#
76deacf0 |
| 24-Jan-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: hold a ref on the root in create_reloc_inode
We're creating a reloc inode in the data reloc tree, we need to hold a ref on the root while we're doing that.
Signed-off-by: Josef Bacik <josef@
btrfs: hold a ref on the root in create_reloc_inode
We're creating a reloc inode in the data reloc tree, we need to hold a ref on the root while we're doing that.
Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
3d7babdc |
| 24-Jan-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: hold a ref on the root in find_data_references
We're looking up the data references for the bytenr in a root, we need to hold a ref on that root while we're doing that.
Signed-off-by: Josef
btrfs: hold a ref on the root in find_data_references
We're looking up the data references for the bytenr in a root, we need to hold a ref on that root while we're doing that.
Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
442b1ac5 |
| 24-Jan-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: hold a ref on the root in record_reloc_root_in_trans
We are recording this root in the transaction, so we need to hold a ref on it until we do that.
Signed-off-by: Josef Bacik <josef@toxicpa
btrfs: hold a ref on the root in record_reloc_root_in_trans
We are recording this root in the transaction, so we need to hold a ref on it until we do that.
Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
ab9737bd |
| 24-Jan-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: hold a ref on the root in merge_reloc_roots
We look up the corresponding root for the reloc root, we need to hold a ref while we're messing with it.
Signed-off-by: Josef Bacik <josef@toxicpa
btrfs: hold a ref on the root in merge_reloc_roots
We look up the corresponding root for the reloc root, we need to hold a ref while we're messing with it.
Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
db2c2ca2 |
| 24-Jan-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: hold a ref on the root in prepare_to_merge
We look up the reloc roots corresponding root, we need to hold a ref on that root.
Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: D
btrfs: hold a ref on the root in prepare_to_merge
We look up the reloc roots corresponding root, we need to hold a ref on that root.
Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
0b530bc5 |
| 24-Jan-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: hold a ref on the root in build_backref_tree
This is trickier than the previous conversions. We have backref_node's that need to hold onto their root for their lifetime. Do the read of the
btrfs: hold a ref on the root in build_backref_tree
This is trickier than the previous conversions. We have backref_node's that need to hold onto their root for their lifetime. Do the read of the root and grab the ref. If at any point we don't use the root we discard it, however if we use it in our backref node we don't free it until we free the backref node. Any time we switch the root's for the backref node we need to drop our ref on the old root and grab the ref on the new root, and if we dupe a node we need to get a ref on the root there as well.
Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
3dbf1738 |
| 24-Jan-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: make relocation use btrfs_read_tree_root()
Relocation has it's special roots, we don't want to save these in the root cache either, so swap it to use btrfs_read_tree_root(). However the relo
btrfs: make relocation use btrfs_read_tree_root()
Relocation has it's special roots, we don't want to save these in the root cache either, so swap it to use btrfs_read_tree_root(). However the reloc root does need REF_COWS set, so make sure we set it everywhere we use this helper, as it no longer does the REF_COWS setting.
Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
0c891389 |
| 15-Jan-2020 |
Qu Wenruo <wqu@suse.com> |
btrfs: relocation: Add introduction of how relocation works
Relocation is one of the most complex part of btrfs, while it's also the foundation stone for online resizing, profile converting.
For su
btrfs: relocation: Add introduction of how relocation works
Relocation is one of the most complex part of btrfs, while it's also the foundation stone for online resizing, profile converting.
For such a complex facility, we should at least have some introduction to it.
This patch will add an basic introduction at pretty a high level, explaining:
- What relocation does - How relocation is done Only mentioning how data reloc tree and reloc tree are involved in the operation. No details like the backref cache, or the data reloc tree contents. - Which function to refer.
More detailed comments will be added for reloc tree creation, data reloc tree creation and backref cache.
Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
bffe633e |
| 02-Dec-2019 |
Omar Sandoval <osandov@fb.com> |
btrfs: make btrfs_ordered_extent naming consistent with btrfs_file_extent_item
ordered->start, ordered->len, and ordered->disk_len correspond to fi->disk_bytenr, fi->num_bytes, and fi->disk_num_byte
btrfs: make btrfs_ordered_extent naming consistent with btrfs_file_extent_item
ordered->start, ordered->len, and ordered->disk_len correspond to fi->disk_bytenr, fi->num_bytes, and fi->disk_num_bytes, respectively. It's confusing to translate between the two naming schemes. Since a btrfs_ordered_extent is basically a pending btrfs_file_extent_item, let's make the former use the naming from the latter.
Note that I didn't touch the names in tracepoints just in case there are scripts depending on the current naming.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
430640e3 |
| 28-Nov-2019 |
Qu Wenruo <wqu@suse.com> |
btrfs: relocation: Output current relocation stage at btrfs_relocate_block_group()
There are two relocation stages but both print the same message. Add the description of the stage. This can help de
btrfs: relocation: Output current relocation stage at btrfs_relocate_block_group()
There are two relocation stages but both print the same message. Add the description of the stage. This can help debugging or provides informative message to users.
BTRFS info (device dm-5): balance: start -d -m -s BTRFS info (device dm-5): relocating block group 30408704 flags metadata|dup BTRFS info (device dm-5): found 2 extents, stage: move data extents BTRFS info (device dm-5): relocating block group 22020096 flags system|dup BTRFS info (device dm-5): found 1 extents, stage: move data extents BTRFS info (device dm-5): relocating block group 13631488 flags data BTRFS info (device dm-5): found 1 extents, stage: move data extents BTRFS info (device dm-5): found 1 extents, stage: update data pointers BTRFS info (device dm-5): balance: ended with status: 0
Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
6282675e |
| 07-Jan-2020 |
Qu Wenruo <wqu@suse.com> |
btrfs: relocation: fix reloc_root lifespan and access
[BUG] There are several different KASAN reports for balance + snapshot workloads. Involved call paths include:
should_ignore_root+0x54/0xb0
btrfs: relocation: fix reloc_root lifespan and access
[BUG] There are several different KASAN reports for balance + snapshot workloads. Involved call paths include:
should_ignore_root+0x54/0xb0 [btrfs] build_backref_tree+0x11af/0x2280 [btrfs] relocate_tree_blocks+0x391/0xb80 [btrfs] relocate_block_group+0x3e5/0xa00 [btrfs] btrfs_relocate_block_group+0x240/0x4d0 [btrfs] btrfs_relocate_chunk+0x53/0xf0 [btrfs] btrfs_balance+0xc91/0x1840 [btrfs] btrfs_ioctl_balance+0x416/0x4e0 [btrfs] btrfs_ioctl+0x8af/0x3e60 [btrfs] do_vfs_ioctl+0x831/0xb10
create_reloc_root+0x9f/0x460 [btrfs] btrfs_reloc_post_snapshot+0xff/0x6c0 [btrfs] create_pending_snapshot+0xa9b/0x15f0 [btrfs] create_pending_snapshots+0x111/0x140 [btrfs] btrfs_commit_transaction+0x7a6/0x1360 [btrfs] btrfs_mksubvol+0x915/0x960 [btrfs] btrfs_ioctl_snap_create_transid+0x1d5/0x1e0 [btrfs] btrfs_ioctl_snap_create_v2+0x1d3/0x270 [btrfs] btrfs_ioctl+0x241b/0x3e60 [btrfs] do_vfs_ioctl+0x831/0xb10
btrfs_reloc_pre_snapshot+0x85/0xc0 [btrfs] create_pending_snapshot+0x209/0x15f0 [btrfs] create_pending_snapshots+0x111/0x140 [btrfs] btrfs_commit_transaction+0x7a6/0x1360 [btrfs] btrfs_mksubvol+0x915/0x960 [btrfs] btrfs_ioctl_snap_create_transid+0x1d5/0x1e0 [btrfs] btrfs_ioctl_snap_create_v2+0x1d3/0x270 [btrfs] btrfs_ioctl+0x241b/0x3e60 [btrfs] do_vfs_ioctl+0x831/0xb10
[CAUSE] All these call sites are only relying on root->reloc_root, which can undergo btrfs_drop_snapshot(), and since we don't have real refcount based protection to reloc roots, we can reach already dropped reloc root, triggering KASAN.
[FIX] To avoid such access to unstable root->reloc_root, we should check BTRFS_ROOT_DEAD_RELOC_TREE bit first.
This patch introduces wrappers that provide the correct way to check the bit with memory barriers protection.
Most callers don't distinguish merged reloc tree and no reloc tree. The only exception is should_ignore_root(), as merged reloc tree can be ignored, while no reloc tree shouldn't.
[CRITICAL SECTION ANALYSIS] Although test_bit()/set_bit()/clear_bit() doesn't imply a barrier, the DEAD_RELOC_TREE bit has extra help from transaction as a higher level barrier, the lifespan of root::reloc_root and DEAD_RELOC_TREE bit are:
NULL: reloc_root is NULL PTR: reloc_root is not NULL 0: DEAD_RELOC_ROOT bit not set DEAD: DEAD_RELOC_ROOT bit set
(NULL, 0) Initial state __ | /\ Section A btrfs_init_reloc_root() \/ | __ (PTR, 0) reloc_root initialized /\ | | btrfs_update_reloc_root() | Section B | | (PTR, DEAD) reloc_root has been merged \/ | __ === btrfs_commit_transaction() ==================== | /\ clean_dirty_subvols() | | | Section C (NULL, DEAD) reloc_root cleanup starts \/ | __ btrfs_drop_snapshot() /\ | | Section D (NULL, 0) Back to initial state \/
Every have_reloc_root() or test_bit(DEAD_RELOC_ROOT) caller holds transaction handle, so none of such caller can cross transaction boundary.
In Section A, every caller just found no DEAD bit, and grab reloc_root.
In the cross section A-B, caller may get no DEAD bit, but since reloc_root is still completely valid thus accessing reloc_root is completely safe.
No test_bit() caller can cross the boundary of Section B and Section C.
In Section C, every caller found the DEAD bit, so no one will access reloc_root.
In the cross section C-D, either caller gets the DEAD bit set, avoiding access reloc_root no matter if it's safe or not. Or caller get the DEAD bit cleared, then access reloc_root, which is already NULL, nothing will be wrong.
The memory write barriers are between the reloc_root updates and bit set/clear, the pairing read side is before test_bit.
Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Fixes: d2311e698578 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots") CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ barriers ] Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
ca1aa281 |
| 06-Dec-2019 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: do not leak reloc root if we fail to read the fs root
If we fail to read the fs root corresponding with a reloc root we'll just break out and free the reloc roots. But we remove our current
btrfs: do not leak reloc root if we fail to read the fs root
If we fail to read the fs root corresponding with a reloc root we'll just break out and free the reloc roots. But we remove our current reloc_root from this list higher up, which means we'll leak this reloc_root. Fix this by adding ourselves back to the reloc_roots list so we are properly cleaned up.
CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
Revision tags: v5.3.11, v5.3.10, v5.3.9, v5.3.8, v5.3.7, v5.3.6, v5.3.5, v5.3.4, v5.3.3, v5.3.2, v5.3.1, v5.3, v5.2.14, v5.3-rc8, v5.2.13, v5.2.12 |
|
#
a019e9e1 |
| 30-Aug-2019 |
David Sterba <dsterba@suse.com> |
btrfs: remove extent_map::bdev
We can now remove the bdev from extent_map. Previous patches made sure that bio_set_dev is correctly in all places and that we don't need to grab it from latest_bdev o
btrfs: remove extent_map::bdev
We can now remove the bdev from extent_map. Previous patches made sure that bio_set_dev is correctly in all places and that we don't need to grab it from latest_bdev or pass it around inside the extent map.
Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
b12de528 |
| 14-Nov-2019 |
Qu Wenruo <wqu@suse.com> |
btrfs: scrub: Don't check free space before marking a block group RO
[BUG] When running btrfs/072 with only one online CPU, it has a pretty high chance to fail:
btrfs/072 12s ... _check_dmesg: so
btrfs: scrub: Don't check free space before marking a block group RO
[BUG] When running btrfs/072 with only one online CPU, it has a pretty high chance to fail:
btrfs/072 12s ... _check_dmesg: something found in dmesg (see xfstests-dev/results//btrfs/072.dmesg) - output mismatch (see xfstests-dev/results//btrfs/072.out.bad) --- tests/btrfs/072.out 2019-10-22 15:18:14.008965340 +0800 +++ /xfstests-dev/results//btrfs/072.out.bad 2019-11-14 15:56:45.877152240 +0800 @@ -1,2 +1,3 @@ QA output created by 072 Silence is golden +Scrub find errors in "-m dup -d single" test ...
And with the following call trace:
BTRFS info (device dm-5): scrub: started on devid 1 ------------[ cut here ]------------ BTRFS: Transaction aborted (error -27) WARNING: CPU: 0 PID: 55087 at fs/btrfs/block-group.c:1890 btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs] CPU: 0 PID: 55087 Comm: btrfs Tainted: G W O 5.4.0-rc1-custom+ #13 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs] Call Trace: __btrfs_end_transaction+0xdb/0x310 [btrfs] btrfs_end_transaction+0x10/0x20 [btrfs] btrfs_inc_block_group_ro+0x1c9/0x210 [btrfs] scrub_enumerate_chunks+0x264/0x940 [btrfs] btrfs_scrub_dev+0x45c/0x8f0 [btrfs] btrfs_ioctl+0x31a1/0x3fb0 [btrfs] do_vfs_ioctl+0x636/0xaa0 ksys_ioctl+0x67/0x90 __x64_sys_ioctl+0x43/0x50 do_syscall_64+0x79/0xe0 entry_SYSCALL_64_after_hwframe+0x49/0xbe ---[ end trace 166c865cec7688e7 ]---
[CAUSE] The error number -27 is -EFBIG, returned from the following call chain: btrfs_end_transaction() |- __btrfs_end_transaction() |- btrfs_create_pending_block_groups() |- btrfs_finish_chunk_alloc() |- btrfs_add_system_chunk()
This happens because we have used up all space of btrfs_super_block::sys_chunk_array.
The root cause is, we have the following bad loop of creating tons of system chunks:
1. The only SYSTEM chunk is being scrubbed It's very common to have only one SYSTEM chunk. 2. New SYSTEM bg will be allocated As btrfs_inc_block_group_ro() will check if we have enough space after marking current bg RO. If not, then allocate a new chunk. 3. New SYSTEM bg is still empty, will be reclaimed During the reclaim, we will mark it RO again. 4. That newly allocated empty SYSTEM bg get scrubbed We go back to step 2, as the bg is already mark RO but still not cleaned up yet.
If the cleaner kthread doesn't get executed fast enough (e.g. only one CPU), then we will get more and more empty SYSTEM chunks, using up all the space of btrfs_super_block::sys_chunk_array.
[FIX] Since scrub/dev-replace doesn't always need to allocate new extent, especially chunk tree extent, so we don't really need to do chunk pre-allocation.
To break above spiral, here we introduce a new parameter to btrfs_inc_block_group(), @do_chunk_alloc, which indicates whether we need extra chunk pre-allocation.
For relocation, we pass @do_chunk_alloc=true, while for scrub, we pass @do_chunk_alloc=false. This should keep unnecessary empty chunks from popping up for scrub.
Also, since there are two parameters for btrfs_inc_block_group_ro(), add more comment for it.
Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
32da5386 |
| 29-Oct-2019 |
David Sterba <dsterba@suse.com> |
btrfs: rename btrfs_block_group_cache
The type name is misleading, a single entry is named 'cache' while this normally means a collection of objects. Rename that everywhere. Also the identifier was
btrfs: rename btrfs_block_group_cache
The type name is misleading, a single entry is named 'cache' while this normally means a collection of objects. Rename that everywhere. Also the identifier was quite long, making function prototypes harder to format.
Suggested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
b3470b5d |
| 23-Oct-2019 |
David Sterba <dsterba@suse.com> |
btrfs: add dedicated members for start and length of a block group
The on-disk format of block group item makes use of the key that stores the offset and length. This is further used in the code, al
btrfs: add dedicated members for start and length of a block group
The on-disk format of block group item makes use of the key that stores the offset and length. This is further used in the code, although this makes thing harder to understand. The key is also packed so the offset/length is not properly aligned as u64.
Add start (key.objectid) and length (key.offset) members to block group and remove the embedded key. When the item is searched or written, a local variable for key is used.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
bf38be65 |
| 23-Oct-2019 |
David Sterba <dsterba@suse.com> |
btrfs: move block_group_item::used to block group
For unknown reasons, the member 'used' in the block group struct is stored in the b-tree item and accessed everywhere using the special accessor hel
btrfs: move block_group_item::used to block group
For unknown reasons, the member 'used' in the block group struct is stored in the b-tree item and accessed everywhere using the special accessor helper. Let's unify it and make it a regular member and only update the item before writing it to the tree.
The item is still being used for flags and chunk_objectid, there's some duplication until the item is removed in following patches.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
67439dad |
| 08-Oct-2019 |
David Sterba <dsterba@suse.com> |
btrfs: opencode extent_buffer_get
The helper is trivial and we can understand what the atomic_inc on something named refs does.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Ana
btrfs: opencode extent_buffer_get
The helper is trivial and we can understand what the atomic_inc on something named refs does.
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
4c66e0d4 |
| 03-Oct-2019 |
David Sterba <dsterba@suse.com> |
btrfs: drop unused parameter is_new from btrfs_iget
The parameter is now always set to NULL and could be dropped. The last user was get_default_root but that got reworked in 05dbe6837b60 ("Btrfs: un
btrfs: drop unused parameter is_new from btrfs_iget
The parameter is now always set to NULL and could be dropped. The last user was get_default_root but that got reworked in 05dbe6837b60 ("Btrfs: unify subvol= and subvolid= mounting") and the parameter became unused.
Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
8702ba93 |
| 14-Oct-2019 |
Qu Wenruo <wqu@suse.com> |
btrfs: qgroup: Always free PREALLOC META reserve in btrfs_delalloc_release_extents()
[Background] Btrfs qgroup uses two types of reserved space for METADATA space, PERTRANS and PREALLOC.
PERTRANS i
btrfs: qgroup: Always free PREALLOC META reserve in btrfs_delalloc_release_extents()
[Background] Btrfs qgroup uses two types of reserved space for METADATA space, PERTRANS and PREALLOC.
PERTRANS is metadata space reserved for each transaction started by btrfs_start_transaction(). While PREALLOC is for delalloc, where we reserve space before joining a transaction, and finally it will be converted to PERTRANS after the writeback is done.
[Inconsistency] However there is inconsistency in how we handle PREALLOC metadata space.
The most obvious one is: In btrfs_buffered_write(): btrfs_delalloc_release_extents(BTRFS_I(inode), reserve_bytes, true);
We always free qgroup PREALLOC meta space.
While in btrfs_truncate_block(): btrfs_delalloc_release_extents(BTRFS_I(inode), blocksize, (ret != 0));
We only free qgroup PREALLOC meta space when something went wrong.
[The Correct Behavior] The correct behavior should be the one in btrfs_buffered_write(), we should always free PREALLOC metadata space.
The reason is, the btrfs_delalloc_* mechanism works by: - Reserve metadata first, even it's not necessary In btrfs_delalloc_reserve_metadata()
- Free the unused metadata space Normally in: btrfs_delalloc_release_extents() |- btrfs_inode_rsv_release() Here we do calculation on whether we should release or not.
E.g. for 64K buffered write, the metadata rsv works like:
/* The first page */ reserve_meta: num_bytes=calc_inode_reservations() free_meta: num_bytes=0 total: num_bytes=calc_inode_reservations() /* The first page caused one outstanding extent, thus needs metadata rsv */
/* The 2nd page */ reserve_meta: num_bytes=calc_inode_reservations() free_meta: num_bytes=calc_inode_reservations() total: not changed /* The 2nd page doesn't cause new outstanding extent, needs no new meta rsv, so we free what we have reserved */
/* The 3rd~16th pages */ reserve_meta: num_bytes=calc_inode_reservations() free_meta: num_bytes=calc_inode_reservations() total: not changed (still space for one outstanding extent)
This means, if btrfs_delalloc_release_extents() determines to free some space, then those space should be freed NOW. So for qgroup, we should call btrfs_qgroup_free_meta_prealloc() other than btrfs_qgroup_convert_reserved_meta().
The good news is: - The callers are not that hot The hottest caller is in btrfs_buffered_write(), which is already fixed by commit 336a8bb8e36a ("btrfs: Fix wrong btrfs_delalloc_release_extents parameter"). Thus it's not that easy to cause false EDQUOT.
- The trans commit in advance for qgroup would hide the bug Since commit f5fef4593653 ("btrfs: qgroup: Make qgroup async transaction commit more aggressive"), when btrfs qgroup metadata free space is slow, it will try to commit transaction and free the wrongly converted PERTRANS space, so it's not that easy to hit such bug.
[FIX] So to fix the problem, remove the @qgroup_free parameter for btrfs_delalloc_release_extents(), and always pass true to btrfs_inode_rsv_release().
Reported-by: Filipe Manana <fdmanana@suse.com> Fixes: 43b18595d660 ("btrfs: qgroup: Use separate meta reservation type for delalloc") CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
44db1216 |
| 09-Oct-2019 |
Filipe Manana <fdmanana@suse.com> |
Btrfs: add missing extents release on file extent cluster relocation error
If we error out when finding a page at relocate_file_extent_cluster(), we need to release the outstanding extents counter o
Btrfs: add missing extents release on file extent cluster relocation error
If we error out when finding a page at relocate_file_extent_cluster(), we need to release the outstanding extents counter on the relocation inode, set by the previous call to btrfs_delalloc_reserve_metadata(), otherwise the inode's block reserve size can never decrease to zero and metadata space is leaked. Therefore add a call to btrfs_delalloc_release_extents() in case we can't find the target page.
Fixes: 8b62f87bad9c ("Btrfs: rework outstanding_extents") CC: stable@vger.kernel.org # 4.19+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
1fac4a54 |
| 23-Sep-2019 |
Qu Wenruo <wqu@suse.com> |
btrfs: relocation: fix use-after-free on dead relocation roots
[BUG] One user reported a reproducible KASAN report about use-after-free:
BTRFS info (device sdi1): balance: start -dvrange=12568116
btrfs: relocation: fix use-after-free on dead relocation roots
[BUG] One user reported a reproducible KASAN report about use-after-free:
BTRFS info (device sdi1): balance: start -dvrange=1256811659264..1256811659265 BTRFS info (device sdi1): relocating block group 1256811659264 flags data|raid0 ================================================================== BUG: KASAN: use-after-free in btrfs_init_reloc_root+0x2cd/0x340 [btrfs] Write of size 8 at addr ffff88856f671710 by task kworker/u24:10/261579
CPU: 2 PID: 261579 Comm: kworker/u24:10 Tainted: P OE 5.2.11-arch1-1-kasan #4 Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X99 Extreme4, BIOS P3.80 04/06/2018 Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs] Call Trace: dump_stack+0x7b/0xba print_address_description+0x6c/0x22e ? btrfs_init_reloc_root+0x2cd/0x340 [btrfs] __kasan_report.cold+0x1b/0x3b ? btrfs_init_reloc_root+0x2cd/0x340 [btrfs] kasan_report+0x12/0x17 __asan_report_store8_noabort+0x17/0x20 btrfs_init_reloc_root+0x2cd/0x340 [btrfs] record_root_in_trans+0x2a0/0x370 [btrfs] btrfs_record_root_in_trans+0xf4/0x140 [btrfs] start_transaction+0x1ab/0xe90 [btrfs] btrfs_join_transaction+0x1d/0x20 [btrfs] btrfs_finish_ordered_io+0x7bf/0x18a0 [btrfs] ? lock_repin_lock+0x400/0x400 ? __kmem_cache_shutdown.cold+0x140/0x1ad ? btrfs_unlink_subvol+0x9b0/0x9b0 [btrfs] finish_ordered_fn+0x15/0x20 [btrfs] normal_work_helper+0x1bd/0xca0 [btrfs] ? process_one_work+0x819/0x1720 ? kasan_check_read+0x11/0x20 btrfs_endio_write_helper+0x12/0x20 [btrfs] process_one_work+0x8c9/0x1720 ? pwq_dec_nr_in_flight+0x2f0/0x2f0 ? worker_thread+0x1d9/0x1030 worker_thread+0x98/0x1030 kthread+0x2bb/0x3b0 ? process_one_work+0x1720/0x1720 ? kthread_park+0x120/0x120 ret_from_fork+0x35/0x40
Allocated by task 369692: __kasan_kmalloc.part.0+0x44/0xc0 __kasan_kmalloc.constprop.0+0xba/0xc0 kasan_kmalloc+0x9/0x10 kmem_cache_alloc_trace+0x138/0x260 btrfs_read_tree_root+0x92/0x360 [btrfs] btrfs_read_fs_root+0x10/0xb0 [btrfs] create_reloc_root+0x47d/0xa10 [btrfs] btrfs_init_reloc_root+0x1e2/0x340 [btrfs] record_root_in_trans+0x2a0/0x370 [btrfs] btrfs_record_root_in_trans+0xf4/0x140 [btrfs] start_transaction+0x1ab/0xe90 [btrfs] btrfs_start_transaction+0x1e/0x20 [btrfs] __btrfs_prealloc_file_range+0x1c2/0xa00 [btrfs] btrfs_prealloc_file_range+0x13/0x20 [btrfs] prealloc_file_extent_cluster+0x29f/0x570 [btrfs] relocate_file_extent_cluster+0x193/0xc30 [btrfs] relocate_data_extent+0x1f8/0x490 [btrfs] relocate_block_group+0x600/0x1060 [btrfs] btrfs_relocate_block_group+0x3a0/0xa00 [btrfs] btrfs_relocate_chunk+0x9e/0x180 [btrfs] btrfs_balance+0x14e4/0x2fc0 [btrfs] btrfs_ioctl_balance+0x47f/0x640 [btrfs] btrfs_ioctl+0x119d/0x8380 [btrfs] do_vfs_ioctl+0x9f5/0x1060 ksys_ioctl+0x67/0x90 __x64_sys_ioctl+0x73/0xb0 do_syscall_64+0xa5/0x370 entry_SYSCALL_64_after_hwframe+0x44/0xa9
Freed by task 369692: __kasan_slab_free+0x14f/0x210 kasan_slab_free+0xe/0x10 kfree+0xd8/0x270 btrfs_drop_snapshot+0x154c/0x1eb0 [btrfs] clean_dirty_subvols+0x227/0x340 [btrfs] relocate_block_group+0x972/0x1060 [btrfs] btrfs_relocate_block_group+0x3a0/0xa00 [btrfs] btrfs_relocate_chunk+0x9e/0x180 [btrfs] btrfs_balance+0x14e4/0x2fc0 [btrfs] btrfs_ioctl_balance+0x47f/0x640 [btrfs] btrfs_ioctl+0x119d/0x8380 [btrfs] do_vfs_ioctl+0x9f5/0x1060 ksys_ioctl+0x67/0x90 __x64_sys_ioctl+0x73/0xb0 do_syscall_64+0xa5/0x370 entry_SYSCALL_64_after_hwframe+0x44/0xa9
The buggy address belongs to the object at ffff88856f671100 which belongs to the cache kmalloc-4k of size 4096 The buggy address is located 1552 bytes inside of 4096-byte region [ffff88856f671100, ffff88856f672100) The buggy address belongs to the page: page:ffffea0015bd9c00 refcount:1 mapcount:0 mapping:ffff88864400e600 index:0x0 compound_mapcount: 0 flags: 0x2ffff0000010200(slab|head) raw: 02ffff0000010200 dead000000000100 dead000000000200 ffff88864400e600 raw: 0000000000000000 0000000000070007 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected
Memory state around the buggy address: ffff88856f671600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff88856f671680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >ffff88856f671700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff88856f671780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff88856f671800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ================================================================== BTRFS info (device sdi1): 1 enospc errors during balance BTRFS info (device sdi1): balance: ended with status: -28
[CAUSE] The problem happens when finish_ordered_io() get called with balance still running, while the reloc root of that subvolume is already dead. (Tree is swap already done, but tree not yet deleted for possible qgroup usage.)
That means root->reloc_root still exists, but that reloc_root can be under btrfs_drop_snapshot(), thus we shouldn't access it.
The following race could cause the use-after-free problem:
CPU1 | CPU2 -------------------------------------------------------------------------- | relocate_block_group() | |- unset_reloc_control(rc) | |- btrfs_commit_transaction() btrfs_finish_ordered_io() | |- clean_dirty_subvols() |- btrfs_join_transaction() | | |- record_root_in_trans() | | |- btrfs_init_reloc_root() | | |- if (root->reloc_root) | | | | |- root->reloc_root = NULL | | |- btrfs_drop_snapshot(reloc_root); |- reloc_root->last_trans| = trans->transid | ^^^^^^^^^^^^^^^^^^^^^^ Use after free
[FIX] Fix it by the following modifications:
- Test if the root has dead reloc tree before accessing root->reloc_root If the root has BTRFS_ROOT_DEAD_RELOC_TREE, then we don't need to create or update root->reloc_tree
- Clear the BTRFS_ROOT_DEAD_RELOC_TREE flag until we have fully dropped reloc tree To co-operate with above modification, so as long as BTRFS_ROOT_DEAD_RELOC_TREE is still set, we won't try to re-create reloc tree at record_root_in_trans().
Reported-by: Cebtenzzre <cebtenzzre@gmail.com> Fixes: d2311e698578 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots") CC: stable@vger.kernel.org # 5.1+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
Revision tags: v5.2.11, v5.2.10, v5.2.9, v5.2.8, v5.2.7, v5.2.6, v5.2.5, v5.2.4, v5.2.3, v5.2.2, v5.2.1, v5.2, v5.1.16, v5.1.15, v5.1.14, v5.1.13 |
|
#
aac0023c |
| 20-Jun-2019 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: move basic block_group definitions to their own header
This is prep work for moving all of the block group cache code into its own file.
Signed-off-by: Josef Bacik <josef@toxicpanda.com> Rev
btrfs: move basic block_group definitions to their own header
This is prep work for moving all of the block group cache code into its own file.
Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor comment updates ] Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
330a5827 |
| 17-Jul-2019 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: Remove leftover of in-band dedupe
It's unlikely in-band dedupe is going to land so just remove any leftovers - dedupe.h header as well as the 'dedupe' parameter to btrfs_set_extent_delalloc.
btrfs: Remove leftover of in-band dedupe
It's unlikely in-band dedupe is going to land so just remove any leftovers - dedupe.h header as well as the 'dedupe' parameter to btrfs_set_extent_delalloc.
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
86736342 |
| 19-Jun-2019 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: migrate the delalloc space stuff to it's own home
We have code for data and metadata reservations for delalloc. There's quite a bit of code here, and it's used in a lot of places so I've sep
btrfs: migrate the delalloc space stuff to it's own home
We have code for data and metadata reservations for delalloc. There's quite a bit of code here, and it's used in a lot of places so I've separated it out to it's own file. inode.c and file.c are already pretty large, and this code is complicated enough to live in its own space.
Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|