Revision tags: v6.6.67, v6.6.66, v6.6.65, v6.6.64, v6.6.63, v6.6.62, v6.6.61, v6.6.60, v6.6.59, v6.6.58, v6.6.57, v6.6.56, v6.6.55, v6.6.54, v6.6.53, v6.6.52, v6.6.51, v6.6.50, v6.6.49 |
|
#
26d0dfbb |
| 29-Aug-2024 |
Andrew Jeffery <andrew@codeconstruct.com.au> |
Merge tag 'v6.6.48' into for/openbmc/dev-6.6
This is the 6.6.48 stable release
|
Revision tags: v6.6.48, v6.6.47, v6.6.46, v6.6.45, v6.6.44, v6.6.43, v6.6.42, v6.6.41, v6.6.40, v6.6.39, v6.6.38, v6.6.37, v6.6.36, v6.6.35, v6.6.34, v6.6.33, v6.6.32, v6.6.31, v6.6.30, v6.6.29, v6.6.28, v6.6.27, v6.6.26, v6.6.25, v6.6.24, v6.6.23, v6.6.16, v6.6.15, v6.6.14, v6.6.13 |
|
#
1bca9776 |
| 16-Jan-2024 |
David Sterba <dsterba@suse.com> |
btrfs: replace sb::s_blocksize by fs_info::sectorsize
[ Upstream commit 4e00422ee62663e31e611d7de4d2c4aa3f8555f2 ]
The block size stored in the super block is used by subsystems outside of btrfs an
btrfs: replace sb::s_blocksize by fs_info::sectorsize
[ Upstream commit 4e00422ee62663e31e611d7de4d2c4aa3f8555f2 ]
The block size stored in the super block is used by subsystems outside of btrfs and it's a copy of fs_info::sectorsize. Unify that to always use our sectorsize, with the exception of mount where we first need to use fixed values (4K) until we read the super block and can set the sectorsize.
Replace all uses, in most cases it's fewer pointer indirections.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: 46a6e10a1ab1 ("btrfs: send: allow cloning non-aligned extent if it ends at i_size") Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
#
c005e2f6 |
| 14-Aug-2024 |
Andrew Jeffery <andrew@codeconstruct.com.au> |
Merge tag 'v6.6.46' into for/openbmc/dev-6.6
This is the 6.6.46 stable release
|
#
ba4dedb7 |
| 05-Mar-2024 |
Qu Wenruo <wqu@suse.com> |
btrfs: do not clear page dirty inside extent_write_locked_range()
[ Upstream commit 97713b1a2ced1e4a2a6c40045903797ebd44d7e0 ]
[BUG] For subpage + zoned case, the following workload can lead to rsv
btrfs: do not clear page dirty inside extent_write_locked_range()
[ Upstream commit 97713b1a2ced1e4a2a6c40045903797ebd44d7e0 ]
[BUG] For subpage + zoned case, the following workload can lead to rsv data leak at unmount time:
# mkfs.btrfs -f -s 4k $dev # mount $dev $mnt # fsstress -w -n 8 -d $mnt -s 1709539240 0/0: fiemap - no filename 0/1: copyrange read - no filename 0/2: write - no filename 0/3: rename - no source filename 0/4: creat f0 x:0 0 0 0/4: creat add id=0,parent=-1 0/5: writev f0[259 1 0 0 0 0] [778052,113,965] 0 0/6: ioctl(FIEMAP) f0[259 1 0 0 224 887097] [1294220,2291618343991484791,0x10000] -1 0/7: dwrite - xfsctl(XFS_IOC_DIOINFO) f0[259 1 0 0 224 887097] return 25, fallback to stat() 0/7: dwrite f0[259 1 0 0 224 887097] [696320,102400] 0 # umount $mnt
The dmesg includes the following rsv leak detection warning (all call trace skipped):
------------[ cut here ]------------ WARNING: CPU: 2 PID: 4528 at fs/btrfs/inode.c:8653 btrfs_destroy_inode+0x1e0/0x200 [btrfs] ---[ end trace 0000000000000000 ]--- ------------[ cut here ]------------ WARNING: CPU: 2 PID: 4528 at fs/btrfs/inode.c:8654 btrfs_destroy_inode+0x1a8/0x200 [btrfs] ---[ end trace 0000000000000000 ]--- ------------[ cut here ]------------ WARNING: CPU: 2 PID: 4528 at fs/btrfs/inode.c:8660 btrfs_destroy_inode+0x1a0/0x200 [btrfs] ---[ end trace 0000000000000000 ]--- BTRFS info (device sda): last unmount of filesystem 1b4abba9-de34-4f07-9e7f-157cf12a18d6 ------------[ cut here ]------------ WARNING: CPU: 3 PID: 4528 at fs/btrfs/block-group.c:4434 btrfs_free_block_groups+0x338/0x500 [btrfs] ---[ end trace 0000000000000000 ]--- BTRFS info (device sda): space_info DATA has 268218368 free, is not full BTRFS info (device sda): space_info total=268435456, used=204800, pinned=0, reserved=0, may_use=12288, readonly=0 zone_unusable=0 BTRFS info (device sda): global_block_rsv: size 0 reserved 0 BTRFS info (device sda): trans_block_rsv: size 0 reserved 0 BTRFS info (device sda): chunk_block_rsv: size 0 reserved 0 BTRFS info (device sda): delayed_block_rsv: size 0 reserved 0 BTRFS info (device sda): delayed_refs_rsv: size 0 reserved 0 ------------[ cut here ]------------ WARNING: CPU: 3 PID: 4528 at fs/btrfs/block-group.c:4434 btrfs_free_block_groups+0x338/0x500 [btrfs] ---[ end trace 0000000000000000 ]--- BTRFS info (device sda): space_info METADATA has 267796480 free, is not full BTRFS info (device sda): space_info total=268435456, used=131072, pinned=0, reserved=0, may_use=262144, readonly=0 zone_unusable=245760 BTRFS info (device sda): global_block_rsv: size 0 reserved 0 BTRFS info (device sda): trans_block_rsv: size 0 reserved 0 BTRFS info (device sda): chunk_block_rsv: size 0 reserved 0 BTRFS info (device sda): delayed_block_rsv: size 0 reserved 0 BTRFS info (device sda): delayed_refs_rsv: size 0 reserved 0
Above $dev is a tcmu-runner emulated zoned HDD, which has a max zone append size of 64K, and the system has 64K page size.
[CAUSE] I have added several trace_printk() to show the events (header skipped):
> btrfs_dirty_pages: r/i=5/259 dirty start=774144 len=114688 > btrfs_dirty_pages: r/i=5/259 dirty part of page=720896 off_in_page=53248 len_in_page=12288 > btrfs_dirty_pages: r/i=5/259 dirty part of page=786432 off_in_page=0 len_in_page=65536 > btrfs_dirty_pages: r/i=5/259 dirty part of page=851968 off_in_page=0 len_in_page=36864
The above lines show our buffered write has dirtied 3 pages of inode 259 of root 5:
704K 768K 832K 896K I |////I/////////////////I///////////| I 756K 868K
|///| is the dirtied range using subpage bitmaps. and 'I' is the page boundary.
Meanwhile all three pages (704K, 768K, 832K) have their PageDirty flag set.
> btrfs_direct_write: r/i=5/259 start dio filepos=696320 len=102400
Then direct IO write starts, since the range [680K, 780K) covers the beginning part of the above dirty range, we need to writeback the two pages at 704K and 768K.
> cow_file_range: r/i=5/259 add ordered extent filepos=774144 len=65536 > extent_write_locked_range: r/i=5/259 locked page=720896 start=774144 len=65536
Now the above 2 lines show that we're writing back for dirty range [756K, 756K + 64K). We only writeback 64K because the zoned device has max zone append size as 64K.
> extent_write_locked_range: r/i=5/259 clear dirty for page=786432
!!! The above line shows the root cause. !!!
We're calling clear_page_dirty_for_io() inside extent_write_locked_range(), for the page 768K. This is because extent_write_locked_range() can go beyond the current locked page, here we hit the page at 768K and clear its page dirt.
In fact this would lead to the desync between subpage dirty and page dirty flags. We have the page dirty flag cleared, but the subpage range [820K, 832K) is still dirty.
After the writeback of range [756K, 820K), the dirty flags look like this, as page 768K no longer has dirty flag set.
704K 768K 832K 896K I I | I/////////////| I 820K 868K
This means we will no longer writeback range [820K, 832K), thus the reserved data/metadata space would never be properly released.
> extent_write_cache_pages: r/i=5/259 skip non-dirty folio=786432
Now even though we try to start writeback for page 768K, since the page is not dirty, we completely skip it at extent_write_cache_pages() time.
> btrfs_direct_write: r/i=5/259 dio done filepos=696320 len=0
Now the direct IO finished.
> cow_file_range: r/i=5/259 add ordered extent filepos=851968 len=36864 > extent_write_locked_range: r/i=5/259 locked page=851968 start=851968 len=36864
Now we writeback the remaining dirty range, which is [832K, 868K). Causing the range [820K, 832K) never to be submitted, thus leaking the reserved space.
This bug only affects subpage and zoned case. For non-subpage and zoned case, we have exactly one sector for each page, thus no such partial dirty cases.
For subpage and non-zoned case, we never go into run_delalloc_cow(), and normally all the dirty subpage ranges would be properly submitted inside __extent_writepage_io().
[FIX] Just do not clear the page dirty at all inside extent_write_locked_range(). As __extent_writepage_io() would do a more accurate, subpage compatible clear for page and subpage dirty flags anyway.
Now the correct trace would look like this:
> btrfs_dirty_pages: r/i=5/259 dirty start=774144 len=114688 > btrfs_dirty_pages: r/i=5/259 dirty part of page=720896 off_in_page=53248 len_in_page=12288 > btrfs_dirty_pages: r/i=5/259 dirty part of page=786432 off_in_page=0 len_in_page=65536 > btrfs_dirty_pages: r/i=5/259 dirty part of page=851968 off_in_page=0 len_in_page=36864
The page dirty part is still the same 3 pages.
> btrfs_direct_write: r/i=5/259 start dio filepos=696320 len=102400 > cow_file_range: r/i=5/259 add ordered extent filepos=774144 len=65536 > extent_write_locked_range: r/i=5/259 locked page=720896 start=774144 len=65536
And the writeback for the first 64K is still correct.
> cow_file_range: r/i=5/259 add ordered extent filepos=839680 len=49152 > extent_write_locked_range: r/i=5/259 locked page=786432 start=839680 len=49152
Now with the fix, we can properly writeback the range [820K, 832K), and properly release the reserved data/metadata space.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
#
e0d77d0f |
| 19-May-2024 |
Andrew Jeffery <andrew@codeconstruct.com.au> |
Merge tag 'v6.6.31' into dev-6.6
This is the 6.6.31 stable release
|
#
845cf1c7 |
| 25-Mar-2024 |
Qu Wenruo <wqu@suse.com> |
btrfs: do not wait for short bulk allocation
commit 1db7959aacd905e6487d0478ac01d89f86eb1e51 upstream.
[BUG] There is a recent report that when memory pressure is high (including cached pages), btr
btrfs: do not wait for short bulk allocation
commit 1db7959aacd905e6487d0478ac01d89f86eb1e51 upstream.
[BUG] There is a recent report that when memory pressure is high (including cached pages), btrfs can spend most of its time on memory allocation in btrfs_alloc_page_array() for compressed read/write.
[CAUSE] For btrfs_alloc_page_array() we always go alloc_pages_bulk_array(), and even if the bulk allocation failed (fell back to single page allocation) we still retry but with extra memalloc_retry_wait().
If the bulk alloc only returned one page a time, we would spend a lot of time on the retry wait.
The behavior was introduced in commit 395cb57e8560 ("btrfs: wait between incomplete batch memory allocations").
[FIX] Although the commit mentioned that other filesystems do the wait, it's not the case at least nowadays.
All the mainlined filesystems only call memalloc_retry_wait() if they failed to allocate any page (not only for bulk allocation). If there is any progress, they won't call memalloc_retry_wait() at all.
For example, xfs_buf_alloc_pages() would only call memalloc_retry_wait() if there is no allocation progress at all, and the call is not for metadata readahead.
So I don't believe we should call memalloc_retry_wait() unconditionally for short allocation.
Call memalloc_retry_wait() if it fails to allocate any page for tree block allocation (which goes with __GFP_NOFAIL and may not need the special handling anyway), and reduce the latency for btrfs_alloc_page_array().
Reported-by: Julian Taylor <julian.taylor@1und1.de> Tested-by: Julian Taylor <julian.taylor@1und1.de> Link: https://lore.kernel.org/all/8966c095-cbe7-4d22-9784-a647d1bf27c3@1und1.de/ Fixes: 395cb57e8560 ("btrfs: wait between incomplete batch memory allocations") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
show more ...
|
#
86aa961b |
| 10-Apr-2024 |
Andrew Jeffery <andrew@codeconstruct.com.au> |
Merge tag 'v6.6.26' into dev-6.6
This is the 6.6.26 stable release
|
#
49d640d2 |
| 28-Feb-2024 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix race when detecting delalloc ranges during fiemap
[ Upstream commit 978b63f7464abcfd364a6c95f734282c50f3decf ]
For fiemap we recently stopped locking the target extent range for the whol
btrfs: fix race when detecting delalloc ranges during fiemap
[ Upstream commit 978b63f7464abcfd364a6c95f734282c50f3decf ]
For fiemap we recently stopped locking the target extent range for the whole duration of the fiemap call, in order to avoid a deadlock in a scenario where the fiemap buffer happens to be a memory mapped range of the same file. This use case is very unlikely to be useful in practice but it may be triggered by fuzz testing (syzbot, etc).
This however introduced a race that makes us miss delalloc ranges for file regions that are currently holes, so the caller of fiemap will not be aware that there's data for some file regions. This can be quite serious for some use cases - for example in coreutils versions before 9.0, the cp program used fiemap to detect holes and data in the source file, copying only regions with data (extents or delalloc) from the source file to the destination file in order to preserve holes (see the documentation for its --sparse command line option). This means that if cp was used with a source file that had delalloc in a hole, the destination file could end up without that data, which is effectively a data loss issue, if it happened to hit the race described below.
The race happens like this:
1) Fiemap is called, without the FIEMAP_FLAG_SYNC flag, for a file that has delalloc in the file range [64M, 65M[, which is currently a hole;
2) Fiemap locks the inode in shared mode, then starts iterating the inode's subvolume tree searching for file extent items, without having the whole fiemap target range locked in the inode's io tree - the change introduced recently by commit b0ad381fa769 ("btrfs: fix deadlock with fiemap and extent locking"). It only locks ranges in the io tree when it finds a hole or prealloc extent since that commit;
3) Note that fiemap clones each leaf before using it, and this is to avoid deadlocks when locking a file range in the inode's io tree and the fiemap buffer is memory mapped to some file, because writing to the page with btrfs_page_mkwrite() will wait on any ordered extent for the page's range and the ordered extent needs to lock the range and may need to modify the same leaf, therefore leading to a deadlock on the leaf;
4) While iterating the file extent items in the cloned leaf before finding the hole in the range [64M, 65M[, the delalloc in that range is flushed and its ordered extent completes - meaning the corresponding file extent item is in the inode's subvolume tree, but not present in the cloned leaf that fiemap is iterating over;
5) When fiemap finds the hole in the [64M, 65M[ range by seeing the gap in the cloned leaf (or a file extent item with disk_bytenr == 0 in case the NO_HOLES feature is not enabled), it will lock that file range in the inode's io tree and then search for delalloc by checking for the EXTENT_DELALLOC bit in the io tree for that range and ordered extents (with btrfs_find_delalloc_in_range()). But it finds nothing since the delalloc in that range was already flushed and the ordered extent completed and is gone - as a result fiemap will not report that there's delalloc or an extent for the range [64M, 65M[, so user space will be mislead into thinking that there's a hole in that range.
This could actually be sporadically triggered with test case generic/094 from fstests, which reports a missing extent/delalloc range like this:
# generic/094 2s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//generic/094.out.bad) # --- tests/generic/094.out 2020-06-10 19:29:03.830519425 +0100 # +++ /home/fdmanana/git/hub/xfstests/results//generic/094.out.bad 2024-02-28 11:00:00.381071525 +0000 # @@ -1,3 +1,9 @@ # QA output created by 094 # fiemap run with sync # fiemap run without sync # +ERROR: couldn't find extent at 7 # +map is 'HHDDHPPDPHPH' # +logical: [ 5.. 6] phys: 301517.. 301518 flags: 0x800 tot: 2 # +logical: [ 8.. 8] phys: 301520.. 301520 flags: 0x800 tot: 1 # ... # (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/generic/094.out /home/fdmanana/git/hub/xfstests/results//generic/094.out.bad' to see the entire diff)
So in order to fix this, while still avoiding deadlocks in the case where the fiemap buffer is memory mapped to the same file, change fiemap to work like the following:
1) Always lock the whole range in the inode's io tree before starting to iterate the inode's subvolume tree searching for file extent items, just like we did before commit b0ad381fa769 ("btrfs: fix deadlock with fiemap and extent locking");
2) Now instead of writing to the fiemap buffer every time we have an extent to report, write instead to a temporary buffer (1 page), and when that buffer becomes full, stop iterating the file extent items, unlock the range in the io tree, release the search path, submit all the entries kept in that buffer to the fiemap buffer, and then resume the search for file extent items after locking again the remainder of the range in the io tree.
The buffer having a size of a page, allows for 146 entries in a system with 4K pages. This is a large enough value to have a good performance by avoiding too many restarts of the search for file extent items. In other words this preserves the huge performance gains made in the last two years to fiemap, while avoiding the deadlocks in case the fiemap buffer is memory mapped to the same file (useless in practice, but possible and exercised by fuzz testing and syzbot).
Fixes: b0ad381fa769 ("btrfs: fix deadlock with fiemap and extent locking") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
#
8cc484e8 |
| 22-Feb-2024 |
Filipe Manana <fdmanana@suse.com> |
btrfs: ensure fiemap doesn't race with writes when FIEMAP_FLAG_SYNC is given
[ Upstream commit 418b09027743d9a9fb39116bed46a192f868a3c3 ]
When FIEMAP_FLAG_SYNC is given to fiemap the expectation is
btrfs: ensure fiemap doesn't race with writes when FIEMAP_FLAG_SYNC is given
[ Upstream commit 418b09027743d9a9fb39116bed46a192f868a3c3 ]
When FIEMAP_FLAG_SYNC is given to fiemap the expectation is that that are no concurrent writes and we get a stable view of the inode's extent layout.
When the flag is given we flush all IO (and wait for ordered extents to complete) and then lock the inode in shared mode, however that leaves open the possibility that a write might happen right after the flushing and before locking the inode. So fix this by flushing again after locking the inode - we leave the initial flushing before locking the inode to avoid holding the lock and blocking other RO operations while waiting for IO and ordered extents to complete. The second flushing while holding the inode's lock will most of the time do nothing or very little since the time window for new writes to have happened is small.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Stable-dep-of: 978b63f7464a ("btrfs: fix race when detecting delalloc ranges during fiemap") Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
#
46eeaa11 |
| 03-Apr-2024 |
Andrew Jeffery <andrew@codeconstruct.com.au> |
Merge tag 'v6.6.24' into dev-6.6
This is the 6.6.24 stable release
|
#
0427c8ef |
| 15-Mar-2024 |
Tavian Barnes <tavianator@tavianator.com> |
btrfs: fix race in read_extent_buffer_pages()
commit ef1e68236b9153c27cb7cf29ead0c532870d4215 upstream.
There are reports from tree-checker that detects corrupted nodes, without any obvious pattern
btrfs: fix race in read_extent_buffer_pages()
commit ef1e68236b9153c27cb7cf29ead0c532870d4215 upstream.
There are reports from tree-checker that detects corrupted nodes, without any obvious pattern so possibly an overwrite in memory. After some debugging it turns out there's a race when reading an extent buffer the uptodate status can be missed.
To prevent concurrent reads for the same extent buffer, read_extent_buffer_pages() performs these checks:
/* (1) */ if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags)) return 0;
/* (2) */ if (test_and_set_bit(EXTENT_BUFFER_READING, &eb->bflags)) goto done;
At this point, it seems safe to start the actual read operation. Once that completes, end_bbio_meta_read() does
/* (3) */ set_extent_buffer_uptodate(eb);
/* (4) */ clear_bit(EXTENT_BUFFER_READING, &eb->bflags);
Normally, this is enough to ensure only one read happens, and all other callers wait for it to finish before returning. Unfortunately, there is a racey interleaving:
Thread A | Thread B | Thread C ---------+----------+--------- (1) | | | (1) | (2) | | (3) | | (4) | | | (2) | | | (1)
When this happens, thread B kicks of an unnecessary read. Worse, thread C will see UPTODATE set and return immediately, while the read from thread B is still in progress. This race could result in tree-checker errors like this as the extent buffer is concurrently modified:
BTRFS critical (device dm-0): corrupted node, root=256 block=8550954455682405139 owner mismatch, have 11858205567642294356 expect [256, 18446744073709551360]
Fix it by testing UPTODATE again after setting the READING bit, and if it's been set, skip the unnecessary read.
Fixes: d7172f52e993 ("btrfs: use per-buffer locking for extent_buffer reading") Link: https://lore.kernel.org/linux-btrfs/CAHk-=whNdMaN9ntZ47XRKP6DBes2E5w7fi-0U3H2+PS18p+Pzw@mail.gmail.com/ Link: https://lore.kernel.org/linux-btrfs/f51a6d5d7432455a6a858d51b49ecac183e0bbc9.1706312914.git.wqu@suse.com/ Link: https://lore.kernel.org/linux-btrfs/c7241ea4-fcc6-48d2-98c8-b5ea790d6c89@gmx.com/ CC: stable@vger.kernel.org # 6.5+ Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Tavian Barnes <tavianator@tavianator.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor update of changelog ] Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
show more ...
|
#
ded566b4 |
| 12-Feb-2024 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: fix deadlock with fiemap and extent locking
commit b0ad381fa7690244802aed119b478b4bdafc31dd upstream.
While working on the patchset to remove extent locking I got a lockdep splat with fiemap
btrfs: fix deadlock with fiemap and extent locking
commit b0ad381fa7690244802aed119b478b4bdafc31dd upstream.
While working on the patchset to remove extent locking I got a lockdep splat with fiemap and pagefaulting with my new extent lock replacement lock.
This deadlock exists with our normal code, we just don't have lockdep annotations with the extent locking so we've never noticed it.
Since we're copying the fiemap extent to user space on every iteration we have the chance of pagefaulting. Because we hold the extent lock for the entire range we could mkwrite into a range in the file that we have mmap'ed. This would deadlock with the following stack trace
[<0>] lock_extent+0x28d/0x2f0 [<0>] btrfs_page_mkwrite+0x273/0x8a0 [<0>] do_page_mkwrite+0x50/0xb0 [<0>] do_fault+0xc1/0x7b0 [<0>] __handle_mm_fault+0x2fa/0x460 [<0>] handle_mm_fault+0xa4/0x330 [<0>] do_user_addr_fault+0x1f4/0x800 [<0>] exc_page_fault+0x7c/0x1e0 [<0>] asm_exc_page_fault+0x26/0x30 [<0>] rep_movs_alternative+0x33/0x70 [<0>] _copy_to_user+0x49/0x70 [<0>] fiemap_fill_next_extent+0xc8/0x120 [<0>] emit_fiemap_extent+0x4d/0xa0 [<0>] extent_fiemap+0x7f8/0xad0 [<0>] btrfs_fiemap+0x49/0x80 [<0>] __x64_sys_ioctl+0x3e1/0xb50 [<0>] do_syscall_64+0x94/0x1a0 [<0>] entry_SYSCALL_64_after_hwframe+0x6e/0x76
I wrote an fstest to reproduce this deadlock without my replacement lock and verified that the deadlock exists with our existing locking.
To fix this simply don't take the extent lock for the entire duration of the fiemap. This is safe in general because we keep track of where we are when we're searching the tree, so if an ordered extent updates in the middle of our fiemap call we'll still emit the correct extents because we know what offset we were on before.
The only place we maintain the lock is searching delalloc. Since the delalloc stuff can change during writeback we want to lock the extent range so we have a consistent view of delalloc at the time we're checking to see if we need to set the delalloc flag.
With this patch applied we no longer deadlock with my testcase.
CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
show more ...
|
#
816ffd28 |
| 13-Mar-2024 |
Andrew Jeffery <andrew@codeconstruct.com.au> |
Merge tag 'v6.6.21' into dev-6.6
This is the 6.6.21 stable release
|
#
d43f8e58 |
| 22-Feb-2024 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix race between ordered extent completion and fiemap
[ Upstream commit a1a4a9ca77f143c00fce69c1239887ff8b813bec ]
For fiemap we recently stopped locking the target extent range for the whol
btrfs: fix race between ordered extent completion and fiemap
[ Upstream commit a1a4a9ca77f143c00fce69c1239887ff8b813bec ]
For fiemap we recently stopped locking the target extent range for the whole duration of the fiemap call, in order to avoid a deadlock in a scenario where the fiemap buffer happens to be a memory mapped range of the same file. This use case is very unlikely to be useful in practice but it may be triggered by fuzz testing (syzbot, etc).
However by not locking the target extent range for the whole duration of the fiemap call we can race with an ordered extent. This happens like this:
1) The fiemap task finishes processing a file extent item that covers the file range [512K, 1M[, and that file extent item is the last item in the leaf currently being processed;
2) And ordered extent for the file range [768K, 2M[, in COW mode, completes (btrfs_finish_one_ordered()) and the file extent item covering the range [512K, 1M[ is trimmed to cover the range [512K, 768K[ and then a new file extent item for the range [768K, 2M[ is inserted in the inode's subvolume tree;
3) The fiemap task calls fiemap_next_leaf_item(), which then calls btrfs_next_leaf() to find the next leaf / item. This finds that the the next key following the one we previously processed (its type is BTRFS_EXTENT_DATA_KEY and its offset is 512K), is the key corresponding to the new file extent item inserted by the ordered extent, which has a type of BTRFS_EXTENT_DATA_KEY and an offset of 768K;
4) Later the fiemap code ends up at emit_fiemap_extent() and triggers the warning:
if (cache->offset + cache->len > offset) { WARN_ON(1); return -EINVAL; }
Since we get 1M > 768K, because the previously emitted entry for the old extent covering the file range [512K, 1M[ ends at an offset that is greater than the new extent's start offset (768K). This makes fiemap fail with -EINVAL besides triggering the warning that produces a stack trace like the following:
[1621.677651] ------------[ cut here ]------------ [1621.677656] WARNING: CPU: 1 PID: 204366 at fs/btrfs/extent_io.c:2492 emit_fiemap_extent+0x84/0x90 [btrfs] [1621.677899] Modules linked in: btrfs blake2b_generic (...) [1621.677951] CPU: 1 PID: 204366 Comm: pool Not tainted 6.8.0-rc5-btrfs-next-151+ #1 [1621.677954] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014 [1621.677956] RIP: 0010:emit_fiemap_extent+0x84/0x90 [btrfs] [1621.678033] Code: 2b 4c 89 63 (...) [1621.678035] RSP: 0018:ffffab16089ffd20 EFLAGS: 00010206 [1621.678037] RAX: 00000000004fa000 RBX: ffffab16089ffe08 RCX: 0000000000009000 [1621.678039] RDX: 00000000004f9000 RSI: 00000000004f1000 RDI: ffffab16089ffe90 [1621.678040] RBP: 00000000004f9000 R08: 0000000000001000 R09: 0000000000000000 [1621.678041] R10: 0000000000000000 R11: 0000000000001000 R12: 0000000041d78000 [1621.678043] R13: 0000000000001000 R14: 0000000000000000 R15: ffff9434f0b17850 [1621.678044] FS: 00007fa6e20006c0(0000) GS:ffff943bdfa40000(0000) knlGS:0000000000000000 [1621.678046] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [1621.678048] CR2: 00007fa6b0801000 CR3: 000000012d404002 CR4: 0000000000370ef0 [1621.678053] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [1621.678055] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [1621.678056] Call Trace: [1621.678074] <TASK> [1621.678076] ? __warn+0x80/0x130 [1621.678082] ? emit_fiemap_extent+0x84/0x90 [btrfs] [1621.678159] ? report_bug+0x1f4/0x200 [1621.678164] ? handle_bug+0x42/0x70 [1621.678167] ? exc_invalid_op+0x14/0x70 [1621.678170] ? asm_exc_invalid_op+0x16/0x20 [1621.678178] ? emit_fiemap_extent+0x84/0x90 [btrfs] [1621.678253] extent_fiemap+0x766/0xa30 [btrfs] [1621.678339] btrfs_fiemap+0x45/0x80 [btrfs] [1621.678420] do_vfs_ioctl+0x1e4/0x870 [1621.678431] __x64_sys_ioctl+0x6a/0xc0 [1621.678434] do_syscall_64+0x52/0x120 [1621.678445] entry_SYSCALL_64_after_hwframe+0x6e/0x76
There's also another case where before calling btrfs_next_leaf() we are processing a hole or a prealloc extent and we had several delalloc ranges within that hole or prealloc extent. In that case if the ordered extents complete before we find the next key, we may end up finding an extent item with an offset smaller than (or equals to) the offset in cache->offset.
So fix this by changing emit_fiemap_extent() to address these three scenarios like this:
1) For the first case, steps listed above, adjust the length of the previously cached extent so that it does not overlap with the current extent, emit the previous one and cache the current file extent item;
2) For the second case where he had a hole or prealloc extent with multiple delalloc ranges inside the hole or prealloc extent's range, and the current file extent item has an offset that matches the offset in the fiemap cache, just discard what we have in the fiemap cache and assign the current file extent item to the cache, since it's more up to date;
3) For the third case where he had a hole or prealloc extent with multiple delalloc ranges inside the hole or prealloc extent's range and the offset of the file extent item we just found is smaller than what we have in the cache, just skip the current file extent item if its range end at or behind the cached extent's end, because we may have emitted (to the fiemap user space buffer) delalloc ranges that overlap with the current file extent item's range. If the file extent item's range goes beyond the end offset of the cached extent, just emit the cached extent and cache a subrange of the file extent item, that goes from the end offset of the cached extent to the end offset of the file extent item.
Dealing with those cases in those ways makes everything consistent by reflecting the current state of file extent items in the btree and without emitting extents that have overlapping ranges (which would be confusing and violating expectations).
This issue could be triggered often with test case generic/561, and was also hit and reported by Wang Yugui.
Reported-by: Wang Yugui <wangyugui@e16-tech.com> Link: https://lore.kernel.org/linux-btrfs/20240223104619.701F.409509F4@e16-tech.com/ Fixes: b0ad381fa769 ("btrfs: fix deadlock with fiemap and extent locking") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
#
e6eff205 |
| 10-Feb-2024 |
Andrew Jeffery <andrew@codeconstruct.com.au> |
Merge tag 'v6.6.8' into dev-6.6
This is the 6.6.8 stable release
|
Revision tags: v6.6.12, v6.6.11, v6.6.10, v6.6.9, v6.6.8, v6.6.7, v6.6.6, v6.6.5, v6.6.4 |
|
#
d3cf0243 |
| 01-Dec-2023 |
Boris Burkov <boris@bur.io> |
btrfs: don't clear qgroup reserved bit in release_folio
commit a86805504b88f636a6458520d85afdf0634e3c6b upstream.
The EXTENT_QGROUP_RESERVED bit is used to "lock" regions of the file for duplicate
btrfs: don't clear qgroup reserved bit in release_folio
commit a86805504b88f636a6458520d85afdf0634e3c6b upstream.
The EXTENT_QGROUP_RESERVED bit is used to "lock" regions of the file for duplicate reservations. That is two writes to that range in one transaction shouldn't create two reservations, as the reservation will only be freed once when the write finally goes down. Therefore, it is never OK to clear that bit without freeing the associated qgroup reserve. At this point, we don't want to be freeing the reserve, so mask off the bit.
CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
show more ...
|
#
b97d6790 |
| 13-Dec-2023 |
Joel Stanley <joel@jms.id.au> |
Merge tag 'v6.6.6' into dev-6.6
This is the 6.6.6 stable release
Signed-off-by: Joel Stanley <joel@jms.id.au>
|
Revision tags: v6.6.3 |
|
#
33357859 |
| 23-Nov-2023 |
Qu Wenruo <wqu@suse.com> |
btrfs: free the allocated memory if btrfs_alloc_page_array() fails
commit 94dbf7c0871f7ae6349ba4b0341ce8f5f98a071d upstream.
[BUG] If btrfs_alloc_page_array() fail to allocate all pages but part of
btrfs: free the allocated memory if btrfs_alloc_page_array() fails
commit 94dbf7c0871f7ae6349ba4b0341ce8f5f98a071d upstream.
[BUG] If btrfs_alloc_page_array() fail to allocate all pages but part of the slots, then the partially allocated pages would be leaked in function btrfs_submit_compressed_read().
[CAUSE] As explicitly stated, if btrfs_alloc_page_array() returned -ENOMEM, caller is responsible to free the partially allocated pages.
For the existing call sites, most of them are fine:
- btrfs_raid_bio::stripe_pages Handled by free_raid_bio().
- extent_buffer::pages[] Handled btrfs_release_extent_buffer_pages().
- scrub_stripe::pages[] Handled by release_scrub_stripe().
But there is one exception in btrfs_submit_compressed_read(), if btrfs_alloc_page_array() failed, we didn't cleanup the array and freed the array pointer directly.
Initially there is still the error handling in commit dd137dd1f2d7 ("btrfs: factor out allocating an array of pages"), but later in commit 544fe4a903ce ("btrfs: embed a btrfs_bio into struct compressed_bio"), the error handling is removed, leading to the possible memory leak.
[FIX] This patch would add back the error handling first, then to prevent such situation from happening again, also Make btrfs_alloc_page_array() to free the allocated pages as a extra safety net, then we don't need to add the error handling to btrfs_submit_compressed_read().
Fixes: 544fe4a903ce ("btrfs: embed a btrfs_bio into struct compressed_bio") CC: stable@vger.kernel.org # 6.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
show more ...
|
Revision tags: v6.6.2, v6.5.11, v6.6.1, v6.5.10, v6.6, v6.5.9, v6.5.8, v6.5.7, v6.5.6 |
|
#
cac405a3 |
| 26-Sep-2023 |
Linus Torvalds <torvalds@linux-foundation.org> |
Merge tag 'for-6.6-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- delayed refs fixes: - fix race when refilling delayed refs block
Merge tag 'for-6.6-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- delayed refs fixes: - fix race when refilling delayed refs block reserve - prevent transaction block reserve underflow when starting transaction - error message and value adjustments
- fix build warnings with CONFIG_CC_OPTIMIZE_FOR_SIZE and -Wmaybe-uninitialized
- fix for smatch report where uninitialized data from invalid extent buffer range could be returned to the caller
- fix numeric overflow in statfs when calculating lower threshold for a full filesystem
* tag 'for-6.6-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: initialize start_slot in btrfs_log_prealloc_extents btrfs: make sure to initialize start and len in find_free_dev_extent btrfs: reset destination buffer when read_extent_buffer() gets invalid range btrfs: properly report 0 avail for very full file systems btrfs: log message if extent item not found when running delayed extent op btrfs: remove redundant BUG_ON() from __btrfs_inc_extent_ref() btrfs: return -EUCLEAN for delayed tree ref with a ref count not equals to 1 btrfs: prevent transaction block reserve underflow when starting transaction btrfs: fix race when refilling delayed refs block reserve
show more ...
|
Revision tags: v6.5.5, v6.5.4 |
|
#
74ee7914 |
| 18-Sep-2023 |
Qu Wenruo <wqu@suse.com> |
btrfs: reset destination buffer when read_extent_buffer() gets invalid range
Commit f98b6215d7d1 ("btrfs: extent_io: do extra check for extent buffer read write functions") changed how we handle inv
btrfs: reset destination buffer when read_extent_buffer() gets invalid range
Commit f98b6215d7d1 ("btrfs: extent_io: do extra check for extent buffer read write functions") changed how we handle invalid extent buffer range for read_extent_buffer().
Previously if the range is invalid we just set the destination to zero, but after the patch we do nothing and error out.
This can lead to smatch static checker errors like:
fs/btrfs/print-tree.c:186 print_uuid_item() error: uninitialized symbol 'subvol_id'. fs/btrfs/tests/extent-io-tests.c:338 check_eb_bitmap() error: uninitialized symbol 'has'. fs/btrfs/tests/extent-io-tests.c:353 check_eb_bitmap() error: uninitialized symbol 'has'. fs/btrfs/uuid-tree.c:203 btrfs_uuid_tree_remove() error: uninitialized symbol 'read_subid'. fs/btrfs/uuid-tree.c:353 btrfs_uuid_tree_iterate() error: uninitialized symbol 'subid_le'. fs/btrfs/uuid-tree.c:72 btrfs_uuid_tree_lookup() error: uninitialized symbol 'data'. fs/btrfs/volumes.c:7415 btrfs_dev_stats_value() error: uninitialized symbol 'val'.
Fix those warnings by reverting back to the old memset() behavior. By this we keep the static checker happy and would still make a lot of noise when such invalid ranges are passed in.
Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Fixes: f98b6215d7d1 ("btrfs: extent_io: do extra check for extent buffer read write functions") Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
a229cf67 |
| 20-Sep-2023 |
Linus Torvalds <torvalds@linux-foundation.org> |
Merge tag 'for-6.6-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba: "A few more followup fixes to the directory listing.
People have noti
Merge tag 'for-6.6-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba: "A few more followup fixes to the directory listing.
People have noticed different behaviour compared to other filesystems after changes in 6.5. This is now unified to more "logical" and expected behaviour while still within POSIX. And a few more fixes for stable.
- change behaviour of readdir()/rewinddir() when new directory entries are created after opendir(), properly tracking the last entry
- fix race in readdir when multiple threads can set the last entry index for a directory
Additionally:
- use exclusive lock when direct io might need to drop privs and call notify_change()
- don't clear uptodate bit on page after an error, this may lead to a deadlock in subpage mode
- fix waiting pattern when multiple readers block on Merkle tree data, switch to folios"
* tag 'for-6.6-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix race between reading a directory and adding entries to it btrfs: refresh dir last index during a rewinddir(3) call btrfs: set last dir index to the current last index when opening dir btrfs: don't clear uptodate on write errors btrfs: file_remove_privs needs an exclusive lock in direct io write btrfs: convert btrfs_read_merkle_tree_page() to use a folio
show more ...
|
Revision tags: v6.5.3 |
|
#
b595d259 |
| 08-Sep-2023 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: don't clear uptodate on write errors
We have been consistently seeing hangs with generic/648 in our subpage GitHub CI setup. This is a classic deadlock, we are calling btrfs_read_folio() on
btrfs: don't clear uptodate on write errors
We have been consistently seeing hangs with generic/648 in our subpage GitHub CI setup. This is a classic deadlock, we are calling btrfs_read_folio() on a folio, which requires holding the folio lock on the folio, and then finding a ordered extent that overlaps that range and calling btrfs_start_ordered_extent(), which then tries to write out the dirty page, which requires taking the folio lock and then we deadlock.
The hang happens because we're writing to range [1271750656, 1271767040), page index [77621, 77622], and page 77621 is !Uptodate. It is also Dirty, so we call btrfs_read_folio() for 77621 and which does btrfs_lock_and_flush_ordered_range() for that range, and we find an ordered extent which is [1271644160, 1271746560), page index [77615, 77621]. The page indexes overlap, but the actual bytes don't overlap. We're holding the page lock for 77621, then call btrfs_lock_and_flush_ordered_range() which tries to flush the dirty page, and tries to lock 77621 again and then we deadlock.
The byte ranges do not overlap, but with subpage support if we clear uptodate on any portion of the page we mark the entire thing as not uptodate.
We have been clearing page uptodate on write errors, but no other file system does this, and is in fact incorrect. This doesn't hurt us in the !subpage case because we can't end up with overlapped ranges that don't also overlap on the page.
Fix this by not clearing uptodate when we have a write error. The only thing we should be doing in this case is setting the mapping error and carrying on. This makes it so we would no longer call btrfs_read_folio() on the page as it's uptodate and eliminates the deadlock.
With this patch we're now able to make it through a full fstests run on our subpage blocksize VMs.
Note for stable backports: this probably goes beyond 6.1 but the code has been cleaned up and clearing the uptodate bit must be verified on each version independently.
CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
c900529f |
| 12-Sep-2023 |
Thomas Zimmermann <tzimmermann@suse.de> |
Merge drm/drm-fixes into drm-misc-fixes
Forwarding to v6.6-rc1.
Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
|
Revision tags: v6.5.2, v6.1.51, v6.5.1 |
|
#
1ac731c5 |
| 30-Aug-2023 |
Dmitry Torokhov <dmitry.torokhov@gmail.com> |
Merge branch 'next' into for-linus
Prepare input updates for 6.6 merge window.
|
Revision tags: v6.1.50 |
|
#
547635c6 |
| 28-Aug-2023 |
Linus Torvalds <torvalds@linux-foundation.org> |
Merge tag 'for-6.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba: "No new features, the bulk of the changes are fixes, refactoring and cle
Merge tag 'for-6.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba: "No new features, the bulk of the changes are fixes, refactoring and cleanups. The notable fix is the scrub performance restoration after rewrite in 6.4, though still only partial.
Fixes:
- scrub performance drop due to rewrite in 6.4 partially restored: - do IO grouping by blg_plug/blk_unplug again - avoid unnecessary tree searches when processing stripes, in extent and checksum trees - the drop is noticeable on fast PCIe devices, -66% and restored to -33% of the original - backports to 6.4 planned
- handle more corner cases of transaction commit during orphan cleanup or delayed ref processing
- use correct fsid/metadata_uuid when validating super block
- copy directory permissions and time when creating a stub subvolume
Core:
- debugging feature integrity checker deprecated, to be removed in 6.7
- in zoned mode, zones are activated just before the write, making error handling easier, now the overcommit mechanism can be enabled again which improves performance by avoiding more frequent flushing
- v0 extent handling completely removed, deprecated long time ago
- error handling improvements
- tests: - extent buffer bitmap tests - pinned extent splitting tests
- cleanups and refactoring: - compression writeback - extent buffer bitmap - space flushing, ENOSPC handling"
* tag 'for-6.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (110 commits) btrfs: zoned: skip splitting and logical rewriting on pre-alloc write btrfs: tests: test invalid splitting when skipping pinned drop extent_map btrfs: tests: add a test for btrfs_add_extent_mapping btrfs: tests: add extent_map tests for dropping with odd layouts btrfs: scrub: move write back of repaired sectors to scrub_stripe_read_repair_worker() btrfs: scrub: don't go ordered workqueue for dev-replace btrfs: scrub: fix grouping of read IO btrfs: scrub: avoid unnecessary csum tree search preparing stripes btrfs: scrub: avoid unnecessary extent tree search preparing stripes btrfs: copy dir permission and time when creating a stub subvolume btrfs: remove pointless empty list check when reading delayed dir indexes btrfs: drop redundant check to use fs_devices::metadata_uuid btrfs: compare the correct fsid/metadata_uuid in btrfs_validate_super btrfs: use the correct superblock to compare fsid in btrfs_validate_super btrfs: simplify memcpy either of metadata_uuid or fsid btrfs: add a helper to read the superblock metadata_uuid btrfs: remove v0 extent handling btrfs: output extra debug info if we failed to find an inline backref btrfs: move the !zoned assert into run_delalloc_cow btrfs: consolidate the error handling in run_delalloc_nocow ...
show more ...
|