Revision tags: v6.6.25, v6.6.24, v6.6.23 |
|
#
eb344109 |
| 23-Feb-2024 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix double free of anonymous device after snapshot creation failure
commit e2b54eaf28df0c978626c9736b94f003b523b451 upstream.
When creating a snapshot we may do a double free of an anonymous
btrfs: fix double free of anonymous device after snapshot creation failure
commit e2b54eaf28df0c978626c9736b94f003b523b451 upstream.
When creating a snapshot we may do a double free of an anonymous device in case there's an error committing the transaction. The second free may result in freeing an anonymous device number that was allocated by some other subsystem in the kernel or another btrfs filesystem.
The steps that lead to this:
1) At ioctl.c:create_snapshot() we allocate an anonymous device number and assign it to pending_snapshot->anon_dev;
2) Then we call btrfs_commit_transaction() and end up at transaction.c:create_pending_snapshot();
3) There we call btrfs_get_new_fs_root() and pass it the anonymous device number stored in pending_snapshot->anon_dev;
4) btrfs_get_new_fs_root() frees that anonymous device number because btrfs_lookup_fs_root() returned a root - someone else did a lookup of the new root already, which could some task doing backref walking;
5) After that some error happens in the transaction commit path, and at ioctl.c:create_snapshot() we jump to the 'fail' label, and after that we free again the same anonymous device number, which in the meanwhile may have been reallocated somewhere else, because pending_snapshot->anon_dev still has the same value as in step 1.
Recently syzbot ran into this and reported the following trace:
------------[ cut here ]------------ ida_free called for id=51 which is not allocated. WARNING: CPU: 1 PID: 31038 at lib/idr.c:525 ida_free+0x370/0x420 lib/idr.c:525 Modules linked in: CPU: 1 PID: 31038 Comm: syz-executor.2 Not tainted 6.8.0-rc4-syzkaller-00410-gc02197fc9076 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/25/2024 RIP: 0010:ida_free+0x370/0x420 lib/idr.c:525 Code: 10 42 80 3c 28 (...) RSP: 0018:ffffc90015a67300 EFLAGS: 00010246 RAX: be5130472f5dd000 RBX: 0000000000000033 RCX: 0000000000040000 RDX: ffffc90009a7a000 RSI: 000000000003ffff RDI: 0000000000040000 RBP: ffffc90015a673f0 R08: ffffffff81577992 R09: 1ffff92002b4cdb4 R10: dffffc0000000000 R11: fffff52002b4cdb5 R12: 0000000000000246 R13: dffffc0000000000 R14: ffffffff8e256b80 R15: 0000000000000246 FS: 00007fca3f4b46c0(0000) GS:ffff8880b9500000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f167a17b978 CR3: 000000001ed26000 CR4: 0000000000350ef0 Call Trace: <TASK> btrfs_get_root_ref+0xa48/0xaf0 fs/btrfs/disk-io.c:1346 create_pending_snapshot+0xff2/0x2bc0 fs/btrfs/transaction.c:1837 create_pending_snapshots+0x195/0x1d0 fs/btrfs/transaction.c:1931 btrfs_commit_transaction+0xf1c/0x3740 fs/btrfs/transaction.c:2404 create_snapshot+0x507/0x880 fs/btrfs/ioctl.c:848 btrfs_mksubvol+0x5d0/0x750 fs/btrfs/ioctl.c:998 btrfs_mksnapshot+0xb5/0xf0 fs/btrfs/ioctl.c:1044 __btrfs_ioctl_snap_create+0x387/0x4b0 fs/btrfs/ioctl.c:1306 btrfs_ioctl_snap_create_v2+0x1ca/0x400 fs/btrfs/ioctl.c:1393 btrfs_ioctl+0xa74/0xd40 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:871 [inline] __se_sys_ioctl+0xfe/0x170 fs/ioctl.c:857 do_syscall_64+0xfb/0x240 entry_SYSCALL_64_after_hwframe+0x6f/0x77 RIP: 0033:0x7fca3e67dda9 Code: 28 00 00 00 (...) RSP: 002b:00007fca3f4b40c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007fca3e7abf80 RCX: 00007fca3e67dda9 RDX: 00000000200005c0 RSI: 0000000050009417 RDI: 0000000000000003 RBP: 00007fca3e6ca47a R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 000000000000000b R14: 00007fca3e7abf80 R15: 00007fff6bf95658 </TASK>
Where we get an explicit message where we attempt to free an anonymous device number that is not currently allocated. It happens in a different code path from the example below, at btrfs_get_root_ref(), so this change may not fix the case triggered by syzbot.
To fix at least the code path from the example above, change btrfs_get_root_ref() and its callers to receive a dev_t pointer argument for the anonymous device number, so that in case it frees the number, it also resets it to 0, so that up in the call chain we don't attempt to do the double free.
CC: stable@vger.kernel.org # 5.10+ Link: https://lore.kernel.org/linux-btrfs/000000000000f673a1061202f630@google.com/ Fixes: e03ee2fe873e ("btrfs: do not ASSERT() if the newly created subvolume already got read") Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
show more ...
|
Revision tags: v6.6.16, v6.6.15, v6.6.14, v6.6.13 |
|
#
83377565 |
| 20-Jan-2024 |
Qu Wenruo <wqu@suse.com> |
btrfs: do not ASSERT() if the newly created subvolume already got read
commit e03ee2fe873eb68c1f9ba5112fee70303ebf9dfb upstream.
[BUG] There is a syzbot crash, triggered by the ASSERT() during subv
btrfs: do not ASSERT() if the newly created subvolume already got read
commit e03ee2fe873eb68c1f9ba5112fee70303ebf9dfb upstream.
[BUG] There is a syzbot crash, triggered by the ASSERT() during subvolume creation:
assertion failed: !anon_dev, in fs/btrfs/disk-io.c:1319 ------------[ cut here ]------------ kernel BUG at fs/btrfs/disk-io.c:1319! invalid opcode: 0000 [#1] PREEMPT SMP KASAN RIP: 0010:btrfs_get_root_ref.part.0+0x9aa/0xa60 <TASK> btrfs_get_new_fs_root+0xd3/0xf0 create_subvol+0xd02/0x1650 btrfs_mksubvol+0xe95/0x12b0 __btrfs_ioctl_snap_create+0x2f9/0x4f0 btrfs_ioctl_snap_create+0x16b/0x200 btrfs_ioctl+0x35f0/0x5cf0 __x64_sys_ioctl+0x19d/0x210 do_syscall_64+0x3f/0xe0 entry_SYSCALL_64_after_hwframe+0x63/0x6b ---[ end trace 0000000000000000 ]---
[CAUSE] During create_subvol(), after inserting root item for the newly created subvolume, we would trigger btrfs_get_new_fs_root() to get the btrfs_root of that subvolume.
The idea here is, we have preallocated an anonymous device number for the subvolume, thus we can assign it to the new subvolume.
But there is really nothing preventing things like backref walk to read the new subvolume. If that happens before we call btrfs_get_new_fs_root(), the subvolume would be read out, with a new anonymous device number assigned already.
In that case, we would trigger ASSERT(), as we really expect no one to read out that subvolume (which is not yet accessible from the fs). But things like backref walk is still possible to trigger the read on the subvolume.
Thus our assumption on the ASSERT() is not correct in the first place.
[FIX] Fix it by removing the ASSERT(), and just free the @anon_dev, reset it to 0, and continue.
If the subvolume tree is read out by something else, it should have already get a new anon_dev assigned thus we only need to free the preallocated one.
Reported-by: Chenyuan Yang <chenyuan0y@gmail.com> Fixes: 2dfb1e43f57d ("btrfs: preallocate anon block device at first phase of snapshot creation") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
show more ...
|
Revision tags: v6.6.12, v6.6.11, v6.6.10, v6.6.9, v6.6.8, v6.6.7, v6.6.6, v6.6.5, v6.6.4 |
|
#
624bc6f6 |
| 01-Dec-2023 |
Boris Burkov <boris@bur.io> |
btrfs: free qgroup pertrans reserve on transaction abort
[ Upstream commit b321a52cce062ec7ed385333a33905d22159ce36 ]
If we abort a transaction, we never run the code that frees the pertrans qgroup
btrfs: free qgroup pertrans reserve on transaction abort
[ Upstream commit b321a52cce062ec7ed385333a33905d22159ce36 ]
If we abort a transaction, we never run the code that frees the pertrans qgroup reservation. This results in warnings on unmount as that reservation has been leaked. The leak isn't a huge issue since the fs is read-only, but it's better to clean it up when we know we can/should. Do it during the cleanup_transaction step of aborting.
CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
Revision tags: v6.6.3, v6.6.2, v6.5.11, v6.6.1, v6.5.10 |
|
#
9f7894e2 |
| 01-Nov-2023 |
Qu Wenruo <wqu@suse.com> |
btrfs: add dmesg output for first mount and last unmount of a filesystem
commit 2db313205f8b96eea467691917138d646bb50aef upstream.
There is a feature request to add dmesg output when unmounting a b
btrfs: add dmesg output for first mount and last unmount of a filesystem
commit 2db313205f8b96eea467691917138d646bb50aef upstream.
There is a feature request to add dmesg output when unmounting a btrfs. There are several alternative methods to do the same thing, but with their own problems:
- Use eBPF to watch btrfs_put_super()/open_ctree() Not end user friendly, they have to dip their head into the source code.
- Watch for directory /sys/fs/<uuid>/ This is way more simple, but still requires some simple device -> uuid lookups. And a script needs to use inotify to watch /sys/fs/.
Compared to all these, directly outputting the information into dmesg would be the most simple one, with both device and UUID included.
And since we're here, also add the output when mounting a filesystem for the first time for parity. A more fine grained monitoring of subvolume mounts should be done by another layer, like audit.
Now mounting a btrfs with all default mkfs options would look like this:
[81.906566] BTRFS info (device dm-8): first mount of filesystem 633b5c16-afe3-4b79-b195-138fe145e4f2 [81.907494] BTRFS info (device dm-8): using crc32c (crc32c-intel) checksum algorithm [81.908258] BTRFS info (device dm-8): using free space tree [81.912644] BTRFS info (device dm-8): auto enabling async discard [81.913277] BTRFS info (device dm-8): checking UUID tree [91.668256] BTRFS info (device dm-8): last unmount of filesystem 633b5c16-afe3-4b79-b195-138fe145e4f2
CC: stable@vger.kernel.org # 5.4+ Link: https://github.com/kdave/btrfs-progs/issues/689 Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
show more ...
|
Revision tags: v6.6, v6.5.9, v6.5.8, v6.5.7, v6.5.6, v6.5.5, v6.5.4, v6.5.3 |
|
#
d5e09e38 |
| 12-Sep-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: abort transaction on generation mismatch when marking eb as dirty
[ Upstream commit 50564b651d01c19ce732819c5b3c3fd60707188e ]
When marking an extent buffer as dirty, at btrfs_mark_buffer_di
btrfs: abort transaction on generation mismatch when marking eb as dirty
[ Upstream commit 50564b651d01c19ce732819c5b3c3fd60707188e ]
When marking an extent buffer as dirty, at btrfs_mark_buffer_dirty(), we check if its generation matches the running transaction and if not we just print a warning. Such mismatch is an indicator that something really went wrong and only printing a warning message (and stack trace) is not enough to prevent a corruption. Allowing a transaction to commit with such an extent buffer will trigger an error if we ever try to read it from disk due to a generation mismatch with its parent generation.
So abort the current transaction with -EUCLEAN if we notice a generation mismatch. For this we need to pass a transaction handle to btrfs_mark_buffer_dirty() which is always available except in test code, in which case we can pass NULL since it operates on dummy extent buffers and all test roots have a single node/leaf (root node at level 0).
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
Revision tags: v6.5.2, v6.1.51, v6.5.1, v6.1.50, v6.5, v6.1.49, v6.1.48 |
|
#
5e0e8799 |
| 22-Aug-2023 |
Qu Wenruo <wqu@suse.com> |
btrfs: fix a compilation error if DEBUG is defined in btree_dirty_folio
[BUG] After commit 72a69cd03082 ("btrfs: subpage: pack all subpage bitmaps into a larger bitmap"), the DEBUG section of btree_
btrfs: fix a compilation error if DEBUG is defined in btree_dirty_folio
[BUG] After commit 72a69cd03082 ("btrfs: subpage: pack all subpage bitmaps into a larger bitmap"), the DEBUG section of btree_dirty_folio() would no longer compile.
[CAUSE] If DEBUG is defined, we would do extra checks for btree_dirty_folio(), mostly to make sure the range we marked dirty has an extent buffer and that extent buffer is dirty.
For subpage, we need to iterate through all the extent buffers covered by that page range, and make sure they all matches the criteria.
However commit 72a69cd03082 ("btrfs: subpage: pack all subpage bitmaps into a larger bitmap") changes how we store the bitmap, we pack all the 16 bits bitmaps into a larger bitmap, which would save some space.
This means we no longer have btrfs_subpage::dirty_bitmap, instead the dirty bitmap is starting at btrfs_subpage_info::dirty_offset, and has a length of btrfs_subpage_info::bitmap_nr_bits.
[FIX] Although I'm not sure if it still makes sense to maintain such code, at least let it compile.
This patch would let us test the bits one by one through the bitmaps.
CC: stable@vger.kernel.org # 6.1+ Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
77d20c68 |
| 24-Aug-2023 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: do not block starts waiting on previous transaction commit
Internally I got a report of very long stalls on normal operations like creating a new file when auto relocation was running. The r
btrfs: do not block starts waiting on previous transaction commit
Internally I got a report of very long stalls on normal operations like creating a new file when auto relocation was running. The reporter used the 'bpf offcputime' tracer to show that we would get stuck in start_transaction for 5 to 30 seconds, and were always being woken up by the transaction commit.
Using my timing-everything script, which times how long a function takes and what percentage of that total time is taken up by its children, I saw several traces like this
1083 took 32812902424 ns 29929002926 ns 91.2110% wait_for_commit_duration 25568 ns 7.7920e-05% commit_fs_roots_duration 1007751 ns 0.00307% commit_cowonly_roots_duration 446855602 ns 1.36182% btrfs_run_delayed_refs_duration 271980 ns 0.00082% btrfs_run_delayed_items_duration 2008 ns 6.1195e-06% btrfs_apply_pending_changes_duration 9656 ns 2.9427e-05% switch_commit_roots_duration 1598 ns 4.8700e-06% btrfs_commit_device_sizes_duration 4314 ns 1.3147e-05% btrfs_free_log_root_tree_duration
Here I was only tracing functions that happen where we are between START_COMMIT and UNBLOCKED in order to see what would be keeping us blocked for so long. The wait_for_commit() we do is where we wait for a previous transaction that hasn't completed it's commit. This can include all of the unpin work and other cleanups, which tends to be the longest part of our transaction commit.
There is no reason we should be blocking new things from entering the transaction at this point, it just adds to random latency spikes for no reason.
Fix this by adding a PREP stage. This allows us to properly deal with multiple committers coming in at the same time, we retain the behavior that the winner waits on the previous transaction and the losers all wait for this transaction commit to occur. Nothing else is blocked during the PREP stage, and then once the wait is complete we switch to COMMIT_START and all of the same behavior as before is maintained.
Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
Revision tags: v6.1.46, v6.1.45, v6.1.44, v6.1.43 |
|
#
67bc5ad0 |
| 31-Jul-2023 |
Anand Jain <anand.jain@oracle.com> |
btrfs: drop redundant check to use fs_devices::metadata_uuid
fs_devices::metadata_uuid value is already updated based on the super_block::METADATA_UUID flag for either fsid or metadata_uuid as appro
btrfs: drop redundant check to use fs_devices::metadata_uuid
fs_devices::metadata_uuid value is already updated based on the super_block::METADATA_UUID flag for either fsid or metadata_uuid as appropriate. So, fs_devices::metadata_uuid can be used directly.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
6bfe3959 |
| 31-Jul-2023 |
Anand Jain <anand.jain@oracle.com> |
btrfs: compare the correct fsid/metadata_uuid in btrfs_validate_super
The function btrfs_validate_super() should verify the metadata_uuid in the provided superblock argument. Because, all its caller
btrfs: compare the correct fsid/metadata_uuid in btrfs_validate_super
The function btrfs_validate_super() should verify the metadata_uuid in the provided superblock argument. Because, all its callers expect it to do that.
Such as in the following stacks:
write_all_supers() sb = fs_info->super_for_commit; btrfs_validate_write_super(.., sb) btrfs_validate_super(.., sb, ..)
scrub_one_super() btrfs_validate_super(.., sb, ..)
And check_dev_super() btrfs_validate_super(.., sb, ..)
However, it currently verifies the fs_info::super_copy::metadata_uuid instead. Fix this using the correct metadata_uuid in the superblock argument.
CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
d167aa76 |
| 31-Jul-2023 |
Anand Jain <anand.jain@oracle.com> |
btrfs: use the correct superblock to compare fsid in btrfs_validate_super
The function btrfs_validate_super() should verify the fsid in the provided superblock argument. Because, all its callers exp
btrfs: use the correct superblock to compare fsid in btrfs_validate_super
The function btrfs_validate_super() should verify the fsid in the provided superblock argument. Because, all its callers expect it to do that.
Such as in the following stack:
write_all_supers() sb = fs_info->super_for_commit; btrfs_validate_write_super(.., sb) btrfs_validate_super(.., sb, ..)
scrub_one_super() btrfs_validate_super(.., sb, ..)
And check_dev_super() btrfs_validate_super(.., sb, ..)
However, it currently verifies the fs_info::super_copy::fsid instead, which is not correct. Fix this using the correct fsid in the superblock argument.
CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Tested-by: Guilherme G. Piccoli <gpiccoli@igalia.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
84af994b |
| 09-Aug-2023 |
Ruan Jinjie <ruanjinjie@huawei.com> |
btrfs: use LIST_HEAD() to initialize the list_head
Use LIST_HEAD() to initialize the list_head instead of open-coding it.
Signed-off-by: Ruan Jinjie <ruanjinjie@huawei.com> Reviewed-by: David Sterb
btrfs: use LIST_HEAD() to initialize the list_head
Use LIST_HEAD() to initialize the list_head instead of open-coding it.
Signed-off-by: Ruan Jinjie <ruanjinjie@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
a7e1ac7b |
| 07-Aug-2023 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: zoned: reserve zones for an active metadata/system block group
Ensure a metadata and system block group can be activated on write time, by leaving a certain number of active zones when trying
btrfs: zoned: reserve zones for an active metadata/system block group
Ensure a metadata and system block group can be activated on write time, by leaving a certain number of active zones when trying to activate a data block group.
Zones for two metadata block groups (normal and tree-log) and one system block group are reserved, according to the profile type: two zones per block group on the DUP profile and one zone per block group otherwise.
The reservation must be freed once a non-data block group is allocated. If not, we over-reserve the active zones and data block group activation will suffer. For the dynamic reservation count, we need to manage the reservation count per device.
The reservation count variable is protected by fs_info->zone_active_bgs_lock.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
Revision tags: v6.1.42 |
|
#
504b1596 |
| 26-Jul-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: make btrfs_cleanup_fs_roots() static
btrfs_cleanup_fs_roots() is not used outside disk-io.c, so make it static, remove its prototype from disk-io.h and move its definition above the where it'
btrfs: make btrfs_cleanup_fs_roots() static
btrfs_cleanup_fs_roots() is not used outside disk-io.c, so make it static, remove its prototype from disk-io.h and move its definition above the where it's used in disk-io.c
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
ae3364e5 |
| 26-Jul-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: store the error that turned the fs into error state
Currently when we turn the fs into an error state, typically after a transaction abort, we don't store the error anywhere, we just set a bi
btrfs: store the error that turned the fs into error state
Currently when we turn the fs into an error state, typically after a transaction abort, we don't store the error anywhere, we just set a bit (BTRFS_FS_STATE_ERROR) at struct btrfs_fs_info::fs_state to signal the error state.
There are cases where it would be useful to have access to the specific error in order to provide a more meaningful error to users/applications. This change adds a member to struct btrfs_fs_info to store the error and removes the BTRFS_FS_STATE_ERROR bit. When there's no error, the new member (fs_error) has a value of 0, otherwise its value is a negative errno value.
Followup changes will make use of this new member.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
Revision tags: v6.1.41, v6.1.40, v6.1.39, v6.1.38, v6.1.37 |
|
#
e5860f82 |
| 30-Jun-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: make find_first_extent_bit() return a boolean
Currently find_first_extent_bit() returns a 0 if it found a range in the given io tree and 1 if it didn't find any. There's no need to return any
btrfs: make find_first_extent_bit() return a boolean
Currently find_first_extent_bit() returns a 0 if it found a range in the given io tree and 1 if it didn't find any. There's no need to return any errors, so make the return value a boolean and invert the logic to make more sense: return true if it found a range and false if it didn't find any range.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
46d81ebd |
| 30-Jun-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: make btrfs_destroy_pinned_extent() return void
Currently btrfs_destroy_pinned_extent() is always returning 0 no matter what and its caller ignores its return value (as well everything up in t
btrfs: make btrfs_destroy_pinned_extent() return void
Currently btrfs_destroy_pinned_extent() is always returning 0 no matter what and its caller ignores its return value (as well everything up in the call chain). This is because this is called in the transaction abort path, where we can't even deal with any errors since we are in a critical situation already and cleanup of resources is done in a best effort fashion.
So make btrfs_destroy_pinned_extent() return void.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
aec5716c |
| 30-Jun-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: make btrfs_destroy_marked_extents() return void
Currently btrfs_destroy_marked_extents() is returning the value of the last call to find_first_extent_bit(), which returns a value of 1 meaning
btrfs: make btrfs_destroy_marked_extents() return void
Currently btrfs_destroy_marked_extents() is returning the value of the last call to find_first_extent_bit(), which returns a value of 1 meaning no more ranges found the dirty pages io tree. This value is useless to the single caller of btrfs_destroy_marked_extents(), which ignores any return value from btrfs_destroy_marked_extents(). This is because it's only used in the transaction abort path, where we can't even deal with any errors since we are in a critical situation already and cleanup of resources is done in a best effort fashion.
So make btrfs_destroy_marked_extents() return void.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
6ebcd021 |
| 03-Aug-2023 |
Qu Wenruo <wqu@suse.com> |
btrfs: reject invalid reloc tree root keys with stack dump
[BUG] Syzbot reported a crash that an ASSERT() got triggered inside prepare_to_merge().
That ASSERT() makes sure the reloc tree is properl
btrfs: reject invalid reloc tree root keys with stack dump
[BUG] Syzbot reported a crash that an ASSERT() got triggered inside prepare_to_merge().
That ASSERT() makes sure the reloc tree is properly pointed back by its subvolume tree.
[CAUSE] After more debugging output, it turns out we had an invalid reloc tree:
BTRFS error (device loop1): reloc tree mismatch, root 8 has no reloc root, expect reloc root key (-8, 132, 8) gen 17
Note the above root key is (TREE_RELOC_OBJECTID, ROOT_ITEM, QUOTA_TREE_OBJECTID), meaning it's a reloc tree for quota tree.
But reloc trees can only exist for subvolumes, as for non-subvolume trees, we just COW the involved tree block, no need to create a reloc tree since those tree blocks won't be shared with other trees.
Only subvolumes tree can share tree blocks with other trees (thus they have BTRFS_ROOT_SHAREABLE flag).
Thus this new debug output proves my previous assumption that corrupted on-disk data can trigger that ASSERT().
[FIX] Besides the dedicated fix and the graceful exit, also let tree-checker to check such root keys, to make sure reloc trees can only exist for subvolumes.
CC: stable@vger.kernel.org # 5.15+ Reported-by: syzbot+ae97a827ae1c3336bbb4@syzkaller.appspotmail.com Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
773e722a |
| 03-Aug-2023 |
Qu Wenruo <wqu@suse.com> |
btrfs: avoid race between qgroup tree creation and relocation
[BUG] Syzbot reported a weird ASSERT() triggered inside prepare_to_merge().
assertion failed: root->reloc_root == reloc_root, in fs/b
btrfs: avoid race between qgroup tree creation and relocation
[BUG] Syzbot reported a weird ASSERT() triggered inside prepare_to_merge().
assertion failed: root->reloc_root == reloc_root, in fs/btrfs/relocation.c:1919 ------------[ cut here ]------------ kernel BUG at fs/btrfs/relocation.c:1919! invalid opcode: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 9904 Comm: syz-executor.3 Not tainted 6.4.0-syzkaller-08881-g533925cb7604 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/27/2023 RIP: 0010:prepare_to_merge+0xbb2/0xc40 fs/btrfs/relocation.c:1919 Code: fe e9 f5 (...) RSP: 0018:ffffc9000325f760 EFLAGS: 00010246 RAX: 000000000000004f RBX: ffff888075644030 RCX: 1481ccc522da5800 RDX: ffffc90005c09000 RSI: 00000000000364ca RDI: 00000000000364cb RBP: ffffc9000325f870 R08: ffffffff816f33ac R09: 1ffff9200064bea0 R10: dffffc0000000000 R11: fffff5200064bea1 R12: ffff888075644000 R13: ffff88803b166000 R14: ffff88803b166560 R15: ffff88803b166558 FS: 00007f4e305fd700(0000) GS:ffff8880b9800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000056080679c000 CR3: 00000000193ad000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> relocate_block_group+0xa5d/0xcd0 fs/btrfs/relocation.c:3749 btrfs_relocate_block_group+0x7ab/0xd70 fs/btrfs/relocation.c:4087 btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3283 __btrfs_balance+0x1b06/0x2690 fs/btrfs/volumes.c:4018 btrfs_balance+0xbdb/0x1120 fs/btrfs/volumes.c:4402 btrfs_ioctl_balance+0x496/0x7c0 fs/btrfs/ioctl.c:3604 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:870 [inline] __se_sys_ioctl+0xf8/0x170 fs/ioctl.c:856 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7f4e2f88c389
[CAUSE] With extra debugging, the offending reloc_root is for quota tree (rootid 8).
Normally we should not use the reloc tree for quota root at all, as reloc trees are only for subvolume trees.
But there is a race between quota enabling and relocation, this happens after commit 85724171b302 ("btrfs: fix the btrfs_get_global_root return value").
Before that commit, for quota and free space tree, we exit immediately if we cannot grab it from fs_info.
But now we would try to read it from disk, just as if they are fs trees, this sets ROOT_SHAREABLE flags in such race:
Thread A | Thread B ---------------------------------+------------------------------ btrfs_quota_enable() | | | btrfs_get_root_ref() | | |- btrfs_get_global_root() | | | Returned NULL | | |- btrfs_lookup_fs_root() | | | Returned NULL |- btrfs_create_tree() | | | Now quota root item is | | | inserted | |- btrfs_read_tree_root() | | | Got the newly inserted quota root | | |- btrfs_init_fs_root() | | | Set ROOT_SHAREABLE flag
[FIX] Get back to the old behavior by returning PTR_ERR(-ENOENT) if the target objectid is not a subvolume tree or data reloc tree.
Reported-and-tested-by: syzbot+ae97a827ae1c3336bbb4@syzkaller.appspotmail.com Fixes: 85724171b302 ("btrfs: fix the btrfs_get_global_root return value") Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
95ca6599 |
| 29-Jun-2023 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: zoned: do not enable async discard
The zoned mode need to reset a zone before using it. We rely on btrfs's original discard functionality (discarding unused block group range) to do the reset
btrfs: zoned: do not enable async discard
The zoned mode need to reset a zone before using it. We rely on btrfs's original discard functionality (discarding unused block group range) to do the resetting.
While the commit 63a7cb130718 ("btrfs: auto enable discard=async when possible") made the discard done in an async manner, a zoned reset do not need to be async, as it is fast enough.
Even worth, delaying zone rests prevents using those zones again. So, let's disable async discard on the zoned mode.
Fixes: 63a7cb130718 ("btrfs: auto enable discard=async when possible") CC: stable@vger.kernel.org # 6.3+ Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update message text ] Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
Revision tags: v6.1.36, v6.4, v6.1.35, v6.1.34, v6.1.33, v6.1.32, v6.1.31, v6.1.30, v6.1.29, v6.1.28, v6.1.27, v6.1.26, v6.3, v6.1.25, v6.1.24, v6.1.23 |
|
#
efcfcbc6 |
| 03-Apr-2023 |
David Sterba <dsterba@suse.com> |
btrfs: add xxhash to fast checksum implementations
The implementation of XXHASH is now CPU only but still fast enough to be considered for the synchronous checksumming, like non-generic crc32c.
A u
btrfs: add xxhash to fast checksum implementations
The implementation of XXHASH is now CPU only but still fast enough to be considered for the synchronous checksumming, like non-generic crc32c.
A userspace benchmark comparing it to various implementations (patched hash-speedtest from btrfs-progs):
Block size: 4096 Iterations: 1000000 Implementation: builtin Units: CPU cycles
NULL-NOP: cycles: 73384294, cycles/i 73 NULL-MEMCPY: cycles: 228033868, cycles/i 228, 61664.320 MiB/s CRC32C-ref: cycles: 24758559416, cycles/i 24758, 567.950 MiB/s CRC32C-NI: cycles: 1194350470, cycles/i 1194, 11773.433 MiB/s CRC32C-ADLERSW: cycles: 6150186216, cycles/i 6150, 2286.372 MiB/s CRC32C-ADLERHW: cycles: 626979180, cycles/i 626, 22427.453 MiB/s CRC32C-PCL: cycles: 466746732, cycles/i 466, 30126.699 MiB/s XXHASH: cycles: 860656400, cycles/i 860, 16338.188 MiB/s
Comparing purely software implementation (ref), current outdated accelerated using crc32q instruction (NI), optimized implementations by M. Adler (https://stackoverflow.com/questions/17645167/implementing-sse-4-2s-crc32c-in-software/17646775#17646775) and the best one that was taken from kernel using the PCLMULQDQ instruction (PCL).
Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
99f09ce3 |
| 02-Jun-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: make btrfs_destroy_delayed_refs() return void
btrfs_destroy_delayed_refs() always returns 0 and its single caller does not check its return value, as it also returns void, and so does the cal
btrfs: make btrfs_destroy_delayed_refs() return void
btrfs_destroy_delayed_refs() always returns 0 and its single caller does not check its return value, as it also returns void, and so does the callers' caller and so on. This is because we are in the transaction abort path, where we have no way to deal with errors (we are in a critical situation) and all cleanup of resources works in a best effort fashion. So make btrfs_destroy_delayed_refs() return void.
Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
184533e3 |
| 29-May-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: remove unnecessary prototype declarations at disk-io.c
We have a few static functions at disk-io.c for which we have a forward declaration of their prototype, but it's not needed because all
btrfs: remove unnecessary prototype declarations at disk-io.c
We have a few static functions at disk-io.c for which we have a forward declaration of their prototype, but it's not needed because all those functions are defined before they are called, so remove them.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
4d34ad34 |
| 29-May-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: remove pointless in_tree field from struct btrfs_delayed_ref_node
The 'in_tree' field is really not needed in struct btrfs_delayed_ref_node, as we can check whether a reference is in the tree
btrfs: remove pointless in_tree field from struct btrfs_delayed_ref_node
The 'in_tree' field is really not needed in struct btrfs_delayed_ref_node, as we can check whether a reference is in the tree or not simply by checking its red black tree node member with RB_EMPTY_NODE(), as when we remove it from the tree we always call RB_CLEAR_NODE(). So remove that field and use RB_EMPTY_NODE().
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
58e814fc |
| 25-May-2023 |
Tejun Heo <tj@kernel.org> |
btrfs: use alloc_ordered_workqueue() to create ordered workqueues
BACKGROUND ==========
When multiple work items are queued to a workqueue, their execution order doesn't match the queueing order. T
btrfs: use alloc_ordered_workqueue() to create ordered workqueues
BACKGROUND ==========
When multiple work items are queued to a workqueue, their execution order doesn't match the queueing order. They may get executed in any order and simultaneously. When fully serialized execution - one by one in the queueing order - is needed, an ordered workqueue should be used which can be created with alloc_ordered_workqueue().
However, alloc_ordered_workqueue() was a later addition. Before it, an ordered workqueue could be obtained by creating an UNBOUND workqueue with @max_active==1. This originally was an implementation side-effect which was broken by 4c16bd327c74 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered"). Because there were users that depended on the ordered execution, 5c0338c68706 ("workqueue: restore WQ_UNBOUND/max_active==1 to be ordered") made workqueue allocation path to implicitly promote UNBOUND workqueues w/ @max_active==1 to ordered workqueues.
While this has worked okay, overloading the UNBOUND allocation interface this way creates other issues. It's difficult to tell whether a given workqueue actually needs to be ordered and users that legitimately want a min concurrency level wq unexpectedly gets an ordered one instead. With planned UNBOUND workqueue updates to improve execution locality and more prevalence of chiplet designs which can benefit from such improvements, this isn't a state we wanna be in forever.
This patch series audits all call sites that create an UNBOUND workqueue w/ @max_active==1 and converts them to alloc_ordered_workqueue() as necessary.
BTRFS =====
* fs_info->scrub_workers initialized in scrub_workers_get() was setting @max_active to 1 when @is_dev_replace is set and it seems that the workqueue actually needs to be ordered if @is_dev_replace. Update the code so that alloc_ordered_workqueue() is used if @is_dev_replace.
* fs_info->discard_ctl.discard_workers initialized in btrfs_init_workqueues() was directly using alloc_workqueue() w/ @max_active==1. Converted to alloc_ordered_workqueue().
* fs_info->fixup_workers and fs_info->qgroup_rescan_workers initialized in btrfs_queue_work() use the btrfs's workqueue wrapper, btrfs_workqueue, which are allocated with btrfs_alloc_workqueue().
btrfs_workqueue implements automatic @max_active adjustment which is disabled when the specified max limit is below a certain threshold, so calling btrfs_alloc_workqueue() with @limit_active==1 yields an ordered workqueue whose @max_active won't be changed as the auto-tuning is disabled.
This is rather brittle in that nothing clearly indicates that the two workqueues should be ordered or btrfs_alloc_workqueue() must disable auto-tuning when @limit_active==1.
This patch factors out the common btrfs_workqueue init code into btrfs_init_workqueue() and add explicit btrfs_alloc_ordered_workqueue(). The two workqueues are converted to use the new ordered allocation interface.
Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|