Revision tags: v6.6.25, v6.6.24, v6.6.23, v6.6.16, v6.6.15, v6.6.14, v6.6.13, v6.6.12, v6.6.11, v6.6.10, v6.6.9, v6.6.8, v6.6.7, v6.6.6, v6.6.5, v6.6.4, v6.6.3, v6.6.2, v6.5.11, v6.6.1, v6.5.10, v6.6, v6.5.9, v6.5.8, v6.5.7, v6.5.6, v6.5.5, v6.5.4, v6.5.3 |
|
#
d5e09e38 |
| 12-Sep-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: abort transaction on generation mismatch when marking eb as dirty
[ Upstream commit 50564b651d01c19ce732819c5b3c3fd60707188e ]
When marking an extent buffer as dirty, at btrfs_mark_buffer_di
btrfs: abort transaction on generation mismatch when marking eb as dirty
[ Upstream commit 50564b651d01c19ce732819c5b3c3fd60707188e ]
When marking an extent buffer as dirty, at btrfs_mark_buffer_dirty(), we check if its generation matches the running transaction and if not we just print a warning. Such mismatch is an indicator that something really went wrong and only printing a warning message (and stack trace) is not enough to prevent a corruption. Allowing a transaction to commit with such an extent buffer will trigger an error if we ever try to read it from disk due to a generation mismatch with its parent generation.
So abort the current transaction with -EUCLEAN if we notice a generation mismatch. For this we need to pass a transaction handle to btrfs_mark_buffer_dirty() which is always available except in test code, in which case we can pass NULL since it operates on dummy extent buffers and all test roots have a single node/leaf (root node at level 0).
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
#
eb96e221 |
| 19-Oct-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix unwritten extent buffer after snapshotting a new subvolume
When creating a snapshot of a subvolume that was created in the current transaction, we can end up not persisting a dirty extent
btrfs: fix unwritten extent buffer after snapshotting a new subvolume
When creating a snapshot of a subvolume that was created in the current transaction, we can end up not persisting a dirty extent buffer that is referenced by the snapshot, resulting in IO errors due to checksum failures when trying to read the extent buffer later from disk. A sequence of steps that leads to this is the following:
1) At ioctl.c:create_subvol() we allocate an extent buffer, with logical address 36007936, for the leaf/root of a new subvolume that has an ID of 291. We mark the extent buffer as dirty, and at this point the subvolume tree has a single node/leaf which is also its root (level 0);
2) We no longer commit the transaction used to create the subvolume at create_subvol(). We used to, but that was recently removed in commit 1b53e51a4a8f ("btrfs: don't commit transaction for every subvol create");
3) The transaction used to create the subvolume has an ID of 33, so the extent buffer 36007936 has a generation of 33;
4) Several updates happen to subvolume 291 during transaction 33, several files created and its tree height changes from 0 to 1, so we end up with a new root at level 1 and the extent buffer 36007936 is now a leaf of that new root node, which is extent buffer 36048896.
The commit root remains as 36007936, since we are still at transaction 33;
5) Creation of a snapshot of subvolume 291, with an ID of 292, starts at ioctl.c:create_snapshot(). This triggers a commit of transaction 33 and we end up at transaction.c:create_pending_snapshot(), in the critical section of a transaction commit.
There we COW the root of subvolume 291, which is extent buffer 36048896. The COW operation returns extent buffer 36048896, since there's no need to COW because the extent buffer was created in this transaction and it was not written yet.
The we call btrfs_copy_root() against the root node 36048896. During this operation we allocate a new extent buffer to turn into the root node of the snapshot, copy the contents of the root node 36048896 into this snapshot root extent buffer, set the owner to 292 (the ID of the snapshot), etc, and then we call btrfs_inc_ref(). This will create a delayed reference for each leaf pointed by the root node with a reference root of 292 - this includes a reference for the leaf 36007936.
After that we set the bit BTRFS_ROOT_FORCE_COW in the root's state.
Then we call btrfs_insert_dir_item(), to create the directory entry in in the tree of subvolume 291 that points to the snapshot. This ends up needing to modify leaf 36007936 to insert the respective directory items. Because the bit BTRFS_ROOT_FORCE_COW is set for the root's state, we need to COW the leaf. We end up at btrfs_force_cow_block() and then at update_ref_for_cow().
At update_ref_for_cow() we call btrfs_block_can_be_shared() which returns false, despite the fact the leaf 36007936 is shared - the subvolume's root and the snapshot's root point to that leaf. The reason that it incorrectly returns false is because the commit root of the subvolume is extent buffer 36007936 - it was the initial root of the subvolume when we created it. So btrfs_block_can_be_shared() which has the following logic:
int btrfs_block_can_be_shared(struct btrfs_root *root, struct extent_buffer *buf) { if (test_bit(BTRFS_ROOT_SHAREABLE, &root->state) && buf != root->node && buf != root->commit_root && (btrfs_header_generation(buf) <= btrfs_root_last_snapshot(&root->root_item) || btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC))) return 1;
return 0; }
Returns false (0) since 'buf' (extent buffer 36007936) matches the root's commit root.
As a result, at update_ref_for_cow(), we don't check for the number of references for extent buffer 36007936, we just assume it's not shared and therefore that it has only 1 reference, so we set the local variable 'refs' to 1.
Later on, in the final if-else statement at update_ref_for_cow():
static noinline int update_ref_for_cow(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct extent_buffer *buf, struct extent_buffer *cow, int *last_ref) { (...) if (refs > 1) { (...) } else { (...) btrfs_clear_buffer_dirty(trans, buf); *last_ref = 1; } }
So we mark the extent buffer 36007936 as not dirty, and as a result we don't write it to disk later in the transaction commit, despite the fact that the snapshot's root points to it.
Attempting to access the leaf or dumping the tree for example shows that the extent buffer was not written:
$ btrfs inspect-internal dump-tree -t 292 /dev/sdb btrfs-progs v6.2.2 file tree key (292 ROOT_ITEM 33) node 36110336 level 1 items 2 free space 119 generation 33 owner 292 node 36110336 flags 0x1(WRITTEN) backref revision 1 checksum stored a8103e3e checksum calced a8103e3e fs uuid 90c9a46f-ae9f-4626-9aff-0cbf3e2e3a79 chunk uuid e8c9c885-78f4-4d31-85fe-89e5f5fd4a07 key (256 INODE_ITEM 0) block 36007936 gen 33 key (257 EXTENT_DATA 0) block 36052992 gen 33 checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29 checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29 total bytes 107374182400 bytes used 38572032 uuid 90c9a46f-ae9f-4626-9aff-0cbf3e2e3a79
The respective on disk region is full of zeroes as the device was trimmed at mkfs time.
Obviously 'btrfs check' also detects and complains about this:
$ btrfs check /dev/sdb Opening filesystem to check... Checking filesystem on /dev/sdb UUID: 90c9a46f-ae9f-4626-9aff-0cbf3e2e3a79 generation: 33 (33) [1/7] checking root items [2/7] checking extents checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29 checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29 checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29 bad tree block 36007936, bytenr mismatch, want=36007936, have=0 owner ref check failed [36007936 4096] ERROR: errors found in extent allocation tree or chunk allocation [3/7] checking free space tree [4/7] checking fs roots checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29 checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29 checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29 bad tree block 36007936, bytenr mismatch, want=36007936, have=0 The following tree block(s) is corrupted in tree 292: tree block bytenr: 36110336, level: 1, node key: (256, 1, 0) root 292 root dir 256 not found ERROR: errors found in fs roots found 38572032 bytes used, error(s) found total csum bytes: 16048 total tree bytes: 1265664 total fs tree bytes: 1118208 total extent tree bytes: 65536 btree space waste bytes: 562598 file data blocks allocated: 65978368 referenced 36569088
Fix this by updating btrfs_block_can_be_shared() to consider that an extent buffer may be shared if it matches the commit root and if its generation matches the current transaction's generation.
This can be reproduced with the following script:
$ cat test.sh #!/bin/bash
MNT=/mnt/sdi DEV=/dev/sdi
# Use a filesystem with a 64K node size so that we have the same node # size on every machine regardless of its page size (on x86_64 default # node size is 16K due to the 4K page size, while on PPC it's 64K by # default). This way we can make sure we are able to create a btree for # the subvolume with a height of 2. mkfs.btrfs -f -n 64K $DEV mount $DEV $MNT
btrfs subvolume create $MNT/subvol
# Create a few empty files on the subvolume, this bumps its btree # height to 2 (root node at level 1 and 2 leaves). for ((i = 1; i <= 300; i++)); do echo -n > $MNT/subvol/file_$i done
btrfs subvolume snapshot -r $MNT/subvol $MNT/subvol/snap
umount $DEV
btrfs check $DEV
Running it on a 6.5 kernel (or any 6.6-rc kernel at the moment):
$ ./test.sh Create subvolume '/mnt/sdi/subvol' Create a readonly snapshot of '/mnt/sdi/subvol' in '/mnt/sdi/subvol/snap' Opening filesystem to check... Checking filesystem on /dev/sdi UUID: bbdde2ff-7d02-45ca-8a73-3c36f23755a1 [1/7] checking root items [2/7] checking extents parent transid verify failed on 30539776 wanted 7 found 5 parent transid verify failed on 30539776 wanted 7 found 5 parent transid verify failed on 30539776 wanted 7 found 5 Ignoring transid failure owner ref check failed [30539776 65536] ERROR: errors found in extent allocation tree or chunk allocation [3/7] checking free space tree [4/7] checking fs roots parent transid verify failed on 30539776 wanted 7 found 5 Ignoring transid failure Wrong key of child node/leaf, wanted: (256, 1, 0), have: (2, 132, 0) Wrong generation of child node/leaf, wanted: 5, have: 7 root 257 root dir 256 not found ERROR: errors found in fs roots found 917504 bytes used, error(s) found total csum bytes: 0 total tree bytes: 851968 total fs tree bytes: 393216 total extent tree bytes: 65536 btree space waste bytes: 736550 file data blocks allocated: 0 referenced 0
A test case for fstests will follow soon.
Fixes: 1b53e51a4a8f ("btrfs: don't commit transaction for every subvol create") CC: stable@vger.kernel.org # 6.5+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
e36f9491 |
| 27-Sep-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: error out when reallocating block for defrag using a stale transaction
At btrfs_realloc_node() we have these checks to verify we are not using a stale transaction (a past transaction with an
btrfs: error out when reallocating block for defrag using a stale transaction
At btrfs_realloc_node() we have these checks to verify we are not using a stale transaction (a past transaction with an unblocked state or higher), and the only thing we do is to trigger two WARN_ON(). This however is a critical problem, highly unexpected and if it happens it's most likely due to a bug, so we should error out and turn the fs into error state so that such issue is much more easily noticed if it's triggered.
The problem is critical because in btrfs_realloc_node() we COW tree blocks, and using such stale transaction will lead to not persisting the extent buffers used for the COW operations, as allocating tree block adds the range of the respective extent buffers to the ->dirty_pages iotree of the transaction, and a stale transaction, in the unlocked state or higher, will not flush dirty extent buffers anymore, therefore resulting in not persisting the tree block and resource leaks (not cleaning the dirty_pages iotree for example).
So do the following changes:
1) Return -EUCLEAN if we find a stale transaction;
2) Turn the fs into error state, with error -EUCLEAN, so that no transaction can be committed, and generate a stack trace;
3) Combine both conditions into a single if statement, as both are related and have the same error message;
4) Mark the check as unlikely, since this is not expected to ever happen.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
a2caab29 |
| 27-Sep-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: error when COWing block from a root that is being deleted
At btrfs_cow_block() we check if the block being COWed belongs to a root that is being deleted and if so we log an error message. How
btrfs: error when COWing block from a root that is being deleted
At btrfs_cow_block() we check if the block being COWed belongs to a root that is being deleted and if so we log an error message. However this is an unexpected case and it indicates a bug somewhere, so we should return an error and abort the transaction. So change this in the following ways:
1) Abort the transaction with -EUCLEAN, so that if the issue ever happens it can easily be noticed;
2) Change the logged message level from error to critical, and change the message itself to print the block's logical address and the ID of the root;
3) Return -EUCLEAN to the caller;
4) As this is an unexpected scenario, that should never happen, mark the check as unlikely, allowing the compiler to potentially generate better code.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
48774f3b |
| 27-Sep-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: error out when COWing block using a stale transaction
At btrfs_cow_block() we have these checks to verify we are not using a stale transaction (a past transaction with an unblocked state or h
btrfs: error out when COWing block using a stale transaction
At btrfs_cow_block() we have these checks to verify we are not using a stale transaction (a past transaction with an unblocked state or higher), and the only thing we do is to trigger a WARN with a message and a stack trace. This however is a critical problem, highly unexpected and if it happens it's most likely due to a bug, so we should error out and turn the fs into error state so that such issue is much more easily noticed if it's triggered.
The problem is critical because using such stale transaction will lead to not persisting the extent buffer used for the COW operation, as allocating a tree block adds the range of the respective extent buffer to the ->dirty_pages iotree of the transaction, and a stale transaction, in the unlocked state or higher, will not flush dirty extent buffers anymore, therefore resulting in not persisting the tree block and resource leaks (not cleaning the dirty_pages iotree for example).
So do the following changes:
1) Return -EUCLEAN if we find a stale transaction;
2) Turn the fs into error state, with error -EUCLEAN, so that no transaction can be committed, and generate a stack trace;
3) Combine both conditions into a single if statement, as both are related and have the same error message;
4) Mark the check as unlikely, since this is not expected to ever happen.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
Revision tags: v6.5.2, v6.1.51, v6.5.1, v6.1.50, v6.5, v6.1.49, v6.1.48, v6.1.46, v6.1.45, v6.1.44, v6.1.43, v6.1.42, v6.1.41, v6.1.40, v6.1.39, v6.1.38, v6.1.37, v6.1.36, v6.4, v6.1.35, v6.1.34 |
|
#
7569141e |
| 09-Jun-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: replace BUG_ON() at split_item() with proper error handling
There's no need to BUG_ON() at split_item() if the leaf does not have enough free space for the new item, we can just return -ENOSP
btrfs: replace BUG_ON() at split_item() with proper error handling
There's no need to BUG_ON() at split_item() if the leaf does not have enough free space for the new item, we can just return -ENOSPC since the caller can deal with errors from split_item(). Also, as this is a very unlikely condition to happen, because the caller has invoked setup_leaf_for_split() before calling split_item(), surround the condition with a WARN_ON() which makes it easier to notice this unexpected condition and tags the if branch with 'unlikely' as well.
I've actually once hit this BUG_ON() with some incorrect code changes I had, which was very inconvenient as it required rebooting the VM.
Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
Revision tags: v6.1.33 |
|
#
751a2761 |
| 08-Jun-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: do not BUG_ON() on tree mod log failures at btrfs_del_ptr()
At btrfs_del_ptr(), instead of doing a BUG_ON() in case we fail to record tree mod log operations, do a transaction abort and retur
btrfs: do not BUG_ON() on tree mod log failures at btrfs_del_ptr()
At btrfs_del_ptr(), instead of doing a BUG_ON() in case we fail to record tree mod log operations, do a transaction abort and return the error to the callers. There's really no need for the BUG_ON() as we can release all resources in the context of all callers, and we have to abort because other future tree searches that use the tree mod log (btrfs_search_old_slot()) may get inconsistent results if other operations modify the tree after that failure and before the tree mod log based search.
This implies btrfs_del_ptr() return an int instead of void, and making all callers check for returned errors.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
50b5d1fc |
| 08-Jun-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: do not BUG_ON() on tree mod log failures at insert_ptr()
At insert_ptr(), instead of doing a BUG_ON() in case we fail to record tree mod log operations, do a transaction abort and return the
btrfs: do not BUG_ON() on tree mod log failures at insert_ptr()
At insert_ptr(), instead of doing a BUG_ON() in case we fail to record tree mod log operations, do a transaction abort and return the error to the callers. There's really no need for the BUG_ON() as we can release all resources in the context of all callers, and we have to abort because other future tree searches that use the tree mod log (btrfs_search_old_slot()) may get inconsistent results if other operations modify the tree after that failure and before the tree mod log based search.
This implies making insert_ptr() return an int instead of void, and making all callers check for returned errors.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
f61aa7ba |
| 08-Jun-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: do not BUG_ON() on tree mod log failure at insert_new_root()
At insert_new_root(), instead of doing a BUG_ON() in case we fail to record the tree mod log operation, just return the error to t
btrfs: do not BUG_ON() on tree mod log failure at insert_new_root()
At insert_new_root(), instead of doing a BUG_ON() in case we fail to record the tree mod log operation, just return the error to the callers after releasing the allocated tree block. At this point we haven't made any changes to the b+tree, so we haven't left it in an inconsistent state and therefore have no need to abort the transaction. All we need to do is to unlock and free the extent buffer we just allocated with the purpose of making it the new root.
Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
11d6ae03 |
| 08-Jun-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: do not BUG_ON() on tree mod log failures at push_nodes_for_insert()
At push_nodes_for_insert(), instead of doing a BUG_ON() in case we fail to record tree mod log operations, do a transaction
btrfs: do not BUG_ON() on tree mod log failures at push_nodes_for_insert()
At push_nodes_for_insert(), instead of doing a BUG_ON() in case we fail to record tree mod log operations, do a transaction abort and return the error to the caller. There's really no need for the BUG_ON() as we can release all resources in this context, and we have to abort because other future tree searches that use the tree mod log (btrfs_search_old_slot()) may get inconsistent results if other operations modify the tree after that failure and before the tree mod log based search.
Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
eced687e |
| 08-Jun-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: abort transaction at update_ref_for_cow() when ref count is zero
At update_ref_for_cow() we are calling btrfs_handle_fs_error() if we find that the extent buffer has an unexpected ref count o
btrfs: abort transaction at update_ref_for_cow() when ref count is zero
At update_ref_for_cow() we are calling btrfs_handle_fs_error() if we find that the extent buffer has an unexpected ref count of zero, however we can simply use btrfs_abort_transaction(), which achieves the same purposes: to turn the fs to error state, abort the current transaction and turn the fs to RO mode as well. Besides that, btrfs_abort_transaction() also prints a stack trace which makes it more useful.
Also, as this is a very unexpected situation, indicating a serious corruption/inconsistency, tag the if branch as 'unlikely', set the error code to -EUCLEAN instead of -EROFS, and log an explicit message.
Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
725026ed |
| 08-Jun-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: abort transaction at balance_level() when left child is missing
At balance_level() we are calling btrfs_handle_fs_error() when the middle child only has 1 item and the left child is missing,
btrfs: abort transaction at balance_level() when left child is missing
At balance_level() we are calling btrfs_handle_fs_error() when the middle child only has 1 item and the left child is missing, however we can simply use btrfs_abort_transaction(), which achieves the same purposes: to turn the fs to error state, abort the current transaction and turn the fs to RO mode. Besides that, btrfs_abort_transaction() also prints a stack trace which makes it more useful.
Also, as this is a highly unexpected case and it's about a b+tree inconsistency, change the error code from -EROFS to -EUCLEAN, tag the if branch as 'unlikely' and log an explicit error message.
Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
87b8e9d0 |
| 08-Jun-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: avoid unnecessarily setting the fs to RO and error state at balance_level()
At balance_level(), when trying to promote a child node to a root node, if we fail to read the child we call btrfs_
btrfs: avoid unnecessarily setting the fs to RO and error state at balance_level()
At balance_level(), when trying to promote a child node to a root node, if we fail to read the child we call btrfs_handle_fs_error(), which turns the fs to RO mode and sets it to error state as well, causing any ongoing transaction to abort. This however is not necessary because at that point we have not made any change yet at balance_level(), so any error reading the child node does not leaves us in any inconsistent state. Therefore we can just return the error to the caller.
Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
daefe4d4 |
| 08-Jun-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: rename enospc label to out at balance_level()
At balance_level() we have this 'enospc' label where we jump to in case we get an error at several places. However that error is certainly not -E
btrfs: rename enospc label to out at balance_level()
At balance_level() we have this 'enospc' label where we jump to in case we get an error at several places. However that error is certainly not -ENOSPC in call cases, it can be -EIO or -ENOMEM when reading a child extent buffer for example, or -ENOMEM when trying to record tree mod log operations. So to make this less confusing, rename the label to 'out'.
Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
39020d8a |
| 08-Jun-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: do not BUG_ON() on tree mod log failure at balance_level()
At balance_level(), instead of doing a BUG_ON() in case we fail to record tree mod log operations, do a transaction abort and return
btrfs: do not BUG_ON() on tree mod log failure at balance_level()
At balance_level(), instead of doing a BUG_ON() in case we fail to record tree mod log operations, do a transaction abort and return the error to the callers. There's really no need for the BUG_ON() as we can release all resources in this context, and we have to abort because other future tree searches that use the tree mod log (btrfs_search_old_slot()) may get inconsistent results if other operations modify the tree after that failure and before the tree mod log based search.
CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
40b0a749 |
| 08-Jun-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: do not BUG_ON() on tree mod log failure at __btrfs_cow_block()
At __btrfs_cow_block(), instead of doing a BUG_ON() in case we fail to record a tree mod log root insertion operation, do a tran
btrfs: do not BUG_ON() on tree mod log failure at __btrfs_cow_block()
At __btrfs_cow_block(), instead of doing a BUG_ON() in case we fail to record a tree mod log root insertion operation, do a transaction abort instead. There's really no need for the BUG_ON(), we can properly release all resources in this context and turn the filesystem to RO mode and in an error state instead.
CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
ede600e4 |
| 08-Jun-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix extent buffer leak after tree mod log failure at split_node()
At split_node(), if we fail to log the tree mod log copy operation, we return without unlocking the split extent buffer we ju
btrfs: fix extent buffer leak after tree mod log failure at split_node()
At split_node(), if we fail to log the tree mod log copy operation, we return without unlocking the split extent buffer we just allocated and without decrementing the reference we own on it. Fix this by unlocking it and decrementing the ref count before returning.
Fixes: 5de865eebb83 ("Btrfs: fix tree mod logging") CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
d09c5152 |
| 08-Jun-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: add missing error handling when logging operation while COWing extent buffer
When COWing an extent buffer that is not the root node, we need to log in the tree mod log that we replaced a poin
btrfs: add missing error handling when logging operation while COWing extent buffer
When COWing an extent buffer that is not the root node, we need to log in the tree mod log that we replaced a pointer in the parent node, otherwise a tree mod log user doing a search on the b+tree can return incorrect results (that miss something). We are doing the call to btrfs_tree_mod_log_insert_key() but we totally ignore its return value.
So fix this by adding the missing error handling, resulting in a transaction abort and freeing the COWed extent buffer.
Fixes: f230475e62f7 ("Btrfs: put all block modifications into the tree mod log") CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
Revision tags: v6.1.32 |
|
#
5cead542 |
| 01-Jun-2023 |
Boris Burkov <boris@bur.io> |
btrfs: insert tree mod log move in push_node_left
There is a fairly unlikely race condition in tree mod log rewind that can result in a kernel panic which has the following trace:
[530.569] BTRFS
btrfs: insert tree mod log move in push_node_left
There is a fairly unlikely race condition in tree mod log rewind that can result in a kernel panic which has the following trace:
[530.569] BTRFS critical (device sda3): unable to find logical 0 length 4096 [530.585] BTRFS critical (device sda3): unable to find logical 0 length 4096 [530.602] BUG: kernel NULL pointer dereference, address: 0000000000000002 [530.618] #PF: supervisor read access in kernel mode [530.629] #PF: error_code(0x0000) - not-present page [530.641] PGD 0 P4D 0 [530.647] Oops: 0000 [#1] SMP [530.654] CPU: 30 PID: 398973 Comm: below Kdump: loaded Tainted: G S O K 5.12.0-0_fbk13_clang_7455_gb24de3bdb045 #1 [530.680] Hardware name: Quanta Mono Lake-M.2 SATA 1HY9U9Z001G/Mono Lake-M.2 SATA, BIOS F20_3A15 08/16/2017 [530.703] RIP: 0010:__btrfs_map_block+0xaa/0xd00 [530.755] RSP: 0018:ffffc9002c2f7600 EFLAGS: 00010246 [530.767] RAX: ffffffffffffffea RBX: ffff888292e41000 RCX: f2702d8b8be15100 [530.784] RDX: ffff88885fda6fb8 RSI: ffff88885fd973c8 RDI: ffff88885fd973c8 [530.800] RBP: ffff888292e410d0 R08: ffffffff82fd7fd0 R09: 00000000fffeffff [530.816] R10: ffffffff82e57fd0 R11: ffffffff82e57d70 R12: 0000000000000000 [530.832] R13: 0000000000001000 R14: 0000000000001000 R15: ffffc9002c2f76f0 [530.848] FS: 00007f38d64af000(0000) GS:ffff88885fd80000(0000) knlGS:0000000000000000 [530.866] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [530.880] CR2: 0000000000000002 CR3: 00000002b6770004 CR4: 00000000003706e0 [530.896] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [530.912] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [530.928] Call Trace: [530.934] ? btrfs_printk+0x13b/0x18c [530.943] ? btrfs_bio_counter_inc_blocked+0x3d/0x130 [530.955] btrfs_map_bio+0x75/0x330 [530.963] ? kmem_cache_alloc+0x12a/0x2d0 [530.973] ? btrfs_submit_metadata_bio+0x63/0x100 [530.984] btrfs_submit_metadata_bio+0xa4/0x100 [530.995] submit_extent_page+0x30f/0x360 [531.004] read_extent_buffer_pages+0x49e/0x6d0 [531.015] ? submit_extent_page+0x360/0x360 [531.025] btree_read_extent_buffer_pages+0x5f/0x150 [531.037] read_tree_block+0x37/0x60 [531.046] read_block_for_search+0x18b/0x410 [531.056] btrfs_search_old_slot+0x198/0x2f0 [531.066] resolve_indirect_ref+0xfe/0x6f0 [531.076] ? ulist_alloc+0x31/0x60 [531.084] ? kmem_cache_alloc_trace+0x12e/0x2b0 [531.095] find_parent_nodes+0x720/0x1830 [531.105] ? ulist_alloc+0x10/0x60 [531.113] iterate_extent_inodes+0xea/0x370 [531.123] ? btrfs_previous_extent_item+0x8f/0x110 [531.134] ? btrfs_search_path_in_tree+0x240/0x240 [531.146] iterate_inodes_from_logical+0x98/0xd0 [531.157] ? btrfs_search_path_in_tree+0x240/0x240 [531.168] btrfs_ioctl_logical_to_ino+0xd9/0x180 [531.179] btrfs_ioctl+0xe2/0x2eb0
This occurs when logical inode resolution takes a tree mod log sequence number, and then while backref walking hits a rewind on a busy node which has the following sequence of tree mod log operations (numbers filled in from a specific example, but they are somewhat arbitrary)
REMOVE_WHILE_FREEING slot 532 REMOVE_WHILE_FREEING slot 531 REMOVE_WHILE_FREEING slot 530 ... REMOVE_WHILE_FREEING slot 0 REMOVE slot 455 REMOVE slot 454 REMOVE slot 453 ... REMOVE slot 0 ADD slot 455 ADD slot 454 ADD slot 453 ... ADD slot 0 MOVE src slot 0 -> dst slot 456 nritems 533 REMOVE slot 455 REMOVE slot 454 REMOVE slot 453 ... REMOVE slot 0
When this sequence gets applied via btrfs_tree_mod_log_rewind, it allocates a fresh rewind eb, and first inserts the correct key info for the 533 elements, then overwrites the first 456 of them, then decrements the count by 456 via the add ops, then rewinds the move by doing a memmove from 456:988->0:532. We have never written anything past 532, so that memmove writes garbage into the 0:532 range. In practice, this results in a lot of fully 0 keys. The rewind then puts valid keys into slots 0:455 with the last removes, but 456:532 are still invalid.
When search_old_slot uses this eb, if it uses one of those invalid slots, it can then read the extent buffer and issue a bio for offset 0 which ultimately panics looking up extent mappings.
This bad tree mod log sequence gets generated when the node balancing code happens to do a balance_node_right followed by a push_node_left while logging in the tree mod log. Illustrated for ebs L and R (left and right):
L R start: [XXX|YYY|...] [ZZZ|...|...] balance_node_right: [XXX|YYY|...] [...|ZZZ|...] move Z to make room for Y [XXX|...|...] [YYY|ZZZ|...] copy Y from L to R push_node_left: [XXX|YYY|...] [...|ZZZ|...] copy Y from R to L [XXX|YYY|...] [ZZZ|...|...] move Z into emptied space (NOT LOGGED!)
This is because balance_node_right logs a move, but push_node_left explicitly doesn't. That is because logging the move would remove the overwritten src < dst range in the right eb, which was already logged when we called btrfs_tree_mod_log_eb_copy. The correct sequence would include a move from 456:988 to 0:532 after remove 0:455 and before removing 0:532. Reversing that sequence would entail creating keys for 0:532, then moving those keys out to 456:988, then creating more keys for 0:455.
i.e.,
REMOVE_WHILE_FREEING slot 532 REMOVE_WHILE_FREEING slot 531 REMOVE_WHILE_FREEING slot 530 ... REMOVE_WHILE_FREEING slot 0 MOVE src slot 456 -> dst slot 0 nritems 533 REMOVE slot 455 REMOVE slot 454 REMOVE slot 453 ... REMOVE slot 0 ADD slot 455 ADD slot 454 ADD slot 453 ... ADD slot 0 MOVE src slot 0 -> dst slot 456 nritems 533 REMOVE slot 455 REMOVE slot 454 REMOVE slot 453 ... REMOVE slot 0
Fix this to log the move but avoid the double remove by putting all the logging logic in btrfs_tree_mod_log_eb_copy which has enough information to detect these cases and properly log moves, removes, and adds. Leave btrfs_tree_mod_log_insert_move to handle insert_ptr and delete_ptr's tree mod logging.
(Un)fortunately, this is quite difficult to reproduce, and I was only able to reproduce it by adding sleeps in btrfs_search_old_slot that would encourage more log rewinding during ino_to_logical ioctls. I was able to hit the warning in the previous patch in the series without the fix quite quickly, but not after this patch.
CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
Revision tags: v6.1.31, v6.1.30, v6.1.29, v6.1.28, v6.1.27 |
|
#
016f9d0b |
| 29-Apr-2023 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: rename del_ptr to btrfs_del_ptr and export it
This exists internal to ctree.c, however btrfs check needs to use it for some of its operations. I'd rather not duplicate that code inside of bt
btrfs: rename del_ptr to btrfs_del_ptr and export it
This exists internal to ctree.c, however btrfs check needs to use it for some of its operations. I'd rather not duplicate that code inside of btrfs check as this is low level and I want to keep this code in one place, so rename the function to btrfs_del_ptr and export it so that it can be used inside of btrfs-progs safely. Add a comment to make sure this doesn't get removed by a future cleanup.
Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
b3cbfb0d |
| 29-Apr-2023 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: add a btrfs_csum_type_size helper
This is needed in btrfs-progs for the tools that convert the checksum types for file systems and a few other things. We don't have it in the kernel as we ju
btrfs: add a btrfs_csum_type_size helper
This is needed in btrfs-progs for the tools that convert the checksum types for file systems and a few other things. We don't have it in the kernel as we just want to get the size for the super blocks type. However I don't want to have to manually add this every time we sync ctree.c into btrfs-progs, so add the helper in the kernel with a note so it doesn't get removed by a later cleanup.
Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
4aec05fa |
| 29-Apr-2023 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: remove level argument from btrfs_set_block_flags
We just pass in btrfs_header_level(eb) for the level, and we're passing in the eb already, so simply get the level from the eb inside of btrfs
btrfs: remove level argument from btrfs_set_block_flags
We just pass in btrfs_header_level(eb) for the level, and we're passing in the eb already, so simply get the level from the eb inside of btrfs_set_block_flags.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
eee3b811 |
| 27-Apr-2023 |
Qu Wenruo <wqu@suse.com> |
btrfs: improve leaf dump and error handling
Improve the leaf dump behavior by:
- Always dump the leaf first, then the error message
- Output the slot number if possible Especially in __btrfs_fre
btrfs: improve leaf dump and error handling
Improve the leaf dump behavior by:
- Always dump the leaf first, then the error message
- Output the slot number if possible Especially in __btrfs_free_extent() the leaf dump of extent tree can be pretty large. With an extra slot number it's much easier to locate the problem.
Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
6c75a589 |
| 27-Apr-2023 |
Qu Wenruo <wqu@suse.com> |
btrfs: print-tree: pass const extent buffer pointer
Since print-tree infrastructure only prints the content of a tree block, we can make them to accept const extent buffer pointer.
This removes a f
btrfs: print-tree: pass const extent buffer pointer
Since print-tree infrastructure only prints the content of a tree block, we can make them to accept const extent buffer pointer.
This removes a forced type convert in extent-tree, where we convert a const extent buffer pointer to regular one, just to avoid compiler warning.
Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
Revision tags: v6.1.26 |
|
#
88ad95b0 |
| 26-Apr-2023 |
Filipe Manana <fdmanana@suse.com> |
btrfs: tag as unlikely the key comparison when checking sibling keys
When checking siblings keys, before moving keys from one node/leaf to a sibling node/leaf, it's very unexpected to have the last
btrfs: tag as unlikely the key comparison when checking sibling keys
When checking siblings keys, before moving keys from one node/leaf to a sibling node/leaf, it's very unexpected to have the last key of the left sibling greater than or equals to the first key of the right sibling, as that means we have a (serious) corruption that breaks the key ordering properties of a b+tree. Since this is unexpected, surround the comparison with the unlikely macro, which helps the compiler generate better code for the most expected case (no existing b+tree corruption). This is also what we do for other unexpected cases of invalid key ordering (like at btrfs_set_item_key_safe()).
Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|