History log of /openbmc/linux/fs/btrfs/backref.c (Results 1 – 25 of 948)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: v6.6.30, v6.6.29, v6.6.28
# 3a63cee1 17-Apr-2024 Johannes Thumshirn <johannes.thumshirn@wdc.com>

btrfs: fix information leak in btrfs_ioctl_logical_to_ino()

commit 2f7ef5bb4a2f3e481ef05fab946edb97c84f67cf upstream.

Syzbot reported the following information leak for in
btrfs_ioctl_logical_to_in

btrfs: fix information leak in btrfs_ioctl_logical_to_ino()

commit 2f7ef5bb4a2f3e481ef05fab946edb97c84f67cf upstream.

Syzbot reported the following information leak for in
btrfs_ioctl_logical_to_ino():

BUG: KMSAN: kernel-infoleak in instrument_copy_to_user include/linux/instrumented.h:114 [inline]
BUG: KMSAN: kernel-infoleak in _copy_to_user+0xbc/0x110 lib/usercopy.c:40
instrument_copy_to_user include/linux/instrumented.h:114 [inline]
_copy_to_user+0xbc/0x110 lib/usercopy.c:40
copy_to_user include/linux/uaccess.h:191 [inline]
btrfs_ioctl_logical_to_ino+0x440/0x750 fs/btrfs/ioctl.c:3499
btrfs_ioctl+0x714/0x1260
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:904 [inline]
__se_sys_ioctl+0x261/0x450 fs/ioctl.c:890
__x64_sys_ioctl+0x96/0xe0 fs/ioctl.c:890
x64_sys_call+0x1883/0x3b50 arch/x86/include/generated/asm/syscalls_64.h:17
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xcf/0x1e0 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f

Uninit was created at:
__kmalloc_large_node+0x231/0x370 mm/slub.c:3921
__do_kmalloc_node mm/slub.c:3954 [inline]
__kmalloc_node+0xb07/0x1060 mm/slub.c:3973
kmalloc_node include/linux/slab.h:648 [inline]
kvmalloc_node+0xc0/0x2d0 mm/util.c:634
kvmalloc include/linux/slab.h:766 [inline]
init_data_container+0x49/0x1e0 fs/btrfs/backref.c:2779
btrfs_ioctl_logical_to_ino+0x17c/0x750 fs/btrfs/ioctl.c:3480
btrfs_ioctl+0x714/0x1260
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:904 [inline]
__se_sys_ioctl+0x261/0x450 fs/ioctl.c:890
__x64_sys_ioctl+0x96/0xe0 fs/ioctl.c:890
x64_sys_call+0x1883/0x3b50 arch/x86/include/generated/asm/syscalls_64.h:17
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xcf/0x1e0 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f

Bytes 40-65535 of 65536 are uninitialized
Memory access of size 65536 starts at ffff888045a40000

This happens, because we're copying a 'struct btrfs_data_container' back
to user-space. This btrfs_data_container is allocated in
'init_data_container()' via kvmalloc(), which does not zero-fill the
memory.

Fix this by using kvzalloc() which zeroes out the memory on allocation.

CC: stable@vger.kernel.org # 4.14+
Reported-by: <syzbot+510a1abbb8116eeb341d@syzkaller.appspotmail.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <Johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

show more ...


Revision tags: v6.6.30, v6.6.29, v6.6.28
# 3a63cee1 17-Apr-2024 Johannes Thumshirn <johannes.thumshirn@wdc.com>

btrfs: fix information leak in btrfs_ioctl_logical_to_ino()

commit 2f7ef5bb4a2f3e481ef05fab946edb97c84f67cf upstream.

Syzbot reported the following information leak for in
btrfs_ioctl_logical_to_in

btrfs: fix information leak in btrfs_ioctl_logical_to_ino()

commit 2f7ef5bb4a2f3e481ef05fab946edb97c84f67cf upstream.

Syzbot reported the following information leak for in
btrfs_ioctl_logical_to_ino():

BUG: KMSAN: kernel-infoleak in instrument_copy_to_user include/linux/instrumented.h:114 [inline]
BUG: KMSAN: kernel-infoleak in _copy_to_user+0xbc/0x110 lib/usercopy.c:40
instrument_copy_to_user include/linux/instrumented.h:114 [inline]
_copy_to_user+0xbc/0x110 lib/usercopy.c:40
copy_to_user include/linux/uaccess.h:191 [inline]
btrfs_ioctl_logical_to_ino+0x440/0x750 fs/btrfs/ioctl.c:3499
btrfs_ioctl+0x714/0x1260
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:904 [inline]
__se_sys_ioctl+0x261/0x450 fs/ioctl.c:890
__x64_sys_ioctl+0x96/0xe0 fs/ioctl.c:890
x64_sys_call+0x1883/0x3b50 arch/x86/include/generated/asm/syscalls_64.h:17
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xcf/0x1e0 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f

Uninit was created at:
__kmalloc_large_node+0x231/0x370 mm/slub.c:3921
__do_kmalloc_node mm/slub.c:3954 [inline]
__kmalloc_node+0xb07/0x1060 mm/slub.c:3973
kmalloc_node include/linux/slab.h:648 [inline]
kvmalloc_node+0xc0/0x2d0 mm/util.c:634
kvmalloc include/linux/slab.h:766 [inline]
init_data_container+0x49/0x1e0 fs/btrfs/backref.c:2779
btrfs_ioctl_logical_to_ino+0x17c/0x750 fs/btrfs/ioctl.c:3480
btrfs_ioctl+0x714/0x1260
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:904 [inline]
__se_sys_ioctl+0x261/0x450 fs/ioctl.c:890
__x64_sys_ioctl+0x96/0xe0 fs/ioctl.c:890
x64_sys_call+0x1883/0x3b50 arch/x86/include/generated/asm/syscalls_64.h:17
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xcf/0x1e0 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f

Bytes 40-65535 of 65536 are uninitialized
Memory access of size 65536 starts at ffff888045a40000

This happens, because we're copying a 'struct btrfs_data_container' back
to user-space. This btrfs_data_container is allocated in
'init_data_container()' via kvmalloc(), which does not zero-fill the
memory.

Fix this by using kvzalloc() which zeroes out the memory on allocation.

CC: stable@vger.kernel.org # 4.14+
Reported-by: <syzbot+510a1abbb8116eeb341d@syzkaller.appspotmail.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <Johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

show more ...


Revision tags: v6.6.30, v6.6.29, v6.6.28
# 3a63cee1 17-Apr-2024 Johannes Thumshirn <johannes.thumshirn@wdc.com>

btrfs: fix information leak in btrfs_ioctl_logical_to_ino()

commit 2f7ef5bb4a2f3e481ef05fab946edb97c84f67cf upstream.

Syzbot reported the following information leak for in
btrfs_ioctl_logical_to_in

btrfs: fix information leak in btrfs_ioctl_logical_to_ino()

commit 2f7ef5bb4a2f3e481ef05fab946edb97c84f67cf upstream.

Syzbot reported the following information leak for in
btrfs_ioctl_logical_to_ino():

BUG: KMSAN: kernel-infoleak in instrument_copy_to_user include/linux/instrumented.h:114 [inline]
BUG: KMSAN: kernel-infoleak in _copy_to_user+0xbc/0x110 lib/usercopy.c:40
instrument_copy_to_user include/linux/instrumented.h:114 [inline]
_copy_to_user+0xbc/0x110 lib/usercopy.c:40
copy_to_user include/linux/uaccess.h:191 [inline]
btrfs_ioctl_logical_to_ino+0x440/0x750 fs/btrfs/ioctl.c:3499
btrfs_ioctl+0x714/0x1260
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:904 [inline]
__se_sys_ioctl+0x261/0x450 fs/ioctl.c:890
__x64_sys_ioctl+0x96/0xe0 fs/ioctl.c:890
x64_sys_call+0x1883/0x3b50 arch/x86/include/generated/asm/syscalls_64.h:17
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xcf/0x1e0 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f

Uninit was created at:
__kmalloc_large_node+0x231/0x370 mm/slub.c:3921
__do_kmalloc_node mm/slub.c:3954 [inline]
__kmalloc_node+0xb07/0x1060 mm/slub.c:3973
kmalloc_node include/linux/slab.h:648 [inline]
kvmalloc_node+0xc0/0x2d0 mm/util.c:634
kvmalloc include/linux/slab.h:766 [inline]
init_data_container+0x49/0x1e0 fs/btrfs/backref.c:2779
btrfs_ioctl_logical_to_ino+0x17c/0x750 fs/btrfs/ioctl.c:3480
btrfs_ioctl+0x714/0x1260
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:904 [inline]
__se_sys_ioctl+0x261/0x450 fs/ioctl.c:890
__x64_sys_ioctl+0x96/0xe0 fs/ioctl.c:890
x64_sys_call+0x1883/0x3b50 arch/x86/include/generated/asm/syscalls_64.h:17
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xcf/0x1e0 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f

Bytes 40-65535 of 65536 are uninitialized
Memory access of size 65536 starts at ffff888045a40000

This happens, because we're copying a 'struct btrfs_data_container' back
to user-space. This btrfs_data_container is allocated in
'init_data_container()' via kvmalloc(), which does not zero-fill the
memory.

Fix this by using kvzalloc() which zeroes out the memory on allocation.

CC: stable@vger.kernel.org # 4.14+
Reported-by: <syzbot+510a1abbb8116eeb341d@syzkaller.appspotmail.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <Johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

show more ...


Revision tags: v6.6.27, v6.6.26, v6.6.25, v6.6.24, v6.6.23, v6.6.16, v6.6.15, v6.6.14, v6.6.13, v6.6.12, v6.6.11, v6.6.10, v6.6.9, v6.6.8, v6.6.7, v6.6.6, v6.6.5, v6.6.4, v6.6.3, v6.6.2, v6.5.11, v6.6.1, v6.5.10, v6.6, v6.5.9, v6.5.8
# eb96e221 19-Oct-2023 Filipe Manana <fdmanana@suse.com>

btrfs: fix unwritten extent buffer after snapshotting a new subvolume

When creating a snapshot of a subvolume that was created in the current
transaction, we can end up not persisting a dirty extent

btrfs: fix unwritten extent buffer after snapshotting a new subvolume

When creating a snapshot of a subvolume that was created in the current
transaction, we can end up not persisting a dirty extent buffer that is
referenced by the snapshot, resulting in IO errors due to checksum failures
when trying to read the extent buffer later from disk. A sequence of steps
that leads to this is the following:

1) At ioctl.c:create_subvol() we allocate an extent buffer, with logical
address 36007936, for the leaf/root of a new subvolume that has an ID
of 291. We mark the extent buffer as dirty, and at this point the
subvolume tree has a single node/leaf which is also its root (level 0);

2) We no longer commit the transaction used to create the subvolume at
create_subvol(). We used to, but that was recently removed in
commit 1b53e51a4a8f ("btrfs: don't commit transaction for every subvol
create");

3) The transaction used to create the subvolume has an ID of 33, so the
extent buffer 36007936 has a generation of 33;

4) Several updates happen to subvolume 291 during transaction 33, several
files created and its tree height changes from 0 to 1, so we end up with
a new root at level 1 and the extent buffer 36007936 is now a leaf of
that new root node, which is extent buffer 36048896.

The commit root remains as 36007936, since we are still at transaction
33;

5) Creation of a snapshot of subvolume 291, with an ID of 292, starts at
ioctl.c:create_snapshot(). This triggers a commit of transaction 33 and
we end up at transaction.c:create_pending_snapshot(), in the critical
section of a transaction commit.

There we COW the root of subvolume 291, which is extent buffer 36048896.
The COW operation returns extent buffer 36048896, since there's no need
to COW because the extent buffer was created in this transaction and it
was not written yet.

The we call btrfs_copy_root() against the root node 36048896. During
this operation we allocate a new extent buffer to turn into the root
node of the snapshot, copy the contents of the root node 36048896 into
this snapshot root extent buffer, set the owner to 292 (the ID of the
snapshot), etc, and then we call btrfs_inc_ref(). This will create a
delayed reference for each leaf pointed by the root node with a
reference root of 292 - this includes a reference for the leaf
36007936.

After that we set the bit BTRFS_ROOT_FORCE_COW in the root's state.

Then we call btrfs_insert_dir_item(), to create the directory entry in
in the tree of subvolume 291 that points to the snapshot. This ends up
needing to modify leaf 36007936 to insert the respective directory
items. Because the bit BTRFS_ROOT_FORCE_COW is set for the root's state,
we need to COW the leaf. We end up at btrfs_force_cow_block() and then
at update_ref_for_cow().

At update_ref_for_cow() we call btrfs_block_can_be_shared() which
returns false, despite the fact the leaf 36007936 is shared - the
subvolume's root and the snapshot's root point to that leaf. The
reason that it incorrectly returns false is because the commit root
of the subvolume is extent buffer 36007936 - it was the initial root
of the subvolume when we created it. So btrfs_block_can_be_shared()
which has the following logic:

int btrfs_block_can_be_shared(struct btrfs_root *root,
struct extent_buffer *buf)
{
if (test_bit(BTRFS_ROOT_SHAREABLE, &root->state) &&
buf != root->node && buf != root->commit_root &&
(btrfs_header_generation(buf) <=
btrfs_root_last_snapshot(&root->root_item) ||
btrfs_header_flag(buf, BTRFS_HEADER_FLAG_RELOC)))
return 1;

return 0;
}

Returns false (0) since 'buf' (extent buffer 36007936) matches the
root's commit root.

As a result, at update_ref_for_cow(), we don't check for the number
of references for extent buffer 36007936, we just assume it's not
shared and therefore that it has only 1 reference, so we set the local
variable 'refs' to 1.

Later on, in the final if-else statement at update_ref_for_cow():

static noinline int update_ref_for_cow(struct btrfs_trans_handle *trans,
struct btrfs_root *root,
struct extent_buffer *buf,
struct extent_buffer *cow,
int *last_ref)
{
(...)
if (refs > 1) {
(...)
} else {
(...)
btrfs_clear_buffer_dirty(trans, buf);
*last_ref = 1;
}
}

So we mark the extent buffer 36007936 as not dirty, and as a result
we don't write it to disk later in the transaction commit, despite the
fact that the snapshot's root points to it.

Attempting to access the leaf or dumping the tree for example shows
that the extent buffer was not written:

$ btrfs inspect-internal dump-tree -t 292 /dev/sdb
btrfs-progs v6.2.2
file tree key (292 ROOT_ITEM 33)
node 36110336 level 1 items 2 free space 119 generation 33 owner 292
node 36110336 flags 0x1(WRITTEN) backref revision 1
checksum stored a8103e3e
checksum calced a8103e3e
fs uuid 90c9a46f-ae9f-4626-9aff-0cbf3e2e3a79
chunk uuid e8c9c885-78f4-4d31-85fe-89e5f5fd4a07
key (256 INODE_ITEM 0) block 36007936 gen 33
key (257 EXTENT_DATA 0) block 36052992 gen 33
checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
total bytes 107374182400
bytes used 38572032
uuid 90c9a46f-ae9f-4626-9aff-0cbf3e2e3a79

The respective on disk region is full of zeroes as the device was
trimmed at mkfs time.

Obviously 'btrfs check' also detects and complains about this:

$ btrfs check /dev/sdb
Opening filesystem to check...
Checking filesystem on /dev/sdb
UUID: 90c9a46f-ae9f-4626-9aff-0cbf3e2e3a79
generation: 33 (33)
[1/7] checking root items
[2/7] checking extents
checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
bad tree block 36007936, bytenr mismatch, want=36007936, have=0
owner ref check failed [36007936 4096]
ERROR: errors found in extent allocation tree or chunk allocation
[3/7] checking free space tree
[4/7] checking fs roots
checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
checksum verify failed on 36007936 wanted 0x00000000 found 0x86005f29
bad tree block 36007936, bytenr mismatch, want=36007936, have=0
The following tree block(s) is corrupted in tree 292:
tree block bytenr: 36110336, level: 1, node key: (256, 1, 0)
root 292 root dir 256 not found
ERROR: errors found in fs roots
found 38572032 bytes used, error(s) found
total csum bytes: 16048
total tree bytes: 1265664
total fs tree bytes: 1118208
total extent tree bytes: 65536
btree space waste bytes: 562598
file data blocks allocated: 65978368
referenced 36569088

Fix this by updating btrfs_block_can_be_shared() to consider that an
extent buffer may be shared if it matches the commit root and if its
generation matches the current transaction's generation.

This can be reproduced with the following script:

$ cat test.sh
#!/bin/bash

MNT=/mnt/sdi
DEV=/dev/sdi

# Use a filesystem with a 64K node size so that we have the same node
# size on every machine regardless of its page size (on x86_64 default
# node size is 16K due to the 4K page size, while on PPC it's 64K by
# default). This way we can make sure we are able to create a btree for
# the subvolume with a height of 2.
mkfs.btrfs -f -n 64K $DEV
mount $DEV $MNT

btrfs subvolume create $MNT/subvol

# Create a few empty files on the subvolume, this bumps its btree
# height to 2 (root node at level 1 and 2 leaves).
for ((i = 1; i <= 300; i++)); do
echo -n > $MNT/subvol/file_$i
done

btrfs subvolume snapshot -r $MNT/subvol $MNT/subvol/snap

umount $DEV

btrfs check $DEV

Running it on a 6.5 kernel (or any 6.6-rc kernel at the moment):

$ ./test.sh
Create subvolume '/mnt/sdi/subvol'
Create a readonly snapshot of '/mnt/sdi/subvol' in '/mnt/sdi/subvol/snap'
Opening filesystem to check...
Checking filesystem on /dev/sdi
UUID: bbdde2ff-7d02-45ca-8a73-3c36f23755a1
[1/7] checking root items
[2/7] checking extents
parent transid verify failed on 30539776 wanted 7 found 5
parent transid verify failed on 30539776 wanted 7 found 5
parent transid verify failed on 30539776 wanted 7 found 5
Ignoring transid failure
owner ref check failed [30539776 65536]
ERROR: errors found in extent allocation tree or chunk allocation
[3/7] checking free space tree
[4/7] checking fs roots
parent transid verify failed on 30539776 wanted 7 found 5
Ignoring transid failure
Wrong key of child node/leaf, wanted: (256, 1, 0), have: (2, 132, 0)
Wrong generation of child node/leaf, wanted: 5, have: 7
root 257 root dir 256 not found
ERROR: errors found in fs roots
found 917504 bytes used, error(s) found
total csum bytes: 0
total tree bytes: 851968
total fs tree bytes: 393216
total extent tree bytes: 65536
btree space waste bytes: 736550
file data blocks allocated: 0
referenced 0

A test case for fstests will follow soon.

Fixes: 1b53e51a4a8f ("btrfs: don't commit transaction for every subvol create")
CC: stable@vger.kernel.org # 6.5+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


Revision tags: v6.5.7, v6.5.6, v6.5.5, v6.5.4, v6.5.3, v6.5.2, v6.1.51, v6.5.1, v6.1.50, v6.5, v6.1.49, v6.1.48, v6.1.46
# 182741d2 11-Aug-2023 Qu Wenruo <wqu@suse.com>

btrfs: remove v0 extent handling

The v0 extent item has been deprecated for a long time, and we don't have
any report from the community either.

So it's time to remove the v0 extent specific error

btrfs: remove v0 extent handling

The v0 extent item has been deprecated for a long time, and we don't have
any report from the community either.

So it's time to remove the v0 extent specific error handling, and just
treat them as regular extent tree corruption.

This patch would remove the btrfs_print_v0_err() helper, and enhance the
involved error handling to treat them just as any extent tree
corruption. No reports regarding v0 extents have been seen since the
graceful handling was added in 2018.

This involves:

- btrfs_backref_add_tree_node()
This change is a little tricky, the new code is changed to only handle
BTRFS_TREE_BLOCK_REF_KEY and BTRFS_SHARED_BLOCK_REF_KEY.

But this is safe, as we have rejected any unknown inline refs through
btrfs_get_extent_inline_ref_type().
For keyed backrefs, we're safe to skip anything we don't know (that's
if it can pass tree-checker in the first place).

- btrfs_lookup_extent_info()
- lookup_inline_extent_backref()
- run_delayed_extent_op()
- __btrfs_free_extent()
- add_tree_block()
Regular error handling of unexpected extent tree item, and abort
transaction (if we have a trans handle).

- remove_extent_data_ref()
It's pretty much the same as the regular rejection of unknown backref
key.
But for this particular case, we can also remove a BUG_ON().

- extent_data_ref_count()
We can remove the BTRFS_EXTENT_REF_V0_KEY BUG_ON(), as it would be
rejected by the only caller.

- btrfs_print_leaf()
Remove the handling for BTRFS_EXTENT_REF_V0_KEY.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


Revision tags: v6.1.45, v6.1.44, v6.1.43, v6.1.42, v6.1.41, v6.1.40, v6.1.39, v6.1.38, v6.1.37, v6.1.36, v6.4, v6.1.35, v6.1.34, v6.1.33, v6.1.32, v6.1.31, v6.1.30, v6.1.29, v6.1.28
# 0cad8f14 09-May-2023 Filipe Manana <fdmanana@suse.com>

btrfs: fix backref walking not returning all inode refs

When using the logical to ino ioctl v2, if the flag to ignore offsets of
file extent items (BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET) is given, th

btrfs: fix backref walking not returning all inode refs

When using the logical to ino ioctl v2, if the flag to ignore offsets of
file extent items (BTRFS_LOGICAL_INO_ARGS_IGNORE_OFFSET) is given, the
backref walking code ends up not returning references for all file offsets
of an inode that point to the given logical bytenr. This happens since
kernel 6.2, commit 6ce6ba534418 ("btrfs: use a single argument for extent
offset in backref walking functions") because:

1) It mistakenly skipped the search for file extent items in a leaf that
point to the target extent if that flag is given. Instead it should
only skip the filtering done by check_extent_in_eb() - that is, it
should not avoid the calls to that function (or find_extent_in_eb(),
which uses it).

2) It was also not building a list of inode extent elements (struct
extent_inode_elem) if we have multiple inode references for an extent
when the ignore offset flag is given to the logical to ino ioctl - it
would leave a single element, only the last one that was found.

These stem from the confusing old interface for backref walking functions
where we had an extent item offset argument that was a pointer to a u64
and another boolean argument that indicated if the offset should be
ignored, but the pointer could be NULL. That NULL case is used by
relocation, qgroup extent accounting and fiemap, simply to avoid building
the inode extent list for each reference, as it's not necessary for those
use cases and therefore avoids memory allocations and some computations.

Fix this by adding a boolean argument to the backref walk context
structure to indicate that the inode extent list should not be built,
make relocation set that argument to true and fix the backref walking
logic to skip the calls to check_extent_in_eb() and find_extent_in_eb()
only if this new argument is true, instead of 'ignore_extent_item_pos'
being true.

A test case for fstests will be added soon, to provide cover not only
for these cases but to the logical to ino ioctl in general as well, as
currently we do not have a test case for it.

Reported-by: Vladimir Panteleev <git@vladimir.panteleev.md>
Link: https://lore.kernel.org/linux-btrfs/CAHhfkvwo=nmzrJSqZ2qMfF-rZB-ab6ahHnCD_sq9h4o8v+M7QQ@mail.gmail.com/
Fixes: 6ce6ba534418 ("btrfs: use a single argument for extent offset in backref walking functions")
CC: stable@vger.kernel.org # 6.2+
Tested-by: Vladimir Panteleev <git@vladimir.panteleev.md>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


Revision tags: v6.1.27, v6.1.26, v6.3, v6.1.25, v6.1.24, v6.1.23, v6.1.22
# 2280d425 28-Mar-2023 Filipe Manana <fdmanana@suse.com>

btrfs: ignore fiemap path cache when there are multiple paths for a node

During fiemap, when walking backreferences to determine if a b+tree
node/leaf is shared, we may find a tree block (leaf or no

btrfs: ignore fiemap path cache when there are multiple paths for a node

During fiemap, when walking backreferences to determine if a b+tree
node/leaf is shared, we may find a tree block (leaf or node) for which
two parents were added to the references ulist. This happens if we get
for example one direct ref (shared tree block ref) and one indirect ref
(non-shared tree block ref) for the tree block at the current level,
which can happen during relocation.

In that case the fiemap path cache can not be used since it's meant for
a single path, with one tree block at each possible level, so having
multiple references for a tree block at any level may result in getting
the level counter exceed BTRFS_MAX_LEVEL and eventually trigger the
warning:

WARN_ON_ONCE(level >= BTRFS_MAX_LEVEL)

at lookup_backref_shared_cache() and at store_backref_shared_cache().
This is harmless since the code ignores any level >= BTRFS_MAX_LEVEL, the
warning is there just to catch any unexpected case like the one described
above. However if a user finds this it may be scary and get reported.

So just ignore the path cache once we find a tree block for which there
are more than one reference, which is the less common case, and update
the cache with the sharedness check result for all levels below the level
for which we found multiple references.

Reported-by: Jarno Pelkonen <jarno.pelkonen@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAKv8qLmDNAGJGCtsevxx_VZ_YOvvs1L83iEJkTgyA4joJertng@mail.gmail.com/
Fixes: 12a824dc67a6 ("btrfs: speedup checking for extent sharedness during fiemap")
CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


Revision tags: v6.1.21, v6.1.20, v6.1.19, v6.1.18, v6.1.17, v6.1.16, v6.1.15, v6.1.14, v6.1.13, v6.2, v6.1.12, v6.1.11, v6.1.10, v6.1.9, v6.1.8, v6.1.7
# e2fd8306 17-Jan-2023 Filipe Manana <fdmanana@suse.com>

btrfs: skip backref walking during fiemap if we know the leaf is shared

During fiemap, when checking if a data extent is shared we are doing the
backref walking even if we already know the leaf is s

btrfs: skip backref walking during fiemap if we know the leaf is shared

During fiemap, when checking if a data extent is shared we are doing the
backref walking even if we already know the leaf is shared, which is a
waste of time since if the leaf shared then the data extent is also
shared. So skip the backref walking when we know we are in a shared leaf.

The following test was measures the gains for a case where all leaves
are shared due to a snapshot:

$ cat test.sh
#!/bin/bash

DEV=/dev/sdj
MNT=/mnt/sdj

umount $DEV &> /dev/null
mkfs.btrfs -f $DEV
# Use compression to quickly create files with a lot of extents
# (each with a size of 128K).
mount -o compress=lzo $DEV $MNT

# 40G gives 327680 extents, each with a size of 128K.
xfs_io -f -c "pwrite -S 0xab -b 1M 0 40G" $MNT/foobar

# Add some more files to increase the size of the fs and extent
# trees (in the real world there's a lot of files and extents
# from other files).
xfs_io -f -c "pwrite -S 0xcd -b 1M 0 20G" $MNT/file1
xfs_io -f -c "pwrite -S 0xef -b 1M 0 20G" $MNT/file2
xfs_io -f -c "pwrite -S 0x73 -b 1M 0 20G" $MNT/file3

# Create a snapshot so all the extents become indirectly shared
# through subtrees, with a generation less than or equals to the
# generation used to create the snapshot.
btrfs subvolume snapshot -r $MNT $MNT/snap1

# Unmount and mount again to clear cached metadata.
umount $MNT
mount -o compress=lzo $DEV $MNT

start=$(date +%s%N)
# The filefrag tool uses the fiemap ioctl.
filefrag $MNT/foobar
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "fiemap took $dur milliseconds (metadata not cached)"
echo

start=$(date +%s%N)
filefrag $MNT/foobar
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "fiemap took $dur milliseconds (metadata cached)"

umount $MNT

The results were the following on a non-debug kernel (Debian's default
kernel config).

Before this patch:

(...)
/mnt/sdi/foobar: 327680 extents found
fiemap took 1821 milliseconds (metadata not cached)

/mnt/sdi/foobar: 327680 extents found
fiemap took 399 milliseconds (metadata cached)

After this patch:

(...)
/mnt/sdi/foobar: 327680 extents found
fiemap took 591 milliseconds (metadata not cached)

/mnt/sdi/foobar: 327680 extents found
fiemap took 123 milliseconds (metadata cached)

That's a speedup of 3.1x and 3.2x.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# 4e4488d4 17-Jan-2023 Filipe Manana <fdmanana@suse.com>

btrfs: assert commit root semaphore is held when accessing backref cache

During fiemap, when accessing the cache that stores the sharedness of an
extent, we need to either be holding a transaction h

btrfs: assert commit root semaphore is held when accessing backref cache

During fiemap, when accessing the cache that stores the sharedness of an
extent, we need to either be holding a transaction handle or the commit
root semaphore. I left comments about this in the comment that precedes
store_backref_shared_cache() and lookup_backref_shared_cache(), but have
actually not enforced it through assertions. So assert that the commit
root semaphore is held if we are not holding a transaction handle.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


Revision tags: v6.1.6, v6.1.5, v6.0.19, v6.0.18, v6.1.4, v6.1.3, v6.0.17, v6.1.2, v6.0.16, v6.1.1, v6.0.15, v6.0.14
# 560840af 14-Dec-2022 Boris Burkov <boris@bur.io>

btrfs: fix resolving backrefs for inline extent followed by prealloc

If a file consists of an inline extent followed by a regular or prealloc
extent, then a legitimate attempt to resolve a logical a

btrfs: fix resolving backrefs for inline extent followed by prealloc

If a file consists of an inline extent followed by a regular or prealloc
extent, then a legitimate attempt to resolve a logical address in the
non-inline region will result in add_all_parents reading the invalid
offset field of the inline extent. If the inline extent item is placed
in the leaf eb s.t. it is the first item, attempting to access the
offset field will not only be meaningless, it will go past the end of
the eb and cause this panic:

[17.626048] BTRFS warning (device dm-2): bad eb member end: ptr 0x3fd4 start 30834688 member offset 16377 size 8
[17.631693] general protection fault, probably for non-canonical address 0x5088000000000: 0000 [#1] SMP PTI
[17.635041] CPU: 2 PID: 1267 Comm: btrfs Not tainted 5.12.0-07246-g75175d5adc74-dirty #199
[17.637969] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[17.641995] RIP: 0010:btrfs_get_64+0xe7/0x110
[17.649890] RSP: 0018:ffffc90001f73a08 EFLAGS: 00010202
[17.651652] RAX: 0000000000000001 RBX: ffff88810c42d000 RCX: 0000000000000000
[17.653921] RDX: 0005088000000000 RSI: ffffc90001f73a0f RDI: 0000000000000001
[17.656174] RBP: 0000000000000ff9 R08: 0000000000000007 R09: c0000000fffeffff
[17.658441] R10: ffffc90001f73790 R11: ffffc90001f73788 R12: ffff888106afe918
[17.661070] R13: 0000000000003fd4 R14: 0000000000003f6f R15: cdcdcdcdcdcdcdcd
[17.663617] FS: 00007f64e7627d80(0000) GS:ffff888237c80000(0000) knlGS:0000000000000000
[17.666525] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[17.668664] CR2: 000055d4a39152e8 CR3: 000000010c596002 CR4: 0000000000770ee0
[17.671253] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[17.673634] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[17.676034] PKRU: 55555554
[17.677004] Call Trace:
[17.677877] add_all_parents+0x276/0x480
[17.679325] find_parent_nodes+0xfae/0x1590
[17.680771] btrfs_find_all_leafs+0x5e/0xa0
[17.682217] iterate_extent_inodes+0xce/0x260
[17.683809] ? btrfs_inode_flags_to_xflags+0x50/0x50
[17.685597] ? iterate_inodes_from_logical+0xa1/0xd0
[17.687404] iterate_inodes_from_logical+0xa1/0xd0
[17.689121] ? btrfs_inode_flags_to_xflags+0x50/0x50
[17.691010] btrfs_ioctl_logical_to_ino+0x131/0x190
[17.692946] btrfs_ioctl+0x104a/0x2f60
[17.694384] ? selinux_file_ioctl+0x182/0x220
[17.695995] ? __x64_sys_ioctl+0x84/0xc0
[17.697394] __x64_sys_ioctl+0x84/0xc0
[17.698697] do_syscall_64+0x33/0x40
[17.700017] entry_SYSCALL_64_after_hwframe+0x44/0xae
[17.701753] RIP: 0033:0x7f64e72761b7
[17.709355] RSP: 002b:00007ffefb067f58 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[17.712088] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f64e72761b7
[17.714667] RDX: 00007ffefb067fb0 RSI: 00000000c0389424 RDI: 0000000000000003
[17.717386] RBP: 00007ffefb06d188 R08: 000055d4a390d2b0 R09: 00007f64e7340a60
[17.719938] R10: 0000000000000231 R11: 0000000000000246 R12: 0000000000000001
[17.722383] R13: 0000000000000000 R14: 00000000c0389424 R15: 000055d4a38fd2a0
[17.724839] Modules linked in:

Fix the bug by detecting the inline extent item in add_all_parents and
skipping to the next extent item.

CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


Revision tags: v6.0.13, v6.1, v6.0.12, v6.0.11, v6.0.10, v5.15.80, v6.0.9, v5.15.79
# 27137fac 15-Nov-2022 Christoph Hellwig <hch@lst.de>

btrfs: move struct btrfs_tree_parent_check out of disk-io.h

Move struct btrfs_tree_parent_check out of disk-io.h so that volumes.h
an various .c files don't have to include disk-io.h just for it.

R

btrfs: move struct btrfs_tree_parent_check out of disk-io.h

Move struct btrfs_tree_parent_check out of disk-io.h so that volumes.h
an various .c files don't have to include disk-io.h just for it.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
[ use tree-checker.h for the structure ]
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


Revision tags: v6.0.8, v5.15.78, v6.0.7, v5.15.77, v5.15.76, v6.0.6, v6.0.5, v5.15.75, v6.0.4, v6.0.3, v6.0.2, v5.15.74, v5.15.73, v6.0.1, v5.15.72, v6.0, v5.15.71, v5.15.70, v5.15.69, v5.15.68
# 789d6a3a 14-Sep-2022 Qu Wenruo <wqu@suse.com>

btrfs: concentrate all tree block parentness check parameters into one structure

There are several different tree block parentness check parameters used
across several helpers:

- level
Mandatory

btrfs: concentrate all tree block parentness check parameters into one structure

There are several different tree block parentness check parameters used
across several helpers:

- level
Mandatory

- transid
Under most cases it's mandatory, but there are several backref cases
which skips this check.

- owner_root
- first_key
Utilized by most top-down tree search routine. Otherwise can be
skipped.

Those four members are not always mandatory checks, and some of them are
the same u64, which means if some arguments got swapped compiler will
not catch it.

Furthermore if we're going to further expand the parentness check, we
need to modify quite some helpers just to add one more parameter.

This patch will concentrate all these members into a structure called
btrfs_tree_parent_check, and pass that structure for the following
helpers:

- btrfs_read_extent_buffer()
- read_tree_block()

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# adf02418 01-Nov-2022 Filipe Manana <fdmanana@suse.com>

btrfs: send: skip resolution of our own backref when finding clone source

When doing backref walking to determine a source range to clone from, it
is worthless to collect and resolve our own data ba

btrfs: send: skip resolution of our own backref when finding clone source

When doing backref walking to determine a source range to clone from, it
is worthless to collect and resolve our own data backref, as we can't
obviously use it as a clone source and it represents the range we want to
clone into. Collecting the backref implies doing the extra work to resolve
it, doing the search for a file extent item in a subvolume tree, etc.
Skipping the data backref is valid as long as we only have the send root
as the single clone root, otherwise the leaf with the file extent item may
be accessible from another clone root due to shared subtrees created by
snapshots, and therefore we have to collect the backref and resolve it.

So add a callback to the backref walking code to guide it to skip data
backrefs.

This change is part of a patchset comprised of the following patches:

01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source

The following test was run on non-debug kernel (Debian's default kernel
config) before and after applying the patchset:

$ cat test-send-many-shared-extents.sh
#!/bin/bash

DEV=/dev/sdh
MNT=/mnt/sdh

umount $DEV &> /dev/null
mkfs.btrfs -f $DEV
mount $DEV $MNT

num_files=50000
num_clones_per_file=50

for ((i = 1; i <= $num_files; i++)); do
xfs_io -f -c "pwrite 0 64K" $MNT/file_$i > /dev/null
echo -ne "\r$i files created..."
done
echo

btrfs subvolume snapshot -r $MNT $MNT/snap1

cloned=0
for ((i = 1; i <= $num_clones_per_file; i++)); do
for ((j = 1; j <= $num_files; j++)); do
cp --reflink=always $MNT/file_$j $MNT/file_${j}_clone_${i}
cloned=$((cloned + 1))
echo -ne "\r$cloned / $((num_files * num_clones_per_file)) clone operations"
done
done
echo

btrfs subvolume snapshot -r $MNT $MNT/snap2

# Unmount and mount again to clear all cached metadata (and data).
umount $DEV
mount $DEV $MNT

start=$(date +%s%N)
btrfs send $MNT/snap2 > /dev/null
end=$(date +%s%N)

dur=$(( (end - start) / 1000000000 ))
echo -e "\nFull send took $dur seconds"

# Unmount and mount again to clear all cached metadata (and data).
umount $DEV
mount $DEV $MNT

start=$(date +%s%N)
btrfs send -p $MNT/snap1 $MNT/snap2 > /dev/null
end=$(date +%s%N)

dur=$(( (end - start) / 1000000000 ))
echo -e "\nIncremental send took $dur seconds"

umount $MNT

Before applying the patchset:

(...)
Full send took 1108 seconds
(...)
Incremental send took 1135 seconds

After applying the whole patchset:

(...)
Full send took 268 seconds (-75.8%)
(...)
Incremental send took 316 seconds (-72.2%)

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# f73853c7 01-Nov-2022 Filipe Manana <fdmanana@suse.com>

btrfs: send: avoid double extent tree search when finding clone source

At find_extent_clone() we search twice for the extent item corresponding
to the data extent that the current file extent items

btrfs: send: avoid double extent tree search when finding clone source

At find_extent_clone() we search twice for the extent item corresponding
to the data extent that the current file extent items points to:

1) Once with a call to extent_from_logical();

2) Once again during backref walking, through iterate_extent_inodes()
which eventually leads to find_parent_nodes() where we will search
again the extent tree for the same extent item.

The extent tree can be huge, so doing this one extra search for every
extent we want to send adds up and it's expensive.

The first call is there since the send code was introduced and it
accomplishes two things:

1) Check that the extent is flagged as a data extent in the extent tree.
But it can not be anything else, otherwise we wouldn't have a file
extent item in the send root pointing to it.
This was probably added to catch bugs in the early days where send was
yet too young and the interaction with everything else was far from
perfect;

2) Check how many direct references there are on the extent, and if
there's too many (more than SEND_MAX_EXTENT_REFS), avoid doing the
backred walking as it may take too long and slowdown send.

So improve on this by having a callback in the backref walking code that
is called when it finds the extent item in the extent tree, and have those
checks done in the callback. When the callback returns anything different
from 0, it stops the backref walking code. This way we do a single search
on the extent tree for the extent item of our data extent.

Also, before this change we were only checking the number of references on
the data extent against SEND_MAX_EXTENT_REFS, but after starting backref
walking we will end up resolving backrefs for extent buffers in the path
from a leaf having a file extent item pointing to our data extent, up to
roots of trees from which the extent buffer is accessible from, due to
shared subtrees resulting from snapshoting. We were therefore allowing for
the possibility for send taking too long due to some node in the path from
the leaf to a root node being shared too many times. After this change we
check for reference counts being greater than SEND_MAX_EXTENT_REFS for
both data extents and metadata extents.

This change is part of a patchset comprised of the following patches:

01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source

Performance test results are in the changelog of patch 17/17.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# 88ffb665 01-Nov-2022 Filipe Manana <fdmanana@suse.com>

btrfs: send: skip unnecessary backref iterations

When looking for a clone source for an extent, we are iterating over all
the backreferences for an extent. This is often a waste of time, because
onc

btrfs: send: skip unnecessary backref iterations

When looking for a clone source for an extent, we are iterating over all
the backreferences for an extent. This is often a waste of time, because
once we find a good clone source we could stop immediately instead of
continuing backref walking, which is expensive.

Basically what happens currently is this:

1) Call iterate_extent_inodes() to iterate over all the backreferences;

2) It calls btrfs_find_all_leafs() which in turn calls the main function
to walk over backrefs and collect them - find_parent_nodes();

3) Then we collect all the references for our target data extent from the
extent tree (and delayed refs if any), add them to the rb trees,
resolve all the indirect backreferences and search for all the file
extent items in fs trees, building a list of inodes for each one of
them (struct extent_inode_elem);

4) Then back at iterate_extent_inodes() we find all the roots associated
to each found leaf, and call the callback __iterate_backrefs defined
at send.c for each inode in the inode list associated to each leaf.

Some times one the first backreferences we find in a fs tree is optimal
to satisfy the clone operation that send wants to perform, and in that
case we could stop immediately and avoid resolving all the remaining
indirect backreferences (search fs trees for the respective file extent
items, etc). This possibly if when we find a fs tree leaf with a file
extent item we are able to know what are all the roots that can lead to
the leaf - this is now possible after the previous patch in the series
that adds a cache that maps leaves to a list of roots. So we can now
shortcircuit backref walking during send, by having the callback we
pass to iterate_extent_inodes() to be called when we find a file extent
item for an indirect backreference, and have it return a special value
when it found a suitable backreference and it does not need to look for
more backreferences. This change does that.

This change is part of a patchset comprised of the following patches:

01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source

Performance test results are in the changelog of patch 17/17.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# 66d04209 01-Nov-2022 Filipe Manana <fdmanana@suse.com>

btrfs: send: cache leaf to roots mapping during backref walking

During a send operation, when doing backref walking to determine which
inodes/offsets/roots we can clone from, the most repetitive and

btrfs: send: cache leaf to roots mapping during backref walking

During a send operation, when doing backref walking to determine which
inodes/offsets/roots we can clone from, the most repetitive and expensive
step is to map each leaf that has file extent items pointing to the target
data extent to the IDs of the roots from which the leaves are accessible,
which happens at iterate_extent_inodes(). That step requires finding every
parent node of a leaf, then the parent of each parent, and so on until we
reach a root node. So it's a naturally expensive operation, and repetitive
because each leaf can have hundreds of file extent items (for a nodesize
of 16K, that can be slightly over 200 file extent items). There's also
temporal locality, as we process all file extent items from a leave before
moving the next leaf.

This change caches the mapping of leaves to root IDs, to avoid repeating
those computations over and over again. The cache is limited to a maximum
of 128 entries, with each entry being a struct with a size of 128 bytes,
so the maximum cache size is 16K plus any nodes internally allocated by
the maple tree that is used to index pointers to those structs. The cache
is invalidated whenever we detect relocation happened since we started
filling the cache, because if relocation happened then extent buffers for
leaves and nodes of the trees used by a send operation may have been
reallocated.

This cache also allows for another important optimization that is
introduced in the next patch in the series.

This change is part of a patchset comprised of the following patches:

01/17 btrfs: fix inode list leak during backref walking at resolve_indirect_refs()
02/17 btrfs: fix inode list leak during backref walking at find_parent_nodes()
03/17 btrfs: fix ulist leaks in error paths of qgroup self tests
04/17 btrfs: remove pointless and double ulist frees in error paths of qgroup tests
05/17 btrfs: send: avoid unnecessary path allocations when finding extent clone
06/17 btrfs: send: update comment at find_extent_clone()
07/17 btrfs: send: drop unnecessary backref context field initializations
08/17 btrfs: send: avoid unnecessary backref lookups when finding clone source
09/17 btrfs: send: optimize clone detection to increase extent sharing
10/17 btrfs: use a single argument for extent offset in backref walking functions
11/17 btrfs: use a structure to pass arguments to backref walking functions
12/17 btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()
13/17 btrfs: constify ulist parameter of ulist_next()
14/17 btrfs: send: cache leaf to roots mapping during backref walking
15/17 btrfs: send: skip unnecessary backref iterations
16/17 btrfs: send: avoid double extent tree search when finding clone source
17/17 btrfs: send: skip resolution of our own backref when finding clone source

Performance test results are in the changelog of patch 17/17.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# 1baea6f1 01-Nov-2022 Filipe Manana <fdmanana@suse.com>

btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()

At iterate_extent_inodes() we collect a ulist of leaves for a given extent
with a call to btrfs_find_all_leafs() and then

btrfs: reuse roots ulist on each leaf iteration for iterate_extent_inodes()

At iterate_extent_inodes() we collect a ulist of leaves for a given extent
with a call to btrfs_find_all_leafs() and then we enter a loop where we
iterate over all the collected leaves. Each iteration of that loop does a
call to btrfs_find_all_roots_safe(), to determine all roots from which a
leaf is accessible, and that results in allocating and releasing a ulist
to store the root IDs.

Instead of allocating and releasing the roots ulist on every iteration,
allocate a ulist before entering the loop and keep using it on each
iteration, reinitializing the ulist at the end of each iteration.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# a2c8d27e 01-Nov-2022 Filipe Manana <fdmanana@suse.com>

btrfs: use a structure to pass arguments to backref walking functions

The public backref walking functions have quite a lot of arguments that
are passed down the call stack to find_parent_nodes(), t

btrfs: use a structure to pass arguments to backref walking functions

The public backref walking functions have quite a lot of arguments that
are passed down the call stack to find_parent_nodes(), the core function
of the backref walking code.

The next patches in series will need to add even arguments to these
functions that should be passed not only to find_parent_nodes(), but also
to other functions used by the later (directly or even lower in the call
stack).

So create a structure to hold all these arguments and state used by the
main backref walking function, find_parent_nodes(), and use it as the
argument for the public backref walking functions iterate_extent_inodes(),
btrfs_find_all_leafs() and btrfs_find_all_roots().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# 6ce6ba53 01-Nov-2022 Filipe Manana <fdmanana@suse.com>

btrfs: use a single argument for extent offset in backref walking functions

The interface for find_parent_nodes() has two extent offset related
arguments:

1) One u64 pointer argument for the extent

btrfs: use a single argument for extent offset in backref walking functions

The interface for find_parent_nodes() has two extent offset related
arguments:

1) One u64 pointer argument for the extent offset;

2) One boolean argument to tell if the extent offset should be ignored or
not.

These are confusing, becase the extent offset pointer can be NULL and in
some cases callers pass a NULL value as a way to tell the backref walking
code to ignore offsets in file extent items (and simply consider all file
extent items that point to the target data extent).

The boolean argument was added in commit c995ab3cda3f ("btrfs: add a flag
to iterate_inodes_from_logical to find all extent refs for uncompressed
extents"), but it was never really necessary, it was enough if it could
find a way to get a NULL value passed to the "extent_item_pos" argument of
find_parent_nodes(). The arguments are also passed to functions called
by find_parent_nodes() and respective helper functions, which further
makes everything more complicated than needed.

Then we have several backref walking related functions that end up calling
find_parent_nodes(), either directly or through some other function that
they call, and for many we have to use an "extent_item_pos" (u64) argument
and a boolean "ignore_offset" argument too.

This is confusing and not really necessary. So use a single argument to
specify the extent offset, as a simple u64 and not as a pointer, but
using a special value of (u64)-1, defined as a documented constant, to
indicate when the extent offset should be ignored.

This is also preparation work for the upcoming patches in the series that
add other arguments to find_parent_nodes() and other related functions
that use it.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# c7499a64 01-Nov-2022 Filipe Manana <fdmanana@suse.com>

btrfs: send: optimize clone detection to increase extent sharing

Currently send does not do the best decisions when it comes to decide
between multiple clone sources, which results in clone operatio

btrfs: send: optimize clone detection to increase extent sharing

Currently send does not do the best decisions when it comes to decide
between multiple clone sources, which results in clone operations for
partial extent ranges, which has the following disadvantages:

1) We get less shared extents at the destination;

2) We have to read more data during the send operation and emit more
write commands.

Besides not being optimal behaviour, it also breaks user expectations and
is often reported by users, with a recent example in the Link tag at the
bottom of this change log.

Part of the reason for this non-optimal behaviour is that the backref
walking code does not provide information about the length of the file
extent items that were found for each backref, so send is blind about
which backref is the best to chose as a cloning source.

The other existing reasons are just silliness, namely always prefering
the inode with the lowest number when multiple are found for the same
root and when we can clone from multiple roots, always prefer the send
root over any of the other clone roots. This does not make any sense
since any inode or root is fine and as good as any other inode/root.

Fix this by making backref walking pass information about the number of
bytes referenced by each file extent item and then have send's backref
callback pick the inode with the highest number of bytes for each root.
Finally select the root from which we can clone more bytes from.

Example reproducer:

$ cat test.sh
#!/bin/bash

DEV=/dev/sdi
MNT=/mnt/sdi

mkfs.btrfs -f $DEV
mount $DEV $MNT

xfs_io -f -c "pwrite -S 0xab -b 2M 0 2M" $MNT/foo
cp --reflink=always $MNT/foo $MNT/bar
cp --reflink=always $MNT/foo $MNT/baz
sync

# Overwrite the second half of file foo.
xfs_io -c "pwrite -S 0xcd -b 1M 1M 1M" $MNT/foo
sync

echo
echo "*** fiemap in the original filesystem ***"
echo
xfs_io -c "fiemap -v" $MNT/foo
xfs_io -c "fiemap -v" $MNT/bar
xfs_io -c "fiemap -v" $MNT/baz
echo

btrfs filesystem du $MNT

btrfs subvolume snapshot -r $MNT $MNT/snap

btrfs send -f /tmp/send_stream $MNT/snap

umount $MNT
mkfs.btrfs -f $DEV &> /dev/null
mount $DEV $MNT

btrfs receive -f /tmp/send_stream $MNT

echo
echo "*** fiemap in the new filesystem ***"
echo
xfs_io -r -c "fiemap -v" $MNT/snap/foo
xfs_io -r -c "fiemap -v" $MNT/snap/bar
xfs_io -r -c "fiemap -v" $MNT/snap/baz
echo

btrfs filesystem du $MNT

rm -f /tmp/send_stream
rm -f /tmp/snap.fssum

umount $MNT

Before this change:

$ ./test.sh
(...)

*** fiemap in the original filesystem ***

/mnt/sdi/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..2047]: 26624..28671 2048 0x2000
1: [2048..4095]: 30720..32767 2048 0x1
/mnt/sdi/bar:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..4095]: 26624..30719 4096 0x2001
/mnt/sdi/baz:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..4095]: 26624..30719 4096 0x2001

Total Exclusive Set shared Filename
2.00MiB 1.00MiB - /mnt/sdi/foo
2.00MiB 0.00B - /mnt/sdi/bar
2.00MiB 0.00B - /mnt/sdi/baz
6.00MiB 1.00MiB 2.00MiB /mnt/sdi

Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap'
At subvol /mnt/sdi/snap
At subvol snap

*** fiemap in the new filesystem ***

/mnt/sdi/snap/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..4095]: 26624..30719 4096 0x2001
/mnt/sdi/snap/bar:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..2047]: 26624..28671 2048 0x2000
1: [2048..4095]: 30720..32767 2048 0x1
/mnt/sdi/snap/baz:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..2047]: 26624..28671 2048 0x2000
1: [2048..4095]: 32768..34815 2048 0x1

Total Exclusive Set shared Filename
2.00MiB 0.00B - /mnt/sdi/snap/foo
2.00MiB 1.00MiB - /mnt/sdi/snap/bar
2.00MiB 1.00MiB - /mnt/sdi/snap/baz
6.00MiB 2.00MiB - /mnt/sdi/snap
6.00MiB 2.00MiB 2.00MiB /mnt/sdi

We end up with two 1M extents that are not shared for files bar and baz.

After this change:

$ ./test.sh
(...)

*** fiemap in the original filesystem ***

/mnt/sdi/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..2047]: 26624..28671 2048 0x2000
1: [2048..4095]: 30720..32767 2048 0x1
/mnt/sdi/bar:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..4095]: 26624..30719 4096 0x2001
/mnt/sdi/baz:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..4095]: 26624..30719 4096 0x2001

Total Exclusive Set shared Filename
2.00MiB 1.00MiB - /mnt/sdi/foo
2.00MiB 0.00B - /mnt/sdi/bar
2.00MiB 0.00B - /mnt/sdi/baz
6.00MiB 1.00MiB 2.00MiB /mnt/sdi
Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap'
At subvol /mnt/sdi/snap
At subvol snap

*** fiemap in the new filesystem ***

/mnt/sdi/snap/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..4095]: 26624..30719 4096 0x2001
/mnt/sdi/snap/bar:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..2047]: 26624..28671 2048 0x2000
1: [2048..4095]: 30720..32767 2048 0x2001
/mnt/sdi/snap/baz:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..2047]: 26624..28671 2048 0x2000
1: [2048..4095]: 30720..32767 2048 0x2001

Total Exclusive Set shared Filename
2.00MiB 0.00B - /mnt/sdi/snap/foo
2.00MiB 0.00B - /mnt/sdi/snap/bar
2.00MiB 0.00B - /mnt/sdi/snap/baz
6.00MiB 0.00B - /mnt/sdi/snap
6.00MiB 0.00B 3.00MiB /mnt/sdi

Now there's a much better sharing, files bar and baz share 1M of the
extent of file foo and the second extent of files bar and baz is shared
between themselves.

This will later be turned into a test case for fstests.

Link: https://lore.kernel.org/linux-btrfs/20221008005704.795b44b0@crass-HP-ZBook-15-G2/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# 67707479 26-Oct-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move relocation prototypes into relocation.h

Move these out of ctree.h into relocation.h to cut down on code in
ctree.h

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-of

btrfs: move relocation prototypes into relocation.h

Move these out of ctree.h into relocation.h to cut down on code in
ctree.h

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# a0231804 24-Oct-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move extent-tree helpers into their own header file

Move all the extent tree related prototypes to extent-tree.h out of
ctree.h, and then go include it everywhere needed so everything
compile

btrfs: move extent-tree helpers into their own header file

Move all the extent tree related prototypes to extent-tree.h out of
ctree.h, and then go include it everywhere needed so everything
compiles.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# d68194b2 14-Oct-2022 David Sterba <dsterba@suse.com>

btrfs: sink gfp_t parameter to btrfs_backref_iter_alloc

There's only one caller that passes GFP_NOFS, we can drop the parameter
an use the flags directly.

Reviewed-by: Anand Jain <anand.jain@oracle

btrfs: sink gfp_t parameter to btrfs_backref_iter_alloc

There's only one caller that passes GFP_NOFS, we can drop the parameter
an use the flags directly.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# 07e81dc9 19-Oct-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move accessor helpers into accessors.h

This is a large patch, but because they're all macros it's impossible to
split up. Simply copy all of the item accessors in ctree.h and paste
them in a

btrfs: move accessor helpers into accessors.h

This is a large patch, but because they're all macros it's impossible to
split up. Simply copy all of the item accessors in ctree.h and paste
them in accessors.h, and then update any files to include the header so
everything compiles.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ reformat comments, style fixups ]
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# c7f13d42 19-Oct-2022 Josef Bacik <josef@toxicpanda.com>

btrfs: move fs wide helpers out of ctree.h

We have several fs wide related helpers in ctree.h. The bulk of these
are the incompat flag test helpers, but there are things such as
btrfs_fs_closing()

btrfs: move fs wide helpers out of ctree.h

We have several fs wide related helpers in ctree.h. The bulk of these
are the incompat flag test helpers, but there are things such as
btrfs_fs_closing() and the read only helpers that also aren't directly
related to the ctree code. Move these into a fs.h header, which will
serve as the location for file system wide related helpers.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


12345678910>>...38