History log of /openbmc/linux/fs/btrfs/relocation.c (Results 251 – 275 of 908)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
# 76deacf0 24-Jan-2020 Josef Bacik <josef@toxicpanda.com>

btrfs: hold a ref on the root in create_reloc_inode

We're creating a reloc inode in the data reloc tree, we need to hold a
ref on the root while we're doing that.

Signed-off-by: Josef Bacik <josef@

btrfs: hold a ref on the root in create_reloc_inode

We're creating a reloc inode in the data reloc tree, we need to hold a
ref on the root while we're doing that.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# 3d7babdc 24-Jan-2020 Josef Bacik <josef@toxicpanda.com>

btrfs: hold a ref on the root in find_data_references

We're looking up the data references for the bytenr in a root, we need
to hold a ref on that root while we're doing that.

Signed-off-by: Josef

btrfs: hold a ref on the root in find_data_references

We're looking up the data references for the bytenr in a root, we need
to hold a ref on that root while we're doing that.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# 442b1ac5 24-Jan-2020 Josef Bacik <josef@toxicpanda.com>

btrfs: hold a ref on the root in record_reloc_root_in_trans

We are recording this root in the transaction, so we need to hold a ref
on it until we do that.

Signed-off-by: Josef Bacik <josef@toxicpa

btrfs: hold a ref on the root in record_reloc_root_in_trans

We are recording this root in the transaction, so we need to hold a ref
on it until we do that.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# ab9737bd 24-Jan-2020 Josef Bacik <josef@toxicpanda.com>

btrfs: hold a ref on the root in merge_reloc_roots

We look up the corresponding root for the reloc root, we need to hold a
ref while we're messing with it.

Signed-off-by: Josef Bacik <josef@toxicpa

btrfs: hold a ref on the root in merge_reloc_roots

We look up the corresponding root for the reloc root, we need to hold a
ref while we're messing with it.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# db2c2ca2 24-Jan-2020 Josef Bacik <josef@toxicpanda.com>

btrfs: hold a ref on the root in prepare_to_merge

We look up the reloc roots corresponding root, we need to hold a ref on
that root.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: D

btrfs: hold a ref on the root in prepare_to_merge

We look up the reloc roots corresponding root, we need to hold a ref on
that root.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# 0b530bc5 24-Jan-2020 Josef Bacik <josef@toxicpanda.com>

btrfs: hold a ref on the root in build_backref_tree

This is trickier than the previous conversions. We have backref_node's
that need to hold onto their root for their lifetime. Do the read of
the

btrfs: hold a ref on the root in build_backref_tree

This is trickier than the previous conversions. We have backref_node's
that need to hold onto their root for their lifetime. Do the read of
the root and grab the ref. If at any point we don't use the root we
discard it, however if we use it in our backref node we don't free it
until we free the backref node. Any time we switch the root's for the
backref node we need to drop our ref on the old root and grab the ref on
the new root, and if we dupe a node we need to get a ref on the root
there as well.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# 3dbf1738 24-Jan-2020 Josef Bacik <josef@toxicpanda.com>

btrfs: make relocation use btrfs_read_tree_root()

Relocation has it's special roots, we don't want to save these in the
root cache either, so swap it to use btrfs_read_tree_root(). However
the relo

btrfs: make relocation use btrfs_read_tree_root()

Relocation has it's special roots, we don't want to save these in the
root cache either, so swap it to use btrfs_read_tree_root(). However
the reloc root does need REF_COWS set, so make sure we set it everywhere
we use this helper, as it no longer does the REF_COWS setting.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# 0c891389 15-Jan-2020 Qu Wenruo <wqu@suse.com>

btrfs: relocation: Add introduction of how relocation works

Relocation is one of the most complex part of btrfs, while it's also the
foundation stone for online resizing, profile converting.

For su

btrfs: relocation: Add introduction of how relocation works

Relocation is one of the most complex part of btrfs, while it's also the
foundation stone for online resizing, profile converting.

For such a complex facility, we should at least have some introduction
to it.

This patch will add an basic introduction at pretty a high level,
explaining:

- What relocation does
- How relocation is done
Only mentioning how data reloc tree and reloc tree are involved in the
operation.
No details like the backref cache, or the data reloc tree contents.
- Which function to refer.

More detailed comments will be added for reloc tree creation, data reloc
tree creation and backref cache.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# bffe633e 02-Dec-2019 Omar Sandoval <osandov@fb.com>

btrfs: make btrfs_ordered_extent naming consistent with btrfs_file_extent_item

ordered->start, ordered->len, and ordered->disk_len correspond to
fi->disk_bytenr, fi->num_bytes, and fi->disk_num_byte

btrfs: make btrfs_ordered_extent naming consistent with btrfs_file_extent_item

ordered->start, ordered->len, and ordered->disk_len correspond to
fi->disk_bytenr, fi->num_bytes, and fi->disk_num_bytes, respectively.
It's confusing to translate between the two naming schemes. Since a
btrfs_ordered_extent is basically a pending btrfs_file_extent_item,
let's make the former use the naming from the latter.

Note that I didn't touch the names in tracepoints just in case there are
scripts depending on the current naming.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# 430640e3 28-Nov-2019 Qu Wenruo <wqu@suse.com>

btrfs: relocation: Output current relocation stage at btrfs_relocate_block_group()

There are two relocation stages but both print the same message. Add the
description of the stage. This can help de

btrfs: relocation: Output current relocation stage at btrfs_relocate_block_group()

There are two relocation stages but both print the same message. Add the
description of the stage. This can help debugging or provides
informative message to users.

BTRFS info (device dm-5): balance: start -d -m -s
BTRFS info (device dm-5): relocating block group 30408704 flags metadata|dup
BTRFS info (device dm-5): found 2 extents, stage: move data extents
BTRFS info (device dm-5): relocating block group 22020096 flags system|dup
BTRFS info (device dm-5): found 1 extents, stage: move data extents
BTRFS info (device dm-5): relocating block group 13631488 flags data
BTRFS info (device dm-5): found 1 extents, stage: move data extents
BTRFS info (device dm-5): found 1 extents, stage: update data pointers
BTRFS info (device dm-5): balance: ended with status: 0

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# 6282675e 07-Jan-2020 Qu Wenruo <wqu@suse.com>

btrfs: relocation: fix reloc_root lifespan and access

[BUG]
There are several different KASAN reports for balance + snapshot
workloads. Involved call paths include:

should_ignore_root+0x54/0xb0

btrfs: relocation: fix reloc_root lifespan and access

[BUG]
There are several different KASAN reports for balance + snapshot
workloads. Involved call paths include:

should_ignore_root+0x54/0xb0 [btrfs]
build_backref_tree+0x11af/0x2280 [btrfs]
relocate_tree_blocks+0x391/0xb80 [btrfs]
relocate_block_group+0x3e5/0xa00 [btrfs]
btrfs_relocate_block_group+0x240/0x4d0 [btrfs]
btrfs_relocate_chunk+0x53/0xf0 [btrfs]
btrfs_balance+0xc91/0x1840 [btrfs]
btrfs_ioctl_balance+0x416/0x4e0 [btrfs]
btrfs_ioctl+0x8af/0x3e60 [btrfs]
do_vfs_ioctl+0x831/0xb10

create_reloc_root+0x9f/0x460 [btrfs]
btrfs_reloc_post_snapshot+0xff/0x6c0 [btrfs]
create_pending_snapshot+0xa9b/0x15f0 [btrfs]
create_pending_snapshots+0x111/0x140 [btrfs]
btrfs_commit_transaction+0x7a6/0x1360 [btrfs]
btrfs_mksubvol+0x915/0x960 [btrfs]
btrfs_ioctl_snap_create_transid+0x1d5/0x1e0 [btrfs]
btrfs_ioctl_snap_create_v2+0x1d3/0x270 [btrfs]
btrfs_ioctl+0x241b/0x3e60 [btrfs]
do_vfs_ioctl+0x831/0xb10

btrfs_reloc_pre_snapshot+0x85/0xc0 [btrfs]
create_pending_snapshot+0x209/0x15f0 [btrfs]
create_pending_snapshots+0x111/0x140 [btrfs]
btrfs_commit_transaction+0x7a6/0x1360 [btrfs]
btrfs_mksubvol+0x915/0x960 [btrfs]
btrfs_ioctl_snap_create_transid+0x1d5/0x1e0 [btrfs]
btrfs_ioctl_snap_create_v2+0x1d3/0x270 [btrfs]
btrfs_ioctl+0x241b/0x3e60 [btrfs]
do_vfs_ioctl+0x831/0xb10

[CAUSE]
All these call sites are only relying on root->reloc_root, which can
undergo btrfs_drop_snapshot(), and since we don't have real refcount
based protection to reloc roots, we can reach already dropped reloc
root, triggering KASAN.

[FIX]
To avoid such access to unstable root->reloc_root, we should check
BTRFS_ROOT_DEAD_RELOC_TREE bit first.

This patch introduces wrappers that provide the correct way to check the
bit with memory barriers protection.

Most callers don't distinguish merged reloc tree and no reloc tree. The
only exception is should_ignore_root(), as merged reloc tree can be
ignored, while no reloc tree shouldn't.

[CRITICAL SECTION ANALYSIS]
Although test_bit()/set_bit()/clear_bit() doesn't imply a barrier, the
DEAD_RELOC_TREE bit has extra help from transaction as a higher level
barrier, the lifespan of root::reloc_root and DEAD_RELOC_TREE bit are:

NULL: reloc_root is NULL PTR: reloc_root is not NULL
0: DEAD_RELOC_ROOT bit not set DEAD: DEAD_RELOC_ROOT bit set

(NULL, 0) Initial state __
| /\ Section A
btrfs_init_reloc_root() \/
| __
(PTR, 0) reloc_root initialized /\
| |
btrfs_update_reloc_root() | Section B
| |
(PTR, DEAD) reloc_root has been merged \/
| __
=== btrfs_commit_transaction() ====================
| /\
clean_dirty_subvols() |
| | Section C
(NULL, DEAD) reloc_root cleanup starts \/
| __
btrfs_drop_snapshot() /\
| | Section D
(NULL, 0) Back to initial state \/

Every have_reloc_root() or test_bit(DEAD_RELOC_ROOT) caller holds
transaction handle, so none of such caller can cross transaction boundary.

In Section A, every caller just found no DEAD bit, and grab reloc_root.

In the cross section A-B, caller may get no DEAD bit, but since reloc_root
is still completely valid thus accessing reloc_root is completely safe.

No test_bit() caller can cross the boundary of Section B and Section C.

In Section C, every caller found the DEAD bit, so no one will access
reloc_root.

In the cross section C-D, either caller gets the DEAD bit set, avoiding
access reloc_root no matter if it's safe or not. Or caller get the DEAD
bit cleared, then access reloc_root, which is already NULL, nothing will
be wrong.

The memory write barriers are between the reloc_root updates and bit
set/clear, the pairing read side is before test_bit.

Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Fixes: d2311e698578 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ barriers ]
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# ca1aa281 06-Dec-2019 Josef Bacik <josef@toxicpanda.com>

btrfs: do not leak reloc root if we fail to read the fs root

If we fail to read the fs root corresponding with a reloc root we'll
just break out and free the reloc roots. But we remove our current

btrfs: do not leak reloc root if we fail to read the fs root

If we fail to read the fs root corresponding with a reloc root we'll
just break out and free the reloc roots. But we remove our current
reloc_root from this list higher up, which means we'll leak this
reloc_root. Fix this by adding ourselves back to the reloc_roots list
so we are properly cleaned up.

CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


Revision tags: v5.3.11, v5.3.10, v5.3.9, v5.3.8, v5.3.7, v5.3.6, v5.3.5, v5.3.4, v5.3.3, v5.3.2, v5.3.1, v5.3, v5.2.14, v5.3-rc8, v5.2.13, v5.2.12
# a019e9e1 30-Aug-2019 David Sterba <dsterba@suse.com>

btrfs: remove extent_map::bdev

We can now remove the bdev from extent_map. Previous patches made sure
that bio_set_dev is correctly in all places and that we don't need to
grab it from latest_bdev o

btrfs: remove extent_map::bdev

We can now remove the bdev from extent_map. Previous patches made sure
that bio_set_dev is correctly in all places and that we don't need to
grab it from latest_bdev or pass it around inside the extent map.

Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# b12de528 14-Nov-2019 Qu Wenruo <wqu@suse.com>

btrfs: scrub: Don't check free space before marking a block group RO

[BUG]
When running btrfs/072 with only one online CPU, it has a pretty high
chance to fail:

btrfs/072 12s ... _check_dmesg: so

btrfs: scrub: Don't check free space before marking a block group RO

[BUG]
When running btrfs/072 with only one online CPU, it has a pretty high
chance to fail:

btrfs/072 12s ... _check_dmesg: something found in dmesg (see xfstests-dev/results//btrfs/072.dmesg)
- output mismatch (see xfstests-dev/results//btrfs/072.out.bad)
--- tests/btrfs/072.out 2019-10-22 15:18:14.008965340 +0800
+++ /xfstests-dev/results//btrfs/072.out.bad 2019-11-14 15:56:45.877152240 +0800
@@ -1,2 +1,3 @@
QA output created by 072
Silence is golden
+Scrub find errors in "-m dup -d single" test
...

And with the following call trace:

BTRFS info (device dm-5): scrub: started on devid 1
------------[ cut here ]------------
BTRFS: Transaction aborted (error -27)
WARNING: CPU: 0 PID: 55087 at fs/btrfs/block-group.c:1890 btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs]
CPU: 0 PID: 55087 Comm: btrfs Tainted: G W O 5.4.0-rc1-custom+ #13
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs]
Call Trace:
__btrfs_end_transaction+0xdb/0x310 [btrfs]
btrfs_end_transaction+0x10/0x20 [btrfs]
btrfs_inc_block_group_ro+0x1c9/0x210 [btrfs]
scrub_enumerate_chunks+0x264/0x940 [btrfs]
btrfs_scrub_dev+0x45c/0x8f0 [btrfs]
btrfs_ioctl+0x31a1/0x3fb0 [btrfs]
do_vfs_ioctl+0x636/0xaa0
ksys_ioctl+0x67/0x90
__x64_sys_ioctl+0x43/0x50
do_syscall_64+0x79/0xe0
entry_SYSCALL_64_after_hwframe+0x49/0xbe
---[ end trace 166c865cec7688e7 ]---

[CAUSE]
The error number -27 is -EFBIG, returned from the following call chain:
btrfs_end_transaction()
|- __btrfs_end_transaction()
|- btrfs_create_pending_block_groups()
|- btrfs_finish_chunk_alloc()
|- btrfs_add_system_chunk()

This happens because we have used up all space of
btrfs_super_block::sys_chunk_array.

The root cause is, we have the following bad loop of creating tons of
system chunks:

1. The only SYSTEM chunk is being scrubbed
It's very common to have only one SYSTEM chunk.
2. New SYSTEM bg will be allocated
As btrfs_inc_block_group_ro() will check if we have enough space
after marking current bg RO. If not, then allocate a new chunk.
3. New SYSTEM bg is still empty, will be reclaimed
During the reclaim, we will mark it RO again.
4. That newly allocated empty SYSTEM bg get scrubbed
We go back to step 2, as the bg is already mark RO but still not
cleaned up yet.

If the cleaner kthread doesn't get executed fast enough (e.g. only one
CPU), then we will get more and more empty SYSTEM chunks, using up all
the space of btrfs_super_block::sys_chunk_array.

[FIX]
Since scrub/dev-replace doesn't always need to allocate new extent,
especially chunk tree extent, so we don't really need to do chunk
pre-allocation.

To break above spiral, here we introduce a new parameter to
btrfs_inc_block_group(), @do_chunk_alloc, which indicates whether we
need extra chunk pre-allocation.

For relocation, we pass @do_chunk_alloc=true, while for scrub, we pass
@do_chunk_alloc=false.
This should keep unnecessary empty chunks from popping up for scrub.

Also, since there are two parameters for btrfs_inc_block_group_ro(),
add more comment for it.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# 32da5386 29-Oct-2019 David Sterba <dsterba@suse.com>

btrfs: rename btrfs_block_group_cache

The type name is misleading, a single entry is named 'cache' while this
normally means a collection of objects. Rename that everywhere. Also the
identifier was

btrfs: rename btrfs_block_group_cache

The type name is misleading, a single entry is named 'cache' while this
normally means a collection of objects. Rename that everywhere. Also the
identifier was quite long, making function prototypes harder to format.

Suggested-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# b3470b5d 23-Oct-2019 David Sterba <dsterba@suse.com>

btrfs: add dedicated members for start and length of a block group

The on-disk format of block group item makes use of the key that stores
the offset and length. This is further used in the code, al

btrfs: add dedicated members for start and length of a block group

The on-disk format of block group item makes use of the key that stores
the offset and length. This is further used in the code, although this
makes thing harder to understand. The key is also packed so the
offset/length is not properly aligned as u64.

Add start (key.objectid) and length (key.offset) members to block group
and remove the embedded key. When the item is searched or written, a
local variable for key is used.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# bf38be65 23-Oct-2019 David Sterba <dsterba@suse.com>

btrfs: move block_group_item::used to block group

For unknown reasons, the member 'used' in the block group struct is
stored in the b-tree item and accessed everywhere using the special
accessor hel

btrfs: move block_group_item::used to block group

For unknown reasons, the member 'used' in the block group struct is
stored in the b-tree item and accessed everywhere using the special
accessor helper. Let's unify it and make it a regular member and only
update the item before writing it to the tree.

The item is still being used for flags and chunk_objectid, there's some
duplication until the item is removed in following patches.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# 67439dad 08-Oct-2019 David Sterba <dsterba@suse.com>

btrfs: opencode extent_buffer_get

The helper is trivial and we can understand what the atomic_inc on
something named refs does.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Ana

btrfs: opencode extent_buffer_get

The helper is trivial and we can understand what the atomic_inc on
something named refs does.

Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# 4c66e0d4 03-Oct-2019 David Sterba <dsterba@suse.com>

btrfs: drop unused parameter is_new from btrfs_iget

The parameter is now always set to NULL and could be dropped. The last
user was get_default_root but that got reworked in 05dbe6837b60 ("Btrfs:
un

btrfs: drop unused parameter is_new from btrfs_iget

The parameter is now always set to NULL and could be dropped. The last
user was get_default_root but that got reworked in 05dbe6837b60 ("Btrfs:
unify subvol= and subvolid= mounting") and the parameter became unused.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# 8702ba93 14-Oct-2019 Qu Wenruo <wqu@suse.com>

btrfs: qgroup: Always free PREALLOC META reserve in btrfs_delalloc_release_extents()

[Background]
Btrfs qgroup uses two types of reserved space for METADATA space,
PERTRANS and PREALLOC.

PERTRANS i

btrfs: qgroup: Always free PREALLOC META reserve in btrfs_delalloc_release_extents()

[Background]
Btrfs qgroup uses two types of reserved space for METADATA space,
PERTRANS and PREALLOC.

PERTRANS is metadata space reserved for each transaction started by
btrfs_start_transaction().
While PREALLOC is for delalloc, where we reserve space before joining a
transaction, and finally it will be converted to PERTRANS after the
writeback is done.

[Inconsistency]
However there is inconsistency in how we handle PREALLOC metadata space.

The most obvious one is:
In btrfs_buffered_write():
btrfs_delalloc_release_extents(BTRFS_I(inode), reserve_bytes, true);

We always free qgroup PREALLOC meta space.

While in btrfs_truncate_block():
btrfs_delalloc_release_extents(BTRFS_I(inode), blocksize, (ret != 0));

We only free qgroup PREALLOC meta space when something went wrong.

[The Correct Behavior]
The correct behavior should be the one in btrfs_buffered_write(), we
should always free PREALLOC metadata space.

The reason is, the btrfs_delalloc_* mechanism works by:
- Reserve metadata first, even it's not necessary
In btrfs_delalloc_reserve_metadata()

- Free the unused metadata space
Normally in:
btrfs_delalloc_release_extents()
|- btrfs_inode_rsv_release()
Here we do calculation on whether we should release or not.

E.g. for 64K buffered write, the metadata rsv works like:

/* The first page */
reserve_meta: num_bytes=calc_inode_reservations()
free_meta: num_bytes=0
total: num_bytes=calc_inode_reservations()
/* The first page caused one outstanding extent, thus needs metadata
rsv */

/* The 2nd page */
reserve_meta: num_bytes=calc_inode_reservations()
free_meta: num_bytes=calc_inode_reservations()
total: not changed
/* The 2nd page doesn't cause new outstanding extent, needs no new meta
rsv, so we free what we have reserved */

/* The 3rd~16th pages */
reserve_meta: num_bytes=calc_inode_reservations()
free_meta: num_bytes=calc_inode_reservations()
total: not changed (still space for one outstanding extent)

This means, if btrfs_delalloc_release_extents() determines to free some
space, then those space should be freed NOW.
So for qgroup, we should call btrfs_qgroup_free_meta_prealloc() other
than btrfs_qgroup_convert_reserved_meta().

The good news is:
- The callers are not that hot
The hottest caller is in btrfs_buffered_write(), which is already
fixed by commit 336a8bb8e36a ("btrfs: Fix wrong
btrfs_delalloc_release_extents parameter"). Thus it's not that
easy to cause false EDQUOT.

- The trans commit in advance for qgroup would hide the bug
Since commit f5fef4593653 ("btrfs: qgroup: Make qgroup async transaction
commit more aggressive"), when btrfs qgroup metadata free space is slow,
it will try to commit transaction and free the wrongly converted
PERTRANS space, so it's not that easy to hit such bug.

[FIX]
So to fix the problem, remove the @qgroup_free parameter for
btrfs_delalloc_release_extents(), and always pass true to
btrfs_inode_rsv_release().

Reported-by: Filipe Manana <fdmanana@suse.com>
Fixes: 43b18595d660 ("btrfs: qgroup: Use separate meta reservation type for delalloc")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# 44db1216 09-Oct-2019 Filipe Manana <fdmanana@suse.com>

Btrfs: add missing extents release on file extent cluster relocation error

If we error out when finding a page at relocate_file_extent_cluster(), we
need to release the outstanding extents counter o

Btrfs: add missing extents release on file extent cluster relocation error

If we error out when finding a page at relocate_file_extent_cluster(), we
need to release the outstanding extents counter on the relocation inode,
set by the previous call to btrfs_delalloc_reserve_metadata(), otherwise
the inode's block reserve size can never decrease to zero and metadata
space is leaked. Therefore add a call to btrfs_delalloc_release_extents()
in case we can't find the target page.

Fixes: 8b62f87bad9c ("Btrfs: rework outstanding_extents")
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# 1fac4a54 23-Sep-2019 Qu Wenruo <wqu@suse.com>

btrfs: relocation: fix use-after-free on dead relocation roots

[BUG]
One user reported a reproducible KASAN report about use-after-free:

BTRFS info (device sdi1): balance: start -dvrange=12568116

btrfs: relocation: fix use-after-free on dead relocation roots

[BUG]
One user reported a reproducible KASAN report about use-after-free:

BTRFS info (device sdi1): balance: start -dvrange=1256811659264..1256811659265
BTRFS info (device sdi1): relocating block group 1256811659264 flags data|raid0
==================================================================
BUG: KASAN: use-after-free in btrfs_init_reloc_root+0x2cd/0x340 [btrfs]
Write of size 8 at addr ffff88856f671710 by task kworker/u24:10/261579

CPU: 2 PID: 261579 Comm: kworker/u24:10 Tainted: P OE 5.2.11-arch1-1-kasan #4
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X99 Extreme4, BIOS P3.80 04/06/2018
Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs]
Call Trace:
dump_stack+0x7b/0xba
print_address_description+0x6c/0x22e
? btrfs_init_reloc_root+0x2cd/0x340 [btrfs]
__kasan_report.cold+0x1b/0x3b
? btrfs_init_reloc_root+0x2cd/0x340 [btrfs]
kasan_report+0x12/0x17
__asan_report_store8_noabort+0x17/0x20
btrfs_init_reloc_root+0x2cd/0x340 [btrfs]
record_root_in_trans+0x2a0/0x370 [btrfs]
btrfs_record_root_in_trans+0xf4/0x140 [btrfs]
start_transaction+0x1ab/0xe90 [btrfs]
btrfs_join_transaction+0x1d/0x20 [btrfs]
btrfs_finish_ordered_io+0x7bf/0x18a0 [btrfs]
? lock_repin_lock+0x400/0x400
? __kmem_cache_shutdown.cold+0x140/0x1ad
? btrfs_unlink_subvol+0x9b0/0x9b0 [btrfs]
finish_ordered_fn+0x15/0x20 [btrfs]
normal_work_helper+0x1bd/0xca0 [btrfs]
? process_one_work+0x819/0x1720
? kasan_check_read+0x11/0x20
btrfs_endio_write_helper+0x12/0x20 [btrfs]
process_one_work+0x8c9/0x1720
? pwq_dec_nr_in_flight+0x2f0/0x2f0
? worker_thread+0x1d9/0x1030
worker_thread+0x98/0x1030
kthread+0x2bb/0x3b0
? process_one_work+0x1720/0x1720
? kthread_park+0x120/0x120
ret_from_fork+0x35/0x40

Allocated by task 369692:
__kasan_kmalloc.part.0+0x44/0xc0
__kasan_kmalloc.constprop.0+0xba/0xc0
kasan_kmalloc+0x9/0x10
kmem_cache_alloc_trace+0x138/0x260
btrfs_read_tree_root+0x92/0x360 [btrfs]
btrfs_read_fs_root+0x10/0xb0 [btrfs]
create_reloc_root+0x47d/0xa10 [btrfs]
btrfs_init_reloc_root+0x1e2/0x340 [btrfs]
record_root_in_trans+0x2a0/0x370 [btrfs]
btrfs_record_root_in_trans+0xf4/0x140 [btrfs]
start_transaction+0x1ab/0xe90 [btrfs]
btrfs_start_transaction+0x1e/0x20 [btrfs]
__btrfs_prealloc_file_range+0x1c2/0xa00 [btrfs]
btrfs_prealloc_file_range+0x13/0x20 [btrfs]
prealloc_file_extent_cluster+0x29f/0x570 [btrfs]
relocate_file_extent_cluster+0x193/0xc30 [btrfs]
relocate_data_extent+0x1f8/0x490 [btrfs]
relocate_block_group+0x600/0x1060 [btrfs]
btrfs_relocate_block_group+0x3a0/0xa00 [btrfs]
btrfs_relocate_chunk+0x9e/0x180 [btrfs]
btrfs_balance+0x14e4/0x2fc0 [btrfs]
btrfs_ioctl_balance+0x47f/0x640 [btrfs]
btrfs_ioctl+0x119d/0x8380 [btrfs]
do_vfs_ioctl+0x9f5/0x1060
ksys_ioctl+0x67/0x90
__x64_sys_ioctl+0x73/0xb0
do_syscall_64+0xa5/0x370
entry_SYSCALL_64_after_hwframe+0x44/0xa9

Freed by task 369692:
__kasan_slab_free+0x14f/0x210
kasan_slab_free+0xe/0x10
kfree+0xd8/0x270
btrfs_drop_snapshot+0x154c/0x1eb0 [btrfs]
clean_dirty_subvols+0x227/0x340 [btrfs]
relocate_block_group+0x972/0x1060 [btrfs]
btrfs_relocate_block_group+0x3a0/0xa00 [btrfs]
btrfs_relocate_chunk+0x9e/0x180 [btrfs]
btrfs_balance+0x14e4/0x2fc0 [btrfs]
btrfs_ioctl_balance+0x47f/0x640 [btrfs]
btrfs_ioctl+0x119d/0x8380 [btrfs]
do_vfs_ioctl+0x9f5/0x1060
ksys_ioctl+0x67/0x90
__x64_sys_ioctl+0x73/0xb0
do_syscall_64+0xa5/0x370
entry_SYSCALL_64_after_hwframe+0x44/0xa9

The buggy address belongs to the object at ffff88856f671100
which belongs to the cache kmalloc-4k of size 4096
The buggy address is located 1552 bytes inside of
4096-byte region [ffff88856f671100, ffff88856f672100)
The buggy address belongs to the page:
page:ffffea0015bd9c00 refcount:1 mapcount:0 mapping:ffff88864400e600 index:0x0 compound_mapcount: 0
flags: 0x2ffff0000010200(slab|head)
raw: 02ffff0000010200 dead000000000100 dead000000000200 ffff88864400e600
raw: 0000000000000000 0000000000070007 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
ffff88856f671600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88856f671680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff88856f671700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff88856f671780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff88856f671800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================
BTRFS info (device sdi1): 1 enospc errors during balance
BTRFS info (device sdi1): balance: ended with status: -28

[CAUSE]
The problem happens when finish_ordered_io() get called with balance
still running, while the reloc root of that subvolume is already dead.
(Tree is swap already done, but tree not yet deleted for possible qgroup
usage.)

That means root->reloc_root still exists, but that reloc_root can be
under btrfs_drop_snapshot(), thus we shouldn't access it.

The following race could cause the use-after-free problem:

CPU1 | CPU2
--------------------------------------------------------------------------
| relocate_block_group()
| |- unset_reloc_control(rc)
| |- btrfs_commit_transaction()
btrfs_finish_ordered_io() | |- clean_dirty_subvols()
|- btrfs_join_transaction() | |
|- record_root_in_trans() | |
|- btrfs_init_reloc_root() | |
|- if (root->reloc_root) | |
| | |- root->reloc_root = NULL
| | |- btrfs_drop_snapshot(reloc_root);
|- reloc_root->last_trans|
= trans->transid |
^^^^^^^^^^^^^^^^^^^^^^
Use after free

[FIX]
Fix it by the following modifications:

- Test if the root has dead reloc tree before accessing root->reloc_root
If the root has BTRFS_ROOT_DEAD_RELOC_TREE, then we don't need to
create or update root->reloc_tree

- Clear the BTRFS_ROOT_DEAD_RELOC_TREE flag until we have fully dropped
reloc tree
To co-operate with above modification, so as long as
BTRFS_ROOT_DEAD_RELOC_TREE is still set, we won't try to re-create
reloc tree at record_root_in_trans().

Reported-by: Cebtenzzre <cebtenzzre@gmail.com>
Fixes: d2311e698578 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots")
CC: stable@vger.kernel.org # 5.1+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


Revision tags: v5.2.11, v5.2.10, v5.2.9, v5.2.8, v5.2.7, v5.2.6, v5.2.5, v5.2.4, v5.2.3, v5.2.2, v5.2.1, v5.2, v5.1.16, v5.1.15, v5.1.14, v5.1.13
# aac0023c 20-Jun-2019 Josef Bacik <josef@toxicpanda.com>

btrfs: move basic block_group definitions to their own header

This is prep work for moving all of the block group cache code into its
own file.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Rev

btrfs: move basic block_group definitions to their own header

This is prep work for moving all of the block group cache code into its
own file.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor comment updates ]
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# 330a5827 17-Jul-2019 Nikolay Borisov <nborisov@suse.com>

btrfs: Remove leftover of in-band dedupe

It's unlikely in-band dedupe is going to land so just remove any
leftovers - dedupe.h header as well as the 'dedupe' parameter to
btrfs_set_extent_delalloc.

btrfs: Remove leftover of in-band dedupe

It's unlikely in-band dedupe is going to land so just remove any
leftovers - dedupe.h header as well as the 'dedupe' parameter to
btrfs_set_extent_delalloc.

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# 86736342 19-Jun-2019 Josef Bacik <josef@toxicpanda.com>

btrfs: migrate the delalloc space stuff to it's own home

We have code for data and metadata reservations for delalloc. There's
quite a bit of code here, and it's used in a lot of places so I've
sep

btrfs: migrate the delalloc space stuff to it's own home

We have code for data and metadata reservations for delalloc. There's
quite a bit of code here, and it's used in a lot of places so I've
separated it out to it's own file. inode.c and file.c are already
pretty large, and this code is complicated enough to live in its own
space.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


1...<<11121314151617181920>>...37