History log of /openbmc/linux/fs/btrfs/bio.c (Results 1 – 25 of 46)
Revision (<<< Hide revision tags) (Show revision tags >>>) Date Author Comments
Revision tags: v6.6.36, v6.6.35, v6.6.34, v6.6.33
# 082b3d4e 07-Jun-2024 Johannes Thumshirn <johannes.thumshirn@wdc.com>

btrfs: zoned: allocate dummy checksums for zoned NODATASUM writes

[ Upstream commit cebae292e0c32a228e8f2219c270a7237be24a6a ]

Shin'ichiro reported that when he's running fstests' test-case
btrfs/1

btrfs: zoned: allocate dummy checksums for zoned NODATASUM writes

[ Upstream commit cebae292e0c32a228e8f2219c270a7237be24a6a ]

Shin'ichiro reported that when he's running fstests' test-case
btrfs/167 on emulated zoned devices, he's seeing the following NULL
pointer dereference in 'btrfs_zone_finish_endio()':

Oops: general protection fault, probably for non-canonical address 0xdffffc0000000011: 0000 [#1] PREEMPT SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000088-0x000000000000008f]
CPU: 4 PID: 2332440 Comm: kworker/u80:15 Tainted: G W 6.10.0-rc2-kts+ #4
Hardware name: Supermicro Super Server/X11SPi-TF, BIOS 3.3 02/21/2020
Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
RIP: 0010:btrfs_zone_finish_endio.part.0+0x34/0x160 [btrfs]

RSP: 0018:ffff88867f107a90 EFLAGS: 00010206
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff893e5534
RDX: 0000000000000011 RSI: 0000000000000004 RDI: 0000000000000088
RBP: 0000000000000002 R08: 0000000000000001 R09: ffffed1081696028
R10: ffff88840b4b0143 R11: ffff88834dfff600 R12: ffff88840b4b0000
R13: 0000000000020000 R14: 0000000000000000 R15: ffff888530ad5210
FS: 0000000000000000(0000) GS:ffff888e3f800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f87223fff38 CR3: 00000007a7c6a002 CR4: 00000000007706f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
<TASK>
? __die_body.cold+0x19/0x27
? die_addr+0x46/0x70
? exc_general_protection+0x14f/0x250
? asm_exc_general_protection+0x26/0x30
? do_raw_read_unlock+0x44/0x70
? btrfs_zone_finish_endio.part.0+0x34/0x160 [btrfs]
btrfs_finish_one_ordered+0x5d9/0x19a0 [btrfs]
? __pfx_lock_release+0x10/0x10
? do_raw_write_lock+0x90/0x260
? __pfx_do_raw_write_lock+0x10/0x10
? __pfx_btrfs_finish_one_ordered+0x10/0x10 [btrfs]
? _raw_write_unlock+0x23/0x40
? btrfs_finish_ordered_zoned+0x5a9/0x850 [btrfs]
? lock_acquire+0x435/0x500
btrfs_work_helper+0x1b1/0xa70 [btrfs]
? __schedule+0x10a8/0x60b0
? __pfx___might_resched+0x10/0x10
process_one_work+0x862/0x1410
? __pfx_lock_acquire+0x10/0x10
? __pfx_process_one_work+0x10/0x10
? assign_work+0x16c/0x240
worker_thread+0x5e6/0x1010
? __pfx_worker_thread+0x10/0x10
kthread+0x2c3/0x3a0
? trace_irq_enable.constprop.0+0xce/0x110
? __pfx_kthread+0x10/0x10
ret_from_fork+0x31/0x70
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>

Enabling CONFIG_BTRFS_ASSERT revealed the following assertion to
trigger:

assertion failed: !list_empty(&ordered->list), in fs/btrfs/zoned.c:1815

This indicates, that we're missing the checksums list on the
ordered_extent. As btrfs/167 is doing a NOCOW write this is to be
expected.

Further analysis with drgn confirmed the assumption:

>>> inode = prog.crashed_thread().stack_trace()[11]['ordered'].inode
>>> btrfs_inode = drgn.container_of(inode, "struct btrfs_inode", \
"vfs_inode")
>>> print(btrfs_inode.flags)
(u32)1

As zoned emulation mode simulates conventional zones on regular devices,
we cannot use zone-append for writing. But we're only attaching dummy
checksums if we're doing a zone-append write.

So for NOCOW zoned data writes on conventional zones, also attach a
dummy checksum.

Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Fixes: cbfce4c7fbde ("btrfs: optimize the logical to physical mapping for zoned writes")
CC: Naohiro Aota <Naohiro.Aota@wdc.com> # 6.6+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Sasha Levin <sashal@kernel.org>

show more ...


Revision tags: v6.6.32, v6.6.31, v6.6.30, v6.6.29, v6.6.28, v6.6.27, v6.6.26, v6.6.25, v6.6.24, v6.6.23, v6.6.16, v6.6.15, v6.6.14, v6.6.13, v6.6.12, v6.6.11, v6.6.10, v6.6.9, v6.6.8, v6.6.7, v6.6.6, v6.6.5, v6.6.4, v6.6.3, v6.6.2, v6.5.11, v6.6.1, v6.5.10, v6.6, v6.5.9, v6.5.8, v6.5.7, v6.5.6, v6.5.5, v6.5.4, v6.5.3, v6.5.2, v6.1.51, v6.5.1, v6.1.50, v6.5, v6.1.49, v6.1.48, v6.1.46, v6.1.45, v6.1.44, v6.1.43, v6.1.42, v6.1.41, v6.1.40, v6.1.39, v6.1.38, v6.1.37, v6.1.36, v6.4, v6.1.35, v6.1.34, v6.1.33, v6.1.32
# ec63b84d 31-May-2023 Christoph Hellwig <hch@lst.de>

btrfs: add an ordered_extent pointer to struct btrfs_bio

Add a pointer to the ordered_extent to the existing union in struct
btrfs_bio, so all code dealing with data write bios can just use a
pointe

btrfs: add an ordered_extent pointer to struct btrfs_bio

Add a pointer to the ordered_extent to the existing union in struct
btrfs_bio, so all code dealing with data write bios can just use a
pointer dereference to retrieve the ordered_extent instead of doing
multiple rbtree lookups per I/O.

The reference to this ordered_extent is dropped at end I/O time,
which implies that an extra one must be acquired when the bio is split.
This also requires moving the btrfs_extract_ordered_extent call into
btrfs_split_bio so that the invariant of always having a valid
ordered_extent reference for the btrfs_bio is kept.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# fbe96087 31-May-2023 Christoph Hellwig <hch@lst.de>

btrfs: add a is_data_bbio helper

Add a helper to check for that a btrfs_bio has a valid inode, and that
it is a data inode to key off all the special handling for data path
checksumming. Note that

btrfs: add a is_data_bbio helper

Add a helper to check for that a btrfs_bio has a valid inode, and that
it is a data inode to key off all the special handling for data path
checksumming. Note that this uses is_data_inode instead of REQ_META as
REQ_META is only set directly before submission in submit_one_bio and
we'll also want to use this helper for error handling where REQ_META
isn't set yet.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# a39da514 31-May-2023 Christoph Hellwig <hch@lst.de>

btrfs: limit write bios to a single ordered extent

Currently buffered writeback bios are allowed to span multiple
ordered_extents, although that basically never actually happens since
commit 4a445b7

btrfs: limit write bios to a single ordered extent

Currently buffered writeback bios are allowed to span multiple
ordered_extents, although that basically never actually happens since
commit 4a445b7b6178 ("btrfs: don't merge pages into bio if their page
offset is not contiguous").

Supporting bios than span ordered_extents complicates the file
checksumming code, and prevents us from adding an ordered_extent pointer
to the btrfs_bio structure. Use the existing code to limit a bio to
single ordered_extent for zoned device writes for all writes.

This allows to remove the REQ_BTRFS_ONE_ORDERED flags, and the
handling of multiple ordered_extents in btrfs_csum_one_bio.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# c731cd0b 31-May-2023 Christoph Hellwig <hch@lst.de>

btrfs: fix file_offset for REQ_BTRFS_ONE_ORDERED bios that get split

If a bio gets split, it needs to have a proper file_offset for checksum
validation and repair to work properly.

Based on feedbac

btrfs: fix file_offset for REQ_BTRFS_ONE_ORDERED bios that get split

If a bio gets split, it needs to have a proper file_offset for checksum
validation and repair to work properly.

Based on feedback from Josef, commit 852eee62d31a ("btrfs: allow
btrfs_submit_bio to split bios") skipped this adjustment for ONE_ORDERED
bios. But if we actually ever need to split a ONE_ORDERED read bio, this
will lead to a wrong file offset in the repair code. Right now the only
user of the file_offset is logging of an error message so this is mostly
harmless, but the wrong offset might be more problematic for additional
users in the future.

Fixes: 852eee62d31a ("btrfs: allow btrfs_submit_bio to split bios")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# cd4efd21 30-May-2023 Christoph Hellwig <hch@lst.de>

btrfs: rename __btrfs_map_block to btrfs_map_block

Now that the old btrfs_map_block is gone, drop the leading underscores
from __btrfs_map_block.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by:

btrfs: rename __btrfs_map_block to btrfs_map_block

Now that the old btrfs_map_block is gone, drop the leading underscores
from __btrfs_map_block.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


Revision tags: v6.1.31, v6.1.30
# 71df088c 24-May-2023 Christoph Hellwig <hch@lst.de>

btrfs: defer splitting of ordered extents until I/O completion

The btrfs zoned completion code currently needs an ordered_extent and
extent_map per bio so that it can account for the non-predictable

btrfs: defer splitting of ordered extents until I/O completion

The btrfs zoned completion code currently needs an ordered_extent and
extent_map per bio so that it can account for the non-predictable
write location from Zone Append. To archive that it currently splits
the ordered_extent and extent_map at I/O submission time, and then
records the actual physical address in the ->physical field of the
ordered_extent.

This patch instead switches to record the "original" physical address
that the btrfs allocator assigned in spare space in the btrfs_bio,
and then rewrites the logical address in the btrfs_ordered_sum
structure at I/O completion time. This allows the ordered extent
completion handler to simply walk the list of ordered csums and
split the ordered extent as needed. This removes an extra ordered
extent and extent_map lookup and manipulation during the I/O
submission path, and instead batches it in the I/O completion path
where we need to touch these anyway.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# 3887653c 09-Jun-2023 Christoph Hellwig <hch@lst.de>

btrfs: record orig_physical only for the original bio

btrfs_submit_dev_bio is also called for clone bios that aren't embedded
into a btrfs_bio structure, but previous commit "btrfs: optimize the
log

btrfs: record orig_physical only for the original bio

btrfs_submit_dev_bio is also called for clone bios that aren't embedded
into a btrfs_bio structure, but previous commit "btrfs: optimize the
logical to physical mapping for zoned writes" added code to assign
btrfs_bio.orig_physical in it.

This is harmless right now as only the single data profile can be used
on zoned devices, but will blow up when the RAID stripe tree is added.
Move it out into the single I/O specific branch in the caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# cbfce4c7 24-May-2023 Christoph Hellwig <hch@lst.de>

btrfs: optimize the logical to physical mapping for zoned writes

The current code to store the final logical to physical mapping for a
zone append write in the extent tree is rather inefficient. It

btrfs: optimize the logical to physical mapping for zoned writes

The current code to store the final logical to physical mapping for a
zone append write in the extent tree is rather inefficient. It first has
to split the ordered extent so that there is one ordered extent per bio,
so that it can look up the ordered extent on I/O completion in
btrfs_record_physical_zoned and store the physical LBA returned by the
block driver in the ordered extent.

btrfs_rewrite_logical_zoned then has to do a lookup in the chunk tree to
see what physical address the logical address for this bio / ordered
extent is mapped to, and then rewrite it in the extent tree.

To optimize this process, we can store the physical address assigned in
the chunk tree to the original logical address and a pointer to
btrfs_ordered_sum structure the in the btrfs_bio structure, and then use
this information to rewrite the logical address in the btrfs_ordered_sum
structure directly at I/O completion time in btrfs_record_physical_zoned.
btrfs_rewrite_logical_zoned then simply updates the logical address in
the extent tree and the ordered_extent itself.

The code in btrfs_rewrite_logical_zoned now runs for all data I/O
completions in zoned file systems, which is fine as there is no remapping
to do for non-append writes to conventional zones or for relocation, and
the overhead for quickly breaking out of the loop is very low.

Because zoned file systems now need the ordered_sums structure to
record the actual write location returned by zone append, allocate dummy
structures without the csum array for them when the I/O doesn't use
checksums, and free them when completing the ordered_extent.

Note that the btrfs_bio doesn't grow as the new field are places into
a union that is so far not used for data writes and has plenty of space
left in it.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# e9cb93b9 24-May-2023 Christoph Hellwig <hch@lst.de>

btrfs: don't call btrfs_record_physical_zoned for failed append

When a zoned append command fails there is no written address reported,
so don't try to record it.

Reviewed-by: Johannes Thumshirn <j

btrfs: don't call btrfs_record_physical_zoned for failed append

When a zoned append command fails there is no written address reported,
so don't try to record it.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


Revision tags: v6.1.29, v6.1.28
# 8bfec2e4 03-May-2023 Christoph Hellwig <hch@lst.de>

btrfs: remove hipri_workers workqueue

Now that btrfs_wq_submit_bio is never called for synchronous I/O,
the hipri_workers workqueue is not used anymore and can be removed.

Reviewed-by: Chris Mason

btrfs: remove hipri_workers workqueue

Now that btrfs_wq_submit_bio is never called for synchronous I/O,
the hipri_workers workqueue is not used anymore and can be removed.

Reviewed-by: Chris Mason <clm@fb.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# e917ff56 03-May-2023 Christoph Hellwig <hch@lst.de>

btrfs: determine synchronous writers from bio or writeback control

The writeback_control structure already passes down the information about
a writeback being synchronous from the core VM code, and

btrfs: determine synchronous writers from bio or writeback control

The writeback_control structure already passes down the information about
a writeback being synchronous from the core VM code, and thus information
is propagated into the bio REQ_SYNC flag through the wbc_to_write_flags
helper.

Use that information to decide if checksums calculation is offloaded to
a workqueue instead of btrfs_inode::sync_writers field that not only
bloats the inode but also has too wide scope, being inode wide instead
of limited to the actual writeback request.

The sync writes were set in:

- btrfs_do_write_iter - regular IO, sync status is set
- start_ordered_ops - ordered write start, writeback with WB_SYNC_ALL
mode
- btrfs_write_marked_extents - write marked extents, writeback with
WB_SYNC_ALL mode

Reviewed-by: Chris Mason <clm@fb.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# da023618 03-May-2023 Christoph Hellwig <hch@lst.de>

btrfs: submit IO synchronously for fast checksum implementations

Most modern hardware supports very fast accelerated crc32c calculation.
If that is supported the CPU overhead of the checksum calcula

btrfs: submit IO synchronously for fast checksum implementations

Most modern hardware supports very fast accelerated crc32c calculation.
If that is supported the CPU overhead of the checksum calculation is
very limited, and offloading the calculation to special worker threads
has a lot of overhead for no gain.

E.g. on an Intel Optane device is actually very much slows down even
1M buffered writes with fio:

Unpatched:

write: IOPS=3316, BW=3316MiB/s (3477MB/s)(200GiB/61757msec); 0 zone resets

With synchronous CRCs:

write: IOPS=4882, BW=4882MiB/s (5119MB/s)(200GiB/41948msec); 0 zone resets

With a lot of variation during the unpatched run going down as low as
1100MB/s, while the synchronous CRC version has about the same peak write
speed but much lower dips, and fewer kworkers churning around.
Both tests had fio saturated at 100% CPU.

(thanks to Jens Axboe via Chris Mason for the benchmarking)

Reviewed-by: Chris Mason <clm@fb.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


Revision tags: v6.1.27, v6.1.26, v6.3, v6.1.25
# adbe7e38 15-Apr-2023 Anand Jain <anand.jain@oracle.com>

btrfs: use SECTOR_SHIFT to convert LBA to physical offset

Using SECTOR_SHIFT to convert LBA to physical address makes it more
readable.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by

btrfs: use SECTOR_SHIFT to convert LBA to physical offset

Using SECTOR_SHIFT to convert LBA to physical address makes it more
readable.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# b675df02 01-Jun-2023 Qu Wenruo <wqu@suse.com>

btrfs: zoned: fix dev-replace after the scrub rework

[BUG]
After commit e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror()
to scrub_stripe infrastructure"), scrub no longer works for zoned de

btrfs: zoned: fix dev-replace after the scrub rework

[BUG]
After commit e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror()
to scrub_stripe infrastructure"), scrub no longer works for zoned device
at all.

Even an empty zoned btrfs cannot be replaced:

# mkfs.btrfs -f /dev/nvme0n1
# mount /dev/nvme0n1 /mnt/btrfs
# btrfs replace start -Bf 1 /dev/nvme0n2 /mnt/btrfs
Resetting device zones /dev/nvme1n1 (160 zones) ...
ERROR: ioctl(DEV_REPLACE_START) failed on "/mnt/btrfs/": Input/output error

And we can hit kernel crash related to that:

BTRFS info (device nvme1n1): host-managed zoned block device /dev/nvme3n1, 160 zones of 134217728 bytes
BTRFS info (device nvme1n1): dev_replace from /dev/nvme2n1 (devid 2) to /dev/nvme3n1 started
nvme3n1: Zone Management Append(0x7d) @ LBA 65536, 4 blocks, Zone Is Full (sct 0x1 / sc 0xb9) DNR
I/O error, dev nvme3n1, sector 786432 op 0xd:(ZONE_APPEND) flags 0x4000 phys_seg 3 prio class 2
BTRFS error (device nvme1n1): bdev /dev/nvme3n1 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
BUG: kernel NULL pointer dereference, address: 00000000000000a8
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
RIP: 0010:_raw_spin_lock_irqsave+0x1e/0x40
Call Trace:
<IRQ>
btrfs_lookup_ordered_extent+0x31/0x190
btrfs_record_physical_zoned+0x18/0x40
btrfs_simple_end_io+0xaf/0xc0
blk_update_request+0x153/0x4c0
blk_mq_end_request+0x15/0xd0
nvme_poll_cq+0x1d3/0x360
nvme_irq+0x39/0x80
__handle_irq_event_percpu+0x3b/0x190
handle_irq_event+0x2f/0x70
handle_edge_irq+0x7c/0x210
__common_interrupt+0x34/0xa0
common_interrupt+0x7d/0xa0
</IRQ>
<TASK>
asm_common_interrupt+0x22/0x40

[CAUSE]
Dev-replace reuses scrub code to iterate all extents and write the
existing content back to the new device.

And for zoned devices, we call fill_writer_pointer_gap() to make sure
all the writes into the zoned device is sequential, even if there may be
some gaps between the writes.

However we have several different bugs all related to zoned dev-replace:

- We are using ZONE_APPEND operation for metadata style write back
For zoned devices, btrfs has two ways to write data:

* ZONE_APPEND for data
This allows higher queue depth, but will not be able to know where
the write would land.
Thus needs to grab the real on-disk physical location in it's endio.

* WRITE for metadata
This requires single queue depth (new writes can only be submitted
after previous one finished), and all writes must be sequential.

For scrub, we go single queue depth, but still goes with ZONE_APPEND,
which requires btrfs_bio::inode being populated.
This is the cause of that crash.

- No correct tracing of write_pointer
After a write finished, we should forward sctx->write_pointer, or
fill_writer_pointer_gap() would not work properly and cause more
than necessary zero out, and fill the whole zone prematurely.

- Incorrect physical bytenr passed to fill_writer_pointer_gap()
In scrub_write_sectors(), one call site passes logical address, which
is completely wrong.

The other call site passes physical address of current sector, but
we should pass the physical address of the btrfs_bio we're submitting.

This is the cause of the -EIO errors.

[FIX]
- Do not use ZONE_APPEND for btrfs_submit_repair_write().

- Manually forward sctx->write_pointer after successful writeback

- Use the physical address of the to-be-submitted btrfs_bio for
fill_writer_pointer_gap()

Now zoned device replace would work as expected.

Reported-by: Christoph Hellwig <hch@lst.de>
Fixes: e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# 45c2f368 15-May-2023 Christoph Hellwig <hch@lst.de>

btrfs: call btrfs_orig_bbio_end_io in btrfs_end_bio_work

When I implemented the storage layer bio splitting, I was under the
assumption that we'll never split metadata bios. But Qu reminded me that

btrfs: call btrfs_orig_bbio_end_io in btrfs_end_bio_work

When I implemented the storage layer bio splitting, I was under the
assumption that we'll never split metadata bios. But Qu reminded me that
this can actually happen with very old file systems with unaligned
metadata chunks and RAID0.

I still haven't seen such a case in practice, but we better handled this
case, especially as it is fairly easily to do not calling the ->end_іo
method directly in btrfs_end_io_work, and using the proper
btrfs_orig_bbio_end_io helper instead.

In addition to the old file system with unaligned metadata chunks case
documented in the commit log, the combination of the new scrub code
with Johannes pending raid-stripe-tree also triggers this case. We
spent some time debugging it and found that this patch solves
the problem.

Fixes: 103c19723c80 ("btrfs: split the bio submission path into a separate file")
CC: stable@vger.kernel.org # 6.3+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


Revision tags: v6.1.24, v6.1.23, v6.1.22, v6.1.21
# 4886ff7b 19-Mar-2023 Qu Wenruo <wqu@suse.com>

btrfs: introduce a new helper to submit write bio for repair

Both scrub and read-repair are utilizing a special repair writes that:

- Only writes back to a single device
Even for read-repair on R

btrfs: introduce a new helper to submit write bio for repair

Both scrub and read-repair are utilizing a special repair writes that:

- Only writes back to a single device
Even for read-repair on RAID56, we only update the corrupted data
stripe itself, not triggering the full RMW path.

- Requires a valid @mirror_num
For RAID56 case, only @mirror_num == 1 is valid.
For non-RAID56 cases, we need @mirror_num to locate our stripe.

- No data csum generation needed

These two call sites still have some differences though:

- Read-repair goes plain bio
It doesn't need a full btrfs_bio, and goes submit_bio_wait().

- New scrub repair would go btrfs_bio
To simplify both read and write path.

So here this patch would:

- Introduce a common helper, btrfs_map_repair_block()
Due to the single device nature, we can use an on-stack
btrfs_io_stripe to pass device and its physical bytenr.

- Introduce a new interface, btrfs_submit_repair_bio(), for later scrub
code
This is for the incoming scrub code.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# 4317ff00 23-Mar-2023 Qu Wenruo <wqu@suse.com>

btrfs: introduce btrfs_bio::fs_info member

Currently we're doing a lot of work for btrfs_bio:

- Checksum verification for data read bios
- Bio splits if it crosses stripe boundary
- Read repair for

btrfs: introduce btrfs_bio::fs_info member

Currently we're doing a lot of work for btrfs_bio:

- Checksum verification for data read bios
- Bio splits if it crosses stripe boundary
- Read repair for data read bios

However for the incoming scrub patches, we don't want this extra
functionality at all, just plain logical + mirror -> physical mapping
ability.

Thus here we do the following changes:

- Introduce btrfs_bio::fs_info
This is for the new scrub specific btrfs_bio, which would not populate
btrfs_bio::inode.
Thus we need such new member to grab a fs_info

This new member will always be populated.

- Replace @inode argument with @fs_info for btrfs_bio_init() and its
caller
Since @inode is no longer a mandatory member, replace it with
@fs_info, and let involved users populate @inode.

- Skip checksum verification and generation if @bbio->inode is NULL

- Add extra ASSERT()s
To make sure:

* bbio->inode is properly set for involved read repair path
* if @file_offset is set, bbio->inode is also populated

- Grab @fs_info from @bbio directly
We can no longer go @bbio->inode->root->fs_info, as bbio->inode can be
NULL. This involves:

* btrfs_simple_end_io()
* should_async_write()
* btrfs_wq_submit_bio()
* btrfs_use_zone_append()

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# 3480373e 26-Mar-2023 Christoph Hellwig <hch@lst.de>

btrfs, block: move REQ_CGROUP_PUNT to btrfs

REQ_CGROUP_PUNT is a bit annoying as it is hard to follow and adds
a branch to the bio submission hot path. To fix this, export
blkcg_punt_bio_submit and

btrfs, block: move REQ_CGROUP_PUNT to btrfs

REQ_CGROUP_PUNT is a bit annoying as it is hard to follow and adds
a branch to the bio submission hot path. To fix this, export
blkcg_punt_bio_submit and let btrfs call it directly. Add a new
REQ_FS_PRIVATE flag for btrfs to indicate to it's own low-level
bio submission code that a punt to the cgroup submission helper
is required.

Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# 7edd339c 28-Mar-2023 Christoph Hellwig <hch@lst.de>

btrfs: pass an ordered_extent to btrfs_extract_ordered_extent

To prepare for a new caller that already has the ordered_extent
available, change btrfs_extract_ordered_extent to take an argument
for i

btrfs: pass an ordered_extent to btrfs_extract_ordered_extent

To prepare for a new caller that already has the ordered_extent
available, change btrfs_extract_ordered_extent to take an argument
for it. Add a wrapper for the bio case that still has to do the
lookup (for now).

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# 078e4cf5 30-Mar-2023 Johannes Thumshirn <johannes.thumshirn@wdc.com>

btrfs: use __bio_add_page for adding a single page in repair_one_sector

The btrfs repair bio submission code uses bio_add_page() to add a page
to a newly created bio. bio_add_page() can fail, but th

btrfs: use __bio_add_page for adding a single page in repair_one_sector

The btrfs repair bio submission code uses bio_add_page() to add a page
to a newly created bio. bio_add_page() can fail, but the return value is
never checked.

Use __bio_add_page() as adding a single page to a newly created bio is
guaranteed to succeed.

This brings us a step closer to marking bio_add_page() as __must_check.

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


Revision tags: v6.1.20, v6.1.19, v6.1.18, v6.1.17, v6.1.16
# 2cef0c79 07-Mar-2023 Christoph Hellwig <hch@lst.de>

btrfs: make btrfs_split_bio work on struct btrfs_bio

btrfs_split_bio expects a btrfs_bio as argument and always allocates one.
Type both the orig_bio argument and the return value as struct btrfs_bi

btrfs: make btrfs_split_bio work on struct btrfs_bio

btrfs_split_bio expects a btrfs_bio as argument and always allocates one.
Type both the orig_bio argument and the return value as struct btrfs_bio
to improve type safety.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# b41bbd29 07-Mar-2023 Christoph Hellwig <hch@lst.de>

btrfs: return a btrfs_bio from btrfs_bio_alloc

Return the containing struct btrfs_bio instead of the less type safe
struct bio from btrfs_bio_alloc.

Reviewed-by: Anand Jain <anand.jain@oracle.com>

btrfs: return a btrfs_bio from btrfs_bio_alloc

Return the containing struct btrfs_bio instead of the less type safe
struct bio from btrfs_bio_alloc.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


# ae42a154 07-Mar-2023 Christoph Hellwig <hch@lst.de>

btrfs: pass a btrfs_bio to btrfs_submit_bio

btrfs_submit_bio expects the bio passed to it to be embedded into a
btrfs_bio structure. Pass the btrfs_bio directly to increase type
safety and make the

btrfs: pass a btrfs_bio to btrfs_submit_bio

btrfs_submit_bio expects the bio passed to it to be embedded into a
btrfs_bio structure. Pass the btrfs_bio directly to increase type
safety and make the code self-documenting.

Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


Revision tags: v6.1.15, v6.1.14, v6.1.13, v6.2, v6.1.12
# 98e8d36a 12-Feb-2023 Naohiro Aota <naohiro.aota@wdc.com>

btrfs: fix unnecessary increment of read error stat on write error

Current btrfs_log_dev_io_error() increases the read error count even if the
erroneous IO is a WRITE request. This is because it for

btrfs: fix unnecessary increment of read error stat on write error

Current btrfs_log_dev_io_error() increases the read error count even if the
erroneous IO is a WRITE request. This is because it forget to use "else
if", and all the error WRITE requests counts as READ error as there is (of
course) no REQ_RAHEAD bit set.

Fixes: c3a62baf21ad ("btrfs: use chained bios when cloning")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

show more ...


12