Revision tags: v6.6.36, v6.6.35, v6.6.34, v6.6.33 |
|
#
082b3d4e |
| 07-Jun-2024 |
Johannes Thumshirn <johannes.thumshirn@wdc.com> |
btrfs: zoned: allocate dummy checksums for zoned NODATASUM writes
[ Upstream commit cebae292e0c32a228e8f2219c270a7237be24a6a ]
Shin'ichiro reported that when he's running fstests' test-case btrfs/1
btrfs: zoned: allocate dummy checksums for zoned NODATASUM writes
[ Upstream commit cebae292e0c32a228e8f2219c270a7237be24a6a ]
Shin'ichiro reported that when he's running fstests' test-case btrfs/167 on emulated zoned devices, he's seeing the following NULL pointer dereference in 'btrfs_zone_finish_endio()':
Oops: general protection fault, probably for non-canonical address 0xdffffc0000000011: 0000 [#1] PREEMPT SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000088-0x000000000000008f] CPU: 4 PID: 2332440 Comm: kworker/u80:15 Tainted: G W 6.10.0-rc2-kts+ #4 Hardware name: Supermicro Super Server/X11SPi-TF, BIOS 3.3 02/21/2020 Workqueue: btrfs-endio-write btrfs_work_helper [btrfs] RIP: 0010:btrfs_zone_finish_endio.part.0+0x34/0x160 [btrfs]
RSP: 0018:ffff88867f107a90 EFLAGS: 00010206 RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff893e5534 RDX: 0000000000000011 RSI: 0000000000000004 RDI: 0000000000000088 RBP: 0000000000000002 R08: 0000000000000001 R09: ffffed1081696028 R10: ffff88840b4b0143 R11: ffff88834dfff600 R12: ffff88840b4b0000 R13: 0000000000020000 R14: 0000000000000000 R15: ffff888530ad5210 FS: 0000000000000000(0000) GS:ffff888e3f800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f87223fff38 CR3: 00000007a7c6a002 CR4: 00000000007706f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> ? __die_body.cold+0x19/0x27 ? die_addr+0x46/0x70 ? exc_general_protection+0x14f/0x250 ? asm_exc_general_protection+0x26/0x30 ? do_raw_read_unlock+0x44/0x70 ? btrfs_zone_finish_endio.part.0+0x34/0x160 [btrfs] btrfs_finish_one_ordered+0x5d9/0x19a0 [btrfs] ? __pfx_lock_release+0x10/0x10 ? do_raw_write_lock+0x90/0x260 ? __pfx_do_raw_write_lock+0x10/0x10 ? __pfx_btrfs_finish_one_ordered+0x10/0x10 [btrfs] ? _raw_write_unlock+0x23/0x40 ? btrfs_finish_ordered_zoned+0x5a9/0x850 [btrfs] ? lock_acquire+0x435/0x500 btrfs_work_helper+0x1b1/0xa70 [btrfs] ? __schedule+0x10a8/0x60b0 ? __pfx___might_resched+0x10/0x10 process_one_work+0x862/0x1410 ? __pfx_lock_acquire+0x10/0x10 ? __pfx_process_one_work+0x10/0x10 ? assign_work+0x16c/0x240 worker_thread+0x5e6/0x1010 ? __pfx_worker_thread+0x10/0x10 kthread+0x2c3/0x3a0 ? trace_irq_enable.constprop.0+0xce/0x110 ? __pfx_kthread+0x10/0x10 ret_from_fork+0x31/0x70 ? __pfx_kthread+0x10/0x10 ret_from_fork_asm+0x1a/0x30 </TASK>
Enabling CONFIG_BTRFS_ASSERT revealed the following assertion to trigger:
assertion failed: !list_empty(&ordered->list), in fs/btrfs/zoned.c:1815
This indicates, that we're missing the checksums list on the ordered_extent. As btrfs/167 is doing a NOCOW write this is to be expected.
Further analysis with drgn confirmed the assumption:
>>> inode = prog.crashed_thread().stack_trace()[11]['ordered'].inode >>> btrfs_inode = drgn.container_of(inode, "struct btrfs_inode", \ "vfs_inode") >>> print(btrfs_inode.flags) (u32)1
As zoned emulation mode simulates conventional zones on regular devices, we cannot use zone-append for writing. But we're only attaching dummy checksums if we're doing a zone-append write.
So for NOCOW zoned data writes on conventional zones, also attach a dummy checksum.
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Fixes: cbfce4c7fbde ("btrfs: optimize the logical to physical mapping for zoned writes") CC: Naohiro Aota <Naohiro.Aota@wdc.com> # 6.6+ Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> Signed-off-by: Sasha Levin <sashal@kernel.org>
show more ...
|
Revision tags: v6.6.32, v6.6.31, v6.6.30, v6.6.29, v6.6.28, v6.6.27, v6.6.26, v6.6.25, v6.6.24, v6.6.23, v6.6.16, v6.6.15, v6.6.14, v6.6.13, v6.6.12, v6.6.11, v6.6.10, v6.6.9, v6.6.8, v6.6.7, v6.6.6, v6.6.5, v6.6.4, v6.6.3, v6.6.2, v6.5.11, v6.6.1, v6.5.10, v6.6, v6.5.9, v6.5.8, v6.5.7, v6.5.6, v6.5.5, v6.5.4, v6.5.3, v6.5.2, v6.1.51, v6.5.1, v6.1.50, v6.5, v6.1.49, v6.1.48, v6.1.46, v6.1.45, v6.1.44, v6.1.43, v6.1.42, v6.1.41, v6.1.40, v6.1.39, v6.1.38, v6.1.37, v6.1.36, v6.4, v6.1.35, v6.1.34, v6.1.33, v6.1.32 |
|
#
ec63b84d |
| 31-May-2023 |
Christoph Hellwig <hch@lst.de> |
btrfs: add an ordered_extent pointer to struct btrfs_bio
Add a pointer to the ordered_extent to the existing union in struct btrfs_bio, so all code dealing with data write bios can just use a pointe
btrfs: add an ordered_extent pointer to struct btrfs_bio
Add a pointer to the ordered_extent to the existing union in struct btrfs_bio, so all code dealing with data write bios can just use a pointer dereference to retrieve the ordered_extent instead of doing multiple rbtree lookups per I/O.
The reference to this ordered_extent is dropped at end I/O time, which implies that an extra one must be acquired when the bio is split. This also requires moving the btrfs_extract_ordered_extent call into btrfs_split_bio so that the invariant of always having a valid ordered_extent reference for the btrfs_bio is kept.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
fbe96087 |
| 31-May-2023 |
Christoph Hellwig <hch@lst.de> |
btrfs: add a is_data_bbio helper
Add a helper to check for that a btrfs_bio has a valid inode, and that it is a data inode to key off all the special handling for data path checksumming. Note that
btrfs: add a is_data_bbio helper
Add a helper to check for that a btrfs_bio has a valid inode, and that it is a data inode to key off all the special handling for data path checksumming. Note that this uses is_data_inode instead of REQ_META as REQ_META is only set directly before submission in submit_one_bio and we'll also want to use this helper for error handling where REQ_META isn't set yet.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
a39da514 |
| 31-May-2023 |
Christoph Hellwig <hch@lst.de> |
btrfs: limit write bios to a single ordered extent
Currently buffered writeback bios are allowed to span multiple ordered_extents, although that basically never actually happens since commit 4a445b7
btrfs: limit write bios to a single ordered extent
Currently buffered writeback bios are allowed to span multiple ordered_extents, although that basically never actually happens since commit 4a445b7b6178 ("btrfs: don't merge pages into bio if their page offset is not contiguous").
Supporting bios than span ordered_extents complicates the file checksumming code, and prevents us from adding an ordered_extent pointer to the btrfs_bio structure. Use the existing code to limit a bio to single ordered_extent for zoned device writes for all writes.
This allows to remove the REQ_BTRFS_ONE_ORDERED flags, and the handling of multiple ordered_extents in btrfs_csum_one_bio.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
c731cd0b |
| 31-May-2023 |
Christoph Hellwig <hch@lst.de> |
btrfs: fix file_offset for REQ_BTRFS_ONE_ORDERED bios that get split
If a bio gets split, it needs to have a proper file_offset for checksum validation and repair to work properly.
Based on feedbac
btrfs: fix file_offset for REQ_BTRFS_ONE_ORDERED bios that get split
If a bio gets split, it needs to have a proper file_offset for checksum validation and repair to work properly.
Based on feedback from Josef, commit 852eee62d31a ("btrfs: allow btrfs_submit_bio to split bios") skipped this adjustment for ONE_ORDERED bios. But if we actually ever need to split a ONE_ORDERED read bio, this will lead to a wrong file offset in the repair code. Right now the only user of the file_offset is logging of an error message so this is mostly harmless, but the wrong offset might be more problematic for additional users in the future.
Fixes: 852eee62d31a ("btrfs: allow btrfs_submit_bio to split bios") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
cd4efd21 |
| 30-May-2023 |
Christoph Hellwig <hch@lst.de> |
btrfs: rename __btrfs_map_block to btrfs_map_block
Now that the old btrfs_map_block is gone, drop the leading underscores from __btrfs_map_block.
Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by:
btrfs: rename __btrfs_map_block to btrfs_map_block
Now that the old btrfs_map_block is gone, drop the leading underscores from __btrfs_map_block.
Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
Revision tags: v6.1.31, v6.1.30 |
|
#
71df088c |
| 24-May-2023 |
Christoph Hellwig <hch@lst.de> |
btrfs: defer splitting of ordered extents until I/O completion
The btrfs zoned completion code currently needs an ordered_extent and extent_map per bio so that it can account for the non-predictable
btrfs: defer splitting of ordered extents until I/O completion
The btrfs zoned completion code currently needs an ordered_extent and extent_map per bio so that it can account for the non-predictable write location from Zone Append. To archive that it currently splits the ordered_extent and extent_map at I/O submission time, and then records the actual physical address in the ->physical field of the ordered_extent.
This patch instead switches to record the "original" physical address that the btrfs allocator assigned in spare space in the btrfs_bio, and then rewrites the logical address in the btrfs_ordered_sum structure at I/O completion time. This allows the ordered extent completion handler to simply walk the list of ordered csums and split the ordered extent as needed. This removes an extra ordered extent and extent_map lookup and manipulation during the I/O submission path, and instead batches it in the I/O completion path where we need to touch these anyway.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
3887653c |
| 09-Jun-2023 |
Christoph Hellwig <hch@lst.de> |
btrfs: record orig_physical only for the original bio
btrfs_submit_dev_bio is also called for clone bios that aren't embedded into a btrfs_bio structure, but previous commit "btrfs: optimize the log
btrfs: record orig_physical only for the original bio
btrfs_submit_dev_bio is also called for clone bios that aren't embedded into a btrfs_bio structure, but previous commit "btrfs: optimize the logical to physical mapping for zoned writes" added code to assign btrfs_bio.orig_physical in it.
This is harmless right now as only the single data profile can be used on zoned devices, but will blow up when the RAID stripe tree is added. Move it out into the single I/O specific branch in the caller.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
cbfce4c7 |
| 24-May-2023 |
Christoph Hellwig <hch@lst.de> |
btrfs: optimize the logical to physical mapping for zoned writes
The current code to store the final logical to physical mapping for a zone append write in the extent tree is rather inefficient. It
btrfs: optimize the logical to physical mapping for zoned writes
The current code to store the final logical to physical mapping for a zone append write in the extent tree is rather inefficient. It first has to split the ordered extent so that there is one ordered extent per bio, so that it can look up the ordered extent on I/O completion in btrfs_record_physical_zoned and store the physical LBA returned by the block driver in the ordered extent.
btrfs_rewrite_logical_zoned then has to do a lookup in the chunk tree to see what physical address the logical address for this bio / ordered extent is mapped to, and then rewrite it in the extent tree.
To optimize this process, we can store the physical address assigned in the chunk tree to the original logical address and a pointer to btrfs_ordered_sum structure the in the btrfs_bio structure, and then use this information to rewrite the logical address in the btrfs_ordered_sum structure directly at I/O completion time in btrfs_record_physical_zoned. btrfs_rewrite_logical_zoned then simply updates the logical address in the extent tree and the ordered_extent itself.
The code in btrfs_rewrite_logical_zoned now runs for all data I/O completions in zoned file systems, which is fine as there is no remapping to do for non-append writes to conventional zones or for relocation, and the overhead for quickly breaking out of the loop is very low.
Because zoned file systems now need the ordered_sums structure to record the actual write location returned by zone append, allocate dummy structures without the csum array for them when the I/O doesn't use checksums, and free them when completing the ordered_extent.
Note that the btrfs_bio doesn't grow as the new field are places into a union that is so far not used for data writes and has plenty of space left in it.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
e9cb93b9 |
| 24-May-2023 |
Christoph Hellwig <hch@lst.de> |
btrfs: don't call btrfs_record_physical_zoned for failed append
When a zoned append command fails there is no written address reported, so don't try to record it.
Reviewed-by: Johannes Thumshirn <j
btrfs: don't call btrfs_record_physical_zoned for failed append
When a zoned append command fails there is no written address reported, so don't try to record it.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
Revision tags: v6.1.29, v6.1.28 |
|
#
8bfec2e4 |
| 03-May-2023 |
Christoph Hellwig <hch@lst.de> |
btrfs: remove hipri_workers workqueue
Now that btrfs_wq_submit_bio is never called for synchronous I/O, the hipri_workers workqueue is not used anymore and can be removed.
Reviewed-by: Chris Mason
btrfs: remove hipri_workers workqueue
Now that btrfs_wq_submit_bio is never called for synchronous I/O, the hipri_workers workqueue is not used anymore and can be removed.
Reviewed-by: Chris Mason <clm@fb.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
e917ff56 |
| 03-May-2023 |
Christoph Hellwig <hch@lst.de> |
btrfs: determine synchronous writers from bio or writeback control
The writeback_control structure already passes down the information about a writeback being synchronous from the core VM code, and
btrfs: determine synchronous writers from bio or writeback control
The writeback_control structure already passes down the information about a writeback being synchronous from the core VM code, and thus information is propagated into the bio REQ_SYNC flag through the wbc_to_write_flags helper.
Use that information to decide if checksums calculation is offloaded to a workqueue instead of btrfs_inode::sync_writers field that not only bloats the inode but also has too wide scope, being inode wide instead of limited to the actual writeback request.
The sync writes were set in:
- btrfs_do_write_iter - regular IO, sync status is set - start_ordered_ops - ordered write start, writeback with WB_SYNC_ALL mode - btrfs_write_marked_extents - write marked extents, writeback with WB_SYNC_ALL mode
Reviewed-by: Chris Mason <clm@fb.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
da023618 |
| 03-May-2023 |
Christoph Hellwig <hch@lst.de> |
btrfs: submit IO synchronously for fast checksum implementations
Most modern hardware supports very fast accelerated crc32c calculation. If that is supported the CPU overhead of the checksum calcula
btrfs: submit IO synchronously for fast checksum implementations
Most modern hardware supports very fast accelerated crc32c calculation. If that is supported the CPU overhead of the checksum calculation is very limited, and offloading the calculation to special worker threads has a lot of overhead for no gain.
E.g. on an Intel Optane device is actually very much slows down even 1M buffered writes with fio:
Unpatched:
write: IOPS=3316, BW=3316MiB/s (3477MB/s)(200GiB/61757msec); 0 zone resets
With synchronous CRCs:
write: IOPS=4882, BW=4882MiB/s (5119MB/s)(200GiB/41948msec); 0 zone resets
With a lot of variation during the unpatched run going down as low as 1100MB/s, while the synchronous CRC version has about the same peak write speed but much lower dips, and fewer kworkers churning around. Both tests had fio saturated at 100% CPU.
(thanks to Jens Axboe via Chris Mason for the benchmarking)
Reviewed-by: Chris Mason <clm@fb.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
Revision tags: v6.1.27, v6.1.26, v6.3, v6.1.25 |
|
#
adbe7e38 |
| 15-Apr-2023 |
Anand Jain <anand.jain@oracle.com> |
btrfs: use SECTOR_SHIFT to convert LBA to physical offset
Using SECTOR_SHIFT to convert LBA to physical address makes it more readable.
Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by
btrfs: use SECTOR_SHIFT to convert LBA to physical offset
Using SECTOR_SHIFT to convert LBA to physical address makes it more readable.
Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
b675df02 |
| 01-Jun-2023 |
Qu Wenruo <wqu@suse.com> |
btrfs: zoned: fix dev-replace after the scrub rework
[BUG] After commit e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure"), scrub no longer works for zoned de
btrfs: zoned: fix dev-replace after the scrub rework
[BUG] After commit e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure"), scrub no longer works for zoned device at all.
Even an empty zoned btrfs cannot be replaced:
# mkfs.btrfs -f /dev/nvme0n1 # mount /dev/nvme0n1 /mnt/btrfs # btrfs replace start -Bf 1 /dev/nvme0n2 /mnt/btrfs Resetting device zones /dev/nvme1n1 (160 zones) ... ERROR: ioctl(DEV_REPLACE_START) failed on "/mnt/btrfs/": Input/output error
And we can hit kernel crash related to that:
BTRFS info (device nvme1n1): host-managed zoned block device /dev/nvme3n1, 160 zones of 134217728 bytes BTRFS info (device nvme1n1): dev_replace from /dev/nvme2n1 (devid 2) to /dev/nvme3n1 started nvme3n1: Zone Management Append(0x7d) @ LBA 65536, 4 blocks, Zone Is Full (sct 0x1 / sc 0xb9) DNR I/O error, dev nvme3n1, sector 786432 op 0xd:(ZONE_APPEND) flags 0x4000 phys_seg 3 prio class 2 BTRFS error (device nvme1n1): bdev /dev/nvme3n1 errs: wr 1, rd 0, flush 0, corrupt 0, gen 0 BUG: kernel NULL pointer dereference, address: 00000000000000a8 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 RIP: 0010:_raw_spin_lock_irqsave+0x1e/0x40 Call Trace: <IRQ> btrfs_lookup_ordered_extent+0x31/0x190 btrfs_record_physical_zoned+0x18/0x40 btrfs_simple_end_io+0xaf/0xc0 blk_update_request+0x153/0x4c0 blk_mq_end_request+0x15/0xd0 nvme_poll_cq+0x1d3/0x360 nvme_irq+0x39/0x80 __handle_irq_event_percpu+0x3b/0x190 handle_irq_event+0x2f/0x70 handle_edge_irq+0x7c/0x210 __common_interrupt+0x34/0xa0 common_interrupt+0x7d/0xa0 </IRQ> <TASK> asm_common_interrupt+0x22/0x40
[CAUSE] Dev-replace reuses scrub code to iterate all extents and write the existing content back to the new device.
And for zoned devices, we call fill_writer_pointer_gap() to make sure all the writes into the zoned device is sequential, even if there may be some gaps between the writes.
However we have several different bugs all related to zoned dev-replace:
- We are using ZONE_APPEND operation for metadata style write back For zoned devices, btrfs has two ways to write data:
* ZONE_APPEND for data This allows higher queue depth, but will not be able to know where the write would land. Thus needs to grab the real on-disk physical location in it's endio.
* WRITE for metadata This requires single queue depth (new writes can only be submitted after previous one finished), and all writes must be sequential.
For scrub, we go single queue depth, but still goes with ZONE_APPEND, which requires btrfs_bio::inode being populated. This is the cause of that crash.
- No correct tracing of write_pointer After a write finished, we should forward sctx->write_pointer, or fill_writer_pointer_gap() would not work properly and cause more than necessary zero out, and fill the whole zone prematurely.
- Incorrect physical bytenr passed to fill_writer_pointer_gap() In scrub_write_sectors(), one call site passes logical address, which is completely wrong.
The other call site passes physical address of current sector, but we should pass the physical address of the btrfs_bio we're submitting.
This is the cause of the -EIO errors.
[FIX] - Do not use ZONE_APPEND for btrfs_submit_repair_write().
- Manually forward sctx->write_pointer after successful writeback
- Use the physical address of the to-be-submitted btrfs_bio for fill_writer_pointer_gap()
Now zoned device replace would work as expected.
Reported-by: Christoph Hellwig <hch@lst.de> Fixes: e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
45c2f368 |
| 15-May-2023 |
Christoph Hellwig <hch@lst.de> |
btrfs: call btrfs_orig_bbio_end_io in btrfs_end_bio_work
When I implemented the storage layer bio splitting, I was under the assumption that we'll never split metadata bios. But Qu reminded me that
btrfs: call btrfs_orig_bbio_end_io in btrfs_end_bio_work
When I implemented the storage layer bio splitting, I was under the assumption that we'll never split metadata bios. But Qu reminded me that this can actually happen with very old file systems with unaligned metadata chunks and RAID0.
I still haven't seen such a case in practice, but we better handled this case, especially as it is fairly easily to do not calling the ->end_іo method directly in btrfs_end_io_work, and using the proper btrfs_orig_bbio_end_io helper instead.
In addition to the old file system with unaligned metadata chunks case documented in the commit log, the combination of the new scrub code with Johannes pending raid-stripe-tree also triggers this case. We spent some time debugging it and found that this patch solves the problem.
Fixes: 103c19723c80 ("btrfs: split the bio submission path into a separate file") CC: stable@vger.kernel.org # 6.3+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
Revision tags: v6.1.24, v6.1.23, v6.1.22, v6.1.21 |
|
#
4886ff7b |
| 19-Mar-2023 |
Qu Wenruo <wqu@suse.com> |
btrfs: introduce a new helper to submit write bio for repair
Both scrub and read-repair are utilizing a special repair writes that:
- Only writes back to a single device Even for read-repair on R
btrfs: introduce a new helper to submit write bio for repair
Both scrub and read-repair are utilizing a special repair writes that:
- Only writes back to a single device Even for read-repair on RAID56, we only update the corrupted data stripe itself, not triggering the full RMW path.
- Requires a valid @mirror_num For RAID56 case, only @mirror_num == 1 is valid. For non-RAID56 cases, we need @mirror_num to locate our stripe.
- No data csum generation needed
These two call sites still have some differences though:
- Read-repair goes plain bio It doesn't need a full btrfs_bio, and goes submit_bio_wait().
- New scrub repair would go btrfs_bio To simplify both read and write path.
So here this patch would:
- Introduce a common helper, btrfs_map_repair_block() Due to the single device nature, we can use an on-stack btrfs_io_stripe to pass device and its physical bytenr.
- Introduce a new interface, btrfs_submit_repair_bio(), for later scrub code This is for the incoming scrub code.
Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
4317ff00 |
| 23-Mar-2023 |
Qu Wenruo <wqu@suse.com> |
btrfs: introduce btrfs_bio::fs_info member
Currently we're doing a lot of work for btrfs_bio:
- Checksum verification for data read bios - Bio splits if it crosses stripe boundary - Read repair for
btrfs: introduce btrfs_bio::fs_info member
Currently we're doing a lot of work for btrfs_bio:
- Checksum verification for data read bios - Bio splits if it crosses stripe boundary - Read repair for data read bios
However for the incoming scrub patches, we don't want this extra functionality at all, just plain logical + mirror -> physical mapping ability.
Thus here we do the following changes:
- Introduce btrfs_bio::fs_info This is for the new scrub specific btrfs_bio, which would not populate btrfs_bio::inode. Thus we need such new member to grab a fs_info
This new member will always be populated.
- Replace @inode argument with @fs_info for btrfs_bio_init() and its caller Since @inode is no longer a mandatory member, replace it with @fs_info, and let involved users populate @inode.
- Skip checksum verification and generation if @bbio->inode is NULL
- Add extra ASSERT()s To make sure:
* bbio->inode is properly set for involved read repair path * if @file_offset is set, bbio->inode is also populated
- Grab @fs_info from @bbio directly We can no longer go @bbio->inode->root->fs_info, as bbio->inode can be NULL. This involves:
* btrfs_simple_end_io() * should_async_write() * btrfs_wq_submit_bio() * btrfs_use_zone_append()
Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
3480373e |
| 26-Mar-2023 |
Christoph Hellwig <hch@lst.de> |
btrfs, block: move REQ_CGROUP_PUNT to btrfs
REQ_CGROUP_PUNT is a bit annoying as it is hard to follow and adds a branch to the bio submission hot path. To fix this, export blkcg_punt_bio_submit and
btrfs, block: move REQ_CGROUP_PUNT to btrfs
REQ_CGROUP_PUNT is a bit annoying as it is hard to follow and adds a branch to the bio submission hot path. To fix this, export blkcg_punt_bio_submit and let btrfs call it directly. Add a new REQ_FS_PRIVATE flag for btrfs to indicate to it's own low-level bio submission code that a punt to the cgroup submission helper is required.
Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
7edd339c |
| 28-Mar-2023 |
Christoph Hellwig <hch@lst.de> |
btrfs: pass an ordered_extent to btrfs_extract_ordered_extent
To prepare for a new caller that already has the ordered_extent available, change btrfs_extract_ordered_extent to take an argument for i
btrfs: pass an ordered_extent to btrfs_extract_ordered_extent
To prepare for a new caller that already has the ordered_extent available, change btrfs_extract_ordered_extent to take an argument for it. Add a wrapper for the bio case that still has to do the lookup (for now).
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
078e4cf5 |
| 30-Mar-2023 |
Johannes Thumshirn <johannes.thumshirn@wdc.com> |
btrfs: use __bio_add_page for adding a single page in repair_one_sector
The btrfs repair bio submission code uses bio_add_page() to add a page to a newly created bio. bio_add_page() can fail, but th
btrfs: use __bio_add_page for adding a single page in repair_one_sector
The btrfs repair bio submission code uses bio_add_page() to add a page to a newly created bio. bio_add_page() can fail, but the return value is never checked.
Use __bio_add_page() as adding a single page to a newly created bio is guaranteed to succeed.
This brings us a step closer to marking bio_add_page() as __must_check.
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
Revision tags: v6.1.20, v6.1.19, v6.1.18, v6.1.17, v6.1.16 |
|
#
2cef0c79 |
| 07-Mar-2023 |
Christoph Hellwig <hch@lst.de> |
btrfs: make btrfs_split_bio work on struct btrfs_bio
btrfs_split_bio expects a btrfs_bio as argument and always allocates one. Type both the orig_bio argument and the return value as struct btrfs_bi
btrfs: make btrfs_split_bio work on struct btrfs_bio
btrfs_split_bio expects a btrfs_bio as argument and always allocates one. Type both the orig_bio argument and the return value as struct btrfs_bio to improve type safety.
Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
b41bbd29 |
| 07-Mar-2023 |
Christoph Hellwig <hch@lst.de> |
btrfs: return a btrfs_bio from btrfs_bio_alloc
Return the containing struct btrfs_bio instead of the less type safe struct bio from btrfs_bio_alloc.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
btrfs: return a btrfs_bio from btrfs_bio_alloc
Return the containing struct btrfs_bio instead of the less type safe struct bio from btrfs_bio_alloc.
Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
ae42a154 |
| 07-Mar-2023 |
Christoph Hellwig <hch@lst.de> |
btrfs: pass a btrfs_bio to btrfs_submit_bio
btrfs_submit_bio expects the bio passed to it to be embedded into a btrfs_bio structure. Pass the btrfs_bio directly to increase type safety and make the
btrfs: pass a btrfs_bio to btrfs_submit_bio
btrfs_submit_bio expects the bio passed to it to be embedded into a btrfs_bio structure. Pass the btrfs_bio directly to increase type safety and make the code self-documenting.
Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
Revision tags: v6.1.15, v6.1.14, v6.1.13, v6.2, v6.1.12 |
|
#
98e8d36a |
| 12-Feb-2023 |
Naohiro Aota <naohiro.aota@wdc.com> |
btrfs: fix unnecessary increment of read error stat on write error
Current btrfs_log_dev_io_error() increases the read error count even if the erroneous IO is a WRITE request. This is because it for
btrfs: fix unnecessary increment of read error stat on write error
Current btrfs_log_dev_io_error() increases the read error count even if the erroneous IO is a WRITE request. This is because it forget to use "else if", and all the error WRITE requests counts as READ error as there is (of course) no REQ_RAHEAD bit set.
Fixes: c3a62baf21ad ("btrfs: use chained bios when cloning") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|