#
88a777a6 |
| 09-Oct-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: implement space clamping for preemptive flushing
Starting preemptive flushing at 50% of available free space is a good start, but some workloads are particularly abusive and can quickly overw
btrfs: implement space clamping for preemptive flushing
Starting preemptive flushing at 50% of available free space is a good start, but some workloads are particularly abusive and can quickly overwhelm the preemptive flushing code and drive us into using tickets.
Handle this by clamping down on our threshold for starting and continuing to run preemptive flushing. This is particularly important for our overcommit case, as we can really drive the file system into overages and then it's more difficult to pull it back as we start to actually fill up the file system.
The clamping is essentially 2^CLAMP, but we start at 1 so whatever we calculate for overcommit is the baseline.
Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
2e294c60 |
| 09-Oct-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: simplify the logic in need_preemptive_flushing
A lot of this was added all in one go with no explanation, and is a bit unwieldy and confusing. Simplify the logic to start preemptive flushing
btrfs: simplify the logic in need_preemptive_flushing
A lot of this was added all in one go with no explanation, and is a bit unwieldy and confusing. Simplify the logic to start preemptive flushing if we've reserved more than half of our available free space.
Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
9f42d377 |
| 09-Oct-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: rework btrfs_calc_reclaim_metadata_size
Currently btrfs_calc_reclaim_metadata_size does two things, it returns the space currently required for flushing by the tickets, and if there are no ti
btrfs: rework btrfs_calc_reclaim_metadata_size
Currently btrfs_calc_reclaim_metadata_size does two things, it returns the space currently required for flushing by the tickets, and if there are no tickets it calculates a value for the preemptive flushing.
However for the normal ticketed flushing we really only care about the space required for tickets. We will accidentally come in and flush one time, but as soon as we see there are no tickets we bail out of our flushing.
Fix this by making btrfs_calc_reclaim_metadata_size really only tell us what is required for flushing if we have people waiting on space. Then move the preemptive flushing logic into need_preemptive_reclaim(). We ignore btrfs_calc_reclaim_metadata_size() in need_preemptive_reclaim() because if we are in this path then we made our reservation and there are not pending tickets currently, so we do not need to check it, simply do the fuzzy logic to check if we're getting low on space.
Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
f205edf7 |
| 09-Oct-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: check reclaim_size in need_preemptive_reclaim
If we're flushing space for tickets then we have space_info->reclaim_size set and we do not need to do background reclaim.
Reviewed-by: Nikolay
btrfs: check reclaim_size in need_preemptive_reclaim
If we're flushing space for tickets then we have space_info->reclaim_size set and we do not need to do background reclaim.
Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
ae7913ba |
| 09-Oct-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: rename need_do_async_reclaim
All of our normal flushing is asynchronous reclaim, so this helper is poorly named. This is more checking if we need to preemptively flush space, so rename it to
btrfs: rename need_do_async_reclaim
All of our normal flushing is asynchronous reclaim, so this helper is poorly named. This is more checking if we need to preemptively flush space, so rename it to need_preemptive_reclaim.
Also switch it to bool and make it plain static as followup patches will move more code here.
Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
576fa348 |
| 09-Oct-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: improve preemptive background space flushing
Currently if we ever have to flush space because we do not have enough we allocate a ticket and attach it to the space_info, and then systematical
btrfs: improve preemptive background space flushing
Currently if we ever have to flush space because we do not have enough we allocate a ticket and attach it to the space_info, and then systematically flush things in the filesystem that hold space reservations until our space is reclaimed.
However this has a latency cost, we must go to sleep and wait for the flushing to make progress before we are woken up and allowed to continue doing our work.
In order to address that we used to kick off the async worker to flush space preemptively, so that we could be reclaiming space hopefully before any tasks needed to stop and wait for space to reclaim.
When I introduced the ticketed ENOSPC stuff this broke slightly in the fact that we were using tickets to indicate if we were done flushing. No tickets, no more flushing. However this meant that we essentially never preemptively flushed. This caused a write performance regression that Nikolay noticed in an unrelated patch that removed the committing of the transaction during btrfs_end_transaction.
The behavior that happened pre that patch was btrfs_end_transaction() would see that we were low on space, and it would commit the transaction. This was bad because in this particular case you could end up with thousands and thousands of transactions being committed during the 5 minute reproducer. With the patch to remove this behavior we got much more sane transaction commits, but we ended up slower because we would write for a while, flush, write for a while, flush again.
To address this we need to reinstate a preemptive flushing mechanism. However it is distinctly different from our ticketing flushing in that it doesn't have tickets to base it's decisions on. Instead of bolting this logic into our existing flushing work, add another worker to handle this preemptive flushing. Here we will attempt to be slightly intelligent about the things that we flushing, attempting to balance between whichever pool is taking up the most space.
Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
f00c42dd |
| 09-Oct-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: introduce a FORCE_COMMIT_TRANS flush operation
Solely for preemptive flushing, we want to be able to force the transaction commit without any of the ambiguity of may_commit_transaction(). Th
btrfs: introduce a FORCE_COMMIT_TRANS flush operation
Solely for preemptive flushing, we want to be able to force the transaction commit without any of the ambiguity of may_commit_transaction(). This is because may_commit_transaction() checks tickets and such, and in preemptive flushing we already know it'll be helpful, so use this to keep the code nice and clean and straightforward.
Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> [ add comment ] Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
5deb17e1 |
| 09-Oct-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: track ordered bytes instead of just dio ordered bytes
We track dio_bytes because the shrink delalloc code needs to know if we have more DIO in flight than we have normal buffered IO. The rea
btrfs: track ordered bytes instead of just dio ordered bytes
We track dio_bytes because the shrink delalloc code needs to know if we have more DIO in flight than we have normal buffered IO. The reason for this is because we can't "flush" DIO, we have to just wait on the ordered extents to finish.
However this is true of all ordered extents. If we have more ordered space outstanding than dirty pages we should be waiting on ordered extents. We already are ok on this front technically, because we always do a FLUSH_DELALLOC_WAIT loop, but I want to use the ordered counter in the preemptive flushing code as well, so change this to count all ordered bytes instead of just DIO ordered bytes.
Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
ac1ea10e |
| 09-Oct-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: add a trace point for reserve tickets
While debugging a ENOSPC related performance problem I needed to see the time difference between start and end of a reserve ticket, so add a trace point
btrfs: add a trace point for reserve tickets
While debugging a ENOSPC related performance problem I needed to see the time difference between start and end of a reserve ticket, so add a trace point to report when we handle a reserve ticket.
I opted to spit out start_ns itself without calculating the difference because there could be a gap between enabling the tracepoint and setting start_ns. Doing it this way allows us to filter on 0 start_ns so we don't get bogus entries, and we can easily calculate the time difference with bpftrace or something else.
Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
91e79a83 |
| 09-Oct-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: make flush_space take a enum btrfs_flush_state instead of int
I got a automated message from somebody who runs clang against our kernels and it's because I used the wrong enum type for what I
btrfs: make flush_space take a enum btrfs_flush_state instead of int
I got a automated message from somebody who runs clang against our kernels and it's because I used the wrong enum type for what I passed into flush_space, caught by -Wenum-conversion. Change the argument to be explicitly the enum we're expecting to make everything consistent. Maybe eventually gcc will catch errors like this.
Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
d98b188e |
| 22-Jan-2021 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: fix parameter description in space-info.c
With these fixes space-info.c is clear for W=1 warnings, namely the following ones are fixed:
fs/btrfs/space-info.c:575: warning: Function parameter
btrfs: fix parameter description in space-info.c
With these fixes space-info.c is clear for W=1 warnings, namely the following ones are fixed:
fs/btrfs/space-info.c:575: warning: Function parameter or member 'fs_info' not described in 'may_commit_transaction' fs/btrfs/space-info.c:575: warning: Function parameter or member 'space_info' not described in 'may_commit_transaction' fs/btrfs/space-info.c:1231: warning: Function parameter or member 'fs_info' not described in 'handle_reserve_ticket' fs/btrfs/space-info.c:1231: warning: Function parameter or member 'space_info' not described in 'handle_reserve_ticket' fs/btrfs/space-info.c:1231: warning: Function parameter or member 'ticket' not described in 'handle_reserve_ticket' fs/btrfs/space-info.c:1231: warning: Function parameter or member 'flush' not described in 'handle_reserve_ticket' fs/btrfs/space-info.c:1315: warning: Function parameter or member 'fs_info' not described in '__reserve_bytes' fs/btrfs/space-info.c:1315: warning: Function parameter or member 'space_info' not described in '__reserve_bytes' fs/btrfs/space-info.c:1315: warning: Function parameter or member 'orig_bytes' not described in '__reserve_bytes' fs/btrfs/space-info.c:1315: warning: Function parameter or member 'flush' not described in '__reserve_bytes' fs/btrfs/space-info.c:1427: warning: Function parameter or member 'root' not described in 'btrfs_reserve_metadata_bytes' fs/btrfs/space-info.c:1427: warning: Function parameter or member 'block_rsv' not described in 'btrfs_reserve_metadata_bytes' fs/btrfs/space-info.c:1427: warning: Function parameter or member 'orig_bytes' not described in 'btrfs_reserve_metadata_bytes' fs/btrfs/space-info.c:1427: warning: Function parameter or member 'flush' not described in 'btrfs_reserve_metadata_bytes' fs/btrfs/space-info.c:1462: warning: Function parameter or member 'fs_info' not described in 'btrfs_reserve_data_bytes' fs/btrfs/space-info.c:1462: warning: Function parameter or member 'bytes' not described in 'btrfs_reserve_data_bytes' fs/btrfs/space-info.c:1462: warning: Function parameter or member 'flush' not described in 'btrfs_reserve_data_bytes'
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
9db4dc24 |
| 11-Jan-2021 |
Nikolay Borisov <nborisov@suse.com> |
btrfs: make btrfs_start_delalloc_root's nr argument a long
It's currently u64 which gets instantly translated either to LONG_MAX (if U64_MAX is passed) or cast to an unsigned long (which is in fact,
btrfs: make btrfs_start_delalloc_root's nr argument a long
It's currently u64 which gets instantly translated either to LONG_MAX (if U64_MAX is passed) or cast to an unsigned long (which is in fact, wrong because writeback_control::nr_to_write is a signed, long type).
Just convert the function's argument to be long time which obviates the need to manually convert u64 value to a long. Adjust all call sites which pass U64_MAX to pass LONG_MAX. Finally ensure that in shrink_delalloc the u64 is converted to a long without overflowing, resulting in a negative number.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
e076ab2a |
| 07-Jan-2021 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: shrink delalloc pages instead of full inodes
Commit 38d715f494f2 ("btrfs: use btrfs_start_delalloc_roots in shrink_delalloc") cleaned up how we do delalloc shrinking by utilizing some infrast
btrfs: shrink delalloc pages instead of full inodes
Commit 38d715f494f2 ("btrfs: use btrfs_start_delalloc_roots in shrink_delalloc") cleaned up how we do delalloc shrinking by utilizing some infrastructure we have in place to flush inodes that we use for device replace and snapshot. However this introduced a pretty serious performance regression. To reproduce the user untarred the source tarball of Firefox (360MiB xz compressed/1.5GiB uncompressed), and would see it take anywhere from 5 to 20 times as long to untar in 5.10 compared to 5.9. This was observed on fast devices (SSD and better) and not on HDD.
The root cause is because before we would generally use the normal writeback path to reclaim delalloc space, and for this we would provide it with the number of pages we wanted to flush. The referenced commit changed this to flush that many inodes, which drastically increased the amount of space we were flushing in certain cases, which severely affected performance.
We cannot revert this patch unfortunately because of 3d45f221ce62 ("btrfs: fix deadlock when cloning inline extent and low on free metadata space") which requires the ability to skip flushing inodes that are being cloned in certain scenarios, which means we need to keep using our flushing infrastructure or risk re-introducing the deadlock.
Instead to fix this problem we can go back to providing btrfs_start_delalloc_roots with a number of pages to flush, and then set up a writeback_control and utilize sync_inode() to handle the flushing for us. This gives us the same behavior we had prior to the fix, while still allowing us to avoid the deadlock that was fixed by Filipe. I redid the users original test and got the following results on one of our test machines (256GiB of ram, 56 cores, 2TiB Intel NVMe drive)
5.9 0m54.258s 5.10 1m26.212s 5.10+patch 0m38.800s
5.10+patch is significantly faster than plain 5.9 because of my patch series "Change data reservations to use the ticketing infra" which contained the patch that introduced the regression, but generally improved the overall ENOSPC flushing mechanisms.
Additional testing on consumer-grade SSD (8GiB ram, 8 CPU) confirm the results:
5.10.5 4m00s 5.10.5+patch 1m08s 5.11-rc2 5m14s 5.11-rc2+patch 1m30s
Reported-by: René Rebe <rene@exactcode.de> Fixes: 38d715f494f2 ("btrfs: use btrfs_start_delalloc_roots in shrink_delalloc") CC: stable@vger.kernel.org # 5.10 Signed-off-by: Josef Bacik <josef@toxicpanda.com> Tested-by: David Sterba <dsterba@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add my test results ] Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
3d45f221 |
| 02-Dec-2020 |
Filipe Manana <fdmanana@suse.com> |
btrfs: fix deadlock when cloning inline extent and low on free metadata space
When cloning an inline extent there are cases where we can not just copy the inline extent from the source range to the
btrfs: fix deadlock when cloning inline extent and low on free metadata space
When cloning an inline extent there are cases where we can not just copy the inline extent from the source range to the target range (e.g. when the target range starts at an offset greater than zero). In such cases we copy the inline extent's data into a page of the destination inode and then dirty that page. However, after that we will need to start a transaction for each processed extent and, if we are ever low on available metadata space, we may need to flush existing delalloc for all dirty inodes in an attempt to release metadata space - if that happens we may deadlock:
* the async reclaim task queued a delalloc work to flush delalloc for the destination inode of the clone operation;
* the task executing that delalloc work gets blocked waiting for the range with the dirty page to be unlocked, which is currently locked by the task doing the clone operation;
* the async reclaim task blocks waiting for the delalloc work to complete;
* the cloning task is waiting on the waitqueue of its reservation ticket while holding the range with the dirty page locked in the inode's io_tree;
* if metadata space is not released by some other task (like delalloc for some other inode completing for example), the clone task waits forever and as a consequence the delalloc work and async reclaim tasks will hang forever as well. Releasing more space on the other hand may require starting a transaction, which will hang as well when trying to reserve metadata space, resulting in a deadlock between all these tasks.
When this happens, traces like the following show up in dmesg/syslog:
[87452.323003] INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds. [87452.323644] Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 [87452.324248] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [87452.324852] task:kworker/u16:11 state:D stack: 0 pid:1810830 ppid: 2 flags:0x00004000 [87452.325520] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs] [87452.326136] Call Trace: [87452.326737] __schedule+0x5d1/0xcf0 [87452.327390] schedule+0x45/0xe0 [87452.328174] lock_extent_bits+0x1e6/0x2d0 [btrfs] [87452.328894] ? finish_wait+0x90/0x90 [87452.329474] btrfs_invalidatepage+0x32c/0x390 [btrfs] [87452.330133] ? __mod_memcg_state+0x8e/0x160 [87452.330738] __extent_writepage+0x2d4/0x400 [btrfs] [87452.331405] extent_write_cache_pages+0x2b2/0x500 [btrfs] [87452.332007] ? lock_release+0x20e/0x4c0 [87452.332557] ? trace_hardirqs_on+0x1b/0xf0 [87452.333127] extent_writepages+0x43/0x90 [btrfs] [87452.333653] ? lock_acquire+0x1a3/0x490 [87452.334177] do_writepages+0x43/0xe0 [87452.334699] ? __filemap_fdatawrite_range+0xa4/0x100 [87452.335720] __filemap_fdatawrite_range+0xc5/0x100 [87452.336500] btrfs_run_delalloc_work+0x17/0x40 [btrfs] [87452.337216] btrfs_work_helper+0xf1/0x600 [btrfs] [87452.337838] process_one_work+0x24e/0x5e0 [87452.338437] worker_thread+0x50/0x3b0 [87452.339137] ? process_one_work+0x5e0/0x5e0 [87452.339884] kthread+0x153/0x170 [87452.340507] ? kthread_mod_delayed_work+0xc0/0xc0 [87452.341153] ret_from_fork+0x22/0x30 [87452.341806] INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds. [87452.342487] Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 [87452.343274] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [87452.344049] task:kworker/u16:1 state:D stack: 0 pid:2426217 ppid: 2 flags:0x00004000 [87452.344974] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs] [87452.345655] Call Trace: [87452.346305] __schedule+0x5d1/0xcf0 [87452.346947] ? kvm_clock_read+0x14/0x30 [87452.347676] ? wait_for_completion+0x81/0x110 [87452.348389] schedule+0x45/0xe0 [87452.349077] schedule_timeout+0x30c/0x580 [87452.349718] ? _raw_spin_unlock_irqrestore+0x3c/0x60 [87452.350340] ? lock_acquire+0x1a3/0x490 [87452.351006] ? try_to_wake_up+0x7a/0xa20 [87452.351541] ? lock_release+0x20e/0x4c0 [87452.352040] ? lock_acquired+0x199/0x490 [87452.352517] ? wait_for_completion+0x81/0x110 [87452.353000] wait_for_completion+0xab/0x110 [87452.353490] start_delalloc_inodes+0x2af/0x390 [btrfs] [87452.353973] btrfs_start_delalloc_roots+0x12d/0x250 [btrfs] [87452.354455] flush_space+0x24f/0x660 [btrfs] [87452.355063] btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs] [87452.355565] process_one_work+0x24e/0x5e0 [87452.356024] worker_thread+0x20f/0x3b0 [87452.356487] ? process_one_work+0x5e0/0x5e0 [87452.356973] kthread+0x153/0x170 [87452.357434] ? kthread_mod_delayed_work+0xc0/0xc0 [87452.357880] ret_from_fork+0x22/0x30 (...) < stack traces of several tasks waiting for the locks of the inodes of the clone operation > (...) [92867.444138] RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000246 ORIG_RAX: 0000000000000052 [92867.444624] RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73f97 [92867.445116] RDX: 0000000000000000 RSI: 0000560fbd5d7a40 RDI: 0000560fbd5d8960 [92867.445595] RBP: 00007ffc3371beb0 R08: 0000000000000001 R09: 0000000000000003 [92867.446070] R10: 00007ffc3371b996 R11: 0000000000000246 R12: 0000000000000000 [92867.446820] R13: 000000000000001f R14: 00007ffc3371bea0 R15: 00007ffc3371beb0 [92867.447361] task:fsstress state:D stack: 0 pid:2508238 ppid:2508153 flags:0x00004000 [92867.447920] Call Trace: [92867.448435] __schedule+0x5d1/0xcf0 [92867.448934] ? _raw_spin_unlock_irqrestore+0x3c/0x60 [92867.449423] schedule+0x45/0xe0 [92867.449916] __reserve_bytes+0x4a4/0xb10 [btrfs] [92867.450576] ? finish_wait+0x90/0x90 [92867.451202] btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs] [92867.451815] btrfs_block_rsv_add+0x1f/0x50 [btrfs] [92867.452412] start_transaction+0x2d1/0x760 [btrfs] [92867.453216] clone_copy_inline_extent+0x333/0x490 [btrfs] [92867.453848] ? lock_release+0x20e/0x4c0 [92867.454539] ? btrfs_search_slot+0x9a7/0xc30 [btrfs] [92867.455218] btrfs_clone+0x569/0x7e0 [btrfs] [92867.455952] btrfs_clone_files+0xf6/0x150 [btrfs] [92867.456588] btrfs_remap_file_range+0x324/0x3d0 [btrfs] [92867.457213] do_clone_file_range+0xd4/0x1f0 [92867.457828] vfs_clone_file_range+0x4d/0x230 [92867.458355] ? lock_release+0x20e/0x4c0 [92867.458890] ioctl_file_clone+0x8f/0xc0 [92867.459377] do_vfs_ioctl+0x342/0x750 [92867.459913] __x64_sys_ioctl+0x62/0xb0 [92867.460377] do_syscall_64+0x33/0x80 [92867.460842] entry_SYSCALL_64_after_hwframe+0x44/0xa9 (...) < stack traces of more tasks blocked on metadata reservation like the clone task above, because the async reclaim task has deadlocked > (...)
Another thing to notice is that the worker task that is deadlocked when trying to flush the destination inode of the clone operation is at btrfs_invalidatepage(). This is simply because the clone operation has a destination offset greater than the i_size and we only update the i_size of the destination file after cloning an extent (just like we do in the buffered write path).
Since the async reclaim path uses btrfs_start_delalloc_roots() to trigger the flushing of delalloc for all inodes that have delalloc, add a runtime flag to an inode to signal it should not be flushed, and for inodes with that flag set, start_delalloc_inodes() will simply skip them. When the cloning code needs to dirty a page to copy an inline extent, set that flag on the inode and then clear it when the clone operation finishes.
This could be sporadically triggered with test case generic/269 from fstests, which exercises many fsstress processes running in parallel with several dd processes filling up the entire filesystem.
CC: stable@vger.kernel.org # 5.9+ Fixes: 05a5a7621ce6 ("Btrfs: implement full reflink support for inline extents") Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
72804905 |
| 01-Sep-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: kill the RCU protection for fs_info->space_info
We have this thing wrapped in an RCU lock, but it's really not needed. We create all the space_info's on mount, and we destroy them on unmount.
btrfs: kill the RCU protection for fs_info->space_info
We have this thing wrapped in an RCU lock, but it's really not needed. We create all the space_info's on mount, and we destroy them on unmount. The list never changes and we're protected from messing with it by the normal mount/umount path, so kill the RCU stuff around it.
Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
a31a5876 |
| 09-Sep-2020 |
YueHaibing <yuehaibing@huawei.com> |
btrfs: remove unused function calc_global_rsv_need_space()
It is not used since commit 0096420adb03 ("btrfs: do not account global reserve in can_overcommit").
Reviewed-by: Anand Jain <anand.jain@o
btrfs: remove unused function calc_global_rsv_need_space()
It is not used since commit 0096420adb03 ("btrfs: do not account global reserve in can_overcommit").
Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
c4923027 |
| 25-Aug-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: fix possible infinite loop in data async reclaim
Dave reported an issue where generic/102 would sometimes hang. This turned out to be because we'd get into this spot where we were no longer
btrfs: fix possible infinite loop in data async reclaim
Dave reported an issue where generic/102 would sometimes hang. This turned out to be because we'd get into this spot where we were no longer making progress on data reservations because our exit condition was not met. The log is basically
while (!space_info->full && !list_empty(&space_info->tickets)) flush_space(space_info, flush_state);
where flush state is our various flush states, but doesn't include ALLOC_CHUNK_FORCE. This is because we actually lead with allocating chunks, and so the assumption was that once you got to the actual flushing states you could no longer allocate chunks. This was a stupid assumption, because you could have deleted block groups that would be reclaimed by a transaction commit, thus unsetting space_info->full. This is essentially what happens with generic/102, and so sometimes you'd get stuck in the flushing loop because we weren't allocating chunks, but flushing space wasn't giving us what we needed to make progress.
Fix this by adding ALLOC_CHUNK_FORCE to the end of our flushing states, that way we will eventually bail out because we did end up with space_info->full if we free'd a chunk previously. Otherwise, as is the case for this test, we'll allocate our chunk and continue on our happy merry way.
Reported-by: David Sterba <dsterba@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
1a7a92c8 |
| 21-Jul-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: add a comment explaining the data flush steps
The data flushing steps are not obvious to people other than myself and Chris. Write a giant comment explaining the reasoning behind each flush
btrfs: add a comment explaining the data flush steps
The data flushing steps are not obvious to people other than myself and Chris. Write a giant comment explaining the reasoning behind each flush step for data as well as why it is in that particular order.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
57056740 |
| 21-Jul-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: do async reclaim for data reservations
Now that we have the data ticketing stuff in place, move normal data reservations to use an async reclaim helper to satisfy tickets. Before we could ha
btrfs: do async reclaim for data reservations
Now that we have the data ticketing stuff in place, move normal data reservations to use an async reclaim helper to satisfy tickets. Before we could have multiple tasks race in and both allocate chunks, resulting in more data chunks than we would necessarily need. Serializing these allocations and making a single thread responsible for flushing will only allocate chunks as needed, as well as cut down on transaction commits and other flush related activities.
Priority reservations will still work as they have before, simply trying to allocate a chunk until they can make their reservation.
Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
cb3e3930 |
| 21-Jul-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: flush delayed refs when trying to reserve data space
We can end up with freed extents in the delayed refs, and thus may_commit_transaction() may not think we have enough pinned space to commi
btrfs: flush delayed refs when trying to reserve data space
We can end up with freed extents in the delayed refs, and thus may_commit_transaction() may not think we have enough pinned space to commit the transaction and we'll ENOSPC early. Handle this by running the delayed refs in order to make sure pinned is uptodate before we try to commit the transaction.
Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
327feeeb |
| 21-Jul-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: run delayed iputs before committing the transaction for data
Before we were waiting on iputs after we committed the transaction, but this doesn't really make much sense. We want to reclaim a
btrfs: run delayed iputs before committing the transaction for data
Before we were waiting on iputs after we committed the transaction, but this doesn't really make much sense. We want to reclaim any space we may have in order to be more likely to commit the transaction, due to pinned space being added by running the delayed iputs. Fix this by making delayed iputs run before committing the transaction.
Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
bb86bd3d |
| 21-Jul-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: don't force commit if we are data
We used to unconditionally commit the transaction at least 2 times and then on the 3rd try check against pinned space to make sure committing the transaction
btrfs: don't force commit if we are data
We used to unconditionally commit the transaction at least 2 times and then on the 3rd try check against pinned space to make sure committing the transaction was worth the effort. This is overkill, we know nobody is going to steal our reservation, and if we can't make our reservation with the pinned amount simply bail out.
This also cleans up the passing of bytes_needed to may_commit_transaction, as that was the thing we added into place in order to accomplish this behavior. We no longer need it so remove that mess.
Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
02827001 |
| 21-Jul-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: drop the commit_cycles stuff for data reservations
This was an old wart left over from how we previously did data reservations. Before we could have people race in and take a reservation whi
btrfs: drop the commit_cycles stuff for data reservations
This was an old wart left over from how we previously did data reservations. Before we could have people race in and take a reservation while we were flushing space, so we needed to make sure we looped a few times before giving up. Now that we're using the ticketing infrastructure we don't have to worry about this and can drop the logic altogether.
Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
f3bda421 |
| 21-Jul-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: use the same helper for data and metadata reservations
Now that data reservations follow the same pattern as metadata reservations we can simply rename __reserve_metadata_bytes to __reserve_b
btrfs: use the same helper for data and metadata reservations
Now that data reservations follow the same pattern as metadata reservations we can simply rename __reserve_metadata_bytes to __reserve_bytes and use that helper for data reservations.
Things to keep in mind, btrfs_can_overcommit() returns 0 for data, because we can never overcommit. We also will never pass in FLUSH_ALL for data, so we'll simply be added to the priority list and go straight into handle_reserve_ticket.
Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|
#
0532a6f8 |
| 21-Jul-2020 |
Josef Bacik <josef@toxicpanda.com> |
btrfs: serialize data reservations if we are flushing
Nikolay reported a problem where generic/371 would fail sometimes with a slow drive. The gist of the test is that we fallocate a file in parall
btrfs: serialize data reservations if we are flushing
Nikolay reported a problem where generic/371 would fail sometimes with a slow drive. The gist of the test is that we fallocate a file in parallel with a pwrite of a different file. These two files combined are smaller than the file system, but sometimes the pwrite would ENOSPC.
A fair bit of investigation uncovered the fact that the fallocate workload was racing in and grabbing the free space that the pwrite workload was trying to free up so it could make its own reservation. After a few loops of this eventually the pwrite workload would error out with an ENOSPC.
We've had the same problem with metadata as well, and we serialized all metadata allocations to satisfy this problem. This wasn't usually a problem with data because data reservations are more straightforward, but obviously could still happen.
Fix this by not allowing reservations to occur if there are any pending tickets waiting to be satisfied on the space info.
Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
show more ...
|