Revision tags: v6.6.25, v6.6.24, v6.6.23, v6.6.16, v6.6.15, v6.6.14, v6.6.13, v6.6.12, v6.6.11, v6.6.10, v6.6.9, v6.6.8, v6.6.7, v6.6.6, v6.6.5, v6.6.4, v6.6.3, v6.6.2, v6.5.11, v6.6.1, v6.5.10, v6.6, v6.5.9, v6.5.8, v6.5.7, v6.5.6 |
|
#
428c4435 |
| 03-Oct-2023 |
Dave Chinner <dchinner@redhat.com> |
xfs: move log discard work to xfs_discard.c
Because we are going to use the same list-based discard submission interface for fstrim-based discards, too.
Signed-off-by: Dave Chinner <dchinner@redhat
xfs: move log discard work to xfs_discard.c
Because we are going to use the same list-based discard submission interface for fstrim-based discards, too.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
show more ...
|
Revision tags: v6.5.5, v6.5.4, v6.5.3 |
|
#
ecd49f7a |
| 11-Sep-2023 |
Darrick J. Wong <djwong@kernel.org> |
xfs: fix per-cpu CIL structure aggregation racing with dying cpus
In commit 7c8ade2121200 ("xfs: implement percpu cil space used calculation"), the XFS committed (log) item list code was converted t
xfs: fix per-cpu CIL structure aggregation racing with dying cpus
In commit 7c8ade2121200 ("xfs: implement percpu cil space used calculation"), the XFS committed (log) item list code was converted to use per-cpu lists and space tracking to reduce cpu contention when multiple threads are modifying different parts of the filesystem and hence end up contending on the log structures during transaction commit. Each CPU tracks its own commit items and space usage, and these do not have to be merged into the main CIL until either someone wants to push the CIL items, or we run over a soft threshold and switch to slower (but more accurate) accounting with atomics.
Unfortunately, the for_each_cpu iteration suffers from the same race with cpu dying problem that was identified in commit 8b57b11cca88f ("pcpcntrs: fix dying cpu summation race") -- CPUs are removed from cpu_online_mask before the CPUHP_XFS_DEAD callback gets called. As a result, both CIL percpu structure aggregation functions fail to collect the items and accounted space usage at the correct point in time.
If we're lucky, the items that are collected from the online cpus exceed the space given to those cpus, and the log immediately shuts down in xlog_cil_insert_items due to the (apparent) log reservation overrun. This happens periodically with generic/650, which exercises cpu hotplug vs. the filesystem code:
smpboot: CPU 3 is now offline XFS (sda3): ctx ticket reservation ran out. Need to up reservation XFS (sda3): ticket reservation summary: XFS (sda3): unit res = 9268 bytes XFS (sda3): current res = -40 bytes XFS (sda3): original count = 1 XFS (sda3): remaining count = 1 XFS (sda3): Filesystem has been shut down due to log error (0x2).
Applying the same sort of fix from 8b57b11cca88f to the CIL code seems to make the generic/650 problem go away, but I've been told that tglx was not happy when he saw:
"...the only thing we actually need to care about is that percpu_counter_sum() iterates dying CPUs. That's trivial to do, and when there are no CPUs dying, it has no addition overhead except for a cpumask_or() operation."
The CPU hotplug code is rather complex and difficult to understand and I don't want to try to understand the cpu hotplug locking well enough to use cpu_dying mask. Furthermore, there's a performance improvement that could be had here. Attach a private cpu mask to the CIL structure so that we can track exactly which cpus have accessed the percpu data at all. It doesn't matter if the cpu has since gone offline; log item aggregation will still find the items. Better yet, we skip cpus that have not recently logged anything.
Worse yet, Ritesh Harjani and Eric Sandeen both reported today that CPU hot remove racing with an xfs mount can crash if the cpu_dead notifier tries to access the log but the mount hasn't yet set up the log.
Link: https://lore.kernel.org/linux-xfs/ZOLzgBOuyWHapOyZ@dread.disaster.area/T/ Link: https://lore.kernel.org/lkml/877cuj1mt1.ffs@tglx/ Link: https://lore.kernel.org/lkml/20230414162755.281993820@linutronix.de/ Link: https://lore.kernel.org/linux-xfs/ZOVkjxWZq0YmjrJu@dread.disaster.area/T/ Cc: tglx@linutronix.de Cc: peterz@infradead.org Reported-by: ritesh.list@gmail.com Reported-by: sandeen@sandeen.net Fixes: af1c2146a50b ("xfs: introduce per-cpu CIL tracking structure") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
show more ...
|
Revision tags: v6.5.2, v6.1.51, v6.5.1, v6.1.50, v6.5, v6.1.49, v6.1.48, v6.1.46, v6.1.45, v6.1.44, v6.1.43, v6.1.42, v6.1.41, v6.1.40, v6.1.39, v6.1.38, v6.1.37, v6.1.36, v6.4, v6.1.35, v6.1.34, v6.1.33, v6.1.32, v6.1.31, v6.1.30, v6.1.29, v6.1.28, v6.1.27, v6.1.26, v6.3, v6.1.25, v6.1.24, v6.1.23, v6.1.22, v6.1.21, v6.1.20, v6.1.19, v6.1.18, v6.1.17, v6.1.16, v6.1.15, v6.1.14, v6.1.13, v6.2, v6.1.12, v6.1.11, v6.1.10, v6.1.9, v6.1.8, v6.1.7, v6.1.6, v6.1.5, v6.0.19, v6.0.18, v6.1.4, v6.1.3, v6.0.17, v6.1.2, v6.0.16, v6.1.1, v6.0.15, v6.0.14, v6.0.13, v6.1, v6.0.12, v6.0.11, v6.0.10, v5.15.80, v6.0.9, v5.15.79, v6.0.8, v5.15.78, v6.0.7, v5.15.77, v5.15.76, v6.0.6, v6.0.5, v5.15.75, v6.0.4, v6.0.3, v6.0.2, v5.15.74, v5.15.73, v6.0.1, v5.15.72, v6.0, v5.15.71, v5.15.70, v5.15.69, v5.15.68, v5.15.67, v5.15.66, v5.15.65, v5.15.64, v5.15.63, v5.15.62, v5.15.61, v5.15.60, v5.15.59, v5.19, v5.15.58, v5.15.57, v5.15.56, v5.15.55, v5.15.54, v5.15.53 |
|
#
d9f68777 |
| 07-Jul-2022 |
Dave Chinner <dchinner@redhat.com> |
xfs: xlog_sync() manually adjusts grant head space
When xlog_sync() rounds off the tail the iclog that is being flushed, it manually subtracts that space from the grant heads. This space is actually
xfs: xlog_sync() manually adjusts grant head space
When xlog_sync() rounds off the tail the iclog that is being flushed, it manually subtracts that space from the grant heads. This space is actually reserved by the transaction ticket that covers the xlog_sync() call from xlog_write(), but we don't plumb the ticket down far enough for it to account for the space consumed in the current log ticket.
The grant heads are hot, so we really should be accounting this to the ticket is we can, rather than adding thousands of extra grant head updates every CIL commit.
Interestingly, this actually indicates a potential log space overrun can occur when we force the log. By the time that xfs_log_force() pushes out an active iclog and consumes the roundoff space, the reservation for that roundoff space has been returned to the grant heads and is no longer covered by a reservation. In theory the roundoff added to log force on an already full log could push the write head past the tail. In practice, the CIL commit that writes to the log and needs the iclog pushed will have reserved space for roundoff, so when it releases the ticket there will still be physical space for the roundoff to be committed to the log, even though it is no longer reserved. This roundoff won't be enough space to allow a transaction to be woken if the log is full, so overruns should not actually occur in practice.
That said, it indicates that we should not release the CIL context log ticket until after we've released the commit iclog. It also means that xlog_sync() still needs the direct grant head manipulation if we don't provide it with a ticket. Log forces are rare when we are in fast paths running 1.5 million transactions/s that make the grant heads hot, so let's optimise the hot case and pass CIL log tickets down to the xlog_sync() code.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
show more ...
|
#
1ccb0745 |
| 07-Jul-2022 |
Dave Chinner <dchinner@redhat.com> |
xfs: avoid cil push lock if possible
Because now it hurts when the CIL fills up.
- 37.20% __xfs_trans_commit - 35.84% xfs_log_commit_cil - 19.34% _raw_spin_lock - do_ra
xfs: avoid cil push lock if possible
Because now it hurts when the CIL fills up.
- 37.20% __xfs_trans_commit - 35.84% xfs_log_commit_cil - 19.34% _raw_spin_lock - do_raw_spin_lock 19.01% __pv_queued_spin_lock_slowpath - 4.20% xfs_log_ticket_ungrant 0.90% xfs_log_space_wake
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
show more ...
|
#
4eb56069 |
| 07-Jul-2022 |
Dave Chinner <dchinner@redhat.com> |
xfs: move CIL ordering to the logvec chain
Adding a list_sort() call to the CIL push work while the xc_ctx_lock is held exclusively has resulted in fairly long lock hold times and that stops all fro
xfs: move CIL ordering to the logvec chain
Adding a list_sort() call to the CIL push work while the xc_ctx_lock is held exclusively has resulted in fairly long lock hold times and that stops all front end transaction commits from making progress.
We can move the sorting out of the xc_ctx_lock if we can transfer the ordering information to the log vectors as they are detached from the log items and then we can sort the log vectors. With these changes, we can move the list_sort() call to just before we call xlog_write() when we aren't holding any locks at all.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
show more ...
|
#
16924853 |
| 07-Jul-2022 |
Dave Chinner <dchinner@redhat.com> |
xfs: convert log vector chain to use list heads
Because the next change is going to require sorting log vectors, and that requires arbitrary rearrangement of the list which cannot be done easily wit
xfs: convert log vector chain to use list heads
Because the next change is going to require sorting log vectors, and that requires arbitrary rearrangement of the list which cannot be done easily with a single linked list.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
show more ...
|
#
c0fb4765 |
| 07-Jul-2022 |
Dave Chinner <dchinner@redhat.com> |
xfs: convert CIL to unordered per cpu lists
So that we can remove the cil_lock which is a global serialisation point. We've already got ordering sorted, so all we need to do is treat the CIL list li
xfs: convert CIL to unordered per cpu lists
So that we can remove the cil_lock which is a global serialisation point. We've already got ordering sorted, so all we need to do is treat the CIL list like the busy extent list and reconstruct it before the push starts.
This is what we're trying to avoid:
- 75.35% 1.83% [kernel] [k] xfs_log_commit_cil - 46.35% xfs_log_commit_cil - 41.54% _raw_spin_lock - 67.30% do_raw_spin_lock 66.96% __pv_queued_spin_lock_slowpath
Which happens on a 32p system when running a 32-way 'rm -rf' workload. After this patch:
- 20.90% 3.23% [kernel] [k] xfs_log_commit_cil - 17.67% xfs_log_commit_cil - 6.51% xfs_log_ticket_ungrant 1.40% xfs_log_space_wake 2.32% memcpy_erms - 2.18% xfs_buf_item_committing - 2.12% xfs_buf_item_release - 1.03% xfs_buf_unlock 0.96% up 0.72% xfs_buf_rele 1.33% xfs_inode_item_format 1.19% down_read 0.91% up_read 0.76% xfs_buf_item_format - 0.68% kmem_alloc_large - 0.67% kmem_alloc 0.64% __kmalloc 0.50% xfs_buf_item_size
It kinda looks like the workload is running out of log space all the time. But all the spinlock contention is gone and the transaction commit rate has gone from 800k/s to 1.3M/s so the amount of real work being done has gone up a *lot*.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
show more ...
|
#
016a2338 |
| 07-Jul-2022 |
Dave Chinner <dchinner@redhat.com> |
xfs: Add order IDs to log items in CIL
Before we split the ordered CIL up into per cpu lists, we need a mechanism to track the order of the items in the CIL. We need to do this because there are rul
xfs: Add order IDs to log items in CIL
Before we split the ordered CIL up into per cpu lists, we need a mechanism to track the order of the items in the CIL. We need to do this because there are rules around the order in which related items must physically appear in the log even inside a single checkpoint transaction.
An example of this is intents - an intent must appear in the log before it's intent done record so that log recovery can cancel the intent correctly. If we have these two records misordered in the CIL, then they will not be recovered correctly by journal replay.
We also will not be able to move items to the tail of the CIL list when they are relogged, hence the log items will need some mechanism to allow the correct log item order to be recreated before we write log items to the hournal.
Hence we need to have a mechanism for recording global order of transactions in the log items so that we can recover that order from un-ordered per-cpu lists.
Do this with a simple monotonic increasing commit counter in the CIL context. Each log item in the transaction gets stamped with the current commit order ID before it is added to the CIL. If the item is already in the CIL, leave it where it is instead of moving it to the tail of the list and instead sort the list before we start the push work.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
show more ...
|
#
df7a4a21 |
| 07-Jul-2022 |
Dave Chinner <dchinner@redhat.com> |
xfs: convert CIL busy extents to per-cpu
To get them out from under the CIL lock.
This is an unordered list, so we can simply punt it to per-cpu lists during transaction commits and reaggregate it
xfs: convert CIL busy extents to per-cpu
To get them out from under the CIL lock.
This is an unordered list, so we can simply punt it to per-cpu lists during transaction commits and reaggregate it back into a single list during the CIL push work.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
show more ...
|
#
1dd2a2c1 |
| 07-Jul-2022 |
Dave Chinner <dchinner@redhat.com> |
xfs: track CIL ticket reservation in percpu structure
To get it out from under the cil spinlock.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
|
#
7c8ade21 |
| 07-Jul-2022 |
Dave Chinner <dchinner@redhat.com> |
xfs: implement percpu cil space used calculation
Now that we have the CIL percpu structures in place, implement the space used counter as a per-cpu counter.
We have to be really careful now about e
xfs: implement percpu cil space used calculation
Now that we have the CIL percpu structures in place, implement the space used counter as a per-cpu counter.
We have to be really careful now about ensuring that the checks and updates run without arbitrary delays, which means they need to run with pre-emption disabled. We do this by careful placement of the get_cpu_ptr/put_cpu_ptr calls to access the per-cpu structures for that CPU.
We need to be able to reliably detect that the CIL has reached the hard limit threshold so we can take extra reservations for the iclog headers when the space used overruns the original reservation. hence we factor out xlog_cil_over_hard_limit() from xlog_cil_push_background().
The global CIL space used is an atomic variable that is backed by per-cpu aggregation to minimise the number of atomic updates we do to the global state in the fast path. While we are under the soft limit, we aggregate only when the per-cpu aggregation is over the proportion of the soft limit assigned to that CPU. This means that all CPUs can use all but one byte of their aggregation threshold and we will not go over the soft limit.
Hence once we detect that we've gone over both a per-cpu aggregation threshold and the soft limit, we know that we have only exceeded the soft limit by one per-cpu aggregation threshold. Even if all CPUs hit this at the same time, we can't be over the hard limit, so we can run an aggregation back into the atomic counter at this point and still be under the hard limit.
At this point, we will be over the soft limit and hence we'll aggregate into the global atomic used space directly rather than the per-cpu counters, hence providing accurate detection of hard limit excursion for accounting and reservation purposes.
Hence we get the best of both worlds - lockless, scalable per-cpu fast path plus accurate, atomic detection of hard limit excursion.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
show more ...
|
Revision tags: v5.15.52 |
|
#
af1c2146 |
| 01-Jul-2022 |
Dave Chinner <dchinner@redhat.com> |
xfs: introduce per-cpu CIL tracking structure
The CIL push lock is highly contended on larger machines, becoming a hard bottleneck that about 700,000 transaction commits/s on >16p machines. To addre
xfs: introduce per-cpu CIL tracking structure
The CIL push lock is highly contended on larger machines, becoming a hard bottleneck that about 700,000 transaction commits/s on >16p machines. To address this, start moving the CIL tracking infrastructure to utilise per-CPU structures.
We need to track the space used, the amount of log reservation space reserved to write the CIL, the log items in the CIL and the busy extents that need to be completed by the CIL commit. This requires a couple of per-cpu counters, an unordered per-cpu list and a globally ordered per-cpu list.
Create a per-cpu structure to hold these and all the management interfaces needed, as well as the hooks to handle hotplug CPUs.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
show more ...
|
#
31151cc3 |
| 01-Jul-2022 |
Dave Chinner <dchinner@redhat.com> |
xfs: rework per-iclog header CIL reservation
For every iclog that a CIL push will use up, we need to ensure we have space reserved for the iclog header in each iclog. It is extremely difficult to do
xfs: rework per-iclog header CIL reservation
For every iclog that a CIL push will use up, we need to ensure we have space reserved for the iclog header in each iclog. It is extremely difficult to do this accurately with a per-cpu counter without expensive summing of the counter in every commit. However, we know what the maximum CIL size is going to be because of the hard space limit we have, and hence we know exactly how many iclogs we are going to need to write out the CIL.
We are constrained by the requirement that small transactions only have reservation space for a single iclog header built into them. At commit time we don't know how much of the current transaction reservation is made up of iclog header reservations as calculated by xfs_log_calc_unit_res() when the ticket was reserved. As larger reservations have multiple header spaces reserved, we can steal more than one iclog header reservation at a time, but we only steal the exact number needed for the given log vector size delta.
As a result, we don't know exactly when we are going to steal iclog header reservations, nor do we know exactly how many we are going to need for a given CIL.
To make things simple, start by calculating the worst case number of iclog headers a full CIL push will require. Record this into an atomic variable in the CIL. Then add a byte counter to the log ticket that records exactly how much iclog header space has been reserved in this ticket by xfs_log_calc_unit_res(). This tells us exactly how much space we can steal from the ticket at transaction commit time.
Now, at transaction commit time, we can check if the CIL has a full iclog header reservation and, if not, steal the entire reservation the current ticket holds for iclog headers. This minimises the number of times we need to do atomic operations in the fast path, but still guarantees we get all the reservations we need.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
show more ...
|
#
12380d23 |
| 01-Jul-2022 |
Dave Chinner <dchinner@redhat.com> |
xfs: lift init CIL reservation out of xc_cil_lock
The xc_cil_lock is the most highly contended lock in XFS now. To start the process of getting rid of it, lift the initial reservation of the CIL log
xfs: lift init CIL reservation out of xc_cil_lock
The xc_cil_lock is the most highly contended lock in XFS now. To start the process of getting rid of it, lift the initial reservation of the CIL log space out from under the xc_cil_lock.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
show more ...
|
#
88591e7f |
| 01-Jul-2022 |
Dave Chinner <dchinner@redhat.com> |
xfs: use the CIL space used counter for emptiness checks
In the next patches we are going to make the CIL list itself per-cpu, and so we cannot use list_empty() to check is the list is empty. Replac
xfs: use the CIL space used counter for emptiness checks
In the next patches we are going to make the CIL list itself per-cpu, and so we cannot use list_empty() to check is the list is empty. Replace the list_empty() checks with a flag in the CIL to indicate we have committed at least one transaction to the CIL and hence the CIL is not empty.
We need this flag to be an atomic so that we can clear it without holding any locks in the commit fast path, but we also need to be careful to avoid atomic operations in the fast path. Hence we use the fact that test_bit() is not an atomic op to first check if the flag is set and then run the atomic test_and_clear_bit() operation to clear it and steal the initial unit reservation for the CIL context checkpoint.
When we are switching to a new context in a push, we place the setting of the XLOG_CIL_EMPTY flag under the xc_push_lock. THis allows all the other places that need to check whether the CIL is empty to use test_bit() and still be serialised correctly with the CIL context swaps that set the bit.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
show more ...
|
Revision tags: v5.15.51, v5.15.50, v5.15.49, v5.15.48, v5.15.47, v5.15.46, v5.15.45, v5.15.44, v5.15.43, v5.15.42, v5.18, v5.15.41, v5.15.40, v5.15.39 |
|
#
45ff8b47 |
| 12-May-2022 |
Dave Chinner <dchinner@redhat.com> |
xfs: can't use kmem_zalloc() for attribute buffers
Because heap allocation of 64kB buffers will fail:
.... XFS: fs_mark(8414) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2
xfs: can't use kmem_zalloc() for attribute buffers
Because heap allocation of 64kB buffers will fail:
.... XFS: fs_mark(8414) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40) XFS: fs_mark(8417) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40) XFS: fs_mark(8409) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40) XFS: fs_mark(8428) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40) XFS: fs_mark(8430) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40) XFS: fs_mark(8437) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40) XFS: fs_mark(8433) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40) XFS: fs_mark(8406) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40) XFS: fs_mark(8412) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40) XFS: fs_mark(8432) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40) XFS: fs_mark(8424) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40) ....
I'd use kvmalloc() instead, but....
- 48.19% xfs_attr_create_intent - 46.89% xfs_attri_init - kvmalloc_node - 46.04% __kmalloc_node - kmalloc_large_node - 45.99% __alloc_pages - 39.39% __alloc_pages_slowpath.constprop.0 - 38.89% __alloc_pages_direct_compact - 38.71% try_to_compact_pages - compact_zone_order - compact_zone - 21.09% isolate_migratepages_block 10.31% PageHuge 5.82% set_pfnblock_flags_mask 0.86% get_pfnblock_flags_mask - 4.48% __reset_isolation_suitable 4.44% __reset_isolation_pfn - 3.56% __pageblock_pfn_to_page 1.33% pfn_to_online_page 2.83% get_pfnblock_flags_mask - 0.87% migrate_pages 0.86% compaction_alloc 0.84% find_suitable_fallback - 6.60% get_page_from_freelist 4.99% clear_page_erms - 1.19% _raw_spin_lock_irqsave - do_raw_spin_lock __pv_queued_spin_lock_slowpath - 0.86% __vmalloc_node_range 0.65% __alloc_pages_bulk
.... this is just yet another reminder of how much kvmalloc() sucks. So lift xlog_cil_kvmalloc(), rename it to xlog_kvmalloc() and use that instead....
We also clean up the attribute name and value lengths as they no longer need to be rounded out to sizes compatible with log vectors.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
show more ...
|
Revision tags: v5.15.38 |
|
#
0d227466 |
| 03-May-2022 |
Dave Chinner <dchinner@redhat.com> |
xfs: intent item whiteouts
When we log modifications based on intents, we add both intent and intent done items to the modification being made. These get written to the log to ensure that the operat
xfs: intent item whiteouts
When we log modifications based on intents, we add both intent and intent done items to the modification being made. These get written to the log to ensure that the operation is re-run if the intent done is not found in the log.
However, for operations that complete wholly within a single checkpoint, the change in the checkpoint is atomic and will never need replay. In this case, we don't need to actually write the intent and intent done items to the journal because log recovery will never need to manually restart this modification.
Log recovery currently handles intent/intent done matching by inserting the intent into the AIL, then removing it when a matching intent done item is found. Hence for all the intent-based operations that complete within a checkpoint, we spend all that time parsing the intent/intent done items just to cancel them and do nothing with them.
Hence it follows that the only time we actually need intents in the log is when the modification crosses checkpoint boundaries in the log and so may only be partially complete in the journal. Hence if we commit and intent done item to the CIL and the intent item is in the same checkpoint, we don't actually have to write them to the journal because log recovery will always cancel the intents.
We've never really worried about the overhead of logging intents unnecessarily like this because the intents we log are generally very much smaller than the change being made. e.g. freeing an extent involves modifying at lease two freespace btree blocks and the AGF, so the EFI/EFD overhead is only a small increase in space and processing time compared to the overall cost of freeing an extent.
However, delayed attributes change this cost equation dramatically, especially for inline attributes. In the case of adding an inline attribute, we only log the inode core and attribute fork at present. With delayed attributes, we now log the attr intent which includes the name and value, the inode core adn attr fork, and finally the attr intent done item. We increase the number of items we log from 1 to 3, and the number of log vectors (regions) goes up from 3 to 7. Hence we tripple the number of objects that the CIL has to process, and more than double the number of log vectors that need to be written to the journal.
At scale, this means delayed attributes cause a non-pipelined CIL to become CPU bound processing all the extra items, resulting in a > 40% performance degradation on 16-way file+xattr create worklaods. Pipelining the CIL (as per 5.15) reduces the performance degradation to 20%, but now the limitation is the rate at which the log items can be written to the iclogs and iclogs be dispatched for IO and completed.
Even log IO completion is slowed down by these intents, because it now has to process 3x the number of items in the checkpoint. Processing completed intents is especially inefficient here, because we first insert the intent into the AIL, then remove it from the AIL when the intent done is processed. IOWs, we are also doing expensive operations in log IO completion we could completely avoid if we didn't log completed intent/intent done pairs.
Enter log item whiteouts.
When an intent done is committed, we can check to see if the associated intent is in the same checkpoint as we are currently committing the intent done to. If so, we can mark the intent log item with a whiteout and immediately free the intent done item rather than committing it to the CIL. We can basically skip the entire formatting and CIL insertion steps for the intent done item.
However, we cannot remove the intent item from the CIL at this point because the unlocked per-cpu CIL item lists do not permit removal without holding the CIL context lock exclusively. Transaction commit only holds the context lock shared, hence the best we can do is mark the intent item with a whiteout so that the CIL push can release it rather than writing it to the log.
This means we never write the intent to the log if the intent done has also been committed to the same checkpoint, but we'll always write the intent if the intent done has not been committed or has been committed to a different checkpoint. This will result in correct log recovery behaviour in all cases, without the overhead of logging unnecessary intents.
This intent whiteout concept is generic - we can apply it to all intent/intent done pairs that have a direct 1:1 relationship. The way deferred ops iterate and relog intents mean that all intents currently have a 1:1 relationship with their done intent, and hence we can apply this cancellation to all existing intent/intent done implementations.
For delayed attributes with a 16-way 64kB xattr create workload, whiteouts reduce the amount of journalled metadata from ~2.5GB/s down to ~600MB/s and improve the creation rate from 9000/s to 14000/s.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
show more ...
|
#
22b1afc5 |
| 03-May-2022 |
Dave Chinner <dchinner@redhat.com> |
xfs: factor and move some code in xfs_log_cil.c
In preparation for adding support for intent item whiteouts.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@k
xfs: factor and move some code in xfs_log_cil.c
In preparation for adding support for intent item whiteouts.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
show more ...
|
Revision tags: v5.15.37, v5.15.36 |
|
#
593e3439 |
| 20-Apr-2022 |
Dave Chinner <dchinner@redhat.com> |
xfs: CIL context doesn't need to count iovecs
Now that we account for log opheaders in the log item formatting code, we don't actually use the aggregated count of log iovecs in the CIL for anything.
xfs: CIL context doesn't need to count iovecs
Now that we account for log opheaders in the log item formatting code, we don't actually use the aggregated count of log iovecs in the CIL for anything. Remove it and the tracking code that calculates it.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
show more ...
|
#
14b07ecd |
| 20-Apr-2022 |
Dave Chinner <dchinner@redhat.com> |
xfs: xlog_write() doesn't need optype anymore
So remove it from the interface and callers.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewe
xfs: xlog_write() doesn't need optype anymore
So remove it from the interface and callers.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
show more ...
|
#
d80fc291 |
| 20-Apr-2022 |
Dave Chinner <dchinner@redhat.com> |
xfs: pass lv chain length into xlog_write()
The caller of xlog_write() usually has a close accounting of the aggregated vector length contained in the log vector chain passed to xlog_write(). There
xfs: pass lv chain length into xlog_write()
The caller of xlog_write() usually has a close accounting of the aggregated vector length contained in the log vector chain passed to xlog_write(). There is no need to iterate the chain to calculate he length of the data in xlog_write_calculate_len() if the caller is already iterating that chain to build it.
Passing in the vector length avoids doing an extra chain iteration, which can be a significant amount of work given that large CIL commits can have hundreds of thousands of vectors attached to the chain.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
show more ...
|
#
8d547cf9 |
| 20-Apr-2022 |
Dave Chinner <dchinner@redhat.com> |
xfs: reserve space and initialise xlog_op_header in item formatting
Current xlog_write() adds op headers to the log manually for every log item region that is in the vector passed to it. While xlog_
xfs: reserve space and initialise xlog_op_header in item formatting
Current xlog_write() adds op headers to the log manually for every log item region that is in the vector passed to it. While xlog_write() needs to stamp the transaction ID into the ophdr, we already know it's length, flags, clientid, etc at CIL commit time.
This means the only time that xlog write really needs to format and reserve space for a new ophdr is when a region is split across two iclogs. Adding the opheader and accounting for it as part of the normal formatted item region means we simplify the accounting of space used by a transaction and we don't have to special case reserving of space in for the ophdrs in xlog_write(). It also means we can largely initialise the ophdr in transaction commit instead of xlog_write, making the xlog_write formatting inner loop much tighter.
xlog_prepare_iovec() is now too large to stay as an inline function, so we move it out of line and into xfs_log.c.
Object sizes: text data bss dec hex filename 1125934 305951 484 1432369 15db31 fs/xfs/built-in.a.before 1123360 305951 484 1429795 15d123 fs/xfs/built-in.a.after
So the code is a roughly 2.5kB smaller with xlog_prepare_iovec() now out of line, even though it grew in size itself.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
show more ...
|
#
c7610dce |
| 20-Apr-2022 |
Dave Chinner <dchinner@redhat.com> |
xfs: log tickets don't need log client id
We currently set the log ticket client ID when we reserve a transaction. This client ID is only ever written to the log by a CIL checkpoint or unmount recor
xfs: log tickets don't need log client id
We currently set the log ticket client ID when we reserve a transaction. This client ID is only ever written to the log by a CIL checkpoint or unmount records, and so anything using a high level transaction allocated through xfs_trans_alloc() does not need a log ticket client ID to be set.
For the CIL checkpoint, the client ID written to the journal is always XFS_TRANSACTION, and for the unmount record it is always XFS_LOG, and nothing else writes to the log. All of these operations tell xlog_write() exactly what they need to write to the log (the optype) and build their own opheaders for start, commit and unmount records. Hence we no longer need to set the client id in either the log ticket or the xfs_trans.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
show more ...
|
#
54021b62 |
| 20-Apr-2022 |
Dave Chinner <david@fromorbit.com> |
xfs: embed the xlog_op_header in the commit record
Remove the final case where xlog_write() has to prepend an opheader to a log transaction. Similar to the start record, the commit record is just an
xfs: embed the xlog_op_header in the commit record
Remove the final case where xlog_write() has to prepend an opheader to a log transaction. Similar to the start record, the commit record is just an empty opheader with a XLOG_COMMIT_TRANS type, so we can just make this the payload for the region being passed to xlog_write() and remove the special handling in xlog_write() for the commit record.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
show more ...
|
#
6eaed95e |
| 20-Apr-2022 |
Dave Chinner <dchinner@redhat.com> |
xfs: only CIL pushes require a start record
So move the one-off start record writing in xlog_write() out into the static header that the CIL push builds to write into the log initially. This simplif
xfs: only CIL pushes require a start record
So move the one-off start record writing in xlog_write() out into the static header that the CIL push builds to write into the log initially. This simplifes the xlog_write() logic a lot.
pahole on x86-64 confirms that the xlog_cil_trans_hdr is correctly 32 bit aligned and packed for copying the log op and transaction headers directly into the log as a single log region copy.
struct xlog_cil_trans_hdr { struct xlog_op_header oph[2]; /* 0 24 */ struct xfs_trans_header thdr; /* 24 16 */ struct xfs_log_iovec lhdr[2]; /* 40 32 */
/* size: 72, cachelines: 2, members: 3 */ /* last cacheline: 8 bytes */ };
A wart is needed to handle the fact that length of the region the opheader points to doesn't include the opheader length. hence if we embed the opheader, we have to substract the opheader length from the length written into the opheader by the generic copying code. This will eventually go away when everything is converted to embedded opheaders.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
show more ...
|