Revision tags: v6.6.25, v6.6.24 |
|
#
cd3c2cf3 |
| 26-Mar-2024 |
Darrick J. Wong <djwong@kernel.org> |
xfs: use xfs_defer_pending objects to recover intent items
commit 03f7767c9f6120ac933378fdec3bfd78bf07bc11 upstream.
One thing I never quite got around to doing is porting the log intent item recov
xfs: use xfs_defer_pending objects to recover intent items
commit 03f7767c9f6120ac933378fdec3bfd78bf07bc11 upstream.
One thing I never quite got around to doing is porting the log intent item recovery code to reconstruct the deferred pending work state. As a result, each intent item open codes xfs_defer_finish_one in its recovery method, because that's what the EFI code did before xfs_defer.c even existed.
This is a gross thing to have left unfixed -- if an EFI cannot proceed due to busy extents, we end up creating separate new EFIs for each unfinished work item, which is a change in behavior from what runtime would have done.
Worse yet, Long Li pointed out that there's a UAF in the recovery code. The ->commit_pass2 function adds the intent item to the AIL and drops the refcount. The one remaining refcount is now owned by the recovery mechanism (aka the log intent items in the AIL) with the intent of giving the refcount to the intent done item in the ->iop_recover function.
However, if something fails later in recovery, xlog_recover_finish will walk the recovered intent items in the AIL and release them. If the CIL hasn't been pushed before that point (which is possible since we don't force the log until later) then the intent done release will try to free its associated intent, which has already been freed.
This patch starts to address this mess by having the ->commit_pass2 functions recreate the xfs_defer_pending state. The next few patches will fix the recovery functions.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Catherine Hoang <catherine.hoang@oracle.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
show more ...
|
Revision tags: v6.6.25, v6.6.24 |
|
#
cd3c2cf3 |
| 26-Mar-2024 |
Darrick J. Wong <djwong@kernel.org> |
xfs: use xfs_defer_pending objects to recover intent items
commit 03f7767c9f6120ac933378fdec3bfd78bf07bc11 upstream.
One thing I never quite got around to doing is porting the log intent item recov
xfs: use xfs_defer_pending objects to recover intent items
commit 03f7767c9f6120ac933378fdec3bfd78bf07bc11 upstream.
One thing I never quite got around to doing is porting the log intent item recovery code to reconstruct the deferred pending work state. As a result, each intent item open codes xfs_defer_finish_one in its recovery method, because that's what the EFI code did before xfs_defer.c even existed.
This is a gross thing to have left unfixed -- if an EFI cannot proceed due to busy extents, we end up creating separate new EFIs for each unfinished work item, which is a change in behavior from what runtime would have done.
Worse yet, Long Li pointed out that there's a UAF in the recovery code. The ->commit_pass2 function adds the intent item to the AIL and drops the refcount. The one remaining refcount is now owned by the recovery mechanism (aka the log intent items in the AIL) with the intent of giving the refcount to the intent done item in the ->iop_recover function.
However, if something fails later in recovery, xlog_recover_finish will walk the recovered intent items in the AIL and release them. If the CIL hasn't been pushed before that point (which is possible since we don't force the log until later) then the intent done release will try to free its associated intent, which has already been freed.
This patch starts to address this mess by having the ->commit_pass2 functions recreate the xfs_defer_pending state. The next few patches will fix the recovery functions.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Catherine Hoang <catherine.hoang@oracle.com> Acked-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
show more ...
|
Revision tags: v6.6.23, v6.6.16, v6.6.15, v6.6.14, v6.6.13, v6.6.12, v6.6.11, v6.6.10, v6.6.9, v6.6.8, v6.6.7, v6.6.6, v6.6.5, v6.6.4, v6.6.3, v6.6.2, v6.5.11, v6.6.1, v6.5.10, v6.6, v6.5.9, v6.5.8, v6.5.7, v6.5.6 |
|
#
428c4435 |
| 03-Oct-2023 |
Dave Chinner <dchinner@redhat.com> |
xfs: move log discard work to xfs_discard.c
Because we are going to use the same list-based discard submission interface for fstrim-based discards, too.
Signed-off-by: Dave Chinner <dchinner@redhat
xfs: move log discard work to xfs_discard.c
Because we are going to use the same list-based discard submission interface for fstrim-based discards, too.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
show more ...
|
Revision tags: v6.5.5, v6.5.4, v6.5.3 |
|
#
ecd49f7a |
| 11-Sep-2023 |
Darrick J. Wong <djwong@kernel.org> |
xfs: fix per-cpu CIL structure aggregation racing with dying cpus
In commit 7c8ade2121200 ("xfs: implement percpu cil space used calculation"), the XFS committed (log) item list code was converted t
xfs: fix per-cpu CIL structure aggregation racing with dying cpus
In commit 7c8ade2121200 ("xfs: implement percpu cil space used calculation"), the XFS committed (log) item list code was converted to use per-cpu lists and space tracking to reduce cpu contention when multiple threads are modifying different parts of the filesystem and hence end up contending on the log structures during transaction commit. Each CPU tracks its own commit items and space usage, and these do not have to be merged into the main CIL until either someone wants to push the CIL items, or we run over a soft threshold and switch to slower (but more accurate) accounting with atomics.
Unfortunately, the for_each_cpu iteration suffers from the same race with cpu dying problem that was identified in commit 8b57b11cca88f ("pcpcntrs: fix dying cpu summation race") -- CPUs are removed from cpu_online_mask before the CPUHP_XFS_DEAD callback gets called. As a result, both CIL percpu structure aggregation functions fail to collect the items and accounted space usage at the correct point in time.
If we're lucky, the items that are collected from the online cpus exceed the space given to those cpus, and the log immediately shuts down in xlog_cil_insert_items due to the (apparent) log reservation overrun. This happens periodically with generic/650, which exercises cpu hotplug vs. the filesystem code:
smpboot: CPU 3 is now offline XFS (sda3): ctx ticket reservation ran out. Need to up reservation XFS (sda3): ticket reservation summary: XFS (sda3): unit res = 9268 bytes XFS (sda3): current res = -40 bytes XFS (sda3): original count = 1 XFS (sda3): remaining count = 1 XFS (sda3): Filesystem has been shut down due to log error (0x2).
Applying the same sort of fix from 8b57b11cca88f to the CIL code seems to make the generic/650 problem go away, but I've been told that tglx was not happy when he saw:
"...the only thing we actually need to care about is that percpu_counter_sum() iterates dying CPUs. That's trivial to do, and when there are no CPUs dying, it has no addition overhead except for a cpumask_or() operation."
The CPU hotplug code is rather complex and difficult to understand and I don't want to try to understand the cpu hotplug locking well enough to use cpu_dying mask. Furthermore, there's a performance improvement that could be had here. Attach a private cpu mask to the CIL structure so that we can track exactly which cpus have accessed the percpu data at all. It doesn't matter if the cpu has since gone offline; log item aggregation will still find the items. Better yet, we skip cpus that have not recently logged anything.
Worse yet, Ritesh Harjani and Eric Sandeen both reported today that CPU hot remove racing with an xfs mount can crash if the cpu_dead notifier tries to access the log but the mount hasn't yet set up the log.
Link: https://lore.kernel.org/linux-xfs/ZOLzgBOuyWHapOyZ@dread.disaster.area/T/ Link: https://lore.kernel.org/lkml/877cuj1mt1.ffs@tglx/ Link: https://lore.kernel.org/lkml/20230414162755.281993820@linutronix.de/ Link: https://lore.kernel.org/linux-xfs/ZOVkjxWZq0YmjrJu@dread.disaster.area/T/ Cc: tglx@linutronix.de Cc: peterz@infradead.org Reported-by: ritesh.list@gmail.com Reported-by: sandeen@sandeen.net Fixes: af1c2146a50b ("xfs: introduce per-cpu CIL tracking structure") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
show more ...
|
Revision tags: v6.5.2, v6.1.51, v6.5.1, v6.1.50, v6.5, v6.1.49, v6.1.48, v6.1.46, v6.1.45, v6.1.44, v6.1.43, v6.1.42, v6.1.41, v6.1.40, v6.1.39, v6.1.38, v6.1.37, v6.1.36, v6.4, v6.1.35, v6.1.34, v6.1.33, v6.1.32, v6.1.31, v6.1.30, v6.1.29, v6.1.28, v6.1.27, v6.1.26, v6.3, v6.1.25, v6.1.24, v6.1.23, v6.1.22, v6.1.21, v6.1.20, v6.1.19, v6.1.18, v6.1.17, v6.1.16, v6.1.15, v6.1.14, v6.1.13, v6.2, v6.1.12, v6.1.11, v6.1.10, v6.1.9, v6.1.8, v6.1.7, v6.1.6, v6.1.5, v6.0.19, v6.0.18, v6.1.4, v6.1.3, v6.0.17, v6.1.2, v6.0.16, v6.1.1, v6.0.15, v6.0.14, v6.0.13, v6.1, v6.0.12, v6.0.11, v6.0.10, v5.15.80, v6.0.9, v5.15.79, v6.0.8, v5.15.78, v6.0.7, v5.15.77, v5.15.76, v6.0.6, v6.0.5, v5.15.75, v6.0.4, v6.0.3, v6.0.2, v5.15.74, v5.15.73, v6.0.1, v5.15.72, v6.0, v5.15.71, v5.15.70, v5.15.69, v5.15.68, v5.15.67, v5.15.66, v5.15.65, v5.15.64, v5.15.63, v5.15.62, v5.15.61, v5.15.60, v5.15.59, v5.19, v5.15.58, v5.15.57, v5.15.56, v5.15.55, v5.15.54, v5.15.53 |
|
#
d9f68777 |
| 07-Jul-2022 |
Dave Chinner <dchinner@redhat.com> |
xfs: xlog_sync() manually adjusts grant head space
When xlog_sync() rounds off the tail the iclog that is being flushed, it manually subtracts that space from the grant heads. This space is actually
xfs: xlog_sync() manually adjusts grant head space
When xlog_sync() rounds off the tail the iclog that is being flushed, it manually subtracts that space from the grant heads. This space is actually reserved by the transaction ticket that covers the xlog_sync() call from xlog_write(), but we don't plumb the ticket down far enough for it to account for the space consumed in the current log ticket.
The grant heads are hot, so we really should be accounting this to the ticket is we can, rather than adding thousands of extra grant head updates every CIL commit.
Interestingly, this actually indicates a potential log space overrun can occur when we force the log. By the time that xfs_log_force() pushes out an active iclog and consumes the roundoff space, the reservation for that roundoff space has been returned to the grant heads and is no longer covered by a reservation. In theory the roundoff added to log force on an already full log could push the write head past the tail. In practice, the CIL commit that writes to the log and needs the iclog pushed will have reserved space for roundoff, so when it releases the ticket there will still be physical space for the roundoff to be committed to the log, even though it is no longer reserved. This roundoff won't be enough space to allow a transaction to be woken if the log is full, so overruns should not actually occur in practice.
That said, it indicates that we should not release the CIL context log ticket until after we've released the commit iclog. It also means that xlog_sync() still needs the direct grant head manipulation if we don't provide it with a ticket. Log forces are rare when we are in fast paths running 1.5 million transactions/s that make the grant heads hot, so let's optimise the hot case and pass CIL log tickets down to the xlog_sync() code.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
show more ...
|
#
16924853 |
| 07-Jul-2022 |
Dave Chinner <dchinner@redhat.com> |
xfs: convert log vector chain to use list heads
Because the next change is going to require sorting log vectors, and that requires arbitrary rearrangement of the list which cannot be done easily wit
xfs: convert log vector chain to use list heads
Because the next change is going to require sorting log vectors, and that requires arbitrary rearrangement of the list which cannot be done easily with a single linked list.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
show more ...
|
#
c0fb4765 |
| 07-Jul-2022 |
Dave Chinner <dchinner@redhat.com> |
xfs: convert CIL to unordered per cpu lists
So that we can remove the cil_lock which is a global serialisation point. We've already got ordering sorted, so all we need to do is treat the CIL list li
xfs: convert CIL to unordered per cpu lists
So that we can remove the cil_lock which is a global serialisation point. We've already got ordering sorted, so all we need to do is treat the CIL list like the busy extent list and reconstruct it before the push starts.
This is what we're trying to avoid:
- 75.35% 1.83% [kernel] [k] xfs_log_commit_cil - 46.35% xfs_log_commit_cil - 41.54% _raw_spin_lock - 67.30% do_raw_spin_lock 66.96% __pv_queued_spin_lock_slowpath
Which happens on a 32p system when running a 32-way 'rm -rf' workload. After this patch:
- 20.90% 3.23% [kernel] [k] xfs_log_commit_cil - 17.67% xfs_log_commit_cil - 6.51% xfs_log_ticket_ungrant 1.40% xfs_log_space_wake 2.32% memcpy_erms - 2.18% xfs_buf_item_committing - 2.12% xfs_buf_item_release - 1.03% xfs_buf_unlock 0.96% up 0.72% xfs_buf_rele 1.33% xfs_inode_item_format 1.19% down_read 0.91% up_read 0.76% xfs_buf_item_format - 0.68% kmem_alloc_large - 0.67% kmem_alloc 0.64% __kmalloc 0.50% xfs_buf_item_size
It kinda looks like the workload is running out of log space all the time. But all the spinlock contention is gone and the transaction commit rate has gone from 800k/s to 1.3M/s so the amount of real work being done has gone up a *lot*.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
show more ...
|
#
016a2338 |
| 07-Jul-2022 |
Dave Chinner <dchinner@redhat.com> |
xfs: Add order IDs to log items in CIL
Before we split the ordered CIL up into per cpu lists, we need a mechanism to track the order of the items in the CIL. We need to do this because there are rul
xfs: Add order IDs to log items in CIL
Before we split the ordered CIL up into per cpu lists, we need a mechanism to track the order of the items in the CIL. We need to do this because there are rules around the order in which related items must physically appear in the log even inside a single checkpoint transaction.
An example of this is intents - an intent must appear in the log before it's intent done record so that log recovery can cancel the intent correctly. If we have these two records misordered in the CIL, then they will not be recovered correctly by journal replay.
We also will not be able to move items to the tail of the CIL list when they are relogged, hence the log items will need some mechanism to allow the correct log item order to be recreated before we write log items to the hournal.
Hence we need to have a mechanism for recording global order of transactions in the log items so that we can recover that order from un-ordered per-cpu lists.
Do this with a simple monotonic increasing commit counter in the CIL context. Each log item in the transaction gets stamped with the current commit order ID before it is added to the CIL. If the item is already in the CIL, leave it where it is instead of moving it to the tail of the list and instead sort the list before we start the push work.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
show more ...
|
#
1dd2a2c1 |
| 07-Jul-2022 |
Dave Chinner <dchinner@redhat.com> |
xfs: track CIL ticket reservation in percpu structure
To get it out from under the cil spinlock.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
|
#
7c8ade21 |
| 07-Jul-2022 |
Dave Chinner <dchinner@redhat.com> |
xfs: implement percpu cil space used calculation
Now that we have the CIL percpu structures in place, implement the space used counter as a per-cpu counter.
We have to be really careful now about e
xfs: implement percpu cil space used calculation
Now that we have the CIL percpu structures in place, implement the space used counter as a per-cpu counter.
We have to be really careful now about ensuring that the checks and updates run without arbitrary delays, which means they need to run with pre-emption disabled. We do this by careful placement of the get_cpu_ptr/put_cpu_ptr calls to access the per-cpu structures for that CPU.
We need to be able to reliably detect that the CIL has reached the hard limit threshold so we can take extra reservations for the iclog headers when the space used overruns the original reservation. hence we factor out xlog_cil_over_hard_limit() from xlog_cil_push_background().
The global CIL space used is an atomic variable that is backed by per-cpu aggregation to minimise the number of atomic updates we do to the global state in the fast path. While we are under the soft limit, we aggregate only when the per-cpu aggregation is over the proportion of the soft limit assigned to that CPU. This means that all CPUs can use all but one byte of their aggregation threshold and we will not go over the soft limit.
Hence once we detect that we've gone over both a per-cpu aggregation threshold and the soft limit, we know that we have only exceeded the soft limit by one per-cpu aggregation threshold. Even if all CPUs hit this at the same time, we can't be over the hard limit, so we can run an aggregation back into the atomic counter at this point and still be under the hard limit.
At this point, we will be over the soft limit and hence we'll aggregate into the global atomic used space directly rather than the per-cpu counters, hence providing accurate detection of hard limit excursion for accounting and reservation purposes.
Hence we get the best of both worlds - lockless, scalable per-cpu fast path plus accurate, atomic detection of hard limit excursion.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
show more ...
|
Revision tags: v5.15.52 |
|
#
af1c2146 |
| 01-Jul-2022 |
Dave Chinner <dchinner@redhat.com> |
xfs: introduce per-cpu CIL tracking structure
The CIL push lock is highly contended on larger machines, becoming a hard bottleneck that about 700,000 transaction commits/s on >16p machines. To addre
xfs: introduce per-cpu CIL tracking structure
The CIL push lock is highly contended on larger machines, becoming a hard bottleneck that about 700,000 transaction commits/s on >16p machines. To address this, start moving the CIL tracking infrastructure to utilise per-CPU structures.
We need to track the space used, the amount of log reservation space reserved to write the CIL, the log items in the CIL and the busy extents that need to be completed by the CIL commit. This requires a couple of per-cpu counters, an unordered per-cpu list and a globally ordered per-cpu list.
Create a per-cpu structure to hold these and all the management interfaces needed, as well as the hooks to handle hotplug CPUs.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
show more ...
|
#
31151cc3 |
| 01-Jul-2022 |
Dave Chinner <dchinner@redhat.com> |
xfs: rework per-iclog header CIL reservation
For every iclog that a CIL push will use up, we need to ensure we have space reserved for the iclog header in each iclog. It is extremely difficult to do
xfs: rework per-iclog header CIL reservation
For every iclog that a CIL push will use up, we need to ensure we have space reserved for the iclog header in each iclog. It is extremely difficult to do this accurately with a per-cpu counter without expensive summing of the counter in every commit. However, we know what the maximum CIL size is going to be because of the hard space limit we have, and hence we know exactly how many iclogs we are going to need to write out the CIL.
We are constrained by the requirement that small transactions only have reservation space for a single iclog header built into them. At commit time we don't know how much of the current transaction reservation is made up of iclog header reservations as calculated by xfs_log_calc_unit_res() when the ticket was reserved. As larger reservations have multiple header spaces reserved, we can steal more than one iclog header reservation at a time, but we only steal the exact number needed for the given log vector size delta.
As a result, we don't know exactly when we are going to steal iclog header reservations, nor do we know exactly how many we are going to need for a given CIL.
To make things simple, start by calculating the worst case number of iclog headers a full CIL push will require. Record this into an atomic variable in the CIL. Then add a byte counter to the log ticket that records exactly how much iclog header space has been reserved in this ticket by xfs_log_calc_unit_res(). This tells us exactly how much space we can steal from the ticket at transaction commit time.
Now, at transaction commit time, we can check if the CIL has a full iclog header reservation and, if not, steal the entire reservation the current ticket holds for iclog headers. This minimises the number of times we need to do atomic operations in the fast path, but still guarantees we get all the reservations we need.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
show more ...
|
#
88591e7f |
| 01-Jul-2022 |
Dave Chinner <dchinner@redhat.com> |
xfs: use the CIL space used counter for emptiness checks
In the next patches we are going to make the CIL list itself per-cpu, and so we cannot use list_empty() to check is the list is empty. Replac
xfs: use the CIL space used counter for emptiness checks
In the next patches we are going to make the CIL list itself per-cpu, and so we cannot use list_empty() to check is the list is empty. Replace the list_empty() checks with a flag in the CIL to indicate we have committed at least one transaction to the CIL and hence the CIL is not empty.
We need this flag to be an atomic so that we can clear it without holding any locks in the commit fast path, but we also need to be careful to avoid atomic operations in the fast path. Hence we use the fact that test_bit() is not an atomic op to first check if the flag is set and then run the atomic test_and_clear_bit() operation to clear it and steal the initial unit reservation for the CIL context checkpoint.
When we are switching to a new context in a push, we place the setting of the XLOG_CIL_EMPTY flag under the xc_push_lock. THis allows all the other places that need to check whether the CIL is empty to use test_bit() and still be serialised correctly with the CIL context swaps that set the bit.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
show more ...
|
Revision tags: v5.15.51, v5.15.50, v5.15.49, v5.15.48, v5.15.47, v5.15.46, v5.15.45, v5.15.44 |
|
#
27232349 |
| 26-May-2022 |
Darrick J. Wong <djwong@kernel.org> |
xfs: refactor buffer cancellation table allocation
Move the code that allocates and frees the buffer cancellation tables used by log recovery into the file that actually uses the tables. This is a
xfs: refactor buffer cancellation table allocation
Move the code that allocates and frees the buffer cancellation tables used by log recovery into the file that actually uses the tables. This is a precursor to some cleanups and a memory leak fix.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
show more ...
|
Revision tags: v5.15.43, v5.15.42, v5.18, v5.15.41, v5.15.40, v5.15.39 |
|
#
45ff8b47 |
| 12-May-2022 |
Dave Chinner <dchinner@redhat.com> |
xfs: can't use kmem_zalloc() for attribute buffers
Because heap allocation of 64kB buffers will fail:
.... XFS: fs_mark(8414) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2
xfs: can't use kmem_zalloc() for attribute buffers
Because heap allocation of 64kB buffers will fail:
.... XFS: fs_mark(8414) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40) XFS: fs_mark(8417) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40) XFS: fs_mark(8409) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40) XFS: fs_mark(8428) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40) XFS: fs_mark(8430) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40) XFS: fs_mark(8437) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40) XFS: fs_mark(8433) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40) XFS: fs_mark(8406) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40) XFS: fs_mark(8412) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40) XFS: fs_mark(8432) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40) XFS: fs_mark(8424) possible memory allocation deadlock size 65768 in kmem_alloc (mode:0x2d40) ....
I'd use kvmalloc() instead, but....
- 48.19% xfs_attr_create_intent - 46.89% xfs_attri_init - kvmalloc_node - 46.04% __kmalloc_node - kmalloc_large_node - 45.99% __alloc_pages - 39.39% __alloc_pages_slowpath.constprop.0 - 38.89% __alloc_pages_direct_compact - 38.71% try_to_compact_pages - compact_zone_order - compact_zone - 21.09% isolate_migratepages_block 10.31% PageHuge 5.82% set_pfnblock_flags_mask 0.86% get_pfnblock_flags_mask - 4.48% __reset_isolation_suitable 4.44% __reset_isolation_pfn - 3.56% __pageblock_pfn_to_page 1.33% pfn_to_online_page 2.83% get_pfnblock_flags_mask - 0.87% migrate_pages 0.86% compaction_alloc 0.84% find_suitable_fallback - 6.60% get_page_from_freelist 4.99% clear_page_erms - 1.19% _raw_spin_lock_irqsave - do_raw_spin_lock __pv_queued_spin_lock_slowpath - 0.86% __vmalloc_node_range 0.65% __alloc_pages_bulk
.... this is just yet another reminder of how much kvmalloc() sucks. So lift xlog_cil_kvmalloc(), rename it to xlog_kvmalloc() and use that instead....
We also clean up the attribute name and value lengths as they no longer need to be rounded out to sizes compatible with log vectors.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
show more ...
|
Revision tags: v5.15.38, v5.15.37, v5.15.36 |
|
#
c60d13ea |
| 20-Apr-2022 |
Dave Chinner <dchinner@redhat.com> |
xfs: convert log ticket and iclog flags to unsigned.
5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned fields to be unsigned.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Re
xfs: convert log ticket and iclog flags to unsigned.
5.18 w/ std=gnu11 compiled with gcc-5 wants flags stored in unsigned fields to be unsigned.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
show more ...
|
#
593e3439 |
| 20-Apr-2022 |
Dave Chinner <dchinner@redhat.com> |
xfs: CIL context doesn't need to count iovecs
Now that we account for log opheaders in the log item formatting code, we don't actually use the aggregated count of log iovecs in the CIL for anything.
xfs: CIL context doesn't need to count iovecs
Now that we account for log opheaders in the log item formatting code, we don't actually use the aggregated count of log iovecs in the CIL for anything. Remove it and the tracking code that calculates it.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
show more ...
|
#
14b07ecd |
| 20-Apr-2022 |
Dave Chinner <dchinner@redhat.com> |
xfs: xlog_write() doesn't need optype anymore
So remove it from the interface and callers.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewe
xfs: xlog_write() doesn't need optype anymore
So remove it from the interface and callers.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
show more ...
|
#
1236bbe8 |
| 20-Apr-2022 |
Christoph Hellwig <hch@lst.de> |
xfs: remove xlog_verify_dest_ptr
Just check that the offset in xlog_write_vec is smaller than the iclog size and remove the expensive cycling through all iclogs.
Signed-off-by: Christoph Hellwig <h
xfs: remove xlog_verify_dest_ptr
Just check that the offset in xlog_write_vec is smaller than the iclog size and remove the expensive cycling through all iclogs.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
show more ...
|
#
ad3e3693 |
| 20-Apr-2022 |
Dave Chinner <dchinner@redhat.com> |
xfs: introduce xlog_write_partial()
Re-implement writing of a log vector that does not fit into the current iclog. The iclog will already be in XLOG_STATE_WANT_SYNC because xlog_get_iclog_space() wi
xfs: introduce xlog_write_partial()
Re-implement writing of a log vector that does not fit into the current iclog. The iclog will already be in XLOG_STATE_WANT_SYNC because xlog_get_iclog_space() will have reserved all the remaining iclog space for us, hence we can simply iterate over the iovecs in the log vector getting more iclog space until the entire log vector is written.
Handling this partial write case separately means we do need to pass unnecessary state around for the common, fast path case when the log vector fits entirely within the current iclog. It isolates the complexity and allows us to modify and improve the partial write case without impacting the simple fast path.
This change includes several improvements incorporated from patches written by Christoph Hellwig.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
show more ...
|
#
decb545f |
| 20-Apr-2022 |
Christoph Hellwig <hch@lst.de> |
xfs: change the type of ic_datap
Turn ic_datap from a char into a void pointer given that it points to arbitrary data.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfost
xfs: change the type of ic_datap
Turn ic_datap from a char into a void pointer given that it points to arbitrary data.
Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Chandan Babu R <chandanrlinux@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> [dgc: also remove (char *) cast in xlog_alloc_log()] Signed-off-by: Dave Chinner <david@fromorbit.com>
show more ...
|
#
d80fc291 |
| 20-Apr-2022 |
Dave Chinner <dchinner@redhat.com> |
xfs: pass lv chain length into xlog_write()
The caller of xlog_write() usually has a close accounting of the aggregated vector length contained in the log vector chain passed to xlog_write(). There
xfs: pass lv chain length into xlog_write()
The caller of xlog_write() usually has a close accounting of the aggregated vector length contained in the log vector chain passed to xlog_write(). There is no need to iterate the chain to calculate he length of the data in xlog_write_calculate_len() if the caller is already iterating that chain to build it.
Passing in the vector length avoids doing an extra chain iteration, which can be a significant amount of work given that large CIL commits can have hundreds of thousands of vectors attached to the chain.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
show more ...
|
#
c5141320 |
| 20-Apr-2022 |
Dave Chinner <dchinner@redhat.com> |
xfs: log ticket region debug is largely useless
xlog_tic_add_region() is used to trace the regions being added to a log ticket to provide information in the situation where a ticket reservation over
xfs: log ticket region debug is largely useless
xlog_tic_add_region() is used to trace the regions being added to a log ticket to provide information in the situation where a ticket reservation overrun occurs. The information gathered is stored int the ticket, and dumped if xlog_print_tic_res() is called.
For a front end struct xfs_trans overrun, the ticket only contains reservation tracking information - the ticket is never handed to the log so has no regions attached to it. The overrun debug information in this case comes from xlog_print_trans(), which walks the items attached to the transaction and dumps their attached formatted log vectors directly. It also dumps the ticket state, but that only contains reservation accounting and nothing else. Hence xlog_print_tic_res() never dumps region or overrun information from this path.
xlog_tic_add_region() is actually called from xlog_write(), which means it is being used to track the regions seen in a CIL checkpoint log vector chain. In looking at CIL behaviour recently, I've seen 32MB checkpoints regularly exceed 250,000 regions in the LV chain. The log ticket debug code can track *15* regions. IOWs, if there is a ticket overrun in the CIL code, the ticket region tracking code is going to be completely useless for determining what went wrong. The only thing it can tell us is how much of an overrun occurred, and we really don't need extra debug information in the log ticket to tell us that.
Indeed, the main place we call xlog_tic_add_region() is also adding up the number of regions and the space used so that xlog_write() knows how much will be written to the log. This is exactly the same information that log ticket is storing once we take away the useless region tracking array. Hence xlog_tic_add_region() is not useful, but can be called 250,000 times a CIL push...
Just strip all that debug "information" out of the of the log ticket and only have it report reservation space information when an overrun occurs. This also reduces the size of a log ticket down by about 150 bytes...
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
show more ...
|
#
c7610dce |
| 20-Apr-2022 |
Dave Chinner <dchinner@redhat.com> |
xfs: log tickets don't need log client id
We currently set the log ticket client ID when we reserve a transaction. This client ID is only ever written to the log by a CIL checkpoint or unmount recor
xfs: log tickets don't need log client id
We currently set the log ticket client ID when we reserve a transaction. This client ID is only ever written to the log by a CIL checkpoint or unmount records, and so anything using a high level transaction allocated through xfs_trans_alloc() does not need a log ticket client ID to be set.
For the CIL checkpoint, the client ID written to the journal is always XFS_TRANSACTION, and for the unmount record it is always XFS_LOG, and nothing else writes to the log. All of these operations tell xlog_write() exactly what they need to write to the log (the optype) and build their own opheaders for start, commit and unmount records. Hence we no longer need to set the client id in either the log ticket or the xfs_trans.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Chandan Babu R <chandan.babu@oracle.com> Signed-off-by: Dave Chinner <david@fromorbit.com>
show more ...
|
Revision tags: v5.15.35, v5.15.34, v5.15.33 |
|
#
919edbad |
| 29-Mar-2022 |
Dave Chinner <dchinner@redhat.com> |
xfs: drop async cache flushes from CIL commits.
Jan Kara reported a performance regression in dbench that he bisected down to commit bad77c375e8d ("xfs: CIL checkpoint flushes caches unconditionally
xfs: drop async cache flushes from CIL commits.
Jan Kara reported a performance regression in dbench that he bisected down to commit bad77c375e8d ("xfs: CIL checkpoint flushes caches unconditionally").
Whilst developing the journal flush/fua optimisations this cache was part of, it appeared to made a significant difference to performance. However, now that this patchset has settled and all the correctness issues fixed, there does not appear to be any significant performance benefit to asynchronous cache flushes.
In fact, the opposite is true on some storage types and workloads, where additional cache flushes that can occur from fsync heavy workloads have measurable and significant impact on overall throughput.
Local dbench testing shows little difference on dbench runs with sync vs async cache flushes on either fast or slow SSD storage, and no difference in streaming concurrent async transaction workloads like fs-mark.
Fast NVME storage.
From `dbench -t 30`, CIL scale:
clients async sync BW Latency BW Latency 1 935.18 0.855 915.64 0.903 8 2404.51 6.873 2341.77 6.511 16 3003.42 6.460 2931.57 6.529 32 3697.23 7.939 3596.28 7.894 128 7237.43 15.495 7217.74 11.588 512 5079.24 90.587 5167.08 95.822
fsmark, 32 threads, create w/ 64 byte xattr w/32k logbsize
create chown unlink async 1m41s 1m16s 2m03s sync 1m40s 1m19s 1m54s
Slower SATA SSD storage:
From `dbench -t 30`, CIL scale:
clients async sync BW Latency BW Latency 1 78.59 15.792 83.78 10.729 8 367.88 92.067 404.63 59.943 16 564.51 72.524 602.71 76.089 32 831.66 105.984 870.26 110.482 128 1659.76 102.969 1624.73 91.356 512 2135.91 223.054 2603.07 161.160
fsmark, 16 threads, create w/32k logbsize
create unlink async 5m06s 4m15s sync 5m00s 4m22s
And on Jan's test machine:
5.18-rc8-vanilla 5.18-rc8-patched Amean 1 71.22 ( 0.00%) 64.94 * 8.81%* Amean 2 93.03 ( 0.00%) 84.80 * 8.85%* Amean 4 150.54 ( 0.00%) 137.51 * 8.66%* Amean 8 252.53 ( 0.00%) 242.24 * 4.08%* Amean 16 454.13 ( 0.00%) 439.08 * 3.31%* Amean 32 835.24 ( 0.00%) 829.74 * 0.66%* Amean 64 1740.59 ( 0.00%) 1686.73 * 3.09%*
Performance and cache flush behaviour is restored to pre-regression levels.
As such, we can now consider the async cache flush mechanism an unnecessary exercise in premature optimisation and hence we can now remove it and the infrastructure it requires completely.
Fixes: bad77c375e8d ("xfs: CIL checkpoint flushes caches unconditionally") Reported-and-tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
show more ...
|