Revision tags: v6.6.25, v6.6.24, v6.6.23, v6.6.16, v6.6.15, v6.6.14, v6.6.13, v6.6.12, v6.6.11, v6.6.10, v6.6.9, v6.6.8, v6.6.7, v6.6.6, v6.6.5, v6.6.4, v6.6.3, v6.6.2, v6.5.11, v6.6.1, v6.5.10, v6.6, v6.5.9, v6.5.8, v6.5.7, v6.5.6, v6.5.5, v6.5.4, v6.5.3 |
|
#
68b957f6 |
| 11-Sep-2023 |
Darrick J. Wong <djwong@kernel.org> |
xfs: load uncached unlinked inodes into memory on demand
shrikanth hegde reports that filesystems fail shortly after mount with the following failure:
WARNING: CPU: 56 PID: 12450 at fs/xfs/xfs_ino
xfs: load uncached unlinked inodes into memory on demand
shrikanth hegde reports that filesystems fail shortly after mount with the following failure:
WARNING: CPU: 56 PID: 12450 at fs/xfs/xfs_inode.c:1839 xfs_iunlink_lookup+0x58/0x80 [xfs]
This of course is the WARN_ON_ONCE in xfs_iunlink_lookup:
ip = radix_tree_lookup(&pag->pag_ici_root, agino); if (WARN_ON_ONCE(!ip || !ip->i_ino)) { ... }
From diagnostic data collected by the bug reporters, it would appear that we cleanly mounted a filesystem that contained unlinked inodes. Unlinked inodes are only processed as a final step of log recovery, which means that clean mounts do not process the unlinked list at all.
Prior to the introduction of the incore unlinked lists, this wasn't a problem because the unlink code would (very expensively) traverse the entire ondisk metadata iunlink chain to keep things up to date. However, the incore unlinked list code complains when it realizes that it is out of sync with the ondisk metadata and shuts down the fs, which is bad.
Ritesh proposed to solve this problem by unconditionally parsing the unlinked lists at mount time, but this imposes a mount time cost for every filesystem to catch something that should be very infrequent. Instead, let's target the places where we can encounter a next_unlinked pointer that refers to an inode that is not in cache, and load it into cache.
Note: This patch does not address the problem of iget loading an inode from the middle of the iunlink list and needing to set i_prev_unlinked correctly.
Reported-by: shrikanth hegde <sshegde@linux.vnet.ibm.com> Triaged-by: Ritesh Harjani <ritesh.list@gmail.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
show more ...
|
#
83771c50 |
| 11-Sep-2023 |
Darrick J. Wong <djwong@kernel.org> |
xfs: reload entire unlinked bucket lists
The previous patch to reload unrecovered unlinked inodes when adding a newly created inode to the unlinked list is missing a key piece of functionality. It
xfs: reload entire unlinked bucket lists
The previous patch to reload unrecovered unlinked inodes when adding a newly created inode to the unlinked list is missing a key piece of functionality. It doesn't handle the case that someone calls xfs_iget on an inode that is not the last item in the incore list. For example, if at mount time the ondisk iunlink bucket looks like this:
AGI -> 7 -> 22 -> 3 -> NULL
None of these three inodes are cached in memory. Now let's say that someone tries to open inode 3 by handle. We need to walk the list to make sure that inodes 7 and 22 get loaded cold, and that the i_prev_unlinked of inode 3 gets set to 22.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
show more ...
|
Revision tags: v6.5.2, v6.1.51, v6.5.1, v6.1.50, v6.5, v6.1.49, v6.1.48 |
|
#
1d024e7a |
| 18-Aug-2023 |
Matthew Wilcox (Oracle) <willy@infradead.org> |
mm: remove enum page_entry_size
Remove the unnecessary encoding of page order into an enum and pass the page order directly. That lets us get rid of pe_order().
The switch constructs have to be ch
mm: remove enum page_entry_size
Remove the unnecessary encoding of page order into an enum and pass the page order directly. That lets us get rid of pe_order().
The switch constructs have to be changed to if/else constructs to prevent GCC from warning on builds with 3-level page tables where PMD_ORDER and PUD_ORDER have the same value.
If you are looking at this commit because your driver stopped compiling, look at the previous commit as well and audit your driver to be sure it doesn't depend on mmap_lock being held in its ->huge_fault method.
[willy@infradead.org: use "order %u" to match the (non dev_t) style] Link: https://lkml.kernel.org/r/ZOUYekbtTv+n8hYf@casper.infradead.org Link: https://lkml.kernel.org/r/20230818202335.2739663-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
show more ...
|
Revision tags: v6.1.46, v6.1.45 |
|
#
526aab5f |
| 10-Aug-2023 |
Darrick J. Wong <djwong@kernel.org> |
xfs: implement online scrubbing of rtsummary info
Finish the realtime summary scrubber by adding the functions we need to compute a fresh copy of the rtsummary info and comparing it to the copy on d
xfs: implement online scrubbing of rtsummary info
Finish the realtime summary scrubber by adding the functions we need to compute a fresh copy of the rtsummary info and comparing it to the copy on disk.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
show more ...
|
Revision tags: v6.1.44, v6.1.43, v6.1.42, v6.1.41, v6.1.40, v6.1.39, v6.1.38, v6.1.37 |
|
#
f045dd00 |
| 29-Jun-2023 |
Darrick J. Wong <djwong@kernel.org> |
xfs: clean up the rtbitmap fsmap backend
The rtbitmap fsmap backend doesn't query the rmapbt, so it's wasteful to spend time initializing the rmap_irec objects. Worse yet, the logic to query the rt
xfs: clean up the rtbitmap fsmap backend
The rtbitmap fsmap backend doesn't query the rmapbt, so it's wasteful to spend time initializing the rmap_irec objects. Worse yet, the logic to query the rtbitmap is spread across three separate functions, which is unnecessarily difficult to follow.
Compute the start rtextent that we want from keys[0] directly and combine the functions to avoid passing parameters around everywhere, and consolidate all the logic into a single function. At one point many years ago I intended to use __xfs_getfsmap_rtdev as the launching point for realtime rmapbt queries, but this hasn't been the case for a long time.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
show more ...
|
Revision tags: v6.1.36, v6.4, v6.1.35, v6.1.34, v6.1.33, v6.1.32, v6.1.31, v6.1.30 |
|
#
54919f94 |
| 22-May-2023 |
David Howells <dhowells@redhat.com> |
xfs: Provide a splice-read wrapper
Provide a splice_read wrapper for XFS. This does a stat count and a shutdown check before proceeding, then emits a new trace line and locks the inode across the c
xfs: Provide a splice-read wrapper
Provide a splice_read wrapper for XFS. This does a stat count and a shutdown check before proceeding, then emits a new trace line and locks the inode across the call to filemap_splice_read() and adds to the stats afterwards. Splicing from direct I/O or DAX is handled by the caller.
Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> cc: Al Viro <viro@zeniv.linux.org.uk> cc: Jens Axboe <axboe@kernel.dk> cc: Darrick J. Wong <djwong@kernel.org> cc: linux-xfs@vger.kernel.org cc: linux-fsdevel@vger.kernel.org cc: linux-block@vger.kernel.org cc: linux-mm@kvack.org Link: https://lore.kernel.org/r/20230522135018.2742245-25-dhowells@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
show more ...
|
Revision tags: v6.1.29, v6.1.28, v6.1.27, v6.1.26, v6.3, v6.1.25, v6.1.24 |
|
#
d5c88131 |
| 11-Apr-2023 |
Darrick J. Wong <djwong@kernel.org> |
xfs: allow queued AG intents to drain before scrubbing
When a writer thread executes a chain of log intent items, the AG header buffer locks will cycle during a transaction roll to get from one inte
xfs: allow queued AG intents to drain before scrubbing
When a writer thread executes a chain of log intent items, the AG header buffer locks will cycle during a transaction roll to get from one intent item to the next in a chain. Although scrub takes all AG header buffer locks, this isn't sufficient to guard against scrub checking an AG while that writer thread is in the middle of finishing a chain because there's no higher level locking primitive guarding allocation groups.
When there's a collision, cross-referencing between data structures (e.g. rmapbt and refcountbt) yields false corruption events; if repair is running, this results in incorrect repairs, which is catastrophic.
Fix this by adding to the perag structure the count of active intents and make scrub wait until it has both AG header buffer locks and the intent counter reaches zero.
One quirk of the drain code is that deferred bmap updates also bump and drop the intent counter. A fundamental decision made during the design phase of the reverse mapping feature is that updates to the rmapbt records are always made by the same code that updates the primary metadata. In other words, callers of bmapi functions expect that the bmapi functions will queue deferred rmap updates.
Some parts of the reflink code queue deferred refcount (CUI) and bmap (BUI) updates in the same head transaction, but the deferred work manager completely finishes the CUI before the BUI work is started. As a result, the CUI drops the intent count long before the deferred rmap (RUI) update even has a chance to bump the intent count. The only way to keep the intent count elevated between the CUI and RUI is for the BUI to bump the counter until the RUI has been created.
A second quirk of the intent drain code is that deferred work items must increment the intent counter as soon as the work item is added to the transaction. When a BUI completes and queues an RUI, the RUI must increment the counter before the BUI decrements it. The only way to accomplish this is to require that the counter be bumped as soon as the deferred work item is created in memory.
In the next patches we'll improve on this facility, but this patch provides the basic functionality.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
show more ...
|
#
9b2e5a23 |
| 11-Apr-2023 |
Darrick J. Wong <djwong@kernel.org> |
xfs: create traced helper to get extra perag references
There are a few places in the XFS codebase where a caller has either an active or a passive reference to a perag structure and wants to give a
xfs: create traced helper to get extra perag references
There are a few places in the XFS codebase where a caller has either an active or a passive reference to a perag structure and wants to give a passive reference to some other piece of code. Btree cursor creation and inode walks are good examples of this. Replace the open-coded logic with a helper to do this.
The new function adds a few safeguards -- it checks that there's at least one reference to the perag structure passed in, and it records the refcount bump in the ftrace information. This makes it much easier to debug perag refcounting problems.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
show more ...
|
Revision tags: v6.1.23, v6.1.22, v6.1.21, v6.1.20 |
|
#
e6fbb716 |
| 15-Mar-2023 |
Darrick J. Wong <djwong@kernel.org> |
xfs: add tracepoints for each of the externally visible allocators
There are now five separate space allocator interfaces exposed to the rest of XFS for five different strategies to find space. Add
xfs: add tracepoints for each of the externally visible allocators
There are now five separate space allocator interfaces exposed to the rest of XFS for five different strategies to find space. Add tracepoints for each of them so that I can tell from a trace dump exactly which ones got called and what happened underneath them. Add a sixth so it's more obvious if an allocation actually happened.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
show more ...
|
Revision tags: v6.1.19, v6.1.18, v6.1.17, v6.1.16, v6.1.15, v6.1.14, v6.1.13, v6.2, v6.1.12 |
|
#
bd4f5d09 |
| 12-Feb-2023 |
Dave Chinner <dchinner@redhat.com> |
xfs: refactor the filestreams allocator pick functions
Now that the filestreams allocator is largely rewritten, restructure the main entry point and pick function to seperate out the different opera
xfs: refactor the filestreams allocator pick functions
Now that the filestreams allocator is largely rewritten, restructure the main entry point and pick function to seperate out the different operations cleanly. The MRU lookup function should not handle the start AG selection on MRU lookup failure, and nor should the pick function handle building the association that is inserted into the MRU.
This leaves the filestreams allocator fairly clean and easy to understand, returning to the caller with an active perag reference and a target block to allocate at.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
show more ...
|
#
571e2592 |
| 12-Feb-2023 |
Dave Chinner <dchinner@redhat.com> |
xfs: pass perag to filestreams tracing
Pass perags instead of raw ag numbers, avoiding the need for the special peek function for the tracing code.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
xfs: pass perag to filestreams tracing
Pass perags instead of raw ag numbers, avoiding the need for the special peek function for the tracing code.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
show more ...
|
#
230e8fe8 |
| 12-Feb-2023 |
Dave Chinner <dchinner@redhat.com> |
xfs: fold xfs_alloc_ag_vextent() into callers
We don't need the multiplexing xfs_alloc_ag_vextent() provided anymore - we can just call the exact/near/size variants directly. This allows us to remov
xfs: fold xfs_alloc_ag_vextent() into callers
We don't need the multiplexing xfs_alloc_ag_vextent() provided anymore - we can just call the exact/near/size variants directly. This allows us to remove args->type completely and stop using args->fsbno as an input to the allocator algorithms.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
show more ...
|
#
368e2d09 |
| 12-Feb-2023 |
Dave Chinner <dchinner@redhat.com> |
xfs: rework the perag trace points to be perag centric
So that they all output the same information in the traces to make debugging refcount issues easier.
This means that all the lookup/drop funct
xfs: rework the perag trace points to be perag centric
So that they all output the same information in the traces to make debugging refcount issues easier.
This means that all the lookup/drop functions no longer need to use the full memory barrier atomic operations (atomic*_return()) so will have less overhead when tracing is off. The set/clear tag tracepoints no longer abuse the reference count to pass the tag - the tag being cleared is obvious from the _RET_IP_ that is recorded in the trace point.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
show more ...
|
#
c4d5660a |
| 12-Feb-2023 |
Dave Chinner <dchinner@redhat.com> |
xfs: active perag reference counting
We need to be able to dynamically remove instantiated AGs from memory safely, either for shrinking the filesystem or paging AG state in and out of memory (e.g. s
xfs: active perag reference counting
We need to be able to dynamically remove instantiated AGs from memory safely, either for shrinking the filesystem or paging AG state in and out of memory (e.g. supporting millions of AGs). This means we need to be able to safely exclude operations from accessing perags while dynamic removal is in progress.
To do this, introduce the concept of active and passive references. Active references are required for high level operations that make use of an AG for a given operation (e.g. allocation) and pin the perag in memory for the duration of the operation that is operating on the perag (e.g. transaction scope). This means we can fail to get an active reference to an AG, hence callers of the new active reference API must be able to handle lookup failure gracefully.
Passive references are used in low level code, where we might need to access the perag structure for the purposes of completing high level operations. For example, buffers need to use passive references because: - we need to be able to do metadata IO during operations like grow and shrink transactions where high level active references to the AG have already been blocked - buffers need to pin the perag until they are reclaimed from memory, something that high level code has no direct control over. - unused cached buffers should not prevent a shrink from being started.
Hence we have active references that will form exclusion barriers for operations to be performed on an AG, and passive references that will prevent reclaim of the perag until all objects with passive references have been reclaimed themselves.
This patch introduce xfs_perag_grab()/xfs_perag_rele() as the API for active AG reference functionality. We also need to convert the for_each_perag*() iterators to use active references, which will start the process of converting high level code over to using active references. Conversion of non-iterator based code to active references will be done in followup patches.
Note that the implementation using reference counting is really just a development vehicle for the API to ensure we don't have any leaks in the callers. Once we need to remove perag structures from memory dyanmically, we will need a much more robust per-ag state transition mechanism for preventing new references from being taken while we wait for existing references to drain before removal from memory can occur....
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
show more ...
|
#
692b6cdd |
| 10-Feb-2023 |
Dave Chinner <dchinner@redhat.com> |
xfs: t_firstblock is tracking AGs not blocks
The tp->t_firstblock field is now raelly tracking the highest AG we have locked, not the block number of the highest allocation we've made. It's purpose
xfs: t_firstblock is tracking AGs not blocks
The tp->t_firstblock field is now raelly tracking the highest AG we have locked, not the block number of the highest allocation we've made. It's purpose is to prevent AGF locking deadlocks, so rename it to "highest AG" and simplify the implementation to just track the agno rather than a fsbno.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
show more ...
|
#
1dd0510f |
| 10-Feb-2023 |
Dave Chinner <dchinner@redhat.com> |
xfs: fix low space alloc deadlock
I've recently encountered an ABBA deadlock with g/476. The upcoming changes seem to make this much easier to hit, but the underlying problem is a pre-existing one.
xfs: fix low space alloc deadlock
I've recently encountered an ABBA deadlock with g/476. The upcoming changes seem to make this much easier to hit, but the underlying problem is a pre-existing one.
Essentially, if we select an AG for allocation, then lock the AGF and then fail to allocate for some reason (e.g. minimum length requirements cannot be satisfied), then we drop out of the allocation with the AGF still locked.
The caller then modifies the allocation constraints - usually loosening them up - and tries again. This can result in trying to access AGFs that are lower than the AGF we already have locked from the failed attempt. e.g. the failed attempt skipped several AGs before failing, so we have locks an AG higher than the start AG. Retrying the allocation from the start AG then causes us to violate AGF lock ordering and this can lead to deadlocks.
The deadlock exists even if allocation succeeds - we can do a followup allocations in the same transaction for BMBT blocks that aren't guaranteed to be in the same AG as the original, and can move into higher AGs. Hence we really need to move the tp->t_firstblock tracking down into xfs_alloc_vextent() where it can be set when we exit with a locked AG.
xfs_alloc_vextent() can also check there if the requested allocation falls within the allow range of AGs set by tp->t_firstblock. If we can't allocate within the range set, we have to fail the allocation. If we are allowed to to non-blocking AGF locking, we can ignore the AG locking order limitations as we can use try-locks for the first iteration over requested AG range.
This invalidates a set of post allocation asserts that check that the allocation is always above tp->t_firstblock if it is set. Because we can use try-locks to avoid the deadlock in some circumstances, having a pre-existing locked AGF doesn't always prevent allocation from lower order AGFs. Hence those ASSERTs need to be removed.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
show more ...
|
Revision tags: v6.1.11, v6.1.10 |
|
#
0b11553e |
| 01-Feb-2023 |
Darrick J. Wong <djwong@kernel.org> |
xfs: pass refcount intent directly through the log intent code
Pass the incore refcount intent through the CUI logging code instead of repeatedly boxing and unboxing parameters.
Signed-off-by: Darr
xfs: pass refcount intent directly through the log intent code
Pass the incore refcount intent through the CUI logging code instead of repeatedly boxing and unboxing parameters.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
show more ...
|
Revision tags: v6.1.9, v6.1.8, v6.1.7, v6.1.6, v6.1.5, v6.0.19, v6.0.18, v6.1.4, v6.1.3, v6.0.17, v6.1.2, v6.0.16, v6.1.1, v6.0.15, v6.0.14, v6.0.13, v6.1, v6.0.12, v6.0.11 |
|
#
254e3459 |
| 28-Nov-2022 |
Darrick J. Wong <djwong@kernel.org> |
xfs: add debug knob to slow down write for fun
Add a new error injection knob so that we can arbitrarily slow down pagecache writes to test for race conditions and aberrant reclaim behavior if the w
xfs: add debug knob to slow down write for fun
Add a new error injection knob so that we can arbitrarily slow down pagecache writes to test for race conditions and aberrant reclaim behavior if the writeback mechanisms are slow to issue writeback. This will enable functional testing for the ifork sequence counters introduced in commit 304a68b9c63b ("xfs: use iomap_valid method to detect stale cached iomaps") that fixes write racing with reclaim writeback.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
show more ...
|
#
c2beff99 |
| 28-Nov-2022 |
Darrick J. Wong <djwong@kernel.org> |
xfs: add debug knob to slow down writeback for fun
Add a new error injection knob so that we can arbitrarily slow down writeback to test for race conditions and aberrant reclaim behavior if the writ
xfs: add debug knob to slow down writeback for fun
Add a new error injection knob so that we can arbitrarily slow down writeback to test for race conditions and aberrant reclaim behavior if the writeback mechanisms are slow to issue writeback. This will enable functional testing for the ifork sequence counters introduced in commit 745b3f76d1c8 ("xfs: maintain a sequence count for inode fork manipulations").
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
show more ...
|
Revision tags: v6.0.10, v5.15.80, v6.0.9, v5.15.79, v6.0.8, v5.15.78, v6.0.7, v5.15.77, v5.15.76, v6.0.6 |
|
#
571423a1 |
| 26-Oct-2022 |
Darrick J. Wong <djwong@kernel.org> |
xfs: report refcount domain in tracepoints
Now that we've broken out the startblock and shared/cow domain in the incore refcount extent record structure, update the tracepoints to report the domain.
xfs: report refcount domain in tracepoints
Now that we've broken out the startblock and shared/cow domain in the incore refcount extent record structure, update the tracepoints to report the domain.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
show more ...
|
Revision tags: v6.0.5, v5.15.75, v6.0.4, v6.0.3, v6.0.2, v5.15.74, v5.15.73, v6.0.1, v5.15.72, v6.0, v5.15.71, v5.15.70, v5.15.69 |
|
#
8838dafe |
| 18-Sep-2022 |
Zeng Heng <zengheng4@huawei.com> |
xfs: missing space in xfs trace log
Add space between arguments would help someone to locate the key words they want, so break quoted strings at a space character.
Such as below: [Before] kworker/1
xfs: missing space in xfs trace log
Add space between arguments would help someone to locate the key words they want, so break quoted strings at a space character.
Such as below: [Before] kworker/1:0-280 [001] ..... 600.782135: xfs_bunmap: dev 7:0 ino 0x85 disize 0x0 fileoff 0x0 fsbcount 0x400000001fffffflags ATTRFORK ...
[After] kworker/1:2-564 [001] ..... 23817.906160: xfs_bunmap: dev 7:0 ino 0x85 disize 0x0 fileoff 0x0 fsbcount 0x400000001fffff flags ATTRFORK ...
Signed-off-by: Zeng Heng <zengheng4@huawei.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
show more ...
|
Revision tags: v5.15.68, v5.15.67, v5.15.66, v5.15.65, v5.15.64, v5.15.63, v5.15.62, v5.15.61, v5.15.60, v5.15.59, v5.19, v5.15.58, v5.15.57, v5.15.56, v5.15.55 |
|
#
a83d5a8b |
| 13-Jul-2022 |
Dave Chinner <dchinner@redhat.com> |
xfs: introduce xfs_iunlink_lookup
When an inode is on an unlinked list during normal operation, it is guaranteed to be pinned in memory as it is either referenced by the current unlink operation or
xfs: introduce xfs_iunlink_lookup
When an inode is on an unlinked list during normal operation, it is guaranteed to be pinned in memory as it is either referenced by the current unlink operation or it has a open file descriptor that references it and has it pinned in memory. Hence to look up an inode on the unlinked list, we can do a direct inode cache lookup and always expect the lookup to succeed.
Add a function to do this lookup based on the agino that we use to link the chain of unlinked inodes together so we can begin the conversion the unlinked list manipulations to use in-memory inodes rather than inode cluster buffers and remove the backref cache.
Use this lookup function to replace the on-disk inode buffer walk when removing inodes from the unlinked list with an in-core inode unlinked list walk.
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
show more ...
|
Revision tags: v5.15.54 |
|
#
c01147d9 |
| 09-Jul-2022 |
Darrick J. Wong <djwong@kernel.org> |
xfs: replace inode fork size macros with functions
Replace the shouty macros here with typechecked helper functions.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dc
xfs: replace inode fork size macros with functions
Replace the shouty macros here with typechecked helper functions.
Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
show more ...
|
Revision tags: v5.15.53, v5.15.52, v5.15.51, v5.15.50, v5.15.49 |
|
#
5e672cd6 |
| 16-Jun-2022 |
Dave Chinner <dchinner@redhat.com> |
xfs: introduce xfs_inodegc_push()
The current blocking mechanism for pushing the inodegc queue out to disk can result in systems becoming unusable when there is a long running inodegc operation. Thi
xfs: introduce xfs_inodegc_push()
The current blocking mechanism for pushing the inodegc queue out to disk can result in systems becoming unusable when there is a long running inodegc operation. This is because the statfs() implementation currently issues a blocking flush of the inodegc queue and a significant number of common system utilities will call statfs() to discover something about the underlying filesystem.
This can result in userspace operations getting stuck on inodegc progress, and when trying to remove a heavily reflinked file on slow storage with a full journal, this can result in delays measuring in hours.
Avoid this problem by adding "push" function that expedites the flushing of the inodegc queue, but doesn't wait for it to complete.
Convert xfs_fs_statfs() and xfs_qm_scall_getquota() to use this mechanism so they don't block but still ensure that queued operations are expedited.
Fixes: ab23a7768739 ("xfs: per-cpu deferred inode inactivation queues") Reported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Dave Chinner <dchinner@redhat.com> [djwong: fix _getquota_next to use _inodegc_push too] Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
show more ...
|
Revision tags: v5.15.48, v5.15.47, v5.15.46, v5.15.45, v5.15.44, v5.15.43, v5.15.42, v5.18, v5.15.41, v5.15.40, v5.15.39 |
|
#
fdaf1bb3 |
| 12-May-2022 |
Dave Chinner <dchinner@redhat.com> |
xfs: ATTR_REPLACE algorithm with LARP enabled needs rework
We can't use the same algorithm for replacing an existing attribute when logging attributes. The existing algorithm is essentially:
1. cre
xfs: ATTR_REPLACE algorithm with LARP enabled needs rework
We can't use the same algorithm for replacing an existing attribute when logging attributes. The existing algorithm is essentially:
1. create new attr w/ INCOMPLETE 2. atomically flip INCOMPLETE flags between old + new attribute 3. remove old attr which is marked w/ INCOMPLETE
This algorithm guarantees that we see either the old or new attribute, and if we fail after the atomic flag flip, we don't have to recover the removal of the old attr because we never see INCOMPLETE attributes in lookups.
For logged attributes, however, this does not work. The logged attribute intents do not track the work that has been done as the transaction rolls, and hence the only recovery mechanism we have is "run the replace operation from scratch".
This is further exacerbated by the attempt to avoid needing the INCOMPLETE flag to create an atomic swap. This means we can create a second active attribute of the same name before we remove the original. If we fail at any point after the create but before the removal has completed, we end up with duplicate attributes in the attr btree and recovery only tries to replace one of them.
There are several other failure modes where we can leave partially allocated remote attributes that expose stale data, partially free remote attributes that enable UAF based stale data exposure, etc.
TO fix this, we need a different algorithm for replace operations when LARP is enabled. Luckily, it's not that complex if we take the right first step. That is, the first thing we log is the attri intent with the new name/value pair and mark the old attr as INCOMPLETE in the same transaction.
From there, we then remove the old attr and keep relogging the new name/value in the intent, such that we always know that we have to create the new attr in recovery. Once the old attr is removed, we then run a normal ATTR_CREATE operation relogging the intent as we go. If the new attr is local, then it gets created in a single atomic transaction that also logs the final intent done. If the new attr is remote, the we set INCOMPLETE on the new attr while we allocate and set the remote value, and then we clear the INCOMPLETE flag at in the last transaction taht logs the final intent done.
If we fail at any point in this algorithm, log recovery will always see the same state on disk: the new name/value in the intent, and either an INCOMPLETE attr or no attr in the attr btree. If we find an INCOMPLETE attr, we run the full replace starting with removing the INCOMPLETE attr. If we don't find it, then we simply create the new attr.
Notably, recovery of a failed create that has an INCOMPLETE flag set is now the same - we start with the lookup of the INCOMPLETE attr, and if that exists then we do the full replace recovery process, otherwise we just create the new attr.
Hence changing the way we do the replace operation when LARP is enabled allows us to use the same log recovery algorithm for both the ATTR_CREATE and ATTR_REPLACE operations. This is also the same algorithm we use for runtime ATTR_REPLACE operations (except for the step setting up the initial conditions).
The result is that:
- ATTR_CREATE uses the same algorithm regardless of whether LARP is enabled or not - ATTR_REPLACE with larp=0 is identical to the old algorithm - ATTR_REPLACE with larp=1 runs an unmodified attr removal algorithm from the larp=0 code and then runs the unmodified ATTR_CREATE code. - log recovery when larp=1 runs the same ATTR_REPLACE algorithm as it uses at runtime.
Because the state machine is now quite clean, changing the algorithm is really just a case of changing the initial state and how the states link together for the ATTR_REPLACE case. Hence it's not a huge amount of code for what is a fairly substantial rework of the attr logging and recovery algorithm....
Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Dave Chinner <david@fromorbit.com>
show more ...
|