1.. SPDX-License-Identifier: GPL-2.0
2.. _xfs_online_fsck_design:
3
4..
5        Mapping of heading styles within this document:
6        Heading 1 uses "====" above and below
7        Heading 2 uses "===="
8        Heading 3 uses "----"
9        Heading 4 uses "````"
10        Heading 5 uses "^^^^"
11        Heading 6 uses "~~~~"
12        Heading 7 uses "...."
13
14        Sections are manually numbered because apparently that's what everyone
15        does in the kernel.
16
17======================
18XFS Online Fsck Design
19======================
20
21This document captures the design of the online filesystem check feature for
22XFS.
23The purpose of this document is threefold:
24
25- To help kernel distributors understand exactly what the XFS online fsck
26  feature is, and issues about which they should be aware.
27
28- To help people reading the code to familiarize themselves with the relevant
29  concepts and design points before they start digging into the code.
30
31- To help developers maintaining the system by capturing the reasons
32  supporting higher level decision making.
33
34As the online fsck code is merged, the links in this document to topic branches
35will be replaced with links to code.
36
37This document is licensed under the terms of the GNU Public License, v2.
38The primary author is Darrick J. Wong.
39
40This design document is split into seven parts.
41Part 1 defines what fsck tools are and the motivations for writing a new one.
42Parts 2 and 3 present a high level overview of how online fsck process works
43and how it is tested to ensure correct functionality.
44Part 4 discusses the user interface and the intended usage modes of the new
45program.
46Parts 5 and 6 show off the high level components and how they fit together, and
47then present case studies of how each repair function actually works.
48Part 7 sums up what has been discussed so far and speculates about what else
49might be built atop online fsck.
50
51.. contents:: Table of Contents
52   :local:
53
541. What is a Filesystem Check?
55==============================
56
57A Unix filesystem has four main responsibilities:
58
59- Provide a hierarchy of names through which application programs can associate
60  arbitrary blobs of data for any length of time,
61
62- Virtualize physical storage media across those names, and
63
64- Retrieve the named data blobs at any time.
65
66- Examine resource usage.
67
68Metadata directly supporting these functions (e.g. files, directories, space
69mappings) are sometimes called primary metadata.
70Secondary metadata (e.g. reverse mapping and directory parent pointers) support
71operations internal to the filesystem, such as internal consistency checking
72and reorganization.
73Summary metadata, as the name implies, condense information contained in
74primary metadata for performance reasons.
75
76The filesystem check (fsck) tool examines all the metadata in a filesystem
77to look for errors.
78In addition to looking for obvious metadata corruptions, fsck also
79cross-references different types of metadata records with each other to look
80for inconsistencies.
81People do not like losing data, so most fsck tools also contains some ability
82to correct any problems found.
83As a word of caution -- the primary goal of most Linux fsck tools is to restore
84the filesystem metadata to a consistent state, not to maximize the data
85recovered.
86That precedent will not be challenged here.
87
88Filesystems of the 20th century generally lacked any redundancy in the ondisk
89format, which means that fsck can only respond to errors by erasing files until
90errors are no longer detected.
91More recent filesystem designs contain enough redundancy in their metadata that
92it is now possible to regenerate data structures when non-catastrophic errors
93occur; this capability aids both strategies.
94
95+--------------------------------------------------------------------------+
96| **Note**:                                                                |
97+--------------------------------------------------------------------------+
98| System administrators avoid data loss by increasing the number of        |
99| separate storage systems through the creation of backups; and they avoid |
100| downtime by increasing the redundancy of each storage system through the |
101| creation of RAID arrays.                                                 |
102| fsck tools address only the first problem.                               |
103+--------------------------------------------------------------------------+
104
105TLDR; Show Me the Code!
106-----------------------
107
108Code is posted to the kernel.org git trees as follows:
109`kernel changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-symlink>`_,
110`userspace changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-service>`_, and
111`QA test changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-dirs>`_.
112Each kernel patchset adding an online repair function will use the same branch
113name across the kernel, xfsprogs, and fstests git repos.
114
115Existing Tools
116--------------
117
118The online fsck tool described here will be the third tool in the history of
119XFS (on Linux) to check and repair filesystems.
120Two programs precede it:
121
122The first program, ``xfs_check``, was created as part of the XFS debugger
123(``xfs_db``) and can only be used with unmounted filesystems.
124It walks all metadata in the filesystem looking for inconsistencies in the
125metadata, though it lacks any ability to repair what it finds.
126Due to its high memory requirements and inability to repair things, this
127program is now deprecated and will not be discussed further.
128
129The second program, ``xfs_repair``, was created to be faster and more robust
130than the first program.
131Like its predecessor, it can only be used with unmounted filesystems.
132It uses extent-based in-memory data structures to reduce memory consumption,
133and tries to schedule readahead IO appropriately to reduce I/O waiting time
134while it scans the metadata of the entire filesystem.
135The most important feature of this tool is its ability to respond to
136inconsistencies in file metadata and directory tree by erasing things as needed
137to eliminate problems.
138Space usage metadata are rebuilt from the observed file metadata.
139
140Problem Statement
141-----------------
142
143The current XFS tools leave several problems unsolved:
144
1451. **User programs** suddenly **lose access** to the filesystem when unexpected
146   shutdowns occur as a result of silent corruptions in the metadata.
147   These occur **unpredictably** and often without warning.
148
1492. **Users** experience a **total loss of service** during the recovery period
150   after an **unexpected shutdown** occurs.
151
1523. **Users** experience a **total loss of service** if the filesystem is taken
153   offline to **look for problems** proactively.
154
1554. **Data owners** cannot **check the integrity** of their stored data without
156   reading all of it.
157   This may expose them to substantial billing costs when a linear media scan
158   performed by the storage system administrator might suffice.
159
1605. **System administrators** cannot **schedule** a maintenance window to deal
161   with corruptions if they **lack the means** to assess filesystem health
162   while the filesystem is online.
163
1646. **Fleet monitoring tools** cannot **automate periodic checks** of filesystem
165   health when doing so requires **manual intervention** and downtime.
166
1677. **Users** can be tricked into **doing things they do not desire** when
168   malicious actors **exploit quirks of Unicode** to place misleading names
169   in directories.
170
171Given this definition of the problems to be solved and the actors who would
172benefit, the proposed solution is a third fsck tool that acts on a running
173filesystem.
174
175This new third program has three components: an in-kernel facility to check
176metadata, an in-kernel facility to repair metadata, and a userspace driver
177program to drive fsck activity on a live filesystem.
178``xfs_scrub`` is the name of the driver program.
179The rest of this document presents the goals and use cases of the new fsck
180tool, describes its major design points in connection to those goals, and
181discusses the similarities and differences with existing tools.
182
183+--------------------------------------------------------------------------+
184| **Note**:                                                                |
185+--------------------------------------------------------------------------+
186| Throughout this document, the existing offline fsck tool can also be     |
187| referred to by its current name "``xfs_repair``".                        |
188| The userspace driver program for the new online fsck tool can be         |
189| referred to as "``xfs_scrub``".                                          |
190| The kernel portion of online fsck that validates metadata is called      |
191| "online scrub", and portion of the kernel that fixes metadata is called  |
192| "online repair".                                                         |
193+--------------------------------------------------------------------------+
194
195The naming hierarchy is broken up into objects known as directories and files
196and the physical space is split into pieces known as allocation groups.
197Sharding enables better performance on highly parallel systems and helps to
198contain the damage when corruptions occur.
199The division of the filesystem into principal objects (allocation groups and
200inodes) means that there are ample opportunities to perform targeted checks and
201repairs on a subset of the filesystem.
202
203While this is going on, other parts continue processing IO requests.
204Even if a piece of filesystem metadata can only be regenerated by scanning the
205entire system, the scan can still be done in the background while other file
206operations continue.
207
208In summary, online fsck takes advantage of resource sharding and redundant
209metadata to enable targeted checking and repair operations while the system
210is running.
211This capability will be coupled to automatic system management so that
212autonomous self-healing of XFS maximizes service availability.
213
2142. Theory of Operation
215======================
216
217Because it is necessary for online fsck to lock and scan live metadata objects,
218online fsck consists of three separate code components.
219The first is the userspace driver program ``xfs_scrub``, which is responsible
220for identifying individual metadata items, scheduling work items for them,
221reacting to the outcomes appropriately, and reporting results to the system
222administrator.
223The second and third are in the kernel, which implements functions to check
224and repair each type of online fsck work item.
225
226+------------------------------------------------------------------+
227| **Note**:                                                        |
228+------------------------------------------------------------------+
229| For brevity, this document shortens the phrase "online fsck work |
230| item" to "scrub item".                                           |
231+------------------------------------------------------------------+
232
233Scrub item types are delineated in a manner consistent with the Unix design
234philosophy, which is to say that each item should handle one aspect of a
235metadata structure, and handle it well.
236
237Scope
238-----
239
240In principle, online fsck should be able to check and to repair everything that
241the offline fsck program can handle.
242However, online fsck cannot be running 100% of the time, which means that
243latent errors may creep in after a scrub completes.
244If these errors cause the next mount to fail, offline fsck is the only
245solution.
246This limitation means that maintenance of the offline fsck tool will continue.
247A second limitation of online fsck is that it must follow the same resource
248sharing and lock acquisition rules as the regular filesystem.
249This means that scrub cannot take *any* shortcuts to save time, because doing
250so could lead to concurrency problems.
251In other words, online fsck is not a complete replacement for offline fsck, and
252a complete run of online fsck may take longer than online fsck.
253However, both of these limitations are acceptable tradeoffs to satisfy the
254different motivations of online fsck, which are to **minimize system downtime**
255and to **increase predictability of operation**.
256
257.. _scrubphases:
258
259Phases of Work
260--------------
261
262The userspace driver program ``xfs_scrub`` splits the work of checking and
263repairing an entire filesystem into seven phases.
264Each phase concentrates on checking specific types of scrub items and depends
265on the success of all previous phases.
266The seven phases are as follows:
267
2681. Collect geometry information about the mounted filesystem and computer,
269   discover the online fsck capabilities of the kernel, and open the
270   underlying storage devices.
271
2722. Check allocation group metadata, all realtime volume metadata, and all quota
273   files.
274   Each metadata structure is scheduled as a separate scrub item.
275   If corruption is found in the inode header or inode btree and ``xfs_scrub``
276   is permitted to perform repairs, then those scrub items are repaired to
277   prepare for phase 3.
278   Repairs are implemented by using the information in the scrub item to
279   resubmit the kernel scrub call with the repair flag enabled; this is
280   discussed in the next section.
281   Optimizations and all other repairs are deferred to phase 4.
282
2833. Check all metadata of every file in the filesystem.
284   Each metadata structure is also scheduled as a separate scrub item.
285   If repairs are needed and ``xfs_scrub`` is permitted to perform repairs,
286   and there were no problems detected during phase 2, then those scrub items
287   are repaired immediately.
288   Optimizations, deferred repairs, and unsuccessful repairs are deferred to
289   phase 4.
290
2914. All remaining repairs and scheduled optimizations are performed during this
292   phase, if the caller permits them.
293   Before starting repairs, the summary counters are checked and any necessary
294   repairs are performed so that subsequent repairs will not fail the resource
295   reservation step due to wildly incorrect summary counters.
296   Unsuccesful repairs are requeued as long as forward progress on repairs is
297   made somewhere in the filesystem.
298   Free space in the filesystem is trimmed at the end of phase 4 if the
299   filesystem is clean.
300
3015. By the start of this phase, all primary and secondary filesystem metadata
302   must be correct.
303   Summary counters such as the free space counts and quota resource counts
304   are checked and corrected.
305   Directory entry names and extended attribute names are checked for
306   suspicious entries such as control characters or confusing Unicode sequences
307   appearing in names.
308
3096. If the caller asks for a media scan, read all allocated and written data
310   file extents in the filesystem.
311   The ability to use hardware-assisted data file integrity checking is new
312   to online fsck; neither of the previous tools have this capability.
313   If media errors occur, they will be mapped to the owning files and reported.
314
3157. Re-check the summary counters and presents the caller with a summary of
316   space usage and file counts.
317
318Steps for Each Scrub Item
319-------------------------
320
321The kernel scrub code uses a three-step strategy for checking and repairing
322the one aspect of a metadata object represented by a scrub item:
323
3241. The scrub item of interest is checked for corruptions; opportunities for
325   optimization; and for values that are directly controlled by the system
326   administrator but look suspicious.
327   If the item is not corrupt or does not need optimization, resource are
328   released and the positive scan results are returned to userspace.
329   If the item is corrupt or could be optimized but the caller does not permit
330   this, resources are released and the negative scan results are returned to
331   userspace.
332   Otherwise, the kernel moves on to the second step.
333
3342. The repair function is called to rebuild the data structure.
335   Repair functions generally choose rebuild a structure from other metadata
336   rather than try to salvage the existing structure.
337   If the repair fails, the scan results from the first step are returned to
338   userspace.
339   Otherwise, the kernel moves on to the third step.
340
3413. In the third step, the kernel runs the same checks over the new metadata
342   item to assess the efficacy of the repairs.
343   The results of the reassessment are returned to userspace.
344
345Classification of Metadata
346--------------------------
347
348Each type of metadata object (and therefore each type of scrub item) is
349classified as follows:
350
351Primary Metadata
352````````````````
353
354Metadata structures in this category should be most familiar to filesystem
355users either because they are directly created by the user or they index
356objects created by the user
357Most filesystem objects fall into this class:
358
359- Free space and reference count information
360
361- Inode records and indexes
362
363- Storage mapping information for file data
364
365- Directories
366
367- Extended attributes
368
369- Symbolic links
370
371- Quota limits
372
373Scrub obeys the same rules as regular filesystem accesses for resource and lock
374acquisition.
375
376Primary metadata objects are the simplest for scrub to process.
377The principal filesystem object (either an allocation group or an inode) that
378owns the item being scrubbed is locked to guard against concurrent updates.
379The check function examines every record associated with the type for obvious
380errors and cross-references healthy records against other metadata to look for
381inconsistencies.
382Repairs for this class of scrub item are simple, since the repair function
383starts by holding all the resources acquired in the previous step.
384The repair function scans available metadata as needed to record all the
385observations needed to complete the structure.
386Next, it stages the observations in a new ondisk structure and commits it
387atomically to complete the repair.
388Finally, the storage from the old data structure are carefully reaped.
389
390Because ``xfs_scrub`` locks a primary object for the duration of the repair,
391this is effectively an offline repair operation performed on a subset of the
392filesystem.
393This minimizes the complexity of the repair code because it is not necessary to
394handle concurrent updates from other threads, nor is it necessary to access
395any other part of the filesystem.
396As a result, indexed structures can be rebuilt very quickly, and programs
397trying to access the damaged structure will be blocked until repairs complete.
398The only infrastructure needed by the repair code are the staging area for
399observations and a means to write new structures to disk.
400Despite these limitations, the advantage that online repair holds is clear:
401targeted work on individual shards of the filesystem avoids total loss of
402service.
403
404This mechanism is described in section 2.1 ("Off-Line Algorithm") of
405V. Srinivasan and M. J. Carey, `"Performance of On-Line Index Construction
406Algorithms" <https://minds.wisconsin.edu/bitstream/handle/1793/59524/TR1047.pdf>`_,
407*Extending Database Technology*, pp. 293-309, 1992.
408
409Most primary metadata repair functions stage their intermediate results in an
410in-memory array prior to formatting the new ondisk structure, which is very
411similar to the list-based algorithm discussed in section 2.3 ("List-Based
412Algorithms") of Srinivasan.
413However, any data structure builder that maintains a resource lock for the
414duration of the repair is *always* an offline algorithm.
415
416Secondary Metadata
417``````````````````
418
419Metadata structures in this category reflect records found in primary metadata,
420but are only needed for online fsck or for reorganization of the filesystem.
421
422Secondary metadata include:
423
424- Reverse mapping information
425
426- Directory parent pointers
427
428This class of metadata is difficult for scrub to process because scrub attaches
429to the secondary object but needs to check primary metadata, which runs counter
430to the usual order of resource acquisition.
431Frequently, this means that full filesystems scans are necessary to rebuild the
432metadata.
433Check functions can be limited in scope to reduce runtime.
434Repairs, however, require a full scan of primary metadata, which can take a
435long time to complete.
436Under these conditions, ``xfs_scrub`` cannot lock resources for the entire
437duration of the repair.
438
439Instead, repair functions set up an in-memory staging structure to store
440observations.
441Depending on the requirements of the specific repair function, the staging
442index will either have the same format as the ondisk structure or a design
443specific to that repair function.
444The next step is to release all locks and start the filesystem scan.
445When the repair scanner needs to record an observation, the staging data are
446locked long enough to apply the update.
447While the filesystem scan is in progress, the repair function hooks the
448filesystem so that it can apply pending filesystem updates to the staging
449information.
450Once the scan is done, the owning object is re-locked, the live data is used to
451write a new ondisk structure, and the repairs are committed atomically.
452The hooks are disabled and the staging staging area is freed.
453Finally, the storage from the old data structure are carefully reaped.
454
455Introducing concurrency helps online repair avoid various locking problems, but
456comes at a high cost to code complexity.
457Live filesystem code has to be hooked so that the repair function can observe
458updates in progress.
459The staging area has to become a fully functional parallel structure so that
460updates can be merged from the hooks.
461Finally, the hook, the filesystem scan, and the inode locking model must be
462sufficiently well integrated that a hook event can decide if a given update
463should be applied to the staging structure.
464
465In theory, the scrub implementation could apply these same techniques for
466primary metadata, but doing so would make it massively more complex and less
467performant.
468Programs attempting to access the damaged structures are not blocked from
469operation, which may cause application failure or an unplanned filesystem
470shutdown.
471
472Inspiration for the secondary metadata repair strategy was drawn from section
4732.4 of Srinivasan above, and sections 2 ("NSF: Inded Build Without Side-File")
474and 3.1.1 ("Duplicate Key Insert Problem") in C. Mohan, `"Algorithms for
475Creating Indexes for Very Large Tables Without Quiescing Updates"
476<https://dl.acm.org/doi/10.1145/130283.130337>`_, 1992.
477
478The sidecar index mentioned above bears some resemblance to the side file
479method mentioned in Srinivasan and Mohan.
480Their method consists of an index builder that extracts relevant record data to
481build the new structure as quickly as possible; and an auxiliary structure that
482captures all updates that would be committed to the index by other threads were
483the new index already online.
484After the index building scan finishes, the updates recorded in the side file
485are applied to the new index.
486To avoid conflicts between the index builder and other writer threads, the
487builder maintains a publicly visible cursor that tracks the progress of the
488scan through the record space.
489To avoid duplication of work between the side file and the index builder, side
490file updates are elided when the record ID for the update is greater than the
491cursor position within the record ID space.
492
493To minimize changes to the rest of the codebase, XFS online repair keeps the
494replacement index hidden until it's completely ready to go.
495In other words, there is no attempt to expose the keyspace of the new index
496while repair is running.
497The complexity of such an approach would be very high and perhaps more
498appropriate to building *new* indices.
499
500**Future Work Question**: Can the full scan and live update code used to
501facilitate a repair also be used to implement a comprehensive check?
502
503*Answer*: In theory, yes.  Check would be much stronger if each scrub function
504employed these live scans to build a shadow copy of the metadata and then
505compared the shadow records to the ondisk records.
506However, doing that is a fair amount more work than what the checking functions
507do now.
508The live scans and hooks were developed much later.
509That in turn increases the runtime of those scrub functions.
510
511Summary Information
512```````````````````
513
514Metadata structures in this last category summarize the contents of primary
515metadata records.
516These are often used to speed up resource usage queries, and are many times
517smaller than the primary metadata which they represent.
518
519Examples of summary information include:
520
521- Summary counts of free space and inodes
522
523- File link counts from directories
524
525- Quota resource usage counts
526
527Check and repair require full filesystem scans, but resource and lock
528acquisition follow the same paths as regular filesystem accesses.
529
530The superblock summary counters have special requirements due to the underlying
531implementation of the incore counters, and will be treated separately.
532Check and repair of the other types of summary counters (quota resource counts
533and file link counts) employ the same filesystem scanning and hooking
534techniques as outlined above, but because the underlying data are sets of
535integer counters, the staging data need not be a fully functional mirror of the
536ondisk structure.
537
538Inspiration for quota and file link count repair strategies were drawn from
539sections 2.12 ("Online Index Operations") through 2.14 ("Incremental View
540Maintenace") of G.  Graefe, `"Concurrent Queries and Updates in Summary Views
541and Their Indexes"
542<http://www.odbms.org/wp-content/uploads/2014/06/Increment-locks.pdf>`_, 2011.
543
544Since quotas are non-negative integer counts of resource usage, online
545quotacheck can use the incremental view deltas described in section 2.14 to
546track pending changes to the block and inode usage counts in each transaction,
547and commit those changes to a dquot side file when the transaction commits.
548Delta tracking is necessary for dquots because the index builder scans inodes,
549whereas the data structure being rebuilt is an index of dquots.
550Link count checking combines the view deltas and commit step into one because
551it sets attributes of the objects being scanned instead of writing them to a
552separate data structure.
553Each online fsck function will be discussed as case studies later in this
554document.
555
556Risk Management
557---------------
558
559During the development of online fsck, several risk factors were identified
560that may make the feature unsuitable for certain distributors and users.
561Steps can be taken to mitigate or eliminate those risks, though at a cost to
562functionality.
563
564- **Decreased performance**: Adding metadata indices to the filesystem
565  increases the time cost of persisting changes to disk, and the reverse space
566  mapping and directory parent pointers are no exception.
567  System administrators who require the maximum performance can disable the
568  reverse mapping features at format time, though this choice dramatically
569  reduces the ability of online fsck to find inconsistencies and repair them.
570
571- **Incorrect repairs**: As with all software, there might be defects in the
572  software that result in incorrect repairs being written to the filesystem.
573  Systematic fuzz testing (detailed in the next section) is employed by the
574  authors to find bugs early, but it might not catch everything.
575  The kernel build system provides Kconfig options (``CONFIG_XFS_ONLINE_SCRUB``
576  and ``CONFIG_XFS_ONLINE_REPAIR``) to enable distributors to choose not to
577  accept this risk.
578  The xfsprogs build system has a configure option (``--enable-scrub=no``) that
579  disables building of the ``xfs_scrub`` binary, though this is not a risk
580  mitigation if the kernel functionality remains enabled.
581
582- **Inability to repair**: Sometimes, a filesystem is too badly damaged to be
583  repairable.
584  If the keyspaces of several metadata indices overlap in some manner but a
585  coherent narrative cannot be formed from records collected, then the repair
586  fails.
587  To reduce the chance that a repair will fail with a dirty transaction and
588  render the filesystem unusable, the online repair functions have been
589  designed to stage and validate all new records before committing the new
590  structure.
591
592- **Misbehavior**: Online fsck requires many privileges -- raw IO to block
593  devices, opening files by handle, ignoring Unix discretionary access control,
594  and the ability to perform administrative changes.
595  Running this automatically in the background scares people, so the systemd
596  background service is configured to run with only the privileges required.
597  Obviously, this cannot address certain problems like the kernel crashing or
598  deadlocking, but it should be sufficient to prevent the scrub process from
599  escaping and reconfiguring the system.
600  The cron job does not have this protection.
601
602- **Fuzz Kiddiez**: There are many people now who seem to think that running
603  automated fuzz testing of ondisk artifacts to find mischevious behavior and
604  spraying exploit code onto the public mailing list for instant zero-day
605  disclosure is somehow of some social benefit.
606  In the view of this author, the benefit is realized only when the fuzz
607  operators help to **fix** the flaws, but this opinion apparently is not
608  widely shared among security "researchers".
609  The XFS maintainers' continuing ability to manage these events presents an
610  ongoing risk to the stability of the development process.
611  Automated testing should front-load some of the risk while the feature is
612  considered EXPERIMENTAL.
613
614Many of these risks are inherent to software programming.
615Despite this, it is hoped that this new functionality will prove useful in
616reducing unexpected downtime.
617
6183. Testing Plan
619===============
620
621As stated before, fsck tools have three main goals:
622
6231. Detect inconsistencies in the metadata;
624
6252. Eliminate those inconsistencies; and
626
6273. Minimize further loss of data.
628
629Demonstrations of correct operation are necessary to build users' confidence
630that the software behaves within expectations.
631Unfortunately, it was not really feasible to perform regular exhaustive testing
632of every aspect of a fsck tool until the introduction of low-cost virtual
633machines with high-IOPS storage.
634With ample hardware availability in mind, the testing strategy for the online
635fsck project involves differential analysis against the existing fsck tools and
636systematic testing of every attribute of every type of metadata object.
637Testing can be split into four major categories, as discussed below.
638
639Integrated Testing with fstests
640-------------------------------
641
642The primary goal of any free software QA effort is to make testing as
643inexpensive and widespread as possible to maximize the scaling advantages of
644community.
645In other words, testing should maximize the breadth of filesystem configuration
646scenarios and hardware setups.
647This improves code quality by enabling the authors of online fsck to find and
648fix bugs early, and helps developers of new features to find integration
649issues earlier in their development effort.
650
651The Linux filesystem community shares a common QA testing suite,
652`fstests <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/>`_, for
653functional and regression testing.
654Even before development work began on online fsck, fstests (when run on XFS)
655would run both the ``xfs_check`` and ``xfs_repair -n`` commands on the test and
656scratch filesystems between each test.
657This provides a level of assurance that the kernel and the fsck tools stay in
658alignment about what constitutes consistent metadata.
659During development of the online checking code, fstests was modified to run
660``xfs_scrub -n`` between each test to ensure that the new checking code
661produces the same results as the two existing fsck tools.
662
663To start development of online repair, fstests was modified to run
664``xfs_repair`` to rebuild the filesystem's metadata indices between tests.
665This ensures that offline repair does not crash, leave a corrupt filesystem
666after it exists, or trigger complaints from the online check.
667This also established a baseline for what can and cannot be repaired offline.
668To complete the first phase of development of online repair, fstests was
669modified to be able to run ``xfs_scrub`` in a "force rebuild" mode.
670This enables a comparison of the effectiveness of online repair as compared to
671the existing offline repair tools.
672
673General Fuzz Testing of Metadata Blocks
674---------------------------------------
675
676XFS benefits greatly from having a very robust debugging tool, ``xfs_db``.
677
678Before development of online fsck even began, a set of fstests were created
679to test the rather common fault that entire metadata blocks get corrupted.
680This required the creation of fstests library code that can create a filesystem
681containing every possible type of metadata object.
682Next, individual test cases were created to create a test filesystem, identify
683a single block of a specific type of metadata object, trash it with the
684existing ``blocktrash`` command in ``xfs_db``, and test the reaction of a
685particular metadata validation strategy.
686
687This earlier test suite enabled XFS developers to test the ability of the
688in-kernel validation functions and the ability of the offline fsck tool to
689detect and eliminate the inconsistent metadata.
690This part of the test suite was extended to cover online fsck in exactly the
691same manner.
692
693In other words, for a given fstests filesystem configuration:
694
695* For each metadata object existing on the filesystem:
696
697  * Write garbage to it
698
699  * Test the reactions of:
700
701    1. The kernel verifiers to stop obviously bad metadata
702    2. Offline repair (``xfs_repair``) to detect and fix
703    3. Online repair (``xfs_scrub``) to detect and fix
704
705Targeted Fuzz Testing of Metadata Records
706-----------------------------------------
707
708The testing plan for online fsck includes extending the existing fs testing
709infrastructure to provide a much more powerful facility: targeted fuzz testing
710of every metadata field of every metadata object in the filesystem.
711``xfs_db`` can modify every field of every metadata structure in every
712block in the filesystem to simulate the effects of memory corruption and
713software bugs.
714Given that fstests already contains the ability to create a filesystem
715containing every metadata format known to the filesystem, ``xfs_db`` can be
716used to perform exhaustive fuzz testing!
717
718For a given fstests filesystem configuration:
719
720* For each metadata object existing on the filesystem...
721
722  * For each record inside that metadata object...
723
724    * For each field inside that record...
725
726      * For each conceivable type of transformation that can be applied to a bit field...
727
728        1. Clear all bits
729        2. Set all bits
730        3. Toggle the most significant bit
731        4. Toggle the middle bit
732        5. Toggle the least significant bit
733        6. Add a small quantity
734        7. Subtract a small quantity
735        8. Randomize the contents
736
737        * ...test the reactions of:
738
739          1. The kernel verifiers to stop obviously bad metadata
740          2. Offline checking (``xfs_repair -n``)
741          3. Offline repair (``xfs_repair``)
742          4. Online checking (``xfs_scrub -n``)
743          5. Online repair (``xfs_scrub``)
744          6. Both repair tools (``xfs_scrub`` and then ``xfs_repair`` if online repair doesn't succeed)
745
746This is quite the combinatoric explosion!
747
748Fortunately, having this much test coverage makes it easy for XFS developers to
749check the responses of XFS' fsck tools.
750Since the introduction of the fuzz testing framework, these tests have been
751used to discover incorrect repair code and missing functionality for entire
752classes of metadata objects in ``xfs_repair``.
753The enhanced testing was used to finalize the deprecation of ``xfs_check`` by
754confirming that ``xfs_repair`` could detect at least as many corruptions as
755the older tool.
756
757These tests have been very valuable for ``xfs_scrub`` in the same ways -- they
758allow the online fsck developers to compare online fsck against offline fsck,
759and they enable XFS developers to find deficiencies in the code base.
760
761Proposed patchsets include
762`general fuzzer improvements
763<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzzer-improvements>`_,
764`fuzzing baselines
765<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzz-baseline>`_,
766and `improvements in fuzz testing comprehensiveness
767<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=more-fuzz-testing>`_.
768
769Stress Testing
770--------------
771
772A unique requirement to online fsck is the ability to operate on a filesystem
773concurrently with regular workloads.
774Although it is of course impossible to run ``xfs_scrub`` with *zero* observable
775impact on the running system, the online repair code should never introduce
776inconsistencies into the filesystem metadata, and regular workloads should
777never notice resource starvation.
778To verify that these conditions are being met, fstests has been enhanced in
779the following ways:
780
781* For each scrub item type, create a test to exercise checking that item type
782  while running ``fsstress``.
783* For each scrub item type, create a test to exercise repairing that item type
784  while running ``fsstress``.
785* Race ``fsstress`` and ``xfs_scrub -n`` to ensure that checking the whole
786  filesystem doesn't cause problems.
787* Race ``fsstress`` and ``xfs_scrub`` in force-rebuild mode to ensure that
788  force-repairing the whole filesystem doesn't cause problems.
789* Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while
790  freezing and thawing the filesystem.
791* Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while
792  remounting the filesystem read-only and read-write.
793* The same, but running ``fsx`` instead of ``fsstress``.  (Not done yet?)
794
795Success is defined by the ability to run all of these tests without observing
796any unexpected filesystem shutdowns due to corrupted metadata, kernel hang
797check warnings, or any other sort of mischief.
798
799Proposed patchsets include `general stress testing
800<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=race-scrub-and-mount-state-changes>`_
801and the `evolution of existing per-function stress testing
802<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=refactor-scrub-stress>`_.
803
8044. User Interface
805=================
806
807The primary user of online fsck is the system administrator, just like offline
808repair.
809Online fsck presents two modes of operation to administrators:
810A foreground CLI process for online fsck on demand, and a background service
811that performs autonomous checking and repair.
812
813Checking on Demand
814------------------
815
816For administrators who want the absolute freshest information about the
817metadata in a filesystem, ``xfs_scrub`` can be run as a foreground process on
818a command line.
819The program checks every piece of metadata in the filesystem while the
820administrator waits for the results to be reported, just like the existing
821``xfs_repair`` tool.
822Both tools share a ``-n`` option to perform a read-only scan, and a ``-v``
823option to increase the verbosity of the information reported.
824
825A new feature of ``xfs_scrub`` is the ``-x`` option, which employs the error
826correction capabilities of the hardware to check data file contents.
827The media scan is not enabled by default because it may dramatically increase
828program runtime and consume a lot of bandwidth on older storage hardware.
829
830The output of a foreground invocation is captured in the system log.
831
832The ``xfs_scrub_all`` program walks the list of mounted filesystems and
833initiates ``xfs_scrub`` for each of them in parallel.
834It serializes scans for any filesystems that resolve to the same top level
835kernel block device to prevent resource overconsumption.
836
837Background Service
838------------------
839
840To reduce the workload of system administrators, the ``xfs_scrub`` package
841provides a suite of `systemd <https://systemd.io/>`_ timers and services that
842run online fsck automatically on weekends by default.
843The background service configures scrub to run with as little privilege as
844possible, the lowest CPU and IO priority, and in a CPU-constrained single
845threaded mode.
846This can be tuned by the systemd administrator at any time to suit the latency
847and throughput requirements of customer workloads.
848
849The output of the background service is also captured in the system log.
850If desired, reports of failures (either due to inconsistencies or mere runtime
851errors) can be emailed automatically by setting the ``EMAIL_ADDR`` environment
852variable in the following service files:
853
854* ``xfs_scrub_fail@.service``
855* ``xfs_scrub_media_fail@.service``
856* ``xfs_scrub_all_fail.service``
857
858The decision to enable the background scan is left to the system administrator.
859This can be done by enabling either of the following services:
860
861* ``xfs_scrub_all.timer`` on systemd systems
862* ``xfs_scrub_all.cron`` on non-systemd systems
863
864This automatic weekly scan is configured out of the box to perform an
865additional media scan of all file data once per month.
866This is less foolproof than, say, storing file data block checksums, but much
867more performant if application software provides its own integrity checking,
868redundancy can be provided elsewhere above the filesystem, or the storage
869device's integrity guarantees are deemed sufficient.
870
871The systemd unit file definitions have been subjected to a security audit
872(as of systemd 249) to ensure that the xfs_scrub processes have as little
873access to the rest of the system as possible.
874This was performed via ``systemd-analyze security``, after which privileges
875were restricted to the minimum required, sandboxing was set up to the maximal
876extent possible with sandboxing and system call filtering; and access to the
877filesystem tree was restricted to the minimum needed to start the program and
878access the filesystem being scanned.
879The service definition files restrict CPU usage to 80% of one CPU core, and
880apply as nice of a priority to IO and CPU scheduling as possible.
881This measure was taken to minimize delays in the rest of the filesystem.
882No such hardening has been performed for the cron job.
883
884Proposed patchset:
885`Enabling the xfs_scrub background service
886<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-service>`_.
887
888Health Reporting
889----------------
890
891XFS caches a summary of each filesystem's health status in memory.
892The information is updated whenever ``xfs_scrub`` is run, or whenever
893inconsistencies are detected in the filesystem metadata during regular
894operations.
895System administrators should use the ``health`` command of ``xfs_spaceman`` to
896download this information into a human-readable format.
897If problems have been observed, the administrator can schedule a reduced
898service window to run the online repair tool to correct the problem.
899Failing that, the administrator can decide to schedule a maintenance window to
900run the traditional offline repair tool to correct the problem.
901
902**Future Work Question**: Should the health reporting integrate with the new
903inotify fs error notification system?
904Would it be helpful for sysadmins to have a daemon to listen for corruption
905notifications and initiate a repair?
906
907*Answer*: These questions remain unanswered, but should be a part of the
908conversation with early adopters and potential downstream users of XFS.
909
910Proposed patchsets include
911`wiring up health reports to correction returns
912<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=corruption-health-reports>`_
913and
914`preservation of sickness info during memory reclaim
915<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=indirect-health-reporting>`_.
916
9175. Kernel Algorithms and Data Structures
918========================================
919
920This section discusses the key algorithms and data structures of the kernel
921code that provide the ability to check and repair metadata while the system
922is running.
923The first chapters in this section reveal the pieces that provide the
924foundation for checking metadata.
925The remainder of this section presents the mechanisms through which XFS
926regenerates itself.
927
928Self Describing Metadata
929------------------------
930
931Starting with XFS version 5 in 2012, XFS updated the format of nearly every
932ondisk block header to record a magic number, a checksum, a universally
933"unique" identifier (UUID), an owner code, the ondisk address of the block,
934and a log sequence number.
935When loading a block buffer from disk, the magic number, UUID, owner, and
936ondisk address confirm that the retrieved block matches the specific owner of
937the current filesystem, and that the information contained in the block is
938supposed to be found at the ondisk address.
939The first three components enable checking tools to disregard alleged metadata
940that doesn't belong to the filesystem, and the fourth component enables the
941filesystem to detect lost writes.
942
943Whenever a file system operation modifies a block, the change is submitted
944to the log as part of a transaction.
945The log then processes these transactions marking them done once they are
946safely persisted to storage.
947The logging code maintains the checksum and the log sequence number of the last
948transactional update.
949Checksums are useful for detecting torn writes and other discrepancies that can
950be introduced between the computer and its storage devices.
951Sequence number tracking enables log recovery to avoid applying out of date
952log updates to the filesystem.
953
954These two features improve overall runtime resiliency by providing a means for
955the filesystem to detect obvious corruption when reading metadata blocks from
956disk, but these buffer verifiers cannot provide any consistency checking
957between metadata structures.
958
959For more information, please see the documentation for
960Documentation/filesystems/xfs-self-describing-metadata.rst
961
962Reverse Mapping
963---------------
964
965The original design of XFS (circa 1993) is an improvement upon 1980s Unix
966filesystem design.
967In those days, storage density was expensive, CPU time was scarce, and
968excessive seek time could kill performance.
969For performance reasons, filesystem authors were reluctant to add redundancy to
970the filesystem, even at the cost of data integrity.
971Filesystems designers in the early 21st century choose different strategies to
972increase internal redundancy -- either storing nearly identical copies of
973metadata, or more space-efficient encoding techniques.
974
975For XFS, a different redundancy strategy was chosen to modernize the design:
976a secondary space usage index that maps allocated disk extents back to their
977owners.
978By adding a new index, the filesystem retains most of its ability to scale
979well to heavily threaded workloads involving large datasets, since the primary
980file metadata (the directory tree, the file block map, and the allocation
981groups) remain unchanged.
982Like any system that improves redundancy, the reverse-mapping feature increases
983overhead costs for space mapping activities.
984However, it has two critical advantages: first, the reverse index is key to
985enabling online fsck and other requested functionality such as free space
986defragmentation, better media failure reporting, and filesystem shrinking.
987Second, the different ondisk storage format of the reverse mapping btree
988defeats device-level deduplication because the filesystem requires real
989redundancy.
990
991+--------------------------------------------------------------------------+
992| **Sidebar**:                                                             |
993+--------------------------------------------------------------------------+
994| A criticism of adding the secondary index is that it does nothing to     |
995| improve the robustness of user data storage itself.                      |
996| This is a valid point, but adding a new index for file data block        |
997| checksums increases write amplification by turning data overwrites into  |
998| copy-writes, which age the filesystem prematurely.                       |
999| In keeping with thirty years of precedent, users who want file data      |
1000| integrity can supply as powerful a solution as they require.             |
1001| As for metadata, the complexity of adding a new secondary index of space |
1002| usage is much less than adding volume management and storage device      |
1003| mirroring to XFS itself.                                                 |
1004| Perfection of RAID and volume management are best left to existing       |
1005| layers in the kernel.                                                    |
1006+--------------------------------------------------------------------------+
1007
1008The information captured in a reverse space mapping record is as follows:
1009
1010.. code-block:: c
1011
1012	struct xfs_rmap_irec {
1013	    xfs_agblock_t    rm_startblock;   /* extent start block */
1014	    xfs_extlen_t     rm_blockcount;   /* extent length */
1015	    uint64_t         rm_owner;        /* extent owner */
1016	    uint64_t         rm_offset;       /* offset within the owner */
1017	    unsigned int     rm_flags;        /* state flags */
1018	};
1019
1020The first two fields capture the location and size of the physical space,
1021in units of filesystem blocks.
1022The owner field tells scrub which metadata structure or file inode have been
1023assigned this space.
1024For space allocated to files, the offset field tells scrub where the space was
1025mapped within the file fork.
1026Finally, the flags field provides extra information about the space usage --
1027is this an attribute fork extent?  A file mapping btree extent?  Or an
1028unwritten data extent?
1029
1030Online filesystem checking judges the consistency of each primary metadata
1031record by comparing its information against all other space indices.
1032The reverse mapping index plays a key role in the consistency checking process
1033because it contains a centralized alternate copy of all space allocation
1034information.
1035Program runtime and ease of resource acquisition are the only real limits to
1036what online checking can consult.
1037For example, a file data extent mapping can be checked against:
1038
1039* The absence of an entry in the free space information.
1040* The absence of an entry in the inode index.
1041* The absence of an entry in the reference count data if the file is not
1042  marked as having shared extents.
1043* The correspondence of an entry in the reverse mapping information.
1044
1045There are several observations to make about reverse mapping indices:
1046
10471. Reverse mappings can provide a positive affirmation of correctness if any of
1048   the above primary metadata are in doubt.
1049   The checking code for most primary metadata follows a path similar to the
1050   one outlined above.
1051
10522. Proving the consistency of secondary metadata with the primary metadata is
1053   difficult because that requires a full scan of all primary space metadata,
1054   which is very time intensive.
1055   For example, checking a reverse mapping record for a file extent mapping
1056   btree block requires locking the file and searching the entire btree to
1057   confirm the block.
1058   Instead, scrub relies on rigorous cross-referencing during the primary space
1059   mapping structure checks.
1060
10613. Consistency scans must use non-blocking lock acquisition primitives if the
1062   required locking order is not the same order used by regular filesystem
1063   operations.
1064   For example, if the filesystem normally takes a file ILOCK before taking
1065   the AGF buffer lock but scrub wants to take a file ILOCK while holding
1066   an AGF buffer lock, scrub cannot block on that second acquisition.
1067   This means that forward progress during this part of a scan of the reverse
1068   mapping data cannot be guaranteed if system load is heavy.
1069
1070In summary, reverse mappings play a key role in reconstruction of primary
1071metadata.
1072The details of how these records are staged, written to disk, and committed
1073into the filesystem are covered in subsequent sections.
1074
1075Checking and Cross-Referencing
1076------------------------------
1077
1078The first step of checking a metadata structure is to examine every record
1079contained within the structure and its relationship with the rest of the
1080system.
1081XFS contains multiple layers of checking to try to prevent inconsistent
1082metadata from wreaking havoc on the system.
1083Each of these layers contributes information that helps the kernel to make
1084three decisions about the health of a metadata structure:
1085
1086- Is a part of this structure obviously corrupt (``XFS_SCRUB_OFLAG_CORRUPT``) ?
1087- Is this structure inconsistent with the rest of the system
1088  (``XFS_SCRUB_OFLAG_XCORRUPT``) ?
1089- Is there so much damage around the filesystem that cross-referencing is not
1090  possible (``XFS_SCRUB_OFLAG_XFAIL``) ?
1091- Can the structure be optimized to improve performance or reduce the size of
1092  metadata (``XFS_SCRUB_OFLAG_PREEN``) ?
1093- Does the structure contain data that is not inconsistent but deserves review
1094  by the system administrator (``XFS_SCRUB_OFLAG_WARNING``) ?
1095
1096The following sections describe how the metadata scrubbing process works.
1097
1098Metadata Buffer Verification
1099````````````````````````````
1100
1101The lowest layer of metadata protection in XFS are the metadata verifiers built
1102into the buffer cache.
1103These functions perform inexpensive internal consistency checking of the block
1104itself, and answer these questions:
1105
1106- Does the block belong to this filesystem?
1107
1108- Does the block belong to the structure that asked for the read?
1109  This assumes that metadata blocks only have one owner, which is always true
1110  in XFS.
1111
1112- Is the type of data stored in the block within a reasonable range of what
1113  scrub is expecting?
1114
1115- Does the physical location of the block match the location it was read from?
1116
1117- Does the block checksum match the data?
1118
1119The scope of the protections here are very limited -- verifiers can only
1120establish that the filesystem code is reasonably free of gross corruption bugs
1121and that the storage system is reasonably competent at retrieval.
1122Corruption problems observed at runtime cause the generation of health reports,
1123failed system calls, and in the extreme case, filesystem shutdowns if the
1124corrupt metadata force the cancellation of a dirty transaction.
1125
1126Every online fsck scrubbing function is expected to read every ondisk metadata
1127block of a structure in the course of checking the structure.
1128Corruption problems observed during a check are immediately reported to
1129userspace as corruption; during a cross-reference, they are reported as a
1130failure to cross-reference once the full examination is complete.
1131Reads satisfied by a buffer already in cache (and hence already verified)
1132bypass these checks.
1133
1134Internal Consistency Checks
1135```````````````````````````
1136
1137After the buffer cache, the next level of metadata protection is the internal
1138record verification code built into the filesystem.
1139These checks are split between the buffer verifiers, the in-filesystem users of
1140the buffer cache, and the scrub code itself, depending on the amount of higher
1141level context required.
1142The scope of checking is still internal to the block.
1143These higher level checking functions answer these questions:
1144
1145- Does the type of data stored in the block match what scrub is expecting?
1146
1147- Does the block belong to the owning structure that asked for the read?
1148
1149- If the block contains records, do the records fit within the block?
1150
1151- If the block tracks internal free space information, is it consistent with
1152  the record areas?
1153
1154- Are the records contained inside the block free of obvious corruptions?
1155
1156Record checks in this category are more rigorous and more time-intensive.
1157For example, block pointers and inumbers are checked to ensure that they point
1158within the dynamically allocated parts of an allocation group and within
1159the filesystem.
1160Names are checked for invalid characters, and flags are checked for invalid
1161combinations.
1162Other record attributes are checked for sensible values.
1163Btree records spanning an interval of the btree keyspace are checked for
1164correct order and lack of mergeability (except for file fork mappings).
1165For performance reasons, regular code may skip some of these checks unless
1166debugging is enabled or a write is about to occur.
1167Scrub functions, of course, must check all possible problems.
1168
1169Validation of Userspace-Controlled Record Attributes
1170````````````````````````````````````````````````````
1171
1172Various pieces of filesystem metadata are directly controlled by userspace.
1173Because of this nature, validation work cannot be more precise than checking
1174that a value is within the possible range.
1175These fields include:
1176
1177- Superblock fields controlled by mount options
1178- Filesystem labels
1179- File timestamps
1180- File permissions
1181- File size
1182- File flags
1183- Names present in directory entries, extended attribute keys, and filesystem
1184  labels
1185- Extended attribute key namespaces
1186- Extended attribute values
1187- File data block contents
1188- Quota limits
1189- Quota timer expiration (if resource usage exceeds the soft limit)
1190
1191Cross-Referencing Space Metadata
1192````````````````````````````````
1193
1194After internal block checks, the next higher level of checking is
1195cross-referencing records between metadata structures.
1196For regular runtime code, the cost of these checks is considered to be
1197prohibitively expensive, but as scrub is dedicated to rooting out
1198inconsistencies, it must pursue all avenues of inquiry.
1199The exact set of cross-referencing is highly dependent on the context of the
1200data structure being checked.
1201
1202The XFS btree code has keyspace scanning functions that online fsck uses to
1203cross reference one structure with another.
1204Specifically, scrub can scan the key space of an index to determine if that
1205keyspace is fully, sparsely, or not at all mapped to records.
1206For the reverse mapping btree, it is possible to mask parts of the key for the
1207purposes of performing a keyspace scan so that scrub can decide if the rmap
1208btree contains records mapping a certain extent of physical space without the
1209sparsenses of the rest of the rmap keyspace getting in the way.
1210
1211Btree blocks undergo the following checks before cross-referencing:
1212
1213- Does the type of data stored in the block match what scrub is expecting?
1214
1215- Does the block belong to the owning structure that asked for the read?
1216
1217- Do the records fit within the block?
1218
1219- Are the records contained inside the block free of obvious corruptions?
1220
1221- Are the name hashes in the correct order?
1222
1223- Do node pointers within the btree point to valid block addresses for the type
1224  of btree?
1225
1226- Do child pointers point towards the leaves?
1227
1228- Do sibling pointers point across the same level?
1229
1230- For each node block record, does the record key accurate reflect the contents
1231  of the child block?
1232
1233Space allocation records are cross-referenced as follows:
1234
12351. Any space mentioned by any metadata structure are cross-referenced as
1236   follows:
1237
1238   - Does the reverse mapping index list only the appropriate owner as the
1239     owner of each block?
1240
1241   - Are none of the blocks claimed as free space?
1242
1243   - If these aren't file data blocks, are none of the blocks claimed as space
1244     shared by different owners?
1245
12462. Btree blocks are cross-referenced as follows:
1247
1248   - Everything in class 1 above.
1249
1250   - If there's a parent node block, do the keys listed for this block match the
1251     keyspace of this block?
1252
1253   - Do the sibling pointers point to valid blocks?  Of the same level?
1254
1255   - Do the child pointers point to valid blocks?  Of the next level down?
1256
12573. Free space btree records are cross-referenced as follows:
1258
1259   - Everything in class 1 and 2 above.
1260
1261   - Does the reverse mapping index list no owners of this space?
1262
1263   - Is this space not claimed by the inode index for inodes?
1264
1265   - Is it not mentioned by the reference count index?
1266
1267   - Is there a matching record in the other free space btree?
1268
12694. Inode btree records are cross-referenced as follows:
1270
1271   - Everything in class 1 and 2 above.
1272
1273   - Is there a matching record in free inode btree?
1274
1275   - Do cleared bits in the holemask correspond with inode clusters?
1276
1277   - Do set bits in the freemask correspond with inode records with zero link
1278     count?
1279
12805. Inode records are cross-referenced as follows:
1281
1282   - Everything in class 1.
1283
1284   - Do all the fields that summarize information about the file forks actually
1285     match those forks?
1286
1287   - Does each inode with zero link count correspond to a record in the free
1288     inode btree?
1289
12906. File fork space mapping records are cross-referenced as follows:
1291
1292   - Everything in class 1 and 2 above.
1293
1294   - Is this space not mentioned by the inode btrees?
1295
1296   - If this is a CoW fork mapping, does it correspond to a CoW entry in the
1297     reference count btree?
1298
12997. Reference count records are cross-referenced as follows:
1300
1301   - Everything in class 1 and 2 above.
1302
1303   - Within the space subkeyspace of the rmap btree (that is to say, all
1304     records mapped to a particular space extent and ignoring the owner info),
1305     are there the same number of reverse mapping records for each block as the
1306     reference count record claims?
1307
1308Proposed patchsets are the series to find gaps in
1309`refcount btree
1310<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-refcount-gaps>`_,
1311`inode btree
1312<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-inobt-gaps>`_, and
1313`rmap btree
1314<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-rmapbt-gaps>`_ records;
1315to find
1316`mergeable records
1317<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-mergeable-records>`_;
1318and to
1319`improve cross referencing with rmap
1320<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-strengthen-rmap-checking>`_
1321before starting a repair.
1322
1323Checking Extended Attributes
1324````````````````````````````
1325
1326Extended attributes implement a key-value store that enable fragments of data
1327to be attached to any file.
1328Both the kernel and userspace can access the keys and values, subject to
1329namespace and privilege restrictions.
1330Most typically these fragments are metadata about the file -- origins, security
1331contexts, user-supplied labels, indexing information, etc.
1332
1333Names can be as long as 255 bytes and can exist in several different
1334namespaces.
1335Values can be as large as 64KB.
1336A file's extended attributes are stored in blocks mapped by the attr fork.
1337The mappings point to leaf blocks, remote value blocks, or dabtree blocks.
1338Block 0 in the attribute fork is always the top of the structure, but otherwise
1339each of the three types of blocks can be found at any offset in the attr fork.
1340Leaf blocks contain attribute key records that point to the name and the value.
1341Names are always stored elsewhere in the same leaf block.
1342Values that are less than 3/4 the size of a filesystem block are also stored
1343elsewhere in the same leaf block.
1344Remote value blocks contain values that are too large to fit inside a leaf.
1345If the leaf information exceeds a single filesystem block, a dabtree (also
1346rooted at block 0) is created to map hashes of the attribute names to leaf
1347blocks in the attr fork.
1348
1349Checking an extended attribute structure is not so straightfoward due to the
1350lack of separation between attr blocks and index blocks.
1351Scrub must read each block mapped by the attr fork and ignore the non-leaf
1352blocks:
1353
13541. Walk the dabtree in the attr fork (if present) to ensure that there are no
1355   irregularities in the blocks or dabtree mappings that do not point to
1356   attr leaf blocks.
1357
13582. Walk the blocks of the attr fork looking for leaf blocks.
1359   For each entry inside a leaf:
1360
1361   a. Validate that the name does not contain invalid characters.
1362
1363   b. Read the attr value.
1364      This performs a named lookup of the attr name to ensure the correctness
1365      of the dabtree.
1366      If the value is stored in a remote block, this also validates the
1367      integrity of the remote value block.
1368
1369Checking and Cross-Referencing Directories
1370``````````````````````````````````````````
1371
1372The filesystem directory tree is a directed acylic graph structure, with files
1373constituting the nodes, and directory entries (dirents) constituting the edges.
1374Directories are a special type of file containing a set of mappings from a
1375255-byte sequence (name) to an inumber.
1376These are called directory entries, or dirents for short.
1377Each directory file must have exactly one directory pointing to the file.
1378A root directory points to itself.
1379Directory entries point to files of any type.
1380Each non-directory file may have multiple directories point to it.
1381
1382In XFS, directories are implemented as a file containing up to three 32GB
1383partitions.
1384The first partition contains directory entry data blocks.
1385Each data block contains variable-sized records associating a user-provided
1386name with an inumber and, optionally, a file type.
1387If the directory entry data grows beyond one block, the second partition (which
1388exists as post-EOF extents) is populated with a block containing free space
1389information and an index that maps hashes of the dirent names to directory data
1390blocks in the first partition.
1391This makes directory name lookups very fast.
1392If this second partition grows beyond one block, the third partition is
1393populated with a linear array of free space information for faster
1394expansions.
1395If the free space has been separated and the second partition grows again
1396beyond one block, then a dabtree is used to map hashes of dirent names to
1397directory data blocks.
1398
1399Checking a directory is pretty straightfoward:
1400
14011. Walk the dabtree in the second partition (if present) to ensure that there
1402   are no irregularities in the blocks or dabtree mappings that do not point to
1403   dirent blocks.
1404
14052. Walk the blocks of the first partition looking for directory entries.
1406   Each dirent is checked as follows:
1407
1408   a. Does the name contain no invalid characters?
1409
1410   b. Does the inumber correspond to an actual, allocated inode?
1411
1412   c. Does the child inode have a nonzero link count?
1413
1414   d. If a file type is included in the dirent, does it match the type of the
1415      inode?
1416
1417   e. If the child is a subdirectory, does the child's dotdot pointer point
1418      back to the parent?
1419
1420   f. If the directory has a second partition, perform a named lookup of the
1421      dirent name to ensure the correctness of the dabtree.
1422
14233. Walk the free space list in the third partition (if present) to ensure that
1424   the free spaces it describes are really unused.
1425
1426Checking operations involving :ref:`parents <dirparent>` and
1427:ref:`file link counts <nlinks>` are discussed in more detail in later
1428sections.
1429
1430Checking Directory/Attribute Btrees
1431^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1432
1433As stated in previous sections, the directory/attribute btree (dabtree) index
1434maps user-provided names to improve lookup times by avoiding linear scans.
1435Internally, it maps a 32-bit hash of the name to a block offset within the
1436appropriate file fork.
1437
1438The internal structure of a dabtree closely resembles the btrees that record
1439fixed-size metadata records -- each dabtree block contains a magic number, a
1440checksum, sibling pointers, a UUID, a tree level, and a log sequence number.
1441The format of leaf and node records are the same -- each entry points to the
1442next level down in the hierarchy, with dabtree node records pointing to dabtree
1443leaf blocks, and dabtree leaf records pointing to non-dabtree blocks elsewhere
1444in the fork.
1445
1446Checking and cross-referencing the dabtree is very similar to what is done for
1447space btrees:
1448
1449- Does the type of data stored in the block match what scrub is expecting?
1450
1451- Does the block belong to the owning structure that asked for the read?
1452
1453- Do the records fit within the block?
1454
1455- Are the records contained inside the block free of obvious corruptions?
1456
1457- Are the name hashes in the correct order?
1458
1459- Do node pointers within the dabtree point to valid fork offsets for dabtree
1460  blocks?
1461
1462- Do leaf pointers within the dabtree point to valid fork offsets for directory
1463  or attr leaf blocks?
1464
1465- Do child pointers point towards the leaves?
1466
1467- Do sibling pointers point across the same level?
1468
1469- For each dabtree node record, does the record key accurate reflect the
1470  contents of the child dabtree block?
1471
1472- For each dabtree leaf record, does the record key accurate reflect the
1473  contents of the directory or attr block?
1474
1475Cross-Referencing Summary Counters
1476``````````````````````````````````
1477
1478XFS maintains three classes of summary counters: available resources, quota
1479resource usage, and file link counts.
1480
1481In theory, the amount of available resources (data blocks, inodes, realtime
1482extents) can be found by walking the entire filesystem.
1483This would make for very slow reporting, so a transactional filesystem can
1484maintain summaries of this information in the superblock.
1485Cross-referencing these values against the filesystem metadata should be a
1486simple matter of walking the free space and inode metadata in each AG and the
1487realtime bitmap, but there are complications that will be discussed in
1488:ref:`more detail <fscounters>` later.
1489
1490:ref:`Quota usage <quotacheck>` and :ref:`file link count <nlinks>`
1491checking are sufficiently complicated to warrant separate sections.
1492
1493Post-Repair Reverification
1494``````````````````````````
1495
1496After performing a repair, the checking code is run a second time to validate
1497the new structure, and the results of the health assessment are recorded
1498internally and returned to the calling process.
1499This step is critical for enabling system administrator to monitor the status
1500of the filesystem and the progress of any repairs.
1501For developers, it is a useful means to judge the efficacy of error detection
1502and correction in the online and offline checking tools.
1503
1504Eventual Consistency vs. Online Fsck
1505------------------------------------
1506
1507Complex operations can make modifications to multiple per-AG data structures
1508with a chain of transactions.
1509These chains, once committed to the log, are restarted during log recovery if
1510the system crashes while processing the chain.
1511Because the AG header buffers are unlocked between transactions within a chain,
1512online checking must coordinate with chained operations that are in progress to
1513avoid incorrectly detecting inconsistencies due to pending chains.
1514Furthermore, online repair must not run when operations are pending because
1515the metadata are temporarily inconsistent with each other, and rebuilding is
1516not possible.
1517
1518Only online fsck has this requirement of total consistency of AG metadata, and
1519should be relatively rare as compared to filesystem change operations.
1520Online fsck coordinates with transaction chains as follows:
1521
1522* For each AG, maintain a count of intent items targetting that AG.
1523  The count should be bumped whenever a new item is added to the chain.
1524  The count should be dropped when the filesystem has locked the AG header
1525  buffers and finished the work.
1526
1527* When online fsck wants to examine an AG, it should lock the AG header
1528  buffers to quiesce all transaction chains that want to modify that AG.
1529  If the count is zero, proceed with the checking operation.
1530  If it is nonzero, cycle the buffer locks to allow the chain to make forward
1531  progress.
1532
1533This may lead to online fsck taking a long time to complete, but regular
1534filesystem updates take precedence over background checking activity.
1535Details about the discovery of this situation are presented in the
1536:ref:`next section <chain_coordination>`, and details about the solution
1537are presented :ref:`after that<intent_drains>`.
1538
1539.. _chain_coordination:
1540
1541Discovery of the Problem
1542````````````````````````
1543
1544Midway through the development of online scrubbing, the fsstress tests
1545uncovered a misinteraction between online fsck and compound transaction chains
1546created by other writer threads that resulted in false reports of metadata
1547inconsistency.
1548The root cause of these reports is the eventual consistency model introduced by
1549the expansion of deferred work items and compound transaction chains when
1550reverse mapping and reflink were introduced.
1551
1552Originally, transaction chains were added to XFS to avoid deadlocks when
1553unmapping space from files.
1554Deadlock avoidance rules require that AGs only be locked in increasing order,
1555which makes it impossible (say) to use a single transaction to free a space
1556extent in AG 7 and then try to free a now superfluous block mapping btree block
1557in AG 3.
1558To avoid these kinds of deadlocks, XFS creates Extent Freeing Intent (EFI) log
1559items to commit to freeing some space in one transaction while deferring the
1560actual metadata updates to a fresh transaction.
1561The transaction sequence looks like this:
1562
15631. The first transaction contains a physical update to the file's block mapping
1564   structures to remove the mapping from the btree blocks.
1565   It then attaches to the in-memory transaction an action item to schedule
1566   deferred freeing of space.
1567   Concretely, each transaction maintains a list of ``struct
1568   xfs_defer_pending`` objects, each of which maintains a list of ``struct
1569   xfs_extent_free_item`` objects.
1570   Returning to the example above, the action item tracks the freeing of both
1571   the unmapped space from AG 7 and the block mapping btree (BMBT) block from
1572   AG 3.
1573   Deferred frees recorded in this manner are committed in the log by creating
1574   an EFI log item from the ``struct xfs_extent_free_item`` object and
1575   attaching the log item to the transaction.
1576   When the log is persisted to disk, the EFI item is written into the ondisk
1577   transaction record.
1578   EFIs can list up to 16 extents to free, all sorted in AG order.
1579
15802. The second transaction contains a physical update to the free space btrees
1581   of AG 3 to release the former BMBT block and a second physical update to the
1582   free space btrees of AG 7 to release the unmapped file space.
1583   Observe that the the physical updates are resequenced in the correct order
1584   when possible.
1585   Attached to the transaction is a an extent free done (EFD) log item.
1586   The EFD contains a pointer to the EFI logged in transaction #1 so that log
1587   recovery can tell if the EFI needs to be replayed.
1588
1589If the system goes down after transaction #1 is written back to the filesystem
1590but before #2 is committed, a scan of the filesystem metadata would show
1591inconsistent filesystem metadata because there would not appear to be any owner
1592of the unmapped space.
1593Happily, log recovery corrects this inconsistency for us -- when recovery finds
1594an intent log item but does not find a corresponding intent done item, it will
1595reconstruct the incore state of the intent item and finish it.
1596In the example above, the log must replay both frees described in the recovered
1597EFI to complete the recovery phase.
1598
1599There are subtleties to XFS' transaction chaining strategy to consider:
1600
1601* Log items must be added to a transaction in the correct order to prevent
1602  conflicts with principal objects that are not held by the transaction.
1603  In other words, all per-AG metadata updates for an unmapped block must be
1604  completed before the last update to free the extent, and extents should not
1605  be reallocated until that last update commits to the log.
1606
1607* AG header buffers are released between each transaction in a chain.
1608  This means that other threads can observe an AG in an intermediate state,
1609  but as long as the first subtlety is handled, this should not affect the
1610  correctness of filesystem operations.
1611
1612* Unmounting the filesystem flushes all pending work to disk, which means that
1613  offline fsck never sees the temporary inconsistencies caused by deferred
1614  work item processing.
1615
1616In this manner, XFS employs a form of eventual consistency to avoid deadlocks
1617and increase parallelism.
1618
1619During the design phase of the reverse mapping and reflink features, it was
1620decided that it was impractical to cram all the reverse mapping updates for a
1621single filesystem change into a single transaction because a single file
1622mapping operation can explode into many small updates:
1623
1624* The block mapping update itself
1625* A reverse mapping update for the block mapping update
1626* Fixing the freelist
1627* A reverse mapping update for the freelist fix
1628
1629* A shape change to the block mapping btree
1630* A reverse mapping update for the btree update
1631* Fixing the freelist (again)
1632* A reverse mapping update for the freelist fix
1633
1634* An update to the reference counting information
1635* A reverse mapping update for the refcount update
1636* Fixing the freelist (a third time)
1637* A reverse mapping update for the freelist fix
1638
1639* Freeing any space that was unmapped and not owned by any other file
1640* Fixing the freelist (a fourth time)
1641* A reverse mapping update for the freelist fix
1642
1643* Freeing the space used by the block mapping btree
1644* Fixing the freelist (a fifth time)
1645* A reverse mapping update for the freelist fix
1646
1647Free list fixups are not usually needed more than once per AG per transaction
1648chain, but it is theoretically possible if space is very tight.
1649For copy-on-write updates this is even worse, because this must be done once to
1650remove the space from a staging area and again to map it into the file!
1651
1652To deal with this explosion in a calm manner, XFS expands its use of deferred
1653work items to cover most reverse mapping updates and all refcount updates.
1654This reduces the worst case size of transaction reservations by breaking the
1655work into a long chain of small updates, which increases the degree of eventual
1656consistency in the system.
1657Again, this generally isn't a problem because XFS orders its deferred work
1658items carefully to avoid resource reuse conflicts between unsuspecting threads.
1659
1660However, online fsck changes the rules -- remember that although physical
1661updates to per-AG structures are coordinated by locking the buffers for AG
1662headers, buffer locks are dropped between transactions.
1663Once scrub acquires resources and takes locks for a data structure, it must do
1664all the validation work without releasing the lock.
1665If the main lock for a space btree is an AG header buffer lock, scrub may have
1666interrupted another thread that is midway through finishing a chain.
1667For example, if a thread performing a copy-on-write has completed a reverse
1668mapping update but not the corresponding refcount update, the two AG btrees
1669will appear inconsistent to scrub and an observation of corruption will be
1670recorded.  This observation will not be correct.
1671If a repair is attempted in this state, the results will be catastrophic!
1672
1673Several other solutions to this problem were evaluated upon discovery of this
1674flaw and rejected:
1675
16761. Add a higher level lock to allocation groups and require writer threads to
1677   acquire the higher level lock in AG order before making any changes.
1678   This would be very difficult to implement in practice because it is
1679   difficult to determine which locks need to be obtained, and in what order,
1680   without simulating the entire operation.
1681   Performing a dry run of a file operation to discover necessary locks would
1682   make the filesystem very slow.
1683
16842. Make the deferred work coordinator code aware of consecutive intent items
1685   targeting the same AG and have it hold the AG header buffers locked across
1686   the transaction roll between updates.
1687   This would introduce a lot of complexity into the coordinator since it is
1688   only loosely coupled with the actual deferred work items.
1689   It would also fail to solve the problem because deferred work items can
1690   generate new deferred subtasks, but all subtasks must be complete before
1691   work can start on a new sibling task.
1692
16933. Teach online fsck to walk all transactions waiting for whichever lock(s)
1694   protect the data structure being scrubbed to look for pending operations.
1695   The checking and repair operations must factor these pending operations into
1696   the evaluations being performed.
1697   This solution is a nonstarter because it is *extremely* invasive to the main
1698   filesystem.
1699
1700.. _intent_drains:
1701
1702Intent Drains
1703`````````````
1704
1705Online fsck uses an atomic intent item counter and lock cycling to coordinate
1706with transaction chains.
1707There are two key properties to the drain mechanism.
1708First, the counter is incremented when a deferred work item is *queued* to a
1709transaction, and it is decremented after the associated intent done log item is
1710*committed* to another transaction.
1711The second property is that deferred work can be added to a transaction without
1712holding an AG header lock, but per-AG work items cannot be marked done without
1713locking that AG header buffer to log the physical updates and the intent done
1714log item.
1715The first property enables scrub to yield to running transaction chains, which
1716is an explicit deprioritization of online fsck to benefit file operations.
1717The second property of the drain is key to the correct coordination of scrub,
1718since scrub will always be able to decide if a conflict is possible.
1719
1720For regular filesystem code, the drain works as follows:
1721
17221. Call the appropriate subsystem function to add a deferred work item to a
1723   transaction.
1724
17252. The function calls ``xfs_defer_drain_bump`` to increase the counter.
1726
17273. When the deferred item manager wants to finish the deferred work item, it
1728   calls ``->finish_item`` to complete it.
1729
17304. The ``->finish_item`` implementation logs some changes and calls
1731   ``xfs_defer_drain_drop`` to decrease the sloppy counter and wake up any threads
1732   waiting on the drain.
1733
17345. The subtransaction commits, which unlocks the resource associated with the
1735   intent item.
1736
1737For scrub, the drain works as follows:
1738
17391. Lock the resource(s) associated with the metadata being scrubbed.
1740   For example, a scan of the refcount btree would lock the AGI and AGF header
1741   buffers.
1742
17432. If the counter is zero (``xfs_defer_drain_busy`` returns false), there are no
1744   chains in progress and the operation may proceed.
1745
17463. Otherwise, release the resources grabbed in step 1.
1747
17484. Wait for the intent counter to reach zero (``xfs_defer_drain_intents``), then go
1749   back to step 1 unless a signal has been caught.
1750
1751To avoid polling in step 4, the drain provides a waitqueue for scrub threads to
1752be woken up whenever the intent count drops to zero.
1753
1754The proposed patchset is the
1755`scrub intent drain series
1756<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-drain-intents>`_.
1757
1758.. _jump_labels:
1759
1760Static Keys (aka Jump Label Patching)
1761`````````````````````````````````````
1762
1763Online fsck for XFS separates the regular filesystem from the checking and
1764repair code as much as possible.
1765However, there are a few parts of online fsck (such as the intent drains, and
1766later, live update hooks) where it is useful for the online fsck code to know
1767what's going on in the rest of the filesystem.
1768Since it is not expected that online fsck will be constantly running in the
1769background, it is very important to minimize the runtime overhead imposed by
1770these hooks when online fsck is compiled into the kernel but not actively
1771running on behalf of userspace.
1772Taking locks in the hot path of a writer thread to access a data structure only
1773to find that no further action is necessary is expensive -- on the author's
1774computer, this have an overhead of 40-50ns per access.
1775Fortunately, the kernel supports dynamic code patching, which enables XFS to
1776replace a static branch to hook code with ``nop`` sleds when online fsck isn't
1777running.
1778This sled has an overhead of however long it takes the instruction decoder to
1779skip past the sled, which seems to be on the order of less than 1ns and
1780does not access memory outside of instruction fetching.
1781
1782When online fsck enables the static key, the sled is replaced with an
1783unconditional branch to call the hook code.
1784The switchover is quite expensive (~22000ns) but is paid entirely by the
1785program that invoked online fsck, and can be amortized if multiple threads
1786enter online fsck at the same time, or if multiple filesystems are being
1787checked at the same time.
1788Changing the branch direction requires taking the CPU hotplug lock, and since
1789CPU initialization requires memory allocation, online fsck must be careful not
1790to change a static key while holding any locks or resources that could be
1791accessed in the memory reclaim paths.
1792To minimize contention on the CPU hotplug lock, care should be taken not to
1793enable or disable static keys unnecessarily.
1794
1795Because static keys are intended to minimize hook overhead for regular
1796filesystem operations when xfs_scrub is not running, the intended usage
1797patterns are as follows:
1798
1799- The hooked part of XFS should declare a static-scoped static key that
1800  defaults to false.
1801  The ``DEFINE_STATIC_KEY_FALSE`` macro takes care of this.
1802  The static key itself should be declared as a ``static`` variable.
1803
1804- When deciding to invoke code that's only used by scrub, the regular
1805  filesystem should call the ``static_branch_unlikely`` predicate to avoid the
1806  scrub-only hook code if the static key is not enabled.
1807
1808- The regular filesystem should export helper functions that call
1809  ``static_branch_inc`` to enable and ``static_branch_dec`` to disable the
1810  static key.
1811  Wrapper functions make it easy to compile out the relevant code if the kernel
1812  distributor turns off online fsck at build time.
1813
1814- Scrub functions wanting to turn on scrub-only XFS functionality should call
1815  the ``xchk_fsgates_enable`` from the setup function to enable a specific
1816  hook.
1817  This must be done before obtaining any resources that are used by memory
1818  reclaim.
1819  Callers had better be sure they really need the functionality gated by the
1820  static key; the ``TRY_HARDER`` flag is useful here.
1821
1822Online scrub has resource acquisition helpers (e.g. ``xchk_perag_lock``) to
1823handle locking AGI and AGF buffers for all scrubber functions.
1824If it detects a conflict between scrub and the running transactions, it will
1825try to wait for intents to complete.
1826If the caller of the helper has not enabled the static key, the helper will
1827return -EDEADLOCK, which should result in the scrub being restarted with the
1828``TRY_HARDER`` flag set.
1829The scrub setup function should detect that flag, enable the static key, and
1830try the scrub again.
1831Scrub teardown disables all static keys obtained by ``xchk_fsgates_enable``.
1832
1833For more information, please see the kernel documentation of
1834Documentation/staging/static-keys.rst.
1835