1.. SPDX-License-Identifier: GPL-2.0
2.. _xfs_online_fsck_design:
3
4..
5        Mapping of heading styles within this document:
6        Heading 1 uses "====" above and below
7        Heading 2 uses "===="
8        Heading 3 uses "----"
9        Heading 4 uses "````"
10        Heading 5 uses "^^^^"
11        Heading 6 uses "~~~~"
12        Heading 7 uses "...."
13
14        Sections are manually numbered because apparently that's what everyone
15        does in the kernel.
16
17======================
18XFS Online Fsck Design
19======================
20
21This document captures the design of the online filesystem check feature for
22XFS.
23The purpose of this document is threefold:
24
25- To help kernel distributors understand exactly what the XFS online fsck
26  feature is, and issues about which they should be aware.
27
28- To help people reading the code to familiarize themselves with the relevant
29  concepts and design points before they start digging into the code.
30
31- To help developers maintaining the system by capturing the reasons
32  supporting higher level decision making.
33
34As the online fsck code is merged, the links in this document to topic branches
35will be replaced with links to code.
36
37This document is licensed under the terms of the GNU Public License, v2.
38The primary author is Darrick J. Wong.
39
40This design document is split into seven parts.
41Part 1 defines what fsck tools are and the motivations for writing a new one.
42Parts 2 and 3 present a high level overview of how online fsck process works
43and how it is tested to ensure correct functionality.
44Part 4 discusses the user interface and the intended usage modes of the new
45program.
46Parts 5 and 6 show off the high level components and how they fit together, and
47then present case studies of how each repair function actually works.
48Part 7 sums up what has been discussed so far and speculates about what else
49might be built atop online fsck.
50
51.. contents:: Table of Contents
52   :local:
53
541. What is a Filesystem Check?
55==============================
56
57A Unix filesystem has four main responsibilities:
58
59- Provide a hierarchy of names through which application programs can associate
60  arbitrary blobs of data for any length of time,
61
62- Virtualize physical storage media across those names, and
63
64- Retrieve the named data blobs at any time.
65
66- Examine resource usage.
67
68Metadata directly supporting these functions (e.g. files, directories, space
69mappings) are sometimes called primary metadata.
70Secondary metadata (e.g. reverse mapping and directory parent pointers) support
71operations internal to the filesystem, such as internal consistency checking
72and reorganization.
73Summary metadata, as the name implies, condense information contained in
74primary metadata for performance reasons.
75
76The filesystem check (fsck) tool examines all the metadata in a filesystem
77to look for errors.
78In addition to looking for obvious metadata corruptions, fsck also
79cross-references different types of metadata records with each other to look
80for inconsistencies.
81People do not like losing data, so most fsck tools also contains some ability
82to correct any problems found.
83As a word of caution -- the primary goal of most Linux fsck tools is to restore
84the filesystem metadata to a consistent state, not to maximize the data
85recovered.
86That precedent will not be challenged here.
87
88Filesystems of the 20th century generally lacked any redundancy in the ondisk
89format, which means that fsck can only respond to errors by erasing files until
90errors are no longer detected.
91More recent filesystem designs contain enough redundancy in their metadata that
92it is now possible to regenerate data structures when non-catastrophic errors
93occur; this capability aids both strategies.
94
95+--------------------------------------------------------------------------+
96| **Note**:                                                                |
97+--------------------------------------------------------------------------+
98| System administrators avoid data loss by increasing the number of        |
99| separate storage systems through the creation of backups; and they avoid |
100| downtime by increasing the redundancy of each storage system through the |
101| creation of RAID arrays.                                                 |
102| fsck tools address only the first problem.                               |
103+--------------------------------------------------------------------------+
104
105TLDR; Show Me the Code!
106-----------------------
107
108Code is posted to the kernel.org git trees as follows:
109`kernel changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-symlink>`_,
110`userspace changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-service>`_, and
111`QA test changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-dirs>`_.
112Each kernel patchset adding an online repair function will use the same branch
113name across the kernel, xfsprogs, and fstests git repos.
114
115Existing Tools
116--------------
117
118The online fsck tool described here will be the third tool in the history of
119XFS (on Linux) to check and repair filesystems.
120Two programs precede it:
121
122The first program, ``xfs_check``, was created as part of the XFS debugger
123(``xfs_db``) and can only be used with unmounted filesystems.
124It walks all metadata in the filesystem looking for inconsistencies in the
125metadata, though it lacks any ability to repair what it finds.
126Due to its high memory requirements and inability to repair things, this
127program is now deprecated and will not be discussed further.
128
129The second program, ``xfs_repair``, was created to be faster and more robust
130than the first program.
131Like its predecessor, it can only be used with unmounted filesystems.
132It uses extent-based in-memory data structures to reduce memory consumption,
133and tries to schedule readahead IO appropriately to reduce I/O waiting time
134while it scans the metadata of the entire filesystem.
135The most important feature of this tool is its ability to respond to
136inconsistencies in file metadata and directory tree by erasing things as needed
137to eliminate problems.
138Space usage metadata are rebuilt from the observed file metadata.
139
140Problem Statement
141-----------------
142
143The current XFS tools leave several problems unsolved:
144
1451. **User programs** suddenly **lose access** to the filesystem when unexpected
146   shutdowns occur as a result of silent corruptions in the metadata.
147   These occur **unpredictably** and often without warning.
148
1492. **Users** experience a **total loss of service** during the recovery period
150   after an **unexpected shutdown** occurs.
151
1523. **Users** experience a **total loss of service** if the filesystem is taken
153   offline to **look for problems** proactively.
154
1554. **Data owners** cannot **check the integrity** of their stored data without
156   reading all of it.
157   This may expose them to substantial billing costs when a linear media scan
158   performed by the storage system administrator might suffice.
159
1605. **System administrators** cannot **schedule** a maintenance window to deal
161   with corruptions if they **lack the means** to assess filesystem health
162   while the filesystem is online.
163
1646. **Fleet monitoring tools** cannot **automate periodic checks** of filesystem
165   health when doing so requires **manual intervention** and downtime.
166
1677. **Users** can be tricked into **doing things they do not desire** when
168   malicious actors **exploit quirks of Unicode** to place misleading names
169   in directories.
170
171Given this definition of the problems to be solved and the actors who would
172benefit, the proposed solution is a third fsck tool that acts on a running
173filesystem.
174
175This new third program has three components: an in-kernel facility to check
176metadata, an in-kernel facility to repair metadata, and a userspace driver
177program to drive fsck activity on a live filesystem.
178``xfs_scrub`` is the name of the driver program.
179The rest of this document presents the goals and use cases of the new fsck
180tool, describes its major design points in connection to those goals, and
181discusses the similarities and differences with existing tools.
182
183+--------------------------------------------------------------------------+
184| **Note**:                                                                |
185+--------------------------------------------------------------------------+
186| Throughout this document, the existing offline fsck tool can also be     |
187| referred to by its current name "``xfs_repair``".                        |
188| The userspace driver program for the new online fsck tool can be         |
189| referred to as "``xfs_scrub``".                                          |
190| The kernel portion of online fsck that validates metadata is called      |
191| "online scrub", and portion of the kernel that fixes metadata is called  |
192| "online repair".                                                         |
193+--------------------------------------------------------------------------+
194
195The naming hierarchy is broken up into objects known as directories and files
196and the physical space is split into pieces known as allocation groups.
197Sharding enables better performance on highly parallel systems and helps to
198contain the damage when corruptions occur.
199The division of the filesystem into principal objects (allocation groups and
200inodes) means that there are ample opportunities to perform targeted checks and
201repairs on a subset of the filesystem.
202
203While this is going on, other parts continue processing IO requests.
204Even if a piece of filesystem metadata can only be regenerated by scanning the
205entire system, the scan can still be done in the background while other file
206operations continue.
207
208In summary, online fsck takes advantage of resource sharding and redundant
209metadata to enable targeted checking and repair operations while the system
210is running.
211This capability will be coupled to automatic system management so that
212autonomous self-healing of XFS maximizes service availability.
213
2142. Theory of Operation
215======================
216
217Because it is necessary for online fsck to lock and scan live metadata objects,
218online fsck consists of three separate code components.
219The first is the userspace driver program ``xfs_scrub``, which is responsible
220for identifying individual metadata items, scheduling work items for them,
221reacting to the outcomes appropriately, and reporting results to the system
222administrator.
223The second and third are in the kernel, which implements functions to check
224and repair each type of online fsck work item.
225
226+------------------------------------------------------------------+
227| **Note**:                                                        |
228+------------------------------------------------------------------+
229| For brevity, this document shortens the phrase "online fsck work |
230| item" to "scrub item".                                           |
231+------------------------------------------------------------------+
232
233Scrub item types are delineated in a manner consistent with the Unix design
234philosophy, which is to say that each item should handle one aspect of a
235metadata structure, and handle it well.
236
237Scope
238-----
239
240In principle, online fsck should be able to check and to repair everything that
241the offline fsck program can handle.
242However, online fsck cannot be running 100% of the time, which means that
243latent errors may creep in after a scrub completes.
244If these errors cause the next mount to fail, offline fsck is the only
245solution.
246This limitation means that maintenance of the offline fsck tool will continue.
247A second limitation of online fsck is that it must follow the same resource
248sharing and lock acquisition rules as the regular filesystem.
249This means that scrub cannot take *any* shortcuts to save time, because doing
250so could lead to concurrency problems.
251In other words, online fsck is not a complete replacement for offline fsck, and
252a complete run of online fsck may take longer than online fsck.
253However, both of these limitations are acceptable tradeoffs to satisfy the
254different motivations of online fsck, which are to **minimize system downtime**
255and to **increase predictability of operation**.
256
257.. _scrubphases:
258
259Phases of Work
260--------------
261
262The userspace driver program ``xfs_scrub`` splits the work of checking and
263repairing an entire filesystem into seven phases.
264Each phase concentrates on checking specific types of scrub items and depends
265on the success of all previous phases.
266The seven phases are as follows:
267
2681. Collect geometry information about the mounted filesystem and computer,
269   discover the online fsck capabilities of the kernel, and open the
270   underlying storage devices.
271
2722. Check allocation group metadata, all realtime volume metadata, and all quota
273   files.
274   Each metadata structure is scheduled as a separate scrub item.
275   If corruption is found in the inode header or inode btree and ``xfs_scrub``
276   is permitted to perform repairs, then those scrub items are repaired to
277   prepare for phase 3.
278   Repairs are implemented by using the information in the scrub item to
279   resubmit the kernel scrub call with the repair flag enabled; this is
280   discussed in the next section.
281   Optimizations and all other repairs are deferred to phase 4.
282
2833. Check all metadata of every file in the filesystem.
284   Each metadata structure is also scheduled as a separate scrub item.
285   If repairs are needed and ``xfs_scrub`` is permitted to perform repairs,
286   and there were no problems detected during phase 2, then those scrub items
287   are repaired immediately.
288   Optimizations, deferred repairs, and unsuccessful repairs are deferred to
289   phase 4.
290
2914. All remaining repairs and scheduled optimizations are performed during this
292   phase, if the caller permits them.
293   Before starting repairs, the summary counters are checked and any necessary
294   repairs are performed so that subsequent repairs will not fail the resource
295   reservation step due to wildly incorrect summary counters.
296   Unsuccesful repairs are requeued as long as forward progress on repairs is
297   made somewhere in the filesystem.
298   Free space in the filesystem is trimmed at the end of phase 4 if the
299   filesystem is clean.
300
3015. By the start of this phase, all primary and secondary filesystem metadata
302   must be correct.
303   Summary counters such as the free space counts and quota resource counts
304   are checked and corrected.
305   Directory entry names and extended attribute names are checked for
306   suspicious entries such as control characters or confusing Unicode sequences
307   appearing in names.
308
3096. If the caller asks for a media scan, read all allocated and written data
310   file extents in the filesystem.
311   The ability to use hardware-assisted data file integrity checking is new
312   to online fsck; neither of the previous tools have this capability.
313   If media errors occur, they will be mapped to the owning files and reported.
314
3157. Re-check the summary counters and presents the caller with a summary of
316   space usage and file counts.
317
318Steps for Each Scrub Item
319-------------------------
320
321The kernel scrub code uses a three-step strategy for checking and repairing
322the one aspect of a metadata object represented by a scrub item:
323
3241. The scrub item of interest is checked for corruptions; opportunities for
325   optimization; and for values that are directly controlled by the system
326   administrator but look suspicious.
327   If the item is not corrupt or does not need optimization, resource are
328   released and the positive scan results are returned to userspace.
329   If the item is corrupt or could be optimized but the caller does not permit
330   this, resources are released and the negative scan results are returned to
331   userspace.
332   Otherwise, the kernel moves on to the second step.
333
3342. The repair function is called to rebuild the data structure.
335   Repair functions generally choose rebuild a structure from other metadata
336   rather than try to salvage the existing structure.
337   If the repair fails, the scan results from the first step are returned to
338   userspace.
339   Otherwise, the kernel moves on to the third step.
340
3413. In the third step, the kernel runs the same checks over the new metadata
342   item to assess the efficacy of the repairs.
343   The results of the reassessment are returned to userspace.
344
345Classification of Metadata
346--------------------------
347
348Each type of metadata object (and therefore each type of scrub item) is
349classified as follows:
350
351Primary Metadata
352````````````````
353
354Metadata structures in this category should be most familiar to filesystem
355users either because they are directly created by the user or they index
356objects created by the user
357Most filesystem objects fall into this class:
358
359- Free space and reference count information
360
361- Inode records and indexes
362
363- Storage mapping information for file data
364
365- Directories
366
367- Extended attributes
368
369- Symbolic links
370
371- Quota limits
372
373Scrub obeys the same rules as regular filesystem accesses for resource and lock
374acquisition.
375
376Primary metadata objects are the simplest for scrub to process.
377The principal filesystem object (either an allocation group or an inode) that
378owns the item being scrubbed is locked to guard against concurrent updates.
379The check function examines every record associated with the type for obvious
380errors and cross-references healthy records against other metadata to look for
381inconsistencies.
382Repairs for this class of scrub item are simple, since the repair function
383starts by holding all the resources acquired in the previous step.
384The repair function scans available metadata as needed to record all the
385observations needed to complete the structure.
386Next, it stages the observations in a new ondisk structure and commits it
387atomically to complete the repair.
388Finally, the storage from the old data structure are carefully reaped.
389
390Because ``xfs_scrub`` locks a primary object for the duration of the repair,
391this is effectively an offline repair operation performed on a subset of the
392filesystem.
393This minimizes the complexity of the repair code because it is not necessary to
394handle concurrent updates from other threads, nor is it necessary to access
395any other part of the filesystem.
396As a result, indexed structures can be rebuilt very quickly, and programs
397trying to access the damaged structure will be blocked until repairs complete.
398The only infrastructure needed by the repair code are the staging area for
399observations and a means to write new structures to disk.
400Despite these limitations, the advantage that online repair holds is clear:
401targeted work on individual shards of the filesystem avoids total loss of
402service.
403
404This mechanism is described in section 2.1 ("Off-Line Algorithm") of
405V. Srinivasan and M. J. Carey, `"Performance of On-Line Index Construction
406Algorithms" <https://minds.wisconsin.edu/bitstream/handle/1793/59524/TR1047.pdf>`_,
407*Extending Database Technology*, pp. 293-309, 1992.
408
409Most primary metadata repair functions stage their intermediate results in an
410in-memory array prior to formatting the new ondisk structure, which is very
411similar to the list-based algorithm discussed in section 2.3 ("List-Based
412Algorithms") of Srinivasan.
413However, any data structure builder that maintains a resource lock for the
414duration of the repair is *always* an offline algorithm.
415
416.. _secondary_metadata:
417
418Secondary Metadata
419``````````````````
420
421Metadata structures in this category reflect records found in primary metadata,
422but are only needed for online fsck or for reorganization of the filesystem.
423
424Secondary metadata include:
425
426- Reverse mapping information
427
428- Directory parent pointers
429
430This class of metadata is difficult for scrub to process because scrub attaches
431to the secondary object but needs to check primary metadata, which runs counter
432to the usual order of resource acquisition.
433Frequently, this means that full filesystems scans are necessary to rebuild the
434metadata.
435Check functions can be limited in scope to reduce runtime.
436Repairs, however, require a full scan of primary metadata, which can take a
437long time to complete.
438Under these conditions, ``xfs_scrub`` cannot lock resources for the entire
439duration of the repair.
440
441Instead, repair functions set up an in-memory staging structure to store
442observations.
443Depending on the requirements of the specific repair function, the staging
444index will either have the same format as the ondisk structure or a design
445specific to that repair function.
446The next step is to release all locks and start the filesystem scan.
447When the repair scanner needs to record an observation, the staging data are
448locked long enough to apply the update.
449While the filesystem scan is in progress, the repair function hooks the
450filesystem so that it can apply pending filesystem updates to the staging
451information.
452Once the scan is done, the owning object is re-locked, the live data is used to
453write a new ondisk structure, and the repairs are committed atomically.
454The hooks are disabled and the staging staging area is freed.
455Finally, the storage from the old data structure are carefully reaped.
456
457Introducing concurrency helps online repair avoid various locking problems, but
458comes at a high cost to code complexity.
459Live filesystem code has to be hooked so that the repair function can observe
460updates in progress.
461The staging area has to become a fully functional parallel structure so that
462updates can be merged from the hooks.
463Finally, the hook, the filesystem scan, and the inode locking model must be
464sufficiently well integrated that a hook event can decide if a given update
465should be applied to the staging structure.
466
467In theory, the scrub implementation could apply these same techniques for
468primary metadata, but doing so would make it massively more complex and less
469performant.
470Programs attempting to access the damaged structures are not blocked from
471operation, which may cause application failure or an unplanned filesystem
472shutdown.
473
474Inspiration for the secondary metadata repair strategy was drawn from section
4752.4 of Srinivasan above, and sections 2 ("NSF: Inded Build Without Side-File")
476and 3.1.1 ("Duplicate Key Insert Problem") in C. Mohan, `"Algorithms for
477Creating Indexes for Very Large Tables Without Quiescing Updates"
478<https://dl.acm.org/doi/10.1145/130283.130337>`_, 1992.
479
480The sidecar index mentioned above bears some resemblance to the side file
481method mentioned in Srinivasan and Mohan.
482Their method consists of an index builder that extracts relevant record data to
483build the new structure as quickly as possible; and an auxiliary structure that
484captures all updates that would be committed to the index by other threads were
485the new index already online.
486After the index building scan finishes, the updates recorded in the side file
487are applied to the new index.
488To avoid conflicts between the index builder and other writer threads, the
489builder maintains a publicly visible cursor that tracks the progress of the
490scan through the record space.
491To avoid duplication of work between the side file and the index builder, side
492file updates are elided when the record ID for the update is greater than the
493cursor position within the record ID space.
494
495To minimize changes to the rest of the codebase, XFS online repair keeps the
496replacement index hidden until it's completely ready to go.
497In other words, there is no attempt to expose the keyspace of the new index
498while repair is running.
499The complexity of such an approach would be very high and perhaps more
500appropriate to building *new* indices.
501
502**Future Work Question**: Can the full scan and live update code used to
503facilitate a repair also be used to implement a comprehensive check?
504
505*Answer*: In theory, yes.  Check would be much stronger if each scrub function
506employed these live scans to build a shadow copy of the metadata and then
507compared the shadow records to the ondisk records.
508However, doing that is a fair amount more work than what the checking functions
509do now.
510The live scans and hooks were developed much later.
511That in turn increases the runtime of those scrub functions.
512
513Summary Information
514```````````````````
515
516Metadata structures in this last category summarize the contents of primary
517metadata records.
518These are often used to speed up resource usage queries, and are many times
519smaller than the primary metadata which they represent.
520
521Examples of summary information include:
522
523- Summary counts of free space and inodes
524
525- File link counts from directories
526
527- Quota resource usage counts
528
529Check and repair require full filesystem scans, but resource and lock
530acquisition follow the same paths as regular filesystem accesses.
531
532The superblock summary counters have special requirements due to the underlying
533implementation of the incore counters, and will be treated separately.
534Check and repair of the other types of summary counters (quota resource counts
535and file link counts) employ the same filesystem scanning and hooking
536techniques as outlined above, but because the underlying data are sets of
537integer counters, the staging data need not be a fully functional mirror of the
538ondisk structure.
539
540Inspiration for quota and file link count repair strategies were drawn from
541sections 2.12 ("Online Index Operations") through 2.14 ("Incremental View
542Maintenace") of G.  Graefe, `"Concurrent Queries and Updates in Summary Views
543and Their Indexes"
544<http://www.odbms.org/wp-content/uploads/2014/06/Increment-locks.pdf>`_, 2011.
545
546Since quotas are non-negative integer counts of resource usage, online
547quotacheck can use the incremental view deltas described in section 2.14 to
548track pending changes to the block and inode usage counts in each transaction,
549and commit those changes to a dquot side file when the transaction commits.
550Delta tracking is necessary for dquots because the index builder scans inodes,
551whereas the data structure being rebuilt is an index of dquots.
552Link count checking combines the view deltas and commit step into one because
553it sets attributes of the objects being scanned instead of writing them to a
554separate data structure.
555Each online fsck function will be discussed as case studies later in this
556document.
557
558Risk Management
559---------------
560
561During the development of online fsck, several risk factors were identified
562that may make the feature unsuitable for certain distributors and users.
563Steps can be taken to mitigate or eliminate those risks, though at a cost to
564functionality.
565
566- **Decreased performance**: Adding metadata indices to the filesystem
567  increases the time cost of persisting changes to disk, and the reverse space
568  mapping and directory parent pointers are no exception.
569  System administrators who require the maximum performance can disable the
570  reverse mapping features at format time, though this choice dramatically
571  reduces the ability of online fsck to find inconsistencies and repair them.
572
573- **Incorrect repairs**: As with all software, there might be defects in the
574  software that result in incorrect repairs being written to the filesystem.
575  Systematic fuzz testing (detailed in the next section) is employed by the
576  authors to find bugs early, but it might not catch everything.
577  The kernel build system provides Kconfig options (``CONFIG_XFS_ONLINE_SCRUB``
578  and ``CONFIG_XFS_ONLINE_REPAIR``) to enable distributors to choose not to
579  accept this risk.
580  The xfsprogs build system has a configure option (``--enable-scrub=no``) that
581  disables building of the ``xfs_scrub`` binary, though this is not a risk
582  mitigation if the kernel functionality remains enabled.
583
584- **Inability to repair**: Sometimes, a filesystem is too badly damaged to be
585  repairable.
586  If the keyspaces of several metadata indices overlap in some manner but a
587  coherent narrative cannot be formed from records collected, then the repair
588  fails.
589  To reduce the chance that a repair will fail with a dirty transaction and
590  render the filesystem unusable, the online repair functions have been
591  designed to stage and validate all new records before committing the new
592  structure.
593
594- **Misbehavior**: Online fsck requires many privileges -- raw IO to block
595  devices, opening files by handle, ignoring Unix discretionary access control,
596  and the ability to perform administrative changes.
597  Running this automatically in the background scares people, so the systemd
598  background service is configured to run with only the privileges required.
599  Obviously, this cannot address certain problems like the kernel crashing or
600  deadlocking, but it should be sufficient to prevent the scrub process from
601  escaping and reconfiguring the system.
602  The cron job does not have this protection.
603
604- **Fuzz Kiddiez**: There are many people now who seem to think that running
605  automated fuzz testing of ondisk artifacts to find mischevious behavior and
606  spraying exploit code onto the public mailing list for instant zero-day
607  disclosure is somehow of some social benefit.
608  In the view of this author, the benefit is realized only when the fuzz
609  operators help to **fix** the flaws, but this opinion apparently is not
610  widely shared among security "researchers".
611  The XFS maintainers' continuing ability to manage these events presents an
612  ongoing risk to the stability of the development process.
613  Automated testing should front-load some of the risk while the feature is
614  considered EXPERIMENTAL.
615
616Many of these risks are inherent to software programming.
617Despite this, it is hoped that this new functionality will prove useful in
618reducing unexpected downtime.
619
6203. Testing Plan
621===============
622
623As stated before, fsck tools have three main goals:
624
6251. Detect inconsistencies in the metadata;
626
6272. Eliminate those inconsistencies; and
628
6293. Minimize further loss of data.
630
631Demonstrations of correct operation are necessary to build users' confidence
632that the software behaves within expectations.
633Unfortunately, it was not really feasible to perform regular exhaustive testing
634of every aspect of a fsck tool until the introduction of low-cost virtual
635machines with high-IOPS storage.
636With ample hardware availability in mind, the testing strategy for the online
637fsck project involves differential analysis against the existing fsck tools and
638systematic testing of every attribute of every type of metadata object.
639Testing can be split into four major categories, as discussed below.
640
641Integrated Testing with fstests
642-------------------------------
643
644The primary goal of any free software QA effort is to make testing as
645inexpensive and widespread as possible to maximize the scaling advantages of
646community.
647In other words, testing should maximize the breadth of filesystem configuration
648scenarios and hardware setups.
649This improves code quality by enabling the authors of online fsck to find and
650fix bugs early, and helps developers of new features to find integration
651issues earlier in their development effort.
652
653The Linux filesystem community shares a common QA testing suite,
654`fstests <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/>`_, for
655functional and regression testing.
656Even before development work began on online fsck, fstests (when run on XFS)
657would run both the ``xfs_check`` and ``xfs_repair -n`` commands on the test and
658scratch filesystems between each test.
659This provides a level of assurance that the kernel and the fsck tools stay in
660alignment about what constitutes consistent metadata.
661During development of the online checking code, fstests was modified to run
662``xfs_scrub -n`` between each test to ensure that the new checking code
663produces the same results as the two existing fsck tools.
664
665To start development of online repair, fstests was modified to run
666``xfs_repair`` to rebuild the filesystem's metadata indices between tests.
667This ensures that offline repair does not crash, leave a corrupt filesystem
668after it exists, or trigger complaints from the online check.
669This also established a baseline for what can and cannot be repaired offline.
670To complete the first phase of development of online repair, fstests was
671modified to be able to run ``xfs_scrub`` in a "force rebuild" mode.
672This enables a comparison of the effectiveness of online repair as compared to
673the existing offline repair tools.
674
675General Fuzz Testing of Metadata Blocks
676---------------------------------------
677
678XFS benefits greatly from having a very robust debugging tool, ``xfs_db``.
679
680Before development of online fsck even began, a set of fstests were created
681to test the rather common fault that entire metadata blocks get corrupted.
682This required the creation of fstests library code that can create a filesystem
683containing every possible type of metadata object.
684Next, individual test cases were created to create a test filesystem, identify
685a single block of a specific type of metadata object, trash it with the
686existing ``blocktrash`` command in ``xfs_db``, and test the reaction of a
687particular metadata validation strategy.
688
689This earlier test suite enabled XFS developers to test the ability of the
690in-kernel validation functions and the ability of the offline fsck tool to
691detect and eliminate the inconsistent metadata.
692This part of the test suite was extended to cover online fsck in exactly the
693same manner.
694
695In other words, for a given fstests filesystem configuration:
696
697* For each metadata object existing on the filesystem:
698
699  * Write garbage to it
700
701  * Test the reactions of:
702
703    1. The kernel verifiers to stop obviously bad metadata
704    2. Offline repair (``xfs_repair``) to detect and fix
705    3. Online repair (``xfs_scrub``) to detect and fix
706
707Targeted Fuzz Testing of Metadata Records
708-----------------------------------------
709
710The testing plan for online fsck includes extending the existing fs testing
711infrastructure to provide a much more powerful facility: targeted fuzz testing
712of every metadata field of every metadata object in the filesystem.
713``xfs_db`` can modify every field of every metadata structure in every
714block in the filesystem to simulate the effects of memory corruption and
715software bugs.
716Given that fstests already contains the ability to create a filesystem
717containing every metadata format known to the filesystem, ``xfs_db`` can be
718used to perform exhaustive fuzz testing!
719
720For a given fstests filesystem configuration:
721
722* For each metadata object existing on the filesystem...
723
724  * For each record inside that metadata object...
725
726    * For each field inside that record...
727
728      * For each conceivable type of transformation that can be applied to a bit field...
729
730        1. Clear all bits
731        2. Set all bits
732        3. Toggle the most significant bit
733        4. Toggle the middle bit
734        5. Toggle the least significant bit
735        6. Add a small quantity
736        7. Subtract a small quantity
737        8. Randomize the contents
738
739        * ...test the reactions of:
740
741          1. The kernel verifiers to stop obviously bad metadata
742          2. Offline checking (``xfs_repair -n``)
743          3. Offline repair (``xfs_repair``)
744          4. Online checking (``xfs_scrub -n``)
745          5. Online repair (``xfs_scrub``)
746          6. Both repair tools (``xfs_scrub`` and then ``xfs_repair`` if online repair doesn't succeed)
747
748This is quite the combinatoric explosion!
749
750Fortunately, having this much test coverage makes it easy for XFS developers to
751check the responses of XFS' fsck tools.
752Since the introduction of the fuzz testing framework, these tests have been
753used to discover incorrect repair code and missing functionality for entire
754classes of metadata objects in ``xfs_repair``.
755The enhanced testing was used to finalize the deprecation of ``xfs_check`` by
756confirming that ``xfs_repair`` could detect at least as many corruptions as
757the older tool.
758
759These tests have been very valuable for ``xfs_scrub`` in the same ways -- they
760allow the online fsck developers to compare online fsck against offline fsck,
761and they enable XFS developers to find deficiencies in the code base.
762
763Proposed patchsets include
764`general fuzzer improvements
765<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzzer-improvements>`_,
766`fuzzing baselines
767<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzz-baseline>`_,
768and `improvements in fuzz testing comprehensiveness
769<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=more-fuzz-testing>`_.
770
771Stress Testing
772--------------
773
774A unique requirement to online fsck is the ability to operate on a filesystem
775concurrently with regular workloads.
776Although it is of course impossible to run ``xfs_scrub`` with *zero* observable
777impact on the running system, the online repair code should never introduce
778inconsistencies into the filesystem metadata, and regular workloads should
779never notice resource starvation.
780To verify that these conditions are being met, fstests has been enhanced in
781the following ways:
782
783* For each scrub item type, create a test to exercise checking that item type
784  while running ``fsstress``.
785* For each scrub item type, create a test to exercise repairing that item type
786  while running ``fsstress``.
787* Race ``fsstress`` and ``xfs_scrub -n`` to ensure that checking the whole
788  filesystem doesn't cause problems.
789* Race ``fsstress`` and ``xfs_scrub`` in force-rebuild mode to ensure that
790  force-repairing the whole filesystem doesn't cause problems.
791* Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while
792  freezing and thawing the filesystem.
793* Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while
794  remounting the filesystem read-only and read-write.
795* The same, but running ``fsx`` instead of ``fsstress``.  (Not done yet?)
796
797Success is defined by the ability to run all of these tests without observing
798any unexpected filesystem shutdowns due to corrupted metadata, kernel hang
799check warnings, or any other sort of mischief.
800
801Proposed patchsets include `general stress testing
802<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=race-scrub-and-mount-state-changes>`_
803and the `evolution of existing per-function stress testing
804<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=refactor-scrub-stress>`_.
805
8064. User Interface
807=================
808
809The primary user of online fsck is the system administrator, just like offline
810repair.
811Online fsck presents two modes of operation to administrators:
812A foreground CLI process for online fsck on demand, and a background service
813that performs autonomous checking and repair.
814
815Checking on Demand
816------------------
817
818For administrators who want the absolute freshest information about the
819metadata in a filesystem, ``xfs_scrub`` can be run as a foreground process on
820a command line.
821The program checks every piece of metadata in the filesystem while the
822administrator waits for the results to be reported, just like the existing
823``xfs_repair`` tool.
824Both tools share a ``-n`` option to perform a read-only scan, and a ``-v``
825option to increase the verbosity of the information reported.
826
827A new feature of ``xfs_scrub`` is the ``-x`` option, which employs the error
828correction capabilities of the hardware to check data file contents.
829The media scan is not enabled by default because it may dramatically increase
830program runtime and consume a lot of bandwidth on older storage hardware.
831
832The output of a foreground invocation is captured in the system log.
833
834The ``xfs_scrub_all`` program walks the list of mounted filesystems and
835initiates ``xfs_scrub`` for each of them in parallel.
836It serializes scans for any filesystems that resolve to the same top level
837kernel block device to prevent resource overconsumption.
838
839Background Service
840------------------
841
842To reduce the workload of system administrators, the ``xfs_scrub`` package
843provides a suite of `systemd <https://systemd.io/>`_ timers and services that
844run online fsck automatically on weekends by default.
845The background service configures scrub to run with as little privilege as
846possible, the lowest CPU and IO priority, and in a CPU-constrained single
847threaded mode.
848This can be tuned by the systemd administrator at any time to suit the latency
849and throughput requirements of customer workloads.
850
851The output of the background service is also captured in the system log.
852If desired, reports of failures (either due to inconsistencies or mere runtime
853errors) can be emailed automatically by setting the ``EMAIL_ADDR`` environment
854variable in the following service files:
855
856* ``xfs_scrub_fail@.service``
857* ``xfs_scrub_media_fail@.service``
858* ``xfs_scrub_all_fail.service``
859
860The decision to enable the background scan is left to the system administrator.
861This can be done by enabling either of the following services:
862
863* ``xfs_scrub_all.timer`` on systemd systems
864* ``xfs_scrub_all.cron`` on non-systemd systems
865
866This automatic weekly scan is configured out of the box to perform an
867additional media scan of all file data once per month.
868This is less foolproof than, say, storing file data block checksums, but much
869more performant if application software provides its own integrity checking,
870redundancy can be provided elsewhere above the filesystem, or the storage
871device's integrity guarantees are deemed sufficient.
872
873The systemd unit file definitions have been subjected to a security audit
874(as of systemd 249) to ensure that the xfs_scrub processes have as little
875access to the rest of the system as possible.
876This was performed via ``systemd-analyze security``, after which privileges
877were restricted to the minimum required, sandboxing was set up to the maximal
878extent possible with sandboxing and system call filtering; and access to the
879filesystem tree was restricted to the minimum needed to start the program and
880access the filesystem being scanned.
881The service definition files restrict CPU usage to 80% of one CPU core, and
882apply as nice of a priority to IO and CPU scheduling as possible.
883This measure was taken to minimize delays in the rest of the filesystem.
884No such hardening has been performed for the cron job.
885
886Proposed patchset:
887`Enabling the xfs_scrub background service
888<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-service>`_.
889
890Health Reporting
891----------------
892
893XFS caches a summary of each filesystem's health status in memory.
894The information is updated whenever ``xfs_scrub`` is run, or whenever
895inconsistencies are detected in the filesystem metadata during regular
896operations.
897System administrators should use the ``health`` command of ``xfs_spaceman`` to
898download this information into a human-readable format.
899If problems have been observed, the administrator can schedule a reduced
900service window to run the online repair tool to correct the problem.
901Failing that, the administrator can decide to schedule a maintenance window to
902run the traditional offline repair tool to correct the problem.
903
904**Future Work Question**: Should the health reporting integrate with the new
905inotify fs error notification system?
906Would it be helpful for sysadmins to have a daemon to listen for corruption
907notifications and initiate a repair?
908
909*Answer*: These questions remain unanswered, but should be a part of the
910conversation with early adopters and potential downstream users of XFS.
911
912Proposed patchsets include
913`wiring up health reports to correction returns
914<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=corruption-health-reports>`_
915and
916`preservation of sickness info during memory reclaim
917<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=indirect-health-reporting>`_.
918
9195. Kernel Algorithms and Data Structures
920========================================
921
922This section discusses the key algorithms and data structures of the kernel
923code that provide the ability to check and repair metadata while the system
924is running.
925The first chapters in this section reveal the pieces that provide the
926foundation for checking metadata.
927The remainder of this section presents the mechanisms through which XFS
928regenerates itself.
929
930Self Describing Metadata
931------------------------
932
933Starting with XFS version 5 in 2012, XFS updated the format of nearly every
934ondisk block header to record a magic number, a checksum, a universally
935"unique" identifier (UUID), an owner code, the ondisk address of the block,
936and a log sequence number.
937When loading a block buffer from disk, the magic number, UUID, owner, and
938ondisk address confirm that the retrieved block matches the specific owner of
939the current filesystem, and that the information contained in the block is
940supposed to be found at the ondisk address.
941The first three components enable checking tools to disregard alleged metadata
942that doesn't belong to the filesystem, and the fourth component enables the
943filesystem to detect lost writes.
944
945Whenever a file system operation modifies a block, the change is submitted
946to the log as part of a transaction.
947The log then processes these transactions marking them done once they are
948safely persisted to storage.
949The logging code maintains the checksum and the log sequence number of the last
950transactional update.
951Checksums are useful for detecting torn writes and other discrepancies that can
952be introduced between the computer and its storage devices.
953Sequence number tracking enables log recovery to avoid applying out of date
954log updates to the filesystem.
955
956These two features improve overall runtime resiliency by providing a means for
957the filesystem to detect obvious corruption when reading metadata blocks from
958disk, but these buffer verifiers cannot provide any consistency checking
959between metadata structures.
960
961For more information, please see the documentation for
962Documentation/filesystems/xfs-self-describing-metadata.rst
963
964Reverse Mapping
965---------------
966
967The original design of XFS (circa 1993) is an improvement upon 1980s Unix
968filesystem design.
969In those days, storage density was expensive, CPU time was scarce, and
970excessive seek time could kill performance.
971For performance reasons, filesystem authors were reluctant to add redundancy to
972the filesystem, even at the cost of data integrity.
973Filesystems designers in the early 21st century choose different strategies to
974increase internal redundancy -- either storing nearly identical copies of
975metadata, or more space-efficient encoding techniques.
976
977For XFS, a different redundancy strategy was chosen to modernize the design:
978a secondary space usage index that maps allocated disk extents back to their
979owners.
980By adding a new index, the filesystem retains most of its ability to scale
981well to heavily threaded workloads involving large datasets, since the primary
982file metadata (the directory tree, the file block map, and the allocation
983groups) remain unchanged.
984Like any system that improves redundancy, the reverse-mapping feature increases
985overhead costs for space mapping activities.
986However, it has two critical advantages: first, the reverse index is key to
987enabling online fsck and other requested functionality such as free space
988defragmentation, better media failure reporting, and filesystem shrinking.
989Second, the different ondisk storage format of the reverse mapping btree
990defeats device-level deduplication because the filesystem requires real
991redundancy.
992
993+--------------------------------------------------------------------------+
994| **Sidebar**:                                                             |
995+--------------------------------------------------------------------------+
996| A criticism of adding the secondary index is that it does nothing to     |
997| improve the robustness of user data storage itself.                      |
998| This is a valid point, but adding a new index for file data block        |
999| checksums increases write amplification by turning data overwrites into  |
1000| copy-writes, which age the filesystem prematurely.                       |
1001| In keeping with thirty years of precedent, users who want file data      |
1002| integrity can supply as powerful a solution as they require.             |
1003| As for metadata, the complexity of adding a new secondary index of space |
1004| usage is much less than adding volume management and storage device      |
1005| mirroring to XFS itself.                                                 |
1006| Perfection of RAID and volume management are best left to existing       |
1007| layers in the kernel.                                                    |
1008+--------------------------------------------------------------------------+
1009
1010The information captured in a reverse space mapping record is as follows:
1011
1012.. code-block:: c
1013
1014	struct xfs_rmap_irec {
1015	    xfs_agblock_t    rm_startblock;   /* extent start block */
1016	    xfs_extlen_t     rm_blockcount;   /* extent length */
1017	    uint64_t         rm_owner;        /* extent owner */
1018	    uint64_t         rm_offset;       /* offset within the owner */
1019	    unsigned int     rm_flags;        /* state flags */
1020	};
1021
1022The first two fields capture the location and size of the physical space,
1023in units of filesystem blocks.
1024The owner field tells scrub which metadata structure or file inode have been
1025assigned this space.
1026For space allocated to files, the offset field tells scrub where the space was
1027mapped within the file fork.
1028Finally, the flags field provides extra information about the space usage --
1029is this an attribute fork extent?  A file mapping btree extent?  Or an
1030unwritten data extent?
1031
1032Online filesystem checking judges the consistency of each primary metadata
1033record by comparing its information against all other space indices.
1034The reverse mapping index plays a key role in the consistency checking process
1035because it contains a centralized alternate copy of all space allocation
1036information.
1037Program runtime and ease of resource acquisition are the only real limits to
1038what online checking can consult.
1039For example, a file data extent mapping can be checked against:
1040
1041* The absence of an entry in the free space information.
1042* The absence of an entry in the inode index.
1043* The absence of an entry in the reference count data if the file is not
1044  marked as having shared extents.
1045* The correspondence of an entry in the reverse mapping information.
1046
1047There are several observations to make about reverse mapping indices:
1048
10491. Reverse mappings can provide a positive affirmation of correctness if any of
1050   the above primary metadata are in doubt.
1051   The checking code for most primary metadata follows a path similar to the
1052   one outlined above.
1053
10542. Proving the consistency of secondary metadata with the primary metadata is
1055   difficult because that requires a full scan of all primary space metadata,
1056   which is very time intensive.
1057   For example, checking a reverse mapping record for a file extent mapping
1058   btree block requires locking the file and searching the entire btree to
1059   confirm the block.
1060   Instead, scrub relies on rigorous cross-referencing during the primary space
1061   mapping structure checks.
1062
10633. Consistency scans must use non-blocking lock acquisition primitives if the
1064   required locking order is not the same order used by regular filesystem
1065   operations.
1066   For example, if the filesystem normally takes a file ILOCK before taking
1067   the AGF buffer lock but scrub wants to take a file ILOCK while holding
1068   an AGF buffer lock, scrub cannot block on that second acquisition.
1069   This means that forward progress during this part of a scan of the reverse
1070   mapping data cannot be guaranteed if system load is heavy.
1071
1072In summary, reverse mappings play a key role in reconstruction of primary
1073metadata.
1074The details of how these records are staged, written to disk, and committed
1075into the filesystem are covered in subsequent sections.
1076
1077Checking and Cross-Referencing
1078------------------------------
1079
1080The first step of checking a metadata structure is to examine every record
1081contained within the structure and its relationship with the rest of the
1082system.
1083XFS contains multiple layers of checking to try to prevent inconsistent
1084metadata from wreaking havoc on the system.
1085Each of these layers contributes information that helps the kernel to make
1086three decisions about the health of a metadata structure:
1087
1088- Is a part of this structure obviously corrupt (``XFS_SCRUB_OFLAG_CORRUPT``) ?
1089- Is this structure inconsistent with the rest of the system
1090  (``XFS_SCRUB_OFLAG_XCORRUPT``) ?
1091- Is there so much damage around the filesystem that cross-referencing is not
1092  possible (``XFS_SCRUB_OFLAG_XFAIL``) ?
1093- Can the structure be optimized to improve performance or reduce the size of
1094  metadata (``XFS_SCRUB_OFLAG_PREEN``) ?
1095- Does the structure contain data that is not inconsistent but deserves review
1096  by the system administrator (``XFS_SCRUB_OFLAG_WARNING``) ?
1097
1098The following sections describe how the metadata scrubbing process works.
1099
1100Metadata Buffer Verification
1101````````````````````````````
1102
1103The lowest layer of metadata protection in XFS are the metadata verifiers built
1104into the buffer cache.
1105These functions perform inexpensive internal consistency checking of the block
1106itself, and answer these questions:
1107
1108- Does the block belong to this filesystem?
1109
1110- Does the block belong to the structure that asked for the read?
1111  This assumes that metadata blocks only have one owner, which is always true
1112  in XFS.
1113
1114- Is the type of data stored in the block within a reasonable range of what
1115  scrub is expecting?
1116
1117- Does the physical location of the block match the location it was read from?
1118
1119- Does the block checksum match the data?
1120
1121The scope of the protections here are very limited -- verifiers can only
1122establish that the filesystem code is reasonably free of gross corruption bugs
1123and that the storage system is reasonably competent at retrieval.
1124Corruption problems observed at runtime cause the generation of health reports,
1125failed system calls, and in the extreme case, filesystem shutdowns if the
1126corrupt metadata force the cancellation of a dirty transaction.
1127
1128Every online fsck scrubbing function is expected to read every ondisk metadata
1129block of a structure in the course of checking the structure.
1130Corruption problems observed during a check are immediately reported to
1131userspace as corruption; during a cross-reference, they are reported as a
1132failure to cross-reference once the full examination is complete.
1133Reads satisfied by a buffer already in cache (and hence already verified)
1134bypass these checks.
1135
1136Internal Consistency Checks
1137```````````````````````````
1138
1139After the buffer cache, the next level of metadata protection is the internal
1140record verification code built into the filesystem.
1141These checks are split between the buffer verifiers, the in-filesystem users of
1142the buffer cache, and the scrub code itself, depending on the amount of higher
1143level context required.
1144The scope of checking is still internal to the block.
1145These higher level checking functions answer these questions:
1146
1147- Does the type of data stored in the block match what scrub is expecting?
1148
1149- Does the block belong to the owning structure that asked for the read?
1150
1151- If the block contains records, do the records fit within the block?
1152
1153- If the block tracks internal free space information, is it consistent with
1154  the record areas?
1155
1156- Are the records contained inside the block free of obvious corruptions?
1157
1158Record checks in this category are more rigorous and more time-intensive.
1159For example, block pointers and inumbers are checked to ensure that they point
1160within the dynamically allocated parts of an allocation group and within
1161the filesystem.
1162Names are checked for invalid characters, and flags are checked for invalid
1163combinations.
1164Other record attributes are checked for sensible values.
1165Btree records spanning an interval of the btree keyspace are checked for
1166correct order and lack of mergeability (except for file fork mappings).
1167For performance reasons, regular code may skip some of these checks unless
1168debugging is enabled or a write is about to occur.
1169Scrub functions, of course, must check all possible problems.
1170
1171Validation of Userspace-Controlled Record Attributes
1172````````````````````````````````````````````````````
1173
1174Various pieces of filesystem metadata are directly controlled by userspace.
1175Because of this nature, validation work cannot be more precise than checking
1176that a value is within the possible range.
1177These fields include:
1178
1179- Superblock fields controlled by mount options
1180- Filesystem labels
1181- File timestamps
1182- File permissions
1183- File size
1184- File flags
1185- Names present in directory entries, extended attribute keys, and filesystem
1186  labels
1187- Extended attribute key namespaces
1188- Extended attribute values
1189- File data block contents
1190- Quota limits
1191- Quota timer expiration (if resource usage exceeds the soft limit)
1192
1193Cross-Referencing Space Metadata
1194````````````````````````````````
1195
1196After internal block checks, the next higher level of checking is
1197cross-referencing records between metadata structures.
1198For regular runtime code, the cost of these checks is considered to be
1199prohibitively expensive, but as scrub is dedicated to rooting out
1200inconsistencies, it must pursue all avenues of inquiry.
1201The exact set of cross-referencing is highly dependent on the context of the
1202data structure being checked.
1203
1204The XFS btree code has keyspace scanning functions that online fsck uses to
1205cross reference one structure with another.
1206Specifically, scrub can scan the key space of an index to determine if that
1207keyspace is fully, sparsely, or not at all mapped to records.
1208For the reverse mapping btree, it is possible to mask parts of the key for the
1209purposes of performing a keyspace scan so that scrub can decide if the rmap
1210btree contains records mapping a certain extent of physical space without the
1211sparsenses of the rest of the rmap keyspace getting in the way.
1212
1213Btree blocks undergo the following checks before cross-referencing:
1214
1215- Does the type of data stored in the block match what scrub is expecting?
1216
1217- Does the block belong to the owning structure that asked for the read?
1218
1219- Do the records fit within the block?
1220
1221- Are the records contained inside the block free of obvious corruptions?
1222
1223- Are the name hashes in the correct order?
1224
1225- Do node pointers within the btree point to valid block addresses for the type
1226  of btree?
1227
1228- Do child pointers point towards the leaves?
1229
1230- Do sibling pointers point across the same level?
1231
1232- For each node block record, does the record key accurate reflect the contents
1233  of the child block?
1234
1235Space allocation records are cross-referenced as follows:
1236
12371. Any space mentioned by any metadata structure are cross-referenced as
1238   follows:
1239
1240   - Does the reverse mapping index list only the appropriate owner as the
1241     owner of each block?
1242
1243   - Are none of the blocks claimed as free space?
1244
1245   - If these aren't file data blocks, are none of the blocks claimed as space
1246     shared by different owners?
1247
12482. Btree blocks are cross-referenced as follows:
1249
1250   - Everything in class 1 above.
1251
1252   - If there's a parent node block, do the keys listed for this block match the
1253     keyspace of this block?
1254
1255   - Do the sibling pointers point to valid blocks?  Of the same level?
1256
1257   - Do the child pointers point to valid blocks?  Of the next level down?
1258
12593. Free space btree records are cross-referenced as follows:
1260
1261   - Everything in class 1 and 2 above.
1262
1263   - Does the reverse mapping index list no owners of this space?
1264
1265   - Is this space not claimed by the inode index for inodes?
1266
1267   - Is it not mentioned by the reference count index?
1268
1269   - Is there a matching record in the other free space btree?
1270
12714. Inode btree records are cross-referenced as follows:
1272
1273   - Everything in class 1 and 2 above.
1274
1275   - Is there a matching record in free inode btree?
1276
1277   - Do cleared bits in the holemask correspond with inode clusters?
1278
1279   - Do set bits in the freemask correspond with inode records with zero link
1280     count?
1281
12825. Inode records are cross-referenced as follows:
1283
1284   - Everything in class 1.
1285
1286   - Do all the fields that summarize information about the file forks actually
1287     match those forks?
1288
1289   - Does each inode with zero link count correspond to a record in the free
1290     inode btree?
1291
12926. File fork space mapping records are cross-referenced as follows:
1293
1294   - Everything in class 1 and 2 above.
1295
1296   - Is this space not mentioned by the inode btrees?
1297
1298   - If this is a CoW fork mapping, does it correspond to a CoW entry in the
1299     reference count btree?
1300
13017. Reference count records are cross-referenced as follows:
1302
1303   - Everything in class 1 and 2 above.
1304
1305   - Within the space subkeyspace of the rmap btree (that is to say, all
1306     records mapped to a particular space extent and ignoring the owner info),
1307     are there the same number of reverse mapping records for each block as the
1308     reference count record claims?
1309
1310Proposed patchsets are the series to find gaps in
1311`refcount btree
1312<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-refcount-gaps>`_,
1313`inode btree
1314<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-inobt-gaps>`_, and
1315`rmap btree
1316<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-rmapbt-gaps>`_ records;
1317to find
1318`mergeable records
1319<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-mergeable-records>`_;
1320and to
1321`improve cross referencing with rmap
1322<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-strengthen-rmap-checking>`_
1323before starting a repair.
1324
1325Checking Extended Attributes
1326````````````````````````````
1327
1328Extended attributes implement a key-value store that enable fragments of data
1329to be attached to any file.
1330Both the kernel and userspace can access the keys and values, subject to
1331namespace and privilege restrictions.
1332Most typically these fragments are metadata about the file -- origins, security
1333contexts, user-supplied labels, indexing information, etc.
1334
1335Names can be as long as 255 bytes and can exist in several different
1336namespaces.
1337Values can be as large as 64KB.
1338A file's extended attributes are stored in blocks mapped by the attr fork.
1339The mappings point to leaf blocks, remote value blocks, or dabtree blocks.
1340Block 0 in the attribute fork is always the top of the structure, but otherwise
1341each of the three types of blocks can be found at any offset in the attr fork.
1342Leaf blocks contain attribute key records that point to the name and the value.
1343Names are always stored elsewhere in the same leaf block.
1344Values that are less than 3/4 the size of a filesystem block are also stored
1345elsewhere in the same leaf block.
1346Remote value blocks contain values that are too large to fit inside a leaf.
1347If the leaf information exceeds a single filesystem block, a dabtree (also
1348rooted at block 0) is created to map hashes of the attribute names to leaf
1349blocks in the attr fork.
1350
1351Checking an extended attribute structure is not so straightfoward due to the
1352lack of separation between attr blocks and index blocks.
1353Scrub must read each block mapped by the attr fork and ignore the non-leaf
1354blocks:
1355
13561. Walk the dabtree in the attr fork (if present) to ensure that there are no
1357   irregularities in the blocks or dabtree mappings that do not point to
1358   attr leaf blocks.
1359
13602. Walk the blocks of the attr fork looking for leaf blocks.
1361   For each entry inside a leaf:
1362
1363   a. Validate that the name does not contain invalid characters.
1364
1365   b. Read the attr value.
1366      This performs a named lookup of the attr name to ensure the correctness
1367      of the dabtree.
1368      If the value is stored in a remote block, this also validates the
1369      integrity of the remote value block.
1370
1371Checking and Cross-Referencing Directories
1372``````````````````````````````````````````
1373
1374The filesystem directory tree is a directed acylic graph structure, with files
1375constituting the nodes, and directory entries (dirents) constituting the edges.
1376Directories are a special type of file containing a set of mappings from a
1377255-byte sequence (name) to an inumber.
1378These are called directory entries, or dirents for short.
1379Each directory file must have exactly one directory pointing to the file.
1380A root directory points to itself.
1381Directory entries point to files of any type.
1382Each non-directory file may have multiple directories point to it.
1383
1384In XFS, directories are implemented as a file containing up to three 32GB
1385partitions.
1386The first partition contains directory entry data blocks.
1387Each data block contains variable-sized records associating a user-provided
1388name with an inumber and, optionally, a file type.
1389If the directory entry data grows beyond one block, the second partition (which
1390exists as post-EOF extents) is populated with a block containing free space
1391information and an index that maps hashes of the dirent names to directory data
1392blocks in the first partition.
1393This makes directory name lookups very fast.
1394If this second partition grows beyond one block, the third partition is
1395populated with a linear array of free space information for faster
1396expansions.
1397If the free space has been separated and the second partition grows again
1398beyond one block, then a dabtree is used to map hashes of dirent names to
1399directory data blocks.
1400
1401Checking a directory is pretty straightfoward:
1402
14031. Walk the dabtree in the second partition (if present) to ensure that there
1404   are no irregularities in the blocks or dabtree mappings that do not point to
1405   dirent blocks.
1406
14072. Walk the blocks of the first partition looking for directory entries.
1408   Each dirent is checked as follows:
1409
1410   a. Does the name contain no invalid characters?
1411
1412   b. Does the inumber correspond to an actual, allocated inode?
1413
1414   c. Does the child inode have a nonzero link count?
1415
1416   d. If a file type is included in the dirent, does it match the type of the
1417      inode?
1418
1419   e. If the child is a subdirectory, does the child's dotdot pointer point
1420      back to the parent?
1421
1422   f. If the directory has a second partition, perform a named lookup of the
1423      dirent name to ensure the correctness of the dabtree.
1424
14253. Walk the free space list in the third partition (if present) to ensure that
1426   the free spaces it describes are really unused.
1427
1428Checking operations involving :ref:`parents <dirparent>` and
1429:ref:`file link counts <nlinks>` are discussed in more detail in later
1430sections.
1431
1432Checking Directory/Attribute Btrees
1433^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1434
1435As stated in previous sections, the directory/attribute btree (dabtree) index
1436maps user-provided names to improve lookup times by avoiding linear scans.
1437Internally, it maps a 32-bit hash of the name to a block offset within the
1438appropriate file fork.
1439
1440The internal structure of a dabtree closely resembles the btrees that record
1441fixed-size metadata records -- each dabtree block contains a magic number, a
1442checksum, sibling pointers, a UUID, a tree level, and a log sequence number.
1443The format of leaf and node records are the same -- each entry points to the
1444next level down in the hierarchy, with dabtree node records pointing to dabtree
1445leaf blocks, and dabtree leaf records pointing to non-dabtree blocks elsewhere
1446in the fork.
1447
1448Checking and cross-referencing the dabtree is very similar to what is done for
1449space btrees:
1450
1451- Does the type of data stored in the block match what scrub is expecting?
1452
1453- Does the block belong to the owning structure that asked for the read?
1454
1455- Do the records fit within the block?
1456
1457- Are the records contained inside the block free of obvious corruptions?
1458
1459- Are the name hashes in the correct order?
1460
1461- Do node pointers within the dabtree point to valid fork offsets for dabtree
1462  blocks?
1463
1464- Do leaf pointers within the dabtree point to valid fork offsets for directory
1465  or attr leaf blocks?
1466
1467- Do child pointers point towards the leaves?
1468
1469- Do sibling pointers point across the same level?
1470
1471- For each dabtree node record, does the record key accurate reflect the
1472  contents of the child dabtree block?
1473
1474- For each dabtree leaf record, does the record key accurate reflect the
1475  contents of the directory or attr block?
1476
1477Cross-Referencing Summary Counters
1478``````````````````````````````````
1479
1480XFS maintains three classes of summary counters: available resources, quota
1481resource usage, and file link counts.
1482
1483In theory, the amount of available resources (data blocks, inodes, realtime
1484extents) can be found by walking the entire filesystem.
1485This would make for very slow reporting, so a transactional filesystem can
1486maintain summaries of this information in the superblock.
1487Cross-referencing these values against the filesystem metadata should be a
1488simple matter of walking the free space and inode metadata in each AG and the
1489realtime bitmap, but there are complications that will be discussed in
1490:ref:`more detail <fscounters>` later.
1491
1492:ref:`Quota usage <quotacheck>` and :ref:`file link count <nlinks>`
1493checking are sufficiently complicated to warrant separate sections.
1494
1495Post-Repair Reverification
1496``````````````````````````
1497
1498After performing a repair, the checking code is run a second time to validate
1499the new structure, and the results of the health assessment are recorded
1500internally and returned to the calling process.
1501This step is critical for enabling system administrator to monitor the status
1502of the filesystem and the progress of any repairs.
1503For developers, it is a useful means to judge the efficacy of error detection
1504and correction in the online and offline checking tools.
1505
1506Eventual Consistency vs. Online Fsck
1507------------------------------------
1508
1509Complex operations can make modifications to multiple per-AG data structures
1510with a chain of transactions.
1511These chains, once committed to the log, are restarted during log recovery if
1512the system crashes while processing the chain.
1513Because the AG header buffers are unlocked between transactions within a chain,
1514online checking must coordinate with chained operations that are in progress to
1515avoid incorrectly detecting inconsistencies due to pending chains.
1516Furthermore, online repair must not run when operations are pending because
1517the metadata are temporarily inconsistent with each other, and rebuilding is
1518not possible.
1519
1520Only online fsck has this requirement of total consistency of AG metadata, and
1521should be relatively rare as compared to filesystem change operations.
1522Online fsck coordinates with transaction chains as follows:
1523
1524* For each AG, maintain a count of intent items targetting that AG.
1525  The count should be bumped whenever a new item is added to the chain.
1526  The count should be dropped when the filesystem has locked the AG header
1527  buffers and finished the work.
1528
1529* When online fsck wants to examine an AG, it should lock the AG header
1530  buffers to quiesce all transaction chains that want to modify that AG.
1531  If the count is zero, proceed with the checking operation.
1532  If it is nonzero, cycle the buffer locks to allow the chain to make forward
1533  progress.
1534
1535This may lead to online fsck taking a long time to complete, but regular
1536filesystem updates take precedence over background checking activity.
1537Details about the discovery of this situation are presented in the
1538:ref:`next section <chain_coordination>`, and details about the solution
1539are presented :ref:`after that<intent_drains>`.
1540
1541.. _chain_coordination:
1542
1543Discovery of the Problem
1544````````````````````````
1545
1546Midway through the development of online scrubbing, the fsstress tests
1547uncovered a misinteraction between online fsck and compound transaction chains
1548created by other writer threads that resulted in false reports of metadata
1549inconsistency.
1550The root cause of these reports is the eventual consistency model introduced by
1551the expansion of deferred work items and compound transaction chains when
1552reverse mapping and reflink were introduced.
1553
1554Originally, transaction chains were added to XFS to avoid deadlocks when
1555unmapping space from files.
1556Deadlock avoidance rules require that AGs only be locked in increasing order,
1557which makes it impossible (say) to use a single transaction to free a space
1558extent in AG 7 and then try to free a now superfluous block mapping btree block
1559in AG 3.
1560To avoid these kinds of deadlocks, XFS creates Extent Freeing Intent (EFI) log
1561items to commit to freeing some space in one transaction while deferring the
1562actual metadata updates to a fresh transaction.
1563The transaction sequence looks like this:
1564
15651. The first transaction contains a physical update to the file's block mapping
1566   structures to remove the mapping from the btree blocks.
1567   It then attaches to the in-memory transaction an action item to schedule
1568   deferred freeing of space.
1569   Concretely, each transaction maintains a list of ``struct
1570   xfs_defer_pending`` objects, each of which maintains a list of ``struct
1571   xfs_extent_free_item`` objects.
1572   Returning to the example above, the action item tracks the freeing of both
1573   the unmapped space from AG 7 and the block mapping btree (BMBT) block from
1574   AG 3.
1575   Deferred frees recorded in this manner are committed in the log by creating
1576   an EFI log item from the ``struct xfs_extent_free_item`` object and
1577   attaching the log item to the transaction.
1578   When the log is persisted to disk, the EFI item is written into the ondisk
1579   transaction record.
1580   EFIs can list up to 16 extents to free, all sorted in AG order.
1581
15822. The second transaction contains a physical update to the free space btrees
1583   of AG 3 to release the former BMBT block and a second physical update to the
1584   free space btrees of AG 7 to release the unmapped file space.
1585   Observe that the the physical updates are resequenced in the correct order
1586   when possible.
1587   Attached to the transaction is a an extent free done (EFD) log item.
1588   The EFD contains a pointer to the EFI logged in transaction #1 so that log
1589   recovery can tell if the EFI needs to be replayed.
1590
1591If the system goes down after transaction #1 is written back to the filesystem
1592but before #2 is committed, a scan of the filesystem metadata would show
1593inconsistent filesystem metadata because there would not appear to be any owner
1594of the unmapped space.
1595Happily, log recovery corrects this inconsistency for us -- when recovery finds
1596an intent log item but does not find a corresponding intent done item, it will
1597reconstruct the incore state of the intent item and finish it.
1598In the example above, the log must replay both frees described in the recovered
1599EFI to complete the recovery phase.
1600
1601There are subtleties to XFS' transaction chaining strategy to consider:
1602
1603* Log items must be added to a transaction in the correct order to prevent
1604  conflicts with principal objects that are not held by the transaction.
1605  In other words, all per-AG metadata updates for an unmapped block must be
1606  completed before the last update to free the extent, and extents should not
1607  be reallocated until that last update commits to the log.
1608
1609* AG header buffers are released between each transaction in a chain.
1610  This means that other threads can observe an AG in an intermediate state,
1611  but as long as the first subtlety is handled, this should not affect the
1612  correctness of filesystem operations.
1613
1614* Unmounting the filesystem flushes all pending work to disk, which means that
1615  offline fsck never sees the temporary inconsistencies caused by deferred
1616  work item processing.
1617
1618In this manner, XFS employs a form of eventual consistency to avoid deadlocks
1619and increase parallelism.
1620
1621During the design phase of the reverse mapping and reflink features, it was
1622decided that it was impractical to cram all the reverse mapping updates for a
1623single filesystem change into a single transaction because a single file
1624mapping operation can explode into many small updates:
1625
1626* The block mapping update itself
1627* A reverse mapping update for the block mapping update
1628* Fixing the freelist
1629* A reverse mapping update for the freelist fix
1630
1631* A shape change to the block mapping btree
1632* A reverse mapping update for the btree update
1633* Fixing the freelist (again)
1634* A reverse mapping update for the freelist fix
1635
1636* An update to the reference counting information
1637* A reverse mapping update for the refcount update
1638* Fixing the freelist (a third time)
1639* A reverse mapping update for the freelist fix
1640
1641* Freeing any space that was unmapped and not owned by any other file
1642* Fixing the freelist (a fourth time)
1643* A reverse mapping update for the freelist fix
1644
1645* Freeing the space used by the block mapping btree
1646* Fixing the freelist (a fifth time)
1647* A reverse mapping update for the freelist fix
1648
1649Free list fixups are not usually needed more than once per AG per transaction
1650chain, but it is theoretically possible if space is very tight.
1651For copy-on-write updates this is even worse, because this must be done once to
1652remove the space from a staging area and again to map it into the file!
1653
1654To deal with this explosion in a calm manner, XFS expands its use of deferred
1655work items to cover most reverse mapping updates and all refcount updates.
1656This reduces the worst case size of transaction reservations by breaking the
1657work into a long chain of small updates, which increases the degree of eventual
1658consistency in the system.
1659Again, this generally isn't a problem because XFS orders its deferred work
1660items carefully to avoid resource reuse conflicts between unsuspecting threads.
1661
1662However, online fsck changes the rules -- remember that although physical
1663updates to per-AG structures are coordinated by locking the buffers for AG
1664headers, buffer locks are dropped between transactions.
1665Once scrub acquires resources and takes locks for a data structure, it must do
1666all the validation work without releasing the lock.
1667If the main lock for a space btree is an AG header buffer lock, scrub may have
1668interrupted another thread that is midway through finishing a chain.
1669For example, if a thread performing a copy-on-write has completed a reverse
1670mapping update but not the corresponding refcount update, the two AG btrees
1671will appear inconsistent to scrub and an observation of corruption will be
1672recorded.  This observation will not be correct.
1673If a repair is attempted in this state, the results will be catastrophic!
1674
1675Several other solutions to this problem were evaluated upon discovery of this
1676flaw and rejected:
1677
16781. Add a higher level lock to allocation groups and require writer threads to
1679   acquire the higher level lock in AG order before making any changes.
1680   This would be very difficult to implement in practice because it is
1681   difficult to determine which locks need to be obtained, and in what order,
1682   without simulating the entire operation.
1683   Performing a dry run of a file operation to discover necessary locks would
1684   make the filesystem very slow.
1685
16862. Make the deferred work coordinator code aware of consecutive intent items
1687   targeting the same AG and have it hold the AG header buffers locked across
1688   the transaction roll between updates.
1689   This would introduce a lot of complexity into the coordinator since it is
1690   only loosely coupled with the actual deferred work items.
1691   It would also fail to solve the problem because deferred work items can
1692   generate new deferred subtasks, but all subtasks must be complete before
1693   work can start on a new sibling task.
1694
16953. Teach online fsck to walk all transactions waiting for whichever lock(s)
1696   protect the data structure being scrubbed to look for pending operations.
1697   The checking and repair operations must factor these pending operations into
1698   the evaluations being performed.
1699   This solution is a nonstarter because it is *extremely* invasive to the main
1700   filesystem.
1701
1702.. _intent_drains:
1703
1704Intent Drains
1705`````````````
1706
1707Online fsck uses an atomic intent item counter and lock cycling to coordinate
1708with transaction chains.
1709There are two key properties to the drain mechanism.
1710First, the counter is incremented when a deferred work item is *queued* to a
1711transaction, and it is decremented after the associated intent done log item is
1712*committed* to another transaction.
1713The second property is that deferred work can be added to a transaction without
1714holding an AG header lock, but per-AG work items cannot be marked done without
1715locking that AG header buffer to log the physical updates and the intent done
1716log item.
1717The first property enables scrub to yield to running transaction chains, which
1718is an explicit deprioritization of online fsck to benefit file operations.
1719The second property of the drain is key to the correct coordination of scrub,
1720since scrub will always be able to decide if a conflict is possible.
1721
1722For regular filesystem code, the drain works as follows:
1723
17241. Call the appropriate subsystem function to add a deferred work item to a
1725   transaction.
1726
17272. The function calls ``xfs_defer_drain_bump`` to increase the counter.
1728
17293. When the deferred item manager wants to finish the deferred work item, it
1730   calls ``->finish_item`` to complete it.
1731
17324. The ``->finish_item`` implementation logs some changes and calls
1733   ``xfs_defer_drain_drop`` to decrease the sloppy counter and wake up any threads
1734   waiting on the drain.
1735
17365. The subtransaction commits, which unlocks the resource associated with the
1737   intent item.
1738
1739For scrub, the drain works as follows:
1740
17411. Lock the resource(s) associated with the metadata being scrubbed.
1742   For example, a scan of the refcount btree would lock the AGI and AGF header
1743   buffers.
1744
17452. If the counter is zero (``xfs_defer_drain_busy`` returns false), there are no
1746   chains in progress and the operation may proceed.
1747
17483. Otherwise, release the resources grabbed in step 1.
1749
17504. Wait for the intent counter to reach zero (``xfs_defer_drain_intents``), then go
1751   back to step 1 unless a signal has been caught.
1752
1753To avoid polling in step 4, the drain provides a waitqueue for scrub threads to
1754be woken up whenever the intent count drops to zero.
1755
1756The proposed patchset is the
1757`scrub intent drain series
1758<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-drain-intents>`_.
1759
1760.. _jump_labels:
1761
1762Static Keys (aka Jump Label Patching)
1763`````````````````````````````````````
1764
1765Online fsck for XFS separates the regular filesystem from the checking and
1766repair code as much as possible.
1767However, there are a few parts of online fsck (such as the intent drains, and
1768later, live update hooks) where it is useful for the online fsck code to know
1769what's going on in the rest of the filesystem.
1770Since it is not expected that online fsck will be constantly running in the
1771background, it is very important to minimize the runtime overhead imposed by
1772these hooks when online fsck is compiled into the kernel but not actively
1773running on behalf of userspace.
1774Taking locks in the hot path of a writer thread to access a data structure only
1775to find that no further action is necessary is expensive -- on the author's
1776computer, this have an overhead of 40-50ns per access.
1777Fortunately, the kernel supports dynamic code patching, which enables XFS to
1778replace a static branch to hook code with ``nop`` sleds when online fsck isn't
1779running.
1780This sled has an overhead of however long it takes the instruction decoder to
1781skip past the sled, which seems to be on the order of less than 1ns and
1782does not access memory outside of instruction fetching.
1783
1784When online fsck enables the static key, the sled is replaced with an
1785unconditional branch to call the hook code.
1786The switchover is quite expensive (~22000ns) but is paid entirely by the
1787program that invoked online fsck, and can be amortized if multiple threads
1788enter online fsck at the same time, or if multiple filesystems are being
1789checked at the same time.
1790Changing the branch direction requires taking the CPU hotplug lock, and since
1791CPU initialization requires memory allocation, online fsck must be careful not
1792to change a static key while holding any locks or resources that could be
1793accessed in the memory reclaim paths.
1794To minimize contention on the CPU hotplug lock, care should be taken not to
1795enable or disable static keys unnecessarily.
1796
1797Because static keys are intended to minimize hook overhead for regular
1798filesystem operations when xfs_scrub is not running, the intended usage
1799patterns are as follows:
1800
1801- The hooked part of XFS should declare a static-scoped static key that
1802  defaults to false.
1803  The ``DEFINE_STATIC_KEY_FALSE`` macro takes care of this.
1804  The static key itself should be declared as a ``static`` variable.
1805
1806- When deciding to invoke code that's only used by scrub, the regular
1807  filesystem should call the ``static_branch_unlikely`` predicate to avoid the
1808  scrub-only hook code if the static key is not enabled.
1809
1810- The regular filesystem should export helper functions that call
1811  ``static_branch_inc`` to enable and ``static_branch_dec`` to disable the
1812  static key.
1813  Wrapper functions make it easy to compile out the relevant code if the kernel
1814  distributor turns off online fsck at build time.
1815
1816- Scrub functions wanting to turn on scrub-only XFS functionality should call
1817  the ``xchk_fsgates_enable`` from the setup function to enable a specific
1818  hook.
1819  This must be done before obtaining any resources that are used by memory
1820  reclaim.
1821  Callers had better be sure they really need the functionality gated by the
1822  static key; the ``TRY_HARDER`` flag is useful here.
1823
1824Online scrub has resource acquisition helpers (e.g. ``xchk_perag_lock``) to
1825handle locking AGI and AGF buffers for all scrubber functions.
1826If it detects a conflict between scrub and the running transactions, it will
1827try to wait for intents to complete.
1828If the caller of the helper has not enabled the static key, the helper will
1829return -EDEADLOCK, which should result in the scrub being restarted with the
1830``TRY_HARDER`` flag set.
1831The scrub setup function should detect that flag, enable the static key, and
1832try the scrub again.
1833Scrub teardown disables all static keys obtained by ``xchk_fsgates_enable``.
1834
1835For more information, please see the kernel documentation of
1836Documentation/staging/static-keys.rst.
1837
1838.. _xfile:
1839
1840Pageable Kernel Memory
1841----------------------
1842
1843Some online checking functions work by scanning the filesystem to build a
1844shadow copy of an ondisk metadata structure in memory and comparing the two
1845copies.
1846For online repair to rebuild a metadata structure, it must compute the record
1847set that will be stored in the new structure before it can persist that new
1848structure to disk.
1849Ideally, repairs complete with a single atomic commit that introduces
1850a new data structure.
1851To meet these goals, the kernel needs to collect a large amount of information
1852in a place that doesn't require the correct operation of the filesystem.
1853
1854Kernel memory isn't suitable because:
1855
1856* Allocating a contiguous region of memory to create a C array is very
1857  difficult, especially on 32-bit systems.
1858
1859* Linked lists of records introduce double pointer overhead which is very high
1860  and eliminate the possibility of indexed lookups.
1861
1862* Kernel memory is pinned, which can drive the system into OOM conditions.
1863
1864* The system might not have sufficient memory to stage all the information.
1865
1866At any given time, online fsck does not need to keep the entire record set in
1867memory, which means that individual records can be paged out if necessary.
1868Continued development of online fsck demonstrated that the ability to perform
1869indexed data storage would also be very useful.
1870Fortunately, the Linux kernel already has a facility for byte-addressable and
1871pageable storage: tmpfs.
1872In-kernel graphics drivers (most notably i915) take advantage of tmpfs files
1873to store intermediate data that doesn't need to be in memory at all times, so
1874that usage precedent is already established.
1875Hence, the ``xfile`` was born!
1876
1877+--------------------------------------------------------------------------+
1878| **Historical Sidebar**:                                                  |
1879+--------------------------------------------------------------------------+
1880| The first edition of online repair inserted records into a new btree as  |
1881| it found them, which failed because filesystem could shut down with a    |
1882| built data structure, which would be live after recovery finished.       |
1883|                                                                          |
1884| The second edition solved the half-rebuilt structure problem by storing  |
1885| everything in memory, but frequently ran the system out of memory.       |
1886|                                                                          |
1887| The third edition solved the OOM problem by using linked lists, but the  |
1888| memory overhead of the list pointers was extreme.                        |
1889+--------------------------------------------------------------------------+
1890
1891xfile Access Models
1892```````````````````
1893
1894A survey of the intended uses of xfiles suggested these use cases:
1895
18961. Arrays of fixed-sized records (space management btrees, directory and
1897   extended attribute entries)
1898
18992. Sparse arrays of fixed-sized records (quotas and link counts)
1900
19013. Large binary objects (BLOBs) of variable sizes (directory and extended
1902   attribute names and values)
1903
19044. Staging btrees in memory (reverse mapping btrees)
1905
19065. Arbitrary contents (realtime space management)
1907
1908To support the first four use cases, high level data structures wrap the xfile
1909to share functionality between online fsck functions.
1910The rest of this section discusses the interfaces that the xfile presents to
1911four of those five higher level data structures.
1912The fifth use case is discussed in the :ref:`realtime summary <rtsummary>` case
1913study.
1914
1915The most general storage interface supported by the xfile enables the reading
1916and writing of arbitrary quantities of data at arbitrary offsets in the xfile.
1917This capability is provided by ``xfile_pread`` and ``xfile_pwrite`` functions,
1918which behave similarly to their userspace counterparts.
1919XFS is very record-based, which suggests that the ability to load and store
1920complete records is important.
1921To support these cases, a pair of ``xfile_obj_load`` and ``xfile_obj_store``
1922functions are provided to read and persist objects into an xfile.
1923They are internally the same as pread and pwrite, except that they treat any
1924error as an out of memory error.
1925For online repair, squashing error conditions in this manner is an acceptable
1926behavior because the only reaction is to abort the operation back to userspace.
1927All five xfile usecases can be serviced by these four functions.
1928
1929However, no discussion of file access idioms is complete without answering the
1930question, "But what about mmap?"
1931It is convenient to access storage directly with pointers, just like userspace
1932code does with regular memory.
1933Online fsck must not drive the system into OOM conditions, which means that
1934xfiles must be responsive to memory reclamation.
1935tmpfs can only push a pagecache folio to the swap cache if the folio is neither
1936pinned nor locked, which means the xfile must not pin too many folios.
1937
1938Short term direct access to xfile contents is done by locking the pagecache
1939folio and mapping it into kernel address space.
1940Programmatic access (e.g. pread and pwrite) uses this mechanism.
1941Folio locks are not supposed to be held for long periods of time, so long
1942term direct access to xfile contents is done by bumping the folio refcount,
1943mapping it into kernel address space, and dropping the folio lock.
1944These long term users *must* be responsive to memory reclaim by hooking into
1945the shrinker infrastructure to know when to release folios.
1946
1947The ``xfile_get_page`` and ``xfile_put_page`` functions are provided to
1948retrieve the (locked) folio that backs part of an xfile and to release it.
1949The only code to use these folio lease functions are the xfarray
1950:ref:`sorting<xfarray_sort>` algorithms and the :ref:`in-memory
1951btrees<xfbtree>`.
1952
1953xfile Access Coordination
1954`````````````````````````
1955
1956For security reasons, xfiles must be owned privately by the kernel.
1957They are marked ``S_PRIVATE`` to prevent interference from the security system,
1958must never be mapped into process file descriptor tables, and their pages must
1959never be mapped into userspace processes.
1960
1961To avoid locking recursion issues with the VFS, all accesses to the shmfs file
1962are performed by manipulating the page cache directly.
1963xfile writers call the ``->write_begin`` and ``->write_end`` functions of the
1964xfile's address space to grab writable pages, copy the caller's buffer into the
1965page, and release the pages.
1966xfile readers call ``shmem_read_mapping_page_gfp`` to grab pages directly
1967before copying the contents into the caller's buffer.
1968In other words, xfiles ignore the VFS read and write code paths to avoid
1969having to create a dummy ``struct kiocb`` and to avoid taking inode and
1970freeze locks.
1971tmpfs cannot be frozen, and xfiles must not be exposed to userspace.
1972
1973If an xfile is shared between threads to stage repairs, the caller must provide
1974its own locks to coordinate access.
1975For example, if a scrub function stores scan results in an xfile and needs
1976other threads to provide updates to the scanned data, the scrub function must
1977provide a lock for all threads to share.
1978
1979.. _xfarray:
1980
1981Arrays of Fixed-Sized Records
1982`````````````````````````````
1983
1984In XFS, each type of indexed space metadata (free space, inodes, reference
1985counts, file fork space, and reverse mappings) consists of a set of fixed-size
1986records indexed with a classic B+ tree.
1987Directories have a set of fixed-size dirent records that point to the names,
1988and extended attributes have a set of fixed-size attribute keys that point to
1989names and values.
1990Quota counters and file link counters index records with numbers.
1991During a repair, scrub needs to stage new records during the gathering step and
1992retrieve them during the btree building step.
1993
1994Although this requirement can be satisfied by calling the read and write
1995methods of the xfile directly, it is simpler for callers for there to be a
1996higher level abstraction to take care of computing array offsets, to provide
1997iterator functions, and to deal with sparse records and sorting.
1998The ``xfarray`` abstraction presents a linear array for fixed-size records atop
1999the byte-accessible xfile.
2000
2001.. _xfarray_access_patterns:
2002
2003Array Access Patterns
2004^^^^^^^^^^^^^^^^^^^^^
2005
2006Array access patterns in online fsck tend to fall into three categories.
2007Iteration of records is assumed to be necessary for all cases and will be
2008covered in the next section.
2009
2010The first type of caller handles records that are indexed by position.
2011Gaps may exist between records, and a record may be updated multiple times
2012during the collection step.
2013In other words, these callers want a sparse linearly addressed table file.
2014The typical use case are quota records or file link count records.
2015Access to array elements is performed programmatically via ``xfarray_load`` and
2016``xfarray_store`` functions, which wrap the similarly-named xfile functions to
2017provide loading and storing of array elements at arbitrary array indices.
2018Gaps are defined to be null records, and null records are defined to be a
2019sequence of all zero bytes.
2020Null records are detected by calling ``xfarray_element_is_null``.
2021They are created either by calling ``xfarray_unset`` to null out an existing
2022record or by never storing anything to an array index.
2023
2024The second type of caller handles records that are not indexed by position
2025and do not require multiple updates to a record.
2026The typical use case here is rebuilding space btrees and key/value btrees.
2027These callers can add records to the array without caring about array indices
2028via the ``xfarray_append`` function, which stores a record at the end of the
2029array.
2030For callers that require records to be presentable in a specific order (e.g.
2031rebuilding btree data), the ``xfarray_sort`` function can arrange the sorted
2032records; this function will be covered later.
2033
2034The third type of caller is a bag, which is useful for counting records.
2035The typical use case here is constructing space extent reference counts from
2036reverse mapping information.
2037Records can be put in the bag in any order, they can be removed from the bag
2038at any time, and uniqueness of records is left to callers.
2039The ``xfarray_store_anywhere`` function is used to insert a record in any
2040null record slot in the bag; and the ``xfarray_unset`` function removes a
2041record from the bag.
2042
2043The proposed patchset is the
2044`big in-memory array
2045<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=big-array>`_.
2046
2047Iterating Array Elements
2048^^^^^^^^^^^^^^^^^^^^^^^^
2049
2050Most users of the xfarray require the ability to iterate the records stored in
2051the array.
2052Callers can probe every possible array index with the following:
2053
2054.. code-block:: c
2055
2056	xfarray_idx_t i;
2057	foreach_xfarray_idx(array, i) {
2058	    xfarray_load(array, i, &rec);
2059
2060	    /* do something with rec */
2061	}
2062
2063All users of this idiom must be prepared to handle null records or must already
2064know that there aren't any.
2065
2066For xfarray users that want to iterate a sparse array, the ``xfarray_iter``
2067function ignores indices in the xfarray that have never been written to by
2068calling ``xfile_seek_data`` (which internally uses ``SEEK_DATA``) to skip areas
2069of the array that are not populated with memory pages.
2070Once it finds a page, it will skip the zeroed areas of the page.
2071
2072.. code-block:: c
2073
2074	xfarray_idx_t i = XFARRAY_CURSOR_INIT;
2075	while ((ret = xfarray_iter(array, &i, &rec)) == 1) {
2076	    /* do something with rec */
2077	}
2078
2079.. _xfarray_sort:
2080
2081Sorting Array Elements
2082^^^^^^^^^^^^^^^^^^^^^^
2083
2084During the fourth demonstration of online repair, a community reviewer remarked
2085that for performance reasons, online repair ought to load batches of records
2086into btree record blocks instead of inserting records into a new btree one at a
2087time.
2088The btree insertion code in XFS is responsible for maintaining correct ordering
2089of the records, so naturally the xfarray must also support sorting the record
2090set prior to bulk loading.
2091
2092Case Study: Sorting xfarrays
2093~~~~~~~~~~~~~~~~~~~~~~~~~~~~
2094
2095The sorting algorithm used in the xfarray is actually a combination of adaptive
2096quicksort and a heapsort subalgorithm in the spirit of
2097`Sedgewick <https://algs4.cs.princeton.edu/23quicksort/>`_ and
2098`pdqsort <https://github.com/orlp/pdqsort>`_, with customizations for the Linux
2099kernel.
2100To sort records in a reasonably short amount of time, ``xfarray`` takes
2101advantage of the binary subpartitioning offered by quicksort, but it also uses
2102heapsort to hedge aginst performance collapse if the chosen quicksort pivots
2103are poor.
2104Both algorithms are (in general) O(n * lg(n)), but there is a wide performance
2105gulf between the two implementations.
2106
2107The Linux kernel already contains a reasonably fast implementation of heapsort.
2108It only operates on regular C arrays, which limits the scope of its usefulness.
2109There are two key places where the xfarray uses it:
2110
2111* Sorting any record subset backed by a single xfile page.
2112
2113* Loading a small number of xfarray records from potentially disparate parts
2114  of the xfarray into a memory buffer, and sorting the buffer.
2115
2116In other words, ``xfarray`` uses heapsort to constrain the nested recursion of
2117quicksort, thereby mitigating quicksort's worst runtime behavior.
2118
2119Choosing a quicksort pivot is a tricky business.
2120A good pivot splits the set to sort in half, leading to the divide and conquer
2121behavior that is crucial to  O(n * lg(n)) performance.
2122A poor pivot barely splits the subset at all, leading to O(n\ :sup:`2`)
2123runtime.
2124The xfarray sort routine tries to avoid picking a bad pivot by sampling nine
2125records into a memory buffer and using the kernel heapsort to identify the
2126median of the nine.
2127
2128Most modern quicksort implementations employ Tukey's "ninther" to select a
2129pivot from a classic C array.
2130Typical ninther implementations pick three unique triads of records, sort each
2131of the triads, and then sort the middle value of each triad to determine the
2132ninther value.
2133As stated previously, however, xfile accesses are not entirely cheap.
2134It turned out to be much more performant to read the nine elements into a
2135memory buffer, run the kernel's in-memory heapsort on the buffer, and choose
2136the 4th element of that buffer as the pivot.
2137Tukey's ninthers are described in J. W. Tukey, `The ninther, a technique for
2138low-effort robust (resistant) location in large samples`, in *Contributions to
2139Survey Sampling and Applied Statistics*, edited by H. David, (Academic Press,
21401978), pp. 251–257.
2141
2142The partitioning of quicksort is fairly textbook -- rearrange the record
2143subset around the pivot, then set up the current and next stack frames to
2144sort with the larger and the smaller halves of the pivot, respectively.
2145This keeps the stack space requirements to log2(record count).
2146
2147As a final performance optimization, the hi and lo scanning phase of quicksort
2148keeps examined xfile pages mapped in the kernel for as long as possible to
2149reduce map/unmap cycles.
2150Surprisingly, this reduces overall sort runtime by nearly half again after
2151accounting for the application of heapsort directly onto xfile pages.
2152
2153Blob Storage
2154````````````
2155
2156Extended attributes and directories add an additional requirement for staging
2157records: arbitrary byte sequences of finite length.
2158Each directory entry record needs to store entry name,
2159and each extended attribute needs to store both the attribute name and value.
2160The names, keys, and values can consume a large amount of memory, so the
2161``xfblob`` abstraction was created to simplify management of these blobs
2162atop an xfile.
2163
2164Blob arrays provide ``xfblob_load`` and ``xfblob_store`` functions to retrieve
2165and persist objects.
2166The store function returns a magic cookie for every object that it persists.
2167Later, callers provide this cookie to the ``xblob_load`` to recall the object.
2168The ``xfblob_free`` function frees a specific blob, and the ``xfblob_truncate``
2169function frees them all because compaction is not needed.
2170
2171The details of repairing directories and extended attributes will be discussed
2172in a subsequent section about atomic extent swapping.
2173However, it should be noted that these repair functions only use blob storage
2174to cache a small number of entries before adding them to a temporary ondisk
2175file, which is why compaction is not required.
2176
2177The proposed patchset is at the start of the
2178`extended attribute repair
2179<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_ series.
2180
2181.. _xfbtree:
2182
2183In-Memory B+Trees
2184`````````````````
2185
2186The chapter about :ref:`secondary metadata<secondary_metadata>` mentioned that
2187checking and repairing of secondary metadata commonly requires coordination
2188between a live metadata scan of the filesystem and writer threads that are
2189updating that metadata.
2190Keeping the scan data up to date requires requires the ability to propagate
2191metadata updates from the filesystem into the data being collected by the scan.
2192This *can* be done by appending concurrent updates into a separate log file and
2193applying them before writing the new metadata to disk, but this leads to
2194unbounded memory consumption if the rest of the system is very busy.
2195Another option is to skip the side-log and commit live updates from the
2196filesystem directly into the scan data, which trades more overhead for a lower
2197maximum memory requirement.
2198In both cases, the data structure holding the scan results must support indexed
2199access to perform well.
2200
2201Given that indexed lookups of scan data is required for both strategies, online
2202fsck employs the second strategy of committing live updates directly into
2203scan data.
2204Because xfarrays are not indexed and do not enforce record ordering, they
2205are not suitable for this task.
2206Conveniently, however, XFS has a library to create and maintain ordered reverse
2207mapping records: the existing rmap btree code!
2208If only there was a means to create one in memory.
2209
2210Recall that the :ref:`xfile <xfile>` abstraction represents memory pages as a
2211regular file, which means that the kernel can create byte or block addressable
2212virtual address spaces at will.
2213The XFS buffer cache specializes in abstracting IO to block-oriented  address
2214spaces, which means that adaptation of the buffer cache to interface with
2215xfiles enables reuse of the entire btree library.
2216Btrees built atop an xfile are collectively known as ``xfbtrees``.
2217The next few sections describe how they actually work.
2218
2219The proposed patchset is the
2220`in-memory btree
2221<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=in-memory-btrees>`_
2222series.
2223
2224Using xfiles as a Buffer Cache Target
2225^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2226
2227Two modifications are necessary to support xfiles as a buffer cache target.
2228The first is to make it possible for the ``struct xfs_buftarg`` structure to
2229host the ``struct xfs_buf`` rhashtable, because normally those are held by a
2230per-AG structure.
2231The second change is to modify the buffer ``ioapply`` function to "read" cached
2232pages from the xfile and "write" cached pages back to the xfile.
2233Multiple access to individual buffers is controlled by the ``xfs_buf`` lock,
2234since the xfile does not provide any locking on its own.
2235With this adaptation in place, users of the xfile-backed buffer cache use
2236exactly the same APIs as users of the disk-backed buffer cache.
2237The separation between xfile and buffer cache implies higher memory usage since
2238they do not share pages, but this property could some day enable transactional
2239updates to an in-memory btree.
2240Today, however, it simply eliminates the need for new code.
2241
2242Space Management with an xfbtree
2243^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2244
2245Space management for an xfile is very simple -- each btree block is one memory
2246page in size.
2247These blocks use the same header format as an on-disk btree, but the in-memory
2248block verifiers ignore the checksums, assuming that xfile memory is no more
2249corruption-prone than regular DRAM.
2250Reusing existing code here is more important than absolute memory efficiency.
2251
2252The very first block of an xfile backing an xfbtree contains a header block.
2253The header describes the owner, height, and the block number of the root
2254xfbtree block.
2255
2256To allocate a btree block, use ``xfile_seek_data`` to find a gap in the file.
2257If there are no gaps, create one by extending the length of the xfile.
2258Preallocate space for the block with ``xfile_prealloc``, and hand back the
2259location.
2260To free an xfbtree block, use ``xfile_discard`` (which internally uses
2261``FALLOC_FL_PUNCH_HOLE``) to remove the memory page from the xfile.
2262
2263Populating an xfbtree
2264^^^^^^^^^^^^^^^^^^^^^
2265
2266An online fsck function that wants to create an xfbtree should proceed as
2267follows:
2268
22691. Call ``xfile_create`` to create an xfile.
2270
22712. Call ``xfs_alloc_memory_buftarg`` to create a buffer cache target structure
2272   pointing to the xfile.
2273
22743. Pass the buffer cache target, buffer ops, and other information to
2275   ``xfbtree_create`` to write an initial tree header and root block to the
2276   xfile.
2277   Each btree type should define a wrapper that passes necessary arguments to
2278   the creation function.
2279   For example, rmap btrees define ``xfs_rmapbt_mem_create`` to take care of
2280   all the necessary details for callers.
2281   A ``struct xfbtree`` object will be returned.
2282
22834. Pass the xfbtree object to the btree cursor creation function for the
2284   btree type.
2285   Following the example above, ``xfs_rmapbt_mem_cursor`` takes care of this
2286   for callers.
2287
22885. Pass the btree cursor to the regular btree functions to make queries against
2289   and to update the in-memory btree.
2290   For example, a btree cursor for an rmap xfbtree can be passed to the
2291   ``xfs_rmap_*`` functions just like any other btree cursor.
2292   See the :ref:`next section<xfbtree_commit>` for information on dealing with
2293   xfbtree updates that are logged to a transaction.
2294
22956. When finished, delete the btree cursor, destroy the xfbtree object, free the
2296   buffer target, and the destroy the xfile to release all resources.
2297
2298.. _xfbtree_commit:
2299
2300Committing Logged xfbtree Buffers
2301^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2302
2303Although it is a clever hack to reuse the rmap btree code to handle the staging
2304structure, the ephemeral nature of the in-memory btree block storage presents
2305some challenges of its own.
2306The XFS transaction manager must not commit buffer log items for buffers backed
2307by an xfile because the log format does not understand updates for devices
2308other than the data device.
2309An ephemeral xfbtree probably will not exist by the time the AIL checkpoints
2310log transactions back into the filesystem, and certainly won't exist during
2311log recovery.
2312For these reasons, any code updating an xfbtree in transaction context must
2313remove the buffer log items from the transaction and write the updates into the
2314backing xfile before committing or cancelling the transaction.
2315
2316The ``xfbtree_trans_commit`` and ``xfbtree_trans_cancel`` functions implement
2317this functionality as follows:
2318
23191. Find each buffer log item whose buffer targets the xfile.
2320
23212. Record the dirty/ordered status of the log item.
2322
23233. Detach the log item from the buffer.
2324
23254. Queue the buffer to a special delwri list.
2326
23275. Clear the transaction dirty flag if the only dirty log items were the ones
2328   that were detached in step 3.
2329
23306. Submit the delwri list to commit the changes to the xfile, if the updates
2331   are being committed.
2332
2333After removing xfile logged buffers from the transaction in this manner, the
2334transaction can be committed or cancelled.
2335
2336Bulk Loading of Ondisk B+Trees
2337------------------------------
2338
2339As mentioned previously, early iterations of online repair built new btree
2340structures by creating a new btree and adding observations individually.
2341Loading a btree one record at a time had a slight advantage of not requiring
2342the incore records to be sorted prior to commit, but was very slow and leaked
2343blocks if the system went down during a repair.
2344Loading records one at a time also meant that repair could not control the
2345loading factor of the blocks in the new btree.
2346
2347Fortunately, the venerable ``xfs_repair`` tool had a more efficient means for
2348rebuilding a btree index from a collection of records -- bulk btree loading.
2349This was implemented rather inefficiently code-wise, since ``xfs_repair``
2350had separate copy-pasted implementations for each btree type.
2351
2352To prepare for online fsck, each of the four bulk loaders were studied, notes
2353were taken, and the four were refactored into a single generic btree bulk
2354loading mechanism.
2355Those notes in turn have been refreshed and are presented below.
2356
2357Geometry Computation
2358````````````````````
2359
2360The zeroth step of bulk loading is to assemble the entire record set that will
2361be stored in the new btree, and sort the records.
2362Next, call ``xfs_btree_bload_compute_geometry`` to compute the shape of the
2363btree from the record set, the type of btree, and any load factor preferences.
2364This information is required for resource reservation.
2365
2366First, the geometry computation computes the minimum and maximum records that
2367will fit in a leaf block from the size of a btree block and the size of the
2368block header.
2369Roughly speaking, the maximum number of records is::
2370
2371        maxrecs = (block_size - header_size) / record_size
2372
2373The XFS design specifies that btree blocks should be merged when possible,
2374which means the minimum number of records is half of maxrecs::
2375
2376        minrecs = maxrecs / 2
2377
2378The next variable to determine is the desired loading factor.
2379This must be at least minrecs and no more than maxrecs.
2380Choosing minrecs is undesirable because it wastes half the block.
2381Choosing maxrecs is also undesirable because adding a single record to each
2382newly rebuilt leaf block will cause a tree split, which causes a noticeable
2383drop in performance immediately afterwards.
2384The default loading factor was chosen to be 75% of maxrecs, which provides a
2385reasonably compact structure without any immediate split penalties::
2386
2387        default_load_factor = (maxrecs + minrecs) / 2
2388
2389If space is tight, the loading factor will be set to maxrecs to try to avoid
2390running out of space::
2391
2392        leaf_load_factor = enough space ? default_load_factor : maxrecs
2393
2394Load factor is computed for btree node blocks using the combined size of the
2395btree key and pointer as the record size::
2396
2397        maxrecs = (block_size - header_size) / (key_size + ptr_size)
2398        minrecs = maxrecs / 2
2399        node_load_factor = enough space ? default_load_factor : maxrecs
2400
2401Once that's done, the number of leaf blocks required to store the record set
2402can be computed as::
2403
2404        leaf_blocks = ceil(record_count / leaf_load_factor)
2405
2406The number of node blocks needed to point to the next level down in the tree
2407is computed as::
2408
2409        n_blocks = (n == 0 ? leaf_blocks : node_blocks[n])
2410        node_blocks[n + 1] = ceil(n_blocks / node_load_factor)
2411
2412The entire computation is performed recursively until the current level only
2413needs one block.
2414The resulting geometry is as follows:
2415
2416- For AG-rooted btrees, this level is the root level, so the height of the new
2417  tree is ``level + 1`` and the space needed is the summation of the number of
2418  blocks on each level.
2419
2420- For inode-rooted btrees where the records in the top level do not fit in the
2421  inode fork area, the height is ``level + 2``, the space needed is the
2422  summation of the number of blocks on each level, and the inode fork points to
2423  the root block.
2424
2425- For inode-rooted btrees where the records in the top level can be stored in
2426  the inode fork area, then the root block can be stored in the inode, the
2427  height is ``level + 1``, and the space needed is one less than the summation
2428  of the number of blocks on each level.
2429  This only becomes relevant when non-bmap btrees gain the ability to root in
2430  an inode, which is a future patchset and only included here for completeness.
2431
2432.. _newbt:
2433
2434Reserving New B+Tree Blocks
2435```````````````````````````
2436
2437Once repair knows the number of blocks needed for the new btree, it allocates
2438those blocks using the free space information.
2439Each reserved extent is tracked separately by the btree builder state data.
2440To improve crash resilience, the reservation code also logs an Extent Freeing
2441Intent (EFI) item in the same transaction as each space allocation and attaches
2442its in-memory ``struct xfs_extent_free_item`` object to the space reservation.
2443If the system goes down, log recovery will use the unfinished EFIs to free the
2444unused space, the free space, leaving the filesystem unchanged.
2445
2446Each time the btree builder claims a block for the btree from a reserved
2447extent, it updates the in-memory reservation to reflect the claimed space.
2448Block reservation tries to allocate as much contiguous space as possible to
2449reduce the number of EFIs in play.
2450
2451While repair is writing these new btree blocks, the EFIs created for the space
2452reservations pin the tail of the ondisk log.
2453It's possible that other parts of the system will remain busy and push the head
2454of the log towards the pinned tail.
2455To avoid livelocking the filesystem, the EFIs must not pin the tail of the log
2456for too long.
2457To alleviate this problem, the dynamic relogging capability of the deferred ops
2458mechanism is reused here to commit a transaction at the log head containing an
2459EFD for the old EFI and new EFI at the head.
2460This enables the log to release the old EFI to keep the log moving forwards.
2461
2462EFIs have a role to play during the commit and reaping phases; please see the
2463next section and the section about :ref:`reaping<reaping>` for more details.
2464
2465Proposed patchsets are the
2466`bitmap rework
2467<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-bitmap-rework>`_
2468and the
2469`preparation for bulk loading btrees
2470<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading>`_.
2471
2472
2473Writing the New Tree
2474````````````````````
2475
2476This part is pretty simple -- the btree builder (``xfs_btree_bulkload``) claims
2477a block from the reserved list, writes the new btree block header, fills the
2478rest of the block with records, and adds the new leaf block to a list of
2479written blocks::
2480
2481  ┌────┐
2482  │leaf│
2483  │RRR │
2484  └────┘
2485
2486Sibling pointers are set every time a new block is added to the level::
2487
2488  ┌────┐ ┌────┐ ┌────┐ ┌────┐
2489  │leaf│→│leaf│→│leaf│→│leaf│
2490  │RRR │←│RRR │←│RRR │←│RRR │
2491  └────┘ └────┘ └────┘ └────┘
2492
2493When it finishes writing the record leaf blocks, it moves on to the node
2494blocks
2495To fill a node block, it walks each block in the next level down in the tree
2496to compute the relevant keys and write them into the parent node::
2497
2498      ┌────┐       ┌────┐
2499      │node│──────→│node│
2500      │PP  │←──────│PP  │
2501      └────┘       └────┘
2502      ↙   ↘         ↙   ↘
2503  ┌────┐ ┌────┐ ┌────┐ ┌────┐
2504  │leaf│→│leaf│→│leaf│→│leaf│
2505  │RRR │←│RRR │←│RRR │←│RRR │
2506  └────┘ └────┘ └────┘ └────┘
2507
2508When it reaches the root level, it is ready to commit the new btree!::
2509
2510          ┌─────────┐
2511          │  root   │
2512          │   PP    │
2513          └─────────┘
2514          ↙         ↘
2515      ┌────┐       ┌────┐
2516      │node│──────→│node│
2517      │PP  │←──────│PP  │
2518      └────┘       └────┘
2519      ↙   ↘         ↙   ↘
2520  ┌────┐ ┌────┐ ┌────┐ ┌────┐
2521  │leaf│→│leaf│→│leaf│→│leaf│
2522  │RRR │←│RRR │←│RRR │←│RRR │
2523  └────┘ └────┘ └────┘ └────┘
2524
2525The first step to commit the new btree is to persist the btree blocks to disk
2526synchronously.
2527This is a little complicated because a new btree block could have been freed
2528in the recent past, so the builder must use ``xfs_buf_delwri_queue_here`` to
2529remove the (stale) buffer from the AIL list before it can write the new blocks
2530to disk.
2531Blocks are queued for IO using a delwri list and written in one large batch
2532with ``xfs_buf_delwri_submit``.
2533
2534Once the new blocks have been persisted to disk, control returns to the
2535individual repair function that called the bulk loader.
2536The repair function must log the location of the new root in a transaction,
2537clean up the space reservations that were made for the new btree, and reap the
2538old metadata blocks:
2539
25401. Commit the location of the new btree root.
2541
25422. For each incore reservation:
2543
2544   a. Log Extent Freeing Done (EFD) items for all the space that was consumed
2545      by the btree builder.  The new EFDs must point to the EFIs attached to
2546      the reservation to prevent log recovery from freeing the new blocks.
2547
2548   b. For unclaimed portions of incore reservations, create a regular deferred
2549      extent free work item to be free the unused space later in the
2550      transaction chain.
2551
2552   c. The EFDs and EFIs logged in steps 2a and 2b must not overrun the
2553      reservation of the committing transaction.
2554      If the btree loading code suspects this might be about to happen, it must
2555      call ``xrep_defer_finish`` to clear out the deferred work and obtain a
2556      fresh transaction.
2557
25583. Clear out the deferred work a second time to finish the commit and clean
2559   the repair transaction.
2560
2561The transaction rolling in steps 2c and 3 represent a weakness in the repair
2562algorithm, because a log flush and a crash before the end of the reap step can
2563result in space leaking.
2564Online repair functions minimize the chances of this occuring by using very
2565large transactions, which each can accomodate many thousands of block freeing
2566instructions.
2567Repair moves on to reaping the old blocks, which will be presented in a
2568subsequent :ref:`section<reaping>` after a few case studies of bulk loading.
2569
2570Case Study: Rebuilding the Inode Index
2571^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2572
2573The high level process to rebuild the inode index btree is:
2574
25751. Walk the reverse mapping records to generate ``struct xfs_inobt_rec``
2576   records from the inode chunk information and a bitmap of the old inode btree
2577   blocks.
2578
25792. Append the records to an xfarray in inode order.
2580
25813. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number
2582   of blocks needed for the inode btree.
2583   If the free space inode btree is enabled, call it again to estimate the
2584   geometry of the finobt.
2585
25864. Allocate the number of blocks computed in the previous step.
2587
25885. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and
2589   generate the internal node blocks.
2590   If the free space inode btree is enabled, call it again to load the finobt.
2591
25926. Commit the location of the new btree root block(s) to the AGI.
2593
25947. Reap the old btree blocks using the bitmap created in step 1.
2595
2596Details are as follows.
2597
2598The inode btree maps inumbers to the ondisk location of the associated
2599inode records, which means that the inode btrees can be rebuilt from the
2600reverse mapping information.
2601Reverse mapping records with an owner of ``XFS_RMAP_OWN_INOBT`` marks the
2602location of the old inode btree blocks.
2603Each reverse mapping record with an owner of ``XFS_RMAP_OWN_INODES`` marks the
2604location of at least one inode cluster buffer.
2605A cluster is the smallest number of ondisk inodes that can be allocated or
2606freed in a single transaction; it is never smaller than 1 fs block or 4 inodes.
2607
2608For the space represented by each inode cluster, ensure that there are no
2609records in the free space btrees nor any records in the reference count btree.
2610If there are, the space metadata inconsistencies are reason enough to abort the
2611operation.
2612Otherwise, read each cluster buffer to check that its contents appear to be
2613ondisk inodes and to decide if the file is allocated
2614(``xfs_dinode.i_mode != 0``) or free (``xfs_dinode.i_mode == 0``).
2615Accumulate the results of successive inode cluster buffer reads until there is
2616enough information to fill a single inode chunk record, which is 64 consecutive
2617numbers in the inumber keyspace.
2618If the chunk is sparse, the chunk record may include holes.
2619
2620Once the repair function accumulates one chunk's worth of data, it calls
2621``xfarray_append`` to add the inode btree record to the xfarray.
2622This xfarray is walked twice during the btree creation step -- once to populate
2623the inode btree with all inode chunk records, and a second time to populate the
2624free inode btree with records for chunks that have free non-sparse inodes.
2625The number of records for the inode btree is the number of xfarray records,
2626but the record count for the free inode btree has to be computed as inode chunk
2627records are stored in the xfarray.
2628
2629The proposed patchset is the
2630`AG btree repair
2631<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
2632series.
2633
2634Case Study: Rebuilding the Space Reference Counts
2635^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2636
2637Reverse mapping records are used to rebuild the reference count information.
2638Reference counts are required for correct operation of copy on write for shared
2639file data.
2640Imagine the reverse mapping entries as rectangles representing extents of
2641physical blocks, and that the rectangles can be laid down to allow them to
2642overlap each other.
2643From the diagram below, it is apparent that a reference count record must start
2644or end wherever the height of the stack changes.
2645In other words, the record emission stimulus is level-triggered::
2646
2647                        █    ███
2648              ██      █████ ████   ███        ██████
2649        ██   ████     ███████████ ████     █████████
2650        ████████████████████████████████ ███████████
2651        ^ ^  ^^ ^^    ^ ^^ ^^^  ^^^^  ^ ^^ ^  ^     ^
2652        2 1  23 21    3 43 234  2123  1 01 2  3     0
2653
2654The ondisk reference count btree does not store the refcount == 0 cases because
2655the free space btree already records which blocks are free.
2656Extents being used to stage copy-on-write operations should be the only records
2657with refcount == 1.
2658Single-owner file blocks aren't recorded in either the free space or the
2659reference count btrees.
2660
2661The high level process to rebuild the reference count btree is:
2662
26631. Walk the reverse mapping records to generate ``struct xfs_refcount_irec``
2664   records for any space having more than one reverse mapping and add them to
2665   the xfarray.
2666   Any records owned by ``XFS_RMAP_OWN_COW`` are also added to the xfarray
2667   because these are extents allocated to stage a copy on write operation and
2668   are tracked in the refcount btree.
2669
2670   Use any records owned by ``XFS_RMAP_OWN_REFC`` to create a bitmap of old
2671   refcount btree blocks.
2672
26732. Sort the records in physical extent order, putting the CoW staging extents
2674   at the end of the xfarray.
2675   This matches the sorting order of records in the refcount btree.
2676
26773. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number
2678   of blocks needed for the new tree.
2679
26804. Allocate the number of blocks computed in the previous step.
2681
26825. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and
2683   generate the internal node blocks.
2684
26856. Commit the location of new btree root block to the AGF.
2686
26877. Reap the old btree blocks using the bitmap created in step 1.
2688
2689Details are as follows; the same algorithm is used by ``xfs_repair`` to
2690generate refcount information from reverse mapping records.
2691
2692- Until the reverse mapping btree runs out of records:
2693
2694  - Retrieve the next record from the btree and put it in a bag.
2695
2696  - Collect all records with the same starting block from the btree and put
2697    them in the bag.
2698
2699  - While the bag isn't empty:
2700
2701    - Among the mappings in the bag, compute the lowest block number where the
2702      reference count changes.
2703      This position will be either the starting block number of the next
2704      unprocessed reverse mapping or the next block after the shortest mapping
2705      in the bag.
2706
2707    - Remove all mappings from the bag that end at this position.
2708
2709    - Collect all reverse mappings that start at this position from the btree
2710      and put them in the bag.
2711
2712    - If the size of the bag changed and is greater than one, create a new
2713      refcount record associating the block number range that we just walked to
2714      the size of the bag.
2715
2716The bag-like structure in this case is a type 2 xfarray as discussed in the
2717:ref:`xfarray access patterns<xfarray_access_patterns>` section.
2718Reverse mappings are added to the bag using ``xfarray_store_anywhere`` and
2719removed via ``xfarray_unset``.
2720Bag members are examined through ``xfarray_iter`` loops.
2721
2722The proposed patchset is the
2723`AG btree repair
2724<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
2725series.
2726
2727Case Study: Rebuilding File Fork Mapping Indices
2728^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2729
2730The high level process to rebuild a data/attr fork mapping btree is:
2731
27321. Walk the reverse mapping records to generate ``struct xfs_bmbt_rec``
2733   records from the reverse mapping records for that inode and fork.
2734   Append these records to an xfarray.
2735   Compute the bitmap of the old bmap btree blocks from the ``BMBT_BLOCK``
2736   records.
2737
27382. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number
2739   of blocks needed for the new tree.
2740
27413. Sort the records in file offset order.
2742
27434. If the extent records would fit in the inode fork immediate area, commit the
2744   records to that immediate area and skip to step 8.
2745
27465. Allocate the number of blocks computed in the previous step.
2747
27486. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and
2749   generate the internal node blocks.
2750
27517. Commit the new btree root block to the inode fork immediate area.
2752
27538. Reap the old btree blocks using the bitmap created in step 1.
2754
2755There are some complications here:
2756First, it's possible to move the fork offset to adjust the sizes of the
2757immediate areas if the data and attr forks are not both in BMBT format.
2758Second, if there are sufficiently few fork mappings, it may be possible to use
2759EXTENTS format instead of BMBT, which may require a conversion.
2760Third, the incore extent map must be reloaded carefully to avoid disturbing
2761any delayed allocation extents.
2762
2763The proposed patchset is the
2764`file mapping repair
2765<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-file-mappings>`_
2766series.
2767
2768.. _reaping:
2769
2770Reaping Old Metadata Blocks
2771---------------------------
2772
2773Whenever online fsck builds a new data structure to replace one that is
2774suspect, there is a question of how to find and dispose of the blocks that
2775belonged to the old structure.
2776The laziest method of course is not to deal with them at all, but this slowly
2777leads to service degradations as space leaks out of the filesystem.
2778Hopefully, someone will schedule a rebuild of the free space information to
2779plug all those leaks.
2780Offline repair rebuilds all space metadata after recording the usage of
2781the files and directories that it decides not to clear, hence it can build new
2782structures in the discovered free space and avoid the question of reaping.
2783
2784As part of a repair, online fsck relies heavily on the reverse mapping records
2785to find space that is owned by the corresponding rmap owner yet truly free.
2786Cross referencing rmap records with other rmap records is necessary because
2787there may be other data structures that also think they own some of those
2788blocks (e.g. crosslinked trees).
2789Permitting the block allocator to hand them out again will not push the system
2790towards consistency.
2791
2792For space metadata, the process of finding extents to dispose of generally
2793follows this format:
2794
27951. Create a bitmap of space used by data structures that must be preserved.
2796   The space reservations used to create the new metadata can be used here if
2797   the same rmap owner code is used to denote all of the objects being rebuilt.
2798
27992. Survey the reverse mapping data to create a bitmap of space owned by the
2800   same ``XFS_RMAP_OWN_*`` number for the metadata that is being preserved.
2801
28023. Use the bitmap disunion operator to subtract (1) from (2).
2803   The remaining set bits represent candidate extents that could be freed.
2804   The process moves on to step 4 below.
2805
2806Repairs for file-based metadata such as extended attributes, directories,
2807symbolic links, quota files and realtime bitmaps are performed by building a
2808new structure attached to a temporary file and swapping the forks.
2809Afterward, the mappings in the old file fork are the candidate blocks for
2810disposal.
2811
2812The process for disposing of old extents is as follows:
2813
28144. For each candidate extent, count the number of reverse mapping records for
2815   the first block in that extent that do not have the same rmap owner for the
2816   data structure being repaired.
2817
2818   - If zero, the block has a single owner and can be freed.
2819
2820   - If not, the block is part of a crosslinked structure and must not be
2821     freed.
2822
28235. Starting with the next block in the extent, figure out how many more blocks
2824   have the same zero/nonzero other owner status as that first block.
2825
28266. If the region is crosslinked, delete the reverse mapping entry for the
2827   structure being repaired and move on to the next region.
2828
28297. If the region is to be freed, mark any corresponding buffers in the buffer
2830   cache as stale to prevent log writeback.
2831
28328. Free the region and move on.
2833
2834However, there is one complication to this procedure.
2835Transactions are of finite size, so the reaping process must be careful to roll
2836the transactions to avoid overruns.
2837Overruns come from two sources:
2838
2839a. EFIs logged on behalf of space that is no longer occupied
2840
2841b. Log items for buffer invalidations
2842
2843This is also a window in which a crash during the reaping process can leak
2844blocks.
2845As stated earlier, online repair functions use very large transactions to
2846minimize the chances of this occurring.
2847
2848The proposed patchset is the
2849`preparation for bulk loading btrees
2850<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading>`_
2851series.
2852
2853Case Study: Reaping After a Regular Btree Repair
2854````````````````````````````````````````````````
2855
2856Old reference count and inode btrees are the easiest to reap because they have
2857rmap records with special owner codes: ``XFS_RMAP_OWN_REFC`` for the refcount
2858btree, and ``XFS_RMAP_OWN_INOBT`` for the inode and free inode btrees.
2859Creating a list of extents to reap the old btree blocks is quite simple,
2860conceptually:
2861
28621. Lock the relevant AGI/AGF header buffers to prevent allocation and frees.
2863
28642. For each reverse mapping record with an rmap owner corresponding to the
2865   metadata structure being rebuilt, set the corresponding range in a bitmap.
2866
28673. Walk the current data structures that have the same rmap owner.
2868   For each block visited, clear that range in the above bitmap.
2869
28704. Each set bit in the bitmap represents a block that could be a block from the
2871   old data structures and hence is a candidate for reaping.
2872   In other words, ``(rmap_records_owned_by & ~blocks_reachable_by_walk)``
2873   are the blocks that might be freeable.
2874
2875If it is possible to maintain the AGF lock throughout the repair (which is the
2876common case), then step 2 can be performed at the same time as the reverse
2877mapping record walk that creates the records for the new btree.
2878
2879Case Study: Rebuilding the Free Space Indices
2880`````````````````````````````````````````````
2881
2882The high level process to rebuild the free space indices is:
2883
28841. Walk the reverse mapping records to generate ``struct xfs_alloc_rec_incore``
2885   records from the gaps in the reverse mapping btree.
2886
28872. Append the records to an xfarray.
2888
28893. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number
2890   of blocks needed for each new tree.
2891
28924. Allocate the number of blocks computed in the previous step from the free
2893   space information collected.
2894
28955. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and
2896   generate the internal node blocks for the free space by length index.
2897   Call it again for the free space by block number index.
2898
28996. Commit the locations of the new btree root blocks to the AGF.
2900
29017. Reap the old btree blocks by looking for space that is not recorded by the
2902   reverse mapping btree, the new free space btrees, or the AGFL.
2903
2904Repairing the free space btrees has three key complications over a regular
2905btree repair:
2906
2907First, free space is not explicitly tracked in the reverse mapping records.
2908Hence, the new free space records must be inferred from gaps in the physical
2909space component of the keyspace of the reverse mapping btree.
2910
2911Second, free space repairs cannot use the common btree reservation code because
2912new blocks are reserved out of the free space btrees.
2913This is impossible when repairing the free space btrees themselves.
2914However, repair holds the AGF buffer lock for the duration of the free space
2915index reconstruction, so it can use the collected free space information to
2916supply the blocks for the new free space btrees.
2917It is not necessary to back each reserved extent with an EFI because the new
2918free space btrees are constructed in what the ondisk filesystem thinks is
2919unowned space.
2920However, if reserving blocks for the new btrees from the collected free space
2921information changes the number of free space records, repair must re-estimate
2922the new free space btree geometry with the new record count until the
2923reservation is sufficient.
2924As part of committing the new btrees, repair must ensure that reverse mappings
2925are created for the reserved blocks and that unused reserved blocks are
2926inserted into the free space btrees.
2927Deferrred rmap and freeing operations are used to ensure that this transition
2928is atomic, similar to the other btree repair functions.
2929
2930Third, finding the blocks to reap after the repair is not overly
2931straightforward.
2932Blocks for the free space btrees and the reverse mapping btrees are supplied by
2933the AGFL.
2934Blocks put onto the AGFL have reverse mapping records with the owner
2935``XFS_RMAP_OWN_AG``.
2936This ownership is retained when blocks move from the AGFL into the free space
2937btrees or the reverse mapping btrees.
2938When repair walks reverse mapping records to synthesize free space records, it
2939creates a bitmap (``ag_owner_bitmap``) of all the space claimed by
2940``XFS_RMAP_OWN_AG`` records.
2941The repair context maintains a second bitmap corresponding to the rmap btree
2942blocks and the AGFL blocks (``rmap_agfl_bitmap``).
2943When the walk is complete, the bitmap disunion operation ``(ag_owner_bitmap &
2944~rmap_agfl_bitmap)`` computes the extents that are used by the old free space
2945btrees.
2946These blocks can then be reaped using the methods outlined above.
2947
2948The proposed patchset is the
2949`AG btree repair
2950<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
2951series.
2952
2953.. _rmap_reap:
2954
2955Case Study: Reaping After Repairing Reverse Mapping Btrees
2956``````````````````````````````````````````````````````````
2957
2958Old reverse mapping btrees are less difficult to reap after a repair.
2959As mentioned in the previous section, blocks on the AGFL, the two free space
2960btree blocks, and the reverse mapping btree blocks all have reverse mapping
2961records with ``XFS_RMAP_OWN_AG`` as the owner.
2962The full process of gathering reverse mapping records and building a new btree
2963are described in the case study of
2964:ref:`live rebuilds of rmap data <rmap_repair>`, but a crucial point from that
2965discussion is that the new rmap btree will not contain any records for the old
2966rmap btree, nor will the old btree blocks be tracked in the free space btrees.
2967The list of candidate reaping blocks is computed by setting the bits
2968corresponding to the gaps in the new rmap btree records, and then clearing the
2969bits corresponding to extents in the free space btrees and the current AGFL
2970blocks.
2971The result ``(new_rmapbt_gaps & ~(agfl | bnobt_records))`` are reaped using the
2972methods outlined above.
2973
2974The rest of the process of rebuildng the reverse mapping btree is discussed
2975in a separate :ref:`case study<rmap_repair>`.
2976
2977The proposed patchset is the
2978`AG btree repair
2979<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
2980series.
2981
2982Case Study: Rebuilding the AGFL
2983```````````````````````````````
2984
2985The allocation group free block list (AGFL) is repaired as follows:
2986
29871. Create a bitmap for all the space that the reverse mapping data claims is
2988   owned by ``XFS_RMAP_OWN_AG``.
2989
29902. Subtract the space used by the two free space btrees and the rmap btree.
2991
29923. Subtract any space that the reverse mapping data claims is owned by any
2993   other owner, to avoid re-adding crosslinked blocks to the AGFL.
2994
29954. Once the AGFL is full, reap any blocks leftover.
2996
29975. The next operation to fix the freelist will right-size the list.
2998
2999See `fs/xfs/scrub/agheader_repair.c <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/xfs/scrub/agheader_repair.c>`_ for more details.
3000