1.. SPDX-License-Identifier: GPL-2.0
2.. _xfs_online_fsck_design:
3
4..
5        Mapping of heading styles within this document:
6        Heading 1 uses "====" above and below
7        Heading 2 uses "===="
8        Heading 3 uses "----"
9        Heading 4 uses "````"
10        Heading 5 uses "^^^^"
11        Heading 6 uses "~~~~"
12        Heading 7 uses "...."
13
14        Sections are manually numbered because apparently that's what everyone
15        does in the kernel.
16
17======================
18XFS Online Fsck Design
19======================
20
21This document captures the design of the online filesystem check feature for
22XFS.
23The purpose of this document is threefold:
24
25- To help kernel distributors understand exactly what the XFS online fsck
26  feature is, and issues about which they should be aware.
27
28- To help people reading the code to familiarize themselves with the relevant
29  concepts and design points before they start digging into the code.
30
31- To help developers maintaining the system by capturing the reasons
32  supporting higher level decision making.
33
34As the online fsck code is merged, the links in this document to topic branches
35will be replaced with links to code.
36
37This document is licensed under the terms of the GNU Public License, v2.
38The primary author is Darrick J. Wong.
39
40This design document is split into seven parts.
41Part 1 defines what fsck tools are and the motivations for writing a new one.
42Parts 2 and 3 present a high level overview of how online fsck process works
43and how it is tested to ensure correct functionality.
44Part 4 discusses the user interface and the intended usage modes of the new
45program.
46Parts 5 and 6 show off the high level components and how they fit together, and
47then present case studies of how each repair function actually works.
48Part 7 sums up what has been discussed so far and speculates about what else
49might be built atop online fsck.
50
51.. contents:: Table of Contents
52   :local:
53
541. What is a Filesystem Check?
55==============================
56
57A Unix filesystem has four main responsibilities:
58
59- Provide a hierarchy of names through which application programs can associate
60  arbitrary blobs of data for any length of time,
61
62- Virtualize physical storage media across those names, and
63
64- Retrieve the named data blobs at any time.
65
66- Examine resource usage.
67
68Metadata directly supporting these functions (e.g. files, directories, space
69mappings) are sometimes called primary metadata.
70Secondary metadata (e.g. reverse mapping and directory parent pointers) support
71operations internal to the filesystem, such as internal consistency checking
72and reorganization.
73Summary metadata, as the name implies, condense information contained in
74primary metadata for performance reasons.
75
76The filesystem check (fsck) tool examines all the metadata in a filesystem
77to look for errors.
78In addition to looking for obvious metadata corruptions, fsck also
79cross-references different types of metadata records with each other to look
80for inconsistencies.
81People do not like losing data, so most fsck tools also contains some ability
82to correct any problems found.
83As a word of caution -- the primary goal of most Linux fsck tools is to restore
84the filesystem metadata to a consistent state, not to maximize the data
85recovered.
86That precedent will not be challenged here.
87
88Filesystems of the 20th century generally lacked any redundancy in the ondisk
89format, which means that fsck can only respond to errors by erasing files until
90errors are no longer detected.
91More recent filesystem designs contain enough redundancy in their metadata that
92it is now possible to regenerate data structures when non-catastrophic errors
93occur; this capability aids both strategies.
94
95+--------------------------------------------------------------------------+
96| **Note**:                                                                |
97+--------------------------------------------------------------------------+
98| System administrators avoid data loss by increasing the number of        |
99| separate storage systems through the creation of backups; and they avoid |
100| downtime by increasing the redundancy of each storage system through the |
101| creation of RAID arrays.                                                 |
102| fsck tools address only the first problem.                               |
103+--------------------------------------------------------------------------+
104
105TLDR; Show Me the Code!
106-----------------------
107
108Code is posted to the kernel.org git trees as follows:
109`kernel changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-symlink>`_,
110`userspace changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-service>`_, and
111`QA test changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-dirs>`_.
112Each kernel patchset adding an online repair function will use the same branch
113name across the kernel, xfsprogs, and fstests git repos.
114
115Existing Tools
116--------------
117
118The online fsck tool described here will be the third tool in the history of
119XFS (on Linux) to check and repair filesystems.
120Two programs precede it:
121
122The first program, ``xfs_check``, was created as part of the XFS debugger
123(``xfs_db``) and can only be used with unmounted filesystems.
124It walks all metadata in the filesystem looking for inconsistencies in the
125metadata, though it lacks any ability to repair what it finds.
126Due to its high memory requirements and inability to repair things, this
127program is now deprecated and will not be discussed further.
128
129The second program, ``xfs_repair``, was created to be faster and more robust
130than the first program.
131Like its predecessor, it can only be used with unmounted filesystems.
132It uses extent-based in-memory data structures to reduce memory consumption,
133and tries to schedule readahead IO appropriately to reduce I/O waiting time
134while it scans the metadata of the entire filesystem.
135The most important feature of this tool is its ability to respond to
136inconsistencies in file metadata and directory tree by erasing things as needed
137to eliminate problems.
138Space usage metadata are rebuilt from the observed file metadata.
139
140Problem Statement
141-----------------
142
143The current XFS tools leave several problems unsolved:
144
1451. **User programs** suddenly **lose access** to the filesystem when unexpected
146   shutdowns occur as a result of silent corruptions in the metadata.
147   These occur **unpredictably** and often without warning.
148
1492. **Users** experience a **total loss of service** during the recovery period
150   after an **unexpected shutdown** occurs.
151
1523. **Users** experience a **total loss of service** if the filesystem is taken
153   offline to **look for problems** proactively.
154
1554. **Data owners** cannot **check the integrity** of their stored data without
156   reading all of it.
157   This may expose them to substantial billing costs when a linear media scan
158   performed by the storage system administrator might suffice.
159
1605. **System administrators** cannot **schedule** a maintenance window to deal
161   with corruptions if they **lack the means** to assess filesystem health
162   while the filesystem is online.
163
1646. **Fleet monitoring tools** cannot **automate periodic checks** of filesystem
165   health when doing so requires **manual intervention** and downtime.
166
1677. **Users** can be tricked into **doing things they do not desire** when
168   malicious actors **exploit quirks of Unicode** to place misleading names
169   in directories.
170
171Given this definition of the problems to be solved and the actors who would
172benefit, the proposed solution is a third fsck tool that acts on a running
173filesystem.
174
175This new third program has three components: an in-kernel facility to check
176metadata, an in-kernel facility to repair metadata, and a userspace driver
177program to drive fsck activity on a live filesystem.
178``xfs_scrub`` is the name of the driver program.
179The rest of this document presents the goals and use cases of the new fsck
180tool, describes its major design points in connection to those goals, and
181discusses the similarities and differences with existing tools.
182
183+--------------------------------------------------------------------------+
184| **Note**:                                                                |
185+--------------------------------------------------------------------------+
186| Throughout this document, the existing offline fsck tool can also be     |
187| referred to by its current name "``xfs_repair``".                        |
188| The userspace driver program for the new online fsck tool can be         |
189| referred to as "``xfs_scrub``".                                          |
190| The kernel portion of online fsck that validates metadata is called      |
191| "online scrub", and portion of the kernel that fixes metadata is called  |
192| "online repair".                                                         |
193+--------------------------------------------------------------------------+
194
195The naming hierarchy is broken up into objects known as directories and files
196and the physical space is split into pieces known as allocation groups.
197Sharding enables better performance on highly parallel systems and helps to
198contain the damage when corruptions occur.
199The division of the filesystem into principal objects (allocation groups and
200inodes) means that there are ample opportunities to perform targeted checks and
201repairs on a subset of the filesystem.
202
203While this is going on, other parts continue processing IO requests.
204Even if a piece of filesystem metadata can only be regenerated by scanning the
205entire system, the scan can still be done in the background while other file
206operations continue.
207
208In summary, online fsck takes advantage of resource sharding and redundant
209metadata to enable targeted checking and repair operations while the system
210is running.
211This capability will be coupled to automatic system management so that
212autonomous self-healing of XFS maximizes service availability.
213
2142. Theory of Operation
215======================
216
217Because it is necessary for online fsck to lock and scan live metadata objects,
218online fsck consists of three separate code components.
219The first is the userspace driver program ``xfs_scrub``, which is responsible
220for identifying individual metadata items, scheduling work items for them,
221reacting to the outcomes appropriately, and reporting results to the system
222administrator.
223The second and third are in the kernel, which implements functions to check
224and repair each type of online fsck work item.
225
226+------------------------------------------------------------------+
227| **Note**:                                                        |
228+------------------------------------------------------------------+
229| For brevity, this document shortens the phrase "online fsck work |
230| item" to "scrub item".                                           |
231+------------------------------------------------------------------+
232
233Scrub item types are delineated in a manner consistent with the Unix design
234philosophy, which is to say that each item should handle one aspect of a
235metadata structure, and handle it well.
236
237Scope
238-----
239
240In principle, online fsck should be able to check and to repair everything that
241the offline fsck program can handle.
242However, online fsck cannot be running 100% of the time, which means that
243latent errors may creep in after a scrub completes.
244If these errors cause the next mount to fail, offline fsck is the only
245solution.
246This limitation means that maintenance of the offline fsck tool will continue.
247A second limitation of online fsck is that it must follow the same resource
248sharing and lock acquisition rules as the regular filesystem.
249This means that scrub cannot take *any* shortcuts to save time, because doing
250so could lead to concurrency problems.
251In other words, online fsck is not a complete replacement for offline fsck, and
252a complete run of online fsck may take longer than online fsck.
253However, both of these limitations are acceptable tradeoffs to satisfy the
254different motivations of online fsck, which are to **minimize system downtime**
255and to **increase predictability of operation**.
256
257.. _scrubphases:
258
259Phases of Work
260--------------
261
262The userspace driver program ``xfs_scrub`` splits the work of checking and
263repairing an entire filesystem into seven phases.
264Each phase concentrates on checking specific types of scrub items and depends
265on the success of all previous phases.
266The seven phases are as follows:
267
2681. Collect geometry information about the mounted filesystem and computer,
269   discover the online fsck capabilities of the kernel, and open the
270   underlying storage devices.
271
2722. Check allocation group metadata, all realtime volume metadata, and all quota
273   files.
274   Each metadata structure is scheduled as a separate scrub item.
275   If corruption is found in the inode header or inode btree and ``xfs_scrub``
276   is permitted to perform repairs, then those scrub items are repaired to
277   prepare for phase 3.
278   Repairs are implemented by using the information in the scrub item to
279   resubmit the kernel scrub call with the repair flag enabled; this is
280   discussed in the next section.
281   Optimizations and all other repairs are deferred to phase 4.
282
2833. Check all metadata of every file in the filesystem.
284   Each metadata structure is also scheduled as a separate scrub item.
285   If repairs are needed and ``xfs_scrub`` is permitted to perform repairs,
286   and there were no problems detected during phase 2, then those scrub items
287   are repaired immediately.
288   Optimizations, deferred repairs, and unsuccessful repairs are deferred to
289   phase 4.
290
2914. All remaining repairs and scheduled optimizations are performed during this
292   phase, if the caller permits them.
293   Before starting repairs, the summary counters are checked and any necessary
294   repairs are performed so that subsequent repairs will not fail the resource
295   reservation step due to wildly incorrect summary counters.
296   Unsuccesful repairs are requeued as long as forward progress on repairs is
297   made somewhere in the filesystem.
298   Free space in the filesystem is trimmed at the end of phase 4 if the
299   filesystem is clean.
300
3015. By the start of this phase, all primary and secondary filesystem metadata
302   must be correct.
303   Summary counters such as the free space counts and quota resource counts
304   are checked and corrected.
305   Directory entry names and extended attribute names are checked for
306   suspicious entries such as control characters or confusing Unicode sequences
307   appearing in names.
308
3096. If the caller asks for a media scan, read all allocated and written data
310   file extents in the filesystem.
311   The ability to use hardware-assisted data file integrity checking is new
312   to online fsck; neither of the previous tools have this capability.
313   If media errors occur, they will be mapped to the owning files and reported.
314
3157. Re-check the summary counters and presents the caller with a summary of
316   space usage and file counts.
317
318Steps for Each Scrub Item
319-------------------------
320
321The kernel scrub code uses a three-step strategy for checking and repairing
322the one aspect of a metadata object represented by a scrub item:
323
3241. The scrub item of interest is checked for corruptions; opportunities for
325   optimization; and for values that are directly controlled by the system
326   administrator but look suspicious.
327   If the item is not corrupt or does not need optimization, resource are
328   released and the positive scan results are returned to userspace.
329   If the item is corrupt or could be optimized but the caller does not permit
330   this, resources are released and the negative scan results are returned to
331   userspace.
332   Otherwise, the kernel moves on to the second step.
333
3342. The repair function is called to rebuild the data structure.
335   Repair functions generally choose rebuild a structure from other metadata
336   rather than try to salvage the existing structure.
337   If the repair fails, the scan results from the first step are returned to
338   userspace.
339   Otherwise, the kernel moves on to the third step.
340
3413. In the third step, the kernel runs the same checks over the new metadata
342   item to assess the efficacy of the repairs.
343   The results of the reassessment are returned to userspace.
344
345Classification of Metadata
346--------------------------
347
348Each type of metadata object (and therefore each type of scrub item) is
349classified as follows:
350
351Primary Metadata
352````````````````
353
354Metadata structures in this category should be most familiar to filesystem
355users either because they are directly created by the user or they index
356objects created by the user
357Most filesystem objects fall into this class:
358
359- Free space and reference count information
360
361- Inode records and indexes
362
363- Storage mapping information for file data
364
365- Directories
366
367- Extended attributes
368
369- Symbolic links
370
371- Quota limits
372
373Scrub obeys the same rules as regular filesystem accesses for resource and lock
374acquisition.
375
376Primary metadata objects are the simplest for scrub to process.
377The principal filesystem object (either an allocation group or an inode) that
378owns the item being scrubbed is locked to guard against concurrent updates.
379The check function examines every record associated with the type for obvious
380errors and cross-references healthy records against other metadata to look for
381inconsistencies.
382Repairs for this class of scrub item are simple, since the repair function
383starts by holding all the resources acquired in the previous step.
384The repair function scans available metadata as needed to record all the
385observations needed to complete the structure.
386Next, it stages the observations in a new ondisk structure and commits it
387atomically to complete the repair.
388Finally, the storage from the old data structure are carefully reaped.
389
390Because ``xfs_scrub`` locks a primary object for the duration of the repair,
391this is effectively an offline repair operation performed on a subset of the
392filesystem.
393This minimizes the complexity of the repair code because it is not necessary to
394handle concurrent updates from other threads, nor is it necessary to access
395any other part of the filesystem.
396As a result, indexed structures can be rebuilt very quickly, and programs
397trying to access the damaged structure will be blocked until repairs complete.
398The only infrastructure needed by the repair code are the staging area for
399observations and a means to write new structures to disk.
400Despite these limitations, the advantage that online repair holds is clear:
401targeted work on individual shards of the filesystem avoids total loss of
402service.
403
404This mechanism is described in section 2.1 ("Off-Line Algorithm") of
405V. Srinivasan and M. J. Carey, `"Performance of On-Line Index Construction
406Algorithms" <https://minds.wisconsin.edu/bitstream/handle/1793/59524/TR1047.pdf>`_,
407*Extending Database Technology*, pp. 293-309, 1992.
408
409Most primary metadata repair functions stage their intermediate results in an
410in-memory array prior to formatting the new ondisk structure, which is very
411similar to the list-based algorithm discussed in section 2.3 ("List-Based
412Algorithms") of Srinivasan.
413However, any data structure builder that maintains a resource lock for the
414duration of the repair is *always* an offline algorithm.
415
416Secondary Metadata
417``````````````````
418
419Metadata structures in this category reflect records found in primary metadata,
420but are only needed for online fsck or for reorganization of the filesystem.
421
422Secondary metadata include:
423
424- Reverse mapping information
425
426- Directory parent pointers
427
428This class of metadata is difficult for scrub to process because scrub attaches
429to the secondary object but needs to check primary metadata, which runs counter
430to the usual order of resource acquisition.
431Frequently, this means that full filesystems scans are necessary to rebuild the
432metadata.
433Check functions can be limited in scope to reduce runtime.
434Repairs, however, require a full scan of primary metadata, which can take a
435long time to complete.
436Under these conditions, ``xfs_scrub`` cannot lock resources for the entire
437duration of the repair.
438
439Instead, repair functions set up an in-memory staging structure to store
440observations.
441Depending on the requirements of the specific repair function, the staging
442index will either have the same format as the ondisk structure or a design
443specific to that repair function.
444The next step is to release all locks and start the filesystem scan.
445When the repair scanner needs to record an observation, the staging data are
446locked long enough to apply the update.
447While the filesystem scan is in progress, the repair function hooks the
448filesystem so that it can apply pending filesystem updates to the staging
449information.
450Once the scan is done, the owning object is re-locked, the live data is used to
451write a new ondisk structure, and the repairs are committed atomically.
452The hooks are disabled and the staging staging area is freed.
453Finally, the storage from the old data structure are carefully reaped.
454
455Introducing concurrency helps online repair avoid various locking problems, but
456comes at a high cost to code complexity.
457Live filesystem code has to be hooked so that the repair function can observe
458updates in progress.
459The staging area has to become a fully functional parallel structure so that
460updates can be merged from the hooks.
461Finally, the hook, the filesystem scan, and the inode locking model must be
462sufficiently well integrated that a hook event can decide if a given update
463should be applied to the staging structure.
464
465In theory, the scrub implementation could apply these same techniques for
466primary metadata, but doing so would make it massively more complex and less
467performant.
468Programs attempting to access the damaged structures are not blocked from
469operation, which may cause application failure or an unplanned filesystem
470shutdown.
471
472Inspiration for the secondary metadata repair strategy was drawn from section
4732.4 of Srinivasan above, and sections 2 ("NSF: Inded Build Without Side-File")
474and 3.1.1 ("Duplicate Key Insert Problem") in C. Mohan, `"Algorithms for
475Creating Indexes for Very Large Tables Without Quiescing Updates"
476<https://dl.acm.org/doi/10.1145/130283.130337>`_, 1992.
477
478The sidecar index mentioned above bears some resemblance to the side file
479method mentioned in Srinivasan and Mohan.
480Their method consists of an index builder that extracts relevant record data to
481build the new structure as quickly as possible; and an auxiliary structure that
482captures all updates that would be committed to the index by other threads were
483the new index already online.
484After the index building scan finishes, the updates recorded in the side file
485are applied to the new index.
486To avoid conflicts between the index builder and other writer threads, the
487builder maintains a publicly visible cursor that tracks the progress of the
488scan through the record space.
489To avoid duplication of work between the side file and the index builder, side
490file updates are elided when the record ID for the update is greater than the
491cursor position within the record ID space.
492
493To minimize changes to the rest of the codebase, XFS online repair keeps the
494replacement index hidden until it's completely ready to go.
495In other words, there is no attempt to expose the keyspace of the new index
496while repair is running.
497The complexity of such an approach would be very high and perhaps more
498appropriate to building *new* indices.
499
500**Future Work Question**: Can the full scan and live update code used to
501facilitate a repair also be used to implement a comprehensive check?
502
503*Answer*: In theory, yes.  Check would be much stronger if each scrub function
504employed these live scans to build a shadow copy of the metadata and then
505compared the shadow records to the ondisk records.
506However, doing that is a fair amount more work than what the checking functions
507do now.
508The live scans and hooks were developed much later.
509That in turn increases the runtime of those scrub functions.
510
511Summary Information
512```````````````````
513
514Metadata structures in this last category summarize the contents of primary
515metadata records.
516These are often used to speed up resource usage queries, and are many times
517smaller than the primary metadata which they represent.
518
519Examples of summary information include:
520
521- Summary counts of free space and inodes
522
523- File link counts from directories
524
525- Quota resource usage counts
526
527Check and repair require full filesystem scans, but resource and lock
528acquisition follow the same paths as regular filesystem accesses.
529
530The superblock summary counters have special requirements due to the underlying
531implementation of the incore counters, and will be treated separately.
532Check and repair of the other types of summary counters (quota resource counts
533and file link counts) employ the same filesystem scanning and hooking
534techniques as outlined above, but because the underlying data are sets of
535integer counters, the staging data need not be a fully functional mirror of the
536ondisk structure.
537
538Inspiration for quota and file link count repair strategies were drawn from
539sections 2.12 ("Online Index Operations") through 2.14 ("Incremental View
540Maintenace") of G.  Graefe, `"Concurrent Queries and Updates in Summary Views
541and Their Indexes"
542<http://www.odbms.org/wp-content/uploads/2014/06/Increment-locks.pdf>`_, 2011.
543
544Since quotas are non-negative integer counts of resource usage, online
545quotacheck can use the incremental view deltas described in section 2.14 to
546track pending changes to the block and inode usage counts in each transaction,
547and commit those changes to a dquot side file when the transaction commits.
548Delta tracking is necessary for dquots because the index builder scans inodes,
549whereas the data structure being rebuilt is an index of dquots.
550Link count checking combines the view deltas and commit step into one because
551it sets attributes of the objects being scanned instead of writing them to a
552separate data structure.
553Each online fsck function will be discussed as case studies later in this
554document.
555
556Risk Management
557---------------
558
559During the development of online fsck, several risk factors were identified
560that may make the feature unsuitable for certain distributors and users.
561Steps can be taken to mitigate or eliminate those risks, though at a cost to
562functionality.
563
564- **Decreased performance**: Adding metadata indices to the filesystem
565  increases the time cost of persisting changes to disk, and the reverse space
566  mapping and directory parent pointers are no exception.
567  System administrators who require the maximum performance can disable the
568  reverse mapping features at format time, though this choice dramatically
569  reduces the ability of online fsck to find inconsistencies and repair them.
570
571- **Incorrect repairs**: As with all software, there might be defects in the
572  software that result in incorrect repairs being written to the filesystem.
573  Systematic fuzz testing (detailed in the next section) is employed by the
574  authors to find bugs early, but it might not catch everything.
575  The kernel build system provides Kconfig options (``CONFIG_XFS_ONLINE_SCRUB``
576  and ``CONFIG_XFS_ONLINE_REPAIR``) to enable distributors to choose not to
577  accept this risk.
578  The xfsprogs build system has a configure option (``--enable-scrub=no``) that
579  disables building of the ``xfs_scrub`` binary, though this is not a risk
580  mitigation if the kernel functionality remains enabled.
581
582- **Inability to repair**: Sometimes, a filesystem is too badly damaged to be
583  repairable.
584  If the keyspaces of several metadata indices overlap in some manner but a
585  coherent narrative cannot be formed from records collected, then the repair
586  fails.
587  To reduce the chance that a repair will fail with a dirty transaction and
588  render the filesystem unusable, the online repair functions have been
589  designed to stage and validate all new records before committing the new
590  structure.
591
592- **Misbehavior**: Online fsck requires many privileges -- raw IO to block
593  devices, opening files by handle, ignoring Unix discretionary access control,
594  and the ability to perform administrative changes.
595  Running this automatically in the background scares people, so the systemd
596  background service is configured to run with only the privileges required.
597  Obviously, this cannot address certain problems like the kernel crashing or
598  deadlocking, but it should be sufficient to prevent the scrub process from
599  escaping and reconfiguring the system.
600  The cron job does not have this protection.
601
602- **Fuzz Kiddiez**: There are many people now who seem to think that running
603  automated fuzz testing of ondisk artifacts to find mischevious behavior and
604  spraying exploit code onto the public mailing list for instant zero-day
605  disclosure is somehow of some social benefit.
606  In the view of this author, the benefit is realized only when the fuzz
607  operators help to **fix** the flaws, but this opinion apparently is not
608  widely shared among security "researchers".
609  The XFS maintainers' continuing ability to manage these events presents an
610  ongoing risk to the stability of the development process.
611  Automated testing should front-load some of the risk while the feature is
612  considered EXPERIMENTAL.
613
614Many of these risks are inherent to software programming.
615Despite this, it is hoped that this new functionality will prove useful in
616reducing unexpected downtime.
617