1.. SPDX-License-Identifier: GPL-2.0 2.. _xfs_online_fsck_design: 3 4.. 5 Mapping of heading styles within this document: 6 Heading 1 uses "====" above and below 7 Heading 2 uses "====" 8 Heading 3 uses "----" 9 Heading 4 uses "````" 10 Heading 5 uses "^^^^" 11 Heading 6 uses "~~~~" 12 Heading 7 uses "...." 13 14 Sections are manually numbered because apparently that's what everyone 15 does in the kernel. 16 17====================== 18XFS Online Fsck Design 19====================== 20 21This document captures the design of the online filesystem check feature for 22XFS. 23The purpose of this document is threefold: 24 25- To help kernel distributors understand exactly what the XFS online fsck 26 feature is, and issues about which they should be aware. 27 28- To help people reading the code to familiarize themselves with the relevant 29 concepts and design points before they start digging into the code. 30 31- To help developers maintaining the system by capturing the reasons 32 supporting higher level decision making. 33 34As the online fsck code is merged, the links in this document to topic branches 35will be replaced with links to code. 36 37This document is licensed under the terms of the GNU Public License, v2. 38The primary author is Darrick J. Wong. 39 40This design document is split into seven parts. 41Part 1 defines what fsck tools are and the motivations for writing a new one. 42Parts 2 and 3 present a high level overview of how online fsck process works 43and how it is tested to ensure correct functionality. 44Part 4 discusses the user interface and the intended usage modes of the new 45program. 46Parts 5 and 6 show off the high level components and how they fit together, and 47then present case studies of how each repair function actually works. 48Part 7 sums up what has been discussed so far and speculates about what else 49might be built atop online fsck. 50 51.. contents:: Table of Contents 52 :local: 53 541. What is a Filesystem Check? 55============================== 56 57A Unix filesystem has four main responsibilities: 58 59- Provide a hierarchy of names through which application programs can associate 60 arbitrary blobs of data for any length of time, 61 62- Virtualize physical storage media across those names, and 63 64- Retrieve the named data blobs at any time. 65 66- Examine resource usage. 67 68Metadata directly supporting these functions (e.g. files, directories, space 69mappings) are sometimes called primary metadata. 70Secondary metadata (e.g. reverse mapping and directory parent pointers) support 71operations internal to the filesystem, such as internal consistency checking 72and reorganization. 73Summary metadata, as the name implies, condense information contained in 74primary metadata for performance reasons. 75 76The filesystem check (fsck) tool examines all the metadata in a filesystem 77to look for errors. 78In addition to looking for obvious metadata corruptions, fsck also 79cross-references different types of metadata records with each other to look 80for inconsistencies. 81People do not like losing data, so most fsck tools also contains some ability 82to correct any problems found. 83As a word of caution -- the primary goal of most Linux fsck tools is to restore 84the filesystem metadata to a consistent state, not to maximize the data 85recovered. 86That precedent will not be challenged here. 87 88Filesystems of the 20th century generally lacked any redundancy in the ondisk 89format, which means that fsck can only respond to errors by erasing files until 90errors are no longer detected. 91More recent filesystem designs contain enough redundancy in their metadata that 92it is now possible to regenerate data structures when non-catastrophic errors 93occur; this capability aids both strategies. 94 95+--------------------------------------------------------------------------+ 96| **Note**: | 97+--------------------------------------------------------------------------+ 98| System administrators avoid data loss by increasing the number of | 99| separate storage systems through the creation of backups; and they avoid | 100| downtime by increasing the redundancy of each storage system through the | 101| creation of RAID arrays. | 102| fsck tools address only the first problem. | 103+--------------------------------------------------------------------------+ 104 105TLDR; Show Me the Code! 106----------------------- 107 108Code is posted to the kernel.org git trees as follows: 109`kernel changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-symlink>`_, 110`userspace changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-service>`_, and 111`QA test changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-dirs>`_. 112Each kernel patchset adding an online repair function will use the same branch 113name across the kernel, xfsprogs, and fstests git repos. 114 115Existing Tools 116-------------- 117 118The online fsck tool described here will be the third tool in the history of 119XFS (on Linux) to check and repair filesystems. 120Two programs precede it: 121 122The first program, ``xfs_check``, was created as part of the XFS debugger 123(``xfs_db``) and can only be used with unmounted filesystems. 124It walks all metadata in the filesystem looking for inconsistencies in the 125metadata, though it lacks any ability to repair what it finds. 126Due to its high memory requirements and inability to repair things, this 127program is now deprecated and will not be discussed further. 128 129The second program, ``xfs_repair``, was created to be faster and more robust 130than the first program. 131Like its predecessor, it can only be used with unmounted filesystems. 132It uses extent-based in-memory data structures to reduce memory consumption, 133and tries to schedule readahead IO appropriately to reduce I/O waiting time 134while it scans the metadata of the entire filesystem. 135The most important feature of this tool is its ability to respond to 136inconsistencies in file metadata and directory tree by erasing things as needed 137to eliminate problems. 138Space usage metadata are rebuilt from the observed file metadata. 139 140Problem Statement 141----------------- 142 143The current XFS tools leave several problems unsolved: 144 1451. **User programs** suddenly **lose access** to the filesystem when unexpected 146 shutdowns occur as a result of silent corruptions in the metadata. 147 These occur **unpredictably** and often without warning. 148 1492. **Users** experience a **total loss of service** during the recovery period 150 after an **unexpected shutdown** occurs. 151 1523. **Users** experience a **total loss of service** if the filesystem is taken 153 offline to **look for problems** proactively. 154 1554. **Data owners** cannot **check the integrity** of their stored data without 156 reading all of it. 157 This may expose them to substantial billing costs when a linear media scan 158 performed by the storage system administrator might suffice. 159 1605. **System administrators** cannot **schedule** a maintenance window to deal 161 with corruptions if they **lack the means** to assess filesystem health 162 while the filesystem is online. 163 1646. **Fleet monitoring tools** cannot **automate periodic checks** of filesystem 165 health when doing so requires **manual intervention** and downtime. 166 1677. **Users** can be tricked into **doing things they do not desire** when 168 malicious actors **exploit quirks of Unicode** to place misleading names 169 in directories. 170 171Given this definition of the problems to be solved and the actors who would 172benefit, the proposed solution is a third fsck tool that acts on a running 173filesystem. 174 175This new third program has three components: an in-kernel facility to check 176metadata, an in-kernel facility to repair metadata, and a userspace driver 177program to drive fsck activity on a live filesystem. 178``xfs_scrub`` is the name of the driver program. 179The rest of this document presents the goals and use cases of the new fsck 180tool, describes its major design points in connection to those goals, and 181discusses the similarities and differences with existing tools. 182 183+--------------------------------------------------------------------------+ 184| **Note**: | 185+--------------------------------------------------------------------------+ 186| Throughout this document, the existing offline fsck tool can also be | 187| referred to by its current name "``xfs_repair``". | 188| The userspace driver program for the new online fsck tool can be | 189| referred to as "``xfs_scrub``". | 190| The kernel portion of online fsck that validates metadata is called | 191| "online scrub", and portion of the kernel that fixes metadata is called | 192| "online repair". | 193+--------------------------------------------------------------------------+ 194 195The naming hierarchy is broken up into objects known as directories and files 196and the physical space is split into pieces known as allocation groups. 197Sharding enables better performance on highly parallel systems and helps to 198contain the damage when corruptions occur. 199The division of the filesystem into principal objects (allocation groups and 200inodes) means that there are ample opportunities to perform targeted checks and 201repairs on a subset of the filesystem. 202 203While this is going on, other parts continue processing IO requests. 204Even if a piece of filesystem metadata can only be regenerated by scanning the 205entire system, the scan can still be done in the background while other file 206operations continue. 207 208In summary, online fsck takes advantage of resource sharding and redundant 209metadata to enable targeted checking and repair operations while the system 210is running. 211This capability will be coupled to automatic system management so that 212autonomous self-healing of XFS maximizes service availability. 213 2142. Theory of Operation 215====================== 216 217Because it is necessary for online fsck to lock and scan live metadata objects, 218online fsck consists of three separate code components. 219The first is the userspace driver program ``xfs_scrub``, which is responsible 220for identifying individual metadata items, scheduling work items for them, 221reacting to the outcomes appropriately, and reporting results to the system 222administrator. 223The second and third are in the kernel, which implements functions to check 224and repair each type of online fsck work item. 225 226+------------------------------------------------------------------+ 227| **Note**: | 228+------------------------------------------------------------------+ 229| For brevity, this document shortens the phrase "online fsck work | 230| item" to "scrub item". | 231+------------------------------------------------------------------+ 232 233Scrub item types are delineated in a manner consistent with the Unix design 234philosophy, which is to say that each item should handle one aspect of a 235metadata structure, and handle it well. 236 237Scope 238----- 239 240In principle, online fsck should be able to check and to repair everything that 241the offline fsck program can handle. 242However, online fsck cannot be running 100% of the time, which means that 243latent errors may creep in after a scrub completes. 244If these errors cause the next mount to fail, offline fsck is the only 245solution. 246This limitation means that maintenance of the offline fsck tool will continue. 247A second limitation of online fsck is that it must follow the same resource 248sharing and lock acquisition rules as the regular filesystem. 249This means that scrub cannot take *any* shortcuts to save time, because doing 250so could lead to concurrency problems. 251In other words, online fsck is not a complete replacement for offline fsck, and 252a complete run of online fsck may take longer than online fsck. 253However, both of these limitations are acceptable tradeoffs to satisfy the 254different motivations of online fsck, which are to **minimize system downtime** 255and to **increase predictability of operation**. 256 257.. _scrubphases: 258 259Phases of Work 260-------------- 261 262The userspace driver program ``xfs_scrub`` splits the work of checking and 263repairing an entire filesystem into seven phases. 264Each phase concentrates on checking specific types of scrub items and depends 265on the success of all previous phases. 266The seven phases are as follows: 267 2681. Collect geometry information about the mounted filesystem and computer, 269 discover the online fsck capabilities of the kernel, and open the 270 underlying storage devices. 271 2722. Check allocation group metadata, all realtime volume metadata, and all quota 273 files. 274 Each metadata structure is scheduled as a separate scrub item. 275 If corruption is found in the inode header or inode btree and ``xfs_scrub`` 276 is permitted to perform repairs, then those scrub items are repaired to 277 prepare for phase 3. 278 Repairs are implemented by using the information in the scrub item to 279 resubmit the kernel scrub call with the repair flag enabled; this is 280 discussed in the next section. 281 Optimizations and all other repairs are deferred to phase 4. 282 2833. Check all metadata of every file in the filesystem. 284 Each metadata structure is also scheduled as a separate scrub item. 285 If repairs are needed and ``xfs_scrub`` is permitted to perform repairs, 286 and there were no problems detected during phase 2, then those scrub items 287 are repaired immediately. 288 Optimizations, deferred repairs, and unsuccessful repairs are deferred to 289 phase 4. 290 2914. All remaining repairs and scheduled optimizations are performed during this 292 phase, if the caller permits them. 293 Before starting repairs, the summary counters are checked and any necessary 294 repairs are performed so that subsequent repairs will not fail the resource 295 reservation step due to wildly incorrect summary counters. 296 Unsuccesful repairs are requeued as long as forward progress on repairs is 297 made somewhere in the filesystem. 298 Free space in the filesystem is trimmed at the end of phase 4 if the 299 filesystem is clean. 300 3015. By the start of this phase, all primary and secondary filesystem metadata 302 must be correct. 303 Summary counters such as the free space counts and quota resource counts 304 are checked and corrected. 305 Directory entry names and extended attribute names are checked for 306 suspicious entries such as control characters or confusing Unicode sequences 307 appearing in names. 308 3096. If the caller asks for a media scan, read all allocated and written data 310 file extents in the filesystem. 311 The ability to use hardware-assisted data file integrity checking is new 312 to online fsck; neither of the previous tools have this capability. 313 If media errors occur, they will be mapped to the owning files and reported. 314 3157. Re-check the summary counters and presents the caller with a summary of 316 space usage and file counts. 317 318Steps for Each Scrub Item 319------------------------- 320 321The kernel scrub code uses a three-step strategy for checking and repairing 322the one aspect of a metadata object represented by a scrub item: 323 3241. The scrub item of interest is checked for corruptions; opportunities for 325 optimization; and for values that are directly controlled by the system 326 administrator but look suspicious. 327 If the item is not corrupt or does not need optimization, resource are 328 released and the positive scan results are returned to userspace. 329 If the item is corrupt or could be optimized but the caller does not permit 330 this, resources are released and the negative scan results are returned to 331 userspace. 332 Otherwise, the kernel moves on to the second step. 333 3342. The repair function is called to rebuild the data structure. 335 Repair functions generally choose rebuild a structure from other metadata 336 rather than try to salvage the existing structure. 337 If the repair fails, the scan results from the first step are returned to 338 userspace. 339 Otherwise, the kernel moves on to the third step. 340 3413. In the third step, the kernel runs the same checks over the new metadata 342 item to assess the efficacy of the repairs. 343 The results of the reassessment are returned to userspace. 344 345Classification of Metadata 346-------------------------- 347 348Each type of metadata object (and therefore each type of scrub item) is 349classified as follows: 350 351Primary Metadata 352```````````````` 353 354Metadata structures in this category should be most familiar to filesystem 355users either because they are directly created by the user or they index 356objects created by the user 357Most filesystem objects fall into this class: 358 359- Free space and reference count information 360 361- Inode records and indexes 362 363- Storage mapping information for file data 364 365- Directories 366 367- Extended attributes 368 369- Symbolic links 370 371- Quota limits 372 373Scrub obeys the same rules as regular filesystem accesses for resource and lock 374acquisition. 375 376Primary metadata objects are the simplest for scrub to process. 377The principal filesystem object (either an allocation group or an inode) that 378owns the item being scrubbed is locked to guard against concurrent updates. 379The check function examines every record associated with the type for obvious 380errors and cross-references healthy records against other metadata to look for 381inconsistencies. 382Repairs for this class of scrub item are simple, since the repair function 383starts by holding all the resources acquired in the previous step. 384The repair function scans available metadata as needed to record all the 385observations needed to complete the structure. 386Next, it stages the observations in a new ondisk structure and commits it 387atomically to complete the repair. 388Finally, the storage from the old data structure are carefully reaped. 389 390Because ``xfs_scrub`` locks a primary object for the duration of the repair, 391this is effectively an offline repair operation performed on a subset of the 392filesystem. 393This minimizes the complexity of the repair code because it is not necessary to 394handle concurrent updates from other threads, nor is it necessary to access 395any other part of the filesystem. 396As a result, indexed structures can be rebuilt very quickly, and programs 397trying to access the damaged structure will be blocked until repairs complete. 398The only infrastructure needed by the repair code are the staging area for 399observations and a means to write new structures to disk. 400Despite these limitations, the advantage that online repair holds is clear: 401targeted work on individual shards of the filesystem avoids total loss of 402service. 403 404This mechanism is described in section 2.1 ("Off-Line Algorithm") of 405V. Srinivasan and M. J. Carey, `"Performance of On-Line Index Construction 406Algorithms" <https://minds.wisconsin.edu/bitstream/handle/1793/59524/TR1047.pdf>`_, 407*Extending Database Technology*, pp. 293-309, 1992. 408 409Most primary metadata repair functions stage their intermediate results in an 410in-memory array prior to formatting the new ondisk structure, which is very 411similar to the list-based algorithm discussed in section 2.3 ("List-Based 412Algorithms") of Srinivasan. 413However, any data structure builder that maintains a resource lock for the 414duration of the repair is *always* an offline algorithm. 415 416Secondary Metadata 417`````````````````` 418 419Metadata structures in this category reflect records found in primary metadata, 420but are only needed for online fsck or for reorganization of the filesystem. 421 422Secondary metadata include: 423 424- Reverse mapping information 425 426- Directory parent pointers 427 428This class of metadata is difficult for scrub to process because scrub attaches 429to the secondary object but needs to check primary metadata, which runs counter 430to the usual order of resource acquisition. 431Frequently, this means that full filesystems scans are necessary to rebuild the 432metadata. 433Check functions can be limited in scope to reduce runtime. 434Repairs, however, require a full scan of primary metadata, which can take a 435long time to complete. 436Under these conditions, ``xfs_scrub`` cannot lock resources for the entire 437duration of the repair. 438 439Instead, repair functions set up an in-memory staging structure to store 440observations. 441Depending on the requirements of the specific repair function, the staging 442index will either have the same format as the ondisk structure or a design 443specific to that repair function. 444The next step is to release all locks and start the filesystem scan. 445When the repair scanner needs to record an observation, the staging data are 446locked long enough to apply the update. 447While the filesystem scan is in progress, the repair function hooks the 448filesystem so that it can apply pending filesystem updates to the staging 449information. 450Once the scan is done, the owning object is re-locked, the live data is used to 451write a new ondisk structure, and the repairs are committed atomically. 452The hooks are disabled and the staging staging area is freed. 453Finally, the storage from the old data structure are carefully reaped. 454 455Introducing concurrency helps online repair avoid various locking problems, but 456comes at a high cost to code complexity. 457Live filesystem code has to be hooked so that the repair function can observe 458updates in progress. 459The staging area has to become a fully functional parallel structure so that 460updates can be merged from the hooks. 461Finally, the hook, the filesystem scan, and the inode locking model must be 462sufficiently well integrated that a hook event can decide if a given update 463should be applied to the staging structure. 464 465In theory, the scrub implementation could apply these same techniques for 466primary metadata, but doing so would make it massively more complex and less 467performant. 468Programs attempting to access the damaged structures are not blocked from 469operation, which may cause application failure or an unplanned filesystem 470shutdown. 471 472Inspiration for the secondary metadata repair strategy was drawn from section 4732.4 of Srinivasan above, and sections 2 ("NSF: Inded Build Without Side-File") 474and 3.1.1 ("Duplicate Key Insert Problem") in C. Mohan, `"Algorithms for 475Creating Indexes for Very Large Tables Without Quiescing Updates" 476<https://dl.acm.org/doi/10.1145/130283.130337>`_, 1992. 477 478The sidecar index mentioned above bears some resemblance to the side file 479method mentioned in Srinivasan and Mohan. 480Their method consists of an index builder that extracts relevant record data to 481build the new structure as quickly as possible; and an auxiliary structure that 482captures all updates that would be committed to the index by other threads were 483the new index already online. 484After the index building scan finishes, the updates recorded in the side file 485are applied to the new index. 486To avoid conflicts between the index builder and other writer threads, the 487builder maintains a publicly visible cursor that tracks the progress of the 488scan through the record space. 489To avoid duplication of work between the side file and the index builder, side 490file updates are elided when the record ID for the update is greater than the 491cursor position within the record ID space. 492 493To minimize changes to the rest of the codebase, XFS online repair keeps the 494replacement index hidden until it's completely ready to go. 495In other words, there is no attempt to expose the keyspace of the new index 496while repair is running. 497The complexity of such an approach would be very high and perhaps more 498appropriate to building *new* indices. 499 500**Future Work Question**: Can the full scan and live update code used to 501facilitate a repair also be used to implement a comprehensive check? 502 503*Answer*: In theory, yes. Check would be much stronger if each scrub function 504employed these live scans to build a shadow copy of the metadata and then 505compared the shadow records to the ondisk records. 506However, doing that is a fair amount more work than what the checking functions 507do now. 508The live scans and hooks were developed much later. 509That in turn increases the runtime of those scrub functions. 510 511Summary Information 512``````````````````` 513 514Metadata structures in this last category summarize the contents of primary 515metadata records. 516These are often used to speed up resource usage queries, and are many times 517smaller than the primary metadata which they represent. 518 519Examples of summary information include: 520 521- Summary counts of free space and inodes 522 523- File link counts from directories 524 525- Quota resource usage counts 526 527Check and repair require full filesystem scans, but resource and lock 528acquisition follow the same paths as regular filesystem accesses. 529 530The superblock summary counters have special requirements due to the underlying 531implementation of the incore counters, and will be treated separately. 532Check and repair of the other types of summary counters (quota resource counts 533and file link counts) employ the same filesystem scanning and hooking 534techniques as outlined above, but because the underlying data are sets of 535integer counters, the staging data need not be a fully functional mirror of the 536ondisk structure. 537 538Inspiration for quota and file link count repair strategies were drawn from 539sections 2.12 ("Online Index Operations") through 2.14 ("Incremental View 540Maintenace") of G. Graefe, `"Concurrent Queries and Updates in Summary Views 541and Their Indexes" 542<http://www.odbms.org/wp-content/uploads/2014/06/Increment-locks.pdf>`_, 2011. 543 544Since quotas are non-negative integer counts of resource usage, online 545quotacheck can use the incremental view deltas described in section 2.14 to 546track pending changes to the block and inode usage counts in each transaction, 547and commit those changes to a dquot side file when the transaction commits. 548Delta tracking is necessary for dquots because the index builder scans inodes, 549whereas the data structure being rebuilt is an index of dquots. 550Link count checking combines the view deltas and commit step into one because 551it sets attributes of the objects being scanned instead of writing them to a 552separate data structure. 553Each online fsck function will be discussed as case studies later in this 554document. 555 556Risk Management 557--------------- 558 559During the development of online fsck, several risk factors were identified 560that may make the feature unsuitable for certain distributors and users. 561Steps can be taken to mitigate or eliminate those risks, though at a cost to 562functionality. 563 564- **Decreased performance**: Adding metadata indices to the filesystem 565 increases the time cost of persisting changes to disk, and the reverse space 566 mapping and directory parent pointers are no exception. 567 System administrators who require the maximum performance can disable the 568 reverse mapping features at format time, though this choice dramatically 569 reduces the ability of online fsck to find inconsistencies and repair them. 570 571- **Incorrect repairs**: As with all software, there might be defects in the 572 software that result in incorrect repairs being written to the filesystem. 573 Systematic fuzz testing (detailed in the next section) is employed by the 574 authors to find bugs early, but it might not catch everything. 575 The kernel build system provides Kconfig options (``CONFIG_XFS_ONLINE_SCRUB`` 576 and ``CONFIG_XFS_ONLINE_REPAIR``) to enable distributors to choose not to 577 accept this risk. 578 The xfsprogs build system has a configure option (``--enable-scrub=no``) that 579 disables building of the ``xfs_scrub`` binary, though this is not a risk 580 mitigation if the kernel functionality remains enabled. 581 582- **Inability to repair**: Sometimes, a filesystem is too badly damaged to be 583 repairable. 584 If the keyspaces of several metadata indices overlap in some manner but a 585 coherent narrative cannot be formed from records collected, then the repair 586 fails. 587 To reduce the chance that a repair will fail with a dirty transaction and 588 render the filesystem unusable, the online repair functions have been 589 designed to stage and validate all new records before committing the new 590 structure. 591 592- **Misbehavior**: Online fsck requires many privileges -- raw IO to block 593 devices, opening files by handle, ignoring Unix discretionary access control, 594 and the ability to perform administrative changes. 595 Running this automatically in the background scares people, so the systemd 596 background service is configured to run with only the privileges required. 597 Obviously, this cannot address certain problems like the kernel crashing or 598 deadlocking, but it should be sufficient to prevent the scrub process from 599 escaping and reconfiguring the system. 600 The cron job does not have this protection. 601 602- **Fuzz Kiddiez**: There are many people now who seem to think that running 603 automated fuzz testing of ondisk artifacts to find mischevious behavior and 604 spraying exploit code onto the public mailing list for instant zero-day 605 disclosure is somehow of some social benefit. 606 In the view of this author, the benefit is realized only when the fuzz 607 operators help to **fix** the flaws, but this opinion apparently is not 608 widely shared among security "researchers". 609 The XFS maintainers' continuing ability to manage these events presents an 610 ongoing risk to the stability of the development process. 611 Automated testing should front-load some of the risk while the feature is 612 considered EXPERIMENTAL. 613 614Many of these risks are inherent to software programming. 615Despite this, it is hoped that this new functionality will prove useful in 616reducing unexpected downtime. 617