1.. SPDX-License-Identifier: GPL-2.0 2.. _xfs_online_fsck_design: 3 4.. 5 Mapping of heading styles within this document: 6 Heading 1 uses "====" above and below 7 Heading 2 uses "====" 8 Heading 3 uses "----" 9 Heading 4 uses "````" 10 Heading 5 uses "^^^^" 11 Heading 6 uses "~~~~" 12 Heading 7 uses "...." 13 14 Sections are manually numbered because apparently that's what everyone 15 does in the kernel. 16 17====================== 18XFS Online Fsck Design 19====================== 20 21This document captures the design of the online filesystem check feature for 22XFS. 23The purpose of this document is threefold: 24 25- To help kernel distributors understand exactly what the XFS online fsck 26 feature is, and issues about which they should be aware. 27 28- To help people reading the code to familiarize themselves with the relevant 29 concepts and design points before they start digging into the code. 30 31- To help developers maintaining the system by capturing the reasons 32 supporting higher level decision making. 33 34As the online fsck code is merged, the links in this document to topic branches 35will be replaced with links to code. 36 37This document is licensed under the terms of the GNU Public License, v2. 38The primary author is Darrick J. Wong. 39 40This design document is split into seven parts. 41Part 1 defines what fsck tools are and the motivations for writing a new one. 42Parts 2 and 3 present a high level overview of how online fsck process works 43and how it is tested to ensure correct functionality. 44Part 4 discusses the user interface and the intended usage modes of the new 45program. 46Parts 5 and 6 show off the high level components and how they fit together, and 47then present case studies of how each repair function actually works. 48Part 7 sums up what has been discussed so far and speculates about what else 49might be built atop online fsck. 50 51.. contents:: Table of Contents 52 :local: 53 541. What is a Filesystem Check? 55============================== 56 57A Unix filesystem has four main responsibilities: 58 59- Provide a hierarchy of names through which application programs can associate 60 arbitrary blobs of data for any length of time, 61 62- Virtualize physical storage media across those names, and 63 64- Retrieve the named data blobs at any time. 65 66- Examine resource usage. 67 68Metadata directly supporting these functions (e.g. files, directories, space 69mappings) are sometimes called primary metadata. 70Secondary metadata (e.g. reverse mapping and directory parent pointers) support 71operations internal to the filesystem, such as internal consistency checking 72and reorganization. 73Summary metadata, as the name implies, condense information contained in 74primary metadata for performance reasons. 75 76The filesystem check (fsck) tool examines all the metadata in a filesystem 77to look for errors. 78In addition to looking for obvious metadata corruptions, fsck also 79cross-references different types of metadata records with each other to look 80for inconsistencies. 81People do not like losing data, so most fsck tools also contains some ability 82to correct any problems found. 83As a word of caution -- the primary goal of most Linux fsck tools is to restore 84the filesystem metadata to a consistent state, not to maximize the data 85recovered. 86That precedent will not be challenged here. 87 88Filesystems of the 20th century generally lacked any redundancy in the ondisk 89format, which means that fsck can only respond to errors by erasing files until 90errors are no longer detected. 91More recent filesystem designs contain enough redundancy in their metadata that 92it is now possible to regenerate data structures when non-catastrophic errors 93occur; this capability aids both strategies. 94 95+--------------------------------------------------------------------------+ 96| **Note**: | 97+--------------------------------------------------------------------------+ 98| System administrators avoid data loss by increasing the number of | 99| separate storage systems through the creation of backups; and they avoid | 100| downtime by increasing the redundancy of each storage system through the | 101| creation of RAID arrays. | 102| fsck tools address only the first problem. | 103+--------------------------------------------------------------------------+ 104 105TLDR; Show Me the Code! 106----------------------- 107 108Code is posted to the kernel.org git trees as follows: 109`kernel changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-symlink>`_, 110`userspace changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-service>`_, and 111`QA test changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-dirs>`_. 112Each kernel patchset adding an online repair function will use the same branch 113name across the kernel, xfsprogs, and fstests git repos. 114 115Existing Tools 116-------------- 117 118The online fsck tool described here will be the third tool in the history of 119XFS (on Linux) to check and repair filesystems. 120Two programs precede it: 121 122The first program, ``xfs_check``, was created as part of the XFS debugger 123(``xfs_db``) and can only be used with unmounted filesystems. 124It walks all metadata in the filesystem looking for inconsistencies in the 125metadata, though it lacks any ability to repair what it finds. 126Due to its high memory requirements and inability to repair things, this 127program is now deprecated and will not be discussed further. 128 129The second program, ``xfs_repair``, was created to be faster and more robust 130than the first program. 131Like its predecessor, it can only be used with unmounted filesystems. 132It uses extent-based in-memory data structures to reduce memory consumption, 133and tries to schedule readahead IO appropriately to reduce I/O waiting time 134while it scans the metadata of the entire filesystem. 135The most important feature of this tool is its ability to respond to 136inconsistencies in file metadata and directory tree by erasing things as needed 137to eliminate problems. 138Space usage metadata are rebuilt from the observed file metadata. 139 140Problem Statement 141----------------- 142 143The current XFS tools leave several problems unsolved: 144 1451. **User programs** suddenly **lose access** to the filesystem when unexpected 146 shutdowns occur as a result of silent corruptions in the metadata. 147 These occur **unpredictably** and often without warning. 148 1492. **Users** experience a **total loss of service** during the recovery period 150 after an **unexpected shutdown** occurs. 151 1523. **Users** experience a **total loss of service** if the filesystem is taken 153 offline to **look for problems** proactively. 154 1554. **Data owners** cannot **check the integrity** of their stored data without 156 reading all of it. 157 This may expose them to substantial billing costs when a linear media scan 158 performed by the storage system administrator might suffice. 159 1605. **System administrators** cannot **schedule** a maintenance window to deal 161 with corruptions if they **lack the means** to assess filesystem health 162 while the filesystem is online. 163 1646. **Fleet monitoring tools** cannot **automate periodic checks** of filesystem 165 health when doing so requires **manual intervention** and downtime. 166 1677. **Users** can be tricked into **doing things they do not desire** when 168 malicious actors **exploit quirks of Unicode** to place misleading names 169 in directories. 170 171Given this definition of the problems to be solved and the actors who would 172benefit, the proposed solution is a third fsck tool that acts on a running 173filesystem. 174 175This new third program has three components: an in-kernel facility to check 176metadata, an in-kernel facility to repair metadata, and a userspace driver 177program to drive fsck activity on a live filesystem. 178``xfs_scrub`` is the name of the driver program. 179The rest of this document presents the goals and use cases of the new fsck 180tool, describes its major design points in connection to those goals, and 181discusses the similarities and differences with existing tools. 182 183+--------------------------------------------------------------------------+ 184| **Note**: | 185+--------------------------------------------------------------------------+ 186| Throughout this document, the existing offline fsck tool can also be | 187| referred to by its current name "``xfs_repair``". | 188| The userspace driver program for the new online fsck tool can be | 189| referred to as "``xfs_scrub``". | 190| The kernel portion of online fsck that validates metadata is called | 191| "online scrub", and portion of the kernel that fixes metadata is called | 192| "online repair". | 193+--------------------------------------------------------------------------+ 194 195The naming hierarchy is broken up into objects known as directories and files 196and the physical space is split into pieces known as allocation groups. 197Sharding enables better performance on highly parallel systems and helps to 198contain the damage when corruptions occur. 199The division of the filesystem into principal objects (allocation groups and 200inodes) means that there are ample opportunities to perform targeted checks and 201repairs on a subset of the filesystem. 202 203While this is going on, other parts continue processing IO requests. 204Even if a piece of filesystem metadata can only be regenerated by scanning the 205entire system, the scan can still be done in the background while other file 206operations continue. 207 208In summary, online fsck takes advantage of resource sharding and redundant 209metadata to enable targeted checking and repair operations while the system 210is running. 211This capability will be coupled to automatic system management so that 212autonomous self-healing of XFS maximizes service availability. 213 2142. Theory of Operation 215====================== 216 217Because it is necessary for online fsck to lock and scan live metadata objects, 218online fsck consists of three separate code components. 219The first is the userspace driver program ``xfs_scrub``, which is responsible 220for identifying individual metadata items, scheduling work items for them, 221reacting to the outcomes appropriately, and reporting results to the system 222administrator. 223The second and third are in the kernel, which implements functions to check 224and repair each type of online fsck work item. 225 226+------------------------------------------------------------------+ 227| **Note**: | 228+------------------------------------------------------------------+ 229| For brevity, this document shortens the phrase "online fsck work | 230| item" to "scrub item". | 231+------------------------------------------------------------------+ 232 233Scrub item types are delineated in a manner consistent with the Unix design 234philosophy, which is to say that each item should handle one aspect of a 235metadata structure, and handle it well. 236 237Scope 238----- 239 240In principle, online fsck should be able to check and to repair everything that 241the offline fsck program can handle. 242However, online fsck cannot be running 100% of the time, which means that 243latent errors may creep in after a scrub completes. 244If these errors cause the next mount to fail, offline fsck is the only 245solution. 246This limitation means that maintenance of the offline fsck tool will continue. 247A second limitation of online fsck is that it must follow the same resource 248sharing and lock acquisition rules as the regular filesystem. 249This means that scrub cannot take *any* shortcuts to save time, because doing 250so could lead to concurrency problems. 251In other words, online fsck is not a complete replacement for offline fsck, and 252a complete run of online fsck may take longer than online fsck. 253However, both of these limitations are acceptable tradeoffs to satisfy the 254different motivations of online fsck, which are to **minimize system downtime** 255and to **increase predictability of operation**. 256 257.. _scrubphases: 258 259Phases of Work 260-------------- 261 262The userspace driver program ``xfs_scrub`` splits the work of checking and 263repairing an entire filesystem into seven phases. 264Each phase concentrates on checking specific types of scrub items and depends 265on the success of all previous phases. 266The seven phases are as follows: 267 2681. Collect geometry information about the mounted filesystem and computer, 269 discover the online fsck capabilities of the kernel, and open the 270 underlying storage devices. 271 2722. Check allocation group metadata, all realtime volume metadata, and all quota 273 files. 274 Each metadata structure is scheduled as a separate scrub item. 275 If corruption is found in the inode header or inode btree and ``xfs_scrub`` 276 is permitted to perform repairs, then those scrub items are repaired to 277 prepare for phase 3. 278 Repairs are implemented by using the information in the scrub item to 279 resubmit the kernel scrub call with the repair flag enabled; this is 280 discussed in the next section. 281 Optimizations and all other repairs are deferred to phase 4. 282 2833. Check all metadata of every file in the filesystem. 284 Each metadata structure is also scheduled as a separate scrub item. 285 If repairs are needed and ``xfs_scrub`` is permitted to perform repairs, 286 and there were no problems detected during phase 2, then those scrub items 287 are repaired immediately. 288 Optimizations, deferred repairs, and unsuccessful repairs are deferred to 289 phase 4. 290 2914. All remaining repairs and scheduled optimizations are performed during this 292 phase, if the caller permits them. 293 Before starting repairs, the summary counters are checked and any necessary 294 repairs are performed so that subsequent repairs will not fail the resource 295 reservation step due to wildly incorrect summary counters. 296 Unsuccesful repairs are requeued as long as forward progress on repairs is 297 made somewhere in the filesystem. 298 Free space in the filesystem is trimmed at the end of phase 4 if the 299 filesystem is clean. 300 3015. By the start of this phase, all primary and secondary filesystem metadata 302 must be correct. 303 Summary counters such as the free space counts and quota resource counts 304 are checked and corrected. 305 Directory entry names and extended attribute names are checked for 306 suspicious entries such as control characters or confusing Unicode sequences 307 appearing in names. 308 3096. If the caller asks for a media scan, read all allocated and written data 310 file extents in the filesystem. 311 The ability to use hardware-assisted data file integrity checking is new 312 to online fsck; neither of the previous tools have this capability. 313 If media errors occur, they will be mapped to the owning files and reported. 314 3157. Re-check the summary counters and presents the caller with a summary of 316 space usage and file counts. 317 318Steps for Each Scrub Item 319------------------------- 320 321The kernel scrub code uses a three-step strategy for checking and repairing 322the one aspect of a metadata object represented by a scrub item: 323 3241. The scrub item of interest is checked for corruptions; opportunities for 325 optimization; and for values that are directly controlled by the system 326 administrator but look suspicious. 327 If the item is not corrupt or does not need optimization, resource are 328 released and the positive scan results are returned to userspace. 329 If the item is corrupt or could be optimized but the caller does not permit 330 this, resources are released and the negative scan results are returned to 331 userspace. 332 Otherwise, the kernel moves on to the second step. 333 3342. The repair function is called to rebuild the data structure. 335 Repair functions generally choose rebuild a structure from other metadata 336 rather than try to salvage the existing structure. 337 If the repair fails, the scan results from the first step are returned to 338 userspace. 339 Otherwise, the kernel moves on to the third step. 340 3413. In the third step, the kernel runs the same checks over the new metadata 342 item to assess the efficacy of the repairs. 343 The results of the reassessment are returned to userspace. 344 345Classification of Metadata 346-------------------------- 347 348Each type of metadata object (and therefore each type of scrub item) is 349classified as follows: 350 351Primary Metadata 352```````````````` 353 354Metadata structures in this category should be most familiar to filesystem 355users either because they are directly created by the user or they index 356objects created by the user 357Most filesystem objects fall into this class: 358 359- Free space and reference count information 360 361- Inode records and indexes 362 363- Storage mapping information for file data 364 365- Directories 366 367- Extended attributes 368 369- Symbolic links 370 371- Quota limits 372 373Scrub obeys the same rules as regular filesystem accesses for resource and lock 374acquisition. 375 376Primary metadata objects are the simplest for scrub to process. 377The principal filesystem object (either an allocation group or an inode) that 378owns the item being scrubbed is locked to guard against concurrent updates. 379The check function examines every record associated with the type for obvious 380errors and cross-references healthy records against other metadata to look for 381inconsistencies. 382Repairs for this class of scrub item are simple, since the repair function 383starts by holding all the resources acquired in the previous step. 384The repair function scans available metadata as needed to record all the 385observations needed to complete the structure. 386Next, it stages the observations in a new ondisk structure and commits it 387atomically to complete the repair. 388Finally, the storage from the old data structure are carefully reaped. 389 390Because ``xfs_scrub`` locks a primary object for the duration of the repair, 391this is effectively an offline repair operation performed on a subset of the 392filesystem. 393This minimizes the complexity of the repair code because it is not necessary to 394handle concurrent updates from other threads, nor is it necessary to access 395any other part of the filesystem. 396As a result, indexed structures can be rebuilt very quickly, and programs 397trying to access the damaged structure will be blocked until repairs complete. 398The only infrastructure needed by the repair code are the staging area for 399observations and a means to write new structures to disk. 400Despite these limitations, the advantage that online repair holds is clear: 401targeted work on individual shards of the filesystem avoids total loss of 402service. 403 404This mechanism is described in section 2.1 ("Off-Line Algorithm") of 405V. Srinivasan and M. J. Carey, `"Performance of On-Line Index Construction 406Algorithms" <https://minds.wisconsin.edu/bitstream/handle/1793/59524/TR1047.pdf>`_, 407*Extending Database Technology*, pp. 293-309, 1992. 408 409Most primary metadata repair functions stage their intermediate results in an 410in-memory array prior to formatting the new ondisk structure, which is very 411similar to the list-based algorithm discussed in section 2.3 ("List-Based 412Algorithms") of Srinivasan. 413However, any data structure builder that maintains a resource lock for the 414duration of the repair is *always* an offline algorithm. 415 416.. _secondary_metadata: 417 418Secondary Metadata 419`````````````````` 420 421Metadata structures in this category reflect records found in primary metadata, 422but are only needed for online fsck or for reorganization of the filesystem. 423 424Secondary metadata include: 425 426- Reverse mapping information 427 428- Directory parent pointers 429 430This class of metadata is difficult for scrub to process because scrub attaches 431to the secondary object but needs to check primary metadata, which runs counter 432to the usual order of resource acquisition. 433Frequently, this means that full filesystems scans are necessary to rebuild the 434metadata. 435Check functions can be limited in scope to reduce runtime. 436Repairs, however, require a full scan of primary metadata, which can take a 437long time to complete. 438Under these conditions, ``xfs_scrub`` cannot lock resources for the entire 439duration of the repair. 440 441Instead, repair functions set up an in-memory staging structure to store 442observations. 443Depending on the requirements of the specific repair function, the staging 444index will either have the same format as the ondisk structure or a design 445specific to that repair function. 446The next step is to release all locks and start the filesystem scan. 447When the repair scanner needs to record an observation, the staging data are 448locked long enough to apply the update. 449While the filesystem scan is in progress, the repair function hooks the 450filesystem so that it can apply pending filesystem updates to the staging 451information. 452Once the scan is done, the owning object is re-locked, the live data is used to 453write a new ondisk structure, and the repairs are committed atomically. 454The hooks are disabled and the staging staging area is freed. 455Finally, the storage from the old data structure are carefully reaped. 456 457Introducing concurrency helps online repair avoid various locking problems, but 458comes at a high cost to code complexity. 459Live filesystem code has to be hooked so that the repair function can observe 460updates in progress. 461The staging area has to become a fully functional parallel structure so that 462updates can be merged from the hooks. 463Finally, the hook, the filesystem scan, and the inode locking model must be 464sufficiently well integrated that a hook event can decide if a given update 465should be applied to the staging structure. 466 467In theory, the scrub implementation could apply these same techniques for 468primary metadata, but doing so would make it massively more complex and less 469performant. 470Programs attempting to access the damaged structures are not blocked from 471operation, which may cause application failure or an unplanned filesystem 472shutdown. 473 474Inspiration for the secondary metadata repair strategy was drawn from section 4752.4 of Srinivasan above, and sections 2 ("NSF: Inded Build Without Side-File") 476and 3.1.1 ("Duplicate Key Insert Problem") in C. Mohan, `"Algorithms for 477Creating Indexes for Very Large Tables Without Quiescing Updates" 478<https://dl.acm.org/doi/10.1145/130283.130337>`_, 1992. 479 480The sidecar index mentioned above bears some resemblance to the side file 481method mentioned in Srinivasan and Mohan. 482Their method consists of an index builder that extracts relevant record data to 483build the new structure as quickly as possible; and an auxiliary structure that 484captures all updates that would be committed to the index by other threads were 485the new index already online. 486After the index building scan finishes, the updates recorded in the side file 487are applied to the new index. 488To avoid conflicts between the index builder and other writer threads, the 489builder maintains a publicly visible cursor that tracks the progress of the 490scan through the record space. 491To avoid duplication of work between the side file and the index builder, side 492file updates are elided when the record ID for the update is greater than the 493cursor position within the record ID space. 494 495To minimize changes to the rest of the codebase, XFS online repair keeps the 496replacement index hidden until it's completely ready to go. 497In other words, there is no attempt to expose the keyspace of the new index 498while repair is running. 499The complexity of such an approach would be very high and perhaps more 500appropriate to building *new* indices. 501 502**Future Work Question**: Can the full scan and live update code used to 503facilitate a repair also be used to implement a comprehensive check? 504 505*Answer*: In theory, yes. Check would be much stronger if each scrub function 506employed these live scans to build a shadow copy of the metadata and then 507compared the shadow records to the ondisk records. 508However, doing that is a fair amount more work than what the checking functions 509do now. 510The live scans and hooks were developed much later. 511That in turn increases the runtime of those scrub functions. 512 513Summary Information 514``````````````````` 515 516Metadata structures in this last category summarize the contents of primary 517metadata records. 518These are often used to speed up resource usage queries, and are many times 519smaller than the primary metadata which they represent. 520 521Examples of summary information include: 522 523- Summary counts of free space and inodes 524 525- File link counts from directories 526 527- Quota resource usage counts 528 529Check and repair require full filesystem scans, but resource and lock 530acquisition follow the same paths as regular filesystem accesses. 531 532The superblock summary counters have special requirements due to the underlying 533implementation of the incore counters, and will be treated separately. 534Check and repair of the other types of summary counters (quota resource counts 535and file link counts) employ the same filesystem scanning and hooking 536techniques as outlined above, but because the underlying data are sets of 537integer counters, the staging data need not be a fully functional mirror of the 538ondisk structure. 539 540Inspiration for quota and file link count repair strategies were drawn from 541sections 2.12 ("Online Index Operations") through 2.14 ("Incremental View 542Maintenace") of G. Graefe, `"Concurrent Queries and Updates in Summary Views 543and Their Indexes" 544<http://www.odbms.org/wp-content/uploads/2014/06/Increment-locks.pdf>`_, 2011. 545 546Since quotas are non-negative integer counts of resource usage, online 547quotacheck can use the incremental view deltas described in section 2.14 to 548track pending changes to the block and inode usage counts in each transaction, 549and commit those changes to a dquot side file when the transaction commits. 550Delta tracking is necessary for dquots because the index builder scans inodes, 551whereas the data structure being rebuilt is an index of dquots. 552Link count checking combines the view deltas and commit step into one because 553it sets attributes of the objects being scanned instead of writing them to a 554separate data structure. 555Each online fsck function will be discussed as case studies later in this 556document. 557 558Risk Management 559--------------- 560 561During the development of online fsck, several risk factors were identified 562that may make the feature unsuitable for certain distributors and users. 563Steps can be taken to mitigate or eliminate those risks, though at a cost to 564functionality. 565 566- **Decreased performance**: Adding metadata indices to the filesystem 567 increases the time cost of persisting changes to disk, and the reverse space 568 mapping and directory parent pointers are no exception. 569 System administrators who require the maximum performance can disable the 570 reverse mapping features at format time, though this choice dramatically 571 reduces the ability of online fsck to find inconsistencies and repair them. 572 573- **Incorrect repairs**: As with all software, there might be defects in the 574 software that result in incorrect repairs being written to the filesystem. 575 Systematic fuzz testing (detailed in the next section) is employed by the 576 authors to find bugs early, but it might not catch everything. 577 The kernel build system provides Kconfig options (``CONFIG_XFS_ONLINE_SCRUB`` 578 and ``CONFIG_XFS_ONLINE_REPAIR``) to enable distributors to choose not to 579 accept this risk. 580 The xfsprogs build system has a configure option (``--enable-scrub=no``) that 581 disables building of the ``xfs_scrub`` binary, though this is not a risk 582 mitigation if the kernel functionality remains enabled. 583 584- **Inability to repair**: Sometimes, a filesystem is too badly damaged to be 585 repairable. 586 If the keyspaces of several metadata indices overlap in some manner but a 587 coherent narrative cannot be formed from records collected, then the repair 588 fails. 589 To reduce the chance that a repair will fail with a dirty transaction and 590 render the filesystem unusable, the online repair functions have been 591 designed to stage and validate all new records before committing the new 592 structure. 593 594- **Misbehavior**: Online fsck requires many privileges -- raw IO to block 595 devices, opening files by handle, ignoring Unix discretionary access control, 596 and the ability to perform administrative changes. 597 Running this automatically in the background scares people, so the systemd 598 background service is configured to run with only the privileges required. 599 Obviously, this cannot address certain problems like the kernel crashing or 600 deadlocking, but it should be sufficient to prevent the scrub process from 601 escaping and reconfiguring the system. 602 The cron job does not have this protection. 603 604- **Fuzz Kiddiez**: There are many people now who seem to think that running 605 automated fuzz testing of ondisk artifacts to find mischevious behavior and 606 spraying exploit code onto the public mailing list for instant zero-day 607 disclosure is somehow of some social benefit. 608 In the view of this author, the benefit is realized only when the fuzz 609 operators help to **fix** the flaws, but this opinion apparently is not 610 widely shared among security "researchers". 611 The XFS maintainers' continuing ability to manage these events presents an 612 ongoing risk to the stability of the development process. 613 Automated testing should front-load some of the risk while the feature is 614 considered EXPERIMENTAL. 615 616Many of these risks are inherent to software programming. 617Despite this, it is hoped that this new functionality will prove useful in 618reducing unexpected downtime. 619 6203. Testing Plan 621=============== 622 623As stated before, fsck tools have three main goals: 624 6251. Detect inconsistencies in the metadata; 626 6272. Eliminate those inconsistencies; and 628 6293. Minimize further loss of data. 630 631Demonstrations of correct operation are necessary to build users' confidence 632that the software behaves within expectations. 633Unfortunately, it was not really feasible to perform regular exhaustive testing 634of every aspect of a fsck tool until the introduction of low-cost virtual 635machines with high-IOPS storage. 636With ample hardware availability in mind, the testing strategy for the online 637fsck project involves differential analysis against the existing fsck tools and 638systematic testing of every attribute of every type of metadata object. 639Testing can be split into four major categories, as discussed below. 640 641Integrated Testing with fstests 642------------------------------- 643 644The primary goal of any free software QA effort is to make testing as 645inexpensive and widespread as possible to maximize the scaling advantages of 646community. 647In other words, testing should maximize the breadth of filesystem configuration 648scenarios and hardware setups. 649This improves code quality by enabling the authors of online fsck to find and 650fix bugs early, and helps developers of new features to find integration 651issues earlier in their development effort. 652 653The Linux filesystem community shares a common QA testing suite, 654`fstests <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/>`_, for 655functional and regression testing. 656Even before development work began on online fsck, fstests (when run on XFS) 657would run both the ``xfs_check`` and ``xfs_repair -n`` commands on the test and 658scratch filesystems between each test. 659This provides a level of assurance that the kernel and the fsck tools stay in 660alignment about what constitutes consistent metadata. 661During development of the online checking code, fstests was modified to run 662``xfs_scrub -n`` between each test to ensure that the new checking code 663produces the same results as the two existing fsck tools. 664 665To start development of online repair, fstests was modified to run 666``xfs_repair`` to rebuild the filesystem's metadata indices between tests. 667This ensures that offline repair does not crash, leave a corrupt filesystem 668after it exists, or trigger complaints from the online check. 669This also established a baseline for what can and cannot be repaired offline. 670To complete the first phase of development of online repair, fstests was 671modified to be able to run ``xfs_scrub`` in a "force rebuild" mode. 672This enables a comparison of the effectiveness of online repair as compared to 673the existing offline repair tools. 674 675General Fuzz Testing of Metadata Blocks 676--------------------------------------- 677 678XFS benefits greatly from having a very robust debugging tool, ``xfs_db``. 679 680Before development of online fsck even began, a set of fstests were created 681to test the rather common fault that entire metadata blocks get corrupted. 682This required the creation of fstests library code that can create a filesystem 683containing every possible type of metadata object. 684Next, individual test cases were created to create a test filesystem, identify 685a single block of a specific type of metadata object, trash it with the 686existing ``blocktrash`` command in ``xfs_db``, and test the reaction of a 687particular metadata validation strategy. 688 689This earlier test suite enabled XFS developers to test the ability of the 690in-kernel validation functions and the ability of the offline fsck tool to 691detect and eliminate the inconsistent metadata. 692This part of the test suite was extended to cover online fsck in exactly the 693same manner. 694 695In other words, for a given fstests filesystem configuration: 696 697* For each metadata object existing on the filesystem: 698 699 * Write garbage to it 700 701 * Test the reactions of: 702 703 1. The kernel verifiers to stop obviously bad metadata 704 2. Offline repair (``xfs_repair``) to detect and fix 705 3. Online repair (``xfs_scrub``) to detect and fix 706 707Targeted Fuzz Testing of Metadata Records 708----------------------------------------- 709 710The testing plan for online fsck includes extending the existing fs testing 711infrastructure to provide a much more powerful facility: targeted fuzz testing 712of every metadata field of every metadata object in the filesystem. 713``xfs_db`` can modify every field of every metadata structure in every 714block in the filesystem to simulate the effects of memory corruption and 715software bugs. 716Given that fstests already contains the ability to create a filesystem 717containing every metadata format known to the filesystem, ``xfs_db`` can be 718used to perform exhaustive fuzz testing! 719 720For a given fstests filesystem configuration: 721 722* For each metadata object existing on the filesystem... 723 724 * For each record inside that metadata object... 725 726 * For each field inside that record... 727 728 * For each conceivable type of transformation that can be applied to a bit field... 729 730 1. Clear all bits 731 2. Set all bits 732 3. Toggle the most significant bit 733 4. Toggle the middle bit 734 5. Toggle the least significant bit 735 6. Add a small quantity 736 7. Subtract a small quantity 737 8. Randomize the contents 738 739 * ...test the reactions of: 740 741 1. The kernel verifiers to stop obviously bad metadata 742 2. Offline checking (``xfs_repair -n``) 743 3. Offline repair (``xfs_repair``) 744 4. Online checking (``xfs_scrub -n``) 745 5. Online repair (``xfs_scrub``) 746 6. Both repair tools (``xfs_scrub`` and then ``xfs_repair`` if online repair doesn't succeed) 747 748This is quite the combinatoric explosion! 749 750Fortunately, having this much test coverage makes it easy for XFS developers to 751check the responses of XFS' fsck tools. 752Since the introduction of the fuzz testing framework, these tests have been 753used to discover incorrect repair code and missing functionality for entire 754classes of metadata objects in ``xfs_repair``. 755The enhanced testing was used to finalize the deprecation of ``xfs_check`` by 756confirming that ``xfs_repair`` could detect at least as many corruptions as 757the older tool. 758 759These tests have been very valuable for ``xfs_scrub`` in the same ways -- they 760allow the online fsck developers to compare online fsck against offline fsck, 761and they enable XFS developers to find deficiencies in the code base. 762 763Proposed patchsets include 764`general fuzzer improvements 765<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzzer-improvements>`_, 766`fuzzing baselines 767<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzz-baseline>`_, 768and `improvements in fuzz testing comprehensiveness 769<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=more-fuzz-testing>`_. 770 771Stress Testing 772-------------- 773 774A unique requirement to online fsck is the ability to operate on a filesystem 775concurrently with regular workloads. 776Although it is of course impossible to run ``xfs_scrub`` with *zero* observable 777impact on the running system, the online repair code should never introduce 778inconsistencies into the filesystem metadata, and regular workloads should 779never notice resource starvation. 780To verify that these conditions are being met, fstests has been enhanced in 781the following ways: 782 783* For each scrub item type, create a test to exercise checking that item type 784 while running ``fsstress``. 785* For each scrub item type, create a test to exercise repairing that item type 786 while running ``fsstress``. 787* Race ``fsstress`` and ``xfs_scrub -n`` to ensure that checking the whole 788 filesystem doesn't cause problems. 789* Race ``fsstress`` and ``xfs_scrub`` in force-rebuild mode to ensure that 790 force-repairing the whole filesystem doesn't cause problems. 791* Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while 792 freezing and thawing the filesystem. 793* Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while 794 remounting the filesystem read-only and read-write. 795* The same, but running ``fsx`` instead of ``fsstress``. (Not done yet?) 796 797Success is defined by the ability to run all of these tests without observing 798any unexpected filesystem shutdowns due to corrupted metadata, kernel hang 799check warnings, or any other sort of mischief. 800 801Proposed patchsets include `general stress testing 802<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=race-scrub-and-mount-state-changes>`_ 803and the `evolution of existing per-function stress testing 804<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=refactor-scrub-stress>`_. 805 8064. User Interface 807================= 808 809The primary user of online fsck is the system administrator, just like offline 810repair. 811Online fsck presents two modes of operation to administrators: 812A foreground CLI process for online fsck on demand, and a background service 813that performs autonomous checking and repair. 814 815Checking on Demand 816------------------ 817 818For administrators who want the absolute freshest information about the 819metadata in a filesystem, ``xfs_scrub`` can be run as a foreground process on 820a command line. 821The program checks every piece of metadata in the filesystem while the 822administrator waits for the results to be reported, just like the existing 823``xfs_repair`` tool. 824Both tools share a ``-n`` option to perform a read-only scan, and a ``-v`` 825option to increase the verbosity of the information reported. 826 827A new feature of ``xfs_scrub`` is the ``-x`` option, which employs the error 828correction capabilities of the hardware to check data file contents. 829The media scan is not enabled by default because it may dramatically increase 830program runtime and consume a lot of bandwidth on older storage hardware. 831 832The output of a foreground invocation is captured in the system log. 833 834The ``xfs_scrub_all`` program walks the list of mounted filesystems and 835initiates ``xfs_scrub`` for each of them in parallel. 836It serializes scans for any filesystems that resolve to the same top level 837kernel block device to prevent resource overconsumption. 838 839Background Service 840------------------ 841 842To reduce the workload of system administrators, the ``xfs_scrub`` package 843provides a suite of `systemd <https://systemd.io/>`_ timers and services that 844run online fsck automatically on weekends by default. 845The background service configures scrub to run with as little privilege as 846possible, the lowest CPU and IO priority, and in a CPU-constrained single 847threaded mode. 848This can be tuned by the systemd administrator at any time to suit the latency 849and throughput requirements of customer workloads. 850 851The output of the background service is also captured in the system log. 852If desired, reports of failures (either due to inconsistencies or mere runtime 853errors) can be emailed automatically by setting the ``EMAIL_ADDR`` environment 854variable in the following service files: 855 856* ``xfs_scrub_fail@.service`` 857* ``xfs_scrub_media_fail@.service`` 858* ``xfs_scrub_all_fail.service`` 859 860The decision to enable the background scan is left to the system administrator. 861This can be done by enabling either of the following services: 862 863* ``xfs_scrub_all.timer`` on systemd systems 864* ``xfs_scrub_all.cron`` on non-systemd systems 865 866This automatic weekly scan is configured out of the box to perform an 867additional media scan of all file data once per month. 868This is less foolproof than, say, storing file data block checksums, but much 869more performant if application software provides its own integrity checking, 870redundancy can be provided elsewhere above the filesystem, or the storage 871device's integrity guarantees are deemed sufficient. 872 873The systemd unit file definitions have been subjected to a security audit 874(as of systemd 249) to ensure that the xfs_scrub processes have as little 875access to the rest of the system as possible. 876This was performed via ``systemd-analyze security``, after which privileges 877were restricted to the minimum required, sandboxing was set up to the maximal 878extent possible with sandboxing and system call filtering; and access to the 879filesystem tree was restricted to the minimum needed to start the program and 880access the filesystem being scanned. 881The service definition files restrict CPU usage to 80% of one CPU core, and 882apply as nice of a priority to IO and CPU scheduling as possible. 883This measure was taken to minimize delays in the rest of the filesystem. 884No such hardening has been performed for the cron job. 885 886Proposed patchset: 887`Enabling the xfs_scrub background service 888<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-service>`_. 889 890Health Reporting 891---------------- 892 893XFS caches a summary of each filesystem's health status in memory. 894The information is updated whenever ``xfs_scrub`` is run, or whenever 895inconsistencies are detected in the filesystem metadata during regular 896operations. 897System administrators should use the ``health`` command of ``xfs_spaceman`` to 898download this information into a human-readable format. 899If problems have been observed, the administrator can schedule a reduced 900service window to run the online repair tool to correct the problem. 901Failing that, the administrator can decide to schedule a maintenance window to 902run the traditional offline repair tool to correct the problem. 903 904**Future Work Question**: Should the health reporting integrate with the new 905inotify fs error notification system? 906Would it be helpful for sysadmins to have a daemon to listen for corruption 907notifications and initiate a repair? 908 909*Answer*: These questions remain unanswered, but should be a part of the 910conversation with early adopters and potential downstream users of XFS. 911 912Proposed patchsets include 913`wiring up health reports to correction returns 914<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=corruption-health-reports>`_ 915and 916`preservation of sickness info during memory reclaim 917<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=indirect-health-reporting>`_. 918 9195. Kernel Algorithms and Data Structures 920======================================== 921 922This section discusses the key algorithms and data structures of the kernel 923code that provide the ability to check and repair metadata while the system 924is running. 925The first chapters in this section reveal the pieces that provide the 926foundation for checking metadata. 927The remainder of this section presents the mechanisms through which XFS 928regenerates itself. 929 930Self Describing Metadata 931------------------------ 932 933Starting with XFS version 5 in 2012, XFS updated the format of nearly every 934ondisk block header to record a magic number, a checksum, a universally 935"unique" identifier (UUID), an owner code, the ondisk address of the block, 936and a log sequence number. 937When loading a block buffer from disk, the magic number, UUID, owner, and 938ondisk address confirm that the retrieved block matches the specific owner of 939the current filesystem, and that the information contained in the block is 940supposed to be found at the ondisk address. 941The first three components enable checking tools to disregard alleged metadata 942that doesn't belong to the filesystem, and the fourth component enables the 943filesystem to detect lost writes. 944 945Whenever a file system operation modifies a block, the change is submitted 946to the log as part of a transaction. 947The log then processes these transactions marking them done once they are 948safely persisted to storage. 949The logging code maintains the checksum and the log sequence number of the last 950transactional update. 951Checksums are useful for detecting torn writes and other discrepancies that can 952be introduced between the computer and its storage devices. 953Sequence number tracking enables log recovery to avoid applying out of date 954log updates to the filesystem. 955 956These two features improve overall runtime resiliency by providing a means for 957the filesystem to detect obvious corruption when reading metadata blocks from 958disk, but these buffer verifiers cannot provide any consistency checking 959between metadata structures. 960 961For more information, please see the documentation for 962Documentation/filesystems/xfs-self-describing-metadata.rst 963 964Reverse Mapping 965--------------- 966 967The original design of XFS (circa 1993) is an improvement upon 1980s Unix 968filesystem design. 969In those days, storage density was expensive, CPU time was scarce, and 970excessive seek time could kill performance. 971For performance reasons, filesystem authors were reluctant to add redundancy to 972the filesystem, even at the cost of data integrity. 973Filesystems designers in the early 21st century choose different strategies to 974increase internal redundancy -- either storing nearly identical copies of 975metadata, or more space-efficient encoding techniques. 976 977For XFS, a different redundancy strategy was chosen to modernize the design: 978a secondary space usage index that maps allocated disk extents back to their 979owners. 980By adding a new index, the filesystem retains most of its ability to scale 981well to heavily threaded workloads involving large datasets, since the primary 982file metadata (the directory tree, the file block map, and the allocation 983groups) remain unchanged. 984Like any system that improves redundancy, the reverse-mapping feature increases 985overhead costs for space mapping activities. 986However, it has two critical advantages: first, the reverse index is key to 987enabling online fsck and other requested functionality such as free space 988defragmentation, better media failure reporting, and filesystem shrinking. 989Second, the different ondisk storage format of the reverse mapping btree 990defeats device-level deduplication because the filesystem requires real 991redundancy. 992 993+--------------------------------------------------------------------------+ 994| **Sidebar**: | 995+--------------------------------------------------------------------------+ 996| A criticism of adding the secondary index is that it does nothing to | 997| improve the robustness of user data storage itself. | 998| This is a valid point, but adding a new index for file data block | 999| checksums increases write amplification by turning data overwrites into | 1000| copy-writes, which age the filesystem prematurely. | 1001| In keeping with thirty years of precedent, users who want file data | 1002| integrity can supply as powerful a solution as they require. | 1003| As for metadata, the complexity of adding a new secondary index of space | 1004| usage is much less than adding volume management and storage device | 1005| mirroring to XFS itself. | 1006| Perfection of RAID and volume management are best left to existing | 1007| layers in the kernel. | 1008+--------------------------------------------------------------------------+ 1009 1010The information captured in a reverse space mapping record is as follows: 1011 1012.. code-block:: c 1013 1014 struct xfs_rmap_irec { 1015 xfs_agblock_t rm_startblock; /* extent start block */ 1016 xfs_extlen_t rm_blockcount; /* extent length */ 1017 uint64_t rm_owner; /* extent owner */ 1018 uint64_t rm_offset; /* offset within the owner */ 1019 unsigned int rm_flags; /* state flags */ 1020 }; 1021 1022The first two fields capture the location and size of the physical space, 1023in units of filesystem blocks. 1024The owner field tells scrub which metadata structure or file inode have been 1025assigned this space. 1026For space allocated to files, the offset field tells scrub where the space was 1027mapped within the file fork. 1028Finally, the flags field provides extra information about the space usage -- 1029is this an attribute fork extent? A file mapping btree extent? Or an 1030unwritten data extent? 1031 1032Online filesystem checking judges the consistency of each primary metadata 1033record by comparing its information against all other space indices. 1034The reverse mapping index plays a key role in the consistency checking process 1035because it contains a centralized alternate copy of all space allocation 1036information. 1037Program runtime and ease of resource acquisition are the only real limits to 1038what online checking can consult. 1039For example, a file data extent mapping can be checked against: 1040 1041* The absence of an entry in the free space information. 1042* The absence of an entry in the inode index. 1043* The absence of an entry in the reference count data if the file is not 1044 marked as having shared extents. 1045* The correspondence of an entry in the reverse mapping information. 1046 1047There are several observations to make about reverse mapping indices: 1048 10491. Reverse mappings can provide a positive affirmation of correctness if any of 1050 the above primary metadata are in doubt. 1051 The checking code for most primary metadata follows a path similar to the 1052 one outlined above. 1053 10542. Proving the consistency of secondary metadata with the primary metadata is 1055 difficult because that requires a full scan of all primary space metadata, 1056 which is very time intensive. 1057 For example, checking a reverse mapping record for a file extent mapping 1058 btree block requires locking the file and searching the entire btree to 1059 confirm the block. 1060 Instead, scrub relies on rigorous cross-referencing during the primary space 1061 mapping structure checks. 1062 10633. Consistency scans must use non-blocking lock acquisition primitives if the 1064 required locking order is not the same order used by regular filesystem 1065 operations. 1066 For example, if the filesystem normally takes a file ILOCK before taking 1067 the AGF buffer lock but scrub wants to take a file ILOCK while holding 1068 an AGF buffer lock, scrub cannot block on that second acquisition. 1069 This means that forward progress during this part of a scan of the reverse 1070 mapping data cannot be guaranteed if system load is heavy. 1071 1072In summary, reverse mappings play a key role in reconstruction of primary 1073metadata. 1074The details of how these records are staged, written to disk, and committed 1075into the filesystem are covered in subsequent sections. 1076 1077Checking and Cross-Referencing 1078------------------------------ 1079 1080The first step of checking a metadata structure is to examine every record 1081contained within the structure and its relationship with the rest of the 1082system. 1083XFS contains multiple layers of checking to try to prevent inconsistent 1084metadata from wreaking havoc on the system. 1085Each of these layers contributes information that helps the kernel to make 1086three decisions about the health of a metadata structure: 1087 1088- Is a part of this structure obviously corrupt (``XFS_SCRUB_OFLAG_CORRUPT``) ? 1089- Is this structure inconsistent with the rest of the system 1090 (``XFS_SCRUB_OFLAG_XCORRUPT``) ? 1091- Is there so much damage around the filesystem that cross-referencing is not 1092 possible (``XFS_SCRUB_OFLAG_XFAIL``) ? 1093- Can the structure be optimized to improve performance or reduce the size of 1094 metadata (``XFS_SCRUB_OFLAG_PREEN``) ? 1095- Does the structure contain data that is not inconsistent but deserves review 1096 by the system administrator (``XFS_SCRUB_OFLAG_WARNING``) ? 1097 1098The following sections describe how the metadata scrubbing process works. 1099 1100Metadata Buffer Verification 1101```````````````````````````` 1102 1103The lowest layer of metadata protection in XFS are the metadata verifiers built 1104into the buffer cache. 1105These functions perform inexpensive internal consistency checking of the block 1106itself, and answer these questions: 1107 1108- Does the block belong to this filesystem? 1109 1110- Does the block belong to the structure that asked for the read? 1111 This assumes that metadata blocks only have one owner, which is always true 1112 in XFS. 1113 1114- Is the type of data stored in the block within a reasonable range of what 1115 scrub is expecting? 1116 1117- Does the physical location of the block match the location it was read from? 1118 1119- Does the block checksum match the data? 1120 1121The scope of the protections here are very limited -- verifiers can only 1122establish that the filesystem code is reasonably free of gross corruption bugs 1123and that the storage system is reasonably competent at retrieval. 1124Corruption problems observed at runtime cause the generation of health reports, 1125failed system calls, and in the extreme case, filesystem shutdowns if the 1126corrupt metadata force the cancellation of a dirty transaction. 1127 1128Every online fsck scrubbing function is expected to read every ondisk metadata 1129block of a structure in the course of checking the structure. 1130Corruption problems observed during a check are immediately reported to 1131userspace as corruption; during a cross-reference, they are reported as a 1132failure to cross-reference once the full examination is complete. 1133Reads satisfied by a buffer already in cache (and hence already verified) 1134bypass these checks. 1135 1136Internal Consistency Checks 1137``````````````````````````` 1138 1139After the buffer cache, the next level of metadata protection is the internal 1140record verification code built into the filesystem. 1141These checks are split between the buffer verifiers, the in-filesystem users of 1142the buffer cache, and the scrub code itself, depending on the amount of higher 1143level context required. 1144The scope of checking is still internal to the block. 1145These higher level checking functions answer these questions: 1146 1147- Does the type of data stored in the block match what scrub is expecting? 1148 1149- Does the block belong to the owning structure that asked for the read? 1150 1151- If the block contains records, do the records fit within the block? 1152 1153- If the block tracks internal free space information, is it consistent with 1154 the record areas? 1155 1156- Are the records contained inside the block free of obvious corruptions? 1157 1158Record checks in this category are more rigorous and more time-intensive. 1159For example, block pointers and inumbers are checked to ensure that they point 1160within the dynamically allocated parts of an allocation group and within 1161the filesystem. 1162Names are checked for invalid characters, and flags are checked for invalid 1163combinations. 1164Other record attributes are checked for sensible values. 1165Btree records spanning an interval of the btree keyspace are checked for 1166correct order and lack of mergeability (except for file fork mappings). 1167For performance reasons, regular code may skip some of these checks unless 1168debugging is enabled or a write is about to occur. 1169Scrub functions, of course, must check all possible problems. 1170 1171Validation of Userspace-Controlled Record Attributes 1172```````````````````````````````````````````````````` 1173 1174Various pieces of filesystem metadata are directly controlled by userspace. 1175Because of this nature, validation work cannot be more precise than checking 1176that a value is within the possible range. 1177These fields include: 1178 1179- Superblock fields controlled by mount options 1180- Filesystem labels 1181- File timestamps 1182- File permissions 1183- File size 1184- File flags 1185- Names present in directory entries, extended attribute keys, and filesystem 1186 labels 1187- Extended attribute key namespaces 1188- Extended attribute values 1189- File data block contents 1190- Quota limits 1191- Quota timer expiration (if resource usage exceeds the soft limit) 1192 1193Cross-Referencing Space Metadata 1194```````````````````````````````` 1195 1196After internal block checks, the next higher level of checking is 1197cross-referencing records between metadata structures. 1198For regular runtime code, the cost of these checks is considered to be 1199prohibitively expensive, but as scrub is dedicated to rooting out 1200inconsistencies, it must pursue all avenues of inquiry. 1201The exact set of cross-referencing is highly dependent on the context of the 1202data structure being checked. 1203 1204The XFS btree code has keyspace scanning functions that online fsck uses to 1205cross reference one structure with another. 1206Specifically, scrub can scan the key space of an index to determine if that 1207keyspace is fully, sparsely, or not at all mapped to records. 1208For the reverse mapping btree, it is possible to mask parts of the key for the 1209purposes of performing a keyspace scan so that scrub can decide if the rmap 1210btree contains records mapping a certain extent of physical space without the 1211sparsenses of the rest of the rmap keyspace getting in the way. 1212 1213Btree blocks undergo the following checks before cross-referencing: 1214 1215- Does the type of data stored in the block match what scrub is expecting? 1216 1217- Does the block belong to the owning structure that asked for the read? 1218 1219- Do the records fit within the block? 1220 1221- Are the records contained inside the block free of obvious corruptions? 1222 1223- Are the name hashes in the correct order? 1224 1225- Do node pointers within the btree point to valid block addresses for the type 1226 of btree? 1227 1228- Do child pointers point towards the leaves? 1229 1230- Do sibling pointers point across the same level? 1231 1232- For each node block record, does the record key accurate reflect the contents 1233 of the child block? 1234 1235Space allocation records are cross-referenced as follows: 1236 12371. Any space mentioned by any metadata structure are cross-referenced as 1238 follows: 1239 1240 - Does the reverse mapping index list only the appropriate owner as the 1241 owner of each block? 1242 1243 - Are none of the blocks claimed as free space? 1244 1245 - If these aren't file data blocks, are none of the blocks claimed as space 1246 shared by different owners? 1247 12482. Btree blocks are cross-referenced as follows: 1249 1250 - Everything in class 1 above. 1251 1252 - If there's a parent node block, do the keys listed for this block match the 1253 keyspace of this block? 1254 1255 - Do the sibling pointers point to valid blocks? Of the same level? 1256 1257 - Do the child pointers point to valid blocks? Of the next level down? 1258 12593. Free space btree records are cross-referenced as follows: 1260 1261 - Everything in class 1 and 2 above. 1262 1263 - Does the reverse mapping index list no owners of this space? 1264 1265 - Is this space not claimed by the inode index for inodes? 1266 1267 - Is it not mentioned by the reference count index? 1268 1269 - Is there a matching record in the other free space btree? 1270 12714. Inode btree records are cross-referenced as follows: 1272 1273 - Everything in class 1 and 2 above. 1274 1275 - Is there a matching record in free inode btree? 1276 1277 - Do cleared bits in the holemask correspond with inode clusters? 1278 1279 - Do set bits in the freemask correspond with inode records with zero link 1280 count? 1281 12825. Inode records are cross-referenced as follows: 1283 1284 - Everything in class 1. 1285 1286 - Do all the fields that summarize information about the file forks actually 1287 match those forks? 1288 1289 - Does each inode with zero link count correspond to a record in the free 1290 inode btree? 1291 12926. File fork space mapping records are cross-referenced as follows: 1293 1294 - Everything in class 1 and 2 above. 1295 1296 - Is this space not mentioned by the inode btrees? 1297 1298 - If this is a CoW fork mapping, does it correspond to a CoW entry in the 1299 reference count btree? 1300 13017. Reference count records are cross-referenced as follows: 1302 1303 - Everything in class 1 and 2 above. 1304 1305 - Within the space subkeyspace of the rmap btree (that is to say, all 1306 records mapped to a particular space extent and ignoring the owner info), 1307 are there the same number of reverse mapping records for each block as the 1308 reference count record claims? 1309 1310Proposed patchsets are the series to find gaps in 1311`refcount btree 1312<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-refcount-gaps>`_, 1313`inode btree 1314<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-inobt-gaps>`_, and 1315`rmap btree 1316<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-rmapbt-gaps>`_ records; 1317to find 1318`mergeable records 1319<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-mergeable-records>`_; 1320and to 1321`improve cross referencing with rmap 1322<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-strengthen-rmap-checking>`_ 1323before starting a repair. 1324 1325Checking Extended Attributes 1326```````````````````````````` 1327 1328Extended attributes implement a key-value store that enable fragments of data 1329to be attached to any file. 1330Both the kernel and userspace can access the keys and values, subject to 1331namespace and privilege restrictions. 1332Most typically these fragments are metadata about the file -- origins, security 1333contexts, user-supplied labels, indexing information, etc. 1334 1335Names can be as long as 255 bytes and can exist in several different 1336namespaces. 1337Values can be as large as 64KB. 1338A file's extended attributes are stored in blocks mapped by the attr fork. 1339The mappings point to leaf blocks, remote value blocks, or dabtree blocks. 1340Block 0 in the attribute fork is always the top of the structure, but otherwise 1341each of the three types of blocks can be found at any offset in the attr fork. 1342Leaf blocks contain attribute key records that point to the name and the value. 1343Names are always stored elsewhere in the same leaf block. 1344Values that are less than 3/4 the size of a filesystem block are also stored 1345elsewhere in the same leaf block. 1346Remote value blocks contain values that are too large to fit inside a leaf. 1347If the leaf information exceeds a single filesystem block, a dabtree (also 1348rooted at block 0) is created to map hashes of the attribute names to leaf 1349blocks in the attr fork. 1350 1351Checking an extended attribute structure is not so straightfoward due to the 1352lack of separation between attr blocks and index blocks. 1353Scrub must read each block mapped by the attr fork and ignore the non-leaf 1354blocks: 1355 13561. Walk the dabtree in the attr fork (if present) to ensure that there are no 1357 irregularities in the blocks or dabtree mappings that do not point to 1358 attr leaf blocks. 1359 13602. Walk the blocks of the attr fork looking for leaf blocks. 1361 For each entry inside a leaf: 1362 1363 a. Validate that the name does not contain invalid characters. 1364 1365 b. Read the attr value. 1366 This performs a named lookup of the attr name to ensure the correctness 1367 of the dabtree. 1368 If the value is stored in a remote block, this also validates the 1369 integrity of the remote value block. 1370 1371Checking and Cross-Referencing Directories 1372`````````````````````````````````````````` 1373 1374The filesystem directory tree is a directed acylic graph structure, with files 1375constituting the nodes, and directory entries (dirents) constituting the edges. 1376Directories are a special type of file containing a set of mappings from a 1377255-byte sequence (name) to an inumber. 1378These are called directory entries, or dirents for short. 1379Each directory file must have exactly one directory pointing to the file. 1380A root directory points to itself. 1381Directory entries point to files of any type. 1382Each non-directory file may have multiple directories point to it. 1383 1384In XFS, directories are implemented as a file containing up to three 32GB 1385partitions. 1386The first partition contains directory entry data blocks. 1387Each data block contains variable-sized records associating a user-provided 1388name with an inumber and, optionally, a file type. 1389If the directory entry data grows beyond one block, the second partition (which 1390exists as post-EOF extents) is populated with a block containing free space 1391information and an index that maps hashes of the dirent names to directory data 1392blocks in the first partition. 1393This makes directory name lookups very fast. 1394If this second partition grows beyond one block, the third partition is 1395populated with a linear array of free space information for faster 1396expansions. 1397If the free space has been separated and the second partition grows again 1398beyond one block, then a dabtree is used to map hashes of dirent names to 1399directory data blocks. 1400 1401Checking a directory is pretty straightfoward: 1402 14031. Walk the dabtree in the second partition (if present) to ensure that there 1404 are no irregularities in the blocks or dabtree mappings that do not point to 1405 dirent blocks. 1406 14072. Walk the blocks of the first partition looking for directory entries. 1408 Each dirent is checked as follows: 1409 1410 a. Does the name contain no invalid characters? 1411 1412 b. Does the inumber correspond to an actual, allocated inode? 1413 1414 c. Does the child inode have a nonzero link count? 1415 1416 d. If a file type is included in the dirent, does it match the type of the 1417 inode? 1418 1419 e. If the child is a subdirectory, does the child's dotdot pointer point 1420 back to the parent? 1421 1422 f. If the directory has a second partition, perform a named lookup of the 1423 dirent name to ensure the correctness of the dabtree. 1424 14253. Walk the free space list in the third partition (if present) to ensure that 1426 the free spaces it describes are really unused. 1427 1428Checking operations involving :ref:`parents <dirparent>` and 1429:ref:`file link counts <nlinks>` are discussed in more detail in later 1430sections. 1431 1432Checking Directory/Attribute Btrees 1433^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1434 1435As stated in previous sections, the directory/attribute btree (dabtree) index 1436maps user-provided names to improve lookup times by avoiding linear scans. 1437Internally, it maps a 32-bit hash of the name to a block offset within the 1438appropriate file fork. 1439 1440The internal structure of a dabtree closely resembles the btrees that record 1441fixed-size metadata records -- each dabtree block contains a magic number, a 1442checksum, sibling pointers, a UUID, a tree level, and a log sequence number. 1443The format of leaf and node records are the same -- each entry points to the 1444next level down in the hierarchy, with dabtree node records pointing to dabtree 1445leaf blocks, and dabtree leaf records pointing to non-dabtree blocks elsewhere 1446in the fork. 1447 1448Checking and cross-referencing the dabtree is very similar to what is done for 1449space btrees: 1450 1451- Does the type of data stored in the block match what scrub is expecting? 1452 1453- Does the block belong to the owning structure that asked for the read? 1454 1455- Do the records fit within the block? 1456 1457- Are the records contained inside the block free of obvious corruptions? 1458 1459- Are the name hashes in the correct order? 1460 1461- Do node pointers within the dabtree point to valid fork offsets for dabtree 1462 blocks? 1463 1464- Do leaf pointers within the dabtree point to valid fork offsets for directory 1465 or attr leaf blocks? 1466 1467- Do child pointers point towards the leaves? 1468 1469- Do sibling pointers point across the same level? 1470 1471- For each dabtree node record, does the record key accurate reflect the 1472 contents of the child dabtree block? 1473 1474- For each dabtree leaf record, does the record key accurate reflect the 1475 contents of the directory or attr block? 1476 1477Cross-Referencing Summary Counters 1478`````````````````````````````````` 1479 1480XFS maintains three classes of summary counters: available resources, quota 1481resource usage, and file link counts. 1482 1483In theory, the amount of available resources (data blocks, inodes, realtime 1484extents) can be found by walking the entire filesystem. 1485This would make for very slow reporting, so a transactional filesystem can 1486maintain summaries of this information in the superblock. 1487Cross-referencing these values against the filesystem metadata should be a 1488simple matter of walking the free space and inode metadata in each AG and the 1489realtime bitmap, but there are complications that will be discussed in 1490:ref:`more detail <fscounters>` later. 1491 1492:ref:`Quota usage <quotacheck>` and :ref:`file link count <nlinks>` 1493checking are sufficiently complicated to warrant separate sections. 1494 1495Post-Repair Reverification 1496`````````````````````````` 1497 1498After performing a repair, the checking code is run a second time to validate 1499the new structure, and the results of the health assessment are recorded 1500internally and returned to the calling process. 1501This step is critical for enabling system administrator to monitor the status 1502of the filesystem and the progress of any repairs. 1503For developers, it is a useful means to judge the efficacy of error detection 1504and correction in the online and offline checking tools. 1505 1506Eventual Consistency vs. Online Fsck 1507------------------------------------ 1508 1509Complex operations can make modifications to multiple per-AG data structures 1510with a chain of transactions. 1511These chains, once committed to the log, are restarted during log recovery if 1512the system crashes while processing the chain. 1513Because the AG header buffers are unlocked between transactions within a chain, 1514online checking must coordinate with chained operations that are in progress to 1515avoid incorrectly detecting inconsistencies due to pending chains. 1516Furthermore, online repair must not run when operations are pending because 1517the metadata are temporarily inconsistent with each other, and rebuilding is 1518not possible. 1519 1520Only online fsck has this requirement of total consistency of AG metadata, and 1521should be relatively rare as compared to filesystem change operations. 1522Online fsck coordinates with transaction chains as follows: 1523 1524* For each AG, maintain a count of intent items targetting that AG. 1525 The count should be bumped whenever a new item is added to the chain. 1526 The count should be dropped when the filesystem has locked the AG header 1527 buffers and finished the work. 1528 1529* When online fsck wants to examine an AG, it should lock the AG header 1530 buffers to quiesce all transaction chains that want to modify that AG. 1531 If the count is zero, proceed with the checking operation. 1532 If it is nonzero, cycle the buffer locks to allow the chain to make forward 1533 progress. 1534 1535This may lead to online fsck taking a long time to complete, but regular 1536filesystem updates take precedence over background checking activity. 1537Details about the discovery of this situation are presented in the 1538:ref:`next section <chain_coordination>`, and details about the solution 1539are presented :ref:`after that<intent_drains>`. 1540 1541.. _chain_coordination: 1542 1543Discovery of the Problem 1544```````````````````````` 1545 1546Midway through the development of online scrubbing, the fsstress tests 1547uncovered a misinteraction between online fsck and compound transaction chains 1548created by other writer threads that resulted in false reports of metadata 1549inconsistency. 1550The root cause of these reports is the eventual consistency model introduced by 1551the expansion of deferred work items and compound transaction chains when 1552reverse mapping and reflink were introduced. 1553 1554Originally, transaction chains were added to XFS to avoid deadlocks when 1555unmapping space from files. 1556Deadlock avoidance rules require that AGs only be locked in increasing order, 1557which makes it impossible (say) to use a single transaction to free a space 1558extent in AG 7 and then try to free a now superfluous block mapping btree block 1559in AG 3. 1560To avoid these kinds of deadlocks, XFS creates Extent Freeing Intent (EFI) log 1561items to commit to freeing some space in one transaction while deferring the 1562actual metadata updates to a fresh transaction. 1563The transaction sequence looks like this: 1564 15651. The first transaction contains a physical update to the file's block mapping 1566 structures to remove the mapping from the btree blocks. 1567 It then attaches to the in-memory transaction an action item to schedule 1568 deferred freeing of space. 1569 Concretely, each transaction maintains a list of ``struct 1570 xfs_defer_pending`` objects, each of which maintains a list of ``struct 1571 xfs_extent_free_item`` objects. 1572 Returning to the example above, the action item tracks the freeing of both 1573 the unmapped space from AG 7 and the block mapping btree (BMBT) block from 1574 AG 3. 1575 Deferred frees recorded in this manner are committed in the log by creating 1576 an EFI log item from the ``struct xfs_extent_free_item`` object and 1577 attaching the log item to the transaction. 1578 When the log is persisted to disk, the EFI item is written into the ondisk 1579 transaction record. 1580 EFIs can list up to 16 extents to free, all sorted in AG order. 1581 15822. The second transaction contains a physical update to the free space btrees 1583 of AG 3 to release the former BMBT block and a second physical update to the 1584 free space btrees of AG 7 to release the unmapped file space. 1585 Observe that the the physical updates are resequenced in the correct order 1586 when possible. 1587 Attached to the transaction is a an extent free done (EFD) log item. 1588 The EFD contains a pointer to the EFI logged in transaction #1 so that log 1589 recovery can tell if the EFI needs to be replayed. 1590 1591If the system goes down after transaction #1 is written back to the filesystem 1592but before #2 is committed, a scan of the filesystem metadata would show 1593inconsistent filesystem metadata because there would not appear to be any owner 1594of the unmapped space. 1595Happily, log recovery corrects this inconsistency for us -- when recovery finds 1596an intent log item but does not find a corresponding intent done item, it will 1597reconstruct the incore state of the intent item and finish it. 1598In the example above, the log must replay both frees described in the recovered 1599EFI to complete the recovery phase. 1600 1601There are subtleties to XFS' transaction chaining strategy to consider: 1602 1603* Log items must be added to a transaction in the correct order to prevent 1604 conflicts with principal objects that are not held by the transaction. 1605 In other words, all per-AG metadata updates for an unmapped block must be 1606 completed before the last update to free the extent, and extents should not 1607 be reallocated until that last update commits to the log. 1608 1609* AG header buffers are released between each transaction in a chain. 1610 This means that other threads can observe an AG in an intermediate state, 1611 but as long as the first subtlety is handled, this should not affect the 1612 correctness of filesystem operations. 1613 1614* Unmounting the filesystem flushes all pending work to disk, which means that 1615 offline fsck never sees the temporary inconsistencies caused by deferred 1616 work item processing. 1617 1618In this manner, XFS employs a form of eventual consistency to avoid deadlocks 1619and increase parallelism. 1620 1621During the design phase of the reverse mapping and reflink features, it was 1622decided that it was impractical to cram all the reverse mapping updates for a 1623single filesystem change into a single transaction because a single file 1624mapping operation can explode into many small updates: 1625 1626* The block mapping update itself 1627* A reverse mapping update for the block mapping update 1628* Fixing the freelist 1629* A reverse mapping update for the freelist fix 1630 1631* A shape change to the block mapping btree 1632* A reverse mapping update for the btree update 1633* Fixing the freelist (again) 1634* A reverse mapping update for the freelist fix 1635 1636* An update to the reference counting information 1637* A reverse mapping update for the refcount update 1638* Fixing the freelist (a third time) 1639* A reverse mapping update for the freelist fix 1640 1641* Freeing any space that was unmapped and not owned by any other file 1642* Fixing the freelist (a fourth time) 1643* A reverse mapping update for the freelist fix 1644 1645* Freeing the space used by the block mapping btree 1646* Fixing the freelist (a fifth time) 1647* A reverse mapping update for the freelist fix 1648 1649Free list fixups are not usually needed more than once per AG per transaction 1650chain, but it is theoretically possible if space is very tight. 1651For copy-on-write updates this is even worse, because this must be done once to 1652remove the space from a staging area and again to map it into the file! 1653 1654To deal with this explosion in a calm manner, XFS expands its use of deferred 1655work items to cover most reverse mapping updates and all refcount updates. 1656This reduces the worst case size of transaction reservations by breaking the 1657work into a long chain of small updates, which increases the degree of eventual 1658consistency in the system. 1659Again, this generally isn't a problem because XFS orders its deferred work 1660items carefully to avoid resource reuse conflicts between unsuspecting threads. 1661 1662However, online fsck changes the rules -- remember that although physical 1663updates to per-AG structures are coordinated by locking the buffers for AG 1664headers, buffer locks are dropped between transactions. 1665Once scrub acquires resources and takes locks for a data structure, it must do 1666all the validation work without releasing the lock. 1667If the main lock for a space btree is an AG header buffer lock, scrub may have 1668interrupted another thread that is midway through finishing a chain. 1669For example, if a thread performing a copy-on-write has completed a reverse 1670mapping update but not the corresponding refcount update, the two AG btrees 1671will appear inconsistent to scrub and an observation of corruption will be 1672recorded. This observation will not be correct. 1673If a repair is attempted in this state, the results will be catastrophic! 1674 1675Several other solutions to this problem were evaluated upon discovery of this 1676flaw and rejected: 1677 16781. Add a higher level lock to allocation groups and require writer threads to 1679 acquire the higher level lock in AG order before making any changes. 1680 This would be very difficult to implement in practice because it is 1681 difficult to determine which locks need to be obtained, and in what order, 1682 without simulating the entire operation. 1683 Performing a dry run of a file operation to discover necessary locks would 1684 make the filesystem very slow. 1685 16862. Make the deferred work coordinator code aware of consecutive intent items 1687 targeting the same AG and have it hold the AG header buffers locked across 1688 the transaction roll between updates. 1689 This would introduce a lot of complexity into the coordinator since it is 1690 only loosely coupled with the actual deferred work items. 1691 It would also fail to solve the problem because deferred work items can 1692 generate new deferred subtasks, but all subtasks must be complete before 1693 work can start on a new sibling task. 1694 16953. Teach online fsck to walk all transactions waiting for whichever lock(s) 1696 protect the data structure being scrubbed to look for pending operations. 1697 The checking and repair operations must factor these pending operations into 1698 the evaluations being performed. 1699 This solution is a nonstarter because it is *extremely* invasive to the main 1700 filesystem. 1701 1702.. _intent_drains: 1703 1704Intent Drains 1705````````````` 1706 1707Online fsck uses an atomic intent item counter and lock cycling to coordinate 1708with transaction chains. 1709There are two key properties to the drain mechanism. 1710First, the counter is incremented when a deferred work item is *queued* to a 1711transaction, and it is decremented after the associated intent done log item is 1712*committed* to another transaction. 1713The second property is that deferred work can be added to a transaction without 1714holding an AG header lock, but per-AG work items cannot be marked done without 1715locking that AG header buffer to log the physical updates and the intent done 1716log item. 1717The first property enables scrub to yield to running transaction chains, which 1718is an explicit deprioritization of online fsck to benefit file operations. 1719The second property of the drain is key to the correct coordination of scrub, 1720since scrub will always be able to decide if a conflict is possible. 1721 1722For regular filesystem code, the drain works as follows: 1723 17241. Call the appropriate subsystem function to add a deferred work item to a 1725 transaction. 1726 17272. The function calls ``xfs_defer_drain_bump`` to increase the counter. 1728 17293. When the deferred item manager wants to finish the deferred work item, it 1730 calls ``->finish_item`` to complete it. 1731 17324. The ``->finish_item`` implementation logs some changes and calls 1733 ``xfs_defer_drain_drop`` to decrease the sloppy counter and wake up any threads 1734 waiting on the drain. 1735 17365. The subtransaction commits, which unlocks the resource associated with the 1737 intent item. 1738 1739For scrub, the drain works as follows: 1740 17411. Lock the resource(s) associated with the metadata being scrubbed. 1742 For example, a scan of the refcount btree would lock the AGI and AGF header 1743 buffers. 1744 17452. If the counter is zero (``xfs_defer_drain_busy`` returns false), there are no 1746 chains in progress and the operation may proceed. 1747 17483. Otherwise, release the resources grabbed in step 1. 1749 17504. Wait for the intent counter to reach zero (``xfs_defer_drain_intents``), then go 1751 back to step 1 unless a signal has been caught. 1752 1753To avoid polling in step 4, the drain provides a waitqueue for scrub threads to 1754be woken up whenever the intent count drops to zero. 1755 1756The proposed patchset is the 1757`scrub intent drain series 1758<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-drain-intents>`_. 1759 1760.. _jump_labels: 1761 1762Static Keys (aka Jump Label Patching) 1763````````````````````````````````````` 1764 1765Online fsck for XFS separates the regular filesystem from the checking and 1766repair code as much as possible. 1767However, there are a few parts of online fsck (such as the intent drains, and 1768later, live update hooks) where it is useful for the online fsck code to know 1769what's going on in the rest of the filesystem. 1770Since it is not expected that online fsck will be constantly running in the 1771background, it is very important to minimize the runtime overhead imposed by 1772these hooks when online fsck is compiled into the kernel but not actively 1773running on behalf of userspace. 1774Taking locks in the hot path of a writer thread to access a data structure only 1775to find that no further action is necessary is expensive -- on the author's 1776computer, this have an overhead of 40-50ns per access. 1777Fortunately, the kernel supports dynamic code patching, which enables XFS to 1778replace a static branch to hook code with ``nop`` sleds when online fsck isn't 1779running. 1780This sled has an overhead of however long it takes the instruction decoder to 1781skip past the sled, which seems to be on the order of less than 1ns and 1782does not access memory outside of instruction fetching. 1783 1784When online fsck enables the static key, the sled is replaced with an 1785unconditional branch to call the hook code. 1786The switchover is quite expensive (~22000ns) but is paid entirely by the 1787program that invoked online fsck, and can be amortized if multiple threads 1788enter online fsck at the same time, or if multiple filesystems are being 1789checked at the same time. 1790Changing the branch direction requires taking the CPU hotplug lock, and since 1791CPU initialization requires memory allocation, online fsck must be careful not 1792to change a static key while holding any locks or resources that could be 1793accessed in the memory reclaim paths. 1794To minimize contention on the CPU hotplug lock, care should be taken not to 1795enable or disable static keys unnecessarily. 1796 1797Because static keys are intended to minimize hook overhead for regular 1798filesystem operations when xfs_scrub is not running, the intended usage 1799patterns are as follows: 1800 1801- The hooked part of XFS should declare a static-scoped static key that 1802 defaults to false. 1803 The ``DEFINE_STATIC_KEY_FALSE`` macro takes care of this. 1804 The static key itself should be declared as a ``static`` variable. 1805 1806- When deciding to invoke code that's only used by scrub, the regular 1807 filesystem should call the ``static_branch_unlikely`` predicate to avoid the 1808 scrub-only hook code if the static key is not enabled. 1809 1810- The regular filesystem should export helper functions that call 1811 ``static_branch_inc`` to enable and ``static_branch_dec`` to disable the 1812 static key. 1813 Wrapper functions make it easy to compile out the relevant code if the kernel 1814 distributor turns off online fsck at build time. 1815 1816- Scrub functions wanting to turn on scrub-only XFS functionality should call 1817 the ``xchk_fsgates_enable`` from the setup function to enable a specific 1818 hook. 1819 This must be done before obtaining any resources that are used by memory 1820 reclaim. 1821 Callers had better be sure they really need the functionality gated by the 1822 static key; the ``TRY_HARDER`` flag is useful here. 1823 1824Online scrub has resource acquisition helpers (e.g. ``xchk_perag_lock``) to 1825handle locking AGI and AGF buffers for all scrubber functions. 1826If it detects a conflict between scrub and the running transactions, it will 1827try to wait for intents to complete. 1828If the caller of the helper has not enabled the static key, the helper will 1829return -EDEADLOCK, which should result in the scrub being restarted with the 1830``TRY_HARDER`` flag set. 1831The scrub setup function should detect that flag, enable the static key, and 1832try the scrub again. 1833Scrub teardown disables all static keys obtained by ``xchk_fsgates_enable``. 1834 1835For more information, please see the kernel documentation of 1836Documentation/staging/static-keys.rst. 1837 1838.. _xfile: 1839 1840Pageable Kernel Memory 1841---------------------- 1842 1843Some online checking functions work by scanning the filesystem to build a 1844shadow copy of an ondisk metadata structure in memory and comparing the two 1845copies. 1846For online repair to rebuild a metadata structure, it must compute the record 1847set that will be stored in the new structure before it can persist that new 1848structure to disk. 1849Ideally, repairs complete with a single atomic commit that introduces 1850a new data structure. 1851To meet these goals, the kernel needs to collect a large amount of information 1852in a place that doesn't require the correct operation of the filesystem. 1853 1854Kernel memory isn't suitable because: 1855 1856* Allocating a contiguous region of memory to create a C array is very 1857 difficult, especially on 32-bit systems. 1858 1859* Linked lists of records introduce double pointer overhead which is very high 1860 and eliminate the possibility of indexed lookups. 1861 1862* Kernel memory is pinned, which can drive the system into OOM conditions. 1863 1864* The system might not have sufficient memory to stage all the information. 1865 1866At any given time, online fsck does not need to keep the entire record set in 1867memory, which means that individual records can be paged out if necessary. 1868Continued development of online fsck demonstrated that the ability to perform 1869indexed data storage would also be very useful. 1870Fortunately, the Linux kernel already has a facility for byte-addressable and 1871pageable storage: tmpfs. 1872In-kernel graphics drivers (most notably i915) take advantage of tmpfs files 1873to store intermediate data that doesn't need to be in memory at all times, so 1874that usage precedent is already established. 1875Hence, the ``xfile`` was born! 1876 1877+--------------------------------------------------------------------------+ 1878| **Historical Sidebar**: | 1879+--------------------------------------------------------------------------+ 1880| The first edition of online repair inserted records into a new btree as | 1881| it found them, which failed because filesystem could shut down with a | 1882| built data structure, which would be live after recovery finished. | 1883| | 1884| The second edition solved the half-rebuilt structure problem by storing | 1885| everything in memory, but frequently ran the system out of memory. | 1886| | 1887| The third edition solved the OOM problem by using linked lists, but the | 1888| memory overhead of the list pointers was extreme. | 1889+--------------------------------------------------------------------------+ 1890 1891xfile Access Models 1892``````````````````` 1893 1894A survey of the intended uses of xfiles suggested these use cases: 1895 18961. Arrays of fixed-sized records (space management btrees, directory and 1897 extended attribute entries) 1898 18992. Sparse arrays of fixed-sized records (quotas and link counts) 1900 19013. Large binary objects (BLOBs) of variable sizes (directory and extended 1902 attribute names and values) 1903 19044. Staging btrees in memory (reverse mapping btrees) 1905 19065. Arbitrary contents (realtime space management) 1907 1908To support the first four use cases, high level data structures wrap the xfile 1909to share functionality between online fsck functions. 1910The rest of this section discusses the interfaces that the xfile presents to 1911four of those five higher level data structures. 1912The fifth use case is discussed in the :ref:`realtime summary <rtsummary>` case 1913study. 1914 1915The most general storage interface supported by the xfile enables the reading 1916and writing of arbitrary quantities of data at arbitrary offsets in the xfile. 1917This capability is provided by ``xfile_pread`` and ``xfile_pwrite`` functions, 1918which behave similarly to their userspace counterparts. 1919XFS is very record-based, which suggests that the ability to load and store 1920complete records is important. 1921To support these cases, a pair of ``xfile_obj_load`` and ``xfile_obj_store`` 1922functions are provided to read and persist objects into an xfile. 1923They are internally the same as pread and pwrite, except that they treat any 1924error as an out of memory error. 1925For online repair, squashing error conditions in this manner is an acceptable 1926behavior because the only reaction is to abort the operation back to userspace. 1927All five xfile usecases can be serviced by these four functions. 1928 1929However, no discussion of file access idioms is complete without answering the 1930question, "But what about mmap?" 1931It is convenient to access storage directly with pointers, just like userspace 1932code does with regular memory. 1933Online fsck must not drive the system into OOM conditions, which means that 1934xfiles must be responsive to memory reclamation. 1935tmpfs can only push a pagecache folio to the swap cache if the folio is neither 1936pinned nor locked, which means the xfile must not pin too many folios. 1937 1938Short term direct access to xfile contents is done by locking the pagecache 1939folio and mapping it into kernel address space. 1940Programmatic access (e.g. pread and pwrite) uses this mechanism. 1941Folio locks are not supposed to be held for long periods of time, so long 1942term direct access to xfile contents is done by bumping the folio refcount, 1943mapping it into kernel address space, and dropping the folio lock. 1944These long term users *must* be responsive to memory reclaim by hooking into 1945the shrinker infrastructure to know when to release folios. 1946 1947The ``xfile_get_page`` and ``xfile_put_page`` functions are provided to 1948retrieve the (locked) folio that backs part of an xfile and to release it. 1949The only code to use these folio lease functions are the xfarray 1950:ref:`sorting<xfarray_sort>` algorithms and the :ref:`in-memory 1951btrees<xfbtree>`. 1952 1953xfile Access Coordination 1954````````````````````````` 1955 1956For security reasons, xfiles must be owned privately by the kernel. 1957They are marked ``S_PRIVATE`` to prevent interference from the security system, 1958must never be mapped into process file descriptor tables, and their pages must 1959never be mapped into userspace processes. 1960 1961To avoid locking recursion issues with the VFS, all accesses to the shmfs file 1962are performed by manipulating the page cache directly. 1963xfile writers call the ``->write_begin`` and ``->write_end`` functions of the 1964xfile's address space to grab writable pages, copy the caller's buffer into the 1965page, and release the pages. 1966xfile readers call ``shmem_read_mapping_page_gfp`` to grab pages directly 1967before copying the contents into the caller's buffer. 1968In other words, xfiles ignore the VFS read and write code paths to avoid 1969having to create a dummy ``struct kiocb`` and to avoid taking inode and 1970freeze locks. 1971tmpfs cannot be frozen, and xfiles must not be exposed to userspace. 1972 1973If an xfile is shared between threads to stage repairs, the caller must provide 1974its own locks to coordinate access. 1975For example, if a scrub function stores scan results in an xfile and needs 1976other threads to provide updates to the scanned data, the scrub function must 1977provide a lock for all threads to share. 1978 1979.. _xfarray: 1980 1981Arrays of Fixed-Sized Records 1982````````````````````````````` 1983 1984In XFS, each type of indexed space metadata (free space, inodes, reference 1985counts, file fork space, and reverse mappings) consists of a set of fixed-size 1986records indexed with a classic B+ tree. 1987Directories have a set of fixed-size dirent records that point to the names, 1988and extended attributes have a set of fixed-size attribute keys that point to 1989names and values. 1990Quota counters and file link counters index records with numbers. 1991During a repair, scrub needs to stage new records during the gathering step and 1992retrieve them during the btree building step. 1993 1994Although this requirement can be satisfied by calling the read and write 1995methods of the xfile directly, it is simpler for callers for there to be a 1996higher level abstraction to take care of computing array offsets, to provide 1997iterator functions, and to deal with sparse records and sorting. 1998The ``xfarray`` abstraction presents a linear array for fixed-size records atop 1999the byte-accessible xfile. 2000 2001.. _xfarray_access_patterns: 2002 2003Array Access Patterns 2004^^^^^^^^^^^^^^^^^^^^^ 2005 2006Array access patterns in online fsck tend to fall into three categories. 2007Iteration of records is assumed to be necessary for all cases and will be 2008covered in the next section. 2009 2010The first type of caller handles records that are indexed by position. 2011Gaps may exist between records, and a record may be updated multiple times 2012during the collection step. 2013In other words, these callers want a sparse linearly addressed table file. 2014The typical use case are quota records or file link count records. 2015Access to array elements is performed programmatically via ``xfarray_load`` and 2016``xfarray_store`` functions, which wrap the similarly-named xfile functions to 2017provide loading and storing of array elements at arbitrary array indices. 2018Gaps are defined to be null records, and null records are defined to be a 2019sequence of all zero bytes. 2020Null records are detected by calling ``xfarray_element_is_null``. 2021They are created either by calling ``xfarray_unset`` to null out an existing 2022record or by never storing anything to an array index. 2023 2024The second type of caller handles records that are not indexed by position 2025and do not require multiple updates to a record. 2026The typical use case here is rebuilding space btrees and key/value btrees. 2027These callers can add records to the array without caring about array indices 2028via the ``xfarray_append`` function, which stores a record at the end of the 2029array. 2030For callers that require records to be presentable in a specific order (e.g. 2031rebuilding btree data), the ``xfarray_sort`` function can arrange the sorted 2032records; this function will be covered later. 2033 2034The third type of caller is a bag, which is useful for counting records. 2035The typical use case here is constructing space extent reference counts from 2036reverse mapping information. 2037Records can be put in the bag in any order, they can be removed from the bag 2038at any time, and uniqueness of records is left to callers. 2039The ``xfarray_store_anywhere`` function is used to insert a record in any 2040null record slot in the bag; and the ``xfarray_unset`` function removes a 2041record from the bag. 2042 2043The proposed patchset is the 2044`big in-memory array 2045<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=big-array>`_. 2046 2047Iterating Array Elements 2048^^^^^^^^^^^^^^^^^^^^^^^^ 2049 2050Most users of the xfarray require the ability to iterate the records stored in 2051the array. 2052Callers can probe every possible array index with the following: 2053 2054.. code-block:: c 2055 2056 xfarray_idx_t i; 2057 foreach_xfarray_idx(array, i) { 2058 xfarray_load(array, i, &rec); 2059 2060 /* do something with rec */ 2061 } 2062 2063All users of this idiom must be prepared to handle null records or must already 2064know that there aren't any. 2065 2066For xfarray users that want to iterate a sparse array, the ``xfarray_iter`` 2067function ignores indices in the xfarray that have never been written to by 2068calling ``xfile_seek_data`` (which internally uses ``SEEK_DATA``) to skip areas 2069of the array that are not populated with memory pages. 2070Once it finds a page, it will skip the zeroed areas of the page. 2071 2072.. code-block:: c 2073 2074 xfarray_idx_t i = XFARRAY_CURSOR_INIT; 2075 while ((ret = xfarray_iter(array, &i, &rec)) == 1) { 2076 /* do something with rec */ 2077 } 2078 2079.. _xfarray_sort: 2080 2081Sorting Array Elements 2082^^^^^^^^^^^^^^^^^^^^^^ 2083 2084During the fourth demonstration of online repair, a community reviewer remarked 2085that for performance reasons, online repair ought to load batches of records 2086into btree record blocks instead of inserting records into a new btree one at a 2087time. 2088The btree insertion code in XFS is responsible for maintaining correct ordering 2089of the records, so naturally the xfarray must also support sorting the record 2090set prior to bulk loading. 2091 2092Case Study: Sorting xfarrays 2093~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2094 2095The sorting algorithm used in the xfarray is actually a combination of adaptive 2096quicksort and a heapsort subalgorithm in the spirit of 2097`Sedgewick <https://algs4.cs.princeton.edu/23quicksort/>`_ and 2098`pdqsort <https://github.com/orlp/pdqsort>`_, with customizations for the Linux 2099kernel. 2100To sort records in a reasonably short amount of time, ``xfarray`` takes 2101advantage of the binary subpartitioning offered by quicksort, but it also uses 2102heapsort to hedge aginst performance collapse if the chosen quicksort pivots 2103are poor. 2104Both algorithms are (in general) O(n * lg(n)), but there is a wide performance 2105gulf between the two implementations. 2106 2107The Linux kernel already contains a reasonably fast implementation of heapsort. 2108It only operates on regular C arrays, which limits the scope of its usefulness. 2109There are two key places where the xfarray uses it: 2110 2111* Sorting any record subset backed by a single xfile page. 2112 2113* Loading a small number of xfarray records from potentially disparate parts 2114 of the xfarray into a memory buffer, and sorting the buffer. 2115 2116In other words, ``xfarray`` uses heapsort to constrain the nested recursion of 2117quicksort, thereby mitigating quicksort's worst runtime behavior. 2118 2119Choosing a quicksort pivot is a tricky business. 2120A good pivot splits the set to sort in half, leading to the divide and conquer 2121behavior that is crucial to O(n * lg(n)) performance. 2122A poor pivot barely splits the subset at all, leading to O(n\ :sup:`2`) 2123runtime. 2124The xfarray sort routine tries to avoid picking a bad pivot by sampling nine 2125records into a memory buffer and using the kernel heapsort to identify the 2126median of the nine. 2127 2128Most modern quicksort implementations employ Tukey's "ninther" to select a 2129pivot from a classic C array. 2130Typical ninther implementations pick three unique triads of records, sort each 2131of the triads, and then sort the middle value of each triad to determine the 2132ninther value. 2133As stated previously, however, xfile accesses are not entirely cheap. 2134It turned out to be much more performant to read the nine elements into a 2135memory buffer, run the kernel's in-memory heapsort on the buffer, and choose 2136the 4th element of that buffer as the pivot. 2137Tukey's ninthers are described in J. W. Tukey, `The ninther, a technique for 2138low-effort robust (resistant) location in large samples`, in *Contributions to 2139Survey Sampling and Applied Statistics*, edited by H. David, (Academic Press, 21401978), pp. 251–257. 2141 2142The partitioning of quicksort is fairly textbook -- rearrange the record 2143subset around the pivot, then set up the current and next stack frames to 2144sort with the larger and the smaller halves of the pivot, respectively. 2145This keeps the stack space requirements to log2(record count). 2146 2147As a final performance optimization, the hi and lo scanning phase of quicksort 2148keeps examined xfile pages mapped in the kernel for as long as possible to 2149reduce map/unmap cycles. 2150Surprisingly, this reduces overall sort runtime by nearly half again after 2151accounting for the application of heapsort directly onto xfile pages. 2152 2153Blob Storage 2154```````````` 2155 2156Extended attributes and directories add an additional requirement for staging 2157records: arbitrary byte sequences of finite length. 2158Each directory entry record needs to store entry name, 2159and each extended attribute needs to store both the attribute name and value. 2160The names, keys, and values can consume a large amount of memory, so the 2161``xfblob`` abstraction was created to simplify management of these blobs 2162atop an xfile. 2163 2164Blob arrays provide ``xfblob_load`` and ``xfblob_store`` functions to retrieve 2165and persist objects. 2166The store function returns a magic cookie for every object that it persists. 2167Later, callers provide this cookie to the ``xblob_load`` to recall the object. 2168The ``xfblob_free`` function frees a specific blob, and the ``xfblob_truncate`` 2169function frees them all because compaction is not needed. 2170 2171The details of repairing directories and extended attributes will be discussed 2172in a subsequent section about atomic extent swapping. 2173However, it should be noted that these repair functions only use blob storage 2174to cache a small number of entries before adding them to a temporary ondisk 2175file, which is why compaction is not required. 2176 2177The proposed patchset is at the start of the 2178`extended attribute repair 2179<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_ series. 2180 2181.. _xfbtree: 2182 2183In-Memory B+Trees 2184````````````````` 2185 2186The chapter about :ref:`secondary metadata<secondary_metadata>` mentioned that 2187checking and repairing of secondary metadata commonly requires coordination 2188between a live metadata scan of the filesystem and writer threads that are 2189updating that metadata. 2190Keeping the scan data up to date requires requires the ability to propagate 2191metadata updates from the filesystem into the data being collected by the scan. 2192This *can* be done by appending concurrent updates into a separate log file and 2193applying them before writing the new metadata to disk, but this leads to 2194unbounded memory consumption if the rest of the system is very busy. 2195Another option is to skip the side-log and commit live updates from the 2196filesystem directly into the scan data, which trades more overhead for a lower 2197maximum memory requirement. 2198In both cases, the data structure holding the scan results must support indexed 2199access to perform well. 2200 2201Given that indexed lookups of scan data is required for both strategies, online 2202fsck employs the second strategy of committing live updates directly into 2203scan data. 2204Because xfarrays are not indexed and do not enforce record ordering, they 2205are not suitable for this task. 2206Conveniently, however, XFS has a library to create and maintain ordered reverse 2207mapping records: the existing rmap btree code! 2208If only there was a means to create one in memory. 2209 2210Recall that the :ref:`xfile <xfile>` abstraction represents memory pages as a 2211regular file, which means that the kernel can create byte or block addressable 2212virtual address spaces at will. 2213The XFS buffer cache specializes in abstracting IO to block-oriented address 2214spaces, which means that adaptation of the buffer cache to interface with 2215xfiles enables reuse of the entire btree library. 2216Btrees built atop an xfile are collectively known as ``xfbtrees``. 2217The next few sections describe how they actually work. 2218 2219The proposed patchset is the 2220`in-memory btree 2221<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=in-memory-btrees>`_ 2222series. 2223 2224Using xfiles as a Buffer Cache Target 2225^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2226 2227Two modifications are necessary to support xfiles as a buffer cache target. 2228The first is to make it possible for the ``struct xfs_buftarg`` structure to 2229host the ``struct xfs_buf`` rhashtable, because normally those are held by a 2230per-AG structure. 2231The second change is to modify the buffer ``ioapply`` function to "read" cached 2232pages from the xfile and "write" cached pages back to the xfile. 2233Multiple access to individual buffers is controlled by the ``xfs_buf`` lock, 2234since the xfile does not provide any locking on its own. 2235With this adaptation in place, users of the xfile-backed buffer cache use 2236exactly the same APIs as users of the disk-backed buffer cache. 2237The separation between xfile and buffer cache implies higher memory usage since 2238they do not share pages, but this property could some day enable transactional 2239updates to an in-memory btree. 2240Today, however, it simply eliminates the need for new code. 2241 2242Space Management with an xfbtree 2243^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2244 2245Space management for an xfile is very simple -- each btree block is one memory 2246page in size. 2247These blocks use the same header format as an on-disk btree, but the in-memory 2248block verifiers ignore the checksums, assuming that xfile memory is no more 2249corruption-prone than regular DRAM. 2250Reusing existing code here is more important than absolute memory efficiency. 2251 2252The very first block of an xfile backing an xfbtree contains a header block. 2253The header describes the owner, height, and the block number of the root 2254xfbtree block. 2255 2256To allocate a btree block, use ``xfile_seek_data`` to find a gap in the file. 2257If there are no gaps, create one by extending the length of the xfile. 2258Preallocate space for the block with ``xfile_prealloc``, and hand back the 2259location. 2260To free an xfbtree block, use ``xfile_discard`` (which internally uses 2261``FALLOC_FL_PUNCH_HOLE``) to remove the memory page from the xfile. 2262 2263Populating an xfbtree 2264^^^^^^^^^^^^^^^^^^^^^ 2265 2266An online fsck function that wants to create an xfbtree should proceed as 2267follows: 2268 22691. Call ``xfile_create`` to create an xfile. 2270 22712. Call ``xfs_alloc_memory_buftarg`` to create a buffer cache target structure 2272 pointing to the xfile. 2273 22743. Pass the buffer cache target, buffer ops, and other information to 2275 ``xfbtree_create`` to write an initial tree header and root block to the 2276 xfile. 2277 Each btree type should define a wrapper that passes necessary arguments to 2278 the creation function. 2279 For example, rmap btrees define ``xfs_rmapbt_mem_create`` to take care of 2280 all the necessary details for callers. 2281 A ``struct xfbtree`` object will be returned. 2282 22834. Pass the xfbtree object to the btree cursor creation function for the 2284 btree type. 2285 Following the example above, ``xfs_rmapbt_mem_cursor`` takes care of this 2286 for callers. 2287 22885. Pass the btree cursor to the regular btree functions to make queries against 2289 and to update the in-memory btree. 2290 For example, a btree cursor for an rmap xfbtree can be passed to the 2291 ``xfs_rmap_*`` functions just like any other btree cursor. 2292 See the :ref:`next section<xfbtree_commit>` for information on dealing with 2293 xfbtree updates that are logged to a transaction. 2294 22956. When finished, delete the btree cursor, destroy the xfbtree object, free the 2296 buffer target, and the destroy the xfile to release all resources. 2297 2298.. _xfbtree_commit: 2299 2300Committing Logged xfbtree Buffers 2301^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2302 2303Although it is a clever hack to reuse the rmap btree code to handle the staging 2304structure, the ephemeral nature of the in-memory btree block storage presents 2305some challenges of its own. 2306The XFS transaction manager must not commit buffer log items for buffers backed 2307by an xfile because the log format does not understand updates for devices 2308other than the data device. 2309An ephemeral xfbtree probably will not exist by the time the AIL checkpoints 2310log transactions back into the filesystem, and certainly won't exist during 2311log recovery. 2312For these reasons, any code updating an xfbtree in transaction context must 2313remove the buffer log items from the transaction and write the updates into the 2314backing xfile before committing or cancelling the transaction. 2315 2316The ``xfbtree_trans_commit`` and ``xfbtree_trans_cancel`` functions implement 2317this functionality as follows: 2318 23191. Find each buffer log item whose buffer targets the xfile. 2320 23212. Record the dirty/ordered status of the log item. 2322 23233. Detach the log item from the buffer. 2324 23254. Queue the buffer to a special delwri list. 2326 23275. Clear the transaction dirty flag if the only dirty log items were the ones 2328 that were detached in step 3. 2329 23306. Submit the delwri list to commit the changes to the xfile, if the updates 2331 are being committed. 2332 2333After removing xfile logged buffers from the transaction in this manner, the 2334transaction can be committed or cancelled. 2335 2336Bulk Loading of Ondisk B+Trees 2337------------------------------ 2338 2339As mentioned previously, early iterations of online repair built new btree 2340structures by creating a new btree and adding observations individually. 2341Loading a btree one record at a time had a slight advantage of not requiring 2342the incore records to be sorted prior to commit, but was very slow and leaked 2343blocks if the system went down during a repair. 2344Loading records one at a time also meant that repair could not control the 2345loading factor of the blocks in the new btree. 2346 2347Fortunately, the venerable ``xfs_repair`` tool had a more efficient means for 2348rebuilding a btree index from a collection of records -- bulk btree loading. 2349This was implemented rather inefficiently code-wise, since ``xfs_repair`` 2350had separate copy-pasted implementations for each btree type. 2351 2352To prepare for online fsck, each of the four bulk loaders were studied, notes 2353were taken, and the four were refactored into a single generic btree bulk 2354loading mechanism. 2355Those notes in turn have been refreshed and are presented below. 2356 2357Geometry Computation 2358```````````````````` 2359 2360The zeroth step of bulk loading is to assemble the entire record set that will 2361be stored in the new btree, and sort the records. 2362Next, call ``xfs_btree_bload_compute_geometry`` to compute the shape of the 2363btree from the record set, the type of btree, and any load factor preferences. 2364This information is required for resource reservation. 2365 2366First, the geometry computation computes the minimum and maximum records that 2367will fit in a leaf block from the size of a btree block and the size of the 2368block header. 2369Roughly speaking, the maximum number of records is:: 2370 2371 maxrecs = (block_size - header_size) / record_size 2372 2373The XFS design specifies that btree blocks should be merged when possible, 2374which means the minimum number of records is half of maxrecs:: 2375 2376 minrecs = maxrecs / 2 2377 2378The next variable to determine is the desired loading factor. 2379This must be at least minrecs and no more than maxrecs. 2380Choosing minrecs is undesirable because it wastes half the block. 2381Choosing maxrecs is also undesirable because adding a single record to each 2382newly rebuilt leaf block will cause a tree split, which causes a noticeable 2383drop in performance immediately afterwards. 2384The default loading factor was chosen to be 75% of maxrecs, which provides a 2385reasonably compact structure without any immediate split penalties:: 2386 2387 default_load_factor = (maxrecs + minrecs) / 2 2388 2389If space is tight, the loading factor will be set to maxrecs to try to avoid 2390running out of space:: 2391 2392 leaf_load_factor = enough space ? default_load_factor : maxrecs 2393 2394Load factor is computed for btree node blocks using the combined size of the 2395btree key and pointer as the record size:: 2396 2397 maxrecs = (block_size - header_size) / (key_size + ptr_size) 2398 minrecs = maxrecs / 2 2399 node_load_factor = enough space ? default_load_factor : maxrecs 2400 2401Once that's done, the number of leaf blocks required to store the record set 2402can be computed as:: 2403 2404 leaf_blocks = ceil(record_count / leaf_load_factor) 2405 2406The number of node blocks needed to point to the next level down in the tree 2407is computed as:: 2408 2409 n_blocks = (n == 0 ? leaf_blocks : node_blocks[n]) 2410 node_blocks[n + 1] = ceil(n_blocks / node_load_factor) 2411 2412The entire computation is performed recursively until the current level only 2413needs one block. 2414The resulting geometry is as follows: 2415 2416- For AG-rooted btrees, this level is the root level, so the height of the new 2417 tree is ``level + 1`` and the space needed is the summation of the number of 2418 blocks on each level. 2419 2420- For inode-rooted btrees where the records in the top level do not fit in the 2421 inode fork area, the height is ``level + 2``, the space needed is the 2422 summation of the number of blocks on each level, and the inode fork points to 2423 the root block. 2424 2425- For inode-rooted btrees where the records in the top level can be stored in 2426 the inode fork area, then the root block can be stored in the inode, the 2427 height is ``level + 1``, and the space needed is one less than the summation 2428 of the number of blocks on each level. 2429 This only becomes relevant when non-bmap btrees gain the ability to root in 2430 an inode, which is a future patchset and only included here for completeness. 2431 2432.. _newbt: 2433 2434Reserving New B+Tree Blocks 2435``````````````````````````` 2436 2437Once repair knows the number of blocks needed for the new btree, it allocates 2438those blocks using the free space information. 2439Each reserved extent is tracked separately by the btree builder state data. 2440To improve crash resilience, the reservation code also logs an Extent Freeing 2441Intent (EFI) item in the same transaction as each space allocation and attaches 2442its in-memory ``struct xfs_extent_free_item`` object to the space reservation. 2443If the system goes down, log recovery will use the unfinished EFIs to free the 2444unused space, the free space, leaving the filesystem unchanged. 2445 2446Each time the btree builder claims a block for the btree from a reserved 2447extent, it updates the in-memory reservation to reflect the claimed space. 2448Block reservation tries to allocate as much contiguous space as possible to 2449reduce the number of EFIs in play. 2450 2451While repair is writing these new btree blocks, the EFIs created for the space 2452reservations pin the tail of the ondisk log. 2453It's possible that other parts of the system will remain busy and push the head 2454of the log towards the pinned tail. 2455To avoid livelocking the filesystem, the EFIs must not pin the tail of the log 2456for too long. 2457To alleviate this problem, the dynamic relogging capability of the deferred ops 2458mechanism is reused here to commit a transaction at the log head containing an 2459EFD for the old EFI and new EFI at the head. 2460This enables the log to release the old EFI to keep the log moving forwards. 2461 2462EFIs have a role to play during the commit and reaping phases; please see the 2463next section and the section about :ref:`reaping<reaping>` for more details. 2464 2465Proposed patchsets are the 2466`bitmap rework 2467<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-bitmap-rework>`_ 2468and the 2469`preparation for bulk loading btrees 2470<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading>`_. 2471 2472 2473Writing the New Tree 2474```````````````````` 2475 2476This part is pretty simple -- the btree builder (``xfs_btree_bulkload``) claims 2477a block from the reserved list, writes the new btree block header, fills the 2478rest of the block with records, and adds the new leaf block to a list of 2479written blocks:: 2480 2481 ┌────┐ 2482 │leaf│ 2483 │RRR │ 2484 └────┘ 2485 2486Sibling pointers are set every time a new block is added to the level:: 2487 2488 ┌────┐ ┌────┐ ┌────┐ ┌────┐ 2489 │leaf│→│leaf│→│leaf│→│leaf│ 2490 │RRR │←│RRR │←│RRR │←│RRR │ 2491 └────┘ └────┘ └────┘ └────┘ 2492 2493When it finishes writing the record leaf blocks, it moves on to the node 2494blocks 2495To fill a node block, it walks each block in the next level down in the tree 2496to compute the relevant keys and write them into the parent node:: 2497 2498 ┌────┐ ┌────┐ 2499 │node│──────→│node│ 2500 │PP │←──────│PP │ 2501 └────┘ └────┘ 2502 ↙ ↘ ↙ ↘ 2503 ┌────┐ ┌────┐ ┌────┐ ┌────┐ 2504 │leaf│→│leaf│→│leaf│→│leaf│ 2505 │RRR │←│RRR │←│RRR │←│RRR │ 2506 └────┘ └────┘ └────┘ └────┘ 2507 2508When it reaches the root level, it is ready to commit the new btree!:: 2509 2510 ┌─────────┐ 2511 │ root │ 2512 │ PP │ 2513 └─────────┘ 2514 ↙ ↘ 2515 ┌────┐ ┌────┐ 2516 │node│──────→│node│ 2517 │PP │←──────│PP │ 2518 └────┘ └────┘ 2519 ↙ ↘ ↙ ↘ 2520 ┌────┐ ┌────┐ ┌────┐ ┌────┐ 2521 │leaf│→│leaf│→│leaf│→│leaf│ 2522 │RRR │←│RRR │←│RRR │←│RRR │ 2523 └────┘ └────┘ └────┘ └────┘ 2524 2525The first step to commit the new btree is to persist the btree blocks to disk 2526synchronously. 2527This is a little complicated because a new btree block could have been freed 2528in the recent past, so the builder must use ``xfs_buf_delwri_queue_here`` to 2529remove the (stale) buffer from the AIL list before it can write the new blocks 2530to disk. 2531Blocks are queued for IO using a delwri list and written in one large batch 2532with ``xfs_buf_delwri_submit``. 2533 2534Once the new blocks have been persisted to disk, control returns to the 2535individual repair function that called the bulk loader. 2536The repair function must log the location of the new root in a transaction, 2537clean up the space reservations that were made for the new btree, and reap the 2538old metadata blocks: 2539 25401. Commit the location of the new btree root. 2541 25422. For each incore reservation: 2543 2544 a. Log Extent Freeing Done (EFD) items for all the space that was consumed 2545 by the btree builder. The new EFDs must point to the EFIs attached to 2546 the reservation to prevent log recovery from freeing the new blocks. 2547 2548 b. For unclaimed portions of incore reservations, create a regular deferred 2549 extent free work item to be free the unused space later in the 2550 transaction chain. 2551 2552 c. The EFDs and EFIs logged in steps 2a and 2b must not overrun the 2553 reservation of the committing transaction. 2554 If the btree loading code suspects this might be about to happen, it must 2555 call ``xrep_defer_finish`` to clear out the deferred work and obtain a 2556 fresh transaction. 2557 25583. Clear out the deferred work a second time to finish the commit and clean 2559 the repair transaction. 2560 2561The transaction rolling in steps 2c and 3 represent a weakness in the repair 2562algorithm, because a log flush and a crash before the end of the reap step can 2563result in space leaking. 2564Online repair functions minimize the chances of this occuring by using very 2565large transactions, which each can accomodate many thousands of block freeing 2566instructions. 2567Repair moves on to reaping the old blocks, which will be presented in a 2568subsequent :ref:`section<reaping>` after a few case studies of bulk loading. 2569 2570Case Study: Rebuilding the Inode Index 2571^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2572 2573The high level process to rebuild the inode index btree is: 2574 25751. Walk the reverse mapping records to generate ``struct xfs_inobt_rec`` 2576 records from the inode chunk information and a bitmap of the old inode btree 2577 blocks. 2578 25792. Append the records to an xfarray in inode order. 2580 25813. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number 2582 of blocks needed for the inode btree. 2583 If the free space inode btree is enabled, call it again to estimate the 2584 geometry of the finobt. 2585 25864. Allocate the number of blocks computed in the previous step. 2587 25885. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and 2589 generate the internal node blocks. 2590 If the free space inode btree is enabled, call it again to load the finobt. 2591 25926. Commit the location of the new btree root block(s) to the AGI. 2593 25947. Reap the old btree blocks using the bitmap created in step 1. 2595 2596Details are as follows. 2597 2598The inode btree maps inumbers to the ondisk location of the associated 2599inode records, which means that the inode btrees can be rebuilt from the 2600reverse mapping information. 2601Reverse mapping records with an owner of ``XFS_RMAP_OWN_INOBT`` marks the 2602location of the old inode btree blocks. 2603Each reverse mapping record with an owner of ``XFS_RMAP_OWN_INODES`` marks the 2604location of at least one inode cluster buffer. 2605A cluster is the smallest number of ondisk inodes that can be allocated or 2606freed in a single transaction; it is never smaller than 1 fs block or 4 inodes. 2607 2608For the space represented by each inode cluster, ensure that there are no 2609records in the free space btrees nor any records in the reference count btree. 2610If there are, the space metadata inconsistencies are reason enough to abort the 2611operation. 2612Otherwise, read each cluster buffer to check that its contents appear to be 2613ondisk inodes and to decide if the file is allocated 2614(``xfs_dinode.i_mode != 0``) or free (``xfs_dinode.i_mode == 0``). 2615Accumulate the results of successive inode cluster buffer reads until there is 2616enough information to fill a single inode chunk record, which is 64 consecutive 2617numbers in the inumber keyspace. 2618If the chunk is sparse, the chunk record may include holes. 2619 2620Once the repair function accumulates one chunk's worth of data, it calls 2621``xfarray_append`` to add the inode btree record to the xfarray. 2622This xfarray is walked twice during the btree creation step -- once to populate 2623the inode btree with all inode chunk records, and a second time to populate the 2624free inode btree with records for chunks that have free non-sparse inodes. 2625The number of records for the inode btree is the number of xfarray records, 2626but the record count for the free inode btree has to be computed as inode chunk 2627records are stored in the xfarray. 2628 2629The proposed patchset is the 2630`AG btree repair 2631<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_ 2632series. 2633 2634Case Study: Rebuilding the Space Reference Counts 2635^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2636 2637Reverse mapping records are used to rebuild the reference count information. 2638Reference counts are required for correct operation of copy on write for shared 2639file data. 2640Imagine the reverse mapping entries as rectangles representing extents of 2641physical blocks, and that the rectangles can be laid down to allow them to 2642overlap each other. 2643From the diagram below, it is apparent that a reference count record must start 2644or end wherever the height of the stack changes. 2645In other words, the record emission stimulus is level-triggered:: 2646 2647 █ ███ 2648 ██ █████ ████ ███ ██████ 2649 ██ ████ ███████████ ████ █████████ 2650 ████████████████████████████████ ███████████ 2651 ^ ^ ^^ ^^ ^ ^^ ^^^ ^^^^ ^ ^^ ^ ^ ^ 2652 2 1 23 21 3 43 234 2123 1 01 2 3 0 2653 2654The ondisk reference count btree does not store the refcount == 0 cases because 2655the free space btree already records which blocks are free. 2656Extents being used to stage copy-on-write operations should be the only records 2657with refcount == 1. 2658Single-owner file blocks aren't recorded in either the free space or the 2659reference count btrees. 2660 2661The high level process to rebuild the reference count btree is: 2662 26631. Walk the reverse mapping records to generate ``struct xfs_refcount_irec`` 2664 records for any space having more than one reverse mapping and add them to 2665 the xfarray. 2666 Any records owned by ``XFS_RMAP_OWN_COW`` are also added to the xfarray 2667 because these are extents allocated to stage a copy on write operation and 2668 are tracked in the refcount btree. 2669 2670 Use any records owned by ``XFS_RMAP_OWN_REFC`` to create a bitmap of old 2671 refcount btree blocks. 2672 26732. Sort the records in physical extent order, putting the CoW staging extents 2674 at the end of the xfarray. 2675 This matches the sorting order of records in the refcount btree. 2676 26773. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number 2678 of blocks needed for the new tree. 2679 26804. Allocate the number of blocks computed in the previous step. 2681 26825. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and 2683 generate the internal node blocks. 2684 26856. Commit the location of new btree root block to the AGF. 2686 26877. Reap the old btree blocks using the bitmap created in step 1. 2688 2689Details are as follows; the same algorithm is used by ``xfs_repair`` to 2690generate refcount information from reverse mapping records. 2691 2692- Until the reverse mapping btree runs out of records: 2693 2694 - Retrieve the next record from the btree and put it in a bag. 2695 2696 - Collect all records with the same starting block from the btree and put 2697 them in the bag. 2698 2699 - While the bag isn't empty: 2700 2701 - Among the mappings in the bag, compute the lowest block number where the 2702 reference count changes. 2703 This position will be either the starting block number of the next 2704 unprocessed reverse mapping or the next block after the shortest mapping 2705 in the bag. 2706 2707 - Remove all mappings from the bag that end at this position. 2708 2709 - Collect all reverse mappings that start at this position from the btree 2710 and put them in the bag. 2711 2712 - If the size of the bag changed and is greater than one, create a new 2713 refcount record associating the block number range that we just walked to 2714 the size of the bag. 2715 2716The bag-like structure in this case is a type 2 xfarray as discussed in the 2717:ref:`xfarray access patterns<xfarray_access_patterns>` section. 2718Reverse mappings are added to the bag using ``xfarray_store_anywhere`` and 2719removed via ``xfarray_unset``. 2720Bag members are examined through ``xfarray_iter`` loops. 2721 2722The proposed patchset is the 2723`AG btree repair 2724<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_ 2725series. 2726 2727Case Study: Rebuilding File Fork Mapping Indices 2728^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2729 2730The high level process to rebuild a data/attr fork mapping btree is: 2731 27321. Walk the reverse mapping records to generate ``struct xfs_bmbt_rec`` 2733 records from the reverse mapping records for that inode and fork. 2734 Append these records to an xfarray. 2735 Compute the bitmap of the old bmap btree blocks from the ``BMBT_BLOCK`` 2736 records. 2737 27382. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number 2739 of blocks needed for the new tree. 2740 27413. Sort the records in file offset order. 2742 27434. If the extent records would fit in the inode fork immediate area, commit the 2744 records to that immediate area and skip to step 8. 2745 27465. Allocate the number of blocks computed in the previous step. 2747 27486. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and 2749 generate the internal node blocks. 2750 27517. Commit the new btree root block to the inode fork immediate area. 2752 27538. Reap the old btree blocks using the bitmap created in step 1. 2754 2755There are some complications here: 2756First, it's possible to move the fork offset to adjust the sizes of the 2757immediate areas if the data and attr forks are not both in BMBT format. 2758Second, if there are sufficiently few fork mappings, it may be possible to use 2759EXTENTS format instead of BMBT, which may require a conversion. 2760Third, the incore extent map must be reloaded carefully to avoid disturbing 2761any delayed allocation extents. 2762 2763The proposed patchset is the 2764`file mapping repair 2765<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-file-mappings>`_ 2766series. 2767 2768.. _reaping: 2769 2770Reaping Old Metadata Blocks 2771--------------------------- 2772 2773Whenever online fsck builds a new data structure to replace one that is 2774suspect, there is a question of how to find and dispose of the blocks that 2775belonged to the old structure. 2776The laziest method of course is not to deal with them at all, but this slowly 2777leads to service degradations as space leaks out of the filesystem. 2778Hopefully, someone will schedule a rebuild of the free space information to 2779plug all those leaks. 2780Offline repair rebuilds all space metadata after recording the usage of 2781the files and directories that it decides not to clear, hence it can build new 2782structures in the discovered free space and avoid the question of reaping. 2783 2784As part of a repair, online fsck relies heavily on the reverse mapping records 2785to find space that is owned by the corresponding rmap owner yet truly free. 2786Cross referencing rmap records with other rmap records is necessary because 2787there may be other data structures that also think they own some of those 2788blocks (e.g. crosslinked trees). 2789Permitting the block allocator to hand them out again will not push the system 2790towards consistency. 2791 2792For space metadata, the process of finding extents to dispose of generally 2793follows this format: 2794 27951. Create a bitmap of space used by data structures that must be preserved. 2796 The space reservations used to create the new metadata can be used here if 2797 the same rmap owner code is used to denote all of the objects being rebuilt. 2798 27992. Survey the reverse mapping data to create a bitmap of space owned by the 2800 same ``XFS_RMAP_OWN_*`` number for the metadata that is being preserved. 2801 28023. Use the bitmap disunion operator to subtract (1) from (2). 2803 The remaining set bits represent candidate extents that could be freed. 2804 The process moves on to step 4 below. 2805 2806Repairs for file-based metadata such as extended attributes, directories, 2807symbolic links, quota files and realtime bitmaps are performed by building a 2808new structure attached to a temporary file and swapping the forks. 2809Afterward, the mappings in the old file fork are the candidate blocks for 2810disposal. 2811 2812The process for disposing of old extents is as follows: 2813 28144. For each candidate extent, count the number of reverse mapping records for 2815 the first block in that extent that do not have the same rmap owner for the 2816 data structure being repaired. 2817 2818 - If zero, the block has a single owner and can be freed. 2819 2820 - If not, the block is part of a crosslinked structure and must not be 2821 freed. 2822 28235. Starting with the next block in the extent, figure out how many more blocks 2824 have the same zero/nonzero other owner status as that first block. 2825 28266. If the region is crosslinked, delete the reverse mapping entry for the 2827 structure being repaired and move on to the next region. 2828 28297. If the region is to be freed, mark any corresponding buffers in the buffer 2830 cache as stale to prevent log writeback. 2831 28328. Free the region and move on. 2833 2834However, there is one complication to this procedure. 2835Transactions are of finite size, so the reaping process must be careful to roll 2836the transactions to avoid overruns. 2837Overruns come from two sources: 2838 2839a. EFIs logged on behalf of space that is no longer occupied 2840 2841b. Log items for buffer invalidations 2842 2843This is also a window in which a crash during the reaping process can leak 2844blocks. 2845As stated earlier, online repair functions use very large transactions to 2846minimize the chances of this occurring. 2847 2848The proposed patchset is the 2849`preparation for bulk loading btrees 2850<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading>`_ 2851series. 2852 2853Case Study: Reaping After a Regular Btree Repair 2854```````````````````````````````````````````````` 2855 2856Old reference count and inode btrees are the easiest to reap because they have 2857rmap records with special owner codes: ``XFS_RMAP_OWN_REFC`` for the refcount 2858btree, and ``XFS_RMAP_OWN_INOBT`` for the inode and free inode btrees. 2859Creating a list of extents to reap the old btree blocks is quite simple, 2860conceptually: 2861 28621. Lock the relevant AGI/AGF header buffers to prevent allocation and frees. 2863 28642. For each reverse mapping record with an rmap owner corresponding to the 2865 metadata structure being rebuilt, set the corresponding range in a bitmap. 2866 28673. Walk the current data structures that have the same rmap owner. 2868 For each block visited, clear that range in the above bitmap. 2869 28704. Each set bit in the bitmap represents a block that could be a block from the 2871 old data structures and hence is a candidate for reaping. 2872 In other words, ``(rmap_records_owned_by & ~blocks_reachable_by_walk)`` 2873 are the blocks that might be freeable. 2874 2875If it is possible to maintain the AGF lock throughout the repair (which is the 2876common case), then step 2 can be performed at the same time as the reverse 2877mapping record walk that creates the records for the new btree. 2878 2879Case Study: Rebuilding the Free Space Indices 2880````````````````````````````````````````````` 2881 2882The high level process to rebuild the free space indices is: 2883 28841. Walk the reverse mapping records to generate ``struct xfs_alloc_rec_incore`` 2885 records from the gaps in the reverse mapping btree. 2886 28872. Append the records to an xfarray. 2888 28893. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number 2890 of blocks needed for each new tree. 2891 28924. Allocate the number of blocks computed in the previous step from the free 2893 space information collected. 2894 28955. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and 2896 generate the internal node blocks for the free space by length index. 2897 Call it again for the free space by block number index. 2898 28996. Commit the locations of the new btree root blocks to the AGF. 2900 29017. Reap the old btree blocks by looking for space that is not recorded by the 2902 reverse mapping btree, the new free space btrees, or the AGFL. 2903 2904Repairing the free space btrees has three key complications over a regular 2905btree repair: 2906 2907First, free space is not explicitly tracked in the reverse mapping records. 2908Hence, the new free space records must be inferred from gaps in the physical 2909space component of the keyspace of the reverse mapping btree. 2910 2911Second, free space repairs cannot use the common btree reservation code because 2912new blocks are reserved out of the free space btrees. 2913This is impossible when repairing the free space btrees themselves. 2914However, repair holds the AGF buffer lock for the duration of the free space 2915index reconstruction, so it can use the collected free space information to 2916supply the blocks for the new free space btrees. 2917It is not necessary to back each reserved extent with an EFI because the new 2918free space btrees are constructed in what the ondisk filesystem thinks is 2919unowned space. 2920However, if reserving blocks for the new btrees from the collected free space 2921information changes the number of free space records, repair must re-estimate 2922the new free space btree geometry with the new record count until the 2923reservation is sufficient. 2924As part of committing the new btrees, repair must ensure that reverse mappings 2925are created for the reserved blocks and that unused reserved blocks are 2926inserted into the free space btrees. 2927Deferrred rmap and freeing operations are used to ensure that this transition 2928is atomic, similar to the other btree repair functions. 2929 2930Third, finding the blocks to reap after the repair is not overly 2931straightforward. 2932Blocks for the free space btrees and the reverse mapping btrees are supplied by 2933the AGFL. 2934Blocks put onto the AGFL have reverse mapping records with the owner 2935``XFS_RMAP_OWN_AG``. 2936This ownership is retained when blocks move from the AGFL into the free space 2937btrees or the reverse mapping btrees. 2938When repair walks reverse mapping records to synthesize free space records, it 2939creates a bitmap (``ag_owner_bitmap``) of all the space claimed by 2940``XFS_RMAP_OWN_AG`` records. 2941The repair context maintains a second bitmap corresponding to the rmap btree 2942blocks and the AGFL blocks (``rmap_agfl_bitmap``). 2943When the walk is complete, the bitmap disunion operation ``(ag_owner_bitmap & 2944~rmap_agfl_bitmap)`` computes the extents that are used by the old free space 2945btrees. 2946These blocks can then be reaped using the methods outlined above. 2947 2948The proposed patchset is the 2949`AG btree repair 2950<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_ 2951series. 2952 2953.. _rmap_reap: 2954 2955Case Study: Reaping After Repairing Reverse Mapping Btrees 2956`````````````````````````````````````````````````````````` 2957 2958Old reverse mapping btrees are less difficult to reap after a repair. 2959As mentioned in the previous section, blocks on the AGFL, the two free space 2960btree blocks, and the reverse mapping btree blocks all have reverse mapping 2961records with ``XFS_RMAP_OWN_AG`` as the owner. 2962The full process of gathering reverse mapping records and building a new btree 2963are described in the case study of 2964:ref:`live rebuilds of rmap data <rmap_repair>`, but a crucial point from that 2965discussion is that the new rmap btree will not contain any records for the old 2966rmap btree, nor will the old btree blocks be tracked in the free space btrees. 2967The list of candidate reaping blocks is computed by setting the bits 2968corresponding to the gaps in the new rmap btree records, and then clearing the 2969bits corresponding to extents in the free space btrees and the current AGFL 2970blocks. 2971The result ``(new_rmapbt_gaps & ~(agfl | bnobt_records))`` are reaped using the 2972methods outlined above. 2973 2974The rest of the process of rebuildng the reverse mapping btree is discussed 2975in a separate :ref:`case study<rmap_repair>`. 2976 2977The proposed patchset is the 2978`AG btree repair 2979<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_ 2980series. 2981 2982Case Study: Rebuilding the AGFL 2983``````````````````````````````` 2984 2985The allocation group free block list (AGFL) is repaired as follows: 2986 29871. Create a bitmap for all the space that the reverse mapping data claims is 2988 owned by ``XFS_RMAP_OWN_AG``. 2989 29902. Subtract the space used by the two free space btrees and the rmap btree. 2991 29923. Subtract any space that the reverse mapping data claims is owned by any 2993 other owner, to avoid re-adding crosslinked blocks to the AGFL. 2994 29954. Once the AGFL is full, reap any blocks leftover. 2996 29975. The next operation to fix the freelist will right-size the list. 2998 2999See `fs/xfs/scrub/agheader_repair.c <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/xfs/scrub/agheader_repair.c>`_ for more details. 3000 3001Inode Record Repairs 3002-------------------- 3003 3004Inode records must be handled carefully, because they have both ondisk records 3005("dinodes") and an in-memory ("cached") representation. 3006There is a very high potential for cache coherency issues if online fsck is not 3007careful to access the ondisk metadata *only* when the ondisk metadata is so 3008badly damaged that the filesystem cannot load the in-memory representation. 3009When online fsck wants to open a damaged file for scrubbing, it must use 3010specialized resource acquisition functions that return either the in-memory 3011representation *or* a lock on whichever object is necessary to prevent any 3012update to the ondisk location. 3013 3014The only repairs that should be made to the ondisk inode buffers are whatever 3015is necessary to get the in-core structure loaded. 3016This means fixing whatever is caught by the inode cluster buffer and inode fork 3017verifiers, and retrying the ``iget`` operation. 3018If the second ``iget`` fails, the repair has failed. 3019 3020Once the in-memory representation is loaded, repair can lock the inode and can 3021subject it to comprehensive checks, repairs, and optimizations. 3022Most inode attributes are easy to check and constrain, or are user-controlled 3023arbitrary bit patterns; these are both easy to fix. 3024Dealing with the data and attr fork extent counts and the file block counts is 3025more complicated, because computing the correct value requires traversing the 3026forks, or if that fails, leaving the fields invalid and waiting for the fork 3027fsck functions to run. 3028 3029The proposed patchset is the 3030`inode 3031<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-inodes>`_ 3032repair series. 3033 3034Quota Record Repairs 3035-------------------- 3036 3037Similar to inodes, quota records ("dquots") also have both ondisk records and 3038an in-memory representation, and hence are subject to the same cache coherency 3039issues. 3040Somewhat confusingly, both are known as dquots in the XFS codebase. 3041 3042The only repairs that should be made to the ondisk quota record buffers are 3043whatever is necessary to get the in-core structure loaded. 3044Once the in-memory representation is loaded, the only attributes needing 3045checking are obviously bad limits and timer values. 3046 3047Quota usage counters are checked, repaired, and discussed separately in the 3048section about :ref:`live quotacheck <quotacheck>`. 3049 3050The proposed patchset is the 3051`quota 3052<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quota>`_ 3053repair series. 3054 3055.. _fscounters: 3056 3057Freezing to Fix Summary Counters 3058-------------------------------- 3059 3060Filesystem summary counters track availability of filesystem resources such 3061as free blocks, free inodes, and allocated inodes. 3062This information could be compiled by walking the free space and inode indexes, 3063but this is a slow process, so XFS maintains a copy in the ondisk superblock 3064that should reflect the ondisk metadata, at least when the filesystem has been 3065unmounted cleanly. 3066For performance reasons, XFS also maintains incore copies of those counters, 3067which are key to enabling resource reservations for active transactions. 3068Writer threads reserve the worst-case quantities of resources from the 3069incore counter and give back whatever they don't use at commit time. 3070It is therefore only necessary to serialize on the superblock when the 3071superblock is being committed to disk. 3072 3073The lazy superblock counter feature introduced in XFS v5 took this even further 3074by training log recovery to recompute the summary counters from the AG headers, 3075which eliminated the need for most transactions even to touch the superblock. 3076The only time XFS commits the summary counters is at filesystem unmount. 3077To reduce contention even further, the incore counter is implemented as a 3078percpu counter, which means that each CPU is allocated a batch of blocks from a 3079global incore counter and can satisfy small allocations from the local batch. 3080 3081The high-performance nature of the summary counters makes it difficult for 3082online fsck to check them, since there is no way to quiesce a percpu counter 3083while the system is running. 3084Although online fsck can read the filesystem metadata to compute the correct 3085values of the summary counters, there's no way to hold the value of a percpu 3086counter stable, so it's quite possible that the counter will be out of date by 3087the time the walk is complete. 3088Earlier versions of online scrub would return to userspace with an incomplete 3089scan flag, but this is not a satisfying outcome for a system administrator. 3090For repairs, the in-memory counters must be stabilized while walking the 3091filesystem metadata to get an accurate reading and install it in the percpu 3092counter. 3093 3094To satisfy this requirement, online fsck must prevent other programs in the 3095system from initiating new writes to the filesystem, it must disable background 3096garbage collection threads, and it must wait for existing writer programs to 3097exit the kernel. 3098Once that has been established, scrub can walk the AG free space indexes, the 3099inode btrees, and the realtime bitmap to compute the correct value of all 3100four summary counters. 3101This is very similar to a filesystem freeze, though not all of the pieces are 3102necessary: 3103 3104- The final freeze state is set one higher than ``SB_FREEZE_COMPLETE`` to 3105 prevent other threads from thawing the filesystem, or other scrub threads 3106 from initiating another fscounters freeze. 3107 3108- It does not quiesce the log. 3109 3110With this code in place, it is now possible to pause the filesystem for just 3111long enough to check and correct the summary counters. 3112 3113+--------------------------------------------------------------------------+ 3114| **Historical Sidebar**: | 3115+--------------------------------------------------------------------------+ 3116| The initial implementation used the actual VFS filesystem freeze | 3117| mechanism to quiesce filesystem activity. | 3118| With the filesystem frozen, it is possible to resolve the counter values | 3119| with exact precision, but there are many problems with calling the VFS | 3120| methods directly: | 3121| | 3122| - Other programs can unfreeze the filesystem without our knowledge. | 3123| This leads to incorrect scan results and incorrect repairs. | 3124| | 3125| - Adding an extra lock to prevent others from thawing the filesystem | 3126| required the addition of a ``->freeze_super`` function to wrap | 3127| ``freeze_fs()``. | 3128| This in turn caused other subtle problems because it turns out that | 3129| the VFS ``freeze_super`` and ``thaw_super`` functions can drop the | 3130| last reference to the VFS superblock, and any subsequent access | 3131| becomes a UAF bug! | 3132| This can happen if the filesystem is unmounted while the underlying | 3133| block device has frozen the filesystem. | 3134| This problem could be solved by grabbing extra references to the | 3135| superblock, but it felt suboptimal given the other inadequacies of | 3136| this approach. | 3137| | 3138| - The log need not be quiesced to check the summary counters, but a VFS | 3139| freeze initiates one anyway. | 3140| This adds unnecessary runtime to live fscounter fsck operations. | 3141| | 3142| - Quiescing the log means that XFS flushes the (possibly incorrect) | 3143| counters to disk as part of cleaning the log. | 3144| | 3145| - A bug in the VFS meant that freeze could complete even when | 3146| sync_filesystem fails to flush the filesystem and returns an error. | 3147| This bug was fixed in Linux 5.17. | 3148+--------------------------------------------------------------------------+ 3149 3150The proposed patchset is the 3151`summary counter cleanup 3152<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-fscounters>`_ 3153series. 3154 3155Full Filesystem Scans 3156--------------------- 3157 3158Certain types of metadata can only be checked by walking every file in the 3159entire filesystem to record observations and comparing the observations against 3160what's recorded on disk. 3161Like every other type of online repair, repairs are made by writing those 3162observations to disk in a replacement structure and committing it atomically. 3163However, it is not practical to shut down the entire filesystem to examine 3164hundreds of billions of files because the downtime would be excessive. 3165Therefore, online fsck must build the infrastructure to manage a live scan of 3166all the files in the filesystem. 3167There are two questions that need to be solved to perform a live walk: 3168 3169- How does scrub manage the scan while it is collecting data? 3170 3171- How does the scan keep abreast of changes being made to the system by other 3172 threads? 3173 3174.. _iscan: 3175 3176Coordinated Inode Scans 3177``````````````````````` 3178 3179In the original Unix filesystems of the 1970s, each directory entry contained 3180an index number (*inumber*) which was used as an index into on ondisk array 3181(*itable*) of fixed-size records (*inodes*) describing a file's attributes and 3182its data block mapping. 3183This system is described by J. Lions, `"inode (5659)" 3184<http://www.lemis.com/grog/Documentation/Lions/>`_ in *Lions' Commentary on 3185UNIX, 6th Edition*, (Dept. of Computer Science, the University of New South 3186Wales, November 1977), pp. 18-2; and later by D. Ritchie and K. Thompson, 3187`"Implementation of the File System" 3188<https://archive.org/details/bstj57-6-1905/page/n8/mode/1up>`_, from *The UNIX 3189Time-Sharing System*, (The Bell System Technical Journal, July 1978), pp. 31901913-4. 3191 3192XFS retains most of this design, except now inumbers are search keys over all 3193the space in the data section filesystem. 3194They form a continuous keyspace that can be expressed as a 64-bit integer, 3195though the inodes themselves are sparsely distributed within the keyspace. 3196Scans proceed in a linear fashion across the inumber keyspace, starting from 3197``0x0`` and ending at ``0xFFFFFFFFFFFFFFFF``. 3198Naturally, a scan through a keyspace requires a scan cursor object to track the 3199scan progress. 3200Because this keyspace is sparse, this cursor contains two parts. 3201The first part of this scan cursor object tracks the inode that will be 3202examined next; call this the examination cursor. 3203Somewhat less obviously, the scan cursor object must also track which parts of 3204the keyspace have already been visited, which is critical for deciding if a 3205concurrent filesystem update needs to be incorporated into the scan data. 3206Call this the visited inode cursor. 3207 3208Advancing the scan cursor is a multi-step process encapsulated in 3209``xchk_iscan_iter``: 3210 32111. Lock the AGI buffer of the AG containing the inode pointed to by the visited 3212 inode cursor. 3213 This guarantee that inodes in this AG cannot be allocated or freed while 3214 advancing the cursor. 3215 32162. Use the per-AG inode btree to look up the next inumber after the one that 3217 was just visited, since it may not be keyspace adjacent. 3218 32193. If there are no more inodes left in this AG: 3220 3221 a. Move the examination cursor to the point of the inumber keyspace that 3222 corresponds to the start of the next AG. 3223 3224 b. Adjust the visited inode cursor to indicate that it has "visited" the 3225 last possible inode in the current AG's inode keyspace. 3226 XFS inumbers are segmented, so the cursor needs to be marked as having 3227 visited the entire keyspace up to just before the start of the next AG's 3228 inode keyspace. 3229 3230 c. Unlock the AGI and return to step 1 if there are unexamined AGs in the 3231 filesystem. 3232 3233 d. If there are no more AGs to examine, set both cursors to the end of the 3234 inumber keyspace. 3235 The scan is now complete. 3236 32374. Otherwise, there is at least one more inode to scan in this AG: 3238 3239 a. Move the examination cursor ahead to the next inode marked as allocated 3240 by the inode btree. 3241 3242 b. Adjust the visited inode cursor to point to the inode just prior to where 3243 the examination cursor is now. 3244 Because the scanner holds the AGI buffer lock, no inodes could have been 3245 created in the part of the inode keyspace that the visited inode cursor 3246 just advanced. 3247 32485. Get the incore inode for the inumber of the examination cursor. 3249 By maintaining the AGI buffer lock until this point, the scanner knows that 3250 it was safe to advance the examination cursor across the entire keyspace, 3251 and that it has stabilized this next inode so that it cannot disappear from 3252 the filesystem until the scan releases the incore inode. 3253 32546. Drop the AGI lock and return the incore inode to the caller. 3255 3256Online fsck functions scan all files in the filesystem as follows: 3257 32581. Start a scan by calling ``xchk_iscan_start``. 3259 32602. Advance the scan cursor (``xchk_iscan_iter``) to get the next inode. 3261 If one is provided: 3262 3263 a. Lock the inode to prevent updates during the scan. 3264 3265 b. Scan the inode. 3266 3267 c. While still holding the inode lock, adjust the visited inode cursor 3268 (``xchk_iscan_mark_visited``) to point to this inode. 3269 3270 d. Unlock and release the inode. 3271 32728. Call ``xchk_iscan_teardown`` to complete the scan. 3273 3274There are subtleties with the inode cache that complicate grabbing the incore 3275inode for the caller. 3276Obviously, it is an absolute requirement that the inode metadata be consistent 3277enough to load it into the inode cache. 3278Second, if the incore inode is stuck in some intermediate state, the scan 3279coordinator must release the AGI and push the main filesystem to get the inode 3280back into a loadable state. 3281 3282The proposed patches are the 3283`inode scanner 3284<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>`_ 3285series. 3286The first user of the new functionality is the 3287`online quotacheck 3288<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>`_ 3289series. 3290 3291Inode Management 3292```````````````` 3293 3294In regular filesystem code, references to allocated XFS incore inodes are 3295always obtained (``xfs_iget``) outside of transaction context because the 3296creation of the incore context for an existing file does not require metadata 3297updates. 3298However, it is important to note that references to incore inodes obtained as 3299part of file creation must be performed in transaction context because the 3300filesystem must ensure the atomicity of the ondisk inode btree index updates 3301and the initialization of the actual ondisk inode. 3302 3303References to incore inodes are always released (``xfs_irele``) outside of 3304transaction context because there are a handful of activities that might 3305require ondisk updates: 3306 3307- The VFS may decide to kick off writeback as part of a ``DONTCACHE`` inode 3308 release. 3309 3310- Speculative preallocations need to be unreserved. 3311 3312- An unlinked file may have lost its last reference, in which case the entire 3313 file must be inactivated, which involves releasing all of its resources in 3314 the ondisk metadata and freeing the inode. 3315 3316These activities are collectively called inode inactivation. 3317Inactivation has two parts -- the VFS part, which initiates writeback on all 3318dirty file pages, and the XFS part, which cleans up XFS-specific information 3319and frees the inode if it was unlinked. 3320If the inode is unlinked (or unconnected after a file handle operation), the 3321kernel drops the inode into the inactivation machinery immediately. 3322 3323During normal operation, resource acquisition for an update follows this order 3324to avoid deadlocks: 3325 33261. Inode reference (``iget``). 3327 33282. Filesystem freeze protection, if repairing (``mnt_want_write_file``). 3329 33303. Inode ``IOLOCK`` (VFS ``i_rwsem``) lock to control file IO. 3331 33324. Inode ``MMAPLOCK`` (page cache ``invalidate_lock``) lock for operations that 3333 can update page cache mappings. 3334 33355. Log feature enablement. 3336 33376. Transaction log space grant. 3338 33397. Space on the data and realtime devices for the transaction. 3340 33418. Incore dquot references, if a file is being repaired. 3342 Note that they are not locked, merely acquired. 3343 33449. Inode ``ILOCK`` for file metadata updates. 3345 334610. AG header buffer locks / Realtime metadata inode ILOCK. 3347 334811. Realtime metadata buffer locks, if applicable. 3349 335012. Extent mapping btree blocks, if applicable. 3351 3352Resources are often released in the reverse order, though this is not required. 3353However, online fsck differs from regular XFS operations because it may examine 3354an object that normally is acquired in a later stage of the locking order, and 3355then decide to cross-reference the object with an object that is acquired 3356earlier in the order. 3357The next few sections detail the specific ways in which online fsck takes care 3358to avoid deadlocks. 3359 3360iget and irele During a Scrub 3361^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 3362 3363An inode scan performed on behalf of a scrub operation runs in transaction 3364context, and possibly with resources already locked and bound to it. 3365This isn't much of a problem for ``iget`` since it can operate in the context 3366of an existing transaction, as long as all of the bound resources are acquired 3367before the inode reference in the regular filesystem. 3368 3369When the VFS ``iput`` function is given a linked inode with no other 3370references, it normally puts the inode on an LRU list in the hope that it can 3371save time if another process re-opens the file before the system runs out 3372of memory and frees it. 3373Filesystem callers can short-circuit the LRU process by setting a ``DONTCACHE`` 3374flag on the inode to cause the kernel to try to drop the inode into the 3375inactivation machinery immediately. 3376 3377In the past, inactivation was always done from the process that dropped the 3378inode, which was a problem for scrub because scrub may already hold a 3379transaction, and XFS does not support nesting transactions. 3380On the other hand, if there is no scrub transaction, it is desirable to drop 3381otherwise unused inodes immediately to avoid polluting caches. 3382To capture these nuances, the online fsck code has a separate ``xchk_irele`` 3383function to set or clear the ``DONTCACHE`` flag to get the required release 3384behavior. 3385 3386Proposed patchsets include fixing 3387`scrub iget usage 3388<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iget-fixes>`_ and 3389`dir iget usage 3390<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-dir-iget-fixes>`_. 3391 3392Locking Inodes 3393^^^^^^^^^^^^^^ 3394 3395In regular filesystem code, the VFS and XFS will acquire multiple IOLOCK locks 3396in a well-known order: parent → child when updating the directory tree, and 3397in numerical order of the addresses of their ``struct inode`` object otherwise. 3398For regular files, the MMAPLOCK can be acquired after the IOLOCK to stop page 3399faults. 3400If two MMAPLOCKs must be acquired, they are acquired in numerical order of 3401the addresses of their ``struct address_space`` objects. 3402Due to the structure of existing filesystem code, IOLOCKs and MMAPLOCKs must be 3403acquired before transactions are allocated. 3404If two ILOCKs must be acquired, they are acquired in inumber order. 3405 3406Inode lock acquisition must be done carefully during a coordinated inode scan. 3407Online fsck cannot abide these conventions, because for a directory tree 3408scanner, the scrub process holds the IOLOCK of the file being scanned and it 3409needs to take the IOLOCK of the file at the other end of the directory link. 3410If the directory tree is corrupt because it contains a cycle, ``xfs_scrub`` 3411cannot use the regular inode locking functions and avoid becoming trapped in an 3412ABBA deadlock. 3413 3414Solving both of these problems is straightforward -- any time online fsck 3415needs to take a second lock of the same class, it uses trylock to avoid an ABBA 3416deadlock. 3417If the trylock fails, scrub drops all inode locks and use trylock loops to 3418(re)acquire all necessary resources. 3419Trylock loops enable scrub to check for pending fatal signals, which is how 3420scrub avoids deadlocking the filesystem or becoming an unresponsive process. 3421However, trylock loops means that online fsck must be prepared to measure the 3422resource being scrubbed before and after the lock cycle to detect changes and 3423react accordingly. 3424 3425.. _dirparent: 3426 3427Case Study: Finding a Directory Parent 3428^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 3429 3430Consider the directory parent pointer repair code as an example. 3431Online fsck must verify that the dotdot dirent of a directory points up to a 3432parent directory, and that the parent directory contains exactly one dirent 3433pointing down to the child directory. 3434Fully validating this relationship (and repairing it if possible) requires a 3435walk of every directory on the filesystem while holding the child locked, and 3436while updates to the directory tree are being made. 3437The coordinated inode scan provides a way to walk the filesystem without the 3438possibility of missing an inode. 3439The child directory is kept locked to prevent updates to the dotdot dirent, but 3440if the scanner fails to lock a parent, it can drop and relock both the child 3441and the prospective parent. 3442If the dotdot entry changes while the directory is unlocked, then a move or 3443rename operation must have changed the child's parentage, and the scan can 3444exit early. 3445 3446The proposed patchset is the 3447`directory repair 3448<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_ 3449series. 3450 3451.. _fshooks: 3452 3453Filesystem Hooks 3454````````````````` 3455 3456The second piece of support that online fsck functions need during a full 3457filesystem scan is the ability to stay informed about updates being made by 3458other threads in the filesystem, since comparisons against the past are useless 3459in a dynamic environment. 3460Two pieces of Linux kernel infrastructure enable online fsck to monitor regular 3461filesystem operations: filesystem hooks and :ref:`static keys<jump_labels>`. 3462 3463Filesystem hooks convey information about an ongoing filesystem operation to 3464a downstream consumer. 3465In this case, the downstream consumer is always an online fsck function. 3466Because multiple fsck functions can run in parallel, online fsck uses the Linux 3467notifier call chain facility to dispatch updates to any number of interested 3468fsck processes. 3469Call chains are a dynamic list, which means that they can be configured at 3470run time. 3471Because these hooks are private to the XFS module, the information passed along 3472contains exactly what the checking function needs to update its observations. 3473 3474The current implementation of XFS hooks uses SRCU notifier chains to reduce the 3475impact to highly threaded workloads. 3476Regular blocking notifier chains use a rwsem and seem to have a much lower 3477overhead for single-threaded applications. 3478However, it may turn out that the combination of blocking chains and static 3479keys are a more performant combination; more study is needed here. 3480 3481The following pieces are necessary to hook a certain point in the filesystem: 3482 3483- A ``struct xfs_hooks`` object must be embedded in a convenient place such as 3484 a well-known incore filesystem object. 3485 3486- Each hook must define an action code and a structure containing more context 3487 about the action. 3488 3489- Hook providers should provide appropriate wrapper functions and structs 3490 around the ``xfs_hooks`` and ``xfs_hook`` objects to take advantage of type 3491 checking to ensure correct usage. 3492 3493- A callsite in the regular filesystem code must be chosen to call 3494 ``xfs_hooks_call`` with the action code and data structure. 3495 This place should be adjacent to (and not earlier than) the place where 3496 the filesystem update is committed to the transaction. 3497 In general, when the filesystem calls a hook chain, it should be able to 3498 handle sleeping and should not be vulnerable to memory reclaim or locking 3499 recursion. 3500 However, the exact requirements are very dependent on the context of the hook 3501 caller and the callee. 3502 3503- The online fsck function should define a structure to hold scan data, a lock 3504 to coordinate access to the scan data, and a ``struct xfs_hook`` object. 3505 The scanner function and the regular filesystem code must acquire resources 3506 in the same order; see the next section for details. 3507 3508- The online fsck code must contain a C function to catch the hook action code 3509 and data structure. 3510 If the object being updated has already been visited by the scan, then the 3511 hook information must be applied to the scan data. 3512 3513- Prior to unlocking inodes to start the scan, online fsck must call 3514 ``xfs_hooks_setup`` to initialize the ``struct xfs_hook``, and 3515 ``xfs_hooks_add`` to enable the hook. 3516 3517- Online fsck must call ``xfs_hooks_del`` to disable the hook once the scan is 3518 complete. 3519 3520The number of hooks should be kept to a minimum to reduce complexity. 3521Static keys are used to reduce the overhead of filesystem hooks to nearly 3522zero when online fsck is not running. 3523 3524.. _liveupdate: 3525 3526Live Updates During a Scan 3527`````````````````````````` 3528 3529The code paths of the online fsck scanning code and the :ref:`hooked<fshooks>` 3530filesystem code look like this:: 3531 3532 other program 3533 ↓ 3534 inode lock ←────────────────────┐ 3535 ↓ │ 3536 AG header lock │ 3537 ↓ │ 3538 filesystem function │ 3539 ↓ │ 3540 notifier call chain │ same 3541 ↓ ├─── inode 3542 scrub hook function │ lock 3543 ↓ │ 3544 scan data mutex ←──┐ same │ 3545 ↓ ├─── scan │ 3546 update scan data │ lock │ 3547 ↑ │ │ 3548 scan data mutex ←──┘ │ 3549 ↑ │ 3550 inode lock ←────────────────────┘ 3551 ↑ 3552 scrub function 3553 ↑ 3554 inode scanner 3555 ↑ 3556 xfs_scrub 3557 3558These rules must be followed to ensure correct interactions between the 3559checking code and the code making an update to the filesystem: 3560 3561- Prior to invoking the notifier call chain, the filesystem function being 3562 hooked must acquire the same lock that the scrub scanning function acquires 3563 to scan the inode. 3564 3565- The scanning function and the scrub hook function must coordinate access to 3566 the scan data by acquiring a lock on the scan data. 3567 3568- Scrub hook function must not add the live update information to the scan 3569 observations unless the inode being updated has already been scanned. 3570 The scan coordinator has a helper predicate (``xchk_iscan_want_live_update``) 3571 for this. 3572 3573- Scrub hook functions must not change the caller's state, including the 3574 transaction that it is running. 3575 They must not acquire any resources that might conflict with the filesystem 3576 function being hooked. 3577 3578- The hook function can abort the inode scan to avoid breaking the other rules. 3579 3580The inode scan APIs are pretty simple: 3581 3582- ``xchk_iscan_start`` starts a scan 3583 3584- ``xchk_iscan_iter`` grabs a reference to the next inode in the scan or 3585 returns zero if there is nothing left to scan 3586 3587- ``xchk_iscan_want_live_update`` to decide if an inode has already been 3588 visited in the scan. 3589 This is critical for hook functions to decide if they need to update the 3590 in-memory scan information. 3591 3592- ``xchk_iscan_mark_visited`` to mark an inode as having been visited in the 3593 scan 3594 3595- ``xchk_iscan_teardown`` to finish the scan 3596 3597This functionality is also a part of the 3598`inode scanner 3599<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>`_ 3600series. 3601 3602.. _quotacheck: 3603 3604Case Study: Quota Counter Checking 3605^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 3606 3607It is useful to compare the mount time quotacheck code to the online repair 3608quotacheck code. 3609Mount time quotacheck does not have to contend with concurrent operations, so 3610it does the following: 3611 36121. Make sure the ondisk dquots are in good enough shape that all the incore 3613 dquots will actually load, and zero the resource usage counters in the 3614 ondisk buffer. 3615 36162. Walk every inode in the filesystem. 3617 Add each file's resource usage to the incore dquot. 3618 36193. Walk each incore dquot. 3620 If the incore dquot is not being flushed, add the ondisk buffer backing the 3621 incore dquot to a delayed write (delwri) list. 3622 36234. Write the buffer list to disk. 3624 3625Like most online fsck functions, online quotacheck can't write to regular 3626filesystem objects until the newly collected metadata reflect all filesystem 3627state. 3628Therefore, online quotacheck records file resource usage to a shadow dquot 3629index implemented with a sparse ``xfarray``, and only writes to the real dquots 3630once the scan is complete. 3631Handling transactional updates is tricky because quota resource usage updates 3632are handled in phases to minimize contention on dquots: 3633 36341. The inodes involved are joined and locked to a transaction. 3635 36362. For each dquot attached to the file: 3637 3638 a. The dquot is locked. 3639 3640 b. A quota reservation is added to the dquot's resource usage. 3641 The reservation is recorded in the transaction. 3642 3643 c. The dquot is unlocked. 3644 36453. Changes in actual quota usage are tracked in the transaction. 3646 36474. At transaction commit time, each dquot is examined again: 3648 3649 a. The dquot is locked again. 3650 3651 b. Quota usage changes are logged and unused reservation is given back to 3652 the dquot. 3653 3654 c. The dquot is unlocked. 3655 3656For online quotacheck, hooks are placed in steps 2 and 4. 3657The step 2 hook creates a shadow version of the transaction dquot context 3658(``dqtrx``) that operates in a similar manner to the regular code. 3659The step 4 hook commits the shadow ``dqtrx`` changes to the shadow dquots. 3660Notice that both hooks are called with the inode locked, which is how the 3661live update coordinates with the inode scanner. 3662 3663The quotacheck scan looks like this: 3664 36651. Set up a coordinated inode scan. 3666 36672. For each inode returned by the inode scan iterator: 3668 3669 a. Grab and lock the inode. 3670 3671 b. Determine that inode's resource usage (data blocks, inode counts, 3672 realtime blocks) and add that to the shadow dquots for the user, group, 3673 and project ids associated with the inode. 3674 3675 c. Unlock and release the inode. 3676 36773. For each dquot in the system: 3678 3679 a. Grab and lock the dquot. 3680 3681 b. Check the dquot against the shadow dquots created by the scan and updated 3682 by the live hooks. 3683 3684Live updates are key to being able to walk every quota record without 3685needing to hold any locks for a long duration. 3686If repairs are desired, the real and shadow dquots are locked and their 3687resource counts are set to the values in the shadow dquot. 3688 3689The proposed patchset is the 3690`online quotacheck 3691<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>`_ 3692series. 3693 3694.. _nlinks: 3695 3696Case Study: File Link Count Checking 3697^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 3698 3699File link count checking also uses live update hooks. 3700The coordinated inode scanner is used to visit all directories on the 3701filesystem, and per-file link count records are stored in a sparse ``xfarray`` 3702indexed by inumber. 3703During the scanning phase, each entry in a directory generates observation 3704data as follows: 3705 37061. If the entry is a dotdot (``'..'``) entry of the root directory, the 3707 directory's parent link count is bumped because the root directory's dotdot 3708 entry is self referential. 3709 37102. If the entry is a dotdot entry of a subdirectory, the parent's backref 3711 count is bumped. 3712 37133. If the entry is neither a dot nor a dotdot entry, the target file's parent 3714 count is bumped. 3715 37164. If the target is a subdirectory, the parent's child link count is bumped. 3717 3718A crucial point to understand about how the link count inode scanner interacts 3719with the live update hooks is that the scan cursor tracks which *parent* 3720directories have been scanned. 3721In other words, the live updates ignore any update about ``A → B`` when A has 3722not been scanned, even if B has been scanned. 3723Furthermore, a subdirectory A with a dotdot entry pointing back to B is 3724accounted as a backref counter in the shadow data for A, since child dotdot 3725entries affect the parent's link count. 3726Live update hooks are carefully placed in all parts of the filesystem that 3727create, change, or remove directory entries, since those operations involve 3728bumplink and droplink. 3729 3730For any file, the correct link count is the number of parents plus the number 3731of child subdirectories. 3732Non-directories never have children of any kind. 3733The backref information is used to detect inconsistencies in the number of 3734links pointing to child subdirectories and the number of dotdot entries 3735pointing back. 3736 3737After the scan completes, the link count of each file can be checked by locking 3738both the inode and the shadow data, and comparing the link counts. 3739A second coordinated inode scan cursor is used for comparisons. 3740Live updates are key to being able to walk every inode without needing to hold 3741any locks between inodes. 3742If repairs are desired, the inode's link count is set to the value in the 3743shadow information. 3744If no parents are found, the file must be :ref:`reparented <orphanage>` to the 3745orphanage to prevent the file from being lost forever. 3746 3747The proposed patchset is the 3748`file link count repair 3749<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-nlinks>`_ 3750series. 3751 3752.. _rmap_repair: 3753 3754Case Study: Rebuilding Reverse Mapping Records 3755^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 3756 3757Most repair functions follow the same pattern: lock filesystem resources, 3758walk the surviving ondisk metadata looking for replacement metadata records, 3759and use an :ref:`in-memory array <xfarray>` to store the gathered observations. 3760The primary advantage of this approach is the simplicity and modularity of the 3761repair code -- code and data are entirely contained within the scrub module, 3762do not require hooks in the main filesystem, and are usually the most efficient 3763in memory use. 3764A secondary advantage of this repair approach is atomicity -- once the kernel 3765decides a structure is corrupt, no other threads can access the metadata until 3766the kernel finishes repairing and revalidating the metadata. 3767 3768For repairs going on within a shard of the filesystem, these advantages 3769outweigh the delays inherent in locking the shard while repairing parts of the 3770shard. 3771Unfortunately, repairs to the reverse mapping btree cannot use the "standard" 3772btree repair strategy because it must scan every space mapping of every fork of 3773every file in the filesystem, and the filesystem cannot stop. 3774Therefore, rmap repair foregoes atomicity between scrub and repair. 3775It combines a :ref:`coordinated inode scanner <iscan>`, :ref:`live update hooks 3776<liveupdate>`, and an :ref:`in-memory rmap btree <xfbtree>` to complete the 3777scan for reverse mapping records. 3778 37791. Set up an xfbtree to stage rmap records. 3780 37812. While holding the locks on the AGI and AGF buffers acquired during the 3782 scrub, generate reverse mappings for all AG metadata: inodes, btrees, CoW 3783 staging extents, and the internal log. 3784 37853. Set up an inode scanner. 3786 37874. Hook into rmap updates for the AG being repaired so that the live scan data 3788 can receive updates to the rmap btree from the rest of the filesystem during 3789 the file scan. 3790 37915. For each space mapping found in either fork of each file scanned, 3792 decide if the mapping matches the AG of interest. 3793 If so: 3794 3795 a. Create a btree cursor for the in-memory btree. 3796 3797 b. Use the rmap code to add the record to the in-memory btree. 3798 3799 c. Use the :ref:`special commit function <xfbtree_commit>` to write the 3800 xfbtree changes to the xfile. 3801 38026. For each live update received via the hook, decide if the owner has already 3803 been scanned. 3804 If so, apply the live update into the scan data: 3805 3806 a. Create a btree cursor for the in-memory btree. 3807 3808 b. Replay the operation into the in-memory btree. 3809 3810 c. Use the :ref:`special commit function <xfbtree_commit>` to write the 3811 xfbtree changes to the xfile. 3812 This is performed with an empty transaction to avoid changing the 3813 caller's state. 3814 38157. When the inode scan finishes, create a new scrub transaction and relock the 3816 two AG headers. 3817 38188. Compute the new btree geometry using the number of rmap records in the 3819 shadow btree, like all other btree rebuilding functions. 3820 38219. Allocate the number of blocks computed in the previous step. 3822 382310. Perform the usual btree bulk loading and commit to install the new rmap 3824 btree. 3825 382611. Reap the old rmap btree blocks as discussed in the case study about how 3827 to :ref:`reap after rmap btree repair <rmap_reap>`. 3828 382912. Free the xfbtree now that it not needed. 3830 3831The proposed patchset is the 3832`rmap repair 3833<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rmap-btree>`_ 3834series. 3835