1a8f6c2e5SDarrick J. Wong.. SPDX-License-Identifier: GPL-2.0 2a8f6c2e5SDarrick J. Wong.. _xfs_online_fsck_design: 3a8f6c2e5SDarrick J. Wong 4a8f6c2e5SDarrick J. Wong.. 5a8f6c2e5SDarrick J. Wong Mapping of heading styles within this document: 6a8f6c2e5SDarrick J. Wong Heading 1 uses "====" above and below 7a8f6c2e5SDarrick J. Wong Heading 2 uses "====" 8a8f6c2e5SDarrick J. Wong Heading 3 uses "----" 9a8f6c2e5SDarrick J. Wong Heading 4 uses "````" 10a8f6c2e5SDarrick J. Wong Heading 5 uses "^^^^" 11a8f6c2e5SDarrick J. Wong Heading 6 uses "~~~~" 12a8f6c2e5SDarrick J. Wong Heading 7 uses "...." 13a8f6c2e5SDarrick J. Wong 14a8f6c2e5SDarrick J. Wong Sections are manually numbered because apparently that's what everyone 15a8f6c2e5SDarrick J. Wong does in the kernel. 16a8f6c2e5SDarrick J. Wong 17a8f6c2e5SDarrick J. Wong====================== 18a8f6c2e5SDarrick J. WongXFS Online Fsck Design 19a8f6c2e5SDarrick J. Wong====================== 20a8f6c2e5SDarrick J. Wong 21a8f6c2e5SDarrick J. WongThis document captures the design of the online filesystem check feature for 22a8f6c2e5SDarrick J. WongXFS. 23a8f6c2e5SDarrick J. WongThe purpose of this document is threefold: 24a8f6c2e5SDarrick J. Wong 25a8f6c2e5SDarrick J. Wong- To help kernel distributors understand exactly what the XFS online fsck 26a8f6c2e5SDarrick J. Wong feature is, and issues about which they should be aware. 27a8f6c2e5SDarrick J. Wong 28a8f6c2e5SDarrick J. Wong- To help people reading the code to familiarize themselves with the relevant 29a8f6c2e5SDarrick J. Wong concepts and design points before they start digging into the code. 30a8f6c2e5SDarrick J. Wong 31a8f6c2e5SDarrick J. Wong- To help developers maintaining the system by capturing the reasons 32a8f6c2e5SDarrick J. Wong supporting higher level decision making. 33a8f6c2e5SDarrick J. Wong 34a8f6c2e5SDarrick J. WongAs the online fsck code is merged, the links in this document to topic branches 35a8f6c2e5SDarrick J. Wongwill be replaced with links to code. 36a8f6c2e5SDarrick J. Wong 37a8f6c2e5SDarrick J. WongThis document is licensed under the terms of the GNU Public License, v2. 38a8f6c2e5SDarrick J. WongThe primary author is Darrick J. Wong. 39a8f6c2e5SDarrick J. Wong 40a8f6c2e5SDarrick J. WongThis design document is split into seven parts. 41a8f6c2e5SDarrick J. WongPart 1 defines what fsck tools are and the motivations for writing a new one. 42a8f6c2e5SDarrick J. WongParts 2 and 3 present a high level overview of how online fsck process works 43a8f6c2e5SDarrick J. Wongand how it is tested to ensure correct functionality. 44a8f6c2e5SDarrick J. WongPart 4 discusses the user interface and the intended usage modes of the new 45a8f6c2e5SDarrick J. Wongprogram. 46a8f6c2e5SDarrick J. WongParts 5 and 6 show off the high level components and how they fit together, and 47a8f6c2e5SDarrick J. Wongthen present case studies of how each repair function actually works. 48a8f6c2e5SDarrick J. WongPart 7 sums up what has been discussed so far and speculates about what else 49a8f6c2e5SDarrick J. Wongmight be built atop online fsck. 50a8f6c2e5SDarrick J. Wong 51a8f6c2e5SDarrick J. Wong.. contents:: Table of Contents 52a8f6c2e5SDarrick J. Wong :local: 53a8f6c2e5SDarrick J. Wong 54a8f6c2e5SDarrick J. Wong1. What is a Filesystem Check? 55a8f6c2e5SDarrick J. Wong============================== 56a8f6c2e5SDarrick J. Wong 57a8f6c2e5SDarrick J. WongA Unix filesystem has four main responsibilities: 58a8f6c2e5SDarrick J. Wong 59a8f6c2e5SDarrick J. Wong- Provide a hierarchy of names through which application programs can associate 60a8f6c2e5SDarrick J. Wong arbitrary blobs of data for any length of time, 61a8f6c2e5SDarrick J. Wong 62a8f6c2e5SDarrick J. Wong- Virtualize physical storage media across those names, and 63a8f6c2e5SDarrick J. Wong 64a8f6c2e5SDarrick J. Wong- Retrieve the named data blobs at any time. 65a8f6c2e5SDarrick J. Wong 66a8f6c2e5SDarrick J. Wong- Examine resource usage. 67a8f6c2e5SDarrick J. Wong 68a8f6c2e5SDarrick J. WongMetadata directly supporting these functions (e.g. files, directories, space 69a8f6c2e5SDarrick J. Wongmappings) are sometimes called primary metadata. 70a8f6c2e5SDarrick J. WongSecondary metadata (e.g. reverse mapping and directory parent pointers) support 71a8f6c2e5SDarrick J. Wongoperations internal to the filesystem, such as internal consistency checking 72a8f6c2e5SDarrick J. Wongand reorganization. 73a8f6c2e5SDarrick J. WongSummary metadata, as the name implies, condense information contained in 74a8f6c2e5SDarrick J. Wongprimary metadata for performance reasons. 75a8f6c2e5SDarrick J. Wong 76a8f6c2e5SDarrick J. WongThe filesystem check (fsck) tool examines all the metadata in a filesystem 77a8f6c2e5SDarrick J. Wongto look for errors. 78a8f6c2e5SDarrick J. WongIn addition to looking for obvious metadata corruptions, fsck also 79a8f6c2e5SDarrick J. Wongcross-references different types of metadata records with each other to look 80a8f6c2e5SDarrick J. Wongfor inconsistencies. 81a8f6c2e5SDarrick J. WongPeople do not like losing data, so most fsck tools also contains some ability 82a8f6c2e5SDarrick J. Wongto correct any problems found. 83a8f6c2e5SDarrick J. WongAs a word of caution -- the primary goal of most Linux fsck tools is to restore 84a8f6c2e5SDarrick J. Wongthe filesystem metadata to a consistent state, not to maximize the data 85a8f6c2e5SDarrick J. Wongrecovered. 86a8f6c2e5SDarrick J. WongThat precedent will not be challenged here. 87a8f6c2e5SDarrick J. Wong 88a8f6c2e5SDarrick J. WongFilesystems of the 20th century generally lacked any redundancy in the ondisk 89a8f6c2e5SDarrick J. Wongformat, which means that fsck can only respond to errors by erasing files until 90a8f6c2e5SDarrick J. Wongerrors are no longer detected. 91a8f6c2e5SDarrick J. WongMore recent filesystem designs contain enough redundancy in their metadata that 92a8f6c2e5SDarrick J. Wongit is now possible to regenerate data structures when non-catastrophic errors 93a8f6c2e5SDarrick J. Wongoccur; this capability aids both strategies. 94a8f6c2e5SDarrick J. Wong 95a8f6c2e5SDarrick J. Wong+--------------------------------------------------------------------------+ 96a8f6c2e5SDarrick J. Wong| **Note**: | 97a8f6c2e5SDarrick J. Wong+--------------------------------------------------------------------------+ 98a8f6c2e5SDarrick J. Wong| System administrators avoid data loss by increasing the number of | 99a8f6c2e5SDarrick J. Wong| separate storage systems through the creation of backups; and they avoid | 100a8f6c2e5SDarrick J. Wong| downtime by increasing the redundancy of each storage system through the | 101a8f6c2e5SDarrick J. Wong| creation of RAID arrays. | 102a8f6c2e5SDarrick J. Wong| fsck tools address only the first problem. | 103a8f6c2e5SDarrick J. Wong+--------------------------------------------------------------------------+ 104a8f6c2e5SDarrick J. Wong 105a8f6c2e5SDarrick J. WongTLDR; Show Me the Code! 106a8f6c2e5SDarrick J. Wong----------------------- 107a8f6c2e5SDarrick J. Wong 108a8f6c2e5SDarrick J. WongCode is posted to the kernel.org git trees as follows: 109a8f6c2e5SDarrick J. Wong`kernel changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-symlink>`_, 110a8f6c2e5SDarrick J. Wong`userspace changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-service>`_, and 111a8f6c2e5SDarrick J. Wong`QA test changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-dirs>`_. 112a8f6c2e5SDarrick J. WongEach kernel patchset adding an online repair function will use the same branch 113a8f6c2e5SDarrick J. Wongname across the kernel, xfsprogs, and fstests git repos. 114a8f6c2e5SDarrick J. Wong 115a8f6c2e5SDarrick J. WongExisting Tools 116a8f6c2e5SDarrick J. Wong-------------- 117a8f6c2e5SDarrick J. Wong 118a8f6c2e5SDarrick J. WongThe online fsck tool described here will be the third tool in the history of 119a8f6c2e5SDarrick J. WongXFS (on Linux) to check and repair filesystems. 120a8f6c2e5SDarrick J. WongTwo programs precede it: 121a8f6c2e5SDarrick J. Wong 122a8f6c2e5SDarrick J. WongThe first program, ``xfs_check``, was created as part of the XFS debugger 123a8f6c2e5SDarrick J. Wong(``xfs_db``) and can only be used with unmounted filesystems. 124a8f6c2e5SDarrick J. WongIt walks all metadata in the filesystem looking for inconsistencies in the 125a8f6c2e5SDarrick J. Wongmetadata, though it lacks any ability to repair what it finds. 126a8f6c2e5SDarrick J. WongDue to its high memory requirements and inability to repair things, this 127a8f6c2e5SDarrick J. Wongprogram is now deprecated and will not be discussed further. 128a8f6c2e5SDarrick J. Wong 129a8f6c2e5SDarrick J. WongThe second program, ``xfs_repair``, was created to be faster and more robust 130a8f6c2e5SDarrick J. Wongthan the first program. 131a8f6c2e5SDarrick J. WongLike its predecessor, it can only be used with unmounted filesystems. 132a8f6c2e5SDarrick J. WongIt uses extent-based in-memory data structures to reduce memory consumption, 133a8f6c2e5SDarrick J. Wongand tries to schedule readahead IO appropriately to reduce I/O waiting time 134a8f6c2e5SDarrick J. Wongwhile it scans the metadata of the entire filesystem. 135a8f6c2e5SDarrick J. WongThe most important feature of this tool is its ability to respond to 136a8f6c2e5SDarrick J. Wonginconsistencies in file metadata and directory tree by erasing things as needed 137a8f6c2e5SDarrick J. Wongto eliminate problems. 138a8f6c2e5SDarrick J. WongSpace usage metadata are rebuilt from the observed file metadata. 139a8f6c2e5SDarrick J. Wong 140a8f6c2e5SDarrick J. WongProblem Statement 141a8f6c2e5SDarrick J. Wong----------------- 142a8f6c2e5SDarrick J. Wong 143a8f6c2e5SDarrick J. WongThe current XFS tools leave several problems unsolved: 144a8f6c2e5SDarrick J. Wong 145a8f6c2e5SDarrick J. Wong1. **User programs** suddenly **lose access** to the filesystem when unexpected 146a8f6c2e5SDarrick J. Wong shutdowns occur as a result of silent corruptions in the metadata. 147a8f6c2e5SDarrick J. Wong These occur **unpredictably** and often without warning. 148a8f6c2e5SDarrick J. Wong 149a8f6c2e5SDarrick J. Wong2. **Users** experience a **total loss of service** during the recovery period 150a8f6c2e5SDarrick J. Wong after an **unexpected shutdown** occurs. 151a8f6c2e5SDarrick J. Wong 152a8f6c2e5SDarrick J. Wong3. **Users** experience a **total loss of service** if the filesystem is taken 153a8f6c2e5SDarrick J. Wong offline to **look for problems** proactively. 154a8f6c2e5SDarrick J. Wong 155a8f6c2e5SDarrick J. Wong4. **Data owners** cannot **check the integrity** of their stored data without 156a8f6c2e5SDarrick J. Wong reading all of it. 157a8f6c2e5SDarrick J. Wong This may expose them to substantial billing costs when a linear media scan 158a8f6c2e5SDarrick J. Wong performed by the storage system administrator might suffice. 159a8f6c2e5SDarrick J. Wong 160a8f6c2e5SDarrick J. Wong5. **System administrators** cannot **schedule** a maintenance window to deal 161a8f6c2e5SDarrick J. Wong with corruptions if they **lack the means** to assess filesystem health 162a8f6c2e5SDarrick J. Wong while the filesystem is online. 163a8f6c2e5SDarrick J. Wong 164a8f6c2e5SDarrick J. Wong6. **Fleet monitoring tools** cannot **automate periodic checks** of filesystem 165a8f6c2e5SDarrick J. Wong health when doing so requires **manual intervention** and downtime. 166a8f6c2e5SDarrick J. Wong 167a8f6c2e5SDarrick J. Wong7. **Users** can be tricked into **doing things they do not desire** when 168a8f6c2e5SDarrick J. Wong malicious actors **exploit quirks of Unicode** to place misleading names 169a8f6c2e5SDarrick J. Wong in directories. 170a8f6c2e5SDarrick J. Wong 171a8f6c2e5SDarrick J. WongGiven this definition of the problems to be solved and the actors who would 172a8f6c2e5SDarrick J. Wongbenefit, the proposed solution is a third fsck tool that acts on a running 173a8f6c2e5SDarrick J. Wongfilesystem. 174a8f6c2e5SDarrick J. Wong 175a8f6c2e5SDarrick J. WongThis new third program has three components: an in-kernel facility to check 176a8f6c2e5SDarrick J. Wongmetadata, an in-kernel facility to repair metadata, and a userspace driver 177a8f6c2e5SDarrick J. Wongprogram to drive fsck activity on a live filesystem. 178a8f6c2e5SDarrick J. Wong``xfs_scrub`` is the name of the driver program. 179a8f6c2e5SDarrick J. WongThe rest of this document presents the goals and use cases of the new fsck 180a8f6c2e5SDarrick J. Wongtool, describes its major design points in connection to those goals, and 181a8f6c2e5SDarrick J. Wongdiscusses the similarities and differences with existing tools. 182a8f6c2e5SDarrick J. Wong 183a8f6c2e5SDarrick J. Wong+--------------------------------------------------------------------------+ 184a8f6c2e5SDarrick J. Wong| **Note**: | 185a8f6c2e5SDarrick J. Wong+--------------------------------------------------------------------------+ 186a8f6c2e5SDarrick J. Wong| Throughout this document, the existing offline fsck tool can also be | 187a8f6c2e5SDarrick J. Wong| referred to by its current name "``xfs_repair``". | 188a8f6c2e5SDarrick J. Wong| The userspace driver program for the new online fsck tool can be | 189a8f6c2e5SDarrick J. Wong| referred to as "``xfs_scrub``". | 190a8f6c2e5SDarrick J. Wong| The kernel portion of online fsck that validates metadata is called | 191a8f6c2e5SDarrick J. Wong| "online scrub", and portion of the kernel that fixes metadata is called | 192a8f6c2e5SDarrick J. Wong| "online repair". | 193a8f6c2e5SDarrick J. Wong+--------------------------------------------------------------------------+ 194a8f6c2e5SDarrick J. Wong 195a8f6c2e5SDarrick J. WongThe naming hierarchy is broken up into objects known as directories and files 196a8f6c2e5SDarrick J. Wongand the physical space is split into pieces known as allocation groups. 197a8f6c2e5SDarrick J. WongSharding enables better performance on highly parallel systems and helps to 198a8f6c2e5SDarrick J. Wongcontain the damage when corruptions occur. 199a8f6c2e5SDarrick J. WongThe division of the filesystem into principal objects (allocation groups and 200a8f6c2e5SDarrick J. Wonginodes) means that there are ample opportunities to perform targeted checks and 201a8f6c2e5SDarrick J. Wongrepairs on a subset of the filesystem. 202a8f6c2e5SDarrick J. Wong 203a8f6c2e5SDarrick J. WongWhile this is going on, other parts continue processing IO requests. 204a8f6c2e5SDarrick J. WongEven if a piece of filesystem metadata can only be regenerated by scanning the 205a8f6c2e5SDarrick J. Wongentire system, the scan can still be done in the background while other file 206a8f6c2e5SDarrick J. Wongoperations continue. 207a8f6c2e5SDarrick J. Wong 208a8f6c2e5SDarrick J. WongIn summary, online fsck takes advantage of resource sharding and redundant 209a8f6c2e5SDarrick J. Wongmetadata to enable targeted checking and repair operations while the system 210a8f6c2e5SDarrick J. Wongis running. 211a8f6c2e5SDarrick J. WongThis capability will be coupled to automatic system management so that 212a8f6c2e5SDarrick J. Wongautonomous self-healing of XFS maximizes service availability. 21388757e04SDarrick J. Wong 21488757e04SDarrick J. Wong2. Theory of Operation 21588757e04SDarrick J. Wong====================== 21688757e04SDarrick J. Wong 21788757e04SDarrick J. WongBecause it is necessary for online fsck to lock and scan live metadata objects, 21888757e04SDarrick J. Wongonline fsck consists of three separate code components. 21988757e04SDarrick J. WongThe first is the userspace driver program ``xfs_scrub``, which is responsible 22088757e04SDarrick J. Wongfor identifying individual metadata items, scheduling work items for them, 22188757e04SDarrick J. Wongreacting to the outcomes appropriately, and reporting results to the system 22288757e04SDarrick J. Wongadministrator. 22388757e04SDarrick J. WongThe second and third are in the kernel, which implements functions to check 22488757e04SDarrick J. Wongand repair each type of online fsck work item. 22588757e04SDarrick J. Wong 22688757e04SDarrick J. Wong+------------------------------------------------------------------+ 22788757e04SDarrick J. Wong| **Note**: | 22888757e04SDarrick J. Wong+------------------------------------------------------------------+ 22988757e04SDarrick J. Wong| For brevity, this document shortens the phrase "online fsck work | 23088757e04SDarrick J. Wong| item" to "scrub item". | 23188757e04SDarrick J. Wong+------------------------------------------------------------------+ 23288757e04SDarrick J. Wong 23388757e04SDarrick J. WongScrub item types are delineated in a manner consistent with the Unix design 23488757e04SDarrick J. Wongphilosophy, which is to say that each item should handle one aspect of a 23588757e04SDarrick J. Wongmetadata structure, and handle it well. 23688757e04SDarrick J. Wong 23788757e04SDarrick J. WongScope 23888757e04SDarrick J. Wong----- 23988757e04SDarrick J. Wong 24088757e04SDarrick J. WongIn principle, online fsck should be able to check and to repair everything that 24188757e04SDarrick J. Wongthe offline fsck program can handle. 24288757e04SDarrick J. WongHowever, online fsck cannot be running 100% of the time, which means that 24388757e04SDarrick J. Wonglatent errors may creep in after a scrub completes. 24488757e04SDarrick J. WongIf these errors cause the next mount to fail, offline fsck is the only 24588757e04SDarrick J. Wongsolution. 24688757e04SDarrick J. WongThis limitation means that maintenance of the offline fsck tool will continue. 24788757e04SDarrick J. WongA second limitation of online fsck is that it must follow the same resource 24888757e04SDarrick J. Wongsharing and lock acquisition rules as the regular filesystem. 24988757e04SDarrick J. WongThis means that scrub cannot take *any* shortcuts to save time, because doing 25088757e04SDarrick J. Wongso could lead to concurrency problems. 25188757e04SDarrick J. WongIn other words, online fsck is not a complete replacement for offline fsck, and 25288757e04SDarrick J. Wonga complete run of online fsck may take longer than online fsck. 25388757e04SDarrick J. WongHowever, both of these limitations are acceptable tradeoffs to satisfy the 25488757e04SDarrick J. Wongdifferent motivations of online fsck, which are to **minimize system downtime** 25588757e04SDarrick J. Wongand to **increase predictability of operation**. 25688757e04SDarrick J. Wong 25788757e04SDarrick J. Wong.. _scrubphases: 25888757e04SDarrick J. Wong 25988757e04SDarrick J. WongPhases of Work 26088757e04SDarrick J. Wong-------------- 26188757e04SDarrick J. Wong 26288757e04SDarrick J. WongThe userspace driver program ``xfs_scrub`` splits the work of checking and 26388757e04SDarrick J. Wongrepairing an entire filesystem into seven phases. 26488757e04SDarrick J. WongEach phase concentrates on checking specific types of scrub items and depends 26588757e04SDarrick J. Wongon the success of all previous phases. 26688757e04SDarrick J. WongThe seven phases are as follows: 26788757e04SDarrick J. Wong 26888757e04SDarrick J. Wong1. Collect geometry information about the mounted filesystem and computer, 26988757e04SDarrick J. Wong discover the online fsck capabilities of the kernel, and open the 27088757e04SDarrick J. Wong underlying storage devices. 27188757e04SDarrick J. Wong 27288757e04SDarrick J. Wong2. Check allocation group metadata, all realtime volume metadata, and all quota 27388757e04SDarrick J. Wong files. 27488757e04SDarrick J. Wong Each metadata structure is scheduled as a separate scrub item. 27588757e04SDarrick J. Wong If corruption is found in the inode header or inode btree and ``xfs_scrub`` 27688757e04SDarrick J. Wong is permitted to perform repairs, then those scrub items are repaired to 27788757e04SDarrick J. Wong prepare for phase 3. 27888757e04SDarrick J. Wong Repairs are implemented by using the information in the scrub item to 27988757e04SDarrick J. Wong resubmit the kernel scrub call with the repair flag enabled; this is 28088757e04SDarrick J. Wong discussed in the next section. 28188757e04SDarrick J. Wong Optimizations and all other repairs are deferred to phase 4. 28288757e04SDarrick J. Wong 28388757e04SDarrick J. Wong3. Check all metadata of every file in the filesystem. 28488757e04SDarrick J. Wong Each metadata structure is also scheduled as a separate scrub item. 28588757e04SDarrick J. Wong If repairs are needed and ``xfs_scrub`` is permitted to perform repairs, 28688757e04SDarrick J. Wong and there were no problems detected during phase 2, then those scrub items 28788757e04SDarrick J. Wong are repaired immediately. 28888757e04SDarrick J. Wong Optimizations, deferred repairs, and unsuccessful repairs are deferred to 28988757e04SDarrick J. Wong phase 4. 29088757e04SDarrick J. Wong 29188757e04SDarrick J. Wong4. All remaining repairs and scheduled optimizations are performed during this 29288757e04SDarrick J. Wong phase, if the caller permits them. 29388757e04SDarrick J. Wong Before starting repairs, the summary counters are checked and any necessary 29488757e04SDarrick J. Wong repairs are performed so that subsequent repairs will not fail the resource 29588757e04SDarrick J. Wong reservation step due to wildly incorrect summary counters. 296*d56b699dSBjorn Helgaas Unsuccessful repairs are requeued as long as forward progress on repairs is 29788757e04SDarrick J. Wong made somewhere in the filesystem. 29888757e04SDarrick J. Wong Free space in the filesystem is trimmed at the end of phase 4 if the 29988757e04SDarrick J. Wong filesystem is clean. 30088757e04SDarrick J. Wong 30188757e04SDarrick J. Wong5. By the start of this phase, all primary and secondary filesystem metadata 30288757e04SDarrick J. Wong must be correct. 30388757e04SDarrick J. Wong Summary counters such as the free space counts and quota resource counts 30488757e04SDarrick J. Wong are checked and corrected. 30588757e04SDarrick J. Wong Directory entry names and extended attribute names are checked for 30688757e04SDarrick J. Wong suspicious entries such as control characters or confusing Unicode sequences 30788757e04SDarrick J. Wong appearing in names. 30888757e04SDarrick J. Wong 30988757e04SDarrick J. Wong6. If the caller asks for a media scan, read all allocated and written data 31088757e04SDarrick J. Wong file extents in the filesystem. 31188757e04SDarrick J. Wong The ability to use hardware-assisted data file integrity checking is new 31288757e04SDarrick J. Wong to online fsck; neither of the previous tools have this capability. 31388757e04SDarrick J. Wong If media errors occur, they will be mapped to the owning files and reported. 31488757e04SDarrick J. Wong 31588757e04SDarrick J. Wong7. Re-check the summary counters and presents the caller with a summary of 31688757e04SDarrick J. Wong space usage and file counts. 31788757e04SDarrick J. Wong 318af051dfbSDarrick J. WongThis allocation of responsibilities will be :ref:`revisited <scrubcheck>` 319af051dfbSDarrick J. Wonglater in this document. 320af051dfbSDarrick J. Wong 32188757e04SDarrick J. WongSteps for Each Scrub Item 32288757e04SDarrick J. Wong------------------------- 32388757e04SDarrick J. Wong 32488757e04SDarrick J. WongThe kernel scrub code uses a three-step strategy for checking and repairing 32588757e04SDarrick J. Wongthe one aspect of a metadata object represented by a scrub item: 32688757e04SDarrick J. Wong 32788757e04SDarrick J. Wong1. The scrub item of interest is checked for corruptions; opportunities for 32888757e04SDarrick J. Wong optimization; and for values that are directly controlled by the system 32988757e04SDarrick J. Wong administrator but look suspicious. 33088757e04SDarrick J. Wong If the item is not corrupt or does not need optimization, resource are 33188757e04SDarrick J. Wong released and the positive scan results are returned to userspace. 33288757e04SDarrick J. Wong If the item is corrupt or could be optimized but the caller does not permit 33388757e04SDarrick J. Wong this, resources are released and the negative scan results are returned to 33488757e04SDarrick J. Wong userspace. 33588757e04SDarrick J. Wong Otherwise, the kernel moves on to the second step. 33688757e04SDarrick J. Wong 33788757e04SDarrick J. Wong2. The repair function is called to rebuild the data structure. 33888757e04SDarrick J. Wong Repair functions generally choose rebuild a structure from other metadata 33988757e04SDarrick J. Wong rather than try to salvage the existing structure. 34088757e04SDarrick J. Wong If the repair fails, the scan results from the first step are returned to 34188757e04SDarrick J. Wong userspace. 34288757e04SDarrick J. Wong Otherwise, the kernel moves on to the third step. 34388757e04SDarrick J. Wong 34488757e04SDarrick J. Wong3. In the third step, the kernel runs the same checks over the new metadata 34588757e04SDarrick J. Wong item to assess the efficacy of the repairs. 34688757e04SDarrick J. Wong The results of the reassessment are returned to userspace. 34788757e04SDarrick J. Wong 34888757e04SDarrick J. WongClassification of Metadata 34988757e04SDarrick J. Wong-------------------------- 35088757e04SDarrick J. Wong 35188757e04SDarrick J. WongEach type of metadata object (and therefore each type of scrub item) is 35288757e04SDarrick J. Wongclassified as follows: 35388757e04SDarrick J. Wong 35488757e04SDarrick J. WongPrimary Metadata 35588757e04SDarrick J. Wong```````````````` 35688757e04SDarrick J. Wong 35788757e04SDarrick J. WongMetadata structures in this category should be most familiar to filesystem 35888757e04SDarrick J. Wongusers either because they are directly created by the user or they index 35988757e04SDarrick J. Wongobjects created by the user 36088757e04SDarrick J. WongMost filesystem objects fall into this class: 36188757e04SDarrick J. Wong 36288757e04SDarrick J. Wong- Free space and reference count information 36388757e04SDarrick J. Wong 36488757e04SDarrick J. Wong- Inode records and indexes 36588757e04SDarrick J. Wong 36688757e04SDarrick J. Wong- Storage mapping information for file data 36788757e04SDarrick J. Wong 36888757e04SDarrick J. Wong- Directories 36988757e04SDarrick J. Wong 37088757e04SDarrick J. Wong- Extended attributes 37188757e04SDarrick J. Wong 37288757e04SDarrick J. Wong- Symbolic links 37388757e04SDarrick J. Wong 37488757e04SDarrick J. Wong- Quota limits 37588757e04SDarrick J. Wong 37688757e04SDarrick J. WongScrub obeys the same rules as regular filesystem accesses for resource and lock 37788757e04SDarrick J. Wongacquisition. 37888757e04SDarrick J. Wong 37988757e04SDarrick J. WongPrimary metadata objects are the simplest for scrub to process. 38088757e04SDarrick J. WongThe principal filesystem object (either an allocation group or an inode) that 38188757e04SDarrick J. Wongowns the item being scrubbed is locked to guard against concurrent updates. 38288757e04SDarrick J. WongThe check function examines every record associated with the type for obvious 38388757e04SDarrick J. Wongerrors and cross-references healthy records against other metadata to look for 38488757e04SDarrick J. Wonginconsistencies. 38588757e04SDarrick J. WongRepairs for this class of scrub item are simple, since the repair function 38688757e04SDarrick J. Wongstarts by holding all the resources acquired in the previous step. 38788757e04SDarrick J. WongThe repair function scans available metadata as needed to record all the 38888757e04SDarrick J. Wongobservations needed to complete the structure. 38988757e04SDarrick J. WongNext, it stages the observations in a new ondisk structure and commits it 39088757e04SDarrick J. Wongatomically to complete the repair. 39188757e04SDarrick J. WongFinally, the storage from the old data structure are carefully reaped. 39288757e04SDarrick J. Wong 39388757e04SDarrick J. WongBecause ``xfs_scrub`` locks a primary object for the duration of the repair, 39488757e04SDarrick J. Wongthis is effectively an offline repair operation performed on a subset of the 39588757e04SDarrick J. Wongfilesystem. 39688757e04SDarrick J. WongThis minimizes the complexity of the repair code because it is not necessary to 39788757e04SDarrick J. Wonghandle concurrent updates from other threads, nor is it necessary to access 39888757e04SDarrick J. Wongany other part of the filesystem. 39988757e04SDarrick J. WongAs a result, indexed structures can be rebuilt very quickly, and programs 40088757e04SDarrick J. Wongtrying to access the damaged structure will be blocked until repairs complete. 40188757e04SDarrick J. WongThe only infrastructure needed by the repair code are the staging area for 40288757e04SDarrick J. Wongobservations and a means to write new structures to disk. 40388757e04SDarrick J. WongDespite these limitations, the advantage that online repair holds is clear: 40488757e04SDarrick J. Wongtargeted work on individual shards of the filesystem avoids total loss of 40588757e04SDarrick J. Wongservice. 40688757e04SDarrick J. Wong 40788757e04SDarrick J. WongThis mechanism is described in section 2.1 ("Off-Line Algorithm") of 40888757e04SDarrick J. WongV. Srinivasan and M. J. Carey, `"Performance of On-Line Index Construction 40988757e04SDarrick J. WongAlgorithms" <https://minds.wisconsin.edu/bitstream/handle/1793/59524/TR1047.pdf>`_, 41088757e04SDarrick J. Wong*Extending Database Technology*, pp. 293-309, 1992. 41188757e04SDarrick J. Wong 41288757e04SDarrick J. WongMost primary metadata repair functions stage their intermediate results in an 41388757e04SDarrick J. Wongin-memory array prior to formatting the new ondisk structure, which is very 41488757e04SDarrick J. Wongsimilar to the list-based algorithm discussed in section 2.3 ("List-Based 41588757e04SDarrick J. WongAlgorithms") of Srinivasan. 41688757e04SDarrick J. WongHowever, any data structure builder that maintains a resource lock for the 41788757e04SDarrick J. Wongduration of the repair is *always* an offline algorithm. 41888757e04SDarrick J. Wong 4195f658dadSDarrick J. Wong.. _secondary_metadata: 4205f658dadSDarrick J. Wong 42188757e04SDarrick J. WongSecondary Metadata 42288757e04SDarrick J. Wong`````````````````` 42388757e04SDarrick J. Wong 42488757e04SDarrick J. WongMetadata structures in this category reflect records found in primary metadata, 42588757e04SDarrick J. Wongbut are only needed for online fsck or for reorganization of the filesystem. 42688757e04SDarrick J. Wong 42788757e04SDarrick J. WongSecondary metadata include: 42888757e04SDarrick J. Wong 42988757e04SDarrick J. Wong- Reverse mapping information 43088757e04SDarrick J. Wong 43188757e04SDarrick J. Wong- Directory parent pointers 43288757e04SDarrick J. Wong 43388757e04SDarrick J. WongThis class of metadata is difficult for scrub to process because scrub attaches 43488757e04SDarrick J. Wongto the secondary object but needs to check primary metadata, which runs counter 43588757e04SDarrick J. Wongto the usual order of resource acquisition. 43688757e04SDarrick J. WongFrequently, this means that full filesystems scans are necessary to rebuild the 43788757e04SDarrick J. Wongmetadata. 43888757e04SDarrick J. WongCheck functions can be limited in scope to reduce runtime. 43988757e04SDarrick J. WongRepairs, however, require a full scan of primary metadata, which can take a 44088757e04SDarrick J. Wonglong time to complete. 44188757e04SDarrick J. WongUnder these conditions, ``xfs_scrub`` cannot lock resources for the entire 44288757e04SDarrick J. Wongduration of the repair. 44388757e04SDarrick J. Wong 44488757e04SDarrick J. WongInstead, repair functions set up an in-memory staging structure to store 44588757e04SDarrick J. Wongobservations. 44688757e04SDarrick J. WongDepending on the requirements of the specific repair function, the staging 44788757e04SDarrick J. Wongindex will either have the same format as the ondisk structure or a design 44888757e04SDarrick J. Wongspecific to that repair function. 44988757e04SDarrick J. WongThe next step is to release all locks and start the filesystem scan. 45088757e04SDarrick J. WongWhen the repair scanner needs to record an observation, the staging data are 45188757e04SDarrick J. Wonglocked long enough to apply the update. 45288757e04SDarrick J. WongWhile the filesystem scan is in progress, the repair function hooks the 45388757e04SDarrick J. Wongfilesystem so that it can apply pending filesystem updates to the staging 45488757e04SDarrick J. Wonginformation. 45588757e04SDarrick J. WongOnce the scan is done, the owning object is re-locked, the live data is used to 45688757e04SDarrick J. Wongwrite a new ondisk structure, and the repairs are committed atomically. 45788757e04SDarrick J. WongThe hooks are disabled and the staging staging area is freed. 45888757e04SDarrick J. WongFinally, the storage from the old data structure are carefully reaped. 45988757e04SDarrick J. Wong 46088757e04SDarrick J. WongIntroducing concurrency helps online repair avoid various locking problems, but 46188757e04SDarrick J. Wongcomes at a high cost to code complexity. 46288757e04SDarrick J. WongLive filesystem code has to be hooked so that the repair function can observe 46388757e04SDarrick J. Wongupdates in progress. 46488757e04SDarrick J. WongThe staging area has to become a fully functional parallel structure so that 46588757e04SDarrick J. Wongupdates can be merged from the hooks. 46688757e04SDarrick J. WongFinally, the hook, the filesystem scan, and the inode locking model must be 46788757e04SDarrick J. Wongsufficiently well integrated that a hook event can decide if a given update 46888757e04SDarrick J. Wongshould be applied to the staging structure. 46988757e04SDarrick J. Wong 47088757e04SDarrick J. WongIn theory, the scrub implementation could apply these same techniques for 47188757e04SDarrick J. Wongprimary metadata, but doing so would make it massively more complex and less 47288757e04SDarrick J. Wongperformant. 47388757e04SDarrick J. WongPrograms attempting to access the damaged structures are not blocked from 47488757e04SDarrick J. Wongoperation, which may cause application failure or an unplanned filesystem 47588757e04SDarrick J. Wongshutdown. 47688757e04SDarrick J. Wong 47788757e04SDarrick J. WongInspiration for the secondary metadata repair strategy was drawn from section 47888757e04SDarrick J. Wong2.4 of Srinivasan above, and sections 2 ("NSF: Inded Build Without Side-File") 47988757e04SDarrick J. Wongand 3.1.1 ("Duplicate Key Insert Problem") in C. Mohan, `"Algorithms for 48088757e04SDarrick J. WongCreating Indexes for Very Large Tables Without Quiescing Updates" 48188757e04SDarrick J. Wong<https://dl.acm.org/doi/10.1145/130283.130337>`_, 1992. 48288757e04SDarrick J. Wong 48388757e04SDarrick J. WongThe sidecar index mentioned above bears some resemblance to the side file 48488757e04SDarrick J. Wongmethod mentioned in Srinivasan and Mohan. 48588757e04SDarrick J. WongTheir method consists of an index builder that extracts relevant record data to 48688757e04SDarrick J. Wongbuild the new structure as quickly as possible; and an auxiliary structure that 48788757e04SDarrick J. Wongcaptures all updates that would be committed to the index by other threads were 48888757e04SDarrick J. Wongthe new index already online. 48988757e04SDarrick J. WongAfter the index building scan finishes, the updates recorded in the side file 49088757e04SDarrick J. Wongare applied to the new index. 49188757e04SDarrick J. WongTo avoid conflicts between the index builder and other writer threads, the 49288757e04SDarrick J. Wongbuilder maintains a publicly visible cursor that tracks the progress of the 49388757e04SDarrick J. Wongscan through the record space. 49488757e04SDarrick J. WongTo avoid duplication of work between the side file and the index builder, side 49588757e04SDarrick J. Wongfile updates are elided when the record ID for the update is greater than the 49688757e04SDarrick J. Wongcursor position within the record ID space. 49788757e04SDarrick J. Wong 49888757e04SDarrick J. WongTo minimize changes to the rest of the codebase, XFS online repair keeps the 49988757e04SDarrick J. Wongreplacement index hidden until it's completely ready to go. 50088757e04SDarrick J. WongIn other words, there is no attempt to expose the keyspace of the new index 50188757e04SDarrick J. Wongwhile repair is running. 50288757e04SDarrick J. WongThe complexity of such an approach would be very high and perhaps more 50388757e04SDarrick J. Wongappropriate to building *new* indices. 50488757e04SDarrick J. Wong 50588757e04SDarrick J. Wong**Future Work Question**: Can the full scan and live update code used to 50688757e04SDarrick J. Wongfacilitate a repair also be used to implement a comprehensive check? 50788757e04SDarrick J. Wong 50888757e04SDarrick J. Wong*Answer*: In theory, yes. Check would be much stronger if each scrub function 50988757e04SDarrick J. Wongemployed these live scans to build a shadow copy of the metadata and then 51088757e04SDarrick J. Wongcompared the shadow records to the ondisk records. 51188757e04SDarrick J. WongHowever, doing that is a fair amount more work than what the checking functions 51288757e04SDarrick J. Wongdo now. 51388757e04SDarrick J. WongThe live scans and hooks were developed much later. 51488757e04SDarrick J. WongThat in turn increases the runtime of those scrub functions. 51588757e04SDarrick J. Wong 51688757e04SDarrick J. WongSummary Information 51788757e04SDarrick J. Wong``````````````````` 51888757e04SDarrick J. Wong 51988757e04SDarrick J. WongMetadata structures in this last category summarize the contents of primary 52088757e04SDarrick J. Wongmetadata records. 52188757e04SDarrick J. WongThese are often used to speed up resource usage queries, and are many times 52288757e04SDarrick J. Wongsmaller than the primary metadata which they represent. 52388757e04SDarrick J. Wong 52488757e04SDarrick J. WongExamples of summary information include: 52588757e04SDarrick J. Wong 52688757e04SDarrick J. Wong- Summary counts of free space and inodes 52788757e04SDarrick J. Wong 52888757e04SDarrick J. Wong- File link counts from directories 52988757e04SDarrick J. Wong 53088757e04SDarrick J. Wong- Quota resource usage counts 53188757e04SDarrick J. Wong 53288757e04SDarrick J. WongCheck and repair require full filesystem scans, but resource and lock 53388757e04SDarrick J. Wongacquisition follow the same paths as regular filesystem accesses. 53488757e04SDarrick J. Wong 53588757e04SDarrick J. WongThe superblock summary counters have special requirements due to the underlying 53688757e04SDarrick J. Wongimplementation of the incore counters, and will be treated separately. 53788757e04SDarrick J. WongCheck and repair of the other types of summary counters (quota resource counts 53888757e04SDarrick J. Wongand file link counts) employ the same filesystem scanning and hooking 53988757e04SDarrick J. Wongtechniques as outlined above, but because the underlying data are sets of 54088757e04SDarrick J. Wonginteger counters, the staging data need not be a fully functional mirror of the 54188757e04SDarrick J. Wongondisk structure. 54288757e04SDarrick J. Wong 54388757e04SDarrick J. WongInspiration for quota and file link count repair strategies were drawn from 54488757e04SDarrick J. Wongsections 2.12 ("Online Index Operations") through 2.14 ("Incremental View 545*d56b699dSBjorn HelgaasMaintenance") of G. Graefe, `"Concurrent Queries and Updates in Summary Views 54688757e04SDarrick J. Wongand Their Indexes" 54788757e04SDarrick J. Wong<http://www.odbms.org/wp-content/uploads/2014/06/Increment-locks.pdf>`_, 2011. 54888757e04SDarrick J. Wong 54988757e04SDarrick J. WongSince quotas are non-negative integer counts of resource usage, online 55088757e04SDarrick J. Wongquotacheck can use the incremental view deltas described in section 2.14 to 55188757e04SDarrick J. Wongtrack pending changes to the block and inode usage counts in each transaction, 55288757e04SDarrick J. Wongand commit those changes to a dquot side file when the transaction commits. 55388757e04SDarrick J. WongDelta tracking is necessary for dquots because the index builder scans inodes, 55488757e04SDarrick J. Wongwhereas the data structure being rebuilt is an index of dquots. 55588757e04SDarrick J. WongLink count checking combines the view deltas and commit step into one because 55688757e04SDarrick J. Wongit sets attributes of the objects being scanned instead of writing them to a 55788757e04SDarrick J. Wongseparate data structure. 55888757e04SDarrick J. WongEach online fsck function will be discussed as case studies later in this 55988757e04SDarrick J. Wongdocument. 56088757e04SDarrick J. Wong 56188757e04SDarrick J. WongRisk Management 56288757e04SDarrick J. Wong--------------- 56388757e04SDarrick J. Wong 56488757e04SDarrick J. WongDuring the development of online fsck, several risk factors were identified 56588757e04SDarrick J. Wongthat may make the feature unsuitable for certain distributors and users. 56688757e04SDarrick J. WongSteps can be taken to mitigate or eliminate those risks, though at a cost to 56788757e04SDarrick J. Wongfunctionality. 56888757e04SDarrick J. Wong 56988757e04SDarrick J. Wong- **Decreased performance**: Adding metadata indices to the filesystem 57088757e04SDarrick J. Wong increases the time cost of persisting changes to disk, and the reverse space 57188757e04SDarrick J. Wong mapping and directory parent pointers are no exception. 57288757e04SDarrick J. Wong System administrators who require the maximum performance can disable the 57388757e04SDarrick J. Wong reverse mapping features at format time, though this choice dramatically 57488757e04SDarrick J. Wong reduces the ability of online fsck to find inconsistencies and repair them. 57588757e04SDarrick J. Wong 57688757e04SDarrick J. Wong- **Incorrect repairs**: As with all software, there might be defects in the 57788757e04SDarrick J. Wong software that result in incorrect repairs being written to the filesystem. 57888757e04SDarrick J. Wong Systematic fuzz testing (detailed in the next section) is employed by the 57988757e04SDarrick J. Wong authors to find bugs early, but it might not catch everything. 58088757e04SDarrick J. Wong The kernel build system provides Kconfig options (``CONFIG_XFS_ONLINE_SCRUB`` 58188757e04SDarrick J. Wong and ``CONFIG_XFS_ONLINE_REPAIR``) to enable distributors to choose not to 58288757e04SDarrick J. Wong accept this risk. 58388757e04SDarrick J. Wong The xfsprogs build system has a configure option (``--enable-scrub=no``) that 58488757e04SDarrick J. Wong disables building of the ``xfs_scrub`` binary, though this is not a risk 58588757e04SDarrick J. Wong mitigation if the kernel functionality remains enabled. 58688757e04SDarrick J. Wong 58788757e04SDarrick J. Wong- **Inability to repair**: Sometimes, a filesystem is too badly damaged to be 58888757e04SDarrick J. Wong repairable. 58988757e04SDarrick J. Wong If the keyspaces of several metadata indices overlap in some manner but a 59088757e04SDarrick J. Wong coherent narrative cannot be formed from records collected, then the repair 59188757e04SDarrick J. Wong fails. 59288757e04SDarrick J. Wong To reduce the chance that a repair will fail with a dirty transaction and 59388757e04SDarrick J. Wong render the filesystem unusable, the online repair functions have been 59488757e04SDarrick J. Wong designed to stage and validate all new records before committing the new 59588757e04SDarrick J. Wong structure. 59688757e04SDarrick J. Wong 59788757e04SDarrick J. Wong- **Misbehavior**: Online fsck requires many privileges -- raw IO to block 59888757e04SDarrick J. Wong devices, opening files by handle, ignoring Unix discretionary access control, 59988757e04SDarrick J. Wong and the ability to perform administrative changes. 60088757e04SDarrick J. Wong Running this automatically in the background scares people, so the systemd 60188757e04SDarrick J. Wong background service is configured to run with only the privileges required. 60288757e04SDarrick J. Wong Obviously, this cannot address certain problems like the kernel crashing or 60388757e04SDarrick J. Wong deadlocking, but it should be sufficient to prevent the scrub process from 60488757e04SDarrick J. Wong escaping and reconfiguring the system. 60588757e04SDarrick J. Wong The cron job does not have this protection. 60688757e04SDarrick J. Wong 60788757e04SDarrick J. Wong- **Fuzz Kiddiez**: There are many people now who seem to think that running 608*d56b699dSBjorn Helgaas automated fuzz testing of ondisk artifacts to find mischievous behavior and 60988757e04SDarrick J. Wong spraying exploit code onto the public mailing list for instant zero-day 61088757e04SDarrick J. Wong disclosure is somehow of some social benefit. 61188757e04SDarrick J. Wong In the view of this author, the benefit is realized only when the fuzz 61288757e04SDarrick J. Wong operators help to **fix** the flaws, but this opinion apparently is not 61388757e04SDarrick J. Wong widely shared among security "researchers". 61488757e04SDarrick J. Wong The XFS maintainers' continuing ability to manage these events presents an 61588757e04SDarrick J. Wong ongoing risk to the stability of the development process. 61688757e04SDarrick J. Wong Automated testing should front-load some of the risk while the feature is 61788757e04SDarrick J. Wong considered EXPERIMENTAL. 61888757e04SDarrick J. Wong 61988757e04SDarrick J. WongMany of these risks are inherent to software programming. 62088757e04SDarrick J. WongDespite this, it is hoped that this new functionality will prove useful in 62188757e04SDarrick J. Wongreducing unexpected downtime. 6229a30b5b5SDarrick J. Wong 6239a30b5b5SDarrick J. Wong3. Testing Plan 6249a30b5b5SDarrick J. Wong=============== 6259a30b5b5SDarrick J. Wong 6269a30b5b5SDarrick J. WongAs stated before, fsck tools have three main goals: 6279a30b5b5SDarrick J. Wong 6289a30b5b5SDarrick J. Wong1. Detect inconsistencies in the metadata; 6299a30b5b5SDarrick J. Wong 6309a30b5b5SDarrick J. Wong2. Eliminate those inconsistencies; and 6319a30b5b5SDarrick J. Wong 6329a30b5b5SDarrick J. Wong3. Minimize further loss of data. 6339a30b5b5SDarrick J. Wong 6349a30b5b5SDarrick J. WongDemonstrations of correct operation are necessary to build users' confidence 6359a30b5b5SDarrick J. Wongthat the software behaves within expectations. 6369a30b5b5SDarrick J. WongUnfortunately, it was not really feasible to perform regular exhaustive testing 6379a30b5b5SDarrick J. Wongof every aspect of a fsck tool until the introduction of low-cost virtual 6389a30b5b5SDarrick J. Wongmachines with high-IOPS storage. 6399a30b5b5SDarrick J. WongWith ample hardware availability in mind, the testing strategy for the online 6409a30b5b5SDarrick J. Wongfsck project involves differential analysis against the existing fsck tools and 6419a30b5b5SDarrick J. Wongsystematic testing of every attribute of every type of metadata object. 6429a30b5b5SDarrick J. WongTesting can be split into four major categories, as discussed below. 6439a30b5b5SDarrick J. Wong 6449a30b5b5SDarrick J. WongIntegrated Testing with fstests 6459a30b5b5SDarrick J. Wong------------------------------- 6469a30b5b5SDarrick J. Wong 6479a30b5b5SDarrick J. WongThe primary goal of any free software QA effort is to make testing as 6489a30b5b5SDarrick J. Wonginexpensive and widespread as possible to maximize the scaling advantages of 6499a30b5b5SDarrick J. Wongcommunity. 6509a30b5b5SDarrick J. WongIn other words, testing should maximize the breadth of filesystem configuration 6519a30b5b5SDarrick J. Wongscenarios and hardware setups. 6529a30b5b5SDarrick J. WongThis improves code quality by enabling the authors of online fsck to find and 6539a30b5b5SDarrick J. Wongfix bugs early, and helps developers of new features to find integration 6549a30b5b5SDarrick J. Wongissues earlier in their development effort. 6559a30b5b5SDarrick J. Wong 6569a30b5b5SDarrick J. WongThe Linux filesystem community shares a common QA testing suite, 6579a30b5b5SDarrick J. Wong`fstests <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/>`_, for 6589a30b5b5SDarrick J. Wongfunctional and regression testing. 6599a30b5b5SDarrick J. WongEven before development work began on online fsck, fstests (when run on XFS) 6609a30b5b5SDarrick J. Wongwould run both the ``xfs_check`` and ``xfs_repair -n`` commands on the test and 6619a30b5b5SDarrick J. Wongscratch filesystems between each test. 6629a30b5b5SDarrick J. WongThis provides a level of assurance that the kernel and the fsck tools stay in 6639a30b5b5SDarrick J. Wongalignment about what constitutes consistent metadata. 6649a30b5b5SDarrick J. WongDuring development of the online checking code, fstests was modified to run 6659a30b5b5SDarrick J. Wong``xfs_scrub -n`` between each test to ensure that the new checking code 6669a30b5b5SDarrick J. Wongproduces the same results as the two existing fsck tools. 6679a30b5b5SDarrick J. Wong 6689a30b5b5SDarrick J. WongTo start development of online repair, fstests was modified to run 6699a30b5b5SDarrick J. Wong``xfs_repair`` to rebuild the filesystem's metadata indices between tests. 6709a30b5b5SDarrick J. WongThis ensures that offline repair does not crash, leave a corrupt filesystem 6719a30b5b5SDarrick J. Wongafter it exists, or trigger complaints from the online check. 6729a30b5b5SDarrick J. WongThis also established a baseline for what can and cannot be repaired offline. 6739a30b5b5SDarrick J. WongTo complete the first phase of development of online repair, fstests was 6749a30b5b5SDarrick J. Wongmodified to be able to run ``xfs_scrub`` in a "force rebuild" mode. 6759a30b5b5SDarrick J. WongThis enables a comparison of the effectiveness of online repair as compared to 6769a30b5b5SDarrick J. Wongthe existing offline repair tools. 6779a30b5b5SDarrick J. Wong 6789a30b5b5SDarrick J. WongGeneral Fuzz Testing of Metadata Blocks 6799a30b5b5SDarrick J. Wong--------------------------------------- 6809a30b5b5SDarrick J. Wong 6819a30b5b5SDarrick J. WongXFS benefits greatly from having a very robust debugging tool, ``xfs_db``. 6829a30b5b5SDarrick J. Wong 6839a30b5b5SDarrick J. WongBefore development of online fsck even began, a set of fstests were created 6849a30b5b5SDarrick J. Wongto test the rather common fault that entire metadata blocks get corrupted. 6859a30b5b5SDarrick J. WongThis required the creation of fstests library code that can create a filesystem 6869a30b5b5SDarrick J. Wongcontaining every possible type of metadata object. 6879a30b5b5SDarrick J. WongNext, individual test cases were created to create a test filesystem, identify 6889a30b5b5SDarrick J. Wonga single block of a specific type of metadata object, trash it with the 6899a30b5b5SDarrick J. Wongexisting ``blocktrash`` command in ``xfs_db``, and test the reaction of a 6909a30b5b5SDarrick J. Wongparticular metadata validation strategy. 6919a30b5b5SDarrick J. Wong 6929a30b5b5SDarrick J. WongThis earlier test suite enabled XFS developers to test the ability of the 6939a30b5b5SDarrick J. Wongin-kernel validation functions and the ability of the offline fsck tool to 6949a30b5b5SDarrick J. Wongdetect and eliminate the inconsistent metadata. 6959a30b5b5SDarrick J. WongThis part of the test suite was extended to cover online fsck in exactly the 6969a30b5b5SDarrick J. Wongsame manner. 6979a30b5b5SDarrick J. Wong 6989a30b5b5SDarrick J. WongIn other words, for a given fstests filesystem configuration: 6999a30b5b5SDarrick J. Wong 7009a30b5b5SDarrick J. Wong* For each metadata object existing on the filesystem: 7019a30b5b5SDarrick J. Wong 7029a30b5b5SDarrick J. Wong * Write garbage to it 7039a30b5b5SDarrick J. Wong 7049a30b5b5SDarrick J. Wong * Test the reactions of: 7059a30b5b5SDarrick J. Wong 7069a30b5b5SDarrick J. Wong 1. The kernel verifiers to stop obviously bad metadata 7079a30b5b5SDarrick J. Wong 2. Offline repair (``xfs_repair``) to detect and fix 7089a30b5b5SDarrick J. Wong 3. Online repair (``xfs_scrub``) to detect and fix 7099a30b5b5SDarrick J. Wong 7109a30b5b5SDarrick J. WongTargeted Fuzz Testing of Metadata Records 7119a30b5b5SDarrick J. Wong----------------------------------------- 7129a30b5b5SDarrick J. Wong 7139a30b5b5SDarrick J. WongThe testing plan for online fsck includes extending the existing fs testing 7149a30b5b5SDarrick J. Wonginfrastructure to provide a much more powerful facility: targeted fuzz testing 7159a30b5b5SDarrick J. Wongof every metadata field of every metadata object in the filesystem. 7169a30b5b5SDarrick J. Wong``xfs_db`` can modify every field of every metadata structure in every 7179a30b5b5SDarrick J. Wongblock in the filesystem to simulate the effects of memory corruption and 7189a30b5b5SDarrick J. Wongsoftware bugs. 7199a30b5b5SDarrick J. WongGiven that fstests already contains the ability to create a filesystem 7209a30b5b5SDarrick J. Wongcontaining every metadata format known to the filesystem, ``xfs_db`` can be 7219a30b5b5SDarrick J. Wongused to perform exhaustive fuzz testing! 7229a30b5b5SDarrick J. Wong 7239a30b5b5SDarrick J. WongFor a given fstests filesystem configuration: 7249a30b5b5SDarrick J. Wong 7259a30b5b5SDarrick J. Wong* For each metadata object existing on the filesystem... 7269a30b5b5SDarrick J. Wong 7279a30b5b5SDarrick J. Wong * For each record inside that metadata object... 7289a30b5b5SDarrick J. Wong 7299a30b5b5SDarrick J. Wong * For each field inside that record... 7309a30b5b5SDarrick J. Wong 7319a30b5b5SDarrick J. Wong * For each conceivable type of transformation that can be applied to a bit field... 7329a30b5b5SDarrick J. Wong 7339a30b5b5SDarrick J. Wong 1. Clear all bits 7349a30b5b5SDarrick J. Wong 2. Set all bits 7359a30b5b5SDarrick J. Wong 3. Toggle the most significant bit 7369a30b5b5SDarrick J. Wong 4. Toggle the middle bit 7379a30b5b5SDarrick J. Wong 5. Toggle the least significant bit 7389a30b5b5SDarrick J. Wong 6. Add a small quantity 7399a30b5b5SDarrick J. Wong 7. Subtract a small quantity 7409a30b5b5SDarrick J. Wong 8. Randomize the contents 7419a30b5b5SDarrick J. Wong 7429a30b5b5SDarrick J. Wong * ...test the reactions of: 7439a30b5b5SDarrick J. Wong 7449a30b5b5SDarrick J. Wong 1. The kernel verifiers to stop obviously bad metadata 7459a30b5b5SDarrick J. Wong 2. Offline checking (``xfs_repair -n``) 7469a30b5b5SDarrick J. Wong 3. Offline repair (``xfs_repair``) 7479a30b5b5SDarrick J. Wong 4. Online checking (``xfs_scrub -n``) 7489a30b5b5SDarrick J. Wong 5. Online repair (``xfs_scrub``) 7499a30b5b5SDarrick J. Wong 6. Both repair tools (``xfs_scrub`` and then ``xfs_repair`` if online repair doesn't succeed) 7509a30b5b5SDarrick J. Wong 7519a30b5b5SDarrick J. WongThis is quite the combinatoric explosion! 7529a30b5b5SDarrick J. Wong 7539a30b5b5SDarrick J. WongFortunately, having this much test coverage makes it easy for XFS developers to 7549a30b5b5SDarrick J. Wongcheck the responses of XFS' fsck tools. 7559a30b5b5SDarrick J. WongSince the introduction of the fuzz testing framework, these tests have been 7569a30b5b5SDarrick J. Wongused to discover incorrect repair code and missing functionality for entire 7579a30b5b5SDarrick J. Wongclasses of metadata objects in ``xfs_repair``. 7589a30b5b5SDarrick J. WongThe enhanced testing was used to finalize the deprecation of ``xfs_check`` by 7599a30b5b5SDarrick J. Wongconfirming that ``xfs_repair`` could detect at least as many corruptions as 7609a30b5b5SDarrick J. Wongthe older tool. 7619a30b5b5SDarrick J. Wong 7629a30b5b5SDarrick J. WongThese tests have been very valuable for ``xfs_scrub`` in the same ways -- they 7639a30b5b5SDarrick J. Wongallow the online fsck developers to compare online fsck against offline fsck, 7649a30b5b5SDarrick J. Wongand they enable XFS developers to find deficiencies in the code base. 7659a30b5b5SDarrick J. Wong 7669a30b5b5SDarrick J. WongProposed patchsets include 7679a30b5b5SDarrick J. Wong`general fuzzer improvements 7689a30b5b5SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzzer-improvements>`_, 7699a30b5b5SDarrick J. Wong`fuzzing baselines 7709a30b5b5SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzz-baseline>`_, 7719a30b5b5SDarrick J. Wongand `improvements in fuzz testing comprehensiveness 7729a30b5b5SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=more-fuzz-testing>`_. 7739a30b5b5SDarrick J. Wong 7749a30b5b5SDarrick J. WongStress Testing 7759a30b5b5SDarrick J. Wong-------------- 7769a30b5b5SDarrick J. Wong 7779a30b5b5SDarrick J. WongA unique requirement to online fsck is the ability to operate on a filesystem 7789a30b5b5SDarrick J. Wongconcurrently with regular workloads. 7799a30b5b5SDarrick J. WongAlthough it is of course impossible to run ``xfs_scrub`` with *zero* observable 7809a30b5b5SDarrick J. Wongimpact on the running system, the online repair code should never introduce 7819a30b5b5SDarrick J. Wonginconsistencies into the filesystem metadata, and regular workloads should 7829a30b5b5SDarrick J. Wongnever notice resource starvation. 7839a30b5b5SDarrick J. WongTo verify that these conditions are being met, fstests has been enhanced in 7849a30b5b5SDarrick J. Wongthe following ways: 7859a30b5b5SDarrick J. Wong 7869a30b5b5SDarrick J. Wong* For each scrub item type, create a test to exercise checking that item type 7879a30b5b5SDarrick J. Wong while running ``fsstress``. 7889a30b5b5SDarrick J. Wong* For each scrub item type, create a test to exercise repairing that item type 7899a30b5b5SDarrick J. Wong while running ``fsstress``. 7909a30b5b5SDarrick J. Wong* Race ``fsstress`` and ``xfs_scrub -n`` to ensure that checking the whole 7919a30b5b5SDarrick J. Wong filesystem doesn't cause problems. 7929a30b5b5SDarrick J. Wong* Race ``fsstress`` and ``xfs_scrub`` in force-rebuild mode to ensure that 7939a30b5b5SDarrick J. Wong force-repairing the whole filesystem doesn't cause problems. 7949a30b5b5SDarrick J. Wong* Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while 7959a30b5b5SDarrick J. Wong freezing and thawing the filesystem. 7969a30b5b5SDarrick J. Wong* Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while 7979a30b5b5SDarrick J. Wong remounting the filesystem read-only and read-write. 7989a30b5b5SDarrick J. Wong* The same, but running ``fsx`` instead of ``fsstress``. (Not done yet?) 7999a30b5b5SDarrick J. Wong 8009a30b5b5SDarrick J. WongSuccess is defined by the ability to run all of these tests without observing 8019a30b5b5SDarrick J. Wongany unexpected filesystem shutdowns due to corrupted metadata, kernel hang 8029a30b5b5SDarrick J. Wongcheck warnings, or any other sort of mischief. 8039a30b5b5SDarrick J. Wong 8049a30b5b5SDarrick J. WongProposed patchsets include `general stress testing 8059a30b5b5SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=race-scrub-and-mount-state-changes>`_ 8069a30b5b5SDarrick J. Wongand the `evolution of existing per-function stress testing 8079a30b5b5SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=refactor-scrub-stress>`_. 8084f7f6469SDarrick J. Wong 8094f7f6469SDarrick J. Wong4. User Interface 8104f7f6469SDarrick J. Wong================= 8114f7f6469SDarrick J. Wong 8124f7f6469SDarrick J. WongThe primary user of online fsck is the system administrator, just like offline 8134f7f6469SDarrick J. Wongrepair. 8144f7f6469SDarrick J. WongOnline fsck presents two modes of operation to administrators: 8154f7f6469SDarrick J. WongA foreground CLI process for online fsck on demand, and a background service 8164f7f6469SDarrick J. Wongthat performs autonomous checking and repair. 8174f7f6469SDarrick J. Wong 8184f7f6469SDarrick J. WongChecking on Demand 8194f7f6469SDarrick J. Wong------------------ 8204f7f6469SDarrick J. Wong 8214f7f6469SDarrick J. WongFor administrators who want the absolute freshest information about the 8224f7f6469SDarrick J. Wongmetadata in a filesystem, ``xfs_scrub`` can be run as a foreground process on 8234f7f6469SDarrick J. Wonga command line. 8244f7f6469SDarrick J. WongThe program checks every piece of metadata in the filesystem while the 8254f7f6469SDarrick J. Wongadministrator waits for the results to be reported, just like the existing 8264f7f6469SDarrick J. Wong``xfs_repair`` tool. 8274f7f6469SDarrick J. WongBoth tools share a ``-n`` option to perform a read-only scan, and a ``-v`` 8284f7f6469SDarrick J. Wongoption to increase the verbosity of the information reported. 8294f7f6469SDarrick J. Wong 8304f7f6469SDarrick J. WongA new feature of ``xfs_scrub`` is the ``-x`` option, which employs the error 8314f7f6469SDarrick J. Wongcorrection capabilities of the hardware to check data file contents. 8324f7f6469SDarrick J. WongThe media scan is not enabled by default because it may dramatically increase 8334f7f6469SDarrick J. Wongprogram runtime and consume a lot of bandwidth on older storage hardware. 8344f7f6469SDarrick J. Wong 8354f7f6469SDarrick J. WongThe output of a foreground invocation is captured in the system log. 8364f7f6469SDarrick J. Wong 8374f7f6469SDarrick J. WongThe ``xfs_scrub_all`` program walks the list of mounted filesystems and 8384f7f6469SDarrick J. Wonginitiates ``xfs_scrub`` for each of them in parallel. 8394f7f6469SDarrick J. WongIt serializes scans for any filesystems that resolve to the same top level 8404f7f6469SDarrick J. Wongkernel block device to prevent resource overconsumption. 8414f7f6469SDarrick J. Wong 8424f7f6469SDarrick J. WongBackground Service 8434f7f6469SDarrick J. Wong------------------ 8444f7f6469SDarrick J. Wong 8454f7f6469SDarrick J. WongTo reduce the workload of system administrators, the ``xfs_scrub`` package 8464f7f6469SDarrick J. Wongprovides a suite of `systemd <https://systemd.io/>`_ timers and services that 8474f7f6469SDarrick J. Wongrun online fsck automatically on weekends by default. 8484f7f6469SDarrick J. WongThe background service configures scrub to run with as little privilege as 8494f7f6469SDarrick J. Wongpossible, the lowest CPU and IO priority, and in a CPU-constrained single 8504f7f6469SDarrick J. Wongthreaded mode. 8514f7f6469SDarrick J. WongThis can be tuned by the systemd administrator at any time to suit the latency 8524f7f6469SDarrick J. Wongand throughput requirements of customer workloads. 8534f7f6469SDarrick J. Wong 8544f7f6469SDarrick J. WongThe output of the background service is also captured in the system log. 8554f7f6469SDarrick J. WongIf desired, reports of failures (either due to inconsistencies or mere runtime 8564f7f6469SDarrick J. Wongerrors) can be emailed automatically by setting the ``EMAIL_ADDR`` environment 8574f7f6469SDarrick J. Wongvariable in the following service files: 8584f7f6469SDarrick J. Wong 8594f7f6469SDarrick J. Wong* ``xfs_scrub_fail@.service`` 8604f7f6469SDarrick J. Wong* ``xfs_scrub_media_fail@.service`` 8614f7f6469SDarrick J. Wong* ``xfs_scrub_all_fail.service`` 8624f7f6469SDarrick J. Wong 8634f7f6469SDarrick J. WongThe decision to enable the background scan is left to the system administrator. 8644f7f6469SDarrick J. WongThis can be done by enabling either of the following services: 8654f7f6469SDarrick J. Wong 8664f7f6469SDarrick J. Wong* ``xfs_scrub_all.timer`` on systemd systems 8674f7f6469SDarrick J. Wong* ``xfs_scrub_all.cron`` on non-systemd systems 8684f7f6469SDarrick J. Wong 8694f7f6469SDarrick J. WongThis automatic weekly scan is configured out of the box to perform an 8704f7f6469SDarrick J. Wongadditional media scan of all file data once per month. 8714f7f6469SDarrick J. WongThis is less foolproof than, say, storing file data block checksums, but much 8724f7f6469SDarrick J. Wongmore performant if application software provides its own integrity checking, 8734f7f6469SDarrick J. Wongredundancy can be provided elsewhere above the filesystem, or the storage 8744f7f6469SDarrick J. Wongdevice's integrity guarantees are deemed sufficient. 8754f7f6469SDarrick J. Wong 8764f7f6469SDarrick J. WongThe systemd unit file definitions have been subjected to a security audit 8774f7f6469SDarrick J. Wong(as of systemd 249) to ensure that the xfs_scrub processes have as little 8784f7f6469SDarrick J. Wongaccess to the rest of the system as possible. 8794f7f6469SDarrick J. WongThis was performed via ``systemd-analyze security``, after which privileges 8804f7f6469SDarrick J. Wongwere restricted to the minimum required, sandboxing was set up to the maximal 8814f7f6469SDarrick J. Wongextent possible with sandboxing and system call filtering; and access to the 8824f7f6469SDarrick J. Wongfilesystem tree was restricted to the minimum needed to start the program and 8834f7f6469SDarrick J. Wongaccess the filesystem being scanned. 8844f7f6469SDarrick J. WongThe service definition files restrict CPU usage to 80% of one CPU core, and 8854f7f6469SDarrick J. Wongapply as nice of a priority to IO and CPU scheduling as possible. 8864f7f6469SDarrick J. WongThis measure was taken to minimize delays in the rest of the filesystem. 8874f7f6469SDarrick J. WongNo such hardening has been performed for the cron job. 8884f7f6469SDarrick J. Wong 8894f7f6469SDarrick J. WongProposed patchset: 8904f7f6469SDarrick J. Wong`Enabling the xfs_scrub background service 8914f7f6469SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-service>`_. 8924f7f6469SDarrick J. Wong 8934f7f6469SDarrick J. WongHealth Reporting 8944f7f6469SDarrick J. Wong---------------- 8954f7f6469SDarrick J. Wong 8964f7f6469SDarrick J. WongXFS caches a summary of each filesystem's health status in memory. 8974f7f6469SDarrick J. WongThe information is updated whenever ``xfs_scrub`` is run, or whenever 8984f7f6469SDarrick J. Wonginconsistencies are detected in the filesystem metadata during regular 8994f7f6469SDarrick J. Wongoperations. 9004f7f6469SDarrick J. WongSystem administrators should use the ``health`` command of ``xfs_spaceman`` to 9014f7f6469SDarrick J. Wongdownload this information into a human-readable format. 9024f7f6469SDarrick J. WongIf problems have been observed, the administrator can schedule a reduced 9034f7f6469SDarrick J. Wongservice window to run the online repair tool to correct the problem. 9044f7f6469SDarrick J. WongFailing that, the administrator can decide to schedule a maintenance window to 9054f7f6469SDarrick J. Wongrun the traditional offline repair tool to correct the problem. 9064f7f6469SDarrick J. Wong 9074f7f6469SDarrick J. Wong**Future Work Question**: Should the health reporting integrate with the new 9084f7f6469SDarrick J. Wonginotify fs error notification system? 9094f7f6469SDarrick J. WongWould it be helpful for sysadmins to have a daemon to listen for corruption 9104f7f6469SDarrick J. Wongnotifications and initiate a repair? 9114f7f6469SDarrick J. Wong 9124f7f6469SDarrick J. Wong*Answer*: These questions remain unanswered, but should be a part of the 9134f7f6469SDarrick J. Wongconversation with early adopters and potential downstream users of XFS. 9144f7f6469SDarrick J. Wong 9154f7f6469SDarrick J. WongProposed patchsets include 9164f7f6469SDarrick J. Wong`wiring up health reports to correction returns 9174f7f6469SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=corruption-health-reports>`_ 9184f7f6469SDarrick J. Wongand 9194f7f6469SDarrick J. Wong`preservation of sickness info during memory reclaim 9204f7f6469SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=indirect-health-reporting>`_. 921e5edad52SDarrick J. Wong 922e5edad52SDarrick J. Wong5. Kernel Algorithms and Data Structures 923e5edad52SDarrick J. Wong======================================== 924e5edad52SDarrick J. Wong 925e5edad52SDarrick J. WongThis section discusses the key algorithms and data structures of the kernel 926e5edad52SDarrick J. Wongcode that provide the ability to check and repair metadata while the system 927e5edad52SDarrick J. Wongis running. 928e5edad52SDarrick J. WongThe first chapters in this section reveal the pieces that provide the 929e5edad52SDarrick J. Wongfoundation for checking metadata. 930e5edad52SDarrick J. WongThe remainder of this section presents the mechanisms through which XFS 931e5edad52SDarrick J. Wongregenerates itself. 932e5edad52SDarrick J. Wong 933e5edad52SDarrick J. WongSelf Describing Metadata 934e5edad52SDarrick J. Wong------------------------ 935e5edad52SDarrick J. Wong 936e5edad52SDarrick J. WongStarting with XFS version 5 in 2012, XFS updated the format of nearly every 937e5edad52SDarrick J. Wongondisk block header to record a magic number, a checksum, a universally 938e5edad52SDarrick J. Wong"unique" identifier (UUID), an owner code, the ondisk address of the block, 939e5edad52SDarrick J. Wongand a log sequence number. 940e5edad52SDarrick J. WongWhen loading a block buffer from disk, the magic number, UUID, owner, and 941e5edad52SDarrick J. Wongondisk address confirm that the retrieved block matches the specific owner of 942e5edad52SDarrick J. Wongthe current filesystem, and that the information contained in the block is 943e5edad52SDarrick J. Wongsupposed to be found at the ondisk address. 944e5edad52SDarrick J. WongThe first three components enable checking tools to disregard alleged metadata 945e5edad52SDarrick J. Wongthat doesn't belong to the filesystem, and the fourth component enables the 946e5edad52SDarrick J. Wongfilesystem to detect lost writes. 947e5edad52SDarrick J. Wong 948e5edad52SDarrick J. WongWhenever a file system operation modifies a block, the change is submitted 949e5edad52SDarrick J. Wongto the log as part of a transaction. 950e5edad52SDarrick J. WongThe log then processes these transactions marking them done once they are 951e5edad52SDarrick J. Wongsafely persisted to storage. 952e5edad52SDarrick J. WongThe logging code maintains the checksum and the log sequence number of the last 953e5edad52SDarrick J. Wongtransactional update. 954e5edad52SDarrick J. WongChecksums are useful for detecting torn writes and other discrepancies that can 955e5edad52SDarrick J. Wongbe introduced between the computer and its storage devices. 956e5edad52SDarrick J. WongSequence number tracking enables log recovery to avoid applying out of date 957e5edad52SDarrick J. Wonglog updates to the filesystem. 958e5edad52SDarrick J. Wong 959e5edad52SDarrick J. WongThese two features improve overall runtime resiliency by providing a means for 960e5edad52SDarrick J. Wongthe filesystem to detect obvious corruption when reading metadata blocks from 961e5edad52SDarrick J. Wongdisk, but these buffer verifiers cannot provide any consistency checking 962e5edad52SDarrick J. Wongbetween metadata structures. 963e5edad52SDarrick J. Wong 964e5edad52SDarrick J. WongFor more information, please see the documentation for 965e5edad52SDarrick J. WongDocumentation/filesystems/xfs-self-describing-metadata.rst 966e5edad52SDarrick J. Wong 967e5edad52SDarrick J. WongReverse Mapping 968e5edad52SDarrick J. Wong--------------- 969e5edad52SDarrick J. Wong 970e5edad52SDarrick J. WongThe original design of XFS (circa 1993) is an improvement upon 1980s Unix 971e5edad52SDarrick J. Wongfilesystem design. 972e5edad52SDarrick J. WongIn those days, storage density was expensive, CPU time was scarce, and 973e5edad52SDarrick J. Wongexcessive seek time could kill performance. 974e5edad52SDarrick J. WongFor performance reasons, filesystem authors were reluctant to add redundancy to 975e5edad52SDarrick J. Wongthe filesystem, even at the cost of data integrity. 976e5edad52SDarrick J. WongFilesystems designers in the early 21st century choose different strategies to 977e5edad52SDarrick J. Wongincrease internal redundancy -- either storing nearly identical copies of 978e5edad52SDarrick J. Wongmetadata, or more space-efficient encoding techniques. 979e5edad52SDarrick J. Wong 980e5edad52SDarrick J. WongFor XFS, a different redundancy strategy was chosen to modernize the design: 981e5edad52SDarrick J. Wonga secondary space usage index that maps allocated disk extents back to their 982e5edad52SDarrick J. Wongowners. 983e5edad52SDarrick J. WongBy adding a new index, the filesystem retains most of its ability to scale 984e5edad52SDarrick J. Wongwell to heavily threaded workloads involving large datasets, since the primary 985e5edad52SDarrick J. Wongfile metadata (the directory tree, the file block map, and the allocation 986e5edad52SDarrick J. Wonggroups) remain unchanged. 987e5edad52SDarrick J. WongLike any system that improves redundancy, the reverse-mapping feature increases 988e5edad52SDarrick J. Wongoverhead costs for space mapping activities. 989e5edad52SDarrick J. WongHowever, it has two critical advantages: first, the reverse index is key to 990e5edad52SDarrick J. Wongenabling online fsck and other requested functionality such as free space 991e5edad52SDarrick J. Wongdefragmentation, better media failure reporting, and filesystem shrinking. 992e5edad52SDarrick J. WongSecond, the different ondisk storage format of the reverse mapping btree 993e5edad52SDarrick J. Wongdefeats device-level deduplication because the filesystem requires real 994e5edad52SDarrick J. Wongredundancy. 995e5edad52SDarrick J. Wong 996e5edad52SDarrick J. Wong+--------------------------------------------------------------------------+ 997e5edad52SDarrick J. Wong| **Sidebar**: | 998e5edad52SDarrick J. Wong+--------------------------------------------------------------------------+ 999e5edad52SDarrick J. Wong| A criticism of adding the secondary index is that it does nothing to | 1000e5edad52SDarrick J. Wong| improve the robustness of user data storage itself. | 1001e5edad52SDarrick J. Wong| This is a valid point, but adding a new index for file data block | 1002e5edad52SDarrick J. Wong| checksums increases write amplification by turning data overwrites into | 1003e5edad52SDarrick J. Wong| copy-writes, which age the filesystem prematurely. | 1004e5edad52SDarrick J. Wong| In keeping with thirty years of precedent, users who want file data | 1005e5edad52SDarrick J. Wong| integrity can supply as powerful a solution as they require. | 1006e5edad52SDarrick J. Wong| As for metadata, the complexity of adding a new secondary index of space | 1007e5edad52SDarrick J. Wong| usage is much less than adding volume management and storage device | 1008e5edad52SDarrick J. Wong| mirroring to XFS itself. | 1009e5edad52SDarrick J. Wong| Perfection of RAID and volume management are best left to existing | 1010e5edad52SDarrick J. Wong| layers in the kernel. | 1011e5edad52SDarrick J. Wong+--------------------------------------------------------------------------+ 1012e5edad52SDarrick J. Wong 1013e5edad52SDarrick J. WongThe information captured in a reverse space mapping record is as follows: 1014e5edad52SDarrick J. Wong 1015e5edad52SDarrick J. Wong.. code-block:: c 1016e5edad52SDarrick J. Wong 1017e5edad52SDarrick J. Wong struct xfs_rmap_irec { 1018e5edad52SDarrick J. Wong xfs_agblock_t rm_startblock; /* extent start block */ 1019e5edad52SDarrick J. Wong xfs_extlen_t rm_blockcount; /* extent length */ 1020e5edad52SDarrick J. Wong uint64_t rm_owner; /* extent owner */ 1021e5edad52SDarrick J. Wong uint64_t rm_offset; /* offset within the owner */ 1022e5edad52SDarrick J. Wong unsigned int rm_flags; /* state flags */ 1023e5edad52SDarrick J. Wong }; 1024e5edad52SDarrick J. Wong 1025e5edad52SDarrick J. WongThe first two fields capture the location and size of the physical space, 1026e5edad52SDarrick J. Wongin units of filesystem blocks. 1027e5edad52SDarrick J. WongThe owner field tells scrub which metadata structure or file inode have been 1028e5edad52SDarrick J. Wongassigned this space. 1029e5edad52SDarrick J. WongFor space allocated to files, the offset field tells scrub where the space was 1030e5edad52SDarrick J. Wongmapped within the file fork. 1031e5edad52SDarrick J. WongFinally, the flags field provides extra information about the space usage -- 1032e5edad52SDarrick J. Wongis this an attribute fork extent? A file mapping btree extent? Or an 1033e5edad52SDarrick J. Wongunwritten data extent? 1034e5edad52SDarrick J. Wong 1035e5edad52SDarrick J. WongOnline filesystem checking judges the consistency of each primary metadata 1036e5edad52SDarrick J. Wongrecord by comparing its information against all other space indices. 1037e5edad52SDarrick J. WongThe reverse mapping index plays a key role in the consistency checking process 1038e5edad52SDarrick J. Wongbecause it contains a centralized alternate copy of all space allocation 1039e5edad52SDarrick J. Wonginformation. 1040e5edad52SDarrick J. WongProgram runtime and ease of resource acquisition are the only real limits to 1041e5edad52SDarrick J. Wongwhat online checking can consult. 1042e5edad52SDarrick J. WongFor example, a file data extent mapping can be checked against: 1043e5edad52SDarrick J. Wong 1044e5edad52SDarrick J. Wong* The absence of an entry in the free space information. 1045e5edad52SDarrick J. Wong* The absence of an entry in the inode index. 1046e5edad52SDarrick J. Wong* The absence of an entry in the reference count data if the file is not 1047e5edad52SDarrick J. Wong marked as having shared extents. 1048e5edad52SDarrick J. Wong* The correspondence of an entry in the reverse mapping information. 1049e5edad52SDarrick J. Wong 1050e5edad52SDarrick J. WongThere are several observations to make about reverse mapping indices: 1051e5edad52SDarrick J. Wong 1052e5edad52SDarrick J. Wong1. Reverse mappings can provide a positive affirmation of correctness if any of 1053e5edad52SDarrick J. Wong the above primary metadata are in doubt. 1054e5edad52SDarrick J. Wong The checking code for most primary metadata follows a path similar to the 1055e5edad52SDarrick J. Wong one outlined above. 1056e5edad52SDarrick J. Wong 1057e5edad52SDarrick J. Wong2. Proving the consistency of secondary metadata with the primary metadata is 1058e5edad52SDarrick J. Wong difficult because that requires a full scan of all primary space metadata, 1059e5edad52SDarrick J. Wong which is very time intensive. 1060e5edad52SDarrick J. Wong For example, checking a reverse mapping record for a file extent mapping 1061e5edad52SDarrick J. Wong btree block requires locking the file and searching the entire btree to 1062e5edad52SDarrick J. Wong confirm the block. 1063e5edad52SDarrick J. Wong Instead, scrub relies on rigorous cross-referencing during the primary space 1064e5edad52SDarrick J. Wong mapping structure checks. 1065e5edad52SDarrick J. Wong 1066e5edad52SDarrick J. Wong3. Consistency scans must use non-blocking lock acquisition primitives if the 1067e5edad52SDarrick J. Wong required locking order is not the same order used by regular filesystem 1068e5edad52SDarrick J. Wong operations. 1069e5edad52SDarrick J. Wong For example, if the filesystem normally takes a file ILOCK before taking 1070e5edad52SDarrick J. Wong the AGF buffer lock but scrub wants to take a file ILOCK while holding 1071e5edad52SDarrick J. Wong an AGF buffer lock, scrub cannot block on that second acquisition. 1072e5edad52SDarrick J. Wong This means that forward progress during this part of a scan of the reverse 1073e5edad52SDarrick J. Wong mapping data cannot be guaranteed if system load is heavy. 1074e5edad52SDarrick J. Wong 1075e5edad52SDarrick J. WongIn summary, reverse mappings play a key role in reconstruction of primary 1076e5edad52SDarrick J. Wongmetadata. 1077e5edad52SDarrick J. WongThe details of how these records are staged, written to disk, and committed 1078e5edad52SDarrick J. Wonginto the filesystem are covered in subsequent sections. 1079e5edad52SDarrick J. Wong 1080e5edad52SDarrick J. WongChecking and Cross-Referencing 1081e5edad52SDarrick J. Wong------------------------------ 1082e5edad52SDarrick J. Wong 1083e5edad52SDarrick J. WongThe first step of checking a metadata structure is to examine every record 1084e5edad52SDarrick J. Wongcontained within the structure and its relationship with the rest of the 1085e5edad52SDarrick J. Wongsystem. 1086e5edad52SDarrick J. WongXFS contains multiple layers of checking to try to prevent inconsistent 1087e5edad52SDarrick J. Wongmetadata from wreaking havoc on the system. 1088e5edad52SDarrick J. WongEach of these layers contributes information that helps the kernel to make 1089e5edad52SDarrick J. Wongthree decisions about the health of a metadata structure: 1090e5edad52SDarrick J. Wong 1091e5edad52SDarrick J. Wong- Is a part of this structure obviously corrupt (``XFS_SCRUB_OFLAG_CORRUPT``) ? 1092e5edad52SDarrick J. Wong- Is this structure inconsistent with the rest of the system 1093e5edad52SDarrick J. Wong (``XFS_SCRUB_OFLAG_XCORRUPT``) ? 1094e5edad52SDarrick J. Wong- Is there so much damage around the filesystem that cross-referencing is not 1095e5edad52SDarrick J. Wong possible (``XFS_SCRUB_OFLAG_XFAIL``) ? 1096e5edad52SDarrick J. Wong- Can the structure be optimized to improve performance or reduce the size of 1097e5edad52SDarrick J. Wong metadata (``XFS_SCRUB_OFLAG_PREEN``) ? 1098e5edad52SDarrick J. Wong- Does the structure contain data that is not inconsistent but deserves review 1099e5edad52SDarrick J. Wong by the system administrator (``XFS_SCRUB_OFLAG_WARNING``) ? 1100e5edad52SDarrick J. Wong 1101e5edad52SDarrick J. WongThe following sections describe how the metadata scrubbing process works. 1102e5edad52SDarrick J. Wong 1103e5edad52SDarrick J. WongMetadata Buffer Verification 1104e5edad52SDarrick J. Wong```````````````````````````` 1105e5edad52SDarrick J. Wong 1106e5edad52SDarrick J. WongThe lowest layer of metadata protection in XFS are the metadata verifiers built 1107e5edad52SDarrick J. Wonginto the buffer cache. 1108e5edad52SDarrick J. WongThese functions perform inexpensive internal consistency checking of the block 1109e5edad52SDarrick J. Wongitself, and answer these questions: 1110e5edad52SDarrick J. Wong 1111e5edad52SDarrick J. Wong- Does the block belong to this filesystem? 1112e5edad52SDarrick J. Wong 1113e5edad52SDarrick J. Wong- Does the block belong to the structure that asked for the read? 1114e5edad52SDarrick J. Wong This assumes that metadata blocks only have one owner, which is always true 1115e5edad52SDarrick J. Wong in XFS. 1116e5edad52SDarrick J. Wong 1117e5edad52SDarrick J. Wong- Is the type of data stored in the block within a reasonable range of what 1118e5edad52SDarrick J. Wong scrub is expecting? 1119e5edad52SDarrick J. Wong 1120e5edad52SDarrick J. Wong- Does the physical location of the block match the location it was read from? 1121e5edad52SDarrick J. Wong 1122e5edad52SDarrick J. Wong- Does the block checksum match the data? 1123e5edad52SDarrick J. Wong 1124e5edad52SDarrick J. WongThe scope of the protections here are very limited -- verifiers can only 1125e5edad52SDarrick J. Wongestablish that the filesystem code is reasonably free of gross corruption bugs 1126e5edad52SDarrick J. Wongand that the storage system is reasonably competent at retrieval. 1127e5edad52SDarrick J. WongCorruption problems observed at runtime cause the generation of health reports, 1128e5edad52SDarrick J. Wongfailed system calls, and in the extreme case, filesystem shutdowns if the 1129e5edad52SDarrick J. Wongcorrupt metadata force the cancellation of a dirty transaction. 1130e5edad52SDarrick J. Wong 1131e5edad52SDarrick J. WongEvery online fsck scrubbing function is expected to read every ondisk metadata 1132e5edad52SDarrick J. Wongblock of a structure in the course of checking the structure. 1133e5edad52SDarrick J. WongCorruption problems observed during a check are immediately reported to 1134e5edad52SDarrick J. Wonguserspace as corruption; during a cross-reference, they are reported as a 1135e5edad52SDarrick J. Wongfailure to cross-reference once the full examination is complete. 1136e5edad52SDarrick J. WongReads satisfied by a buffer already in cache (and hence already verified) 1137e5edad52SDarrick J. Wongbypass these checks. 1138e5edad52SDarrick J. Wong 1139e5edad52SDarrick J. WongInternal Consistency Checks 1140e5edad52SDarrick J. Wong``````````````````````````` 1141e5edad52SDarrick J. Wong 1142e5edad52SDarrick J. WongAfter the buffer cache, the next level of metadata protection is the internal 1143e5edad52SDarrick J. Wongrecord verification code built into the filesystem. 1144e5edad52SDarrick J. WongThese checks are split between the buffer verifiers, the in-filesystem users of 1145e5edad52SDarrick J. Wongthe buffer cache, and the scrub code itself, depending on the amount of higher 1146e5edad52SDarrick J. Wonglevel context required. 1147e5edad52SDarrick J. WongThe scope of checking is still internal to the block. 1148e5edad52SDarrick J. WongThese higher level checking functions answer these questions: 1149e5edad52SDarrick J. Wong 1150e5edad52SDarrick J. Wong- Does the type of data stored in the block match what scrub is expecting? 1151e5edad52SDarrick J. Wong 1152e5edad52SDarrick J. Wong- Does the block belong to the owning structure that asked for the read? 1153e5edad52SDarrick J. Wong 1154e5edad52SDarrick J. Wong- If the block contains records, do the records fit within the block? 1155e5edad52SDarrick J. Wong 1156e5edad52SDarrick J. Wong- If the block tracks internal free space information, is it consistent with 1157e5edad52SDarrick J. Wong the record areas? 1158e5edad52SDarrick J. Wong 1159e5edad52SDarrick J. Wong- Are the records contained inside the block free of obvious corruptions? 1160e5edad52SDarrick J. Wong 1161e5edad52SDarrick J. WongRecord checks in this category are more rigorous and more time-intensive. 1162e5edad52SDarrick J. WongFor example, block pointers and inumbers are checked to ensure that they point 1163e5edad52SDarrick J. Wongwithin the dynamically allocated parts of an allocation group and within 1164e5edad52SDarrick J. Wongthe filesystem. 1165e5edad52SDarrick J. WongNames are checked for invalid characters, and flags are checked for invalid 1166e5edad52SDarrick J. Wongcombinations. 1167e5edad52SDarrick J. WongOther record attributes are checked for sensible values. 1168e5edad52SDarrick J. WongBtree records spanning an interval of the btree keyspace are checked for 1169e5edad52SDarrick J. Wongcorrect order and lack of mergeability (except for file fork mappings). 1170e5edad52SDarrick J. WongFor performance reasons, regular code may skip some of these checks unless 1171e5edad52SDarrick J. Wongdebugging is enabled or a write is about to occur. 1172e5edad52SDarrick J. WongScrub functions, of course, must check all possible problems. 1173e5edad52SDarrick J. Wong 1174e5edad52SDarrick J. WongValidation of Userspace-Controlled Record Attributes 1175e5edad52SDarrick J. Wong```````````````````````````````````````````````````` 1176e5edad52SDarrick J. Wong 1177e5edad52SDarrick J. WongVarious pieces of filesystem metadata are directly controlled by userspace. 1178e5edad52SDarrick J. WongBecause of this nature, validation work cannot be more precise than checking 1179e5edad52SDarrick J. Wongthat a value is within the possible range. 1180e5edad52SDarrick J. WongThese fields include: 1181e5edad52SDarrick J. Wong 1182e5edad52SDarrick J. Wong- Superblock fields controlled by mount options 1183e5edad52SDarrick J. Wong- Filesystem labels 1184e5edad52SDarrick J. Wong- File timestamps 1185e5edad52SDarrick J. Wong- File permissions 1186e5edad52SDarrick J. Wong- File size 1187e5edad52SDarrick J. Wong- File flags 1188e5edad52SDarrick J. Wong- Names present in directory entries, extended attribute keys, and filesystem 1189e5edad52SDarrick J. Wong labels 1190e5edad52SDarrick J. Wong- Extended attribute key namespaces 1191e5edad52SDarrick J. Wong- Extended attribute values 1192e5edad52SDarrick J. Wong- File data block contents 1193e5edad52SDarrick J. Wong- Quota limits 1194e5edad52SDarrick J. Wong- Quota timer expiration (if resource usage exceeds the soft limit) 1195e5edad52SDarrick J. Wong 1196e5edad52SDarrick J. WongCross-Referencing Space Metadata 1197e5edad52SDarrick J. Wong```````````````````````````````` 1198e5edad52SDarrick J. Wong 1199e5edad52SDarrick J. WongAfter internal block checks, the next higher level of checking is 1200e5edad52SDarrick J. Wongcross-referencing records between metadata structures. 1201e5edad52SDarrick J. WongFor regular runtime code, the cost of these checks is considered to be 1202e5edad52SDarrick J. Wongprohibitively expensive, but as scrub is dedicated to rooting out 1203e5edad52SDarrick J. Wonginconsistencies, it must pursue all avenues of inquiry. 1204e5edad52SDarrick J. WongThe exact set of cross-referencing is highly dependent on the context of the 1205e5edad52SDarrick J. Wongdata structure being checked. 1206e5edad52SDarrick J. Wong 1207e5edad52SDarrick J. WongThe XFS btree code has keyspace scanning functions that online fsck uses to 1208e5edad52SDarrick J. Wongcross reference one structure with another. 1209e5edad52SDarrick J. WongSpecifically, scrub can scan the key space of an index to determine if that 1210e5edad52SDarrick J. Wongkeyspace is fully, sparsely, or not at all mapped to records. 1211e5edad52SDarrick J. WongFor the reverse mapping btree, it is possible to mask parts of the key for the 1212e5edad52SDarrick J. Wongpurposes of performing a keyspace scan so that scrub can decide if the rmap 1213e5edad52SDarrick J. Wongbtree contains records mapping a certain extent of physical space without the 1214e5edad52SDarrick J. Wongsparsenses of the rest of the rmap keyspace getting in the way. 1215e5edad52SDarrick J. Wong 1216e5edad52SDarrick J. WongBtree blocks undergo the following checks before cross-referencing: 1217e5edad52SDarrick J. Wong 1218e5edad52SDarrick J. Wong- Does the type of data stored in the block match what scrub is expecting? 1219e5edad52SDarrick J. Wong 1220e5edad52SDarrick J. Wong- Does the block belong to the owning structure that asked for the read? 1221e5edad52SDarrick J. Wong 1222e5edad52SDarrick J. Wong- Do the records fit within the block? 1223e5edad52SDarrick J. Wong 1224e5edad52SDarrick J. Wong- Are the records contained inside the block free of obvious corruptions? 1225e5edad52SDarrick J. Wong 1226e5edad52SDarrick J. Wong- Are the name hashes in the correct order? 1227e5edad52SDarrick J. Wong 1228e5edad52SDarrick J. Wong- Do node pointers within the btree point to valid block addresses for the type 1229e5edad52SDarrick J. Wong of btree? 1230e5edad52SDarrick J. Wong 1231e5edad52SDarrick J. Wong- Do child pointers point towards the leaves? 1232e5edad52SDarrick J. Wong 1233e5edad52SDarrick J. Wong- Do sibling pointers point across the same level? 1234e5edad52SDarrick J. Wong 1235e5edad52SDarrick J. Wong- For each node block record, does the record key accurate reflect the contents 1236e5edad52SDarrick J. Wong of the child block? 1237e5edad52SDarrick J. Wong 1238e5edad52SDarrick J. WongSpace allocation records are cross-referenced as follows: 1239e5edad52SDarrick J. Wong 1240e5edad52SDarrick J. Wong1. Any space mentioned by any metadata structure are cross-referenced as 1241e5edad52SDarrick J. Wong follows: 1242e5edad52SDarrick J. Wong 1243e5edad52SDarrick J. Wong - Does the reverse mapping index list only the appropriate owner as the 1244e5edad52SDarrick J. Wong owner of each block? 1245e5edad52SDarrick J. Wong 1246e5edad52SDarrick J. Wong - Are none of the blocks claimed as free space? 1247e5edad52SDarrick J. Wong 1248e5edad52SDarrick J. Wong - If these aren't file data blocks, are none of the blocks claimed as space 1249e5edad52SDarrick J. Wong shared by different owners? 1250e5edad52SDarrick J. Wong 1251e5edad52SDarrick J. Wong2. Btree blocks are cross-referenced as follows: 1252e5edad52SDarrick J. Wong 1253e5edad52SDarrick J. Wong - Everything in class 1 above. 1254e5edad52SDarrick J. Wong 1255e5edad52SDarrick J. Wong - If there's a parent node block, do the keys listed for this block match the 1256e5edad52SDarrick J. Wong keyspace of this block? 1257e5edad52SDarrick J. Wong 1258e5edad52SDarrick J. Wong - Do the sibling pointers point to valid blocks? Of the same level? 1259e5edad52SDarrick J. Wong 1260e5edad52SDarrick J. Wong - Do the child pointers point to valid blocks? Of the next level down? 1261e5edad52SDarrick J. Wong 1262e5edad52SDarrick J. Wong3. Free space btree records are cross-referenced as follows: 1263e5edad52SDarrick J. Wong 1264e5edad52SDarrick J. Wong - Everything in class 1 and 2 above. 1265e5edad52SDarrick J. Wong 1266e5edad52SDarrick J. Wong - Does the reverse mapping index list no owners of this space? 1267e5edad52SDarrick J. Wong 1268e5edad52SDarrick J. Wong - Is this space not claimed by the inode index for inodes? 1269e5edad52SDarrick J. Wong 1270e5edad52SDarrick J. Wong - Is it not mentioned by the reference count index? 1271e5edad52SDarrick J. Wong 1272e5edad52SDarrick J. Wong - Is there a matching record in the other free space btree? 1273e5edad52SDarrick J. Wong 1274e5edad52SDarrick J. Wong4. Inode btree records are cross-referenced as follows: 1275e5edad52SDarrick J. Wong 1276e5edad52SDarrick J. Wong - Everything in class 1 and 2 above. 1277e5edad52SDarrick J. Wong 1278e5edad52SDarrick J. Wong - Is there a matching record in free inode btree? 1279e5edad52SDarrick J. Wong 1280e5edad52SDarrick J. Wong - Do cleared bits in the holemask correspond with inode clusters? 1281e5edad52SDarrick J. Wong 1282e5edad52SDarrick J. Wong - Do set bits in the freemask correspond with inode records with zero link 1283e5edad52SDarrick J. Wong count? 1284e5edad52SDarrick J. Wong 1285e5edad52SDarrick J. Wong5. Inode records are cross-referenced as follows: 1286e5edad52SDarrick J. Wong 1287e5edad52SDarrick J. Wong - Everything in class 1. 1288e5edad52SDarrick J. Wong 1289e5edad52SDarrick J. Wong - Do all the fields that summarize information about the file forks actually 1290e5edad52SDarrick J. Wong match those forks? 1291e5edad52SDarrick J. Wong 1292e5edad52SDarrick J. Wong - Does each inode with zero link count correspond to a record in the free 1293e5edad52SDarrick J. Wong inode btree? 1294e5edad52SDarrick J. Wong 1295e5edad52SDarrick J. Wong6. File fork space mapping records are cross-referenced as follows: 1296e5edad52SDarrick J. Wong 1297e5edad52SDarrick J. Wong - Everything in class 1 and 2 above. 1298e5edad52SDarrick J. Wong 1299e5edad52SDarrick J. Wong - Is this space not mentioned by the inode btrees? 1300e5edad52SDarrick J. Wong 1301e5edad52SDarrick J. Wong - If this is a CoW fork mapping, does it correspond to a CoW entry in the 1302e5edad52SDarrick J. Wong reference count btree? 1303e5edad52SDarrick J. Wong 1304e5edad52SDarrick J. Wong7. Reference count records are cross-referenced as follows: 1305e5edad52SDarrick J. Wong 1306e5edad52SDarrick J. Wong - Everything in class 1 and 2 above. 1307e5edad52SDarrick J. Wong 1308e5edad52SDarrick J. Wong - Within the space subkeyspace of the rmap btree (that is to say, all 1309e5edad52SDarrick J. Wong records mapped to a particular space extent and ignoring the owner info), 1310e5edad52SDarrick J. Wong are there the same number of reverse mapping records for each block as the 1311e5edad52SDarrick J. Wong reference count record claims? 1312e5edad52SDarrick J. Wong 1313e5edad52SDarrick J. WongProposed patchsets are the series to find gaps in 1314e5edad52SDarrick J. Wong`refcount btree 1315e5edad52SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-refcount-gaps>`_, 1316e5edad52SDarrick J. Wong`inode btree 1317e5edad52SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-inobt-gaps>`_, and 1318e5edad52SDarrick J. Wong`rmap btree 1319e5edad52SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-rmapbt-gaps>`_ records; 1320e5edad52SDarrick J. Wongto find 1321e5edad52SDarrick J. Wong`mergeable records 1322e5edad52SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-mergeable-records>`_; 1323e5edad52SDarrick J. Wongand to 1324e5edad52SDarrick J. Wong`improve cross referencing with rmap 1325e5edad52SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-strengthen-rmap-checking>`_ 1326e5edad52SDarrick J. Wongbefore starting a repair. 1327e5edad52SDarrick J. Wong 1328e5edad52SDarrick J. WongChecking Extended Attributes 1329e5edad52SDarrick J. Wong```````````````````````````` 1330e5edad52SDarrick J. Wong 1331e5edad52SDarrick J. WongExtended attributes implement a key-value store that enable fragments of data 1332e5edad52SDarrick J. Wongto be attached to any file. 1333e5edad52SDarrick J. WongBoth the kernel and userspace can access the keys and values, subject to 1334e5edad52SDarrick J. Wongnamespace and privilege restrictions. 1335e5edad52SDarrick J. WongMost typically these fragments are metadata about the file -- origins, security 1336e5edad52SDarrick J. Wongcontexts, user-supplied labels, indexing information, etc. 1337e5edad52SDarrick J. Wong 1338e5edad52SDarrick J. WongNames can be as long as 255 bytes and can exist in several different 1339e5edad52SDarrick J. Wongnamespaces. 1340e5edad52SDarrick J. WongValues can be as large as 64KB. 1341e5edad52SDarrick J. WongA file's extended attributes are stored in blocks mapped by the attr fork. 1342e5edad52SDarrick J. WongThe mappings point to leaf blocks, remote value blocks, or dabtree blocks. 1343e5edad52SDarrick J. WongBlock 0 in the attribute fork is always the top of the structure, but otherwise 1344e5edad52SDarrick J. Wongeach of the three types of blocks can be found at any offset in the attr fork. 1345e5edad52SDarrick J. WongLeaf blocks contain attribute key records that point to the name and the value. 1346e5edad52SDarrick J. WongNames are always stored elsewhere in the same leaf block. 1347e5edad52SDarrick J. WongValues that are less than 3/4 the size of a filesystem block are also stored 1348e5edad52SDarrick J. Wongelsewhere in the same leaf block. 1349e5edad52SDarrick J. WongRemote value blocks contain values that are too large to fit inside a leaf. 1350e5edad52SDarrick J. WongIf the leaf information exceeds a single filesystem block, a dabtree (also 1351e5edad52SDarrick J. Wongrooted at block 0) is created to map hashes of the attribute names to leaf 1352e5edad52SDarrick J. Wongblocks in the attr fork. 1353e5edad52SDarrick J. Wong 1354*d56b699dSBjorn HelgaasChecking an extended attribute structure is not so straightforward due to the 1355e5edad52SDarrick J. Wonglack of separation between attr blocks and index blocks. 1356e5edad52SDarrick J. WongScrub must read each block mapped by the attr fork and ignore the non-leaf 1357e5edad52SDarrick J. Wongblocks: 1358e5edad52SDarrick J. Wong 1359e5edad52SDarrick J. Wong1. Walk the dabtree in the attr fork (if present) to ensure that there are no 1360e5edad52SDarrick J. Wong irregularities in the blocks or dabtree mappings that do not point to 1361e5edad52SDarrick J. Wong attr leaf blocks. 1362e5edad52SDarrick J. Wong 1363e5edad52SDarrick J. Wong2. Walk the blocks of the attr fork looking for leaf blocks. 1364e5edad52SDarrick J. Wong For each entry inside a leaf: 1365e5edad52SDarrick J. Wong 1366e5edad52SDarrick J. Wong a. Validate that the name does not contain invalid characters. 1367e5edad52SDarrick J. Wong 1368e5edad52SDarrick J. Wong b. Read the attr value. 1369e5edad52SDarrick J. Wong This performs a named lookup of the attr name to ensure the correctness 1370e5edad52SDarrick J. Wong of the dabtree. 1371e5edad52SDarrick J. Wong If the value is stored in a remote block, this also validates the 1372e5edad52SDarrick J. Wong integrity of the remote value block. 1373e5edad52SDarrick J. Wong 1374e5edad52SDarrick J. WongChecking and Cross-Referencing Directories 1375e5edad52SDarrick J. Wong`````````````````````````````````````````` 1376e5edad52SDarrick J. Wong 1377e5edad52SDarrick J. WongThe filesystem directory tree is a directed acylic graph structure, with files 1378e5edad52SDarrick J. Wongconstituting the nodes, and directory entries (dirents) constituting the edges. 1379e5edad52SDarrick J. WongDirectories are a special type of file containing a set of mappings from a 1380e5edad52SDarrick J. Wong255-byte sequence (name) to an inumber. 1381e5edad52SDarrick J. WongThese are called directory entries, or dirents for short. 1382e5edad52SDarrick J. WongEach directory file must have exactly one directory pointing to the file. 1383e5edad52SDarrick J. WongA root directory points to itself. 1384e5edad52SDarrick J. WongDirectory entries point to files of any type. 1385e5edad52SDarrick J. WongEach non-directory file may have multiple directories point to it. 1386e5edad52SDarrick J. Wong 1387e5edad52SDarrick J. WongIn XFS, directories are implemented as a file containing up to three 32GB 1388e5edad52SDarrick J. Wongpartitions. 1389e5edad52SDarrick J. WongThe first partition contains directory entry data blocks. 1390e5edad52SDarrick J. WongEach data block contains variable-sized records associating a user-provided 1391e5edad52SDarrick J. Wongname with an inumber and, optionally, a file type. 1392e5edad52SDarrick J. WongIf the directory entry data grows beyond one block, the second partition (which 1393e5edad52SDarrick J. Wongexists as post-EOF extents) is populated with a block containing free space 1394e5edad52SDarrick J. Wonginformation and an index that maps hashes of the dirent names to directory data 1395e5edad52SDarrick J. Wongblocks in the first partition. 1396e5edad52SDarrick J. WongThis makes directory name lookups very fast. 1397e5edad52SDarrick J. WongIf this second partition grows beyond one block, the third partition is 1398e5edad52SDarrick J. Wongpopulated with a linear array of free space information for faster 1399e5edad52SDarrick J. Wongexpansions. 1400e5edad52SDarrick J. WongIf the free space has been separated and the second partition grows again 1401e5edad52SDarrick J. Wongbeyond one block, then a dabtree is used to map hashes of dirent names to 1402e5edad52SDarrick J. Wongdirectory data blocks. 1403e5edad52SDarrick J. Wong 1404*d56b699dSBjorn HelgaasChecking a directory is pretty straightforward: 1405e5edad52SDarrick J. Wong 1406e5edad52SDarrick J. Wong1. Walk the dabtree in the second partition (if present) to ensure that there 1407e5edad52SDarrick J. Wong are no irregularities in the blocks or dabtree mappings that do not point to 1408e5edad52SDarrick J. Wong dirent blocks. 1409e5edad52SDarrick J. Wong 1410e5edad52SDarrick J. Wong2. Walk the blocks of the first partition looking for directory entries. 1411e5edad52SDarrick J. Wong Each dirent is checked as follows: 1412e5edad52SDarrick J. Wong 1413e5edad52SDarrick J. Wong a. Does the name contain no invalid characters? 1414e5edad52SDarrick J. Wong 1415e5edad52SDarrick J. Wong b. Does the inumber correspond to an actual, allocated inode? 1416e5edad52SDarrick J. Wong 1417e5edad52SDarrick J. Wong c. Does the child inode have a nonzero link count? 1418e5edad52SDarrick J. Wong 1419e5edad52SDarrick J. Wong d. If a file type is included in the dirent, does it match the type of the 1420e5edad52SDarrick J. Wong inode? 1421e5edad52SDarrick J. Wong 1422e5edad52SDarrick J. Wong e. If the child is a subdirectory, does the child's dotdot pointer point 1423e5edad52SDarrick J. Wong back to the parent? 1424e5edad52SDarrick J. Wong 1425e5edad52SDarrick J. Wong f. If the directory has a second partition, perform a named lookup of the 1426e5edad52SDarrick J. Wong dirent name to ensure the correctness of the dabtree. 1427e5edad52SDarrick J. Wong 1428e5edad52SDarrick J. Wong3. Walk the free space list in the third partition (if present) to ensure that 1429e5edad52SDarrick J. Wong the free spaces it describes are really unused. 1430e5edad52SDarrick J. Wong 1431e5edad52SDarrick J. WongChecking operations involving :ref:`parents <dirparent>` and 1432e5edad52SDarrick J. Wong:ref:`file link counts <nlinks>` are discussed in more detail in later 1433e5edad52SDarrick J. Wongsections. 1434e5edad52SDarrick J. Wong 1435e5edad52SDarrick J. WongChecking Directory/Attribute Btrees 1436e5edad52SDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1437e5edad52SDarrick J. Wong 1438e5edad52SDarrick J. WongAs stated in previous sections, the directory/attribute btree (dabtree) index 1439e5edad52SDarrick J. Wongmaps user-provided names to improve lookup times by avoiding linear scans. 1440e5edad52SDarrick J. WongInternally, it maps a 32-bit hash of the name to a block offset within the 1441e5edad52SDarrick J. Wongappropriate file fork. 1442e5edad52SDarrick J. Wong 1443e5edad52SDarrick J. WongThe internal structure of a dabtree closely resembles the btrees that record 1444e5edad52SDarrick J. Wongfixed-size metadata records -- each dabtree block contains a magic number, a 1445e5edad52SDarrick J. Wongchecksum, sibling pointers, a UUID, a tree level, and a log sequence number. 1446e5edad52SDarrick J. WongThe format of leaf and node records are the same -- each entry points to the 1447e5edad52SDarrick J. Wongnext level down in the hierarchy, with dabtree node records pointing to dabtree 1448e5edad52SDarrick J. Wongleaf blocks, and dabtree leaf records pointing to non-dabtree blocks elsewhere 1449e5edad52SDarrick J. Wongin the fork. 1450e5edad52SDarrick J. Wong 1451e5edad52SDarrick J. WongChecking and cross-referencing the dabtree is very similar to what is done for 1452e5edad52SDarrick J. Wongspace btrees: 1453e5edad52SDarrick J. Wong 1454e5edad52SDarrick J. Wong- Does the type of data stored in the block match what scrub is expecting? 1455e5edad52SDarrick J. Wong 1456e5edad52SDarrick J. Wong- Does the block belong to the owning structure that asked for the read? 1457e5edad52SDarrick J. Wong 1458e5edad52SDarrick J. Wong- Do the records fit within the block? 1459e5edad52SDarrick J. Wong 1460e5edad52SDarrick J. Wong- Are the records contained inside the block free of obvious corruptions? 1461e5edad52SDarrick J. Wong 1462e5edad52SDarrick J. Wong- Are the name hashes in the correct order? 1463e5edad52SDarrick J. Wong 1464e5edad52SDarrick J. Wong- Do node pointers within the dabtree point to valid fork offsets for dabtree 1465e5edad52SDarrick J. Wong blocks? 1466e5edad52SDarrick J. Wong 1467e5edad52SDarrick J. Wong- Do leaf pointers within the dabtree point to valid fork offsets for directory 1468e5edad52SDarrick J. Wong or attr leaf blocks? 1469e5edad52SDarrick J. Wong 1470e5edad52SDarrick J. Wong- Do child pointers point towards the leaves? 1471e5edad52SDarrick J. Wong 1472e5edad52SDarrick J. Wong- Do sibling pointers point across the same level? 1473e5edad52SDarrick J. Wong 1474e5edad52SDarrick J. Wong- For each dabtree node record, does the record key accurate reflect the 1475e5edad52SDarrick J. Wong contents of the child dabtree block? 1476e5edad52SDarrick J. Wong 1477e5edad52SDarrick J. Wong- For each dabtree leaf record, does the record key accurate reflect the 1478e5edad52SDarrick J. Wong contents of the directory or attr block? 1479e5edad52SDarrick J. Wong 1480e5edad52SDarrick J. WongCross-Referencing Summary Counters 1481e5edad52SDarrick J. Wong`````````````````````````````````` 1482e5edad52SDarrick J. Wong 1483e5edad52SDarrick J. WongXFS maintains three classes of summary counters: available resources, quota 1484e5edad52SDarrick J. Wongresource usage, and file link counts. 1485e5edad52SDarrick J. Wong 1486e5edad52SDarrick J. WongIn theory, the amount of available resources (data blocks, inodes, realtime 1487e5edad52SDarrick J. Wongextents) can be found by walking the entire filesystem. 1488e5edad52SDarrick J. WongThis would make for very slow reporting, so a transactional filesystem can 1489e5edad52SDarrick J. Wongmaintain summaries of this information in the superblock. 1490e5edad52SDarrick J. WongCross-referencing these values against the filesystem metadata should be a 1491e5edad52SDarrick J. Wongsimple matter of walking the free space and inode metadata in each AG and the 1492e5edad52SDarrick J. Wongrealtime bitmap, but there are complications that will be discussed in 1493e5edad52SDarrick J. Wong:ref:`more detail <fscounters>` later. 1494e5edad52SDarrick J. Wong 1495e5edad52SDarrick J. Wong:ref:`Quota usage <quotacheck>` and :ref:`file link count <nlinks>` 1496e5edad52SDarrick J. Wongchecking are sufficiently complicated to warrant separate sections. 1497e5edad52SDarrick J. Wong 1498e5edad52SDarrick J. WongPost-Repair Reverification 1499e5edad52SDarrick J. Wong`````````````````````````` 1500e5edad52SDarrick J. Wong 1501e5edad52SDarrick J. WongAfter performing a repair, the checking code is run a second time to validate 1502e5edad52SDarrick J. Wongthe new structure, and the results of the health assessment are recorded 1503e5edad52SDarrick J. Wonginternally and returned to the calling process. 1504e5edad52SDarrick J. WongThis step is critical for enabling system administrator to monitor the status 1505e5edad52SDarrick J. Wongof the filesystem and the progress of any repairs. 1506e5edad52SDarrick J. WongFor developers, it is a useful means to judge the efficacy of error detection 1507e5edad52SDarrick J. Wongand correction in the online and offline checking tools. 1508bae43864SDarrick J. Wong 1509bae43864SDarrick J. WongEventual Consistency vs. Online Fsck 1510bae43864SDarrick J. Wong------------------------------------ 1511bae43864SDarrick J. Wong 1512bae43864SDarrick J. WongComplex operations can make modifications to multiple per-AG data structures 1513bae43864SDarrick J. Wongwith a chain of transactions. 1514bae43864SDarrick J. WongThese chains, once committed to the log, are restarted during log recovery if 1515bae43864SDarrick J. Wongthe system crashes while processing the chain. 1516bae43864SDarrick J. WongBecause the AG header buffers are unlocked between transactions within a chain, 1517bae43864SDarrick J. Wongonline checking must coordinate with chained operations that are in progress to 1518bae43864SDarrick J. Wongavoid incorrectly detecting inconsistencies due to pending chains. 1519bae43864SDarrick J. WongFurthermore, online repair must not run when operations are pending because 1520bae43864SDarrick J. Wongthe metadata are temporarily inconsistent with each other, and rebuilding is 1521bae43864SDarrick J. Wongnot possible. 1522bae43864SDarrick J. Wong 1523bae43864SDarrick J. WongOnly online fsck has this requirement of total consistency of AG metadata, and 1524bae43864SDarrick J. Wongshould be relatively rare as compared to filesystem change operations. 1525bae43864SDarrick J. WongOnline fsck coordinates with transaction chains as follows: 1526bae43864SDarrick J. Wong 1527*d56b699dSBjorn Helgaas* For each AG, maintain a count of intent items targeting that AG. 1528bae43864SDarrick J. Wong The count should be bumped whenever a new item is added to the chain. 1529bae43864SDarrick J. Wong The count should be dropped when the filesystem has locked the AG header 1530bae43864SDarrick J. Wong buffers and finished the work. 1531bae43864SDarrick J. Wong 1532bae43864SDarrick J. Wong* When online fsck wants to examine an AG, it should lock the AG header 1533bae43864SDarrick J. Wong buffers to quiesce all transaction chains that want to modify that AG. 1534bae43864SDarrick J. Wong If the count is zero, proceed with the checking operation. 1535bae43864SDarrick J. Wong If it is nonzero, cycle the buffer locks to allow the chain to make forward 1536bae43864SDarrick J. Wong progress. 1537bae43864SDarrick J. Wong 1538bae43864SDarrick J. WongThis may lead to online fsck taking a long time to complete, but regular 1539bae43864SDarrick J. Wongfilesystem updates take precedence over background checking activity. 1540bae43864SDarrick J. WongDetails about the discovery of this situation are presented in the 1541bae43864SDarrick J. Wong:ref:`next section <chain_coordination>`, and details about the solution 1542bae43864SDarrick J. Wongare presented :ref:`after that<intent_drains>`. 1543bae43864SDarrick J. Wong 1544bae43864SDarrick J. Wong.. _chain_coordination: 1545bae43864SDarrick J. Wong 1546bae43864SDarrick J. WongDiscovery of the Problem 1547bae43864SDarrick J. Wong```````````````````````` 1548bae43864SDarrick J. Wong 1549bae43864SDarrick J. WongMidway through the development of online scrubbing, the fsstress tests 1550bae43864SDarrick J. Wonguncovered a misinteraction between online fsck and compound transaction chains 1551bae43864SDarrick J. Wongcreated by other writer threads that resulted in false reports of metadata 1552bae43864SDarrick J. Wonginconsistency. 1553bae43864SDarrick J. WongThe root cause of these reports is the eventual consistency model introduced by 1554bae43864SDarrick J. Wongthe expansion of deferred work items and compound transaction chains when 1555bae43864SDarrick J. Wongreverse mapping and reflink were introduced. 1556bae43864SDarrick J. Wong 1557bae43864SDarrick J. WongOriginally, transaction chains were added to XFS to avoid deadlocks when 1558bae43864SDarrick J. Wongunmapping space from files. 1559bae43864SDarrick J. WongDeadlock avoidance rules require that AGs only be locked in increasing order, 1560bae43864SDarrick J. Wongwhich makes it impossible (say) to use a single transaction to free a space 1561bae43864SDarrick J. Wongextent in AG 7 and then try to free a now superfluous block mapping btree block 1562bae43864SDarrick J. Wongin AG 3. 1563bae43864SDarrick J. WongTo avoid these kinds of deadlocks, XFS creates Extent Freeing Intent (EFI) log 1564bae43864SDarrick J. Wongitems to commit to freeing some space in one transaction while deferring the 1565bae43864SDarrick J. Wongactual metadata updates to a fresh transaction. 1566bae43864SDarrick J. WongThe transaction sequence looks like this: 1567bae43864SDarrick J. Wong 1568bae43864SDarrick J. Wong1. The first transaction contains a physical update to the file's block mapping 1569bae43864SDarrick J. Wong structures to remove the mapping from the btree blocks. 1570bae43864SDarrick J. Wong It then attaches to the in-memory transaction an action item to schedule 1571bae43864SDarrick J. Wong deferred freeing of space. 1572bae43864SDarrick J. Wong Concretely, each transaction maintains a list of ``struct 1573bae43864SDarrick J. Wong xfs_defer_pending`` objects, each of which maintains a list of ``struct 1574bae43864SDarrick J. Wong xfs_extent_free_item`` objects. 1575bae43864SDarrick J. Wong Returning to the example above, the action item tracks the freeing of both 1576bae43864SDarrick J. Wong the unmapped space from AG 7 and the block mapping btree (BMBT) block from 1577bae43864SDarrick J. Wong AG 3. 1578bae43864SDarrick J. Wong Deferred frees recorded in this manner are committed in the log by creating 1579bae43864SDarrick J. Wong an EFI log item from the ``struct xfs_extent_free_item`` object and 1580bae43864SDarrick J. Wong attaching the log item to the transaction. 1581bae43864SDarrick J. Wong When the log is persisted to disk, the EFI item is written into the ondisk 1582bae43864SDarrick J. Wong transaction record. 1583bae43864SDarrick J. Wong EFIs can list up to 16 extents to free, all sorted in AG order. 1584bae43864SDarrick J. Wong 1585bae43864SDarrick J. Wong2. The second transaction contains a physical update to the free space btrees 1586bae43864SDarrick J. Wong of AG 3 to release the former BMBT block and a second physical update to the 1587bae43864SDarrick J. Wong free space btrees of AG 7 to release the unmapped file space. 1588bae43864SDarrick J. Wong Observe that the the physical updates are resequenced in the correct order 1589bae43864SDarrick J. Wong when possible. 1590bae43864SDarrick J. Wong Attached to the transaction is a an extent free done (EFD) log item. 1591bae43864SDarrick J. Wong The EFD contains a pointer to the EFI logged in transaction #1 so that log 1592bae43864SDarrick J. Wong recovery can tell if the EFI needs to be replayed. 1593bae43864SDarrick J. Wong 1594bae43864SDarrick J. WongIf the system goes down after transaction #1 is written back to the filesystem 1595bae43864SDarrick J. Wongbut before #2 is committed, a scan of the filesystem metadata would show 1596bae43864SDarrick J. Wonginconsistent filesystem metadata because there would not appear to be any owner 1597bae43864SDarrick J. Wongof the unmapped space. 1598bae43864SDarrick J. WongHappily, log recovery corrects this inconsistency for us -- when recovery finds 1599bae43864SDarrick J. Wongan intent log item but does not find a corresponding intent done item, it will 1600bae43864SDarrick J. Wongreconstruct the incore state of the intent item and finish it. 1601bae43864SDarrick J. WongIn the example above, the log must replay both frees described in the recovered 1602bae43864SDarrick J. WongEFI to complete the recovery phase. 1603bae43864SDarrick J. Wong 1604bae43864SDarrick J. WongThere are subtleties to XFS' transaction chaining strategy to consider: 1605bae43864SDarrick J. Wong 1606bae43864SDarrick J. Wong* Log items must be added to a transaction in the correct order to prevent 1607bae43864SDarrick J. Wong conflicts with principal objects that are not held by the transaction. 1608bae43864SDarrick J. Wong In other words, all per-AG metadata updates for an unmapped block must be 1609bae43864SDarrick J. Wong completed before the last update to free the extent, and extents should not 1610bae43864SDarrick J. Wong be reallocated until that last update commits to the log. 1611bae43864SDarrick J. Wong 1612bae43864SDarrick J. Wong* AG header buffers are released between each transaction in a chain. 1613bae43864SDarrick J. Wong This means that other threads can observe an AG in an intermediate state, 1614bae43864SDarrick J. Wong but as long as the first subtlety is handled, this should not affect the 1615bae43864SDarrick J. Wong correctness of filesystem operations. 1616bae43864SDarrick J. Wong 1617bae43864SDarrick J. Wong* Unmounting the filesystem flushes all pending work to disk, which means that 1618bae43864SDarrick J. Wong offline fsck never sees the temporary inconsistencies caused by deferred 1619bae43864SDarrick J. Wong work item processing. 1620bae43864SDarrick J. Wong 1621bae43864SDarrick J. WongIn this manner, XFS employs a form of eventual consistency to avoid deadlocks 1622bae43864SDarrick J. Wongand increase parallelism. 1623bae43864SDarrick J. Wong 1624bae43864SDarrick J. WongDuring the design phase of the reverse mapping and reflink features, it was 1625bae43864SDarrick J. Wongdecided that it was impractical to cram all the reverse mapping updates for a 1626bae43864SDarrick J. Wongsingle filesystem change into a single transaction because a single file 1627bae43864SDarrick J. Wongmapping operation can explode into many small updates: 1628bae43864SDarrick J. Wong 1629bae43864SDarrick J. Wong* The block mapping update itself 1630bae43864SDarrick J. Wong* A reverse mapping update for the block mapping update 1631bae43864SDarrick J. Wong* Fixing the freelist 1632bae43864SDarrick J. Wong* A reverse mapping update for the freelist fix 1633bae43864SDarrick J. Wong 1634bae43864SDarrick J. Wong* A shape change to the block mapping btree 1635bae43864SDarrick J. Wong* A reverse mapping update for the btree update 1636bae43864SDarrick J. Wong* Fixing the freelist (again) 1637bae43864SDarrick J. Wong* A reverse mapping update for the freelist fix 1638bae43864SDarrick J. Wong 1639bae43864SDarrick J. Wong* An update to the reference counting information 1640bae43864SDarrick J. Wong* A reverse mapping update for the refcount update 1641bae43864SDarrick J. Wong* Fixing the freelist (a third time) 1642bae43864SDarrick J. Wong* A reverse mapping update for the freelist fix 1643bae43864SDarrick J. Wong 1644bae43864SDarrick J. Wong* Freeing any space that was unmapped and not owned by any other file 1645bae43864SDarrick J. Wong* Fixing the freelist (a fourth time) 1646bae43864SDarrick J. Wong* A reverse mapping update for the freelist fix 1647bae43864SDarrick J. Wong 1648bae43864SDarrick J. Wong* Freeing the space used by the block mapping btree 1649bae43864SDarrick J. Wong* Fixing the freelist (a fifth time) 1650bae43864SDarrick J. Wong* A reverse mapping update for the freelist fix 1651bae43864SDarrick J. Wong 1652bae43864SDarrick J. WongFree list fixups are not usually needed more than once per AG per transaction 1653bae43864SDarrick J. Wongchain, but it is theoretically possible if space is very tight. 1654bae43864SDarrick J. WongFor copy-on-write updates this is even worse, because this must be done once to 1655bae43864SDarrick J. Wongremove the space from a staging area and again to map it into the file! 1656bae43864SDarrick J. Wong 1657bae43864SDarrick J. WongTo deal with this explosion in a calm manner, XFS expands its use of deferred 1658bae43864SDarrick J. Wongwork items to cover most reverse mapping updates and all refcount updates. 1659bae43864SDarrick J. WongThis reduces the worst case size of transaction reservations by breaking the 1660bae43864SDarrick J. Wongwork into a long chain of small updates, which increases the degree of eventual 1661bae43864SDarrick J. Wongconsistency in the system. 1662bae43864SDarrick J. WongAgain, this generally isn't a problem because XFS orders its deferred work 1663bae43864SDarrick J. Wongitems carefully to avoid resource reuse conflicts between unsuspecting threads. 1664bae43864SDarrick J. Wong 1665bae43864SDarrick J. WongHowever, online fsck changes the rules -- remember that although physical 1666bae43864SDarrick J. Wongupdates to per-AG structures are coordinated by locking the buffers for AG 1667bae43864SDarrick J. Wongheaders, buffer locks are dropped between transactions. 1668bae43864SDarrick J. WongOnce scrub acquires resources and takes locks for a data structure, it must do 1669bae43864SDarrick J. Wongall the validation work without releasing the lock. 1670bae43864SDarrick J. WongIf the main lock for a space btree is an AG header buffer lock, scrub may have 1671bae43864SDarrick J. Wonginterrupted another thread that is midway through finishing a chain. 1672bae43864SDarrick J. WongFor example, if a thread performing a copy-on-write has completed a reverse 1673bae43864SDarrick J. Wongmapping update but not the corresponding refcount update, the two AG btrees 1674bae43864SDarrick J. Wongwill appear inconsistent to scrub and an observation of corruption will be 1675bae43864SDarrick J. Wongrecorded. This observation will not be correct. 1676bae43864SDarrick J. WongIf a repair is attempted in this state, the results will be catastrophic! 1677bae43864SDarrick J. Wong 1678bae43864SDarrick J. WongSeveral other solutions to this problem were evaluated upon discovery of this 1679bae43864SDarrick J. Wongflaw and rejected: 1680bae43864SDarrick J. Wong 1681bae43864SDarrick J. Wong1. Add a higher level lock to allocation groups and require writer threads to 1682bae43864SDarrick J. Wong acquire the higher level lock in AG order before making any changes. 1683bae43864SDarrick J. Wong This would be very difficult to implement in practice because it is 1684bae43864SDarrick J. Wong difficult to determine which locks need to be obtained, and in what order, 1685bae43864SDarrick J. Wong without simulating the entire operation. 1686bae43864SDarrick J. Wong Performing a dry run of a file operation to discover necessary locks would 1687bae43864SDarrick J. Wong make the filesystem very slow. 1688bae43864SDarrick J. Wong 1689bae43864SDarrick J. Wong2. Make the deferred work coordinator code aware of consecutive intent items 1690bae43864SDarrick J. Wong targeting the same AG and have it hold the AG header buffers locked across 1691bae43864SDarrick J. Wong the transaction roll between updates. 1692bae43864SDarrick J. Wong This would introduce a lot of complexity into the coordinator since it is 1693bae43864SDarrick J. Wong only loosely coupled with the actual deferred work items. 1694bae43864SDarrick J. Wong It would also fail to solve the problem because deferred work items can 1695bae43864SDarrick J. Wong generate new deferred subtasks, but all subtasks must be complete before 1696bae43864SDarrick J. Wong work can start on a new sibling task. 1697bae43864SDarrick J. Wong 1698bae43864SDarrick J. Wong3. Teach online fsck to walk all transactions waiting for whichever lock(s) 1699bae43864SDarrick J. Wong protect the data structure being scrubbed to look for pending operations. 1700bae43864SDarrick J. Wong The checking and repair operations must factor these pending operations into 1701bae43864SDarrick J. Wong the evaluations being performed. 1702bae43864SDarrick J. Wong This solution is a nonstarter because it is *extremely* invasive to the main 1703bae43864SDarrick J. Wong filesystem. 1704bae43864SDarrick J. Wong 1705bae43864SDarrick J. Wong.. _intent_drains: 1706bae43864SDarrick J. Wong 1707bae43864SDarrick J. WongIntent Drains 1708bae43864SDarrick J. Wong````````````` 1709bae43864SDarrick J. Wong 1710bae43864SDarrick J. WongOnline fsck uses an atomic intent item counter and lock cycling to coordinate 1711bae43864SDarrick J. Wongwith transaction chains. 1712bae43864SDarrick J. WongThere are two key properties to the drain mechanism. 1713bae43864SDarrick J. WongFirst, the counter is incremented when a deferred work item is *queued* to a 1714bae43864SDarrick J. Wongtransaction, and it is decremented after the associated intent done log item is 1715bae43864SDarrick J. Wong*committed* to another transaction. 1716bae43864SDarrick J. WongThe second property is that deferred work can be added to a transaction without 1717bae43864SDarrick J. Wongholding an AG header lock, but per-AG work items cannot be marked done without 1718bae43864SDarrick J. Wonglocking that AG header buffer to log the physical updates and the intent done 1719bae43864SDarrick J. Wonglog item. 1720bae43864SDarrick J. WongThe first property enables scrub to yield to running transaction chains, which 1721bae43864SDarrick J. Wongis an explicit deprioritization of online fsck to benefit file operations. 1722bae43864SDarrick J. WongThe second property of the drain is key to the correct coordination of scrub, 1723bae43864SDarrick J. Wongsince scrub will always be able to decide if a conflict is possible. 1724bae43864SDarrick J. Wong 1725bae43864SDarrick J. WongFor regular filesystem code, the drain works as follows: 1726bae43864SDarrick J. Wong 1727bae43864SDarrick J. Wong1. Call the appropriate subsystem function to add a deferred work item to a 1728bae43864SDarrick J. Wong transaction. 1729bae43864SDarrick J. Wong 1730bae43864SDarrick J. Wong2. The function calls ``xfs_defer_drain_bump`` to increase the counter. 1731bae43864SDarrick J. Wong 1732bae43864SDarrick J. Wong3. When the deferred item manager wants to finish the deferred work item, it 1733bae43864SDarrick J. Wong calls ``->finish_item`` to complete it. 1734bae43864SDarrick J. Wong 1735bae43864SDarrick J. Wong4. The ``->finish_item`` implementation logs some changes and calls 1736bae43864SDarrick J. Wong ``xfs_defer_drain_drop`` to decrease the sloppy counter and wake up any threads 1737bae43864SDarrick J. Wong waiting on the drain. 1738bae43864SDarrick J. Wong 1739bae43864SDarrick J. Wong5. The subtransaction commits, which unlocks the resource associated with the 1740bae43864SDarrick J. Wong intent item. 1741bae43864SDarrick J. Wong 1742bae43864SDarrick J. WongFor scrub, the drain works as follows: 1743bae43864SDarrick J. Wong 1744bae43864SDarrick J. Wong1. Lock the resource(s) associated with the metadata being scrubbed. 1745bae43864SDarrick J. Wong For example, a scan of the refcount btree would lock the AGI and AGF header 1746bae43864SDarrick J. Wong buffers. 1747bae43864SDarrick J. Wong 1748bae43864SDarrick J. Wong2. If the counter is zero (``xfs_defer_drain_busy`` returns false), there are no 1749bae43864SDarrick J. Wong chains in progress and the operation may proceed. 1750bae43864SDarrick J. Wong 1751bae43864SDarrick J. Wong3. Otherwise, release the resources grabbed in step 1. 1752bae43864SDarrick J. Wong 1753bae43864SDarrick J. Wong4. Wait for the intent counter to reach zero (``xfs_defer_drain_intents``), then go 1754bae43864SDarrick J. Wong back to step 1 unless a signal has been caught. 1755bae43864SDarrick J. Wong 1756bae43864SDarrick J. WongTo avoid polling in step 4, the drain provides a waitqueue for scrub threads to 1757bae43864SDarrick J. Wongbe woken up whenever the intent count drops to zero. 1758bae43864SDarrick J. Wong 1759bae43864SDarrick J. WongThe proposed patchset is the 1760bae43864SDarrick J. Wong`scrub intent drain series 1761bae43864SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-drain-intents>`_. 1762bae43864SDarrick J. Wong 1763bae43864SDarrick J. Wong.. _jump_labels: 1764bae43864SDarrick J. Wong 1765bae43864SDarrick J. WongStatic Keys (aka Jump Label Patching) 1766bae43864SDarrick J. Wong````````````````````````````````````` 1767bae43864SDarrick J. Wong 1768bae43864SDarrick J. WongOnline fsck for XFS separates the regular filesystem from the checking and 1769bae43864SDarrick J. Wongrepair code as much as possible. 1770bae43864SDarrick J. WongHowever, there are a few parts of online fsck (such as the intent drains, and 1771bae43864SDarrick J. Wonglater, live update hooks) where it is useful for the online fsck code to know 1772bae43864SDarrick J. Wongwhat's going on in the rest of the filesystem. 1773bae43864SDarrick J. WongSince it is not expected that online fsck will be constantly running in the 1774bae43864SDarrick J. Wongbackground, it is very important to minimize the runtime overhead imposed by 1775bae43864SDarrick J. Wongthese hooks when online fsck is compiled into the kernel but not actively 1776bae43864SDarrick J. Wongrunning on behalf of userspace. 1777bae43864SDarrick J. WongTaking locks in the hot path of a writer thread to access a data structure only 1778bae43864SDarrick J. Wongto find that no further action is necessary is expensive -- on the author's 1779bae43864SDarrick J. Wongcomputer, this have an overhead of 40-50ns per access. 1780bae43864SDarrick J. WongFortunately, the kernel supports dynamic code patching, which enables XFS to 1781bae43864SDarrick J. Wongreplace a static branch to hook code with ``nop`` sleds when online fsck isn't 1782bae43864SDarrick J. Wongrunning. 1783bae43864SDarrick J. WongThis sled has an overhead of however long it takes the instruction decoder to 1784bae43864SDarrick J. Wongskip past the sled, which seems to be on the order of less than 1ns and 1785bae43864SDarrick J. Wongdoes not access memory outside of instruction fetching. 1786bae43864SDarrick J. Wong 1787bae43864SDarrick J. WongWhen online fsck enables the static key, the sled is replaced with an 1788bae43864SDarrick J. Wongunconditional branch to call the hook code. 1789bae43864SDarrick J. WongThe switchover is quite expensive (~22000ns) but is paid entirely by the 1790bae43864SDarrick J. Wongprogram that invoked online fsck, and can be amortized if multiple threads 1791bae43864SDarrick J. Wongenter online fsck at the same time, or if multiple filesystems are being 1792bae43864SDarrick J. Wongchecked at the same time. 1793bae43864SDarrick J. WongChanging the branch direction requires taking the CPU hotplug lock, and since 1794bae43864SDarrick J. WongCPU initialization requires memory allocation, online fsck must be careful not 1795bae43864SDarrick J. Wongto change a static key while holding any locks or resources that could be 1796bae43864SDarrick J. Wongaccessed in the memory reclaim paths. 1797bae43864SDarrick J. WongTo minimize contention on the CPU hotplug lock, care should be taken not to 1798bae43864SDarrick J. Wongenable or disable static keys unnecessarily. 1799bae43864SDarrick J. Wong 1800bae43864SDarrick J. WongBecause static keys are intended to minimize hook overhead for regular 1801bae43864SDarrick J. Wongfilesystem operations when xfs_scrub is not running, the intended usage 1802bae43864SDarrick J. Wongpatterns are as follows: 1803bae43864SDarrick J. Wong 1804bae43864SDarrick J. Wong- The hooked part of XFS should declare a static-scoped static key that 1805bae43864SDarrick J. Wong defaults to false. 1806bae43864SDarrick J. Wong The ``DEFINE_STATIC_KEY_FALSE`` macro takes care of this. 1807bae43864SDarrick J. Wong The static key itself should be declared as a ``static`` variable. 1808bae43864SDarrick J. Wong 1809bae43864SDarrick J. Wong- When deciding to invoke code that's only used by scrub, the regular 1810bae43864SDarrick J. Wong filesystem should call the ``static_branch_unlikely`` predicate to avoid the 1811bae43864SDarrick J. Wong scrub-only hook code if the static key is not enabled. 1812bae43864SDarrick J. Wong 1813bae43864SDarrick J. Wong- The regular filesystem should export helper functions that call 1814bae43864SDarrick J. Wong ``static_branch_inc`` to enable and ``static_branch_dec`` to disable the 1815bae43864SDarrick J. Wong static key. 1816bae43864SDarrick J. Wong Wrapper functions make it easy to compile out the relevant code if the kernel 1817bae43864SDarrick J. Wong distributor turns off online fsck at build time. 1818bae43864SDarrick J. Wong 1819bae43864SDarrick J. Wong- Scrub functions wanting to turn on scrub-only XFS functionality should call 1820bae43864SDarrick J. Wong the ``xchk_fsgates_enable`` from the setup function to enable a specific 1821bae43864SDarrick J. Wong hook. 1822bae43864SDarrick J. Wong This must be done before obtaining any resources that are used by memory 1823bae43864SDarrick J. Wong reclaim. 1824bae43864SDarrick J. Wong Callers had better be sure they really need the functionality gated by the 1825bae43864SDarrick J. Wong static key; the ``TRY_HARDER`` flag is useful here. 1826bae43864SDarrick J. Wong 1827bae43864SDarrick J. WongOnline scrub has resource acquisition helpers (e.g. ``xchk_perag_lock``) to 1828bae43864SDarrick J. Wonghandle locking AGI and AGF buffers for all scrubber functions. 1829bae43864SDarrick J. WongIf it detects a conflict between scrub and the running transactions, it will 1830bae43864SDarrick J. Wongtry to wait for intents to complete. 1831bae43864SDarrick J. WongIf the caller of the helper has not enabled the static key, the helper will 1832bae43864SDarrick J. Wongreturn -EDEADLOCK, which should result in the scrub being restarted with the 1833bae43864SDarrick J. Wong``TRY_HARDER`` flag set. 1834bae43864SDarrick J. WongThe scrub setup function should detect that flag, enable the static key, and 1835bae43864SDarrick J. Wongtry the scrub again. 1836bae43864SDarrick J. WongScrub teardown disables all static keys obtained by ``xchk_fsgates_enable``. 1837bae43864SDarrick J. Wong 1838bae43864SDarrick J. WongFor more information, please see the kernel documentation of 1839bae43864SDarrick J. WongDocumentation/staging/static-keys.rst. 18405f658dadSDarrick J. Wong 18415f658dadSDarrick J. Wong.. _xfile: 18425f658dadSDarrick J. Wong 18435f658dadSDarrick J. WongPageable Kernel Memory 18445f658dadSDarrick J. Wong---------------------- 18455f658dadSDarrick J. Wong 18465f658dadSDarrick J. WongSome online checking functions work by scanning the filesystem to build a 18475f658dadSDarrick J. Wongshadow copy of an ondisk metadata structure in memory and comparing the two 18485f658dadSDarrick J. Wongcopies. 18495f658dadSDarrick J. WongFor online repair to rebuild a metadata structure, it must compute the record 18505f658dadSDarrick J. Wongset that will be stored in the new structure before it can persist that new 18515f658dadSDarrick J. Wongstructure to disk. 18525f658dadSDarrick J. WongIdeally, repairs complete with a single atomic commit that introduces 18535f658dadSDarrick J. Wonga new data structure. 18545f658dadSDarrick J. WongTo meet these goals, the kernel needs to collect a large amount of information 18555f658dadSDarrick J. Wongin a place that doesn't require the correct operation of the filesystem. 18565f658dadSDarrick J. Wong 18575f658dadSDarrick J. WongKernel memory isn't suitable because: 18585f658dadSDarrick J. Wong 18595f658dadSDarrick J. Wong* Allocating a contiguous region of memory to create a C array is very 18605f658dadSDarrick J. Wong difficult, especially on 32-bit systems. 18615f658dadSDarrick J. Wong 18625f658dadSDarrick J. Wong* Linked lists of records introduce double pointer overhead which is very high 18635f658dadSDarrick J. Wong and eliminate the possibility of indexed lookups. 18645f658dadSDarrick J. Wong 18655f658dadSDarrick J. Wong* Kernel memory is pinned, which can drive the system into OOM conditions. 18665f658dadSDarrick J. Wong 18675f658dadSDarrick J. Wong* The system might not have sufficient memory to stage all the information. 18685f658dadSDarrick J. Wong 18695f658dadSDarrick J. WongAt any given time, online fsck does not need to keep the entire record set in 18705f658dadSDarrick J. Wongmemory, which means that individual records can be paged out if necessary. 18715f658dadSDarrick J. WongContinued development of online fsck demonstrated that the ability to perform 18725f658dadSDarrick J. Wongindexed data storage would also be very useful. 18735f658dadSDarrick J. WongFortunately, the Linux kernel already has a facility for byte-addressable and 18745f658dadSDarrick J. Wongpageable storage: tmpfs. 18755f658dadSDarrick J. WongIn-kernel graphics drivers (most notably i915) take advantage of tmpfs files 18765f658dadSDarrick J. Wongto store intermediate data that doesn't need to be in memory at all times, so 18775f658dadSDarrick J. Wongthat usage precedent is already established. 18785f658dadSDarrick J. WongHence, the ``xfile`` was born! 18795f658dadSDarrick J. Wong 18805f658dadSDarrick J. Wong+--------------------------------------------------------------------------+ 18815f658dadSDarrick J. Wong| **Historical Sidebar**: | 18825f658dadSDarrick J. Wong+--------------------------------------------------------------------------+ 18835f658dadSDarrick J. Wong| The first edition of online repair inserted records into a new btree as | 18845f658dadSDarrick J. Wong| it found them, which failed because filesystem could shut down with a | 18855f658dadSDarrick J. Wong| built data structure, which would be live after recovery finished. | 18865f658dadSDarrick J. Wong| | 18875f658dadSDarrick J. Wong| The second edition solved the half-rebuilt structure problem by storing | 18885f658dadSDarrick J. Wong| everything in memory, but frequently ran the system out of memory. | 18895f658dadSDarrick J. Wong| | 18905f658dadSDarrick J. Wong| The third edition solved the OOM problem by using linked lists, but the | 18915f658dadSDarrick J. Wong| memory overhead of the list pointers was extreme. | 18925f658dadSDarrick J. Wong+--------------------------------------------------------------------------+ 18935f658dadSDarrick J. Wong 18945f658dadSDarrick J. Wongxfile Access Models 18955f658dadSDarrick J. Wong``````````````````` 18965f658dadSDarrick J. Wong 18975f658dadSDarrick J. WongA survey of the intended uses of xfiles suggested these use cases: 18985f658dadSDarrick J. Wong 18995f658dadSDarrick J. Wong1. Arrays of fixed-sized records (space management btrees, directory and 19005f658dadSDarrick J. Wong extended attribute entries) 19015f658dadSDarrick J. Wong 19025f658dadSDarrick J. Wong2. Sparse arrays of fixed-sized records (quotas and link counts) 19035f658dadSDarrick J. Wong 19045f658dadSDarrick J. Wong3. Large binary objects (BLOBs) of variable sizes (directory and extended 19055f658dadSDarrick J. Wong attribute names and values) 19065f658dadSDarrick J. Wong 19075f658dadSDarrick J. Wong4. Staging btrees in memory (reverse mapping btrees) 19085f658dadSDarrick J. Wong 19095f658dadSDarrick J. Wong5. Arbitrary contents (realtime space management) 19105f658dadSDarrick J. Wong 19115f658dadSDarrick J. WongTo support the first four use cases, high level data structures wrap the xfile 19125f658dadSDarrick J. Wongto share functionality between online fsck functions. 19135f658dadSDarrick J. WongThe rest of this section discusses the interfaces that the xfile presents to 19145f658dadSDarrick J. Wongfour of those five higher level data structures. 19155f658dadSDarrick J. WongThe fifth use case is discussed in the :ref:`realtime summary <rtsummary>` case 19165f658dadSDarrick J. Wongstudy. 19175f658dadSDarrick J. Wong 19185f658dadSDarrick J. WongThe most general storage interface supported by the xfile enables the reading 19195f658dadSDarrick J. Wongand writing of arbitrary quantities of data at arbitrary offsets in the xfile. 19205f658dadSDarrick J. WongThis capability is provided by ``xfile_pread`` and ``xfile_pwrite`` functions, 19215f658dadSDarrick J. Wongwhich behave similarly to their userspace counterparts. 19225f658dadSDarrick J. WongXFS is very record-based, which suggests that the ability to load and store 19235f658dadSDarrick J. Wongcomplete records is important. 19245f658dadSDarrick J. WongTo support these cases, a pair of ``xfile_obj_load`` and ``xfile_obj_store`` 19255f658dadSDarrick J. Wongfunctions are provided to read and persist objects into an xfile. 19265f658dadSDarrick J. WongThey are internally the same as pread and pwrite, except that they treat any 19275f658dadSDarrick J. Wongerror as an out of memory error. 19285f658dadSDarrick J. WongFor online repair, squashing error conditions in this manner is an acceptable 19295f658dadSDarrick J. Wongbehavior because the only reaction is to abort the operation back to userspace. 19305f658dadSDarrick J. WongAll five xfile usecases can be serviced by these four functions. 19315f658dadSDarrick J. Wong 19325f658dadSDarrick J. WongHowever, no discussion of file access idioms is complete without answering the 19335f658dadSDarrick J. Wongquestion, "But what about mmap?" 19345f658dadSDarrick J. WongIt is convenient to access storage directly with pointers, just like userspace 19355f658dadSDarrick J. Wongcode does with regular memory. 19365f658dadSDarrick J. WongOnline fsck must not drive the system into OOM conditions, which means that 19375f658dadSDarrick J. Wongxfiles must be responsive to memory reclamation. 19385f658dadSDarrick J. Wongtmpfs can only push a pagecache folio to the swap cache if the folio is neither 19395f658dadSDarrick J. Wongpinned nor locked, which means the xfile must not pin too many folios. 19405f658dadSDarrick J. Wong 19415f658dadSDarrick J. WongShort term direct access to xfile contents is done by locking the pagecache 19425f658dadSDarrick J. Wongfolio and mapping it into kernel address space. 19435f658dadSDarrick J. WongProgrammatic access (e.g. pread and pwrite) uses this mechanism. 19445f658dadSDarrick J. WongFolio locks are not supposed to be held for long periods of time, so long 19455f658dadSDarrick J. Wongterm direct access to xfile contents is done by bumping the folio refcount, 19465f658dadSDarrick J. Wongmapping it into kernel address space, and dropping the folio lock. 19475f658dadSDarrick J. WongThese long term users *must* be responsive to memory reclaim by hooking into 19485f658dadSDarrick J. Wongthe shrinker infrastructure to know when to release folios. 19495f658dadSDarrick J. Wong 19505f658dadSDarrick J. WongThe ``xfile_get_page`` and ``xfile_put_page`` functions are provided to 19515f658dadSDarrick J. Wongretrieve the (locked) folio that backs part of an xfile and to release it. 19525f658dadSDarrick J. WongThe only code to use these folio lease functions are the xfarray 19535f658dadSDarrick J. Wong:ref:`sorting<xfarray_sort>` algorithms and the :ref:`in-memory 19545f658dadSDarrick J. Wongbtrees<xfbtree>`. 19555f658dadSDarrick J. Wong 19565f658dadSDarrick J. Wongxfile Access Coordination 19575f658dadSDarrick J. Wong````````````````````````` 19585f658dadSDarrick J. Wong 19595f658dadSDarrick J. WongFor security reasons, xfiles must be owned privately by the kernel. 19605f658dadSDarrick J. WongThey are marked ``S_PRIVATE`` to prevent interference from the security system, 19615f658dadSDarrick J. Wongmust never be mapped into process file descriptor tables, and their pages must 19625f658dadSDarrick J. Wongnever be mapped into userspace processes. 19635f658dadSDarrick J. Wong 19645f658dadSDarrick J. WongTo avoid locking recursion issues with the VFS, all accesses to the shmfs file 19655f658dadSDarrick J. Wongare performed by manipulating the page cache directly. 19665f658dadSDarrick J. Wongxfile writers call the ``->write_begin`` and ``->write_end`` functions of the 19675f658dadSDarrick J. Wongxfile's address space to grab writable pages, copy the caller's buffer into the 19685f658dadSDarrick J. Wongpage, and release the pages. 19695f658dadSDarrick J. Wongxfile readers call ``shmem_read_mapping_page_gfp`` to grab pages directly 19705f658dadSDarrick J. Wongbefore copying the contents into the caller's buffer. 19715f658dadSDarrick J. WongIn other words, xfiles ignore the VFS read and write code paths to avoid 19725f658dadSDarrick J. Wonghaving to create a dummy ``struct kiocb`` and to avoid taking inode and 19735f658dadSDarrick J. Wongfreeze locks. 19745f658dadSDarrick J. Wongtmpfs cannot be frozen, and xfiles must not be exposed to userspace. 19755f658dadSDarrick J. Wong 19765f658dadSDarrick J. WongIf an xfile is shared between threads to stage repairs, the caller must provide 19775f658dadSDarrick J. Wongits own locks to coordinate access. 19785f658dadSDarrick J. WongFor example, if a scrub function stores scan results in an xfile and needs 19795f658dadSDarrick J. Wongother threads to provide updates to the scanned data, the scrub function must 19805f658dadSDarrick J. Wongprovide a lock for all threads to share. 19815f658dadSDarrick J. Wong 19825f658dadSDarrick J. Wong.. _xfarray: 19835f658dadSDarrick J. Wong 19845f658dadSDarrick J. WongArrays of Fixed-Sized Records 19855f658dadSDarrick J. Wong````````````````````````````` 19865f658dadSDarrick J. Wong 19875f658dadSDarrick J. WongIn XFS, each type of indexed space metadata (free space, inodes, reference 19885f658dadSDarrick J. Wongcounts, file fork space, and reverse mappings) consists of a set of fixed-size 19895f658dadSDarrick J. Wongrecords indexed with a classic B+ tree. 19905f658dadSDarrick J. WongDirectories have a set of fixed-size dirent records that point to the names, 19915f658dadSDarrick J. Wongand extended attributes have a set of fixed-size attribute keys that point to 19925f658dadSDarrick J. Wongnames and values. 19935f658dadSDarrick J. WongQuota counters and file link counters index records with numbers. 19945f658dadSDarrick J. WongDuring a repair, scrub needs to stage new records during the gathering step and 19955f658dadSDarrick J. Wongretrieve them during the btree building step. 19965f658dadSDarrick J. Wong 19975f658dadSDarrick J. WongAlthough this requirement can be satisfied by calling the read and write 19985f658dadSDarrick J. Wongmethods of the xfile directly, it is simpler for callers for there to be a 19995f658dadSDarrick J. Wonghigher level abstraction to take care of computing array offsets, to provide 20005f658dadSDarrick J. Wongiterator functions, and to deal with sparse records and sorting. 20015f658dadSDarrick J. WongThe ``xfarray`` abstraction presents a linear array for fixed-size records atop 20025f658dadSDarrick J. Wongthe byte-accessible xfile. 20035f658dadSDarrick J. Wong 20045f658dadSDarrick J. Wong.. _xfarray_access_patterns: 20055f658dadSDarrick J. Wong 20065f658dadSDarrick J. WongArray Access Patterns 20075f658dadSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^ 20085f658dadSDarrick J. Wong 20095f658dadSDarrick J. WongArray access patterns in online fsck tend to fall into three categories. 20105f658dadSDarrick J. WongIteration of records is assumed to be necessary for all cases and will be 20115f658dadSDarrick J. Wongcovered in the next section. 20125f658dadSDarrick J. Wong 20135f658dadSDarrick J. WongThe first type of caller handles records that are indexed by position. 20145f658dadSDarrick J. WongGaps may exist between records, and a record may be updated multiple times 20155f658dadSDarrick J. Wongduring the collection step. 20165f658dadSDarrick J. WongIn other words, these callers want a sparse linearly addressed table file. 20175f658dadSDarrick J. WongThe typical use case are quota records or file link count records. 20185f658dadSDarrick J. WongAccess to array elements is performed programmatically via ``xfarray_load`` and 20195f658dadSDarrick J. Wong``xfarray_store`` functions, which wrap the similarly-named xfile functions to 20205f658dadSDarrick J. Wongprovide loading and storing of array elements at arbitrary array indices. 20215f658dadSDarrick J. WongGaps are defined to be null records, and null records are defined to be a 20225f658dadSDarrick J. Wongsequence of all zero bytes. 20235f658dadSDarrick J. WongNull records are detected by calling ``xfarray_element_is_null``. 20245f658dadSDarrick J. WongThey are created either by calling ``xfarray_unset`` to null out an existing 20255f658dadSDarrick J. Wongrecord or by never storing anything to an array index. 20265f658dadSDarrick J. Wong 20275f658dadSDarrick J. WongThe second type of caller handles records that are not indexed by position 20285f658dadSDarrick J. Wongand do not require multiple updates to a record. 20295f658dadSDarrick J. WongThe typical use case here is rebuilding space btrees and key/value btrees. 20305f658dadSDarrick J. WongThese callers can add records to the array without caring about array indices 20315f658dadSDarrick J. Wongvia the ``xfarray_append`` function, which stores a record at the end of the 20325f658dadSDarrick J. Wongarray. 20335f658dadSDarrick J. WongFor callers that require records to be presentable in a specific order (e.g. 20345f658dadSDarrick J. Wongrebuilding btree data), the ``xfarray_sort`` function can arrange the sorted 20355f658dadSDarrick J. Wongrecords; this function will be covered later. 20365f658dadSDarrick J. Wong 20375f658dadSDarrick J. WongThe third type of caller is a bag, which is useful for counting records. 20385f658dadSDarrick J. WongThe typical use case here is constructing space extent reference counts from 20395f658dadSDarrick J. Wongreverse mapping information. 20405f658dadSDarrick J. WongRecords can be put in the bag in any order, they can be removed from the bag 20415f658dadSDarrick J. Wongat any time, and uniqueness of records is left to callers. 20425f658dadSDarrick J. WongThe ``xfarray_store_anywhere`` function is used to insert a record in any 20435f658dadSDarrick J. Wongnull record slot in the bag; and the ``xfarray_unset`` function removes a 20445f658dadSDarrick J. Wongrecord from the bag. 20455f658dadSDarrick J. Wong 20465f658dadSDarrick J. WongThe proposed patchset is the 20475f658dadSDarrick J. Wong`big in-memory array 20485f658dadSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=big-array>`_. 20495f658dadSDarrick J. Wong 20505f658dadSDarrick J. WongIterating Array Elements 20515f658dadSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^ 20525f658dadSDarrick J. Wong 20535f658dadSDarrick J. WongMost users of the xfarray require the ability to iterate the records stored in 20545f658dadSDarrick J. Wongthe array. 20555f658dadSDarrick J. WongCallers can probe every possible array index with the following: 20565f658dadSDarrick J. Wong 20575f658dadSDarrick J. Wong.. code-block:: c 20585f658dadSDarrick J. Wong 20595f658dadSDarrick J. Wong xfarray_idx_t i; 20605f658dadSDarrick J. Wong foreach_xfarray_idx(array, i) { 20615f658dadSDarrick J. Wong xfarray_load(array, i, &rec); 20625f658dadSDarrick J. Wong 20635f658dadSDarrick J. Wong /* do something with rec */ 20645f658dadSDarrick J. Wong } 20655f658dadSDarrick J. Wong 20665f658dadSDarrick J. WongAll users of this idiom must be prepared to handle null records or must already 20675f658dadSDarrick J. Wongknow that there aren't any. 20685f658dadSDarrick J. Wong 20695f658dadSDarrick J. WongFor xfarray users that want to iterate a sparse array, the ``xfarray_iter`` 20705f658dadSDarrick J. Wongfunction ignores indices in the xfarray that have never been written to by 20715f658dadSDarrick J. Wongcalling ``xfile_seek_data`` (which internally uses ``SEEK_DATA``) to skip areas 20725f658dadSDarrick J. Wongof the array that are not populated with memory pages. 20735f658dadSDarrick J. WongOnce it finds a page, it will skip the zeroed areas of the page. 20745f658dadSDarrick J. Wong 20755f658dadSDarrick J. Wong.. code-block:: c 20765f658dadSDarrick J. Wong 20775f658dadSDarrick J. Wong xfarray_idx_t i = XFARRAY_CURSOR_INIT; 20785f658dadSDarrick J. Wong while ((ret = xfarray_iter(array, &i, &rec)) == 1) { 20795f658dadSDarrick J. Wong /* do something with rec */ 20805f658dadSDarrick J. Wong } 20815f658dadSDarrick J. Wong 20825f658dadSDarrick J. Wong.. _xfarray_sort: 20835f658dadSDarrick J. Wong 20845f658dadSDarrick J. WongSorting Array Elements 20855f658dadSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^ 20865f658dadSDarrick J. Wong 20875f658dadSDarrick J. WongDuring the fourth demonstration of online repair, a community reviewer remarked 20885f658dadSDarrick J. Wongthat for performance reasons, online repair ought to load batches of records 20895f658dadSDarrick J. Wonginto btree record blocks instead of inserting records into a new btree one at a 20905f658dadSDarrick J. Wongtime. 20915f658dadSDarrick J. WongThe btree insertion code in XFS is responsible for maintaining correct ordering 20925f658dadSDarrick J. Wongof the records, so naturally the xfarray must also support sorting the record 20935f658dadSDarrick J. Wongset prior to bulk loading. 20945f658dadSDarrick J. Wong 20955f658dadSDarrick J. WongCase Study: Sorting xfarrays 20965f658dadSDarrick J. Wong~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 20975f658dadSDarrick J. Wong 20985f658dadSDarrick J. WongThe sorting algorithm used in the xfarray is actually a combination of adaptive 20995f658dadSDarrick J. Wongquicksort and a heapsort subalgorithm in the spirit of 21005f658dadSDarrick J. Wong`Sedgewick <https://algs4.cs.princeton.edu/23quicksort/>`_ and 21015f658dadSDarrick J. Wong`pdqsort <https://github.com/orlp/pdqsort>`_, with customizations for the Linux 21025f658dadSDarrick J. Wongkernel. 21035f658dadSDarrick J. WongTo sort records in a reasonably short amount of time, ``xfarray`` takes 21045f658dadSDarrick J. Wongadvantage of the binary subpartitioning offered by quicksort, but it also uses 2105*d56b699dSBjorn Helgaasheapsort to hedge against performance collapse if the chosen quicksort pivots 21065f658dadSDarrick J. Wongare poor. 21075f658dadSDarrick J. WongBoth algorithms are (in general) O(n * lg(n)), but there is a wide performance 21085f658dadSDarrick J. Wonggulf between the two implementations. 21095f658dadSDarrick J. Wong 21105f658dadSDarrick J. WongThe Linux kernel already contains a reasonably fast implementation of heapsort. 21115f658dadSDarrick J. WongIt only operates on regular C arrays, which limits the scope of its usefulness. 21125f658dadSDarrick J. WongThere are two key places where the xfarray uses it: 21135f658dadSDarrick J. Wong 21145f658dadSDarrick J. Wong* Sorting any record subset backed by a single xfile page. 21155f658dadSDarrick J. Wong 21165f658dadSDarrick J. Wong* Loading a small number of xfarray records from potentially disparate parts 21175f658dadSDarrick J. Wong of the xfarray into a memory buffer, and sorting the buffer. 21185f658dadSDarrick J. Wong 21195f658dadSDarrick J. WongIn other words, ``xfarray`` uses heapsort to constrain the nested recursion of 21205f658dadSDarrick J. Wongquicksort, thereby mitigating quicksort's worst runtime behavior. 21215f658dadSDarrick J. Wong 21225f658dadSDarrick J. WongChoosing a quicksort pivot is a tricky business. 21235f658dadSDarrick J. WongA good pivot splits the set to sort in half, leading to the divide and conquer 21245f658dadSDarrick J. Wongbehavior that is crucial to O(n * lg(n)) performance. 21255f658dadSDarrick J. WongA poor pivot barely splits the subset at all, leading to O(n\ :sup:`2`) 21265f658dadSDarrick J. Wongruntime. 21275f658dadSDarrick J. WongThe xfarray sort routine tries to avoid picking a bad pivot by sampling nine 21285f658dadSDarrick J. Wongrecords into a memory buffer and using the kernel heapsort to identify the 21295f658dadSDarrick J. Wongmedian of the nine. 21305f658dadSDarrick J. Wong 21315f658dadSDarrick J. WongMost modern quicksort implementations employ Tukey's "ninther" to select a 21325f658dadSDarrick J. Wongpivot from a classic C array. 21335f658dadSDarrick J. WongTypical ninther implementations pick three unique triads of records, sort each 21345f658dadSDarrick J. Wongof the triads, and then sort the middle value of each triad to determine the 21355f658dadSDarrick J. Wongninther value. 21365f658dadSDarrick J. WongAs stated previously, however, xfile accesses are not entirely cheap. 21375f658dadSDarrick J. WongIt turned out to be much more performant to read the nine elements into a 21385f658dadSDarrick J. Wongmemory buffer, run the kernel's in-memory heapsort on the buffer, and choose 21395f658dadSDarrick J. Wongthe 4th element of that buffer as the pivot. 21405f658dadSDarrick J. WongTukey's ninthers are described in J. W. Tukey, `The ninther, a technique for 21415f658dadSDarrick J. Wonglow-effort robust (resistant) location in large samples`, in *Contributions to 21425f658dadSDarrick J. WongSurvey Sampling and Applied Statistics*, edited by H. David, (Academic Press, 21435f658dadSDarrick J. Wong1978), pp. 251–257. 21445f658dadSDarrick J. Wong 21455f658dadSDarrick J. WongThe partitioning of quicksort is fairly textbook -- rearrange the record 21465f658dadSDarrick J. Wongsubset around the pivot, then set up the current and next stack frames to 21475f658dadSDarrick J. Wongsort with the larger and the smaller halves of the pivot, respectively. 21485f658dadSDarrick J. WongThis keeps the stack space requirements to log2(record count). 21495f658dadSDarrick J. Wong 21505f658dadSDarrick J. WongAs a final performance optimization, the hi and lo scanning phase of quicksort 21515f658dadSDarrick J. Wongkeeps examined xfile pages mapped in the kernel for as long as possible to 21525f658dadSDarrick J. Wongreduce map/unmap cycles. 21535f658dadSDarrick J. WongSurprisingly, this reduces overall sort runtime by nearly half again after 21545f658dadSDarrick J. Wongaccounting for the application of heapsort directly onto xfile pages. 21555f658dadSDarrick J. Wong 2156a26aa252SDarrick J. Wong.. _xfblob: 2157a26aa252SDarrick J. Wong 21585f658dadSDarrick J. WongBlob Storage 21595f658dadSDarrick J. Wong```````````` 21605f658dadSDarrick J. Wong 21615f658dadSDarrick J. WongExtended attributes and directories add an additional requirement for staging 21625f658dadSDarrick J. Wongrecords: arbitrary byte sequences of finite length. 21635f658dadSDarrick J. WongEach directory entry record needs to store entry name, 21645f658dadSDarrick J. Wongand each extended attribute needs to store both the attribute name and value. 21655f658dadSDarrick J. WongThe names, keys, and values can consume a large amount of memory, so the 21665f658dadSDarrick J. Wong``xfblob`` abstraction was created to simplify management of these blobs 21675f658dadSDarrick J. Wongatop an xfile. 21685f658dadSDarrick J. Wong 21695f658dadSDarrick J. WongBlob arrays provide ``xfblob_load`` and ``xfblob_store`` functions to retrieve 21705f658dadSDarrick J. Wongand persist objects. 21715f658dadSDarrick J. WongThe store function returns a magic cookie for every object that it persists. 21725f658dadSDarrick J. WongLater, callers provide this cookie to the ``xblob_load`` to recall the object. 21735f658dadSDarrick J. WongThe ``xfblob_free`` function frees a specific blob, and the ``xfblob_truncate`` 21745f658dadSDarrick J. Wongfunction frees them all because compaction is not needed. 21755f658dadSDarrick J. Wong 21765f658dadSDarrick J. WongThe details of repairing directories and extended attributes will be discussed 21775f658dadSDarrick J. Wongin a subsequent section about atomic extent swapping. 21785f658dadSDarrick J. WongHowever, it should be noted that these repair functions only use blob storage 21795f658dadSDarrick J. Wongto cache a small number of entries before adding them to a temporary ondisk 21805f658dadSDarrick J. Wongfile, which is why compaction is not required. 21815f658dadSDarrick J. Wong 21825f658dadSDarrick J. WongThe proposed patchset is at the start of the 21835f658dadSDarrick J. Wong`extended attribute repair 21845f658dadSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_ series. 21855f658dadSDarrick J. Wong 21865f658dadSDarrick J. Wong.. _xfbtree: 21875f658dadSDarrick J. Wong 21885f658dadSDarrick J. WongIn-Memory B+Trees 21895f658dadSDarrick J. Wong````````````````` 21905f658dadSDarrick J. Wong 21915f658dadSDarrick J. WongThe chapter about :ref:`secondary metadata<secondary_metadata>` mentioned that 21925f658dadSDarrick J. Wongchecking and repairing of secondary metadata commonly requires coordination 21935f658dadSDarrick J. Wongbetween a live metadata scan of the filesystem and writer threads that are 21945f658dadSDarrick J. Wongupdating that metadata. 21955f658dadSDarrick J. WongKeeping the scan data up to date requires requires the ability to propagate 21965f658dadSDarrick J. Wongmetadata updates from the filesystem into the data being collected by the scan. 21975f658dadSDarrick J. WongThis *can* be done by appending concurrent updates into a separate log file and 21985f658dadSDarrick J. Wongapplying them before writing the new metadata to disk, but this leads to 21995f658dadSDarrick J. Wongunbounded memory consumption if the rest of the system is very busy. 22005f658dadSDarrick J. WongAnother option is to skip the side-log and commit live updates from the 22015f658dadSDarrick J. Wongfilesystem directly into the scan data, which trades more overhead for a lower 22025f658dadSDarrick J. Wongmaximum memory requirement. 22035f658dadSDarrick J. WongIn both cases, the data structure holding the scan results must support indexed 22045f658dadSDarrick J. Wongaccess to perform well. 22055f658dadSDarrick J. Wong 22065f658dadSDarrick J. WongGiven that indexed lookups of scan data is required for both strategies, online 22075f658dadSDarrick J. Wongfsck employs the second strategy of committing live updates directly into 22085f658dadSDarrick J. Wongscan data. 22095f658dadSDarrick J. WongBecause xfarrays are not indexed and do not enforce record ordering, they 22105f658dadSDarrick J. Wongare not suitable for this task. 22115f658dadSDarrick J. WongConveniently, however, XFS has a library to create and maintain ordered reverse 22125f658dadSDarrick J. Wongmapping records: the existing rmap btree code! 22135f658dadSDarrick J. WongIf only there was a means to create one in memory. 22145f658dadSDarrick J. Wong 22155f658dadSDarrick J. WongRecall that the :ref:`xfile <xfile>` abstraction represents memory pages as a 22165f658dadSDarrick J. Wongregular file, which means that the kernel can create byte or block addressable 22175f658dadSDarrick J. Wongvirtual address spaces at will. 22185f658dadSDarrick J. WongThe XFS buffer cache specializes in abstracting IO to block-oriented address 22195f658dadSDarrick J. Wongspaces, which means that adaptation of the buffer cache to interface with 22205f658dadSDarrick J. Wongxfiles enables reuse of the entire btree library. 22215f658dadSDarrick J. WongBtrees built atop an xfile are collectively known as ``xfbtrees``. 22225f658dadSDarrick J. WongThe next few sections describe how they actually work. 22235f658dadSDarrick J. Wong 22245f658dadSDarrick J. WongThe proposed patchset is the 22255f658dadSDarrick J. Wong`in-memory btree 22265f658dadSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=in-memory-btrees>`_ 22275f658dadSDarrick J. Wongseries. 22285f658dadSDarrick J. Wong 22295f658dadSDarrick J. WongUsing xfiles as a Buffer Cache Target 22305f658dadSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 22315f658dadSDarrick J. Wong 22325f658dadSDarrick J. WongTwo modifications are necessary to support xfiles as a buffer cache target. 22335f658dadSDarrick J. WongThe first is to make it possible for the ``struct xfs_buftarg`` structure to 22345f658dadSDarrick J. Wonghost the ``struct xfs_buf`` rhashtable, because normally those are held by a 22355f658dadSDarrick J. Wongper-AG structure. 22365f658dadSDarrick J. WongThe second change is to modify the buffer ``ioapply`` function to "read" cached 22375f658dadSDarrick J. Wongpages from the xfile and "write" cached pages back to the xfile. 22385f658dadSDarrick J. WongMultiple access to individual buffers is controlled by the ``xfs_buf`` lock, 22395f658dadSDarrick J. Wongsince the xfile does not provide any locking on its own. 22405f658dadSDarrick J. WongWith this adaptation in place, users of the xfile-backed buffer cache use 22415f658dadSDarrick J. Wongexactly the same APIs as users of the disk-backed buffer cache. 22425f658dadSDarrick J. WongThe separation between xfile and buffer cache implies higher memory usage since 22435f658dadSDarrick J. Wongthey do not share pages, but this property could some day enable transactional 22445f658dadSDarrick J. Wongupdates to an in-memory btree. 22455f658dadSDarrick J. WongToday, however, it simply eliminates the need for new code. 22465f658dadSDarrick J. Wong 22475f658dadSDarrick J. WongSpace Management with an xfbtree 22485f658dadSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 22495f658dadSDarrick J. Wong 22505f658dadSDarrick J. WongSpace management for an xfile is very simple -- each btree block is one memory 22515f658dadSDarrick J. Wongpage in size. 22525f658dadSDarrick J. WongThese blocks use the same header format as an on-disk btree, but the in-memory 22535f658dadSDarrick J. Wongblock verifiers ignore the checksums, assuming that xfile memory is no more 22545f658dadSDarrick J. Wongcorruption-prone than regular DRAM. 22555f658dadSDarrick J. WongReusing existing code here is more important than absolute memory efficiency. 22565f658dadSDarrick J. Wong 22575f658dadSDarrick J. WongThe very first block of an xfile backing an xfbtree contains a header block. 22585f658dadSDarrick J. WongThe header describes the owner, height, and the block number of the root 22595f658dadSDarrick J. Wongxfbtree block. 22605f658dadSDarrick J. Wong 22615f658dadSDarrick J. WongTo allocate a btree block, use ``xfile_seek_data`` to find a gap in the file. 22625f658dadSDarrick J. WongIf there are no gaps, create one by extending the length of the xfile. 22635f658dadSDarrick J. WongPreallocate space for the block with ``xfile_prealloc``, and hand back the 22645f658dadSDarrick J. Wonglocation. 22655f658dadSDarrick J. WongTo free an xfbtree block, use ``xfile_discard`` (which internally uses 22665f658dadSDarrick J. Wong``FALLOC_FL_PUNCH_HOLE``) to remove the memory page from the xfile. 22675f658dadSDarrick J. Wong 22685f658dadSDarrick J. WongPopulating an xfbtree 22695f658dadSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^ 22705f658dadSDarrick J. Wong 22715f658dadSDarrick J. WongAn online fsck function that wants to create an xfbtree should proceed as 22725f658dadSDarrick J. Wongfollows: 22735f658dadSDarrick J. Wong 22745f658dadSDarrick J. Wong1. Call ``xfile_create`` to create an xfile. 22755f658dadSDarrick J. Wong 22765f658dadSDarrick J. Wong2. Call ``xfs_alloc_memory_buftarg`` to create a buffer cache target structure 22775f658dadSDarrick J. Wong pointing to the xfile. 22785f658dadSDarrick J. Wong 22795f658dadSDarrick J. Wong3. Pass the buffer cache target, buffer ops, and other information to 22805f658dadSDarrick J. Wong ``xfbtree_create`` to write an initial tree header and root block to the 22815f658dadSDarrick J. Wong xfile. 22825f658dadSDarrick J. Wong Each btree type should define a wrapper that passes necessary arguments to 22835f658dadSDarrick J. Wong the creation function. 22845f658dadSDarrick J. Wong For example, rmap btrees define ``xfs_rmapbt_mem_create`` to take care of 22855f658dadSDarrick J. Wong all the necessary details for callers. 22865f658dadSDarrick J. Wong A ``struct xfbtree`` object will be returned. 22875f658dadSDarrick J. Wong 22885f658dadSDarrick J. Wong4. Pass the xfbtree object to the btree cursor creation function for the 22895f658dadSDarrick J. Wong btree type. 22905f658dadSDarrick J. Wong Following the example above, ``xfs_rmapbt_mem_cursor`` takes care of this 22915f658dadSDarrick J. Wong for callers. 22925f658dadSDarrick J. Wong 22935f658dadSDarrick J. Wong5. Pass the btree cursor to the regular btree functions to make queries against 22945f658dadSDarrick J. Wong and to update the in-memory btree. 22955f658dadSDarrick J. Wong For example, a btree cursor for an rmap xfbtree can be passed to the 22965f658dadSDarrick J. Wong ``xfs_rmap_*`` functions just like any other btree cursor. 22975f658dadSDarrick J. Wong See the :ref:`next section<xfbtree_commit>` for information on dealing with 22985f658dadSDarrick J. Wong xfbtree updates that are logged to a transaction. 22995f658dadSDarrick J. Wong 23005f658dadSDarrick J. Wong6. When finished, delete the btree cursor, destroy the xfbtree object, free the 23015f658dadSDarrick J. Wong buffer target, and the destroy the xfile to release all resources. 23025f658dadSDarrick J. Wong 23035f658dadSDarrick J. Wong.. _xfbtree_commit: 23045f658dadSDarrick J. Wong 23055f658dadSDarrick J. WongCommitting Logged xfbtree Buffers 23065f658dadSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 23075f658dadSDarrick J. Wong 23085f658dadSDarrick J. WongAlthough it is a clever hack to reuse the rmap btree code to handle the staging 23095f658dadSDarrick J. Wongstructure, the ephemeral nature of the in-memory btree block storage presents 23105f658dadSDarrick J. Wongsome challenges of its own. 23115f658dadSDarrick J. WongThe XFS transaction manager must not commit buffer log items for buffers backed 23125f658dadSDarrick J. Wongby an xfile because the log format does not understand updates for devices 23135f658dadSDarrick J. Wongother than the data device. 23145f658dadSDarrick J. WongAn ephemeral xfbtree probably will not exist by the time the AIL checkpoints 23155f658dadSDarrick J. Wonglog transactions back into the filesystem, and certainly won't exist during 23165f658dadSDarrick J. Wonglog recovery. 23175f658dadSDarrick J. WongFor these reasons, any code updating an xfbtree in transaction context must 23185f658dadSDarrick J. Wongremove the buffer log items from the transaction and write the updates into the 23195f658dadSDarrick J. Wongbacking xfile before committing or cancelling the transaction. 23205f658dadSDarrick J. Wong 23215f658dadSDarrick J. WongThe ``xfbtree_trans_commit`` and ``xfbtree_trans_cancel`` functions implement 23225f658dadSDarrick J. Wongthis functionality as follows: 23235f658dadSDarrick J. Wong 23245f658dadSDarrick J. Wong1. Find each buffer log item whose buffer targets the xfile. 23255f658dadSDarrick J. Wong 23265f658dadSDarrick J. Wong2. Record the dirty/ordered status of the log item. 23275f658dadSDarrick J. Wong 23285f658dadSDarrick J. Wong3. Detach the log item from the buffer. 23295f658dadSDarrick J. Wong 23305f658dadSDarrick J. Wong4. Queue the buffer to a special delwri list. 23315f658dadSDarrick J. Wong 23325f658dadSDarrick J. Wong5. Clear the transaction dirty flag if the only dirty log items were the ones 23335f658dadSDarrick J. Wong that were detached in step 3. 23345f658dadSDarrick J. Wong 23355f658dadSDarrick J. Wong6. Submit the delwri list to commit the changes to the xfile, if the updates 23365f658dadSDarrick J. Wong are being committed. 23375f658dadSDarrick J. Wong 23385f658dadSDarrick J. WongAfter removing xfile logged buffers from the transaction in this manner, the 23395f658dadSDarrick J. Wongtransaction can be committed or cancelled. 23407fb8ccffSDarrick J. Wong 23417fb8ccffSDarrick J. WongBulk Loading of Ondisk B+Trees 23427fb8ccffSDarrick J. Wong------------------------------ 23437fb8ccffSDarrick J. Wong 23447fb8ccffSDarrick J. WongAs mentioned previously, early iterations of online repair built new btree 23457fb8ccffSDarrick J. Wongstructures by creating a new btree and adding observations individually. 23467fb8ccffSDarrick J. WongLoading a btree one record at a time had a slight advantage of not requiring 23477fb8ccffSDarrick J. Wongthe incore records to be sorted prior to commit, but was very slow and leaked 23487fb8ccffSDarrick J. Wongblocks if the system went down during a repair. 23497fb8ccffSDarrick J. WongLoading records one at a time also meant that repair could not control the 23507fb8ccffSDarrick J. Wongloading factor of the blocks in the new btree. 23517fb8ccffSDarrick J. Wong 23527fb8ccffSDarrick J. WongFortunately, the venerable ``xfs_repair`` tool had a more efficient means for 23537fb8ccffSDarrick J. Wongrebuilding a btree index from a collection of records -- bulk btree loading. 23547fb8ccffSDarrick J. WongThis was implemented rather inefficiently code-wise, since ``xfs_repair`` 23557fb8ccffSDarrick J. Wonghad separate copy-pasted implementations for each btree type. 23567fb8ccffSDarrick J. Wong 23577fb8ccffSDarrick J. WongTo prepare for online fsck, each of the four bulk loaders were studied, notes 23587fb8ccffSDarrick J. Wongwere taken, and the four were refactored into a single generic btree bulk 23597fb8ccffSDarrick J. Wongloading mechanism. 23607fb8ccffSDarrick J. WongThose notes in turn have been refreshed and are presented below. 23617fb8ccffSDarrick J. Wong 23627fb8ccffSDarrick J. WongGeometry Computation 23637fb8ccffSDarrick J. Wong```````````````````` 23647fb8ccffSDarrick J. Wong 23657fb8ccffSDarrick J. WongThe zeroth step of bulk loading is to assemble the entire record set that will 23667fb8ccffSDarrick J. Wongbe stored in the new btree, and sort the records. 23677fb8ccffSDarrick J. WongNext, call ``xfs_btree_bload_compute_geometry`` to compute the shape of the 23687fb8ccffSDarrick J. Wongbtree from the record set, the type of btree, and any load factor preferences. 23697fb8ccffSDarrick J. WongThis information is required for resource reservation. 23707fb8ccffSDarrick J. Wong 23717fb8ccffSDarrick J. WongFirst, the geometry computation computes the minimum and maximum records that 23727fb8ccffSDarrick J. Wongwill fit in a leaf block from the size of a btree block and the size of the 23737fb8ccffSDarrick J. Wongblock header. 23747fb8ccffSDarrick J. WongRoughly speaking, the maximum number of records is:: 23757fb8ccffSDarrick J. Wong 23767fb8ccffSDarrick J. Wong maxrecs = (block_size - header_size) / record_size 23777fb8ccffSDarrick J. Wong 23787fb8ccffSDarrick J. WongThe XFS design specifies that btree blocks should be merged when possible, 23797fb8ccffSDarrick J. Wongwhich means the minimum number of records is half of maxrecs:: 23807fb8ccffSDarrick J. Wong 23817fb8ccffSDarrick J. Wong minrecs = maxrecs / 2 23827fb8ccffSDarrick J. Wong 23837fb8ccffSDarrick J. WongThe next variable to determine is the desired loading factor. 23847fb8ccffSDarrick J. WongThis must be at least minrecs and no more than maxrecs. 23857fb8ccffSDarrick J. WongChoosing minrecs is undesirable because it wastes half the block. 23867fb8ccffSDarrick J. WongChoosing maxrecs is also undesirable because adding a single record to each 23877fb8ccffSDarrick J. Wongnewly rebuilt leaf block will cause a tree split, which causes a noticeable 23887fb8ccffSDarrick J. Wongdrop in performance immediately afterwards. 23897fb8ccffSDarrick J. WongThe default loading factor was chosen to be 75% of maxrecs, which provides a 23907fb8ccffSDarrick J. Wongreasonably compact structure without any immediate split penalties:: 23917fb8ccffSDarrick J. Wong 23927fb8ccffSDarrick J. Wong default_load_factor = (maxrecs + minrecs) / 2 23937fb8ccffSDarrick J. Wong 23947fb8ccffSDarrick J. WongIf space is tight, the loading factor will be set to maxrecs to try to avoid 23957fb8ccffSDarrick J. Wongrunning out of space:: 23967fb8ccffSDarrick J. Wong 23977fb8ccffSDarrick J. Wong leaf_load_factor = enough space ? default_load_factor : maxrecs 23987fb8ccffSDarrick J. Wong 23997fb8ccffSDarrick J. WongLoad factor is computed for btree node blocks using the combined size of the 24007fb8ccffSDarrick J. Wongbtree key and pointer as the record size:: 24017fb8ccffSDarrick J. Wong 24027fb8ccffSDarrick J. Wong maxrecs = (block_size - header_size) / (key_size + ptr_size) 24037fb8ccffSDarrick J. Wong minrecs = maxrecs / 2 24047fb8ccffSDarrick J. Wong node_load_factor = enough space ? default_load_factor : maxrecs 24057fb8ccffSDarrick J. Wong 24067fb8ccffSDarrick J. WongOnce that's done, the number of leaf blocks required to store the record set 24077fb8ccffSDarrick J. Wongcan be computed as:: 24087fb8ccffSDarrick J. Wong 24097fb8ccffSDarrick J. Wong leaf_blocks = ceil(record_count / leaf_load_factor) 24107fb8ccffSDarrick J. Wong 24117fb8ccffSDarrick J. WongThe number of node blocks needed to point to the next level down in the tree 24127fb8ccffSDarrick J. Wongis computed as:: 24137fb8ccffSDarrick J. Wong 24147fb8ccffSDarrick J. Wong n_blocks = (n == 0 ? leaf_blocks : node_blocks[n]) 24157fb8ccffSDarrick J. Wong node_blocks[n + 1] = ceil(n_blocks / node_load_factor) 24167fb8ccffSDarrick J. Wong 24177fb8ccffSDarrick J. WongThe entire computation is performed recursively until the current level only 24187fb8ccffSDarrick J. Wongneeds one block. 24197fb8ccffSDarrick J. WongThe resulting geometry is as follows: 24207fb8ccffSDarrick J. Wong 24217fb8ccffSDarrick J. Wong- For AG-rooted btrees, this level is the root level, so the height of the new 24227fb8ccffSDarrick J. Wong tree is ``level + 1`` and the space needed is the summation of the number of 24237fb8ccffSDarrick J. Wong blocks on each level. 24247fb8ccffSDarrick J. Wong 24257fb8ccffSDarrick J. Wong- For inode-rooted btrees where the records in the top level do not fit in the 24267fb8ccffSDarrick J. Wong inode fork area, the height is ``level + 2``, the space needed is the 24277fb8ccffSDarrick J. Wong summation of the number of blocks on each level, and the inode fork points to 24287fb8ccffSDarrick J. Wong the root block. 24297fb8ccffSDarrick J. Wong 24307fb8ccffSDarrick J. Wong- For inode-rooted btrees where the records in the top level can be stored in 24317fb8ccffSDarrick J. Wong the inode fork area, then the root block can be stored in the inode, the 24327fb8ccffSDarrick J. Wong height is ``level + 1``, and the space needed is one less than the summation 24337fb8ccffSDarrick J. Wong of the number of blocks on each level. 24347fb8ccffSDarrick J. Wong This only becomes relevant when non-bmap btrees gain the ability to root in 24357fb8ccffSDarrick J. Wong an inode, which is a future patchset and only included here for completeness. 24367fb8ccffSDarrick J. Wong 24377fb8ccffSDarrick J. Wong.. _newbt: 24387fb8ccffSDarrick J. Wong 24397fb8ccffSDarrick J. WongReserving New B+Tree Blocks 24407fb8ccffSDarrick J. Wong``````````````````````````` 24417fb8ccffSDarrick J. Wong 24427fb8ccffSDarrick J. WongOnce repair knows the number of blocks needed for the new btree, it allocates 24437fb8ccffSDarrick J. Wongthose blocks using the free space information. 24447fb8ccffSDarrick J. WongEach reserved extent is tracked separately by the btree builder state data. 24457fb8ccffSDarrick J. WongTo improve crash resilience, the reservation code also logs an Extent Freeing 24467fb8ccffSDarrick J. WongIntent (EFI) item in the same transaction as each space allocation and attaches 24477fb8ccffSDarrick J. Wongits in-memory ``struct xfs_extent_free_item`` object to the space reservation. 24487fb8ccffSDarrick J. WongIf the system goes down, log recovery will use the unfinished EFIs to free the 24497fb8ccffSDarrick J. Wongunused space, the free space, leaving the filesystem unchanged. 24507fb8ccffSDarrick J. Wong 24517fb8ccffSDarrick J. WongEach time the btree builder claims a block for the btree from a reserved 24527fb8ccffSDarrick J. Wongextent, it updates the in-memory reservation to reflect the claimed space. 24537fb8ccffSDarrick J. WongBlock reservation tries to allocate as much contiguous space as possible to 24547fb8ccffSDarrick J. Wongreduce the number of EFIs in play. 24557fb8ccffSDarrick J. Wong 24567fb8ccffSDarrick J. WongWhile repair is writing these new btree blocks, the EFIs created for the space 24577fb8ccffSDarrick J. Wongreservations pin the tail of the ondisk log. 24587fb8ccffSDarrick J. WongIt's possible that other parts of the system will remain busy and push the head 24597fb8ccffSDarrick J. Wongof the log towards the pinned tail. 24607fb8ccffSDarrick J. WongTo avoid livelocking the filesystem, the EFIs must not pin the tail of the log 24617fb8ccffSDarrick J. Wongfor too long. 24627fb8ccffSDarrick J. WongTo alleviate this problem, the dynamic relogging capability of the deferred ops 24637fb8ccffSDarrick J. Wongmechanism is reused here to commit a transaction at the log head containing an 24647fb8ccffSDarrick J. WongEFD for the old EFI and new EFI at the head. 24657fb8ccffSDarrick J. WongThis enables the log to release the old EFI to keep the log moving forwards. 24667fb8ccffSDarrick J. Wong 24677fb8ccffSDarrick J. WongEFIs have a role to play during the commit and reaping phases; please see the 24687fb8ccffSDarrick J. Wongnext section and the section about :ref:`reaping<reaping>` for more details. 24697fb8ccffSDarrick J. Wong 24707fb8ccffSDarrick J. WongProposed patchsets are the 24717fb8ccffSDarrick J. Wong`bitmap rework 24727fb8ccffSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-bitmap-rework>`_ 24737fb8ccffSDarrick J. Wongand the 24747fb8ccffSDarrick J. Wong`preparation for bulk loading btrees 24757fb8ccffSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading>`_. 24767fb8ccffSDarrick J. Wong 24777fb8ccffSDarrick J. Wong 24787fb8ccffSDarrick J. WongWriting the New Tree 24797fb8ccffSDarrick J. Wong```````````````````` 24807fb8ccffSDarrick J. Wong 24817fb8ccffSDarrick J. WongThis part is pretty simple -- the btree builder (``xfs_btree_bulkload``) claims 24827fb8ccffSDarrick J. Wonga block from the reserved list, writes the new btree block header, fills the 24837fb8ccffSDarrick J. Wongrest of the block with records, and adds the new leaf block to a list of 24847fb8ccffSDarrick J. Wongwritten blocks:: 24857fb8ccffSDarrick J. Wong 24867fb8ccffSDarrick J. Wong ┌────┐ 24877fb8ccffSDarrick J. Wong │leaf│ 24887fb8ccffSDarrick J. Wong │RRR │ 24897fb8ccffSDarrick J. Wong └────┘ 24907fb8ccffSDarrick J. Wong 24917fb8ccffSDarrick J. WongSibling pointers are set every time a new block is added to the level:: 24927fb8ccffSDarrick J. Wong 24937fb8ccffSDarrick J. Wong ┌────┐ ┌────┐ ┌────┐ ┌────┐ 24947fb8ccffSDarrick J. Wong │leaf│→│leaf│→│leaf│→│leaf│ 24957fb8ccffSDarrick J. Wong │RRR │←│RRR │←│RRR │←│RRR │ 24967fb8ccffSDarrick J. Wong └────┘ └────┘ └────┘ └────┘ 24977fb8ccffSDarrick J. Wong 24987fb8ccffSDarrick J. WongWhen it finishes writing the record leaf blocks, it moves on to the node 24997fb8ccffSDarrick J. Wongblocks 25007fb8ccffSDarrick J. WongTo fill a node block, it walks each block in the next level down in the tree 25017fb8ccffSDarrick J. Wongto compute the relevant keys and write them into the parent node:: 25027fb8ccffSDarrick J. Wong 25037fb8ccffSDarrick J. Wong ┌────┐ ┌────┐ 25047fb8ccffSDarrick J. Wong │node│──────→│node│ 25057fb8ccffSDarrick J. Wong │PP │←──────│PP │ 25067fb8ccffSDarrick J. Wong └────┘ └────┘ 25077fb8ccffSDarrick J. Wong ↙ ↘ ↙ ↘ 25087fb8ccffSDarrick J. Wong ┌────┐ ┌────┐ ┌────┐ ┌────┐ 25097fb8ccffSDarrick J. Wong │leaf│→│leaf│→│leaf│→│leaf│ 25107fb8ccffSDarrick J. Wong │RRR │←│RRR │←│RRR │←│RRR │ 25117fb8ccffSDarrick J. Wong └────┘ └────┘ └────┘ └────┘ 25127fb8ccffSDarrick J. Wong 25137fb8ccffSDarrick J. WongWhen it reaches the root level, it is ready to commit the new btree!:: 25147fb8ccffSDarrick J. Wong 25157fb8ccffSDarrick J. Wong ┌─────────┐ 25167fb8ccffSDarrick J. Wong │ root │ 25177fb8ccffSDarrick J. Wong │ PP │ 25187fb8ccffSDarrick J. Wong └─────────┘ 25197fb8ccffSDarrick J. Wong ↙ ↘ 25207fb8ccffSDarrick J. Wong ┌────┐ ┌────┐ 25217fb8ccffSDarrick J. Wong │node│──────→│node│ 25227fb8ccffSDarrick J. Wong │PP │←──────│PP │ 25237fb8ccffSDarrick J. Wong └────┘ └────┘ 25247fb8ccffSDarrick J. Wong ↙ ↘ ↙ ↘ 25257fb8ccffSDarrick J. Wong ┌────┐ ┌────┐ ┌────┐ ┌────┐ 25267fb8ccffSDarrick J. Wong │leaf│→│leaf│→│leaf│→│leaf│ 25277fb8ccffSDarrick J. Wong │RRR │←│RRR │←│RRR │←│RRR │ 25287fb8ccffSDarrick J. Wong └────┘ └────┘ └────┘ └────┘ 25297fb8ccffSDarrick J. Wong 25307fb8ccffSDarrick J. WongThe first step to commit the new btree is to persist the btree blocks to disk 25317fb8ccffSDarrick J. Wongsynchronously. 25327fb8ccffSDarrick J. WongThis is a little complicated because a new btree block could have been freed 25337fb8ccffSDarrick J. Wongin the recent past, so the builder must use ``xfs_buf_delwri_queue_here`` to 25347fb8ccffSDarrick J. Wongremove the (stale) buffer from the AIL list before it can write the new blocks 25357fb8ccffSDarrick J. Wongto disk. 25367fb8ccffSDarrick J. WongBlocks are queued for IO using a delwri list and written in one large batch 25377fb8ccffSDarrick J. Wongwith ``xfs_buf_delwri_submit``. 25387fb8ccffSDarrick J. Wong 25397fb8ccffSDarrick J. WongOnce the new blocks have been persisted to disk, control returns to the 25407fb8ccffSDarrick J. Wongindividual repair function that called the bulk loader. 25417fb8ccffSDarrick J. WongThe repair function must log the location of the new root in a transaction, 25427fb8ccffSDarrick J. Wongclean up the space reservations that were made for the new btree, and reap the 25437fb8ccffSDarrick J. Wongold metadata blocks: 25447fb8ccffSDarrick J. Wong 25457fb8ccffSDarrick J. Wong1. Commit the location of the new btree root. 25467fb8ccffSDarrick J. Wong 25477fb8ccffSDarrick J. Wong2. For each incore reservation: 25487fb8ccffSDarrick J. Wong 25497fb8ccffSDarrick J. Wong a. Log Extent Freeing Done (EFD) items for all the space that was consumed 25507fb8ccffSDarrick J. Wong by the btree builder. The new EFDs must point to the EFIs attached to 25517fb8ccffSDarrick J. Wong the reservation to prevent log recovery from freeing the new blocks. 25527fb8ccffSDarrick J. Wong 25537fb8ccffSDarrick J. Wong b. For unclaimed portions of incore reservations, create a regular deferred 25547fb8ccffSDarrick J. Wong extent free work item to be free the unused space later in the 25557fb8ccffSDarrick J. Wong transaction chain. 25567fb8ccffSDarrick J. Wong 25577fb8ccffSDarrick J. Wong c. The EFDs and EFIs logged in steps 2a and 2b must not overrun the 25587fb8ccffSDarrick J. Wong reservation of the committing transaction. 25597fb8ccffSDarrick J. Wong If the btree loading code suspects this might be about to happen, it must 25607fb8ccffSDarrick J. Wong call ``xrep_defer_finish`` to clear out the deferred work and obtain a 25617fb8ccffSDarrick J. Wong fresh transaction. 25627fb8ccffSDarrick J. Wong 25637fb8ccffSDarrick J. Wong3. Clear out the deferred work a second time to finish the commit and clean 25647fb8ccffSDarrick J. Wong the repair transaction. 25657fb8ccffSDarrick J. Wong 25667fb8ccffSDarrick J. WongThe transaction rolling in steps 2c and 3 represent a weakness in the repair 25677fb8ccffSDarrick J. Wongalgorithm, because a log flush and a crash before the end of the reap step can 25687fb8ccffSDarrick J. Wongresult in space leaking. 2569*d56b699dSBjorn HelgaasOnline repair functions minimize the chances of this occurring by using very 2570*d56b699dSBjorn Helgaaslarge transactions, which each can accommodate many thousands of block freeing 25717fb8ccffSDarrick J. Wonginstructions. 25727fb8ccffSDarrick J. WongRepair moves on to reaping the old blocks, which will be presented in a 25737fb8ccffSDarrick J. Wongsubsequent :ref:`section<reaping>` after a few case studies of bulk loading. 25747fb8ccffSDarrick J. Wong 25757fb8ccffSDarrick J. WongCase Study: Rebuilding the Inode Index 25767fb8ccffSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 25777fb8ccffSDarrick J. Wong 25787fb8ccffSDarrick J. WongThe high level process to rebuild the inode index btree is: 25797fb8ccffSDarrick J. Wong 25807fb8ccffSDarrick J. Wong1. Walk the reverse mapping records to generate ``struct xfs_inobt_rec`` 25817fb8ccffSDarrick J. Wong records from the inode chunk information and a bitmap of the old inode btree 25827fb8ccffSDarrick J. Wong blocks. 25837fb8ccffSDarrick J. Wong 25847fb8ccffSDarrick J. Wong2. Append the records to an xfarray in inode order. 25857fb8ccffSDarrick J. Wong 25867fb8ccffSDarrick J. Wong3. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number 25877fb8ccffSDarrick J. Wong of blocks needed for the inode btree. 25887fb8ccffSDarrick J. Wong If the free space inode btree is enabled, call it again to estimate the 25897fb8ccffSDarrick J. Wong geometry of the finobt. 25907fb8ccffSDarrick J. Wong 25917fb8ccffSDarrick J. Wong4. Allocate the number of blocks computed in the previous step. 25927fb8ccffSDarrick J. Wong 25937fb8ccffSDarrick J. Wong5. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and 25947fb8ccffSDarrick J. Wong generate the internal node blocks. 25957fb8ccffSDarrick J. Wong If the free space inode btree is enabled, call it again to load the finobt. 25967fb8ccffSDarrick J. Wong 25977fb8ccffSDarrick J. Wong6. Commit the location of the new btree root block(s) to the AGI. 25987fb8ccffSDarrick J. Wong 25997fb8ccffSDarrick J. Wong7. Reap the old btree blocks using the bitmap created in step 1. 26007fb8ccffSDarrick J. Wong 26017fb8ccffSDarrick J. WongDetails are as follows. 26027fb8ccffSDarrick J. Wong 26037fb8ccffSDarrick J. WongThe inode btree maps inumbers to the ondisk location of the associated 26047fb8ccffSDarrick J. Wonginode records, which means that the inode btrees can be rebuilt from the 26057fb8ccffSDarrick J. Wongreverse mapping information. 26067fb8ccffSDarrick J. WongReverse mapping records with an owner of ``XFS_RMAP_OWN_INOBT`` marks the 26077fb8ccffSDarrick J. Wonglocation of the old inode btree blocks. 26087fb8ccffSDarrick J. WongEach reverse mapping record with an owner of ``XFS_RMAP_OWN_INODES`` marks the 26097fb8ccffSDarrick J. Wonglocation of at least one inode cluster buffer. 26107fb8ccffSDarrick J. WongA cluster is the smallest number of ondisk inodes that can be allocated or 26117fb8ccffSDarrick J. Wongfreed in a single transaction; it is never smaller than 1 fs block or 4 inodes. 26127fb8ccffSDarrick J. Wong 26137fb8ccffSDarrick J. WongFor the space represented by each inode cluster, ensure that there are no 26147fb8ccffSDarrick J. Wongrecords in the free space btrees nor any records in the reference count btree. 26157fb8ccffSDarrick J. WongIf there are, the space metadata inconsistencies are reason enough to abort the 26167fb8ccffSDarrick J. Wongoperation. 26177fb8ccffSDarrick J. WongOtherwise, read each cluster buffer to check that its contents appear to be 26187fb8ccffSDarrick J. Wongondisk inodes and to decide if the file is allocated 26197fb8ccffSDarrick J. Wong(``xfs_dinode.i_mode != 0``) or free (``xfs_dinode.i_mode == 0``). 26207fb8ccffSDarrick J. WongAccumulate the results of successive inode cluster buffer reads until there is 26217fb8ccffSDarrick J. Wongenough information to fill a single inode chunk record, which is 64 consecutive 26227fb8ccffSDarrick J. Wongnumbers in the inumber keyspace. 26237fb8ccffSDarrick J. WongIf the chunk is sparse, the chunk record may include holes. 26247fb8ccffSDarrick J. Wong 26257fb8ccffSDarrick J. WongOnce the repair function accumulates one chunk's worth of data, it calls 26267fb8ccffSDarrick J. Wong``xfarray_append`` to add the inode btree record to the xfarray. 26277fb8ccffSDarrick J. WongThis xfarray is walked twice during the btree creation step -- once to populate 26287fb8ccffSDarrick J. Wongthe inode btree with all inode chunk records, and a second time to populate the 26297fb8ccffSDarrick J. Wongfree inode btree with records for chunks that have free non-sparse inodes. 26307fb8ccffSDarrick J. WongThe number of records for the inode btree is the number of xfarray records, 26317fb8ccffSDarrick J. Wongbut the record count for the free inode btree has to be computed as inode chunk 26327fb8ccffSDarrick J. Wongrecords are stored in the xfarray. 26337fb8ccffSDarrick J. Wong 26347fb8ccffSDarrick J. WongThe proposed patchset is the 26357fb8ccffSDarrick J. Wong`AG btree repair 26367fb8ccffSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_ 26377fb8ccffSDarrick J. Wongseries. 26387fb8ccffSDarrick J. Wong 26397fb8ccffSDarrick J. WongCase Study: Rebuilding the Space Reference Counts 26407fb8ccffSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 26417fb8ccffSDarrick J. Wong 26427fb8ccffSDarrick J. WongReverse mapping records are used to rebuild the reference count information. 26437fb8ccffSDarrick J. WongReference counts are required for correct operation of copy on write for shared 26447fb8ccffSDarrick J. Wongfile data. 26457fb8ccffSDarrick J. WongImagine the reverse mapping entries as rectangles representing extents of 26467fb8ccffSDarrick J. Wongphysical blocks, and that the rectangles can be laid down to allow them to 26477fb8ccffSDarrick J. Wongoverlap each other. 26487fb8ccffSDarrick J. WongFrom the diagram below, it is apparent that a reference count record must start 26497fb8ccffSDarrick J. Wongor end wherever the height of the stack changes. 26507fb8ccffSDarrick J. WongIn other words, the record emission stimulus is level-triggered:: 26517fb8ccffSDarrick J. Wong 26527fb8ccffSDarrick J. Wong █ ███ 26537fb8ccffSDarrick J. Wong ██ █████ ████ ███ ██████ 26547fb8ccffSDarrick J. Wong ██ ████ ███████████ ████ █████████ 26557fb8ccffSDarrick J. Wong ████████████████████████████████ ███████████ 26567fb8ccffSDarrick J. Wong ^ ^ ^^ ^^ ^ ^^ ^^^ ^^^^ ^ ^^ ^ ^ ^ 26577fb8ccffSDarrick J. Wong 2 1 23 21 3 43 234 2123 1 01 2 3 0 26587fb8ccffSDarrick J. Wong 26597fb8ccffSDarrick J. WongThe ondisk reference count btree does not store the refcount == 0 cases because 26607fb8ccffSDarrick J. Wongthe free space btree already records which blocks are free. 26617fb8ccffSDarrick J. WongExtents being used to stage copy-on-write operations should be the only records 26627fb8ccffSDarrick J. Wongwith refcount == 1. 26637fb8ccffSDarrick J. WongSingle-owner file blocks aren't recorded in either the free space or the 26647fb8ccffSDarrick J. Wongreference count btrees. 26657fb8ccffSDarrick J. Wong 26667fb8ccffSDarrick J. WongThe high level process to rebuild the reference count btree is: 26677fb8ccffSDarrick J. Wong 26687fb8ccffSDarrick J. Wong1. Walk the reverse mapping records to generate ``struct xfs_refcount_irec`` 26697fb8ccffSDarrick J. Wong records for any space having more than one reverse mapping and add them to 26707fb8ccffSDarrick J. Wong the xfarray. 26717fb8ccffSDarrick J. Wong Any records owned by ``XFS_RMAP_OWN_COW`` are also added to the xfarray 26727fb8ccffSDarrick J. Wong because these are extents allocated to stage a copy on write operation and 26737fb8ccffSDarrick J. Wong are tracked in the refcount btree. 26747fb8ccffSDarrick J. Wong 26757fb8ccffSDarrick J. Wong Use any records owned by ``XFS_RMAP_OWN_REFC`` to create a bitmap of old 26767fb8ccffSDarrick J. Wong refcount btree blocks. 26777fb8ccffSDarrick J. Wong 26787fb8ccffSDarrick J. Wong2. Sort the records in physical extent order, putting the CoW staging extents 26797fb8ccffSDarrick J. Wong at the end of the xfarray. 26807fb8ccffSDarrick J. Wong This matches the sorting order of records in the refcount btree. 26817fb8ccffSDarrick J. Wong 26827fb8ccffSDarrick J. Wong3. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number 26837fb8ccffSDarrick J. Wong of blocks needed for the new tree. 26847fb8ccffSDarrick J. Wong 26857fb8ccffSDarrick J. Wong4. Allocate the number of blocks computed in the previous step. 26867fb8ccffSDarrick J. Wong 26877fb8ccffSDarrick J. Wong5. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and 26887fb8ccffSDarrick J. Wong generate the internal node blocks. 26897fb8ccffSDarrick J. Wong 26907fb8ccffSDarrick J. Wong6. Commit the location of new btree root block to the AGF. 26917fb8ccffSDarrick J. Wong 26927fb8ccffSDarrick J. Wong7. Reap the old btree blocks using the bitmap created in step 1. 26937fb8ccffSDarrick J. Wong 26947fb8ccffSDarrick J. WongDetails are as follows; the same algorithm is used by ``xfs_repair`` to 26957fb8ccffSDarrick J. Wonggenerate refcount information from reverse mapping records. 26967fb8ccffSDarrick J. Wong 26977fb8ccffSDarrick J. Wong- Until the reverse mapping btree runs out of records: 26987fb8ccffSDarrick J. Wong 26997fb8ccffSDarrick J. Wong - Retrieve the next record from the btree and put it in a bag. 27007fb8ccffSDarrick J. Wong 27017fb8ccffSDarrick J. Wong - Collect all records with the same starting block from the btree and put 27027fb8ccffSDarrick J. Wong them in the bag. 27037fb8ccffSDarrick J. Wong 27047fb8ccffSDarrick J. Wong - While the bag isn't empty: 27057fb8ccffSDarrick J. Wong 27067fb8ccffSDarrick J. Wong - Among the mappings in the bag, compute the lowest block number where the 27077fb8ccffSDarrick J. Wong reference count changes. 27087fb8ccffSDarrick J. Wong This position will be either the starting block number of the next 27097fb8ccffSDarrick J. Wong unprocessed reverse mapping or the next block after the shortest mapping 27107fb8ccffSDarrick J. Wong in the bag. 27117fb8ccffSDarrick J. Wong 27127fb8ccffSDarrick J. Wong - Remove all mappings from the bag that end at this position. 27137fb8ccffSDarrick J. Wong 27147fb8ccffSDarrick J. Wong - Collect all reverse mappings that start at this position from the btree 27157fb8ccffSDarrick J. Wong and put them in the bag. 27167fb8ccffSDarrick J. Wong 27177fb8ccffSDarrick J. Wong - If the size of the bag changed and is greater than one, create a new 27187fb8ccffSDarrick J. Wong refcount record associating the block number range that we just walked to 27197fb8ccffSDarrick J. Wong the size of the bag. 27207fb8ccffSDarrick J. Wong 27217fb8ccffSDarrick J. WongThe bag-like structure in this case is a type 2 xfarray as discussed in the 27227fb8ccffSDarrick J. Wong:ref:`xfarray access patterns<xfarray_access_patterns>` section. 27237fb8ccffSDarrick J. WongReverse mappings are added to the bag using ``xfarray_store_anywhere`` and 27247fb8ccffSDarrick J. Wongremoved via ``xfarray_unset``. 27257fb8ccffSDarrick J. WongBag members are examined through ``xfarray_iter`` loops. 27267fb8ccffSDarrick J. Wong 27277fb8ccffSDarrick J. WongThe proposed patchset is the 27287fb8ccffSDarrick J. Wong`AG btree repair 27297fb8ccffSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_ 27307fb8ccffSDarrick J. Wongseries. 27317fb8ccffSDarrick J. Wong 27327fb8ccffSDarrick J. WongCase Study: Rebuilding File Fork Mapping Indices 27337fb8ccffSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 27347fb8ccffSDarrick J. Wong 27357fb8ccffSDarrick J. WongThe high level process to rebuild a data/attr fork mapping btree is: 27367fb8ccffSDarrick J. Wong 27377fb8ccffSDarrick J. Wong1. Walk the reverse mapping records to generate ``struct xfs_bmbt_rec`` 27387fb8ccffSDarrick J. Wong records from the reverse mapping records for that inode and fork. 27397fb8ccffSDarrick J. Wong Append these records to an xfarray. 27407fb8ccffSDarrick J. Wong Compute the bitmap of the old bmap btree blocks from the ``BMBT_BLOCK`` 27417fb8ccffSDarrick J. Wong records. 27427fb8ccffSDarrick J. Wong 27437fb8ccffSDarrick J. Wong2. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number 27447fb8ccffSDarrick J. Wong of blocks needed for the new tree. 27457fb8ccffSDarrick J. Wong 27467fb8ccffSDarrick J. Wong3. Sort the records in file offset order. 27477fb8ccffSDarrick J. Wong 27487fb8ccffSDarrick J. Wong4. If the extent records would fit in the inode fork immediate area, commit the 27497fb8ccffSDarrick J. Wong records to that immediate area and skip to step 8. 27507fb8ccffSDarrick J. Wong 27517fb8ccffSDarrick J. Wong5. Allocate the number of blocks computed in the previous step. 27527fb8ccffSDarrick J. Wong 27537fb8ccffSDarrick J. Wong6. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and 27547fb8ccffSDarrick J. Wong generate the internal node blocks. 27557fb8ccffSDarrick J. Wong 27567fb8ccffSDarrick J. Wong7. Commit the new btree root block to the inode fork immediate area. 27577fb8ccffSDarrick J. Wong 27587fb8ccffSDarrick J. Wong8. Reap the old btree blocks using the bitmap created in step 1. 27597fb8ccffSDarrick J. Wong 27607fb8ccffSDarrick J. WongThere are some complications here: 27617fb8ccffSDarrick J. WongFirst, it's possible to move the fork offset to adjust the sizes of the 27627fb8ccffSDarrick J. Wongimmediate areas if the data and attr forks are not both in BMBT format. 27637fb8ccffSDarrick J. WongSecond, if there are sufficiently few fork mappings, it may be possible to use 27647fb8ccffSDarrick J. WongEXTENTS format instead of BMBT, which may require a conversion. 27657fb8ccffSDarrick J. WongThird, the incore extent map must be reloaded carefully to avoid disturbing 27667fb8ccffSDarrick J. Wongany delayed allocation extents. 27677fb8ccffSDarrick J. Wong 27687fb8ccffSDarrick J. WongThe proposed patchset is the 27697fb8ccffSDarrick J. Wong`file mapping repair 27707fb8ccffSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-file-mappings>`_ 27717fb8ccffSDarrick J. Wongseries. 27727fb8ccffSDarrick J. Wong 27737fb8ccffSDarrick J. Wong.. _reaping: 27747fb8ccffSDarrick J. Wong 27757fb8ccffSDarrick J. WongReaping Old Metadata Blocks 27767fb8ccffSDarrick J. Wong--------------------------- 27777fb8ccffSDarrick J. Wong 27787fb8ccffSDarrick J. WongWhenever online fsck builds a new data structure to replace one that is 27797fb8ccffSDarrick J. Wongsuspect, there is a question of how to find and dispose of the blocks that 27807fb8ccffSDarrick J. Wongbelonged to the old structure. 27817fb8ccffSDarrick J. WongThe laziest method of course is not to deal with them at all, but this slowly 27827fb8ccffSDarrick J. Wongleads to service degradations as space leaks out of the filesystem. 27837fb8ccffSDarrick J. WongHopefully, someone will schedule a rebuild of the free space information to 27847fb8ccffSDarrick J. Wongplug all those leaks. 27857fb8ccffSDarrick J. WongOffline repair rebuilds all space metadata after recording the usage of 27867fb8ccffSDarrick J. Wongthe files and directories that it decides not to clear, hence it can build new 27877fb8ccffSDarrick J. Wongstructures in the discovered free space and avoid the question of reaping. 27887fb8ccffSDarrick J. Wong 27897fb8ccffSDarrick J. WongAs part of a repair, online fsck relies heavily on the reverse mapping records 27907fb8ccffSDarrick J. Wongto find space that is owned by the corresponding rmap owner yet truly free. 27917fb8ccffSDarrick J. WongCross referencing rmap records with other rmap records is necessary because 27927fb8ccffSDarrick J. Wongthere may be other data structures that also think they own some of those 27937fb8ccffSDarrick J. Wongblocks (e.g. crosslinked trees). 27947fb8ccffSDarrick J. WongPermitting the block allocator to hand them out again will not push the system 27957fb8ccffSDarrick J. Wongtowards consistency. 27967fb8ccffSDarrick J. Wong 27977fb8ccffSDarrick J. WongFor space metadata, the process of finding extents to dispose of generally 27987fb8ccffSDarrick J. Wongfollows this format: 27997fb8ccffSDarrick J. Wong 28007fb8ccffSDarrick J. Wong1. Create a bitmap of space used by data structures that must be preserved. 28017fb8ccffSDarrick J. Wong The space reservations used to create the new metadata can be used here if 28027fb8ccffSDarrick J. Wong the same rmap owner code is used to denote all of the objects being rebuilt. 28037fb8ccffSDarrick J. Wong 28047fb8ccffSDarrick J. Wong2. Survey the reverse mapping data to create a bitmap of space owned by the 28057fb8ccffSDarrick J. Wong same ``XFS_RMAP_OWN_*`` number for the metadata that is being preserved. 28067fb8ccffSDarrick J. Wong 28077fb8ccffSDarrick J. Wong3. Use the bitmap disunion operator to subtract (1) from (2). 28087fb8ccffSDarrick J. Wong The remaining set bits represent candidate extents that could be freed. 28097fb8ccffSDarrick J. Wong The process moves on to step 4 below. 28107fb8ccffSDarrick J. Wong 28117fb8ccffSDarrick J. WongRepairs for file-based metadata such as extended attributes, directories, 28127fb8ccffSDarrick J. Wongsymbolic links, quota files and realtime bitmaps are performed by building a 28137fb8ccffSDarrick J. Wongnew structure attached to a temporary file and swapping the forks. 28147fb8ccffSDarrick J. WongAfterward, the mappings in the old file fork are the candidate blocks for 28157fb8ccffSDarrick J. Wongdisposal. 28167fb8ccffSDarrick J. Wong 28177fb8ccffSDarrick J. WongThe process for disposing of old extents is as follows: 28187fb8ccffSDarrick J. Wong 28197fb8ccffSDarrick J. Wong4. For each candidate extent, count the number of reverse mapping records for 28207fb8ccffSDarrick J. Wong the first block in that extent that do not have the same rmap owner for the 28217fb8ccffSDarrick J. Wong data structure being repaired. 28227fb8ccffSDarrick J. Wong 28237fb8ccffSDarrick J. Wong - If zero, the block has a single owner and can be freed. 28247fb8ccffSDarrick J. Wong 28257fb8ccffSDarrick J. Wong - If not, the block is part of a crosslinked structure and must not be 28267fb8ccffSDarrick J. Wong freed. 28277fb8ccffSDarrick J. Wong 28287fb8ccffSDarrick J. Wong5. Starting with the next block in the extent, figure out how many more blocks 28297fb8ccffSDarrick J. Wong have the same zero/nonzero other owner status as that first block. 28307fb8ccffSDarrick J. Wong 28317fb8ccffSDarrick J. Wong6. If the region is crosslinked, delete the reverse mapping entry for the 28327fb8ccffSDarrick J. Wong structure being repaired and move on to the next region. 28337fb8ccffSDarrick J. Wong 28347fb8ccffSDarrick J. Wong7. If the region is to be freed, mark any corresponding buffers in the buffer 28357fb8ccffSDarrick J. Wong cache as stale to prevent log writeback. 28367fb8ccffSDarrick J. Wong 28377fb8ccffSDarrick J. Wong8. Free the region and move on. 28387fb8ccffSDarrick J. Wong 28397fb8ccffSDarrick J. WongHowever, there is one complication to this procedure. 28407fb8ccffSDarrick J. WongTransactions are of finite size, so the reaping process must be careful to roll 28417fb8ccffSDarrick J. Wongthe transactions to avoid overruns. 28427fb8ccffSDarrick J. WongOverruns come from two sources: 28437fb8ccffSDarrick J. Wong 28447fb8ccffSDarrick J. Wonga. EFIs logged on behalf of space that is no longer occupied 28457fb8ccffSDarrick J. Wong 28467fb8ccffSDarrick J. Wongb. Log items for buffer invalidations 28477fb8ccffSDarrick J. Wong 28487fb8ccffSDarrick J. WongThis is also a window in which a crash during the reaping process can leak 28497fb8ccffSDarrick J. Wongblocks. 28507fb8ccffSDarrick J. WongAs stated earlier, online repair functions use very large transactions to 28517fb8ccffSDarrick J. Wongminimize the chances of this occurring. 28527fb8ccffSDarrick J. Wong 28537fb8ccffSDarrick J. WongThe proposed patchset is the 28547fb8ccffSDarrick J. Wong`preparation for bulk loading btrees 28557fb8ccffSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading>`_ 28567fb8ccffSDarrick J. Wongseries. 28577fb8ccffSDarrick J. Wong 28587fb8ccffSDarrick J. WongCase Study: Reaping After a Regular Btree Repair 28597fb8ccffSDarrick J. Wong```````````````````````````````````````````````` 28607fb8ccffSDarrick J. Wong 28617fb8ccffSDarrick J. WongOld reference count and inode btrees are the easiest to reap because they have 28627fb8ccffSDarrick J. Wongrmap records with special owner codes: ``XFS_RMAP_OWN_REFC`` for the refcount 28637fb8ccffSDarrick J. Wongbtree, and ``XFS_RMAP_OWN_INOBT`` for the inode and free inode btrees. 28647fb8ccffSDarrick J. WongCreating a list of extents to reap the old btree blocks is quite simple, 28657fb8ccffSDarrick J. Wongconceptually: 28667fb8ccffSDarrick J. Wong 28677fb8ccffSDarrick J. Wong1. Lock the relevant AGI/AGF header buffers to prevent allocation and frees. 28687fb8ccffSDarrick J. Wong 28697fb8ccffSDarrick J. Wong2. For each reverse mapping record with an rmap owner corresponding to the 28707fb8ccffSDarrick J. Wong metadata structure being rebuilt, set the corresponding range in a bitmap. 28717fb8ccffSDarrick J. Wong 28727fb8ccffSDarrick J. Wong3. Walk the current data structures that have the same rmap owner. 28737fb8ccffSDarrick J. Wong For each block visited, clear that range in the above bitmap. 28747fb8ccffSDarrick J. Wong 28757fb8ccffSDarrick J. Wong4. Each set bit in the bitmap represents a block that could be a block from the 28767fb8ccffSDarrick J. Wong old data structures and hence is a candidate for reaping. 28777fb8ccffSDarrick J. Wong In other words, ``(rmap_records_owned_by & ~blocks_reachable_by_walk)`` 28787fb8ccffSDarrick J. Wong are the blocks that might be freeable. 28797fb8ccffSDarrick J. Wong 28807fb8ccffSDarrick J. WongIf it is possible to maintain the AGF lock throughout the repair (which is the 28817fb8ccffSDarrick J. Wongcommon case), then step 2 can be performed at the same time as the reverse 28827fb8ccffSDarrick J. Wongmapping record walk that creates the records for the new btree. 28837fb8ccffSDarrick J. Wong 28847fb8ccffSDarrick J. WongCase Study: Rebuilding the Free Space Indices 28857fb8ccffSDarrick J. Wong````````````````````````````````````````````` 28867fb8ccffSDarrick J. Wong 28877fb8ccffSDarrick J. WongThe high level process to rebuild the free space indices is: 28887fb8ccffSDarrick J. Wong 28897fb8ccffSDarrick J. Wong1. Walk the reverse mapping records to generate ``struct xfs_alloc_rec_incore`` 28907fb8ccffSDarrick J. Wong records from the gaps in the reverse mapping btree. 28917fb8ccffSDarrick J. Wong 28927fb8ccffSDarrick J. Wong2. Append the records to an xfarray. 28937fb8ccffSDarrick J. Wong 28947fb8ccffSDarrick J. Wong3. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number 28957fb8ccffSDarrick J. Wong of blocks needed for each new tree. 28967fb8ccffSDarrick J. Wong 28977fb8ccffSDarrick J. Wong4. Allocate the number of blocks computed in the previous step from the free 28987fb8ccffSDarrick J. Wong space information collected. 28997fb8ccffSDarrick J. Wong 29007fb8ccffSDarrick J. Wong5. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and 29017fb8ccffSDarrick J. Wong generate the internal node blocks for the free space by length index. 29027fb8ccffSDarrick J. Wong Call it again for the free space by block number index. 29037fb8ccffSDarrick J. Wong 29047fb8ccffSDarrick J. Wong6. Commit the locations of the new btree root blocks to the AGF. 29057fb8ccffSDarrick J. Wong 29067fb8ccffSDarrick J. Wong7. Reap the old btree blocks by looking for space that is not recorded by the 29077fb8ccffSDarrick J. Wong reverse mapping btree, the new free space btrees, or the AGFL. 29087fb8ccffSDarrick J. Wong 29097fb8ccffSDarrick J. WongRepairing the free space btrees has three key complications over a regular 29107fb8ccffSDarrick J. Wongbtree repair: 29117fb8ccffSDarrick J. Wong 29127fb8ccffSDarrick J. WongFirst, free space is not explicitly tracked in the reverse mapping records. 29137fb8ccffSDarrick J. WongHence, the new free space records must be inferred from gaps in the physical 29147fb8ccffSDarrick J. Wongspace component of the keyspace of the reverse mapping btree. 29157fb8ccffSDarrick J. Wong 29167fb8ccffSDarrick J. WongSecond, free space repairs cannot use the common btree reservation code because 29177fb8ccffSDarrick J. Wongnew blocks are reserved out of the free space btrees. 29187fb8ccffSDarrick J. WongThis is impossible when repairing the free space btrees themselves. 29197fb8ccffSDarrick J. WongHowever, repair holds the AGF buffer lock for the duration of the free space 29207fb8ccffSDarrick J. Wongindex reconstruction, so it can use the collected free space information to 29217fb8ccffSDarrick J. Wongsupply the blocks for the new free space btrees. 29227fb8ccffSDarrick J. WongIt is not necessary to back each reserved extent with an EFI because the new 29237fb8ccffSDarrick J. Wongfree space btrees are constructed in what the ondisk filesystem thinks is 29247fb8ccffSDarrick J. Wongunowned space. 29257fb8ccffSDarrick J. WongHowever, if reserving blocks for the new btrees from the collected free space 29267fb8ccffSDarrick J. Wonginformation changes the number of free space records, repair must re-estimate 29277fb8ccffSDarrick J. Wongthe new free space btree geometry with the new record count until the 29287fb8ccffSDarrick J. Wongreservation is sufficient. 29297fb8ccffSDarrick J. WongAs part of committing the new btrees, repair must ensure that reverse mappings 29307fb8ccffSDarrick J. Wongare created for the reserved blocks and that unused reserved blocks are 29317fb8ccffSDarrick J. Wonginserted into the free space btrees. 29327fb8ccffSDarrick J. WongDeferrred rmap and freeing operations are used to ensure that this transition 29337fb8ccffSDarrick J. Wongis atomic, similar to the other btree repair functions. 29347fb8ccffSDarrick J. Wong 29357fb8ccffSDarrick J. WongThird, finding the blocks to reap after the repair is not overly 29367fb8ccffSDarrick J. Wongstraightforward. 29377fb8ccffSDarrick J. WongBlocks for the free space btrees and the reverse mapping btrees are supplied by 29387fb8ccffSDarrick J. Wongthe AGFL. 29397fb8ccffSDarrick J. WongBlocks put onto the AGFL have reverse mapping records with the owner 29407fb8ccffSDarrick J. Wong``XFS_RMAP_OWN_AG``. 29417fb8ccffSDarrick J. WongThis ownership is retained when blocks move from the AGFL into the free space 29427fb8ccffSDarrick J. Wongbtrees or the reverse mapping btrees. 29437fb8ccffSDarrick J. WongWhen repair walks reverse mapping records to synthesize free space records, it 29447fb8ccffSDarrick J. Wongcreates a bitmap (``ag_owner_bitmap``) of all the space claimed by 29457fb8ccffSDarrick J. Wong``XFS_RMAP_OWN_AG`` records. 29467fb8ccffSDarrick J. WongThe repair context maintains a second bitmap corresponding to the rmap btree 29477fb8ccffSDarrick J. Wongblocks and the AGFL blocks (``rmap_agfl_bitmap``). 29487fb8ccffSDarrick J. WongWhen the walk is complete, the bitmap disunion operation ``(ag_owner_bitmap & 29497fb8ccffSDarrick J. Wong~rmap_agfl_bitmap)`` computes the extents that are used by the old free space 29507fb8ccffSDarrick J. Wongbtrees. 29517fb8ccffSDarrick J. WongThese blocks can then be reaped using the methods outlined above. 29527fb8ccffSDarrick J. Wong 29537fb8ccffSDarrick J. WongThe proposed patchset is the 29547fb8ccffSDarrick J. Wong`AG btree repair 29557fb8ccffSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_ 29567fb8ccffSDarrick J. Wongseries. 29577fb8ccffSDarrick J. Wong 29587fb8ccffSDarrick J. Wong.. _rmap_reap: 29597fb8ccffSDarrick J. Wong 29607fb8ccffSDarrick J. WongCase Study: Reaping After Repairing Reverse Mapping Btrees 29617fb8ccffSDarrick J. Wong`````````````````````````````````````````````````````````` 29627fb8ccffSDarrick J. Wong 29637fb8ccffSDarrick J. WongOld reverse mapping btrees are less difficult to reap after a repair. 29647fb8ccffSDarrick J. WongAs mentioned in the previous section, blocks on the AGFL, the two free space 29657fb8ccffSDarrick J. Wongbtree blocks, and the reverse mapping btree blocks all have reverse mapping 29667fb8ccffSDarrick J. Wongrecords with ``XFS_RMAP_OWN_AG`` as the owner. 29677fb8ccffSDarrick J. WongThe full process of gathering reverse mapping records and building a new btree 29687fb8ccffSDarrick J. Wongare described in the case study of 29697fb8ccffSDarrick J. Wong:ref:`live rebuilds of rmap data <rmap_repair>`, but a crucial point from that 29707fb8ccffSDarrick J. Wongdiscussion is that the new rmap btree will not contain any records for the old 29717fb8ccffSDarrick J. Wongrmap btree, nor will the old btree blocks be tracked in the free space btrees. 29727fb8ccffSDarrick J. WongThe list of candidate reaping blocks is computed by setting the bits 29737fb8ccffSDarrick J. Wongcorresponding to the gaps in the new rmap btree records, and then clearing the 29747fb8ccffSDarrick J. Wongbits corresponding to extents in the free space btrees and the current AGFL 29757fb8ccffSDarrick J. Wongblocks. 29767fb8ccffSDarrick J. WongThe result ``(new_rmapbt_gaps & ~(agfl | bnobt_records))`` are reaped using the 29777fb8ccffSDarrick J. Wongmethods outlined above. 29787fb8ccffSDarrick J. Wong 29797fb8ccffSDarrick J. WongThe rest of the process of rebuildng the reverse mapping btree is discussed 29807fb8ccffSDarrick J. Wongin a separate :ref:`case study<rmap_repair>`. 29817fb8ccffSDarrick J. Wong 29827fb8ccffSDarrick J. WongThe proposed patchset is the 29837fb8ccffSDarrick J. Wong`AG btree repair 29847fb8ccffSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_ 29857fb8ccffSDarrick J. Wongseries. 29867fb8ccffSDarrick J. Wong 29877fb8ccffSDarrick J. WongCase Study: Rebuilding the AGFL 29887fb8ccffSDarrick J. Wong``````````````````````````````` 29897fb8ccffSDarrick J. Wong 29907fb8ccffSDarrick J. WongThe allocation group free block list (AGFL) is repaired as follows: 29917fb8ccffSDarrick J. Wong 29927fb8ccffSDarrick J. Wong1. Create a bitmap for all the space that the reverse mapping data claims is 29937fb8ccffSDarrick J. Wong owned by ``XFS_RMAP_OWN_AG``. 29947fb8ccffSDarrick J. Wong 29957fb8ccffSDarrick J. Wong2. Subtract the space used by the two free space btrees and the rmap btree. 29967fb8ccffSDarrick J. Wong 29977fb8ccffSDarrick J. Wong3. Subtract any space that the reverse mapping data claims is owned by any 29987fb8ccffSDarrick J. Wong other owner, to avoid re-adding crosslinked blocks to the AGFL. 29997fb8ccffSDarrick J. Wong 30007fb8ccffSDarrick J. Wong4. Once the AGFL is full, reap any blocks leftover. 30017fb8ccffSDarrick J. Wong 30027fb8ccffSDarrick J. Wong5. The next operation to fix the freelist will right-size the list. 30037fb8ccffSDarrick J. Wong 30047fb8ccffSDarrick J. WongSee `fs/xfs/scrub/agheader_repair.c <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/xfs/scrub/agheader_repair.c>`_ for more details. 3005d6978871SDarrick J. Wong 3006d6978871SDarrick J. WongInode Record Repairs 3007d6978871SDarrick J. Wong-------------------- 3008d6978871SDarrick J. Wong 3009d6978871SDarrick J. WongInode records must be handled carefully, because they have both ondisk records 3010d6978871SDarrick J. Wong("dinodes") and an in-memory ("cached") representation. 3011d6978871SDarrick J. WongThere is a very high potential for cache coherency issues if online fsck is not 3012d6978871SDarrick J. Wongcareful to access the ondisk metadata *only* when the ondisk metadata is so 3013d6978871SDarrick J. Wongbadly damaged that the filesystem cannot load the in-memory representation. 3014d6978871SDarrick J. WongWhen online fsck wants to open a damaged file for scrubbing, it must use 3015d6978871SDarrick J. Wongspecialized resource acquisition functions that return either the in-memory 3016d6978871SDarrick J. Wongrepresentation *or* a lock on whichever object is necessary to prevent any 3017d6978871SDarrick J. Wongupdate to the ondisk location. 3018d6978871SDarrick J. Wong 3019d6978871SDarrick J. WongThe only repairs that should be made to the ondisk inode buffers are whatever 3020d6978871SDarrick J. Wongis necessary to get the in-core structure loaded. 3021d6978871SDarrick J. WongThis means fixing whatever is caught by the inode cluster buffer and inode fork 3022d6978871SDarrick J. Wongverifiers, and retrying the ``iget`` operation. 3023d6978871SDarrick J. WongIf the second ``iget`` fails, the repair has failed. 3024d6978871SDarrick J. Wong 3025d6978871SDarrick J. WongOnce the in-memory representation is loaded, repair can lock the inode and can 3026d6978871SDarrick J. Wongsubject it to comprehensive checks, repairs, and optimizations. 3027d6978871SDarrick J. WongMost inode attributes are easy to check and constrain, or are user-controlled 3028d6978871SDarrick J. Wongarbitrary bit patterns; these are both easy to fix. 3029d6978871SDarrick J. WongDealing with the data and attr fork extent counts and the file block counts is 3030d6978871SDarrick J. Wongmore complicated, because computing the correct value requires traversing the 3031d6978871SDarrick J. Wongforks, or if that fails, leaving the fields invalid and waiting for the fork 3032d6978871SDarrick J. Wongfsck functions to run. 3033d6978871SDarrick J. Wong 3034d6978871SDarrick J. WongThe proposed patchset is the 3035d6978871SDarrick J. Wong`inode 3036d6978871SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-inodes>`_ 3037d6978871SDarrick J. Wongrepair series. 3038d6978871SDarrick J. Wong 3039d6978871SDarrick J. WongQuota Record Repairs 3040d6978871SDarrick J. Wong-------------------- 3041d6978871SDarrick J. Wong 3042d6978871SDarrick J. WongSimilar to inodes, quota records ("dquots") also have both ondisk records and 3043d6978871SDarrick J. Wongan in-memory representation, and hence are subject to the same cache coherency 3044d6978871SDarrick J. Wongissues. 3045d6978871SDarrick J. WongSomewhat confusingly, both are known as dquots in the XFS codebase. 3046d6978871SDarrick J. Wong 3047d6978871SDarrick J. WongThe only repairs that should be made to the ondisk quota record buffers are 3048d6978871SDarrick J. Wongwhatever is necessary to get the in-core structure loaded. 3049d6978871SDarrick J. WongOnce the in-memory representation is loaded, the only attributes needing 3050d6978871SDarrick J. Wongchecking are obviously bad limits and timer values. 3051d6978871SDarrick J. Wong 3052d6978871SDarrick J. WongQuota usage counters are checked, repaired, and discussed separately in the 3053d6978871SDarrick J. Wongsection about :ref:`live quotacheck <quotacheck>`. 3054d6978871SDarrick J. Wong 3055d6978871SDarrick J. WongThe proposed patchset is the 3056d6978871SDarrick J. Wong`quota 3057d6978871SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quota>`_ 3058d6978871SDarrick J. Wongrepair series. 3059d6978871SDarrick J. Wong 3060d6978871SDarrick J. Wong.. _fscounters: 3061d6978871SDarrick J. Wong 3062d6978871SDarrick J. WongFreezing to Fix Summary Counters 3063d6978871SDarrick J. Wong-------------------------------- 3064d6978871SDarrick J. Wong 3065d6978871SDarrick J. WongFilesystem summary counters track availability of filesystem resources such 3066d6978871SDarrick J. Wongas free blocks, free inodes, and allocated inodes. 3067d6978871SDarrick J. WongThis information could be compiled by walking the free space and inode indexes, 3068d6978871SDarrick J. Wongbut this is a slow process, so XFS maintains a copy in the ondisk superblock 3069d6978871SDarrick J. Wongthat should reflect the ondisk metadata, at least when the filesystem has been 3070d6978871SDarrick J. Wongunmounted cleanly. 3071d6978871SDarrick J. WongFor performance reasons, XFS also maintains incore copies of those counters, 3072d6978871SDarrick J. Wongwhich are key to enabling resource reservations for active transactions. 3073d6978871SDarrick J. WongWriter threads reserve the worst-case quantities of resources from the 3074d6978871SDarrick J. Wongincore counter and give back whatever they don't use at commit time. 3075d6978871SDarrick J. WongIt is therefore only necessary to serialize on the superblock when the 3076d6978871SDarrick J. Wongsuperblock is being committed to disk. 3077d6978871SDarrick J. Wong 3078d6978871SDarrick J. WongThe lazy superblock counter feature introduced in XFS v5 took this even further 3079d6978871SDarrick J. Wongby training log recovery to recompute the summary counters from the AG headers, 3080d6978871SDarrick J. Wongwhich eliminated the need for most transactions even to touch the superblock. 3081d6978871SDarrick J. WongThe only time XFS commits the summary counters is at filesystem unmount. 3082d6978871SDarrick J. WongTo reduce contention even further, the incore counter is implemented as a 3083d6978871SDarrick J. Wongpercpu counter, which means that each CPU is allocated a batch of blocks from a 3084d6978871SDarrick J. Wongglobal incore counter and can satisfy small allocations from the local batch. 3085d6978871SDarrick J. Wong 3086d6978871SDarrick J. WongThe high-performance nature of the summary counters makes it difficult for 3087d6978871SDarrick J. Wongonline fsck to check them, since there is no way to quiesce a percpu counter 3088d6978871SDarrick J. Wongwhile the system is running. 3089d6978871SDarrick J. WongAlthough online fsck can read the filesystem metadata to compute the correct 3090d6978871SDarrick J. Wongvalues of the summary counters, there's no way to hold the value of a percpu 3091d6978871SDarrick J. Wongcounter stable, so it's quite possible that the counter will be out of date by 3092d6978871SDarrick J. Wongthe time the walk is complete. 3093d6978871SDarrick J. WongEarlier versions of online scrub would return to userspace with an incomplete 3094d6978871SDarrick J. Wongscan flag, but this is not a satisfying outcome for a system administrator. 3095d6978871SDarrick J. WongFor repairs, the in-memory counters must be stabilized while walking the 3096d6978871SDarrick J. Wongfilesystem metadata to get an accurate reading and install it in the percpu 3097d6978871SDarrick J. Wongcounter. 3098d6978871SDarrick J. Wong 3099d6978871SDarrick J. WongTo satisfy this requirement, online fsck must prevent other programs in the 3100d6978871SDarrick J. Wongsystem from initiating new writes to the filesystem, it must disable background 3101d6978871SDarrick J. Wonggarbage collection threads, and it must wait for existing writer programs to 3102d6978871SDarrick J. Wongexit the kernel. 3103d6978871SDarrick J. WongOnce that has been established, scrub can walk the AG free space indexes, the 3104d6978871SDarrick J. Wonginode btrees, and the realtime bitmap to compute the correct value of all 3105d6978871SDarrick J. Wongfour summary counters. 3106d6978871SDarrick J. WongThis is very similar to a filesystem freeze, though not all of the pieces are 3107d6978871SDarrick J. Wongnecessary: 3108d6978871SDarrick J. Wong 3109d6978871SDarrick J. Wong- The final freeze state is set one higher than ``SB_FREEZE_COMPLETE`` to 3110d6978871SDarrick J. Wong prevent other threads from thawing the filesystem, or other scrub threads 3111d6978871SDarrick J. Wong from initiating another fscounters freeze. 3112d6978871SDarrick J. Wong 3113d6978871SDarrick J. Wong- It does not quiesce the log. 3114d6978871SDarrick J. Wong 3115d6978871SDarrick J. WongWith this code in place, it is now possible to pause the filesystem for just 3116d6978871SDarrick J. Wonglong enough to check and correct the summary counters. 3117d6978871SDarrick J. Wong 3118d6978871SDarrick J. Wong+--------------------------------------------------------------------------+ 3119d6978871SDarrick J. Wong| **Historical Sidebar**: | 3120d6978871SDarrick J. Wong+--------------------------------------------------------------------------+ 3121d6978871SDarrick J. Wong| The initial implementation used the actual VFS filesystem freeze | 3122d6978871SDarrick J. Wong| mechanism to quiesce filesystem activity. | 3123d6978871SDarrick J. Wong| With the filesystem frozen, it is possible to resolve the counter values | 3124d6978871SDarrick J. Wong| with exact precision, but there are many problems with calling the VFS | 3125d6978871SDarrick J. Wong| methods directly: | 3126d6978871SDarrick J. Wong| | 3127d6978871SDarrick J. Wong| - Other programs can unfreeze the filesystem without our knowledge. | 3128d6978871SDarrick J. Wong| This leads to incorrect scan results and incorrect repairs. | 3129d6978871SDarrick J. Wong| | 3130d6978871SDarrick J. Wong| - Adding an extra lock to prevent others from thawing the filesystem | 3131d6978871SDarrick J. Wong| required the addition of a ``->freeze_super`` function to wrap | 3132d6978871SDarrick J. Wong| ``freeze_fs()``. | 3133d6978871SDarrick J. Wong| This in turn caused other subtle problems because it turns out that | 3134d6978871SDarrick J. Wong| the VFS ``freeze_super`` and ``thaw_super`` functions can drop the | 3135d6978871SDarrick J. Wong| last reference to the VFS superblock, and any subsequent access | 3136d6978871SDarrick J. Wong| becomes a UAF bug! | 3137d6978871SDarrick J. Wong| This can happen if the filesystem is unmounted while the underlying | 3138d6978871SDarrick J. Wong| block device has frozen the filesystem. | 3139d6978871SDarrick J. Wong| This problem could be solved by grabbing extra references to the | 3140d6978871SDarrick J. Wong| superblock, but it felt suboptimal given the other inadequacies of | 3141d6978871SDarrick J. Wong| this approach. | 3142d6978871SDarrick J. Wong| | 3143d6978871SDarrick J. Wong| - The log need not be quiesced to check the summary counters, but a VFS | 3144d6978871SDarrick J. Wong| freeze initiates one anyway. | 3145d6978871SDarrick J. Wong| This adds unnecessary runtime to live fscounter fsck operations. | 3146d6978871SDarrick J. Wong| | 3147d6978871SDarrick J. Wong| - Quiescing the log means that XFS flushes the (possibly incorrect) | 3148d6978871SDarrick J. Wong| counters to disk as part of cleaning the log. | 3149d6978871SDarrick J. Wong| | 3150d6978871SDarrick J. Wong| - A bug in the VFS meant that freeze could complete even when | 3151d6978871SDarrick J. Wong| sync_filesystem fails to flush the filesystem and returns an error. | 3152d6978871SDarrick J. Wong| This bug was fixed in Linux 5.17. | 3153d6978871SDarrick J. Wong+--------------------------------------------------------------------------+ 3154d6978871SDarrick J. Wong 3155d6978871SDarrick J. WongThe proposed patchset is the 3156d6978871SDarrick J. Wong`summary counter cleanup 3157d6978871SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-fscounters>`_ 3158d6978871SDarrick J. Wongseries. 3159a0d856eeSDarrick J. Wong 3160a0d856eeSDarrick J. WongFull Filesystem Scans 3161a0d856eeSDarrick J. Wong--------------------- 3162a0d856eeSDarrick J. Wong 3163a0d856eeSDarrick J. WongCertain types of metadata can only be checked by walking every file in the 3164a0d856eeSDarrick J. Wongentire filesystem to record observations and comparing the observations against 3165a0d856eeSDarrick J. Wongwhat's recorded on disk. 3166a0d856eeSDarrick J. WongLike every other type of online repair, repairs are made by writing those 3167a0d856eeSDarrick J. Wongobservations to disk in a replacement structure and committing it atomically. 3168a0d856eeSDarrick J. WongHowever, it is not practical to shut down the entire filesystem to examine 3169a0d856eeSDarrick J. Wonghundreds of billions of files because the downtime would be excessive. 3170a0d856eeSDarrick J. WongTherefore, online fsck must build the infrastructure to manage a live scan of 3171a0d856eeSDarrick J. Wongall the files in the filesystem. 3172a0d856eeSDarrick J. WongThere are two questions that need to be solved to perform a live walk: 3173a0d856eeSDarrick J. Wong 3174a0d856eeSDarrick J. Wong- How does scrub manage the scan while it is collecting data? 3175a0d856eeSDarrick J. Wong 3176a0d856eeSDarrick J. Wong- How does the scan keep abreast of changes being made to the system by other 3177a0d856eeSDarrick J. Wong threads? 3178a0d856eeSDarrick J. Wong 3179a0d856eeSDarrick J. Wong.. _iscan: 3180a0d856eeSDarrick J. Wong 3181a0d856eeSDarrick J. WongCoordinated Inode Scans 3182a0d856eeSDarrick J. Wong``````````````````````` 3183a0d856eeSDarrick J. Wong 3184a0d856eeSDarrick J. WongIn the original Unix filesystems of the 1970s, each directory entry contained 3185a0d856eeSDarrick J. Wongan index number (*inumber*) which was used as an index into on ondisk array 3186a0d856eeSDarrick J. Wong(*itable*) of fixed-size records (*inodes*) describing a file's attributes and 3187a0d856eeSDarrick J. Wongits data block mapping. 3188a0d856eeSDarrick J. WongThis system is described by J. Lions, `"inode (5659)" 3189a0d856eeSDarrick J. Wong<http://www.lemis.com/grog/Documentation/Lions/>`_ in *Lions' Commentary on 3190a0d856eeSDarrick J. WongUNIX, 6th Edition*, (Dept. of Computer Science, the University of New South 3191a0d856eeSDarrick J. WongWales, November 1977), pp. 18-2; and later by D. Ritchie and K. Thompson, 3192a0d856eeSDarrick J. Wong`"Implementation of the File System" 3193a0d856eeSDarrick J. Wong<https://archive.org/details/bstj57-6-1905/page/n8/mode/1up>`_, from *The UNIX 3194a0d856eeSDarrick J. WongTime-Sharing System*, (The Bell System Technical Journal, July 1978), pp. 3195a0d856eeSDarrick J. Wong1913-4. 3196a0d856eeSDarrick J. Wong 3197a0d856eeSDarrick J. WongXFS retains most of this design, except now inumbers are search keys over all 3198a0d856eeSDarrick J. Wongthe space in the data section filesystem. 3199a0d856eeSDarrick J. WongThey form a continuous keyspace that can be expressed as a 64-bit integer, 3200a0d856eeSDarrick J. Wongthough the inodes themselves are sparsely distributed within the keyspace. 3201a0d856eeSDarrick J. WongScans proceed in a linear fashion across the inumber keyspace, starting from 3202a0d856eeSDarrick J. Wong``0x0`` and ending at ``0xFFFFFFFFFFFFFFFF``. 3203a0d856eeSDarrick J. WongNaturally, a scan through a keyspace requires a scan cursor object to track the 3204a0d856eeSDarrick J. Wongscan progress. 3205a0d856eeSDarrick J. WongBecause this keyspace is sparse, this cursor contains two parts. 3206a0d856eeSDarrick J. WongThe first part of this scan cursor object tracks the inode that will be 3207a0d856eeSDarrick J. Wongexamined next; call this the examination cursor. 3208a0d856eeSDarrick J. WongSomewhat less obviously, the scan cursor object must also track which parts of 3209a0d856eeSDarrick J. Wongthe keyspace have already been visited, which is critical for deciding if a 3210a0d856eeSDarrick J. Wongconcurrent filesystem update needs to be incorporated into the scan data. 3211a0d856eeSDarrick J. WongCall this the visited inode cursor. 3212a0d856eeSDarrick J. Wong 3213a0d856eeSDarrick J. WongAdvancing the scan cursor is a multi-step process encapsulated in 3214a0d856eeSDarrick J. Wong``xchk_iscan_iter``: 3215a0d856eeSDarrick J. Wong 3216a0d856eeSDarrick J. Wong1. Lock the AGI buffer of the AG containing the inode pointed to by the visited 3217a0d856eeSDarrick J. Wong inode cursor. 3218a0d856eeSDarrick J. Wong This guarantee that inodes in this AG cannot be allocated or freed while 3219a0d856eeSDarrick J. Wong advancing the cursor. 3220a0d856eeSDarrick J. Wong 3221a0d856eeSDarrick J. Wong2. Use the per-AG inode btree to look up the next inumber after the one that 3222a0d856eeSDarrick J. Wong was just visited, since it may not be keyspace adjacent. 3223a0d856eeSDarrick J. Wong 3224a0d856eeSDarrick J. Wong3. If there are no more inodes left in this AG: 3225a0d856eeSDarrick J. Wong 3226a0d856eeSDarrick J. Wong a. Move the examination cursor to the point of the inumber keyspace that 3227a0d856eeSDarrick J. Wong corresponds to the start of the next AG. 3228a0d856eeSDarrick J. Wong 3229a0d856eeSDarrick J. Wong b. Adjust the visited inode cursor to indicate that it has "visited" the 3230a0d856eeSDarrick J. Wong last possible inode in the current AG's inode keyspace. 3231a0d856eeSDarrick J. Wong XFS inumbers are segmented, so the cursor needs to be marked as having 3232a0d856eeSDarrick J. Wong visited the entire keyspace up to just before the start of the next AG's 3233a0d856eeSDarrick J. Wong inode keyspace. 3234a0d856eeSDarrick J. Wong 3235a0d856eeSDarrick J. Wong c. Unlock the AGI and return to step 1 if there are unexamined AGs in the 3236a0d856eeSDarrick J. Wong filesystem. 3237a0d856eeSDarrick J. Wong 3238a0d856eeSDarrick J. Wong d. If there are no more AGs to examine, set both cursors to the end of the 3239a0d856eeSDarrick J. Wong inumber keyspace. 3240a0d856eeSDarrick J. Wong The scan is now complete. 3241a0d856eeSDarrick J. Wong 3242a0d856eeSDarrick J. Wong4. Otherwise, there is at least one more inode to scan in this AG: 3243a0d856eeSDarrick J. Wong 3244a0d856eeSDarrick J. Wong a. Move the examination cursor ahead to the next inode marked as allocated 3245a0d856eeSDarrick J. Wong by the inode btree. 3246a0d856eeSDarrick J. Wong 3247a0d856eeSDarrick J. Wong b. Adjust the visited inode cursor to point to the inode just prior to where 3248a0d856eeSDarrick J. Wong the examination cursor is now. 3249a0d856eeSDarrick J. Wong Because the scanner holds the AGI buffer lock, no inodes could have been 3250a0d856eeSDarrick J. Wong created in the part of the inode keyspace that the visited inode cursor 3251a0d856eeSDarrick J. Wong just advanced. 3252a0d856eeSDarrick J. Wong 3253a0d856eeSDarrick J. Wong5. Get the incore inode for the inumber of the examination cursor. 3254a0d856eeSDarrick J. Wong By maintaining the AGI buffer lock until this point, the scanner knows that 3255a0d856eeSDarrick J. Wong it was safe to advance the examination cursor across the entire keyspace, 3256a0d856eeSDarrick J. Wong and that it has stabilized this next inode so that it cannot disappear from 3257a0d856eeSDarrick J. Wong the filesystem until the scan releases the incore inode. 3258a0d856eeSDarrick J. Wong 3259a0d856eeSDarrick J. Wong6. Drop the AGI lock and return the incore inode to the caller. 3260a0d856eeSDarrick J. Wong 3261a0d856eeSDarrick J. WongOnline fsck functions scan all files in the filesystem as follows: 3262a0d856eeSDarrick J. Wong 3263a0d856eeSDarrick J. Wong1. Start a scan by calling ``xchk_iscan_start``. 3264a0d856eeSDarrick J. Wong 3265a0d856eeSDarrick J. Wong2. Advance the scan cursor (``xchk_iscan_iter``) to get the next inode. 3266a0d856eeSDarrick J. Wong If one is provided: 3267a0d856eeSDarrick J. Wong 3268a0d856eeSDarrick J. Wong a. Lock the inode to prevent updates during the scan. 3269a0d856eeSDarrick J. Wong 3270a0d856eeSDarrick J. Wong b. Scan the inode. 3271a0d856eeSDarrick J. Wong 3272a0d856eeSDarrick J. Wong c. While still holding the inode lock, adjust the visited inode cursor 3273a0d856eeSDarrick J. Wong (``xchk_iscan_mark_visited``) to point to this inode. 3274a0d856eeSDarrick J. Wong 3275a0d856eeSDarrick J. Wong d. Unlock and release the inode. 3276a0d856eeSDarrick J. Wong 3277a0d856eeSDarrick J. Wong8. Call ``xchk_iscan_teardown`` to complete the scan. 3278a0d856eeSDarrick J. Wong 3279a0d856eeSDarrick J. WongThere are subtleties with the inode cache that complicate grabbing the incore 3280a0d856eeSDarrick J. Wonginode for the caller. 3281a0d856eeSDarrick J. WongObviously, it is an absolute requirement that the inode metadata be consistent 3282a0d856eeSDarrick J. Wongenough to load it into the inode cache. 3283a0d856eeSDarrick J. WongSecond, if the incore inode is stuck in some intermediate state, the scan 3284a0d856eeSDarrick J. Wongcoordinator must release the AGI and push the main filesystem to get the inode 3285a0d856eeSDarrick J. Wongback into a loadable state. 3286a0d856eeSDarrick J. Wong 3287a0d856eeSDarrick J. WongThe proposed patches are the 3288a0d856eeSDarrick J. Wong`inode scanner 3289a0d856eeSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>`_ 3290a0d856eeSDarrick J. Wongseries. 3291a0d856eeSDarrick J. WongThe first user of the new functionality is the 3292a0d856eeSDarrick J. Wong`online quotacheck 3293a0d856eeSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>`_ 3294a0d856eeSDarrick J. Wongseries. 3295a0d856eeSDarrick J. Wong 3296a0d856eeSDarrick J. WongInode Management 3297a0d856eeSDarrick J. Wong```````````````` 3298a0d856eeSDarrick J. Wong 3299a0d856eeSDarrick J. WongIn regular filesystem code, references to allocated XFS incore inodes are 3300a0d856eeSDarrick J. Wongalways obtained (``xfs_iget``) outside of transaction context because the 3301a0d856eeSDarrick J. Wongcreation of the incore context for an existing file does not require metadata 3302a0d856eeSDarrick J. Wongupdates. 3303a0d856eeSDarrick J. WongHowever, it is important to note that references to incore inodes obtained as 3304a0d856eeSDarrick J. Wongpart of file creation must be performed in transaction context because the 3305a0d856eeSDarrick J. Wongfilesystem must ensure the atomicity of the ondisk inode btree index updates 3306a0d856eeSDarrick J. Wongand the initialization of the actual ondisk inode. 3307a0d856eeSDarrick J. Wong 3308a0d856eeSDarrick J. WongReferences to incore inodes are always released (``xfs_irele``) outside of 3309a0d856eeSDarrick J. Wongtransaction context because there are a handful of activities that might 3310a0d856eeSDarrick J. Wongrequire ondisk updates: 3311a0d856eeSDarrick J. Wong 3312a0d856eeSDarrick J. Wong- The VFS may decide to kick off writeback as part of a ``DONTCACHE`` inode 3313a0d856eeSDarrick J. Wong release. 3314a0d856eeSDarrick J. Wong 3315a0d856eeSDarrick J. Wong- Speculative preallocations need to be unreserved. 3316a0d856eeSDarrick J. Wong 3317a0d856eeSDarrick J. Wong- An unlinked file may have lost its last reference, in which case the entire 3318a0d856eeSDarrick J. Wong file must be inactivated, which involves releasing all of its resources in 3319a0d856eeSDarrick J. Wong the ondisk metadata and freeing the inode. 3320a0d856eeSDarrick J. Wong 3321a0d856eeSDarrick J. WongThese activities are collectively called inode inactivation. 3322a0d856eeSDarrick J. WongInactivation has two parts -- the VFS part, which initiates writeback on all 3323a0d856eeSDarrick J. Wongdirty file pages, and the XFS part, which cleans up XFS-specific information 3324a0d856eeSDarrick J. Wongand frees the inode if it was unlinked. 3325a0d856eeSDarrick J. WongIf the inode is unlinked (or unconnected after a file handle operation), the 3326a0d856eeSDarrick J. Wongkernel drops the inode into the inactivation machinery immediately. 3327a0d856eeSDarrick J. Wong 3328a0d856eeSDarrick J. WongDuring normal operation, resource acquisition for an update follows this order 3329a0d856eeSDarrick J. Wongto avoid deadlocks: 3330a0d856eeSDarrick J. Wong 3331a0d856eeSDarrick J. Wong1. Inode reference (``iget``). 3332a0d856eeSDarrick J. Wong 3333a0d856eeSDarrick J. Wong2. Filesystem freeze protection, if repairing (``mnt_want_write_file``). 3334a0d856eeSDarrick J. Wong 3335a0d856eeSDarrick J. Wong3. Inode ``IOLOCK`` (VFS ``i_rwsem``) lock to control file IO. 3336a0d856eeSDarrick J. Wong 3337a0d856eeSDarrick J. Wong4. Inode ``MMAPLOCK`` (page cache ``invalidate_lock``) lock for operations that 3338a0d856eeSDarrick J. Wong can update page cache mappings. 3339a0d856eeSDarrick J. Wong 3340a0d856eeSDarrick J. Wong5. Log feature enablement. 3341a0d856eeSDarrick J. Wong 3342a0d856eeSDarrick J. Wong6. Transaction log space grant. 3343a0d856eeSDarrick J. Wong 3344a0d856eeSDarrick J. Wong7. Space on the data and realtime devices for the transaction. 3345a0d856eeSDarrick J. Wong 3346a0d856eeSDarrick J. Wong8. Incore dquot references, if a file is being repaired. 3347a0d856eeSDarrick J. Wong Note that they are not locked, merely acquired. 3348a0d856eeSDarrick J. Wong 3349a0d856eeSDarrick J. Wong9. Inode ``ILOCK`` for file metadata updates. 3350a0d856eeSDarrick J. Wong 3351a0d856eeSDarrick J. Wong10. AG header buffer locks / Realtime metadata inode ILOCK. 3352a0d856eeSDarrick J. Wong 3353a0d856eeSDarrick J. Wong11. Realtime metadata buffer locks, if applicable. 3354a0d856eeSDarrick J. Wong 3355a0d856eeSDarrick J. Wong12. Extent mapping btree blocks, if applicable. 3356a0d856eeSDarrick J. Wong 3357a0d856eeSDarrick J. WongResources are often released in the reverse order, though this is not required. 3358a0d856eeSDarrick J. WongHowever, online fsck differs from regular XFS operations because it may examine 3359a0d856eeSDarrick J. Wongan object that normally is acquired in a later stage of the locking order, and 3360a0d856eeSDarrick J. Wongthen decide to cross-reference the object with an object that is acquired 3361a0d856eeSDarrick J. Wongearlier in the order. 3362a0d856eeSDarrick J. WongThe next few sections detail the specific ways in which online fsck takes care 3363a0d856eeSDarrick J. Wongto avoid deadlocks. 3364a0d856eeSDarrick J. Wong 3365a0d856eeSDarrick J. Wongiget and irele During a Scrub 3366a0d856eeSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 3367a0d856eeSDarrick J. Wong 3368a0d856eeSDarrick J. WongAn inode scan performed on behalf of a scrub operation runs in transaction 3369a0d856eeSDarrick J. Wongcontext, and possibly with resources already locked and bound to it. 3370a0d856eeSDarrick J. WongThis isn't much of a problem for ``iget`` since it can operate in the context 3371a0d856eeSDarrick J. Wongof an existing transaction, as long as all of the bound resources are acquired 3372a0d856eeSDarrick J. Wongbefore the inode reference in the regular filesystem. 3373a0d856eeSDarrick J. Wong 3374a0d856eeSDarrick J. WongWhen the VFS ``iput`` function is given a linked inode with no other 3375a0d856eeSDarrick J. Wongreferences, it normally puts the inode on an LRU list in the hope that it can 3376a0d856eeSDarrick J. Wongsave time if another process re-opens the file before the system runs out 3377a0d856eeSDarrick J. Wongof memory and frees it. 3378a0d856eeSDarrick J. WongFilesystem callers can short-circuit the LRU process by setting a ``DONTCACHE`` 3379a0d856eeSDarrick J. Wongflag on the inode to cause the kernel to try to drop the inode into the 3380a0d856eeSDarrick J. Wonginactivation machinery immediately. 3381a0d856eeSDarrick J. Wong 3382a0d856eeSDarrick J. WongIn the past, inactivation was always done from the process that dropped the 3383a0d856eeSDarrick J. Wonginode, which was a problem for scrub because scrub may already hold a 3384a0d856eeSDarrick J. Wongtransaction, and XFS does not support nesting transactions. 3385a0d856eeSDarrick J. WongOn the other hand, if there is no scrub transaction, it is desirable to drop 3386a0d856eeSDarrick J. Wongotherwise unused inodes immediately to avoid polluting caches. 3387a0d856eeSDarrick J. WongTo capture these nuances, the online fsck code has a separate ``xchk_irele`` 3388a0d856eeSDarrick J. Wongfunction to set or clear the ``DONTCACHE`` flag to get the required release 3389a0d856eeSDarrick J. Wongbehavior. 3390a0d856eeSDarrick J. Wong 3391a0d856eeSDarrick J. WongProposed patchsets include fixing 3392a0d856eeSDarrick J. Wong`scrub iget usage 3393a0d856eeSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iget-fixes>`_ and 3394a0d856eeSDarrick J. Wong`dir iget usage 3395a0d856eeSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-dir-iget-fixes>`_. 3396a0d856eeSDarrick J. Wong 33972f754f7fSDarrick J. Wong.. _ilocking: 33982f754f7fSDarrick J. Wong 3399a0d856eeSDarrick J. WongLocking Inodes 3400a0d856eeSDarrick J. Wong^^^^^^^^^^^^^^ 3401a0d856eeSDarrick J. Wong 3402a0d856eeSDarrick J. WongIn regular filesystem code, the VFS and XFS will acquire multiple IOLOCK locks 3403a0d856eeSDarrick J. Wongin a well-known order: parent → child when updating the directory tree, and 3404a0d856eeSDarrick J. Wongin numerical order of the addresses of their ``struct inode`` object otherwise. 3405a0d856eeSDarrick J. WongFor regular files, the MMAPLOCK can be acquired after the IOLOCK to stop page 3406a0d856eeSDarrick J. Wongfaults. 3407a0d856eeSDarrick J. WongIf two MMAPLOCKs must be acquired, they are acquired in numerical order of 3408a0d856eeSDarrick J. Wongthe addresses of their ``struct address_space`` objects. 3409a0d856eeSDarrick J. WongDue to the structure of existing filesystem code, IOLOCKs and MMAPLOCKs must be 3410a0d856eeSDarrick J. Wongacquired before transactions are allocated. 3411a0d856eeSDarrick J. WongIf two ILOCKs must be acquired, they are acquired in inumber order. 3412a0d856eeSDarrick J. Wong 3413a0d856eeSDarrick J. WongInode lock acquisition must be done carefully during a coordinated inode scan. 3414a0d856eeSDarrick J. WongOnline fsck cannot abide these conventions, because for a directory tree 3415a0d856eeSDarrick J. Wongscanner, the scrub process holds the IOLOCK of the file being scanned and it 3416a0d856eeSDarrick J. Wongneeds to take the IOLOCK of the file at the other end of the directory link. 3417a0d856eeSDarrick J. WongIf the directory tree is corrupt because it contains a cycle, ``xfs_scrub`` 3418a0d856eeSDarrick J. Wongcannot use the regular inode locking functions and avoid becoming trapped in an 3419a0d856eeSDarrick J. WongABBA deadlock. 3420a0d856eeSDarrick J. Wong 3421a0d856eeSDarrick J. WongSolving both of these problems is straightforward -- any time online fsck 3422a0d856eeSDarrick J. Wongneeds to take a second lock of the same class, it uses trylock to avoid an ABBA 3423a0d856eeSDarrick J. Wongdeadlock. 3424a0d856eeSDarrick J. WongIf the trylock fails, scrub drops all inode locks and use trylock loops to 3425a0d856eeSDarrick J. Wong(re)acquire all necessary resources. 3426a0d856eeSDarrick J. WongTrylock loops enable scrub to check for pending fatal signals, which is how 3427a0d856eeSDarrick J. Wongscrub avoids deadlocking the filesystem or becoming an unresponsive process. 3428a0d856eeSDarrick J. WongHowever, trylock loops means that online fsck must be prepared to measure the 3429a0d856eeSDarrick J. Wongresource being scrubbed before and after the lock cycle to detect changes and 3430a0d856eeSDarrick J. Wongreact accordingly. 3431a0d856eeSDarrick J. Wong 3432a0d856eeSDarrick J. Wong.. _dirparent: 3433a0d856eeSDarrick J. Wong 3434a0d856eeSDarrick J. WongCase Study: Finding a Directory Parent 3435a0d856eeSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 3436a0d856eeSDarrick J. Wong 3437a0d856eeSDarrick J. WongConsider the directory parent pointer repair code as an example. 3438a0d856eeSDarrick J. WongOnline fsck must verify that the dotdot dirent of a directory points up to a 3439a0d856eeSDarrick J. Wongparent directory, and that the parent directory contains exactly one dirent 3440a0d856eeSDarrick J. Wongpointing down to the child directory. 3441a0d856eeSDarrick J. WongFully validating this relationship (and repairing it if possible) requires a 3442a0d856eeSDarrick J. Wongwalk of every directory on the filesystem while holding the child locked, and 3443a0d856eeSDarrick J. Wongwhile updates to the directory tree are being made. 3444a0d856eeSDarrick J. WongThe coordinated inode scan provides a way to walk the filesystem without the 3445a0d856eeSDarrick J. Wongpossibility of missing an inode. 3446a0d856eeSDarrick J. WongThe child directory is kept locked to prevent updates to the dotdot dirent, but 3447a0d856eeSDarrick J. Wongif the scanner fails to lock a parent, it can drop and relock both the child 3448a0d856eeSDarrick J. Wongand the prospective parent. 3449a0d856eeSDarrick J. WongIf the dotdot entry changes while the directory is unlocked, then a move or 3450a0d856eeSDarrick J. Wongrename operation must have changed the child's parentage, and the scan can 3451a0d856eeSDarrick J. Wongexit early. 3452a0d856eeSDarrick J. Wong 3453a0d856eeSDarrick J. WongThe proposed patchset is the 3454a0d856eeSDarrick J. Wong`directory repair 3455a0d856eeSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_ 3456a0d856eeSDarrick J. Wongseries. 3457a0d856eeSDarrick J. Wong 3458a0d856eeSDarrick J. Wong.. _fshooks: 3459a0d856eeSDarrick J. Wong 3460a0d856eeSDarrick J. WongFilesystem Hooks 3461a0d856eeSDarrick J. Wong````````````````` 3462a0d856eeSDarrick J. Wong 3463a0d856eeSDarrick J. WongThe second piece of support that online fsck functions need during a full 3464a0d856eeSDarrick J. Wongfilesystem scan is the ability to stay informed about updates being made by 3465a0d856eeSDarrick J. Wongother threads in the filesystem, since comparisons against the past are useless 3466a0d856eeSDarrick J. Wongin a dynamic environment. 3467a0d856eeSDarrick J. WongTwo pieces of Linux kernel infrastructure enable online fsck to monitor regular 3468a0d856eeSDarrick J. Wongfilesystem operations: filesystem hooks and :ref:`static keys<jump_labels>`. 3469a0d856eeSDarrick J. Wong 3470a0d856eeSDarrick J. WongFilesystem hooks convey information about an ongoing filesystem operation to 3471a0d856eeSDarrick J. Wonga downstream consumer. 3472a0d856eeSDarrick J. WongIn this case, the downstream consumer is always an online fsck function. 3473a0d856eeSDarrick J. WongBecause multiple fsck functions can run in parallel, online fsck uses the Linux 3474a0d856eeSDarrick J. Wongnotifier call chain facility to dispatch updates to any number of interested 3475a0d856eeSDarrick J. Wongfsck processes. 3476a0d856eeSDarrick J. WongCall chains are a dynamic list, which means that they can be configured at 3477a0d856eeSDarrick J. Wongrun time. 3478a0d856eeSDarrick J. WongBecause these hooks are private to the XFS module, the information passed along 3479a0d856eeSDarrick J. Wongcontains exactly what the checking function needs to update its observations. 3480a0d856eeSDarrick J. Wong 3481a0d856eeSDarrick J. WongThe current implementation of XFS hooks uses SRCU notifier chains to reduce the 3482a0d856eeSDarrick J. Wongimpact to highly threaded workloads. 3483a0d856eeSDarrick J. WongRegular blocking notifier chains use a rwsem and seem to have a much lower 3484a0d856eeSDarrick J. Wongoverhead for single-threaded applications. 3485a0d856eeSDarrick J. WongHowever, it may turn out that the combination of blocking chains and static 3486a0d856eeSDarrick J. Wongkeys are a more performant combination; more study is needed here. 3487a0d856eeSDarrick J. Wong 3488a0d856eeSDarrick J. WongThe following pieces are necessary to hook a certain point in the filesystem: 3489a0d856eeSDarrick J. Wong 3490a0d856eeSDarrick J. Wong- A ``struct xfs_hooks`` object must be embedded in a convenient place such as 3491a0d856eeSDarrick J. Wong a well-known incore filesystem object. 3492a0d856eeSDarrick J. Wong 3493a0d856eeSDarrick J. Wong- Each hook must define an action code and a structure containing more context 3494a0d856eeSDarrick J. Wong about the action. 3495a0d856eeSDarrick J. Wong 3496a0d856eeSDarrick J. Wong- Hook providers should provide appropriate wrapper functions and structs 3497a0d856eeSDarrick J. Wong around the ``xfs_hooks`` and ``xfs_hook`` objects to take advantage of type 3498a0d856eeSDarrick J. Wong checking to ensure correct usage. 3499a0d856eeSDarrick J. Wong 3500a0d856eeSDarrick J. Wong- A callsite in the regular filesystem code must be chosen to call 3501a0d856eeSDarrick J. Wong ``xfs_hooks_call`` with the action code and data structure. 3502a0d856eeSDarrick J. Wong This place should be adjacent to (and not earlier than) the place where 3503a0d856eeSDarrick J. Wong the filesystem update is committed to the transaction. 3504a0d856eeSDarrick J. Wong In general, when the filesystem calls a hook chain, it should be able to 3505a0d856eeSDarrick J. Wong handle sleeping and should not be vulnerable to memory reclaim or locking 3506a0d856eeSDarrick J. Wong recursion. 3507a0d856eeSDarrick J. Wong However, the exact requirements are very dependent on the context of the hook 3508a0d856eeSDarrick J. Wong caller and the callee. 3509a0d856eeSDarrick J. Wong 3510a0d856eeSDarrick J. Wong- The online fsck function should define a structure to hold scan data, a lock 3511a0d856eeSDarrick J. Wong to coordinate access to the scan data, and a ``struct xfs_hook`` object. 3512a0d856eeSDarrick J. Wong The scanner function and the regular filesystem code must acquire resources 3513a0d856eeSDarrick J. Wong in the same order; see the next section for details. 3514a0d856eeSDarrick J. Wong 3515a0d856eeSDarrick J. Wong- The online fsck code must contain a C function to catch the hook action code 3516a0d856eeSDarrick J. Wong and data structure. 3517a0d856eeSDarrick J. Wong If the object being updated has already been visited by the scan, then the 3518a0d856eeSDarrick J. Wong hook information must be applied to the scan data. 3519a0d856eeSDarrick J. Wong 3520a0d856eeSDarrick J. Wong- Prior to unlocking inodes to start the scan, online fsck must call 3521a0d856eeSDarrick J. Wong ``xfs_hooks_setup`` to initialize the ``struct xfs_hook``, and 3522a0d856eeSDarrick J. Wong ``xfs_hooks_add`` to enable the hook. 3523a0d856eeSDarrick J. Wong 3524a0d856eeSDarrick J. Wong- Online fsck must call ``xfs_hooks_del`` to disable the hook once the scan is 3525a0d856eeSDarrick J. Wong complete. 3526a0d856eeSDarrick J. Wong 3527a0d856eeSDarrick J. WongThe number of hooks should be kept to a minimum to reduce complexity. 3528a0d856eeSDarrick J. WongStatic keys are used to reduce the overhead of filesystem hooks to nearly 3529a0d856eeSDarrick J. Wongzero when online fsck is not running. 3530a0d856eeSDarrick J. Wong 3531a0d856eeSDarrick J. Wong.. _liveupdate: 3532a0d856eeSDarrick J. Wong 3533a0d856eeSDarrick J. WongLive Updates During a Scan 3534a0d856eeSDarrick J. Wong`````````````````````````` 3535a0d856eeSDarrick J. Wong 3536a0d856eeSDarrick J. WongThe code paths of the online fsck scanning code and the :ref:`hooked<fshooks>` 3537a0d856eeSDarrick J. Wongfilesystem code look like this:: 3538a0d856eeSDarrick J. Wong 3539a0d856eeSDarrick J. Wong other program 3540a0d856eeSDarrick J. Wong ↓ 3541a0d856eeSDarrick J. Wong inode lock ←────────────────────┐ 3542a0d856eeSDarrick J. Wong ↓ │ 3543a0d856eeSDarrick J. Wong AG header lock │ 3544a0d856eeSDarrick J. Wong ↓ │ 3545a0d856eeSDarrick J. Wong filesystem function │ 3546a0d856eeSDarrick J. Wong ↓ │ 3547a0d856eeSDarrick J. Wong notifier call chain │ same 3548a0d856eeSDarrick J. Wong ↓ ├─── inode 3549a0d856eeSDarrick J. Wong scrub hook function │ lock 3550a0d856eeSDarrick J. Wong ↓ │ 3551a0d856eeSDarrick J. Wong scan data mutex ←──┐ same │ 3552a0d856eeSDarrick J. Wong ↓ ├─── scan │ 3553a0d856eeSDarrick J. Wong update scan data │ lock │ 3554a0d856eeSDarrick J. Wong ↑ │ │ 3555a0d856eeSDarrick J. Wong scan data mutex ←──┘ │ 3556a0d856eeSDarrick J. Wong ↑ │ 3557a0d856eeSDarrick J. Wong inode lock ←────────────────────┘ 3558a0d856eeSDarrick J. Wong ↑ 3559a0d856eeSDarrick J. Wong scrub function 3560a0d856eeSDarrick J. Wong ↑ 3561a0d856eeSDarrick J. Wong inode scanner 3562a0d856eeSDarrick J. Wong ↑ 3563a0d856eeSDarrick J. Wong xfs_scrub 3564a0d856eeSDarrick J. Wong 3565a0d856eeSDarrick J. WongThese rules must be followed to ensure correct interactions between the 3566a0d856eeSDarrick J. Wongchecking code and the code making an update to the filesystem: 3567a0d856eeSDarrick J. Wong 3568a0d856eeSDarrick J. Wong- Prior to invoking the notifier call chain, the filesystem function being 3569a0d856eeSDarrick J. Wong hooked must acquire the same lock that the scrub scanning function acquires 3570a0d856eeSDarrick J. Wong to scan the inode. 3571a0d856eeSDarrick J. Wong 3572a0d856eeSDarrick J. Wong- The scanning function and the scrub hook function must coordinate access to 3573a0d856eeSDarrick J. Wong the scan data by acquiring a lock on the scan data. 3574a0d856eeSDarrick J. Wong 3575a0d856eeSDarrick J. Wong- Scrub hook function must not add the live update information to the scan 3576a0d856eeSDarrick J. Wong observations unless the inode being updated has already been scanned. 3577a0d856eeSDarrick J. Wong The scan coordinator has a helper predicate (``xchk_iscan_want_live_update``) 3578a0d856eeSDarrick J. Wong for this. 3579a0d856eeSDarrick J. Wong 3580a0d856eeSDarrick J. Wong- Scrub hook functions must not change the caller's state, including the 3581a0d856eeSDarrick J. Wong transaction that it is running. 3582a0d856eeSDarrick J. Wong They must not acquire any resources that might conflict with the filesystem 3583a0d856eeSDarrick J. Wong function being hooked. 3584a0d856eeSDarrick J. Wong 3585a0d856eeSDarrick J. Wong- The hook function can abort the inode scan to avoid breaking the other rules. 3586a0d856eeSDarrick J. Wong 3587a0d856eeSDarrick J. WongThe inode scan APIs are pretty simple: 3588a0d856eeSDarrick J. Wong 3589a0d856eeSDarrick J. Wong- ``xchk_iscan_start`` starts a scan 3590a0d856eeSDarrick J. Wong 3591a0d856eeSDarrick J. Wong- ``xchk_iscan_iter`` grabs a reference to the next inode in the scan or 3592a0d856eeSDarrick J. Wong returns zero if there is nothing left to scan 3593a0d856eeSDarrick J. Wong 3594a0d856eeSDarrick J. Wong- ``xchk_iscan_want_live_update`` to decide if an inode has already been 3595a0d856eeSDarrick J. Wong visited in the scan. 3596a0d856eeSDarrick J. Wong This is critical for hook functions to decide if they need to update the 3597a0d856eeSDarrick J. Wong in-memory scan information. 3598a0d856eeSDarrick J. Wong 3599a0d856eeSDarrick J. Wong- ``xchk_iscan_mark_visited`` to mark an inode as having been visited in the 3600a0d856eeSDarrick J. Wong scan 3601a0d856eeSDarrick J. Wong 3602a0d856eeSDarrick J. Wong- ``xchk_iscan_teardown`` to finish the scan 3603a0d856eeSDarrick J. Wong 3604a0d856eeSDarrick J. WongThis functionality is also a part of the 3605a0d856eeSDarrick J. Wong`inode scanner 3606a0d856eeSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>`_ 3607a0d856eeSDarrick J. Wongseries. 3608a0d856eeSDarrick J. Wong 3609a0d856eeSDarrick J. Wong.. _quotacheck: 3610a0d856eeSDarrick J. Wong 3611a0d856eeSDarrick J. WongCase Study: Quota Counter Checking 3612a0d856eeSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 3613a0d856eeSDarrick J. Wong 3614a0d856eeSDarrick J. WongIt is useful to compare the mount time quotacheck code to the online repair 3615a0d856eeSDarrick J. Wongquotacheck code. 3616a0d856eeSDarrick J. WongMount time quotacheck does not have to contend with concurrent operations, so 3617a0d856eeSDarrick J. Wongit does the following: 3618a0d856eeSDarrick J. Wong 3619a0d856eeSDarrick J. Wong1. Make sure the ondisk dquots are in good enough shape that all the incore 3620a0d856eeSDarrick J. Wong dquots will actually load, and zero the resource usage counters in the 3621a0d856eeSDarrick J. Wong ondisk buffer. 3622a0d856eeSDarrick J. Wong 3623a0d856eeSDarrick J. Wong2. Walk every inode in the filesystem. 3624a0d856eeSDarrick J. Wong Add each file's resource usage to the incore dquot. 3625a0d856eeSDarrick J. Wong 3626a0d856eeSDarrick J. Wong3. Walk each incore dquot. 3627a0d856eeSDarrick J. Wong If the incore dquot is not being flushed, add the ondisk buffer backing the 3628a0d856eeSDarrick J. Wong incore dquot to a delayed write (delwri) list. 3629a0d856eeSDarrick J. Wong 3630a0d856eeSDarrick J. Wong4. Write the buffer list to disk. 3631a0d856eeSDarrick J. Wong 3632a0d856eeSDarrick J. WongLike most online fsck functions, online quotacheck can't write to regular 3633a0d856eeSDarrick J. Wongfilesystem objects until the newly collected metadata reflect all filesystem 3634a0d856eeSDarrick J. Wongstate. 3635a0d856eeSDarrick J. WongTherefore, online quotacheck records file resource usage to a shadow dquot 3636a0d856eeSDarrick J. Wongindex implemented with a sparse ``xfarray``, and only writes to the real dquots 3637a0d856eeSDarrick J. Wongonce the scan is complete. 3638a0d856eeSDarrick J. WongHandling transactional updates is tricky because quota resource usage updates 3639a0d856eeSDarrick J. Wongare handled in phases to minimize contention on dquots: 3640a0d856eeSDarrick J. Wong 3641a0d856eeSDarrick J. Wong1. The inodes involved are joined and locked to a transaction. 3642a0d856eeSDarrick J. Wong 3643a0d856eeSDarrick J. Wong2. For each dquot attached to the file: 3644a0d856eeSDarrick J. Wong 3645a0d856eeSDarrick J. Wong a. The dquot is locked. 3646a0d856eeSDarrick J. Wong 3647a0d856eeSDarrick J. Wong b. A quota reservation is added to the dquot's resource usage. 3648a0d856eeSDarrick J. Wong The reservation is recorded in the transaction. 3649a0d856eeSDarrick J. Wong 3650a0d856eeSDarrick J. Wong c. The dquot is unlocked. 3651a0d856eeSDarrick J. Wong 3652a0d856eeSDarrick J. Wong3. Changes in actual quota usage are tracked in the transaction. 3653a0d856eeSDarrick J. Wong 3654a0d856eeSDarrick J. Wong4. At transaction commit time, each dquot is examined again: 3655a0d856eeSDarrick J. Wong 3656a0d856eeSDarrick J. Wong a. The dquot is locked again. 3657a0d856eeSDarrick J. Wong 3658a0d856eeSDarrick J. Wong b. Quota usage changes are logged and unused reservation is given back to 3659a0d856eeSDarrick J. Wong the dquot. 3660a0d856eeSDarrick J. Wong 3661a0d856eeSDarrick J. Wong c. The dquot is unlocked. 3662a0d856eeSDarrick J. Wong 3663a0d856eeSDarrick J. WongFor online quotacheck, hooks are placed in steps 2 and 4. 3664a0d856eeSDarrick J. WongThe step 2 hook creates a shadow version of the transaction dquot context 3665a0d856eeSDarrick J. Wong(``dqtrx``) that operates in a similar manner to the regular code. 3666a0d856eeSDarrick J. WongThe step 4 hook commits the shadow ``dqtrx`` changes to the shadow dquots. 3667a0d856eeSDarrick J. WongNotice that both hooks are called with the inode locked, which is how the 3668a0d856eeSDarrick J. Wonglive update coordinates with the inode scanner. 3669a0d856eeSDarrick J. Wong 3670a0d856eeSDarrick J. WongThe quotacheck scan looks like this: 3671a0d856eeSDarrick J. Wong 3672a0d856eeSDarrick J. Wong1. Set up a coordinated inode scan. 3673a0d856eeSDarrick J. Wong 3674a0d856eeSDarrick J. Wong2. For each inode returned by the inode scan iterator: 3675a0d856eeSDarrick J. Wong 3676a0d856eeSDarrick J. Wong a. Grab and lock the inode. 3677a0d856eeSDarrick J. Wong 3678a0d856eeSDarrick J. Wong b. Determine that inode's resource usage (data blocks, inode counts, 3679a0d856eeSDarrick J. Wong realtime blocks) and add that to the shadow dquots for the user, group, 3680a0d856eeSDarrick J. Wong and project ids associated with the inode. 3681a0d856eeSDarrick J. Wong 3682a0d856eeSDarrick J. Wong c. Unlock and release the inode. 3683a0d856eeSDarrick J. Wong 3684a0d856eeSDarrick J. Wong3. For each dquot in the system: 3685a0d856eeSDarrick J. Wong 3686a0d856eeSDarrick J. Wong a. Grab and lock the dquot. 3687a0d856eeSDarrick J. Wong 3688a0d856eeSDarrick J. Wong b. Check the dquot against the shadow dquots created by the scan and updated 3689a0d856eeSDarrick J. Wong by the live hooks. 3690a0d856eeSDarrick J. Wong 3691a0d856eeSDarrick J. WongLive updates are key to being able to walk every quota record without 3692a0d856eeSDarrick J. Wongneeding to hold any locks for a long duration. 3693a0d856eeSDarrick J. WongIf repairs are desired, the real and shadow dquots are locked and their 3694a0d856eeSDarrick J. Wongresource counts are set to the values in the shadow dquot. 3695a0d856eeSDarrick J. Wong 3696a0d856eeSDarrick J. WongThe proposed patchset is the 3697a0d856eeSDarrick J. Wong`online quotacheck 3698a0d856eeSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>`_ 3699a0d856eeSDarrick J. Wongseries. 3700a0d856eeSDarrick J. Wong 3701a0d856eeSDarrick J. Wong.. _nlinks: 3702a0d856eeSDarrick J. Wong 3703a0d856eeSDarrick J. WongCase Study: File Link Count Checking 3704a0d856eeSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 3705a0d856eeSDarrick J. Wong 3706a0d856eeSDarrick J. WongFile link count checking also uses live update hooks. 3707a0d856eeSDarrick J. WongThe coordinated inode scanner is used to visit all directories on the 3708a0d856eeSDarrick J. Wongfilesystem, and per-file link count records are stored in a sparse ``xfarray`` 3709a0d856eeSDarrick J. Wongindexed by inumber. 3710a0d856eeSDarrick J. WongDuring the scanning phase, each entry in a directory generates observation 3711a0d856eeSDarrick J. Wongdata as follows: 3712a0d856eeSDarrick J. Wong 3713a0d856eeSDarrick J. Wong1. If the entry is a dotdot (``'..'``) entry of the root directory, the 3714a0d856eeSDarrick J. Wong directory's parent link count is bumped because the root directory's dotdot 3715a0d856eeSDarrick J. Wong entry is self referential. 3716a0d856eeSDarrick J. Wong 3717a0d856eeSDarrick J. Wong2. If the entry is a dotdot entry of a subdirectory, the parent's backref 3718a0d856eeSDarrick J. Wong count is bumped. 3719a0d856eeSDarrick J. Wong 3720a0d856eeSDarrick J. Wong3. If the entry is neither a dot nor a dotdot entry, the target file's parent 3721a0d856eeSDarrick J. Wong count is bumped. 3722a0d856eeSDarrick J. Wong 3723a0d856eeSDarrick J. Wong4. If the target is a subdirectory, the parent's child link count is bumped. 3724a0d856eeSDarrick J. Wong 3725a0d856eeSDarrick J. WongA crucial point to understand about how the link count inode scanner interacts 3726a0d856eeSDarrick J. Wongwith the live update hooks is that the scan cursor tracks which *parent* 3727a0d856eeSDarrick J. Wongdirectories have been scanned. 3728a0d856eeSDarrick J. WongIn other words, the live updates ignore any update about ``A → B`` when A has 3729a0d856eeSDarrick J. Wongnot been scanned, even if B has been scanned. 3730a0d856eeSDarrick J. WongFurthermore, a subdirectory A with a dotdot entry pointing back to B is 3731a0d856eeSDarrick J. Wongaccounted as a backref counter in the shadow data for A, since child dotdot 3732a0d856eeSDarrick J. Wongentries affect the parent's link count. 3733a0d856eeSDarrick J. WongLive update hooks are carefully placed in all parts of the filesystem that 3734a0d856eeSDarrick J. Wongcreate, change, or remove directory entries, since those operations involve 3735a0d856eeSDarrick J. Wongbumplink and droplink. 3736a0d856eeSDarrick J. Wong 3737a0d856eeSDarrick J. WongFor any file, the correct link count is the number of parents plus the number 3738a0d856eeSDarrick J. Wongof child subdirectories. 3739a0d856eeSDarrick J. WongNon-directories never have children of any kind. 3740a0d856eeSDarrick J. WongThe backref information is used to detect inconsistencies in the number of 3741a0d856eeSDarrick J. Wonglinks pointing to child subdirectories and the number of dotdot entries 3742a0d856eeSDarrick J. Wongpointing back. 3743a0d856eeSDarrick J. Wong 3744a0d856eeSDarrick J. WongAfter the scan completes, the link count of each file can be checked by locking 3745a0d856eeSDarrick J. Wongboth the inode and the shadow data, and comparing the link counts. 3746a0d856eeSDarrick J. WongA second coordinated inode scan cursor is used for comparisons. 3747a0d856eeSDarrick J. WongLive updates are key to being able to walk every inode without needing to hold 3748a0d856eeSDarrick J. Wongany locks between inodes. 3749a0d856eeSDarrick J. WongIf repairs are desired, the inode's link count is set to the value in the 3750a0d856eeSDarrick J. Wongshadow information. 3751a0d856eeSDarrick J. WongIf no parents are found, the file must be :ref:`reparented <orphanage>` to the 3752a0d856eeSDarrick J. Wongorphanage to prevent the file from being lost forever. 3753a0d856eeSDarrick J. Wong 3754a0d856eeSDarrick J. WongThe proposed patchset is the 3755a0d856eeSDarrick J. Wong`file link count repair 3756a0d856eeSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-nlinks>`_ 3757a0d856eeSDarrick J. Wongseries. 3758a0d856eeSDarrick J. Wong 3759a0d856eeSDarrick J. Wong.. _rmap_repair: 3760a0d856eeSDarrick J. Wong 3761a0d856eeSDarrick J. WongCase Study: Rebuilding Reverse Mapping Records 3762a0d856eeSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 3763a0d856eeSDarrick J. Wong 3764a0d856eeSDarrick J. WongMost repair functions follow the same pattern: lock filesystem resources, 3765a0d856eeSDarrick J. Wongwalk the surviving ondisk metadata looking for replacement metadata records, 3766a0d856eeSDarrick J. Wongand use an :ref:`in-memory array <xfarray>` to store the gathered observations. 3767a0d856eeSDarrick J. WongThe primary advantage of this approach is the simplicity and modularity of the 3768a0d856eeSDarrick J. Wongrepair code -- code and data are entirely contained within the scrub module, 3769a0d856eeSDarrick J. Wongdo not require hooks in the main filesystem, and are usually the most efficient 3770a0d856eeSDarrick J. Wongin memory use. 3771a0d856eeSDarrick J. WongA secondary advantage of this repair approach is atomicity -- once the kernel 3772a0d856eeSDarrick J. Wongdecides a structure is corrupt, no other threads can access the metadata until 3773a0d856eeSDarrick J. Wongthe kernel finishes repairing and revalidating the metadata. 3774a0d856eeSDarrick J. Wong 3775a0d856eeSDarrick J. WongFor repairs going on within a shard of the filesystem, these advantages 3776a0d856eeSDarrick J. Wongoutweigh the delays inherent in locking the shard while repairing parts of the 3777a0d856eeSDarrick J. Wongshard. 3778a0d856eeSDarrick J. WongUnfortunately, repairs to the reverse mapping btree cannot use the "standard" 3779a0d856eeSDarrick J. Wongbtree repair strategy because it must scan every space mapping of every fork of 3780a0d856eeSDarrick J. Wongevery file in the filesystem, and the filesystem cannot stop. 3781a0d856eeSDarrick J. WongTherefore, rmap repair foregoes atomicity between scrub and repair. 3782a0d856eeSDarrick J. WongIt combines a :ref:`coordinated inode scanner <iscan>`, :ref:`live update hooks 3783a0d856eeSDarrick J. Wong<liveupdate>`, and an :ref:`in-memory rmap btree <xfbtree>` to complete the 3784a0d856eeSDarrick J. Wongscan for reverse mapping records. 3785a0d856eeSDarrick J. Wong 3786a0d856eeSDarrick J. Wong1. Set up an xfbtree to stage rmap records. 3787a0d856eeSDarrick J. Wong 3788a0d856eeSDarrick J. Wong2. While holding the locks on the AGI and AGF buffers acquired during the 3789a0d856eeSDarrick J. Wong scrub, generate reverse mappings for all AG metadata: inodes, btrees, CoW 3790a0d856eeSDarrick J. Wong staging extents, and the internal log. 3791a0d856eeSDarrick J. Wong 3792a0d856eeSDarrick J. Wong3. Set up an inode scanner. 3793a0d856eeSDarrick J. Wong 3794a0d856eeSDarrick J. Wong4. Hook into rmap updates for the AG being repaired so that the live scan data 3795a0d856eeSDarrick J. Wong can receive updates to the rmap btree from the rest of the filesystem during 3796a0d856eeSDarrick J. Wong the file scan. 3797a0d856eeSDarrick J. Wong 3798a0d856eeSDarrick J. Wong5. For each space mapping found in either fork of each file scanned, 3799a0d856eeSDarrick J. Wong decide if the mapping matches the AG of interest. 3800a0d856eeSDarrick J. Wong If so: 3801a0d856eeSDarrick J. Wong 3802a0d856eeSDarrick J. Wong a. Create a btree cursor for the in-memory btree. 3803a0d856eeSDarrick J. Wong 3804a0d856eeSDarrick J. Wong b. Use the rmap code to add the record to the in-memory btree. 3805a0d856eeSDarrick J. Wong 3806a0d856eeSDarrick J. Wong c. Use the :ref:`special commit function <xfbtree_commit>` to write the 3807a0d856eeSDarrick J. Wong xfbtree changes to the xfile. 3808a0d856eeSDarrick J. Wong 3809a0d856eeSDarrick J. Wong6. For each live update received via the hook, decide if the owner has already 3810a0d856eeSDarrick J. Wong been scanned. 3811a0d856eeSDarrick J. Wong If so, apply the live update into the scan data: 3812a0d856eeSDarrick J. Wong 3813a0d856eeSDarrick J. Wong a. Create a btree cursor for the in-memory btree. 3814a0d856eeSDarrick J. Wong 3815a0d856eeSDarrick J. Wong b. Replay the operation into the in-memory btree. 3816a0d856eeSDarrick J. Wong 3817a0d856eeSDarrick J. Wong c. Use the :ref:`special commit function <xfbtree_commit>` to write the 3818a0d856eeSDarrick J. Wong xfbtree changes to the xfile. 3819a0d856eeSDarrick J. Wong This is performed with an empty transaction to avoid changing the 3820a0d856eeSDarrick J. Wong caller's state. 3821a0d856eeSDarrick J. Wong 3822a0d856eeSDarrick J. Wong7. When the inode scan finishes, create a new scrub transaction and relock the 3823a0d856eeSDarrick J. Wong two AG headers. 3824a0d856eeSDarrick J. Wong 3825a0d856eeSDarrick J. Wong8. Compute the new btree geometry using the number of rmap records in the 3826a0d856eeSDarrick J. Wong shadow btree, like all other btree rebuilding functions. 3827a0d856eeSDarrick J. Wong 3828a0d856eeSDarrick J. Wong9. Allocate the number of blocks computed in the previous step. 3829a0d856eeSDarrick J. Wong 3830a0d856eeSDarrick J. Wong10. Perform the usual btree bulk loading and commit to install the new rmap 3831a0d856eeSDarrick J. Wong btree. 3832a0d856eeSDarrick J. Wong 3833a0d856eeSDarrick J. Wong11. Reap the old rmap btree blocks as discussed in the case study about how 3834a0d856eeSDarrick J. Wong to :ref:`reap after rmap btree repair <rmap_reap>`. 3835a0d856eeSDarrick J. Wong 3836a0d856eeSDarrick J. Wong12. Free the xfbtree now that it not needed. 3837a0d856eeSDarrick J. Wong 3838a0d856eeSDarrick J. WongThe proposed patchset is the 3839a0d856eeSDarrick J. Wong`rmap repair 3840a0d856eeSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rmap-btree>`_ 3841a0d856eeSDarrick J. Wongseries. 38422f754f7fSDarrick J. Wong 38432f754f7fSDarrick J. WongStaging Repairs with Temporary Files on Disk 38442f754f7fSDarrick J. Wong-------------------------------------------- 38452f754f7fSDarrick J. Wong 38462f754f7fSDarrick J. WongXFS stores a substantial amount of metadata in file forks: directories, 38472f754f7fSDarrick J. Wongextended attributes, symbolic link targets, free space bitmaps and summary 38482f754f7fSDarrick J. Wonginformation for the realtime volume, and quota records. 38492f754f7fSDarrick J. WongFile forks map 64-bit logical file fork space extents to physical storage space 38502f754f7fSDarrick J. Wongextents, similar to how a memory management unit maps 64-bit virtual addresses 38512f754f7fSDarrick J. Wongto physical memory addresses. 38522f754f7fSDarrick J. WongTherefore, file-based tree structures (such as directories and extended 38532f754f7fSDarrick J. Wongattributes) use blocks mapped in the file fork offset address space that point 38542f754f7fSDarrick J. Wongto other blocks mapped within that same address space, and file-based linear 38552f754f7fSDarrick J. Wongstructures (such as bitmaps and quota records) compute array element offsets in 38562f754f7fSDarrick J. Wongthe file fork offset address space. 38572f754f7fSDarrick J. Wong 38582f754f7fSDarrick J. WongBecause file forks can consume as much space as the entire filesystem, repairs 38592f754f7fSDarrick J. Wongcannot be staged in memory, even when a paging scheme is available. 38602f754f7fSDarrick J. WongTherefore, online repair of file-based metadata createas a temporary file in 38612f754f7fSDarrick J. Wongthe XFS filesystem, writes a new structure at the correct offsets into the 38622f754f7fSDarrick J. Wongtemporary file, and atomically swaps the fork mappings (and hence the fork 38632f754f7fSDarrick J. Wongcontents) to commit the repair. 38642f754f7fSDarrick J. WongOnce the repair is complete, the old fork can be reaped as necessary; if the 38652f754f7fSDarrick J. Wongsystem goes down during the reap, the iunlink code will delete the blocks 38662f754f7fSDarrick J. Wongduring log recovery. 38672f754f7fSDarrick J. Wong 38682f754f7fSDarrick J. Wong**Note**: All space usage and inode indices in the filesystem *must* be 38692f754f7fSDarrick J. Wongconsistent to use a temporary file safely! 38702f754f7fSDarrick J. WongThis dependency is the reason why online repair can only use pageable kernel 38712f754f7fSDarrick J. Wongmemory to stage ondisk space usage information. 38722f754f7fSDarrick J. Wong 38732f754f7fSDarrick J. WongSwapping metadata extents with a temporary file requires the owner field of the 38742f754f7fSDarrick J. Wongblock headers to match the file being repaired and not the temporary file. The 38752f754f7fSDarrick J. Wongdirectory, extended attribute, and symbolic link functions were all modified to 38762f754f7fSDarrick J. Wongallow callers to specify owner numbers explicitly. 38772f754f7fSDarrick J. Wong 38782f754f7fSDarrick J. WongThere is a downside to the reaping process -- if the system crashes during the 38792f754f7fSDarrick J. Wongreap phase and the fork extents are crosslinked, the iunlink processing will 38802f754f7fSDarrick J. Wongfail because freeing space will find the extra reverse mappings and abort. 38812f754f7fSDarrick J. Wong 38822f754f7fSDarrick J. WongTemporary files created for repair are similar to ``O_TMPFILE`` files created 38832f754f7fSDarrick J. Wongby userspace. 38842f754f7fSDarrick J. WongThey are not linked into a directory and the entire file will be reaped when 38852f754f7fSDarrick J. Wongthe last reference to the file is lost. 38862f754f7fSDarrick J. WongThe key differences are that these files must have no access permission outside 38872f754f7fSDarrick J. Wongthe kernel at all, they must be specially marked to prevent them from being 38882f754f7fSDarrick J. Wongopened by handle, and they must never be linked into the directory tree. 38892f754f7fSDarrick J. Wong 38902f754f7fSDarrick J. Wong+--------------------------------------------------------------------------+ 38912f754f7fSDarrick J. Wong| **Historical Sidebar**: | 38922f754f7fSDarrick J. Wong+--------------------------------------------------------------------------+ 38932f754f7fSDarrick J. Wong| In the initial iteration of file metadata repair, the damaged metadata | 38942f754f7fSDarrick J. Wong| blocks would be scanned for salvageable data; the extents in the file | 38952f754f7fSDarrick J. Wong| fork would be reaped; and then a new structure would be built in its | 38962f754f7fSDarrick J. Wong| place. | 38972f754f7fSDarrick J. Wong| This strategy did not survive the introduction of the atomic repair | 38982f754f7fSDarrick J. Wong| requirement expressed earlier in this document. | 38992f754f7fSDarrick J. Wong| | 39002f754f7fSDarrick J. Wong| The second iteration explored building a second structure at a high | 39012f754f7fSDarrick J. Wong| offset in the fork from the salvage data, reaping the old extents, and | 39022f754f7fSDarrick J. Wong| using a ``COLLAPSE_RANGE`` operation to slide the new extents into | 39032f754f7fSDarrick J. Wong| place. | 39042f754f7fSDarrick J. Wong| | 39052f754f7fSDarrick J. Wong| This had many drawbacks: | 39062f754f7fSDarrick J. Wong| | 39072f754f7fSDarrick J. Wong| - Array structures are linearly addressed, and the regular filesystem | 39082f754f7fSDarrick J. Wong| codebase does not have the concept of a linear offset that could be | 39092f754f7fSDarrick J. Wong| applied to the record offset computation to build an alternate copy. | 39102f754f7fSDarrick J. Wong| | 39112f754f7fSDarrick J. Wong| - Extended attributes are allowed to use the entire attr fork offset | 39122f754f7fSDarrick J. Wong| address space. | 39132f754f7fSDarrick J. Wong| | 39142f754f7fSDarrick J. Wong| - Even if repair could build an alternate copy of a data structure in a | 39152f754f7fSDarrick J. Wong| different part of the fork address space, the atomic repair commit | 39162f754f7fSDarrick J. Wong| requirement means that online repair would have to be able to perform | 39172f754f7fSDarrick J. Wong| a log assisted ``COLLAPSE_RANGE`` operation to ensure that the old | 39182f754f7fSDarrick J. Wong| structure was completely replaced. | 39192f754f7fSDarrick J. Wong| | 39202f754f7fSDarrick J. Wong| - A crash after construction of the secondary tree but before the range | 39212f754f7fSDarrick J. Wong| collapse would leave unreachable blocks in the file fork. | 39222f754f7fSDarrick J. Wong| This would likely confuse things further. | 39232f754f7fSDarrick J. Wong| | 39242f754f7fSDarrick J. Wong| - Reaping blocks after a repair is not a simple operation, and | 39252f754f7fSDarrick J. Wong| initiating a reap operation from a restarted range collapse operation | 39262f754f7fSDarrick J. Wong| during log recovery is daunting. | 39272f754f7fSDarrick J. Wong| | 39282f754f7fSDarrick J. Wong| - Directory entry blocks and quota records record the file fork offset | 39292f754f7fSDarrick J. Wong| in the header area of each block. | 39302f754f7fSDarrick J. Wong| An atomic range collapse operation would have to rewrite this part of | 39312f754f7fSDarrick J. Wong| each block header. | 39322f754f7fSDarrick J. Wong| Rewriting a single field in block headers is not a huge problem, but | 39332f754f7fSDarrick J. Wong| it's something to be aware of. | 39342f754f7fSDarrick J. Wong| | 39352f754f7fSDarrick J. Wong| - Each block in a directory or extended attributes btree index contains | 39362f754f7fSDarrick J. Wong| sibling and child block pointers. | 39372f754f7fSDarrick J. Wong| Were the atomic commit to use a range collapse operation, each block | 39382f754f7fSDarrick J. Wong| would have to be rewritten very carefully to preserve the graph | 39392f754f7fSDarrick J. Wong| structure. | 39402f754f7fSDarrick J. Wong| Doing this as part of a range collapse means rewriting a large number | 39412f754f7fSDarrick J. Wong| of blocks repeatedly, which is not conducive to quick repairs. | 39422f754f7fSDarrick J. Wong| | 39432f754f7fSDarrick J. Wong| This lead to the introduction of temporary file staging. | 39442f754f7fSDarrick J. Wong+--------------------------------------------------------------------------+ 39452f754f7fSDarrick J. Wong 39462f754f7fSDarrick J. WongUsing a Temporary File 39472f754f7fSDarrick J. Wong`````````````````````` 39482f754f7fSDarrick J. Wong 39492f754f7fSDarrick J. WongOnline repair code should use the ``xrep_tempfile_create`` function to create a 39502f754f7fSDarrick J. Wongtemporary file inside the filesystem. 39512f754f7fSDarrick J. WongThis allocates an inode, marks the in-core inode private, and attaches it to 39522f754f7fSDarrick J. Wongthe scrub context. 39532f754f7fSDarrick J. WongThese files are hidden from userspace, may not be added to the directory tree, 39542f754f7fSDarrick J. Wongand must be kept private. 39552f754f7fSDarrick J. Wong 39562f754f7fSDarrick J. WongTemporary files only use two inode locks: the IOLOCK and the ILOCK. 39572f754f7fSDarrick J. WongThe MMAPLOCK is not needed here, because there must not be page faults from 39582f754f7fSDarrick J. Wonguserspace for data fork blocks. 39592f754f7fSDarrick J. WongThe usage patterns of these two locks are the same as for any other XFS file -- 39602f754f7fSDarrick J. Wongaccess to file data are controlled via the IOLOCK, and access to file metadata 39612f754f7fSDarrick J. Wongare controlled via the ILOCK. 39622f754f7fSDarrick J. WongLocking helpers are provided so that the temporary file and its lock state can 39632f754f7fSDarrick J. Wongbe cleaned up by the scrub context. 39642f754f7fSDarrick J. WongTo comply with the nested locking strategy laid out in the :ref:`inode 39652f754f7fSDarrick J. Wonglocking<ilocking>` section, it is recommended that scrub functions use the 39662f754f7fSDarrick J. Wongxrep_tempfile_ilock*_nowait lock helpers. 39672f754f7fSDarrick J. Wong 39682f754f7fSDarrick J. WongData can be written to a temporary file by two means: 39692f754f7fSDarrick J. Wong 39702f754f7fSDarrick J. Wong1. ``xrep_tempfile_copyin`` can be used to set the contents of a regular 39712f754f7fSDarrick J. Wong temporary file from an xfile. 39722f754f7fSDarrick J. Wong 39732f754f7fSDarrick J. Wong2. The regular directory, symbolic link, and extended attribute functions can 39742f754f7fSDarrick J. Wong be used to write to the temporary file. 39752f754f7fSDarrick J. Wong 39762f754f7fSDarrick J. WongOnce a good copy of a data file has been constructed in a temporary file, it 39772f754f7fSDarrick J. Wongmust be conveyed to the file being repaired, which is the topic of the next 39782f754f7fSDarrick J. Wongsection. 39792f754f7fSDarrick J. Wong 39802f754f7fSDarrick J. WongThe proposed patches are in the 39812f754f7fSDarrick J. Wong`repair temporary files 39822f754f7fSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-tempfiles>`_ 39832f754f7fSDarrick J. Wongseries. 39842f754f7fSDarrick J. Wong 39852f754f7fSDarrick J. WongAtomic Extent Swapping 39862f754f7fSDarrick J. Wong---------------------- 39872f754f7fSDarrick J. Wong 39882f754f7fSDarrick J. WongOnce repair builds a temporary file with a new data structure written into 39892f754f7fSDarrick J. Wongit, it must commit the new changes into the existing file. 39902f754f7fSDarrick J. WongIt is not possible to swap the inumbers of two files, so instead the new 39912f754f7fSDarrick J. Wongmetadata must replace the old. 39922f754f7fSDarrick J. WongThis suggests the need for the ability to swap extents, but the existing extent 39932f754f7fSDarrick J. Wongswapping code used by the file defragmenting tool ``xfs_fsr`` is not sufficient 39942f754f7fSDarrick J. Wongfor online repair because: 39952f754f7fSDarrick J. Wong 39962f754f7fSDarrick J. Wonga. When the reverse-mapping btree is enabled, the swap code must keep the 39972f754f7fSDarrick J. Wong reverse mapping information up to date with every exchange of mappings. 39982f754f7fSDarrick J. Wong Therefore, it can only exchange one mapping per transaction, and each 39992f754f7fSDarrick J. Wong transaction is independent. 40002f754f7fSDarrick J. Wong 40012f754f7fSDarrick J. Wongb. Reverse-mapping is critical for the operation of online fsck, so the old 40022f754f7fSDarrick J. Wong defragmentation code (which swapped entire extent forks in a single 40032f754f7fSDarrick J. Wong operation) is not useful here. 40042f754f7fSDarrick J. Wong 40052f754f7fSDarrick J. Wongc. Defragmentation is assumed to occur between two files with identical 40062f754f7fSDarrick J. Wong contents. 40072f754f7fSDarrick J. Wong For this use case, an incomplete exchange will not result in a user-visible 40082f754f7fSDarrick J. Wong change in file contents, even if the operation is interrupted. 40092f754f7fSDarrick J. Wong 40102f754f7fSDarrick J. Wongd. Online repair needs to swap the contents of two files that are by definition 40112f754f7fSDarrick J. Wong *not* identical. 40122f754f7fSDarrick J. Wong For directory and xattr repairs, the user-visible contents might be the 40132f754f7fSDarrick J. Wong same, but the contents of individual blocks may be very different. 40142f754f7fSDarrick J. Wong 40152f754f7fSDarrick J. Wonge. Old blocks in the file may be cross-linked with another structure and must 40162f754f7fSDarrick J. Wong not reappear if the system goes down mid-repair. 40172f754f7fSDarrick J. Wong 40182f754f7fSDarrick J. WongThese problems are overcome by creating a new deferred operation and a new type 40192f754f7fSDarrick J. Wongof log intent item to track the progress of an operation to exchange two file 40202f754f7fSDarrick J. Wongranges. 40212f754f7fSDarrick J. WongThe new deferred operation type chains together the same transactions used by 40222f754f7fSDarrick J. Wongthe reverse-mapping extent swap code. 40232f754f7fSDarrick J. WongThe new log item records the progress of the exchange to ensure that once an 40242f754f7fSDarrick J. Wongexchange begins, it will always run to completion, even there are 40252f754f7fSDarrick J. Wonginterruptions. 40262f754f7fSDarrick J. WongThe new ``XFS_SB_FEAT_INCOMPAT_LOG_ATOMIC_SWAP`` log-incompatible feature flag 40272f754f7fSDarrick J. Wongin the superblock protects these new log item records from being replayed on 40282f754f7fSDarrick J. Wongold kernels. 40292f754f7fSDarrick J. Wong 40302f754f7fSDarrick J. WongThe proposed patchset is the 40312f754f7fSDarrick J. Wong`atomic extent swap 40322f754f7fSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-file-updates>`_ 40332f754f7fSDarrick J. Wongseries. 40342f754f7fSDarrick J. Wong 40352f754f7fSDarrick J. Wong+--------------------------------------------------------------------------+ 40362f754f7fSDarrick J. Wong| **Sidebar: Using Log-Incompatible Feature Flags** | 40372f754f7fSDarrick J. Wong+--------------------------------------------------------------------------+ 40382f754f7fSDarrick J. Wong| Starting with XFS v5, the superblock contains a | 40392f754f7fSDarrick J. Wong| ``sb_features_log_incompat`` field to indicate that the log contains | 40402f754f7fSDarrick J. Wong| records that might not readable by all kernels that could mount this | 40412f754f7fSDarrick J. Wong| filesystem. | 40422f754f7fSDarrick J. Wong| In short, log incompat features protect the log contents against kernels | 40432f754f7fSDarrick J. Wong| that will not understand the contents. | 40442f754f7fSDarrick J. Wong| Unlike the other superblock feature bits, log incompat bits are | 40452f754f7fSDarrick J. Wong| ephemeral because an empty (clean) log does not need protection. | 40462f754f7fSDarrick J. Wong| The log cleans itself after its contents have been committed into the | 40472f754f7fSDarrick J. Wong| filesystem, either as part of an unmount or because the system is | 40482f754f7fSDarrick J. Wong| otherwise idle. | 40492f754f7fSDarrick J. Wong| Because upper level code can be working on a transaction at the same | 40502f754f7fSDarrick J. Wong| time that the log cleans itself, it is necessary for upper level code to | 40512f754f7fSDarrick J. Wong| communicate to the log when it is going to use a log incompatible | 40522f754f7fSDarrick J. Wong| feature. | 40532f754f7fSDarrick J. Wong| | 40542f754f7fSDarrick J. Wong| The log coordinates access to incompatible features through the use of | 40552f754f7fSDarrick J. Wong| one ``struct rw_semaphore`` for each feature. | 40562f754f7fSDarrick J. Wong| The log cleaning code tries to take this rwsem in exclusive mode to | 40572f754f7fSDarrick J. Wong| clear the bit; if the lock attempt fails, the feature bit remains set. | 40582f754f7fSDarrick J. Wong| Filesystem code signals its intention to use a log incompat feature in a | 40592f754f7fSDarrick J. Wong| transaction by calling ``xlog_use_incompat_feat``, which takes the rwsem | 40602f754f7fSDarrick J. Wong| in shared mode. | 40612f754f7fSDarrick J. Wong| The code supporting a log incompat feature should create wrapper | 40622f754f7fSDarrick J. Wong| functions to obtain the log feature and call | 40632f754f7fSDarrick J. Wong| ``xfs_add_incompat_log_feature`` to set the feature bits in the primary | 40642f754f7fSDarrick J. Wong| superblock. | 40652f754f7fSDarrick J. Wong| The superblock update is performed transactionally, so the wrapper to | 40662f754f7fSDarrick J. Wong| obtain log assistance must be called just prior to the creation of the | 40672f754f7fSDarrick J. Wong| transaction that uses the functionality. | 40682f754f7fSDarrick J. Wong| For a file operation, this step must happen after taking the IOLOCK | 40692f754f7fSDarrick J. Wong| and the MMAPLOCK, but before allocating the transaction. | 40702f754f7fSDarrick J. Wong| When the transaction is complete, the ``xlog_drop_incompat_feat`` | 40712f754f7fSDarrick J. Wong| function is called to release the feature. | 40722f754f7fSDarrick J. Wong| The feature bit will not be cleared from the superblock until the log | 40732f754f7fSDarrick J. Wong| becomes clean. | 40742f754f7fSDarrick J. Wong| | 40752f754f7fSDarrick J. Wong| Log-assisted extended attribute updates and atomic extent swaps both use | 40762f754f7fSDarrick J. Wong| log incompat features and provide convenience wrappers around the | 40772f754f7fSDarrick J. Wong| functionality. | 40782f754f7fSDarrick J. Wong+--------------------------------------------------------------------------+ 40792f754f7fSDarrick J. Wong 40802f754f7fSDarrick J. WongMechanics of an Atomic Extent Swap 40812f754f7fSDarrick J. Wong`````````````````````````````````` 40822f754f7fSDarrick J. Wong 40832f754f7fSDarrick J. WongSwapping entire file forks is a complex task. 40842f754f7fSDarrick J. WongThe goal is to exchange all file fork mappings between two file fork offset 40852f754f7fSDarrick J. Wongranges. 40862f754f7fSDarrick J. WongThere are likely to be many extent mappings in each fork, and the edges of 40872f754f7fSDarrick J. Wongthe mappings aren't necessarily aligned. 40882f754f7fSDarrick J. WongFurthermore, there may be other updates that need to happen after the swap, 40892f754f7fSDarrick J. Wongsuch as exchanging file sizes, inode flags, or conversion of fork data to local 40902f754f7fSDarrick J. Wongformat. 40912f754f7fSDarrick J. WongThis is roughly the format of the new deferred extent swap work item: 40922f754f7fSDarrick J. Wong 40932f754f7fSDarrick J. Wong.. code-block:: c 40942f754f7fSDarrick J. Wong 40952f754f7fSDarrick J. Wong struct xfs_swapext_intent { 40962f754f7fSDarrick J. Wong /* Inodes participating in the operation. */ 40972f754f7fSDarrick J. Wong struct xfs_inode *sxi_ip1; 40982f754f7fSDarrick J. Wong struct xfs_inode *sxi_ip2; 40992f754f7fSDarrick J. Wong 41002f754f7fSDarrick J. Wong /* File offset range information. */ 41012f754f7fSDarrick J. Wong xfs_fileoff_t sxi_startoff1; 41022f754f7fSDarrick J. Wong xfs_fileoff_t sxi_startoff2; 41032f754f7fSDarrick J. Wong xfs_filblks_t sxi_blockcount; 41042f754f7fSDarrick J. Wong 41052f754f7fSDarrick J. Wong /* Set these file sizes after the operation, unless negative. */ 41062f754f7fSDarrick J. Wong xfs_fsize_t sxi_isize1; 41072f754f7fSDarrick J. Wong xfs_fsize_t sxi_isize2; 41082f754f7fSDarrick J. Wong 41092f754f7fSDarrick J. Wong /* XFS_SWAP_EXT_* log operation flags */ 41102f754f7fSDarrick J. Wong uint64_t sxi_flags; 41112f754f7fSDarrick J. Wong }; 41122f754f7fSDarrick J. Wong 41132f754f7fSDarrick J. WongThe new log intent item contains enough information to track two logical fork 41142f754f7fSDarrick J. Wongoffset ranges: ``(inode1, startoff1, blockcount)`` and ``(inode2, startoff2, 41152f754f7fSDarrick J. Wongblockcount)``. 41162f754f7fSDarrick J. WongEach step of a swap operation exchanges the largest file range mapping possible 41172f754f7fSDarrick J. Wongfrom one file to the other. 41182f754f7fSDarrick J. WongAfter each step in the swap operation, the two startoff fields are incremented 41192f754f7fSDarrick J. Wongand the blockcount field is decremented to reflect the progress made. 41202f754f7fSDarrick J. WongThe flags field captures behavioral parameters such as swapping the attr fork 41212f754f7fSDarrick J. Wonginstead of the data fork and other work to be done after the extent swap. 41222f754f7fSDarrick J. WongThe two isize fields are used to swap the file size at the end of the operation 41232f754f7fSDarrick J. Wongif the file data fork is the target of the swap operation. 41242f754f7fSDarrick J. Wong 41252f754f7fSDarrick J. WongWhen the extent swap is initiated, the sequence of operations is as follows: 41262f754f7fSDarrick J. Wong 41272f754f7fSDarrick J. Wong1. Create a deferred work item for the extent swap. 41282f754f7fSDarrick J. Wong At the start, it should contain the entirety of the file ranges to be 41292f754f7fSDarrick J. Wong swapped. 41302f754f7fSDarrick J. Wong 41312f754f7fSDarrick J. Wong2. Call ``xfs_defer_finish`` to process the exchange. 41322f754f7fSDarrick J. Wong This is encapsulated in ``xrep_tempswap_contents`` for scrub operations. 41332f754f7fSDarrick J. Wong This will log an extent swap intent item to the transaction for the deferred 41342f754f7fSDarrick J. Wong extent swap work item. 41352f754f7fSDarrick J. Wong 41362f754f7fSDarrick J. Wong3. Until ``sxi_blockcount`` of the deferred extent swap work item is zero, 41372f754f7fSDarrick J. Wong 41382f754f7fSDarrick J. Wong a. Read the block maps of both file ranges starting at ``sxi_startoff1`` and 41392f754f7fSDarrick J. Wong ``sxi_startoff2``, respectively, and compute the longest extent that can 41402f754f7fSDarrick J. Wong be swapped in a single step. 41412f754f7fSDarrick J. Wong This is the minimum of the two ``br_blockcount`` s in the mappings. 41422f754f7fSDarrick J. Wong Keep advancing through the file forks until at least one of the mappings 41432f754f7fSDarrick J. Wong contains written blocks. 41442f754f7fSDarrick J. Wong Mutual holes, unwritten extents, and extent mappings to the same physical 41452f754f7fSDarrick J. Wong space are not exchanged. 41462f754f7fSDarrick J. Wong 41472f754f7fSDarrick J. Wong For the next few steps, this document will refer to the mapping that came 41482f754f7fSDarrick J. Wong from file 1 as "map1", and the mapping that came from file 2 as "map2". 41492f754f7fSDarrick J. Wong 41502f754f7fSDarrick J. Wong b. Create a deferred block mapping update to unmap map1 from file 1. 41512f754f7fSDarrick J. Wong 41522f754f7fSDarrick J. Wong c. Create a deferred block mapping update to unmap map2 from file 2. 41532f754f7fSDarrick J. Wong 41542f754f7fSDarrick J. Wong d. Create a deferred block mapping update to map map1 into file 2. 41552f754f7fSDarrick J. Wong 41562f754f7fSDarrick J. Wong e. Create a deferred block mapping update to map map2 into file 1. 41572f754f7fSDarrick J. Wong 41582f754f7fSDarrick J. Wong f. Log the block, quota, and extent count updates for both files. 41592f754f7fSDarrick J. Wong 41602f754f7fSDarrick J. Wong g. Extend the ondisk size of either file if necessary. 41612f754f7fSDarrick J. Wong 41622f754f7fSDarrick J. Wong h. Log an extent swap done log item for the extent swap intent log item 41632f754f7fSDarrick J. Wong that was read at the start of step 3. 41642f754f7fSDarrick J. Wong 41652f754f7fSDarrick J. Wong i. Compute the amount of file range that has just been covered. 41662f754f7fSDarrick J. Wong This quantity is ``(map1.br_startoff + map1.br_blockcount - 41672f754f7fSDarrick J. Wong sxi_startoff1)``, because step 3a could have skipped holes. 41682f754f7fSDarrick J. Wong 41692f754f7fSDarrick J. Wong j. Increase the starting offsets of ``sxi_startoff1`` and ``sxi_startoff2`` 41702f754f7fSDarrick J. Wong by the number of blocks computed in the previous step, and decrease 41712f754f7fSDarrick J. Wong ``sxi_blockcount`` by the same quantity. 41722f754f7fSDarrick J. Wong This advances the cursor. 41732f754f7fSDarrick J. Wong 41742f754f7fSDarrick J. Wong k. Log a new extent swap intent log item reflecting the advanced state of 41752f754f7fSDarrick J. Wong the work item. 41762f754f7fSDarrick J. Wong 41772f754f7fSDarrick J. Wong l. Return the proper error code (EAGAIN) to the deferred operation manager 41782f754f7fSDarrick J. Wong to inform it that there is more work to be done. 41792f754f7fSDarrick J. Wong The operation manager completes the deferred work in steps 3b-3e before 41802f754f7fSDarrick J. Wong moving back to the start of step 3. 41812f754f7fSDarrick J. Wong 41822f754f7fSDarrick J. Wong4. Perform any post-processing. 41832f754f7fSDarrick J. Wong This will be discussed in more detail in subsequent sections. 41842f754f7fSDarrick J. Wong 41852f754f7fSDarrick J. WongIf the filesystem goes down in the middle of an operation, log recovery will 41862f754f7fSDarrick J. Wongfind the most recent unfinished extent swap log intent item and restart from 41872f754f7fSDarrick J. Wongthere. 41882f754f7fSDarrick J. WongThis is how extent swapping guarantees that an outside observer will either see 41892f754f7fSDarrick J. Wongthe old broken structure or the new one, and never a mismash of both. 41902f754f7fSDarrick J. Wong 41912f754f7fSDarrick J. WongPreparation for Extent Swapping 41922f754f7fSDarrick J. Wong``````````````````````````````` 41932f754f7fSDarrick J. Wong 41942f754f7fSDarrick J. WongThere are a few things that need to be taken care of before initiating an 41952f754f7fSDarrick J. Wongatomic extent swap operation. 41962f754f7fSDarrick J. WongFirst, regular files require the page cache to be flushed to disk before the 41972f754f7fSDarrick J. Wongoperation begins, and directio writes to be quiesced. 41982f754f7fSDarrick J. WongLike any filesystem operation, extent swapping must determine the maximum 41992f754f7fSDarrick J. Wongamount of disk space and quota that can be consumed on behalf of both files in 42002f754f7fSDarrick J. Wongthe operation, and reserve that quantity of resources to avoid an unrecoverable 42012f754f7fSDarrick J. Wongout of space failure once it starts dirtying metadata. 42022f754f7fSDarrick J. WongThe preparation step scans the ranges of both files to estimate: 42032f754f7fSDarrick J. Wong 42042f754f7fSDarrick J. Wong- Data device blocks needed to handle the repeated updates to the fork 42052f754f7fSDarrick J. Wong mappings. 42062f754f7fSDarrick J. Wong- Change in data and realtime block counts for both files. 42072f754f7fSDarrick J. Wong- Increase in quota usage for both files, if the two files do not share the 42082f754f7fSDarrick J. Wong same set of quota ids. 42092f754f7fSDarrick J. Wong- The number of extent mappings that will be added to each file. 42102f754f7fSDarrick J. Wong- Whether or not there are partially written realtime extents. 42112f754f7fSDarrick J. Wong User programs must never be able to access a realtime file extent that maps 42122f754f7fSDarrick J. Wong to different extents on the realtime volume, which could happen if the 42132f754f7fSDarrick J. Wong operation fails to run to completion. 42142f754f7fSDarrick J. Wong 42152f754f7fSDarrick J. WongThe need for precise estimation increases the run time of the swap operation, 42162f754f7fSDarrick J. Wongbut it is very important to maintain correct accounting. 42172f754f7fSDarrick J. WongThe filesystem must not run completely out of free space, nor can the extent 42182f754f7fSDarrick J. Wongswap ever add more extent mappings to a fork than it can support. 42192f754f7fSDarrick J. WongRegular users are required to abide the quota limits, though metadata repairs 42202f754f7fSDarrick J. Wongmay exceed quota to resolve inconsistent metadata elsewhere. 42212f754f7fSDarrick J. Wong 42222f754f7fSDarrick J. WongSpecial Features for Swapping Metadata File Extents 42232f754f7fSDarrick J. Wong``````````````````````````````````````````````````` 42242f754f7fSDarrick J. Wong 42252f754f7fSDarrick J. WongExtended attributes, symbolic links, and directories can set the fork format to 42262f754f7fSDarrick J. Wong"local" and treat the fork as a literal area for data storage. 42272f754f7fSDarrick J. WongMetadata repairs must take extra steps to support these cases: 42282f754f7fSDarrick J. Wong 42292f754f7fSDarrick J. Wong- If both forks are in local format and the fork areas are large enough, the 42302f754f7fSDarrick J. Wong swap is performed by copying the incore fork contents, logging both forks, 42312f754f7fSDarrick J. Wong and committing. 42322f754f7fSDarrick J. Wong The atomic extent swap mechanism is not necessary, since this can be done 42332f754f7fSDarrick J. Wong with a single transaction. 42342f754f7fSDarrick J. Wong 42352f754f7fSDarrick J. Wong- If both forks map blocks, then the regular atomic extent swap is used. 42362f754f7fSDarrick J. Wong 42372f754f7fSDarrick J. Wong- Otherwise, only one fork is in local format. 42382f754f7fSDarrick J. Wong The contents of the local format fork are converted to a block to perform the 42392f754f7fSDarrick J. Wong swap. 42402f754f7fSDarrick J. Wong The conversion to block format must be done in the same transaction that 42412f754f7fSDarrick J. Wong logs the initial extent swap intent log item. 42422f754f7fSDarrick J. Wong The regular atomic extent swap is used to exchange the mappings. 42432f754f7fSDarrick J. Wong Special flags are set on the swap operation so that the transaction can be 42442f754f7fSDarrick J. Wong rolled one more time to convert the second file's fork back to local format 42452f754f7fSDarrick J. Wong so that the second file will be ready to go as soon as the ILOCK is dropped. 42462f754f7fSDarrick J. Wong 42472f754f7fSDarrick J. WongExtended attributes and directories stamp the owning inode into every block, 42482f754f7fSDarrick J. Wongbut the buffer verifiers do not actually check the inode number! 42492f754f7fSDarrick J. WongAlthough there is no verification, it is still important to maintain 42502f754f7fSDarrick J. Wongreferential integrity, so prior to performing the extent swap, online repair 42512f754f7fSDarrick J. Wongbuilds every block in the new data structure with the owner field of the file 42522f754f7fSDarrick J. Wongbeing repaired. 42532f754f7fSDarrick J. Wong 42542f754f7fSDarrick J. WongAfter a successful swap operation, the repair operation must reap the old fork 42552f754f7fSDarrick J. Wongblocks by processing each fork mapping through the standard :ref:`file extent 42562f754f7fSDarrick J. Wongreaping <reaping>` mechanism that is done post-repair. 42572f754f7fSDarrick J. WongIf the filesystem should go down during the reap part of the repair, the 42582f754f7fSDarrick J. Wongiunlink processing at the end of recovery will free both the temporary file and 42592f754f7fSDarrick J. Wongwhatever blocks were not reaped. 42602f754f7fSDarrick J. WongHowever, this iunlink processing omits the cross-link detection of online 42612f754f7fSDarrick J. Wongrepair, and is not completely foolproof. 42622f754f7fSDarrick J. Wong 42632f754f7fSDarrick J. WongSwapping Temporary File Extents 42642f754f7fSDarrick J. Wong``````````````````````````````` 42652f754f7fSDarrick J. Wong 42662f754f7fSDarrick J. WongTo repair a metadata file, online repair proceeds as follows: 42672f754f7fSDarrick J. Wong 42682f754f7fSDarrick J. Wong1. Create a temporary repair file. 42692f754f7fSDarrick J. Wong 42702f754f7fSDarrick J. Wong2. Use the staging data to write out new contents into the temporary repair 42712f754f7fSDarrick J. Wong file. 42722f754f7fSDarrick J. Wong The same fork must be written to as is being repaired. 42732f754f7fSDarrick J. Wong 42742f754f7fSDarrick J. Wong3. Commit the scrub transaction, since the swap estimation step must be 42752f754f7fSDarrick J. Wong completed before transaction reservations are made. 42762f754f7fSDarrick J. Wong 42772f754f7fSDarrick J. Wong4. Call ``xrep_tempswap_trans_alloc`` to allocate a new scrub transaction with 42782f754f7fSDarrick J. Wong the appropriate resource reservations, locks, and fill out a ``struct 42792f754f7fSDarrick J. Wong xfs_swapext_req`` with the details of the swap operation. 42802f754f7fSDarrick J. Wong 42812f754f7fSDarrick J. Wong5. Call ``xrep_tempswap_contents`` to swap the contents. 42822f754f7fSDarrick J. Wong 42832f754f7fSDarrick J. Wong6. Commit the transaction to complete the repair. 42842f754f7fSDarrick J. Wong 42852f754f7fSDarrick J. Wong.. _rtsummary: 42862f754f7fSDarrick J. Wong 42872f754f7fSDarrick J. WongCase Study: Repairing the Realtime Summary File 42882f754f7fSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 42892f754f7fSDarrick J. Wong 42902f754f7fSDarrick J. WongIn the "realtime" section of an XFS filesystem, free space is tracked via a 42912f754f7fSDarrick J. Wongbitmap, similar to Unix FFS. 42922f754f7fSDarrick J. WongEach bit in the bitmap represents one realtime extent, which is a multiple of 42932f754f7fSDarrick J. Wongthe filesystem block size between 4KiB and 1GiB in size. 42942f754f7fSDarrick J. WongThe realtime summary file indexes the number of free extents of a given size to 42952f754f7fSDarrick J. Wongthe offset of the block within the realtime free space bitmap where those free 42962f754f7fSDarrick J. Wongextents begin. 42972f754f7fSDarrick J. WongIn other words, the summary file helps the allocator find free extents by 42982f754f7fSDarrick J. Wonglength, similar to what the free space by count (cntbt) btree does for the data 42992f754f7fSDarrick J. Wongsection. 43002f754f7fSDarrick J. Wong 43012f754f7fSDarrick J. WongThe summary file itself is a flat file (with no block headers or checksums!) 43022f754f7fSDarrick J. Wongpartitioned into ``log2(total rt extents)`` sections containing enough 32-bit 43032f754f7fSDarrick J. Wongcounters to match the number of blocks in the rt bitmap. 43042f754f7fSDarrick J. WongEach counter records the number of free extents that start in that bitmap block 43052f754f7fSDarrick J. Wongand can satisfy a power-of-two allocation request. 43062f754f7fSDarrick J. Wong 43072f754f7fSDarrick J. WongTo check the summary file against the bitmap: 43082f754f7fSDarrick J. Wong 43092f754f7fSDarrick J. Wong1. Take the ILOCK of both the realtime bitmap and summary files. 43102f754f7fSDarrick J. Wong 43112f754f7fSDarrick J. Wong2. For each free space extent recorded in the bitmap: 43122f754f7fSDarrick J. Wong 43132f754f7fSDarrick J. Wong a. Compute the position in the summary file that contains a counter that 43142f754f7fSDarrick J. Wong represents this free extent. 43152f754f7fSDarrick J. Wong 43162f754f7fSDarrick J. Wong b. Read the counter from the xfile. 43172f754f7fSDarrick J. Wong 43182f754f7fSDarrick J. Wong c. Increment it, and write it back to the xfile. 43192f754f7fSDarrick J. Wong 43202f754f7fSDarrick J. Wong3. Compare the contents of the xfile against the ondisk file. 43212f754f7fSDarrick J. Wong 43222f754f7fSDarrick J. WongTo repair the summary file, write the xfile contents into the temporary file 43232f754f7fSDarrick J. Wongand use atomic extent swap to commit the new contents. 43242f754f7fSDarrick J. WongThe temporary file is then reaped. 43252f754f7fSDarrick J. Wong 43262f754f7fSDarrick J. WongThe proposed patchset is the 43272f754f7fSDarrick J. Wong`realtime summary repair 43282f754f7fSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rtsummary>`_ 43292f754f7fSDarrick J. Wongseries. 43302f754f7fSDarrick J. Wong 43312f754f7fSDarrick J. WongCase Study: Salvaging Extended Attributes 43322f754f7fSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 43332f754f7fSDarrick J. Wong 43342f754f7fSDarrick J. WongIn XFS, extended attributes are implemented as a namespaced name-value store. 43352f754f7fSDarrick J. WongValues are limited in size to 64KiB, but there is no limit in the number of 43362f754f7fSDarrick J. Wongnames. 43372f754f7fSDarrick J. WongThe attribute fork is unpartitioned, which means that the root of the attribute 43382f754f7fSDarrick J. Wongstructure is always in logical block zero, but attribute leaf blocks, dabtree 43392f754f7fSDarrick J. Wongindex blocks, and remote value blocks are intermixed. 43402f754f7fSDarrick J. WongAttribute leaf blocks contain variable-sized records that associate 43412f754f7fSDarrick J. Wonguser-provided names with the user-provided values. 43422f754f7fSDarrick J. WongValues larger than a block are allocated separate extents and written there. 43432f754f7fSDarrick J. WongIf the leaf information expands beyond a single block, a directory/attribute 43442f754f7fSDarrick J. Wongbtree (``dabtree``) is created to map hashes of attribute names to entries 43452f754f7fSDarrick J. Wongfor fast lookup. 43462f754f7fSDarrick J. Wong 43472f754f7fSDarrick J. WongSalvaging extended attributes is done as follows: 43482f754f7fSDarrick J. Wong 43492f754f7fSDarrick J. Wong1. Walk the attr fork mappings of the file being repaired to find the attribute 43502f754f7fSDarrick J. Wong leaf blocks. 43512f754f7fSDarrick J. Wong When one is found, 43522f754f7fSDarrick J. Wong 43532f754f7fSDarrick J. Wong a. Walk the attr leaf block to find candidate keys. 43542f754f7fSDarrick J. Wong When one is found, 43552f754f7fSDarrick J. Wong 43562f754f7fSDarrick J. Wong 1. Check the name for problems, and ignore the name if there are. 43572f754f7fSDarrick J. Wong 43582f754f7fSDarrick J. Wong 2. Retrieve the value. 43592f754f7fSDarrick J. Wong If that succeeds, add the name and value to the staging xfarray and 43602f754f7fSDarrick J. Wong xfblob. 43612f754f7fSDarrick J. Wong 43622f754f7fSDarrick J. Wong2. If the memory usage of the xfarray and xfblob exceed a certain amount of 43632f754f7fSDarrick J. Wong memory or there are no more attr fork blocks to examine, unlock the file and 43642f754f7fSDarrick J. Wong add the staged extended attributes to the temporary file. 43652f754f7fSDarrick J. Wong 43662f754f7fSDarrick J. Wong3. Use atomic extent swapping to exchange the new and old extended attribute 43672f754f7fSDarrick J. Wong structures. 43682f754f7fSDarrick J. Wong The old attribute blocks are now attached to the temporary file. 43692f754f7fSDarrick J. Wong 43702f754f7fSDarrick J. Wong4. Reap the temporary file. 43712f754f7fSDarrick J. Wong 43722f754f7fSDarrick J. WongThe proposed patchset is the 43732f754f7fSDarrick J. Wong`extended attribute repair 43742f754f7fSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_ 43752f754f7fSDarrick J. Wongseries. 4376a26aa252SDarrick J. Wong 4377a26aa252SDarrick J. WongFixing Directories 4378a26aa252SDarrick J. Wong------------------ 4379a26aa252SDarrick J. Wong 4380a26aa252SDarrick J. WongFixing directories is difficult with currently available filesystem features, 4381a26aa252SDarrick J. Wongsince directory entries are not redundant. 4382a26aa252SDarrick J. WongThe offline repair tool scans all inodes to find files with nonzero link count, 4383a26aa252SDarrick J. Wongand then it scans all directories to establish parentage of those linked files. 4384a26aa252SDarrick J. WongDamaged files and directories are zapped, and files with no parent are 4385a26aa252SDarrick J. Wongmoved to the ``/lost+found`` directory. 4386a26aa252SDarrick J. WongIt does not try to salvage anything. 4387a26aa252SDarrick J. Wong 4388a26aa252SDarrick J. WongThe best that online repair can do at this time is to read directory data 4389a26aa252SDarrick J. Wongblocks and salvage any dirents that look plausible, correct link counts, and 4390a26aa252SDarrick J. Wongmove orphans back into the directory tree. 4391a26aa252SDarrick J. WongThe salvage process is discussed in the case study at the end of this section. 4392a26aa252SDarrick J. WongThe :ref:`file link count fsck <nlinks>` code takes care of fixing link counts 4393a26aa252SDarrick J. Wongand moving orphans to the ``/lost+found`` directory. 4394a26aa252SDarrick J. Wong 4395a26aa252SDarrick J. WongCase Study: Salvaging Directories 4396a26aa252SDarrick J. Wong````````````````````````````````` 4397a26aa252SDarrick J. Wong 4398a26aa252SDarrick J. WongUnlike extended attributes, directory blocks are all the same size, so 4399a26aa252SDarrick J. Wongsalvaging directories is straightforward: 4400a26aa252SDarrick J. Wong 4401a26aa252SDarrick J. Wong1. Find the parent of the directory. 4402a26aa252SDarrick J. Wong If the dotdot entry is not unreadable, try to confirm that the alleged 4403a26aa252SDarrick J. Wong parent has a child entry pointing back to the directory being repaired. 4404a26aa252SDarrick J. Wong Otherwise, walk the filesystem to find it. 4405a26aa252SDarrick J. Wong 4406a26aa252SDarrick J. Wong2. Walk the first partition of data fork of the directory to find the directory 4407a26aa252SDarrick J. Wong entry data blocks. 4408a26aa252SDarrick J. Wong When one is found, 4409a26aa252SDarrick J. Wong 4410a26aa252SDarrick J. Wong a. Walk the directory data block to find candidate entries. 4411a26aa252SDarrick J. Wong When an entry is found: 4412a26aa252SDarrick J. Wong 4413a26aa252SDarrick J. Wong i. Check the name for problems, and ignore the name if there are. 4414a26aa252SDarrick J. Wong 4415a26aa252SDarrick J. Wong ii. Retrieve the inumber and grab the inode. 4416a26aa252SDarrick J. Wong If that succeeds, add the name, inode number, and file type to the 4417a26aa252SDarrick J. Wong staging xfarray and xblob. 4418a26aa252SDarrick J. Wong 4419a26aa252SDarrick J. Wong3. If the memory usage of the xfarray and xfblob exceed a certain amount of 4420a26aa252SDarrick J. Wong memory or there are no more directory data blocks to examine, unlock the 4421a26aa252SDarrick J. Wong directory and add the staged dirents into the temporary directory. 4422a26aa252SDarrick J. Wong Truncate the staging files. 4423a26aa252SDarrick J. Wong 4424a26aa252SDarrick J. Wong4. Use atomic extent swapping to exchange the new and old directory structures. 4425a26aa252SDarrick J. Wong The old directory blocks are now attached to the temporary file. 4426a26aa252SDarrick J. Wong 4427a26aa252SDarrick J. Wong5. Reap the temporary file. 4428a26aa252SDarrick J. Wong 4429a26aa252SDarrick J. Wong**Future Work Question**: Should repair revalidate the dentry cache when 4430a26aa252SDarrick J. Wongrebuilding a directory? 4431a26aa252SDarrick J. Wong 4432a26aa252SDarrick J. Wong*Answer*: Yes, it should. 4433a26aa252SDarrick J. Wong 4434a26aa252SDarrick J. WongIn theory it is necessary to scan all dentry cache entries for a directory to 4435a26aa252SDarrick J. Wongensure that one of the following apply: 4436a26aa252SDarrick J. Wong 4437a26aa252SDarrick J. Wong1. The cached dentry reflects an ondisk dirent in the new directory. 4438a26aa252SDarrick J. Wong 4439a26aa252SDarrick J. Wong2. The cached dentry no longer has a corresponding ondisk dirent in the new 4440a26aa252SDarrick J. Wong directory and the dentry can be purged from the cache. 4441a26aa252SDarrick J. Wong 4442a26aa252SDarrick J. Wong3. The cached dentry no longer has an ondisk dirent but the dentry cannot be 4443a26aa252SDarrick J. Wong purged. 4444a26aa252SDarrick J. Wong This is the problem case. 4445a26aa252SDarrick J. Wong 4446a26aa252SDarrick J. WongUnfortunately, the current dentry cache design doesn't provide a means to walk 4447a26aa252SDarrick J. Wongevery child dentry of a specific directory, which makes this a hard problem. 4448a26aa252SDarrick J. WongThere is no known solution. 4449a26aa252SDarrick J. Wong 4450a26aa252SDarrick J. WongThe proposed patchset is the 4451a26aa252SDarrick J. Wong`directory repair 4452a26aa252SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_ 4453a26aa252SDarrick J. Wongseries. 4454a26aa252SDarrick J. Wong 4455a26aa252SDarrick J. WongParent Pointers 4456a26aa252SDarrick J. Wong``````````````` 4457a26aa252SDarrick J. Wong 4458a26aa252SDarrick J. WongA parent pointer is a piece of file metadata that enables a user to locate the 4459a26aa252SDarrick J. Wongfile's parent directory without having to traverse the directory tree from the 4460a26aa252SDarrick J. Wongroot. 4461a26aa252SDarrick J. WongWithout them, reconstruction of directory trees is hindered in much the same 4462a26aa252SDarrick J. Wongway that the historic lack of reverse space mapping information once hindered 4463a26aa252SDarrick J. Wongreconstruction of filesystem space metadata. 4464a26aa252SDarrick J. WongThe parent pointer feature, however, makes total directory reconstruction 4465a26aa252SDarrick J. Wongpossible. 4466a26aa252SDarrick J. Wong 4467a26aa252SDarrick J. WongXFS parent pointers include the dirent name and location of the entry within 4468a26aa252SDarrick J. Wongthe parent directory. 4469a26aa252SDarrick J. WongIn other words, child files use extended attributes to store pointers to 4470a26aa252SDarrick J. Wongparents in the form ``(parent_inum, parent_gen, dirent_pos) → (dirent_name)``. 4471a26aa252SDarrick J. WongThe directory checking process can be strengthened to ensure that the target of 4472a26aa252SDarrick J. Wongeach dirent also contains a parent pointer pointing back to the dirent. 4473a26aa252SDarrick J. WongLikewise, each parent pointer can be checked by ensuring that the target of 4474a26aa252SDarrick J. Wongeach parent pointer is a directory and that it contains a dirent matching 4475a26aa252SDarrick J. Wongthe parent pointer. 4476a26aa252SDarrick J. WongBoth online and offline repair can use this strategy. 4477a26aa252SDarrick J. Wong 4478a26aa252SDarrick J. Wong**Note**: The ondisk format of parent pointers is not yet finalized. 4479a26aa252SDarrick J. Wong 4480a26aa252SDarrick J. Wong+--------------------------------------------------------------------------+ 4481a26aa252SDarrick J. Wong| **Historical Sidebar**: | 4482a26aa252SDarrick J. Wong+--------------------------------------------------------------------------+ 4483a26aa252SDarrick J. Wong| Directory parent pointers were first proposed as an XFS feature more | 4484a26aa252SDarrick J. Wong| than a decade ago by SGI. | 4485a26aa252SDarrick J. Wong| Each link from a parent directory to a child file is mirrored with an | 4486a26aa252SDarrick J. Wong| extended attribute in the child that could be used to identify the | 4487a26aa252SDarrick J. Wong| parent directory. | 4488a26aa252SDarrick J. Wong| Unfortunately, this early implementation had major shortcomings and was | 4489a26aa252SDarrick J. Wong| never merged into Linux XFS: | 4490a26aa252SDarrick J. Wong| | 4491a26aa252SDarrick J. Wong| 1. The XFS codebase of the late 2000s did not have the infrastructure to | 4492a26aa252SDarrick J. Wong| enforce strong referential integrity in the directory tree. | 4493a26aa252SDarrick J. Wong| It did not guarantee that a change in a forward link would always be | 4494a26aa252SDarrick J. Wong| followed up with the corresponding change to the reverse links. | 4495a26aa252SDarrick J. Wong| | 4496a26aa252SDarrick J. Wong| 2. Referential integrity was not integrated into offline repair. | 4497a26aa252SDarrick J. Wong| Checking and repairs were performed on mounted filesystems without | 4498a26aa252SDarrick J. Wong| taking any kernel or inode locks to coordinate access. | 4499a26aa252SDarrick J. Wong| It is not clear how this actually worked properly. | 4500a26aa252SDarrick J. Wong| | 4501a26aa252SDarrick J. Wong| 3. The extended attribute did not record the name of the directory entry | 4502a26aa252SDarrick J. Wong| in the parent, so the SGI parent pointer implementation cannot be | 4503a26aa252SDarrick J. Wong| used to reconnect the directory tree. | 4504a26aa252SDarrick J. Wong| | 4505a26aa252SDarrick J. Wong| 4. Extended attribute forks only support 65,536 extents, which means | 4506a26aa252SDarrick J. Wong| that parent pointer attribute creation is likely to fail at some | 4507a26aa252SDarrick J. Wong| point before the maximum file link count is achieved. | 4508a26aa252SDarrick J. Wong| | 4509a26aa252SDarrick J. Wong| The original parent pointer design was too unstable for something like | 4510a26aa252SDarrick J. Wong| a file system repair to depend on. | 4511a26aa252SDarrick J. Wong| Allison Henderson, Chandan Babu, and Catherine Hoang are working on a | 4512a26aa252SDarrick J. Wong| second implementation that solves all shortcomings of the first. | 4513a26aa252SDarrick J. Wong| During 2022, Allison introduced log intent items to track physical | 4514a26aa252SDarrick J. Wong| manipulations of the extended attribute structures. | 4515a26aa252SDarrick J. Wong| This solves the referential integrity problem by making it possible to | 4516a26aa252SDarrick J. Wong| commit a dirent update and a parent pointer update in the same | 4517a26aa252SDarrick J. Wong| transaction. | 4518a26aa252SDarrick J. Wong| Chandan increased the maximum extent counts of both data and attribute | 4519a26aa252SDarrick J. Wong| forks, thereby ensuring that the extended attribute structure can grow | 4520a26aa252SDarrick J. Wong| to handle the maximum hardlink count of any file. | 4521a26aa252SDarrick J. Wong+--------------------------------------------------------------------------+ 4522a26aa252SDarrick J. Wong 4523a26aa252SDarrick J. WongCase Study: Repairing Directories with Parent Pointers 4524a26aa252SDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 4525a26aa252SDarrick J. Wong 4526a26aa252SDarrick J. WongDirectory rebuilding uses a :ref:`coordinated inode scan <iscan>` and 4527a26aa252SDarrick J. Wonga :ref:`directory entry live update hook <liveupdate>` as follows: 4528a26aa252SDarrick J. Wong 4529a26aa252SDarrick J. Wong1. Set up a temporary directory for generating the new directory structure, 4530a26aa252SDarrick J. Wong an xfblob for storing entry names, and an xfarray for stashing directory 4531a26aa252SDarrick J. Wong updates. 4532a26aa252SDarrick J. Wong 4533a26aa252SDarrick J. Wong2. Set up an inode scanner and hook into the directory entry code to receive 4534a26aa252SDarrick J. Wong updates on directory operations. 4535a26aa252SDarrick J. Wong 4536a26aa252SDarrick J. Wong3. For each parent pointer found in each file scanned, decide if the parent 4537a26aa252SDarrick J. Wong pointer references the directory of interest. 4538a26aa252SDarrick J. Wong If so: 4539a26aa252SDarrick J. Wong 4540a26aa252SDarrick J. Wong a. Stash an addname entry for this dirent in the xfarray for later. 4541a26aa252SDarrick J. Wong 4542a26aa252SDarrick J. Wong b. When finished scanning that file, flush the stashed updates to the 4543a26aa252SDarrick J. Wong temporary directory. 4544a26aa252SDarrick J. Wong 4545a26aa252SDarrick J. Wong4. For each live directory update received via the hook, decide if the child 4546a26aa252SDarrick J. Wong has already been scanned. 4547a26aa252SDarrick J. Wong If so: 4548a26aa252SDarrick J. Wong 4549a26aa252SDarrick J. Wong a. Stash an addname or removename entry for this dirent update in the 4550a26aa252SDarrick J. Wong xfarray for later. 4551a26aa252SDarrick J. Wong We cannot write directly to the temporary directory because hook 4552a26aa252SDarrick J. Wong functions are not allowed to modify filesystem metadata. 4553a26aa252SDarrick J. Wong Instead, we stash updates in the xfarray and rely on the scanner thread 4554a26aa252SDarrick J. Wong to apply the stashed updates to the temporary directory. 4555a26aa252SDarrick J. Wong 4556a26aa252SDarrick J. Wong5. When the scan is complete, atomically swap the contents of the temporary 4557a26aa252SDarrick J. Wong directory and the directory being repaired. 4558a26aa252SDarrick J. Wong The temporary directory now contains the damaged directory structure. 4559a26aa252SDarrick J. Wong 4560a26aa252SDarrick J. Wong6. Reap the temporary directory. 4561a26aa252SDarrick J. Wong 4562a26aa252SDarrick J. Wong7. Update the dirent position field of parent pointers as necessary. 4563a26aa252SDarrick J. Wong This may require the queuing of a substantial number of xattr log intent 4564a26aa252SDarrick J. Wong items. 4565a26aa252SDarrick J. Wong 4566a26aa252SDarrick J. WongThe proposed patchset is the 4567a26aa252SDarrick J. Wong`parent pointers directory repair 4568a26aa252SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-online-dir-repair>`_ 4569a26aa252SDarrick J. Wongseries. 4570a26aa252SDarrick J. Wong 4571a26aa252SDarrick J. Wong**Unresolved Question**: How will repair ensure that the ``dirent_pos`` fields 4572a26aa252SDarrick J. Wongmatch in the reconstructed directory? 4573a26aa252SDarrick J. Wong 4574a26aa252SDarrick J. Wong*Answer*: There are a few ways to solve this problem: 4575a26aa252SDarrick J. Wong 4576a26aa252SDarrick J. Wong1. The field could be designated advisory, since the other three values are 4577a26aa252SDarrick J. Wong sufficient to find the entry in the parent. 4578a26aa252SDarrick J. Wong However, this makes indexed key lookup impossible while repairs are ongoing. 4579a26aa252SDarrick J. Wong 4580a26aa252SDarrick J. Wong2. We could allow creating directory entries at specified offsets, which solves 4581a26aa252SDarrick J. Wong the referential integrity problem but runs the risk that dirent creation 4582a26aa252SDarrick J. Wong will fail due to conflicts with the free space in the directory. 4583a26aa252SDarrick J. Wong 4584a26aa252SDarrick J. Wong These conflicts could be resolved by appending the directory entry and 4585a26aa252SDarrick J. Wong amending the xattr code to support updating an xattr key and reindexing the 4586a26aa252SDarrick J. Wong dabtree, though this would have to be performed with the parent directory 4587a26aa252SDarrick J. Wong still locked. 4588a26aa252SDarrick J. Wong 4589a26aa252SDarrick J. Wong3. Same as above, but remove the old parent pointer entry and add a new one 4590a26aa252SDarrick J. Wong atomically. 4591a26aa252SDarrick J. Wong 4592a26aa252SDarrick J. Wong4. Change the ondisk xattr format to ``(parent_inum, name) → (parent_gen)``, 4593a26aa252SDarrick J. Wong which would provide the attr name uniqueness that we require, without 4594a26aa252SDarrick J. Wong forcing repair code to update the dirent position. 4595a26aa252SDarrick J. Wong Unfortunately, this requires changes to the xattr code to support attr 4596a26aa252SDarrick J. Wong names as long as 263 bytes. 4597a26aa252SDarrick J. Wong 4598a26aa252SDarrick J. Wong5. Change the ondisk xattr format to ``(parent_inum, hash(name)) → 4599a26aa252SDarrick J. Wong (name, parent_gen)``. 4600a26aa252SDarrick J. Wong If the hash is sufficiently resistant to collisions (e.g. sha256) then 4601a26aa252SDarrick J. Wong this should provide the attr name uniqueness that we require. 4602a26aa252SDarrick J. Wong Names shorter than 247 bytes could be stored directly. 4603a26aa252SDarrick J. Wong 4604a26aa252SDarrick J. WongDiscussion is ongoing under the `parent pointers patch deluge 4605a26aa252SDarrick J. Wong<https://www.spinics.net/lists/linux-xfs/msg69397.html>`_. 4606a26aa252SDarrick J. Wong 4607a26aa252SDarrick J. WongCase Study: Repairing Parent Pointers 4608a26aa252SDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 4609a26aa252SDarrick J. Wong 4610a26aa252SDarrick J. WongOnline reconstruction of a file's parent pointer information works similarly to 4611a26aa252SDarrick J. Wongdirectory reconstruction: 4612a26aa252SDarrick J. Wong 4613a26aa252SDarrick J. Wong1. Set up a temporary file for generating a new extended attribute structure, 4614a26aa252SDarrick J. Wong an `xfblob<xfblob>` for storing parent pointer names, and an xfarray for 4615a26aa252SDarrick J. Wong stashing parent pointer updates. 4616a26aa252SDarrick J. Wong 4617a26aa252SDarrick J. Wong2. Set up an inode scanner and hook into the directory entry code to receive 4618a26aa252SDarrick J. Wong updates on directory operations. 4619a26aa252SDarrick J. Wong 4620a26aa252SDarrick J. Wong3. For each directory entry found in each directory scanned, decide if the 4621a26aa252SDarrick J. Wong dirent references the file of interest. 4622a26aa252SDarrick J. Wong If so: 4623a26aa252SDarrick J. Wong 4624a26aa252SDarrick J. Wong a. Stash an addpptr entry for this parent pointer in the xfblob and xfarray 4625a26aa252SDarrick J. Wong for later. 4626a26aa252SDarrick J. Wong 4627a26aa252SDarrick J. Wong b. When finished scanning the directory, flush the stashed updates to the 4628a26aa252SDarrick J. Wong temporary directory. 4629a26aa252SDarrick J. Wong 4630a26aa252SDarrick J. Wong4. For each live directory update received via the hook, decide if the parent 4631a26aa252SDarrick J. Wong has already been scanned. 4632a26aa252SDarrick J. Wong If so: 4633a26aa252SDarrick J. Wong 4634a26aa252SDarrick J. Wong a. Stash an addpptr or removepptr entry for this dirent update in the 4635a26aa252SDarrick J. Wong xfarray for later. 4636a26aa252SDarrick J. Wong We cannot write parent pointers directly to the temporary file because 4637a26aa252SDarrick J. Wong hook functions are not allowed to modify filesystem metadata. 4638a26aa252SDarrick J. Wong Instead, we stash updates in the xfarray and rely on the scanner thread 4639a26aa252SDarrick J. Wong to apply the stashed parent pointer updates to the temporary file. 4640a26aa252SDarrick J. Wong 4641a26aa252SDarrick J. Wong5. Copy all non-parent pointer extended attributes to the temporary file. 4642a26aa252SDarrick J. Wong 4643a26aa252SDarrick J. Wong6. When the scan is complete, atomically swap the attribute fork of the 4644a26aa252SDarrick J. Wong temporary file and the file being repaired. 4645a26aa252SDarrick J. Wong The temporary file now contains the damaged extended attribute structure. 4646a26aa252SDarrick J. Wong 4647a26aa252SDarrick J. Wong7. Reap the temporary file. 4648a26aa252SDarrick J. Wong 4649a26aa252SDarrick J. WongThe proposed patchset is the 4650a26aa252SDarrick J. Wong`parent pointers repair 4651a26aa252SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-online-parent-repair>`_ 4652a26aa252SDarrick J. Wongseries. 4653a26aa252SDarrick J. Wong 4654a26aa252SDarrick J. WongDigression: Offline Checking of Parent Pointers 4655a26aa252SDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 4656a26aa252SDarrick J. Wong 4657a26aa252SDarrick J. WongExamining parent pointers in offline repair works differently because corrupt 4658a26aa252SDarrick J. Wongfiles are erased long before directory tree connectivity checks are performed. 4659a26aa252SDarrick J. WongParent pointer checks are therefore a second pass to be added to the existing 4660a26aa252SDarrick J. Wongconnectivity checks: 4661a26aa252SDarrick J. Wong 4662a26aa252SDarrick J. Wong1. After the set of surviving files has been established (i.e. phase 6), 4663a26aa252SDarrick J. Wong walk the surviving directories of each AG in the filesystem. 4664a26aa252SDarrick J. Wong This is already performed as part of the connectivity checks. 4665a26aa252SDarrick J. Wong 4666a26aa252SDarrick J. Wong2. For each directory entry found, record the name in an xfblob, and store 4667a26aa252SDarrick J. Wong ``(child_ag_inum, parent_inum, parent_gen, dirent_pos)`` tuples in a 4668a26aa252SDarrick J. Wong per-AG in-memory slab. 4669a26aa252SDarrick J. Wong 4670a26aa252SDarrick J. Wong3. For each AG in the filesystem, 4671a26aa252SDarrick J. Wong 4672a26aa252SDarrick J. Wong a. Sort the per-AG tuples in order of child_ag_inum, parent_inum, and 4673a26aa252SDarrick J. Wong dirent_pos. 4674a26aa252SDarrick J. Wong 4675a26aa252SDarrick J. Wong b. For each inode in the AG, 4676a26aa252SDarrick J. Wong 4677a26aa252SDarrick J. Wong 1. Scan the inode for parent pointers. 4678a26aa252SDarrick J. Wong Record the names in a per-file xfblob, and store ``(parent_inum, 4679a26aa252SDarrick J. Wong parent_gen, dirent_pos)`` tuples in a per-file slab. 4680a26aa252SDarrick J. Wong 4681a26aa252SDarrick J. Wong 2. Sort the per-file tuples in order of parent_inum, and dirent_pos. 4682a26aa252SDarrick J. Wong 4683a26aa252SDarrick J. Wong 3. Position one slab cursor at the start of the inode's records in the 4684a26aa252SDarrick J. Wong per-AG tuple slab. 4685a26aa252SDarrick J. Wong This should be trivial since the per-AG tuples are in child inumber 4686a26aa252SDarrick J. Wong order. 4687a26aa252SDarrick J. Wong 4688a26aa252SDarrick J. Wong 4. Position a second slab cursor at the start of the per-file tuple slab. 4689a26aa252SDarrick J. Wong 4690a26aa252SDarrick J. Wong 5. Iterate the two cursors in lockstep, comparing the parent_ino and 4691a26aa252SDarrick J. Wong dirent_pos fields of the records under each cursor. 4692a26aa252SDarrick J. Wong 4693a26aa252SDarrick J. Wong a. Tuples in the per-AG list but not the per-file list are missing and 4694a26aa252SDarrick J. Wong need to be written to the inode. 4695a26aa252SDarrick J. Wong 4696a26aa252SDarrick J. Wong b. Tuples in the per-file list but not the per-AG list are dangling 4697a26aa252SDarrick J. Wong and need to be removed from the inode. 4698a26aa252SDarrick J. Wong 4699a26aa252SDarrick J. Wong c. For tuples in both lists, update the parent_gen and name components 4700a26aa252SDarrick J. Wong of the parent pointer if necessary. 4701a26aa252SDarrick J. Wong 4702a26aa252SDarrick J. Wong4. Move on to examining link counts, as we do today. 4703a26aa252SDarrick J. Wong 4704a26aa252SDarrick J. WongThe proposed patchset is the 4705a26aa252SDarrick J. Wong`offline parent pointers repair 4706a26aa252SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=pptrs-repair>`_ 4707a26aa252SDarrick J. Wongseries. 4708a26aa252SDarrick J. Wong 4709a26aa252SDarrick J. WongRebuilding directories from parent pointers in offline repair is very 4710a26aa252SDarrick J. Wongchallenging because it currently uses a single-pass scan of the filesystem 4711a26aa252SDarrick J. Wongduring phase 3 to decide which files are corrupt enough to be zapped. 4712a26aa252SDarrick J. WongThis scan would have to be converted into a multi-pass scan: 4713a26aa252SDarrick J. Wong 4714a26aa252SDarrick J. Wong1. The first pass of the scan zaps corrupt inodes, forks, and attributes 4715a26aa252SDarrick J. Wong much as it does now. 4716a26aa252SDarrick J. Wong Corrupt directories are noted but not zapped. 4717a26aa252SDarrick J. Wong 4718a26aa252SDarrick J. Wong2. The next pass records parent pointers pointing to the directories noted 4719a26aa252SDarrick J. Wong as being corrupt in the first pass. 4720a26aa252SDarrick J. Wong This second pass may have to happen after the phase 4 scan for duplicate 4721a26aa252SDarrick J. Wong blocks, if phase 4 is also capable of zapping directories. 4722a26aa252SDarrick J. Wong 4723a26aa252SDarrick J. Wong3. The third pass resets corrupt directories to an empty shortform directory. 4724a26aa252SDarrick J. Wong Free space metadata has not been ensured yet, so repair cannot yet use the 4725a26aa252SDarrick J. Wong directory building code in libxfs. 4726a26aa252SDarrick J. Wong 4727a26aa252SDarrick J. Wong4. At the start of phase 6, space metadata have been rebuilt. 4728a26aa252SDarrick J. Wong Use the parent pointer information recorded during step 2 to reconstruct 4729a26aa252SDarrick J. Wong the dirents and add them to the now-empty directories. 4730a26aa252SDarrick J. Wong 4731a26aa252SDarrick J. WongThis code has not yet been constructed. 4732a26aa252SDarrick J. Wong 4733a26aa252SDarrick J. Wong.. _orphanage: 4734a26aa252SDarrick J. Wong 4735a26aa252SDarrick J. WongThe Orphanage 4736a26aa252SDarrick J. Wong------------- 4737a26aa252SDarrick J. Wong 4738a26aa252SDarrick J. WongFilesystems present files as a directed, and hopefully acyclic, graph. 4739a26aa252SDarrick J. WongIn other words, a tree. 4740a26aa252SDarrick J. WongThe root of the filesystem is a directory, and each entry in a directory points 4741a26aa252SDarrick J. Wongdownwards either to more subdirectories or to non-directory files. 4742a26aa252SDarrick J. WongUnfortunately, a disruption in the directory graph pointers result in a 4743a26aa252SDarrick J. Wongdisconnected graph, which makes files impossible to access via regular path 4744a26aa252SDarrick J. Wongresolution. 4745a26aa252SDarrick J. Wong 4746a26aa252SDarrick J. WongWithout parent pointers, the directory parent pointer online scrub code can 4747a26aa252SDarrick J. Wongdetect a dotdot entry pointing to a parent directory that doesn't have a link 4748a26aa252SDarrick J. Wongback to the child directory and the file link count checker can detect a file 4749a26aa252SDarrick J. Wongthat isn't pointed to by any directory in the filesystem. 4750a26aa252SDarrick J. WongIf such a file has a positive link count, the file is an orphan. 4751a26aa252SDarrick J. Wong 4752a26aa252SDarrick J. WongWith parent pointers, directories can be rebuilt by scanning parent pointers 4753a26aa252SDarrick J. Wongand parent pointers can be rebuilt by scanning directories. 4754a26aa252SDarrick J. WongThis should reduce the incidence of files ending up in ``/lost+found``. 4755a26aa252SDarrick J. Wong 4756a26aa252SDarrick J. WongWhen orphans are found, they should be reconnected to the directory tree. 4757a26aa252SDarrick J. WongOffline fsck solves the problem by creating a directory ``/lost+found`` to 4758a26aa252SDarrick J. Wongserve as an orphanage, and linking orphan files into the orphanage by using the 4759a26aa252SDarrick J. Wonginumber as the name. 4760a26aa252SDarrick J. WongReparenting a file to the orphanage does not reset any of its permissions or 4761a26aa252SDarrick J. WongACLs. 4762a26aa252SDarrick J. Wong 4763a26aa252SDarrick J. WongThis process is more involved in the kernel than it is in userspace. 4764a26aa252SDarrick J. WongThe directory and file link count repair setup functions must use the regular 4765a26aa252SDarrick J. WongVFS mechanisms to create the orphanage directory with all the necessary 4766a26aa252SDarrick J. Wongsecurity attributes and dentry cache entries, just like a regular directory 4767a26aa252SDarrick J. Wongtree modification. 4768a26aa252SDarrick J. Wong 4769a26aa252SDarrick J. WongOrphaned files are adopted by the orphanage as follows: 4770a26aa252SDarrick J. Wong 4771a26aa252SDarrick J. Wong1. Call ``xrep_orphanage_try_create`` at the start of the scrub setup function 4772a26aa252SDarrick J. Wong to try to ensure that the lost and found directory actually exists. 4773a26aa252SDarrick J. Wong This also attaches the orphanage directory to the scrub context. 4774a26aa252SDarrick J. Wong 4775a26aa252SDarrick J. Wong2. If the decision is made to reconnect a file, take the IOLOCK of both the 4776a26aa252SDarrick J. Wong orphanage and the file being reattached. 4777a26aa252SDarrick J. Wong The ``xrep_orphanage_iolock_two`` function follows the inode locking 4778a26aa252SDarrick J. Wong strategy discussed earlier. 4779a26aa252SDarrick J. Wong 4780a26aa252SDarrick J. Wong3. Call ``xrep_orphanage_compute_blkres`` and ``xrep_orphanage_compute_name`` 4781a26aa252SDarrick J. Wong to compute the new name in the orphanage and the block reservation required. 4782a26aa252SDarrick J. Wong 4783a26aa252SDarrick J. Wong4. Use ``xrep_orphanage_adoption_prep`` to reserve resources to the repair 4784a26aa252SDarrick J. Wong transaction. 4785a26aa252SDarrick J. Wong 4786a26aa252SDarrick J. Wong5. Call ``xrep_orphanage_adopt`` to reparent the orphaned file into the lost 4787a26aa252SDarrick J. Wong and found, and update the kernel dentry cache. 4788a26aa252SDarrick J. Wong 4789a26aa252SDarrick J. WongThe proposed patches are in the 4790a26aa252SDarrick J. Wong`orphanage adoption 4791a26aa252SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-orphanage>`_ 4792a26aa252SDarrick J. Wongseries. 4793af051dfbSDarrick J. Wong 4794af051dfbSDarrick J. Wong6. Userspace Algorithms and Data Structures 4795af051dfbSDarrick J. Wong=========================================== 4796af051dfbSDarrick J. Wong 4797af051dfbSDarrick J. WongThis section discusses the key algorithms and data structures of the userspace 4798af051dfbSDarrick J. Wongprogram, ``xfs_scrub``, that provide the ability to drive metadata checks and 4799af051dfbSDarrick J. Wongrepairs in the kernel, verify file data, and look for other potential problems. 4800af051dfbSDarrick J. Wong 4801af051dfbSDarrick J. Wong.. _scrubcheck: 4802af051dfbSDarrick J. Wong 4803af051dfbSDarrick J. WongChecking Metadata 4804af051dfbSDarrick J. Wong----------------- 4805af051dfbSDarrick J. Wong 4806af051dfbSDarrick J. WongRecall the :ref:`phases of fsck work<scrubphases>` outlined earlier. 4807af051dfbSDarrick J. WongThat structure follows naturally from the data dependencies designed into the 4808af051dfbSDarrick J. Wongfilesystem from its beginnings in 1993. 4809af051dfbSDarrick J. WongIn XFS, there are several groups of metadata dependencies: 4810af051dfbSDarrick J. Wong 4811af051dfbSDarrick J. Wonga. Filesystem summary counts depend on consistency within the inode indices, 4812af051dfbSDarrick J. Wong the allocation group space btrees, and the realtime volume space 4813af051dfbSDarrick J. Wong information. 4814af051dfbSDarrick J. Wong 4815af051dfbSDarrick J. Wongb. Quota resource counts depend on consistency within the quota file data 4816af051dfbSDarrick J. Wong forks, inode indices, inode records, and the forks of every file on the 4817af051dfbSDarrick J. Wong system. 4818af051dfbSDarrick J. Wong 4819af051dfbSDarrick J. Wongc. The naming hierarchy depends on consistency within the directory and 4820af051dfbSDarrick J. Wong extended attribute structures. 4821af051dfbSDarrick J. Wong This includes file link counts. 4822af051dfbSDarrick J. Wong 4823af051dfbSDarrick J. Wongd. Directories, extended attributes, and file data depend on consistency within 4824af051dfbSDarrick J. Wong the file forks that map directory and extended attribute data to physical 4825af051dfbSDarrick J. Wong storage media. 4826af051dfbSDarrick J. Wong 4827af051dfbSDarrick J. Wonge. The file forks depends on consistency within inode records and the space 4828af051dfbSDarrick J. Wong metadata indices of the allocation groups and the realtime volume. 4829af051dfbSDarrick J. Wong This includes quota and realtime metadata files. 4830af051dfbSDarrick J. Wong 4831af051dfbSDarrick J. Wongf. Inode records depends on consistency within the inode metadata indices. 4832af051dfbSDarrick J. Wong 4833af051dfbSDarrick J. Wongg. Realtime space metadata depend on the inode records and data forks of the 4834af051dfbSDarrick J. Wong realtime metadata inodes. 4835af051dfbSDarrick J. Wong 4836af051dfbSDarrick J. Wongh. The allocation group metadata indices (free space, inodes, reference count, 4837af051dfbSDarrick J. Wong and reverse mapping btrees) depend on consistency within the AG headers and 4838af051dfbSDarrick J. Wong between all the AG metadata btrees. 4839af051dfbSDarrick J. Wong 4840af051dfbSDarrick J. Wongi. ``xfs_scrub`` depends on the filesystem being mounted and kernel support 4841af051dfbSDarrick J. Wong for online fsck functionality. 4842af051dfbSDarrick J. Wong 4843af051dfbSDarrick J. WongTherefore, a metadata dependency graph is a convenient way to schedule checking 4844af051dfbSDarrick J. Wongoperations in the ``xfs_scrub`` program: 4845af051dfbSDarrick J. Wong 4846af051dfbSDarrick J. Wong- Phase 1 checks that the provided path maps to an XFS filesystem and detect 4847af051dfbSDarrick J. Wong the kernel's scrubbing abilities, which validates group (i). 4848af051dfbSDarrick J. Wong 4849af051dfbSDarrick J. Wong- Phase 2 scrubs groups (g) and (h) in parallel using a threaded workqueue. 4850af051dfbSDarrick J. Wong 4851af051dfbSDarrick J. Wong- Phase 3 scans inodes in parallel. 4852af051dfbSDarrick J. Wong For each inode, groups (f), (e), and (d) are checked, in that order. 4853af051dfbSDarrick J. Wong 4854af051dfbSDarrick J. Wong- Phase 4 repairs everything in groups (i) through (d) so that phases 5 and 6 4855af051dfbSDarrick J. Wong may run reliably. 4856af051dfbSDarrick J. Wong 4857af051dfbSDarrick J. Wong- Phase 5 starts by checking groups (b) and (c) in parallel before moving on 4858af051dfbSDarrick J. Wong to checking names. 4859af051dfbSDarrick J. Wong 4860af051dfbSDarrick J. Wong- Phase 6 depends on groups (i) through (b) to find file data blocks to verify, 4861af051dfbSDarrick J. Wong to read them, and to report which blocks of which files are affected. 4862af051dfbSDarrick J. Wong 4863af051dfbSDarrick J. Wong- Phase 7 checks group (a), having validated everything else. 4864af051dfbSDarrick J. Wong 4865af051dfbSDarrick J. WongNotice that the data dependencies between groups are enforced by the structure 4866af051dfbSDarrick J. Wongof the program flow. 4867af051dfbSDarrick J. Wong 4868af051dfbSDarrick J. WongParallel Inode Scans 4869af051dfbSDarrick J. Wong-------------------- 4870af051dfbSDarrick J. Wong 4871af051dfbSDarrick J. WongAn XFS filesystem can easily contain hundreds of millions of inodes. 4872af051dfbSDarrick J. WongGiven that XFS targets installations with large high-performance storage, 4873af051dfbSDarrick J. Wongit is desirable to scrub inodes in parallel to minimize runtime, particularly 4874af051dfbSDarrick J. Wongif the program has been invoked manually from a command line. 4875af051dfbSDarrick J. WongThis requires careful scheduling to keep the threads as evenly loaded as 4876af051dfbSDarrick J. Wongpossible. 4877af051dfbSDarrick J. Wong 4878af051dfbSDarrick J. WongEarly iterations of the ``xfs_scrub`` inode scanner naïvely created a single 4879af051dfbSDarrick J. Wongworkqueue and scheduled a single workqueue item per AG. 4880af051dfbSDarrick J. WongEach workqueue item walked the inode btree (with ``XFS_IOC_INUMBERS``) to find 4881af051dfbSDarrick J. Wonginode chunks and then called bulkstat (``XFS_IOC_BULKSTAT``) to gather enough 4882af051dfbSDarrick J. Wonginformation to construct file handles. 4883af051dfbSDarrick J. WongThe file handle was then passed to a function to generate scrub items for each 4884af051dfbSDarrick J. Wongmetadata object of each inode. 4885af051dfbSDarrick J. WongThis simple algorithm leads to thread balancing problems in phase 3 if the 4886af051dfbSDarrick J. Wongfilesystem contains one AG with a few large sparse files and the rest of the 4887af051dfbSDarrick J. WongAGs contain many smaller files. 4888af051dfbSDarrick J. WongThe inode scan dispatch function was not sufficiently granular; it should have 4889af051dfbSDarrick J. Wongbeen dispatching at the level of individual inodes, or, to constrain memory 4890af051dfbSDarrick J. Wongconsumption, inode btree records. 4891af051dfbSDarrick J. Wong 4892af051dfbSDarrick J. WongThanks to Dave Chinner, bounded workqueues in userspace enable ``xfs_scrub`` to 4893af051dfbSDarrick J. Wongavoid this problem with ease by adding a second workqueue. 4894af051dfbSDarrick J. WongJust like before, the first workqueue is seeded with one workqueue item per AG, 4895af051dfbSDarrick J. Wongand it uses INUMBERS to find inode btree chunks. 4896af051dfbSDarrick J. WongThe second workqueue, however, is configured with an upper bound on the number 4897af051dfbSDarrick J. Wongof items that can be waiting to be run. 4898af051dfbSDarrick J. WongEach inode btree chunk found by the first workqueue's workers are queued to the 4899af051dfbSDarrick J. Wongsecond workqueue, and it is this second workqueue that queries BULKSTAT, 4900af051dfbSDarrick J. Wongcreates a file handle, and passes it to a function to generate scrub items for 4901af051dfbSDarrick J. Wongeach metadata object of each inode. 4902af051dfbSDarrick J. WongIf the second workqueue is too full, the workqueue add function blocks the 4903af051dfbSDarrick J. Wongfirst workqueue's workers until the backlog eases. 4904af051dfbSDarrick J. WongThis doesn't completely solve the balancing problem, but reduces it enough to 4905af051dfbSDarrick J. Wongmove on to more pressing issues. 4906af051dfbSDarrick J. Wong 4907af051dfbSDarrick J. WongThe proposed patchsets are the scrub 4908af051dfbSDarrick J. Wong`performance tweaks 4909af051dfbSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-performance-tweaks>`_ 4910af051dfbSDarrick J. Wongand the 4911af051dfbSDarrick J. Wong`inode scan rebalance 4912af051dfbSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-iscan-rebalance>`_ 4913af051dfbSDarrick J. Wongseries. 4914af051dfbSDarrick J. Wong 4915af051dfbSDarrick J. Wong.. _scrubrepair: 4916af051dfbSDarrick J. Wong 4917af051dfbSDarrick J. WongScheduling Repairs 4918af051dfbSDarrick J. Wong------------------ 4919af051dfbSDarrick J. Wong 4920af051dfbSDarrick J. WongDuring phase 2, corruptions and inconsistencies reported in any AGI header or 4921af051dfbSDarrick J. Wonginode btree are repaired immediately, because phase 3 relies on proper 4922af051dfbSDarrick J. Wongfunctioning of the inode indices to find inodes to scan. 4923af051dfbSDarrick J. WongFailed repairs are rescheduled to phase 4. 4924af051dfbSDarrick J. WongProblems reported in any other space metadata are deferred to phase 4. 4925af051dfbSDarrick J. WongOptimization opportunities are always deferred to phase 4, no matter their 4926af051dfbSDarrick J. Wongorigin. 4927af051dfbSDarrick J. Wong 4928af051dfbSDarrick J. WongDuring phase 3, corruptions and inconsistencies reported in any part of a 4929af051dfbSDarrick J. Wongfile's metadata are repaired immediately if all space metadata were validated 4930af051dfbSDarrick J. Wongduring phase 2. 4931af051dfbSDarrick J. WongRepairs that fail or cannot be repaired immediately are scheduled for phase 4. 4932af051dfbSDarrick J. Wong 4933af051dfbSDarrick J. WongIn the original design of ``xfs_scrub``, it was thought that repairs would be 4934af051dfbSDarrick J. Wongso infrequent that the ``struct xfs_scrub_metadata`` objects used to 4935af051dfbSDarrick J. Wongcommunicate with the kernel could also be used as the primary object to 4936af051dfbSDarrick J. Wongschedule repairs. 4937af051dfbSDarrick J. WongWith recent increases in the number of optimizations possible for a given 4938af051dfbSDarrick J. Wongfilesystem object, it became much more memory-efficient to track all eligible 4939af051dfbSDarrick J. Wongrepairs for a given filesystem object with a single repair item. 4940af051dfbSDarrick J. WongEach repair item represents a single lockable object -- AGs, metadata files, 4941af051dfbSDarrick J. Wongindividual inodes, or a class of summary information. 4942af051dfbSDarrick J. Wong 4943af051dfbSDarrick J. WongPhase 4 is responsible for scheduling a lot of repair work in as quick a 4944af051dfbSDarrick J. Wongmanner as is practical. 4945af051dfbSDarrick J. WongThe :ref:`data dependencies <scrubcheck>` outlined earlier still apply, which 4946af051dfbSDarrick J. Wongmeans that ``xfs_scrub`` must try to complete the repair work scheduled by 4947af051dfbSDarrick J. Wongphase 2 before trying repair work scheduled by phase 3. 4948af051dfbSDarrick J. WongThe repair process is as follows: 4949af051dfbSDarrick J. Wong 4950af051dfbSDarrick J. Wong1. Start a round of repair with a workqueue and enough workers to keep the CPUs 4951af051dfbSDarrick J. Wong as busy as the user desires. 4952af051dfbSDarrick J. Wong 4953af051dfbSDarrick J. Wong a. For each repair item queued by phase 2, 4954af051dfbSDarrick J. Wong 4955af051dfbSDarrick J. Wong i. Ask the kernel to repair everything listed in the repair item for a 4956af051dfbSDarrick J. Wong given filesystem object. 4957af051dfbSDarrick J. Wong 4958af051dfbSDarrick J. Wong ii. Make a note if the kernel made any progress in reducing the number 4959af051dfbSDarrick J. Wong of repairs needed for this object. 4960af051dfbSDarrick J. Wong 4961af051dfbSDarrick J. Wong iii. If the object no longer requires repairs, revalidate all metadata 4962af051dfbSDarrick J. Wong associated with this object. 4963af051dfbSDarrick J. Wong If the revalidation succeeds, drop the repair item. 4964af051dfbSDarrick J. Wong If not, requeue the item for more repairs. 4965af051dfbSDarrick J. Wong 4966af051dfbSDarrick J. Wong b. If any repairs were made, jump back to 1a to retry all the phase 2 items. 4967af051dfbSDarrick J. Wong 4968af051dfbSDarrick J. Wong c. For each repair item queued by phase 3, 4969af051dfbSDarrick J. Wong 4970af051dfbSDarrick J. Wong i. Ask the kernel to repair everything listed in the repair item for a 4971af051dfbSDarrick J. Wong given filesystem object. 4972af051dfbSDarrick J. Wong 4973af051dfbSDarrick J. Wong ii. Make a note if the kernel made any progress in reducing the number 4974af051dfbSDarrick J. Wong of repairs needed for this object. 4975af051dfbSDarrick J. Wong 4976af051dfbSDarrick J. Wong iii. If the object no longer requires repairs, revalidate all metadata 4977af051dfbSDarrick J. Wong associated with this object. 4978af051dfbSDarrick J. Wong If the revalidation succeeds, drop the repair item. 4979af051dfbSDarrick J. Wong If not, requeue the item for more repairs. 4980af051dfbSDarrick J. Wong 4981af051dfbSDarrick J. Wong d. If any repairs were made, jump back to 1c to retry all the phase 3 items. 4982af051dfbSDarrick J. Wong 4983af051dfbSDarrick J. Wong2. If step 1 made any repair progress of any kind, jump back to step 1 to start 4984af051dfbSDarrick J. Wong another round of repair. 4985af051dfbSDarrick J. Wong 4986af051dfbSDarrick J. Wong3. If there are items left to repair, run them all serially one more time. 4987af051dfbSDarrick J. Wong Complain if the repairs were not successful, since this is the last chance 4988af051dfbSDarrick J. Wong to repair anything. 4989af051dfbSDarrick J. Wong 4990af051dfbSDarrick J. WongCorruptions and inconsistencies encountered during phases 5 and 7 are repaired 4991af051dfbSDarrick J. Wongimmediately. 4992af051dfbSDarrick J. WongCorrupt file data blocks reported by phase 6 cannot be recovered by the 4993af051dfbSDarrick J. Wongfilesystem. 4994af051dfbSDarrick J. Wong 4995af051dfbSDarrick J. WongThe proposed patchsets are the 4996af051dfbSDarrick J. Wong`repair warning improvements 4997af051dfbSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-better-repair-warnings>`_, 4998af051dfbSDarrick J. Wongrefactoring of the 4999af051dfbSDarrick J. Wong`repair data dependency 5000af051dfbSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-data-deps>`_ 5001af051dfbSDarrick J. Wongand 5002af051dfbSDarrick J. Wong`object tracking 5003af051dfbSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-object-tracking>`_, 5004af051dfbSDarrick J. Wongand the 5005af051dfbSDarrick J. Wong`repair scheduling 5006af051dfbSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-scheduling>`_ 5007af051dfbSDarrick J. Wongimprovement series. 5008af051dfbSDarrick J. Wong 5009af051dfbSDarrick J. WongChecking Names for Confusable Unicode Sequences 5010af051dfbSDarrick J. Wong----------------------------------------------- 5011af051dfbSDarrick J. Wong 5012af051dfbSDarrick J. WongIf ``xfs_scrub`` succeeds in validating the filesystem metadata by the end of 5013af051dfbSDarrick J. Wongphase 4, it moves on to phase 5, which checks for suspicious looking names in 5014af051dfbSDarrick J. Wongthe filesystem. 5015af051dfbSDarrick J. WongThese names consist of the filesystem label, names in directory entries, and 5016af051dfbSDarrick J. Wongthe names of extended attributes. 5017af051dfbSDarrick J. WongLike most Unix filesystems, XFS imposes the sparest of constraints on the 5018af051dfbSDarrick J. Wongcontents of a name: 5019af051dfbSDarrick J. Wong 5020af051dfbSDarrick J. Wong- Slashes and null bytes are not allowed in directory entries. 5021af051dfbSDarrick J. Wong 5022af051dfbSDarrick J. Wong- Null bytes are not allowed in userspace-visible extended attributes. 5023af051dfbSDarrick J. Wong 5024af051dfbSDarrick J. Wong- Null bytes are not allowed in the filesystem label. 5025af051dfbSDarrick J. Wong 5026af051dfbSDarrick J. WongDirectory entries and attribute keys store the length of the name explicitly 5027af051dfbSDarrick J. Wongondisk, which means that nulls are not name terminators. 5028af051dfbSDarrick J. WongFor this section, the term "naming domain" refers to any place where names are 5029af051dfbSDarrick J. Wongpresented together -- all the names in a directory, or all the attributes of a 5030af051dfbSDarrick J. Wongfile. 5031af051dfbSDarrick J. Wong 5032af051dfbSDarrick J. WongAlthough the Unix naming constraints are very permissive, the reality of most 5033af051dfbSDarrick J. Wongmodern-day Linux systems is that programs work with Unicode character code 5034af051dfbSDarrick J. Wongpoints to support international languages. 5035af051dfbSDarrick J. WongThese programs typically encode those code points in UTF-8 when interfacing 5036af051dfbSDarrick J. Wongwith the C library because the kernel expects null-terminated names. 5037af051dfbSDarrick J. WongIn the common case, therefore, names found in an XFS filesystem are actually 5038af051dfbSDarrick J. WongUTF-8 encoded Unicode data. 5039af051dfbSDarrick J. Wong 5040af051dfbSDarrick J. WongTo maximize its expressiveness, the Unicode standard defines separate control 5041af051dfbSDarrick J. Wongpoints for various characters that render similarly or identically in writing 5042af051dfbSDarrick J. Wongsystems around the world. 5043af051dfbSDarrick J. WongFor example, the character "Cyrillic Small Letter A" U+0430 "а" often renders 5044af051dfbSDarrick J. Wongidentically to "Latin Small Letter A" U+0061 "a". 5045af051dfbSDarrick J. Wong 5046af051dfbSDarrick J. WongThe standard also permits characters to be constructed in multiple ways -- 5047af051dfbSDarrick J. Wongeither by using a defined code point, or by combining one code point with 5048af051dfbSDarrick J. Wongvarious combining marks. 5049af051dfbSDarrick J. WongFor example, the character "Angstrom Sign U+212B "Å" can also be expressed 5050af051dfbSDarrick J. Wongas "Latin Capital Letter A" U+0041 "A" followed by "Combining Ring Above" 5051af051dfbSDarrick J. WongU+030A "◌̊". 5052af051dfbSDarrick J. WongBoth sequences render identically. 5053af051dfbSDarrick J. Wong 5054af051dfbSDarrick J. WongLike the standards that preceded it, Unicode also defines various control 5055af051dfbSDarrick J. Wongcharacters to alter the presentation of text. 5056af051dfbSDarrick J. WongFor example, the character "Right-to-Left Override" U+202E can trick some 5057af051dfbSDarrick J. Wongprograms into rendering "moo\\xe2\\x80\\xaegnp.txt" as "mootxt.png". 5058af051dfbSDarrick J. WongA second category of rendering problems involves whitespace characters. 5059af051dfbSDarrick J. WongIf the character "Zero Width Space" U+200B is encountered in a file name, the 5060af051dfbSDarrick J. Wongname will render identically to a name that does not have the zero width 5061af051dfbSDarrick J. Wongspace. 5062af051dfbSDarrick J. Wong 5063af051dfbSDarrick J. WongIf two names within a naming domain have different byte sequences but render 5064af051dfbSDarrick J. Wongidentically, a user may be confused by it. 5065af051dfbSDarrick J. WongThe kernel, in its indifference to upper level encoding schemes, permits this. 5066af051dfbSDarrick J. WongMost filesystem drivers persist the byte sequence names that are given to them 5067af051dfbSDarrick J. Wongby the VFS. 5068af051dfbSDarrick J. Wong 5069af051dfbSDarrick J. WongTechniques for detecting confusable names are explained in great detail in 5070af051dfbSDarrick J. Wongsections 4 and 5 of the 5071af051dfbSDarrick J. Wong`Unicode Security Mechanisms <https://unicode.org/reports/tr39/>`_ 5072af051dfbSDarrick J. Wongdocument. 5073af051dfbSDarrick J. WongWhen ``xfs_scrub`` detects UTF-8 encoding in use on a system, it uses the 5074af051dfbSDarrick J. WongUnicode normalization form NFD in conjunction with the confusable name 5075af051dfbSDarrick J. Wongdetection component of 5076af051dfbSDarrick J. Wong`libicu <https://github.com/unicode-org/icu>`_ 5077af051dfbSDarrick J. Wongto identify names with a directory or within a file's extended attributes that 5078af051dfbSDarrick J. Wongcould be confused for each other. 5079af051dfbSDarrick J. WongNames are also checked for control characters, non-rendering characters, and 5080af051dfbSDarrick J. Wongmixing of bidirectional characters. 5081af051dfbSDarrick J. WongAll of these potential issues are reported to the system administrator during 5082af051dfbSDarrick J. Wongphase 5. 5083af051dfbSDarrick J. Wong 5084af051dfbSDarrick J. WongMedia Verification of File Data Extents 5085af051dfbSDarrick J. Wong--------------------------------------- 5086af051dfbSDarrick J. Wong 5087af051dfbSDarrick J. WongThe system administrator can elect to initiate a media scan of all file data 5088af051dfbSDarrick J. Wongblocks. 5089af051dfbSDarrick J. WongThis scan after validation of all filesystem metadata (except for the summary 5090af051dfbSDarrick J. Wongcounters) as phase 6. 5091af051dfbSDarrick J. WongThe scan starts by calling ``FS_IOC_GETFSMAP`` to scan the filesystem space map 5092af051dfbSDarrick J. Wongto find areas that are allocated to file data fork extents. 5093*d56b699dSBjorn HelgaasGaps between data fork extents that are smaller than 64k are treated as if 5094af051dfbSDarrick J. Wongthey were data fork extents to reduce the command setup overhead. 5095af051dfbSDarrick J. WongWhen the space map scan accumulates a region larger than 32MB, a media 5096af051dfbSDarrick J. Wongverification request is sent to the disk as a directio read of the raw block 5097af051dfbSDarrick J. Wongdevice. 5098af051dfbSDarrick J. Wong 5099af051dfbSDarrick J. WongIf the verification read fails, ``xfs_scrub`` retries with single-block reads 5100af051dfbSDarrick J. Wongto narrow down the failure to the specific region of the media and recorded. 5101af051dfbSDarrick J. WongWhen it has finished issuing verification requests, it again uses the space 5102af051dfbSDarrick J. Wongmapping ioctl to map the recorded media errors back to metadata structures 5103af051dfbSDarrick J. Wongand report what has been lost. 5104af051dfbSDarrick J. WongFor media errors in blocks owned by files, parent pointers can be used to 5105af051dfbSDarrick J. Wongconstruct file paths from inode numbers for user-friendly reporting. 510603786f0aSDarrick J. Wong 510703786f0aSDarrick J. Wong7. Conclusion and Future Work 510803786f0aSDarrick J. Wong============================= 510903786f0aSDarrick J. Wong 511003786f0aSDarrick J. WongIt is hoped that the reader of this document has followed the designs laid out 511103786f0aSDarrick J. Wongin this document and now has some familiarity with how XFS performs online 511203786f0aSDarrick J. Wongrebuilding of its metadata indices, and how filesystem users can interact with 511303786f0aSDarrick J. Wongthat functionality. 511403786f0aSDarrick J. WongAlthough the scope of this work is daunting, it is hoped that this guide will 511503786f0aSDarrick J. Wongmake it easier for code readers to understand what has been built, for whom it 511603786f0aSDarrick J. Wonghas been built, and why. 511703786f0aSDarrick J. WongPlease feel free to contact the XFS mailing list with questions. 511803786f0aSDarrick J. Wong 511903786f0aSDarrick J. WongFIEXCHANGE_RANGE 512003786f0aSDarrick J. Wong---------------- 512103786f0aSDarrick J. Wong 512203786f0aSDarrick J. WongAs discussed earlier, a second frontend to the atomic extent swap mechanism is 512303786f0aSDarrick J. Wonga new ioctl call that userspace programs can use to commit updates to files 512403786f0aSDarrick J. Wongatomically. 512503786f0aSDarrick J. WongThis frontend has been out for review for several years now, though the 512603786f0aSDarrick J. Wongnecessary refinements to online repair and lack of customer demand mean that 512703786f0aSDarrick J. Wongthe proposal has not been pushed very hard. 512803786f0aSDarrick J. Wong 512903786f0aSDarrick J. WongExtent Swapping with Regular User Files 513003786f0aSDarrick J. Wong``````````````````````````````````````` 513103786f0aSDarrick J. Wong 513203786f0aSDarrick J. WongAs mentioned earlier, XFS has long had the ability to swap extents between 513303786f0aSDarrick J. Wongfiles, which is used almost exclusively by ``xfs_fsr`` to defragment files. 513403786f0aSDarrick J. WongThe earliest form of this was the fork swap mechanism, where the entire 513503786f0aSDarrick J. Wongcontents of data forks could be exchanged between two files by exchanging the 513603786f0aSDarrick J. Wongraw bytes in each inode fork's immediate area. 513703786f0aSDarrick J. WongWhen XFS v5 came along with self-describing metadata, this old mechanism grew 513803786f0aSDarrick J. Wongsome log support to continue rewriting the owner fields of BMBT blocks during 513903786f0aSDarrick J. Wonglog recovery. 514003786f0aSDarrick J. WongWhen the reverse mapping btree was later added to XFS, the only way to maintain 514103786f0aSDarrick J. Wongthe consistency of the fork mappings with the reverse mapping index was to 514203786f0aSDarrick J. Wongdevelop an iterative mechanism that used deferred bmap and rmap operations to 514303786f0aSDarrick J. Wongswap mappings one at a time. 514403786f0aSDarrick J. WongThis mechanism is identical to steps 2-3 from the procedure above except for 514503786f0aSDarrick J. Wongthe new tracking items, because the atomic extent swap mechanism is an 514603786f0aSDarrick J. Wongiteration of an existing mechanism and not something totally novel. 514703786f0aSDarrick J. WongFor the narrow case of file defragmentation, the file contents must be 514803786f0aSDarrick J. Wongidentical, so the recovery guarantees are not much of a gain. 514903786f0aSDarrick J. Wong 515003786f0aSDarrick J. WongAtomic extent swapping is much more flexible than the existing swapext 515103786f0aSDarrick J. Wongimplementations because it can guarantee that the caller never sees a mix of 515203786f0aSDarrick J. Wongold and new contents even after a crash, and it can operate on two arbitrary 515303786f0aSDarrick J. Wongfile fork ranges. 515403786f0aSDarrick J. WongThe extra flexibility enables several new use cases: 515503786f0aSDarrick J. Wong 515603786f0aSDarrick J. Wong- **Atomic commit of file writes**: A userspace process opens a file that it 515703786f0aSDarrick J. Wong wants to update. 515803786f0aSDarrick J. Wong Next, it opens a temporary file and calls the file clone operation to reflink 515903786f0aSDarrick J. Wong the first file's contents into the temporary file. 516003786f0aSDarrick J. Wong Writes to the original file should instead be written to the temporary file. 516103786f0aSDarrick J. Wong Finally, the process calls the atomic extent swap system call 516203786f0aSDarrick J. Wong (``FIEXCHANGE_RANGE``) to exchange the file contents, thereby committing all 516303786f0aSDarrick J. Wong of the updates to the original file, or none of them. 516403786f0aSDarrick J. Wong 516503786f0aSDarrick J. Wong.. _swapext_if_unchanged: 516603786f0aSDarrick J. Wong 516703786f0aSDarrick J. Wong- **Transactional file updates**: The same mechanism as above, but the caller 516803786f0aSDarrick J. Wong only wants the commit to occur if the original file's contents have not 516903786f0aSDarrick J. Wong changed. 517003786f0aSDarrick J. Wong To make this happen, the calling process snapshots the file modification and 517103786f0aSDarrick J. Wong change timestamps of the original file before reflinking its data to the 517203786f0aSDarrick J. Wong temporary file. 517303786f0aSDarrick J. Wong When the program is ready to commit the changes, it passes the timestamps 517403786f0aSDarrick J. Wong into the kernel as arguments to the atomic extent swap system call. 517503786f0aSDarrick J. Wong The kernel only commits the changes if the provided timestamps match the 517603786f0aSDarrick J. Wong original file. 517703786f0aSDarrick J. Wong 517803786f0aSDarrick J. Wong- **Emulation of atomic block device writes**: Export a block device with a 517903786f0aSDarrick J. Wong logical sector size matching the filesystem block size to force all writes 518003786f0aSDarrick J. Wong to be aligned to the filesystem block size. 518103786f0aSDarrick J. Wong Stage all writes to a temporary file, and when that is complete, call the 518203786f0aSDarrick J. Wong atomic extent swap system call with a flag to indicate that holes in the 518303786f0aSDarrick J. Wong temporary file should be ignored. 518403786f0aSDarrick J. Wong This emulates an atomic device write in software, and can support arbitrary 518503786f0aSDarrick J. Wong scattered writes. 518603786f0aSDarrick J. Wong 518703786f0aSDarrick J. WongVectorized Scrub 518803786f0aSDarrick J. Wong---------------- 518903786f0aSDarrick J. Wong 519003786f0aSDarrick J. WongAs it turns out, the :ref:`refactoring <scrubrepair>` of repair items mentioned 519103786f0aSDarrick J. Wongearlier was a catalyst for enabling a vectorized scrub system call. 519203786f0aSDarrick J. WongSince 2018, the cost of making a kernel call has increased considerably on some 519303786f0aSDarrick J. Wongsystems to mitigate the effects of speculative execution attacks. 519403786f0aSDarrick J. WongThis incentivizes program authors to make as few system calls as possible to 519503786f0aSDarrick J. Wongreduce the number of times an execution path crosses a security boundary. 519603786f0aSDarrick J. Wong 519703786f0aSDarrick J. WongWith vectorized scrub, userspace pushes to the kernel the identity of a 519803786f0aSDarrick J. Wongfilesystem object, a list of scrub types to run against that object, and a 519903786f0aSDarrick J. Wongsimple representation of the data dependencies between the selected scrub 520003786f0aSDarrick J. Wongtypes. 520103786f0aSDarrick J. WongThe kernel executes as much of the caller's plan as it can until it hits a 520203786f0aSDarrick J. Wongdependency that cannot be satisfied due to a corruption, and tells userspace 520303786f0aSDarrick J. Wonghow much was accomplished. 520403786f0aSDarrick J. WongIt is hoped that ``io_uring`` will pick up enough of this functionality that 520503786f0aSDarrick J. Wongonline fsck can use that instead of adding a separate vectored scrub system 520603786f0aSDarrick J. Wongcall to XFS. 520703786f0aSDarrick J. Wong 520803786f0aSDarrick J. WongThe relevant patchsets are the 520903786f0aSDarrick J. Wong`kernel vectorized scrub 521003786f0aSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=vectorized-scrub>`_ 521103786f0aSDarrick J. Wongand 521203786f0aSDarrick J. Wong`userspace vectorized scrub 521303786f0aSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=vectorized-scrub>`_ 521403786f0aSDarrick J. Wongseries. 521503786f0aSDarrick J. Wong 521603786f0aSDarrick J. WongQuality of Service Targets for Scrub 521703786f0aSDarrick J. Wong------------------------------------ 521803786f0aSDarrick J. Wong 521903786f0aSDarrick J. WongOne serious shortcoming of the online fsck code is that the amount of time that 522003786f0aSDarrick J. Wongit can spend in the kernel holding resource locks is basically unbounded. 522103786f0aSDarrick J. WongUserspace is allowed to send a fatal signal to the process which will cause 522203786f0aSDarrick J. Wong``xfs_scrub`` to exit when it reaches a good stopping point, but there's no way 522303786f0aSDarrick J. Wongfor userspace to provide a time budget to the kernel. 522403786f0aSDarrick J. WongGiven that the scrub codebase has helpers to detect fatal signals, it shouldn't 522503786f0aSDarrick J. Wongbe too much work to allow userspace to specify a timeout for a scrub/repair 522603786f0aSDarrick J. Wongoperation and abort the operation if it exceeds budget. 522703786f0aSDarrick J. WongHowever, most repair functions have the property that once they begin to touch 522803786f0aSDarrick J. Wongondisk metadata, the operation cannot be cancelled cleanly, after which a QoS 522903786f0aSDarrick J. Wongtimeout is no longer useful. 523003786f0aSDarrick J. Wong 523103786f0aSDarrick J. WongDefragmenting Free Space 523203786f0aSDarrick J. Wong------------------------ 523303786f0aSDarrick J. Wong 523403786f0aSDarrick J. WongOver the years, many XFS users have requested the creation of a program to 523503786f0aSDarrick J. Wongclear a portion of the physical storage underlying a filesystem so that it 523603786f0aSDarrick J. Wongbecomes a contiguous chunk of free space. 523703786f0aSDarrick J. WongCall this free space defragmenter ``clearspace`` for short. 523803786f0aSDarrick J. Wong 523903786f0aSDarrick J. WongThe first piece the ``clearspace`` program needs is the ability to read the 524003786f0aSDarrick J. Wongreverse mapping index from userspace. 524103786f0aSDarrick J. WongThis already exists in the form of the ``FS_IOC_GETFSMAP`` ioctl. 524203786f0aSDarrick J. WongThe second piece it needs is a new fallocate mode 524303786f0aSDarrick J. Wong(``FALLOC_FL_MAP_FREE_SPACE``) that allocates the free space in a region and 524403786f0aSDarrick J. Wongmaps it to a file. 524503786f0aSDarrick J. WongCall this file the "space collector" file. 524603786f0aSDarrick J. WongThe third piece is the ability to force an online repair. 524703786f0aSDarrick J. Wong 524803786f0aSDarrick J. WongTo clear all the metadata out of a portion of physical storage, clearspace 524903786f0aSDarrick J. Wonguses the new fallocate map-freespace call to map any free space in that region 525003786f0aSDarrick J. Wongto the space collector file. 525103786f0aSDarrick J. WongNext, clearspace finds all metadata blocks in that region by way of 525203786f0aSDarrick J. Wong``GETFSMAP`` and issues forced repair requests on the data structure. 525303786f0aSDarrick J. WongThis often results in the metadata being rebuilt somewhere that is not being 525403786f0aSDarrick J. Wongcleared. 525503786f0aSDarrick J. WongAfter each relocation, clearspace calls the "map free space" function again to 525603786f0aSDarrick J. Wongcollect any newly freed space in the region being cleared. 525703786f0aSDarrick J. Wong 525803786f0aSDarrick J. WongTo clear all the file data out of a portion of the physical storage, clearspace 525903786f0aSDarrick J. Wonguses the FSMAP information to find relevant file data blocks. 526003786f0aSDarrick J. WongHaving identified a good target, it uses the ``FICLONERANGE`` call on that part 526103786f0aSDarrick J. Wongof the file to try to share the physical space with a dummy file. 526203786f0aSDarrick J. WongCloning the extent means that the original owners cannot overwrite the 526303786f0aSDarrick J. Wongcontents; any changes will be written somewhere else via copy-on-write. 526403786f0aSDarrick J. WongClearspace makes its own copy of the frozen extent in an area that is not being 526503786f0aSDarrick J. Wongcleared, and uses ``FIEDEUPRANGE`` (or the :ref:`atomic extent swap 526603786f0aSDarrick J. Wong<swapext_if_unchanged>` feature) to change the target file's data extent 526703786f0aSDarrick J. Wongmapping away from the area being cleared. 526803786f0aSDarrick J. WongWhen all other mappings have been moved, clearspace reflinks the space into the 526903786f0aSDarrick J. Wongspace collector file so that it becomes unavailable. 527003786f0aSDarrick J. Wong 527103786f0aSDarrick J. WongThere are further optimizations that could apply to the above algorithm. 527203786f0aSDarrick J. WongTo clear a piece of physical storage that has a high sharing factor, it is 527303786f0aSDarrick J. Wongstrongly desirable to retain this sharing factor. 527403786f0aSDarrick J. WongIn fact, these extents should be moved first to maximize sharing factor after 527503786f0aSDarrick J. Wongthe operation completes. 527603786f0aSDarrick J. WongTo make this work smoothly, clearspace needs a new ioctl 527703786f0aSDarrick J. Wong(``FS_IOC_GETREFCOUNTS``) to report reference count information to userspace. 527803786f0aSDarrick J. WongWith the refcount information exposed, clearspace can quickly find the longest, 527903786f0aSDarrick J. Wongmost shared data extents in the filesystem, and target them first. 528003786f0aSDarrick J. Wong 528103786f0aSDarrick J. Wong**Future Work Question**: How might the filesystem move inode chunks? 528203786f0aSDarrick J. Wong 528303786f0aSDarrick J. Wong*Answer*: To move inode chunks, Dave Chinner constructed a prototype program 528403786f0aSDarrick J. Wongthat creates a new file with the old contents and then locklessly runs around 528503786f0aSDarrick J. Wongthe filesystem updating directory entries. 528603786f0aSDarrick J. WongThe operation cannot complete if the filesystem goes down. 528703786f0aSDarrick J. WongThat problem isn't totally insurmountable: create an inode remapping table 528803786f0aSDarrick J. Wonghidden behind a jump label, and a log item that tracks the kernel walking the 528903786f0aSDarrick J. Wongfilesystem to update directory entries. 529003786f0aSDarrick J. WongThe trouble is, the kernel can't do anything about open files, since it cannot 529103786f0aSDarrick J. Wongrevoke them. 529203786f0aSDarrick J. Wong 529303786f0aSDarrick J. Wong**Future Work Question**: Can static keys be used to minimize the cost of 529403786f0aSDarrick J. Wongsupporting ``revoke()`` on XFS files? 529503786f0aSDarrick J. Wong 529603786f0aSDarrick J. Wong*Answer*: Yes. 529703786f0aSDarrick J. WongUntil the first revocation, the bailout code need not be in the call path at 529803786f0aSDarrick J. Wongall. 529903786f0aSDarrick J. Wong 530003786f0aSDarrick J. WongThe relevant patchsets are the 530103786f0aSDarrick J. Wong`kernel freespace defrag 530203786f0aSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=defrag-freespace>`_ 530303786f0aSDarrick J. Wongand 530403786f0aSDarrick J. Wong`userspace freespace defrag 530503786f0aSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=defrag-freespace>`_ 530603786f0aSDarrick J. Wongseries. 530703786f0aSDarrick J. Wong 530803786f0aSDarrick J. WongShrinking Filesystems 530903786f0aSDarrick J. Wong--------------------- 531003786f0aSDarrick J. Wong 531103786f0aSDarrick J. WongRemoving the end of the filesystem ought to be a simple matter of evacuating 531203786f0aSDarrick J. Wongthe data and metadata at the end of the filesystem, and handing the freed space 531303786f0aSDarrick J. Wongto the shrink code. 531403786f0aSDarrick J. WongThat requires an evacuation of the space at end of the filesystem, which is a 531503786f0aSDarrick J. Wonguse of free space defragmentation! 5316