1a8f6c2e5SDarrick J. Wong.. SPDX-License-Identifier: GPL-2.0 2a8f6c2e5SDarrick J. Wong.. _xfs_online_fsck_design: 3a8f6c2e5SDarrick J. Wong 4a8f6c2e5SDarrick J. Wong.. 5a8f6c2e5SDarrick J. Wong Mapping of heading styles within this document: 6a8f6c2e5SDarrick J. Wong Heading 1 uses "====" above and below 7a8f6c2e5SDarrick J. Wong Heading 2 uses "====" 8a8f6c2e5SDarrick J. Wong Heading 3 uses "----" 9a8f6c2e5SDarrick J. Wong Heading 4 uses "````" 10a8f6c2e5SDarrick J. Wong Heading 5 uses "^^^^" 11a8f6c2e5SDarrick J. Wong Heading 6 uses "~~~~" 12a8f6c2e5SDarrick J. Wong Heading 7 uses "...." 13a8f6c2e5SDarrick J. Wong 14a8f6c2e5SDarrick J. Wong Sections are manually numbered because apparently that's what everyone 15a8f6c2e5SDarrick J. Wong does in the kernel. 16a8f6c2e5SDarrick J. Wong 17a8f6c2e5SDarrick J. Wong====================== 18a8f6c2e5SDarrick J. WongXFS Online Fsck Design 19a8f6c2e5SDarrick J. Wong====================== 20a8f6c2e5SDarrick J. Wong 21a8f6c2e5SDarrick J. WongThis document captures the design of the online filesystem check feature for 22a8f6c2e5SDarrick J. WongXFS. 23a8f6c2e5SDarrick J. WongThe purpose of this document is threefold: 24a8f6c2e5SDarrick J. Wong 25a8f6c2e5SDarrick J. Wong- To help kernel distributors understand exactly what the XFS online fsck 26a8f6c2e5SDarrick J. Wong feature is, and issues about which they should be aware. 27a8f6c2e5SDarrick J. Wong 28a8f6c2e5SDarrick J. Wong- To help people reading the code to familiarize themselves with the relevant 29a8f6c2e5SDarrick J. Wong concepts and design points before they start digging into the code. 30a8f6c2e5SDarrick J. Wong 31a8f6c2e5SDarrick J. Wong- To help developers maintaining the system by capturing the reasons 32a8f6c2e5SDarrick J. Wong supporting higher level decision making. 33a8f6c2e5SDarrick J. Wong 34a8f6c2e5SDarrick J. WongAs the online fsck code is merged, the links in this document to topic branches 35a8f6c2e5SDarrick J. Wongwill be replaced with links to code. 36a8f6c2e5SDarrick J. Wong 37a8f6c2e5SDarrick J. WongThis document is licensed under the terms of the GNU Public License, v2. 38a8f6c2e5SDarrick J. WongThe primary author is Darrick J. Wong. 39a8f6c2e5SDarrick J. Wong 40a8f6c2e5SDarrick J. WongThis design document is split into seven parts. 41a8f6c2e5SDarrick J. WongPart 1 defines what fsck tools are and the motivations for writing a new one. 42a8f6c2e5SDarrick J. WongParts 2 and 3 present a high level overview of how online fsck process works 43a8f6c2e5SDarrick J. Wongand how it is tested to ensure correct functionality. 44a8f6c2e5SDarrick J. WongPart 4 discusses the user interface and the intended usage modes of the new 45a8f6c2e5SDarrick J. Wongprogram. 46a8f6c2e5SDarrick J. WongParts 5 and 6 show off the high level components and how they fit together, and 47a8f6c2e5SDarrick J. Wongthen present case studies of how each repair function actually works. 48a8f6c2e5SDarrick J. WongPart 7 sums up what has been discussed so far and speculates about what else 49a8f6c2e5SDarrick J. Wongmight be built atop online fsck. 50a8f6c2e5SDarrick J. Wong 51a8f6c2e5SDarrick J. Wong.. contents:: Table of Contents 52a8f6c2e5SDarrick J. Wong :local: 53a8f6c2e5SDarrick J. Wong 54a8f6c2e5SDarrick J. Wong1. What is a Filesystem Check? 55a8f6c2e5SDarrick J. Wong============================== 56a8f6c2e5SDarrick J. Wong 57a8f6c2e5SDarrick J. WongA Unix filesystem has four main responsibilities: 58a8f6c2e5SDarrick J. Wong 59a8f6c2e5SDarrick J. Wong- Provide a hierarchy of names through which application programs can associate 60a8f6c2e5SDarrick J. Wong arbitrary blobs of data for any length of time, 61a8f6c2e5SDarrick J. Wong 62a8f6c2e5SDarrick J. Wong- Virtualize physical storage media across those names, and 63a8f6c2e5SDarrick J. Wong 64a8f6c2e5SDarrick J. Wong- Retrieve the named data blobs at any time. 65a8f6c2e5SDarrick J. Wong 66a8f6c2e5SDarrick J. Wong- Examine resource usage. 67a8f6c2e5SDarrick J. Wong 68a8f6c2e5SDarrick J. WongMetadata directly supporting these functions (e.g. files, directories, space 69a8f6c2e5SDarrick J. Wongmappings) are sometimes called primary metadata. 70a8f6c2e5SDarrick J. WongSecondary metadata (e.g. reverse mapping and directory parent pointers) support 71a8f6c2e5SDarrick J. Wongoperations internal to the filesystem, such as internal consistency checking 72a8f6c2e5SDarrick J. Wongand reorganization. 73a8f6c2e5SDarrick J. WongSummary metadata, as the name implies, condense information contained in 74a8f6c2e5SDarrick J. Wongprimary metadata for performance reasons. 75a8f6c2e5SDarrick J. Wong 76a8f6c2e5SDarrick J. WongThe filesystem check (fsck) tool examines all the metadata in a filesystem 77a8f6c2e5SDarrick J. Wongto look for errors. 78a8f6c2e5SDarrick J. WongIn addition to looking for obvious metadata corruptions, fsck also 79a8f6c2e5SDarrick J. Wongcross-references different types of metadata records with each other to look 80a8f6c2e5SDarrick J. Wongfor inconsistencies. 81a8f6c2e5SDarrick J. WongPeople do not like losing data, so most fsck tools also contains some ability 82a8f6c2e5SDarrick J. Wongto correct any problems found. 83a8f6c2e5SDarrick J. WongAs a word of caution -- the primary goal of most Linux fsck tools is to restore 84a8f6c2e5SDarrick J. Wongthe filesystem metadata to a consistent state, not to maximize the data 85a8f6c2e5SDarrick J. Wongrecovered. 86a8f6c2e5SDarrick J. WongThat precedent will not be challenged here. 87a8f6c2e5SDarrick J. Wong 88a8f6c2e5SDarrick J. WongFilesystems of the 20th century generally lacked any redundancy in the ondisk 89a8f6c2e5SDarrick J. Wongformat, which means that fsck can only respond to errors by erasing files until 90a8f6c2e5SDarrick J. Wongerrors are no longer detected. 91a8f6c2e5SDarrick J. WongMore recent filesystem designs contain enough redundancy in their metadata that 92a8f6c2e5SDarrick J. Wongit is now possible to regenerate data structures when non-catastrophic errors 93a8f6c2e5SDarrick J. Wongoccur; this capability aids both strategies. 94a8f6c2e5SDarrick J. Wong 95a8f6c2e5SDarrick J. Wong+--------------------------------------------------------------------------+ 96a8f6c2e5SDarrick J. Wong| **Note**: | 97a8f6c2e5SDarrick J. Wong+--------------------------------------------------------------------------+ 98a8f6c2e5SDarrick J. Wong| System administrators avoid data loss by increasing the number of | 99a8f6c2e5SDarrick J. Wong| separate storage systems through the creation of backups; and they avoid | 100a8f6c2e5SDarrick J. Wong| downtime by increasing the redundancy of each storage system through the | 101a8f6c2e5SDarrick J. Wong| creation of RAID arrays. | 102a8f6c2e5SDarrick J. Wong| fsck tools address only the first problem. | 103a8f6c2e5SDarrick J. Wong+--------------------------------------------------------------------------+ 104a8f6c2e5SDarrick J. Wong 105a8f6c2e5SDarrick J. WongTLDR; Show Me the Code! 106a8f6c2e5SDarrick J. Wong----------------------- 107a8f6c2e5SDarrick J. Wong 108a8f6c2e5SDarrick J. WongCode is posted to the kernel.org git trees as follows: 109a8f6c2e5SDarrick J. Wong`kernel changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-symlink>`_, 110a8f6c2e5SDarrick J. Wong`userspace changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-service>`_, and 111a8f6c2e5SDarrick J. Wong`QA test changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-dirs>`_. 112a8f6c2e5SDarrick J. WongEach kernel patchset adding an online repair function will use the same branch 113a8f6c2e5SDarrick J. Wongname across the kernel, xfsprogs, and fstests git repos. 114a8f6c2e5SDarrick J. Wong 115a8f6c2e5SDarrick J. WongExisting Tools 116a8f6c2e5SDarrick J. Wong-------------- 117a8f6c2e5SDarrick J. Wong 118a8f6c2e5SDarrick J. WongThe online fsck tool described here will be the third tool in the history of 119a8f6c2e5SDarrick J. WongXFS (on Linux) to check and repair filesystems. 120a8f6c2e5SDarrick J. WongTwo programs precede it: 121a8f6c2e5SDarrick J. Wong 122a8f6c2e5SDarrick J. WongThe first program, ``xfs_check``, was created as part of the XFS debugger 123a8f6c2e5SDarrick J. Wong(``xfs_db``) and can only be used with unmounted filesystems. 124a8f6c2e5SDarrick J. WongIt walks all metadata in the filesystem looking for inconsistencies in the 125a8f6c2e5SDarrick J. Wongmetadata, though it lacks any ability to repair what it finds. 126a8f6c2e5SDarrick J. WongDue to its high memory requirements and inability to repair things, this 127a8f6c2e5SDarrick J. Wongprogram is now deprecated and will not be discussed further. 128a8f6c2e5SDarrick J. Wong 129a8f6c2e5SDarrick J. WongThe second program, ``xfs_repair``, was created to be faster and more robust 130a8f6c2e5SDarrick J. Wongthan the first program. 131a8f6c2e5SDarrick J. WongLike its predecessor, it can only be used with unmounted filesystems. 132a8f6c2e5SDarrick J. WongIt uses extent-based in-memory data structures to reduce memory consumption, 133a8f6c2e5SDarrick J. Wongand tries to schedule readahead IO appropriately to reduce I/O waiting time 134a8f6c2e5SDarrick J. Wongwhile it scans the metadata of the entire filesystem. 135a8f6c2e5SDarrick J. WongThe most important feature of this tool is its ability to respond to 136a8f6c2e5SDarrick J. Wonginconsistencies in file metadata and directory tree by erasing things as needed 137a8f6c2e5SDarrick J. Wongto eliminate problems. 138a8f6c2e5SDarrick J. WongSpace usage metadata are rebuilt from the observed file metadata. 139a8f6c2e5SDarrick J. Wong 140a8f6c2e5SDarrick J. WongProblem Statement 141a8f6c2e5SDarrick J. Wong----------------- 142a8f6c2e5SDarrick J. Wong 143a8f6c2e5SDarrick J. WongThe current XFS tools leave several problems unsolved: 144a8f6c2e5SDarrick J. Wong 145a8f6c2e5SDarrick J. Wong1. **User programs** suddenly **lose access** to the filesystem when unexpected 146a8f6c2e5SDarrick J. Wong shutdowns occur as a result of silent corruptions in the metadata. 147a8f6c2e5SDarrick J. Wong These occur **unpredictably** and often without warning. 148a8f6c2e5SDarrick J. Wong 149a8f6c2e5SDarrick J. Wong2. **Users** experience a **total loss of service** during the recovery period 150a8f6c2e5SDarrick J. Wong after an **unexpected shutdown** occurs. 151a8f6c2e5SDarrick J. Wong 152a8f6c2e5SDarrick J. Wong3. **Users** experience a **total loss of service** if the filesystem is taken 153a8f6c2e5SDarrick J. Wong offline to **look for problems** proactively. 154a8f6c2e5SDarrick J. Wong 155a8f6c2e5SDarrick J. Wong4. **Data owners** cannot **check the integrity** of their stored data without 156a8f6c2e5SDarrick J. Wong reading all of it. 157a8f6c2e5SDarrick J. Wong This may expose them to substantial billing costs when a linear media scan 158a8f6c2e5SDarrick J. Wong performed by the storage system administrator might suffice. 159a8f6c2e5SDarrick J. Wong 160a8f6c2e5SDarrick J. Wong5. **System administrators** cannot **schedule** a maintenance window to deal 161a8f6c2e5SDarrick J. Wong with corruptions if they **lack the means** to assess filesystem health 162a8f6c2e5SDarrick J. Wong while the filesystem is online. 163a8f6c2e5SDarrick J. Wong 164a8f6c2e5SDarrick J. Wong6. **Fleet monitoring tools** cannot **automate periodic checks** of filesystem 165a8f6c2e5SDarrick J. Wong health when doing so requires **manual intervention** and downtime. 166a8f6c2e5SDarrick J. Wong 167a8f6c2e5SDarrick J. Wong7. **Users** can be tricked into **doing things they do not desire** when 168a8f6c2e5SDarrick J. Wong malicious actors **exploit quirks of Unicode** to place misleading names 169a8f6c2e5SDarrick J. Wong in directories. 170a8f6c2e5SDarrick J. Wong 171a8f6c2e5SDarrick J. WongGiven this definition of the problems to be solved and the actors who would 172a8f6c2e5SDarrick J. Wongbenefit, the proposed solution is a third fsck tool that acts on a running 173a8f6c2e5SDarrick J. Wongfilesystem. 174a8f6c2e5SDarrick J. Wong 175a8f6c2e5SDarrick J. WongThis new third program has three components: an in-kernel facility to check 176a8f6c2e5SDarrick J. Wongmetadata, an in-kernel facility to repair metadata, and a userspace driver 177a8f6c2e5SDarrick J. Wongprogram to drive fsck activity on a live filesystem. 178a8f6c2e5SDarrick J. Wong``xfs_scrub`` is the name of the driver program. 179a8f6c2e5SDarrick J. WongThe rest of this document presents the goals and use cases of the new fsck 180a8f6c2e5SDarrick J. Wongtool, describes its major design points in connection to those goals, and 181a8f6c2e5SDarrick J. Wongdiscusses the similarities and differences with existing tools. 182a8f6c2e5SDarrick J. Wong 183a8f6c2e5SDarrick J. Wong+--------------------------------------------------------------------------+ 184a8f6c2e5SDarrick J. Wong| **Note**: | 185a8f6c2e5SDarrick J. Wong+--------------------------------------------------------------------------+ 186a8f6c2e5SDarrick J. Wong| Throughout this document, the existing offline fsck tool can also be | 187a8f6c2e5SDarrick J. Wong| referred to by its current name "``xfs_repair``". | 188a8f6c2e5SDarrick J. Wong| The userspace driver program for the new online fsck tool can be | 189a8f6c2e5SDarrick J. Wong| referred to as "``xfs_scrub``". | 190a8f6c2e5SDarrick J. Wong| The kernel portion of online fsck that validates metadata is called | 191a8f6c2e5SDarrick J. Wong| "online scrub", and portion of the kernel that fixes metadata is called | 192a8f6c2e5SDarrick J. Wong| "online repair". | 193a8f6c2e5SDarrick J. Wong+--------------------------------------------------------------------------+ 194a8f6c2e5SDarrick J. Wong 195a8f6c2e5SDarrick J. WongThe naming hierarchy is broken up into objects known as directories and files 196a8f6c2e5SDarrick J. Wongand the physical space is split into pieces known as allocation groups. 197a8f6c2e5SDarrick J. WongSharding enables better performance on highly parallel systems and helps to 198a8f6c2e5SDarrick J. Wongcontain the damage when corruptions occur. 199a8f6c2e5SDarrick J. WongThe division of the filesystem into principal objects (allocation groups and 200a8f6c2e5SDarrick J. Wonginodes) means that there are ample opportunities to perform targeted checks and 201a8f6c2e5SDarrick J. Wongrepairs on a subset of the filesystem. 202a8f6c2e5SDarrick J. Wong 203a8f6c2e5SDarrick J. WongWhile this is going on, other parts continue processing IO requests. 204a8f6c2e5SDarrick J. WongEven if a piece of filesystem metadata can only be regenerated by scanning the 205a8f6c2e5SDarrick J. Wongentire system, the scan can still be done in the background while other file 206a8f6c2e5SDarrick J. Wongoperations continue. 207a8f6c2e5SDarrick J. Wong 208a8f6c2e5SDarrick J. WongIn summary, online fsck takes advantage of resource sharding and redundant 209a8f6c2e5SDarrick J. Wongmetadata to enable targeted checking and repair operations while the system 210a8f6c2e5SDarrick J. Wongis running. 211a8f6c2e5SDarrick J. WongThis capability will be coupled to automatic system management so that 212a8f6c2e5SDarrick J. Wongautonomous self-healing of XFS maximizes service availability. 21388757e04SDarrick J. Wong 21488757e04SDarrick J. Wong2. Theory of Operation 21588757e04SDarrick J. Wong====================== 21688757e04SDarrick J. Wong 21788757e04SDarrick J. WongBecause it is necessary for online fsck to lock and scan live metadata objects, 21888757e04SDarrick J. Wongonline fsck consists of three separate code components. 21988757e04SDarrick J. WongThe first is the userspace driver program ``xfs_scrub``, which is responsible 22088757e04SDarrick J. Wongfor identifying individual metadata items, scheduling work items for them, 22188757e04SDarrick J. Wongreacting to the outcomes appropriately, and reporting results to the system 22288757e04SDarrick J. Wongadministrator. 22388757e04SDarrick J. WongThe second and third are in the kernel, which implements functions to check 22488757e04SDarrick J. Wongand repair each type of online fsck work item. 22588757e04SDarrick J. Wong 22688757e04SDarrick J. Wong+------------------------------------------------------------------+ 22788757e04SDarrick J. Wong| **Note**: | 22888757e04SDarrick J. Wong+------------------------------------------------------------------+ 22988757e04SDarrick J. Wong| For brevity, this document shortens the phrase "online fsck work | 23088757e04SDarrick J. Wong| item" to "scrub item". | 23188757e04SDarrick J. Wong+------------------------------------------------------------------+ 23288757e04SDarrick J. Wong 23388757e04SDarrick J. WongScrub item types are delineated in a manner consistent with the Unix design 23488757e04SDarrick J. Wongphilosophy, which is to say that each item should handle one aspect of a 23588757e04SDarrick J. Wongmetadata structure, and handle it well. 23688757e04SDarrick J. Wong 23788757e04SDarrick J. WongScope 23888757e04SDarrick J. Wong----- 23988757e04SDarrick J. Wong 24088757e04SDarrick J. WongIn principle, online fsck should be able to check and to repair everything that 24188757e04SDarrick J. Wongthe offline fsck program can handle. 24288757e04SDarrick J. WongHowever, online fsck cannot be running 100% of the time, which means that 24388757e04SDarrick J. Wonglatent errors may creep in after a scrub completes. 24488757e04SDarrick J. WongIf these errors cause the next mount to fail, offline fsck is the only 24588757e04SDarrick J. Wongsolution. 24688757e04SDarrick J. WongThis limitation means that maintenance of the offline fsck tool will continue. 24788757e04SDarrick J. WongA second limitation of online fsck is that it must follow the same resource 24888757e04SDarrick J. Wongsharing and lock acquisition rules as the regular filesystem. 24988757e04SDarrick J. WongThis means that scrub cannot take *any* shortcuts to save time, because doing 25088757e04SDarrick J. Wongso could lead to concurrency problems. 25188757e04SDarrick J. WongIn other words, online fsck is not a complete replacement for offline fsck, and 25288757e04SDarrick J. Wonga complete run of online fsck may take longer than online fsck. 25388757e04SDarrick J. WongHowever, both of these limitations are acceptable tradeoffs to satisfy the 25488757e04SDarrick J. Wongdifferent motivations of online fsck, which are to **minimize system downtime** 25588757e04SDarrick J. Wongand to **increase predictability of operation**. 25688757e04SDarrick J. Wong 25788757e04SDarrick J. Wong.. _scrubphases: 25888757e04SDarrick J. Wong 25988757e04SDarrick J. WongPhases of Work 26088757e04SDarrick J. Wong-------------- 26188757e04SDarrick J. Wong 26288757e04SDarrick J. WongThe userspace driver program ``xfs_scrub`` splits the work of checking and 26388757e04SDarrick J. Wongrepairing an entire filesystem into seven phases. 26488757e04SDarrick J. WongEach phase concentrates on checking specific types of scrub items and depends 26588757e04SDarrick J. Wongon the success of all previous phases. 26688757e04SDarrick J. WongThe seven phases are as follows: 26788757e04SDarrick J. Wong 26888757e04SDarrick J. Wong1. Collect geometry information about the mounted filesystem and computer, 26988757e04SDarrick J. Wong discover the online fsck capabilities of the kernel, and open the 27088757e04SDarrick J. Wong underlying storage devices. 27188757e04SDarrick J. Wong 27288757e04SDarrick J. Wong2. Check allocation group metadata, all realtime volume metadata, and all quota 27388757e04SDarrick J. Wong files. 27488757e04SDarrick J. Wong Each metadata structure is scheduled as a separate scrub item. 27588757e04SDarrick J. Wong If corruption is found in the inode header or inode btree and ``xfs_scrub`` 27688757e04SDarrick J. Wong is permitted to perform repairs, then those scrub items are repaired to 27788757e04SDarrick J. Wong prepare for phase 3. 27888757e04SDarrick J. Wong Repairs are implemented by using the information in the scrub item to 27988757e04SDarrick J. Wong resubmit the kernel scrub call with the repair flag enabled; this is 28088757e04SDarrick J. Wong discussed in the next section. 28188757e04SDarrick J. Wong Optimizations and all other repairs are deferred to phase 4. 28288757e04SDarrick J. Wong 28388757e04SDarrick J. Wong3. Check all metadata of every file in the filesystem. 28488757e04SDarrick J. Wong Each metadata structure is also scheduled as a separate scrub item. 28588757e04SDarrick J. Wong If repairs are needed and ``xfs_scrub`` is permitted to perform repairs, 28688757e04SDarrick J. Wong and there were no problems detected during phase 2, then those scrub items 28788757e04SDarrick J. Wong are repaired immediately. 28888757e04SDarrick J. Wong Optimizations, deferred repairs, and unsuccessful repairs are deferred to 28988757e04SDarrick J. Wong phase 4. 29088757e04SDarrick J. Wong 29188757e04SDarrick J. Wong4. All remaining repairs and scheduled optimizations are performed during this 29288757e04SDarrick J. Wong phase, if the caller permits them. 29388757e04SDarrick J. Wong Before starting repairs, the summary counters are checked and any necessary 29488757e04SDarrick J. Wong repairs are performed so that subsequent repairs will not fail the resource 29588757e04SDarrick J. Wong reservation step due to wildly incorrect summary counters. 29688757e04SDarrick J. Wong Unsuccesful repairs are requeued as long as forward progress on repairs is 29788757e04SDarrick J. Wong made somewhere in the filesystem. 29888757e04SDarrick J. Wong Free space in the filesystem is trimmed at the end of phase 4 if the 29988757e04SDarrick J. Wong filesystem is clean. 30088757e04SDarrick J. Wong 30188757e04SDarrick J. Wong5. By the start of this phase, all primary and secondary filesystem metadata 30288757e04SDarrick J. Wong must be correct. 30388757e04SDarrick J. Wong Summary counters such as the free space counts and quota resource counts 30488757e04SDarrick J. Wong are checked and corrected. 30588757e04SDarrick J. Wong Directory entry names and extended attribute names are checked for 30688757e04SDarrick J. Wong suspicious entries such as control characters or confusing Unicode sequences 30788757e04SDarrick J. Wong appearing in names. 30888757e04SDarrick J. Wong 30988757e04SDarrick J. Wong6. If the caller asks for a media scan, read all allocated and written data 31088757e04SDarrick J. Wong file extents in the filesystem. 31188757e04SDarrick J. Wong The ability to use hardware-assisted data file integrity checking is new 31288757e04SDarrick J. Wong to online fsck; neither of the previous tools have this capability. 31388757e04SDarrick J. Wong If media errors occur, they will be mapped to the owning files and reported. 31488757e04SDarrick J. Wong 31588757e04SDarrick J. Wong7. Re-check the summary counters and presents the caller with a summary of 31688757e04SDarrick J. Wong space usage and file counts. 31788757e04SDarrick J. Wong 31888757e04SDarrick J. WongSteps for Each Scrub Item 31988757e04SDarrick J. Wong------------------------- 32088757e04SDarrick J. Wong 32188757e04SDarrick J. WongThe kernel scrub code uses a three-step strategy for checking and repairing 32288757e04SDarrick J. Wongthe one aspect of a metadata object represented by a scrub item: 32388757e04SDarrick J. Wong 32488757e04SDarrick J. Wong1. The scrub item of interest is checked for corruptions; opportunities for 32588757e04SDarrick J. Wong optimization; and for values that are directly controlled by the system 32688757e04SDarrick J. Wong administrator but look suspicious. 32788757e04SDarrick J. Wong If the item is not corrupt or does not need optimization, resource are 32888757e04SDarrick J. Wong released and the positive scan results are returned to userspace. 32988757e04SDarrick J. Wong If the item is corrupt or could be optimized but the caller does not permit 33088757e04SDarrick J. Wong this, resources are released and the negative scan results are returned to 33188757e04SDarrick J. Wong userspace. 33288757e04SDarrick J. Wong Otherwise, the kernel moves on to the second step. 33388757e04SDarrick J. Wong 33488757e04SDarrick J. Wong2. The repair function is called to rebuild the data structure. 33588757e04SDarrick J. Wong Repair functions generally choose rebuild a structure from other metadata 33688757e04SDarrick J. Wong rather than try to salvage the existing structure. 33788757e04SDarrick J. Wong If the repair fails, the scan results from the first step are returned to 33888757e04SDarrick J. Wong userspace. 33988757e04SDarrick J. Wong Otherwise, the kernel moves on to the third step. 34088757e04SDarrick J. Wong 34188757e04SDarrick J. Wong3. In the third step, the kernel runs the same checks over the new metadata 34288757e04SDarrick J. Wong item to assess the efficacy of the repairs. 34388757e04SDarrick J. Wong The results of the reassessment are returned to userspace. 34488757e04SDarrick J. Wong 34588757e04SDarrick J. WongClassification of Metadata 34688757e04SDarrick J. Wong-------------------------- 34788757e04SDarrick J. Wong 34888757e04SDarrick J. WongEach type of metadata object (and therefore each type of scrub item) is 34988757e04SDarrick J. Wongclassified as follows: 35088757e04SDarrick J. Wong 35188757e04SDarrick J. WongPrimary Metadata 35288757e04SDarrick J. Wong```````````````` 35388757e04SDarrick J. Wong 35488757e04SDarrick J. WongMetadata structures in this category should be most familiar to filesystem 35588757e04SDarrick J. Wongusers either because they are directly created by the user or they index 35688757e04SDarrick J. Wongobjects created by the user 35788757e04SDarrick J. WongMost filesystem objects fall into this class: 35888757e04SDarrick J. Wong 35988757e04SDarrick J. Wong- Free space and reference count information 36088757e04SDarrick J. Wong 36188757e04SDarrick J. Wong- Inode records and indexes 36288757e04SDarrick J. Wong 36388757e04SDarrick J. Wong- Storage mapping information for file data 36488757e04SDarrick J. Wong 36588757e04SDarrick J. Wong- Directories 36688757e04SDarrick J. Wong 36788757e04SDarrick J. Wong- Extended attributes 36888757e04SDarrick J. Wong 36988757e04SDarrick J. Wong- Symbolic links 37088757e04SDarrick J. Wong 37188757e04SDarrick J. Wong- Quota limits 37288757e04SDarrick J. Wong 37388757e04SDarrick J. WongScrub obeys the same rules as regular filesystem accesses for resource and lock 37488757e04SDarrick J. Wongacquisition. 37588757e04SDarrick J. Wong 37688757e04SDarrick J. WongPrimary metadata objects are the simplest for scrub to process. 37788757e04SDarrick J. WongThe principal filesystem object (either an allocation group or an inode) that 37888757e04SDarrick J. Wongowns the item being scrubbed is locked to guard against concurrent updates. 37988757e04SDarrick J. WongThe check function examines every record associated with the type for obvious 38088757e04SDarrick J. Wongerrors and cross-references healthy records against other metadata to look for 38188757e04SDarrick J. Wonginconsistencies. 38288757e04SDarrick J. WongRepairs for this class of scrub item are simple, since the repair function 38388757e04SDarrick J. Wongstarts by holding all the resources acquired in the previous step. 38488757e04SDarrick J. WongThe repair function scans available metadata as needed to record all the 38588757e04SDarrick J. Wongobservations needed to complete the structure. 38688757e04SDarrick J. WongNext, it stages the observations in a new ondisk structure and commits it 38788757e04SDarrick J. Wongatomically to complete the repair. 38888757e04SDarrick J. WongFinally, the storage from the old data structure are carefully reaped. 38988757e04SDarrick J. Wong 39088757e04SDarrick J. WongBecause ``xfs_scrub`` locks a primary object for the duration of the repair, 39188757e04SDarrick J. Wongthis is effectively an offline repair operation performed on a subset of the 39288757e04SDarrick J. Wongfilesystem. 39388757e04SDarrick J. WongThis minimizes the complexity of the repair code because it is not necessary to 39488757e04SDarrick J. Wonghandle concurrent updates from other threads, nor is it necessary to access 39588757e04SDarrick J. Wongany other part of the filesystem. 39688757e04SDarrick J. WongAs a result, indexed structures can be rebuilt very quickly, and programs 39788757e04SDarrick J. Wongtrying to access the damaged structure will be blocked until repairs complete. 39888757e04SDarrick J. WongThe only infrastructure needed by the repair code are the staging area for 39988757e04SDarrick J. Wongobservations and a means to write new structures to disk. 40088757e04SDarrick J. WongDespite these limitations, the advantage that online repair holds is clear: 40188757e04SDarrick J. Wongtargeted work on individual shards of the filesystem avoids total loss of 40288757e04SDarrick J. Wongservice. 40388757e04SDarrick J. Wong 40488757e04SDarrick J. WongThis mechanism is described in section 2.1 ("Off-Line Algorithm") of 40588757e04SDarrick J. WongV. Srinivasan and M. J. Carey, `"Performance of On-Line Index Construction 40688757e04SDarrick J. WongAlgorithms" <https://minds.wisconsin.edu/bitstream/handle/1793/59524/TR1047.pdf>`_, 40788757e04SDarrick J. Wong*Extending Database Technology*, pp. 293-309, 1992. 40888757e04SDarrick J. Wong 40988757e04SDarrick J. WongMost primary metadata repair functions stage their intermediate results in an 41088757e04SDarrick J. Wongin-memory array prior to formatting the new ondisk structure, which is very 41188757e04SDarrick J. Wongsimilar to the list-based algorithm discussed in section 2.3 ("List-Based 41288757e04SDarrick J. WongAlgorithms") of Srinivasan. 41388757e04SDarrick J. WongHowever, any data structure builder that maintains a resource lock for the 41488757e04SDarrick J. Wongduration of the repair is *always* an offline algorithm. 41588757e04SDarrick J. Wong 4165f658dadSDarrick J. Wong.. _secondary_metadata: 4175f658dadSDarrick J. Wong 41888757e04SDarrick J. WongSecondary Metadata 41988757e04SDarrick J. Wong`````````````````` 42088757e04SDarrick J. Wong 42188757e04SDarrick J. WongMetadata structures in this category reflect records found in primary metadata, 42288757e04SDarrick J. Wongbut are only needed for online fsck or for reorganization of the filesystem. 42388757e04SDarrick J. Wong 42488757e04SDarrick J. WongSecondary metadata include: 42588757e04SDarrick J. Wong 42688757e04SDarrick J. Wong- Reverse mapping information 42788757e04SDarrick J. Wong 42888757e04SDarrick J. Wong- Directory parent pointers 42988757e04SDarrick J. Wong 43088757e04SDarrick J. WongThis class of metadata is difficult for scrub to process because scrub attaches 43188757e04SDarrick J. Wongto the secondary object but needs to check primary metadata, which runs counter 43288757e04SDarrick J. Wongto the usual order of resource acquisition. 43388757e04SDarrick J. WongFrequently, this means that full filesystems scans are necessary to rebuild the 43488757e04SDarrick J. Wongmetadata. 43588757e04SDarrick J. WongCheck functions can be limited in scope to reduce runtime. 43688757e04SDarrick J. WongRepairs, however, require a full scan of primary metadata, which can take a 43788757e04SDarrick J. Wonglong time to complete. 43888757e04SDarrick J. WongUnder these conditions, ``xfs_scrub`` cannot lock resources for the entire 43988757e04SDarrick J. Wongduration of the repair. 44088757e04SDarrick J. Wong 44188757e04SDarrick J. WongInstead, repair functions set up an in-memory staging structure to store 44288757e04SDarrick J. Wongobservations. 44388757e04SDarrick J. WongDepending on the requirements of the specific repair function, the staging 44488757e04SDarrick J. Wongindex will either have the same format as the ondisk structure or a design 44588757e04SDarrick J. Wongspecific to that repair function. 44688757e04SDarrick J. WongThe next step is to release all locks and start the filesystem scan. 44788757e04SDarrick J. WongWhen the repair scanner needs to record an observation, the staging data are 44888757e04SDarrick J. Wonglocked long enough to apply the update. 44988757e04SDarrick J. WongWhile the filesystem scan is in progress, the repair function hooks the 45088757e04SDarrick J. Wongfilesystem so that it can apply pending filesystem updates to the staging 45188757e04SDarrick J. Wonginformation. 45288757e04SDarrick J. WongOnce the scan is done, the owning object is re-locked, the live data is used to 45388757e04SDarrick J. Wongwrite a new ondisk structure, and the repairs are committed atomically. 45488757e04SDarrick J. WongThe hooks are disabled and the staging staging area is freed. 45588757e04SDarrick J. WongFinally, the storage from the old data structure are carefully reaped. 45688757e04SDarrick J. Wong 45788757e04SDarrick J. WongIntroducing concurrency helps online repair avoid various locking problems, but 45888757e04SDarrick J. Wongcomes at a high cost to code complexity. 45988757e04SDarrick J. WongLive filesystem code has to be hooked so that the repair function can observe 46088757e04SDarrick J. Wongupdates in progress. 46188757e04SDarrick J. WongThe staging area has to become a fully functional parallel structure so that 46288757e04SDarrick J. Wongupdates can be merged from the hooks. 46388757e04SDarrick J. WongFinally, the hook, the filesystem scan, and the inode locking model must be 46488757e04SDarrick J. Wongsufficiently well integrated that a hook event can decide if a given update 46588757e04SDarrick J. Wongshould be applied to the staging structure. 46688757e04SDarrick J. Wong 46788757e04SDarrick J. WongIn theory, the scrub implementation could apply these same techniques for 46888757e04SDarrick J. Wongprimary metadata, but doing so would make it massively more complex and less 46988757e04SDarrick J. Wongperformant. 47088757e04SDarrick J. WongPrograms attempting to access the damaged structures are not blocked from 47188757e04SDarrick J. Wongoperation, which may cause application failure or an unplanned filesystem 47288757e04SDarrick J. Wongshutdown. 47388757e04SDarrick J. Wong 47488757e04SDarrick J. WongInspiration for the secondary metadata repair strategy was drawn from section 47588757e04SDarrick J. Wong2.4 of Srinivasan above, and sections 2 ("NSF: Inded Build Without Side-File") 47688757e04SDarrick J. Wongand 3.1.1 ("Duplicate Key Insert Problem") in C. Mohan, `"Algorithms for 47788757e04SDarrick J. WongCreating Indexes for Very Large Tables Without Quiescing Updates" 47888757e04SDarrick J. Wong<https://dl.acm.org/doi/10.1145/130283.130337>`_, 1992. 47988757e04SDarrick J. Wong 48088757e04SDarrick J. WongThe sidecar index mentioned above bears some resemblance to the side file 48188757e04SDarrick J. Wongmethod mentioned in Srinivasan and Mohan. 48288757e04SDarrick J. WongTheir method consists of an index builder that extracts relevant record data to 48388757e04SDarrick J. Wongbuild the new structure as quickly as possible; and an auxiliary structure that 48488757e04SDarrick J. Wongcaptures all updates that would be committed to the index by other threads were 48588757e04SDarrick J. Wongthe new index already online. 48688757e04SDarrick J. WongAfter the index building scan finishes, the updates recorded in the side file 48788757e04SDarrick J. Wongare applied to the new index. 48888757e04SDarrick J. WongTo avoid conflicts between the index builder and other writer threads, the 48988757e04SDarrick J. Wongbuilder maintains a publicly visible cursor that tracks the progress of the 49088757e04SDarrick J. Wongscan through the record space. 49188757e04SDarrick J. WongTo avoid duplication of work between the side file and the index builder, side 49288757e04SDarrick J. Wongfile updates are elided when the record ID for the update is greater than the 49388757e04SDarrick J. Wongcursor position within the record ID space. 49488757e04SDarrick J. Wong 49588757e04SDarrick J. WongTo minimize changes to the rest of the codebase, XFS online repair keeps the 49688757e04SDarrick J. Wongreplacement index hidden until it's completely ready to go. 49788757e04SDarrick J. WongIn other words, there is no attempt to expose the keyspace of the new index 49888757e04SDarrick J. Wongwhile repair is running. 49988757e04SDarrick J. WongThe complexity of such an approach would be very high and perhaps more 50088757e04SDarrick J. Wongappropriate to building *new* indices. 50188757e04SDarrick J. Wong 50288757e04SDarrick J. Wong**Future Work Question**: Can the full scan and live update code used to 50388757e04SDarrick J. Wongfacilitate a repair also be used to implement a comprehensive check? 50488757e04SDarrick J. Wong 50588757e04SDarrick J. Wong*Answer*: In theory, yes. Check would be much stronger if each scrub function 50688757e04SDarrick J. Wongemployed these live scans to build a shadow copy of the metadata and then 50788757e04SDarrick J. Wongcompared the shadow records to the ondisk records. 50888757e04SDarrick J. WongHowever, doing that is a fair amount more work than what the checking functions 50988757e04SDarrick J. Wongdo now. 51088757e04SDarrick J. WongThe live scans and hooks were developed much later. 51188757e04SDarrick J. WongThat in turn increases the runtime of those scrub functions. 51288757e04SDarrick J. Wong 51388757e04SDarrick J. WongSummary Information 51488757e04SDarrick J. Wong``````````````````` 51588757e04SDarrick J. Wong 51688757e04SDarrick J. WongMetadata structures in this last category summarize the contents of primary 51788757e04SDarrick J. Wongmetadata records. 51888757e04SDarrick J. WongThese are often used to speed up resource usage queries, and are many times 51988757e04SDarrick J. Wongsmaller than the primary metadata which they represent. 52088757e04SDarrick J. Wong 52188757e04SDarrick J. WongExamples of summary information include: 52288757e04SDarrick J. Wong 52388757e04SDarrick J. Wong- Summary counts of free space and inodes 52488757e04SDarrick J. Wong 52588757e04SDarrick J. Wong- File link counts from directories 52688757e04SDarrick J. Wong 52788757e04SDarrick J. Wong- Quota resource usage counts 52888757e04SDarrick J. Wong 52988757e04SDarrick J. WongCheck and repair require full filesystem scans, but resource and lock 53088757e04SDarrick J. Wongacquisition follow the same paths as regular filesystem accesses. 53188757e04SDarrick J. Wong 53288757e04SDarrick J. WongThe superblock summary counters have special requirements due to the underlying 53388757e04SDarrick J. Wongimplementation of the incore counters, and will be treated separately. 53488757e04SDarrick J. WongCheck and repair of the other types of summary counters (quota resource counts 53588757e04SDarrick J. Wongand file link counts) employ the same filesystem scanning and hooking 53688757e04SDarrick J. Wongtechniques as outlined above, but because the underlying data are sets of 53788757e04SDarrick J. Wonginteger counters, the staging data need not be a fully functional mirror of the 53888757e04SDarrick J. Wongondisk structure. 53988757e04SDarrick J. Wong 54088757e04SDarrick J. WongInspiration for quota and file link count repair strategies were drawn from 54188757e04SDarrick J. Wongsections 2.12 ("Online Index Operations") through 2.14 ("Incremental View 54288757e04SDarrick J. WongMaintenace") of G. Graefe, `"Concurrent Queries and Updates in Summary Views 54388757e04SDarrick J. Wongand Their Indexes" 54488757e04SDarrick J. Wong<http://www.odbms.org/wp-content/uploads/2014/06/Increment-locks.pdf>`_, 2011. 54588757e04SDarrick J. Wong 54688757e04SDarrick J. WongSince quotas are non-negative integer counts of resource usage, online 54788757e04SDarrick J. Wongquotacheck can use the incremental view deltas described in section 2.14 to 54888757e04SDarrick J. Wongtrack pending changes to the block and inode usage counts in each transaction, 54988757e04SDarrick J. Wongand commit those changes to a dquot side file when the transaction commits. 55088757e04SDarrick J. WongDelta tracking is necessary for dquots because the index builder scans inodes, 55188757e04SDarrick J. Wongwhereas the data structure being rebuilt is an index of dquots. 55288757e04SDarrick J. WongLink count checking combines the view deltas and commit step into one because 55388757e04SDarrick J. Wongit sets attributes of the objects being scanned instead of writing them to a 55488757e04SDarrick J. Wongseparate data structure. 55588757e04SDarrick J. WongEach online fsck function will be discussed as case studies later in this 55688757e04SDarrick J. Wongdocument. 55788757e04SDarrick J. Wong 55888757e04SDarrick J. WongRisk Management 55988757e04SDarrick J. Wong--------------- 56088757e04SDarrick J. Wong 56188757e04SDarrick J. WongDuring the development of online fsck, several risk factors were identified 56288757e04SDarrick J. Wongthat may make the feature unsuitable for certain distributors and users. 56388757e04SDarrick J. WongSteps can be taken to mitigate or eliminate those risks, though at a cost to 56488757e04SDarrick J. Wongfunctionality. 56588757e04SDarrick J. Wong 56688757e04SDarrick J. Wong- **Decreased performance**: Adding metadata indices to the filesystem 56788757e04SDarrick J. Wong increases the time cost of persisting changes to disk, and the reverse space 56888757e04SDarrick J. Wong mapping and directory parent pointers are no exception. 56988757e04SDarrick J. Wong System administrators who require the maximum performance can disable the 57088757e04SDarrick J. Wong reverse mapping features at format time, though this choice dramatically 57188757e04SDarrick J. Wong reduces the ability of online fsck to find inconsistencies and repair them. 57288757e04SDarrick J. Wong 57388757e04SDarrick J. Wong- **Incorrect repairs**: As with all software, there might be defects in the 57488757e04SDarrick J. Wong software that result in incorrect repairs being written to the filesystem. 57588757e04SDarrick J. Wong Systematic fuzz testing (detailed in the next section) is employed by the 57688757e04SDarrick J. Wong authors to find bugs early, but it might not catch everything. 57788757e04SDarrick J. Wong The kernel build system provides Kconfig options (``CONFIG_XFS_ONLINE_SCRUB`` 57888757e04SDarrick J. Wong and ``CONFIG_XFS_ONLINE_REPAIR``) to enable distributors to choose not to 57988757e04SDarrick J. Wong accept this risk. 58088757e04SDarrick J. Wong The xfsprogs build system has a configure option (``--enable-scrub=no``) that 58188757e04SDarrick J. Wong disables building of the ``xfs_scrub`` binary, though this is not a risk 58288757e04SDarrick J. Wong mitigation if the kernel functionality remains enabled. 58388757e04SDarrick J. Wong 58488757e04SDarrick J. Wong- **Inability to repair**: Sometimes, a filesystem is too badly damaged to be 58588757e04SDarrick J. Wong repairable. 58688757e04SDarrick J. Wong If the keyspaces of several metadata indices overlap in some manner but a 58788757e04SDarrick J. Wong coherent narrative cannot be formed from records collected, then the repair 58888757e04SDarrick J. Wong fails. 58988757e04SDarrick J. Wong To reduce the chance that a repair will fail with a dirty transaction and 59088757e04SDarrick J. Wong render the filesystem unusable, the online repair functions have been 59188757e04SDarrick J. Wong designed to stage and validate all new records before committing the new 59288757e04SDarrick J. Wong structure. 59388757e04SDarrick J. Wong 59488757e04SDarrick J. Wong- **Misbehavior**: Online fsck requires many privileges -- raw IO to block 59588757e04SDarrick J. Wong devices, opening files by handle, ignoring Unix discretionary access control, 59688757e04SDarrick J. Wong and the ability to perform administrative changes. 59788757e04SDarrick J. Wong Running this automatically in the background scares people, so the systemd 59888757e04SDarrick J. Wong background service is configured to run with only the privileges required. 59988757e04SDarrick J. Wong Obviously, this cannot address certain problems like the kernel crashing or 60088757e04SDarrick J. Wong deadlocking, but it should be sufficient to prevent the scrub process from 60188757e04SDarrick J. Wong escaping and reconfiguring the system. 60288757e04SDarrick J. Wong The cron job does not have this protection. 60388757e04SDarrick J. Wong 60488757e04SDarrick J. Wong- **Fuzz Kiddiez**: There are many people now who seem to think that running 60588757e04SDarrick J. Wong automated fuzz testing of ondisk artifacts to find mischevious behavior and 60688757e04SDarrick J. Wong spraying exploit code onto the public mailing list for instant zero-day 60788757e04SDarrick J. Wong disclosure is somehow of some social benefit. 60888757e04SDarrick J. Wong In the view of this author, the benefit is realized only when the fuzz 60988757e04SDarrick J. Wong operators help to **fix** the flaws, but this opinion apparently is not 61088757e04SDarrick J. Wong widely shared among security "researchers". 61188757e04SDarrick J. Wong The XFS maintainers' continuing ability to manage these events presents an 61288757e04SDarrick J. Wong ongoing risk to the stability of the development process. 61388757e04SDarrick J. Wong Automated testing should front-load some of the risk while the feature is 61488757e04SDarrick J. Wong considered EXPERIMENTAL. 61588757e04SDarrick J. Wong 61688757e04SDarrick J. WongMany of these risks are inherent to software programming. 61788757e04SDarrick J. WongDespite this, it is hoped that this new functionality will prove useful in 61888757e04SDarrick J. Wongreducing unexpected downtime. 6199a30b5b5SDarrick J. Wong 6209a30b5b5SDarrick J. Wong3. Testing Plan 6219a30b5b5SDarrick J. Wong=============== 6229a30b5b5SDarrick J. Wong 6239a30b5b5SDarrick J. WongAs stated before, fsck tools have three main goals: 6249a30b5b5SDarrick J. Wong 6259a30b5b5SDarrick J. Wong1. Detect inconsistencies in the metadata; 6269a30b5b5SDarrick J. Wong 6279a30b5b5SDarrick J. Wong2. Eliminate those inconsistencies; and 6289a30b5b5SDarrick J. Wong 6299a30b5b5SDarrick J. Wong3. Minimize further loss of data. 6309a30b5b5SDarrick J. Wong 6319a30b5b5SDarrick J. WongDemonstrations of correct operation are necessary to build users' confidence 6329a30b5b5SDarrick J. Wongthat the software behaves within expectations. 6339a30b5b5SDarrick J. WongUnfortunately, it was not really feasible to perform regular exhaustive testing 6349a30b5b5SDarrick J. Wongof every aspect of a fsck tool until the introduction of low-cost virtual 6359a30b5b5SDarrick J. Wongmachines with high-IOPS storage. 6369a30b5b5SDarrick J. WongWith ample hardware availability in mind, the testing strategy for the online 6379a30b5b5SDarrick J. Wongfsck project involves differential analysis against the existing fsck tools and 6389a30b5b5SDarrick J. Wongsystematic testing of every attribute of every type of metadata object. 6399a30b5b5SDarrick J. WongTesting can be split into four major categories, as discussed below. 6409a30b5b5SDarrick J. Wong 6419a30b5b5SDarrick J. WongIntegrated Testing with fstests 6429a30b5b5SDarrick J. Wong------------------------------- 6439a30b5b5SDarrick J. Wong 6449a30b5b5SDarrick J. WongThe primary goal of any free software QA effort is to make testing as 6459a30b5b5SDarrick J. Wonginexpensive and widespread as possible to maximize the scaling advantages of 6469a30b5b5SDarrick J. Wongcommunity. 6479a30b5b5SDarrick J. WongIn other words, testing should maximize the breadth of filesystem configuration 6489a30b5b5SDarrick J. Wongscenarios and hardware setups. 6499a30b5b5SDarrick J. WongThis improves code quality by enabling the authors of online fsck to find and 6509a30b5b5SDarrick J. Wongfix bugs early, and helps developers of new features to find integration 6519a30b5b5SDarrick J. Wongissues earlier in their development effort. 6529a30b5b5SDarrick J. Wong 6539a30b5b5SDarrick J. WongThe Linux filesystem community shares a common QA testing suite, 6549a30b5b5SDarrick J. Wong`fstests <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/>`_, for 6559a30b5b5SDarrick J. Wongfunctional and regression testing. 6569a30b5b5SDarrick J. WongEven before development work began on online fsck, fstests (when run on XFS) 6579a30b5b5SDarrick J. Wongwould run both the ``xfs_check`` and ``xfs_repair -n`` commands on the test and 6589a30b5b5SDarrick J. Wongscratch filesystems between each test. 6599a30b5b5SDarrick J. WongThis provides a level of assurance that the kernel and the fsck tools stay in 6609a30b5b5SDarrick J. Wongalignment about what constitutes consistent metadata. 6619a30b5b5SDarrick J. WongDuring development of the online checking code, fstests was modified to run 6629a30b5b5SDarrick J. Wong``xfs_scrub -n`` between each test to ensure that the new checking code 6639a30b5b5SDarrick J. Wongproduces the same results as the two existing fsck tools. 6649a30b5b5SDarrick J. Wong 6659a30b5b5SDarrick J. WongTo start development of online repair, fstests was modified to run 6669a30b5b5SDarrick J. Wong``xfs_repair`` to rebuild the filesystem's metadata indices between tests. 6679a30b5b5SDarrick J. WongThis ensures that offline repair does not crash, leave a corrupt filesystem 6689a30b5b5SDarrick J. Wongafter it exists, or trigger complaints from the online check. 6699a30b5b5SDarrick J. WongThis also established a baseline for what can and cannot be repaired offline. 6709a30b5b5SDarrick J. WongTo complete the first phase of development of online repair, fstests was 6719a30b5b5SDarrick J. Wongmodified to be able to run ``xfs_scrub`` in a "force rebuild" mode. 6729a30b5b5SDarrick J. WongThis enables a comparison of the effectiveness of online repair as compared to 6739a30b5b5SDarrick J. Wongthe existing offline repair tools. 6749a30b5b5SDarrick J. Wong 6759a30b5b5SDarrick J. WongGeneral Fuzz Testing of Metadata Blocks 6769a30b5b5SDarrick J. Wong--------------------------------------- 6779a30b5b5SDarrick J. Wong 6789a30b5b5SDarrick J. WongXFS benefits greatly from having a very robust debugging tool, ``xfs_db``. 6799a30b5b5SDarrick J. Wong 6809a30b5b5SDarrick J. WongBefore development of online fsck even began, a set of fstests were created 6819a30b5b5SDarrick J. Wongto test the rather common fault that entire metadata blocks get corrupted. 6829a30b5b5SDarrick J. WongThis required the creation of fstests library code that can create a filesystem 6839a30b5b5SDarrick J. Wongcontaining every possible type of metadata object. 6849a30b5b5SDarrick J. WongNext, individual test cases were created to create a test filesystem, identify 6859a30b5b5SDarrick J. Wonga single block of a specific type of metadata object, trash it with the 6869a30b5b5SDarrick J. Wongexisting ``blocktrash`` command in ``xfs_db``, and test the reaction of a 6879a30b5b5SDarrick J. Wongparticular metadata validation strategy. 6889a30b5b5SDarrick J. Wong 6899a30b5b5SDarrick J. WongThis earlier test suite enabled XFS developers to test the ability of the 6909a30b5b5SDarrick J. Wongin-kernel validation functions and the ability of the offline fsck tool to 6919a30b5b5SDarrick J. Wongdetect and eliminate the inconsistent metadata. 6929a30b5b5SDarrick J. WongThis part of the test suite was extended to cover online fsck in exactly the 6939a30b5b5SDarrick J. Wongsame manner. 6949a30b5b5SDarrick J. Wong 6959a30b5b5SDarrick J. WongIn other words, for a given fstests filesystem configuration: 6969a30b5b5SDarrick J. Wong 6979a30b5b5SDarrick J. Wong* For each metadata object existing on the filesystem: 6989a30b5b5SDarrick J. Wong 6999a30b5b5SDarrick J. Wong * Write garbage to it 7009a30b5b5SDarrick J. Wong 7019a30b5b5SDarrick J. Wong * Test the reactions of: 7029a30b5b5SDarrick J. Wong 7039a30b5b5SDarrick J. Wong 1. The kernel verifiers to stop obviously bad metadata 7049a30b5b5SDarrick J. Wong 2. Offline repair (``xfs_repair``) to detect and fix 7059a30b5b5SDarrick J. Wong 3. Online repair (``xfs_scrub``) to detect and fix 7069a30b5b5SDarrick J. Wong 7079a30b5b5SDarrick J. WongTargeted Fuzz Testing of Metadata Records 7089a30b5b5SDarrick J. Wong----------------------------------------- 7099a30b5b5SDarrick J. Wong 7109a30b5b5SDarrick J. WongThe testing plan for online fsck includes extending the existing fs testing 7119a30b5b5SDarrick J. Wonginfrastructure to provide a much more powerful facility: targeted fuzz testing 7129a30b5b5SDarrick J. Wongof every metadata field of every metadata object in the filesystem. 7139a30b5b5SDarrick J. Wong``xfs_db`` can modify every field of every metadata structure in every 7149a30b5b5SDarrick J. Wongblock in the filesystem to simulate the effects of memory corruption and 7159a30b5b5SDarrick J. Wongsoftware bugs. 7169a30b5b5SDarrick J. WongGiven that fstests already contains the ability to create a filesystem 7179a30b5b5SDarrick J. Wongcontaining every metadata format known to the filesystem, ``xfs_db`` can be 7189a30b5b5SDarrick J. Wongused to perform exhaustive fuzz testing! 7199a30b5b5SDarrick J. Wong 7209a30b5b5SDarrick J. WongFor a given fstests filesystem configuration: 7219a30b5b5SDarrick J. Wong 7229a30b5b5SDarrick J. Wong* For each metadata object existing on the filesystem... 7239a30b5b5SDarrick J. Wong 7249a30b5b5SDarrick J. Wong * For each record inside that metadata object... 7259a30b5b5SDarrick J. Wong 7269a30b5b5SDarrick J. Wong * For each field inside that record... 7279a30b5b5SDarrick J. Wong 7289a30b5b5SDarrick J. Wong * For each conceivable type of transformation that can be applied to a bit field... 7299a30b5b5SDarrick J. Wong 7309a30b5b5SDarrick J. Wong 1. Clear all bits 7319a30b5b5SDarrick J. Wong 2. Set all bits 7329a30b5b5SDarrick J. Wong 3. Toggle the most significant bit 7339a30b5b5SDarrick J. Wong 4. Toggle the middle bit 7349a30b5b5SDarrick J. Wong 5. Toggle the least significant bit 7359a30b5b5SDarrick J. Wong 6. Add a small quantity 7369a30b5b5SDarrick J. Wong 7. Subtract a small quantity 7379a30b5b5SDarrick J. Wong 8. Randomize the contents 7389a30b5b5SDarrick J. Wong 7399a30b5b5SDarrick J. Wong * ...test the reactions of: 7409a30b5b5SDarrick J. Wong 7419a30b5b5SDarrick J. Wong 1. The kernel verifiers to stop obviously bad metadata 7429a30b5b5SDarrick J. Wong 2. Offline checking (``xfs_repair -n``) 7439a30b5b5SDarrick J. Wong 3. Offline repair (``xfs_repair``) 7449a30b5b5SDarrick J. Wong 4. Online checking (``xfs_scrub -n``) 7459a30b5b5SDarrick J. Wong 5. Online repair (``xfs_scrub``) 7469a30b5b5SDarrick J. Wong 6. Both repair tools (``xfs_scrub`` and then ``xfs_repair`` if online repair doesn't succeed) 7479a30b5b5SDarrick J. Wong 7489a30b5b5SDarrick J. WongThis is quite the combinatoric explosion! 7499a30b5b5SDarrick J. Wong 7509a30b5b5SDarrick J. WongFortunately, having this much test coverage makes it easy for XFS developers to 7519a30b5b5SDarrick J. Wongcheck the responses of XFS' fsck tools. 7529a30b5b5SDarrick J. WongSince the introduction of the fuzz testing framework, these tests have been 7539a30b5b5SDarrick J. Wongused to discover incorrect repair code and missing functionality for entire 7549a30b5b5SDarrick J. Wongclasses of metadata objects in ``xfs_repair``. 7559a30b5b5SDarrick J. WongThe enhanced testing was used to finalize the deprecation of ``xfs_check`` by 7569a30b5b5SDarrick J. Wongconfirming that ``xfs_repair`` could detect at least as many corruptions as 7579a30b5b5SDarrick J. Wongthe older tool. 7589a30b5b5SDarrick J. Wong 7599a30b5b5SDarrick J. WongThese tests have been very valuable for ``xfs_scrub`` in the same ways -- they 7609a30b5b5SDarrick J. Wongallow the online fsck developers to compare online fsck against offline fsck, 7619a30b5b5SDarrick J. Wongand they enable XFS developers to find deficiencies in the code base. 7629a30b5b5SDarrick J. Wong 7639a30b5b5SDarrick J. WongProposed patchsets include 7649a30b5b5SDarrick J. Wong`general fuzzer improvements 7659a30b5b5SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzzer-improvements>`_, 7669a30b5b5SDarrick J. Wong`fuzzing baselines 7679a30b5b5SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzz-baseline>`_, 7689a30b5b5SDarrick J. Wongand `improvements in fuzz testing comprehensiveness 7699a30b5b5SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=more-fuzz-testing>`_. 7709a30b5b5SDarrick J. Wong 7719a30b5b5SDarrick J. WongStress Testing 7729a30b5b5SDarrick J. Wong-------------- 7739a30b5b5SDarrick J. Wong 7749a30b5b5SDarrick J. WongA unique requirement to online fsck is the ability to operate on a filesystem 7759a30b5b5SDarrick J. Wongconcurrently with regular workloads. 7769a30b5b5SDarrick J. WongAlthough it is of course impossible to run ``xfs_scrub`` with *zero* observable 7779a30b5b5SDarrick J. Wongimpact on the running system, the online repair code should never introduce 7789a30b5b5SDarrick J. Wonginconsistencies into the filesystem metadata, and regular workloads should 7799a30b5b5SDarrick J. Wongnever notice resource starvation. 7809a30b5b5SDarrick J. WongTo verify that these conditions are being met, fstests has been enhanced in 7819a30b5b5SDarrick J. Wongthe following ways: 7829a30b5b5SDarrick J. Wong 7839a30b5b5SDarrick J. Wong* For each scrub item type, create a test to exercise checking that item type 7849a30b5b5SDarrick J. Wong while running ``fsstress``. 7859a30b5b5SDarrick J. Wong* For each scrub item type, create a test to exercise repairing that item type 7869a30b5b5SDarrick J. Wong while running ``fsstress``. 7879a30b5b5SDarrick J. Wong* Race ``fsstress`` and ``xfs_scrub -n`` to ensure that checking the whole 7889a30b5b5SDarrick J. Wong filesystem doesn't cause problems. 7899a30b5b5SDarrick J. Wong* Race ``fsstress`` and ``xfs_scrub`` in force-rebuild mode to ensure that 7909a30b5b5SDarrick J. Wong force-repairing the whole filesystem doesn't cause problems. 7919a30b5b5SDarrick J. Wong* Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while 7929a30b5b5SDarrick J. Wong freezing and thawing the filesystem. 7939a30b5b5SDarrick J. Wong* Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while 7949a30b5b5SDarrick J. Wong remounting the filesystem read-only and read-write. 7959a30b5b5SDarrick J. Wong* The same, but running ``fsx`` instead of ``fsstress``. (Not done yet?) 7969a30b5b5SDarrick J. Wong 7979a30b5b5SDarrick J. WongSuccess is defined by the ability to run all of these tests without observing 7989a30b5b5SDarrick J. Wongany unexpected filesystem shutdowns due to corrupted metadata, kernel hang 7999a30b5b5SDarrick J. Wongcheck warnings, or any other sort of mischief. 8009a30b5b5SDarrick J. Wong 8019a30b5b5SDarrick J. WongProposed patchsets include `general stress testing 8029a30b5b5SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=race-scrub-and-mount-state-changes>`_ 8039a30b5b5SDarrick J. Wongand the `evolution of existing per-function stress testing 8049a30b5b5SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=refactor-scrub-stress>`_. 8054f7f6469SDarrick J. Wong 8064f7f6469SDarrick J. Wong4. User Interface 8074f7f6469SDarrick J. Wong================= 8084f7f6469SDarrick J. Wong 8094f7f6469SDarrick J. WongThe primary user of online fsck is the system administrator, just like offline 8104f7f6469SDarrick J. Wongrepair. 8114f7f6469SDarrick J. WongOnline fsck presents two modes of operation to administrators: 8124f7f6469SDarrick J. WongA foreground CLI process for online fsck on demand, and a background service 8134f7f6469SDarrick J. Wongthat performs autonomous checking and repair. 8144f7f6469SDarrick J. Wong 8154f7f6469SDarrick J. WongChecking on Demand 8164f7f6469SDarrick J. Wong------------------ 8174f7f6469SDarrick J. Wong 8184f7f6469SDarrick J. WongFor administrators who want the absolute freshest information about the 8194f7f6469SDarrick J. Wongmetadata in a filesystem, ``xfs_scrub`` can be run as a foreground process on 8204f7f6469SDarrick J. Wonga command line. 8214f7f6469SDarrick J. WongThe program checks every piece of metadata in the filesystem while the 8224f7f6469SDarrick J. Wongadministrator waits for the results to be reported, just like the existing 8234f7f6469SDarrick J. Wong``xfs_repair`` tool. 8244f7f6469SDarrick J. WongBoth tools share a ``-n`` option to perform a read-only scan, and a ``-v`` 8254f7f6469SDarrick J. Wongoption to increase the verbosity of the information reported. 8264f7f6469SDarrick J. Wong 8274f7f6469SDarrick J. WongA new feature of ``xfs_scrub`` is the ``-x`` option, which employs the error 8284f7f6469SDarrick J. Wongcorrection capabilities of the hardware to check data file contents. 8294f7f6469SDarrick J. WongThe media scan is not enabled by default because it may dramatically increase 8304f7f6469SDarrick J. Wongprogram runtime and consume a lot of bandwidth on older storage hardware. 8314f7f6469SDarrick J. Wong 8324f7f6469SDarrick J. WongThe output of a foreground invocation is captured in the system log. 8334f7f6469SDarrick J. Wong 8344f7f6469SDarrick J. WongThe ``xfs_scrub_all`` program walks the list of mounted filesystems and 8354f7f6469SDarrick J. Wonginitiates ``xfs_scrub`` for each of them in parallel. 8364f7f6469SDarrick J. WongIt serializes scans for any filesystems that resolve to the same top level 8374f7f6469SDarrick J. Wongkernel block device to prevent resource overconsumption. 8384f7f6469SDarrick J. Wong 8394f7f6469SDarrick J. WongBackground Service 8404f7f6469SDarrick J. Wong------------------ 8414f7f6469SDarrick J. Wong 8424f7f6469SDarrick J. WongTo reduce the workload of system administrators, the ``xfs_scrub`` package 8434f7f6469SDarrick J. Wongprovides a suite of `systemd <https://systemd.io/>`_ timers and services that 8444f7f6469SDarrick J. Wongrun online fsck automatically on weekends by default. 8454f7f6469SDarrick J. WongThe background service configures scrub to run with as little privilege as 8464f7f6469SDarrick J. Wongpossible, the lowest CPU and IO priority, and in a CPU-constrained single 8474f7f6469SDarrick J. Wongthreaded mode. 8484f7f6469SDarrick J. WongThis can be tuned by the systemd administrator at any time to suit the latency 8494f7f6469SDarrick J. Wongand throughput requirements of customer workloads. 8504f7f6469SDarrick J. Wong 8514f7f6469SDarrick J. WongThe output of the background service is also captured in the system log. 8524f7f6469SDarrick J. WongIf desired, reports of failures (either due to inconsistencies or mere runtime 8534f7f6469SDarrick J. Wongerrors) can be emailed automatically by setting the ``EMAIL_ADDR`` environment 8544f7f6469SDarrick J. Wongvariable in the following service files: 8554f7f6469SDarrick J. Wong 8564f7f6469SDarrick J. Wong* ``xfs_scrub_fail@.service`` 8574f7f6469SDarrick J. Wong* ``xfs_scrub_media_fail@.service`` 8584f7f6469SDarrick J. Wong* ``xfs_scrub_all_fail.service`` 8594f7f6469SDarrick J. Wong 8604f7f6469SDarrick J. WongThe decision to enable the background scan is left to the system administrator. 8614f7f6469SDarrick J. WongThis can be done by enabling either of the following services: 8624f7f6469SDarrick J. Wong 8634f7f6469SDarrick J. Wong* ``xfs_scrub_all.timer`` on systemd systems 8644f7f6469SDarrick J. Wong* ``xfs_scrub_all.cron`` on non-systemd systems 8654f7f6469SDarrick J. Wong 8664f7f6469SDarrick J. WongThis automatic weekly scan is configured out of the box to perform an 8674f7f6469SDarrick J. Wongadditional media scan of all file data once per month. 8684f7f6469SDarrick J. WongThis is less foolproof than, say, storing file data block checksums, but much 8694f7f6469SDarrick J. Wongmore performant if application software provides its own integrity checking, 8704f7f6469SDarrick J. Wongredundancy can be provided elsewhere above the filesystem, or the storage 8714f7f6469SDarrick J. Wongdevice's integrity guarantees are deemed sufficient. 8724f7f6469SDarrick J. Wong 8734f7f6469SDarrick J. WongThe systemd unit file definitions have been subjected to a security audit 8744f7f6469SDarrick J. Wong(as of systemd 249) to ensure that the xfs_scrub processes have as little 8754f7f6469SDarrick J. Wongaccess to the rest of the system as possible. 8764f7f6469SDarrick J. WongThis was performed via ``systemd-analyze security``, after which privileges 8774f7f6469SDarrick J. Wongwere restricted to the minimum required, sandboxing was set up to the maximal 8784f7f6469SDarrick J. Wongextent possible with sandboxing and system call filtering; and access to the 8794f7f6469SDarrick J. Wongfilesystem tree was restricted to the minimum needed to start the program and 8804f7f6469SDarrick J. Wongaccess the filesystem being scanned. 8814f7f6469SDarrick J. WongThe service definition files restrict CPU usage to 80% of one CPU core, and 8824f7f6469SDarrick J. Wongapply as nice of a priority to IO and CPU scheduling as possible. 8834f7f6469SDarrick J. WongThis measure was taken to minimize delays in the rest of the filesystem. 8844f7f6469SDarrick J. WongNo such hardening has been performed for the cron job. 8854f7f6469SDarrick J. Wong 8864f7f6469SDarrick J. WongProposed patchset: 8874f7f6469SDarrick J. Wong`Enabling the xfs_scrub background service 8884f7f6469SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-service>`_. 8894f7f6469SDarrick J. Wong 8904f7f6469SDarrick J. WongHealth Reporting 8914f7f6469SDarrick J. Wong---------------- 8924f7f6469SDarrick J. Wong 8934f7f6469SDarrick J. WongXFS caches a summary of each filesystem's health status in memory. 8944f7f6469SDarrick J. WongThe information is updated whenever ``xfs_scrub`` is run, or whenever 8954f7f6469SDarrick J. Wonginconsistencies are detected in the filesystem metadata during regular 8964f7f6469SDarrick J. Wongoperations. 8974f7f6469SDarrick J. WongSystem administrators should use the ``health`` command of ``xfs_spaceman`` to 8984f7f6469SDarrick J. Wongdownload this information into a human-readable format. 8994f7f6469SDarrick J. WongIf problems have been observed, the administrator can schedule a reduced 9004f7f6469SDarrick J. Wongservice window to run the online repair tool to correct the problem. 9014f7f6469SDarrick J. WongFailing that, the administrator can decide to schedule a maintenance window to 9024f7f6469SDarrick J. Wongrun the traditional offline repair tool to correct the problem. 9034f7f6469SDarrick J. Wong 9044f7f6469SDarrick J. Wong**Future Work Question**: Should the health reporting integrate with the new 9054f7f6469SDarrick J. Wonginotify fs error notification system? 9064f7f6469SDarrick J. WongWould it be helpful for sysadmins to have a daemon to listen for corruption 9074f7f6469SDarrick J. Wongnotifications and initiate a repair? 9084f7f6469SDarrick J. Wong 9094f7f6469SDarrick J. Wong*Answer*: These questions remain unanswered, but should be a part of the 9104f7f6469SDarrick J. Wongconversation with early adopters and potential downstream users of XFS. 9114f7f6469SDarrick J. Wong 9124f7f6469SDarrick J. WongProposed patchsets include 9134f7f6469SDarrick J. Wong`wiring up health reports to correction returns 9144f7f6469SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=corruption-health-reports>`_ 9154f7f6469SDarrick J. Wongand 9164f7f6469SDarrick J. Wong`preservation of sickness info during memory reclaim 9174f7f6469SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=indirect-health-reporting>`_. 918e5edad52SDarrick J. Wong 919e5edad52SDarrick J. Wong5. Kernel Algorithms and Data Structures 920e5edad52SDarrick J. Wong======================================== 921e5edad52SDarrick J. Wong 922e5edad52SDarrick J. WongThis section discusses the key algorithms and data structures of the kernel 923e5edad52SDarrick J. Wongcode that provide the ability to check and repair metadata while the system 924e5edad52SDarrick J. Wongis running. 925e5edad52SDarrick J. WongThe first chapters in this section reveal the pieces that provide the 926e5edad52SDarrick J. Wongfoundation for checking metadata. 927e5edad52SDarrick J. WongThe remainder of this section presents the mechanisms through which XFS 928e5edad52SDarrick J. Wongregenerates itself. 929e5edad52SDarrick J. Wong 930e5edad52SDarrick J. WongSelf Describing Metadata 931e5edad52SDarrick J. Wong------------------------ 932e5edad52SDarrick J. Wong 933e5edad52SDarrick J. WongStarting with XFS version 5 in 2012, XFS updated the format of nearly every 934e5edad52SDarrick J. Wongondisk block header to record a magic number, a checksum, a universally 935e5edad52SDarrick J. Wong"unique" identifier (UUID), an owner code, the ondisk address of the block, 936e5edad52SDarrick J. Wongand a log sequence number. 937e5edad52SDarrick J. WongWhen loading a block buffer from disk, the magic number, UUID, owner, and 938e5edad52SDarrick J. Wongondisk address confirm that the retrieved block matches the specific owner of 939e5edad52SDarrick J. Wongthe current filesystem, and that the information contained in the block is 940e5edad52SDarrick J. Wongsupposed to be found at the ondisk address. 941e5edad52SDarrick J. WongThe first three components enable checking tools to disregard alleged metadata 942e5edad52SDarrick J. Wongthat doesn't belong to the filesystem, and the fourth component enables the 943e5edad52SDarrick J. Wongfilesystem to detect lost writes. 944e5edad52SDarrick J. Wong 945e5edad52SDarrick J. WongWhenever a file system operation modifies a block, the change is submitted 946e5edad52SDarrick J. Wongto the log as part of a transaction. 947e5edad52SDarrick J. WongThe log then processes these transactions marking them done once they are 948e5edad52SDarrick J. Wongsafely persisted to storage. 949e5edad52SDarrick J. WongThe logging code maintains the checksum and the log sequence number of the last 950e5edad52SDarrick J. Wongtransactional update. 951e5edad52SDarrick J. WongChecksums are useful for detecting torn writes and other discrepancies that can 952e5edad52SDarrick J. Wongbe introduced between the computer and its storage devices. 953e5edad52SDarrick J. WongSequence number tracking enables log recovery to avoid applying out of date 954e5edad52SDarrick J. Wonglog updates to the filesystem. 955e5edad52SDarrick J. Wong 956e5edad52SDarrick J. WongThese two features improve overall runtime resiliency by providing a means for 957e5edad52SDarrick J. Wongthe filesystem to detect obvious corruption when reading metadata blocks from 958e5edad52SDarrick J. Wongdisk, but these buffer verifiers cannot provide any consistency checking 959e5edad52SDarrick J. Wongbetween metadata structures. 960e5edad52SDarrick J. Wong 961e5edad52SDarrick J. WongFor more information, please see the documentation for 962e5edad52SDarrick J. WongDocumentation/filesystems/xfs-self-describing-metadata.rst 963e5edad52SDarrick J. Wong 964e5edad52SDarrick J. WongReverse Mapping 965e5edad52SDarrick J. Wong--------------- 966e5edad52SDarrick J. Wong 967e5edad52SDarrick J. WongThe original design of XFS (circa 1993) is an improvement upon 1980s Unix 968e5edad52SDarrick J. Wongfilesystem design. 969e5edad52SDarrick J. WongIn those days, storage density was expensive, CPU time was scarce, and 970e5edad52SDarrick J. Wongexcessive seek time could kill performance. 971e5edad52SDarrick J. WongFor performance reasons, filesystem authors were reluctant to add redundancy to 972e5edad52SDarrick J. Wongthe filesystem, even at the cost of data integrity. 973e5edad52SDarrick J. WongFilesystems designers in the early 21st century choose different strategies to 974e5edad52SDarrick J. Wongincrease internal redundancy -- either storing nearly identical copies of 975e5edad52SDarrick J. Wongmetadata, or more space-efficient encoding techniques. 976e5edad52SDarrick J. Wong 977e5edad52SDarrick J. WongFor XFS, a different redundancy strategy was chosen to modernize the design: 978e5edad52SDarrick J. Wonga secondary space usage index that maps allocated disk extents back to their 979e5edad52SDarrick J. Wongowners. 980e5edad52SDarrick J. WongBy adding a new index, the filesystem retains most of its ability to scale 981e5edad52SDarrick J. Wongwell to heavily threaded workloads involving large datasets, since the primary 982e5edad52SDarrick J. Wongfile metadata (the directory tree, the file block map, and the allocation 983e5edad52SDarrick J. Wonggroups) remain unchanged. 984e5edad52SDarrick J. WongLike any system that improves redundancy, the reverse-mapping feature increases 985e5edad52SDarrick J. Wongoverhead costs for space mapping activities. 986e5edad52SDarrick J. WongHowever, it has two critical advantages: first, the reverse index is key to 987e5edad52SDarrick J. Wongenabling online fsck and other requested functionality such as free space 988e5edad52SDarrick J. Wongdefragmentation, better media failure reporting, and filesystem shrinking. 989e5edad52SDarrick J. WongSecond, the different ondisk storage format of the reverse mapping btree 990e5edad52SDarrick J. Wongdefeats device-level deduplication because the filesystem requires real 991e5edad52SDarrick J. Wongredundancy. 992e5edad52SDarrick J. Wong 993e5edad52SDarrick J. Wong+--------------------------------------------------------------------------+ 994e5edad52SDarrick J. Wong| **Sidebar**: | 995e5edad52SDarrick J. Wong+--------------------------------------------------------------------------+ 996e5edad52SDarrick J. Wong| A criticism of adding the secondary index is that it does nothing to | 997e5edad52SDarrick J. Wong| improve the robustness of user data storage itself. | 998e5edad52SDarrick J. Wong| This is a valid point, but adding a new index for file data block | 999e5edad52SDarrick J. Wong| checksums increases write amplification by turning data overwrites into | 1000e5edad52SDarrick J. Wong| copy-writes, which age the filesystem prematurely. | 1001e5edad52SDarrick J. Wong| In keeping with thirty years of precedent, users who want file data | 1002e5edad52SDarrick J. Wong| integrity can supply as powerful a solution as they require. | 1003e5edad52SDarrick J. Wong| As for metadata, the complexity of adding a new secondary index of space | 1004e5edad52SDarrick J. Wong| usage is much less than adding volume management and storage device | 1005e5edad52SDarrick J. Wong| mirroring to XFS itself. | 1006e5edad52SDarrick J. Wong| Perfection of RAID and volume management are best left to existing | 1007e5edad52SDarrick J. Wong| layers in the kernel. | 1008e5edad52SDarrick J. Wong+--------------------------------------------------------------------------+ 1009e5edad52SDarrick J. Wong 1010e5edad52SDarrick J. WongThe information captured in a reverse space mapping record is as follows: 1011e5edad52SDarrick J. Wong 1012e5edad52SDarrick J. Wong.. code-block:: c 1013e5edad52SDarrick J. Wong 1014e5edad52SDarrick J. Wong struct xfs_rmap_irec { 1015e5edad52SDarrick J. Wong xfs_agblock_t rm_startblock; /* extent start block */ 1016e5edad52SDarrick J. Wong xfs_extlen_t rm_blockcount; /* extent length */ 1017e5edad52SDarrick J. Wong uint64_t rm_owner; /* extent owner */ 1018e5edad52SDarrick J. Wong uint64_t rm_offset; /* offset within the owner */ 1019e5edad52SDarrick J. Wong unsigned int rm_flags; /* state flags */ 1020e5edad52SDarrick J. Wong }; 1021e5edad52SDarrick J. Wong 1022e5edad52SDarrick J. WongThe first two fields capture the location and size of the physical space, 1023e5edad52SDarrick J. Wongin units of filesystem blocks. 1024e5edad52SDarrick J. WongThe owner field tells scrub which metadata structure or file inode have been 1025e5edad52SDarrick J. Wongassigned this space. 1026e5edad52SDarrick J. WongFor space allocated to files, the offset field tells scrub where the space was 1027e5edad52SDarrick J. Wongmapped within the file fork. 1028e5edad52SDarrick J. WongFinally, the flags field provides extra information about the space usage -- 1029e5edad52SDarrick J. Wongis this an attribute fork extent? A file mapping btree extent? Or an 1030e5edad52SDarrick J. Wongunwritten data extent? 1031e5edad52SDarrick J. Wong 1032e5edad52SDarrick J. WongOnline filesystem checking judges the consistency of each primary metadata 1033e5edad52SDarrick J. Wongrecord by comparing its information against all other space indices. 1034e5edad52SDarrick J. WongThe reverse mapping index plays a key role in the consistency checking process 1035e5edad52SDarrick J. Wongbecause it contains a centralized alternate copy of all space allocation 1036e5edad52SDarrick J. Wonginformation. 1037e5edad52SDarrick J. WongProgram runtime and ease of resource acquisition are the only real limits to 1038e5edad52SDarrick J. Wongwhat online checking can consult. 1039e5edad52SDarrick J. WongFor example, a file data extent mapping can be checked against: 1040e5edad52SDarrick J. Wong 1041e5edad52SDarrick J. Wong* The absence of an entry in the free space information. 1042e5edad52SDarrick J. Wong* The absence of an entry in the inode index. 1043e5edad52SDarrick J. Wong* The absence of an entry in the reference count data if the file is not 1044e5edad52SDarrick J. Wong marked as having shared extents. 1045e5edad52SDarrick J. Wong* The correspondence of an entry in the reverse mapping information. 1046e5edad52SDarrick J. Wong 1047e5edad52SDarrick J. WongThere are several observations to make about reverse mapping indices: 1048e5edad52SDarrick J. Wong 1049e5edad52SDarrick J. Wong1. Reverse mappings can provide a positive affirmation of correctness if any of 1050e5edad52SDarrick J. Wong the above primary metadata are in doubt. 1051e5edad52SDarrick J. Wong The checking code for most primary metadata follows a path similar to the 1052e5edad52SDarrick J. Wong one outlined above. 1053e5edad52SDarrick J. Wong 1054e5edad52SDarrick J. Wong2. Proving the consistency of secondary metadata with the primary metadata is 1055e5edad52SDarrick J. Wong difficult because that requires a full scan of all primary space metadata, 1056e5edad52SDarrick J. Wong which is very time intensive. 1057e5edad52SDarrick J. Wong For example, checking a reverse mapping record for a file extent mapping 1058e5edad52SDarrick J. Wong btree block requires locking the file and searching the entire btree to 1059e5edad52SDarrick J. Wong confirm the block. 1060e5edad52SDarrick J. Wong Instead, scrub relies on rigorous cross-referencing during the primary space 1061e5edad52SDarrick J. Wong mapping structure checks. 1062e5edad52SDarrick J. Wong 1063e5edad52SDarrick J. Wong3. Consistency scans must use non-blocking lock acquisition primitives if the 1064e5edad52SDarrick J. Wong required locking order is not the same order used by regular filesystem 1065e5edad52SDarrick J. Wong operations. 1066e5edad52SDarrick J. Wong For example, if the filesystem normally takes a file ILOCK before taking 1067e5edad52SDarrick J. Wong the AGF buffer lock but scrub wants to take a file ILOCK while holding 1068e5edad52SDarrick J. Wong an AGF buffer lock, scrub cannot block on that second acquisition. 1069e5edad52SDarrick J. Wong This means that forward progress during this part of a scan of the reverse 1070e5edad52SDarrick J. Wong mapping data cannot be guaranteed if system load is heavy. 1071e5edad52SDarrick J. Wong 1072e5edad52SDarrick J. WongIn summary, reverse mappings play a key role in reconstruction of primary 1073e5edad52SDarrick J. Wongmetadata. 1074e5edad52SDarrick J. WongThe details of how these records are staged, written to disk, and committed 1075e5edad52SDarrick J. Wonginto the filesystem are covered in subsequent sections. 1076e5edad52SDarrick J. Wong 1077e5edad52SDarrick J. WongChecking and Cross-Referencing 1078e5edad52SDarrick J. Wong------------------------------ 1079e5edad52SDarrick J. Wong 1080e5edad52SDarrick J. WongThe first step of checking a metadata structure is to examine every record 1081e5edad52SDarrick J. Wongcontained within the structure and its relationship with the rest of the 1082e5edad52SDarrick J. Wongsystem. 1083e5edad52SDarrick J. WongXFS contains multiple layers of checking to try to prevent inconsistent 1084e5edad52SDarrick J. Wongmetadata from wreaking havoc on the system. 1085e5edad52SDarrick J. WongEach of these layers contributes information that helps the kernel to make 1086e5edad52SDarrick J. Wongthree decisions about the health of a metadata structure: 1087e5edad52SDarrick J. Wong 1088e5edad52SDarrick J. Wong- Is a part of this structure obviously corrupt (``XFS_SCRUB_OFLAG_CORRUPT``) ? 1089e5edad52SDarrick J. Wong- Is this structure inconsistent with the rest of the system 1090e5edad52SDarrick J. Wong (``XFS_SCRUB_OFLAG_XCORRUPT``) ? 1091e5edad52SDarrick J. Wong- Is there so much damage around the filesystem that cross-referencing is not 1092e5edad52SDarrick J. Wong possible (``XFS_SCRUB_OFLAG_XFAIL``) ? 1093e5edad52SDarrick J. Wong- Can the structure be optimized to improve performance or reduce the size of 1094e5edad52SDarrick J. Wong metadata (``XFS_SCRUB_OFLAG_PREEN``) ? 1095e5edad52SDarrick J. Wong- Does the structure contain data that is not inconsistent but deserves review 1096e5edad52SDarrick J. Wong by the system administrator (``XFS_SCRUB_OFLAG_WARNING``) ? 1097e5edad52SDarrick J. Wong 1098e5edad52SDarrick J. WongThe following sections describe how the metadata scrubbing process works. 1099e5edad52SDarrick J. Wong 1100e5edad52SDarrick J. WongMetadata Buffer Verification 1101e5edad52SDarrick J. Wong```````````````````````````` 1102e5edad52SDarrick J. Wong 1103e5edad52SDarrick J. WongThe lowest layer of metadata protection in XFS are the metadata verifiers built 1104e5edad52SDarrick J. Wonginto the buffer cache. 1105e5edad52SDarrick J. WongThese functions perform inexpensive internal consistency checking of the block 1106e5edad52SDarrick J. Wongitself, and answer these questions: 1107e5edad52SDarrick J. Wong 1108e5edad52SDarrick J. Wong- Does the block belong to this filesystem? 1109e5edad52SDarrick J. Wong 1110e5edad52SDarrick J. Wong- Does the block belong to the structure that asked for the read? 1111e5edad52SDarrick J. Wong This assumes that metadata blocks only have one owner, which is always true 1112e5edad52SDarrick J. Wong in XFS. 1113e5edad52SDarrick J. Wong 1114e5edad52SDarrick J. Wong- Is the type of data stored in the block within a reasonable range of what 1115e5edad52SDarrick J. Wong scrub is expecting? 1116e5edad52SDarrick J. Wong 1117e5edad52SDarrick J. Wong- Does the physical location of the block match the location it was read from? 1118e5edad52SDarrick J. Wong 1119e5edad52SDarrick J. Wong- Does the block checksum match the data? 1120e5edad52SDarrick J. Wong 1121e5edad52SDarrick J. WongThe scope of the protections here are very limited -- verifiers can only 1122e5edad52SDarrick J. Wongestablish that the filesystem code is reasonably free of gross corruption bugs 1123e5edad52SDarrick J. Wongand that the storage system is reasonably competent at retrieval. 1124e5edad52SDarrick J. WongCorruption problems observed at runtime cause the generation of health reports, 1125e5edad52SDarrick J. Wongfailed system calls, and in the extreme case, filesystem shutdowns if the 1126e5edad52SDarrick J. Wongcorrupt metadata force the cancellation of a dirty transaction. 1127e5edad52SDarrick J. Wong 1128e5edad52SDarrick J. WongEvery online fsck scrubbing function is expected to read every ondisk metadata 1129e5edad52SDarrick J. Wongblock of a structure in the course of checking the structure. 1130e5edad52SDarrick J. WongCorruption problems observed during a check are immediately reported to 1131e5edad52SDarrick J. Wonguserspace as corruption; during a cross-reference, they are reported as a 1132e5edad52SDarrick J. Wongfailure to cross-reference once the full examination is complete. 1133e5edad52SDarrick J. WongReads satisfied by a buffer already in cache (and hence already verified) 1134e5edad52SDarrick J. Wongbypass these checks. 1135e5edad52SDarrick J. Wong 1136e5edad52SDarrick J. WongInternal Consistency Checks 1137e5edad52SDarrick J. Wong``````````````````````````` 1138e5edad52SDarrick J. Wong 1139e5edad52SDarrick J. WongAfter the buffer cache, the next level of metadata protection is the internal 1140e5edad52SDarrick J. Wongrecord verification code built into the filesystem. 1141e5edad52SDarrick J. WongThese checks are split between the buffer verifiers, the in-filesystem users of 1142e5edad52SDarrick J. Wongthe buffer cache, and the scrub code itself, depending on the amount of higher 1143e5edad52SDarrick J. Wonglevel context required. 1144e5edad52SDarrick J. WongThe scope of checking is still internal to the block. 1145e5edad52SDarrick J. WongThese higher level checking functions answer these questions: 1146e5edad52SDarrick J. Wong 1147e5edad52SDarrick J. Wong- Does the type of data stored in the block match what scrub is expecting? 1148e5edad52SDarrick J. Wong 1149e5edad52SDarrick J. Wong- Does the block belong to the owning structure that asked for the read? 1150e5edad52SDarrick J. Wong 1151e5edad52SDarrick J. Wong- If the block contains records, do the records fit within the block? 1152e5edad52SDarrick J. Wong 1153e5edad52SDarrick J. Wong- If the block tracks internal free space information, is it consistent with 1154e5edad52SDarrick J. Wong the record areas? 1155e5edad52SDarrick J. Wong 1156e5edad52SDarrick J. Wong- Are the records contained inside the block free of obvious corruptions? 1157e5edad52SDarrick J. Wong 1158e5edad52SDarrick J. WongRecord checks in this category are more rigorous and more time-intensive. 1159e5edad52SDarrick J. WongFor example, block pointers and inumbers are checked to ensure that they point 1160e5edad52SDarrick J. Wongwithin the dynamically allocated parts of an allocation group and within 1161e5edad52SDarrick J. Wongthe filesystem. 1162e5edad52SDarrick J. WongNames are checked for invalid characters, and flags are checked for invalid 1163e5edad52SDarrick J. Wongcombinations. 1164e5edad52SDarrick J. WongOther record attributes are checked for sensible values. 1165e5edad52SDarrick J. WongBtree records spanning an interval of the btree keyspace are checked for 1166e5edad52SDarrick J. Wongcorrect order and lack of mergeability (except for file fork mappings). 1167e5edad52SDarrick J. WongFor performance reasons, regular code may skip some of these checks unless 1168e5edad52SDarrick J. Wongdebugging is enabled or a write is about to occur. 1169e5edad52SDarrick J. WongScrub functions, of course, must check all possible problems. 1170e5edad52SDarrick J. Wong 1171e5edad52SDarrick J. WongValidation of Userspace-Controlled Record Attributes 1172e5edad52SDarrick J. Wong```````````````````````````````````````````````````` 1173e5edad52SDarrick J. Wong 1174e5edad52SDarrick J. WongVarious pieces of filesystem metadata are directly controlled by userspace. 1175e5edad52SDarrick J. WongBecause of this nature, validation work cannot be more precise than checking 1176e5edad52SDarrick J. Wongthat a value is within the possible range. 1177e5edad52SDarrick J. WongThese fields include: 1178e5edad52SDarrick J. Wong 1179e5edad52SDarrick J. Wong- Superblock fields controlled by mount options 1180e5edad52SDarrick J. Wong- Filesystem labels 1181e5edad52SDarrick J. Wong- File timestamps 1182e5edad52SDarrick J. Wong- File permissions 1183e5edad52SDarrick J. Wong- File size 1184e5edad52SDarrick J. Wong- File flags 1185e5edad52SDarrick J. Wong- Names present in directory entries, extended attribute keys, and filesystem 1186e5edad52SDarrick J. Wong labels 1187e5edad52SDarrick J. Wong- Extended attribute key namespaces 1188e5edad52SDarrick J. Wong- Extended attribute values 1189e5edad52SDarrick J. Wong- File data block contents 1190e5edad52SDarrick J. Wong- Quota limits 1191e5edad52SDarrick J. Wong- Quota timer expiration (if resource usage exceeds the soft limit) 1192e5edad52SDarrick J. Wong 1193e5edad52SDarrick J. WongCross-Referencing Space Metadata 1194e5edad52SDarrick J. Wong```````````````````````````````` 1195e5edad52SDarrick J. Wong 1196e5edad52SDarrick J. WongAfter internal block checks, the next higher level of checking is 1197e5edad52SDarrick J. Wongcross-referencing records between metadata structures. 1198e5edad52SDarrick J. WongFor regular runtime code, the cost of these checks is considered to be 1199e5edad52SDarrick J. Wongprohibitively expensive, but as scrub is dedicated to rooting out 1200e5edad52SDarrick J. Wonginconsistencies, it must pursue all avenues of inquiry. 1201e5edad52SDarrick J. WongThe exact set of cross-referencing is highly dependent on the context of the 1202e5edad52SDarrick J. Wongdata structure being checked. 1203e5edad52SDarrick J. Wong 1204e5edad52SDarrick J. WongThe XFS btree code has keyspace scanning functions that online fsck uses to 1205e5edad52SDarrick J. Wongcross reference one structure with another. 1206e5edad52SDarrick J. WongSpecifically, scrub can scan the key space of an index to determine if that 1207e5edad52SDarrick J. Wongkeyspace is fully, sparsely, or not at all mapped to records. 1208e5edad52SDarrick J. WongFor the reverse mapping btree, it is possible to mask parts of the key for the 1209e5edad52SDarrick J. Wongpurposes of performing a keyspace scan so that scrub can decide if the rmap 1210e5edad52SDarrick J. Wongbtree contains records mapping a certain extent of physical space without the 1211e5edad52SDarrick J. Wongsparsenses of the rest of the rmap keyspace getting in the way. 1212e5edad52SDarrick J. Wong 1213e5edad52SDarrick J. WongBtree blocks undergo the following checks before cross-referencing: 1214e5edad52SDarrick J. Wong 1215e5edad52SDarrick J. Wong- Does the type of data stored in the block match what scrub is expecting? 1216e5edad52SDarrick J. Wong 1217e5edad52SDarrick J. Wong- Does the block belong to the owning structure that asked for the read? 1218e5edad52SDarrick J. Wong 1219e5edad52SDarrick J. Wong- Do the records fit within the block? 1220e5edad52SDarrick J. Wong 1221e5edad52SDarrick J. Wong- Are the records contained inside the block free of obvious corruptions? 1222e5edad52SDarrick J. Wong 1223e5edad52SDarrick J. Wong- Are the name hashes in the correct order? 1224e5edad52SDarrick J. Wong 1225e5edad52SDarrick J. Wong- Do node pointers within the btree point to valid block addresses for the type 1226e5edad52SDarrick J. Wong of btree? 1227e5edad52SDarrick J. Wong 1228e5edad52SDarrick J. Wong- Do child pointers point towards the leaves? 1229e5edad52SDarrick J. Wong 1230e5edad52SDarrick J. Wong- Do sibling pointers point across the same level? 1231e5edad52SDarrick J. Wong 1232e5edad52SDarrick J. Wong- For each node block record, does the record key accurate reflect the contents 1233e5edad52SDarrick J. Wong of the child block? 1234e5edad52SDarrick J. Wong 1235e5edad52SDarrick J. WongSpace allocation records are cross-referenced as follows: 1236e5edad52SDarrick J. Wong 1237e5edad52SDarrick J. Wong1. Any space mentioned by any metadata structure are cross-referenced as 1238e5edad52SDarrick J. Wong follows: 1239e5edad52SDarrick J. Wong 1240e5edad52SDarrick J. Wong - Does the reverse mapping index list only the appropriate owner as the 1241e5edad52SDarrick J. Wong owner of each block? 1242e5edad52SDarrick J. Wong 1243e5edad52SDarrick J. Wong - Are none of the blocks claimed as free space? 1244e5edad52SDarrick J. Wong 1245e5edad52SDarrick J. Wong - If these aren't file data blocks, are none of the blocks claimed as space 1246e5edad52SDarrick J. Wong shared by different owners? 1247e5edad52SDarrick J. Wong 1248e5edad52SDarrick J. Wong2. Btree blocks are cross-referenced as follows: 1249e5edad52SDarrick J. Wong 1250e5edad52SDarrick J. Wong - Everything in class 1 above. 1251e5edad52SDarrick J. Wong 1252e5edad52SDarrick J. Wong - If there's a parent node block, do the keys listed for this block match the 1253e5edad52SDarrick J. Wong keyspace of this block? 1254e5edad52SDarrick J. Wong 1255e5edad52SDarrick J. Wong - Do the sibling pointers point to valid blocks? Of the same level? 1256e5edad52SDarrick J. Wong 1257e5edad52SDarrick J. Wong - Do the child pointers point to valid blocks? Of the next level down? 1258e5edad52SDarrick J. Wong 1259e5edad52SDarrick J. Wong3. Free space btree records are cross-referenced as follows: 1260e5edad52SDarrick J. Wong 1261e5edad52SDarrick J. Wong - Everything in class 1 and 2 above. 1262e5edad52SDarrick J. Wong 1263e5edad52SDarrick J. Wong - Does the reverse mapping index list no owners of this space? 1264e5edad52SDarrick J. Wong 1265e5edad52SDarrick J. Wong - Is this space not claimed by the inode index for inodes? 1266e5edad52SDarrick J. Wong 1267e5edad52SDarrick J. Wong - Is it not mentioned by the reference count index? 1268e5edad52SDarrick J. Wong 1269e5edad52SDarrick J. Wong - Is there a matching record in the other free space btree? 1270e5edad52SDarrick J. Wong 1271e5edad52SDarrick J. Wong4. Inode btree records are cross-referenced as follows: 1272e5edad52SDarrick J. Wong 1273e5edad52SDarrick J. Wong - Everything in class 1 and 2 above. 1274e5edad52SDarrick J. Wong 1275e5edad52SDarrick J. Wong - Is there a matching record in free inode btree? 1276e5edad52SDarrick J. Wong 1277e5edad52SDarrick J. Wong - Do cleared bits in the holemask correspond with inode clusters? 1278e5edad52SDarrick J. Wong 1279e5edad52SDarrick J. Wong - Do set bits in the freemask correspond with inode records with zero link 1280e5edad52SDarrick J. Wong count? 1281e5edad52SDarrick J. Wong 1282e5edad52SDarrick J. Wong5. Inode records are cross-referenced as follows: 1283e5edad52SDarrick J. Wong 1284e5edad52SDarrick J. Wong - Everything in class 1. 1285e5edad52SDarrick J. Wong 1286e5edad52SDarrick J. Wong - Do all the fields that summarize information about the file forks actually 1287e5edad52SDarrick J. Wong match those forks? 1288e5edad52SDarrick J. Wong 1289e5edad52SDarrick J. Wong - Does each inode with zero link count correspond to a record in the free 1290e5edad52SDarrick J. Wong inode btree? 1291e5edad52SDarrick J. Wong 1292e5edad52SDarrick J. Wong6. File fork space mapping records are cross-referenced as follows: 1293e5edad52SDarrick J. Wong 1294e5edad52SDarrick J. Wong - Everything in class 1 and 2 above. 1295e5edad52SDarrick J. Wong 1296e5edad52SDarrick J. Wong - Is this space not mentioned by the inode btrees? 1297e5edad52SDarrick J. Wong 1298e5edad52SDarrick J. Wong - If this is a CoW fork mapping, does it correspond to a CoW entry in the 1299e5edad52SDarrick J. Wong reference count btree? 1300e5edad52SDarrick J. Wong 1301e5edad52SDarrick J. Wong7. Reference count records are cross-referenced as follows: 1302e5edad52SDarrick J. Wong 1303e5edad52SDarrick J. Wong - Everything in class 1 and 2 above. 1304e5edad52SDarrick J. Wong 1305e5edad52SDarrick J. Wong - Within the space subkeyspace of the rmap btree (that is to say, all 1306e5edad52SDarrick J. Wong records mapped to a particular space extent and ignoring the owner info), 1307e5edad52SDarrick J. Wong are there the same number of reverse mapping records for each block as the 1308e5edad52SDarrick J. Wong reference count record claims? 1309e5edad52SDarrick J. Wong 1310e5edad52SDarrick J. WongProposed patchsets are the series to find gaps in 1311e5edad52SDarrick J. Wong`refcount btree 1312e5edad52SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-refcount-gaps>`_, 1313e5edad52SDarrick J. Wong`inode btree 1314e5edad52SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-inobt-gaps>`_, and 1315e5edad52SDarrick J. Wong`rmap btree 1316e5edad52SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-rmapbt-gaps>`_ records; 1317e5edad52SDarrick J. Wongto find 1318e5edad52SDarrick J. Wong`mergeable records 1319e5edad52SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-mergeable-records>`_; 1320e5edad52SDarrick J. Wongand to 1321e5edad52SDarrick J. Wong`improve cross referencing with rmap 1322e5edad52SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-strengthen-rmap-checking>`_ 1323e5edad52SDarrick J. Wongbefore starting a repair. 1324e5edad52SDarrick J. Wong 1325e5edad52SDarrick J. WongChecking Extended Attributes 1326e5edad52SDarrick J. Wong```````````````````````````` 1327e5edad52SDarrick J. Wong 1328e5edad52SDarrick J. WongExtended attributes implement a key-value store that enable fragments of data 1329e5edad52SDarrick J. Wongto be attached to any file. 1330e5edad52SDarrick J. WongBoth the kernel and userspace can access the keys and values, subject to 1331e5edad52SDarrick J. Wongnamespace and privilege restrictions. 1332e5edad52SDarrick J. WongMost typically these fragments are metadata about the file -- origins, security 1333e5edad52SDarrick J. Wongcontexts, user-supplied labels, indexing information, etc. 1334e5edad52SDarrick J. Wong 1335e5edad52SDarrick J. WongNames can be as long as 255 bytes and can exist in several different 1336e5edad52SDarrick J. Wongnamespaces. 1337e5edad52SDarrick J. WongValues can be as large as 64KB. 1338e5edad52SDarrick J. WongA file's extended attributes are stored in blocks mapped by the attr fork. 1339e5edad52SDarrick J. WongThe mappings point to leaf blocks, remote value blocks, or dabtree blocks. 1340e5edad52SDarrick J. WongBlock 0 in the attribute fork is always the top of the structure, but otherwise 1341e5edad52SDarrick J. Wongeach of the three types of blocks can be found at any offset in the attr fork. 1342e5edad52SDarrick J. WongLeaf blocks contain attribute key records that point to the name and the value. 1343e5edad52SDarrick J. WongNames are always stored elsewhere in the same leaf block. 1344e5edad52SDarrick J. WongValues that are less than 3/4 the size of a filesystem block are also stored 1345e5edad52SDarrick J. Wongelsewhere in the same leaf block. 1346e5edad52SDarrick J. WongRemote value blocks contain values that are too large to fit inside a leaf. 1347e5edad52SDarrick J. WongIf the leaf information exceeds a single filesystem block, a dabtree (also 1348e5edad52SDarrick J. Wongrooted at block 0) is created to map hashes of the attribute names to leaf 1349e5edad52SDarrick J. Wongblocks in the attr fork. 1350e5edad52SDarrick J. Wong 1351e5edad52SDarrick J. WongChecking an extended attribute structure is not so straightfoward due to the 1352e5edad52SDarrick J. Wonglack of separation between attr blocks and index blocks. 1353e5edad52SDarrick J. WongScrub must read each block mapped by the attr fork and ignore the non-leaf 1354e5edad52SDarrick J. Wongblocks: 1355e5edad52SDarrick J. Wong 1356e5edad52SDarrick J. Wong1. Walk the dabtree in the attr fork (if present) to ensure that there are no 1357e5edad52SDarrick J. Wong irregularities in the blocks or dabtree mappings that do not point to 1358e5edad52SDarrick J. Wong attr leaf blocks. 1359e5edad52SDarrick J. Wong 1360e5edad52SDarrick J. Wong2. Walk the blocks of the attr fork looking for leaf blocks. 1361e5edad52SDarrick J. Wong For each entry inside a leaf: 1362e5edad52SDarrick J. Wong 1363e5edad52SDarrick J. Wong a. Validate that the name does not contain invalid characters. 1364e5edad52SDarrick J. Wong 1365e5edad52SDarrick J. Wong b. Read the attr value. 1366e5edad52SDarrick J. Wong This performs a named lookup of the attr name to ensure the correctness 1367e5edad52SDarrick J. Wong of the dabtree. 1368e5edad52SDarrick J. Wong If the value is stored in a remote block, this also validates the 1369e5edad52SDarrick J. Wong integrity of the remote value block. 1370e5edad52SDarrick J. Wong 1371e5edad52SDarrick J. WongChecking and Cross-Referencing Directories 1372e5edad52SDarrick J. Wong`````````````````````````````````````````` 1373e5edad52SDarrick J. Wong 1374e5edad52SDarrick J. WongThe filesystem directory tree is a directed acylic graph structure, with files 1375e5edad52SDarrick J. Wongconstituting the nodes, and directory entries (dirents) constituting the edges. 1376e5edad52SDarrick J. WongDirectories are a special type of file containing a set of mappings from a 1377e5edad52SDarrick J. Wong255-byte sequence (name) to an inumber. 1378e5edad52SDarrick J. WongThese are called directory entries, or dirents for short. 1379e5edad52SDarrick J. WongEach directory file must have exactly one directory pointing to the file. 1380e5edad52SDarrick J. WongA root directory points to itself. 1381e5edad52SDarrick J. WongDirectory entries point to files of any type. 1382e5edad52SDarrick J. WongEach non-directory file may have multiple directories point to it. 1383e5edad52SDarrick J. Wong 1384e5edad52SDarrick J. WongIn XFS, directories are implemented as a file containing up to three 32GB 1385e5edad52SDarrick J. Wongpartitions. 1386e5edad52SDarrick J. WongThe first partition contains directory entry data blocks. 1387e5edad52SDarrick J. WongEach data block contains variable-sized records associating a user-provided 1388e5edad52SDarrick J. Wongname with an inumber and, optionally, a file type. 1389e5edad52SDarrick J. WongIf the directory entry data grows beyond one block, the second partition (which 1390e5edad52SDarrick J. Wongexists as post-EOF extents) is populated with a block containing free space 1391e5edad52SDarrick J. Wonginformation and an index that maps hashes of the dirent names to directory data 1392e5edad52SDarrick J. Wongblocks in the first partition. 1393e5edad52SDarrick J. WongThis makes directory name lookups very fast. 1394e5edad52SDarrick J. WongIf this second partition grows beyond one block, the third partition is 1395e5edad52SDarrick J. Wongpopulated with a linear array of free space information for faster 1396e5edad52SDarrick J. Wongexpansions. 1397e5edad52SDarrick J. WongIf the free space has been separated and the second partition grows again 1398e5edad52SDarrick J. Wongbeyond one block, then a dabtree is used to map hashes of dirent names to 1399e5edad52SDarrick J. Wongdirectory data blocks. 1400e5edad52SDarrick J. Wong 1401e5edad52SDarrick J. WongChecking a directory is pretty straightfoward: 1402e5edad52SDarrick J. Wong 1403e5edad52SDarrick J. Wong1. Walk the dabtree in the second partition (if present) to ensure that there 1404e5edad52SDarrick J. Wong are no irregularities in the blocks or dabtree mappings that do not point to 1405e5edad52SDarrick J. Wong dirent blocks. 1406e5edad52SDarrick J. Wong 1407e5edad52SDarrick J. Wong2. Walk the blocks of the first partition looking for directory entries. 1408e5edad52SDarrick J. Wong Each dirent is checked as follows: 1409e5edad52SDarrick J. Wong 1410e5edad52SDarrick J. Wong a. Does the name contain no invalid characters? 1411e5edad52SDarrick J. Wong 1412e5edad52SDarrick J. Wong b. Does the inumber correspond to an actual, allocated inode? 1413e5edad52SDarrick J. Wong 1414e5edad52SDarrick J. Wong c. Does the child inode have a nonzero link count? 1415e5edad52SDarrick J. Wong 1416e5edad52SDarrick J. Wong d. If a file type is included in the dirent, does it match the type of the 1417e5edad52SDarrick J. Wong inode? 1418e5edad52SDarrick J. Wong 1419e5edad52SDarrick J. Wong e. If the child is a subdirectory, does the child's dotdot pointer point 1420e5edad52SDarrick J. Wong back to the parent? 1421e5edad52SDarrick J. Wong 1422e5edad52SDarrick J. Wong f. If the directory has a second partition, perform a named lookup of the 1423e5edad52SDarrick J. Wong dirent name to ensure the correctness of the dabtree. 1424e5edad52SDarrick J. Wong 1425e5edad52SDarrick J. Wong3. Walk the free space list in the third partition (if present) to ensure that 1426e5edad52SDarrick J. Wong the free spaces it describes are really unused. 1427e5edad52SDarrick J. Wong 1428e5edad52SDarrick J. WongChecking operations involving :ref:`parents <dirparent>` and 1429e5edad52SDarrick J. Wong:ref:`file link counts <nlinks>` are discussed in more detail in later 1430e5edad52SDarrick J. Wongsections. 1431e5edad52SDarrick J. Wong 1432e5edad52SDarrick J. WongChecking Directory/Attribute Btrees 1433e5edad52SDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1434e5edad52SDarrick J. Wong 1435e5edad52SDarrick J. WongAs stated in previous sections, the directory/attribute btree (dabtree) index 1436e5edad52SDarrick J. Wongmaps user-provided names to improve lookup times by avoiding linear scans. 1437e5edad52SDarrick J. WongInternally, it maps a 32-bit hash of the name to a block offset within the 1438e5edad52SDarrick J. Wongappropriate file fork. 1439e5edad52SDarrick J. Wong 1440e5edad52SDarrick J. WongThe internal structure of a dabtree closely resembles the btrees that record 1441e5edad52SDarrick J. Wongfixed-size metadata records -- each dabtree block contains a magic number, a 1442e5edad52SDarrick J. Wongchecksum, sibling pointers, a UUID, a tree level, and a log sequence number. 1443e5edad52SDarrick J. WongThe format of leaf and node records are the same -- each entry points to the 1444e5edad52SDarrick J. Wongnext level down in the hierarchy, with dabtree node records pointing to dabtree 1445e5edad52SDarrick J. Wongleaf blocks, and dabtree leaf records pointing to non-dabtree blocks elsewhere 1446e5edad52SDarrick J. Wongin the fork. 1447e5edad52SDarrick J. Wong 1448e5edad52SDarrick J. WongChecking and cross-referencing the dabtree is very similar to what is done for 1449e5edad52SDarrick J. Wongspace btrees: 1450e5edad52SDarrick J. Wong 1451e5edad52SDarrick J. Wong- Does the type of data stored in the block match what scrub is expecting? 1452e5edad52SDarrick J. Wong 1453e5edad52SDarrick J. Wong- Does the block belong to the owning structure that asked for the read? 1454e5edad52SDarrick J. Wong 1455e5edad52SDarrick J. Wong- Do the records fit within the block? 1456e5edad52SDarrick J. Wong 1457e5edad52SDarrick J. Wong- Are the records contained inside the block free of obvious corruptions? 1458e5edad52SDarrick J. Wong 1459e5edad52SDarrick J. Wong- Are the name hashes in the correct order? 1460e5edad52SDarrick J. Wong 1461e5edad52SDarrick J. Wong- Do node pointers within the dabtree point to valid fork offsets for dabtree 1462e5edad52SDarrick J. Wong blocks? 1463e5edad52SDarrick J. Wong 1464e5edad52SDarrick J. Wong- Do leaf pointers within the dabtree point to valid fork offsets for directory 1465e5edad52SDarrick J. Wong or attr leaf blocks? 1466e5edad52SDarrick J. Wong 1467e5edad52SDarrick J. Wong- Do child pointers point towards the leaves? 1468e5edad52SDarrick J. Wong 1469e5edad52SDarrick J. Wong- Do sibling pointers point across the same level? 1470e5edad52SDarrick J. Wong 1471e5edad52SDarrick J. Wong- For each dabtree node record, does the record key accurate reflect the 1472e5edad52SDarrick J. Wong contents of the child dabtree block? 1473e5edad52SDarrick J. Wong 1474e5edad52SDarrick J. Wong- For each dabtree leaf record, does the record key accurate reflect the 1475e5edad52SDarrick J. Wong contents of the directory or attr block? 1476e5edad52SDarrick J. Wong 1477e5edad52SDarrick J. WongCross-Referencing Summary Counters 1478e5edad52SDarrick J. Wong`````````````````````````````````` 1479e5edad52SDarrick J. Wong 1480e5edad52SDarrick J. WongXFS maintains three classes of summary counters: available resources, quota 1481e5edad52SDarrick J. Wongresource usage, and file link counts. 1482e5edad52SDarrick J. Wong 1483e5edad52SDarrick J. WongIn theory, the amount of available resources (data blocks, inodes, realtime 1484e5edad52SDarrick J. Wongextents) can be found by walking the entire filesystem. 1485e5edad52SDarrick J. WongThis would make for very slow reporting, so a transactional filesystem can 1486e5edad52SDarrick J. Wongmaintain summaries of this information in the superblock. 1487e5edad52SDarrick J. WongCross-referencing these values against the filesystem metadata should be a 1488e5edad52SDarrick J. Wongsimple matter of walking the free space and inode metadata in each AG and the 1489e5edad52SDarrick J. Wongrealtime bitmap, but there are complications that will be discussed in 1490e5edad52SDarrick J. Wong:ref:`more detail <fscounters>` later. 1491e5edad52SDarrick J. Wong 1492e5edad52SDarrick J. Wong:ref:`Quota usage <quotacheck>` and :ref:`file link count <nlinks>` 1493e5edad52SDarrick J. Wongchecking are sufficiently complicated to warrant separate sections. 1494e5edad52SDarrick J. Wong 1495e5edad52SDarrick J. WongPost-Repair Reverification 1496e5edad52SDarrick J. Wong`````````````````````````` 1497e5edad52SDarrick J. Wong 1498e5edad52SDarrick J. WongAfter performing a repair, the checking code is run a second time to validate 1499e5edad52SDarrick J. Wongthe new structure, and the results of the health assessment are recorded 1500e5edad52SDarrick J. Wonginternally and returned to the calling process. 1501e5edad52SDarrick J. WongThis step is critical for enabling system administrator to monitor the status 1502e5edad52SDarrick J. Wongof the filesystem and the progress of any repairs. 1503e5edad52SDarrick J. WongFor developers, it is a useful means to judge the efficacy of error detection 1504e5edad52SDarrick J. Wongand correction in the online and offline checking tools. 1505bae43864SDarrick J. Wong 1506bae43864SDarrick J. WongEventual Consistency vs. Online Fsck 1507bae43864SDarrick J. Wong------------------------------------ 1508bae43864SDarrick J. Wong 1509bae43864SDarrick J. WongComplex operations can make modifications to multiple per-AG data structures 1510bae43864SDarrick J. Wongwith a chain of transactions. 1511bae43864SDarrick J. WongThese chains, once committed to the log, are restarted during log recovery if 1512bae43864SDarrick J. Wongthe system crashes while processing the chain. 1513bae43864SDarrick J. WongBecause the AG header buffers are unlocked between transactions within a chain, 1514bae43864SDarrick J. Wongonline checking must coordinate with chained operations that are in progress to 1515bae43864SDarrick J. Wongavoid incorrectly detecting inconsistencies due to pending chains. 1516bae43864SDarrick J. WongFurthermore, online repair must not run when operations are pending because 1517bae43864SDarrick J. Wongthe metadata are temporarily inconsistent with each other, and rebuilding is 1518bae43864SDarrick J. Wongnot possible. 1519bae43864SDarrick J. Wong 1520bae43864SDarrick J. WongOnly online fsck has this requirement of total consistency of AG metadata, and 1521bae43864SDarrick J. Wongshould be relatively rare as compared to filesystem change operations. 1522bae43864SDarrick J. WongOnline fsck coordinates with transaction chains as follows: 1523bae43864SDarrick J. Wong 1524bae43864SDarrick J. Wong* For each AG, maintain a count of intent items targetting that AG. 1525bae43864SDarrick J. Wong The count should be bumped whenever a new item is added to the chain. 1526bae43864SDarrick J. Wong The count should be dropped when the filesystem has locked the AG header 1527bae43864SDarrick J. Wong buffers and finished the work. 1528bae43864SDarrick J. Wong 1529bae43864SDarrick J. Wong* When online fsck wants to examine an AG, it should lock the AG header 1530bae43864SDarrick J. Wong buffers to quiesce all transaction chains that want to modify that AG. 1531bae43864SDarrick J. Wong If the count is zero, proceed with the checking operation. 1532bae43864SDarrick J. Wong If it is nonzero, cycle the buffer locks to allow the chain to make forward 1533bae43864SDarrick J. Wong progress. 1534bae43864SDarrick J. Wong 1535bae43864SDarrick J. WongThis may lead to online fsck taking a long time to complete, but regular 1536bae43864SDarrick J. Wongfilesystem updates take precedence over background checking activity. 1537bae43864SDarrick J. WongDetails about the discovery of this situation are presented in the 1538bae43864SDarrick J. Wong:ref:`next section <chain_coordination>`, and details about the solution 1539bae43864SDarrick J. Wongare presented :ref:`after that<intent_drains>`. 1540bae43864SDarrick J. Wong 1541bae43864SDarrick J. Wong.. _chain_coordination: 1542bae43864SDarrick J. Wong 1543bae43864SDarrick J. WongDiscovery of the Problem 1544bae43864SDarrick J. Wong```````````````````````` 1545bae43864SDarrick J. Wong 1546bae43864SDarrick J. WongMidway through the development of online scrubbing, the fsstress tests 1547bae43864SDarrick J. Wonguncovered a misinteraction between online fsck and compound transaction chains 1548bae43864SDarrick J. Wongcreated by other writer threads that resulted in false reports of metadata 1549bae43864SDarrick J. Wonginconsistency. 1550bae43864SDarrick J. WongThe root cause of these reports is the eventual consistency model introduced by 1551bae43864SDarrick J. Wongthe expansion of deferred work items and compound transaction chains when 1552bae43864SDarrick J. Wongreverse mapping and reflink were introduced. 1553bae43864SDarrick J. Wong 1554bae43864SDarrick J. WongOriginally, transaction chains were added to XFS to avoid deadlocks when 1555bae43864SDarrick J. Wongunmapping space from files. 1556bae43864SDarrick J. WongDeadlock avoidance rules require that AGs only be locked in increasing order, 1557bae43864SDarrick J. Wongwhich makes it impossible (say) to use a single transaction to free a space 1558bae43864SDarrick J. Wongextent in AG 7 and then try to free a now superfluous block mapping btree block 1559bae43864SDarrick J. Wongin AG 3. 1560bae43864SDarrick J. WongTo avoid these kinds of deadlocks, XFS creates Extent Freeing Intent (EFI) log 1561bae43864SDarrick J. Wongitems to commit to freeing some space in one transaction while deferring the 1562bae43864SDarrick J. Wongactual metadata updates to a fresh transaction. 1563bae43864SDarrick J. WongThe transaction sequence looks like this: 1564bae43864SDarrick J. Wong 1565bae43864SDarrick J. Wong1. The first transaction contains a physical update to the file's block mapping 1566bae43864SDarrick J. Wong structures to remove the mapping from the btree blocks. 1567bae43864SDarrick J. Wong It then attaches to the in-memory transaction an action item to schedule 1568bae43864SDarrick J. Wong deferred freeing of space. 1569bae43864SDarrick J. Wong Concretely, each transaction maintains a list of ``struct 1570bae43864SDarrick J. Wong xfs_defer_pending`` objects, each of which maintains a list of ``struct 1571bae43864SDarrick J. Wong xfs_extent_free_item`` objects. 1572bae43864SDarrick J. Wong Returning to the example above, the action item tracks the freeing of both 1573bae43864SDarrick J. Wong the unmapped space from AG 7 and the block mapping btree (BMBT) block from 1574bae43864SDarrick J. Wong AG 3. 1575bae43864SDarrick J. Wong Deferred frees recorded in this manner are committed in the log by creating 1576bae43864SDarrick J. Wong an EFI log item from the ``struct xfs_extent_free_item`` object and 1577bae43864SDarrick J. Wong attaching the log item to the transaction. 1578bae43864SDarrick J. Wong When the log is persisted to disk, the EFI item is written into the ondisk 1579bae43864SDarrick J. Wong transaction record. 1580bae43864SDarrick J. Wong EFIs can list up to 16 extents to free, all sorted in AG order. 1581bae43864SDarrick J. Wong 1582bae43864SDarrick J. Wong2. The second transaction contains a physical update to the free space btrees 1583bae43864SDarrick J. Wong of AG 3 to release the former BMBT block and a second physical update to the 1584bae43864SDarrick J. Wong free space btrees of AG 7 to release the unmapped file space. 1585bae43864SDarrick J. Wong Observe that the the physical updates are resequenced in the correct order 1586bae43864SDarrick J. Wong when possible. 1587bae43864SDarrick J. Wong Attached to the transaction is a an extent free done (EFD) log item. 1588bae43864SDarrick J. Wong The EFD contains a pointer to the EFI logged in transaction #1 so that log 1589bae43864SDarrick J. Wong recovery can tell if the EFI needs to be replayed. 1590bae43864SDarrick J. Wong 1591bae43864SDarrick J. WongIf the system goes down after transaction #1 is written back to the filesystem 1592bae43864SDarrick J. Wongbut before #2 is committed, a scan of the filesystem metadata would show 1593bae43864SDarrick J. Wonginconsistent filesystem metadata because there would not appear to be any owner 1594bae43864SDarrick J. Wongof the unmapped space. 1595bae43864SDarrick J. WongHappily, log recovery corrects this inconsistency for us -- when recovery finds 1596bae43864SDarrick J. Wongan intent log item but does not find a corresponding intent done item, it will 1597bae43864SDarrick J. Wongreconstruct the incore state of the intent item and finish it. 1598bae43864SDarrick J. WongIn the example above, the log must replay both frees described in the recovered 1599bae43864SDarrick J. WongEFI to complete the recovery phase. 1600bae43864SDarrick J. Wong 1601bae43864SDarrick J. WongThere are subtleties to XFS' transaction chaining strategy to consider: 1602bae43864SDarrick J. Wong 1603bae43864SDarrick J. Wong* Log items must be added to a transaction in the correct order to prevent 1604bae43864SDarrick J. Wong conflicts with principal objects that are not held by the transaction. 1605bae43864SDarrick J. Wong In other words, all per-AG metadata updates for an unmapped block must be 1606bae43864SDarrick J. Wong completed before the last update to free the extent, and extents should not 1607bae43864SDarrick J. Wong be reallocated until that last update commits to the log. 1608bae43864SDarrick J. Wong 1609bae43864SDarrick J. Wong* AG header buffers are released between each transaction in a chain. 1610bae43864SDarrick J. Wong This means that other threads can observe an AG in an intermediate state, 1611bae43864SDarrick J. Wong but as long as the first subtlety is handled, this should not affect the 1612bae43864SDarrick J. Wong correctness of filesystem operations. 1613bae43864SDarrick J. Wong 1614bae43864SDarrick J. Wong* Unmounting the filesystem flushes all pending work to disk, which means that 1615bae43864SDarrick J. Wong offline fsck never sees the temporary inconsistencies caused by deferred 1616bae43864SDarrick J. Wong work item processing. 1617bae43864SDarrick J. Wong 1618bae43864SDarrick J. WongIn this manner, XFS employs a form of eventual consistency to avoid deadlocks 1619bae43864SDarrick J. Wongand increase parallelism. 1620bae43864SDarrick J. Wong 1621bae43864SDarrick J. WongDuring the design phase of the reverse mapping and reflink features, it was 1622bae43864SDarrick J. Wongdecided that it was impractical to cram all the reverse mapping updates for a 1623bae43864SDarrick J. Wongsingle filesystem change into a single transaction because a single file 1624bae43864SDarrick J. Wongmapping operation can explode into many small updates: 1625bae43864SDarrick J. Wong 1626bae43864SDarrick J. Wong* The block mapping update itself 1627bae43864SDarrick J. Wong* A reverse mapping update for the block mapping update 1628bae43864SDarrick J. Wong* Fixing the freelist 1629bae43864SDarrick J. Wong* A reverse mapping update for the freelist fix 1630bae43864SDarrick J. Wong 1631bae43864SDarrick J. Wong* A shape change to the block mapping btree 1632bae43864SDarrick J. Wong* A reverse mapping update for the btree update 1633bae43864SDarrick J. Wong* Fixing the freelist (again) 1634bae43864SDarrick J. Wong* A reverse mapping update for the freelist fix 1635bae43864SDarrick J. Wong 1636bae43864SDarrick J. Wong* An update to the reference counting information 1637bae43864SDarrick J. Wong* A reverse mapping update for the refcount update 1638bae43864SDarrick J. Wong* Fixing the freelist (a third time) 1639bae43864SDarrick J. Wong* A reverse mapping update for the freelist fix 1640bae43864SDarrick J. Wong 1641bae43864SDarrick J. Wong* Freeing any space that was unmapped and not owned by any other file 1642bae43864SDarrick J. Wong* Fixing the freelist (a fourth time) 1643bae43864SDarrick J. Wong* A reverse mapping update for the freelist fix 1644bae43864SDarrick J. Wong 1645bae43864SDarrick J. Wong* Freeing the space used by the block mapping btree 1646bae43864SDarrick J. Wong* Fixing the freelist (a fifth time) 1647bae43864SDarrick J. Wong* A reverse mapping update for the freelist fix 1648bae43864SDarrick J. Wong 1649bae43864SDarrick J. WongFree list fixups are not usually needed more than once per AG per transaction 1650bae43864SDarrick J. Wongchain, but it is theoretically possible if space is very tight. 1651bae43864SDarrick J. WongFor copy-on-write updates this is even worse, because this must be done once to 1652bae43864SDarrick J. Wongremove the space from a staging area and again to map it into the file! 1653bae43864SDarrick J. Wong 1654bae43864SDarrick J. WongTo deal with this explosion in a calm manner, XFS expands its use of deferred 1655bae43864SDarrick J. Wongwork items to cover most reverse mapping updates and all refcount updates. 1656bae43864SDarrick J. WongThis reduces the worst case size of transaction reservations by breaking the 1657bae43864SDarrick J. Wongwork into a long chain of small updates, which increases the degree of eventual 1658bae43864SDarrick J. Wongconsistency in the system. 1659bae43864SDarrick J. WongAgain, this generally isn't a problem because XFS orders its deferred work 1660bae43864SDarrick J. Wongitems carefully to avoid resource reuse conflicts between unsuspecting threads. 1661bae43864SDarrick J. Wong 1662bae43864SDarrick J. WongHowever, online fsck changes the rules -- remember that although physical 1663bae43864SDarrick J. Wongupdates to per-AG structures are coordinated by locking the buffers for AG 1664bae43864SDarrick J. Wongheaders, buffer locks are dropped between transactions. 1665bae43864SDarrick J. WongOnce scrub acquires resources and takes locks for a data structure, it must do 1666bae43864SDarrick J. Wongall the validation work without releasing the lock. 1667bae43864SDarrick J. WongIf the main lock for a space btree is an AG header buffer lock, scrub may have 1668bae43864SDarrick J. Wonginterrupted another thread that is midway through finishing a chain. 1669bae43864SDarrick J. WongFor example, if a thread performing a copy-on-write has completed a reverse 1670bae43864SDarrick J. Wongmapping update but not the corresponding refcount update, the two AG btrees 1671bae43864SDarrick J. Wongwill appear inconsistent to scrub and an observation of corruption will be 1672bae43864SDarrick J. Wongrecorded. This observation will not be correct. 1673bae43864SDarrick J. WongIf a repair is attempted in this state, the results will be catastrophic! 1674bae43864SDarrick J. Wong 1675bae43864SDarrick J. WongSeveral other solutions to this problem were evaluated upon discovery of this 1676bae43864SDarrick J. Wongflaw and rejected: 1677bae43864SDarrick J. Wong 1678bae43864SDarrick J. Wong1. Add a higher level lock to allocation groups and require writer threads to 1679bae43864SDarrick J. Wong acquire the higher level lock in AG order before making any changes. 1680bae43864SDarrick J. Wong This would be very difficult to implement in practice because it is 1681bae43864SDarrick J. Wong difficult to determine which locks need to be obtained, and in what order, 1682bae43864SDarrick J. Wong without simulating the entire operation. 1683bae43864SDarrick J. Wong Performing a dry run of a file operation to discover necessary locks would 1684bae43864SDarrick J. Wong make the filesystem very slow. 1685bae43864SDarrick J. Wong 1686bae43864SDarrick J. Wong2. Make the deferred work coordinator code aware of consecutive intent items 1687bae43864SDarrick J. Wong targeting the same AG and have it hold the AG header buffers locked across 1688bae43864SDarrick J. Wong the transaction roll between updates. 1689bae43864SDarrick J. Wong This would introduce a lot of complexity into the coordinator since it is 1690bae43864SDarrick J. Wong only loosely coupled with the actual deferred work items. 1691bae43864SDarrick J. Wong It would also fail to solve the problem because deferred work items can 1692bae43864SDarrick J. Wong generate new deferred subtasks, but all subtasks must be complete before 1693bae43864SDarrick J. Wong work can start on a new sibling task. 1694bae43864SDarrick J. Wong 1695bae43864SDarrick J. Wong3. Teach online fsck to walk all transactions waiting for whichever lock(s) 1696bae43864SDarrick J. Wong protect the data structure being scrubbed to look for pending operations. 1697bae43864SDarrick J. Wong The checking and repair operations must factor these pending operations into 1698bae43864SDarrick J. Wong the evaluations being performed. 1699bae43864SDarrick J. Wong This solution is a nonstarter because it is *extremely* invasive to the main 1700bae43864SDarrick J. Wong filesystem. 1701bae43864SDarrick J. Wong 1702bae43864SDarrick J. Wong.. _intent_drains: 1703bae43864SDarrick J. Wong 1704bae43864SDarrick J. WongIntent Drains 1705bae43864SDarrick J. Wong````````````` 1706bae43864SDarrick J. Wong 1707bae43864SDarrick J. WongOnline fsck uses an atomic intent item counter and lock cycling to coordinate 1708bae43864SDarrick J. Wongwith transaction chains. 1709bae43864SDarrick J. WongThere are two key properties to the drain mechanism. 1710bae43864SDarrick J. WongFirst, the counter is incremented when a deferred work item is *queued* to a 1711bae43864SDarrick J. Wongtransaction, and it is decremented after the associated intent done log item is 1712bae43864SDarrick J. Wong*committed* to another transaction. 1713bae43864SDarrick J. WongThe second property is that deferred work can be added to a transaction without 1714bae43864SDarrick J. Wongholding an AG header lock, but per-AG work items cannot be marked done without 1715bae43864SDarrick J. Wonglocking that AG header buffer to log the physical updates and the intent done 1716bae43864SDarrick J. Wonglog item. 1717bae43864SDarrick J. WongThe first property enables scrub to yield to running transaction chains, which 1718bae43864SDarrick J. Wongis an explicit deprioritization of online fsck to benefit file operations. 1719bae43864SDarrick J. WongThe second property of the drain is key to the correct coordination of scrub, 1720bae43864SDarrick J. Wongsince scrub will always be able to decide if a conflict is possible. 1721bae43864SDarrick J. Wong 1722bae43864SDarrick J. WongFor regular filesystem code, the drain works as follows: 1723bae43864SDarrick J. Wong 1724bae43864SDarrick J. Wong1. Call the appropriate subsystem function to add a deferred work item to a 1725bae43864SDarrick J. Wong transaction. 1726bae43864SDarrick J. Wong 1727bae43864SDarrick J. Wong2. The function calls ``xfs_defer_drain_bump`` to increase the counter. 1728bae43864SDarrick J. Wong 1729bae43864SDarrick J. Wong3. When the deferred item manager wants to finish the deferred work item, it 1730bae43864SDarrick J. Wong calls ``->finish_item`` to complete it. 1731bae43864SDarrick J. Wong 1732bae43864SDarrick J. Wong4. The ``->finish_item`` implementation logs some changes and calls 1733bae43864SDarrick J. Wong ``xfs_defer_drain_drop`` to decrease the sloppy counter and wake up any threads 1734bae43864SDarrick J. Wong waiting on the drain. 1735bae43864SDarrick J. Wong 1736bae43864SDarrick J. Wong5. The subtransaction commits, which unlocks the resource associated with the 1737bae43864SDarrick J. Wong intent item. 1738bae43864SDarrick J. Wong 1739bae43864SDarrick J. WongFor scrub, the drain works as follows: 1740bae43864SDarrick J. Wong 1741bae43864SDarrick J. Wong1. Lock the resource(s) associated with the metadata being scrubbed. 1742bae43864SDarrick J. Wong For example, a scan of the refcount btree would lock the AGI and AGF header 1743bae43864SDarrick J. Wong buffers. 1744bae43864SDarrick J. Wong 1745bae43864SDarrick J. Wong2. If the counter is zero (``xfs_defer_drain_busy`` returns false), there are no 1746bae43864SDarrick J. Wong chains in progress and the operation may proceed. 1747bae43864SDarrick J. Wong 1748bae43864SDarrick J. Wong3. Otherwise, release the resources grabbed in step 1. 1749bae43864SDarrick J. Wong 1750bae43864SDarrick J. Wong4. Wait for the intent counter to reach zero (``xfs_defer_drain_intents``), then go 1751bae43864SDarrick J. Wong back to step 1 unless a signal has been caught. 1752bae43864SDarrick J. Wong 1753bae43864SDarrick J. WongTo avoid polling in step 4, the drain provides a waitqueue for scrub threads to 1754bae43864SDarrick J. Wongbe woken up whenever the intent count drops to zero. 1755bae43864SDarrick J. Wong 1756bae43864SDarrick J. WongThe proposed patchset is the 1757bae43864SDarrick J. Wong`scrub intent drain series 1758bae43864SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-drain-intents>`_. 1759bae43864SDarrick J. Wong 1760bae43864SDarrick J. Wong.. _jump_labels: 1761bae43864SDarrick J. Wong 1762bae43864SDarrick J. WongStatic Keys (aka Jump Label Patching) 1763bae43864SDarrick J. Wong````````````````````````````````````` 1764bae43864SDarrick J. Wong 1765bae43864SDarrick J. WongOnline fsck for XFS separates the regular filesystem from the checking and 1766bae43864SDarrick J. Wongrepair code as much as possible. 1767bae43864SDarrick J. WongHowever, there are a few parts of online fsck (such as the intent drains, and 1768bae43864SDarrick J. Wonglater, live update hooks) where it is useful for the online fsck code to know 1769bae43864SDarrick J. Wongwhat's going on in the rest of the filesystem. 1770bae43864SDarrick J. WongSince it is not expected that online fsck will be constantly running in the 1771bae43864SDarrick J. Wongbackground, it is very important to minimize the runtime overhead imposed by 1772bae43864SDarrick J. Wongthese hooks when online fsck is compiled into the kernel but not actively 1773bae43864SDarrick J. Wongrunning on behalf of userspace. 1774bae43864SDarrick J. WongTaking locks in the hot path of a writer thread to access a data structure only 1775bae43864SDarrick J. Wongto find that no further action is necessary is expensive -- on the author's 1776bae43864SDarrick J. Wongcomputer, this have an overhead of 40-50ns per access. 1777bae43864SDarrick J. WongFortunately, the kernel supports dynamic code patching, which enables XFS to 1778bae43864SDarrick J. Wongreplace a static branch to hook code with ``nop`` sleds when online fsck isn't 1779bae43864SDarrick J. Wongrunning. 1780bae43864SDarrick J. WongThis sled has an overhead of however long it takes the instruction decoder to 1781bae43864SDarrick J. Wongskip past the sled, which seems to be on the order of less than 1ns and 1782bae43864SDarrick J. Wongdoes not access memory outside of instruction fetching. 1783bae43864SDarrick J. Wong 1784bae43864SDarrick J. WongWhen online fsck enables the static key, the sled is replaced with an 1785bae43864SDarrick J. Wongunconditional branch to call the hook code. 1786bae43864SDarrick J. WongThe switchover is quite expensive (~22000ns) but is paid entirely by the 1787bae43864SDarrick J. Wongprogram that invoked online fsck, and can be amortized if multiple threads 1788bae43864SDarrick J. Wongenter online fsck at the same time, or if multiple filesystems are being 1789bae43864SDarrick J. Wongchecked at the same time. 1790bae43864SDarrick J. WongChanging the branch direction requires taking the CPU hotplug lock, and since 1791bae43864SDarrick J. WongCPU initialization requires memory allocation, online fsck must be careful not 1792bae43864SDarrick J. Wongto change a static key while holding any locks or resources that could be 1793bae43864SDarrick J. Wongaccessed in the memory reclaim paths. 1794bae43864SDarrick J. WongTo minimize contention on the CPU hotplug lock, care should be taken not to 1795bae43864SDarrick J. Wongenable or disable static keys unnecessarily. 1796bae43864SDarrick J. Wong 1797bae43864SDarrick J. WongBecause static keys are intended to minimize hook overhead for regular 1798bae43864SDarrick J. Wongfilesystem operations when xfs_scrub is not running, the intended usage 1799bae43864SDarrick J. Wongpatterns are as follows: 1800bae43864SDarrick J. Wong 1801bae43864SDarrick J. Wong- The hooked part of XFS should declare a static-scoped static key that 1802bae43864SDarrick J. Wong defaults to false. 1803bae43864SDarrick J. Wong The ``DEFINE_STATIC_KEY_FALSE`` macro takes care of this. 1804bae43864SDarrick J. Wong The static key itself should be declared as a ``static`` variable. 1805bae43864SDarrick J. Wong 1806bae43864SDarrick J. Wong- When deciding to invoke code that's only used by scrub, the regular 1807bae43864SDarrick J. Wong filesystem should call the ``static_branch_unlikely`` predicate to avoid the 1808bae43864SDarrick J. Wong scrub-only hook code if the static key is not enabled. 1809bae43864SDarrick J. Wong 1810bae43864SDarrick J. Wong- The regular filesystem should export helper functions that call 1811bae43864SDarrick J. Wong ``static_branch_inc`` to enable and ``static_branch_dec`` to disable the 1812bae43864SDarrick J. Wong static key. 1813bae43864SDarrick J. Wong Wrapper functions make it easy to compile out the relevant code if the kernel 1814bae43864SDarrick J. Wong distributor turns off online fsck at build time. 1815bae43864SDarrick J. Wong 1816bae43864SDarrick J. Wong- Scrub functions wanting to turn on scrub-only XFS functionality should call 1817bae43864SDarrick J. Wong the ``xchk_fsgates_enable`` from the setup function to enable a specific 1818bae43864SDarrick J. Wong hook. 1819bae43864SDarrick J. Wong This must be done before obtaining any resources that are used by memory 1820bae43864SDarrick J. Wong reclaim. 1821bae43864SDarrick J. Wong Callers had better be sure they really need the functionality gated by the 1822bae43864SDarrick J. Wong static key; the ``TRY_HARDER`` flag is useful here. 1823bae43864SDarrick J. Wong 1824bae43864SDarrick J. WongOnline scrub has resource acquisition helpers (e.g. ``xchk_perag_lock``) to 1825bae43864SDarrick J. Wonghandle locking AGI and AGF buffers for all scrubber functions. 1826bae43864SDarrick J. WongIf it detects a conflict between scrub and the running transactions, it will 1827bae43864SDarrick J. Wongtry to wait for intents to complete. 1828bae43864SDarrick J. WongIf the caller of the helper has not enabled the static key, the helper will 1829bae43864SDarrick J. Wongreturn -EDEADLOCK, which should result in the scrub being restarted with the 1830bae43864SDarrick J. Wong``TRY_HARDER`` flag set. 1831bae43864SDarrick J. WongThe scrub setup function should detect that flag, enable the static key, and 1832bae43864SDarrick J. Wongtry the scrub again. 1833bae43864SDarrick J. WongScrub teardown disables all static keys obtained by ``xchk_fsgates_enable``. 1834bae43864SDarrick J. Wong 1835bae43864SDarrick J. WongFor more information, please see the kernel documentation of 1836bae43864SDarrick J. WongDocumentation/staging/static-keys.rst. 18375f658dadSDarrick J. Wong 18385f658dadSDarrick J. Wong.. _xfile: 18395f658dadSDarrick J. Wong 18405f658dadSDarrick J. WongPageable Kernel Memory 18415f658dadSDarrick J. Wong---------------------- 18425f658dadSDarrick J. Wong 18435f658dadSDarrick J. WongSome online checking functions work by scanning the filesystem to build a 18445f658dadSDarrick J. Wongshadow copy of an ondisk metadata structure in memory and comparing the two 18455f658dadSDarrick J. Wongcopies. 18465f658dadSDarrick J. WongFor online repair to rebuild a metadata structure, it must compute the record 18475f658dadSDarrick J. Wongset that will be stored in the new structure before it can persist that new 18485f658dadSDarrick J. Wongstructure to disk. 18495f658dadSDarrick J. WongIdeally, repairs complete with a single atomic commit that introduces 18505f658dadSDarrick J. Wonga new data structure. 18515f658dadSDarrick J. WongTo meet these goals, the kernel needs to collect a large amount of information 18525f658dadSDarrick J. Wongin a place that doesn't require the correct operation of the filesystem. 18535f658dadSDarrick J. Wong 18545f658dadSDarrick J. WongKernel memory isn't suitable because: 18555f658dadSDarrick J. Wong 18565f658dadSDarrick J. Wong* Allocating a contiguous region of memory to create a C array is very 18575f658dadSDarrick J. Wong difficult, especially on 32-bit systems. 18585f658dadSDarrick J. Wong 18595f658dadSDarrick J. Wong* Linked lists of records introduce double pointer overhead which is very high 18605f658dadSDarrick J. Wong and eliminate the possibility of indexed lookups. 18615f658dadSDarrick J. Wong 18625f658dadSDarrick J. Wong* Kernel memory is pinned, which can drive the system into OOM conditions. 18635f658dadSDarrick J. Wong 18645f658dadSDarrick J. Wong* The system might not have sufficient memory to stage all the information. 18655f658dadSDarrick J. Wong 18665f658dadSDarrick J. WongAt any given time, online fsck does not need to keep the entire record set in 18675f658dadSDarrick J. Wongmemory, which means that individual records can be paged out if necessary. 18685f658dadSDarrick J. WongContinued development of online fsck demonstrated that the ability to perform 18695f658dadSDarrick J. Wongindexed data storage would also be very useful. 18705f658dadSDarrick J. WongFortunately, the Linux kernel already has a facility for byte-addressable and 18715f658dadSDarrick J. Wongpageable storage: tmpfs. 18725f658dadSDarrick J. WongIn-kernel graphics drivers (most notably i915) take advantage of tmpfs files 18735f658dadSDarrick J. Wongto store intermediate data that doesn't need to be in memory at all times, so 18745f658dadSDarrick J. Wongthat usage precedent is already established. 18755f658dadSDarrick J. WongHence, the ``xfile`` was born! 18765f658dadSDarrick J. Wong 18775f658dadSDarrick J. Wong+--------------------------------------------------------------------------+ 18785f658dadSDarrick J. Wong| **Historical Sidebar**: | 18795f658dadSDarrick J. Wong+--------------------------------------------------------------------------+ 18805f658dadSDarrick J. Wong| The first edition of online repair inserted records into a new btree as | 18815f658dadSDarrick J. Wong| it found them, which failed because filesystem could shut down with a | 18825f658dadSDarrick J. Wong| built data structure, which would be live after recovery finished. | 18835f658dadSDarrick J. Wong| | 18845f658dadSDarrick J. Wong| The second edition solved the half-rebuilt structure problem by storing | 18855f658dadSDarrick J. Wong| everything in memory, but frequently ran the system out of memory. | 18865f658dadSDarrick J. Wong| | 18875f658dadSDarrick J. Wong| The third edition solved the OOM problem by using linked lists, but the | 18885f658dadSDarrick J. Wong| memory overhead of the list pointers was extreme. | 18895f658dadSDarrick J. Wong+--------------------------------------------------------------------------+ 18905f658dadSDarrick J. Wong 18915f658dadSDarrick J. Wongxfile Access Models 18925f658dadSDarrick J. Wong``````````````````` 18935f658dadSDarrick J. Wong 18945f658dadSDarrick J. WongA survey of the intended uses of xfiles suggested these use cases: 18955f658dadSDarrick J. Wong 18965f658dadSDarrick J. Wong1. Arrays of fixed-sized records (space management btrees, directory and 18975f658dadSDarrick J. Wong extended attribute entries) 18985f658dadSDarrick J. Wong 18995f658dadSDarrick J. Wong2. Sparse arrays of fixed-sized records (quotas and link counts) 19005f658dadSDarrick J. Wong 19015f658dadSDarrick J. Wong3. Large binary objects (BLOBs) of variable sizes (directory and extended 19025f658dadSDarrick J. Wong attribute names and values) 19035f658dadSDarrick J. Wong 19045f658dadSDarrick J. Wong4. Staging btrees in memory (reverse mapping btrees) 19055f658dadSDarrick J. Wong 19065f658dadSDarrick J. Wong5. Arbitrary contents (realtime space management) 19075f658dadSDarrick J. Wong 19085f658dadSDarrick J. WongTo support the first four use cases, high level data structures wrap the xfile 19095f658dadSDarrick J. Wongto share functionality between online fsck functions. 19105f658dadSDarrick J. WongThe rest of this section discusses the interfaces that the xfile presents to 19115f658dadSDarrick J. Wongfour of those five higher level data structures. 19125f658dadSDarrick J. WongThe fifth use case is discussed in the :ref:`realtime summary <rtsummary>` case 19135f658dadSDarrick J. Wongstudy. 19145f658dadSDarrick J. Wong 19155f658dadSDarrick J. WongThe most general storage interface supported by the xfile enables the reading 19165f658dadSDarrick J. Wongand writing of arbitrary quantities of data at arbitrary offsets in the xfile. 19175f658dadSDarrick J. WongThis capability is provided by ``xfile_pread`` and ``xfile_pwrite`` functions, 19185f658dadSDarrick J. Wongwhich behave similarly to their userspace counterparts. 19195f658dadSDarrick J. WongXFS is very record-based, which suggests that the ability to load and store 19205f658dadSDarrick J. Wongcomplete records is important. 19215f658dadSDarrick J. WongTo support these cases, a pair of ``xfile_obj_load`` and ``xfile_obj_store`` 19225f658dadSDarrick J. Wongfunctions are provided to read and persist objects into an xfile. 19235f658dadSDarrick J. WongThey are internally the same as pread and pwrite, except that they treat any 19245f658dadSDarrick J. Wongerror as an out of memory error. 19255f658dadSDarrick J. WongFor online repair, squashing error conditions in this manner is an acceptable 19265f658dadSDarrick J. Wongbehavior because the only reaction is to abort the operation back to userspace. 19275f658dadSDarrick J. WongAll five xfile usecases can be serviced by these four functions. 19285f658dadSDarrick J. Wong 19295f658dadSDarrick J. WongHowever, no discussion of file access idioms is complete without answering the 19305f658dadSDarrick J. Wongquestion, "But what about mmap?" 19315f658dadSDarrick J. WongIt is convenient to access storage directly with pointers, just like userspace 19325f658dadSDarrick J. Wongcode does with regular memory. 19335f658dadSDarrick J. WongOnline fsck must not drive the system into OOM conditions, which means that 19345f658dadSDarrick J. Wongxfiles must be responsive to memory reclamation. 19355f658dadSDarrick J. Wongtmpfs can only push a pagecache folio to the swap cache if the folio is neither 19365f658dadSDarrick J. Wongpinned nor locked, which means the xfile must not pin too many folios. 19375f658dadSDarrick J. Wong 19385f658dadSDarrick J. WongShort term direct access to xfile contents is done by locking the pagecache 19395f658dadSDarrick J. Wongfolio and mapping it into kernel address space. 19405f658dadSDarrick J. WongProgrammatic access (e.g. pread and pwrite) uses this mechanism. 19415f658dadSDarrick J. WongFolio locks are not supposed to be held for long periods of time, so long 19425f658dadSDarrick J. Wongterm direct access to xfile contents is done by bumping the folio refcount, 19435f658dadSDarrick J. Wongmapping it into kernel address space, and dropping the folio lock. 19445f658dadSDarrick J. WongThese long term users *must* be responsive to memory reclaim by hooking into 19455f658dadSDarrick J. Wongthe shrinker infrastructure to know when to release folios. 19465f658dadSDarrick J. Wong 19475f658dadSDarrick J. WongThe ``xfile_get_page`` and ``xfile_put_page`` functions are provided to 19485f658dadSDarrick J. Wongretrieve the (locked) folio that backs part of an xfile and to release it. 19495f658dadSDarrick J. WongThe only code to use these folio lease functions are the xfarray 19505f658dadSDarrick J. Wong:ref:`sorting<xfarray_sort>` algorithms and the :ref:`in-memory 19515f658dadSDarrick J. Wongbtrees<xfbtree>`. 19525f658dadSDarrick J. Wong 19535f658dadSDarrick J. Wongxfile Access Coordination 19545f658dadSDarrick J. Wong````````````````````````` 19555f658dadSDarrick J. Wong 19565f658dadSDarrick J. WongFor security reasons, xfiles must be owned privately by the kernel. 19575f658dadSDarrick J. WongThey are marked ``S_PRIVATE`` to prevent interference from the security system, 19585f658dadSDarrick J. Wongmust never be mapped into process file descriptor tables, and their pages must 19595f658dadSDarrick J. Wongnever be mapped into userspace processes. 19605f658dadSDarrick J. Wong 19615f658dadSDarrick J. WongTo avoid locking recursion issues with the VFS, all accesses to the shmfs file 19625f658dadSDarrick J. Wongare performed by manipulating the page cache directly. 19635f658dadSDarrick J. Wongxfile writers call the ``->write_begin`` and ``->write_end`` functions of the 19645f658dadSDarrick J. Wongxfile's address space to grab writable pages, copy the caller's buffer into the 19655f658dadSDarrick J. Wongpage, and release the pages. 19665f658dadSDarrick J. Wongxfile readers call ``shmem_read_mapping_page_gfp`` to grab pages directly 19675f658dadSDarrick J. Wongbefore copying the contents into the caller's buffer. 19685f658dadSDarrick J. WongIn other words, xfiles ignore the VFS read and write code paths to avoid 19695f658dadSDarrick J. Wonghaving to create a dummy ``struct kiocb`` and to avoid taking inode and 19705f658dadSDarrick J. Wongfreeze locks. 19715f658dadSDarrick J. Wongtmpfs cannot be frozen, and xfiles must not be exposed to userspace. 19725f658dadSDarrick J. Wong 19735f658dadSDarrick J. WongIf an xfile is shared between threads to stage repairs, the caller must provide 19745f658dadSDarrick J. Wongits own locks to coordinate access. 19755f658dadSDarrick J. WongFor example, if a scrub function stores scan results in an xfile and needs 19765f658dadSDarrick J. Wongother threads to provide updates to the scanned data, the scrub function must 19775f658dadSDarrick J. Wongprovide a lock for all threads to share. 19785f658dadSDarrick J. Wong 19795f658dadSDarrick J. Wong.. _xfarray: 19805f658dadSDarrick J. Wong 19815f658dadSDarrick J. WongArrays of Fixed-Sized Records 19825f658dadSDarrick J. Wong````````````````````````````` 19835f658dadSDarrick J. Wong 19845f658dadSDarrick J. WongIn XFS, each type of indexed space metadata (free space, inodes, reference 19855f658dadSDarrick J. Wongcounts, file fork space, and reverse mappings) consists of a set of fixed-size 19865f658dadSDarrick J. Wongrecords indexed with a classic B+ tree. 19875f658dadSDarrick J. WongDirectories have a set of fixed-size dirent records that point to the names, 19885f658dadSDarrick J. Wongand extended attributes have a set of fixed-size attribute keys that point to 19895f658dadSDarrick J. Wongnames and values. 19905f658dadSDarrick J. WongQuota counters and file link counters index records with numbers. 19915f658dadSDarrick J. WongDuring a repair, scrub needs to stage new records during the gathering step and 19925f658dadSDarrick J. Wongretrieve them during the btree building step. 19935f658dadSDarrick J. Wong 19945f658dadSDarrick J. WongAlthough this requirement can be satisfied by calling the read and write 19955f658dadSDarrick J. Wongmethods of the xfile directly, it is simpler for callers for there to be a 19965f658dadSDarrick J. Wonghigher level abstraction to take care of computing array offsets, to provide 19975f658dadSDarrick J. Wongiterator functions, and to deal with sparse records and sorting. 19985f658dadSDarrick J. WongThe ``xfarray`` abstraction presents a linear array for fixed-size records atop 19995f658dadSDarrick J. Wongthe byte-accessible xfile. 20005f658dadSDarrick J. Wong 20015f658dadSDarrick J. Wong.. _xfarray_access_patterns: 20025f658dadSDarrick J. Wong 20035f658dadSDarrick J. WongArray Access Patterns 20045f658dadSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^ 20055f658dadSDarrick J. Wong 20065f658dadSDarrick J. WongArray access patterns in online fsck tend to fall into three categories. 20075f658dadSDarrick J. WongIteration of records is assumed to be necessary for all cases and will be 20085f658dadSDarrick J. Wongcovered in the next section. 20095f658dadSDarrick J. Wong 20105f658dadSDarrick J. WongThe first type of caller handles records that are indexed by position. 20115f658dadSDarrick J. WongGaps may exist between records, and a record may be updated multiple times 20125f658dadSDarrick J. Wongduring the collection step. 20135f658dadSDarrick J. WongIn other words, these callers want a sparse linearly addressed table file. 20145f658dadSDarrick J. WongThe typical use case are quota records or file link count records. 20155f658dadSDarrick J. WongAccess to array elements is performed programmatically via ``xfarray_load`` and 20165f658dadSDarrick J. Wong``xfarray_store`` functions, which wrap the similarly-named xfile functions to 20175f658dadSDarrick J. Wongprovide loading and storing of array elements at arbitrary array indices. 20185f658dadSDarrick J. WongGaps are defined to be null records, and null records are defined to be a 20195f658dadSDarrick J. Wongsequence of all zero bytes. 20205f658dadSDarrick J. WongNull records are detected by calling ``xfarray_element_is_null``. 20215f658dadSDarrick J. WongThey are created either by calling ``xfarray_unset`` to null out an existing 20225f658dadSDarrick J. Wongrecord or by never storing anything to an array index. 20235f658dadSDarrick J. Wong 20245f658dadSDarrick J. WongThe second type of caller handles records that are not indexed by position 20255f658dadSDarrick J. Wongand do not require multiple updates to a record. 20265f658dadSDarrick J. WongThe typical use case here is rebuilding space btrees and key/value btrees. 20275f658dadSDarrick J. WongThese callers can add records to the array without caring about array indices 20285f658dadSDarrick J. Wongvia the ``xfarray_append`` function, which stores a record at the end of the 20295f658dadSDarrick J. Wongarray. 20305f658dadSDarrick J. WongFor callers that require records to be presentable in a specific order (e.g. 20315f658dadSDarrick J. Wongrebuilding btree data), the ``xfarray_sort`` function can arrange the sorted 20325f658dadSDarrick J. Wongrecords; this function will be covered later. 20335f658dadSDarrick J. Wong 20345f658dadSDarrick J. WongThe third type of caller is a bag, which is useful for counting records. 20355f658dadSDarrick J. WongThe typical use case here is constructing space extent reference counts from 20365f658dadSDarrick J. Wongreverse mapping information. 20375f658dadSDarrick J. WongRecords can be put in the bag in any order, they can be removed from the bag 20385f658dadSDarrick J. Wongat any time, and uniqueness of records is left to callers. 20395f658dadSDarrick J. WongThe ``xfarray_store_anywhere`` function is used to insert a record in any 20405f658dadSDarrick J. Wongnull record slot in the bag; and the ``xfarray_unset`` function removes a 20415f658dadSDarrick J. Wongrecord from the bag. 20425f658dadSDarrick J. Wong 20435f658dadSDarrick J. WongThe proposed patchset is the 20445f658dadSDarrick J. Wong`big in-memory array 20455f658dadSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=big-array>`_. 20465f658dadSDarrick J. Wong 20475f658dadSDarrick J. WongIterating Array Elements 20485f658dadSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^ 20495f658dadSDarrick J. Wong 20505f658dadSDarrick J. WongMost users of the xfarray require the ability to iterate the records stored in 20515f658dadSDarrick J. Wongthe array. 20525f658dadSDarrick J. WongCallers can probe every possible array index with the following: 20535f658dadSDarrick J. Wong 20545f658dadSDarrick J. Wong.. code-block:: c 20555f658dadSDarrick J. Wong 20565f658dadSDarrick J. Wong xfarray_idx_t i; 20575f658dadSDarrick J. Wong foreach_xfarray_idx(array, i) { 20585f658dadSDarrick J. Wong xfarray_load(array, i, &rec); 20595f658dadSDarrick J. Wong 20605f658dadSDarrick J. Wong /* do something with rec */ 20615f658dadSDarrick J. Wong } 20625f658dadSDarrick J. Wong 20635f658dadSDarrick J. WongAll users of this idiom must be prepared to handle null records or must already 20645f658dadSDarrick J. Wongknow that there aren't any. 20655f658dadSDarrick J. Wong 20665f658dadSDarrick J. WongFor xfarray users that want to iterate a sparse array, the ``xfarray_iter`` 20675f658dadSDarrick J. Wongfunction ignores indices in the xfarray that have never been written to by 20685f658dadSDarrick J. Wongcalling ``xfile_seek_data`` (which internally uses ``SEEK_DATA``) to skip areas 20695f658dadSDarrick J. Wongof the array that are not populated with memory pages. 20705f658dadSDarrick J. WongOnce it finds a page, it will skip the zeroed areas of the page. 20715f658dadSDarrick J. Wong 20725f658dadSDarrick J. Wong.. code-block:: c 20735f658dadSDarrick J. Wong 20745f658dadSDarrick J. Wong xfarray_idx_t i = XFARRAY_CURSOR_INIT; 20755f658dadSDarrick J. Wong while ((ret = xfarray_iter(array, &i, &rec)) == 1) { 20765f658dadSDarrick J. Wong /* do something with rec */ 20775f658dadSDarrick J. Wong } 20785f658dadSDarrick J. Wong 20795f658dadSDarrick J. Wong.. _xfarray_sort: 20805f658dadSDarrick J. Wong 20815f658dadSDarrick J. WongSorting Array Elements 20825f658dadSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^ 20835f658dadSDarrick J. Wong 20845f658dadSDarrick J. WongDuring the fourth demonstration of online repair, a community reviewer remarked 20855f658dadSDarrick J. Wongthat for performance reasons, online repair ought to load batches of records 20865f658dadSDarrick J. Wonginto btree record blocks instead of inserting records into a new btree one at a 20875f658dadSDarrick J. Wongtime. 20885f658dadSDarrick J. WongThe btree insertion code in XFS is responsible for maintaining correct ordering 20895f658dadSDarrick J. Wongof the records, so naturally the xfarray must also support sorting the record 20905f658dadSDarrick J. Wongset prior to bulk loading. 20915f658dadSDarrick J. Wong 20925f658dadSDarrick J. WongCase Study: Sorting xfarrays 20935f658dadSDarrick J. Wong~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 20945f658dadSDarrick J. Wong 20955f658dadSDarrick J. WongThe sorting algorithm used in the xfarray is actually a combination of adaptive 20965f658dadSDarrick J. Wongquicksort and a heapsort subalgorithm in the spirit of 20975f658dadSDarrick J. Wong`Sedgewick <https://algs4.cs.princeton.edu/23quicksort/>`_ and 20985f658dadSDarrick J. Wong`pdqsort <https://github.com/orlp/pdqsort>`_, with customizations for the Linux 20995f658dadSDarrick J. Wongkernel. 21005f658dadSDarrick J. WongTo sort records in a reasonably short amount of time, ``xfarray`` takes 21015f658dadSDarrick J. Wongadvantage of the binary subpartitioning offered by quicksort, but it also uses 21025f658dadSDarrick J. Wongheapsort to hedge aginst performance collapse if the chosen quicksort pivots 21035f658dadSDarrick J. Wongare poor. 21045f658dadSDarrick J. WongBoth algorithms are (in general) O(n * lg(n)), but there is a wide performance 21055f658dadSDarrick J. Wonggulf between the two implementations. 21065f658dadSDarrick J. Wong 21075f658dadSDarrick J. WongThe Linux kernel already contains a reasonably fast implementation of heapsort. 21085f658dadSDarrick J. WongIt only operates on regular C arrays, which limits the scope of its usefulness. 21095f658dadSDarrick J. WongThere are two key places where the xfarray uses it: 21105f658dadSDarrick J. Wong 21115f658dadSDarrick J. Wong* Sorting any record subset backed by a single xfile page. 21125f658dadSDarrick J. Wong 21135f658dadSDarrick J. Wong* Loading a small number of xfarray records from potentially disparate parts 21145f658dadSDarrick J. Wong of the xfarray into a memory buffer, and sorting the buffer. 21155f658dadSDarrick J. Wong 21165f658dadSDarrick J. WongIn other words, ``xfarray`` uses heapsort to constrain the nested recursion of 21175f658dadSDarrick J. Wongquicksort, thereby mitigating quicksort's worst runtime behavior. 21185f658dadSDarrick J. Wong 21195f658dadSDarrick J. WongChoosing a quicksort pivot is a tricky business. 21205f658dadSDarrick J. WongA good pivot splits the set to sort in half, leading to the divide and conquer 21215f658dadSDarrick J. Wongbehavior that is crucial to O(n * lg(n)) performance. 21225f658dadSDarrick J. WongA poor pivot barely splits the subset at all, leading to O(n\ :sup:`2`) 21235f658dadSDarrick J. Wongruntime. 21245f658dadSDarrick J. WongThe xfarray sort routine tries to avoid picking a bad pivot by sampling nine 21255f658dadSDarrick J. Wongrecords into a memory buffer and using the kernel heapsort to identify the 21265f658dadSDarrick J. Wongmedian of the nine. 21275f658dadSDarrick J. Wong 21285f658dadSDarrick J. WongMost modern quicksort implementations employ Tukey's "ninther" to select a 21295f658dadSDarrick J. Wongpivot from a classic C array. 21305f658dadSDarrick J. WongTypical ninther implementations pick three unique triads of records, sort each 21315f658dadSDarrick J. Wongof the triads, and then sort the middle value of each triad to determine the 21325f658dadSDarrick J. Wongninther value. 21335f658dadSDarrick J. WongAs stated previously, however, xfile accesses are not entirely cheap. 21345f658dadSDarrick J. WongIt turned out to be much more performant to read the nine elements into a 21355f658dadSDarrick J. Wongmemory buffer, run the kernel's in-memory heapsort on the buffer, and choose 21365f658dadSDarrick J. Wongthe 4th element of that buffer as the pivot. 21375f658dadSDarrick J. WongTukey's ninthers are described in J. W. Tukey, `The ninther, a technique for 21385f658dadSDarrick J. Wonglow-effort robust (resistant) location in large samples`, in *Contributions to 21395f658dadSDarrick J. WongSurvey Sampling and Applied Statistics*, edited by H. David, (Academic Press, 21405f658dadSDarrick J. Wong1978), pp. 251–257. 21415f658dadSDarrick J. Wong 21425f658dadSDarrick J. WongThe partitioning of quicksort is fairly textbook -- rearrange the record 21435f658dadSDarrick J. Wongsubset around the pivot, then set up the current and next stack frames to 21445f658dadSDarrick J. Wongsort with the larger and the smaller halves of the pivot, respectively. 21455f658dadSDarrick J. WongThis keeps the stack space requirements to log2(record count). 21465f658dadSDarrick J. Wong 21475f658dadSDarrick J. WongAs a final performance optimization, the hi and lo scanning phase of quicksort 21485f658dadSDarrick J. Wongkeeps examined xfile pages mapped in the kernel for as long as possible to 21495f658dadSDarrick J. Wongreduce map/unmap cycles. 21505f658dadSDarrick J. WongSurprisingly, this reduces overall sort runtime by nearly half again after 21515f658dadSDarrick J. Wongaccounting for the application of heapsort directly onto xfile pages. 21525f658dadSDarrick J. Wong 2153*a26aa252SDarrick J. Wong.. _xfblob: 2154*a26aa252SDarrick J. Wong 21555f658dadSDarrick J. WongBlob Storage 21565f658dadSDarrick J. Wong```````````` 21575f658dadSDarrick J. Wong 21585f658dadSDarrick J. WongExtended attributes and directories add an additional requirement for staging 21595f658dadSDarrick J. Wongrecords: arbitrary byte sequences of finite length. 21605f658dadSDarrick J. WongEach directory entry record needs to store entry name, 21615f658dadSDarrick J. Wongand each extended attribute needs to store both the attribute name and value. 21625f658dadSDarrick J. WongThe names, keys, and values can consume a large amount of memory, so the 21635f658dadSDarrick J. Wong``xfblob`` abstraction was created to simplify management of these blobs 21645f658dadSDarrick J. Wongatop an xfile. 21655f658dadSDarrick J. Wong 21665f658dadSDarrick J. WongBlob arrays provide ``xfblob_load`` and ``xfblob_store`` functions to retrieve 21675f658dadSDarrick J. Wongand persist objects. 21685f658dadSDarrick J. WongThe store function returns a magic cookie for every object that it persists. 21695f658dadSDarrick J. WongLater, callers provide this cookie to the ``xblob_load`` to recall the object. 21705f658dadSDarrick J. WongThe ``xfblob_free`` function frees a specific blob, and the ``xfblob_truncate`` 21715f658dadSDarrick J. Wongfunction frees them all because compaction is not needed. 21725f658dadSDarrick J. Wong 21735f658dadSDarrick J. WongThe details of repairing directories and extended attributes will be discussed 21745f658dadSDarrick J. Wongin a subsequent section about atomic extent swapping. 21755f658dadSDarrick J. WongHowever, it should be noted that these repair functions only use blob storage 21765f658dadSDarrick J. Wongto cache a small number of entries before adding them to a temporary ondisk 21775f658dadSDarrick J. Wongfile, which is why compaction is not required. 21785f658dadSDarrick J. Wong 21795f658dadSDarrick J. WongThe proposed patchset is at the start of the 21805f658dadSDarrick J. Wong`extended attribute repair 21815f658dadSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_ series. 21825f658dadSDarrick J. Wong 21835f658dadSDarrick J. Wong.. _xfbtree: 21845f658dadSDarrick J. Wong 21855f658dadSDarrick J. WongIn-Memory B+Trees 21865f658dadSDarrick J. Wong````````````````` 21875f658dadSDarrick J. Wong 21885f658dadSDarrick J. WongThe chapter about :ref:`secondary metadata<secondary_metadata>` mentioned that 21895f658dadSDarrick J. Wongchecking and repairing of secondary metadata commonly requires coordination 21905f658dadSDarrick J. Wongbetween a live metadata scan of the filesystem and writer threads that are 21915f658dadSDarrick J. Wongupdating that metadata. 21925f658dadSDarrick J. WongKeeping the scan data up to date requires requires the ability to propagate 21935f658dadSDarrick J. Wongmetadata updates from the filesystem into the data being collected by the scan. 21945f658dadSDarrick J. WongThis *can* be done by appending concurrent updates into a separate log file and 21955f658dadSDarrick J. Wongapplying them before writing the new metadata to disk, but this leads to 21965f658dadSDarrick J. Wongunbounded memory consumption if the rest of the system is very busy. 21975f658dadSDarrick J. WongAnother option is to skip the side-log and commit live updates from the 21985f658dadSDarrick J. Wongfilesystem directly into the scan data, which trades more overhead for a lower 21995f658dadSDarrick J. Wongmaximum memory requirement. 22005f658dadSDarrick J. WongIn both cases, the data structure holding the scan results must support indexed 22015f658dadSDarrick J. Wongaccess to perform well. 22025f658dadSDarrick J. Wong 22035f658dadSDarrick J. WongGiven that indexed lookups of scan data is required for both strategies, online 22045f658dadSDarrick J. Wongfsck employs the second strategy of committing live updates directly into 22055f658dadSDarrick J. Wongscan data. 22065f658dadSDarrick J. WongBecause xfarrays are not indexed and do not enforce record ordering, they 22075f658dadSDarrick J. Wongare not suitable for this task. 22085f658dadSDarrick J. WongConveniently, however, XFS has a library to create and maintain ordered reverse 22095f658dadSDarrick J. Wongmapping records: the existing rmap btree code! 22105f658dadSDarrick J. WongIf only there was a means to create one in memory. 22115f658dadSDarrick J. Wong 22125f658dadSDarrick J. WongRecall that the :ref:`xfile <xfile>` abstraction represents memory pages as a 22135f658dadSDarrick J. Wongregular file, which means that the kernel can create byte or block addressable 22145f658dadSDarrick J. Wongvirtual address spaces at will. 22155f658dadSDarrick J. WongThe XFS buffer cache specializes in abstracting IO to block-oriented address 22165f658dadSDarrick J. Wongspaces, which means that adaptation of the buffer cache to interface with 22175f658dadSDarrick J. Wongxfiles enables reuse of the entire btree library. 22185f658dadSDarrick J. WongBtrees built atop an xfile are collectively known as ``xfbtrees``. 22195f658dadSDarrick J. WongThe next few sections describe how they actually work. 22205f658dadSDarrick J. Wong 22215f658dadSDarrick J. WongThe proposed patchset is the 22225f658dadSDarrick J. Wong`in-memory btree 22235f658dadSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=in-memory-btrees>`_ 22245f658dadSDarrick J. Wongseries. 22255f658dadSDarrick J. Wong 22265f658dadSDarrick J. WongUsing xfiles as a Buffer Cache Target 22275f658dadSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 22285f658dadSDarrick J. Wong 22295f658dadSDarrick J. WongTwo modifications are necessary to support xfiles as a buffer cache target. 22305f658dadSDarrick J. WongThe first is to make it possible for the ``struct xfs_buftarg`` structure to 22315f658dadSDarrick J. Wonghost the ``struct xfs_buf`` rhashtable, because normally those are held by a 22325f658dadSDarrick J. Wongper-AG structure. 22335f658dadSDarrick J. WongThe second change is to modify the buffer ``ioapply`` function to "read" cached 22345f658dadSDarrick J. Wongpages from the xfile and "write" cached pages back to the xfile. 22355f658dadSDarrick J. WongMultiple access to individual buffers is controlled by the ``xfs_buf`` lock, 22365f658dadSDarrick J. Wongsince the xfile does not provide any locking on its own. 22375f658dadSDarrick J. WongWith this adaptation in place, users of the xfile-backed buffer cache use 22385f658dadSDarrick J. Wongexactly the same APIs as users of the disk-backed buffer cache. 22395f658dadSDarrick J. WongThe separation between xfile and buffer cache implies higher memory usage since 22405f658dadSDarrick J. Wongthey do not share pages, but this property could some day enable transactional 22415f658dadSDarrick J. Wongupdates to an in-memory btree. 22425f658dadSDarrick J. WongToday, however, it simply eliminates the need for new code. 22435f658dadSDarrick J. Wong 22445f658dadSDarrick J. WongSpace Management with an xfbtree 22455f658dadSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 22465f658dadSDarrick J. Wong 22475f658dadSDarrick J. WongSpace management for an xfile is very simple -- each btree block is one memory 22485f658dadSDarrick J. Wongpage in size. 22495f658dadSDarrick J. WongThese blocks use the same header format as an on-disk btree, but the in-memory 22505f658dadSDarrick J. Wongblock verifiers ignore the checksums, assuming that xfile memory is no more 22515f658dadSDarrick J. Wongcorruption-prone than regular DRAM. 22525f658dadSDarrick J. WongReusing existing code here is more important than absolute memory efficiency. 22535f658dadSDarrick J. Wong 22545f658dadSDarrick J. WongThe very first block of an xfile backing an xfbtree contains a header block. 22555f658dadSDarrick J. WongThe header describes the owner, height, and the block number of the root 22565f658dadSDarrick J. Wongxfbtree block. 22575f658dadSDarrick J. Wong 22585f658dadSDarrick J. WongTo allocate a btree block, use ``xfile_seek_data`` to find a gap in the file. 22595f658dadSDarrick J. WongIf there are no gaps, create one by extending the length of the xfile. 22605f658dadSDarrick J. WongPreallocate space for the block with ``xfile_prealloc``, and hand back the 22615f658dadSDarrick J. Wonglocation. 22625f658dadSDarrick J. WongTo free an xfbtree block, use ``xfile_discard`` (which internally uses 22635f658dadSDarrick J. Wong``FALLOC_FL_PUNCH_HOLE``) to remove the memory page from the xfile. 22645f658dadSDarrick J. Wong 22655f658dadSDarrick J. WongPopulating an xfbtree 22665f658dadSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^ 22675f658dadSDarrick J. Wong 22685f658dadSDarrick J. WongAn online fsck function that wants to create an xfbtree should proceed as 22695f658dadSDarrick J. Wongfollows: 22705f658dadSDarrick J. Wong 22715f658dadSDarrick J. Wong1. Call ``xfile_create`` to create an xfile. 22725f658dadSDarrick J. Wong 22735f658dadSDarrick J. Wong2. Call ``xfs_alloc_memory_buftarg`` to create a buffer cache target structure 22745f658dadSDarrick J. Wong pointing to the xfile. 22755f658dadSDarrick J. Wong 22765f658dadSDarrick J. Wong3. Pass the buffer cache target, buffer ops, and other information to 22775f658dadSDarrick J. Wong ``xfbtree_create`` to write an initial tree header and root block to the 22785f658dadSDarrick J. Wong xfile. 22795f658dadSDarrick J. Wong Each btree type should define a wrapper that passes necessary arguments to 22805f658dadSDarrick J. Wong the creation function. 22815f658dadSDarrick J. Wong For example, rmap btrees define ``xfs_rmapbt_mem_create`` to take care of 22825f658dadSDarrick J. Wong all the necessary details for callers. 22835f658dadSDarrick J. Wong A ``struct xfbtree`` object will be returned. 22845f658dadSDarrick J. Wong 22855f658dadSDarrick J. Wong4. Pass the xfbtree object to the btree cursor creation function for the 22865f658dadSDarrick J. Wong btree type. 22875f658dadSDarrick J. Wong Following the example above, ``xfs_rmapbt_mem_cursor`` takes care of this 22885f658dadSDarrick J. Wong for callers. 22895f658dadSDarrick J. Wong 22905f658dadSDarrick J. Wong5. Pass the btree cursor to the regular btree functions to make queries against 22915f658dadSDarrick J. Wong and to update the in-memory btree. 22925f658dadSDarrick J. Wong For example, a btree cursor for an rmap xfbtree can be passed to the 22935f658dadSDarrick J. Wong ``xfs_rmap_*`` functions just like any other btree cursor. 22945f658dadSDarrick J. Wong See the :ref:`next section<xfbtree_commit>` for information on dealing with 22955f658dadSDarrick J. Wong xfbtree updates that are logged to a transaction. 22965f658dadSDarrick J. Wong 22975f658dadSDarrick J. Wong6. When finished, delete the btree cursor, destroy the xfbtree object, free the 22985f658dadSDarrick J. Wong buffer target, and the destroy the xfile to release all resources. 22995f658dadSDarrick J. Wong 23005f658dadSDarrick J. Wong.. _xfbtree_commit: 23015f658dadSDarrick J. Wong 23025f658dadSDarrick J. WongCommitting Logged xfbtree Buffers 23035f658dadSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 23045f658dadSDarrick J. Wong 23055f658dadSDarrick J. WongAlthough it is a clever hack to reuse the rmap btree code to handle the staging 23065f658dadSDarrick J. Wongstructure, the ephemeral nature of the in-memory btree block storage presents 23075f658dadSDarrick J. Wongsome challenges of its own. 23085f658dadSDarrick J. WongThe XFS transaction manager must not commit buffer log items for buffers backed 23095f658dadSDarrick J. Wongby an xfile because the log format does not understand updates for devices 23105f658dadSDarrick J. Wongother than the data device. 23115f658dadSDarrick J. WongAn ephemeral xfbtree probably will not exist by the time the AIL checkpoints 23125f658dadSDarrick J. Wonglog transactions back into the filesystem, and certainly won't exist during 23135f658dadSDarrick J. Wonglog recovery. 23145f658dadSDarrick J. WongFor these reasons, any code updating an xfbtree in transaction context must 23155f658dadSDarrick J. Wongremove the buffer log items from the transaction and write the updates into the 23165f658dadSDarrick J. Wongbacking xfile before committing or cancelling the transaction. 23175f658dadSDarrick J. Wong 23185f658dadSDarrick J. WongThe ``xfbtree_trans_commit`` and ``xfbtree_trans_cancel`` functions implement 23195f658dadSDarrick J. Wongthis functionality as follows: 23205f658dadSDarrick J. Wong 23215f658dadSDarrick J. Wong1. Find each buffer log item whose buffer targets the xfile. 23225f658dadSDarrick J. Wong 23235f658dadSDarrick J. Wong2. Record the dirty/ordered status of the log item. 23245f658dadSDarrick J. Wong 23255f658dadSDarrick J. Wong3. Detach the log item from the buffer. 23265f658dadSDarrick J. Wong 23275f658dadSDarrick J. Wong4. Queue the buffer to a special delwri list. 23285f658dadSDarrick J. Wong 23295f658dadSDarrick J. Wong5. Clear the transaction dirty flag if the only dirty log items were the ones 23305f658dadSDarrick J. Wong that were detached in step 3. 23315f658dadSDarrick J. Wong 23325f658dadSDarrick J. Wong6. Submit the delwri list to commit the changes to the xfile, if the updates 23335f658dadSDarrick J. Wong are being committed. 23345f658dadSDarrick J. Wong 23355f658dadSDarrick J. WongAfter removing xfile logged buffers from the transaction in this manner, the 23365f658dadSDarrick J. Wongtransaction can be committed or cancelled. 23377fb8ccffSDarrick J. Wong 23387fb8ccffSDarrick J. WongBulk Loading of Ondisk B+Trees 23397fb8ccffSDarrick J. Wong------------------------------ 23407fb8ccffSDarrick J. Wong 23417fb8ccffSDarrick J. WongAs mentioned previously, early iterations of online repair built new btree 23427fb8ccffSDarrick J. Wongstructures by creating a new btree and adding observations individually. 23437fb8ccffSDarrick J. WongLoading a btree one record at a time had a slight advantage of not requiring 23447fb8ccffSDarrick J. Wongthe incore records to be sorted prior to commit, but was very slow and leaked 23457fb8ccffSDarrick J. Wongblocks if the system went down during a repair. 23467fb8ccffSDarrick J. WongLoading records one at a time also meant that repair could not control the 23477fb8ccffSDarrick J. Wongloading factor of the blocks in the new btree. 23487fb8ccffSDarrick J. Wong 23497fb8ccffSDarrick J. WongFortunately, the venerable ``xfs_repair`` tool had a more efficient means for 23507fb8ccffSDarrick J. Wongrebuilding a btree index from a collection of records -- bulk btree loading. 23517fb8ccffSDarrick J. WongThis was implemented rather inefficiently code-wise, since ``xfs_repair`` 23527fb8ccffSDarrick J. Wonghad separate copy-pasted implementations for each btree type. 23537fb8ccffSDarrick J. Wong 23547fb8ccffSDarrick J. WongTo prepare for online fsck, each of the four bulk loaders were studied, notes 23557fb8ccffSDarrick J. Wongwere taken, and the four were refactored into a single generic btree bulk 23567fb8ccffSDarrick J. Wongloading mechanism. 23577fb8ccffSDarrick J. WongThose notes in turn have been refreshed and are presented below. 23587fb8ccffSDarrick J. Wong 23597fb8ccffSDarrick J. WongGeometry Computation 23607fb8ccffSDarrick J. Wong```````````````````` 23617fb8ccffSDarrick J. Wong 23627fb8ccffSDarrick J. WongThe zeroth step of bulk loading is to assemble the entire record set that will 23637fb8ccffSDarrick J. Wongbe stored in the new btree, and sort the records. 23647fb8ccffSDarrick J. WongNext, call ``xfs_btree_bload_compute_geometry`` to compute the shape of the 23657fb8ccffSDarrick J. Wongbtree from the record set, the type of btree, and any load factor preferences. 23667fb8ccffSDarrick J. WongThis information is required for resource reservation. 23677fb8ccffSDarrick J. Wong 23687fb8ccffSDarrick J. WongFirst, the geometry computation computes the minimum and maximum records that 23697fb8ccffSDarrick J. Wongwill fit in a leaf block from the size of a btree block and the size of the 23707fb8ccffSDarrick J. Wongblock header. 23717fb8ccffSDarrick J. WongRoughly speaking, the maximum number of records is:: 23727fb8ccffSDarrick J. Wong 23737fb8ccffSDarrick J. Wong maxrecs = (block_size - header_size) / record_size 23747fb8ccffSDarrick J. Wong 23757fb8ccffSDarrick J. WongThe XFS design specifies that btree blocks should be merged when possible, 23767fb8ccffSDarrick J. Wongwhich means the minimum number of records is half of maxrecs:: 23777fb8ccffSDarrick J. Wong 23787fb8ccffSDarrick J. Wong minrecs = maxrecs / 2 23797fb8ccffSDarrick J. Wong 23807fb8ccffSDarrick J. WongThe next variable to determine is the desired loading factor. 23817fb8ccffSDarrick J. WongThis must be at least minrecs and no more than maxrecs. 23827fb8ccffSDarrick J. WongChoosing minrecs is undesirable because it wastes half the block. 23837fb8ccffSDarrick J. WongChoosing maxrecs is also undesirable because adding a single record to each 23847fb8ccffSDarrick J. Wongnewly rebuilt leaf block will cause a tree split, which causes a noticeable 23857fb8ccffSDarrick J. Wongdrop in performance immediately afterwards. 23867fb8ccffSDarrick J. WongThe default loading factor was chosen to be 75% of maxrecs, which provides a 23877fb8ccffSDarrick J. Wongreasonably compact structure without any immediate split penalties:: 23887fb8ccffSDarrick J. Wong 23897fb8ccffSDarrick J. Wong default_load_factor = (maxrecs + minrecs) / 2 23907fb8ccffSDarrick J. Wong 23917fb8ccffSDarrick J. WongIf space is tight, the loading factor will be set to maxrecs to try to avoid 23927fb8ccffSDarrick J. Wongrunning out of space:: 23937fb8ccffSDarrick J. Wong 23947fb8ccffSDarrick J. Wong leaf_load_factor = enough space ? default_load_factor : maxrecs 23957fb8ccffSDarrick J. Wong 23967fb8ccffSDarrick J. WongLoad factor is computed for btree node blocks using the combined size of the 23977fb8ccffSDarrick J. Wongbtree key and pointer as the record size:: 23987fb8ccffSDarrick J. Wong 23997fb8ccffSDarrick J. Wong maxrecs = (block_size - header_size) / (key_size + ptr_size) 24007fb8ccffSDarrick J. Wong minrecs = maxrecs / 2 24017fb8ccffSDarrick J. Wong node_load_factor = enough space ? default_load_factor : maxrecs 24027fb8ccffSDarrick J. Wong 24037fb8ccffSDarrick J. WongOnce that's done, the number of leaf blocks required to store the record set 24047fb8ccffSDarrick J. Wongcan be computed as:: 24057fb8ccffSDarrick J. Wong 24067fb8ccffSDarrick J. Wong leaf_blocks = ceil(record_count / leaf_load_factor) 24077fb8ccffSDarrick J. Wong 24087fb8ccffSDarrick J. WongThe number of node blocks needed to point to the next level down in the tree 24097fb8ccffSDarrick J. Wongis computed as:: 24107fb8ccffSDarrick J. Wong 24117fb8ccffSDarrick J. Wong n_blocks = (n == 0 ? leaf_blocks : node_blocks[n]) 24127fb8ccffSDarrick J. Wong node_blocks[n + 1] = ceil(n_blocks / node_load_factor) 24137fb8ccffSDarrick J. Wong 24147fb8ccffSDarrick J. WongThe entire computation is performed recursively until the current level only 24157fb8ccffSDarrick J. Wongneeds one block. 24167fb8ccffSDarrick J. WongThe resulting geometry is as follows: 24177fb8ccffSDarrick J. Wong 24187fb8ccffSDarrick J. Wong- For AG-rooted btrees, this level is the root level, so the height of the new 24197fb8ccffSDarrick J. Wong tree is ``level + 1`` and the space needed is the summation of the number of 24207fb8ccffSDarrick J. Wong blocks on each level. 24217fb8ccffSDarrick J. Wong 24227fb8ccffSDarrick J. Wong- For inode-rooted btrees where the records in the top level do not fit in the 24237fb8ccffSDarrick J. Wong inode fork area, the height is ``level + 2``, the space needed is the 24247fb8ccffSDarrick J. Wong summation of the number of blocks on each level, and the inode fork points to 24257fb8ccffSDarrick J. Wong the root block. 24267fb8ccffSDarrick J. Wong 24277fb8ccffSDarrick J. Wong- For inode-rooted btrees where the records in the top level can be stored in 24287fb8ccffSDarrick J. Wong the inode fork area, then the root block can be stored in the inode, the 24297fb8ccffSDarrick J. Wong height is ``level + 1``, and the space needed is one less than the summation 24307fb8ccffSDarrick J. Wong of the number of blocks on each level. 24317fb8ccffSDarrick J. Wong This only becomes relevant when non-bmap btrees gain the ability to root in 24327fb8ccffSDarrick J. Wong an inode, which is a future patchset and only included here for completeness. 24337fb8ccffSDarrick J. Wong 24347fb8ccffSDarrick J. Wong.. _newbt: 24357fb8ccffSDarrick J. Wong 24367fb8ccffSDarrick J. WongReserving New B+Tree Blocks 24377fb8ccffSDarrick J. Wong``````````````````````````` 24387fb8ccffSDarrick J. Wong 24397fb8ccffSDarrick J. WongOnce repair knows the number of blocks needed for the new btree, it allocates 24407fb8ccffSDarrick J. Wongthose blocks using the free space information. 24417fb8ccffSDarrick J. WongEach reserved extent is tracked separately by the btree builder state data. 24427fb8ccffSDarrick J. WongTo improve crash resilience, the reservation code also logs an Extent Freeing 24437fb8ccffSDarrick J. WongIntent (EFI) item in the same transaction as each space allocation and attaches 24447fb8ccffSDarrick J. Wongits in-memory ``struct xfs_extent_free_item`` object to the space reservation. 24457fb8ccffSDarrick J. WongIf the system goes down, log recovery will use the unfinished EFIs to free the 24467fb8ccffSDarrick J. Wongunused space, the free space, leaving the filesystem unchanged. 24477fb8ccffSDarrick J. Wong 24487fb8ccffSDarrick J. WongEach time the btree builder claims a block for the btree from a reserved 24497fb8ccffSDarrick J. Wongextent, it updates the in-memory reservation to reflect the claimed space. 24507fb8ccffSDarrick J. WongBlock reservation tries to allocate as much contiguous space as possible to 24517fb8ccffSDarrick J. Wongreduce the number of EFIs in play. 24527fb8ccffSDarrick J. Wong 24537fb8ccffSDarrick J. WongWhile repair is writing these new btree blocks, the EFIs created for the space 24547fb8ccffSDarrick J. Wongreservations pin the tail of the ondisk log. 24557fb8ccffSDarrick J. WongIt's possible that other parts of the system will remain busy and push the head 24567fb8ccffSDarrick J. Wongof the log towards the pinned tail. 24577fb8ccffSDarrick J. WongTo avoid livelocking the filesystem, the EFIs must not pin the tail of the log 24587fb8ccffSDarrick J. Wongfor too long. 24597fb8ccffSDarrick J. WongTo alleviate this problem, the dynamic relogging capability of the deferred ops 24607fb8ccffSDarrick J. Wongmechanism is reused here to commit a transaction at the log head containing an 24617fb8ccffSDarrick J. WongEFD for the old EFI and new EFI at the head. 24627fb8ccffSDarrick J. WongThis enables the log to release the old EFI to keep the log moving forwards. 24637fb8ccffSDarrick J. Wong 24647fb8ccffSDarrick J. WongEFIs have a role to play during the commit and reaping phases; please see the 24657fb8ccffSDarrick J. Wongnext section and the section about :ref:`reaping<reaping>` for more details. 24667fb8ccffSDarrick J. Wong 24677fb8ccffSDarrick J. WongProposed patchsets are the 24687fb8ccffSDarrick J. Wong`bitmap rework 24697fb8ccffSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-bitmap-rework>`_ 24707fb8ccffSDarrick J. Wongand the 24717fb8ccffSDarrick J. Wong`preparation for bulk loading btrees 24727fb8ccffSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading>`_. 24737fb8ccffSDarrick J. Wong 24747fb8ccffSDarrick J. Wong 24757fb8ccffSDarrick J. WongWriting the New Tree 24767fb8ccffSDarrick J. Wong```````````````````` 24777fb8ccffSDarrick J. Wong 24787fb8ccffSDarrick J. WongThis part is pretty simple -- the btree builder (``xfs_btree_bulkload``) claims 24797fb8ccffSDarrick J. Wonga block from the reserved list, writes the new btree block header, fills the 24807fb8ccffSDarrick J. Wongrest of the block with records, and adds the new leaf block to a list of 24817fb8ccffSDarrick J. Wongwritten blocks:: 24827fb8ccffSDarrick J. Wong 24837fb8ccffSDarrick J. Wong ┌────┐ 24847fb8ccffSDarrick J. Wong │leaf│ 24857fb8ccffSDarrick J. Wong │RRR │ 24867fb8ccffSDarrick J. Wong └────┘ 24877fb8ccffSDarrick J. Wong 24887fb8ccffSDarrick J. WongSibling pointers are set every time a new block is added to the level:: 24897fb8ccffSDarrick J. Wong 24907fb8ccffSDarrick J. Wong ┌────┐ ┌────┐ ┌────┐ ┌────┐ 24917fb8ccffSDarrick J. Wong │leaf│→│leaf│→│leaf│→│leaf│ 24927fb8ccffSDarrick J. Wong │RRR │←│RRR │←│RRR │←│RRR │ 24937fb8ccffSDarrick J. Wong └────┘ └────┘ └────┘ └────┘ 24947fb8ccffSDarrick J. Wong 24957fb8ccffSDarrick J. WongWhen it finishes writing the record leaf blocks, it moves on to the node 24967fb8ccffSDarrick J. Wongblocks 24977fb8ccffSDarrick J. WongTo fill a node block, it walks each block in the next level down in the tree 24987fb8ccffSDarrick J. Wongto compute the relevant keys and write them into the parent node:: 24997fb8ccffSDarrick J. Wong 25007fb8ccffSDarrick J. Wong ┌────┐ ┌────┐ 25017fb8ccffSDarrick J. Wong │node│──────→│node│ 25027fb8ccffSDarrick J. Wong │PP │←──────│PP │ 25037fb8ccffSDarrick J. Wong └────┘ └────┘ 25047fb8ccffSDarrick J. Wong ↙ ↘ ↙ ↘ 25057fb8ccffSDarrick J. Wong ┌────┐ ┌────┐ ┌────┐ ┌────┐ 25067fb8ccffSDarrick J. Wong │leaf│→│leaf│→│leaf│→│leaf│ 25077fb8ccffSDarrick J. Wong │RRR │←│RRR │←│RRR │←│RRR │ 25087fb8ccffSDarrick J. Wong └────┘ └────┘ └────┘ └────┘ 25097fb8ccffSDarrick J. Wong 25107fb8ccffSDarrick J. WongWhen it reaches the root level, it is ready to commit the new btree!:: 25117fb8ccffSDarrick J. Wong 25127fb8ccffSDarrick J. Wong ┌─────────┐ 25137fb8ccffSDarrick J. Wong │ root │ 25147fb8ccffSDarrick J. Wong │ PP │ 25157fb8ccffSDarrick J. Wong └─────────┘ 25167fb8ccffSDarrick J. Wong ↙ ↘ 25177fb8ccffSDarrick J. Wong ┌────┐ ┌────┐ 25187fb8ccffSDarrick J. Wong │node│──────→│node│ 25197fb8ccffSDarrick J. Wong │PP │←──────│PP │ 25207fb8ccffSDarrick J. Wong └────┘ └────┘ 25217fb8ccffSDarrick J. Wong ↙ ↘ ↙ ↘ 25227fb8ccffSDarrick J. Wong ┌────┐ ┌────┐ ┌────┐ ┌────┐ 25237fb8ccffSDarrick J. Wong │leaf│→│leaf│→│leaf│→│leaf│ 25247fb8ccffSDarrick J. Wong │RRR │←│RRR │←│RRR │←│RRR │ 25257fb8ccffSDarrick J. Wong └────┘ └────┘ └────┘ └────┘ 25267fb8ccffSDarrick J. Wong 25277fb8ccffSDarrick J. WongThe first step to commit the new btree is to persist the btree blocks to disk 25287fb8ccffSDarrick J. Wongsynchronously. 25297fb8ccffSDarrick J. WongThis is a little complicated because a new btree block could have been freed 25307fb8ccffSDarrick J. Wongin the recent past, so the builder must use ``xfs_buf_delwri_queue_here`` to 25317fb8ccffSDarrick J. Wongremove the (stale) buffer from the AIL list before it can write the new blocks 25327fb8ccffSDarrick J. Wongto disk. 25337fb8ccffSDarrick J. WongBlocks are queued for IO using a delwri list and written in one large batch 25347fb8ccffSDarrick J. Wongwith ``xfs_buf_delwri_submit``. 25357fb8ccffSDarrick J. Wong 25367fb8ccffSDarrick J. WongOnce the new blocks have been persisted to disk, control returns to the 25377fb8ccffSDarrick J. Wongindividual repair function that called the bulk loader. 25387fb8ccffSDarrick J. WongThe repair function must log the location of the new root in a transaction, 25397fb8ccffSDarrick J. Wongclean up the space reservations that were made for the new btree, and reap the 25407fb8ccffSDarrick J. Wongold metadata blocks: 25417fb8ccffSDarrick J. Wong 25427fb8ccffSDarrick J. Wong1. Commit the location of the new btree root. 25437fb8ccffSDarrick J. Wong 25447fb8ccffSDarrick J. Wong2. For each incore reservation: 25457fb8ccffSDarrick J. Wong 25467fb8ccffSDarrick J. Wong a. Log Extent Freeing Done (EFD) items for all the space that was consumed 25477fb8ccffSDarrick J. Wong by the btree builder. The new EFDs must point to the EFIs attached to 25487fb8ccffSDarrick J. Wong the reservation to prevent log recovery from freeing the new blocks. 25497fb8ccffSDarrick J. Wong 25507fb8ccffSDarrick J. Wong b. For unclaimed portions of incore reservations, create a regular deferred 25517fb8ccffSDarrick J. Wong extent free work item to be free the unused space later in the 25527fb8ccffSDarrick J. Wong transaction chain. 25537fb8ccffSDarrick J. Wong 25547fb8ccffSDarrick J. Wong c. The EFDs and EFIs logged in steps 2a and 2b must not overrun the 25557fb8ccffSDarrick J. Wong reservation of the committing transaction. 25567fb8ccffSDarrick J. Wong If the btree loading code suspects this might be about to happen, it must 25577fb8ccffSDarrick J. Wong call ``xrep_defer_finish`` to clear out the deferred work and obtain a 25587fb8ccffSDarrick J. Wong fresh transaction. 25597fb8ccffSDarrick J. Wong 25607fb8ccffSDarrick J. Wong3. Clear out the deferred work a second time to finish the commit and clean 25617fb8ccffSDarrick J. Wong the repair transaction. 25627fb8ccffSDarrick J. Wong 25637fb8ccffSDarrick J. WongThe transaction rolling in steps 2c and 3 represent a weakness in the repair 25647fb8ccffSDarrick J. Wongalgorithm, because a log flush and a crash before the end of the reap step can 25657fb8ccffSDarrick J. Wongresult in space leaking. 25667fb8ccffSDarrick J. WongOnline repair functions minimize the chances of this occuring by using very 25677fb8ccffSDarrick J. Wonglarge transactions, which each can accomodate many thousands of block freeing 25687fb8ccffSDarrick J. Wonginstructions. 25697fb8ccffSDarrick J. WongRepair moves on to reaping the old blocks, which will be presented in a 25707fb8ccffSDarrick J. Wongsubsequent :ref:`section<reaping>` after a few case studies of bulk loading. 25717fb8ccffSDarrick J. Wong 25727fb8ccffSDarrick J. WongCase Study: Rebuilding the Inode Index 25737fb8ccffSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 25747fb8ccffSDarrick J. Wong 25757fb8ccffSDarrick J. WongThe high level process to rebuild the inode index btree is: 25767fb8ccffSDarrick J. Wong 25777fb8ccffSDarrick J. Wong1. Walk the reverse mapping records to generate ``struct xfs_inobt_rec`` 25787fb8ccffSDarrick J. Wong records from the inode chunk information and a bitmap of the old inode btree 25797fb8ccffSDarrick J. Wong blocks. 25807fb8ccffSDarrick J. Wong 25817fb8ccffSDarrick J. Wong2. Append the records to an xfarray in inode order. 25827fb8ccffSDarrick J. Wong 25837fb8ccffSDarrick J. Wong3. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number 25847fb8ccffSDarrick J. Wong of blocks needed for the inode btree. 25857fb8ccffSDarrick J. Wong If the free space inode btree is enabled, call it again to estimate the 25867fb8ccffSDarrick J. Wong geometry of the finobt. 25877fb8ccffSDarrick J. Wong 25887fb8ccffSDarrick J. Wong4. Allocate the number of blocks computed in the previous step. 25897fb8ccffSDarrick J. Wong 25907fb8ccffSDarrick J. Wong5. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and 25917fb8ccffSDarrick J. Wong generate the internal node blocks. 25927fb8ccffSDarrick J. Wong If the free space inode btree is enabled, call it again to load the finobt. 25937fb8ccffSDarrick J. Wong 25947fb8ccffSDarrick J. Wong6. Commit the location of the new btree root block(s) to the AGI. 25957fb8ccffSDarrick J. Wong 25967fb8ccffSDarrick J. Wong7. Reap the old btree blocks using the bitmap created in step 1. 25977fb8ccffSDarrick J. Wong 25987fb8ccffSDarrick J. WongDetails are as follows. 25997fb8ccffSDarrick J. Wong 26007fb8ccffSDarrick J. WongThe inode btree maps inumbers to the ondisk location of the associated 26017fb8ccffSDarrick J. Wonginode records, which means that the inode btrees can be rebuilt from the 26027fb8ccffSDarrick J. Wongreverse mapping information. 26037fb8ccffSDarrick J. WongReverse mapping records with an owner of ``XFS_RMAP_OWN_INOBT`` marks the 26047fb8ccffSDarrick J. Wonglocation of the old inode btree blocks. 26057fb8ccffSDarrick J. WongEach reverse mapping record with an owner of ``XFS_RMAP_OWN_INODES`` marks the 26067fb8ccffSDarrick J. Wonglocation of at least one inode cluster buffer. 26077fb8ccffSDarrick J. WongA cluster is the smallest number of ondisk inodes that can be allocated or 26087fb8ccffSDarrick J. Wongfreed in a single transaction; it is never smaller than 1 fs block or 4 inodes. 26097fb8ccffSDarrick J. Wong 26107fb8ccffSDarrick J. WongFor the space represented by each inode cluster, ensure that there are no 26117fb8ccffSDarrick J. Wongrecords in the free space btrees nor any records in the reference count btree. 26127fb8ccffSDarrick J. WongIf there are, the space metadata inconsistencies are reason enough to abort the 26137fb8ccffSDarrick J. Wongoperation. 26147fb8ccffSDarrick J. WongOtherwise, read each cluster buffer to check that its contents appear to be 26157fb8ccffSDarrick J. Wongondisk inodes and to decide if the file is allocated 26167fb8ccffSDarrick J. Wong(``xfs_dinode.i_mode != 0``) or free (``xfs_dinode.i_mode == 0``). 26177fb8ccffSDarrick J. WongAccumulate the results of successive inode cluster buffer reads until there is 26187fb8ccffSDarrick J. Wongenough information to fill a single inode chunk record, which is 64 consecutive 26197fb8ccffSDarrick J. Wongnumbers in the inumber keyspace. 26207fb8ccffSDarrick J. WongIf the chunk is sparse, the chunk record may include holes. 26217fb8ccffSDarrick J. Wong 26227fb8ccffSDarrick J. WongOnce the repair function accumulates one chunk's worth of data, it calls 26237fb8ccffSDarrick J. Wong``xfarray_append`` to add the inode btree record to the xfarray. 26247fb8ccffSDarrick J. WongThis xfarray is walked twice during the btree creation step -- once to populate 26257fb8ccffSDarrick J. Wongthe inode btree with all inode chunk records, and a second time to populate the 26267fb8ccffSDarrick J. Wongfree inode btree with records for chunks that have free non-sparse inodes. 26277fb8ccffSDarrick J. WongThe number of records for the inode btree is the number of xfarray records, 26287fb8ccffSDarrick J. Wongbut the record count for the free inode btree has to be computed as inode chunk 26297fb8ccffSDarrick J. Wongrecords are stored in the xfarray. 26307fb8ccffSDarrick J. Wong 26317fb8ccffSDarrick J. WongThe proposed patchset is the 26327fb8ccffSDarrick J. Wong`AG btree repair 26337fb8ccffSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_ 26347fb8ccffSDarrick J. Wongseries. 26357fb8ccffSDarrick J. Wong 26367fb8ccffSDarrick J. WongCase Study: Rebuilding the Space Reference Counts 26377fb8ccffSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 26387fb8ccffSDarrick J. Wong 26397fb8ccffSDarrick J. WongReverse mapping records are used to rebuild the reference count information. 26407fb8ccffSDarrick J. WongReference counts are required for correct operation of copy on write for shared 26417fb8ccffSDarrick J. Wongfile data. 26427fb8ccffSDarrick J. WongImagine the reverse mapping entries as rectangles representing extents of 26437fb8ccffSDarrick J. Wongphysical blocks, and that the rectangles can be laid down to allow them to 26447fb8ccffSDarrick J. Wongoverlap each other. 26457fb8ccffSDarrick J. WongFrom the diagram below, it is apparent that a reference count record must start 26467fb8ccffSDarrick J. Wongor end wherever the height of the stack changes. 26477fb8ccffSDarrick J. WongIn other words, the record emission stimulus is level-triggered:: 26487fb8ccffSDarrick J. Wong 26497fb8ccffSDarrick J. Wong █ ███ 26507fb8ccffSDarrick J. Wong ██ █████ ████ ███ ██████ 26517fb8ccffSDarrick J. Wong ██ ████ ███████████ ████ █████████ 26527fb8ccffSDarrick J. Wong ████████████████████████████████ ███████████ 26537fb8ccffSDarrick J. Wong ^ ^ ^^ ^^ ^ ^^ ^^^ ^^^^ ^ ^^ ^ ^ ^ 26547fb8ccffSDarrick J. Wong 2 1 23 21 3 43 234 2123 1 01 2 3 0 26557fb8ccffSDarrick J. Wong 26567fb8ccffSDarrick J. WongThe ondisk reference count btree does not store the refcount == 0 cases because 26577fb8ccffSDarrick J. Wongthe free space btree already records which blocks are free. 26587fb8ccffSDarrick J. WongExtents being used to stage copy-on-write operations should be the only records 26597fb8ccffSDarrick J. Wongwith refcount == 1. 26607fb8ccffSDarrick J. WongSingle-owner file blocks aren't recorded in either the free space or the 26617fb8ccffSDarrick J. Wongreference count btrees. 26627fb8ccffSDarrick J. Wong 26637fb8ccffSDarrick J. WongThe high level process to rebuild the reference count btree is: 26647fb8ccffSDarrick J. Wong 26657fb8ccffSDarrick J. Wong1. Walk the reverse mapping records to generate ``struct xfs_refcount_irec`` 26667fb8ccffSDarrick J. Wong records for any space having more than one reverse mapping and add them to 26677fb8ccffSDarrick J. Wong the xfarray. 26687fb8ccffSDarrick J. Wong Any records owned by ``XFS_RMAP_OWN_COW`` are also added to the xfarray 26697fb8ccffSDarrick J. Wong because these are extents allocated to stage a copy on write operation and 26707fb8ccffSDarrick J. Wong are tracked in the refcount btree. 26717fb8ccffSDarrick J. Wong 26727fb8ccffSDarrick J. Wong Use any records owned by ``XFS_RMAP_OWN_REFC`` to create a bitmap of old 26737fb8ccffSDarrick J. Wong refcount btree blocks. 26747fb8ccffSDarrick J. Wong 26757fb8ccffSDarrick J. Wong2. Sort the records in physical extent order, putting the CoW staging extents 26767fb8ccffSDarrick J. Wong at the end of the xfarray. 26777fb8ccffSDarrick J. Wong This matches the sorting order of records in the refcount btree. 26787fb8ccffSDarrick J. Wong 26797fb8ccffSDarrick J. Wong3. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number 26807fb8ccffSDarrick J. Wong of blocks needed for the new tree. 26817fb8ccffSDarrick J. Wong 26827fb8ccffSDarrick J. Wong4. Allocate the number of blocks computed in the previous step. 26837fb8ccffSDarrick J. Wong 26847fb8ccffSDarrick J. Wong5. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and 26857fb8ccffSDarrick J. Wong generate the internal node blocks. 26867fb8ccffSDarrick J. Wong 26877fb8ccffSDarrick J. Wong6. Commit the location of new btree root block to the AGF. 26887fb8ccffSDarrick J. Wong 26897fb8ccffSDarrick J. Wong7. Reap the old btree blocks using the bitmap created in step 1. 26907fb8ccffSDarrick J. Wong 26917fb8ccffSDarrick J. WongDetails are as follows; the same algorithm is used by ``xfs_repair`` to 26927fb8ccffSDarrick J. Wonggenerate refcount information from reverse mapping records. 26937fb8ccffSDarrick J. Wong 26947fb8ccffSDarrick J. Wong- Until the reverse mapping btree runs out of records: 26957fb8ccffSDarrick J. Wong 26967fb8ccffSDarrick J. Wong - Retrieve the next record from the btree and put it in a bag. 26977fb8ccffSDarrick J. Wong 26987fb8ccffSDarrick J. Wong - Collect all records with the same starting block from the btree and put 26997fb8ccffSDarrick J. Wong them in the bag. 27007fb8ccffSDarrick J. Wong 27017fb8ccffSDarrick J. Wong - While the bag isn't empty: 27027fb8ccffSDarrick J. Wong 27037fb8ccffSDarrick J. Wong - Among the mappings in the bag, compute the lowest block number where the 27047fb8ccffSDarrick J. Wong reference count changes. 27057fb8ccffSDarrick J. Wong This position will be either the starting block number of the next 27067fb8ccffSDarrick J. Wong unprocessed reverse mapping or the next block after the shortest mapping 27077fb8ccffSDarrick J. Wong in the bag. 27087fb8ccffSDarrick J. Wong 27097fb8ccffSDarrick J. Wong - Remove all mappings from the bag that end at this position. 27107fb8ccffSDarrick J. Wong 27117fb8ccffSDarrick J. Wong - Collect all reverse mappings that start at this position from the btree 27127fb8ccffSDarrick J. Wong and put them in the bag. 27137fb8ccffSDarrick J. Wong 27147fb8ccffSDarrick J. Wong - If the size of the bag changed and is greater than one, create a new 27157fb8ccffSDarrick J. Wong refcount record associating the block number range that we just walked to 27167fb8ccffSDarrick J. Wong the size of the bag. 27177fb8ccffSDarrick J. Wong 27187fb8ccffSDarrick J. WongThe bag-like structure in this case is a type 2 xfarray as discussed in the 27197fb8ccffSDarrick J. Wong:ref:`xfarray access patterns<xfarray_access_patterns>` section. 27207fb8ccffSDarrick J. WongReverse mappings are added to the bag using ``xfarray_store_anywhere`` and 27217fb8ccffSDarrick J. Wongremoved via ``xfarray_unset``. 27227fb8ccffSDarrick J. WongBag members are examined through ``xfarray_iter`` loops. 27237fb8ccffSDarrick J. Wong 27247fb8ccffSDarrick J. WongThe proposed patchset is the 27257fb8ccffSDarrick J. Wong`AG btree repair 27267fb8ccffSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_ 27277fb8ccffSDarrick J. Wongseries. 27287fb8ccffSDarrick J. Wong 27297fb8ccffSDarrick J. WongCase Study: Rebuilding File Fork Mapping Indices 27307fb8ccffSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 27317fb8ccffSDarrick J. Wong 27327fb8ccffSDarrick J. WongThe high level process to rebuild a data/attr fork mapping btree is: 27337fb8ccffSDarrick J. Wong 27347fb8ccffSDarrick J. Wong1. Walk the reverse mapping records to generate ``struct xfs_bmbt_rec`` 27357fb8ccffSDarrick J. Wong records from the reverse mapping records for that inode and fork. 27367fb8ccffSDarrick J. Wong Append these records to an xfarray. 27377fb8ccffSDarrick J. Wong Compute the bitmap of the old bmap btree blocks from the ``BMBT_BLOCK`` 27387fb8ccffSDarrick J. Wong records. 27397fb8ccffSDarrick J. Wong 27407fb8ccffSDarrick J. Wong2. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number 27417fb8ccffSDarrick J. Wong of blocks needed for the new tree. 27427fb8ccffSDarrick J. Wong 27437fb8ccffSDarrick J. Wong3. Sort the records in file offset order. 27447fb8ccffSDarrick J. Wong 27457fb8ccffSDarrick J. Wong4. If the extent records would fit in the inode fork immediate area, commit the 27467fb8ccffSDarrick J. Wong records to that immediate area and skip to step 8. 27477fb8ccffSDarrick J. Wong 27487fb8ccffSDarrick J. Wong5. Allocate the number of blocks computed in the previous step. 27497fb8ccffSDarrick J. Wong 27507fb8ccffSDarrick J. Wong6. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and 27517fb8ccffSDarrick J. Wong generate the internal node blocks. 27527fb8ccffSDarrick J. Wong 27537fb8ccffSDarrick J. Wong7. Commit the new btree root block to the inode fork immediate area. 27547fb8ccffSDarrick J. Wong 27557fb8ccffSDarrick J. Wong8. Reap the old btree blocks using the bitmap created in step 1. 27567fb8ccffSDarrick J. Wong 27577fb8ccffSDarrick J. WongThere are some complications here: 27587fb8ccffSDarrick J. WongFirst, it's possible to move the fork offset to adjust the sizes of the 27597fb8ccffSDarrick J. Wongimmediate areas if the data and attr forks are not both in BMBT format. 27607fb8ccffSDarrick J. WongSecond, if there are sufficiently few fork mappings, it may be possible to use 27617fb8ccffSDarrick J. WongEXTENTS format instead of BMBT, which may require a conversion. 27627fb8ccffSDarrick J. WongThird, the incore extent map must be reloaded carefully to avoid disturbing 27637fb8ccffSDarrick J. Wongany delayed allocation extents. 27647fb8ccffSDarrick J. Wong 27657fb8ccffSDarrick J. WongThe proposed patchset is the 27667fb8ccffSDarrick J. Wong`file mapping repair 27677fb8ccffSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-file-mappings>`_ 27687fb8ccffSDarrick J. Wongseries. 27697fb8ccffSDarrick J. Wong 27707fb8ccffSDarrick J. Wong.. _reaping: 27717fb8ccffSDarrick J. Wong 27727fb8ccffSDarrick J. WongReaping Old Metadata Blocks 27737fb8ccffSDarrick J. Wong--------------------------- 27747fb8ccffSDarrick J. Wong 27757fb8ccffSDarrick J. WongWhenever online fsck builds a new data structure to replace one that is 27767fb8ccffSDarrick J. Wongsuspect, there is a question of how to find and dispose of the blocks that 27777fb8ccffSDarrick J. Wongbelonged to the old structure. 27787fb8ccffSDarrick J. WongThe laziest method of course is not to deal with them at all, but this slowly 27797fb8ccffSDarrick J. Wongleads to service degradations as space leaks out of the filesystem. 27807fb8ccffSDarrick J. WongHopefully, someone will schedule a rebuild of the free space information to 27817fb8ccffSDarrick J. Wongplug all those leaks. 27827fb8ccffSDarrick J. WongOffline repair rebuilds all space metadata after recording the usage of 27837fb8ccffSDarrick J. Wongthe files and directories that it decides not to clear, hence it can build new 27847fb8ccffSDarrick J. Wongstructures in the discovered free space and avoid the question of reaping. 27857fb8ccffSDarrick J. Wong 27867fb8ccffSDarrick J. WongAs part of a repair, online fsck relies heavily on the reverse mapping records 27877fb8ccffSDarrick J. Wongto find space that is owned by the corresponding rmap owner yet truly free. 27887fb8ccffSDarrick J. WongCross referencing rmap records with other rmap records is necessary because 27897fb8ccffSDarrick J. Wongthere may be other data structures that also think they own some of those 27907fb8ccffSDarrick J. Wongblocks (e.g. crosslinked trees). 27917fb8ccffSDarrick J. WongPermitting the block allocator to hand them out again will not push the system 27927fb8ccffSDarrick J. Wongtowards consistency. 27937fb8ccffSDarrick J. Wong 27947fb8ccffSDarrick J. WongFor space metadata, the process of finding extents to dispose of generally 27957fb8ccffSDarrick J. Wongfollows this format: 27967fb8ccffSDarrick J. Wong 27977fb8ccffSDarrick J. Wong1. Create a bitmap of space used by data structures that must be preserved. 27987fb8ccffSDarrick J. Wong The space reservations used to create the new metadata can be used here if 27997fb8ccffSDarrick J. Wong the same rmap owner code is used to denote all of the objects being rebuilt. 28007fb8ccffSDarrick J. Wong 28017fb8ccffSDarrick J. Wong2. Survey the reverse mapping data to create a bitmap of space owned by the 28027fb8ccffSDarrick J. Wong same ``XFS_RMAP_OWN_*`` number for the metadata that is being preserved. 28037fb8ccffSDarrick J. Wong 28047fb8ccffSDarrick J. Wong3. Use the bitmap disunion operator to subtract (1) from (2). 28057fb8ccffSDarrick J. Wong The remaining set bits represent candidate extents that could be freed. 28067fb8ccffSDarrick J. Wong The process moves on to step 4 below. 28077fb8ccffSDarrick J. Wong 28087fb8ccffSDarrick J. WongRepairs for file-based metadata such as extended attributes, directories, 28097fb8ccffSDarrick J. Wongsymbolic links, quota files and realtime bitmaps are performed by building a 28107fb8ccffSDarrick J. Wongnew structure attached to a temporary file and swapping the forks. 28117fb8ccffSDarrick J. WongAfterward, the mappings in the old file fork are the candidate blocks for 28127fb8ccffSDarrick J. Wongdisposal. 28137fb8ccffSDarrick J. Wong 28147fb8ccffSDarrick J. WongThe process for disposing of old extents is as follows: 28157fb8ccffSDarrick J. Wong 28167fb8ccffSDarrick J. Wong4. For each candidate extent, count the number of reverse mapping records for 28177fb8ccffSDarrick J. Wong the first block in that extent that do not have the same rmap owner for the 28187fb8ccffSDarrick J. Wong data structure being repaired. 28197fb8ccffSDarrick J. Wong 28207fb8ccffSDarrick J. Wong - If zero, the block has a single owner and can be freed. 28217fb8ccffSDarrick J. Wong 28227fb8ccffSDarrick J. Wong - If not, the block is part of a crosslinked structure and must not be 28237fb8ccffSDarrick J. Wong freed. 28247fb8ccffSDarrick J. Wong 28257fb8ccffSDarrick J. Wong5. Starting with the next block in the extent, figure out how many more blocks 28267fb8ccffSDarrick J. Wong have the same zero/nonzero other owner status as that first block. 28277fb8ccffSDarrick J. Wong 28287fb8ccffSDarrick J. Wong6. If the region is crosslinked, delete the reverse mapping entry for the 28297fb8ccffSDarrick J. Wong structure being repaired and move on to the next region. 28307fb8ccffSDarrick J. Wong 28317fb8ccffSDarrick J. Wong7. If the region is to be freed, mark any corresponding buffers in the buffer 28327fb8ccffSDarrick J. Wong cache as stale to prevent log writeback. 28337fb8ccffSDarrick J. Wong 28347fb8ccffSDarrick J. Wong8. Free the region and move on. 28357fb8ccffSDarrick J. Wong 28367fb8ccffSDarrick J. WongHowever, there is one complication to this procedure. 28377fb8ccffSDarrick J. WongTransactions are of finite size, so the reaping process must be careful to roll 28387fb8ccffSDarrick J. Wongthe transactions to avoid overruns. 28397fb8ccffSDarrick J. WongOverruns come from two sources: 28407fb8ccffSDarrick J. Wong 28417fb8ccffSDarrick J. Wonga. EFIs logged on behalf of space that is no longer occupied 28427fb8ccffSDarrick J. Wong 28437fb8ccffSDarrick J. Wongb. Log items for buffer invalidations 28447fb8ccffSDarrick J. Wong 28457fb8ccffSDarrick J. WongThis is also a window in which a crash during the reaping process can leak 28467fb8ccffSDarrick J. Wongblocks. 28477fb8ccffSDarrick J. WongAs stated earlier, online repair functions use very large transactions to 28487fb8ccffSDarrick J. Wongminimize the chances of this occurring. 28497fb8ccffSDarrick J. Wong 28507fb8ccffSDarrick J. WongThe proposed patchset is the 28517fb8ccffSDarrick J. Wong`preparation for bulk loading btrees 28527fb8ccffSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading>`_ 28537fb8ccffSDarrick J. Wongseries. 28547fb8ccffSDarrick J. Wong 28557fb8ccffSDarrick J. WongCase Study: Reaping After a Regular Btree Repair 28567fb8ccffSDarrick J. Wong```````````````````````````````````````````````` 28577fb8ccffSDarrick J. Wong 28587fb8ccffSDarrick J. WongOld reference count and inode btrees are the easiest to reap because they have 28597fb8ccffSDarrick J. Wongrmap records with special owner codes: ``XFS_RMAP_OWN_REFC`` for the refcount 28607fb8ccffSDarrick J. Wongbtree, and ``XFS_RMAP_OWN_INOBT`` for the inode and free inode btrees. 28617fb8ccffSDarrick J. WongCreating a list of extents to reap the old btree blocks is quite simple, 28627fb8ccffSDarrick J. Wongconceptually: 28637fb8ccffSDarrick J. Wong 28647fb8ccffSDarrick J. Wong1. Lock the relevant AGI/AGF header buffers to prevent allocation and frees. 28657fb8ccffSDarrick J. Wong 28667fb8ccffSDarrick J. Wong2. For each reverse mapping record with an rmap owner corresponding to the 28677fb8ccffSDarrick J. Wong metadata structure being rebuilt, set the corresponding range in a bitmap. 28687fb8ccffSDarrick J. Wong 28697fb8ccffSDarrick J. Wong3. Walk the current data structures that have the same rmap owner. 28707fb8ccffSDarrick J. Wong For each block visited, clear that range in the above bitmap. 28717fb8ccffSDarrick J. Wong 28727fb8ccffSDarrick J. Wong4. Each set bit in the bitmap represents a block that could be a block from the 28737fb8ccffSDarrick J. Wong old data structures and hence is a candidate for reaping. 28747fb8ccffSDarrick J. Wong In other words, ``(rmap_records_owned_by & ~blocks_reachable_by_walk)`` 28757fb8ccffSDarrick J. Wong are the blocks that might be freeable. 28767fb8ccffSDarrick J. Wong 28777fb8ccffSDarrick J. WongIf it is possible to maintain the AGF lock throughout the repair (which is the 28787fb8ccffSDarrick J. Wongcommon case), then step 2 can be performed at the same time as the reverse 28797fb8ccffSDarrick J. Wongmapping record walk that creates the records for the new btree. 28807fb8ccffSDarrick J. Wong 28817fb8ccffSDarrick J. WongCase Study: Rebuilding the Free Space Indices 28827fb8ccffSDarrick J. Wong````````````````````````````````````````````` 28837fb8ccffSDarrick J. Wong 28847fb8ccffSDarrick J. WongThe high level process to rebuild the free space indices is: 28857fb8ccffSDarrick J. Wong 28867fb8ccffSDarrick J. Wong1. Walk the reverse mapping records to generate ``struct xfs_alloc_rec_incore`` 28877fb8ccffSDarrick J. Wong records from the gaps in the reverse mapping btree. 28887fb8ccffSDarrick J. Wong 28897fb8ccffSDarrick J. Wong2. Append the records to an xfarray. 28907fb8ccffSDarrick J. Wong 28917fb8ccffSDarrick J. Wong3. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number 28927fb8ccffSDarrick J. Wong of blocks needed for each new tree. 28937fb8ccffSDarrick J. Wong 28947fb8ccffSDarrick J. Wong4. Allocate the number of blocks computed in the previous step from the free 28957fb8ccffSDarrick J. Wong space information collected. 28967fb8ccffSDarrick J. Wong 28977fb8ccffSDarrick J. Wong5. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and 28987fb8ccffSDarrick J. Wong generate the internal node blocks for the free space by length index. 28997fb8ccffSDarrick J. Wong Call it again for the free space by block number index. 29007fb8ccffSDarrick J. Wong 29017fb8ccffSDarrick J. Wong6. Commit the locations of the new btree root blocks to the AGF. 29027fb8ccffSDarrick J. Wong 29037fb8ccffSDarrick J. Wong7. Reap the old btree blocks by looking for space that is not recorded by the 29047fb8ccffSDarrick J. Wong reverse mapping btree, the new free space btrees, or the AGFL. 29057fb8ccffSDarrick J. Wong 29067fb8ccffSDarrick J. WongRepairing the free space btrees has three key complications over a regular 29077fb8ccffSDarrick J. Wongbtree repair: 29087fb8ccffSDarrick J. Wong 29097fb8ccffSDarrick J. WongFirst, free space is not explicitly tracked in the reverse mapping records. 29107fb8ccffSDarrick J. WongHence, the new free space records must be inferred from gaps in the physical 29117fb8ccffSDarrick J. Wongspace component of the keyspace of the reverse mapping btree. 29127fb8ccffSDarrick J. Wong 29137fb8ccffSDarrick J. WongSecond, free space repairs cannot use the common btree reservation code because 29147fb8ccffSDarrick J. Wongnew blocks are reserved out of the free space btrees. 29157fb8ccffSDarrick J. WongThis is impossible when repairing the free space btrees themselves. 29167fb8ccffSDarrick J. WongHowever, repair holds the AGF buffer lock for the duration of the free space 29177fb8ccffSDarrick J. Wongindex reconstruction, so it can use the collected free space information to 29187fb8ccffSDarrick J. Wongsupply the blocks for the new free space btrees. 29197fb8ccffSDarrick J. WongIt is not necessary to back each reserved extent with an EFI because the new 29207fb8ccffSDarrick J. Wongfree space btrees are constructed in what the ondisk filesystem thinks is 29217fb8ccffSDarrick J. Wongunowned space. 29227fb8ccffSDarrick J. WongHowever, if reserving blocks for the new btrees from the collected free space 29237fb8ccffSDarrick J. Wonginformation changes the number of free space records, repair must re-estimate 29247fb8ccffSDarrick J. Wongthe new free space btree geometry with the new record count until the 29257fb8ccffSDarrick J. Wongreservation is sufficient. 29267fb8ccffSDarrick J. WongAs part of committing the new btrees, repair must ensure that reverse mappings 29277fb8ccffSDarrick J. Wongare created for the reserved blocks and that unused reserved blocks are 29287fb8ccffSDarrick J. Wonginserted into the free space btrees. 29297fb8ccffSDarrick J. WongDeferrred rmap and freeing operations are used to ensure that this transition 29307fb8ccffSDarrick J. Wongis atomic, similar to the other btree repair functions. 29317fb8ccffSDarrick J. Wong 29327fb8ccffSDarrick J. WongThird, finding the blocks to reap after the repair is not overly 29337fb8ccffSDarrick J. Wongstraightforward. 29347fb8ccffSDarrick J. WongBlocks for the free space btrees and the reverse mapping btrees are supplied by 29357fb8ccffSDarrick J. Wongthe AGFL. 29367fb8ccffSDarrick J. WongBlocks put onto the AGFL have reverse mapping records with the owner 29377fb8ccffSDarrick J. Wong``XFS_RMAP_OWN_AG``. 29387fb8ccffSDarrick J. WongThis ownership is retained when blocks move from the AGFL into the free space 29397fb8ccffSDarrick J. Wongbtrees or the reverse mapping btrees. 29407fb8ccffSDarrick J. WongWhen repair walks reverse mapping records to synthesize free space records, it 29417fb8ccffSDarrick J. Wongcreates a bitmap (``ag_owner_bitmap``) of all the space claimed by 29427fb8ccffSDarrick J. Wong``XFS_RMAP_OWN_AG`` records. 29437fb8ccffSDarrick J. WongThe repair context maintains a second bitmap corresponding to the rmap btree 29447fb8ccffSDarrick J. Wongblocks and the AGFL blocks (``rmap_agfl_bitmap``). 29457fb8ccffSDarrick J. WongWhen the walk is complete, the bitmap disunion operation ``(ag_owner_bitmap & 29467fb8ccffSDarrick J. Wong~rmap_agfl_bitmap)`` computes the extents that are used by the old free space 29477fb8ccffSDarrick J. Wongbtrees. 29487fb8ccffSDarrick J. WongThese blocks can then be reaped using the methods outlined above. 29497fb8ccffSDarrick J. Wong 29507fb8ccffSDarrick J. WongThe proposed patchset is the 29517fb8ccffSDarrick J. Wong`AG btree repair 29527fb8ccffSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_ 29537fb8ccffSDarrick J. Wongseries. 29547fb8ccffSDarrick J. Wong 29557fb8ccffSDarrick J. Wong.. _rmap_reap: 29567fb8ccffSDarrick J. Wong 29577fb8ccffSDarrick J. WongCase Study: Reaping After Repairing Reverse Mapping Btrees 29587fb8ccffSDarrick J. Wong`````````````````````````````````````````````````````````` 29597fb8ccffSDarrick J. Wong 29607fb8ccffSDarrick J. WongOld reverse mapping btrees are less difficult to reap after a repair. 29617fb8ccffSDarrick J. WongAs mentioned in the previous section, blocks on the AGFL, the two free space 29627fb8ccffSDarrick J. Wongbtree blocks, and the reverse mapping btree blocks all have reverse mapping 29637fb8ccffSDarrick J. Wongrecords with ``XFS_RMAP_OWN_AG`` as the owner. 29647fb8ccffSDarrick J. WongThe full process of gathering reverse mapping records and building a new btree 29657fb8ccffSDarrick J. Wongare described in the case study of 29667fb8ccffSDarrick J. Wong:ref:`live rebuilds of rmap data <rmap_repair>`, but a crucial point from that 29677fb8ccffSDarrick J. Wongdiscussion is that the new rmap btree will not contain any records for the old 29687fb8ccffSDarrick J. Wongrmap btree, nor will the old btree blocks be tracked in the free space btrees. 29697fb8ccffSDarrick J. WongThe list of candidate reaping blocks is computed by setting the bits 29707fb8ccffSDarrick J. Wongcorresponding to the gaps in the new rmap btree records, and then clearing the 29717fb8ccffSDarrick J. Wongbits corresponding to extents in the free space btrees and the current AGFL 29727fb8ccffSDarrick J. Wongblocks. 29737fb8ccffSDarrick J. WongThe result ``(new_rmapbt_gaps & ~(agfl | bnobt_records))`` are reaped using the 29747fb8ccffSDarrick J. Wongmethods outlined above. 29757fb8ccffSDarrick J. Wong 29767fb8ccffSDarrick J. WongThe rest of the process of rebuildng the reverse mapping btree is discussed 29777fb8ccffSDarrick J. Wongin a separate :ref:`case study<rmap_repair>`. 29787fb8ccffSDarrick J. Wong 29797fb8ccffSDarrick J. WongThe proposed patchset is the 29807fb8ccffSDarrick J. Wong`AG btree repair 29817fb8ccffSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_ 29827fb8ccffSDarrick J. Wongseries. 29837fb8ccffSDarrick J. Wong 29847fb8ccffSDarrick J. WongCase Study: Rebuilding the AGFL 29857fb8ccffSDarrick J. Wong``````````````````````````````` 29867fb8ccffSDarrick J. Wong 29877fb8ccffSDarrick J. WongThe allocation group free block list (AGFL) is repaired as follows: 29887fb8ccffSDarrick J. Wong 29897fb8ccffSDarrick J. Wong1. Create a bitmap for all the space that the reverse mapping data claims is 29907fb8ccffSDarrick J. Wong owned by ``XFS_RMAP_OWN_AG``. 29917fb8ccffSDarrick J. Wong 29927fb8ccffSDarrick J. Wong2. Subtract the space used by the two free space btrees and the rmap btree. 29937fb8ccffSDarrick J. Wong 29947fb8ccffSDarrick J. Wong3. Subtract any space that the reverse mapping data claims is owned by any 29957fb8ccffSDarrick J. Wong other owner, to avoid re-adding crosslinked blocks to the AGFL. 29967fb8ccffSDarrick J. Wong 29977fb8ccffSDarrick J. Wong4. Once the AGFL is full, reap any blocks leftover. 29987fb8ccffSDarrick J. Wong 29997fb8ccffSDarrick J. Wong5. The next operation to fix the freelist will right-size the list. 30007fb8ccffSDarrick J. Wong 30017fb8ccffSDarrick J. WongSee `fs/xfs/scrub/agheader_repair.c <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/xfs/scrub/agheader_repair.c>`_ for more details. 3002d6978871SDarrick J. Wong 3003d6978871SDarrick J. WongInode Record Repairs 3004d6978871SDarrick J. Wong-------------------- 3005d6978871SDarrick J. Wong 3006d6978871SDarrick J. WongInode records must be handled carefully, because they have both ondisk records 3007d6978871SDarrick J. Wong("dinodes") and an in-memory ("cached") representation. 3008d6978871SDarrick J. WongThere is a very high potential for cache coherency issues if online fsck is not 3009d6978871SDarrick J. Wongcareful to access the ondisk metadata *only* when the ondisk metadata is so 3010d6978871SDarrick J. Wongbadly damaged that the filesystem cannot load the in-memory representation. 3011d6978871SDarrick J. WongWhen online fsck wants to open a damaged file for scrubbing, it must use 3012d6978871SDarrick J. Wongspecialized resource acquisition functions that return either the in-memory 3013d6978871SDarrick J. Wongrepresentation *or* a lock on whichever object is necessary to prevent any 3014d6978871SDarrick J. Wongupdate to the ondisk location. 3015d6978871SDarrick J. Wong 3016d6978871SDarrick J. WongThe only repairs that should be made to the ondisk inode buffers are whatever 3017d6978871SDarrick J. Wongis necessary to get the in-core structure loaded. 3018d6978871SDarrick J. WongThis means fixing whatever is caught by the inode cluster buffer and inode fork 3019d6978871SDarrick J. Wongverifiers, and retrying the ``iget`` operation. 3020d6978871SDarrick J. WongIf the second ``iget`` fails, the repair has failed. 3021d6978871SDarrick J. Wong 3022d6978871SDarrick J. WongOnce the in-memory representation is loaded, repair can lock the inode and can 3023d6978871SDarrick J. Wongsubject it to comprehensive checks, repairs, and optimizations. 3024d6978871SDarrick J. WongMost inode attributes are easy to check and constrain, or are user-controlled 3025d6978871SDarrick J. Wongarbitrary bit patterns; these are both easy to fix. 3026d6978871SDarrick J. WongDealing with the data and attr fork extent counts and the file block counts is 3027d6978871SDarrick J. Wongmore complicated, because computing the correct value requires traversing the 3028d6978871SDarrick J. Wongforks, or if that fails, leaving the fields invalid and waiting for the fork 3029d6978871SDarrick J. Wongfsck functions to run. 3030d6978871SDarrick J. Wong 3031d6978871SDarrick J. WongThe proposed patchset is the 3032d6978871SDarrick J. Wong`inode 3033d6978871SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-inodes>`_ 3034d6978871SDarrick J. Wongrepair series. 3035d6978871SDarrick J. Wong 3036d6978871SDarrick J. WongQuota Record Repairs 3037d6978871SDarrick J. Wong-------------------- 3038d6978871SDarrick J. Wong 3039d6978871SDarrick J. WongSimilar to inodes, quota records ("dquots") also have both ondisk records and 3040d6978871SDarrick J. Wongan in-memory representation, and hence are subject to the same cache coherency 3041d6978871SDarrick J. Wongissues. 3042d6978871SDarrick J. WongSomewhat confusingly, both are known as dquots in the XFS codebase. 3043d6978871SDarrick J. Wong 3044d6978871SDarrick J. WongThe only repairs that should be made to the ondisk quota record buffers are 3045d6978871SDarrick J. Wongwhatever is necessary to get the in-core structure loaded. 3046d6978871SDarrick J. WongOnce the in-memory representation is loaded, the only attributes needing 3047d6978871SDarrick J. Wongchecking are obviously bad limits and timer values. 3048d6978871SDarrick J. Wong 3049d6978871SDarrick J. WongQuota usage counters are checked, repaired, and discussed separately in the 3050d6978871SDarrick J. Wongsection about :ref:`live quotacheck <quotacheck>`. 3051d6978871SDarrick J. Wong 3052d6978871SDarrick J. WongThe proposed patchset is the 3053d6978871SDarrick J. Wong`quota 3054d6978871SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quota>`_ 3055d6978871SDarrick J. Wongrepair series. 3056d6978871SDarrick J. Wong 3057d6978871SDarrick J. Wong.. _fscounters: 3058d6978871SDarrick J. Wong 3059d6978871SDarrick J. WongFreezing to Fix Summary Counters 3060d6978871SDarrick J. Wong-------------------------------- 3061d6978871SDarrick J. Wong 3062d6978871SDarrick J. WongFilesystem summary counters track availability of filesystem resources such 3063d6978871SDarrick J. Wongas free blocks, free inodes, and allocated inodes. 3064d6978871SDarrick J. WongThis information could be compiled by walking the free space and inode indexes, 3065d6978871SDarrick J. Wongbut this is a slow process, so XFS maintains a copy in the ondisk superblock 3066d6978871SDarrick J. Wongthat should reflect the ondisk metadata, at least when the filesystem has been 3067d6978871SDarrick J. Wongunmounted cleanly. 3068d6978871SDarrick J. WongFor performance reasons, XFS also maintains incore copies of those counters, 3069d6978871SDarrick J. Wongwhich are key to enabling resource reservations for active transactions. 3070d6978871SDarrick J. WongWriter threads reserve the worst-case quantities of resources from the 3071d6978871SDarrick J. Wongincore counter and give back whatever they don't use at commit time. 3072d6978871SDarrick J. WongIt is therefore only necessary to serialize on the superblock when the 3073d6978871SDarrick J. Wongsuperblock is being committed to disk. 3074d6978871SDarrick J. Wong 3075d6978871SDarrick J. WongThe lazy superblock counter feature introduced in XFS v5 took this even further 3076d6978871SDarrick J. Wongby training log recovery to recompute the summary counters from the AG headers, 3077d6978871SDarrick J. Wongwhich eliminated the need for most transactions even to touch the superblock. 3078d6978871SDarrick J. WongThe only time XFS commits the summary counters is at filesystem unmount. 3079d6978871SDarrick J. WongTo reduce contention even further, the incore counter is implemented as a 3080d6978871SDarrick J. Wongpercpu counter, which means that each CPU is allocated a batch of blocks from a 3081d6978871SDarrick J. Wongglobal incore counter and can satisfy small allocations from the local batch. 3082d6978871SDarrick J. Wong 3083d6978871SDarrick J. WongThe high-performance nature of the summary counters makes it difficult for 3084d6978871SDarrick J. Wongonline fsck to check them, since there is no way to quiesce a percpu counter 3085d6978871SDarrick J. Wongwhile the system is running. 3086d6978871SDarrick J. WongAlthough online fsck can read the filesystem metadata to compute the correct 3087d6978871SDarrick J. Wongvalues of the summary counters, there's no way to hold the value of a percpu 3088d6978871SDarrick J. Wongcounter stable, so it's quite possible that the counter will be out of date by 3089d6978871SDarrick J. Wongthe time the walk is complete. 3090d6978871SDarrick J. WongEarlier versions of online scrub would return to userspace with an incomplete 3091d6978871SDarrick J. Wongscan flag, but this is not a satisfying outcome for a system administrator. 3092d6978871SDarrick J. WongFor repairs, the in-memory counters must be stabilized while walking the 3093d6978871SDarrick J. Wongfilesystem metadata to get an accurate reading and install it in the percpu 3094d6978871SDarrick J. Wongcounter. 3095d6978871SDarrick J. Wong 3096d6978871SDarrick J. WongTo satisfy this requirement, online fsck must prevent other programs in the 3097d6978871SDarrick J. Wongsystem from initiating new writes to the filesystem, it must disable background 3098d6978871SDarrick J. Wonggarbage collection threads, and it must wait for existing writer programs to 3099d6978871SDarrick J. Wongexit the kernel. 3100d6978871SDarrick J. WongOnce that has been established, scrub can walk the AG free space indexes, the 3101d6978871SDarrick J. Wonginode btrees, and the realtime bitmap to compute the correct value of all 3102d6978871SDarrick J. Wongfour summary counters. 3103d6978871SDarrick J. WongThis is very similar to a filesystem freeze, though not all of the pieces are 3104d6978871SDarrick J. Wongnecessary: 3105d6978871SDarrick J. Wong 3106d6978871SDarrick J. Wong- The final freeze state is set one higher than ``SB_FREEZE_COMPLETE`` to 3107d6978871SDarrick J. Wong prevent other threads from thawing the filesystem, or other scrub threads 3108d6978871SDarrick J. Wong from initiating another fscounters freeze. 3109d6978871SDarrick J. Wong 3110d6978871SDarrick J. Wong- It does not quiesce the log. 3111d6978871SDarrick J. Wong 3112d6978871SDarrick J. WongWith this code in place, it is now possible to pause the filesystem for just 3113d6978871SDarrick J. Wonglong enough to check and correct the summary counters. 3114d6978871SDarrick J. Wong 3115d6978871SDarrick J. Wong+--------------------------------------------------------------------------+ 3116d6978871SDarrick J. Wong| **Historical Sidebar**: | 3117d6978871SDarrick J. Wong+--------------------------------------------------------------------------+ 3118d6978871SDarrick J. Wong| The initial implementation used the actual VFS filesystem freeze | 3119d6978871SDarrick J. Wong| mechanism to quiesce filesystem activity. | 3120d6978871SDarrick J. Wong| With the filesystem frozen, it is possible to resolve the counter values | 3121d6978871SDarrick J. Wong| with exact precision, but there are many problems with calling the VFS | 3122d6978871SDarrick J. Wong| methods directly: | 3123d6978871SDarrick J. Wong| | 3124d6978871SDarrick J. Wong| - Other programs can unfreeze the filesystem without our knowledge. | 3125d6978871SDarrick J. Wong| This leads to incorrect scan results and incorrect repairs. | 3126d6978871SDarrick J. Wong| | 3127d6978871SDarrick J. Wong| - Adding an extra lock to prevent others from thawing the filesystem | 3128d6978871SDarrick J. Wong| required the addition of a ``->freeze_super`` function to wrap | 3129d6978871SDarrick J. Wong| ``freeze_fs()``. | 3130d6978871SDarrick J. Wong| This in turn caused other subtle problems because it turns out that | 3131d6978871SDarrick J. Wong| the VFS ``freeze_super`` and ``thaw_super`` functions can drop the | 3132d6978871SDarrick J. Wong| last reference to the VFS superblock, and any subsequent access | 3133d6978871SDarrick J. Wong| becomes a UAF bug! | 3134d6978871SDarrick J. Wong| This can happen if the filesystem is unmounted while the underlying | 3135d6978871SDarrick J. Wong| block device has frozen the filesystem. | 3136d6978871SDarrick J. Wong| This problem could be solved by grabbing extra references to the | 3137d6978871SDarrick J. Wong| superblock, but it felt suboptimal given the other inadequacies of | 3138d6978871SDarrick J. Wong| this approach. | 3139d6978871SDarrick J. Wong| | 3140d6978871SDarrick J. Wong| - The log need not be quiesced to check the summary counters, but a VFS | 3141d6978871SDarrick J. Wong| freeze initiates one anyway. | 3142d6978871SDarrick J. Wong| This adds unnecessary runtime to live fscounter fsck operations. | 3143d6978871SDarrick J. Wong| | 3144d6978871SDarrick J. Wong| - Quiescing the log means that XFS flushes the (possibly incorrect) | 3145d6978871SDarrick J. Wong| counters to disk as part of cleaning the log. | 3146d6978871SDarrick J. Wong| | 3147d6978871SDarrick J. Wong| - A bug in the VFS meant that freeze could complete even when | 3148d6978871SDarrick J. Wong| sync_filesystem fails to flush the filesystem and returns an error. | 3149d6978871SDarrick J. Wong| This bug was fixed in Linux 5.17. | 3150d6978871SDarrick J. Wong+--------------------------------------------------------------------------+ 3151d6978871SDarrick J. Wong 3152d6978871SDarrick J. WongThe proposed patchset is the 3153d6978871SDarrick J. Wong`summary counter cleanup 3154d6978871SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-fscounters>`_ 3155d6978871SDarrick J. Wongseries. 3156a0d856eeSDarrick J. Wong 3157a0d856eeSDarrick J. WongFull Filesystem Scans 3158a0d856eeSDarrick J. Wong--------------------- 3159a0d856eeSDarrick J. Wong 3160a0d856eeSDarrick J. WongCertain types of metadata can only be checked by walking every file in the 3161a0d856eeSDarrick J. Wongentire filesystem to record observations and comparing the observations against 3162a0d856eeSDarrick J. Wongwhat's recorded on disk. 3163a0d856eeSDarrick J. WongLike every other type of online repair, repairs are made by writing those 3164a0d856eeSDarrick J. Wongobservations to disk in a replacement structure and committing it atomically. 3165a0d856eeSDarrick J. WongHowever, it is not practical to shut down the entire filesystem to examine 3166a0d856eeSDarrick J. Wonghundreds of billions of files because the downtime would be excessive. 3167a0d856eeSDarrick J. WongTherefore, online fsck must build the infrastructure to manage a live scan of 3168a0d856eeSDarrick J. Wongall the files in the filesystem. 3169a0d856eeSDarrick J. WongThere are two questions that need to be solved to perform a live walk: 3170a0d856eeSDarrick J. Wong 3171a0d856eeSDarrick J. Wong- How does scrub manage the scan while it is collecting data? 3172a0d856eeSDarrick J. Wong 3173a0d856eeSDarrick J. Wong- How does the scan keep abreast of changes being made to the system by other 3174a0d856eeSDarrick J. Wong threads? 3175a0d856eeSDarrick J. Wong 3176a0d856eeSDarrick J. Wong.. _iscan: 3177a0d856eeSDarrick J. Wong 3178a0d856eeSDarrick J. WongCoordinated Inode Scans 3179a0d856eeSDarrick J. Wong``````````````````````` 3180a0d856eeSDarrick J. Wong 3181a0d856eeSDarrick J. WongIn the original Unix filesystems of the 1970s, each directory entry contained 3182a0d856eeSDarrick J. Wongan index number (*inumber*) which was used as an index into on ondisk array 3183a0d856eeSDarrick J. Wong(*itable*) of fixed-size records (*inodes*) describing a file's attributes and 3184a0d856eeSDarrick J. Wongits data block mapping. 3185a0d856eeSDarrick J. WongThis system is described by J. Lions, `"inode (5659)" 3186a0d856eeSDarrick J. Wong<http://www.lemis.com/grog/Documentation/Lions/>`_ in *Lions' Commentary on 3187a0d856eeSDarrick J. WongUNIX, 6th Edition*, (Dept. of Computer Science, the University of New South 3188a0d856eeSDarrick J. WongWales, November 1977), pp. 18-2; and later by D. Ritchie and K. Thompson, 3189a0d856eeSDarrick J. Wong`"Implementation of the File System" 3190a0d856eeSDarrick J. Wong<https://archive.org/details/bstj57-6-1905/page/n8/mode/1up>`_, from *The UNIX 3191a0d856eeSDarrick J. WongTime-Sharing System*, (The Bell System Technical Journal, July 1978), pp. 3192a0d856eeSDarrick J. Wong1913-4. 3193a0d856eeSDarrick J. Wong 3194a0d856eeSDarrick J. WongXFS retains most of this design, except now inumbers are search keys over all 3195a0d856eeSDarrick J. Wongthe space in the data section filesystem. 3196a0d856eeSDarrick J. WongThey form a continuous keyspace that can be expressed as a 64-bit integer, 3197a0d856eeSDarrick J. Wongthough the inodes themselves are sparsely distributed within the keyspace. 3198a0d856eeSDarrick J. WongScans proceed in a linear fashion across the inumber keyspace, starting from 3199a0d856eeSDarrick J. Wong``0x0`` and ending at ``0xFFFFFFFFFFFFFFFF``. 3200a0d856eeSDarrick J. WongNaturally, a scan through a keyspace requires a scan cursor object to track the 3201a0d856eeSDarrick J. Wongscan progress. 3202a0d856eeSDarrick J. WongBecause this keyspace is sparse, this cursor contains two parts. 3203a0d856eeSDarrick J. WongThe first part of this scan cursor object tracks the inode that will be 3204a0d856eeSDarrick J. Wongexamined next; call this the examination cursor. 3205a0d856eeSDarrick J. WongSomewhat less obviously, the scan cursor object must also track which parts of 3206a0d856eeSDarrick J. Wongthe keyspace have already been visited, which is critical for deciding if a 3207a0d856eeSDarrick J. Wongconcurrent filesystem update needs to be incorporated into the scan data. 3208a0d856eeSDarrick J. WongCall this the visited inode cursor. 3209a0d856eeSDarrick J. Wong 3210a0d856eeSDarrick J. WongAdvancing the scan cursor is a multi-step process encapsulated in 3211a0d856eeSDarrick J. Wong``xchk_iscan_iter``: 3212a0d856eeSDarrick J. Wong 3213a0d856eeSDarrick J. Wong1. Lock the AGI buffer of the AG containing the inode pointed to by the visited 3214a0d856eeSDarrick J. Wong inode cursor. 3215a0d856eeSDarrick J. Wong This guarantee that inodes in this AG cannot be allocated or freed while 3216a0d856eeSDarrick J. Wong advancing the cursor. 3217a0d856eeSDarrick J. Wong 3218a0d856eeSDarrick J. Wong2. Use the per-AG inode btree to look up the next inumber after the one that 3219a0d856eeSDarrick J. Wong was just visited, since it may not be keyspace adjacent. 3220a0d856eeSDarrick J. Wong 3221a0d856eeSDarrick J. Wong3. If there are no more inodes left in this AG: 3222a0d856eeSDarrick J. Wong 3223a0d856eeSDarrick J. Wong a. Move the examination cursor to the point of the inumber keyspace that 3224a0d856eeSDarrick J. Wong corresponds to the start of the next AG. 3225a0d856eeSDarrick J. Wong 3226a0d856eeSDarrick J. Wong b. Adjust the visited inode cursor to indicate that it has "visited" the 3227a0d856eeSDarrick J. Wong last possible inode in the current AG's inode keyspace. 3228a0d856eeSDarrick J. Wong XFS inumbers are segmented, so the cursor needs to be marked as having 3229a0d856eeSDarrick J. Wong visited the entire keyspace up to just before the start of the next AG's 3230a0d856eeSDarrick J. Wong inode keyspace. 3231a0d856eeSDarrick J. Wong 3232a0d856eeSDarrick J. Wong c. Unlock the AGI and return to step 1 if there are unexamined AGs in the 3233a0d856eeSDarrick J. Wong filesystem. 3234a0d856eeSDarrick J. Wong 3235a0d856eeSDarrick J. Wong d. If there are no more AGs to examine, set both cursors to the end of the 3236a0d856eeSDarrick J. Wong inumber keyspace. 3237a0d856eeSDarrick J. Wong The scan is now complete. 3238a0d856eeSDarrick J. Wong 3239a0d856eeSDarrick J. Wong4. Otherwise, there is at least one more inode to scan in this AG: 3240a0d856eeSDarrick J. Wong 3241a0d856eeSDarrick J. Wong a. Move the examination cursor ahead to the next inode marked as allocated 3242a0d856eeSDarrick J. Wong by the inode btree. 3243a0d856eeSDarrick J. Wong 3244a0d856eeSDarrick J. Wong b. Adjust the visited inode cursor to point to the inode just prior to where 3245a0d856eeSDarrick J. Wong the examination cursor is now. 3246a0d856eeSDarrick J. Wong Because the scanner holds the AGI buffer lock, no inodes could have been 3247a0d856eeSDarrick J. Wong created in the part of the inode keyspace that the visited inode cursor 3248a0d856eeSDarrick J. Wong just advanced. 3249a0d856eeSDarrick J. Wong 3250a0d856eeSDarrick J. Wong5. Get the incore inode for the inumber of the examination cursor. 3251a0d856eeSDarrick J. Wong By maintaining the AGI buffer lock until this point, the scanner knows that 3252a0d856eeSDarrick J. Wong it was safe to advance the examination cursor across the entire keyspace, 3253a0d856eeSDarrick J. Wong and that it has stabilized this next inode so that it cannot disappear from 3254a0d856eeSDarrick J. Wong the filesystem until the scan releases the incore inode. 3255a0d856eeSDarrick J. Wong 3256a0d856eeSDarrick J. Wong6. Drop the AGI lock and return the incore inode to the caller. 3257a0d856eeSDarrick J. Wong 3258a0d856eeSDarrick J. WongOnline fsck functions scan all files in the filesystem as follows: 3259a0d856eeSDarrick J. Wong 3260a0d856eeSDarrick J. Wong1. Start a scan by calling ``xchk_iscan_start``. 3261a0d856eeSDarrick J. Wong 3262a0d856eeSDarrick J. Wong2. Advance the scan cursor (``xchk_iscan_iter``) to get the next inode. 3263a0d856eeSDarrick J. Wong If one is provided: 3264a0d856eeSDarrick J. Wong 3265a0d856eeSDarrick J. Wong a. Lock the inode to prevent updates during the scan. 3266a0d856eeSDarrick J. Wong 3267a0d856eeSDarrick J. Wong b. Scan the inode. 3268a0d856eeSDarrick J. Wong 3269a0d856eeSDarrick J. Wong c. While still holding the inode lock, adjust the visited inode cursor 3270a0d856eeSDarrick J. Wong (``xchk_iscan_mark_visited``) to point to this inode. 3271a0d856eeSDarrick J. Wong 3272a0d856eeSDarrick J. Wong d. Unlock and release the inode. 3273a0d856eeSDarrick J. Wong 3274a0d856eeSDarrick J. Wong8. Call ``xchk_iscan_teardown`` to complete the scan. 3275a0d856eeSDarrick J. Wong 3276a0d856eeSDarrick J. WongThere are subtleties with the inode cache that complicate grabbing the incore 3277a0d856eeSDarrick J. Wonginode for the caller. 3278a0d856eeSDarrick J. WongObviously, it is an absolute requirement that the inode metadata be consistent 3279a0d856eeSDarrick J. Wongenough to load it into the inode cache. 3280a0d856eeSDarrick J. WongSecond, if the incore inode is stuck in some intermediate state, the scan 3281a0d856eeSDarrick J. Wongcoordinator must release the AGI and push the main filesystem to get the inode 3282a0d856eeSDarrick J. Wongback into a loadable state. 3283a0d856eeSDarrick J. Wong 3284a0d856eeSDarrick J. WongThe proposed patches are the 3285a0d856eeSDarrick J. Wong`inode scanner 3286a0d856eeSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>`_ 3287a0d856eeSDarrick J. Wongseries. 3288a0d856eeSDarrick J. WongThe first user of the new functionality is the 3289a0d856eeSDarrick J. Wong`online quotacheck 3290a0d856eeSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>`_ 3291a0d856eeSDarrick J. Wongseries. 3292a0d856eeSDarrick J. Wong 3293a0d856eeSDarrick J. WongInode Management 3294a0d856eeSDarrick J. Wong```````````````` 3295a0d856eeSDarrick J. Wong 3296a0d856eeSDarrick J. WongIn regular filesystem code, references to allocated XFS incore inodes are 3297a0d856eeSDarrick J. Wongalways obtained (``xfs_iget``) outside of transaction context because the 3298a0d856eeSDarrick J. Wongcreation of the incore context for an existing file does not require metadata 3299a0d856eeSDarrick J. Wongupdates. 3300a0d856eeSDarrick J. WongHowever, it is important to note that references to incore inodes obtained as 3301a0d856eeSDarrick J. Wongpart of file creation must be performed in transaction context because the 3302a0d856eeSDarrick J. Wongfilesystem must ensure the atomicity of the ondisk inode btree index updates 3303a0d856eeSDarrick J. Wongand the initialization of the actual ondisk inode. 3304a0d856eeSDarrick J. Wong 3305a0d856eeSDarrick J. WongReferences to incore inodes are always released (``xfs_irele``) outside of 3306a0d856eeSDarrick J. Wongtransaction context because there are a handful of activities that might 3307a0d856eeSDarrick J. Wongrequire ondisk updates: 3308a0d856eeSDarrick J. Wong 3309a0d856eeSDarrick J. Wong- The VFS may decide to kick off writeback as part of a ``DONTCACHE`` inode 3310a0d856eeSDarrick J. Wong release. 3311a0d856eeSDarrick J. Wong 3312a0d856eeSDarrick J. Wong- Speculative preallocations need to be unreserved. 3313a0d856eeSDarrick J. Wong 3314a0d856eeSDarrick J. Wong- An unlinked file may have lost its last reference, in which case the entire 3315a0d856eeSDarrick J. Wong file must be inactivated, which involves releasing all of its resources in 3316a0d856eeSDarrick J. Wong the ondisk metadata and freeing the inode. 3317a0d856eeSDarrick J. Wong 3318a0d856eeSDarrick J. WongThese activities are collectively called inode inactivation. 3319a0d856eeSDarrick J. WongInactivation has two parts -- the VFS part, which initiates writeback on all 3320a0d856eeSDarrick J. Wongdirty file pages, and the XFS part, which cleans up XFS-specific information 3321a0d856eeSDarrick J. Wongand frees the inode if it was unlinked. 3322a0d856eeSDarrick J. WongIf the inode is unlinked (or unconnected after a file handle operation), the 3323a0d856eeSDarrick J. Wongkernel drops the inode into the inactivation machinery immediately. 3324a0d856eeSDarrick J. Wong 3325a0d856eeSDarrick J. WongDuring normal operation, resource acquisition for an update follows this order 3326a0d856eeSDarrick J. Wongto avoid deadlocks: 3327a0d856eeSDarrick J. Wong 3328a0d856eeSDarrick J. Wong1. Inode reference (``iget``). 3329a0d856eeSDarrick J. Wong 3330a0d856eeSDarrick J. Wong2. Filesystem freeze protection, if repairing (``mnt_want_write_file``). 3331a0d856eeSDarrick J. Wong 3332a0d856eeSDarrick J. Wong3. Inode ``IOLOCK`` (VFS ``i_rwsem``) lock to control file IO. 3333a0d856eeSDarrick J. Wong 3334a0d856eeSDarrick J. Wong4. Inode ``MMAPLOCK`` (page cache ``invalidate_lock``) lock for operations that 3335a0d856eeSDarrick J. Wong can update page cache mappings. 3336a0d856eeSDarrick J. Wong 3337a0d856eeSDarrick J. Wong5. Log feature enablement. 3338a0d856eeSDarrick J. Wong 3339a0d856eeSDarrick J. Wong6. Transaction log space grant. 3340a0d856eeSDarrick J. Wong 3341a0d856eeSDarrick J. Wong7. Space on the data and realtime devices for the transaction. 3342a0d856eeSDarrick J. Wong 3343a0d856eeSDarrick J. Wong8. Incore dquot references, if a file is being repaired. 3344a0d856eeSDarrick J. Wong Note that they are not locked, merely acquired. 3345a0d856eeSDarrick J. Wong 3346a0d856eeSDarrick J. Wong9. Inode ``ILOCK`` for file metadata updates. 3347a0d856eeSDarrick J. Wong 3348a0d856eeSDarrick J. Wong10. AG header buffer locks / Realtime metadata inode ILOCK. 3349a0d856eeSDarrick J. Wong 3350a0d856eeSDarrick J. Wong11. Realtime metadata buffer locks, if applicable. 3351a0d856eeSDarrick J. Wong 3352a0d856eeSDarrick J. Wong12. Extent mapping btree blocks, if applicable. 3353a0d856eeSDarrick J. Wong 3354a0d856eeSDarrick J. WongResources are often released in the reverse order, though this is not required. 3355a0d856eeSDarrick J. WongHowever, online fsck differs from regular XFS operations because it may examine 3356a0d856eeSDarrick J. Wongan object that normally is acquired in a later stage of the locking order, and 3357a0d856eeSDarrick J. Wongthen decide to cross-reference the object with an object that is acquired 3358a0d856eeSDarrick J. Wongearlier in the order. 3359a0d856eeSDarrick J. WongThe next few sections detail the specific ways in which online fsck takes care 3360a0d856eeSDarrick J. Wongto avoid deadlocks. 3361a0d856eeSDarrick J. Wong 3362a0d856eeSDarrick J. Wongiget and irele During a Scrub 3363a0d856eeSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 3364a0d856eeSDarrick J. Wong 3365a0d856eeSDarrick J. WongAn inode scan performed on behalf of a scrub operation runs in transaction 3366a0d856eeSDarrick J. Wongcontext, and possibly with resources already locked and bound to it. 3367a0d856eeSDarrick J. WongThis isn't much of a problem for ``iget`` since it can operate in the context 3368a0d856eeSDarrick J. Wongof an existing transaction, as long as all of the bound resources are acquired 3369a0d856eeSDarrick J. Wongbefore the inode reference in the regular filesystem. 3370a0d856eeSDarrick J. Wong 3371a0d856eeSDarrick J. WongWhen the VFS ``iput`` function is given a linked inode with no other 3372a0d856eeSDarrick J. Wongreferences, it normally puts the inode on an LRU list in the hope that it can 3373a0d856eeSDarrick J. Wongsave time if another process re-opens the file before the system runs out 3374a0d856eeSDarrick J. Wongof memory and frees it. 3375a0d856eeSDarrick J. WongFilesystem callers can short-circuit the LRU process by setting a ``DONTCACHE`` 3376a0d856eeSDarrick J. Wongflag on the inode to cause the kernel to try to drop the inode into the 3377a0d856eeSDarrick J. Wonginactivation machinery immediately. 3378a0d856eeSDarrick J. Wong 3379a0d856eeSDarrick J. WongIn the past, inactivation was always done from the process that dropped the 3380a0d856eeSDarrick J. Wonginode, which was a problem for scrub because scrub may already hold a 3381a0d856eeSDarrick J. Wongtransaction, and XFS does not support nesting transactions. 3382a0d856eeSDarrick J. WongOn the other hand, if there is no scrub transaction, it is desirable to drop 3383a0d856eeSDarrick J. Wongotherwise unused inodes immediately to avoid polluting caches. 3384a0d856eeSDarrick J. WongTo capture these nuances, the online fsck code has a separate ``xchk_irele`` 3385a0d856eeSDarrick J. Wongfunction to set or clear the ``DONTCACHE`` flag to get the required release 3386a0d856eeSDarrick J. Wongbehavior. 3387a0d856eeSDarrick J. Wong 3388a0d856eeSDarrick J. WongProposed patchsets include fixing 3389a0d856eeSDarrick J. Wong`scrub iget usage 3390a0d856eeSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iget-fixes>`_ and 3391a0d856eeSDarrick J. Wong`dir iget usage 3392a0d856eeSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-dir-iget-fixes>`_. 3393a0d856eeSDarrick J. Wong 33942f754f7fSDarrick J. Wong.. _ilocking: 33952f754f7fSDarrick J. Wong 3396a0d856eeSDarrick J. WongLocking Inodes 3397a0d856eeSDarrick J. Wong^^^^^^^^^^^^^^ 3398a0d856eeSDarrick J. Wong 3399a0d856eeSDarrick J. WongIn regular filesystem code, the VFS and XFS will acquire multiple IOLOCK locks 3400a0d856eeSDarrick J. Wongin a well-known order: parent → child when updating the directory tree, and 3401a0d856eeSDarrick J. Wongin numerical order of the addresses of their ``struct inode`` object otherwise. 3402a0d856eeSDarrick J. WongFor regular files, the MMAPLOCK can be acquired after the IOLOCK to stop page 3403a0d856eeSDarrick J. Wongfaults. 3404a0d856eeSDarrick J. WongIf two MMAPLOCKs must be acquired, they are acquired in numerical order of 3405a0d856eeSDarrick J. Wongthe addresses of their ``struct address_space`` objects. 3406a0d856eeSDarrick J. WongDue to the structure of existing filesystem code, IOLOCKs and MMAPLOCKs must be 3407a0d856eeSDarrick J. Wongacquired before transactions are allocated. 3408a0d856eeSDarrick J. WongIf two ILOCKs must be acquired, they are acquired in inumber order. 3409a0d856eeSDarrick J. Wong 3410a0d856eeSDarrick J. WongInode lock acquisition must be done carefully during a coordinated inode scan. 3411a0d856eeSDarrick J. WongOnline fsck cannot abide these conventions, because for a directory tree 3412a0d856eeSDarrick J. Wongscanner, the scrub process holds the IOLOCK of the file being scanned and it 3413a0d856eeSDarrick J. Wongneeds to take the IOLOCK of the file at the other end of the directory link. 3414a0d856eeSDarrick J. WongIf the directory tree is corrupt because it contains a cycle, ``xfs_scrub`` 3415a0d856eeSDarrick J. Wongcannot use the regular inode locking functions and avoid becoming trapped in an 3416a0d856eeSDarrick J. WongABBA deadlock. 3417a0d856eeSDarrick J. Wong 3418a0d856eeSDarrick J. WongSolving both of these problems is straightforward -- any time online fsck 3419a0d856eeSDarrick J. Wongneeds to take a second lock of the same class, it uses trylock to avoid an ABBA 3420a0d856eeSDarrick J. Wongdeadlock. 3421a0d856eeSDarrick J. WongIf the trylock fails, scrub drops all inode locks and use trylock loops to 3422a0d856eeSDarrick J. Wong(re)acquire all necessary resources. 3423a0d856eeSDarrick J. WongTrylock loops enable scrub to check for pending fatal signals, which is how 3424a0d856eeSDarrick J. Wongscrub avoids deadlocking the filesystem or becoming an unresponsive process. 3425a0d856eeSDarrick J. WongHowever, trylock loops means that online fsck must be prepared to measure the 3426a0d856eeSDarrick J. Wongresource being scrubbed before and after the lock cycle to detect changes and 3427a0d856eeSDarrick J. Wongreact accordingly. 3428a0d856eeSDarrick J. Wong 3429a0d856eeSDarrick J. Wong.. _dirparent: 3430a0d856eeSDarrick J. Wong 3431a0d856eeSDarrick J. WongCase Study: Finding a Directory Parent 3432a0d856eeSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 3433a0d856eeSDarrick J. Wong 3434a0d856eeSDarrick J. WongConsider the directory parent pointer repair code as an example. 3435a0d856eeSDarrick J. WongOnline fsck must verify that the dotdot dirent of a directory points up to a 3436a0d856eeSDarrick J. Wongparent directory, and that the parent directory contains exactly one dirent 3437a0d856eeSDarrick J. Wongpointing down to the child directory. 3438a0d856eeSDarrick J. WongFully validating this relationship (and repairing it if possible) requires a 3439a0d856eeSDarrick J. Wongwalk of every directory on the filesystem while holding the child locked, and 3440a0d856eeSDarrick J. Wongwhile updates to the directory tree are being made. 3441a0d856eeSDarrick J. WongThe coordinated inode scan provides a way to walk the filesystem without the 3442a0d856eeSDarrick J. Wongpossibility of missing an inode. 3443a0d856eeSDarrick J. WongThe child directory is kept locked to prevent updates to the dotdot dirent, but 3444a0d856eeSDarrick J. Wongif the scanner fails to lock a parent, it can drop and relock both the child 3445a0d856eeSDarrick J. Wongand the prospective parent. 3446a0d856eeSDarrick J. WongIf the dotdot entry changes while the directory is unlocked, then a move or 3447a0d856eeSDarrick J. Wongrename operation must have changed the child's parentage, and the scan can 3448a0d856eeSDarrick J. Wongexit early. 3449a0d856eeSDarrick J. Wong 3450a0d856eeSDarrick J. WongThe proposed patchset is the 3451a0d856eeSDarrick J. Wong`directory repair 3452a0d856eeSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_ 3453a0d856eeSDarrick J. Wongseries. 3454a0d856eeSDarrick J. Wong 3455a0d856eeSDarrick J. Wong.. _fshooks: 3456a0d856eeSDarrick J. Wong 3457a0d856eeSDarrick J. WongFilesystem Hooks 3458a0d856eeSDarrick J. Wong````````````````` 3459a0d856eeSDarrick J. Wong 3460a0d856eeSDarrick J. WongThe second piece of support that online fsck functions need during a full 3461a0d856eeSDarrick J. Wongfilesystem scan is the ability to stay informed about updates being made by 3462a0d856eeSDarrick J. Wongother threads in the filesystem, since comparisons against the past are useless 3463a0d856eeSDarrick J. Wongin a dynamic environment. 3464a0d856eeSDarrick J. WongTwo pieces of Linux kernel infrastructure enable online fsck to monitor regular 3465a0d856eeSDarrick J. Wongfilesystem operations: filesystem hooks and :ref:`static keys<jump_labels>`. 3466a0d856eeSDarrick J. Wong 3467a0d856eeSDarrick J. WongFilesystem hooks convey information about an ongoing filesystem operation to 3468a0d856eeSDarrick J. Wonga downstream consumer. 3469a0d856eeSDarrick J. WongIn this case, the downstream consumer is always an online fsck function. 3470a0d856eeSDarrick J. WongBecause multiple fsck functions can run in parallel, online fsck uses the Linux 3471a0d856eeSDarrick J. Wongnotifier call chain facility to dispatch updates to any number of interested 3472a0d856eeSDarrick J. Wongfsck processes. 3473a0d856eeSDarrick J. WongCall chains are a dynamic list, which means that they can be configured at 3474a0d856eeSDarrick J. Wongrun time. 3475a0d856eeSDarrick J. WongBecause these hooks are private to the XFS module, the information passed along 3476a0d856eeSDarrick J. Wongcontains exactly what the checking function needs to update its observations. 3477a0d856eeSDarrick J. Wong 3478a0d856eeSDarrick J. WongThe current implementation of XFS hooks uses SRCU notifier chains to reduce the 3479a0d856eeSDarrick J. Wongimpact to highly threaded workloads. 3480a0d856eeSDarrick J. WongRegular blocking notifier chains use a rwsem and seem to have a much lower 3481a0d856eeSDarrick J. Wongoverhead for single-threaded applications. 3482a0d856eeSDarrick J. WongHowever, it may turn out that the combination of blocking chains and static 3483a0d856eeSDarrick J. Wongkeys are a more performant combination; more study is needed here. 3484a0d856eeSDarrick J. Wong 3485a0d856eeSDarrick J. WongThe following pieces are necessary to hook a certain point in the filesystem: 3486a0d856eeSDarrick J. Wong 3487a0d856eeSDarrick J. Wong- A ``struct xfs_hooks`` object must be embedded in a convenient place such as 3488a0d856eeSDarrick J. Wong a well-known incore filesystem object. 3489a0d856eeSDarrick J. Wong 3490a0d856eeSDarrick J. Wong- Each hook must define an action code and a structure containing more context 3491a0d856eeSDarrick J. Wong about the action. 3492a0d856eeSDarrick J. Wong 3493a0d856eeSDarrick J. Wong- Hook providers should provide appropriate wrapper functions and structs 3494a0d856eeSDarrick J. Wong around the ``xfs_hooks`` and ``xfs_hook`` objects to take advantage of type 3495a0d856eeSDarrick J. Wong checking to ensure correct usage. 3496a0d856eeSDarrick J. Wong 3497a0d856eeSDarrick J. Wong- A callsite in the regular filesystem code must be chosen to call 3498a0d856eeSDarrick J. Wong ``xfs_hooks_call`` with the action code and data structure. 3499a0d856eeSDarrick J. Wong This place should be adjacent to (and not earlier than) the place where 3500a0d856eeSDarrick J. Wong the filesystem update is committed to the transaction. 3501a0d856eeSDarrick J. Wong In general, when the filesystem calls a hook chain, it should be able to 3502a0d856eeSDarrick J. Wong handle sleeping and should not be vulnerable to memory reclaim or locking 3503a0d856eeSDarrick J. Wong recursion. 3504a0d856eeSDarrick J. Wong However, the exact requirements are very dependent on the context of the hook 3505a0d856eeSDarrick J. Wong caller and the callee. 3506a0d856eeSDarrick J. Wong 3507a0d856eeSDarrick J. Wong- The online fsck function should define a structure to hold scan data, a lock 3508a0d856eeSDarrick J. Wong to coordinate access to the scan data, and a ``struct xfs_hook`` object. 3509a0d856eeSDarrick J. Wong The scanner function and the regular filesystem code must acquire resources 3510a0d856eeSDarrick J. Wong in the same order; see the next section for details. 3511a0d856eeSDarrick J. Wong 3512a0d856eeSDarrick J. Wong- The online fsck code must contain a C function to catch the hook action code 3513a0d856eeSDarrick J. Wong and data structure. 3514a0d856eeSDarrick J. Wong If the object being updated has already been visited by the scan, then the 3515a0d856eeSDarrick J. Wong hook information must be applied to the scan data. 3516a0d856eeSDarrick J. Wong 3517a0d856eeSDarrick J. Wong- Prior to unlocking inodes to start the scan, online fsck must call 3518a0d856eeSDarrick J. Wong ``xfs_hooks_setup`` to initialize the ``struct xfs_hook``, and 3519a0d856eeSDarrick J. Wong ``xfs_hooks_add`` to enable the hook. 3520a0d856eeSDarrick J. Wong 3521a0d856eeSDarrick J. Wong- Online fsck must call ``xfs_hooks_del`` to disable the hook once the scan is 3522a0d856eeSDarrick J. Wong complete. 3523a0d856eeSDarrick J. Wong 3524a0d856eeSDarrick J. WongThe number of hooks should be kept to a minimum to reduce complexity. 3525a0d856eeSDarrick J. WongStatic keys are used to reduce the overhead of filesystem hooks to nearly 3526a0d856eeSDarrick J. Wongzero when online fsck is not running. 3527a0d856eeSDarrick J. Wong 3528a0d856eeSDarrick J. Wong.. _liveupdate: 3529a0d856eeSDarrick J. Wong 3530a0d856eeSDarrick J. WongLive Updates During a Scan 3531a0d856eeSDarrick J. Wong`````````````````````````` 3532a0d856eeSDarrick J. Wong 3533a0d856eeSDarrick J. WongThe code paths of the online fsck scanning code and the :ref:`hooked<fshooks>` 3534a0d856eeSDarrick J. Wongfilesystem code look like this:: 3535a0d856eeSDarrick J. Wong 3536a0d856eeSDarrick J. Wong other program 3537a0d856eeSDarrick J. Wong ↓ 3538a0d856eeSDarrick J. Wong inode lock ←────────────────────┐ 3539a0d856eeSDarrick J. Wong ↓ │ 3540a0d856eeSDarrick J. Wong AG header lock │ 3541a0d856eeSDarrick J. Wong ↓ │ 3542a0d856eeSDarrick J. Wong filesystem function │ 3543a0d856eeSDarrick J. Wong ↓ │ 3544a0d856eeSDarrick J. Wong notifier call chain │ same 3545a0d856eeSDarrick J. Wong ↓ ├─── inode 3546a0d856eeSDarrick J. Wong scrub hook function │ lock 3547a0d856eeSDarrick J. Wong ↓ │ 3548a0d856eeSDarrick J. Wong scan data mutex ←──┐ same │ 3549a0d856eeSDarrick J. Wong ↓ ├─── scan │ 3550a0d856eeSDarrick J. Wong update scan data │ lock │ 3551a0d856eeSDarrick J. Wong ↑ │ │ 3552a0d856eeSDarrick J. Wong scan data mutex ←──┘ │ 3553a0d856eeSDarrick J. Wong ↑ │ 3554a0d856eeSDarrick J. Wong inode lock ←────────────────────┘ 3555a0d856eeSDarrick J. Wong ↑ 3556a0d856eeSDarrick J. Wong scrub function 3557a0d856eeSDarrick J. Wong ↑ 3558a0d856eeSDarrick J. Wong inode scanner 3559a0d856eeSDarrick J. Wong ↑ 3560a0d856eeSDarrick J. Wong xfs_scrub 3561a0d856eeSDarrick J. Wong 3562a0d856eeSDarrick J. WongThese rules must be followed to ensure correct interactions between the 3563a0d856eeSDarrick J. Wongchecking code and the code making an update to the filesystem: 3564a0d856eeSDarrick J. Wong 3565a0d856eeSDarrick J. Wong- Prior to invoking the notifier call chain, the filesystem function being 3566a0d856eeSDarrick J. Wong hooked must acquire the same lock that the scrub scanning function acquires 3567a0d856eeSDarrick J. Wong to scan the inode. 3568a0d856eeSDarrick J. Wong 3569a0d856eeSDarrick J. Wong- The scanning function and the scrub hook function must coordinate access to 3570a0d856eeSDarrick J. Wong the scan data by acquiring a lock on the scan data. 3571a0d856eeSDarrick J. Wong 3572a0d856eeSDarrick J. Wong- Scrub hook function must not add the live update information to the scan 3573a0d856eeSDarrick J. Wong observations unless the inode being updated has already been scanned. 3574a0d856eeSDarrick J. Wong The scan coordinator has a helper predicate (``xchk_iscan_want_live_update``) 3575a0d856eeSDarrick J. Wong for this. 3576a0d856eeSDarrick J. Wong 3577a0d856eeSDarrick J. Wong- Scrub hook functions must not change the caller's state, including the 3578a0d856eeSDarrick J. Wong transaction that it is running. 3579a0d856eeSDarrick J. Wong They must not acquire any resources that might conflict with the filesystem 3580a0d856eeSDarrick J. Wong function being hooked. 3581a0d856eeSDarrick J. Wong 3582a0d856eeSDarrick J. Wong- The hook function can abort the inode scan to avoid breaking the other rules. 3583a0d856eeSDarrick J. Wong 3584a0d856eeSDarrick J. WongThe inode scan APIs are pretty simple: 3585a0d856eeSDarrick J. Wong 3586a0d856eeSDarrick J. Wong- ``xchk_iscan_start`` starts a scan 3587a0d856eeSDarrick J. Wong 3588a0d856eeSDarrick J. Wong- ``xchk_iscan_iter`` grabs a reference to the next inode in the scan or 3589a0d856eeSDarrick J. Wong returns zero if there is nothing left to scan 3590a0d856eeSDarrick J. Wong 3591a0d856eeSDarrick J. Wong- ``xchk_iscan_want_live_update`` to decide if an inode has already been 3592a0d856eeSDarrick J. Wong visited in the scan. 3593a0d856eeSDarrick J. Wong This is critical for hook functions to decide if they need to update the 3594a0d856eeSDarrick J. Wong in-memory scan information. 3595a0d856eeSDarrick J. Wong 3596a0d856eeSDarrick J. Wong- ``xchk_iscan_mark_visited`` to mark an inode as having been visited in the 3597a0d856eeSDarrick J. Wong scan 3598a0d856eeSDarrick J. Wong 3599a0d856eeSDarrick J. Wong- ``xchk_iscan_teardown`` to finish the scan 3600a0d856eeSDarrick J. Wong 3601a0d856eeSDarrick J. WongThis functionality is also a part of the 3602a0d856eeSDarrick J. Wong`inode scanner 3603a0d856eeSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>`_ 3604a0d856eeSDarrick J. Wongseries. 3605a0d856eeSDarrick J. Wong 3606a0d856eeSDarrick J. Wong.. _quotacheck: 3607a0d856eeSDarrick J. Wong 3608a0d856eeSDarrick J. WongCase Study: Quota Counter Checking 3609a0d856eeSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 3610a0d856eeSDarrick J. Wong 3611a0d856eeSDarrick J. WongIt is useful to compare the mount time quotacheck code to the online repair 3612a0d856eeSDarrick J. Wongquotacheck code. 3613a0d856eeSDarrick J. WongMount time quotacheck does not have to contend with concurrent operations, so 3614a0d856eeSDarrick J. Wongit does the following: 3615a0d856eeSDarrick J. Wong 3616a0d856eeSDarrick J. Wong1. Make sure the ondisk dquots are in good enough shape that all the incore 3617a0d856eeSDarrick J. Wong dquots will actually load, and zero the resource usage counters in the 3618a0d856eeSDarrick J. Wong ondisk buffer. 3619a0d856eeSDarrick J. Wong 3620a0d856eeSDarrick J. Wong2. Walk every inode in the filesystem. 3621a0d856eeSDarrick J. Wong Add each file's resource usage to the incore dquot. 3622a0d856eeSDarrick J. Wong 3623a0d856eeSDarrick J. Wong3. Walk each incore dquot. 3624a0d856eeSDarrick J. Wong If the incore dquot is not being flushed, add the ondisk buffer backing the 3625a0d856eeSDarrick J. Wong incore dquot to a delayed write (delwri) list. 3626a0d856eeSDarrick J. Wong 3627a0d856eeSDarrick J. Wong4. Write the buffer list to disk. 3628a0d856eeSDarrick J. Wong 3629a0d856eeSDarrick J. WongLike most online fsck functions, online quotacheck can't write to regular 3630a0d856eeSDarrick J. Wongfilesystem objects until the newly collected metadata reflect all filesystem 3631a0d856eeSDarrick J. Wongstate. 3632a0d856eeSDarrick J. WongTherefore, online quotacheck records file resource usage to a shadow dquot 3633a0d856eeSDarrick J. Wongindex implemented with a sparse ``xfarray``, and only writes to the real dquots 3634a0d856eeSDarrick J. Wongonce the scan is complete. 3635a0d856eeSDarrick J. WongHandling transactional updates is tricky because quota resource usage updates 3636a0d856eeSDarrick J. Wongare handled in phases to minimize contention on dquots: 3637a0d856eeSDarrick J. Wong 3638a0d856eeSDarrick J. Wong1. The inodes involved are joined and locked to a transaction. 3639a0d856eeSDarrick J. Wong 3640a0d856eeSDarrick J. Wong2. For each dquot attached to the file: 3641a0d856eeSDarrick J. Wong 3642a0d856eeSDarrick J. Wong a. The dquot is locked. 3643a0d856eeSDarrick J. Wong 3644a0d856eeSDarrick J. Wong b. A quota reservation is added to the dquot's resource usage. 3645a0d856eeSDarrick J. Wong The reservation is recorded in the transaction. 3646a0d856eeSDarrick J. Wong 3647a0d856eeSDarrick J. Wong c. The dquot is unlocked. 3648a0d856eeSDarrick J. Wong 3649a0d856eeSDarrick J. Wong3. Changes in actual quota usage are tracked in the transaction. 3650a0d856eeSDarrick J. Wong 3651a0d856eeSDarrick J. Wong4. At transaction commit time, each dquot is examined again: 3652a0d856eeSDarrick J. Wong 3653a0d856eeSDarrick J. Wong a. The dquot is locked again. 3654a0d856eeSDarrick J. Wong 3655a0d856eeSDarrick J. Wong b. Quota usage changes are logged and unused reservation is given back to 3656a0d856eeSDarrick J. Wong the dquot. 3657a0d856eeSDarrick J. Wong 3658a0d856eeSDarrick J. Wong c. The dquot is unlocked. 3659a0d856eeSDarrick J. Wong 3660a0d856eeSDarrick J. WongFor online quotacheck, hooks are placed in steps 2 and 4. 3661a0d856eeSDarrick J. WongThe step 2 hook creates a shadow version of the transaction dquot context 3662a0d856eeSDarrick J. Wong(``dqtrx``) that operates in a similar manner to the regular code. 3663a0d856eeSDarrick J. WongThe step 4 hook commits the shadow ``dqtrx`` changes to the shadow dquots. 3664a0d856eeSDarrick J. WongNotice that both hooks are called with the inode locked, which is how the 3665a0d856eeSDarrick J. Wonglive update coordinates with the inode scanner. 3666a0d856eeSDarrick J. Wong 3667a0d856eeSDarrick J. WongThe quotacheck scan looks like this: 3668a0d856eeSDarrick J. Wong 3669a0d856eeSDarrick J. Wong1. Set up a coordinated inode scan. 3670a0d856eeSDarrick J. Wong 3671a0d856eeSDarrick J. Wong2. For each inode returned by the inode scan iterator: 3672a0d856eeSDarrick J. Wong 3673a0d856eeSDarrick J. Wong a. Grab and lock the inode. 3674a0d856eeSDarrick J. Wong 3675a0d856eeSDarrick J. Wong b. Determine that inode's resource usage (data blocks, inode counts, 3676a0d856eeSDarrick J. Wong realtime blocks) and add that to the shadow dquots for the user, group, 3677a0d856eeSDarrick J. Wong and project ids associated with the inode. 3678a0d856eeSDarrick J. Wong 3679a0d856eeSDarrick J. Wong c. Unlock and release the inode. 3680a0d856eeSDarrick J. Wong 3681a0d856eeSDarrick J. Wong3. For each dquot in the system: 3682a0d856eeSDarrick J. Wong 3683a0d856eeSDarrick J. Wong a. Grab and lock the dquot. 3684a0d856eeSDarrick J. Wong 3685a0d856eeSDarrick J. Wong b. Check the dquot against the shadow dquots created by the scan and updated 3686a0d856eeSDarrick J. Wong by the live hooks. 3687a0d856eeSDarrick J. Wong 3688a0d856eeSDarrick J. WongLive updates are key to being able to walk every quota record without 3689a0d856eeSDarrick J. Wongneeding to hold any locks for a long duration. 3690a0d856eeSDarrick J. WongIf repairs are desired, the real and shadow dquots are locked and their 3691a0d856eeSDarrick J. Wongresource counts are set to the values in the shadow dquot. 3692a0d856eeSDarrick J. Wong 3693a0d856eeSDarrick J. WongThe proposed patchset is the 3694a0d856eeSDarrick J. Wong`online quotacheck 3695a0d856eeSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>`_ 3696a0d856eeSDarrick J. Wongseries. 3697a0d856eeSDarrick J. Wong 3698a0d856eeSDarrick J. Wong.. _nlinks: 3699a0d856eeSDarrick J. Wong 3700a0d856eeSDarrick J. WongCase Study: File Link Count Checking 3701a0d856eeSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 3702a0d856eeSDarrick J. Wong 3703a0d856eeSDarrick J. WongFile link count checking also uses live update hooks. 3704a0d856eeSDarrick J. WongThe coordinated inode scanner is used to visit all directories on the 3705a0d856eeSDarrick J. Wongfilesystem, and per-file link count records are stored in a sparse ``xfarray`` 3706a0d856eeSDarrick J. Wongindexed by inumber. 3707a0d856eeSDarrick J. WongDuring the scanning phase, each entry in a directory generates observation 3708a0d856eeSDarrick J. Wongdata as follows: 3709a0d856eeSDarrick J. Wong 3710a0d856eeSDarrick J. Wong1. If the entry is a dotdot (``'..'``) entry of the root directory, the 3711a0d856eeSDarrick J. Wong directory's parent link count is bumped because the root directory's dotdot 3712a0d856eeSDarrick J. Wong entry is self referential. 3713a0d856eeSDarrick J. Wong 3714a0d856eeSDarrick J. Wong2. If the entry is a dotdot entry of a subdirectory, the parent's backref 3715a0d856eeSDarrick J. Wong count is bumped. 3716a0d856eeSDarrick J. Wong 3717a0d856eeSDarrick J. Wong3. If the entry is neither a dot nor a dotdot entry, the target file's parent 3718a0d856eeSDarrick J. Wong count is bumped. 3719a0d856eeSDarrick J. Wong 3720a0d856eeSDarrick J. Wong4. If the target is a subdirectory, the parent's child link count is bumped. 3721a0d856eeSDarrick J. Wong 3722a0d856eeSDarrick J. WongA crucial point to understand about how the link count inode scanner interacts 3723a0d856eeSDarrick J. Wongwith the live update hooks is that the scan cursor tracks which *parent* 3724a0d856eeSDarrick J. Wongdirectories have been scanned. 3725a0d856eeSDarrick J. WongIn other words, the live updates ignore any update about ``A → B`` when A has 3726a0d856eeSDarrick J. Wongnot been scanned, even if B has been scanned. 3727a0d856eeSDarrick J. WongFurthermore, a subdirectory A with a dotdot entry pointing back to B is 3728a0d856eeSDarrick J. Wongaccounted as a backref counter in the shadow data for A, since child dotdot 3729a0d856eeSDarrick J. Wongentries affect the parent's link count. 3730a0d856eeSDarrick J. WongLive update hooks are carefully placed in all parts of the filesystem that 3731a0d856eeSDarrick J. Wongcreate, change, or remove directory entries, since those operations involve 3732a0d856eeSDarrick J. Wongbumplink and droplink. 3733a0d856eeSDarrick J. Wong 3734a0d856eeSDarrick J. WongFor any file, the correct link count is the number of parents plus the number 3735a0d856eeSDarrick J. Wongof child subdirectories. 3736a0d856eeSDarrick J. WongNon-directories never have children of any kind. 3737a0d856eeSDarrick J. WongThe backref information is used to detect inconsistencies in the number of 3738a0d856eeSDarrick J. Wonglinks pointing to child subdirectories and the number of dotdot entries 3739a0d856eeSDarrick J. Wongpointing back. 3740a0d856eeSDarrick J. Wong 3741a0d856eeSDarrick J. WongAfter the scan completes, the link count of each file can be checked by locking 3742a0d856eeSDarrick J. Wongboth the inode and the shadow data, and comparing the link counts. 3743a0d856eeSDarrick J. WongA second coordinated inode scan cursor is used for comparisons. 3744a0d856eeSDarrick J. WongLive updates are key to being able to walk every inode without needing to hold 3745a0d856eeSDarrick J. Wongany locks between inodes. 3746a0d856eeSDarrick J. WongIf repairs are desired, the inode's link count is set to the value in the 3747a0d856eeSDarrick J. Wongshadow information. 3748a0d856eeSDarrick J. WongIf no parents are found, the file must be :ref:`reparented <orphanage>` to the 3749a0d856eeSDarrick J. Wongorphanage to prevent the file from being lost forever. 3750a0d856eeSDarrick J. Wong 3751a0d856eeSDarrick J. WongThe proposed patchset is the 3752a0d856eeSDarrick J. Wong`file link count repair 3753a0d856eeSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-nlinks>`_ 3754a0d856eeSDarrick J. Wongseries. 3755a0d856eeSDarrick J. Wong 3756a0d856eeSDarrick J. Wong.. _rmap_repair: 3757a0d856eeSDarrick J. Wong 3758a0d856eeSDarrick J. WongCase Study: Rebuilding Reverse Mapping Records 3759a0d856eeSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 3760a0d856eeSDarrick J. Wong 3761a0d856eeSDarrick J. WongMost repair functions follow the same pattern: lock filesystem resources, 3762a0d856eeSDarrick J. Wongwalk the surviving ondisk metadata looking for replacement metadata records, 3763a0d856eeSDarrick J. Wongand use an :ref:`in-memory array <xfarray>` to store the gathered observations. 3764a0d856eeSDarrick J. WongThe primary advantage of this approach is the simplicity and modularity of the 3765a0d856eeSDarrick J. Wongrepair code -- code and data are entirely contained within the scrub module, 3766a0d856eeSDarrick J. Wongdo not require hooks in the main filesystem, and are usually the most efficient 3767a0d856eeSDarrick J. Wongin memory use. 3768a0d856eeSDarrick J. WongA secondary advantage of this repair approach is atomicity -- once the kernel 3769a0d856eeSDarrick J. Wongdecides a structure is corrupt, no other threads can access the metadata until 3770a0d856eeSDarrick J. Wongthe kernel finishes repairing and revalidating the metadata. 3771a0d856eeSDarrick J. Wong 3772a0d856eeSDarrick J. WongFor repairs going on within a shard of the filesystem, these advantages 3773a0d856eeSDarrick J. Wongoutweigh the delays inherent in locking the shard while repairing parts of the 3774a0d856eeSDarrick J. Wongshard. 3775a0d856eeSDarrick J. WongUnfortunately, repairs to the reverse mapping btree cannot use the "standard" 3776a0d856eeSDarrick J. Wongbtree repair strategy because it must scan every space mapping of every fork of 3777a0d856eeSDarrick J. Wongevery file in the filesystem, and the filesystem cannot stop. 3778a0d856eeSDarrick J. WongTherefore, rmap repair foregoes atomicity between scrub and repair. 3779a0d856eeSDarrick J. WongIt combines a :ref:`coordinated inode scanner <iscan>`, :ref:`live update hooks 3780a0d856eeSDarrick J. Wong<liveupdate>`, and an :ref:`in-memory rmap btree <xfbtree>` to complete the 3781a0d856eeSDarrick J. Wongscan for reverse mapping records. 3782a0d856eeSDarrick J. Wong 3783a0d856eeSDarrick J. Wong1. Set up an xfbtree to stage rmap records. 3784a0d856eeSDarrick J. Wong 3785a0d856eeSDarrick J. Wong2. While holding the locks on the AGI and AGF buffers acquired during the 3786a0d856eeSDarrick J. Wong scrub, generate reverse mappings for all AG metadata: inodes, btrees, CoW 3787a0d856eeSDarrick J. Wong staging extents, and the internal log. 3788a0d856eeSDarrick J. Wong 3789a0d856eeSDarrick J. Wong3. Set up an inode scanner. 3790a0d856eeSDarrick J. Wong 3791a0d856eeSDarrick J. Wong4. Hook into rmap updates for the AG being repaired so that the live scan data 3792a0d856eeSDarrick J. Wong can receive updates to the rmap btree from the rest of the filesystem during 3793a0d856eeSDarrick J. Wong the file scan. 3794a0d856eeSDarrick J. Wong 3795a0d856eeSDarrick J. Wong5. For each space mapping found in either fork of each file scanned, 3796a0d856eeSDarrick J. Wong decide if the mapping matches the AG of interest. 3797a0d856eeSDarrick J. Wong If so: 3798a0d856eeSDarrick J. Wong 3799a0d856eeSDarrick J. Wong a. Create a btree cursor for the in-memory btree. 3800a0d856eeSDarrick J. Wong 3801a0d856eeSDarrick J. Wong b. Use the rmap code to add the record to the in-memory btree. 3802a0d856eeSDarrick J. Wong 3803a0d856eeSDarrick J. Wong c. Use the :ref:`special commit function <xfbtree_commit>` to write the 3804a0d856eeSDarrick J. Wong xfbtree changes to the xfile. 3805a0d856eeSDarrick J. Wong 3806a0d856eeSDarrick J. Wong6. For each live update received via the hook, decide if the owner has already 3807a0d856eeSDarrick J. Wong been scanned. 3808a0d856eeSDarrick J. Wong If so, apply the live update into the scan data: 3809a0d856eeSDarrick J. Wong 3810a0d856eeSDarrick J. Wong a. Create a btree cursor for the in-memory btree. 3811a0d856eeSDarrick J. Wong 3812a0d856eeSDarrick J. Wong b. Replay the operation into the in-memory btree. 3813a0d856eeSDarrick J. Wong 3814a0d856eeSDarrick J. Wong c. Use the :ref:`special commit function <xfbtree_commit>` to write the 3815a0d856eeSDarrick J. Wong xfbtree changes to the xfile. 3816a0d856eeSDarrick J. Wong This is performed with an empty transaction to avoid changing the 3817a0d856eeSDarrick J. Wong caller's state. 3818a0d856eeSDarrick J. Wong 3819a0d856eeSDarrick J. Wong7. When the inode scan finishes, create a new scrub transaction and relock the 3820a0d856eeSDarrick J. Wong two AG headers. 3821a0d856eeSDarrick J. Wong 3822a0d856eeSDarrick J. Wong8. Compute the new btree geometry using the number of rmap records in the 3823a0d856eeSDarrick J. Wong shadow btree, like all other btree rebuilding functions. 3824a0d856eeSDarrick J. Wong 3825a0d856eeSDarrick J. Wong9. Allocate the number of blocks computed in the previous step. 3826a0d856eeSDarrick J. Wong 3827a0d856eeSDarrick J. Wong10. Perform the usual btree bulk loading and commit to install the new rmap 3828a0d856eeSDarrick J. Wong btree. 3829a0d856eeSDarrick J. Wong 3830a0d856eeSDarrick J. Wong11. Reap the old rmap btree blocks as discussed in the case study about how 3831a0d856eeSDarrick J. Wong to :ref:`reap after rmap btree repair <rmap_reap>`. 3832a0d856eeSDarrick J. Wong 3833a0d856eeSDarrick J. Wong12. Free the xfbtree now that it not needed. 3834a0d856eeSDarrick J. Wong 3835a0d856eeSDarrick J. WongThe proposed patchset is the 3836a0d856eeSDarrick J. Wong`rmap repair 3837a0d856eeSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rmap-btree>`_ 3838a0d856eeSDarrick J. Wongseries. 38392f754f7fSDarrick J. Wong 38402f754f7fSDarrick J. WongStaging Repairs with Temporary Files on Disk 38412f754f7fSDarrick J. Wong-------------------------------------------- 38422f754f7fSDarrick J. Wong 38432f754f7fSDarrick J. WongXFS stores a substantial amount of metadata in file forks: directories, 38442f754f7fSDarrick J. Wongextended attributes, symbolic link targets, free space bitmaps and summary 38452f754f7fSDarrick J. Wonginformation for the realtime volume, and quota records. 38462f754f7fSDarrick J. WongFile forks map 64-bit logical file fork space extents to physical storage space 38472f754f7fSDarrick J. Wongextents, similar to how a memory management unit maps 64-bit virtual addresses 38482f754f7fSDarrick J. Wongto physical memory addresses. 38492f754f7fSDarrick J. WongTherefore, file-based tree structures (such as directories and extended 38502f754f7fSDarrick J. Wongattributes) use blocks mapped in the file fork offset address space that point 38512f754f7fSDarrick J. Wongto other blocks mapped within that same address space, and file-based linear 38522f754f7fSDarrick J. Wongstructures (such as bitmaps and quota records) compute array element offsets in 38532f754f7fSDarrick J. Wongthe file fork offset address space. 38542f754f7fSDarrick J. Wong 38552f754f7fSDarrick J. WongBecause file forks can consume as much space as the entire filesystem, repairs 38562f754f7fSDarrick J. Wongcannot be staged in memory, even when a paging scheme is available. 38572f754f7fSDarrick J. WongTherefore, online repair of file-based metadata createas a temporary file in 38582f754f7fSDarrick J. Wongthe XFS filesystem, writes a new structure at the correct offsets into the 38592f754f7fSDarrick J. Wongtemporary file, and atomically swaps the fork mappings (and hence the fork 38602f754f7fSDarrick J. Wongcontents) to commit the repair. 38612f754f7fSDarrick J. WongOnce the repair is complete, the old fork can be reaped as necessary; if the 38622f754f7fSDarrick J. Wongsystem goes down during the reap, the iunlink code will delete the blocks 38632f754f7fSDarrick J. Wongduring log recovery. 38642f754f7fSDarrick J. Wong 38652f754f7fSDarrick J. Wong**Note**: All space usage and inode indices in the filesystem *must* be 38662f754f7fSDarrick J. Wongconsistent to use a temporary file safely! 38672f754f7fSDarrick J. WongThis dependency is the reason why online repair can only use pageable kernel 38682f754f7fSDarrick J. Wongmemory to stage ondisk space usage information. 38692f754f7fSDarrick J. Wong 38702f754f7fSDarrick J. WongSwapping metadata extents with a temporary file requires the owner field of the 38712f754f7fSDarrick J. Wongblock headers to match the file being repaired and not the temporary file. The 38722f754f7fSDarrick J. Wongdirectory, extended attribute, and symbolic link functions were all modified to 38732f754f7fSDarrick J. Wongallow callers to specify owner numbers explicitly. 38742f754f7fSDarrick J. Wong 38752f754f7fSDarrick J. WongThere is a downside to the reaping process -- if the system crashes during the 38762f754f7fSDarrick J. Wongreap phase and the fork extents are crosslinked, the iunlink processing will 38772f754f7fSDarrick J. Wongfail because freeing space will find the extra reverse mappings and abort. 38782f754f7fSDarrick J. Wong 38792f754f7fSDarrick J. WongTemporary files created for repair are similar to ``O_TMPFILE`` files created 38802f754f7fSDarrick J. Wongby userspace. 38812f754f7fSDarrick J. WongThey are not linked into a directory and the entire file will be reaped when 38822f754f7fSDarrick J. Wongthe last reference to the file is lost. 38832f754f7fSDarrick J. WongThe key differences are that these files must have no access permission outside 38842f754f7fSDarrick J. Wongthe kernel at all, they must be specially marked to prevent them from being 38852f754f7fSDarrick J. Wongopened by handle, and they must never be linked into the directory tree. 38862f754f7fSDarrick J. Wong 38872f754f7fSDarrick J. Wong+--------------------------------------------------------------------------+ 38882f754f7fSDarrick J. Wong| **Historical Sidebar**: | 38892f754f7fSDarrick J. Wong+--------------------------------------------------------------------------+ 38902f754f7fSDarrick J. Wong| In the initial iteration of file metadata repair, the damaged metadata | 38912f754f7fSDarrick J. Wong| blocks would be scanned for salvageable data; the extents in the file | 38922f754f7fSDarrick J. Wong| fork would be reaped; and then a new structure would be built in its | 38932f754f7fSDarrick J. Wong| place. | 38942f754f7fSDarrick J. Wong| This strategy did not survive the introduction of the atomic repair | 38952f754f7fSDarrick J. Wong| requirement expressed earlier in this document. | 38962f754f7fSDarrick J. Wong| | 38972f754f7fSDarrick J. Wong| The second iteration explored building a second structure at a high | 38982f754f7fSDarrick J. Wong| offset in the fork from the salvage data, reaping the old extents, and | 38992f754f7fSDarrick J. Wong| using a ``COLLAPSE_RANGE`` operation to slide the new extents into | 39002f754f7fSDarrick J. Wong| place. | 39012f754f7fSDarrick J. Wong| | 39022f754f7fSDarrick J. Wong| This had many drawbacks: | 39032f754f7fSDarrick J. Wong| | 39042f754f7fSDarrick J. Wong| - Array structures are linearly addressed, and the regular filesystem | 39052f754f7fSDarrick J. Wong| codebase does not have the concept of a linear offset that could be | 39062f754f7fSDarrick J. Wong| applied to the record offset computation to build an alternate copy. | 39072f754f7fSDarrick J. Wong| | 39082f754f7fSDarrick J. Wong| - Extended attributes are allowed to use the entire attr fork offset | 39092f754f7fSDarrick J. Wong| address space. | 39102f754f7fSDarrick J. Wong| | 39112f754f7fSDarrick J. Wong| - Even if repair could build an alternate copy of a data structure in a | 39122f754f7fSDarrick J. Wong| different part of the fork address space, the atomic repair commit | 39132f754f7fSDarrick J. Wong| requirement means that online repair would have to be able to perform | 39142f754f7fSDarrick J. Wong| a log assisted ``COLLAPSE_RANGE`` operation to ensure that the old | 39152f754f7fSDarrick J. Wong| structure was completely replaced. | 39162f754f7fSDarrick J. Wong| | 39172f754f7fSDarrick J. Wong| - A crash after construction of the secondary tree but before the range | 39182f754f7fSDarrick J. Wong| collapse would leave unreachable blocks in the file fork. | 39192f754f7fSDarrick J. Wong| This would likely confuse things further. | 39202f754f7fSDarrick J. Wong| | 39212f754f7fSDarrick J. Wong| - Reaping blocks after a repair is not a simple operation, and | 39222f754f7fSDarrick J. Wong| initiating a reap operation from a restarted range collapse operation | 39232f754f7fSDarrick J. Wong| during log recovery is daunting. | 39242f754f7fSDarrick J. Wong| | 39252f754f7fSDarrick J. Wong| - Directory entry blocks and quota records record the file fork offset | 39262f754f7fSDarrick J. Wong| in the header area of each block. | 39272f754f7fSDarrick J. Wong| An atomic range collapse operation would have to rewrite this part of | 39282f754f7fSDarrick J. Wong| each block header. | 39292f754f7fSDarrick J. Wong| Rewriting a single field in block headers is not a huge problem, but | 39302f754f7fSDarrick J. Wong| it's something to be aware of. | 39312f754f7fSDarrick J. Wong| | 39322f754f7fSDarrick J. Wong| - Each block in a directory or extended attributes btree index contains | 39332f754f7fSDarrick J. Wong| sibling and child block pointers. | 39342f754f7fSDarrick J. Wong| Were the atomic commit to use a range collapse operation, each block | 39352f754f7fSDarrick J. Wong| would have to be rewritten very carefully to preserve the graph | 39362f754f7fSDarrick J. Wong| structure. | 39372f754f7fSDarrick J. Wong| Doing this as part of a range collapse means rewriting a large number | 39382f754f7fSDarrick J. Wong| of blocks repeatedly, which is not conducive to quick repairs. | 39392f754f7fSDarrick J. Wong| | 39402f754f7fSDarrick J. Wong| This lead to the introduction of temporary file staging. | 39412f754f7fSDarrick J. Wong+--------------------------------------------------------------------------+ 39422f754f7fSDarrick J. Wong 39432f754f7fSDarrick J. WongUsing a Temporary File 39442f754f7fSDarrick J. Wong`````````````````````` 39452f754f7fSDarrick J. Wong 39462f754f7fSDarrick J. WongOnline repair code should use the ``xrep_tempfile_create`` function to create a 39472f754f7fSDarrick J. Wongtemporary file inside the filesystem. 39482f754f7fSDarrick J. WongThis allocates an inode, marks the in-core inode private, and attaches it to 39492f754f7fSDarrick J. Wongthe scrub context. 39502f754f7fSDarrick J. WongThese files are hidden from userspace, may not be added to the directory tree, 39512f754f7fSDarrick J. Wongand must be kept private. 39522f754f7fSDarrick J. Wong 39532f754f7fSDarrick J. WongTemporary files only use two inode locks: the IOLOCK and the ILOCK. 39542f754f7fSDarrick J. WongThe MMAPLOCK is not needed here, because there must not be page faults from 39552f754f7fSDarrick J. Wonguserspace for data fork blocks. 39562f754f7fSDarrick J. WongThe usage patterns of these two locks are the same as for any other XFS file -- 39572f754f7fSDarrick J. Wongaccess to file data are controlled via the IOLOCK, and access to file metadata 39582f754f7fSDarrick J. Wongare controlled via the ILOCK. 39592f754f7fSDarrick J. WongLocking helpers are provided so that the temporary file and its lock state can 39602f754f7fSDarrick J. Wongbe cleaned up by the scrub context. 39612f754f7fSDarrick J. WongTo comply with the nested locking strategy laid out in the :ref:`inode 39622f754f7fSDarrick J. Wonglocking<ilocking>` section, it is recommended that scrub functions use the 39632f754f7fSDarrick J. Wongxrep_tempfile_ilock*_nowait lock helpers. 39642f754f7fSDarrick J. Wong 39652f754f7fSDarrick J. WongData can be written to a temporary file by two means: 39662f754f7fSDarrick J. Wong 39672f754f7fSDarrick J. Wong1. ``xrep_tempfile_copyin`` can be used to set the contents of a regular 39682f754f7fSDarrick J. Wong temporary file from an xfile. 39692f754f7fSDarrick J. Wong 39702f754f7fSDarrick J. Wong2. The regular directory, symbolic link, and extended attribute functions can 39712f754f7fSDarrick J. Wong be used to write to the temporary file. 39722f754f7fSDarrick J. Wong 39732f754f7fSDarrick J. WongOnce a good copy of a data file has been constructed in a temporary file, it 39742f754f7fSDarrick J. Wongmust be conveyed to the file being repaired, which is the topic of the next 39752f754f7fSDarrick J. Wongsection. 39762f754f7fSDarrick J. Wong 39772f754f7fSDarrick J. WongThe proposed patches are in the 39782f754f7fSDarrick J. Wong`repair temporary files 39792f754f7fSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-tempfiles>`_ 39802f754f7fSDarrick J. Wongseries. 39812f754f7fSDarrick J. Wong 39822f754f7fSDarrick J. WongAtomic Extent Swapping 39832f754f7fSDarrick J. Wong---------------------- 39842f754f7fSDarrick J. Wong 39852f754f7fSDarrick J. WongOnce repair builds a temporary file with a new data structure written into 39862f754f7fSDarrick J. Wongit, it must commit the new changes into the existing file. 39872f754f7fSDarrick J. WongIt is not possible to swap the inumbers of two files, so instead the new 39882f754f7fSDarrick J. Wongmetadata must replace the old. 39892f754f7fSDarrick J. WongThis suggests the need for the ability to swap extents, but the existing extent 39902f754f7fSDarrick J. Wongswapping code used by the file defragmenting tool ``xfs_fsr`` is not sufficient 39912f754f7fSDarrick J. Wongfor online repair because: 39922f754f7fSDarrick J. Wong 39932f754f7fSDarrick J. Wonga. When the reverse-mapping btree is enabled, the swap code must keep the 39942f754f7fSDarrick J. Wong reverse mapping information up to date with every exchange of mappings. 39952f754f7fSDarrick J. Wong Therefore, it can only exchange one mapping per transaction, and each 39962f754f7fSDarrick J. Wong transaction is independent. 39972f754f7fSDarrick J. Wong 39982f754f7fSDarrick J. Wongb. Reverse-mapping is critical for the operation of online fsck, so the old 39992f754f7fSDarrick J. Wong defragmentation code (which swapped entire extent forks in a single 40002f754f7fSDarrick J. Wong operation) is not useful here. 40012f754f7fSDarrick J. Wong 40022f754f7fSDarrick J. Wongc. Defragmentation is assumed to occur between two files with identical 40032f754f7fSDarrick J. Wong contents. 40042f754f7fSDarrick J. Wong For this use case, an incomplete exchange will not result in a user-visible 40052f754f7fSDarrick J. Wong change in file contents, even if the operation is interrupted. 40062f754f7fSDarrick J. Wong 40072f754f7fSDarrick J. Wongd. Online repair needs to swap the contents of two files that are by definition 40082f754f7fSDarrick J. Wong *not* identical. 40092f754f7fSDarrick J. Wong For directory and xattr repairs, the user-visible contents might be the 40102f754f7fSDarrick J. Wong same, but the contents of individual blocks may be very different. 40112f754f7fSDarrick J. Wong 40122f754f7fSDarrick J. Wonge. Old blocks in the file may be cross-linked with another structure and must 40132f754f7fSDarrick J. Wong not reappear if the system goes down mid-repair. 40142f754f7fSDarrick J. Wong 40152f754f7fSDarrick J. WongThese problems are overcome by creating a new deferred operation and a new type 40162f754f7fSDarrick J. Wongof log intent item to track the progress of an operation to exchange two file 40172f754f7fSDarrick J. Wongranges. 40182f754f7fSDarrick J. WongThe new deferred operation type chains together the same transactions used by 40192f754f7fSDarrick J. Wongthe reverse-mapping extent swap code. 40202f754f7fSDarrick J. WongThe new log item records the progress of the exchange to ensure that once an 40212f754f7fSDarrick J. Wongexchange begins, it will always run to completion, even there are 40222f754f7fSDarrick J. Wonginterruptions. 40232f754f7fSDarrick J. WongThe new ``XFS_SB_FEAT_INCOMPAT_LOG_ATOMIC_SWAP`` log-incompatible feature flag 40242f754f7fSDarrick J. Wongin the superblock protects these new log item records from being replayed on 40252f754f7fSDarrick J. Wongold kernels. 40262f754f7fSDarrick J. Wong 40272f754f7fSDarrick J. WongThe proposed patchset is the 40282f754f7fSDarrick J. Wong`atomic extent swap 40292f754f7fSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-file-updates>`_ 40302f754f7fSDarrick J. Wongseries. 40312f754f7fSDarrick J. Wong 40322f754f7fSDarrick J. Wong+--------------------------------------------------------------------------+ 40332f754f7fSDarrick J. Wong| **Sidebar: Using Log-Incompatible Feature Flags** | 40342f754f7fSDarrick J. Wong+--------------------------------------------------------------------------+ 40352f754f7fSDarrick J. Wong| Starting with XFS v5, the superblock contains a | 40362f754f7fSDarrick J. Wong| ``sb_features_log_incompat`` field to indicate that the log contains | 40372f754f7fSDarrick J. Wong| records that might not readable by all kernels that could mount this | 40382f754f7fSDarrick J. Wong| filesystem. | 40392f754f7fSDarrick J. Wong| In short, log incompat features protect the log contents against kernels | 40402f754f7fSDarrick J. Wong| that will not understand the contents. | 40412f754f7fSDarrick J. Wong| Unlike the other superblock feature bits, log incompat bits are | 40422f754f7fSDarrick J. Wong| ephemeral because an empty (clean) log does not need protection. | 40432f754f7fSDarrick J. Wong| The log cleans itself after its contents have been committed into the | 40442f754f7fSDarrick J. Wong| filesystem, either as part of an unmount or because the system is | 40452f754f7fSDarrick J. Wong| otherwise idle. | 40462f754f7fSDarrick J. Wong| Because upper level code can be working on a transaction at the same | 40472f754f7fSDarrick J. Wong| time that the log cleans itself, it is necessary for upper level code to | 40482f754f7fSDarrick J. Wong| communicate to the log when it is going to use a log incompatible | 40492f754f7fSDarrick J. Wong| feature. | 40502f754f7fSDarrick J. Wong| | 40512f754f7fSDarrick J. Wong| The log coordinates access to incompatible features through the use of | 40522f754f7fSDarrick J. Wong| one ``struct rw_semaphore`` for each feature. | 40532f754f7fSDarrick J. Wong| The log cleaning code tries to take this rwsem in exclusive mode to | 40542f754f7fSDarrick J. Wong| clear the bit; if the lock attempt fails, the feature bit remains set. | 40552f754f7fSDarrick J. Wong| Filesystem code signals its intention to use a log incompat feature in a | 40562f754f7fSDarrick J. Wong| transaction by calling ``xlog_use_incompat_feat``, which takes the rwsem | 40572f754f7fSDarrick J. Wong| in shared mode. | 40582f754f7fSDarrick J. Wong| The code supporting a log incompat feature should create wrapper | 40592f754f7fSDarrick J. Wong| functions to obtain the log feature and call | 40602f754f7fSDarrick J. Wong| ``xfs_add_incompat_log_feature`` to set the feature bits in the primary | 40612f754f7fSDarrick J. Wong| superblock. | 40622f754f7fSDarrick J. Wong| The superblock update is performed transactionally, so the wrapper to | 40632f754f7fSDarrick J. Wong| obtain log assistance must be called just prior to the creation of the | 40642f754f7fSDarrick J. Wong| transaction that uses the functionality. | 40652f754f7fSDarrick J. Wong| For a file operation, this step must happen after taking the IOLOCK | 40662f754f7fSDarrick J. Wong| and the MMAPLOCK, but before allocating the transaction. | 40672f754f7fSDarrick J. Wong| When the transaction is complete, the ``xlog_drop_incompat_feat`` | 40682f754f7fSDarrick J. Wong| function is called to release the feature. | 40692f754f7fSDarrick J. Wong| The feature bit will not be cleared from the superblock until the log | 40702f754f7fSDarrick J. Wong| becomes clean. | 40712f754f7fSDarrick J. Wong| | 40722f754f7fSDarrick J. Wong| Log-assisted extended attribute updates and atomic extent swaps both use | 40732f754f7fSDarrick J. Wong| log incompat features and provide convenience wrappers around the | 40742f754f7fSDarrick J. Wong| functionality. | 40752f754f7fSDarrick J. Wong+--------------------------------------------------------------------------+ 40762f754f7fSDarrick J. Wong 40772f754f7fSDarrick J. WongMechanics of an Atomic Extent Swap 40782f754f7fSDarrick J. Wong`````````````````````````````````` 40792f754f7fSDarrick J. Wong 40802f754f7fSDarrick J. WongSwapping entire file forks is a complex task. 40812f754f7fSDarrick J. WongThe goal is to exchange all file fork mappings between two file fork offset 40822f754f7fSDarrick J. Wongranges. 40832f754f7fSDarrick J. WongThere are likely to be many extent mappings in each fork, and the edges of 40842f754f7fSDarrick J. Wongthe mappings aren't necessarily aligned. 40852f754f7fSDarrick J. WongFurthermore, there may be other updates that need to happen after the swap, 40862f754f7fSDarrick J. Wongsuch as exchanging file sizes, inode flags, or conversion of fork data to local 40872f754f7fSDarrick J. Wongformat. 40882f754f7fSDarrick J. WongThis is roughly the format of the new deferred extent swap work item: 40892f754f7fSDarrick J. Wong 40902f754f7fSDarrick J. Wong.. code-block:: c 40912f754f7fSDarrick J. Wong 40922f754f7fSDarrick J. Wong struct xfs_swapext_intent { 40932f754f7fSDarrick J. Wong /* Inodes participating in the operation. */ 40942f754f7fSDarrick J. Wong struct xfs_inode *sxi_ip1; 40952f754f7fSDarrick J. Wong struct xfs_inode *sxi_ip2; 40962f754f7fSDarrick J. Wong 40972f754f7fSDarrick J. Wong /* File offset range information. */ 40982f754f7fSDarrick J. Wong xfs_fileoff_t sxi_startoff1; 40992f754f7fSDarrick J. Wong xfs_fileoff_t sxi_startoff2; 41002f754f7fSDarrick J. Wong xfs_filblks_t sxi_blockcount; 41012f754f7fSDarrick J. Wong 41022f754f7fSDarrick J. Wong /* Set these file sizes after the operation, unless negative. */ 41032f754f7fSDarrick J. Wong xfs_fsize_t sxi_isize1; 41042f754f7fSDarrick J. Wong xfs_fsize_t sxi_isize2; 41052f754f7fSDarrick J. Wong 41062f754f7fSDarrick J. Wong /* XFS_SWAP_EXT_* log operation flags */ 41072f754f7fSDarrick J. Wong uint64_t sxi_flags; 41082f754f7fSDarrick J. Wong }; 41092f754f7fSDarrick J. Wong 41102f754f7fSDarrick J. WongThe new log intent item contains enough information to track two logical fork 41112f754f7fSDarrick J. Wongoffset ranges: ``(inode1, startoff1, blockcount)`` and ``(inode2, startoff2, 41122f754f7fSDarrick J. Wongblockcount)``. 41132f754f7fSDarrick J. WongEach step of a swap operation exchanges the largest file range mapping possible 41142f754f7fSDarrick J. Wongfrom one file to the other. 41152f754f7fSDarrick J. WongAfter each step in the swap operation, the two startoff fields are incremented 41162f754f7fSDarrick J. Wongand the blockcount field is decremented to reflect the progress made. 41172f754f7fSDarrick J. WongThe flags field captures behavioral parameters such as swapping the attr fork 41182f754f7fSDarrick J. Wonginstead of the data fork and other work to be done after the extent swap. 41192f754f7fSDarrick J. WongThe two isize fields are used to swap the file size at the end of the operation 41202f754f7fSDarrick J. Wongif the file data fork is the target of the swap operation. 41212f754f7fSDarrick J. Wong 41222f754f7fSDarrick J. WongWhen the extent swap is initiated, the sequence of operations is as follows: 41232f754f7fSDarrick J. Wong 41242f754f7fSDarrick J. Wong1. Create a deferred work item for the extent swap. 41252f754f7fSDarrick J. Wong At the start, it should contain the entirety of the file ranges to be 41262f754f7fSDarrick J. Wong swapped. 41272f754f7fSDarrick J. Wong 41282f754f7fSDarrick J. Wong2. Call ``xfs_defer_finish`` to process the exchange. 41292f754f7fSDarrick J. Wong This is encapsulated in ``xrep_tempswap_contents`` for scrub operations. 41302f754f7fSDarrick J. Wong This will log an extent swap intent item to the transaction for the deferred 41312f754f7fSDarrick J. Wong extent swap work item. 41322f754f7fSDarrick J. Wong 41332f754f7fSDarrick J. Wong3. Until ``sxi_blockcount`` of the deferred extent swap work item is zero, 41342f754f7fSDarrick J. Wong 41352f754f7fSDarrick J. Wong a. Read the block maps of both file ranges starting at ``sxi_startoff1`` and 41362f754f7fSDarrick J. Wong ``sxi_startoff2``, respectively, and compute the longest extent that can 41372f754f7fSDarrick J. Wong be swapped in a single step. 41382f754f7fSDarrick J. Wong This is the minimum of the two ``br_blockcount`` s in the mappings. 41392f754f7fSDarrick J. Wong Keep advancing through the file forks until at least one of the mappings 41402f754f7fSDarrick J. Wong contains written blocks. 41412f754f7fSDarrick J. Wong Mutual holes, unwritten extents, and extent mappings to the same physical 41422f754f7fSDarrick J. Wong space are not exchanged. 41432f754f7fSDarrick J. Wong 41442f754f7fSDarrick J. Wong For the next few steps, this document will refer to the mapping that came 41452f754f7fSDarrick J. Wong from file 1 as "map1", and the mapping that came from file 2 as "map2". 41462f754f7fSDarrick J. Wong 41472f754f7fSDarrick J. Wong b. Create a deferred block mapping update to unmap map1 from file 1. 41482f754f7fSDarrick J. Wong 41492f754f7fSDarrick J. Wong c. Create a deferred block mapping update to unmap map2 from file 2. 41502f754f7fSDarrick J. Wong 41512f754f7fSDarrick J. Wong d. Create a deferred block mapping update to map map1 into file 2. 41522f754f7fSDarrick J. Wong 41532f754f7fSDarrick J. Wong e. Create a deferred block mapping update to map map2 into file 1. 41542f754f7fSDarrick J. Wong 41552f754f7fSDarrick J. Wong f. Log the block, quota, and extent count updates for both files. 41562f754f7fSDarrick J. Wong 41572f754f7fSDarrick J. Wong g. Extend the ondisk size of either file if necessary. 41582f754f7fSDarrick J. Wong 41592f754f7fSDarrick J. Wong h. Log an extent swap done log item for the extent swap intent log item 41602f754f7fSDarrick J. Wong that was read at the start of step 3. 41612f754f7fSDarrick J. Wong 41622f754f7fSDarrick J. Wong i. Compute the amount of file range that has just been covered. 41632f754f7fSDarrick J. Wong This quantity is ``(map1.br_startoff + map1.br_blockcount - 41642f754f7fSDarrick J. Wong sxi_startoff1)``, because step 3a could have skipped holes. 41652f754f7fSDarrick J. Wong 41662f754f7fSDarrick J. Wong j. Increase the starting offsets of ``sxi_startoff1`` and ``sxi_startoff2`` 41672f754f7fSDarrick J. Wong by the number of blocks computed in the previous step, and decrease 41682f754f7fSDarrick J. Wong ``sxi_blockcount`` by the same quantity. 41692f754f7fSDarrick J. Wong This advances the cursor. 41702f754f7fSDarrick J. Wong 41712f754f7fSDarrick J. Wong k. Log a new extent swap intent log item reflecting the advanced state of 41722f754f7fSDarrick J. Wong the work item. 41732f754f7fSDarrick J. Wong 41742f754f7fSDarrick J. Wong l. Return the proper error code (EAGAIN) to the deferred operation manager 41752f754f7fSDarrick J. Wong to inform it that there is more work to be done. 41762f754f7fSDarrick J. Wong The operation manager completes the deferred work in steps 3b-3e before 41772f754f7fSDarrick J. Wong moving back to the start of step 3. 41782f754f7fSDarrick J. Wong 41792f754f7fSDarrick J. Wong4. Perform any post-processing. 41802f754f7fSDarrick J. Wong This will be discussed in more detail in subsequent sections. 41812f754f7fSDarrick J. Wong 41822f754f7fSDarrick J. WongIf the filesystem goes down in the middle of an operation, log recovery will 41832f754f7fSDarrick J. Wongfind the most recent unfinished extent swap log intent item and restart from 41842f754f7fSDarrick J. Wongthere. 41852f754f7fSDarrick J. WongThis is how extent swapping guarantees that an outside observer will either see 41862f754f7fSDarrick J. Wongthe old broken structure or the new one, and never a mismash of both. 41872f754f7fSDarrick J. Wong 41882f754f7fSDarrick J. WongPreparation for Extent Swapping 41892f754f7fSDarrick J. Wong``````````````````````````````` 41902f754f7fSDarrick J. Wong 41912f754f7fSDarrick J. WongThere are a few things that need to be taken care of before initiating an 41922f754f7fSDarrick J. Wongatomic extent swap operation. 41932f754f7fSDarrick J. WongFirst, regular files require the page cache to be flushed to disk before the 41942f754f7fSDarrick J. Wongoperation begins, and directio writes to be quiesced. 41952f754f7fSDarrick J. WongLike any filesystem operation, extent swapping must determine the maximum 41962f754f7fSDarrick J. Wongamount of disk space and quota that can be consumed on behalf of both files in 41972f754f7fSDarrick J. Wongthe operation, and reserve that quantity of resources to avoid an unrecoverable 41982f754f7fSDarrick J. Wongout of space failure once it starts dirtying metadata. 41992f754f7fSDarrick J. WongThe preparation step scans the ranges of both files to estimate: 42002f754f7fSDarrick J. Wong 42012f754f7fSDarrick J. Wong- Data device blocks needed to handle the repeated updates to the fork 42022f754f7fSDarrick J. Wong mappings. 42032f754f7fSDarrick J. Wong- Change in data and realtime block counts for both files. 42042f754f7fSDarrick J. Wong- Increase in quota usage for both files, if the two files do not share the 42052f754f7fSDarrick J. Wong same set of quota ids. 42062f754f7fSDarrick J. Wong- The number of extent mappings that will be added to each file. 42072f754f7fSDarrick J. Wong- Whether or not there are partially written realtime extents. 42082f754f7fSDarrick J. Wong User programs must never be able to access a realtime file extent that maps 42092f754f7fSDarrick J. Wong to different extents on the realtime volume, which could happen if the 42102f754f7fSDarrick J. Wong operation fails to run to completion. 42112f754f7fSDarrick J. Wong 42122f754f7fSDarrick J. WongThe need for precise estimation increases the run time of the swap operation, 42132f754f7fSDarrick J. Wongbut it is very important to maintain correct accounting. 42142f754f7fSDarrick J. WongThe filesystem must not run completely out of free space, nor can the extent 42152f754f7fSDarrick J. Wongswap ever add more extent mappings to a fork than it can support. 42162f754f7fSDarrick J. WongRegular users are required to abide the quota limits, though metadata repairs 42172f754f7fSDarrick J. Wongmay exceed quota to resolve inconsistent metadata elsewhere. 42182f754f7fSDarrick J. Wong 42192f754f7fSDarrick J. WongSpecial Features for Swapping Metadata File Extents 42202f754f7fSDarrick J. Wong``````````````````````````````````````````````````` 42212f754f7fSDarrick J. Wong 42222f754f7fSDarrick J. WongExtended attributes, symbolic links, and directories can set the fork format to 42232f754f7fSDarrick J. Wong"local" and treat the fork as a literal area for data storage. 42242f754f7fSDarrick J. WongMetadata repairs must take extra steps to support these cases: 42252f754f7fSDarrick J. Wong 42262f754f7fSDarrick J. Wong- If both forks are in local format and the fork areas are large enough, the 42272f754f7fSDarrick J. Wong swap is performed by copying the incore fork contents, logging both forks, 42282f754f7fSDarrick J. Wong and committing. 42292f754f7fSDarrick J. Wong The atomic extent swap mechanism is not necessary, since this can be done 42302f754f7fSDarrick J. Wong with a single transaction. 42312f754f7fSDarrick J. Wong 42322f754f7fSDarrick J. Wong- If both forks map blocks, then the regular atomic extent swap is used. 42332f754f7fSDarrick J. Wong 42342f754f7fSDarrick J. Wong- Otherwise, only one fork is in local format. 42352f754f7fSDarrick J. Wong The contents of the local format fork are converted to a block to perform the 42362f754f7fSDarrick J. Wong swap. 42372f754f7fSDarrick J. Wong The conversion to block format must be done in the same transaction that 42382f754f7fSDarrick J. Wong logs the initial extent swap intent log item. 42392f754f7fSDarrick J. Wong The regular atomic extent swap is used to exchange the mappings. 42402f754f7fSDarrick J. Wong Special flags are set on the swap operation so that the transaction can be 42412f754f7fSDarrick J. Wong rolled one more time to convert the second file's fork back to local format 42422f754f7fSDarrick J. Wong so that the second file will be ready to go as soon as the ILOCK is dropped. 42432f754f7fSDarrick J. Wong 42442f754f7fSDarrick J. WongExtended attributes and directories stamp the owning inode into every block, 42452f754f7fSDarrick J. Wongbut the buffer verifiers do not actually check the inode number! 42462f754f7fSDarrick J. WongAlthough there is no verification, it is still important to maintain 42472f754f7fSDarrick J. Wongreferential integrity, so prior to performing the extent swap, online repair 42482f754f7fSDarrick J. Wongbuilds every block in the new data structure with the owner field of the file 42492f754f7fSDarrick J. Wongbeing repaired. 42502f754f7fSDarrick J. Wong 42512f754f7fSDarrick J. WongAfter a successful swap operation, the repair operation must reap the old fork 42522f754f7fSDarrick J. Wongblocks by processing each fork mapping through the standard :ref:`file extent 42532f754f7fSDarrick J. Wongreaping <reaping>` mechanism that is done post-repair. 42542f754f7fSDarrick J. WongIf the filesystem should go down during the reap part of the repair, the 42552f754f7fSDarrick J. Wongiunlink processing at the end of recovery will free both the temporary file and 42562f754f7fSDarrick J. Wongwhatever blocks were not reaped. 42572f754f7fSDarrick J. WongHowever, this iunlink processing omits the cross-link detection of online 42582f754f7fSDarrick J. Wongrepair, and is not completely foolproof. 42592f754f7fSDarrick J. Wong 42602f754f7fSDarrick J. WongSwapping Temporary File Extents 42612f754f7fSDarrick J. Wong``````````````````````````````` 42622f754f7fSDarrick J. Wong 42632f754f7fSDarrick J. WongTo repair a metadata file, online repair proceeds as follows: 42642f754f7fSDarrick J. Wong 42652f754f7fSDarrick J. Wong1. Create a temporary repair file. 42662f754f7fSDarrick J. Wong 42672f754f7fSDarrick J. Wong2. Use the staging data to write out new contents into the temporary repair 42682f754f7fSDarrick J. Wong file. 42692f754f7fSDarrick J. Wong The same fork must be written to as is being repaired. 42702f754f7fSDarrick J. Wong 42712f754f7fSDarrick J. Wong3. Commit the scrub transaction, since the swap estimation step must be 42722f754f7fSDarrick J. Wong completed before transaction reservations are made. 42732f754f7fSDarrick J. Wong 42742f754f7fSDarrick J. Wong4. Call ``xrep_tempswap_trans_alloc`` to allocate a new scrub transaction with 42752f754f7fSDarrick J. Wong the appropriate resource reservations, locks, and fill out a ``struct 42762f754f7fSDarrick J. Wong xfs_swapext_req`` with the details of the swap operation. 42772f754f7fSDarrick J. Wong 42782f754f7fSDarrick J. Wong5. Call ``xrep_tempswap_contents`` to swap the contents. 42792f754f7fSDarrick J. Wong 42802f754f7fSDarrick J. Wong6. Commit the transaction to complete the repair. 42812f754f7fSDarrick J. Wong 42822f754f7fSDarrick J. Wong.. _rtsummary: 42832f754f7fSDarrick J. Wong 42842f754f7fSDarrick J. WongCase Study: Repairing the Realtime Summary File 42852f754f7fSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 42862f754f7fSDarrick J. Wong 42872f754f7fSDarrick J. WongIn the "realtime" section of an XFS filesystem, free space is tracked via a 42882f754f7fSDarrick J. Wongbitmap, similar to Unix FFS. 42892f754f7fSDarrick J. WongEach bit in the bitmap represents one realtime extent, which is a multiple of 42902f754f7fSDarrick J. Wongthe filesystem block size between 4KiB and 1GiB in size. 42912f754f7fSDarrick J. WongThe realtime summary file indexes the number of free extents of a given size to 42922f754f7fSDarrick J. Wongthe offset of the block within the realtime free space bitmap where those free 42932f754f7fSDarrick J. Wongextents begin. 42942f754f7fSDarrick J. WongIn other words, the summary file helps the allocator find free extents by 42952f754f7fSDarrick J. Wonglength, similar to what the free space by count (cntbt) btree does for the data 42962f754f7fSDarrick J. Wongsection. 42972f754f7fSDarrick J. Wong 42982f754f7fSDarrick J. WongThe summary file itself is a flat file (with no block headers or checksums!) 42992f754f7fSDarrick J. Wongpartitioned into ``log2(total rt extents)`` sections containing enough 32-bit 43002f754f7fSDarrick J. Wongcounters to match the number of blocks in the rt bitmap. 43012f754f7fSDarrick J. WongEach counter records the number of free extents that start in that bitmap block 43022f754f7fSDarrick J. Wongand can satisfy a power-of-two allocation request. 43032f754f7fSDarrick J. Wong 43042f754f7fSDarrick J. WongTo check the summary file against the bitmap: 43052f754f7fSDarrick J. Wong 43062f754f7fSDarrick J. Wong1. Take the ILOCK of both the realtime bitmap and summary files. 43072f754f7fSDarrick J. Wong 43082f754f7fSDarrick J. Wong2. For each free space extent recorded in the bitmap: 43092f754f7fSDarrick J. Wong 43102f754f7fSDarrick J. Wong a. Compute the position in the summary file that contains a counter that 43112f754f7fSDarrick J. Wong represents this free extent. 43122f754f7fSDarrick J. Wong 43132f754f7fSDarrick J. Wong b. Read the counter from the xfile. 43142f754f7fSDarrick J. Wong 43152f754f7fSDarrick J. Wong c. Increment it, and write it back to the xfile. 43162f754f7fSDarrick J. Wong 43172f754f7fSDarrick J. Wong3. Compare the contents of the xfile against the ondisk file. 43182f754f7fSDarrick J. Wong 43192f754f7fSDarrick J. WongTo repair the summary file, write the xfile contents into the temporary file 43202f754f7fSDarrick J. Wongand use atomic extent swap to commit the new contents. 43212f754f7fSDarrick J. WongThe temporary file is then reaped. 43222f754f7fSDarrick J. Wong 43232f754f7fSDarrick J. WongThe proposed patchset is the 43242f754f7fSDarrick J. Wong`realtime summary repair 43252f754f7fSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rtsummary>`_ 43262f754f7fSDarrick J. Wongseries. 43272f754f7fSDarrick J. Wong 43282f754f7fSDarrick J. WongCase Study: Salvaging Extended Attributes 43292f754f7fSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 43302f754f7fSDarrick J. Wong 43312f754f7fSDarrick J. WongIn XFS, extended attributes are implemented as a namespaced name-value store. 43322f754f7fSDarrick J. WongValues are limited in size to 64KiB, but there is no limit in the number of 43332f754f7fSDarrick J. Wongnames. 43342f754f7fSDarrick J. WongThe attribute fork is unpartitioned, which means that the root of the attribute 43352f754f7fSDarrick J. Wongstructure is always in logical block zero, but attribute leaf blocks, dabtree 43362f754f7fSDarrick J. Wongindex blocks, and remote value blocks are intermixed. 43372f754f7fSDarrick J. WongAttribute leaf blocks contain variable-sized records that associate 43382f754f7fSDarrick J. Wonguser-provided names with the user-provided values. 43392f754f7fSDarrick J. WongValues larger than a block are allocated separate extents and written there. 43402f754f7fSDarrick J. WongIf the leaf information expands beyond a single block, a directory/attribute 43412f754f7fSDarrick J. Wongbtree (``dabtree``) is created to map hashes of attribute names to entries 43422f754f7fSDarrick J. Wongfor fast lookup. 43432f754f7fSDarrick J. Wong 43442f754f7fSDarrick J. WongSalvaging extended attributes is done as follows: 43452f754f7fSDarrick J. Wong 43462f754f7fSDarrick J. Wong1. Walk the attr fork mappings of the file being repaired to find the attribute 43472f754f7fSDarrick J. Wong leaf blocks. 43482f754f7fSDarrick J. Wong When one is found, 43492f754f7fSDarrick J. Wong 43502f754f7fSDarrick J. Wong a. Walk the attr leaf block to find candidate keys. 43512f754f7fSDarrick J. Wong When one is found, 43522f754f7fSDarrick J. Wong 43532f754f7fSDarrick J. Wong 1. Check the name for problems, and ignore the name if there are. 43542f754f7fSDarrick J. Wong 43552f754f7fSDarrick J. Wong 2. Retrieve the value. 43562f754f7fSDarrick J. Wong If that succeeds, add the name and value to the staging xfarray and 43572f754f7fSDarrick J. Wong xfblob. 43582f754f7fSDarrick J. Wong 43592f754f7fSDarrick J. Wong2. If the memory usage of the xfarray and xfblob exceed a certain amount of 43602f754f7fSDarrick J. Wong memory or there are no more attr fork blocks to examine, unlock the file and 43612f754f7fSDarrick J. Wong add the staged extended attributes to the temporary file. 43622f754f7fSDarrick J. Wong 43632f754f7fSDarrick J. Wong3. Use atomic extent swapping to exchange the new and old extended attribute 43642f754f7fSDarrick J. Wong structures. 43652f754f7fSDarrick J. Wong The old attribute blocks are now attached to the temporary file. 43662f754f7fSDarrick J. Wong 43672f754f7fSDarrick J. Wong4. Reap the temporary file. 43682f754f7fSDarrick J. Wong 43692f754f7fSDarrick J. WongThe proposed patchset is the 43702f754f7fSDarrick J. Wong`extended attribute repair 43712f754f7fSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_ 43722f754f7fSDarrick J. Wongseries. 4373*a26aa252SDarrick J. Wong 4374*a26aa252SDarrick J. WongFixing Directories 4375*a26aa252SDarrick J. Wong------------------ 4376*a26aa252SDarrick J. Wong 4377*a26aa252SDarrick J. WongFixing directories is difficult with currently available filesystem features, 4378*a26aa252SDarrick J. Wongsince directory entries are not redundant. 4379*a26aa252SDarrick J. WongThe offline repair tool scans all inodes to find files with nonzero link count, 4380*a26aa252SDarrick J. Wongand then it scans all directories to establish parentage of those linked files. 4381*a26aa252SDarrick J. WongDamaged files and directories are zapped, and files with no parent are 4382*a26aa252SDarrick J. Wongmoved to the ``/lost+found`` directory. 4383*a26aa252SDarrick J. WongIt does not try to salvage anything. 4384*a26aa252SDarrick J. Wong 4385*a26aa252SDarrick J. WongThe best that online repair can do at this time is to read directory data 4386*a26aa252SDarrick J. Wongblocks and salvage any dirents that look plausible, correct link counts, and 4387*a26aa252SDarrick J. Wongmove orphans back into the directory tree. 4388*a26aa252SDarrick J. WongThe salvage process is discussed in the case study at the end of this section. 4389*a26aa252SDarrick J. WongThe :ref:`file link count fsck <nlinks>` code takes care of fixing link counts 4390*a26aa252SDarrick J. Wongand moving orphans to the ``/lost+found`` directory. 4391*a26aa252SDarrick J. Wong 4392*a26aa252SDarrick J. WongCase Study: Salvaging Directories 4393*a26aa252SDarrick J. Wong````````````````````````````````` 4394*a26aa252SDarrick J. Wong 4395*a26aa252SDarrick J. WongUnlike extended attributes, directory blocks are all the same size, so 4396*a26aa252SDarrick J. Wongsalvaging directories is straightforward: 4397*a26aa252SDarrick J. Wong 4398*a26aa252SDarrick J. Wong1. Find the parent of the directory. 4399*a26aa252SDarrick J. Wong If the dotdot entry is not unreadable, try to confirm that the alleged 4400*a26aa252SDarrick J. Wong parent has a child entry pointing back to the directory being repaired. 4401*a26aa252SDarrick J. Wong Otherwise, walk the filesystem to find it. 4402*a26aa252SDarrick J. Wong 4403*a26aa252SDarrick J. Wong2. Walk the first partition of data fork of the directory to find the directory 4404*a26aa252SDarrick J. Wong entry data blocks. 4405*a26aa252SDarrick J. Wong When one is found, 4406*a26aa252SDarrick J. Wong 4407*a26aa252SDarrick J. Wong a. Walk the directory data block to find candidate entries. 4408*a26aa252SDarrick J. Wong When an entry is found: 4409*a26aa252SDarrick J. Wong 4410*a26aa252SDarrick J. Wong i. Check the name for problems, and ignore the name if there are. 4411*a26aa252SDarrick J. Wong 4412*a26aa252SDarrick J. Wong ii. Retrieve the inumber and grab the inode. 4413*a26aa252SDarrick J. Wong If that succeeds, add the name, inode number, and file type to the 4414*a26aa252SDarrick J. Wong staging xfarray and xblob. 4415*a26aa252SDarrick J. Wong 4416*a26aa252SDarrick J. Wong3. If the memory usage of the xfarray and xfblob exceed a certain amount of 4417*a26aa252SDarrick J. Wong memory or there are no more directory data blocks to examine, unlock the 4418*a26aa252SDarrick J. Wong directory and add the staged dirents into the temporary directory. 4419*a26aa252SDarrick J. Wong Truncate the staging files. 4420*a26aa252SDarrick J. Wong 4421*a26aa252SDarrick J. Wong4. Use atomic extent swapping to exchange the new and old directory structures. 4422*a26aa252SDarrick J. Wong The old directory blocks are now attached to the temporary file. 4423*a26aa252SDarrick J. Wong 4424*a26aa252SDarrick J. Wong5. Reap the temporary file. 4425*a26aa252SDarrick J. Wong 4426*a26aa252SDarrick J. Wong**Future Work Question**: Should repair revalidate the dentry cache when 4427*a26aa252SDarrick J. Wongrebuilding a directory? 4428*a26aa252SDarrick J. Wong 4429*a26aa252SDarrick J. Wong*Answer*: Yes, it should. 4430*a26aa252SDarrick J. Wong 4431*a26aa252SDarrick J. WongIn theory it is necessary to scan all dentry cache entries for a directory to 4432*a26aa252SDarrick J. Wongensure that one of the following apply: 4433*a26aa252SDarrick J. Wong 4434*a26aa252SDarrick J. Wong1. The cached dentry reflects an ondisk dirent in the new directory. 4435*a26aa252SDarrick J. Wong 4436*a26aa252SDarrick J. Wong2. The cached dentry no longer has a corresponding ondisk dirent in the new 4437*a26aa252SDarrick J. Wong directory and the dentry can be purged from the cache. 4438*a26aa252SDarrick J. Wong 4439*a26aa252SDarrick J. Wong3. The cached dentry no longer has an ondisk dirent but the dentry cannot be 4440*a26aa252SDarrick J. Wong purged. 4441*a26aa252SDarrick J. Wong This is the problem case. 4442*a26aa252SDarrick J. Wong 4443*a26aa252SDarrick J. WongUnfortunately, the current dentry cache design doesn't provide a means to walk 4444*a26aa252SDarrick J. Wongevery child dentry of a specific directory, which makes this a hard problem. 4445*a26aa252SDarrick J. WongThere is no known solution. 4446*a26aa252SDarrick J. Wong 4447*a26aa252SDarrick J. WongThe proposed patchset is the 4448*a26aa252SDarrick J. Wong`directory repair 4449*a26aa252SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_ 4450*a26aa252SDarrick J. Wongseries. 4451*a26aa252SDarrick J. Wong 4452*a26aa252SDarrick J. WongParent Pointers 4453*a26aa252SDarrick J. Wong``````````````` 4454*a26aa252SDarrick J. Wong 4455*a26aa252SDarrick J. WongA parent pointer is a piece of file metadata that enables a user to locate the 4456*a26aa252SDarrick J. Wongfile's parent directory without having to traverse the directory tree from the 4457*a26aa252SDarrick J. Wongroot. 4458*a26aa252SDarrick J. WongWithout them, reconstruction of directory trees is hindered in much the same 4459*a26aa252SDarrick J. Wongway that the historic lack of reverse space mapping information once hindered 4460*a26aa252SDarrick J. Wongreconstruction of filesystem space metadata. 4461*a26aa252SDarrick J. WongThe parent pointer feature, however, makes total directory reconstruction 4462*a26aa252SDarrick J. Wongpossible. 4463*a26aa252SDarrick J. Wong 4464*a26aa252SDarrick J. WongXFS parent pointers include the dirent name and location of the entry within 4465*a26aa252SDarrick J. Wongthe parent directory. 4466*a26aa252SDarrick J. WongIn other words, child files use extended attributes to store pointers to 4467*a26aa252SDarrick J. Wongparents in the form ``(parent_inum, parent_gen, dirent_pos) → (dirent_name)``. 4468*a26aa252SDarrick J. WongThe directory checking process can be strengthened to ensure that the target of 4469*a26aa252SDarrick J. Wongeach dirent also contains a parent pointer pointing back to the dirent. 4470*a26aa252SDarrick J. WongLikewise, each parent pointer can be checked by ensuring that the target of 4471*a26aa252SDarrick J. Wongeach parent pointer is a directory and that it contains a dirent matching 4472*a26aa252SDarrick J. Wongthe parent pointer. 4473*a26aa252SDarrick J. WongBoth online and offline repair can use this strategy. 4474*a26aa252SDarrick J. Wong 4475*a26aa252SDarrick J. Wong**Note**: The ondisk format of parent pointers is not yet finalized. 4476*a26aa252SDarrick J. Wong 4477*a26aa252SDarrick J. Wong+--------------------------------------------------------------------------+ 4478*a26aa252SDarrick J. Wong| **Historical Sidebar**: | 4479*a26aa252SDarrick J. Wong+--------------------------------------------------------------------------+ 4480*a26aa252SDarrick J. Wong| Directory parent pointers were first proposed as an XFS feature more | 4481*a26aa252SDarrick J. Wong| than a decade ago by SGI. | 4482*a26aa252SDarrick J. Wong| Each link from a parent directory to a child file is mirrored with an | 4483*a26aa252SDarrick J. Wong| extended attribute in the child that could be used to identify the | 4484*a26aa252SDarrick J. Wong| parent directory. | 4485*a26aa252SDarrick J. Wong| Unfortunately, this early implementation had major shortcomings and was | 4486*a26aa252SDarrick J. Wong| never merged into Linux XFS: | 4487*a26aa252SDarrick J. Wong| | 4488*a26aa252SDarrick J. Wong| 1. The XFS codebase of the late 2000s did not have the infrastructure to | 4489*a26aa252SDarrick J. Wong| enforce strong referential integrity in the directory tree. | 4490*a26aa252SDarrick J. Wong| It did not guarantee that a change in a forward link would always be | 4491*a26aa252SDarrick J. Wong| followed up with the corresponding change to the reverse links. | 4492*a26aa252SDarrick J. Wong| | 4493*a26aa252SDarrick J. Wong| 2. Referential integrity was not integrated into offline repair. | 4494*a26aa252SDarrick J. Wong| Checking and repairs were performed on mounted filesystems without | 4495*a26aa252SDarrick J. Wong| taking any kernel or inode locks to coordinate access. | 4496*a26aa252SDarrick J. Wong| It is not clear how this actually worked properly. | 4497*a26aa252SDarrick J. Wong| | 4498*a26aa252SDarrick J. Wong| 3. The extended attribute did not record the name of the directory entry | 4499*a26aa252SDarrick J. Wong| in the parent, so the SGI parent pointer implementation cannot be | 4500*a26aa252SDarrick J. Wong| used to reconnect the directory tree. | 4501*a26aa252SDarrick J. Wong| | 4502*a26aa252SDarrick J. Wong| 4. Extended attribute forks only support 65,536 extents, which means | 4503*a26aa252SDarrick J. Wong| that parent pointer attribute creation is likely to fail at some | 4504*a26aa252SDarrick J. Wong| point before the maximum file link count is achieved. | 4505*a26aa252SDarrick J. Wong| | 4506*a26aa252SDarrick J. Wong| The original parent pointer design was too unstable for something like | 4507*a26aa252SDarrick J. Wong| a file system repair to depend on. | 4508*a26aa252SDarrick J. Wong| Allison Henderson, Chandan Babu, and Catherine Hoang are working on a | 4509*a26aa252SDarrick J. Wong| second implementation that solves all shortcomings of the first. | 4510*a26aa252SDarrick J. Wong| During 2022, Allison introduced log intent items to track physical | 4511*a26aa252SDarrick J. Wong| manipulations of the extended attribute structures. | 4512*a26aa252SDarrick J. Wong| This solves the referential integrity problem by making it possible to | 4513*a26aa252SDarrick J. Wong| commit a dirent update and a parent pointer update in the same | 4514*a26aa252SDarrick J. Wong| transaction. | 4515*a26aa252SDarrick J. Wong| Chandan increased the maximum extent counts of both data and attribute | 4516*a26aa252SDarrick J. Wong| forks, thereby ensuring that the extended attribute structure can grow | 4517*a26aa252SDarrick J. Wong| to handle the maximum hardlink count of any file. | 4518*a26aa252SDarrick J. Wong+--------------------------------------------------------------------------+ 4519*a26aa252SDarrick J. Wong 4520*a26aa252SDarrick J. WongCase Study: Repairing Directories with Parent Pointers 4521*a26aa252SDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 4522*a26aa252SDarrick J. Wong 4523*a26aa252SDarrick J. WongDirectory rebuilding uses a :ref:`coordinated inode scan <iscan>` and 4524*a26aa252SDarrick J. Wonga :ref:`directory entry live update hook <liveupdate>` as follows: 4525*a26aa252SDarrick J. Wong 4526*a26aa252SDarrick J. Wong1. Set up a temporary directory for generating the new directory structure, 4527*a26aa252SDarrick J. Wong an xfblob for storing entry names, and an xfarray for stashing directory 4528*a26aa252SDarrick J. Wong updates. 4529*a26aa252SDarrick J. Wong 4530*a26aa252SDarrick J. Wong2. Set up an inode scanner and hook into the directory entry code to receive 4531*a26aa252SDarrick J. Wong updates on directory operations. 4532*a26aa252SDarrick J. Wong 4533*a26aa252SDarrick J. Wong3. For each parent pointer found in each file scanned, decide if the parent 4534*a26aa252SDarrick J. Wong pointer references the directory of interest. 4535*a26aa252SDarrick J. Wong If so: 4536*a26aa252SDarrick J. Wong 4537*a26aa252SDarrick J. Wong a. Stash an addname entry for this dirent in the xfarray for later. 4538*a26aa252SDarrick J. Wong 4539*a26aa252SDarrick J. Wong b. When finished scanning that file, flush the stashed updates to the 4540*a26aa252SDarrick J. Wong temporary directory. 4541*a26aa252SDarrick J. Wong 4542*a26aa252SDarrick J. Wong4. For each live directory update received via the hook, decide if the child 4543*a26aa252SDarrick J. Wong has already been scanned. 4544*a26aa252SDarrick J. Wong If so: 4545*a26aa252SDarrick J. Wong 4546*a26aa252SDarrick J. Wong a. Stash an addname or removename entry for this dirent update in the 4547*a26aa252SDarrick J. Wong xfarray for later. 4548*a26aa252SDarrick J. Wong We cannot write directly to the temporary directory because hook 4549*a26aa252SDarrick J. Wong functions are not allowed to modify filesystem metadata. 4550*a26aa252SDarrick J. Wong Instead, we stash updates in the xfarray and rely on the scanner thread 4551*a26aa252SDarrick J. Wong to apply the stashed updates to the temporary directory. 4552*a26aa252SDarrick J. Wong 4553*a26aa252SDarrick J. Wong5. When the scan is complete, atomically swap the contents of the temporary 4554*a26aa252SDarrick J. Wong directory and the directory being repaired. 4555*a26aa252SDarrick J. Wong The temporary directory now contains the damaged directory structure. 4556*a26aa252SDarrick J. Wong 4557*a26aa252SDarrick J. Wong6. Reap the temporary directory. 4558*a26aa252SDarrick J. Wong 4559*a26aa252SDarrick J. Wong7. Update the dirent position field of parent pointers as necessary. 4560*a26aa252SDarrick J. Wong This may require the queuing of a substantial number of xattr log intent 4561*a26aa252SDarrick J. Wong items. 4562*a26aa252SDarrick J. Wong 4563*a26aa252SDarrick J. WongThe proposed patchset is the 4564*a26aa252SDarrick J. Wong`parent pointers directory repair 4565*a26aa252SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-online-dir-repair>`_ 4566*a26aa252SDarrick J. Wongseries. 4567*a26aa252SDarrick J. Wong 4568*a26aa252SDarrick J. Wong**Unresolved Question**: How will repair ensure that the ``dirent_pos`` fields 4569*a26aa252SDarrick J. Wongmatch in the reconstructed directory? 4570*a26aa252SDarrick J. Wong 4571*a26aa252SDarrick J. Wong*Answer*: There are a few ways to solve this problem: 4572*a26aa252SDarrick J. Wong 4573*a26aa252SDarrick J. Wong1. The field could be designated advisory, since the other three values are 4574*a26aa252SDarrick J. Wong sufficient to find the entry in the parent. 4575*a26aa252SDarrick J. Wong However, this makes indexed key lookup impossible while repairs are ongoing. 4576*a26aa252SDarrick J. Wong 4577*a26aa252SDarrick J. Wong2. We could allow creating directory entries at specified offsets, which solves 4578*a26aa252SDarrick J. Wong the referential integrity problem but runs the risk that dirent creation 4579*a26aa252SDarrick J. Wong will fail due to conflicts with the free space in the directory. 4580*a26aa252SDarrick J. Wong 4581*a26aa252SDarrick J. Wong These conflicts could be resolved by appending the directory entry and 4582*a26aa252SDarrick J. Wong amending the xattr code to support updating an xattr key and reindexing the 4583*a26aa252SDarrick J. Wong dabtree, though this would have to be performed with the parent directory 4584*a26aa252SDarrick J. Wong still locked. 4585*a26aa252SDarrick J. Wong 4586*a26aa252SDarrick J. Wong3. Same as above, but remove the old parent pointer entry and add a new one 4587*a26aa252SDarrick J. Wong atomically. 4588*a26aa252SDarrick J. Wong 4589*a26aa252SDarrick J. Wong4. Change the ondisk xattr format to ``(parent_inum, name) → (parent_gen)``, 4590*a26aa252SDarrick J. Wong which would provide the attr name uniqueness that we require, without 4591*a26aa252SDarrick J. Wong forcing repair code to update the dirent position. 4592*a26aa252SDarrick J. Wong Unfortunately, this requires changes to the xattr code to support attr 4593*a26aa252SDarrick J. Wong names as long as 263 bytes. 4594*a26aa252SDarrick J. Wong 4595*a26aa252SDarrick J. Wong5. Change the ondisk xattr format to ``(parent_inum, hash(name)) → 4596*a26aa252SDarrick J. Wong (name, parent_gen)``. 4597*a26aa252SDarrick J. Wong If the hash is sufficiently resistant to collisions (e.g. sha256) then 4598*a26aa252SDarrick J. Wong this should provide the attr name uniqueness that we require. 4599*a26aa252SDarrick J. Wong Names shorter than 247 bytes could be stored directly. 4600*a26aa252SDarrick J. Wong 4601*a26aa252SDarrick J. WongDiscussion is ongoing under the `parent pointers patch deluge 4602*a26aa252SDarrick J. Wong<https://www.spinics.net/lists/linux-xfs/msg69397.html>`_. 4603*a26aa252SDarrick J. Wong 4604*a26aa252SDarrick J. WongCase Study: Repairing Parent Pointers 4605*a26aa252SDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 4606*a26aa252SDarrick J. Wong 4607*a26aa252SDarrick J. WongOnline reconstruction of a file's parent pointer information works similarly to 4608*a26aa252SDarrick J. Wongdirectory reconstruction: 4609*a26aa252SDarrick J. Wong 4610*a26aa252SDarrick J. Wong1. Set up a temporary file for generating a new extended attribute structure, 4611*a26aa252SDarrick J. Wong an `xfblob<xfblob>` for storing parent pointer names, and an xfarray for 4612*a26aa252SDarrick J. Wong stashing parent pointer updates. 4613*a26aa252SDarrick J. Wong 4614*a26aa252SDarrick J. Wong2. Set up an inode scanner and hook into the directory entry code to receive 4615*a26aa252SDarrick J. Wong updates on directory operations. 4616*a26aa252SDarrick J. Wong 4617*a26aa252SDarrick J. Wong3. For each directory entry found in each directory scanned, decide if the 4618*a26aa252SDarrick J. Wong dirent references the file of interest. 4619*a26aa252SDarrick J. Wong If so: 4620*a26aa252SDarrick J. Wong 4621*a26aa252SDarrick J. Wong a. Stash an addpptr entry for this parent pointer in the xfblob and xfarray 4622*a26aa252SDarrick J. Wong for later. 4623*a26aa252SDarrick J. Wong 4624*a26aa252SDarrick J. Wong b. When finished scanning the directory, flush the stashed updates to the 4625*a26aa252SDarrick J. Wong temporary directory. 4626*a26aa252SDarrick J. Wong 4627*a26aa252SDarrick J. Wong4. For each live directory update received via the hook, decide if the parent 4628*a26aa252SDarrick J. Wong has already been scanned. 4629*a26aa252SDarrick J. Wong If so: 4630*a26aa252SDarrick J. Wong 4631*a26aa252SDarrick J. Wong a. Stash an addpptr or removepptr entry for this dirent update in the 4632*a26aa252SDarrick J. Wong xfarray for later. 4633*a26aa252SDarrick J. Wong We cannot write parent pointers directly to the temporary file because 4634*a26aa252SDarrick J. Wong hook functions are not allowed to modify filesystem metadata. 4635*a26aa252SDarrick J. Wong Instead, we stash updates in the xfarray and rely on the scanner thread 4636*a26aa252SDarrick J. Wong to apply the stashed parent pointer updates to the temporary file. 4637*a26aa252SDarrick J. Wong 4638*a26aa252SDarrick J. Wong5. Copy all non-parent pointer extended attributes to the temporary file. 4639*a26aa252SDarrick J. Wong 4640*a26aa252SDarrick J. Wong6. When the scan is complete, atomically swap the attribute fork of the 4641*a26aa252SDarrick J. Wong temporary file and the file being repaired. 4642*a26aa252SDarrick J. Wong The temporary file now contains the damaged extended attribute structure. 4643*a26aa252SDarrick J. Wong 4644*a26aa252SDarrick J. Wong7. Reap the temporary file. 4645*a26aa252SDarrick J. Wong 4646*a26aa252SDarrick J. WongThe proposed patchset is the 4647*a26aa252SDarrick J. Wong`parent pointers repair 4648*a26aa252SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-online-parent-repair>`_ 4649*a26aa252SDarrick J. Wongseries. 4650*a26aa252SDarrick J. Wong 4651*a26aa252SDarrick J. WongDigression: Offline Checking of Parent Pointers 4652*a26aa252SDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 4653*a26aa252SDarrick J. Wong 4654*a26aa252SDarrick J. WongExamining parent pointers in offline repair works differently because corrupt 4655*a26aa252SDarrick J. Wongfiles are erased long before directory tree connectivity checks are performed. 4656*a26aa252SDarrick J. WongParent pointer checks are therefore a second pass to be added to the existing 4657*a26aa252SDarrick J. Wongconnectivity checks: 4658*a26aa252SDarrick J. Wong 4659*a26aa252SDarrick J. Wong1. After the set of surviving files has been established (i.e. phase 6), 4660*a26aa252SDarrick J. Wong walk the surviving directories of each AG in the filesystem. 4661*a26aa252SDarrick J. Wong This is already performed as part of the connectivity checks. 4662*a26aa252SDarrick J. Wong 4663*a26aa252SDarrick J. Wong2. For each directory entry found, record the name in an xfblob, and store 4664*a26aa252SDarrick J. Wong ``(child_ag_inum, parent_inum, parent_gen, dirent_pos)`` tuples in a 4665*a26aa252SDarrick J. Wong per-AG in-memory slab. 4666*a26aa252SDarrick J. Wong 4667*a26aa252SDarrick J. Wong3. For each AG in the filesystem, 4668*a26aa252SDarrick J. Wong 4669*a26aa252SDarrick J. Wong a. Sort the per-AG tuples in order of child_ag_inum, parent_inum, and 4670*a26aa252SDarrick J. Wong dirent_pos. 4671*a26aa252SDarrick J. Wong 4672*a26aa252SDarrick J. Wong b. For each inode in the AG, 4673*a26aa252SDarrick J. Wong 4674*a26aa252SDarrick J. Wong 1. Scan the inode for parent pointers. 4675*a26aa252SDarrick J. Wong Record the names in a per-file xfblob, and store ``(parent_inum, 4676*a26aa252SDarrick J. Wong parent_gen, dirent_pos)`` tuples in a per-file slab. 4677*a26aa252SDarrick J. Wong 4678*a26aa252SDarrick J. Wong 2. Sort the per-file tuples in order of parent_inum, and dirent_pos. 4679*a26aa252SDarrick J. Wong 4680*a26aa252SDarrick J. Wong 3. Position one slab cursor at the start of the inode's records in the 4681*a26aa252SDarrick J. Wong per-AG tuple slab. 4682*a26aa252SDarrick J. Wong This should be trivial since the per-AG tuples are in child inumber 4683*a26aa252SDarrick J. Wong order. 4684*a26aa252SDarrick J. Wong 4685*a26aa252SDarrick J. Wong 4. Position a second slab cursor at the start of the per-file tuple slab. 4686*a26aa252SDarrick J. Wong 4687*a26aa252SDarrick J. Wong 5. Iterate the two cursors in lockstep, comparing the parent_ino and 4688*a26aa252SDarrick J. Wong dirent_pos fields of the records under each cursor. 4689*a26aa252SDarrick J. Wong 4690*a26aa252SDarrick J. Wong a. Tuples in the per-AG list but not the per-file list are missing and 4691*a26aa252SDarrick J. Wong need to be written to the inode. 4692*a26aa252SDarrick J. Wong 4693*a26aa252SDarrick J. Wong b. Tuples in the per-file list but not the per-AG list are dangling 4694*a26aa252SDarrick J. Wong and need to be removed from the inode. 4695*a26aa252SDarrick J. Wong 4696*a26aa252SDarrick J. Wong c. For tuples in both lists, update the parent_gen and name components 4697*a26aa252SDarrick J. Wong of the parent pointer if necessary. 4698*a26aa252SDarrick J. Wong 4699*a26aa252SDarrick J. Wong4. Move on to examining link counts, as we do today. 4700*a26aa252SDarrick J. Wong 4701*a26aa252SDarrick J. WongThe proposed patchset is the 4702*a26aa252SDarrick J. Wong`offline parent pointers repair 4703*a26aa252SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=pptrs-repair>`_ 4704*a26aa252SDarrick J. Wongseries. 4705*a26aa252SDarrick J. Wong 4706*a26aa252SDarrick J. WongRebuilding directories from parent pointers in offline repair is very 4707*a26aa252SDarrick J. Wongchallenging because it currently uses a single-pass scan of the filesystem 4708*a26aa252SDarrick J. Wongduring phase 3 to decide which files are corrupt enough to be zapped. 4709*a26aa252SDarrick J. WongThis scan would have to be converted into a multi-pass scan: 4710*a26aa252SDarrick J. Wong 4711*a26aa252SDarrick J. Wong1. The first pass of the scan zaps corrupt inodes, forks, and attributes 4712*a26aa252SDarrick J. Wong much as it does now. 4713*a26aa252SDarrick J. Wong Corrupt directories are noted but not zapped. 4714*a26aa252SDarrick J. Wong 4715*a26aa252SDarrick J. Wong2. The next pass records parent pointers pointing to the directories noted 4716*a26aa252SDarrick J. Wong as being corrupt in the first pass. 4717*a26aa252SDarrick J. Wong This second pass may have to happen after the phase 4 scan for duplicate 4718*a26aa252SDarrick J. Wong blocks, if phase 4 is also capable of zapping directories. 4719*a26aa252SDarrick J. Wong 4720*a26aa252SDarrick J. Wong3. The third pass resets corrupt directories to an empty shortform directory. 4721*a26aa252SDarrick J. Wong Free space metadata has not been ensured yet, so repair cannot yet use the 4722*a26aa252SDarrick J. Wong directory building code in libxfs. 4723*a26aa252SDarrick J. Wong 4724*a26aa252SDarrick J. Wong4. At the start of phase 6, space metadata have been rebuilt. 4725*a26aa252SDarrick J. Wong Use the parent pointer information recorded during step 2 to reconstruct 4726*a26aa252SDarrick J. Wong the dirents and add them to the now-empty directories. 4727*a26aa252SDarrick J. Wong 4728*a26aa252SDarrick J. WongThis code has not yet been constructed. 4729*a26aa252SDarrick J. Wong 4730*a26aa252SDarrick J. Wong.. _orphanage: 4731*a26aa252SDarrick J. Wong 4732*a26aa252SDarrick J. WongThe Orphanage 4733*a26aa252SDarrick J. Wong------------- 4734*a26aa252SDarrick J. Wong 4735*a26aa252SDarrick J. WongFilesystems present files as a directed, and hopefully acyclic, graph. 4736*a26aa252SDarrick J. WongIn other words, a tree. 4737*a26aa252SDarrick J. WongThe root of the filesystem is a directory, and each entry in a directory points 4738*a26aa252SDarrick J. Wongdownwards either to more subdirectories or to non-directory files. 4739*a26aa252SDarrick J. WongUnfortunately, a disruption in the directory graph pointers result in a 4740*a26aa252SDarrick J. Wongdisconnected graph, which makes files impossible to access via regular path 4741*a26aa252SDarrick J. Wongresolution. 4742*a26aa252SDarrick J. Wong 4743*a26aa252SDarrick J. WongWithout parent pointers, the directory parent pointer online scrub code can 4744*a26aa252SDarrick J. Wongdetect a dotdot entry pointing to a parent directory that doesn't have a link 4745*a26aa252SDarrick J. Wongback to the child directory and the file link count checker can detect a file 4746*a26aa252SDarrick J. Wongthat isn't pointed to by any directory in the filesystem. 4747*a26aa252SDarrick J. WongIf such a file has a positive link count, the file is an orphan. 4748*a26aa252SDarrick J. Wong 4749*a26aa252SDarrick J. WongWith parent pointers, directories can be rebuilt by scanning parent pointers 4750*a26aa252SDarrick J. Wongand parent pointers can be rebuilt by scanning directories. 4751*a26aa252SDarrick J. WongThis should reduce the incidence of files ending up in ``/lost+found``. 4752*a26aa252SDarrick J. Wong 4753*a26aa252SDarrick J. WongWhen orphans are found, they should be reconnected to the directory tree. 4754*a26aa252SDarrick J. WongOffline fsck solves the problem by creating a directory ``/lost+found`` to 4755*a26aa252SDarrick J. Wongserve as an orphanage, and linking orphan files into the orphanage by using the 4756*a26aa252SDarrick J. Wonginumber as the name. 4757*a26aa252SDarrick J. WongReparenting a file to the orphanage does not reset any of its permissions or 4758*a26aa252SDarrick J. WongACLs. 4759*a26aa252SDarrick J. Wong 4760*a26aa252SDarrick J. WongThis process is more involved in the kernel than it is in userspace. 4761*a26aa252SDarrick J. WongThe directory and file link count repair setup functions must use the regular 4762*a26aa252SDarrick J. WongVFS mechanisms to create the orphanage directory with all the necessary 4763*a26aa252SDarrick J. Wongsecurity attributes and dentry cache entries, just like a regular directory 4764*a26aa252SDarrick J. Wongtree modification. 4765*a26aa252SDarrick J. Wong 4766*a26aa252SDarrick J. WongOrphaned files are adopted by the orphanage as follows: 4767*a26aa252SDarrick J. Wong 4768*a26aa252SDarrick J. Wong1. Call ``xrep_orphanage_try_create`` at the start of the scrub setup function 4769*a26aa252SDarrick J. Wong to try to ensure that the lost and found directory actually exists. 4770*a26aa252SDarrick J. Wong This also attaches the orphanage directory to the scrub context. 4771*a26aa252SDarrick J. Wong 4772*a26aa252SDarrick J. Wong2. If the decision is made to reconnect a file, take the IOLOCK of both the 4773*a26aa252SDarrick J. Wong orphanage and the file being reattached. 4774*a26aa252SDarrick J. Wong The ``xrep_orphanage_iolock_two`` function follows the inode locking 4775*a26aa252SDarrick J. Wong strategy discussed earlier. 4776*a26aa252SDarrick J. Wong 4777*a26aa252SDarrick J. Wong3. Call ``xrep_orphanage_compute_blkres`` and ``xrep_orphanage_compute_name`` 4778*a26aa252SDarrick J. Wong to compute the new name in the orphanage and the block reservation required. 4779*a26aa252SDarrick J. Wong 4780*a26aa252SDarrick J. Wong4. Use ``xrep_orphanage_adoption_prep`` to reserve resources to the repair 4781*a26aa252SDarrick J. Wong transaction. 4782*a26aa252SDarrick J. Wong 4783*a26aa252SDarrick J. Wong5. Call ``xrep_orphanage_adopt`` to reparent the orphaned file into the lost 4784*a26aa252SDarrick J. Wong and found, and update the kernel dentry cache. 4785*a26aa252SDarrick J. Wong 4786*a26aa252SDarrick J. WongThe proposed patches are in the 4787*a26aa252SDarrick J. Wong`orphanage adoption 4788*a26aa252SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-orphanage>`_ 4789*a26aa252SDarrick J. Wongseries. 4790