1a8f6c2e5SDarrick J. Wong.. SPDX-License-Identifier: GPL-2.0
2a8f6c2e5SDarrick J. Wong.. _xfs_online_fsck_design:
3a8f6c2e5SDarrick J. Wong
4a8f6c2e5SDarrick J. Wong..
5a8f6c2e5SDarrick J. Wong        Mapping of heading styles within this document:
6a8f6c2e5SDarrick J. Wong        Heading 1 uses "====" above and below
7a8f6c2e5SDarrick J. Wong        Heading 2 uses "===="
8a8f6c2e5SDarrick J. Wong        Heading 3 uses "----"
9a8f6c2e5SDarrick J. Wong        Heading 4 uses "````"
10a8f6c2e5SDarrick J. Wong        Heading 5 uses "^^^^"
11a8f6c2e5SDarrick J. Wong        Heading 6 uses "~~~~"
12a8f6c2e5SDarrick J. Wong        Heading 7 uses "...."
13a8f6c2e5SDarrick J. Wong
14a8f6c2e5SDarrick J. Wong        Sections are manually numbered because apparently that's what everyone
15a8f6c2e5SDarrick J. Wong        does in the kernel.
16a8f6c2e5SDarrick J. Wong
17a8f6c2e5SDarrick J. Wong======================
18a8f6c2e5SDarrick J. WongXFS Online Fsck Design
19a8f6c2e5SDarrick J. Wong======================
20a8f6c2e5SDarrick J. Wong
21a8f6c2e5SDarrick J. WongThis document captures the design of the online filesystem check feature for
22a8f6c2e5SDarrick J. WongXFS.
23a8f6c2e5SDarrick J. WongThe purpose of this document is threefold:
24a8f6c2e5SDarrick J. Wong
25a8f6c2e5SDarrick J. Wong- To help kernel distributors understand exactly what the XFS online fsck
26a8f6c2e5SDarrick J. Wong  feature is, and issues about which they should be aware.
27a8f6c2e5SDarrick J. Wong
28a8f6c2e5SDarrick J. Wong- To help people reading the code to familiarize themselves with the relevant
29a8f6c2e5SDarrick J. Wong  concepts and design points before they start digging into the code.
30a8f6c2e5SDarrick J. Wong
31a8f6c2e5SDarrick J. Wong- To help developers maintaining the system by capturing the reasons
32a8f6c2e5SDarrick J. Wong  supporting higher level decision making.
33a8f6c2e5SDarrick J. Wong
34a8f6c2e5SDarrick J. WongAs the online fsck code is merged, the links in this document to topic branches
35a8f6c2e5SDarrick J. Wongwill be replaced with links to code.
36a8f6c2e5SDarrick J. Wong
37a8f6c2e5SDarrick J. WongThis document is licensed under the terms of the GNU Public License, v2.
38a8f6c2e5SDarrick J. WongThe primary author is Darrick J. Wong.
39a8f6c2e5SDarrick J. Wong
40a8f6c2e5SDarrick J. WongThis design document is split into seven parts.
41a8f6c2e5SDarrick J. WongPart 1 defines what fsck tools are and the motivations for writing a new one.
42a8f6c2e5SDarrick J. WongParts 2 and 3 present a high level overview of how online fsck process works
43a8f6c2e5SDarrick J. Wongand how it is tested to ensure correct functionality.
44a8f6c2e5SDarrick J. WongPart 4 discusses the user interface and the intended usage modes of the new
45a8f6c2e5SDarrick J. Wongprogram.
46a8f6c2e5SDarrick J. WongParts 5 and 6 show off the high level components and how they fit together, and
47a8f6c2e5SDarrick J. Wongthen present case studies of how each repair function actually works.
48a8f6c2e5SDarrick J. WongPart 7 sums up what has been discussed so far and speculates about what else
49a8f6c2e5SDarrick J. Wongmight be built atop online fsck.
50a8f6c2e5SDarrick J. Wong
51a8f6c2e5SDarrick J. Wong.. contents:: Table of Contents
52a8f6c2e5SDarrick J. Wong   :local:
53a8f6c2e5SDarrick J. Wong
54a8f6c2e5SDarrick J. Wong1. What is a Filesystem Check?
55a8f6c2e5SDarrick J. Wong==============================
56a8f6c2e5SDarrick J. Wong
57a8f6c2e5SDarrick J. WongA Unix filesystem has four main responsibilities:
58a8f6c2e5SDarrick J. Wong
59a8f6c2e5SDarrick J. Wong- Provide a hierarchy of names through which application programs can associate
60a8f6c2e5SDarrick J. Wong  arbitrary blobs of data for any length of time,
61a8f6c2e5SDarrick J. Wong
62a8f6c2e5SDarrick J. Wong- Virtualize physical storage media across those names, and
63a8f6c2e5SDarrick J. Wong
64a8f6c2e5SDarrick J. Wong- Retrieve the named data blobs at any time.
65a8f6c2e5SDarrick J. Wong
66a8f6c2e5SDarrick J. Wong- Examine resource usage.
67a8f6c2e5SDarrick J. Wong
68a8f6c2e5SDarrick J. WongMetadata directly supporting these functions (e.g. files, directories, space
69a8f6c2e5SDarrick J. Wongmappings) are sometimes called primary metadata.
70a8f6c2e5SDarrick J. WongSecondary metadata (e.g. reverse mapping and directory parent pointers) support
71a8f6c2e5SDarrick J. Wongoperations internal to the filesystem, such as internal consistency checking
72a8f6c2e5SDarrick J. Wongand reorganization.
73a8f6c2e5SDarrick J. WongSummary metadata, as the name implies, condense information contained in
74a8f6c2e5SDarrick J. Wongprimary metadata for performance reasons.
75a8f6c2e5SDarrick J. Wong
76a8f6c2e5SDarrick J. WongThe filesystem check (fsck) tool examines all the metadata in a filesystem
77a8f6c2e5SDarrick J. Wongto look for errors.
78a8f6c2e5SDarrick J. WongIn addition to looking for obvious metadata corruptions, fsck also
79a8f6c2e5SDarrick J. Wongcross-references different types of metadata records with each other to look
80a8f6c2e5SDarrick J. Wongfor inconsistencies.
81a8f6c2e5SDarrick J. WongPeople do not like losing data, so most fsck tools also contains some ability
82a8f6c2e5SDarrick J. Wongto correct any problems found.
83a8f6c2e5SDarrick J. WongAs a word of caution -- the primary goal of most Linux fsck tools is to restore
84a8f6c2e5SDarrick J. Wongthe filesystem metadata to a consistent state, not to maximize the data
85a8f6c2e5SDarrick J. Wongrecovered.
86a8f6c2e5SDarrick J. WongThat precedent will not be challenged here.
87a8f6c2e5SDarrick J. Wong
88a8f6c2e5SDarrick J. WongFilesystems of the 20th century generally lacked any redundancy in the ondisk
89a8f6c2e5SDarrick J. Wongformat, which means that fsck can only respond to errors by erasing files until
90a8f6c2e5SDarrick J. Wongerrors are no longer detected.
91a8f6c2e5SDarrick J. WongMore recent filesystem designs contain enough redundancy in their metadata that
92a8f6c2e5SDarrick J. Wongit is now possible to regenerate data structures when non-catastrophic errors
93a8f6c2e5SDarrick J. Wongoccur; this capability aids both strategies.
94a8f6c2e5SDarrick J. Wong
95a8f6c2e5SDarrick J. Wong+--------------------------------------------------------------------------+
96a8f6c2e5SDarrick J. Wong| **Note**:                                                                |
97a8f6c2e5SDarrick J. Wong+--------------------------------------------------------------------------+
98a8f6c2e5SDarrick J. Wong| System administrators avoid data loss by increasing the number of        |
99a8f6c2e5SDarrick J. Wong| separate storage systems through the creation of backups; and they avoid |
100a8f6c2e5SDarrick J. Wong| downtime by increasing the redundancy of each storage system through the |
101a8f6c2e5SDarrick J. Wong| creation of RAID arrays.                                                 |
102a8f6c2e5SDarrick J. Wong| fsck tools address only the first problem.                               |
103a8f6c2e5SDarrick J. Wong+--------------------------------------------------------------------------+
104a8f6c2e5SDarrick J. Wong
105a8f6c2e5SDarrick J. WongTLDR; Show Me the Code!
106a8f6c2e5SDarrick J. Wong-----------------------
107a8f6c2e5SDarrick J. Wong
108a8f6c2e5SDarrick J. WongCode is posted to the kernel.org git trees as follows:
109a8f6c2e5SDarrick J. Wong`kernel changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-symlink>`_,
110a8f6c2e5SDarrick J. Wong`userspace changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-service>`_, and
111a8f6c2e5SDarrick J. Wong`QA test changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-dirs>`_.
112a8f6c2e5SDarrick J. WongEach kernel patchset adding an online repair function will use the same branch
113a8f6c2e5SDarrick J. Wongname across the kernel, xfsprogs, and fstests git repos.
114a8f6c2e5SDarrick J. Wong
115a8f6c2e5SDarrick J. WongExisting Tools
116a8f6c2e5SDarrick J. Wong--------------
117a8f6c2e5SDarrick J. Wong
118a8f6c2e5SDarrick J. WongThe online fsck tool described here will be the third tool in the history of
119a8f6c2e5SDarrick J. WongXFS (on Linux) to check and repair filesystems.
120a8f6c2e5SDarrick J. WongTwo programs precede it:
121a8f6c2e5SDarrick J. Wong
122a8f6c2e5SDarrick J. WongThe first program, ``xfs_check``, was created as part of the XFS debugger
123a8f6c2e5SDarrick J. Wong(``xfs_db``) and can only be used with unmounted filesystems.
124a8f6c2e5SDarrick J. WongIt walks all metadata in the filesystem looking for inconsistencies in the
125a8f6c2e5SDarrick J. Wongmetadata, though it lacks any ability to repair what it finds.
126a8f6c2e5SDarrick J. WongDue to its high memory requirements and inability to repair things, this
127a8f6c2e5SDarrick J. Wongprogram is now deprecated and will not be discussed further.
128a8f6c2e5SDarrick J. Wong
129a8f6c2e5SDarrick J. WongThe second program, ``xfs_repair``, was created to be faster and more robust
130a8f6c2e5SDarrick J. Wongthan the first program.
131a8f6c2e5SDarrick J. WongLike its predecessor, it can only be used with unmounted filesystems.
132a8f6c2e5SDarrick J. WongIt uses extent-based in-memory data structures to reduce memory consumption,
133a8f6c2e5SDarrick J. Wongand tries to schedule readahead IO appropriately to reduce I/O waiting time
134a8f6c2e5SDarrick J. Wongwhile it scans the metadata of the entire filesystem.
135a8f6c2e5SDarrick J. WongThe most important feature of this tool is its ability to respond to
136a8f6c2e5SDarrick J. Wonginconsistencies in file metadata and directory tree by erasing things as needed
137a8f6c2e5SDarrick J. Wongto eliminate problems.
138a8f6c2e5SDarrick J. WongSpace usage metadata are rebuilt from the observed file metadata.
139a8f6c2e5SDarrick J. Wong
140a8f6c2e5SDarrick J. WongProblem Statement
141a8f6c2e5SDarrick J. Wong-----------------
142a8f6c2e5SDarrick J. Wong
143a8f6c2e5SDarrick J. WongThe current XFS tools leave several problems unsolved:
144a8f6c2e5SDarrick J. Wong
145a8f6c2e5SDarrick J. Wong1. **User programs** suddenly **lose access** to the filesystem when unexpected
146a8f6c2e5SDarrick J. Wong   shutdowns occur as a result of silent corruptions in the metadata.
147a8f6c2e5SDarrick J. Wong   These occur **unpredictably** and often without warning.
148a8f6c2e5SDarrick J. Wong
149a8f6c2e5SDarrick J. Wong2. **Users** experience a **total loss of service** during the recovery period
150a8f6c2e5SDarrick J. Wong   after an **unexpected shutdown** occurs.
151a8f6c2e5SDarrick J. Wong
152a8f6c2e5SDarrick J. Wong3. **Users** experience a **total loss of service** if the filesystem is taken
153a8f6c2e5SDarrick J. Wong   offline to **look for problems** proactively.
154a8f6c2e5SDarrick J. Wong
155a8f6c2e5SDarrick J. Wong4. **Data owners** cannot **check the integrity** of their stored data without
156a8f6c2e5SDarrick J. Wong   reading all of it.
157a8f6c2e5SDarrick J. Wong   This may expose them to substantial billing costs when a linear media scan
158a8f6c2e5SDarrick J. Wong   performed by the storage system administrator might suffice.
159a8f6c2e5SDarrick J. Wong
160a8f6c2e5SDarrick J. Wong5. **System administrators** cannot **schedule** a maintenance window to deal
161a8f6c2e5SDarrick J. Wong   with corruptions if they **lack the means** to assess filesystem health
162a8f6c2e5SDarrick J. Wong   while the filesystem is online.
163a8f6c2e5SDarrick J. Wong
164a8f6c2e5SDarrick J. Wong6. **Fleet monitoring tools** cannot **automate periodic checks** of filesystem
165a8f6c2e5SDarrick J. Wong   health when doing so requires **manual intervention** and downtime.
166a8f6c2e5SDarrick J. Wong
167a8f6c2e5SDarrick J. Wong7. **Users** can be tricked into **doing things they do not desire** when
168a8f6c2e5SDarrick J. Wong   malicious actors **exploit quirks of Unicode** to place misleading names
169a8f6c2e5SDarrick J. Wong   in directories.
170a8f6c2e5SDarrick J. Wong
171a8f6c2e5SDarrick J. WongGiven this definition of the problems to be solved and the actors who would
172a8f6c2e5SDarrick J. Wongbenefit, the proposed solution is a third fsck tool that acts on a running
173a8f6c2e5SDarrick J. Wongfilesystem.
174a8f6c2e5SDarrick J. Wong
175a8f6c2e5SDarrick J. WongThis new third program has three components: an in-kernel facility to check
176a8f6c2e5SDarrick J. Wongmetadata, an in-kernel facility to repair metadata, and a userspace driver
177a8f6c2e5SDarrick J. Wongprogram to drive fsck activity on a live filesystem.
178a8f6c2e5SDarrick J. Wong``xfs_scrub`` is the name of the driver program.
179a8f6c2e5SDarrick J. WongThe rest of this document presents the goals and use cases of the new fsck
180a8f6c2e5SDarrick J. Wongtool, describes its major design points in connection to those goals, and
181a8f6c2e5SDarrick J. Wongdiscusses the similarities and differences with existing tools.
182a8f6c2e5SDarrick J. Wong
183a8f6c2e5SDarrick J. Wong+--------------------------------------------------------------------------+
184a8f6c2e5SDarrick J. Wong| **Note**:                                                                |
185a8f6c2e5SDarrick J. Wong+--------------------------------------------------------------------------+
186a8f6c2e5SDarrick J. Wong| Throughout this document, the existing offline fsck tool can also be     |
187a8f6c2e5SDarrick J. Wong| referred to by its current name "``xfs_repair``".                        |
188a8f6c2e5SDarrick J. Wong| The userspace driver program for the new online fsck tool can be         |
189a8f6c2e5SDarrick J. Wong| referred to as "``xfs_scrub``".                                          |
190a8f6c2e5SDarrick J. Wong| The kernel portion of online fsck that validates metadata is called      |
191a8f6c2e5SDarrick J. Wong| "online scrub", and portion of the kernel that fixes metadata is called  |
192a8f6c2e5SDarrick J. Wong| "online repair".                                                         |
193a8f6c2e5SDarrick J. Wong+--------------------------------------------------------------------------+
194a8f6c2e5SDarrick J. Wong
195a8f6c2e5SDarrick J. WongThe naming hierarchy is broken up into objects known as directories and files
196a8f6c2e5SDarrick J. Wongand the physical space is split into pieces known as allocation groups.
197a8f6c2e5SDarrick J. WongSharding enables better performance on highly parallel systems and helps to
198a8f6c2e5SDarrick J. Wongcontain the damage when corruptions occur.
199a8f6c2e5SDarrick J. WongThe division of the filesystem into principal objects (allocation groups and
200a8f6c2e5SDarrick J. Wonginodes) means that there are ample opportunities to perform targeted checks and
201a8f6c2e5SDarrick J. Wongrepairs on a subset of the filesystem.
202a8f6c2e5SDarrick J. Wong
203a8f6c2e5SDarrick J. WongWhile this is going on, other parts continue processing IO requests.
204a8f6c2e5SDarrick J. WongEven if a piece of filesystem metadata can only be regenerated by scanning the
205a8f6c2e5SDarrick J. Wongentire system, the scan can still be done in the background while other file
206a8f6c2e5SDarrick J. Wongoperations continue.
207a8f6c2e5SDarrick J. Wong
208a8f6c2e5SDarrick J. WongIn summary, online fsck takes advantage of resource sharding and redundant
209a8f6c2e5SDarrick J. Wongmetadata to enable targeted checking and repair operations while the system
210a8f6c2e5SDarrick J. Wongis running.
211a8f6c2e5SDarrick J. WongThis capability will be coupled to automatic system management so that
212a8f6c2e5SDarrick J. Wongautonomous self-healing of XFS maximizes service availability.
21388757e04SDarrick J. Wong
21488757e04SDarrick J. Wong2. Theory of Operation
21588757e04SDarrick J. Wong======================
21688757e04SDarrick J. Wong
21788757e04SDarrick J. WongBecause it is necessary for online fsck to lock and scan live metadata objects,
21888757e04SDarrick J. Wongonline fsck consists of three separate code components.
21988757e04SDarrick J. WongThe first is the userspace driver program ``xfs_scrub``, which is responsible
22088757e04SDarrick J. Wongfor identifying individual metadata items, scheduling work items for them,
22188757e04SDarrick J. Wongreacting to the outcomes appropriately, and reporting results to the system
22288757e04SDarrick J. Wongadministrator.
22388757e04SDarrick J. WongThe second and third are in the kernel, which implements functions to check
22488757e04SDarrick J. Wongand repair each type of online fsck work item.
22588757e04SDarrick J. Wong
22688757e04SDarrick J. Wong+------------------------------------------------------------------+
22788757e04SDarrick J. Wong| **Note**:                                                        |
22888757e04SDarrick J. Wong+------------------------------------------------------------------+
22988757e04SDarrick J. Wong| For brevity, this document shortens the phrase "online fsck work |
23088757e04SDarrick J. Wong| item" to "scrub item".                                           |
23188757e04SDarrick J. Wong+------------------------------------------------------------------+
23288757e04SDarrick J. Wong
23388757e04SDarrick J. WongScrub item types are delineated in a manner consistent with the Unix design
23488757e04SDarrick J. Wongphilosophy, which is to say that each item should handle one aspect of a
23588757e04SDarrick J. Wongmetadata structure, and handle it well.
23688757e04SDarrick J. Wong
23788757e04SDarrick J. WongScope
23888757e04SDarrick J. Wong-----
23988757e04SDarrick J. Wong
24088757e04SDarrick J. WongIn principle, online fsck should be able to check and to repair everything that
24188757e04SDarrick J. Wongthe offline fsck program can handle.
24288757e04SDarrick J. WongHowever, online fsck cannot be running 100% of the time, which means that
24388757e04SDarrick J. Wonglatent errors may creep in after a scrub completes.
24488757e04SDarrick J. WongIf these errors cause the next mount to fail, offline fsck is the only
24588757e04SDarrick J. Wongsolution.
24688757e04SDarrick J. WongThis limitation means that maintenance of the offline fsck tool will continue.
24788757e04SDarrick J. WongA second limitation of online fsck is that it must follow the same resource
24888757e04SDarrick J. Wongsharing and lock acquisition rules as the regular filesystem.
24988757e04SDarrick J. WongThis means that scrub cannot take *any* shortcuts to save time, because doing
25088757e04SDarrick J. Wongso could lead to concurrency problems.
25188757e04SDarrick J. WongIn other words, online fsck is not a complete replacement for offline fsck, and
25288757e04SDarrick J. Wonga complete run of online fsck may take longer than online fsck.
25388757e04SDarrick J. WongHowever, both of these limitations are acceptable tradeoffs to satisfy the
25488757e04SDarrick J. Wongdifferent motivations of online fsck, which are to **minimize system downtime**
25588757e04SDarrick J. Wongand to **increase predictability of operation**.
25688757e04SDarrick J. Wong
25788757e04SDarrick J. Wong.. _scrubphases:
25888757e04SDarrick J. Wong
25988757e04SDarrick J. WongPhases of Work
26088757e04SDarrick J. Wong--------------
26188757e04SDarrick J. Wong
26288757e04SDarrick J. WongThe userspace driver program ``xfs_scrub`` splits the work of checking and
26388757e04SDarrick J. Wongrepairing an entire filesystem into seven phases.
26488757e04SDarrick J. WongEach phase concentrates on checking specific types of scrub items and depends
26588757e04SDarrick J. Wongon the success of all previous phases.
26688757e04SDarrick J. WongThe seven phases are as follows:
26788757e04SDarrick J. Wong
26888757e04SDarrick J. Wong1. Collect geometry information about the mounted filesystem and computer,
26988757e04SDarrick J. Wong   discover the online fsck capabilities of the kernel, and open the
27088757e04SDarrick J. Wong   underlying storage devices.
27188757e04SDarrick J. Wong
27288757e04SDarrick J. Wong2. Check allocation group metadata, all realtime volume metadata, and all quota
27388757e04SDarrick J. Wong   files.
27488757e04SDarrick J. Wong   Each metadata structure is scheduled as a separate scrub item.
27588757e04SDarrick J. Wong   If corruption is found in the inode header or inode btree and ``xfs_scrub``
27688757e04SDarrick J. Wong   is permitted to perform repairs, then those scrub items are repaired to
27788757e04SDarrick J. Wong   prepare for phase 3.
27888757e04SDarrick J. Wong   Repairs are implemented by using the information in the scrub item to
27988757e04SDarrick J. Wong   resubmit the kernel scrub call with the repair flag enabled; this is
28088757e04SDarrick J. Wong   discussed in the next section.
28188757e04SDarrick J. Wong   Optimizations and all other repairs are deferred to phase 4.
28288757e04SDarrick J. Wong
28388757e04SDarrick J. Wong3. Check all metadata of every file in the filesystem.
28488757e04SDarrick J. Wong   Each metadata structure is also scheduled as a separate scrub item.
28588757e04SDarrick J. Wong   If repairs are needed and ``xfs_scrub`` is permitted to perform repairs,
28688757e04SDarrick J. Wong   and there were no problems detected during phase 2, then those scrub items
28788757e04SDarrick J. Wong   are repaired immediately.
28888757e04SDarrick J. Wong   Optimizations, deferred repairs, and unsuccessful repairs are deferred to
28988757e04SDarrick J. Wong   phase 4.
29088757e04SDarrick J. Wong
29188757e04SDarrick J. Wong4. All remaining repairs and scheduled optimizations are performed during this
29288757e04SDarrick J. Wong   phase, if the caller permits them.
29388757e04SDarrick J. Wong   Before starting repairs, the summary counters are checked and any necessary
29488757e04SDarrick J. Wong   repairs are performed so that subsequent repairs will not fail the resource
29588757e04SDarrick J. Wong   reservation step due to wildly incorrect summary counters.
29688757e04SDarrick J. Wong   Unsuccesful repairs are requeued as long as forward progress on repairs is
29788757e04SDarrick J. Wong   made somewhere in the filesystem.
29888757e04SDarrick J. Wong   Free space in the filesystem is trimmed at the end of phase 4 if the
29988757e04SDarrick J. Wong   filesystem is clean.
30088757e04SDarrick J. Wong
30188757e04SDarrick J. Wong5. By the start of this phase, all primary and secondary filesystem metadata
30288757e04SDarrick J. Wong   must be correct.
30388757e04SDarrick J. Wong   Summary counters such as the free space counts and quota resource counts
30488757e04SDarrick J. Wong   are checked and corrected.
30588757e04SDarrick J. Wong   Directory entry names and extended attribute names are checked for
30688757e04SDarrick J. Wong   suspicious entries such as control characters or confusing Unicode sequences
30788757e04SDarrick J. Wong   appearing in names.
30888757e04SDarrick J. Wong
30988757e04SDarrick J. Wong6. If the caller asks for a media scan, read all allocated and written data
31088757e04SDarrick J. Wong   file extents in the filesystem.
31188757e04SDarrick J. Wong   The ability to use hardware-assisted data file integrity checking is new
31288757e04SDarrick J. Wong   to online fsck; neither of the previous tools have this capability.
31388757e04SDarrick J. Wong   If media errors occur, they will be mapped to the owning files and reported.
31488757e04SDarrick J. Wong
31588757e04SDarrick J. Wong7. Re-check the summary counters and presents the caller with a summary of
31688757e04SDarrick J. Wong   space usage and file counts.
31788757e04SDarrick J. Wong
31888757e04SDarrick J. WongSteps for Each Scrub Item
31988757e04SDarrick J. Wong-------------------------
32088757e04SDarrick J. Wong
32188757e04SDarrick J. WongThe kernel scrub code uses a three-step strategy for checking and repairing
32288757e04SDarrick J. Wongthe one aspect of a metadata object represented by a scrub item:
32388757e04SDarrick J. Wong
32488757e04SDarrick J. Wong1. The scrub item of interest is checked for corruptions; opportunities for
32588757e04SDarrick J. Wong   optimization; and for values that are directly controlled by the system
32688757e04SDarrick J. Wong   administrator but look suspicious.
32788757e04SDarrick J. Wong   If the item is not corrupt or does not need optimization, resource are
32888757e04SDarrick J. Wong   released and the positive scan results are returned to userspace.
32988757e04SDarrick J. Wong   If the item is corrupt or could be optimized but the caller does not permit
33088757e04SDarrick J. Wong   this, resources are released and the negative scan results are returned to
33188757e04SDarrick J. Wong   userspace.
33288757e04SDarrick J. Wong   Otherwise, the kernel moves on to the second step.
33388757e04SDarrick J. Wong
33488757e04SDarrick J. Wong2. The repair function is called to rebuild the data structure.
33588757e04SDarrick J. Wong   Repair functions generally choose rebuild a structure from other metadata
33688757e04SDarrick J. Wong   rather than try to salvage the existing structure.
33788757e04SDarrick J. Wong   If the repair fails, the scan results from the first step are returned to
33888757e04SDarrick J. Wong   userspace.
33988757e04SDarrick J. Wong   Otherwise, the kernel moves on to the third step.
34088757e04SDarrick J. Wong
34188757e04SDarrick J. Wong3. In the third step, the kernel runs the same checks over the new metadata
34288757e04SDarrick J. Wong   item to assess the efficacy of the repairs.
34388757e04SDarrick J. Wong   The results of the reassessment are returned to userspace.
34488757e04SDarrick J. Wong
34588757e04SDarrick J. WongClassification of Metadata
34688757e04SDarrick J. Wong--------------------------
34788757e04SDarrick J. Wong
34888757e04SDarrick J. WongEach type of metadata object (and therefore each type of scrub item) is
34988757e04SDarrick J. Wongclassified as follows:
35088757e04SDarrick J. Wong
35188757e04SDarrick J. WongPrimary Metadata
35288757e04SDarrick J. Wong````````````````
35388757e04SDarrick J. Wong
35488757e04SDarrick J. WongMetadata structures in this category should be most familiar to filesystem
35588757e04SDarrick J. Wongusers either because they are directly created by the user or they index
35688757e04SDarrick J. Wongobjects created by the user
35788757e04SDarrick J. WongMost filesystem objects fall into this class:
35888757e04SDarrick J. Wong
35988757e04SDarrick J. Wong- Free space and reference count information
36088757e04SDarrick J. Wong
36188757e04SDarrick J. Wong- Inode records and indexes
36288757e04SDarrick J. Wong
36388757e04SDarrick J. Wong- Storage mapping information for file data
36488757e04SDarrick J. Wong
36588757e04SDarrick J. Wong- Directories
36688757e04SDarrick J. Wong
36788757e04SDarrick J. Wong- Extended attributes
36888757e04SDarrick J. Wong
36988757e04SDarrick J. Wong- Symbolic links
37088757e04SDarrick J. Wong
37188757e04SDarrick J. Wong- Quota limits
37288757e04SDarrick J. Wong
37388757e04SDarrick J. WongScrub obeys the same rules as regular filesystem accesses for resource and lock
37488757e04SDarrick J. Wongacquisition.
37588757e04SDarrick J. Wong
37688757e04SDarrick J. WongPrimary metadata objects are the simplest for scrub to process.
37788757e04SDarrick J. WongThe principal filesystem object (either an allocation group or an inode) that
37888757e04SDarrick J. Wongowns the item being scrubbed is locked to guard against concurrent updates.
37988757e04SDarrick J. WongThe check function examines every record associated with the type for obvious
38088757e04SDarrick J. Wongerrors and cross-references healthy records against other metadata to look for
38188757e04SDarrick J. Wonginconsistencies.
38288757e04SDarrick J. WongRepairs for this class of scrub item are simple, since the repair function
38388757e04SDarrick J. Wongstarts by holding all the resources acquired in the previous step.
38488757e04SDarrick J. WongThe repair function scans available metadata as needed to record all the
38588757e04SDarrick J. Wongobservations needed to complete the structure.
38688757e04SDarrick J. WongNext, it stages the observations in a new ondisk structure and commits it
38788757e04SDarrick J. Wongatomically to complete the repair.
38888757e04SDarrick J. WongFinally, the storage from the old data structure are carefully reaped.
38988757e04SDarrick J. Wong
39088757e04SDarrick J. WongBecause ``xfs_scrub`` locks a primary object for the duration of the repair,
39188757e04SDarrick J. Wongthis is effectively an offline repair operation performed on a subset of the
39288757e04SDarrick J. Wongfilesystem.
39388757e04SDarrick J. WongThis minimizes the complexity of the repair code because it is not necessary to
39488757e04SDarrick J. Wonghandle concurrent updates from other threads, nor is it necessary to access
39588757e04SDarrick J. Wongany other part of the filesystem.
39688757e04SDarrick J. WongAs a result, indexed structures can be rebuilt very quickly, and programs
39788757e04SDarrick J. Wongtrying to access the damaged structure will be blocked until repairs complete.
39888757e04SDarrick J. WongThe only infrastructure needed by the repair code are the staging area for
39988757e04SDarrick J. Wongobservations and a means to write new structures to disk.
40088757e04SDarrick J. WongDespite these limitations, the advantage that online repair holds is clear:
40188757e04SDarrick J. Wongtargeted work on individual shards of the filesystem avoids total loss of
40288757e04SDarrick J. Wongservice.
40388757e04SDarrick J. Wong
40488757e04SDarrick J. WongThis mechanism is described in section 2.1 ("Off-Line Algorithm") of
40588757e04SDarrick J. WongV. Srinivasan and M. J. Carey, `"Performance of On-Line Index Construction
40688757e04SDarrick J. WongAlgorithms" <https://minds.wisconsin.edu/bitstream/handle/1793/59524/TR1047.pdf>`_,
40788757e04SDarrick J. Wong*Extending Database Technology*, pp. 293-309, 1992.
40888757e04SDarrick J. Wong
40988757e04SDarrick J. WongMost primary metadata repair functions stage their intermediate results in an
41088757e04SDarrick J. Wongin-memory array prior to formatting the new ondisk structure, which is very
41188757e04SDarrick J. Wongsimilar to the list-based algorithm discussed in section 2.3 ("List-Based
41288757e04SDarrick J. WongAlgorithms") of Srinivasan.
41388757e04SDarrick J. WongHowever, any data structure builder that maintains a resource lock for the
41488757e04SDarrick J. Wongduration of the repair is *always* an offline algorithm.
41588757e04SDarrick J. Wong
4165f658dadSDarrick J. Wong.. _secondary_metadata:
4175f658dadSDarrick J. Wong
41888757e04SDarrick J. WongSecondary Metadata
41988757e04SDarrick J. Wong``````````````````
42088757e04SDarrick J. Wong
42188757e04SDarrick J. WongMetadata structures in this category reflect records found in primary metadata,
42288757e04SDarrick J. Wongbut are only needed for online fsck or for reorganization of the filesystem.
42388757e04SDarrick J. Wong
42488757e04SDarrick J. WongSecondary metadata include:
42588757e04SDarrick J. Wong
42688757e04SDarrick J. Wong- Reverse mapping information
42788757e04SDarrick J. Wong
42888757e04SDarrick J. Wong- Directory parent pointers
42988757e04SDarrick J. Wong
43088757e04SDarrick J. WongThis class of metadata is difficult for scrub to process because scrub attaches
43188757e04SDarrick J. Wongto the secondary object but needs to check primary metadata, which runs counter
43288757e04SDarrick J. Wongto the usual order of resource acquisition.
43388757e04SDarrick J. WongFrequently, this means that full filesystems scans are necessary to rebuild the
43488757e04SDarrick J. Wongmetadata.
43588757e04SDarrick J. WongCheck functions can be limited in scope to reduce runtime.
43688757e04SDarrick J. WongRepairs, however, require a full scan of primary metadata, which can take a
43788757e04SDarrick J. Wonglong time to complete.
43888757e04SDarrick J. WongUnder these conditions, ``xfs_scrub`` cannot lock resources for the entire
43988757e04SDarrick J. Wongduration of the repair.
44088757e04SDarrick J. Wong
44188757e04SDarrick J. WongInstead, repair functions set up an in-memory staging structure to store
44288757e04SDarrick J. Wongobservations.
44388757e04SDarrick J. WongDepending on the requirements of the specific repair function, the staging
44488757e04SDarrick J. Wongindex will either have the same format as the ondisk structure or a design
44588757e04SDarrick J. Wongspecific to that repair function.
44688757e04SDarrick J. WongThe next step is to release all locks and start the filesystem scan.
44788757e04SDarrick J. WongWhen the repair scanner needs to record an observation, the staging data are
44888757e04SDarrick J. Wonglocked long enough to apply the update.
44988757e04SDarrick J. WongWhile the filesystem scan is in progress, the repair function hooks the
45088757e04SDarrick J. Wongfilesystem so that it can apply pending filesystem updates to the staging
45188757e04SDarrick J. Wonginformation.
45288757e04SDarrick J. WongOnce the scan is done, the owning object is re-locked, the live data is used to
45388757e04SDarrick J. Wongwrite a new ondisk structure, and the repairs are committed atomically.
45488757e04SDarrick J. WongThe hooks are disabled and the staging staging area is freed.
45588757e04SDarrick J. WongFinally, the storage from the old data structure are carefully reaped.
45688757e04SDarrick J. Wong
45788757e04SDarrick J. WongIntroducing concurrency helps online repair avoid various locking problems, but
45888757e04SDarrick J. Wongcomes at a high cost to code complexity.
45988757e04SDarrick J. WongLive filesystem code has to be hooked so that the repair function can observe
46088757e04SDarrick J. Wongupdates in progress.
46188757e04SDarrick J. WongThe staging area has to become a fully functional parallel structure so that
46288757e04SDarrick J. Wongupdates can be merged from the hooks.
46388757e04SDarrick J. WongFinally, the hook, the filesystem scan, and the inode locking model must be
46488757e04SDarrick J. Wongsufficiently well integrated that a hook event can decide if a given update
46588757e04SDarrick J. Wongshould be applied to the staging structure.
46688757e04SDarrick J. Wong
46788757e04SDarrick J. WongIn theory, the scrub implementation could apply these same techniques for
46888757e04SDarrick J. Wongprimary metadata, but doing so would make it massively more complex and less
46988757e04SDarrick J. Wongperformant.
47088757e04SDarrick J. WongPrograms attempting to access the damaged structures are not blocked from
47188757e04SDarrick J. Wongoperation, which may cause application failure or an unplanned filesystem
47288757e04SDarrick J. Wongshutdown.
47388757e04SDarrick J. Wong
47488757e04SDarrick J. WongInspiration for the secondary metadata repair strategy was drawn from section
47588757e04SDarrick J. Wong2.4 of Srinivasan above, and sections 2 ("NSF: Inded Build Without Side-File")
47688757e04SDarrick J. Wongand 3.1.1 ("Duplicate Key Insert Problem") in C. Mohan, `"Algorithms for
47788757e04SDarrick J. WongCreating Indexes for Very Large Tables Without Quiescing Updates"
47888757e04SDarrick J. Wong<https://dl.acm.org/doi/10.1145/130283.130337>`_, 1992.
47988757e04SDarrick J. Wong
48088757e04SDarrick J. WongThe sidecar index mentioned above bears some resemblance to the side file
48188757e04SDarrick J. Wongmethod mentioned in Srinivasan and Mohan.
48288757e04SDarrick J. WongTheir method consists of an index builder that extracts relevant record data to
48388757e04SDarrick J. Wongbuild the new structure as quickly as possible; and an auxiliary structure that
48488757e04SDarrick J. Wongcaptures all updates that would be committed to the index by other threads were
48588757e04SDarrick J. Wongthe new index already online.
48688757e04SDarrick J. WongAfter the index building scan finishes, the updates recorded in the side file
48788757e04SDarrick J. Wongare applied to the new index.
48888757e04SDarrick J. WongTo avoid conflicts between the index builder and other writer threads, the
48988757e04SDarrick J. Wongbuilder maintains a publicly visible cursor that tracks the progress of the
49088757e04SDarrick J. Wongscan through the record space.
49188757e04SDarrick J. WongTo avoid duplication of work between the side file and the index builder, side
49288757e04SDarrick J. Wongfile updates are elided when the record ID for the update is greater than the
49388757e04SDarrick J. Wongcursor position within the record ID space.
49488757e04SDarrick J. Wong
49588757e04SDarrick J. WongTo minimize changes to the rest of the codebase, XFS online repair keeps the
49688757e04SDarrick J. Wongreplacement index hidden until it's completely ready to go.
49788757e04SDarrick J. WongIn other words, there is no attempt to expose the keyspace of the new index
49888757e04SDarrick J. Wongwhile repair is running.
49988757e04SDarrick J. WongThe complexity of such an approach would be very high and perhaps more
50088757e04SDarrick J. Wongappropriate to building *new* indices.
50188757e04SDarrick J. Wong
50288757e04SDarrick J. Wong**Future Work Question**: Can the full scan and live update code used to
50388757e04SDarrick J. Wongfacilitate a repair also be used to implement a comprehensive check?
50488757e04SDarrick J. Wong
50588757e04SDarrick J. Wong*Answer*: In theory, yes.  Check would be much stronger if each scrub function
50688757e04SDarrick J. Wongemployed these live scans to build a shadow copy of the metadata and then
50788757e04SDarrick J. Wongcompared the shadow records to the ondisk records.
50888757e04SDarrick J. WongHowever, doing that is a fair amount more work than what the checking functions
50988757e04SDarrick J. Wongdo now.
51088757e04SDarrick J. WongThe live scans and hooks were developed much later.
51188757e04SDarrick J. WongThat in turn increases the runtime of those scrub functions.
51288757e04SDarrick J. Wong
51388757e04SDarrick J. WongSummary Information
51488757e04SDarrick J. Wong```````````````````
51588757e04SDarrick J. Wong
51688757e04SDarrick J. WongMetadata structures in this last category summarize the contents of primary
51788757e04SDarrick J. Wongmetadata records.
51888757e04SDarrick J. WongThese are often used to speed up resource usage queries, and are many times
51988757e04SDarrick J. Wongsmaller than the primary metadata which they represent.
52088757e04SDarrick J. Wong
52188757e04SDarrick J. WongExamples of summary information include:
52288757e04SDarrick J. Wong
52388757e04SDarrick J. Wong- Summary counts of free space and inodes
52488757e04SDarrick J. Wong
52588757e04SDarrick J. Wong- File link counts from directories
52688757e04SDarrick J. Wong
52788757e04SDarrick J. Wong- Quota resource usage counts
52888757e04SDarrick J. Wong
52988757e04SDarrick J. WongCheck and repair require full filesystem scans, but resource and lock
53088757e04SDarrick J. Wongacquisition follow the same paths as regular filesystem accesses.
53188757e04SDarrick J. Wong
53288757e04SDarrick J. WongThe superblock summary counters have special requirements due to the underlying
53388757e04SDarrick J. Wongimplementation of the incore counters, and will be treated separately.
53488757e04SDarrick J. WongCheck and repair of the other types of summary counters (quota resource counts
53588757e04SDarrick J. Wongand file link counts) employ the same filesystem scanning and hooking
53688757e04SDarrick J. Wongtechniques as outlined above, but because the underlying data are sets of
53788757e04SDarrick J. Wonginteger counters, the staging data need not be a fully functional mirror of the
53888757e04SDarrick J. Wongondisk structure.
53988757e04SDarrick J. Wong
54088757e04SDarrick J. WongInspiration for quota and file link count repair strategies were drawn from
54188757e04SDarrick J. Wongsections 2.12 ("Online Index Operations") through 2.14 ("Incremental View
54288757e04SDarrick J. WongMaintenace") of G.  Graefe, `"Concurrent Queries and Updates in Summary Views
54388757e04SDarrick J. Wongand Their Indexes"
54488757e04SDarrick J. Wong<http://www.odbms.org/wp-content/uploads/2014/06/Increment-locks.pdf>`_, 2011.
54588757e04SDarrick J. Wong
54688757e04SDarrick J. WongSince quotas are non-negative integer counts of resource usage, online
54788757e04SDarrick J. Wongquotacheck can use the incremental view deltas described in section 2.14 to
54888757e04SDarrick J. Wongtrack pending changes to the block and inode usage counts in each transaction,
54988757e04SDarrick J. Wongand commit those changes to a dquot side file when the transaction commits.
55088757e04SDarrick J. WongDelta tracking is necessary for dquots because the index builder scans inodes,
55188757e04SDarrick J. Wongwhereas the data structure being rebuilt is an index of dquots.
55288757e04SDarrick J. WongLink count checking combines the view deltas and commit step into one because
55388757e04SDarrick J. Wongit sets attributes of the objects being scanned instead of writing them to a
55488757e04SDarrick J. Wongseparate data structure.
55588757e04SDarrick J. WongEach online fsck function will be discussed as case studies later in this
55688757e04SDarrick J. Wongdocument.
55788757e04SDarrick J. Wong
55888757e04SDarrick J. WongRisk Management
55988757e04SDarrick J. Wong---------------
56088757e04SDarrick J. Wong
56188757e04SDarrick J. WongDuring the development of online fsck, several risk factors were identified
56288757e04SDarrick J. Wongthat may make the feature unsuitable for certain distributors and users.
56388757e04SDarrick J. WongSteps can be taken to mitigate or eliminate those risks, though at a cost to
56488757e04SDarrick J. Wongfunctionality.
56588757e04SDarrick J. Wong
56688757e04SDarrick J. Wong- **Decreased performance**: Adding metadata indices to the filesystem
56788757e04SDarrick J. Wong  increases the time cost of persisting changes to disk, and the reverse space
56888757e04SDarrick J. Wong  mapping and directory parent pointers are no exception.
56988757e04SDarrick J. Wong  System administrators who require the maximum performance can disable the
57088757e04SDarrick J. Wong  reverse mapping features at format time, though this choice dramatically
57188757e04SDarrick J. Wong  reduces the ability of online fsck to find inconsistencies and repair them.
57288757e04SDarrick J. Wong
57388757e04SDarrick J. Wong- **Incorrect repairs**: As with all software, there might be defects in the
57488757e04SDarrick J. Wong  software that result in incorrect repairs being written to the filesystem.
57588757e04SDarrick J. Wong  Systematic fuzz testing (detailed in the next section) is employed by the
57688757e04SDarrick J. Wong  authors to find bugs early, but it might not catch everything.
57788757e04SDarrick J. Wong  The kernel build system provides Kconfig options (``CONFIG_XFS_ONLINE_SCRUB``
57888757e04SDarrick J. Wong  and ``CONFIG_XFS_ONLINE_REPAIR``) to enable distributors to choose not to
57988757e04SDarrick J. Wong  accept this risk.
58088757e04SDarrick J. Wong  The xfsprogs build system has a configure option (``--enable-scrub=no``) that
58188757e04SDarrick J. Wong  disables building of the ``xfs_scrub`` binary, though this is not a risk
58288757e04SDarrick J. Wong  mitigation if the kernel functionality remains enabled.
58388757e04SDarrick J. Wong
58488757e04SDarrick J. Wong- **Inability to repair**: Sometimes, a filesystem is too badly damaged to be
58588757e04SDarrick J. Wong  repairable.
58688757e04SDarrick J. Wong  If the keyspaces of several metadata indices overlap in some manner but a
58788757e04SDarrick J. Wong  coherent narrative cannot be formed from records collected, then the repair
58888757e04SDarrick J. Wong  fails.
58988757e04SDarrick J. Wong  To reduce the chance that a repair will fail with a dirty transaction and
59088757e04SDarrick J. Wong  render the filesystem unusable, the online repair functions have been
59188757e04SDarrick J. Wong  designed to stage and validate all new records before committing the new
59288757e04SDarrick J. Wong  structure.
59388757e04SDarrick J. Wong
59488757e04SDarrick J. Wong- **Misbehavior**: Online fsck requires many privileges -- raw IO to block
59588757e04SDarrick J. Wong  devices, opening files by handle, ignoring Unix discretionary access control,
59688757e04SDarrick J. Wong  and the ability to perform administrative changes.
59788757e04SDarrick J. Wong  Running this automatically in the background scares people, so the systemd
59888757e04SDarrick J. Wong  background service is configured to run with only the privileges required.
59988757e04SDarrick J. Wong  Obviously, this cannot address certain problems like the kernel crashing or
60088757e04SDarrick J. Wong  deadlocking, but it should be sufficient to prevent the scrub process from
60188757e04SDarrick J. Wong  escaping and reconfiguring the system.
60288757e04SDarrick J. Wong  The cron job does not have this protection.
60388757e04SDarrick J. Wong
60488757e04SDarrick J. Wong- **Fuzz Kiddiez**: There are many people now who seem to think that running
60588757e04SDarrick J. Wong  automated fuzz testing of ondisk artifacts to find mischevious behavior and
60688757e04SDarrick J. Wong  spraying exploit code onto the public mailing list for instant zero-day
60788757e04SDarrick J. Wong  disclosure is somehow of some social benefit.
60888757e04SDarrick J. Wong  In the view of this author, the benefit is realized only when the fuzz
60988757e04SDarrick J. Wong  operators help to **fix** the flaws, but this opinion apparently is not
61088757e04SDarrick J. Wong  widely shared among security "researchers".
61188757e04SDarrick J. Wong  The XFS maintainers' continuing ability to manage these events presents an
61288757e04SDarrick J. Wong  ongoing risk to the stability of the development process.
61388757e04SDarrick J. Wong  Automated testing should front-load some of the risk while the feature is
61488757e04SDarrick J. Wong  considered EXPERIMENTAL.
61588757e04SDarrick J. Wong
61688757e04SDarrick J. WongMany of these risks are inherent to software programming.
61788757e04SDarrick J. WongDespite this, it is hoped that this new functionality will prove useful in
61888757e04SDarrick J. Wongreducing unexpected downtime.
6199a30b5b5SDarrick J. Wong
6209a30b5b5SDarrick J. Wong3. Testing Plan
6219a30b5b5SDarrick J. Wong===============
6229a30b5b5SDarrick J. Wong
6239a30b5b5SDarrick J. WongAs stated before, fsck tools have three main goals:
6249a30b5b5SDarrick J. Wong
6259a30b5b5SDarrick J. Wong1. Detect inconsistencies in the metadata;
6269a30b5b5SDarrick J. Wong
6279a30b5b5SDarrick J. Wong2. Eliminate those inconsistencies; and
6289a30b5b5SDarrick J. Wong
6299a30b5b5SDarrick J. Wong3. Minimize further loss of data.
6309a30b5b5SDarrick J. Wong
6319a30b5b5SDarrick J. WongDemonstrations of correct operation are necessary to build users' confidence
6329a30b5b5SDarrick J. Wongthat the software behaves within expectations.
6339a30b5b5SDarrick J. WongUnfortunately, it was not really feasible to perform regular exhaustive testing
6349a30b5b5SDarrick J. Wongof every aspect of a fsck tool until the introduction of low-cost virtual
6359a30b5b5SDarrick J. Wongmachines with high-IOPS storage.
6369a30b5b5SDarrick J. WongWith ample hardware availability in mind, the testing strategy for the online
6379a30b5b5SDarrick J. Wongfsck project involves differential analysis against the existing fsck tools and
6389a30b5b5SDarrick J. Wongsystematic testing of every attribute of every type of metadata object.
6399a30b5b5SDarrick J. WongTesting can be split into four major categories, as discussed below.
6409a30b5b5SDarrick J. Wong
6419a30b5b5SDarrick J. WongIntegrated Testing with fstests
6429a30b5b5SDarrick J. Wong-------------------------------
6439a30b5b5SDarrick J. Wong
6449a30b5b5SDarrick J. WongThe primary goal of any free software QA effort is to make testing as
6459a30b5b5SDarrick J. Wonginexpensive and widespread as possible to maximize the scaling advantages of
6469a30b5b5SDarrick J. Wongcommunity.
6479a30b5b5SDarrick J. WongIn other words, testing should maximize the breadth of filesystem configuration
6489a30b5b5SDarrick J. Wongscenarios and hardware setups.
6499a30b5b5SDarrick J. WongThis improves code quality by enabling the authors of online fsck to find and
6509a30b5b5SDarrick J. Wongfix bugs early, and helps developers of new features to find integration
6519a30b5b5SDarrick J. Wongissues earlier in their development effort.
6529a30b5b5SDarrick J. Wong
6539a30b5b5SDarrick J. WongThe Linux filesystem community shares a common QA testing suite,
6549a30b5b5SDarrick J. Wong`fstests <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/>`_, for
6559a30b5b5SDarrick J. Wongfunctional and regression testing.
6569a30b5b5SDarrick J. WongEven before development work began on online fsck, fstests (when run on XFS)
6579a30b5b5SDarrick J. Wongwould run both the ``xfs_check`` and ``xfs_repair -n`` commands on the test and
6589a30b5b5SDarrick J. Wongscratch filesystems between each test.
6599a30b5b5SDarrick J. WongThis provides a level of assurance that the kernel and the fsck tools stay in
6609a30b5b5SDarrick J. Wongalignment about what constitutes consistent metadata.
6619a30b5b5SDarrick J. WongDuring development of the online checking code, fstests was modified to run
6629a30b5b5SDarrick J. Wong``xfs_scrub -n`` between each test to ensure that the new checking code
6639a30b5b5SDarrick J. Wongproduces the same results as the two existing fsck tools.
6649a30b5b5SDarrick J. Wong
6659a30b5b5SDarrick J. WongTo start development of online repair, fstests was modified to run
6669a30b5b5SDarrick J. Wong``xfs_repair`` to rebuild the filesystem's metadata indices between tests.
6679a30b5b5SDarrick J. WongThis ensures that offline repair does not crash, leave a corrupt filesystem
6689a30b5b5SDarrick J. Wongafter it exists, or trigger complaints from the online check.
6699a30b5b5SDarrick J. WongThis also established a baseline for what can and cannot be repaired offline.
6709a30b5b5SDarrick J. WongTo complete the first phase of development of online repair, fstests was
6719a30b5b5SDarrick J. Wongmodified to be able to run ``xfs_scrub`` in a "force rebuild" mode.
6729a30b5b5SDarrick J. WongThis enables a comparison of the effectiveness of online repair as compared to
6739a30b5b5SDarrick J. Wongthe existing offline repair tools.
6749a30b5b5SDarrick J. Wong
6759a30b5b5SDarrick J. WongGeneral Fuzz Testing of Metadata Blocks
6769a30b5b5SDarrick J. Wong---------------------------------------
6779a30b5b5SDarrick J. Wong
6789a30b5b5SDarrick J. WongXFS benefits greatly from having a very robust debugging tool, ``xfs_db``.
6799a30b5b5SDarrick J. Wong
6809a30b5b5SDarrick J. WongBefore development of online fsck even began, a set of fstests were created
6819a30b5b5SDarrick J. Wongto test the rather common fault that entire metadata blocks get corrupted.
6829a30b5b5SDarrick J. WongThis required the creation of fstests library code that can create a filesystem
6839a30b5b5SDarrick J. Wongcontaining every possible type of metadata object.
6849a30b5b5SDarrick J. WongNext, individual test cases were created to create a test filesystem, identify
6859a30b5b5SDarrick J. Wonga single block of a specific type of metadata object, trash it with the
6869a30b5b5SDarrick J. Wongexisting ``blocktrash`` command in ``xfs_db``, and test the reaction of a
6879a30b5b5SDarrick J. Wongparticular metadata validation strategy.
6889a30b5b5SDarrick J. Wong
6899a30b5b5SDarrick J. WongThis earlier test suite enabled XFS developers to test the ability of the
6909a30b5b5SDarrick J. Wongin-kernel validation functions and the ability of the offline fsck tool to
6919a30b5b5SDarrick J. Wongdetect and eliminate the inconsistent metadata.
6929a30b5b5SDarrick J. WongThis part of the test suite was extended to cover online fsck in exactly the
6939a30b5b5SDarrick J. Wongsame manner.
6949a30b5b5SDarrick J. Wong
6959a30b5b5SDarrick J. WongIn other words, for a given fstests filesystem configuration:
6969a30b5b5SDarrick J. Wong
6979a30b5b5SDarrick J. Wong* For each metadata object existing on the filesystem:
6989a30b5b5SDarrick J. Wong
6999a30b5b5SDarrick J. Wong  * Write garbage to it
7009a30b5b5SDarrick J. Wong
7019a30b5b5SDarrick J. Wong  * Test the reactions of:
7029a30b5b5SDarrick J. Wong
7039a30b5b5SDarrick J. Wong    1. The kernel verifiers to stop obviously bad metadata
7049a30b5b5SDarrick J. Wong    2. Offline repair (``xfs_repair``) to detect and fix
7059a30b5b5SDarrick J. Wong    3. Online repair (``xfs_scrub``) to detect and fix
7069a30b5b5SDarrick J. Wong
7079a30b5b5SDarrick J. WongTargeted Fuzz Testing of Metadata Records
7089a30b5b5SDarrick J. Wong-----------------------------------------
7099a30b5b5SDarrick J. Wong
7109a30b5b5SDarrick J. WongThe testing plan for online fsck includes extending the existing fs testing
7119a30b5b5SDarrick J. Wonginfrastructure to provide a much more powerful facility: targeted fuzz testing
7129a30b5b5SDarrick J. Wongof every metadata field of every metadata object in the filesystem.
7139a30b5b5SDarrick J. Wong``xfs_db`` can modify every field of every metadata structure in every
7149a30b5b5SDarrick J. Wongblock in the filesystem to simulate the effects of memory corruption and
7159a30b5b5SDarrick J. Wongsoftware bugs.
7169a30b5b5SDarrick J. WongGiven that fstests already contains the ability to create a filesystem
7179a30b5b5SDarrick J. Wongcontaining every metadata format known to the filesystem, ``xfs_db`` can be
7189a30b5b5SDarrick J. Wongused to perform exhaustive fuzz testing!
7199a30b5b5SDarrick J. Wong
7209a30b5b5SDarrick J. WongFor a given fstests filesystem configuration:
7219a30b5b5SDarrick J. Wong
7229a30b5b5SDarrick J. Wong* For each metadata object existing on the filesystem...
7239a30b5b5SDarrick J. Wong
7249a30b5b5SDarrick J. Wong  * For each record inside that metadata object...
7259a30b5b5SDarrick J. Wong
7269a30b5b5SDarrick J. Wong    * For each field inside that record...
7279a30b5b5SDarrick J. Wong
7289a30b5b5SDarrick J. Wong      * For each conceivable type of transformation that can be applied to a bit field...
7299a30b5b5SDarrick J. Wong
7309a30b5b5SDarrick J. Wong        1. Clear all bits
7319a30b5b5SDarrick J. Wong        2. Set all bits
7329a30b5b5SDarrick J. Wong        3. Toggle the most significant bit
7339a30b5b5SDarrick J. Wong        4. Toggle the middle bit
7349a30b5b5SDarrick J. Wong        5. Toggle the least significant bit
7359a30b5b5SDarrick J. Wong        6. Add a small quantity
7369a30b5b5SDarrick J. Wong        7. Subtract a small quantity
7379a30b5b5SDarrick J. Wong        8. Randomize the contents
7389a30b5b5SDarrick J. Wong
7399a30b5b5SDarrick J. Wong        * ...test the reactions of:
7409a30b5b5SDarrick J. Wong
7419a30b5b5SDarrick J. Wong          1. The kernel verifiers to stop obviously bad metadata
7429a30b5b5SDarrick J. Wong          2. Offline checking (``xfs_repair -n``)
7439a30b5b5SDarrick J. Wong          3. Offline repair (``xfs_repair``)
7449a30b5b5SDarrick J. Wong          4. Online checking (``xfs_scrub -n``)
7459a30b5b5SDarrick J. Wong          5. Online repair (``xfs_scrub``)
7469a30b5b5SDarrick J. Wong          6. Both repair tools (``xfs_scrub`` and then ``xfs_repair`` if online repair doesn't succeed)
7479a30b5b5SDarrick J. Wong
7489a30b5b5SDarrick J. WongThis is quite the combinatoric explosion!
7499a30b5b5SDarrick J. Wong
7509a30b5b5SDarrick J. WongFortunately, having this much test coverage makes it easy for XFS developers to
7519a30b5b5SDarrick J. Wongcheck the responses of XFS' fsck tools.
7529a30b5b5SDarrick J. WongSince the introduction of the fuzz testing framework, these tests have been
7539a30b5b5SDarrick J. Wongused to discover incorrect repair code and missing functionality for entire
7549a30b5b5SDarrick J. Wongclasses of metadata objects in ``xfs_repair``.
7559a30b5b5SDarrick J. WongThe enhanced testing was used to finalize the deprecation of ``xfs_check`` by
7569a30b5b5SDarrick J. Wongconfirming that ``xfs_repair`` could detect at least as many corruptions as
7579a30b5b5SDarrick J. Wongthe older tool.
7589a30b5b5SDarrick J. Wong
7599a30b5b5SDarrick J. WongThese tests have been very valuable for ``xfs_scrub`` in the same ways -- they
7609a30b5b5SDarrick J. Wongallow the online fsck developers to compare online fsck against offline fsck,
7619a30b5b5SDarrick J. Wongand they enable XFS developers to find deficiencies in the code base.
7629a30b5b5SDarrick J. Wong
7639a30b5b5SDarrick J. WongProposed patchsets include
7649a30b5b5SDarrick J. Wong`general fuzzer improvements
7659a30b5b5SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzzer-improvements>`_,
7669a30b5b5SDarrick J. Wong`fuzzing baselines
7679a30b5b5SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzz-baseline>`_,
7689a30b5b5SDarrick J. Wongand `improvements in fuzz testing comprehensiveness
7699a30b5b5SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=more-fuzz-testing>`_.
7709a30b5b5SDarrick J. Wong
7719a30b5b5SDarrick J. WongStress Testing
7729a30b5b5SDarrick J. Wong--------------
7739a30b5b5SDarrick J. Wong
7749a30b5b5SDarrick J. WongA unique requirement to online fsck is the ability to operate on a filesystem
7759a30b5b5SDarrick J. Wongconcurrently with regular workloads.
7769a30b5b5SDarrick J. WongAlthough it is of course impossible to run ``xfs_scrub`` with *zero* observable
7779a30b5b5SDarrick J. Wongimpact on the running system, the online repair code should never introduce
7789a30b5b5SDarrick J. Wonginconsistencies into the filesystem metadata, and regular workloads should
7799a30b5b5SDarrick J. Wongnever notice resource starvation.
7809a30b5b5SDarrick J. WongTo verify that these conditions are being met, fstests has been enhanced in
7819a30b5b5SDarrick J. Wongthe following ways:
7829a30b5b5SDarrick J. Wong
7839a30b5b5SDarrick J. Wong* For each scrub item type, create a test to exercise checking that item type
7849a30b5b5SDarrick J. Wong  while running ``fsstress``.
7859a30b5b5SDarrick J. Wong* For each scrub item type, create a test to exercise repairing that item type
7869a30b5b5SDarrick J. Wong  while running ``fsstress``.
7879a30b5b5SDarrick J. Wong* Race ``fsstress`` and ``xfs_scrub -n`` to ensure that checking the whole
7889a30b5b5SDarrick J. Wong  filesystem doesn't cause problems.
7899a30b5b5SDarrick J. Wong* Race ``fsstress`` and ``xfs_scrub`` in force-rebuild mode to ensure that
7909a30b5b5SDarrick J. Wong  force-repairing the whole filesystem doesn't cause problems.
7919a30b5b5SDarrick J. Wong* Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while
7929a30b5b5SDarrick J. Wong  freezing and thawing the filesystem.
7939a30b5b5SDarrick J. Wong* Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while
7949a30b5b5SDarrick J. Wong  remounting the filesystem read-only and read-write.
7959a30b5b5SDarrick J. Wong* The same, but running ``fsx`` instead of ``fsstress``.  (Not done yet?)
7969a30b5b5SDarrick J. Wong
7979a30b5b5SDarrick J. WongSuccess is defined by the ability to run all of these tests without observing
7989a30b5b5SDarrick J. Wongany unexpected filesystem shutdowns due to corrupted metadata, kernel hang
7999a30b5b5SDarrick J. Wongcheck warnings, or any other sort of mischief.
8009a30b5b5SDarrick J. Wong
8019a30b5b5SDarrick J. WongProposed patchsets include `general stress testing
8029a30b5b5SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=race-scrub-and-mount-state-changes>`_
8039a30b5b5SDarrick J. Wongand the `evolution of existing per-function stress testing
8049a30b5b5SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=refactor-scrub-stress>`_.
8054f7f6469SDarrick J. Wong
8064f7f6469SDarrick J. Wong4. User Interface
8074f7f6469SDarrick J. Wong=================
8084f7f6469SDarrick J. Wong
8094f7f6469SDarrick J. WongThe primary user of online fsck is the system administrator, just like offline
8104f7f6469SDarrick J. Wongrepair.
8114f7f6469SDarrick J. WongOnline fsck presents two modes of operation to administrators:
8124f7f6469SDarrick J. WongA foreground CLI process for online fsck on demand, and a background service
8134f7f6469SDarrick J. Wongthat performs autonomous checking and repair.
8144f7f6469SDarrick J. Wong
8154f7f6469SDarrick J. WongChecking on Demand
8164f7f6469SDarrick J. Wong------------------
8174f7f6469SDarrick J. Wong
8184f7f6469SDarrick J. WongFor administrators who want the absolute freshest information about the
8194f7f6469SDarrick J. Wongmetadata in a filesystem, ``xfs_scrub`` can be run as a foreground process on
8204f7f6469SDarrick J. Wonga command line.
8214f7f6469SDarrick J. WongThe program checks every piece of metadata in the filesystem while the
8224f7f6469SDarrick J. Wongadministrator waits for the results to be reported, just like the existing
8234f7f6469SDarrick J. Wong``xfs_repair`` tool.
8244f7f6469SDarrick J. WongBoth tools share a ``-n`` option to perform a read-only scan, and a ``-v``
8254f7f6469SDarrick J. Wongoption to increase the verbosity of the information reported.
8264f7f6469SDarrick J. Wong
8274f7f6469SDarrick J. WongA new feature of ``xfs_scrub`` is the ``-x`` option, which employs the error
8284f7f6469SDarrick J. Wongcorrection capabilities of the hardware to check data file contents.
8294f7f6469SDarrick J. WongThe media scan is not enabled by default because it may dramatically increase
8304f7f6469SDarrick J. Wongprogram runtime and consume a lot of bandwidth on older storage hardware.
8314f7f6469SDarrick J. Wong
8324f7f6469SDarrick J. WongThe output of a foreground invocation is captured in the system log.
8334f7f6469SDarrick J. Wong
8344f7f6469SDarrick J. WongThe ``xfs_scrub_all`` program walks the list of mounted filesystems and
8354f7f6469SDarrick J. Wonginitiates ``xfs_scrub`` for each of them in parallel.
8364f7f6469SDarrick J. WongIt serializes scans for any filesystems that resolve to the same top level
8374f7f6469SDarrick J. Wongkernel block device to prevent resource overconsumption.
8384f7f6469SDarrick J. Wong
8394f7f6469SDarrick J. WongBackground Service
8404f7f6469SDarrick J. Wong------------------
8414f7f6469SDarrick J. Wong
8424f7f6469SDarrick J. WongTo reduce the workload of system administrators, the ``xfs_scrub`` package
8434f7f6469SDarrick J. Wongprovides a suite of `systemd <https://systemd.io/>`_ timers and services that
8444f7f6469SDarrick J. Wongrun online fsck automatically on weekends by default.
8454f7f6469SDarrick J. WongThe background service configures scrub to run with as little privilege as
8464f7f6469SDarrick J. Wongpossible, the lowest CPU and IO priority, and in a CPU-constrained single
8474f7f6469SDarrick J. Wongthreaded mode.
8484f7f6469SDarrick J. WongThis can be tuned by the systemd administrator at any time to suit the latency
8494f7f6469SDarrick J. Wongand throughput requirements of customer workloads.
8504f7f6469SDarrick J. Wong
8514f7f6469SDarrick J. WongThe output of the background service is also captured in the system log.
8524f7f6469SDarrick J. WongIf desired, reports of failures (either due to inconsistencies or mere runtime
8534f7f6469SDarrick J. Wongerrors) can be emailed automatically by setting the ``EMAIL_ADDR`` environment
8544f7f6469SDarrick J. Wongvariable in the following service files:
8554f7f6469SDarrick J. Wong
8564f7f6469SDarrick J. Wong* ``xfs_scrub_fail@.service``
8574f7f6469SDarrick J. Wong* ``xfs_scrub_media_fail@.service``
8584f7f6469SDarrick J. Wong* ``xfs_scrub_all_fail.service``
8594f7f6469SDarrick J. Wong
8604f7f6469SDarrick J. WongThe decision to enable the background scan is left to the system administrator.
8614f7f6469SDarrick J. WongThis can be done by enabling either of the following services:
8624f7f6469SDarrick J. Wong
8634f7f6469SDarrick J. Wong* ``xfs_scrub_all.timer`` on systemd systems
8644f7f6469SDarrick J. Wong* ``xfs_scrub_all.cron`` on non-systemd systems
8654f7f6469SDarrick J. Wong
8664f7f6469SDarrick J. WongThis automatic weekly scan is configured out of the box to perform an
8674f7f6469SDarrick J. Wongadditional media scan of all file data once per month.
8684f7f6469SDarrick J. WongThis is less foolproof than, say, storing file data block checksums, but much
8694f7f6469SDarrick J. Wongmore performant if application software provides its own integrity checking,
8704f7f6469SDarrick J. Wongredundancy can be provided elsewhere above the filesystem, or the storage
8714f7f6469SDarrick J. Wongdevice's integrity guarantees are deemed sufficient.
8724f7f6469SDarrick J. Wong
8734f7f6469SDarrick J. WongThe systemd unit file definitions have been subjected to a security audit
8744f7f6469SDarrick J. Wong(as of systemd 249) to ensure that the xfs_scrub processes have as little
8754f7f6469SDarrick J. Wongaccess to the rest of the system as possible.
8764f7f6469SDarrick J. WongThis was performed via ``systemd-analyze security``, after which privileges
8774f7f6469SDarrick J. Wongwere restricted to the minimum required, sandboxing was set up to the maximal
8784f7f6469SDarrick J. Wongextent possible with sandboxing and system call filtering; and access to the
8794f7f6469SDarrick J. Wongfilesystem tree was restricted to the minimum needed to start the program and
8804f7f6469SDarrick J. Wongaccess the filesystem being scanned.
8814f7f6469SDarrick J. WongThe service definition files restrict CPU usage to 80% of one CPU core, and
8824f7f6469SDarrick J. Wongapply as nice of a priority to IO and CPU scheduling as possible.
8834f7f6469SDarrick J. WongThis measure was taken to minimize delays in the rest of the filesystem.
8844f7f6469SDarrick J. WongNo such hardening has been performed for the cron job.
8854f7f6469SDarrick J. Wong
8864f7f6469SDarrick J. WongProposed patchset:
8874f7f6469SDarrick J. Wong`Enabling the xfs_scrub background service
8884f7f6469SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-service>`_.
8894f7f6469SDarrick J. Wong
8904f7f6469SDarrick J. WongHealth Reporting
8914f7f6469SDarrick J. Wong----------------
8924f7f6469SDarrick J. Wong
8934f7f6469SDarrick J. WongXFS caches a summary of each filesystem's health status in memory.
8944f7f6469SDarrick J. WongThe information is updated whenever ``xfs_scrub`` is run, or whenever
8954f7f6469SDarrick J. Wonginconsistencies are detected in the filesystem metadata during regular
8964f7f6469SDarrick J. Wongoperations.
8974f7f6469SDarrick J. WongSystem administrators should use the ``health`` command of ``xfs_spaceman`` to
8984f7f6469SDarrick J. Wongdownload this information into a human-readable format.
8994f7f6469SDarrick J. WongIf problems have been observed, the administrator can schedule a reduced
9004f7f6469SDarrick J. Wongservice window to run the online repair tool to correct the problem.
9014f7f6469SDarrick J. WongFailing that, the administrator can decide to schedule a maintenance window to
9024f7f6469SDarrick J. Wongrun the traditional offline repair tool to correct the problem.
9034f7f6469SDarrick J. Wong
9044f7f6469SDarrick J. Wong**Future Work Question**: Should the health reporting integrate with the new
9054f7f6469SDarrick J. Wonginotify fs error notification system?
9064f7f6469SDarrick J. WongWould it be helpful for sysadmins to have a daemon to listen for corruption
9074f7f6469SDarrick J. Wongnotifications and initiate a repair?
9084f7f6469SDarrick J. Wong
9094f7f6469SDarrick J. Wong*Answer*: These questions remain unanswered, but should be a part of the
9104f7f6469SDarrick J. Wongconversation with early adopters and potential downstream users of XFS.
9114f7f6469SDarrick J. Wong
9124f7f6469SDarrick J. WongProposed patchsets include
9134f7f6469SDarrick J. Wong`wiring up health reports to correction returns
9144f7f6469SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=corruption-health-reports>`_
9154f7f6469SDarrick J. Wongand
9164f7f6469SDarrick J. Wong`preservation of sickness info during memory reclaim
9174f7f6469SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=indirect-health-reporting>`_.
918e5edad52SDarrick J. Wong
919e5edad52SDarrick J. Wong5. Kernel Algorithms and Data Structures
920e5edad52SDarrick J. Wong========================================
921e5edad52SDarrick J. Wong
922e5edad52SDarrick J. WongThis section discusses the key algorithms and data structures of the kernel
923e5edad52SDarrick J. Wongcode that provide the ability to check and repair metadata while the system
924e5edad52SDarrick J. Wongis running.
925e5edad52SDarrick J. WongThe first chapters in this section reveal the pieces that provide the
926e5edad52SDarrick J. Wongfoundation for checking metadata.
927e5edad52SDarrick J. WongThe remainder of this section presents the mechanisms through which XFS
928e5edad52SDarrick J. Wongregenerates itself.
929e5edad52SDarrick J. Wong
930e5edad52SDarrick J. WongSelf Describing Metadata
931e5edad52SDarrick J. Wong------------------------
932e5edad52SDarrick J. Wong
933e5edad52SDarrick J. WongStarting with XFS version 5 in 2012, XFS updated the format of nearly every
934e5edad52SDarrick J. Wongondisk block header to record a magic number, a checksum, a universally
935e5edad52SDarrick J. Wong"unique" identifier (UUID), an owner code, the ondisk address of the block,
936e5edad52SDarrick J. Wongand a log sequence number.
937e5edad52SDarrick J. WongWhen loading a block buffer from disk, the magic number, UUID, owner, and
938e5edad52SDarrick J. Wongondisk address confirm that the retrieved block matches the specific owner of
939e5edad52SDarrick J. Wongthe current filesystem, and that the information contained in the block is
940e5edad52SDarrick J. Wongsupposed to be found at the ondisk address.
941e5edad52SDarrick J. WongThe first three components enable checking tools to disregard alleged metadata
942e5edad52SDarrick J. Wongthat doesn't belong to the filesystem, and the fourth component enables the
943e5edad52SDarrick J. Wongfilesystem to detect lost writes.
944e5edad52SDarrick J. Wong
945e5edad52SDarrick J. WongWhenever a file system operation modifies a block, the change is submitted
946e5edad52SDarrick J. Wongto the log as part of a transaction.
947e5edad52SDarrick J. WongThe log then processes these transactions marking them done once they are
948e5edad52SDarrick J. Wongsafely persisted to storage.
949e5edad52SDarrick J. WongThe logging code maintains the checksum and the log sequence number of the last
950e5edad52SDarrick J. Wongtransactional update.
951e5edad52SDarrick J. WongChecksums are useful for detecting torn writes and other discrepancies that can
952e5edad52SDarrick J. Wongbe introduced between the computer and its storage devices.
953e5edad52SDarrick J. WongSequence number tracking enables log recovery to avoid applying out of date
954e5edad52SDarrick J. Wonglog updates to the filesystem.
955e5edad52SDarrick J. Wong
956e5edad52SDarrick J. WongThese two features improve overall runtime resiliency by providing a means for
957e5edad52SDarrick J. Wongthe filesystem to detect obvious corruption when reading metadata blocks from
958e5edad52SDarrick J. Wongdisk, but these buffer verifiers cannot provide any consistency checking
959e5edad52SDarrick J. Wongbetween metadata structures.
960e5edad52SDarrick J. Wong
961e5edad52SDarrick J. WongFor more information, please see the documentation for
962e5edad52SDarrick J. WongDocumentation/filesystems/xfs-self-describing-metadata.rst
963e5edad52SDarrick J. Wong
964e5edad52SDarrick J. WongReverse Mapping
965e5edad52SDarrick J. Wong---------------
966e5edad52SDarrick J. Wong
967e5edad52SDarrick J. WongThe original design of XFS (circa 1993) is an improvement upon 1980s Unix
968e5edad52SDarrick J. Wongfilesystem design.
969e5edad52SDarrick J. WongIn those days, storage density was expensive, CPU time was scarce, and
970e5edad52SDarrick J. Wongexcessive seek time could kill performance.
971e5edad52SDarrick J. WongFor performance reasons, filesystem authors were reluctant to add redundancy to
972e5edad52SDarrick J. Wongthe filesystem, even at the cost of data integrity.
973e5edad52SDarrick J. WongFilesystems designers in the early 21st century choose different strategies to
974e5edad52SDarrick J. Wongincrease internal redundancy -- either storing nearly identical copies of
975e5edad52SDarrick J. Wongmetadata, or more space-efficient encoding techniques.
976e5edad52SDarrick J. Wong
977e5edad52SDarrick J. WongFor XFS, a different redundancy strategy was chosen to modernize the design:
978e5edad52SDarrick J. Wonga secondary space usage index that maps allocated disk extents back to their
979e5edad52SDarrick J. Wongowners.
980e5edad52SDarrick J. WongBy adding a new index, the filesystem retains most of its ability to scale
981e5edad52SDarrick J. Wongwell to heavily threaded workloads involving large datasets, since the primary
982e5edad52SDarrick J. Wongfile metadata (the directory tree, the file block map, and the allocation
983e5edad52SDarrick J. Wonggroups) remain unchanged.
984e5edad52SDarrick J. WongLike any system that improves redundancy, the reverse-mapping feature increases
985e5edad52SDarrick J. Wongoverhead costs for space mapping activities.
986e5edad52SDarrick J. WongHowever, it has two critical advantages: first, the reverse index is key to
987e5edad52SDarrick J. Wongenabling online fsck and other requested functionality such as free space
988e5edad52SDarrick J. Wongdefragmentation, better media failure reporting, and filesystem shrinking.
989e5edad52SDarrick J. WongSecond, the different ondisk storage format of the reverse mapping btree
990e5edad52SDarrick J. Wongdefeats device-level deduplication because the filesystem requires real
991e5edad52SDarrick J. Wongredundancy.
992e5edad52SDarrick J. Wong
993e5edad52SDarrick J. Wong+--------------------------------------------------------------------------+
994e5edad52SDarrick J. Wong| **Sidebar**:                                                             |
995e5edad52SDarrick J. Wong+--------------------------------------------------------------------------+
996e5edad52SDarrick J. Wong| A criticism of adding the secondary index is that it does nothing to     |
997e5edad52SDarrick J. Wong| improve the robustness of user data storage itself.                      |
998e5edad52SDarrick J. Wong| This is a valid point, but adding a new index for file data block        |
999e5edad52SDarrick J. Wong| checksums increases write amplification by turning data overwrites into  |
1000e5edad52SDarrick J. Wong| copy-writes, which age the filesystem prematurely.                       |
1001e5edad52SDarrick J. Wong| In keeping with thirty years of precedent, users who want file data      |
1002e5edad52SDarrick J. Wong| integrity can supply as powerful a solution as they require.             |
1003e5edad52SDarrick J. Wong| As for metadata, the complexity of adding a new secondary index of space |
1004e5edad52SDarrick J. Wong| usage is much less than adding volume management and storage device      |
1005e5edad52SDarrick J. Wong| mirroring to XFS itself.                                                 |
1006e5edad52SDarrick J. Wong| Perfection of RAID and volume management are best left to existing       |
1007e5edad52SDarrick J. Wong| layers in the kernel.                                                    |
1008e5edad52SDarrick J. Wong+--------------------------------------------------------------------------+
1009e5edad52SDarrick J. Wong
1010e5edad52SDarrick J. WongThe information captured in a reverse space mapping record is as follows:
1011e5edad52SDarrick J. Wong
1012e5edad52SDarrick J. Wong.. code-block:: c
1013e5edad52SDarrick J. Wong
1014e5edad52SDarrick J. Wong	struct xfs_rmap_irec {
1015e5edad52SDarrick J. Wong	    xfs_agblock_t    rm_startblock;   /* extent start block */
1016e5edad52SDarrick J. Wong	    xfs_extlen_t     rm_blockcount;   /* extent length */
1017e5edad52SDarrick J. Wong	    uint64_t         rm_owner;        /* extent owner */
1018e5edad52SDarrick J. Wong	    uint64_t         rm_offset;       /* offset within the owner */
1019e5edad52SDarrick J. Wong	    unsigned int     rm_flags;        /* state flags */
1020e5edad52SDarrick J. Wong	};
1021e5edad52SDarrick J. Wong
1022e5edad52SDarrick J. WongThe first two fields capture the location and size of the physical space,
1023e5edad52SDarrick J. Wongin units of filesystem blocks.
1024e5edad52SDarrick J. WongThe owner field tells scrub which metadata structure or file inode have been
1025e5edad52SDarrick J. Wongassigned this space.
1026e5edad52SDarrick J. WongFor space allocated to files, the offset field tells scrub where the space was
1027e5edad52SDarrick J. Wongmapped within the file fork.
1028e5edad52SDarrick J. WongFinally, the flags field provides extra information about the space usage --
1029e5edad52SDarrick J. Wongis this an attribute fork extent?  A file mapping btree extent?  Or an
1030e5edad52SDarrick J. Wongunwritten data extent?
1031e5edad52SDarrick J. Wong
1032e5edad52SDarrick J. WongOnline filesystem checking judges the consistency of each primary metadata
1033e5edad52SDarrick J. Wongrecord by comparing its information against all other space indices.
1034e5edad52SDarrick J. WongThe reverse mapping index plays a key role in the consistency checking process
1035e5edad52SDarrick J. Wongbecause it contains a centralized alternate copy of all space allocation
1036e5edad52SDarrick J. Wonginformation.
1037e5edad52SDarrick J. WongProgram runtime and ease of resource acquisition are the only real limits to
1038e5edad52SDarrick J. Wongwhat online checking can consult.
1039e5edad52SDarrick J. WongFor example, a file data extent mapping can be checked against:
1040e5edad52SDarrick J. Wong
1041e5edad52SDarrick J. Wong* The absence of an entry in the free space information.
1042e5edad52SDarrick J. Wong* The absence of an entry in the inode index.
1043e5edad52SDarrick J. Wong* The absence of an entry in the reference count data if the file is not
1044e5edad52SDarrick J. Wong  marked as having shared extents.
1045e5edad52SDarrick J. Wong* The correspondence of an entry in the reverse mapping information.
1046e5edad52SDarrick J. Wong
1047e5edad52SDarrick J. WongThere are several observations to make about reverse mapping indices:
1048e5edad52SDarrick J. Wong
1049e5edad52SDarrick J. Wong1. Reverse mappings can provide a positive affirmation of correctness if any of
1050e5edad52SDarrick J. Wong   the above primary metadata are in doubt.
1051e5edad52SDarrick J. Wong   The checking code for most primary metadata follows a path similar to the
1052e5edad52SDarrick J. Wong   one outlined above.
1053e5edad52SDarrick J. Wong
1054e5edad52SDarrick J. Wong2. Proving the consistency of secondary metadata with the primary metadata is
1055e5edad52SDarrick J. Wong   difficult because that requires a full scan of all primary space metadata,
1056e5edad52SDarrick J. Wong   which is very time intensive.
1057e5edad52SDarrick J. Wong   For example, checking a reverse mapping record for a file extent mapping
1058e5edad52SDarrick J. Wong   btree block requires locking the file and searching the entire btree to
1059e5edad52SDarrick J. Wong   confirm the block.
1060e5edad52SDarrick J. Wong   Instead, scrub relies on rigorous cross-referencing during the primary space
1061e5edad52SDarrick J. Wong   mapping structure checks.
1062e5edad52SDarrick J. Wong
1063e5edad52SDarrick J. Wong3. Consistency scans must use non-blocking lock acquisition primitives if the
1064e5edad52SDarrick J. Wong   required locking order is not the same order used by regular filesystem
1065e5edad52SDarrick J. Wong   operations.
1066e5edad52SDarrick J. Wong   For example, if the filesystem normally takes a file ILOCK before taking
1067e5edad52SDarrick J. Wong   the AGF buffer lock but scrub wants to take a file ILOCK while holding
1068e5edad52SDarrick J. Wong   an AGF buffer lock, scrub cannot block on that second acquisition.
1069e5edad52SDarrick J. Wong   This means that forward progress during this part of a scan of the reverse
1070e5edad52SDarrick J. Wong   mapping data cannot be guaranteed if system load is heavy.
1071e5edad52SDarrick J. Wong
1072e5edad52SDarrick J. WongIn summary, reverse mappings play a key role in reconstruction of primary
1073e5edad52SDarrick J. Wongmetadata.
1074e5edad52SDarrick J. WongThe details of how these records are staged, written to disk, and committed
1075e5edad52SDarrick J. Wonginto the filesystem are covered in subsequent sections.
1076e5edad52SDarrick J. Wong
1077e5edad52SDarrick J. WongChecking and Cross-Referencing
1078e5edad52SDarrick J. Wong------------------------------
1079e5edad52SDarrick J. Wong
1080e5edad52SDarrick J. WongThe first step of checking a metadata structure is to examine every record
1081e5edad52SDarrick J. Wongcontained within the structure and its relationship with the rest of the
1082e5edad52SDarrick J. Wongsystem.
1083e5edad52SDarrick J. WongXFS contains multiple layers of checking to try to prevent inconsistent
1084e5edad52SDarrick J. Wongmetadata from wreaking havoc on the system.
1085e5edad52SDarrick J. WongEach of these layers contributes information that helps the kernel to make
1086e5edad52SDarrick J. Wongthree decisions about the health of a metadata structure:
1087e5edad52SDarrick J. Wong
1088e5edad52SDarrick J. Wong- Is a part of this structure obviously corrupt (``XFS_SCRUB_OFLAG_CORRUPT``) ?
1089e5edad52SDarrick J. Wong- Is this structure inconsistent with the rest of the system
1090e5edad52SDarrick J. Wong  (``XFS_SCRUB_OFLAG_XCORRUPT``) ?
1091e5edad52SDarrick J. Wong- Is there so much damage around the filesystem that cross-referencing is not
1092e5edad52SDarrick J. Wong  possible (``XFS_SCRUB_OFLAG_XFAIL``) ?
1093e5edad52SDarrick J. Wong- Can the structure be optimized to improve performance or reduce the size of
1094e5edad52SDarrick J. Wong  metadata (``XFS_SCRUB_OFLAG_PREEN``) ?
1095e5edad52SDarrick J. Wong- Does the structure contain data that is not inconsistent but deserves review
1096e5edad52SDarrick J. Wong  by the system administrator (``XFS_SCRUB_OFLAG_WARNING``) ?
1097e5edad52SDarrick J. Wong
1098e5edad52SDarrick J. WongThe following sections describe how the metadata scrubbing process works.
1099e5edad52SDarrick J. Wong
1100e5edad52SDarrick J. WongMetadata Buffer Verification
1101e5edad52SDarrick J. Wong````````````````````````````
1102e5edad52SDarrick J. Wong
1103e5edad52SDarrick J. WongThe lowest layer of metadata protection in XFS are the metadata verifiers built
1104e5edad52SDarrick J. Wonginto the buffer cache.
1105e5edad52SDarrick J. WongThese functions perform inexpensive internal consistency checking of the block
1106e5edad52SDarrick J. Wongitself, and answer these questions:
1107e5edad52SDarrick J. Wong
1108e5edad52SDarrick J. Wong- Does the block belong to this filesystem?
1109e5edad52SDarrick J. Wong
1110e5edad52SDarrick J. Wong- Does the block belong to the structure that asked for the read?
1111e5edad52SDarrick J. Wong  This assumes that metadata blocks only have one owner, which is always true
1112e5edad52SDarrick J. Wong  in XFS.
1113e5edad52SDarrick J. Wong
1114e5edad52SDarrick J. Wong- Is the type of data stored in the block within a reasonable range of what
1115e5edad52SDarrick J. Wong  scrub is expecting?
1116e5edad52SDarrick J. Wong
1117e5edad52SDarrick J. Wong- Does the physical location of the block match the location it was read from?
1118e5edad52SDarrick J. Wong
1119e5edad52SDarrick J. Wong- Does the block checksum match the data?
1120e5edad52SDarrick J. Wong
1121e5edad52SDarrick J. WongThe scope of the protections here are very limited -- verifiers can only
1122e5edad52SDarrick J. Wongestablish that the filesystem code is reasonably free of gross corruption bugs
1123e5edad52SDarrick J. Wongand that the storage system is reasonably competent at retrieval.
1124e5edad52SDarrick J. WongCorruption problems observed at runtime cause the generation of health reports,
1125e5edad52SDarrick J. Wongfailed system calls, and in the extreme case, filesystem shutdowns if the
1126e5edad52SDarrick J. Wongcorrupt metadata force the cancellation of a dirty transaction.
1127e5edad52SDarrick J. Wong
1128e5edad52SDarrick J. WongEvery online fsck scrubbing function is expected to read every ondisk metadata
1129e5edad52SDarrick J. Wongblock of a structure in the course of checking the structure.
1130e5edad52SDarrick J. WongCorruption problems observed during a check are immediately reported to
1131e5edad52SDarrick J. Wonguserspace as corruption; during a cross-reference, they are reported as a
1132e5edad52SDarrick J. Wongfailure to cross-reference once the full examination is complete.
1133e5edad52SDarrick J. WongReads satisfied by a buffer already in cache (and hence already verified)
1134e5edad52SDarrick J. Wongbypass these checks.
1135e5edad52SDarrick J. Wong
1136e5edad52SDarrick J. WongInternal Consistency Checks
1137e5edad52SDarrick J. Wong```````````````````````````
1138e5edad52SDarrick J. Wong
1139e5edad52SDarrick J. WongAfter the buffer cache, the next level of metadata protection is the internal
1140e5edad52SDarrick J. Wongrecord verification code built into the filesystem.
1141e5edad52SDarrick J. WongThese checks are split between the buffer verifiers, the in-filesystem users of
1142e5edad52SDarrick J. Wongthe buffer cache, and the scrub code itself, depending on the amount of higher
1143e5edad52SDarrick J. Wonglevel context required.
1144e5edad52SDarrick J. WongThe scope of checking is still internal to the block.
1145e5edad52SDarrick J. WongThese higher level checking functions answer these questions:
1146e5edad52SDarrick J. Wong
1147e5edad52SDarrick J. Wong- Does the type of data stored in the block match what scrub is expecting?
1148e5edad52SDarrick J. Wong
1149e5edad52SDarrick J. Wong- Does the block belong to the owning structure that asked for the read?
1150e5edad52SDarrick J. Wong
1151e5edad52SDarrick J. Wong- If the block contains records, do the records fit within the block?
1152e5edad52SDarrick J. Wong
1153e5edad52SDarrick J. Wong- If the block tracks internal free space information, is it consistent with
1154e5edad52SDarrick J. Wong  the record areas?
1155e5edad52SDarrick J. Wong
1156e5edad52SDarrick J. Wong- Are the records contained inside the block free of obvious corruptions?
1157e5edad52SDarrick J. Wong
1158e5edad52SDarrick J. WongRecord checks in this category are more rigorous and more time-intensive.
1159e5edad52SDarrick J. WongFor example, block pointers and inumbers are checked to ensure that they point
1160e5edad52SDarrick J. Wongwithin the dynamically allocated parts of an allocation group and within
1161e5edad52SDarrick J. Wongthe filesystem.
1162e5edad52SDarrick J. WongNames are checked for invalid characters, and flags are checked for invalid
1163e5edad52SDarrick J. Wongcombinations.
1164e5edad52SDarrick J. WongOther record attributes are checked for sensible values.
1165e5edad52SDarrick J. WongBtree records spanning an interval of the btree keyspace are checked for
1166e5edad52SDarrick J. Wongcorrect order and lack of mergeability (except for file fork mappings).
1167e5edad52SDarrick J. WongFor performance reasons, regular code may skip some of these checks unless
1168e5edad52SDarrick J. Wongdebugging is enabled or a write is about to occur.
1169e5edad52SDarrick J. WongScrub functions, of course, must check all possible problems.
1170e5edad52SDarrick J. Wong
1171e5edad52SDarrick J. WongValidation of Userspace-Controlled Record Attributes
1172e5edad52SDarrick J. Wong````````````````````````````````````````````````````
1173e5edad52SDarrick J. Wong
1174e5edad52SDarrick J. WongVarious pieces of filesystem metadata are directly controlled by userspace.
1175e5edad52SDarrick J. WongBecause of this nature, validation work cannot be more precise than checking
1176e5edad52SDarrick J. Wongthat a value is within the possible range.
1177e5edad52SDarrick J. WongThese fields include:
1178e5edad52SDarrick J. Wong
1179e5edad52SDarrick J. Wong- Superblock fields controlled by mount options
1180e5edad52SDarrick J. Wong- Filesystem labels
1181e5edad52SDarrick J. Wong- File timestamps
1182e5edad52SDarrick J. Wong- File permissions
1183e5edad52SDarrick J. Wong- File size
1184e5edad52SDarrick J. Wong- File flags
1185e5edad52SDarrick J. Wong- Names present in directory entries, extended attribute keys, and filesystem
1186e5edad52SDarrick J. Wong  labels
1187e5edad52SDarrick J. Wong- Extended attribute key namespaces
1188e5edad52SDarrick J. Wong- Extended attribute values
1189e5edad52SDarrick J. Wong- File data block contents
1190e5edad52SDarrick J. Wong- Quota limits
1191e5edad52SDarrick J. Wong- Quota timer expiration (if resource usage exceeds the soft limit)
1192e5edad52SDarrick J. Wong
1193e5edad52SDarrick J. WongCross-Referencing Space Metadata
1194e5edad52SDarrick J. Wong````````````````````````````````
1195e5edad52SDarrick J. Wong
1196e5edad52SDarrick J. WongAfter internal block checks, the next higher level of checking is
1197e5edad52SDarrick J. Wongcross-referencing records between metadata structures.
1198e5edad52SDarrick J. WongFor regular runtime code, the cost of these checks is considered to be
1199e5edad52SDarrick J. Wongprohibitively expensive, but as scrub is dedicated to rooting out
1200e5edad52SDarrick J. Wonginconsistencies, it must pursue all avenues of inquiry.
1201e5edad52SDarrick J. WongThe exact set of cross-referencing is highly dependent on the context of the
1202e5edad52SDarrick J. Wongdata structure being checked.
1203e5edad52SDarrick J. Wong
1204e5edad52SDarrick J. WongThe XFS btree code has keyspace scanning functions that online fsck uses to
1205e5edad52SDarrick J. Wongcross reference one structure with another.
1206e5edad52SDarrick J. WongSpecifically, scrub can scan the key space of an index to determine if that
1207e5edad52SDarrick J. Wongkeyspace is fully, sparsely, or not at all mapped to records.
1208e5edad52SDarrick J. WongFor the reverse mapping btree, it is possible to mask parts of the key for the
1209e5edad52SDarrick J. Wongpurposes of performing a keyspace scan so that scrub can decide if the rmap
1210e5edad52SDarrick J. Wongbtree contains records mapping a certain extent of physical space without the
1211e5edad52SDarrick J. Wongsparsenses of the rest of the rmap keyspace getting in the way.
1212e5edad52SDarrick J. Wong
1213e5edad52SDarrick J. WongBtree blocks undergo the following checks before cross-referencing:
1214e5edad52SDarrick J. Wong
1215e5edad52SDarrick J. Wong- Does the type of data stored in the block match what scrub is expecting?
1216e5edad52SDarrick J. Wong
1217e5edad52SDarrick J. Wong- Does the block belong to the owning structure that asked for the read?
1218e5edad52SDarrick J. Wong
1219e5edad52SDarrick J. Wong- Do the records fit within the block?
1220e5edad52SDarrick J. Wong
1221e5edad52SDarrick J. Wong- Are the records contained inside the block free of obvious corruptions?
1222e5edad52SDarrick J. Wong
1223e5edad52SDarrick J. Wong- Are the name hashes in the correct order?
1224e5edad52SDarrick J. Wong
1225e5edad52SDarrick J. Wong- Do node pointers within the btree point to valid block addresses for the type
1226e5edad52SDarrick J. Wong  of btree?
1227e5edad52SDarrick J. Wong
1228e5edad52SDarrick J. Wong- Do child pointers point towards the leaves?
1229e5edad52SDarrick J. Wong
1230e5edad52SDarrick J. Wong- Do sibling pointers point across the same level?
1231e5edad52SDarrick J. Wong
1232e5edad52SDarrick J. Wong- For each node block record, does the record key accurate reflect the contents
1233e5edad52SDarrick J. Wong  of the child block?
1234e5edad52SDarrick J. Wong
1235e5edad52SDarrick J. WongSpace allocation records are cross-referenced as follows:
1236e5edad52SDarrick J. Wong
1237e5edad52SDarrick J. Wong1. Any space mentioned by any metadata structure are cross-referenced as
1238e5edad52SDarrick J. Wong   follows:
1239e5edad52SDarrick J. Wong
1240e5edad52SDarrick J. Wong   - Does the reverse mapping index list only the appropriate owner as the
1241e5edad52SDarrick J. Wong     owner of each block?
1242e5edad52SDarrick J. Wong
1243e5edad52SDarrick J. Wong   - Are none of the blocks claimed as free space?
1244e5edad52SDarrick J. Wong
1245e5edad52SDarrick J. Wong   - If these aren't file data blocks, are none of the blocks claimed as space
1246e5edad52SDarrick J. Wong     shared by different owners?
1247e5edad52SDarrick J. Wong
1248e5edad52SDarrick J. Wong2. Btree blocks are cross-referenced as follows:
1249e5edad52SDarrick J. Wong
1250e5edad52SDarrick J. Wong   - Everything in class 1 above.
1251e5edad52SDarrick J. Wong
1252e5edad52SDarrick J. Wong   - If there's a parent node block, do the keys listed for this block match the
1253e5edad52SDarrick J. Wong     keyspace of this block?
1254e5edad52SDarrick J. Wong
1255e5edad52SDarrick J. Wong   - Do the sibling pointers point to valid blocks?  Of the same level?
1256e5edad52SDarrick J. Wong
1257e5edad52SDarrick J. Wong   - Do the child pointers point to valid blocks?  Of the next level down?
1258e5edad52SDarrick J. Wong
1259e5edad52SDarrick J. Wong3. Free space btree records are cross-referenced as follows:
1260e5edad52SDarrick J. Wong
1261e5edad52SDarrick J. Wong   - Everything in class 1 and 2 above.
1262e5edad52SDarrick J. Wong
1263e5edad52SDarrick J. Wong   - Does the reverse mapping index list no owners of this space?
1264e5edad52SDarrick J. Wong
1265e5edad52SDarrick J. Wong   - Is this space not claimed by the inode index for inodes?
1266e5edad52SDarrick J. Wong
1267e5edad52SDarrick J. Wong   - Is it not mentioned by the reference count index?
1268e5edad52SDarrick J. Wong
1269e5edad52SDarrick J. Wong   - Is there a matching record in the other free space btree?
1270e5edad52SDarrick J. Wong
1271e5edad52SDarrick J. Wong4. Inode btree records are cross-referenced as follows:
1272e5edad52SDarrick J. Wong
1273e5edad52SDarrick J. Wong   - Everything in class 1 and 2 above.
1274e5edad52SDarrick J. Wong
1275e5edad52SDarrick J. Wong   - Is there a matching record in free inode btree?
1276e5edad52SDarrick J. Wong
1277e5edad52SDarrick J. Wong   - Do cleared bits in the holemask correspond with inode clusters?
1278e5edad52SDarrick J. Wong
1279e5edad52SDarrick J. Wong   - Do set bits in the freemask correspond with inode records with zero link
1280e5edad52SDarrick J. Wong     count?
1281e5edad52SDarrick J. Wong
1282e5edad52SDarrick J. Wong5. Inode records are cross-referenced as follows:
1283e5edad52SDarrick J. Wong
1284e5edad52SDarrick J. Wong   - Everything in class 1.
1285e5edad52SDarrick J. Wong
1286e5edad52SDarrick J. Wong   - Do all the fields that summarize information about the file forks actually
1287e5edad52SDarrick J. Wong     match those forks?
1288e5edad52SDarrick J. Wong
1289e5edad52SDarrick J. Wong   - Does each inode with zero link count correspond to a record in the free
1290e5edad52SDarrick J. Wong     inode btree?
1291e5edad52SDarrick J. Wong
1292e5edad52SDarrick J. Wong6. File fork space mapping records are cross-referenced as follows:
1293e5edad52SDarrick J. Wong
1294e5edad52SDarrick J. Wong   - Everything in class 1 and 2 above.
1295e5edad52SDarrick J. Wong
1296e5edad52SDarrick J. Wong   - Is this space not mentioned by the inode btrees?
1297e5edad52SDarrick J. Wong
1298e5edad52SDarrick J. Wong   - If this is a CoW fork mapping, does it correspond to a CoW entry in the
1299e5edad52SDarrick J. Wong     reference count btree?
1300e5edad52SDarrick J. Wong
1301e5edad52SDarrick J. Wong7. Reference count records are cross-referenced as follows:
1302e5edad52SDarrick J. Wong
1303e5edad52SDarrick J. Wong   - Everything in class 1 and 2 above.
1304e5edad52SDarrick J. Wong
1305e5edad52SDarrick J. Wong   - Within the space subkeyspace of the rmap btree (that is to say, all
1306e5edad52SDarrick J. Wong     records mapped to a particular space extent and ignoring the owner info),
1307e5edad52SDarrick J. Wong     are there the same number of reverse mapping records for each block as the
1308e5edad52SDarrick J. Wong     reference count record claims?
1309e5edad52SDarrick J. Wong
1310e5edad52SDarrick J. WongProposed patchsets are the series to find gaps in
1311e5edad52SDarrick J. Wong`refcount btree
1312e5edad52SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-refcount-gaps>`_,
1313e5edad52SDarrick J. Wong`inode btree
1314e5edad52SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-inobt-gaps>`_, and
1315e5edad52SDarrick J. Wong`rmap btree
1316e5edad52SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-rmapbt-gaps>`_ records;
1317e5edad52SDarrick J. Wongto find
1318e5edad52SDarrick J. Wong`mergeable records
1319e5edad52SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-mergeable-records>`_;
1320e5edad52SDarrick J. Wongand to
1321e5edad52SDarrick J. Wong`improve cross referencing with rmap
1322e5edad52SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-strengthen-rmap-checking>`_
1323e5edad52SDarrick J. Wongbefore starting a repair.
1324e5edad52SDarrick J. Wong
1325e5edad52SDarrick J. WongChecking Extended Attributes
1326e5edad52SDarrick J. Wong````````````````````````````
1327e5edad52SDarrick J. Wong
1328e5edad52SDarrick J. WongExtended attributes implement a key-value store that enable fragments of data
1329e5edad52SDarrick J. Wongto be attached to any file.
1330e5edad52SDarrick J. WongBoth the kernel and userspace can access the keys and values, subject to
1331e5edad52SDarrick J. Wongnamespace and privilege restrictions.
1332e5edad52SDarrick J. WongMost typically these fragments are metadata about the file -- origins, security
1333e5edad52SDarrick J. Wongcontexts, user-supplied labels, indexing information, etc.
1334e5edad52SDarrick J. Wong
1335e5edad52SDarrick J. WongNames can be as long as 255 bytes and can exist in several different
1336e5edad52SDarrick J. Wongnamespaces.
1337e5edad52SDarrick J. WongValues can be as large as 64KB.
1338e5edad52SDarrick J. WongA file's extended attributes are stored in blocks mapped by the attr fork.
1339e5edad52SDarrick J. WongThe mappings point to leaf blocks, remote value blocks, or dabtree blocks.
1340e5edad52SDarrick J. WongBlock 0 in the attribute fork is always the top of the structure, but otherwise
1341e5edad52SDarrick J. Wongeach of the three types of blocks can be found at any offset in the attr fork.
1342e5edad52SDarrick J. WongLeaf blocks contain attribute key records that point to the name and the value.
1343e5edad52SDarrick J. WongNames are always stored elsewhere in the same leaf block.
1344e5edad52SDarrick J. WongValues that are less than 3/4 the size of a filesystem block are also stored
1345e5edad52SDarrick J. Wongelsewhere in the same leaf block.
1346e5edad52SDarrick J. WongRemote value blocks contain values that are too large to fit inside a leaf.
1347e5edad52SDarrick J. WongIf the leaf information exceeds a single filesystem block, a dabtree (also
1348e5edad52SDarrick J. Wongrooted at block 0) is created to map hashes of the attribute names to leaf
1349e5edad52SDarrick J. Wongblocks in the attr fork.
1350e5edad52SDarrick J. Wong
1351e5edad52SDarrick J. WongChecking an extended attribute structure is not so straightfoward due to the
1352e5edad52SDarrick J. Wonglack of separation between attr blocks and index blocks.
1353e5edad52SDarrick J. WongScrub must read each block mapped by the attr fork and ignore the non-leaf
1354e5edad52SDarrick J. Wongblocks:
1355e5edad52SDarrick J. Wong
1356e5edad52SDarrick J. Wong1. Walk the dabtree in the attr fork (if present) to ensure that there are no
1357e5edad52SDarrick J. Wong   irregularities in the blocks or dabtree mappings that do not point to
1358e5edad52SDarrick J. Wong   attr leaf blocks.
1359e5edad52SDarrick J. Wong
1360e5edad52SDarrick J. Wong2. Walk the blocks of the attr fork looking for leaf blocks.
1361e5edad52SDarrick J. Wong   For each entry inside a leaf:
1362e5edad52SDarrick J. Wong
1363e5edad52SDarrick J. Wong   a. Validate that the name does not contain invalid characters.
1364e5edad52SDarrick J. Wong
1365e5edad52SDarrick J. Wong   b. Read the attr value.
1366e5edad52SDarrick J. Wong      This performs a named lookup of the attr name to ensure the correctness
1367e5edad52SDarrick J. Wong      of the dabtree.
1368e5edad52SDarrick J. Wong      If the value is stored in a remote block, this also validates the
1369e5edad52SDarrick J. Wong      integrity of the remote value block.
1370e5edad52SDarrick J. Wong
1371e5edad52SDarrick J. WongChecking and Cross-Referencing Directories
1372e5edad52SDarrick J. Wong``````````````````````````````````````````
1373e5edad52SDarrick J. Wong
1374e5edad52SDarrick J. WongThe filesystem directory tree is a directed acylic graph structure, with files
1375e5edad52SDarrick J. Wongconstituting the nodes, and directory entries (dirents) constituting the edges.
1376e5edad52SDarrick J. WongDirectories are a special type of file containing a set of mappings from a
1377e5edad52SDarrick J. Wong255-byte sequence (name) to an inumber.
1378e5edad52SDarrick J. WongThese are called directory entries, or dirents for short.
1379e5edad52SDarrick J. WongEach directory file must have exactly one directory pointing to the file.
1380e5edad52SDarrick J. WongA root directory points to itself.
1381e5edad52SDarrick J. WongDirectory entries point to files of any type.
1382e5edad52SDarrick J. WongEach non-directory file may have multiple directories point to it.
1383e5edad52SDarrick J. Wong
1384e5edad52SDarrick J. WongIn XFS, directories are implemented as a file containing up to three 32GB
1385e5edad52SDarrick J. Wongpartitions.
1386e5edad52SDarrick J. WongThe first partition contains directory entry data blocks.
1387e5edad52SDarrick J. WongEach data block contains variable-sized records associating a user-provided
1388e5edad52SDarrick J. Wongname with an inumber and, optionally, a file type.
1389e5edad52SDarrick J. WongIf the directory entry data grows beyond one block, the second partition (which
1390e5edad52SDarrick J. Wongexists as post-EOF extents) is populated with a block containing free space
1391e5edad52SDarrick J. Wonginformation and an index that maps hashes of the dirent names to directory data
1392e5edad52SDarrick J. Wongblocks in the first partition.
1393e5edad52SDarrick J. WongThis makes directory name lookups very fast.
1394e5edad52SDarrick J. WongIf this second partition grows beyond one block, the third partition is
1395e5edad52SDarrick J. Wongpopulated with a linear array of free space information for faster
1396e5edad52SDarrick J. Wongexpansions.
1397e5edad52SDarrick J. WongIf the free space has been separated and the second partition grows again
1398e5edad52SDarrick J. Wongbeyond one block, then a dabtree is used to map hashes of dirent names to
1399e5edad52SDarrick J. Wongdirectory data blocks.
1400e5edad52SDarrick J. Wong
1401e5edad52SDarrick J. WongChecking a directory is pretty straightfoward:
1402e5edad52SDarrick J. Wong
1403e5edad52SDarrick J. Wong1. Walk the dabtree in the second partition (if present) to ensure that there
1404e5edad52SDarrick J. Wong   are no irregularities in the blocks or dabtree mappings that do not point to
1405e5edad52SDarrick J. Wong   dirent blocks.
1406e5edad52SDarrick J. Wong
1407e5edad52SDarrick J. Wong2. Walk the blocks of the first partition looking for directory entries.
1408e5edad52SDarrick J. Wong   Each dirent is checked as follows:
1409e5edad52SDarrick J. Wong
1410e5edad52SDarrick J. Wong   a. Does the name contain no invalid characters?
1411e5edad52SDarrick J. Wong
1412e5edad52SDarrick J. Wong   b. Does the inumber correspond to an actual, allocated inode?
1413e5edad52SDarrick J. Wong
1414e5edad52SDarrick J. Wong   c. Does the child inode have a nonzero link count?
1415e5edad52SDarrick J. Wong
1416e5edad52SDarrick J. Wong   d. If a file type is included in the dirent, does it match the type of the
1417e5edad52SDarrick J. Wong      inode?
1418e5edad52SDarrick J. Wong
1419e5edad52SDarrick J. Wong   e. If the child is a subdirectory, does the child's dotdot pointer point
1420e5edad52SDarrick J. Wong      back to the parent?
1421e5edad52SDarrick J. Wong
1422e5edad52SDarrick J. Wong   f. If the directory has a second partition, perform a named lookup of the
1423e5edad52SDarrick J. Wong      dirent name to ensure the correctness of the dabtree.
1424e5edad52SDarrick J. Wong
1425e5edad52SDarrick J. Wong3. Walk the free space list in the third partition (if present) to ensure that
1426e5edad52SDarrick J. Wong   the free spaces it describes are really unused.
1427e5edad52SDarrick J. Wong
1428e5edad52SDarrick J. WongChecking operations involving :ref:`parents <dirparent>` and
1429e5edad52SDarrick J. Wong:ref:`file link counts <nlinks>` are discussed in more detail in later
1430e5edad52SDarrick J. Wongsections.
1431e5edad52SDarrick J. Wong
1432e5edad52SDarrick J. WongChecking Directory/Attribute Btrees
1433e5edad52SDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1434e5edad52SDarrick J. Wong
1435e5edad52SDarrick J. WongAs stated in previous sections, the directory/attribute btree (dabtree) index
1436e5edad52SDarrick J. Wongmaps user-provided names to improve lookup times by avoiding linear scans.
1437e5edad52SDarrick J. WongInternally, it maps a 32-bit hash of the name to a block offset within the
1438e5edad52SDarrick J. Wongappropriate file fork.
1439e5edad52SDarrick J. Wong
1440e5edad52SDarrick J. WongThe internal structure of a dabtree closely resembles the btrees that record
1441e5edad52SDarrick J. Wongfixed-size metadata records -- each dabtree block contains a magic number, a
1442e5edad52SDarrick J. Wongchecksum, sibling pointers, a UUID, a tree level, and a log sequence number.
1443e5edad52SDarrick J. WongThe format of leaf and node records are the same -- each entry points to the
1444e5edad52SDarrick J. Wongnext level down in the hierarchy, with dabtree node records pointing to dabtree
1445e5edad52SDarrick J. Wongleaf blocks, and dabtree leaf records pointing to non-dabtree blocks elsewhere
1446e5edad52SDarrick J. Wongin the fork.
1447e5edad52SDarrick J. Wong
1448e5edad52SDarrick J. WongChecking and cross-referencing the dabtree is very similar to what is done for
1449e5edad52SDarrick J. Wongspace btrees:
1450e5edad52SDarrick J. Wong
1451e5edad52SDarrick J. Wong- Does the type of data stored in the block match what scrub is expecting?
1452e5edad52SDarrick J. Wong
1453e5edad52SDarrick J. Wong- Does the block belong to the owning structure that asked for the read?
1454e5edad52SDarrick J. Wong
1455e5edad52SDarrick J. Wong- Do the records fit within the block?
1456e5edad52SDarrick J. Wong
1457e5edad52SDarrick J. Wong- Are the records contained inside the block free of obvious corruptions?
1458e5edad52SDarrick J. Wong
1459e5edad52SDarrick J. Wong- Are the name hashes in the correct order?
1460e5edad52SDarrick J. Wong
1461e5edad52SDarrick J. Wong- Do node pointers within the dabtree point to valid fork offsets for dabtree
1462e5edad52SDarrick J. Wong  blocks?
1463e5edad52SDarrick J. Wong
1464e5edad52SDarrick J. Wong- Do leaf pointers within the dabtree point to valid fork offsets for directory
1465e5edad52SDarrick J. Wong  or attr leaf blocks?
1466e5edad52SDarrick J. Wong
1467e5edad52SDarrick J. Wong- Do child pointers point towards the leaves?
1468e5edad52SDarrick J. Wong
1469e5edad52SDarrick J. Wong- Do sibling pointers point across the same level?
1470e5edad52SDarrick J. Wong
1471e5edad52SDarrick J. Wong- For each dabtree node record, does the record key accurate reflect the
1472e5edad52SDarrick J. Wong  contents of the child dabtree block?
1473e5edad52SDarrick J. Wong
1474e5edad52SDarrick J. Wong- For each dabtree leaf record, does the record key accurate reflect the
1475e5edad52SDarrick J. Wong  contents of the directory or attr block?
1476e5edad52SDarrick J. Wong
1477e5edad52SDarrick J. WongCross-Referencing Summary Counters
1478e5edad52SDarrick J. Wong``````````````````````````````````
1479e5edad52SDarrick J. Wong
1480e5edad52SDarrick J. WongXFS maintains three classes of summary counters: available resources, quota
1481e5edad52SDarrick J. Wongresource usage, and file link counts.
1482e5edad52SDarrick J. Wong
1483e5edad52SDarrick J. WongIn theory, the amount of available resources (data blocks, inodes, realtime
1484e5edad52SDarrick J. Wongextents) can be found by walking the entire filesystem.
1485e5edad52SDarrick J. WongThis would make for very slow reporting, so a transactional filesystem can
1486e5edad52SDarrick J. Wongmaintain summaries of this information in the superblock.
1487e5edad52SDarrick J. WongCross-referencing these values against the filesystem metadata should be a
1488e5edad52SDarrick J. Wongsimple matter of walking the free space and inode metadata in each AG and the
1489e5edad52SDarrick J. Wongrealtime bitmap, but there are complications that will be discussed in
1490e5edad52SDarrick J. Wong:ref:`more detail <fscounters>` later.
1491e5edad52SDarrick J. Wong
1492e5edad52SDarrick J. Wong:ref:`Quota usage <quotacheck>` and :ref:`file link count <nlinks>`
1493e5edad52SDarrick J. Wongchecking are sufficiently complicated to warrant separate sections.
1494e5edad52SDarrick J. Wong
1495e5edad52SDarrick J. WongPost-Repair Reverification
1496e5edad52SDarrick J. Wong``````````````````````````
1497e5edad52SDarrick J. Wong
1498e5edad52SDarrick J. WongAfter performing a repair, the checking code is run a second time to validate
1499e5edad52SDarrick J. Wongthe new structure, and the results of the health assessment are recorded
1500e5edad52SDarrick J. Wonginternally and returned to the calling process.
1501e5edad52SDarrick J. WongThis step is critical for enabling system administrator to monitor the status
1502e5edad52SDarrick J. Wongof the filesystem and the progress of any repairs.
1503e5edad52SDarrick J. WongFor developers, it is a useful means to judge the efficacy of error detection
1504e5edad52SDarrick J. Wongand correction in the online and offline checking tools.
1505bae43864SDarrick J. Wong
1506bae43864SDarrick J. WongEventual Consistency vs. Online Fsck
1507bae43864SDarrick J. Wong------------------------------------
1508bae43864SDarrick J. Wong
1509bae43864SDarrick J. WongComplex operations can make modifications to multiple per-AG data structures
1510bae43864SDarrick J. Wongwith a chain of transactions.
1511bae43864SDarrick J. WongThese chains, once committed to the log, are restarted during log recovery if
1512bae43864SDarrick J. Wongthe system crashes while processing the chain.
1513bae43864SDarrick J. WongBecause the AG header buffers are unlocked between transactions within a chain,
1514bae43864SDarrick J. Wongonline checking must coordinate with chained operations that are in progress to
1515bae43864SDarrick J. Wongavoid incorrectly detecting inconsistencies due to pending chains.
1516bae43864SDarrick J. WongFurthermore, online repair must not run when operations are pending because
1517bae43864SDarrick J. Wongthe metadata are temporarily inconsistent with each other, and rebuilding is
1518bae43864SDarrick J. Wongnot possible.
1519bae43864SDarrick J. Wong
1520bae43864SDarrick J. WongOnly online fsck has this requirement of total consistency of AG metadata, and
1521bae43864SDarrick J. Wongshould be relatively rare as compared to filesystem change operations.
1522bae43864SDarrick J. WongOnline fsck coordinates with transaction chains as follows:
1523bae43864SDarrick J. Wong
1524bae43864SDarrick J. Wong* For each AG, maintain a count of intent items targetting that AG.
1525bae43864SDarrick J. Wong  The count should be bumped whenever a new item is added to the chain.
1526bae43864SDarrick J. Wong  The count should be dropped when the filesystem has locked the AG header
1527bae43864SDarrick J. Wong  buffers and finished the work.
1528bae43864SDarrick J. Wong
1529bae43864SDarrick J. Wong* When online fsck wants to examine an AG, it should lock the AG header
1530bae43864SDarrick J. Wong  buffers to quiesce all transaction chains that want to modify that AG.
1531bae43864SDarrick J. Wong  If the count is zero, proceed with the checking operation.
1532bae43864SDarrick J. Wong  If it is nonzero, cycle the buffer locks to allow the chain to make forward
1533bae43864SDarrick J. Wong  progress.
1534bae43864SDarrick J. Wong
1535bae43864SDarrick J. WongThis may lead to online fsck taking a long time to complete, but regular
1536bae43864SDarrick J. Wongfilesystem updates take precedence over background checking activity.
1537bae43864SDarrick J. WongDetails about the discovery of this situation are presented in the
1538bae43864SDarrick J. Wong:ref:`next section <chain_coordination>`, and details about the solution
1539bae43864SDarrick J. Wongare presented :ref:`after that<intent_drains>`.
1540bae43864SDarrick J. Wong
1541bae43864SDarrick J. Wong.. _chain_coordination:
1542bae43864SDarrick J. Wong
1543bae43864SDarrick J. WongDiscovery of the Problem
1544bae43864SDarrick J. Wong````````````````````````
1545bae43864SDarrick J. Wong
1546bae43864SDarrick J. WongMidway through the development of online scrubbing, the fsstress tests
1547bae43864SDarrick J. Wonguncovered a misinteraction between online fsck and compound transaction chains
1548bae43864SDarrick J. Wongcreated by other writer threads that resulted in false reports of metadata
1549bae43864SDarrick J. Wonginconsistency.
1550bae43864SDarrick J. WongThe root cause of these reports is the eventual consistency model introduced by
1551bae43864SDarrick J. Wongthe expansion of deferred work items and compound transaction chains when
1552bae43864SDarrick J. Wongreverse mapping and reflink were introduced.
1553bae43864SDarrick J. Wong
1554bae43864SDarrick J. WongOriginally, transaction chains were added to XFS to avoid deadlocks when
1555bae43864SDarrick J. Wongunmapping space from files.
1556bae43864SDarrick J. WongDeadlock avoidance rules require that AGs only be locked in increasing order,
1557bae43864SDarrick J. Wongwhich makes it impossible (say) to use a single transaction to free a space
1558bae43864SDarrick J. Wongextent in AG 7 and then try to free a now superfluous block mapping btree block
1559bae43864SDarrick J. Wongin AG 3.
1560bae43864SDarrick J. WongTo avoid these kinds of deadlocks, XFS creates Extent Freeing Intent (EFI) log
1561bae43864SDarrick J. Wongitems to commit to freeing some space in one transaction while deferring the
1562bae43864SDarrick J. Wongactual metadata updates to a fresh transaction.
1563bae43864SDarrick J. WongThe transaction sequence looks like this:
1564bae43864SDarrick J. Wong
1565bae43864SDarrick J. Wong1. The first transaction contains a physical update to the file's block mapping
1566bae43864SDarrick J. Wong   structures to remove the mapping from the btree blocks.
1567bae43864SDarrick J. Wong   It then attaches to the in-memory transaction an action item to schedule
1568bae43864SDarrick J. Wong   deferred freeing of space.
1569bae43864SDarrick J. Wong   Concretely, each transaction maintains a list of ``struct
1570bae43864SDarrick J. Wong   xfs_defer_pending`` objects, each of which maintains a list of ``struct
1571bae43864SDarrick J. Wong   xfs_extent_free_item`` objects.
1572bae43864SDarrick J. Wong   Returning to the example above, the action item tracks the freeing of both
1573bae43864SDarrick J. Wong   the unmapped space from AG 7 and the block mapping btree (BMBT) block from
1574bae43864SDarrick J. Wong   AG 3.
1575bae43864SDarrick J. Wong   Deferred frees recorded in this manner are committed in the log by creating
1576bae43864SDarrick J. Wong   an EFI log item from the ``struct xfs_extent_free_item`` object and
1577bae43864SDarrick J. Wong   attaching the log item to the transaction.
1578bae43864SDarrick J. Wong   When the log is persisted to disk, the EFI item is written into the ondisk
1579bae43864SDarrick J. Wong   transaction record.
1580bae43864SDarrick J. Wong   EFIs can list up to 16 extents to free, all sorted in AG order.
1581bae43864SDarrick J. Wong
1582bae43864SDarrick J. Wong2. The second transaction contains a physical update to the free space btrees
1583bae43864SDarrick J. Wong   of AG 3 to release the former BMBT block and a second physical update to the
1584bae43864SDarrick J. Wong   free space btrees of AG 7 to release the unmapped file space.
1585bae43864SDarrick J. Wong   Observe that the the physical updates are resequenced in the correct order
1586bae43864SDarrick J. Wong   when possible.
1587bae43864SDarrick J. Wong   Attached to the transaction is a an extent free done (EFD) log item.
1588bae43864SDarrick J. Wong   The EFD contains a pointer to the EFI logged in transaction #1 so that log
1589bae43864SDarrick J. Wong   recovery can tell if the EFI needs to be replayed.
1590bae43864SDarrick J. Wong
1591bae43864SDarrick J. WongIf the system goes down after transaction #1 is written back to the filesystem
1592bae43864SDarrick J. Wongbut before #2 is committed, a scan of the filesystem metadata would show
1593bae43864SDarrick J. Wonginconsistent filesystem metadata because there would not appear to be any owner
1594bae43864SDarrick J. Wongof the unmapped space.
1595bae43864SDarrick J. WongHappily, log recovery corrects this inconsistency for us -- when recovery finds
1596bae43864SDarrick J. Wongan intent log item but does not find a corresponding intent done item, it will
1597bae43864SDarrick J. Wongreconstruct the incore state of the intent item and finish it.
1598bae43864SDarrick J. WongIn the example above, the log must replay both frees described in the recovered
1599bae43864SDarrick J. WongEFI to complete the recovery phase.
1600bae43864SDarrick J. Wong
1601bae43864SDarrick J. WongThere are subtleties to XFS' transaction chaining strategy to consider:
1602bae43864SDarrick J. Wong
1603bae43864SDarrick J. Wong* Log items must be added to a transaction in the correct order to prevent
1604bae43864SDarrick J. Wong  conflicts with principal objects that are not held by the transaction.
1605bae43864SDarrick J. Wong  In other words, all per-AG metadata updates for an unmapped block must be
1606bae43864SDarrick J. Wong  completed before the last update to free the extent, and extents should not
1607bae43864SDarrick J. Wong  be reallocated until that last update commits to the log.
1608bae43864SDarrick J. Wong
1609bae43864SDarrick J. Wong* AG header buffers are released between each transaction in a chain.
1610bae43864SDarrick J. Wong  This means that other threads can observe an AG in an intermediate state,
1611bae43864SDarrick J. Wong  but as long as the first subtlety is handled, this should not affect the
1612bae43864SDarrick J. Wong  correctness of filesystem operations.
1613bae43864SDarrick J. Wong
1614bae43864SDarrick J. Wong* Unmounting the filesystem flushes all pending work to disk, which means that
1615bae43864SDarrick J. Wong  offline fsck never sees the temporary inconsistencies caused by deferred
1616bae43864SDarrick J. Wong  work item processing.
1617bae43864SDarrick J. Wong
1618bae43864SDarrick J. WongIn this manner, XFS employs a form of eventual consistency to avoid deadlocks
1619bae43864SDarrick J. Wongand increase parallelism.
1620bae43864SDarrick J. Wong
1621bae43864SDarrick J. WongDuring the design phase of the reverse mapping and reflink features, it was
1622bae43864SDarrick J. Wongdecided that it was impractical to cram all the reverse mapping updates for a
1623bae43864SDarrick J. Wongsingle filesystem change into a single transaction because a single file
1624bae43864SDarrick J. Wongmapping operation can explode into many small updates:
1625bae43864SDarrick J. Wong
1626bae43864SDarrick J. Wong* The block mapping update itself
1627bae43864SDarrick J. Wong* A reverse mapping update for the block mapping update
1628bae43864SDarrick J. Wong* Fixing the freelist
1629bae43864SDarrick J. Wong* A reverse mapping update for the freelist fix
1630bae43864SDarrick J. Wong
1631bae43864SDarrick J. Wong* A shape change to the block mapping btree
1632bae43864SDarrick J. Wong* A reverse mapping update for the btree update
1633bae43864SDarrick J. Wong* Fixing the freelist (again)
1634bae43864SDarrick J. Wong* A reverse mapping update for the freelist fix
1635bae43864SDarrick J. Wong
1636bae43864SDarrick J. Wong* An update to the reference counting information
1637bae43864SDarrick J. Wong* A reverse mapping update for the refcount update
1638bae43864SDarrick J. Wong* Fixing the freelist (a third time)
1639bae43864SDarrick J. Wong* A reverse mapping update for the freelist fix
1640bae43864SDarrick J. Wong
1641bae43864SDarrick J. Wong* Freeing any space that was unmapped and not owned by any other file
1642bae43864SDarrick J. Wong* Fixing the freelist (a fourth time)
1643bae43864SDarrick J. Wong* A reverse mapping update for the freelist fix
1644bae43864SDarrick J. Wong
1645bae43864SDarrick J. Wong* Freeing the space used by the block mapping btree
1646bae43864SDarrick J. Wong* Fixing the freelist (a fifth time)
1647bae43864SDarrick J. Wong* A reverse mapping update for the freelist fix
1648bae43864SDarrick J. Wong
1649bae43864SDarrick J. WongFree list fixups are not usually needed more than once per AG per transaction
1650bae43864SDarrick J. Wongchain, but it is theoretically possible if space is very tight.
1651bae43864SDarrick J. WongFor copy-on-write updates this is even worse, because this must be done once to
1652bae43864SDarrick J. Wongremove the space from a staging area and again to map it into the file!
1653bae43864SDarrick J. Wong
1654bae43864SDarrick J. WongTo deal with this explosion in a calm manner, XFS expands its use of deferred
1655bae43864SDarrick J. Wongwork items to cover most reverse mapping updates and all refcount updates.
1656bae43864SDarrick J. WongThis reduces the worst case size of transaction reservations by breaking the
1657bae43864SDarrick J. Wongwork into a long chain of small updates, which increases the degree of eventual
1658bae43864SDarrick J. Wongconsistency in the system.
1659bae43864SDarrick J. WongAgain, this generally isn't a problem because XFS orders its deferred work
1660bae43864SDarrick J. Wongitems carefully to avoid resource reuse conflicts between unsuspecting threads.
1661bae43864SDarrick J. Wong
1662bae43864SDarrick J. WongHowever, online fsck changes the rules -- remember that although physical
1663bae43864SDarrick J. Wongupdates to per-AG structures are coordinated by locking the buffers for AG
1664bae43864SDarrick J. Wongheaders, buffer locks are dropped between transactions.
1665bae43864SDarrick J. WongOnce scrub acquires resources and takes locks for a data structure, it must do
1666bae43864SDarrick J. Wongall the validation work without releasing the lock.
1667bae43864SDarrick J. WongIf the main lock for a space btree is an AG header buffer lock, scrub may have
1668bae43864SDarrick J. Wonginterrupted another thread that is midway through finishing a chain.
1669bae43864SDarrick J. WongFor example, if a thread performing a copy-on-write has completed a reverse
1670bae43864SDarrick J. Wongmapping update but not the corresponding refcount update, the two AG btrees
1671bae43864SDarrick J. Wongwill appear inconsistent to scrub and an observation of corruption will be
1672bae43864SDarrick J. Wongrecorded.  This observation will not be correct.
1673bae43864SDarrick J. WongIf a repair is attempted in this state, the results will be catastrophic!
1674bae43864SDarrick J. Wong
1675bae43864SDarrick J. WongSeveral other solutions to this problem were evaluated upon discovery of this
1676bae43864SDarrick J. Wongflaw and rejected:
1677bae43864SDarrick J. Wong
1678bae43864SDarrick J. Wong1. Add a higher level lock to allocation groups and require writer threads to
1679bae43864SDarrick J. Wong   acquire the higher level lock in AG order before making any changes.
1680bae43864SDarrick J. Wong   This would be very difficult to implement in practice because it is
1681bae43864SDarrick J. Wong   difficult to determine which locks need to be obtained, and in what order,
1682bae43864SDarrick J. Wong   without simulating the entire operation.
1683bae43864SDarrick J. Wong   Performing a dry run of a file operation to discover necessary locks would
1684bae43864SDarrick J. Wong   make the filesystem very slow.
1685bae43864SDarrick J. Wong
1686bae43864SDarrick J. Wong2. Make the deferred work coordinator code aware of consecutive intent items
1687bae43864SDarrick J. Wong   targeting the same AG and have it hold the AG header buffers locked across
1688bae43864SDarrick J. Wong   the transaction roll between updates.
1689bae43864SDarrick J. Wong   This would introduce a lot of complexity into the coordinator since it is
1690bae43864SDarrick J. Wong   only loosely coupled with the actual deferred work items.
1691bae43864SDarrick J. Wong   It would also fail to solve the problem because deferred work items can
1692bae43864SDarrick J. Wong   generate new deferred subtasks, but all subtasks must be complete before
1693bae43864SDarrick J. Wong   work can start on a new sibling task.
1694bae43864SDarrick J. Wong
1695bae43864SDarrick J. Wong3. Teach online fsck to walk all transactions waiting for whichever lock(s)
1696bae43864SDarrick J. Wong   protect the data structure being scrubbed to look for pending operations.
1697bae43864SDarrick J. Wong   The checking and repair operations must factor these pending operations into
1698bae43864SDarrick J. Wong   the evaluations being performed.
1699bae43864SDarrick J. Wong   This solution is a nonstarter because it is *extremely* invasive to the main
1700bae43864SDarrick J. Wong   filesystem.
1701bae43864SDarrick J. Wong
1702bae43864SDarrick J. Wong.. _intent_drains:
1703bae43864SDarrick J. Wong
1704bae43864SDarrick J. WongIntent Drains
1705bae43864SDarrick J. Wong`````````````
1706bae43864SDarrick J. Wong
1707bae43864SDarrick J. WongOnline fsck uses an atomic intent item counter and lock cycling to coordinate
1708bae43864SDarrick J. Wongwith transaction chains.
1709bae43864SDarrick J. WongThere are two key properties to the drain mechanism.
1710bae43864SDarrick J. WongFirst, the counter is incremented when a deferred work item is *queued* to a
1711bae43864SDarrick J. Wongtransaction, and it is decremented after the associated intent done log item is
1712bae43864SDarrick J. Wong*committed* to another transaction.
1713bae43864SDarrick J. WongThe second property is that deferred work can be added to a transaction without
1714bae43864SDarrick J. Wongholding an AG header lock, but per-AG work items cannot be marked done without
1715bae43864SDarrick J. Wonglocking that AG header buffer to log the physical updates and the intent done
1716bae43864SDarrick J. Wonglog item.
1717bae43864SDarrick J. WongThe first property enables scrub to yield to running transaction chains, which
1718bae43864SDarrick J. Wongis an explicit deprioritization of online fsck to benefit file operations.
1719bae43864SDarrick J. WongThe second property of the drain is key to the correct coordination of scrub,
1720bae43864SDarrick J. Wongsince scrub will always be able to decide if a conflict is possible.
1721bae43864SDarrick J. Wong
1722bae43864SDarrick J. WongFor regular filesystem code, the drain works as follows:
1723bae43864SDarrick J. Wong
1724bae43864SDarrick J. Wong1. Call the appropriate subsystem function to add a deferred work item to a
1725bae43864SDarrick J. Wong   transaction.
1726bae43864SDarrick J. Wong
1727bae43864SDarrick J. Wong2. The function calls ``xfs_defer_drain_bump`` to increase the counter.
1728bae43864SDarrick J. Wong
1729bae43864SDarrick J. Wong3. When the deferred item manager wants to finish the deferred work item, it
1730bae43864SDarrick J. Wong   calls ``->finish_item`` to complete it.
1731bae43864SDarrick J. Wong
1732bae43864SDarrick J. Wong4. The ``->finish_item`` implementation logs some changes and calls
1733bae43864SDarrick J. Wong   ``xfs_defer_drain_drop`` to decrease the sloppy counter and wake up any threads
1734bae43864SDarrick J. Wong   waiting on the drain.
1735bae43864SDarrick J. Wong
1736bae43864SDarrick J. Wong5. The subtransaction commits, which unlocks the resource associated with the
1737bae43864SDarrick J. Wong   intent item.
1738bae43864SDarrick J. Wong
1739bae43864SDarrick J. WongFor scrub, the drain works as follows:
1740bae43864SDarrick J. Wong
1741bae43864SDarrick J. Wong1. Lock the resource(s) associated with the metadata being scrubbed.
1742bae43864SDarrick J. Wong   For example, a scan of the refcount btree would lock the AGI and AGF header
1743bae43864SDarrick J. Wong   buffers.
1744bae43864SDarrick J. Wong
1745bae43864SDarrick J. Wong2. If the counter is zero (``xfs_defer_drain_busy`` returns false), there are no
1746bae43864SDarrick J. Wong   chains in progress and the operation may proceed.
1747bae43864SDarrick J. Wong
1748bae43864SDarrick J. Wong3. Otherwise, release the resources grabbed in step 1.
1749bae43864SDarrick J. Wong
1750bae43864SDarrick J. Wong4. Wait for the intent counter to reach zero (``xfs_defer_drain_intents``), then go
1751bae43864SDarrick J. Wong   back to step 1 unless a signal has been caught.
1752bae43864SDarrick J. Wong
1753bae43864SDarrick J. WongTo avoid polling in step 4, the drain provides a waitqueue for scrub threads to
1754bae43864SDarrick J. Wongbe woken up whenever the intent count drops to zero.
1755bae43864SDarrick J. Wong
1756bae43864SDarrick J. WongThe proposed patchset is the
1757bae43864SDarrick J. Wong`scrub intent drain series
1758bae43864SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-drain-intents>`_.
1759bae43864SDarrick J. Wong
1760bae43864SDarrick J. Wong.. _jump_labels:
1761bae43864SDarrick J. Wong
1762bae43864SDarrick J. WongStatic Keys (aka Jump Label Patching)
1763bae43864SDarrick J. Wong`````````````````````````````````````
1764bae43864SDarrick J. Wong
1765bae43864SDarrick J. WongOnline fsck for XFS separates the regular filesystem from the checking and
1766bae43864SDarrick J. Wongrepair code as much as possible.
1767bae43864SDarrick J. WongHowever, there are a few parts of online fsck (such as the intent drains, and
1768bae43864SDarrick J. Wonglater, live update hooks) where it is useful for the online fsck code to know
1769bae43864SDarrick J. Wongwhat's going on in the rest of the filesystem.
1770bae43864SDarrick J. WongSince it is not expected that online fsck will be constantly running in the
1771bae43864SDarrick J. Wongbackground, it is very important to minimize the runtime overhead imposed by
1772bae43864SDarrick J. Wongthese hooks when online fsck is compiled into the kernel but not actively
1773bae43864SDarrick J. Wongrunning on behalf of userspace.
1774bae43864SDarrick J. WongTaking locks in the hot path of a writer thread to access a data structure only
1775bae43864SDarrick J. Wongto find that no further action is necessary is expensive -- on the author's
1776bae43864SDarrick J. Wongcomputer, this have an overhead of 40-50ns per access.
1777bae43864SDarrick J. WongFortunately, the kernel supports dynamic code patching, which enables XFS to
1778bae43864SDarrick J. Wongreplace a static branch to hook code with ``nop`` sleds when online fsck isn't
1779bae43864SDarrick J. Wongrunning.
1780bae43864SDarrick J. WongThis sled has an overhead of however long it takes the instruction decoder to
1781bae43864SDarrick J. Wongskip past the sled, which seems to be on the order of less than 1ns and
1782bae43864SDarrick J. Wongdoes not access memory outside of instruction fetching.
1783bae43864SDarrick J. Wong
1784bae43864SDarrick J. WongWhen online fsck enables the static key, the sled is replaced with an
1785bae43864SDarrick J. Wongunconditional branch to call the hook code.
1786bae43864SDarrick J. WongThe switchover is quite expensive (~22000ns) but is paid entirely by the
1787bae43864SDarrick J. Wongprogram that invoked online fsck, and can be amortized if multiple threads
1788bae43864SDarrick J. Wongenter online fsck at the same time, or if multiple filesystems are being
1789bae43864SDarrick J. Wongchecked at the same time.
1790bae43864SDarrick J. WongChanging the branch direction requires taking the CPU hotplug lock, and since
1791bae43864SDarrick J. WongCPU initialization requires memory allocation, online fsck must be careful not
1792bae43864SDarrick J. Wongto change a static key while holding any locks or resources that could be
1793bae43864SDarrick J. Wongaccessed in the memory reclaim paths.
1794bae43864SDarrick J. WongTo minimize contention on the CPU hotplug lock, care should be taken not to
1795bae43864SDarrick J. Wongenable or disable static keys unnecessarily.
1796bae43864SDarrick J. Wong
1797bae43864SDarrick J. WongBecause static keys are intended to minimize hook overhead for regular
1798bae43864SDarrick J. Wongfilesystem operations when xfs_scrub is not running, the intended usage
1799bae43864SDarrick J. Wongpatterns are as follows:
1800bae43864SDarrick J. Wong
1801bae43864SDarrick J. Wong- The hooked part of XFS should declare a static-scoped static key that
1802bae43864SDarrick J. Wong  defaults to false.
1803bae43864SDarrick J. Wong  The ``DEFINE_STATIC_KEY_FALSE`` macro takes care of this.
1804bae43864SDarrick J. Wong  The static key itself should be declared as a ``static`` variable.
1805bae43864SDarrick J. Wong
1806bae43864SDarrick J. Wong- When deciding to invoke code that's only used by scrub, the regular
1807bae43864SDarrick J. Wong  filesystem should call the ``static_branch_unlikely`` predicate to avoid the
1808bae43864SDarrick J. Wong  scrub-only hook code if the static key is not enabled.
1809bae43864SDarrick J. Wong
1810bae43864SDarrick J. Wong- The regular filesystem should export helper functions that call
1811bae43864SDarrick J. Wong  ``static_branch_inc`` to enable and ``static_branch_dec`` to disable the
1812bae43864SDarrick J. Wong  static key.
1813bae43864SDarrick J. Wong  Wrapper functions make it easy to compile out the relevant code if the kernel
1814bae43864SDarrick J. Wong  distributor turns off online fsck at build time.
1815bae43864SDarrick J. Wong
1816bae43864SDarrick J. Wong- Scrub functions wanting to turn on scrub-only XFS functionality should call
1817bae43864SDarrick J. Wong  the ``xchk_fsgates_enable`` from the setup function to enable a specific
1818bae43864SDarrick J. Wong  hook.
1819bae43864SDarrick J. Wong  This must be done before obtaining any resources that are used by memory
1820bae43864SDarrick J. Wong  reclaim.
1821bae43864SDarrick J. Wong  Callers had better be sure they really need the functionality gated by the
1822bae43864SDarrick J. Wong  static key; the ``TRY_HARDER`` flag is useful here.
1823bae43864SDarrick J. Wong
1824bae43864SDarrick J. WongOnline scrub has resource acquisition helpers (e.g. ``xchk_perag_lock``) to
1825bae43864SDarrick J. Wonghandle locking AGI and AGF buffers for all scrubber functions.
1826bae43864SDarrick J. WongIf it detects a conflict between scrub and the running transactions, it will
1827bae43864SDarrick J. Wongtry to wait for intents to complete.
1828bae43864SDarrick J. WongIf the caller of the helper has not enabled the static key, the helper will
1829bae43864SDarrick J. Wongreturn -EDEADLOCK, which should result in the scrub being restarted with the
1830bae43864SDarrick J. Wong``TRY_HARDER`` flag set.
1831bae43864SDarrick J. WongThe scrub setup function should detect that flag, enable the static key, and
1832bae43864SDarrick J. Wongtry the scrub again.
1833bae43864SDarrick J. WongScrub teardown disables all static keys obtained by ``xchk_fsgates_enable``.
1834bae43864SDarrick J. Wong
1835bae43864SDarrick J. WongFor more information, please see the kernel documentation of
1836bae43864SDarrick J. WongDocumentation/staging/static-keys.rst.
18375f658dadSDarrick J. Wong
18385f658dadSDarrick J. Wong.. _xfile:
18395f658dadSDarrick J. Wong
18405f658dadSDarrick J. WongPageable Kernel Memory
18415f658dadSDarrick J. Wong----------------------
18425f658dadSDarrick J. Wong
18435f658dadSDarrick J. WongSome online checking functions work by scanning the filesystem to build a
18445f658dadSDarrick J. Wongshadow copy of an ondisk metadata structure in memory and comparing the two
18455f658dadSDarrick J. Wongcopies.
18465f658dadSDarrick J. WongFor online repair to rebuild a metadata structure, it must compute the record
18475f658dadSDarrick J. Wongset that will be stored in the new structure before it can persist that new
18485f658dadSDarrick J. Wongstructure to disk.
18495f658dadSDarrick J. WongIdeally, repairs complete with a single atomic commit that introduces
18505f658dadSDarrick J. Wonga new data structure.
18515f658dadSDarrick J. WongTo meet these goals, the kernel needs to collect a large amount of information
18525f658dadSDarrick J. Wongin a place that doesn't require the correct operation of the filesystem.
18535f658dadSDarrick J. Wong
18545f658dadSDarrick J. WongKernel memory isn't suitable because:
18555f658dadSDarrick J. Wong
18565f658dadSDarrick J. Wong* Allocating a contiguous region of memory to create a C array is very
18575f658dadSDarrick J. Wong  difficult, especially on 32-bit systems.
18585f658dadSDarrick J. Wong
18595f658dadSDarrick J. Wong* Linked lists of records introduce double pointer overhead which is very high
18605f658dadSDarrick J. Wong  and eliminate the possibility of indexed lookups.
18615f658dadSDarrick J. Wong
18625f658dadSDarrick J. Wong* Kernel memory is pinned, which can drive the system into OOM conditions.
18635f658dadSDarrick J. Wong
18645f658dadSDarrick J. Wong* The system might not have sufficient memory to stage all the information.
18655f658dadSDarrick J. Wong
18665f658dadSDarrick J. WongAt any given time, online fsck does not need to keep the entire record set in
18675f658dadSDarrick J. Wongmemory, which means that individual records can be paged out if necessary.
18685f658dadSDarrick J. WongContinued development of online fsck demonstrated that the ability to perform
18695f658dadSDarrick J. Wongindexed data storage would also be very useful.
18705f658dadSDarrick J. WongFortunately, the Linux kernel already has a facility for byte-addressable and
18715f658dadSDarrick J. Wongpageable storage: tmpfs.
18725f658dadSDarrick J. WongIn-kernel graphics drivers (most notably i915) take advantage of tmpfs files
18735f658dadSDarrick J. Wongto store intermediate data that doesn't need to be in memory at all times, so
18745f658dadSDarrick J. Wongthat usage precedent is already established.
18755f658dadSDarrick J. WongHence, the ``xfile`` was born!
18765f658dadSDarrick J. Wong
18775f658dadSDarrick J. Wong+--------------------------------------------------------------------------+
18785f658dadSDarrick J. Wong| **Historical Sidebar**:                                                  |
18795f658dadSDarrick J. Wong+--------------------------------------------------------------------------+
18805f658dadSDarrick J. Wong| The first edition of online repair inserted records into a new btree as  |
18815f658dadSDarrick J. Wong| it found them, which failed because filesystem could shut down with a    |
18825f658dadSDarrick J. Wong| built data structure, which would be live after recovery finished.       |
18835f658dadSDarrick J. Wong|                                                                          |
18845f658dadSDarrick J. Wong| The second edition solved the half-rebuilt structure problem by storing  |
18855f658dadSDarrick J. Wong| everything in memory, but frequently ran the system out of memory.       |
18865f658dadSDarrick J. Wong|                                                                          |
18875f658dadSDarrick J. Wong| The third edition solved the OOM problem by using linked lists, but the  |
18885f658dadSDarrick J. Wong| memory overhead of the list pointers was extreme.                        |
18895f658dadSDarrick J. Wong+--------------------------------------------------------------------------+
18905f658dadSDarrick J. Wong
18915f658dadSDarrick J. Wongxfile Access Models
18925f658dadSDarrick J. Wong```````````````````
18935f658dadSDarrick J. Wong
18945f658dadSDarrick J. WongA survey of the intended uses of xfiles suggested these use cases:
18955f658dadSDarrick J. Wong
18965f658dadSDarrick J. Wong1. Arrays of fixed-sized records (space management btrees, directory and
18975f658dadSDarrick J. Wong   extended attribute entries)
18985f658dadSDarrick J. Wong
18995f658dadSDarrick J. Wong2. Sparse arrays of fixed-sized records (quotas and link counts)
19005f658dadSDarrick J. Wong
19015f658dadSDarrick J. Wong3. Large binary objects (BLOBs) of variable sizes (directory and extended
19025f658dadSDarrick J. Wong   attribute names and values)
19035f658dadSDarrick J. Wong
19045f658dadSDarrick J. Wong4. Staging btrees in memory (reverse mapping btrees)
19055f658dadSDarrick J. Wong
19065f658dadSDarrick J. Wong5. Arbitrary contents (realtime space management)
19075f658dadSDarrick J. Wong
19085f658dadSDarrick J. WongTo support the first four use cases, high level data structures wrap the xfile
19095f658dadSDarrick J. Wongto share functionality between online fsck functions.
19105f658dadSDarrick J. WongThe rest of this section discusses the interfaces that the xfile presents to
19115f658dadSDarrick J. Wongfour of those five higher level data structures.
19125f658dadSDarrick J. WongThe fifth use case is discussed in the :ref:`realtime summary <rtsummary>` case
19135f658dadSDarrick J. Wongstudy.
19145f658dadSDarrick J. Wong
19155f658dadSDarrick J. WongThe most general storage interface supported by the xfile enables the reading
19165f658dadSDarrick J. Wongand writing of arbitrary quantities of data at arbitrary offsets in the xfile.
19175f658dadSDarrick J. WongThis capability is provided by ``xfile_pread`` and ``xfile_pwrite`` functions,
19185f658dadSDarrick J. Wongwhich behave similarly to their userspace counterparts.
19195f658dadSDarrick J. WongXFS is very record-based, which suggests that the ability to load and store
19205f658dadSDarrick J. Wongcomplete records is important.
19215f658dadSDarrick J. WongTo support these cases, a pair of ``xfile_obj_load`` and ``xfile_obj_store``
19225f658dadSDarrick J. Wongfunctions are provided to read and persist objects into an xfile.
19235f658dadSDarrick J. WongThey are internally the same as pread and pwrite, except that they treat any
19245f658dadSDarrick J. Wongerror as an out of memory error.
19255f658dadSDarrick J. WongFor online repair, squashing error conditions in this manner is an acceptable
19265f658dadSDarrick J. Wongbehavior because the only reaction is to abort the operation back to userspace.
19275f658dadSDarrick J. WongAll five xfile usecases can be serviced by these four functions.
19285f658dadSDarrick J. Wong
19295f658dadSDarrick J. WongHowever, no discussion of file access idioms is complete without answering the
19305f658dadSDarrick J. Wongquestion, "But what about mmap?"
19315f658dadSDarrick J. WongIt is convenient to access storage directly with pointers, just like userspace
19325f658dadSDarrick J. Wongcode does with regular memory.
19335f658dadSDarrick J. WongOnline fsck must not drive the system into OOM conditions, which means that
19345f658dadSDarrick J. Wongxfiles must be responsive to memory reclamation.
19355f658dadSDarrick J. Wongtmpfs can only push a pagecache folio to the swap cache if the folio is neither
19365f658dadSDarrick J. Wongpinned nor locked, which means the xfile must not pin too many folios.
19375f658dadSDarrick J. Wong
19385f658dadSDarrick J. WongShort term direct access to xfile contents is done by locking the pagecache
19395f658dadSDarrick J. Wongfolio and mapping it into kernel address space.
19405f658dadSDarrick J. WongProgrammatic access (e.g. pread and pwrite) uses this mechanism.
19415f658dadSDarrick J. WongFolio locks are not supposed to be held for long periods of time, so long
19425f658dadSDarrick J. Wongterm direct access to xfile contents is done by bumping the folio refcount,
19435f658dadSDarrick J. Wongmapping it into kernel address space, and dropping the folio lock.
19445f658dadSDarrick J. WongThese long term users *must* be responsive to memory reclaim by hooking into
19455f658dadSDarrick J. Wongthe shrinker infrastructure to know when to release folios.
19465f658dadSDarrick J. Wong
19475f658dadSDarrick J. WongThe ``xfile_get_page`` and ``xfile_put_page`` functions are provided to
19485f658dadSDarrick J. Wongretrieve the (locked) folio that backs part of an xfile and to release it.
19495f658dadSDarrick J. WongThe only code to use these folio lease functions are the xfarray
19505f658dadSDarrick J. Wong:ref:`sorting<xfarray_sort>` algorithms and the :ref:`in-memory
19515f658dadSDarrick J. Wongbtrees<xfbtree>`.
19525f658dadSDarrick J. Wong
19535f658dadSDarrick J. Wongxfile Access Coordination
19545f658dadSDarrick J. Wong`````````````````````````
19555f658dadSDarrick J. Wong
19565f658dadSDarrick J. WongFor security reasons, xfiles must be owned privately by the kernel.
19575f658dadSDarrick J. WongThey are marked ``S_PRIVATE`` to prevent interference from the security system,
19585f658dadSDarrick J. Wongmust never be mapped into process file descriptor tables, and their pages must
19595f658dadSDarrick J. Wongnever be mapped into userspace processes.
19605f658dadSDarrick J. Wong
19615f658dadSDarrick J. WongTo avoid locking recursion issues with the VFS, all accesses to the shmfs file
19625f658dadSDarrick J. Wongare performed by manipulating the page cache directly.
19635f658dadSDarrick J. Wongxfile writers call the ``->write_begin`` and ``->write_end`` functions of the
19645f658dadSDarrick J. Wongxfile's address space to grab writable pages, copy the caller's buffer into the
19655f658dadSDarrick J. Wongpage, and release the pages.
19665f658dadSDarrick J. Wongxfile readers call ``shmem_read_mapping_page_gfp`` to grab pages directly
19675f658dadSDarrick J. Wongbefore copying the contents into the caller's buffer.
19685f658dadSDarrick J. WongIn other words, xfiles ignore the VFS read and write code paths to avoid
19695f658dadSDarrick J. Wonghaving to create a dummy ``struct kiocb`` and to avoid taking inode and
19705f658dadSDarrick J. Wongfreeze locks.
19715f658dadSDarrick J. Wongtmpfs cannot be frozen, and xfiles must not be exposed to userspace.
19725f658dadSDarrick J. Wong
19735f658dadSDarrick J. WongIf an xfile is shared between threads to stage repairs, the caller must provide
19745f658dadSDarrick J. Wongits own locks to coordinate access.
19755f658dadSDarrick J. WongFor example, if a scrub function stores scan results in an xfile and needs
19765f658dadSDarrick J. Wongother threads to provide updates to the scanned data, the scrub function must
19775f658dadSDarrick J. Wongprovide a lock for all threads to share.
19785f658dadSDarrick J. Wong
19795f658dadSDarrick J. Wong.. _xfarray:
19805f658dadSDarrick J. Wong
19815f658dadSDarrick J. WongArrays of Fixed-Sized Records
19825f658dadSDarrick J. Wong`````````````````````````````
19835f658dadSDarrick J. Wong
19845f658dadSDarrick J. WongIn XFS, each type of indexed space metadata (free space, inodes, reference
19855f658dadSDarrick J. Wongcounts, file fork space, and reverse mappings) consists of a set of fixed-size
19865f658dadSDarrick J. Wongrecords indexed with a classic B+ tree.
19875f658dadSDarrick J. WongDirectories have a set of fixed-size dirent records that point to the names,
19885f658dadSDarrick J. Wongand extended attributes have a set of fixed-size attribute keys that point to
19895f658dadSDarrick J. Wongnames and values.
19905f658dadSDarrick J. WongQuota counters and file link counters index records with numbers.
19915f658dadSDarrick J. WongDuring a repair, scrub needs to stage new records during the gathering step and
19925f658dadSDarrick J. Wongretrieve them during the btree building step.
19935f658dadSDarrick J. Wong
19945f658dadSDarrick J. WongAlthough this requirement can be satisfied by calling the read and write
19955f658dadSDarrick J. Wongmethods of the xfile directly, it is simpler for callers for there to be a
19965f658dadSDarrick J. Wonghigher level abstraction to take care of computing array offsets, to provide
19975f658dadSDarrick J. Wongiterator functions, and to deal with sparse records and sorting.
19985f658dadSDarrick J. WongThe ``xfarray`` abstraction presents a linear array for fixed-size records atop
19995f658dadSDarrick J. Wongthe byte-accessible xfile.
20005f658dadSDarrick J. Wong
20015f658dadSDarrick J. Wong.. _xfarray_access_patterns:
20025f658dadSDarrick J. Wong
20035f658dadSDarrick J. WongArray Access Patterns
20045f658dadSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^
20055f658dadSDarrick J. Wong
20065f658dadSDarrick J. WongArray access patterns in online fsck tend to fall into three categories.
20075f658dadSDarrick J. WongIteration of records is assumed to be necessary for all cases and will be
20085f658dadSDarrick J. Wongcovered in the next section.
20095f658dadSDarrick J. Wong
20105f658dadSDarrick J. WongThe first type of caller handles records that are indexed by position.
20115f658dadSDarrick J. WongGaps may exist between records, and a record may be updated multiple times
20125f658dadSDarrick J. Wongduring the collection step.
20135f658dadSDarrick J. WongIn other words, these callers want a sparse linearly addressed table file.
20145f658dadSDarrick J. WongThe typical use case are quota records or file link count records.
20155f658dadSDarrick J. WongAccess to array elements is performed programmatically via ``xfarray_load`` and
20165f658dadSDarrick J. Wong``xfarray_store`` functions, which wrap the similarly-named xfile functions to
20175f658dadSDarrick J. Wongprovide loading and storing of array elements at arbitrary array indices.
20185f658dadSDarrick J. WongGaps are defined to be null records, and null records are defined to be a
20195f658dadSDarrick J. Wongsequence of all zero bytes.
20205f658dadSDarrick J. WongNull records are detected by calling ``xfarray_element_is_null``.
20215f658dadSDarrick J. WongThey are created either by calling ``xfarray_unset`` to null out an existing
20225f658dadSDarrick J. Wongrecord or by never storing anything to an array index.
20235f658dadSDarrick J. Wong
20245f658dadSDarrick J. WongThe second type of caller handles records that are not indexed by position
20255f658dadSDarrick J. Wongand do not require multiple updates to a record.
20265f658dadSDarrick J. WongThe typical use case here is rebuilding space btrees and key/value btrees.
20275f658dadSDarrick J. WongThese callers can add records to the array without caring about array indices
20285f658dadSDarrick J. Wongvia the ``xfarray_append`` function, which stores a record at the end of the
20295f658dadSDarrick J. Wongarray.
20305f658dadSDarrick J. WongFor callers that require records to be presentable in a specific order (e.g.
20315f658dadSDarrick J. Wongrebuilding btree data), the ``xfarray_sort`` function can arrange the sorted
20325f658dadSDarrick J. Wongrecords; this function will be covered later.
20335f658dadSDarrick J. Wong
20345f658dadSDarrick J. WongThe third type of caller is a bag, which is useful for counting records.
20355f658dadSDarrick J. WongThe typical use case here is constructing space extent reference counts from
20365f658dadSDarrick J. Wongreverse mapping information.
20375f658dadSDarrick J. WongRecords can be put in the bag in any order, they can be removed from the bag
20385f658dadSDarrick J. Wongat any time, and uniqueness of records is left to callers.
20395f658dadSDarrick J. WongThe ``xfarray_store_anywhere`` function is used to insert a record in any
20405f658dadSDarrick J. Wongnull record slot in the bag; and the ``xfarray_unset`` function removes a
20415f658dadSDarrick J. Wongrecord from the bag.
20425f658dadSDarrick J. Wong
20435f658dadSDarrick J. WongThe proposed patchset is the
20445f658dadSDarrick J. Wong`big in-memory array
20455f658dadSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=big-array>`_.
20465f658dadSDarrick J. Wong
20475f658dadSDarrick J. WongIterating Array Elements
20485f658dadSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^
20495f658dadSDarrick J. Wong
20505f658dadSDarrick J. WongMost users of the xfarray require the ability to iterate the records stored in
20515f658dadSDarrick J. Wongthe array.
20525f658dadSDarrick J. WongCallers can probe every possible array index with the following:
20535f658dadSDarrick J. Wong
20545f658dadSDarrick J. Wong.. code-block:: c
20555f658dadSDarrick J. Wong
20565f658dadSDarrick J. Wong	xfarray_idx_t i;
20575f658dadSDarrick J. Wong	foreach_xfarray_idx(array, i) {
20585f658dadSDarrick J. Wong	    xfarray_load(array, i, &rec);
20595f658dadSDarrick J. Wong
20605f658dadSDarrick J. Wong	    /* do something with rec */
20615f658dadSDarrick J. Wong	}
20625f658dadSDarrick J. Wong
20635f658dadSDarrick J. WongAll users of this idiom must be prepared to handle null records or must already
20645f658dadSDarrick J. Wongknow that there aren't any.
20655f658dadSDarrick J. Wong
20665f658dadSDarrick J. WongFor xfarray users that want to iterate a sparse array, the ``xfarray_iter``
20675f658dadSDarrick J. Wongfunction ignores indices in the xfarray that have never been written to by
20685f658dadSDarrick J. Wongcalling ``xfile_seek_data`` (which internally uses ``SEEK_DATA``) to skip areas
20695f658dadSDarrick J. Wongof the array that are not populated with memory pages.
20705f658dadSDarrick J. WongOnce it finds a page, it will skip the zeroed areas of the page.
20715f658dadSDarrick J. Wong
20725f658dadSDarrick J. Wong.. code-block:: c
20735f658dadSDarrick J. Wong
20745f658dadSDarrick J. Wong	xfarray_idx_t i = XFARRAY_CURSOR_INIT;
20755f658dadSDarrick J. Wong	while ((ret = xfarray_iter(array, &i, &rec)) == 1) {
20765f658dadSDarrick J. Wong	    /* do something with rec */
20775f658dadSDarrick J. Wong	}
20785f658dadSDarrick J. Wong
20795f658dadSDarrick J. Wong.. _xfarray_sort:
20805f658dadSDarrick J. Wong
20815f658dadSDarrick J. WongSorting Array Elements
20825f658dadSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^
20835f658dadSDarrick J. Wong
20845f658dadSDarrick J. WongDuring the fourth demonstration of online repair, a community reviewer remarked
20855f658dadSDarrick J. Wongthat for performance reasons, online repair ought to load batches of records
20865f658dadSDarrick J. Wonginto btree record blocks instead of inserting records into a new btree one at a
20875f658dadSDarrick J. Wongtime.
20885f658dadSDarrick J. WongThe btree insertion code in XFS is responsible for maintaining correct ordering
20895f658dadSDarrick J. Wongof the records, so naturally the xfarray must also support sorting the record
20905f658dadSDarrick J. Wongset prior to bulk loading.
20915f658dadSDarrick J. Wong
20925f658dadSDarrick J. WongCase Study: Sorting xfarrays
20935f658dadSDarrick J. Wong~~~~~~~~~~~~~~~~~~~~~~~~~~~~
20945f658dadSDarrick J. Wong
20955f658dadSDarrick J. WongThe sorting algorithm used in the xfarray is actually a combination of adaptive
20965f658dadSDarrick J. Wongquicksort and a heapsort subalgorithm in the spirit of
20975f658dadSDarrick J. Wong`Sedgewick <https://algs4.cs.princeton.edu/23quicksort/>`_ and
20985f658dadSDarrick J. Wong`pdqsort <https://github.com/orlp/pdqsort>`_, with customizations for the Linux
20995f658dadSDarrick J. Wongkernel.
21005f658dadSDarrick J. WongTo sort records in a reasonably short amount of time, ``xfarray`` takes
21015f658dadSDarrick J. Wongadvantage of the binary subpartitioning offered by quicksort, but it also uses
21025f658dadSDarrick J. Wongheapsort to hedge aginst performance collapse if the chosen quicksort pivots
21035f658dadSDarrick J. Wongare poor.
21045f658dadSDarrick J. WongBoth algorithms are (in general) O(n * lg(n)), but there is a wide performance
21055f658dadSDarrick J. Wonggulf between the two implementations.
21065f658dadSDarrick J. Wong
21075f658dadSDarrick J. WongThe Linux kernel already contains a reasonably fast implementation of heapsort.
21085f658dadSDarrick J. WongIt only operates on regular C arrays, which limits the scope of its usefulness.
21095f658dadSDarrick J. WongThere are two key places where the xfarray uses it:
21105f658dadSDarrick J. Wong
21115f658dadSDarrick J. Wong* Sorting any record subset backed by a single xfile page.
21125f658dadSDarrick J. Wong
21135f658dadSDarrick J. Wong* Loading a small number of xfarray records from potentially disparate parts
21145f658dadSDarrick J. Wong  of the xfarray into a memory buffer, and sorting the buffer.
21155f658dadSDarrick J. Wong
21165f658dadSDarrick J. WongIn other words, ``xfarray`` uses heapsort to constrain the nested recursion of
21175f658dadSDarrick J. Wongquicksort, thereby mitigating quicksort's worst runtime behavior.
21185f658dadSDarrick J. Wong
21195f658dadSDarrick J. WongChoosing a quicksort pivot is a tricky business.
21205f658dadSDarrick J. WongA good pivot splits the set to sort in half, leading to the divide and conquer
21215f658dadSDarrick J. Wongbehavior that is crucial to  O(n * lg(n)) performance.
21225f658dadSDarrick J. WongA poor pivot barely splits the subset at all, leading to O(n\ :sup:`2`)
21235f658dadSDarrick J. Wongruntime.
21245f658dadSDarrick J. WongThe xfarray sort routine tries to avoid picking a bad pivot by sampling nine
21255f658dadSDarrick J. Wongrecords into a memory buffer and using the kernel heapsort to identify the
21265f658dadSDarrick J. Wongmedian of the nine.
21275f658dadSDarrick J. Wong
21285f658dadSDarrick J. WongMost modern quicksort implementations employ Tukey's "ninther" to select a
21295f658dadSDarrick J. Wongpivot from a classic C array.
21305f658dadSDarrick J. WongTypical ninther implementations pick three unique triads of records, sort each
21315f658dadSDarrick J. Wongof the triads, and then sort the middle value of each triad to determine the
21325f658dadSDarrick J. Wongninther value.
21335f658dadSDarrick J. WongAs stated previously, however, xfile accesses are not entirely cheap.
21345f658dadSDarrick J. WongIt turned out to be much more performant to read the nine elements into a
21355f658dadSDarrick J. Wongmemory buffer, run the kernel's in-memory heapsort on the buffer, and choose
21365f658dadSDarrick J. Wongthe 4th element of that buffer as the pivot.
21375f658dadSDarrick J. WongTukey's ninthers are described in J. W. Tukey, `The ninther, a technique for
21385f658dadSDarrick J. Wonglow-effort robust (resistant) location in large samples`, in *Contributions to
21395f658dadSDarrick J. WongSurvey Sampling and Applied Statistics*, edited by H. David, (Academic Press,
21405f658dadSDarrick J. Wong1978), pp. 251–257.
21415f658dadSDarrick J. Wong
21425f658dadSDarrick J. WongThe partitioning of quicksort is fairly textbook -- rearrange the record
21435f658dadSDarrick J. Wongsubset around the pivot, then set up the current and next stack frames to
21445f658dadSDarrick J. Wongsort with the larger and the smaller halves of the pivot, respectively.
21455f658dadSDarrick J. WongThis keeps the stack space requirements to log2(record count).
21465f658dadSDarrick J. Wong
21475f658dadSDarrick J. WongAs a final performance optimization, the hi and lo scanning phase of quicksort
21485f658dadSDarrick J. Wongkeeps examined xfile pages mapped in the kernel for as long as possible to
21495f658dadSDarrick J. Wongreduce map/unmap cycles.
21505f658dadSDarrick J. WongSurprisingly, this reduces overall sort runtime by nearly half again after
21515f658dadSDarrick J. Wongaccounting for the application of heapsort directly onto xfile pages.
21525f658dadSDarrick J. Wong
2153*a26aa252SDarrick J. Wong.. _xfblob:
2154*a26aa252SDarrick J. Wong
21555f658dadSDarrick J. WongBlob Storage
21565f658dadSDarrick J. Wong````````````
21575f658dadSDarrick J. Wong
21585f658dadSDarrick J. WongExtended attributes and directories add an additional requirement for staging
21595f658dadSDarrick J. Wongrecords: arbitrary byte sequences of finite length.
21605f658dadSDarrick J. WongEach directory entry record needs to store entry name,
21615f658dadSDarrick J. Wongand each extended attribute needs to store both the attribute name and value.
21625f658dadSDarrick J. WongThe names, keys, and values can consume a large amount of memory, so the
21635f658dadSDarrick J. Wong``xfblob`` abstraction was created to simplify management of these blobs
21645f658dadSDarrick J. Wongatop an xfile.
21655f658dadSDarrick J. Wong
21665f658dadSDarrick J. WongBlob arrays provide ``xfblob_load`` and ``xfblob_store`` functions to retrieve
21675f658dadSDarrick J. Wongand persist objects.
21685f658dadSDarrick J. WongThe store function returns a magic cookie for every object that it persists.
21695f658dadSDarrick J. WongLater, callers provide this cookie to the ``xblob_load`` to recall the object.
21705f658dadSDarrick J. WongThe ``xfblob_free`` function frees a specific blob, and the ``xfblob_truncate``
21715f658dadSDarrick J. Wongfunction frees them all because compaction is not needed.
21725f658dadSDarrick J. Wong
21735f658dadSDarrick J. WongThe details of repairing directories and extended attributes will be discussed
21745f658dadSDarrick J. Wongin a subsequent section about atomic extent swapping.
21755f658dadSDarrick J. WongHowever, it should be noted that these repair functions only use blob storage
21765f658dadSDarrick J. Wongto cache a small number of entries before adding them to a temporary ondisk
21775f658dadSDarrick J. Wongfile, which is why compaction is not required.
21785f658dadSDarrick J. Wong
21795f658dadSDarrick J. WongThe proposed patchset is at the start of the
21805f658dadSDarrick J. Wong`extended attribute repair
21815f658dadSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_ series.
21825f658dadSDarrick J. Wong
21835f658dadSDarrick J. Wong.. _xfbtree:
21845f658dadSDarrick J. Wong
21855f658dadSDarrick J. WongIn-Memory B+Trees
21865f658dadSDarrick J. Wong`````````````````
21875f658dadSDarrick J. Wong
21885f658dadSDarrick J. WongThe chapter about :ref:`secondary metadata<secondary_metadata>` mentioned that
21895f658dadSDarrick J. Wongchecking and repairing of secondary metadata commonly requires coordination
21905f658dadSDarrick J. Wongbetween a live metadata scan of the filesystem and writer threads that are
21915f658dadSDarrick J. Wongupdating that metadata.
21925f658dadSDarrick J. WongKeeping the scan data up to date requires requires the ability to propagate
21935f658dadSDarrick J. Wongmetadata updates from the filesystem into the data being collected by the scan.
21945f658dadSDarrick J. WongThis *can* be done by appending concurrent updates into a separate log file and
21955f658dadSDarrick J. Wongapplying them before writing the new metadata to disk, but this leads to
21965f658dadSDarrick J. Wongunbounded memory consumption if the rest of the system is very busy.
21975f658dadSDarrick J. WongAnother option is to skip the side-log and commit live updates from the
21985f658dadSDarrick J. Wongfilesystem directly into the scan data, which trades more overhead for a lower
21995f658dadSDarrick J. Wongmaximum memory requirement.
22005f658dadSDarrick J. WongIn both cases, the data structure holding the scan results must support indexed
22015f658dadSDarrick J. Wongaccess to perform well.
22025f658dadSDarrick J. Wong
22035f658dadSDarrick J. WongGiven that indexed lookups of scan data is required for both strategies, online
22045f658dadSDarrick J. Wongfsck employs the second strategy of committing live updates directly into
22055f658dadSDarrick J. Wongscan data.
22065f658dadSDarrick J. WongBecause xfarrays are not indexed and do not enforce record ordering, they
22075f658dadSDarrick J. Wongare not suitable for this task.
22085f658dadSDarrick J. WongConveniently, however, XFS has a library to create and maintain ordered reverse
22095f658dadSDarrick J. Wongmapping records: the existing rmap btree code!
22105f658dadSDarrick J. WongIf only there was a means to create one in memory.
22115f658dadSDarrick J. Wong
22125f658dadSDarrick J. WongRecall that the :ref:`xfile <xfile>` abstraction represents memory pages as a
22135f658dadSDarrick J. Wongregular file, which means that the kernel can create byte or block addressable
22145f658dadSDarrick J. Wongvirtual address spaces at will.
22155f658dadSDarrick J. WongThe XFS buffer cache specializes in abstracting IO to block-oriented  address
22165f658dadSDarrick J. Wongspaces, which means that adaptation of the buffer cache to interface with
22175f658dadSDarrick J. Wongxfiles enables reuse of the entire btree library.
22185f658dadSDarrick J. WongBtrees built atop an xfile are collectively known as ``xfbtrees``.
22195f658dadSDarrick J. WongThe next few sections describe how they actually work.
22205f658dadSDarrick J. Wong
22215f658dadSDarrick J. WongThe proposed patchset is the
22225f658dadSDarrick J. Wong`in-memory btree
22235f658dadSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=in-memory-btrees>`_
22245f658dadSDarrick J. Wongseries.
22255f658dadSDarrick J. Wong
22265f658dadSDarrick J. WongUsing xfiles as a Buffer Cache Target
22275f658dadSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
22285f658dadSDarrick J. Wong
22295f658dadSDarrick J. WongTwo modifications are necessary to support xfiles as a buffer cache target.
22305f658dadSDarrick J. WongThe first is to make it possible for the ``struct xfs_buftarg`` structure to
22315f658dadSDarrick J. Wonghost the ``struct xfs_buf`` rhashtable, because normally those are held by a
22325f658dadSDarrick J. Wongper-AG structure.
22335f658dadSDarrick J. WongThe second change is to modify the buffer ``ioapply`` function to "read" cached
22345f658dadSDarrick J. Wongpages from the xfile and "write" cached pages back to the xfile.
22355f658dadSDarrick J. WongMultiple access to individual buffers is controlled by the ``xfs_buf`` lock,
22365f658dadSDarrick J. Wongsince the xfile does not provide any locking on its own.
22375f658dadSDarrick J. WongWith this adaptation in place, users of the xfile-backed buffer cache use
22385f658dadSDarrick J. Wongexactly the same APIs as users of the disk-backed buffer cache.
22395f658dadSDarrick J. WongThe separation between xfile and buffer cache implies higher memory usage since
22405f658dadSDarrick J. Wongthey do not share pages, but this property could some day enable transactional
22415f658dadSDarrick J. Wongupdates to an in-memory btree.
22425f658dadSDarrick J. WongToday, however, it simply eliminates the need for new code.
22435f658dadSDarrick J. Wong
22445f658dadSDarrick J. WongSpace Management with an xfbtree
22455f658dadSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
22465f658dadSDarrick J. Wong
22475f658dadSDarrick J. WongSpace management for an xfile is very simple -- each btree block is one memory
22485f658dadSDarrick J. Wongpage in size.
22495f658dadSDarrick J. WongThese blocks use the same header format as an on-disk btree, but the in-memory
22505f658dadSDarrick J. Wongblock verifiers ignore the checksums, assuming that xfile memory is no more
22515f658dadSDarrick J. Wongcorruption-prone than regular DRAM.
22525f658dadSDarrick J. WongReusing existing code here is more important than absolute memory efficiency.
22535f658dadSDarrick J. Wong
22545f658dadSDarrick J. WongThe very first block of an xfile backing an xfbtree contains a header block.
22555f658dadSDarrick J. WongThe header describes the owner, height, and the block number of the root
22565f658dadSDarrick J. Wongxfbtree block.
22575f658dadSDarrick J. Wong
22585f658dadSDarrick J. WongTo allocate a btree block, use ``xfile_seek_data`` to find a gap in the file.
22595f658dadSDarrick J. WongIf there are no gaps, create one by extending the length of the xfile.
22605f658dadSDarrick J. WongPreallocate space for the block with ``xfile_prealloc``, and hand back the
22615f658dadSDarrick J. Wonglocation.
22625f658dadSDarrick J. WongTo free an xfbtree block, use ``xfile_discard`` (which internally uses
22635f658dadSDarrick J. Wong``FALLOC_FL_PUNCH_HOLE``) to remove the memory page from the xfile.
22645f658dadSDarrick J. Wong
22655f658dadSDarrick J. WongPopulating an xfbtree
22665f658dadSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^
22675f658dadSDarrick J. Wong
22685f658dadSDarrick J. WongAn online fsck function that wants to create an xfbtree should proceed as
22695f658dadSDarrick J. Wongfollows:
22705f658dadSDarrick J. Wong
22715f658dadSDarrick J. Wong1. Call ``xfile_create`` to create an xfile.
22725f658dadSDarrick J. Wong
22735f658dadSDarrick J. Wong2. Call ``xfs_alloc_memory_buftarg`` to create a buffer cache target structure
22745f658dadSDarrick J. Wong   pointing to the xfile.
22755f658dadSDarrick J. Wong
22765f658dadSDarrick J. Wong3. Pass the buffer cache target, buffer ops, and other information to
22775f658dadSDarrick J. Wong   ``xfbtree_create`` to write an initial tree header and root block to the
22785f658dadSDarrick J. Wong   xfile.
22795f658dadSDarrick J. Wong   Each btree type should define a wrapper that passes necessary arguments to
22805f658dadSDarrick J. Wong   the creation function.
22815f658dadSDarrick J. Wong   For example, rmap btrees define ``xfs_rmapbt_mem_create`` to take care of
22825f658dadSDarrick J. Wong   all the necessary details for callers.
22835f658dadSDarrick J. Wong   A ``struct xfbtree`` object will be returned.
22845f658dadSDarrick J. Wong
22855f658dadSDarrick J. Wong4. Pass the xfbtree object to the btree cursor creation function for the
22865f658dadSDarrick J. Wong   btree type.
22875f658dadSDarrick J. Wong   Following the example above, ``xfs_rmapbt_mem_cursor`` takes care of this
22885f658dadSDarrick J. Wong   for callers.
22895f658dadSDarrick J. Wong
22905f658dadSDarrick J. Wong5. Pass the btree cursor to the regular btree functions to make queries against
22915f658dadSDarrick J. Wong   and to update the in-memory btree.
22925f658dadSDarrick J. Wong   For example, a btree cursor for an rmap xfbtree can be passed to the
22935f658dadSDarrick J. Wong   ``xfs_rmap_*`` functions just like any other btree cursor.
22945f658dadSDarrick J. Wong   See the :ref:`next section<xfbtree_commit>` for information on dealing with
22955f658dadSDarrick J. Wong   xfbtree updates that are logged to a transaction.
22965f658dadSDarrick J. Wong
22975f658dadSDarrick J. Wong6. When finished, delete the btree cursor, destroy the xfbtree object, free the
22985f658dadSDarrick J. Wong   buffer target, and the destroy the xfile to release all resources.
22995f658dadSDarrick J. Wong
23005f658dadSDarrick J. Wong.. _xfbtree_commit:
23015f658dadSDarrick J. Wong
23025f658dadSDarrick J. WongCommitting Logged xfbtree Buffers
23035f658dadSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
23045f658dadSDarrick J. Wong
23055f658dadSDarrick J. WongAlthough it is a clever hack to reuse the rmap btree code to handle the staging
23065f658dadSDarrick J. Wongstructure, the ephemeral nature of the in-memory btree block storage presents
23075f658dadSDarrick J. Wongsome challenges of its own.
23085f658dadSDarrick J. WongThe XFS transaction manager must not commit buffer log items for buffers backed
23095f658dadSDarrick J. Wongby an xfile because the log format does not understand updates for devices
23105f658dadSDarrick J. Wongother than the data device.
23115f658dadSDarrick J. WongAn ephemeral xfbtree probably will not exist by the time the AIL checkpoints
23125f658dadSDarrick J. Wonglog transactions back into the filesystem, and certainly won't exist during
23135f658dadSDarrick J. Wonglog recovery.
23145f658dadSDarrick J. WongFor these reasons, any code updating an xfbtree in transaction context must
23155f658dadSDarrick J. Wongremove the buffer log items from the transaction and write the updates into the
23165f658dadSDarrick J. Wongbacking xfile before committing or cancelling the transaction.
23175f658dadSDarrick J. Wong
23185f658dadSDarrick J. WongThe ``xfbtree_trans_commit`` and ``xfbtree_trans_cancel`` functions implement
23195f658dadSDarrick J. Wongthis functionality as follows:
23205f658dadSDarrick J. Wong
23215f658dadSDarrick J. Wong1. Find each buffer log item whose buffer targets the xfile.
23225f658dadSDarrick J. Wong
23235f658dadSDarrick J. Wong2. Record the dirty/ordered status of the log item.
23245f658dadSDarrick J. Wong
23255f658dadSDarrick J. Wong3. Detach the log item from the buffer.
23265f658dadSDarrick J. Wong
23275f658dadSDarrick J. Wong4. Queue the buffer to a special delwri list.
23285f658dadSDarrick J. Wong
23295f658dadSDarrick J. Wong5. Clear the transaction dirty flag if the only dirty log items were the ones
23305f658dadSDarrick J. Wong   that were detached in step 3.
23315f658dadSDarrick J. Wong
23325f658dadSDarrick J. Wong6. Submit the delwri list to commit the changes to the xfile, if the updates
23335f658dadSDarrick J. Wong   are being committed.
23345f658dadSDarrick J. Wong
23355f658dadSDarrick J. WongAfter removing xfile logged buffers from the transaction in this manner, the
23365f658dadSDarrick J. Wongtransaction can be committed or cancelled.
23377fb8ccffSDarrick J. Wong
23387fb8ccffSDarrick J. WongBulk Loading of Ondisk B+Trees
23397fb8ccffSDarrick J. Wong------------------------------
23407fb8ccffSDarrick J. Wong
23417fb8ccffSDarrick J. WongAs mentioned previously, early iterations of online repair built new btree
23427fb8ccffSDarrick J. Wongstructures by creating a new btree and adding observations individually.
23437fb8ccffSDarrick J. WongLoading a btree one record at a time had a slight advantage of not requiring
23447fb8ccffSDarrick J. Wongthe incore records to be sorted prior to commit, but was very slow and leaked
23457fb8ccffSDarrick J. Wongblocks if the system went down during a repair.
23467fb8ccffSDarrick J. WongLoading records one at a time also meant that repair could not control the
23477fb8ccffSDarrick J. Wongloading factor of the blocks in the new btree.
23487fb8ccffSDarrick J. Wong
23497fb8ccffSDarrick J. WongFortunately, the venerable ``xfs_repair`` tool had a more efficient means for
23507fb8ccffSDarrick J. Wongrebuilding a btree index from a collection of records -- bulk btree loading.
23517fb8ccffSDarrick J. WongThis was implemented rather inefficiently code-wise, since ``xfs_repair``
23527fb8ccffSDarrick J. Wonghad separate copy-pasted implementations for each btree type.
23537fb8ccffSDarrick J. Wong
23547fb8ccffSDarrick J. WongTo prepare for online fsck, each of the four bulk loaders were studied, notes
23557fb8ccffSDarrick J. Wongwere taken, and the four were refactored into a single generic btree bulk
23567fb8ccffSDarrick J. Wongloading mechanism.
23577fb8ccffSDarrick J. WongThose notes in turn have been refreshed and are presented below.
23587fb8ccffSDarrick J. Wong
23597fb8ccffSDarrick J. WongGeometry Computation
23607fb8ccffSDarrick J. Wong````````````````````
23617fb8ccffSDarrick J. Wong
23627fb8ccffSDarrick J. WongThe zeroth step of bulk loading is to assemble the entire record set that will
23637fb8ccffSDarrick J. Wongbe stored in the new btree, and sort the records.
23647fb8ccffSDarrick J. WongNext, call ``xfs_btree_bload_compute_geometry`` to compute the shape of the
23657fb8ccffSDarrick J. Wongbtree from the record set, the type of btree, and any load factor preferences.
23667fb8ccffSDarrick J. WongThis information is required for resource reservation.
23677fb8ccffSDarrick J. Wong
23687fb8ccffSDarrick J. WongFirst, the geometry computation computes the minimum and maximum records that
23697fb8ccffSDarrick J. Wongwill fit in a leaf block from the size of a btree block and the size of the
23707fb8ccffSDarrick J. Wongblock header.
23717fb8ccffSDarrick J. WongRoughly speaking, the maximum number of records is::
23727fb8ccffSDarrick J. Wong
23737fb8ccffSDarrick J. Wong        maxrecs = (block_size - header_size) / record_size
23747fb8ccffSDarrick J. Wong
23757fb8ccffSDarrick J. WongThe XFS design specifies that btree blocks should be merged when possible,
23767fb8ccffSDarrick J. Wongwhich means the minimum number of records is half of maxrecs::
23777fb8ccffSDarrick J. Wong
23787fb8ccffSDarrick J. Wong        minrecs = maxrecs / 2
23797fb8ccffSDarrick J. Wong
23807fb8ccffSDarrick J. WongThe next variable to determine is the desired loading factor.
23817fb8ccffSDarrick J. WongThis must be at least minrecs and no more than maxrecs.
23827fb8ccffSDarrick J. WongChoosing minrecs is undesirable because it wastes half the block.
23837fb8ccffSDarrick J. WongChoosing maxrecs is also undesirable because adding a single record to each
23847fb8ccffSDarrick J. Wongnewly rebuilt leaf block will cause a tree split, which causes a noticeable
23857fb8ccffSDarrick J. Wongdrop in performance immediately afterwards.
23867fb8ccffSDarrick J. WongThe default loading factor was chosen to be 75% of maxrecs, which provides a
23877fb8ccffSDarrick J. Wongreasonably compact structure without any immediate split penalties::
23887fb8ccffSDarrick J. Wong
23897fb8ccffSDarrick J. Wong        default_load_factor = (maxrecs + minrecs) / 2
23907fb8ccffSDarrick J. Wong
23917fb8ccffSDarrick J. WongIf space is tight, the loading factor will be set to maxrecs to try to avoid
23927fb8ccffSDarrick J. Wongrunning out of space::
23937fb8ccffSDarrick J. Wong
23947fb8ccffSDarrick J. Wong        leaf_load_factor = enough space ? default_load_factor : maxrecs
23957fb8ccffSDarrick J. Wong
23967fb8ccffSDarrick J. WongLoad factor is computed for btree node blocks using the combined size of the
23977fb8ccffSDarrick J. Wongbtree key and pointer as the record size::
23987fb8ccffSDarrick J. Wong
23997fb8ccffSDarrick J. Wong        maxrecs = (block_size - header_size) / (key_size + ptr_size)
24007fb8ccffSDarrick J. Wong        minrecs = maxrecs / 2
24017fb8ccffSDarrick J. Wong        node_load_factor = enough space ? default_load_factor : maxrecs
24027fb8ccffSDarrick J. Wong
24037fb8ccffSDarrick J. WongOnce that's done, the number of leaf blocks required to store the record set
24047fb8ccffSDarrick J. Wongcan be computed as::
24057fb8ccffSDarrick J. Wong
24067fb8ccffSDarrick J. Wong        leaf_blocks = ceil(record_count / leaf_load_factor)
24077fb8ccffSDarrick J. Wong
24087fb8ccffSDarrick J. WongThe number of node blocks needed to point to the next level down in the tree
24097fb8ccffSDarrick J. Wongis computed as::
24107fb8ccffSDarrick J. Wong
24117fb8ccffSDarrick J. Wong        n_blocks = (n == 0 ? leaf_blocks : node_blocks[n])
24127fb8ccffSDarrick J. Wong        node_blocks[n + 1] = ceil(n_blocks / node_load_factor)
24137fb8ccffSDarrick J. Wong
24147fb8ccffSDarrick J. WongThe entire computation is performed recursively until the current level only
24157fb8ccffSDarrick J. Wongneeds one block.
24167fb8ccffSDarrick J. WongThe resulting geometry is as follows:
24177fb8ccffSDarrick J. Wong
24187fb8ccffSDarrick J. Wong- For AG-rooted btrees, this level is the root level, so the height of the new
24197fb8ccffSDarrick J. Wong  tree is ``level + 1`` and the space needed is the summation of the number of
24207fb8ccffSDarrick J. Wong  blocks on each level.
24217fb8ccffSDarrick J. Wong
24227fb8ccffSDarrick J. Wong- For inode-rooted btrees where the records in the top level do not fit in the
24237fb8ccffSDarrick J. Wong  inode fork area, the height is ``level + 2``, the space needed is the
24247fb8ccffSDarrick J. Wong  summation of the number of blocks on each level, and the inode fork points to
24257fb8ccffSDarrick J. Wong  the root block.
24267fb8ccffSDarrick J. Wong
24277fb8ccffSDarrick J. Wong- For inode-rooted btrees where the records in the top level can be stored in
24287fb8ccffSDarrick J. Wong  the inode fork area, then the root block can be stored in the inode, the
24297fb8ccffSDarrick J. Wong  height is ``level + 1``, and the space needed is one less than the summation
24307fb8ccffSDarrick J. Wong  of the number of blocks on each level.
24317fb8ccffSDarrick J. Wong  This only becomes relevant when non-bmap btrees gain the ability to root in
24327fb8ccffSDarrick J. Wong  an inode, which is a future patchset and only included here for completeness.
24337fb8ccffSDarrick J. Wong
24347fb8ccffSDarrick J. Wong.. _newbt:
24357fb8ccffSDarrick J. Wong
24367fb8ccffSDarrick J. WongReserving New B+Tree Blocks
24377fb8ccffSDarrick J. Wong```````````````````````````
24387fb8ccffSDarrick J. Wong
24397fb8ccffSDarrick J. WongOnce repair knows the number of blocks needed for the new btree, it allocates
24407fb8ccffSDarrick J. Wongthose blocks using the free space information.
24417fb8ccffSDarrick J. WongEach reserved extent is tracked separately by the btree builder state data.
24427fb8ccffSDarrick J. WongTo improve crash resilience, the reservation code also logs an Extent Freeing
24437fb8ccffSDarrick J. WongIntent (EFI) item in the same transaction as each space allocation and attaches
24447fb8ccffSDarrick J. Wongits in-memory ``struct xfs_extent_free_item`` object to the space reservation.
24457fb8ccffSDarrick J. WongIf the system goes down, log recovery will use the unfinished EFIs to free the
24467fb8ccffSDarrick J. Wongunused space, the free space, leaving the filesystem unchanged.
24477fb8ccffSDarrick J. Wong
24487fb8ccffSDarrick J. WongEach time the btree builder claims a block for the btree from a reserved
24497fb8ccffSDarrick J. Wongextent, it updates the in-memory reservation to reflect the claimed space.
24507fb8ccffSDarrick J. WongBlock reservation tries to allocate as much contiguous space as possible to
24517fb8ccffSDarrick J. Wongreduce the number of EFIs in play.
24527fb8ccffSDarrick J. Wong
24537fb8ccffSDarrick J. WongWhile repair is writing these new btree blocks, the EFIs created for the space
24547fb8ccffSDarrick J. Wongreservations pin the tail of the ondisk log.
24557fb8ccffSDarrick J. WongIt's possible that other parts of the system will remain busy and push the head
24567fb8ccffSDarrick J. Wongof the log towards the pinned tail.
24577fb8ccffSDarrick J. WongTo avoid livelocking the filesystem, the EFIs must not pin the tail of the log
24587fb8ccffSDarrick J. Wongfor too long.
24597fb8ccffSDarrick J. WongTo alleviate this problem, the dynamic relogging capability of the deferred ops
24607fb8ccffSDarrick J. Wongmechanism is reused here to commit a transaction at the log head containing an
24617fb8ccffSDarrick J. WongEFD for the old EFI and new EFI at the head.
24627fb8ccffSDarrick J. WongThis enables the log to release the old EFI to keep the log moving forwards.
24637fb8ccffSDarrick J. Wong
24647fb8ccffSDarrick J. WongEFIs have a role to play during the commit and reaping phases; please see the
24657fb8ccffSDarrick J. Wongnext section and the section about :ref:`reaping<reaping>` for more details.
24667fb8ccffSDarrick J. Wong
24677fb8ccffSDarrick J. WongProposed patchsets are the
24687fb8ccffSDarrick J. Wong`bitmap rework
24697fb8ccffSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-bitmap-rework>`_
24707fb8ccffSDarrick J. Wongand the
24717fb8ccffSDarrick J. Wong`preparation for bulk loading btrees
24727fb8ccffSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading>`_.
24737fb8ccffSDarrick J. Wong
24747fb8ccffSDarrick J. Wong
24757fb8ccffSDarrick J. WongWriting the New Tree
24767fb8ccffSDarrick J. Wong````````````````````
24777fb8ccffSDarrick J. Wong
24787fb8ccffSDarrick J. WongThis part is pretty simple -- the btree builder (``xfs_btree_bulkload``) claims
24797fb8ccffSDarrick J. Wonga block from the reserved list, writes the new btree block header, fills the
24807fb8ccffSDarrick J. Wongrest of the block with records, and adds the new leaf block to a list of
24817fb8ccffSDarrick J. Wongwritten blocks::
24827fb8ccffSDarrick J. Wong
24837fb8ccffSDarrick J. Wong  ┌────┐
24847fb8ccffSDarrick J. Wong  │leaf│
24857fb8ccffSDarrick J. Wong  │RRR │
24867fb8ccffSDarrick J. Wong  └────┘
24877fb8ccffSDarrick J. Wong
24887fb8ccffSDarrick J. WongSibling pointers are set every time a new block is added to the level::
24897fb8ccffSDarrick J. Wong
24907fb8ccffSDarrick J. Wong  ┌────┐ ┌────┐ ┌────┐ ┌────┐
24917fb8ccffSDarrick J. Wong  │leaf│→│leaf│→│leaf│→│leaf│
24927fb8ccffSDarrick J. Wong  │RRR │←│RRR │←│RRR │←│RRR │
24937fb8ccffSDarrick J. Wong  └────┘ └────┘ └────┘ └────┘
24947fb8ccffSDarrick J. Wong
24957fb8ccffSDarrick J. WongWhen it finishes writing the record leaf blocks, it moves on to the node
24967fb8ccffSDarrick J. Wongblocks
24977fb8ccffSDarrick J. WongTo fill a node block, it walks each block in the next level down in the tree
24987fb8ccffSDarrick J. Wongto compute the relevant keys and write them into the parent node::
24997fb8ccffSDarrick J. Wong
25007fb8ccffSDarrick J. Wong      ┌────┐       ┌────┐
25017fb8ccffSDarrick J. Wong      │node│──────→│node│
25027fb8ccffSDarrick J. Wong      │PP  │←──────│PP  │
25037fb8ccffSDarrick J. Wong      └────┘       └────┘
25047fb8ccffSDarrick J. Wong      ↙   ↘         ↙   ↘
25057fb8ccffSDarrick J. Wong  ┌────┐ ┌────┐ ┌────┐ ┌────┐
25067fb8ccffSDarrick J. Wong  │leaf│→│leaf│→│leaf│→│leaf│
25077fb8ccffSDarrick J. Wong  │RRR │←│RRR │←│RRR │←│RRR │
25087fb8ccffSDarrick J. Wong  └────┘ └────┘ └────┘ └────┘
25097fb8ccffSDarrick J. Wong
25107fb8ccffSDarrick J. WongWhen it reaches the root level, it is ready to commit the new btree!::
25117fb8ccffSDarrick J. Wong
25127fb8ccffSDarrick J. Wong          ┌─────────┐
25137fb8ccffSDarrick J. Wong          │  root   │
25147fb8ccffSDarrick J. Wong          │   PP    │
25157fb8ccffSDarrick J. Wong          └─────────┘
25167fb8ccffSDarrick J. Wong          ↙         ↘
25177fb8ccffSDarrick J. Wong      ┌────┐       ┌────┐
25187fb8ccffSDarrick J. Wong      │node│──────→│node│
25197fb8ccffSDarrick J. Wong      │PP  │←──────│PP  │
25207fb8ccffSDarrick J. Wong      └────┘       └────┘
25217fb8ccffSDarrick J. Wong      ↙   ↘         ↙   ↘
25227fb8ccffSDarrick J. Wong  ┌────┐ ┌────┐ ┌────┐ ┌────┐
25237fb8ccffSDarrick J. Wong  │leaf│→│leaf│→│leaf│→│leaf│
25247fb8ccffSDarrick J. Wong  │RRR │←│RRR │←│RRR │←│RRR │
25257fb8ccffSDarrick J. Wong  └────┘ └────┘ └────┘ └────┘
25267fb8ccffSDarrick J. Wong
25277fb8ccffSDarrick J. WongThe first step to commit the new btree is to persist the btree blocks to disk
25287fb8ccffSDarrick J. Wongsynchronously.
25297fb8ccffSDarrick J. WongThis is a little complicated because a new btree block could have been freed
25307fb8ccffSDarrick J. Wongin the recent past, so the builder must use ``xfs_buf_delwri_queue_here`` to
25317fb8ccffSDarrick J. Wongremove the (stale) buffer from the AIL list before it can write the new blocks
25327fb8ccffSDarrick J. Wongto disk.
25337fb8ccffSDarrick J. WongBlocks are queued for IO using a delwri list and written in one large batch
25347fb8ccffSDarrick J. Wongwith ``xfs_buf_delwri_submit``.
25357fb8ccffSDarrick J. Wong
25367fb8ccffSDarrick J. WongOnce the new blocks have been persisted to disk, control returns to the
25377fb8ccffSDarrick J. Wongindividual repair function that called the bulk loader.
25387fb8ccffSDarrick J. WongThe repair function must log the location of the new root in a transaction,
25397fb8ccffSDarrick J. Wongclean up the space reservations that were made for the new btree, and reap the
25407fb8ccffSDarrick J. Wongold metadata blocks:
25417fb8ccffSDarrick J. Wong
25427fb8ccffSDarrick J. Wong1. Commit the location of the new btree root.
25437fb8ccffSDarrick J. Wong
25447fb8ccffSDarrick J. Wong2. For each incore reservation:
25457fb8ccffSDarrick J. Wong
25467fb8ccffSDarrick J. Wong   a. Log Extent Freeing Done (EFD) items for all the space that was consumed
25477fb8ccffSDarrick J. Wong      by the btree builder.  The new EFDs must point to the EFIs attached to
25487fb8ccffSDarrick J. Wong      the reservation to prevent log recovery from freeing the new blocks.
25497fb8ccffSDarrick J. Wong
25507fb8ccffSDarrick J. Wong   b. For unclaimed portions of incore reservations, create a regular deferred
25517fb8ccffSDarrick J. Wong      extent free work item to be free the unused space later in the
25527fb8ccffSDarrick J. Wong      transaction chain.
25537fb8ccffSDarrick J. Wong
25547fb8ccffSDarrick J. Wong   c. The EFDs and EFIs logged in steps 2a and 2b must not overrun the
25557fb8ccffSDarrick J. Wong      reservation of the committing transaction.
25567fb8ccffSDarrick J. Wong      If the btree loading code suspects this might be about to happen, it must
25577fb8ccffSDarrick J. Wong      call ``xrep_defer_finish`` to clear out the deferred work and obtain a
25587fb8ccffSDarrick J. Wong      fresh transaction.
25597fb8ccffSDarrick J. Wong
25607fb8ccffSDarrick J. Wong3. Clear out the deferred work a second time to finish the commit and clean
25617fb8ccffSDarrick J. Wong   the repair transaction.
25627fb8ccffSDarrick J. Wong
25637fb8ccffSDarrick J. WongThe transaction rolling in steps 2c and 3 represent a weakness in the repair
25647fb8ccffSDarrick J. Wongalgorithm, because a log flush and a crash before the end of the reap step can
25657fb8ccffSDarrick J. Wongresult in space leaking.
25667fb8ccffSDarrick J. WongOnline repair functions minimize the chances of this occuring by using very
25677fb8ccffSDarrick J. Wonglarge transactions, which each can accomodate many thousands of block freeing
25687fb8ccffSDarrick J. Wonginstructions.
25697fb8ccffSDarrick J. WongRepair moves on to reaping the old blocks, which will be presented in a
25707fb8ccffSDarrick J. Wongsubsequent :ref:`section<reaping>` after a few case studies of bulk loading.
25717fb8ccffSDarrick J. Wong
25727fb8ccffSDarrick J. WongCase Study: Rebuilding the Inode Index
25737fb8ccffSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
25747fb8ccffSDarrick J. Wong
25757fb8ccffSDarrick J. WongThe high level process to rebuild the inode index btree is:
25767fb8ccffSDarrick J. Wong
25777fb8ccffSDarrick J. Wong1. Walk the reverse mapping records to generate ``struct xfs_inobt_rec``
25787fb8ccffSDarrick J. Wong   records from the inode chunk information and a bitmap of the old inode btree
25797fb8ccffSDarrick J. Wong   blocks.
25807fb8ccffSDarrick J. Wong
25817fb8ccffSDarrick J. Wong2. Append the records to an xfarray in inode order.
25827fb8ccffSDarrick J. Wong
25837fb8ccffSDarrick J. Wong3. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number
25847fb8ccffSDarrick J. Wong   of blocks needed for the inode btree.
25857fb8ccffSDarrick J. Wong   If the free space inode btree is enabled, call it again to estimate the
25867fb8ccffSDarrick J. Wong   geometry of the finobt.
25877fb8ccffSDarrick J. Wong
25887fb8ccffSDarrick J. Wong4. Allocate the number of blocks computed in the previous step.
25897fb8ccffSDarrick J. Wong
25907fb8ccffSDarrick J. Wong5. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and
25917fb8ccffSDarrick J. Wong   generate the internal node blocks.
25927fb8ccffSDarrick J. Wong   If the free space inode btree is enabled, call it again to load the finobt.
25937fb8ccffSDarrick J. Wong
25947fb8ccffSDarrick J. Wong6. Commit the location of the new btree root block(s) to the AGI.
25957fb8ccffSDarrick J. Wong
25967fb8ccffSDarrick J. Wong7. Reap the old btree blocks using the bitmap created in step 1.
25977fb8ccffSDarrick J. Wong
25987fb8ccffSDarrick J. WongDetails are as follows.
25997fb8ccffSDarrick J. Wong
26007fb8ccffSDarrick J. WongThe inode btree maps inumbers to the ondisk location of the associated
26017fb8ccffSDarrick J. Wonginode records, which means that the inode btrees can be rebuilt from the
26027fb8ccffSDarrick J. Wongreverse mapping information.
26037fb8ccffSDarrick J. WongReverse mapping records with an owner of ``XFS_RMAP_OWN_INOBT`` marks the
26047fb8ccffSDarrick J. Wonglocation of the old inode btree blocks.
26057fb8ccffSDarrick J. WongEach reverse mapping record with an owner of ``XFS_RMAP_OWN_INODES`` marks the
26067fb8ccffSDarrick J. Wonglocation of at least one inode cluster buffer.
26077fb8ccffSDarrick J. WongA cluster is the smallest number of ondisk inodes that can be allocated or
26087fb8ccffSDarrick J. Wongfreed in a single transaction; it is never smaller than 1 fs block or 4 inodes.
26097fb8ccffSDarrick J. Wong
26107fb8ccffSDarrick J. WongFor the space represented by each inode cluster, ensure that there are no
26117fb8ccffSDarrick J. Wongrecords in the free space btrees nor any records in the reference count btree.
26127fb8ccffSDarrick J. WongIf there are, the space metadata inconsistencies are reason enough to abort the
26137fb8ccffSDarrick J. Wongoperation.
26147fb8ccffSDarrick J. WongOtherwise, read each cluster buffer to check that its contents appear to be
26157fb8ccffSDarrick J. Wongondisk inodes and to decide if the file is allocated
26167fb8ccffSDarrick J. Wong(``xfs_dinode.i_mode != 0``) or free (``xfs_dinode.i_mode == 0``).
26177fb8ccffSDarrick J. WongAccumulate the results of successive inode cluster buffer reads until there is
26187fb8ccffSDarrick J. Wongenough information to fill a single inode chunk record, which is 64 consecutive
26197fb8ccffSDarrick J. Wongnumbers in the inumber keyspace.
26207fb8ccffSDarrick J. WongIf the chunk is sparse, the chunk record may include holes.
26217fb8ccffSDarrick J. Wong
26227fb8ccffSDarrick J. WongOnce the repair function accumulates one chunk's worth of data, it calls
26237fb8ccffSDarrick J. Wong``xfarray_append`` to add the inode btree record to the xfarray.
26247fb8ccffSDarrick J. WongThis xfarray is walked twice during the btree creation step -- once to populate
26257fb8ccffSDarrick J. Wongthe inode btree with all inode chunk records, and a second time to populate the
26267fb8ccffSDarrick J. Wongfree inode btree with records for chunks that have free non-sparse inodes.
26277fb8ccffSDarrick J. WongThe number of records for the inode btree is the number of xfarray records,
26287fb8ccffSDarrick J. Wongbut the record count for the free inode btree has to be computed as inode chunk
26297fb8ccffSDarrick J. Wongrecords are stored in the xfarray.
26307fb8ccffSDarrick J. Wong
26317fb8ccffSDarrick J. WongThe proposed patchset is the
26327fb8ccffSDarrick J. Wong`AG btree repair
26337fb8ccffSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
26347fb8ccffSDarrick J. Wongseries.
26357fb8ccffSDarrick J. Wong
26367fb8ccffSDarrick J. WongCase Study: Rebuilding the Space Reference Counts
26377fb8ccffSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
26387fb8ccffSDarrick J. Wong
26397fb8ccffSDarrick J. WongReverse mapping records are used to rebuild the reference count information.
26407fb8ccffSDarrick J. WongReference counts are required for correct operation of copy on write for shared
26417fb8ccffSDarrick J. Wongfile data.
26427fb8ccffSDarrick J. WongImagine the reverse mapping entries as rectangles representing extents of
26437fb8ccffSDarrick J. Wongphysical blocks, and that the rectangles can be laid down to allow them to
26447fb8ccffSDarrick J. Wongoverlap each other.
26457fb8ccffSDarrick J. WongFrom the diagram below, it is apparent that a reference count record must start
26467fb8ccffSDarrick J. Wongor end wherever the height of the stack changes.
26477fb8ccffSDarrick J. WongIn other words, the record emission stimulus is level-triggered::
26487fb8ccffSDarrick J. Wong
26497fb8ccffSDarrick J. Wong                        █    ███
26507fb8ccffSDarrick J. Wong              ██      █████ ████   ███        ██████
26517fb8ccffSDarrick J. Wong        ██   ████     ███████████ ████     █████████
26527fb8ccffSDarrick J. Wong        ████████████████████████████████ ███████████
26537fb8ccffSDarrick J. Wong        ^ ^  ^^ ^^    ^ ^^ ^^^  ^^^^  ^ ^^ ^  ^     ^
26547fb8ccffSDarrick J. Wong        2 1  23 21    3 43 234  2123  1 01 2  3     0
26557fb8ccffSDarrick J. Wong
26567fb8ccffSDarrick J. WongThe ondisk reference count btree does not store the refcount == 0 cases because
26577fb8ccffSDarrick J. Wongthe free space btree already records which blocks are free.
26587fb8ccffSDarrick J. WongExtents being used to stage copy-on-write operations should be the only records
26597fb8ccffSDarrick J. Wongwith refcount == 1.
26607fb8ccffSDarrick J. WongSingle-owner file blocks aren't recorded in either the free space or the
26617fb8ccffSDarrick J. Wongreference count btrees.
26627fb8ccffSDarrick J. Wong
26637fb8ccffSDarrick J. WongThe high level process to rebuild the reference count btree is:
26647fb8ccffSDarrick J. Wong
26657fb8ccffSDarrick J. Wong1. Walk the reverse mapping records to generate ``struct xfs_refcount_irec``
26667fb8ccffSDarrick J. Wong   records for any space having more than one reverse mapping and add them to
26677fb8ccffSDarrick J. Wong   the xfarray.
26687fb8ccffSDarrick J. Wong   Any records owned by ``XFS_RMAP_OWN_COW`` are also added to the xfarray
26697fb8ccffSDarrick J. Wong   because these are extents allocated to stage a copy on write operation and
26707fb8ccffSDarrick J. Wong   are tracked in the refcount btree.
26717fb8ccffSDarrick J. Wong
26727fb8ccffSDarrick J. Wong   Use any records owned by ``XFS_RMAP_OWN_REFC`` to create a bitmap of old
26737fb8ccffSDarrick J. Wong   refcount btree blocks.
26747fb8ccffSDarrick J. Wong
26757fb8ccffSDarrick J. Wong2. Sort the records in physical extent order, putting the CoW staging extents
26767fb8ccffSDarrick J. Wong   at the end of the xfarray.
26777fb8ccffSDarrick J. Wong   This matches the sorting order of records in the refcount btree.
26787fb8ccffSDarrick J. Wong
26797fb8ccffSDarrick J. Wong3. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number
26807fb8ccffSDarrick J. Wong   of blocks needed for the new tree.
26817fb8ccffSDarrick J. Wong
26827fb8ccffSDarrick J. Wong4. Allocate the number of blocks computed in the previous step.
26837fb8ccffSDarrick J. Wong
26847fb8ccffSDarrick J. Wong5. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and
26857fb8ccffSDarrick J. Wong   generate the internal node blocks.
26867fb8ccffSDarrick J. Wong
26877fb8ccffSDarrick J. Wong6. Commit the location of new btree root block to the AGF.
26887fb8ccffSDarrick J. Wong
26897fb8ccffSDarrick J. Wong7. Reap the old btree blocks using the bitmap created in step 1.
26907fb8ccffSDarrick J. Wong
26917fb8ccffSDarrick J. WongDetails are as follows; the same algorithm is used by ``xfs_repair`` to
26927fb8ccffSDarrick J. Wonggenerate refcount information from reverse mapping records.
26937fb8ccffSDarrick J. Wong
26947fb8ccffSDarrick J. Wong- Until the reverse mapping btree runs out of records:
26957fb8ccffSDarrick J. Wong
26967fb8ccffSDarrick J. Wong  - Retrieve the next record from the btree and put it in a bag.
26977fb8ccffSDarrick J. Wong
26987fb8ccffSDarrick J. Wong  - Collect all records with the same starting block from the btree and put
26997fb8ccffSDarrick J. Wong    them in the bag.
27007fb8ccffSDarrick J. Wong
27017fb8ccffSDarrick J. Wong  - While the bag isn't empty:
27027fb8ccffSDarrick J. Wong
27037fb8ccffSDarrick J. Wong    - Among the mappings in the bag, compute the lowest block number where the
27047fb8ccffSDarrick J. Wong      reference count changes.
27057fb8ccffSDarrick J. Wong      This position will be either the starting block number of the next
27067fb8ccffSDarrick J. Wong      unprocessed reverse mapping or the next block after the shortest mapping
27077fb8ccffSDarrick J. Wong      in the bag.
27087fb8ccffSDarrick J. Wong
27097fb8ccffSDarrick J. Wong    - Remove all mappings from the bag that end at this position.
27107fb8ccffSDarrick J. Wong
27117fb8ccffSDarrick J. Wong    - Collect all reverse mappings that start at this position from the btree
27127fb8ccffSDarrick J. Wong      and put them in the bag.
27137fb8ccffSDarrick J. Wong
27147fb8ccffSDarrick J. Wong    - If the size of the bag changed and is greater than one, create a new
27157fb8ccffSDarrick J. Wong      refcount record associating the block number range that we just walked to
27167fb8ccffSDarrick J. Wong      the size of the bag.
27177fb8ccffSDarrick J. Wong
27187fb8ccffSDarrick J. WongThe bag-like structure in this case is a type 2 xfarray as discussed in the
27197fb8ccffSDarrick J. Wong:ref:`xfarray access patterns<xfarray_access_patterns>` section.
27207fb8ccffSDarrick J. WongReverse mappings are added to the bag using ``xfarray_store_anywhere`` and
27217fb8ccffSDarrick J. Wongremoved via ``xfarray_unset``.
27227fb8ccffSDarrick J. WongBag members are examined through ``xfarray_iter`` loops.
27237fb8ccffSDarrick J. Wong
27247fb8ccffSDarrick J. WongThe proposed patchset is the
27257fb8ccffSDarrick J. Wong`AG btree repair
27267fb8ccffSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
27277fb8ccffSDarrick J. Wongseries.
27287fb8ccffSDarrick J. Wong
27297fb8ccffSDarrick J. WongCase Study: Rebuilding File Fork Mapping Indices
27307fb8ccffSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
27317fb8ccffSDarrick J. Wong
27327fb8ccffSDarrick J. WongThe high level process to rebuild a data/attr fork mapping btree is:
27337fb8ccffSDarrick J. Wong
27347fb8ccffSDarrick J. Wong1. Walk the reverse mapping records to generate ``struct xfs_bmbt_rec``
27357fb8ccffSDarrick J. Wong   records from the reverse mapping records for that inode and fork.
27367fb8ccffSDarrick J. Wong   Append these records to an xfarray.
27377fb8ccffSDarrick J. Wong   Compute the bitmap of the old bmap btree blocks from the ``BMBT_BLOCK``
27387fb8ccffSDarrick J. Wong   records.
27397fb8ccffSDarrick J. Wong
27407fb8ccffSDarrick J. Wong2. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number
27417fb8ccffSDarrick J. Wong   of blocks needed for the new tree.
27427fb8ccffSDarrick J. Wong
27437fb8ccffSDarrick J. Wong3. Sort the records in file offset order.
27447fb8ccffSDarrick J. Wong
27457fb8ccffSDarrick J. Wong4. If the extent records would fit in the inode fork immediate area, commit the
27467fb8ccffSDarrick J. Wong   records to that immediate area and skip to step 8.
27477fb8ccffSDarrick J. Wong
27487fb8ccffSDarrick J. Wong5. Allocate the number of blocks computed in the previous step.
27497fb8ccffSDarrick J. Wong
27507fb8ccffSDarrick J. Wong6. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and
27517fb8ccffSDarrick J. Wong   generate the internal node blocks.
27527fb8ccffSDarrick J. Wong
27537fb8ccffSDarrick J. Wong7. Commit the new btree root block to the inode fork immediate area.
27547fb8ccffSDarrick J. Wong
27557fb8ccffSDarrick J. Wong8. Reap the old btree blocks using the bitmap created in step 1.
27567fb8ccffSDarrick J. Wong
27577fb8ccffSDarrick J. WongThere are some complications here:
27587fb8ccffSDarrick J. WongFirst, it's possible to move the fork offset to adjust the sizes of the
27597fb8ccffSDarrick J. Wongimmediate areas if the data and attr forks are not both in BMBT format.
27607fb8ccffSDarrick J. WongSecond, if there are sufficiently few fork mappings, it may be possible to use
27617fb8ccffSDarrick J. WongEXTENTS format instead of BMBT, which may require a conversion.
27627fb8ccffSDarrick J. WongThird, the incore extent map must be reloaded carefully to avoid disturbing
27637fb8ccffSDarrick J. Wongany delayed allocation extents.
27647fb8ccffSDarrick J. Wong
27657fb8ccffSDarrick J. WongThe proposed patchset is the
27667fb8ccffSDarrick J. Wong`file mapping repair
27677fb8ccffSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-file-mappings>`_
27687fb8ccffSDarrick J. Wongseries.
27697fb8ccffSDarrick J. Wong
27707fb8ccffSDarrick J. Wong.. _reaping:
27717fb8ccffSDarrick J. Wong
27727fb8ccffSDarrick J. WongReaping Old Metadata Blocks
27737fb8ccffSDarrick J. Wong---------------------------
27747fb8ccffSDarrick J. Wong
27757fb8ccffSDarrick J. WongWhenever online fsck builds a new data structure to replace one that is
27767fb8ccffSDarrick J. Wongsuspect, there is a question of how to find and dispose of the blocks that
27777fb8ccffSDarrick J. Wongbelonged to the old structure.
27787fb8ccffSDarrick J. WongThe laziest method of course is not to deal with them at all, but this slowly
27797fb8ccffSDarrick J. Wongleads to service degradations as space leaks out of the filesystem.
27807fb8ccffSDarrick J. WongHopefully, someone will schedule a rebuild of the free space information to
27817fb8ccffSDarrick J. Wongplug all those leaks.
27827fb8ccffSDarrick J. WongOffline repair rebuilds all space metadata after recording the usage of
27837fb8ccffSDarrick J. Wongthe files and directories that it decides not to clear, hence it can build new
27847fb8ccffSDarrick J. Wongstructures in the discovered free space and avoid the question of reaping.
27857fb8ccffSDarrick J. Wong
27867fb8ccffSDarrick J. WongAs part of a repair, online fsck relies heavily on the reverse mapping records
27877fb8ccffSDarrick J. Wongto find space that is owned by the corresponding rmap owner yet truly free.
27887fb8ccffSDarrick J. WongCross referencing rmap records with other rmap records is necessary because
27897fb8ccffSDarrick J. Wongthere may be other data structures that also think they own some of those
27907fb8ccffSDarrick J. Wongblocks (e.g. crosslinked trees).
27917fb8ccffSDarrick J. WongPermitting the block allocator to hand them out again will not push the system
27927fb8ccffSDarrick J. Wongtowards consistency.
27937fb8ccffSDarrick J. Wong
27947fb8ccffSDarrick J. WongFor space metadata, the process of finding extents to dispose of generally
27957fb8ccffSDarrick J. Wongfollows this format:
27967fb8ccffSDarrick J. Wong
27977fb8ccffSDarrick J. Wong1. Create a bitmap of space used by data structures that must be preserved.
27987fb8ccffSDarrick J. Wong   The space reservations used to create the new metadata can be used here if
27997fb8ccffSDarrick J. Wong   the same rmap owner code is used to denote all of the objects being rebuilt.
28007fb8ccffSDarrick J. Wong
28017fb8ccffSDarrick J. Wong2. Survey the reverse mapping data to create a bitmap of space owned by the
28027fb8ccffSDarrick J. Wong   same ``XFS_RMAP_OWN_*`` number for the metadata that is being preserved.
28037fb8ccffSDarrick J. Wong
28047fb8ccffSDarrick J. Wong3. Use the bitmap disunion operator to subtract (1) from (2).
28057fb8ccffSDarrick J. Wong   The remaining set bits represent candidate extents that could be freed.
28067fb8ccffSDarrick J. Wong   The process moves on to step 4 below.
28077fb8ccffSDarrick J. Wong
28087fb8ccffSDarrick J. WongRepairs for file-based metadata such as extended attributes, directories,
28097fb8ccffSDarrick J. Wongsymbolic links, quota files and realtime bitmaps are performed by building a
28107fb8ccffSDarrick J. Wongnew structure attached to a temporary file and swapping the forks.
28117fb8ccffSDarrick J. WongAfterward, the mappings in the old file fork are the candidate blocks for
28127fb8ccffSDarrick J. Wongdisposal.
28137fb8ccffSDarrick J. Wong
28147fb8ccffSDarrick J. WongThe process for disposing of old extents is as follows:
28157fb8ccffSDarrick J. Wong
28167fb8ccffSDarrick J. Wong4. For each candidate extent, count the number of reverse mapping records for
28177fb8ccffSDarrick J. Wong   the first block in that extent that do not have the same rmap owner for the
28187fb8ccffSDarrick J. Wong   data structure being repaired.
28197fb8ccffSDarrick J. Wong
28207fb8ccffSDarrick J. Wong   - If zero, the block has a single owner and can be freed.
28217fb8ccffSDarrick J. Wong
28227fb8ccffSDarrick J. Wong   - If not, the block is part of a crosslinked structure and must not be
28237fb8ccffSDarrick J. Wong     freed.
28247fb8ccffSDarrick J. Wong
28257fb8ccffSDarrick J. Wong5. Starting with the next block in the extent, figure out how many more blocks
28267fb8ccffSDarrick J. Wong   have the same zero/nonzero other owner status as that first block.
28277fb8ccffSDarrick J. Wong
28287fb8ccffSDarrick J. Wong6. If the region is crosslinked, delete the reverse mapping entry for the
28297fb8ccffSDarrick J. Wong   structure being repaired and move on to the next region.
28307fb8ccffSDarrick J. Wong
28317fb8ccffSDarrick J. Wong7. If the region is to be freed, mark any corresponding buffers in the buffer
28327fb8ccffSDarrick J. Wong   cache as stale to prevent log writeback.
28337fb8ccffSDarrick J. Wong
28347fb8ccffSDarrick J. Wong8. Free the region and move on.
28357fb8ccffSDarrick J. Wong
28367fb8ccffSDarrick J. WongHowever, there is one complication to this procedure.
28377fb8ccffSDarrick J. WongTransactions are of finite size, so the reaping process must be careful to roll
28387fb8ccffSDarrick J. Wongthe transactions to avoid overruns.
28397fb8ccffSDarrick J. WongOverruns come from two sources:
28407fb8ccffSDarrick J. Wong
28417fb8ccffSDarrick J. Wonga. EFIs logged on behalf of space that is no longer occupied
28427fb8ccffSDarrick J. Wong
28437fb8ccffSDarrick J. Wongb. Log items for buffer invalidations
28447fb8ccffSDarrick J. Wong
28457fb8ccffSDarrick J. WongThis is also a window in which a crash during the reaping process can leak
28467fb8ccffSDarrick J. Wongblocks.
28477fb8ccffSDarrick J. WongAs stated earlier, online repair functions use very large transactions to
28487fb8ccffSDarrick J. Wongminimize the chances of this occurring.
28497fb8ccffSDarrick J. Wong
28507fb8ccffSDarrick J. WongThe proposed patchset is the
28517fb8ccffSDarrick J. Wong`preparation for bulk loading btrees
28527fb8ccffSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading>`_
28537fb8ccffSDarrick J. Wongseries.
28547fb8ccffSDarrick J. Wong
28557fb8ccffSDarrick J. WongCase Study: Reaping After a Regular Btree Repair
28567fb8ccffSDarrick J. Wong````````````````````````````````````````````````
28577fb8ccffSDarrick J. Wong
28587fb8ccffSDarrick J. WongOld reference count and inode btrees are the easiest to reap because they have
28597fb8ccffSDarrick J. Wongrmap records with special owner codes: ``XFS_RMAP_OWN_REFC`` for the refcount
28607fb8ccffSDarrick J. Wongbtree, and ``XFS_RMAP_OWN_INOBT`` for the inode and free inode btrees.
28617fb8ccffSDarrick J. WongCreating a list of extents to reap the old btree blocks is quite simple,
28627fb8ccffSDarrick J. Wongconceptually:
28637fb8ccffSDarrick J. Wong
28647fb8ccffSDarrick J. Wong1. Lock the relevant AGI/AGF header buffers to prevent allocation and frees.
28657fb8ccffSDarrick J. Wong
28667fb8ccffSDarrick J. Wong2. For each reverse mapping record with an rmap owner corresponding to the
28677fb8ccffSDarrick J. Wong   metadata structure being rebuilt, set the corresponding range in a bitmap.
28687fb8ccffSDarrick J. Wong
28697fb8ccffSDarrick J. Wong3. Walk the current data structures that have the same rmap owner.
28707fb8ccffSDarrick J. Wong   For each block visited, clear that range in the above bitmap.
28717fb8ccffSDarrick J. Wong
28727fb8ccffSDarrick J. Wong4. Each set bit in the bitmap represents a block that could be a block from the
28737fb8ccffSDarrick J. Wong   old data structures and hence is a candidate for reaping.
28747fb8ccffSDarrick J. Wong   In other words, ``(rmap_records_owned_by & ~blocks_reachable_by_walk)``
28757fb8ccffSDarrick J. Wong   are the blocks that might be freeable.
28767fb8ccffSDarrick J. Wong
28777fb8ccffSDarrick J. WongIf it is possible to maintain the AGF lock throughout the repair (which is the
28787fb8ccffSDarrick J. Wongcommon case), then step 2 can be performed at the same time as the reverse
28797fb8ccffSDarrick J. Wongmapping record walk that creates the records for the new btree.
28807fb8ccffSDarrick J. Wong
28817fb8ccffSDarrick J. WongCase Study: Rebuilding the Free Space Indices
28827fb8ccffSDarrick J. Wong`````````````````````````````````````````````
28837fb8ccffSDarrick J. Wong
28847fb8ccffSDarrick J. WongThe high level process to rebuild the free space indices is:
28857fb8ccffSDarrick J. Wong
28867fb8ccffSDarrick J. Wong1. Walk the reverse mapping records to generate ``struct xfs_alloc_rec_incore``
28877fb8ccffSDarrick J. Wong   records from the gaps in the reverse mapping btree.
28887fb8ccffSDarrick J. Wong
28897fb8ccffSDarrick J. Wong2. Append the records to an xfarray.
28907fb8ccffSDarrick J. Wong
28917fb8ccffSDarrick J. Wong3. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number
28927fb8ccffSDarrick J. Wong   of blocks needed for each new tree.
28937fb8ccffSDarrick J. Wong
28947fb8ccffSDarrick J. Wong4. Allocate the number of blocks computed in the previous step from the free
28957fb8ccffSDarrick J. Wong   space information collected.
28967fb8ccffSDarrick J. Wong
28977fb8ccffSDarrick J. Wong5. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and
28987fb8ccffSDarrick J. Wong   generate the internal node blocks for the free space by length index.
28997fb8ccffSDarrick J. Wong   Call it again for the free space by block number index.
29007fb8ccffSDarrick J. Wong
29017fb8ccffSDarrick J. Wong6. Commit the locations of the new btree root blocks to the AGF.
29027fb8ccffSDarrick J. Wong
29037fb8ccffSDarrick J. Wong7. Reap the old btree blocks by looking for space that is not recorded by the
29047fb8ccffSDarrick J. Wong   reverse mapping btree, the new free space btrees, or the AGFL.
29057fb8ccffSDarrick J. Wong
29067fb8ccffSDarrick J. WongRepairing the free space btrees has three key complications over a regular
29077fb8ccffSDarrick J. Wongbtree repair:
29087fb8ccffSDarrick J. Wong
29097fb8ccffSDarrick J. WongFirst, free space is not explicitly tracked in the reverse mapping records.
29107fb8ccffSDarrick J. WongHence, the new free space records must be inferred from gaps in the physical
29117fb8ccffSDarrick J. Wongspace component of the keyspace of the reverse mapping btree.
29127fb8ccffSDarrick J. Wong
29137fb8ccffSDarrick J. WongSecond, free space repairs cannot use the common btree reservation code because
29147fb8ccffSDarrick J. Wongnew blocks are reserved out of the free space btrees.
29157fb8ccffSDarrick J. WongThis is impossible when repairing the free space btrees themselves.
29167fb8ccffSDarrick J. WongHowever, repair holds the AGF buffer lock for the duration of the free space
29177fb8ccffSDarrick J. Wongindex reconstruction, so it can use the collected free space information to
29187fb8ccffSDarrick J. Wongsupply the blocks for the new free space btrees.
29197fb8ccffSDarrick J. WongIt is not necessary to back each reserved extent with an EFI because the new
29207fb8ccffSDarrick J. Wongfree space btrees are constructed in what the ondisk filesystem thinks is
29217fb8ccffSDarrick J. Wongunowned space.
29227fb8ccffSDarrick J. WongHowever, if reserving blocks for the new btrees from the collected free space
29237fb8ccffSDarrick J. Wonginformation changes the number of free space records, repair must re-estimate
29247fb8ccffSDarrick J. Wongthe new free space btree geometry with the new record count until the
29257fb8ccffSDarrick J. Wongreservation is sufficient.
29267fb8ccffSDarrick J. WongAs part of committing the new btrees, repair must ensure that reverse mappings
29277fb8ccffSDarrick J. Wongare created for the reserved blocks and that unused reserved blocks are
29287fb8ccffSDarrick J. Wonginserted into the free space btrees.
29297fb8ccffSDarrick J. WongDeferrred rmap and freeing operations are used to ensure that this transition
29307fb8ccffSDarrick J. Wongis atomic, similar to the other btree repair functions.
29317fb8ccffSDarrick J. Wong
29327fb8ccffSDarrick J. WongThird, finding the blocks to reap after the repair is not overly
29337fb8ccffSDarrick J. Wongstraightforward.
29347fb8ccffSDarrick J. WongBlocks for the free space btrees and the reverse mapping btrees are supplied by
29357fb8ccffSDarrick J. Wongthe AGFL.
29367fb8ccffSDarrick J. WongBlocks put onto the AGFL have reverse mapping records with the owner
29377fb8ccffSDarrick J. Wong``XFS_RMAP_OWN_AG``.
29387fb8ccffSDarrick J. WongThis ownership is retained when blocks move from the AGFL into the free space
29397fb8ccffSDarrick J. Wongbtrees or the reverse mapping btrees.
29407fb8ccffSDarrick J. WongWhen repair walks reverse mapping records to synthesize free space records, it
29417fb8ccffSDarrick J. Wongcreates a bitmap (``ag_owner_bitmap``) of all the space claimed by
29427fb8ccffSDarrick J. Wong``XFS_RMAP_OWN_AG`` records.
29437fb8ccffSDarrick J. WongThe repair context maintains a second bitmap corresponding to the rmap btree
29447fb8ccffSDarrick J. Wongblocks and the AGFL blocks (``rmap_agfl_bitmap``).
29457fb8ccffSDarrick J. WongWhen the walk is complete, the bitmap disunion operation ``(ag_owner_bitmap &
29467fb8ccffSDarrick J. Wong~rmap_agfl_bitmap)`` computes the extents that are used by the old free space
29477fb8ccffSDarrick J. Wongbtrees.
29487fb8ccffSDarrick J. WongThese blocks can then be reaped using the methods outlined above.
29497fb8ccffSDarrick J. Wong
29507fb8ccffSDarrick J. WongThe proposed patchset is the
29517fb8ccffSDarrick J. Wong`AG btree repair
29527fb8ccffSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
29537fb8ccffSDarrick J. Wongseries.
29547fb8ccffSDarrick J. Wong
29557fb8ccffSDarrick J. Wong.. _rmap_reap:
29567fb8ccffSDarrick J. Wong
29577fb8ccffSDarrick J. WongCase Study: Reaping After Repairing Reverse Mapping Btrees
29587fb8ccffSDarrick J. Wong``````````````````````````````````````````````````````````
29597fb8ccffSDarrick J. Wong
29607fb8ccffSDarrick J. WongOld reverse mapping btrees are less difficult to reap after a repair.
29617fb8ccffSDarrick J. WongAs mentioned in the previous section, blocks on the AGFL, the two free space
29627fb8ccffSDarrick J. Wongbtree blocks, and the reverse mapping btree blocks all have reverse mapping
29637fb8ccffSDarrick J. Wongrecords with ``XFS_RMAP_OWN_AG`` as the owner.
29647fb8ccffSDarrick J. WongThe full process of gathering reverse mapping records and building a new btree
29657fb8ccffSDarrick J. Wongare described in the case study of
29667fb8ccffSDarrick J. Wong:ref:`live rebuilds of rmap data <rmap_repair>`, but a crucial point from that
29677fb8ccffSDarrick J. Wongdiscussion is that the new rmap btree will not contain any records for the old
29687fb8ccffSDarrick J. Wongrmap btree, nor will the old btree blocks be tracked in the free space btrees.
29697fb8ccffSDarrick J. WongThe list of candidate reaping blocks is computed by setting the bits
29707fb8ccffSDarrick J. Wongcorresponding to the gaps in the new rmap btree records, and then clearing the
29717fb8ccffSDarrick J. Wongbits corresponding to extents in the free space btrees and the current AGFL
29727fb8ccffSDarrick J. Wongblocks.
29737fb8ccffSDarrick J. WongThe result ``(new_rmapbt_gaps & ~(agfl | bnobt_records))`` are reaped using the
29747fb8ccffSDarrick J. Wongmethods outlined above.
29757fb8ccffSDarrick J. Wong
29767fb8ccffSDarrick J. WongThe rest of the process of rebuildng the reverse mapping btree is discussed
29777fb8ccffSDarrick J. Wongin a separate :ref:`case study<rmap_repair>`.
29787fb8ccffSDarrick J. Wong
29797fb8ccffSDarrick J. WongThe proposed patchset is the
29807fb8ccffSDarrick J. Wong`AG btree repair
29817fb8ccffSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
29827fb8ccffSDarrick J. Wongseries.
29837fb8ccffSDarrick J. Wong
29847fb8ccffSDarrick J. WongCase Study: Rebuilding the AGFL
29857fb8ccffSDarrick J. Wong```````````````````````````````
29867fb8ccffSDarrick J. Wong
29877fb8ccffSDarrick J. WongThe allocation group free block list (AGFL) is repaired as follows:
29887fb8ccffSDarrick J. Wong
29897fb8ccffSDarrick J. Wong1. Create a bitmap for all the space that the reverse mapping data claims is
29907fb8ccffSDarrick J. Wong   owned by ``XFS_RMAP_OWN_AG``.
29917fb8ccffSDarrick J. Wong
29927fb8ccffSDarrick J. Wong2. Subtract the space used by the two free space btrees and the rmap btree.
29937fb8ccffSDarrick J. Wong
29947fb8ccffSDarrick J. Wong3. Subtract any space that the reverse mapping data claims is owned by any
29957fb8ccffSDarrick J. Wong   other owner, to avoid re-adding crosslinked blocks to the AGFL.
29967fb8ccffSDarrick J. Wong
29977fb8ccffSDarrick J. Wong4. Once the AGFL is full, reap any blocks leftover.
29987fb8ccffSDarrick J. Wong
29997fb8ccffSDarrick J. Wong5. The next operation to fix the freelist will right-size the list.
30007fb8ccffSDarrick J. Wong
30017fb8ccffSDarrick J. WongSee `fs/xfs/scrub/agheader_repair.c <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/xfs/scrub/agheader_repair.c>`_ for more details.
3002d6978871SDarrick J. Wong
3003d6978871SDarrick J. WongInode Record Repairs
3004d6978871SDarrick J. Wong--------------------
3005d6978871SDarrick J. Wong
3006d6978871SDarrick J. WongInode records must be handled carefully, because they have both ondisk records
3007d6978871SDarrick J. Wong("dinodes") and an in-memory ("cached") representation.
3008d6978871SDarrick J. WongThere is a very high potential for cache coherency issues if online fsck is not
3009d6978871SDarrick J. Wongcareful to access the ondisk metadata *only* when the ondisk metadata is so
3010d6978871SDarrick J. Wongbadly damaged that the filesystem cannot load the in-memory representation.
3011d6978871SDarrick J. WongWhen online fsck wants to open a damaged file for scrubbing, it must use
3012d6978871SDarrick J. Wongspecialized resource acquisition functions that return either the in-memory
3013d6978871SDarrick J. Wongrepresentation *or* a lock on whichever object is necessary to prevent any
3014d6978871SDarrick J. Wongupdate to the ondisk location.
3015d6978871SDarrick J. Wong
3016d6978871SDarrick J. WongThe only repairs that should be made to the ondisk inode buffers are whatever
3017d6978871SDarrick J. Wongis necessary to get the in-core structure loaded.
3018d6978871SDarrick J. WongThis means fixing whatever is caught by the inode cluster buffer and inode fork
3019d6978871SDarrick J. Wongverifiers, and retrying the ``iget`` operation.
3020d6978871SDarrick J. WongIf the second ``iget`` fails, the repair has failed.
3021d6978871SDarrick J. Wong
3022d6978871SDarrick J. WongOnce the in-memory representation is loaded, repair can lock the inode and can
3023d6978871SDarrick J. Wongsubject it to comprehensive checks, repairs, and optimizations.
3024d6978871SDarrick J. WongMost inode attributes are easy to check and constrain, or are user-controlled
3025d6978871SDarrick J. Wongarbitrary bit patterns; these are both easy to fix.
3026d6978871SDarrick J. WongDealing with the data and attr fork extent counts and the file block counts is
3027d6978871SDarrick J. Wongmore complicated, because computing the correct value requires traversing the
3028d6978871SDarrick J. Wongforks, or if that fails, leaving the fields invalid and waiting for the fork
3029d6978871SDarrick J. Wongfsck functions to run.
3030d6978871SDarrick J. Wong
3031d6978871SDarrick J. WongThe proposed patchset is the
3032d6978871SDarrick J. Wong`inode
3033d6978871SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-inodes>`_
3034d6978871SDarrick J. Wongrepair series.
3035d6978871SDarrick J. Wong
3036d6978871SDarrick J. WongQuota Record Repairs
3037d6978871SDarrick J. Wong--------------------
3038d6978871SDarrick J. Wong
3039d6978871SDarrick J. WongSimilar to inodes, quota records ("dquots") also have both ondisk records and
3040d6978871SDarrick J. Wongan in-memory representation, and hence are subject to the same cache coherency
3041d6978871SDarrick J. Wongissues.
3042d6978871SDarrick J. WongSomewhat confusingly, both are known as dquots in the XFS codebase.
3043d6978871SDarrick J. Wong
3044d6978871SDarrick J. WongThe only repairs that should be made to the ondisk quota record buffers are
3045d6978871SDarrick J. Wongwhatever is necessary to get the in-core structure loaded.
3046d6978871SDarrick J. WongOnce the in-memory representation is loaded, the only attributes needing
3047d6978871SDarrick J. Wongchecking are obviously bad limits and timer values.
3048d6978871SDarrick J. Wong
3049d6978871SDarrick J. WongQuota usage counters are checked, repaired, and discussed separately in the
3050d6978871SDarrick J. Wongsection about :ref:`live quotacheck <quotacheck>`.
3051d6978871SDarrick J. Wong
3052d6978871SDarrick J. WongThe proposed patchset is the
3053d6978871SDarrick J. Wong`quota
3054d6978871SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quota>`_
3055d6978871SDarrick J. Wongrepair series.
3056d6978871SDarrick J. Wong
3057d6978871SDarrick J. Wong.. _fscounters:
3058d6978871SDarrick J. Wong
3059d6978871SDarrick J. WongFreezing to Fix Summary Counters
3060d6978871SDarrick J. Wong--------------------------------
3061d6978871SDarrick J. Wong
3062d6978871SDarrick J. WongFilesystem summary counters track availability of filesystem resources such
3063d6978871SDarrick J. Wongas free blocks, free inodes, and allocated inodes.
3064d6978871SDarrick J. WongThis information could be compiled by walking the free space and inode indexes,
3065d6978871SDarrick J. Wongbut this is a slow process, so XFS maintains a copy in the ondisk superblock
3066d6978871SDarrick J. Wongthat should reflect the ondisk metadata, at least when the filesystem has been
3067d6978871SDarrick J. Wongunmounted cleanly.
3068d6978871SDarrick J. WongFor performance reasons, XFS also maintains incore copies of those counters,
3069d6978871SDarrick J. Wongwhich are key to enabling resource reservations for active transactions.
3070d6978871SDarrick J. WongWriter threads reserve the worst-case quantities of resources from the
3071d6978871SDarrick J. Wongincore counter and give back whatever they don't use at commit time.
3072d6978871SDarrick J. WongIt is therefore only necessary to serialize on the superblock when the
3073d6978871SDarrick J. Wongsuperblock is being committed to disk.
3074d6978871SDarrick J. Wong
3075d6978871SDarrick J. WongThe lazy superblock counter feature introduced in XFS v5 took this even further
3076d6978871SDarrick J. Wongby training log recovery to recompute the summary counters from the AG headers,
3077d6978871SDarrick J. Wongwhich eliminated the need for most transactions even to touch the superblock.
3078d6978871SDarrick J. WongThe only time XFS commits the summary counters is at filesystem unmount.
3079d6978871SDarrick J. WongTo reduce contention even further, the incore counter is implemented as a
3080d6978871SDarrick J. Wongpercpu counter, which means that each CPU is allocated a batch of blocks from a
3081d6978871SDarrick J. Wongglobal incore counter and can satisfy small allocations from the local batch.
3082d6978871SDarrick J. Wong
3083d6978871SDarrick J. WongThe high-performance nature of the summary counters makes it difficult for
3084d6978871SDarrick J. Wongonline fsck to check them, since there is no way to quiesce a percpu counter
3085d6978871SDarrick J. Wongwhile the system is running.
3086d6978871SDarrick J. WongAlthough online fsck can read the filesystem metadata to compute the correct
3087d6978871SDarrick J. Wongvalues of the summary counters, there's no way to hold the value of a percpu
3088d6978871SDarrick J. Wongcounter stable, so it's quite possible that the counter will be out of date by
3089d6978871SDarrick J. Wongthe time the walk is complete.
3090d6978871SDarrick J. WongEarlier versions of online scrub would return to userspace with an incomplete
3091d6978871SDarrick J. Wongscan flag, but this is not a satisfying outcome for a system administrator.
3092d6978871SDarrick J. WongFor repairs, the in-memory counters must be stabilized while walking the
3093d6978871SDarrick J. Wongfilesystem metadata to get an accurate reading and install it in the percpu
3094d6978871SDarrick J. Wongcounter.
3095d6978871SDarrick J. Wong
3096d6978871SDarrick J. WongTo satisfy this requirement, online fsck must prevent other programs in the
3097d6978871SDarrick J. Wongsystem from initiating new writes to the filesystem, it must disable background
3098d6978871SDarrick J. Wonggarbage collection threads, and it must wait for existing writer programs to
3099d6978871SDarrick J. Wongexit the kernel.
3100d6978871SDarrick J. WongOnce that has been established, scrub can walk the AG free space indexes, the
3101d6978871SDarrick J. Wonginode btrees, and the realtime bitmap to compute the correct value of all
3102d6978871SDarrick J. Wongfour summary counters.
3103d6978871SDarrick J. WongThis is very similar to a filesystem freeze, though not all of the pieces are
3104d6978871SDarrick J. Wongnecessary:
3105d6978871SDarrick J. Wong
3106d6978871SDarrick J. Wong- The final freeze state is set one higher than ``SB_FREEZE_COMPLETE`` to
3107d6978871SDarrick J. Wong  prevent other threads from thawing the filesystem, or other scrub threads
3108d6978871SDarrick J. Wong  from initiating another fscounters freeze.
3109d6978871SDarrick J. Wong
3110d6978871SDarrick J. Wong- It does not quiesce the log.
3111d6978871SDarrick J. Wong
3112d6978871SDarrick J. WongWith this code in place, it is now possible to pause the filesystem for just
3113d6978871SDarrick J. Wonglong enough to check and correct the summary counters.
3114d6978871SDarrick J. Wong
3115d6978871SDarrick J. Wong+--------------------------------------------------------------------------+
3116d6978871SDarrick J. Wong| **Historical Sidebar**:                                                  |
3117d6978871SDarrick J. Wong+--------------------------------------------------------------------------+
3118d6978871SDarrick J. Wong| The initial implementation used the actual VFS filesystem freeze         |
3119d6978871SDarrick J. Wong| mechanism to quiesce filesystem activity.                                |
3120d6978871SDarrick J. Wong| With the filesystem frozen, it is possible to resolve the counter values |
3121d6978871SDarrick J. Wong| with exact precision, but there are many problems with calling the VFS   |
3122d6978871SDarrick J. Wong| methods directly:                                                        |
3123d6978871SDarrick J. Wong|                                                                          |
3124d6978871SDarrick J. Wong| - Other programs can unfreeze the filesystem without our knowledge.      |
3125d6978871SDarrick J. Wong|   This leads to incorrect scan results and incorrect repairs.            |
3126d6978871SDarrick J. Wong|                                                                          |
3127d6978871SDarrick J. Wong| - Adding an extra lock to prevent others from thawing the filesystem     |
3128d6978871SDarrick J. Wong|   required the addition of a ``->freeze_super`` function to wrap         |
3129d6978871SDarrick J. Wong|   ``freeze_fs()``.                                                       |
3130d6978871SDarrick J. Wong|   This in turn caused other subtle problems because it turns out that    |
3131d6978871SDarrick J. Wong|   the VFS ``freeze_super`` and ``thaw_super`` functions can drop the     |
3132d6978871SDarrick J. Wong|   last reference to the VFS superblock, and any subsequent access        |
3133d6978871SDarrick J. Wong|   becomes a UAF bug!                                                     |
3134d6978871SDarrick J. Wong|   This can happen if the filesystem is unmounted while the underlying    |
3135d6978871SDarrick J. Wong|   block device has frozen the filesystem.                                |
3136d6978871SDarrick J. Wong|   This problem could be solved by grabbing extra references to the       |
3137d6978871SDarrick J. Wong|   superblock, but it felt suboptimal given the other inadequacies of     |
3138d6978871SDarrick J. Wong|   this approach.                                                         |
3139d6978871SDarrick J. Wong|                                                                          |
3140d6978871SDarrick J. Wong| - The log need not be quiesced to check the summary counters, but a VFS  |
3141d6978871SDarrick J. Wong|   freeze initiates one anyway.                                           |
3142d6978871SDarrick J. Wong|   This adds unnecessary runtime to live fscounter fsck operations.       |
3143d6978871SDarrick J. Wong|                                                                          |
3144d6978871SDarrick J. Wong| - Quiescing the log means that XFS flushes the (possibly incorrect)      |
3145d6978871SDarrick J. Wong|   counters to disk as part of cleaning the log.                          |
3146d6978871SDarrick J. Wong|                                                                          |
3147d6978871SDarrick J. Wong| - A bug in the VFS meant that freeze could complete even when            |
3148d6978871SDarrick J. Wong|   sync_filesystem fails to flush the filesystem and returns an error.    |
3149d6978871SDarrick J. Wong|   This bug was fixed in Linux 5.17.                                      |
3150d6978871SDarrick J. Wong+--------------------------------------------------------------------------+
3151d6978871SDarrick J. Wong
3152d6978871SDarrick J. WongThe proposed patchset is the
3153d6978871SDarrick J. Wong`summary counter cleanup
3154d6978871SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-fscounters>`_
3155d6978871SDarrick J. Wongseries.
3156a0d856eeSDarrick J. Wong
3157a0d856eeSDarrick J. WongFull Filesystem Scans
3158a0d856eeSDarrick J. Wong---------------------
3159a0d856eeSDarrick J. Wong
3160a0d856eeSDarrick J. WongCertain types of metadata can only be checked by walking every file in the
3161a0d856eeSDarrick J. Wongentire filesystem to record observations and comparing the observations against
3162a0d856eeSDarrick J. Wongwhat's recorded on disk.
3163a0d856eeSDarrick J. WongLike every other type of online repair, repairs are made by writing those
3164a0d856eeSDarrick J. Wongobservations to disk in a replacement structure and committing it atomically.
3165a0d856eeSDarrick J. WongHowever, it is not practical to shut down the entire filesystem to examine
3166a0d856eeSDarrick J. Wonghundreds of billions of files because the downtime would be excessive.
3167a0d856eeSDarrick J. WongTherefore, online fsck must build the infrastructure to manage a live scan of
3168a0d856eeSDarrick J. Wongall the files in the filesystem.
3169a0d856eeSDarrick J. WongThere are two questions that need to be solved to perform a live walk:
3170a0d856eeSDarrick J. Wong
3171a0d856eeSDarrick J. Wong- How does scrub manage the scan while it is collecting data?
3172a0d856eeSDarrick J. Wong
3173a0d856eeSDarrick J. Wong- How does the scan keep abreast of changes being made to the system by other
3174a0d856eeSDarrick J. Wong  threads?
3175a0d856eeSDarrick J. Wong
3176a0d856eeSDarrick J. Wong.. _iscan:
3177a0d856eeSDarrick J. Wong
3178a0d856eeSDarrick J. WongCoordinated Inode Scans
3179a0d856eeSDarrick J. Wong```````````````````````
3180a0d856eeSDarrick J. Wong
3181a0d856eeSDarrick J. WongIn the original Unix filesystems of the 1970s, each directory entry contained
3182a0d856eeSDarrick J. Wongan index number (*inumber*) which was used as an index into on ondisk array
3183a0d856eeSDarrick J. Wong(*itable*) of fixed-size records (*inodes*) describing a file's attributes and
3184a0d856eeSDarrick J. Wongits data block mapping.
3185a0d856eeSDarrick J. WongThis system is described by J. Lions, `"inode (5659)"
3186a0d856eeSDarrick J. Wong<http://www.lemis.com/grog/Documentation/Lions/>`_ in *Lions' Commentary on
3187a0d856eeSDarrick J. WongUNIX, 6th Edition*, (Dept. of Computer Science, the University of New South
3188a0d856eeSDarrick J. WongWales, November 1977), pp. 18-2; and later by D. Ritchie and K. Thompson,
3189a0d856eeSDarrick J. Wong`"Implementation of the File System"
3190a0d856eeSDarrick J. Wong<https://archive.org/details/bstj57-6-1905/page/n8/mode/1up>`_, from *The UNIX
3191a0d856eeSDarrick J. WongTime-Sharing System*, (The Bell System Technical Journal, July 1978), pp.
3192a0d856eeSDarrick J. Wong1913-4.
3193a0d856eeSDarrick J. Wong
3194a0d856eeSDarrick J. WongXFS retains most of this design, except now inumbers are search keys over all
3195a0d856eeSDarrick J. Wongthe space in the data section filesystem.
3196a0d856eeSDarrick J. WongThey form a continuous keyspace that can be expressed as a 64-bit integer,
3197a0d856eeSDarrick J. Wongthough the inodes themselves are sparsely distributed within the keyspace.
3198a0d856eeSDarrick J. WongScans proceed in a linear fashion across the inumber keyspace, starting from
3199a0d856eeSDarrick J. Wong``0x0`` and ending at ``0xFFFFFFFFFFFFFFFF``.
3200a0d856eeSDarrick J. WongNaturally, a scan through a keyspace requires a scan cursor object to track the
3201a0d856eeSDarrick J. Wongscan progress.
3202a0d856eeSDarrick J. WongBecause this keyspace is sparse, this cursor contains two parts.
3203a0d856eeSDarrick J. WongThe first part of this scan cursor object tracks the inode that will be
3204a0d856eeSDarrick J. Wongexamined next; call this the examination cursor.
3205a0d856eeSDarrick J. WongSomewhat less obviously, the scan cursor object must also track which parts of
3206a0d856eeSDarrick J. Wongthe keyspace have already been visited, which is critical for deciding if a
3207a0d856eeSDarrick J. Wongconcurrent filesystem update needs to be incorporated into the scan data.
3208a0d856eeSDarrick J. WongCall this the visited inode cursor.
3209a0d856eeSDarrick J. Wong
3210a0d856eeSDarrick J. WongAdvancing the scan cursor is a multi-step process encapsulated in
3211a0d856eeSDarrick J. Wong``xchk_iscan_iter``:
3212a0d856eeSDarrick J. Wong
3213a0d856eeSDarrick J. Wong1. Lock the AGI buffer of the AG containing the inode pointed to by the visited
3214a0d856eeSDarrick J. Wong   inode cursor.
3215a0d856eeSDarrick J. Wong   This guarantee that inodes in this AG cannot be allocated or freed while
3216a0d856eeSDarrick J. Wong   advancing the cursor.
3217a0d856eeSDarrick J. Wong
3218a0d856eeSDarrick J. Wong2. Use the per-AG inode btree to look up the next inumber after the one that
3219a0d856eeSDarrick J. Wong   was just visited, since it may not be keyspace adjacent.
3220a0d856eeSDarrick J. Wong
3221a0d856eeSDarrick J. Wong3. If there are no more inodes left in this AG:
3222a0d856eeSDarrick J. Wong
3223a0d856eeSDarrick J. Wong   a. Move the examination cursor to the point of the inumber keyspace that
3224a0d856eeSDarrick J. Wong      corresponds to the start of the next AG.
3225a0d856eeSDarrick J. Wong
3226a0d856eeSDarrick J. Wong   b. Adjust the visited inode cursor to indicate that it has "visited" the
3227a0d856eeSDarrick J. Wong      last possible inode in the current AG's inode keyspace.
3228a0d856eeSDarrick J. Wong      XFS inumbers are segmented, so the cursor needs to be marked as having
3229a0d856eeSDarrick J. Wong      visited the entire keyspace up to just before the start of the next AG's
3230a0d856eeSDarrick J. Wong      inode keyspace.
3231a0d856eeSDarrick J. Wong
3232a0d856eeSDarrick J. Wong   c. Unlock the AGI and return to step 1 if there are unexamined AGs in the
3233a0d856eeSDarrick J. Wong      filesystem.
3234a0d856eeSDarrick J. Wong
3235a0d856eeSDarrick J. Wong   d. If there are no more AGs to examine, set both cursors to the end of the
3236a0d856eeSDarrick J. Wong      inumber keyspace.
3237a0d856eeSDarrick J. Wong      The scan is now complete.
3238a0d856eeSDarrick J. Wong
3239a0d856eeSDarrick J. Wong4. Otherwise, there is at least one more inode to scan in this AG:
3240a0d856eeSDarrick J. Wong
3241a0d856eeSDarrick J. Wong   a. Move the examination cursor ahead to the next inode marked as allocated
3242a0d856eeSDarrick J. Wong      by the inode btree.
3243a0d856eeSDarrick J. Wong
3244a0d856eeSDarrick J. Wong   b. Adjust the visited inode cursor to point to the inode just prior to where
3245a0d856eeSDarrick J. Wong      the examination cursor is now.
3246a0d856eeSDarrick J. Wong      Because the scanner holds the AGI buffer lock, no inodes could have been
3247a0d856eeSDarrick J. Wong      created in the part of the inode keyspace that the visited inode cursor
3248a0d856eeSDarrick J. Wong      just advanced.
3249a0d856eeSDarrick J. Wong
3250a0d856eeSDarrick J. Wong5. Get the incore inode for the inumber of the examination cursor.
3251a0d856eeSDarrick J. Wong   By maintaining the AGI buffer lock until this point, the scanner knows that
3252a0d856eeSDarrick J. Wong   it was safe to advance the examination cursor across the entire keyspace,
3253a0d856eeSDarrick J. Wong   and that it has stabilized this next inode so that it cannot disappear from
3254a0d856eeSDarrick J. Wong   the filesystem until the scan releases the incore inode.
3255a0d856eeSDarrick J. Wong
3256a0d856eeSDarrick J. Wong6. Drop the AGI lock and return the incore inode to the caller.
3257a0d856eeSDarrick J. Wong
3258a0d856eeSDarrick J. WongOnline fsck functions scan all files in the filesystem as follows:
3259a0d856eeSDarrick J. Wong
3260a0d856eeSDarrick J. Wong1. Start a scan by calling ``xchk_iscan_start``.
3261a0d856eeSDarrick J. Wong
3262a0d856eeSDarrick J. Wong2. Advance the scan cursor (``xchk_iscan_iter``) to get the next inode.
3263a0d856eeSDarrick J. Wong   If one is provided:
3264a0d856eeSDarrick J. Wong
3265a0d856eeSDarrick J. Wong   a. Lock the inode to prevent updates during the scan.
3266a0d856eeSDarrick J. Wong
3267a0d856eeSDarrick J. Wong   b. Scan the inode.
3268a0d856eeSDarrick J. Wong
3269a0d856eeSDarrick J. Wong   c. While still holding the inode lock, adjust the visited inode cursor
3270a0d856eeSDarrick J. Wong      (``xchk_iscan_mark_visited``) to point to this inode.
3271a0d856eeSDarrick J. Wong
3272a0d856eeSDarrick J. Wong   d. Unlock and release the inode.
3273a0d856eeSDarrick J. Wong
3274a0d856eeSDarrick J. Wong8. Call ``xchk_iscan_teardown`` to complete the scan.
3275a0d856eeSDarrick J. Wong
3276a0d856eeSDarrick J. WongThere are subtleties with the inode cache that complicate grabbing the incore
3277a0d856eeSDarrick J. Wonginode for the caller.
3278a0d856eeSDarrick J. WongObviously, it is an absolute requirement that the inode metadata be consistent
3279a0d856eeSDarrick J. Wongenough to load it into the inode cache.
3280a0d856eeSDarrick J. WongSecond, if the incore inode is stuck in some intermediate state, the scan
3281a0d856eeSDarrick J. Wongcoordinator must release the AGI and push the main filesystem to get the inode
3282a0d856eeSDarrick J. Wongback into a loadable state.
3283a0d856eeSDarrick J. Wong
3284a0d856eeSDarrick J. WongThe proposed patches are the
3285a0d856eeSDarrick J. Wong`inode scanner
3286a0d856eeSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>`_
3287a0d856eeSDarrick J. Wongseries.
3288a0d856eeSDarrick J. WongThe first user of the new functionality is the
3289a0d856eeSDarrick J. Wong`online quotacheck
3290a0d856eeSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>`_
3291a0d856eeSDarrick J. Wongseries.
3292a0d856eeSDarrick J. Wong
3293a0d856eeSDarrick J. WongInode Management
3294a0d856eeSDarrick J. Wong````````````````
3295a0d856eeSDarrick J. Wong
3296a0d856eeSDarrick J. WongIn regular filesystem code, references to allocated XFS incore inodes are
3297a0d856eeSDarrick J. Wongalways obtained (``xfs_iget``) outside of transaction context because the
3298a0d856eeSDarrick J. Wongcreation of the incore context for an existing file does not require metadata
3299a0d856eeSDarrick J. Wongupdates.
3300a0d856eeSDarrick J. WongHowever, it is important to note that references to incore inodes obtained as
3301a0d856eeSDarrick J. Wongpart of file creation must be performed in transaction context because the
3302a0d856eeSDarrick J. Wongfilesystem must ensure the atomicity of the ondisk inode btree index updates
3303a0d856eeSDarrick J. Wongand the initialization of the actual ondisk inode.
3304a0d856eeSDarrick J. Wong
3305a0d856eeSDarrick J. WongReferences to incore inodes are always released (``xfs_irele``) outside of
3306a0d856eeSDarrick J. Wongtransaction context because there are a handful of activities that might
3307a0d856eeSDarrick J. Wongrequire ondisk updates:
3308a0d856eeSDarrick J. Wong
3309a0d856eeSDarrick J. Wong- The VFS may decide to kick off writeback as part of a ``DONTCACHE`` inode
3310a0d856eeSDarrick J. Wong  release.
3311a0d856eeSDarrick J. Wong
3312a0d856eeSDarrick J. Wong- Speculative preallocations need to be unreserved.
3313a0d856eeSDarrick J. Wong
3314a0d856eeSDarrick J. Wong- An unlinked file may have lost its last reference, in which case the entire
3315a0d856eeSDarrick J. Wong  file must be inactivated, which involves releasing all of its resources in
3316a0d856eeSDarrick J. Wong  the ondisk metadata and freeing the inode.
3317a0d856eeSDarrick J. Wong
3318a0d856eeSDarrick J. WongThese activities are collectively called inode inactivation.
3319a0d856eeSDarrick J. WongInactivation has two parts -- the VFS part, which initiates writeback on all
3320a0d856eeSDarrick J. Wongdirty file pages, and the XFS part, which cleans up XFS-specific information
3321a0d856eeSDarrick J. Wongand frees the inode if it was unlinked.
3322a0d856eeSDarrick J. WongIf the inode is unlinked (or unconnected after a file handle operation), the
3323a0d856eeSDarrick J. Wongkernel drops the inode into the inactivation machinery immediately.
3324a0d856eeSDarrick J. Wong
3325a0d856eeSDarrick J. WongDuring normal operation, resource acquisition for an update follows this order
3326a0d856eeSDarrick J. Wongto avoid deadlocks:
3327a0d856eeSDarrick J. Wong
3328a0d856eeSDarrick J. Wong1. Inode reference (``iget``).
3329a0d856eeSDarrick J. Wong
3330a0d856eeSDarrick J. Wong2. Filesystem freeze protection, if repairing (``mnt_want_write_file``).
3331a0d856eeSDarrick J. Wong
3332a0d856eeSDarrick J. Wong3. Inode ``IOLOCK`` (VFS ``i_rwsem``) lock to control file IO.
3333a0d856eeSDarrick J. Wong
3334a0d856eeSDarrick J. Wong4. Inode ``MMAPLOCK`` (page cache ``invalidate_lock``) lock for operations that
3335a0d856eeSDarrick J. Wong   can update page cache mappings.
3336a0d856eeSDarrick J. Wong
3337a0d856eeSDarrick J. Wong5. Log feature enablement.
3338a0d856eeSDarrick J. Wong
3339a0d856eeSDarrick J. Wong6. Transaction log space grant.
3340a0d856eeSDarrick J. Wong
3341a0d856eeSDarrick J. Wong7. Space on the data and realtime devices for the transaction.
3342a0d856eeSDarrick J. Wong
3343a0d856eeSDarrick J. Wong8. Incore dquot references, if a file is being repaired.
3344a0d856eeSDarrick J. Wong   Note that they are not locked, merely acquired.
3345a0d856eeSDarrick J. Wong
3346a0d856eeSDarrick J. Wong9. Inode ``ILOCK`` for file metadata updates.
3347a0d856eeSDarrick J. Wong
3348a0d856eeSDarrick J. Wong10. AG header buffer locks / Realtime metadata inode ILOCK.
3349a0d856eeSDarrick J. Wong
3350a0d856eeSDarrick J. Wong11. Realtime metadata buffer locks, if applicable.
3351a0d856eeSDarrick J. Wong
3352a0d856eeSDarrick J. Wong12. Extent mapping btree blocks, if applicable.
3353a0d856eeSDarrick J. Wong
3354a0d856eeSDarrick J. WongResources are often released in the reverse order, though this is not required.
3355a0d856eeSDarrick J. WongHowever, online fsck differs from regular XFS operations because it may examine
3356a0d856eeSDarrick J. Wongan object that normally is acquired in a later stage of the locking order, and
3357a0d856eeSDarrick J. Wongthen decide to cross-reference the object with an object that is acquired
3358a0d856eeSDarrick J. Wongearlier in the order.
3359a0d856eeSDarrick J. WongThe next few sections detail the specific ways in which online fsck takes care
3360a0d856eeSDarrick J. Wongto avoid deadlocks.
3361a0d856eeSDarrick J. Wong
3362a0d856eeSDarrick J. Wongiget and irele During a Scrub
3363a0d856eeSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
3364a0d856eeSDarrick J. Wong
3365a0d856eeSDarrick J. WongAn inode scan performed on behalf of a scrub operation runs in transaction
3366a0d856eeSDarrick J. Wongcontext, and possibly with resources already locked and bound to it.
3367a0d856eeSDarrick J. WongThis isn't much of a problem for ``iget`` since it can operate in the context
3368a0d856eeSDarrick J. Wongof an existing transaction, as long as all of the bound resources are acquired
3369a0d856eeSDarrick J. Wongbefore the inode reference in the regular filesystem.
3370a0d856eeSDarrick J. Wong
3371a0d856eeSDarrick J. WongWhen the VFS ``iput`` function is given a linked inode with no other
3372a0d856eeSDarrick J. Wongreferences, it normally puts the inode on an LRU list in the hope that it can
3373a0d856eeSDarrick J. Wongsave time if another process re-opens the file before the system runs out
3374a0d856eeSDarrick J. Wongof memory and frees it.
3375a0d856eeSDarrick J. WongFilesystem callers can short-circuit the LRU process by setting a ``DONTCACHE``
3376a0d856eeSDarrick J. Wongflag on the inode to cause the kernel to try to drop the inode into the
3377a0d856eeSDarrick J. Wonginactivation machinery immediately.
3378a0d856eeSDarrick J. Wong
3379a0d856eeSDarrick J. WongIn the past, inactivation was always done from the process that dropped the
3380a0d856eeSDarrick J. Wonginode, which was a problem for scrub because scrub may already hold a
3381a0d856eeSDarrick J. Wongtransaction, and XFS does not support nesting transactions.
3382a0d856eeSDarrick J. WongOn the other hand, if there is no scrub transaction, it is desirable to drop
3383a0d856eeSDarrick J. Wongotherwise unused inodes immediately to avoid polluting caches.
3384a0d856eeSDarrick J. WongTo capture these nuances, the online fsck code has a separate ``xchk_irele``
3385a0d856eeSDarrick J. Wongfunction to set or clear the ``DONTCACHE`` flag to get the required release
3386a0d856eeSDarrick J. Wongbehavior.
3387a0d856eeSDarrick J. Wong
3388a0d856eeSDarrick J. WongProposed patchsets include fixing
3389a0d856eeSDarrick J. Wong`scrub iget usage
3390a0d856eeSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iget-fixes>`_ and
3391a0d856eeSDarrick J. Wong`dir iget usage
3392a0d856eeSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-dir-iget-fixes>`_.
3393a0d856eeSDarrick J. Wong
33942f754f7fSDarrick J. Wong.. _ilocking:
33952f754f7fSDarrick J. Wong
3396a0d856eeSDarrick J. WongLocking Inodes
3397a0d856eeSDarrick J. Wong^^^^^^^^^^^^^^
3398a0d856eeSDarrick J. Wong
3399a0d856eeSDarrick J. WongIn regular filesystem code, the VFS and XFS will acquire multiple IOLOCK locks
3400a0d856eeSDarrick J. Wongin a well-known order: parent → child when updating the directory tree, and
3401a0d856eeSDarrick J. Wongin numerical order of the addresses of their ``struct inode`` object otherwise.
3402a0d856eeSDarrick J. WongFor regular files, the MMAPLOCK can be acquired after the IOLOCK to stop page
3403a0d856eeSDarrick J. Wongfaults.
3404a0d856eeSDarrick J. WongIf two MMAPLOCKs must be acquired, they are acquired in numerical order of
3405a0d856eeSDarrick J. Wongthe addresses of their ``struct address_space`` objects.
3406a0d856eeSDarrick J. WongDue to the structure of existing filesystem code, IOLOCKs and MMAPLOCKs must be
3407a0d856eeSDarrick J. Wongacquired before transactions are allocated.
3408a0d856eeSDarrick J. WongIf two ILOCKs must be acquired, they are acquired in inumber order.
3409a0d856eeSDarrick J. Wong
3410a0d856eeSDarrick J. WongInode lock acquisition must be done carefully during a coordinated inode scan.
3411a0d856eeSDarrick J. WongOnline fsck cannot abide these conventions, because for a directory tree
3412a0d856eeSDarrick J. Wongscanner, the scrub process holds the IOLOCK of the file being scanned and it
3413a0d856eeSDarrick J. Wongneeds to take the IOLOCK of the file at the other end of the directory link.
3414a0d856eeSDarrick J. WongIf the directory tree is corrupt because it contains a cycle, ``xfs_scrub``
3415a0d856eeSDarrick J. Wongcannot use the regular inode locking functions and avoid becoming trapped in an
3416a0d856eeSDarrick J. WongABBA deadlock.
3417a0d856eeSDarrick J. Wong
3418a0d856eeSDarrick J. WongSolving both of these problems is straightforward -- any time online fsck
3419a0d856eeSDarrick J. Wongneeds to take a second lock of the same class, it uses trylock to avoid an ABBA
3420a0d856eeSDarrick J. Wongdeadlock.
3421a0d856eeSDarrick J. WongIf the trylock fails, scrub drops all inode locks and use trylock loops to
3422a0d856eeSDarrick J. Wong(re)acquire all necessary resources.
3423a0d856eeSDarrick J. WongTrylock loops enable scrub to check for pending fatal signals, which is how
3424a0d856eeSDarrick J. Wongscrub avoids deadlocking the filesystem or becoming an unresponsive process.
3425a0d856eeSDarrick J. WongHowever, trylock loops means that online fsck must be prepared to measure the
3426a0d856eeSDarrick J. Wongresource being scrubbed before and after the lock cycle to detect changes and
3427a0d856eeSDarrick J. Wongreact accordingly.
3428a0d856eeSDarrick J. Wong
3429a0d856eeSDarrick J. Wong.. _dirparent:
3430a0d856eeSDarrick J. Wong
3431a0d856eeSDarrick J. WongCase Study: Finding a Directory Parent
3432a0d856eeSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
3433a0d856eeSDarrick J. Wong
3434a0d856eeSDarrick J. WongConsider the directory parent pointer repair code as an example.
3435a0d856eeSDarrick J. WongOnline fsck must verify that the dotdot dirent of a directory points up to a
3436a0d856eeSDarrick J. Wongparent directory, and that the parent directory contains exactly one dirent
3437a0d856eeSDarrick J. Wongpointing down to the child directory.
3438a0d856eeSDarrick J. WongFully validating this relationship (and repairing it if possible) requires a
3439a0d856eeSDarrick J. Wongwalk of every directory on the filesystem while holding the child locked, and
3440a0d856eeSDarrick J. Wongwhile updates to the directory tree are being made.
3441a0d856eeSDarrick J. WongThe coordinated inode scan provides a way to walk the filesystem without the
3442a0d856eeSDarrick J. Wongpossibility of missing an inode.
3443a0d856eeSDarrick J. WongThe child directory is kept locked to prevent updates to the dotdot dirent, but
3444a0d856eeSDarrick J. Wongif the scanner fails to lock a parent, it can drop and relock both the child
3445a0d856eeSDarrick J. Wongand the prospective parent.
3446a0d856eeSDarrick J. WongIf the dotdot entry changes while the directory is unlocked, then a move or
3447a0d856eeSDarrick J. Wongrename operation must have changed the child's parentage, and the scan can
3448a0d856eeSDarrick J. Wongexit early.
3449a0d856eeSDarrick J. Wong
3450a0d856eeSDarrick J. WongThe proposed patchset is the
3451a0d856eeSDarrick J. Wong`directory repair
3452a0d856eeSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_
3453a0d856eeSDarrick J. Wongseries.
3454a0d856eeSDarrick J. Wong
3455a0d856eeSDarrick J. Wong.. _fshooks:
3456a0d856eeSDarrick J. Wong
3457a0d856eeSDarrick J. WongFilesystem Hooks
3458a0d856eeSDarrick J. Wong`````````````````
3459a0d856eeSDarrick J. Wong
3460a0d856eeSDarrick J. WongThe second piece of support that online fsck functions need during a full
3461a0d856eeSDarrick J. Wongfilesystem scan is the ability to stay informed about updates being made by
3462a0d856eeSDarrick J. Wongother threads in the filesystem, since comparisons against the past are useless
3463a0d856eeSDarrick J. Wongin a dynamic environment.
3464a0d856eeSDarrick J. WongTwo pieces of Linux kernel infrastructure enable online fsck to monitor regular
3465a0d856eeSDarrick J. Wongfilesystem operations: filesystem hooks and :ref:`static keys<jump_labels>`.
3466a0d856eeSDarrick J. Wong
3467a0d856eeSDarrick J. WongFilesystem hooks convey information about an ongoing filesystem operation to
3468a0d856eeSDarrick J. Wonga downstream consumer.
3469a0d856eeSDarrick J. WongIn this case, the downstream consumer is always an online fsck function.
3470a0d856eeSDarrick J. WongBecause multiple fsck functions can run in parallel, online fsck uses the Linux
3471a0d856eeSDarrick J. Wongnotifier call chain facility to dispatch updates to any number of interested
3472a0d856eeSDarrick J. Wongfsck processes.
3473a0d856eeSDarrick J. WongCall chains are a dynamic list, which means that they can be configured at
3474a0d856eeSDarrick J. Wongrun time.
3475a0d856eeSDarrick J. WongBecause these hooks are private to the XFS module, the information passed along
3476a0d856eeSDarrick J. Wongcontains exactly what the checking function needs to update its observations.
3477a0d856eeSDarrick J. Wong
3478a0d856eeSDarrick J. WongThe current implementation of XFS hooks uses SRCU notifier chains to reduce the
3479a0d856eeSDarrick J. Wongimpact to highly threaded workloads.
3480a0d856eeSDarrick J. WongRegular blocking notifier chains use a rwsem and seem to have a much lower
3481a0d856eeSDarrick J. Wongoverhead for single-threaded applications.
3482a0d856eeSDarrick J. WongHowever, it may turn out that the combination of blocking chains and static
3483a0d856eeSDarrick J. Wongkeys are a more performant combination; more study is needed here.
3484a0d856eeSDarrick J. Wong
3485a0d856eeSDarrick J. WongThe following pieces are necessary to hook a certain point in the filesystem:
3486a0d856eeSDarrick J. Wong
3487a0d856eeSDarrick J. Wong- A ``struct xfs_hooks`` object must be embedded in a convenient place such as
3488a0d856eeSDarrick J. Wong  a well-known incore filesystem object.
3489a0d856eeSDarrick J. Wong
3490a0d856eeSDarrick J. Wong- Each hook must define an action code and a structure containing more context
3491a0d856eeSDarrick J. Wong  about the action.
3492a0d856eeSDarrick J. Wong
3493a0d856eeSDarrick J. Wong- Hook providers should provide appropriate wrapper functions and structs
3494a0d856eeSDarrick J. Wong  around the ``xfs_hooks`` and ``xfs_hook`` objects to take advantage of type
3495a0d856eeSDarrick J. Wong  checking to ensure correct usage.
3496a0d856eeSDarrick J. Wong
3497a0d856eeSDarrick J. Wong- A callsite in the regular filesystem code must be chosen to call
3498a0d856eeSDarrick J. Wong  ``xfs_hooks_call`` with the action code and data structure.
3499a0d856eeSDarrick J. Wong  This place should be adjacent to (and not earlier than) the place where
3500a0d856eeSDarrick J. Wong  the filesystem update is committed to the transaction.
3501a0d856eeSDarrick J. Wong  In general, when the filesystem calls a hook chain, it should be able to
3502a0d856eeSDarrick J. Wong  handle sleeping and should not be vulnerable to memory reclaim or locking
3503a0d856eeSDarrick J. Wong  recursion.
3504a0d856eeSDarrick J. Wong  However, the exact requirements are very dependent on the context of the hook
3505a0d856eeSDarrick J. Wong  caller and the callee.
3506a0d856eeSDarrick J. Wong
3507a0d856eeSDarrick J. Wong- The online fsck function should define a structure to hold scan data, a lock
3508a0d856eeSDarrick J. Wong  to coordinate access to the scan data, and a ``struct xfs_hook`` object.
3509a0d856eeSDarrick J. Wong  The scanner function and the regular filesystem code must acquire resources
3510a0d856eeSDarrick J. Wong  in the same order; see the next section for details.
3511a0d856eeSDarrick J. Wong
3512a0d856eeSDarrick J. Wong- The online fsck code must contain a C function to catch the hook action code
3513a0d856eeSDarrick J. Wong  and data structure.
3514a0d856eeSDarrick J. Wong  If the object being updated has already been visited by the scan, then the
3515a0d856eeSDarrick J. Wong  hook information must be applied to the scan data.
3516a0d856eeSDarrick J. Wong
3517a0d856eeSDarrick J. Wong- Prior to unlocking inodes to start the scan, online fsck must call
3518a0d856eeSDarrick J. Wong  ``xfs_hooks_setup`` to initialize the ``struct xfs_hook``, and
3519a0d856eeSDarrick J. Wong  ``xfs_hooks_add`` to enable the hook.
3520a0d856eeSDarrick J. Wong
3521a0d856eeSDarrick J. Wong- Online fsck must call ``xfs_hooks_del`` to disable the hook once the scan is
3522a0d856eeSDarrick J. Wong  complete.
3523a0d856eeSDarrick J. Wong
3524a0d856eeSDarrick J. WongThe number of hooks should be kept to a minimum to reduce complexity.
3525a0d856eeSDarrick J. WongStatic keys are used to reduce the overhead of filesystem hooks to nearly
3526a0d856eeSDarrick J. Wongzero when online fsck is not running.
3527a0d856eeSDarrick J. Wong
3528a0d856eeSDarrick J. Wong.. _liveupdate:
3529a0d856eeSDarrick J. Wong
3530a0d856eeSDarrick J. WongLive Updates During a Scan
3531a0d856eeSDarrick J. Wong``````````````````````````
3532a0d856eeSDarrick J. Wong
3533a0d856eeSDarrick J. WongThe code paths of the online fsck scanning code and the :ref:`hooked<fshooks>`
3534a0d856eeSDarrick J. Wongfilesystem code look like this::
3535a0d856eeSDarrick J. Wong
3536a0d856eeSDarrick J. Wong            other program
3537a0d856eeSDarrick J. Wong3538a0d856eeSDarrick J. Wong            inode lock ←────────────────────┐
3539a0d856eeSDarrick J. Wong                  ↓                         │
3540a0d856eeSDarrick J. Wong            AG header lock                  │
3541a0d856eeSDarrick J. Wong                  ↓                         │
3542a0d856eeSDarrick J. Wong            filesystem function             │
3543a0d856eeSDarrick J. Wong                  ↓                         │
3544a0d856eeSDarrick J. Wong            notifier call chain             │    same
3545a0d856eeSDarrick J. Wong                  ↓                         ├─── inode
3546a0d856eeSDarrick J. Wong            scrub hook function             │    lock
3547a0d856eeSDarrick J. Wong                  ↓                         │
3548a0d856eeSDarrick J. Wong            scan data mutex ←──┐    same    │
3549a0d856eeSDarrick J. Wong                  ↓            ├─── scan    │
3550a0d856eeSDarrick J. Wong            update scan data   │    lock    │
3551a0d856eeSDarrick J. Wong                  ↑            │            │
3552a0d856eeSDarrick J. Wong            scan data mutex ←──┘            │
3553a0d856eeSDarrick J. Wong                  ↑                         │
3554a0d856eeSDarrick J. Wong            inode lock ←────────────────────┘
3555a0d856eeSDarrick J. Wong3556a0d856eeSDarrick J. Wong            scrub function
3557a0d856eeSDarrick J. Wong3558a0d856eeSDarrick J. Wong            inode scanner
3559a0d856eeSDarrick J. Wong3560a0d856eeSDarrick J. Wong            xfs_scrub
3561a0d856eeSDarrick J. Wong
3562a0d856eeSDarrick J. WongThese rules must be followed to ensure correct interactions between the
3563a0d856eeSDarrick J. Wongchecking code and the code making an update to the filesystem:
3564a0d856eeSDarrick J. Wong
3565a0d856eeSDarrick J. Wong- Prior to invoking the notifier call chain, the filesystem function being
3566a0d856eeSDarrick J. Wong  hooked must acquire the same lock that the scrub scanning function acquires
3567a0d856eeSDarrick J. Wong  to scan the inode.
3568a0d856eeSDarrick J. Wong
3569a0d856eeSDarrick J. Wong- The scanning function and the scrub hook function must coordinate access to
3570a0d856eeSDarrick J. Wong  the scan data by acquiring a lock on the scan data.
3571a0d856eeSDarrick J. Wong
3572a0d856eeSDarrick J. Wong- Scrub hook function must not add the live update information to the scan
3573a0d856eeSDarrick J. Wong  observations unless the inode being updated has already been scanned.
3574a0d856eeSDarrick J. Wong  The scan coordinator has a helper predicate (``xchk_iscan_want_live_update``)
3575a0d856eeSDarrick J. Wong  for this.
3576a0d856eeSDarrick J. Wong
3577a0d856eeSDarrick J. Wong- Scrub hook functions must not change the caller's state, including the
3578a0d856eeSDarrick J. Wong  transaction that it is running.
3579a0d856eeSDarrick J. Wong  They must not acquire any resources that might conflict with the filesystem
3580a0d856eeSDarrick J. Wong  function being hooked.
3581a0d856eeSDarrick J. Wong
3582a0d856eeSDarrick J. Wong- The hook function can abort the inode scan to avoid breaking the other rules.
3583a0d856eeSDarrick J. Wong
3584a0d856eeSDarrick J. WongThe inode scan APIs are pretty simple:
3585a0d856eeSDarrick J. Wong
3586a0d856eeSDarrick J. Wong- ``xchk_iscan_start`` starts a scan
3587a0d856eeSDarrick J. Wong
3588a0d856eeSDarrick J. Wong- ``xchk_iscan_iter`` grabs a reference to the next inode in the scan or
3589a0d856eeSDarrick J. Wong  returns zero if there is nothing left to scan
3590a0d856eeSDarrick J. Wong
3591a0d856eeSDarrick J. Wong- ``xchk_iscan_want_live_update`` to decide if an inode has already been
3592a0d856eeSDarrick J. Wong  visited in the scan.
3593a0d856eeSDarrick J. Wong  This is critical for hook functions to decide if they need to update the
3594a0d856eeSDarrick J. Wong  in-memory scan information.
3595a0d856eeSDarrick J. Wong
3596a0d856eeSDarrick J. Wong- ``xchk_iscan_mark_visited`` to mark an inode as having been visited in the
3597a0d856eeSDarrick J. Wong  scan
3598a0d856eeSDarrick J. Wong
3599a0d856eeSDarrick J. Wong- ``xchk_iscan_teardown`` to finish the scan
3600a0d856eeSDarrick J. Wong
3601a0d856eeSDarrick J. WongThis functionality is also a part of the
3602a0d856eeSDarrick J. Wong`inode scanner
3603a0d856eeSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>`_
3604a0d856eeSDarrick J. Wongseries.
3605a0d856eeSDarrick J. Wong
3606a0d856eeSDarrick J. Wong.. _quotacheck:
3607a0d856eeSDarrick J. Wong
3608a0d856eeSDarrick J. WongCase Study: Quota Counter Checking
3609a0d856eeSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
3610a0d856eeSDarrick J. Wong
3611a0d856eeSDarrick J. WongIt is useful to compare the mount time quotacheck code to the online repair
3612a0d856eeSDarrick J. Wongquotacheck code.
3613a0d856eeSDarrick J. WongMount time quotacheck does not have to contend with concurrent operations, so
3614a0d856eeSDarrick J. Wongit does the following:
3615a0d856eeSDarrick J. Wong
3616a0d856eeSDarrick J. Wong1. Make sure the ondisk dquots are in good enough shape that all the incore
3617a0d856eeSDarrick J. Wong   dquots will actually load, and zero the resource usage counters in the
3618a0d856eeSDarrick J. Wong   ondisk buffer.
3619a0d856eeSDarrick J. Wong
3620a0d856eeSDarrick J. Wong2. Walk every inode in the filesystem.
3621a0d856eeSDarrick J. Wong   Add each file's resource usage to the incore dquot.
3622a0d856eeSDarrick J. Wong
3623a0d856eeSDarrick J. Wong3. Walk each incore dquot.
3624a0d856eeSDarrick J. Wong   If the incore dquot is not being flushed, add the ondisk buffer backing the
3625a0d856eeSDarrick J. Wong   incore dquot to a delayed write (delwri) list.
3626a0d856eeSDarrick J. Wong
3627a0d856eeSDarrick J. Wong4. Write the buffer list to disk.
3628a0d856eeSDarrick J. Wong
3629a0d856eeSDarrick J. WongLike most online fsck functions, online quotacheck can't write to regular
3630a0d856eeSDarrick J. Wongfilesystem objects until the newly collected metadata reflect all filesystem
3631a0d856eeSDarrick J. Wongstate.
3632a0d856eeSDarrick J. WongTherefore, online quotacheck records file resource usage to a shadow dquot
3633a0d856eeSDarrick J. Wongindex implemented with a sparse ``xfarray``, and only writes to the real dquots
3634a0d856eeSDarrick J. Wongonce the scan is complete.
3635a0d856eeSDarrick J. WongHandling transactional updates is tricky because quota resource usage updates
3636a0d856eeSDarrick J. Wongare handled in phases to minimize contention on dquots:
3637a0d856eeSDarrick J. Wong
3638a0d856eeSDarrick J. Wong1. The inodes involved are joined and locked to a transaction.
3639a0d856eeSDarrick J. Wong
3640a0d856eeSDarrick J. Wong2. For each dquot attached to the file:
3641a0d856eeSDarrick J. Wong
3642a0d856eeSDarrick J. Wong   a. The dquot is locked.
3643a0d856eeSDarrick J. Wong
3644a0d856eeSDarrick J. Wong   b. A quota reservation is added to the dquot's resource usage.
3645a0d856eeSDarrick J. Wong      The reservation is recorded in the transaction.
3646a0d856eeSDarrick J. Wong
3647a0d856eeSDarrick J. Wong   c. The dquot is unlocked.
3648a0d856eeSDarrick J. Wong
3649a0d856eeSDarrick J. Wong3. Changes in actual quota usage are tracked in the transaction.
3650a0d856eeSDarrick J. Wong
3651a0d856eeSDarrick J. Wong4. At transaction commit time, each dquot is examined again:
3652a0d856eeSDarrick J. Wong
3653a0d856eeSDarrick J. Wong   a. The dquot is locked again.
3654a0d856eeSDarrick J. Wong
3655a0d856eeSDarrick J. Wong   b. Quota usage changes are logged and unused reservation is given back to
3656a0d856eeSDarrick J. Wong      the dquot.
3657a0d856eeSDarrick J. Wong
3658a0d856eeSDarrick J. Wong   c. The dquot is unlocked.
3659a0d856eeSDarrick J. Wong
3660a0d856eeSDarrick J. WongFor online quotacheck, hooks are placed in steps 2 and 4.
3661a0d856eeSDarrick J. WongThe step 2 hook creates a shadow version of the transaction dquot context
3662a0d856eeSDarrick J. Wong(``dqtrx``) that operates in a similar manner to the regular code.
3663a0d856eeSDarrick J. WongThe step 4 hook commits the shadow ``dqtrx`` changes to the shadow dquots.
3664a0d856eeSDarrick J. WongNotice that both hooks are called with the inode locked, which is how the
3665a0d856eeSDarrick J. Wonglive update coordinates with the inode scanner.
3666a0d856eeSDarrick J. Wong
3667a0d856eeSDarrick J. WongThe quotacheck scan looks like this:
3668a0d856eeSDarrick J. Wong
3669a0d856eeSDarrick J. Wong1. Set up a coordinated inode scan.
3670a0d856eeSDarrick J. Wong
3671a0d856eeSDarrick J. Wong2. For each inode returned by the inode scan iterator:
3672a0d856eeSDarrick J. Wong
3673a0d856eeSDarrick J. Wong   a. Grab and lock the inode.
3674a0d856eeSDarrick J. Wong
3675a0d856eeSDarrick J. Wong   b. Determine that inode's resource usage (data blocks, inode counts,
3676a0d856eeSDarrick J. Wong      realtime blocks) and add that to the shadow dquots for the user, group,
3677a0d856eeSDarrick J. Wong      and project ids associated with the inode.
3678a0d856eeSDarrick J. Wong
3679a0d856eeSDarrick J. Wong   c. Unlock and release the inode.
3680a0d856eeSDarrick J. Wong
3681a0d856eeSDarrick J. Wong3. For each dquot in the system:
3682a0d856eeSDarrick J. Wong
3683a0d856eeSDarrick J. Wong   a. Grab and lock the dquot.
3684a0d856eeSDarrick J. Wong
3685a0d856eeSDarrick J. Wong   b. Check the dquot against the shadow dquots created by the scan and updated
3686a0d856eeSDarrick J. Wong      by the live hooks.
3687a0d856eeSDarrick J. Wong
3688a0d856eeSDarrick J. WongLive updates are key to being able to walk every quota record without
3689a0d856eeSDarrick J. Wongneeding to hold any locks for a long duration.
3690a0d856eeSDarrick J. WongIf repairs are desired, the real and shadow dquots are locked and their
3691a0d856eeSDarrick J. Wongresource counts are set to the values in the shadow dquot.
3692a0d856eeSDarrick J. Wong
3693a0d856eeSDarrick J. WongThe proposed patchset is the
3694a0d856eeSDarrick J. Wong`online quotacheck
3695a0d856eeSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>`_
3696a0d856eeSDarrick J. Wongseries.
3697a0d856eeSDarrick J. Wong
3698a0d856eeSDarrick J. Wong.. _nlinks:
3699a0d856eeSDarrick J. Wong
3700a0d856eeSDarrick J. WongCase Study: File Link Count Checking
3701a0d856eeSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
3702a0d856eeSDarrick J. Wong
3703a0d856eeSDarrick J. WongFile link count checking also uses live update hooks.
3704a0d856eeSDarrick J. WongThe coordinated inode scanner is used to visit all directories on the
3705a0d856eeSDarrick J. Wongfilesystem, and per-file link count records are stored in a sparse ``xfarray``
3706a0d856eeSDarrick J. Wongindexed by inumber.
3707a0d856eeSDarrick J. WongDuring the scanning phase, each entry in a directory generates observation
3708a0d856eeSDarrick J. Wongdata as follows:
3709a0d856eeSDarrick J. Wong
3710a0d856eeSDarrick J. Wong1. If the entry is a dotdot (``'..'``) entry of the root directory, the
3711a0d856eeSDarrick J. Wong   directory's parent link count is bumped because the root directory's dotdot
3712a0d856eeSDarrick J. Wong   entry is self referential.
3713a0d856eeSDarrick J. Wong
3714a0d856eeSDarrick J. Wong2. If the entry is a dotdot entry of a subdirectory, the parent's backref
3715a0d856eeSDarrick J. Wong   count is bumped.
3716a0d856eeSDarrick J. Wong
3717a0d856eeSDarrick J. Wong3. If the entry is neither a dot nor a dotdot entry, the target file's parent
3718a0d856eeSDarrick J. Wong   count is bumped.
3719a0d856eeSDarrick J. Wong
3720a0d856eeSDarrick J. Wong4. If the target is a subdirectory, the parent's child link count is bumped.
3721a0d856eeSDarrick J. Wong
3722a0d856eeSDarrick J. WongA crucial point to understand about how the link count inode scanner interacts
3723a0d856eeSDarrick J. Wongwith the live update hooks is that the scan cursor tracks which *parent*
3724a0d856eeSDarrick J. Wongdirectories have been scanned.
3725a0d856eeSDarrick J. WongIn other words, the live updates ignore any update about ``A → B`` when A has
3726a0d856eeSDarrick J. Wongnot been scanned, even if B has been scanned.
3727a0d856eeSDarrick J. WongFurthermore, a subdirectory A with a dotdot entry pointing back to B is
3728a0d856eeSDarrick J. Wongaccounted as a backref counter in the shadow data for A, since child dotdot
3729a0d856eeSDarrick J. Wongentries affect the parent's link count.
3730a0d856eeSDarrick J. WongLive update hooks are carefully placed in all parts of the filesystem that
3731a0d856eeSDarrick J. Wongcreate, change, or remove directory entries, since those operations involve
3732a0d856eeSDarrick J. Wongbumplink and droplink.
3733a0d856eeSDarrick J. Wong
3734a0d856eeSDarrick J. WongFor any file, the correct link count is the number of parents plus the number
3735a0d856eeSDarrick J. Wongof child subdirectories.
3736a0d856eeSDarrick J. WongNon-directories never have children of any kind.
3737a0d856eeSDarrick J. WongThe backref information is used to detect inconsistencies in the number of
3738a0d856eeSDarrick J. Wonglinks pointing to child subdirectories and the number of dotdot entries
3739a0d856eeSDarrick J. Wongpointing back.
3740a0d856eeSDarrick J. Wong
3741a0d856eeSDarrick J. WongAfter the scan completes, the link count of each file can be checked by locking
3742a0d856eeSDarrick J. Wongboth the inode and the shadow data, and comparing the link counts.
3743a0d856eeSDarrick J. WongA second coordinated inode scan cursor is used for comparisons.
3744a0d856eeSDarrick J. WongLive updates are key to being able to walk every inode without needing to hold
3745a0d856eeSDarrick J. Wongany locks between inodes.
3746a0d856eeSDarrick J. WongIf repairs are desired, the inode's link count is set to the value in the
3747a0d856eeSDarrick J. Wongshadow information.
3748a0d856eeSDarrick J. WongIf no parents are found, the file must be :ref:`reparented <orphanage>` to the
3749a0d856eeSDarrick J. Wongorphanage to prevent the file from being lost forever.
3750a0d856eeSDarrick J. Wong
3751a0d856eeSDarrick J. WongThe proposed patchset is the
3752a0d856eeSDarrick J. Wong`file link count repair
3753a0d856eeSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-nlinks>`_
3754a0d856eeSDarrick J. Wongseries.
3755a0d856eeSDarrick J. Wong
3756a0d856eeSDarrick J. Wong.. _rmap_repair:
3757a0d856eeSDarrick J. Wong
3758a0d856eeSDarrick J. WongCase Study: Rebuilding Reverse Mapping Records
3759a0d856eeSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
3760a0d856eeSDarrick J. Wong
3761a0d856eeSDarrick J. WongMost repair functions follow the same pattern: lock filesystem resources,
3762a0d856eeSDarrick J. Wongwalk the surviving ondisk metadata looking for replacement metadata records,
3763a0d856eeSDarrick J. Wongand use an :ref:`in-memory array <xfarray>` to store the gathered observations.
3764a0d856eeSDarrick J. WongThe primary advantage of this approach is the simplicity and modularity of the
3765a0d856eeSDarrick J. Wongrepair code -- code and data are entirely contained within the scrub module,
3766a0d856eeSDarrick J. Wongdo not require hooks in the main filesystem, and are usually the most efficient
3767a0d856eeSDarrick J. Wongin memory use.
3768a0d856eeSDarrick J. WongA secondary advantage of this repair approach is atomicity -- once the kernel
3769a0d856eeSDarrick J. Wongdecides a structure is corrupt, no other threads can access the metadata until
3770a0d856eeSDarrick J. Wongthe kernel finishes repairing and revalidating the metadata.
3771a0d856eeSDarrick J. Wong
3772a0d856eeSDarrick J. WongFor repairs going on within a shard of the filesystem, these advantages
3773a0d856eeSDarrick J. Wongoutweigh the delays inherent in locking the shard while repairing parts of the
3774a0d856eeSDarrick J. Wongshard.
3775a0d856eeSDarrick J. WongUnfortunately, repairs to the reverse mapping btree cannot use the "standard"
3776a0d856eeSDarrick J. Wongbtree repair strategy because it must scan every space mapping of every fork of
3777a0d856eeSDarrick J. Wongevery file in the filesystem, and the filesystem cannot stop.
3778a0d856eeSDarrick J. WongTherefore, rmap repair foregoes atomicity between scrub and repair.
3779a0d856eeSDarrick J. WongIt combines a :ref:`coordinated inode scanner <iscan>`, :ref:`live update hooks
3780a0d856eeSDarrick J. Wong<liveupdate>`, and an :ref:`in-memory rmap btree <xfbtree>` to complete the
3781a0d856eeSDarrick J. Wongscan for reverse mapping records.
3782a0d856eeSDarrick J. Wong
3783a0d856eeSDarrick J. Wong1. Set up an xfbtree to stage rmap records.
3784a0d856eeSDarrick J. Wong
3785a0d856eeSDarrick J. Wong2. While holding the locks on the AGI and AGF buffers acquired during the
3786a0d856eeSDarrick J. Wong   scrub, generate reverse mappings for all AG metadata: inodes, btrees, CoW
3787a0d856eeSDarrick J. Wong   staging extents, and the internal log.
3788a0d856eeSDarrick J. Wong
3789a0d856eeSDarrick J. Wong3. Set up an inode scanner.
3790a0d856eeSDarrick J. Wong
3791a0d856eeSDarrick J. Wong4. Hook into rmap updates for the AG being repaired so that the live scan data
3792a0d856eeSDarrick J. Wong   can receive updates to the rmap btree from the rest of the filesystem during
3793a0d856eeSDarrick J. Wong   the file scan.
3794a0d856eeSDarrick J. Wong
3795a0d856eeSDarrick J. Wong5. For each space mapping found in either fork of each file scanned,
3796a0d856eeSDarrick J. Wong   decide if the mapping matches the AG of interest.
3797a0d856eeSDarrick J. Wong   If so:
3798a0d856eeSDarrick J. Wong
3799a0d856eeSDarrick J. Wong   a. Create a btree cursor for the in-memory btree.
3800a0d856eeSDarrick J. Wong
3801a0d856eeSDarrick J. Wong   b. Use the rmap code to add the record to the in-memory btree.
3802a0d856eeSDarrick J. Wong
3803a0d856eeSDarrick J. Wong   c. Use the :ref:`special commit function <xfbtree_commit>` to write the
3804a0d856eeSDarrick J. Wong      xfbtree changes to the xfile.
3805a0d856eeSDarrick J. Wong
3806a0d856eeSDarrick J. Wong6. For each live update received via the hook, decide if the owner has already
3807a0d856eeSDarrick J. Wong   been scanned.
3808a0d856eeSDarrick J. Wong   If so, apply the live update into the scan data:
3809a0d856eeSDarrick J. Wong
3810a0d856eeSDarrick J. Wong   a. Create a btree cursor for the in-memory btree.
3811a0d856eeSDarrick J. Wong
3812a0d856eeSDarrick J. Wong   b. Replay the operation into the in-memory btree.
3813a0d856eeSDarrick J. Wong
3814a0d856eeSDarrick J. Wong   c. Use the :ref:`special commit function <xfbtree_commit>` to write the
3815a0d856eeSDarrick J. Wong      xfbtree changes to the xfile.
3816a0d856eeSDarrick J. Wong      This is performed with an empty transaction to avoid changing the
3817a0d856eeSDarrick J. Wong      caller's state.
3818a0d856eeSDarrick J. Wong
3819a0d856eeSDarrick J. Wong7. When the inode scan finishes, create a new scrub transaction and relock the
3820a0d856eeSDarrick J. Wong   two AG headers.
3821a0d856eeSDarrick J. Wong
3822a0d856eeSDarrick J. Wong8. Compute the new btree geometry using the number of rmap records in the
3823a0d856eeSDarrick J. Wong   shadow btree, like all other btree rebuilding functions.
3824a0d856eeSDarrick J. Wong
3825a0d856eeSDarrick J. Wong9. Allocate the number of blocks computed in the previous step.
3826a0d856eeSDarrick J. Wong
3827a0d856eeSDarrick J. Wong10. Perform the usual btree bulk loading and commit to install the new rmap
3828a0d856eeSDarrick J. Wong    btree.
3829a0d856eeSDarrick J. Wong
3830a0d856eeSDarrick J. Wong11. Reap the old rmap btree blocks as discussed in the case study about how
3831a0d856eeSDarrick J. Wong    to :ref:`reap after rmap btree repair <rmap_reap>`.
3832a0d856eeSDarrick J. Wong
3833a0d856eeSDarrick J. Wong12. Free the xfbtree now that it not needed.
3834a0d856eeSDarrick J. Wong
3835a0d856eeSDarrick J. WongThe proposed patchset is the
3836a0d856eeSDarrick J. Wong`rmap repair
3837a0d856eeSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rmap-btree>`_
3838a0d856eeSDarrick J. Wongseries.
38392f754f7fSDarrick J. Wong
38402f754f7fSDarrick J. WongStaging Repairs with Temporary Files on Disk
38412f754f7fSDarrick J. Wong--------------------------------------------
38422f754f7fSDarrick J. Wong
38432f754f7fSDarrick J. WongXFS stores a substantial amount of metadata in file forks: directories,
38442f754f7fSDarrick J. Wongextended attributes, symbolic link targets, free space bitmaps and summary
38452f754f7fSDarrick J. Wonginformation for the realtime volume, and quota records.
38462f754f7fSDarrick J. WongFile forks map 64-bit logical file fork space extents to physical storage space
38472f754f7fSDarrick J. Wongextents, similar to how a memory management unit maps 64-bit virtual addresses
38482f754f7fSDarrick J. Wongto physical memory addresses.
38492f754f7fSDarrick J. WongTherefore, file-based tree structures (such as directories and extended
38502f754f7fSDarrick J. Wongattributes) use blocks mapped in the file fork offset address space that point
38512f754f7fSDarrick J. Wongto other blocks mapped within that same address space, and file-based linear
38522f754f7fSDarrick J. Wongstructures (such as bitmaps and quota records) compute array element offsets in
38532f754f7fSDarrick J. Wongthe file fork offset address space.
38542f754f7fSDarrick J. Wong
38552f754f7fSDarrick J. WongBecause file forks can consume as much space as the entire filesystem, repairs
38562f754f7fSDarrick J. Wongcannot be staged in memory, even when a paging scheme is available.
38572f754f7fSDarrick J. WongTherefore, online repair of file-based metadata createas a temporary file in
38582f754f7fSDarrick J. Wongthe XFS filesystem, writes a new structure at the correct offsets into the
38592f754f7fSDarrick J. Wongtemporary file, and atomically swaps the fork mappings (and hence the fork
38602f754f7fSDarrick J. Wongcontents) to commit the repair.
38612f754f7fSDarrick J. WongOnce the repair is complete, the old fork can be reaped as necessary; if the
38622f754f7fSDarrick J. Wongsystem goes down during the reap, the iunlink code will delete the blocks
38632f754f7fSDarrick J. Wongduring log recovery.
38642f754f7fSDarrick J. Wong
38652f754f7fSDarrick J. Wong**Note**: All space usage and inode indices in the filesystem *must* be
38662f754f7fSDarrick J. Wongconsistent to use a temporary file safely!
38672f754f7fSDarrick J. WongThis dependency is the reason why online repair can only use pageable kernel
38682f754f7fSDarrick J. Wongmemory to stage ondisk space usage information.
38692f754f7fSDarrick J. Wong
38702f754f7fSDarrick J. WongSwapping metadata extents with a temporary file requires the owner field of the
38712f754f7fSDarrick J. Wongblock headers to match the file being repaired and not the temporary file.  The
38722f754f7fSDarrick J. Wongdirectory, extended attribute, and symbolic link functions were all modified to
38732f754f7fSDarrick J. Wongallow callers to specify owner numbers explicitly.
38742f754f7fSDarrick J. Wong
38752f754f7fSDarrick J. WongThere is a downside to the reaping process -- if the system crashes during the
38762f754f7fSDarrick J. Wongreap phase and the fork extents are crosslinked, the iunlink processing will
38772f754f7fSDarrick J. Wongfail because freeing space will find the extra reverse mappings and abort.
38782f754f7fSDarrick J. Wong
38792f754f7fSDarrick J. WongTemporary files created for repair are similar to ``O_TMPFILE`` files created
38802f754f7fSDarrick J. Wongby userspace.
38812f754f7fSDarrick J. WongThey are not linked into a directory and the entire file will be reaped when
38822f754f7fSDarrick J. Wongthe last reference to the file is lost.
38832f754f7fSDarrick J. WongThe key differences are that these files must have no access permission outside
38842f754f7fSDarrick J. Wongthe kernel at all, they must be specially marked to prevent them from being
38852f754f7fSDarrick J. Wongopened by handle, and they must never be linked into the directory tree.
38862f754f7fSDarrick J. Wong
38872f754f7fSDarrick J. Wong+--------------------------------------------------------------------------+
38882f754f7fSDarrick J. Wong| **Historical Sidebar**:                                                  |
38892f754f7fSDarrick J. Wong+--------------------------------------------------------------------------+
38902f754f7fSDarrick J. Wong| In the initial iteration of file metadata repair, the damaged metadata   |
38912f754f7fSDarrick J. Wong| blocks would be scanned for salvageable data; the extents in the file    |
38922f754f7fSDarrick J. Wong| fork would be reaped; and then a new structure would be built in its     |
38932f754f7fSDarrick J. Wong| place.                                                                   |
38942f754f7fSDarrick J. Wong| This strategy did not survive the introduction of the atomic repair      |
38952f754f7fSDarrick J. Wong| requirement expressed earlier in this document.                          |
38962f754f7fSDarrick J. Wong|                                                                          |
38972f754f7fSDarrick J. Wong| The second iteration explored building a second structure at a high      |
38982f754f7fSDarrick J. Wong| offset in the fork from the salvage data, reaping the old extents, and   |
38992f754f7fSDarrick J. Wong| using a ``COLLAPSE_RANGE`` operation to slide the new extents into       |
39002f754f7fSDarrick J. Wong| place.                                                                   |
39012f754f7fSDarrick J. Wong|                                                                          |
39022f754f7fSDarrick J. Wong| This had many drawbacks:                                                 |
39032f754f7fSDarrick J. Wong|                                                                          |
39042f754f7fSDarrick J. Wong| - Array structures are linearly addressed, and the regular filesystem    |
39052f754f7fSDarrick J. Wong|   codebase does not have the concept of a linear offset that could be    |
39062f754f7fSDarrick J. Wong|   applied to the record offset computation to build an alternate copy.   |
39072f754f7fSDarrick J. Wong|                                                                          |
39082f754f7fSDarrick J. Wong| - Extended attributes are allowed to use the entire attr fork offset     |
39092f754f7fSDarrick J. Wong|   address space.                                                         |
39102f754f7fSDarrick J. Wong|                                                                          |
39112f754f7fSDarrick J. Wong| - Even if repair could build an alternate copy of a data structure in a  |
39122f754f7fSDarrick J. Wong|   different part of the fork address space, the atomic repair commit     |
39132f754f7fSDarrick J. Wong|   requirement means that online repair would have to be able to perform  |
39142f754f7fSDarrick J. Wong|   a log assisted ``COLLAPSE_RANGE`` operation to ensure that the old     |
39152f754f7fSDarrick J. Wong|   structure was completely replaced.                                     |
39162f754f7fSDarrick J. Wong|                                                                          |
39172f754f7fSDarrick J. Wong| - A crash after construction of the secondary tree but before the range  |
39182f754f7fSDarrick J. Wong|   collapse would leave unreachable blocks in the file fork.              |
39192f754f7fSDarrick J. Wong|   This would likely confuse things further.                              |
39202f754f7fSDarrick J. Wong|                                                                          |
39212f754f7fSDarrick J. Wong| - Reaping blocks after a repair is not a simple operation, and           |
39222f754f7fSDarrick J. Wong|   initiating a reap operation from a restarted range collapse operation  |
39232f754f7fSDarrick J. Wong|   during log recovery is daunting.                                       |
39242f754f7fSDarrick J. Wong|                                                                          |
39252f754f7fSDarrick J. Wong| - Directory entry blocks and quota records record the file fork offset   |
39262f754f7fSDarrick J. Wong|   in the header area of each block.                                      |
39272f754f7fSDarrick J. Wong|   An atomic range collapse operation would have to rewrite this part of  |
39282f754f7fSDarrick J. Wong|   each block header.                                                     |
39292f754f7fSDarrick J. Wong|   Rewriting a single field in block headers is not a huge problem, but   |
39302f754f7fSDarrick J. Wong|   it's something to be aware of.                                         |
39312f754f7fSDarrick J. Wong|                                                                          |
39322f754f7fSDarrick J. Wong| - Each block in a directory or extended attributes btree index contains  |
39332f754f7fSDarrick J. Wong|   sibling and child block pointers.                                      |
39342f754f7fSDarrick J. Wong|   Were the atomic commit to use a range collapse operation, each block   |
39352f754f7fSDarrick J. Wong|   would have to be rewritten very carefully to preserve the graph        |
39362f754f7fSDarrick J. Wong|   structure.                                                             |
39372f754f7fSDarrick J. Wong|   Doing this as part of a range collapse means rewriting a large number  |
39382f754f7fSDarrick J. Wong|   of blocks repeatedly, which is not conducive to quick repairs.         |
39392f754f7fSDarrick J. Wong|                                                                          |
39402f754f7fSDarrick J. Wong| This lead to the introduction of temporary file staging.                 |
39412f754f7fSDarrick J. Wong+--------------------------------------------------------------------------+
39422f754f7fSDarrick J. Wong
39432f754f7fSDarrick J. WongUsing a Temporary File
39442f754f7fSDarrick J. Wong``````````````````````
39452f754f7fSDarrick J. Wong
39462f754f7fSDarrick J. WongOnline repair code should use the ``xrep_tempfile_create`` function to create a
39472f754f7fSDarrick J. Wongtemporary file inside the filesystem.
39482f754f7fSDarrick J. WongThis allocates an inode, marks the in-core inode private, and attaches it to
39492f754f7fSDarrick J. Wongthe scrub context.
39502f754f7fSDarrick J. WongThese files are hidden from userspace, may not be added to the directory tree,
39512f754f7fSDarrick J. Wongand must be kept private.
39522f754f7fSDarrick J. Wong
39532f754f7fSDarrick J. WongTemporary files only use two inode locks: the IOLOCK and the ILOCK.
39542f754f7fSDarrick J. WongThe MMAPLOCK is not needed here, because there must not be page faults from
39552f754f7fSDarrick J. Wonguserspace for data fork blocks.
39562f754f7fSDarrick J. WongThe usage patterns of these two locks are the same as for any other XFS file --
39572f754f7fSDarrick J. Wongaccess to file data are controlled via the IOLOCK, and access to file metadata
39582f754f7fSDarrick J. Wongare controlled via the ILOCK.
39592f754f7fSDarrick J. WongLocking helpers are provided so that the temporary file and its lock state can
39602f754f7fSDarrick J. Wongbe cleaned up by the scrub context.
39612f754f7fSDarrick J. WongTo comply with the nested locking strategy laid out in the :ref:`inode
39622f754f7fSDarrick J. Wonglocking<ilocking>` section, it is recommended that scrub functions use the
39632f754f7fSDarrick J. Wongxrep_tempfile_ilock*_nowait lock helpers.
39642f754f7fSDarrick J. Wong
39652f754f7fSDarrick J. WongData can be written to a temporary file by two means:
39662f754f7fSDarrick J. Wong
39672f754f7fSDarrick J. Wong1. ``xrep_tempfile_copyin`` can be used to set the contents of a regular
39682f754f7fSDarrick J. Wong   temporary file from an xfile.
39692f754f7fSDarrick J. Wong
39702f754f7fSDarrick J. Wong2. The regular directory, symbolic link, and extended attribute functions can
39712f754f7fSDarrick J. Wong   be used to write to the temporary file.
39722f754f7fSDarrick J. Wong
39732f754f7fSDarrick J. WongOnce a good copy of a data file has been constructed in a temporary file, it
39742f754f7fSDarrick J. Wongmust be conveyed to the file being repaired, which is the topic of the next
39752f754f7fSDarrick J. Wongsection.
39762f754f7fSDarrick J. Wong
39772f754f7fSDarrick J. WongThe proposed patches are in the
39782f754f7fSDarrick J. Wong`repair temporary files
39792f754f7fSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-tempfiles>`_
39802f754f7fSDarrick J. Wongseries.
39812f754f7fSDarrick J. Wong
39822f754f7fSDarrick J. WongAtomic Extent Swapping
39832f754f7fSDarrick J. Wong----------------------
39842f754f7fSDarrick J. Wong
39852f754f7fSDarrick J. WongOnce repair builds a temporary file with a new data structure written into
39862f754f7fSDarrick J. Wongit, it must commit the new changes into the existing file.
39872f754f7fSDarrick J. WongIt is not possible to swap the inumbers of two files, so instead the new
39882f754f7fSDarrick J. Wongmetadata must replace the old.
39892f754f7fSDarrick J. WongThis suggests the need for the ability to swap extents, but the existing extent
39902f754f7fSDarrick J. Wongswapping code used by the file defragmenting tool ``xfs_fsr`` is not sufficient
39912f754f7fSDarrick J. Wongfor online repair because:
39922f754f7fSDarrick J. Wong
39932f754f7fSDarrick J. Wonga. When the reverse-mapping btree is enabled, the swap code must keep the
39942f754f7fSDarrick J. Wong   reverse mapping information up to date with every exchange of mappings.
39952f754f7fSDarrick J. Wong   Therefore, it can only exchange one mapping per transaction, and each
39962f754f7fSDarrick J. Wong   transaction is independent.
39972f754f7fSDarrick J. Wong
39982f754f7fSDarrick J. Wongb. Reverse-mapping is critical for the operation of online fsck, so the old
39992f754f7fSDarrick J. Wong   defragmentation code (which swapped entire extent forks in a single
40002f754f7fSDarrick J. Wong   operation) is not useful here.
40012f754f7fSDarrick J. Wong
40022f754f7fSDarrick J. Wongc. Defragmentation is assumed to occur between two files with identical
40032f754f7fSDarrick J. Wong   contents.
40042f754f7fSDarrick J. Wong   For this use case, an incomplete exchange will not result in a user-visible
40052f754f7fSDarrick J. Wong   change in file contents, even if the operation is interrupted.
40062f754f7fSDarrick J. Wong
40072f754f7fSDarrick J. Wongd. Online repair needs to swap the contents of two files that are by definition
40082f754f7fSDarrick J. Wong   *not* identical.
40092f754f7fSDarrick J. Wong   For directory and xattr repairs, the user-visible contents might be the
40102f754f7fSDarrick J. Wong   same, but the contents of individual blocks may be very different.
40112f754f7fSDarrick J. Wong
40122f754f7fSDarrick J. Wonge. Old blocks in the file may be cross-linked with another structure and must
40132f754f7fSDarrick J. Wong   not reappear if the system goes down mid-repair.
40142f754f7fSDarrick J. Wong
40152f754f7fSDarrick J. WongThese problems are overcome by creating a new deferred operation and a new type
40162f754f7fSDarrick J. Wongof log intent item to track the progress of an operation to exchange two file
40172f754f7fSDarrick J. Wongranges.
40182f754f7fSDarrick J. WongThe new deferred operation type chains together the same transactions used by
40192f754f7fSDarrick J. Wongthe reverse-mapping extent swap code.
40202f754f7fSDarrick J. WongThe new log item records the progress of the exchange to ensure that once an
40212f754f7fSDarrick J. Wongexchange begins, it will always run to completion, even there are
40222f754f7fSDarrick J. Wonginterruptions.
40232f754f7fSDarrick J. WongThe new ``XFS_SB_FEAT_INCOMPAT_LOG_ATOMIC_SWAP`` log-incompatible feature flag
40242f754f7fSDarrick J. Wongin the superblock protects these new log item records from being replayed on
40252f754f7fSDarrick J. Wongold kernels.
40262f754f7fSDarrick J. Wong
40272f754f7fSDarrick J. WongThe proposed patchset is the
40282f754f7fSDarrick J. Wong`atomic extent swap
40292f754f7fSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-file-updates>`_
40302f754f7fSDarrick J. Wongseries.
40312f754f7fSDarrick J. Wong
40322f754f7fSDarrick J. Wong+--------------------------------------------------------------------------+
40332f754f7fSDarrick J. Wong| **Sidebar: Using Log-Incompatible Feature Flags**                        |
40342f754f7fSDarrick J. Wong+--------------------------------------------------------------------------+
40352f754f7fSDarrick J. Wong| Starting with XFS v5, the superblock contains a                          |
40362f754f7fSDarrick J. Wong| ``sb_features_log_incompat`` field to indicate that the log contains     |
40372f754f7fSDarrick J. Wong| records that might not readable by all kernels that could mount this     |
40382f754f7fSDarrick J. Wong| filesystem.                                                              |
40392f754f7fSDarrick J. Wong| In short, log incompat features protect the log contents against kernels |
40402f754f7fSDarrick J. Wong| that will not understand the contents.                                   |
40412f754f7fSDarrick J. Wong| Unlike the other superblock feature bits, log incompat bits are          |
40422f754f7fSDarrick J. Wong| ephemeral because an empty (clean) log does not need protection.         |
40432f754f7fSDarrick J. Wong| The log cleans itself after its contents have been committed into the    |
40442f754f7fSDarrick J. Wong| filesystem, either as part of an unmount or because the system is        |
40452f754f7fSDarrick J. Wong| otherwise idle.                                                          |
40462f754f7fSDarrick J. Wong| Because upper level code can be working on a transaction at the same     |
40472f754f7fSDarrick J. Wong| time that the log cleans itself, it is necessary for upper level code to |
40482f754f7fSDarrick J. Wong| communicate to the log when it is going to use a log incompatible        |
40492f754f7fSDarrick J. Wong| feature.                                                                 |
40502f754f7fSDarrick J. Wong|                                                                          |
40512f754f7fSDarrick J. Wong| The log coordinates access to incompatible features through the use of   |
40522f754f7fSDarrick J. Wong| one ``struct rw_semaphore`` for each feature.                            |
40532f754f7fSDarrick J. Wong| The log cleaning code tries to take this rwsem in exclusive mode to      |
40542f754f7fSDarrick J. Wong| clear the bit; if the lock attempt fails, the feature bit remains set.   |
40552f754f7fSDarrick J. Wong| Filesystem code signals its intention to use a log incompat feature in a |
40562f754f7fSDarrick J. Wong| transaction by calling ``xlog_use_incompat_feat``, which takes the rwsem |
40572f754f7fSDarrick J. Wong| in shared mode.                                                          |
40582f754f7fSDarrick J. Wong| The code supporting a log incompat feature should create wrapper         |
40592f754f7fSDarrick J. Wong| functions to obtain the log feature and call                             |
40602f754f7fSDarrick J. Wong| ``xfs_add_incompat_log_feature`` to set the feature bits in the primary  |
40612f754f7fSDarrick J. Wong| superblock.                                                              |
40622f754f7fSDarrick J. Wong| The superblock update is performed transactionally, so the wrapper to    |
40632f754f7fSDarrick J. Wong| obtain log assistance must be called just prior to the creation of the   |
40642f754f7fSDarrick J. Wong| transaction that uses the functionality.                                 |
40652f754f7fSDarrick J. Wong| For a file operation, this step must happen after taking the IOLOCK      |
40662f754f7fSDarrick J. Wong| and the MMAPLOCK, but before allocating the transaction.                 |
40672f754f7fSDarrick J. Wong| When the transaction is complete, the ``xlog_drop_incompat_feat``        |
40682f754f7fSDarrick J. Wong| function is called to release the feature.                               |
40692f754f7fSDarrick J. Wong| The feature bit will not be cleared from the superblock until the log    |
40702f754f7fSDarrick J. Wong| becomes clean.                                                           |
40712f754f7fSDarrick J. Wong|                                                                          |
40722f754f7fSDarrick J. Wong| Log-assisted extended attribute updates and atomic extent swaps both use |
40732f754f7fSDarrick J. Wong| log incompat features and provide convenience wrappers around the        |
40742f754f7fSDarrick J. Wong| functionality.                                                           |
40752f754f7fSDarrick J. Wong+--------------------------------------------------------------------------+
40762f754f7fSDarrick J. Wong
40772f754f7fSDarrick J. WongMechanics of an Atomic Extent Swap
40782f754f7fSDarrick J. Wong``````````````````````````````````
40792f754f7fSDarrick J. Wong
40802f754f7fSDarrick J. WongSwapping entire file forks is a complex task.
40812f754f7fSDarrick J. WongThe goal is to exchange all file fork mappings between two file fork offset
40822f754f7fSDarrick J. Wongranges.
40832f754f7fSDarrick J. WongThere are likely to be many extent mappings in each fork, and the edges of
40842f754f7fSDarrick J. Wongthe mappings aren't necessarily aligned.
40852f754f7fSDarrick J. WongFurthermore, there may be other updates that need to happen after the swap,
40862f754f7fSDarrick J. Wongsuch as exchanging file sizes, inode flags, or conversion of fork data to local
40872f754f7fSDarrick J. Wongformat.
40882f754f7fSDarrick J. WongThis is roughly the format of the new deferred extent swap work item:
40892f754f7fSDarrick J. Wong
40902f754f7fSDarrick J. Wong.. code-block:: c
40912f754f7fSDarrick J. Wong
40922f754f7fSDarrick J. Wong	struct xfs_swapext_intent {
40932f754f7fSDarrick J. Wong	    /* Inodes participating in the operation. */
40942f754f7fSDarrick J. Wong	    struct xfs_inode    *sxi_ip1;
40952f754f7fSDarrick J. Wong	    struct xfs_inode    *sxi_ip2;
40962f754f7fSDarrick J. Wong
40972f754f7fSDarrick J. Wong	    /* File offset range information. */
40982f754f7fSDarrick J. Wong	    xfs_fileoff_t       sxi_startoff1;
40992f754f7fSDarrick J. Wong	    xfs_fileoff_t       sxi_startoff2;
41002f754f7fSDarrick J. Wong	    xfs_filblks_t       sxi_blockcount;
41012f754f7fSDarrick J. Wong
41022f754f7fSDarrick J. Wong	    /* Set these file sizes after the operation, unless negative. */
41032f754f7fSDarrick J. Wong	    xfs_fsize_t         sxi_isize1;
41042f754f7fSDarrick J. Wong	    xfs_fsize_t         sxi_isize2;
41052f754f7fSDarrick J. Wong
41062f754f7fSDarrick J. Wong	    /* XFS_SWAP_EXT_* log operation flags */
41072f754f7fSDarrick J. Wong	    uint64_t            sxi_flags;
41082f754f7fSDarrick J. Wong	};
41092f754f7fSDarrick J. Wong
41102f754f7fSDarrick J. WongThe new log intent item contains enough information to track two logical fork
41112f754f7fSDarrick J. Wongoffset ranges: ``(inode1, startoff1, blockcount)`` and ``(inode2, startoff2,
41122f754f7fSDarrick J. Wongblockcount)``.
41132f754f7fSDarrick J. WongEach step of a swap operation exchanges the largest file range mapping possible
41142f754f7fSDarrick J. Wongfrom one file to the other.
41152f754f7fSDarrick J. WongAfter each step in the swap operation, the two startoff fields are incremented
41162f754f7fSDarrick J. Wongand the blockcount field is decremented to reflect the progress made.
41172f754f7fSDarrick J. WongThe flags field captures behavioral parameters such as swapping the attr fork
41182f754f7fSDarrick J. Wonginstead of the data fork and other work to be done after the extent swap.
41192f754f7fSDarrick J. WongThe two isize fields are used to swap the file size at the end of the operation
41202f754f7fSDarrick J. Wongif the file data fork is the target of the swap operation.
41212f754f7fSDarrick J. Wong
41222f754f7fSDarrick J. WongWhen the extent swap is initiated, the sequence of operations is as follows:
41232f754f7fSDarrick J. Wong
41242f754f7fSDarrick J. Wong1. Create a deferred work item for the extent swap.
41252f754f7fSDarrick J. Wong   At the start, it should contain the entirety of the file ranges to be
41262f754f7fSDarrick J. Wong   swapped.
41272f754f7fSDarrick J. Wong
41282f754f7fSDarrick J. Wong2. Call ``xfs_defer_finish`` to process the exchange.
41292f754f7fSDarrick J. Wong   This is encapsulated in ``xrep_tempswap_contents`` for scrub operations.
41302f754f7fSDarrick J. Wong   This will log an extent swap intent item to the transaction for the deferred
41312f754f7fSDarrick J. Wong   extent swap work item.
41322f754f7fSDarrick J. Wong
41332f754f7fSDarrick J. Wong3. Until ``sxi_blockcount`` of the deferred extent swap work item is zero,
41342f754f7fSDarrick J. Wong
41352f754f7fSDarrick J. Wong   a. Read the block maps of both file ranges starting at ``sxi_startoff1`` and
41362f754f7fSDarrick J. Wong      ``sxi_startoff2``, respectively, and compute the longest extent that can
41372f754f7fSDarrick J. Wong      be swapped in a single step.
41382f754f7fSDarrick J. Wong      This is the minimum of the two ``br_blockcount`` s in the mappings.
41392f754f7fSDarrick J. Wong      Keep advancing through the file forks until at least one of the mappings
41402f754f7fSDarrick J. Wong      contains written blocks.
41412f754f7fSDarrick J. Wong      Mutual holes, unwritten extents, and extent mappings to the same physical
41422f754f7fSDarrick J. Wong      space are not exchanged.
41432f754f7fSDarrick J. Wong
41442f754f7fSDarrick J. Wong      For the next few steps, this document will refer to the mapping that came
41452f754f7fSDarrick J. Wong      from file 1 as "map1", and the mapping that came from file 2 as "map2".
41462f754f7fSDarrick J. Wong
41472f754f7fSDarrick J. Wong   b. Create a deferred block mapping update to unmap map1 from file 1.
41482f754f7fSDarrick J. Wong
41492f754f7fSDarrick J. Wong   c. Create a deferred block mapping update to unmap map2 from file 2.
41502f754f7fSDarrick J. Wong
41512f754f7fSDarrick J. Wong   d. Create a deferred block mapping update to map map1 into file 2.
41522f754f7fSDarrick J. Wong
41532f754f7fSDarrick J. Wong   e. Create a deferred block mapping update to map map2 into file 1.
41542f754f7fSDarrick J. Wong
41552f754f7fSDarrick J. Wong   f. Log the block, quota, and extent count updates for both files.
41562f754f7fSDarrick J. Wong
41572f754f7fSDarrick J. Wong   g. Extend the ondisk size of either file if necessary.
41582f754f7fSDarrick J. Wong
41592f754f7fSDarrick J. Wong   h. Log an extent swap done log item for the extent swap intent log item
41602f754f7fSDarrick J. Wong      that was read at the start of step 3.
41612f754f7fSDarrick J. Wong
41622f754f7fSDarrick J. Wong   i. Compute the amount of file range that has just been covered.
41632f754f7fSDarrick J. Wong      This quantity is ``(map1.br_startoff + map1.br_blockcount -
41642f754f7fSDarrick J. Wong      sxi_startoff1)``, because step 3a could have skipped holes.
41652f754f7fSDarrick J. Wong
41662f754f7fSDarrick J. Wong   j. Increase the starting offsets of ``sxi_startoff1`` and ``sxi_startoff2``
41672f754f7fSDarrick J. Wong      by the number of blocks computed in the previous step, and decrease
41682f754f7fSDarrick J. Wong      ``sxi_blockcount`` by the same quantity.
41692f754f7fSDarrick J. Wong      This advances the cursor.
41702f754f7fSDarrick J. Wong
41712f754f7fSDarrick J. Wong   k. Log a new extent swap intent log item reflecting the advanced state of
41722f754f7fSDarrick J. Wong      the work item.
41732f754f7fSDarrick J. Wong
41742f754f7fSDarrick J. Wong   l. Return the proper error code (EAGAIN) to the deferred operation manager
41752f754f7fSDarrick J. Wong      to inform it that there is more work to be done.
41762f754f7fSDarrick J. Wong      The operation manager completes the deferred work in steps 3b-3e before
41772f754f7fSDarrick J. Wong      moving back to the start of step 3.
41782f754f7fSDarrick J. Wong
41792f754f7fSDarrick J. Wong4. Perform any post-processing.
41802f754f7fSDarrick J. Wong   This will be discussed in more detail in subsequent sections.
41812f754f7fSDarrick J. Wong
41822f754f7fSDarrick J. WongIf the filesystem goes down in the middle of an operation, log recovery will
41832f754f7fSDarrick J. Wongfind the most recent unfinished extent swap log intent item and restart from
41842f754f7fSDarrick J. Wongthere.
41852f754f7fSDarrick J. WongThis is how extent swapping guarantees that an outside observer will either see
41862f754f7fSDarrick J. Wongthe old broken structure or the new one, and never a mismash of both.
41872f754f7fSDarrick J. Wong
41882f754f7fSDarrick J. WongPreparation for Extent Swapping
41892f754f7fSDarrick J. Wong```````````````````````````````
41902f754f7fSDarrick J. Wong
41912f754f7fSDarrick J. WongThere are a few things that need to be taken care of before initiating an
41922f754f7fSDarrick J. Wongatomic extent swap operation.
41932f754f7fSDarrick J. WongFirst, regular files require the page cache to be flushed to disk before the
41942f754f7fSDarrick J. Wongoperation begins, and directio writes to be quiesced.
41952f754f7fSDarrick J. WongLike any filesystem operation, extent swapping must determine the maximum
41962f754f7fSDarrick J. Wongamount of disk space and quota that can be consumed on behalf of both files in
41972f754f7fSDarrick J. Wongthe operation, and reserve that quantity of resources to avoid an unrecoverable
41982f754f7fSDarrick J. Wongout of space failure once it starts dirtying metadata.
41992f754f7fSDarrick J. WongThe preparation step scans the ranges of both files to estimate:
42002f754f7fSDarrick J. Wong
42012f754f7fSDarrick J. Wong- Data device blocks needed to handle the repeated updates to the fork
42022f754f7fSDarrick J. Wong  mappings.
42032f754f7fSDarrick J. Wong- Change in data and realtime block counts for both files.
42042f754f7fSDarrick J. Wong- Increase in quota usage for both files, if the two files do not share the
42052f754f7fSDarrick J. Wong  same set of quota ids.
42062f754f7fSDarrick J. Wong- The number of extent mappings that will be added to each file.
42072f754f7fSDarrick J. Wong- Whether or not there are partially written realtime extents.
42082f754f7fSDarrick J. Wong  User programs must never be able to access a realtime file extent that maps
42092f754f7fSDarrick J. Wong  to different extents on the realtime volume, which could happen if the
42102f754f7fSDarrick J. Wong  operation fails to run to completion.
42112f754f7fSDarrick J. Wong
42122f754f7fSDarrick J. WongThe need for precise estimation increases the run time of the swap operation,
42132f754f7fSDarrick J. Wongbut it is very important to maintain correct accounting.
42142f754f7fSDarrick J. WongThe filesystem must not run completely out of free space, nor can the extent
42152f754f7fSDarrick J. Wongswap ever add more extent mappings to a fork than it can support.
42162f754f7fSDarrick J. WongRegular users are required to abide the quota limits, though metadata repairs
42172f754f7fSDarrick J. Wongmay exceed quota to resolve inconsistent metadata elsewhere.
42182f754f7fSDarrick J. Wong
42192f754f7fSDarrick J. WongSpecial Features for Swapping Metadata File Extents
42202f754f7fSDarrick J. Wong```````````````````````````````````````````````````
42212f754f7fSDarrick J. Wong
42222f754f7fSDarrick J. WongExtended attributes, symbolic links, and directories can set the fork format to
42232f754f7fSDarrick J. Wong"local" and treat the fork as a literal area for data storage.
42242f754f7fSDarrick J. WongMetadata repairs must take extra steps to support these cases:
42252f754f7fSDarrick J. Wong
42262f754f7fSDarrick J. Wong- If both forks are in local format and the fork areas are large enough, the
42272f754f7fSDarrick J. Wong  swap is performed by copying the incore fork contents, logging both forks,
42282f754f7fSDarrick J. Wong  and committing.
42292f754f7fSDarrick J. Wong  The atomic extent swap mechanism is not necessary, since this can be done
42302f754f7fSDarrick J. Wong  with a single transaction.
42312f754f7fSDarrick J. Wong
42322f754f7fSDarrick J. Wong- If both forks map blocks, then the regular atomic extent swap is used.
42332f754f7fSDarrick J. Wong
42342f754f7fSDarrick J. Wong- Otherwise, only one fork is in local format.
42352f754f7fSDarrick J. Wong  The contents of the local format fork are converted to a block to perform the
42362f754f7fSDarrick J. Wong  swap.
42372f754f7fSDarrick J. Wong  The conversion to block format must be done in the same transaction that
42382f754f7fSDarrick J. Wong  logs the initial extent swap intent log item.
42392f754f7fSDarrick J. Wong  The regular atomic extent swap is used to exchange the mappings.
42402f754f7fSDarrick J. Wong  Special flags are set on the swap operation so that the transaction can be
42412f754f7fSDarrick J. Wong  rolled one more time to convert the second file's fork back to local format
42422f754f7fSDarrick J. Wong  so that the second file will be ready to go as soon as the ILOCK is dropped.
42432f754f7fSDarrick J. Wong
42442f754f7fSDarrick J. WongExtended attributes and directories stamp the owning inode into every block,
42452f754f7fSDarrick J. Wongbut the buffer verifiers do not actually check the inode number!
42462f754f7fSDarrick J. WongAlthough there is no verification, it is still important to maintain
42472f754f7fSDarrick J. Wongreferential integrity, so prior to performing the extent swap, online repair
42482f754f7fSDarrick J. Wongbuilds every block in the new data structure with the owner field of the file
42492f754f7fSDarrick J. Wongbeing repaired.
42502f754f7fSDarrick J. Wong
42512f754f7fSDarrick J. WongAfter a successful swap operation, the repair operation must reap the old fork
42522f754f7fSDarrick J. Wongblocks by processing each fork mapping through the standard :ref:`file extent
42532f754f7fSDarrick J. Wongreaping <reaping>` mechanism that is done post-repair.
42542f754f7fSDarrick J. WongIf the filesystem should go down during the reap part of the repair, the
42552f754f7fSDarrick J. Wongiunlink processing at the end of recovery will free both the temporary file and
42562f754f7fSDarrick J. Wongwhatever blocks were not reaped.
42572f754f7fSDarrick J. WongHowever, this iunlink processing omits the cross-link detection of online
42582f754f7fSDarrick J. Wongrepair, and is not completely foolproof.
42592f754f7fSDarrick J. Wong
42602f754f7fSDarrick J. WongSwapping Temporary File Extents
42612f754f7fSDarrick J. Wong```````````````````````````````
42622f754f7fSDarrick J. Wong
42632f754f7fSDarrick J. WongTo repair a metadata file, online repair proceeds as follows:
42642f754f7fSDarrick J. Wong
42652f754f7fSDarrick J. Wong1. Create a temporary repair file.
42662f754f7fSDarrick J. Wong
42672f754f7fSDarrick J. Wong2. Use the staging data to write out new contents into the temporary repair
42682f754f7fSDarrick J. Wong   file.
42692f754f7fSDarrick J. Wong   The same fork must be written to as is being repaired.
42702f754f7fSDarrick J. Wong
42712f754f7fSDarrick J. Wong3. Commit the scrub transaction, since the swap estimation step must be
42722f754f7fSDarrick J. Wong   completed before transaction reservations are made.
42732f754f7fSDarrick J. Wong
42742f754f7fSDarrick J. Wong4. Call ``xrep_tempswap_trans_alloc`` to allocate a new scrub transaction with
42752f754f7fSDarrick J. Wong   the appropriate resource reservations, locks, and fill out a ``struct
42762f754f7fSDarrick J. Wong   xfs_swapext_req`` with the details of the swap operation.
42772f754f7fSDarrick J. Wong
42782f754f7fSDarrick J. Wong5. Call ``xrep_tempswap_contents`` to swap the contents.
42792f754f7fSDarrick J. Wong
42802f754f7fSDarrick J. Wong6. Commit the transaction to complete the repair.
42812f754f7fSDarrick J. Wong
42822f754f7fSDarrick J. Wong.. _rtsummary:
42832f754f7fSDarrick J. Wong
42842f754f7fSDarrick J. WongCase Study: Repairing the Realtime Summary File
42852f754f7fSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
42862f754f7fSDarrick J. Wong
42872f754f7fSDarrick J. WongIn the "realtime" section of an XFS filesystem, free space is tracked via a
42882f754f7fSDarrick J. Wongbitmap, similar to Unix FFS.
42892f754f7fSDarrick J. WongEach bit in the bitmap represents one realtime extent, which is a multiple of
42902f754f7fSDarrick J. Wongthe filesystem block size between 4KiB and 1GiB in size.
42912f754f7fSDarrick J. WongThe realtime summary file indexes the number of free extents of a given size to
42922f754f7fSDarrick J. Wongthe offset of the block within the realtime free space bitmap where those free
42932f754f7fSDarrick J. Wongextents begin.
42942f754f7fSDarrick J. WongIn other words, the summary file helps the allocator find free extents by
42952f754f7fSDarrick J. Wonglength, similar to what the free space by count (cntbt) btree does for the data
42962f754f7fSDarrick J. Wongsection.
42972f754f7fSDarrick J. Wong
42982f754f7fSDarrick J. WongThe summary file itself is a flat file (with no block headers or checksums!)
42992f754f7fSDarrick J. Wongpartitioned into ``log2(total rt extents)`` sections containing enough 32-bit
43002f754f7fSDarrick J. Wongcounters to match the number of blocks in the rt bitmap.
43012f754f7fSDarrick J. WongEach counter records the number of free extents that start in that bitmap block
43022f754f7fSDarrick J. Wongand can satisfy a power-of-two allocation request.
43032f754f7fSDarrick J. Wong
43042f754f7fSDarrick J. WongTo check the summary file against the bitmap:
43052f754f7fSDarrick J. Wong
43062f754f7fSDarrick J. Wong1. Take the ILOCK of both the realtime bitmap and summary files.
43072f754f7fSDarrick J. Wong
43082f754f7fSDarrick J. Wong2. For each free space extent recorded in the bitmap:
43092f754f7fSDarrick J. Wong
43102f754f7fSDarrick J. Wong   a. Compute the position in the summary file that contains a counter that
43112f754f7fSDarrick J. Wong      represents this free extent.
43122f754f7fSDarrick J. Wong
43132f754f7fSDarrick J. Wong   b. Read the counter from the xfile.
43142f754f7fSDarrick J. Wong
43152f754f7fSDarrick J. Wong   c. Increment it, and write it back to the xfile.
43162f754f7fSDarrick J. Wong
43172f754f7fSDarrick J. Wong3. Compare the contents of the xfile against the ondisk file.
43182f754f7fSDarrick J. Wong
43192f754f7fSDarrick J. WongTo repair the summary file, write the xfile contents into the temporary file
43202f754f7fSDarrick J. Wongand use atomic extent swap to commit the new contents.
43212f754f7fSDarrick J. WongThe temporary file is then reaped.
43222f754f7fSDarrick J. Wong
43232f754f7fSDarrick J. WongThe proposed patchset is the
43242f754f7fSDarrick J. Wong`realtime summary repair
43252f754f7fSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rtsummary>`_
43262f754f7fSDarrick J. Wongseries.
43272f754f7fSDarrick J. Wong
43282f754f7fSDarrick J. WongCase Study: Salvaging Extended Attributes
43292f754f7fSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
43302f754f7fSDarrick J. Wong
43312f754f7fSDarrick J. WongIn XFS, extended attributes are implemented as a namespaced name-value store.
43322f754f7fSDarrick J. WongValues are limited in size to 64KiB, but there is no limit in the number of
43332f754f7fSDarrick J. Wongnames.
43342f754f7fSDarrick J. WongThe attribute fork is unpartitioned, which means that the root of the attribute
43352f754f7fSDarrick J. Wongstructure is always in logical block zero, but attribute leaf blocks, dabtree
43362f754f7fSDarrick J. Wongindex blocks, and remote value blocks are intermixed.
43372f754f7fSDarrick J. WongAttribute leaf blocks contain variable-sized records that associate
43382f754f7fSDarrick J. Wonguser-provided names with the user-provided values.
43392f754f7fSDarrick J. WongValues larger than a block are allocated separate extents and written there.
43402f754f7fSDarrick J. WongIf the leaf information expands beyond a single block, a directory/attribute
43412f754f7fSDarrick J. Wongbtree (``dabtree``) is created to map hashes of attribute names to entries
43422f754f7fSDarrick J. Wongfor fast lookup.
43432f754f7fSDarrick J. Wong
43442f754f7fSDarrick J. WongSalvaging extended attributes is done as follows:
43452f754f7fSDarrick J. Wong
43462f754f7fSDarrick J. Wong1. Walk the attr fork mappings of the file being repaired to find the attribute
43472f754f7fSDarrick J. Wong   leaf blocks.
43482f754f7fSDarrick J. Wong   When one is found,
43492f754f7fSDarrick J. Wong
43502f754f7fSDarrick J. Wong   a. Walk the attr leaf block to find candidate keys.
43512f754f7fSDarrick J. Wong      When one is found,
43522f754f7fSDarrick J. Wong
43532f754f7fSDarrick J. Wong      1. Check the name for problems, and ignore the name if there are.
43542f754f7fSDarrick J. Wong
43552f754f7fSDarrick J. Wong      2. Retrieve the value.
43562f754f7fSDarrick J. Wong         If that succeeds, add the name and value to the staging xfarray and
43572f754f7fSDarrick J. Wong         xfblob.
43582f754f7fSDarrick J. Wong
43592f754f7fSDarrick J. Wong2. If the memory usage of the xfarray and xfblob exceed a certain amount of
43602f754f7fSDarrick J. Wong   memory or there are no more attr fork blocks to examine, unlock the file and
43612f754f7fSDarrick J. Wong   add the staged extended attributes to the temporary file.
43622f754f7fSDarrick J. Wong
43632f754f7fSDarrick J. Wong3. Use atomic extent swapping to exchange the new and old extended attribute
43642f754f7fSDarrick J. Wong   structures.
43652f754f7fSDarrick J. Wong   The old attribute blocks are now attached to the temporary file.
43662f754f7fSDarrick J. Wong
43672f754f7fSDarrick J. Wong4. Reap the temporary file.
43682f754f7fSDarrick J. Wong
43692f754f7fSDarrick J. WongThe proposed patchset is the
43702f754f7fSDarrick J. Wong`extended attribute repair
43712f754f7fSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_
43722f754f7fSDarrick J. Wongseries.
4373*a26aa252SDarrick J. Wong
4374*a26aa252SDarrick J. WongFixing Directories
4375*a26aa252SDarrick J. Wong------------------
4376*a26aa252SDarrick J. Wong
4377*a26aa252SDarrick J. WongFixing directories is difficult with currently available filesystem features,
4378*a26aa252SDarrick J. Wongsince directory entries are not redundant.
4379*a26aa252SDarrick J. WongThe offline repair tool scans all inodes to find files with nonzero link count,
4380*a26aa252SDarrick J. Wongand then it scans all directories to establish parentage of those linked files.
4381*a26aa252SDarrick J. WongDamaged files and directories are zapped, and files with no parent are
4382*a26aa252SDarrick J. Wongmoved to the ``/lost+found`` directory.
4383*a26aa252SDarrick J. WongIt does not try to salvage anything.
4384*a26aa252SDarrick J. Wong
4385*a26aa252SDarrick J. WongThe best that online repair can do at this time is to read directory data
4386*a26aa252SDarrick J. Wongblocks and salvage any dirents that look plausible, correct link counts, and
4387*a26aa252SDarrick J. Wongmove orphans back into the directory tree.
4388*a26aa252SDarrick J. WongThe salvage process is discussed in the case study at the end of this section.
4389*a26aa252SDarrick J. WongThe :ref:`file link count fsck <nlinks>` code takes care of fixing link counts
4390*a26aa252SDarrick J. Wongand moving orphans to the ``/lost+found`` directory.
4391*a26aa252SDarrick J. Wong
4392*a26aa252SDarrick J. WongCase Study: Salvaging Directories
4393*a26aa252SDarrick J. Wong`````````````````````````````````
4394*a26aa252SDarrick J. Wong
4395*a26aa252SDarrick J. WongUnlike extended attributes, directory blocks are all the same size, so
4396*a26aa252SDarrick J. Wongsalvaging directories is straightforward:
4397*a26aa252SDarrick J. Wong
4398*a26aa252SDarrick J. Wong1. Find the parent of the directory.
4399*a26aa252SDarrick J. Wong   If the dotdot entry is not unreadable, try to confirm that the alleged
4400*a26aa252SDarrick J. Wong   parent has a child entry pointing back to the directory being repaired.
4401*a26aa252SDarrick J. Wong   Otherwise, walk the filesystem to find it.
4402*a26aa252SDarrick J. Wong
4403*a26aa252SDarrick J. Wong2. Walk the first partition of data fork of the directory to find the directory
4404*a26aa252SDarrick J. Wong   entry data blocks.
4405*a26aa252SDarrick J. Wong   When one is found,
4406*a26aa252SDarrick J. Wong
4407*a26aa252SDarrick J. Wong   a. Walk the directory data block to find candidate entries.
4408*a26aa252SDarrick J. Wong      When an entry is found:
4409*a26aa252SDarrick J. Wong
4410*a26aa252SDarrick J. Wong      i. Check the name for problems, and ignore the name if there are.
4411*a26aa252SDarrick J. Wong
4412*a26aa252SDarrick J. Wong      ii. Retrieve the inumber and grab the inode.
4413*a26aa252SDarrick J. Wong          If that succeeds, add the name, inode number, and file type to the
4414*a26aa252SDarrick J. Wong          staging xfarray and xblob.
4415*a26aa252SDarrick J. Wong
4416*a26aa252SDarrick J. Wong3. If the memory usage of the xfarray and xfblob exceed a certain amount of
4417*a26aa252SDarrick J. Wong   memory or there are no more directory data blocks to examine, unlock the
4418*a26aa252SDarrick J. Wong   directory and add the staged dirents into the temporary directory.
4419*a26aa252SDarrick J. Wong   Truncate the staging files.
4420*a26aa252SDarrick J. Wong
4421*a26aa252SDarrick J. Wong4. Use atomic extent swapping to exchange the new and old directory structures.
4422*a26aa252SDarrick J. Wong   The old directory blocks are now attached to the temporary file.
4423*a26aa252SDarrick J. Wong
4424*a26aa252SDarrick J. Wong5. Reap the temporary file.
4425*a26aa252SDarrick J. Wong
4426*a26aa252SDarrick J. Wong**Future Work Question**: Should repair revalidate the dentry cache when
4427*a26aa252SDarrick J. Wongrebuilding a directory?
4428*a26aa252SDarrick J. Wong
4429*a26aa252SDarrick J. Wong*Answer*: Yes, it should.
4430*a26aa252SDarrick J. Wong
4431*a26aa252SDarrick J. WongIn theory it is necessary to scan all dentry cache entries for a directory to
4432*a26aa252SDarrick J. Wongensure that one of the following apply:
4433*a26aa252SDarrick J. Wong
4434*a26aa252SDarrick J. Wong1. The cached dentry reflects an ondisk dirent in the new directory.
4435*a26aa252SDarrick J. Wong
4436*a26aa252SDarrick J. Wong2. The cached dentry no longer has a corresponding ondisk dirent in the new
4437*a26aa252SDarrick J. Wong   directory and the dentry can be purged from the cache.
4438*a26aa252SDarrick J. Wong
4439*a26aa252SDarrick J. Wong3. The cached dentry no longer has an ondisk dirent but the dentry cannot be
4440*a26aa252SDarrick J. Wong   purged.
4441*a26aa252SDarrick J. Wong   This is the problem case.
4442*a26aa252SDarrick J. Wong
4443*a26aa252SDarrick J. WongUnfortunately, the current dentry cache design doesn't provide a means to walk
4444*a26aa252SDarrick J. Wongevery child dentry of a specific directory, which makes this a hard problem.
4445*a26aa252SDarrick J. WongThere is no known solution.
4446*a26aa252SDarrick J. Wong
4447*a26aa252SDarrick J. WongThe proposed patchset is the
4448*a26aa252SDarrick J. Wong`directory repair
4449*a26aa252SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_
4450*a26aa252SDarrick J. Wongseries.
4451*a26aa252SDarrick J. Wong
4452*a26aa252SDarrick J. WongParent Pointers
4453*a26aa252SDarrick J. Wong```````````````
4454*a26aa252SDarrick J. Wong
4455*a26aa252SDarrick J. WongA parent pointer is a piece of file metadata that enables a user to locate the
4456*a26aa252SDarrick J. Wongfile's parent directory without having to traverse the directory tree from the
4457*a26aa252SDarrick J. Wongroot.
4458*a26aa252SDarrick J. WongWithout them, reconstruction of directory trees is hindered in much the same
4459*a26aa252SDarrick J. Wongway that the historic lack of reverse space mapping information once hindered
4460*a26aa252SDarrick J. Wongreconstruction of filesystem space metadata.
4461*a26aa252SDarrick J. WongThe parent pointer feature, however, makes total directory reconstruction
4462*a26aa252SDarrick J. Wongpossible.
4463*a26aa252SDarrick J. Wong
4464*a26aa252SDarrick J. WongXFS parent pointers include the dirent name and location of the entry within
4465*a26aa252SDarrick J. Wongthe parent directory.
4466*a26aa252SDarrick J. WongIn other words, child files use extended attributes to store pointers to
4467*a26aa252SDarrick J. Wongparents in the form ``(parent_inum, parent_gen, dirent_pos) → (dirent_name)``.
4468*a26aa252SDarrick J. WongThe directory checking process can be strengthened to ensure that the target of
4469*a26aa252SDarrick J. Wongeach dirent also contains a parent pointer pointing back to the dirent.
4470*a26aa252SDarrick J. WongLikewise, each parent pointer can be checked by ensuring that the target of
4471*a26aa252SDarrick J. Wongeach parent pointer is a directory and that it contains a dirent matching
4472*a26aa252SDarrick J. Wongthe parent pointer.
4473*a26aa252SDarrick J. WongBoth online and offline repair can use this strategy.
4474*a26aa252SDarrick J. Wong
4475*a26aa252SDarrick J. Wong**Note**: The ondisk format of parent pointers is not yet finalized.
4476*a26aa252SDarrick J. Wong
4477*a26aa252SDarrick J. Wong+--------------------------------------------------------------------------+
4478*a26aa252SDarrick J. Wong| **Historical Sidebar**:                                                  |
4479*a26aa252SDarrick J. Wong+--------------------------------------------------------------------------+
4480*a26aa252SDarrick J. Wong| Directory parent pointers were first proposed as an XFS feature more     |
4481*a26aa252SDarrick J. Wong| than a decade ago by SGI.                                                |
4482*a26aa252SDarrick J. Wong| Each link from a parent directory to a child file is mirrored with an    |
4483*a26aa252SDarrick J. Wong| extended attribute in the child that could be used to identify the       |
4484*a26aa252SDarrick J. Wong| parent directory.                                                        |
4485*a26aa252SDarrick J. Wong| Unfortunately, this early implementation had major shortcomings and was  |
4486*a26aa252SDarrick J. Wong| never merged into Linux XFS:                                             |
4487*a26aa252SDarrick J. Wong|                                                                          |
4488*a26aa252SDarrick J. Wong| 1. The XFS codebase of the late 2000s did not have the infrastructure to |
4489*a26aa252SDarrick J. Wong|    enforce strong referential integrity in the directory tree.           |
4490*a26aa252SDarrick J. Wong|    It did not guarantee that a change in a forward link would always be  |
4491*a26aa252SDarrick J. Wong|    followed up with the corresponding change to the reverse links.       |
4492*a26aa252SDarrick J. Wong|                                                                          |
4493*a26aa252SDarrick J. Wong| 2. Referential integrity was not integrated into offline repair.         |
4494*a26aa252SDarrick J. Wong|    Checking and repairs were performed on mounted filesystems without    |
4495*a26aa252SDarrick J. Wong|    taking any kernel or inode locks to coordinate access.                |
4496*a26aa252SDarrick J. Wong|    It is not clear how this actually worked properly.                    |
4497*a26aa252SDarrick J. Wong|                                                                          |
4498*a26aa252SDarrick J. Wong| 3. The extended attribute did not record the name of the directory entry |
4499*a26aa252SDarrick J. Wong|    in the parent, so the SGI parent pointer implementation cannot be     |
4500*a26aa252SDarrick J. Wong|    used to reconnect the directory tree.                                 |
4501*a26aa252SDarrick J. Wong|                                                                          |
4502*a26aa252SDarrick J. Wong| 4. Extended attribute forks only support 65,536 extents, which means     |
4503*a26aa252SDarrick J. Wong|    that parent pointer attribute creation is likely to fail at some      |
4504*a26aa252SDarrick J. Wong|    point before the maximum file link count is achieved.                 |
4505*a26aa252SDarrick J. Wong|                                                                          |
4506*a26aa252SDarrick J. Wong| The original parent pointer design was too unstable for something like   |
4507*a26aa252SDarrick J. Wong| a file system repair to depend on.                                       |
4508*a26aa252SDarrick J. Wong| Allison Henderson, Chandan Babu, and Catherine Hoang are working on a    |
4509*a26aa252SDarrick J. Wong| second implementation that solves all shortcomings of the first.         |
4510*a26aa252SDarrick J. Wong| During 2022, Allison introduced log intent items to track physical       |
4511*a26aa252SDarrick J. Wong| manipulations of the extended attribute structures.                      |
4512*a26aa252SDarrick J. Wong| This solves the referential integrity problem by making it possible to   |
4513*a26aa252SDarrick J. Wong| commit a dirent update and a parent pointer update in the same           |
4514*a26aa252SDarrick J. Wong| transaction.                                                             |
4515*a26aa252SDarrick J. Wong| Chandan increased the maximum extent counts of both data and attribute   |
4516*a26aa252SDarrick J. Wong| forks, thereby ensuring that the extended attribute structure can grow   |
4517*a26aa252SDarrick J. Wong| to handle the maximum hardlink count of any file.                        |
4518*a26aa252SDarrick J. Wong+--------------------------------------------------------------------------+
4519*a26aa252SDarrick J. Wong
4520*a26aa252SDarrick J. WongCase Study: Repairing Directories with Parent Pointers
4521*a26aa252SDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
4522*a26aa252SDarrick J. Wong
4523*a26aa252SDarrick J. WongDirectory rebuilding uses a :ref:`coordinated inode scan <iscan>` and
4524*a26aa252SDarrick J. Wonga :ref:`directory entry live update hook <liveupdate>` as follows:
4525*a26aa252SDarrick J. Wong
4526*a26aa252SDarrick J. Wong1. Set up a temporary directory for generating the new directory structure,
4527*a26aa252SDarrick J. Wong   an xfblob for storing entry names, and an xfarray for stashing directory
4528*a26aa252SDarrick J. Wong   updates.
4529*a26aa252SDarrick J. Wong
4530*a26aa252SDarrick J. Wong2. Set up an inode scanner and hook into the directory entry code to receive
4531*a26aa252SDarrick J. Wong   updates on directory operations.
4532*a26aa252SDarrick J. Wong
4533*a26aa252SDarrick J. Wong3. For each parent pointer found in each file scanned, decide if the parent
4534*a26aa252SDarrick J. Wong   pointer references the directory of interest.
4535*a26aa252SDarrick J. Wong   If so:
4536*a26aa252SDarrick J. Wong
4537*a26aa252SDarrick J. Wong   a. Stash an addname entry for this dirent in the xfarray for later.
4538*a26aa252SDarrick J. Wong
4539*a26aa252SDarrick J. Wong   b. When finished scanning that file, flush the stashed updates to the
4540*a26aa252SDarrick J. Wong      temporary directory.
4541*a26aa252SDarrick J. Wong
4542*a26aa252SDarrick J. Wong4. For each live directory update received via the hook, decide if the child
4543*a26aa252SDarrick J. Wong   has already been scanned.
4544*a26aa252SDarrick J. Wong   If so:
4545*a26aa252SDarrick J. Wong
4546*a26aa252SDarrick J. Wong   a. Stash an addname or removename entry for this dirent update in the
4547*a26aa252SDarrick J. Wong      xfarray for later.
4548*a26aa252SDarrick J. Wong      We cannot write directly to the temporary directory because hook
4549*a26aa252SDarrick J. Wong      functions are not allowed to modify filesystem metadata.
4550*a26aa252SDarrick J. Wong      Instead, we stash updates in the xfarray and rely on the scanner thread
4551*a26aa252SDarrick J. Wong      to apply the stashed updates to the temporary directory.
4552*a26aa252SDarrick J. Wong
4553*a26aa252SDarrick J. Wong5. When the scan is complete, atomically swap the contents of the temporary
4554*a26aa252SDarrick J. Wong   directory and the directory being repaired.
4555*a26aa252SDarrick J. Wong   The temporary directory now contains the damaged directory structure.
4556*a26aa252SDarrick J. Wong
4557*a26aa252SDarrick J. Wong6. Reap the temporary directory.
4558*a26aa252SDarrick J. Wong
4559*a26aa252SDarrick J. Wong7. Update the dirent position field of parent pointers as necessary.
4560*a26aa252SDarrick J. Wong   This may require the queuing of a substantial number of xattr log intent
4561*a26aa252SDarrick J. Wong   items.
4562*a26aa252SDarrick J. Wong
4563*a26aa252SDarrick J. WongThe proposed patchset is the
4564*a26aa252SDarrick J. Wong`parent pointers directory repair
4565*a26aa252SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-online-dir-repair>`_
4566*a26aa252SDarrick J. Wongseries.
4567*a26aa252SDarrick J. Wong
4568*a26aa252SDarrick J. Wong**Unresolved Question**: How will repair ensure that the ``dirent_pos`` fields
4569*a26aa252SDarrick J. Wongmatch in the reconstructed directory?
4570*a26aa252SDarrick J. Wong
4571*a26aa252SDarrick J. Wong*Answer*: There are a few ways to solve this problem:
4572*a26aa252SDarrick J. Wong
4573*a26aa252SDarrick J. Wong1. The field could be designated advisory, since the other three values are
4574*a26aa252SDarrick J. Wong   sufficient to find the entry in the parent.
4575*a26aa252SDarrick J. Wong   However, this makes indexed key lookup impossible while repairs are ongoing.
4576*a26aa252SDarrick J. Wong
4577*a26aa252SDarrick J. Wong2. We could allow creating directory entries at specified offsets, which solves
4578*a26aa252SDarrick J. Wong   the referential integrity problem but runs the risk that dirent creation
4579*a26aa252SDarrick J. Wong   will fail due to conflicts with the free space in the directory.
4580*a26aa252SDarrick J. Wong
4581*a26aa252SDarrick J. Wong   These conflicts could be resolved by appending the directory entry and
4582*a26aa252SDarrick J. Wong   amending the xattr code to support updating an xattr key and reindexing the
4583*a26aa252SDarrick J. Wong   dabtree, though this would have to be performed with the parent directory
4584*a26aa252SDarrick J. Wong   still locked.
4585*a26aa252SDarrick J. Wong
4586*a26aa252SDarrick J. Wong3. Same as above, but remove the old parent pointer entry and add a new one
4587*a26aa252SDarrick J. Wong   atomically.
4588*a26aa252SDarrick J. Wong
4589*a26aa252SDarrick J. Wong4. Change the ondisk xattr format to ``(parent_inum, name) → (parent_gen)``,
4590*a26aa252SDarrick J. Wong   which would provide the attr name uniqueness that we require, without
4591*a26aa252SDarrick J. Wong   forcing repair code to update the dirent position.
4592*a26aa252SDarrick J. Wong   Unfortunately, this requires changes to the xattr code to support attr
4593*a26aa252SDarrick J. Wong   names as long as 263 bytes.
4594*a26aa252SDarrick J. Wong
4595*a26aa252SDarrick J. Wong5. Change the ondisk xattr format to ``(parent_inum, hash(name)) →
4596*a26aa252SDarrick J. Wong   (name, parent_gen)``.
4597*a26aa252SDarrick J. Wong   If the hash is sufficiently resistant to collisions (e.g. sha256) then
4598*a26aa252SDarrick J. Wong   this should provide the attr name uniqueness that we require.
4599*a26aa252SDarrick J. Wong   Names shorter than 247 bytes could be stored directly.
4600*a26aa252SDarrick J. Wong
4601*a26aa252SDarrick J. WongDiscussion is ongoing under the `parent pointers patch deluge
4602*a26aa252SDarrick J. Wong<https://www.spinics.net/lists/linux-xfs/msg69397.html>`_.
4603*a26aa252SDarrick J. Wong
4604*a26aa252SDarrick J. WongCase Study: Repairing Parent Pointers
4605*a26aa252SDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
4606*a26aa252SDarrick J. Wong
4607*a26aa252SDarrick J. WongOnline reconstruction of a file's parent pointer information works similarly to
4608*a26aa252SDarrick J. Wongdirectory reconstruction:
4609*a26aa252SDarrick J. Wong
4610*a26aa252SDarrick J. Wong1. Set up a temporary file for generating a new extended attribute structure,
4611*a26aa252SDarrick J. Wong   an `xfblob<xfblob>` for storing parent pointer names, and an xfarray for
4612*a26aa252SDarrick J. Wong   stashing parent pointer updates.
4613*a26aa252SDarrick J. Wong
4614*a26aa252SDarrick J. Wong2. Set up an inode scanner and hook into the directory entry code to receive
4615*a26aa252SDarrick J. Wong   updates on directory operations.
4616*a26aa252SDarrick J. Wong
4617*a26aa252SDarrick J. Wong3. For each directory entry found in each directory scanned, decide if the
4618*a26aa252SDarrick J. Wong   dirent references the file of interest.
4619*a26aa252SDarrick J. Wong   If so:
4620*a26aa252SDarrick J. Wong
4621*a26aa252SDarrick J. Wong   a. Stash an addpptr entry for this parent pointer in the xfblob and xfarray
4622*a26aa252SDarrick J. Wong      for later.
4623*a26aa252SDarrick J. Wong
4624*a26aa252SDarrick J. Wong   b. When finished scanning the directory, flush the stashed updates to the
4625*a26aa252SDarrick J. Wong      temporary directory.
4626*a26aa252SDarrick J. Wong
4627*a26aa252SDarrick J. Wong4. For each live directory update received via the hook, decide if the parent
4628*a26aa252SDarrick J. Wong   has already been scanned.
4629*a26aa252SDarrick J. Wong   If so:
4630*a26aa252SDarrick J. Wong
4631*a26aa252SDarrick J. Wong   a. Stash an addpptr or removepptr entry for this dirent update in the
4632*a26aa252SDarrick J. Wong      xfarray for later.
4633*a26aa252SDarrick J. Wong      We cannot write parent pointers directly to the temporary file because
4634*a26aa252SDarrick J. Wong      hook functions are not allowed to modify filesystem metadata.
4635*a26aa252SDarrick J. Wong      Instead, we stash updates in the xfarray and rely on the scanner thread
4636*a26aa252SDarrick J. Wong      to apply the stashed parent pointer updates to the temporary file.
4637*a26aa252SDarrick J. Wong
4638*a26aa252SDarrick J. Wong5. Copy all non-parent pointer extended attributes to the temporary file.
4639*a26aa252SDarrick J. Wong
4640*a26aa252SDarrick J. Wong6. When the scan is complete, atomically swap the attribute fork of the
4641*a26aa252SDarrick J. Wong   temporary file and the file being repaired.
4642*a26aa252SDarrick J. Wong   The temporary file now contains the damaged extended attribute structure.
4643*a26aa252SDarrick J. Wong
4644*a26aa252SDarrick J. Wong7. Reap the temporary file.
4645*a26aa252SDarrick J. Wong
4646*a26aa252SDarrick J. WongThe proposed patchset is the
4647*a26aa252SDarrick J. Wong`parent pointers repair
4648*a26aa252SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-online-parent-repair>`_
4649*a26aa252SDarrick J. Wongseries.
4650*a26aa252SDarrick J. Wong
4651*a26aa252SDarrick J. WongDigression: Offline Checking of Parent Pointers
4652*a26aa252SDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
4653*a26aa252SDarrick J. Wong
4654*a26aa252SDarrick J. WongExamining parent pointers in offline repair works differently because corrupt
4655*a26aa252SDarrick J. Wongfiles are erased long before directory tree connectivity checks are performed.
4656*a26aa252SDarrick J. WongParent pointer checks are therefore a second pass to be added to the existing
4657*a26aa252SDarrick J. Wongconnectivity checks:
4658*a26aa252SDarrick J. Wong
4659*a26aa252SDarrick J. Wong1. After the set of surviving files has been established (i.e. phase 6),
4660*a26aa252SDarrick J. Wong   walk the surviving directories of each AG in the filesystem.
4661*a26aa252SDarrick J. Wong   This is already performed as part of the connectivity checks.
4662*a26aa252SDarrick J. Wong
4663*a26aa252SDarrick J. Wong2. For each directory entry found, record the name in an xfblob, and store
4664*a26aa252SDarrick J. Wong   ``(child_ag_inum, parent_inum, parent_gen, dirent_pos)`` tuples in a
4665*a26aa252SDarrick J. Wong   per-AG in-memory slab.
4666*a26aa252SDarrick J. Wong
4667*a26aa252SDarrick J. Wong3. For each AG in the filesystem,
4668*a26aa252SDarrick J. Wong
4669*a26aa252SDarrick J. Wong   a. Sort the per-AG tuples in order of child_ag_inum, parent_inum, and
4670*a26aa252SDarrick J. Wong      dirent_pos.
4671*a26aa252SDarrick J. Wong
4672*a26aa252SDarrick J. Wong   b. For each inode in the AG,
4673*a26aa252SDarrick J. Wong
4674*a26aa252SDarrick J. Wong      1. Scan the inode for parent pointers.
4675*a26aa252SDarrick J. Wong         Record the names in a per-file xfblob, and store ``(parent_inum,
4676*a26aa252SDarrick J. Wong         parent_gen, dirent_pos)`` tuples in a per-file slab.
4677*a26aa252SDarrick J. Wong
4678*a26aa252SDarrick J. Wong      2. Sort the per-file tuples in order of parent_inum, and dirent_pos.
4679*a26aa252SDarrick J. Wong
4680*a26aa252SDarrick J. Wong      3. Position one slab cursor at the start of the inode's records in the
4681*a26aa252SDarrick J. Wong         per-AG tuple slab.
4682*a26aa252SDarrick J. Wong         This should be trivial since the per-AG tuples are in child inumber
4683*a26aa252SDarrick J. Wong         order.
4684*a26aa252SDarrick J. Wong
4685*a26aa252SDarrick J. Wong      4. Position a second slab cursor at the start of the per-file tuple slab.
4686*a26aa252SDarrick J. Wong
4687*a26aa252SDarrick J. Wong      5. Iterate the two cursors in lockstep, comparing the parent_ino and
4688*a26aa252SDarrick J. Wong         dirent_pos fields of the records under each cursor.
4689*a26aa252SDarrick J. Wong
4690*a26aa252SDarrick J. Wong         a. Tuples in the per-AG list but not the per-file list are missing and
4691*a26aa252SDarrick J. Wong            need to be written to the inode.
4692*a26aa252SDarrick J. Wong
4693*a26aa252SDarrick J. Wong         b. Tuples in the per-file list but not the per-AG list are dangling
4694*a26aa252SDarrick J. Wong            and need to be removed from the inode.
4695*a26aa252SDarrick J. Wong
4696*a26aa252SDarrick J. Wong         c. For tuples in both lists, update the parent_gen and name components
4697*a26aa252SDarrick J. Wong            of the parent pointer if necessary.
4698*a26aa252SDarrick J. Wong
4699*a26aa252SDarrick J. Wong4. Move on to examining link counts, as we do today.
4700*a26aa252SDarrick J. Wong
4701*a26aa252SDarrick J. WongThe proposed patchset is the
4702*a26aa252SDarrick J. Wong`offline parent pointers repair
4703*a26aa252SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=pptrs-repair>`_
4704*a26aa252SDarrick J. Wongseries.
4705*a26aa252SDarrick J. Wong
4706*a26aa252SDarrick J. WongRebuilding directories from parent pointers in offline repair is very
4707*a26aa252SDarrick J. Wongchallenging because it currently uses a single-pass scan of the filesystem
4708*a26aa252SDarrick J. Wongduring phase 3 to decide which files are corrupt enough to be zapped.
4709*a26aa252SDarrick J. WongThis scan would have to be converted into a multi-pass scan:
4710*a26aa252SDarrick J. Wong
4711*a26aa252SDarrick J. Wong1. The first pass of the scan zaps corrupt inodes, forks, and attributes
4712*a26aa252SDarrick J. Wong   much as it does now.
4713*a26aa252SDarrick J. Wong   Corrupt directories are noted but not zapped.
4714*a26aa252SDarrick J. Wong
4715*a26aa252SDarrick J. Wong2. The next pass records parent pointers pointing to the directories noted
4716*a26aa252SDarrick J. Wong   as being corrupt in the first pass.
4717*a26aa252SDarrick J. Wong   This second pass may have to happen after the phase 4 scan for duplicate
4718*a26aa252SDarrick J. Wong   blocks, if phase 4 is also capable of zapping directories.
4719*a26aa252SDarrick J. Wong
4720*a26aa252SDarrick J. Wong3. The third pass resets corrupt directories to an empty shortform directory.
4721*a26aa252SDarrick J. Wong   Free space metadata has not been ensured yet, so repair cannot yet use the
4722*a26aa252SDarrick J. Wong   directory building code in libxfs.
4723*a26aa252SDarrick J. Wong
4724*a26aa252SDarrick J. Wong4. At the start of phase 6, space metadata have been rebuilt.
4725*a26aa252SDarrick J. Wong   Use the parent pointer information recorded during step 2 to reconstruct
4726*a26aa252SDarrick J. Wong   the dirents and add them to the now-empty directories.
4727*a26aa252SDarrick J. Wong
4728*a26aa252SDarrick J. WongThis code has not yet been constructed.
4729*a26aa252SDarrick J. Wong
4730*a26aa252SDarrick J. Wong.. _orphanage:
4731*a26aa252SDarrick J. Wong
4732*a26aa252SDarrick J. WongThe Orphanage
4733*a26aa252SDarrick J. Wong-------------
4734*a26aa252SDarrick J. Wong
4735*a26aa252SDarrick J. WongFilesystems present files as a directed, and hopefully acyclic, graph.
4736*a26aa252SDarrick J. WongIn other words, a tree.
4737*a26aa252SDarrick J. WongThe root of the filesystem is a directory, and each entry in a directory points
4738*a26aa252SDarrick J. Wongdownwards either to more subdirectories or to non-directory files.
4739*a26aa252SDarrick J. WongUnfortunately, a disruption in the directory graph pointers result in a
4740*a26aa252SDarrick J. Wongdisconnected graph, which makes files impossible to access via regular path
4741*a26aa252SDarrick J. Wongresolution.
4742*a26aa252SDarrick J. Wong
4743*a26aa252SDarrick J. WongWithout parent pointers, the directory parent pointer online scrub code can
4744*a26aa252SDarrick J. Wongdetect a dotdot entry pointing to a parent directory that doesn't have a link
4745*a26aa252SDarrick J. Wongback to the child directory and the file link count checker can detect a file
4746*a26aa252SDarrick J. Wongthat isn't pointed to by any directory in the filesystem.
4747*a26aa252SDarrick J. WongIf such a file has a positive link count, the file is an orphan.
4748*a26aa252SDarrick J. Wong
4749*a26aa252SDarrick J. WongWith parent pointers, directories can be rebuilt by scanning parent pointers
4750*a26aa252SDarrick J. Wongand parent pointers can be rebuilt by scanning directories.
4751*a26aa252SDarrick J. WongThis should reduce the incidence of files ending up in ``/lost+found``.
4752*a26aa252SDarrick J. Wong
4753*a26aa252SDarrick J. WongWhen orphans are found, they should be reconnected to the directory tree.
4754*a26aa252SDarrick J. WongOffline fsck solves the problem by creating a directory ``/lost+found`` to
4755*a26aa252SDarrick J. Wongserve as an orphanage, and linking orphan files into the orphanage by using the
4756*a26aa252SDarrick J. Wonginumber as the name.
4757*a26aa252SDarrick J. WongReparenting a file to the orphanage does not reset any of its permissions or
4758*a26aa252SDarrick J. WongACLs.
4759*a26aa252SDarrick J. Wong
4760*a26aa252SDarrick J. WongThis process is more involved in the kernel than it is in userspace.
4761*a26aa252SDarrick J. WongThe directory and file link count repair setup functions must use the regular
4762*a26aa252SDarrick J. WongVFS mechanisms to create the orphanage directory with all the necessary
4763*a26aa252SDarrick J. Wongsecurity attributes and dentry cache entries, just like a regular directory
4764*a26aa252SDarrick J. Wongtree modification.
4765*a26aa252SDarrick J. Wong
4766*a26aa252SDarrick J. WongOrphaned files are adopted by the orphanage as follows:
4767*a26aa252SDarrick J. Wong
4768*a26aa252SDarrick J. Wong1. Call ``xrep_orphanage_try_create`` at the start of the scrub setup function
4769*a26aa252SDarrick J. Wong   to try to ensure that the lost and found directory actually exists.
4770*a26aa252SDarrick J. Wong   This also attaches the orphanage directory to the scrub context.
4771*a26aa252SDarrick J. Wong
4772*a26aa252SDarrick J. Wong2. If the decision is made to reconnect a file, take the IOLOCK of both the
4773*a26aa252SDarrick J. Wong   orphanage and the file being reattached.
4774*a26aa252SDarrick J. Wong   The ``xrep_orphanage_iolock_two`` function follows the inode locking
4775*a26aa252SDarrick J. Wong   strategy discussed earlier.
4776*a26aa252SDarrick J. Wong
4777*a26aa252SDarrick J. Wong3. Call ``xrep_orphanage_compute_blkres`` and ``xrep_orphanage_compute_name``
4778*a26aa252SDarrick J. Wong   to compute the new name in the orphanage and the block reservation required.
4779*a26aa252SDarrick J. Wong
4780*a26aa252SDarrick J. Wong4. Use ``xrep_orphanage_adoption_prep`` to reserve resources to the repair
4781*a26aa252SDarrick J. Wong   transaction.
4782*a26aa252SDarrick J. Wong
4783*a26aa252SDarrick J. Wong5. Call ``xrep_orphanage_adopt`` to reparent the orphaned file into the lost
4784*a26aa252SDarrick J. Wong   and found, and update the kernel dentry cache.
4785*a26aa252SDarrick J. Wong
4786*a26aa252SDarrick J. WongThe proposed patches are in the
4787*a26aa252SDarrick J. Wong`orphanage adoption
4788*a26aa252SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-orphanage>`_
4789*a26aa252SDarrick J. Wongseries.
4790