1a8f6c2e5SDarrick J. Wong.. SPDX-License-Identifier: GPL-2.0
2a8f6c2e5SDarrick J. Wong.. _xfs_online_fsck_design:
3a8f6c2e5SDarrick J. Wong
4a8f6c2e5SDarrick J. Wong..
5a8f6c2e5SDarrick J. Wong        Mapping of heading styles within this document:
6a8f6c2e5SDarrick J. Wong        Heading 1 uses "====" above and below
7a8f6c2e5SDarrick J. Wong        Heading 2 uses "===="
8a8f6c2e5SDarrick J. Wong        Heading 3 uses "----"
9a8f6c2e5SDarrick J. Wong        Heading 4 uses "````"
10a8f6c2e5SDarrick J. Wong        Heading 5 uses "^^^^"
11a8f6c2e5SDarrick J. Wong        Heading 6 uses "~~~~"
12a8f6c2e5SDarrick J. Wong        Heading 7 uses "...."
13a8f6c2e5SDarrick J. Wong
14a8f6c2e5SDarrick J. Wong        Sections are manually numbered because apparently that's what everyone
15a8f6c2e5SDarrick J. Wong        does in the kernel.
16a8f6c2e5SDarrick J. Wong
17a8f6c2e5SDarrick J. Wong======================
18a8f6c2e5SDarrick J. WongXFS Online Fsck Design
19a8f6c2e5SDarrick J. Wong======================
20a8f6c2e5SDarrick J. Wong
21a8f6c2e5SDarrick J. WongThis document captures the design of the online filesystem check feature for
22a8f6c2e5SDarrick J. WongXFS.
23a8f6c2e5SDarrick J. WongThe purpose of this document is threefold:
24a8f6c2e5SDarrick J. Wong
25a8f6c2e5SDarrick J. Wong- To help kernel distributors understand exactly what the XFS online fsck
26a8f6c2e5SDarrick J. Wong  feature is, and issues about which they should be aware.
27a8f6c2e5SDarrick J. Wong
28a8f6c2e5SDarrick J. Wong- To help people reading the code to familiarize themselves with the relevant
29a8f6c2e5SDarrick J. Wong  concepts and design points before they start digging into the code.
30a8f6c2e5SDarrick J. Wong
31a8f6c2e5SDarrick J. Wong- To help developers maintaining the system by capturing the reasons
32a8f6c2e5SDarrick J. Wong  supporting higher level decision making.
33a8f6c2e5SDarrick J. Wong
34a8f6c2e5SDarrick J. WongAs the online fsck code is merged, the links in this document to topic branches
35a8f6c2e5SDarrick J. Wongwill be replaced with links to code.
36a8f6c2e5SDarrick J. Wong
37a8f6c2e5SDarrick J. WongThis document is licensed under the terms of the GNU Public License, v2.
38a8f6c2e5SDarrick J. WongThe primary author is Darrick J. Wong.
39a8f6c2e5SDarrick J. Wong
40a8f6c2e5SDarrick J. WongThis design document is split into seven parts.
41a8f6c2e5SDarrick J. WongPart 1 defines what fsck tools are and the motivations for writing a new one.
42a8f6c2e5SDarrick J. WongParts 2 and 3 present a high level overview of how online fsck process works
43a8f6c2e5SDarrick J. Wongand how it is tested to ensure correct functionality.
44a8f6c2e5SDarrick J. WongPart 4 discusses the user interface and the intended usage modes of the new
45a8f6c2e5SDarrick J. Wongprogram.
46a8f6c2e5SDarrick J. WongParts 5 and 6 show off the high level components and how they fit together, and
47a8f6c2e5SDarrick J. Wongthen present case studies of how each repair function actually works.
48a8f6c2e5SDarrick J. WongPart 7 sums up what has been discussed so far and speculates about what else
49a8f6c2e5SDarrick J. Wongmight be built atop online fsck.
50a8f6c2e5SDarrick J. Wong
51a8f6c2e5SDarrick J. Wong.. contents:: Table of Contents
52a8f6c2e5SDarrick J. Wong   :local:
53a8f6c2e5SDarrick J. Wong
54a8f6c2e5SDarrick J. Wong1. What is a Filesystem Check?
55a8f6c2e5SDarrick J. Wong==============================
56a8f6c2e5SDarrick J. Wong
57a8f6c2e5SDarrick J. WongA Unix filesystem has four main responsibilities:
58a8f6c2e5SDarrick J. Wong
59a8f6c2e5SDarrick J. Wong- Provide a hierarchy of names through which application programs can associate
60a8f6c2e5SDarrick J. Wong  arbitrary blobs of data for any length of time,
61a8f6c2e5SDarrick J. Wong
62a8f6c2e5SDarrick J. Wong- Virtualize physical storage media across those names, and
63a8f6c2e5SDarrick J. Wong
64a8f6c2e5SDarrick J. Wong- Retrieve the named data blobs at any time.
65a8f6c2e5SDarrick J. Wong
66a8f6c2e5SDarrick J. Wong- Examine resource usage.
67a8f6c2e5SDarrick J. Wong
68a8f6c2e5SDarrick J. WongMetadata directly supporting these functions (e.g. files, directories, space
69a8f6c2e5SDarrick J. Wongmappings) are sometimes called primary metadata.
70a8f6c2e5SDarrick J. WongSecondary metadata (e.g. reverse mapping and directory parent pointers) support
71a8f6c2e5SDarrick J. Wongoperations internal to the filesystem, such as internal consistency checking
72a8f6c2e5SDarrick J. Wongand reorganization.
73a8f6c2e5SDarrick J. WongSummary metadata, as the name implies, condense information contained in
74a8f6c2e5SDarrick J. Wongprimary metadata for performance reasons.
75a8f6c2e5SDarrick J. Wong
76a8f6c2e5SDarrick J. WongThe filesystem check (fsck) tool examines all the metadata in a filesystem
77a8f6c2e5SDarrick J. Wongto look for errors.
78a8f6c2e5SDarrick J. WongIn addition to looking for obvious metadata corruptions, fsck also
79a8f6c2e5SDarrick J. Wongcross-references different types of metadata records with each other to look
80a8f6c2e5SDarrick J. Wongfor inconsistencies.
81a8f6c2e5SDarrick J. WongPeople do not like losing data, so most fsck tools also contains some ability
82a8f6c2e5SDarrick J. Wongto correct any problems found.
83a8f6c2e5SDarrick J. WongAs a word of caution -- the primary goal of most Linux fsck tools is to restore
84a8f6c2e5SDarrick J. Wongthe filesystem metadata to a consistent state, not to maximize the data
85a8f6c2e5SDarrick J. Wongrecovered.
86a8f6c2e5SDarrick J. WongThat precedent will not be challenged here.
87a8f6c2e5SDarrick J. Wong
88a8f6c2e5SDarrick J. WongFilesystems of the 20th century generally lacked any redundancy in the ondisk
89a8f6c2e5SDarrick J. Wongformat, which means that fsck can only respond to errors by erasing files until
90a8f6c2e5SDarrick J. Wongerrors are no longer detected.
91a8f6c2e5SDarrick J. WongMore recent filesystem designs contain enough redundancy in their metadata that
92a8f6c2e5SDarrick J. Wongit is now possible to regenerate data structures when non-catastrophic errors
93a8f6c2e5SDarrick J. Wongoccur; this capability aids both strategies.
94a8f6c2e5SDarrick J. Wong
95a8f6c2e5SDarrick J. Wong+--------------------------------------------------------------------------+
96a8f6c2e5SDarrick J. Wong| **Note**:                                                                |
97a8f6c2e5SDarrick J. Wong+--------------------------------------------------------------------------+
98a8f6c2e5SDarrick J. Wong| System administrators avoid data loss by increasing the number of        |
99a8f6c2e5SDarrick J. Wong| separate storage systems through the creation of backups; and they avoid |
100a8f6c2e5SDarrick J. Wong| downtime by increasing the redundancy of each storage system through the |
101a8f6c2e5SDarrick J. Wong| creation of RAID arrays.                                                 |
102a8f6c2e5SDarrick J. Wong| fsck tools address only the first problem.                               |
103a8f6c2e5SDarrick J. Wong+--------------------------------------------------------------------------+
104a8f6c2e5SDarrick J. Wong
105a8f6c2e5SDarrick J. WongTLDR; Show Me the Code!
106a8f6c2e5SDarrick J. Wong-----------------------
107a8f6c2e5SDarrick J. Wong
108a8f6c2e5SDarrick J. WongCode is posted to the kernel.org git trees as follows:
109a8f6c2e5SDarrick J. Wong`kernel changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-symlink>`_,
110a8f6c2e5SDarrick J. Wong`userspace changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-service>`_, and
111a8f6c2e5SDarrick J. Wong`QA test changes <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=repair-dirs>`_.
112a8f6c2e5SDarrick J. WongEach kernel patchset adding an online repair function will use the same branch
113a8f6c2e5SDarrick J. Wongname across the kernel, xfsprogs, and fstests git repos.
114a8f6c2e5SDarrick J. Wong
115a8f6c2e5SDarrick J. WongExisting Tools
116a8f6c2e5SDarrick J. Wong--------------
117a8f6c2e5SDarrick J. Wong
118a8f6c2e5SDarrick J. WongThe online fsck tool described here will be the third tool in the history of
119a8f6c2e5SDarrick J. WongXFS (on Linux) to check and repair filesystems.
120a8f6c2e5SDarrick J. WongTwo programs precede it:
121a8f6c2e5SDarrick J. Wong
122a8f6c2e5SDarrick J. WongThe first program, ``xfs_check``, was created as part of the XFS debugger
123a8f6c2e5SDarrick J. Wong(``xfs_db``) and can only be used with unmounted filesystems.
124a8f6c2e5SDarrick J. WongIt walks all metadata in the filesystem looking for inconsistencies in the
125a8f6c2e5SDarrick J. Wongmetadata, though it lacks any ability to repair what it finds.
126a8f6c2e5SDarrick J. WongDue to its high memory requirements and inability to repair things, this
127a8f6c2e5SDarrick J. Wongprogram is now deprecated and will not be discussed further.
128a8f6c2e5SDarrick J. Wong
129a8f6c2e5SDarrick J. WongThe second program, ``xfs_repair``, was created to be faster and more robust
130a8f6c2e5SDarrick J. Wongthan the first program.
131a8f6c2e5SDarrick J. WongLike its predecessor, it can only be used with unmounted filesystems.
132a8f6c2e5SDarrick J. WongIt uses extent-based in-memory data structures to reduce memory consumption,
133a8f6c2e5SDarrick J. Wongand tries to schedule readahead IO appropriately to reduce I/O waiting time
134a8f6c2e5SDarrick J. Wongwhile it scans the metadata of the entire filesystem.
135a8f6c2e5SDarrick J. WongThe most important feature of this tool is its ability to respond to
136a8f6c2e5SDarrick J. Wonginconsistencies in file metadata and directory tree by erasing things as needed
137a8f6c2e5SDarrick J. Wongto eliminate problems.
138a8f6c2e5SDarrick J. WongSpace usage metadata are rebuilt from the observed file metadata.
139a8f6c2e5SDarrick J. Wong
140a8f6c2e5SDarrick J. WongProblem Statement
141a8f6c2e5SDarrick J. Wong-----------------
142a8f6c2e5SDarrick J. Wong
143a8f6c2e5SDarrick J. WongThe current XFS tools leave several problems unsolved:
144a8f6c2e5SDarrick J. Wong
145a8f6c2e5SDarrick J. Wong1. **User programs** suddenly **lose access** to the filesystem when unexpected
146a8f6c2e5SDarrick J. Wong   shutdowns occur as a result of silent corruptions in the metadata.
147a8f6c2e5SDarrick J. Wong   These occur **unpredictably** and often without warning.
148a8f6c2e5SDarrick J. Wong
149a8f6c2e5SDarrick J. Wong2. **Users** experience a **total loss of service** during the recovery period
150a8f6c2e5SDarrick J. Wong   after an **unexpected shutdown** occurs.
151a8f6c2e5SDarrick J. Wong
152a8f6c2e5SDarrick J. Wong3. **Users** experience a **total loss of service** if the filesystem is taken
153a8f6c2e5SDarrick J. Wong   offline to **look for problems** proactively.
154a8f6c2e5SDarrick J. Wong
155a8f6c2e5SDarrick J. Wong4. **Data owners** cannot **check the integrity** of their stored data without
156a8f6c2e5SDarrick J. Wong   reading all of it.
157a8f6c2e5SDarrick J. Wong   This may expose them to substantial billing costs when a linear media scan
158a8f6c2e5SDarrick J. Wong   performed by the storage system administrator might suffice.
159a8f6c2e5SDarrick J. Wong
160a8f6c2e5SDarrick J. Wong5. **System administrators** cannot **schedule** a maintenance window to deal
161a8f6c2e5SDarrick J. Wong   with corruptions if they **lack the means** to assess filesystem health
162a8f6c2e5SDarrick J. Wong   while the filesystem is online.
163a8f6c2e5SDarrick J. Wong
164a8f6c2e5SDarrick J. Wong6. **Fleet monitoring tools** cannot **automate periodic checks** of filesystem
165a8f6c2e5SDarrick J. Wong   health when doing so requires **manual intervention** and downtime.
166a8f6c2e5SDarrick J. Wong
167a8f6c2e5SDarrick J. Wong7. **Users** can be tricked into **doing things they do not desire** when
168a8f6c2e5SDarrick J. Wong   malicious actors **exploit quirks of Unicode** to place misleading names
169a8f6c2e5SDarrick J. Wong   in directories.
170a8f6c2e5SDarrick J. Wong
171a8f6c2e5SDarrick J. WongGiven this definition of the problems to be solved and the actors who would
172a8f6c2e5SDarrick J. Wongbenefit, the proposed solution is a third fsck tool that acts on a running
173a8f6c2e5SDarrick J. Wongfilesystem.
174a8f6c2e5SDarrick J. Wong
175a8f6c2e5SDarrick J. WongThis new third program has three components: an in-kernel facility to check
176a8f6c2e5SDarrick J. Wongmetadata, an in-kernel facility to repair metadata, and a userspace driver
177a8f6c2e5SDarrick J. Wongprogram to drive fsck activity on a live filesystem.
178a8f6c2e5SDarrick J. Wong``xfs_scrub`` is the name of the driver program.
179a8f6c2e5SDarrick J. WongThe rest of this document presents the goals and use cases of the new fsck
180a8f6c2e5SDarrick J. Wongtool, describes its major design points in connection to those goals, and
181a8f6c2e5SDarrick J. Wongdiscusses the similarities and differences with existing tools.
182a8f6c2e5SDarrick J. Wong
183a8f6c2e5SDarrick J. Wong+--------------------------------------------------------------------------+
184a8f6c2e5SDarrick J. Wong| **Note**:                                                                |
185a8f6c2e5SDarrick J. Wong+--------------------------------------------------------------------------+
186a8f6c2e5SDarrick J. Wong| Throughout this document, the existing offline fsck tool can also be     |
187a8f6c2e5SDarrick J. Wong| referred to by its current name "``xfs_repair``".                        |
188a8f6c2e5SDarrick J. Wong| The userspace driver program for the new online fsck tool can be         |
189a8f6c2e5SDarrick J. Wong| referred to as "``xfs_scrub``".                                          |
190a8f6c2e5SDarrick J. Wong| The kernel portion of online fsck that validates metadata is called      |
191a8f6c2e5SDarrick J. Wong| "online scrub", and portion of the kernel that fixes metadata is called  |
192a8f6c2e5SDarrick J. Wong| "online repair".                                                         |
193a8f6c2e5SDarrick J. Wong+--------------------------------------------------------------------------+
194a8f6c2e5SDarrick J. Wong
195a8f6c2e5SDarrick J. WongThe naming hierarchy is broken up into objects known as directories and files
196a8f6c2e5SDarrick J. Wongand the physical space is split into pieces known as allocation groups.
197a8f6c2e5SDarrick J. WongSharding enables better performance on highly parallel systems and helps to
198a8f6c2e5SDarrick J. Wongcontain the damage when corruptions occur.
199a8f6c2e5SDarrick J. WongThe division of the filesystem into principal objects (allocation groups and
200a8f6c2e5SDarrick J. Wonginodes) means that there are ample opportunities to perform targeted checks and
201a8f6c2e5SDarrick J. Wongrepairs on a subset of the filesystem.
202a8f6c2e5SDarrick J. Wong
203a8f6c2e5SDarrick J. WongWhile this is going on, other parts continue processing IO requests.
204a8f6c2e5SDarrick J. WongEven if a piece of filesystem metadata can only be regenerated by scanning the
205a8f6c2e5SDarrick J. Wongentire system, the scan can still be done in the background while other file
206a8f6c2e5SDarrick J. Wongoperations continue.
207a8f6c2e5SDarrick J. Wong
208a8f6c2e5SDarrick J. WongIn summary, online fsck takes advantage of resource sharding and redundant
209a8f6c2e5SDarrick J. Wongmetadata to enable targeted checking and repair operations while the system
210a8f6c2e5SDarrick J. Wongis running.
211a8f6c2e5SDarrick J. WongThis capability will be coupled to automatic system management so that
212a8f6c2e5SDarrick J. Wongautonomous self-healing of XFS maximizes service availability.
21388757e04SDarrick J. Wong
21488757e04SDarrick J. Wong2. Theory of Operation
21588757e04SDarrick J. Wong======================
21688757e04SDarrick J. Wong
21788757e04SDarrick J. WongBecause it is necessary for online fsck to lock and scan live metadata objects,
21888757e04SDarrick J. Wongonline fsck consists of three separate code components.
21988757e04SDarrick J. WongThe first is the userspace driver program ``xfs_scrub``, which is responsible
22088757e04SDarrick J. Wongfor identifying individual metadata items, scheduling work items for them,
22188757e04SDarrick J. Wongreacting to the outcomes appropriately, and reporting results to the system
22288757e04SDarrick J. Wongadministrator.
22388757e04SDarrick J. WongThe second and third are in the kernel, which implements functions to check
22488757e04SDarrick J. Wongand repair each type of online fsck work item.
22588757e04SDarrick J. Wong
22688757e04SDarrick J. Wong+------------------------------------------------------------------+
22788757e04SDarrick J. Wong| **Note**:                                                        |
22888757e04SDarrick J. Wong+------------------------------------------------------------------+
22988757e04SDarrick J. Wong| For brevity, this document shortens the phrase "online fsck work |
23088757e04SDarrick J. Wong| item" to "scrub item".                                           |
23188757e04SDarrick J. Wong+------------------------------------------------------------------+
23288757e04SDarrick J. Wong
23388757e04SDarrick J. WongScrub item types are delineated in a manner consistent with the Unix design
23488757e04SDarrick J. Wongphilosophy, which is to say that each item should handle one aspect of a
23588757e04SDarrick J. Wongmetadata structure, and handle it well.
23688757e04SDarrick J. Wong
23788757e04SDarrick J. WongScope
23888757e04SDarrick J. Wong-----
23988757e04SDarrick J. Wong
24088757e04SDarrick J. WongIn principle, online fsck should be able to check and to repair everything that
24188757e04SDarrick J. Wongthe offline fsck program can handle.
24288757e04SDarrick J. WongHowever, online fsck cannot be running 100% of the time, which means that
24388757e04SDarrick J. Wonglatent errors may creep in after a scrub completes.
24488757e04SDarrick J. WongIf these errors cause the next mount to fail, offline fsck is the only
24588757e04SDarrick J. Wongsolution.
24688757e04SDarrick J. WongThis limitation means that maintenance of the offline fsck tool will continue.
24788757e04SDarrick J. WongA second limitation of online fsck is that it must follow the same resource
24888757e04SDarrick J. Wongsharing and lock acquisition rules as the regular filesystem.
24988757e04SDarrick J. WongThis means that scrub cannot take *any* shortcuts to save time, because doing
25088757e04SDarrick J. Wongso could lead to concurrency problems.
25188757e04SDarrick J. WongIn other words, online fsck is not a complete replacement for offline fsck, and
25288757e04SDarrick J. Wonga complete run of online fsck may take longer than online fsck.
25388757e04SDarrick J. WongHowever, both of these limitations are acceptable tradeoffs to satisfy the
25488757e04SDarrick J. Wongdifferent motivations of online fsck, which are to **minimize system downtime**
25588757e04SDarrick J. Wongand to **increase predictability of operation**.
25688757e04SDarrick J. Wong
25788757e04SDarrick J. Wong.. _scrubphases:
25888757e04SDarrick J. Wong
25988757e04SDarrick J. WongPhases of Work
26088757e04SDarrick J. Wong--------------
26188757e04SDarrick J. Wong
26288757e04SDarrick J. WongThe userspace driver program ``xfs_scrub`` splits the work of checking and
26388757e04SDarrick J. Wongrepairing an entire filesystem into seven phases.
26488757e04SDarrick J. WongEach phase concentrates on checking specific types of scrub items and depends
26588757e04SDarrick J. Wongon the success of all previous phases.
26688757e04SDarrick J. WongThe seven phases are as follows:
26788757e04SDarrick J. Wong
26888757e04SDarrick J. Wong1. Collect geometry information about the mounted filesystem and computer,
26988757e04SDarrick J. Wong   discover the online fsck capabilities of the kernel, and open the
27088757e04SDarrick J. Wong   underlying storage devices.
27188757e04SDarrick J. Wong
27288757e04SDarrick J. Wong2. Check allocation group metadata, all realtime volume metadata, and all quota
27388757e04SDarrick J. Wong   files.
27488757e04SDarrick J. Wong   Each metadata structure is scheduled as a separate scrub item.
27588757e04SDarrick J. Wong   If corruption is found in the inode header or inode btree and ``xfs_scrub``
27688757e04SDarrick J. Wong   is permitted to perform repairs, then those scrub items are repaired to
27788757e04SDarrick J. Wong   prepare for phase 3.
27888757e04SDarrick J. Wong   Repairs are implemented by using the information in the scrub item to
27988757e04SDarrick J. Wong   resubmit the kernel scrub call with the repair flag enabled; this is
28088757e04SDarrick J. Wong   discussed in the next section.
28188757e04SDarrick J. Wong   Optimizations and all other repairs are deferred to phase 4.
28288757e04SDarrick J. Wong
28388757e04SDarrick J. Wong3. Check all metadata of every file in the filesystem.
28488757e04SDarrick J. Wong   Each metadata structure is also scheduled as a separate scrub item.
28588757e04SDarrick J. Wong   If repairs are needed and ``xfs_scrub`` is permitted to perform repairs,
28688757e04SDarrick J. Wong   and there were no problems detected during phase 2, then those scrub items
28788757e04SDarrick J. Wong   are repaired immediately.
28888757e04SDarrick J. Wong   Optimizations, deferred repairs, and unsuccessful repairs are deferred to
28988757e04SDarrick J. Wong   phase 4.
29088757e04SDarrick J. Wong
29188757e04SDarrick J. Wong4. All remaining repairs and scheduled optimizations are performed during this
29288757e04SDarrick J. Wong   phase, if the caller permits them.
29388757e04SDarrick J. Wong   Before starting repairs, the summary counters are checked and any necessary
29488757e04SDarrick J. Wong   repairs are performed so that subsequent repairs will not fail the resource
29588757e04SDarrick J. Wong   reservation step due to wildly incorrect summary counters.
296*d56b699dSBjorn Helgaas   Unsuccessful repairs are requeued as long as forward progress on repairs is
29788757e04SDarrick J. Wong   made somewhere in the filesystem.
29888757e04SDarrick J. Wong   Free space in the filesystem is trimmed at the end of phase 4 if the
29988757e04SDarrick J. Wong   filesystem is clean.
30088757e04SDarrick J. Wong
30188757e04SDarrick J. Wong5. By the start of this phase, all primary and secondary filesystem metadata
30288757e04SDarrick J. Wong   must be correct.
30388757e04SDarrick J. Wong   Summary counters such as the free space counts and quota resource counts
30488757e04SDarrick J. Wong   are checked and corrected.
30588757e04SDarrick J. Wong   Directory entry names and extended attribute names are checked for
30688757e04SDarrick J. Wong   suspicious entries such as control characters or confusing Unicode sequences
30788757e04SDarrick J. Wong   appearing in names.
30888757e04SDarrick J. Wong
30988757e04SDarrick J. Wong6. If the caller asks for a media scan, read all allocated and written data
31088757e04SDarrick J. Wong   file extents in the filesystem.
31188757e04SDarrick J. Wong   The ability to use hardware-assisted data file integrity checking is new
31288757e04SDarrick J. Wong   to online fsck; neither of the previous tools have this capability.
31388757e04SDarrick J. Wong   If media errors occur, they will be mapped to the owning files and reported.
31488757e04SDarrick J. Wong
31588757e04SDarrick J. Wong7. Re-check the summary counters and presents the caller with a summary of
31688757e04SDarrick J. Wong   space usage and file counts.
31788757e04SDarrick J. Wong
318af051dfbSDarrick J. WongThis allocation of responsibilities will be :ref:`revisited <scrubcheck>`
319af051dfbSDarrick J. Wonglater in this document.
320af051dfbSDarrick J. Wong
32188757e04SDarrick J. WongSteps for Each Scrub Item
32288757e04SDarrick J. Wong-------------------------
32388757e04SDarrick J. Wong
32488757e04SDarrick J. WongThe kernel scrub code uses a three-step strategy for checking and repairing
32588757e04SDarrick J. Wongthe one aspect of a metadata object represented by a scrub item:
32688757e04SDarrick J. Wong
32788757e04SDarrick J. Wong1. The scrub item of interest is checked for corruptions; opportunities for
32888757e04SDarrick J. Wong   optimization; and for values that are directly controlled by the system
32988757e04SDarrick J. Wong   administrator but look suspicious.
33088757e04SDarrick J. Wong   If the item is not corrupt or does not need optimization, resource are
33188757e04SDarrick J. Wong   released and the positive scan results are returned to userspace.
33288757e04SDarrick J. Wong   If the item is corrupt or could be optimized but the caller does not permit
33388757e04SDarrick J. Wong   this, resources are released and the negative scan results are returned to
33488757e04SDarrick J. Wong   userspace.
33588757e04SDarrick J. Wong   Otherwise, the kernel moves on to the second step.
33688757e04SDarrick J. Wong
33788757e04SDarrick J. Wong2. The repair function is called to rebuild the data structure.
33888757e04SDarrick J. Wong   Repair functions generally choose rebuild a structure from other metadata
33988757e04SDarrick J. Wong   rather than try to salvage the existing structure.
34088757e04SDarrick J. Wong   If the repair fails, the scan results from the first step are returned to
34188757e04SDarrick J. Wong   userspace.
34288757e04SDarrick J. Wong   Otherwise, the kernel moves on to the third step.
34388757e04SDarrick J. Wong
34488757e04SDarrick J. Wong3. In the third step, the kernel runs the same checks over the new metadata
34588757e04SDarrick J. Wong   item to assess the efficacy of the repairs.
34688757e04SDarrick J. Wong   The results of the reassessment are returned to userspace.
34788757e04SDarrick J. Wong
34888757e04SDarrick J. WongClassification of Metadata
34988757e04SDarrick J. Wong--------------------------
35088757e04SDarrick J. Wong
35188757e04SDarrick J. WongEach type of metadata object (and therefore each type of scrub item) is
35288757e04SDarrick J. Wongclassified as follows:
35388757e04SDarrick J. Wong
35488757e04SDarrick J. WongPrimary Metadata
35588757e04SDarrick J. Wong````````````````
35688757e04SDarrick J. Wong
35788757e04SDarrick J. WongMetadata structures in this category should be most familiar to filesystem
35888757e04SDarrick J. Wongusers either because they are directly created by the user or they index
35988757e04SDarrick J. Wongobjects created by the user
36088757e04SDarrick J. WongMost filesystem objects fall into this class:
36188757e04SDarrick J. Wong
36288757e04SDarrick J. Wong- Free space and reference count information
36388757e04SDarrick J. Wong
36488757e04SDarrick J. Wong- Inode records and indexes
36588757e04SDarrick J. Wong
36688757e04SDarrick J. Wong- Storage mapping information for file data
36788757e04SDarrick J. Wong
36888757e04SDarrick J. Wong- Directories
36988757e04SDarrick J. Wong
37088757e04SDarrick J. Wong- Extended attributes
37188757e04SDarrick J. Wong
37288757e04SDarrick J. Wong- Symbolic links
37388757e04SDarrick J. Wong
37488757e04SDarrick J. Wong- Quota limits
37588757e04SDarrick J. Wong
37688757e04SDarrick J. WongScrub obeys the same rules as regular filesystem accesses for resource and lock
37788757e04SDarrick J. Wongacquisition.
37888757e04SDarrick J. Wong
37988757e04SDarrick J. WongPrimary metadata objects are the simplest for scrub to process.
38088757e04SDarrick J. WongThe principal filesystem object (either an allocation group or an inode) that
38188757e04SDarrick J. Wongowns the item being scrubbed is locked to guard against concurrent updates.
38288757e04SDarrick J. WongThe check function examines every record associated with the type for obvious
38388757e04SDarrick J. Wongerrors and cross-references healthy records against other metadata to look for
38488757e04SDarrick J. Wonginconsistencies.
38588757e04SDarrick J. WongRepairs for this class of scrub item are simple, since the repair function
38688757e04SDarrick J. Wongstarts by holding all the resources acquired in the previous step.
38788757e04SDarrick J. WongThe repair function scans available metadata as needed to record all the
38888757e04SDarrick J. Wongobservations needed to complete the structure.
38988757e04SDarrick J. WongNext, it stages the observations in a new ondisk structure and commits it
39088757e04SDarrick J. Wongatomically to complete the repair.
39188757e04SDarrick J. WongFinally, the storage from the old data structure are carefully reaped.
39288757e04SDarrick J. Wong
39388757e04SDarrick J. WongBecause ``xfs_scrub`` locks a primary object for the duration of the repair,
39488757e04SDarrick J. Wongthis is effectively an offline repair operation performed on a subset of the
39588757e04SDarrick J. Wongfilesystem.
39688757e04SDarrick J. WongThis minimizes the complexity of the repair code because it is not necessary to
39788757e04SDarrick J. Wonghandle concurrent updates from other threads, nor is it necessary to access
39888757e04SDarrick J. Wongany other part of the filesystem.
39988757e04SDarrick J. WongAs a result, indexed structures can be rebuilt very quickly, and programs
40088757e04SDarrick J. Wongtrying to access the damaged structure will be blocked until repairs complete.
40188757e04SDarrick J. WongThe only infrastructure needed by the repair code are the staging area for
40288757e04SDarrick J. Wongobservations and a means to write new structures to disk.
40388757e04SDarrick J. WongDespite these limitations, the advantage that online repair holds is clear:
40488757e04SDarrick J. Wongtargeted work on individual shards of the filesystem avoids total loss of
40588757e04SDarrick J. Wongservice.
40688757e04SDarrick J. Wong
40788757e04SDarrick J. WongThis mechanism is described in section 2.1 ("Off-Line Algorithm") of
40888757e04SDarrick J. WongV. Srinivasan and M. J. Carey, `"Performance of On-Line Index Construction
40988757e04SDarrick J. WongAlgorithms" <https://minds.wisconsin.edu/bitstream/handle/1793/59524/TR1047.pdf>`_,
41088757e04SDarrick J. Wong*Extending Database Technology*, pp. 293-309, 1992.
41188757e04SDarrick J. Wong
41288757e04SDarrick J. WongMost primary metadata repair functions stage their intermediate results in an
41388757e04SDarrick J. Wongin-memory array prior to formatting the new ondisk structure, which is very
41488757e04SDarrick J. Wongsimilar to the list-based algorithm discussed in section 2.3 ("List-Based
41588757e04SDarrick J. WongAlgorithms") of Srinivasan.
41688757e04SDarrick J. WongHowever, any data structure builder that maintains a resource lock for the
41788757e04SDarrick J. Wongduration of the repair is *always* an offline algorithm.
41888757e04SDarrick J. Wong
4195f658dadSDarrick J. Wong.. _secondary_metadata:
4205f658dadSDarrick J. Wong
42188757e04SDarrick J. WongSecondary Metadata
42288757e04SDarrick J. Wong``````````````````
42388757e04SDarrick J. Wong
42488757e04SDarrick J. WongMetadata structures in this category reflect records found in primary metadata,
42588757e04SDarrick J. Wongbut are only needed for online fsck or for reorganization of the filesystem.
42688757e04SDarrick J. Wong
42788757e04SDarrick J. WongSecondary metadata include:
42888757e04SDarrick J. Wong
42988757e04SDarrick J. Wong- Reverse mapping information
43088757e04SDarrick J. Wong
43188757e04SDarrick J. Wong- Directory parent pointers
43288757e04SDarrick J. Wong
43388757e04SDarrick J. WongThis class of metadata is difficult for scrub to process because scrub attaches
43488757e04SDarrick J. Wongto the secondary object but needs to check primary metadata, which runs counter
43588757e04SDarrick J. Wongto the usual order of resource acquisition.
43688757e04SDarrick J. WongFrequently, this means that full filesystems scans are necessary to rebuild the
43788757e04SDarrick J. Wongmetadata.
43888757e04SDarrick J. WongCheck functions can be limited in scope to reduce runtime.
43988757e04SDarrick J. WongRepairs, however, require a full scan of primary metadata, which can take a
44088757e04SDarrick J. Wonglong time to complete.
44188757e04SDarrick J. WongUnder these conditions, ``xfs_scrub`` cannot lock resources for the entire
44288757e04SDarrick J. Wongduration of the repair.
44388757e04SDarrick J. Wong
44488757e04SDarrick J. WongInstead, repair functions set up an in-memory staging structure to store
44588757e04SDarrick J. Wongobservations.
44688757e04SDarrick J. WongDepending on the requirements of the specific repair function, the staging
44788757e04SDarrick J. Wongindex will either have the same format as the ondisk structure or a design
44888757e04SDarrick J. Wongspecific to that repair function.
44988757e04SDarrick J. WongThe next step is to release all locks and start the filesystem scan.
45088757e04SDarrick J. WongWhen the repair scanner needs to record an observation, the staging data are
45188757e04SDarrick J. Wonglocked long enough to apply the update.
45288757e04SDarrick J. WongWhile the filesystem scan is in progress, the repair function hooks the
45388757e04SDarrick J. Wongfilesystem so that it can apply pending filesystem updates to the staging
45488757e04SDarrick J. Wonginformation.
45588757e04SDarrick J. WongOnce the scan is done, the owning object is re-locked, the live data is used to
45688757e04SDarrick J. Wongwrite a new ondisk structure, and the repairs are committed atomically.
45788757e04SDarrick J. WongThe hooks are disabled and the staging staging area is freed.
45888757e04SDarrick J. WongFinally, the storage from the old data structure are carefully reaped.
45988757e04SDarrick J. Wong
46088757e04SDarrick J. WongIntroducing concurrency helps online repair avoid various locking problems, but
46188757e04SDarrick J. Wongcomes at a high cost to code complexity.
46288757e04SDarrick J. WongLive filesystem code has to be hooked so that the repair function can observe
46388757e04SDarrick J. Wongupdates in progress.
46488757e04SDarrick J. WongThe staging area has to become a fully functional parallel structure so that
46588757e04SDarrick J. Wongupdates can be merged from the hooks.
46688757e04SDarrick J. WongFinally, the hook, the filesystem scan, and the inode locking model must be
46788757e04SDarrick J. Wongsufficiently well integrated that a hook event can decide if a given update
46888757e04SDarrick J. Wongshould be applied to the staging structure.
46988757e04SDarrick J. Wong
47088757e04SDarrick J. WongIn theory, the scrub implementation could apply these same techniques for
47188757e04SDarrick J. Wongprimary metadata, but doing so would make it massively more complex and less
47288757e04SDarrick J. Wongperformant.
47388757e04SDarrick J. WongPrograms attempting to access the damaged structures are not blocked from
47488757e04SDarrick J. Wongoperation, which may cause application failure or an unplanned filesystem
47588757e04SDarrick J. Wongshutdown.
47688757e04SDarrick J. Wong
47788757e04SDarrick J. WongInspiration for the secondary metadata repair strategy was drawn from section
47888757e04SDarrick J. Wong2.4 of Srinivasan above, and sections 2 ("NSF: Inded Build Without Side-File")
47988757e04SDarrick J. Wongand 3.1.1 ("Duplicate Key Insert Problem") in C. Mohan, `"Algorithms for
48088757e04SDarrick J. WongCreating Indexes for Very Large Tables Without Quiescing Updates"
48188757e04SDarrick J. Wong<https://dl.acm.org/doi/10.1145/130283.130337>`_, 1992.
48288757e04SDarrick J. Wong
48388757e04SDarrick J. WongThe sidecar index mentioned above bears some resemblance to the side file
48488757e04SDarrick J. Wongmethod mentioned in Srinivasan and Mohan.
48588757e04SDarrick J. WongTheir method consists of an index builder that extracts relevant record data to
48688757e04SDarrick J. Wongbuild the new structure as quickly as possible; and an auxiliary structure that
48788757e04SDarrick J. Wongcaptures all updates that would be committed to the index by other threads were
48888757e04SDarrick J. Wongthe new index already online.
48988757e04SDarrick J. WongAfter the index building scan finishes, the updates recorded in the side file
49088757e04SDarrick J. Wongare applied to the new index.
49188757e04SDarrick J. WongTo avoid conflicts between the index builder and other writer threads, the
49288757e04SDarrick J. Wongbuilder maintains a publicly visible cursor that tracks the progress of the
49388757e04SDarrick J. Wongscan through the record space.
49488757e04SDarrick J. WongTo avoid duplication of work between the side file and the index builder, side
49588757e04SDarrick J. Wongfile updates are elided when the record ID for the update is greater than the
49688757e04SDarrick J. Wongcursor position within the record ID space.
49788757e04SDarrick J. Wong
49888757e04SDarrick J. WongTo minimize changes to the rest of the codebase, XFS online repair keeps the
49988757e04SDarrick J. Wongreplacement index hidden until it's completely ready to go.
50088757e04SDarrick J. WongIn other words, there is no attempt to expose the keyspace of the new index
50188757e04SDarrick J. Wongwhile repair is running.
50288757e04SDarrick J. WongThe complexity of such an approach would be very high and perhaps more
50388757e04SDarrick J. Wongappropriate to building *new* indices.
50488757e04SDarrick J. Wong
50588757e04SDarrick J. Wong**Future Work Question**: Can the full scan and live update code used to
50688757e04SDarrick J. Wongfacilitate a repair also be used to implement a comprehensive check?
50788757e04SDarrick J. Wong
50888757e04SDarrick J. Wong*Answer*: In theory, yes.  Check would be much stronger if each scrub function
50988757e04SDarrick J. Wongemployed these live scans to build a shadow copy of the metadata and then
51088757e04SDarrick J. Wongcompared the shadow records to the ondisk records.
51188757e04SDarrick J. WongHowever, doing that is a fair amount more work than what the checking functions
51288757e04SDarrick J. Wongdo now.
51388757e04SDarrick J. WongThe live scans and hooks were developed much later.
51488757e04SDarrick J. WongThat in turn increases the runtime of those scrub functions.
51588757e04SDarrick J. Wong
51688757e04SDarrick J. WongSummary Information
51788757e04SDarrick J. Wong```````````````````
51888757e04SDarrick J. Wong
51988757e04SDarrick J. WongMetadata structures in this last category summarize the contents of primary
52088757e04SDarrick J. Wongmetadata records.
52188757e04SDarrick J. WongThese are often used to speed up resource usage queries, and are many times
52288757e04SDarrick J. Wongsmaller than the primary metadata which they represent.
52388757e04SDarrick J. Wong
52488757e04SDarrick J. WongExamples of summary information include:
52588757e04SDarrick J. Wong
52688757e04SDarrick J. Wong- Summary counts of free space and inodes
52788757e04SDarrick J. Wong
52888757e04SDarrick J. Wong- File link counts from directories
52988757e04SDarrick J. Wong
53088757e04SDarrick J. Wong- Quota resource usage counts
53188757e04SDarrick J. Wong
53288757e04SDarrick J. WongCheck and repair require full filesystem scans, but resource and lock
53388757e04SDarrick J. Wongacquisition follow the same paths as regular filesystem accesses.
53488757e04SDarrick J. Wong
53588757e04SDarrick J. WongThe superblock summary counters have special requirements due to the underlying
53688757e04SDarrick J. Wongimplementation of the incore counters, and will be treated separately.
53788757e04SDarrick J. WongCheck and repair of the other types of summary counters (quota resource counts
53888757e04SDarrick J. Wongand file link counts) employ the same filesystem scanning and hooking
53988757e04SDarrick J. Wongtechniques as outlined above, but because the underlying data are sets of
54088757e04SDarrick J. Wonginteger counters, the staging data need not be a fully functional mirror of the
54188757e04SDarrick J. Wongondisk structure.
54288757e04SDarrick J. Wong
54388757e04SDarrick J. WongInspiration for quota and file link count repair strategies were drawn from
54488757e04SDarrick J. Wongsections 2.12 ("Online Index Operations") through 2.14 ("Incremental View
545*d56b699dSBjorn HelgaasMaintenance") of G.  Graefe, `"Concurrent Queries and Updates in Summary Views
54688757e04SDarrick J. Wongand Their Indexes"
54788757e04SDarrick J. Wong<http://www.odbms.org/wp-content/uploads/2014/06/Increment-locks.pdf>`_, 2011.
54888757e04SDarrick J. Wong
54988757e04SDarrick J. WongSince quotas are non-negative integer counts of resource usage, online
55088757e04SDarrick J. Wongquotacheck can use the incremental view deltas described in section 2.14 to
55188757e04SDarrick J. Wongtrack pending changes to the block and inode usage counts in each transaction,
55288757e04SDarrick J. Wongand commit those changes to a dquot side file when the transaction commits.
55388757e04SDarrick J. WongDelta tracking is necessary for dquots because the index builder scans inodes,
55488757e04SDarrick J. Wongwhereas the data structure being rebuilt is an index of dquots.
55588757e04SDarrick J. WongLink count checking combines the view deltas and commit step into one because
55688757e04SDarrick J. Wongit sets attributes of the objects being scanned instead of writing them to a
55788757e04SDarrick J. Wongseparate data structure.
55888757e04SDarrick J. WongEach online fsck function will be discussed as case studies later in this
55988757e04SDarrick J. Wongdocument.
56088757e04SDarrick J. Wong
56188757e04SDarrick J. WongRisk Management
56288757e04SDarrick J. Wong---------------
56388757e04SDarrick J. Wong
56488757e04SDarrick J. WongDuring the development of online fsck, several risk factors were identified
56588757e04SDarrick J. Wongthat may make the feature unsuitable for certain distributors and users.
56688757e04SDarrick J. WongSteps can be taken to mitigate or eliminate those risks, though at a cost to
56788757e04SDarrick J. Wongfunctionality.
56888757e04SDarrick J. Wong
56988757e04SDarrick J. Wong- **Decreased performance**: Adding metadata indices to the filesystem
57088757e04SDarrick J. Wong  increases the time cost of persisting changes to disk, and the reverse space
57188757e04SDarrick J. Wong  mapping and directory parent pointers are no exception.
57288757e04SDarrick J. Wong  System administrators who require the maximum performance can disable the
57388757e04SDarrick J. Wong  reverse mapping features at format time, though this choice dramatically
57488757e04SDarrick J. Wong  reduces the ability of online fsck to find inconsistencies and repair them.
57588757e04SDarrick J. Wong
57688757e04SDarrick J. Wong- **Incorrect repairs**: As with all software, there might be defects in the
57788757e04SDarrick J. Wong  software that result in incorrect repairs being written to the filesystem.
57888757e04SDarrick J. Wong  Systematic fuzz testing (detailed in the next section) is employed by the
57988757e04SDarrick J. Wong  authors to find bugs early, but it might not catch everything.
58088757e04SDarrick J. Wong  The kernel build system provides Kconfig options (``CONFIG_XFS_ONLINE_SCRUB``
58188757e04SDarrick J. Wong  and ``CONFIG_XFS_ONLINE_REPAIR``) to enable distributors to choose not to
58288757e04SDarrick J. Wong  accept this risk.
58388757e04SDarrick J. Wong  The xfsprogs build system has a configure option (``--enable-scrub=no``) that
58488757e04SDarrick J. Wong  disables building of the ``xfs_scrub`` binary, though this is not a risk
58588757e04SDarrick J. Wong  mitigation if the kernel functionality remains enabled.
58688757e04SDarrick J. Wong
58788757e04SDarrick J. Wong- **Inability to repair**: Sometimes, a filesystem is too badly damaged to be
58888757e04SDarrick J. Wong  repairable.
58988757e04SDarrick J. Wong  If the keyspaces of several metadata indices overlap in some manner but a
59088757e04SDarrick J. Wong  coherent narrative cannot be formed from records collected, then the repair
59188757e04SDarrick J. Wong  fails.
59288757e04SDarrick J. Wong  To reduce the chance that a repair will fail with a dirty transaction and
59388757e04SDarrick J. Wong  render the filesystem unusable, the online repair functions have been
59488757e04SDarrick J. Wong  designed to stage and validate all new records before committing the new
59588757e04SDarrick J. Wong  structure.
59688757e04SDarrick J. Wong
59788757e04SDarrick J. Wong- **Misbehavior**: Online fsck requires many privileges -- raw IO to block
59888757e04SDarrick J. Wong  devices, opening files by handle, ignoring Unix discretionary access control,
59988757e04SDarrick J. Wong  and the ability to perform administrative changes.
60088757e04SDarrick J. Wong  Running this automatically in the background scares people, so the systemd
60188757e04SDarrick J. Wong  background service is configured to run with only the privileges required.
60288757e04SDarrick J. Wong  Obviously, this cannot address certain problems like the kernel crashing or
60388757e04SDarrick J. Wong  deadlocking, but it should be sufficient to prevent the scrub process from
60488757e04SDarrick J. Wong  escaping and reconfiguring the system.
60588757e04SDarrick J. Wong  The cron job does not have this protection.
60688757e04SDarrick J. Wong
60788757e04SDarrick J. Wong- **Fuzz Kiddiez**: There are many people now who seem to think that running
608*d56b699dSBjorn Helgaas  automated fuzz testing of ondisk artifacts to find mischievous behavior and
60988757e04SDarrick J. Wong  spraying exploit code onto the public mailing list for instant zero-day
61088757e04SDarrick J. Wong  disclosure is somehow of some social benefit.
61188757e04SDarrick J. Wong  In the view of this author, the benefit is realized only when the fuzz
61288757e04SDarrick J. Wong  operators help to **fix** the flaws, but this opinion apparently is not
61388757e04SDarrick J. Wong  widely shared among security "researchers".
61488757e04SDarrick J. Wong  The XFS maintainers' continuing ability to manage these events presents an
61588757e04SDarrick J. Wong  ongoing risk to the stability of the development process.
61688757e04SDarrick J. Wong  Automated testing should front-load some of the risk while the feature is
61788757e04SDarrick J. Wong  considered EXPERIMENTAL.
61888757e04SDarrick J. Wong
61988757e04SDarrick J. WongMany of these risks are inherent to software programming.
62088757e04SDarrick J. WongDespite this, it is hoped that this new functionality will prove useful in
62188757e04SDarrick J. Wongreducing unexpected downtime.
6229a30b5b5SDarrick J. Wong
6239a30b5b5SDarrick J. Wong3. Testing Plan
6249a30b5b5SDarrick J. Wong===============
6259a30b5b5SDarrick J. Wong
6269a30b5b5SDarrick J. WongAs stated before, fsck tools have three main goals:
6279a30b5b5SDarrick J. Wong
6289a30b5b5SDarrick J. Wong1. Detect inconsistencies in the metadata;
6299a30b5b5SDarrick J. Wong
6309a30b5b5SDarrick J. Wong2. Eliminate those inconsistencies; and
6319a30b5b5SDarrick J. Wong
6329a30b5b5SDarrick J. Wong3. Minimize further loss of data.
6339a30b5b5SDarrick J. Wong
6349a30b5b5SDarrick J. WongDemonstrations of correct operation are necessary to build users' confidence
6359a30b5b5SDarrick J. Wongthat the software behaves within expectations.
6369a30b5b5SDarrick J. WongUnfortunately, it was not really feasible to perform regular exhaustive testing
6379a30b5b5SDarrick J. Wongof every aspect of a fsck tool until the introduction of low-cost virtual
6389a30b5b5SDarrick J. Wongmachines with high-IOPS storage.
6399a30b5b5SDarrick J. WongWith ample hardware availability in mind, the testing strategy for the online
6409a30b5b5SDarrick J. Wongfsck project involves differential analysis against the existing fsck tools and
6419a30b5b5SDarrick J. Wongsystematic testing of every attribute of every type of metadata object.
6429a30b5b5SDarrick J. WongTesting can be split into four major categories, as discussed below.
6439a30b5b5SDarrick J. Wong
6449a30b5b5SDarrick J. WongIntegrated Testing with fstests
6459a30b5b5SDarrick J. Wong-------------------------------
6469a30b5b5SDarrick J. Wong
6479a30b5b5SDarrick J. WongThe primary goal of any free software QA effort is to make testing as
6489a30b5b5SDarrick J. Wonginexpensive and widespread as possible to maximize the scaling advantages of
6499a30b5b5SDarrick J. Wongcommunity.
6509a30b5b5SDarrick J. WongIn other words, testing should maximize the breadth of filesystem configuration
6519a30b5b5SDarrick J. Wongscenarios and hardware setups.
6529a30b5b5SDarrick J. WongThis improves code quality by enabling the authors of online fsck to find and
6539a30b5b5SDarrick J. Wongfix bugs early, and helps developers of new features to find integration
6549a30b5b5SDarrick J. Wongissues earlier in their development effort.
6559a30b5b5SDarrick J. Wong
6569a30b5b5SDarrick J. WongThe Linux filesystem community shares a common QA testing suite,
6579a30b5b5SDarrick J. Wong`fstests <https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/>`_, for
6589a30b5b5SDarrick J. Wongfunctional and regression testing.
6599a30b5b5SDarrick J. WongEven before development work began on online fsck, fstests (when run on XFS)
6609a30b5b5SDarrick J. Wongwould run both the ``xfs_check`` and ``xfs_repair -n`` commands on the test and
6619a30b5b5SDarrick J. Wongscratch filesystems between each test.
6629a30b5b5SDarrick J. WongThis provides a level of assurance that the kernel and the fsck tools stay in
6639a30b5b5SDarrick J. Wongalignment about what constitutes consistent metadata.
6649a30b5b5SDarrick J. WongDuring development of the online checking code, fstests was modified to run
6659a30b5b5SDarrick J. Wong``xfs_scrub -n`` between each test to ensure that the new checking code
6669a30b5b5SDarrick J. Wongproduces the same results as the two existing fsck tools.
6679a30b5b5SDarrick J. Wong
6689a30b5b5SDarrick J. WongTo start development of online repair, fstests was modified to run
6699a30b5b5SDarrick J. Wong``xfs_repair`` to rebuild the filesystem's metadata indices between tests.
6709a30b5b5SDarrick J. WongThis ensures that offline repair does not crash, leave a corrupt filesystem
6719a30b5b5SDarrick J. Wongafter it exists, or trigger complaints from the online check.
6729a30b5b5SDarrick J. WongThis also established a baseline for what can and cannot be repaired offline.
6739a30b5b5SDarrick J. WongTo complete the first phase of development of online repair, fstests was
6749a30b5b5SDarrick J. Wongmodified to be able to run ``xfs_scrub`` in a "force rebuild" mode.
6759a30b5b5SDarrick J. WongThis enables a comparison of the effectiveness of online repair as compared to
6769a30b5b5SDarrick J. Wongthe existing offline repair tools.
6779a30b5b5SDarrick J. Wong
6789a30b5b5SDarrick J. WongGeneral Fuzz Testing of Metadata Blocks
6799a30b5b5SDarrick J. Wong---------------------------------------
6809a30b5b5SDarrick J. Wong
6819a30b5b5SDarrick J. WongXFS benefits greatly from having a very robust debugging tool, ``xfs_db``.
6829a30b5b5SDarrick J. Wong
6839a30b5b5SDarrick J. WongBefore development of online fsck even began, a set of fstests were created
6849a30b5b5SDarrick J. Wongto test the rather common fault that entire metadata blocks get corrupted.
6859a30b5b5SDarrick J. WongThis required the creation of fstests library code that can create a filesystem
6869a30b5b5SDarrick J. Wongcontaining every possible type of metadata object.
6879a30b5b5SDarrick J. WongNext, individual test cases were created to create a test filesystem, identify
6889a30b5b5SDarrick J. Wonga single block of a specific type of metadata object, trash it with the
6899a30b5b5SDarrick J. Wongexisting ``blocktrash`` command in ``xfs_db``, and test the reaction of a
6909a30b5b5SDarrick J. Wongparticular metadata validation strategy.
6919a30b5b5SDarrick J. Wong
6929a30b5b5SDarrick J. WongThis earlier test suite enabled XFS developers to test the ability of the
6939a30b5b5SDarrick J. Wongin-kernel validation functions and the ability of the offline fsck tool to
6949a30b5b5SDarrick J. Wongdetect and eliminate the inconsistent metadata.
6959a30b5b5SDarrick J. WongThis part of the test suite was extended to cover online fsck in exactly the
6969a30b5b5SDarrick J. Wongsame manner.
6979a30b5b5SDarrick J. Wong
6989a30b5b5SDarrick J. WongIn other words, for a given fstests filesystem configuration:
6999a30b5b5SDarrick J. Wong
7009a30b5b5SDarrick J. Wong* For each metadata object existing on the filesystem:
7019a30b5b5SDarrick J. Wong
7029a30b5b5SDarrick J. Wong  * Write garbage to it
7039a30b5b5SDarrick J. Wong
7049a30b5b5SDarrick J. Wong  * Test the reactions of:
7059a30b5b5SDarrick J. Wong
7069a30b5b5SDarrick J. Wong    1. The kernel verifiers to stop obviously bad metadata
7079a30b5b5SDarrick J. Wong    2. Offline repair (``xfs_repair``) to detect and fix
7089a30b5b5SDarrick J. Wong    3. Online repair (``xfs_scrub``) to detect and fix
7099a30b5b5SDarrick J. Wong
7109a30b5b5SDarrick J. WongTargeted Fuzz Testing of Metadata Records
7119a30b5b5SDarrick J. Wong-----------------------------------------
7129a30b5b5SDarrick J. Wong
7139a30b5b5SDarrick J. WongThe testing plan for online fsck includes extending the existing fs testing
7149a30b5b5SDarrick J. Wonginfrastructure to provide a much more powerful facility: targeted fuzz testing
7159a30b5b5SDarrick J. Wongof every metadata field of every metadata object in the filesystem.
7169a30b5b5SDarrick J. Wong``xfs_db`` can modify every field of every metadata structure in every
7179a30b5b5SDarrick J. Wongblock in the filesystem to simulate the effects of memory corruption and
7189a30b5b5SDarrick J. Wongsoftware bugs.
7199a30b5b5SDarrick J. WongGiven that fstests already contains the ability to create a filesystem
7209a30b5b5SDarrick J. Wongcontaining every metadata format known to the filesystem, ``xfs_db`` can be
7219a30b5b5SDarrick J. Wongused to perform exhaustive fuzz testing!
7229a30b5b5SDarrick J. Wong
7239a30b5b5SDarrick J. WongFor a given fstests filesystem configuration:
7249a30b5b5SDarrick J. Wong
7259a30b5b5SDarrick J. Wong* For each metadata object existing on the filesystem...
7269a30b5b5SDarrick J. Wong
7279a30b5b5SDarrick J. Wong  * For each record inside that metadata object...
7289a30b5b5SDarrick J. Wong
7299a30b5b5SDarrick J. Wong    * For each field inside that record...
7309a30b5b5SDarrick J. Wong
7319a30b5b5SDarrick J. Wong      * For each conceivable type of transformation that can be applied to a bit field...
7329a30b5b5SDarrick J. Wong
7339a30b5b5SDarrick J. Wong        1. Clear all bits
7349a30b5b5SDarrick J. Wong        2. Set all bits
7359a30b5b5SDarrick J. Wong        3. Toggle the most significant bit
7369a30b5b5SDarrick J. Wong        4. Toggle the middle bit
7379a30b5b5SDarrick J. Wong        5. Toggle the least significant bit
7389a30b5b5SDarrick J. Wong        6. Add a small quantity
7399a30b5b5SDarrick J. Wong        7. Subtract a small quantity
7409a30b5b5SDarrick J. Wong        8. Randomize the contents
7419a30b5b5SDarrick J. Wong
7429a30b5b5SDarrick J. Wong        * ...test the reactions of:
7439a30b5b5SDarrick J. Wong
7449a30b5b5SDarrick J. Wong          1. The kernel verifiers to stop obviously bad metadata
7459a30b5b5SDarrick J. Wong          2. Offline checking (``xfs_repair -n``)
7469a30b5b5SDarrick J. Wong          3. Offline repair (``xfs_repair``)
7479a30b5b5SDarrick J. Wong          4. Online checking (``xfs_scrub -n``)
7489a30b5b5SDarrick J. Wong          5. Online repair (``xfs_scrub``)
7499a30b5b5SDarrick J. Wong          6. Both repair tools (``xfs_scrub`` and then ``xfs_repair`` if online repair doesn't succeed)
7509a30b5b5SDarrick J. Wong
7519a30b5b5SDarrick J. WongThis is quite the combinatoric explosion!
7529a30b5b5SDarrick J. Wong
7539a30b5b5SDarrick J. WongFortunately, having this much test coverage makes it easy for XFS developers to
7549a30b5b5SDarrick J. Wongcheck the responses of XFS' fsck tools.
7559a30b5b5SDarrick J. WongSince the introduction of the fuzz testing framework, these tests have been
7569a30b5b5SDarrick J. Wongused to discover incorrect repair code and missing functionality for entire
7579a30b5b5SDarrick J. Wongclasses of metadata objects in ``xfs_repair``.
7589a30b5b5SDarrick J. WongThe enhanced testing was used to finalize the deprecation of ``xfs_check`` by
7599a30b5b5SDarrick J. Wongconfirming that ``xfs_repair`` could detect at least as many corruptions as
7609a30b5b5SDarrick J. Wongthe older tool.
7619a30b5b5SDarrick J. Wong
7629a30b5b5SDarrick J. WongThese tests have been very valuable for ``xfs_scrub`` in the same ways -- they
7639a30b5b5SDarrick J. Wongallow the online fsck developers to compare online fsck against offline fsck,
7649a30b5b5SDarrick J. Wongand they enable XFS developers to find deficiencies in the code base.
7659a30b5b5SDarrick J. Wong
7669a30b5b5SDarrick J. WongProposed patchsets include
7679a30b5b5SDarrick J. Wong`general fuzzer improvements
7689a30b5b5SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzzer-improvements>`_,
7699a30b5b5SDarrick J. Wong`fuzzing baselines
7709a30b5b5SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=fuzz-baseline>`_,
7719a30b5b5SDarrick J. Wongand `improvements in fuzz testing comprehensiveness
7729a30b5b5SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=more-fuzz-testing>`_.
7739a30b5b5SDarrick J. Wong
7749a30b5b5SDarrick J. WongStress Testing
7759a30b5b5SDarrick J. Wong--------------
7769a30b5b5SDarrick J. Wong
7779a30b5b5SDarrick J. WongA unique requirement to online fsck is the ability to operate on a filesystem
7789a30b5b5SDarrick J. Wongconcurrently with regular workloads.
7799a30b5b5SDarrick J. WongAlthough it is of course impossible to run ``xfs_scrub`` with *zero* observable
7809a30b5b5SDarrick J. Wongimpact on the running system, the online repair code should never introduce
7819a30b5b5SDarrick J. Wonginconsistencies into the filesystem metadata, and regular workloads should
7829a30b5b5SDarrick J. Wongnever notice resource starvation.
7839a30b5b5SDarrick J. WongTo verify that these conditions are being met, fstests has been enhanced in
7849a30b5b5SDarrick J. Wongthe following ways:
7859a30b5b5SDarrick J. Wong
7869a30b5b5SDarrick J. Wong* For each scrub item type, create a test to exercise checking that item type
7879a30b5b5SDarrick J. Wong  while running ``fsstress``.
7889a30b5b5SDarrick J. Wong* For each scrub item type, create a test to exercise repairing that item type
7899a30b5b5SDarrick J. Wong  while running ``fsstress``.
7909a30b5b5SDarrick J. Wong* Race ``fsstress`` and ``xfs_scrub -n`` to ensure that checking the whole
7919a30b5b5SDarrick J. Wong  filesystem doesn't cause problems.
7929a30b5b5SDarrick J. Wong* Race ``fsstress`` and ``xfs_scrub`` in force-rebuild mode to ensure that
7939a30b5b5SDarrick J. Wong  force-repairing the whole filesystem doesn't cause problems.
7949a30b5b5SDarrick J. Wong* Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while
7959a30b5b5SDarrick J. Wong  freezing and thawing the filesystem.
7969a30b5b5SDarrick J. Wong* Race ``xfs_scrub`` in check and force-repair mode against ``fsstress`` while
7979a30b5b5SDarrick J. Wong  remounting the filesystem read-only and read-write.
7989a30b5b5SDarrick J. Wong* The same, but running ``fsx`` instead of ``fsstress``.  (Not done yet?)
7999a30b5b5SDarrick J. Wong
8009a30b5b5SDarrick J. WongSuccess is defined by the ability to run all of these tests without observing
8019a30b5b5SDarrick J. Wongany unexpected filesystem shutdowns due to corrupted metadata, kernel hang
8029a30b5b5SDarrick J. Wongcheck warnings, or any other sort of mischief.
8039a30b5b5SDarrick J. Wong
8049a30b5b5SDarrick J. WongProposed patchsets include `general stress testing
8059a30b5b5SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=race-scrub-and-mount-state-changes>`_
8069a30b5b5SDarrick J. Wongand the `evolution of existing per-function stress testing
8079a30b5b5SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfstests-dev.git/log/?h=refactor-scrub-stress>`_.
8084f7f6469SDarrick J. Wong
8094f7f6469SDarrick J. Wong4. User Interface
8104f7f6469SDarrick J. Wong=================
8114f7f6469SDarrick J. Wong
8124f7f6469SDarrick J. WongThe primary user of online fsck is the system administrator, just like offline
8134f7f6469SDarrick J. Wongrepair.
8144f7f6469SDarrick J. WongOnline fsck presents two modes of operation to administrators:
8154f7f6469SDarrick J. WongA foreground CLI process for online fsck on demand, and a background service
8164f7f6469SDarrick J. Wongthat performs autonomous checking and repair.
8174f7f6469SDarrick J. Wong
8184f7f6469SDarrick J. WongChecking on Demand
8194f7f6469SDarrick J. Wong------------------
8204f7f6469SDarrick J. Wong
8214f7f6469SDarrick J. WongFor administrators who want the absolute freshest information about the
8224f7f6469SDarrick J. Wongmetadata in a filesystem, ``xfs_scrub`` can be run as a foreground process on
8234f7f6469SDarrick J. Wonga command line.
8244f7f6469SDarrick J. WongThe program checks every piece of metadata in the filesystem while the
8254f7f6469SDarrick J. Wongadministrator waits for the results to be reported, just like the existing
8264f7f6469SDarrick J. Wong``xfs_repair`` tool.
8274f7f6469SDarrick J. WongBoth tools share a ``-n`` option to perform a read-only scan, and a ``-v``
8284f7f6469SDarrick J. Wongoption to increase the verbosity of the information reported.
8294f7f6469SDarrick J. Wong
8304f7f6469SDarrick J. WongA new feature of ``xfs_scrub`` is the ``-x`` option, which employs the error
8314f7f6469SDarrick J. Wongcorrection capabilities of the hardware to check data file contents.
8324f7f6469SDarrick J. WongThe media scan is not enabled by default because it may dramatically increase
8334f7f6469SDarrick J. Wongprogram runtime and consume a lot of bandwidth on older storage hardware.
8344f7f6469SDarrick J. Wong
8354f7f6469SDarrick J. WongThe output of a foreground invocation is captured in the system log.
8364f7f6469SDarrick J. Wong
8374f7f6469SDarrick J. WongThe ``xfs_scrub_all`` program walks the list of mounted filesystems and
8384f7f6469SDarrick J. Wonginitiates ``xfs_scrub`` for each of them in parallel.
8394f7f6469SDarrick J. WongIt serializes scans for any filesystems that resolve to the same top level
8404f7f6469SDarrick J. Wongkernel block device to prevent resource overconsumption.
8414f7f6469SDarrick J. Wong
8424f7f6469SDarrick J. WongBackground Service
8434f7f6469SDarrick J. Wong------------------
8444f7f6469SDarrick J. Wong
8454f7f6469SDarrick J. WongTo reduce the workload of system administrators, the ``xfs_scrub`` package
8464f7f6469SDarrick J. Wongprovides a suite of `systemd <https://systemd.io/>`_ timers and services that
8474f7f6469SDarrick J. Wongrun online fsck automatically on weekends by default.
8484f7f6469SDarrick J. WongThe background service configures scrub to run with as little privilege as
8494f7f6469SDarrick J. Wongpossible, the lowest CPU and IO priority, and in a CPU-constrained single
8504f7f6469SDarrick J. Wongthreaded mode.
8514f7f6469SDarrick J. WongThis can be tuned by the systemd administrator at any time to suit the latency
8524f7f6469SDarrick J. Wongand throughput requirements of customer workloads.
8534f7f6469SDarrick J. Wong
8544f7f6469SDarrick J. WongThe output of the background service is also captured in the system log.
8554f7f6469SDarrick J. WongIf desired, reports of failures (either due to inconsistencies or mere runtime
8564f7f6469SDarrick J. Wongerrors) can be emailed automatically by setting the ``EMAIL_ADDR`` environment
8574f7f6469SDarrick J. Wongvariable in the following service files:
8584f7f6469SDarrick J. Wong
8594f7f6469SDarrick J. Wong* ``xfs_scrub_fail@.service``
8604f7f6469SDarrick J. Wong* ``xfs_scrub_media_fail@.service``
8614f7f6469SDarrick J. Wong* ``xfs_scrub_all_fail.service``
8624f7f6469SDarrick J. Wong
8634f7f6469SDarrick J. WongThe decision to enable the background scan is left to the system administrator.
8644f7f6469SDarrick J. WongThis can be done by enabling either of the following services:
8654f7f6469SDarrick J. Wong
8664f7f6469SDarrick J. Wong* ``xfs_scrub_all.timer`` on systemd systems
8674f7f6469SDarrick J. Wong* ``xfs_scrub_all.cron`` on non-systemd systems
8684f7f6469SDarrick J. Wong
8694f7f6469SDarrick J. WongThis automatic weekly scan is configured out of the box to perform an
8704f7f6469SDarrick J. Wongadditional media scan of all file data once per month.
8714f7f6469SDarrick J. WongThis is less foolproof than, say, storing file data block checksums, but much
8724f7f6469SDarrick J. Wongmore performant if application software provides its own integrity checking,
8734f7f6469SDarrick J. Wongredundancy can be provided elsewhere above the filesystem, or the storage
8744f7f6469SDarrick J. Wongdevice's integrity guarantees are deemed sufficient.
8754f7f6469SDarrick J. Wong
8764f7f6469SDarrick J. WongThe systemd unit file definitions have been subjected to a security audit
8774f7f6469SDarrick J. Wong(as of systemd 249) to ensure that the xfs_scrub processes have as little
8784f7f6469SDarrick J. Wongaccess to the rest of the system as possible.
8794f7f6469SDarrick J. WongThis was performed via ``systemd-analyze security``, after which privileges
8804f7f6469SDarrick J. Wongwere restricted to the minimum required, sandboxing was set up to the maximal
8814f7f6469SDarrick J. Wongextent possible with sandboxing and system call filtering; and access to the
8824f7f6469SDarrick J. Wongfilesystem tree was restricted to the minimum needed to start the program and
8834f7f6469SDarrick J. Wongaccess the filesystem being scanned.
8844f7f6469SDarrick J. WongThe service definition files restrict CPU usage to 80% of one CPU core, and
8854f7f6469SDarrick J. Wongapply as nice of a priority to IO and CPU scheduling as possible.
8864f7f6469SDarrick J. WongThis measure was taken to minimize delays in the rest of the filesystem.
8874f7f6469SDarrick J. WongNo such hardening has been performed for the cron job.
8884f7f6469SDarrick J. Wong
8894f7f6469SDarrick J. WongProposed patchset:
8904f7f6469SDarrick J. Wong`Enabling the xfs_scrub background service
8914f7f6469SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-media-scan-service>`_.
8924f7f6469SDarrick J. Wong
8934f7f6469SDarrick J. WongHealth Reporting
8944f7f6469SDarrick J. Wong----------------
8954f7f6469SDarrick J. Wong
8964f7f6469SDarrick J. WongXFS caches a summary of each filesystem's health status in memory.
8974f7f6469SDarrick J. WongThe information is updated whenever ``xfs_scrub`` is run, or whenever
8984f7f6469SDarrick J. Wonginconsistencies are detected in the filesystem metadata during regular
8994f7f6469SDarrick J. Wongoperations.
9004f7f6469SDarrick J. WongSystem administrators should use the ``health`` command of ``xfs_spaceman`` to
9014f7f6469SDarrick J. Wongdownload this information into a human-readable format.
9024f7f6469SDarrick J. WongIf problems have been observed, the administrator can schedule a reduced
9034f7f6469SDarrick J. Wongservice window to run the online repair tool to correct the problem.
9044f7f6469SDarrick J. WongFailing that, the administrator can decide to schedule a maintenance window to
9054f7f6469SDarrick J. Wongrun the traditional offline repair tool to correct the problem.
9064f7f6469SDarrick J. Wong
9074f7f6469SDarrick J. Wong**Future Work Question**: Should the health reporting integrate with the new
9084f7f6469SDarrick J. Wonginotify fs error notification system?
9094f7f6469SDarrick J. WongWould it be helpful for sysadmins to have a daemon to listen for corruption
9104f7f6469SDarrick J. Wongnotifications and initiate a repair?
9114f7f6469SDarrick J. Wong
9124f7f6469SDarrick J. Wong*Answer*: These questions remain unanswered, but should be a part of the
9134f7f6469SDarrick J. Wongconversation with early adopters and potential downstream users of XFS.
9144f7f6469SDarrick J. Wong
9154f7f6469SDarrick J. WongProposed patchsets include
9164f7f6469SDarrick J. Wong`wiring up health reports to correction returns
9174f7f6469SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=corruption-health-reports>`_
9184f7f6469SDarrick J. Wongand
9194f7f6469SDarrick J. Wong`preservation of sickness info during memory reclaim
9204f7f6469SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=indirect-health-reporting>`_.
921e5edad52SDarrick J. Wong
922e5edad52SDarrick J. Wong5. Kernel Algorithms and Data Structures
923e5edad52SDarrick J. Wong========================================
924e5edad52SDarrick J. Wong
925e5edad52SDarrick J. WongThis section discusses the key algorithms and data structures of the kernel
926e5edad52SDarrick J. Wongcode that provide the ability to check and repair metadata while the system
927e5edad52SDarrick J. Wongis running.
928e5edad52SDarrick J. WongThe first chapters in this section reveal the pieces that provide the
929e5edad52SDarrick J. Wongfoundation for checking metadata.
930e5edad52SDarrick J. WongThe remainder of this section presents the mechanisms through which XFS
931e5edad52SDarrick J. Wongregenerates itself.
932e5edad52SDarrick J. Wong
933e5edad52SDarrick J. WongSelf Describing Metadata
934e5edad52SDarrick J. Wong------------------------
935e5edad52SDarrick J. Wong
936e5edad52SDarrick J. WongStarting with XFS version 5 in 2012, XFS updated the format of nearly every
937e5edad52SDarrick J. Wongondisk block header to record a magic number, a checksum, a universally
938e5edad52SDarrick J. Wong"unique" identifier (UUID), an owner code, the ondisk address of the block,
939e5edad52SDarrick J. Wongand a log sequence number.
940e5edad52SDarrick J. WongWhen loading a block buffer from disk, the magic number, UUID, owner, and
941e5edad52SDarrick J. Wongondisk address confirm that the retrieved block matches the specific owner of
942e5edad52SDarrick J. Wongthe current filesystem, and that the information contained in the block is
943e5edad52SDarrick J. Wongsupposed to be found at the ondisk address.
944e5edad52SDarrick J. WongThe first three components enable checking tools to disregard alleged metadata
945e5edad52SDarrick J. Wongthat doesn't belong to the filesystem, and the fourth component enables the
946e5edad52SDarrick J. Wongfilesystem to detect lost writes.
947e5edad52SDarrick J. Wong
948e5edad52SDarrick J. WongWhenever a file system operation modifies a block, the change is submitted
949e5edad52SDarrick J. Wongto the log as part of a transaction.
950e5edad52SDarrick J. WongThe log then processes these transactions marking them done once they are
951e5edad52SDarrick J. Wongsafely persisted to storage.
952e5edad52SDarrick J. WongThe logging code maintains the checksum and the log sequence number of the last
953e5edad52SDarrick J. Wongtransactional update.
954e5edad52SDarrick J. WongChecksums are useful for detecting torn writes and other discrepancies that can
955e5edad52SDarrick J. Wongbe introduced between the computer and its storage devices.
956e5edad52SDarrick J. WongSequence number tracking enables log recovery to avoid applying out of date
957e5edad52SDarrick J. Wonglog updates to the filesystem.
958e5edad52SDarrick J. Wong
959e5edad52SDarrick J. WongThese two features improve overall runtime resiliency by providing a means for
960e5edad52SDarrick J. Wongthe filesystem to detect obvious corruption when reading metadata blocks from
961e5edad52SDarrick J. Wongdisk, but these buffer verifiers cannot provide any consistency checking
962e5edad52SDarrick J. Wongbetween metadata structures.
963e5edad52SDarrick J. Wong
964e5edad52SDarrick J. WongFor more information, please see the documentation for
965e5edad52SDarrick J. WongDocumentation/filesystems/xfs-self-describing-metadata.rst
966e5edad52SDarrick J. Wong
967e5edad52SDarrick J. WongReverse Mapping
968e5edad52SDarrick J. Wong---------------
969e5edad52SDarrick J. Wong
970e5edad52SDarrick J. WongThe original design of XFS (circa 1993) is an improvement upon 1980s Unix
971e5edad52SDarrick J. Wongfilesystem design.
972e5edad52SDarrick J. WongIn those days, storage density was expensive, CPU time was scarce, and
973e5edad52SDarrick J. Wongexcessive seek time could kill performance.
974e5edad52SDarrick J. WongFor performance reasons, filesystem authors were reluctant to add redundancy to
975e5edad52SDarrick J. Wongthe filesystem, even at the cost of data integrity.
976e5edad52SDarrick J. WongFilesystems designers in the early 21st century choose different strategies to
977e5edad52SDarrick J. Wongincrease internal redundancy -- either storing nearly identical copies of
978e5edad52SDarrick J. Wongmetadata, or more space-efficient encoding techniques.
979e5edad52SDarrick J. Wong
980e5edad52SDarrick J. WongFor XFS, a different redundancy strategy was chosen to modernize the design:
981e5edad52SDarrick J. Wonga secondary space usage index that maps allocated disk extents back to their
982e5edad52SDarrick J. Wongowners.
983e5edad52SDarrick J. WongBy adding a new index, the filesystem retains most of its ability to scale
984e5edad52SDarrick J. Wongwell to heavily threaded workloads involving large datasets, since the primary
985e5edad52SDarrick J. Wongfile metadata (the directory tree, the file block map, and the allocation
986e5edad52SDarrick J. Wonggroups) remain unchanged.
987e5edad52SDarrick J. WongLike any system that improves redundancy, the reverse-mapping feature increases
988e5edad52SDarrick J. Wongoverhead costs for space mapping activities.
989e5edad52SDarrick J. WongHowever, it has two critical advantages: first, the reverse index is key to
990e5edad52SDarrick J. Wongenabling online fsck and other requested functionality such as free space
991e5edad52SDarrick J. Wongdefragmentation, better media failure reporting, and filesystem shrinking.
992e5edad52SDarrick J. WongSecond, the different ondisk storage format of the reverse mapping btree
993e5edad52SDarrick J. Wongdefeats device-level deduplication because the filesystem requires real
994e5edad52SDarrick J. Wongredundancy.
995e5edad52SDarrick J. Wong
996e5edad52SDarrick J. Wong+--------------------------------------------------------------------------+
997e5edad52SDarrick J. Wong| **Sidebar**:                                                             |
998e5edad52SDarrick J. Wong+--------------------------------------------------------------------------+
999e5edad52SDarrick J. Wong| A criticism of adding the secondary index is that it does nothing to     |
1000e5edad52SDarrick J. Wong| improve the robustness of user data storage itself.                      |
1001e5edad52SDarrick J. Wong| This is a valid point, but adding a new index for file data block        |
1002e5edad52SDarrick J. Wong| checksums increases write amplification by turning data overwrites into  |
1003e5edad52SDarrick J. Wong| copy-writes, which age the filesystem prematurely.                       |
1004e5edad52SDarrick J. Wong| In keeping with thirty years of precedent, users who want file data      |
1005e5edad52SDarrick J. Wong| integrity can supply as powerful a solution as they require.             |
1006e5edad52SDarrick J. Wong| As for metadata, the complexity of adding a new secondary index of space |
1007e5edad52SDarrick J. Wong| usage is much less than adding volume management and storage device      |
1008e5edad52SDarrick J. Wong| mirroring to XFS itself.                                                 |
1009e5edad52SDarrick J. Wong| Perfection of RAID and volume management are best left to existing       |
1010e5edad52SDarrick J. Wong| layers in the kernel.                                                    |
1011e5edad52SDarrick J. Wong+--------------------------------------------------------------------------+
1012e5edad52SDarrick J. Wong
1013e5edad52SDarrick J. WongThe information captured in a reverse space mapping record is as follows:
1014e5edad52SDarrick J. Wong
1015e5edad52SDarrick J. Wong.. code-block:: c
1016e5edad52SDarrick J. Wong
1017e5edad52SDarrick J. Wong	struct xfs_rmap_irec {
1018e5edad52SDarrick J. Wong	    xfs_agblock_t    rm_startblock;   /* extent start block */
1019e5edad52SDarrick J. Wong	    xfs_extlen_t     rm_blockcount;   /* extent length */
1020e5edad52SDarrick J. Wong	    uint64_t         rm_owner;        /* extent owner */
1021e5edad52SDarrick J. Wong	    uint64_t         rm_offset;       /* offset within the owner */
1022e5edad52SDarrick J. Wong	    unsigned int     rm_flags;        /* state flags */
1023e5edad52SDarrick J. Wong	};
1024e5edad52SDarrick J. Wong
1025e5edad52SDarrick J. WongThe first two fields capture the location and size of the physical space,
1026e5edad52SDarrick J. Wongin units of filesystem blocks.
1027e5edad52SDarrick J. WongThe owner field tells scrub which metadata structure or file inode have been
1028e5edad52SDarrick J. Wongassigned this space.
1029e5edad52SDarrick J. WongFor space allocated to files, the offset field tells scrub where the space was
1030e5edad52SDarrick J. Wongmapped within the file fork.
1031e5edad52SDarrick J. WongFinally, the flags field provides extra information about the space usage --
1032e5edad52SDarrick J. Wongis this an attribute fork extent?  A file mapping btree extent?  Or an
1033e5edad52SDarrick J. Wongunwritten data extent?
1034e5edad52SDarrick J. Wong
1035e5edad52SDarrick J. WongOnline filesystem checking judges the consistency of each primary metadata
1036e5edad52SDarrick J. Wongrecord by comparing its information against all other space indices.
1037e5edad52SDarrick J. WongThe reverse mapping index plays a key role in the consistency checking process
1038e5edad52SDarrick J. Wongbecause it contains a centralized alternate copy of all space allocation
1039e5edad52SDarrick J. Wonginformation.
1040e5edad52SDarrick J. WongProgram runtime and ease of resource acquisition are the only real limits to
1041e5edad52SDarrick J. Wongwhat online checking can consult.
1042e5edad52SDarrick J. WongFor example, a file data extent mapping can be checked against:
1043e5edad52SDarrick J. Wong
1044e5edad52SDarrick J. Wong* The absence of an entry in the free space information.
1045e5edad52SDarrick J. Wong* The absence of an entry in the inode index.
1046e5edad52SDarrick J. Wong* The absence of an entry in the reference count data if the file is not
1047e5edad52SDarrick J. Wong  marked as having shared extents.
1048e5edad52SDarrick J. Wong* The correspondence of an entry in the reverse mapping information.
1049e5edad52SDarrick J. Wong
1050e5edad52SDarrick J. WongThere are several observations to make about reverse mapping indices:
1051e5edad52SDarrick J. Wong
1052e5edad52SDarrick J. Wong1. Reverse mappings can provide a positive affirmation of correctness if any of
1053e5edad52SDarrick J. Wong   the above primary metadata are in doubt.
1054e5edad52SDarrick J. Wong   The checking code for most primary metadata follows a path similar to the
1055e5edad52SDarrick J. Wong   one outlined above.
1056e5edad52SDarrick J. Wong
1057e5edad52SDarrick J. Wong2. Proving the consistency of secondary metadata with the primary metadata is
1058e5edad52SDarrick J. Wong   difficult because that requires a full scan of all primary space metadata,
1059e5edad52SDarrick J. Wong   which is very time intensive.
1060e5edad52SDarrick J. Wong   For example, checking a reverse mapping record for a file extent mapping
1061e5edad52SDarrick J. Wong   btree block requires locking the file and searching the entire btree to
1062e5edad52SDarrick J. Wong   confirm the block.
1063e5edad52SDarrick J. Wong   Instead, scrub relies on rigorous cross-referencing during the primary space
1064e5edad52SDarrick J. Wong   mapping structure checks.
1065e5edad52SDarrick J. Wong
1066e5edad52SDarrick J. Wong3. Consistency scans must use non-blocking lock acquisition primitives if the
1067e5edad52SDarrick J. Wong   required locking order is not the same order used by regular filesystem
1068e5edad52SDarrick J. Wong   operations.
1069e5edad52SDarrick J. Wong   For example, if the filesystem normally takes a file ILOCK before taking
1070e5edad52SDarrick J. Wong   the AGF buffer lock but scrub wants to take a file ILOCK while holding
1071e5edad52SDarrick J. Wong   an AGF buffer lock, scrub cannot block on that second acquisition.
1072e5edad52SDarrick J. Wong   This means that forward progress during this part of a scan of the reverse
1073e5edad52SDarrick J. Wong   mapping data cannot be guaranteed if system load is heavy.
1074e5edad52SDarrick J. Wong
1075e5edad52SDarrick J. WongIn summary, reverse mappings play a key role in reconstruction of primary
1076e5edad52SDarrick J. Wongmetadata.
1077e5edad52SDarrick J. WongThe details of how these records are staged, written to disk, and committed
1078e5edad52SDarrick J. Wonginto the filesystem are covered in subsequent sections.
1079e5edad52SDarrick J. Wong
1080e5edad52SDarrick J. WongChecking and Cross-Referencing
1081e5edad52SDarrick J. Wong------------------------------
1082e5edad52SDarrick J. Wong
1083e5edad52SDarrick J. WongThe first step of checking a metadata structure is to examine every record
1084e5edad52SDarrick J. Wongcontained within the structure and its relationship with the rest of the
1085e5edad52SDarrick J. Wongsystem.
1086e5edad52SDarrick J. WongXFS contains multiple layers of checking to try to prevent inconsistent
1087e5edad52SDarrick J. Wongmetadata from wreaking havoc on the system.
1088e5edad52SDarrick J. WongEach of these layers contributes information that helps the kernel to make
1089e5edad52SDarrick J. Wongthree decisions about the health of a metadata structure:
1090e5edad52SDarrick J. Wong
1091e5edad52SDarrick J. Wong- Is a part of this structure obviously corrupt (``XFS_SCRUB_OFLAG_CORRUPT``) ?
1092e5edad52SDarrick J. Wong- Is this structure inconsistent with the rest of the system
1093e5edad52SDarrick J. Wong  (``XFS_SCRUB_OFLAG_XCORRUPT``) ?
1094e5edad52SDarrick J. Wong- Is there so much damage around the filesystem that cross-referencing is not
1095e5edad52SDarrick J. Wong  possible (``XFS_SCRUB_OFLAG_XFAIL``) ?
1096e5edad52SDarrick J. Wong- Can the structure be optimized to improve performance or reduce the size of
1097e5edad52SDarrick J. Wong  metadata (``XFS_SCRUB_OFLAG_PREEN``) ?
1098e5edad52SDarrick J. Wong- Does the structure contain data that is not inconsistent but deserves review
1099e5edad52SDarrick J. Wong  by the system administrator (``XFS_SCRUB_OFLAG_WARNING``) ?
1100e5edad52SDarrick J. Wong
1101e5edad52SDarrick J. WongThe following sections describe how the metadata scrubbing process works.
1102e5edad52SDarrick J. Wong
1103e5edad52SDarrick J. WongMetadata Buffer Verification
1104e5edad52SDarrick J. Wong````````````````````````````
1105e5edad52SDarrick J. Wong
1106e5edad52SDarrick J. WongThe lowest layer of metadata protection in XFS are the metadata verifiers built
1107e5edad52SDarrick J. Wonginto the buffer cache.
1108e5edad52SDarrick J. WongThese functions perform inexpensive internal consistency checking of the block
1109e5edad52SDarrick J. Wongitself, and answer these questions:
1110e5edad52SDarrick J. Wong
1111e5edad52SDarrick J. Wong- Does the block belong to this filesystem?
1112e5edad52SDarrick J. Wong
1113e5edad52SDarrick J. Wong- Does the block belong to the structure that asked for the read?
1114e5edad52SDarrick J. Wong  This assumes that metadata blocks only have one owner, which is always true
1115e5edad52SDarrick J. Wong  in XFS.
1116e5edad52SDarrick J. Wong
1117e5edad52SDarrick J. Wong- Is the type of data stored in the block within a reasonable range of what
1118e5edad52SDarrick J. Wong  scrub is expecting?
1119e5edad52SDarrick J. Wong
1120e5edad52SDarrick J. Wong- Does the physical location of the block match the location it was read from?
1121e5edad52SDarrick J. Wong
1122e5edad52SDarrick J. Wong- Does the block checksum match the data?
1123e5edad52SDarrick J. Wong
1124e5edad52SDarrick J. WongThe scope of the protections here are very limited -- verifiers can only
1125e5edad52SDarrick J. Wongestablish that the filesystem code is reasonably free of gross corruption bugs
1126e5edad52SDarrick J. Wongand that the storage system is reasonably competent at retrieval.
1127e5edad52SDarrick J. WongCorruption problems observed at runtime cause the generation of health reports,
1128e5edad52SDarrick J. Wongfailed system calls, and in the extreme case, filesystem shutdowns if the
1129e5edad52SDarrick J. Wongcorrupt metadata force the cancellation of a dirty transaction.
1130e5edad52SDarrick J. Wong
1131e5edad52SDarrick J. WongEvery online fsck scrubbing function is expected to read every ondisk metadata
1132e5edad52SDarrick J. Wongblock of a structure in the course of checking the structure.
1133e5edad52SDarrick J. WongCorruption problems observed during a check are immediately reported to
1134e5edad52SDarrick J. Wonguserspace as corruption; during a cross-reference, they are reported as a
1135e5edad52SDarrick J. Wongfailure to cross-reference once the full examination is complete.
1136e5edad52SDarrick J. WongReads satisfied by a buffer already in cache (and hence already verified)
1137e5edad52SDarrick J. Wongbypass these checks.
1138e5edad52SDarrick J. Wong
1139e5edad52SDarrick J. WongInternal Consistency Checks
1140e5edad52SDarrick J. Wong```````````````````````````
1141e5edad52SDarrick J. Wong
1142e5edad52SDarrick J. WongAfter the buffer cache, the next level of metadata protection is the internal
1143e5edad52SDarrick J. Wongrecord verification code built into the filesystem.
1144e5edad52SDarrick J. WongThese checks are split between the buffer verifiers, the in-filesystem users of
1145e5edad52SDarrick J. Wongthe buffer cache, and the scrub code itself, depending on the amount of higher
1146e5edad52SDarrick J. Wonglevel context required.
1147e5edad52SDarrick J. WongThe scope of checking is still internal to the block.
1148e5edad52SDarrick J. WongThese higher level checking functions answer these questions:
1149e5edad52SDarrick J. Wong
1150e5edad52SDarrick J. Wong- Does the type of data stored in the block match what scrub is expecting?
1151e5edad52SDarrick J. Wong
1152e5edad52SDarrick J. Wong- Does the block belong to the owning structure that asked for the read?
1153e5edad52SDarrick J. Wong
1154e5edad52SDarrick J. Wong- If the block contains records, do the records fit within the block?
1155e5edad52SDarrick J. Wong
1156e5edad52SDarrick J. Wong- If the block tracks internal free space information, is it consistent with
1157e5edad52SDarrick J. Wong  the record areas?
1158e5edad52SDarrick J. Wong
1159e5edad52SDarrick J. Wong- Are the records contained inside the block free of obvious corruptions?
1160e5edad52SDarrick J. Wong
1161e5edad52SDarrick J. WongRecord checks in this category are more rigorous and more time-intensive.
1162e5edad52SDarrick J. WongFor example, block pointers and inumbers are checked to ensure that they point
1163e5edad52SDarrick J. Wongwithin the dynamically allocated parts of an allocation group and within
1164e5edad52SDarrick J. Wongthe filesystem.
1165e5edad52SDarrick J. WongNames are checked for invalid characters, and flags are checked for invalid
1166e5edad52SDarrick J. Wongcombinations.
1167e5edad52SDarrick J. WongOther record attributes are checked for sensible values.
1168e5edad52SDarrick J. WongBtree records spanning an interval of the btree keyspace are checked for
1169e5edad52SDarrick J. Wongcorrect order and lack of mergeability (except for file fork mappings).
1170e5edad52SDarrick J. WongFor performance reasons, regular code may skip some of these checks unless
1171e5edad52SDarrick J. Wongdebugging is enabled or a write is about to occur.
1172e5edad52SDarrick J. WongScrub functions, of course, must check all possible problems.
1173e5edad52SDarrick J. Wong
1174e5edad52SDarrick J. WongValidation of Userspace-Controlled Record Attributes
1175e5edad52SDarrick J. Wong````````````````````````````````````````````````````
1176e5edad52SDarrick J. Wong
1177e5edad52SDarrick J. WongVarious pieces of filesystem metadata are directly controlled by userspace.
1178e5edad52SDarrick J. WongBecause of this nature, validation work cannot be more precise than checking
1179e5edad52SDarrick J. Wongthat a value is within the possible range.
1180e5edad52SDarrick J. WongThese fields include:
1181e5edad52SDarrick J. Wong
1182e5edad52SDarrick J. Wong- Superblock fields controlled by mount options
1183e5edad52SDarrick J. Wong- Filesystem labels
1184e5edad52SDarrick J. Wong- File timestamps
1185e5edad52SDarrick J. Wong- File permissions
1186e5edad52SDarrick J. Wong- File size
1187e5edad52SDarrick J. Wong- File flags
1188e5edad52SDarrick J. Wong- Names present in directory entries, extended attribute keys, and filesystem
1189e5edad52SDarrick J. Wong  labels
1190e5edad52SDarrick J. Wong- Extended attribute key namespaces
1191e5edad52SDarrick J. Wong- Extended attribute values
1192e5edad52SDarrick J. Wong- File data block contents
1193e5edad52SDarrick J. Wong- Quota limits
1194e5edad52SDarrick J. Wong- Quota timer expiration (if resource usage exceeds the soft limit)
1195e5edad52SDarrick J. Wong
1196e5edad52SDarrick J. WongCross-Referencing Space Metadata
1197e5edad52SDarrick J. Wong````````````````````````````````
1198e5edad52SDarrick J. Wong
1199e5edad52SDarrick J. WongAfter internal block checks, the next higher level of checking is
1200e5edad52SDarrick J. Wongcross-referencing records between metadata structures.
1201e5edad52SDarrick J. WongFor regular runtime code, the cost of these checks is considered to be
1202e5edad52SDarrick J. Wongprohibitively expensive, but as scrub is dedicated to rooting out
1203e5edad52SDarrick J. Wonginconsistencies, it must pursue all avenues of inquiry.
1204e5edad52SDarrick J. WongThe exact set of cross-referencing is highly dependent on the context of the
1205e5edad52SDarrick J. Wongdata structure being checked.
1206e5edad52SDarrick J. Wong
1207e5edad52SDarrick J. WongThe XFS btree code has keyspace scanning functions that online fsck uses to
1208e5edad52SDarrick J. Wongcross reference one structure with another.
1209e5edad52SDarrick J. WongSpecifically, scrub can scan the key space of an index to determine if that
1210e5edad52SDarrick J. Wongkeyspace is fully, sparsely, or not at all mapped to records.
1211e5edad52SDarrick J. WongFor the reverse mapping btree, it is possible to mask parts of the key for the
1212e5edad52SDarrick J. Wongpurposes of performing a keyspace scan so that scrub can decide if the rmap
1213e5edad52SDarrick J. Wongbtree contains records mapping a certain extent of physical space without the
1214e5edad52SDarrick J. Wongsparsenses of the rest of the rmap keyspace getting in the way.
1215e5edad52SDarrick J. Wong
1216e5edad52SDarrick J. WongBtree blocks undergo the following checks before cross-referencing:
1217e5edad52SDarrick J. Wong
1218e5edad52SDarrick J. Wong- Does the type of data stored in the block match what scrub is expecting?
1219e5edad52SDarrick J. Wong
1220e5edad52SDarrick J. Wong- Does the block belong to the owning structure that asked for the read?
1221e5edad52SDarrick J. Wong
1222e5edad52SDarrick J. Wong- Do the records fit within the block?
1223e5edad52SDarrick J. Wong
1224e5edad52SDarrick J. Wong- Are the records contained inside the block free of obvious corruptions?
1225e5edad52SDarrick J. Wong
1226e5edad52SDarrick J. Wong- Are the name hashes in the correct order?
1227e5edad52SDarrick J. Wong
1228e5edad52SDarrick J. Wong- Do node pointers within the btree point to valid block addresses for the type
1229e5edad52SDarrick J. Wong  of btree?
1230e5edad52SDarrick J. Wong
1231e5edad52SDarrick J. Wong- Do child pointers point towards the leaves?
1232e5edad52SDarrick J. Wong
1233e5edad52SDarrick J. Wong- Do sibling pointers point across the same level?
1234e5edad52SDarrick J. Wong
1235e5edad52SDarrick J. Wong- For each node block record, does the record key accurate reflect the contents
1236e5edad52SDarrick J. Wong  of the child block?
1237e5edad52SDarrick J. Wong
1238e5edad52SDarrick J. WongSpace allocation records are cross-referenced as follows:
1239e5edad52SDarrick J. Wong
1240e5edad52SDarrick J. Wong1. Any space mentioned by any metadata structure are cross-referenced as
1241e5edad52SDarrick J. Wong   follows:
1242e5edad52SDarrick J. Wong
1243e5edad52SDarrick J. Wong   - Does the reverse mapping index list only the appropriate owner as the
1244e5edad52SDarrick J. Wong     owner of each block?
1245e5edad52SDarrick J. Wong
1246e5edad52SDarrick J. Wong   - Are none of the blocks claimed as free space?
1247e5edad52SDarrick J. Wong
1248e5edad52SDarrick J. Wong   - If these aren't file data blocks, are none of the blocks claimed as space
1249e5edad52SDarrick J. Wong     shared by different owners?
1250e5edad52SDarrick J. Wong
1251e5edad52SDarrick J. Wong2. Btree blocks are cross-referenced as follows:
1252e5edad52SDarrick J. Wong
1253e5edad52SDarrick J. Wong   - Everything in class 1 above.
1254e5edad52SDarrick J. Wong
1255e5edad52SDarrick J. Wong   - If there's a parent node block, do the keys listed for this block match the
1256e5edad52SDarrick J. Wong     keyspace of this block?
1257e5edad52SDarrick J. Wong
1258e5edad52SDarrick J. Wong   - Do the sibling pointers point to valid blocks?  Of the same level?
1259e5edad52SDarrick J. Wong
1260e5edad52SDarrick J. Wong   - Do the child pointers point to valid blocks?  Of the next level down?
1261e5edad52SDarrick J. Wong
1262e5edad52SDarrick J. Wong3. Free space btree records are cross-referenced as follows:
1263e5edad52SDarrick J. Wong
1264e5edad52SDarrick J. Wong   - Everything in class 1 and 2 above.
1265e5edad52SDarrick J. Wong
1266e5edad52SDarrick J. Wong   - Does the reverse mapping index list no owners of this space?
1267e5edad52SDarrick J. Wong
1268e5edad52SDarrick J. Wong   - Is this space not claimed by the inode index for inodes?
1269e5edad52SDarrick J. Wong
1270e5edad52SDarrick J. Wong   - Is it not mentioned by the reference count index?
1271e5edad52SDarrick J. Wong
1272e5edad52SDarrick J. Wong   - Is there a matching record in the other free space btree?
1273e5edad52SDarrick J. Wong
1274e5edad52SDarrick J. Wong4. Inode btree records are cross-referenced as follows:
1275e5edad52SDarrick J. Wong
1276e5edad52SDarrick J. Wong   - Everything in class 1 and 2 above.
1277e5edad52SDarrick J. Wong
1278e5edad52SDarrick J. Wong   - Is there a matching record in free inode btree?
1279e5edad52SDarrick J. Wong
1280e5edad52SDarrick J. Wong   - Do cleared bits in the holemask correspond with inode clusters?
1281e5edad52SDarrick J. Wong
1282e5edad52SDarrick J. Wong   - Do set bits in the freemask correspond with inode records with zero link
1283e5edad52SDarrick J. Wong     count?
1284e5edad52SDarrick J. Wong
1285e5edad52SDarrick J. Wong5. Inode records are cross-referenced as follows:
1286e5edad52SDarrick J. Wong
1287e5edad52SDarrick J. Wong   - Everything in class 1.
1288e5edad52SDarrick J. Wong
1289e5edad52SDarrick J. Wong   - Do all the fields that summarize information about the file forks actually
1290e5edad52SDarrick J. Wong     match those forks?
1291e5edad52SDarrick J. Wong
1292e5edad52SDarrick J. Wong   - Does each inode with zero link count correspond to a record in the free
1293e5edad52SDarrick J. Wong     inode btree?
1294e5edad52SDarrick J. Wong
1295e5edad52SDarrick J. Wong6. File fork space mapping records are cross-referenced as follows:
1296e5edad52SDarrick J. Wong
1297e5edad52SDarrick J. Wong   - Everything in class 1 and 2 above.
1298e5edad52SDarrick J. Wong
1299e5edad52SDarrick J. Wong   - Is this space not mentioned by the inode btrees?
1300e5edad52SDarrick J. Wong
1301e5edad52SDarrick J. Wong   - If this is a CoW fork mapping, does it correspond to a CoW entry in the
1302e5edad52SDarrick J. Wong     reference count btree?
1303e5edad52SDarrick J. Wong
1304e5edad52SDarrick J. Wong7. Reference count records are cross-referenced as follows:
1305e5edad52SDarrick J. Wong
1306e5edad52SDarrick J. Wong   - Everything in class 1 and 2 above.
1307e5edad52SDarrick J. Wong
1308e5edad52SDarrick J. Wong   - Within the space subkeyspace of the rmap btree (that is to say, all
1309e5edad52SDarrick J. Wong     records mapped to a particular space extent and ignoring the owner info),
1310e5edad52SDarrick J. Wong     are there the same number of reverse mapping records for each block as the
1311e5edad52SDarrick J. Wong     reference count record claims?
1312e5edad52SDarrick J. Wong
1313e5edad52SDarrick J. WongProposed patchsets are the series to find gaps in
1314e5edad52SDarrick J. Wong`refcount btree
1315e5edad52SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-refcount-gaps>`_,
1316e5edad52SDarrick J. Wong`inode btree
1317e5edad52SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-inobt-gaps>`_, and
1318e5edad52SDarrick J. Wong`rmap btree
1319e5edad52SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-rmapbt-gaps>`_ records;
1320e5edad52SDarrick J. Wongto find
1321e5edad52SDarrick J. Wong`mergeable records
1322e5edad52SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-detect-mergeable-records>`_;
1323e5edad52SDarrick J. Wongand to
1324e5edad52SDarrick J. Wong`improve cross referencing with rmap
1325e5edad52SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-strengthen-rmap-checking>`_
1326e5edad52SDarrick J. Wongbefore starting a repair.
1327e5edad52SDarrick J. Wong
1328e5edad52SDarrick J. WongChecking Extended Attributes
1329e5edad52SDarrick J. Wong````````````````````````````
1330e5edad52SDarrick J. Wong
1331e5edad52SDarrick J. WongExtended attributes implement a key-value store that enable fragments of data
1332e5edad52SDarrick J. Wongto be attached to any file.
1333e5edad52SDarrick J. WongBoth the kernel and userspace can access the keys and values, subject to
1334e5edad52SDarrick J. Wongnamespace and privilege restrictions.
1335e5edad52SDarrick J. WongMost typically these fragments are metadata about the file -- origins, security
1336e5edad52SDarrick J. Wongcontexts, user-supplied labels, indexing information, etc.
1337e5edad52SDarrick J. Wong
1338e5edad52SDarrick J. WongNames can be as long as 255 bytes and can exist in several different
1339e5edad52SDarrick J. Wongnamespaces.
1340e5edad52SDarrick J. WongValues can be as large as 64KB.
1341e5edad52SDarrick J. WongA file's extended attributes are stored in blocks mapped by the attr fork.
1342e5edad52SDarrick J. WongThe mappings point to leaf blocks, remote value blocks, or dabtree blocks.
1343e5edad52SDarrick J. WongBlock 0 in the attribute fork is always the top of the structure, but otherwise
1344e5edad52SDarrick J. Wongeach of the three types of blocks can be found at any offset in the attr fork.
1345e5edad52SDarrick J. WongLeaf blocks contain attribute key records that point to the name and the value.
1346e5edad52SDarrick J. WongNames are always stored elsewhere in the same leaf block.
1347e5edad52SDarrick J. WongValues that are less than 3/4 the size of a filesystem block are also stored
1348e5edad52SDarrick J. Wongelsewhere in the same leaf block.
1349e5edad52SDarrick J. WongRemote value blocks contain values that are too large to fit inside a leaf.
1350e5edad52SDarrick J. WongIf the leaf information exceeds a single filesystem block, a dabtree (also
1351e5edad52SDarrick J. Wongrooted at block 0) is created to map hashes of the attribute names to leaf
1352e5edad52SDarrick J. Wongblocks in the attr fork.
1353e5edad52SDarrick J. Wong
1354*d56b699dSBjorn HelgaasChecking an extended attribute structure is not so straightforward due to the
1355e5edad52SDarrick J. Wonglack of separation between attr blocks and index blocks.
1356e5edad52SDarrick J. WongScrub must read each block mapped by the attr fork and ignore the non-leaf
1357e5edad52SDarrick J. Wongblocks:
1358e5edad52SDarrick J. Wong
1359e5edad52SDarrick J. Wong1. Walk the dabtree in the attr fork (if present) to ensure that there are no
1360e5edad52SDarrick J. Wong   irregularities in the blocks or dabtree mappings that do not point to
1361e5edad52SDarrick J. Wong   attr leaf blocks.
1362e5edad52SDarrick J. Wong
1363e5edad52SDarrick J. Wong2. Walk the blocks of the attr fork looking for leaf blocks.
1364e5edad52SDarrick J. Wong   For each entry inside a leaf:
1365e5edad52SDarrick J. Wong
1366e5edad52SDarrick J. Wong   a. Validate that the name does not contain invalid characters.
1367e5edad52SDarrick J. Wong
1368e5edad52SDarrick J. Wong   b. Read the attr value.
1369e5edad52SDarrick J. Wong      This performs a named lookup of the attr name to ensure the correctness
1370e5edad52SDarrick J. Wong      of the dabtree.
1371e5edad52SDarrick J. Wong      If the value is stored in a remote block, this also validates the
1372e5edad52SDarrick J. Wong      integrity of the remote value block.
1373e5edad52SDarrick J. Wong
1374e5edad52SDarrick J. WongChecking and Cross-Referencing Directories
1375e5edad52SDarrick J. Wong``````````````````````````````````````````
1376e5edad52SDarrick J. Wong
1377e5edad52SDarrick J. WongThe filesystem directory tree is a directed acylic graph structure, with files
1378e5edad52SDarrick J. Wongconstituting the nodes, and directory entries (dirents) constituting the edges.
1379e5edad52SDarrick J. WongDirectories are a special type of file containing a set of mappings from a
1380e5edad52SDarrick J. Wong255-byte sequence (name) to an inumber.
1381e5edad52SDarrick J. WongThese are called directory entries, or dirents for short.
1382e5edad52SDarrick J. WongEach directory file must have exactly one directory pointing to the file.
1383e5edad52SDarrick J. WongA root directory points to itself.
1384e5edad52SDarrick J. WongDirectory entries point to files of any type.
1385e5edad52SDarrick J. WongEach non-directory file may have multiple directories point to it.
1386e5edad52SDarrick J. Wong
1387e5edad52SDarrick J. WongIn XFS, directories are implemented as a file containing up to three 32GB
1388e5edad52SDarrick J. Wongpartitions.
1389e5edad52SDarrick J. WongThe first partition contains directory entry data blocks.
1390e5edad52SDarrick J. WongEach data block contains variable-sized records associating a user-provided
1391e5edad52SDarrick J. Wongname with an inumber and, optionally, a file type.
1392e5edad52SDarrick J. WongIf the directory entry data grows beyond one block, the second partition (which
1393e5edad52SDarrick J. Wongexists as post-EOF extents) is populated with a block containing free space
1394e5edad52SDarrick J. Wonginformation and an index that maps hashes of the dirent names to directory data
1395e5edad52SDarrick J. Wongblocks in the first partition.
1396e5edad52SDarrick J. WongThis makes directory name lookups very fast.
1397e5edad52SDarrick J. WongIf this second partition grows beyond one block, the third partition is
1398e5edad52SDarrick J. Wongpopulated with a linear array of free space information for faster
1399e5edad52SDarrick J. Wongexpansions.
1400e5edad52SDarrick J. WongIf the free space has been separated and the second partition grows again
1401e5edad52SDarrick J. Wongbeyond one block, then a dabtree is used to map hashes of dirent names to
1402e5edad52SDarrick J. Wongdirectory data blocks.
1403e5edad52SDarrick J. Wong
1404*d56b699dSBjorn HelgaasChecking a directory is pretty straightforward:
1405e5edad52SDarrick J. Wong
1406e5edad52SDarrick J. Wong1. Walk the dabtree in the second partition (if present) to ensure that there
1407e5edad52SDarrick J. Wong   are no irregularities in the blocks or dabtree mappings that do not point to
1408e5edad52SDarrick J. Wong   dirent blocks.
1409e5edad52SDarrick J. Wong
1410e5edad52SDarrick J. Wong2. Walk the blocks of the first partition looking for directory entries.
1411e5edad52SDarrick J. Wong   Each dirent is checked as follows:
1412e5edad52SDarrick J. Wong
1413e5edad52SDarrick J. Wong   a. Does the name contain no invalid characters?
1414e5edad52SDarrick J. Wong
1415e5edad52SDarrick J. Wong   b. Does the inumber correspond to an actual, allocated inode?
1416e5edad52SDarrick J. Wong
1417e5edad52SDarrick J. Wong   c. Does the child inode have a nonzero link count?
1418e5edad52SDarrick J. Wong
1419e5edad52SDarrick J. Wong   d. If a file type is included in the dirent, does it match the type of the
1420e5edad52SDarrick J. Wong      inode?
1421e5edad52SDarrick J. Wong
1422e5edad52SDarrick J. Wong   e. If the child is a subdirectory, does the child's dotdot pointer point
1423e5edad52SDarrick J. Wong      back to the parent?
1424e5edad52SDarrick J. Wong
1425e5edad52SDarrick J. Wong   f. If the directory has a second partition, perform a named lookup of the
1426e5edad52SDarrick J. Wong      dirent name to ensure the correctness of the dabtree.
1427e5edad52SDarrick J. Wong
1428e5edad52SDarrick J. Wong3. Walk the free space list in the third partition (if present) to ensure that
1429e5edad52SDarrick J. Wong   the free spaces it describes are really unused.
1430e5edad52SDarrick J. Wong
1431e5edad52SDarrick J. WongChecking operations involving :ref:`parents <dirparent>` and
1432e5edad52SDarrick J. Wong:ref:`file link counts <nlinks>` are discussed in more detail in later
1433e5edad52SDarrick J. Wongsections.
1434e5edad52SDarrick J. Wong
1435e5edad52SDarrick J. WongChecking Directory/Attribute Btrees
1436e5edad52SDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1437e5edad52SDarrick J. Wong
1438e5edad52SDarrick J. WongAs stated in previous sections, the directory/attribute btree (dabtree) index
1439e5edad52SDarrick J. Wongmaps user-provided names to improve lookup times by avoiding linear scans.
1440e5edad52SDarrick J. WongInternally, it maps a 32-bit hash of the name to a block offset within the
1441e5edad52SDarrick J. Wongappropriate file fork.
1442e5edad52SDarrick J. Wong
1443e5edad52SDarrick J. WongThe internal structure of a dabtree closely resembles the btrees that record
1444e5edad52SDarrick J. Wongfixed-size metadata records -- each dabtree block contains a magic number, a
1445e5edad52SDarrick J. Wongchecksum, sibling pointers, a UUID, a tree level, and a log sequence number.
1446e5edad52SDarrick J. WongThe format of leaf and node records are the same -- each entry points to the
1447e5edad52SDarrick J. Wongnext level down in the hierarchy, with dabtree node records pointing to dabtree
1448e5edad52SDarrick J. Wongleaf blocks, and dabtree leaf records pointing to non-dabtree blocks elsewhere
1449e5edad52SDarrick J. Wongin the fork.
1450e5edad52SDarrick J. Wong
1451e5edad52SDarrick J. WongChecking and cross-referencing the dabtree is very similar to what is done for
1452e5edad52SDarrick J. Wongspace btrees:
1453e5edad52SDarrick J. Wong
1454e5edad52SDarrick J. Wong- Does the type of data stored in the block match what scrub is expecting?
1455e5edad52SDarrick J. Wong
1456e5edad52SDarrick J. Wong- Does the block belong to the owning structure that asked for the read?
1457e5edad52SDarrick J. Wong
1458e5edad52SDarrick J. Wong- Do the records fit within the block?
1459e5edad52SDarrick J. Wong
1460e5edad52SDarrick J. Wong- Are the records contained inside the block free of obvious corruptions?
1461e5edad52SDarrick J. Wong
1462e5edad52SDarrick J. Wong- Are the name hashes in the correct order?
1463e5edad52SDarrick J. Wong
1464e5edad52SDarrick J. Wong- Do node pointers within the dabtree point to valid fork offsets for dabtree
1465e5edad52SDarrick J. Wong  blocks?
1466e5edad52SDarrick J. Wong
1467e5edad52SDarrick J. Wong- Do leaf pointers within the dabtree point to valid fork offsets for directory
1468e5edad52SDarrick J. Wong  or attr leaf blocks?
1469e5edad52SDarrick J. Wong
1470e5edad52SDarrick J. Wong- Do child pointers point towards the leaves?
1471e5edad52SDarrick J. Wong
1472e5edad52SDarrick J. Wong- Do sibling pointers point across the same level?
1473e5edad52SDarrick J. Wong
1474e5edad52SDarrick J. Wong- For each dabtree node record, does the record key accurate reflect the
1475e5edad52SDarrick J. Wong  contents of the child dabtree block?
1476e5edad52SDarrick J. Wong
1477e5edad52SDarrick J. Wong- For each dabtree leaf record, does the record key accurate reflect the
1478e5edad52SDarrick J. Wong  contents of the directory or attr block?
1479e5edad52SDarrick J. Wong
1480e5edad52SDarrick J. WongCross-Referencing Summary Counters
1481e5edad52SDarrick J. Wong``````````````````````````````````
1482e5edad52SDarrick J. Wong
1483e5edad52SDarrick J. WongXFS maintains three classes of summary counters: available resources, quota
1484e5edad52SDarrick J. Wongresource usage, and file link counts.
1485e5edad52SDarrick J. Wong
1486e5edad52SDarrick J. WongIn theory, the amount of available resources (data blocks, inodes, realtime
1487e5edad52SDarrick J. Wongextents) can be found by walking the entire filesystem.
1488e5edad52SDarrick J. WongThis would make for very slow reporting, so a transactional filesystem can
1489e5edad52SDarrick J. Wongmaintain summaries of this information in the superblock.
1490e5edad52SDarrick J. WongCross-referencing these values against the filesystem metadata should be a
1491e5edad52SDarrick J. Wongsimple matter of walking the free space and inode metadata in each AG and the
1492e5edad52SDarrick J. Wongrealtime bitmap, but there are complications that will be discussed in
1493e5edad52SDarrick J. Wong:ref:`more detail <fscounters>` later.
1494e5edad52SDarrick J. Wong
1495e5edad52SDarrick J. Wong:ref:`Quota usage <quotacheck>` and :ref:`file link count <nlinks>`
1496e5edad52SDarrick J. Wongchecking are sufficiently complicated to warrant separate sections.
1497e5edad52SDarrick J. Wong
1498e5edad52SDarrick J. WongPost-Repair Reverification
1499e5edad52SDarrick J. Wong``````````````````````````
1500e5edad52SDarrick J. Wong
1501e5edad52SDarrick J. WongAfter performing a repair, the checking code is run a second time to validate
1502e5edad52SDarrick J. Wongthe new structure, and the results of the health assessment are recorded
1503e5edad52SDarrick J. Wonginternally and returned to the calling process.
1504e5edad52SDarrick J. WongThis step is critical for enabling system administrator to monitor the status
1505e5edad52SDarrick J. Wongof the filesystem and the progress of any repairs.
1506e5edad52SDarrick J. WongFor developers, it is a useful means to judge the efficacy of error detection
1507e5edad52SDarrick J. Wongand correction in the online and offline checking tools.
1508bae43864SDarrick J. Wong
1509bae43864SDarrick J. WongEventual Consistency vs. Online Fsck
1510bae43864SDarrick J. Wong------------------------------------
1511bae43864SDarrick J. Wong
1512bae43864SDarrick J. WongComplex operations can make modifications to multiple per-AG data structures
1513bae43864SDarrick J. Wongwith a chain of transactions.
1514bae43864SDarrick J. WongThese chains, once committed to the log, are restarted during log recovery if
1515bae43864SDarrick J. Wongthe system crashes while processing the chain.
1516bae43864SDarrick J. WongBecause the AG header buffers are unlocked between transactions within a chain,
1517bae43864SDarrick J. Wongonline checking must coordinate with chained operations that are in progress to
1518bae43864SDarrick J. Wongavoid incorrectly detecting inconsistencies due to pending chains.
1519bae43864SDarrick J. WongFurthermore, online repair must not run when operations are pending because
1520bae43864SDarrick J. Wongthe metadata are temporarily inconsistent with each other, and rebuilding is
1521bae43864SDarrick J. Wongnot possible.
1522bae43864SDarrick J. Wong
1523bae43864SDarrick J. WongOnly online fsck has this requirement of total consistency of AG metadata, and
1524bae43864SDarrick J. Wongshould be relatively rare as compared to filesystem change operations.
1525bae43864SDarrick J. WongOnline fsck coordinates with transaction chains as follows:
1526bae43864SDarrick J. Wong
1527*d56b699dSBjorn Helgaas* For each AG, maintain a count of intent items targeting that AG.
1528bae43864SDarrick J. Wong  The count should be bumped whenever a new item is added to the chain.
1529bae43864SDarrick J. Wong  The count should be dropped when the filesystem has locked the AG header
1530bae43864SDarrick J. Wong  buffers and finished the work.
1531bae43864SDarrick J. Wong
1532bae43864SDarrick J. Wong* When online fsck wants to examine an AG, it should lock the AG header
1533bae43864SDarrick J. Wong  buffers to quiesce all transaction chains that want to modify that AG.
1534bae43864SDarrick J. Wong  If the count is zero, proceed with the checking operation.
1535bae43864SDarrick J. Wong  If it is nonzero, cycle the buffer locks to allow the chain to make forward
1536bae43864SDarrick J. Wong  progress.
1537bae43864SDarrick J. Wong
1538bae43864SDarrick J. WongThis may lead to online fsck taking a long time to complete, but regular
1539bae43864SDarrick J. Wongfilesystem updates take precedence over background checking activity.
1540bae43864SDarrick J. WongDetails about the discovery of this situation are presented in the
1541bae43864SDarrick J. Wong:ref:`next section <chain_coordination>`, and details about the solution
1542bae43864SDarrick J. Wongare presented :ref:`after that<intent_drains>`.
1543bae43864SDarrick J. Wong
1544bae43864SDarrick J. Wong.. _chain_coordination:
1545bae43864SDarrick J. Wong
1546bae43864SDarrick J. WongDiscovery of the Problem
1547bae43864SDarrick J. Wong````````````````````````
1548bae43864SDarrick J. Wong
1549bae43864SDarrick J. WongMidway through the development of online scrubbing, the fsstress tests
1550bae43864SDarrick J. Wonguncovered a misinteraction between online fsck and compound transaction chains
1551bae43864SDarrick J. Wongcreated by other writer threads that resulted in false reports of metadata
1552bae43864SDarrick J. Wonginconsistency.
1553bae43864SDarrick J. WongThe root cause of these reports is the eventual consistency model introduced by
1554bae43864SDarrick J. Wongthe expansion of deferred work items and compound transaction chains when
1555bae43864SDarrick J. Wongreverse mapping and reflink were introduced.
1556bae43864SDarrick J. Wong
1557bae43864SDarrick J. WongOriginally, transaction chains were added to XFS to avoid deadlocks when
1558bae43864SDarrick J. Wongunmapping space from files.
1559bae43864SDarrick J. WongDeadlock avoidance rules require that AGs only be locked in increasing order,
1560bae43864SDarrick J. Wongwhich makes it impossible (say) to use a single transaction to free a space
1561bae43864SDarrick J. Wongextent in AG 7 and then try to free a now superfluous block mapping btree block
1562bae43864SDarrick J. Wongin AG 3.
1563bae43864SDarrick J. WongTo avoid these kinds of deadlocks, XFS creates Extent Freeing Intent (EFI) log
1564bae43864SDarrick J. Wongitems to commit to freeing some space in one transaction while deferring the
1565bae43864SDarrick J. Wongactual metadata updates to a fresh transaction.
1566bae43864SDarrick J. WongThe transaction sequence looks like this:
1567bae43864SDarrick J. Wong
1568bae43864SDarrick J. Wong1. The first transaction contains a physical update to the file's block mapping
1569bae43864SDarrick J. Wong   structures to remove the mapping from the btree blocks.
1570bae43864SDarrick J. Wong   It then attaches to the in-memory transaction an action item to schedule
1571bae43864SDarrick J. Wong   deferred freeing of space.
1572bae43864SDarrick J. Wong   Concretely, each transaction maintains a list of ``struct
1573bae43864SDarrick J. Wong   xfs_defer_pending`` objects, each of which maintains a list of ``struct
1574bae43864SDarrick J. Wong   xfs_extent_free_item`` objects.
1575bae43864SDarrick J. Wong   Returning to the example above, the action item tracks the freeing of both
1576bae43864SDarrick J. Wong   the unmapped space from AG 7 and the block mapping btree (BMBT) block from
1577bae43864SDarrick J. Wong   AG 3.
1578bae43864SDarrick J. Wong   Deferred frees recorded in this manner are committed in the log by creating
1579bae43864SDarrick J. Wong   an EFI log item from the ``struct xfs_extent_free_item`` object and
1580bae43864SDarrick J. Wong   attaching the log item to the transaction.
1581bae43864SDarrick J. Wong   When the log is persisted to disk, the EFI item is written into the ondisk
1582bae43864SDarrick J. Wong   transaction record.
1583bae43864SDarrick J. Wong   EFIs can list up to 16 extents to free, all sorted in AG order.
1584bae43864SDarrick J. Wong
1585bae43864SDarrick J. Wong2. The second transaction contains a physical update to the free space btrees
1586bae43864SDarrick J. Wong   of AG 3 to release the former BMBT block and a second physical update to the
1587bae43864SDarrick J. Wong   free space btrees of AG 7 to release the unmapped file space.
1588bae43864SDarrick J. Wong   Observe that the the physical updates are resequenced in the correct order
1589bae43864SDarrick J. Wong   when possible.
1590bae43864SDarrick J. Wong   Attached to the transaction is a an extent free done (EFD) log item.
1591bae43864SDarrick J. Wong   The EFD contains a pointer to the EFI logged in transaction #1 so that log
1592bae43864SDarrick J. Wong   recovery can tell if the EFI needs to be replayed.
1593bae43864SDarrick J. Wong
1594bae43864SDarrick J. WongIf the system goes down after transaction #1 is written back to the filesystem
1595bae43864SDarrick J. Wongbut before #2 is committed, a scan of the filesystem metadata would show
1596bae43864SDarrick J. Wonginconsistent filesystem metadata because there would not appear to be any owner
1597bae43864SDarrick J. Wongof the unmapped space.
1598bae43864SDarrick J. WongHappily, log recovery corrects this inconsistency for us -- when recovery finds
1599bae43864SDarrick J. Wongan intent log item but does not find a corresponding intent done item, it will
1600bae43864SDarrick J. Wongreconstruct the incore state of the intent item and finish it.
1601bae43864SDarrick J. WongIn the example above, the log must replay both frees described in the recovered
1602bae43864SDarrick J. WongEFI to complete the recovery phase.
1603bae43864SDarrick J. Wong
1604bae43864SDarrick J. WongThere are subtleties to XFS' transaction chaining strategy to consider:
1605bae43864SDarrick J. Wong
1606bae43864SDarrick J. Wong* Log items must be added to a transaction in the correct order to prevent
1607bae43864SDarrick J. Wong  conflicts with principal objects that are not held by the transaction.
1608bae43864SDarrick J. Wong  In other words, all per-AG metadata updates for an unmapped block must be
1609bae43864SDarrick J. Wong  completed before the last update to free the extent, and extents should not
1610bae43864SDarrick J. Wong  be reallocated until that last update commits to the log.
1611bae43864SDarrick J. Wong
1612bae43864SDarrick J. Wong* AG header buffers are released between each transaction in a chain.
1613bae43864SDarrick J. Wong  This means that other threads can observe an AG in an intermediate state,
1614bae43864SDarrick J. Wong  but as long as the first subtlety is handled, this should not affect the
1615bae43864SDarrick J. Wong  correctness of filesystem operations.
1616bae43864SDarrick J. Wong
1617bae43864SDarrick J. Wong* Unmounting the filesystem flushes all pending work to disk, which means that
1618bae43864SDarrick J. Wong  offline fsck never sees the temporary inconsistencies caused by deferred
1619bae43864SDarrick J. Wong  work item processing.
1620bae43864SDarrick J. Wong
1621bae43864SDarrick J. WongIn this manner, XFS employs a form of eventual consistency to avoid deadlocks
1622bae43864SDarrick J. Wongand increase parallelism.
1623bae43864SDarrick J. Wong
1624bae43864SDarrick J. WongDuring the design phase of the reverse mapping and reflink features, it was
1625bae43864SDarrick J. Wongdecided that it was impractical to cram all the reverse mapping updates for a
1626bae43864SDarrick J. Wongsingle filesystem change into a single transaction because a single file
1627bae43864SDarrick J. Wongmapping operation can explode into many small updates:
1628bae43864SDarrick J. Wong
1629bae43864SDarrick J. Wong* The block mapping update itself
1630bae43864SDarrick J. Wong* A reverse mapping update for the block mapping update
1631bae43864SDarrick J. Wong* Fixing the freelist
1632bae43864SDarrick J. Wong* A reverse mapping update for the freelist fix
1633bae43864SDarrick J. Wong
1634bae43864SDarrick J. Wong* A shape change to the block mapping btree
1635bae43864SDarrick J. Wong* A reverse mapping update for the btree update
1636bae43864SDarrick J. Wong* Fixing the freelist (again)
1637bae43864SDarrick J. Wong* A reverse mapping update for the freelist fix
1638bae43864SDarrick J. Wong
1639bae43864SDarrick J. Wong* An update to the reference counting information
1640bae43864SDarrick J. Wong* A reverse mapping update for the refcount update
1641bae43864SDarrick J. Wong* Fixing the freelist (a third time)
1642bae43864SDarrick J. Wong* A reverse mapping update for the freelist fix
1643bae43864SDarrick J. Wong
1644bae43864SDarrick J. Wong* Freeing any space that was unmapped and not owned by any other file
1645bae43864SDarrick J. Wong* Fixing the freelist (a fourth time)
1646bae43864SDarrick J. Wong* A reverse mapping update for the freelist fix
1647bae43864SDarrick J. Wong
1648bae43864SDarrick J. Wong* Freeing the space used by the block mapping btree
1649bae43864SDarrick J. Wong* Fixing the freelist (a fifth time)
1650bae43864SDarrick J. Wong* A reverse mapping update for the freelist fix
1651bae43864SDarrick J. Wong
1652bae43864SDarrick J. WongFree list fixups are not usually needed more than once per AG per transaction
1653bae43864SDarrick J. Wongchain, but it is theoretically possible if space is very tight.
1654bae43864SDarrick J. WongFor copy-on-write updates this is even worse, because this must be done once to
1655bae43864SDarrick J. Wongremove the space from a staging area and again to map it into the file!
1656bae43864SDarrick J. Wong
1657bae43864SDarrick J. WongTo deal with this explosion in a calm manner, XFS expands its use of deferred
1658bae43864SDarrick J. Wongwork items to cover most reverse mapping updates and all refcount updates.
1659bae43864SDarrick J. WongThis reduces the worst case size of transaction reservations by breaking the
1660bae43864SDarrick J. Wongwork into a long chain of small updates, which increases the degree of eventual
1661bae43864SDarrick J. Wongconsistency in the system.
1662bae43864SDarrick J. WongAgain, this generally isn't a problem because XFS orders its deferred work
1663bae43864SDarrick J. Wongitems carefully to avoid resource reuse conflicts between unsuspecting threads.
1664bae43864SDarrick J. Wong
1665bae43864SDarrick J. WongHowever, online fsck changes the rules -- remember that although physical
1666bae43864SDarrick J. Wongupdates to per-AG structures are coordinated by locking the buffers for AG
1667bae43864SDarrick J. Wongheaders, buffer locks are dropped between transactions.
1668bae43864SDarrick J. WongOnce scrub acquires resources and takes locks for a data structure, it must do
1669bae43864SDarrick J. Wongall the validation work without releasing the lock.
1670bae43864SDarrick J. WongIf the main lock for a space btree is an AG header buffer lock, scrub may have
1671bae43864SDarrick J. Wonginterrupted another thread that is midway through finishing a chain.
1672bae43864SDarrick J. WongFor example, if a thread performing a copy-on-write has completed a reverse
1673bae43864SDarrick J. Wongmapping update but not the corresponding refcount update, the two AG btrees
1674bae43864SDarrick J. Wongwill appear inconsistent to scrub and an observation of corruption will be
1675bae43864SDarrick J. Wongrecorded.  This observation will not be correct.
1676bae43864SDarrick J. WongIf a repair is attempted in this state, the results will be catastrophic!
1677bae43864SDarrick J. Wong
1678bae43864SDarrick J. WongSeveral other solutions to this problem were evaluated upon discovery of this
1679bae43864SDarrick J. Wongflaw and rejected:
1680bae43864SDarrick J. Wong
1681bae43864SDarrick J. Wong1. Add a higher level lock to allocation groups and require writer threads to
1682bae43864SDarrick J. Wong   acquire the higher level lock in AG order before making any changes.
1683bae43864SDarrick J. Wong   This would be very difficult to implement in practice because it is
1684bae43864SDarrick J. Wong   difficult to determine which locks need to be obtained, and in what order,
1685bae43864SDarrick J. Wong   without simulating the entire operation.
1686bae43864SDarrick J. Wong   Performing a dry run of a file operation to discover necessary locks would
1687bae43864SDarrick J. Wong   make the filesystem very slow.
1688bae43864SDarrick J. Wong
1689bae43864SDarrick J. Wong2. Make the deferred work coordinator code aware of consecutive intent items
1690bae43864SDarrick J. Wong   targeting the same AG and have it hold the AG header buffers locked across
1691bae43864SDarrick J. Wong   the transaction roll between updates.
1692bae43864SDarrick J. Wong   This would introduce a lot of complexity into the coordinator since it is
1693bae43864SDarrick J. Wong   only loosely coupled with the actual deferred work items.
1694bae43864SDarrick J. Wong   It would also fail to solve the problem because deferred work items can
1695bae43864SDarrick J. Wong   generate new deferred subtasks, but all subtasks must be complete before
1696bae43864SDarrick J. Wong   work can start on a new sibling task.
1697bae43864SDarrick J. Wong
1698bae43864SDarrick J. Wong3. Teach online fsck to walk all transactions waiting for whichever lock(s)
1699bae43864SDarrick J. Wong   protect the data structure being scrubbed to look for pending operations.
1700bae43864SDarrick J. Wong   The checking and repair operations must factor these pending operations into
1701bae43864SDarrick J. Wong   the evaluations being performed.
1702bae43864SDarrick J. Wong   This solution is a nonstarter because it is *extremely* invasive to the main
1703bae43864SDarrick J. Wong   filesystem.
1704bae43864SDarrick J. Wong
1705bae43864SDarrick J. Wong.. _intent_drains:
1706bae43864SDarrick J. Wong
1707bae43864SDarrick J. WongIntent Drains
1708bae43864SDarrick J. Wong`````````````
1709bae43864SDarrick J. Wong
1710bae43864SDarrick J. WongOnline fsck uses an atomic intent item counter and lock cycling to coordinate
1711bae43864SDarrick J. Wongwith transaction chains.
1712bae43864SDarrick J. WongThere are two key properties to the drain mechanism.
1713bae43864SDarrick J. WongFirst, the counter is incremented when a deferred work item is *queued* to a
1714bae43864SDarrick J. Wongtransaction, and it is decremented after the associated intent done log item is
1715bae43864SDarrick J. Wong*committed* to another transaction.
1716bae43864SDarrick J. WongThe second property is that deferred work can be added to a transaction without
1717bae43864SDarrick J. Wongholding an AG header lock, but per-AG work items cannot be marked done without
1718bae43864SDarrick J. Wonglocking that AG header buffer to log the physical updates and the intent done
1719bae43864SDarrick J. Wonglog item.
1720bae43864SDarrick J. WongThe first property enables scrub to yield to running transaction chains, which
1721bae43864SDarrick J. Wongis an explicit deprioritization of online fsck to benefit file operations.
1722bae43864SDarrick J. WongThe second property of the drain is key to the correct coordination of scrub,
1723bae43864SDarrick J. Wongsince scrub will always be able to decide if a conflict is possible.
1724bae43864SDarrick J. Wong
1725bae43864SDarrick J. WongFor regular filesystem code, the drain works as follows:
1726bae43864SDarrick J. Wong
1727bae43864SDarrick J. Wong1. Call the appropriate subsystem function to add a deferred work item to a
1728bae43864SDarrick J. Wong   transaction.
1729bae43864SDarrick J. Wong
1730bae43864SDarrick J. Wong2. The function calls ``xfs_defer_drain_bump`` to increase the counter.
1731bae43864SDarrick J. Wong
1732bae43864SDarrick J. Wong3. When the deferred item manager wants to finish the deferred work item, it
1733bae43864SDarrick J. Wong   calls ``->finish_item`` to complete it.
1734bae43864SDarrick J. Wong
1735bae43864SDarrick J. Wong4. The ``->finish_item`` implementation logs some changes and calls
1736bae43864SDarrick J. Wong   ``xfs_defer_drain_drop`` to decrease the sloppy counter and wake up any threads
1737bae43864SDarrick J. Wong   waiting on the drain.
1738bae43864SDarrick J. Wong
1739bae43864SDarrick J. Wong5. The subtransaction commits, which unlocks the resource associated with the
1740bae43864SDarrick J. Wong   intent item.
1741bae43864SDarrick J. Wong
1742bae43864SDarrick J. WongFor scrub, the drain works as follows:
1743bae43864SDarrick J. Wong
1744bae43864SDarrick J. Wong1. Lock the resource(s) associated with the metadata being scrubbed.
1745bae43864SDarrick J. Wong   For example, a scan of the refcount btree would lock the AGI and AGF header
1746bae43864SDarrick J. Wong   buffers.
1747bae43864SDarrick J. Wong
1748bae43864SDarrick J. Wong2. If the counter is zero (``xfs_defer_drain_busy`` returns false), there are no
1749bae43864SDarrick J. Wong   chains in progress and the operation may proceed.
1750bae43864SDarrick J. Wong
1751bae43864SDarrick J. Wong3. Otherwise, release the resources grabbed in step 1.
1752bae43864SDarrick J. Wong
1753bae43864SDarrick J. Wong4. Wait for the intent counter to reach zero (``xfs_defer_drain_intents``), then go
1754bae43864SDarrick J. Wong   back to step 1 unless a signal has been caught.
1755bae43864SDarrick J. Wong
1756bae43864SDarrick J. WongTo avoid polling in step 4, the drain provides a waitqueue for scrub threads to
1757bae43864SDarrick J. Wongbe woken up whenever the intent count drops to zero.
1758bae43864SDarrick J. Wong
1759bae43864SDarrick J. WongThe proposed patchset is the
1760bae43864SDarrick J. Wong`scrub intent drain series
1761bae43864SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-drain-intents>`_.
1762bae43864SDarrick J. Wong
1763bae43864SDarrick J. Wong.. _jump_labels:
1764bae43864SDarrick J. Wong
1765bae43864SDarrick J. WongStatic Keys (aka Jump Label Patching)
1766bae43864SDarrick J. Wong`````````````````````````````````````
1767bae43864SDarrick J. Wong
1768bae43864SDarrick J. WongOnline fsck for XFS separates the regular filesystem from the checking and
1769bae43864SDarrick J. Wongrepair code as much as possible.
1770bae43864SDarrick J. WongHowever, there are a few parts of online fsck (such as the intent drains, and
1771bae43864SDarrick J. Wonglater, live update hooks) where it is useful for the online fsck code to know
1772bae43864SDarrick J. Wongwhat's going on in the rest of the filesystem.
1773bae43864SDarrick J. WongSince it is not expected that online fsck will be constantly running in the
1774bae43864SDarrick J. Wongbackground, it is very important to minimize the runtime overhead imposed by
1775bae43864SDarrick J. Wongthese hooks when online fsck is compiled into the kernel but not actively
1776bae43864SDarrick J. Wongrunning on behalf of userspace.
1777bae43864SDarrick J. WongTaking locks in the hot path of a writer thread to access a data structure only
1778bae43864SDarrick J. Wongto find that no further action is necessary is expensive -- on the author's
1779bae43864SDarrick J. Wongcomputer, this have an overhead of 40-50ns per access.
1780bae43864SDarrick J. WongFortunately, the kernel supports dynamic code patching, which enables XFS to
1781bae43864SDarrick J. Wongreplace a static branch to hook code with ``nop`` sleds when online fsck isn't
1782bae43864SDarrick J. Wongrunning.
1783bae43864SDarrick J. WongThis sled has an overhead of however long it takes the instruction decoder to
1784bae43864SDarrick J. Wongskip past the sled, which seems to be on the order of less than 1ns and
1785bae43864SDarrick J. Wongdoes not access memory outside of instruction fetching.
1786bae43864SDarrick J. Wong
1787bae43864SDarrick J. WongWhen online fsck enables the static key, the sled is replaced with an
1788bae43864SDarrick J. Wongunconditional branch to call the hook code.
1789bae43864SDarrick J. WongThe switchover is quite expensive (~22000ns) but is paid entirely by the
1790bae43864SDarrick J. Wongprogram that invoked online fsck, and can be amortized if multiple threads
1791bae43864SDarrick J. Wongenter online fsck at the same time, or if multiple filesystems are being
1792bae43864SDarrick J. Wongchecked at the same time.
1793bae43864SDarrick J. WongChanging the branch direction requires taking the CPU hotplug lock, and since
1794bae43864SDarrick J. WongCPU initialization requires memory allocation, online fsck must be careful not
1795bae43864SDarrick J. Wongto change a static key while holding any locks or resources that could be
1796bae43864SDarrick J. Wongaccessed in the memory reclaim paths.
1797bae43864SDarrick J. WongTo minimize contention on the CPU hotplug lock, care should be taken not to
1798bae43864SDarrick J. Wongenable or disable static keys unnecessarily.
1799bae43864SDarrick J. Wong
1800bae43864SDarrick J. WongBecause static keys are intended to minimize hook overhead for regular
1801bae43864SDarrick J. Wongfilesystem operations when xfs_scrub is not running, the intended usage
1802bae43864SDarrick J. Wongpatterns are as follows:
1803bae43864SDarrick J. Wong
1804bae43864SDarrick J. Wong- The hooked part of XFS should declare a static-scoped static key that
1805bae43864SDarrick J. Wong  defaults to false.
1806bae43864SDarrick J. Wong  The ``DEFINE_STATIC_KEY_FALSE`` macro takes care of this.
1807bae43864SDarrick J. Wong  The static key itself should be declared as a ``static`` variable.
1808bae43864SDarrick J. Wong
1809bae43864SDarrick J. Wong- When deciding to invoke code that's only used by scrub, the regular
1810bae43864SDarrick J. Wong  filesystem should call the ``static_branch_unlikely`` predicate to avoid the
1811bae43864SDarrick J. Wong  scrub-only hook code if the static key is not enabled.
1812bae43864SDarrick J. Wong
1813bae43864SDarrick J. Wong- The regular filesystem should export helper functions that call
1814bae43864SDarrick J. Wong  ``static_branch_inc`` to enable and ``static_branch_dec`` to disable the
1815bae43864SDarrick J. Wong  static key.
1816bae43864SDarrick J. Wong  Wrapper functions make it easy to compile out the relevant code if the kernel
1817bae43864SDarrick J. Wong  distributor turns off online fsck at build time.
1818bae43864SDarrick J. Wong
1819bae43864SDarrick J. Wong- Scrub functions wanting to turn on scrub-only XFS functionality should call
1820bae43864SDarrick J. Wong  the ``xchk_fsgates_enable`` from the setup function to enable a specific
1821bae43864SDarrick J. Wong  hook.
1822bae43864SDarrick J. Wong  This must be done before obtaining any resources that are used by memory
1823bae43864SDarrick J. Wong  reclaim.
1824bae43864SDarrick J. Wong  Callers had better be sure they really need the functionality gated by the
1825bae43864SDarrick J. Wong  static key; the ``TRY_HARDER`` flag is useful here.
1826bae43864SDarrick J. Wong
1827bae43864SDarrick J. WongOnline scrub has resource acquisition helpers (e.g. ``xchk_perag_lock``) to
1828bae43864SDarrick J. Wonghandle locking AGI and AGF buffers for all scrubber functions.
1829bae43864SDarrick J. WongIf it detects a conflict between scrub and the running transactions, it will
1830bae43864SDarrick J. Wongtry to wait for intents to complete.
1831bae43864SDarrick J. WongIf the caller of the helper has not enabled the static key, the helper will
1832bae43864SDarrick J. Wongreturn -EDEADLOCK, which should result in the scrub being restarted with the
1833bae43864SDarrick J. Wong``TRY_HARDER`` flag set.
1834bae43864SDarrick J. WongThe scrub setup function should detect that flag, enable the static key, and
1835bae43864SDarrick J. Wongtry the scrub again.
1836bae43864SDarrick J. WongScrub teardown disables all static keys obtained by ``xchk_fsgates_enable``.
1837bae43864SDarrick J. Wong
1838bae43864SDarrick J. WongFor more information, please see the kernel documentation of
1839bae43864SDarrick J. WongDocumentation/staging/static-keys.rst.
18405f658dadSDarrick J. Wong
18415f658dadSDarrick J. Wong.. _xfile:
18425f658dadSDarrick J. Wong
18435f658dadSDarrick J. WongPageable Kernel Memory
18445f658dadSDarrick J. Wong----------------------
18455f658dadSDarrick J. Wong
18465f658dadSDarrick J. WongSome online checking functions work by scanning the filesystem to build a
18475f658dadSDarrick J. Wongshadow copy of an ondisk metadata structure in memory and comparing the two
18485f658dadSDarrick J. Wongcopies.
18495f658dadSDarrick J. WongFor online repair to rebuild a metadata structure, it must compute the record
18505f658dadSDarrick J. Wongset that will be stored in the new structure before it can persist that new
18515f658dadSDarrick J. Wongstructure to disk.
18525f658dadSDarrick J. WongIdeally, repairs complete with a single atomic commit that introduces
18535f658dadSDarrick J. Wonga new data structure.
18545f658dadSDarrick J. WongTo meet these goals, the kernel needs to collect a large amount of information
18555f658dadSDarrick J. Wongin a place that doesn't require the correct operation of the filesystem.
18565f658dadSDarrick J. Wong
18575f658dadSDarrick J. WongKernel memory isn't suitable because:
18585f658dadSDarrick J. Wong
18595f658dadSDarrick J. Wong* Allocating a contiguous region of memory to create a C array is very
18605f658dadSDarrick J. Wong  difficult, especially on 32-bit systems.
18615f658dadSDarrick J. Wong
18625f658dadSDarrick J. Wong* Linked lists of records introduce double pointer overhead which is very high
18635f658dadSDarrick J. Wong  and eliminate the possibility of indexed lookups.
18645f658dadSDarrick J. Wong
18655f658dadSDarrick J. Wong* Kernel memory is pinned, which can drive the system into OOM conditions.
18665f658dadSDarrick J. Wong
18675f658dadSDarrick J. Wong* The system might not have sufficient memory to stage all the information.
18685f658dadSDarrick J. Wong
18695f658dadSDarrick J. WongAt any given time, online fsck does not need to keep the entire record set in
18705f658dadSDarrick J. Wongmemory, which means that individual records can be paged out if necessary.
18715f658dadSDarrick J. WongContinued development of online fsck demonstrated that the ability to perform
18725f658dadSDarrick J. Wongindexed data storage would also be very useful.
18735f658dadSDarrick J. WongFortunately, the Linux kernel already has a facility for byte-addressable and
18745f658dadSDarrick J. Wongpageable storage: tmpfs.
18755f658dadSDarrick J. WongIn-kernel graphics drivers (most notably i915) take advantage of tmpfs files
18765f658dadSDarrick J. Wongto store intermediate data that doesn't need to be in memory at all times, so
18775f658dadSDarrick J. Wongthat usage precedent is already established.
18785f658dadSDarrick J. WongHence, the ``xfile`` was born!
18795f658dadSDarrick J. Wong
18805f658dadSDarrick J. Wong+--------------------------------------------------------------------------+
18815f658dadSDarrick J. Wong| **Historical Sidebar**:                                                  |
18825f658dadSDarrick J. Wong+--------------------------------------------------------------------------+
18835f658dadSDarrick J. Wong| The first edition of online repair inserted records into a new btree as  |
18845f658dadSDarrick J. Wong| it found them, which failed because filesystem could shut down with a    |
18855f658dadSDarrick J. Wong| built data structure, which would be live after recovery finished.       |
18865f658dadSDarrick J. Wong|                                                                          |
18875f658dadSDarrick J. Wong| The second edition solved the half-rebuilt structure problem by storing  |
18885f658dadSDarrick J. Wong| everything in memory, but frequently ran the system out of memory.       |
18895f658dadSDarrick J. Wong|                                                                          |
18905f658dadSDarrick J. Wong| The third edition solved the OOM problem by using linked lists, but the  |
18915f658dadSDarrick J. Wong| memory overhead of the list pointers was extreme.                        |
18925f658dadSDarrick J. Wong+--------------------------------------------------------------------------+
18935f658dadSDarrick J. Wong
18945f658dadSDarrick J. Wongxfile Access Models
18955f658dadSDarrick J. Wong```````````````````
18965f658dadSDarrick J. Wong
18975f658dadSDarrick J. WongA survey of the intended uses of xfiles suggested these use cases:
18985f658dadSDarrick J. Wong
18995f658dadSDarrick J. Wong1. Arrays of fixed-sized records (space management btrees, directory and
19005f658dadSDarrick J. Wong   extended attribute entries)
19015f658dadSDarrick J. Wong
19025f658dadSDarrick J. Wong2. Sparse arrays of fixed-sized records (quotas and link counts)
19035f658dadSDarrick J. Wong
19045f658dadSDarrick J. Wong3. Large binary objects (BLOBs) of variable sizes (directory and extended
19055f658dadSDarrick J. Wong   attribute names and values)
19065f658dadSDarrick J. Wong
19075f658dadSDarrick J. Wong4. Staging btrees in memory (reverse mapping btrees)
19085f658dadSDarrick J. Wong
19095f658dadSDarrick J. Wong5. Arbitrary contents (realtime space management)
19105f658dadSDarrick J. Wong
19115f658dadSDarrick J. WongTo support the first four use cases, high level data structures wrap the xfile
19125f658dadSDarrick J. Wongto share functionality between online fsck functions.
19135f658dadSDarrick J. WongThe rest of this section discusses the interfaces that the xfile presents to
19145f658dadSDarrick J. Wongfour of those five higher level data structures.
19155f658dadSDarrick J. WongThe fifth use case is discussed in the :ref:`realtime summary <rtsummary>` case
19165f658dadSDarrick J. Wongstudy.
19175f658dadSDarrick J. Wong
19185f658dadSDarrick J. WongThe most general storage interface supported by the xfile enables the reading
19195f658dadSDarrick J. Wongand writing of arbitrary quantities of data at arbitrary offsets in the xfile.
19205f658dadSDarrick J. WongThis capability is provided by ``xfile_pread`` and ``xfile_pwrite`` functions,
19215f658dadSDarrick J. Wongwhich behave similarly to their userspace counterparts.
19225f658dadSDarrick J. WongXFS is very record-based, which suggests that the ability to load and store
19235f658dadSDarrick J. Wongcomplete records is important.
19245f658dadSDarrick J. WongTo support these cases, a pair of ``xfile_obj_load`` and ``xfile_obj_store``
19255f658dadSDarrick J. Wongfunctions are provided to read and persist objects into an xfile.
19265f658dadSDarrick J. WongThey are internally the same as pread and pwrite, except that they treat any
19275f658dadSDarrick J. Wongerror as an out of memory error.
19285f658dadSDarrick J. WongFor online repair, squashing error conditions in this manner is an acceptable
19295f658dadSDarrick J. Wongbehavior because the only reaction is to abort the operation back to userspace.
19305f658dadSDarrick J. WongAll five xfile usecases can be serviced by these four functions.
19315f658dadSDarrick J. Wong
19325f658dadSDarrick J. WongHowever, no discussion of file access idioms is complete without answering the
19335f658dadSDarrick J. Wongquestion, "But what about mmap?"
19345f658dadSDarrick J. WongIt is convenient to access storage directly with pointers, just like userspace
19355f658dadSDarrick J. Wongcode does with regular memory.
19365f658dadSDarrick J. WongOnline fsck must not drive the system into OOM conditions, which means that
19375f658dadSDarrick J. Wongxfiles must be responsive to memory reclamation.
19385f658dadSDarrick J. Wongtmpfs can only push a pagecache folio to the swap cache if the folio is neither
19395f658dadSDarrick J. Wongpinned nor locked, which means the xfile must not pin too many folios.
19405f658dadSDarrick J. Wong
19415f658dadSDarrick J. WongShort term direct access to xfile contents is done by locking the pagecache
19425f658dadSDarrick J. Wongfolio and mapping it into kernel address space.
19435f658dadSDarrick J. WongProgrammatic access (e.g. pread and pwrite) uses this mechanism.
19445f658dadSDarrick J. WongFolio locks are not supposed to be held for long periods of time, so long
19455f658dadSDarrick J. Wongterm direct access to xfile contents is done by bumping the folio refcount,
19465f658dadSDarrick J. Wongmapping it into kernel address space, and dropping the folio lock.
19475f658dadSDarrick J. WongThese long term users *must* be responsive to memory reclaim by hooking into
19485f658dadSDarrick J. Wongthe shrinker infrastructure to know when to release folios.
19495f658dadSDarrick J. Wong
19505f658dadSDarrick J. WongThe ``xfile_get_page`` and ``xfile_put_page`` functions are provided to
19515f658dadSDarrick J. Wongretrieve the (locked) folio that backs part of an xfile and to release it.
19525f658dadSDarrick J. WongThe only code to use these folio lease functions are the xfarray
19535f658dadSDarrick J. Wong:ref:`sorting<xfarray_sort>` algorithms and the :ref:`in-memory
19545f658dadSDarrick J. Wongbtrees<xfbtree>`.
19555f658dadSDarrick J. Wong
19565f658dadSDarrick J. Wongxfile Access Coordination
19575f658dadSDarrick J. Wong`````````````````````````
19585f658dadSDarrick J. Wong
19595f658dadSDarrick J. WongFor security reasons, xfiles must be owned privately by the kernel.
19605f658dadSDarrick J. WongThey are marked ``S_PRIVATE`` to prevent interference from the security system,
19615f658dadSDarrick J. Wongmust never be mapped into process file descriptor tables, and their pages must
19625f658dadSDarrick J. Wongnever be mapped into userspace processes.
19635f658dadSDarrick J. Wong
19645f658dadSDarrick J. WongTo avoid locking recursion issues with the VFS, all accesses to the shmfs file
19655f658dadSDarrick J. Wongare performed by manipulating the page cache directly.
19665f658dadSDarrick J. Wongxfile writers call the ``->write_begin`` and ``->write_end`` functions of the
19675f658dadSDarrick J. Wongxfile's address space to grab writable pages, copy the caller's buffer into the
19685f658dadSDarrick J. Wongpage, and release the pages.
19695f658dadSDarrick J. Wongxfile readers call ``shmem_read_mapping_page_gfp`` to grab pages directly
19705f658dadSDarrick J. Wongbefore copying the contents into the caller's buffer.
19715f658dadSDarrick J. WongIn other words, xfiles ignore the VFS read and write code paths to avoid
19725f658dadSDarrick J. Wonghaving to create a dummy ``struct kiocb`` and to avoid taking inode and
19735f658dadSDarrick J. Wongfreeze locks.
19745f658dadSDarrick J. Wongtmpfs cannot be frozen, and xfiles must not be exposed to userspace.
19755f658dadSDarrick J. Wong
19765f658dadSDarrick J. WongIf an xfile is shared between threads to stage repairs, the caller must provide
19775f658dadSDarrick J. Wongits own locks to coordinate access.
19785f658dadSDarrick J. WongFor example, if a scrub function stores scan results in an xfile and needs
19795f658dadSDarrick J. Wongother threads to provide updates to the scanned data, the scrub function must
19805f658dadSDarrick J. Wongprovide a lock for all threads to share.
19815f658dadSDarrick J. Wong
19825f658dadSDarrick J. Wong.. _xfarray:
19835f658dadSDarrick J. Wong
19845f658dadSDarrick J. WongArrays of Fixed-Sized Records
19855f658dadSDarrick J. Wong`````````````````````````````
19865f658dadSDarrick J. Wong
19875f658dadSDarrick J. WongIn XFS, each type of indexed space metadata (free space, inodes, reference
19885f658dadSDarrick J. Wongcounts, file fork space, and reverse mappings) consists of a set of fixed-size
19895f658dadSDarrick J. Wongrecords indexed with a classic B+ tree.
19905f658dadSDarrick J. WongDirectories have a set of fixed-size dirent records that point to the names,
19915f658dadSDarrick J. Wongand extended attributes have a set of fixed-size attribute keys that point to
19925f658dadSDarrick J. Wongnames and values.
19935f658dadSDarrick J. WongQuota counters and file link counters index records with numbers.
19945f658dadSDarrick J. WongDuring a repair, scrub needs to stage new records during the gathering step and
19955f658dadSDarrick J. Wongretrieve them during the btree building step.
19965f658dadSDarrick J. Wong
19975f658dadSDarrick J. WongAlthough this requirement can be satisfied by calling the read and write
19985f658dadSDarrick J. Wongmethods of the xfile directly, it is simpler for callers for there to be a
19995f658dadSDarrick J. Wonghigher level abstraction to take care of computing array offsets, to provide
20005f658dadSDarrick J. Wongiterator functions, and to deal with sparse records and sorting.
20015f658dadSDarrick J. WongThe ``xfarray`` abstraction presents a linear array for fixed-size records atop
20025f658dadSDarrick J. Wongthe byte-accessible xfile.
20035f658dadSDarrick J. Wong
20045f658dadSDarrick J. Wong.. _xfarray_access_patterns:
20055f658dadSDarrick J. Wong
20065f658dadSDarrick J. WongArray Access Patterns
20075f658dadSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^
20085f658dadSDarrick J. Wong
20095f658dadSDarrick J. WongArray access patterns in online fsck tend to fall into three categories.
20105f658dadSDarrick J. WongIteration of records is assumed to be necessary for all cases and will be
20115f658dadSDarrick J. Wongcovered in the next section.
20125f658dadSDarrick J. Wong
20135f658dadSDarrick J. WongThe first type of caller handles records that are indexed by position.
20145f658dadSDarrick J. WongGaps may exist between records, and a record may be updated multiple times
20155f658dadSDarrick J. Wongduring the collection step.
20165f658dadSDarrick J. WongIn other words, these callers want a sparse linearly addressed table file.
20175f658dadSDarrick J. WongThe typical use case are quota records or file link count records.
20185f658dadSDarrick J. WongAccess to array elements is performed programmatically via ``xfarray_load`` and
20195f658dadSDarrick J. Wong``xfarray_store`` functions, which wrap the similarly-named xfile functions to
20205f658dadSDarrick J. Wongprovide loading and storing of array elements at arbitrary array indices.
20215f658dadSDarrick J. WongGaps are defined to be null records, and null records are defined to be a
20225f658dadSDarrick J. Wongsequence of all zero bytes.
20235f658dadSDarrick J. WongNull records are detected by calling ``xfarray_element_is_null``.
20245f658dadSDarrick J. WongThey are created either by calling ``xfarray_unset`` to null out an existing
20255f658dadSDarrick J. Wongrecord or by never storing anything to an array index.
20265f658dadSDarrick J. Wong
20275f658dadSDarrick J. WongThe second type of caller handles records that are not indexed by position
20285f658dadSDarrick J. Wongand do not require multiple updates to a record.
20295f658dadSDarrick J. WongThe typical use case here is rebuilding space btrees and key/value btrees.
20305f658dadSDarrick J. WongThese callers can add records to the array without caring about array indices
20315f658dadSDarrick J. Wongvia the ``xfarray_append`` function, which stores a record at the end of the
20325f658dadSDarrick J. Wongarray.
20335f658dadSDarrick J. WongFor callers that require records to be presentable in a specific order (e.g.
20345f658dadSDarrick J. Wongrebuilding btree data), the ``xfarray_sort`` function can arrange the sorted
20355f658dadSDarrick J. Wongrecords; this function will be covered later.
20365f658dadSDarrick J. Wong
20375f658dadSDarrick J. WongThe third type of caller is a bag, which is useful for counting records.
20385f658dadSDarrick J. WongThe typical use case here is constructing space extent reference counts from
20395f658dadSDarrick J. Wongreverse mapping information.
20405f658dadSDarrick J. WongRecords can be put in the bag in any order, they can be removed from the bag
20415f658dadSDarrick J. Wongat any time, and uniqueness of records is left to callers.
20425f658dadSDarrick J. WongThe ``xfarray_store_anywhere`` function is used to insert a record in any
20435f658dadSDarrick J. Wongnull record slot in the bag; and the ``xfarray_unset`` function removes a
20445f658dadSDarrick J. Wongrecord from the bag.
20455f658dadSDarrick J. Wong
20465f658dadSDarrick J. WongThe proposed patchset is the
20475f658dadSDarrick J. Wong`big in-memory array
20485f658dadSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=big-array>`_.
20495f658dadSDarrick J. Wong
20505f658dadSDarrick J. WongIterating Array Elements
20515f658dadSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^
20525f658dadSDarrick J. Wong
20535f658dadSDarrick J. WongMost users of the xfarray require the ability to iterate the records stored in
20545f658dadSDarrick J. Wongthe array.
20555f658dadSDarrick J. WongCallers can probe every possible array index with the following:
20565f658dadSDarrick J. Wong
20575f658dadSDarrick J. Wong.. code-block:: c
20585f658dadSDarrick J. Wong
20595f658dadSDarrick J. Wong	xfarray_idx_t i;
20605f658dadSDarrick J. Wong	foreach_xfarray_idx(array, i) {
20615f658dadSDarrick J. Wong	    xfarray_load(array, i, &rec);
20625f658dadSDarrick J. Wong
20635f658dadSDarrick J. Wong	    /* do something with rec */
20645f658dadSDarrick J. Wong	}
20655f658dadSDarrick J. Wong
20665f658dadSDarrick J. WongAll users of this idiom must be prepared to handle null records or must already
20675f658dadSDarrick J. Wongknow that there aren't any.
20685f658dadSDarrick J. Wong
20695f658dadSDarrick J. WongFor xfarray users that want to iterate a sparse array, the ``xfarray_iter``
20705f658dadSDarrick J. Wongfunction ignores indices in the xfarray that have never been written to by
20715f658dadSDarrick J. Wongcalling ``xfile_seek_data`` (which internally uses ``SEEK_DATA``) to skip areas
20725f658dadSDarrick J. Wongof the array that are not populated with memory pages.
20735f658dadSDarrick J. WongOnce it finds a page, it will skip the zeroed areas of the page.
20745f658dadSDarrick J. Wong
20755f658dadSDarrick J. Wong.. code-block:: c
20765f658dadSDarrick J. Wong
20775f658dadSDarrick J. Wong	xfarray_idx_t i = XFARRAY_CURSOR_INIT;
20785f658dadSDarrick J. Wong	while ((ret = xfarray_iter(array, &i, &rec)) == 1) {
20795f658dadSDarrick J. Wong	    /* do something with rec */
20805f658dadSDarrick J. Wong	}
20815f658dadSDarrick J. Wong
20825f658dadSDarrick J. Wong.. _xfarray_sort:
20835f658dadSDarrick J. Wong
20845f658dadSDarrick J. WongSorting Array Elements
20855f658dadSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^
20865f658dadSDarrick J. Wong
20875f658dadSDarrick J. WongDuring the fourth demonstration of online repair, a community reviewer remarked
20885f658dadSDarrick J. Wongthat for performance reasons, online repair ought to load batches of records
20895f658dadSDarrick J. Wonginto btree record blocks instead of inserting records into a new btree one at a
20905f658dadSDarrick J. Wongtime.
20915f658dadSDarrick J. WongThe btree insertion code in XFS is responsible for maintaining correct ordering
20925f658dadSDarrick J. Wongof the records, so naturally the xfarray must also support sorting the record
20935f658dadSDarrick J. Wongset prior to bulk loading.
20945f658dadSDarrick J. Wong
20955f658dadSDarrick J. WongCase Study: Sorting xfarrays
20965f658dadSDarrick J. Wong~~~~~~~~~~~~~~~~~~~~~~~~~~~~
20975f658dadSDarrick J. Wong
20985f658dadSDarrick J. WongThe sorting algorithm used in the xfarray is actually a combination of adaptive
20995f658dadSDarrick J. Wongquicksort and a heapsort subalgorithm in the spirit of
21005f658dadSDarrick J. Wong`Sedgewick <https://algs4.cs.princeton.edu/23quicksort/>`_ and
21015f658dadSDarrick J. Wong`pdqsort <https://github.com/orlp/pdqsort>`_, with customizations for the Linux
21025f658dadSDarrick J. Wongkernel.
21035f658dadSDarrick J. WongTo sort records in a reasonably short amount of time, ``xfarray`` takes
21045f658dadSDarrick J. Wongadvantage of the binary subpartitioning offered by quicksort, but it also uses
2105*d56b699dSBjorn Helgaasheapsort to hedge against performance collapse if the chosen quicksort pivots
21065f658dadSDarrick J. Wongare poor.
21075f658dadSDarrick J. WongBoth algorithms are (in general) O(n * lg(n)), but there is a wide performance
21085f658dadSDarrick J. Wonggulf between the two implementations.
21095f658dadSDarrick J. Wong
21105f658dadSDarrick J. WongThe Linux kernel already contains a reasonably fast implementation of heapsort.
21115f658dadSDarrick J. WongIt only operates on regular C arrays, which limits the scope of its usefulness.
21125f658dadSDarrick J. WongThere are two key places where the xfarray uses it:
21135f658dadSDarrick J. Wong
21145f658dadSDarrick J. Wong* Sorting any record subset backed by a single xfile page.
21155f658dadSDarrick J. Wong
21165f658dadSDarrick J. Wong* Loading a small number of xfarray records from potentially disparate parts
21175f658dadSDarrick J. Wong  of the xfarray into a memory buffer, and sorting the buffer.
21185f658dadSDarrick J. Wong
21195f658dadSDarrick J. WongIn other words, ``xfarray`` uses heapsort to constrain the nested recursion of
21205f658dadSDarrick J. Wongquicksort, thereby mitigating quicksort's worst runtime behavior.
21215f658dadSDarrick J. Wong
21225f658dadSDarrick J. WongChoosing a quicksort pivot is a tricky business.
21235f658dadSDarrick J. WongA good pivot splits the set to sort in half, leading to the divide and conquer
21245f658dadSDarrick J. Wongbehavior that is crucial to  O(n * lg(n)) performance.
21255f658dadSDarrick J. WongA poor pivot barely splits the subset at all, leading to O(n\ :sup:`2`)
21265f658dadSDarrick J. Wongruntime.
21275f658dadSDarrick J. WongThe xfarray sort routine tries to avoid picking a bad pivot by sampling nine
21285f658dadSDarrick J. Wongrecords into a memory buffer and using the kernel heapsort to identify the
21295f658dadSDarrick J. Wongmedian of the nine.
21305f658dadSDarrick J. Wong
21315f658dadSDarrick J. WongMost modern quicksort implementations employ Tukey's "ninther" to select a
21325f658dadSDarrick J. Wongpivot from a classic C array.
21335f658dadSDarrick J. WongTypical ninther implementations pick three unique triads of records, sort each
21345f658dadSDarrick J. Wongof the triads, and then sort the middle value of each triad to determine the
21355f658dadSDarrick J. Wongninther value.
21365f658dadSDarrick J. WongAs stated previously, however, xfile accesses are not entirely cheap.
21375f658dadSDarrick J. WongIt turned out to be much more performant to read the nine elements into a
21385f658dadSDarrick J. Wongmemory buffer, run the kernel's in-memory heapsort on the buffer, and choose
21395f658dadSDarrick J. Wongthe 4th element of that buffer as the pivot.
21405f658dadSDarrick J. WongTukey's ninthers are described in J. W. Tukey, `The ninther, a technique for
21415f658dadSDarrick J. Wonglow-effort robust (resistant) location in large samples`, in *Contributions to
21425f658dadSDarrick J. WongSurvey Sampling and Applied Statistics*, edited by H. David, (Academic Press,
21435f658dadSDarrick J. Wong1978), pp. 251–257.
21445f658dadSDarrick J. Wong
21455f658dadSDarrick J. WongThe partitioning of quicksort is fairly textbook -- rearrange the record
21465f658dadSDarrick J. Wongsubset around the pivot, then set up the current and next stack frames to
21475f658dadSDarrick J. Wongsort with the larger and the smaller halves of the pivot, respectively.
21485f658dadSDarrick J. WongThis keeps the stack space requirements to log2(record count).
21495f658dadSDarrick J. Wong
21505f658dadSDarrick J. WongAs a final performance optimization, the hi and lo scanning phase of quicksort
21515f658dadSDarrick J. Wongkeeps examined xfile pages mapped in the kernel for as long as possible to
21525f658dadSDarrick J. Wongreduce map/unmap cycles.
21535f658dadSDarrick J. WongSurprisingly, this reduces overall sort runtime by nearly half again after
21545f658dadSDarrick J. Wongaccounting for the application of heapsort directly onto xfile pages.
21555f658dadSDarrick J. Wong
2156a26aa252SDarrick J. Wong.. _xfblob:
2157a26aa252SDarrick J. Wong
21585f658dadSDarrick J. WongBlob Storage
21595f658dadSDarrick J. Wong````````````
21605f658dadSDarrick J. Wong
21615f658dadSDarrick J. WongExtended attributes and directories add an additional requirement for staging
21625f658dadSDarrick J. Wongrecords: arbitrary byte sequences of finite length.
21635f658dadSDarrick J. WongEach directory entry record needs to store entry name,
21645f658dadSDarrick J. Wongand each extended attribute needs to store both the attribute name and value.
21655f658dadSDarrick J. WongThe names, keys, and values can consume a large amount of memory, so the
21665f658dadSDarrick J. Wong``xfblob`` abstraction was created to simplify management of these blobs
21675f658dadSDarrick J. Wongatop an xfile.
21685f658dadSDarrick J. Wong
21695f658dadSDarrick J. WongBlob arrays provide ``xfblob_load`` and ``xfblob_store`` functions to retrieve
21705f658dadSDarrick J. Wongand persist objects.
21715f658dadSDarrick J. WongThe store function returns a magic cookie for every object that it persists.
21725f658dadSDarrick J. WongLater, callers provide this cookie to the ``xblob_load`` to recall the object.
21735f658dadSDarrick J. WongThe ``xfblob_free`` function frees a specific blob, and the ``xfblob_truncate``
21745f658dadSDarrick J. Wongfunction frees them all because compaction is not needed.
21755f658dadSDarrick J. Wong
21765f658dadSDarrick J. WongThe details of repairing directories and extended attributes will be discussed
21775f658dadSDarrick J. Wongin a subsequent section about atomic extent swapping.
21785f658dadSDarrick J. WongHowever, it should be noted that these repair functions only use blob storage
21795f658dadSDarrick J. Wongto cache a small number of entries before adding them to a temporary ondisk
21805f658dadSDarrick J. Wongfile, which is why compaction is not required.
21815f658dadSDarrick J. Wong
21825f658dadSDarrick J. WongThe proposed patchset is at the start of the
21835f658dadSDarrick J. Wong`extended attribute repair
21845f658dadSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_ series.
21855f658dadSDarrick J. Wong
21865f658dadSDarrick J. Wong.. _xfbtree:
21875f658dadSDarrick J. Wong
21885f658dadSDarrick J. WongIn-Memory B+Trees
21895f658dadSDarrick J. Wong`````````````````
21905f658dadSDarrick J. Wong
21915f658dadSDarrick J. WongThe chapter about :ref:`secondary metadata<secondary_metadata>` mentioned that
21925f658dadSDarrick J. Wongchecking and repairing of secondary metadata commonly requires coordination
21935f658dadSDarrick J. Wongbetween a live metadata scan of the filesystem and writer threads that are
21945f658dadSDarrick J. Wongupdating that metadata.
21955f658dadSDarrick J. WongKeeping the scan data up to date requires requires the ability to propagate
21965f658dadSDarrick J. Wongmetadata updates from the filesystem into the data being collected by the scan.
21975f658dadSDarrick J. WongThis *can* be done by appending concurrent updates into a separate log file and
21985f658dadSDarrick J. Wongapplying them before writing the new metadata to disk, but this leads to
21995f658dadSDarrick J. Wongunbounded memory consumption if the rest of the system is very busy.
22005f658dadSDarrick J. WongAnother option is to skip the side-log and commit live updates from the
22015f658dadSDarrick J. Wongfilesystem directly into the scan data, which trades more overhead for a lower
22025f658dadSDarrick J. Wongmaximum memory requirement.
22035f658dadSDarrick J. WongIn both cases, the data structure holding the scan results must support indexed
22045f658dadSDarrick J. Wongaccess to perform well.
22055f658dadSDarrick J. Wong
22065f658dadSDarrick J. WongGiven that indexed lookups of scan data is required for both strategies, online
22075f658dadSDarrick J. Wongfsck employs the second strategy of committing live updates directly into
22085f658dadSDarrick J. Wongscan data.
22095f658dadSDarrick J. WongBecause xfarrays are not indexed and do not enforce record ordering, they
22105f658dadSDarrick J. Wongare not suitable for this task.
22115f658dadSDarrick J. WongConveniently, however, XFS has a library to create and maintain ordered reverse
22125f658dadSDarrick J. Wongmapping records: the existing rmap btree code!
22135f658dadSDarrick J. WongIf only there was a means to create one in memory.
22145f658dadSDarrick J. Wong
22155f658dadSDarrick J. WongRecall that the :ref:`xfile <xfile>` abstraction represents memory pages as a
22165f658dadSDarrick J. Wongregular file, which means that the kernel can create byte or block addressable
22175f658dadSDarrick J. Wongvirtual address spaces at will.
22185f658dadSDarrick J. WongThe XFS buffer cache specializes in abstracting IO to block-oriented  address
22195f658dadSDarrick J. Wongspaces, which means that adaptation of the buffer cache to interface with
22205f658dadSDarrick J. Wongxfiles enables reuse of the entire btree library.
22215f658dadSDarrick J. WongBtrees built atop an xfile are collectively known as ``xfbtrees``.
22225f658dadSDarrick J. WongThe next few sections describe how they actually work.
22235f658dadSDarrick J. Wong
22245f658dadSDarrick J. WongThe proposed patchset is the
22255f658dadSDarrick J. Wong`in-memory btree
22265f658dadSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=in-memory-btrees>`_
22275f658dadSDarrick J. Wongseries.
22285f658dadSDarrick J. Wong
22295f658dadSDarrick J. WongUsing xfiles as a Buffer Cache Target
22305f658dadSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
22315f658dadSDarrick J. Wong
22325f658dadSDarrick J. WongTwo modifications are necessary to support xfiles as a buffer cache target.
22335f658dadSDarrick J. WongThe first is to make it possible for the ``struct xfs_buftarg`` structure to
22345f658dadSDarrick J. Wonghost the ``struct xfs_buf`` rhashtable, because normally those are held by a
22355f658dadSDarrick J. Wongper-AG structure.
22365f658dadSDarrick J. WongThe second change is to modify the buffer ``ioapply`` function to "read" cached
22375f658dadSDarrick J. Wongpages from the xfile and "write" cached pages back to the xfile.
22385f658dadSDarrick J. WongMultiple access to individual buffers is controlled by the ``xfs_buf`` lock,
22395f658dadSDarrick J. Wongsince the xfile does not provide any locking on its own.
22405f658dadSDarrick J. WongWith this adaptation in place, users of the xfile-backed buffer cache use
22415f658dadSDarrick J. Wongexactly the same APIs as users of the disk-backed buffer cache.
22425f658dadSDarrick J. WongThe separation between xfile and buffer cache implies higher memory usage since
22435f658dadSDarrick J. Wongthey do not share pages, but this property could some day enable transactional
22445f658dadSDarrick J. Wongupdates to an in-memory btree.
22455f658dadSDarrick J. WongToday, however, it simply eliminates the need for new code.
22465f658dadSDarrick J. Wong
22475f658dadSDarrick J. WongSpace Management with an xfbtree
22485f658dadSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
22495f658dadSDarrick J. Wong
22505f658dadSDarrick J. WongSpace management for an xfile is very simple -- each btree block is one memory
22515f658dadSDarrick J. Wongpage in size.
22525f658dadSDarrick J. WongThese blocks use the same header format as an on-disk btree, but the in-memory
22535f658dadSDarrick J. Wongblock verifiers ignore the checksums, assuming that xfile memory is no more
22545f658dadSDarrick J. Wongcorruption-prone than regular DRAM.
22555f658dadSDarrick J. WongReusing existing code here is more important than absolute memory efficiency.
22565f658dadSDarrick J. Wong
22575f658dadSDarrick J. WongThe very first block of an xfile backing an xfbtree contains a header block.
22585f658dadSDarrick J. WongThe header describes the owner, height, and the block number of the root
22595f658dadSDarrick J. Wongxfbtree block.
22605f658dadSDarrick J. Wong
22615f658dadSDarrick J. WongTo allocate a btree block, use ``xfile_seek_data`` to find a gap in the file.
22625f658dadSDarrick J. WongIf there are no gaps, create one by extending the length of the xfile.
22635f658dadSDarrick J. WongPreallocate space for the block with ``xfile_prealloc``, and hand back the
22645f658dadSDarrick J. Wonglocation.
22655f658dadSDarrick J. WongTo free an xfbtree block, use ``xfile_discard`` (which internally uses
22665f658dadSDarrick J. Wong``FALLOC_FL_PUNCH_HOLE``) to remove the memory page from the xfile.
22675f658dadSDarrick J. Wong
22685f658dadSDarrick J. WongPopulating an xfbtree
22695f658dadSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^
22705f658dadSDarrick J. Wong
22715f658dadSDarrick J. WongAn online fsck function that wants to create an xfbtree should proceed as
22725f658dadSDarrick J. Wongfollows:
22735f658dadSDarrick J. Wong
22745f658dadSDarrick J. Wong1. Call ``xfile_create`` to create an xfile.
22755f658dadSDarrick J. Wong
22765f658dadSDarrick J. Wong2. Call ``xfs_alloc_memory_buftarg`` to create a buffer cache target structure
22775f658dadSDarrick J. Wong   pointing to the xfile.
22785f658dadSDarrick J. Wong
22795f658dadSDarrick J. Wong3. Pass the buffer cache target, buffer ops, and other information to
22805f658dadSDarrick J. Wong   ``xfbtree_create`` to write an initial tree header and root block to the
22815f658dadSDarrick J. Wong   xfile.
22825f658dadSDarrick J. Wong   Each btree type should define a wrapper that passes necessary arguments to
22835f658dadSDarrick J. Wong   the creation function.
22845f658dadSDarrick J. Wong   For example, rmap btrees define ``xfs_rmapbt_mem_create`` to take care of
22855f658dadSDarrick J. Wong   all the necessary details for callers.
22865f658dadSDarrick J. Wong   A ``struct xfbtree`` object will be returned.
22875f658dadSDarrick J. Wong
22885f658dadSDarrick J. Wong4. Pass the xfbtree object to the btree cursor creation function for the
22895f658dadSDarrick J. Wong   btree type.
22905f658dadSDarrick J. Wong   Following the example above, ``xfs_rmapbt_mem_cursor`` takes care of this
22915f658dadSDarrick J. Wong   for callers.
22925f658dadSDarrick J. Wong
22935f658dadSDarrick J. Wong5. Pass the btree cursor to the regular btree functions to make queries against
22945f658dadSDarrick J. Wong   and to update the in-memory btree.
22955f658dadSDarrick J. Wong   For example, a btree cursor for an rmap xfbtree can be passed to the
22965f658dadSDarrick J. Wong   ``xfs_rmap_*`` functions just like any other btree cursor.
22975f658dadSDarrick J. Wong   See the :ref:`next section<xfbtree_commit>` for information on dealing with
22985f658dadSDarrick J. Wong   xfbtree updates that are logged to a transaction.
22995f658dadSDarrick J. Wong
23005f658dadSDarrick J. Wong6. When finished, delete the btree cursor, destroy the xfbtree object, free the
23015f658dadSDarrick J. Wong   buffer target, and the destroy the xfile to release all resources.
23025f658dadSDarrick J. Wong
23035f658dadSDarrick J. Wong.. _xfbtree_commit:
23045f658dadSDarrick J. Wong
23055f658dadSDarrick J. WongCommitting Logged xfbtree Buffers
23065f658dadSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
23075f658dadSDarrick J. Wong
23085f658dadSDarrick J. WongAlthough it is a clever hack to reuse the rmap btree code to handle the staging
23095f658dadSDarrick J. Wongstructure, the ephemeral nature of the in-memory btree block storage presents
23105f658dadSDarrick J. Wongsome challenges of its own.
23115f658dadSDarrick J. WongThe XFS transaction manager must not commit buffer log items for buffers backed
23125f658dadSDarrick J. Wongby an xfile because the log format does not understand updates for devices
23135f658dadSDarrick J. Wongother than the data device.
23145f658dadSDarrick J. WongAn ephemeral xfbtree probably will not exist by the time the AIL checkpoints
23155f658dadSDarrick J. Wonglog transactions back into the filesystem, and certainly won't exist during
23165f658dadSDarrick J. Wonglog recovery.
23175f658dadSDarrick J. WongFor these reasons, any code updating an xfbtree in transaction context must
23185f658dadSDarrick J. Wongremove the buffer log items from the transaction and write the updates into the
23195f658dadSDarrick J. Wongbacking xfile before committing or cancelling the transaction.
23205f658dadSDarrick J. Wong
23215f658dadSDarrick J. WongThe ``xfbtree_trans_commit`` and ``xfbtree_trans_cancel`` functions implement
23225f658dadSDarrick J. Wongthis functionality as follows:
23235f658dadSDarrick J. Wong
23245f658dadSDarrick J. Wong1. Find each buffer log item whose buffer targets the xfile.
23255f658dadSDarrick J. Wong
23265f658dadSDarrick J. Wong2. Record the dirty/ordered status of the log item.
23275f658dadSDarrick J. Wong
23285f658dadSDarrick J. Wong3. Detach the log item from the buffer.
23295f658dadSDarrick J. Wong
23305f658dadSDarrick J. Wong4. Queue the buffer to a special delwri list.
23315f658dadSDarrick J. Wong
23325f658dadSDarrick J. Wong5. Clear the transaction dirty flag if the only dirty log items were the ones
23335f658dadSDarrick J. Wong   that were detached in step 3.
23345f658dadSDarrick J. Wong
23355f658dadSDarrick J. Wong6. Submit the delwri list to commit the changes to the xfile, if the updates
23365f658dadSDarrick J. Wong   are being committed.
23375f658dadSDarrick J. Wong
23385f658dadSDarrick J. WongAfter removing xfile logged buffers from the transaction in this manner, the
23395f658dadSDarrick J. Wongtransaction can be committed or cancelled.
23407fb8ccffSDarrick J. Wong
23417fb8ccffSDarrick J. WongBulk Loading of Ondisk B+Trees
23427fb8ccffSDarrick J. Wong------------------------------
23437fb8ccffSDarrick J. Wong
23447fb8ccffSDarrick J. WongAs mentioned previously, early iterations of online repair built new btree
23457fb8ccffSDarrick J. Wongstructures by creating a new btree and adding observations individually.
23467fb8ccffSDarrick J. WongLoading a btree one record at a time had a slight advantage of not requiring
23477fb8ccffSDarrick J. Wongthe incore records to be sorted prior to commit, but was very slow and leaked
23487fb8ccffSDarrick J. Wongblocks if the system went down during a repair.
23497fb8ccffSDarrick J. WongLoading records one at a time also meant that repair could not control the
23507fb8ccffSDarrick J. Wongloading factor of the blocks in the new btree.
23517fb8ccffSDarrick J. Wong
23527fb8ccffSDarrick J. WongFortunately, the venerable ``xfs_repair`` tool had a more efficient means for
23537fb8ccffSDarrick J. Wongrebuilding a btree index from a collection of records -- bulk btree loading.
23547fb8ccffSDarrick J. WongThis was implemented rather inefficiently code-wise, since ``xfs_repair``
23557fb8ccffSDarrick J. Wonghad separate copy-pasted implementations for each btree type.
23567fb8ccffSDarrick J. Wong
23577fb8ccffSDarrick J. WongTo prepare for online fsck, each of the four bulk loaders were studied, notes
23587fb8ccffSDarrick J. Wongwere taken, and the four were refactored into a single generic btree bulk
23597fb8ccffSDarrick J. Wongloading mechanism.
23607fb8ccffSDarrick J. WongThose notes in turn have been refreshed and are presented below.
23617fb8ccffSDarrick J. Wong
23627fb8ccffSDarrick J. WongGeometry Computation
23637fb8ccffSDarrick J. Wong````````````````````
23647fb8ccffSDarrick J. Wong
23657fb8ccffSDarrick J. WongThe zeroth step of bulk loading is to assemble the entire record set that will
23667fb8ccffSDarrick J. Wongbe stored in the new btree, and sort the records.
23677fb8ccffSDarrick J. WongNext, call ``xfs_btree_bload_compute_geometry`` to compute the shape of the
23687fb8ccffSDarrick J. Wongbtree from the record set, the type of btree, and any load factor preferences.
23697fb8ccffSDarrick J. WongThis information is required for resource reservation.
23707fb8ccffSDarrick J. Wong
23717fb8ccffSDarrick J. WongFirst, the geometry computation computes the minimum and maximum records that
23727fb8ccffSDarrick J. Wongwill fit in a leaf block from the size of a btree block and the size of the
23737fb8ccffSDarrick J. Wongblock header.
23747fb8ccffSDarrick J. WongRoughly speaking, the maximum number of records is::
23757fb8ccffSDarrick J. Wong
23767fb8ccffSDarrick J. Wong        maxrecs = (block_size - header_size) / record_size
23777fb8ccffSDarrick J. Wong
23787fb8ccffSDarrick J. WongThe XFS design specifies that btree blocks should be merged when possible,
23797fb8ccffSDarrick J. Wongwhich means the minimum number of records is half of maxrecs::
23807fb8ccffSDarrick J. Wong
23817fb8ccffSDarrick J. Wong        minrecs = maxrecs / 2
23827fb8ccffSDarrick J. Wong
23837fb8ccffSDarrick J. WongThe next variable to determine is the desired loading factor.
23847fb8ccffSDarrick J. WongThis must be at least minrecs and no more than maxrecs.
23857fb8ccffSDarrick J. WongChoosing minrecs is undesirable because it wastes half the block.
23867fb8ccffSDarrick J. WongChoosing maxrecs is also undesirable because adding a single record to each
23877fb8ccffSDarrick J. Wongnewly rebuilt leaf block will cause a tree split, which causes a noticeable
23887fb8ccffSDarrick J. Wongdrop in performance immediately afterwards.
23897fb8ccffSDarrick J. WongThe default loading factor was chosen to be 75% of maxrecs, which provides a
23907fb8ccffSDarrick J. Wongreasonably compact structure without any immediate split penalties::
23917fb8ccffSDarrick J. Wong
23927fb8ccffSDarrick J. Wong        default_load_factor = (maxrecs + minrecs) / 2
23937fb8ccffSDarrick J. Wong
23947fb8ccffSDarrick J. WongIf space is tight, the loading factor will be set to maxrecs to try to avoid
23957fb8ccffSDarrick J. Wongrunning out of space::
23967fb8ccffSDarrick J. Wong
23977fb8ccffSDarrick J. Wong        leaf_load_factor = enough space ? default_load_factor : maxrecs
23987fb8ccffSDarrick J. Wong
23997fb8ccffSDarrick J. WongLoad factor is computed for btree node blocks using the combined size of the
24007fb8ccffSDarrick J. Wongbtree key and pointer as the record size::
24017fb8ccffSDarrick J. Wong
24027fb8ccffSDarrick J. Wong        maxrecs = (block_size - header_size) / (key_size + ptr_size)
24037fb8ccffSDarrick J. Wong        minrecs = maxrecs / 2
24047fb8ccffSDarrick J. Wong        node_load_factor = enough space ? default_load_factor : maxrecs
24057fb8ccffSDarrick J. Wong
24067fb8ccffSDarrick J. WongOnce that's done, the number of leaf blocks required to store the record set
24077fb8ccffSDarrick J. Wongcan be computed as::
24087fb8ccffSDarrick J. Wong
24097fb8ccffSDarrick J. Wong        leaf_blocks = ceil(record_count / leaf_load_factor)
24107fb8ccffSDarrick J. Wong
24117fb8ccffSDarrick J. WongThe number of node blocks needed to point to the next level down in the tree
24127fb8ccffSDarrick J. Wongis computed as::
24137fb8ccffSDarrick J. Wong
24147fb8ccffSDarrick J. Wong        n_blocks = (n == 0 ? leaf_blocks : node_blocks[n])
24157fb8ccffSDarrick J. Wong        node_blocks[n + 1] = ceil(n_blocks / node_load_factor)
24167fb8ccffSDarrick J. Wong
24177fb8ccffSDarrick J. WongThe entire computation is performed recursively until the current level only
24187fb8ccffSDarrick J. Wongneeds one block.
24197fb8ccffSDarrick J. WongThe resulting geometry is as follows:
24207fb8ccffSDarrick J. Wong
24217fb8ccffSDarrick J. Wong- For AG-rooted btrees, this level is the root level, so the height of the new
24227fb8ccffSDarrick J. Wong  tree is ``level + 1`` and the space needed is the summation of the number of
24237fb8ccffSDarrick J. Wong  blocks on each level.
24247fb8ccffSDarrick J. Wong
24257fb8ccffSDarrick J. Wong- For inode-rooted btrees where the records in the top level do not fit in the
24267fb8ccffSDarrick J. Wong  inode fork area, the height is ``level + 2``, the space needed is the
24277fb8ccffSDarrick J. Wong  summation of the number of blocks on each level, and the inode fork points to
24287fb8ccffSDarrick J. Wong  the root block.
24297fb8ccffSDarrick J. Wong
24307fb8ccffSDarrick J. Wong- For inode-rooted btrees where the records in the top level can be stored in
24317fb8ccffSDarrick J. Wong  the inode fork area, then the root block can be stored in the inode, the
24327fb8ccffSDarrick J. Wong  height is ``level + 1``, and the space needed is one less than the summation
24337fb8ccffSDarrick J. Wong  of the number of blocks on each level.
24347fb8ccffSDarrick J. Wong  This only becomes relevant when non-bmap btrees gain the ability to root in
24357fb8ccffSDarrick J. Wong  an inode, which is a future patchset and only included here for completeness.
24367fb8ccffSDarrick J. Wong
24377fb8ccffSDarrick J. Wong.. _newbt:
24387fb8ccffSDarrick J. Wong
24397fb8ccffSDarrick J. WongReserving New B+Tree Blocks
24407fb8ccffSDarrick J. Wong```````````````````````````
24417fb8ccffSDarrick J. Wong
24427fb8ccffSDarrick J. WongOnce repair knows the number of blocks needed for the new btree, it allocates
24437fb8ccffSDarrick J. Wongthose blocks using the free space information.
24447fb8ccffSDarrick J. WongEach reserved extent is tracked separately by the btree builder state data.
24457fb8ccffSDarrick J. WongTo improve crash resilience, the reservation code also logs an Extent Freeing
24467fb8ccffSDarrick J. WongIntent (EFI) item in the same transaction as each space allocation and attaches
24477fb8ccffSDarrick J. Wongits in-memory ``struct xfs_extent_free_item`` object to the space reservation.
24487fb8ccffSDarrick J. WongIf the system goes down, log recovery will use the unfinished EFIs to free the
24497fb8ccffSDarrick J. Wongunused space, the free space, leaving the filesystem unchanged.
24507fb8ccffSDarrick J. Wong
24517fb8ccffSDarrick J. WongEach time the btree builder claims a block for the btree from a reserved
24527fb8ccffSDarrick J. Wongextent, it updates the in-memory reservation to reflect the claimed space.
24537fb8ccffSDarrick J. WongBlock reservation tries to allocate as much contiguous space as possible to
24547fb8ccffSDarrick J. Wongreduce the number of EFIs in play.
24557fb8ccffSDarrick J. Wong
24567fb8ccffSDarrick J. WongWhile repair is writing these new btree blocks, the EFIs created for the space
24577fb8ccffSDarrick J. Wongreservations pin the tail of the ondisk log.
24587fb8ccffSDarrick J. WongIt's possible that other parts of the system will remain busy and push the head
24597fb8ccffSDarrick J. Wongof the log towards the pinned tail.
24607fb8ccffSDarrick J. WongTo avoid livelocking the filesystem, the EFIs must not pin the tail of the log
24617fb8ccffSDarrick J. Wongfor too long.
24627fb8ccffSDarrick J. WongTo alleviate this problem, the dynamic relogging capability of the deferred ops
24637fb8ccffSDarrick J. Wongmechanism is reused here to commit a transaction at the log head containing an
24647fb8ccffSDarrick J. WongEFD for the old EFI and new EFI at the head.
24657fb8ccffSDarrick J. WongThis enables the log to release the old EFI to keep the log moving forwards.
24667fb8ccffSDarrick J. Wong
24677fb8ccffSDarrick J. WongEFIs have a role to play during the commit and reaping phases; please see the
24687fb8ccffSDarrick J. Wongnext section and the section about :ref:`reaping<reaping>` for more details.
24697fb8ccffSDarrick J. Wong
24707fb8ccffSDarrick J. WongProposed patchsets are the
24717fb8ccffSDarrick J. Wong`bitmap rework
24727fb8ccffSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-bitmap-rework>`_
24737fb8ccffSDarrick J. Wongand the
24747fb8ccffSDarrick J. Wong`preparation for bulk loading btrees
24757fb8ccffSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading>`_.
24767fb8ccffSDarrick J. Wong
24777fb8ccffSDarrick J. Wong
24787fb8ccffSDarrick J. WongWriting the New Tree
24797fb8ccffSDarrick J. Wong````````````````````
24807fb8ccffSDarrick J. Wong
24817fb8ccffSDarrick J. WongThis part is pretty simple -- the btree builder (``xfs_btree_bulkload``) claims
24827fb8ccffSDarrick J. Wonga block from the reserved list, writes the new btree block header, fills the
24837fb8ccffSDarrick J. Wongrest of the block with records, and adds the new leaf block to a list of
24847fb8ccffSDarrick J. Wongwritten blocks::
24857fb8ccffSDarrick J. Wong
24867fb8ccffSDarrick J. Wong  ┌────┐
24877fb8ccffSDarrick J. Wong  │leaf│
24887fb8ccffSDarrick J. Wong  │RRR │
24897fb8ccffSDarrick J. Wong  └────┘
24907fb8ccffSDarrick J. Wong
24917fb8ccffSDarrick J. WongSibling pointers are set every time a new block is added to the level::
24927fb8ccffSDarrick J. Wong
24937fb8ccffSDarrick J. Wong  ┌────┐ ┌────┐ ┌────┐ ┌────┐
24947fb8ccffSDarrick J. Wong  │leaf│→│leaf│→│leaf│→│leaf│
24957fb8ccffSDarrick J. Wong  │RRR │←│RRR │←│RRR │←│RRR │
24967fb8ccffSDarrick J. Wong  └────┘ └────┘ └────┘ └────┘
24977fb8ccffSDarrick J. Wong
24987fb8ccffSDarrick J. WongWhen it finishes writing the record leaf blocks, it moves on to the node
24997fb8ccffSDarrick J. Wongblocks
25007fb8ccffSDarrick J. WongTo fill a node block, it walks each block in the next level down in the tree
25017fb8ccffSDarrick J. Wongto compute the relevant keys and write them into the parent node::
25027fb8ccffSDarrick J. Wong
25037fb8ccffSDarrick J. Wong      ┌────┐       ┌────┐
25047fb8ccffSDarrick J. Wong      │node│──────→│node│
25057fb8ccffSDarrick J. Wong      │PP  │←──────│PP  │
25067fb8ccffSDarrick J. Wong      └────┘       └────┘
25077fb8ccffSDarrick J. Wong      ↙   ↘         ↙   ↘
25087fb8ccffSDarrick J. Wong  ┌────┐ ┌────┐ ┌────┐ ┌────┐
25097fb8ccffSDarrick J. Wong  │leaf│→│leaf│→│leaf│→│leaf│
25107fb8ccffSDarrick J. Wong  │RRR │←│RRR │←│RRR │←│RRR │
25117fb8ccffSDarrick J. Wong  └────┘ └────┘ └────┘ └────┘
25127fb8ccffSDarrick J. Wong
25137fb8ccffSDarrick J. WongWhen it reaches the root level, it is ready to commit the new btree!::
25147fb8ccffSDarrick J. Wong
25157fb8ccffSDarrick J. Wong          ┌─────────┐
25167fb8ccffSDarrick J. Wong          │  root   │
25177fb8ccffSDarrick J. Wong          │   PP    │
25187fb8ccffSDarrick J. Wong          └─────────┘
25197fb8ccffSDarrick J. Wong          ↙         ↘
25207fb8ccffSDarrick J. Wong      ┌────┐       ┌────┐
25217fb8ccffSDarrick J. Wong      │node│──────→│node│
25227fb8ccffSDarrick J. Wong      │PP  │←──────│PP  │
25237fb8ccffSDarrick J. Wong      └────┘       └────┘
25247fb8ccffSDarrick J. Wong      ↙   ↘         ↙   ↘
25257fb8ccffSDarrick J. Wong  ┌────┐ ┌────┐ ┌────┐ ┌────┐
25267fb8ccffSDarrick J. Wong  │leaf│→│leaf│→│leaf│→│leaf│
25277fb8ccffSDarrick J. Wong  │RRR │←│RRR │←│RRR │←│RRR │
25287fb8ccffSDarrick J. Wong  └────┘ └────┘ └────┘ └────┘
25297fb8ccffSDarrick J. Wong
25307fb8ccffSDarrick J. WongThe first step to commit the new btree is to persist the btree blocks to disk
25317fb8ccffSDarrick J. Wongsynchronously.
25327fb8ccffSDarrick J. WongThis is a little complicated because a new btree block could have been freed
25337fb8ccffSDarrick J. Wongin the recent past, so the builder must use ``xfs_buf_delwri_queue_here`` to
25347fb8ccffSDarrick J. Wongremove the (stale) buffer from the AIL list before it can write the new blocks
25357fb8ccffSDarrick J. Wongto disk.
25367fb8ccffSDarrick J. WongBlocks are queued for IO using a delwri list and written in one large batch
25377fb8ccffSDarrick J. Wongwith ``xfs_buf_delwri_submit``.
25387fb8ccffSDarrick J. Wong
25397fb8ccffSDarrick J. WongOnce the new blocks have been persisted to disk, control returns to the
25407fb8ccffSDarrick J. Wongindividual repair function that called the bulk loader.
25417fb8ccffSDarrick J. WongThe repair function must log the location of the new root in a transaction,
25427fb8ccffSDarrick J. Wongclean up the space reservations that were made for the new btree, and reap the
25437fb8ccffSDarrick J. Wongold metadata blocks:
25447fb8ccffSDarrick J. Wong
25457fb8ccffSDarrick J. Wong1. Commit the location of the new btree root.
25467fb8ccffSDarrick J. Wong
25477fb8ccffSDarrick J. Wong2. For each incore reservation:
25487fb8ccffSDarrick J. Wong
25497fb8ccffSDarrick J. Wong   a. Log Extent Freeing Done (EFD) items for all the space that was consumed
25507fb8ccffSDarrick J. Wong      by the btree builder.  The new EFDs must point to the EFIs attached to
25517fb8ccffSDarrick J. Wong      the reservation to prevent log recovery from freeing the new blocks.
25527fb8ccffSDarrick J. Wong
25537fb8ccffSDarrick J. Wong   b. For unclaimed portions of incore reservations, create a regular deferred
25547fb8ccffSDarrick J. Wong      extent free work item to be free the unused space later in the
25557fb8ccffSDarrick J. Wong      transaction chain.
25567fb8ccffSDarrick J. Wong
25577fb8ccffSDarrick J. Wong   c. The EFDs and EFIs logged in steps 2a and 2b must not overrun the
25587fb8ccffSDarrick J. Wong      reservation of the committing transaction.
25597fb8ccffSDarrick J. Wong      If the btree loading code suspects this might be about to happen, it must
25607fb8ccffSDarrick J. Wong      call ``xrep_defer_finish`` to clear out the deferred work and obtain a
25617fb8ccffSDarrick J. Wong      fresh transaction.
25627fb8ccffSDarrick J. Wong
25637fb8ccffSDarrick J. Wong3. Clear out the deferred work a second time to finish the commit and clean
25647fb8ccffSDarrick J. Wong   the repair transaction.
25657fb8ccffSDarrick J. Wong
25667fb8ccffSDarrick J. WongThe transaction rolling in steps 2c and 3 represent a weakness in the repair
25677fb8ccffSDarrick J. Wongalgorithm, because a log flush and a crash before the end of the reap step can
25687fb8ccffSDarrick J. Wongresult in space leaking.
2569*d56b699dSBjorn HelgaasOnline repair functions minimize the chances of this occurring by using very
2570*d56b699dSBjorn Helgaaslarge transactions, which each can accommodate many thousands of block freeing
25717fb8ccffSDarrick J. Wonginstructions.
25727fb8ccffSDarrick J. WongRepair moves on to reaping the old blocks, which will be presented in a
25737fb8ccffSDarrick J. Wongsubsequent :ref:`section<reaping>` after a few case studies of bulk loading.
25747fb8ccffSDarrick J. Wong
25757fb8ccffSDarrick J. WongCase Study: Rebuilding the Inode Index
25767fb8ccffSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
25777fb8ccffSDarrick J. Wong
25787fb8ccffSDarrick J. WongThe high level process to rebuild the inode index btree is:
25797fb8ccffSDarrick J. Wong
25807fb8ccffSDarrick J. Wong1. Walk the reverse mapping records to generate ``struct xfs_inobt_rec``
25817fb8ccffSDarrick J. Wong   records from the inode chunk information and a bitmap of the old inode btree
25827fb8ccffSDarrick J. Wong   blocks.
25837fb8ccffSDarrick J. Wong
25847fb8ccffSDarrick J. Wong2. Append the records to an xfarray in inode order.
25857fb8ccffSDarrick J. Wong
25867fb8ccffSDarrick J. Wong3. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number
25877fb8ccffSDarrick J. Wong   of blocks needed for the inode btree.
25887fb8ccffSDarrick J. Wong   If the free space inode btree is enabled, call it again to estimate the
25897fb8ccffSDarrick J. Wong   geometry of the finobt.
25907fb8ccffSDarrick J. Wong
25917fb8ccffSDarrick J. Wong4. Allocate the number of blocks computed in the previous step.
25927fb8ccffSDarrick J. Wong
25937fb8ccffSDarrick J. Wong5. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and
25947fb8ccffSDarrick J. Wong   generate the internal node blocks.
25957fb8ccffSDarrick J. Wong   If the free space inode btree is enabled, call it again to load the finobt.
25967fb8ccffSDarrick J. Wong
25977fb8ccffSDarrick J. Wong6. Commit the location of the new btree root block(s) to the AGI.
25987fb8ccffSDarrick J. Wong
25997fb8ccffSDarrick J. Wong7. Reap the old btree blocks using the bitmap created in step 1.
26007fb8ccffSDarrick J. Wong
26017fb8ccffSDarrick J. WongDetails are as follows.
26027fb8ccffSDarrick J. Wong
26037fb8ccffSDarrick J. WongThe inode btree maps inumbers to the ondisk location of the associated
26047fb8ccffSDarrick J. Wonginode records, which means that the inode btrees can be rebuilt from the
26057fb8ccffSDarrick J. Wongreverse mapping information.
26067fb8ccffSDarrick J. WongReverse mapping records with an owner of ``XFS_RMAP_OWN_INOBT`` marks the
26077fb8ccffSDarrick J. Wonglocation of the old inode btree blocks.
26087fb8ccffSDarrick J. WongEach reverse mapping record with an owner of ``XFS_RMAP_OWN_INODES`` marks the
26097fb8ccffSDarrick J. Wonglocation of at least one inode cluster buffer.
26107fb8ccffSDarrick J. WongA cluster is the smallest number of ondisk inodes that can be allocated or
26117fb8ccffSDarrick J. Wongfreed in a single transaction; it is never smaller than 1 fs block or 4 inodes.
26127fb8ccffSDarrick J. Wong
26137fb8ccffSDarrick J. WongFor the space represented by each inode cluster, ensure that there are no
26147fb8ccffSDarrick J. Wongrecords in the free space btrees nor any records in the reference count btree.
26157fb8ccffSDarrick J. WongIf there are, the space metadata inconsistencies are reason enough to abort the
26167fb8ccffSDarrick J. Wongoperation.
26177fb8ccffSDarrick J. WongOtherwise, read each cluster buffer to check that its contents appear to be
26187fb8ccffSDarrick J. Wongondisk inodes and to decide if the file is allocated
26197fb8ccffSDarrick J. Wong(``xfs_dinode.i_mode != 0``) or free (``xfs_dinode.i_mode == 0``).
26207fb8ccffSDarrick J. WongAccumulate the results of successive inode cluster buffer reads until there is
26217fb8ccffSDarrick J. Wongenough information to fill a single inode chunk record, which is 64 consecutive
26227fb8ccffSDarrick J. Wongnumbers in the inumber keyspace.
26237fb8ccffSDarrick J. WongIf the chunk is sparse, the chunk record may include holes.
26247fb8ccffSDarrick J. Wong
26257fb8ccffSDarrick J. WongOnce the repair function accumulates one chunk's worth of data, it calls
26267fb8ccffSDarrick J. Wong``xfarray_append`` to add the inode btree record to the xfarray.
26277fb8ccffSDarrick J. WongThis xfarray is walked twice during the btree creation step -- once to populate
26287fb8ccffSDarrick J. Wongthe inode btree with all inode chunk records, and a second time to populate the
26297fb8ccffSDarrick J. Wongfree inode btree with records for chunks that have free non-sparse inodes.
26307fb8ccffSDarrick J. WongThe number of records for the inode btree is the number of xfarray records,
26317fb8ccffSDarrick J. Wongbut the record count for the free inode btree has to be computed as inode chunk
26327fb8ccffSDarrick J. Wongrecords are stored in the xfarray.
26337fb8ccffSDarrick J. Wong
26347fb8ccffSDarrick J. WongThe proposed patchset is the
26357fb8ccffSDarrick J. Wong`AG btree repair
26367fb8ccffSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
26377fb8ccffSDarrick J. Wongseries.
26387fb8ccffSDarrick J. Wong
26397fb8ccffSDarrick J. WongCase Study: Rebuilding the Space Reference Counts
26407fb8ccffSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
26417fb8ccffSDarrick J. Wong
26427fb8ccffSDarrick J. WongReverse mapping records are used to rebuild the reference count information.
26437fb8ccffSDarrick J. WongReference counts are required for correct operation of copy on write for shared
26447fb8ccffSDarrick J. Wongfile data.
26457fb8ccffSDarrick J. WongImagine the reverse mapping entries as rectangles representing extents of
26467fb8ccffSDarrick J. Wongphysical blocks, and that the rectangles can be laid down to allow them to
26477fb8ccffSDarrick J. Wongoverlap each other.
26487fb8ccffSDarrick J. WongFrom the diagram below, it is apparent that a reference count record must start
26497fb8ccffSDarrick J. Wongor end wherever the height of the stack changes.
26507fb8ccffSDarrick J. WongIn other words, the record emission stimulus is level-triggered::
26517fb8ccffSDarrick J. Wong
26527fb8ccffSDarrick J. Wong                        █    ███
26537fb8ccffSDarrick J. Wong              ██      █████ ████   ███        ██████
26547fb8ccffSDarrick J. Wong        ██   ████     ███████████ ████     █████████
26557fb8ccffSDarrick J. Wong        ████████████████████████████████ ███████████
26567fb8ccffSDarrick J. Wong        ^ ^  ^^ ^^    ^ ^^ ^^^  ^^^^  ^ ^^ ^  ^     ^
26577fb8ccffSDarrick J. Wong        2 1  23 21    3 43 234  2123  1 01 2  3     0
26587fb8ccffSDarrick J. Wong
26597fb8ccffSDarrick J. WongThe ondisk reference count btree does not store the refcount == 0 cases because
26607fb8ccffSDarrick J. Wongthe free space btree already records which blocks are free.
26617fb8ccffSDarrick J. WongExtents being used to stage copy-on-write operations should be the only records
26627fb8ccffSDarrick J. Wongwith refcount == 1.
26637fb8ccffSDarrick J. WongSingle-owner file blocks aren't recorded in either the free space or the
26647fb8ccffSDarrick J. Wongreference count btrees.
26657fb8ccffSDarrick J. Wong
26667fb8ccffSDarrick J. WongThe high level process to rebuild the reference count btree is:
26677fb8ccffSDarrick J. Wong
26687fb8ccffSDarrick J. Wong1. Walk the reverse mapping records to generate ``struct xfs_refcount_irec``
26697fb8ccffSDarrick J. Wong   records for any space having more than one reverse mapping and add them to
26707fb8ccffSDarrick J. Wong   the xfarray.
26717fb8ccffSDarrick J. Wong   Any records owned by ``XFS_RMAP_OWN_COW`` are also added to the xfarray
26727fb8ccffSDarrick J. Wong   because these are extents allocated to stage a copy on write operation and
26737fb8ccffSDarrick J. Wong   are tracked in the refcount btree.
26747fb8ccffSDarrick J. Wong
26757fb8ccffSDarrick J. Wong   Use any records owned by ``XFS_RMAP_OWN_REFC`` to create a bitmap of old
26767fb8ccffSDarrick J. Wong   refcount btree blocks.
26777fb8ccffSDarrick J. Wong
26787fb8ccffSDarrick J. Wong2. Sort the records in physical extent order, putting the CoW staging extents
26797fb8ccffSDarrick J. Wong   at the end of the xfarray.
26807fb8ccffSDarrick J. Wong   This matches the sorting order of records in the refcount btree.
26817fb8ccffSDarrick J. Wong
26827fb8ccffSDarrick J. Wong3. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number
26837fb8ccffSDarrick J. Wong   of blocks needed for the new tree.
26847fb8ccffSDarrick J. Wong
26857fb8ccffSDarrick J. Wong4. Allocate the number of blocks computed in the previous step.
26867fb8ccffSDarrick J. Wong
26877fb8ccffSDarrick J. Wong5. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and
26887fb8ccffSDarrick J. Wong   generate the internal node blocks.
26897fb8ccffSDarrick J. Wong
26907fb8ccffSDarrick J. Wong6. Commit the location of new btree root block to the AGF.
26917fb8ccffSDarrick J. Wong
26927fb8ccffSDarrick J. Wong7. Reap the old btree blocks using the bitmap created in step 1.
26937fb8ccffSDarrick J. Wong
26947fb8ccffSDarrick J. WongDetails are as follows; the same algorithm is used by ``xfs_repair`` to
26957fb8ccffSDarrick J. Wonggenerate refcount information from reverse mapping records.
26967fb8ccffSDarrick J. Wong
26977fb8ccffSDarrick J. Wong- Until the reverse mapping btree runs out of records:
26987fb8ccffSDarrick J. Wong
26997fb8ccffSDarrick J. Wong  - Retrieve the next record from the btree and put it in a bag.
27007fb8ccffSDarrick J. Wong
27017fb8ccffSDarrick J. Wong  - Collect all records with the same starting block from the btree and put
27027fb8ccffSDarrick J. Wong    them in the bag.
27037fb8ccffSDarrick J. Wong
27047fb8ccffSDarrick J. Wong  - While the bag isn't empty:
27057fb8ccffSDarrick J. Wong
27067fb8ccffSDarrick J. Wong    - Among the mappings in the bag, compute the lowest block number where the
27077fb8ccffSDarrick J. Wong      reference count changes.
27087fb8ccffSDarrick J. Wong      This position will be either the starting block number of the next
27097fb8ccffSDarrick J. Wong      unprocessed reverse mapping or the next block after the shortest mapping
27107fb8ccffSDarrick J. Wong      in the bag.
27117fb8ccffSDarrick J. Wong
27127fb8ccffSDarrick J. Wong    - Remove all mappings from the bag that end at this position.
27137fb8ccffSDarrick J. Wong
27147fb8ccffSDarrick J. Wong    - Collect all reverse mappings that start at this position from the btree
27157fb8ccffSDarrick J. Wong      and put them in the bag.
27167fb8ccffSDarrick J. Wong
27177fb8ccffSDarrick J. Wong    - If the size of the bag changed and is greater than one, create a new
27187fb8ccffSDarrick J. Wong      refcount record associating the block number range that we just walked to
27197fb8ccffSDarrick J. Wong      the size of the bag.
27207fb8ccffSDarrick J. Wong
27217fb8ccffSDarrick J. WongThe bag-like structure in this case is a type 2 xfarray as discussed in the
27227fb8ccffSDarrick J. Wong:ref:`xfarray access patterns<xfarray_access_patterns>` section.
27237fb8ccffSDarrick J. WongReverse mappings are added to the bag using ``xfarray_store_anywhere`` and
27247fb8ccffSDarrick J. Wongremoved via ``xfarray_unset``.
27257fb8ccffSDarrick J. WongBag members are examined through ``xfarray_iter`` loops.
27267fb8ccffSDarrick J. Wong
27277fb8ccffSDarrick J. WongThe proposed patchset is the
27287fb8ccffSDarrick J. Wong`AG btree repair
27297fb8ccffSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
27307fb8ccffSDarrick J. Wongseries.
27317fb8ccffSDarrick J. Wong
27327fb8ccffSDarrick J. WongCase Study: Rebuilding File Fork Mapping Indices
27337fb8ccffSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
27347fb8ccffSDarrick J. Wong
27357fb8ccffSDarrick J. WongThe high level process to rebuild a data/attr fork mapping btree is:
27367fb8ccffSDarrick J. Wong
27377fb8ccffSDarrick J. Wong1. Walk the reverse mapping records to generate ``struct xfs_bmbt_rec``
27387fb8ccffSDarrick J. Wong   records from the reverse mapping records for that inode and fork.
27397fb8ccffSDarrick J. Wong   Append these records to an xfarray.
27407fb8ccffSDarrick J. Wong   Compute the bitmap of the old bmap btree blocks from the ``BMBT_BLOCK``
27417fb8ccffSDarrick J. Wong   records.
27427fb8ccffSDarrick J. Wong
27437fb8ccffSDarrick J. Wong2. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number
27447fb8ccffSDarrick J. Wong   of blocks needed for the new tree.
27457fb8ccffSDarrick J. Wong
27467fb8ccffSDarrick J. Wong3. Sort the records in file offset order.
27477fb8ccffSDarrick J. Wong
27487fb8ccffSDarrick J. Wong4. If the extent records would fit in the inode fork immediate area, commit the
27497fb8ccffSDarrick J. Wong   records to that immediate area and skip to step 8.
27507fb8ccffSDarrick J. Wong
27517fb8ccffSDarrick J. Wong5. Allocate the number of blocks computed in the previous step.
27527fb8ccffSDarrick J. Wong
27537fb8ccffSDarrick J. Wong6. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and
27547fb8ccffSDarrick J. Wong   generate the internal node blocks.
27557fb8ccffSDarrick J. Wong
27567fb8ccffSDarrick J. Wong7. Commit the new btree root block to the inode fork immediate area.
27577fb8ccffSDarrick J. Wong
27587fb8ccffSDarrick J. Wong8. Reap the old btree blocks using the bitmap created in step 1.
27597fb8ccffSDarrick J. Wong
27607fb8ccffSDarrick J. WongThere are some complications here:
27617fb8ccffSDarrick J. WongFirst, it's possible to move the fork offset to adjust the sizes of the
27627fb8ccffSDarrick J. Wongimmediate areas if the data and attr forks are not both in BMBT format.
27637fb8ccffSDarrick J. WongSecond, if there are sufficiently few fork mappings, it may be possible to use
27647fb8ccffSDarrick J. WongEXTENTS format instead of BMBT, which may require a conversion.
27657fb8ccffSDarrick J. WongThird, the incore extent map must be reloaded carefully to avoid disturbing
27667fb8ccffSDarrick J. Wongany delayed allocation extents.
27677fb8ccffSDarrick J. Wong
27687fb8ccffSDarrick J. WongThe proposed patchset is the
27697fb8ccffSDarrick J. Wong`file mapping repair
27707fb8ccffSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-file-mappings>`_
27717fb8ccffSDarrick J. Wongseries.
27727fb8ccffSDarrick J. Wong
27737fb8ccffSDarrick J. Wong.. _reaping:
27747fb8ccffSDarrick J. Wong
27757fb8ccffSDarrick J. WongReaping Old Metadata Blocks
27767fb8ccffSDarrick J. Wong---------------------------
27777fb8ccffSDarrick J. Wong
27787fb8ccffSDarrick J. WongWhenever online fsck builds a new data structure to replace one that is
27797fb8ccffSDarrick J. Wongsuspect, there is a question of how to find and dispose of the blocks that
27807fb8ccffSDarrick J. Wongbelonged to the old structure.
27817fb8ccffSDarrick J. WongThe laziest method of course is not to deal with them at all, but this slowly
27827fb8ccffSDarrick J. Wongleads to service degradations as space leaks out of the filesystem.
27837fb8ccffSDarrick J. WongHopefully, someone will schedule a rebuild of the free space information to
27847fb8ccffSDarrick J. Wongplug all those leaks.
27857fb8ccffSDarrick J. WongOffline repair rebuilds all space metadata after recording the usage of
27867fb8ccffSDarrick J. Wongthe files and directories that it decides not to clear, hence it can build new
27877fb8ccffSDarrick J. Wongstructures in the discovered free space and avoid the question of reaping.
27887fb8ccffSDarrick J. Wong
27897fb8ccffSDarrick J. WongAs part of a repair, online fsck relies heavily on the reverse mapping records
27907fb8ccffSDarrick J. Wongto find space that is owned by the corresponding rmap owner yet truly free.
27917fb8ccffSDarrick J. WongCross referencing rmap records with other rmap records is necessary because
27927fb8ccffSDarrick J. Wongthere may be other data structures that also think they own some of those
27937fb8ccffSDarrick J. Wongblocks (e.g. crosslinked trees).
27947fb8ccffSDarrick J. WongPermitting the block allocator to hand them out again will not push the system
27957fb8ccffSDarrick J. Wongtowards consistency.
27967fb8ccffSDarrick J. Wong
27977fb8ccffSDarrick J. WongFor space metadata, the process of finding extents to dispose of generally
27987fb8ccffSDarrick J. Wongfollows this format:
27997fb8ccffSDarrick J. Wong
28007fb8ccffSDarrick J. Wong1. Create a bitmap of space used by data structures that must be preserved.
28017fb8ccffSDarrick J. Wong   The space reservations used to create the new metadata can be used here if
28027fb8ccffSDarrick J. Wong   the same rmap owner code is used to denote all of the objects being rebuilt.
28037fb8ccffSDarrick J. Wong
28047fb8ccffSDarrick J. Wong2. Survey the reverse mapping data to create a bitmap of space owned by the
28057fb8ccffSDarrick J. Wong   same ``XFS_RMAP_OWN_*`` number for the metadata that is being preserved.
28067fb8ccffSDarrick J. Wong
28077fb8ccffSDarrick J. Wong3. Use the bitmap disunion operator to subtract (1) from (2).
28087fb8ccffSDarrick J. Wong   The remaining set bits represent candidate extents that could be freed.
28097fb8ccffSDarrick J. Wong   The process moves on to step 4 below.
28107fb8ccffSDarrick J. Wong
28117fb8ccffSDarrick J. WongRepairs for file-based metadata such as extended attributes, directories,
28127fb8ccffSDarrick J. Wongsymbolic links, quota files and realtime bitmaps are performed by building a
28137fb8ccffSDarrick J. Wongnew structure attached to a temporary file and swapping the forks.
28147fb8ccffSDarrick J. WongAfterward, the mappings in the old file fork are the candidate blocks for
28157fb8ccffSDarrick J. Wongdisposal.
28167fb8ccffSDarrick J. Wong
28177fb8ccffSDarrick J. WongThe process for disposing of old extents is as follows:
28187fb8ccffSDarrick J. Wong
28197fb8ccffSDarrick J. Wong4. For each candidate extent, count the number of reverse mapping records for
28207fb8ccffSDarrick J. Wong   the first block in that extent that do not have the same rmap owner for the
28217fb8ccffSDarrick J. Wong   data structure being repaired.
28227fb8ccffSDarrick J. Wong
28237fb8ccffSDarrick J. Wong   - If zero, the block has a single owner and can be freed.
28247fb8ccffSDarrick J. Wong
28257fb8ccffSDarrick J. Wong   - If not, the block is part of a crosslinked structure and must not be
28267fb8ccffSDarrick J. Wong     freed.
28277fb8ccffSDarrick J. Wong
28287fb8ccffSDarrick J. Wong5. Starting with the next block in the extent, figure out how many more blocks
28297fb8ccffSDarrick J. Wong   have the same zero/nonzero other owner status as that first block.
28307fb8ccffSDarrick J. Wong
28317fb8ccffSDarrick J. Wong6. If the region is crosslinked, delete the reverse mapping entry for the
28327fb8ccffSDarrick J. Wong   structure being repaired and move on to the next region.
28337fb8ccffSDarrick J. Wong
28347fb8ccffSDarrick J. Wong7. If the region is to be freed, mark any corresponding buffers in the buffer
28357fb8ccffSDarrick J. Wong   cache as stale to prevent log writeback.
28367fb8ccffSDarrick J. Wong
28377fb8ccffSDarrick J. Wong8. Free the region and move on.
28387fb8ccffSDarrick J. Wong
28397fb8ccffSDarrick J. WongHowever, there is one complication to this procedure.
28407fb8ccffSDarrick J. WongTransactions are of finite size, so the reaping process must be careful to roll
28417fb8ccffSDarrick J. Wongthe transactions to avoid overruns.
28427fb8ccffSDarrick J. WongOverruns come from two sources:
28437fb8ccffSDarrick J. Wong
28447fb8ccffSDarrick J. Wonga. EFIs logged on behalf of space that is no longer occupied
28457fb8ccffSDarrick J. Wong
28467fb8ccffSDarrick J. Wongb. Log items for buffer invalidations
28477fb8ccffSDarrick J. Wong
28487fb8ccffSDarrick J. WongThis is also a window in which a crash during the reaping process can leak
28497fb8ccffSDarrick J. Wongblocks.
28507fb8ccffSDarrick J. WongAs stated earlier, online repair functions use very large transactions to
28517fb8ccffSDarrick J. Wongminimize the chances of this occurring.
28527fb8ccffSDarrick J. Wong
28537fb8ccffSDarrick J. WongThe proposed patchset is the
28547fb8ccffSDarrick J. Wong`preparation for bulk loading btrees
28557fb8ccffSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-prep-for-bulk-loading>`_
28567fb8ccffSDarrick J. Wongseries.
28577fb8ccffSDarrick J. Wong
28587fb8ccffSDarrick J. WongCase Study: Reaping After a Regular Btree Repair
28597fb8ccffSDarrick J. Wong````````````````````````````````````````````````
28607fb8ccffSDarrick J. Wong
28617fb8ccffSDarrick J. WongOld reference count and inode btrees are the easiest to reap because they have
28627fb8ccffSDarrick J. Wongrmap records with special owner codes: ``XFS_RMAP_OWN_REFC`` for the refcount
28637fb8ccffSDarrick J. Wongbtree, and ``XFS_RMAP_OWN_INOBT`` for the inode and free inode btrees.
28647fb8ccffSDarrick J. WongCreating a list of extents to reap the old btree blocks is quite simple,
28657fb8ccffSDarrick J. Wongconceptually:
28667fb8ccffSDarrick J. Wong
28677fb8ccffSDarrick J. Wong1. Lock the relevant AGI/AGF header buffers to prevent allocation and frees.
28687fb8ccffSDarrick J. Wong
28697fb8ccffSDarrick J. Wong2. For each reverse mapping record with an rmap owner corresponding to the
28707fb8ccffSDarrick J. Wong   metadata structure being rebuilt, set the corresponding range in a bitmap.
28717fb8ccffSDarrick J. Wong
28727fb8ccffSDarrick J. Wong3. Walk the current data structures that have the same rmap owner.
28737fb8ccffSDarrick J. Wong   For each block visited, clear that range in the above bitmap.
28747fb8ccffSDarrick J. Wong
28757fb8ccffSDarrick J. Wong4. Each set bit in the bitmap represents a block that could be a block from the
28767fb8ccffSDarrick J. Wong   old data structures and hence is a candidate for reaping.
28777fb8ccffSDarrick J. Wong   In other words, ``(rmap_records_owned_by & ~blocks_reachable_by_walk)``
28787fb8ccffSDarrick J. Wong   are the blocks that might be freeable.
28797fb8ccffSDarrick J. Wong
28807fb8ccffSDarrick J. WongIf it is possible to maintain the AGF lock throughout the repair (which is the
28817fb8ccffSDarrick J. Wongcommon case), then step 2 can be performed at the same time as the reverse
28827fb8ccffSDarrick J. Wongmapping record walk that creates the records for the new btree.
28837fb8ccffSDarrick J. Wong
28847fb8ccffSDarrick J. WongCase Study: Rebuilding the Free Space Indices
28857fb8ccffSDarrick J. Wong`````````````````````````````````````````````
28867fb8ccffSDarrick J. Wong
28877fb8ccffSDarrick J. WongThe high level process to rebuild the free space indices is:
28887fb8ccffSDarrick J. Wong
28897fb8ccffSDarrick J. Wong1. Walk the reverse mapping records to generate ``struct xfs_alloc_rec_incore``
28907fb8ccffSDarrick J. Wong   records from the gaps in the reverse mapping btree.
28917fb8ccffSDarrick J. Wong
28927fb8ccffSDarrick J. Wong2. Append the records to an xfarray.
28937fb8ccffSDarrick J. Wong
28947fb8ccffSDarrick J. Wong3. Use the ``xfs_btree_bload_compute_geometry`` function to compute the number
28957fb8ccffSDarrick J. Wong   of blocks needed for each new tree.
28967fb8ccffSDarrick J. Wong
28977fb8ccffSDarrick J. Wong4. Allocate the number of blocks computed in the previous step from the free
28987fb8ccffSDarrick J. Wong   space information collected.
28997fb8ccffSDarrick J. Wong
29007fb8ccffSDarrick J. Wong5. Use ``xfs_btree_bload`` to write the xfarray records to btree blocks and
29017fb8ccffSDarrick J. Wong   generate the internal node blocks for the free space by length index.
29027fb8ccffSDarrick J. Wong   Call it again for the free space by block number index.
29037fb8ccffSDarrick J. Wong
29047fb8ccffSDarrick J. Wong6. Commit the locations of the new btree root blocks to the AGF.
29057fb8ccffSDarrick J. Wong
29067fb8ccffSDarrick J. Wong7. Reap the old btree blocks by looking for space that is not recorded by the
29077fb8ccffSDarrick J. Wong   reverse mapping btree, the new free space btrees, or the AGFL.
29087fb8ccffSDarrick J. Wong
29097fb8ccffSDarrick J. WongRepairing the free space btrees has three key complications over a regular
29107fb8ccffSDarrick J. Wongbtree repair:
29117fb8ccffSDarrick J. Wong
29127fb8ccffSDarrick J. WongFirst, free space is not explicitly tracked in the reverse mapping records.
29137fb8ccffSDarrick J. WongHence, the new free space records must be inferred from gaps in the physical
29147fb8ccffSDarrick J. Wongspace component of the keyspace of the reverse mapping btree.
29157fb8ccffSDarrick J. Wong
29167fb8ccffSDarrick J. WongSecond, free space repairs cannot use the common btree reservation code because
29177fb8ccffSDarrick J. Wongnew blocks are reserved out of the free space btrees.
29187fb8ccffSDarrick J. WongThis is impossible when repairing the free space btrees themselves.
29197fb8ccffSDarrick J. WongHowever, repair holds the AGF buffer lock for the duration of the free space
29207fb8ccffSDarrick J. Wongindex reconstruction, so it can use the collected free space information to
29217fb8ccffSDarrick J. Wongsupply the blocks for the new free space btrees.
29227fb8ccffSDarrick J. WongIt is not necessary to back each reserved extent with an EFI because the new
29237fb8ccffSDarrick J. Wongfree space btrees are constructed in what the ondisk filesystem thinks is
29247fb8ccffSDarrick J. Wongunowned space.
29257fb8ccffSDarrick J. WongHowever, if reserving blocks for the new btrees from the collected free space
29267fb8ccffSDarrick J. Wonginformation changes the number of free space records, repair must re-estimate
29277fb8ccffSDarrick J. Wongthe new free space btree geometry with the new record count until the
29287fb8ccffSDarrick J. Wongreservation is sufficient.
29297fb8ccffSDarrick J. WongAs part of committing the new btrees, repair must ensure that reverse mappings
29307fb8ccffSDarrick J. Wongare created for the reserved blocks and that unused reserved blocks are
29317fb8ccffSDarrick J. Wonginserted into the free space btrees.
29327fb8ccffSDarrick J. WongDeferrred rmap and freeing operations are used to ensure that this transition
29337fb8ccffSDarrick J. Wongis atomic, similar to the other btree repair functions.
29347fb8ccffSDarrick J. Wong
29357fb8ccffSDarrick J. WongThird, finding the blocks to reap after the repair is not overly
29367fb8ccffSDarrick J. Wongstraightforward.
29377fb8ccffSDarrick J. WongBlocks for the free space btrees and the reverse mapping btrees are supplied by
29387fb8ccffSDarrick J. Wongthe AGFL.
29397fb8ccffSDarrick J. WongBlocks put onto the AGFL have reverse mapping records with the owner
29407fb8ccffSDarrick J. Wong``XFS_RMAP_OWN_AG``.
29417fb8ccffSDarrick J. WongThis ownership is retained when blocks move from the AGFL into the free space
29427fb8ccffSDarrick J. Wongbtrees or the reverse mapping btrees.
29437fb8ccffSDarrick J. WongWhen repair walks reverse mapping records to synthesize free space records, it
29447fb8ccffSDarrick J. Wongcreates a bitmap (``ag_owner_bitmap``) of all the space claimed by
29457fb8ccffSDarrick J. Wong``XFS_RMAP_OWN_AG`` records.
29467fb8ccffSDarrick J. WongThe repair context maintains a second bitmap corresponding to the rmap btree
29477fb8ccffSDarrick J. Wongblocks and the AGFL blocks (``rmap_agfl_bitmap``).
29487fb8ccffSDarrick J. WongWhen the walk is complete, the bitmap disunion operation ``(ag_owner_bitmap &
29497fb8ccffSDarrick J. Wong~rmap_agfl_bitmap)`` computes the extents that are used by the old free space
29507fb8ccffSDarrick J. Wongbtrees.
29517fb8ccffSDarrick J. WongThese blocks can then be reaped using the methods outlined above.
29527fb8ccffSDarrick J. Wong
29537fb8ccffSDarrick J. WongThe proposed patchset is the
29547fb8ccffSDarrick J. Wong`AG btree repair
29557fb8ccffSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
29567fb8ccffSDarrick J. Wongseries.
29577fb8ccffSDarrick J. Wong
29587fb8ccffSDarrick J. Wong.. _rmap_reap:
29597fb8ccffSDarrick J. Wong
29607fb8ccffSDarrick J. WongCase Study: Reaping After Repairing Reverse Mapping Btrees
29617fb8ccffSDarrick J. Wong``````````````````````````````````````````````````````````
29627fb8ccffSDarrick J. Wong
29637fb8ccffSDarrick J. WongOld reverse mapping btrees are less difficult to reap after a repair.
29647fb8ccffSDarrick J. WongAs mentioned in the previous section, blocks on the AGFL, the two free space
29657fb8ccffSDarrick J. Wongbtree blocks, and the reverse mapping btree blocks all have reverse mapping
29667fb8ccffSDarrick J. Wongrecords with ``XFS_RMAP_OWN_AG`` as the owner.
29677fb8ccffSDarrick J. WongThe full process of gathering reverse mapping records and building a new btree
29687fb8ccffSDarrick J. Wongare described in the case study of
29697fb8ccffSDarrick J. Wong:ref:`live rebuilds of rmap data <rmap_repair>`, but a crucial point from that
29707fb8ccffSDarrick J. Wongdiscussion is that the new rmap btree will not contain any records for the old
29717fb8ccffSDarrick J. Wongrmap btree, nor will the old btree blocks be tracked in the free space btrees.
29727fb8ccffSDarrick J. WongThe list of candidate reaping blocks is computed by setting the bits
29737fb8ccffSDarrick J. Wongcorresponding to the gaps in the new rmap btree records, and then clearing the
29747fb8ccffSDarrick J. Wongbits corresponding to extents in the free space btrees and the current AGFL
29757fb8ccffSDarrick J. Wongblocks.
29767fb8ccffSDarrick J. WongThe result ``(new_rmapbt_gaps & ~(agfl | bnobt_records))`` are reaped using the
29777fb8ccffSDarrick J. Wongmethods outlined above.
29787fb8ccffSDarrick J. Wong
29797fb8ccffSDarrick J. WongThe rest of the process of rebuildng the reverse mapping btree is discussed
29807fb8ccffSDarrick J. Wongin a separate :ref:`case study<rmap_repair>`.
29817fb8ccffSDarrick J. Wong
29827fb8ccffSDarrick J. WongThe proposed patchset is the
29837fb8ccffSDarrick J. Wong`AG btree repair
29847fb8ccffSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-ag-btrees>`_
29857fb8ccffSDarrick J. Wongseries.
29867fb8ccffSDarrick J. Wong
29877fb8ccffSDarrick J. WongCase Study: Rebuilding the AGFL
29887fb8ccffSDarrick J. Wong```````````````````````````````
29897fb8ccffSDarrick J. Wong
29907fb8ccffSDarrick J. WongThe allocation group free block list (AGFL) is repaired as follows:
29917fb8ccffSDarrick J. Wong
29927fb8ccffSDarrick J. Wong1. Create a bitmap for all the space that the reverse mapping data claims is
29937fb8ccffSDarrick J. Wong   owned by ``XFS_RMAP_OWN_AG``.
29947fb8ccffSDarrick J. Wong
29957fb8ccffSDarrick J. Wong2. Subtract the space used by the two free space btrees and the rmap btree.
29967fb8ccffSDarrick J. Wong
29977fb8ccffSDarrick J. Wong3. Subtract any space that the reverse mapping data claims is owned by any
29987fb8ccffSDarrick J. Wong   other owner, to avoid re-adding crosslinked blocks to the AGFL.
29997fb8ccffSDarrick J. Wong
30007fb8ccffSDarrick J. Wong4. Once the AGFL is full, reap any blocks leftover.
30017fb8ccffSDarrick J. Wong
30027fb8ccffSDarrick J. Wong5. The next operation to fix the freelist will right-size the list.
30037fb8ccffSDarrick J. Wong
30047fb8ccffSDarrick J. WongSee `fs/xfs/scrub/agheader_repair.c <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/xfs/scrub/agheader_repair.c>`_ for more details.
3005d6978871SDarrick J. Wong
3006d6978871SDarrick J. WongInode Record Repairs
3007d6978871SDarrick J. Wong--------------------
3008d6978871SDarrick J. Wong
3009d6978871SDarrick J. WongInode records must be handled carefully, because they have both ondisk records
3010d6978871SDarrick J. Wong("dinodes") and an in-memory ("cached") representation.
3011d6978871SDarrick J. WongThere is a very high potential for cache coherency issues if online fsck is not
3012d6978871SDarrick J. Wongcareful to access the ondisk metadata *only* when the ondisk metadata is so
3013d6978871SDarrick J. Wongbadly damaged that the filesystem cannot load the in-memory representation.
3014d6978871SDarrick J. WongWhen online fsck wants to open a damaged file for scrubbing, it must use
3015d6978871SDarrick J. Wongspecialized resource acquisition functions that return either the in-memory
3016d6978871SDarrick J. Wongrepresentation *or* a lock on whichever object is necessary to prevent any
3017d6978871SDarrick J. Wongupdate to the ondisk location.
3018d6978871SDarrick J. Wong
3019d6978871SDarrick J. WongThe only repairs that should be made to the ondisk inode buffers are whatever
3020d6978871SDarrick J. Wongis necessary to get the in-core structure loaded.
3021d6978871SDarrick J. WongThis means fixing whatever is caught by the inode cluster buffer and inode fork
3022d6978871SDarrick J. Wongverifiers, and retrying the ``iget`` operation.
3023d6978871SDarrick J. WongIf the second ``iget`` fails, the repair has failed.
3024d6978871SDarrick J. Wong
3025d6978871SDarrick J. WongOnce the in-memory representation is loaded, repair can lock the inode and can
3026d6978871SDarrick J. Wongsubject it to comprehensive checks, repairs, and optimizations.
3027d6978871SDarrick J. WongMost inode attributes are easy to check and constrain, or are user-controlled
3028d6978871SDarrick J. Wongarbitrary bit patterns; these are both easy to fix.
3029d6978871SDarrick J. WongDealing with the data and attr fork extent counts and the file block counts is
3030d6978871SDarrick J. Wongmore complicated, because computing the correct value requires traversing the
3031d6978871SDarrick J. Wongforks, or if that fails, leaving the fields invalid and waiting for the fork
3032d6978871SDarrick J. Wongfsck functions to run.
3033d6978871SDarrick J. Wong
3034d6978871SDarrick J. WongThe proposed patchset is the
3035d6978871SDarrick J. Wong`inode
3036d6978871SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-inodes>`_
3037d6978871SDarrick J. Wongrepair series.
3038d6978871SDarrick J. Wong
3039d6978871SDarrick J. WongQuota Record Repairs
3040d6978871SDarrick J. Wong--------------------
3041d6978871SDarrick J. Wong
3042d6978871SDarrick J. WongSimilar to inodes, quota records ("dquots") also have both ondisk records and
3043d6978871SDarrick J. Wongan in-memory representation, and hence are subject to the same cache coherency
3044d6978871SDarrick J. Wongissues.
3045d6978871SDarrick J. WongSomewhat confusingly, both are known as dquots in the XFS codebase.
3046d6978871SDarrick J. Wong
3047d6978871SDarrick J. WongThe only repairs that should be made to the ondisk quota record buffers are
3048d6978871SDarrick J. Wongwhatever is necessary to get the in-core structure loaded.
3049d6978871SDarrick J. WongOnce the in-memory representation is loaded, the only attributes needing
3050d6978871SDarrick J. Wongchecking are obviously bad limits and timer values.
3051d6978871SDarrick J. Wong
3052d6978871SDarrick J. WongQuota usage counters are checked, repaired, and discussed separately in the
3053d6978871SDarrick J. Wongsection about :ref:`live quotacheck <quotacheck>`.
3054d6978871SDarrick J. Wong
3055d6978871SDarrick J. WongThe proposed patchset is the
3056d6978871SDarrick J. Wong`quota
3057d6978871SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quota>`_
3058d6978871SDarrick J. Wongrepair series.
3059d6978871SDarrick J. Wong
3060d6978871SDarrick J. Wong.. _fscounters:
3061d6978871SDarrick J. Wong
3062d6978871SDarrick J. WongFreezing to Fix Summary Counters
3063d6978871SDarrick J. Wong--------------------------------
3064d6978871SDarrick J. Wong
3065d6978871SDarrick J. WongFilesystem summary counters track availability of filesystem resources such
3066d6978871SDarrick J. Wongas free blocks, free inodes, and allocated inodes.
3067d6978871SDarrick J. WongThis information could be compiled by walking the free space and inode indexes,
3068d6978871SDarrick J. Wongbut this is a slow process, so XFS maintains a copy in the ondisk superblock
3069d6978871SDarrick J. Wongthat should reflect the ondisk metadata, at least when the filesystem has been
3070d6978871SDarrick J. Wongunmounted cleanly.
3071d6978871SDarrick J. WongFor performance reasons, XFS also maintains incore copies of those counters,
3072d6978871SDarrick J. Wongwhich are key to enabling resource reservations for active transactions.
3073d6978871SDarrick J. WongWriter threads reserve the worst-case quantities of resources from the
3074d6978871SDarrick J. Wongincore counter and give back whatever they don't use at commit time.
3075d6978871SDarrick J. WongIt is therefore only necessary to serialize on the superblock when the
3076d6978871SDarrick J. Wongsuperblock is being committed to disk.
3077d6978871SDarrick J. Wong
3078d6978871SDarrick J. WongThe lazy superblock counter feature introduced in XFS v5 took this even further
3079d6978871SDarrick J. Wongby training log recovery to recompute the summary counters from the AG headers,
3080d6978871SDarrick J. Wongwhich eliminated the need for most transactions even to touch the superblock.
3081d6978871SDarrick J. WongThe only time XFS commits the summary counters is at filesystem unmount.
3082d6978871SDarrick J. WongTo reduce contention even further, the incore counter is implemented as a
3083d6978871SDarrick J. Wongpercpu counter, which means that each CPU is allocated a batch of blocks from a
3084d6978871SDarrick J. Wongglobal incore counter and can satisfy small allocations from the local batch.
3085d6978871SDarrick J. Wong
3086d6978871SDarrick J. WongThe high-performance nature of the summary counters makes it difficult for
3087d6978871SDarrick J. Wongonline fsck to check them, since there is no way to quiesce a percpu counter
3088d6978871SDarrick J. Wongwhile the system is running.
3089d6978871SDarrick J. WongAlthough online fsck can read the filesystem metadata to compute the correct
3090d6978871SDarrick J. Wongvalues of the summary counters, there's no way to hold the value of a percpu
3091d6978871SDarrick J. Wongcounter stable, so it's quite possible that the counter will be out of date by
3092d6978871SDarrick J. Wongthe time the walk is complete.
3093d6978871SDarrick J. WongEarlier versions of online scrub would return to userspace with an incomplete
3094d6978871SDarrick J. Wongscan flag, but this is not a satisfying outcome for a system administrator.
3095d6978871SDarrick J. WongFor repairs, the in-memory counters must be stabilized while walking the
3096d6978871SDarrick J. Wongfilesystem metadata to get an accurate reading and install it in the percpu
3097d6978871SDarrick J. Wongcounter.
3098d6978871SDarrick J. Wong
3099d6978871SDarrick J. WongTo satisfy this requirement, online fsck must prevent other programs in the
3100d6978871SDarrick J. Wongsystem from initiating new writes to the filesystem, it must disable background
3101d6978871SDarrick J. Wonggarbage collection threads, and it must wait for existing writer programs to
3102d6978871SDarrick J. Wongexit the kernel.
3103d6978871SDarrick J. WongOnce that has been established, scrub can walk the AG free space indexes, the
3104d6978871SDarrick J. Wonginode btrees, and the realtime bitmap to compute the correct value of all
3105d6978871SDarrick J. Wongfour summary counters.
3106d6978871SDarrick J. WongThis is very similar to a filesystem freeze, though not all of the pieces are
3107d6978871SDarrick J. Wongnecessary:
3108d6978871SDarrick J. Wong
3109d6978871SDarrick J. Wong- The final freeze state is set one higher than ``SB_FREEZE_COMPLETE`` to
3110d6978871SDarrick J. Wong  prevent other threads from thawing the filesystem, or other scrub threads
3111d6978871SDarrick J. Wong  from initiating another fscounters freeze.
3112d6978871SDarrick J. Wong
3113d6978871SDarrick J. Wong- It does not quiesce the log.
3114d6978871SDarrick J. Wong
3115d6978871SDarrick J. WongWith this code in place, it is now possible to pause the filesystem for just
3116d6978871SDarrick J. Wonglong enough to check and correct the summary counters.
3117d6978871SDarrick J. Wong
3118d6978871SDarrick J. Wong+--------------------------------------------------------------------------+
3119d6978871SDarrick J. Wong| **Historical Sidebar**:                                                  |
3120d6978871SDarrick J. Wong+--------------------------------------------------------------------------+
3121d6978871SDarrick J. Wong| The initial implementation used the actual VFS filesystem freeze         |
3122d6978871SDarrick J. Wong| mechanism to quiesce filesystem activity.                                |
3123d6978871SDarrick J. Wong| With the filesystem frozen, it is possible to resolve the counter values |
3124d6978871SDarrick J. Wong| with exact precision, but there are many problems with calling the VFS   |
3125d6978871SDarrick J. Wong| methods directly:                                                        |
3126d6978871SDarrick J. Wong|                                                                          |
3127d6978871SDarrick J. Wong| - Other programs can unfreeze the filesystem without our knowledge.      |
3128d6978871SDarrick J. Wong|   This leads to incorrect scan results and incorrect repairs.            |
3129d6978871SDarrick J. Wong|                                                                          |
3130d6978871SDarrick J. Wong| - Adding an extra lock to prevent others from thawing the filesystem     |
3131d6978871SDarrick J. Wong|   required the addition of a ``->freeze_super`` function to wrap         |
3132d6978871SDarrick J. Wong|   ``freeze_fs()``.                                                       |
3133d6978871SDarrick J. Wong|   This in turn caused other subtle problems because it turns out that    |
3134d6978871SDarrick J. Wong|   the VFS ``freeze_super`` and ``thaw_super`` functions can drop the     |
3135d6978871SDarrick J. Wong|   last reference to the VFS superblock, and any subsequent access        |
3136d6978871SDarrick J. Wong|   becomes a UAF bug!                                                     |
3137d6978871SDarrick J. Wong|   This can happen if the filesystem is unmounted while the underlying    |
3138d6978871SDarrick J. Wong|   block device has frozen the filesystem.                                |
3139d6978871SDarrick J. Wong|   This problem could be solved by grabbing extra references to the       |
3140d6978871SDarrick J. Wong|   superblock, but it felt suboptimal given the other inadequacies of     |
3141d6978871SDarrick J. Wong|   this approach.                                                         |
3142d6978871SDarrick J. Wong|                                                                          |
3143d6978871SDarrick J. Wong| - The log need not be quiesced to check the summary counters, but a VFS  |
3144d6978871SDarrick J. Wong|   freeze initiates one anyway.                                           |
3145d6978871SDarrick J. Wong|   This adds unnecessary runtime to live fscounter fsck operations.       |
3146d6978871SDarrick J. Wong|                                                                          |
3147d6978871SDarrick J. Wong| - Quiescing the log means that XFS flushes the (possibly incorrect)      |
3148d6978871SDarrick J. Wong|   counters to disk as part of cleaning the log.                          |
3149d6978871SDarrick J. Wong|                                                                          |
3150d6978871SDarrick J. Wong| - A bug in the VFS meant that freeze could complete even when            |
3151d6978871SDarrick J. Wong|   sync_filesystem fails to flush the filesystem and returns an error.    |
3152d6978871SDarrick J. Wong|   This bug was fixed in Linux 5.17.                                      |
3153d6978871SDarrick J. Wong+--------------------------------------------------------------------------+
3154d6978871SDarrick J. Wong
3155d6978871SDarrick J. WongThe proposed patchset is the
3156d6978871SDarrick J. Wong`summary counter cleanup
3157d6978871SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-fscounters>`_
3158d6978871SDarrick J. Wongseries.
3159a0d856eeSDarrick J. Wong
3160a0d856eeSDarrick J. WongFull Filesystem Scans
3161a0d856eeSDarrick J. Wong---------------------
3162a0d856eeSDarrick J. Wong
3163a0d856eeSDarrick J. WongCertain types of metadata can only be checked by walking every file in the
3164a0d856eeSDarrick J. Wongentire filesystem to record observations and comparing the observations against
3165a0d856eeSDarrick J. Wongwhat's recorded on disk.
3166a0d856eeSDarrick J. WongLike every other type of online repair, repairs are made by writing those
3167a0d856eeSDarrick J. Wongobservations to disk in a replacement structure and committing it atomically.
3168a0d856eeSDarrick J. WongHowever, it is not practical to shut down the entire filesystem to examine
3169a0d856eeSDarrick J. Wonghundreds of billions of files because the downtime would be excessive.
3170a0d856eeSDarrick J. WongTherefore, online fsck must build the infrastructure to manage a live scan of
3171a0d856eeSDarrick J. Wongall the files in the filesystem.
3172a0d856eeSDarrick J. WongThere are two questions that need to be solved to perform a live walk:
3173a0d856eeSDarrick J. Wong
3174a0d856eeSDarrick J. Wong- How does scrub manage the scan while it is collecting data?
3175a0d856eeSDarrick J. Wong
3176a0d856eeSDarrick J. Wong- How does the scan keep abreast of changes being made to the system by other
3177a0d856eeSDarrick J. Wong  threads?
3178a0d856eeSDarrick J. Wong
3179a0d856eeSDarrick J. Wong.. _iscan:
3180a0d856eeSDarrick J. Wong
3181a0d856eeSDarrick J. WongCoordinated Inode Scans
3182a0d856eeSDarrick J. Wong```````````````````````
3183a0d856eeSDarrick J. Wong
3184a0d856eeSDarrick J. WongIn the original Unix filesystems of the 1970s, each directory entry contained
3185a0d856eeSDarrick J. Wongan index number (*inumber*) which was used as an index into on ondisk array
3186a0d856eeSDarrick J. Wong(*itable*) of fixed-size records (*inodes*) describing a file's attributes and
3187a0d856eeSDarrick J. Wongits data block mapping.
3188a0d856eeSDarrick J. WongThis system is described by J. Lions, `"inode (5659)"
3189a0d856eeSDarrick J. Wong<http://www.lemis.com/grog/Documentation/Lions/>`_ in *Lions' Commentary on
3190a0d856eeSDarrick J. WongUNIX, 6th Edition*, (Dept. of Computer Science, the University of New South
3191a0d856eeSDarrick J. WongWales, November 1977), pp. 18-2; and later by D. Ritchie and K. Thompson,
3192a0d856eeSDarrick J. Wong`"Implementation of the File System"
3193a0d856eeSDarrick J. Wong<https://archive.org/details/bstj57-6-1905/page/n8/mode/1up>`_, from *The UNIX
3194a0d856eeSDarrick J. WongTime-Sharing System*, (The Bell System Technical Journal, July 1978), pp.
3195a0d856eeSDarrick J. Wong1913-4.
3196a0d856eeSDarrick J. Wong
3197a0d856eeSDarrick J. WongXFS retains most of this design, except now inumbers are search keys over all
3198a0d856eeSDarrick J. Wongthe space in the data section filesystem.
3199a0d856eeSDarrick J. WongThey form a continuous keyspace that can be expressed as a 64-bit integer,
3200a0d856eeSDarrick J. Wongthough the inodes themselves are sparsely distributed within the keyspace.
3201a0d856eeSDarrick J. WongScans proceed in a linear fashion across the inumber keyspace, starting from
3202a0d856eeSDarrick J. Wong``0x0`` and ending at ``0xFFFFFFFFFFFFFFFF``.
3203a0d856eeSDarrick J. WongNaturally, a scan through a keyspace requires a scan cursor object to track the
3204a0d856eeSDarrick J. Wongscan progress.
3205a0d856eeSDarrick J. WongBecause this keyspace is sparse, this cursor contains two parts.
3206a0d856eeSDarrick J. WongThe first part of this scan cursor object tracks the inode that will be
3207a0d856eeSDarrick J. Wongexamined next; call this the examination cursor.
3208a0d856eeSDarrick J. WongSomewhat less obviously, the scan cursor object must also track which parts of
3209a0d856eeSDarrick J. Wongthe keyspace have already been visited, which is critical for deciding if a
3210a0d856eeSDarrick J. Wongconcurrent filesystem update needs to be incorporated into the scan data.
3211a0d856eeSDarrick J. WongCall this the visited inode cursor.
3212a0d856eeSDarrick J. Wong
3213a0d856eeSDarrick J. WongAdvancing the scan cursor is a multi-step process encapsulated in
3214a0d856eeSDarrick J. Wong``xchk_iscan_iter``:
3215a0d856eeSDarrick J. Wong
3216a0d856eeSDarrick J. Wong1. Lock the AGI buffer of the AG containing the inode pointed to by the visited
3217a0d856eeSDarrick J. Wong   inode cursor.
3218a0d856eeSDarrick J. Wong   This guarantee that inodes in this AG cannot be allocated or freed while
3219a0d856eeSDarrick J. Wong   advancing the cursor.
3220a0d856eeSDarrick J. Wong
3221a0d856eeSDarrick J. Wong2. Use the per-AG inode btree to look up the next inumber after the one that
3222a0d856eeSDarrick J. Wong   was just visited, since it may not be keyspace adjacent.
3223a0d856eeSDarrick J. Wong
3224a0d856eeSDarrick J. Wong3. If there are no more inodes left in this AG:
3225a0d856eeSDarrick J. Wong
3226a0d856eeSDarrick J. Wong   a. Move the examination cursor to the point of the inumber keyspace that
3227a0d856eeSDarrick J. Wong      corresponds to the start of the next AG.
3228a0d856eeSDarrick J. Wong
3229a0d856eeSDarrick J. Wong   b. Adjust the visited inode cursor to indicate that it has "visited" the
3230a0d856eeSDarrick J. Wong      last possible inode in the current AG's inode keyspace.
3231a0d856eeSDarrick J. Wong      XFS inumbers are segmented, so the cursor needs to be marked as having
3232a0d856eeSDarrick J. Wong      visited the entire keyspace up to just before the start of the next AG's
3233a0d856eeSDarrick J. Wong      inode keyspace.
3234a0d856eeSDarrick J. Wong
3235a0d856eeSDarrick J. Wong   c. Unlock the AGI and return to step 1 if there are unexamined AGs in the
3236a0d856eeSDarrick J. Wong      filesystem.
3237a0d856eeSDarrick J. Wong
3238a0d856eeSDarrick J. Wong   d. If there are no more AGs to examine, set both cursors to the end of the
3239a0d856eeSDarrick J. Wong      inumber keyspace.
3240a0d856eeSDarrick J. Wong      The scan is now complete.
3241a0d856eeSDarrick J. Wong
3242a0d856eeSDarrick J. Wong4. Otherwise, there is at least one more inode to scan in this AG:
3243a0d856eeSDarrick J. Wong
3244a0d856eeSDarrick J. Wong   a. Move the examination cursor ahead to the next inode marked as allocated
3245a0d856eeSDarrick J. Wong      by the inode btree.
3246a0d856eeSDarrick J. Wong
3247a0d856eeSDarrick J. Wong   b. Adjust the visited inode cursor to point to the inode just prior to where
3248a0d856eeSDarrick J. Wong      the examination cursor is now.
3249a0d856eeSDarrick J. Wong      Because the scanner holds the AGI buffer lock, no inodes could have been
3250a0d856eeSDarrick J. Wong      created in the part of the inode keyspace that the visited inode cursor
3251a0d856eeSDarrick J. Wong      just advanced.
3252a0d856eeSDarrick J. Wong
3253a0d856eeSDarrick J. Wong5. Get the incore inode for the inumber of the examination cursor.
3254a0d856eeSDarrick J. Wong   By maintaining the AGI buffer lock until this point, the scanner knows that
3255a0d856eeSDarrick J. Wong   it was safe to advance the examination cursor across the entire keyspace,
3256a0d856eeSDarrick J. Wong   and that it has stabilized this next inode so that it cannot disappear from
3257a0d856eeSDarrick J. Wong   the filesystem until the scan releases the incore inode.
3258a0d856eeSDarrick J. Wong
3259a0d856eeSDarrick J. Wong6. Drop the AGI lock and return the incore inode to the caller.
3260a0d856eeSDarrick J. Wong
3261a0d856eeSDarrick J. WongOnline fsck functions scan all files in the filesystem as follows:
3262a0d856eeSDarrick J. Wong
3263a0d856eeSDarrick J. Wong1. Start a scan by calling ``xchk_iscan_start``.
3264a0d856eeSDarrick J. Wong
3265a0d856eeSDarrick J. Wong2. Advance the scan cursor (``xchk_iscan_iter``) to get the next inode.
3266a0d856eeSDarrick J. Wong   If one is provided:
3267a0d856eeSDarrick J. Wong
3268a0d856eeSDarrick J. Wong   a. Lock the inode to prevent updates during the scan.
3269a0d856eeSDarrick J. Wong
3270a0d856eeSDarrick J. Wong   b. Scan the inode.
3271a0d856eeSDarrick J. Wong
3272a0d856eeSDarrick J. Wong   c. While still holding the inode lock, adjust the visited inode cursor
3273a0d856eeSDarrick J. Wong      (``xchk_iscan_mark_visited``) to point to this inode.
3274a0d856eeSDarrick J. Wong
3275a0d856eeSDarrick J. Wong   d. Unlock and release the inode.
3276a0d856eeSDarrick J. Wong
3277a0d856eeSDarrick J. Wong8. Call ``xchk_iscan_teardown`` to complete the scan.
3278a0d856eeSDarrick J. Wong
3279a0d856eeSDarrick J. WongThere are subtleties with the inode cache that complicate grabbing the incore
3280a0d856eeSDarrick J. Wonginode for the caller.
3281a0d856eeSDarrick J. WongObviously, it is an absolute requirement that the inode metadata be consistent
3282a0d856eeSDarrick J. Wongenough to load it into the inode cache.
3283a0d856eeSDarrick J. WongSecond, if the incore inode is stuck in some intermediate state, the scan
3284a0d856eeSDarrick J. Wongcoordinator must release the AGI and push the main filesystem to get the inode
3285a0d856eeSDarrick J. Wongback into a loadable state.
3286a0d856eeSDarrick J. Wong
3287a0d856eeSDarrick J. WongThe proposed patches are the
3288a0d856eeSDarrick J. Wong`inode scanner
3289a0d856eeSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>`_
3290a0d856eeSDarrick J. Wongseries.
3291a0d856eeSDarrick J. WongThe first user of the new functionality is the
3292a0d856eeSDarrick J. Wong`online quotacheck
3293a0d856eeSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>`_
3294a0d856eeSDarrick J. Wongseries.
3295a0d856eeSDarrick J. Wong
3296a0d856eeSDarrick J. WongInode Management
3297a0d856eeSDarrick J. Wong````````````````
3298a0d856eeSDarrick J. Wong
3299a0d856eeSDarrick J. WongIn regular filesystem code, references to allocated XFS incore inodes are
3300a0d856eeSDarrick J. Wongalways obtained (``xfs_iget``) outside of transaction context because the
3301a0d856eeSDarrick J. Wongcreation of the incore context for an existing file does not require metadata
3302a0d856eeSDarrick J. Wongupdates.
3303a0d856eeSDarrick J. WongHowever, it is important to note that references to incore inodes obtained as
3304a0d856eeSDarrick J. Wongpart of file creation must be performed in transaction context because the
3305a0d856eeSDarrick J. Wongfilesystem must ensure the atomicity of the ondisk inode btree index updates
3306a0d856eeSDarrick J. Wongand the initialization of the actual ondisk inode.
3307a0d856eeSDarrick J. Wong
3308a0d856eeSDarrick J. WongReferences to incore inodes are always released (``xfs_irele``) outside of
3309a0d856eeSDarrick J. Wongtransaction context because there are a handful of activities that might
3310a0d856eeSDarrick J. Wongrequire ondisk updates:
3311a0d856eeSDarrick J. Wong
3312a0d856eeSDarrick J. Wong- The VFS may decide to kick off writeback as part of a ``DONTCACHE`` inode
3313a0d856eeSDarrick J. Wong  release.
3314a0d856eeSDarrick J. Wong
3315a0d856eeSDarrick J. Wong- Speculative preallocations need to be unreserved.
3316a0d856eeSDarrick J. Wong
3317a0d856eeSDarrick J. Wong- An unlinked file may have lost its last reference, in which case the entire
3318a0d856eeSDarrick J. Wong  file must be inactivated, which involves releasing all of its resources in
3319a0d856eeSDarrick J. Wong  the ondisk metadata and freeing the inode.
3320a0d856eeSDarrick J. Wong
3321a0d856eeSDarrick J. WongThese activities are collectively called inode inactivation.
3322a0d856eeSDarrick J. WongInactivation has two parts -- the VFS part, which initiates writeback on all
3323a0d856eeSDarrick J. Wongdirty file pages, and the XFS part, which cleans up XFS-specific information
3324a0d856eeSDarrick J. Wongand frees the inode if it was unlinked.
3325a0d856eeSDarrick J. WongIf the inode is unlinked (or unconnected after a file handle operation), the
3326a0d856eeSDarrick J. Wongkernel drops the inode into the inactivation machinery immediately.
3327a0d856eeSDarrick J. Wong
3328a0d856eeSDarrick J. WongDuring normal operation, resource acquisition for an update follows this order
3329a0d856eeSDarrick J. Wongto avoid deadlocks:
3330a0d856eeSDarrick J. Wong
3331a0d856eeSDarrick J. Wong1. Inode reference (``iget``).
3332a0d856eeSDarrick J. Wong
3333a0d856eeSDarrick J. Wong2. Filesystem freeze protection, if repairing (``mnt_want_write_file``).
3334a0d856eeSDarrick J. Wong
3335a0d856eeSDarrick J. Wong3. Inode ``IOLOCK`` (VFS ``i_rwsem``) lock to control file IO.
3336a0d856eeSDarrick J. Wong
3337a0d856eeSDarrick J. Wong4. Inode ``MMAPLOCK`` (page cache ``invalidate_lock``) lock for operations that
3338a0d856eeSDarrick J. Wong   can update page cache mappings.
3339a0d856eeSDarrick J. Wong
3340a0d856eeSDarrick J. Wong5. Log feature enablement.
3341a0d856eeSDarrick J. Wong
3342a0d856eeSDarrick J. Wong6. Transaction log space grant.
3343a0d856eeSDarrick J. Wong
3344a0d856eeSDarrick J. Wong7. Space on the data and realtime devices for the transaction.
3345a0d856eeSDarrick J. Wong
3346a0d856eeSDarrick J. Wong8. Incore dquot references, if a file is being repaired.
3347a0d856eeSDarrick J. Wong   Note that they are not locked, merely acquired.
3348a0d856eeSDarrick J. Wong
3349a0d856eeSDarrick J. Wong9. Inode ``ILOCK`` for file metadata updates.
3350a0d856eeSDarrick J. Wong
3351a0d856eeSDarrick J. Wong10. AG header buffer locks / Realtime metadata inode ILOCK.
3352a0d856eeSDarrick J. Wong
3353a0d856eeSDarrick J. Wong11. Realtime metadata buffer locks, if applicable.
3354a0d856eeSDarrick J. Wong
3355a0d856eeSDarrick J. Wong12. Extent mapping btree blocks, if applicable.
3356a0d856eeSDarrick J. Wong
3357a0d856eeSDarrick J. WongResources are often released in the reverse order, though this is not required.
3358a0d856eeSDarrick J. WongHowever, online fsck differs from regular XFS operations because it may examine
3359a0d856eeSDarrick J. Wongan object that normally is acquired in a later stage of the locking order, and
3360a0d856eeSDarrick J. Wongthen decide to cross-reference the object with an object that is acquired
3361a0d856eeSDarrick J. Wongearlier in the order.
3362a0d856eeSDarrick J. WongThe next few sections detail the specific ways in which online fsck takes care
3363a0d856eeSDarrick J. Wongto avoid deadlocks.
3364a0d856eeSDarrick J. Wong
3365a0d856eeSDarrick J. Wongiget and irele During a Scrub
3366a0d856eeSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
3367a0d856eeSDarrick J. Wong
3368a0d856eeSDarrick J. WongAn inode scan performed on behalf of a scrub operation runs in transaction
3369a0d856eeSDarrick J. Wongcontext, and possibly with resources already locked and bound to it.
3370a0d856eeSDarrick J. WongThis isn't much of a problem for ``iget`` since it can operate in the context
3371a0d856eeSDarrick J. Wongof an existing transaction, as long as all of the bound resources are acquired
3372a0d856eeSDarrick J. Wongbefore the inode reference in the regular filesystem.
3373a0d856eeSDarrick J. Wong
3374a0d856eeSDarrick J. WongWhen the VFS ``iput`` function is given a linked inode with no other
3375a0d856eeSDarrick J. Wongreferences, it normally puts the inode on an LRU list in the hope that it can
3376a0d856eeSDarrick J. Wongsave time if another process re-opens the file before the system runs out
3377a0d856eeSDarrick J. Wongof memory and frees it.
3378a0d856eeSDarrick J. WongFilesystem callers can short-circuit the LRU process by setting a ``DONTCACHE``
3379a0d856eeSDarrick J. Wongflag on the inode to cause the kernel to try to drop the inode into the
3380a0d856eeSDarrick J. Wonginactivation machinery immediately.
3381a0d856eeSDarrick J. Wong
3382a0d856eeSDarrick J. WongIn the past, inactivation was always done from the process that dropped the
3383a0d856eeSDarrick J. Wonginode, which was a problem for scrub because scrub may already hold a
3384a0d856eeSDarrick J. Wongtransaction, and XFS does not support nesting transactions.
3385a0d856eeSDarrick J. WongOn the other hand, if there is no scrub transaction, it is desirable to drop
3386a0d856eeSDarrick J. Wongotherwise unused inodes immediately to avoid polluting caches.
3387a0d856eeSDarrick J. WongTo capture these nuances, the online fsck code has a separate ``xchk_irele``
3388a0d856eeSDarrick J. Wongfunction to set or clear the ``DONTCACHE`` flag to get the required release
3389a0d856eeSDarrick J. Wongbehavior.
3390a0d856eeSDarrick J. Wong
3391a0d856eeSDarrick J. WongProposed patchsets include fixing
3392a0d856eeSDarrick J. Wong`scrub iget usage
3393a0d856eeSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iget-fixes>`_ and
3394a0d856eeSDarrick J. Wong`dir iget usage
3395a0d856eeSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-dir-iget-fixes>`_.
3396a0d856eeSDarrick J. Wong
33972f754f7fSDarrick J. Wong.. _ilocking:
33982f754f7fSDarrick J. Wong
3399a0d856eeSDarrick J. WongLocking Inodes
3400a0d856eeSDarrick J. Wong^^^^^^^^^^^^^^
3401a0d856eeSDarrick J. Wong
3402a0d856eeSDarrick J. WongIn regular filesystem code, the VFS and XFS will acquire multiple IOLOCK locks
3403a0d856eeSDarrick J. Wongin a well-known order: parent → child when updating the directory tree, and
3404a0d856eeSDarrick J. Wongin numerical order of the addresses of their ``struct inode`` object otherwise.
3405a0d856eeSDarrick J. WongFor regular files, the MMAPLOCK can be acquired after the IOLOCK to stop page
3406a0d856eeSDarrick J. Wongfaults.
3407a0d856eeSDarrick J. WongIf two MMAPLOCKs must be acquired, they are acquired in numerical order of
3408a0d856eeSDarrick J. Wongthe addresses of their ``struct address_space`` objects.
3409a0d856eeSDarrick J. WongDue to the structure of existing filesystem code, IOLOCKs and MMAPLOCKs must be
3410a0d856eeSDarrick J. Wongacquired before transactions are allocated.
3411a0d856eeSDarrick J. WongIf two ILOCKs must be acquired, they are acquired in inumber order.
3412a0d856eeSDarrick J. Wong
3413a0d856eeSDarrick J. WongInode lock acquisition must be done carefully during a coordinated inode scan.
3414a0d856eeSDarrick J. WongOnline fsck cannot abide these conventions, because for a directory tree
3415a0d856eeSDarrick J. Wongscanner, the scrub process holds the IOLOCK of the file being scanned and it
3416a0d856eeSDarrick J. Wongneeds to take the IOLOCK of the file at the other end of the directory link.
3417a0d856eeSDarrick J. WongIf the directory tree is corrupt because it contains a cycle, ``xfs_scrub``
3418a0d856eeSDarrick J. Wongcannot use the regular inode locking functions and avoid becoming trapped in an
3419a0d856eeSDarrick J. WongABBA deadlock.
3420a0d856eeSDarrick J. Wong
3421a0d856eeSDarrick J. WongSolving both of these problems is straightforward -- any time online fsck
3422a0d856eeSDarrick J. Wongneeds to take a second lock of the same class, it uses trylock to avoid an ABBA
3423a0d856eeSDarrick J. Wongdeadlock.
3424a0d856eeSDarrick J. WongIf the trylock fails, scrub drops all inode locks and use trylock loops to
3425a0d856eeSDarrick J. Wong(re)acquire all necessary resources.
3426a0d856eeSDarrick J. WongTrylock loops enable scrub to check for pending fatal signals, which is how
3427a0d856eeSDarrick J. Wongscrub avoids deadlocking the filesystem or becoming an unresponsive process.
3428a0d856eeSDarrick J. WongHowever, trylock loops means that online fsck must be prepared to measure the
3429a0d856eeSDarrick J. Wongresource being scrubbed before and after the lock cycle to detect changes and
3430a0d856eeSDarrick J. Wongreact accordingly.
3431a0d856eeSDarrick J. Wong
3432a0d856eeSDarrick J. Wong.. _dirparent:
3433a0d856eeSDarrick J. Wong
3434a0d856eeSDarrick J. WongCase Study: Finding a Directory Parent
3435a0d856eeSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
3436a0d856eeSDarrick J. Wong
3437a0d856eeSDarrick J. WongConsider the directory parent pointer repair code as an example.
3438a0d856eeSDarrick J. WongOnline fsck must verify that the dotdot dirent of a directory points up to a
3439a0d856eeSDarrick J. Wongparent directory, and that the parent directory contains exactly one dirent
3440a0d856eeSDarrick J. Wongpointing down to the child directory.
3441a0d856eeSDarrick J. WongFully validating this relationship (and repairing it if possible) requires a
3442a0d856eeSDarrick J. Wongwalk of every directory on the filesystem while holding the child locked, and
3443a0d856eeSDarrick J. Wongwhile updates to the directory tree are being made.
3444a0d856eeSDarrick J. WongThe coordinated inode scan provides a way to walk the filesystem without the
3445a0d856eeSDarrick J. Wongpossibility of missing an inode.
3446a0d856eeSDarrick J. WongThe child directory is kept locked to prevent updates to the dotdot dirent, but
3447a0d856eeSDarrick J. Wongif the scanner fails to lock a parent, it can drop and relock both the child
3448a0d856eeSDarrick J. Wongand the prospective parent.
3449a0d856eeSDarrick J. WongIf the dotdot entry changes while the directory is unlocked, then a move or
3450a0d856eeSDarrick J. Wongrename operation must have changed the child's parentage, and the scan can
3451a0d856eeSDarrick J. Wongexit early.
3452a0d856eeSDarrick J. Wong
3453a0d856eeSDarrick J. WongThe proposed patchset is the
3454a0d856eeSDarrick J. Wong`directory repair
3455a0d856eeSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_
3456a0d856eeSDarrick J. Wongseries.
3457a0d856eeSDarrick J. Wong
3458a0d856eeSDarrick J. Wong.. _fshooks:
3459a0d856eeSDarrick J. Wong
3460a0d856eeSDarrick J. WongFilesystem Hooks
3461a0d856eeSDarrick J. Wong`````````````````
3462a0d856eeSDarrick J. Wong
3463a0d856eeSDarrick J. WongThe second piece of support that online fsck functions need during a full
3464a0d856eeSDarrick J. Wongfilesystem scan is the ability to stay informed about updates being made by
3465a0d856eeSDarrick J. Wongother threads in the filesystem, since comparisons against the past are useless
3466a0d856eeSDarrick J. Wongin a dynamic environment.
3467a0d856eeSDarrick J. WongTwo pieces of Linux kernel infrastructure enable online fsck to monitor regular
3468a0d856eeSDarrick J. Wongfilesystem operations: filesystem hooks and :ref:`static keys<jump_labels>`.
3469a0d856eeSDarrick J. Wong
3470a0d856eeSDarrick J. WongFilesystem hooks convey information about an ongoing filesystem operation to
3471a0d856eeSDarrick J. Wonga downstream consumer.
3472a0d856eeSDarrick J. WongIn this case, the downstream consumer is always an online fsck function.
3473a0d856eeSDarrick J. WongBecause multiple fsck functions can run in parallel, online fsck uses the Linux
3474a0d856eeSDarrick J. Wongnotifier call chain facility to dispatch updates to any number of interested
3475a0d856eeSDarrick J. Wongfsck processes.
3476a0d856eeSDarrick J. WongCall chains are a dynamic list, which means that they can be configured at
3477a0d856eeSDarrick J. Wongrun time.
3478a0d856eeSDarrick J. WongBecause these hooks are private to the XFS module, the information passed along
3479a0d856eeSDarrick J. Wongcontains exactly what the checking function needs to update its observations.
3480a0d856eeSDarrick J. Wong
3481a0d856eeSDarrick J. WongThe current implementation of XFS hooks uses SRCU notifier chains to reduce the
3482a0d856eeSDarrick J. Wongimpact to highly threaded workloads.
3483a0d856eeSDarrick J. WongRegular blocking notifier chains use a rwsem and seem to have a much lower
3484a0d856eeSDarrick J. Wongoverhead for single-threaded applications.
3485a0d856eeSDarrick J. WongHowever, it may turn out that the combination of blocking chains and static
3486a0d856eeSDarrick J. Wongkeys are a more performant combination; more study is needed here.
3487a0d856eeSDarrick J. Wong
3488a0d856eeSDarrick J. WongThe following pieces are necessary to hook a certain point in the filesystem:
3489a0d856eeSDarrick J. Wong
3490a0d856eeSDarrick J. Wong- A ``struct xfs_hooks`` object must be embedded in a convenient place such as
3491a0d856eeSDarrick J. Wong  a well-known incore filesystem object.
3492a0d856eeSDarrick J. Wong
3493a0d856eeSDarrick J. Wong- Each hook must define an action code and a structure containing more context
3494a0d856eeSDarrick J. Wong  about the action.
3495a0d856eeSDarrick J. Wong
3496a0d856eeSDarrick J. Wong- Hook providers should provide appropriate wrapper functions and structs
3497a0d856eeSDarrick J. Wong  around the ``xfs_hooks`` and ``xfs_hook`` objects to take advantage of type
3498a0d856eeSDarrick J. Wong  checking to ensure correct usage.
3499a0d856eeSDarrick J. Wong
3500a0d856eeSDarrick J. Wong- A callsite in the regular filesystem code must be chosen to call
3501a0d856eeSDarrick J. Wong  ``xfs_hooks_call`` with the action code and data structure.
3502a0d856eeSDarrick J. Wong  This place should be adjacent to (and not earlier than) the place where
3503a0d856eeSDarrick J. Wong  the filesystem update is committed to the transaction.
3504a0d856eeSDarrick J. Wong  In general, when the filesystem calls a hook chain, it should be able to
3505a0d856eeSDarrick J. Wong  handle sleeping and should not be vulnerable to memory reclaim or locking
3506a0d856eeSDarrick J. Wong  recursion.
3507a0d856eeSDarrick J. Wong  However, the exact requirements are very dependent on the context of the hook
3508a0d856eeSDarrick J. Wong  caller and the callee.
3509a0d856eeSDarrick J. Wong
3510a0d856eeSDarrick J. Wong- The online fsck function should define a structure to hold scan data, a lock
3511a0d856eeSDarrick J. Wong  to coordinate access to the scan data, and a ``struct xfs_hook`` object.
3512a0d856eeSDarrick J. Wong  The scanner function and the regular filesystem code must acquire resources
3513a0d856eeSDarrick J. Wong  in the same order; see the next section for details.
3514a0d856eeSDarrick J. Wong
3515a0d856eeSDarrick J. Wong- The online fsck code must contain a C function to catch the hook action code
3516a0d856eeSDarrick J. Wong  and data structure.
3517a0d856eeSDarrick J. Wong  If the object being updated has already been visited by the scan, then the
3518a0d856eeSDarrick J. Wong  hook information must be applied to the scan data.
3519a0d856eeSDarrick J. Wong
3520a0d856eeSDarrick J. Wong- Prior to unlocking inodes to start the scan, online fsck must call
3521a0d856eeSDarrick J. Wong  ``xfs_hooks_setup`` to initialize the ``struct xfs_hook``, and
3522a0d856eeSDarrick J. Wong  ``xfs_hooks_add`` to enable the hook.
3523a0d856eeSDarrick J. Wong
3524a0d856eeSDarrick J. Wong- Online fsck must call ``xfs_hooks_del`` to disable the hook once the scan is
3525a0d856eeSDarrick J. Wong  complete.
3526a0d856eeSDarrick J. Wong
3527a0d856eeSDarrick J. WongThe number of hooks should be kept to a minimum to reduce complexity.
3528a0d856eeSDarrick J. WongStatic keys are used to reduce the overhead of filesystem hooks to nearly
3529a0d856eeSDarrick J. Wongzero when online fsck is not running.
3530a0d856eeSDarrick J. Wong
3531a0d856eeSDarrick J. Wong.. _liveupdate:
3532a0d856eeSDarrick J. Wong
3533a0d856eeSDarrick J. WongLive Updates During a Scan
3534a0d856eeSDarrick J. Wong``````````````````````````
3535a0d856eeSDarrick J. Wong
3536a0d856eeSDarrick J. WongThe code paths of the online fsck scanning code and the :ref:`hooked<fshooks>`
3537a0d856eeSDarrick J. Wongfilesystem code look like this::
3538a0d856eeSDarrick J. Wong
3539a0d856eeSDarrick J. Wong            other program
3540a0d856eeSDarrick J. Wong3541a0d856eeSDarrick J. Wong            inode lock ←────────────────────┐
3542a0d856eeSDarrick J. Wong                  ↓                         │
3543a0d856eeSDarrick J. Wong            AG header lock                  │
3544a0d856eeSDarrick J. Wong                  ↓                         │
3545a0d856eeSDarrick J. Wong            filesystem function             │
3546a0d856eeSDarrick J. Wong                  ↓                         │
3547a0d856eeSDarrick J. Wong            notifier call chain             │    same
3548a0d856eeSDarrick J. Wong                  ↓                         ├─── inode
3549a0d856eeSDarrick J. Wong            scrub hook function             │    lock
3550a0d856eeSDarrick J. Wong                  ↓                         │
3551a0d856eeSDarrick J. Wong            scan data mutex ←──┐    same    │
3552a0d856eeSDarrick J. Wong                  ↓            ├─── scan    │
3553a0d856eeSDarrick J. Wong            update scan data   │    lock    │
3554a0d856eeSDarrick J. Wong                  ↑            │            │
3555a0d856eeSDarrick J. Wong            scan data mutex ←──┘            │
3556a0d856eeSDarrick J. Wong                  ↑                         │
3557a0d856eeSDarrick J. Wong            inode lock ←────────────────────┘
3558a0d856eeSDarrick J. Wong3559a0d856eeSDarrick J. Wong            scrub function
3560a0d856eeSDarrick J. Wong3561a0d856eeSDarrick J. Wong            inode scanner
3562a0d856eeSDarrick J. Wong3563a0d856eeSDarrick J. Wong            xfs_scrub
3564a0d856eeSDarrick J. Wong
3565a0d856eeSDarrick J. WongThese rules must be followed to ensure correct interactions between the
3566a0d856eeSDarrick J. Wongchecking code and the code making an update to the filesystem:
3567a0d856eeSDarrick J. Wong
3568a0d856eeSDarrick J. Wong- Prior to invoking the notifier call chain, the filesystem function being
3569a0d856eeSDarrick J. Wong  hooked must acquire the same lock that the scrub scanning function acquires
3570a0d856eeSDarrick J. Wong  to scan the inode.
3571a0d856eeSDarrick J. Wong
3572a0d856eeSDarrick J. Wong- The scanning function and the scrub hook function must coordinate access to
3573a0d856eeSDarrick J. Wong  the scan data by acquiring a lock on the scan data.
3574a0d856eeSDarrick J. Wong
3575a0d856eeSDarrick J. Wong- Scrub hook function must not add the live update information to the scan
3576a0d856eeSDarrick J. Wong  observations unless the inode being updated has already been scanned.
3577a0d856eeSDarrick J. Wong  The scan coordinator has a helper predicate (``xchk_iscan_want_live_update``)
3578a0d856eeSDarrick J. Wong  for this.
3579a0d856eeSDarrick J. Wong
3580a0d856eeSDarrick J. Wong- Scrub hook functions must not change the caller's state, including the
3581a0d856eeSDarrick J. Wong  transaction that it is running.
3582a0d856eeSDarrick J. Wong  They must not acquire any resources that might conflict with the filesystem
3583a0d856eeSDarrick J. Wong  function being hooked.
3584a0d856eeSDarrick J. Wong
3585a0d856eeSDarrick J. Wong- The hook function can abort the inode scan to avoid breaking the other rules.
3586a0d856eeSDarrick J. Wong
3587a0d856eeSDarrick J. WongThe inode scan APIs are pretty simple:
3588a0d856eeSDarrick J. Wong
3589a0d856eeSDarrick J. Wong- ``xchk_iscan_start`` starts a scan
3590a0d856eeSDarrick J. Wong
3591a0d856eeSDarrick J. Wong- ``xchk_iscan_iter`` grabs a reference to the next inode in the scan or
3592a0d856eeSDarrick J. Wong  returns zero if there is nothing left to scan
3593a0d856eeSDarrick J. Wong
3594a0d856eeSDarrick J. Wong- ``xchk_iscan_want_live_update`` to decide if an inode has already been
3595a0d856eeSDarrick J. Wong  visited in the scan.
3596a0d856eeSDarrick J. Wong  This is critical for hook functions to decide if they need to update the
3597a0d856eeSDarrick J. Wong  in-memory scan information.
3598a0d856eeSDarrick J. Wong
3599a0d856eeSDarrick J. Wong- ``xchk_iscan_mark_visited`` to mark an inode as having been visited in the
3600a0d856eeSDarrick J. Wong  scan
3601a0d856eeSDarrick J. Wong
3602a0d856eeSDarrick J. Wong- ``xchk_iscan_teardown`` to finish the scan
3603a0d856eeSDarrick J. Wong
3604a0d856eeSDarrick J. WongThis functionality is also a part of the
3605a0d856eeSDarrick J. Wong`inode scanner
3606a0d856eeSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-iscan>`_
3607a0d856eeSDarrick J. Wongseries.
3608a0d856eeSDarrick J. Wong
3609a0d856eeSDarrick J. Wong.. _quotacheck:
3610a0d856eeSDarrick J. Wong
3611a0d856eeSDarrick J. WongCase Study: Quota Counter Checking
3612a0d856eeSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
3613a0d856eeSDarrick J. Wong
3614a0d856eeSDarrick J. WongIt is useful to compare the mount time quotacheck code to the online repair
3615a0d856eeSDarrick J. Wongquotacheck code.
3616a0d856eeSDarrick J. WongMount time quotacheck does not have to contend with concurrent operations, so
3617a0d856eeSDarrick J. Wongit does the following:
3618a0d856eeSDarrick J. Wong
3619a0d856eeSDarrick J. Wong1. Make sure the ondisk dquots are in good enough shape that all the incore
3620a0d856eeSDarrick J. Wong   dquots will actually load, and zero the resource usage counters in the
3621a0d856eeSDarrick J. Wong   ondisk buffer.
3622a0d856eeSDarrick J. Wong
3623a0d856eeSDarrick J. Wong2. Walk every inode in the filesystem.
3624a0d856eeSDarrick J. Wong   Add each file's resource usage to the incore dquot.
3625a0d856eeSDarrick J. Wong
3626a0d856eeSDarrick J. Wong3. Walk each incore dquot.
3627a0d856eeSDarrick J. Wong   If the incore dquot is not being flushed, add the ondisk buffer backing the
3628a0d856eeSDarrick J. Wong   incore dquot to a delayed write (delwri) list.
3629a0d856eeSDarrick J. Wong
3630a0d856eeSDarrick J. Wong4. Write the buffer list to disk.
3631a0d856eeSDarrick J. Wong
3632a0d856eeSDarrick J. WongLike most online fsck functions, online quotacheck can't write to regular
3633a0d856eeSDarrick J. Wongfilesystem objects until the newly collected metadata reflect all filesystem
3634a0d856eeSDarrick J. Wongstate.
3635a0d856eeSDarrick J. WongTherefore, online quotacheck records file resource usage to a shadow dquot
3636a0d856eeSDarrick J. Wongindex implemented with a sparse ``xfarray``, and only writes to the real dquots
3637a0d856eeSDarrick J. Wongonce the scan is complete.
3638a0d856eeSDarrick J. WongHandling transactional updates is tricky because quota resource usage updates
3639a0d856eeSDarrick J. Wongare handled in phases to minimize contention on dquots:
3640a0d856eeSDarrick J. Wong
3641a0d856eeSDarrick J. Wong1. The inodes involved are joined and locked to a transaction.
3642a0d856eeSDarrick J. Wong
3643a0d856eeSDarrick J. Wong2. For each dquot attached to the file:
3644a0d856eeSDarrick J. Wong
3645a0d856eeSDarrick J. Wong   a. The dquot is locked.
3646a0d856eeSDarrick J. Wong
3647a0d856eeSDarrick J. Wong   b. A quota reservation is added to the dquot's resource usage.
3648a0d856eeSDarrick J. Wong      The reservation is recorded in the transaction.
3649a0d856eeSDarrick J. Wong
3650a0d856eeSDarrick J. Wong   c. The dquot is unlocked.
3651a0d856eeSDarrick J. Wong
3652a0d856eeSDarrick J. Wong3. Changes in actual quota usage are tracked in the transaction.
3653a0d856eeSDarrick J. Wong
3654a0d856eeSDarrick J. Wong4. At transaction commit time, each dquot is examined again:
3655a0d856eeSDarrick J. Wong
3656a0d856eeSDarrick J. Wong   a. The dquot is locked again.
3657a0d856eeSDarrick J. Wong
3658a0d856eeSDarrick J. Wong   b. Quota usage changes are logged and unused reservation is given back to
3659a0d856eeSDarrick J. Wong      the dquot.
3660a0d856eeSDarrick J. Wong
3661a0d856eeSDarrick J. Wong   c. The dquot is unlocked.
3662a0d856eeSDarrick J. Wong
3663a0d856eeSDarrick J. WongFor online quotacheck, hooks are placed in steps 2 and 4.
3664a0d856eeSDarrick J. WongThe step 2 hook creates a shadow version of the transaction dquot context
3665a0d856eeSDarrick J. Wong(``dqtrx``) that operates in a similar manner to the regular code.
3666a0d856eeSDarrick J. WongThe step 4 hook commits the shadow ``dqtrx`` changes to the shadow dquots.
3667a0d856eeSDarrick J. WongNotice that both hooks are called with the inode locked, which is how the
3668a0d856eeSDarrick J. Wonglive update coordinates with the inode scanner.
3669a0d856eeSDarrick J. Wong
3670a0d856eeSDarrick J. WongThe quotacheck scan looks like this:
3671a0d856eeSDarrick J. Wong
3672a0d856eeSDarrick J. Wong1. Set up a coordinated inode scan.
3673a0d856eeSDarrick J. Wong
3674a0d856eeSDarrick J. Wong2. For each inode returned by the inode scan iterator:
3675a0d856eeSDarrick J. Wong
3676a0d856eeSDarrick J. Wong   a. Grab and lock the inode.
3677a0d856eeSDarrick J. Wong
3678a0d856eeSDarrick J. Wong   b. Determine that inode's resource usage (data blocks, inode counts,
3679a0d856eeSDarrick J. Wong      realtime blocks) and add that to the shadow dquots for the user, group,
3680a0d856eeSDarrick J. Wong      and project ids associated with the inode.
3681a0d856eeSDarrick J. Wong
3682a0d856eeSDarrick J. Wong   c. Unlock and release the inode.
3683a0d856eeSDarrick J. Wong
3684a0d856eeSDarrick J. Wong3. For each dquot in the system:
3685a0d856eeSDarrick J. Wong
3686a0d856eeSDarrick J. Wong   a. Grab and lock the dquot.
3687a0d856eeSDarrick J. Wong
3688a0d856eeSDarrick J. Wong   b. Check the dquot against the shadow dquots created by the scan and updated
3689a0d856eeSDarrick J. Wong      by the live hooks.
3690a0d856eeSDarrick J. Wong
3691a0d856eeSDarrick J. WongLive updates are key to being able to walk every quota record without
3692a0d856eeSDarrick J. Wongneeding to hold any locks for a long duration.
3693a0d856eeSDarrick J. WongIf repairs are desired, the real and shadow dquots are locked and their
3694a0d856eeSDarrick J. Wongresource counts are set to the values in the shadow dquot.
3695a0d856eeSDarrick J. Wong
3696a0d856eeSDarrick J. WongThe proposed patchset is the
3697a0d856eeSDarrick J. Wong`online quotacheck
3698a0d856eeSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-quotacheck>`_
3699a0d856eeSDarrick J. Wongseries.
3700a0d856eeSDarrick J. Wong
3701a0d856eeSDarrick J. Wong.. _nlinks:
3702a0d856eeSDarrick J. Wong
3703a0d856eeSDarrick J. WongCase Study: File Link Count Checking
3704a0d856eeSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
3705a0d856eeSDarrick J. Wong
3706a0d856eeSDarrick J. WongFile link count checking also uses live update hooks.
3707a0d856eeSDarrick J. WongThe coordinated inode scanner is used to visit all directories on the
3708a0d856eeSDarrick J. Wongfilesystem, and per-file link count records are stored in a sparse ``xfarray``
3709a0d856eeSDarrick J. Wongindexed by inumber.
3710a0d856eeSDarrick J. WongDuring the scanning phase, each entry in a directory generates observation
3711a0d856eeSDarrick J. Wongdata as follows:
3712a0d856eeSDarrick J. Wong
3713a0d856eeSDarrick J. Wong1. If the entry is a dotdot (``'..'``) entry of the root directory, the
3714a0d856eeSDarrick J. Wong   directory's parent link count is bumped because the root directory's dotdot
3715a0d856eeSDarrick J. Wong   entry is self referential.
3716a0d856eeSDarrick J. Wong
3717a0d856eeSDarrick J. Wong2. If the entry is a dotdot entry of a subdirectory, the parent's backref
3718a0d856eeSDarrick J. Wong   count is bumped.
3719a0d856eeSDarrick J. Wong
3720a0d856eeSDarrick J. Wong3. If the entry is neither a dot nor a dotdot entry, the target file's parent
3721a0d856eeSDarrick J. Wong   count is bumped.
3722a0d856eeSDarrick J. Wong
3723a0d856eeSDarrick J. Wong4. If the target is a subdirectory, the parent's child link count is bumped.
3724a0d856eeSDarrick J. Wong
3725a0d856eeSDarrick J. WongA crucial point to understand about how the link count inode scanner interacts
3726a0d856eeSDarrick J. Wongwith the live update hooks is that the scan cursor tracks which *parent*
3727a0d856eeSDarrick J. Wongdirectories have been scanned.
3728a0d856eeSDarrick J. WongIn other words, the live updates ignore any update about ``A → B`` when A has
3729a0d856eeSDarrick J. Wongnot been scanned, even if B has been scanned.
3730a0d856eeSDarrick J. WongFurthermore, a subdirectory A with a dotdot entry pointing back to B is
3731a0d856eeSDarrick J. Wongaccounted as a backref counter in the shadow data for A, since child dotdot
3732a0d856eeSDarrick J. Wongentries affect the parent's link count.
3733a0d856eeSDarrick J. WongLive update hooks are carefully placed in all parts of the filesystem that
3734a0d856eeSDarrick J. Wongcreate, change, or remove directory entries, since those operations involve
3735a0d856eeSDarrick J. Wongbumplink and droplink.
3736a0d856eeSDarrick J. Wong
3737a0d856eeSDarrick J. WongFor any file, the correct link count is the number of parents plus the number
3738a0d856eeSDarrick J. Wongof child subdirectories.
3739a0d856eeSDarrick J. WongNon-directories never have children of any kind.
3740a0d856eeSDarrick J. WongThe backref information is used to detect inconsistencies in the number of
3741a0d856eeSDarrick J. Wonglinks pointing to child subdirectories and the number of dotdot entries
3742a0d856eeSDarrick J. Wongpointing back.
3743a0d856eeSDarrick J. Wong
3744a0d856eeSDarrick J. WongAfter the scan completes, the link count of each file can be checked by locking
3745a0d856eeSDarrick J. Wongboth the inode and the shadow data, and comparing the link counts.
3746a0d856eeSDarrick J. WongA second coordinated inode scan cursor is used for comparisons.
3747a0d856eeSDarrick J. WongLive updates are key to being able to walk every inode without needing to hold
3748a0d856eeSDarrick J. Wongany locks between inodes.
3749a0d856eeSDarrick J. WongIf repairs are desired, the inode's link count is set to the value in the
3750a0d856eeSDarrick J. Wongshadow information.
3751a0d856eeSDarrick J. WongIf no parents are found, the file must be :ref:`reparented <orphanage>` to the
3752a0d856eeSDarrick J. Wongorphanage to prevent the file from being lost forever.
3753a0d856eeSDarrick J. Wong
3754a0d856eeSDarrick J. WongThe proposed patchset is the
3755a0d856eeSDarrick J. Wong`file link count repair
3756a0d856eeSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-nlinks>`_
3757a0d856eeSDarrick J. Wongseries.
3758a0d856eeSDarrick J. Wong
3759a0d856eeSDarrick J. Wong.. _rmap_repair:
3760a0d856eeSDarrick J. Wong
3761a0d856eeSDarrick J. WongCase Study: Rebuilding Reverse Mapping Records
3762a0d856eeSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
3763a0d856eeSDarrick J. Wong
3764a0d856eeSDarrick J. WongMost repair functions follow the same pattern: lock filesystem resources,
3765a0d856eeSDarrick J. Wongwalk the surviving ondisk metadata looking for replacement metadata records,
3766a0d856eeSDarrick J. Wongand use an :ref:`in-memory array <xfarray>` to store the gathered observations.
3767a0d856eeSDarrick J. WongThe primary advantage of this approach is the simplicity and modularity of the
3768a0d856eeSDarrick J. Wongrepair code -- code and data are entirely contained within the scrub module,
3769a0d856eeSDarrick J. Wongdo not require hooks in the main filesystem, and are usually the most efficient
3770a0d856eeSDarrick J. Wongin memory use.
3771a0d856eeSDarrick J. WongA secondary advantage of this repair approach is atomicity -- once the kernel
3772a0d856eeSDarrick J. Wongdecides a structure is corrupt, no other threads can access the metadata until
3773a0d856eeSDarrick J. Wongthe kernel finishes repairing and revalidating the metadata.
3774a0d856eeSDarrick J. Wong
3775a0d856eeSDarrick J. WongFor repairs going on within a shard of the filesystem, these advantages
3776a0d856eeSDarrick J. Wongoutweigh the delays inherent in locking the shard while repairing parts of the
3777a0d856eeSDarrick J. Wongshard.
3778a0d856eeSDarrick J. WongUnfortunately, repairs to the reverse mapping btree cannot use the "standard"
3779a0d856eeSDarrick J. Wongbtree repair strategy because it must scan every space mapping of every fork of
3780a0d856eeSDarrick J. Wongevery file in the filesystem, and the filesystem cannot stop.
3781a0d856eeSDarrick J. WongTherefore, rmap repair foregoes atomicity between scrub and repair.
3782a0d856eeSDarrick J. WongIt combines a :ref:`coordinated inode scanner <iscan>`, :ref:`live update hooks
3783a0d856eeSDarrick J. Wong<liveupdate>`, and an :ref:`in-memory rmap btree <xfbtree>` to complete the
3784a0d856eeSDarrick J. Wongscan for reverse mapping records.
3785a0d856eeSDarrick J. Wong
3786a0d856eeSDarrick J. Wong1. Set up an xfbtree to stage rmap records.
3787a0d856eeSDarrick J. Wong
3788a0d856eeSDarrick J. Wong2. While holding the locks on the AGI and AGF buffers acquired during the
3789a0d856eeSDarrick J. Wong   scrub, generate reverse mappings for all AG metadata: inodes, btrees, CoW
3790a0d856eeSDarrick J. Wong   staging extents, and the internal log.
3791a0d856eeSDarrick J. Wong
3792a0d856eeSDarrick J. Wong3. Set up an inode scanner.
3793a0d856eeSDarrick J. Wong
3794a0d856eeSDarrick J. Wong4. Hook into rmap updates for the AG being repaired so that the live scan data
3795a0d856eeSDarrick J. Wong   can receive updates to the rmap btree from the rest of the filesystem during
3796a0d856eeSDarrick J. Wong   the file scan.
3797a0d856eeSDarrick J. Wong
3798a0d856eeSDarrick J. Wong5. For each space mapping found in either fork of each file scanned,
3799a0d856eeSDarrick J. Wong   decide if the mapping matches the AG of interest.
3800a0d856eeSDarrick J. Wong   If so:
3801a0d856eeSDarrick J. Wong
3802a0d856eeSDarrick J. Wong   a. Create a btree cursor for the in-memory btree.
3803a0d856eeSDarrick J. Wong
3804a0d856eeSDarrick J. Wong   b. Use the rmap code to add the record to the in-memory btree.
3805a0d856eeSDarrick J. Wong
3806a0d856eeSDarrick J. Wong   c. Use the :ref:`special commit function <xfbtree_commit>` to write the
3807a0d856eeSDarrick J. Wong      xfbtree changes to the xfile.
3808a0d856eeSDarrick J. Wong
3809a0d856eeSDarrick J. Wong6. For each live update received via the hook, decide if the owner has already
3810a0d856eeSDarrick J. Wong   been scanned.
3811a0d856eeSDarrick J. Wong   If so, apply the live update into the scan data:
3812a0d856eeSDarrick J. Wong
3813a0d856eeSDarrick J. Wong   a. Create a btree cursor for the in-memory btree.
3814a0d856eeSDarrick J. Wong
3815a0d856eeSDarrick J. Wong   b. Replay the operation into the in-memory btree.
3816a0d856eeSDarrick J. Wong
3817a0d856eeSDarrick J. Wong   c. Use the :ref:`special commit function <xfbtree_commit>` to write the
3818a0d856eeSDarrick J. Wong      xfbtree changes to the xfile.
3819a0d856eeSDarrick J. Wong      This is performed with an empty transaction to avoid changing the
3820a0d856eeSDarrick J. Wong      caller's state.
3821a0d856eeSDarrick J. Wong
3822a0d856eeSDarrick J. Wong7. When the inode scan finishes, create a new scrub transaction and relock the
3823a0d856eeSDarrick J. Wong   two AG headers.
3824a0d856eeSDarrick J. Wong
3825a0d856eeSDarrick J. Wong8. Compute the new btree geometry using the number of rmap records in the
3826a0d856eeSDarrick J. Wong   shadow btree, like all other btree rebuilding functions.
3827a0d856eeSDarrick J. Wong
3828a0d856eeSDarrick J. Wong9. Allocate the number of blocks computed in the previous step.
3829a0d856eeSDarrick J. Wong
3830a0d856eeSDarrick J. Wong10. Perform the usual btree bulk loading and commit to install the new rmap
3831a0d856eeSDarrick J. Wong    btree.
3832a0d856eeSDarrick J. Wong
3833a0d856eeSDarrick J. Wong11. Reap the old rmap btree blocks as discussed in the case study about how
3834a0d856eeSDarrick J. Wong    to :ref:`reap after rmap btree repair <rmap_reap>`.
3835a0d856eeSDarrick J. Wong
3836a0d856eeSDarrick J. Wong12. Free the xfbtree now that it not needed.
3837a0d856eeSDarrick J. Wong
3838a0d856eeSDarrick J. WongThe proposed patchset is the
3839a0d856eeSDarrick J. Wong`rmap repair
3840a0d856eeSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rmap-btree>`_
3841a0d856eeSDarrick J. Wongseries.
38422f754f7fSDarrick J. Wong
38432f754f7fSDarrick J. WongStaging Repairs with Temporary Files on Disk
38442f754f7fSDarrick J. Wong--------------------------------------------
38452f754f7fSDarrick J. Wong
38462f754f7fSDarrick J. WongXFS stores a substantial amount of metadata in file forks: directories,
38472f754f7fSDarrick J. Wongextended attributes, symbolic link targets, free space bitmaps and summary
38482f754f7fSDarrick J. Wonginformation for the realtime volume, and quota records.
38492f754f7fSDarrick J. WongFile forks map 64-bit logical file fork space extents to physical storage space
38502f754f7fSDarrick J. Wongextents, similar to how a memory management unit maps 64-bit virtual addresses
38512f754f7fSDarrick J. Wongto physical memory addresses.
38522f754f7fSDarrick J. WongTherefore, file-based tree structures (such as directories and extended
38532f754f7fSDarrick J. Wongattributes) use blocks mapped in the file fork offset address space that point
38542f754f7fSDarrick J. Wongto other blocks mapped within that same address space, and file-based linear
38552f754f7fSDarrick J. Wongstructures (such as bitmaps and quota records) compute array element offsets in
38562f754f7fSDarrick J. Wongthe file fork offset address space.
38572f754f7fSDarrick J. Wong
38582f754f7fSDarrick J. WongBecause file forks can consume as much space as the entire filesystem, repairs
38592f754f7fSDarrick J. Wongcannot be staged in memory, even when a paging scheme is available.
38602f754f7fSDarrick J. WongTherefore, online repair of file-based metadata createas a temporary file in
38612f754f7fSDarrick J. Wongthe XFS filesystem, writes a new structure at the correct offsets into the
38622f754f7fSDarrick J. Wongtemporary file, and atomically swaps the fork mappings (and hence the fork
38632f754f7fSDarrick J. Wongcontents) to commit the repair.
38642f754f7fSDarrick J. WongOnce the repair is complete, the old fork can be reaped as necessary; if the
38652f754f7fSDarrick J. Wongsystem goes down during the reap, the iunlink code will delete the blocks
38662f754f7fSDarrick J. Wongduring log recovery.
38672f754f7fSDarrick J. Wong
38682f754f7fSDarrick J. Wong**Note**: All space usage and inode indices in the filesystem *must* be
38692f754f7fSDarrick J. Wongconsistent to use a temporary file safely!
38702f754f7fSDarrick J. WongThis dependency is the reason why online repair can only use pageable kernel
38712f754f7fSDarrick J. Wongmemory to stage ondisk space usage information.
38722f754f7fSDarrick J. Wong
38732f754f7fSDarrick J. WongSwapping metadata extents with a temporary file requires the owner field of the
38742f754f7fSDarrick J. Wongblock headers to match the file being repaired and not the temporary file.  The
38752f754f7fSDarrick J. Wongdirectory, extended attribute, and symbolic link functions were all modified to
38762f754f7fSDarrick J. Wongallow callers to specify owner numbers explicitly.
38772f754f7fSDarrick J. Wong
38782f754f7fSDarrick J. WongThere is a downside to the reaping process -- if the system crashes during the
38792f754f7fSDarrick J. Wongreap phase and the fork extents are crosslinked, the iunlink processing will
38802f754f7fSDarrick J. Wongfail because freeing space will find the extra reverse mappings and abort.
38812f754f7fSDarrick J. Wong
38822f754f7fSDarrick J. WongTemporary files created for repair are similar to ``O_TMPFILE`` files created
38832f754f7fSDarrick J. Wongby userspace.
38842f754f7fSDarrick J. WongThey are not linked into a directory and the entire file will be reaped when
38852f754f7fSDarrick J. Wongthe last reference to the file is lost.
38862f754f7fSDarrick J. WongThe key differences are that these files must have no access permission outside
38872f754f7fSDarrick J. Wongthe kernel at all, they must be specially marked to prevent them from being
38882f754f7fSDarrick J. Wongopened by handle, and they must never be linked into the directory tree.
38892f754f7fSDarrick J. Wong
38902f754f7fSDarrick J. Wong+--------------------------------------------------------------------------+
38912f754f7fSDarrick J. Wong| **Historical Sidebar**:                                                  |
38922f754f7fSDarrick J. Wong+--------------------------------------------------------------------------+
38932f754f7fSDarrick J. Wong| In the initial iteration of file metadata repair, the damaged metadata   |
38942f754f7fSDarrick J. Wong| blocks would be scanned for salvageable data; the extents in the file    |
38952f754f7fSDarrick J. Wong| fork would be reaped; and then a new structure would be built in its     |
38962f754f7fSDarrick J. Wong| place.                                                                   |
38972f754f7fSDarrick J. Wong| This strategy did not survive the introduction of the atomic repair      |
38982f754f7fSDarrick J. Wong| requirement expressed earlier in this document.                          |
38992f754f7fSDarrick J. Wong|                                                                          |
39002f754f7fSDarrick J. Wong| The second iteration explored building a second structure at a high      |
39012f754f7fSDarrick J. Wong| offset in the fork from the salvage data, reaping the old extents, and   |
39022f754f7fSDarrick J. Wong| using a ``COLLAPSE_RANGE`` operation to slide the new extents into       |
39032f754f7fSDarrick J. Wong| place.                                                                   |
39042f754f7fSDarrick J. Wong|                                                                          |
39052f754f7fSDarrick J. Wong| This had many drawbacks:                                                 |
39062f754f7fSDarrick J. Wong|                                                                          |
39072f754f7fSDarrick J. Wong| - Array structures are linearly addressed, and the regular filesystem    |
39082f754f7fSDarrick J. Wong|   codebase does not have the concept of a linear offset that could be    |
39092f754f7fSDarrick J. Wong|   applied to the record offset computation to build an alternate copy.   |
39102f754f7fSDarrick J. Wong|                                                                          |
39112f754f7fSDarrick J. Wong| - Extended attributes are allowed to use the entire attr fork offset     |
39122f754f7fSDarrick J. Wong|   address space.                                                         |
39132f754f7fSDarrick J. Wong|                                                                          |
39142f754f7fSDarrick J. Wong| - Even if repair could build an alternate copy of a data structure in a  |
39152f754f7fSDarrick J. Wong|   different part of the fork address space, the atomic repair commit     |
39162f754f7fSDarrick J. Wong|   requirement means that online repair would have to be able to perform  |
39172f754f7fSDarrick J. Wong|   a log assisted ``COLLAPSE_RANGE`` operation to ensure that the old     |
39182f754f7fSDarrick J. Wong|   structure was completely replaced.                                     |
39192f754f7fSDarrick J. Wong|                                                                          |
39202f754f7fSDarrick J. Wong| - A crash after construction of the secondary tree but before the range  |
39212f754f7fSDarrick J. Wong|   collapse would leave unreachable blocks in the file fork.              |
39222f754f7fSDarrick J. Wong|   This would likely confuse things further.                              |
39232f754f7fSDarrick J. Wong|                                                                          |
39242f754f7fSDarrick J. Wong| - Reaping blocks after a repair is not a simple operation, and           |
39252f754f7fSDarrick J. Wong|   initiating a reap operation from a restarted range collapse operation  |
39262f754f7fSDarrick J. Wong|   during log recovery is daunting.                                       |
39272f754f7fSDarrick J. Wong|                                                                          |
39282f754f7fSDarrick J. Wong| - Directory entry blocks and quota records record the file fork offset   |
39292f754f7fSDarrick J. Wong|   in the header area of each block.                                      |
39302f754f7fSDarrick J. Wong|   An atomic range collapse operation would have to rewrite this part of  |
39312f754f7fSDarrick J. Wong|   each block header.                                                     |
39322f754f7fSDarrick J. Wong|   Rewriting a single field in block headers is not a huge problem, but   |
39332f754f7fSDarrick J. Wong|   it's something to be aware of.                                         |
39342f754f7fSDarrick J. Wong|                                                                          |
39352f754f7fSDarrick J. Wong| - Each block in a directory or extended attributes btree index contains  |
39362f754f7fSDarrick J. Wong|   sibling and child block pointers.                                      |
39372f754f7fSDarrick J. Wong|   Were the atomic commit to use a range collapse operation, each block   |
39382f754f7fSDarrick J. Wong|   would have to be rewritten very carefully to preserve the graph        |
39392f754f7fSDarrick J. Wong|   structure.                                                             |
39402f754f7fSDarrick J. Wong|   Doing this as part of a range collapse means rewriting a large number  |
39412f754f7fSDarrick J. Wong|   of blocks repeatedly, which is not conducive to quick repairs.         |
39422f754f7fSDarrick J. Wong|                                                                          |
39432f754f7fSDarrick J. Wong| This lead to the introduction of temporary file staging.                 |
39442f754f7fSDarrick J. Wong+--------------------------------------------------------------------------+
39452f754f7fSDarrick J. Wong
39462f754f7fSDarrick J. WongUsing a Temporary File
39472f754f7fSDarrick J. Wong``````````````````````
39482f754f7fSDarrick J. Wong
39492f754f7fSDarrick J. WongOnline repair code should use the ``xrep_tempfile_create`` function to create a
39502f754f7fSDarrick J. Wongtemporary file inside the filesystem.
39512f754f7fSDarrick J. WongThis allocates an inode, marks the in-core inode private, and attaches it to
39522f754f7fSDarrick J. Wongthe scrub context.
39532f754f7fSDarrick J. WongThese files are hidden from userspace, may not be added to the directory tree,
39542f754f7fSDarrick J. Wongand must be kept private.
39552f754f7fSDarrick J. Wong
39562f754f7fSDarrick J. WongTemporary files only use two inode locks: the IOLOCK and the ILOCK.
39572f754f7fSDarrick J. WongThe MMAPLOCK is not needed here, because there must not be page faults from
39582f754f7fSDarrick J. Wonguserspace for data fork blocks.
39592f754f7fSDarrick J. WongThe usage patterns of these two locks are the same as for any other XFS file --
39602f754f7fSDarrick J. Wongaccess to file data are controlled via the IOLOCK, and access to file metadata
39612f754f7fSDarrick J. Wongare controlled via the ILOCK.
39622f754f7fSDarrick J. WongLocking helpers are provided so that the temporary file and its lock state can
39632f754f7fSDarrick J. Wongbe cleaned up by the scrub context.
39642f754f7fSDarrick J. WongTo comply with the nested locking strategy laid out in the :ref:`inode
39652f754f7fSDarrick J. Wonglocking<ilocking>` section, it is recommended that scrub functions use the
39662f754f7fSDarrick J. Wongxrep_tempfile_ilock*_nowait lock helpers.
39672f754f7fSDarrick J. Wong
39682f754f7fSDarrick J. WongData can be written to a temporary file by two means:
39692f754f7fSDarrick J. Wong
39702f754f7fSDarrick J. Wong1. ``xrep_tempfile_copyin`` can be used to set the contents of a regular
39712f754f7fSDarrick J. Wong   temporary file from an xfile.
39722f754f7fSDarrick J. Wong
39732f754f7fSDarrick J. Wong2. The regular directory, symbolic link, and extended attribute functions can
39742f754f7fSDarrick J. Wong   be used to write to the temporary file.
39752f754f7fSDarrick J. Wong
39762f754f7fSDarrick J. WongOnce a good copy of a data file has been constructed in a temporary file, it
39772f754f7fSDarrick J. Wongmust be conveyed to the file being repaired, which is the topic of the next
39782f754f7fSDarrick J. Wongsection.
39792f754f7fSDarrick J. Wong
39802f754f7fSDarrick J. WongThe proposed patches are in the
39812f754f7fSDarrick J. Wong`repair temporary files
39822f754f7fSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-tempfiles>`_
39832f754f7fSDarrick J. Wongseries.
39842f754f7fSDarrick J. Wong
39852f754f7fSDarrick J. WongAtomic Extent Swapping
39862f754f7fSDarrick J. Wong----------------------
39872f754f7fSDarrick J. Wong
39882f754f7fSDarrick J. WongOnce repair builds a temporary file with a new data structure written into
39892f754f7fSDarrick J. Wongit, it must commit the new changes into the existing file.
39902f754f7fSDarrick J. WongIt is not possible to swap the inumbers of two files, so instead the new
39912f754f7fSDarrick J. Wongmetadata must replace the old.
39922f754f7fSDarrick J. WongThis suggests the need for the ability to swap extents, but the existing extent
39932f754f7fSDarrick J. Wongswapping code used by the file defragmenting tool ``xfs_fsr`` is not sufficient
39942f754f7fSDarrick J. Wongfor online repair because:
39952f754f7fSDarrick J. Wong
39962f754f7fSDarrick J. Wonga. When the reverse-mapping btree is enabled, the swap code must keep the
39972f754f7fSDarrick J. Wong   reverse mapping information up to date with every exchange of mappings.
39982f754f7fSDarrick J. Wong   Therefore, it can only exchange one mapping per transaction, and each
39992f754f7fSDarrick J. Wong   transaction is independent.
40002f754f7fSDarrick J. Wong
40012f754f7fSDarrick J. Wongb. Reverse-mapping is critical for the operation of online fsck, so the old
40022f754f7fSDarrick J. Wong   defragmentation code (which swapped entire extent forks in a single
40032f754f7fSDarrick J. Wong   operation) is not useful here.
40042f754f7fSDarrick J. Wong
40052f754f7fSDarrick J. Wongc. Defragmentation is assumed to occur between two files with identical
40062f754f7fSDarrick J. Wong   contents.
40072f754f7fSDarrick J. Wong   For this use case, an incomplete exchange will not result in a user-visible
40082f754f7fSDarrick J. Wong   change in file contents, even if the operation is interrupted.
40092f754f7fSDarrick J. Wong
40102f754f7fSDarrick J. Wongd. Online repair needs to swap the contents of two files that are by definition
40112f754f7fSDarrick J. Wong   *not* identical.
40122f754f7fSDarrick J. Wong   For directory and xattr repairs, the user-visible contents might be the
40132f754f7fSDarrick J. Wong   same, but the contents of individual blocks may be very different.
40142f754f7fSDarrick J. Wong
40152f754f7fSDarrick J. Wonge. Old blocks in the file may be cross-linked with another structure and must
40162f754f7fSDarrick J. Wong   not reappear if the system goes down mid-repair.
40172f754f7fSDarrick J. Wong
40182f754f7fSDarrick J. WongThese problems are overcome by creating a new deferred operation and a new type
40192f754f7fSDarrick J. Wongof log intent item to track the progress of an operation to exchange two file
40202f754f7fSDarrick J. Wongranges.
40212f754f7fSDarrick J. WongThe new deferred operation type chains together the same transactions used by
40222f754f7fSDarrick J. Wongthe reverse-mapping extent swap code.
40232f754f7fSDarrick J. WongThe new log item records the progress of the exchange to ensure that once an
40242f754f7fSDarrick J. Wongexchange begins, it will always run to completion, even there are
40252f754f7fSDarrick J. Wonginterruptions.
40262f754f7fSDarrick J. WongThe new ``XFS_SB_FEAT_INCOMPAT_LOG_ATOMIC_SWAP`` log-incompatible feature flag
40272f754f7fSDarrick J. Wongin the superblock protects these new log item records from being replayed on
40282f754f7fSDarrick J. Wongold kernels.
40292f754f7fSDarrick J. Wong
40302f754f7fSDarrick J. WongThe proposed patchset is the
40312f754f7fSDarrick J. Wong`atomic extent swap
40322f754f7fSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-file-updates>`_
40332f754f7fSDarrick J. Wongseries.
40342f754f7fSDarrick J. Wong
40352f754f7fSDarrick J. Wong+--------------------------------------------------------------------------+
40362f754f7fSDarrick J. Wong| **Sidebar: Using Log-Incompatible Feature Flags**                        |
40372f754f7fSDarrick J. Wong+--------------------------------------------------------------------------+
40382f754f7fSDarrick J. Wong| Starting with XFS v5, the superblock contains a                          |
40392f754f7fSDarrick J. Wong| ``sb_features_log_incompat`` field to indicate that the log contains     |
40402f754f7fSDarrick J. Wong| records that might not readable by all kernels that could mount this     |
40412f754f7fSDarrick J. Wong| filesystem.                                                              |
40422f754f7fSDarrick J. Wong| In short, log incompat features protect the log contents against kernels |
40432f754f7fSDarrick J. Wong| that will not understand the contents.                                   |
40442f754f7fSDarrick J. Wong| Unlike the other superblock feature bits, log incompat bits are          |
40452f754f7fSDarrick J. Wong| ephemeral because an empty (clean) log does not need protection.         |
40462f754f7fSDarrick J. Wong| The log cleans itself after its contents have been committed into the    |
40472f754f7fSDarrick J. Wong| filesystem, either as part of an unmount or because the system is        |
40482f754f7fSDarrick J. Wong| otherwise idle.                                                          |
40492f754f7fSDarrick J. Wong| Because upper level code can be working on a transaction at the same     |
40502f754f7fSDarrick J. Wong| time that the log cleans itself, it is necessary for upper level code to |
40512f754f7fSDarrick J. Wong| communicate to the log when it is going to use a log incompatible        |
40522f754f7fSDarrick J. Wong| feature.                                                                 |
40532f754f7fSDarrick J. Wong|                                                                          |
40542f754f7fSDarrick J. Wong| The log coordinates access to incompatible features through the use of   |
40552f754f7fSDarrick J. Wong| one ``struct rw_semaphore`` for each feature.                            |
40562f754f7fSDarrick J. Wong| The log cleaning code tries to take this rwsem in exclusive mode to      |
40572f754f7fSDarrick J. Wong| clear the bit; if the lock attempt fails, the feature bit remains set.   |
40582f754f7fSDarrick J. Wong| Filesystem code signals its intention to use a log incompat feature in a |
40592f754f7fSDarrick J. Wong| transaction by calling ``xlog_use_incompat_feat``, which takes the rwsem |
40602f754f7fSDarrick J. Wong| in shared mode.                                                          |
40612f754f7fSDarrick J. Wong| The code supporting a log incompat feature should create wrapper         |
40622f754f7fSDarrick J. Wong| functions to obtain the log feature and call                             |
40632f754f7fSDarrick J. Wong| ``xfs_add_incompat_log_feature`` to set the feature bits in the primary  |
40642f754f7fSDarrick J. Wong| superblock.                                                              |
40652f754f7fSDarrick J. Wong| The superblock update is performed transactionally, so the wrapper to    |
40662f754f7fSDarrick J. Wong| obtain log assistance must be called just prior to the creation of the   |
40672f754f7fSDarrick J. Wong| transaction that uses the functionality.                                 |
40682f754f7fSDarrick J. Wong| For a file operation, this step must happen after taking the IOLOCK      |
40692f754f7fSDarrick J. Wong| and the MMAPLOCK, but before allocating the transaction.                 |
40702f754f7fSDarrick J. Wong| When the transaction is complete, the ``xlog_drop_incompat_feat``        |
40712f754f7fSDarrick J. Wong| function is called to release the feature.                               |
40722f754f7fSDarrick J. Wong| The feature bit will not be cleared from the superblock until the log    |
40732f754f7fSDarrick J. Wong| becomes clean.                                                           |
40742f754f7fSDarrick J. Wong|                                                                          |
40752f754f7fSDarrick J. Wong| Log-assisted extended attribute updates and atomic extent swaps both use |
40762f754f7fSDarrick J. Wong| log incompat features and provide convenience wrappers around the        |
40772f754f7fSDarrick J. Wong| functionality.                                                           |
40782f754f7fSDarrick J. Wong+--------------------------------------------------------------------------+
40792f754f7fSDarrick J. Wong
40802f754f7fSDarrick J. WongMechanics of an Atomic Extent Swap
40812f754f7fSDarrick J. Wong``````````````````````````````````
40822f754f7fSDarrick J. Wong
40832f754f7fSDarrick J. WongSwapping entire file forks is a complex task.
40842f754f7fSDarrick J. WongThe goal is to exchange all file fork mappings between two file fork offset
40852f754f7fSDarrick J. Wongranges.
40862f754f7fSDarrick J. WongThere are likely to be many extent mappings in each fork, and the edges of
40872f754f7fSDarrick J. Wongthe mappings aren't necessarily aligned.
40882f754f7fSDarrick J. WongFurthermore, there may be other updates that need to happen after the swap,
40892f754f7fSDarrick J. Wongsuch as exchanging file sizes, inode flags, or conversion of fork data to local
40902f754f7fSDarrick J. Wongformat.
40912f754f7fSDarrick J. WongThis is roughly the format of the new deferred extent swap work item:
40922f754f7fSDarrick J. Wong
40932f754f7fSDarrick J. Wong.. code-block:: c
40942f754f7fSDarrick J. Wong
40952f754f7fSDarrick J. Wong	struct xfs_swapext_intent {
40962f754f7fSDarrick J. Wong	    /* Inodes participating in the operation. */
40972f754f7fSDarrick J. Wong	    struct xfs_inode    *sxi_ip1;
40982f754f7fSDarrick J. Wong	    struct xfs_inode    *sxi_ip2;
40992f754f7fSDarrick J. Wong
41002f754f7fSDarrick J. Wong	    /* File offset range information. */
41012f754f7fSDarrick J. Wong	    xfs_fileoff_t       sxi_startoff1;
41022f754f7fSDarrick J. Wong	    xfs_fileoff_t       sxi_startoff2;
41032f754f7fSDarrick J. Wong	    xfs_filblks_t       sxi_blockcount;
41042f754f7fSDarrick J. Wong
41052f754f7fSDarrick J. Wong	    /* Set these file sizes after the operation, unless negative. */
41062f754f7fSDarrick J. Wong	    xfs_fsize_t         sxi_isize1;
41072f754f7fSDarrick J. Wong	    xfs_fsize_t         sxi_isize2;
41082f754f7fSDarrick J. Wong
41092f754f7fSDarrick J. Wong	    /* XFS_SWAP_EXT_* log operation flags */
41102f754f7fSDarrick J. Wong	    uint64_t            sxi_flags;
41112f754f7fSDarrick J. Wong	};
41122f754f7fSDarrick J. Wong
41132f754f7fSDarrick J. WongThe new log intent item contains enough information to track two logical fork
41142f754f7fSDarrick J. Wongoffset ranges: ``(inode1, startoff1, blockcount)`` and ``(inode2, startoff2,
41152f754f7fSDarrick J. Wongblockcount)``.
41162f754f7fSDarrick J. WongEach step of a swap operation exchanges the largest file range mapping possible
41172f754f7fSDarrick J. Wongfrom one file to the other.
41182f754f7fSDarrick J. WongAfter each step in the swap operation, the two startoff fields are incremented
41192f754f7fSDarrick J. Wongand the blockcount field is decremented to reflect the progress made.
41202f754f7fSDarrick J. WongThe flags field captures behavioral parameters such as swapping the attr fork
41212f754f7fSDarrick J. Wonginstead of the data fork and other work to be done after the extent swap.
41222f754f7fSDarrick J. WongThe two isize fields are used to swap the file size at the end of the operation
41232f754f7fSDarrick J. Wongif the file data fork is the target of the swap operation.
41242f754f7fSDarrick J. Wong
41252f754f7fSDarrick J. WongWhen the extent swap is initiated, the sequence of operations is as follows:
41262f754f7fSDarrick J. Wong
41272f754f7fSDarrick J. Wong1. Create a deferred work item for the extent swap.
41282f754f7fSDarrick J. Wong   At the start, it should contain the entirety of the file ranges to be
41292f754f7fSDarrick J. Wong   swapped.
41302f754f7fSDarrick J. Wong
41312f754f7fSDarrick J. Wong2. Call ``xfs_defer_finish`` to process the exchange.
41322f754f7fSDarrick J. Wong   This is encapsulated in ``xrep_tempswap_contents`` for scrub operations.
41332f754f7fSDarrick J. Wong   This will log an extent swap intent item to the transaction for the deferred
41342f754f7fSDarrick J. Wong   extent swap work item.
41352f754f7fSDarrick J. Wong
41362f754f7fSDarrick J. Wong3. Until ``sxi_blockcount`` of the deferred extent swap work item is zero,
41372f754f7fSDarrick J. Wong
41382f754f7fSDarrick J. Wong   a. Read the block maps of both file ranges starting at ``sxi_startoff1`` and
41392f754f7fSDarrick J. Wong      ``sxi_startoff2``, respectively, and compute the longest extent that can
41402f754f7fSDarrick J. Wong      be swapped in a single step.
41412f754f7fSDarrick J. Wong      This is the minimum of the two ``br_blockcount`` s in the mappings.
41422f754f7fSDarrick J. Wong      Keep advancing through the file forks until at least one of the mappings
41432f754f7fSDarrick J. Wong      contains written blocks.
41442f754f7fSDarrick J. Wong      Mutual holes, unwritten extents, and extent mappings to the same physical
41452f754f7fSDarrick J. Wong      space are not exchanged.
41462f754f7fSDarrick J. Wong
41472f754f7fSDarrick J. Wong      For the next few steps, this document will refer to the mapping that came
41482f754f7fSDarrick J. Wong      from file 1 as "map1", and the mapping that came from file 2 as "map2".
41492f754f7fSDarrick J. Wong
41502f754f7fSDarrick J. Wong   b. Create a deferred block mapping update to unmap map1 from file 1.
41512f754f7fSDarrick J. Wong
41522f754f7fSDarrick J. Wong   c. Create a deferred block mapping update to unmap map2 from file 2.
41532f754f7fSDarrick J. Wong
41542f754f7fSDarrick J. Wong   d. Create a deferred block mapping update to map map1 into file 2.
41552f754f7fSDarrick J. Wong
41562f754f7fSDarrick J. Wong   e. Create a deferred block mapping update to map map2 into file 1.
41572f754f7fSDarrick J. Wong
41582f754f7fSDarrick J. Wong   f. Log the block, quota, and extent count updates for both files.
41592f754f7fSDarrick J. Wong
41602f754f7fSDarrick J. Wong   g. Extend the ondisk size of either file if necessary.
41612f754f7fSDarrick J. Wong
41622f754f7fSDarrick J. Wong   h. Log an extent swap done log item for the extent swap intent log item
41632f754f7fSDarrick J. Wong      that was read at the start of step 3.
41642f754f7fSDarrick J. Wong
41652f754f7fSDarrick J. Wong   i. Compute the amount of file range that has just been covered.
41662f754f7fSDarrick J. Wong      This quantity is ``(map1.br_startoff + map1.br_blockcount -
41672f754f7fSDarrick J. Wong      sxi_startoff1)``, because step 3a could have skipped holes.
41682f754f7fSDarrick J. Wong
41692f754f7fSDarrick J. Wong   j. Increase the starting offsets of ``sxi_startoff1`` and ``sxi_startoff2``
41702f754f7fSDarrick J. Wong      by the number of blocks computed in the previous step, and decrease
41712f754f7fSDarrick J. Wong      ``sxi_blockcount`` by the same quantity.
41722f754f7fSDarrick J. Wong      This advances the cursor.
41732f754f7fSDarrick J. Wong
41742f754f7fSDarrick J. Wong   k. Log a new extent swap intent log item reflecting the advanced state of
41752f754f7fSDarrick J. Wong      the work item.
41762f754f7fSDarrick J. Wong
41772f754f7fSDarrick J. Wong   l. Return the proper error code (EAGAIN) to the deferred operation manager
41782f754f7fSDarrick J. Wong      to inform it that there is more work to be done.
41792f754f7fSDarrick J. Wong      The operation manager completes the deferred work in steps 3b-3e before
41802f754f7fSDarrick J. Wong      moving back to the start of step 3.
41812f754f7fSDarrick J. Wong
41822f754f7fSDarrick J. Wong4. Perform any post-processing.
41832f754f7fSDarrick J. Wong   This will be discussed in more detail in subsequent sections.
41842f754f7fSDarrick J. Wong
41852f754f7fSDarrick J. WongIf the filesystem goes down in the middle of an operation, log recovery will
41862f754f7fSDarrick J. Wongfind the most recent unfinished extent swap log intent item and restart from
41872f754f7fSDarrick J. Wongthere.
41882f754f7fSDarrick J. WongThis is how extent swapping guarantees that an outside observer will either see
41892f754f7fSDarrick J. Wongthe old broken structure or the new one, and never a mismash of both.
41902f754f7fSDarrick J. Wong
41912f754f7fSDarrick J. WongPreparation for Extent Swapping
41922f754f7fSDarrick J. Wong```````````````````````````````
41932f754f7fSDarrick J. Wong
41942f754f7fSDarrick J. WongThere are a few things that need to be taken care of before initiating an
41952f754f7fSDarrick J. Wongatomic extent swap operation.
41962f754f7fSDarrick J. WongFirst, regular files require the page cache to be flushed to disk before the
41972f754f7fSDarrick J. Wongoperation begins, and directio writes to be quiesced.
41982f754f7fSDarrick J. WongLike any filesystem operation, extent swapping must determine the maximum
41992f754f7fSDarrick J. Wongamount of disk space and quota that can be consumed on behalf of both files in
42002f754f7fSDarrick J. Wongthe operation, and reserve that quantity of resources to avoid an unrecoverable
42012f754f7fSDarrick J. Wongout of space failure once it starts dirtying metadata.
42022f754f7fSDarrick J. WongThe preparation step scans the ranges of both files to estimate:
42032f754f7fSDarrick J. Wong
42042f754f7fSDarrick J. Wong- Data device blocks needed to handle the repeated updates to the fork
42052f754f7fSDarrick J. Wong  mappings.
42062f754f7fSDarrick J. Wong- Change in data and realtime block counts for both files.
42072f754f7fSDarrick J. Wong- Increase in quota usage for both files, if the two files do not share the
42082f754f7fSDarrick J. Wong  same set of quota ids.
42092f754f7fSDarrick J. Wong- The number of extent mappings that will be added to each file.
42102f754f7fSDarrick J. Wong- Whether or not there are partially written realtime extents.
42112f754f7fSDarrick J. Wong  User programs must never be able to access a realtime file extent that maps
42122f754f7fSDarrick J. Wong  to different extents on the realtime volume, which could happen if the
42132f754f7fSDarrick J. Wong  operation fails to run to completion.
42142f754f7fSDarrick J. Wong
42152f754f7fSDarrick J. WongThe need for precise estimation increases the run time of the swap operation,
42162f754f7fSDarrick J. Wongbut it is very important to maintain correct accounting.
42172f754f7fSDarrick J. WongThe filesystem must not run completely out of free space, nor can the extent
42182f754f7fSDarrick J. Wongswap ever add more extent mappings to a fork than it can support.
42192f754f7fSDarrick J. WongRegular users are required to abide the quota limits, though metadata repairs
42202f754f7fSDarrick J. Wongmay exceed quota to resolve inconsistent metadata elsewhere.
42212f754f7fSDarrick J. Wong
42222f754f7fSDarrick J. WongSpecial Features for Swapping Metadata File Extents
42232f754f7fSDarrick J. Wong```````````````````````````````````````````````````
42242f754f7fSDarrick J. Wong
42252f754f7fSDarrick J. WongExtended attributes, symbolic links, and directories can set the fork format to
42262f754f7fSDarrick J. Wong"local" and treat the fork as a literal area for data storage.
42272f754f7fSDarrick J. WongMetadata repairs must take extra steps to support these cases:
42282f754f7fSDarrick J. Wong
42292f754f7fSDarrick J. Wong- If both forks are in local format and the fork areas are large enough, the
42302f754f7fSDarrick J. Wong  swap is performed by copying the incore fork contents, logging both forks,
42312f754f7fSDarrick J. Wong  and committing.
42322f754f7fSDarrick J. Wong  The atomic extent swap mechanism is not necessary, since this can be done
42332f754f7fSDarrick J. Wong  with a single transaction.
42342f754f7fSDarrick J. Wong
42352f754f7fSDarrick J. Wong- If both forks map blocks, then the regular atomic extent swap is used.
42362f754f7fSDarrick J. Wong
42372f754f7fSDarrick J. Wong- Otherwise, only one fork is in local format.
42382f754f7fSDarrick J. Wong  The contents of the local format fork are converted to a block to perform the
42392f754f7fSDarrick J. Wong  swap.
42402f754f7fSDarrick J. Wong  The conversion to block format must be done in the same transaction that
42412f754f7fSDarrick J. Wong  logs the initial extent swap intent log item.
42422f754f7fSDarrick J. Wong  The regular atomic extent swap is used to exchange the mappings.
42432f754f7fSDarrick J. Wong  Special flags are set on the swap operation so that the transaction can be
42442f754f7fSDarrick J. Wong  rolled one more time to convert the second file's fork back to local format
42452f754f7fSDarrick J. Wong  so that the second file will be ready to go as soon as the ILOCK is dropped.
42462f754f7fSDarrick J. Wong
42472f754f7fSDarrick J. WongExtended attributes and directories stamp the owning inode into every block,
42482f754f7fSDarrick J. Wongbut the buffer verifiers do not actually check the inode number!
42492f754f7fSDarrick J. WongAlthough there is no verification, it is still important to maintain
42502f754f7fSDarrick J. Wongreferential integrity, so prior to performing the extent swap, online repair
42512f754f7fSDarrick J. Wongbuilds every block in the new data structure with the owner field of the file
42522f754f7fSDarrick J. Wongbeing repaired.
42532f754f7fSDarrick J. Wong
42542f754f7fSDarrick J. WongAfter a successful swap operation, the repair operation must reap the old fork
42552f754f7fSDarrick J. Wongblocks by processing each fork mapping through the standard :ref:`file extent
42562f754f7fSDarrick J. Wongreaping <reaping>` mechanism that is done post-repair.
42572f754f7fSDarrick J. WongIf the filesystem should go down during the reap part of the repair, the
42582f754f7fSDarrick J. Wongiunlink processing at the end of recovery will free both the temporary file and
42592f754f7fSDarrick J. Wongwhatever blocks were not reaped.
42602f754f7fSDarrick J. WongHowever, this iunlink processing omits the cross-link detection of online
42612f754f7fSDarrick J. Wongrepair, and is not completely foolproof.
42622f754f7fSDarrick J. Wong
42632f754f7fSDarrick J. WongSwapping Temporary File Extents
42642f754f7fSDarrick J. Wong```````````````````````````````
42652f754f7fSDarrick J. Wong
42662f754f7fSDarrick J. WongTo repair a metadata file, online repair proceeds as follows:
42672f754f7fSDarrick J. Wong
42682f754f7fSDarrick J. Wong1. Create a temporary repair file.
42692f754f7fSDarrick J. Wong
42702f754f7fSDarrick J. Wong2. Use the staging data to write out new contents into the temporary repair
42712f754f7fSDarrick J. Wong   file.
42722f754f7fSDarrick J. Wong   The same fork must be written to as is being repaired.
42732f754f7fSDarrick J. Wong
42742f754f7fSDarrick J. Wong3. Commit the scrub transaction, since the swap estimation step must be
42752f754f7fSDarrick J. Wong   completed before transaction reservations are made.
42762f754f7fSDarrick J. Wong
42772f754f7fSDarrick J. Wong4. Call ``xrep_tempswap_trans_alloc`` to allocate a new scrub transaction with
42782f754f7fSDarrick J. Wong   the appropriate resource reservations, locks, and fill out a ``struct
42792f754f7fSDarrick J. Wong   xfs_swapext_req`` with the details of the swap operation.
42802f754f7fSDarrick J. Wong
42812f754f7fSDarrick J. Wong5. Call ``xrep_tempswap_contents`` to swap the contents.
42822f754f7fSDarrick J. Wong
42832f754f7fSDarrick J. Wong6. Commit the transaction to complete the repair.
42842f754f7fSDarrick J. Wong
42852f754f7fSDarrick J. Wong.. _rtsummary:
42862f754f7fSDarrick J. Wong
42872f754f7fSDarrick J. WongCase Study: Repairing the Realtime Summary File
42882f754f7fSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
42892f754f7fSDarrick J. Wong
42902f754f7fSDarrick J. WongIn the "realtime" section of an XFS filesystem, free space is tracked via a
42912f754f7fSDarrick J. Wongbitmap, similar to Unix FFS.
42922f754f7fSDarrick J. WongEach bit in the bitmap represents one realtime extent, which is a multiple of
42932f754f7fSDarrick J. Wongthe filesystem block size between 4KiB and 1GiB in size.
42942f754f7fSDarrick J. WongThe realtime summary file indexes the number of free extents of a given size to
42952f754f7fSDarrick J. Wongthe offset of the block within the realtime free space bitmap where those free
42962f754f7fSDarrick J. Wongextents begin.
42972f754f7fSDarrick J. WongIn other words, the summary file helps the allocator find free extents by
42982f754f7fSDarrick J. Wonglength, similar to what the free space by count (cntbt) btree does for the data
42992f754f7fSDarrick J. Wongsection.
43002f754f7fSDarrick J. Wong
43012f754f7fSDarrick J. WongThe summary file itself is a flat file (with no block headers or checksums!)
43022f754f7fSDarrick J. Wongpartitioned into ``log2(total rt extents)`` sections containing enough 32-bit
43032f754f7fSDarrick J. Wongcounters to match the number of blocks in the rt bitmap.
43042f754f7fSDarrick J. WongEach counter records the number of free extents that start in that bitmap block
43052f754f7fSDarrick J. Wongand can satisfy a power-of-two allocation request.
43062f754f7fSDarrick J. Wong
43072f754f7fSDarrick J. WongTo check the summary file against the bitmap:
43082f754f7fSDarrick J. Wong
43092f754f7fSDarrick J. Wong1. Take the ILOCK of both the realtime bitmap and summary files.
43102f754f7fSDarrick J. Wong
43112f754f7fSDarrick J. Wong2. For each free space extent recorded in the bitmap:
43122f754f7fSDarrick J. Wong
43132f754f7fSDarrick J. Wong   a. Compute the position in the summary file that contains a counter that
43142f754f7fSDarrick J. Wong      represents this free extent.
43152f754f7fSDarrick J. Wong
43162f754f7fSDarrick J. Wong   b. Read the counter from the xfile.
43172f754f7fSDarrick J. Wong
43182f754f7fSDarrick J. Wong   c. Increment it, and write it back to the xfile.
43192f754f7fSDarrick J. Wong
43202f754f7fSDarrick J. Wong3. Compare the contents of the xfile against the ondisk file.
43212f754f7fSDarrick J. Wong
43222f754f7fSDarrick J. WongTo repair the summary file, write the xfile contents into the temporary file
43232f754f7fSDarrick J. Wongand use atomic extent swap to commit the new contents.
43242f754f7fSDarrick J. WongThe temporary file is then reaped.
43252f754f7fSDarrick J. Wong
43262f754f7fSDarrick J. WongThe proposed patchset is the
43272f754f7fSDarrick J. Wong`realtime summary repair
43282f754f7fSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rtsummary>`_
43292f754f7fSDarrick J. Wongseries.
43302f754f7fSDarrick J. Wong
43312f754f7fSDarrick J. WongCase Study: Salvaging Extended Attributes
43322f754f7fSDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
43332f754f7fSDarrick J. Wong
43342f754f7fSDarrick J. WongIn XFS, extended attributes are implemented as a namespaced name-value store.
43352f754f7fSDarrick J. WongValues are limited in size to 64KiB, but there is no limit in the number of
43362f754f7fSDarrick J. Wongnames.
43372f754f7fSDarrick J. WongThe attribute fork is unpartitioned, which means that the root of the attribute
43382f754f7fSDarrick J. Wongstructure is always in logical block zero, but attribute leaf blocks, dabtree
43392f754f7fSDarrick J. Wongindex blocks, and remote value blocks are intermixed.
43402f754f7fSDarrick J. WongAttribute leaf blocks contain variable-sized records that associate
43412f754f7fSDarrick J. Wonguser-provided names with the user-provided values.
43422f754f7fSDarrick J. WongValues larger than a block are allocated separate extents and written there.
43432f754f7fSDarrick J. WongIf the leaf information expands beyond a single block, a directory/attribute
43442f754f7fSDarrick J. Wongbtree (``dabtree``) is created to map hashes of attribute names to entries
43452f754f7fSDarrick J. Wongfor fast lookup.
43462f754f7fSDarrick J. Wong
43472f754f7fSDarrick J. WongSalvaging extended attributes is done as follows:
43482f754f7fSDarrick J. Wong
43492f754f7fSDarrick J. Wong1. Walk the attr fork mappings of the file being repaired to find the attribute
43502f754f7fSDarrick J. Wong   leaf blocks.
43512f754f7fSDarrick J. Wong   When one is found,
43522f754f7fSDarrick J. Wong
43532f754f7fSDarrick J. Wong   a. Walk the attr leaf block to find candidate keys.
43542f754f7fSDarrick J. Wong      When one is found,
43552f754f7fSDarrick J. Wong
43562f754f7fSDarrick J. Wong      1. Check the name for problems, and ignore the name if there are.
43572f754f7fSDarrick J. Wong
43582f754f7fSDarrick J. Wong      2. Retrieve the value.
43592f754f7fSDarrick J. Wong         If that succeeds, add the name and value to the staging xfarray and
43602f754f7fSDarrick J. Wong         xfblob.
43612f754f7fSDarrick J. Wong
43622f754f7fSDarrick J. Wong2. If the memory usage of the xfarray and xfblob exceed a certain amount of
43632f754f7fSDarrick J. Wong   memory or there are no more attr fork blocks to examine, unlock the file and
43642f754f7fSDarrick J. Wong   add the staged extended attributes to the temporary file.
43652f754f7fSDarrick J. Wong
43662f754f7fSDarrick J. Wong3. Use atomic extent swapping to exchange the new and old extended attribute
43672f754f7fSDarrick J. Wong   structures.
43682f754f7fSDarrick J. Wong   The old attribute blocks are now attached to the temporary file.
43692f754f7fSDarrick J. Wong
43702f754f7fSDarrick J. Wong4. Reap the temporary file.
43712f754f7fSDarrick J. Wong
43722f754f7fSDarrick J. WongThe proposed patchset is the
43732f754f7fSDarrick J. Wong`extended attribute repair
43742f754f7fSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs>`_
43752f754f7fSDarrick J. Wongseries.
4376a26aa252SDarrick J. Wong
4377a26aa252SDarrick J. WongFixing Directories
4378a26aa252SDarrick J. Wong------------------
4379a26aa252SDarrick J. Wong
4380a26aa252SDarrick J. WongFixing directories is difficult with currently available filesystem features,
4381a26aa252SDarrick J. Wongsince directory entries are not redundant.
4382a26aa252SDarrick J. WongThe offline repair tool scans all inodes to find files with nonzero link count,
4383a26aa252SDarrick J. Wongand then it scans all directories to establish parentage of those linked files.
4384a26aa252SDarrick J. WongDamaged files and directories are zapped, and files with no parent are
4385a26aa252SDarrick J. Wongmoved to the ``/lost+found`` directory.
4386a26aa252SDarrick J. WongIt does not try to salvage anything.
4387a26aa252SDarrick J. Wong
4388a26aa252SDarrick J. WongThe best that online repair can do at this time is to read directory data
4389a26aa252SDarrick J. Wongblocks and salvage any dirents that look plausible, correct link counts, and
4390a26aa252SDarrick J. Wongmove orphans back into the directory tree.
4391a26aa252SDarrick J. WongThe salvage process is discussed in the case study at the end of this section.
4392a26aa252SDarrick J. WongThe :ref:`file link count fsck <nlinks>` code takes care of fixing link counts
4393a26aa252SDarrick J. Wongand moving orphans to the ``/lost+found`` directory.
4394a26aa252SDarrick J. Wong
4395a26aa252SDarrick J. WongCase Study: Salvaging Directories
4396a26aa252SDarrick J. Wong`````````````````````````````````
4397a26aa252SDarrick J. Wong
4398a26aa252SDarrick J. WongUnlike extended attributes, directory blocks are all the same size, so
4399a26aa252SDarrick J. Wongsalvaging directories is straightforward:
4400a26aa252SDarrick J. Wong
4401a26aa252SDarrick J. Wong1. Find the parent of the directory.
4402a26aa252SDarrick J. Wong   If the dotdot entry is not unreadable, try to confirm that the alleged
4403a26aa252SDarrick J. Wong   parent has a child entry pointing back to the directory being repaired.
4404a26aa252SDarrick J. Wong   Otherwise, walk the filesystem to find it.
4405a26aa252SDarrick J. Wong
4406a26aa252SDarrick J. Wong2. Walk the first partition of data fork of the directory to find the directory
4407a26aa252SDarrick J. Wong   entry data blocks.
4408a26aa252SDarrick J. Wong   When one is found,
4409a26aa252SDarrick J. Wong
4410a26aa252SDarrick J. Wong   a. Walk the directory data block to find candidate entries.
4411a26aa252SDarrick J. Wong      When an entry is found:
4412a26aa252SDarrick J. Wong
4413a26aa252SDarrick J. Wong      i. Check the name for problems, and ignore the name if there are.
4414a26aa252SDarrick J. Wong
4415a26aa252SDarrick J. Wong      ii. Retrieve the inumber and grab the inode.
4416a26aa252SDarrick J. Wong          If that succeeds, add the name, inode number, and file type to the
4417a26aa252SDarrick J. Wong          staging xfarray and xblob.
4418a26aa252SDarrick J. Wong
4419a26aa252SDarrick J. Wong3. If the memory usage of the xfarray and xfblob exceed a certain amount of
4420a26aa252SDarrick J. Wong   memory or there are no more directory data blocks to examine, unlock the
4421a26aa252SDarrick J. Wong   directory and add the staged dirents into the temporary directory.
4422a26aa252SDarrick J. Wong   Truncate the staging files.
4423a26aa252SDarrick J. Wong
4424a26aa252SDarrick J. Wong4. Use atomic extent swapping to exchange the new and old directory structures.
4425a26aa252SDarrick J. Wong   The old directory blocks are now attached to the temporary file.
4426a26aa252SDarrick J. Wong
4427a26aa252SDarrick J. Wong5. Reap the temporary file.
4428a26aa252SDarrick J. Wong
4429a26aa252SDarrick J. Wong**Future Work Question**: Should repair revalidate the dentry cache when
4430a26aa252SDarrick J. Wongrebuilding a directory?
4431a26aa252SDarrick J. Wong
4432a26aa252SDarrick J. Wong*Answer*: Yes, it should.
4433a26aa252SDarrick J. Wong
4434a26aa252SDarrick J. WongIn theory it is necessary to scan all dentry cache entries for a directory to
4435a26aa252SDarrick J. Wongensure that one of the following apply:
4436a26aa252SDarrick J. Wong
4437a26aa252SDarrick J. Wong1. The cached dentry reflects an ondisk dirent in the new directory.
4438a26aa252SDarrick J. Wong
4439a26aa252SDarrick J. Wong2. The cached dentry no longer has a corresponding ondisk dirent in the new
4440a26aa252SDarrick J. Wong   directory and the dentry can be purged from the cache.
4441a26aa252SDarrick J. Wong
4442a26aa252SDarrick J. Wong3. The cached dentry no longer has an ondisk dirent but the dentry cannot be
4443a26aa252SDarrick J. Wong   purged.
4444a26aa252SDarrick J. Wong   This is the problem case.
4445a26aa252SDarrick J. Wong
4446a26aa252SDarrick J. WongUnfortunately, the current dentry cache design doesn't provide a means to walk
4447a26aa252SDarrick J. Wongevery child dentry of a specific directory, which makes this a hard problem.
4448a26aa252SDarrick J. WongThere is no known solution.
4449a26aa252SDarrick J. Wong
4450a26aa252SDarrick J. WongThe proposed patchset is the
4451a26aa252SDarrick J. Wong`directory repair
4452a26aa252SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs>`_
4453a26aa252SDarrick J. Wongseries.
4454a26aa252SDarrick J. Wong
4455a26aa252SDarrick J. WongParent Pointers
4456a26aa252SDarrick J. Wong```````````````
4457a26aa252SDarrick J. Wong
4458a26aa252SDarrick J. WongA parent pointer is a piece of file metadata that enables a user to locate the
4459a26aa252SDarrick J. Wongfile's parent directory without having to traverse the directory tree from the
4460a26aa252SDarrick J. Wongroot.
4461a26aa252SDarrick J. WongWithout them, reconstruction of directory trees is hindered in much the same
4462a26aa252SDarrick J. Wongway that the historic lack of reverse space mapping information once hindered
4463a26aa252SDarrick J. Wongreconstruction of filesystem space metadata.
4464a26aa252SDarrick J. WongThe parent pointer feature, however, makes total directory reconstruction
4465a26aa252SDarrick J. Wongpossible.
4466a26aa252SDarrick J. Wong
4467a26aa252SDarrick J. WongXFS parent pointers include the dirent name and location of the entry within
4468a26aa252SDarrick J. Wongthe parent directory.
4469a26aa252SDarrick J. WongIn other words, child files use extended attributes to store pointers to
4470a26aa252SDarrick J. Wongparents in the form ``(parent_inum, parent_gen, dirent_pos) → (dirent_name)``.
4471a26aa252SDarrick J. WongThe directory checking process can be strengthened to ensure that the target of
4472a26aa252SDarrick J. Wongeach dirent also contains a parent pointer pointing back to the dirent.
4473a26aa252SDarrick J. WongLikewise, each parent pointer can be checked by ensuring that the target of
4474a26aa252SDarrick J. Wongeach parent pointer is a directory and that it contains a dirent matching
4475a26aa252SDarrick J. Wongthe parent pointer.
4476a26aa252SDarrick J. WongBoth online and offline repair can use this strategy.
4477a26aa252SDarrick J. Wong
4478a26aa252SDarrick J. Wong**Note**: The ondisk format of parent pointers is not yet finalized.
4479a26aa252SDarrick J. Wong
4480a26aa252SDarrick J. Wong+--------------------------------------------------------------------------+
4481a26aa252SDarrick J. Wong| **Historical Sidebar**:                                                  |
4482a26aa252SDarrick J. Wong+--------------------------------------------------------------------------+
4483a26aa252SDarrick J. Wong| Directory parent pointers were first proposed as an XFS feature more     |
4484a26aa252SDarrick J. Wong| than a decade ago by SGI.                                                |
4485a26aa252SDarrick J. Wong| Each link from a parent directory to a child file is mirrored with an    |
4486a26aa252SDarrick J. Wong| extended attribute in the child that could be used to identify the       |
4487a26aa252SDarrick J. Wong| parent directory.                                                        |
4488a26aa252SDarrick J. Wong| Unfortunately, this early implementation had major shortcomings and was  |
4489a26aa252SDarrick J. Wong| never merged into Linux XFS:                                             |
4490a26aa252SDarrick J. Wong|                                                                          |
4491a26aa252SDarrick J. Wong| 1. The XFS codebase of the late 2000s did not have the infrastructure to |
4492a26aa252SDarrick J. Wong|    enforce strong referential integrity in the directory tree.           |
4493a26aa252SDarrick J. Wong|    It did not guarantee that a change in a forward link would always be  |
4494a26aa252SDarrick J. Wong|    followed up with the corresponding change to the reverse links.       |
4495a26aa252SDarrick J. Wong|                                                                          |
4496a26aa252SDarrick J. Wong| 2. Referential integrity was not integrated into offline repair.         |
4497a26aa252SDarrick J. Wong|    Checking and repairs were performed on mounted filesystems without    |
4498a26aa252SDarrick J. Wong|    taking any kernel or inode locks to coordinate access.                |
4499a26aa252SDarrick J. Wong|    It is not clear how this actually worked properly.                    |
4500a26aa252SDarrick J. Wong|                                                                          |
4501a26aa252SDarrick J. Wong| 3. The extended attribute did not record the name of the directory entry |
4502a26aa252SDarrick J. Wong|    in the parent, so the SGI parent pointer implementation cannot be     |
4503a26aa252SDarrick J. Wong|    used to reconnect the directory tree.                                 |
4504a26aa252SDarrick J. Wong|                                                                          |
4505a26aa252SDarrick J. Wong| 4. Extended attribute forks only support 65,536 extents, which means     |
4506a26aa252SDarrick J. Wong|    that parent pointer attribute creation is likely to fail at some      |
4507a26aa252SDarrick J. Wong|    point before the maximum file link count is achieved.                 |
4508a26aa252SDarrick J. Wong|                                                                          |
4509a26aa252SDarrick J. Wong| The original parent pointer design was too unstable for something like   |
4510a26aa252SDarrick J. Wong| a file system repair to depend on.                                       |
4511a26aa252SDarrick J. Wong| Allison Henderson, Chandan Babu, and Catherine Hoang are working on a    |
4512a26aa252SDarrick J. Wong| second implementation that solves all shortcomings of the first.         |
4513a26aa252SDarrick J. Wong| During 2022, Allison introduced log intent items to track physical       |
4514a26aa252SDarrick J. Wong| manipulations of the extended attribute structures.                      |
4515a26aa252SDarrick J. Wong| This solves the referential integrity problem by making it possible to   |
4516a26aa252SDarrick J. Wong| commit a dirent update and a parent pointer update in the same           |
4517a26aa252SDarrick J. Wong| transaction.                                                             |
4518a26aa252SDarrick J. Wong| Chandan increased the maximum extent counts of both data and attribute   |
4519a26aa252SDarrick J. Wong| forks, thereby ensuring that the extended attribute structure can grow   |
4520a26aa252SDarrick J. Wong| to handle the maximum hardlink count of any file.                        |
4521a26aa252SDarrick J. Wong+--------------------------------------------------------------------------+
4522a26aa252SDarrick J. Wong
4523a26aa252SDarrick J. WongCase Study: Repairing Directories with Parent Pointers
4524a26aa252SDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
4525a26aa252SDarrick J. Wong
4526a26aa252SDarrick J. WongDirectory rebuilding uses a :ref:`coordinated inode scan <iscan>` and
4527a26aa252SDarrick J. Wonga :ref:`directory entry live update hook <liveupdate>` as follows:
4528a26aa252SDarrick J. Wong
4529a26aa252SDarrick J. Wong1. Set up a temporary directory for generating the new directory structure,
4530a26aa252SDarrick J. Wong   an xfblob for storing entry names, and an xfarray for stashing directory
4531a26aa252SDarrick J. Wong   updates.
4532a26aa252SDarrick J. Wong
4533a26aa252SDarrick J. Wong2. Set up an inode scanner and hook into the directory entry code to receive
4534a26aa252SDarrick J. Wong   updates on directory operations.
4535a26aa252SDarrick J. Wong
4536a26aa252SDarrick J. Wong3. For each parent pointer found in each file scanned, decide if the parent
4537a26aa252SDarrick J. Wong   pointer references the directory of interest.
4538a26aa252SDarrick J. Wong   If so:
4539a26aa252SDarrick J. Wong
4540a26aa252SDarrick J. Wong   a. Stash an addname entry for this dirent in the xfarray for later.
4541a26aa252SDarrick J. Wong
4542a26aa252SDarrick J. Wong   b. When finished scanning that file, flush the stashed updates to the
4543a26aa252SDarrick J. Wong      temporary directory.
4544a26aa252SDarrick J. Wong
4545a26aa252SDarrick J. Wong4. For each live directory update received via the hook, decide if the child
4546a26aa252SDarrick J. Wong   has already been scanned.
4547a26aa252SDarrick J. Wong   If so:
4548a26aa252SDarrick J. Wong
4549a26aa252SDarrick J. Wong   a. Stash an addname or removename entry for this dirent update in the
4550a26aa252SDarrick J. Wong      xfarray for later.
4551a26aa252SDarrick J. Wong      We cannot write directly to the temporary directory because hook
4552a26aa252SDarrick J. Wong      functions are not allowed to modify filesystem metadata.
4553a26aa252SDarrick J. Wong      Instead, we stash updates in the xfarray and rely on the scanner thread
4554a26aa252SDarrick J. Wong      to apply the stashed updates to the temporary directory.
4555a26aa252SDarrick J. Wong
4556a26aa252SDarrick J. Wong5. When the scan is complete, atomically swap the contents of the temporary
4557a26aa252SDarrick J. Wong   directory and the directory being repaired.
4558a26aa252SDarrick J. Wong   The temporary directory now contains the damaged directory structure.
4559a26aa252SDarrick J. Wong
4560a26aa252SDarrick J. Wong6. Reap the temporary directory.
4561a26aa252SDarrick J. Wong
4562a26aa252SDarrick J. Wong7. Update the dirent position field of parent pointers as necessary.
4563a26aa252SDarrick J. Wong   This may require the queuing of a substantial number of xattr log intent
4564a26aa252SDarrick J. Wong   items.
4565a26aa252SDarrick J. Wong
4566a26aa252SDarrick J. WongThe proposed patchset is the
4567a26aa252SDarrick J. Wong`parent pointers directory repair
4568a26aa252SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-online-dir-repair>`_
4569a26aa252SDarrick J. Wongseries.
4570a26aa252SDarrick J. Wong
4571a26aa252SDarrick J. Wong**Unresolved Question**: How will repair ensure that the ``dirent_pos`` fields
4572a26aa252SDarrick J. Wongmatch in the reconstructed directory?
4573a26aa252SDarrick J. Wong
4574a26aa252SDarrick J. Wong*Answer*: There are a few ways to solve this problem:
4575a26aa252SDarrick J. Wong
4576a26aa252SDarrick J. Wong1. The field could be designated advisory, since the other three values are
4577a26aa252SDarrick J. Wong   sufficient to find the entry in the parent.
4578a26aa252SDarrick J. Wong   However, this makes indexed key lookup impossible while repairs are ongoing.
4579a26aa252SDarrick J. Wong
4580a26aa252SDarrick J. Wong2. We could allow creating directory entries at specified offsets, which solves
4581a26aa252SDarrick J. Wong   the referential integrity problem but runs the risk that dirent creation
4582a26aa252SDarrick J. Wong   will fail due to conflicts with the free space in the directory.
4583a26aa252SDarrick J. Wong
4584a26aa252SDarrick J. Wong   These conflicts could be resolved by appending the directory entry and
4585a26aa252SDarrick J. Wong   amending the xattr code to support updating an xattr key and reindexing the
4586a26aa252SDarrick J. Wong   dabtree, though this would have to be performed with the parent directory
4587a26aa252SDarrick J. Wong   still locked.
4588a26aa252SDarrick J. Wong
4589a26aa252SDarrick J. Wong3. Same as above, but remove the old parent pointer entry and add a new one
4590a26aa252SDarrick J. Wong   atomically.
4591a26aa252SDarrick J. Wong
4592a26aa252SDarrick J. Wong4. Change the ondisk xattr format to ``(parent_inum, name) → (parent_gen)``,
4593a26aa252SDarrick J. Wong   which would provide the attr name uniqueness that we require, without
4594a26aa252SDarrick J. Wong   forcing repair code to update the dirent position.
4595a26aa252SDarrick J. Wong   Unfortunately, this requires changes to the xattr code to support attr
4596a26aa252SDarrick J. Wong   names as long as 263 bytes.
4597a26aa252SDarrick J. Wong
4598a26aa252SDarrick J. Wong5. Change the ondisk xattr format to ``(parent_inum, hash(name)) →
4599a26aa252SDarrick J. Wong   (name, parent_gen)``.
4600a26aa252SDarrick J. Wong   If the hash is sufficiently resistant to collisions (e.g. sha256) then
4601a26aa252SDarrick J. Wong   this should provide the attr name uniqueness that we require.
4602a26aa252SDarrick J. Wong   Names shorter than 247 bytes could be stored directly.
4603a26aa252SDarrick J. Wong
4604a26aa252SDarrick J. WongDiscussion is ongoing under the `parent pointers patch deluge
4605a26aa252SDarrick J. Wong<https://www.spinics.net/lists/linux-xfs/msg69397.html>`_.
4606a26aa252SDarrick J. Wong
4607a26aa252SDarrick J. WongCase Study: Repairing Parent Pointers
4608a26aa252SDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
4609a26aa252SDarrick J. Wong
4610a26aa252SDarrick J. WongOnline reconstruction of a file's parent pointer information works similarly to
4611a26aa252SDarrick J. Wongdirectory reconstruction:
4612a26aa252SDarrick J. Wong
4613a26aa252SDarrick J. Wong1. Set up a temporary file for generating a new extended attribute structure,
4614a26aa252SDarrick J. Wong   an `xfblob<xfblob>` for storing parent pointer names, and an xfarray for
4615a26aa252SDarrick J. Wong   stashing parent pointer updates.
4616a26aa252SDarrick J. Wong
4617a26aa252SDarrick J. Wong2. Set up an inode scanner and hook into the directory entry code to receive
4618a26aa252SDarrick J. Wong   updates on directory operations.
4619a26aa252SDarrick J. Wong
4620a26aa252SDarrick J. Wong3. For each directory entry found in each directory scanned, decide if the
4621a26aa252SDarrick J. Wong   dirent references the file of interest.
4622a26aa252SDarrick J. Wong   If so:
4623a26aa252SDarrick J. Wong
4624a26aa252SDarrick J. Wong   a. Stash an addpptr entry for this parent pointer in the xfblob and xfarray
4625a26aa252SDarrick J. Wong      for later.
4626a26aa252SDarrick J. Wong
4627a26aa252SDarrick J. Wong   b. When finished scanning the directory, flush the stashed updates to the
4628a26aa252SDarrick J. Wong      temporary directory.
4629a26aa252SDarrick J. Wong
4630a26aa252SDarrick J. Wong4. For each live directory update received via the hook, decide if the parent
4631a26aa252SDarrick J. Wong   has already been scanned.
4632a26aa252SDarrick J. Wong   If so:
4633a26aa252SDarrick J. Wong
4634a26aa252SDarrick J. Wong   a. Stash an addpptr or removepptr entry for this dirent update in the
4635a26aa252SDarrick J. Wong      xfarray for later.
4636a26aa252SDarrick J. Wong      We cannot write parent pointers directly to the temporary file because
4637a26aa252SDarrick J. Wong      hook functions are not allowed to modify filesystem metadata.
4638a26aa252SDarrick J. Wong      Instead, we stash updates in the xfarray and rely on the scanner thread
4639a26aa252SDarrick J. Wong      to apply the stashed parent pointer updates to the temporary file.
4640a26aa252SDarrick J. Wong
4641a26aa252SDarrick J. Wong5. Copy all non-parent pointer extended attributes to the temporary file.
4642a26aa252SDarrick J. Wong
4643a26aa252SDarrick J. Wong6. When the scan is complete, atomically swap the attribute fork of the
4644a26aa252SDarrick J. Wong   temporary file and the file being repaired.
4645a26aa252SDarrick J. Wong   The temporary file now contains the damaged extended attribute structure.
4646a26aa252SDarrick J. Wong
4647a26aa252SDarrick J. Wong7. Reap the temporary file.
4648a26aa252SDarrick J. Wong
4649a26aa252SDarrick J. WongThe proposed patchset is the
4650a26aa252SDarrick J. Wong`parent pointers repair
4651a26aa252SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-online-parent-repair>`_
4652a26aa252SDarrick J. Wongseries.
4653a26aa252SDarrick J. Wong
4654a26aa252SDarrick J. WongDigression: Offline Checking of Parent Pointers
4655a26aa252SDarrick J. Wong^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
4656a26aa252SDarrick J. Wong
4657a26aa252SDarrick J. WongExamining parent pointers in offline repair works differently because corrupt
4658a26aa252SDarrick J. Wongfiles are erased long before directory tree connectivity checks are performed.
4659a26aa252SDarrick J. WongParent pointer checks are therefore a second pass to be added to the existing
4660a26aa252SDarrick J. Wongconnectivity checks:
4661a26aa252SDarrick J. Wong
4662a26aa252SDarrick J. Wong1. After the set of surviving files has been established (i.e. phase 6),
4663a26aa252SDarrick J. Wong   walk the surviving directories of each AG in the filesystem.
4664a26aa252SDarrick J. Wong   This is already performed as part of the connectivity checks.
4665a26aa252SDarrick J. Wong
4666a26aa252SDarrick J. Wong2. For each directory entry found, record the name in an xfblob, and store
4667a26aa252SDarrick J. Wong   ``(child_ag_inum, parent_inum, parent_gen, dirent_pos)`` tuples in a
4668a26aa252SDarrick J. Wong   per-AG in-memory slab.
4669a26aa252SDarrick J. Wong
4670a26aa252SDarrick J. Wong3. For each AG in the filesystem,
4671a26aa252SDarrick J. Wong
4672a26aa252SDarrick J. Wong   a. Sort the per-AG tuples in order of child_ag_inum, parent_inum, and
4673a26aa252SDarrick J. Wong      dirent_pos.
4674a26aa252SDarrick J. Wong
4675a26aa252SDarrick J. Wong   b. For each inode in the AG,
4676a26aa252SDarrick J. Wong
4677a26aa252SDarrick J. Wong      1. Scan the inode for parent pointers.
4678a26aa252SDarrick J. Wong         Record the names in a per-file xfblob, and store ``(parent_inum,
4679a26aa252SDarrick J. Wong         parent_gen, dirent_pos)`` tuples in a per-file slab.
4680a26aa252SDarrick J. Wong
4681a26aa252SDarrick J. Wong      2. Sort the per-file tuples in order of parent_inum, and dirent_pos.
4682a26aa252SDarrick J. Wong
4683a26aa252SDarrick J. Wong      3. Position one slab cursor at the start of the inode's records in the
4684a26aa252SDarrick J. Wong         per-AG tuple slab.
4685a26aa252SDarrick J. Wong         This should be trivial since the per-AG tuples are in child inumber
4686a26aa252SDarrick J. Wong         order.
4687a26aa252SDarrick J. Wong
4688a26aa252SDarrick J. Wong      4. Position a second slab cursor at the start of the per-file tuple slab.
4689a26aa252SDarrick J. Wong
4690a26aa252SDarrick J. Wong      5. Iterate the two cursors in lockstep, comparing the parent_ino and
4691a26aa252SDarrick J. Wong         dirent_pos fields of the records under each cursor.
4692a26aa252SDarrick J. Wong
4693a26aa252SDarrick J. Wong         a. Tuples in the per-AG list but not the per-file list are missing and
4694a26aa252SDarrick J. Wong            need to be written to the inode.
4695a26aa252SDarrick J. Wong
4696a26aa252SDarrick J. Wong         b. Tuples in the per-file list but not the per-AG list are dangling
4697a26aa252SDarrick J. Wong            and need to be removed from the inode.
4698a26aa252SDarrick J. Wong
4699a26aa252SDarrick J. Wong         c. For tuples in both lists, update the parent_gen and name components
4700a26aa252SDarrick J. Wong            of the parent pointer if necessary.
4701a26aa252SDarrick J. Wong
4702a26aa252SDarrick J. Wong4. Move on to examining link counts, as we do today.
4703a26aa252SDarrick J. Wong
4704a26aa252SDarrick J. WongThe proposed patchset is the
4705a26aa252SDarrick J. Wong`offline parent pointers repair
4706a26aa252SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=pptrs-repair>`_
4707a26aa252SDarrick J. Wongseries.
4708a26aa252SDarrick J. Wong
4709a26aa252SDarrick J. WongRebuilding directories from parent pointers in offline repair is very
4710a26aa252SDarrick J. Wongchallenging because it currently uses a single-pass scan of the filesystem
4711a26aa252SDarrick J. Wongduring phase 3 to decide which files are corrupt enough to be zapped.
4712a26aa252SDarrick J. WongThis scan would have to be converted into a multi-pass scan:
4713a26aa252SDarrick J. Wong
4714a26aa252SDarrick J. Wong1. The first pass of the scan zaps corrupt inodes, forks, and attributes
4715a26aa252SDarrick J. Wong   much as it does now.
4716a26aa252SDarrick J. Wong   Corrupt directories are noted but not zapped.
4717a26aa252SDarrick J. Wong
4718a26aa252SDarrick J. Wong2. The next pass records parent pointers pointing to the directories noted
4719a26aa252SDarrick J. Wong   as being corrupt in the first pass.
4720a26aa252SDarrick J. Wong   This second pass may have to happen after the phase 4 scan for duplicate
4721a26aa252SDarrick J. Wong   blocks, if phase 4 is also capable of zapping directories.
4722a26aa252SDarrick J. Wong
4723a26aa252SDarrick J. Wong3. The third pass resets corrupt directories to an empty shortform directory.
4724a26aa252SDarrick J. Wong   Free space metadata has not been ensured yet, so repair cannot yet use the
4725a26aa252SDarrick J. Wong   directory building code in libxfs.
4726a26aa252SDarrick J. Wong
4727a26aa252SDarrick J. Wong4. At the start of phase 6, space metadata have been rebuilt.
4728a26aa252SDarrick J. Wong   Use the parent pointer information recorded during step 2 to reconstruct
4729a26aa252SDarrick J. Wong   the dirents and add them to the now-empty directories.
4730a26aa252SDarrick J. Wong
4731a26aa252SDarrick J. WongThis code has not yet been constructed.
4732a26aa252SDarrick J. Wong
4733a26aa252SDarrick J. Wong.. _orphanage:
4734a26aa252SDarrick J. Wong
4735a26aa252SDarrick J. WongThe Orphanage
4736a26aa252SDarrick J. Wong-------------
4737a26aa252SDarrick J. Wong
4738a26aa252SDarrick J. WongFilesystems present files as a directed, and hopefully acyclic, graph.
4739a26aa252SDarrick J. WongIn other words, a tree.
4740a26aa252SDarrick J. WongThe root of the filesystem is a directory, and each entry in a directory points
4741a26aa252SDarrick J. Wongdownwards either to more subdirectories or to non-directory files.
4742a26aa252SDarrick J. WongUnfortunately, a disruption in the directory graph pointers result in a
4743a26aa252SDarrick J. Wongdisconnected graph, which makes files impossible to access via regular path
4744a26aa252SDarrick J. Wongresolution.
4745a26aa252SDarrick J. Wong
4746a26aa252SDarrick J. WongWithout parent pointers, the directory parent pointer online scrub code can
4747a26aa252SDarrick J. Wongdetect a dotdot entry pointing to a parent directory that doesn't have a link
4748a26aa252SDarrick J. Wongback to the child directory and the file link count checker can detect a file
4749a26aa252SDarrick J. Wongthat isn't pointed to by any directory in the filesystem.
4750a26aa252SDarrick J. WongIf such a file has a positive link count, the file is an orphan.
4751a26aa252SDarrick J. Wong
4752a26aa252SDarrick J. WongWith parent pointers, directories can be rebuilt by scanning parent pointers
4753a26aa252SDarrick J. Wongand parent pointers can be rebuilt by scanning directories.
4754a26aa252SDarrick J. WongThis should reduce the incidence of files ending up in ``/lost+found``.
4755a26aa252SDarrick J. Wong
4756a26aa252SDarrick J. WongWhen orphans are found, they should be reconnected to the directory tree.
4757a26aa252SDarrick J. WongOffline fsck solves the problem by creating a directory ``/lost+found`` to
4758a26aa252SDarrick J. Wongserve as an orphanage, and linking orphan files into the orphanage by using the
4759a26aa252SDarrick J. Wonginumber as the name.
4760a26aa252SDarrick J. WongReparenting a file to the orphanage does not reset any of its permissions or
4761a26aa252SDarrick J. WongACLs.
4762a26aa252SDarrick J. Wong
4763a26aa252SDarrick J. WongThis process is more involved in the kernel than it is in userspace.
4764a26aa252SDarrick J. WongThe directory and file link count repair setup functions must use the regular
4765a26aa252SDarrick J. WongVFS mechanisms to create the orphanage directory with all the necessary
4766a26aa252SDarrick J. Wongsecurity attributes and dentry cache entries, just like a regular directory
4767a26aa252SDarrick J. Wongtree modification.
4768a26aa252SDarrick J. Wong
4769a26aa252SDarrick J. WongOrphaned files are adopted by the orphanage as follows:
4770a26aa252SDarrick J. Wong
4771a26aa252SDarrick J. Wong1. Call ``xrep_orphanage_try_create`` at the start of the scrub setup function
4772a26aa252SDarrick J. Wong   to try to ensure that the lost and found directory actually exists.
4773a26aa252SDarrick J. Wong   This also attaches the orphanage directory to the scrub context.
4774a26aa252SDarrick J. Wong
4775a26aa252SDarrick J. Wong2. If the decision is made to reconnect a file, take the IOLOCK of both the
4776a26aa252SDarrick J. Wong   orphanage and the file being reattached.
4777a26aa252SDarrick J. Wong   The ``xrep_orphanage_iolock_two`` function follows the inode locking
4778a26aa252SDarrick J. Wong   strategy discussed earlier.
4779a26aa252SDarrick J. Wong
4780a26aa252SDarrick J. Wong3. Call ``xrep_orphanage_compute_blkres`` and ``xrep_orphanage_compute_name``
4781a26aa252SDarrick J. Wong   to compute the new name in the orphanage and the block reservation required.
4782a26aa252SDarrick J. Wong
4783a26aa252SDarrick J. Wong4. Use ``xrep_orphanage_adoption_prep`` to reserve resources to the repair
4784a26aa252SDarrick J. Wong   transaction.
4785a26aa252SDarrick J. Wong
4786a26aa252SDarrick J. Wong5. Call ``xrep_orphanage_adopt`` to reparent the orphaned file into the lost
4787a26aa252SDarrick J. Wong   and found, and update the kernel dentry cache.
4788a26aa252SDarrick J. Wong
4789a26aa252SDarrick J. WongThe proposed patches are in the
4790a26aa252SDarrick J. Wong`orphanage adoption
4791a26aa252SDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-orphanage>`_
4792a26aa252SDarrick J. Wongseries.
4793af051dfbSDarrick J. Wong
4794af051dfbSDarrick J. Wong6. Userspace Algorithms and Data Structures
4795af051dfbSDarrick J. Wong===========================================
4796af051dfbSDarrick J. Wong
4797af051dfbSDarrick J. WongThis section discusses the key algorithms and data structures of the userspace
4798af051dfbSDarrick J. Wongprogram, ``xfs_scrub``, that provide the ability to drive metadata checks and
4799af051dfbSDarrick J. Wongrepairs in the kernel, verify file data, and look for other potential problems.
4800af051dfbSDarrick J. Wong
4801af051dfbSDarrick J. Wong.. _scrubcheck:
4802af051dfbSDarrick J. Wong
4803af051dfbSDarrick J. WongChecking Metadata
4804af051dfbSDarrick J. Wong-----------------
4805af051dfbSDarrick J. Wong
4806af051dfbSDarrick J. WongRecall the :ref:`phases of fsck work<scrubphases>` outlined earlier.
4807af051dfbSDarrick J. WongThat structure follows naturally from the data dependencies designed into the
4808af051dfbSDarrick J. Wongfilesystem from its beginnings in 1993.
4809af051dfbSDarrick J. WongIn XFS, there are several groups of metadata dependencies:
4810af051dfbSDarrick J. Wong
4811af051dfbSDarrick J. Wonga. Filesystem summary counts depend on consistency within the inode indices,
4812af051dfbSDarrick J. Wong   the allocation group space btrees, and the realtime volume space
4813af051dfbSDarrick J. Wong   information.
4814af051dfbSDarrick J. Wong
4815af051dfbSDarrick J. Wongb. Quota resource counts depend on consistency within the quota file data
4816af051dfbSDarrick J. Wong   forks, inode indices, inode records, and the forks of every file on the
4817af051dfbSDarrick J. Wong   system.
4818af051dfbSDarrick J. Wong
4819af051dfbSDarrick J. Wongc. The naming hierarchy depends on consistency within the directory and
4820af051dfbSDarrick J. Wong   extended attribute structures.
4821af051dfbSDarrick J. Wong   This includes file link counts.
4822af051dfbSDarrick J. Wong
4823af051dfbSDarrick J. Wongd. Directories, extended attributes, and file data depend on consistency within
4824af051dfbSDarrick J. Wong   the file forks that map directory and extended attribute data to physical
4825af051dfbSDarrick J. Wong   storage media.
4826af051dfbSDarrick J. Wong
4827af051dfbSDarrick J. Wonge. The file forks depends on consistency within inode records and the space
4828af051dfbSDarrick J. Wong   metadata indices of the allocation groups and the realtime volume.
4829af051dfbSDarrick J. Wong   This includes quota and realtime metadata files.
4830af051dfbSDarrick J. Wong
4831af051dfbSDarrick J. Wongf. Inode records depends on consistency within the inode metadata indices.
4832af051dfbSDarrick J. Wong
4833af051dfbSDarrick J. Wongg. Realtime space metadata depend on the inode records and data forks of the
4834af051dfbSDarrick J. Wong   realtime metadata inodes.
4835af051dfbSDarrick J. Wong
4836af051dfbSDarrick J. Wongh. The allocation group metadata indices (free space, inodes, reference count,
4837af051dfbSDarrick J. Wong   and reverse mapping btrees) depend on consistency within the AG headers and
4838af051dfbSDarrick J. Wong   between all the AG metadata btrees.
4839af051dfbSDarrick J. Wong
4840af051dfbSDarrick J. Wongi. ``xfs_scrub`` depends on the filesystem being mounted and kernel support
4841af051dfbSDarrick J. Wong   for online fsck functionality.
4842af051dfbSDarrick J. Wong
4843af051dfbSDarrick J. WongTherefore, a metadata dependency graph is a convenient way to schedule checking
4844af051dfbSDarrick J. Wongoperations in the ``xfs_scrub`` program:
4845af051dfbSDarrick J. Wong
4846af051dfbSDarrick J. Wong- Phase 1 checks that the provided path maps to an XFS filesystem and detect
4847af051dfbSDarrick J. Wong  the kernel's scrubbing abilities, which validates group (i).
4848af051dfbSDarrick J. Wong
4849af051dfbSDarrick J. Wong- Phase 2 scrubs groups (g) and (h) in parallel using a threaded workqueue.
4850af051dfbSDarrick J. Wong
4851af051dfbSDarrick J. Wong- Phase 3 scans inodes in parallel.
4852af051dfbSDarrick J. Wong  For each inode, groups (f), (e), and (d) are checked, in that order.
4853af051dfbSDarrick J. Wong
4854af051dfbSDarrick J. Wong- Phase 4 repairs everything in groups (i) through (d) so that phases 5 and 6
4855af051dfbSDarrick J. Wong  may run reliably.
4856af051dfbSDarrick J. Wong
4857af051dfbSDarrick J. Wong- Phase 5 starts by checking groups (b) and (c) in parallel before moving on
4858af051dfbSDarrick J. Wong  to checking names.
4859af051dfbSDarrick J. Wong
4860af051dfbSDarrick J. Wong- Phase 6 depends on groups (i) through (b) to find file data blocks to verify,
4861af051dfbSDarrick J. Wong  to read them, and to report which blocks of which files are affected.
4862af051dfbSDarrick J. Wong
4863af051dfbSDarrick J. Wong- Phase 7 checks group (a), having validated everything else.
4864af051dfbSDarrick J. Wong
4865af051dfbSDarrick J. WongNotice that the data dependencies between groups are enforced by the structure
4866af051dfbSDarrick J. Wongof the program flow.
4867af051dfbSDarrick J. Wong
4868af051dfbSDarrick J. WongParallel Inode Scans
4869af051dfbSDarrick J. Wong--------------------
4870af051dfbSDarrick J. Wong
4871af051dfbSDarrick J. WongAn XFS filesystem can easily contain hundreds of millions of inodes.
4872af051dfbSDarrick J. WongGiven that XFS targets installations with large high-performance storage,
4873af051dfbSDarrick J. Wongit is desirable to scrub inodes in parallel to minimize runtime, particularly
4874af051dfbSDarrick J. Wongif the program has been invoked manually from a command line.
4875af051dfbSDarrick J. WongThis requires careful scheduling to keep the threads as evenly loaded as
4876af051dfbSDarrick J. Wongpossible.
4877af051dfbSDarrick J. Wong
4878af051dfbSDarrick J. WongEarly iterations of the ``xfs_scrub`` inode scanner naïvely created a single
4879af051dfbSDarrick J. Wongworkqueue and scheduled a single workqueue item per AG.
4880af051dfbSDarrick J. WongEach workqueue item walked the inode btree (with ``XFS_IOC_INUMBERS``) to find
4881af051dfbSDarrick J. Wonginode chunks and then called bulkstat (``XFS_IOC_BULKSTAT``) to gather enough
4882af051dfbSDarrick J. Wonginformation to construct file handles.
4883af051dfbSDarrick J. WongThe file handle was then passed to a function to generate scrub items for each
4884af051dfbSDarrick J. Wongmetadata object of each inode.
4885af051dfbSDarrick J. WongThis simple algorithm leads to thread balancing problems in phase 3 if the
4886af051dfbSDarrick J. Wongfilesystem contains one AG with a few large sparse files and the rest of the
4887af051dfbSDarrick J. WongAGs contain many smaller files.
4888af051dfbSDarrick J. WongThe inode scan dispatch function was not sufficiently granular; it should have
4889af051dfbSDarrick J. Wongbeen dispatching at the level of individual inodes, or, to constrain memory
4890af051dfbSDarrick J. Wongconsumption, inode btree records.
4891af051dfbSDarrick J. Wong
4892af051dfbSDarrick J. WongThanks to Dave Chinner, bounded workqueues in userspace enable ``xfs_scrub`` to
4893af051dfbSDarrick J. Wongavoid this problem with ease by adding a second workqueue.
4894af051dfbSDarrick J. WongJust like before, the first workqueue is seeded with one workqueue item per AG,
4895af051dfbSDarrick J. Wongand it uses INUMBERS to find inode btree chunks.
4896af051dfbSDarrick J. WongThe second workqueue, however, is configured with an upper bound on the number
4897af051dfbSDarrick J. Wongof items that can be waiting to be run.
4898af051dfbSDarrick J. WongEach inode btree chunk found by the first workqueue's workers are queued to the
4899af051dfbSDarrick J. Wongsecond workqueue, and it is this second workqueue that queries BULKSTAT,
4900af051dfbSDarrick J. Wongcreates a file handle, and passes it to a function to generate scrub items for
4901af051dfbSDarrick J. Wongeach metadata object of each inode.
4902af051dfbSDarrick J. WongIf the second workqueue is too full, the workqueue add function blocks the
4903af051dfbSDarrick J. Wongfirst workqueue's workers until the backlog eases.
4904af051dfbSDarrick J. WongThis doesn't completely solve the balancing problem, but reduces it enough to
4905af051dfbSDarrick J. Wongmove on to more pressing issues.
4906af051dfbSDarrick J. Wong
4907af051dfbSDarrick J. WongThe proposed patchsets are the scrub
4908af051dfbSDarrick J. Wong`performance tweaks
4909af051dfbSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-performance-tweaks>`_
4910af051dfbSDarrick J. Wongand the
4911af051dfbSDarrick J. Wong`inode scan rebalance
4912af051dfbSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-iscan-rebalance>`_
4913af051dfbSDarrick J. Wongseries.
4914af051dfbSDarrick J. Wong
4915af051dfbSDarrick J. Wong.. _scrubrepair:
4916af051dfbSDarrick J. Wong
4917af051dfbSDarrick J. WongScheduling Repairs
4918af051dfbSDarrick J. Wong------------------
4919af051dfbSDarrick J. Wong
4920af051dfbSDarrick J. WongDuring phase 2, corruptions and inconsistencies reported in any AGI header or
4921af051dfbSDarrick J. Wonginode btree are repaired immediately, because phase 3 relies on proper
4922af051dfbSDarrick J. Wongfunctioning of the inode indices to find inodes to scan.
4923af051dfbSDarrick J. WongFailed repairs are rescheduled to phase 4.
4924af051dfbSDarrick J. WongProblems reported in any other space metadata are deferred to phase 4.
4925af051dfbSDarrick J. WongOptimization opportunities are always deferred to phase 4, no matter their
4926af051dfbSDarrick J. Wongorigin.
4927af051dfbSDarrick J. Wong
4928af051dfbSDarrick J. WongDuring phase 3, corruptions and inconsistencies reported in any part of a
4929af051dfbSDarrick J. Wongfile's metadata are repaired immediately if all space metadata were validated
4930af051dfbSDarrick J. Wongduring phase 2.
4931af051dfbSDarrick J. WongRepairs that fail or cannot be repaired immediately are scheduled for phase 4.
4932af051dfbSDarrick J. Wong
4933af051dfbSDarrick J. WongIn the original design of ``xfs_scrub``, it was thought that repairs would be
4934af051dfbSDarrick J. Wongso infrequent that the ``struct xfs_scrub_metadata`` objects used to
4935af051dfbSDarrick J. Wongcommunicate with the kernel could also be used as the primary object to
4936af051dfbSDarrick J. Wongschedule repairs.
4937af051dfbSDarrick J. WongWith recent increases in the number of optimizations possible for a given
4938af051dfbSDarrick J. Wongfilesystem object, it became much more memory-efficient to track all eligible
4939af051dfbSDarrick J. Wongrepairs for a given filesystem object with a single repair item.
4940af051dfbSDarrick J. WongEach repair item represents a single lockable object -- AGs, metadata files,
4941af051dfbSDarrick J. Wongindividual inodes, or a class of summary information.
4942af051dfbSDarrick J. Wong
4943af051dfbSDarrick J. WongPhase 4 is responsible for scheduling a lot of repair work in as quick a
4944af051dfbSDarrick J. Wongmanner as is practical.
4945af051dfbSDarrick J. WongThe :ref:`data dependencies <scrubcheck>` outlined earlier still apply, which
4946af051dfbSDarrick J. Wongmeans that ``xfs_scrub`` must try to complete the repair work scheduled by
4947af051dfbSDarrick J. Wongphase 2 before trying repair work scheduled by phase 3.
4948af051dfbSDarrick J. WongThe repair process is as follows:
4949af051dfbSDarrick J. Wong
4950af051dfbSDarrick J. Wong1. Start a round of repair with a workqueue and enough workers to keep the CPUs
4951af051dfbSDarrick J. Wong   as busy as the user desires.
4952af051dfbSDarrick J. Wong
4953af051dfbSDarrick J. Wong   a. For each repair item queued by phase 2,
4954af051dfbSDarrick J. Wong
4955af051dfbSDarrick J. Wong      i.   Ask the kernel to repair everything listed in the repair item for a
4956af051dfbSDarrick J. Wong           given filesystem object.
4957af051dfbSDarrick J. Wong
4958af051dfbSDarrick J. Wong      ii.  Make a note if the kernel made any progress in reducing the number
4959af051dfbSDarrick J. Wong           of repairs needed for this object.
4960af051dfbSDarrick J. Wong
4961af051dfbSDarrick J. Wong      iii. If the object no longer requires repairs, revalidate all metadata
4962af051dfbSDarrick J. Wong           associated with this object.
4963af051dfbSDarrick J. Wong           If the revalidation succeeds, drop the repair item.
4964af051dfbSDarrick J. Wong           If not, requeue the item for more repairs.
4965af051dfbSDarrick J. Wong
4966af051dfbSDarrick J. Wong   b. If any repairs were made, jump back to 1a to retry all the phase 2 items.
4967af051dfbSDarrick J. Wong
4968af051dfbSDarrick J. Wong   c. For each repair item queued by phase 3,
4969af051dfbSDarrick J. Wong
4970af051dfbSDarrick J. Wong      i.   Ask the kernel to repair everything listed in the repair item for a
4971af051dfbSDarrick J. Wong           given filesystem object.
4972af051dfbSDarrick J. Wong
4973af051dfbSDarrick J. Wong      ii.  Make a note if the kernel made any progress in reducing the number
4974af051dfbSDarrick J. Wong           of repairs needed for this object.
4975af051dfbSDarrick J. Wong
4976af051dfbSDarrick J. Wong      iii. If the object no longer requires repairs, revalidate all metadata
4977af051dfbSDarrick J. Wong           associated with this object.
4978af051dfbSDarrick J. Wong           If the revalidation succeeds, drop the repair item.
4979af051dfbSDarrick J. Wong           If not, requeue the item for more repairs.
4980af051dfbSDarrick J. Wong
4981af051dfbSDarrick J. Wong   d. If any repairs were made, jump back to 1c to retry all the phase 3 items.
4982af051dfbSDarrick J. Wong
4983af051dfbSDarrick J. Wong2. If step 1 made any repair progress of any kind, jump back to step 1 to start
4984af051dfbSDarrick J. Wong   another round of repair.
4985af051dfbSDarrick J. Wong
4986af051dfbSDarrick J. Wong3. If there are items left to repair, run them all serially one more time.
4987af051dfbSDarrick J. Wong   Complain if the repairs were not successful, since this is the last chance
4988af051dfbSDarrick J. Wong   to repair anything.
4989af051dfbSDarrick J. Wong
4990af051dfbSDarrick J. WongCorruptions and inconsistencies encountered during phases 5 and 7 are repaired
4991af051dfbSDarrick J. Wongimmediately.
4992af051dfbSDarrick J. WongCorrupt file data blocks reported by phase 6 cannot be recovered by the
4993af051dfbSDarrick J. Wongfilesystem.
4994af051dfbSDarrick J. Wong
4995af051dfbSDarrick J. WongThe proposed patchsets are the
4996af051dfbSDarrick J. Wong`repair warning improvements
4997af051dfbSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-better-repair-warnings>`_,
4998af051dfbSDarrick J. Wongrefactoring of the
4999af051dfbSDarrick J. Wong`repair data dependency
5000af051dfbSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-data-deps>`_
5001af051dfbSDarrick J. Wongand
5002af051dfbSDarrick J. Wong`object tracking
5003af051dfbSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-object-tracking>`_,
5004af051dfbSDarrick J. Wongand the
5005af051dfbSDarrick J. Wong`repair scheduling
5006af051dfbSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-repair-scheduling>`_
5007af051dfbSDarrick J. Wongimprovement series.
5008af051dfbSDarrick J. Wong
5009af051dfbSDarrick J. WongChecking Names for Confusable Unicode Sequences
5010af051dfbSDarrick J. Wong-----------------------------------------------
5011af051dfbSDarrick J. Wong
5012af051dfbSDarrick J. WongIf ``xfs_scrub`` succeeds in validating the filesystem metadata by the end of
5013af051dfbSDarrick J. Wongphase 4, it moves on to phase 5, which checks for suspicious looking names in
5014af051dfbSDarrick J. Wongthe filesystem.
5015af051dfbSDarrick J. WongThese names consist of the filesystem label, names in directory entries, and
5016af051dfbSDarrick J. Wongthe names of extended attributes.
5017af051dfbSDarrick J. WongLike most Unix filesystems, XFS imposes the sparest of constraints on the
5018af051dfbSDarrick J. Wongcontents of a name:
5019af051dfbSDarrick J. Wong
5020af051dfbSDarrick J. Wong- Slashes and null bytes are not allowed in directory entries.
5021af051dfbSDarrick J. Wong
5022af051dfbSDarrick J. Wong- Null bytes are not allowed in userspace-visible extended attributes.
5023af051dfbSDarrick J. Wong
5024af051dfbSDarrick J. Wong- Null bytes are not allowed in the filesystem label.
5025af051dfbSDarrick J. Wong
5026af051dfbSDarrick J. WongDirectory entries and attribute keys store the length of the name explicitly
5027af051dfbSDarrick J. Wongondisk, which means that nulls are not name terminators.
5028af051dfbSDarrick J. WongFor this section, the term "naming domain" refers to any place where names are
5029af051dfbSDarrick J. Wongpresented together -- all the names in a directory, or all the attributes of a
5030af051dfbSDarrick J. Wongfile.
5031af051dfbSDarrick J. Wong
5032af051dfbSDarrick J. WongAlthough the Unix naming constraints are very permissive, the reality of most
5033af051dfbSDarrick J. Wongmodern-day Linux systems is that programs work with Unicode character code
5034af051dfbSDarrick J. Wongpoints to support international languages.
5035af051dfbSDarrick J. WongThese programs typically encode those code points in UTF-8 when interfacing
5036af051dfbSDarrick J. Wongwith the C library because the kernel expects null-terminated names.
5037af051dfbSDarrick J. WongIn the common case, therefore, names found in an XFS filesystem are actually
5038af051dfbSDarrick J. WongUTF-8 encoded Unicode data.
5039af051dfbSDarrick J. Wong
5040af051dfbSDarrick J. WongTo maximize its expressiveness, the Unicode standard defines separate control
5041af051dfbSDarrick J. Wongpoints for various characters that render similarly or identically in writing
5042af051dfbSDarrick J. Wongsystems around the world.
5043af051dfbSDarrick J. WongFor example, the character "Cyrillic Small Letter A" U+0430 "а" often renders
5044af051dfbSDarrick J. Wongidentically to "Latin Small Letter A" U+0061 "a".
5045af051dfbSDarrick J. Wong
5046af051dfbSDarrick J. WongThe standard also permits characters to be constructed in multiple ways --
5047af051dfbSDarrick J. Wongeither by using a defined code point, or by combining one code point with
5048af051dfbSDarrick J. Wongvarious combining marks.
5049af051dfbSDarrick J. WongFor example, the character "Angstrom Sign U+212B "Å" can also be expressed
5050af051dfbSDarrick J. Wongas "Latin Capital Letter A" U+0041 "A" followed by "Combining Ring Above"
5051af051dfbSDarrick J. WongU+030A "◌̊".
5052af051dfbSDarrick J. WongBoth sequences render identically.
5053af051dfbSDarrick J. Wong
5054af051dfbSDarrick J. WongLike the standards that preceded it, Unicode also defines various control
5055af051dfbSDarrick J. Wongcharacters to alter the presentation of text.
5056af051dfbSDarrick J. WongFor example, the character "Right-to-Left Override" U+202E can trick some
5057af051dfbSDarrick J. Wongprograms into rendering "moo\\xe2\\x80\\xaegnp.txt" as "mootxt.png".
5058af051dfbSDarrick J. WongA second category of rendering problems involves whitespace characters.
5059af051dfbSDarrick J. WongIf the character "Zero Width Space" U+200B is encountered in a file name, the
5060af051dfbSDarrick J. Wongname will render identically to a name that does not have the zero width
5061af051dfbSDarrick J. Wongspace.
5062af051dfbSDarrick J. Wong
5063af051dfbSDarrick J. WongIf two names within a naming domain have different byte sequences but render
5064af051dfbSDarrick J. Wongidentically, a user may be confused by it.
5065af051dfbSDarrick J. WongThe kernel, in its indifference to upper level encoding schemes, permits this.
5066af051dfbSDarrick J. WongMost filesystem drivers persist the byte sequence names that are given to them
5067af051dfbSDarrick J. Wongby the VFS.
5068af051dfbSDarrick J. Wong
5069af051dfbSDarrick J. WongTechniques for detecting confusable names are explained in great detail in
5070af051dfbSDarrick J. Wongsections 4 and 5 of the
5071af051dfbSDarrick J. Wong`Unicode Security Mechanisms <https://unicode.org/reports/tr39/>`_
5072af051dfbSDarrick J. Wongdocument.
5073af051dfbSDarrick J. WongWhen ``xfs_scrub`` detects UTF-8 encoding in use on a system, it uses the
5074af051dfbSDarrick J. WongUnicode normalization form NFD in conjunction with the confusable name
5075af051dfbSDarrick J. Wongdetection component of
5076af051dfbSDarrick J. Wong`libicu <https://github.com/unicode-org/icu>`_
5077af051dfbSDarrick J. Wongto identify names with a directory or within a file's extended attributes that
5078af051dfbSDarrick J. Wongcould be confused for each other.
5079af051dfbSDarrick J. WongNames are also checked for control characters, non-rendering characters, and
5080af051dfbSDarrick J. Wongmixing of bidirectional characters.
5081af051dfbSDarrick J. WongAll of these potential issues are reported to the system administrator during
5082af051dfbSDarrick J. Wongphase 5.
5083af051dfbSDarrick J. Wong
5084af051dfbSDarrick J. WongMedia Verification of File Data Extents
5085af051dfbSDarrick J. Wong---------------------------------------
5086af051dfbSDarrick J. Wong
5087af051dfbSDarrick J. WongThe system administrator can elect to initiate a media scan of all file data
5088af051dfbSDarrick J. Wongblocks.
5089af051dfbSDarrick J. WongThis scan after validation of all filesystem metadata (except for the summary
5090af051dfbSDarrick J. Wongcounters) as phase 6.
5091af051dfbSDarrick J. WongThe scan starts by calling ``FS_IOC_GETFSMAP`` to scan the filesystem space map
5092af051dfbSDarrick J. Wongto find areas that are allocated to file data fork extents.
5093*d56b699dSBjorn HelgaasGaps between data fork extents that are smaller than 64k are treated as if
5094af051dfbSDarrick J. Wongthey were data fork extents to reduce the command setup overhead.
5095af051dfbSDarrick J. WongWhen the space map scan accumulates a region larger than 32MB, a media
5096af051dfbSDarrick J. Wongverification request is sent to the disk as a directio read of the raw block
5097af051dfbSDarrick J. Wongdevice.
5098af051dfbSDarrick J. Wong
5099af051dfbSDarrick J. WongIf the verification read fails, ``xfs_scrub`` retries with single-block reads
5100af051dfbSDarrick J. Wongto narrow down the failure to the specific region of the media and recorded.
5101af051dfbSDarrick J. WongWhen it has finished issuing verification requests, it again uses the space
5102af051dfbSDarrick J. Wongmapping ioctl to map the recorded media errors back to metadata structures
5103af051dfbSDarrick J. Wongand report what has been lost.
5104af051dfbSDarrick J. WongFor media errors in blocks owned by files, parent pointers can be used to
5105af051dfbSDarrick J. Wongconstruct file paths from inode numbers for user-friendly reporting.
510603786f0aSDarrick J. Wong
510703786f0aSDarrick J. Wong7. Conclusion and Future Work
510803786f0aSDarrick J. Wong=============================
510903786f0aSDarrick J. Wong
511003786f0aSDarrick J. WongIt is hoped that the reader of this document has followed the designs laid out
511103786f0aSDarrick J. Wongin this document and now has some familiarity with how XFS performs online
511203786f0aSDarrick J. Wongrebuilding of its metadata indices, and how filesystem users can interact with
511303786f0aSDarrick J. Wongthat functionality.
511403786f0aSDarrick J. WongAlthough the scope of this work is daunting, it is hoped that this guide will
511503786f0aSDarrick J. Wongmake it easier for code readers to understand what has been built, for whom it
511603786f0aSDarrick J. Wonghas been built, and why.
511703786f0aSDarrick J. WongPlease feel free to contact the XFS mailing list with questions.
511803786f0aSDarrick J. Wong
511903786f0aSDarrick J. WongFIEXCHANGE_RANGE
512003786f0aSDarrick J. Wong----------------
512103786f0aSDarrick J. Wong
512203786f0aSDarrick J. WongAs discussed earlier, a second frontend to the atomic extent swap mechanism is
512303786f0aSDarrick J. Wonga new ioctl call that userspace programs can use to commit updates to files
512403786f0aSDarrick J. Wongatomically.
512503786f0aSDarrick J. WongThis frontend has been out for review for several years now, though the
512603786f0aSDarrick J. Wongnecessary refinements to online repair and lack of customer demand mean that
512703786f0aSDarrick J. Wongthe proposal has not been pushed very hard.
512803786f0aSDarrick J. Wong
512903786f0aSDarrick J. WongExtent Swapping with Regular User Files
513003786f0aSDarrick J. Wong```````````````````````````````````````
513103786f0aSDarrick J. Wong
513203786f0aSDarrick J. WongAs mentioned earlier, XFS has long had the ability to swap extents between
513303786f0aSDarrick J. Wongfiles, which is used almost exclusively by ``xfs_fsr`` to defragment files.
513403786f0aSDarrick J. WongThe earliest form of this was the fork swap mechanism, where the entire
513503786f0aSDarrick J. Wongcontents of data forks could be exchanged between two files by exchanging the
513603786f0aSDarrick J. Wongraw bytes in each inode fork's immediate area.
513703786f0aSDarrick J. WongWhen XFS v5 came along with self-describing metadata, this old mechanism grew
513803786f0aSDarrick J. Wongsome log support to continue rewriting the owner fields of BMBT blocks during
513903786f0aSDarrick J. Wonglog recovery.
514003786f0aSDarrick J. WongWhen the reverse mapping btree was later added to XFS, the only way to maintain
514103786f0aSDarrick J. Wongthe consistency of the fork mappings with the reverse mapping index was to
514203786f0aSDarrick J. Wongdevelop an iterative mechanism that used deferred bmap and rmap operations to
514303786f0aSDarrick J. Wongswap mappings one at a time.
514403786f0aSDarrick J. WongThis mechanism is identical to steps 2-3 from the procedure above except for
514503786f0aSDarrick J. Wongthe new tracking items, because the atomic extent swap mechanism is an
514603786f0aSDarrick J. Wongiteration of an existing mechanism and not something totally novel.
514703786f0aSDarrick J. WongFor the narrow case of file defragmentation, the file contents must be
514803786f0aSDarrick J. Wongidentical, so the recovery guarantees are not much of a gain.
514903786f0aSDarrick J. Wong
515003786f0aSDarrick J. WongAtomic extent swapping is much more flexible than the existing swapext
515103786f0aSDarrick J. Wongimplementations because it can guarantee that the caller never sees a mix of
515203786f0aSDarrick J. Wongold and new contents even after a crash, and it can operate on two arbitrary
515303786f0aSDarrick J. Wongfile fork ranges.
515403786f0aSDarrick J. WongThe extra flexibility enables several new use cases:
515503786f0aSDarrick J. Wong
515603786f0aSDarrick J. Wong- **Atomic commit of file writes**: A userspace process opens a file that it
515703786f0aSDarrick J. Wong  wants to update.
515803786f0aSDarrick J. Wong  Next, it opens a temporary file and calls the file clone operation to reflink
515903786f0aSDarrick J. Wong  the first file's contents into the temporary file.
516003786f0aSDarrick J. Wong  Writes to the original file should instead be written to the temporary file.
516103786f0aSDarrick J. Wong  Finally, the process calls the atomic extent swap system call
516203786f0aSDarrick J. Wong  (``FIEXCHANGE_RANGE``) to exchange the file contents, thereby committing all
516303786f0aSDarrick J. Wong  of the updates to the original file, or none of them.
516403786f0aSDarrick J. Wong
516503786f0aSDarrick J. Wong.. _swapext_if_unchanged:
516603786f0aSDarrick J. Wong
516703786f0aSDarrick J. Wong- **Transactional file updates**: The same mechanism as above, but the caller
516803786f0aSDarrick J. Wong  only wants the commit to occur if the original file's contents have not
516903786f0aSDarrick J. Wong  changed.
517003786f0aSDarrick J. Wong  To make this happen, the calling process snapshots the file modification and
517103786f0aSDarrick J. Wong  change timestamps of the original file before reflinking its data to the
517203786f0aSDarrick J. Wong  temporary file.
517303786f0aSDarrick J. Wong  When the program is ready to commit the changes, it passes the timestamps
517403786f0aSDarrick J. Wong  into the kernel as arguments to the atomic extent swap system call.
517503786f0aSDarrick J. Wong  The kernel only commits the changes if the provided timestamps match the
517603786f0aSDarrick J. Wong  original file.
517703786f0aSDarrick J. Wong
517803786f0aSDarrick J. Wong- **Emulation of atomic block device writes**: Export a block device with a
517903786f0aSDarrick J. Wong  logical sector size matching the filesystem block size to force all writes
518003786f0aSDarrick J. Wong  to be aligned to the filesystem block size.
518103786f0aSDarrick J. Wong  Stage all writes to a temporary file, and when that is complete, call the
518203786f0aSDarrick J. Wong  atomic extent swap system call with a flag to indicate that holes in the
518303786f0aSDarrick J. Wong  temporary file should be ignored.
518403786f0aSDarrick J. Wong  This emulates an atomic device write in software, and can support arbitrary
518503786f0aSDarrick J. Wong  scattered writes.
518603786f0aSDarrick J. Wong
518703786f0aSDarrick J. WongVectorized Scrub
518803786f0aSDarrick J. Wong----------------
518903786f0aSDarrick J. Wong
519003786f0aSDarrick J. WongAs it turns out, the :ref:`refactoring <scrubrepair>` of repair items mentioned
519103786f0aSDarrick J. Wongearlier was a catalyst for enabling a vectorized scrub system call.
519203786f0aSDarrick J. WongSince 2018, the cost of making a kernel call has increased considerably on some
519303786f0aSDarrick J. Wongsystems to mitigate the effects of speculative execution attacks.
519403786f0aSDarrick J. WongThis incentivizes program authors to make as few system calls as possible to
519503786f0aSDarrick J. Wongreduce the number of times an execution path crosses a security boundary.
519603786f0aSDarrick J. Wong
519703786f0aSDarrick J. WongWith vectorized scrub, userspace pushes to the kernel the identity of a
519803786f0aSDarrick J. Wongfilesystem object, a list of scrub types to run against that object, and a
519903786f0aSDarrick J. Wongsimple representation of the data dependencies between the selected scrub
520003786f0aSDarrick J. Wongtypes.
520103786f0aSDarrick J. WongThe kernel executes as much of the caller's plan as it can until it hits a
520203786f0aSDarrick J. Wongdependency that cannot be satisfied due to a corruption, and tells userspace
520303786f0aSDarrick J. Wonghow much was accomplished.
520403786f0aSDarrick J. WongIt is hoped that ``io_uring`` will pick up enough of this functionality that
520503786f0aSDarrick J. Wongonline fsck can use that instead of adding a separate vectored scrub system
520603786f0aSDarrick J. Wongcall to XFS.
520703786f0aSDarrick J. Wong
520803786f0aSDarrick J. WongThe relevant patchsets are the
520903786f0aSDarrick J. Wong`kernel vectorized scrub
521003786f0aSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=vectorized-scrub>`_
521103786f0aSDarrick J. Wongand
521203786f0aSDarrick J. Wong`userspace vectorized scrub
521303786f0aSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=vectorized-scrub>`_
521403786f0aSDarrick J. Wongseries.
521503786f0aSDarrick J. Wong
521603786f0aSDarrick J. WongQuality of Service Targets for Scrub
521703786f0aSDarrick J. Wong------------------------------------
521803786f0aSDarrick J. Wong
521903786f0aSDarrick J. WongOne serious shortcoming of the online fsck code is that the amount of time that
522003786f0aSDarrick J. Wongit can spend in the kernel holding resource locks is basically unbounded.
522103786f0aSDarrick J. WongUserspace is allowed to send a fatal signal to the process which will cause
522203786f0aSDarrick J. Wong``xfs_scrub`` to exit when it reaches a good stopping point, but there's no way
522303786f0aSDarrick J. Wongfor userspace to provide a time budget to the kernel.
522403786f0aSDarrick J. WongGiven that the scrub codebase has helpers to detect fatal signals, it shouldn't
522503786f0aSDarrick J. Wongbe too much work to allow userspace to specify a timeout for a scrub/repair
522603786f0aSDarrick J. Wongoperation and abort the operation if it exceeds budget.
522703786f0aSDarrick J. WongHowever, most repair functions have the property that once they begin to touch
522803786f0aSDarrick J. Wongondisk metadata, the operation cannot be cancelled cleanly, after which a QoS
522903786f0aSDarrick J. Wongtimeout is no longer useful.
523003786f0aSDarrick J. Wong
523103786f0aSDarrick J. WongDefragmenting Free Space
523203786f0aSDarrick J. Wong------------------------
523303786f0aSDarrick J. Wong
523403786f0aSDarrick J. WongOver the years, many XFS users have requested the creation of a program to
523503786f0aSDarrick J. Wongclear a portion of the physical storage underlying a filesystem so that it
523603786f0aSDarrick J. Wongbecomes a contiguous chunk of free space.
523703786f0aSDarrick J. WongCall this free space defragmenter ``clearspace`` for short.
523803786f0aSDarrick J. Wong
523903786f0aSDarrick J. WongThe first piece the ``clearspace`` program needs is the ability to read the
524003786f0aSDarrick J. Wongreverse mapping index from userspace.
524103786f0aSDarrick J. WongThis already exists in the form of the ``FS_IOC_GETFSMAP`` ioctl.
524203786f0aSDarrick J. WongThe second piece it needs is a new fallocate mode
524303786f0aSDarrick J. Wong(``FALLOC_FL_MAP_FREE_SPACE``) that allocates the free space in a region and
524403786f0aSDarrick J. Wongmaps it to a file.
524503786f0aSDarrick J. WongCall this file the "space collector" file.
524603786f0aSDarrick J. WongThe third piece is the ability to force an online repair.
524703786f0aSDarrick J. Wong
524803786f0aSDarrick J. WongTo clear all the metadata out of a portion of physical storage, clearspace
524903786f0aSDarrick J. Wonguses the new fallocate map-freespace call to map any free space in that region
525003786f0aSDarrick J. Wongto the space collector file.
525103786f0aSDarrick J. WongNext, clearspace finds all metadata blocks in that region by way of
525203786f0aSDarrick J. Wong``GETFSMAP`` and issues forced repair requests on the data structure.
525303786f0aSDarrick J. WongThis often results in the metadata being rebuilt somewhere that is not being
525403786f0aSDarrick J. Wongcleared.
525503786f0aSDarrick J. WongAfter each relocation, clearspace calls the "map free space" function again to
525603786f0aSDarrick J. Wongcollect any newly freed space in the region being cleared.
525703786f0aSDarrick J. Wong
525803786f0aSDarrick J. WongTo clear all the file data out of a portion of the physical storage, clearspace
525903786f0aSDarrick J. Wonguses the FSMAP information to find relevant file data blocks.
526003786f0aSDarrick J. WongHaving identified a good target, it uses the ``FICLONERANGE`` call on that part
526103786f0aSDarrick J. Wongof the file to try to share the physical space with a dummy file.
526203786f0aSDarrick J. WongCloning the extent means that the original owners cannot overwrite the
526303786f0aSDarrick J. Wongcontents; any changes will be written somewhere else via copy-on-write.
526403786f0aSDarrick J. WongClearspace makes its own copy of the frozen extent in an area that is not being
526503786f0aSDarrick J. Wongcleared, and uses ``FIEDEUPRANGE`` (or the :ref:`atomic extent swap
526603786f0aSDarrick J. Wong<swapext_if_unchanged>` feature) to change the target file's data extent
526703786f0aSDarrick J. Wongmapping away from the area being cleared.
526803786f0aSDarrick J. WongWhen all other mappings have been moved, clearspace reflinks the space into the
526903786f0aSDarrick J. Wongspace collector file so that it becomes unavailable.
527003786f0aSDarrick J. Wong
527103786f0aSDarrick J. WongThere are further optimizations that could apply to the above algorithm.
527203786f0aSDarrick J. WongTo clear a piece of physical storage that has a high sharing factor, it is
527303786f0aSDarrick J. Wongstrongly desirable to retain this sharing factor.
527403786f0aSDarrick J. WongIn fact, these extents should be moved first to maximize sharing factor after
527503786f0aSDarrick J. Wongthe operation completes.
527603786f0aSDarrick J. WongTo make this work smoothly, clearspace needs a new ioctl
527703786f0aSDarrick J. Wong(``FS_IOC_GETREFCOUNTS``) to report reference count information to userspace.
527803786f0aSDarrick J. WongWith the refcount information exposed, clearspace can quickly find the longest,
527903786f0aSDarrick J. Wongmost shared data extents in the filesystem, and target them first.
528003786f0aSDarrick J. Wong
528103786f0aSDarrick J. Wong**Future Work Question**: How might the filesystem move inode chunks?
528203786f0aSDarrick J. Wong
528303786f0aSDarrick J. Wong*Answer*: To move inode chunks, Dave Chinner constructed a prototype program
528403786f0aSDarrick J. Wongthat creates a new file with the old contents and then locklessly runs around
528503786f0aSDarrick J. Wongthe filesystem updating directory entries.
528603786f0aSDarrick J. WongThe operation cannot complete if the filesystem goes down.
528703786f0aSDarrick J. WongThat problem isn't totally insurmountable: create an inode remapping table
528803786f0aSDarrick J. Wonghidden behind a jump label, and a log item that tracks the kernel walking the
528903786f0aSDarrick J. Wongfilesystem to update directory entries.
529003786f0aSDarrick J. WongThe trouble is, the kernel can't do anything about open files, since it cannot
529103786f0aSDarrick J. Wongrevoke them.
529203786f0aSDarrick J. Wong
529303786f0aSDarrick J. Wong**Future Work Question**: Can static keys be used to minimize the cost of
529403786f0aSDarrick J. Wongsupporting ``revoke()`` on XFS files?
529503786f0aSDarrick J. Wong
529603786f0aSDarrick J. Wong*Answer*: Yes.
529703786f0aSDarrick J. WongUntil the first revocation, the bailout code need not be in the call path at
529803786f0aSDarrick J. Wongall.
529903786f0aSDarrick J. Wong
530003786f0aSDarrick J. WongThe relevant patchsets are the
530103786f0aSDarrick J. Wong`kernel freespace defrag
530203786f0aSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=defrag-freespace>`_
530303786f0aSDarrick J. Wongand
530403786f0aSDarrick J. Wong`userspace freespace defrag
530503786f0aSDarrick J. Wong<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=defrag-freespace>`_
530603786f0aSDarrick J. Wongseries.
530703786f0aSDarrick J. Wong
530803786f0aSDarrick J. WongShrinking Filesystems
530903786f0aSDarrick J. Wong---------------------
531003786f0aSDarrick J. Wong
531103786f0aSDarrick J. WongRemoving the end of the filesystem ought to be a simple matter of evacuating
531203786f0aSDarrick J. Wongthe data and metadata at the end of the filesystem, and handing the freed space
531303786f0aSDarrick J. Wongto the shrink code.
531403786f0aSDarrick J. WongThat requires an evacuation of the space at end of the filesystem, which is a
531503786f0aSDarrick J. Wonguse of free space defragmentation!
5316