1.. SPDX-License-Identifier: GPL-2.0
2.. _xfs_self_describing_metadata:
3
4============================
5XFS Self Describing Metadata
6============================
7
8Introduction
9============
10
11The largest scalability problem facing XFS is not one of algorithmic
12scalability, but of verification of the filesystem structure. Scalabilty of the
13structures and indexes on disk and the algorithms for iterating them are
14adequate for supporting PB scale filesystems with billions of inodes, however it
15is this very scalability that causes the verification problem.
16
17Almost all metadata on XFS is dynamically allocated. The only fixed location
18metadata is the allocation group headers (SB, AGF, AGFL and AGI), while all
19other metadata structures need to be discovered by walking the filesystem
20structure in different ways. While this is already done by userspace tools for
21validating and repairing the structure, there are limits to what they can
22verify, and this in turn limits the supportable size of an XFS filesystem.
23
24For example, it is entirely possible to manually use xfs_db and a bit of
25scripting to analyse the structure of a 100TB filesystem when trying to
26determine the root cause of a corruption problem, but it is still mainly a
27manual task of verifying that things like single bit errors or misplaced writes
28weren't the ultimate cause of a corruption event. It may take a few hours to a
29few days to perform such forensic analysis, so for at this scale root cause
30analysis is entirely possible.
31
32However, if we scale the filesystem up to 1PB, we now have 10x as much metadata
33to analyse and so that analysis blows out towards weeks/months of forensic work.
34Most of the analysis work is slow and tedious, so as the amount of analysis goes
35up, the more likely that the cause will be lost in the noise.  Hence the primary
36concern for supporting PB scale filesystems is minimising the time and effort
37required for basic forensic analysis of the filesystem structure.
38
39
40Self Describing Metadata
41========================
42
43One of the problems with the current metadata format is that apart from the
44magic number in the metadata block, we have no other way of identifying what it
45is supposed to be. We can't even identify if it is the right place. Put simply,
46you can't look at a single metadata block in isolation and say "yes, it is
47supposed to be there and the contents are valid".
48
49Hence most of the time spent on forensic analysis is spent doing basic
50verification of metadata values, looking for values that are in range (and hence
51not detected by automated verification checks) but are not correct. Finding and
52understanding how things like cross linked block lists (e.g. sibling
53pointers in a btree end up with loops in them) are the key to understanding what
54went wrong, but it is impossible to tell what order the blocks were linked into
55each other or written to disk after the fact.
56
57Hence we need to record more information into the metadata to allow us to
58quickly determine if the metadata is intact and can be ignored for the purpose
59of analysis. We can't protect against every possible type of error, but we can
60ensure that common types of errors are easily detectable.  Hence the concept of
61self describing metadata.
62
63The first, fundamental requirement of self describing metadata is that the
64metadata object contains some form of unique identifier in a well known
65location. This allows us to identify the expected contents of the block and
66hence parse and verify the metadata object. IF we can't independently identify
67the type of metadata in the object, then the metadata doesn't describe itself
68very well at all!
69
70Luckily, almost all XFS metadata has magic numbers embedded already - only the
71AGFL, remote symlinks and remote attribute blocks do not contain identifying
72magic numbers. Hence we can change the on-disk format of all these objects to
73add more identifying information and detect this simply by changing the magic
74numbers in the metadata objects. That is, if it has the current magic number,
75the metadata isn't self identifying. If it contains a new magic number, it is
76self identifying and we can do much more expansive automated verification of the
77metadata object at runtime, during forensic analysis or repair.
78
79As a primary concern, self describing metadata needs some form of overall
80integrity checking. We cannot trust the metadata if we cannot verify that it has
81not been changed as a result of external influences. Hence we need some form of
82integrity check, and this is done by adding CRC32c validation to the metadata
83block. If we can verify the block contains the metadata it was intended to
84contain, a large amount of the manual verification work can be skipped.
85
86CRC32c was selected as metadata cannot be more than 64k in length in XFS and
87hence a 32 bit CRC is more than sufficient to detect multi-bit errors in
88metadata blocks. CRC32c is also now hardware accelerated on common CPUs so it is
89fast. So while CRC32c is not the strongest of possible integrity checks that
90could be used, it is more than sufficient for our needs and has relatively
91little overhead. Adding support for larger integrity fields and/or algorithms
92does really provide any extra value over CRC32c, but it does add a lot of
93complexity and so there is no provision for changing the integrity checking
94mechanism.
95
96Self describing metadata needs to contain enough information so that the
97metadata block can be verified as being in the correct place without needing to
98look at any other metadata. This means it needs to contain location information.
99Just adding a block number to the metadata is not sufficient to protect against
100mis-directed writes - a write might be misdirected to the wrong LUN and so be
101written to the "correct block" of the wrong filesystem. Hence location
102information must contain a filesystem identifier as well as a block number.
103
104Another key information point in forensic analysis is knowing who the metadata
105block belongs to. We already know the type, the location, that it is valid
106and/or corrupted, and how long ago that it was last modified. Knowing the owner
107of the block is important as it allows us to find other related metadata to
108determine the scope of the corruption. For example, if we have a extent btree
109object, we don't know what inode it belongs to and hence have to walk the entire
110filesystem to find the owner of the block. Worse, the corruption could mean that
111no owner can be found (i.e. it's an orphan block), and so without an owner field
112in the metadata we have no idea of the scope of the corruption. If we have an
113owner field in the metadata object, we can immediately do top down validation to
114determine the scope of the problem.
115
116Different types of metadata have different owner identifiers. For example,
117directory, attribute and extent tree blocks are all owned by an inode, while
118freespace btree blocks are owned by an allocation group. Hence the size and
119contents of the owner field are determined by the type of metadata object we are
120looking at.  The owner information can also identify misplaced writes (e.g.
121freespace btree block written to the wrong AG).
122
123Self describing metadata also needs to contain some indication of when it was
124written to the filesystem. One of the key information points when doing forensic
125analysis is how recently the block was modified. Correlation of set of corrupted
126metadata blocks based on modification times is important as it can indicate
127whether the corruptions are related, whether there's been multiple corruption
128events that lead to the eventual failure, and even whether there are corruptions
129present that the run-time verification is not detecting.
130
131For example, we can determine whether a metadata object is supposed to be free
132space or still allocated if it is still referenced by its owner by looking at
133when the free space btree block that contains the block was last written
134compared to when the metadata object itself was last written.  If the free space
135block is more recent than the object and the object's owner, then there is a
136very good chance that the block should have been removed from the owner.
137
138To provide this "written timestamp", each metadata block gets the Log Sequence
139Number (LSN) of the most recent transaction it was modified on written into it.
140This number will always increase over the life of the filesystem, and the only
141thing that resets it is running xfs_repair on the filesystem. Further, by use of
142the LSN we can tell if the corrupted metadata all belonged to the same log
143checkpoint and hence have some idea of how much modification occurred between
144the first and last instance of corrupt metadata on disk and, further, how much
145modification occurred between the corruption being written and when it was
146detected.
147
148Runtime Validation
149==================
150
151Validation of self-describing metadata takes place at runtime in two places:
152
153	- immediately after a successful read from disk
154	- immediately prior to write IO submission
155
156The verification is completely stateless - it is done independently of the
157modification process, and seeks only to check that the metadata is what it says
158it is and that the metadata fields are within bounds and internally consistent.
159As such, we cannot catch all types of corruption that can occur within a block
160as there may be certain limitations that operational state enforces of the
161metadata, or there may be corruption of interblock relationships (e.g. corrupted
162sibling pointer lists). Hence we still need stateful checking in the main code
163body, but in general most of the per-field validation is handled by the
164verifiers.
165
166For read verification, the caller needs to specify the expected type of metadata
167that it should see, and the IO completion process verifies that the metadata
168object matches what was expected. If the verification process fails, then it
169marks the object being read as EFSCORRUPTED. The caller needs to catch this
170error (same as for IO errors), and if it needs to take special action due to a
171verification error it can do so by catching the EFSCORRUPTED error value. If we
172need more discrimination of error type at higher levels, we can define new
173error numbers for different errors as necessary.
174
175The first step in read verification is checking the magic number and determining
176whether CRC validating is necessary. If it is, the CRC32c is calculated and
177compared against the value stored in the object itself. Once this is validated,
178further checks are made against the location information, followed by extensive
179object specific metadata validation. If any of these checks fail, then the
180buffer is considered corrupt and the EFSCORRUPTED error is set appropriately.
181
182Write verification is the opposite of the read verification - first the object
183is extensively verified and if it is OK we then update the LSN from the last
184modification made to the object, After this, we calculate the CRC and insert it
185into the object. Once this is done the write IO is allowed to continue. If any
186error occurs during this process, the buffer is again marked with a EFSCORRUPTED
187error for the higher layers to catch.
188
189Structures
190==========
191
192A typical on-disk structure needs to contain the following information::
193
194    struct xfs_ondisk_hdr {
195	    __be32  magic;		/* magic number */
196	    __be32  crc;		/* CRC, not logged */
197	    uuid_t  uuid;		/* filesystem identifier */
198	    __be64  owner;		/* parent object */
199	    __be64  blkno;		/* location on disk */
200	    __be64  lsn;		/* last modification in log, not logged */
201    };
202
203Depending on the metadata, this information may be part of a header structure
204separate to the metadata contents, or may be distributed through an existing
205structure. The latter occurs with metadata that already contains some of this
206information, such as the superblock and AG headers.
207
208Other metadata may have different formats for the information, but the same
209level of information is generally provided. For example:
210
211	- short btree blocks have a 32 bit owner (ag number) and a 32 bit block
212	  number for location. The two of these combined provide the same
213	  information as @owner and @blkno in eh above structure, but using 8
214	  bytes less space on disk.
215
216	- directory/attribute node blocks have a 16 bit magic number, and the
217	  header that contains the magic number has other information in it as
218	  well. hence the additional metadata headers change the overall format
219	  of the metadata.
220
221A typical buffer read verifier is structured as follows::
222
223    #define XFS_FOO_CRC_OFF		offsetof(struct xfs_ondisk_hdr, crc)
224
225    static void
226    xfs_foo_read_verify(
227	    struct xfs_buf	*bp)
228    {
229	struct xfs_mount *mp = bp->b_mount;
230
231	    if ((xfs_sb_version_hascrc(&mp->m_sb) &&
232		!xfs_verify_cksum(bp->b_addr, BBTOB(bp->b_length),
233					    XFS_FOO_CRC_OFF)) ||
234		!xfs_foo_verify(bp)) {
235		    XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr);
236		    xfs_buf_ioerror(bp, EFSCORRUPTED);
237	    }
238    }
239
240The code ensures that the CRC is only checked if the filesystem has CRCs enabled
241by checking the superblock of the feature bit, and then if the CRC verifies OK
242(or is not needed) it verifies the actual contents of the block.
243
244The verifier function will take a couple of different forms, depending on
245whether the magic number can be used to determine the format of the block. In
246the case it can't, the code is structured as follows::
247
248    static bool
249    xfs_foo_verify(
250	    struct xfs_buf		*bp)
251    {
252	    struct xfs_mount	*mp = bp->b_mount;
253	    struct xfs_ondisk_hdr	*hdr = bp->b_addr;
254
255	    if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))
256		    return false;
257
258	    if (!xfs_sb_version_hascrc(&mp->m_sb)) {
259		    if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))
260			    return false;
261		    if (bp->b_bn != be64_to_cpu(hdr->blkno))
262			    return false;
263		    if (hdr->owner == 0)
264			    return false;
265	    }
266
267	    /* object specific verification checks here */
268
269	    return true;
270    }
271
272If there are different magic numbers for the different formats, the verifier
273will look like::
274
275    static bool
276    xfs_foo_verify(
277	    struct xfs_buf		*bp)
278    {
279	    struct xfs_mount	*mp = bp->b_mount;
280	    struct xfs_ondisk_hdr	*hdr = bp->b_addr;
281
282	    if (hdr->magic == cpu_to_be32(XFS_FOO_CRC_MAGIC)) {
283		    if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))
284			    return false;
285		    if (bp->b_bn != be64_to_cpu(hdr->blkno))
286			    return false;
287		    if (hdr->owner == 0)
288			    return false;
289	    } else if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))
290		    return false;
291
292	    /* object specific verification checks here */
293
294	    return true;
295    }
296
297Write verifiers are very similar to the read verifiers, they just do things in
298the opposite order to the read verifiers. A typical write verifier::
299
300    static void
301    xfs_foo_write_verify(
302	    struct xfs_buf	*bp)
303    {
304	    struct xfs_mount	*mp = bp->b_mount;
305	    struct xfs_buf_log_item	*bip = bp->b_fspriv;
306
307	    if (!xfs_foo_verify(bp)) {
308		    XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr);
309		    xfs_buf_ioerror(bp, EFSCORRUPTED);
310		    return;
311	    }
312
313	    if (!xfs_sb_version_hascrc(&mp->m_sb))
314		    return;
315
316
317	    if (bip) {
318		    struct xfs_ondisk_hdr	*hdr = bp->b_addr;
319		    hdr->lsn = cpu_to_be64(bip->bli_item.li_lsn);
320	    }
321	    xfs_update_cksum(bp->b_addr, BBTOB(bp->b_length), XFS_FOO_CRC_OFF);
322    }
323
324This will verify the internal structure of the metadata before we go any
325further, detecting corruptions that have occurred as the metadata has been
326modified in memory. If the metadata verifies OK, and CRCs are enabled, we then
327update the LSN field (when it was last modified) and calculate the CRC on the
328metadata. Once this is done, we can issue the IO.
329
330Inodes and Dquots
331=================
332
333Inodes and dquots are special snowflakes. They have per-object CRC and
334self-identifiers, but they are packed so that there are multiple objects per
335buffer. Hence we do not use per-buffer verifiers to do the work of per-object
336verification and CRC calculations. The per-buffer verifiers simply perform basic
337identification of the buffer - that they contain inodes or dquots, and that
338there are magic numbers in all the expected spots. All further CRC and
339verification checks are done when each inode is read from or written back to the
340buffer.
341
342The structure of the verifiers and the identifiers checks is very similar to the
343buffer code described above. The only difference is where they are called. For
344example, inode read verification is done in xfs_inode_from_disk() when the inode
345is first read out of the buffer and the struct xfs_inode is instantiated. The
346inode is already extensively verified during writeback in xfs_iflush_int, so the
347only addition here is to add the LSN and CRC to the inode as it is copied back
348into the buffer.
349
350XXX: inode unlinked list modification doesn't recalculate the inode CRC! None of
351the unlinked list modifications check or update CRCs, neither during unlink nor
352log recovery. So, it's gone unnoticed until now. This won't matter immediately -
353repair will probably complain about it - but it needs to be fixed.
354