1fc2f6fe7SMauro Carvalho Chehab.. SPDX-License-Identifier: GPL-2.0
2*e5edad52SDarrick J. Wong.. _xfs_self_describing_metadata:
3fc2f6fe7SMauro Carvalho Chehab
4fc2f6fe7SMauro Carvalho Chehab============================
5fc2f6fe7SMauro Carvalho ChehabXFS Self Describing Metadata
6fc2f6fe7SMauro Carvalho Chehab============================
7fc2f6fe7SMauro Carvalho Chehab
8fc2f6fe7SMauro Carvalho ChehabIntroduction
9fc2f6fe7SMauro Carvalho Chehab============
10fc2f6fe7SMauro Carvalho Chehab
11fc2f6fe7SMauro Carvalho ChehabThe largest scalability problem facing XFS is not one of algorithmic
12fc2f6fe7SMauro Carvalho Chehabscalability, but of verification of the filesystem structure. Scalabilty of the
13fc2f6fe7SMauro Carvalho Chehabstructures and indexes on disk and the algorithms for iterating them are
14fc2f6fe7SMauro Carvalho Chehabadequate for supporting PB scale filesystems with billions of inodes, however it
15fc2f6fe7SMauro Carvalho Chehabis this very scalability that causes the verification problem.
16fc2f6fe7SMauro Carvalho Chehab
17fc2f6fe7SMauro Carvalho ChehabAlmost all metadata on XFS is dynamically allocated. The only fixed location
18fc2f6fe7SMauro Carvalho Chehabmetadata is the allocation group headers (SB, AGF, AGFL and AGI), while all
19fc2f6fe7SMauro Carvalho Chehabother metadata structures need to be discovered by walking the filesystem
20fc2f6fe7SMauro Carvalho Chehabstructure in different ways. While this is already done by userspace tools for
21fc2f6fe7SMauro Carvalho Chehabvalidating and repairing the structure, there are limits to what they can
22fc2f6fe7SMauro Carvalho Chehabverify, and this in turn limits the supportable size of an XFS filesystem.
23fc2f6fe7SMauro Carvalho Chehab
24fc2f6fe7SMauro Carvalho ChehabFor example, it is entirely possible to manually use xfs_db and a bit of
25fc2f6fe7SMauro Carvalho Chehabscripting to analyse the structure of a 100TB filesystem when trying to
26fc2f6fe7SMauro Carvalho Chehabdetermine the root cause of a corruption problem, but it is still mainly a
27fc2f6fe7SMauro Carvalho Chehabmanual task of verifying that things like single bit errors or misplaced writes
28fc2f6fe7SMauro Carvalho Chehabweren't the ultimate cause of a corruption event. It may take a few hours to a
29fc2f6fe7SMauro Carvalho Chehabfew days to perform such forensic analysis, so for at this scale root cause
30fc2f6fe7SMauro Carvalho Chehabanalysis is entirely possible.
31fc2f6fe7SMauro Carvalho Chehab
32fc2f6fe7SMauro Carvalho ChehabHowever, if we scale the filesystem up to 1PB, we now have 10x as much metadata
33fc2f6fe7SMauro Carvalho Chehabto analyse and so that analysis blows out towards weeks/months of forensic work.
34fc2f6fe7SMauro Carvalho ChehabMost of the analysis work is slow and tedious, so as the amount of analysis goes
35fc2f6fe7SMauro Carvalho Chehabup, the more likely that the cause will be lost in the noise.  Hence the primary
36fc2f6fe7SMauro Carvalho Chehabconcern for supporting PB scale filesystems is minimising the time and effort
37fc2f6fe7SMauro Carvalho Chehabrequired for basic forensic analysis of the filesystem structure.
38fc2f6fe7SMauro Carvalho Chehab
39fc2f6fe7SMauro Carvalho Chehab
40fc2f6fe7SMauro Carvalho ChehabSelf Describing Metadata
41fc2f6fe7SMauro Carvalho Chehab========================
42fc2f6fe7SMauro Carvalho Chehab
43fc2f6fe7SMauro Carvalho ChehabOne of the problems with the current metadata format is that apart from the
44fc2f6fe7SMauro Carvalho Chehabmagic number in the metadata block, we have no other way of identifying what it
45fc2f6fe7SMauro Carvalho Chehabis supposed to be. We can't even identify if it is the right place. Put simply,
46fc2f6fe7SMauro Carvalho Chehabyou can't look at a single metadata block in isolation and say "yes, it is
47fc2f6fe7SMauro Carvalho Chehabsupposed to be there and the contents are valid".
48fc2f6fe7SMauro Carvalho Chehab
49fc2f6fe7SMauro Carvalho ChehabHence most of the time spent on forensic analysis is spent doing basic
50fc2f6fe7SMauro Carvalho Chehabverification of metadata values, looking for values that are in range (and hence
51fc2f6fe7SMauro Carvalho Chehabnot detected by automated verification checks) but are not correct. Finding and
52fc2f6fe7SMauro Carvalho Chehabunderstanding how things like cross linked block lists (e.g. sibling
53fc2f6fe7SMauro Carvalho Chehabpointers in a btree end up with loops in them) are the key to understanding what
54fc2f6fe7SMauro Carvalho Chehabwent wrong, but it is impossible to tell what order the blocks were linked into
55fc2f6fe7SMauro Carvalho Chehabeach other or written to disk after the fact.
56fc2f6fe7SMauro Carvalho Chehab
57fc2f6fe7SMauro Carvalho ChehabHence we need to record more information into the metadata to allow us to
58fc2f6fe7SMauro Carvalho Chehabquickly determine if the metadata is intact and can be ignored for the purpose
59fc2f6fe7SMauro Carvalho Chehabof analysis. We can't protect against every possible type of error, but we can
60fc2f6fe7SMauro Carvalho Chehabensure that common types of errors are easily detectable.  Hence the concept of
61fc2f6fe7SMauro Carvalho Chehabself describing metadata.
62fc2f6fe7SMauro Carvalho Chehab
63fc2f6fe7SMauro Carvalho ChehabThe first, fundamental requirement of self describing metadata is that the
64fc2f6fe7SMauro Carvalho Chehabmetadata object contains some form of unique identifier in a well known
65fc2f6fe7SMauro Carvalho Chehablocation. This allows us to identify the expected contents of the block and
66fc2f6fe7SMauro Carvalho Chehabhence parse and verify the metadata object. IF we can't independently identify
67fc2f6fe7SMauro Carvalho Chehabthe type of metadata in the object, then the metadata doesn't describe itself
68fc2f6fe7SMauro Carvalho Chehabvery well at all!
69fc2f6fe7SMauro Carvalho Chehab
70fc2f6fe7SMauro Carvalho ChehabLuckily, almost all XFS metadata has magic numbers embedded already - only the
71fc2f6fe7SMauro Carvalho ChehabAGFL, remote symlinks and remote attribute blocks do not contain identifying
72fc2f6fe7SMauro Carvalho Chehabmagic numbers. Hence we can change the on-disk format of all these objects to
73fc2f6fe7SMauro Carvalho Chehabadd more identifying information and detect this simply by changing the magic
74fc2f6fe7SMauro Carvalho Chehabnumbers in the metadata objects. That is, if it has the current magic number,
75fc2f6fe7SMauro Carvalho Chehabthe metadata isn't self identifying. If it contains a new magic number, it is
76fc2f6fe7SMauro Carvalho Chehabself identifying and we can do much more expansive automated verification of the
77fc2f6fe7SMauro Carvalho Chehabmetadata object at runtime, during forensic analysis or repair.
78fc2f6fe7SMauro Carvalho Chehab
79fc2f6fe7SMauro Carvalho ChehabAs a primary concern, self describing metadata needs some form of overall
80fc2f6fe7SMauro Carvalho Chehabintegrity checking. We cannot trust the metadata if we cannot verify that it has
81fc2f6fe7SMauro Carvalho Chehabnot been changed as a result of external influences. Hence we need some form of
82fc2f6fe7SMauro Carvalho Chehabintegrity check, and this is done by adding CRC32c validation to the metadata
83fc2f6fe7SMauro Carvalho Chehabblock. If we can verify the block contains the metadata it was intended to
84fc2f6fe7SMauro Carvalho Chehabcontain, a large amount of the manual verification work can be skipped.
85fc2f6fe7SMauro Carvalho Chehab
86fc2f6fe7SMauro Carvalho ChehabCRC32c was selected as metadata cannot be more than 64k in length in XFS and
87fc2f6fe7SMauro Carvalho Chehabhence a 32 bit CRC is more than sufficient to detect multi-bit errors in
88fc2f6fe7SMauro Carvalho Chehabmetadata blocks. CRC32c is also now hardware accelerated on common CPUs so it is
89fc2f6fe7SMauro Carvalho Chehabfast. So while CRC32c is not the strongest of possible integrity checks that
90fc2f6fe7SMauro Carvalho Chehabcould be used, it is more than sufficient for our needs and has relatively
91fc2f6fe7SMauro Carvalho Chehablittle overhead. Adding support for larger integrity fields and/or algorithms
92fc2f6fe7SMauro Carvalho Chehabdoes really provide any extra value over CRC32c, but it does add a lot of
93fc2f6fe7SMauro Carvalho Chehabcomplexity and so there is no provision for changing the integrity checking
94fc2f6fe7SMauro Carvalho Chehabmechanism.
95fc2f6fe7SMauro Carvalho Chehab
96fc2f6fe7SMauro Carvalho ChehabSelf describing metadata needs to contain enough information so that the
97fc2f6fe7SMauro Carvalho Chehabmetadata block can be verified as being in the correct place without needing to
98fc2f6fe7SMauro Carvalho Chehablook at any other metadata. This means it needs to contain location information.
99fc2f6fe7SMauro Carvalho ChehabJust adding a block number to the metadata is not sufficient to protect against
100fc2f6fe7SMauro Carvalho Chehabmis-directed writes - a write might be misdirected to the wrong LUN and so be
101fc2f6fe7SMauro Carvalho Chehabwritten to the "correct block" of the wrong filesystem. Hence location
102fc2f6fe7SMauro Carvalho Chehabinformation must contain a filesystem identifier as well as a block number.
103fc2f6fe7SMauro Carvalho Chehab
104fc2f6fe7SMauro Carvalho ChehabAnother key information point in forensic analysis is knowing who the metadata
105fc2f6fe7SMauro Carvalho Chehabblock belongs to. We already know the type, the location, that it is valid
106fc2f6fe7SMauro Carvalho Chehaband/or corrupted, and how long ago that it was last modified. Knowing the owner
107fc2f6fe7SMauro Carvalho Chehabof the block is important as it allows us to find other related metadata to
108fc2f6fe7SMauro Carvalho Chehabdetermine the scope of the corruption. For example, if we have a extent btree
109fc2f6fe7SMauro Carvalho Chehabobject, we don't know what inode it belongs to and hence have to walk the entire
110fc2f6fe7SMauro Carvalho Chehabfilesystem to find the owner of the block. Worse, the corruption could mean that
111fc2f6fe7SMauro Carvalho Chehabno owner can be found (i.e. it's an orphan block), and so without an owner field
112fc2f6fe7SMauro Carvalho Chehabin the metadata we have no idea of the scope of the corruption. If we have an
113fc2f6fe7SMauro Carvalho Chehabowner field in the metadata object, we can immediately do top down validation to
114fc2f6fe7SMauro Carvalho Chehabdetermine the scope of the problem.
115fc2f6fe7SMauro Carvalho Chehab
116fc2f6fe7SMauro Carvalho ChehabDifferent types of metadata have different owner identifiers. For example,
117fc2f6fe7SMauro Carvalho Chehabdirectory, attribute and extent tree blocks are all owned by an inode, while
118fc2f6fe7SMauro Carvalho Chehabfreespace btree blocks are owned by an allocation group. Hence the size and
119fc2f6fe7SMauro Carvalho Chehabcontents of the owner field are determined by the type of metadata object we are
120fc2f6fe7SMauro Carvalho Chehablooking at.  The owner information can also identify misplaced writes (e.g.
121fc2f6fe7SMauro Carvalho Chehabfreespace btree block written to the wrong AG).
122fc2f6fe7SMauro Carvalho Chehab
123fc2f6fe7SMauro Carvalho ChehabSelf describing metadata also needs to contain some indication of when it was
124fc2f6fe7SMauro Carvalho Chehabwritten to the filesystem. One of the key information points when doing forensic
125fc2f6fe7SMauro Carvalho Chehabanalysis is how recently the block was modified. Correlation of set of corrupted
126fc2f6fe7SMauro Carvalho Chehabmetadata blocks based on modification times is important as it can indicate
127fc2f6fe7SMauro Carvalho Chehabwhether the corruptions are related, whether there's been multiple corruption
128fc2f6fe7SMauro Carvalho Chehabevents that lead to the eventual failure, and even whether there are corruptions
129fc2f6fe7SMauro Carvalho Chehabpresent that the run-time verification is not detecting.
130fc2f6fe7SMauro Carvalho Chehab
131fc2f6fe7SMauro Carvalho ChehabFor example, we can determine whether a metadata object is supposed to be free
132fc2f6fe7SMauro Carvalho Chehabspace or still allocated if it is still referenced by its owner by looking at
133fc2f6fe7SMauro Carvalho Chehabwhen the free space btree block that contains the block was last written
134fc2f6fe7SMauro Carvalho Chehabcompared to when the metadata object itself was last written.  If the free space
135fc2f6fe7SMauro Carvalho Chehabblock is more recent than the object and the object's owner, then there is a
136fc2f6fe7SMauro Carvalho Chehabvery good chance that the block should have been removed from the owner.
137fc2f6fe7SMauro Carvalho Chehab
138fc2f6fe7SMauro Carvalho ChehabTo provide this "written timestamp", each metadata block gets the Log Sequence
139fc2f6fe7SMauro Carvalho ChehabNumber (LSN) of the most recent transaction it was modified on written into it.
140fc2f6fe7SMauro Carvalho ChehabThis number will always increase over the life of the filesystem, and the only
141fc2f6fe7SMauro Carvalho Chehabthing that resets it is running xfs_repair on the filesystem. Further, by use of
142fc2f6fe7SMauro Carvalho Chehabthe LSN we can tell if the corrupted metadata all belonged to the same log
143fc2f6fe7SMauro Carvalho Chehabcheckpoint and hence have some idea of how much modification occurred between
144fc2f6fe7SMauro Carvalho Chehabthe first and last instance of corrupt metadata on disk and, further, how much
145fc2f6fe7SMauro Carvalho Chehabmodification occurred between the corruption being written and when it was
146fc2f6fe7SMauro Carvalho Chehabdetected.
147fc2f6fe7SMauro Carvalho Chehab
148fc2f6fe7SMauro Carvalho ChehabRuntime Validation
149fc2f6fe7SMauro Carvalho Chehab==================
150fc2f6fe7SMauro Carvalho Chehab
151fc2f6fe7SMauro Carvalho ChehabValidation of self-describing metadata takes place at runtime in two places:
152fc2f6fe7SMauro Carvalho Chehab
153fc2f6fe7SMauro Carvalho Chehab	- immediately after a successful read from disk
154fc2f6fe7SMauro Carvalho Chehab	- immediately prior to write IO submission
155fc2f6fe7SMauro Carvalho Chehab
156fc2f6fe7SMauro Carvalho ChehabThe verification is completely stateless - it is done independently of the
157fc2f6fe7SMauro Carvalho Chehabmodification process, and seeks only to check that the metadata is what it says
158fc2f6fe7SMauro Carvalho Chehabit is and that the metadata fields are within bounds and internally consistent.
159fc2f6fe7SMauro Carvalho ChehabAs such, we cannot catch all types of corruption that can occur within a block
160fc2f6fe7SMauro Carvalho Chehabas there may be certain limitations that operational state enforces of the
161fc2f6fe7SMauro Carvalho Chehabmetadata, or there may be corruption of interblock relationships (e.g. corrupted
162fc2f6fe7SMauro Carvalho Chehabsibling pointer lists). Hence we still need stateful checking in the main code
163fc2f6fe7SMauro Carvalho Chehabbody, but in general most of the per-field validation is handled by the
164fc2f6fe7SMauro Carvalho Chehabverifiers.
165fc2f6fe7SMauro Carvalho Chehab
166fc2f6fe7SMauro Carvalho ChehabFor read verification, the caller needs to specify the expected type of metadata
167fc2f6fe7SMauro Carvalho Chehabthat it should see, and the IO completion process verifies that the metadata
168fc2f6fe7SMauro Carvalho Chehabobject matches what was expected. If the verification process fails, then it
169fc2f6fe7SMauro Carvalho Chehabmarks the object being read as EFSCORRUPTED. The caller needs to catch this
170fc2f6fe7SMauro Carvalho Chehaberror (same as for IO errors), and if it needs to take special action due to a
171fc2f6fe7SMauro Carvalho Chehabverification error it can do so by catching the EFSCORRUPTED error value. If we
172fc2f6fe7SMauro Carvalho Chehabneed more discrimination of error type at higher levels, we can define new
173fc2f6fe7SMauro Carvalho Chehaberror numbers for different errors as necessary.
174fc2f6fe7SMauro Carvalho Chehab
175fc2f6fe7SMauro Carvalho ChehabThe first step in read verification is checking the magic number and determining
176fc2f6fe7SMauro Carvalho Chehabwhether CRC validating is necessary. If it is, the CRC32c is calculated and
177fc2f6fe7SMauro Carvalho Chehabcompared against the value stored in the object itself. Once this is validated,
178fc2f6fe7SMauro Carvalho Chehabfurther checks are made against the location information, followed by extensive
179fc2f6fe7SMauro Carvalho Chehabobject specific metadata validation. If any of these checks fail, then the
180fc2f6fe7SMauro Carvalho Chehabbuffer is considered corrupt and the EFSCORRUPTED error is set appropriately.
181fc2f6fe7SMauro Carvalho Chehab
182fc2f6fe7SMauro Carvalho ChehabWrite verification is the opposite of the read verification - first the object
183fc2f6fe7SMauro Carvalho Chehabis extensively verified and if it is OK we then update the LSN from the last
184fc2f6fe7SMauro Carvalho Chehabmodification made to the object, After this, we calculate the CRC and insert it
185fc2f6fe7SMauro Carvalho Chehabinto the object. Once this is done the write IO is allowed to continue. If any
186fc2f6fe7SMauro Carvalho Chehaberror occurs during this process, the buffer is again marked with a EFSCORRUPTED
187fc2f6fe7SMauro Carvalho Chehaberror for the higher layers to catch.
188fc2f6fe7SMauro Carvalho Chehab
189fc2f6fe7SMauro Carvalho ChehabStructures
190fc2f6fe7SMauro Carvalho Chehab==========
191fc2f6fe7SMauro Carvalho Chehab
192fc2f6fe7SMauro Carvalho ChehabA typical on-disk structure needs to contain the following information::
193fc2f6fe7SMauro Carvalho Chehab
194fc2f6fe7SMauro Carvalho Chehab    struct xfs_ondisk_hdr {
195fc2f6fe7SMauro Carvalho Chehab	    __be32  magic;		/* magic number */
196fc2f6fe7SMauro Carvalho Chehab	    __be32  crc;		/* CRC, not logged */
197fc2f6fe7SMauro Carvalho Chehab	    uuid_t  uuid;		/* filesystem identifier */
198fc2f6fe7SMauro Carvalho Chehab	    __be64  owner;		/* parent object */
199fc2f6fe7SMauro Carvalho Chehab	    __be64  blkno;		/* location on disk */
200fc2f6fe7SMauro Carvalho Chehab	    __be64  lsn;		/* last modification in log, not logged */
201fc2f6fe7SMauro Carvalho Chehab    };
202fc2f6fe7SMauro Carvalho Chehab
203fc2f6fe7SMauro Carvalho ChehabDepending on the metadata, this information may be part of a header structure
204fc2f6fe7SMauro Carvalho Chehabseparate to the metadata contents, or may be distributed through an existing
205fc2f6fe7SMauro Carvalho Chehabstructure. The latter occurs with metadata that already contains some of this
206fc2f6fe7SMauro Carvalho Chehabinformation, such as the superblock and AG headers.
207fc2f6fe7SMauro Carvalho Chehab
208fc2f6fe7SMauro Carvalho ChehabOther metadata may have different formats for the information, but the same
209fc2f6fe7SMauro Carvalho Chehablevel of information is generally provided. For example:
210fc2f6fe7SMauro Carvalho Chehab
211fc2f6fe7SMauro Carvalho Chehab	- short btree blocks have a 32 bit owner (ag number) and a 32 bit block
212fc2f6fe7SMauro Carvalho Chehab	  number for location. The two of these combined provide the same
213fc2f6fe7SMauro Carvalho Chehab	  information as @owner and @blkno in eh above structure, but using 8
214fc2f6fe7SMauro Carvalho Chehab	  bytes less space on disk.
215fc2f6fe7SMauro Carvalho Chehab
216fc2f6fe7SMauro Carvalho Chehab	- directory/attribute node blocks have a 16 bit magic number, and the
217fc2f6fe7SMauro Carvalho Chehab	  header that contains the magic number has other information in it as
218fc2f6fe7SMauro Carvalho Chehab	  well. hence the additional metadata headers change the overall format
219fc2f6fe7SMauro Carvalho Chehab	  of the metadata.
220fc2f6fe7SMauro Carvalho Chehab
221fc2f6fe7SMauro Carvalho ChehabA typical buffer read verifier is structured as follows::
222fc2f6fe7SMauro Carvalho Chehab
223fc2f6fe7SMauro Carvalho Chehab    #define XFS_FOO_CRC_OFF		offsetof(struct xfs_ondisk_hdr, crc)
224fc2f6fe7SMauro Carvalho Chehab
225fc2f6fe7SMauro Carvalho Chehab    static void
226fc2f6fe7SMauro Carvalho Chehab    xfs_foo_read_verify(
227fc2f6fe7SMauro Carvalho Chehab	    struct xfs_buf	*bp)
228fc2f6fe7SMauro Carvalho Chehab    {
229fc2f6fe7SMauro Carvalho Chehab	struct xfs_mount *mp = bp->b_mount;
230fc2f6fe7SMauro Carvalho Chehab
231fc2f6fe7SMauro Carvalho Chehab	    if ((xfs_sb_version_hascrc(&mp->m_sb) &&
232fc2f6fe7SMauro Carvalho Chehab		!xfs_verify_cksum(bp->b_addr, BBTOB(bp->b_length),
233fc2f6fe7SMauro Carvalho Chehab					    XFS_FOO_CRC_OFF)) ||
234fc2f6fe7SMauro Carvalho Chehab		!xfs_foo_verify(bp)) {
235fc2f6fe7SMauro Carvalho Chehab		    XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr);
236fc2f6fe7SMauro Carvalho Chehab		    xfs_buf_ioerror(bp, EFSCORRUPTED);
237fc2f6fe7SMauro Carvalho Chehab	    }
238fc2f6fe7SMauro Carvalho Chehab    }
239fc2f6fe7SMauro Carvalho Chehab
240fc2f6fe7SMauro Carvalho ChehabThe code ensures that the CRC is only checked if the filesystem has CRCs enabled
241fc2f6fe7SMauro Carvalho Chehabby checking the superblock of the feature bit, and then if the CRC verifies OK
242fc2f6fe7SMauro Carvalho Chehab(or is not needed) it verifies the actual contents of the block.
243fc2f6fe7SMauro Carvalho Chehab
244fc2f6fe7SMauro Carvalho ChehabThe verifier function will take a couple of different forms, depending on
245fc2f6fe7SMauro Carvalho Chehabwhether the magic number can be used to determine the format of the block. In
246fc2f6fe7SMauro Carvalho Chehabthe case it can't, the code is structured as follows::
247fc2f6fe7SMauro Carvalho Chehab
248fc2f6fe7SMauro Carvalho Chehab    static bool
249fc2f6fe7SMauro Carvalho Chehab    xfs_foo_verify(
250fc2f6fe7SMauro Carvalho Chehab	    struct xfs_buf		*bp)
251fc2f6fe7SMauro Carvalho Chehab    {
252fc2f6fe7SMauro Carvalho Chehab	    struct xfs_mount	*mp = bp->b_mount;
253fc2f6fe7SMauro Carvalho Chehab	    struct xfs_ondisk_hdr	*hdr = bp->b_addr;
254fc2f6fe7SMauro Carvalho Chehab
255fc2f6fe7SMauro Carvalho Chehab	    if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))
256fc2f6fe7SMauro Carvalho Chehab		    return false;
257fc2f6fe7SMauro Carvalho Chehab
258fc2f6fe7SMauro Carvalho Chehab	    if (!xfs_sb_version_hascrc(&mp->m_sb)) {
259fc2f6fe7SMauro Carvalho Chehab		    if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))
260fc2f6fe7SMauro Carvalho Chehab			    return false;
261fc2f6fe7SMauro Carvalho Chehab		    if (bp->b_bn != be64_to_cpu(hdr->blkno))
262fc2f6fe7SMauro Carvalho Chehab			    return false;
263fc2f6fe7SMauro Carvalho Chehab		    if (hdr->owner == 0)
264fc2f6fe7SMauro Carvalho Chehab			    return false;
265fc2f6fe7SMauro Carvalho Chehab	    }
266fc2f6fe7SMauro Carvalho Chehab
267fc2f6fe7SMauro Carvalho Chehab	    /* object specific verification checks here */
268fc2f6fe7SMauro Carvalho Chehab
269fc2f6fe7SMauro Carvalho Chehab	    return true;
270fc2f6fe7SMauro Carvalho Chehab    }
271fc2f6fe7SMauro Carvalho Chehab
272fc2f6fe7SMauro Carvalho ChehabIf there are different magic numbers for the different formats, the verifier
273fc2f6fe7SMauro Carvalho Chehabwill look like::
274fc2f6fe7SMauro Carvalho Chehab
275fc2f6fe7SMauro Carvalho Chehab    static bool
276fc2f6fe7SMauro Carvalho Chehab    xfs_foo_verify(
277fc2f6fe7SMauro Carvalho Chehab	    struct xfs_buf		*bp)
278fc2f6fe7SMauro Carvalho Chehab    {
279fc2f6fe7SMauro Carvalho Chehab	    struct xfs_mount	*mp = bp->b_mount;
280fc2f6fe7SMauro Carvalho Chehab	    struct xfs_ondisk_hdr	*hdr = bp->b_addr;
281fc2f6fe7SMauro Carvalho Chehab
282fc2f6fe7SMauro Carvalho Chehab	    if (hdr->magic == cpu_to_be32(XFS_FOO_CRC_MAGIC)) {
283fc2f6fe7SMauro Carvalho Chehab		    if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))
284fc2f6fe7SMauro Carvalho Chehab			    return false;
285fc2f6fe7SMauro Carvalho Chehab		    if (bp->b_bn != be64_to_cpu(hdr->blkno))
286fc2f6fe7SMauro Carvalho Chehab			    return false;
287fc2f6fe7SMauro Carvalho Chehab		    if (hdr->owner == 0)
288fc2f6fe7SMauro Carvalho Chehab			    return false;
289fc2f6fe7SMauro Carvalho Chehab	    } else if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))
290fc2f6fe7SMauro Carvalho Chehab		    return false;
291fc2f6fe7SMauro Carvalho Chehab
292fc2f6fe7SMauro Carvalho Chehab	    /* object specific verification checks here */
293fc2f6fe7SMauro Carvalho Chehab
294fc2f6fe7SMauro Carvalho Chehab	    return true;
295fc2f6fe7SMauro Carvalho Chehab    }
296fc2f6fe7SMauro Carvalho Chehab
297fc2f6fe7SMauro Carvalho ChehabWrite verifiers are very similar to the read verifiers, they just do things in
298fc2f6fe7SMauro Carvalho Chehabthe opposite order to the read verifiers. A typical write verifier::
299fc2f6fe7SMauro Carvalho Chehab
300fc2f6fe7SMauro Carvalho Chehab    static void
301fc2f6fe7SMauro Carvalho Chehab    xfs_foo_write_verify(
302fc2f6fe7SMauro Carvalho Chehab	    struct xfs_buf	*bp)
303fc2f6fe7SMauro Carvalho Chehab    {
304fc2f6fe7SMauro Carvalho Chehab	    struct xfs_mount	*mp = bp->b_mount;
305fc2f6fe7SMauro Carvalho Chehab	    struct xfs_buf_log_item	*bip = bp->b_fspriv;
306fc2f6fe7SMauro Carvalho Chehab
307fc2f6fe7SMauro Carvalho Chehab	    if (!xfs_foo_verify(bp)) {
308fc2f6fe7SMauro Carvalho Chehab		    XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr);
309fc2f6fe7SMauro Carvalho Chehab		    xfs_buf_ioerror(bp, EFSCORRUPTED);
310fc2f6fe7SMauro Carvalho Chehab		    return;
311fc2f6fe7SMauro Carvalho Chehab	    }
312fc2f6fe7SMauro Carvalho Chehab
313fc2f6fe7SMauro Carvalho Chehab	    if (!xfs_sb_version_hascrc(&mp->m_sb))
314fc2f6fe7SMauro Carvalho Chehab		    return;
315fc2f6fe7SMauro Carvalho Chehab
316fc2f6fe7SMauro Carvalho Chehab
317fc2f6fe7SMauro Carvalho Chehab	    if (bip) {
318fc2f6fe7SMauro Carvalho Chehab		    struct xfs_ondisk_hdr	*hdr = bp->b_addr;
319fc2f6fe7SMauro Carvalho Chehab		    hdr->lsn = cpu_to_be64(bip->bli_item.li_lsn);
320fc2f6fe7SMauro Carvalho Chehab	    }
321fc2f6fe7SMauro Carvalho Chehab	    xfs_update_cksum(bp->b_addr, BBTOB(bp->b_length), XFS_FOO_CRC_OFF);
322fc2f6fe7SMauro Carvalho Chehab    }
323fc2f6fe7SMauro Carvalho Chehab
324fc2f6fe7SMauro Carvalho ChehabThis will verify the internal structure of the metadata before we go any
325fc2f6fe7SMauro Carvalho Chehabfurther, detecting corruptions that have occurred as the metadata has been
326fc2f6fe7SMauro Carvalho Chehabmodified in memory. If the metadata verifies OK, and CRCs are enabled, we then
327fc2f6fe7SMauro Carvalho Chehabupdate the LSN field (when it was last modified) and calculate the CRC on the
328fc2f6fe7SMauro Carvalho Chehabmetadata. Once this is done, we can issue the IO.
329fc2f6fe7SMauro Carvalho Chehab
330fc2f6fe7SMauro Carvalho ChehabInodes and Dquots
331fc2f6fe7SMauro Carvalho Chehab=================
332fc2f6fe7SMauro Carvalho Chehab
333fc2f6fe7SMauro Carvalho ChehabInodes and dquots are special snowflakes. They have per-object CRC and
334fc2f6fe7SMauro Carvalho Chehabself-identifiers, but they are packed so that there are multiple objects per
335fc2f6fe7SMauro Carvalho Chehabbuffer. Hence we do not use per-buffer verifiers to do the work of per-object
336fc2f6fe7SMauro Carvalho Chehabverification and CRC calculations. The per-buffer verifiers simply perform basic
337fc2f6fe7SMauro Carvalho Chehabidentification of the buffer - that they contain inodes or dquots, and that
338fc2f6fe7SMauro Carvalho Chehabthere are magic numbers in all the expected spots. All further CRC and
339fc2f6fe7SMauro Carvalho Chehabverification checks are done when each inode is read from or written back to the
340fc2f6fe7SMauro Carvalho Chehabbuffer.
341fc2f6fe7SMauro Carvalho Chehab
342fc2f6fe7SMauro Carvalho ChehabThe structure of the verifiers and the identifiers checks is very similar to the
343fc2f6fe7SMauro Carvalho Chehabbuffer code described above. The only difference is where they are called. For
34416d91548SLinus Torvaldsexample, inode read verification is done in xfs_inode_from_disk() when the inode
34516d91548SLinus Torvaldsis first read out of the buffer and the struct xfs_inode is instantiated. The
34616d91548SLinus Torvaldsinode is already extensively verified during writeback in xfs_iflush_int, so the
34716d91548SLinus Torvaldsonly addition here is to add the LSN and CRC to the inode as it is copied back
34816d91548SLinus Torvaldsinto the buffer.
349fc2f6fe7SMauro Carvalho Chehab
350fc2f6fe7SMauro Carvalho ChehabXXX: inode unlinked list modification doesn't recalculate the inode CRC! None of
351fc2f6fe7SMauro Carvalho Chehabthe unlinked list modifications check or update CRCs, neither during unlink nor
352fc2f6fe7SMauro Carvalho Chehablog recovery. So, it's gone unnoticed until now. This won't matter immediately -
353fc2f6fe7SMauro Carvalho Chehabrepair will probably complain about it - but it needs to be fixed.
354