1fc2f6fe7SMauro Carvalho Chehab.. SPDX-License-Identifier: GPL-2.0
2fc2f6fe7SMauro Carvalho Chehab
3fc2f6fe7SMauro Carvalho Chehab============================
4fc2f6fe7SMauro Carvalho ChehabXFS Self Describing Metadata
5fc2f6fe7SMauro Carvalho Chehab============================
6fc2f6fe7SMauro Carvalho Chehab
7fc2f6fe7SMauro Carvalho ChehabIntroduction
8fc2f6fe7SMauro Carvalho Chehab============
9fc2f6fe7SMauro Carvalho Chehab
10fc2f6fe7SMauro Carvalho ChehabThe largest scalability problem facing XFS is not one of algorithmic
11fc2f6fe7SMauro Carvalho Chehabscalability, but of verification of the filesystem structure. Scalabilty of the
12fc2f6fe7SMauro Carvalho Chehabstructures and indexes on disk and the algorithms for iterating them are
13fc2f6fe7SMauro Carvalho Chehabadequate for supporting PB scale filesystems with billions of inodes, however it
14fc2f6fe7SMauro Carvalho Chehabis this very scalability that causes the verification problem.
15fc2f6fe7SMauro Carvalho Chehab
16fc2f6fe7SMauro Carvalho ChehabAlmost all metadata on XFS is dynamically allocated. The only fixed location
17fc2f6fe7SMauro Carvalho Chehabmetadata is the allocation group headers (SB, AGF, AGFL and AGI), while all
18fc2f6fe7SMauro Carvalho Chehabother metadata structures need to be discovered by walking the filesystem
19fc2f6fe7SMauro Carvalho Chehabstructure in different ways. While this is already done by userspace tools for
20fc2f6fe7SMauro Carvalho Chehabvalidating and repairing the structure, there are limits to what they can
21fc2f6fe7SMauro Carvalho Chehabverify, and this in turn limits the supportable size of an XFS filesystem.
22fc2f6fe7SMauro Carvalho Chehab
23fc2f6fe7SMauro Carvalho ChehabFor example, it is entirely possible to manually use xfs_db and a bit of
24fc2f6fe7SMauro Carvalho Chehabscripting to analyse the structure of a 100TB filesystem when trying to
25fc2f6fe7SMauro Carvalho Chehabdetermine the root cause of a corruption problem, but it is still mainly a
26fc2f6fe7SMauro Carvalho Chehabmanual task of verifying that things like single bit errors or misplaced writes
27fc2f6fe7SMauro Carvalho Chehabweren't the ultimate cause of a corruption event. It may take a few hours to a
28fc2f6fe7SMauro Carvalho Chehabfew days to perform such forensic analysis, so for at this scale root cause
29fc2f6fe7SMauro Carvalho Chehabanalysis is entirely possible.
30fc2f6fe7SMauro Carvalho Chehab
31fc2f6fe7SMauro Carvalho ChehabHowever, if we scale the filesystem up to 1PB, we now have 10x as much metadata
32fc2f6fe7SMauro Carvalho Chehabto analyse and so that analysis blows out towards weeks/months of forensic work.
33fc2f6fe7SMauro Carvalho ChehabMost of the analysis work is slow and tedious, so as the amount of analysis goes
34fc2f6fe7SMauro Carvalho Chehabup, the more likely that the cause will be lost in the noise.  Hence the primary
35fc2f6fe7SMauro Carvalho Chehabconcern for supporting PB scale filesystems is minimising the time and effort
36fc2f6fe7SMauro Carvalho Chehabrequired for basic forensic analysis of the filesystem structure.
37fc2f6fe7SMauro Carvalho Chehab
38fc2f6fe7SMauro Carvalho Chehab
39fc2f6fe7SMauro Carvalho ChehabSelf Describing Metadata
40fc2f6fe7SMauro Carvalho Chehab========================
41fc2f6fe7SMauro Carvalho Chehab
42fc2f6fe7SMauro Carvalho ChehabOne of the problems with the current metadata format is that apart from the
43fc2f6fe7SMauro Carvalho Chehabmagic number in the metadata block, we have no other way of identifying what it
44fc2f6fe7SMauro Carvalho Chehabis supposed to be. We can't even identify if it is the right place. Put simply,
45fc2f6fe7SMauro Carvalho Chehabyou can't look at a single metadata block in isolation and say "yes, it is
46fc2f6fe7SMauro Carvalho Chehabsupposed to be there and the contents are valid".
47fc2f6fe7SMauro Carvalho Chehab
48fc2f6fe7SMauro Carvalho ChehabHence most of the time spent on forensic analysis is spent doing basic
49fc2f6fe7SMauro Carvalho Chehabverification of metadata values, looking for values that are in range (and hence
50fc2f6fe7SMauro Carvalho Chehabnot detected by automated verification checks) but are not correct. Finding and
51fc2f6fe7SMauro Carvalho Chehabunderstanding how things like cross linked block lists (e.g. sibling
52fc2f6fe7SMauro Carvalho Chehabpointers in a btree end up with loops in them) are the key to understanding what
53fc2f6fe7SMauro Carvalho Chehabwent wrong, but it is impossible to tell what order the blocks were linked into
54fc2f6fe7SMauro Carvalho Chehabeach other or written to disk after the fact.
55fc2f6fe7SMauro Carvalho Chehab
56fc2f6fe7SMauro Carvalho ChehabHence we need to record more information into the metadata to allow us to
57fc2f6fe7SMauro Carvalho Chehabquickly determine if the metadata is intact and can be ignored for the purpose
58fc2f6fe7SMauro Carvalho Chehabof analysis. We can't protect against every possible type of error, but we can
59fc2f6fe7SMauro Carvalho Chehabensure that common types of errors are easily detectable.  Hence the concept of
60fc2f6fe7SMauro Carvalho Chehabself describing metadata.
61fc2f6fe7SMauro Carvalho Chehab
62fc2f6fe7SMauro Carvalho ChehabThe first, fundamental requirement of self describing metadata is that the
63fc2f6fe7SMauro Carvalho Chehabmetadata object contains some form of unique identifier in a well known
64fc2f6fe7SMauro Carvalho Chehablocation. This allows us to identify the expected contents of the block and
65fc2f6fe7SMauro Carvalho Chehabhence parse and verify the metadata object. IF we can't independently identify
66fc2f6fe7SMauro Carvalho Chehabthe type of metadata in the object, then the metadata doesn't describe itself
67fc2f6fe7SMauro Carvalho Chehabvery well at all!
68fc2f6fe7SMauro Carvalho Chehab
69fc2f6fe7SMauro Carvalho ChehabLuckily, almost all XFS metadata has magic numbers embedded already - only the
70fc2f6fe7SMauro Carvalho ChehabAGFL, remote symlinks and remote attribute blocks do not contain identifying
71fc2f6fe7SMauro Carvalho Chehabmagic numbers. Hence we can change the on-disk format of all these objects to
72fc2f6fe7SMauro Carvalho Chehabadd more identifying information and detect this simply by changing the magic
73fc2f6fe7SMauro Carvalho Chehabnumbers in the metadata objects. That is, if it has the current magic number,
74fc2f6fe7SMauro Carvalho Chehabthe metadata isn't self identifying. If it contains a new magic number, it is
75fc2f6fe7SMauro Carvalho Chehabself identifying and we can do much more expansive automated verification of the
76fc2f6fe7SMauro Carvalho Chehabmetadata object at runtime, during forensic analysis or repair.
77fc2f6fe7SMauro Carvalho Chehab
78fc2f6fe7SMauro Carvalho ChehabAs a primary concern, self describing metadata needs some form of overall
79fc2f6fe7SMauro Carvalho Chehabintegrity checking. We cannot trust the metadata if we cannot verify that it has
80fc2f6fe7SMauro Carvalho Chehabnot been changed as a result of external influences. Hence we need some form of
81fc2f6fe7SMauro Carvalho Chehabintegrity check, and this is done by adding CRC32c validation to the metadata
82fc2f6fe7SMauro Carvalho Chehabblock. If we can verify the block contains the metadata it was intended to
83fc2f6fe7SMauro Carvalho Chehabcontain, a large amount of the manual verification work can be skipped.
84fc2f6fe7SMauro Carvalho Chehab
85fc2f6fe7SMauro Carvalho ChehabCRC32c was selected as metadata cannot be more than 64k in length in XFS and
86fc2f6fe7SMauro Carvalho Chehabhence a 32 bit CRC is more than sufficient to detect multi-bit errors in
87fc2f6fe7SMauro Carvalho Chehabmetadata blocks. CRC32c is also now hardware accelerated on common CPUs so it is
88fc2f6fe7SMauro Carvalho Chehabfast. So while CRC32c is not the strongest of possible integrity checks that
89fc2f6fe7SMauro Carvalho Chehabcould be used, it is more than sufficient for our needs and has relatively
90fc2f6fe7SMauro Carvalho Chehablittle overhead. Adding support for larger integrity fields and/or algorithms
91fc2f6fe7SMauro Carvalho Chehabdoes really provide any extra value over CRC32c, but it does add a lot of
92fc2f6fe7SMauro Carvalho Chehabcomplexity and so there is no provision for changing the integrity checking
93fc2f6fe7SMauro Carvalho Chehabmechanism.
94fc2f6fe7SMauro Carvalho Chehab
95fc2f6fe7SMauro Carvalho ChehabSelf describing metadata needs to contain enough information so that the
96fc2f6fe7SMauro Carvalho Chehabmetadata block can be verified as being in the correct place without needing to
97fc2f6fe7SMauro Carvalho Chehablook at any other metadata. This means it needs to contain location information.
98fc2f6fe7SMauro Carvalho ChehabJust adding a block number to the metadata is not sufficient to protect against
99fc2f6fe7SMauro Carvalho Chehabmis-directed writes - a write might be misdirected to the wrong LUN and so be
100fc2f6fe7SMauro Carvalho Chehabwritten to the "correct block" of the wrong filesystem. Hence location
101fc2f6fe7SMauro Carvalho Chehabinformation must contain a filesystem identifier as well as a block number.
102fc2f6fe7SMauro Carvalho Chehab
103fc2f6fe7SMauro Carvalho ChehabAnother key information point in forensic analysis is knowing who the metadata
104fc2f6fe7SMauro Carvalho Chehabblock belongs to. We already know the type, the location, that it is valid
105fc2f6fe7SMauro Carvalho Chehaband/or corrupted, and how long ago that it was last modified. Knowing the owner
106fc2f6fe7SMauro Carvalho Chehabof the block is important as it allows us to find other related metadata to
107fc2f6fe7SMauro Carvalho Chehabdetermine the scope of the corruption. For example, if we have a extent btree
108fc2f6fe7SMauro Carvalho Chehabobject, we don't know what inode it belongs to and hence have to walk the entire
109fc2f6fe7SMauro Carvalho Chehabfilesystem to find the owner of the block. Worse, the corruption could mean that
110fc2f6fe7SMauro Carvalho Chehabno owner can be found (i.e. it's an orphan block), and so without an owner field
111fc2f6fe7SMauro Carvalho Chehabin the metadata we have no idea of the scope of the corruption. If we have an
112fc2f6fe7SMauro Carvalho Chehabowner field in the metadata object, we can immediately do top down validation to
113fc2f6fe7SMauro Carvalho Chehabdetermine the scope of the problem.
114fc2f6fe7SMauro Carvalho Chehab
115fc2f6fe7SMauro Carvalho ChehabDifferent types of metadata have different owner identifiers. For example,
116fc2f6fe7SMauro Carvalho Chehabdirectory, attribute and extent tree blocks are all owned by an inode, while
117fc2f6fe7SMauro Carvalho Chehabfreespace btree blocks are owned by an allocation group. Hence the size and
118fc2f6fe7SMauro Carvalho Chehabcontents of the owner field are determined by the type of metadata object we are
119fc2f6fe7SMauro Carvalho Chehablooking at.  The owner information can also identify misplaced writes (e.g.
120fc2f6fe7SMauro Carvalho Chehabfreespace btree block written to the wrong AG).
121fc2f6fe7SMauro Carvalho Chehab
122fc2f6fe7SMauro Carvalho ChehabSelf describing metadata also needs to contain some indication of when it was
123fc2f6fe7SMauro Carvalho Chehabwritten to the filesystem. One of the key information points when doing forensic
124fc2f6fe7SMauro Carvalho Chehabanalysis is how recently the block was modified. Correlation of set of corrupted
125fc2f6fe7SMauro Carvalho Chehabmetadata blocks based on modification times is important as it can indicate
126fc2f6fe7SMauro Carvalho Chehabwhether the corruptions are related, whether there's been multiple corruption
127fc2f6fe7SMauro Carvalho Chehabevents that lead to the eventual failure, and even whether there are corruptions
128fc2f6fe7SMauro Carvalho Chehabpresent that the run-time verification is not detecting.
129fc2f6fe7SMauro Carvalho Chehab
130fc2f6fe7SMauro Carvalho ChehabFor example, we can determine whether a metadata object is supposed to be free
131fc2f6fe7SMauro Carvalho Chehabspace or still allocated if it is still referenced by its owner by looking at
132fc2f6fe7SMauro Carvalho Chehabwhen the free space btree block that contains the block was last written
133fc2f6fe7SMauro Carvalho Chehabcompared to when the metadata object itself was last written.  If the free space
134fc2f6fe7SMauro Carvalho Chehabblock is more recent than the object and the object's owner, then there is a
135fc2f6fe7SMauro Carvalho Chehabvery good chance that the block should have been removed from the owner.
136fc2f6fe7SMauro Carvalho Chehab
137fc2f6fe7SMauro Carvalho ChehabTo provide this "written timestamp", each metadata block gets the Log Sequence
138fc2f6fe7SMauro Carvalho ChehabNumber (LSN) of the most recent transaction it was modified on written into it.
139fc2f6fe7SMauro Carvalho ChehabThis number will always increase over the life of the filesystem, and the only
140fc2f6fe7SMauro Carvalho Chehabthing that resets it is running xfs_repair on the filesystem. Further, by use of
141fc2f6fe7SMauro Carvalho Chehabthe LSN we can tell if the corrupted metadata all belonged to the same log
142fc2f6fe7SMauro Carvalho Chehabcheckpoint and hence have some idea of how much modification occurred between
143fc2f6fe7SMauro Carvalho Chehabthe first and last instance of corrupt metadata on disk and, further, how much
144fc2f6fe7SMauro Carvalho Chehabmodification occurred between the corruption being written and when it was
145fc2f6fe7SMauro Carvalho Chehabdetected.
146fc2f6fe7SMauro Carvalho Chehab
147fc2f6fe7SMauro Carvalho ChehabRuntime Validation
148fc2f6fe7SMauro Carvalho Chehab==================
149fc2f6fe7SMauro Carvalho Chehab
150fc2f6fe7SMauro Carvalho ChehabValidation of self-describing metadata takes place at runtime in two places:
151fc2f6fe7SMauro Carvalho Chehab
152fc2f6fe7SMauro Carvalho Chehab	- immediately after a successful read from disk
153fc2f6fe7SMauro Carvalho Chehab	- immediately prior to write IO submission
154fc2f6fe7SMauro Carvalho Chehab
155fc2f6fe7SMauro Carvalho ChehabThe verification is completely stateless - it is done independently of the
156fc2f6fe7SMauro Carvalho Chehabmodification process, and seeks only to check that the metadata is what it says
157fc2f6fe7SMauro Carvalho Chehabit is and that the metadata fields are within bounds and internally consistent.
158fc2f6fe7SMauro Carvalho ChehabAs such, we cannot catch all types of corruption that can occur within a block
159fc2f6fe7SMauro Carvalho Chehabas there may be certain limitations that operational state enforces of the
160fc2f6fe7SMauro Carvalho Chehabmetadata, or there may be corruption of interblock relationships (e.g. corrupted
161fc2f6fe7SMauro Carvalho Chehabsibling pointer lists). Hence we still need stateful checking in the main code
162fc2f6fe7SMauro Carvalho Chehabbody, but in general most of the per-field validation is handled by the
163fc2f6fe7SMauro Carvalho Chehabverifiers.
164fc2f6fe7SMauro Carvalho Chehab
165fc2f6fe7SMauro Carvalho ChehabFor read verification, the caller needs to specify the expected type of metadata
166fc2f6fe7SMauro Carvalho Chehabthat it should see, and the IO completion process verifies that the metadata
167fc2f6fe7SMauro Carvalho Chehabobject matches what was expected. If the verification process fails, then it
168fc2f6fe7SMauro Carvalho Chehabmarks the object being read as EFSCORRUPTED. The caller needs to catch this
169fc2f6fe7SMauro Carvalho Chehaberror (same as for IO errors), and if it needs to take special action due to a
170fc2f6fe7SMauro Carvalho Chehabverification error it can do so by catching the EFSCORRUPTED error value. If we
171fc2f6fe7SMauro Carvalho Chehabneed more discrimination of error type at higher levels, we can define new
172fc2f6fe7SMauro Carvalho Chehaberror numbers for different errors as necessary.
173fc2f6fe7SMauro Carvalho Chehab
174fc2f6fe7SMauro Carvalho ChehabThe first step in read verification is checking the magic number and determining
175fc2f6fe7SMauro Carvalho Chehabwhether CRC validating is necessary. If it is, the CRC32c is calculated and
176fc2f6fe7SMauro Carvalho Chehabcompared against the value stored in the object itself. Once this is validated,
177fc2f6fe7SMauro Carvalho Chehabfurther checks are made against the location information, followed by extensive
178fc2f6fe7SMauro Carvalho Chehabobject specific metadata validation. If any of these checks fail, then the
179fc2f6fe7SMauro Carvalho Chehabbuffer is considered corrupt and the EFSCORRUPTED error is set appropriately.
180fc2f6fe7SMauro Carvalho Chehab
181fc2f6fe7SMauro Carvalho ChehabWrite verification is the opposite of the read verification - first the object
182fc2f6fe7SMauro Carvalho Chehabis extensively verified and if it is OK we then update the LSN from the last
183fc2f6fe7SMauro Carvalho Chehabmodification made to the object, After this, we calculate the CRC and insert it
184fc2f6fe7SMauro Carvalho Chehabinto the object. Once this is done the write IO is allowed to continue. If any
185fc2f6fe7SMauro Carvalho Chehaberror occurs during this process, the buffer is again marked with a EFSCORRUPTED
186fc2f6fe7SMauro Carvalho Chehaberror for the higher layers to catch.
187fc2f6fe7SMauro Carvalho Chehab
188fc2f6fe7SMauro Carvalho ChehabStructures
189fc2f6fe7SMauro Carvalho Chehab==========
190fc2f6fe7SMauro Carvalho Chehab
191fc2f6fe7SMauro Carvalho ChehabA typical on-disk structure needs to contain the following information::
192fc2f6fe7SMauro Carvalho Chehab
193fc2f6fe7SMauro Carvalho Chehab    struct xfs_ondisk_hdr {
194fc2f6fe7SMauro Carvalho Chehab	    __be32  magic;		/* magic number */
195fc2f6fe7SMauro Carvalho Chehab	    __be32  crc;		/* CRC, not logged */
196fc2f6fe7SMauro Carvalho Chehab	    uuid_t  uuid;		/* filesystem identifier */
197fc2f6fe7SMauro Carvalho Chehab	    __be64  owner;		/* parent object */
198fc2f6fe7SMauro Carvalho Chehab	    __be64  blkno;		/* location on disk */
199fc2f6fe7SMauro Carvalho Chehab	    __be64  lsn;		/* last modification in log, not logged */
200fc2f6fe7SMauro Carvalho Chehab    };
201fc2f6fe7SMauro Carvalho Chehab
202fc2f6fe7SMauro Carvalho ChehabDepending on the metadata, this information may be part of a header structure
203fc2f6fe7SMauro Carvalho Chehabseparate to the metadata contents, or may be distributed through an existing
204fc2f6fe7SMauro Carvalho Chehabstructure. The latter occurs with metadata that already contains some of this
205fc2f6fe7SMauro Carvalho Chehabinformation, such as the superblock and AG headers.
206fc2f6fe7SMauro Carvalho Chehab
207fc2f6fe7SMauro Carvalho ChehabOther metadata may have different formats for the information, but the same
208fc2f6fe7SMauro Carvalho Chehablevel of information is generally provided. For example:
209fc2f6fe7SMauro Carvalho Chehab
210fc2f6fe7SMauro Carvalho Chehab	- short btree blocks have a 32 bit owner (ag number) and a 32 bit block
211fc2f6fe7SMauro Carvalho Chehab	  number for location. The two of these combined provide the same
212fc2f6fe7SMauro Carvalho Chehab	  information as @owner and @blkno in eh above structure, but using 8
213fc2f6fe7SMauro Carvalho Chehab	  bytes less space on disk.
214fc2f6fe7SMauro Carvalho Chehab
215fc2f6fe7SMauro Carvalho Chehab	- directory/attribute node blocks have a 16 bit magic number, and the
216fc2f6fe7SMauro Carvalho Chehab	  header that contains the magic number has other information in it as
217fc2f6fe7SMauro Carvalho Chehab	  well. hence the additional metadata headers change the overall format
218fc2f6fe7SMauro Carvalho Chehab	  of the metadata.
219fc2f6fe7SMauro Carvalho Chehab
220fc2f6fe7SMauro Carvalho ChehabA typical buffer read verifier is structured as follows::
221fc2f6fe7SMauro Carvalho Chehab
222fc2f6fe7SMauro Carvalho Chehab    #define XFS_FOO_CRC_OFF		offsetof(struct xfs_ondisk_hdr, crc)
223fc2f6fe7SMauro Carvalho Chehab
224fc2f6fe7SMauro Carvalho Chehab    static void
225fc2f6fe7SMauro Carvalho Chehab    xfs_foo_read_verify(
226fc2f6fe7SMauro Carvalho Chehab	    struct xfs_buf	*bp)
227fc2f6fe7SMauro Carvalho Chehab    {
228fc2f6fe7SMauro Carvalho Chehab	struct xfs_mount *mp = bp->b_mount;
229fc2f6fe7SMauro Carvalho Chehab
230fc2f6fe7SMauro Carvalho Chehab	    if ((xfs_sb_version_hascrc(&mp->m_sb) &&
231fc2f6fe7SMauro Carvalho Chehab		!xfs_verify_cksum(bp->b_addr, BBTOB(bp->b_length),
232fc2f6fe7SMauro Carvalho Chehab					    XFS_FOO_CRC_OFF)) ||
233fc2f6fe7SMauro Carvalho Chehab		!xfs_foo_verify(bp)) {
234fc2f6fe7SMauro Carvalho Chehab		    XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr);
235fc2f6fe7SMauro Carvalho Chehab		    xfs_buf_ioerror(bp, EFSCORRUPTED);
236fc2f6fe7SMauro Carvalho Chehab	    }
237fc2f6fe7SMauro Carvalho Chehab    }
238fc2f6fe7SMauro Carvalho Chehab
239fc2f6fe7SMauro Carvalho ChehabThe code ensures that the CRC is only checked if the filesystem has CRCs enabled
240fc2f6fe7SMauro Carvalho Chehabby checking the superblock of the feature bit, and then if the CRC verifies OK
241fc2f6fe7SMauro Carvalho Chehab(or is not needed) it verifies the actual contents of the block.
242fc2f6fe7SMauro Carvalho Chehab
243fc2f6fe7SMauro Carvalho ChehabThe verifier function will take a couple of different forms, depending on
244fc2f6fe7SMauro Carvalho Chehabwhether the magic number can be used to determine the format of the block. In
245fc2f6fe7SMauro Carvalho Chehabthe case it can't, the code is structured as follows::
246fc2f6fe7SMauro Carvalho Chehab
247fc2f6fe7SMauro Carvalho Chehab    static bool
248fc2f6fe7SMauro Carvalho Chehab    xfs_foo_verify(
249fc2f6fe7SMauro Carvalho Chehab	    struct xfs_buf		*bp)
250fc2f6fe7SMauro Carvalho Chehab    {
251fc2f6fe7SMauro Carvalho Chehab	    struct xfs_mount	*mp = bp->b_mount;
252fc2f6fe7SMauro Carvalho Chehab	    struct xfs_ondisk_hdr	*hdr = bp->b_addr;
253fc2f6fe7SMauro Carvalho Chehab
254fc2f6fe7SMauro Carvalho Chehab	    if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))
255fc2f6fe7SMauro Carvalho Chehab		    return false;
256fc2f6fe7SMauro Carvalho Chehab
257fc2f6fe7SMauro Carvalho Chehab	    if (!xfs_sb_version_hascrc(&mp->m_sb)) {
258fc2f6fe7SMauro Carvalho Chehab		    if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))
259fc2f6fe7SMauro Carvalho Chehab			    return false;
260fc2f6fe7SMauro Carvalho Chehab		    if (bp->b_bn != be64_to_cpu(hdr->blkno))
261fc2f6fe7SMauro Carvalho Chehab			    return false;
262fc2f6fe7SMauro Carvalho Chehab		    if (hdr->owner == 0)
263fc2f6fe7SMauro Carvalho Chehab			    return false;
264fc2f6fe7SMauro Carvalho Chehab	    }
265fc2f6fe7SMauro Carvalho Chehab
266fc2f6fe7SMauro Carvalho Chehab	    /* object specific verification checks here */
267fc2f6fe7SMauro Carvalho Chehab
268fc2f6fe7SMauro Carvalho Chehab	    return true;
269fc2f6fe7SMauro Carvalho Chehab    }
270fc2f6fe7SMauro Carvalho Chehab
271fc2f6fe7SMauro Carvalho ChehabIf there are different magic numbers for the different formats, the verifier
272fc2f6fe7SMauro Carvalho Chehabwill look like::
273fc2f6fe7SMauro Carvalho Chehab
274fc2f6fe7SMauro Carvalho Chehab    static bool
275fc2f6fe7SMauro Carvalho Chehab    xfs_foo_verify(
276fc2f6fe7SMauro Carvalho Chehab	    struct xfs_buf		*bp)
277fc2f6fe7SMauro Carvalho Chehab    {
278fc2f6fe7SMauro Carvalho Chehab	    struct xfs_mount	*mp = bp->b_mount;
279fc2f6fe7SMauro Carvalho Chehab	    struct xfs_ondisk_hdr	*hdr = bp->b_addr;
280fc2f6fe7SMauro Carvalho Chehab
281fc2f6fe7SMauro Carvalho Chehab	    if (hdr->magic == cpu_to_be32(XFS_FOO_CRC_MAGIC)) {
282fc2f6fe7SMauro Carvalho Chehab		    if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid))
283fc2f6fe7SMauro Carvalho Chehab			    return false;
284fc2f6fe7SMauro Carvalho Chehab		    if (bp->b_bn != be64_to_cpu(hdr->blkno))
285fc2f6fe7SMauro Carvalho Chehab			    return false;
286fc2f6fe7SMauro Carvalho Chehab		    if (hdr->owner == 0)
287fc2f6fe7SMauro Carvalho Chehab			    return false;
288fc2f6fe7SMauro Carvalho Chehab	    } else if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC))
289fc2f6fe7SMauro Carvalho Chehab		    return false;
290fc2f6fe7SMauro Carvalho Chehab
291fc2f6fe7SMauro Carvalho Chehab	    /* object specific verification checks here */
292fc2f6fe7SMauro Carvalho Chehab
293fc2f6fe7SMauro Carvalho Chehab	    return true;
294fc2f6fe7SMauro Carvalho Chehab    }
295fc2f6fe7SMauro Carvalho Chehab
296fc2f6fe7SMauro Carvalho ChehabWrite verifiers are very similar to the read verifiers, they just do things in
297fc2f6fe7SMauro Carvalho Chehabthe opposite order to the read verifiers. A typical write verifier::
298fc2f6fe7SMauro Carvalho Chehab
299fc2f6fe7SMauro Carvalho Chehab    static void
300fc2f6fe7SMauro Carvalho Chehab    xfs_foo_write_verify(
301fc2f6fe7SMauro Carvalho Chehab	    struct xfs_buf	*bp)
302fc2f6fe7SMauro Carvalho Chehab    {
303fc2f6fe7SMauro Carvalho Chehab	    struct xfs_mount	*mp = bp->b_mount;
304fc2f6fe7SMauro Carvalho Chehab	    struct xfs_buf_log_item	*bip = bp->b_fspriv;
305fc2f6fe7SMauro Carvalho Chehab
306fc2f6fe7SMauro Carvalho Chehab	    if (!xfs_foo_verify(bp)) {
307fc2f6fe7SMauro Carvalho Chehab		    XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr);
308fc2f6fe7SMauro Carvalho Chehab		    xfs_buf_ioerror(bp, EFSCORRUPTED);
309fc2f6fe7SMauro Carvalho Chehab		    return;
310fc2f6fe7SMauro Carvalho Chehab	    }
311fc2f6fe7SMauro Carvalho Chehab
312fc2f6fe7SMauro Carvalho Chehab	    if (!xfs_sb_version_hascrc(&mp->m_sb))
313fc2f6fe7SMauro Carvalho Chehab		    return;
314fc2f6fe7SMauro Carvalho Chehab
315fc2f6fe7SMauro Carvalho Chehab
316fc2f6fe7SMauro Carvalho Chehab	    if (bip) {
317fc2f6fe7SMauro Carvalho Chehab		    struct xfs_ondisk_hdr	*hdr = bp->b_addr;
318fc2f6fe7SMauro Carvalho Chehab		    hdr->lsn = cpu_to_be64(bip->bli_item.li_lsn);
319fc2f6fe7SMauro Carvalho Chehab	    }
320fc2f6fe7SMauro Carvalho Chehab	    xfs_update_cksum(bp->b_addr, BBTOB(bp->b_length), XFS_FOO_CRC_OFF);
321fc2f6fe7SMauro Carvalho Chehab    }
322fc2f6fe7SMauro Carvalho Chehab
323fc2f6fe7SMauro Carvalho ChehabThis will verify the internal structure of the metadata before we go any
324fc2f6fe7SMauro Carvalho Chehabfurther, detecting corruptions that have occurred as the metadata has been
325fc2f6fe7SMauro Carvalho Chehabmodified in memory. If the metadata verifies OK, and CRCs are enabled, we then
326fc2f6fe7SMauro Carvalho Chehabupdate the LSN field (when it was last modified) and calculate the CRC on the
327fc2f6fe7SMauro Carvalho Chehabmetadata. Once this is done, we can issue the IO.
328fc2f6fe7SMauro Carvalho Chehab
329fc2f6fe7SMauro Carvalho ChehabInodes and Dquots
330fc2f6fe7SMauro Carvalho Chehab=================
331fc2f6fe7SMauro Carvalho Chehab
332fc2f6fe7SMauro Carvalho ChehabInodes and dquots are special snowflakes. They have per-object CRC and
333fc2f6fe7SMauro Carvalho Chehabself-identifiers, but they are packed so that there are multiple objects per
334fc2f6fe7SMauro Carvalho Chehabbuffer. Hence we do not use per-buffer verifiers to do the work of per-object
335fc2f6fe7SMauro Carvalho Chehabverification and CRC calculations. The per-buffer verifiers simply perform basic
336fc2f6fe7SMauro Carvalho Chehabidentification of the buffer - that they contain inodes or dquots, and that
337fc2f6fe7SMauro Carvalho Chehabthere are magic numbers in all the expected spots. All further CRC and
338fc2f6fe7SMauro Carvalho Chehabverification checks are done when each inode is read from or written back to the
339fc2f6fe7SMauro Carvalho Chehabbuffer.
340fc2f6fe7SMauro Carvalho Chehab
341fc2f6fe7SMauro Carvalho ChehabThe structure of the verifiers and the identifiers checks is very similar to the
342fc2f6fe7SMauro Carvalho Chehabbuffer code described above. The only difference is where they are called. For
343fc2f6fe7SMauro Carvalho Chehabexample, inode read verification is done in xfs_iread() when the inode is first
344fc2f6fe7SMauro Carvalho Chehabread out of the buffer and the struct xfs_inode is instantiated. The inode is
345fc2f6fe7SMauro Carvalho Chehabalready extensively verified during writeback in xfs_iflush_int, so the only
346fc2f6fe7SMauro Carvalho Chehabaddition here is to add the LSN and CRC to the inode as it is copied back into
347fc2f6fe7SMauro Carvalho Chehabthe buffer.
348fc2f6fe7SMauro Carvalho Chehab
349fc2f6fe7SMauro Carvalho ChehabXXX: inode unlinked list modification doesn't recalculate the inode CRC! None of
350fc2f6fe7SMauro Carvalho Chehabthe unlinked list modifications check or update CRCs, neither during unlink nor
351fc2f6fe7SMauro Carvalho Chehablog recovery. So, it's gone unnoticed until now. This won't matter immediately -
352fc2f6fe7SMauro Carvalho Chehabrepair will probably complain about it - but it needs to be fixed.
353