1fc2f6fe7SMauro Carvalho Chehab.. SPDX-License-Identifier: GPL-2.0 2fc2f6fe7SMauro Carvalho Chehab 3fc2f6fe7SMauro Carvalho Chehab============================ 4fc2f6fe7SMauro Carvalho ChehabXFS Self Describing Metadata 5fc2f6fe7SMauro Carvalho Chehab============================ 6fc2f6fe7SMauro Carvalho Chehab 7fc2f6fe7SMauro Carvalho ChehabIntroduction 8fc2f6fe7SMauro Carvalho Chehab============ 9fc2f6fe7SMauro Carvalho Chehab 10fc2f6fe7SMauro Carvalho ChehabThe largest scalability problem facing XFS is not one of algorithmic 11fc2f6fe7SMauro Carvalho Chehabscalability, but of verification of the filesystem structure. Scalabilty of the 12fc2f6fe7SMauro Carvalho Chehabstructures and indexes on disk and the algorithms for iterating them are 13fc2f6fe7SMauro Carvalho Chehabadequate for supporting PB scale filesystems with billions of inodes, however it 14fc2f6fe7SMauro Carvalho Chehabis this very scalability that causes the verification problem. 15fc2f6fe7SMauro Carvalho Chehab 16fc2f6fe7SMauro Carvalho ChehabAlmost all metadata on XFS is dynamically allocated. The only fixed location 17fc2f6fe7SMauro Carvalho Chehabmetadata is the allocation group headers (SB, AGF, AGFL and AGI), while all 18fc2f6fe7SMauro Carvalho Chehabother metadata structures need to be discovered by walking the filesystem 19fc2f6fe7SMauro Carvalho Chehabstructure in different ways. While this is already done by userspace tools for 20fc2f6fe7SMauro Carvalho Chehabvalidating and repairing the structure, there are limits to what they can 21fc2f6fe7SMauro Carvalho Chehabverify, and this in turn limits the supportable size of an XFS filesystem. 22fc2f6fe7SMauro Carvalho Chehab 23fc2f6fe7SMauro Carvalho ChehabFor example, it is entirely possible to manually use xfs_db and a bit of 24fc2f6fe7SMauro Carvalho Chehabscripting to analyse the structure of a 100TB filesystem when trying to 25fc2f6fe7SMauro Carvalho Chehabdetermine the root cause of a corruption problem, but it is still mainly a 26fc2f6fe7SMauro Carvalho Chehabmanual task of verifying that things like single bit errors or misplaced writes 27fc2f6fe7SMauro Carvalho Chehabweren't the ultimate cause of a corruption event. It may take a few hours to a 28fc2f6fe7SMauro Carvalho Chehabfew days to perform such forensic analysis, so for at this scale root cause 29fc2f6fe7SMauro Carvalho Chehabanalysis is entirely possible. 30fc2f6fe7SMauro Carvalho Chehab 31fc2f6fe7SMauro Carvalho ChehabHowever, if we scale the filesystem up to 1PB, we now have 10x as much metadata 32fc2f6fe7SMauro Carvalho Chehabto analyse and so that analysis blows out towards weeks/months of forensic work. 33fc2f6fe7SMauro Carvalho ChehabMost of the analysis work is slow and tedious, so as the amount of analysis goes 34fc2f6fe7SMauro Carvalho Chehabup, the more likely that the cause will be lost in the noise. Hence the primary 35fc2f6fe7SMauro Carvalho Chehabconcern for supporting PB scale filesystems is minimising the time and effort 36fc2f6fe7SMauro Carvalho Chehabrequired for basic forensic analysis of the filesystem structure. 37fc2f6fe7SMauro Carvalho Chehab 38fc2f6fe7SMauro Carvalho Chehab 39fc2f6fe7SMauro Carvalho ChehabSelf Describing Metadata 40fc2f6fe7SMauro Carvalho Chehab======================== 41fc2f6fe7SMauro Carvalho Chehab 42fc2f6fe7SMauro Carvalho ChehabOne of the problems with the current metadata format is that apart from the 43fc2f6fe7SMauro Carvalho Chehabmagic number in the metadata block, we have no other way of identifying what it 44fc2f6fe7SMauro Carvalho Chehabis supposed to be. We can't even identify if it is the right place. Put simply, 45fc2f6fe7SMauro Carvalho Chehabyou can't look at a single metadata block in isolation and say "yes, it is 46fc2f6fe7SMauro Carvalho Chehabsupposed to be there and the contents are valid". 47fc2f6fe7SMauro Carvalho Chehab 48fc2f6fe7SMauro Carvalho ChehabHence most of the time spent on forensic analysis is spent doing basic 49fc2f6fe7SMauro Carvalho Chehabverification of metadata values, looking for values that are in range (and hence 50fc2f6fe7SMauro Carvalho Chehabnot detected by automated verification checks) but are not correct. Finding and 51fc2f6fe7SMauro Carvalho Chehabunderstanding how things like cross linked block lists (e.g. sibling 52fc2f6fe7SMauro Carvalho Chehabpointers in a btree end up with loops in them) are the key to understanding what 53fc2f6fe7SMauro Carvalho Chehabwent wrong, but it is impossible to tell what order the blocks were linked into 54fc2f6fe7SMauro Carvalho Chehabeach other or written to disk after the fact. 55fc2f6fe7SMauro Carvalho Chehab 56fc2f6fe7SMauro Carvalho ChehabHence we need to record more information into the metadata to allow us to 57fc2f6fe7SMauro Carvalho Chehabquickly determine if the metadata is intact and can be ignored for the purpose 58fc2f6fe7SMauro Carvalho Chehabof analysis. We can't protect against every possible type of error, but we can 59fc2f6fe7SMauro Carvalho Chehabensure that common types of errors are easily detectable. Hence the concept of 60fc2f6fe7SMauro Carvalho Chehabself describing metadata. 61fc2f6fe7SMauro Carvalho Chehab 62fc2f6fe7SMauro Carvalho ChehabThe first, fundamental requirement of self describing metadata is that the 63fc2f6fe7SMauro Carvalho Chehabmetadata object contains some form of unique identifier in a well known 64fc2f6fe7SMauro Carvalho Chehablocation. This allows us to identify the expected contents of the block and 65fc2f6fe7SMauro Carvalho Chehabhence parse and verify the metadata object. IF we can't independently identify 66fc2f6fe7SMauro Carvalho Chehabthe type of metadata in the object, then the metadata doesn't describe itself 67fc2f6fe7SMauro Carvalho Chehabvery well at all! 68fc2f6fe7SMauro Carvalho Chehab 69fc2f6fe7SMauro Carvalho ChehabLuckily, almost all XFS metadata has magic numbers embedded already - only the 70fc2f6fe7SMauro Carvalho ChehabAGFL, remote symlinks and remote attribute blocks do not contain identifying 71fc2f6fe7SMauro Carvalho Chehabmagic numbers. Hence we can change the on-disk format of all these objects to 72fc2f6fe7SMauro Carvalho Chehabadd more identifying information and detect this simply by changing the magic 73fc2f6fe7SMauro Carvalho Chehabnumbers in the metadata objects. That is, if it has the current magic number, 74fc2f6fe7SMauro Carvalho Chehabthe metadata isn't self identifying. If it contains a new magic number, it is 75fc2f6fe7SMauro Carvalho Chehabself identifying and we can do much more expansive automated verification of the 76fc2f6fe7SMauro Carvalho Chehabmetadata object at runtime, during forensic analysis or repair. 77fc2f6fe7SMauro Carvalho Chehab 78fc2f6fe7SMauro Carvalho ChehabAs a primary concern, self describing metadata needs some form of overall 79fc2f6fe7SMauro Carvalho Chehabintegrity checking. We cannot trust the metadata if we cannot verify that it has 80fc2f6fe7SMauro Carvalho Chehabnot been changed as a result of external influences. Hence we need some form of 81fc2f6fe7SMauro Carvalho Chehabintegrity check, and this is done by adding CRC32c validation to the metadata 82fc2f6fe7SMauro Carvalho Chehabblock. If we can verify the block contains the metadata it was intended to 83fc2f6fe7SMauro Carvalho Chehabcontain, a large amount of the manual verification work can be skipped. 84fc2f6fe7SMauro Carvalho Chehab 85fc2f6fe7SMauro Carvalho ChehabCRC32c was selected as metadata cannot be more than 64k in length in XFS and 86fc2f6fe7SMauro Carvalho Chehabhence a 32 bit CRC is more than sufficient to detect multi-bit errors in 87fc2f6fe7SMauro Carvalho Chehabmetadata blocks. CRC32c is also now hardware accelerated on common CPUs so it is 88fc2f6fe7SMauro Carvalho Chehabfast. So while CRC32c is not the strongest of possible integrity checks that 89fc2f6fe7SMauro Carvalho Chehabcould be used, it is more than sufficient for our needs and has relatively 90fc2f6fe7SMauro Carvalho Chehablittle overhead. Adding support for larger integrity fields and/or algorithms 91fc2f6fe7SMauro Carvalho Chehabdoes really provide any extra value over CRC32c, but it does add a lot of 92fc2f6fe7SMauro Carvalho Chehabcomplexity and so there is no provision for changing the integrity checking 93fc2f6fe7SMauro Carvalho Chehabmechanism. 94fc2f6fe7SMauro Carvalho Chehab 95fc2f6fe7SMauro Carvalho ChehabSelf describing metadata needs to contain enough information so that the 96fc2f6fe7SMauro Carvalho Chehabmetadata block can be verified as being in the correct place without needing to 97fc2f6fe7SMauro Carvalho Chehablook at any other metadata. This means it needs to contain location information. 98fc2f6fe7SMauro Carvalho ChehabJust adding a block number to the metadata is not sufficient to protect against 99fc2f6fe7SMauro Carvalho Chehabmis-directed writes - a write might be misdirected to the wrong LUN and so be 100fc2f6fe7SMauro Carvalho Chehabwritten to the "correct block" of the wrong filesystem. Hence location 101fc2f6fe7SMauro Carvalho Chehabinformation must contain a filesystem identifier as well as a block number. 102fc2f6fe7SMauro Carvalho Chehab 103fc2f6fe7SMauro Carvalho ChehabAnother key information point in forensic analysis is knowing who the metadata 104fc2f6fe7SMauro Carvalho Chehabblock belongs to. We already know the type, the location, that it is valid 105fc2f6fe7SMauro Carvalho Chehaband/or corrupted, and how long ago that it was last modified. Knowing the owner 106fc2f6fe7SMauro Carvalho Chehabof the block is important as it allows us to find other related metadata to 107fc2f6fe7SMauro Carvalho Chehabdetermine the scope of the corruption. For example, if we have a extent btree 108fc2f6fe7SMauro Carvalho Chehabobject, we don't know what inode it belongs to and hence have to walk the entire 109fc2f6fe7SMauro Carvalho Chehabfilesystem to find the owner of the block. Worse, the corruption could mean that 110fc2f6fe7SMauro Carvalho Chehabno owner can be found (i.e. it's an orphan block), and so without an owner field 111fc2f6fe7SMauro Carvalho Chehabin the metadata we have no idea of the scope of the corruption. If we have an 112fc2f6fe7SMauro Carvalho Chehabowner field in the metadata object, we can immediately do top down validation to 113fc2f6fe7SMauro Carvalho Chehabdetermine the scope of the problem. 114fc2f6fe7SMauro Carvalho Chehab 115fc2f6fe7SMauro Carvalho ChehabDifferent types of metadata have different owner identifiers. For example, 116fc2f6fe7SMauro Carvalho Chehabdirectory, attribute and extent tree blocks are all owned by an inode, while 117fc2f6fe7SMauro Carvalho Chehabfreespace btree blocks are owned by an allocation group. Hence the size and 118fc2f6fe7SMauro Carvalho Chehabcontents of the owner field are determined by the type of metadata object we are 119fc2f6fe7SMauro Carvalho Chehablooking at. The owner information can also identify misplaced writes (e.g. 120fc2f6fe7SMauro Carvalho Chehabfreespace btree block written to the wrong AG). 121fc2f6fe7SMauro Carvalho Chehab 122fc2f6fe7SMauro Carvalho ChehabSelf describing metadata also needs to contain some indication of when it was 123fc2f6fe7SMauro Carvalho Chehabwritten to the filesystem. One of the key information points when doing forensic 124fc2f6fe7SMauro Carvalho Chehabanalysis is how recently the block was modified. Correlation of set of corrupted 125fc2f6fe7SMauro Carvalho Chehabmetadata blocks based on modification times is important as it can indicate 126fc2f6fe7SMauro Carvalho Chehabwhether the corruptions are related, whether there's been multiple corruption 127fc2f6fe7SMauro Carvalho Chehabevents that lead to the eventual failure, and even whether there are corruptions 128fc2f6fe7SMauro Carvalho Chehabpresent that the run-time verification is not detecting. 129fc2f6fe7SMauro Carvalho Chehab 130fc2f6fe7SMauro Carvalho ChehabFor example, we can determine whether a metadata object is supposed to be free 131fc2f6fe7SMauro Carvalho Chehabspace or still allocated if it is still referenced by its owner by looking at 132fc2f6fe7SMauro Carvalho Chehabwhen the free space btree block that contains the block was last written 133fc2f6fe7SMauro Carvalho Chehabcompared to when the metadata object itself was last written. If the free space 134fc2f6fe7SMauro Carvalho Chehabblock is more recent than the object and the object's owner, then there is a 135fc2f6fe7SMauro Carvalho Chehabvery good chance that the block should have been removed from the owner. 136fc2f6fe7SMauro Carvalho Chehab 137fc2f6fe7SMauro Carvalho ChehabTo provide this "written timestamp", each metadata block gets the Log Sequence 138fc2f6fe7SMauro Carvalho ChehabNumber (LSN) of the most recent transaction it was modified on written into it. 139fc2f6fe7SMauro Carvalho ChehabThis number will always increase over the life of the filesystem, and the only 140fc2f6fe7SMauro Carvalho Chehabthing that resets it is running xfs_repair on the filesystem. Further, by use of 141fc2f6fe7SMauro Carvalho Chehabthe LSN we can tell if the corrupted metadata all belonged to the same log 142fc2f6fe7SMauro Carvalho Chehabcheckpoint and hence have some idea of how much modification occurred between 143fc2f6fe7SMauro Carvalho Chehabthe first and last instance of corrupt metadata on disk and, further, how much 144fc2f6fe7SMauro Carvalho Chehabmodification occurred between the corruption being written and when it was 145fc2f6fe7SMauro Carvalho Chehabdetected. 146fc2f6fe7SMauro Carvalho Chehab 147fc2f6fe7SMauro Carvalho ChehabRuntime Validation 148fc2f6fe7SMauro Carvalho Chehab================== 149fc2f6fe7SMauro Carvalho Chehab 150fc2f6fe7SMauro Carvalho ChehabValidation of self-describing metadata takes place at runtime in two places: 151fc2f6fe7SMauro Carvalho Chehab 152fc2f6fe7SMauro Carvalho Chehab - immediately after a successful read from disk 153fc2f6fe7SMauro Carvalho Chehab - immediately prior to write IO submission 154fc2f6fe7SMauro Carvalho Chehab 155fc2f6fe7SMauro Carvalho ChehabThe verification is completely stateless - it is done independently of the 156fc2f6fe7SMauro Carvalho Chehabmodification process, and seeks only to check that the metadata is what it says 157fc2f6fe7SMauro Carvalho Chehabit is and that the metadata fields are within bounds and internally consistent. 158fc2f6fe7SMauro Carvalho ChehabAs such, we cannot catch all types of corruption that can occur within a block 159fc2f6fe7SMauro Carvalho Chehabas there may be certain limitations that operational state enforces of the 160fc2f6fe7SMauro Carvalho Chehabmetadata, or there may be corruption of interblock relationships (e.g. corrupted 161fc2f6fe7SMauro Carvalho Chehabsibling pointer lists). Hence we still need stateful checking in the main code 162fc2f6fe7SMauro Carvalho Chehabbody, but in general most of the per-field validation is handled by the 163fc2f6fe7SMauro Carvalho Chehabverifiers. 164fc2f6fe7SMauro Carvalho Chehab 165fc2f6fe7SMauro Carvalho ChehabFor read verification, the caller needs to specify the expected type of metadata 166fc2f6fe7SMauro Carvalho Chehabthat it should see, and the IO completion process verifies that the metadata 167fc2f6fe7SMauro Carvalho Chehabobject matches what was expected. If the verification process fails, then it 168fc2f6fe7SMauro Carvalho Chehabmarks the object being read as EFSCORRUPTED. The caller needs to catch this 169fc2f6fe7SMauro Carvalho Chehaberror (same as for IO errors), and if it needs to take special action due to a 170fc2f6fe7SMauro Carvalho Chehabverification error it can do so by catching the EFSCORRUPTED error value. If we 171fc2f6fe7SMauro Carvalho Chehabneed more discrimination of error type at higher levels, we can define new 172fc2f6fe7SMauro Carvalho Chehaberror numbers for different errors as necessary. 173fc2f6fe7SMauro Carvalho Chehab 174fc2f6fe7SMauro Carvalho ChehabThe first step in read verification is checking the magic number and determining 175fc2f6fe7SMauro Carvalho Chehabwhether CRC validating is necessary. If it is, the CRC32c is calculated and 176fc2f6fe7SMauro Carvalho Chehabcompared against the value stored in the object itself. Once this is validated, 177fc2f6fe7SMauro Carvalho Chehabfurther checks are made against the location information, followed by extensive 178fc2f6fe7SMauro Carvalho Chehabobject specific metadata validation. If any of these checks fail, then the 179fc2f6fe7SMauro Carvalho Chehabbuffer is considered corrupt and the EFSCORRUPTED error is set appropriately. 180fc2f6fe7SMauro Carvalho Chehab 181fc2f6fe7SMauro Carvalho ChehabWrite verification is the opposite of the read verification - first the object 182fc2f6fe7SMauro Carvalho Chehabis extensively verified and if it is OK we then update the LSN from the last 183fc2f6fe7SMauro Carvalho Chehabmodification made to the object, After this, we calculate the CRC and insert it 184fc2f6fe7SMauro Carvalho Chehabinto the object. Once this is done the write IO is allowed to continue. If any 185fc2f6fe7SMauro Carvalho Chehaberror occurs during this process, the buffer is again marked with a EFSCORRUPTED 186fc2f6fe7SMauro Carvalho Chehaberror for the higher layers to catch. 187fc2f6fe7SMauro Carvalho Chehab 188fc2f6fe7SMauro Carvalho ChehabStructures 189fc2f6fe7SMauro Carvalho Chehab========== 190fc2f6fe7SMauro Carvalho Chehab 191fc2f6fe7SMauro Carvalho ChehabA typical on-disk structure needs to contain the following information:: 192fc2f6fe7SMauro Carvalho Chehab 193fc2f6fe7SMauro Carvalho Chehab struct xfs_ondisk_hdr { 194fc2f6fe7SMauro Carvalho Chehab __be32 magic; /* magic number */ 195fc2f6fe7SMauro Carvalho Chehab __be32 crc; /* CRC, not logged */ 196fc2f6fe7SMauro Carvalho Chehab uuid_t uuid; /* filesystem identifier */ 197fc2f6fe7SMauro Carvalho Chehab __be64 owner; /* parent object */ 198fc2f6fe7SMauro Carvalho Chehab __be64 blkno; /* location on disk */ 199fc2f6fe7SMauro Carvalho Chehab __be64 lsn; /* last modification in log, not logged */ 200fc2f6fe7SMauro Carvalho Chehab }; 201fc2f6fe7SMauro Carvalho Chehab 202fc2f6fe7SMauro Carvalho ChehabDepending on the metadata, this information may be part of a header structure 203fc2f6fe7SMauro Carvalho Chehabseparate to the metadata contents, or may be distributed through an existing 204fc2f6fe7SMauro Carvalho Chehabstructure. The latter occurs with metadata that already contains some of this 205fc2f6fe7SMauro Carvalho Chehabinformation, such as the superblock and AG headers. 206fc2f6fe7SMauro Carvalho Chehab 207fc2f6fe7SMauro Carvalho ChehabOther metadata may have different formats for the information, but the same 208fc2f6fe7SMauro Carvalho Chehablevel of information is generally provided. For example: 209fc2f6fe7SMauro Carvalho Chehab 210fc2f6fe7SMauro Carvalho Chehab - short btree blocks have a 32 bit owner (ag number) and a 32 bit block 211fc2f6fe7SMauro Carvalho Chehab number for location. The two of these combined provide the same 212fc2f6fe7SMauro Carvalho Chehab information as @owner and @blkno in eh above structure, but using 8 213fc2f6fe7SMauro Carvalho Chehab bytes less space on disk. 214fc2f6fe7SMauro Carvalho Chehab 215fc2f6fe7SMauro Carvalho Chehab - directory/attribute node blocks have a 16 bit magic number, and the 216fc2f6fe7SMauro Carvalho Chehab header that contains the magic number has other information in it as 217fc2f6fe7SMauro Carvalho Chehab well. hence the additional metadata headers change the overall format 218fc2f6fe7SMauro Carvalho Chehab of the metadata. 219fc2f6fe7SMauro Carvalho Chehab 220fc2f6fe7SMauro Carvalho ChehabA typical buffer read verifier is structured as follows:: 221fc2f6fe7SMauro Carvalho Chehab 222fc2f6fe7SMauro Carvalho Chehab #define XFS_FOO_CRC_OFF offsetof(struct xfs_ondisk_hdr, crc) 223fc2f6fe7SMauro Carvalho Chehab 224fc2f6fe7SMauro Carvalho Chehab static void 225fc2f6fe7SMauro Carvalho Chehab xfs_foo_read_verify( 226fc2f6fe7SMauro Carvalho Chehab struct xfs_buf *bp) 227fc2f6fe7SMauro Carvalho Chehab { 228fc2f6fe7SMauro Carvalho Chehab struct xfs_mount *mp = bp->b_mount; 229fc2f6fe7SMauro Carvalho Chehab 230fc2f6fe7SMauro Carvalho Chehab if ((xfs_sb_version_hascrc(&mp->m_sb) && 231fc2f6fe7SMauro Carvalho Chehab !xfs_verify_cksum(bp->b_addr, BBTOB(bp->b_length), 232fc2f6fe7SMauro Carvalho Chehab XFS_FOO_CRC_OFF)) || 233fc2f6fe7SMauro Carvalho Chehab !xfs_foo_verify(bp)) { 234fc2f6fe7SMauro Carvalho Chehab XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr); 235fc2f6fe7SMauro Carvalho Chehab xfs_buf_ioerror(bp, EFSCORRUPTED); 236fc2f6fe7SMauro Carvalho Chehab } 237fc2f6fe7SMauro Carvalho Chehab } 238fc2f6fe7SMauro Carvalho Chehab 239fc2f6fe7SMauro Carvalho ChehabThe code ensures that the CRC is only checked if the filesystem has CRCs enabled 240fc2f6fe7SMauro Carvalho Chehabby checking the superblock of the feature bit, and then if the CRC verifies OK 241fc2f6fe7SMauro Carvalho Chehab(or is not needed) it verifies the actual contents of the block. 242fc2f6fe7SMauro Carvalho Chehab 243fc2f6fe7SMauro Carvalho ChehabThe verifier function will take a couple of different forms, depending on 244fc2f6fe7SMauro Carvalho Chehabwhether the magic number can be used to determine the format of the block. In 245fc2f6fe7SMauro Carvalho Chehabthe case it can't, the code is structured as follows:: 246fc2f6fe7SMauro Carvalho Chehab 247fc2f6fe7SMauro Carvalho Chehab static bool 248fc2f6fe7SMauro Carvalho Chehab xfs_foo_verify( 249fc2f6fe7SMauro Carvalho Chehab struct xfs_buf *bp) 250fc2f6fe7SMauro Carvalho Chehab { 251fc2f6fe7SMauro Carvalho Chehab struct xfs_mount *mp = bp->b_mount; 252fc2f6fe7SMauro Carvalho Chehab struct xfs_ondisk_hdr *hdr = bp->b_addr; 253fc2f6fe7SMauro Carvalho Chehab 254fc2f6fe7SMauro Carvalho Chehab if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC)) 255fc2f6fe7SMauro Carvalho Chehab return false; 256fc2f6fe7SMauro Carvalho Chehab 257fc2f6fe7SMauro Carvalho Chehab if (!xfs_sb_version_hascrc(&mp->m_sb)) { 258fc2f6fe7SMauro Carvalho Chehab if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid)) 259fc2f6fe7SMauro Carvalho Chehab return false; 260fc2f6fe7SMauro Carvalho Chehab if (bp->b_bn != be64_to_cpu(hdr->blkno)) 261fc2f6fe7SMauro Carvalho Chehab return false; 262fc2f6fe7SMauro Carvalho Chehab if (hdr->owner == 0) 263fc2f6fe7SMauro Carvalho Chehab return false; 264fc2f6fe7SMauro Carvalho Chehab } 265fc2f6fe7SMauro Carvalho Chehab 266fc2f6fe7SMauro Carvalho Chehab /* object specific verification checks here */ 267fc2f6fe7SMauro Carvalho Chehab 268fc2f6fe7SMauro Carvalho Chehab return true; 269fc2f6fe7SMauro Carvalho Chehab } 270fc2f6fe7SMauro Carvalho Chehab 271fc2f6fe7SMauro Carvalho ChehabIf there are different magic numbers for the different formats, the verifier 272fc2f6fe7SMauro Carvalho Chehabwill look like:: 273fc2f6fe7SMauro Carvalho Chehab 274fc2f6fe7SMauro Carvalho Chehab static bool 275fc2f6fe7SMauro Carvalho Chehab xfs_foo_verify( 276fc2f6fe7SMauro Carvalho Chehab struct xfs_buf *bp) 277fc2f6fe7SMauro Carvalho Chehab { 278fc2f6fe7SMauro Carvalho Chehab struct xfs_mount *mp = bp->b_mount; 279fc2f6fe7SMauro Carvalho Chehab struct xfs_ondisk_hdr *hdr = bp->b_addr; 280fc2f6fe7SMauro Carvalho Chehab 281fc2f6fe7SMauro Carvalho Chehab if (hdr->magic == cpu_to_be32(XFS_FOO_CRC_MAGIC)) { 282fc2f6fe7SMauro Carvalho Chehab if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid)) 283fc2f6fe7SMauro Carvalho Chehab return false; 284fc2f6fe7SMauro Carvalho Chehab if (bp->b_bn != be64_to_cpu(hdr->blkno)) 285fc2f6fe7SMauro Carvalho Chehab return false; 286fc2f6fe7SMauro Carvalho Chehab if (hdr->owner == 0) 287fc2f6fe7SMauro Carvalho Chehab return false; 288fc2f6fe7SMauro Carvalho Chehab } else if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC)) 289fc2f6fe7SMauro Carvalho Chehab return false; 290fc2f6fe7SMauro Carvalho Chehab 291fc2f6fe7SMauro Carvalho Chehab /* object specific verification checks here */ 292fc2f6fe7SMauro Carvalho Chehab 293fc2f6fe7SMauro Carvalho Chehab return true; 294fc2f6fe7SMauro Carvalho Chehab } 295fc2f6fe7SMauro Carvalho Chehab 296fc2f6fe7SMauro Carvalho ChehabWrite verifiers are very similar to the read verifiers, they just do things in 297fc2f6fe7SMauro Carvalho Chehabthe opposite order to the read verifiers. A typical write verifier:: 298fc2f6fe7SMauro Carvalho Chehab 299fc2f6fe7SMauro Carvalho Chehab static void 300fc2f6fe7SMauro Carvalho Chehab xfs_foo_write_verify( 301fc2f6fe7SMauro Carvalho Chehab struct xfs_buf *bp) 302fc2f6fe7SMauro Carvalho Chehab { 303fc2f6fe7SMauro Carvalho Chehab struct xfs_mount *mp = bp->b_mount; 304fc2f6fe7SMauro Carvalho Chehab struct xfs_buf_log_item *bip = bp->b_fspriv; 305fc2f6fe7SMauro Carvalho Chehab 306fc2f6fe7SMauro Carvalho Chehab if (!xfs_foo_verify(bp)) { 307fc2f6fe7SMauro Carvalho Chehab XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr); 308fc2f6fe7SMauro Carvalho Chehab xfs_buf_ioerror(bp, EFSCORRUPTED); 309fc2f6fe7SMauro Carvalho Chehab return; 310fc2f6fe7SMauro Carvalho Chehab } 311fc2f6fe7SMauro Carvalho Chehab 312fc2f6fe7SMauro Carvalho Chehab if (!xfs_sb_version_hascrc(&mp->m_sb)) 313fc2f6fe7SMauro Carvalho Chehab return; 314fc2f6fe7SMauro Carvalho Chehab 315fc2f6fe7SMauro Carvalho Chehab 316fc2f6fe7SMauro Carvalho Chehab if (bip) { 317fc2f6fe7SMauro Carvalho Chehab struct xfs_ondisk_hdr *hdr = bp->b_addr; 318fc2f6fe7SMauro Carvalho Chehab hdr->lsn = cpu_to_be64(bip->bli_item.li_lsn); 319fc2f6fe7SMauro Carvalho Chehab } 320fc2f6fe7SMauro Carvalho Chehab xfs_update_cksum(bp->b_addr, BBTOB(bp->b_length), XFS_FOO_CRC_OFF); 321fc2f6fe7SMauro Carvalho Chehab } 322fc2f6fe7SMauro Carvalho Chehab 323fc2f6fe7SMauro Carvalho ChehabThis will verify the internal structure of the metadata before we go any 324fc2f6fe7SMauro Carvalho Chehabfurther, detecting corruptions that have occurred as the metadata has been 325fc2f6fe7SMauro Carvalho Chehabmodified in memory. If the metadata verifies OK, and CRCs are enabled, we then 326fc2f6fe7SMauro Carvalho Chehabupdate the LSN field (when it was last modified) and calculate the CRC on the 327fc2f6fe7SMauro Carvalho Chehabmetadata. Once this is done, we can issue the IO. 328fc2f6fe7SMauro Carvalho Chehab 329fc2f6fe7SMauro Carvalho ChehabInodes and Dquots 330fc2f6fe7SMauro Carvalho Chehab================= 331fc2f6fe7SMauro Carvalho Chehab 332fc2f6fe7SMauro Carvalho ChehabInodes and dquots are special snowflakes. They have per-object CRC and 333fc2f6fe7SMauro Carvalho Chehabself-identifiers, but they are packed so that there are multiple objects per 334fc2f6fe7SMauro Carvalho Chehabbuffer. Hence we do not use per-buffer verifiers to do the work of per-object 335fc2f6fe7SMauro Carvalho Chehabverification and CRC calculations. The per-buffer verifiers simply perform basic 336fc2f6fe7SMauro Carvalho Chehabidentification of the buffer - that they contain inodes or dquots, and that 337fc2f6fe7SMauro Carvalho Chehabthere are magic numbers in all the expected spots. All further CRC and 338fc2f6fe7SMauro Carvalho Chehabverification checks are done when each inode is read from or written back to the 339fc2f6fe7SMauro Carvalho Chehabbuffer. 340fc2f6fe7SMauro Carvalho Chehab 341fc2f6fe7SMauro Carvalho ChehabThe structure of the verifiers and the identifiers checks is very similar to the 342fc2f6fe7SMauro Carvalho Chehabbuffer code described above. The only difference is where they are called. For 34316d91548SLinus Torvaldsexample, inode read verification is done in xfs_inode_from_disk() when the inode 34416d91548SLinus Torvaldsis first read out of the buffer and the struct xfs_inode is instantiated. The 34516d91548SLinus Torvaldsinode is already extensively verified during writeback in xfs_iflush_int, so the 34616d91548SLinus Torvaldsonly addition here is to add the LSN and CRC to the inode as it is copied back 34716d91548SLinus Torvaldsinto the buffer. 348fc2f6fe7SMauro Carvalho Chehab 349fc2f6fe7SMauro Carvalho ChehabXXX: inode unlinked list modification doesn't recalculate the inode CRC! None of 350fc2f6fe7SMauro Carvalho Chehabthe unlinked list modifications check or update CRCs, neither during unlink nor 351fc2f6fe7SMauro Carvalho Chehablog recovery. So, it's gone unnoticed until now. This won't matter immediately - 352fc2f6fe7SMauro Carvalho Chehabrepair will probably complain about it - but it needs to be fixed. 353