1.. SPDX-License-Identifier: GPL-2.0 2 3Journal (jbd2) 4-------------- 5 6Introduced in ext3, the ext4 filesystem employs a journal to protect the 7filesystem against corruption in the case of a system crash. A small 8continuous region of disk (default 128MiB) is reserved inside the 9filesystem as a place to land “important” data writes on-disk as quickly 10as possible. Once the important data transaction is fully written to the 11disk and flushed from the disk write cache, a record of the data being 12committed is also written to the journal. At some later point in time, 13the journal code writes the transactions to their final locations on 14disk (this could involve a lot of seeking or a lot of small 15read-write-erases) before erasing the commit record. Should the system 16crash during the second slow write, the journal can be replayed all the 17way to the latest commit record, guaranteeing the atomicity of whatever 18gets written through the journal to the disk. The effect of this is to 19guarantee that the filesystem does not become stuck midway through a 20metadata update. 21 22For performance reasons, ext4 by default only writes filesystem metadata 23through the journal. This means that file data blocks are /not/ 24guaranteed to be in any consistent state after a crash. If this default 25guarantee level (``data=ordered``) is not satisfactory, there is a mount 26option to control journal behavior. If ``data=journal``, all data and 27metadata are written to disk through the journal. This is slower but 28safest. If ``data=writeback``, dirty data blocks are not flushed to the 29disk before the metadata are written to disk through the journal. 30 31In case of ``data=ordered`` mode, Ext4 also supports fast commits which 32help reduce commit latency significantly. The default ``data=ordered`` 33mode works by logging metadata blocks to the journal. In fast commit 34mode, Ext4 only stores the minimal delta needed to recreate the 35affected metadata in fast commit space that is shared with JBD2. 36Once the fast commit area fills in or if fast commit is not possible 37or if JBD2 commit timer goes off, Ext4 performs a traditional full commit. 38A full commit invalidates all the fast commits that happened before 39it and thus it makes the fast commit area empty for further fast 40commits. This feature needs to be enabled at mkfs time. 41 42The journal inode is typically inode 8. The first 68 bytes of the 43journal inode are replicated in the ext4 superblock. The journal itself 44is normal (but hidden) file within the filesystem. The file usually 45consumes an entire block group, though mke2fs tries to put it in the 46middle of the disk. 47 48All fields in jbd2 are written to disk in big-endian order. This is the 49opposite of ext4. 50 51NOTE: Both ext4 and ocfs2 use jbd2. 52 53The maximum size of a journal embedded in an ext4 filesystem is 2^32 54blocks. jbd2 itself does not seem to care. 55 56Layout 57~~~~~~ 58 59Generally speaking, the journal has this format: 60 61.. list-table:: 62 :widths: 16 48 16 63 :header-rows: 1 64 65 * - Superblock 66 - descriptor\_block (data\_blocks or revocation\_block) [more data or 67 revocations] commmit\_block 68 - [more transactions...] 69 * - 70 - One transaction 71 - 72 73Notice that a transaction begins with either a descriptor and some data, 74or a block revocation list. A finished transaction always ends with a 75commit. If there is no commit record (or the checksums don't match), the 76transaction will be discarded during replay. 77 78External Journal 79~~~~~~~~~~~~~~~~ 80 81Optionally, an ext4 filesystem can be created with an external journal 82device (as opposed to an internal journal, which uses a reserved inode). 83In this case, on the filesystem device, ``s_journal_inum`` should be 84zero and ``s_journal_uuid`` should be set. On the journal device there 85will be an ext4 super block in the usual place, with a matching UUID. 86The journal superblock will be in the next full block after the 87superblock. 88 89.. list-table:: 90 :widths: 12 12 12 32 12 91 :header-rows: 1 92 93 * - 1024 bytes of padding 94 - ext4 Superblock 95 - Journal Superblock 96 - descriptor\_block (data\_blocks or revocation\_block) [more data or 97 revocations] commmit\_block 98 - [more transactions...] 99 * - 100 - 101 - 102 - One transaction 103 - 104 105Block Header 106~~~~~~~~~~~~ 107 108Every block in the journal starts with a common 12-byte header 109``struct journal_header_s``: 110 111.. list-table:: 112 :widths: 8 8 24 40 113 :header-rows: 1 114 115 * - Offset 116 - Type 117 - Name 118 - Description 119 * - 0x0 120 - \_\_be32 121 - h\_magic 122 - jbd2 magic number, 0xC03B3998. 123 * - 0x4 124 - \_\_be32 125 - h\_blocktype 126 - Description of what this block contains. See the jbd2_blocktype_ table 127 below. 128 * - 0x8 129 - \_\_be32 130 - h\_sequence 131 - The transaction ID that goes with this block. 132 133.. _jbd2_blocktype: 134 135The journal block type can be any one of: 136 137.. list-table:: 138 :widths: 16 64 139 :header-rows: 1 140 141 * - Value 142 - Description 143 * - 1 144 - Descriptor. This block precedes a series of data blocks that were 145 written through the journal during a transaction. 146 * - 2 147 - Block commit record. This block signifies the completion of a 148 transaction. 149 * - 3 150 - Journal superblock, v1. 151 * - 4 152 - Journal superblock, v2. 153 * - 5 154 - Block revocation records. This speeds up recovery by enabling the 155 journal to skip writing blocks that were subsequently rewritten. 156 157Super Block 158~~~~~~~~~~~ 159 160The super block for the journal is much simpler as compared to ext4's. 161The key data kept within are size of the journal, and where to find the 162start of the log of transactions. 163 164The journal superblock is recorded as ``struct journal_superblock_s``, 165which is 1024 bytes long: 166 167.. list-table:: 168 :widths: 8 8 24 40 169 :header-rows: 1 170 171 * - Offset 172 - Type 173 - Name 174 - Description 175 * - 176 - 177 - 178 - Static information describing the journal. 179 * - 0x0 180 - journal\_header\_t (12 bytes) 181 - s\_header 182 - Common header identifying this as a superblock. 183 * - 0xC 184 - \_\_be32 185 - s\_blocksize 186 - Journal device block size. 187 * - 0x10 188 - \_\_be32 189 - s\_maxlen 190 - Total number of blocks in this journal. 191 * - 0x14 192 - \_\_be32 193 - s\_first 194 - First block of log information. 195 * - 196 - 197 - 198 - Dynamic information describing the current state of the log. 199 * - 0x18 200 - \_\_be32 201 - s\_sequence 202 - First commit ID expected in log. 203 * - 0x1C 204 - \_\_be32 205 - s\_start 206 - Block number of the start of log. Contrary to the comments, this field 207 being zero does not imply that the journal is clean! 208 * - 0x20 209 - \_\_be32 210 - s\_errno 211 - Error value, as set by jbd2\_journal\_abort(). 212 * - 213 - 214 - 215 - The remaining fields are only valid in a v2 superblock. 216 * - 0x24 217 - \_\_be32 218 - s\_feature\_compat; 219 - Compatible feature set. See the table jbd2_compat_ below. 220 * - 0x28 221 - \_\_be32 222 - s\_feature\_incompat 223 - Incompatible feature set. See the table jbd2_incompat_ below. 224 * - 0x2C 225 - \_\_be32 226 - s\_feature\_ro\_compat 227 - Read-only compatible feature set. There aren't any of these currently. 228 * - 0x30 229 - \_\_u8 230 - s\_uuid[16] 231 - 128-bit uuid for journal. This is compared against the copy in the ext4 232 super block at mount time. 233 * - 0x40 234 - \_\_be32 235 - s\_nr\_users 236 - Number of file systems sharing this journal. 237 * - 0x44 238 - \_\_be32 239 - s\_dynsuper 240 - Location of dynamic super block copy. (Not used?) 241 * - 0x48 242 - \_\_be32 243 - s\_max\_transaction 244 - Limit of journal blocks per transaction. (Not used?) 245 * - 0x4C 246 - \_\_be32 247 - s\_max\_trans\_data 248 - Limit of data blocks per transaction. (Not used?) 249 * - 0x50 250 - \_\_u8 251 - s\_checksum\_type 252 - Checksum algorithm used for the journal. See jbd2_checksum_type_ for 253 more info. 254 * - 0x51 255 - \_\_u8[3] 256 - s\_padding2 257 - 258 * - 0x54 259 - \_\_u32 260 - s\_padding[42] 261 - 262 * - 0xFC 263 - \_\_be32 264 - s\_checksum 265 - Checksum of the entire superblock, with this field set to zero. 266 * - 0x100 267 - \_\_u8 268 - s\_users[16\*48] 269 - ids of all file systems sharing the log. e2fsprogs/Linux don't allow 270 shared external journals, but I imagine Lustre (or ocfs2?), which use 271 the jbd2 code, might. 272 273.. _jbd2_compat: 274 275The journal compat features are any combination of the following: 276 277.. list-table:: 278 :widths: 16 64 279 :header-rows: 1 280 281 * - Value 282 - Description 283 * - 0x1 284 - Journal maintains checksums on the data blocks. 285 (JBD2\_FEATURE\_COMPAT\_CHECKSUM) 286 287.. _jbd2_incompat: 288 289The journal incompat features are any combination of the following: 290 291.. list-table:: 292 :widths: 16 64 293 :header-rows: 1 294 295 * - Value 296 - Description 297 * - 0x1 298 - Journal has block revocation records. (JBD2\_FEATURE\_INCOMPAT\_REVOKE) 299 * - 0x2 300 - Journal can deal with 64-bit block numbers. 301 (JBD2\_FEATURE\_INCOMPAT\_64BIT) 302 * - 0x4 303 - Journal commits asynchronously. (JBD2\_FEATURE\_INCOMPAT\_ASYNC\_COMMIT) 304 * - 0x8 305 - This journal uses v2 of the checksum on-disk format. Each journal 306 metadata block gets its own checksum, and the block tags in the 307 descriptor table contain checksums for each of the data blocks in the 308 journal. (JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2) 309 * - 0x10 310 - This journal uses v3 of the checksum on-disk format. This is the same as 311 v2, but the journal block tag size is fixed regardless of the size of 312 block numbers. (JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3) 313 314.. _jbd2_checksum_type: 315 316Journal checksum type codes are one of the following. crc32 or crc32c are the 317most likely choices. 318 319.. list-table:: 320 :widths: 16 64 321 :header-rows: 1 322 323 * - Value 324 - Description 325 * - 1 326 - CRC32 327 * - 2 328 - MD5 329 * - 3 330 - SHA1 331 * - 4 332 - CRC32C 333 334Descriptor Block 335~~~~~~~~~~~~~~~~ 336 337The descriptor block contains an array of journal block tags that 338describe the final locations of the data blocks that follow in the 339journal. Descriptor blocks are open-coded instead of being completely 340described by a data structure, but here is the block structure anyway. 341Descriptor blocks consume at least 36 bytes, but use a full block: 342 343.. list-table:: 344 :widths: 8 8 24 40 345 :header-rows: 1 346 347 * - Offset 348 - Type 349 - Name 350 - Descriptor 351 * - 0x0 352 - journal\_header\_t 353 - (open coded) 354 - Common block header. 355 * - 0xC 356 - struct journal\_block\_tag\_s 357 - open coded array[] 358 - Enough tags either to fill up the block or to describe all the data 359 blocks that follow this descriptor block. 360 361Journal block tags have any of the following formats, depending on which 362journal feature and block tag flags are set. 363 364If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 is set, the journal block tag is 365defined as ``struct journal_block_tag3_s``, which looks like the 366following. The size is 16 or 32 bytes. 367 368.. list-table:: 369 :widths: 8 8 24 40 370 :header-rows: 1 371 372 * - Offset 373 - Type 374 - Name 375 - Descriptor 376 * - 0x0 377 - \_\_be32 378 - t\_blocknr 379 - Lower 32-bits of the location of where the corresponding data block 380 should end up on disk. 381 * - 0x4 382 - \_\_be32 383 - t\_flags 384 - Flags that go with the descriptor. See the table jbd2_tag_flags_ for 385 more info. 386 * - 0x8 387 - \_\_be32 388 - t\_blocknr\_high 389 - Upper 32-bits of the location of where the corresponding data block 390 should end up on disk. This is zero if JBD2\_FEATURE\_INCOMPAT\_64BIT is 391 not enabled. 392 * - 0xC 393 - \_\_be32 394 - t\_checksum 395 - Checksum of the journal UUID, the sequence number, and the data block. 396 * - 397 - 398 - 399 - This field appears to be open coded. It always comes at the end of the 400 tag, after t_checksum. This field is not present if the "same UUID" flag 401 is set. 402 * - 0x8 or 0xC 403 - char 404 - uuid[16] 405 - A UUID to go with this tag. This field appears to be copied from the 406 ``j_uuid`` field in ``struct journal_s``, but only tune2fs touches that 407 field. 408 409.. _jbd2_tag_flags: 410 411The journal tag flags are any combination of the following: 412 413.. list-table:: 414 :widths: 16 64 415 :header-rows: 1 416 417 * - Value 418 - Description 419 * - 0x1 420 - On-disk block is escaped. The first four bytes of the data block just 421 happened to match the jbd2 magic number. 422 * - 0x2 423 - This block has the same UUID as previous, therefore the UUID field is 424 omitted. 425 * - 0x4 426 - The data block was deleted by the transaction. (Not used?) 427 * - 0x8 428 - This is the last tag in this descriptor block. 429 430If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 is NOT set, the journal block tag 431is defined as ``struct journal_block_tag_s``, which looks like the 432following. The size is 8, 12, 24, or 28 bytes: 433 434.. list-table:: 435 :widths: 8 8 24 40 436 :header-rows: 1 437 438 * - Offset 439 - Type 440 - Name 441 - Descriptor 442 * - 0x0 443 - \_\_be32 444 - t\_blocknr 445 - Lower 32-bits of the location of where the corresponding data block 446 should end up on disk. 447 * - 0x4 448 - \_\_be16 449 - t\_checksum 450 - Checksum of the journal UUID, the sequence number, and the data block. 451 Note that only the lower 16 bits are stored. 452 * - 0x6 453 - \_\_be16 454 - t\_flags 455 - Flags that go with the descriptor. See the table jbd2_tag_flags_ for 456 more info. 457 * - 458 - 459 - 460 - This next field is only present if the super block indicates support for 461 64-bit block numbers. 462 * - 0x8 463 - \_\_be32 464 - t\_blocknr\_high 465 - Upper 32-bits of the location of where the corresponding data block 466 should end up on disk. 467 * - 468 - 469 - 470 - This field appears to be open coded. It always comes at the end of the 471 tag, after t_flags or t_blocknr_high. This field is not present if the 472 "same UUID" flag is set. 473 * - 0x8 or 0xC 474 - char 475 - uuid[16] 476 - A UUID to go with this tag. This field appears to be copied from the 477 ``j_uuid`` field in ``struct journal_s``, but only tune2fs touches that 478 field. 479 480If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2 or 481JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 are set, the end of the block is a 482``struct jbd2_journal_block_tail``, which looks like this: 483 484.. list-table:: 485 :widths: 8 8 24 40 486 :header-rows: 1 487 488 * - Offset 489 - Type 490 - Name 491 - Descriptor 492 * - 0x0 493 - \_\_be32 494 - t\_checksum 495 - Checksum of the journal UUID + the descriptor block, with this field set 496 to zero. 497 498Data Block 499~~~~~~~~~~ 500 501In general, the data blocks being written to disk through the journal 502are written verbatim into the journal file after the descriptor block. 503However, if the first four bytes of the block match the jbd2 magic 504number then those four bytes are replaced with zeroes and the “escaped” 505flag is set in the descriptor block tag. 506 507Revocation Block 508~~~~~~~~~~~~~~~~ 509 510A revocation block is used to prevent replay of a block in an earlier 511transaction. This is used to mark blocks that were journalled at one 512time but are no longer journalled. Typically this happens if a metadata 513block is freed and re-allocated as a file data block; in this case, a 514journal replay after the file block was written to disk will cause 515corruption. 516 517**NOTE**: This mechanism is NOT used to express “this journal block is 518superseded by this other journal block”, as the author (djwong) 519mistakenly thought. Any block being added to a transaction will cause 520the removal of all existing revocation records for that block. 521 522Revocation blocks are described in 523``struct jbd2_journal_revoke_header_s``, are at least 16 bytes in 524length, but use a full block: 525 526.. list-table:: 527 :widths: 8 8 24 40 528 :header-rows: 1 529 530 * - Offset 531 - Type 532 - Name 533 - Description 534 * - 0x0 535 - journal\_header\_t 536 - r\_header 537 - Common block header. 538 * - 0xC 539 - \_\_be32 540 - r\_count 541 - Number of bytes used in this block. 542 * - 0x10 543 - \_\_be32 or \_\_be64 544 - blocks[0] 545 - Blocks to revoke. 546 547After r\_count is a linear array of block numbers that are effectively 548revoked by this transaction. The size of each block number is 8 bytes if 549the superblock advertises 64-bit block number support, or 4 bytes 550otherwise. 551 552If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2 or 553JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 are set, the end of the revocation 554block is a ``struct jbd2_journal_revoke_tail``, which has this format: 555 556.. list-table:: 557 :widths: 8 8 24 40 558 :header-rows: 1 559 560 * - Offset 561 - Type 562 - Name 563 - Description 564 * - 0x0 565 - \_\_be32 566 - r\_checksum 567 - Checksum of the journal UUID + revocation block 568 569Commit Block 570~~~~~~~~~~~~ 571 572The commit block is a sentry that indicates that a transaction has been 573completely written to the journal. Once this commit block reaches the 574journal, the data stored with this transaction can be written to their 575final locations on disk. 576 577The commit block is described by ``struct commit_header``, which is 32 578bytes long (but uses a full block): 579 580.. list-table:: 581 :widths: 8 8 24 40 582 :header-rows: 1 583 584 * - Offset 585 - Type 586 - Name 587 - Descriptor 588 * - 0x0 589 - journal\_header\_s 590 - (open coded) 591 - Common block header. 592 * - 0xC 593 - unsigned char 594 - h\_chksum\_type 595 - The type of checksum to use to verify the integrity of the data blocks 596 in the transaction. See jbd2_checksum_type_ for more info. 597 * - 0xD 598 - unsigned char 599 - h\_chksum\_size 600 - The number of bytes used by the checksum. Most likely 4. 601 * - 0xE 602 - unsigned char 603 - h\_padding[2] 604 - 605 * - 0x10 606 - \_\_be32 607 - h\_chksum[JBD2\_CHECKSUM\_BYTES] 608 - 32 bytes of space to store checksums. If 609 JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2 or JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 610 are set, the first ``__be32`` is the checksum of the journal UUID and 611 the entire commit block, with this field zeroed. If 612 JBD2\_FEATURE\_COMPAT\_CHECKSUM is set, the first ``__be32`` is the 613 crc32 of all the blocks already written to the transaction. 614 * - 0x30 615 - \_\_be64 616 - h\_commit\_sec 617 - The time that the transaction was committed, in seconds since the epoch. 618 * - 0x38 619 - \_\_be32 620 - h\_commit\_nsec 621 - Nanoseconds component of the above timestamp. 622 623Fast commits 624~~~~~~~~~~~~ 625 626Fast commit area is organized as a log of tag length values. Each TLV has 627a ``struct ext4_fc_tl`` in the beginning which stores the tag and the length 628of the entire field. It is followed by variable length tag specific value. 629Here is the list of supported tags and their meanings: 630 631.. list-table:: 632 :widths: 8 20 20 32 633 :header-rows: 1 634 635 * - Tag 636 - Meaning 637 - Value struct 638 - Description 639 * - EXT4_FC_TAG_HEAD 640 - Fast commit area header 641 - ``struct ext4_fc_head`` 642 - Stores the TID of the transaction after which these fast commits should 643 be applied. 644 * - EXT4_FC_TAG_ADD_RANGE 645 - Add extent to inode 646 - ``struct ext4_fc_add_range`` 647 - Stores the inode number and extent to be added in this inode 648 * - EXT4_FC_TAG_DEL_RANGE 649 - Remove logical offsets to inode 650 - ``struct ext4_fc_del_range`` 651 - Stores the inode number and the logical offset range that needs to be 652 removed 653 * - EXT4_FC_TAG_CREAT 654 - Create directory entry for a newly created file 655 - ``struct ext4_fc_dentry_info`` 656 - Stores the parent inode number, inode number and directory entry of the 657 newly created file 658 * - EXT4_FC_TAG_LINK 659 - Link a directory entry to an inode 660 - ``struct ext4_fc_dentry_info`` 661 - Stores the parent inode number, inode number and directory entry 662 * - EXT4_FC_TAG_UNLINK 663 - Unlink a directory entry of an inode 664 - ``struct ext4_fc_dentry_info`` 665 - Stores the parent inode number, inode number and directory entry 666 667 * - EXT4_FC_TAG_PAD 668 - Padding (unused area) 669 - None 670 - Unused bytes in the fast commit area. 671 672 * - EXT4_FC_TAG_TAIL 673 - Mark the end of a fast commit 674 - ``struct ext4_fc_tail`` 675 - Stores the TID of the commit, CRC of the fast commit of which this tag 676 represents the end of 677 678