1.. SPDX-License-Identifier: GPL-2.0
2
3Journal (jbd2)
4--------------
5
6Introduced in ext3, the ext4 filesystem employs a journal to protect the
7filesystem against corruption in the case of a system crash. A small
8continuous region of disk (default 128MiB) is reserved inside the
9filesystem as a place to land “important” data writes on-disk as quickly
10as possible. Once the important data transaction is fully written to the
11disk and flushed from the disk write cache, a record of the data being
12committed is also written to the journal. At some later point in time,
13the journal code writes the transactions to their final locations on
14disk (this could involve a lot of seeking or a lot of small
15read-write-erases) before erasing the commit record. Should the system
16crash during the second slow write, the journal can be replayed all the
17way to the latest commit record, guaranteeing the atomicity of whatever
18gets written through the journal to the disk. The effect of this is to
19guarantee that the filesystem does not become stuck midway through a
20metadata update.
21
22For performance reasons, ext4 by default only writes filesystem metadata
23through the journal. This means that file data blocks are /not/
24guaranteed to be in any consistent state after a crash. If this default
25guarantee level (``data=ordered``) is not satisfactory, there is a mount
26option to control journal behavior. If ``data=journal``, all data and
27metadata are written to disk through the journal. This is slower but
28safest. If ``data=writeback``, dirty data blocks are not flushed to the
29disk before the metadata are written to disk through the journal.
30
31In case of ``data=ordered`` mode, Ext4 also supports fast commits which
32help reduce commit latency significantly. The default ``data=ordered``
33mode works by logging metadata blocks to the journal. In fast commit
34mode, Ext4 only stores the minimal delta needed to recreate the
35affected metadata in fast commit space that is shared with JBD2.
36Once the fast commit area fills in or if fast commit is not possible
37or if JBD2 commit timer goes off, Ext4 performs a traditional full commit.
38A full commit invalidates all the fast commits that happened before
39it and thus it makes the fast commit area empty for further fast
40commits. This feature needs to be enabled at mkfs time.
41
42The journal inode is typically inode 8. The first 68 bytes of the
43journal inode are replicated in the ext4 superblock. The journal itself
44is normal (but hidden) file within the filesystem. The file usually
45consumes an entire block group, though mke2fs tries to put it in the
46middle of the disk.
47
48All fields in jbd2 are written to disk in big-endian order. This is the
49opposite of ext4.
50
51NOTE: Both ext4 and ocfs2 use jbd2.
52
53The maximum size of a journal embedded in an ext4 filesystem is 2^32
54blocks. jbd2 itself does not seem to care.
55
56Layout
57~~~~~~
58
59Generally speaking, the journal has this format:
60
61.. list-table::
62   :widths: 16 48 16
63   :header-rows: 1
64
65   * - Superblock
66     - descriptor\_block (data\_blocks or revocation\_block) [more data or
67       revocations] commmit\_block
68     - [more transactions...]
69   * -
70     - One transaction
71     -
72
73Notice that a transaction begins with either a descriptor and some data,
74or a block revocation list. A finished transaction always ends with a
75commit. If there is no commit record (or the checksums don't match), the
76transaction will be discarded during replay.
77
78External Journal
79~~~~~~~~~~~~~~~~
80
81Optionally, an ext4 filesystem can be created with an external journal
82device (as opposed to an internal journal, which uses a reserved inode).
83In this case, on the filesystem device, ``s_journal_inum`` should be
84zero and ``s_journal_uuid`` should be set. On the journal device there
85will be an ext4 super block in the usual place, with a matching UUID.
86The journal superblock will be in the next full block after the
87superblock.
88
89.. list-table::
90   :widths: 12 12 12 32 12
91   :header-rows: 1
92
93   * - 1024 bytes of padding
94     - ext4 Superblock
95     - Journal Superblock
96     - descriptor\_block (data\_blocks or revocation\_block) [more data or
97       revocations] commmit\_block
98     - [more transactions...]
99   * -
100     -
101     -
102     - One transaction
103     -
104
105Block Header
106~~~~~~~~~~~~
107
108Every block in the journal starts with a common 12-byte header
109``struct journal_header_s``:
110
111.. list-table::
112   :widths: 8 8 24 40
113   :header-rows: 1
114
115   * - Offset
116     - Type
117     - Name
118     - Description
119   * - 0x0
120     - \_\_be32
121     - h\_magic
122     - jbd2 magic number, 0xC03B3998.
123   * - 0x4
124     - \_\_be32
125     - h\_blocktype
126     - Description of what this block contains. See the jbd2_blocktype_ table
127       below.
128   * - 0x8
129     - \_\_be32
130     - h\_sequence
131     - The transaction ID that goes with this block.
132
133.. _jbd2_blocktype:
134
135The journal block type can be any one of:
136
137.. list-table::
138   :widths: 16 64
139   :header-rows: 1
140
141   * - Value
142     - Description
143   * - 1
144     - Descriptor. This block precedes a series of data blocks that were
145       written through the journal during a transaction.
146   * - 2
147     - Block commit record. This block signifies the completion of a
148       transaction.
149   * - 3
150     - Journal superblock, v1.
151   * - 4
152     - Journal superblock, v2.
153   * - 5
154     - Block revocation records. This speeds up recovery by enabling the
155       journal to skip writing blocks that were subsequently rewritten.
156
157Super Block
158~~~~~~~~~~~
159
160The super block for the journal is much simpler as compared to ext4's.
161The key data kept within are size of the journal, and where to find the
162start of the log of transactions.
163
164The journal superblock is recorded as ``struct journal_superblock_s``,
165which is 1024 bytes long:
166
167.. list-table::
168   :widths: 8 8 24 40
169   :header-rows: 1
170
171   * - Offset
172     - Type
173     - Name
174     - Description
175   * -
176     -
177     -
178     - Static information describing the journal.
179   * - 0x0
180     - journal\_header\_t (12 bytes)
181     - s\_header
182     - Common header identifying this as a superblock.
183   * - 0xC
184     - \_\_be32
185     - s\_blocksize
186     - Journal device block size.
187   * - 0x10
188     - \_\_be32
189     - s\_maxlen
190     - Total number of blocks in this journal.
191   * - 0x14
192     - \_\_be32
193     - s\_first
194     - First block of log information.
195   * -
196     -
197     -
198     - Dynamic information describing the current state of the log.
199   * - 0x18
200     - \_\_be32
201     - s\_sequence
202     - First commit ID expected in log.
203   * - 0x1C
204     - \_\_be32
205     - s\_start
206     - Block number of the start of log. Contrary to the comments, this field
207       being zero does not imply that the journal is clean!
208   * - 0x20
209     - \_\_be32
210     - s\_errno
211     - Error value, as set by jbd2\_journal\_abort().
212   * -
213     -
214     -
215     - The remaining fields are only valid in a v2 superblock.
216   * - 0x24
217     - \_\_be32
218     - s\_feature\_compat;
219     - Compatible feature set. See the table jbd2_compat_ below.
220   * - 0x28
221     - \_\_be32
222     - s\_feature\_incompat
223     - Incompatible feature set. See the table jbd2_incompat_ below.
224   * - 0x2C
225     - \_\_be32
226     - s\_feature\_ro\_compat
227     - Read-only compatible feature set. There aren't any of these currently.
228   * - 0x30
229     - \_\_u8
230     - s\_uuid[16]
231     - 128-bit uuid for journal. This is compared against the copy in the ext4
232       super block at mount time.
233   * - 0x40
234     - \_\_be32
235     - s\_nr\_users
236     - Number of file systems sharing this journal.
237   * - 0x44
238     - \_\_be32
239     - s\_dynsuper
240     - Location of dynamic super block copy. (Not used?)
241   * - 0x48
242     - \_\_be32
243     - s\_max\_transaction
244     - Limit of journal blocks per transaction. (Not used?)
245   * - 0x4C
246     - \_\_be32
247     - s\_max\_trans\_data
248     - Limit of data blocks per transaction. (Not used?)
249   * - 0x50
250     - \_\_u8
251     - s\_checksum\_type
252     - Checksum algorithm used for the journal.  See jbd2_checksum_type_ for
253       more info.
254   * - 0x51
255     - \_\_u8[3]
256     - s\_padding2
257     -
258   * - 0x54
259     - \_\_u32
260     - s\_padding[42]
261     -
262   * - 0xFC
263     - \_\_be32
264     - s\_checksum
265     - Checksum of the entire superblock, with this field set to zero.
266   * - 0x100
267     - \_\_u8
268     - s\_users[16\*48]
269     - ids of all file systems sharing the log. e2fsprogs/Linux don't allow
270       shared external journals, but I imagine Lustre (or ocfs2?), which use
271       the jbd2 code, might.
272
273.. _jbd2_compat:
274
275The journal compat features are any combination of the following:
276
277.. list-table::
278   :widths: 16 64
279   :header-rows: 1
280
281   * - Value
282     - Description
283   * - 0x1
284     - Journal maintains checksums on the data blocks.
285       (JBD2\_FEATURE\_COMPAT\_CHECKSUM)
286
287.. _jbd2_incompat:
288
289The journal incompat features are any combination of the following:
290
291.. list-table::
292   :widths: 16 64
293   :header-rows: 1
294
295   * - Value
296     - Description
297   * - 0x1
298     - Journal has block revocation records. (JBD2\_FEATURE\_INCOMPAT\_REVOKE)
299   * - 0x2
300     - Journal can deal with 64-bit block numbers.
301       (JBD2\_FEATURE\_INCOMPAT\_64BIT)
302   * - 0x4
303     - Journal commits asynchronously. (JBD2\_FEATURE\_INCOMPAT\_ASYNC\_COMMIT)
304   * - 0x8
305     - This journal uses v2 of the checksum on-disk format. Each journal
306       metadata block gets its own checksum, and the block tags in the
307       descriptor table contain checksums for each of the data blocks in the
308       journal. (JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2)
309   * - 0x10
310     - This journal uses v3 of the checksum on-disk format. This is the same as
311       v2, but the journal block tag size is fixed regardless of the size of
312       block numbers. (JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3)
313
314.. _jbd2_checksum_type:
315
316Journal checksum type codes are one of the following.  crc32 or crc32c are the
317most likely choices.
318
319.. list-table::
320   :widths: 16 64
321   :header-rows: 1
322
323   * - Value
324     - Description
325   * - 1
326     - CRC32
327   * - 2
328     - MD5
329   * - 3
330     - SHA1
331   * - 4
332     - CRC32C
333
334Descriptor Block
335~~~~~~~~~~~~~~~~
336
337The descriptor block contains an array of journal block tags that
338describe the final locations of the data blocks that follow in the
339journal. Descriptor blocks are open-coded instead of being completely
340described by a data structure, but here is the block structure anyway.
341Descriptor blocks consume at least 36 bytes, but use a full block:
342
343.. list-table::
344   :widths: 8 8 24 40
345   :header-rows: 1
346
347   * - Offset
348     - Type
349     - Name
350     - Descriptor
351   * - 0x0
352     - journal\_header\_t
353     - (open coded)
354     - Common block header.
355   * - 0xC
356     - struct journal\_block\_tag\_s
357     - open coded array[]
358     - Enough tags either to fill up the block or to describe all the data
359       blocks that follow this descriptor block.
360
361Journal block tags have any of the following formats, depending on which
362journal feature and block tag flags are set.
363
364If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 is set, the journal block tag is
365defined as ``struct journal_block_tag3_s``, which looks like the
366following. The size is 16 or 32 bytes.
367
368.. list-table::
369   :widths: 8 8 24 40
370   :header-rows: 1
371
372   * - Offset
373     - Type
374     - Name
375     - Descriptor
376   * - 0x0
377     - \_\_be32
378     - t\_blocknr
379     - Lower 32-bits of the location of where the corresponding data block
380       should end up on disk.
381   * - 0x4
382     - \_\_be32
383     - t\_flags
384     - Flags that go with the descriptor. See the table jbd2_tag_flags_ for
385       more info.
386   * - 0x8
387     - \_\_be32
388     - t\_blocknr\_high
389     - Upper 32-bits of the location of where the corresponding data block
390       should end up on disk. This is zero if JBD2\_FEATURE\_INCOMPAT\_64BIT is
391       not enabled.
392   * - 0xC
393     - \_\_be32
394     - t\_checksum
395     - Checksum of the journal UUID, the sequence number, and the data block.
396   * -
397     -
398     -
399     - This field appears to be open coded. It always comes at the end of the
400       tag, after t_checksum. This field is not present if the "same UUID" flag
401       is set.
402   * - 0x8 or 0xC
403     - char
404     - uuid[16]
405     - A UUID to go with this tag. This field appears to be copied from the
406       ``j_uuid`` field in ``struct journal_s``, but only tune2fs touches that
407       field.
408
409.. _jbd2_tag_flags:
410
411The journal tag flags are any combination of the following:
412
413.. list-table::
414   :widths: 16 64
415   :header-rows: 1
416
417   * - Value
418     - Description
419   * - 0x1
420     - On-disk block is escaped. The first four bytes of the data block just
421       happened to match the jbd2 magic number.
422   * - 0x2
423     - This block has the same UUID as previous, therefore the UUID field is
424       omitted.
425   * - 0x4
426     - The data block was deleted by the transaction. (Not used?)
427   * - 0x8
428     - This is the last tag in this descriptor block.
429
430If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 is NOT set, the journal block tag
431is defined as ``struct journal_block_tag_s``, which looks like the
432following. The size is 8, 12, 24, or 28 bytes:
433
434.. list-table::
435   :widths: 8 8 24 40
436   :header-rows: 1
437
438   * - Offset
439     - Type
440     - Name
441     - Descriptor
442   * - 0x0
443     - \_\_be32
444     - t\_blocknr
445     - Lower 32-bits of the location of where the corresponding data block
446       should end up on disk.
447   * - 0x4
448     - \_\_be16
449     - t\_checksum
450     - Checksum of the journal UUID, the sequence number, and the data block.
451       Note that only the lower 16 bits are stored.
452   * - 0x6
453     - \_\_be16
454     - t\_flags
455     - Flags that go with the descriptor. See the table jbd2_tag_flags_ for
456       more info.
457   * -
458     -
459     -
460     - This next field is only present if the super block indicates support for
461       64-bit block numbers.
462   * - 0x8
463     - \_\_be32
464     - t\_blocknr\_high
465     - Upper 32-bits of the location of where the corresponding data block
466       should end up on disk.
467   * -
468     -
469     -
470     - This field appears to be open coded. It always comes at the end of the
471       tag, after t_flags or t_blocknr_high. This field is not present if the
472       "same UUID" flag is set.
473   * - 0x8 or 0xC
474     - char
475     - uuid[16]
476     - A UUID to go with this tag. This field appears to be copied from the
477       ``j_uuid`` field in ``struct journal_s``, but only tune2fs touches that
478       field.
479
480If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2 or
481JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 are set, the end of the block is a
482``struct jbd2_journal_block_tail``, which looks like this:
483
484.. list-table::
485   :widths: 8 8 24 40
486   :header-rows: 1
487
488   * - Offset
489     - Type
490     - Name
491     - Descriptor
492   * - 0x0
493     - \_\_be32
494     - t\_checksum
495     - Checksum of the journal UUID + the descriptor block, with this field set
496       to zero.
497
498Data Block
499~~~~~~~~~~
500
501In general, the data blocks being written to disk through the journal
502are written verbatim into the journal file after the descriptor block.
503However, if the first four bytes of the block match the jbd2 magic
504number then those four bytes are replaced with zeroes and the “escaped”
505flag is set in the descriptor block tag.
506
507Revocation Block
508~~~~~~~~~~~~~~~~
509
510A revocation block is used to prevent replay of a block in an earlier
511transaction. This is used to mark blocks that were journalled at one
512time but are no longer journalled. Typically this happens if a metadata
513block is freed and re-allocated as a file data block; in this case, a
514journal replay after the file block was written to disk will cause
515corruption.
516
517**NOTE**: This mechanism is NOT used to express “this journal block is
518superseded by this other journal block”, as the author (djwong)
519mistakenly thought. Any block being added to a transaction will cause
520the removal of all existing revocation records for that block.
521
522Revocation blocks are described in
523``struct jbd2_journal_revoke_header_s``, are at least 16 bytes in
524length, but use a full block:
525
526.. list-table::
527   :widths: 8 8 24 40
528   :header-rows: 1
529
530   * - Offset
531     - Type
532     - Name
533     - Description
534   * - 0x0
535     - journal\_header\_t
536     - r\_header
537     - Common block header.
538   * - 0xC
539     - \_\_be32
540     - r\_count
541     - Number of bytes used in this block.
542   * - 0x10
543     - \_\_be32 or \_\_be64
544     - blocks[0]
545     - Blocks to revoke.
546
547After r\_count is a linear array of block numbers that are effectively
548revoked by this transaction. The size of each block number is 8 bytes if
549the superblock advertises 64-bit block number support, or 4 bytes
550otherwise.
551
552If JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2 or
553JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3 are set, the end of the revocation
554block is a ``struct jbd2_journal_revoke_tail``, which has this format:
555
556.. list-table::
557   :widths: 8 8 24 40
558   :header-rows: 1
559
560   * - Offset
561     - Type
562     - Name
563     - Description
564   * - 0x0
565     - \_\_be32
566     - r\_checksum
567     - Checksum of the journal UUID + revocation block
568
569Commit Block
570~~~~~~~~~~~~
571
572The commit block is a sentry that indicates that a transaction has been
573completely written to the journal. Once this commit block reaches the
574journal, the data stored with this transaction can be written to their
575final locations on disk.
576
577The commit block is described by ``struct commit_header``, which is 32
578bytes long (but uses a full block):
579
580.. list-table::
581   :widths: 8 8 24 40
582   :header-rows: 1
583
584   * - Offset
585     - Type
586     - Name
587     - Descriptor
588   * - 0x0
589     - journal\_header\_s
590     - (open coded)
591     - Common block header.
592   * - 0xC
593     - unsigned char
594     - h\_chksum\_type
595     - The type of checksum to use to verify the integrity of the data blocks
596       in the transaction. See jbd2_checksum_type_ for more info.
597   * - 0xD
598     - unsigned char
599     - h\_chksum\_size
600     - The number of bytes used by the checksum. Most likely 4.
601   * - 0xE
602     - unsigned char
603     - h\_padding[2]
604     -
605   * - 0x10
606     - \_\_be32
607     - h\_chksum[JBD2\_CHECKSUM\_BYTES]
608     - 32 bytes of space to store checksums. If
609       JBD2\_FEATURE\_INCOMPAT\_CSUM\_V2 or JBD2\_FEATURE\_INCOMPAT\_CSUM\_V3
610       are set, the first ``__be32`` is the checksum of the journal UUID and
611       the entire commit block, with this field zeroed. If
612       JBD2\_FEATURE\_COMPAT\_CHECKSUM is set, the first ``__be32`` is the
613       crc32 of all the blocks already written to the transaction.
614   * - 0x30
615     - \_\_be64
616     - h\_commit\_sec
617     - The time that the transaction was committed, in seconds since the epoch.
618   * - 0x38
619     - \_\_be32
620     - h\_commit\_nsec
621     - Nanoseconds component of the above timestamp.
622
623Fast commits
624~~~~~~~~~~~~
625
626Fast commit area is organized as a log of tag length values. Each TLV has
627a ``struct ext4_fc_tl`` in the beginning which stores the tag and the length
628of the entire field. It is followed by variable length tag specific value.
629Here is the list of supported tags and their meanings:
630
631.. list-table::
632   :widths: 8 20 20 32
633   :header-rows: 1
634
635   * - Tag
636     - Meaning
637     - Value struct
638     - Description
639   * - EXT4_FC_TAG_HEAD
640     - Fast commit area header
641     - ``struct ext4_fc_head``
642     - Stores the TID of the transaction after which these fast commits should
643       be applied.
644   * - EXT4_FC_TAG_ADD_RANGE
645     - Add extent to inode
646     - ``struct ext4_fc_add_range``
647     - Stores the inode number and extent to be added in this inode
648   * - EXT4_FC_TAG_DEL_RANGE
649     - Remove logical offsets to inode
650     - ``struct ext4_fc_del_range``
651     - Stores the inode number and the logical offset range that needs to be
652       removed
653   * - EXT4_FC_TAG_CREAT
654     - Create directory entry for a newly created file
655     - ``struct ext4_fc_dentry_info``
656     - Stores the parent inode number, inode number and directory entry of the
657       newly created file
658   * - EXT4_FC_TAG_LINK
659     - Link a directory entry to an inode
660     - ``struct ext4_fc_dentry_info``
661     - Stores the parent inode number, inode number and directory entry
662   * - EXT4_FC_TAG_UNLINK
663     - Unlink a directory entry of an inode
664     - ``struct ext4_fc_dentry_info``
665     - Stores the parent inode number, inode number and directory entry
666
667   * - EXT4_FC_TAG_PAD
668     - Padding (unused area)
669     - None
670     - Unused bytes in the fast commit area.
671
672   * - EXT4_FC_TAG_TAIL
673     - Mark the end of a fast commit
674     - ``struct ext4_fc_tail``
675     - Stores the TID of the commit, CRC of the fast commit of which this tag
676       represents the end of
677
678