1.. SPDX-License-Identifier: GPL-2.0 2 3Index Nodes 4----------- 5 6In a regular UNIX filesystem, the inode stores all the metadata 7pertaining to the file (time stamps, block maps, extended attributes, 8etc), not the directory entry. To find the information associated with a 9file, one must traverse the directory files to find the directory entry 10associated with a file, then load the inode to find the metadata for 11that file. ext4 appears to cheat (for performance reasons) a little bit 12by storing a copy of the file type (normally stored in the inode) in the 13directory entry. (Compare all this to FAT, which stores all the file 14information directly in the directory entry, but does not support hard 15links and is in general more seek-happy than ext4 due to its simpler 16block allocator and extensive use of linked lists.) 17 18The inode table is a linear array of ``struct ext4_inode``. The table is 19sized to have enough blocks to store at least 20``sb.s_inode_size * sb.s_inodes_per_group`` bytes. The number of the 21block group containing an inode can be calculated as 22``(inode_number - 1) / sb.s_inodes_per_group``, and the offset into the 23group's table is ``(inode_number - 1) % sb.s_inodes_per_group``. There 24is no inode 0. 25 26The inode checksum is calculated against the FS UUID, the inode number, 27and the inode structure itself. 28 29The inode table entry is laid out in ``struct ext4_inode``. 30 31.. list-table:: 32 :widths: 8 8 24 40 33 :header-rows: 1 34 :class: longtable 35 36 * - Offset 37 - Size 38 - Name 39 - Description 40 * - 0x0 41 - \_\_le16 42 - i\_mode 43 - File mode. See the table i_mode_ below. 44 * - 0x2 45 - \_\_le16 46 - i\_uid 47 - Lower 16-bits of Owner UID. 48 * - 0x4 49 - \_\_le32 50 - i\_size\_lo 51 - Lower 32-bits of size in bytes. 52 * - 0x8 53 - \_\_le32 54 - i\_atime 55 - Last access time, in seconds since the epoch. However, if the EA\_INODE 56 inode flag is set, this inode stores an extended attribute value and 57 this field contains the checksum of the value. 58 * - 0xC 59 - \_\_le32 60 - i\_ctime 61 - Last inode change time, in seconds since the epoch. However, if the 62 EA\_INODE inode flag is set, this inode stores an extended attribute 63 value and this field contains the lower 32 bits of the attribute value's 64 reference count. 65 * - 0x10 66 - \_\_le32 67 - i\_mtime 68 - Last data modification time, in seconds since the epoch. However, if the 69 EA\_INODE inode flag is set, this inode stores an extended attribute 70 value and this field contains the number of the inode that owns the 71 extended attribute. 72 * - 0x14 73 - \_\_le32 74 - i\_dtime 75 - Deletion Time, in seconds since the epoch. 76 * - 0x18 77 - \_\_le16 78 - i\_gid 79 - Lower 16-bits of GID. 80 * - 0x1A 81 - \_\_le16 82 - i\_links\_count 83 - Hard link count. Normally, ext4 does not permit an inode to have more 84 than 65,000 hard links. This applies to files as well as directories, 85 which means that there cannot be more than 64,998 subdirectories in a 86 directory (each subdirectory's '..' entry counts as a hard link, as does 87 the '.' entry in the directory itself). With the DIR\_NLINK feature 88 enabled, ext4 supports more than 64,998 subdirectories by setting this 89 field to 1 to indicate that the number of hard links is not known. 90 * - 0x1C 91 - \_\_le32 92 - i\_blocks\_lo 93 - Lower 32-bits of “block” count. If the huge\_file feature flag is not 94 set on the filesystem, the file consumes ``i_blocks_lo`` 512-byte blocks 95 on disk. If huge\_file is set and EXT4\_HUGE\_FILE\_FL is NOT set in 96 ``inode.i_flags``, then the file consumes ``i_blocks_lo + (i_blocks_hi 97 << 32)`` 512-byte blocks on disk. If huge\_file is set and 98 EXT4\_HUGE\_FILE\_FL IS set in ``inode.i_flags``, then this file 99 consumes (``i_blocks_lo + i_blocks_hi`` << 32) filesystem blocks on 100 disk. 101 * - 0x20 102 - \_\_le32 103 - i\_flags 104 - Inode flags. See the table i_flags_ below. 105 * - 0x24 106 - 4 bytes 107 - i\_osd1 108 - See the table i_osd1_ for more details. 109 * - 0x28 110 - 60 bytes 111 - i\_block[EXT4\_N\_BLOCKS=15] 112 - Block map or extent tree. See the section “The Contents of inode.i\_block”. 113 * - 0x64 114 - \_\_le32 115 - i\_generation 116 - File version (for NFS). 117 * - 0x68 118 - \_\_le32 119 - i\_file\_acl\_lo 120 - Lower 32-bits of extended attribute block. ACLs are of course one of 121 many possible extended attributes; I think the name of this field is a 122 result of the first use of extended attributes being for ACLs. 123 * - 0x6C 124 - \_\_le32 125 - i\_size\_high / i\_dir\_acl 126 - Upper 32-bits of file/directory size. In ext2/3 this field was named 127 i\_dir\_acl, though it was usually set to zero and never used. 128 * - 0x70 129 - \_\_le32 130 - i\_obso\_faddr 131 - (Obsolete) fragment address. 132 * - 0x74 133 - 12 bytes 134 - i\_osd2 135 - See the table i_osd2_ for more details. 136 * - 0x80 137 - \_\_le16 138 - i\_extra\_isize 139 - Size of this inode - 128. Alternately, the size of the extended inode 140 fields beyond the original ext2 inode, including this field. 141 * - 0x82 142 - \_\_le16 143 - i\_checksum\_hi 144 - Upper 16-bits of the inode checksum. 145 * - 0x84 146 - \_\_le32 147 - i\_ctime\_extra 148 - Extra change time bits. This provides sub-second precision. See Inode 149 Timestamps section. 150 * - 0x88 151 - \_\_le32 152 - i\_mtime\_extra 153 - Extra modification time bits. This provides sub-second precision. 154 * - 0x8C 155 - \_\_le32 156 - i\_atime\_extra 157 - Extra access time bits. This provides sub-second precision. 158 * - 0x90 159 - \_\_le32 160 - i\_crtime 161 - File creation time, in seconds since the epoch. 162 * - 0x94 163 - \_\_le32 164 - i\_crtime\_extra 165 - Extra file creation time bits. This provides sub-second precision. 166 * - 0x98 167 - \_\_le32 168 - i\_version\_hi 169 - Upper 32-bits for version number. 170 * - 0x9C 171 - \_\_le32 172 - i\_projid 173 - Project ID. 174 175.. _i_mode: 176 177The ``i_mode`` value is a combination of the following flags: 178 179.. list-table:: 180 :widths: 16 64 181 :header-rows: 1 182 183 * - Value 184 - Description 185 * - 0x1 186 - S\_IXOTH (Others may execute) 187 * - 0x2 188 - S\_IWOTH (Others may write) 189 * - 0x4 190 - S\_IROTH (Others may read) 191 * - 0x8 192 - S\_IXGRP (Group members may execute) 193 * - 0x10 194 - S\_IWGRP (Group members may write) 195 * - 0x20 196 - S\_IRGRP (Group members may read) 197 * - 0x40 198 - S\_IXUSR (Owner may execute) 199 * - 0x80 200 - S\_IWUSR (Owner may write) 201 * - 0x100 202 - S\_IRUSR (Owner may read) 203 * - 0x200 204 - S\_ISVTX (Sticky bit) 205 * - 0x400 206 - S\_ISGID (Set GID) 207 * - 0x800 208 - S\_ISUID (Set UID) 209 * - 210 - These are mutually-exclusive file types: 211 * - 0x1000 212 - S\_IFIFO (FIFO) 213 * - 0x2000 214 - S\_IFCHR (Character device) 215 * - 0x4000 216 - S\_IFDIR (Directory) 217 * - 0x6000 218 - S\_IFBLK (Block device) 219 * - 0x8000 220 - S\_IFREG (Regular file) 221 * - 0xA000 222 - S\_IFLNK (Symbolic link) 223 * - 0xC000 224 - S\_IFSOCK (Socket) 225 226.. _i_flags: 227 228The ``i_flags`` field is a combination of these values: 229 230.. list-table:: 231 :widths: 16 64 232 :header-rows: 1 233 234 * - Value 235 - Description 236 * - 0x1 237 - This file requires secure deletion (EXT4\_SECRM\_FL). (not implemented) 238 * - 0x2 239 - This file should be preserved, should undeletion be desired 240 (EXT4\_UNRM\_FL). (not implemented) 241 * - 0x4 242 - File is compressed (EXT4\_COMPR\_FL). (not really implemented) 243 * - 0x8 244 - All writes to the file must be synchronous (EXT4\_SYNC\_FL). 245 * - 0x10 246 - File is immutable (EXT4\_IMMUTABLE\_FL). 247 * - 0x20 248 - File can only be appended (EXT4\_APPEND\_FL). 249 * - 0x40 250 - The dump(1) utility should not dump this file (EXT4\_NODUMP\_FL). 251 * - 0x80 252 - Do not update access time (EXT4\_NOATIME\_FL). 253 * - 0x100 254 - Dirty compressed file (EXT4\_DIRTY\_FL). (not used) 255 * - 0x200 256 - File has one or more compressed clusters (EXT4\_COMPRBLK\_FL). (not used) 257 * - 0x400 258 - Do not compress file (EXT4\_NOCOMPR\_FL). (not used) 259 * - 0x800 260 - Encrypted inode (EXT4\_ENCRYPT\_FL). This bit value previously was 261 EXT4\_ECOMPR\_FL (compression error), which was never used. 262 * - 0x1000 263 - Directory has hashed indexes (EXT4\_INDEX\_FL). 264 * - 0x2000 265 - AFS magic directory (EXT4\_IMAGIC\_FL). 266 * - 0x4000 267 - File data must always be written through the journal 268 (EXT4\_JOURNAL\_DATA\_FL). 269 * - 0x8000 270 - File tail should not be merged (EXT4\_NOTAIL\_FL). (not used by ext4) 271 * - 0x10000 272 - All directory entry data should be written synchronously (see 273 ``dirsync``) (EXT4\_DIRSYNC\_FL). 274 * - 0x20000 275 - Top of directory hierarchy (EXT4\_TOPDIR\_FL). 276 * - 0x40000 277 - This is a huge file (EXT4\_HUGE\_FILE\_FL). 278 * - 0x80000 279 - Inode uses extents (EXT4\_EXTENTS\_FL). 280 * - 0x200000 281 - Inode stores a large extended attribute value in its data blocks 282 (EXT4\_EA\_INODE\_FL). 283 * - 0x400000 284 - This file has blocks allocated past EOF (EXT4\_EOFBLOCKS\_FL). 285 (deprecated) 286 * - 0x01000000 287 - Inode is a snapshot (``EXT4_SNAPFILE_FL``). (not in mainline) 288 * - 0x04000000 289 - Snapshot is being deleted (``EXT4_SNAPFILE_DELETED_FL``). (not in 290 mainline) 291 * - 0x08000000 292 - Snapshot shrink has completed (``EXT4_SNAPFILE_SHRUNK_FL``). (not in 293 mainline) 294 * - 0x10000000 295 - Inode has inline data (EXT4\_INLINE\_DATA\_FL). 296 * - 0x20000000 297 - Create children with the same project ID (EXT4\_PROJINHERIT\_FL). 298 * - 0x80000000 299 - Reserved for ext4 library (EXT4\_RESERVED\_FL). 300 * - 301 - Aggregate flags: 302 * - 0x4BDFFF 303 - User-visible flags. 304 * - 0x4B80FF 305 - User-modifiable flags. Note that while EXT4\_JOURNAL\_DATA\_FL and 306 EXT4\_EXTENTS\_FL can be set with setattr, they are not in the kernel's 307 EXT4\_FL\_USER\_MODIFIABLE mask, since it needs to handle the setting of 308 these flags in a special manner and they are masked out of the set of 309 flags that are saved directly to i\_flags. 310 311.. _i_osd1: 312 313The ``osd1`` field has multiple meanings depending on the creator: 314 315Linux: 316 317.. list-table:: 318 :widths: 8 8 24 40 319 :header-rows: 1 320 321 * - Offset 322 - Size 323 - Name 324 - Description 325 * - 0x0 326 - \_\_le32 327 - l\_i\_version 328 - Inode version. However, if the EA\_INODE inode flag is set, this inode 329 stores an extended attribute value and this field contains the upper 32 330 bits of the attribute value's reference count. 331 332Hurd: 333 334.. list-table:: 335 :widths: 8 8 24 40 336 :header-rows: 1 337 338 * - Offset 339 - Size 340 - Name 341 - Description 342 * - 0x0 343 - \_\_le32 344 - h\_i\_translator 345 - ?? 346 347Masix: 348 349.. list-table:: 350 :widths: 8 8 24 40 351 :header-rows: 1 352 353 * - Offset 354 - Size 355 - Name 356 - Description 357 * - 0x0 358 - \_\_le32 359 - m\_i\_reserved 360 - ?? 361 362.. _i_osd2: 363 364The ``osd2`` field has multiple meanings depending on the filesystem creator: 365 366Linux: 367 368.. list-table:: 369 :widths: 8 8 24 40 370 :header-rows: 1 371 372 * - Offset 373 - Size 374 - Name 375 - Description 376 * - 0x0 377 - \_\_le16 378 - l\_i\_blocks\_high 379 - Upper 16-bits of the block count. Please see the note attached to 380 i\_blocks\_lo. 381 * - 0x2 382 - \_\_le16 383 - l\_i\_file\_acl\_high 384 - Upper 16-bits of the extended attribute block (historically, the file 385 ACL location). See the Extended Attributes section below. 386 * - 0x4 387 - \_\_le16 388 - l\_i\_uid\_high 389 - Upper 16-bits of the Owner UID. 390 * - 0x6 391 - \_\_le16 392 - l\_i\_gid\_high 393 - Upper 16-bits of the GID. 394 * - 0x8 395 - \_\_le16 396 - l\_i\_checksum\_lo 397 - Lower 16-bits of the inode checksum. 398 * - 0xA 399 - \_\_le16 400 - l\_i\_reserved 401 - Unused. 402 403Hurd: 404 405.. list-table:: 406 :widths: 8 8 24 40 407 :header-rows: 1 408 409 * - Offset 410 - Size 411 - Name 412 - Description 413 * - 0x0 414 - \_\_le16 415 - h\_i\_reserved1 416 - ?? 417 * - 0x2 418 - \_\_u16 419 - h\_i\_mode\_high 420 - Upper 16-bits of the file mode. 421 * - 0x4 422 - \_\_le16 423 - h\_i\_uid\_high 424 - Upper 16-bits of the Owner UID. 425 * - 0x6 426 - \_\_le16 427 - h\_i\_gid\_high 428 - Upper 16-bits of the GID. 429 * - 0x8 430 - \_\_u32 431 - h\_i\_author 432 - Author code? 433 434Masix: 435 436.. list-table:: 437 :widths: 8 8 24 40 438 :header-rows: 1 439 440 * - Offset 441 - Size 442 - Name 443 - Description 444 * - 0x0 445 - \_\_le16 446 - h\_i\_reserved1 447 - ?? 448 * - 0x2 449 - \_\_u16 450 - m\_i\_file\_acl\_high 451 - Upper 16-bits of the extended attribute block (historically, the file 452 ACL location). 453 * - 0x4 454 - \_\_u32 455 - m\_i\_reserved2[2] 456 - ?? 457 458Inode Size 459~~~~~~~~~~ 460 461In ext2 and ext3, the inode structure size was fixed at 128 bytes 462(``EXT2_GOOD_OLD_INODE_SIZE``) and each inode had a disk record size of 463128 bytes. Starting with ext4, it is possible to allocate a larger 464on-disk inode at format time for all inodes in the filesystem to provide 465space beyond the end of the original ext2 inode. The on-disk inode 466record size is recorded in the superblock as ``s_inode_size``. The 467number of bytes actually used by struct ext4\_inode beyond the original 468128-byte ext2 inode is recorded in the ``i_extra_isize`` field for each 469inode, which allows struct ext4\_inode to grow for a new kernel without 470having to upgrade all of the on-disk inodes. Access to fields beyond 471EXT2\_GOOD\_OLD\_INODE\_SIZE should be verified to be within 472``i_extra_isize``. By default, ext4 inode records are 256 bytes, and (as 473of October 2013) the inode structure is 156 bytes 474(``i_extra_isize = 28``). The extra space between the end of the inode 475structure and the end of the inode record can be used to store extended 476attributes. Each inode record can be as large as the filesystem block 477size, though this is not terribly efficient. 478 479Finding an Inode 480~~~~~~~~~~~~~~~~ 481 482Each block group contains ``sb->s_inodes_per_group`` inodes. Because 483inode 0 is defined not to exist, this formula can be used to find the 484block group that an inode lives in: 485``bg = (inode_num - 1) / sb->s_inodes_per_group``. The particular inode 486can be found within the block group's inode table at 487``index = (inode_num - 1) % sb->s_inodes_per_group``. To get the byte 488address within the inode table, use 489``offset = index * sb->s_inode_size``. 490 491Inode Timestamps 492~~~~~~~~~~~~~~~~ 493 494Four timestamps are recorded in the lower 128 bytes of the inode 495structure -- inode change time (ctime), access time (atime), data 496modification time (mtime), and deletion time (dtime). The four fields 497are 32-bit signed integers that represent seconds since the Unix epoch 498(1970-01-01 00:00:00 GMT), which means that the fields will overflow in 499January 2038. For inodes that are not linked from any directory but are 500still open (orphan inodes), the dtime field is overloaded for use with 501the orphan list. The superblock field ``s_last_orphan`` points to the 502first inode in the orphan list; dtime is then the number of the next 503orphaned inode, or zero if there are no more orphans. 504 505If the inode structure size ``sb->s_inode_size`` is larger than 128 506bytes and the ``i_inode_extra`` field is large enough to encompass the 507respective ``i_[cma]time_extra`` field, the ctime, atime, and mtime 508inode fields are widened to 64 bits. Within this “extra” 32-bit field, 509the lower two bits are used to extend the 32-bit seconds field to be 34 510bit wide; the upper 30 bits are used to provide nanosecond timestamp 511accuracy. Therefore, timestamps should not overflow until May 2446. 512dtime was not widened. There is also a fifth timestamp to record inode 513creation time (crtime); this field is 64-bits wide and decoded in the 514same manner as 64-bit [cma]time. Neither crtime nor dtime are accessible 515through the regular stat() interface, though debugfs will report them. 516 517We use the 32-bit signed time value plus (2^32 \* (extra epoch bits)). 518In other words: 519 520.. list-table:: 521 :widths: 20 20 20 20 20 522 :header-rows: 1 523 524 * - Extra epoch bits 525 - MSB of 32-bit time 526 - Adjustment for signed 32-bit to 64-bit tv\_sec 527 - Decoded 64-bit tv\_sec 528 - valid time range 529 * - 0 0 530 - 1 531 - 0 532 - ``-0x80000000 - -0x00000001`` 533 - 1901-12-13 to 1969-12-31 534 * - 0 0 535 - 0 536 - 0 537 - ``0x000000000 - 0x07fffffff`` 538 - 1970-01-01 to 2038-01-19 539 * - 0 1 540 - 1 541 - 0x100000000 542 - ``0x080000000 - 0x0ffffffff`` 543 - 2038-01-19 to 2106-02-07 544 * - 0 1 545 - 0 546 - 0x100000000 547 - ``0x100000000 - 0x17fffffff`` 548 - 2106-02-07 to 2174-02-25 549 * - 1 0 550 - 1 551 - 0x200000000 552 - ``0x180000000 - 0x1ffffffff`` 553 - 2174-02-25 to 2242-03-16 554 * - 1 0 555 - 0 556 - 0x200000000 557 - ``0x200000000 - 0x27fffffff`` 558 - 2242-03-16 to 2310-04-04 559 * - 1 1 560 - 1 561 - 0x300000000 562 - ``0x280000000 - 0x2ffffffff`` 563 - 2310-04-04 to 2378-04-22 564 * - 1 1 565 - 0 566 - 0x300000000 567 - ``0x300000000 - 0x37fffffff`` 568 - 2378-04-22 to 2446-05-10 569 570This is a somewhat odd encoding since there are effectively seven times 571as many positive values as negative values. There have also been 572long-standing bugs decoding and encoding dates beyond 2038, which don't 573seem to be fixed as of kernel 3.12 and e2fsprogs 1.42.8. 64-bit kernels 574incorrectly use the extra epoch bits 1,1 for dates between 1901 and 5751970. At some point the kernel will be fixed and e2fsck will fix this 576situation, assuming that it is run before 2310. 577