xref: /openbmc/linux/Documentation/filesystems/ext4/directory.rst (revision 023e41632e065d49bcbe31b3c4b336217f96a271)
1.. SPDX-License-Identifier: GPL-2.0
2
3Directory Entries
4-----------------
5
6In an ext4 filesystem, a directory is more or less a flat file that maps
7an arbitrary byte string (usually ASCII) to an inode number on the
8filesystem. There can be many directory entries across the filesystem
9that reference the same inode number--these are known as hard links, and
10that is why hard links cannot reference files on other filesystems. As
11such, directory entries are found by reading the data block(s)
12associated with a directory file for the particular directory entry that
13is desired.
14
15Linear (Classic) Directories
16~~~~~~~~~~~~~~~~~~~~~~~~~~~~
17
18By default, each directory lists its entries in an “almost-linear”
19array. I write “almost” because it's not a linear array in the memory
20sense because directory entries are not split across filesystem blocks.
21Therefore, it is more accurate to say that a directory is a series of
22data blocks and that each block contains a linear array of directory
23entries. The end of each per-block array is signified by reaching the
24end of the block; the last entry in the block has a record length that
25takes it all the way to the end of the block. The end of the entire
26directory is of course signified by reaching the end of the file. Unused
27directory entries are signified by inode = 0. By default the filesystem
28uses ``struct ext4_dir_entry_2`` for directory entries unless the
29“filetype” feature flag is not set, in which case it uses
30``struct ext4_dir_entry``.
31
32The original directory entry format is ``struct ext4_dir_entry``, which
33is at most 263 bytes long, though on disk you'll need to reference
34``dirent.rec_len`` to know for sure.
35
36.. list-table::
37   :widths: 8 8 24 40
38   :header-rows: 1
39
40   * - Offset
41     - Size
42     - Name
43     - Description
44   * - 0x0
45     - \_\_le32
46     - inode
47     - Number of the inode that this directory entry points to.
48   * - 0x4
49     - \_\_le16
50     - rec\_len
51     - Length of this directory entry. Must be a multiple of 4.
52   * - 0x6
53     - \_\_le16
54     - name\_len
55     - Length of the file name.
56   * - 0x8
57     - char
58     - name[EXT4\_NAME\_LEN]
59     - File name.
60
61Since file names cannot be longer than 255 bytes, the new directory
62entry format shortens the rec\_len field and uses the space for a file
63type flag, probably to avoid having to load every inode during directory
64tree traversal. This format is ``ext4_dir_entry_2``, which is at most
65263 bytes long, though on disk you'll need to reference
66``dirent.rec_len`` to know for sure.
67
68.. list-table::
69   :widths: 8 8 24 40
70   :header-rows: 1
71
72   * - Offset
73     - Size
74     - Name
75     - Description
76   * - 0x0
77     - \_\_le32
78     - inode
79     - Number of the inode that this directory entry points to.
80   * - 0x4
81     - \_\_le16
82     - rec\_len
83     - Length of this directory entry.
84   * - 0x6
85     - \_\_u8
86     - name\_len
87     - Length of the file name.
88   * - 0x7
89     - \_\_u8
90     - file\_type
91     - File type code, see ftype_ table below.
92   * - 0x8
93     - char
94     - name[EXT4\_NAME\_LEN]
95     - File name.
96
97.. _ftype:
98
99The directory file type is one of the following values:
100
101.. list-table::
102   :widths: 16 64
103   :header-rows: 1
104
105   * - Value
106     - Description
107   * - 0x0
108     - Unknown.
109   * - 0x1
110     - Regular file.
111   * - 0x2
112     - Directory.
113   * - 0x3
114     - Character device file.
115   * - 0x4
116     - Block device file.
117   * - 0x5
118     - FIFO.
119   * - 0x6
120     - Socket.
121   * - 0x7
122     - Symbolic link.
123
124In order to add checksums to these classic directory blocks, a phony
125``struct ext4_dir_entry`` is placed at the end of each leaf block to
126hold the checksum. The directory entry is 12 bytes long. The inode
127number and name\_len fields are set to zero to fool old software into
128ignoring an apparently empty directory entry, and the checksum is stored
129in the place where the name normally goes. The structure is
130``struct ext4_dir_entry_tail``:
131
132.. list-table::
133   :widths: 8 8 24 40
134   :header-rows: 1
135
136   * - Offset
137     - Size
138     - Name
139     - Description
140   * - 0x0
141     - \_\_le32
142     - det\_reserved\_zero1
143     - Inode number, which must be zero.
144   * - 0x4
145     - \_\_le16
146     - det\_rec\_len
147     - Length of this directory entry, which must be 12.
148   * - 0x6
149     - \_\_u8
150     - det\_reserved\_zero2
151     - Length of the file name, which must be zero.
152   * - 0x7
153     - \_\_u8
154     - det\_reserved\_ft
155     - File type, which must be 0xDE.
156   * - 0x8
157     - \_\_le32
158     - det\_checksum
159     - Directory leaf block checksum.
160
161The leaf directory block checksum is calculated against the FS UUID, the
162directory's inode number, the directory's inode generation number, and
163the entire directory entry block up to (but not including) the fake
164directory entry.
165
166Hash Tree Directories
167~~~~~~~~~~~~~~~~~~~~~
168
169A linear array of directory entries isn't great for performance, so a
170new feature was added to ext3 to provide a faster (but peculiar)
171balanced tree keyed off a hash of the directory entry name. If the
172EXT4\_INDEX\_FL (0x1000) flag is set in the inode, this directory uses a
173hashed btree (htree) to organize and find directory entries. For
174backwards read-only compatibility with ext2, this tree is actually
175hidden inside the directory file, masquerading as “empty” directory data
176blocks! It was stated previously that the end of the linear directory
177entry table was signified with an entry pointing to inode 0; this is
178(ab)used to fool the old linear-scan algorithm into thinking that the
179rest of the directory block is empty so that it moves on.
180
181The root of the tree always lives in the first data block of the
182directory. By ext2 custom, the '.' and '..' entries must appear at the
183beginning of this first block, so they are put here as two
184``struct ext4_dir_entry_2``\ s and not stored in the tree. The rest of
185the root node contains metadata about the tree and finally a hash->block
186map to find nodes that are lower in the htree. If
187``dx_root.info.indirect_levels`` is non-zero then the htree has two
188levels; the data block pointed to by the root node's map is an interior
189node, which is indexed by a minor hash. Interior nodes in this tree
190contains a zeroed out ``struct ext4_dir_entry_2`` followed by a
191minor\_hash->block map to find leafe nodes. Leaf nodes contain a linear
192array of all ``struct ext4_dir_entry_2``; all of these entries
193(presumably) hash to the same value. If there is an overflow, the
194entries simply overflow into the next leaf node, and the
195least-significant bit of the hash (in the interior node map) that gets
196us to this next leaf node is set.
197
198To traverse the directory as a htree, the code calculates the hash of
199the desired file name and uses it to find the corresponding block
200number. If the tree is flat, the block is a linear array of directory
201entries that can be searched; otherwise, the minor hash of the file name
202is computed and used against this second block to find the corresponding
203third block number. That third block number will be a linear array of
204directory entries.
205
206To traverse the directory as a linear array (such as the old code does),
207the code simply reads every data block in the directory. The blocks used
208for the htree will appear to have no entries (aside from '.' and '..')
209and so only the leaf nodes will appear to have any interesting content.
210
211The root of the htree is in ``struct dx_root``, which is the full length
212of a data block:
213
214.. list-table::
215   :widths: 8 8 24 40
216   :header-rows: 1
217
218   * - Offset
219     - Type
220     - Name
221     - Description
222   * - 0x0
223     - \_\_le32
224     - dot.inode
225     - inode number of this directory.
226   * - 0x4
227     - \_\_le16
228     - dot.rec\_len
229     - Length of this record, 12.
230   * - 0x6
231     - u8
232     - dot.name\_len
233     - Length of the name, 1.
234   * - 0x7
235     - u8
236     - dot.file\_type
237     - File type of this entry, 0x2 (directory) (if the feature flag is set).
238   * - 0x8
239     - char
240     - dot.name[4]
241     - “.\\0\\0\\0”
242   * - 0xC
243     - \_\_le32
244     - dotdot.inode
245     - inode number of parent directory.
246   * - 0x10
247     - \_\_le16
248     - dotdot.rec\_len
249     - block\_size - 12. The record length is long enough to cover all htree
250       data.
251   * - 0x12
252     - u8
253     - dotdot.name\_len
254     - Length of the name, 2.
255   * - 0x13
256     - u8
257     - dotdot.file\_type
258     - File type of this entry, 0x2 (directory) (if the feature flag is set).
259   * - 0x14
260     - char
261     - dotdot\_name[4]
262     - “..\\0\\0”
263   * - 0x18
264     - \_\_le32
265     - struct dx\_root\_info.reserved\_zero
266     - Zero.
267   * - 0x1C
268     - u8
269     - struct dx\_root\_info.hash\_version
270     - Hash type, see dirhash_ table below.
271   * - 0x1D
272     - u8
273     - struct dx\_root\_info.info\_length
274     - Length of the tree information, 0x8.
275   * - 0x1E
276     - u8
277     - struct dx\_root\_info.indirect\_levels
278     - Depth of the htree. Cannot be larger than 3 if the INCOMPAT\_LARGEDIR
279       feature is set; cannot be larger than 2 otherwise.
280   * - 0x1F
281     - u8
282     - struct dx\_root\_info.unused\_flags
283     -
284   * - 0x20
285     - \_\_le16
286     - limit
287     - Maximum number of dx\_entries that can follow this header, plus 1 for
288       the header itself.
289   * - 0x22
290     - \_\_le16
291     - count
292     - Actual number of dx\_entries that follow this header, plus 1 for the
293       header itself.
294   * - 0x24
295     - \_\_le32
296     - block
297     - The block number (within the directory file) that goes with hash=0.
298   * - 0x28
299     - struct dx\_entry
300     - entries[0]
301     - As many 8-byte ``struct dx_entry`` as fits in the rest of the data block.
302
303.. _dirhash:
304
305The directory hash is one of the following values:
306
307.. list-table::
308   :widths: 16 64
309   :header-rows: 1
310
311   * - Value
312     - Description
313   * - 0x0
314     - Legacy.
315   * - 0x1
316     - Half MD4.
317   * - 0x2
318     - Tea.
319   * - 0x3
320     - Legacy, unsigned.
321   * - 0x4
322     - Half MD4, unsigned.
323   * - 0x5
324     - Tea, unsigned.
325
326Interior nodes of an htree are recorded as ``struct dx_node``, which is
327also the full length of a data block:
328
329.. list-table::
330   :widths: 8 8 24 40
331   :header-rows: 1
332
333   * - Offset
334     - Type
335     - Name
336     - Description
337   * - 0x0
338     - \_\_le32
339     - fake.inode
340     - Zero, to make it look like this entry is not in use.
341   * - 0x4
342     - \_\_le16
343     - fake.rec\_len
344     - The size of the block, in order to hide all of the dx\_node data.
345   * - 0x6
346     - u8
347     - name\_len
348     - Zero. There is no name for this “unused” directory entry.
349   * - 0x7
350     - u8
351     - file\_type
352     - Zero. There is no file type for this “unused” directory entry.
353   * - 0x8
354     - \_\_le16
355     - limit
356     - Maximum number of dx\_entries that can follow this header, plus 1 for
357       the header itself.
358   * - 0xA
359     - \_\_le16
360     - count
361     - Actual number of dx\_entries that follow this header, plus 1 for the
362       header itself.
363   * - 0xE
364     - \_\_le32
365     - block
366     - The block number (within the directory file) that goes with the lowest
367       hash value of this block. This value is stored in the parent block.
368   * - 0x12
369     - struct dx\_entry
370     - entries[0]
371     - As many 8-byte ``struct dx_entry`` as fits in the rest of the data block.
372
373The hash maps that exist in both ``struct dx_root`` and
374``struct dx_node`` are recorded as ``struct dx_entry``, which is 8 bytes
375long:
376
377.. list-table::
378   :widths: 8 8 24 40
379   :header-rows: 1
380
381   * - Offset
382     - Type
383     - Name
384     - Description
385   * - 0x0
386     - \_\_le32
387     - hash
388     - Hash code.
389   * - 0x4
390     - \_\_le32
391     - block
392     - Block number (within the directory file, not filesystem blocks) of the
393       next node in the htree.
394
395(If you think this is all quite clever and peculiar, so does the
396author.)
397
398If metadata checksums are enabled, the last 8 bytes of the directory
399block (precisely the length of one dx\_entry) are used to store a
400``struct dx_tail``, which contains the checksum. The ``limit`` and
401``count`` entries in the dx\_root/dx\_node structures are adjusted as
402necessary to fit the dx\_tail into the block. If there is no space for
403the dx\_tail, the user is notified to run e2fsck -D to rebuild the
404directory index (which will ensure that there's space for the checksum.
405The dx\_tail structure is 8 bytes long and looks like this:
406
407.. list-table::
408   :widths: 8 8 24 40
409   :header-rows: 1
410
411   * - Offset
412     - Type
413     - Name
414     - Description
415   * - 0x0
416     - u32
417     - dt\_reserved
418     - Zero.
419   * - 0x4
420     - \_\_le32
421     - dt\_checksum
422     - Checksum of the htree directory block.
423
424The checksum is calculated against the FS UUID, the htree index header
425(dx\_root or dx\_node), all of the htree indices (dx\_entry) that are in
426use, and the tail block (dx\_tail).
427