18a98ec7cSDarrick J. Wong.. SPDX-License-Identifier: GPL-2.0 28a98ec7cSDarrick J. Wong 38a98ec7cSDarrick J. WongBlock and Inode Allocation Policy 48a98ec7cSDarrick J. Wong--------------------------------- 58a98ec7cSDarrick J. Wong 68a98ec7cSDarrick J. Wongext4 recognizes (better than ext3, anyway) that data locality is 78a98ec7cSDarrick J. Wonggenerally a desirably quality of a filesystem. On a spinning disk, 88a98ec7cSDarrick J. Wongkeeping related blocks near each other reduces the amount of movement 98a98ec7cSDarrick J. Wongthat the head actuator and disk must perform to access a data block, 108a98ec7cSDarrick J. Wongthus speeding up disk IO. On an SSD there of course are no moving parts, 118a98ec7cSDarrick J. Wongbut locality can increase the size of each transfer request while 128a98ec7cSDarrick J. Wongreducing the total number of requests. This locality may also have the 138a98ec7cSDarrick J. Wongeffect of concentrating writes on a single erase block, which can speed 148a98ec7cSDarrick J. Wongup file rewrites significantly. Therefore, it is useful to reduce 158a98ec7cSDarrick J. Wongfragmentation whenever possible. 168a98ec7cSDarrick J. Wong 178a98ec7cSDarrick J. WongThe first tool that ext4 uses to combat fragmentation is the multi-block 188a98ec7cSDarrick J. Wongallocator. When a file is first created, the block allocator 198a98ec7cSDarrick J. Wongspeculatively allocates 8KiB of disk space to the file on the assumption 208a98ec7cSDarrick J. Wongthat the space will get written soon. When the file is closed, the 218a98ec7cSDarrick J. Wongunused speculative allocations are of course freed, but if the 228a98ec7cSDarrick J. Wongspeculation is correct (typically the case for full writes of small 238a98ec7cSDarrick J. Wongfiles) then the file data gets written out in a single multi-block 248a98ec7cSDarrick J. Wongextent. A second related trick that ext4 uses is delayed allocation. 258a98ec7cSDarrick J. WongUnder this scheme, when a file needs more blocks to absorb file writes, 268a98ec7cSDarrick J. Wongthe filesystem defers deciding the exact placement on the disk until all 278a98ec7cSDarrick J. Wongthe dirty buffers are being written out to disk. By not committing to a 288a98ec7cSDarrick J. Wongparticular placement until it's absolutely necessary (the commit timeout 298a98ec7cSDarrick J. Wongis hit, or sync() is called, or the kernel runs out of memory), the hope 308a98ec7cSDarrick J. Wongis that the filesystem can make better location decisions. 318a98ec7cSDarrick J. Wong 328a98ec7cSDarrick J. WongThe third trick that ext4 (and ext3) uses is that it tries to keep a 338a98ec7cSDarrick J. Wongfile's data blocks in the same block group as its inode. This cuts down 348a98ec7cSDarrick J. Wongon the seek penalty when the filesystem first has to read a file's inode 358a98ec7cSDarrick J. Wongto learn where the file's data blocks live and then seek over to the 368a98ec7cSDarrick J. Wongfile's data blocks to begin I/O operations. 378a98ec7cSDarrick J. Wong 388a98ec7cSDarrick J. WongThe fourth trick is that all the inodes in a directory are placed in the 398a98ec7cSDarrick J. Wongsame block group as the directory, when feasible. The working assumption 408a98ec7cSDarrick J. Wonghere is that all the files in a directory might be related, therefore it 418a98ec7cSDarrick J. Wongis useful to try to keep them all together. 428a98ec7cSDarrick J. Wong 438a98ec7cSDarrick J. WongThe fifth trick is that the disk volume is cut up into 128MB block 448a98ec7cSDarrick J. Wonggroups; these mini-containers are used as outlined above to try to 458a98ec7cSDarrick J. Wongmaintain data locality. However, there is a deliberate quirk -- when a 468a98ec7cSDarrick J. Wongdirectory is created in the root directory, the inode allocator scans 478a98ec7cSDarrick J. Wongthe block groups and puts that directory into the least heavily loaded 488a98ec7cSDarrick J. Wongblock group that it can find. This encourages directories to spread out 498a98ec7cSDarrick J. Wongover a disk; as the top-level directory/file blobs fill up one block 508a98ec7cSDarrick J. Wonggroup, the allocators simply move on to the next block group. Allegedly 518a98ec7cSDarrick J. Wongthis scheme evens out the loading on the block groups, though the author 528a98ec7cSDarrick J. Wongsuspects that the directories which are so unlucky as to land towards 538a98ec7cSDarrick J. Wongthe end of a spinning drive get a raw deal performance-wise. 548a98ec7cSDarrick J. Wong 558a98ec7cSDarrick J. WongOf course if all of these mechanisms fail, one can always use e4defrag 568a98ec7cSDarrick J. Wongto defragment files. 57