xref: /openbmc/linux/Documentation/filesystems/ext4/blockgroup.rst (revision 9a87ffc99ec8eb8d35eed7c4f816d75f5cc9662e)
18a98ec7cSDarrick J. Wong.. SPDX-License-Identifier: GPL-2.0
28a98ec7cSDarrick J. Wong
38a98ec7cSDarrick J. WongLayout
48a98ec7cSDarrick J. Wong------
58a98ec7cSDarrick J. Wong
68a98ec7cSDarrick J. WongThe layout of a standard block group is approximately as follows (each
78a98ec7cSDarrick J. Wongof these fields is discussed in a separate section below):
88a98ec7cSDarrick J. Wong
98a98ec7cSDarrick J. Wong.. list-table::
108a98ec7cSDarrick J. Wong   :widths: 1 1 1 1 1 1 1 1
118a98ec7cSDarrick J. Wong   :header-rows: 1
128a98ec7cSDarrick J. Wong
138a98ec7cSDarrick J. Wong   * - Group 0 Padding
148a98ec7cSDarrick J. Wong     - ext4 Super Block
158a98ec7cSDarrick J. Wong     - Group Descriptors
168a98ec7cSDarrick J. Wong     - Reserved GDT Blocks
178a98ec7cSDarrick J. Wong     - Data Block Bitmap
188a98ec7cSDarrick J. Wong     - inode Bitmap
198a98ec7cSDarrick J. Wong     - inode Table
208a98ec7cSDarrick J. Wong     - Data Blocks
218a98ec7cSDarrick J. Wong   * - 1024 bytes
228a98ec7cSDarrick J. Wong     - 1 block
238a98ec7cSDarrick J. Wong     - many blocks
248a98ec7cSDarrick J. Wong     - many blocks
258a98ec7cSDarrick J. Wong     - 1 block
268a98ec7cSDarrick J. Wong     - 1 block
278a98ec7cSDarrick J. Wong     - many blocks
288a98ec7cSDarrick J. Wong     - many more blocks
298a98ec7cSDarrick J. Wong
308a98ec7cSDarrick J. WongFor the special case of block group 0, the first 1024 bytes are unused,
318a98ec7cSDarrick J. Wongto allow for the installation of x86 boot sectors and other oddities.
328a98ec7cSDarrick J. WongThe superblock will start at offset 1024 bytes, whichever block that
338a98ec7cSDarrick J. Wonghappens to be (usually 0). However, if for some reason the block size =
348a98ec7cSDarrick J. Wong1024, then block 0 is marked in use and the superblock goes in block 1.
358a98ec7cSDarrick J. WongFor all other block groups, there is no padding.
368a98ec7cSDarrick J. Wong
378a98ec7cSDarrick J. WongThe ext4 driver primarily works with the superblock and the group
388a98ec7cSDarrick J. Wongdescriptors that are found in block group 0. Redundant copies of the
398a98ec7cSDarrick J. Wongsuperblock and group descriptors are written to some of the block groups
408a98ec7cSDarrick J. Wongacross the disk in case the beginning of the disk gets trashed, though
418a98ec7cSDarrick J. Wongnot all block groups necessarily host a redundant copy (see following
428a98ec7cSDarrick J. Wongparagraph for more details). If the group does not have a redundant
438a98ec7cSDarrick J. Wongcopy, the block group begins with the data block bitmap. Note also that
448a98ec7cSDarrick J. Wongwhen the filesystem is freshly formatted, mkfs will allocate “reserve
458a98ec7cSDarrick J. WongGDT block” space after the block group descriptors and before the start
468a98ec7cSDarrick J. Wongof the block bitmaps to allow for future expansion of the filesystem. By
478a98ec7cSDarrick J. Wongdefault, a filesystem is allowed to increase in size by a factor of
488a98ec7cSDarrick J. Wong1024x over the original filesystem size.
498a98ec7cSDarrick J. Wong
508a98ec7cSDarrick J. WongThe location of the inode table is given by ``grp.bg_inode_table_*``. It
518a98ec7cSDarrick J. Wongis continuous range of blocks large enough to contain
528a98ec7cSDarrick J. Wong``sb.s_inodes_per_group * sb.s_inode_size`` bytes.
538a98ec7cSDarrick J. Wong
548a98ec7cSDarrick J. WongAs for the ordering of items in a block group, it is generally
558a98ec7cSDarrick J. Wongestablished that the super block and the group descriptor table, if
568a98ec7cSDarrick J. Wongpresent, will be at the beginning of the block group. The bitmaps and
578a98ec7cSDarrick J. Wongthe inode table can be anywhere, and it is quite possible for the
588a98ec7cSDarrick J. Wongbitmaps to come after the inode table, or for both to be in different
593103084aSWang Jianjiangroups (flex_bg). Leftover space is used for file data blocks, indirect
608a98ec7cSDarrick J. Wongblock maps, extent tree blocks, and extended attributes.
618a98ec7cSDarrick J. Wong
628a98ec7cSDarrick J. WongFlexible Block Groups
638a98ec7cSDarrick J. Wong---------------------
648a98ec7cSDarrick J. Wong
658a98ec7cSDarrick J. WongStarting in ext4, there is a new feature called flexible block groups
663103084aSWang Jianjian(flex_bg). In a flex_bg, several block groups are tied together as one
678a98ec7cSDarrick J. Wonglogical block group; the bitmap spaces and the inode table space in the
683103084aSWang Jianjianfirst block group of the flex_bg are expanded to include the bitmaps
693103084aSWang Jianjianand inode tables of all other block groups in the flex_bg. For example,
703103084aSWang Jianjianif the flex_bg size is 4, then group 0 will contain (in order) the
718a98ec7cSDarrick J. Wongsuperblock, group descriptors, data block bitmaps for groups 0-3, inode
728a98ec7cSDarrick J. Wongbitmaps for groups 0-3, inode tables for groups 0-3, and the remaining
738a98ec7cSDarrick J. Wongspace in group 0 is for file data. The effect of this is to group the
74219db95bSAyush Ranjanblock group metadata close together for faster loading, and to enable
75219db95bSAyush Ranjanlarge files to be continuous on disk. Backup copies of the superblock
76219db95bSAyush Ranjanand group descriptors are always at the beginning of block groups, even
773103084aSWang Jianjianif flex_bg is enabled. The number of block groups that make up a
783103084aSWang Jianjianflex_bg is given by 2 ^ ``sb.s_log_groups_per_flex``.
798a98ec7cSDarrick J. Wong
808a98ec7cSDarrick J. WongMeta Block Groups
818a98ec7cSDarrick J. Wong-----------------
828a98ec7cSDarrick J. Wong
833103084aSWang JianjianWithout the option META_BG, for safety concerns, all block group
848a98ec7cSDarrick J. Wongdescriptors copies are kept in the first block group. Given the default
858a98ec7cSDarrick J. Wong128MiB(2^27 bytes) block group size and 64-byte group descriptors, ext4
868a98ec7cSDarrick J. Wongcan have at most 2^27/64 = 2^21 block groups. This limits the entire
87d9d2c827SMauro Carvalho Chehabfilesystem size to 2^21 * 2^27 = 2^48bytes or 256TiB.
888a98ec7cSDarrick J. Wong
898a98ec7cSDarrick J. WongThe solution to this problem is to use the metablock group feature
903103084aSWang Jianjian(META_BG), which is already in ext3 for all 2.6 releases. With the
913103084aSWang JianjianMETA_BG feature, ext4 filesystems are partitioned into many metablock
928a98ec7cSDarrick J. Wonggroups. Each metablock group is a cluster of block groups whose group
938a98ec7cSDarrick J. Wongdescriptor structures can be stored in a single disk block. For ext4
948a98ec7cSDarrick J. Wongfilesystems with 4 KB block size, a single metablock group partition
958a98ec7cSDarrick J. Wongincludes 64 block groups, or 8 GiB of disk space. The metablock group
968a98ec7cSDarrick J. Wongfeature moves the location of the group descriptors from the congested
978a98ec7cSDarrick J. Wongfirst block group of the whole filesystem into the first group of each
988a98ec7cSDarrick J. Wongmetablock group itself. The backups are in the second and last group of
998a98ec7cSDarrick J. Wongeach metablock group. This increases the 2^21 maximum block groups limit
1008a98ec7cSDarrick J. Wongto the hard limit 2^32, allowing support for a 512PiB filesystem.
1018a98ec7cSDarrick J. Wong
1028a98ec7cSDarrick J. WongThe change in the filesystem format replaces the current scheme where
1038a98ec7cSDarrick J. Wongthe superblock is followed by a variable-length set of block group
1048a98ec7cSDarrick J. Wongdescriptors. Instead, the superblock and a single block group descriptor
1058a98ec7cSDarrick J. Wongblock is placed at the beginning of the first, second, and last block
1068a98ec7cSDarrick J. Wonggroups in a meta-block group. A meta-block group is a collection of
1078a98ec7cSDarrick J. Wongblock groups which can be described by a single block group descriptor
108*b7eef407SWu Boblock. Since the size of the block group descriptor structure is 64
109*b7eef407SWu Bobytes, a meta-block group contains 16 block groups for filesystems with
110*b7eef407SWu Boa 1KB block size, and 64 block groups for filesystems with a 4KB
1118a98ec7cSDarrick J. Wongblocksize. Filesystems can either be created using this new block group
1128a98ec7cSDarrick J. Wongdescriptor layout, or existing filesystems can be resized on-line, and
1133103084aSWang Jianjianthe field s_first_meta_bg in the superblock will indicate the first
1148a98ec7cSDarrick J. Wongblock group using this new layout.
1158a98ec7cSDarrick J. Wong
1168a98ec7cSDarrick J. WongPlease see an important note about ``BLOCK_UNINIT`` in the section about
1178a98ec7cSDarrick J. Wongblock and inode bitmaps.
1188a98ec7cSDarrick J. Wong
1198a98ec7cSDarrick J. WongLazy Block Group Initialization
1208a98ec7cSDarrick J. Wong-------------------------------
1218a98ec7cSDarrick J. Wong
1228a98ec7cSDarrick J. WongA new feature for ext4 are three block group descriptor flags that
1238a98ec7cSDarrick J. Wongenable mkfs to skip initializing other parts of the block group
1243103084aSWang Jianjianmetadata. Specifically, the INODE_UNINIT and BLOCK_UNINIT flags mean
1258a98ec7cSDarrick J. Wongthat the inode and block bitmaps for that group can be calculated and
1268a98ec7cSDarrick J. Wongtherefore the on-disk bitmap blocks are not initialized. This is
1278a98ec7cSDarrick J. Wonggenerally the case for an empty block group or a block group containing
1283103084aSWang Jianjianonly fixed-location block group metadata. The INODE_ZEROED flag means
1298a98ec7cSDarrick J. Wongthat the inode table has been initialized; mkfs will unset this flag and
1308a98ec7cSDarrick J. Wongrely on the kernel to initialize the inode tables in the background.
1318a98ec7cSDarrick J. Wong
1328a98ec7cSDarrick J. WongBy not writing zeroes to the bitmaps and inode table, mkfs time is
1333103084aSWang Jianjianreduced considerably. Note the feature flag is RO_COMPAT_GDT_CSUM,
1343103084aSWang Jianjianbut the dumpe2fs output prints this as “uninit_bg”. They are the same
1358a98ec7cSDarrick J. Wongthing.
136