1e66d8631SMauro Carvalho Chehab.. SPDX-License-Identifier: GPL-2.0
2e66d8631SMauro Carvalho Chehab
3e66d8631SMauro Carvalho Chehab======================================
4e66d8631SMauro Carvalho ChehabEnhanced Read-Only File System - EROFS
5e66d8631SMauro Carvalho Chehab======================================
6e66d8631SMauro Carvalho Chehab
7e66d8631SMauro Carvalho ChehabOverview
8e66d8631SMauro Carvalho Chehab========
9e66d8631SMauro Carvalho Chehab
10e66d8631SMauro Carvalho ChehabEROFS file-system stands for Enhanced Read-Only File System. Different
11e66d8631SMauro Carvalho Chehabfrom other read-only file systems, it aims to be designed for flexibility,
12e66d8631SMauro Carvalho Chehabscalability, but be kept simple and high performance.
13e66d8631SMauro Carvalho Chehab
14e66d8631SMauro Carvalho ChehabIt is designed as a better filesystem solution for the following scenarios:
15e66d8631SMauro Carvalho Chehab
16e66d8631SMauro Carvalho Chehab - read-only storage media or
17e66d8631SMauro Carvalho Chehab
18e66d8631SMauro Carvalho Chehab - part of a fully trusted read-only solution, which means it needs to be
19e66d8631SMauro Carvalho Chehab   immutable and bit-for-bit identical to the official golden image for
20e66d8631SMauro Carvalho Chehab   their releases due to security and other considerations and
21e66d8631SMauro Carvalho Chehab
22dfeab2e9SGao Xiang - hope to minimize extra storage space with guaranteed end-to-end performance
23dfeab2e9SGao Xiang   by using compact layout, transparent file compression and direct access,
24dfeab2e9SGao Xiang   especially for those embedded devices with limited memory and high-density
25dfeab2e9SGao Xiang   hosts with numerous containers;
26e66d8631SMauro Carvalho Chehab
27e66d8631SMauro Carvalho ChehabHere is the main features of EROFS:
28e66d8631SMauro Carvalho Chehab
29e66d8631SMauro Carvalho Chehab - Little endian on-disk design;
30e66d8631SMauro Carvalho Chehab
31e66d8631SMauro Carvalho Chehab - Currently 4KB block size (nobh) and therefore maximum 16TB address space;
32e66d8631SMauro Carvalho Chehab
33e66d8631SMauro Carvalho Chehab - Metadata & data could be mixed by design;
34e66d8631SMauro Carvalho Chehab
35e66d8631SMauro Carvalho Chehab - 2 inode versions for different requirements:
36e66d8631SMauro Carvalho Chehab
37e66d8631SMauro Carvalho Chehab   =====================  ============  =====================================
38e66d8631SMauro Carvalho Chehab                          compact (v1)  extended (v2)
39e66d8631SMauro Carvalho Chehab   =====================  ============  =====================================
40e66d8631SMauro Carvalho Chehab   Inode metadata size    32 bytes      64 bytes
41e66d8631SMauro Carvalho Chehab   Max file size          4 GB          16 EB (also limited by max. vol size)
42e66d8631SMauro Carvalho Chehab   Max uids/gids          65536         4294967296
43e66d8631SMauro Carvalho Chehab   File change time       no            yes (64 + 32-bit timestamp)
44e66d8631SMauro Carvalho Chehab   Max hardlinks          65536         4294967296
45e66d8631SMauro Carvalho Chehab   Metadata reserved      4 bytes       14 bytes
46e66d8631SMauro Carvalho Chehab   =====================  ============  =====================================
47e66d8631SMauro Carvalho Chehab
48e66d8631SMauro Carvalho Chehab - Support extended attributes (xattrs) as an option;
49e66d8631SMauro Carvalho Chehab
50e66d8631SMauro Carvalho Chehab - Support xattr inline and tail-end data inline for all files;
51e66d8631SMauro Carvalho Chehab
52e66d8631SMauro Carvalho Chehab - Support POSIX.1e ACLs by using xattrs;
53e66d8631SMauro Carvalho Chehab
5446f2e044SGao Xiang - Support transparent data compression as an option:
55dfeab2e9SGao Xiang   LZ4 algorithm with the fixed-sized output compression for high performance;
56dfeab2e9SGao Xiang
57dfeab2e9SGao Xiang - Multiple device support for multi-layer container images.
58e66d8631SMauro Carvalho Chehab
59e66d8631SMauro Carvalho ChehabThe following git tree provides the file system user-space tools under
60e66d8631SMauro Carvalho Chehabdevelopment (ex, formatting tool mkfs.erofs):
61e66d8631SMauro Carvalho Chehab
62e66d8631SMauro Carvalho Chehab- git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs-utils.git
63e66d8631SMauro Carvalho Chehab
64e66d8631SMauro Carvalho ChehabBugs and patches are welcome, please kindly help us and send to the following
65e66d8631SMauro Carvalho Chehablinux-erofs mailing list:
66e66d8631SMauro Carvalho Chehab
67e66d8631SMauro Carvalho Chehab- linux-erofs mailing list   <linux-erofs@lists.ozlabs.org>
68e66d8631SMauro Carvalho Chehab
69e66d8631SMauro Carvalho ChehabMount options
70e66d8631SMauro Carvalho Chehab=============
71e66d8631SMauro Carvalho Chehab
72e66d8631SMauro Carvalho Chehab===================    =========================================================
73e66d8631SMauro Carvalho Chehab(no)user_xattr         Setup Extended User Attributes. Note: xattr is enabled
74e66d8631SMauro Carvalho Chehab                       by default if CONFIG_EROFS_FS_XATTR is selected.
75e66d8631SMauro Carvalho Chehab(no)acl                Setup POSIX Access Control List. Note: acl is enabled
76e66d8631SMauro Carvalho Chehab                       by default if CONFIG_EROFS_FS_POSIX_ACL is selected.
77e66d8631SMauro Carvalho Chehabcache_strategy=%s      Select a strategy for cached decompression from now on:
78e66d8631SMauro Carvalho Chehab
79e66d8631SMauro Carvalho Chehab		       ==========  =============================================
80e66d8631SMauro Carvalho Chehab                         disabled  In-place I/O decompression only;
81e66d8631SMauro Carvalho Chehab                        readahead  Cache the last incomplete compressed physical
82e66d8631SMauro Carvalho Chehab                                   cluster for further reading. It still does
83e66d8631SMauro Carvalho Chehab                                   in-place I/O decompression for the rest
84e66d8631SMauro Carvalho Chehab                                   compressed physical clusters;
85e66d8631SMauro Carvalho Chehab                       readaround  Cache the both ends of incomplete compressed
86e66d8631SMauro Carvalho Chehab                                   physical clusters for further reading.
87e66d8631SMauro Carvalho Chehab                                   It still does in-place I/O decompression
88e66d8631SMauro Carvalho Chehab                                   for the rest compressed physical clusters.
89e66d8631SMauro Carvalho Chehab		       ==========  =============================================
9006252e9cSGao Xiangdax={always,never}     Use direct access (no page cache).  See
9106252e9cSGao Xiang                       Documentation/filesystems/dax.rst.
9206252e9cSGao Xiangdax                    A legacy option which is an alias for ``dax=always``.
93dfeab2e9SGao Xiangdevice=%s              Specify a path to an extra device to be used together.
94e66d8631SMauro Carvalho Chehab===================    =========================================================
95e66d8631SMauro Carvalho Chehab
96*168e9a76SHuang JiananSysfs Entries
97*168e9a76SHuang Jianan=============
98*168e9a76SHuang Jianan
99*168e9a76SHuang JiananInformation about mounted erofs file systems can be found in /sys/fs/erofs.
100*168e9a76SHuang JiananEach mounted filesystem will have a directory in /sys/fs/erofs based on its
101*168e9a76SHuang Jianandevice name (i.e., /sys/fs/erofs/sda).
102*168e9a76SHuang Jianan(see also Documentation/ABI/testing/sysfs-fs-erofs)
103*168e9a76SHuang Jianan
104e66d8631SMauro Carvalho ChehabOn-disk details
105e66d8631SMauro Carvalho Chehab===============
106e66d8631SMauro Carvalho Chehab
107e66d8631SMauro Carvalho ChehabSummary
108e66d8631SMauro Carvalho Chehab-------
109e66d8631SMauro Carvalho ChehabDifferent from other read-only file systems, an EROFS volume is designed
110e66d8631SMauro Carvalho Chehabto be as simple as possible::
111e66d8631SMauro Carvalho Chehab
112e66d8631SMauro Carvalho Chehab                                |-> aligned with the block size
113e66d8631SMauro Carvalho Chehab   ____________________________________________________________
114e66d8631SMauro Carvalho Chehab  | |SB| | ... | Metadata | ... | Data | Metadata | ... | Data |
115e66d8631SMauro Carvalho Chehab  |_|__|_|_____|__________|_____|______|__________|_____|______|
116e66d8631SMauro Carvalho Chehab  0 +1K
117e66d8631SMauro Carvalho Chehab
118e66d8631SMauro Carvalho ChehabAll data areas should be aligned with the block size, but metadata areas
119e66d8631SMauro Carvalho Chehabmay not. All metadatas can be now observed in two different spaces (views):
120e66d8631SMauro Carvalho Chehab
121e66d8631SMauro Carvalho Chehab 1. Inode metadata space
122e66d8631SMauro Carvalho Chehab
123e66d8631SMauro Carvalho Chehab    Each valid inode should be aligned with an inode slot, which is a fixed
124e66d8631SMauro Carvalho Chehab    value (32 bytes) and designed to be kept in line with compact inode size.
125e66d8631SMauro Carvalho Chehab
126e66d8631SMauro Carvalho Chehab    Each inode can be directly found with the following formula:
127e66d8631SMauro Carvalho Chehab         inode offset = meta_blkaddr * block_size + 32 * nid
128e66d8631SMauro Carvalho Chehab
129e66d8631SMauro Carvalho Chehab    ::
130e66d8631SMauro Carvalho Chehab
131e66d8631SMauro Carvalho Chehab                                 |-> aligned with 8B
132e66d8631SMauro Carvalho Chehab                                            |-> followed closely
133e66d8631SMauro Carvalho Chehab     + meta_blkaddr blocks                                      |-> another slot
134e66d8631SMauro Carvalho Chehab       _____________________________________________________________________
135e66d8631SMauro Carvalho Chehab     |  ...   | inode |  xattrs  | extents  | data inline | ... | inode ...
136e66d8631SMauro Carvalho Chehab     |________|_______|(optional)|(optional)|__(optional)_|_____|__________
137e66d8631SMauro Carvalho Chehab              |-> aligned with the inode slot size
138e66d8631SMauro Carvalho Chehab                   .                   .
139e66d8631SMauro Carvalho Chehab                 .                         .
140e66d8631SMauro Carvalho Chehab               .                              .
141e66d8631SMauro Carvalho Chehab             .                                    .
142e66d8631SMauro Carvalho Chehab           .                                         .
143e66d8631SMauro Carvalho Chehab         .                                              .
144e66d8631SMauro Carvalho Chehab       .____________________________________________________|-> aligned with 4B
145e66d8631SMauro Carvalho Chehab       | xattr_ibody_header | shared xattrs | inline xattrs |
146e66d8631SMauro Carvalho Chehab       |____________________|_______________|_______________|
147e66d8631SMauro Carvalho Chehab       |->    12 bytes    <-|->x * 4 bytes<-|               .
148e66d8631SMauro Carvalho Chehab                           .                .                 .
149e66d8631SMauro Carvalho Chehab                     .                      .                   .
150e66d8631SMauro Carvalho Chehab                .                           .                     .
151e66d8631SMauro Carvalho Chehab            ._______________________________.______________________.
152e66d8631SMauro Carvalho Chehab            | id | id | id | id |  ... | id | ent | ... | ent| ... |
153e66d8631SMauro Carvalho Chehab            |____|____|____|____|______|____|_____|_____|____|_____|
154e66d8631SMauro Carvalho Chehab                                            |-> aligned with 4B
155e66d8631SMauro Carvalho Chehab                                                        |-> aligned with 4B
156e66d8631SMauro Carvalho Chehab
157e66d8631SMauro Carvalho Chehab    Inode could be 32 or 64 bytes, which can be distinguished from a common
158e66d8631SMauro Carvalho Chehab    field which all inode versions have -- i_format::
159e66d8631SMauro Carvalho Chehab
160e66d8631SMauro Carvalho Chehab        __________________               __________________
161e66d8631SMauro Carvalho Chehab       |     i_format     |             |     i_format     |
162e66d8631SMauro Carvalho Chehab       |__________________|             |__________________|
163e66d8631SMauro Carvalho Chehab       |        ...       |             |        ...       |
164e66d8631SMauro Carvalho Chehab       |                  |             |                  |
165e66d8631SMauro Carvalho Chehab       |__________________| 32 bytes    |                  |
166e66d8631SMauro Carvalho Chehab                                        |                  |
167e66d8631SMauro Carvalho Chehab                                        |__________________| 64 bytes
168e66d8631SMauro Carvalho Chehab
169e66d8631SMauro Carvalho Chehab    Xattrs, extents, data inline are followed by the corresponding inode with
170e66d8631SMauro Carvalho Chehab    proper alignment, and they could be optional for different data mappings.
1712a9dc7a8SGao Xiang    _currently_ total 5 data layouts are supported:
172e66d8631SMauro Carvalho Chehab
173e66d8631SMauro Carvalho Chehab    ==  ====================================================================
174e66d8631SMauro Carvalho Chehab     0  flat file data without data inline (no extent);
175e66d8631SMauro Carvalho Chehab     1  fixed-sized output data compression (with non-compacted indexes);
176e66d8631SMauro Carvalho Chehab     2  flat file data with tail packing data inline (no extent);
1772a9dc7a8SGao Xiang     3  fixed-sized output data compression (with compacted indexes, v5.3+);
1782a9dc7a8SGao Xiang     4  chunk-based file (v5.15+).
179e66d8631SMauro Carvalho Chehab    ==  ====================================================================
180e66d8631SMauro Carvalho Chehab
181e66d8631SMauro Carvalho Chehab    The size of the optional xattrs is indicated by i_xattr_count in inode
182e66d8631SMauro Carvalho Chehab    header. Large xattrs or xattrs shared by many different files can be
183e66d8631SMauro Carvalho Chehab    stored in shared xattrs metadata rather than inlined right after inode.
184e66d8631SMauro Carvalho Chehab
185e66d8631SMauro Carvalho Chehab 2. Shared xattrs metadata space
186e66d8631SMauro Carvalho Chehab
187e66d8631SMauro Carvalho Chehab    Shared xattrs space is similar to the above inode space, started with
188e66d8631SMauro Carvalho Chehab    a specific block indicated by xattr_blkaddr, organized one by one with
189e66d8631SMauro Carvalho Chehab    proper align.
190e66d8631SMauro Carvalho Chehab
191e66d8631SMauro Carvalho Chehab    Each share xattr can also be directly found by the following formula:
192e66d8631SMauro Carvalho Chehab         xattr offset = xattr_blkaddr * block_size + 4 * xattr_id
193e66d8631SMauro Carvalho Chehab
194e66d8631SMauro Carvalho Chehab::
195e66d8631SMauro Carvalho Chehab
196e66d8631SMauro Carvalho Chehab                           |-> aligned by  4 bytes
197e66d8631SMauro Carvalho Chehab    + xattr_blkaddr blocks                     |-> aligned with 4 bytes
198e66d8631SMauro Carvalho Chehab     _________________________________________________________________________
199e66d8631SMauro Carvalho Chehab    |  ...   | xattr_entry |  xattr data | ... |  xattr_entry | xattr data  ...
200e66d8631SMauro Carvalho Chehab    |________|_____________|_____________|_____|______________|_______________
201e66d8631SMauro Carvalho Chehab
202e66d8631SMauro Carvalho ChehabDirectories
203e66d8631SMauro Carvalho Chehab-----------
204e66d8631SMauro Carvalho ChehabAll directories are now organized in a compact on-disk format. Note that
205e66d8631SMauro Carvalho Chehabeach directory block is divided into index and name areas in order to support
206e66d8631SMauro Carvalho Chehabrandom file lookup, and all directory entries are _strictly_ recorded in
207e66d8631SMauro Carvalho Chehabalphabetical order in order to support improved prefix binary search
208e66d8631SMauro Carvalho Chehabalgorithm (could refer to the related source code).
209e66d8631SMauro Carvalho Chehab
210e66d8631SMauro Carvalho Chehab::
211e66d8631SMauro Carvalho Chehab
212e66d8631SMauro Carvalho Chehab                  ___________________________
213e66d8631SMauro Carvalho Chehab                 /                           |
214e66d8631SMauro Carvalho Chehab                /              ______________|________________
215e66d8631SMauro Carvalho Chehab               /              /              | nameoff1       | nameoffN-1
216e66d8631SMauro Carvalho Chehab  ____________.______________._______________v________________v__________
217e66d8631SMauro Carvalho Chehab | dirent | dirent | ... | dirent | filename | filename | ... | filename |
218e66d8631SMauro Carvalho Chehab |___.0___|____1___|_____|___N-1__|____0_____|____1_____|_____|___N-1____|
219e66d8631SMauro Carvalho Chehab      \                           ^
220e66d8631SMauro Carvalho Chehab       \                          |                           * could have
221e66d8631SMauro Carvalho Chehab        \                         |                             trailing '\0'
222e66d8631SMauro Carvalho Chehab         \________________________| nameoff0
223e66d8631SMauro Carvalho Chehab                             Directory block
224e66d8631SMauro Carvalho Chehab
225e66d8631SMauro Carvalho ChehabNote that apart from the offset of the first filename, nameoff0 also indicates
226e66d8631SMauro Carvalho Chehabthe total number of directory entries in this block since it is no need to
227e66d8631SMauro Carvalho Chehabintroduce another on-disk field at all.
228e66d8631SMauro Carvalho Chehab
2292a9dc7a8SGao XiangChunk-based file
2302a9dc7a8SGao Xiang----------------
2312a9dc7a8SGao XiangIn order to support chunk-based data deduplication, a new inode data layout has
2322a9dc7a8SGao Xiangbeen supported since Linux v5.15: Files are split in equal-sized data chunks
2332a9dc7a8SGao Xiangwith ``extents`` area of the inode metadata indicating how to get the chunk
2342a9dc7a8SGao Xiangdata: these can be simply as a 4-byte block address array or in the 8-byte
2352a9dc7a8SGao Xiangchunk index form (see struct erofs_inode_chunk_index in erofs_fs.h for more
2362a9dc7a8SGao Xiangdetails.)
2372a9dc7a8SGao Xiang
2382a9dc7a8SGao XiangBy the way, chunk-based files are all uncompressed for now.
2392a9dc7a8SGao Xiang
24046f2e044SGao XiangData compression
24146f2e044SGao Xiang----------------
24246f2e044SGao XiangEROFS implements LZ4 fixed-sized output compression which generates fixed-sized
24346f2e044SGao Xiangcompressed data blocks from variable-sized input in contrast to other existing
24446f2e044SGao Xiangfixed-sized input solutions. Relatively higher compression ratios can be gotten
24546f2e044SGao Xiangby using fixed-sized output compression since nowadays popular data compression
24646f2e044SGao Xiangalgorithms are mostly LZ77-based and such fixed-sized output approach can be
24746f2e044SGao Xiangbenefited from the historical dictionary (aka. sliding window).
24846f2e044SGao Xiang
24946f2e044SGao XiangIn details, original (uncompressed) data is turned into several variable-sized
25046f2e044SGao Xiangextents and in the meanwhile, compressed into physical clusters (pclusters).
25146f2e044SGao XiangIn order to record each variable-sized extent, logical clusters (lclusters) are
25246f2e044SGao Xiangintroduced as the basic unit of compress indexes to indicate whether a new
25346f2e044SGao Xiangextent is generated within the range (HEAD) or not (NONHEAD). Lclusters are now
25446f2e044SGao Xiangfixed in block size, as illustrated below::
255e66d8631SMauro Carvalho Chehab
2561b55767dSGao Xiang          |<-    variable-sized extent    ->|<-       VLE         ->|
257e66d8631SMauro Carvalho Chehab        clusterofs                        clusterofs              clusterofs
2581b55767dSGao Xiang          |                                 |                       |
2591b55767dSGao Xiang _________v_________________________________v_______________________v________
2601b55767dSGao Xiang ... |    .         |              |        .     |              |  .   ...
2611b55767dSGao Xiang ____|____._________|______________|________.___ _|______________|__.________
2621b55767dSGao Xiang     |-> lcluster <-|-> lcluster <-|-> lcluster <-|-> lcluster <-|
26346f2e044SGao Xiang          (HEAD)        (NONHEAD)       (HEAD)        (NONHEAD)    .
26446f2e044SGao Xiang           .             CBLKCNT            .                    .
26546f2e044SGao Xiang            .                               .                  .
26646f2e044SGao Xiang             .                              .                .
26746f2e044SGao Xiang       _______._____________________________.______________._________________
2681b55767dSGao Xiang          ... |              |              |              | ...
2691b55767dSGao Xiang       _______|______________|______________|______________|_________________
27046f2e044SGao Xiang              |->      big pcluster       <-|-> pcluster <-|
271e66d8631SMauro Carvalho Chehab
27246f2e044SGao XiangA physical cluster can be seen as a container of physical compressed blocks
27346f2e044SGao Xiangwhich contains compressed data. Previously, only lcluster-sized (4KB) pclusters
27446f2e044SGao Xiangwere supported. After big pcluster feature is introduced (available since
27546f2e044SGao XiangLinux v5.13), pcluster can be a multiple of lcluster size.
276e66d8631SMauro Carvalho Chehab
27746f2e044SGao XiangFor each HEAD lcluster, clusterofs is recorded to indicate where a new extent
27846f2e044SGao Xiangstarts and blkaddr is used to seek the compressed data. For each NONHEAD
27946f2e044SGao Xianglcluster, delta0 and delta1 are available instead of blkaddr to indicate the
28046f2e044SGao Xiangdistance to its HEAD lcluster and the next HEAD lcluster. A PLAIN lcluster is
28146f2e044SGao Xiangalso a HEAD lcluster except that its data is uncompressed. See the comments
28246f2e044SGao Xiangaround "struct z_erofs_vle_decompressed_index" in erofs_fs.h for more details.
28346f2e044SGao Xiang
28446f2e044SGao XiangIf big pcluster is enabled, pcluster size in lclusters needs to be recorded as
28546f2e044SGao Xiangwell. Let the delta0 of the first NONHEAD lcluster store the compressed block
28646f2e044SGao Xiangcount with a special flag as a new called CBLKCNT NONHEAD lcluster. It's easy
28746f2e044SGao Xiangto understand its delta0 is constantly 1, as illustrated below::
28846f2e044SGao Xiang
28946f2e044SGao Xiang   __________________________________________________________
29046f2e044SGao Xiang  | HEAD |  NONHEAD  | NONHEAD | ... | NONHEAD | HEAD | HEAD |
29146f2e044SGao Xiang  |__:___|_(CBLKCNT)_|_________|_____|_________|__:___|____:_|
29246f2e044SGao Xiang     |<----- a big pcluster (with CBLKCNT) ------>|<--  -->|
29346f2e044SGao Xiang           a lcluster-sized pcluster (without CBLKCNT) ^
29446f2e044SGao Xiang
29546f2e044SGao XiangIf another HEAD follows a HEAD lcluster, there is no room to record CBLKCNT,
29646f2e044SGao Xiangbut it's easy to know the size of such pcluster is 1 lcluster as well.
297