131771f45SMauro Carvalho Chehab.. SPDX-License-Identifier: GPL-2.0
231771f45SMauro Carvalho Chehab
331771f45SMauro Carvalho Chehab=======================
431771f45SMauro Carvalho ChehabSquashfs 4.0 Filesystem
531771f45SMauro Carvalho Chehab=======================
631771f45SMauro Carvalho Chehab
731771f45SMauro Carvalho ChehabSquashfs is a compressed read-only filesystem for Linux.
831771f45SMauro Carvalho Chehab
931771f45SMauro Carvalho ChehabIt uses zlib, lz4, lzo, or xz compression to compress files, inodes and
1031771f45SMauro Carvalho Chehabdirectories.  Inodes in the system are very small and all blocks are packed to
1131771f45SMauro Carvalho Chehabminimise data overhead. Block sizes greater than 4K are supported up to a
1231771f45SMauro Carvalho Chehabmaximum of 1Mbytes (default block size 128K).
1331771f45SMauro Carvalho Chehab
1431771f45SMauro Carvalho ChehabSquashfs is intended for general read-only filesystem use, for archival
1531771f45SMauro Carvalho Chehabuse (i.e. in cases where a .tar.gz file may be used), and in constrained
1631771f45SMauro Carvalho Chehabblock device/memory systems (e.g. embedded systems) where low overhead is
1731771f45SMauro Carvalho Chehabneeded.
1831771f45SMauro Carvalho Chehab
1931771f45SMauro Carvalho ChehabMailing list: squashfs-devel@lists.sourceforge.net
2031771f45SMauro Carvalho ChehabWeb site: www.squashfs.org
2131771f45SMauro Carvalho Chehab
2231771f45SMauro Carvalho Chehab1. Filesystem Features
2331771f45SMauro Carvalho Chehab----------------------
2431771f45SMauro Carvalho Chehab
2531771f45SMauro Carvalho ChehabSquashfs filesystem features versus Cramfs:
2631771f45SMauro Carvalho Chehab
2731771f45SMauro Carvalho Chehab============================== 	=========		==========
2831771f45SMauro Carvalho Chehab				Squashfs		Cramfs
2931771f45SMauro Carvalho Chehab============================== 	=========		==========
3031771f45SMauro Carvalho ChehabMax filesystem size		2^64			256 MiB
3131771f45SMauro Carvalho ChehabMax file size			~ 2 TiB			16 MiB
3231771f45SMauro Carvalho ChehabMax files			unlimited		unlimited
3331771f45SMauro Carvalho ChehabMax directories			unlimited		unlimited
3431771f45SMauro Carvalho ChehabMax entries per directory	unlimited		unlimited
3531771f45SMauro Carvalho ChehabMax block size			1 MiB			4 KiB
3631771f45SMauro Carvalho ChehabMetadata compression		yes			no
3731771f45SMauro Carvalho ChehabDirectory indexes		yes			no
3831771f45SMauro Carvalho ChehabSparse file support		yes			no
3931771f45SMauro Carvalho ChehabTail-end packing (fragments)	yes			no
4031771f45SMauro Carvalho ChehabExportable (NFS etc.)		yes			no
4131771f45SMauro Carvalho ChehabHard link support		yes			no
4231771f45SMauro Carvalho Chehab"." and ".." in readdir		yes			no
4331771f45SMauro Carvalho ChehabReal inode numbers		yes			no
4431771f45SMauro Carvalho Chehab32-bit uids/gids		yes			no
4531771f45SMauro Carvalho ChehabFile creation time		yes			no
4631771f45SMauro Carvalho ChehabXattr support			yes			no
4731771f45SMauro Carvalho ChehabACL support			no			no
4831771f45SMauro Carvalho Chehab============================== 	=========		==========
4931771f45SMauro Carvalho Chehab
5031771f45SMauro Carvalho ChehabSquashfs compresses data, inodes and directories.  In addition, inode and
5131771f45SMauro Carvalho Chehabdirectory data are highly compacted, and packed on byte boundaries.  Each
5231771f45SMauro Carvalho Chehabcompressed inode is on average 8 bytes in length (the exact length varies on
5331771f45SMauro Carvalho Chehabfile type, i.e. regular file, directory, symbolic link, and block/char device
5431771f45SMauro Carvalho Chehabinodes have different sizes).
5531771f45SMauro Carvalho Chehab
5631771f45SMauro Carvalho Chehab2. Using Squashfs
5731771f45SMauro Carvalho Chehab-----------------
5831771f45SMauro Carvalho Chehab
5931771f45SMauro Carvalho ChehabAs squashfs is a read-only filesystem, the mksquashfs program must be used to
6031771f45SMauro Carvalho Chehabcreate populated squashfs filesystems.  This and other squashfs utilities
6131771f45SMauro Carvalho Chehabcan be obtained from http://www.squashfs.org.  Usage instructions can be
6231771f45SMauro Carvalho Chehabobtained from this site also.
6331771f45SMauro Carvalho Chehab
6431771f45SMauro Carvalho ChehabThe squashfs-tools development tree is now located on kernel.org
6531771f45SMauro Carvalho Chehab	git://git.kernel.org/pub/scm/fs/squashfs/squashfs-tools.git
6631771f45SMauro Carvalho Chehab
6731771f45SMauro Carvalho Chehab3. Squashfs Filesystem Design
6831771f45SMauro Carvalho Chehab-----------------------------
6931771f45SMauro Carvalho Chehab
7031771f45SMauro Carvalho ChehabA squashfs filesystem consists of a maximum of nine parts, packed together on a
7131771f45SMauro Carvalho Chehabbyte alignment::
7231771f45SMauro Carvalho Chehab
7331771f45SMauro Carvalho Chehab	 ---------------
7431771f45SMauro Carvalho Chehab	|  superblock 	|
7531771f45SMauro Carvalho Chehab	|---------------|
7631771f45SMauro Carvalho Chehab	|  compression  |
7731771f45SMauro Carvalho Chehab	|    options    |
7831771f45SMauro Carvalho Chehab	|---------------|
7931771f45SMauro Carvalho Chehab	|  datablocks   |
8031771f45SMauro Carvalho Chehab	|  & fragments  |
8131771f45SMauro Carvalho Chehab	|---------------|
8231771f45SMauro Carvalho Chehab	|  inode table	|
8331771f45SMauro Carvalho Chehab	|---------------|
8431771f45SMauro Carvalho Chehab	|   directory	|
8531771f45SMauro Carvalho Chehab	|     table     |
8631771f45SMauro Carvalho Chehab	|---------------|
8731771f45SMauro Carvalho Chehab	|   fragment	|
8831771f45SMauro Carvalho Chehab	|    table      |
8931771f45SMauro Carvalho Chehab	|---------------|
9031771f45SMauro Carvalho Chehab	|    export     |
9131771f45SMauro Carvalho Chehab	|    table      |
9231771f45SMauro Carvalho Chehab	|---------------|
9331771f45SMauro Carvalho Chehab	|    uid/gid	|
9431771f45SMauro Carvalho Chehab	|  lookup table	|
9531771f45SMauro Carvalho Chehab	|---------------|
9631771f45SMauro Carvalho Chehab	|     xattr     |
9731771f45SMauro Carvalho Chehab	|     table	|
9831771f45SMauro Carvalho Chehab	 ---------------
9931771f45SMauro Carvalho Chehab
10031771f45SMauro Carvalho ChehabCompressed data blocks are written to the filesystem as files are read from
10131771f45SMauro Carvalho Chehabthe source directory, and checked for duplicates.  Once all file data has been
10231771f45SMauro Carvalho Chehabwritten the completed inode, directory, fragment, export, uid/gid lookup and
10331771f45SMauro Carvalho Chehabxattr tables are written.
10431771f45SMauro Carvalho Chehab
10531771f45SMauro Carvalho Chehab3.1 Compression options
10631771f45SMauro Carvalho Chehab-----------------------
10731771f45SMauro Carvalho Chehab
10831771f45SMauro Carvalho ChehabCompressors can optionally support compression specific options (e.g.
10931771f45SMauro Carvalho Chehabdictionary size).  If non-default compression options have been used, then
11031771f45SMauro Carvalho Chehabthese are stored here.
11131771f45SMauro Carvalho Chehab
11231771f45SMauro Carvalho Chehab3.2 Inodes
11331771f45SMauro Carvalho Chehab----------
11431771f45SMauro Carvalho Chehab
11531771f45SMauro Carvalho ChehabMetadata (inodes and directories) are compressed in 8Kbyte blocks.  Each
11631771f45SMauro Carvalho Chehabcompressed block is prefixed by a two byte length, the top bit is set if the
11731771f45SMauro Carvalho Chehabblock is uncompressed.  A block will be uncompressed if the -noI option is set,
11831771f45SMauro Carvalho Chehabor if the compressed block was larger than the uncompressed block.
11931771f45SMauro Carvalho Chehab
12031771f45SMauro Carvalho ChehabInodes are packed into the metadata blocks, and are not aligned to block
12131771f45SMauro Carvalho Chehabboundaries, therefore inodes overlap compressed blocks.  Inodes are identified
12231771f45SMauro Carvalho Chehabby a 48-bit number which encodes the location of the compressed metadata block
12331771f45SMauro Carvalho Chehabcontaining the inode, and the byte offset into that block where the inode is
12431771f45SMauro Carvalho Chehabplaced (<block, offset>).
12531771f45SMauro Carvalho Chehab
12631771f45SMauro Carvalho ChehabTo maximise compression there are different inodes for each file type
12731771f45SMauro Carvalho Chehab(regular file, directory, device, etc.), the inode contents and length
12831771f45SMauro Carvalho Chehabvarying with the type.
12931771f45SMauro Carvalho Chehab
13031771f45SMauro Carvalho ChehabTo further maximise compression, two types of regular file inode and
13131771f45SMauro Carvalho Chehabdirectory inode are defined: inodes optimised for frequently occurring
13231771f45SMauro Carvalho Chehabregular files and directories, and extended types where extra
13331771f45SMauro Carvalho Chehabinformation has to be stored.
13431771f45SMauro Carvalho Chehab
13531771f45SMauro Carvalho Chehab3.3 Directories
13631771f45SMauro Carvalho Chehab---------------
13731771f45SMauro Carvalho Chehab
13831771f45SMauro Carvalho ChehabLike inodes, directories are packed into compressed metadata blocks, stored
13931771f45SMauro Carvalho Chehabin a directory table.  Directories are accessed using the start address of
14031771f45SMauro Carvalho Chehabthe metablock containing the directory and the offset into the
14131771f45SMauro Carvalho Chehabdecompressed block (<block, offset>).
14231771f45SMauro Carvalho Chehab
14331771f45SMauro Carvalho ChehabDirectories are organised in a slightly complex way, and are not simply
14431771f45SMauro Carvalho Chehaba list of file names.  The organisation takes advantage of the
14531771f45SMauro Carvalho Chehabfact that (in most cases) the inodes of the files will be in the same
14631771f45SMauro Carvalho Chehabcompressed metadata block, and therefore, can share the start block.
14731771f45SMauro Carvalho ChehabDirectories are therefore organised in a two level list, a directory
14831771f45SMauro Carvalho Chehabheader containing the shared start block value, and a sequence of directory
14931771f45SMauro Carvalho Chehabentries, each of which share the shared start block.  A new directory header
15031771f45SMauro Carvalho Chehabis written once/if the inode start block changes.  The directory
15131771f45SMauro Carvalho Chehabheader/directory entry list is repeated as many times as necessary.
15231771f45SMauro Carvalho Chehab
15331771f45SMauro Carvalho ChehabDirectories are sorted, and can contain a directory index to speed up
15431771f45SMauro Carvalho Chehabfile lookup.  Directory indexes store one entry per metablock, each entry
15531771f45SMauro Carvalho Chehabstoring the index/filename mapping to the first directory header
15631771f45SMauro Carvalho Chehabin each metadata block.  Directories are sorted in alphabetical order,
15731771f45SMauro Carvalho Chehaband at lookup the index is scanned linearly looking for the first filename
15831771f45SMauro Carvalho Chehabalphabetically larger than the filename being looked up.  At this point the
15931771f45SMauro Carvalho Chehablocation of the metadata block the filename is in has been found.
16031771f45SMauro Carvalho ChehabThe general idea of the index is to ensure only one metadata block needs to be
16131771f45SMauro Carvalho Chehabdecompressed to do a lookup irrespective of the length of the directory.
16231771f45SMauro Carvalho ChehabThis scheme has the advantage that it doesn't require extra memory overhead
16331771f45SMauro Carvalho Chehaband doesn't require much extra storage on disk.
16431771f45SMauro Carvalho Chehab
16531771f45SMauro Carvalho Chehab3.4 File data
16631771f45SMauro Carvalho Chehab-------------
16731771f45SMauro Carvalho Chehab
16831771f45SMauro Carvalho ChehabRegular files consist of a sequence of contiguous compressed blocks, and/or a
16931771f45SMauro Carvalho Chehabcompressed fragment block (tail-end packed block).   The compressed size
17031771f45SMauro Carvalho Chehabof each datablock is stored in a block list contained within the
17131771f45SMauro Carvalho Chehabfile inode.
17231771f45SMauro Carvalho Chehab
17331771f45SMauro Carvalho ChehabTo speed up access to datablocks when reading 'large' files (256 Mbytes or
17431771f45SMauro Carvalho Chehablarger), the code implements an index cache that caches the mapping from
17531771f45SMauro Carvalho Chehabblock index to datablock location on disk.
17631771f45SMauro Carvalho Chehab
17731771f45SMauro Carvalho ChehabThe index cache allows Squashfs to handle large files (up to 1.75 TiB) while
17831771f45SMauro Carvalho Chehabretaining a simple and space-efficient block list on disk.  The cache
17931771f45SMauro Carvalho Chehabis split into slots, caching up to eight 224 GiB files (128 KiB blocks).
18031771f45SMauro Carvalho ChehabLarger files use multiple slots, with 1.75 TiB files using all 8 slots.
18131771f45SMauro Carvalho ChehabThe index cache is designed to be memory efficient, and by default uses
18231771f45SMauro Carvalho Chehab16 KiB.
18331771f45SMauro Carvalho Chehab
18431771f45SMauro Carvalho Chehab3.5 Fragment lookup table
18531771f45SMauro Carvalho Chehab-------------------------
18631771f45SMauro Carvalho Chehab
18731771f45SMauro Carvalho ChehabRegular files can contain a fragment index which is mapped to a fragment
18831771f45SMauro Carvalho Chehablocation on disk and compressed size using a fragment lookup table.  This
18931771f45SMauro Carvalho Chehabfragment lookup table is itself stored compressed into metadata blocks.
19031771f45SMauro Carvalho ChehabA second index table is used to locate these.  This second index table for
19131771f45SMauro Carvalho Chehabspeed of access (and because it is small) is read at mount time and cached
19231771f45SMauro Carvalho Chehabin memory.
19331771f45SMauro Carvalho Chehab
19431771f45SMauro Carvalho Chehab3.6 Uid/gid lookup table
19531771f45SMauro Carvalho Chehab------------------------
19631771f45SMauro Carvalho Chehab
19731771f45SMauro Carvalho ChehabFor space efficiency regular files store uid and gid indexes, which are
19831771f45SMauro Carvalho Chehabconverted to 32-bit uids/gids using an id look up table.  This table is
19931771f45SMauro Carvalho Chehabstored compressed into metadata blocks.  A second index table is used to
20031771f45SMauro Carvalho Chehablocate these.  This second index table for speed of access (and because it
20131771f45SMauro Carvalho Chehabis small) is read at mount time and cached in memory.
20231771f45SMauro Carvalho Chehab
20331771f45SMauro Carvalho Chehab3.7 Export table
20431771f45SMauro Carvalho Chehab----------------
20531771f45SMauro Carvalho Chehab
20631771f45SMauro Carvalho ChehabTo enable Squashfs filesystems to be exportable (via NFS etc.) filesystems
20731771f45SMauro Carvalho Chehabcan optionally (disabled with the -no-exports Mksquashfs option) contain
20831771f45SMauro Carvalho Chehaban inode number to inode disk location lookup table.  This is required to
20931771f45SMauro Carvalho Chehabenable Squashfs to map inode numbers passed in filehandles to the inode
21031771f45SMauro Carvalho Chehablocation on disk, which is necessary when the export code reinstantiates
21131771f45SMauro Carvalho Chehabexpired/flushed inodes.
21231771f45SMauro Carvalho Chehab
21331771f45SMauro Carvalho ChehabThis table is stored compressed into metadata blocks.  A second index table is
21431771f45SMauro Carvalho Chehabused to locate these.  This second index table for speed of access (and because
21531771f45SMauro Carvalho Chehabit is small) is read at mount time and cached in memory.
21631771f45SMauro Carvalho Chehab
21731771f45SMauro Carvalho Chehab3.8 Xattr table
21831771f45SMauro Carvalho Chehab---------------
21931771f45SMauro Carvalho Chehab
22031771f45SMauro Carvalho ChehabThe xattr table contains extended attributes for each inode.  The xattrs
22131771f45SMauro Carvalho Chehabfor each inode are stored in a list, each list entry containing a type,
22231771f45SMauro Carvalho Chehabname and value field.  The type field encodes the xattr prefix
22331771f45SMauro Carvalho Chehab("user.", "trusted." etc) and it also encodes how the name/value fields
22431771f45SMauro Carvalho Chehabshould be interpreted.  Currently the type indicates whether the value
22531771f45SMauro Carvalho Chehabis stored inline (in which case the value field contains the xattr value),
22631771f45SMauro Carvalho Chehabor if it is stored out of line (in which case the value field stores a
22731771f45SMauro Carvalho Chehabreference to where the actual value is stored).  This allows large values
22831771f45SMauro Carvalho Chehabto be stored out of line improving scanning and lookup performance and it
22931771f45SMauro Carvalho Chehabalso allows values to be de-duplicated, the value being stored once, and
23031771f45SMauro Carvalho Chehaball other occurrences holding an out of line reference to that value.
23131771f45SMauro Carvalho Chehab
23231771f45SMauro Carvalho ChehabThe xattr lists are packed into compressed 8K metadata blocks.
23331771f45SMauro Carvalho ChehabTo reduce overhead in inodes, rather than storing the on-disk
23431771f45SMauro Carvalho Chehablocation of the xattr list inside each inode, a 32-bit xattr id
23531771f45SMauro Carvalho Chehabis stored.  This xattr id is mapped into the location of the xattr
23631771f45SMauro Carvalho Chehablist using a second xattr id lookup table.
23731771f45SMauro Carvalho Chehab
23831771f45SMauro Carvalho Chehab4. TODOs and Outstanding Issues
23931771f45SMauro Carvalho Chehab-------------------------------
24031771f45SMauro Carvalho Chehab
24131771f45SMauro Carvalho Chehab4.1 TODO list
24231771f45SMauro Carvalho Chehab-------------
24331771f45SMauro Carvalho Chehab
24431771f45SMauro Carvalho ChehabImplement ACL support.
24531771f45SMauro Carvalho Chehab
24631771f45SMauro Carvalho Chehab4.2 Squashfs Internal Cache
24731771f45SMauro Carvalho Chehab---------------------------
24831771f45SMauro Carvalho Chehab
24931771f45SMauro Carvalho ChehabBlocks in Squashfs are compressed.  To avoid repeatedly decompressing
25031771f45SMauro Carvalho Chehabrecently accessed data Squashfs uses two small metadata and fragment caches.
25131771f45SMauro Carvalho Chehab
25231771f45SMauro Carvalho ChehabThe cache is not used for file datablocks, these are decompressed and cached in
25331771f45SMauro Carvalho Chehabthe page-cache in the normal way.  The cache is used to temporarily cache
25431771f45SMauro Carvalho Chehabfragment and metadata blocks which have been read as a result of a metadata
25531771f45SMauro Carvalho Chehab(i.e. inode or directory) or fragment access.  Because metadata and fragments
25631771f45SMauro Carvalho Chehabare packed together into blocks (to gain greater compression) the read of a
25731771f45SMauro Carvalho Chehabparticular piece of metadata or fragment will retrieve other metadata/fragments
25831771f45SMauro Carvalho Chehabwhich have been packed with it, these because of locality-of-reference may be
25931771f45SMauro Carvalho Chehabread in the near future. Temporarily caching them ensures they are available
26031771f45SMauro Carvalho Chehabfor near future access without requiring an additional read and decompress.
26131771f45SMauro Carvalho Chehab
26231771f45SMauro Carvalho ChehabIn the future this internal cache may be replaced with an implementation which
26331771f45SMauro Carvalho Chehabuses the kernel page cache.  Because the page cache operates on page sized
26431771f45SMauro Carvalho Chehabunits this may introduce additional complexity in terms of locking and
26531771f45SMauro Carvalho Chehabassociated race conditions.
266