131771f45SMauro Carvalho Chehab.. SPDX-License-Identifier: GPL-2.0 231771f45SMauro Carvalho Chehab 331771f45SMauro Carvalho Chehab======================= 431771f45SMauro Carvalho ChehabSquashfs 4.0 Filesystem 531771f45SMauro Carvalho Chehab======================= 631771f45SMauro Carvalho Chehab 731771f45SMauro Carvalho ChehabSquashfs is a compressed read-only filesystem for Linux. 831771f45SMauro Carvalho Chehab 931771f45SMauro Carvalho ChehabIt uses zlib, lz4, lzo, or xz compression to compress files, inodes and 1031771f45SMauro Carvalho Chehabdirectories. Inodes in the system are very small and all blocks are packed to 1131771f45SMauro Carvalho Chehabminimise data overhead. Block sizes greater than 4K are supported up to a 1231771f45SMauro Carvalho Chehabmaximum of 1Mbytes (default block size 128K). 1331771f45SMauro Carvalho Chehab 1431771f45SMauro Carvalho ChehabSquashfs is intended for general read-only filesystem use, for archival 1531771f45SMauro Carvalho Chehabuse (i.e. in cases where a .tar.gz file may be used), and in constrained 1631771f45SMauro Carvalho Chehabblock device/memory systems (e.g. embedded systems) where low overhead is 1731771f45SMauro Carvalho Chehabneeded. 1831771f45SMauro Carvalho Chehab 1931771f45SMauro Carvalho ChehabMailing list: squashfs-devel@lists.sourceforge.net 2031771f45SMauro Carvalho ChehabWeb site: www.squashfs.org 2131771f45SMauro Carvalho Chehab 2231771f45SMauro Carvalho Chehab1. Filesystem Features 2331771f45SMauro Carvalho Chehab---------------------- 2431771f45SMauro Carvalho Chehab 2531771f45SMauro Carvalho ChehabSquashfs filesystem features versus Cramfs: 2631771f45SMauro Carvalho Chehab 2731771f45SMauro Carvalho Chehab============================== ========= ========== 2831771f45SMauro Carvalho Chehab Squashfs Cramfs 2931771f45SMauro Carvalho Chehab============================== ========= ========== 3031771f45SMauro Carvalho ChehabMax filesystem size 2^64 256 MiB 3131771f45SMauro Carvalho ChehabMax file size ~ 2 TiB 16 MiB 3231771f45SMauro Carvalho ChehabMax files unlimited unlimited 3331771f45SMauro Carvalho ChehabMax directories unlimited unlimited 3431771f45SMauro Carvalho ChehabMax entries per directory unlimited unlimited 3531771f45SMauro Carvalho ChehabMax block size 1 MiB 4 KiB 3631771f45SMauro Carvalho ChehabMetadata compression yes no 3731771f45SMauro Carvalho ChehabDirectory indexes yes no 3831771f45SMauro Carvalho ChehabSparse file support yes no 3931771f45SMauro Carvalho ChehabTail-end packing (fragments) yes no 4031771f45SMauro Carvalho ChehabExportable (NFS etc.) yes no 4131771f45SMauro Carvalho ChehabHard link support yes no 4231771f45SMauro Carvalho Chehab"." and ".." in readdir yes no 4331771f45SMauro Carvalho ChehabReal inode numbers yes no 4431771f45SMauro Carvalho Chehab32-bit uids/gids yes no 4531771f45SMauro Carvalho ChehabFile creation time yes no 4631771f45SMauro Carvalho ChehabXattr support yes no 4731771f45SMauro Carvalho ChehabACL support no no 4831771f45SMauro Carvalho Chehab============================== ========= ========== 4931771f45SMauro Carvalho Chehab 5031771f45SMauro Carvalho ChehabSquashfs compresses data, inodes and directories. In addition, inode and 5131771f45SMauro Carvalho Chehabdirectory data are highly compacted, and packed on byte boundaries. Each 5231771f45SMauro Carvalho Chehabcompressed inode is on average 8 bytes in length (the exact length varies on 5331771f45SMauro Carvalho Chehabfile type, i.e. regular file, directory, symbolic link, and block/char device 5431771f45SMauro Carvalho Chehabinodes have different sizes). 5531771f45SMauro Carvalho Chehab 5631771f45SMauro Carvalho Chehab2. Using Squashfs 5731771f45SMauro Carvalho Chehab----------------- 5831771f45SMauro Carvalho Chehab 5931771f45SMauro Carvalho ChehabAs squashfs is a read-only filesystem, the mksquashfs program must be used to 6031771f45SMauro Carvalho Chehabcreate populated squashfs filesystems. This and other squashfs utilities 6131771f45SMauro Carvalho Chehabcan be obtained from http://www.squashfs.org. Usage instructions can be 6231771f45SMauro Carvalho Chehabobtained from this site also. 6331771f45SMauro Carvalho Chehab 6431771f45SMauro Carvalho ChehabThe squashfs-tools development tree is now located on kernel.org 6531771f45SMauro Carvalho Chehab git://git.kernel.org/pub/scm/fs/squashfs/squashfs-tools.git 6631771f45SMauro Carvalho Chehab 6731771f45SMauro Carvalho Chehab3. Squashfs Filesystem Design 6831771f45SMauro Carvalho Chehab----------------------------- 6931771f45SMauro Carvalho Chehab 7031771f45SMauro Carvalho ChehabA squashfs filesystem consists of a maximum of nine parts, packed together on a 7131771f45SMauro Carvalho Chehabbyte alignment:: 7231771f45SMauro Carvalho Chehab 7331771f45SMauro Carvalho Chehab --------------- 7431771f45SMauro Carvalho Chehab | superblock | 7531771f45SMauro Carvalho Chehab |---------------| 7631771f45SMauro Carvalho Chehab | compression | 7731771f45SMauro Carvalho Chehab | options | 7831771f45SMauro Carvalho Chehab |---------------| 7931771f45SMauro Carvalho Chehab | datablocks | 8031771f45SMauro Carvalho Chehab | & fragments | 8131771f45SMauro Carvalho Chehab |---------------| 8231771f45SMauro Carvalho Chehab | inode table | 8331771f45SMauro Carvalho Chehab |---------------| 8431771f45SMauro Carvalho Chehab | directory | 8531771f45SMauro Carvalho Chehab | table | 8631771f45SMauro Carvalho Chehab |---------------| 8731771f45SMauro Carvalho Chehab | fragment | 8831771f45SMauro Carvalho Chehab | table | 8931771f45SMauro Carvalho Chehab |---------------| 9031771f45SMauro Carvalho Chehab | export | 9131771f45SMauro Carvalho Chehab | table | 9231771f45SMauro Carvalho Chehab |---------------| 9331771f45SMauro Carvalho Chehab | uid/gid | 9431771f45SMauro Carvalho Chehab | lookup table | 9531771f45SMauro Carvalho Chehab |---------------| 9631771f45SMauro Carvalho Chehab | xattr | 9731771f45SMauro Carvalho Chehab | table | 9831771f45SMauro Carvalho Chehab --------------- 9931771f45SMauro Carvalho Chehab 10031771f45SMauro Carvalho ChehabCompressed data blocks are written to the filesystem as files are read from 10131771f45SMauro Carvalho Chehabthe source directory, and checked for duplicates. Once all file data has been 10231771f45SMauro Carvalho Chehabwritten the completed inode, directory, fragment, export, uid/gid lookup and 10331771f45SMauro Carvalho Chehabxattr tables are written. 10431771f45SMauro Carvalho Chehab 10531771f45SMauro Carvalho Chehab3.1 Compression options 10631771f45SMauro Carvalho Chehab----------------------- 10731771f45SMauro Carvalho Chehab 10831771f45SMauro Carvalho ChehabCompressors can optionally support compression specific options (e.g. 10931771f45SMauro Carvalho Chehabdictionary size). If non-default compression options have been used, then 11031771f45SMauro Carvalho Chehabthese are stored here. 11131771f45SMauro Carvalho Chehab 11231771f45SMauro Carvalho Chehab3.2 Inodes 11331771f45SMauro Carvalho Chehab---------- 11431771f45SMauro Carvalho Chehab 11531771f45SMauro Carvalho ChehabMetadata (inodes and directories) are compressed in 8Kbyte blocks. Each 11631771f45SMauro Carvalho Chehabcompressed block is prefixed by a two byte length, the top bit is set if the 11731771f45SMauro Carvalho Chehabblock is uncompressed. A block will be uncompressed if the -noI option is set, 11831771f45SMauro Carvalho Chehabor if the compressed block was larger than the uncompressed block. 11931771f45SMauro Carvalho Chehab 12031771f45SMauro Carvalho ChehabInodes are packed into the metadata blocks, and are not aligned to block 12131771f45SMauro Carvalho Chehabboundaries, therefore inodes overlap compressed blocks. Inodes are identified 12231771f45SMauro Carvalho Chehabby a 48-bit number which encodes the location of the compressed metadata block 12331771f45SMauro Carvalho Chehabcontaining the inode, and the byte offset into that block where the inode is 12431771f45SMauro Carvalho Chehabplaced (<block, offset>). 12531771f45SMauro Carvalho Chehab 12631771f45SMauro Carvalho ChehabTo maximise compression there are different inodes for each file type 12731771f45SMauro Carvalho Chehab(regular file, directory, device, etc.), the inode contents and length 12831771f45SMauro Carvalho Chehabvarying with the type. 12931771f45SMauro Carvalho Chehab 13031771f45SMauro Carvalho ChehabTo further maximise compression, two types of regular file inode and 13131771f45SMauro Carvalho Chehabdirectory inode are defined: inodes optimised for frequently occurring 13231771f45SMauro Carvalho Chehabregular files and directories, and extended types where extra 13331771f45SMauro Carvalho Chehabinformation has to be stored. 13431771f45SMauro Carvalho Chehab 13531771f45SMauro Carvalho Chehab3.3 Directories 13631771f45SMauro Carvalho Chehab--------------- 13731771f45SMauro Carvalho Chehab 13831771f45SMauro Carvalho ChehabLike inodes, directories are packed into compressed metadata blocks, stored 13931771f45SMauro Carvalho Chehabin a directory table. Directories are accessed using the start address of 14031771f45SMauro Carvalho Chehabthe metablock containing the directory and the offset into the 14131771f45SMauro Carvalho Chehabdecompressed block (<block, offset>). 14231771f45SMauro Carvalho Chehab 14331771f45SMauro Carvalho ChehabDirectories are organised in a slightly complex way, and are not simply 14431771f45SMauro Carvalho Chehaba list of file names. The organisation takes advantage of the 14531771f45SMauro Carvalho Chehabfact that (in most cases) the inodes of the files will be in the same 14631771f45SMauro Carvalho Chehabcompressed metadata block, and therefore, can share the start block. 14731771f45SMauro Carvalho ChehabDirectories are therefore organised in a two level list, a directory 14831771f45SMauro Carvalho Chehabheader containing the shared start block value, and a sequence of directory 14931771f45SMauro Carvalho Chehabentries, each of which share the shared start block. A new directory header 15031771f45SMauro Carvalho Chehabis written once/if the inode start block changes. The directory 15131771f45SMauro Carvalho Chehabheader/directory entry list is repeated as many times as necessary. 15231771f45SMauro Carvalho Chehab 15331771f45SMauro Carvalho ChehabDirectories are sorted, and can contain a directory index to speed up 15431771f45SMauro Carvalho Chehabfile lookup. Directory indexes store one entry per metablock, each entry 15531771f45SMauro Carvalho Chehabstoring the index/filename mapping to the first directory header 15631771f45SMauro Carvalho Chehabin each metadata block. Directories are sorted in alphabetical order, 15731771f45SMauro Carvalho Chehaband at lookup the index is scanned linearly looking for the first filename 15831771f45SMauro Carvalho Chehabalphabetically larger than the filename being looked up. At this point the 15931771f45SMauro Carvalho Chehablocation of the metadata block the filename is in has been found. 16031771f45SMauro Carvalho ChehabThe general idea of the index is to ensure only one metadata block needs to be 16131771f45SMauro Carvalho Chehabdecompressed to do a lookup irrespective of the length of the directory. 16231771f45SMauro Carvalho ChehabThis scheme has the advantage that it doesn't require extra memory overhead 16331771f45SMauro Carvalho Chehaband doesn't require much extra storage on disk. 16431771f45SMauro Carvalho Chehab 16531771f45SMauro Carvalho Chehab3.4 File data 16631771f45SMauro Carvalho Chehab------------- 16731771f45SMauro Carvalho Chehab 16831771f45SMauro Carvalho ChehabRegular files consist of a sequence of contiguous compressed blocks, and/or a 16931771f45SMauro Carvalho Chehabcompressed fragment block (tail-end packed block). The compressed size 17031771f45SMauro Carvalho Chehabof each datablock is stored in a block list contained within the 17131771f45SMauro Carvalho Chehabfile inode. 17231771f45SMauro Carvalho Chehab 17331771f45SMauro Carvalho ChehabTo speed up access to datablocks when reading 'large' files (256 Mbytes or 17431771f45SMauro Carvalho Chehablarger), the code implements an index cache that caches the mapping from 17531771f45SMauro Carvalho Chehabblock index to datablock location on disk. 17631771f45SMauro Carvalho Chehab 17731771f45SMauro Carvalho ChehabThe index cache allows Squashfs to handle large files (up to 1.75 TiB) while 17831771f45SMauro Carvalho Chehabretaining a simple and space-efficient block list on disk. The cache 17931771f45SMauro Carvalho Chehabis split into slots, caching up to eight 224 GiB files (128 KiB blocks). 18031771f45SMauro Carvalho ChehabLarger files use multiple slots, with 1.75 TiB files using all 8 slots. 18131771f45SMauro Carvalho ChehabThe index cache is designed to be memory efficient, and by default uses 18231771f45SMauro Carvalho Chehab16 KiB. 18331771f45SMauro Carvalho Chehab 18431771f45SMauro Carvalho Chehab3.5 Fragment lookup table 18531771f45SMauro Carvalho Chehab------------------------- 18631771f45SMauro Carvalho Chehab 18731771f45SMauro Carvalho ChehabRegular files can contain a fragment index which is mapped to a fragment 18831771f45SMauro Carvalho Chehablocation on disk and compressed size using a fragment lookup table. This 18931771f45SMauro Carvalho Chehabfragment lookup table is itself stored compressed into metadata blocks. 19031771f45SMauro Carvalho ChehabA second index table is used to locate these. This second index table for 19131771f45SMauro Carvalho Chehabspeed of access (and because it is small) is read at mount time and cached 19231771f45SMauro Carvalho Chehabin memory. 19331771f45SMauro Carvalho Chehab 19431771f45SMauro Carvalho Chehab3.6 Uid/gid lookup table 19531771f45SMauro Carvalho Chehab------------------------ 19631771f45SMauro Carvalho Chehab 19731771f45SMauro Carvalho ChehabFor space efficiency regular files store uid and gid indexes, which are 19831771f45SMauro Carvalho Chehabconverted to 32-bit uids/gids using an id look up table. This table is 19931771f45SMauro Carvalho Chehabstored compressed into metadata blocks. A second index table is used to 20031771f45SMauro Carvalho Chehablocate these. This second index table for speed of access (and because it 20131771f45SMauro Carvalho Chehabis small) is read at mount time and cached in memory. 20231771f45SMauro Carvalho Chehab 20331771f45SMauro Carvalho Chehab3.7 Export table 20431771f45SMauro Carvalho Chehab---------------- 20531771f45SMauro Carvalho Chehab 20631771f45SMauro Carvalho ChehabTo enable Squashfs filesystems to be exportable (via NFS etc.) filesystems 20731771f45SMauro Carvalho Chehabcan optionally (disabled with the -no-exports Mksquashfs option) contain 20831771f45SMauro Carvalho Chehaban inode number to inode disk location lookup table. This is required to 20931771f45SMauro Carvalho Chehabenable Squashfs to map inode numbers passed in filehandles to the inode 21031771f45SMauro Carvalho Chehablocation on disk, which is necessary when the export code reinstantiates 21131771f45SMauro Carvalho Chehabexpired/flushed inodes. 21231771f45SMauro Carvalho Chehab 21331771f45SMauro Carvalho ChehabThis table is stored compressed into metadata blocks. A second index table is 21431771f45SMauro Carvalho Chehabused to locate these. This second index table for speed of access (and because 21531771f45SMauro Carvalho Chehabit is small) is read at mount time and cached in memory. 21631771f45SMauro Carvalho Chehab 21731771f45SMauro Carvalho Chehab3.8 Xattr table 21831771f45SMauro Carvalho Chehab--------------- 21931771f45SMauro Carvalho Chehab 22031771f45SMauro Carvalho ChehabThe xattr table contains extended attributes for each inode. The xattrs 22131771f45SMauro Carvalho Chehabfor each inode are stored in a list, each list entry containing a type, 22231771f45SMauro Carvalho Chehabname and value field. The type field encodes the xattr prefix 22331771f45SMauro Carvalho Chehab("user.", "trusted." etc) and it also encodes how the name/value fields 22431771f45SMauro Carvalho Chehabshould be interpreted. Currently the type indicates whether the value 22531771f45SMauro Carvalho Chehabis stored inline (in which case the value field contains the xattr value), 22631771f45SMauro Carvalho Chehabor if it is stored out of line (in which case the value field stores a 22731771f45SMauro Carvalho Chehabreference to where the actual value is stored). This allows large values 22831771f45SMauro Carvalho Chehabto be stored out of line improving scanning and lookup performance and it 22931771f45SMauro Carvalho Chehabalso allows values to be de-duplicated, the value being stored once, and 23031771f45SMauro Carvalho Chehaball other occurrences holding an out of line reference to that value. 23131771f45SMauro Carvalho Chehab 23231771f45SMauro Carvalho ChehabThe xattr lists are packed into compressed 8K metadata blocks. 23331771f45SMauro Carvalho ChehabTo reduce overhead in inodes, rather than storing the on-disk 23431771f45SMauro Carvalho Chehablocation of the xattr list inside each inode, a 32-bit xattr id 23531771f45SMauro Carvalho Chehabis stored. This xattr id is mapped into the location of the xattr 23631771f45SMauro Carvalho Chehablist using a second xattr id lookup table. 23731771f45SMauro Carvalho Chehab 23831771f45SMauro Carvalho Chehab4. TODOs and Outstanding Issues 23931771f45SMauro Carvalho Chehab------------------------------- 24031771f45SMauro Carvalho Chehab 24131771f45SMauro Carvalho Chehab4.1 TODO list 24231771f45SMauro Carvalho Chehab------------- 24331771f45SMauro Carvalho Chehab 24431771f45SMauro Carvalho ChehabImplement ACL support. 24531771f45SMauro Carvalho Chehab 24631771f45SMauro Carvalho Chehab4.2 Squashfs Internal Cache 24731771f45SMauro Carvalho Chehab--------------------------- 24831771f45SMauro Carvalho Chehab 24931771f45SMauro Carvalho ChehabBlocks in Squashfs are compressed. To avoid repeatedly decompressing 25031771f45SMauro Carvalho Chehabrecently accessed data Squashfs uses two small metadata and fragment caches. 25131771f45SMauro Carvalho Chehab 25231771f45SMauro Carvalho ChehabThe cache is not used for file datablocks, these are decompressed and cached in 25331771f45SMauro Carvalho Chehabthe page-cache in the normal way. The cache is used to temporarily cache 25431771f45SMauro Carvalho Chehabfragment and metadata blocks which have been read as a result of a metadata 25531771f45SMauro Carvalho Chehab(i.e. inode or directory) or fragment access. Because metadata and fragments 25631771f45SMauro Carvalho Chehabare packed together into blocks (to gain greater compression) the read of a 25731771f45SMauro Carvalho Chehabparticular piece of metadata or fragment will retrieve other metadata/fragments 25831771f45SMauro Carvalho Chehabwhich have been packed with it, these because of locality-of-reference may be 25931771f45SMauro Carvalho Chehabread in the near future. Temporarily caching them ensures they are available 26031771f45SMauro Carvalho Chehabfor near future access without requiring an additional read and decompress. 26131771f45SMauro Carvalho Chehab 26231771f45SMauro Carvalho ChehabIn the future this internal cache may be replaced with an implementation which 26331771f45SMauro Carvalho Chehabuses the kernel page cache. Because the page cache operates on page sized 26431771f45SMauro Carvalho Chehabunits this may introduce additional complexity in terms of locking and 26531771f45SMauro Carvalho Chehabassociated race conditions. 266