135c6cb41SAmir Goldstein.. SPDX-License-Identifier: GPL-2.0 235c6cb41SAmir Goldstein 35356ab06SAmir GoldsteinWritten by: Neil Brown 45356ab06SAmir GoldsteinPlease see MAINTAINERS file for where to send questions. 55356ab06SAmir Goldstein 65356ab06SAmir GoldsteinOverlay Filesystem 75356ab06SAmir Goldstein================== 85356ab06SAmir Goldstein 95356ab06SAmir GoldsteinThis document describes a prototype for a new approach to providing 105356ab06SAmir Goldsteinoverlay-filesystem functionality in Linux (sometimes referred to as 115356ab06SAmir Goldsteinunion-filesystems). An overlay-filesystem tries to present a 125356ab06SAmir Goldsteinfilesystem which is the result over overlaying one filesystem on top 135356ab06SAmir Goldsteinof the other. 145356ab06SAmir Goldstein 155356ab06SAmir Goldstein 165356ab06SAmir GoldsteinOverlay objects 175356ab06SAmir Goldstein--------------- 185356ab06SAmir Goldstein 195356ab06SAmir GoldsteinThe overlay filesystem approach is 'hybrid', because the objects that 205356ab06SAmir Goldsteinappear in the filesystem do not always appear to belong to that filesystem. 215356ab06SAmir GoldsteinIn many cases, an object accessed in the union will be indistinguishable 225356ab06SAmir Goldsteinfrom accessing the corresponding object from the original filesystem. 235356ab06SAmir GoldsteinThis is most obvious from the 'st_dev' field returned by stat(2). 245356ab06SAmir Goldstein 255356ab06SAmir GoldsteinWhile directories will report an st_dev from the overlay-filesystem, 265356ab06SAmir Goldsteinnon-directory objects may report an st_dev from the lower filesystem or 275356ab06SAmir Goldsteinupper filesystem that is providing the object. Similarly st_ino will 285356ab06SAmir Goldsteinonly be unique when combined with st_dev, and both of these can change 295356ab06SAmir Goldsteinover the lifetime of a non-directory object. Many applications and 305356ab06SAmir Goldsteintools ignore these values and will not be affected. 315356ab06SAmir Goldstein 325356ab06SAmir GoldsteinIn the special case of all overlay layers on the same underlying 335356ab06SAmir Goldsteinfilesystem, all objects will report an st_dev from the overlay 345356ab06SAmir Goldsteinfilesystem and st_ino from the underlying filesystem. This will 355356ab06SAmir Goldsteinmake the overlay mount more compliant with filesystem scanners and 365356ab06SAmir Goldsteinoverlay objects will be distinguishable from the corresponding 375356ab06SAmir Goldsteinobjects in the original filesystem. 385356ab06SAmir Goldstein 395356ab06SAmir GoldsteinOn 64bit systems, even if all overlay layers are not on the same 405356ab06SAmir Goldsteinunderlying filesystem, the same compliant behavior could be achieved 415356ab06SAmir Goldsteinwith the "xino" feature. The "xino" feature composes a unique object 425356ab06SAmir Goldsteinidentifier from the real object st_ino and an underlying fsid index. 43*b0e0f697SAmir GoldsteinThe "xino" feature uses the high inode number bits for fsid, because the 44*b0e0f697SAmir Goldsteinunderlying filesystems rarely use the high inode number bits. In case 452eda9eaaSAmir Goldsteinthe underlying inode number does overflow into the high xino bits, overlay 462eda9eaaSAmir Goldsteinfilesystem will fall back to the non xino behavior for that inode. 472eda9eaaSAmir Goldstein 48*b0e0f697SAmir GoldsteinThe "xino" feature can be enabled with the "-o xino=on" overlay mount option. 49*b0e0f697SAmir GoldsteinIf all underlying filesystems support NFS file handles, the value of st_ino 50*b0e0f697SAmir Goldsteinfor overlay filesystem objects is not only unique, but also persistent over 51*b0e0f697SAmir Goldsteinthe lifetime of the filesystem. The "-o xino=auto" overlay mount option 52*b0e0f697SAmir Goldsteinenables the "xino" feature only if the persistent st_ino requirement is met. 53*b0e0f697SAmir Goldstein 542eda9eaaSAmir GoldsteinThe following table summarizes what can be expected in different overlay 552eda9eaaSAmir Goldsteinconfigurations. 562eda9eaaSAmir Goldstein 572eda9eaaSAmir GoldsteinInode properties 582eda9eaaSAmir Goldstein```````````````` 592eda9eaaSAmir Goldstein 602eda9eaaSAmir Goldstein+--------------+------------+------------+-----------------+----------------+ 612eda9eaaSAmir Goldstein|Configuration | Persistent | Uniform | st_ino == d_ino | d_ino == i_ino | 622eda9eaaSAmir Goldstein| | st_ino | st_dev | | [*] | 632eda9eaaSAmir Goldstein+==============+=====+======+=====+======+========+========+========+=======+ 642eda9eaaSAmir Goldstein| | dir | !dir | dir | !dir | dir + !dir | dir | !dir | 652eda9eaaSAmir Goldstein+--------------+-----+------+-----+------+--------+--------+--------+-------+ 662eda9eaaSAmir Goldstein| All layers | Y | Y | Y | Y | Y | Y | Y | Y | 672eda9eaaSAmir Goldstein| on same fs | | | | | | | | | 682eda9eaaSAmir Goldstein+--------------+-----+------+-----+------+--------+--------+--------+-------+ 69*b0e0f697SAmir Goldstein| Layers not | N | N | Y | N | N | Y | N | Y | 702eda9eaaSAmir Goldstein| on same fs, | | | | | | | | | 712eda9eaaSAmir Goldstein| xino=off | | | | | | | | | 722eda9eaaSAmir Goldstein+--------------+-----+------+-----+------+--------+--------+--------+-------+ 732eda9eaaSAmir Goldstein| xino=on/auto | Y | Y | Y | Y | Y | Y | Y | Y | 742eda9eaaSAmir Goldstein+--------------+-----+------+-----+------+--------+--------+--------+-------+ 75*b0e0f697SAmir Goldstein| xino=on/auto,| N | N | Y | N | N | Y | N | Y | 762eda9eaaSAmir Goldstein| ino overflow | | | | | | | | | 772eda9eaaSAmir Goldstein+--------------+-----+------+-----+------+--------+--------+--------+-------+ 782eda9eaaSAmir Goldstein 792eda9eaaSAmir Goldstein[*] nfsd v3 readdirplus verifies d_ino == i_ino. i_ino is exposed via several 802eda9eaaSAmir Goldstein/proc files, such as /proc/locks and /proc/self/fdinfo/<fd> of an inotify 812eda9eaaSAmir Goldsteinfile descriptor. 825356ab06SAmir Goldstein 835356ab06SAmir GoldsteinUpper and Lower 845356ab06SAmir Goldstein--------------- 855356ab06SAmir Goldstein 865356ab06SAmir GoldsteinAn overlay filesystem combines two filesystems - an 'upper' filesystem 875356ab06SAmir Goldsteinand a 'lower' filesystem. When a name exists in both filesystems, the 885356ab06SAmir Goldsteinobject in the 'upper' filesystem is visible while the object in the 895356ab06SAmir Goldstein'lower' filesystem is either hidden or, in the case of directories, 905356ab06SAmir Goldsteinmerged with the 'upper' object. 915356ab06SAmir Goldstein 925356ab06SAmir GoldsteinIt would be more correct to refer to an upper and lower 'directory 935356ab06SAmir Goldsteintree' rather than 'filesystem' as it is quite possible for both 945356ab06SAmir Goldsteindirectory trees to be in the same filesystem and there is no 955356ab06SAmir Goldsteinrequirement that the root of a filesystem be given for either upper or 965356ab06SAmir Goldsteinlower. 975356ab06SAmir Goldstein 9858afaf5dSMiklos SzerediA wide range of filesystems supported by Linux can be the lower filesystem, 9958afaf5dSMiklos Szeredibut not all filesystems that are mountable by Linux have the features 10058afaf5dSMiklos Szeredineeded for OverlayFS to work. The lower filesystem does not need to be 10158afaf5dSMiklos Szerediwritable. The lower filesystem can even be another overlayfs. The upper 10258afaf5dSMiklos Szeredifilesystem will normally be writable and if it is it must support the 1032d2f2d73SMiklos Szeredicreation of trusted.* and/or user.* extended attributes, and must provide 1042d2f2d73SMiklos Szeredivalid d_type in readdir responses, so NFS is not suitable. 1055356ab06SAmir Goldstein 1065356ab06SAmir GoldsteinA read-only overlay of two read-only filesystems may use any 1075356ab06SAmir Goldsteinfilesystem type. 1085356ab06SAmir Goldstein 1095356ab06SAmir GoldsteinDirectories 1105356ab06SAmir Goldstein----------- 1115356ab06SAmir Goldstein 1125356ab06SAmir GoldsteinOverlaying mainly involves directories. If a given name appears in both 1135356ab06SAmir Goldsteinupper and lower filesystems and refers to a non-directory in either, 1145356ab06SAmir Goldsteinthen the lower object is hidden - the name refers only to the upper 1155356ab06SAmir Goldsteinobject. 1165356ab06SAmir Goldstein 1175356ab06SAmir GoldsteinWhere both upper and lower objects are directories, a merged directory 1185356ab06SAmir Goldsteinis formed. 1195356ab06SAmir Goldstein 1205356ab06SAmir GoldsteinAt mount time, the two directories given as mount options "lowerdir" and 1215356ab06SAmir Goldstein"upperdir" are combined into a merged directory: 1225356ab06SAmir Goldstein 1235356ab06SAmir Goldstein mount -t overlay overlay -olowerdir=/lower,upperdir=/upper,\ 1245356ab06SAmir Goldstein workdir=/work /merged 1255356ab06SAmir Goldstein 1265356ab06SAmir GoldsteinThe "workdir" needs to be an empty directory on the same filesystem 1275356ab06SAmir Goldsteinas upperdir. 1285356ab06SAmir Goldstein 1295356ab06SAmir GoldsteinThen whenever a lookup is requested in such a merged directory, the 1305356ab06SAmir Goldsteinlookup is performed in each actual directory and the combined result 1315356ab06SAmir Goldsteinis cached in the dentry belonging to the overlay filesystem. If both 1325356ab06SAmir Goldsteinactual lookups find directories, both are stored and a merged 1335356ab06SAmir Goldsteindirectory is created, otherwise only one is stored: the upper if it 1345356ab06SAmir Goldsteinexists, else the lower. 1355356ab06SAmir Goldstein 1365356ab06SAmir GoldsteinOnly the lists of names from directories are merged. Other content 1375356ab06SAmir Goldsteinsuch as metadata and extended attributes are reported for the upper 1385356ab06SAmir Goldsteindirectory only. These attributes of the lower directory are hidden. 1395356ab06SAmir Goldstein 1405356ab06SAmir Goldsteinwhiteouts and opaque directories 1415356ab06SAmir Goldstein-------------------------------- 1425356ab06SAmir Goldstein 1435356ab06SAmir GoldsteinIn order to support rm and rmdir without changing the lower 1445356ab06SAmir Goldsteinfilesystem, an overlay filesystem needs to record in the upper filesystem 1455356ab06SAmir Goldsteinthat files have been removed. This is done using whiteouts and opaque 1465356ab06SAmir Goldsteindirectories (non-directories are always opaque). 1475356ab06SAmir Goldstein 1485356ab06SAmir GoldsteinA whiteout is created as a character device with 0/0 device number. 1495356ab06SAmir GoldsteinWhen a whiteout is found in the upper level of a merged directory, any 1505356ab06SAmir Goldsteinmatching name in the lower level is ignored, and the whiteout itself 1515356ab06SAmir Goldsteinis also hidden. 1525356ab06SAmir Goldstein 1535356ab06SAmir GoldsteinA directory is made opaque by setting the xattr "trusted.overlay.opaque" 1545356ab06SAmir Goldsteinto "y". Where the upper filesystem contains an opaque directory, any 1555356ab06SAmir Goldsteindirectory in the lower filesystem with the same name is ignored. 1565356ab06SAmir Goldstein 1575356ab06SAmir Goldsteinreaddir 1585356ab06SAmir Goldstein------- 1595356ab06SAmir Goldstein 1605356ab06SAmir GoldsteinWhen a 'readdir' request is made on a merged directory, the upper and 1615356ab06SAmir Goldsteinlower directories are each read and the name lists merged in the 1625356ab06SAmir Goldsteinobvious way (upper is read first, then lower - entries that already 1635356ab06SAmir Goldsteinexist are not re-added). This merged name list is cached in the 1645356ab06SAmir Goldstein'struct file' and so remains as long as the file is kept open. If the 1655356ab06SAmir Goldsteindirectory is opened and read by two processes at the same time, they 1665356ab06SAmir Goldsteinwill each have separate caches. A seekdir to the start of the 1675356ab06SAmir Goldsteindirectory (offset 0) followed by a readdir will cause the cache to be 1685356ab06SAmir Goldsteindiscarded and rebuilt. 1695356ab06SAmir Goldstein 1705356ab06SAmir GoldsteinThis means that changes to the merged directory do not appear while a 1715356ab06SAmir Goldsteindirectory is being read. This is unlikely to be noticed by many 1725356ab06SAmir Goldsteinprograms. 1735356ab06SAmir Goldstein 1745356ab06SAmir Goldsteinseek offsets are assigned sequentially when the directories are read. 1755356ab06SAmir GoldsteinThus if 1765356ab06SAmir Goldstein 1775356ab06SAmir Goldstein - read part of a directory 1785356ab06SAmir Goldstein - remember an offset, and close the directory 1795356ab06SAmir Goldstein - re-open the directory some time later 1805356ab06SAmir Goldstein - seek to the remembered offset 1815356ab06SAmir Goldstein 1825356ab06SAmir Goldsteinthere may be little correlation between the old and new locations in 1835356ab06SAmir Goldsteinthe list of filenames, particularly if anything has changed in the 1845356ab06SAmir Goldsteindirectory. 1855356ab06SAmir Goldstein 1865356ab06SAmir GoldsteinReaddir on directories that are not merged is simply handled by the 1875356ab06SAmir Goldsteinunderlying directory (upper or lower). 1885356ab06SAmir Goldstein 1895356ab06SAmir Goldsteinrenaming directories 1905356ab06SAmir Goldstein-------------------- 1915356ab06SAmir Goldstein 1925356ab06SAmir GoldsteinWhen renaming a directory that is on the lower layer or merged (i.e. the 1935356ab06SAmir Goldsteindirectory was not created on the upper layer to start with) overlayfs can 1945356ab06SAmir Goldsteinhandle it in two different ways: 1955356ab06SAmir Goldstein 1965356ab06SAmir Goldstein1. return EXDEV error: this error is returned by rename(2) when trying to 1975356ab06SAmir Goldstein move a file or directory across filesystem boundaries. Hence 1985356ab06SAmir Goldstein applications are usually prepared to hande this error (mv(1) for example 1995356ab06SAmir Goldstein recursively copies the directory tree). This is the default behavior. 2005356ab06SAmir Goldstein 2015356ab06SAmir Goldstein2. If the "redirect_dir" feature is enabled, then the directory will be 2025356ab06SAmir Goldstein copied up (but not the contents). Then the "trusted.overlay.redirect" 2035356ab06SAmir Goldstein extended attribute is set to the path of the original location from the 2045356ab06SAmir Goldstein root of the overlay. Finally the directory is moved to the new 2055356ab06SAmir Goldstein location. 2065356ab06SAmir Goldstein 2075356ab06SAmir GoldsteinThere are several ways to tune the "redirect_dir" feature. 2085356ab06SAmir Goldstein 2095356ab06SAmir GoldsteinKernel config options: 2105356ab06SAmir Goldstein 2115356ab06SAmir Goldstein- OVERLAY_FS_REDIRECT_DIR: 2125356ab06SAmir Goldstein If this is enabled, then redirect_dir is turned on by default. 2135356ab06SAmir Goldstein- OVERLAY_FS_REDIRECT_ALWAYS_FOLLOW: 2145356ab06SAmir Goldstein If this is enabled, then redirects are always followed by default. Enabling 2155356ab06SAmir Goldstein this results in a less secure configuration. Enable this option only when 2165356ab06SAmir Goldstein worried about backward compatibility with kernels that have the redirect_dir 2175356ab06SAmir Goldstein feature and follow redirects even if turned off. 2185356ab06SAmir Goldstein 21935c6cb41SAmir GoldsteinModule options (can also be changed through /sys/module/overlay/parameters/): 2205356ab06SAmir Goldstein 2215356ab06SAmir Goldstein- "redirect_dir=BOOL": 2225356ab06SAmir Goldstein See OVERLAY_FS_REDIRECT_DIR kernel config option above. 2235356ab06SAmir Goldstein- "redirect_always_follow=BOOL": 2245356ab06SAmir Goldstein See OVERLAY_FS_REDIRECT_ALWAYS_FOLLOW kernel config option above. 2255356ab06SAmir Goldstein- "redirect_max=NUM": 2265356ab06SAmir Goldstein The maximum number of bytes in an absolute redirect (default is 256). 2275356ab06SAmir Goldstein 2285356ab06SAmir GoldsteinMount options: 2295356ab06SAmir Goldstein 2305356ab06SAmir Goldstein- "redirect_dir=on": 2315356ab06SAmir Goldstein Redirects are enabled. 2325356ab06SAmir Goldstein- "redirect_dir=follow": 2335356ab06SAmir Goldstein Redirects are not created, but followed. 2345356ab06SAmir Goldstein- "redirect_dir=off": 2355356ab06SAmir Goldstein Redirects are not created and only followed if "redirect_always_follow" 2365356ab06SAmir Goldstein feature is enabled in the kernel/module config. 2375356ab06SAmir Goldstein- "redirect_dir=nofollow": 2385356ab06SAmir Goldstein Redirects are not created and not followed (equivalent to "redirect_dir=off" 2395356ab06SAmir Goldstein if "redirect_always_follow" feature is not enabled). 2405356ab06SAmir Goldstein 2415356ab06SAmir GoldsteinWhen the NFS export feature is enabled, every copied up directory is 2425356ab06SAmir Goldsteinindexed by the file handle of the lower inode and a file handle of the 2435356ab06SAmir Goldsteinupper directory is stored in a "trusted.overlay.upper" extended attribute 2445356ab06SAmir Goldsteinon the index entry. On lookup of a merged directory, if the upper 2455356ab06SAmir Goldsteindirectory does not match the file handle stores in the index, that is an 2465356ab06SAmir Goldsteinindication that multiple upper directories may be redirected to the same 2475356ab06SAmir Goldsteinlower directory. In that case, lookup returns an error and warns about 2485356ab06SAmir Goldsteina possible inconsistency. 2495356ab06SAmir Goldstein 2505356ab06SAmir GoldsteinBecause lower layer redirects cannot be verified with the index, enabling 2515356ab06SAmir GoldsteinNFS export support on an overlay filesystem with no upper layer requires 2525356ab06SAmir Goldsteinturning off redirect follow (e.g. "redirect_dir=nofollow"). 2535356ab06SAmir Goldstein 2545356ab06SAmir Goldstein 2555356ab06SAmir GoldsteinNon-directories 2565356ab06SAmir Goldstein--------------- 2575356ab06SAmir Goldstein 2585356ab06SAmir GoldsteinObjects that are not directories (files, symlinks, device-special 2595356ab06SAmir Goldsteinfiles etc.) are presented either from the upper or lower filesystem as 2605356ab06SAmir Goldsteinappropriate. When a file in the lower filesystem is accessed in a way 2615356ab06SAmir Goldsteinthe requires write-access, such as opening for write access, changing 2625356ab06SAmir Goldsteinsome metadata etc., the file is first copied from the lower filesystem 2635356ab06SAmir Goldsteinto the upper filesystem (copy_up). Note that creating a hard-link 2645356ab06SAmir Goldsteinalso requires copy_up, though of course creation of a symlink does 2655356ab06SAmir Goldsteinnot. 2665356ab06SAmir Goldstein 2675356ab06SAmir GoldsteinThe copy_up may turn out to be unnecessary, for example if the file is 2685356ab06SAmir Goldsteinopened for read-write but the data is not modified. 2695356ab06SAmir Goldstein 2705356ab06SAmir GoldsteinThe copy_up process first makes sure that the containing directory 2715356ab06SAmir Goldsteinexists in the upper filesystem - creating it and any parents as 2725356ab06SAmir Goldsteinnecessary. It then creates the object with the same metadata (owner, 2735356ab06SAmir Goldsteinmode, mtime, symlink-target etc.) and then if the object is a file, the 2745356ab06SAmir Goldsteindata is copied from the lower to the upper filesystem. Finally any 2755356ab06SAmir Goldsteinextended attributes are copied up. 2765356ab06SAmir Goldstein 2775356ab06SAmir GoldsteinOnce the copy_up is complete, the overlay filesystem simply 2785356ab06SAmir Goldsteinprovides direct access to the newly created file in the upper 2795356ab06SAmir Goldsteinfilesystem - future operations on the file are barely noticed by the 2805356ab06SAmir Goldsteinoverlay filesystem (though an operation on the name of the file such as 2815356ab06SAmir Goldsteinrename or unlink will of course be noticed and handled). 2825356ab06SAmir Goldstein 2835356ab06SAmir Goldstein 2844c494bd5SMiklos SzerediPermission model 2854c494bd5SMiklos Szeredi---------------- 2864c494bd5SMiklos Szeredi 2874c494bd5SMiklos SzerediPermission checking in the overlay filesystem follows these principles: 2884c494bd5SMiklos Szeredi 2894c494bd5SMiklos Szeredi 1) permission check SHOULD return the same result before and after copy up 2904c494bd5SMiklos Szeredi 2914c494bd5SMiklos Szeredi 2) task creating the overlay mount MUST NOT gain additional privileges 2924c494bd5SMiklos Szeredi 2934c494bd5SMiklos Szeredi 3) non-mounting task MAY gain additional privileges through the overlay, 2944c494bd5SMiklos Szeredi compared to direct access on underlying lower or upper filesystems 2954c494bd5SMiklos Szeredi 2964c494bd5SMiklos SzerediThis is achieved by performing two permission checks on each access 2974c494bd5SMiklos Szeredi 2984c494bd5SMiklos Szeredi a) check if current task is allowed access based on local DAC (owner, 2994c494bd5SMiklos Szeredi group, mode and posix acl), as well as MAC checks 3004c494bd5SMiklos Szeredi 3014c494bd5SMiklos Szeredi b) check if mounting task would be allowed real operation on lower or 3024c494bd5SMiklos Szeredi upper layer based on underlying filesystem permissions, again including 3034c494bd5SMiklos Szeredi MAC checks 3044c494bd5SMiklos Szeredi 3054c494bd5SMiklos SzerediCheck (a) ensures consistency (1) since owner, group, mode and posix acls 3064c494bd5SMiklos Szerediare copied up. On the other hand it can result in server enforced 3074c494bd5SMiklos Szeredipermissions (used by NFS, for example) being ignored (3). 3084c494bd5SMiklos Szeredi 3094c494bd5SMiklos SzerediCheck (b) ensures that no task gains permissions to underlying layers that 3104c494bd5SMiklos Szeredithe mounting task does not have (2). This also means that it is possible 3114c494bd5SMiklos Szeredito create setups where the consistency rule (1) does not hold; normally, 3124c494bd5SMiklos Szeredihowever, the mounting task will have sufficient privileges to perform all 3134c494bd5SMiklos Szeredioperations. 3144c494bd5SMiklos Szeredi 3154c494bd5SMiklos SzerediAnother way to demonstrate this model is drawing parallels between 3164c494bd5SMiklos Szeredi 3174c494bd5SMiklos Szeredi mount -t overlay overlay -olowerdir=/lower,upperdir=/upper,... /merged 3184c494bd5SMiklos Szeredi 3194c494bd5SMiklos Szerediand 3204c494bd5SMiklos Szeredi 3214c494bd5SMiklos Szeredi cp -a /lower /upper 3224c494bd5SMiklos Szeredi mount --bind /upper /merged 3234c494bd5SMiklos Szeredi 3244c494bd5SMiklos SzerediThe resulting access permissions should be the same. The difference is in 3254c494bd5SMiklos Szeredithe time of copy (on-demand vs. up-front). 3264c494bd5SMiklos Szeredi 3274c494bd5SMiklos Szeredi 3285356ab06SAmir GoldsteinMultiple lower layers 3295356ab06SAmir Goldstein--------------------- 3305356ab06SAmir Goldstein 331f7eb0de7SRandy DunlapMultiple lower layers can now be given using the colon (":") as a 3325356ab06SAmir Goldsteinseparator character between the directory names. For example: 3335356ab06SAmir Goldstein 3345356ab06SAmir Goldstein mount -t overlay overlay -olowerdir=/lower1:/lower2:/lower3 /merged 3355356ab06SAmir Goldstein 3365356ab06SAmir GoldsteinAs the example shows, "upperdir=" and "workdir=" may be omitted. In 3375356ab06SAmir Goldsteinthat case the overlay will be read-only. 3385356ab06SAmir Goldstein 3395356ab06SAmir GoldsteinThe specified lower directories will be stacked beginning from the 3405356ab06SAmir Goldsteinrightmost one and going left. In the above example lower1 will be the 3415356ab06SAmir Goldsteintop, lower2 the middle and lower3 the bottom layer. 3425356ab06SAmir Goldstein 3435356ab06SAmir Goldstein 3445356ab06SAmir GoldsteinMetadata only copy up 34535c6cb41SAmir Goldstein--------------------- 3465356ab06SAmir Goldstein 3475356ab06SAmir GoldsteinWhen metadata only copy up feature is enabled, overlayfs will only copy 3485356ab06SAmir Goldsteinup metadata (as opposed to whole file), when a metadata specific operation 3495356ab06SAmir Goldsteinlike chown/chmod is performed. Full file will be copied up later when 3505356ab06SAmir Goldsteinfile is opened for WRITE operation. 3515356ab06SAmir Goldstein 3525356ab06SAmir GoldsteinIn other words, this is delayed data copy up operation and data is copied 3535356ab06SAmir Goldsteinup when there is a need to actually modify data. 3545356ab06SAmir Goldstein 3555356ab06SAmir GoldsteinThere are multiple ways to enable/disable this feature. A config option 3565356ab06SAmir GoldsteinCONFIG_OVERLAY_FS_METACOPY can be set/unset to enable/disable this feature 3575356ab06SAmir Goldsteinby default. Or one can enable/disable it at module load time with module 3585356ab06SAmir Goldsteinparameter metacopy=on/off. Lastly, there is also a per mount option 3595356ab06SAmir Goldsteinmetacopy=on/off to enable/disable this feature per mount. 3605356ab06SAmir Goldstein 3615356ab06SAmir GoldsteinDo not use metacopy=on with untrusted upper/lower directories. Otherwise 3625356ab06SAmir Goldsteinit is possible that an attacker can create a handcrafted file with 3635356ab06SAmir Goldsteinappropriate REDIRECT and METACOPY xattrs, and gain access to file on lower 3645356ab06SAmir Goldsteinpointed by REDIRECT. This should not be possible on local system as setting 3655356ab06SAmir Goldstein"trusted." xattrs will require CAP_SYS_ADMIN. But it should be possible 3665356ab06SAmir Goldsteinfor untrusted layers like from a pen drive. 3675356ab06SAmir Goldstein 368b0def88dSAmir GoldsteinNote: redirect_dir={off|nofollow|follow[*]} and nfs_export=on mount options 369b0def88dSAmir Goldsteinconflict with metacopy=on, and will result in an error. 3705356ab06SAmir Goldstein 37135c6cb41SAmir Goldstein[*] redirect_dir=follow only conflicts with metacopy=on if upperdir=... is 3725356ab06SAmir Goldsteingiven. 3735356ab06SAmir Goldstein 3745356ab06SAmir GoldsteinSharing and copying layers 3755356ab06SAmir Goldstein-------------------------- 3765356ab06SAmir Goldstein 3775356ab06SAmir GoldsteinLower layers may be shared among several overlay mounts and that is indeed 3785356ab06SAmir Goldsteina very common practice. An overlay mount may use the same lower layer 3795356ab06SAmir Goldsteinpath as another overlay mount and it may use a lower layer path that is 3805356ab06SAmir Goldsteinbeneath or above the path of another overlay lower layer path. 3815356ab06SAmir Goldstein 3825356ab06SAmir GoldsteinUsing an upper layer path and/or a workdir path that are already used by 3835356ab06SAmir Goldsteinanother overlay mount is not allowed and may fail with EBUSY. Using 3845356ab06SAmir Goldsteinpartially overlapping paths is not allowed and may fail with EBUSY. 3855356ab06SAmir GoldsteinIf files are accessed from two overlayfs mounts which share or overlap the 3865356ab06SAmir Goldsteinupper layer and/or workdir path the behavior of the overlay is undefined, 3875356ab06SAmir Goldsteinthough it will not result in a crash or deadlock. 3885356ab06SAmir Goldstein 3895356ab06SAmir GoldsteinMounting an overlay using an upper layer path, where the upper layer path 3905356ab06SAmir Goldsteinwas previously used by another mounted overlay in combination with a 3915356ab06SAmir Goldsteindifferent lower layer path, is allowed, unless the "inodes index" feature 3925356ab06SAmir Goldsteinor "metadata only copy up" feature is enabled. 3935356ab06SAmir Goldstein 3945356ab06SAmir GoldsteinWith the "inodes index" feature, on the first time mount, an NFS file 3955356ab06SAmir Goldsteinhandle of the lower layer root directory, along with the UUID of the lower 3965356ab06SAmir Goldsteinfilesystem, are encoded and stored in the "trusted.overlay.origin" extended 3975356ab06SAmir Goldsteinattribute on the upper layer root directory. On subsequent mount attempts, 3985356ab06SAmir Goldsteinthe lower root directory file handle and lower filesystem UUID are compared 3995356ab06SAmir Goldsteinto the stored origin in upper root directory. On failure to verify the 4005356ab06SAmir Goldsteinlower root origin, mount will fail with ESTALE. An overlayfs mount with 4015356ab06SAmir Goldstein"inodes index" enabled will fail with EOPNOTSUPP if the lower filesystem 4025356ab06SAmir Goldsteindoes not support NFS export, lower filesystem does not have a valid UUID or 4035356ab06SAmir Goldsteinif the upper filesystem does not support extended attributes. 4045356ab06SAmir Goldstein 4055356ab06SAmir GoldsteinFor "metadata only copy up" feature there is no verification mechanism at 4065356ab06SAmir Goldsteinmount time. So if same upper is mounted with different set of lower, mount 4075356ab06SAmir Goldsteinprobably will succeed but expect the unexpected later on. So don't do it. 4085356ab06SAmir Goldstein 4095356ab06SAmir GoldsteinIt is quite a common practice to copy overlay layers to a different 4105356ab06SAmir Goldsteindirectory tree on the same or different underlying filesystem, and even 4115356ab06SAmir Goldsteinto a different machine. With the "inodes index" feature, trying to mount 4125356ab06SAmir Goldsteinthe copied layers will fail the verification of the lower root file handle. 4135356ab06SAmir Goldstein 4145356ab06SAmir Goldstein 4155356ab06SAmir GoldsteinNon-standard behavior 4165356ab06SAmir Goldstein--------------------- 4175356ab06SAmir Goldstein 4185356ab06SAmir GoldsteinCurrent version of overlayfs can act as a mostly POSIX compliant 4195356ab06SAmir Goldsteinfilesystem. 4205356ab06SAmir Goldstein 4215356ab06SAmir GoldsteinThis is the list of cases that overlayfs doesn't currently handle: 4225356ab06SAmir Goldstein 4235356ab06SAmir Goldsteina) POSIX mandates updating st_atime for reads. This is currently not 4245356ab06SAmir Goldsteindone in the case when the file resides on a lower layer. 4255356ab06SAmir Goldstein 4265356ab06SAmir Goldsteinb) If a file residing on a lower layer is opened for read-only and then 4275356ab06SAmir Goldsteinmemory mapped with MAP_SHARED, then subsequent changes to the file are not 4285356ab06SAmir Goldsteinreflected in the memory mapping. 4295356ab06SAmir Goldstein 4305356ab06SAmir GoldsteinThe following options allow overlayfs to act more like a standards 4315356ab06SAmir Goldsteincompliant filesystem: 4325356ab06SAmir Goldstein 4335356ab06SAmir Goldstein1) "redirect_dir" 4345356ab06SAmir Goldstein 4355356ab06SAmir GoldsteinEnabled with the mount option or module option: "redirect_dir=on" or with 4365356ab06SAmir Goldsteinthe kernel config option CONFIG_OVERLAY_FS_REDIRECT_DIR=y. 4375356ab06SAmir Goldstein 4385356ab06SAmir GoldsteinIf this feature is disabled, then rename(2) on a lower or merged directory 4395356ab06SAmir Goldsteinwill fail with EXDEV ("Invalid cross-device link"). 4405356ab06SAmir Goldstein 4415356ab06SAmir Goldstein2) "inode index" 4425356ab06SAmir Goldstein 4435356ab06SAmir GoldsteinEnabled with the mount option or module option "index=on" or with the 4445356ab06SAmir Goldsteinkernel config option CONFIG_OVERLAY_FS_INDEX=y. 4455356ab06SAmir Goldstein 4465356ab06SAmir GoldsteinIf this feature is disabled and a file with multiple hard links is copied 4475356ab06SAmir Goldsteinup, then this will "break" the link. Changes will not be propagated to 4485356ab06SAmir Goldsteinother names referring to the same inode. 4495356ab06SAmir Goldstein 4505356ab06SAmir Goldstein3) "xino" 4515356ab06SAmir Goldstein 4525356ab06SAmir GoldsteinEnabled with the mount option "xino=auto" or "xino=on", with the module 4535356ab06SAmir Goldsteinoption "xino_auto=on" or with the kernel config option 4545356ab06SAmir GoldsteinCONFIG_OVERLAY_FS_XINO_AUTO=y. Also implicitly enabled by using the same 4555356ab06SAmir Goldsteinunderlying filesystem for all layers making up the overlay. 4565356ab06SAmir Goldstein 4575356ab06SAmir GoldsteinIf this feature is disabled or the underlying filesystem doesn't have 4585356ab06SAmir Goldsteinenough free bits in the inode number, then overlayfs will not be able to 4595356ab06SAmir Goldsteinguarantee that the values of st_ino and st_dev returned by stat(2) and the 4605356ab06SAmir Goldsteinvalue of d_ino returned by readdir(3) will act like on a normal filesystem. 4615356ab06SAmir GoldsteinE.g. the value of st_dev may be different for two objects in the same 462*b0e0f697SAmir Goldsteinoverlay filesystem and the value of st_ino for filesystem objects may not be 4632eda9eaaSAmir Goldsteinpersistent and could change even while the overlay filesystem is mounted, as 4642eda9eaaSAmir Goldsteinsummarized in the `Inode properties`_ table above. 4655356ab06SAmir Goldstein 4665356ab06SAmir Goldstein 4675356ab06SAmir GoldsteinChanges to underlying filesystems 4685356ab06SAmir Goldstein--------------------------------- 4695356ab06SAmir Goldstein 4705356ab06SAmir GoldsteinChanges to the underlying filesystems while part of a mounted overlay 4715356ab06SAmir Goldsteinfilesystem are not allowed. If the underlying filesystem is changed, 4725356ab06SAmir Goldsteinthe behavior of the overlay is undefined, though it will not result in 4735356ab06SAmir Goldsteina crash or deadlock. 4745356ab06SAmir Goldstein 47513c6ad0fSKevin LockeOffline changes, when the overlay is not mounted, are allowed to the 47613c6ad0fSKevin Lockeupper tree. Offline changes to the lower tree are only allowed if the 477*b0e0f697SAmir Goldstein"metadata only copy up", "inode index", "xino" and "redirect_dir" features 47813c6ad0fSKevin Lockehave not been used. If the lower tree is modified and any of these 47913c6ad0fSKevin Lockefeatures has been used, the behavior of the overlay is undefined, 48013c6ad0fSKevin Lockethough it will not result in a crash or deadlock. 48113c6ad0fSKevin Locke 4825356ab06SAmir GoldsteinWhen the overlay NFS export feature is enabled, overlay filesystems 4835356ab06SAmir Goldsteinbehavior on offline changes of the underlying lower layer is different 4845356ab06SAmir Goldsteinthan the behavior when NFS export is disabled. 4855356ab06SAmir Goldstein 4865356ab06SAmir GoldsteinOn every copy_up, an NFS file handle of the lower inode, along with the 4875356ab06SAmir GoldsteinUUID of the lower filesystem, are encoded and stored in an extended 4885356ab06SAmir Goldsteinattribute "trusted.overlay.origin" on the upper inode. 4895356ab06SAmir Goldstein 4905356ab06SAmir GoldsteinWhen the NFS export feature is enabled, a lookup of a merged directory, 4915356ab06SAmir Goldsteinthat found a lower directory at the lookup path or at the path pointed 4925356ab06SAmir Goldsteinto by the "trusted.overlay.redirect" extended attribute, will verify 4935356ab06SAmir Goldsteinthat the found lower directory file handle and lower filesystem UUID 4945356ab06SAmir Goldsteinmatch the origin file handle that was stored at copy_up time. If a 4955356ab06SAmir Goldsteinfound lower directory does not match the stored origin, that directory 4965356ab06SAmir Goldsteinwill not be merged with the upper directory. 4975356ab06SAmir Goldstein 4985356ab06SAmir Goldstein 4995356ab06SAmir Goldstein 5005356ab06SAmir GoldsteinNFS export 5015356ab06SAmir Goldstein---------- 5025356ab06SAmir Goldstein 5035356ab06SAmir GoldsteinWhen the underlying filesystems supports NFS export and the "nfs_export" 5045356ab06SAmir Goldsteinfeature is enabled, an overlay filesystem may be exported to NFS. 5055356ab06SAmir Goldstein 5065356ab06SAmir GoldsteinWith the "nfs_export" feature, on copy_up of any lower object, an index 5075356ab06SAmir Goldsteinentry is created under the index directory. The index entry name is the 5085356ab06SAmir Goldsteinhexadecimal representation of the copy up origin file handle. For a 5095356ab06SAmir Goldsteinnon-directory object, the index entry is a hard link to the upper inode. 5105356ab06SAmir GoldsteinFor a directory object, the index entry has an extended attribute 5115356ab06SAmir Goldstein"trusted.overlay.upper" with an encoded file handle of the upper 5125356ab06SAmir Goldsteindirectory inode. 5135356ab06SAmir Goldstein 5145356ab06SAmir GoldsteinWhen encoding a file handle from an overlay filesystem object, the 5155356ab06SAmir Goldsteinfollowing rules apply: 5165356ab06SAmir Goldstein 5175356ab06SAmir Goldstein1. For a non-upper object, encode a lower file handle from lower inode 5185356ab06SAmir Goldstein2. For an indexed object, encode a lower file handle from copy_up origin 5195356ab06SAmir Goldstein3. For a pure-upper object and for an existing non-indexed upper object, 5205356ab06SAmir Goldstein encode an upper file handle from upper inode 5215356ab06SAmir Goldstein 5225356ab06SAmir GoldsteinThe encoded overlay file handle includes: 5235356ab06SAmir Goldstein - Header including path type information (e.g. lower/upper) 5245356ab06SAmir Goldstein - UUID of the underlying filesystem 5255356ab06SAmir Goldstein - Underlying filesystem encoding of underlying inode 5265356ab06SAmir Goldstein 5275356ab06SAmir GoldsteinThis encoding format is identical to the encoding format file handles that 5285356ab06SAmir Goldsteinare stored in extended attribute "trusted.overlay.origin". 5295356ab06SAmir Goldstein 5305356ab06SAmir GoldsteinWhen decoding an overlay file handle, the following steps are followed: 5315356ab06SAmir Goldstein 5325356ab06SAmir Goldstein1. Find underlying layer by UUID and path type information. 5335356ab06SAmir Goldstein2. Decode the underlying filesystem file handle to underlying dentry. 5345356ab06SAmir Goldstein3. For a lower file handle, lookup the handle in index directory by name. 5355356ab06SAmir Goldstein4. If a whiteout is found in index, return ESTALE. This represents an 5365356ab06SAmir Goldstein overlay object that was deleted after its file handle was encoded. 5375356ab06SAmir Goldstein5. For a non-directory, instantiate a disconnected overlay dentry from the 5385356ab06SAmir Goldstein decoded underlying dentry, the path type and index inode, if found. 5395356ab06SAmir Goldstein6. For a directory, use the connected underlying decoded dentry, path type 5405356ab06SAmir Goldstein and index, to lookup a connected overlay dentry. 5415356ab06SAmir Goldstein 5425356ab06SAmir GoldsteinDecoding a non-directory file handle may return a disconnected dentry. 5435356ab06SAmir Goldsteincopy_up of that disconnected dentry will create an upper index entry with 5445356ab06SAmir Goldsteinno upper alias. 5455356ab06SAmir Goldstein 5465356ab06SAmir GoldsteinWhen overlay filesystem has multiple lower layers, a middle layer 5475356ab06SAmir Goldsteindirectory may have a "redirect" to lower directory. Because middle layer 5485356ab06SAmir Goldstein"redirects" are not indexed, a lower file handle that was encoded from the 5495356ab06SAmir Goldstein"redirect" origin directory, cannot be used to find the middle or upper 5505356ab06SAmir Goldsteinlayer directory. Similarly, a lower file handle that was encoded from a 5515356ab06SAmir Goldsteindescendant of the "redirect" origin directory, cannot be used to 5525356ab06SAmir Goldsteinreconstruct a connected overlay path. To mitigate the cases of 5535356ab06SAmir Goldsteindirectories that cannot be decoded from a lower file handle, these 5545356ab06SAmir Goldsteindirectories are copied up on encode and encoded as an upper file handle. 5555356ab06SAmir GoldsteinOn an overlay filesystem with no upper layer this mitigation cannot be 5565356ab06SAmir Goldsteinused NFS export in this setup requires turning off redirect follow (e.g. 5575356ab06SAmir Goldstein"redirect_dir=nofollow"). 5585356ab06SAmir Goldstein 5595356ab06SAmir GoldsteinThe overlay filesystem does not support non-directory connectable file 5605356ab06SAmir Goldsteinhandles, so exporting with the 'subtree_check' exportfs configuration will 5615356ab06SAmir Goldsteincause failures to lookup files over NFS. 5625356ab06SAmir Goldstein 5635356ab06SAmir GoldsteinWhen the NFS export feature is enabled, all directory index entries are 5645356ab06SAmir Goldsteinverified on mount time to check that upper file handles are not stale. 5655356ab06SAmir GoldsteinThis verification may cause significant overhead in some cases. 5665356ab06SAmir Goldstein 567f0e1266eSAmir GoldsteinNote: the mount options index=off,nfs_export=on are conflicting for a 568f0e1266eSAmir Goldsteinread-write mount and will result in an error. 569b0def88dSAmir Goldstein 5705830fb6bSPavel TikhomirovNote: the mount option uuid=off can be used to replace UUID of the underlying 5715830fb6bSPavel Tikhomirovfilesystem in file handles with null, and effectively disable UUID checks. This 5725830fb6bSPavel Tikhomirovcan be useful in case the underlying disk is copied and the UUID of this copy 5735830fb6bSPavel Tikhomirovis changed. This is only applicable if all lower/upper/work directories are on 5745830fb6bSPavel Tikhomirovthe same filesystem, otherwise it will fallback to normal behaviour. 5755356ab06SAmir Goldstein 576c86243b0SVivek GoyalVolatile mount 577c86243b0SVivek Goyal-------------- 578c86243b0SVivek Goyal 579c86243b0SVivek GoyalThis is enabled with the "volatile" mount option. Volatile mounts are not 580c86243b0SVivek Goyalguaranteed to survive a crash. It is strongly recommended that volatile 581c86243b0SVivek Goyalmounts are only used if data written to the overlay can be recreated 582c86243b0SVivek Goyalwithout significant effort. 583c86243b0SVivek Goyal 584c86243b0SVivek GoyalThe advantage of mounting with the "volatile" option is that all forms of 585c86243b0SVivek Goyalsync calls to the upper filesystem are omitted. 586c86243b0SVivek Goyal 587335d3fc5SSargun DhillonIn order to avoid a giving a false sense of safety, the syncfs (and fsync) 588335d3fc5SSargun Dhillonsemantics of volatile mounts are slightly different than that of the rest of 589335d3fc5SSargun DhillonVFS. If any writeback error occurs on the upperdir's filesystem after a 590335d3fc5SSargun Dhillonvolatile mount takes place, all sync functions will return an error. Once this 591335d3fc5SSargun Dhilloncondition is reached, the filesystem will not recover, and every subsequent sync 592335d3fc5SSargun Dhilloncall will return an error, even if the upperdir has not experience a new error 593335d3fc5SSargun Dhillonsince the last sync call. 594335d3fc5SSargun Dhillon 595c86243b0SVivek GoyalWhen overlay is mounted with "volatile" option, the directory 596c86243b0SVivek Goyal"$workdir/work/incompat/volatile" is created. During next mount, overlay 597c86243b0SVivek Goyalchecks for this directory and refuses to mount if present. This is a strong 598c86243b0SVivek Goyalindicator that user should throw away upper and work directories and create 599c86243b0SVivek Goyalfresh one. In very limited cases where the user knows that the system has 600c86243b0SVivek Goyalnot crashed and contents of upperdir are intact, The "volatile" directory 601c86243b0SVivek Goyalcan be removed. 602c86243b0SVivek Goyal 6032d2f2d73SMiklos Szeredi 6042d2f2d73SMiklos SzerediUser xattr 6052d2f2d73SMiklos Szeredi---------- 6062d2f2d73SMiklos Szeredi 6072d2f2d73SMiklos SzerediThe the "-o userxattr" mount option forces overlayfs to use the 6082d2f2d73SMiklos Szeredi"user.overlay." xattr namespace instead of "trusted.overlay.". This is 6092d2f2d73SMiklos Szerediuseful for unprivileged mounting of overlayfs. 6102d2f2d73SMiklos Szeredi 6112d2f2d73SMiklos Szeredi 6125356ab06SAmir GoldsteinTestsuite 6135356ab06SAmir Goldstein--------- 6145356ab06SAmir Goldstein 6155356ab06SAmir GoldsteinThere's a testsuite originally developed by David Howells and currently 6165356ab06SAmir Goldsteinmaintained by Amir Goldstein at: 6175356ab06SAmir Goldstein 6185356ab06SAmir Goldstein https://github.com/amir73il/unionmount-testsuite.git 6195356ab06SAmir Goldstein 6205356ab06SAmir GoldsteinRun as root: 6215356ab06SAmir Goldstein 6225356ab06SAmir Goldstein # cd unionmount-testsuite 6235356ab06SAmir Goldstein # ./run --ov --verify 624