1.. SPDX-License-Identifier: GPL-2.0 2 3Written by: Neil Brown 4Please see MAINTAINERS file for where to send questions. 5 6Overlay Filesystem 7================== 8 9This document describes a prototype for a new approach to providing 10overlay-filesystem functionality in Linux (sometimes referred to as 11union-filesystems). An overlay-filesystem tries to present a 12filesystem which is the result over overlaying one filesystem on top 13of the other. 14 15 16Overlay objects 17--------------- 18 19The overlay filesystem approach is 'hybrid', because the objects that 20appear in the filesystem do not always appear to belong to that filesystem. 21In many cases, an object accessed in the union will be indistinguishable 22from accessing the corresponding object from the original filesystem. 23This is most obvious from the 'st_dev' field returned by stat(2). 24 25While directories will report an st_dev from the overlay-filesystem, 26non-directory objects may report an st_dev from the lower filesystem or 27upper filesystem that is providing the object. Similarly st_ino will 28only be unique when combined with st_dev, and both of these can change 29over the lifetime of a non-directory object. Many applications and 30tools ignore these values and will not be affected. 31 32In the special case of all overlay layers on the same underlying 33filesystem, all objects will report an st_dev from the overlay 34filesystem and st_ino from the underlying filesystem. This will 35make the overlay mount more compliant with filesystem scanners and 36overlay objects will be distinguishable from the corresponding 37objects in the original filesystem. 38 39On 64bit systems, even if all overlay layers are not on the same 40underlying filesystem, the same compliant behavior could be achieved 41with the "xino" feature. The "xino" feature composes a unique object 42identifier from the real object st_ino and an underlying fsid index. 43If all underlying filesystems support NFS file handles and export file 44handles with 32bit inode number encoding (e.g. ext4), overlay filesystem 45will use the high inode number bits for fsid. Even when the underlying 46filesystem uses 64bit inode numbers, users can still enable the "xino" 47feature with the "-o xino=on" overlay mount option. That is useful for the 48case of underlying filesystems like xfs and tmpfs, which use 64bit inode 49numbers, but are very unlikely to use the high inode number bit. 50 51 52Upper and Lower 53--------------- 54 55An overlay filesystem combines two filesystems - an 'upper' filesystem 56and a 'lower' filesystem. When a name exists in both filesystems, the 57object in the 'upper' filesystem is visible while the object in the 58'lower' filesystem is either hidden or, in the case of directories, 59merged with the 'upper' object. 60 61It would be more correct to refer to an upper and lower 'directory 62tree' rather than 'filesystem' as it is quite possible for both 63directory trees to be in the same filesystem and there is no 64requirement that the root of a filesystem be given for either upper or 65lower. 66 67The lower filesystem can be any filesystem supported by Linux and does 68not need to be writable. The lower filesystem can even be another 69overlayfs. The upper filesystem will normally be writable and if it 70is it must support the creation of trusted.* extended attributes, and 71must provide valid d_type in readdir responses, so NFS is not suitable. 72 73A read-only overlay of two read-only filesystems may use any 74filesystem type. 75 76Directories 77----------- 78 79Overlaying mainly involves directories. If a given name appears in both 80upper and lower filesystems and refers to a non-directory in either, 81then the lower object is hidden - the name refers only to the upper 82object. 83 84Where both upper and lower objects are directories, a merged directory 85is formed. 86 87At mount time, the two directories given as mount options "lowerdir" and 88"upperdir" are combined into a merged directory: 89 90 mount -t overlay overlay -olowerdir=/lower,upperdir=/upper,\ 91 workdir=/work /merged 92 93The "workdir" needs to be an empty directory on the same filesystem 94as upperdir. 95 96Then whenever a lookup is requested in such a merged directory, the 97lookup is performed in each actual directory and the combined result 98is cached in the dentry belonging to the overlay filesystem. If both 99actual lookups find directories, both are stored and a merged 100directory is created, otherwise only one is stored: the upper if it 101exists, else the lower. 102 103Only the lists of names from directories are merged. Other content 104such as metadata and extended attributes are reported for the upper 105directory only. These attributes of the lower directory are hidden. 106 107whiteouts and opaque directories 108-------------------------------- 109 110In order to support rm and rmdir without changing the lower 111filesystem, an overlay filesystem needs to record in the upper filesystem 112that files have been removed. This is done using whiteouts and opaque 113directories (non-directories are always opaque). 114 115A whiteout is created as a character device with 0/0 device number. 116When a whiteout is found in the upper level of a merged directory, any 117matching name in the lower level is ignored, and the whiteout itself 118is also hidden. 119 120A directory is made opaque by setting the xattr "trusted.overlay.opaque" 121to "y". Where the upper filesystem contains an opaque directory, any 122directory in the lower filesystem with the same name is ignored. 123 124readdir 125------- 126 127When a 'readdir' request is made on a merged directory, the upper and 128lower directories are each read and the name lists merged in the 129obvious way (upper is read first, then lower - entries that already 130exist are not re-added). This merged name list is cached in the 131'struct file' and so remains as long as the file is kept open. If the 132directory is opened and read by two processes at the same time, they 133will each have separate caches. A seekdir to the start of the 134directory (offset 0) followed by a readdir will cause the cache to be 135discarded and rebuilt. 136 137This means that changes to the merged directory do not appear while a 138directory is being read. This is unlikely to be noticed by many 139programs. 140 141seek offsets are assigned sequentially when the directories are read. 142Thus if 143 144 - read part of a directory 145 - remember an offset, and close the directory 146 - re-open the directory some time later 147 - seek to the remembered offset 148 149there may be little correlation between the old and new locations in 150the list of filenames, particularly if anything has changed in the 151directory. 152 153Readdir on directories that are not merged is simply handled by the 154underlying directory (upper or lower). 155 156renaming directories 157-------------------- 158 159When renaming a directory that is on the lower layer or merged (i.e. the 160directory was not created on the upper layer to start with) overlayfs can 161handle it in two different ways: 162 1631. return EXDEV error: this error is returned by rename(2) when trying to 164 move a file or directory across filesystem boundaries. Hence 165 applications are usually prepared to hande this error (mv(1) for example 166 recursively copies the directory tree). This is the default behavior. 167 1682. If the "redirect_dir" feature is enabled, then the directory will be 169 copied up (but not the contents). Then the "trusted.overlay.redirect" 170 extended attribute is set to the path of the original location from the 171 root of the overlay. Finally the directory is moved to the new 172 location. 173 174There are several ways to tune the "redirect_dir" feature. 175 176Kernel config options: 177 178- OVERLAY_FS_REDIRECT_DIR: 179 If this is enabled, then redirect_dir is turned on by default. 180- OVERLAY_FS_REDIRECT_ALWAYS_FOLLOW: 181 If this is enabled, then redirects are always followed by default. Enabling 182 this results in a less secure configuration. Enable this option only when 183 worried about backward compatibility with kernels that have the redirect_dir 184 feature and follow redirects even if turned off. 185 186Module options (can also be changed through /sys/module/overlay/parameters/): 187 188- "redirect_dir=BOOL": 189 See OVERLAY_FS_REDIRECT_DIR kernel config option above. 190- "redirect_always_follow=BOOL": 191 See OVERLAY_FS_REDIRECT_ALWAYS_FOLLOW kernel config option above. 192- "redirect_max=NUM": 193 The maximum number of bytes in an absolute redirect (default is 256). 194 195Mount options: 196 197- "redirect_dir=on": 198 Redirects are enabled. 199- "redirect_dir=follow": 200 Redirects are not created, but followed. 201- "redirect_dir=off": 202 Redirects are not created and only followed if "redirect_always_follow" 203 feature is enabled in the kernel/module config. 204- "redirect_dir=nofollow": 205 Redirects are not created and not followed (equivalent to "redirect_dir=off" 206 if "redirect_always_follow" feature is not enabled). 207 208When the NFS export feature is enabled, every copied up directory is 209indexed by the file handle of the lower inode and a file handle of the 210upper directory is stored in a "trusted.overlay.upper" extended attribute 211on the index entry. On lookup of a merged directory, if the upper 212directory does not match the file handle stores in the index, that is an 213indication that multiple upper directories may be redirected to the same 214lower directory. In that case, lookup returns an error and warns about 215a possible inconsistency. 216 217Because lower layer redirects cannot be verified with the index, enabling 218NFS export support on an overlay filesystem with no upper layer requires 219turning off redirect follow (e.g. "redirect_dir=nofollow"). 220 221 222Non-directories 223--------------- 224 225Objects that are not directories (files, symlinks, device-special 226files etc.) are presented either from the upper or lower filesystem as 227appropriate. When a file in the lower filesystem is accessed in a way 228the requires write-access, such as opening for write access, changing 229some metadata etc., the file is first copied from the lower filesystem 230to the upper filesystem (copy_up). Note that creating a hard-link 231also requires copy_up, though of course creation of a symlink does 232not. 233 234The copy_up may turn out to be unnecessary, for example if the file is 235opened for read-write but the data is not modified. 236 237The copy_up process first makes sure that the containing directory 238exists in the upper filesystem - creating it and any parents as 239necessary. It then creates the object with the same metadata (owner, 240mode, mtime, symlink-target etc.) and then if the object is a file, the 241data is copied from the lower to the upper filesystem. Finally any 242extended attributes are copied up. 243 244Once the copy_up is complete, the overlay filesystem simply 245provides direct access to the newly created file in the upper 246filesystem - future operations on the file are barely noticed by the 247overlay filesystem (though an operation on the name of the file such as 248rename or unlink will of course be noticed and handled). 249 250 251Permission model 252---------------- 253 254Permission checking in the overlay filesystem follows these principles: 255 256 1) permission check SHOULD return the same result before and after copy up 257 258 2) task creating the overlay mount MUST NOT gain additional privileges 259 260 3) non-mounting task MAY gain additional privileges through the overlay, 261 compared to direct access on underlying lower or upper filesystems 262 263This is achieved by performing two permission checks on each access 264 265 a) check if current task is allowed access based on local DAC (owner, 266 group, mode and posix acl), as well as MAC checks 267 268 b) check if mounting task would be allowed real operation on lower or 269 upper layer based on underlying filesystem permissions, again including 270 MAC checks 271 272Check (a) ensures consistency (1) since owner, group, mode and posix acls 273are copied up. On the other hand it can result in server enforced 274permissions (used by NFS, for example) being ignored (3). 275 276Check (b) ensures that no task gains permissions to underlying layers that 277the mounting task does not have (2). This also means that it is possible 278to create setups where the consistency rule (1) does not hold; normally, 279however, the mounting task will have sufficient privileges to perform all 280operations. 281 282Another way to demonstrate this model is drawing parallels between 283 284 mount -t overlay overlay -olowerdir=/lower,upperdir=/upper,... /merged 285 286and 287 288 cp -a /lower /upper 289 mount --bind /upper /merged 290 291The resulting access permissions should be the same. The difference is in 292the time of copy (on-demand vs. up-front). 293 294 295Multiple lower layers 296--------------------- 297 298Multiple lower layers can now be given using the the colon (":") as a 299separator character between the directory names. For example: 300 301 mount -t overlay overlay -olowerdir=/lower1:/lower2:/lower3 /merged 302 303As the example shows, "upperdir=" and "workdir=" may be omitted. In 304that case the overlay will be read-only. 305 306The specified lower directories will be stacked beginning from the 307rightmost one and going left. In the above example lower1 will be the 308top, lower2 the middle and lower3 the bottom layer. 309 310 311Metadata only copy up 312--------------------- 313 314When metadata only copy up feature is enabled, overlayfs will only copy 315up metadata (as opposed to whole file), when a metadata specific operation 316like chown/chmod is performed. Full file will be copied up later when 317file is opened for WRITE operation. 318 319In other words, this is delayed data copy up operation and data is copied 320up when there is a need to actually modify data. 321 322There are multiple ways to enable/disable this feature. A config option 323CONFIG_OVERLAY_FS_METACOPY can be set/unset to enable/disable this feature 324by default. Or one can enable/disable it at module load time with module 325parameter metacopy=on/off. Lastly, there is also a per mount option 326metacopy=on/off to enable/disable this feature per mount. 327 328Do not use metacopy=on with untrusted upper/lower directories. Otherwise 329it is possible that an attacker can create a handcrafted file with 330appropriate REDIRECT and METACOPY xattrs, and gain access to file on lower 331pointed by REDIRECT. This should not be possible on local system as setting 332"trusted." xattrs will require CAP_SYS_ADMIN. But it should be possible 333for untrusted layers like from a pen drive. 334 335Note: redirect_dir={off|nofollow|follow[*]} conflicts with metacopy=on, and 336results in an error. 337 338[*] redirect_dir=follow only conflicts with metacopy=on if upperdir=... is 339given. 340 341Sharing and copying layers 342-------------------------- 343 344Lower layers may be shared among several overlay mounts and that is indeed 345a very common practice. An overlay mount may use the same lower layer 346path as another overlay mount and it may use a lower layer path that is 347beneath or above the path of another overlay lower layer path. 348 349Using an upper layer path and/or a workdir path that are already used by 350another overlay mount is not allowed and may fail with EBUSY. Using 351partially overlapping paths is not allowed and may fail with EBUSY. 352If files are accessed from two overlayfs mounts which share or overlap the 353upper layer and/or workdir path the behavior of the overlay is undefined, 354though it will not result in a crash or deadlock. 355 356Mounting an overlay using an upper layer path, where the upper layer path 357was previously used by another mounted overlay in combination with a 358different lower layer path, is allowed, unless the "inodes index" feature 359or "metadata only copy up" feature is enabled. 360 361With the "inodes index" feature, on the first time mount, an NFS file 362handle of the lower layer root directory, along with the UUID of the lower 363filesystem, are encoded and stored in the "trusted.overlay.origin" extended 364attribute on the upper layer root directory. On subsequent mount attempts, 365the lower root directory file handle and lower filesystem UUID are compared 366to the stored origin in upper root directory. On failure to verify the 367lower root origin, mount will fail with ESTALE. An overlayfs mount with 368"inodes index" enabled will fail with EOPNOTSUPP if the lower filesystem 369does not support NFS export, lower filesystem does not have a valid UUID or 370if the upper filesystem does not support extended attributes. 371 372For "metadata only copy up" feature there is no verification mechanism at 373mount time. So if same upper is mounted with different set of lower, mount 374probably will succeed but expect the unexpected later on. So don't do it. 375 376It is quite a common practice to copy overlay layers to a different 377directory tree on the same or different underlying filesystem, and even 378to a different machine. With the "inodes index" feature, trying to mount 379the copied layers will fail the verification of the lower root file handle. 380 381 382Non-standard behavior 383--------------------- 384 385Current version of overlayfs can act as a mostly POSIX compliant 386filesystem. 387 388This is the list of cases that overlayfs doesn't currently handle: 389 390a) POSIX mandates updating st_atime for reads. This is currently not 391done in the case when the file resides on a lower layer. 392 393b) If a file residing on a lower layer is opened for read-only and then 394memory mapped with MAP_SHARED, then subsequent changes to the file are not 395reflected in the memory mapping. 396 397The following options allow overlayfs to act more like a standards 398compliant filesystem: 399 4001) "redirect_dir" 401 402Enabled with the mount option or module option: "redirect_dir=on" or with 403the kernel config option CONFIG_OVERLAY_FS_REDIRECT_DIR=y. 404 405If this feature is disabled, then rename(2) on a lower or merged directory 406will fail with EXDEV ("Invalid cross-device link"). 407 4082) "inode index" 409 410Enabled with the mount option or module option "index=on" or with the 411kernel config option CONFIG_OVERLAY_FS_INDEX=y. 412 413If this feature is disabled and a file with multiple hard links is copied 414up, then this will "break" the link. Changes will not be propagated to 415other names referring to the same inode. 416 4173) "xino" 418 419Enabled with the mount option "xino=auto" or "xino=on", with the module 420option "xino_auto=on" or with the kernel config option 421CONFIG_OVERLAY_FS_XINO_AUTO=y. Also implicitly enabled by using the same 422underlying filesystem for all layers making up the overlay. 423 424If this feature is disabled or the underlying filesystem doesn't have 425enough free bits in the inode number, then overlayfs will not be able to 426guarantee that the values of st_ino and st_dev returned by stat(2) and the 427value of d_ino returned by readdir(3) will act like on a normal filesystem. 428E.g. the value of st_dev may be different for two objects in the same 429overlay filesystem and the value of st_ino for directory objects may not be 430persistent and could change even while the overlay filesystem is mounted. 431 432 433Changes to underlying filesystems 434--------------------------------- 435 436Offline changes, when the overlay is not mounted, are allowed to either 437the upper or the lower trees. 438 439Changes to the underlying filesystems while part of a mounted overlay 440filesystem are not allowed. If the underlying filesystem is changed, 441the behavior of the overlay is undefined, though it will not result in 442a crash or deadlock. 443 444When the overlay NFS export feature is enabled, overlay filesystems 445behavior on offline changes of the underlying lower layer is different 446than the behavior when NFS export is disabled. 447 448On every copy_up, an NFS file handle of the lower inode, along with the 449UUID of the lower filesystem, are encoded and stored in an extended 450attribute "trusted.overlay.origin" on the upper inode. 451 452When the NFS export feature is enabled, a lookup of a merged directory, 453that found a lower directory at the lookup path or at the path pointed 454to by the "trusted.overlay.redirect" extended attribute, will verify 455that the found lower directory file handle and lower filesystem UUID 456match the origin file handle that was stored at copy_up time. If a 457found lower directory does not match the stored origin, that directory 458will not be merged with the upper directory. 459 460 461 462NFS export 463---------- 464 465When the underlying filesystems supports NFS export and the "nfs_export" 466feature is enabled, an overlay filesystem may be exported to NFS. 467 468With the "nfs_export" feature, on copy_up of any lower object, an index 469entry is created under the index directory. The index entry name is the 470hexadecimal representation of the copy up origin file handle. For a 471non-directory object, the index entry is a hard link to the upper inode. 472For a directory object, the index entry has an extended attribute 473"trusted.overlay.upper" with an encoded file handle of the upper 474directory inode. 475 476When encoding a file handle from an overlay filesystem object, the 477following rules apply: 478 4791. For a non-upper object, encode a lower file handle from lower inode 4802. For an indexed object, encode a lower file handle from copy_up origin 4813. For a pure-upper object and for an existing non-indexed upper object, 482 encode an upper file handle from upper inode 483 484The encoded overlay file handle includes: 485 - Header including path type information (e.g. lower/upper) 486 - UUID of the underlying filesystem 487 - Underlying filesystem encoding of underlying inode 488 489This encoding format is identical to the encoding format file handles that 490are stored in extended attribute "trusted.overlay.origin". 491 492When decoding an overlay file handle, the following steps are followed: 493 4941. Find underlying layer by UUID and path type information. 4952. Decode the underlying filesystem file handle to underlying dentry. 4963. For a lower file handle, lookup the handle in index directory by name. 4974. If a whiteout is found in index, return ESTALE. This represents an 498 overlay object that was deleted after its file handle was encoded. 4995. For a non-directory, instantiate a disconnected overlay dentry from the 500 decoded underlying dentry, the path type and index inode, if found. 5016. For a directory, use the connected underlying decoded dentry, path type 502 and index, to lookup a connected overlay dentry. 503 504Decoding a non-directory file handle may return a disconnected dentry. 505copy_up of that disconnected dentry will create an upper index entry with 506no upper alias. 507 508When overlay filesystem has multiple lower layers, a middle layer 509directory may have a "redirect" to lower directory. Because middle layer 510"redirects" are not indexed, a lower file handle that was encoded from the 511"redirect" origin directory, cannot be used to find the middle or upper 512layer directory. Similarly, a lower file handle that was encoded from a 513descendant of the "redirect" origin directory, cannot be used to 514reconstruct a connected overlay path. To mitigate the cases of 515directories that cannot be decoded from a lower file handle, these 516directories are copied up on encode and encoded as an upper file handle. 517On an overlay filesystem with no upper layer this mitigation cannot be 518used NFS export in this setup requires turning off redirect follow (e.g. 519"redirect_dir=nofollow"). 520 521The overlay filesystem does not support non-directory connectable file 522handles, so exporting with the 'subtree_check' exportfs configuration will 523cause failures to lookup files over NFS. 524 525When the NFS export feature is enabled, all directory index entries are 526verified on mount time to check that upper file handles are not stale. 527This verification may cause significant overhead in some cases. 528 529 530Testsuite 531--------- 532 533There's a testsuite originally developed by David Howells and currently 534maintained by Amir Goldstein at: 535 536 https://github.com/amir73il/unionmount-testsuite.git 537 538Run as root: 539 540 # cd unionmount-testsuite 541 # ./run --ov --verify 542