1ec23eb54SMauro Carvalho Chehab:orphan: 2ec23eb54SMauro Carvalho Chehab 3ec23eb54SMauro Carvalho ChehabMaking Filesystems Exportable 4ec23eb54SMauro Carvalho Chehab============================= 5ec23eb54SMauro Carvalho Chehab 6ec23eb54SMauro Carvalho ChehabOverview 7ec23eb54SMauro Carvalho Chehab-------- 8ec23eb54SMauro Carvalho Chehab 9ec23eb54SMauro Carvalho ChehabAll filesystem operations require a dentry (or two) as a starting 10ec23eb54SMauro Carvalho Chehabpoint. Local applications have a reference-counted hold on suitable 11ec23eb54SMauro Carvalho Chehabdentries via open file descriptors or cwd/root. However remote 12ec23eb54SMauro Carvalho Chehabapplications that access a filesystem via a remote filesystem protocol 13ec23eb54SMauro Carvalho Chehabsuch as NFS may not be able to hold such a reference, and so need a 14ec23eb54SMauro Carvalho Chehabdifferent way to refer to a particular dentry. As the alternative 15ec23eb54SMauro Carvalho Chehabform of reference needs to be stable across renames, truncates, and 16ec23eb54SMauro Carvalho Chehabserver-reboot (among other things, though these tend to be the most 17ec23eb54SMauro Carvalho Chehabproblematic), there is no simple answer like 'filename'. 18ec23eb54SMauro Carvalho Chehab 19ec23eb54SMauro Carvalho ChehabThe mechanism discussed here allows each filesystem implementation to 20ec23eb54SMauro Carvalho Chehabspecify how to generate an opaque (outside of the filesystem) byte 21ec23eb54SMauro Carvalho Chehabstring for any dentry, and how to find an appropriate dentry for any 22ec23eb54SMauro Carvalho Chehabgiven opaque byte string. 23ec23eb54SMauro Carvalho ChehabThis byte string will be called a "filehandle fragment" as it 24ec23eb54SMauro Carvalho Chehabcorresponds to part of an NFS filehandle. 25ec23eb54SMauro Carvalho Chehab 26ec23eb54SMauro Carvalho ChehabA filesystem which supports the mapping between filehandle fragments 27ec23eb54SMauro Carvalho Chehaband dentries will be termed "exportable". 28ec23eb54SMauro Carvalho Chehab 29ec23eb54SMauro Carvalho Chehab 30ec23eb54SMauro Carvalho Chehab 31ec23eb54SMauro Carvalho ChehabDcache Issues 32ec23eb54SMauro Carvalho Chehab------------- 33ec23eb54SMauro Carvalho Chehab 34ec23eb54SMauro Carvalho ChehabThe dcache normally contains a proper prefix of any given filesystem 35ec23eb54SMauro Carvalho Chehabtree. This means that if any filesystem object is in the dcache, then 36ec23eb54SMauro Carvalho Chehaball of the ancestors of that filesystem object are also in the dcache. 37ec23eb54SMauro Carvalho ChehabAs normal access is by filename this prefix is created naturally and 38ec23eb54SMauro Carvalho Chehabmaintained easily (by each object maintaining a reference count on 39ec23eb54SMauro Carvalho Chehabits parent). 40ec23eb54SMauro Carvalho Chehab 41ec23eb54SMauro Carvalho ChehabHowever when objects are included into the dcache by interpreting a 42ec23eb54SMauro Carvalho Chehabfilehandle fragment, there is no automatic creation of a path prefix 43ec23eb54SMauro Carvalho Chehabfor the object. This leads to two related but distinct features of 44ec23eb54SMauro Carvalho Chehabthe dcache that are not needed for normal filesystem access. 45ec23eb54SMauro Carvalho Chehab 46ec23eb54SMauro Carvalho Chehab1. The dcache must sometimes contain objects that are not part of the 47ec23eb54SMauro Carvalho Chehab proper prefix. i.e that are not connected to the root. 48ec23eb54SMauro Carvalho Chehab2. The dcache must be prepared for a newly found (via ->lookup) directory 49ec23eb54SMauro Carvalho Chehab to already have a (non-connected) dentry, and must be able to move 50ec23eb54SMauro Carvalho Chehab that dentry into place (based on the parent and name in the 51ec23eb54SMauro Carvalho Chehab ->lookup). This is particularly needed for directories as 52ec23eb54SMauro Carvalho Chehab it is a dcache invariant that directories only have one dentry. 53ec23eb54SMauro Carvalho Chehab 54ec23eb54SMauro Carvalho ChehabTo implement these features, the dcache has: 55ec23eb54SMauro Carvalho Chehab 56ec23eb54SMauro Carvalho Chehaba. A dentry flag DCACHE_DISCONNECTED which is set on 57ec23eb54SMauro Carvalho Chehab any dentry that might not be part of the proper prefix. 58ec23eb54SMauro Carvalho Chehab This is set when anonymous dentries are created, and cleared when a 59ec23eb54SMauro Carvalho Chehab dentry is noticed to be a child of a dentry which is in the proper 60ec23eb54SMauro Carvalho Chehab prefix. If the refcount on a dentry with this flag set 61ec23eb54SMauro Carvalho Chehab becomes zero, the dentry is immediately discarded, rather than being 62ec23eb54SMauro Carvalho Chehab kept in the dcache. If a dentry that is not already in the dcache 63ec23eb54SMauro Carvalho Chehab is repeatedly accessed by filehandle (as NFSD might do), an new dentry 64ec23eb54SMauro Carvalho Chehab will be a allocated for each access, and discarded at the end of 65ec23eb54SMauro Carvalho Chehab the access. 66ec23eb54SMauro Carvalho Chehab 67ec23eb54SMauro Carvalho Chehab Note that such a dentry can acquire children, name, ancestors, etc. 68ec23eb54SMauro Carvalho Chehab without losing DCACHE_DISCONNECTED - that flag is only cleared when 69ec23eb54SMauro Carvalho Chehab subtree is successfully reconnected to root. Until then dentries 70ec23eb54SMauro Carvalho Chehab in such subtree are retained only as long as there are references; 71ec23eb54SMauro Carvalho Chehab refcount reaching zero means immediate eviction, same as for unhashed 72ec23eb54SMauro Carvalho Chehab dentries. That guarantees that we won't need to hunt them down upon 73ec23eb54SMauro Carvalho Chehab umount. 74ec23eb54SMauro Carvalho Chehab 75ec23eb54SMauro Carvalho Chehabb. A primitive for creation of secondary roots - d_obtain_root(inode). 76ec23eb54SMauro Carvalho Chehab Those do _not_ bear DCACHE_DISCONNECTED. They are placed on the 77ec23eb54SMauro Carvalho Chehab per-superblock list (->s_roots), so they can be located at umount 78ec23eb54SMauro Carvalho Chehab time for eviction purposes. 79ec23eb54SMauro Carvalho Chehab 80ec23eb54SMauro Carvalho Chehabc. Helper routines to allocate anonymous dentries, and to help attach 81ec23eb54SMauro Carvalho Chehab loose directory dentries at lookup time. They are: 82ec23eb54SMauro Carvalho Chehab 83ec23eb54SMauro Carvalho Chehab d_obtain_alias(inode) will return a dentry for the given inode. 84ec23eb54SMauro Carvalho Chehab If the inode already has a dentry, one of those is returned. 85ec23eb54SMauro Carvalho Chehab 86ec23eb54SMauro Carvalho Chehab If it doesn't, a new anonymous (IS_ROOT and 87ec23eb54SMauro Carvalho Chehab DCACHE_DISCONNECTED) dentry is allocated and attached. 88ec23eb54SMauro Carvalho Chehab 89ec23eb54SMauro Carvalho Chehab In the case of a directory, care is taken that only one dentry 90ec23eb54SMauro Carvalho Chehab can ever be attached. 91ec23eb54SMauro Carvalho Chehab 92ec23eb54SMauro Carvalho Chehab d_splice_alias(inode, dentry) will introduce a new dentry into the tree; 93ec23eb54SMauro Carvalho Chehab either the passed-in dentry or a preexisting alias for the given inode 94ec23eb54SMauro Carvalho Chehab (such as an anonymous one created by d_obtain_alias), if appropriate. 95ec23eb54SMauro Carvalho Chehab It returns NULL when the passed-in dentry is used, following the calling 96ec23eb54SMauro Carvalho Chehab convention of ->lookup. 97ec23eb54SMauro Carvalho Chehab 98ec23eb54SMauro Carvalho ChehabFilesystem Issues 99ec23eb54SMauro Carvalho Chehab----------------- 100ec23eb54SMauro Carvalho Chehab 101ec23eb54SMauro Carvalho ChehabFor a filesystem to be exportable it must: 102ec23eb54SMauro Carvalho Chehab 103ec23eb54SMauro Carvalho Chehab 1. provide the filehandle fragment routines described below. 104ec23eb54SMauro Carvalho Chehab 2. make sure that d_splice_alias is used rather than d_add 105ec23eb54SMauro Carvalho Chehab when ->lookup finds an inode for a given parent and name. 106ec23eb54SMauro Carvalho Chehab 107ec23eb54SMauro Carvalho Chehab If inode is NULL, d_splice_alias(inode, dentry) is equivalent to:: 108ec23eb54SMauro Carvalho Chehab 109ec23eb54SMauro Carvalho Chehab d_add(dentry, inode), NULL 110ec23eb54SMauro Carvalho Chehab 111ec23eb54SMauro Carvalho Chehab Similarly, d_splice_alias(ERR_PTR(err), dentry) = ERR_PTR(err) 112ec23eb54SMauro Carvalho Chehab 113ec23eb54SMauro Carvalho Chehab Typically the ->lookup routine will simply end with a:: 114ec23eb54SMauro Carvalho Chehab 115ec23eb54SMauro Carvalho Chehab return d_splice_alias(inode, dentry); 116ec23eb54SMauro Carvalho Chehab } 117ec23eb54SMauro Carvalho Chehab 118ec23eb54SMauro Carvalho Chehab 119ec23eb54SMauro Carvalho Chehab 120ec23eb54SMauro Carvalho ChehabA file system implementation declares that instances of the filesystem 121ec23eb54SMauro Carvalho Chehabare exportable by setting the s_export_op field in the struct 122ec23eb54SMauro Carvalho Chehabsuper_block. This field must point to a "struct export_operations" 123ec23eb54SMauro Carvalho Chehabstruct which has the following members: 124ec23eb54SMauro Carvalho Chehab 125ec23eb54SMauro Carvalho Chehab encode_fh (optional) 126304e9c83SAmir Goldstein Takes a dentry and creates a filehandle fragment which may later be used 127ec23eb54SMauro Carvalho Chehab to find or create a dentry for the same object. The default 128ec23eb54SMauro Carvalho Chehab implementation creates a filehandle fragment that encodes a 32bit inode 129ec23eb54SMauro Carvalho Chehab and generation number for the inode encoded, and if necessary the 130ec23eb54SMauro Carvalho Chehab same information for the parent. 131ec23eb54SMauro Carvalho Chehab 132ec23eb54SMauro Carvalho Chehab fh_to_dentry (mandatory) 133ec23eb54SMauro Carvalho Chehab Given a filehandle fragment, this should find the implied object and 134ec23eb54SMauro Carvalho Chehab create a dentry for it (possibly with d_obtain_alias). 135ec23eb54SMauro Carvalho Chehab 136ec23eb54SMauro Carvalho Chehab fh_to_parent (optional but strongly recommended) 137ec23eb54SMauro Carvalho Chehab Given a filehandle fragment, this should find the parent of the 138ec23eb54SMauro Carvalho Chehab implied object and create a dentry for it (possibly with 139ec23eb54SMauro Carvalho Chehab d_obtain_alias). May fail if the filehandle fragment is too small. 140ec23eb54SMauro Carvalho Chehab 141ec23eb54SMauro Carvalho Chehab get_parent (optional but strongly recommended) 142ec23eb54SMauro Carvalho Chehab When given a dentry for a directory, this should return a dentry for 143ec23eb54SMauro Carvalho Chehab the parent. Quite possibly the parent dentry will have been allocated 144ec23eb54SMauro Carvalho Chehab by d_alloc_anon. The default get_parent function just returns an error 145ec23eb54SMauro Carvalho Chehab so any filehandle lookup that requires finding a parent will fail. 146ec23eb54SMauro Carvalho Chehab ->lookup("..") is *not* used as a default as it can leave ".." entries 147ec23eb54SMauro Carvalho Chehab in the dcache which are too messy to work with. 148ec23eb54SMauro Carvalho Chehab 149ec23eb54SMauro Carvalho Chehab get_name (optional) 150ec23eb54SMauro Carvalho Chehab When given a parent dentry and a child dentry, this should find a name 151ec23eb54SMauro Carvalho Chehab in the directory identified by the parent dentry, which leads to the 152ec23eb54SMauro Carvalho Chehab object identified by the child dentry. If no get_name function is 153ec23eb54SMauro Carvalho Chehab supplied, a default implementation is provided which uses vfs_readdir 154ec23eb54SMauro Carvalho Chehab to find potential names, and matches inode numbers to find the correct 155ec23eb54SMauro Carvalho Chehab match. 156ec23eb54SMauro Carvalho Chehab 157daab110eSJeff Layton flags 158daab110eSJeff Layton Some filesystems may need to be handled differently than others. The 159daab110eSJeff Layton export_operations struct also includes a flags field that allows the 160daab110eSJeff Layton filesystem to communicate such information to nfsd. See the Export 161daab110eSJeff Layton Operations Flags section below for more explanation. 162ec23eb54SMauro Carvalho Chehab 163ec23eb54SMauro Carvalho ChehabA filehandle fragment consists of an array of 1 or more 4byte words, 164ec23eb54SMauro Carvalho Chehabtogether with a one byte "type". 165ec23eb54SMauro Carvalho ChehabThe decode_fh routine should not depend on the stated size that is 166ec23eb54SMauro Carvalho Chehabpassed to it. This size may be larger than the original filehandle 167ec23eb54SMauro Carvalho Chehabgenerated by encode_fh, in which case it will have been padded with 168ec23eb54SMauro Carvalho Chehabnuls. Rather, the encode_fh routine should choose a "type" which 169ec23eb54SMauro Carvalho Chehabindicates the decode_fh how much of the filehandle is valid, and how 170ec23eb54SMauro Carvalho Chehabit should be interpreted. 171daab110eSJeff Layton 172daab110eSJeff LaytonExport Operations Flags 173daab110eSJeff Layton----------------------- 174daab110eSJeff LaytonIn addition to the operation vector pointers, struct export_operations also 175daab110eSJeff Laytoncontains a "flags" field that allows the filesystem to communicate to nfsd 176daab110eSJeff Laytonthat it may want to do things differently when dealing with it. The 177daab110eSJeff Laytonfollowing flags are defined: 178daab110eSJeff Layton 179daab110eSJeff Layton EXPORT_OP_NOWCC - disable NFSv3 WCC attributes on this filesystem 180daab110eSJeff Layton RFC 1813 recommends that servers always send weak cache consistency 181daab110eSJeff Layton (WCC) data to the client after each operation. The server should 182daab110eSJeff Layton atomically collect attributes about the inode, do an operation on it, 183daab110eSJeff Layton and then collect the attributes afterward. This allows the client to 184daab110eSJeff Layton skip issuing GETATTRs in some situations but means that the server 185daab110eSJeff Layton is calling vfs_getattr for almost all RPCs. On some filesystems 186daab110eSJeff Layton (particularly those that are clustered or networked) this is expensive 187daab110eSJeff Layton and atomicity is difficult to guarantee. This flag indicates to nfsd 188daab110eSJeff Layton that it should skip providing WCC attributes to the client in NFSv3 189daab110eSJeff Layton replies when doing operations on this filesystem. Consider enabling 190daab110eSJeff Layton this on filesystems that have an expensive ->getattr inode operation, 191daab110eSJeff Layton or when atomicity between pre and post operation attribute collection 192daab110eSJeff Layton is impossible to guarantee. 193ba5e8187SJeff Layton 194ba5e8187SJeff Layton EXPORT_OP_NOSUBTREECHK - disallow subtree checking on this fs 195ba5e8187SJeff Layton Many NFS operations deal with filehandles, which the server must then 196ba5e8187SJeff Layton vet to ensure that they live inside of an exported tree. When the 197ba5e8187SJeff Layton export consists of an entire filesystem, this is trivial. nfsd can just 198ba5e8187SJeff Layton ensure that the filehandle live on the filesystem. When only part of a 199ba5e8187SJeff Layton filesystem is exported however, then nfsd must walk the ancestors of the 200ba5e8187SJeff Layton inode to ensure that it's within an exported subtree. This is an 201ba5e8187SJeff Layton expensive operation and not all filesystems can support it properly. 202ba5e8187SJeff Layton This flag exempts the filesystem from subtree checking and causes 203ba5e8187SJeff Layton exportfs to get back an error if it tries to enable subtree checking 204ba5e8187SJeff Layton on it. 2057f84b488SJeff Layton 2067f84b488SJeff Layton EXPORT_OP_CLOSE_BEFORE_UNLINK - always close cached files before unlinking 2077f84b488SJeff Layton On some exportable filesystems (such as NFS) unlinking a file that 2087f84b488SJeff Layton is still open can cause a fair bit of extra work. For instance, 2097f84b488SJeff Layton the NFS client will do a "sillyrename" to ensure that the file 2107f84b488SJeff Layton sticks around while it's still open. When reexporting, that open 2117f84b488SJeff Layton file is held by nfsd so we usually end up doing a sillyrename, and 2127f84b488SJeff Layton then immediately deleting the sillyrenamed file just afterward when 2137f84b488SJeff Layton the link count actually goes to zero. Sometimes this delete can race 2147f84b488SJeff Layton with other operations (for instance an rmdir of the parent directory). 2157f84b488SJeff Layton This flag causes nfsd to close any open files for this inode _before_ 2167f84b488SJeff Layton calling into the vfs to do an unlink or a rename that would replace 2177f84b488SJeff Layton an existing file. 218*b38a6023SChuck Lever 219*b38a6023SChuck Lever EXPORT_OP_REMOTE_FS - Backing storage for this filesystem is remote 220*b38a6023SChuck Lever PF_LOCAL_THROTTLE exists for loopback NFSD, where a thread needs to 221*b38a6023SChuck Lever write to one bdi (the final bdi) in order to free up writes queued 222*b38a6023SChuck Lever to another bdi (the client bdi). Such threads get a private balance 223*b38a6023SChuck Lever of dirty pages so that dirty pages for the client bdi do not imact 224*b38a6023SChuck Lever the daemon writing to the final bdi. For filesystems whose durable 225*b38a6023SChuck Lever storage is not local (such as exported NFS filesystems), this 226*b38a6023SChuck Lever constraint has negative consequences. EXPORT_OP_REMOTE_FS enables 227*b38a6023SChuck Lever an export to disable writeback throttling. 228*b38a6023SChuck Lever 229*b38a6023SChuck Lever EXPORT_OP_NOATOMIC_ATTR - Filesystem does not update attributes atomically 230*b38a6023SChuck Lever EXPORT_OP_NOATOMIC_ATTR indicates that the exported filesystem 231*b38a6023SChuck Lever cannot provide the semantics required by the "atomic" boolean in 232*b38a6023SChuck Lever NFSv4's change_info4. This boolean indicates to a client whether the 233*b38a6023SChuck Lever returned before and after change attributes were obtained atomically 234*b38a6023SChuck Lever with the respect to the requested metadata operation (UNLINK, 235*b38a6023SChuck Lever OPEN/CREATE, MKDIR, etc). 236*b38a6023SChuck Lever 237*b38a6023SChuck Lever EXPORT_OP_FLUSH_ON_CLOSE - Filesystem flushes file data on close(2) 238*b38a6023SChuck Lever On most filesystems, inodes can remain under writeback after the 239*b38a6023SChuck Lever file is closed. NFSD relies on client activity or local flusher 240*b38a6023SChuck Lever threads to handle writeback. Certain filesystems, such as NFS, flush 241*b38a6023SChuck Lever all of an inode's dirty data on last close. Exports that behave this 242*b38a6023SChuck Lever way should set EXPORT_OP_FLUSH_ON_CLOSE so that NFSD knows to skip 243*b38a6023SChuck Lever waiting for writeback when closing such files. 244