18ab13bcaSDaniel W. S. Almeida.. SPDX-License-Identifier: GPL-2.0 23b31589cSMauro Carvalho Chehab 33b31589cSMauro Carvalho Chehab==== 48ab13bcaSDaniel W. S. AlmeidaFUSE 53b31589cSMauro Carvalho Chehab==== 68ab13bcaSDaniel W. S. Almeida 78ab13bcaSDaniel W. S. AlmeidaDefinitions 88ab13bcaSDaniel W. S. Almeida=========== 98ab13bcaSDaniel W. S. Almeida 108ab13bcaSDaniel W. S. AlmeidaUserspace filesystem: 118ab13bcaSDaniel W. S. Almeida A filesystem in which data and metadata are provided by an ordinary 128ab13bcaSDaniel W. S. Almeida userspace process. The filesystem can be accessed normally through 138ab13bcaSDaniel W. S. Almeida the kernel interface. 148ab13bcaSDaniel W. S. Almeida 158ab13bcaSDaniel W. S. AlmeidaFilesystem daemon: 168ab13bcaSDaniel W. S. Almeida The process(es) providing the data and metadata of the filesystem. 178ab13bcaSDaniel W. S. Almeida 188ab13bcaSDaniel W. S. AlmeidaNon-privileged mount (or user mount): 198ab13bcaSDaniel W. S. Almeida A userspace filesystem mounted by a non-privileged (non-root) user. 208ab13bcaSDaniel W. S. Almeida The filesystem daemon is running with the privileges of the mounting 218ab13bcaSDaniel W. S. Almeida user. NOTE: this is not the same as mounts allowed with the "user" 228ab13bcaSDaniel W. S. Almeida option in /etc/fstab, which is not discussed here. 238ab13bcaSDaniel W. S. Almeida 248ab13bcaSDaniel W. S. AlmeidaFilesystem connection: 258ab13bcaSDaniel W. S. Almeida A connection between the filesystem daemon and the kernel. The 268ab13bcaSDaniel W. S. Almeida connection exists until either the daemon dies, or the filesystem is 278ab13bcaSDaniel W. S. Almeida umounted. Note that detaching (or lazy umounting) the filesystem 288ab13bcaSDaniel W. S. Almeida does *not* break the connection, in this case it will exist until 298ab13bcaSDaniel W. S. Almeida the last reference to the filesystem is released. 308ab13bcaSDaniel W. S. Almeida 318ab13bcaSDaniel W. S. AlmeidaMount owner: 328ab13bcaSDaniel W. S. Almeida The user who does the mounting. 338ab13bcaSDaniel W. S. Almeida 348ab13bcaSDaniel W. S. AlmeidaUser: 358ab13bcaSDaniel W. S. Almeida The user who is performing filesystem operations. 368ab13bcaSDaniel W. S. Almeida 378ab13bcaSDaniel W. S. AlmeidaWhat is FUSE? 388ab13bcaSDaniel W. S. Almeida============= 398ab13bcaSDaniel W. S. Almeida 408ab13bcaSDaniel W. S. AlmeidaFUSE is a userspace filesystem framework. It consists of a kernel 418ab13bcaSDaniel W. S. Almeidamodule (fuse.ko), a userspace library (libfuse.*) and a mount utility 428ab13bcaSDaniel W. S. Almeida(fusermount). 438ab13bcaSDaniel W. S. Almeida 448ab13bcaSDaniel W. S. AlmeidaOne of the most important features of FUSE is allowing secure, 458ab13bcaSDaniel W. S. Almeidanon-privileged mounts. This opens up new possibilities for the use of 468ab13bcaSDaniel W. S. Almeidafilesystems. A good example is sshfs: a secure network filesystem 478ab13bcaSDaniel W. S. Almeidausing the sftp protocol. 488ab13bcaSDaniel W. S. Almeida 498ab13bcaSDaniel W. S. AlmeidaThe userspace library and utilities are available from the 50c1b0c627SAndré Almeida`FUSE homepage: <https://github.com/libfuse/>`_ 518ab13bcaSDaniel W. S. Almeida 528ab13bcaSDaniel W. S. AlmeidaFilesystem type 538ab13bcaSDaniel W. S. Almeida=============== 548ab13bcaSDaniel W. S. Almeida 558ab13bcaSDaniel W. S. AlmeidaThe filesystem type given to mount(2) can be one of the following: 568ab13bcaSDaniel W. S. Almeida 578ab13bcaSDaniel W. S. Almeida fuse 588ab13bcaSDaniel W. S. Almeida This is the usual way to mount a FUSE filesystem. The first 598ab13bcaSDaniel W. S. Almeida argument of the mount system call may contain an arbitrary string, 608ab13bcaSDaniel W. S. Almeida which is not interpreted by the kernel. 618ab13bcaSDaniel W. S. Almeida 628ab13bcaSDaniel W. S. Almeida fuseblk 638ab13bcaSDaniel W. S. Almeida The filesystem is block device based. The first argument of the 648ab13bcaSDaniel W. S. Almeida mount system call is interpreted as the name of the device. 658ab13bcaSDaniel W. S. Almeida 668ab13bcaSDaniel W. S. AlmeidaMount options 678ab13bcaSDaniel W. S. Almeida============= 688ab13bcaSDaniel W. S. Almeida 698ab13bcaSDaniel W. S. Almeidafd=N 708ab13bcaSDaniel W. S. Almeida The file descriptor to use for communication between the userspace 718ab13bcaSDaniel W. S. Almeida filesystem and the kernel. The file descriptor must have been 728ab13bcaSDaniel W. S. Almeida obtained by opening the FUSE device ('/dev/fuse'). 738ab13bcaSDaniel W. S. Almeida 748ab13bcaSDaniel W. S. Almeidarootmode=M 758ab13bcaSDaniel W. S. Almeida The file mode of the filesystem's root in octal representation. 768ab13bcaSDaniel W. S. Almeida 778ab13bcaSDaniel W. S. Almeidauser_id=N 788ab13bcaSDaniel W. S. Almeida The numeric user id of the mount owner. 798ab13bcaSDaniel W. S. Almeida 808ab13bcaSDaniel W. S. Almeidagroup_id=N 818ab13bcaSDaniel W. S. Almeida The numeric group id of the mount owner. 828ab13bcaSDaniel W. S. Almeida 838ab13bcaSDaniel W. S. Almeidadefault_permissions 848ab13bcaSDaniel W. S. Almeida By default FUSE doesn't check file access permissions, the 858ab13bcaSDaniel W. S. Almeida filesystem is free to implement its access policy or leave it to 868ab13bcaSDaniel W. S. Almeida the underlying file access mechanism (e.g. in case of network 878ab13bcaSDaniel W. S. Almeida filesystems). This option enables permission checking, restricting 888ab13bcaSDaniel W. S. Almeida access based on file mode. It is usually useful together with the 898ab13bcaSDaniel W. S. Almeida 'allow_other' mount option. 908ab13bcaSDaniel W. S. Almeida 918ab13bcaSDaniel W. S. Almeidaallow_other 928ab13bcaSDaniel W. S. Almeida This option overrides the security measure restricting file access 938ab13bcaSDaniel W. S. Almeida to the user mounting the filesystem. This option is by default only 948ab13bcaSDaniel W. S. Almeida allowed to root, but this restriction can be removed with a 958ab13bcaSDaniel W. S. Almeida (userspace) configuration option. 968ab13bcaSDaniel W. S. Almeida 978ab13bcaSDaniel W. S. Almeidamax_read=N 988ab13bcaSDaniel W. S. Almeida With this option the maximum size of read operations can be set. 998ab13bcaSDaniel W. S. Almeida The default is infinite. Note that the size of read requests is 1008ab13bcaSDaniel W. S. Almeida limited anyway to 32 pages (which is 128kbyte on i386). 1018ab13bcaSDaniel W. S. Almeida 1028ab13bcaSDaniel W. S. Almeidablksize=N 1038ab13bcaSDaniel W. S. Almeida Set the block size for the filesystem. The default is 512. This 1048ab13bcaSDaniel W. S. Almeida option is only valid for 'fuseblk' type mounts. 1058ab13bcaSDaniel W. S. Almeida 1068ab13bcaSDaniel W. S. AlmeidaControl filesystem 1078ab13bcaSDaniel W. S. Almeida================== 1088ab13bcaSDaniel W. S. Almeida 1098ab13bcaSDaniel W. S. AlmeidaThere's a control filesystem for FUSE, which can be mounted by:: 1108ab13bcaSDaniel W. S. Almeida 1118ab13bcaSDaniel W. S. Almeida mount -t fusectl none /sys/fs/fuse/connections 1128ab13bcaSDaniel W. S. Almeida 1138ab13bcaSDaniel W. S. AlmeidaMounting it under the '/sys/fs/fuse/connections' directory makes it 1148ab13bcaSDaniel W. S. Almeidabackwards compatible with earlier versions. 1158ab13bcaSDaniel W. S. Almeida 1168ab13bcaSDaniel W. S. AlmeidaUnder the fuse control filesystem each connection has a directory 1178ab13bcaSDaniel W. S. Almeidanamed by a unique number. 1188ab13bcaSDaniel W. S. Almeida 1198ab13bcaSDaniel W. S. AlmeidaFor each connection the following files exist within this directory: 1208ab13bcaSDaniel W. S. Almeida 1218ab13bcaSDaniel W. S. Almeida waiting 1228ab13bcaSDaniel W. S. Almeida The number of requests which are waiting to be transferred to 1238ab13bcaSDaniel W. S. Almeida userspace or being processed by the filesystem daemon. If there is 1248ab13bcaSDaniel W. S. Almeida no filesystem activity and 'waiting' is non-zero, then the 1258ab13bcaSDaniel W. S. Almeida filesystem is hung or deadlocked. 1268ab13bcaSDaniel W. S. Almeida 1278ab13bcaSDaniel W. S. Almeida abort 1288ab13bcaSDaniel W. S. Almeida Writing anything into this file will abort the filesystem 1298ab13bcaSDaniel W. S. Almeida connection. This means that all waiting requests will be aborted an 1308ab13bcaSDaniel W. S. Almeida error returned for all aborted and new requests. 1318ab13bcaSDaniel W. S. Almeida 1328ab13bcaSDaniel W. S. AlmeidaOnly the owner of the mount may read or write these files. 1338ab13bcaSDaniel W. S. Almeida 1348ab13bcaSDaniel W. S. AlmeidaInterrupting filesystem operations 1358ab13bcaSDaniel W. S. Almeida################################## 1368ab13bcaSDaniel W. S. Almeida 1378ab13bcaSDaniel W. S. AlmeidaIf a process issuing a FUSE filesystem request is interrupted, the 1388ab13bcaSDaniel W. S. Almeidafollowing will happen: 1398ab13bcaSDaniel W. S. Almeida 1408ab13bcaSDaniel W. S. Almeida - If the request is not yet sent to userspace AND the signal is 1418ab13bcaSDaniel W. S. Almeida fatal (SIGKILL or unhandled fatal signal), then the request is 1428ab13bcaSDaniel W. S. Almeida dequeued and returns immediately. 1438ab13bcaSDaniel W. S. Almeida 1448ab13bcaSDaniel W. S. Almeida - If the request is not yet sent to userspace AND the signal is not 1458ab13bcaSDaniel W. S. Almeida fatal, then an interrupted flag is set for the request. When 1468ab13bcaSDaniel W. S. Almeida the request has been successfully transferred to userspace and 1478ab13bcaSDaniel W. S. Almeida this flag is set, an INTERRUPT request is queued. 1488ab13bcaSDaniel W. S. Almeida 1498ab13bcaSDaniel W. S. Almeida - If the request is already sent to userspace, then an INTERRUPT 1508ab13bcaSDaniel W. S. Almeida request is queued. 1518ab13bcaSDaniel W. S. Almeida 1528ab13bcaSDaniel W. S. AlmeidaINTERRUPT requests take precedence over other requests, so the 1538ab13bcaSDaniel W. S. Almeidauserspace filesystem will receive queued INTERRUPTs before any others. 1548ab13bcaSDaniel W. S. Almeida 1558ab13bcaSDaniel W. S. AlmeidaThe userspace filesystem may ignore the INTERRUPT requests entirely, 1568ab13bcaSDaniel W. S. Almeidaor may honor them by sending a reply to the *original* request, with 1578ab13bcaSDaniel W. S. Almeidathe error set to EINTR. 1588ab13bcaSDaniel W. S. Almeida 1598ab13bcaSDaniel W. S. AlmeidaIt is also possible that there's a race between processing the 1608ab13bcaSDaniel W. S. Almeidaoriginal request and its INTERRUPT request. There are two possibilities: 1618ab13bcaSDaniel W. S. Almeida 1628ab13bcaSDaniel W. S. Almeida 1. The INTERRUPT request is processed before the original request is 1638ab13bcaSDaniel W. S. Almeida processed 1648ab13bcaSDaniel W. S. Almeida 1658ab13bcaSDaniel W. S. Almeida 2. The INTERRUPT request is processed after the original request has 1668ab13bcaSDaniel W. S. Almeida been answered 1678ab13bcaSDaniel W. S. Almeida 1688ab13bcaSDaniel W. S. AlmeidaIf the filesystem cannot find the original request, it should wait for 1698ab13bcaSDaniel W. S. Almeidasome timeout and/or a number of new requests to arrive, after which it 1708ab13bcaSDaniel W. S. Almeidashould reply to the INTERRUPT request with an EAGAIN error. In case 1718ab13bcaSDaniel W. S. Almeida1) the INTERRUPT request will be requeued. In case 2) the INTERRUPT 1728ab13bcaSDaniel W. S. Almeidareply will be ignored. 1738ab13bcaSDaniel W. S. Almeida 1748ab13bcaSDaniel W. S. AlmeidaAborting a filesystem connection 1758ab13bcaSDaniel W. S. Almeida================================ 1768ab13bcaSDaniel W. S. Almeida 1778ab13bcaSDaniel W. S. AlmeidaIt is possible to get into certain situations where the filesystem is 1788ab13bcaSDaniel W. S. Almeidanot responding. Reasons for this may be: 1798ab13bcaSDaniel W. S. Almeida 1808ab13bcaSDaniel W. S. Almeida a) Broken userspace filesystem implementation 1818ab13bcaSDaniel W. S. Almeida 1828ab13bcaSDaniel W. S. Almeida b) Network connection down 1838ab13bcaSDaniel W. S. Almeida 1848ab13bcaSDaniel W. S. Almeida c) Accidental deadlock 1858ab13bcaSDaniel W. S. Almeida 1868ab13bcaSDaniel W. S. Almeida d) Malicious deadlock 1878ab13bcaSDaniel W. S. Almeida 1888ab13bcaSDaniel W. S. Almeida(For more on c) and d) see later sections) 1898ab13bcaSDaniel W. S. Almeida 1908ab13bcaSDaniel W. S. AlmeidaIn either of these cases it may be useful to abort the connection to 1918ab13bcaSDaniel W. S. Almeidathe filesystem. There are several ways to do this: 1928ab13bcaSDaniel W. S. Almeida 1938ab13bcaSDaniel W. S. Almeida - Kill the filesystem daemon. Works in case of a) and b) 1948ab13bcaSDaniel W. S. Almeida 1958ab13bcaSDaniel W. S. Almeida - Kill the filesystem daemon and all users of the filesystem. Works 1968ab13bcaSDaniel W. S. Almeida in all cases except some malicious deadlocks 1978ab13bcaSDaniel W. S. Almeida 1988ab13bcaSDaniel W. S. Almeida - Use forced umount (umount -f). Works in all cases but only if 1998ab13bcaSDaniel W. S. Almeida filesystem is still attached (it hasn't been lazy unmounted) 2008ab13bcaSDaniel W. S. Almeida 2018ab13bcaSDaniel W. S. Almeida - Abort filesystem through the FUSE control filesystem. Most 2028ab13bcaSDaniel W. S. Almeida powerful method, always works. 2038ab13bcaSDaniel W. S. Almeida 2048ab13bcaSDaniel W. S. AlmeidaHow do non-privileged mounts work? 2058ab13bcaSDaniel W. S. Almeida================================== 2068ab13bcaSDaniel W. S. Almeida 2078ab13bcaSDaniel W. S. AlmeidaSince the mount() system call is a privileged operation, a helper 2088ab13bcaSDaniel W. S. Almeidaprogram (fusermount) is needed, which is installed setuid root. 2098ab13bcaSDaniel W. S. Almeida 2108ab13bcaSDaniel W. S. AlmeidaThe implication of providing non-privileged mounts is that the mount 2118ab13bcaSDaniel W. S. Almeidaowner must not be able to use this capability to compromise the 2128ab13bcaSDaniel W. S. Almeidasystem. Obvious requirements arising from this are: 2138ab13bcaSDaniel W. S. Almeida 2148ab13bcaSDaniel W. S. Almeida A) mount owner should not be able to get elevated privileges with the 2158ab13bcaSDaniel W. S. Almeida help of the mounted filesystem 2168ab13bcaSDaniel W. S. Almeida 2178ab13bcaSDaniel W. S. Almeida B) mount owner should not get illegitimate access to information from 2188ab13bcaSDaniel W. S. Almeida other users' and the super user's processes 2198ab13bcaSDaniel W. S. Almeida 2208ab13bcaSDaniel W. S. Almeida C) mount owner should not be able to induce undesired behavior in 2218ab13bcaSDaniel W. S. Almeida other users' or the super user's processes 2228ab13bcaSDaniel W. S. Almeida 2238ab13bcaSDaniel W. S. AlmeidaHow are requirements fulfilled? 2248ab13bcaSDaniel W. S. Almeida=============================== 2258ab13bcaSDaniel W. S. Almeida 2268ab13bcaSDaniel W. S. Almeida A) The mount owner could gain elevated privileges by either: 2278ab13bcaSDaniel W. S. Almeida 2288ab13bcaSDaniel W. S. Almeida 1. creating a filesystem containing a device file, then opening this device 2298ab13bcaSDaniel W. S. Almeida 2308ab13bcaSDaniel W. S. Almeida 2. creating a filesystem containing a suid or sgid application, then executing this application 2318ab13bcaSDaniel W. S. Almeida 2328ab13bcaSDaniel W. S. Almeida The solution is not to allow opening device files and ignore 2338ab13bcaSDaniel W. S. Almeida setuid and setgid bits when executing programs. To ensure this 2348ab13bcaSDaniel W. S. Almeida fusermount always adds "nosuid" and "nodev" to the mount options 2358ab13bcaSDaniel W. S. Almeida for non-privileged mounts. 2368ab13bcaSDaniel W. S. Almeida 2378ab13bcaSDaniel W. S. Almeida B) If another user is accessing files or directories in the 2388ab13bcaSDaniel W. S. Almeida filesystem, the filesystem daemon serving requests can record the 2398ab13bcaSDaniel W. S. Almeida exact sequence and timing of operations performed. This 2408ab13bcaSDaniel W. S. Almeida information is otherwise inaccessible to the mount owner, so this 2418ab13bcaSDaniel W. S. Almeida counts as an information leak. 2428ab13bcaSDaniel W. S. Almeida 2438ab13bcaSDaniel W. S. Almeida The solution to this problem will be presented in point 2) of C). 2448ab13bcaSDaniel W. S. Almeida 2458ab13bcaSDaniel W. S. Almeida C) There are several ways in which the mount owner can induce 2468ab13bcaSDaniel W. S. Almeida undesired behavior in other users' processes, such as: 2478ab13bcaSDaniel W. S. Almeida 2488ab13bcaSDaniel W. S. Almeida 1) mounting a filesystem over a file or directory which the mount 2498ab13bcaSDaniel W. S. Almeida owner could otherwise not be able to modify (or could only 2508ab13bcaSDaniel W. S. Almeida make limited modifications). 2518ab13bcaSDaniel W. S. Almeida 2528ab13bcaSDaniel W. S. Almeida This is solved in fusermount, by checking the access 2538ab13bcaSDaniel W. S. Almeida permissions on the mountpoint and only allowing the mount if 2548ab13bcaSDaniel W. S. Almeida the mount owner can do unlimited modification (has write 2558ab13bcaSDaniel W. S. Almeida access to the mountpoint, and mountpoint is not a "sticky" 2568ab13bcaSDaniel W. S. Almeida directory) 2578ab13bcaSDaniel W. S. Almeida 2588ab13bcaSDaniel W. S. Almeida 2) Even if 1) is solved the mount owner can change the behavior 2598ab13bcaSDaniel W. S. Almeida of other users' processes. 2608ab13bcaSDaniel W. S. Almeida 2618ab13bcaSDaniel W. S. Almeida i) It can slow down or indefinitely delay the execution of a 2628ab13bcaSDaniel W. S. Almeida filesystem operation creating a DoS against the user or the 2638ab13bcaSDaniel W. S. Almeida whole system. For example a suid application locking a 2648ab13bcaSDaniel W. S. Almeida system file, and then accessing a file on the mount owner's 2658ab13bcaSDaniel W. S. Almeida filesystem could be stopped, and thus causing the system 2668ab13bcaSDaniel W. S. Almeida file to be locked forever. 2678ab13bcaSDaniel W. S. Almeida 2688ab13bcaSDaniel W. S. Almeida ii) It can present files or directories of unlimited length, or 2698ab13bcaSDaniel W. S. Almeida directory structures of unlimited depth, possibly causing a 2708ab13bcaSDaniel W. S. Almeida system process to eat up diskspace, memory or other 2718ab13bcaSDaniel W. S. Almeida resources, again causing *DoS*. 2728ab13bcaSDaniel W. S. Almeida 2738ab13bcaSDaniel W. S. Almeida The solution to this as well as B) is not to allow processes 2748ab13bcaSDaniel W. S. Almeida to access the filesystem, which could otherwise not be 2758ab13bcaSDaniel W. S. Almeida monitored or manipulated by the mount owner. Since if the 2768ab13bcaSDaniel W. S. Almeida mount owner can ptrace a process, it can do all of the above 2778ab13bcaSDaniel W. S. Almeida without using a FUSE mount, the same criteria as used in 2788ab13bcaSDaniel W. S. Almeida ptrace can be used to check if a process is allowed to access 2798ab13bcaSDaniel W. S. Almeida the filesystem or not. 2808ab13bcaSDaniel W. S. Almeida 2818ab13bcaSDaniel W. S. Almeida Note that the *ptrace* check is not strictly necessary to 282*9ccf47b2SDave Marchevsky prevent C/2/i, it is enough to check if mount owner has enough 2838ab13bcaSDaniel W. S. Almeida privilege to send signal to the process accessing the 2848ab13bcaSDaniel W. S. Almeida filesystem, since *SIGSTOP* can be used to get a similar effect. 2858ab13bcaSDaniel W. S. Almeida 2868ab13bcaSDaniel W. S. AlmeidaI think these limitations are unacceptable? 2878ab13bcaSDaniel W. S. Almeida=========================================== 2888ab13bcaSDaniel W. S. Almeida 2898ab13bcaSDaniel W. S. AlmeidaIf a sysadmin trusts the users enough, or can ensure through other 2908ab13bcaSDaniel W. S. Almeidameasures, that system processes will never enter non-privileged 291*9ccf47b2SDave Marchevskymounts, it can relax the last limitation in several ways: 292*9ccf47b2SDave Marchevsky 293*9ccf47b2SDave Marchevsky - With the 'user_allow_other' config option. If this config option is 294*9ccf47b2SDave Marchevsky set, the mounting user can add the 'allow_other' mount option which 295*9ccf47b2SDave Marchevsky disables the check for other users' processes. 296*9ccf47b2SDave Marchevsky 297*9ccf47b2SDave Marchevsky User namespaces have an unintuitive interaction with 'allow_other': 298*9ccf47b2SDave Marchevsky an unprivileged user - normally restricted from mounting with 299*9ccf47b2SDave Marchevsky 'allow_other' - could do so in a user namespace where they're 300*9ccf47b2SDave Marchevsky privileged. If any process could access such an 'allow_other' mount 301*9ccf47b2SDave Marchevsky this would give the mounting user the ability to manipulate 302*9ccf47b2SDave Marchevsky processes in user namespaces where they're unprivileged. For this 303*9ccf47b2SDave Marchevsky reason 'allow_other' restricts access to users in the same userns 304*9ccf47b2SDave Marchevsky or a descendant. 305*9ccf47b2SDave Marchevsky 306*9ccf47b2SDave Marchevsky - With the 'allow_sys_admin_access' module option. If this option is 307*9ccf47b2SDave Marchevsky set, super user's processes have unrestricted access to mounts 308*9ccf47b2SDave Marchevsky irrespective of allow_other setting or user namespace of the 309*9ccf47b2SDave Marchevsky mounting user. 310*9ccf47b2SDave Marchevsky 311*9ccf47b2SDave MarchevskyNote that both of these relaxations expose the system to potential 312*9ccf47b2SDave Marchevskyinformation leak or *DoS* as described in points B and C/2/i-ii in the 313*9ccf47b2SDave Marchevskypreceding section. 3148ab13bcaSDaniel W. S. Almeida 3158ab13bcaSDaniel W. S. AlmeidaKernel - userspace interface 3168ab13bcaSDaniel W. S. Almeida============================ 3178ab13bcaSDaniel W. S. Almeida 3188ab13bcaSDaniel W. S. AlmeidaThe following diagram shows how a filesystem operation (in this 3198ab13bcaSDaniel W. S. Almeidaexample unlink) is performed in FUSE. :: 3208ab13bcaSDaniel W. S. Almeida 3218ab13bcaSDaniel W. S. Almeida 3228ab13bcaSDaniel W. S. Almeida | "rm /mnt/fuse/file" | FUSE filesystem daemon 3238ab13bcaSDaniel W. S. Almeida | | 3248ab13bcaSDaniel W. S. Almeida | | >sys_read() 3258ab13bcaSDaniel W. S. Almeida | | >fuse_dev_read() 3268ab13bcaSDaniel W. S. Almeida | | >request_wait() 3278ab13bcaSDaniel W. S. Almeida | | [sleep on fc->waitq] 3288ab13bcaSDaniel W. S. Almeida | | 3298ab13bcaSDaniel W. S. Almeida | >sys_unlink() | 3308ab13bcaSDaniel W. S. Almeida | >fuse_unlink() | 3318ab13bcaSDaniel W. S. Almeida | [get request from | 3328ab13bcaSDaniel W. S. Almeida | fc->unused_list] | 3338ab13bcaSDaniel W. S. Almeida | >request_send() | 3348ab13bcaSDaniel W. S. Almeida | [queue req on fc->pending] | 3358ab13bcaSDaniel W. S. Almeida | [wake up fc->waitq] | [woken up] 3368ab13bcaSDaniel W. S. Almeida | >request_wait_answer() | 3378ab13bcaSDaniel W. S. Almeida | [sleep on req->waitq] | 3388ab13bcaSDaniel W. S. Almeida | | <request_wait() 3398ab13bcaSDaniel W. S. Almeida | | [remove req from fc->pending] 3408ab13bcaSDaniel W. S. Almeida | | [copy req to read buffer] 3418ab13bcaSDaniel W. S. Almeida | | [add req to fc->processing] 3428ab13bcaSDaniel W. S. Almeida | | <fuse_dev_read() 3438ab13bcaSDaniel W. S. Almeida | | <sys_read() 3448ab13bcaSDaniel W. S. Almeida | | 3458ab13bcaSDaniel W. S. Almeida | | [perform unlink] 3468ab13bcaSDaniel W. S. Almeida | | 3478ab13bcaSDaniel W. S. Almeida | | >sys_write() 3488ab13bcaSDaniel W. S. Almeida | | >fuse_dev_write() 3498ab13bcaSDaniel W. S. Almeida | | [look up req in fc->processing] 3508ab13bcaSDaniel W. S. Almeida | | [remove from fc->processing] 3518ab13bcaSDaniel W. S. Almeida | | [copy write buffer to req] 3528ab13bcaSDaniel W. S. Almeida | [woken up] | [wake up req->waitq] 3538ab13bcaSDaniel W. S. Almeida | | <fuse_dev_write() 3548ab13bcaSDaniel W. S. Almeida | | <sys_write() 3558ab13bcaSDaniel W. S. Almeida | <request_wait_answer() | 3568ab13bcaSDaniel W. S. Almeida | <request_send() | 3578ab13bcaSDaniel W. S. Almeida | [add request to | 3588ab13bcaSDaniel W. S. Almeida | fc->unused_list] | 3598ab13bcaSDaniel W. S. Almeida | <fuse_unlink() | 3608ab13bcaSDaniel W. S. Almeida | <sys_unlink() | 3618ab13bcaSDaniel W. S. Almeida 3628ab13bcaSDaniel W. S. Almeida.. note:: Everything in the description above is greatly simplified 3638ab13bcaSDaniel W. S. Almeida 3648ab13bcaSDaniel W. S. AlmeidaThere are a couple of ways in which to deadlock a FUSE filesystem. 3658ab13bcaSDaniel W. S. AlmeidaSince we are talking about unprivileged userspace programs, 3668ab13bcaSDaniel W. S. Almeidasomething must be done about these. 3678ab13bcaSDaniel W. S. Almeida 3688ab13bcaSDaniel W. S. Almeida**Scenario 1 - Simple deadlock**:: 3698ab13bcaSDaniel W. S. Almeida 3708ab13bcaSDaniel W. S. Almeida | "rm /mnt/fuse/file" | FUSE filesystem daemon 3718ab13bcaSDaniel W. S. Almeida | | 3728ab13bcaSDaniel W. S. Almeida | >sys_unlink("/mnt/fuse/file") | 3738ab13bcaSDaniel W. S. Almeida | [acquire inode semaphore | 3748ab13bcaSDaniel W. S. Almeida | for "file"] | 3758ab13bcaSDaniel W. S. Almeida | >fuse_unlink() | 3768ab13bcaSDaniel W. S. Almeida | [sleep on req->waitq] | 3778ab13bcaSDaniel W. S. Almeida | | <sys_read() 3788ab13bcaSDaniel W. S. Almeida | | >sys_unlink("/mnt/fuse/file") 3798ab13bcaSDaniel W. S. Almeida | | [acquire inode semaphore 3808ab13bcaSDaniel W. S. Almeida | | for "file"] 3818ab13bcaSDaniel W. S. Almeida | | *DEADLOCK* 3828ab13bcaSDaniel W. S. Almeida 3838ab13bcaSDaniel W. S. AlmeidaThe solution for this is to allow the filesystem to be aborted. 3848ab13bcaSDaniel W. S. Almeida 3858ab13bcaSDaniel W. S. Almeida**Scenario 2 - Tricky deadlock** 3868ab13bcaSDaniel W. S. Almeida 3878ab13bcaSDaniel W. S. Almeida 3888ab13bcaSDaniel W. S. AlmeidaThis one needs a carefully crafted filesystem. It's a variation on 3898ab13bcaSDaniel W. S. Almeidathe above, only the call back to the filesystem is not explicit, 3908ab13bcaSDaniel W. S. Almeidabut is caused by a pagefault. :: 3918ab13bcaSDaniel W. S. Almeida 3928ab13bcaSDaniel W. S. Almeida | Kamikaze filesystem thread 1 | Kamikaze filesystem thread 2 3938ab13bcaSDaniel W. S. Almeida | | 3948ab13bcaSDaniel W. S. Almeida | [fd = open("/mnt/fuse/file")] | [request served normally] 3958ab13bcaSDaniel W. S. Almeida | [mmap fd to 'addr'] | 3968ab13bcaSDaniel W. S. Almeida | [close fd] | [FLUSH triggers 'magic' flag] 3978ab13bcaSDaniel W. S. Almeida | [read a byte from addr] | 3988ab13bcaSDaniel W. S. Almeida | >do_page_fault() | 3998ab13bcaSDaniel W. S. Almeida | [find or create page] | 4008ab13bcaSDaniel W. S. Almeida | [lock page] | 4018ab13bcaSDaniel W. S. Almeida | >fuse_readpage() | 4028ab13bcaSDaniel W. S. Almeida | [queue READ request] | 4038ab13bcaSDaniel W. S. Almeida | [sleep on req->waitq] | 4048ab13bcaSDaniel W. S. Almeida | | [read request to buffer] 4058ab13bcaSDaniel W. S. Almeida | | [create reply header before addr] 4068ab13bcaSDaniel W. S. Almeida | | >sys_write(addr - headerlength) 4078ab13bcaSDaniel W. S. Almeida | | >fuse_dev_write() 4088ab13bcaSDaniel W. S. Almeida | | [look up req in fc->processing] 4098ab13bcaSDaniel W. S. Almeida | | [remove from fc->processing] 4108ab13bcaSDaniel W. S. Almeida | | [copy write buffer to req] 4118ab13bcaSDaniel W. S. Almeida | | >do_page_fault() 4128ab13bcaSDaniel W. S. Almeida | | [find or create page] 4138ab13bcaSDaniel W. S. Almeida | | [lock page] 4148ab13bcaSDaniel W. S. Almeida | | * DEADLOCK * 4158ab13bcaSDaniel W. S. Almeida 4168ab13bcaSDaniel W. S. AlmeidaThe solution is basically the same as above. 4178ab13bcaSDaniel W. S. Almeida 4188ab13bcaSDaniel W. S. AlmeidaAn additional problem is that while the write buffer is being copied 4198ab13bcaSDaniel W. S. Almeidato the request, the request must not be interrupted/aborted. This is 4208ab13bcaSDaniel W. S. Almeidabecause the destination address of the copy may not be valid after the 4218ab13bcaSDaniel W. S. Almeidarequest has returned. 4228ab13bcaSDaniel W. S. Almeida 4238ab13bcaSDaniel W. S. AlmeidaThis is solved with doing the copy atomically, and allowing abort 4248ab13bcaSDaniel W. S. Almeidawhile the page(s) belonging to the write buffer are faulted with 4258ab13bcaSDaniel W. S. Almeidaget_user_pages(). The 'req->locked' flag indicates when the copy is 4268ab13bcaSDaniel W. S. Almeidataking place, and abort is delayed until this flag is unset. 427