1e6d42cb1SMauro Carvalho Chehab.. SPDX-License-Identifier: GPL-2.0 2e6d42cb1SMauro Carvalho Chehab 3e6d42cb1SMauro Carvalho Chehab=================================== 4e6d42cb1SMauro Carvalho ChehabFile management in the Linux kernel 5e6d42cb1SMauro Carvalho Chehab=================================== 6e6d42cb1SMauro Carvalho Chehab 7e6d42cb1SMauro Carvalho ChehabThis document describes how locking for files (struct file) 8e6d42cb1SMauro Carvalho Chehaband file descriptor table (struct files) works. 9e6d42cb1SMauro Carvalho Chehab 10e6d42cb1SMauro Carvalho ChehabUp until 2.6.12, the file descriptor table has been protected 11e6d42cb1SMauro Carvalho Chehabwith a lock (files->file_lock) and reference count (files->count). 12e6d42cb1SMauro Carvalho Chehab->file_lock protected accesses to all the file related fields 13e6d42cb1SMauro Carvalho Chehabof the table. ->count was used for sharing the file descriptor 14e6d42cb1SMauro Carvalho Chehabtable between tasks cloned with CLONE_FILES flag. Typically 15e6d42cb1SMauro Carvalho Chehabthis would be the case for posix threads. As with the common 16e6d42cb1SMauro Carvalho Chehabrefcounting model in the kernel, the last task doing 17e6d42cb1SMauro Carvalho Chehaba put_files_struct() frees the file descriptor (fd) table. 18e6d42cb1SMauro Carvalho ChehabThe files (struct file) themselves are protected using 19e6d42cb1SMauro Carvalho Chehabreference count (->f_count). 20e6d42cb1SMauro Carvalho Chehab 21e6d42cb1SMauro Carvalho ChehabIn the new lock-free model of file descriptor management, 22e6d42cb1SMauro Carvalho Chehabthe reference counting is similar, but the locking is 23e6d42cb1SMauro Carvalho Chehabbased on RCU. The file descriptor table contains multiple 24e6d42cb1SMauro Carvalho Chehabelements - the fd sets (open_fds and close_on_exec, the 25e6d42cb1SMauro Carvalho Chehabarray of file pointers, the sizes of the sets and the array 26e6d42cb1SMauro Carvalho Chehabetc.). In order for the updates to appear atomic to 27e6d42cb1SMauro Carvalho Chehaba lock-free reader, all the elements of the file descriptor 28e6d42cb1SMauro Carvalho Chehabtable are in a separate structure - struct fdtable. 29e6d42cb1SMauro Carvalho Chehabfiles_struct contains a pointer to struct fdtable through 30e6d42cb1SMauro Carvalho Chehabwhich the actual fd table is accessed. Initially the 31e6d42cb1SMauro Carvalho Chehabfdtable is embedded in files_struct itself. On a subsequent 32e6d42cb1SMauro Carvalho Chehabexpansion of fdtable, a new fdtable structure is allocated 33e6d42cb1SMauro Carvalho Chehaband files->fdtab points to the new structure. The fdtable 34e6d42cb1SMauro Carvalho Chehabstructure is freed with RCU and lock-free readers either 35e6d42cb1SMauro Carvalho Chehabsee the old fdtable or the new fdtable making the update 36e6d42cb1SMauro Carvalho Chehabappear atomic. Here are the locking rules for 37e6d42cb1SMauro Carvalho Chehabthe fdtable structure - 38e6d42cb1SMauro Carvalho Chehab 39e6d42cb1SMauro Carvalho Chehab1. All references to the fdtable must be done through 40e6d42cb1SMauro Carvalho Chehab the files_fdtable() macro:: 41e6d42cb1SMauro Carvalho Chehab 42e6d42cb1SMauro Carvalho Chehab struct fdtable *fdt; 43e6d42cb1SMauro Carvalho Chehab 44e6d42cb1SMauro Carvalho Chehab rcu_read_lock(); 45e6d42cb1SMauro Carvalho Chehab 46e6d42cb1SMauro Carvalho Chehab fdt = files_fdtable(files); 47e6d42cb1SMauro Carvalho Chehab .... 48e6d42cb1SMauro Carvalho Chehab if (n <= fdt->max_fds) 49e6d42cb1SMauro Carvalho Chehab .... 50e6d42cb1SMauro Carvalho Chehab ... 51e6d42cb1SMauro Carvalho Chehab rcu_read_unlock(); 52e6d42cb1SMauro Carvalho Chehab 53e6d42cb1SMauro Carvalho Chehab files_fdtable() uses rcu_dereference() macro which takes care of 54e6d42cb1SMauro Carvalho Chehab the memory barrier requirements for lock-free dereference. 55e6d42cb1SMauro Carvalho Chehab The fdtable pointer must be read within the read-side 56e6d42cb1SMauro Carvalho Chehab critical section. 57e6d42cb1SMauro Carvalho Chehab 58e6d42cb1SMauro Carvalho Chehab2. Reading of the fdtable as described above must be protected 59e6d42cb1SMauro Carvalho Chehab by rcu_read_lock()/rcu_read_unlock(). 60e6d42cb1SMauro Carvalho Chehab 61e6d42cb1SMauro Carvalho Chehab3. For any update to the fd table, files->file_lock must 62e6d42cb1SMauro Carvalho Chehab be held. 63e6d42cb1SMauro Carvalho Chehab 64e6d42cb1SMauro Carvalho Chehab4. To look up the file structure given an fd, a reader 65*460b4f81SEric W. Biederman must use either lookup_fd_rcu() or files_lookup_fd_rcu() APIs. These 66e6d42cb1SMauro Carvalho Chehab take care of barrier requirements due to lock-free lookup. 67e6d42cb1SMauro Carvalho Chehab 68e6d42cb1SMauro Carvalho Chehab An example:: 69e6d42cb1SMauro Carvalho Chehab 70e6d42cb1SMauro Carvalho Chehab struct file *file; 71e6d42cb1SMauro Carvalho Chehab 72e6d42cb1SMauro Carvalho Chehab rcu_read_lock(); 73*460b4f81SEric W. Biederman file = lookup_fd_rcu(fd); 74e6d42cb1SMauro Carvalho Chehab if (file) { 75e6d42cb1SMauro Carvalho Chehab ... 76e6d42cb1SMauro Carvalho Chehab } 77e6d42cb1SMauro Carvalho Chehab .... 78e6d42cb1SMauro Carvalho Chehab rcu_read_unlock(); 79e6d42cb1SMauro Carvalho Chehab 80e6d42cb1SMauro Carvalho Chehab5. Handling of the file structures is special. Since the look-up 81e6d42cb1SMauro Carvalho Chehab of the fd (fget()/fget_light()) are lock-free, it is possible 82e6d42cb1SMauro Carvalho Chehab that look-up may race with the last put() operation on the 83e6d42cb1SMauro Carvalho Chehab file structure. This is avoided using atomic_long_inc_not_zero() 84e6d42cb1SMauro Carvalho Chehab on ->f_count:: 85e6d42cb1SMauro Carvalho Chehab 86e6d42cb1SMauro Carvalho Chehab rcu_read_lock(); 87f36c2943SEric W. Biederman file = files_lookup_fd_rcu(files, fd); 88e6d42cb1SMauro Carvalho Chehab if (file) { 89e6d42cb1SMauro Carvalho Chehab if (atomic_long_inc_not_zero(&file->f_count)) 90e6d42cb1SMauro Carvalho Chehab *fput_needed = 1; 91e6d42cb1SMauro Carvalho Chehab else 92e6d42cb1SMauro Carvalho Chehab /* Didn't get the reference, someone's freed */ 93e6d42cb1SMauro Carvalho Chehab file = NULL; 94e6d42cb1SMauro Carvalho Chehab } 95e6d42cb1SMauro Carvalho Chehab rcu_read_unlock(); 96e6d42cb1SMauro Carvalho Chehab .... 97e6d42cb1SMauro Carvalho Chehab return file; 98e6d42cb1SMauro Carvalho Chehab 99e6d42cb1SMauro Carvalho Chehab atomic_long_inc_not_zero() detects if refcounts is already zero or 100e6d42cb1SMauro Carvalho Chehab goes to zero during increment. If it does, we fail 101e6d42cb1SMauro Carvalho Chehab fget()/fget_light(). 102e6d42cb1SMauro Carvalho Chehab 103e6d42cb1SMauro Carvalho Chehab6. Since both fdtable and file structures can be looked up 104e6d42cb1SMauro Carvalho Chehab lock-free, they must be installed using rcu_assign_pointer() 105e6d42cb1SMauro Carvalho Chehab API. If they are looked up lock-free, rcu_dereference() 106e6d42cb1SMauro Carvalho Chehab must be used. However it is advisable to use files_fdtable() 107*460b4f81SEric W. Biederman and lookup_fd_rcu()/files_lookup_fd_rcu() which take care of these issues. 108e6d42cb1SMauro Carvalho Chehab 109e6d42cb1SMauro Carvalho Chehab7. While updating, the fdtable pointer must be looked up while 110e6d42cb1SMauro Carvalho Chehab holding files->file_lock. If ->file_lock is dropped, then 111e6d42cb1SMauro Carvalho Chehab another thread expand the files thereby creating a new 112e6d42cb1SMauro Carvalho Chehab fdtable and making the earlier fdtable pointer stale. 113e6d42cb1SMauro Carvalho Chehab 114e6d42cb1SMauro Carvalho Chehab For example:: 115e6d42cb1SMauro Carvalho Chehab 116e6d42cb1SMauro Carvalho Chehab spin_lock(&files->file_lock); 117e6d42cb1SMauro Carvalho Chehab fd = locate_fd(files, file, start); 118e6d42cb1SMauro Carvalho Chehab if (fd >= 0) { 119e6d42cb1SMauro Carvalho Chehab /* locate_fd() may have expanded fdtable, load the ptr */ 120e6d42cb1SMauro Carvalho Chehab fdt = files_fdtable(files); 121e6d42cb1SMauro Carvalho Chehab __set_open_fd(fd, fdt); 122e6d42cb1SMauro Carvalho Chehab __clear_close_on_exec(fd, fdt); 123e6d42cb1SMauro Carvalho Chehab spin_unlock(&files->file_lock); 124e6d42cb1SMauro Carvalho Chehab ..... 125e6d42cb1SMauro Carvalho Chehab 126e6d42cb1SMauro Carvalho Chehab Since locate_fd() can drop ->file_lock (and reacquire ->file_lock), 127e6d42cb1SMauro Carvalho Chehab the fdtable pointer (fdt) must be loaded after locate_fd(). 128e6d42cb1SMauro Carvalho Chehab 129