1e6d42cb1SMauro Carvalho Chehab.. SPDX-License-Identifier: GPL-2.0
2e6d42cb1SMauro Carvalho Chehab
3e6d42cb1SMauro Carvalho Chehab===================================
4e6d42cb1SMauro Carvalho ChehabFile management in the Linux kernel
5e6d42cb1SMauro Carvalho Chehab===================================
6e6d42cb1SMauro Carvalho Chehab
7e6d42cb1SMauro Carvalho ChehabThis document describes how locking for files (struct file)
8e6d42cb1SMauro Carvalho Chehaband file descriptor table (struct files) works.
9e6d42cb1SMauro Carvalho Chehab
10e6d42cb1SMauro Carvalho ChehabUp until 2.6.12, the file descriptor table has been protected
11e6d42cb1SMauro Carvalho Chehabwith a lock (files->file_lock) and reference count (files->count).
12e6d42cb1SMauro Carvalho Chehab->file_lock protected accesses to all the file related fields
13e6d42cb1SMauro Carvalho Chehabof the table. ->count was used for sharing the file descriptor
14e6d42cb1SMauro Carvalho Chehabtable between tasks cloned with CLONE_FILES flag. Typically
15e6d42cb1SMauro Carvalho Chehabthis would be the case for posix threads. As with the common
16e6d42cb1SMauro Carvalho Chehabrefcounting model in the kernel, the last task doing
17e6d42cb1SMauro Carvalho Chehaba put_files_struct() frees the file descriptor (fd) table.
18e6d42cb1SMauro Carvalho ChehabThe files (struct file) themselves are protected using
19e6d42cb1SMauro Carvalho Chehabreference count (->f_count).
20e6d42cb1SMauro Carvalho Chehab
21e6d42cb1SMauro Carvalho ChehabIn the new lock-free model of file descriptor management,
22e6d42cb1SMauro Carvalho Chehabthe reference counting is similar, but the locking is
23e6d42cb1SMauro Carvalho Chehabbased on RCU. The file descriptor table contains multiple
24e6d42cb1SMauro Carvalho Chehabelements - the fd sets (open_fds and close_on_exec, the
25e6d42cb1SMauro Carvalho Chehabarray of file pointers, the sizes of the sets and the array
26e6d42cb1SMauro Carvalho Chehabetc.). In order for the updates to appear atomic to
27e6d42cb1SMauro Carvalho Chehaba lock-free reader, all the elements of the file descriptor
28e6d42cb1SMauro Carvalho Chehabtable are in a separate structure - struct fdtable.
29e6d42cb1SMauro Carvalho Chehabfiles_struct contains a pointer to struct fdtable through
30e6d42cb1SMauro Carvalho Chehabwhich the actual fd table is accessed. Initially the
31e6d42cb1SMauro Carvalho Chehabfdtable is embedded in files_struct itself. On a subsequent
32e6d42cb1SMauro Carvalho Chehabexpansion of fdtable, a new fdtable structure is allocated
33e6d42cb1SMauro Carvalho Chehaband files->fdtab points to the new structure. The fdtable
34e6d42cb1SMauro Carvalho Chehabstructure is freed with RCU and lock-free readers either
35e6d42cb1SMauro Carvalho Chehabsee the old fdtable or the new fdtable making the update
36e6d42cb1SMauro Carvalho Chehabappear atomic. Here are the locking rules for
37e6d42cb1SMauro Carvalho Chehabthe fdtable structure -
38e6d42cb1SMauro Carvalho Chehab
39e6d42cb1SMauro Carvalho Chehab1. All references to the fdtable must be done through
40e6d42cb1SMauro Carvalho Chehab   the files_fdtable() macro::
41e6d42cb1SMauro Carvalho Chehab
42e6d42cb1SMauro Carvalho Chehab	struct fdtable *fdt;
43e6d42cb1SMauro Carvalho Chehab
44e6d42cb1SMauro Carvalho Chehab	rcu_read_lock();
45e6d42cb1SMauro Carvalho Chehab
46e6d42cb1SMauro Carvalho Chehab	fdt = files_fdtable(files);
47e6d42cb1SMauro Carvalho Chehab	....
48e6d42cb1SMauro Carvalho Chehab	if (n <= fdt->max_fds)
49e6d42cb1SMauro Carvalho Chehab		....
50e6d42cb1SMauro Carvalho Chehab	...
51e6d42cb1SMauro Carvalho Chehab	rcu_read_unlock();
52e6d42cb1SMauro Carvalho Chehab
53e6d42cb1SMauro Carvalho Chehab   files_fdtable() uses rcu_dereference() macro which takes care of
54e6d42cb1SMauro Carvalho Chehab   the memory barrier requirements for lock-free dereference.
55e6d42cb1SMauro Carvalho Chehab   The fdtable pointer must be read within the read-side
56e6d42cb1SMauro Carvalho Chehab   critical section.
57e6d42cb1SMauro Carvalho Chehab
58e6d42cb1SMauro Carvalho Chehab2. Reading of the fdtable as described above must be protected
59e6d42cb1SMauro Carvalho Chehab   by rcu_read_lock()/rcu_read_unlock().
60e6d42cb1SMauro Carvalho Chehab
61e6d42cb1SMauro Carvalho Chehab3. For any update to the fd table, files->file_lock must
62e6d42cb1SMauro Carvalho Chehab   be held.
63e6d42cb1SMauro Carvalho Chehab
64e6d42cb1SMauro Carvalho Chehab4. To look up the file structure given an fd, a reader
65*460b4f81SEric W. Biederman   must use either lookup_fd_rcu() or files_lookup_fd_rcu() APIs. These
66e6d42cb1SMauro Carvalho Chehab   take care of barrier requirements due to lock-free lookup.
67e6d42cb1SMauro Carvalho Chehab
68e6d42cb1SMauro Carvalho Chehab   An example::
69e6d42cb1SMauro Carvalho Chehab
70e6d42cb1SMauro Carvalho Chehab	struct file *file;
71e6d42cb1SMauro Carvalho Chehab
72e6d42cb1SMauro Carvalho Chehab	rcu_read_lock();
73*460b4f81SEric W. Biederman	file = lookup_fd_rcu(fd);
74e6d42cb1SMauro Carvalho Chehab	if (file) {
75e6d42cb1SMauro Carvalho Chehab		...
76e6d42cb1SMauro Carvalho Chehab	}
77e6d42cb1SMauro Carvalho Chehab	....
78e6d42cb1SMauro Carvalho Chehab	rcu_read_unlock();
79e6d42cb1SMauro Carvalho Chehab
80e6d42cb1SMauro Carvalho Chehab5. Handling of the file structures is special. Since the look-up
81e6d42cb1SMauro Carvalho Chehab   of the fd (fget()/fget_light()) are lock-free, it is possible
82e6d42cb1SMauro Carvalho Chehab   that look-up may race with the last put() operation on the
83e6d42cb1SMauro Carvalho Chehab   file structure. This is avoided using atomic_long_inc_not_zero()
84e6d42cb1SMauro Carvalho Chehab   on ->f_count::
85e6d42cb1SMauro Carvalho Chehab
86e6d42cb1SMauro Carvalho Chehab	rcu_read_lock();
87f36c2943SEric W. Biederman	file = files_lookup_fd_rcu(files, fd);
88e6d42cb1SMauro Carvalho Chehab	if (file) {
89e6d42cb1SMauro Carvalho Chehab		if (atomic_long_inc_not_zero(&file->f_count))
90e6d42cb1SMauro Carvalho Chehab			*fput_needed = 1;
91e6d42cb1SMauro Carvalho Chehab		else
92e6d42cb1SMauro Carvalho Chehab		/* Didn't get the reference, someone's freed */
93e6d42cb1SMauro Carvalho Chehab			file = NULL;
94e6d42cb1SMauro Carvalho Chehab	}
95e6d42cb1SMauro Carvalho Chehab	rcu_read_unlock();
96e6d42cb1SMauro Carvalho Chehab	....
97e6d42cb1SMauro Carvalho Chehab	return file;
98e6d42cb1SMauro Carvalho Chehab
99e6d42cb1SMauro Carvalho Chehab   atomic_long_inc_not_zero() detects if refcounts is already zero or
100e6d42cb1SMauro Carvalho Chehab   goes to zero during increment. If it does, we fail
101e6d42cb1SMauro Carvalho Chehab   fget()/fget_light().
102e6d42cb1SMauro Carvalho Chehab
103e6d42cb1SMauro Carvalho Chehab6. Since both fdtable and file structures can be looked up
104e6d42cb1SMauro Carvalho Chehab   lock-free, they must be installed using rcu_assign_pointer()
105e6d42cb1SMauro Carvalho Chehab   API. If they are looked up lock-free, rcu_dereference()
106e6d42cb1SMauro Carvalho Chehab   must be used. However it is advisable to use files_fdtable()
107*460b4f81SEric W. Biederman   and lookup_fd_rcu()/files_lookup_fd_rcu() which take care of these issues.
108e6d42cb1SMauro Carvalho Chehab
109e6d42cb1SMauro Carvalho Chehab7. While updating, the fdtable pointer must be looked up while
110e6d42cb1SMauro Carvalho Chehab   holding files->file_lock. If ->file_lock is dropped, then
111e6d42cb1SMauro Carvalho Chehab   another thread expand the files thereby creating a new
112e6d42cb1SMauro Carvalho Chehab   fdtable and making the earlier fdtable pointer stale.
113e6d42cb1SMauro Carvalho Chehab
114e6d42cb1SMauro Carvalho Chehab   For example::
115e6d42cb1SMauro Carvalho Chehab
116e6d42cb1SMauro Carvalho Chehab	spin_lock(&files->file_lock);
117e6d42cb1SMauro Carvalho Chehab	fd = locate_fd(files, file, start);
118e6d42cb1SMauro Carvalho Chehab	if (fd >= 0) {
119e6d42cb1SMauro Carvalho Chehab		/* locate_fd() may have expanded fdtable, load the ptr */
120e6d42cb1SMauro Carvalho Chehab		fdt = files_fdtable(files);
121e6d42cb1SMauro Carvalho Chehab		__set_open_fd(fd, fdt);
122e6d42cb1SMauro Carvalho Chehab		__clear_close_on_exec(fd, fdt);
123e6d42cb1SMauro Carvalho Chehab		spin_unlock(&files->file_lock);
124e6d42cb1SMauro Carvalho Chehab	.....
125e6d42cb1SMauro Carvalho Chehab
126e6d42cb1SMauro Carvalho Chehab   Since locate_fd() can drop ->file_lock (and reacquire ->file_lock),
127e6d42cb1SMauro Carvalho Chehab   the fdtable pointer (fdt) must be loaded after locate_fd().
128e6d42cb1SMauro Carvalho Chehab
129