1.. _userfaultfd:
2
3===========
4Userfaultfd
5===========
6
7Objective
8=========
9
10Userfaults allow the implementation of on-demand paging from userland
11and more generally they allow userland to take control of various
12memory page faults, something otherwise only the kernel code could do.
13
14For example userfaults allows a proper and more optimal implementation
15of the PROT_NONE+SIGSEGV trick.
16
17Design
18======
19
20Userfaults are delivered and resolved through the userfaultfd syscall.
21
22The userfaultfd (aside from registering and unregistering virtual
23memory ranges) provides two primary functionalities:
24
251) read/POLLIN protocol to notify a userland thread of the faults
26   happening
27
282) various UFFDIO_* ioctls that can manage the virtual memory regions
29   registered in the userfaultfd that allows userland to efficiently
30   resolve the userfaults it receives via 1) or to manage the virtual
31   memory in the background
32
33The real advantage of userfaults if compared to regular virtual memory
34management of mremap/mprotect is that the userfaults in all their
35operations never involve heavyweight structures like vmas (in fact the
36userfaultfd runtime load never takes the mmap_sem for writing).
37
38Vmas are not suitable for page- (or hugepage) granular fault tracking
39when dealing with virtual address spaces that could span
40Terabytes. Too many vmas would be needed for that.
41
42The userfaultfd once opened by invoking the syscall, can also be
43passed using unix domain sockets to a manager process, so the same
44manager process could handle the userfaults of a multitude of
45different processes without them being aware about what is going on
46(well of course unless they later try to use the userfaultfd
47themselves on the same region the manager is already tracking, which
48is a corner case that would currently return -EBUSY).
49
50API
51===
52
53When first opened the userfaultfd must be enabled invoking the
54UFFDIO_API ioctl specifying a uffdio_api.api value set to UFFD_API (or
55a later API version) which will specify the read/POLLIN protocol
56userland intends to speak on the UFFD and the uffdio_api.features
57userland requires. The UFFDIO_API ioctl if successful (i.e. if the
58requested uffdio_api.api is spoken also by the running kernel and the
59requested features are going to be enabled) will return into
60uffdio_api.features and uffdio_api.ioctls two 64bit bitmasks of
61respectively all the available features of the read(2) protocol and
62the generic ioctl available.
63
64The uffdio_api.features bitmask returned by the UFFDIO_API ioctl
65defines what memory types are supported by the userfaultfd and what
66events, except page fault notifications, may be generated.
67
68If the kernel supports registering userfaultfd ranges on hugetlbfs
69virtual memory areas, UFFD_FEATURE_MISSING_HUGETLBFS will be set in
70uffdio_api.features. Similarly, UFFD_FEATURE_MISSING_SHMEM will be
71set if the kernel supports registering userfaultfd ranges on shared
72memory (covering all shmem APIs, i.e. tmpfs, IPCSHM, /dev/zero
73MAP_SHARED, memfd_create, etc).
74
75The userland application that wants to use userfaultfd with hugetlbfs
76or shared memory need to set the corresponding flag in
77uffdio_api.features to enable those features.
78
79If the userland desires to receive notifications for events other than
80page faults, it has to verify that uffdio_api.features has appropriate
81UFFD_FEATURE_EVENT_* bits set. These events are described in more
82detail below in "Non-cooperative userfaultfd" section.
83
84Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should
85be invoked (if present in the returned uffdio_api.ioctls bitmask) to
86register a memory range in the userfaultfd by setting the
87uffdio_register structure accordingly. The uffdio_register.mode
88bitmask will specify to the kernel which kind of faults to track for
89the range (UFFDIO_REGISTER_MODE_MISSING would track missing
90pages). The UFFDIO_REGISTER ioctl will return the
91uffdio_register.ioctls bitmask of ioctls that are suitable to resolve
92userfaults on the range registered. Not all ioctls will necessarily be
93supported for all memory types depending on the underlying virtual
94memory backend (anonymous memory vs tmpfs vs real filebacked
95mappings).
96
97Userland can use the uffdio_register.ioctls to manage the virtual
98address space in the background (to add or potentially also remove
99memory from the userfaultfd registered range). This means a userfault
100could be triggering just before userland maps in the background the
101user-faulted page.
102
103The primary ioctl to resolve userfaults is UFFDIO_COPY. That
104atomically copies a page into the userfault registered range and wakes
105up the blocked userfaults (unless uffdio_copy.mode &
106UFFDIO_COPY_MODE_DONTWAKE is set). Other ioctl works similarly to
107UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an
108half copied page since it'll keep userfaulting until the copy has
109finished.
110
111Notes:
112
113- If you requested UFFDIO_REGISTER_MODE_MISSING when registering then
114  you must provide some kind of page in your thread after reading from
115  the uffd.  You must provide either UFFDIO_COPY or UFFDIO_ZEROPAGE.
116  The normal behavior of the OS automatically providing a zero page on
117  an annonymous mmaping is not in place.
118
119- None of the page-delivering ioctls default to the range that you
120  registered with.  You must fill in all fields for the appropriate
121  ioctl struct including the range.
122
123- You get the address of the access that triggered the missing page
124  event out of a struct uffd_msg that you read in the thread from the
125  uffd.  You can supply as many pages as you want with UFFDIO_COPY or
126  UFFDIO_ZEROPAGE.  Keep in mind that unless you used DONTWAKE then
127  the first of any of those IOCTLs wakes up the faulting thread.
128
129- Be sure to test for all errors including (pollfd[0].revents &
130  POLLERR).  This can happen, e.g. when ranges supplied were
131  incorrect.
132
133Write Protect Notifications
134---------------------------
135
136This is equivalent to (but faster than) using mprotect and a SIGSEGV
137signal handler.
138
139Firstly you need to register a range with UFFDIO_REGISTER_MODE_WP.
140Instead of using mprotect(2) you use ioctl(uffd, UFFDIO_WRITEPROTECT,
141struct *uffdio_writeprotect) while mode = UFFDIO_WRITEPROTECT_MODE_WP
142in the struct passed in.  The range does not default to and does not
143have to be identical to the range you registered with.  You can write
144protect as many ranges as you like (inside the registered range).
145Then, in the thread reading from uffd the struct will have
146msg.arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP set. Now you send
147ioctl(uffd, UFFDIO_WRITEPROTECT, struct *uffdio_writeprotect) again
148while pagefault.mode does not have UFFDIO_WRITEPROTECT_MODE_WP set.
149This wakes up the thread which will continue to run with writes. This
150allows you to do the bookkeeping about the write in the uffd reading
151thread before the ioctl.
152
153If you registered with both UFFDIO_REGISTER_MODE_MISSING and
154UFFDIO_REGISTER_MODE_WP then you need to think about the sequence in
155which you supply a page and undo write protect.  Note that there is a
156difference between writes into a WP area and into a !WP area.  The
157former will have UFFD_PAGEFAULT_FLAG_WP set, the latter
158UFFD_PAGEFAULT_FLAG_WRITE.  The latter did not fail on protection but
159you still need to supply a page when UFFDIO_REGISTER_MODE_MISSING was
160used.
161
162QEMU/KVM
163========
164
165QEMU/KVM is using the userfaultfd syscall to implement postcopy live
166migration. Postcopy live migration is one form of memory
167externalization consisting of a virtual machine running with part or
168all of its memory residing on a different node in the cloud. The
169userfaultfd abstraction is generic enough that not a single line of
170KVM kernel code had to be modified in order to add postcopy live
171migration to QEMU.
172
173Guest async page faults, FOLL_NOWAIT and all other GUP features work
174just fine in combination with userfaults. Userfaults trigger async
175page faults in the guest scheduler so those guest processes that
176aren't waiting for userfaults (i.e. network bound) can keep running in
177the guest vcpus.
178
179It is generally beneficial to run one pass of precopy live migration
180just before starting postcopy live migration, in order to avoid
181generating userfaults for readonly guest regions.
182
183The implementation of postcopy live migration currently uses one
184single bidirectional socket but in the future two different sockets
185will be used (to reduce the latency of the userfaults to the minimum
186possible without having to decrease /proc/sys/net/ipv4/tcp_wmem).
187
188The QEMU in the source node writes all pages that it knows are missing
189in the destination node, into the socket, and the migration thread of
190the QEMU running in the destination node runs UFFDIO_COPY|ZEROPAGE
191ioctls on the userfaultfd in order to map the received pages into the
192guest (UFFDIO_ZEROCOPY is used if the source page was a zero page).
193
194A different postcopy thread in the destination node listens with
195poll() to the userfaultfd in parallel. When a POLLIN event is
196generated after a userfault triggers, the postcopy thread read() from
197the userfaultfd and receives the fault address (or -EAGAIN in case the
198userfault was already resolved and waken by a UFFDIO_COPY|ZEROPAGE run
199by the parallel QEMU migration thread).
200
201After the QEMU postcopy thread (running in the destination node) gets
202the userfault address it writes the information about the missing page
203into the socket. The QEMU source node receives the information and
204roughly "seeks" to that page address and continues sending all
205remaining missing pages from that new page offset. Soon after that
206(just the time to flush the tcp_wmem queue through the network) the
207migration thread in the QEMU running in the destination node will
208receive the page that triggered the userfault and it'll map it as
209usual with the UFFDIO_COPY|ZEROPAGE (without actually knowing if it
210was spontaneously sent by the source or if it was an urgent page
211requested through a userfault).
212
213By the time the userfaults start, the QEMU in the destination node
214doesn't need to keep any per-page state bitmap relative to the live
215migration around and a single per-page bitmap has to be maintained in
216the QEMU running in the source node to know which pages are still
217missing in the destination node. The bitmap in the source node is
218checked to find which missing pages to send in round robin and we seek
219over it when receiving incoming userfaults. After sending each page of
220course the bitmap is updated accordingly. It's also useful to avoid
221sending the same page twice (in case the userfault is read by the
222postcopy thread just before UFFDIO_COPY|ZEROPAGE runs in the migration
223thread).
224
225Non-cooperative userfaultfd
226===========================
227
228When the userfaultfd is monitored by an external manager, the manager
229must be able to track changes in the process virtual memory
230layout. Userfaultfd can notify the manager about such changes using
231the same read(2) protocol as for the page fault notifications. The
232manager has to explicitly enable these events by setting appropriate
233bits in uffdio_api.features passed to UFFDIO_API ioctl:
234
235UFFD_FEATURE_EVENT_FORK
236	enable userfaultfd hooks for fork(). When this feature is
237	enabled, the userfaultfd context of the parent process is
238	duplicated into the newly created process. The manager
239	receives UFFD_EVENT_FORK with file descriptor of the new
240	userfaultfd context in the uffd_msg.fork.
241
242UFFD_FEATURE_EVENT_REMAP
243	enable notifications about mremap() calls. When the
244	non-cooperative process moves a virtual memory area to a
245	different location, the manager will receive
246	UFFD_EVENT_REMAP. The uffd_msg.remap will contain the old and
247	new addresses of the area and its original length.
248
249UFFD_FEATURE_EVENT_REMOVE
250	enable notifications about madvise(MADV_REMOVE) and
251	madvise(MADV_DONTNEED) calls. The event UFFD_EVENT_REMOVE will
252	be generated upon these calls to madvise. The uffd_msg.remove
253	will contain start and end addresses of the removed area.
254
255UFFD_FEATURE_EVENT_UNMAP
256	enable notifications about memory unmapping. The manager will
257	get UFFD_EVENT_UNMAP with uffd_msg.remove containing start and
258	end addresses of the unmapped area.
259
260Although the UFFD_FEATURE_EVENT_REMOVE and UFFD_FEATURE_EVENT_UNMAP
261are pretty similar, they quite differ in the action expected from the
262userfaultfd manager. In the former case, the virtual memory is
263removed, but the area is not, the area remains monitored by the
264userfaultfd, and if a page fault occurs in that area it will be
265delivered to the manager. The proper resolution for such page fault is
266to zeromap the faulting address. However, in the latter case, when an
267area is unmapped, either explicitly (with munmap() system call), or
268implicitly (e.g. during mremap()), the area is removed and in turn the
269userfaultfd context for such area disappears too and the manager will
270not get further userland page faults from the removed area. Still, the
271notification is required in order to prevent manager from using
272UFFDIO_COPY on the unmapped area.
273
274Unlike userland page faults which have to be synchronous and require
275explicit or implicit wakeup, all the events are delivered
276asynchronously and the non-cooperative process resumes execution as
277soon as manager executes read(). The userfaultfd manager should
278carefully synchronize calls to UFFDIO_COPY with the events
279processing. To aid the synchronization, the UFFDIO_COPY ioctl will
280return -ENOSPC when the monitored process exits at the time of
281UFFDIO_COPY, and -ENOENT, when the non-cooperative process has changed
282its virtual memory layout simultaneously with outstanding UFFDIO_COPY
283operation.
284
285The current asynchronous model of the event delivery is optimal for
286single threaded non-cooperative userfaultfd manager implementations. A
287synchronous event delivery model can be added later as a new
288userfaultfd feature to facilitate multithreading enhancements of the
289non cooperative manager, for example to allow UFFDIO_COPY ioctls to
290run in parallel to the event reception. Single threaded
291implementations should continue to use the current async event
292delivery model instead.
293