1.. _userfaultfd:
2
3===========
4Userfaultfd
5===========
6
7Objective
8=========
9
10Userfaults allow the implementation of on-demand paging from userland
11and more generally they allow userland to take control of various
12memory page faults, something otherwise only the kernel code could do.
13
14For example userfaults allows a proper and more optimal implementation
15of the PROT_NONE+SIGSEGV trick.
16
17Design
18======
19
20Userfaults are delivered and resolved through the userfaultfd syscall.
21
22The userfaultfd (aside from registering and unregistering virtual
23memory ranges) provides two primary functionalities:
24
251) read/POLLIN protocol to notify a userland thread of the faults
26   happening
27
282) various UFFDIO_* ioctls that can manage the virtual memory regions
29   registered in the userfaultfd that allows userland to efficiently
30   resolve the userfaults it receives via 1) or to manage the virtual
31   memory in the background
32
33The real advantage of userfaults if compared to regular virtual memory
34management of mremap/mprotect is that the userfaults in all their
35operations never involve heavyweight structures like vmas (in fact the
36userfaultfd runtime load never takes the mmap_sem for writing).
37
38Vmas are not suitable for page- (or hugepage) granular fault tracking
39when dealing with virtual address spaces that could span
40Terabytes. Too many vmas would be needed for that.
41
42The userfaultfd once opened by invoking the syscall, can also be
43passed using unix domain sockets to a manager process, so the same
44manager process could handle the userfaults of a multitude of
45different processes without them being aware about what is going on
46(well of course unless they later try to use the userfaultfd
47themselves on the same region the manager is already tracking, which
48is a corner case that would currently return -EBUSY).
49
50API
51===
52
53When first opened the userfaultfd must be enabled invoking the
54UFFDIO_API ioctl specifying a uffdio_api.api value set to UFFD_API (or
55a later API version) which will specify the read/POLLIN protocol
56userland intends to speak on the UFFD and the uffdio_api.features
57userland requires. The UFFDIO_API ioctl if successful (i.e. if the
58requested uffdio_api.api is spoken also by the running kernel and the
59requested features are going to be enabled) will return into
60uffdio_api.features and uffdio_api.ioctls two 64bit bitmasks of
61respectively all the available features of the read(2) protocol and
62the generic ioctl available.
63
64The uffdio_api.features bitmask returned by the UFFDIO_API ioctl
65defines what memory types are supported by the userfaultfd and what
66events, except page fault notifications, may be generated.
67
68If the kernel supports registering userfaultfd ranges on hugetlbfs
69virtual memory areas, UFFD_FEATURE_MISSING_HUGETLBFS will be set in
70uffdio_api.features. Similarly, UFFD_FEATURE_MISSING_SHMEM will be
71set if the kernel supports registering userfaultfd ranges on shared
72memory (covering all shmem APIs, i.e. tmpfs, IPCSHM, /dev/zero
73MAP_SHARED, memfd_create, etc).
74
75The userland application that wants to use userfaultfd with hugetlbfs
76or shared memory need to set the corresponding flag in
77uffdio_api.features to enable those features.
78
79If the userland desires to receive notifications for events other than
80page faults, it has to verify that uffdio_api.features has appropriate
81UFFD_FEATURE_EVENT_* bits set. These events are described in more
82detail below in "Non-cooperative userfaultfd" section.
83
84Once the userfaultfd has been enabled the UFFDIO_REGISTER ioctl should
85be invoked (if present in the returned uffdio_api.ioctls bitmask) to
86register a memory range in the userfaultfd by setting the
87uffdio_register structure accordingly. The uffdio_register.mode
88bitmask will specify to the kernel which kind of faults to track for
89the range (UFFDIO_REGISTER_MODE_MISSING would track missing
90pages). The UFFDIO_REGISTER ioctl will return the
91uffdio_register.ioctls bitmask of ioctls that are suitable to resolve
92userfaults on the range registered. Not all ioctls will necessarily be
93supported for all memory types depending on the underlying virtual
94memory backend (anonymous memory vs tmpfs vs real filebacked
95mappings).
96
97Userland can use the uffdio_register.ioctls to manage the virtual
98address space in the background (to add or potentially also remove
99memory from the userfaultfd registered range). This means a userfault
100could be triggering just before userland maps in the background the
101user-faulted page.
102
103The primary ioctl to resolve userfaults is UFFDIO_COPY. That
104atomically copies a page into the userfault registered range and wakes
105up the blocked userfaults (unless uffdio_copy.mode &
106UFFDIO_COPY_MODE_DONTWAKE is set). Other ioctl works similarly to
107UFFDIO_COPY. They're atomic as in guaranteeing that nothing can see an
108half copied page since it'll keep userfaulting until the copy has
109finished.
110
111QEMU/KVM
112========
113
114QEMU/KVM is using the userfaultfd syscall to implement postcopy live
115migration. Postcopy live migration is one form of memory
116externalization consisting of a virtual machine running with part or
117all of its memory residing on a different node in the cloud. The
118userfaultfd abstraction is generic enough that not a single line of
119KVM kernel code had to be modified in order to add postcopy live
120migration to QEMU.
121
122Guest async page faults, FOLL_NOWAIT and all other GUP features work
123just fine in combination with userfaults. Userfaults trigger async
124page faults in the guest scheduler so those guest processes that
125aren't waiting for userfaults (i.e. network bound) can keep running in
126the guest vcpus.
127
128It is generally beneficial to run one pass of precopy live migration
129just before starting postcopy live migration, in order to avoid
130generating userfaults for readonly guest regions.
131
132The implementation of postcopy live migration currently uses one
133single bidirectional socket but in the future two different sockets
134will be used (to reduce the latency of the userfaults to the minimum
135possible without having to decrease /proc/sys/net/ipv4/tcp_wmem).
136
137The QEMU in the source node writes all pages that it knows are missing
138in the destination node, into the socket, and the migration thread of
139the QEMU running in the destination node runs UFFDIO_COPY|ZEROPAGE
140ioctls on the userfaultfd in order to map the received pages into the
141guest (UFFDIO_ZEROCOPY is used if the source page was a zero page).
142
143A different postcopy thread in the destination node listens with
144poll() to the userfaultfd in parallel. When a POLLIN event is
145generated after a userfault triggers, the postcopy thread read() from
146the userfaultfd and receives the fault address (or -EAGAIN in case the
147userfault was already resolved and waken by a UFFDIO_COPY|ZEROPAGE run
148by the parallel QEMU migration thread).
149
150After the QEMU postcopy thread (running in the destination node) gets
151the userfault address it writes the information about the missing page
152into the socket. The QEMU source node receives the information and
153roughly "seeks" to that page address and continues sending all
154remaining missing pages from that new page offset. Soon after that
155(just the time to flush the tcp_wmem queue through the network) the
156migration thread in the QEMU running in the destination node will
157receive the page that triggered the userfault and it'll map it as
158usual with the UFFDIO_COPY|ZEROPAGE (without actually knowing if it
159was spontaneously sent by the source or if it was an urgent page
160requested through a userfault).
161
162By the time the userfaults start, the QEMU in the destination node
163doesn't need to keep any per-page state bitmap relative to the live
164migration around and a single per-page bitmap has to be maintained in
165the QEMU running in the source node to know which pages are still
166missing in the destination node. The bitmap in the source node is
167checked to find which missing pages to send in round robin and we seek
168over it when receiving incoming userfaults. After sending each page of
169course the bitmap is updated accordingly. It's also useful to avoid
170sending the same page twice (in case the userfault is read by the
171postcopy thread just before UFFDIO_COPY|ZEROPAGE runs in the migration
172thread).
173
174Non-cooperative userfaultfd
175===========================
176
177When the userfaultfd is monitored by an external manager, the manager
178must be able to track changes in the process virtual memory
179layout. Userfaultfd can notify the manager about such changes using
180the same read(2) protocol as for the page fault notifications. The
181manager has to explicitly enable these events by setting appropriate
182bits in uffdio_api.features passed to UFFDIO_API ioctl:
183
184UFFD_FEATURE_EVENT_FORK
185	enable userfaultfd hooks for fork(). When this feature is
186	enabled, the userfaultfd context of the parent process is
187	duplicated into the newly created process. The manager
188	receives UFFD_EVENT_FORK with file descriptor of the new
189	userfaultfd context in the uffd_msg.fork.
190
191UFFD_FEATURE_EVENT_REMAP
192	enable notifications about mremap() calls. When the
193	non-cooperative process moves a virtual memory area to a
194	different location, the manager will receive
195	UFFD_EVENT_REMAP. The uffd_msg.remap will contain the old and
196	new addresses of the area and its original length.
197
198UFFD_FEATURE_EVENT_REMOVE
199	enable notifications about madvise(MADV_REMOVE) and
200	madvise(MADV_DONTNEED) calls. The event UFFD_EVENT_REMOVE will
201	be generated upon these calls to madvise. The uffd_msg.remove
202	will contain start and end addresses of the removed area.
203
204UFFD_FEATURE_EVENT_UNMAP
205	enable notifications about memory unmapping. The manager will
206	get UFFD_EVENT_UNMAP with uffd_msg.remove containing start and
207	end addresses of the unmapped area.
208
209Although the UFFD_FEATURE_EVENT_REMOVE and UFFD_FEATURE_EVENT_UNMAP
210are pretty similar, they quite differ in the action expected from the
211userfaultfd manager. In the former case, the virtual memory is
212removed, but the area is not, the area remains monitored by the
213userfaultfd, and if a page fault occurs in that area it will be
214delivered to the manager. The proper resolution for such page fault is
215to zeromap the faulting address. However, in the latter case, when an
216area is unmapped, either explicitly (with munmap() system call), or
217implicitly (e.g. during mremap()), the area is removed and in turn the
218userfaultfd context for such area disappears too and the manager will
219not get further userland page faults from the removed area. Still, the
220notification is required in order to prevent manager from using
221UFFDIO_COPY on the unmapped area.
222
223Unlike userland page faults which have to be synchronous and require
224explicit or implicit wakeup, all the events are delivered
225asynchronously and the non-cooperative process resumes execution as
226soon as manager executes read(). The userfaultfd manager should
227carefully synchronize calls to UFFDIO_COPY with the events
228processing. To aid the synchronization, the UFFDIO_COPY ioctl will
229return -ENOSPC when the monitored process exits at the time of
230UFFDIO_COPY, and -ENOENT, when the non-cooperative process has changed
231its virtual memory layout simultaneously with outstanding UFFDIO_COPY
232operation.
233
234The current asynchronous model of the event delivery is optimal for
235single threaded non-cooperative userfaultfd manager implementations. A
236synchronous event delivery model can be added later as a new
237userfaultfd feature to facilitate multithreading enhancements of the
238non cooperative manager, for example to allow UFFDIO_COPY ioctls to
239run in parallel to the event reception. Single threaded
240implementations should continue to use the current async event
241delivery model instead.
242