xref: /openbmc/linux/Documentation/virt/kvm/vcpu-requests.rst (revision 22a41e9a5044bf3519f05b4a00e99af34bfeb40c)
1=================
2KVM VCPU Requests
3=================
4
5Overview
6========
7
8KVM supports an internal API enabling threads to request a VCPU thread to
9perform some activity.  For example, a thread may request a VCPU to flush
10its TLB with a VCPU request.  The API consists of the following functions::
11
12  /* Check if any requests are pending for VCPU @vcpu. */
13  bool kvm_request_pending(struct kvm_vcpu *vcpu);
14
15  /* Check if VCPU @vcpu has request @req pending. */
16  bool kvm_test_request(int req, struct kvm_vcpu *vcpu);
17
18  /* Clear request @req for VCPU @vcpu. */
19  void kvm_clear_request(int req, struct kvm_vcpu *vcpu);
20
21  /*
22   * Check if VCPU @vcpu has request @req pending. When the request is
23   * pending it will be cleared and a memory barrier, which pairs with
24   * another in kvm_make_request(), will be issued.
25   */
26  bool kvm_check_request(int req, struct kvm_vcpu *vcpu);
27
28  /*
29   * Make request @req of VCPU @vcpu. Issues a memory barrier, which pairs
30   * with another in kvm_check_request(), prior to setting the request.
31   */
32  void kvm_make_request(int req, struct kvm_vcpu *vcpu);
33
34  /* Make request @req of all VCPUs of the VM with struct kvm @kvm. */
35  bool kvm_make_all_cpus_request(struct kvm *kvm, unsigned int req);
36
37Typically a requester wants the VCPU to perform the activity as soon
38as possible after making the request.  This means most requests
39(kvm_make_request() calls) are followed by a call to kvm_vcpu_kick(),
40and kvm_make_all_cpus_request() has the kicking of all VCPUs built
41into it.
42
43VCPU Kicks
44----------
45
46The goal of a VCPU kick is to bring a VCPU thread out of guest mode in
47order to perform some KVM maintenance.  To do so, an IPI is sent, forcing
48a guest mode exit.  However, a VCPU thread may not be in guest mode at the
49time of the kick.  Therefore, depending on the mode and state of the VCPU
50thread, there are two other actions a kick may take.  All three actions
51are listed below:
52
531) Send an IPI.  This forces a guest mode exit.
542) Waking a sleeping VCPU.  Sleeping VCPUs are VCPU threads outside guest
55   mode that wait on waitqueues.  Waking them removes the threads from
56   the waitqueues, allowing the threads to run again.  This behavior
57   may be suppressed, see KVM_REQUEST_NO_WAKEUP below.
583) Nothing.  When the VCPU is not in guest mode and the VCPU thread is not
59   sleeping, then there is nothing to do.
60
61VCPU Mode
62---------
63
64VCPUs have a mode state, ``vcpu->mode``, that is used to track whether the
65guest is running in guest mode or not, as well as some specific
66outside guest mode states.  The architecture may use ``vcpu->mode`` to
67ensure VCPU requests are seen by VCPUs (see "Ensuring Requests Are Seen"),
68as well as to avoid sending unnecessary IPIs (see "IPI Reduction"), and
69even to ensure IPI acknowledgements are waited upon (see "Waiting for
70Acknowledgements").  The following modes are defined:
71
72OUTSIDE_GUEST_MODE
73
74  The VCPU thread is outside guest mode.
75
76IN_GUEST_MODE
77
78  The VCPU thread is in guest mode.
79
80EXITING_GUEST_MODE
81
82  The VCPU thread is transitioning from IN_GUEST_MODE to
83  OUTSIDE_GUEST_MODE.
84
85READING_SHADOW_PAGE_TABLES
86
87  The VCPU thread is outside guest mode, but it wants the sender of
88  certain VCPU requests, namely KVM_REQ_TLB_FLUSH, to wait until the VCPU
89  thread is done reading the page tables.
90
91VCPU Request Internals
92======================
93
94VCPU requests are simply bit indices of the ``vcpu->requests`` bitmap.
95This means general bitops, like those documented in [atomic-ops]_ could
96also be used, e.g. ::
97
98  clear_bit(KVM_REQ_UNHALT & KVM_REQUEST_MASK, &vcpu->requests);
99
100However, VCPU request users should refrain from doing so, as it would
101break the abstraction.  The first 8 bits are reserved for architecture
102independent requests, all additional bits are available for architecture
103dependent requests.
104
105Architecture Independent Requests
106---------------------------------
107
108KVM_REQ_TLB_FLUSH
109
110  KVM's common MMU notifier may need to flush all of a guest's TLB
111  entries, calling kvm_flush_remote_tlbs() to do so.  Architectures that
112  choose to use the common kvm_flush_remote_tlbs() implementation will
113  need to handle this VCPU request.
114
115KVM_REQ_VM_DEAD
116
117  This request informs all VCPUs that the VM is dead and unusable, e.g. due to
118  fatal error or because the VM's state has been intentionally destroyed.
119
120KVM_REQ_UNBLOCK
121
122  This request informs the vCPU to exit kvm_vcpu_block.  It is used for
123  example from timer handlers that run on the host on behalf of a vCPU,
124  or in order to update the interrupt routing and ensure that assigned
125  devices will wake up the vCPU.
126
127KVM_REQ_UNHALT
128
129  This request may be made from the KVM common function kvm_vcpu_block(),
130  which is used to emulate an instruction that causes a CPU to halt until
131  one of an architectural specific set of events and/or interrupts is
132  received (determined by checking kvm_arch_vcpu_runnable()).  When that
133  event or interrupt arrives kvm_vcpu_block() makes the request.  This is
134  in contrast to when kvm_vcpu_block() returns due to any other reason,
135  such as a pending signal, which does not indicate the VCPU's halt
136  emulation should stop, and therefore does not make the request.
137
138KVM_REQUEST_MASK
139----------------
140
141VCPU requests should be masked by KVM_REQUEST_MASK before using them with
142bitops.  This is because only the lower 8 bits are used to represent the
143request's number.  The upper bits are used as flags.  Currently only two
144flags are defined.
145
146VCPU Request Flags
147------------------
148
149KVM_REQUEST_NO_WAKEUP
150
151  This flag is applied to requests that only need immediate attention
152  from VCPUs running in guest mode.  That is, sleeping VCPUs do not need
153  to be awaken for these requests.  Sleeping VCPUs will handle the
154  requests when they are awaken later for some other reason.
155
156KVM_REQUEST_WAIT
157
158  When requests with this flag are made with kvm_make_all_cpus_request(),
159  then the caller will wait for each VCPU to acknowledge its IPI before
160  proceeding.  This flag only applies to VCPUs that would receive IPIs.
161  If, for example, the VCPU is sleeping, so no IPI is necessary, then
162  the requesting thread does not wait.  This means that this flag may be
163  safely combined with KVM_REQUEST_NO_WAKEUP.  See "Waiting for
164  Acknowledgements" for more information about requests with
165  KVM_REQUEST_WAIT.
166
167VCPU Requests with Associated State
168===================================
169
170Requesters that want the receiving VCPU to handle new state need to ensure
171the newly written state is observable to the receiving VCPU thread's CPU
172by the time it observes the request.  This means a write memory barrier
173must be inserted after writing the new state and before setting the VCPU
174request bit.  Additionally, on the receiving VCPU thread's side, a
175corresponding read barrier must be inserted after reading the request bit
176and before proceeding to read the new state associated with it.  See
177scenario 3, Message and Flag, of [lwn-mb]_ and the kernel documentation
178[memory-barriers]_.
179
180The pair of functions, kvm_check_request() and kvm_make_request(), provide
181the memory barriers, allowing this requirement to be handled internally by
182the API.
183
184Ensuring Requests Are Seen
185==========================
186
187When making requests to VCPUs, we want to avoid the receiving VCPU
188executing in guest mode for an arbitrary long time without handling the
189request.  We can be sure this won't happen as long as we ensure the VCPU
190thread checks kvm_request_pending() before entering guest mode and that a
191kick will send an IPI to force an exit from guest mode when necessary.
192Extra care must be taken to cover the period after the VCPU thread's last
193kvm_request_pending() check and before it has entered guest mode, as kick
194IPIs will only trigger guest mode exits for VCPU threads that are in guest
195mode or at least have already disabled interrupts in order to prepare to
196enter guest mode.  This means that an optimized implementation (see "IPI
197Reduction") must be certain when it's safe to not send the IPI.  One
198solution, which all architectures except s390 apply, is to:
199
200- set ``vcpu->mode`` to IN_GUEST_MODE between disabling the interrupts and
201  the last kvm_request_pending() check;
202- enable interrupts atomically when entering the guest.
203
204This solution also requires memory barriers to be placed carefully in both
205the requesting thread and the receiving VCPU.  With the memory barriers we
206can exclude the possibility of a VCPU thread observing
207!kvm_request_pending() on its last check and then not receiving an IPI for
208the next request made of it, even if the request is made immediately after
209the check.  This is done by way of the Dekker memory barrier pattern
210(scenario 10 of [lwn-mb]_).  As the Dekker pattern requires two variables,
211this solution pairs ``vcpu->mode`` with ``vcpu->requests``.  Substituting
212them into the pattern gives::
213
214  CPU1                                    CPU2
215  =================                       =================
216  local_irq_disable();
217  WRITE_ONCE(vcpu->mode, IN_GUEST_MODE);  kvm_make_request(REQ, vcpu);
218  smp_mb();                               smp_mb();
219  if (kvm_request_pending(vcpu)) {        if (READ_ONCE(vcpu->mode) ==
220                                              IN_GUEST_MODE) {
221      ...abort guest entry...                 ...send IPI...
222  }                                       }
223
224As stated above, the IPI is only useful for VCPU threads in guest mode or
225that have already disabled interrupts.  This is why this specific case of
226the Dekker pattern has been extended to disable interrupts before setting
227``vcpu->mode`` to IN_GUEST_MODE.  WRITE_ONCE() and READ_ONCE() are used to
228pedantically implement the memory barrier pattern, guaranteeing the
229compiler doesn't interfere with ``vcpu->mode``'s carefully planned
230accesses.
231
232IPI Reduction
233-------------
234
235As only one IPI is needed to get a VCPU to check for any/all requests,
236then they may be coalesced.  This is easily done by having the first IPI
237sending kick also change the VCPU mode to something !IN_GUEST_MODE.  The
238transitional state, EXITING_GUEST_MODE, is used for this purpose.
239
240Waiting for Acknowledgements
241----------------------------
242
243Some requests, those with the KVM_REQUEST_WAIT flag set, require IPIs to
244be sent, and the acknowledgements to be waited upon, even when the target
245VCPU threads are in modes other than IN_GUEST_MODE.  For example, one case
246is when a target VCPU thread is in READING_SHADOW_PAGE_TABLES mode, which
247is set after disabling interrupts.  To support these cases, the
248KVM_REQUEST_WAIT flag changes the condition for sending an IPI from
249checking that the VCPU is IN_GUEST_MODE to checking that it is not
250OUTSIDE_GUEST_MODE.
251
252Request-less VCPU Kicks
253-----------------------
254
255As the determination of whether or not to send an IPI depends on the
256two-variable Dekker memory barrier pattern, then it's clear that
257request-less VCPU kicks are almost never correct.  Without the assurance
258that a non-IPI generating kick will still result in an action by the
259receiving VCPU, as the final kvm_request_pending() check does for
260request-accompanying kicks, then the kick may not do anything useful at
261all.  If, for instance, a request-less kick was made to a VCPU that was
262just about to set its mode to IN_GUEST_MODE, meaning no IPI is sent, then
263the VCPU thread may continue its entry without actually having done
264whatever it was the kick was meant to initiate.
265
266One exception is x86's posted interrupt mechanism.  In this case, however,
267even the request-less VCPU kick is coupled with the same
268local_irq_disable() + smp_mb() pattern described above; the ON bit
269(Outstanding Notification) in the posted interrupt descriptor takes the
270role of ``vcpu->requests``.  When sending a posted interrupt, PIR.ON is
271set before reading ``vcpu->mode``; dually, in the VCPU thread,
272vmx_sync_pir_to_irr() reads PIR after setting ``vcpu->mode`` to
273IN_GUEST_MODE.
274
275Additional Considerations
276=========================
277
278Sleeping VCPUs
279--------------
280
281VCPU threads may need to consider requests before and/or after calling
282functions that may put them to sleep, e.g. kvm_vcpu_block().  Whether they
283do or not, and, if they do, which requests need consideration, is
284architecture dependent.  kvm_vcpu_block() calls kvm_arch_vcpu_runnable()
285to check if it should awaken.  One reason to do so is to provide
286architectures a function where requests may be checked if necessary.
287
288Clearing Requests
289-----------------
290
291Generally it only makes sense for the receiving VCPU thread to clear a
292request.  However, in some circumstances, such as when the requesting
293thread and the receiving VCPU thread are executed serially, such as when
294they are the same thread, or when they are using some form of concurrency
295control to temporarily execute synchronously, then it's possible to know
296that the request may be cleared immediately, rather than waiting for the
297receiving VCPU thread to handle the request in VCPU RUN.  The only current
298examples of this are kvm_vcpu_block() calls made by VCPUs to block
299themselves.  A possible side-effect of that call is to make the
300KVM_REQ_UNHALT request, which may then be cleared immediately when the
301VCPU returns from the call.
302
303References
304==========
305
306.. [atomic-ops] Documentation/atomic_bitops.txt and Documentation/atomic_t.txt
307.. [memory-barriers] Documentation/memory-barriers.txt
308.. [lwn-mb] https://lwn.net/Articles/573436/
309