1================= 2KVM VCPU Requests 3================= 4 5Overview 6======== 7 8KVM supports an internal API enabling threads to request a VCPU thread to 9perform some activity. For example, a thread may request a VCPU to flush 10its TLB with a VCPU request. The API consists of the following functions:: 11 12 /* Check if any requests are pending for VCPU @vcpu. */ 13 bool kvm_request_pending(struct kvm_vcpu *vcpu); 14 15 /* Check if VCPU @vcpu has request @req pending. */ 16 bool kvm_test_request(int req, struct kvm_vcpu *vcpu); 17 18 /* Clear request @req for VCPU @vcpu. */ 19 void kvm_clear_request(int req, struct kvm_vcpu *vcpu); 20 21 /* 22 * Check if VCPU @vcpu has request @req pending. When the request is 23 * pending it will be cleared and a memory barrier, which pairs with 24 * another in kvm_make_request(), will be issued. 25 */ 26 bool kvm_check_request(int req, struct kvm_vcpu *vcpu); 27 28 /* 29 * Make request @req of VCPU @vcpu. Issues a memory barrier, which pairs 30 * with another in kvm_check_request(), prior to setting the request. 31 */ 32 void kvm_make_request(int req, struct kvm_vcpu *vcpu); 33 34 /* Make request @req of all VCPUs of the VM with struct kvm @kvm. */ 35 bool kvm_make_all_cpus_request(struct kvm *kvm, unsigned int req); 36 37Typically a requester wants the VCPU to perform the activity as soon 38as possible after making the request. This means most requests 39(kvm_make_request() calls) are followed by a call to kvm_vcpu_kick(), 40and kvm_make_all_cpus_request() has the kicking of all VCPUs built 41into it. 42 43VCPU Kicks 44---------- 45 46The goal of a VCPU kick is to bring a VCPU thread out of guest mode in 47order to perform some KVM maintenance. To do so, an IPI is sent, forcing 48a guest mode exit. However, a VCPU thread may not be in guest mode at the 49time of the kick. Therefore, depending on the mode and state of the VCPU 50thread, there are two other actions a kick may take. All three actions 51are listed below: 52 531) Send an IPI. This forces a guest mode exit. 542) Waking a sleeping VCPU. Sleeping VCPUs are VCPU threads outside guest 55 mode that wait on waitqueues. Waking them removes the threads from 56 the waitqueues, allowing the threads to run again. This behavior 57 may be suppressed, see KVM_REQUEST_NO_WAKEUP below. 583) Nothing. When the VCPU is not in guest mode and the VCPU thread is not 59 sleeping, then there is nothing to do. 60 61VCPU Mode 62--------- 63 64VCPUs have a mode state, ``vcpu->mode``, that is used to track whether the 65guest is running in guest mode or not, as well as some specific 66outside guest mode states. The architecture may use ``vcpu->mode`` to 67ensure VCPU requests are seen by VCPUs (see "Ensuring Requests Are Seen"), 68as well as to avoid sending unnecessary IPIs (see "IPI Reduction"), and 69even to ensure IPI acknowledgements are waited upon (see "Waiting for 70Acknowledgements"). The following modes are defined: 71 72OUTSIDE_GUEST_MODE 73 74 The VCPU thread is outside guest mode. 75 76IN_GUEST_MODE 77 78 The VCPU thread is in guest mode. 79 80EXITING_GUEST_MODE 81 82 The VCPU thread is transitioning from IN_GUEST_MODE to 83 OUTSIDE_GUEST_MODE. 84 85READING_SHADOW_PAGE_TABLES 86 87 The VCPU thread is outside guest mode, but it wants the sender of 88 certain VCPU requests, namely KVM_REQ_TLB_FLUSH, to wait until the VCPU 89 thread is done reading the page tables. 90 91VCPU Request Internals 92====================== 93 94VCPU requests are simply bit indices of the ``vcpu->requests`` bitmap. 95This means general bitops, like those documented in [atomic-ops]_ could 96also be used, e.g. :: 97 98 clear_bit(KVM_REQ_UNHALT & KVM_REQUEST_MASK, &vcpu->requests); 99 100However, VCPU request users should refrain from doing so, as it would 101break the abstraction. The first 8 bits are reserved for architecture 102independent requests, all additional bits are available for architecture 103dependent requests. 104 105Architecture Independent Requests 106--------------------------------- 107 108KVM_REQ_TLB_FLUSH 109 110 KVM's common MMU notifier may need to flush all of a guest's TLB 111 entries, calling kvm_flush_remote_tlbs() to do so. Architectures that 112 choose to use the common kvm_flush_remote_tlbs() implementation will 113 need to handle this VCPU request. 114 115KVM_REQ_MMU_RELOAD 116 117 When shadow page tables are used and memory slots are removed it's 118 necessary to inform each VCPU to completely refresh the tables. This 119 request is used for that. 120 121KVM_REQ_PENDING_TIMER 122 123 This request may be made from a timer handler run on the host on behalf 124 of a VCPU. It informs the VCPU thread to inject a timer interrupt. 125 126KVM_REQ_UNHALT 127 128 This request may be made from the KVM common function kvm_vcpu_block(), 129 which is used to emulate an instruction that causes a CPU to halt until 130 one of an architectural specific set of events and/or interrupts is 131 received (determined by checking kvm_arch_vcpu_runnable()). When that 132 event or interrupt arrives kvm_vcpu_block() makes the request. This is 133 in contrast to when kvm_vcpu_block() returns due to any other reason, 134 such as a pending signal, which does not indicate the VCPU's halt 135 emulation should stop, and therefore does not make the request. 136 137KVM_REQUEST_MASK 138---------------- 139 140VCPU requests should be masked by KVM_REQUEST_MASK before using them with 141bitops. This is because only the lower 8 bits are used to represent the 142request's number. The upper bits are used as flags. Currently only two 143flags are defined. 144 145VCPU Request Flags 146------------------ 147 148KVM_REQUEST_NO_WAKEUP 149 150 This flag is applied to requests that only need immediate attention 151 from VCPUs running in guest mode. That is, sleeping VCPUs do not need 152 to be awaken for these requests. Sleeping VCPUs will handle the 153 requests when they are awaken later for some other reason. 154 155KVM_REQUEST_WAIT 156 157 When requests with this flag are made with kvm_make_all_cpus_request(), 158 then the caller will wait for each VCPU to acknowledge its IPI before 159 proceeding. This flag only applies to VCPUs that would receive IPIs. 160 If, for example, the VCPU is sleeping, so no IPI is necessary, then 161 the requesting thread does not wait. This means that this flag may be 162 safely combined with KVM_REQUEST_NO_WAKEUP. See "Waiting for 163 Acknowledgements" for more information about requests with 164 KVM_REQUEST_WAIT. 165 166VCPU Requests with Associated State 167=================================== 168 169Requesters that want the receiving VCPU to handle new state need to ensure 170the newly written state is observable to the receiving VCPU thread's CPU 171by the time it observes the request. This means a write memory barrier 172must be inserted after writing the new state and before setting the VCPU 173request bit. Additionally, on the receiving VCPU thread's side, a 174corresponding read barrier must be inserted after reading the request bit 175and before proceeding to read the new state associated with it. See 176scenario 3, Message and Flag, of [lwn-mb]_ and the kernel documentation 177[memory-barriers]_. 178 179The pair of functions, kvm_check_request() and kvm_make_request(), provide 180the memory barriers, allowing this requirement to be handled internally by 181the API. 182 183Ensuring Requests Are Seen 184========================== 185 186When making requests to VCPUs, we want to avoid the receiving VCPU 187executing in guest mode for an arbitrary long time without handling the 188request. We can be sure this won't happen as long as we ensure the VCPU 189thread checks kvm_request_pending() before entering guest mode and that a 190kick will send an IPI to force an exit from guest mode when necessary. 191Extra care must be taken to cover the period after the VCPU thread's last 192kvm_request_pending() check and before it has entered guest mode, as kick 193IPIs will only trigger guest mode exits for VCPU threads that are in guest 194mode or at least have already disabled interrupts in order to prepare to 195enter guest mode. This means that an optimized implementation (see "IPI 196Reduction") must be certain when it's safe to not send the IPI. One 197solution, which all architectures except s390 apply, is to: 198 199- set ``vcpu->mode`` to IN_GUEST_MODE between disabling the interrupts and 200 the last kvm_request_pending() check; 201- enable interrupts atomically when entering the guest. 202 203This solution also requires memory barriers to be placed carefully in both 204the requesting thread and the receiving VCPU. With the memory barriers we 205can exclude the possibility of a VCPU thread observing 206!kvm_request_pending() on its last check and then not receiving an IPI for 207the next request made of it, even if the request is made immediately after 208the check. This is done by way of the Dekker memory barrier pattern 209(scenario 10 of [lwn-mb]_). As the Dekker pattern requires two variables, 210this solution pairs ``vcpu->mode`` with ``vcpu->requests``. Substituting 211them into the pattern gives:: 212 213 CPU1 CPU2 214 ================= ================= 215 local_irq_disable(); 216 WRITE_ONCE(vcpu->mode, IN_GUEST_MODE); kvm_make_request(REQ, vcpu); 217 smp_mb(); smp_mb(); 218 if (kvm_request_pending(vcpu)) { if (READ_ONCE(vcpu->mode) == 219 IN_GUEST_MODE) { 220 ...abort guest entry... ...send IPI... 221 } } 222 223As stated above, the IPI is only useful for VCPU threads in guest mode or 224that have already disabled interrupts. This is why this specific case of 225the Dekker pattern has been extended to disable interrupts before setting 226``vcpu->mode`` to IN_GUEST_MODE. WRITE_ONCE() and READ_ONCE() are used to 227pedantically implement the memory barrier pattern, guaranteeing the 228compiler doesn't interfere with ``vcpu->mode``'s carefully planned 229accesses. 230 231IPI Reduction 232------------- 233 234As only one IPI is needed to get a VCPU to check for any/all requests, 235then they may be coalesced. This is easily done by having the first IPI 236sending kick also change the VCPU mode to something !IN_GUEST_MODE. The 237transitional state, EXITING_GUEST_MODE, is used for this purpose. 238 239Waiting for Acknowledgements 240---------------------------- 241 242Some requests, those with the KVM_REQUEST_WAIT flag set, require IPIs to 243be sent, and the acknowledgements to be waited upon, even when the target 244VCPU threads are in modes other than IN_GUEST_MODE. For example, one case 245is when a target VCPU thread is in READING_SHADOW_PAGE_TABLES mode, which 246is set after disabling interrupts. To support these cases, the 247KVM_REQUEST_WAIT flag changes the condition for sending an IPI from 248checking that the VCPU is IN_GUEST_MODE to checking that it is not 249OUTSIDE_GUEST_MODE. 250 251Request-less VCPU Kicks 252----------------------- 253 254As the determination of whether or not to send an IPI depends on the 255two-variable Dekker memory barrier pattern, then it's clear that 256request-less VCPU kicks are almost never correct. Without the assurance 257that a non-IPI generating kick will still result in an action by the 258receiving VCPU, as the final kvm_request_pending() check does for 259request-accompanying kicks, then the kick may not do anything useful at 260all. If, for instance, a request-less kick was made to a VCPU that was 261just about to set its mode to IN_GUEST_MODE, meaning no IPI is sent, then 262the VCPU thread may continue its entry without actually having done 263whatever it was the kick was meant to initiate. 264 265One exception is x86's posted interrupt mechanism. In this case, however, 266even the request-less VCPU kick is coupled with the same 267local_irq_disable() + smp_mb() pattern described above; the ON bit 268(Outstanding Notification) in the posted interrupt descriptor takes the 269role of ``vcpu->requests``. When sending a posted interrupt, PIR.ON is 270set before reading ``vcpu->mode``; dually, in the VCPU thread, 271vmx_sync_pir_to_irr() reads PIR after setting ``vcpu->mode`` to 272IN_GUEST_MODE. 273 274Additional Considerations 275========================= 276 277Sleeping VCPUs 278-------------- 279 280VCPU threads may need to consider requests before and/or after calling 281functions that may put them to sleep, e.g. kvm_vcpu_block(). Whether they 282do or not, and, if they do, which requests need consideration, is 283architecture dependent. kvm_vcpu_block() calls kvm_arch_vcpu_runnable() 284to check if it should awaken. One reason to do so is to provide 285architectures a function where requests may be checked if necessary. 286 287Clearing Requests 288----------------- 289 290Generally it only makes sense for the receiving VCPU thread to clear a 291request. However, in some circumstances, such as when the requesting 292thread and the receiving VCPU thread are executed serially, such as when 293they are the same thread, or when they are using some form of concurrency 294control to temporarily execute synchronously, then it's possible to know 295that the request may be cleared immediately, rather than waiting for the 296receiving VCPU thread to handle the request in VCPU RUN. The only current 297examples of this are kvm_vcpu_block() calls made by VCPUs to block 298themselves. A possible side-effect of that call is to make the 299KVM_REQ_UNHALT request, which may then be cleared immediately when the 300VCPU returns from the call. 301 302References 303========== 304 305.. [atomic-ops] Documentation/core-api/atomic_ops.rst 306.. [memory-barriers] Documentation/memory-barriers.txt 307.. [lwn-mb] https://lwn.net/Articles/573436/ 308