1.. SPDX-License-Identifier: GPL-2.0 2 3================= 4KVM Lock Overview 5================= 6 71. Acquisition Orders 8--------------------- 9 10The acquisition orders for mutexes are as follows: 11 12- kvm->lock is taken outside vcpu->mutex 13 14- kvm->lock is taken outside kvm->slots_lock and kvm->irq_lock 15 16- kvm->slots_lock is taken outside kvm->irq_lock, though acquiring 17 them together is quite rare. 18 19- kvm->mn_active_invalidate_count ensures that pairs of 20 invalidate_range_start() and invalidate_range_end() callbacks 21 use the same memslots array. kvm->slots_lock and kvm->slots_arch_lock 22 are taken on the waiting side in install_new_memslots, so MMU notifiers 23 must not take either kvm->slots_lock or kvm->slots_arch_lock. 24 25For SRCU: 26 27- ``synchronize_srcu(&kvm->srcu)`` is called _inside_ 28 the kvm->slots_lock critical section, therefore kvm->slots_lock 29 cannot be taken inside a kvm->srcu read-side critical section. 30 Instead, kvm->slots_arch_lock is released before the call 31 to ``synchronize_srcu()`` and _can_ be taken inside a 32 kvm->srcu read-side critical section. 33 34- kvm->lock is taken inside kvm->srcu, therefore 35 ``synchronize_srcu(&kvm->srcu)`` cannot be called inside 36 a kvm->lock critical section. If you cannot delay the 37 call until after kvm->lock is released, use ``call_srcu``. 38 39On x86: 40 41- vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock 42 43- kvm->arch.mmu_lock is an rwlock. kvm->arch.tdp_mmu_pages_lock and 44 kvm->arch.mmu_unsync_pages_lock are taken inside kvm->arch.mmu_lock, and 45 cannot be taken without already holding kvm->arch.mmu_lock (typically with 46 ``read_lock`` for the TDP MMU, thus the need for additional spinlocks). 47 48Everything else is a leaf: no other lock is taken inside the critical 49sections. 50 512. Exception 52------------ 53 54Fast page fault: 55 56Fast page fault is the fast path which fixes the guest page fault out of 57the mmu-lock on x86. Currently, the page fault can be fast in one of the 58following two cases: 59 601. Access Tracking: The SPTE is not present, but it is marked for access 61 tracking. That means we need to restore the saved R/X bits. This is 62 described in more detail later below. 63 642. Write-Protection: The SPTE is present and the fault is caused by 65 write-protect. That means we just need to change the W bit of the spte. 66 67What we use to avoid all the race is the Host-writable bit and MMU-writable bit 68on the spte: 69 70- Host-writable means the gfn is writable in the host kernel page tables and in 71 its KVM memslot. 72- MMU-writable means the gfn is writable in the guest's mmu and it is not 73 write-protected by shadow page write-protection. 74 75On fast page fault path, we will use cmpxchg to atomically set the spte W 76bit if spte.HOST_WRITEABLE = 1 and spte.WRITE_PROTECT = 1, to restore the saved 77R/X bits if for an access-traced spte, or both. This is safe because whenever 78changing these bits can be detected by cmpxchg. 79 80But we need carefully check these cases: 81 821) The mapping from gfn to pfn 83 84The mapping from gfn to pfn may be changed since we can only ensure the pfn 85is not changed during cmpxchg. This is a ABA problem, for example, below case 86will happen: 87 88+------------------------------------------------------------------------+ 89| At the beginning:: | 90| | 91| gpte = gfn1 | 92| gfn1 is mapped to pfn1 on host | 93| spte is the shadow page table entry corresponding with gpte and | 94| spte = pfn1 | 95+------------------------------------------------------------------------+ 96| On fast page fault path: | 97+------------------------------------+-----------------------------------+ 98| CPU 0: | CPU 1: | 99+------------------------------------+-----------------------------------+ 100| :: | | 101| | | 102| old_spte = *spte; | | 103+------------------------------------+-----------------------------------+ 104| | pfn1 is swapped out:: | 105| | | 106| | spte = 0; | 107| | | 108| | pfn1 is re-alloced for gfn2. | 109| | | 110| | gpte is changed to point to | 111| | gfn2 by the guest:: | 112| | | 113| | spte = pfn1; | 114+------------------------------------+-----------------------------------+ 115| :: | 116| | 117| if (cmpxchg(spte, old_spte, old_spte+W) | 118| mark_page_dirty(vcpu->kvm, gfn1) | 119| OOPS!!! | 120+------------------------------------------------------------------------+ 121 122We dirty-log for gfn1, that means gfn2 is lost in dirty-bitmap. 123 124For direct sp, we can easily avoid it since the spte of direct sp is fixed 125to gfn. For indirect sp, we disabled fast page fault for simplicity. 126 127A solution for indirect sp could be to pin the gfn, for example via 128kvm_vcpu_gfn_to_pfn_atomic, before the cmpxchg. After the pinning: 129 130- We have held the refcount of pfn that means the pfn can not be freed and 131 be reused for another gfn. 132- The pfn is writable and therefore it cannot be shared between different gfns 133 by KSM. 134 135Then, we can ensure the dirty bitmaps is correctly set for a gfn. 136 1372) Dirty bit tracking 138 139In the origin code, the spte can be fast updated (non-atomically) if the 140spte is read-only and the Accessed bit has already been set since the 141Accessed bit and Dirty bit can not be lost. 142 143But it is not true after fast page fault since the spte can be marked 144writable between reading spte and updating spte. Like below case: 145 146+------------------------------------------------------------------------+ 147| At the beginning:: | 148| | 149| spte.W = 0 | 150| spte.Accessed = 1 | 151+------------------------------------+-----------------------------------+ 152| CPU 0: | CPU 1: | 153+------------------------------------+-----------------------------------+ 154| In mmu_spte_clear_track_bits():: | | 155| | | 156| old_spte = *spte; | | 157| | | 158| | | 159| /* 'if' condition is satisfied. */| | 160| if (old_spte.Accessed == 1 && | | 161| old_spte.W == 0) | | 162| spte = 0ull; | | 163+------------------------------------+-----------------------------------+ 164| | on fast page fault path:: | 165| | | 166| | spte.W = 1 | 167| | | 168| | memory write on the spte:: | 169| | | 170| | spte.Dirty = 1 | 171+------------------------------------+-----------------------------------+ 172| :: | | 173| | | 174| else | | 175| old_spte = xchg(spte, 0ull) | | 176| if (old_spte.Accessed == 1) | | 177| kvm_set_pfn_accessed(spte.pfn);| | 178| if (old_spte.Dirty == 1) | | 179| kvm_set_pfn_dirty(spte.pfn); | | 180| OOPS!!! | | 181+------------------------------------+-----------------------------------+ 182 183The Dirty bit is lost in this case. 184 185In order to avoid this kind of issue, we always treat the spte as "volatile" 186if it can be updated out of mmu-lock, see spte_has_volatile_bits(), it means, 187the spte is always atomically updated in this case. 188 1893) flush tlbs due to spte updated 190 191If the spte is updated from writable to readonly, we should flush all TLBs, 192otherwise rmap_write_protect will find a read-only spte, even though the 193writable spte might be cached on a CPU's TLB. 194 195As mentioned before, the spte can be updated to writable out of mmu-lock on 196fast page fault path, in order to easily audit the path, we see if TLBs need 197be flushed caused by this reason in mmu_spte_update() since this is a common 198function to update spte (present -> present). 199 200Since the spte is "volatile" if it can be updated out of mmu-lock, we always 201atomically update the spte, the race caused by fast page fault can be avoided, 202See the comments in spte_has_volatile_bits() and mmu_spte_update(). 203 204Lockless Access Tracking: 205 206This is used for Intel CPUs that are using EPT but do not support the EPT A/D 207bits. In this case, PTEs are tagged as A/D disabled (using ignored bits), and 208when the KVM MMU notifier is called to track accesses to a page (via 209kvm_mmu_notifier_clear_flush_young), it marks the PTE not-present in hardware 210by clearing the RWX bits in the PTE and storing the original R & X bits in more 211unused/ignored bits. When the VM tries to access the page later on, a fault is 212generated and the fast page fault mechanism described above is used to 213atomically restore the PTE to a Present state. The W bit is not saved when the 214PTE is marked for access tracking and during restoration to the Present state, 215the W bit is set depending on whether or not it was a write access. If it 216wasn't, then the W bit will remain clear until a write access happens, at which 217time it will be set using the Dirty tracking mechanism described above. 218 2193. Reference 220------------ 221 222``kvm_lock`` 223^^^^^^^^^^^^ 224 225:Type: mutex 226:Arch: any 227:Protects: - vm_list 228 229``kvm_count_lock`` 230^^^^^^^^^^^^^^^^^^ 231 232:Type: raw_spinlock_t 233:Arch: any 234:Protects: - hardware virtualization enable/disable 235:Comment: 'raw' because hardware enabling/disabling must be atomic /wrt 236 migration. 237 238``kvm->mn_invalidate_lock`` 239^^^^^^^^^^^^^^^^^^^^^^^^^^^ 240 241:Type: spinlock_t 242:Arch: any 243:Protects: mn_active_invalidate_count, mn_memslots_update_rcuwait 244 245``kvm_arch::tsc_write_lock`` 246^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 247 248:Type: raw_spinlock_t 249:Arch: x86 250:Protects: - kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset} 251 - tsc offset in vmcb 252:Comment: 'raw' because updating the tsc offsets must not be preempted. 253 254``kvm->mmu_lock`` 255^^^^^^^^^^^^^^^^^ 256:Type: spinlock_t or rwlock_t 257:Arch: any 258:Protects: -shadow page/shadow tlb entry 259:Comment: it is a spinlock since it is used in mmu notifier. 260 261``kvm->srcu`` 262^^^^^^^^^^^^^ 263:Type: srcu lock 264:Arch: any 265:Protects: - kvm->memslots 266 - kvm->buses 267:Comment: The srcu read lock must be held while accessing memslots (e.g. 268 when using gfn_to_* functions) and while accessing in-kernel 269 MMIO/PIO address->device structure mapping (kvm->buses). 270 The srcu index can be stored in kvm_vcpu->srcu_idx per vcpu 271 if it is needed by multiple functions. 272 273``kvm->slots_arch_lock`` 274^^^^^^^^^^^^^^^^^^^^^^^^ 275:Type: mutex 276:Arch: any (only needed on x86 though) 277:Protects: any arch-specific fields of memslots that have to be modified 278 in a ``kvm->srcu`` read-side critical section. 279:Comment: must be held before reading the pointer to the current memslots, 280 until after all changes to the memslots are complete 281 282``wakeup_vcpus_on_cpu_lock`` 283^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 284:Type: spinlock_t 285:Arch: x86 286:Protects: wakeup_vcpus_on_cpu 287:Comment: This is a per-CPU lock and it is used for VT-d posted-interrupts. 288 When VT-d posted-interrupts is supported and the VM has assigned 289 devices, we put the blocked vCPU on the list blocked_vcpu_on_cpu 290 protected by blocked_vcpu_on_cpu_lock, when VT-d hardware issues 291 wakeup notification event since external interrupts from the 292 assigned devices happens, we will find the vCPU on the list to 293 wakeup. 294