1.. SPDX-License-Identifier: GPL-2.0 2 3================= 4KVM Lock Overview 5================= 6 71. Acquisition Orders 8--------------------- 9 10The acquisition orders for mutexes are as follows: 11 12- kvm->lock is taken outside vcpu->mutex 13 14- kvm->lock is taken outside kvm->slots_lock and kvm->irq_lock 15 16- kvm->slots_lock is taken outside kvm->irq_lock, though acquiring 17 them together is quite rare. 18 19- Unlike kvm->slots_lock, kvm->slots_arch_lock is released before 20 synchronize_srcu(&kvm->srcu). Therefore kvm->slots_arch_lock 21 can be taken inside a kvm->srcu read-side critical section, 22 while kvm->slots_lock cannot. 23 24On x86: 25 26- vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock 27 28- kvm->arch.mmu_lock is an rwlock. kvm->arch.tdp_mmu_pages_lock is 29 taken inside kvm->arch.mmu_lock, and cannot be taken without already 30 holding kvm->arch.mmu_lock (typically with ``read_lock``, otherwise 31 there's no need to take kvm->arch.tdp_mmu_pages_lock at all). 32 33Everything else is a leaf: no other lock is taken inside the critical 34sections. 35 362. Exception 37------------ 38 39Fast page fault: 40 41Fast page fault is the fast path which fixes the guest page fault out of 42the mmu-lock on x86. Currently, the page fault can be fast in one of the 43following two cases: 44 451. Access Tracking: The SPTE is not present, but it is marked for access 46 tracking. That means we need to restore the saved R/X bits. This is 47 described in more detail later below. 48 492. Write-Protection: The SPTE is present and the fault is caused by 50 write-protect. That means we just need to change the W bit of the spte. 51 52What we use to avoid all the race is the Host-writable bit and MMU-writable bit 53on the spte: 54 55- Host-writable means the gfn is writable in the host kernel page tables and in 56 its KVM memslot. 57- MMU-writable means the gfn is writable in the guest's mmu and it is not 58 write-protected by shadow page write-protection. 59 60On fast page fault path, we will use cmpxchg to atomically set the spte W 61bit if spte.HOST_WRITEABLE = 1 and spte.WRITE_PROTECT = 1, to restore the saved 62R/X bits if for an access-traced spte, or both. This is safe because whenever 63changing these bits can be detected by cmpxchg. 64 65But we need carefully check these cases: 66 671) The mapping from gfn to pfn 68 69The mapping from gfn to pfn may be changed since we can only ensure the pfn 70is not changed during cmpxchg. This is a ABA problem, for example, below case 71will happen: 72 73+------------------------------------------------------------------------+ 74| At the beginning:: | 75| | 76| gpte = gfn1 | 77| gfn1 is mapped to pfn1 on host | 78| spte is the shadow page table entry corresponding with gpte and | 79| spte = pfn1 | 80+------------------------------------------------------------------------+ 81| On fast page fault path: | 82+------------------------------------+-----------------------------------+ 83| CPU 0: | CPU 1: | 84+------------------------------------+-----------------------------------+ 85| :: | | 86| | | 87| old_spte = *spte; | | 88+------------------------------------+-----------------------------------+ 89| | pfn1 is swapped out:: | 90| | | 91| | spte = 0; | 92| | | 93| | pfn1 is re-alloced for gfn2. | 94| | | 95| | gpte is changed to point to | 96| | gfn2 by the guest:: | 97| | | 98| | spte = pfn1; | 99+------------------------------------+-----------------------------------+ 100| :: | 101| | 102| if (cmpxchg(spte, old_spte, old_spte+W) | 103| mark_page_dirty(vcpu->kvm, gfn1) | 104| OOPS!!! | 105+------------------------------------------------------------------------+ 106 107We dirty-log for gfn1, that means gfn2 is lost in dirty-bitmap. 108 109For direct sp, we can easily avoid it since the spte of direct sp is fixed 110to gfn. For indirect sp, we disabled fast page fault for simplicity. 111 112A solution for indirect sp could be to pin the gfn, for example via 113kvm_vcpu_gfn_to_pfn_atomic, before the cmpxchg. After the pinning: 114 115- We have held the refcount of pfn that means the pfn can not be freed and 116 be reused for another gfn. 117- The pfn is writable and therefore it cannot be shared between different gfns 118 by KSM. 119 120Then, we can ensure the dirty bitmaps is correctly set for a gfn. 121 1222) Dirty bit tracking 123 124In the origin code, the spte can be fast updated (non-atomically) if the 125spte is read-only and the Accessed bit has already been set since the 126Accessed bit and Dirty bit can not be lost. 127 128But it is not true after fast page fault since the spte can be marked 129writable between reading spte and updating spte. Like below case: 130 131+------------------------------------------------------------------------+ 132| At the beginning:: | 133| | 134| spte.W = 0 | 135| spte.Accessed = 1 | 136+------------------------------------+-----------------------------------+ 137| CPU 0: | CPU 1: | 138+------------------------------------+-----------------------------------+ 139| In mmu_spte_clear_track_bits():: | | 140| | | 141| old_spte = *spte; | | 142| | | 143| | | 144| /* 'if' condition is satisfied. */| | 145| if (old_spte.Accessed == 1 && | | 146| old_spte.W == 0) | | 147| spte = 0ull; | | 148+------------------------------------+-----------------------------------+ 149| | on fast page fault path:: | 150| | | 151| | spte.W = 1 | 152| | | 153| | memory write on the spte:: | 154| | | 155| | spte.Dirty = 1 | 156+------------------------------------+-----------------------------------+ 157| :: | | 158| | | 159| else | | 160| old_spte = xchg(spte, 0ull) | | 161| if (old_spte.Accessed == 1) | | 162| kvm_set_pfn_accessed(spte.pfn);| | 163| if (old_spte.Dirty == 1) | | 164| kvm_set_pfn_dirty(spte.pfn); | | 165| OOPS!!! | | 166+------------------------------------+-----------------------------------+ 167 168The Dirty bit is lost in this case. 169 170In order to avoid this kind of issue, we always treat the spte as "volatile" 171if it can be updated out of mmu-lock, see spte_has_volatile_bits(), it means, 172the spte is always atomically updated in this case. 173 1743) flush tlbs due to spte updated 175 176If the spte is updated from writable to readonly, we should flush all TLBs, 177otherwise rmap_write_protect will find a read-only spte, even though the 178writable spte might be cached on a CPU's TLB. 179 180As mentioned before, the spte can be updated to writable out of mmu-lock on 181fast page fault path, in order to easily audit the path, we see if TLBs need 182be flushed caused by this reason in mmu_spte_update() since this is a common 183function to update spte (present -> present). 184 185Since the spte is "volatile" if it can be updated out of mmu-lock, we always 186atomically update the spte, the race caused by fast page fault can be avoided, 187See the comments in spte_has_volatile_bits() and mmu_spte_update(). 188 189Lockless Access Tracking: 190 191This is used for Intel CPUs that are using EPT but do not support the EPT A/D 192bits. In this case, PTEs are tagged as A/D disabled (using ignored bits), and 193when the KVM MMU notifier is called to track accesses to a page (via 194kvm_mmu_notifier_clear_flush_young), it marks the PTE not-present in hardware 195by clearing the RWX bits in the PTE and storing the original R & X bits in more 196unused/ignored bits. When the VM tries to access the page later on, a fault is 197generated and the fast page fault mechanism described above is used to 198atomically restore the PTE to a Present state. The W bit is not saved when the 199PTE is marked for access tracking and during restoration to the Present state, 200the W bit is set depending on whether or not it was a write access. If it 201wasn't, then the W bit will remain clear until a write access happens, at which 202time it will be set using the Dirty tracking mechanism described above. 203 2043. Reference 205------------ 206 207:Name: kvm_lock 208:Type: mutex 209:Arch: any 210:Protects: - vm_list 211 212:Name: kvm_count_lock 213:Type: raw_spinlock_t 214:Arch: any 215:Protects: - hardware virtualization enable/disable 216:Comment: 'raw' because hardware enabling/disabling must be atomic /wrt 217 migration. 218 219:Name: kvm_arch::tsc_write_lock 220:Type: raw_spinlock 221:Arch: x86 222:Protects: - kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset} 223 - tsc offset in vmcb 224:Comment: 'raw' because updating the tsc offsets must not be preempted. 225 226:Name: kvm->mmu_lock 227:Type: spinlock_t 228:Arch: any 229:Protects: -shadow page/shadow tlb entry 230:Comment: it is a spinlock since it is used in mmu notifier. 231 232:Name: kvm->srcu 233:Type: srcu lock 234:Arch: any 235:Protects: - kvm->memslots 236 - kvm->buses 237:Comment: The srcu read lock must be held while accessing memslots (e.g. 238 when using gfn_to_* functions) and while accessing in-kernel 239 MMIO/PIO address->device structure mapping (kvm->buses). 240 The srcu index can be stored in kvm_vcpu->srcu_idx per vcpu 241 if it is needed by multiple functions. 242 243:Name: blocked_vcpu_on_cpu_lock 244:Type: spinlock_t 245:Arch: x86 246:Protects: blocked_vcpu_on_cpu 247:Comment: This is a per-CPU lock and it is used for VT-d posted-interrupts. 248 When VT-d posted-interrupts is supported and the VM has assigned 249 devices, we put the blocked vCPU on the list blocked_vcpu_on_cpu 250 protected by blocked_vcpu_on_cpu_lock, when VT-d hardware issues 251 wakeup notification event since external interrupts from the 252 assigned devices happens, we will find the vCPU on the list to 253 wakeup. 254