1daec8d40SPaolo Bonzini.. SPDX-License-Identifier: GPL-2.0 2daec8d40SPaolo Bonzini 3daec8d40SPaolo Bonzini====================== 4daec8d40SPaolo BonziniThe x86 kvm shadow mmu 5daec8d40SPaolo Bonzini====================== 6daec8d40SPaolo Bonzini 7daec8d40SPaolo BonziniThe mmu (in arch/x86/kvm, files mmu.[ch] and paging_tmpl.h) is responsible 8daec8d40SPaolo Bonzinifor presenting a standard x86 mmu to the guest, while translating guest 9daec8d40SPaolo Bonziniphysical addresses to host physical addresses. 10daec8d40SPaolo Bonzini 11daec8d40SPaolo BonziniThe mmu code attempts to satisfy the following requirements: 12daec8d40SPaolo Bonzini 13daec8d40SPaolo Bonzini- correctness: 14daec8d40SPaolo Bonzini the guest should not be able to determine that it is running 15daec8d40SPaolo Bonzini on an emulated mmu except for timing (we attempt to comply 16daec8d40SPaolo Bonzini with the specification, not emulate the characteristics of 17daec8d40SPaolo Bonzini a particular implementation such as tlb size) 18daec8d40SPaolo Bonzini- security: 19daec8d40SPaolo Bonzini the guest must not be able to touch host memory not assigned 20daec8d40SPaolo Bonzini to it 21daec8d40SPaolo Bonzini- performance: 22daec8d40SPaolo Bonzini minimize the performance penalty imposed by the mmu 23daec8d40SPaolo Bonzini- scaling: 24daec8d40SPaolo Bonzini need to scale to large memory and large vcpu guests 25daec8d40SPaolo Bonzini- hardware: 26daec8d40SPaolo Bonzini support the full range of x86 virtualization hardware 27daec8d40SPaolo Bonzini- integration: 28daec8d40SPaolo Bonzini Linux memory management code must be in control of guest memory 29daec8d40SPaolo Bonzini so that swapping, page migration, page merging, transparent 30daec8d40SPaolo Bonzini hugepages, and similar features work without change 31daec8d40SPaolo Bonzini- dirty tracking: 32daec8d40SPaolo Bonzini report writes to guest memory to enable live migration 33daec8d40SPaolo Bonzini and framebuffer-based displays 34daec8d40SPaolo Bonzini- footprint: 35daec8d40SPaolo Bonzini keep the amount of pinned kernel memory low (most memory 36daec8d40SPaolo Bonzini should be shrinkable) 37daec8d40SPaolo Bonzini- reliability: 38daec8d40SPaolo Bonzini avoid multipage or GFP_ATOMIC allocations 39daec8d40SPaolo Bonzini 40daec8d40SPaolo BonziniAcronyms 41daec8d40SPaolo Bonzini======== 42daec8d40SPaolo Bonzini 43daec8d40SPaolo Bonzini==== ==================================================================== 44daec8d40SPaolo Bonzinipfn host page frame number 45daec8d40SPaolo Bonzinihpa host physical address 46daec8d40SPaolo Bonzinihva host virtual address 47daec8d40SPaolo Bonzinigfn guest frame number 48daec8d40SPaolo Bonzinigpa guest physical address 49daec8d40SPaolo Bonzinigva guest virtual address 50daec8d40SPaolo Bonziningpa nested guest physical address 51daec8d40SPaolo Bonziningva nested guest virtual address 52daec8d40SPaolo Bonzinipte page table entry (used also to refer generically to paging structure 53daec8d40SPaolo Bonzini entries) 54daec8d40SPaolo Bonzinigpte guest pte (referring to gfns) 55daec8d40SPaolo Bonzinispte shadow pte (referring to pfns) 56daec8d40SPaolo Bonzinitdp two dimensional paging (vendor neutral term for NPT and EPT) 57daec8d40SPaolo Bonzini==== ==================================================================== 58daec8d40SPaolo Bonzini 59daec8d40SPaolo BonziniVirtual and real hardware supported 60daec8d40SPaolo Bonzini=================================== 61daec8d40SPaolo Bonzini 62daec8d40SPaolo BonziniThe mmu supports first-generation mmu hardware, which allows an atomic switch 63daec8d40SPaolo Bonziniof the current paging mode and cr3 during guest entry, as well as 64daec8d40SPaolo Bonzinitwo-dimensional paging (AMD's NPT and Intel's EPT). The emulated hardware 65daec8d40SPaolo Bonziniit exposes is the traditional 2/3/4 level x86 mmu, with support for global 66daec8d40SPaolo Bonzinipages, pae, pse, pse36, cr0.wp, and 1GB pages. Emulated hardware also 67daec8d40SPaolo Bonziniable to expose NPT capable hardware on NPT capable hosts. 68daec8d40SPaolo Bonzini 69daec8d40SPaolo BonziniTranslation 70daec8d40SPaolo Bonzini=========== 71daec8d40SPaolo Bonzini 72daec8d40SPaolo BonziniThe primary job of the mmu is to program the processor's mmu to translate 73daec8d40SPaolo Bonziniaddresses for the guest. Different translations are required at different 74daec8d40SPaolo Bonzinitimes: 75daec8d40SPaolo Bonzini 76daec8d40SPaolo Bonzini- when guest paging is disabled, we translate guest physical addresses to 77daec8d40SPaolo Bonzini host physical addresses (gpa->hpa) 78daec8d40SPaolo Bonzini- when guest paging is enabled, we translate guest virtual addresses, to 79daec8d40SPaolo Bonzini guest physical addresses, to host physical addresses (gva->gpa->hpa) 80daec8d40SPaolo Bonzini- when the guest launches a guest of its own, we translate nested guest 81daec8d40SPaolo Bonzini virtual addresses, to nested guest physical addresses, to guest physical 82daec8d40SPaolo Bonzini addresses, to host physical addresses (ngva->ngpa->gpa->hpa) 83daec8d40SPaolo Bonzini 84daec8d40SPaolo BonziniThe primary challenge is to encode between 1 and 3 translations into hardware 85daec8d40SPaolo Bonzinithat support only 1 (traditional) and 2 (tdp) translations. When the 86daec8d40SPaolo Bonzininumber of required translations matches the hardware, the mmu operates in 87daec8d40SPaolo Bonzinidirect mode; otherwise it operates in shadow mode (see below). 88daec8d40SPaolo Bonzini 89daec8d40SPaolo BonziniMemory 90daec8d40SPaolo Bonzini====== 91daec8d40SPaolo Bonzini 92daec8d40SPaolo BonziniGuest memory (gpa) is part of the user address space of the process that is 93daec8d40SPaolo Bonziniusing kvm. Userspace defines the translation between guest addresses and user 94daec8d40SPaolo Bonziniaddresses (gpa->hva); note that two gpas may alias to the same hva, but not 95daec8d40SPaolo Bonzinivice versa. 96daec8d40SPaolo Bonzini 97daec8d40SPaolo BonziniThese hvas may be backed using any method available to the host: anonymous 98daec8d40SPaolo Bonzinimemory, file backed memory, and device memory. Memory might be paged by the 99daec8d40SPaolo Bonzinihost at any time. 100daec8d40SPaolo Bonzini 101daec8d40SPaolo BonziniEvents 102daec8d40SPaolo Bonzini====== 103daec8d40SPaolo Bonzini 104daec8d40SPaolo BonziniThe mmu is driven by events, some from the guest, some from the host. 105daec8d40SPaolo Bonzini 106daec8d40SPaolo BonziniGuest generated events: 107daec8d40SPaolo Bonzini 108daec8d40SPaolo Bonzini- writes to control registers (especially cr3) 109daec8d40SPaolo Bonzini- invlpg/invlpga instruction execution 110daec8d40SPaolo Bonzini- access to missing or protected translations 111daec8d40SPaolo Bonzini 112daec8d40SPaolo BonziniHost generated events: 113daec8d40SPaolo Bonzini 114daec8d40SPaolo Bonzini- changes in the gpa->hpa translation (either through gpa->hva changes or 115daec8d40SPaolo Bonzini through hva->hpa changes) 116daec8d40SPaolo Bonzini- memory pressure (the shrinker) 117daec8d40SPaolo Bonzini 118daec8d40SPaolo BonziniShadow pages 119daec8d40SPaolo Bonzini============ 120daec8d40SPaolo Bonzini 121daec8d40SPaolo BonziniThe principal data structure is the shadow page, 'struct kvm_mmu_page'. A 122daec8d40SPaolo Bonzinishadow page contains 512 sptes, which can be either leaf or nonleaf sptes. A 123daec8d40SPaolo Bonzinishadow page may contain a mix of leaf and nonleaf sptes. 124daec8d40SPaolo Bonzini 125daec8d40SPaolo BonziniA nonleaf spte allows the hardware mmu to reach the leaf pages and 126daec8d40SPaolo Bonziniis not related to a translation directly. It points to other shadow pages. 127daec8d40SPaolo Bonzini 128daec8d40SPaolo BonziniA leaf spte corresponds to either one or two translations encoded into 129daec8d40SPaolo Bonzinione paging structure entry. These are always the lowest level of the 130daec8d40SPaolo Bonzinitranslation stack, with optional higher level translations left to NPT/EPT. 131daec8d40SPaolo BonziniLeaf ptes point at guest pages. 132daec8d40SPaolo Bonzini 133daec8d40SPaolo BonziniThe following table shows translations encoded by leaf ptes, with higher-level 134daec8d40SPaolo Bonzinitranslations in parentheses: 135daec8d40SPaolo Bonzini 136daec8d40SPaolo Bonzini Non-nested guests:: 137daec8d40SPaolo Bonzini 138daec8d40SPaolo Bonzini nonpaging: gpa->hpa 139daec8d40SPaolo Bonzini paging: gva->gpa->hpa 140daec8d40SPaolo Bonzini paging, tdp: (gva->)gpa->hpa 141daec8d40SPaolo Bonzini 142daec8d40SPaolo Bonzini Nested guests:: 143daec8d40SPaolo Bonzini 144daec8d40SPaolo Bonzini non-tdp: ngva->gpa->hpa (*) 145daec8d40SPaolo Bonzini tdp: (ngva->)ngpa->gpa->hpa 146daec8d40SPaolo Bonzini 147daec8d40SPaolo Bonzini (*) the guest hypervisor will encode the ngva->gpa translation into its page 148daec8d40SPaolo Bonzini tables if npt is not present 149daec8d40SPaolo Bonzini 150daec8d40SPaolo BonziniShadow pages contain the following information: 151daec8d40SPaolo Bonzini role.level: 152daec8d40SPaolo Bonzini The level in the shadow paging hierarchy that this shadow page belongs to. 153daec8d40SPaolo Bonzini 1=4k sptes, 2=2M sptes, 3=1G sptes, etc. 154daec8d40SPaolo Bonzini role.direct: 155daec8d40SPaolo Bonzini If set, leaf sptes reachable from this page are for a linear range. 156daec8d40SPaolo Bonzini Examples include real mode translation, large guest pages backed by small 157daec8d40SPaolo Bonzini host pages, and gpa->hpa translations when NPT or EPT is active. 158daec8d40SPaolo Bonzini The linear range starts at (gfn << PAGE_SHIFT) and its size is determined 159daec8d40SPaolo Bonzini by role.level (2MB for first level, 1GB for second level, 0.5TB for third 160daec8d40SPaolo Bonzini level, 256TB for fourth level) 161daec8d40SPaolo Bonzini If clear, this page corresponds to a guest page table denoted by the gfn 162daec8d40SPaolo Bonzini field. 163daec8d40SPaolo Bonzini role.quadrant: 164daec8d40SPaolo Bonzini When role.has_4_byte_gpte=1, the guest uses 32-bit gptes while the host uses 64-bit 165daec8d40SPaolo Bonzini sptes. That means a guest page table contains more ptes than the host, 166daec8d40SPaolo Bonzini so multiple shadow pages are needed to shadow one guest page. 167daec8d40SPaolo Bonzini For first-level shadow pages, role.quadrant can be 0 or 1 and denotes the 168daec8d40SPaolo Bonzini first or second 512-gpte block in the guest page table. For second-level 169daec8d40SPaolo Bonzini page tables, each 32-bit gpte is converted to two 64-bit sptes 170daec8d40SPaolo Bonzini (since each first-level guest page is shadowed by two first-level 171daec8d40SPaolo Bonzini shadow pages) so role.quadrant takes values in the range 0..3. Each 172daec8d40SPaolo Bonzini quadrant maps 1GB virtual address space. 173daec8d40SPaolo Bonzini role.access: 174daec8d40SPaolo Bonzini Inherited guest access permissions from the parent ptes in the form uwx. 175daec8d40SPaolo Bonzini Note execute permission is positive, not negative. 176daec8d40SPaolo Bonzini role.invalid: 177daec8d40SPaolo Bonzini The page is invalid and should not be used. It is a root page that is 178daec8d40SPaolo Bonzini currently pinned (by a cpu hardware register pointing to it); once it is 179daec8d40SPaolo Bonzini unpinned it will be destroyed. 180daec8d40SPaolo Bonzini role.has_4_byte_gpte: 181daec8d40SPaolo Bonzini Reflects the size of the guest PTE for which the page is valid, i.e. '0' 182daec8d40SPaolo Bonzini if direct map or 64-bit gptes are in use, '1' if 32-bit gptes are in use. 183daec8d40SPaolo Bonzini role.efer_nx: 184daec8d40SPaolo Bonzini Contains the value of efer.nx for which the page is valid. 185daec8d40SPaolo Bonzini role.cr0_wp: 186daec8d40SPaolo Bonzini Contains the value of cr0.wp for which the page is valid. 187daec8d40SPaolo Bonzini role.smep_andnot_wp: 188daec8d40SPaolo Bonzini Contains the value of cr4.smep && !cr0.wp for which the page is valid 189daec8d40SPaolo Bonzini (pages for which this is true are different from other pages; see the 190daec8d40SPaolo Bonzini treatment of cr0.wp=0 below). 191daec8d40SPaolo Bonzini role.smap_andnot_wp: 192daec8d40SPaolo Bonzini Contains the value of cr4.smap && !cr0.wp for which the page is valid 193daec8d40SPaolo Bonzini (pages for which this is true are different from other pages; see the 194daec8d40SPaolo Bonzini treatment of cr0.wp=0 below). 195daec8d40SPaolo Bonzini role.smm: 196daec8d40SPaolo Bonzini Is 1 if the page is valid in system management mode. This field 197daec8d40SPaolo Bonzini determines which of the kvm_memslots array was used to build this 198daec8d40SPaolo Bonzini shadow page; it is also used to go back from a struct kvm_mmu_page 199daec8d40SPaolo Bonzini to a memslot, through the kvm_memslots_for_spte_role macro and 200daec8d40SPaolo Bonzini __gfn_to_memslot. 201daec8d40SPaolo Bonzini role.ad_disabled: 202daec8d40SPaolo Bonzini Is 1 if the MMU instance cannot use A/D bits. EPT did not have A/D 203daec8d40SPaolo Bonzini bits before Haswell; shadow EPT page tables also cannot use A/D bits 204daec8d40SPaolo Bonzini if the L1 hypervisor does not enable them. 20584e5ffd0SLai Jiangshan role.passthrough: 20684e5ffd0SLai Jiangshan The page is not backed by a guest page table, but its first entry 20784e5ffd0SLai Jiangshan points to one. This is set if NPT uses 5-level page tables (host 20806b66e05SBinbin Wu CR4.LA57=1) and is shadowing L1's 4-level NPT (L1 CR4.LA57=0). 209daec8d40SPaolo Bonzini gfn: 210daec8d40SPaolo Bonzini Either the guest page table containing the translations shadowed by this 211daec8d40SPaolo Bonzini page, or the base page frame for linear translations. See role.direct. 212daec8d40SPaolo Bonzini spt: 213daec8d40SPaolo Bonzini A pageful of 64-bit sptes containing the translations for this page. 214daec8d40SPaolo Bonzini Accessed by both kvm and hardware. 215daec8d40SPaolo Bonzini The page pointed to by spt will have its page->private pointing back 216daec8d40SPaolo Bonzini at the shadow page structure. 217daec8d40SPaolo Bonzini sptes in spt point either at guest pages, or at lower-level shadow pages. 218daec8d40SPaolo Bonzini Specifically, if sp1 and sp2 are shadow pages, then sp1->spt[n] may point 219daec8d40SPaolo Bonzini at __pa(sp2->spt). sp2 will point back at sp1 through parent_pte. 220daec8d40SPaolo Bonzini The spt array forms a DAG structure with the shadow page as a node, and 221daec8d40SPaolo Bonzini guest pages as leaves. 222daec8d40SPaolo Bonzini gfns: 223daec8d40SPaolo Bonzini An array of 512 guest frame numbers, one for each present pte. Used to 224daec8d40SPaolo Bonzini perform a reverse map from a pte to a gfn. When role.direct is set, any 225daec8d40SPaolo Bonzini element of this array can be calculated from the gfn field when used, in 226daec8d40SPaolo Bonzini this case, the array of gfns is not allocated. See role.direct and gfn. 227daec8d40SPaolo Bonzini root_count: 228daec8d40SPaolo Bonzini A counter keeping track of how many hardware registers (guest cr3 or 229daec8d40SPaolo Bonzini pdptrs) are now pointing at the page. While this counter is nonzero, the 230daec8d40SPaolo Bonzini page cannot be destroyed. See role.invalid. 231daec8d40SPaolo Bonzini parent_ptes: 232daec8d40SPaolo Bonzini The reverse mapping for the pte/ptes pointing at this page's spt. If 233daec8d40SPaolo Bonzini parent_ptes bit 0 is zero, only one spte points at this page and 234daec8d40SPaolo Bonzini parent_ptes points at this single spte, otherwise, there exists multiple 235daec8d40SPaolo Bonzini sptes pointing at this page and (parent_ptes & ~0x1) points at a data 236daec8d40SPaolo Bonzini structure with a list of parent sptes. 237daec8d40SPaolo Bonzini unsync: 238daec8d40SPaolo Bonzini If true, then the translations in this page may not match the guest's 239daec8d40SPaolo Bonzini translation. This is equivalent to the state of the tlb when a pte is 240daec8d40SPaolo Bonzini changed but before the tlb entry is flushed. Accordingly, unsync ptes 241daec8d40SPaolo Bonzini are synchronized when the guest executes invlpg or flushes its tlb by 242daec8d40SPaolo Bonzini other means. Valid for leaf pages. 243daec8d40SPaolo Bonzini unsync_children: 244daec8d40SPaolo Bonzini How many sptes in the page point at pages that are unsync (or have 245daec8d40SPaolo Bonzini unsynchronized children). 246daec8d40SPaolo Bonzini unsync_child_bitmap: 247daec8d40SPaolo Bonzini A bitmap indicating which sptes in spt point (directly or indirectly) at 248*d56b699dSBjorn Helgaas pages that may be unsynchronized. Used to quickly locate all unsynchronized 249daec8d40SPaolo Bonzini pages reachable from a given page. 250daec8d40SPaolo Bonzini clear_spte_count: 251daec8d40SPaolo Bonzini Only present on 32-bit hosts, where a 64-bit spte cannot be written 252daec8d40SPaolo Bonzini atomically. The reader uses this while running out of the MMU lock 253daec8d40SPaolo Bonzini to detect in-progress updates and retry them until the writer has 254daec8d40SPaolo Bonzini finished the write. 255daec8d40SPaolo Bonzini write_flooding_count: 256daec8d40SPaolo Bonzini A guest may write to a page table many times, causing a lot of 257daec8d40SPaolo Bonzini emulations if the page needs to be write-protected (see "Synchronized 258daec8d40SPaolo Bonzini and unsynchronized pages" below). Leaf pages can be unsynchronized 259daec8d40SPaolo Bonzini so that they do not trigger frequent emulation, but this is not 260daec8d40SPaolo Bonzini possible for non-leafs. This field counts the number of emulations 261daec8d40SPaolo Bonzini since the last time the page table was actually used; if emulation 262daec8d40SPaolo Bonzini is triggered too frequently on this page, KVM will unmap the page 263daec8d40SPaolo Bonzini to avoid emulation in the future. 264daec8d40SPaolo Bonzini 265daec8d40SPaolo BonziniReverse map 266daec8d40SPaolo Bonzini=========== 267daec8d40SPaolo Bonzini 268daec8d40SPaolo BonziniThe mmu maintains a reverse mapping whereby all ptes mapping a page can be 269daec8d40SPaolo Bonzinireached given its gfn. This is used, for example, when swapping out a page. 270daec8d40SPaolo Bonzini 271daec8d40SPaolo BonziniSynchronized and unsynchronized pages 272daec8d40SPaolo Bonzini===================================== 273daec8d40SPaolo Bonzini 274daec8d40SPaolo BonziniThe guest uses two events to synchronize its tlb and page tables: tlb flushes 275daec8d40SPaolo Bonziniand page invalidations (invlpg). 276daec8d40SPaolo Bonzini 277daec8d40SPaolo BonziniA tlb flush means that we need to synchronize all sptes reachable from the 278daec8d40SPaolo Bonziniguest's cr3. This is expensive, so we keep all guest page tables write 279daec8d40SPaolo Bonziniprotected, and synchronize sptes to gptes when a gpte is written. 280daec8d40SPaolo Bonzini 281daec8d40SPaolo BonziniA special case is when a guest page table is reachable from the current 282daec8d40SPaolo Bonziniguest cr3. In this case, the guest is obliged to issue an invlpg instruction 283daec8d40SPaolo Bonzinibefore using the translation. We take advantage of that by removing write 284daec8d40SPaolo Bonziniprotection from the guest page, and allowing the guest to modify it freely. 285daec8d40SPaolo BonziniWe synchronize modified gptes when the guest invokes invlpg. This reduces 286daec8d40SPaolo Bonzinithe amount of emulation we have to do when the guest modifies multiple gptes, 287daec8d40SPaolo Bonzinior when the a guest page is no longer used as a page table and is used for 288daec8d40SPaolo Bonzinirandom guest data. 289daec8d40SPaolo Bonzini 290daec8d40SPaolo BonziniAs a side effect we have to resynchronize all reachable unsynchronized shadow 291daec8d40SPaolo Bonzinipages on a tlb flush. 292daec8d40SPaolo Bonzini 293daec8d40SPaolo Bonzini 294daec8d40SPaolo BonziniReaction to events 295daec8d40SPaolo Bonzini================== 296daec8d40SPaolo Bonzini 297daec8d40SPaolo Bonzini- guest page fault (or npt page fault, or ept violation) 298daec8d40SPaolo Bonzini 299daec8d40SPaolo BonziniThis is the most complicated event. The cause of a page fault can be: 300daec8d40SPaolo Bonzini 301daec8d40SPaolo Bonzini - a true guest fault (the guest translation won't allow the access) (*) 302daec8d40SPaolo Bonzini - access to a missing translation 303daec8d40SPaolo Bonzini - access to a protected translation 304daec8d40SPaolo Bonzini - when logging dirty pages, memory is write protected 305daec8d40SPaolo Bonzini - synchronized shadow pages are write protected (*) 306daec8d40SPaolo Bonzini - access to untranslatable memory (mmio) 307daec8d40SPaolo Bonzini 308daec8d40SPaolo Bonzini (*) not applicable in direct mode 309daec8d40SPaolo Bonzini 310daec8d40SPaolo BonziniHandling a page fault is performed as follows: 311daec8d40SPaolo Bonzini 312daec8d40SPaolo Bonzini - if the RSV bit of the error code is set, the page fault is caused by guest 313daec8d40SPaolo Bonzini accessing MMIO and cached MMIO information is available. 314daec8d40SPaolo Bonzini 315daec8d40SPaolo Bonzini - walk shadow page table 316daec8d40SPaolo Bonzini - check for valid generation number in the spte (see "Fast invalidation of 317daec8d40SPaolo Bonzini MMIO sptes" below) 318daec8d40SPaolo Bonzini - cache the information to vcpu->arch.mmio_gva, vcpu->arch.mmio_access and 319daec8d40SPaolo Bonzini vcpu->arch.mmio_gfn, and call the emulator 320daec8d40SPaolo Bonzini 321daec8d40SPaolo Bonzini - If both P bit and R/W bit of error code are set, this could possibly 322daec8d40SPaolo Bonzini be handled as a "fast page fault" (fixed without taking the MMU lock). See 323daec8d40SPaolo Bonzini the description in Documentation/virt/kvm/locking.rst. 324daec8d40SPaolo Bonzini 325daec8d40SPaolo Bonzini - if needed, walk the guest page tables to determine the guest translation 326daec8d40SPaolo Bonzini (gva->gpa or ngpa->gpa) 327daec8d40SPaolo Bonzini 328daec8d40SPaolo Bonzini - if permissions are insufficient, reflect the fault back to the guest 329daec8d40SPaolo Bonzini 330daec8d40SPaolo Bonzini - determine the host page 331daec8d40SPaolo Bonzini 332daec8d40SPaolo Bonzini - if this is an mmio request, there is no host page; cache the info to 333daec8d40SPaolo Bonzini vcpu->arch.mmio_gva, vcpu->arch.mmio_access and vcpu->arch.mmio_gfn 334daec8d40SPaolo Bonzini 335daec8d40SPaolo Bonzini - walk the shadow page table to find the spte for the translation, 336daec8d40SPaolo Bonzini instantiating missing intermediate page tables as necessary 337daec8d40SPaolo Bonzini 338daec8d40SPaolo Bonzini - If this is an mmio request, cache the mmio info to the spte and set some 339daec8d40SPaolo Bonzini reserved bit on the spte (see callers of kvm_mmu_set_mmio_spte_mask) 340daec8d40SPaolo Bonzini 341daec8d40SPaolo Bonzini - try to unsynchronize the page 342daec8d40SPaolo Bonzini 343daec8d40SPaolo Bonzini - if successful, we can let the guest continue and modify the gpte 344daec8d40SPaolo Bonzini 345daec8d40SPaolo Bonzini - emulate the instruction 346daec8d40SPaolo Bonzini 347daec8d40SPaolo Bonzini - if failed, unshadow the page and let the guest continue 348daec8d40SPaolo Bonzini 349daec8d40SPaolo Bonzini - update any translations that were modified by the instruction 350daec8d40SPaolo Bonzini 351daec8d40SPaolo Bonziniinvlpg handling: 352daec8d40SPaolo Bonzini 353daec8d40SPaolo Bonzini - walk the shadow page hierarchy and drop affected translations 354daec8d40SPaolo Bonzini - try to reinstantiate the indicated translation in the hope that the 355daec8d40SPaolo Bonzini guest will use it in the near future 356daec8d40SPaolo Bonzini 357daec8d40SPaolo BonziniGuest control register updates: 358daec8d40SPaolo Bonzini 359daec8d40SPaolo Bonzini- mov to cr3 360daec8d40SPaolo Bonzini 361daec8d40SPaolo Bonzini - look up new shadow roots 362daec8d40SPaolo Bonzini - synchronize newly reachable shadow pages 363daec8d40SPaolo Bonzini 364daec8d40SPaolo Bonzini- mov to cr0/cr4/efer 365daec8d40SPaolo Bonzini 366daec8d40SPaolo Bonzini - set up mmu context for new paging mode 367daec8d40SPaolo Bonzini - look up new shadow roots 368daec8d40SPaolo Bonzini - synchronize newly reachable shadow pages 369daec8d40SPaolo Bonzini 370daec8d40SPaolo BonziniHost translation updates: 371daec8d40SPaolo Bonzini 372daec8d40SPaolo Bonzini - mmu notifier called with updated hva 373daec8d40SPaolo Bonzini - look up affected sptes through reverse map 374daec8d40SPaolo Bonzini - drop (or update) translations 375daec8d40SPaolo Bonzini 376daec8d40SPaolo BonziniEmulating cr0.wp 377daec8d40SPaolo Bonzini================ 378daec8d40SPaolo Bonzini 379daec8d40SPaolo BonziniIf tdp is not enabled, the host must keep cr0.wp=1 so page write protection 3807f77ebbfSAkhil Rajworks for the guest kernel, not guest userspace. When the guest 381daec8d40SPaolo Bonzinicr0.wp=1, this does not present a problem. However when the guest cr0.wp=0, 382daec8d40SPaolo Bonziniwe cannot map the permissions for gpte.u=1, gpte.w=0 to any spte (the 383daec8d40SPaolo Bonzinisemantics require allowing any guest kernel access plus user read access). 384daec8d40SPaolo Bonzini 385daec8d40SPaolo BonziniWe handle this by mapping the permissions to two possible sptes, depending 386daec8d40SPaolo Bonzinion fault type: 387daec8d40SPaolo Bonzini 388daec8d40SPaolo Bonzini- kernel write fault: spte.u=0, spte.w=1 (allows full kernel access, 389daec8d40SPaolo Bonzini disallows user access) 390daec8d40SPaolo Bonzini- read fault: spte.u=1, spte.w=0 (allows full read access, disallows kernel 391daec8d40SPaolo Bonzini write access) 392daec8d40SPaolo Bonzini 393daec8d40SPaolo Bonzini(user write faults generate a #PF) 394daec8d40SPaolo Bonzini 395daec8d40SPaolo BonziniIn the first case there are two additional complications: 396daec8d40SPaolo Bonzini 397daec8d40SPaolo Bonzini- if CR4.SMEP is enabled: since we've turned the page into a kernel page, 398daec8d40SPaolo Bonzini the kernel may now execute it. We handle this by also setting spte.nx. 399daec8d40SPaolo Bonzini If we get a user fetch or read fault, we'll change spte.u=1 and 400daec8d40SPaolo Bonzini spte.nx=gpte.nx back. For this to work, KVM forces EFER.NX to 1 when 401daec8d40SPaolo Bonzini shadow paging is in use. 402daec8d40SPaolo Bonzini- if CR4.SMAP is disabled: since the page has been changed to a kernel 403daec8d40SPaolo Bonzini page, it can not be reused when CR4.SMAP is enabled. We set 404daec8d40SPaolo Bonzini CR4.SMAP && !CR0.WP into shadow page's role to avoid this case. Note, 405daec8d40SPaolo Bonzini here we do not care the case that CR4.SMAP is enabled since KVM will 406daec8d40SPaolo Bonzini directly inject #PF to guest due to failed permission check. 407daec8d40SPaolo Bonzini 408daec8d40SPaolo BonziniTo prevent an spte that was converted into a kernel page with cr0.wp=0 409daec8d40SPaolo Bonzinifrom being written by the kernel after cr0.wp has changed to 1, we make 410daec8d40SPaolo Bonzinithe value of cr0.wp part of the page role. This means that an spte created 411daec8d40SPaolo Bonziniwith one value of cr0.wp cannot be used when cr0.wp has a different value - 412daec8d40SPaolo Bonziniit will simply be missed by the shadow page lookup code. A similar issue 413daec8d40SPaolo Bonziniexists when an spte created with cr0.wp=0 and cr4.smep=0 is used after 414daec8d40SPaolo Bonzinichanging cr4.smep to 1. To avoid this, the value of !cr0.wp && cr4.smep 415daec8d40SPaolo Bonziniis also made a part of the page role. 416daec8d40SPaolo Bonzini 417daec8d40SPaolo BonziniLarge pages 418daec8d40SPaolo Bonzini=========== 419daec8d40SPaolo Bonzini 420daec8d40SPaolo BonziniThe mmu supports all combinations of large and small guest and host pages. 421daec8d40SPaolo BonziniSupported page sizes include 4k, 2M, 4M, and 1G. 4M pages are treated as 422daec8d40SPaolo Bonzinitwo separate 2M pages, on both guest and host, since the mmu always uses PAE 423daec8d40SPaolo Bonzinipaging. 424daec8d40SPaolo Bonzini 425daec8d40SPaolo BonziniTo instantiate a large spte, four constraints must be satisfied: 426daec8d40SPaolo Bonzini 427daec8d40SPaolo Bonzini- the spte must point to a large host page 428daec8d40SPaolo Bonzini- the guest pte must be a large pte of at least equivalent size (if tdp is 429daec8d40SPaolo Bonzini enabled, there is no guest pte and this condition is satisfied) 430daec8d40SPaolo Bonzini- if the spte will be writeable, the large page frame may not overlap any 431daec8d40SPaolo Bonzini write-protected pages 432daec8d40SPaolo Bonzini- the guest page must be wholly contained by a single memory slot 433daec8d40SPaolo Bonzini 434daec8d40SPaolo BonziniTo check the last two conditions, the mmu maintains a ->disallow_lpage set of 435daec8d40SPaolo Bonziniarrays for each memory slot and large page size. Every write protected page 436daec8d40SPaolo Bonzinicauses its disallow_lpage to be incremented, thus preventing instantiation of 437daec8d40SPaolo Bonzinia large spte. The frames at the end of an unaligned memory slot have 438daec8d40SPaolo Bonziniartificially inflated ->disallow_lpages so they can never be instantiated. 439daec8d40SPaolo Bonzini 440daec8d40SPaolo BonziniFast invalidation of MMIO sptes 441daec8d40SPaolo Bonzini=============================== 442daec8d40SPaolo Bonzini 443daec8d40SPaolo BonziniAs mentioned in "Reaction to events" above, kvm will cache MMIO 444daec8d40SPaolo Bonziniinformation in leaf sptes. When a new memslot is added or an existing 445daec8d40SPaolo Bonzinimemslot is changed, this information may become stale and needs to be 446daec8d40SPaolo Bonziniinvalidated. This also needs to hold the MMU lock while walking all 447daec8d40SPaolo Bonzinishadow pages, and is made more scalable with a similar technique. 448daec8d40SPaolo Bonzini 449daec8d40SPaolo BonziniMMIO sptes have a few spare bits, which are used to store a 450daec8d40SPaolo Bonzinigeneration number. The global generation number is stored in 451daec8d40SPaolo Bonzinikvm_memslots(kvm)->generation, and increased whenever guest memory info 452daec8d40SPaolo Bonzinichanges. 453daec8d40SPaolo Bonzini 454daec8d40SPaolo BonziniWhen KVM finds an MMIO spte, it checks the generation number of the spte. 455daec8d40SPaolo BonziniIf the generation number of the spte does not equal the global generation 456daec8d40SPaolo Bonzininumber, it will ignore the cached MMIO information and handle the page 457daec8d40SPaolo Bonzinifault through the slow path. 458daec8d40SPaolo Bonzini 459daec8d40SPaolo BonziniSince only 18 bits are used to store generation-number on mmio spte, all 460daec8d40SPaolo Bonzinipages are zapped when there is an overflow. 461daec8d40SPaolo Bonzini 462daec8d40SPaolo BonziniUnfortunately, a single memory access might access kvm_memslots(kvm) multiple 463daec8d40SPaolo Bonzinitimes, the last one happening when the generation number is retrieved and 464daec8d40SPaolo Bonzinistored into the MMIO spte. Thus, the MMIO spte might be created based on 465daec8d40SPaolo Bonziniout-of-date information, but with an up-to-date generation number. 466daec8d40SPaolo Bonzini 467daec8d40SPaolo BonziniTo avoid this, the generation number is incremented again after synchronize_srcu 468daec8d40SPaolo Bonzinireturns; thus, bit 63 of kvm_memslots(kvm)->generation set to 1 only during a 469daec8d40SPaolo Bonzinimemslot update, while some SRCU readers might be using the old copy. We do not 470daec8d40SPaolo Bonziniwant to use an MMIO sptes created with an odd generation number, and we can do 471daec8d40SPaolo Bonzinithis without losing a bit in the MMIO spte. The "update in-progress" bit of the 472daec8d40SPaolo Bonzinigeneration is not stored in MMIO spte, and is so is implicitly zero when the 473daec8d40SPaolo Bonzinigeneration is extracted out of the spte. If KVM is unlucky and creates an MMIO 474daec8d40SPaolo Bonzinispte while an update is in-progress, the next access to the spte will always be 475daec8d40SPaolo Bonzinia cache miss. For example, a subsequent access during the update window will 476daec8d40SPaolo Bonzinimiss due to the in-progress flag diverging, while an access after the update 477daec8d40SPaolo Bonziniwindow closes will have a higher generation number (as compared to the spte). 478daec8d40SPaolo Bonzini 479daec8d40SPaolo Bonzini 480daec8d40SPaolo BonziniFurther reading 481daec8d40SPaolo Bonzini=============== 482daec8d40SPaolo Bonzini 483daec8d40SPaolo Bonzini- NPT presentation from KVM Forum 2008 484daec8d40SPaolo Bonzini https://www.linux-kvm.org/images/c/c8/KvmForum2008%24kdf2008_21.pdf 485