1============================== 2Running nested guests with KVM 3============================== 4 5A nested guest is the ability to run a guest inside another guest (it 6can be KVM-based or a different hypervisor). The straightforward 7example is a KVM guest that in turn runs on a KVM guest (the rest of 8this document is built on this example):: 9 10 .----------------. .----------------. 11 | | | | 12 | L2 | | L2 | 13 | (Nested Guest) | | (Nested Guest) | 14 | | | | 15 |----------------'--'----------------| 16 | | 17 | L1 (Guest Hypervisor) | 18 | KVM (/dev/kvm) | 19 | | 20 .------------------------------------------------------. 21 | L0 (Host Hypervisor) | 22 | KVM (/dev/kvm) | 23 |------------------------------------------------------| 24 | Hardware (with virtualization extensions) | 25 '------------------------------------------------------' 26 27Terminology: 28 29- L0 – level-0; the bare metal host, running KVM 30 31- L1 – level-1 guest; a VM running on L0; also called the "guest 32 hypervisor", as it itself is capable of running KVM. 33 34- L2 – level-2 guest; a VM running on L1, this is the "nested guest" 35 36.. note:: The above diagram is modelled after the x86 architecture; 37 s390x, ppc64 and other architectures are likely to have 38 a different design for nesting. 39 40 For example, s390x always has an LPAR (LogicalPARtition) 41 hypervisor running on bare metal, adding another layer and 42 resulting in at least four levels in a nested setup — L0 (bare 43 metal, running the LPAR hypervisor), L1 (host hypervisor), L2 44 (guest hypervisor), L3 (nested guest). 45 46 This document will stick with the three-level terminology (L0, 47 L1, and L2) for all architectures; and will largely focus on 48 x86. 49 50 51Use Cases 52--------- 53 54There are several scenarios where nested KVM can be useful, to name a 55few: 56 57- As a developer, you want to test your software on different operating 58 systems (OSes). Instead of renting multiple VMs from a Cloud 59 Provider, using nested KVM lets you rent a large enough "guest 60 hypervisor" (level-1 guest). This in turn allows you to create 61 multiple nested guests (level-2 guests), running different OSes, on 62 which you can develop and test your software. 63 64- Live migration of "guest hypervisors" and their nested guests, for 65 load balancing, disaster recovery, etc. 66 67- VM image creation tools (e.g. ``virt-install``, etc) often run 68 their own VM, and users expect these to work inside a VM. 69 70- Some OSes use virtualization internally for security (e.g. to let 71 applications run safely in isolation). 72 73 74Enabling "nested" (x86) 75----------------------- 76 77From Linux kernel v4.20 onwards, the ``nested`` KVM parameter is enabled 78by default for Intel and AMD. (Though your Linux distribution might 79override this default.) 80 81In case you are running a Linux kernel older than v4.19, to enable 82nesting, set the ``nested`` KVM module parameter to ``Y`` or ``1``. To 83persist this setting across reboots, you can add it in a config file, as 84shown below: 85 861. On the bare metal host (L0), list the kernel modules and ensure that 87 the KVM modules:: 88 89 $ lsmod | grep -i kvm 90 kvm_intel 133627 0 91 kvm 435079 1 kvm_intel 92 932. Show information for ``kvm_intel`` module:: 94 95 $ modinfo kvm_intel | grep -i nested 96 parm: nested:bool 97 983. For the nested KVM configuration to persist across reboots, place the 99 below in ``/etc/modprobed/kvm_intel.conf`` (create the file if it 100 doesn't exist):: 101 102 $ cat /etc/modprobe.d/kvm_intel.conf 103 options kvm-intel nested=y 104 1054. Unload and re-load the KVM Intel module:: 106 107 $ sudo rmmod kvm-intel 108 $ sudo modprobe kvm-intel 109 1105. Verify if the ``nested`` parameter for KVM is enabled:: 111 112 $ cat /sys/module/kvm_intel/parameters/nested 113 Y 114 115For AMD hosts, the process is the same as above, except that the module 116name is ``kvm-amd``. 117 118 119Additional nested-related kernel parameters (x86) 120------------------------------------------------- 121 122If your hardware is sufficiently advanced (Intel Haswell processor or 123higher, which has newer hardware virt extensions), the following 124additional features will also be enabled by default: "Shadow VMCS 125(Virtual Machine Control Structure)", APIC Virtualization on your bare 126metal host (L0). Parameters for Intel hosts:: 127 128 $ cat /sys/module/kvm_intel/parameters/enable_shadow_vmcs 129 Y 130 131 $ cat /sys/module/kvm_intel/parameters/enable_apicv 132 Y 133 134 $ cat /sys/module/kvm_intel/parameters/ept 135 Y 136 137.. note:: If you suspect your L2 (i.e. nested guest) is running slower, 138 ensure the above are enabled (particularly 139 ``enable_shadow_vmcs`` and ``ept``). 140 141 142Starting a nested guest (x86) 143----------------------------- 144 145Once your bare metal host (L0) is configured for nesting, you should be 146able to start an L1 guest with:: 147 148 $ qemu-kvm -cpu host [...] 149 150The above will pass through the host CPU's capabilities as-is to the 151gues); or for better live migration compatibility, use a named CPU 152model supported by QEMU. e.g.:: 153 154 $ qemu-kvm -cpu Haswell-noTSX-IBRS,vmx=on 155 156then the guest hypervisor will subsequently be capable of running a 157nested guest with accelerated KVM. 158 159 160Enabling "nested" (s390x) 161------------------------- 162 1631. On the host hypervisor (L0), enable the ``nested`` parameter on 164 s390x:: 165 166 $ rmmod kvm 167 $ modprobe kvm nested=1 168 169.. note:: On s390x, the kernel parameter ``hpage`` is mutually exclusive 170 with the ``nested`` paramter — i.e. to be able to enable 171 ``nested``, the ``hpage`` parameter *must* be disabled. 172 1732. The guest hypervisor (L1) must be provided with the ``sie`` CPU 174 feature — with QEMU, this can be done by using "host passthrough" 175 (via the command-line ``-cpu host``). 176 1773. Now the KVM module can be loaded in the L1 (guest hypervisor):: 178 179 $ modprobe kvm 180 181 182Live migration with nested KVM 183------------------------------ 184 185Migrating an L1 guest, with a *live* nested guest in it, to another 186bare metal host, works as of Linux kernel 5.3 and QEMU 4.2.0 for 187Intel x86 systems, and even on older versions for s390x. 188 189On AMD systems, once an L1 guest has started an L2 guest, the L1 guest 190should no longer be migrated or saved (refer to QEMU documentation on 191"savevm"/"loadvm") until the L2 guest shuts down. Attempting to migrate 192or save-and-load an L1 guest while an L2 guest is running will result in 193undefined behavior. You might see a ``kernel BUG!`` entry in ``dmesg``, a 194kernel 'oops', or an outright kernel panic. Such a migrated or loaded L1 195guest can no longer be considered stable or secure, and must be restarted. 196Migrating an L1 guest merely configured to support nesting, while not 197actually running L2 guests, is expected to function normally even on AMD 198systems but may fail once guests are started. 199 200Migrating an L2 guest is always expected to succeed, so all the following 201scenarios should work even on AMD systems: 202 203- Migrating a nested guest (L2) to another L1 guest on the *same* bare 204 metal host. 205 206- Migrating a nested guest (L2) to another L1 guest on a *different* 207 bare metal host. 208 209- Migrating a nested guest (L2) to a bare metal host. 210 211Reporting bugs from nested setups 212----------------------------------- 213 214Debugging "nested" problems can involve sifting through log files across 215L0, L1 and L2; this can result in tedious back-n-forth between the bug 216reporter and the bug fixer. 217 218- Mention that you are in a "nested" setup. If you are running any kind 219 of "nesting" at all, say so. Unfortunately, this needs to be called 220 out because when reporting bugs, people tend to forget to even 221 *mention* that they're using nested virtualization. 222 223- Ensure you are actually running KVM on KVM. Sometimes people do not 224 have KVM enabled for their guest hypervisor (L1), which results in 225 them running with pure emulation or what QEMU calls it as "TCG", but 226 they think they're running nested KVM. Thus confusing "nested Virt" 227 (which could also mean, QEMU on KVM) with "nested KVM" (KVM on KVM). 228 229Information to collect (generic) 230~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 231 232The following is not an exhaustive list, but a very good starting point: 233 234 - Kernel, libvirt, and QEMU version from L0 235 236 - Kernel, libvirt and QEMU version from L1 237 238 - QEMU command-line of L1 -- when using libvirt, you'll find it here: 239 ``/var/log/libvirt/qemu/instance.log`` 240 241 - QEMU command-line of L2 -- as above, when using libvirt, get the 242 complete libvirt-generated QEMU command-line 243 244 - ``cat /sys/cpuinfo`` from L0 245 246 - ``cat /sys/cpuinfo`` from L1 247 248 - ``lscpu`` from L0 249 250 - ``lscpu`` from L1 251 252 - Full ``dmesg`` output from L0 253 254 - Full ``dmesg`` output from L1 255 256x86-specific info to collect 257~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 258 259Both the below commands, ``x86info`` and ``dmidecode``, should be 260available on most Linux distributions with the same name: 261 262 - Output of: ``x86info -a`` from L0 263 264 - Output of: ``x86info -a`` from L1 265 266 - Output of: ``dmidecode`` from L0 267 268 - Output of: ``dmidecode`` from L1 269 270s390x-specific info to collect 271~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 272 273Along with the earlier mentioned generic details, the below is 274also recommended: 275 276 - ``/proc/sysinfo`` from L1; this will also include the info from L0 277