1f9df7aacSPaolo BonziniMulti-process QEMU 2f9df7aacSPaolo Bonzini=================== 3f9df7aacSPaolo Bonzini 4f9df7aacSPaolo Bonzini.. note:: 5f9df7aacSPaolo Bonzini 68684f1beSJohn G Johnson This is the design document for multi-process QEMU. It does not 78684f1beSJohn G Johnson necessarily reflect the status of the current implementation, which 88684f1beSJohn G Johnson may lack features or be considerably different from what is described 98684f1beSJohn G Johnson in this document. This document is still useful as a description of 108684f1beSJohn G Johnson the goals and general direction of this feature. 118684f1beSJohn G Johnson 128684f1beSJohn G Johnson Please refer to the following wiki for latest details: 138684f1beSJohn G Johnson https://wiki.qemu.org/Features/MultiProcessQEMU 148684f1beSJohn G Johnson 158684f1beSJohn G JohnsonQEMU is often used as the hypervisor for virtual machines running in the 168684f1beSJohn G JohnsonOracle cloud. Since one of the advantages of cloud computing is the 178684f1beSJohn G Johnsonability to run many VMs from different tenants in the same cloud 188684f1beSJohn G Johnsoninfrastructure, a guest that compromised its hypervisor could 198684f1beSJohn G Johnsonpotentially use the hypervisor's access privileges to access data it is 208684f1beSJohn G Johnsonnot authorized for. 218684f1beSJohn G Johnson 228684f1beSJohn G JohnsonQEMU can be susceptible to security attacks because it is a large, 238684f1beSJohn G Johnsonmonolithic program that provides many features to the VMs it services. 248684f1beSJohn G JohnsonMany of these features can be configured out of QEMU, but even a reduced 258684f1beSJohn G Johnsonconfiguration QEMU has a large amount of code a guest can potentially 268684f1beSJohn G Johnsonattack. Separating QEMU reduces the attack surface by aiding to 278684f1beSJohn G Johnsonlimit each component in the system to only access the resources that 288684f1beSJohn G Johnsonit needs to perform its job. 298684f1beSJohn G Johnson 308684f1beSJohn G JohnsonQEMU services 318684f1beSJohn G Johnson------------- 328684f1beSJohn G Johnson 338684f1beSJohn G JohnsonQEMU can be broadly described as providing three main services. One is a 348684f1beSJohn G JohnsonVM control point, where VMs can be created, migrated, re-configured, and 358684f1beSJohn G Johnsondestroyed. A second is to emulate the CPU instructions within the VM, 368684f1beSJohn G Johnsonoften accelerated by HW virtualization features such as Intel's VT 378684f1beSJohn G Johnsonextensions. Finally, it provides IO services to the VM by emulating HW 388684f1beSJohn G JohnsonIO devices, such as disk and network devices. 398684f1beSJohn G Johnson 408684f1beSJohn G JohnsonA multi-process QEMU 418684f1beSJohn G Johnson~~~~~~~~~~~~~~~~~~~~ 428684f1beSJohn G Johnson 438684f1beSJohn G JohnsonA multi-process QEMU involves separating QEMU services into separate 448684f1beSJohn G Johnsonhost processes. Each of these processes can be given only the privileges 458684f1beSJohn G Johnsonit needs to provide its service, e.g., a disk service could be given 468684f1beSJohn G Johnsonaccess only to the disk images it provides, and not be allowed to 478684f1beSJohn G Johnsonaccess other files, or any network devices. An attacker who compromised 488684f1beSJohn G Johnsonthis service would not be able to use this exploit to access files or 498684f1beSJohn G Johnsondevices beyond what the disk service was given access to. 508684f1beSJohn G Johnson 518684f1beSJohn G JohnsonA QEMU control process would remain, but in multi-process mode, will 528684f1beSJohn G Johnsonhave no direct interfaces to the VM. During VM execution, it would still 538684f1beSJohn G Johnsonprovide the user interface to hot-plug devices or live migrate the VM. 548684f1beSJohn G Johnson 558684f1beSJohn G JohnsonA first step in creating a multi-process QEMU is to separate IO services 568684f1beSJohn G Johnsonfrom the main QEMU program, which would continue to provide CPU 578684f1beSJohn G Johnsonemulation. i.e., the control process would also be the CPU emulation 588684f1beSJohn G Johnsonprocess. In a later phase, CPU emulation could be separated from the 598684f1beSJohn G Johnsoncontrol process. 608684f1beSJohn G Johnson 618684f1beSJohn G JohnsonSeparating IO services 628684f1beSJohn G Johnson---------------------- 638684f1beSJohn G Johnson 648684f1beSJohn G JohnsonSeparating IO services into individual host processes is a good place to 658684f1beSJohn G Johnsonbegin for a couple of reasons. One is the sheer number of IO devices QEMU 668684f1beSJohn G Johnsoncan emulate provides a large surface of interfaces which could potentially 678684f1beSJohn G Johnsonbe exploited, and, indeed, have been a source of exploits in the past. 688684f1beSJohn G JohnsonAnother is the modular nature of QEMU device emulation code provides 698684f1beSJohn G Johnsoninterface points where the QEMU functions that perform device emulation 708684f1beSJohn G Johnsoncan be separated from the QEMU functions that manage the emulation of 718684f1beSJohn G Johnsonguest CPU instructions. The devices emulated in the separate process are 728684f1beSJohn G Johnsonreferred to as remote devices. 738684f1beSJohn G Johnson 748684f1beSJohn G JohnsonQEMU device emulation 758684f1beSJohn G Johnson~~~~~~~~~~~~~~~~~~~~~ 768684f1beSJohn G Johnson 778684f1beSJohn G JohnsonQEMU uses an object oriented SW architecture for device emulation code. 788684f1beSJohn G JohnsonConfigured objects are all compiled into the QEMU binary, then objects 798684f1beSJohn G Johnsonare instantiated by name when used by the guest VM. For example, the 808684f1beSJohn G Johnsoncode to emulate a device named "foo" is always present in QEMU, but its 818684f1beSJohn G Johnsoninstantiation code is only run when the device is included in the target 828684f1beSJohn G JohnsonVM. (e.g., via the QEMU command line as *-device foo*) 838684f1beSJohn G Johnson 848684f1beSJohn G JohnsonThe object model is hierarchical, so device emulation code names its 858684f1beSJohn G Johnsonparent object (such as "pci-device" for a PCI device) and QEMU will 868684f1beSJohn G Johnsoninstantiate a parent object before calling the device's instantiation 878684f1beSJohn G Johnsoncode. 888684f1beSJohn G Johnson 898684f1beSJohn G JohnsonCurrent separation models 908684f1beSJohn G Johnson~~~~~~~~~~~~~~~~~~~~~~~~~ 918684f1beSJohn G Johnson 928684f1beSJohn G JohnsonIn order to separate the device emulation code from the CPU emulation 938684f1beSJohn G Johnsoncode, the device object code must run in a different process. There are 948684f1beSJohn G Johnsona couple of existing QEMU features that can run emulation code 958684f1beSJohn G Johnsonseparately from the main QEMU process. These are examined below. 968684f1beSJohn G Johnson 978684f1beSJohn G Johnsonvhost user model 988684f1beSJohn G Johnson^^^^^^^^^^^^^^^^ 998684f1beSJohn G Johnson 1008684f1beSJohn G JohnsonVirtio guest device drivers can be connected to vhost user applications 1018684f1beSJohn G Johnsonin order to perform their IO operations. This model uses special virtio 1028684f1beSJohn G Johnsondevice drivers in the guest and vhost user device objects in QEMU, but 1038684f1beSJohn G Johnsononce the QEMU vhost user code has configured the vhost user application, 1048684f1beSJohn G Johnsonmission-mode IO is performed by the application. The vhost user 1058684f1beSJohn G Johnsonapplication is a daemon process that can be contacted via a known UNIX 1068684f1beSJohn G Johnsondomain socket. 1078684f1beSJohn G Johnson 1088684f1beSJohn G Johnsonvhost socket 1098684f1beSJohn G Johnson'''''''''''' 1108684f1beSJohn G Johnson 1118684f1beSJohn G JohnsonAs mentioned above, one of the tasks of the vhost device object within 1128684f1beSJohn G JohnsonQEMU is to contact the vhost application and send it configuration 1138684f1beSJohn G Johnsoninformation about this device instance. As part of the configuration 1148684f1beSJohn G Johnsonprocess, the application can also be sent other file descriptors over 1158684f1beSJohn G Johnsonthe socket, which then can be used by the vhost user application in 1168684f1beSJohn G Johnsonvarious ways, some of which are described below. 1178684f1beSJohn G Johnson 1188684f1beSJohn G Johnsonvhost MMIO store acceleration 1198684f1beSJohn G Johnson''''''''''''''''''''''''''''' 1208684f1beSJohn G Johnson 1218684f1beSJohn G JohnsonVMs are often run using HW virtualization features via the KVM kernel 1228684f1beSJohn G Johnsondriver. This driver allows QEMU to accelerate the emulation of guest CPU 1238684f1beSJohn G Johnsoninstructions by running the guest in a virtual HW mode. When the guest 1248684f1beSJohn G Johnsonexecutes instructions that cannot be executed by virtual HW mode, 1258684f1beSJohn G Johnsonexecution returns to the KVM driver so it can inform QEMU to emulate the 1268684f1beSJohn G Johnsoninstructions in SW. 1278684f1beSJohn G Johnson 1288684f1beSJohn G JohnsonOne of the events that can cause a return to QEMU is when a guest device 1298684f1beSJohn G Johnsondriver accesses an IO location. QEMU then dispatches the memory 1308684f1beSJohn G Johnsonoperation to the corresponding QEMU device object. In the case of a 1318684f1beSJohn G Johnsonvhost user device, the memory operation would need to be sent over a 1328684f1beSJohn G Johnsonsocket to the vhost application. This path is accelerated by the QEMU 1338684f1beSJohn G Johnsonvirtio code by setting up an eventfd file descriptor that the vhost 1348684f1beSJohn G Johnsonapplication can directly receive MMIO store notifications from the KVM 1358684f1beSJohn G Johnsondriver, instead of needing them to be sent to the QEMU process first. 1368684f1beSJohn G Johnson 1378684f1beSJohn G Johnsonvhost interrupt acceleration 1388684f1beSJohn G Johnson'''''''''''''''''''''''''''' 1398684f1beSJohn G Johnson 1408684f1beSJohn G JohnsonAnother optimization used by the vhost application is the ability to 1418684f1beSJohn G Johnsondirectly inject interrupts into the VM via the KVM driver, again, 1428684f1beSJohn G Johnsonbypassing the need to send the interrupt back to the QEMU process first. 1438684f1beSJohn G JohnsonThe QEMU virtio setup code configures the KVM driver with an eventfd 1448684f1beSJohn G Johnsonthat triggers the device interrupt in the guest when the eventfd is 1458684f1beSJohn G Johnsonwritten. This irqfd file descriptor is then passed to the vhost user 1468684f1beSJohn G Johnsonapplication program. 1478684f1beSJohn G Johnson 1488684f1beSJohn G Johnsonvhost access to guest memory 1498684f1beSJohn G Johnson'''''''''''''''''''''''''''' 1508684f1beSJohn G Johnson 1518684f1beSJohn G JohnsonThe vhost application is also allowed to directly access guest memory, 1528684f1beSJohn G Johnsoninstead of needing to send the data as messages to QEMU. This is also 1538684f1beSJohn G Johnsondone with file descriptors sent to the vhost user application by QEMU. 1548684f1beSJohn G JohnsonThese descriptors can be passed to ``mmap()`` by the vhost application 1558684f1beSJohn G Johnsonto map the guest address space into the vhost application. 1568684f1beSJohn G Johnson 1578684f1beSJohn G JohnsonIOMMUs introduce another level of complexity, since the address given to 1588684f1beSJohn G Johnsonthe guest virtio device to DMA to or from is not a guest physical 1598684f1beSJohn G Johnsonaddress. This case is handled by having vhost code within QEMU register 1608684f1beSJohn G Johnsonas a listener for IOMMU mapping changes. The vhost application maintains 1618684f1beSJohn G Johnsona cache of IOMMMU translations: sending translation requests back to 1628684f1beSJohn G JohnsonQEMU on cache misses, and in turn receiving flush requests from QEMU 1638684f1beSJohn G Johnsonwhen mappings are purged. 1648684f1beSJohn G Johnson 1658684f1beSJohn G Johnsonapplicability to device separation 1668684f1beSJohn G Johnson'''''''''''''''''''''''''''''''''' 1678684f1beSJohn G Johnson 1688684f1beSJohn G JohnsonMuch of the vhost model can be re-used by separated device emulation. In 1698684f1beSJohn G Johnsonparticular, the ideas of using a socket between QEMU and the device 1708684f1beSJohn G Johnsonemulation application, using a file descriptor to inject interrupts into 1718684f1beSJohn G Johnsonthe VM via KVM, and allowing the application to ``mmap()`` the guest 1728684f1beSJohn G Johnsonshould be re used. 1738684f1beSJohn G Johnson 1748684f1beSJohn G JohnsonThere are, however, some notable differences between how a vhost 1758684f1beSJohn G Johnsonapplication works and the needs of separated device emulation. The most 1768684f1beSJohn G Johnsonbasic is that vhost uses custom virtio device drivers which always 1778684f1beSJohn G Johnsontrigger IO with MMIO stores. A separated device emulation model must 1788684f1beSJohn G Johnsonwork with existing IO device models and guest device drivers. MMIO loads 1798684f1beSJohn G Johnsonbreak vhost store acceleration since they are synchronous - guest 1808684f1beSJohn G Johnsonprogress cannot continue until the load has been emulated. By contrast, 1818684f1beSJohn G Johnsonstores are asynchronous, the guest can continue after the store event 1828684f1beSJohn G Johnsonhas been sent to the vhost application. 1838684f1beSJohn G Johnson 1848684f1beSJohn G JohnsonAnother difference is that in the vhost user model, a single daemon can 1858684f1beSJohn G Johnsonsupport multiple QEMU instances. This is contrary to the security regime 1868684f1beSJohn G Johnsondesired, in which the emulation application should only be allowed to 1878684f1beSJohn G Johnsonaccess the files or devices the VM it's running on behalf of can access. 1888684f1beSJohn G Johnson#### qemu-io model 1898684f1beSJohn G Johnson 190c5ba6219SPhilippe Mathieu-Daudé``qemu-io`` is a test harness used to test changes to the QEMU block backend 191c5ba6219SPhilippe Mathieu-Daudéobject code (e.g., the code that implements disk images for disk driver 192c5ba6219SPhilippe Mathieu-Daudéemulation). ``qemu-io`` is not a device emulation application per se, but it 1938684f1beSJohn G Johnsondoes compile the QEMU block objects into a separate binary from the main 1948684f1beSJohn G JohnsonQEMU one. This could be useful for disk device emulation, since its 1958684f1beSJohn G Johnsonemulation applications will need to include the QEMU block objects. 1968684f1beSJohn G Johnson 1978684f1beSJohn G JohnsonNew separation model based on proxy objects 1988684f1beSJohn G Johnson------------------------------------------- 1998684f1beSJohn G Johnson 2008684f1beSJohn G JohnsonA different model based on proxy objects in the QEMU program 2018684f1beSJohn G Johnsoncommunicating with remote emulation programs could provide separation 2028684f1beSJohn G Johnsonwhile minimizing the changes needed to the device emulation code. The 2038684f1beSJohn G Johnsonrest of this section is a discussion of how a proxy object model would 2048684f1beSJohn G Johnsonwork. 2058684f1beSJohn G Johnson 2068684f1beSJohn G JohnsonRemote emulation processes 2078684f1beSJohn G Johnson~~~~~~~~~~~~~~~~~~~~~~~~~~ 2088684f1beSJohn G Johnson 2098684f1beSJohn G JohnsonThe remote emulation process will run the QEMU object hierarchy without 2108684f1beSJohn G Johnsonmodification. The device emulation objects will be also be based on the 2118684f1beSJohn G JohnsonQEMU code, because for anything but the simplest device, it would not be 2128684f1beSJohn G Johnsona tractable to re-implement both the object model and the many device 2138684f1beSJohn G Johnsonbackends that QEMU has. 2148684f1beSJohn G Johnson 2158684f1beSJohn G JohnsonThe processes will communicate with the QEMU process over UNIX domain 2168684f1beSJohn G Johnsonsockets. The processes can be executed either as standalone processes, 2178684f1beSJohn G Johnsonor be executed by QEMU. In both cases, the host backends the emulation 2188684f1beSJohn G Johnsonprocesses will provide are specified on its command line, as they would 2198684f1beSJohn G Johnsonbe for QEMU. For example: 2208684f1beSJohn G Johnson 2218684f1beSJohn G Johnson:: 2228684f1beSJohn G Johnson 2238684f1beSJohn G Johnson disk-proc -blockdev driver=file,node-name=file0,filename=disk-file0 \ 2248684f1beSJohn G Johnson -blockdev driver=qcow2,node-name=drive0,file=file0 2258684f1beSJohn G Johnson 2268684f1beSJohn G Johnsonwould indicate process *disk-proc* uses a qcow2 emulated disk named 2278684f1beSJohn G Johnson*file0* as its backend. 2288684f1beSJohn G Johnson 2298684f1beSJohn G JohnsonEmulation processes may emulate more than one guest controller. A common 2308684f1beSJohn G Johnsonconfiguration might be to put all controllers of the same device class 2318684f1beSJohn G Johnson(e.g., disk, network, etc.) in a single process, so that all backends of 2328684f1beSJohn G Johnsonthe same type can be managed by a single QMP monitor. 2338684f1beSJohn G Johnson 2348684f1beSJohn G Johnsoncommunication with QEMU 2358684f1beSJohn G Johnson^^^^^^^^^^^^^^^^^^^^^^^ 2368684f1beSJohn G Johnson 2378684f1beSJohn G JohnsonThe first argument to the remote emulation process will be a Unix domain 2388684f1beSJohn G Johnsonsocket that connects with the Proxy object. This is a required argument. 2398684f1beSJohn G Johnson 2408684f1beSJohn G Johnson:: 2418684f1beSJohn G Johnson 2428684f1beSJohn G Johnson disk-proc <socket number> <backend list> 2438684f1beSJohn G Johnson 2448684f1beSJohn G Johnsonremote process QMP monitor 2458684f1beSJohn G Johnson^^^^^^^^^^^^^^^^^^^^^^^^^^ 2468684f1beSJohn G Johnson 2478684f1beSJohn G JohnsonRemote emulation processes can be monitored via QMP, similar to QEMU 2488684f1beSJohn G Johnsonitself. The QMP monitor socket is specified the same as for a QEMU 2498684f1beSJohn G Johnsonprocess: 2508684f1beSJohn G Johnson 2518684f1beSJohn G Johnson:: 2528684f1beSJohn G Johnson 2538684f1beSJohn G Johnson disk-proc -qmp unix:/tmp/disk-mon,server 2548684f1beSJohn G Johnson 2558684f1beSJohn G Johnsoncan be monitored over the UNIX socket path */tmp/disk-mon*. 2568684f1beSJohn G Johnson 2578684f1beSJohn G JohnsonQEMU command line 2588684f1beSJohn G Johnson~~~~~~~~~~~~~~~~~ 2598684f1beSJohn G Johnson 2608684f1beSJohn G JohnsonEach remote device emulated in a remote process on the host is 2618684f1beSJohn G Johnsonrepresented as a *-device* of type *pci-proxy-dev*. A socket 2628684f1beSJohn G Johnsonsub-option to this option specifies the Unix socket that connects 2638684f1beSJohn G Johnsonto the remote process. An *id* sub-option is required, and it should 2648684f1beSJohn G Johnsonbe the same id as used in the remote process. 2658684f1beSJohn G Johnson 2668684f1beSJohn G Johnson:: 2678684f1beSJohn G Johnson 2688684f1beSJohn G Johnson qemu-system-x86_64 ... -device pci-proxy-dev,id=lsi0,socket=3 2698684f1beSJohn G Johnson 2708684f1beSJohn G Johnsoncan be used to add a device emulated in a remote process 2718684f1beSJohn G Johnson 2728684f1beSJohn G Johnson 2738684f1beSJohn G JohnsonQEMU management of remote processes 2748684f1beSJohn G Johnson~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 2758684f1beSJohn G Johnson 2768684f1beSJohn G JohnsonQEMU is not aware of the type of type of the remote PCI device. It is 2778684f1beSJohn G Johnsona pass through device as far as QEMU is concerned. 2788684f1beSJohn G Johnson 2798684f1beSJohn G Johnsoncommunication with emulation process 2808684f1beSJohn G Johnson^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2818684f1beSJohn G Johnson 2828684f1beSJohn G Johnsonprimary channel 2838684f1beSJohn G Johnson''''''''''''''' 2848684f1beSJohn G Johnson 2858684f1beSJohn G JohnsonThe primary channel (referred to as com in the code) is used to bootstrap 2868684f1beSJohn G Johnsonthe remote process. It is also used to pass on device-agnostic commands 2878684f1beSJohn G Johnsonlike reset. 2888684f1beSJohn G Johnson 2898684f1beSJohn G Johnsonper-device channels 2908684f1beSJohn G Johnson''''''''''''''''''' 2918684f1beSJohn G Johnson 2928684f1beSJohn G JohnsonEach remote device communicates with QEMU using a dedicated communication 2938684f1beSJohn G Johnsonchannel. The proxy object sets up this channel using the primary 2948684f1beSJohn G Johnsonchannel during its initialization. 2958684f1beSJohn G Johnson 2968684f1beSJohn G JohnsonQEMU device proxy objects 2978684f1beSJohn G Johnson~~~~~~~~~~~~~~~~~~~~~~~~~ 2988684f1beSJohn G Johnson 2998684f1beSJohn G JohnsonQEMU has an object model based on sub-classes inherited from the 3008684f1beSJohn G Johnson"object" super-class. The sub-classes that are of interest here are the 3018684f1beSJohn G Johnson"device" and "bus" sub-classes whose child sub-classes make up the 3028684f1beSJohn G Johnsondevice tree of a QEMU emulated system. 3038684f1beSJohn G Johnson 3048684f1beSJohn G JohnsonThe proxy object model will use device proxy objects to replace the 3058684f1beSJohn G Johnsondevice emulation code within the QEMU process. These objects will live 3068684f1beSJohn G Johnsonin the same place in the object and bus hierarchies as the objects they 3078684f1beSJohn G Johnsonreplace. i.e., the proxy object for an LSI SCSI controller will be a 3088684f1beSJohn G Johnsonsub-class of the "pci-device" class, and will have the same PCI bus 3098684f1beSJohn G Johnsonparent and the same SCSI bus child objects as the LSI controller object 3108684f1beSJohn G Johnsonit replaces. 3118684f1beSJohn G Johnson 3128684f1beSJohn G JohnsonIt is worth noting that the same proxy object is used to mediate with 3138684f1beSJohn G Johnsonall types of remote PCI devices. 3148684f1beSJohn G Johnson 3158684f1beSJohn G Johnsonobject initialization 3168684f1beSJohn G Johnson^^^^^^^^^^^^^^^^^^^^^ 3178684f1beSJohn G Johnson 3188684f1beSJohn G JohnsonThe Proxy device objects are initialized in the exact same manner in 3198684f1beSJohn G Johnsonwhich any other QEMU device would be initialized. 3208684f1beSJohn G Johnson 3218684f1beSJohn G JohnsonIn addition, the Proxy objects perform the following two tasks: 3228684f1beSJohn G Johnson- Parses the "socket" sub option and connects to the remote process 3238684f1beSJohn G Johnsonusing this channel 3248684f1beSJohn G Johnson- Uses the "id" sub-option to connect to the emulated device on the 3258684f1beSJohn G Johnsonseparate process 3268684f1beSJohn G Johnson 3278684f1beSJohn G Johnsonclass\_init 3288684f1beSJohn G Johnson''''''''''' 3298684f1beSJohn G Johnson 3308684f1beSJohn G JohnsonThe ``class_init()`` method of a proxy object will, in general behave 3318684f1beSJohn G Johnsonsimilarly to the object it replaces, including setting any static 3328684f1beSJohn G Johnsonproperties and methods needed by the proxy. 3338684f1beSJohn G Johnson 3348684f1beSJohn G Johnsoninstance\_init / realize 3358684f1beSJohn G Johnson'''''''''''''''''''''''' 3368684f1beSJohn G Johnson 3378684f1beSJohn G JohnsonThe ``instance_init()`` and ``realize()`` functions would only need to 3388684f1beSJohn G Johnsonperform tasks related to being a proxy, such are registering its own 3398684f1beSJohn G JohnsonMMIO handlers, or creating a child bus that other proxy devices can be 3408684f1beSJohn G Johnsonattached to later. 3418684f1beSJohn G Johnson 3428684f1beSJohn G JohnsonOther tasks will be device-specific. For example, PCI device objects 3438684f1beSJohn G Johnsonwill initialize the PCI config space in order to make a valid PCI device 3448684f1beSJohn G Johnsontree within the QEMU process. 3458684f1beSJohn G Johnson 3468684f1beSJohn G Johnsonaddress space registration 3478684f1beSJohn G Johnson^^^^^^^^^^^^^^^^^^^^^^^^^^ 3488684f1beSJohn G Johnson 3498684f1beSJohn G JohnsonMost devices are driven by guest device driver accesses to IO addresses 3508684f1beSJohn G Johnsonor ports. The QEMU device emulation code uses QEMU's memory region 3518684f1beSJohn G Johnsonfunction calls (such as ``memory_region_init_io()``) to add callback 3528684f1beSJohn G Johnsonfunctions that QEMU will invoke when the guest accesses the device's 3538684f1beSJohn G Johnsonareas of the IO address space. When a guest driver does access the 3548684f1beSJohn G Johnsondevice, the VM will exit HW virtualization mode and return to QEMU, 3558684f1beSJohn G Johnsonwhich will then lookup and execute the corresponding callback function. 3568684f1beSJohn G Johnson 3578684f1beSJohn G JohnsonA proxy object would need to mirror the memory region calls the actual 3588684f1beSJohn G Johnsondevice emulator would perform in its initialization code, but with its 3598684f1beSJohn G Johnsonown callbacks. When invoked by QEMU as a result of a guest IO operation, 3608684f1beSJohn G Johnsonthey will forward the operation to the device emulation process. 3618684f1beSJohn G Johnson 3628684f1beSJohn G JohnsonPCI config space 3638684f1beSJohn G Johnson^^^^^^^^^^^^^^^^ 3648684f1beSJohn G Johnson 3658684f1beSJohn G JohnsonPCI devices also have a configuration space that can be accessed by the 3668684f1beSJohn G Johnsonguest driver. Guest accesses to this space is not handled by the device 3678684f1beSJohn G Johnsonemulation object, but by its PCI parent object. Much of this space is 3688684f1beSJohn G Johnsonread-only, but certain registers (especially BAR and MSI-related ones) 3698684f1beSJohn G Johnsonneed to be propagated to the emulation process. 3708684f1beSJohn G Johnson 3718684f1beSJohn G JohnsonPCI parent proxy 3728684f1beSJohn G Johnson'''''''''''''''' 3738684f1beSJohn G Johnson 3748684f1beSJohn G JohnsonOne way to propagate guest PCI config accesses is to create a 3758684f1beSJohn G Johnson"pci-device-proxy" class that can serve as the parent of a PCI device 3768684f1beSJohn G Johnsonproxy object. This class's parent would be "pci-device" and it would 3778684f1beSJohn G Johnsonoverride the PCI parent's ``config_read()`` and ``config_write()`` 3788684f1beSJohn G Johnsonmethods with ones that forward these operations to the emulation 3798684f1beSJohn G Johnsonprogram. 3808684f1beSJohn G Johnson 3818684f1beSJohn G Johnsoninterrupt receipt 3828684f1beSJohn G Johnson^^^^^^^^^^^^^^^^^ 3838684f1beSJohn G Johnson 3848684f1beSJohn G JohnsonA proxy for a device that generates interrupts will need to create a 3858684f1beSJohn G Johnsonsocket to receive interrupt indications from the emulation process. An 3868684f1beSJohn G Johnsonincoming interrupt indication would then be sent up to its bus parent to 3878684f1beSJohn G Johnsonbe injected into the guest. For example, a PCI device object may use 3888684f1beSJohn G Johnson``pci_set_irq()``. 3898684f1beSJohn G Johnson 3908684f1beSJohn G Johnsonlive migration 3918684f1beSJohn G Johnson^^^^^^^^^^^^^^ 3928684f1beSJohn G Johnson 3938684f1beSJohn G JohnsonThe proxy will register to save and restore any *vmstate* it needs over 3948684f1beSJohn G Johnsona live migration event. The device proxy does not need to manage the 3958684f1beSJohn G Johnsonremote device's *vmstate*; that will be handled by the remote process 3968684f1beSJohn G Johnsonproxy (see below). 3978684f1beSJohn G Johnson 3988684f1beSJohn G JohnsonQEMU remote device operation 3998684f1beSJohn G Johnson~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 4008684f1beSJohn G Johnson 4018684f1beSJohn G JohnsonGeneric device operations, such as DMA, will be performed by the remote 4028684f1beSJohn G Johnsonprocess proxy by sending messages to the remote process. 4038684f1beSJohn G Johnson 4048684f1beSJohn G JohnsonDMA operations 4058684f1beSJohn G Johnson^^^^^^^^^^^^^^ 4068684f1beSJohn G Johnson 4078684f1beSJohn G JohnsonDMA operations would be handled much like vhost applications do. One of 4088684f1beSJohn G Johnsonthe initial messages sent to the emulation process is a guest memory 4098684f1beSJohn G Johnsontable. Each entry in this table consists of a file descriptor and size 4108684f1beSJohn G Johnsonthat the emulation process can ``mmap()`` to directly access guest 4118684f1beSJohn G Johnsonmemory, similar to ``vhost_user_set_mem_table()``. Note guest memory 412*9e6180d2SDavid Hildenbrandmust be backed by shared file-backed memory, for example, using 413*9e6180d2SDavid Hildenbrand*-object memory-backend-file,share=on* and setting that memory backend 414*9e6180d2SDavid Hildenbrandas RAM for the machine. 4158684f1beSJohn G Johnson 4168684f1beSJohn G JohnsonIOMMU operations 4178684f1beSJohn G Johnson^^^^^^^^^^^^^^^^ 4188684f1beSJohn G Johnson 4198684f1beSJohn G JohnsonWhen the emulated system includes an IOMMU, the remote process proxy in 4208684f1beSJohn G JohnsonQEMU will need to create a socket for IOMMU requests from the emulation 4218684f1beSJohn G Johnsonprocess. It will handle those requests with an 4228684f1beSJohn G Johnson``address_space_get_iotlb_entry()`` call. In order to handle IOMMU 4238684f1beSJohn G Johnsonunmaps, the remote process proxy will also register as a listener on the 4248684f1beSJohn G Johnsondevice's DMA address space. When an IOMMU memory region is created 4258684f1beSJohn G Johnsonwithin the DMA address space, an IOMMU notifier for unmaps will be added 4268684f1beSJohn G Johnsonto the memory region that will forward unmaps to the emulation process 4278684f1beSJohn G Johnsonover the IOMMU socket. 4288684f1beSJohn G Johnson 4298684f1beSJohn G Johnsondevice hot-plug via QMP 4308684f1beSJohn G Johnson^^^^^^^^^^^^^^^^^^^^^^^ 4318684f1beSJohn G Johnson 4328684f1beSJohn G JohnsonAn QMP "device\_add" command can add a device emulated by a remote 4338684f1beSJohn G Johnsonprocess. It will also have "rid" option to the command, just as the 4348684f1beSJohn G Johnson*-device* command line option does. The remote process may either be one 4358684f1beSJohn G Johnsonstarted at QEMU startup, or be one added by the "add-process" QMP 4368684f1beSJohn G Johnsoncommand described above. In either case, the remote process proxy will 4378684f1beSJohn G Johnsonforward the new device's JSON description to the corresponding emulation 4388684f1beSJohn G Johnsonprocess. 4398684f1beSJohn G Johnson 4408684f1beSJohn G Johnsonlive migration 4418684f1beSJohn G Johnson^^^^^^^^^^^^^^ 4428684f1beSJohn G Johnson 4438684f1beSJohn G JohnsonThe remote process proxy will also register for live migration 4448684f1beSJohn G Johnsonnotifications with ``vmstate_register()``. When called to save state, 4458684f1beSJohn G Johnsonthe proxy will send the remote process a secondary socket file 4468684f1beSJohn G Johnsondescriptor to save the remote process's device *vmstate* over. The 4478684f1beSJohn G Johnsonincoming byte stream length and data will be saved as the proxy's 4488684f1beSJohn G Johnson*vmstate*. When the proxy is resumed on its new host, this *vmstate* 4498684f1beSJohn G Johnsonwill be extracted, and a secondary socket file descriptor will be sent 4508684f1beSJohn G Johnsonto the new remote process through which it receives the *vmstate* in 4518684f1beSJohn G Johnsonorder to restore the devices there. 4528684f1beSJohn G Johnson 4538684f1beSJohn G Johnsondevice emulation in remote process 4548684f1beSJohn G Johnson~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 4558684f1beSJohn G Johnson 4568684f1beSJohn G JohnsonThe parts of QEMU that the emulation program will need include the 4578684f1beSJohn G Johnsonobject model; the memory emulation objects; the device emulation objects 4588684f1beSJohn G Johnsonof the targeted device, and any dependent devices; and, the device's 4598684f1beSJohn G Johnsonbackends. It will also need code to setup the machine environment, 4608684f1beSJohn G Johnsonhandle requests from the QEMU process, and route machine-level requests 4618684f1beSJohn G Johnson(such as interrupts or IOMMU mappings) back to the QEMU process. 4628684f1beSJohn G Johnson 4638684f1beSJohn G Johnsoninitialization 4648684f1beSJohn G Johnson^^^^^^^^^^^^^^ 4658684f1beSJohn G Johnson 4668684f1beSJohn G JohnsonThe process initialization sequence will follow the same sequence 4678684f1beSJohn G Johnsonfollowed by QEMU. It will first initialize the backend objects, then 4688684f1beSJohn G Johnsondevice emulation objects. The JSON descriptions sent by the QEMU process 4698684f1beSJohn G Johnsonwill drive which objects need to be created. 4708684f1beSJohn G Johnson 4718684f1beSJohn G Johnson- address spaces 4728684f1beSJohn G Johnson 4738684f1beSJohn G JohnsonBefore the device objects are created, the initial address spaces and 4748684f1beSJohn G Johnsonmemory regions must be configured with ``memory_map_init()``. This 4758684f1beSJohn G Johnsoncreates a RAM memory region object (*system\_memory*) and an IO memory 4768684f1beSJohn G Johnsonregion object (*system\_io*). 4778684f1beSJohn G Johnson 4788684f1beSJohn G Johnson- RAM 4798684f1beSJohn G Johnson 4808684f1beSJohn G JohnsonRAM memory region creation will follow how ``pc_memory_init()`` creates 4818684f1beSJohn G Johnsonthem, but must use ``memory_region_init_ram_from_fd()`` instead of 4828684f1beSJohn G Johnson``memory_region_allocate_system_memory()``. The file descriptors needed 4838684f1beSJohn G Johnsonwill be supplied by the guest memory table from above. Those RAM regions 4848684f1beSJohn G Johnsonwould then be added to the *system\_memory* memory region with 4858684f1beSJohn G Johnson``memory_region_add_subregion()``. 4868684f1beSJohn G Johnson 4878684f1beSJohn G Johnson- PCI 4888684f1beSJohn G Johnson 4898684f1beSJohn G JohnsonIO initialization will be driven by the JSON descriptions sent from the 4908684f1beSJohn G JohnsonQEMU process. For a PCI device, a PCI bus will need to be created with 4918684f1beSJohn G Johnson``pci_root_bus_new()``, and a PCI memory region will need to be created 4928684f1beSJohn G Johnsonand added to the *system\_memory* memory region with 4938684f1beSJohn G Johnson``memory_region_add_subregion_overlap()``. The overlap version is 4948684f1beSJohn G Johnsonrequired for architectures where PCI memory overlaps with RAM memory. 4958684f1beSJohn G Johnson 4968684f1beSJohn G JohnsonMMIO handling 4978684f1beSJohn G Johnson^^^^^^^^^^^^^ 4988684f1beSJohn G Johnson 4998684f1beSJohn G JohnsonThe device emulation objects will use ``memory_region_init_io()`` to 5008684f1beSJohn G Johnsoninstall their MMIO handlers, and ``pci_register_bar()`` to associate 5018684f1beSJohn G Johnsonthose handlers with a PCI BAR, as they do within QEMU currently. 5028684f1beSJohn G Johnson 5038684f1beSJohn G JohnsonIn order to use ``address_space_rw()`` in the emulation process to 5048684f1beSJohn G Johnsonhandle MMIO requests from QEMU, the PCI physical addresses must be the 5058684f1beSJohn G Johnsonsame in the QEMU process and the device emulation process. In order to 5068684f1beSJohn G Johnsonaccomplish that, guest BAR programming must also be forwarded from QEMU 5078684f1beSJohn G Johnsonto the emulation process. 5088684f1beSJohn G Johnson 5098684f1beSJohn G Johnsoninterrupt injection 5108684f1beSJohn G Johnson^^^^^^^^^^^^^^^^^^^ 5118684f1beSJohn G Johnson 5128684f1beSJohn G JohnsonWhen device emulation wants to inject an interrupt into the VM, the 5138684f1beSJohn G Johnsonrequest climbs the device's bus object hierarchy until the point where a 5148684f1beSJohn G Johnsonbus object knows how to signal the interrupt to the guest. The details 5158684f1beSJohn G Johnsondepend on the type of interrupt being raised. 5168684f1beSJohn G Johnson 5178684f1beSJohn G Johnson- PCI pin interrupts 5188684f1beSJohn G Johnson 5198684f1beSJohn G JohnsonOn x86 systems, there is an emulated IOAPIC object attached to the root 5208684f1beSJohn G JohnsonPCI bus object, and the root PCI object forwards interrupt requests to 5218684f1beSJohn G Johnsonit. The IOAPIC object, in turn, calls the KVM driver to inject the 5228684f1beSJohn G Johnsoncorresponding interrupt into the VM. The simplest way to handle this in 5238684f1beSJohn G Johnsonan emulation process would be to setup the root PCI bus driver (via 5248684f1beSJohn G Johnson``pci_bus_irqs()``) to send a interrupt request back to the QEMU 5258684f1beSJohn G Johnsonprocess, and have the device proxy object reflect it up the PCI tree 5268684f1beSJohn G Johnsonthere. 5278684f1beSJohn G Johnson 5288684f1beSJohn G Johnson- PCI MSI/X interrupts 5298684f1beSJohn G Johnson 5308684f1beSJohn G JohnsonPCI MSI/X interrupts are implemented in HW as DMA writes to a 5318684f1beSJohn G JohnsonCPU-specific PCI address. In QEMU on x86, a KVM APIC object receives 5328684f1beSJohn G Johnsonthese DMA writes, then calls into the KVM driver to inject the interrupt 5338684f1beSJohn G Johnsoninto the VM. A simple emulation process implementation would be to send 5348684f1beSJohn G Johnsonthe MSI DMA address from QEMU as a message at initialization, then 5358684f1beSJohn G Johnsoninstall an address space handler at that address which forwards the MSI 5368684f1beSJohn G Johnsonmessage back to QEMU. 5378684f1beSJohn G Johnson 5388684f1beSJohn G JohnsonDMA operations 5398684f1beSJohn G Johnson^^^^^^^^^^^^^^ 5408684f1beSJohn G Johnson 5418684f1beSJohn G JohnsonWhen a emulation object wants to DMA into or out of guest memory, it 5428684f1beSJohn G Johnsonfirst must use dma\_memory\_map() to convert the DMA address to a local 5438684f1beSJohn G Johnsonvirtual address. The emulation process memory region objects setup above 5448684f1beSJohn G Johnsonwill be used to translate the DMA address to a local virtual address the 5458684f1beSJohn G Johnsondevice emulation code can access. 5468684f1beSJohn G Johnson 5478684f1beSJohn G JohnsonIOMMU 5488684f1beSJohn G Johnson^^^^^ 5498684f1beSJohn G Johnson 5508684f1beSJohn G JohnsonWhen an IOMMU is in use in QEMU, DMA translation uses IOMMU memory 5518684f1beSJohn G Johnsonregions to translate the DMA address to a guest physical address before 5528684f1beSJohn G Johnsonthat physical address can be translated to a local virtual address. The 5538684f1beSJohn G Johnsonemulation process will need similar functionality. 5548684f1beSJohn G Johnson 5558684f1beSJohn G Johnson- IOTLB cache 5568684f1beSJohn G Johnson 5578684f1beSJohn G JohnsonThe emulation process will maintain a cache of recent IOMMU translations 5588684f1beSJohn G Johnson(the IOTLB). When the translate() callback of an IOMMU memory region is 5598684f1beSJohn G Johnsoninvoked, the IOTLB cache will be searched for an entry that will map the 5608684f1beSJohn G JohnsonDMA address to a guest PA. On a cache miss, a message will be sent back 5618684f1beSJohn G Johnsonto QEMU requesting the corresponding translation entry, which be both be 5628684f1beSJohn G Johnsonused to return a guest address and be added to the cache. 5638684f1beSJohn G Johnson 5648684f1beSJohn G Johnson- IOTLB purge 5658684f1beSJohn G Johnson 5668684f1beSJohn G JohnsonThe IOMMU emulation will also need to act on unmap requests from QEMU. 5678684f1beSJohn G JohnsonThese happen when the guest IOMMU driver purges an entry from the 5688684f1beSJohn G Johnsonguest's translation table. 5698684f1beSJohn G Johnson 5708684f1beSJohn G Johnsonlive migration 5718684f1beSJohn G Johnson^^^^^^^^^^^^^^ 5728684f1beSJohn G Johnson 5738684f1beSJohn G JohnsonWhen a remote process receives a live migration indication from QEMU, it 5748684f1beSJohn G Johnsonwill set up a channel using the received file descriptor with 5758684f1beSJohn G Johnson``qio_channel_socket_new_fd()``. This channel will be used to create a 5768684f1beSJohn G Johnson*QEMUfile* that can be passed to ``qemu_save_device_state()`` to send 5778684f1beSJohn G Johnsonthe process's device state back to QEMU. This method will be reversed on 5788684f1beSJohn G Johnsonrestore - the channel will be passed to ``qemu_loadvm_state()`` to 5798684f1beSJohn G Johnsonrestore the device state. 5808684f1beSJohn G Johnson 5818684f1beSJohn G JohnsonAccelerating device emulation 5828684f1beSJohn G Johnson~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 5838684f1beSJohn G Johnson 5848684f1beSJohn G JohnsonThe messages that are required to be sent between QEMU and the emulation 5858684f1beSJohn G Johnsonprocess can add considerable latency to IO operations. The optimizations 5868684f1beSJohn G Johnsondescribed below attempt to ameliorate this effect by allowing the 5878684f1beSJohn G Johnsonemulation process to communicate directly with the kernel KVM driver. 5888684f1beSJohn G JohnsonThe KVM file descriptors created would be passed to the emulation process 5898684f1beSJohn G Johnsonvia initialization messages, much like the guest memory table is done. 5908684f1beSJohn G Johnson#### MMIO acceleration 5918684f1beSJohn G Johnson 5928684f1beSJohn G JohnsonVhost user applications can receive guest virtio driver stores directly 5938684f1beSJohn G Johnsonfrom KVM. The issue with the eventfd mechanism used by vhost user is 5948684f1beSJohn G Johnsonthat it does not pass any data with the event indication, so it cannot 5958684f1beSJohn G Johnsonhandle guest loads or guest stores that carry store data. This concept 5968684f1beSJohn G Johnsoncould, however, be expanded to cover more cases. 5978684f1beSJohn G Johnson 5988684f1beSJohn G JohnsonThe expanded idea would require a new type of KVM device: 5998684f1beSJohn G Johnson*KVM\_DEV\_TYPE\_USER*. This device has two file descriptors: a master 6008684f1beSJohn G Johnsondescriptor that QEMU can use for configuration, and a slave descriptor 6018684f1beSJohn G Johnsonthat the emulation process can use to receive MMIO notifications. QEMU 6028684f1beSJohn G Johnsonwould create both descriptors using the KVM driver, and pass the slave 6038684f1beSJohn G Johnsondescriptor to the emulation process via an initialization message. 6048684f1beSJohn G Johnson 6058684f1beSJohn G Johnsondata structures 6068684f1beSJohn G Johnson^^^^^^^^^^^^^^^ 6078684f1beSJohn G Johnson 6088684f1beSJohn G Johnson- guest physical range 6098684f1beSJohn G Johnson 6108684f1beSJohn G JohnsonThe guest physical range structure describes the address range that a 6118684f1beSJohn G Johnsondevice will respond to. It includes the base and length of the range, as 6128684f1beSJohn G Johnsonwell as which bus the range resides on (e.g., on an x86machine, it can 6138684f1beSJohn G Johnsonspecify whether the range refers to memory or IO addresses). 6148684f1beSJohn G Johnson 6158684f1beSJohn G JohnsonA device can have multiple physical address ranges it responds to (e.g., 6168684f1beSJohn G Johnsona PCI device can have multiple BARs), so the structure will also include 6178684f1beSJohn G Johnsonan enumerated identifier to specify which of the device's ranges is 6188684f1beSJohn G Johnsonbeing referred to. 6198684f1beSJohn G Johnson 6208684f1beSJohn G Johnson+--------+----------------------------+ 6218684f1beSJohn G Johnson| Name | Description | 6228684f1beSJohn G Johnson+========+============================+ 6238684f1beSJohn G Johnson| addr | range base address | 6248684f1beSJohn G Johnson+--------+----------------------------+ 6258684f1beSJohn G Johnson| len | range length | 6268684f1beSJohn G Johnson+--------+----------------------------+ 6278684f1beSJohn G Johnson| bus | addr type (memory or IO) | 6288684f1beSJohn G Johnson+--------+----------------------------+ 6298684f1beSJohn G Johnson| id | range ID (e.g., PCI BAR) | 6308684f1beSJohn G Johnson+--------+----------------------------+ 6318684f1beSJohn G Johnson 6328684f1beSJohn G Johnson- MMIO request structure 6338684f1beSJohn G Johnson 6348684f1beSJohn G JohnsonThis structure describes an MMIO operation. It includes which guest 6358684f1beSJohn G Johnsonphysical range the MMIO was within, the offset within that range, the 6368684f1beSJohn G JohnsonMMIO type (e.g., load or store), and its length and data. It also 6378684f1beSJohn G Johnsonincludes a sequence number that can be used to reply to the MMIO, and 6388684f1beSJohn G Johnsonthe CPU that issued the MMIO. 6398684f1beSJohn G Johnson 6408684f1beSJohn G Johnson+----------+------------------------+ 6418684f1beSJohn G Johnson| Name | Description | 6428684f1beSJohn G Johnson+==========+========================+ 6438684f1beSJohn G Johnson| rid | range MMIO is within | 6448684f1beSJohn G Johnson+----------+------------------------+ 645b980c1aeSStefan Weil| offset | offset within *rid* | 6468684f1beSJohn G Johnson+----------+------------------------+ 6478684f1beSJohn G Johnson| type | e.g., load or store | 6488684f1beSJohn G Johnson+----------+------------------------+ 6498684f1beSJohn G Johnson| len | MMIO length | 6508684f1beSJohn G Johnson+----------+------------------------+ 6518684f1beSJohn G Johnson| data | store data | 6528684f1beSJohn G Johnson+----------+------------------------+ 6538684f1beSJohn G Johnson| seq | sequence ID | 6548684f1beSJohn G Johnson+----------+------------------------+ 6558684f1beSJohn G Johnson 6568684f1beSJohn G Johnson- MMIO request queues 6578684f1beSJohn G Johnson 6588684f1beSJohn G JohnsonMMIO request queues are FIFO arrays of MMIO request structures. There 6598684f1beSJohn G Johnsonare two queues: pending queue is for MMIOs that haven't been read by the 6608684f1beSJohn G Johnsonemulation program, and the sent queue is for MMIOs that haven't been 6618684f1beSJohn G Johnsonacknowledged. The main use of the second queue is to validate MMIO 6628684f1beSJohn G Johnsonreplies from the emulation program. 6638684f1beSJohn G Johnson 6648684f1beSJohn G Johnson- scoreboard 6658684f1beSJohn G Johnson 6668684f1beSJohn G JohnsonEach CPU in the VM is emulated in QEMU by a separate thread, so multiple 6678684f1beSJohn G JohnsonMMIOs may be waiting to be consumed by an emulation program and multiple 6688684f1beSJohn G Johnsonthreads may be waiting for MMIO replies. The scoreboard would contain a 6698684f1beSJohn G Johnsonwait queue and sequence number for the per-CPU threads, allowing them to 6708684f1beSJohn G Johnsonbe individually woken when the MMIO reply is received from the emulation 6718684f1beSJohn G Johnsonprogram. It also tracks the number of posted MMIO stores to the device 6728684f1beSJohn G Johnsonthat haven't been replied to, in order to satisfy the PCI constraint 6738684f1beSJohn G Johnsonthat a load to a device will not complete until all previous stores to 6748684f1beSJohn G Johnsonthat device have been completed. 6758684f1beSJohn G Johnson 6768684f1beSJohn G Johnson- device shadow memory 6778684f1beSJohn G Johnson 6788684f1beSJohn G JohnsonSome MMIO loads do not have device side-effects. These MMIOs can be 6798684f1beSJohn G Johnsoncompleted without sending a MMIO request to the emulation program if the 6808684f1beSJohn G Johnsonemulation program shares a shadow image of the device's memory image 6818684f1beSJohn G Johnsonwith the KVM driver. 6828684f1beSJohn G Johnson 6838684f1beSJohn G JohnsonThe emulation program will ask the KVM driver to allocate memory for the 6848684f1beSJohn G Johnsonshadow image, and will then use ``mmap()`` to directly access it. The 6858684f1beSJohn G Johnsonemulation program can control KVM access to the shadow image by sending 6868684f1beSJohn G JohnsonKVM an access map telling it which areas of the image have no 6878684f1beSJohn G Johnsonside-effects (and can be completed immediately), and which require a 6888684f1beSJohn G JohnsonMMIO request to the emulation program. The access map can also inform 6898684f1beSJohn G Johnsonthe KVM drive which size accesses are allowed to the image. 6908684f1beSJohn G Johnson 6918684f1beSJohn G Johnsonmaster descriptor 6928684f1beSJohn G Johnson^^^^^^^^^^^^^^^^^ 6938684f1beSJohn G Johnson 6948684f1beSJohn G JohnsonThe master descriptor is used by QEMU to configure the new KVM device. 6958684f1beSJohn G JohnsonThe descriptor would be returned by the KVM driver when QEMU issues a 6968684f1beSJohn G Johnson*KVM\_CREATE\_DEVICE* ``ioctl()`` with a *KVM\_DEV\_TYPE\_USER* type. 6978684f1beSJohn G Johnson 6988684f1beSJohn G JohnsonKVM\_DEV\_TYPE\_USER device ops 6998684f1beSJohn G Johnson 7008684f1beSJohn G Johnson 7018684f1beSJohn G JohnsonThe *KVM\_DEV\_TYPE\_USER* operations vector will be registered by a 7028684f1beSJohn G Johnson``kvm_register_device_ops()`` call when the KVM system in initialized by 7038684f1beSJohn G Johnson``kvm_init()``. These device ops are called by the KVM driver when QEMU 7048684f1beSJohn G Johnsonexecutes certain ``ioctl()`` operations on its KVM file descriptor. They 7058684f1beSJohn G Johnsoninclude: 7068684f1beSJohn G Johnson 7078684f1beSJohn G Johnson- create 7088684f1beSJohn G Johnson 7098684f1beSJohn G JohnsonThis routine is called when QEMU issues a *KVM\_CREATE\_DEVICE* 7108684f1beSJohn G Johnson``ioctl()`` on its per-VM file descriptor. It will allocate and 7118684f1beSJohn G Johnsoninitialize a KVM user device specific data structure, and assign the 7128684f1beSJohn G Johnson*kvm\_device* private field to it. 7138684f1beSJohn G Johnson 7148684f1beSJohn G Johnson- ioctl 7158684f1beSJohn G Johnson 7168684f1beSJohn G JohnsonThis routine is invoked when QEMU issues an ``ioctl()`` on the master 7178684f1beSJohn G Johnsondescriptor. The ``ioctl()`` commands supported are defined by the KVM 7188684f1beSJohn G Johnsondevice type. *KVM\_DEV\_TYPE\_USER* ones will need several commands: 7198684f1beSJohn G Johnson 7208684f1beSJohn G Johnson*KVM\_DEV\_USER\_SLAVE\_FD* creates the slave file descriptor that will 7218684f1beSJohn G Johnsonbe passed to the device emulation program. Only one slave can be created 7228684f1beSJohn G Johnsonby each master descriptor. The file operations performed by this 7238684f1beSJohn G Johnsondescriptor are described below. 7248684f1beSJohn G Johnson 7258684f1beSJohn G JohnsonThe *KVM\_DEV\_USER\_PA\_RANGE* command configures a guest physical 7268684f1beSJohn G Johnsonaddress range that the slave descriptor will receive MMIO notifications 7278684f1beSJohn G Johnsonfor. The range is specified by a guest physical range structure 7288684f1beSJohn G Johnsonargument. For buses that assign addresses to devices dynamically, this 7298684f1beSJohn G Johnsoncommand can be executed while the guest is running, such as the case 7308684f1beSJohn G Johnsonwhen a guest changes a device's PCI BAR registers. 7318684f1beSJohn G Johnson 7328684f1beSJohn G Johnson*KVM\_DEV\_USER\_PA\_RANGE* will use ``kvm_io_bus_register_dev()`` to 7338684f1beSJohn G Johnsonregister *kvm\_io\_device\_ops* callbacks to be invoked when the guest 7348684f1beSJohn G Johnsonperforms a MMIO operation within the range. When a range is changed, 7358684f1beSJohn G Johnson``kvm_io_bus_unregister_dev()`` is used to remove the previous 7368684f1beSJohn G Johnsoninstantiation. 7378684f1beSJohn G Johnson 7388684f1beSJohn G Johnson*KVM\_DEV\_USER\_TIMEOUT* will configure a timeout value that specifies 7398684f1beSJohn G Johnsonhow long KVM will wait for the emulation process to respond to a MMIO 7408684f1beSJohn G Johnsonindication. 7418684f1beSJohn G Johnson 7428684f1beSJohn G Johnson- destroy 7438684f1beSJohn G Johnson 7448684f1beSJohn G JohnsonThis routine is called when the VM instance is destroyed. It will need 7458684f1beSJohn G Johnsonto destroy the slave descriptor; and free any memory allocated by the 7468684f1beSJohn G Johnsondriver, as well as the *kvm\_device* structure itself. 7478684f1beSJohn G Johnson 7488684f1beSJohn G Johnsonslave descriptor 7498684f1beSJohn G Johnson^^^^^^^^^^^^^^^^ 7508684f1beSJohn G Johnson 7518684f1beSJohn G JohnsonThe slave descriptor will have its own file operations vector, which 7528684f1beSJohn G Johnsonresponds to system calls on the descriptor performed by the device 7538684f1beSJohn G Johnsonemulation program. 7548684f1beSJohn G Johnson 7558684f1beSJohn G Johnson- read 7568684f1beSJohn G Johnson 7578684f1beSJohn G JohnsonA read returns any pending MMIO requests from the KVM driver as MMIO 7588684f1beSJohn G Johnsonrequest structures. Multiple structures can be returned if there are 7598684f1beSJohn G Johnsonmultiple MMIO operations pending. The MMIO requests are moved from the 7608684f1beSJohn G Johnsonpending queue to the sent queue, and if there are threads waiting for 7618684f1beSJohn G Johnsonspace in the pending to add new MMIO operations, they will be woken 7628684f1beSJohn G Johnsonhere. 7638684f1beSJohn G Johnson 7648684f1beSJohn G Johnson- write 7658684f1beSJohn G Johnson 7668684f1beSJohn G JohnsonA write also consists of a set of MMIO requests. They are compared to 7678684f1beSJohn G Johnsonthe MMIO requests in the sent queue. Matches are removed from the sent 7688684f1beSJohn G Johnsonqueue, and any threads waiting for the reply are woken. If a store is 7698684f1beSJohn G Johnsonremoved, then the number of posted stores in the per-CPU scoreboard is 7708684f1beSJohn G Johnsondecremented. When the number is zero, and a non side-effect load was 7718684f1beSJohn G Johnsonwaiting for posted stores to complete, the load is continued. 7728684f1beSJohn G Johnson 7738684f1beSJohn G Johnson- ioctl 7748684f1beSJohn G Johnson 7758684f1beSJohn G JohnsonThere are several ioctl()s that can be performed on the slave 7768684f1beSJohn G Johnsondescriptor. 7778684f1beSJohn G Johnson 7788684f1beSJohn G JohnsonA *KVM\_DEV\_USER\_SHADOW\_SIZE* ``ioctl()`` causes the KVM driver to 7798684f1beSJohn G Johnsonallocate memory for the shadow image. This memory can later be 7808684f1beSJohn G Johnson``mmap()``\ ed by the emulation process to share the emulation's view of 7818684f1beSJohn G Johnsondevice memory with the KVM driver. 7828684f1beSJohn G Johnson 7838684f1beSJohn G JohnsonA *KVM\_DEV\_USER\_SHADOW\_CTRL* ``ioctl()`` controls access to the 7848684f1beSJohn G Johnsonshadow image. It will send the KVM driver a shadow control map, which 7858684f1beSJohn G Johnsonspecifies which areas of the image can complete guest loads without 7868684f1beSJohn G Johnsonsending the load request to the emulation program. It will also specify 7878684f1beSJohn G Johnsonthe size of load operations that are allowed. 7888684f1beSJohn G Johnson 7898684f1beSJohn G Johnson- poll 7908684f1beSJohn G Johnson 7918684f1beSJohn G JohnsonAn emulation program will use the ``poll()`` call with a *POLLIN* flag 7928684f1beSJohn G Johnsonto determine if there are MMIO requests waiting to be read. It will 7938684f1beSJohn G Johnsonreturn if the pending MMIO request queue is not empty. 7948684f1beSJohn G Johnson 7958684f1beSJohn G Johnson- mmap 7968684f1beSJohn G Johnson 7978684f1beSJohn G JohnsonThis call allows the emulation program to directly access the shadow 7988684f1beSJohn G Johnsonimage allocated by the KVM driver. As device emulation updates device 7998684f1beSJohn G Johnsonmemory, changes with no side-effects will be reflected in the shadow, 8008684f1beSJohn G Johnsonand the KVM driver can satisfy guest loads from the shadow image without 8018684f1beSJohn G Johnsonneeding to wait for the emulation program. 8028684f1beSJohn G Johnson 8038684f1beSJohn G Johnsonkvm\_io\_device ops 8048684f1beSJohn G Johnson^^^^^^^^^^^^^^^^^^^ 8058684f1beSJohn G Johnson 8068684f1beSJohn G JohnsonEach KVM per-CPU thread can handle MMIO operation on behalf of the guest 8078684f1beSJohn G JohnsonVM. KVM will use the MMIO's guest physical address to search for a 8088684f1beSJohn G Johnsonmatching *kvm\_io\_device* to see if the MMIO can be handled by the KVM 8098684f1beSJohn G Johnsondriver instead of exiting back to QEMU. If a match is found, the 8108684f1beSJohn G Johnsoncorresponding callback will be invoked. 8118684f1beSJohn G Johnson 8128684f1beSJohn G Johnson- read 8138684f1beSJohn G Johnson 8148684f1beSJohn G JohnsonThis callback is invoked when the guest performs a load to the device. 8158684f1beSJohn G JohnsonLoads with side-effects must be handled synchronously, with the KVM 8168684f1beSJohn G Johnsondriver putting the QEMU thread to sleep waiting for the emulation 8178684f1beSJohn G Johnsonprocess reply before re-starting the guest. Loads that do not have 8188684f1beSJohn G Johnsonside-effects may be optimized by satisfying them from the shadow image, 8198684f1beSJohn G Johnsonif there are no outstanding stores to the device by this CPU. PCI memory 8208684f1beSJohn G Johnsonordering demands that a load cannot complete before all older stores to 8218684f1beSJohn G Johnsonthe same device have been completed. 8228684f1beSJohn G Johnson 8238684f1beSJohn G Johnson- write 8248684f1beSJohn G Johnson 8258684f1beSJohn G JohnsonStores can be handled asynchronously unless the pending MMIO request 8268684f1beSJohn G Johnsonqueue is full. In this case, the QEMU thread must sleep waiting for 8278684f1beSJohn G Johnsonspace in the queue. Stores will increment the number of posted stores in 8288684f1beSJohn G Johnsonthe per-CPU scoreboard, in order to implement the PCI ordering 8298684f1beSJohn G Johnsonconstraint above. 8308684f1beSJohn G Johnson 8318684f1beSJohn G Johnsoninterrupt acceleration 8328684f1beSJohn G Johnson^^^^^^^^^^^^^^^^^^^^^^ 8338684f1beSJohn G Johnson 8348684f1beSJohn G JohnsonThis performance optimization would work much like a vhost user 8358684f1beSJohn G Johnsonapplication does, where the QEMU process sets up *eventfds* that cause 8368684f1beSJohn G Johnsonthe device's corresponding interrupt to be triggered by the KVM driver. 8378684f1beSJohn G JohnsonThese irq file descriptors are sent to the emulation process at 8388684f1beSJohn G Johnsoninitialization, and are used when the emulation code raises a device 8398684f1beSJohn G Johnsoninterrupt. 8408684f1beSJohn G Johnson 8418684f1beSJohn G Johnsonintx acceleration 8428684f1beSJohn G Johnson''''''''''''''''' 8438684f1beSJohn G Johnson 8448684f1beSJohn G JohnsonTraditional PCI pin interrupts are level based, so, in addition to an 8458684f1beSJohn G Johnsonirq file descriptor, a re-sampling file descriptor needs to be sent to 8468684f1beSJohn G Johnsonthe emulation program. This second file descriptor allows multiple 8478684f1beSJohn G Johnsondevices sharing an irq to be notified when the interrupt has been 8488684f1beSJohn G Johnsonacknowledged by the guest, so they can re-trigger the interrupt if their 8498684f1beSJohn G Johnsondevice has not de-asserted its interrupt. 8508684f1beSJohn G Johnson 8518684f1beSJohn G Johnsonintx irq descriptor 8528684f1beSJohn G Johnson 8538684f1beSJohn G Johnson 8548684f1beSJohn G JohnsonThe irq descriptors are created by the proxy object 8558684f1beSJohn G Johnson``using event_notifier_init()`` to create the irq and re-sampling 8568684f1beSJohn G Johnson*eventds*, and ``kvm_vm_ioctl(KVM_IRQFD)`` to bind them to an interrupt. 8578684f1beSJohn G JohnsonThe interrupt route can be found with 8588684f1beSJohn G Johnson``pci_device_route_intx_to_irq()``. 8598684f1beSJohn G Johnson 8608684f1beSJohn G Johnsonintx routing changes 8618684f1beSJohn G Johnson 8628684f1beSJohn G Johnson 8638684f1beSJohn G JohnsonIntx routing can be changed when the guest programs the APIC the device 8648684f1beSJohn G Johnsonpin is connected to. The proxy object in QEMU will use 8658684f1beSJohn G Johnson``pci_device_set_intx_routing_notifier()`` to be informed of any guest 8668684f1beSJohn G Johnsonchanges to the route. This handler will broadly follow the VFIO 8678684f1beSJohn G Johnsoninterrupt logic to change the route: de-assigning the existing irq 8688684f1beSJohn G Johnsondescriptor from its route, then assigning it the new route. (see 8698684f1beSJohn G Johnson``vfio_intx_update()``) 8708684f1beSJohn G Johnson 8718684f1beSJohn G JohnsonMSI/X acceleration 8728684f1beSJohn G Johnson'''''''''''''''''' 8738684f1beSJohn G Johnson 8748684f1beSJohn G JohnsonMSI/X interrupts are sent as DMA transactions to the host. The interrupt 8758684f1beSJohn G Johnsondata contains a vector that is programmed by the guest, A device may have 8768684f1beSJohn G Johnsonmultiple MSI interrupts associated with it, so multiple irq descriptors 8778684f1beSJohn G Johnsonmay need to be sent to the emulation program. 8788684f1beSJohn G Johnson 8798684f1beSJohn G JohnsonMSI/X irq descriptor 8808684f1beSJohn G Johnson 8818684f1beSJohn G Johnson 8828684f1beSJohn G JohnsonThis case will also follow the VFIO example. For each MSI/X interrupt, 8838684f1beSJohn G Johnsonan *eventfd* is created, a virtual interrupt is allocated by 8848684f1beSJohn G Johnson``kvm_irqchip_add_msi_route()``, and the virtual interrupt is bound to 8858684f1beSJohn G Johnsonthe eventfd with ``kvm_irqchip_add_irqfd_notifier()``. 8868684f1beSJohn G Johnson 8878684f1beSJohn G JohnsonMSI/X config space changes 8888684f1beSJohn G Johnson 8898684f1beSJohn G Johnson 8908684f1beSJohn G JohnsonThe guest may dynamically update several MSI-related tables in the 8918684f1beSJohn G Johnsondevice's PCI config space. These include per-MSI interrupt enables and 8928684f1beSJohn G Johnsonvector data. Additionally, MSIX tables exist in device memory space, not 8938684f1beSJohn G Johnsonconfig space. Much like the BAR case above, the proxy object must look 8948684f1beSJohn G Johnsonat guest config space programming to keep the MSI interrupt state 8958684f1beSJohn G Johnsonconsistent between QEMU and the emulation program. 8968684f1beSJohn G Johnson 8978684f1beSJohn G Johnson-------------- 8988684f1beSJohn G Johnson 8998684f1beSJohn G JohnsonDisaggregated CPU emulation 9008684f1beSJohn G Johnson--------------------------- 9018684f1beSJohn G Johnson 9028684f1beSJohn G JohnsonAfter IO services have been disaggregated, a second phase would be to 9038684f1beSJohn G Johnsonseparate a process to handle CPU instruction emulation from the main 9048684f1beSJohn G JohnsonQEMU control function. There are no object separation points for this 9058684f1beSJohn G Johnsoncode, so the first task would be to create one. 9068684f1beSJohn G Johnson 9078684f1beSJohn G JohnsonHost access controls 9088684f1beSJohn G Johnson-------------------- 9098684f1beSJohn G Johnson 9108684f1beSJohn G JohnsonSeparating QEMU relies on the host OS's access restriction mechanisms to 9118684f1beSJohn G Johnsonenforce that the differing processes can only access the objects they 9128684f1beSJohn G Johnsonare entitled to. There are a couple types of mechanisms usually provided 9138684f1beSJohn G Johnsonby general purpose OSs. 9148684f1beSJohn G Johnson 9158684f1beSJohn G JohnsonDiscretionary access control 9168684f1beSJohn G Johnson~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 9178684f1beSJohn G Johnson 9188684f1beSJohn G JohnsonDiscretionary access control allows each user to control who can access 9198684f1beSJohn G Johnsontheir files. In Linux, this type of control is usually too coarse for 9208684f1beSJohn G JohnsonQEMU separation, since it only provides three separate access controls: 9218684f1beSJohn G Johnsonone for the same user ID, the second for users IDs with the same group 9228684f1beSJohn G JohnsonID, and the third for all other user IDs. Each device instance would 9238684f1beSJohn G Johnsonneed a separate user ID to provide access control, which is likely to be 9248684f1beSJohn G Johnsonunwieldy for dynamically created VMs. 9258684f1beSJohn G Johnson 9268684f1beSJohn G JohnsonMandatory access control 9278684f1beSJohn G Johnson~~~~~~~~~~~~~~~~~~~~~~~~ 9288684f1beSJohn G Johnson 9298684f1beSJohn G JohnsonMandatory access control allows the OS to add an additional set of 9308684f1beSJohn G Johnsoncontrols on top of discretionary access for the OS to control. It also 9318684f1beSJohn G Johnsonadds other attributes to processes and files such as types, roles, and 9328684f1beSJohn G Johnsoncategories, and can establish rules for how processes and files can 9338684f1beSJohn G Johnsoninteract. 9348684f1beSJohn G Johnson 9358684f1beSJohn G JohnsonType enforcement 9368684f1beSJohn G Johnson^^^^^^^^^^^^^^^^ 9378684f1beSJohn G Johnson 9388684f1beSJohn G JohnsonType enforcement assigns a *type* attribute to processes and files, and 9398684f1beSJohn G Johnsonallows rules to be written on what operations a process with a given 9408684f1beSJohn G Johnsontype can perform on a file with a given type. QEMU separation could take 9418684f1beSJohn G Johnsonadvantage of type enforcement by running the emulation processes with 9428684f1beSJohn G Johnsondifferent types, both from the main QEMU process, and from the emulation 9438684f1beSJohn G Johnsonprocesses of different classes of devices. 9448684f1beSJohn G Johnson 9458684f1beSJohn G JohnsonFor example, guest disk images and disk emulation processes could have 9468684f1beSJohn G Johnsontypes separate from the main QEMU process and non-disk emulation 9478684f1beSJohn G Johnsonprocesses, and the type rules could prevent processes other than disk 9488684f1beSJohn G Johnsonemulation ones from accessing guest disk images. Similarly, network 9498684f1beSJohn G Johnsonemulation processes can have a type separate from the main QEMU process 9508684f1beSJohn G Johnsonand non-network emulation process, and only that type can access the 9518684f1beSJohn G Johnsonhost tun/tap device used to provide guest networking. 9528684f1beSJohn G Johnson 9538684f1beSJohn G JohnsonCategory enforcement 9548684f1beSJohn G Johnson^^^^^^^^^^^^^^^^^^^^ 9558684f1beSJohn G Johnson 9568684f1beSJohn G JohnsonCategory enforcement assigns a set of numbers within a given range to 9578684f1beSJohn G Johnsonthe process or file. The process is granted access to the file if the 9588684f1beSJohn G Johnsonprocess's set is a superset of the file's set. This enforcement can be 9598684f1beSJohn G Johnsonused to separate multiple instances of devices in the same class. 9608684f1beSJohn G Johnson 9618684f1beSJohn G JohnsonFor example, if there are multiple disk devices provides to a guest, 9628684f1beSJohn G Johnsoneach device emulation process could be provisioned with a separate 9638684f1beSJohn G Johnsoncategory. The different device emulation processes would not be able to 9648684f1beSJohn G Johnsonaccess each other's backing disk images. 9658684f1beSJohn G Johnson 9668684f1beSJohn G JohnsonAlternatively, categories could be used in lieu of the type enforcement 9678684f1beSJohn G Johnsonscheme described above. In this scenario, different categories would be 9688684f1beSJohn G Johnsonused to prevent device emulation processes in different classes from 9698684f1beSJohn G Johnsonaccessing resources assigned to other classes. 970