1Multi-process QEMU 2=================== 3 4.. note:: 5 6 This is the design document for multi-process QEMU. It does not 7 necessarily reflect the status of the current implementation, which 8 may lack features or be considerably different from what is described 9 in this document. This document is still useful as a description of 10 the goals and general direction of this feature. 11 12 Please refer to the following wiki for latest details: 13 https://wiki.qemu.org/Features/MultiProcessQEMU 14 15QEMU is often used as the hypervisor for virtual machines running in the 16Oracle cloud. Since one of the advantages of cloud computing is the 17ability to run many VMs from different tenants in the same cloud 18infrastructure, a guest that compromised its hypervisor could 19potentially use the hypervisor's access privileges to access data it is 20not authorized for. 21 22QEMU can be susceptible to security attacks because it is a large, 23monolithic program that provides many features to the VMs it services. 24Many of these features can be configured out of QEMU, but even a reduced 25configuration QEMU has a large amount of code a guest can potentially 26attack. Separating QEMU reduces the attack surface by aiding to 27limit each component in the system to only access the resources that 28it needs to perform its job. 29 30QEMU services 31------------- 32 33QEMU can be broadly described as providing three main services. One is a 34VM control point, where VMs can be created, migrated, re-configured, and 35destroyed. A second is to emulate the CPU instructions within the VM, 36often accelerated by HW virtualization features such as Intel's VT 37extensions. Finally, it provides IO services to the VM by emulating HW 38IO devices, such as disk and network devices. 39 40A multi-process QEMU 41~~~~~~~~~~~~~~~~~~~~ 42 43A multi-process QEMU involves separating QEMU services into separate 44host processes. Each of these processes can be given only the privileges 45it needs to provide its service, e.g., a disk service could be given 46access only to the disk images it provides, and not be allowed to 47access other files, or any network devices. An attacker who compromised 48this service would not be able to use this exploit to access files or 49devices beyond what the disk service was given access to. 50 51A QEMU control process would remain, but in multi-process mode, will 52have no direct interfaces to the VM. During VM execution, it would still 53provide the user interface to hot-plug devices or live migrate the VM. 54 55A first step in creating a multi-process QEMU is to separate IO services 56from the main QEMU program, which would continue to provide CPU 57emulation. i.e., the control process would also be the CPU emulation 58process. In a later phase, CPU emulation could be separated from the 59control process. 60 61Separating IO services 62---------------------- 63 64Separating IO services into individual host processes is a good place to 65begin for a couple of reasons. One is the sheer number of IO devices QEMU 66can emulate provides a large surface of interfaces which could potentially 67be exploited, and, indeed, have been a source of exploits in the past. 68Another is the modular nature of QEMU device emulation code provides 69interface points where the QEMU functions that perform device emulation 70can be separated from the QEMU functions that manage the emulation of 71guest CPU instructions. The devices emulated in the separate process are 72referred to as remote devices. 73 74QEMU device emulation 75~~~~~~~~~~~~~~~~~~~~~ 76 77QEMU uses an object oriented SW architecture for device emulation code. 78Configured objects are all compiled into the QEMU binary, then objects 79are instantiated by name when used by the guest VM. For example, the 80code to emulate a device named "foo" is always present in QEMU, but its 81instantiation code is only run when the device is included in the target 82VM. (e.g., via the QEMU command line as *-device foo*) 83 84The object model is hierarchical, so device emulation code names its 85parent object (such as "pci-device" for a PCI device) and QEMU will 86instantiate a parent object before calling the device's instantiation 87code. 88 89Current separation models 90~~~~~~~~~~~~~~~~~~~~~~~~~ 91 92In order to separate the device emulation code from the CPU emulation 93code, the device object code must run in a different process. There are 94a couple of existing QEMU features that can run emulation code 95separately from the main QEMU process. These are examined below. 96 97vhost user model 98^^^^^^^^^^^^^^^^ 99 100Virtio guest device drivers can be connected to vhost user applications 101in order to perform their IO operations. This model uses special virtio 102device drivers in the guest and vhost user device objects in QEMU, but 103once the QEMU vhost user code has configured the vhost user application, 104mission-mode IO is performed by the application. The vhost user 105application is a daemon process that can be contacted via a known UNIX 106domain socket. 107 108vhost socket 109'''''''''''' 110 111As mentioned above, one of the tasks of the vhost device object within 112QEMU is to contact the vhost application and send it configuration 113information about this device instance. As part of the configuration 114process, the application can also be sent other file descriptors over 115the socket, which then can be used by the vhost user application in 116various ways, some of which are described below. 117 118vhost MMIO store acceleration 119''''''''''''''''''''''''''''' 120 121VMs are often run using HW virtualization features via the KVM kernel 122driver. This driver allows QEMU to accelerate the emulation of guest CPU 123instructions by running the guest in a virtual HW mode. When the guest 124executes instructions that cannot be executed by virtual HW mode, 125execution returns to the KVM driver so it can inform QEMU to emulate the 126instructions in SW. 127 128One of the events that can cause a return to QEMU is when a guest device 129driver accesses an IO location. QEMU then dispatches the memory 130operation to the corresponding QEMU device object. In the case of a 131vhost user device, the memory operation would need to be sent over a 132socket to the vhost application. This path is accelerated by the QEMU 133virtio code by setting up an eventfd file descriptor that the vhost 134application can directly receive MMIO store notifications from the KVM 135driver, instead of needing them to be sent to the QEMU process first. 136 137vhost interrupt acceleration 138'''''''''''''''''''''''''''' 139 140Another optimization used by the vhost application is the ability to 141directly inject interrupts into the VM via the KVM driver, again, 142bypassing the need to send the interrupt back to the QEMU process first. 143The QEMU virtio setup code configures the KVM driver with an eventfd 144that triggers the device interrupt in the guest when the eventfd is 145written. This irqfd file descriptor is then passed to the vhost user 146application program. 147 148vhost access to guest memory 149'''''''''''''''''''''''''''' 150 151The vhost application is also allowed to directly access guest memory, 152instead of needing to send the data as messages to QEMU. This is also 153done with file descriptors sent to the vhost user application by QEMU. 154These descriptors can be passed to ``mmap()`` by the vhost application 155to map the guest address space into the vhost application. 156 157IOMMUs introduce another level of complexity, since the address given to 158the guest virtio device to DMA to or from is not a guest physical 159address. This case is handled by having vhost code within QEMU register 160as a listener for IOMMU mapping changes. The vhost application maintains 161a cache of IOMMMU translations: sending translation requests back to 162QEMU on cache misses, and in turn receiving flush requests from QEMU 163when mappings are purged. 164 165applicability to device separation 166'''''''''''''''''''''''''''''''''' 167 168Much of the vhost model can be re-used by separated device emulation. In 169particular, the ideas of using a socket between QEMU and the device 170emulation application, using a file descriptor to inject interrupts into 171the VM via KVM, and allowing the application to ``mmap()`` the guest 172should be re used. 173 174There are, however, some notable differences between how a vhost 175application works and the needs of separated device emulation. The most 176basic is that vhost uses custom virtio device drivers which always 177trigger IO with MMIO stores. A separated device emulation model must 178work with existing IO device models and guest device drivers. MMIO loads 179break vhost store acceleration since they are synchronous - guest 180progress cannot continue until the load has been emulated. By contrast, 181stores are asynchronous, the guest can continue after the store event 182has been sent to the vhost application. 183 184Another difference is that in the vhost user model, a single daemon can 185support multiple QEMU instances. This is contrary to the security regime 186desired, in which the emulation application should only be allowed to 187access the files or devices the VM it's running on behalf of can access. 188#### qemu-io model 189 190``qemu-io`` is a test harness used to test changes to the QEMU block backend 191object code (e.g., the code that implements disk images for disk driver 192emulation). ``qemu-io`` is not a device emulation application per se, but it 193does compile the QEMU block objects into a separate binary from the main 194QEMU one. This could be useful for disk device emulation, since its 195emulation applications will need to include the QEMU block objects. 196 197New separation model based on proxy objects 198------------------------------------------- 199 200A different model based on proxy objects in the QEMU program 201communicating with remote emulation programs could provide separation 202while minimizing the changes needed to the device emulation code. The 203rest of this section is a discussion of how a proxy object model would 204work. 205 206Remote emulation processes 207~~~~~~~~~~~~~~~~~~~~~~~~~~ 208 209The remote emulation process will run the QEMU object hierarchy without 210modification. The device emulation objects will be also be based on the 211QEMU code, because for anything but the simplest device, it would not be 212a tractable to re-implement both the object model and the many device 213backends that QEMU has. 214 215The processes will communicate with the QEMU process over UNIX domain 216sockets. The processes can be executed either as standalone processes, 217or be executed by QEMU. In both cases, the host backends the emulation 218processes will provide are specified on its command line, as they would 219be for QEMU. For example: 220 221:: 222 223 disk-proc -blockdev driver=file,node-name=file0,filename=disk-file0 \ 224 -blockdev driver=qcow2,node-name=drive0,file=file0 225 226would indicate process *disk-proc* uses a qcow2 emulated disk named 227*file0* as its backend. 228 229Emulation processes may emulate more than one guest controller. A common 230configuration might be to put all controllers of the same device class 231(e.g., disk, network, etc.) in a single process, so that all backends of 232the same type can be managed by a single QMP monitor. 233 234communication with QEMU 235^^^^^^^^^^^^^^^^^^^^^^^ 236 237The first argument to the remote emulation process will be a Unix domain 238socket that connects with the Proxy object. This is a required argument. 239 240:: 241 242 disk-proc <socket number> <backend list> 243 244remote process QMP monitor 245^^^^^^^^^^^^^^^^^^^^^^^^^^ 246 247Remote emulation processes can be monitored via QMP, similar to QEMU 248itself. The QMP monitor socket is specified the same as for a QEMU 249process: 250 251:: 252 253 disk-proc -qmp unix:/tmp/disk-mon,server 254 255can be monitored over the UNIX socket path */tmp/disk-mon*. 256 257QEMU command line 258~~~~~~~~~~~~~~~~~ 259 260Each remote device emulated in a remote process on the host is 261represented as a *-device* of type *pci-proxy-dev*. A socket 262sub-option to this option specifies the Unix socket that connects 263to the remote process. An *id* sub-option is required, and it should 264be the same id as used in the remote process. 265 266:: 267 268 qemu-system-x86_64 ... -device pci-proxy-dev,id=lsi0,socket=3 269 270can be used to add a device emulated in a remote process 271 272 273QEMU management of remote processes 274~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 275 276QEMU is not aware of the type of type of the remote PCI device. It is 277a pass through device as far as QEMU is concerned. 278 279communication with emulation process 280^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 281 282primary channel 283''''''''''''''' 284 285The primary channel (referred to as com in the code) is used to bootstrap 286the remote process. It is also used to pass on device-agnostic commands 287like reset. 288 289per-device channels 290''''''''''''''''''' 291 292Each remote device communicates with QEMU using a dedicated communication 293channel. The proxy object sets up this channel using the primary 294channel during its initialization. 295 296QEMU device proxy objects 297~~~~~~~~~~~~~~~~~~~~~~~~~ 298 299QEMU has an object model based on sub-classes inherited from the 300"object" super-class. The sub-classes that are of interest here are the 301"device" and "bus" sub-classes whose child sub-classes make up the 302device tree of a QEMU emulated system. 303 304The proxy object model will use device proxy objects to replace the 305device emulation code within the QEMU process. These objects will live 306in the same place in the object and bus hierarchies as the objects they 307replace. i.e., the proxy object for an LSI SCSI controller will be a 308sub-class of the "pci-device" class, and will have the same PCI bus 309parent and the same SCSI bus child objects as the LSI controller object 310it replaces. 311 312It is worth noting that the same proxy object is used to mediate with 313all types of remote PCI devices. 314 315object initialization 316^^^^^^^^^^^^^^^^^^^^^ 317 318The Proxy device objects are initialized in the exact same manner in 319which any other QEMU device would be initialized. 320 321In addition, the Proxy objects perform the following two tasks: 322- Parses the "socket" sub option and connects to the remote process 323using this channel 324- Uses the "id" sub-option to connect to the emulated device on the 325separate process 326 327class\_init 328''''''''''' 329 330The ``class_init()`` method of a proxy object will, in general behave 331similarly to the object it replaces, including setting any static 332properties and methods needed by the proxy. 333 334instance\_init / realize 335'''''''''''''''''''''''' 336 337The ``instance_init()`` and ``realize()`` functions would only need to 338perform tasks related to being a proxy, such are registering its own 339MMIO handlers, or creating a child bus that other proxy devices can be 340attached to later. 341 342Other tasks will be device-specific. For example, PCI device objects 343will initialize the PCI config space in order to make a valid PCI device 344tree within the QEMU process. 345 346address space registration 347^^^^^^^^^^^^^^^^^^^^^^^^^^ 348 349Most devices are driven by guest device driver accesses to IO addresses 350or ports. The QEMU device emulation code uses QEMU's memory region 351function calls (such as ``memory_region_init_io()``) to add callback 352functions that QEMU will invoke when the guest accesses the device's 353areas of the IO address space. When a guest driver does access the 354device, the VM will exit HW virtualization mode and return to QEMU, 355which will then lookup and execute the corresponding callback function. 356 357A proxy object would need to mirror the memory region calls the actual 358device emulator would perform in its initialization code, but with its 359own callbacks. When invoked by QEMU as a result of a guest IO operation, 360they will forward the operation to the device emulation process. 361 362PCI config space 363^^^^^^^^^^^^^^^^ 364 365PCI devices also have a configuration space that can be accessed by the 366guest driver. Guest accesses to this space is not handled by the device 367emulation object, but by its PCI parent object. Much of this space is 368read-only, but certain registers (especially BAR and MSI-related ones) 369need to be propagated to the emulation process. 370 371PCI parent proxy 372'''''''''''''''' 373 374One way to propagate guest PCI config accesses is to create a 375"pci-device-proxy" class that can serve as the parent of a PCI device 376proxy object. This class's parent would be "pci-device" and it would 377override the PCI parent's ``config_read()`` and ``config_write()`` 378methods with ones that forward these operations to the emulation 379program. 380 381interrupt receipt 382^^^^^^^^^^^^^^^^^ 383 384A proxy for a device that generates interrupts will need to create a 385socket to receive interrupt indications from the emulation process. An 386incoming interrupt indication would then be sent up to its bus parent to 387be injected into the guest. For example, a PCI device object may use 388``pci_set_irq()``. 389 390live migration 391^^^^^^^^^^^^^^ 392 393The proxy will register to save and restore any *vmstate* it needs over 394a live migration event. The device proxy does not need to manage the 395remote device's *vmstate*; that will be handled by the remote process 396proxy (see below). 397 398QEMU remote device operation 399~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 400 401Generic device operations, such as DMA, will be performed by the remote 402process proxy by sending messages to the remote process. 403 404DMA operations 405^^^^^^^^^^^^^^ 406 407DMA operations would be handled much like vhost applications do. One of 408the initial messages sent to the emulation process is a guest memory 409table. Each entry in this table consists of a file descriptor and size 410that the emulation process can ``mmap()`` to directly access guest 411memory, similar to ``vhost_user_set_mem_table()``. Note guest memory 412must be backed by shared file-backed memory, for example, using 413*-object memory-backend-file,share=on* and setting that memory backend 414as RAM for the machine. 415 416IOMMU operations 417^^^^^^^^^^^^^^^^ 418 419When the emulated system includes an IOMMU, the remote process proxy in 420QEMU will need to create a socket for IOMMU requests from the emulation 421process. It will handle those requests with an 422``address_space_get_iotlb_entry()`` call. In order to handle IOMMU 423unmaps, the remote process proxy will also register as a listener on the 424device's DMA address space. When an IOMMU memory region is created 425within the DMA address space, an IOMMU notifier for unmaps will be added 426to the memory region that will forward unmaps to the emulation process 427over the IOMMU socket. 428 429device hot-plug via QMP 430^^^^^^^^^^^^^^^^^^^^^^^ 431 432An QMP "device\_add" command can add a device emulated by a remote 433process. It will also have "rid" option to the command, just as the 434*-device* command line option does. The remote process may either be one 435started at QEMU startup, or be one added by the "add-process" QMP 436command described above. In either case, the remote process proxy will 437forward the new device's JSON description to the corresponding emulation 438process. 439 440live migration 441^^^^^^^^^^^^^^ 442 443The remote process proxy will also register for live migration 444notifications with ``vmstate_register()``. When called to save state, 445the proxy will send the remote process a secondary socket file 446descriptor to save the remote process's device *vmstate* over. The 447incoming byte stream length and data will be saved as the proxy's 448*vmstate*. When the proxy is resumed on its new host, this *vmstate* 449will be extracted, and a secondary socket file descriptor will be sent 450to the new remote process through which it receives the *vmstate* in 451order to restore the devices there. 452 453device emulation in remote process 454~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 455 456The parts of QEMU that the emulation program will need include the 457object model; the memory emulation objects; the device emulation objects 458of the targeted device, and any dependent devices; and, the device's 459backends. It will also need code to setup the machine environment, 460handle requests from the QEMU process, and route machine-level requests 461(such as interrupts or IOMMU mappings) back to the QEMU process. 462 463initialization 464^^^^^^^^^^^^^^ 465 466The process initialization sequence will follow the same sequence 467followed by QEMU. It will first initialize the backend objects, then 468device emulation objects. The JSON descriptions sent by the QEMU process 469will drive which objects need to be created. 470 471- address spaces 472 473Before the device objects are created, the initial address spaces and 474memory regions must be configured with ``memory_map_init()``. This 475creates a RAM memory region object (*system\_memory*) and an IO memory 476region object (*system\_io*). 477 478- RAM 479 480RAM memory region creation will follow how ``pc_memory_init()`` creates 481them, but must use ``memory_region_init_ram_from_fd()`` instead of 482``memory_region_allocate_system_memory()``. The file descriptors needed 483will be supplied by the guest memory table from above. Those RAM regions 484would then be added to the *system\_memory* memory region with 485``memory_region_add_subregion()``. 486 487- PCI 488 489IO initialization will be driven by the JSON descriptions sent from the 490QEMU process. For a PCI device, a PCI bus will need to be created with 491``pci_root_bus_new()``, and a PCI memory region will need to be created 492and added to the *system\_memory* memory region with 493``memory_region_add_subregion_overlap()``. The overlap version is 494required for architectures where PCI memory overlaps with RAM memory. 495 496MMIO handling 497^^^^^^^^^^^^^ 498 499The device emulation objects will use ``memory_region_init_io()`` to 500install their MMIO handlers, and ``pci_register_bar()`` to associate 501those handlers with a PCI BAR, as they do within QEMU currently. 502 503In order to use ``address_space_rw()`` in the emulation process to 504handle MMIO requests from QEMU, the PCI physical addresses must be the 505same in the QEMU process and the device emulation process. In order to 506accomplish that, guest BAR programming must also be forwarded from QEMU 507to the emulation process. 508 509interrupt injection 510^^^^^^^^^^^^^^^^^^^ 511 512When device emulation wants to inject an interrupt into the VM, the 513request climbs the device's bus object hierarchy until the point where a 514bus object knows how to signal the interrupt to the guest. The details 515depend on the type of interrupt being raised. 516 517- PCI pin interrupts 518 519On x86 systems, there is an emulated IOAPIC object attached to the root 520PCI bus object, and the root PCI object forwards interrupt requests to 521it. The IOAPIC object, in turn, calls the KVM driver to inject the 522corresponding interrupt into the VM. The simplest way to handle this in 523an emulation process would be to setup the root PCI bus driver (via 524``pci_bus_irqs()``) to send a interrupt request back to the QEMU 525process, and have the device proxy object reflect it up the PCI tree 526there. 527 528- PCI MSI/X interrupts 529 530PCI MSI/X interrupts are implemented in HW as DMA writes to a 531CPU-specific PCI address. In QEMU on x86, a KVM APIC object receives 532these DMA writes, then calls into the KVM driver to inject the interrupt 533into the VM. A simple emulation process implementation would be to send 534the MSI DMA address from QEMU as a message at initialization, then 535install an address space handler at that address which forwards the MSI 536message back to QEMU. 537 538DMA operations 539^^^^^^^^^^^^^^ 540 541When a emulation object wants to DMA into or out of guest memory, it 542first must use dma\_memory\_map() to convert the DMA address to a local 543virtual address. The emulation process memory region objects setup above 544will be used to translate the DMA address to a local virtual address the 545device emulation code can access. 546 547IOMMU 548^^^^^ 549 550When an IOMMU is in use in QEMU, DMA translation uses IOMMU memory 551regions to translate the DMA address to a guest physical address before 552that physical address can be translated to a local virtual address. The 553emulation process will need similar functionality. 554 555- IOTLB cache 556 557The emulation process will maintain a cache of recent IOMMU translations 558(the IOTLB). When the translate() callback of an IOMMU memory region is 559invoked, the IOTLB cache will be searched for an entry that will map the 560DMA address to a guest PA. On a cache miss, a message will be sent back 561to QEMU requesting the corresponding translation entry, which be both be 562used to return a guest address and be added to the cache. 563 564- IOTLB purge 565 566The IOMMU emulation will also need to act on unmap requests from QEMU. 567These happen when the guest IOMMU driver purges an entry from the 568guest's translation table. 569 570live migration 571^^^^^^^^^^^^^^ 572 573When a remote process receives a live migration indication from QEMU, it 574will set up a channel using the received file descriptor with 575``qio_channel_socket_new_fd()``. This channel will be used to create a 576*QEMUfile* that can be passed to ``qemu_save_device_state()`` to send 577the process's device state back to QEMU. This method will be reversed on 578restore - the channel will be passed to ``qemu_loadvm_state()`` to 579restore the device state. 580 581Accelerating device emulation 582~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 583 584The messages that are required to be sent between QEMU and the emulation 585process can add considerable latency to IO operations. The optimizations 586described below attempt to ameliorate this effect by allowing the 587emulation process to communicate directly with the kernel KVM driver. 588The KVM file descriptors created would be passed to the emulation process 589via initialization messages, much like the guest memory table is done. 590#### MMIO acceleration 591 592Vhost user applications can receive guest virtio driver stores directly 593from KVM. The issue with the eventfd mechanism used by vhost user is 594that it does not pass any data with the event indication, so it cannot 595handle guest loads or guest stores that carry store data. This concept 596could, however, be expanded to cover more cases. 597 598The expanded idea would require a new type of KVM device: 599*KVM\_DEV\_TYPE\_USER*. This device has two file descriptors: a master 600descriptor that QEMU can use for configuration, and a slave descriptor 601that the emulation process can use to receive MMIO notifications. QEMU 602would create both descriptors using the KVM driver, and pass the slave 603descriptor to the emulation process via an initialization message. 604 605data structures 606^^^^^^^^^^^^^^^ 607 608- guest physical range 609 610The guest physical range structure describes the address range that a 611device will respond to. It includes the base and length of the range, as 612well as which bus the range resides on (e.g., on an x86machine, it can 613specify whether the range refers to memory or IO addresses). 614 615A device can have multiple physical address ranges it responds to (e.g., 616a PCI device can have multiple BARs), so the structure will also include 617an enumerated identifier to specify which of the device's ranges is 618being referred to. 619 620+--------+----------------------------+ 621| Name | Description | 622+========+============================+ 623| addr | range base address | 624+--------+----------------------------+ 625| len | range length | 626+--------+----------------------------+ 627| bus | addr type (memory or IO) | 628+--------+----------------------------+ 629| id | range ID (e.g., PCI BAR) | 630+--------+----------------------------+ 631 632- MMIO request structure 633 634This structure describes an MMIO operation. It includes which guest 635physical range the MMIO was within, the offset within that range, the 636MMIO type (e.g., load or store), and its length and data. It also 637includes a sequence number that can be used to reply to the MMIO, and 638the CPU that issued the MMIO. 639 640+----------+------------------------+ 641| Name | Description | 642+==========+========================+ 643| rid | range MMIO is within | 644+----------+------------------------+ 645| offset | offset within *rid* | 646+----------+------------------------+ 647| type | e.g., load or store | 648+----------+------------------------+ 649| len | MMIO length | 650+----------+------------------------+ 651| data | store data | 652+----------+------------------------+ 653| seq | sequence ID | 654+----------+------------------------+ 655 656- MMIO request queues 657 658MMIO request queues are FIFO arrays of MMIO request structures. There 659are two queues: pending queue is for MMIOs that haven't been read by the 660emulation program, and the sent queue is for MMIOs that haven't been 661acknowledged. The main use of the second queue is to validate MMIO 662replies from the emulation program. 663 664- scoreboard 665 666Each CPU in the VM is emulated in QEMU by a separate thread, so multiple 667MMIOs may be waiting to be consumed by an emulation program and multiple 668threads may be waiting for MMIO replies. The scoreboard would contain a 669wait queue and sequence number for the per-CPU threads, allowing them to 670be individually woken when the MMIO reply is received from the emulation 671program. It also tracks the number of posted MMIO stores to the device 672that haven't been replied to, in order to satisfy the PCI constraint 673that a load to a device will not complete until all previous stores to 674that device have been completed. 675 676- device shadow memory 677 678Some MMIO loads do not have device side-effects. These MMIOs can be 679completed without sending a MMIO request to the emulation program if the 680emulation program shares a shadow image of the device's memory image 681with the KVM driver. 682 683The emulation program will ask the KVM driver to allocate memory for the 684shadow image, and will then use ``mmap()`` to directly access it. The 685emulation program can control KVM access to the shadow image by sending 686KVM an access map telling it which areas of the image have no 687side-effects (and can be completed immediately), and which require a 688MMIO request to the emulation program. The access map can also inform 689the KVM drive which size accesses are allowed to the image. 690 691master descriptor 692^^^^^^^^^^^^^^^^^ 693 694The master descriptor is used by QEMU to configure the new KVM device. 695The descriptor would be returned by the KVM driver when QEMU issues a 696*KVM\_CREATE\_DEVICE* ``ioctl()`` with a *KVM\_DEV\_TYPE\_USER* type. 697 698KVM\_DEV\_TYPE\_USER device ops 699 700 701The *KVM\_DEV\_TYPE\_USER* operations vector will be registered by a 702``kvm_register_device_ops()`` call when the KVM system in initialized by 703``kvm_init()``. These device ops are called by the KVM driver when QEMU 704executes certain ``ioctl()`` operations on its KVM file descriptor. They 705include: 706 707- create 708 709This routine is called when QEMU issues a *KVM\_CREATE\_DEVICE* 710``ioctl()`` on its per-VM file descriptor. It will allocate and 711initialize a KVM user device specific data structure, and assign the 712*kvm\_device* private field to it. 713 714- ioctl 715 716This routine is invoked when QEMU issues an ``ioctl()`` on the master 717descriptor. The ``ioctl()`` commands supported are defined by the KVM 718device type. *KVM\_DEV\_TYPE\_USER* ones will need several commands: 719 720*KVM\_DEV\_USER\_SLAVE\_FD* creates the slave file descriptor that will 721be passed to the device emulation program. Only one slave can be created 722by each master descriptor. The file operations performed by this 723descriptor are described below. 724 725The *KVM\_DEV\_USER\_PA\_RANGE* command configures a guest physical 726address range that the slave descriptor will receive MMIO notifications 727for. The range is specified by a guest physical range structure 728argument. For buses that assign addresses to devices dynamically, this 729command can be executed while the guest is running, such as the case 730when a guest changes a device's PCI BAR registers. 731 732*KVM\_DEV\_USER\_PA\_RANGE* will use ``kvm_io_bus_register_dev()`` to 733register *kvm\_io\_device\_ops* callbacks to be invoked when the guest 734performs a MMIO operation within the range. When a range is changed, 735``kvm_io_bus_unregister_dev()`` is used to remove the previous 736instantiation. 737 738*KVM\_DEV\_USER\_TIMEOUT* will configure a timeout value that specifies 739how long KVM will wait for the emulation process to respond to a MMIO 740indication. 741 742- destroy 743 744This routine is called when the VM instance is destroyed. It will need 745to destroy the slave descriptor; and free any memory allocated by the 746driver, as well as the *kvm\_device* structure itself. 747 748slave descriptor 749^^^^^^^^^^^^^^^^ 750 751The slave descriptor will have its own file operations vector, which 752responds to system calls on the descriptor performed by the device 753emulation program. 754 755- read 756 757A read returns any pending MMIO requests from the KVM driver as MMIO 758request structures. Multiple structures can be returned if there are 759multiple MMIO operations pending. The MMIO requests are moved from the 760pending queue to the sent queue, and if there are threads waiting for 761space in the pending to add new MMIO operations, they will be woken 762here. 763 764- write 765 766A write also consists of a set of MMIO requests. They are compared to 767the MMIO requests in the sent queue. Matches are removed from the sent 768queue, and any threads waiting for the reply are woken. If a store is 769removed, then the number of posted stores in the per-CPU scoreboard is 770decremented. When the number is zero, and a non side-effect load was 771waiting for posted stores to complete, the load is continued. 772 773- ioctl 774 775There are several ioctl()s that can be performed on the slave 776descriptor. 777 778A *KVM\_DEV\_USER\_SHADOW\_SIZE* ``ioctl()`` causes the KVM driver to 779allocate memory for the shadow image. This memory can later be 780``mmap()``\ ed by the emulation process to share the emulation's view of 781device memory with the KVM driver. 782 783A *KVM\_DEV\_USER\_SHADOW\_CTRL* ``ioctl()`` controls access to the 784shadow image. It will send the KVM driver a shadow control map, which 785specifies which areas of the image can complete guest loads without 786sending the load request to the emulation program. It will also specify 787the size of load operations that are allowed. 788 789- poll 790 791An emulation program will use the ``poll()`` call with a *POLLIN* flag 792to determine if there are MMIO requests waiting to be read. It will 793return if the pending MMIO request queue is not empty. 794 795- mmap 796 797This call allows the emulation program to directly access the shadow 798image allocated by the KVM driver. As device emulation updates device 799memory, changes with no side-effects will be reflected in the shadow, 800and the KVM driver can satisfy guest loads from the shadow image without 801needing to wait for the emulation program. 802 803kvm\_io\_device ops 804^^^^^^^^^^^^^^^^^^^ 805 806Each KVM per-CPU thread can handle MMIO operation on behalf of the guest 807VM. KVM will use the MMIO's guest physical address to search for a 808matching *kvm\_io\_device* to see if the MMIO can be handled by the KVM 809driver instead of exiting back to QEMU. If a match is found, the 810corresponding callback will be invoked. 811 812- read 813 814This callback is invoked when the guest performs a load to the device. 815Loads with side-effects must be handled synchronously, with the KVM 816driver putting the QEMU thread to sleep waiting for the emulation 817process reply before re-starting the guest. Loads that do not have 818side-effects may be optimized by satisfying them from the shadow image, 819if there are no outstanding stores to the device by this CPU. PCI memory 820ordering demands that a load cannot complete before all older stores to 821the same device have been completed. 822 823- write 824 825Stores can be handled asynchronously unless the pending MMIO request 826queue is full. In this case, the QEMU thread must sleep waiting for 827space in the queue. Stores will increment the number of posted stores in 828the per-CPU scoreboard, in order to implement the PCI ordering 829constraint above. 830 831interrupt acceleration 832^^^^^^^^^^^^^^^^^^^^^^ 833 834This performance optimization would work much like a vhost user 835application does, where the QEMU process sets up *eventfds* that cause 836the device's corresponding interrupt to be triggered by the KVM driver. 837These irq file descriptors are sent to the emulation process at 838initialization, and are used when the emulation code raises a device 839interrupt. 840 841intx acceleration 842''''''''''''''''' 843 844Traditional PCI pin interrupts are level based, so, in addition to an 845irq file descriptor, a re-sampling file descriptor needs to be sent to 846the emulation program. This second file descriptor allows multiple 847devices sharing an irq to be notified when the interrupt has been 848acknowledged by the guest, so they can re-trigger the interrupt if their 849device has not de-asserted its interrupt. 850 851intx irq descriptor 852 853 854The irq descriptors are created by the proxy object 855``using event_notifier_init()`` to create the irq and re-sampling 856*eventds*, and ``kvm_vm_ioctl(KVM_IRQFD)`` to bind them to an interrupt. 857The interrupt route can be found with 858``pci_device_route_intx_to_irq()``. 859 860intx routing changes 861 862 863Intx routing can be changed when the guest programs the APIC the device 864pin is connected to. The proxy object in QEMU will use 865``pci_device_set_intx_routing_notifier()`` to be informed of any guest 866changes to the route. This handler will broadly follow the VFIO 867interrupt logic to change the route: de-assigning the existing irq 868descriptor from its route, then assigning it the new route. (see 869``vfio_intx_update()``) 870 871MSI/X acceleration 872'''''''''''''''''' 873 874MSI/X interrupts are sent as DMA transactions to the host. The interrupt 875data contains a vector that is programmed by the guest, A device may have 876multiple MSI interrupts associated with it, so multiple irq descriptors 877may need to be sent to the emulation program. 878 879MSI/X irq descriptor 880 881 882This case will also follow the VFIO example. For each MSI/X interrupt, 883an *eventfd* is created, a virtual interrupt is allocated by 884``kvm_irqchip_add_msi_route()``, and the virtual interrupt is bound to 885the eventfd with ``kvm_irqchip_add_irqfd_notifier()``. 886 887MSI/X config space changes 888 889 890The guest may dynamically update several MSI-related tables in the 891device's PCI config space. These include per-MSI interrupt enables and 892vector data. Additionally, MSIX tables exist in device memory space, not 893config space. Much like the BAR case above, the proxy object must look 894at guest config space programming to keep the MSI interrupt state 895consistent between QEMU and the emulation program. 896 897-------------- 898 899Disaggregated CPU emulation 900--------------------------- 901 902After IO services have been disaggregated, a second phase would be to 903separate a process to handle CPU instruction emulation from the main 904QEMU control function. There are no object separation points for this 905code, so the first task would be to create one. 906 907Host access controls 908-------------------- 909 910Separating QEMU relies on the host OS's access restriction mechanisms to 911enforce that the differing processes can only access the objects they 912are entitled to. There are a couple types of mechanisms usually provided 913by general purpose OSs. 914 915Discretionary access control 916~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 917 918Discretionary access control allows each user to control who can access 919their files. In Linux, this type of control is usually too coarse for 920QEMU separation, since it only provides three separate access controls: 921one for the same user ID, the second for users IDs with the same group 922ID, and the third for all other user IDs. Each device instance would 923need a separate user ID to provide access control, which is likely to be 924unwieldy for dynamically created VMs. 925 926Mandatory access control 927~~~~~~~~~~~~~~~~~~~~~~~~ 928 929Mandatory access control allows the OS to add an additional set of 930controls on top of discretionary access for the OS to control. It also 931adds other attributes to processes and files such as types, roles, and 932categories, and can establish rules for how processes and files can 933interact. 934 935Type enforcement 936^^^^^^^^^^^^^^^^ 937 938Type enforcement assigns a *type* attribute to processes and files, and 939allows rules to be written on what operations a process with a given 940type can perform on a file with a given type. QEMU separation could take 941advantage of type enforcement by running the emulation processes with 942different types, both from the main QEMU process, and from the emulation 943processes of different classes of devices. 944 945For example, guest disk images and disk emulation processes could have 946types separate from the main QEMU process and non-disk emulation 947processes, and the type rules could prevent processes other than disk 948emulation ones from accessing guest disk images. Similarly, network 949emulation processes can have a type separate from the main QEMU process 950and non-network emulation process, and only that type can access the 951host tun/tap device used to provide guest networking. 952 953Category enforcement 954^^^^^^^^^^^^^^^^^^^^ 955 956Category enforcement assigns a set of numbers within a given range to 957the process or file. The process is granted access to the file if the 958process's set is a superset of the file's set. This enforcement can be 959used to separate multiple instances of devices in the same class. 960 961For example, if there are multiple disk devices provides to a guest, 962each device emulation process could be provisioned with a separate 963category. The different device emulation processes would not be able to 964access each other's backing disk images. 965 966Alternatively, categories could be used in lieu of the type enforcement 967scheme described above. In this scenario, different categories would be 968used to prevent device emulation processes in different classes from 969accessing resources assigned to other classes. 970