1This is the design document for multi-process QEMU. It does not 2necessarily reflect the status of the current implementation, which 3may lack features or be considerably different from what is described 4in this document. This document is still useful as a description of 5the goals and general direction of this feature. 6 7Please refer to the following wiki for latest details: 8https://wiki.qemu.org/Features/MultiProcessQEMU 9 10Multi-process QEMU 11=================== 12 13QEMU is often used as the hypervisor for virtual machines running in the 14Oracle cloud. Since one of the advantages of cloud computing is the 15ability to run many VMs from different tenants in the same cloud 16infrastructure, a guest that compromised its hypervisor could 17potentially use the hypervisor's access privileges to access data it is 18not authorized for. 19 20QEMU can be susceptible to security attacks because it is a large, 21monolithic program that provides many features to the VMs it services. 22Many of these features can be configured out of QEMU, but even a reduced 23configuration QEMU has a large amount of code a guest can potentially 24attack. Separating QEMU reduces the attack surface by aiding to 25limit each component in the system to only access the resources that 26it needs to perform its job. 27 28QEMU services 29------------- 30 31QEMU can be broadly described as providing three main services. One is a 32VM control point, where VMs can be created, migrated, re-configured, and 33destroyed. A second is to emulate the CPU instructions within the VM, 34often accelerated by HW virtualization features such as Intel's VT 35extensions. Finally, it provides IO services to the VM by emulating HW 36IO devices, such as disk and network devices. 37 38A multi-process QEMU 39~~~~~~~~~~~~~~~~~~~~ 40 41A multi-process QEMU involves separating QEMU services into separate 42host processes. Each of these processes can be given only the privileges 43it needs to provide its service, e.g., a disk service could be given 44access only to the disk images it provides, and not be allowed to 45access other files, or any network devices. An attacker who compromised 46this service would not be able to use this exploit to access files or 47devices beyond what the disk service was given access to. 48 49A QEMU control process would remain, but in multi-process mode, will 50have no direct interfaces to the VM. During VM execution, it would still 51provide the user interface to hot-plug devices or live migrate the VM. 52 53A first step in creating a multi-process QEMU is to separate IO services 54from the main QEMU program, which would continue to provide CPU 55emulation. i.e., the control process would also be the CPU emulation 56process. In a later phase, CPU emulation could be separated from the 57control process. 58 59Separating IO services 60---------------------- 61 62Separating IO services into individual host processes is a good place to 63begin for a couple of reasons. One is the sheer number of IO devices QEMU 64can emulate provides a large surface of interfaces which could potentially 65be exploited, and, indeed, have been a source of exploits in the past. 66Another is the modular nature of QEMU device emulation code provides 67interface points where the QEMU functions that perform device emulation 68can be separated from the QEMU functions that manage the emulation of 69guest CPU instructions. The devices emulated in the separate process are 70referred to as remote devices. 71 72QEMU device emulation 73~~~~~~~~~~~~~~~~~~~~~ 74 75QEMU uses an object oriented SW architecture for device emulation code. 76Configured objects are all compiled into the QEMU binary, then objects 77are instantiated by name when used by the guest VM. For example, the 78code to emulate a device named "foo" is always present in QEMU, but its 79instantiation code is only run when the device is included in the target 80VM. (e.g., via the QEMU command line as *-device foo*) 81 82The object model is hierarchical, so device emulation code names its 83parent object (such as "pci-device" for a PCI device) and QEMU will 84instantiate a parent object before calling the device's instantiation 85code. 86 87Current separation models 88~~~~~~~~~~~~~~~~~~~~~~~~~ 89 90In order to separate the device emulation code from the CPU emulation 91code, the device object code must run in a different process. There are 92a couple of existing QEMU features that can run emulation code 93separately from the main QEMU process. These are examined below. 94 95vhost user model 96^^^^^^^^^^^^^^^^ 97 98Virtio guest device drivers can be connected to vhost user applications 99in order to perform their IO operations. This model uses special virtio 100device drivers in the guest and vhost user device objects in QEMU, but 101once the QEMU vhost user code has configured the vhost user application, 102mission-mode IO is performed by the application. The vhost user 103application is a daemon process that can be contacted via a known UNIX 104domain socket. 105 106vhost socket 107'''''''''''' 108 109As mentioned above, one of the tasks of the vhost device object within 110QEMU is to contact the vhost application and send it configuration 111information about this device instance. As part of the configuration 112process, the application can also be sent other file descriptors over 113the socket, which then can be used by the vhost user application in 114various ways, some of which are described below. 115 116vhost MMIO store acceleration 117''''''''''''''''''''''''''''' 118 119VMs are often run using HW virtualization features via the KVM kernel 120driver. This driver allows QEMU to accelerate the emulation of guest CPU 121instructions by running the guest in a virtual HW mode. When the guest 122executes instructions that cannot be executed by virtual HW mode, 123execution returns to the KVM driver so it can inform QEMU to emulate the 124instructions in SW. 125 126One of the events that can cause a return to QEMU is when a guest device 127driver accesses an IO location. QEMU then dispatches the memory 128operation to the corresponding QEMU device object. In the case of a 129vhost user device, the memory operation would need to be sent over a 130socket to the vhost application. This path is accelerated by the QEMU 131virtio code by setting up an eventfd file descriptor that the vhost 132application can directly receive MMIO store notifications from the KVM 133driver, instead of needing them to be sent to the QEMU process first. 134 135vhost interrupt acceleration 136'''''''''''''''''''''''''''' 137 138Another optimization used by the vhost application is the ability to 139directly inject interrupts into the VM via the KVM driver, again, 140bypassing the need to send the interrupt back to the QEMU process first. 141The QEMU virtio setup code configures the KVM driver with an eventfd 142that triggers the device interrupt in the guest when the eventfd is 143written. This irqfd file descriptor is then passed to the vhost user 144application program. 145 146vhost access to guest memory 147'''''''''''''''''''''''''''' 148 149The vhost application is also allowed to directly access guest memory, 150instead of needing to send the data as messages to QEMU. This is also 151done with file descriptors sent to the vhost user application by QEMU. 152These descriptors can be passed to ``mmap()`` by the vhost application 153to map the guest address space into the vhost application. 154 155IOMMUs introduce another level of complexity, since the address given to 156the guest virtio device to DMA to or from is not a guest physical 157address. This case is handled by having vhost code within QEMU register 158as a listener for IOMMU mapping changes. The vhost application maintains 159a cache of IOMMMU translations: sending translation requests back to 160QEMU on cache misses, and in turn receiving flush requests from QEMU 161when mappings are purged. 162 163applicability to device separation 164'''''''''''''''''''''''''''''''''' 165 166Much of the vhost model can be re-used by separated device emulation. In 167particular, the ideas of using a socket between QEMU and the device 168emulation application, using a file descriptor to inject interrupts into 169the VM via KVM, and allowing the application to ``mmap()`` the guest 170should be re used. 171 172There are, however, some notable differences between how a vhost 173application works and the needs of separated device emulation. The most 174basic is that vhost uses custom virtio device drivers which always 175trigger IO with MMIO stores. A separated device emulation model must 176work with existing IO device models and guest device drivers. MMIO loads 177break vhost store acceleration since they are synchronous - guest 178progress cannot continue until the load has been emulated. By contrast, 179stores are asynchronous, the guest can continue after the store event 180has been sent to the vhost application. 181 182Another difference is that in the vhost user model, a single daemon can 183support multiple QEMU instances. This is contrary to the security regime 184desired, in which the emulation application should only be allowed to 185access the files or devices the VM it's running on behalf of can access. 186#### qemu-io model 187 188Qemu-io is a test harness used to test changes to the QEMU block backend 189object code. (e.g., the code that implements disk images for disk driver 190emulation) Qemu-io is not a device emulation application per se, but it 191does compile the QEMU block objects into a separate binary from the main 192QEMU one. This could be useful for disk device emulation, since its 193emulation applications will need to include the QEMU block objects. 194 195New separation model based on proxy objects 196------------------------------------------- 197 198A different model based on proxy objects in the QEMU program 199communicating with remote emulation programs could provide separation 200while minimizing the changes needed to the device emulation code. The 201rest of this section is a discussion of how a proxy object model would 202work. 203 204Remote emulation processes 205~~~~~~~~~~~~~~~~~~~~~~~~~~ 206 207The remote emulation process will run the QEMU object hierarchy without 208modification. The device emulation objects will be also be based on the 209QEMU code, because for anything but the simplest device, it would not be 210a tractable to re-implement both the object model and the many device 211backends that QEMU has. 212 213The processes will communicate with the QEMU process over UNIX domain 214sockets. The processes can be executed either as standalone processes, 215or be executed by QEMU. In both cases, the host backends the emulation 216processes will provide are specified on its command line, as they would 217be for QEMU. For example: 218 219:: 220 221 disk-proc -blockdev driver=file,node-name=file0,filename=disk-file0 \ 222 -blockdev driver=qcow2,node-name=drive0,file=file0 223 224would indicate process *disk-proc* uses a qcow2 emulated disk named 225*file0* as its backend. 226 227Emulation processes may emulate more than one guest controller. A common 228configuration might be to put all controllers of the same device class 229(e.g., disk, network, etc.) in a single process, so that all backends of 230the same type can be managed by a single QMP monitor. 231 232communication with QEMU 233^^^^^^^^^^^^^^^^^^^^^^^ 234 235The first argument to the remote emulation process will be a Unix domain 236socket that connects with the Proxy object. This is a required argument. 237 238:: 239 240 disk-proc <socket number> <backend list> 241 242remote process QMP monitor 243^^^^^^^^^^^^^^^^^^^^^^^^^^ 244 245Remote emulation processes can be monitored via QMP, similar to QEMU 246itself. The QMP monitor socket is specified the same as for a QEMU 247process: 248 249:: 250 251 disk-proc -qmp unix:/tmp/disk-mon,server 252 253can be monitored over the UNIX socket path */tmp/disk-mon*. 254 255QEMU command line 256~~~~~~~~~~~~~~~~~ 257 258Each remote device emulated in a remote process on the host is 259represented as a *-device* of type *pci-proxy-dev*. A socket 260sub-option to this option specifies the Unix socket that connects 261to the remote process. An *id* sub-option is required, and it should 262be the same id as used in the remote process. 263 264:: 265 266 qemu-system-x86_64 ... -device pci-proxy-dev,id=lsi0,socket=3 267 268can be used to add a device emulated in a remote process 269 270 271QEMU management of remote processes 272~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 273 274QEMU is not aware of the type of type of the remote PCI device. It is 275a pass through device as far as QEMU is concerned. 276 277communication with emulation process 278^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 279 280primary channel 281''''''''''''''' 282 283The primary channel (referred to as com in the code) is used to bootstrap 284the remote process. It is also used to pass on device-agnostic commands 285like reset. 286 287per-device channels 288''''''''''''''''''' 289 290Each remote device communicates with QEMU using a dedicated communication 291channel. The proxy object sets up this channel using the primary 292channel during its initialization. 293 294QEMU device proxy objects 295~~~~~~~~~~~~~~~~~~~~~~~~~ 296 297QEMU has an object model based on sub-classes inherited from the 298"object" super-class. The sub-classes that are of interest here are the 299"device" and "bus" sub-classes whose child sub-classes make up the 300device tree of a QEMU emulated system. 301 302The proxy object model will use device proxy objects to replace the 303device emulation code within the QEMU process. These objects will live 304in the same place in the object and bus hierarchies as the objects they 305replace. i.e., the proxy object for an LSI SCSI controller will be a 306sub-class of the "pci-device" class, and will have the same PCI bus 307parent and the same SCSI bus child objects as the LSI controller object 308it replaces. 309 310It is worth noting that the same proxy object is used to mediate with 311all types of remote PCI devices. 312 313object initialization 314^^^^^^^^^^^^^^^^^^^^^ 315 316The Proxy device objects are initialized in the exact same manner in 317which any other QEMU device would be initialized. 318 319In addition, the Proxy objects perform the following two tasks: 320- Parses the "socket" sub option and connects to the remote process 321using this channel 322- Uses the "id" sub-option to connect to the emulated device on the 323separate process 324 325class\_init 326''''''''''' 327 328The ``class_init()`` method of a proxy object will, in general behave 329similarly to the object it replaces, including setting any static 330properties and methods needed by the proxy. 331 332instance\_init / realize 333'''''''''''''''''''''''' 334 335The ``instance_init()`` and ``realize()`` functions would only need to 336perform tasks related to being a proxy, such are registering its own 337MMIO handlers, or creating a child bus that other proxy devices can be 338attached to later. 339 340Other tasks will be device-specific. For example, PCI device objects 341will initialize the PCI config space in order to make a valid PCI device 342tree within the QEMU process. 343 344address space registration 345^^^^^^^^^^^^^^^^^^^^^^^^^^ 346 347Most devices are driven by guest device driver accesses to IO addresses 348or ports. The QEMU device emulation code uses QEMU's memory region 349function calls (such as ``memory_region_init_io()``) to add callback 350functions that QEMU will invoke when the guest accesses the device's 351areas of the IO address space. When a guest driver does access the 352device, the VM will exit HW virtualization mode and return to QEMU, 353which will then lookup and execute the corresponding callback function. 354 355A proxy object would need to mirror the memory region calls the actual 356device emulator would perform in its initialization code, but with its 357own callbacks. When invoked by QEMU as a result of a guest IO operation, 358they will forward the operation to the device emulation process. 359 360PCI config space 361^^^^^^^^^^^^^^^^ 362 363PCI devices also have a configuration space that can be accessed by the 364guest driver. Guest accesses to this space is not handled by the device 365emulation object, but by its PCI parent object. Much of this space is 366read-only, but certain registers (especially BAR and MSI-related ones) 367need to be propagated to the emulation process. 368 369PCI parent proxy 370'''''''''''''''' 371 372One way to propagate guest PCI config accesses is to create a 373"pci-device-proxy" class that can serve as the parent of a PCI device 374proxy object. This class's parent would be "pci-device" and it would 375override the PCI parent's ``config_read()`` and ``config_write()`` 376methods with ones that forward these operations to the emulation 377program. 378 379interrupt receipt 380^^^^^^^^^^^^^^^^^ 381 382A proxy for a device that generates interrupts will need to create a 383socket to receive interrupt indications from the emulation process. An 384incoming interrupt indication would then be sent up to its bus parent to 385be injected into the guest. For example, a PCI device object may use 386``pci_set_irq()``. 387 388live migration 389^^^^^^^^^^^^^^ 390 391The proxy will register to save and restore any *vmstate* it needs over 392a live migration event. The device proxy does not need to manage the 393remote device's *vmstate*; that will be handled by the remote process 394proxy (see below). 395 396QEMU remote device operation 397~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 398 399Generic device operations, such as DMA, will be performed by the remote 400process proxy by sending messages to the remote process. 401 402DMA operations 403^^^^^^^^^^^^^^ 404 405DMA operations would be handled much like vhost applications do. One of 406the initial messages sent to the emulation process is a guest memory 407table. Each entry in this table consists of a file descriptor and size 408that the emulation process can ``mmap()`` to directly access guest 409memory, similar to ``vhost_user_set_mem_table()``. Note guest memory 410must be backed by file descriptors, such as when QEMU is given the 411*-mem-path* command line option. 412 413IOMMU operations 414^^^^^^^^^^^^^^^^ 415 416When the emulated system includes an IOMMU, the remote process proxy in 417QEMU will need to create a socket for IOMMU requests from the emulation 418process. It will handle those requests with an 419``address_space_get_iotlb_entry()`` call. In order to handle IOMMU 420unmaps, the remote process proxy will also register as a listener on the 421device's DMA address space. When an IOMMU memory region is created 422within the DMA address space, an IOMMU notifier for unmaps will be added 423to the memory region that will forward unmaps to the emulation process 424over the IOMMU socket. 425 426device hot-plug via QMP 427^^^^^^^^^^^^^^^^^^^^^^^ 428 429An QMP "device\_add" command can add a device emulated by a remote 430process. It will also have "rid" option to the command, just as the 431*-device* command line option does. The remote process may either be one 432started at QEMU startup, or be one added by the "add-process" QMP 433command described above. In either case, the remote process proxy will 434forward the new device's JSON description to the corresponding emulation 435process. 436 437live migration 438^^^^^^^^^^^^^^ 439 440The remote process proxy will also register for live migration 441notifications with ``vmstate_register()``. When called to save state, 442the proxy will send the remote process a secondary socket file 443descriptor to save the remote process's device *vmstate* over. The 444incoming byte stream length and data will be saved as the proxy's 445*vmstate*. When the proxy is resumed on its new host, this *vmstate* 446will be extracted, and a secondary socket file descriptor will be sent 447to the new remote process through which it receives the *vmstate* in 448order to restore the devices there. 449 450device emulation in remote process 451~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 452 453The parts of QEMU that the emulation program will need include the 454object model; the memory emulation objects; the device emulation objects 455of the targeted device, and any dependent devices; and, the device's 456backends. It will also need code to setup the machine environment, 457handle requests from the QEMU process, and route machine-level requests 458(such as interrupts or IOMMU mappings) back to the QEMU process. 459 460initialization 461^^^^^^^^^^^^^^ 462 463The process initialization sequence will follow the same sequence 464followed by QEMU. It will first initialize the backend objects, then 465device emulation objects. The JSON descriptions sent by the QEMU process 466will drive which objects need to be created. 467 468- address spaces 469 470Before the device objects are created, the initial address spaces and 471memory regions must be configured with ``memory_map_init()``. This 472creates a RAM memory region object (*system\_memory*) and an IO memory 473region object (*system\_io*). 474 475- RAM 476 477RAM memory region creation will follow how ``pc_memory_init()`` creates 478them, but must use ``memory_region_init_ram_from_fd()`` instead of 479``memory_region_allocate_system_memory()``. The file descriptors needed 480will be supplied by the guest memory table from above. Those RAM regions 481would then be added to the *system\_memory* memory region with 482``memory_region_add_subregion()``. 483 484- PCI 485 486IO initialization will be driven by the JSON descriptions sent from the 487QEMU process. For a PCI device, a PCI bus will need to be created with 488``pci_root_bus_new()``, and a PCI memory region will need to be created 489and added to the *system\_memory* memory region with 490``memory_region_add_subregion_overlap()``. The overlap version is 491required for architectures where PCI memory overlaps with RAM memory. 492 493MMIO handling 494^^^^^^^^^^^^^ 495 496The device emulation objects will use ``memory_region_init_io()`` to 497install their MMIO handlers, and ``pci_register_bar()`` to associate 498those handlers with a PCI BAR, as they do within QEMU currently. 499 500In order to use ``address_space_rw()`` in the emulation process to 501handle MMIO requests from QEMU, the PCI physical addresses must be the 502same in the QEMU process and the device emulation process. In order to 503accomplish that, guest BAR programming must also be forwarded from QEMU 504to the emulation process. 505 506interrupt injection 507^^^^^^^^^^^^^^^^^^^ 508 509When device emulation wants to inject an interrupt into the VM, the 510request climbs the device's bus object hierarchy until the point where a 511bus object knows how to signal the interrupt to the guest. The details 512depend on the type of interrupt being raised. 513 514- PCI pin interrupts 515 516On x86 systems, there is an emulated IOAPIC object attached to the root 517PCI bus object, and the root PCI object forwards interrupt requests to 518it. The IOAPIC object, in turn, calls the KVM driver to inject the 519corresponding interrupt into the VM. The simplest way to handle this in 520an emulation process would be to setup the root PCI bus driver (via 521``pci_bus_irqs()``) to send a interrupt request back to the QEMU 522process, and have the device proxy object reflect it up the PCI tree 523there. 524 525- PCI MSI/X interrupts 526 527PCI MSI/X interrupts are implemented in HW as DMA writes to a 528CPU-specific PCI address. In QEMU on x86, a KVM APIC object receives 529these DMA writes, then calls into the KVM driver to inject the interrupt 530into the VM. A simple emulation process implementation would be to send 531the MSI DMA address from QEMU as a message at initialization, then 532install an address space handler at that address which forwards the MSI 533message back to QEMU. 534 535DMA operations 536^^^^^^^^^^^^^^ 537 538When a emulation object wants to DMA into or out of guest memory, it 539first must use dma\_memory\_map() to convert the DMA address to a local 540virtual address. The emulation process memory region objects setup above 541will be used to translate the DMA address to a local virtual address the 542device emulation code can access. 543 544IOMMU 545^^^^^ 546 547When an IOMMU is in use in QEMU, DMA translation uses IOMMU memory 548regions to translate the DMA address to a guest physical address before 549that physical address can be translated to a local virtual address. The 550emulation process will need similar functionality. 551 552- IOTLB cache 553 554The emulation process will maintain a cache of recent IOMMU translations 555(the IOTLB). When the translate() callback of an IOMMU memory region is 556invoked, the IOTLB cache will be searched for an entry that will map the 557DMA address to a guest PA. On a cache miss, a message will be sent back 558to QEMU requesting the corresponding translation entry, which be both be 559used to return a guest address and be added to the cache. 560 561- IOTLB purge 562 563The IOMMU emulation will also need to act on unmap requests from QEMU. 564These happen when the guest IOMMU driver purges an entry from the 565guest's translation table. 566 567live migration 568^^^^^^^^^^^^^^ 569 570When a remote process receives a live migration indication from QEMU, it 571will set up a channel using the received file descriptor with 572``qio_channel_socket_new_fd()``. This channel will be used to create a 573*QEMUfile* that can be passed to ``qemu_save_device_state()`` to send 574the process's device state back to QEMU. This method will be reversed on 575restore - the channel will be passed to ``qemu_loadvm_state()`` to 576restore the device state. 577 578Accelerating device emulation 579~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 580 581The messages that are required to be sent between QEMU and the emulation 582process can add considerable latency to IO operations. The optimizations 583described below attempt to ameliorate this effect by allowing the 584emulation process to communicate directly with the kernel KVM driver. 585The KVM file descriptors created would be passed to the emulation process 586via initialization messages, much like the guest memory table is done. 587#### MMIO acceleration 588 589Vhost user applications can receive guest virtio driver stores directly 590from KVM. The issue with the eventfd mechanism used by vhost user is 591that it does not pass any data with the event indication, so it cannot 592handle guest loads or guest stores that carry store data. This concept 593could, however, be expanded to cover more cases. 594 595The expanded idea would require a new type of KVM device: 596*KVM\_DEV\_TYPE\_USER*. This device has two file descriptors: a master 597descriptor that QEMU can use for configuration, and a slave descriptor 598that the emulation process can use to receive MMIO notifications. QEMU 599would create both descriptors using the KVM driver, and pass the slave 600descriptor to the emulation process via an initialization message. 601 602data structures 603^^^^^^^^^^^^^^^ 604 605- guest physical range 606 607The guest physical range structure describes the address range that a 608device will respond to. It includes the base and length of the range, as 609well as which bus the range resides on (e.g., on an x86machine, it can 610specify whether the range refers to memory or IO addresses). 611 612A device can have multiple physical address ranges it responds to (e.g., 613a PCI device can have multiple BARs), so the structure will also include 614an enumerated identifier to specify which of the device's ranges is 615being referred to. 616 617+--------+----------------------------+ 618| Name | Description | 619+========+============================+ 620| addr | range base address | 621+--------+----------------------------+ 622| len | range length | 623+--------+----------------------------+ 624| bus | addr type (memory or IO) | 625+--------+----------------------------+ 626| id | range ID (e.g., PCI BAR) | 627+--------+----------------------------+ 628 629- MMIO request structure 630 631This structure describes an MMIO operation. It includes which guest 632physical range the MMIO was within, the offset within that range, the 633MMIO type (e.g., load or store), and its length and data. It also 634includes a sequence number that can be used to reply to the MMIO, and 635the CPU that issued the MMIO. 636 637+----------+------------------------+ 638| Name | Description | 639+==========+========================+ 640| rid | range MMIO is within | 641+----------+------------------------+ 642| offset | offset withing *rid* | 643+----------+------------------------+ 644| type | e.g., load or store | 645+----------+------------------------+ 646| len | MMIO length | 647+----------+------------------------+ 648| data | store data | 649+----------+------------------------+ 650| seq | sequence ID | 651+----------+------------------------+ 652 653- MMIO request queues 654 655MMIO request queues are FIFO arrays of MMIO request structures. There 656are two queues: pending queue is for MMIOs that haven't been read by the 657emulation program, and the sent queue is for MMIOs that haven't been 658acknowledged. The main use of the second queue is to validate MMIO 659replies from the emulation program. 660 661- scoreboard 662 663Each CPU in the VM is emulated in QEMU by a separate thread, so multiple 664MMIOs may be waiting to be consumed by an emulation program and multiple 665threads may be waiting for MMIO replies. The scoreboard would contain a 666wait queue and sequence number for the per-CPU threads, allowing them to 667be individually woken when the MMIO reply is received from the emulation 668program. It also tracks the number of posted MMIO stores to the device 669that haven't been replied to, in order to satisfy the PCI constraint 670that a load to a device will not complete until all previous stores to 671that device have been completed. 672 673- device shadow memory 674 675Some MMIO loads do not have device side-effects. These MMIOs can be 676completed without sending a MMIO request to the emulation program if the 677emulation program shares a shadow image of the device's memory image 678with the KVM driver. 679 680The emulation program will ask the KVM driver to allocate memory for the 681shadow image, and will then use ``mmap()`` to directly access it. The 682emulation program can control KVM access to the shadow image by sending 683KVM an access map telling it which areas of the image have no 684side-effects (and can be completed immediately), and which require a 685MMIO request to the emulation program. The access map can also inform 686the KVM drive which size accesses are allowed to the image. 687 688master descriptor 689^^^^^^^^^^^^^^^^^ 690 691The master descriptor is used by QEMU to configure the new KVM device. 692The descriptor would be returned by the KVM driver when QEMU issues a 693*KVM\_CREATE\_DEVICE* ``ioctl()`` with a *KVM\_DEV\_TYPE\_USER* type. 694 695KVM\_DEV\_TYPE\_USER device ops 696 697 698The *KVM\_DEV\_TYPE\_USER* operations vector will be registered by a 699``kvm_register_device_ops()`` call when the KVM system in initialized by 700``kvm_init()``. These device ops are called by the KVM driver when QEMU 701executes certain ``ioctl()`` operations on its KVM file descriptor. They 702include: 703 704- create 705 706This routine is called when QEMU issues a *KVM\_CREATE\_DEVICE* 707``ioctl()`` on its per-VM file descriptor. It will allocate and 708initialize a KVM user device specific data structure, and assign the 709*kvm\_device* private field to it. 710 711- ioctl 712 713This routine is invoked when QEMU issues an ``ioctl()`` on the master 714descriptor. The ``ioctl()`` commands supported are defined by the KVM 715device type. *KVM\_DEV\_TYPE\_USER* ones will need several commands: 716 717*KVM\_DEV\_USER\_SLAVE\_FD* creates the slave file descriptor that will 718be passed to the device emulation program. Only one slave can be created 719by each master descriptor. The file operations performed by this 720descriptor are described below. 721 722The *KVM\_DEV\_USER\_PA\_RANGE* command configures a guest physical 723address range that the slave descriptor will receive MMIO notifications 724for. The range is specified by a guest physical range structure 725argument. For buses that assign addresses to devices dynamically, this 726command can be executed while the guest is running, such as the case 727when a guest changes a device's PCI BAR registers. 728 729*KVM\_DEV\_USER\_PA\_RANGE* will use ``kvm_io_bus_register_dev()`` to 730register *kvm\_io\_device\_ops* callbacks to be invoked when the guest 731performs a MMIO operation within the range. When a range is changed, 732``kvm_io_bus_unregister_dev()`` is used to remove the previous 733instantiation. 734 735*KVM\_DEV\_USER\_TIMEOUT* will configure a timeout value that specifies 736how long KVM will wait for the emulation process to respond to a MMIO 737indication. 738 739- destroy 740 741This routine is called when the VM instance is destroyed. It will need 742to destroy the slave descriptor; and free any memory allocated by the 743driver, as well as the *kvm\_device* structure itself. 744 745slave descriptor 746^^^^^^^^^^^^^^^^ 747 748The slave descriptor will have its own file operations vector, which 749responds to system calls on the descriptor performed by the device 750emulation program. 751 752- read 753 754A read returns any pending MMIO requests from the KVM driver as MMIO 755request structures. Multiple structures can be returned if there are 756multiple MMIO operations pending. The MMIO requests are moved from the 757pending queue to the sent queue, and if there are threads waiting for 758space in the pending to add new MMIO operations, they will be woken 759here. 760 761- write 762 763A write also consists of a set of MMIO requests. They are compared to 764the MMIO requests in the sent queue. Matches are removed from the sent 765queue, and any threads waiting for the reply are woken. If a store is 766removed, then the number of posted stores in the per-CPU scoreboard is 767decremented. When the number is zero, and a non side-effect load was 768waiting for posted stores to complete, the load is continued. 769 770- ioctl 771 772There are several ioctl()s that can be performed on the slave 773descriptor. 774 775A *KVM\_DEV\_USER\_SHADOW\_SIZE* ``ioctl()`` causes the KVM driver to 776allocate memory for the shadow image. This memory can later be 777``mmap()``\ ed by the emulation process to share the emulation's view of 778device memory with the KVM driver. 779 780A *KVM\_DEV\_USER\_SHADOW\_CTRL* ``ioctl()`` controls access to the 781shadow image. It will send the KVM driver a shadow control map, which 782specifies which areas of the image can complete guest loads without 783sending the load request to the emulation program. It will also specify 784the size of load operations that are allowed. 785 786- poll 787 788An emulation program will use the ``poll()`` call with a *POLLIN* flag 789to determine if there are MMIO requests waiting to be read. It will 790return if the pending MMIO request queue is not empty. 791 792- mmap 793 794This call allows the emulation program to directly access the shadow 795image allocated by the KVM driver. As device emulation updates device 796memory, changes with no side-effects will be reflected in the shadow, 797and the KVM driver can satisfy guest loads from the shadow image without 798needing to wait for the emulation program. 799 800kvm\_io\_device ops 801^^^^^^^^^^^^^^^^^^^ 802 803Each KVM per-CPU thread can handle MMIO operation on behalf of the guest 804VM. KVM will use the MMIO's guest physical address to search for a 805matching *kvm\_io\_device* to see if the MMIO can be handled by the KVM 806driver instead of exiting back to QEMU. If a match is found, the 807corresponding callback will be invoked. 808 809- read 810 811This callback is invoked when the guest performs a load to the device. 812Loads with side-effects must be handled synchronously, with the KVM 813driver putting the QEMU thread to sleep waiting for the emulation 814process reply before re-starting the guest. Loads that do not have 815side-effects may be optimized by satisfying them from the shadow image, 816if there are no outstanding stores to the device by this CPU. PCI memory 817ordering demands that a load cannot complete before all older stores to 818the same device have been completed. 819 820- write 821 822Stores can be handled asynchronously unless the pending MMIO request 823queue is full. In this case, the QEMU thread must sleep waiting for 824space in the queue. Stores will increment the number of posted stores in 825the per-CPU scoreboard, in order to implement the PCI ordering 826constraint above. 827 828interrupt acceleration 829^^^^^^^^^^^^^^^^^^^^^^ 830 831This performance optimization would work much like a vhost user 832application does, where the QEMU process sets up *eventfds* that cause 833the device's corresponding interrupt to be triggered by the KVM driver. 834These irq file descriptors are sent to the emulation process at 835initialization, and are used when the emulation code raises a device 836interrupt. 837 838intx acceleration 839''''''''''''''''' 840 841Traditional PCI pin interrupts are level based, so, in addition to an 842irq file descriptor, a re-sampling file descriptor needs to be sent to 843the emulation program. This second file descriptor allows multiple 844devices sharing an irq to be notified when the interrupt has been 845acknowledged by the guest, so they can re-trigger the interrupt if their 846device has not de-asserted its interrupt. 847 848intx irq descriptor 849 850 851The irq descriptors are created by the proxy object 852``using event_notifier_init()`` to create the irq and re-sampling 853*eventds*, and ``kvm_vm_ioctl(KVM_IRQFD)`` to bind them to an interrupt. 854The interrupt route can be found with 855``pci_device_route_intx_to_irq()``. 856 857intx routing changes 858 859 860Intx routing can be changed when the guest programs the APIC the device 861pin is connected to. The proxy object in QEMU will use 862``pci_device_set_intx_routing_notifier()`` to be informed of any guest 863changes to the route. This handler will broadly follow the VFIO 864interrupt logic to change the route: de-assigning the existing irq 865descriptor from its route, then assigning it the new route. (see 866``vfio_intx_update()``) 867 868MSI/X acceleration 869'''''''''''''''''' 870 871MSI/X interrupts are sent as DMA transactions to the host. The interrupt 872data contains a vector that is programmed by the guest, A device may have 873multiple MSI interrupts associated with it, so multiple irq descriptors 874may need to be sent to the emulation program. 875 876MSI/X irq descriptor 877 878 879This case will also follow the VFIO example. For each MSI/X interrupt, 880an *eventfd* is created, a virtual interrupt is allocated by 881``kvm_irqchip_add_msi_route()``, and the virtual interrupt is bound to 882the eventfd with ``kvm_irqchip_add_irqfd_notifier()``. 883 884MSI/X config space changes 885 886 887The guest may dynamically update several MSI-related tables in the 888device's PCI config space. These include per-MSI interrupt enables and 889vector data. Additionally, MSIX tables exist in device memory space, not 890config space. Much like the BAR case above, the proxy object must look 891at guest config space programming to keep the MSI interrupt state 892consistent between QEMU and the emulation program. 893 894-------------- 895 896Disaggregated CPU emulation 897--------------------------- 898 899After IO services have been disaggregated, a second phase would be to 900separate a process to handle CPU instruction emulation from the main 901QEMU control function. There are no object separation points for this 902code, so the first task would be to create one. 903 904Host access controls 905-------------------- 906 907Separating QEMU relies on the host OS's access restriction mechanisms to 908enforce that the differing processes can only access the objects they 909are entitled to. There are a couple types of mechanisms usually provided 910by general purpose OSs. 911 912Discretionary access control 913~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 914 915Discretionary access control allows each user to control who can access 916their files. In Linux, this type of control is usually too coarse for 917QEMU separation, since it only provides three separate access controls: 918one for the same user ID, the second for users IDs with the same group 919ID, and the third for all other user IDs. Each device instance would 920need a separate user ID to provide access control, which is likely to be 921unwieldy for dynamically created VMs. 922 923Mandatory access control 924~~~~~~~~~~~~~~~~~~~~~~~~ 925 926Mandatory access control allows the OS to add an additional set of 927controls on top of discretionary access for the OS to control. It also 928adds other attributes to processes and files such as types, roles, and 929categories, and can establish rules for how processes and files can 930interact. 931 932Type enforcement 933^^^^^^^^^^^^^^^^ 934 935Type enforcement assigns a *type* attribute to processes and files, and 936allows rules to be written on what operations a process with a given 937type can perform on a file with a given type. QEMU separation could take 938advantage of type enforcement by running the emulation processes with 939different types, both from the main QEMU process, and from the emulation 940processes of different classes of devices. 941 942For example, guest disk images and disk emulation processes could have 943types separate from the main QEMU process and non-disk emulation 944processes, and the type rules could prevent processes other than disk 945emulation ones from accessing guest disk images. Similarly, network 946emulation processes can have a type separate from the main QEMU process 947and non-network emulation process, and only that type can access the 948host tun/tap device used to provide guest networking. 949 950Category enforcement 951^^^^^^^^^^^^^^^^^^^^ 952 953Category enforcement assigns a set of numbers within a given range to 954the process or file. The process is granted access to the file if the 955process's set is a superset of the file's set. This enforcement can be 956used to separate multiple instances of devices in the same class. 957 958For example, if there are multiple disk devices provides to a guest, 959each device emulation process could be provisioned with a separate 960category. The different device emulation processes would not be able to 961access each other's backing disk images. 962 963Alternatively, categories could be used in lieu of the type enforcement 964scheme described above. In this scenario, different categories would be 965used to prevent device emulation processes in different classes from 966accessing resources assigned to other classes. 967