1.. _vhost_user_proto: 2 3=================== 4Vhost-user Protocol 5=================== 6 7.. 8 Copyright 2014 Virtual Open Systems Sarl. 9 Copyright 2019 Intel Corporation 10 Licence: This work is licensed under the terms of the GNU GPL, 11 version 2 or later. See the COPYING file in the top-level 12 directory. 13 14.. contents:: Table of Contents 15 16Introduction 17============ 18 19This protocol is aiming to complement the ``ioctl`` interface used to 20control the vhost implementation in the Linux kernel. It implements 21the control plane needed to establish virtqueue sharing with a user 22space process on the same host. It uses communication over a Unix 23domain socket to share file descriptors in the ancillary data of the 24message. 25 26The protocol defines 2 sides of the communication, *front-end* and 27*back-end*. The *front-end* is the application that shares its virtqueues, in 28our case QEMU. The *back-end* is the consumer of the virtqueues. 29 30In the current implementation QEMU is the *front-end*, and the *back-end* 31is the external process consuming the virtio queues, for example a 32software Ethernet switch running in user space, such as Snabbswitch, 33or a block device back-end processing read & write to a virtual 34disk. In order to facilitate interoperability between various back-end 35implementations, it is recommended to follow the :ref:`Backend program 36conventions <backend_conventions>`. 37 38The *front-end* and *back-end* can be either a client (i.e. connecting) or 39server (listening) in the socket communication. 40 41Support for platforms other than Linux 42-------------------------------------- 43 44While vhost-user was initially developed targeting Linux, nowadays it 45is supported on any platform that provides the following features: 46 47- A way for requesting shared memory represented by a file descriptor 48 so it can be passed over a UNIX domain socket and then mapped by the 49 other process. 50 51- AF_UNIX sockets with SCM_RIGHTS, so QEMU and the other process can 52 exchange messages through it, including ancillary data when needed. 53 54- Either eventfd or pipe/pipe2. On platforms where eventfd is not 55 available, QEMU will automatically fall back to pipe2 or, as a last 56 resort, pipe. Each file descriptor will be used for receiving or 57 sending events by reading or writing (respectively) an 8-byte value 58 to the corresponding it. The 8-value itself has no meaning and 59 should not be interpreted. 60 61Message Specification 62===================== 63 64.. Note:: All numbers are in the machine native byte order. 65 66A vhost-user message consists of 3 header fields and a payload. 67 68+---------+-------+------+---------+ 69| request | flags | size | payload | 70+---------+-------+------+---------+ 71 72Header 73------ 74 75:request: 32-bit type of the request 76 77:flags: 32-bit bit field 78 79- Lower 2 bits are the version (currently 0x01) 80- Bit 2 is the reply flag - needs to be sent on each reply from the back-end 81- Bit 3 is the need_reply flag - see :ref:`REPLY_ACK <reply_ack>` for 82 details. 83 84:size: 32-bit size of the payload 85 86Payload 87------- 88 89Depending on the request type, **payload** can be: 90 91A single 64-bit integer 92^^^^^^^^^^^^^^^^^^^^^^^ 93 94+-----+ 95| u64 | 96+-----+ 97 98:u64: a 64-bit unsigned integer 99 100A vring state description 101^^^^^^^^^^^^^^^^^^^^^^^^^ 102 103+-------+-----+ 104| index | num | 105+-------+-----+ 106 107:index: a 32-bit index 108 109:num: a 32-bit number 110 111A vring address description 112^^^^^^^^^^^^^^^^^^^^^^^^^^^ 113 114+-------+-------+------+------------+------+-----------+-----+ 115| index | flags | size | descriptor | used | available | log | 116+-------+-------+------+------------+------+-----------+-----+ 117 118:index: a 32-bit vring index 119 120:flags: a 32-bit vring flags 121 122:descriptor: a 64-bit ring address of the vring descriptor table 123 124:used: a 64-bit ring address of the vring used ring 125 126:available: a 64-bit ring address of the vring available ring 127 128:log: a 64-bit guest address for logging 129 130Note that a ring address is an IOVA if ``VIRTIO_F_IOMMU_PLATFORM`` has 131been negotiated. Otherwise it is a user address. 132 133Memory region description 134^^^^^^^^^^^^^^^^^^^^^^^^^ 135 136+---------------+------+--------------+-------------+ 137| guest address | size | user address | mmap offset | 138+---------------+------+--------------+-------------+ 139 140:guest address: a 64-bit guest address of the region 141 142:size: a 64-bit size 143 144:user address: a 64-bit user address 145 146:mmap offset: 64-bit offset where region starts in the mapped memory 147 148When the ``VHOST_USER_PROTOCOL_F_XEN_MMAP`` protocol feature has been 149successfully negotiated, the memory region description contains two extra 150fields at the end. 151 152+---------------+------+--------------+-------------+----------------+-------+ 153| guest address | size | user address | mmap offset | xen mmap flags | domid | 154+---------------+------+--------------+-------------+----------------+-------+ 155 156:xen mmap flags: 32-bit bit field 157 158- Bit 0 is set for Xen foreign memory mapping. 159- Bit 1 is set for Xen grant memory mapping. 160- Bit 8 is set if the memory region can not be mapped in advance, and memory 161 areas within this region must be mapped / unmapped only when required by the 162 back-end. The back-end shouldn't try to map the entire region at once, as the 163 front-end may not allow it. The back-end should rather map only the required 164 amount of memory at once and unmap it after it is used. 165 166:domid: a 32-bit Xen hypervisor specific domain id. 167 168Single memory region description 169^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 170 171+---------+--------+ 172| padding | region | 173+---------+--------+ 174 175:padding: 64-bit 176 177A region is represented by Memory region description. 178 179Multiple Memory regions description 180^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 181 182+-------------+---------+---------+-----+---------+ 183| num regions | padding | region0 | ... | region7 | 184+-------------+---------+---------+-----+---------+ 185 186:num regions: a 32-bit number of regions 187 188:padding: 32-bit 189 190A region is represented by Memory region description. 191 192Log description 193^^^^^^^^^^^^^^^ 194 195+----------+------------+ 196| log size | log offset | 197+----------+------------+ 198 199:log size: size of area used for logging 200 201:log offset: offset from start of supplied file descriptor where 202 logging starts (i.e. where guest address 0 would be 203 logged) 204 205An IOTLB message 206^^^^^^^^^^^^^^^^ 207 208+------+------+--------------+-------------------+------+ 209| iova | size | user address | permissions flags | type | 210+------+------+--------------+-------------------+------+ 211 212:iova: a 64-bit I/O virtual address programmed by the guest 213 214:size: a 64-bit size 215 216:user address: a 64-bit user address 217 218:permissions flags: an 8-bit value: 219 - 0: No access 220 - 1: Read access 221 - 2: Write access 222 - 3: Read/Write access 223 224:type: an 8-bit IOTLB message type: 225 - 1: IOTLB miss 226 - 2: IOTLB update 227 - 3: IOTLB invalidate 228 - 4: IOTLB access fail 229 230Virtio device config space 231^^^^^^^^^^^^^^^^^^^^^^^^^^ 232 233+--------+------+-------+---------+ 234| offset | size | flags | payload | 235+--------+------+-------+---------+ 236 237:offset: a 32-bit offset of virtio device's configuration space 238 239:size: a 32-bit configuration space access size in bytes 240 241:flags: a 32-bit value: 242 - 0: Vhost front-end messages used for writable fields 243 - 1: Vhost front-end messages used for live migration 244 245:payload: Size bytes array holding the contents of the virtio 246 device's configuration space 247 248Vring area description 249^^^^^^^^^^^^^^^^^^^^^^ 250 251+-----+------+--------+ 252| u64 | size | offset | 253+-----+------+--------+ 254 255:u64: a 64-bit integer contains vring index and flags 256 257:size: a 64-bit size of this area 258 259:offset: a 64-bit offset of this area from the start of the 260 supplied file descriptor 261 262Inflight description 263^^^^^^^^^^^^^^^^^^^^ 264 265+-----------+-------------+------------+------------+ 266| mmap size | mmap offset | num queues | queue size | 267+-----------+-------------+------------+------------+ 268 269:mmap size: a 64-bit size of area to track inflight I/O 270 271:mmap offset: a 64-bit offset of this area from the start 272 of the supplied file descriptor 273 274:num queues: a 16-bit number of virtqueues 275 276:queue size: a 16-bit size of virtqueues 277 278VhostUserShared 279^^^^^^^^^^^^^^^ 280 281+------+ 282| UUID | 283+------+ 284 285:UUID: 16 bytes UUID, whose first three components (a 32-bit value, then 286 two 16-bit values) are stored in big endian. 287 288C structure 289----------- 290 291In QEMU the vhost-user message is implemented with the following struct: 292 293.. code:: c 294 295 typedef struct VhostUserMsg { 296 VhostUserRequest request; 297 uint32_t flags; 298 uint32_t size; 299 union { 300 uint64_t u64; 301 struct vhost_vring_state state; 302 struct vhost_vring_addr addr; 303 VhostUserMemory memory; 304 VhostUserLog log; 305 struct vhost_iotlb_msg iotlb; 306 VhostUserConfig config; 307 VhostUserVringArea area; 308 VhostUserInflight inflight; 309 }; 310 } QEMU_PACKED VhostUserMsg; 311 312Communication 313============= 314 315The protocol for vhost-user is based on the existing implementation of 316vhost for the Linux Kernel. Most messages that can be sent via the 317Unix domain socket implementing vhost-user have an equivalent ioctl to 318the kernel implementation. 319 320The communication consists of the *front-end* sending message requests and 321the *back-end* sending message replies. Most of the requests don't require 322replies. Here is a list of the ones that do: 323 324* ``VHOST_USER_GET_FEATURES`` 325* ``VHOST_USER_GET_PROTOCOL_FEATURES`` 326* ``VHOST_USER_GET_VRING_BASE`` 327* ``VHOST_USER_SET_LOG_BASE`` (if ``VHOST_USER_PROTOCOL_F_LOG_SHMFD``) 328* ``VHOST_USER_GET_INFLIGHT_FD`` (if ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD``) 329 330.. seealso:: 331 332 :ref:`REPLY_ACK <reply_ack>` 333 The section on ``REPLY_ACK`` protocol extension. 334 335There are several messages that the front-end sends with file descriptors passed 336in the ancillary data: 337 338* ``VHOST_USER_ADD_MEM_REG`` 339* ``VHOST_USER_SET_MEM_TABLE`` 340* ``VHOST_USER_SET_LOG_BASE`` (if ``VHOST_USER_PROTOCOL_F_LOG_SHMFD``) 341* ``VHOST_USER_SET_LOG_FD`` 342* ``VHOST_USER_SET_VRING_KICK`` 343* ``VHOST_USER_SET_VRING_CALL`` 344* ``VHOST_USER_SET_VRING_ERR`` 345* ``VHOST_USER_SET_BACKEND_REQ_FD`` (previous name ``VHOST_USER_SET_SLAVE_REQ_FD``) 346* ``VHOST_USER_SET_INFLIGHT_FD`` (if ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD``) 347 348If *front-end* is unable to send the full message or receives a wrong 349reply it will close the connection. An optional reconnection mechanism 350can be implemented. 351 352If *back-end* detects some error such as incompatible features, it may also 353close the connection. This should only happen in exceptional circumstances. 354 355Any protocol extensions are gated by protocol feature bits, which 356allows full backwards compatibility on both front-end and back-end. As 357older back-ends don't support negotiating protocol features, a feature 358bit was dedicated for this purpose:: 359 360 #define VHOST_USER_F_PROTOCOL_FEATURES 30 361 362Note that VHOST_USER_F_PROTOCOL_FEATURES is the UNUSED (30) feature 363bit defined in `VIRTIO 1.1 6.3 Legacy Interface: Reserved Feature Bits 364<https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-4130003>`_. 365VIRTIO devices do not advertise this feature bit and therefore VIRTIO 366drivers cannot negotiate it. 367 368This reserved feature bit was reused by the vhost-user protocol to add 369vhost-user protocol feature negotiation in a backwards compatible 370fashion. Old vhost-user front-end and back-end implementations continue to 371work even though they are not aware of vhost-user protocol feature 372negotiation. 373 374Ring states 375----------- 376 377Rings can be in one of three states: 378 379* stopped: the back-end must not process the ring at all. 380 381* started but disabled: the back-end must process the ring without 382 causing any side effects. For example, for a networking device, 383 in the disabled state the back-end must not supply any new RX packets, 384 but must process and discard any TX packets. 385 386* started and enabled. 387 388Each ring is initialized in a stopped state. The back-end must start 389ring upon receiving a kick (that is, detecting that file descriptor is 390readable) on the descriptor specified by ``VHOST_USER_SET_VRING_KICK`` 391or receiving the in-band message ``VHOST_USER_VRING_KICK`` if negotiated, 392and stop ring upon receiving ``VHOST_USER_GET_VRING_BASE``. 393 394Rings can be enabled or disabled by ``VHOST_USER_SET_VRING_ENABLE``. 395 396If ``VHOST_USER_F_PROTOCOL_FEATURES`` has not been negotiated, the 397ring starts directly in the enabled state. 398 399If ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated, the ring is 400initialized in a disabled state and is enabled by 401``VHOST_USER_SET_VRING_ENABLE`` with parameter 1. 402 403While processing the rings (whether they are enabled or not), the back-end 404must support changing some configuration aspects on the fly. 405 406Multiple queue support 407---------------------- 408 409Many devices have a fixed number of virtqueues. In this case the front-end 410already knows the number of available virtqueues without communicating with the 411back-end. 412 413Some devices do not have a fixed number of virtqueues. Instead the maximum 414number of virtqueues is chosen by the back-end. The number can depend on host 415resource availability or back-end implementation details. Such devices are called 416multiple queue devices. 417 418Multiple queue support allows the back-end to advertise the maximum number of 419queues. This is treated as a protocol extension, hence the back-end has to 420implement protocol features first. The multiple queues feature is supported 421only when the protocol feature ``VHOST_USER_PROTOCOL_F_MQ`` (bit 0) is set. 422 423The max number of queues the back-end supports can be queried with message 424``VHOST_USER_GET_QUEUE_NUM``. Front-end should stop when the number of requested 425queues is bigger than that. 426 427As all queues share one connection, the front-end uses a unique index for each 428queue in the sent message to identify a specified queue. 429 430The front-end enables queues by sending message ``VHOST_USER_SET_VRING_ENABLE``. 431vhost-user-net has historically automatically enabled the first queue pair. 432 433Back-ends should always implement the ``VHOST_USER_PROTOCOL_F_MQ`` protocol 434feature, even for devices with a fixed number of virtqueues, since it is simple 435to implement and offers a degree of introspection. 436 437Front-ends must not rely on the ``VHOST_USER_PROTOCOL_F_MQ`` protocol feature for 438devices with a fixed number of virtqueues. Only true multiqueue devices 439require this protocol feature. 440 441Migration 442--------- 443 444During live migration, the front-end may need to track the modifications 445the back-end makes to the memory mapped regions. The front-end should mark 446the dirty pages in a log. Once it complies to this logging, it may 447declare the ``VHOST_F_LOG_ALL`` vhost feature. 448 449To start/stop logging of data/used ring writes, the front-end may send 450messages ``VHOST_USER_SET_FEATURES`` with ``VHOST_F_LOG_ALL`` and 451``VHOST_USER_SET_VRING_ADDR`` with ``VHOST_VRING_F_LOG`` in ring's 452flags set to 1/0, respectively. 453 454All the modifications to memory pointed by vring "descriptor" should 455be marked. Modifications to "used" vring should be marked if 456``VHOST_VRING_F_LOG`` is part of ring's flags. 457 458Dirty pages are of size:: 459 460 #define VHOST_LOG_PAGE 0x1000 461 462The log memory fd is provided in the ancillary data of 463``VHOST_USER_SET_LOG_BASE`` message when the back-end has 464``VHOST_USER_PROTOCOL_F_LOG_SHMFD`` protocol feature. 465 466The size of the log is supplied as part of ``VhostUserMsg`` which 467should be large enough to cover all known guest addresses. Log starts 468at the supplied offset in the supplied file descriptor. The log 469covers from address 0 to the maximum of guest regions. In pseudo-code, 470to mark page at ``addr`` as dirty:: 471 472 page = addr / VHOST_LOG_PAGE 473 log[page / 8] |= 1 << page % 8 474 475Where ``addr`` is the guest physical address. 476 477Use atomic operations, as the log may be concurrently manipulated. 478 479Note that when logging modifications to the used ring (when 480``VHOST_VRING_F_LOG`` is set for this ring), ``log_guest_addr`` should 481be used to calculate the log offset: the write to first byte of the 482used ring is logged at this offset from log start. Also note that this 483value might be outside the legal guest physical address range 484(i.e. does not have to be covered by the ``VhostUserMemory`` table), but 485the bit offset of the last byte of the ring must fall within the size 486supplied by ``VhostUserLog``. 487 488``VHOST_USER_SET_LOG_FD`` is an optional message with an eventfd in 489ancillary data, it may be used to inform the front-end that the log has 490been modified. 491 492Once the source has finished migration, rings will be stopped by the 493source. No further update must be done before rings are restarted. 494 495In postcopy migration the back-end is started before all the memory has 496been received from the source host, and care must be taken to avoid 497accessing pages that have yet to be received. The back-end opens a 498'userfault'-fd and registers the memory with it; this fd is then 499passed back over to the front-end. The front-end services requests on the 500userfaultfd for pages that are accessed and when the page is available 501it performs WAKE ioctl's on the userfaultfd to wake the stalled 502back-end. The front-end indicates support for this via the 503``VHOST_USER_PROTOCOL_F_PAGEFAULT`` feature. 504 505Memory access 506------------- 507 508The front-end sends a list of vhost memory regions to the back-end using the 509``VHOST_USER_SET_MEM_TABLE`` message. Each region has two base 510addresses: a guest address and a user address. 511 512Messages contain guest addresses and/or user addresses to reference locations 513within the shared memory. The mapping of these addresses works as follows. 514 515User addresses map to the vhost memory region containing that user address. 516 517When the ``VIRTIO_F_IOMMU_PLATFORM`` feature has not been negotiated: 518 519* Guest addresses map to the vhost memory region containing that guest 520 address. 521 522When the ``VIRTIO_F_IOMMU_PLATFORM`` feature has been negotiated: 523 524* Guest addresses are also called I/O virtual addresses (IOVAs). They are 525 translated to user addresses via the IOTLB. 526 527* The vhost memory region guest address is not used. 528 529IOMMU support 530------------- 531 532When the ``VIRTIO_F_IOMMU_PLATFORM`` feature has been negotiated, the 533front-end sends IOTLB entries update & invalidation by sending 534``VHOST_USER_IOTLB_MSG`` requests to the back-end with a ``struct 535vhost_iotlb_msg`` as payload. For update events, the ``iotlb`` payload 536has to be filled with the update message type (2), the I/O virtual 537address, the size, the user virtual address, and the permissions 538flags. Addresses and size must be within vhost memory regions set via 539the ``VHOST_USER_SET_MEM_TABLE`` request. For invalidation events, the 540``iotlb`` payload has to be filled with the invalidation message type 541(3), the I/O virtual address and the size. On success, the back-end is 542expected to reply with a zero payload, non-zero otherwise. 543 544The back-end relies on the back-end communication channel (see :ref:`Back-end 545communication <backend_communication>` section below) to send IOTLB miss 546and access failure events, by sending ``VHOST_USER_BACKEND_IOTLB_MSG`` 547requests to the front-end with a ``struct vhost_iotlb_msg`` as 548payload. For miss events, the iotlb payload has to be filled with the 549miss message type (1), the I/O virtual address and the permissions 550flags. For access failure event, the iotlb payload has to be filled 551with the access failure message type (4), the I/O virtual address and 552the permissions flags. For synchronization purpose, the back-end may 553rely on the reply-ack feature, so the front-end may send a reply when 554operation is completed if the reply-ack feature is negotiated and 555back-ends requests a reply. For miss events, completed operation means 556either front-end sent an update message containing the IOTLB entry 557containing requested address and permission, or front-end sent nothing if 558the IOTLB miss message is invalid (invalid IOVA or permission). 559 560The front-end isn't expected to take the initiative to send IOTLB update 561messages, as the back-end sends IOTLB miss messages for the guest virtual 562memory areas it needs to access. 563 564.. _backend_communication: 565 566Back-end communication 567---------------------- 568 569An optional communication channel is provided if the back-end declares 570``VHOST_USER_PROTOCOL_F_BACKEND_REQ`` protocol feature, to allow the 571back-end to make requests to the front-end. 572 573The fd is provided via ``VHOST_USER_SET_BACKEND_REQ_FD`` ancillary data. 574 575A back-end may then send ``VHOST_USER_BACKEND_*`` messages to the front-end 576using this fd communication channel. 577 578If ``VHOST_USER_PROTOCOL_F_BACKEND_SEND_FD`` protocol feature is 579negotiated, back-end can send file descriptors (at most 8 descriptors in 580each message) to front-end via ancillary data using this fd communication 581channel. 582 583Inflight I/O tracking 584--------------------- 585 586To support reconnecting after restart or crash, back-end may need to 587resubmit inflight I/Os. If virtqueue is processed in order, we can 588easily achieve that by getting the inflight descriptors from 589descriptor table (split virtqueue) or descriptor ring (packed 590virtqueue). However, it can't work when we process descriptors 591out-of-order because some entries which store the information of 592inflight descriptors in available ring (split virtqueue) or descriptor 593ring (packed virtqueue) might be overridden by new entries. To solve 594this problem, the back-end need to allocate an extra buffer to store this 595information of inflight descriptors and share it with front-end for 596persistent. ``VHOST_USER_GET_INFLIGHT_FD`` and 597``VHOST_USER_SET_INFLIGHT_FD`` are used to transfer this buffer 598between front-end and back-end. And the format of this buffer is described 599below: 600 601+---------------+---------------+-----+---------------+ 602| queue0 region | queue1 region | ... | queueN region | 603+---------------+---------------+-----+---------------+ 604 605N is the number of available virtqueues. The back-end could get it from num 606queues field of ``VhostUserInflight``. 607 608For split virtqueue, queue region can be implemented as: 609 610.. code:: c 611 612 typedef struct DescStateSplit { 613 /* Indicate whether this descriptor is inflight or not. 614 * Only available for head-descriptor. */ 615 uint8_t inflight; 616 617 /* Padding */ 618 uint8_t padding[5]; 619 620 /* Maintain a list for the last batch of used descriptors. 621 * Only available when batching is used for submitting */ 622 uint16_t next; 623 624 /* Used to preserve the order of fetching available descriptors. 625 * Only available for head-descriptor. */ 626 uint64_t counter; 627 } DescStateSplit; 628 629 typedef struct QueueRegionSplit { 630 /* The feature flags of this region. Now it's initialized to 0. */ 631 uint64_t features; 632 633 /* The version of this region. It's 1 currently. 634 * Zero value indicates an uninitialized buffer */ 635 uint16_t version; 636 637 /* The size of DescStateSplit array. It's equal to the virtqueue size. 638 * The back-end could get it from queue size field of VhostUserInflight. */ 639 uint16_t desc_num; 640 641 /* The head of list that track the last batch of used descriptors. */ 642 uint16_t last_batch_head; 643 644 /* Store the idx value of used ring */ 645 uint16_t used_idx; 646 647 /* Used to track the state of each descriptor in descriptor table */ 648 DescStateSplit desc[]; 649 } QueueRegionSplit; 650 651To track inflight I/O, the queue region should be processed as follows: 652 653When receiving available buffers from the driver: 654 655#. Get the next available head-descriptor index from available ring, ``i`` 656 657#. Set ``desc[i].counter`` to the value of global counter 658 659#. Increase global counter by 1 660 661#. Set ``desc[i].inflight`` to 1 662 663When supplying used buffers to the driver: 664 6651. Get corresponding used head-descriptor index, i 666 6672. Set ``desc[i].next`` to ``last_batch_head`` 668 6693. Set ``last_batch_head`` to ``i`` 670 671#. Steps 1,2,3 may be performed repeatedly if batching is possible 672 673#. Increase the ``idx`` value of used ring by the size of the batch 674 675#. Set the ``inflight`` field of each ``DescStateSplit`` entry in the batch to 0 676 677#. Set ``used_idx`` to the ``idx`` value of used ring 678 679When reconnecting: 680 681#. If the value of ``used_idx`` does not match the ``idx`` value of 682 used ring (means the inflight field of ``DescStateSplit`` entries in 683 last batch may be incorrect), 684 685 a. Subtract the value of ``used_idx`` from the ``idx`` value of 686 used ring to get last batch size of ``DescStateSplit`` entries 687 688 #. Set the ``inflight`` field of each ``DescStateSplit`` entry to 0 in last batch 689 list which starts from ``last_batch_head`` 690 691 #. Set ``used_idx`` to the ``idx`` value of used ring 692 693#. Resubmit inflight ``DescStateSplit`` entries in order of their 694 counter value 695 696For packed virtqueue, queue region can be implemented as: 697 698.. code:: c 699 700 typedef struct DescStatePacked { 701 /* Indicate whether this descriptor is inflight or not. 702 * Only available for head-descriptor. */ 703 uint8_t inflight; 704 705 /* Padding */ 706 uint8_t padding; 707 708 /* Link to the next free entry */ 709 uint16_t next; 710 711 /* Link to the last entry of descriptor list. 712 * Only available for head-descriptor. */ 713 uint16_t last; 714 715 /* The length of descriptor list. 716 * Only available for head-descriptor. */ 717 uint16_t num; 718 719 /* Used to preserve the order of fetching available descriptors. 720 * Only available for head-descriptor. */ 721 uint64_t counter; 722 723 /* The buffer id */ 724 uint16_t id; 725 726 /* The descriptor flags */ 727 uint16_t flags; 728 729 /* The buffer length */ 730 uint32_t len; 731 732 /* The buffer address */ 733 uint64_t addr; 734 } DescStatePacked; 735 736 typedef struct QueueRegionPacked { 737 /* The feature flags of this region. Now it's initialized to 0. */ 738 uint64_t features; 739 740 /* The version of this region. It's 1 currently. 741 * Zero value indicates an uninitialized buffer */ 742 uint16_t version; 743 744 /* The size of DescStatePacked array. It's equal to the virtqueue size. 745 * The back-end could get it from queue size field of VhostUserInflight. */ 746 uint16_t desc_num; 747 748 /* The head of free DescStatePacked entry list */ 749 uint16_t free_head; 750 751 /* The old head of free DescStatePacked entry list */ 752 uint16_t old_free_head; 753 754 /* The used index of descriptor ring */ 755 uint16_t used_idx; 756 757 /* The old used index of descriptor ring */ 758 uint16_t old_used_idx; 759 760 /* Device ring wrap counter */ 761 uint8_t used_wrap_counter; 762 763 /* The old device ring wrap counter */ 764 uint8_t old_used_wrap_counter; 765 766 /* Padding */ 767 uint8_t padding[7]; 768 769 /* Used to track the state of each descriptor fetched from descriptor ring */ 770 DescStatePacked desc[]; 771 } QueueRegionPacked; 772 773To track inflight I/O, the queue region should be processed as follows: 774 775When receiving available buffers from the driver: 776 777#. Get the next available descriptor entry from descriptor ring, ``d`` 778 779#. If ``d`` is head descriptor, 780 781 a. Set ``desc[old_free_head].num`` to 0 782 783 #. Set ``desc[old_free_head].counter`` to the value of global counter 784 785 #. Increase global counter by 1 786 787 #. Set ``desc[old_free_head].inflight`` to 1 788 789#. If ``d`` is last descriptor, set ``desc[old_free_head].last`` to 790 ``free_head`` 791 792#. Increase ``desc[old_free_head].num`` by 1 793 794#. Set ``desc[free_head].addr``, ``desc[free_head].len``, 795 ``desc[free_head].flags``, ``desc[free_head].id`` to ``d.addr``, 796 ``d.len``, ``d.flags``, ``d.id`` 797 798#. Set ``free_head`` to ``desc[free_head].next`` 799 800#. If ``d`` is last descriptor, set ``old_free_head`` to ``free_head`` 801 802When supplying used buffers to the driver: 803 8041. Get corresponding used head-descriptor entry from descriptor ring, 805 ``d`` 806 8072. Get corresponding ``DescStatePacked`` entry, ``e`` 808 8093. Set ``desc[e.last].next`` to ``free_head`` 810 8114. Set ``free_head`` to the index of ``e`` 812 813#. Steps 1,2,3,4 may be performed repeatedly if batching is possible 814 815#. Increase ``used_idx`` by the size of the batch and update 816 ``used_wrap_counter`` if needed 817 818#. Update ``d.flags`` 819 820#. Set the ``inflight`` field of each head ``DescStatePacked`` entry 821 in the batch to 0 822 823#. Set ``old_free_head``, ``old_used_idx``, ``old_used_wrap_counter`` 824 to ``free_head``, ``used_idx``, ``used_wrap_counter`` 825 826When reconnecting: 827 828#. If ``used_idx`` does not match ``old_used_idx`` (means the 829 ``inflight`` field of ``DescStatePacked`` entries in last batch may 830 be incorrect), 831 832 a. Get the next descriptor ring entry through ``old_used_idx``, ``d`` 833 834 #. Use ``old_used_wrap_counter`` to calculate the available flags 835 836 #. If ``d.flags`` is not equal to the calculated flags value (means 837 back-end has submitted the buffer to guest driver before crash, so 838 it has to commit the in-progres update), set ``old_free_head``, 839 ``old_used_idx``, ``old_used_wrap_counter`` to ``free_head``, 840 ``used_idx``, ``used_wrap_counter`` 841 842#. Set ``free_head``, ``used_idx``, ``used_wrap_counter`` to 843 ``old_free_head``, ``old_used_idx``, ``old_used_wrap_counter`` 844 (roll back any in-progress update) 845 846#. Set the ``inflight`` field of each ``DescStatePacked`` entry in 847 free list to 0 848 849#. Resubmit inflight ``DescStatePacked`` entries in order of their 850 counter value 851 852In-band notifications 853--------------------- 854 855In some limited situations (e.g. for simulation) it is desirable to 856have the kick, call and error (if used) signals done via in-band 857messages instead of asynchronous eventfd notifications. This can be 858done by negotiating the ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` 859protocol feature. 860 861Note that due to the fact that too many messages on the sockets can 862cause the sending application(s) to block, it is not advised to use 863this feature unless absolutely necessary. It is also considered an 864error to negotiate this feature without also negotiating 865``VHOST_USER_PROTOCOL_F_BACKEND_REQ`` and ``VHOST_USER_PROTOCOL_F_REPLY_ACK``, 866the former is necessary for getting a message channel from the back-end 867to the front-end, while the latter needs to be used with the in-band 868notification messages to block until they are processed, both to avoid 869blocking later and for proper processing (at least in the simulation 870use case.) As it has no other way of signalling this error, the back-end 871should close the connection as a response to a 872``VHOST_USER_SET_PROTOCOL_FEATURES`` message that sets the in-band 873notifications feature flag without the other two. 874 875Protocol features 876----------------- 877 878.. code:: c 879 880 #define VHOST_USER_PROTOCOL_F_MQ 0 881 #define VHOST_USER_PROTOCOL_F_LOG_SHMFD 1 882 #define VHOST_USER_PROTOCOL_F_RARP 2 883 #define VHOST_USER_PROTOCOL_F_REPLY_ACK 3 884 #define VHOST_USER_PROTOCOL_F_MTU 4 885 #define VHOST_USER_PROTOCOL_F_BACKEND_REQ 5 886 #define VHOST_USER_PROTOCOL_F_CROSS_ENDIAN 6 887 #define VHOST_USER_PROTOCOL_F_CRYPTO_SESSION 7 888 #define VHOST_USER_PROTOCOL_F_PAGEFAULT 8 889 #define VHOST_USER_PROTOCOL_F_CONFIG 9 890 #define VHOST_USER_PROTOCOL_F_BACKEND_SEND_FD 10 891 #define VHOST_USER_PROTOCOL_F_HOST_NOTIFIER 11 892 #define VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD 12 893 #define VHOST_USER_PROTOCOL_F_RESET_DEVICE 13 894 #define VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS 14 895 #define VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS 15 896 #define VHOST_USER_PROTOCOL_F_STATUS 16 897 #define VHOST_USER_PROTOCOL_F_XEN_MMAP 17 898 #define VHOST_USER_PROTOCOL_F_SHARED_OBJECT 18 899 900Front-end message types 901----------------------- 902 903``VHOST_USER_GET_FEATURES`` 904 :id: 1 905 :equivalent ioctl: ``VHOST_GET_FEATURES`` 906 :request payload: N/A 907 :reply payload: ``u64`` 908 909 Get from the underlying vhost implementation the features bitmask. 910 Feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` signals back-end support 911 for ``VHOST_USER_GET_PROTOCOL_FEATURES`` and 912 ``VHOST_USER_SET_PROTOCOL_FEATURES``. 913 914``VHOST_USER_SET_FEATURES`` 915 :id: 2 916 :equivalent ioctl: ``VHOST_SET_FEATURES`` 917 :request payload: ``u64`` 918 :reply payload: N/A 919 920 Enable features in the underlying vhost implementation using a 921 bitmask. Feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` signals 922 back-end support for ``VHOST_USER_GET_PROTOCOL_FEATURES`` and 923 ``VHOST_USER_SET_PROTOCOL_FEATURES``. 924 925``VHOST_USER_GET_PROTOCOL_FEATURES`` 926 :id: 15 927 :equivalent ioctl: ``VHOST_GET_FEATURES`` 928 :request payload: N/A 929 :reply payload: ``u64`` 930 931 Get the protocol feature bitmask from the underlying vhost 932 implementation. Only legal if feature bit 933 ``VHOST_USER_F_PROTOCOL_FEATURES`` is present in 934 ``VHOST_USER_GET_FEATURES``. It does not need to be acknowledged by 935 ``VHOST_USER_SET_FEATURES``. 936 937.. Note:: 938 Back-ends that report ``VHOST_USER_F_PROTOCOL_FEATURES`` must 939 support this message even before ``VHOST_USER_SET_FEATURES`` was 940 called. 941 942``VHOST_USER_SET_PROTOCOL_FEATURES`` 943 :id: 16 944 :equivalent ioctl: ``VHOST_SET_FEATURES`` 945 :request payload: ``u64`` 946 :reply payload: N/A 947 948 Enable protocol features in the underlying vhost implementation. 949 950 Only legal if feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` is present in 951 ``VHOST_USER_GET_FEATURES``. It does not need to be acknowledged by 952 ``VHOST_USER_SET_FEATURES``. 953 954.. Note:: 955 Back-ends that report ``VHOST_USER_F_PROTOCOL_FEATURES`` must support 956 this message even before ``VHOST_USER_SET_FEATURES`` was called. 957 958``VHOST_USER_SET_OWNER`` 959 :id: 3 960 :equivalent ioctl: ``VHOST_SET_OWNER`` 961 :request payload: N/A 962 :reply payload: N/A 963 964 Issued when a new connection is established. It marks the sender 965 as the front-end that owns of the session. This can be used on the *back-end* 966 as a "session start" flag. 967 968``VHOST_USER_RESET_OWNER`` 969 :id: 4 970 :request payload: N/A 971 :reply payload: N/A 972 973.. admonition:: Deprecated 974 975 This is no longer used. Used to be sent to request disabling all 976 rings, but some back-ends interpreted it to also discard connection 977 state (this interpretation would lead to bugs). It is recommended 978 that back-ends either ignore this message, or use it to disable all 979 rings. 980 981``VHOST_USER_SET_MEM_TABLE`` 982 :id: 5 983 :equivalent ioctl: ``VHOST_SET_MEM_TABLE`` 984 :request payload: multiple memory regions description 985 :reply payload: (postcopy only) multiple memory regions description 986 987 Sets the memory map regions on the back-end so it can translate the 988 vring addresses. In the ancillary data there is an array of file 989 descriptors for each memory mapped region. The size and ordering of 990 the fds matches the number and ordering of memory regions. 991 992 When ``VHOST_USER_POSTCOPY_LISTEN`` has been received, 993 ``SET_MEM_TABLE`` replies with the bases of the memory mapped 994 regions to the front-end. The back-end must have mmap'd the regions but 995 not yet accessed them and should not yet generate a userfault 996 event. 997 998.. Note:: 999 ``NEED_REPLY_MASK`` is not set in this case. QEMU will then 1000 reply back to the list of mappings with an empty 1001 ``VHOST_USER_SET_MEM_TABLE`` as an acknowledgement; only upon 1002 reception of this message may the guest start accessing the memory 1003 and generating faults. 1004 1005``VHOST_USER_SET_LOG_BASE`` 1006 :id: 6 1007 :equivalent ioctl: ``VHOST_SET_LOG_BASE`` 1008 :request payload: u64 1009 :reply payload: N/A 1010 1011 Sets logging shared memory space. 1012 1013 When the back-end has ``VHOST_USER_PROTOCOL_F_LOG_SHMFD`` protocol feature, 1014 the log memory fd is provided in the ancillary data of 1015 ``VHOST_USER_SET_LOG_BASE`` message, the size and offset of shared 1016 memory area provided in the message. 1017 1018``VHOST_USER_SET_LOG_FD`` 1019 :id: 7 1020 :equivalent ioctl: ``VHOST_SET_LOG_FD`` 1021 :request payload: N/A 1022 :reply payload: N/A 1023 1024 Sets the logging file descriptor, which is passed as ancillary data. 1025 1026``VHOST_USER_SET_VRING_NUM`` 1027 :id: 8 1028 :equivalent ioctl: ``VHOST_SET_VRING_NUM`` 1029 :request payload: vring state description 1030 :reply payload: N/A 1031 1032 Set the size of the queue. 1033 1034``VHOST_USER_SET_VRING_ADDR`` 1035 :id: 9 1036 :equivalent ioctl: ``VHOST_SET_VRING_ADDR`` 1037 :request payload: vring address description 1038 :reply payload: N/A 1039 1040 Sets the addresses of the different aspects of the vring. 1041 1042``VHOST_USER_SET_VRING_BASE`` 1043 :id: 10 1044 :equivalent ioctl: ``VHOST_SET_VRING_BASE`` 1045 :request payload: vring state description 1046 :reply payload: N/A 1047 1048 Sets the base offset in the available vring. 1049 1050``VHOST_USER_GET_VRING_BASE`` 1051 :id: 11 1052 :equivalent ioctl: ``VHOST_USER_GET_VRING_BASE`` 1053 :request payload: vring state description 1054 :reply payload: vring state description 1055 1056 Get the available vring base offset. 1057 1058``VHOST_USER_SET_VRING_KICK`` 1059 :id: 12 1060 :equivalent ioctl: ``VHOST_SET_VRING_KICK`` 1061 :request payload: ``u64`` 1062 :reply payload: N/A 1063 1064 Set the event file descriptor for adding buffers to the vring. It is 1065 passed in the ancillary data. 1066 1067 Bits (0-7) of the payload contain the vring index. Bit 8 is the 1068 invalid FD flag. This flag is set when there is no file descriptor 1069 in the ancillary data. This signals that polling should be used 1070 instead of waiting for the kick. Note that if the protocol feature 1071 ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` has been negotiated 1072 this message isn't necessary as the ring is also started on the 1073 ``VHOST_USER_VRING_KICK`` message, it may however still be used to 1074 set an event file descriptor (which will be preferred over the 1075 message) or to enable polling. 1076 1077``VHOST_USER_SET_VRING_CALL`` 1078 :id: 13 1079 :equivalent ioctl: ``VHOST_SET_VRING_CALL`` 1080 :request payload: ``u64`` 1081 :reply payload: N/A 1082 1083 Set the event file descriptor to signal when buffers are used. It is 1084 passed in the ancillary data. 1085 1086 Bits (0-7) of the payload contain the vring index. Bit 8 is the 1087 invalid FD flag. This flag is set when there is no file descriptor 1088 in the ancillary data. This signals that polling will be used 1089 instead of waiting for the call. Note that if the protocol features 1090 ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` and 1091 ``VHOST_USER_PROTOCOL_F_BACKEND_REQ`` have been negotiated this message 1092 isn't necessary as the ``VHOST_USER_BACKEND_VRING_CALL`` message can be 1093 used, it may however still be used to set an event file descriptor 1094 or to enable polling. 1095 1096``VHOST_USER_SET_VRING_ERR`` 1097 :id: 14 1098 :equivalent ioctl: ``VHOST_SET_VRING_ERR`` 1099 :request payload: ``u64`` 1100 :reply payload: N/A 1101 1102 Set the event file descriptor to signal when error occurs. It is 1103 passed in the ancillary data. 1104 1105 Bits (0-7) of the payload contain the vring index. Bit 8 is the 1106 invalid FD flag. This flag is set when there is no file descriptor 1107 in the ancillary data. Note that if the protocol features 1108 ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` and 1109 ``VHOST_USER_PROTOCOL_F_BACKEND_REQ`` have been negotiated this message 1110 isn't necessary as the ``VHOST_USER_BACKEND_VRING_ERR`` message can be 1111 used, it may however still be used to set an event file descriptor 1112 (which will be preferred over the message). 1113 1114``VHOST_USER_GET_QUEUE_NUM`` 1115 :id: 17 1116 :equivalent ioctl: N/A 1117 :request payload: N/A 1118 :reply payload: u64 1119 1120 Query how many queues the back-end supports. 1121 1122 This request should be sent only when ``VHOST_USER_PROTOCOL_F_MQ`` 1123 is set in queried protocol features by 1124 ``VHOST_USER_GET_PROTOCOL_FEATURES``. 1125 1126``VHOST_USER_SET_VRING_ENABLE`` 1127 :id: 18 1128 :equivalent ioctl: N/A 1129 :request payload: vring state description 1130 :reply payload: N/A 1131 1132 Signal the back-end to enable or disable corresponding vring. 1133 1134 This request should be sent only when 1135 ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated. 1136 1137``VHOST_USER_SEND_RARP`` 1138 :id: 19 1139 :equivalent ioctl: N/A 1140 :request payload: ``u64`` 1141 :reply payload: N/A 1142 1143 Ask vhost user back-end to broadcast a fake RARP to notify the migration 1144 is terminated for guest that does not support GUEST_ANNOUNCE. 1145 1146 Only legal if feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` is 1147 present in ``VHOST_USER_GET_FEATURES`` and protocol feature bit 1148 ``VHOST_USER_PROTOCOL_F_RARP`` is present in 1149 ``VHOST_USER_GET_PROTOCOL_FEATURES``. The first 6 bytes of the 1150 payload contain the mac address of the guest to allow the vhost user 1151 back-end to construct and broadcast the fake RARP. 1152 1153``VHOST_USER_NET_SET_MTU`` 1154 :id: 20 1155 :equivalent ioctl: N/A 1156 :request payload: ``u64`` 1157 :reply payload: N/A 1158 1159 Set host MTU value exposed to the guest. 1160 1161 This request should be sent only when ``VIRTIO_NET_F_MTU`` feature 1162 has been successfully negotiated, ``VHOST_USER_F_PROTOCOL_FEATURES`` 1163 is present in ``VHOST_USER_GET_FEATURES`` and protocol feature bit 1164 ``VHOST_USER_PROTOCOL_F_NET_MTU`` is present in 1165 ``VHOST_USER_GET_PROTOCOL_FEATURES``. 1166 1167 If ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is negotiated, the back-end must 1168 respond with zero in case the specified MTU is valid, or non-zero 1169 otherwise. 1170 1171``VHOST_USER_SET_BACKEND_REQ_FD`` (previous name ``VHOST_USER_SET_SLAVE_REQ_FD``) 1172 :id: 21 1173 :equivalent ioctl: N/A 1174 :request payload: N/A 1175 :reply payload: N/A 1176 1177 Set the socket file descriptor for back-end initiated requests. It is passed 1178 in the ancillary data. 1179 1180 This request should be sent only when 1181 ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated, and protocol 1182 feature bit ``VHOST_USER_PROTOCOL_F_BACKEND_REQ`` bit is present in 1183 ``VHOST_USER_GET_PROTOCOL_FEATURES``. If 1184 ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is negotiated, the back-end must 1185 respond with zero for success, non-zero otherwise. 1186 1187``VHOST_USER_IOTLB_MSG`` 1188 :id: 22 1189 :equivalent ioctl: N/A (equivalent to ``VHOST_IOTLB_MSG`` message type) 1190 :request payload: ``struct vhost_iotlb_msg`` 1191 :reply payload: ``u64`` 1192 1193 Send IOTLB messages with ``struct vhost_iotlb_msg`` as payload. 1194 1195 The front-end sends such requests to update and invalidate entries in the 1196 device IOTLB. The back-end has to acknowledge the request with sending 1197 zero as ``u64`` payload for success, non-zero otherwise. 1198 1199 This request should be send only when ``VIRTIO_F_IOMMU_PLATFORM`` 1200 feature has been successfully negotiated. 1201 1202``VHOST_USER_SET_VRING_ENDIAN`` 1203 :id: 23 1204 :equivalent ioctl: ``VHOST_SET_VRING_ENDIAN`` 1205 :request payload: vring state description 1206 :reply payload: N/A 1207 1208 Set the endianness of a VQ for legacy devices. Little-endian is 1209 indicated with state.num set to 0 and big-endian is indicated with 1210 state.num set to 1. Other values are invalid. 1211 1212 This request should be sent only when 1213 ``VHOST_USER_PROTOCOL_F_CROSS_ENDIAN`` has been negotiated. 1214 Backends that negotiated this feature should handle both 1215 endiannesses and expect this message once (per VQ) during device 1216 configuration (ie. before the front-end starts the VQ). 1217 1218``VHOST_USER_GET_CONFIG`` 1219 :id: 24 1220 :equivalent ioctl: N/A 1221 :request payload: virtio device config space 1222 :reply payload: virtio device config space 1223 1224 When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, this message is 1225 submitted by the vhost-user front-end to fetch the contents of the 1226 virtio device configuration space, vhost-user back-end's payload size 1227 MUST match the front-end's request, vhost-user back-end uses zero length of 1228 payload to indicate an error to the vhost-user front-end. The vhost-user 1229 front-end may cache the contents to avoid repeated 1230 ``VHOST_USER_GET_CONFIG`` calls. 1231 1232``VHOST_USER_SET_CONFIG`` 1233 :id: 25 1234 :equivalent ioctl: N/A 1235 :request payload: virtio device config space 1236 :reply payload: N/A 1237 1238 When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, this message is 1239 submitted by the vhost-user front-end when the Guest changes the virtio 1240 device configuration space and also can be used for live migration 1241 on the destination host. The vhost-user back-end must check the flags 1242 field, and back-ends MUST NOT accept SET_CONFIG for read-only 1243 configuration space fields unless the live migration bit is set. 1244 1245``VHOST_USER_CREATE_CRYPTO_SESSION`` 1246 :id: 26 1247 :equivalent ioctl: N/A 1248 :request payload: crypto session description 1249 :reply payload: crypto session description 1250 1251 Create a session for crypto operation. The back-end must return 1252 the session id, 0 or positive for success, negative for failure. 1253 This request should be sent only when 1254 ``VHOST_USER_PROTOCOL_F_CRYPTO_SESSION`` feature has been 1255 successfully negotiated. It's a required feature for crypto 1256 devices. 1257 1258``VHOST_USER_CLOSE_CRYPTO_SESSION`` 1259 :id: 27 1260 :equivalent ioctl: N/A 1261 :request payload: ``u64`` 1262 :reply payload: N/A 1263 1264 Close a session for crypto operation which was previously 1265 created by ``VHOST_USER_CREATE_CRYPTO_SESSION``. 1266 1267 This request should be sent only when 1268 ``VHOST_USER_PROTOCOL_F_CRYPTO_SESSION`` feature has been 1269 successfully negotiated. It's a required feature for crypto 1270 devices. 1271 1272``VHOST_USER_POSTCOPY_ADVISE`` 1273 :id: 28 1274 :request payload: N/A 1275 :reply payload: userfault fd 1276 1277 When ``VHOST_USER_PROTOCOL_F_PAGEFAULT`` is supported, the front-end 1278 advises back-end that a migration with postcopy enabled is underway, 1279 the back-end must open a userfaultfd for later use. Note that at this 1280 stage the migration is still in precopy mode. 1281 1282``VHOST_USER_POSTCOPY_LISTEN`` 1283 :id: 29 1284 :request payload: N/A 1285 :reply payload: N/A 1286 1287 The front-end advises back-end that a transition to postcopy mode has 1288 happened. The back-end must ensure that shared memory is registered 1289 with userfaultfd to cause faulting of non-present pages. 1290 1291 This is always sent sometime after a ``VHOST_USER_POSTCOPY_ADVISE``, 1292 and thus only when ``VHOST_USER_PROTOCOL_F_PAGEFAULT`` is supported. 1293 1294``VHOST_USER_POSTCOPY_END`` 1295 :id: 30 1296 :request payload: N/A 1297 :reply payload: ``u64`` 1298 1299 The front-end advises that postcopy migration has now completed. The back-end 1300 must disable the userfaultfd. The reply is an acknowledgement 1301 only. 1302 1303 When ``VHOST_USER_PROTOCOL_F_PAGEFAULT`` is supported, this message 1304 is sent at the end of the migration, after 1305 ``VHOST_USER_POSTCOPY_LISTEN`` was previously sent. 1306 1307 The value returned is an error indication; 0 is success. 1308 1309``VHOST_USER_GET_INFLIGHT_FD`` 1310 :id: 31 1311 :equivalent ioctl: N/A 1312 :request payload: inflight description 1313 :reply payload: N/A 1314 1315 When ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD`` protocol feature has 1316 been successfully negotiated, this message is submitted by the front-end to 1317 get a shared buffer from back-end. The shared buffer will be used to 1318 track inflight I/O by back-end. QEMU should retrieve a new one when vm 1319 reset. 1320 1321``VHOST_USER_SET_INFLIGHT_FD`` 1322 :id: 32 1323 :equivalent ioctl: N/A 1324 :request payload: inflight description 1325 :reply payload: N/A 1326 1327 When ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD`` protocol feature has 1328 been successfully negotiated, this message is submitted by the front-end to 1329 send the shared inflight buffer back to the back-end so that the back-end 1330 could get inflight I/O after a crash or restart. 1331 1332``VHOST_USER_GPU_SET_SOCKET`` 1333 :id: 33 1334 :equivalent ioctl: N/A 1335 :request payload: N/A 1336 :reply payload: N/A 1337 1338 Sets the GPU protocol socket file descriptor, which is passed as 1339 ancillary data. The GPU protocol is used to inform the front-end of 1340 rendering state and updates. See vhost-user-gpu.rst for details. 1341 1342``VHOST_USER_RESET_DEVICE`` 1343 :id: 34 1344 :equivalent ioctl: N/A 1345 :request payload: N/A 1346 :reply payload: N/A 1347 1348 Ask the vhost user back-end to disable all rings and reset all 1349 internal device state to the initial state, ready to be 1350 reinitialized. The back-end retains ownership of the device 1351 throughout the reset operation. 1352 1353 Only valid if the ``VHOST_USER_PROTOCOL_F_RESET_DEVICE`` protocol 1354 feature is set by the back-end. 1355 1356``VHOST_USER_VRING_KICK`` 1357 :id: 35 1358 :equivalent ioctl: N/A 1359 :request payload: vring state description 1360 :reply payload: N/A 1361 1362 When the ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` protocol 1363 feature has been successfully negotiated, this message may be 1364 submitted by the front-end to indicate that a buffer was added to 1365 the vring instead of signalling it using the vring's kick file 1366 descriptor or having the back-end rely on polling. 1367 1368 The state.num field is currently reserved and must be set to 0. 1369 1370``VHOST_USER_GET_MAX_MEM_SLOTS`` 1371 :id: 36 1372 :equivalent ioctl: N/A 1373 :request payload: N/A 1374 :reply payload: u64 1375 1376 When the ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol 1377 feature has been successfully negotiated, this message is submitted 1378 by the front-end to the back-end. The back-end should return the message with a 1379 u64 payload containing the maximum number of memory slots for 1380 QEMU to expose to the guest. The value returned by the back-end 1381 will be capped at the maximum number of ram slots which can be 1382 supported by the target platform. 1383 1384``VHOST_USER_ADD_MEM_REG`` 1385 :id: 37 1386 :equivalent ioctl: N/A 1387 :request payload: N/A 1388 :reply payload: single memory region description 1389 1390 When the ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol 1391 feature has been successfully negotiated, this message is submitted 1392 by the front-end to the back-end. The message payload contains a memory 1393 region descriptor struct, describing a region of guest memory which 1394 the back-end device must map in. When the 1395 ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol feature has 1396 been successfully negotiated, along with the 1397 ``VHOST_USER_REM_MEM_REG`` message, this message is used to set and 1398 update the memory tables of the back-end device. 1399 1400 Exactly one file descriptor from which the memory is mapped is 1401 passed in the ancillary data. 1402 1403 In postcopy mode (see ``VHOST_USER_POSTCOPY_LISTEN``), the back-end 1404 replies with the bases of the memory mapped region to the front-end. 1405 For further details on postcopy, see ``VHOST_USER_SET_MEM_TABLE``. 1406 They apply to ``VHOST_USER_ADD_MEM_REG`` accordingly. 1407 1408``VHOST_USER_REM_MEM_REG`` 1409 :id: 38 1410 :equivalent ioctl: N/A 1411 :request payload: N/A 1412 :reply payload: single memory region description 1413 1414 When the ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol 1415 feature has been successfully negotiated, this message is submitted 1416 by the front-end to the back-end. The message payload contains a memory 1417 region descriptor struct, describing a region of guest memory which 1418 the back-end device must unmap. When the 1419 ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol feature has 1420 been successfully negotiated, along with the 1421 ``VHOST_USER_ADD_MEM_REG`` message, this message is used to set and 1422 update the memory tables of the back-end device. 1423 1424 The memory region to be removed is identified by its guest address, 1425 user address and size. The mmap offset is ignored. 1426 1427 No file descriptors SHOULD be passed in the ancillary data. For 1428 compatibility with existing incorrect implementations, the back-end MAY 1429 accept messages with one file descriptor. If a file descriptor is 1430 passed, the back-end MUST close it without using it otherwise. 1431 1432``VHOST_USER_SET_STATUS`` 1433 :id: 39 1434 :equivalent ioctl: VHOST_VDPA_SET_STATUS 1435 :request payload: ``u64`` 1436 :reply payload: N/A 1437 1438 When the ``VHOST_USER_PROTOCOL_F_STATUS`` protocol feature has been 1439 successfully negotiated, this message is submitted by the front-end to 1440 notify the back-end with updated device status as defined in the Virtio 1441 specification. 1442 1443``VHOST_USER_GET_STATUS`` 1444 :id: 40 1445 :equivalent ioctl: VHOST_VDPA_GET_STATUS 1446 :request payload: N/A 1447 :reply payload: ``u64`` 1448 1449 When the ``VHOST_USER_PROTOCOL_F_STATUS`` protocol feature has been 1450 successfully negotiated, this message is submitted by the front-end to 1451 query the back-end for its device status as defined in the Virtio 1452 specification. 1453 1454``VHOST_USER_GET_SHARED_OBJECT`` 1455 :id: 41 1456 :equivalent ioctl: N/A 1457 :request payload: ``struct VhostUserShared`` 1458 :reply payload: dmabuf fd 1459 1460 When the ``VHOST_USER_PROTOCOL_F_SHARED_OBJECT`` protocol 1461 feature has been successfully negotiated, and the UUID is found 1462 in the exporters cache, this message is submitted by the front-end 1463 to retrieve a given dma-buf fd from a given back-end, determined by 1464 the requested UUID. Back-end will reply passing the fd when the operation 1465 is successful, or no fd otherwise. 1466 1467Back-end message types 1468---------------------- 1469 1470For this type of message, the request is sent by the back-end and the reply 1471is sent by the front-end. 1472 1473``VHOST_USER_BACKEND_IOTLB_MSG`` (previous name ``VHOST_USER_SLAVE_IOTLB_MSG``) 1474 :id: 1 1475 :equivalent ioctl: N/A (equivalent to ``VHOST_IOTLB_MSG`` message type) 1476 :request payload: ``struct vhost_iotlb_msg`` 1477 :reply payload: N/A 1478 1479 Send IOTLB messages with ``struct vhost_iotlb_msg`` as payload. 1480 The back-end sends such requests to notify of an IOTLB miss, or an IOTLB 1481 access failure. If ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is 1482 negotiated, and back-end set the ``VHOST_USER_NEED_REPLY`` flag, the front-end 1483 must respond with zero when operation is successfully completed, or 1484 non-zero otherwise. This request should be send only when 1485 ``VIRTIO_F_IOMMU_PLATFORM`` feature has been successfully 1486 negotiated. 1487 1488``VHOST_USER_BACKEND_CONFIG_CHANGE_MSG`` (previous name ``VHOST_USER_SLAVE_CONFIG_CHANGE_MSG``) 1489 :id: 2 1490 :equivalent ioctl: N/A 1491 :request payload: N/A 1492 :reply payload: N/A 1493 1494 When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, vhost-user 1495 back-end sends such messages to notify that the virtio device's 1496 configuration space has changed, for those host devices which can 1497 support such feature, host driver can send ``VHOST_USER_GET_CONFIG`` 1498 message to the back-end to get the latest content. If 1499 ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is negotiated, and the back-end sets the 1500 ``VHOST_USER_NEED_REPLY`` flag, the front-end must respond with zero when 1501 operation is successfully completed, or non-zero otherwise. 1502 1503``VHOST_USER_BACKEND_VRING_HOST_NOTIFIER_MSG`` (previous name ``VHOST_USER_SLAVE_VRING_HOST_NOTIFIER_MSG``) 1504 :id: 3 1505 :equivalent ioctl: N/A 1506 :request payload: vring area description 1507 :reply payload: N/A 1508 1509 Sets host notifier for a specified queue. The queue index is 1510 contained in the ``u64`` field of the vring area description. The 1511 host notifier is described by the file descriptor (typically it's a 1512 VFIO device fd) which is passed as ancillary data and the size 1513 (which is mmap size and should be the same as host page size) and 1514 offset (which is mmap offset) carried in the vring area 1515 description. QEMU can mmap the file descriptor based on the size and 1516 offset to get a memory range. Registering a host notifier means 1517 mapping this memory range to the VM as the specified queue's notify 1518 MMIO region. The back-end sends this request to tell QEMU to de-register 1519 the existing notifier if any and register the new notifier if the 1520 request is sent with a file descriptor. 1521 1522 This request should be sent only when 1523 ``VHOST_USER_PROTOCOL_F_HOST_NOTIFIER`` protocol feature has been 1524 successfully negotiated. 1525 1526``VHOST_USER_BACKEND_VRING_CALL`` (previous name ``VHOST_USER_SLAVE_VRING_CALL``) 1527 :id: 4 1528 :equivalent ioctl: N/A 1529 :request payload: vring state description 1530 :reply payload: N/A 1531 1532 When the ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` protocol 1533 feature has been successfully negotiated, this message may be 1534 submitted by the back-end to indicate that a buffer was used from 1535 the vring instead of signalling this using the vring's call file 1536 descriptor or having the front-end relying on polling. 1537 1538 The state.num field is currently reserved and must be set to 0. 1539 1540``VHOST_USER_BACKEND_VRING_ERR`` (previous name ``VHOST_USER_SLAVE_VRING_ERR``) 1541 :id: 5 1542 :equivalent ioctl: N/A 1543 :request payload: vring state description 1544 :reply payload: N/A 1545 1546 When the ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` protocol 1547 feature has been successfully negotiated, this message may be 1548 submitted by the back-end to indicate that an error occurred on the 1549 specific vring, instead of signalling the error file descriptor 1550 set by the front-end via ``VHOST_USER_SET_VRING_ERR``. 1551 1552 The state.num field is currently reserved and must be set to 0. 1553 1554``VHOST_USER_BACKEND_SHARED_OBJECT_ADD`` 1555 :id: 6 1556 :equivalent ioctl: N/A 1557 :request payload: ``struct VhostUserShared`` 1558 :reply payload: N/A 1559 1560 When the ``VHOST_USER_PROTOCOL_F_SHARED_OBJECT`` protocol 1561 feature has been successfully negotiated, this message can be submitted 1562 by the backends to add themselves as exporters to the virtio shared lookup 1563 table. The back-end device gets associated with a UUID in the shared table. 1564 The back-end is responsible of keeping its own table with exported dma-buf fds. 1565 When another back-end tries to import the resource associated with the UUID, 1566 it will send a message to the front-end, which will act as a proxy to the 1567 exporter back-end. If ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is negotiated, and 1568 the back-end sets the ``VHOST_USER_NEED_REPLY`` flag, the front-end must 1569 respond with zero when operation is successfully completed, or non-zero 1570 otherwise. 1571 1572``VHOST_USER_BACKEND_SHARED_OBJECT_REMOVE`` 1573 :id: 7 1574 :equivalent ioctl: N/A 1575 :request payload: ``struct VhostUserShared`` 1576 :reply payload: N/A 1577 1578 When the ``VHOST_USER_PROTOCOL_F_SHARED_OBJECT`` protocol 1579 feature has been successfully negotiated, this message can be submitted 1580 by the backend to remove themselves from to the virtio-dmabuf shared 1581 table API. The shared table will remove the back-end device associated with 1582 the UUID. If ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is negotiated, and the 1583 back-end sets the ``VHOST_USER_NEED_REPLY`` flag, the front-end must respond 1584 with zero when operation is successfully completed, or non-zero otherwise. 1585 1586``VHOST_USER_BACKEND_SHARED_OBJECT_LOOKUP`` 1587 :id: 8 1588 :equivalent ioctl: N/A 1589 :request payload: ``struct VhostUserShared`` 1590 :reply payload: dmabuf fd and ``u64`` 1591 1592 When the ``VHOST_USER_PROTOCOL_F_SHARED_OBJECT`` protocol 1593 feature has been successfully negotiated, this message can be submitted 1594 by the backends to retrieve a given dma-buf fd from the virtio-dmabuf 1595 shared table given a UUID. Frontend will reply passing the fd and a zero 1596 when the operation is successful, or non-zero otherwise. Note that if the 1597 operation fails, no fd is sent to the backend. 1598 1599.. _reply_ack: 1600 1601VHOST_USER_PROTOCOL_F_REPLY_ACK 1602------------------------------- 1603 1604The original vhost-user specification only demands replies for certain 1605commands. This differs from the vhost protocol implementation where 1606commands are sent over an ``ioctl()`` call and block until the back-end 1607has completed. 1608 1609With this protocol extension negotiated, the sender (QEMU) can set the 1610``need_reply`` [Bit 3] flag to any command. This indicates that the 1611back-end MUST respond with a Payload ``VhostUserMsg`` indicating success 1612or failure. The payload should be set to zero on success or non-zero 1613on failure, unless the message already has an explicit reply body. 1614 1615The reply payload gives QEMU a deterministic indication of the result 1616of the command. Today, QEMU is expected to terminate the main vhost-user 1617loop upon receiving such errors. In future, qemu could be taught to be more 1618resilient for selective requests. 1619 1620For the message types that already solicit a reply from the back-end, 1621the presence of ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` or need_reply bit 1622being set brings no behavioural change. (See the Communication_ 1623section for details.) 1624 1625.. _backend_conventions: 1626 1627Backend program conventions 1628=========================== 1629 1630vhost-user back-ends can provide various devices & services and may 1631need to be configured manually depending on the use case. However, it 1632is a good idea to follow the conventions listed here when 1633possible. Users, QEMU or libvirt, can then rely on some common 1634behaviour to avoid heterogeneous configuration and management of the 1635back-end programs and facilitate interoperability. 1636 1637Each back-end installed on a host system should come with at least one 1638JSON file that conforms to the vhost-user.json schema. Each file 1639informs the management applications about the back-end type, and binary 1640location. In addition, it defines rules for management apps for 1641picking the highest priority back-end when multiple match the search 1642criteria (see ``@VhostUserBackend`` documentation in the schema file). 1643 1644If the back-end is not capable of enabling a requested feature on the 1645host (such as 3D acceleration with virgl), or the initialization 1646failed, the back-end should fail to start early and exit with a status 1647!= 0. It may also print a message to stderr for further details. 1648 1649The back-end program must not daemonize itself, but it may be 1650daemonized by the management layer. It may also have a restricted 1651access to the system. 1652 1653File descriptors 0, 1 and 2 will exist, and have regular 1654stdin/stdout/stderr usage (they may have been redirected to /dev/null 1655by the management layer, or to a log handler). 1656 1657The back-end program must end (as quickly and cleanly as possible) when 1658the SIGTERM signal is received. Eventually, it may receive SIGKILL by 1659the management layer after a few seconds. 1660 1661The following command line options have an expected behaviour. They 1662are mandatory, unless explicitly said differently: 1663 1664--socket-path=PATH 1665 1666 This option specify the location of the vhost-user Unix domain socket. 1667 It is incompatible with --fd. 1668 1669--fd=FDNUM 1670 1671 When this argument is given, the back-end program is started with the 1672 vhost-user socket as file descriptor FDNUM. It is incompatible with 1673 --socket-path. 1674 1675--print-capabilities 1676 1677 Output to stdout the back-end capabilities in JSON format, and then 1678 exit successfully. Other options and arguments should be ignored, and 1679 the back-end program should not perform its normal function. The 1680 capabilities can be reported dynamically depending on the host 1681 capabilities. 1682 1683The JSON output is described in the ``vhost-user.json`` schema, by 1684```@VHostUserBackendCapabilities``. Example: 1685 1686.. code:: json 1687 1688 { 1689 "type": "foo", 1690 "features": [ 1691 "feature-a", 1692 "feature-b" 1693 ] 1694 } 1695 1696vhost-user-input 1697---------------- 1698 1699Command line options: 1700 1701--evdev-path=PATH 1702 1703 Specify the linux input device. 1704 1705 (optional) 1706 1707--no-grab 1708 1709 Do no request exclusive access to the input device. 1710 1711 (optional) 1712 1713vhost-user-gpu 1714-------------- 1715 1716Command line options: 1717 1718--render-node=PATH 1719 1720 Specify the GPU DRM render node. 1721 1722 (optional) 1723 1724--virgl 1725 1726 Enable virgl rendering support. 1727 1728 (optional) 1729 1730vhost-user-blk 1731-------------- 1732 1733Command line options: 1734 1735--blk-file=PATH 1736 1737 Specify block device or file path. 1738 1739 (optional) 1740 1741--read-only 1742 1743 Enable read-only. 1744 1745 (optional) 1746