1.. _vhost_user_proto: 2 3=================== 4Vhost-user Protocol 5=================== 6 7.. 8 Copyright 2014 Virtual Open Systems Sarl. 9 Copyright 2019 Intel Corporation 10 Licence: This work is licensed under the terms of the GNU GPL, 11 version 2 or later. See the COPYING file in the top-level 12 directory. 13 14.. contents:: Table of Contents 15 16Introduction 17============ 18 19This protocol is aiming to complement the ``ioctl`` interface used to 20control the vhost implementation in the Linux kernel. It implements 21the control plane needed to establish virtqueue sharing with a user 22space process on the same host. It uses communication over a Unix 23domain socket to share file descriptors in the ancillary data of the 24message. 25 26The protocol defines 2 sides of the communication, *front-end* and 27*back-end*. The *front-end* is the application that shares its virtqueues, in 28our case QEMU. The *back-end* is the consumer of the virtqueues. 29 30In the current implementation QEMU is the *front-end*, and the *back-end* 31is the external process consuming the virtio queues, for example a 32software Ethernet switch running in user space, such as Snabbswitch, 33or a block device back-end processing read & write to a virtual 34disk. In order to facilitate interoperability between various back-end 35implementations, it is recommended to follow the :ref:`Backend program 36conventions <backend_conventions>`. 37 38The *front-end* and *back-end* can be either a client (i.e. connecting) or 39server (listening) in the socket communication. 40 41Support for platforms other than Linux 42-------------------------------------- 43 44While vhost-user was initially developed targeting Linux, nowadays it 45is supported on any platform that provides the following features: 46 47- A way for requesting shared memory represented by a file descriptor 48 so it can be passed over a UNIX domain socket and then mapped by the 49 other process. 50 51- AF_UNIX sockets with SCM_RIGHTS, so QEMU and the other process can 52 exchange messages through it, including ancillary data when needed. 53 54- Either eventfd or pipe/pipe2. On platforms where eventfd is not 55 available, QEMU will automatically fall back to pipe2 or, as a last 56 resort, pipe. Each file descriptor will be used for receiving or 57 sending events by reading or writing (respectively) an 8-byte value 58 to the corresponding it. The 8-value itself has no meaning and 59 should not be interpreted. 60 61Message Specification 62===================== 63 64.. Note:: All numbers are in the machine native byte order. 65 66A vhost-user message consists of 3 header fields and a payload. 67 68+---------+-------+------+---------+ 69| request | flags | size | payload | 70+---------+-------+------+---------+ 71 72Header 73------ 74 75:request: 32-bit type of the request 76 77:flags: 32-bit bit field 78 79- Lower 2 bits are the version (currently 0x01) 80- Bit 2 is the reply flag - needs to be sent on each reply from the back-end 81- Bit 3 is the need_reply flag - see :ref:`REPLY_ACK <reply_ack>` for 82 details. 83 84:size: 32-bit size of the payload 85 86Payload 87------- 88 89Depending on the request type, **payload** can be: 90 91A single 64-bit integer 92^^^^^^^^^^^^^^^^^^^^^^^ 93 94+-----+ 95| u64 | 96+-----+ 97 98:u64: a 64-bit unsigned integer 99 100A vring state description 101^^^^^^^^^^^^^^^^^^^^^^^^^ 102 103+-------+-----+ 104| index | num | 105+-------+-----+ 106 107:index: a 32-bit index 108 109:num: a 32-bit number 110 111A vring address description 112^^^^^^^^^^^^^^^^^^^^^^^^^^^ 113 114+-------+-------+------+------------+------+-----------+-----+ 115| index | flags | size | descriptor | used | available | log | 116+-------+-------+------+------------+------+-----------+-----+ 117 118:index: a 32-bit vring index 119 120:flags: a 32-bit vring flags 121 122:descriptor: a 64-bit ring address of the vring descriptor table 123 124:used: a 64-bit ring address of the vring used ring 125 126:available: a 64-bit ring address of the vring available ring 127 128:log: a 64-bit guest address for logging 129 130Note that a ring address is an IOVA if ``VIRTIO_F_IOMMU_PLATFORM`` has 131been negotiated. Otherwise it is a user address. 132 133Memory region description 134^^^^^^^^^^^^^^^^^^^^^^^^^ 135 136+---------------+------+--------------+-------------+ 137| guest address | size | user address | mmap offset | 138+---------------+------+--------------+-------------+ 139 140:guest address: a 64-bit guest address of the region 141 142:size: a 64-bit size 143 144:user address: a 64-bit user address 145 146:mmap offset: 64-bit offset where region starts in the mapped memory 147 148When the ``VHOST_USER_PROTOCOL_F_XEN_MMAP`` protocol feature has been 149successfully negotiated, the memory region description contains two extra 150fields at the end. 151 152+---------------+------+--------------+-------------+----------------+-------+ 153| guest address | size | user address | mmap offset | xen mmap flags | domid | 154+---------------+------+--------------+-------------+----------------+-------+ 155 156:xen mmap flags: 32-bit bit field 157 158- Bit 0 is set for Xen foreign memory mapping. 159- Bit 1 is set for Xen grant memory mapping. 160- Bit 8 is set if the memory region can not be mapped in advance, and memory 161 areas within this region must be mapped / unmapped only when required by the 162 back-end. The back-end shouldn't try to map the entire region at once, as the 163 front-end may not allow it. The back-end should rather map only the required 164 amount of memory at once and unmap it after it is used. 165 166:domid: a 32-bit Xen hypervisor specific domain id. 167 168Single memory region description 169^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 170 171+---------+--------+ 172| padding | region | 173+---------+--------+ 174 175:padding: 64-bit 176 177A region is represented by Memory region description. 178 179Multiple Memory regions description 180^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 181 182+-------------+---------+---------+-----+---------+ 183| num regions | padding | region0 | ... | region7 | 184+-------------+---------+---------+-----+---------+ 185 186:num regions: a 32-bit number of regions 187 188:padding: 32-bit 189 190A region is represented by Memory region description. 191 192Log description 193^^^^^^^^^^^^^^^ 194 195+----------+------------+ 196| log size | log offset | 197+----------+------------+ 198 199:log size: size of area used for logging 200 201:log offset: offset from start of supplied file descriptor where 202 logging starts (i.e. where guest address 0 would be 203 logged) 204 205An IOTLB message 206^^^^^^^^^^^^^^^^ 207 208+------+------+--------------+-------------------+------+ 209| iova | size | user address | permissions flags | type | 210+------+------+--------------+-------------------+------+ 211 212:iova: a 64-bit I/O virtual address programmed by the guest 213 214:size: a 64-bit size 215 216:user address: a 64-bit user address 217 218:permissions flags: an 8-bit value: 219 - 0: No access 220 - 1: Read access 221 - 2: Write access 222 - 3: Read/Write access 223 224:type: an 8-bit IOTLB message type: 225 - 1: IOTLB miss 226 - 2: IOTLB update 227 - 3: IOTLB invalidate 228 - 4: IOTLB access fail 229 230Virtio device config space 231^^^^^^^^^^^^^^^^^^^^^^^^^^ 232 233+--------+------+-------+---------+ 234| offset | size | flags | payload | 235+--------+------+-------+---------+ 236 237:offset: a 32-bit offset of virtio device's configuration space 238 239:size: a 32-bit configuration space access size in bytes 240 241:flags: a 32-bit value: 242 - 0: Vhost front-end messages used for writable fields 243 - 1: Vhost front-end messages used for live migration 244 245:payload: Size bytes array holding the contents of the virtio 246 device's configuration space 247 248Vring area description 249^^^^^^^^^^^^^^^^^^^^^^ 250 251+-----+------+--------+ 252| u64 | size | offset | 253+-----+------+--------+ 254 255:u64: a 64-bit integer contains vring index and flags 256 257:size: a 64-bit size of this area 258 259:offset: a 64-bit offset of this area from the start of the 260 supplied file descriptor 261 262Inflight description 263^^^^^^^^^^^^^^^^^^^^ 264 265+-----------+-------------+------------+------------+ 266| mmap size | mmap offset | num queues | queue size | 267+-----------+-------------+------------+------------+ 268 269:mmap size: a 64-bit size of area to track inflight I/O 270 271:mmap offset: a 64-bit offset of this area from the start 272 of the supplied file descriptor 273 274:num queues: a 16-bit number of virtqueues 275 276:queue size: a 16-bit size of virtqueues 277 278C structure 279----------- 280 281In QEMU the vhost-user message is implemented with the following struct: 282 283.. code:: c 284 285 typedef struct VhostUserMsg { 286 VhostUserRequest request; 287 uint32_t flags; 288 uint32_t size; 289 union { 290 uint64_t u64; 291 struct vhost_vring_state state; 292 struct vhost_vring_addr addr; 293 VhostUserMemory memory; 294 VhostUserLog log; 295 struct vhost_iotlb_msg iotlb; 296 VhostUserConfig config; 297 VhostUserVringArea area; 298 VhostUserInflight inflight; 299 }; 300 } QEMU_PACKED VhostUserMsg; 301 302Communication 303============= 304 305The protocol for vhost-user is based on the existing implementation of 306vhost for the Linux Kernel. Most messages that can be sent via the 307Unix domain socket implementing vhost-user have an equivalent ioctl to 308the kernel implementation. 309 310The communication consists of the *front-end* sending message requests and 311the *back-end* sending message replies. Most of the requests don't require 312replies. Here is a list of the ones that do: 313 314* ``VHOST_USER_GET_FEATURES`` 315* ``VHOST_USER_GET_PROTOCOL_FEATURES`` 316* ``VHOST_USER_GET_VRING_BASE`` 317* ``VHOST_USER_SET_LOG_BASE`` (if ``VHOST_USER_PROTOCOL_F_LOG_SHMFD``) 318* ``VHOST_USER_GET_INFLIGHT_FD`` (if ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD``) 319 320.. seealso:: 321 322 :ref:`REPLY_ACK <reply_ack>` 323 The section on ``REPLY_ACK`` protocol extension. 324 325There are several messages that the front-end sends with file descriptors passed 326in the ancillary data: 327 328* ``VHOST_USER_ADD_MEM_REG`` 329* ``VHOST_USER_SET_MEM_TABLE`` 330* ``VHOST_USER_SET_LOG_BASE`` (if ``VHOST_USER_PROTOCOL_F_LOG_SHMFD``) 331* ``VHOST_USER_SET_LOG_FD`` 332* ``VHOST_USER_SET_VRING_KICK`` 333* ``VHOST_USER_SET_VRING_CALL`` 334* ``VHOST_USER_SET_VRING_ERR`` 335* ``VHOST_USER_SET_BACKEND_REQ_FD`` (previous name ``VHOST_USER_SET_SLAVE_REQ_FD``) 336* ``VHOST_USER_SET_INFLIGHT_FD`` (if ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD``) 337 338If *front-end* is unable to send the full message or receives a wrong 339reply it will close the connection. An optional reconnection mechanism 340can be implemented. 341 342If *back-end* detects some error such as incompatible features, it may also 343close the connection. This should only happen in exceptional circumstances. 344 345Any protocol extensions are gated by protocol feature bits, which 346allows full backwards compatibility on both front-end and back-end. As 347older back-ends don't support negotiating protocol features, a feature 348bit was dedicated for this purpose:: 349 350 #define VHOST_USER_F_PROTOCOL_FEATURES 30 351 352Note that VHOST_USER_F_PROTOCOL_FEATURES is the UNUSED (30) feature 353bit defined in `VIRTIO 1.1 6.3 Legacy Interface: Reserved Feature Bits 354<https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-4130003>`_. 355VIRTIO devices do not advertise this feature bit and therefore VIRTIO 356drivers cannot negotiate it. 357 358This reserved feature bit was reused by the vhost-user protocol to add 359vhost-user protocol feature negotiation in a backwards compatible 360fashion. Old vhost-user front-end and back-end implementations continue to 361work even though they are not aware of vhost-user protocol feature 362negotiation. 363 364Ring states 365----------- 366 367Rings can be in one of three states: 368 369* stopped: the back-end must not process the ring at all. 370 371* started but disabled: the back-end must process the ring without 372 causing any side effects. For example, for a networking device, 373 in the disabled state the back-end must not supply any new RX packets, 374 but must process and discard any TX packets. 375 376* started and enabled. 377 378Each ring is initialized in a stopped state. The back-end must start 379ring upon receiving a kick (that is, detecting that file descriptor is 380readable) on the descriptor specified by ``VHOST_USER_SET_VRING_KICK`` 381or receiving the in-band message ``VHOST_USER_VRING_KICK`` if negotiated, 382and stop ring upon receiving ``VHOST_USER_GET_VRING_BASE``. 383 384Rings can be enabled or disabled by ``VHOST_USER_SET_VRING_ENABLE``. 385 386If ``VHOST_USER_F_PROTOCOL_FEATURES`` has not been negotiated, the 387ring starts directly in the enabled state. 388 389If ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated, the ring is 390initialized in a disabled state and is enabled by 391``VHOST_USER_SET_VRING_ENABLE`` with parameter 1. 392 393While processing the rings (whether they are enabled or not), the back-end 394must support changing some configuration aspects on the fly. 395 396Multiple queue support 397---------------------- 398 399Many devices have a fixed number of virtqueues. In this case the front-end 400already knows the number of available virtqueues without communicating with the 401back-end. 402 403Some devices do not have a fixed number of virtqueues. Instead the maximum 404number of virtqueues is chosen by the back-end. The number can depend on host 405resource availability or back-end implementation details. Such devices are called 406multiple queue devices. 407 408Multiple queue support allows the back-end to advertise the maximum number of 409queues. This is treated as a protocol extension, hence the back-end has to 410implement protocol features first. The multiple queues feature is supported 411only when the protocol feature ``VHOST_USER_PROTOCOL_F_MQ`` (bit 0) is set. 412 413The max number of queues the back-end supports can be queried with message 414``VHOST_USER_GET_QUEUE_NUM``. Front-end should stop when the number of requested 415queues is bigger than that. 416 417As all queues share one connection, the front-end uses a unique index for each 418queue in the sent message to identify a specified queue. 419 420The front-end enables queues by sending message ``VHOST_USER_SET_VRING_ENABLE``. 421vhost-user-net has historically automatically enabled the first queue pair. 422 423Back-ends should always implement the ``VHOST_USER_PROTOCOL_F_MQ`` protocol 424feature, even for devices with a fixed number of virtqueues, since it is simple 425to implement and offers a degree of introspection. 426 427Front-ends must not rely on the ``VHOST_USER_PROTOCOL_F_MQ`` protocol feature for 428devices with a fixed number of virtqueues. Only true multiqueue devices 429require this protocol feature. 430 431Migration 432--------- 433 434During live migration, the front-end may need to track the modifications 435the back-end makes to the memory mapped regions. The front-end should mark 436the dirty pages in a log. Once it complies to this logging, it may 437declare the ``VHOST_F_LOG_ALL`` vhost feature. 438 439To start/stop logging of data/used ring writes, the front-end may send 440messages ``VHOST_USER_SET_FEATURES`` with ``VHOST_F_LOG_ALL`` and 441``VHOST_USER_SET_VRING_ADDR`` with ``VHOST_VRING_F_LOG`` in ring's 442flags set to 1/0, respectively. 443 444All the modifications to memory pointed by vring "descriptor" should 445be marked. Modifications to "used" vring should be marked if 446``VHOST_VRING_F_LOG`` is part of ring's flags. 447 448Dirty pages are of size:: 449 450 #define VHOST_LOG_PAGE 0x1000 451 452The log memory fd is provided in the ancillary data of 453``VHOST_USER_SET_LOG_BASE`` message when the back-end has 454``VHOST_USER_PROTOCOL_F_LOG_SHMFD`` protocol feature. 455 456The size of the log is supplied as part of ``VhostUserMsg`` which 457should be large enough to cover all known guest addresses. Log starts 458at the supplied offset in the supplied file descriptor. The log 459covers from address 0 to the maximum of guest regions. In pseudo-code, 460to mark page at ``addr`` as dirty:: 461 462 page = addr / VHOST_LOG_PAGE 463 log[page / 8] |= 1 << page % 8 464 465Where ``addr`` is the guest physical address. 466 467Use atomic operations, as the log may be concurrently manipulated. 468 469Note that when logging modifications to the used ring (when 470``VHOST_VRING_F_LOG`` is set for this ring), ``log_guest_addr`` should 471be used to calculate the log offset: the write to first byte of the 472used ring is logged at this offset from log start. Also note that this 473value might be outside the legal guest physical address range 474(i.e. does not have to be covered by the ``VhostUserMemory`` table), but 475the bit offset of the last byte of the ring must fall within the size 476supplied by ``VhostUserLog``. 477 478``VHOST_USER_SET_LOG_FD`` is an optional message with an eventfd in 479ancillary data, it may be used to inform the front-end that the log has 480been modified. 481 482Once the source has finished migration, rings will be stopped by the 483source. No further update must be done before rings are restarted. 484 485In postcopy migration the back-end is started before all the memory has 486been received from the source host, and care must be taken to avoid 487accessing pages that have yet to be received. The back-end opens a 488'userfault'-fd and registers the memory with it; this fd is then 489passed back over to the front-end. The front-end services requests on the 490userfaultfd for pages that are accessed and when the page is available 491it performs WAKE ioctl's on the userfaultfd to wake the stalled 492back-end. The front-end indicates support for this via the 493``VHOST_USER_PROTOCOL_F_PAGEFAULT`` feature. 494 495Memory access 496------------- 497 498The front-end sends a list of vhost memory regions to the back-end using the 499``VHOST_USER_SET_MEM_TABLE`` message. Each region has two base 500addresses: a guest address and a user address. 501 502Messages contain guest addresses and/or user addresses to reference locations 503within the shared memory. The mapping of these addresses works as follows. 504 505User addresses map to the vhost memory region containing that user address. 506 507When the ``VIRTIO_F_IOMMU_PLATFORM`` feature has not been negotiated: 508 509* Guest addresses map to the vhost memory region containing that guest 510 address. 511 512When the ``VIRTIO_F_IOMMU_PLATFORM`` feature has been negotiated: 513 514* Guest addresses are also called I/O virtual addresses (IOVAs). They are 515 translated to user addresses via the IOTLB. 516 517* The vhost memory region guest address is not used. 518 519IOMMU support 520------------- 521 522When the ``VIRTIO_F_IOMMU_PLATFORM`` feature has been negotiated, the 523front-end sends IOTLB entries update & invalidation by sending 524``VHOST_USER_IOTLB_MSG`` requests to the back-end with a ``struct 525vhost_iotlb_msg`` as payload. For update events, the ``iotlb`` payload 526has to be filled with the update message type (2), the I/O virtual 527address, the size, the user virtual address, and the permissions 528flags. Addresses and size must be within vhost memory regions set via 529the ``VHOST_USER_SET_MEM_TABLE`` request. For invalidation events, the 530``iotlb`` payload has to be filled with the invalidation message type 531(3), the I/O virtual address and the size. On success, the back-end is 532expected to reply with a zero payload, non-zero otherwise. 533 534The back-end relies on the back-end communication channel (see :ref:`Back-end 535communication <backend_communication>` section below) to send IOTLB miss 536and access failure events, by sending ``VHOST_USER_BACKEND_IOTLB_MSG`` 537requests to the front-end with a ``struct vhost_iotlb_msg`` as 538payload. For miss events, the iotlb payload has to be filled with the 539miss message type (1), the I/O virtual address and the permissions 540flags. For access failure event, the iotlb payload has to be filled 541with the access failure message type (4), the I/O virtual address and 542the permissions flags. For synchronization purpose, the back-end may 543rely on the reply-ack feature, so the front-end may send a reply when 544operation is completed if the reply-ack feature is negotiated and 545back-ends requests a reply. For miss events, completed operation means 546either front-end sent an update message containing the IOTLB entry 547containing requested address and permission, or front-end sent nothing if 548the IOTLB miss message is invalid (invalid IOVA or permission). 549 550The front-end isn't expected to take the initiative to send IOTLB update 551messages, as the back-end sends IOTLB miss messages for the guest virtual 552memory areas it needs to access. 553 554.. _backend_communication: 555 556Back-end communication 557---------------------- 558 559An optional communication channel is provided if the back-end declares 560``VHOST_USER_PROTOCOL_F_BACKEND_REQ`` protocol feature, to allow the 561back-end to make requests to the front-end. 562 563The fd is provided via ``VHOST_USER_SET_BACKEND_REQ_FD`` ancillary data. 564 565A back-end may then send ``VHOST_USER_BACKEND_*`` messages to the front-end 566using this fd communication channel. 567 568If ``VHOST_USER_PROTOCOL_F_BACKEND_SEND_FD`` protocol feature is 569negotiated, back-end can send file descriptors (at most 8 descriptors in 570each message) to front-end via ancillary data using this fd communication 571channel. 572 573Inflight I/O tracking 574--------------------- 575 576To support reconnecting after restart or crash, back-end may need to 577resubmit inflight I/Os. If virtqueue is processed in order, we can 578easily achieve that by getting the inflight descriptors from 579descriptor table (split virtqueue) or descriptor ring (packed 580virtqueue). However, it can't work when we process descriptors 581out-of-order because some entries which store the information of 582inflight descriptors in available ring (split virtqueue) or descriptor 583ring (packed virtqueue) might be overridden by new entries. To solve 584this problem, the back-end need to allocate an extra buffer to store this 585information of inflight descriptors and share it with front-end for 586persistent. ``VHOST_USER_GET_INFLIGHT_FD`` and 587``VHOST_USER_SET_INFLIGHT_FD`` are used to transfer this buffer 588between front-end and back-end. And the format of this buffer is described 589below: 590 591+---------------+---------------+-----+---------------+ 592| queue0 region | queue1 region | ... | queueN region | 593+---------------+---------------+-----+---------------+ 594 595N is the number of available virtqueues. The back-end could get it from num 596queues field of ``VhostUserInflight``. 597 598For split virtqueue, queue region can be implemented as: 599 600.. code:: c 601 602 typedef struct DescStateSplit { 603 /* Indicate whether this descriptor is inflight or not. 604 * Only available for head-descriptor. */ 605 uint8_t inflight; 606 607 /* Padding */ 608 uint8_t padding[5]; 609 610 /* Maintain a list for the last batch of used descriptors. 611 * Only available when batching is used for submitting */ 612 uint16_t next; 613 614 /* Used to preserve the order of fetching available descriptors. 615 * Only available for head-descriptor. */ 616 uint64_t counter; 617 } DescStateSplit; 618 619 typedef struct QueueRegionSplit { 620 /* The feature flags of this region. Now it's initialized to 0. */ 621 uint64_t features; 622 623 /* The version of this region. It's 1 currently. 624 * Zero value indicates an uninitialized buffer */ 625 uint16_t version; 626 627 /* The size of DescStateSplit array. It's equal to the virtqueue size. 628 * The back-end could get it from queue size field of VhostUserInflight. */ 629 uint16_t desc_num; 630 631 /* The head of list that track the last batch of used descriptors. */ 632 uint16_t last_batch_head; 633 634 /* Store the idx value of used ring */ 635 uint16_t used_idx; 636 637 /* Used to track the state of each descriptor in descriptor table */ 638 DescStateSplit desc[]; 639 } QueueRegionSplit; 640 641To track inflight I/O, the queue region should be processed as follows: 642 643When receiving available buffers from the driver: 644 645#. Get the next available head-descriptor index from available ring, ``i`` 646 647#. Set ``desc[i].counter`` to the value of global counter 648 649#. Increase global counter by 1 650 651#. Set ``desc[i].inflight`` to 1 652 653When supplying used buffers to the driver: 654 6551. Get corresponding used head-descriptor index, i 656 6572. Set ``desc[i].next`` to ``last_batch_head`` 658 6593. Set ``last_batch_head`` to ``i`` 660 661#. Steps 1,2,3 may be performed repeatedly if batching is possible 662 663#. Increase the ``idx`` value of used ring by the size of the batch 664 665#. Set the ``inflight`` field of each ``DescStateSplit`` entry in the batch to 0 666 667#. Set ``used_idx`` to the ``idx`` value of used ring 668 669When reconnecting: 670 671#. If the value of ``used_idx`` does not match the ``idx`` value of 672 used ring (means the inflight field of ``DescStateSplit`` entries in 673 last batch may be incorrect), 674 675 a. Subtract the value of ``used_idx`` from the ``idx`` value of 676 used ring to get last batch size of ``DescStateSplit`` entries 677 678 #. Set the ``inflight`` field of each ``DescStateSplit`` entry to 0 in last batch 679 list which starts from ``last_batch_head`` 680 681 #. Set ``used_idx`` to the ``idx`` value of used ring 682 683#. Resubmit inflight ``DescStateSplit`` entries in order of their 684 counter value 685 686For packed virtqueue, queue region can be implemented as: 687 688.. code:: c 689 690 typedef struct DescStatePacked { 691 /* Indicate whether this descriptor is inflight or not. 692 * Only available for head-descriptor. */ 693 uint8_t inflight; 694 695 /* Padding */ 696 uint8_t padding; 697 698 /* Link to the next free entry */ 699 uint16_t next; 700 701 /* Link to the last entry of descriptor list. 702 * Only available for head-descriptor. */ 703 uint16_t last; 704 705 /* The length of descriptor list. 706 * Only available for head-descriptor. */ 707 uint16_t num; 708 709 /* Used to preserve the order of fetching available descriptors. 710 * Only available for head-descriptor. */ 711 uint64_t counter; 712 713 /* The buffer id */ 714 uint16_t id; 715 716 /* The descriptor flags */ 717 uint16_t flags; 718 719 /* The buffer length */ 720 uint32_t len; 721 722 /* The buffer address */ 723 uint64_t addr; 724 } DescStatePacked; 725 726 typedef struct QueueRegionPacked { 727 /* The feature flags of this region. Now it's initialized to 0. */ 728 uint64_t features; 729 730 /* The version of this region. It's 1 currently. 731 * Zero value indicates an uninitialized buffer */ 732 uint16_t version; 733 734 /* The size of DescStatePacked array. It's equal to the virtqueue size. 735 * The back-end could get it from queue size field of VhostUserInflight. */ 736 uint16_t desc_num; 737 738 /* The head of free DescStatePacked entry list */ 739 uint16_t free_head; 740 741 /* The old head of free DescStatePacked entry list */ 742 uint16_t old_free_head; 743 744 /* The used index of descriptor ring */ 745 uint16_t used_idx; 746 747 /* The old used index of descriptor ring */ 748 uint16_t old_used_idx; 749 750 /* Device ring wrap counter */ 751 uint8_t used_wrap_counter; 752 753 /* The old device ring wrap counter */ 754 uint8_t old_used_wrap_counter; 755 756 /* Padding */ 757 uint8_t padding[7]; 758 759 /* Used to track the state of each descriptor fetched from descriptor ring */ 760 DescStatePacked desc[]; 761 } QueueRegionPacked; 762 763To track inflight I/O, the queue region should be processed as follows: 764 765When receiving available buffers from the driver: 766 767#. Get the next available descriptor entry from descriptor ring, ``d`` 768 769#. If ``d`` is head descriptor, 770 771 a. Set ``desc[old_free_head].num`` to 0 772 773 #. Set ``desc[old_free_head].counter`` to the value of global counter 774 775 #. Increase global counter by 1 776 777 #. Set ``desc[old_free_head].inflight`` to 1 778 779#. If ``d`` is last descriptor, set ``desc[old_free_head].last`` to 780 ``free_head`` 781 782#. Increase ``desc[old_free_head].num`` by 1 783 784#. Set ``desc[free_head].addr``, ``desc[free_head].len``, 785 ``desc[free_head].flags``, ``desc[free_head].id`` to ``d.addr``, 786 ``d.len``, ``d.flags``, ``d.id`` 787 788#. Set ``free_head`` to ``desc[free_head].next`` 789 790#. If ``d`` is last descriptor, set ``old_free_head`` to ``free_head`` 791 792When supplying used buffers to the driver: 793 7941. Get corresponding used head-descriptor entry from descriptor ring, 795 ``d`` 796 7972. Get corresponding ``DescStatePacked`` entry, ``e`` 798 7993. Set ``desc[e.last].next`` to ``free_head`` 800 8014. Set ``free_head`` to the index of ``e`` 802 803#. Steps 1,2,3,4 may be performed repeatedly if batching is possible 804 805#. Increase ``used_idx`` by the size of the batch and update 806 ``used_wrap_counter`` if needed 807 808#. Update ``d.flags`` 809 810#. Set the ``inflight`` field of each head ``DescStatePacked`` entry 811 in the batch to 0 812 813#. Set ``old_free_head``, ``old_used_idx``, ``old_used_wrap_counter`` 814 to ``free_head``, ``used_idx``, ``used_wrap_counter`` 815 816When reconnecting: 817 818#. If ``used_idx`` does not match ``old_used_idx`` (means the 819 ``inflight`` field of ``DescStatePacked`` entries in last batch may 820 be incorrect), 821 822 a. Get the next descriptor ring entry through ``old_used_idx``, ``d`` 823 824 #. Use ``old_used_wrap_counter`` to calculate the available flags 825 826 #. If ``d.flags`` is not equal to the calculated flags value (means 827 back-end has submitted the buffer to guest driver before crash, so 828 it has to commit the in-progres update), set ``old_free_head``, 829 ``old_used_idx``, ``old_used_wrap_counter`` to ``free_head``, 830 ``used_idx``, ``used_wrap_counter`` 831 832#. Set ``free_head``, ``used_idx``, ``used_wrap_counter`` to 833 ``old_free_head``, ``old_used_idx``, ``old_used_wrap_counter`` 834 (roll back any in-progress update) 835 836#. Set the ``inflight`` field of each ``DescStatePacked`` entry in 837 free list to 0 838 839#. Resubmit inflight ``DescStatePacked`` entries in order of their 840 counter value 841 842In-band notifications 843--------------------- 844 845In some limited situations (e.g. for simulation) it is desirable to 846have the kick, call and error (if used) signals done via in-band 847messages instead of asynchronous eventfd notifications. This can be 848done by negotiating the ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` 849protocol feature. 850 851Note that due to the fact that too many messages on the sockets can 852cause the sending application(s) to block, it is not advised to use 853this feature unless absolutely necessary. It is also considered an 854error to negotiate this feature without also negotiating 855``VHOST_USER_PROTOCOL_F_BACKEND_REQ`` and ``VHOST_USER_PROTOCOL_F_REPLY_ACK``, 856the former is necessary for getting a message channel from the back-end 857to the front-end, while the latter needs to be used with the in-band 858notification messages to block until they are processed, both to avoid 859blocking later and for proper processing (at least in the simulation 860use case.) As it has no other way of signalling this error, the back-end 861should close the connection as a response to a 862``VHOST_USER_SET_PROTOCOL_FEATURES`` message that sets the in-band 863notifications feature flag without the other two. 864 865Protocol features 866----------------- 867 868.. code:: c 869 870 #define VHOST_USER_PROTOCOL_F_MQ 0 871 #define VHOST_USER_PROTOCOL_F_LOG_SHMFD 1 872 #define VHOST_USER_PROTOCOL_F_RARP 2 873 #define VHOST_USER_PROTOCOL_F_REPLY_ACK 3 874 #define VHOST_USER_PROTOCOL_F_MTU 4 875 #define VHOST_USER_PROTOCOL_F_BACKEND_REQ 5 876 #define VHOST_USER_PROTOCOL_F_CROSS_ENDIAN 6 877 #define VHOST_USER_PROTOCOL_F_CRYPTO_SESSION 7 878 #define VHOST_USER_PROTOCOL_F_PAGEFAULT 8 879 #define VHOST_USER_PROTOCOL_F_CONFIG 9 880 #define VHOST_USER_PROTOCOL_F_BACKEND_SEND_FD 10 881 #define VHOST_USER_PROTOCOL_F_HOST_NOTIFIER 11 882 #define VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD 12 883 #define VHOST_USER_PROTOCOL_F_RESET_DEVICE 13 884 #define VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS 14 885 #define VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS 15 886 #define VHOST_USER_PROTOCOL_F_STATUS 16 887 #define VHOST_USER_PROTOCOL_F_XEN_MMAP 17 888 889Front-end message types 890----------------------- 891 892``VHOST_USER_GET_FEATURES`` 893 :id: 1 894 :equivalent ioctl: ``VHOST_GET_FEATURES`` 895 :request payload: N/A 896 :reply payload: ``u64`` 897 898 Get from the underlying vhost implementation the features bitmask. 899 Feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` signals back-end support 900 for ``VHOST_USER_GET_PROTOCOL_FEATURES`` and 901 ``VHOST_USER_SET_PROTOCOL_FEATURES``. 902 903``VHOST_USER_SET_FEATURES`` 904 :id: 2 905 :equivalent ioctl: ``VHOST_SET_FEATURES`` 906 :request payload: ``u64`` 907 :reply payload: N/A 908 909 Enable features in the underlying vhost implementation using a 910 bitmask. Feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` signals 911 back-end support for ``VHOST_USER_GET_PROTOCOL_FEATURES`` and 912 ``VHOST_USER_SET_PROTOCOL_FEATURES``. 913 914``VHOST_USER_GET_PROTOCOL_FEATURES`` 915 :id: 15 916 :equivalent ioctl: ``VHOST_GET_FEATURES`` 917 :request payload: N/A 918 :reply payload: ``u64`` 919 920 Get the protocol feature bitmask from the underlying vhost 921 implementation. Only legal if feature bit 922 ``VHOST_USER_F_PROTOCOL_FEATURES`` is present in 923 ``VHOST_USER_GET_FEATURES``. It does not need to be acknowledged by 924 ``VHOST_USER_SET_FEATURES``. 925 926.. Note:: 927 Back-ends that report ``VHOST_USER_F_PROTOCOL_FEATURES`` must 928 support this message even before ``VHOST_USER_SET_FEATURES`` was 929 called. 930 931``VHOST_USER_SET_PROTOCOL_FEATURES`` 932 :id: 16 933 :equivalent ioctl: ``VHOST_SET_FEATURES`` 934 :request payload: ``u64`` 935 :reply payload: N/A 936 937 Enable protocol features in the underlying vhost implementation. 938 939 Only legal if feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` is present in 940 ``VHOST_USER_GET_FEATURES``. It does not need to be acknowledged by 941 ``VHOST_USER_SET_FEATURES``. 942 943.. Note:: 944 Back-ends that report ``VHOST_USER_F_PROTOCOL_FEATURES`` must support 945 this message even before ``VHOST_USER_SET_FEATURES`` was called. 946 947``VHOST_USER_SET_OWNER`` 948 :id: 3 949 :equivalent ioctl: ``VHOST_SET_OWNER`` 950 :request payload: N/A 951 :reply payload: N/A 952 953 Issued when a new connection is established. It marks the sender 954 as the front-end that owns of the session. This can be used on the *back-end* 955 as a "session start" flag. 956 957``VHOST_USER_RESET_OWNER`` 958 :id: 4 959 :request payload: N/A 960 :reply payload: N/A 961 962.. admonition:: Deprecated 963 964 This is no longer used. Used to be sent to request disabling all 965 rings, but some back-ends interpreted it to also discard connection 966 state (this interpretation would lead to bugs). It is recommended 967 that back-ends either ignore this message, or use it to disable all 968 rings. 969 970``VHOST_USER_SET_MEM_TABLE`` 971 :id: 5 972 :equivalent ioctl: ``VHOST_SET_MEM_TABLE`` 973 :request payload: multiple memory regions description 974 :reply payload: (postcopy only) multiple memory regions description 975 976 Sets the memory map regions on the back-end so it can translate the 977 vring addresses. In the ancillary data there is an array of file 978 descriptors for each memory mapped region. The size and ordering of 979 the fds matches the number and ordering of memory regions. 980 981 When ``VHOST_USER_POSTCOPY_LISTEN`` has been received, 982 ``SET_MEM_TABLE`` replies with the bases of the memory mapped 983 regions to the front-end. The back-end must have mmap'd the regions but 984 not yet accessed them and should not yet generate a userfault 985 event. 986 987.. Note:: 988 ``NEED_REPLY_MASK`` is not set in this case. QEMU will then 989 reply back to the list of mappings with an empty 990 ``VHOST_USER_SET_MEM_TABLE`` as an acknowledgement; only upon 991 reception of this message may the guest start accessing the memory 992 and generating faults. 993 994``VHOST_USER_SET_LOG_BASE`` 995 :id: 6 996 :equivalent ioctl: ``VHOST_SET_LOG_BASE`` 997 :request payload: u64 998 :reply payload: N/A 999 1000 Sets logging shared memory space. 1001 1002 When the back-end has ``VHOST_USER_PROTOCOL_F_LOG_SHMFD`` protocol feature, 1003 the log memory fd is provided in the ancillary data of 1004 ``VHOST_USER_SET_LOG_BASE`` message, the size and offset of shared 1005 memory area provided in the message. 1006 1007``VHOST_USER_SET_LOG_FD`` 1008 :id: 7 1009 :equivalent ioctl: ``VHOST_SET_LOG_FD`` 1010 :request payload: N/A 1011 :reply payload: N/A 1012 1013 Sets the logging file descriptor, which is passed as ancillary data. 1014 1015``VHOST_USER_SET_VRING_NUM`` 1016 :id: 8 1017 :equivalent ioctl: ``VHOST_SET_VRING_NUM`` 1018 :request payload: vring state description 1019 :reply payload: N/A 1020 1021 Set the size of the queue. 1022 1023``VHOST_USER_SET_VRING_ADDR`` 1024 :id: 9 1025 :equivalent ioctl: ``VHOST_SET_VRING_ADDR`` 1026 :request payload: vring address description 1027 :reply payload: N/A 1028 1029 Sets the addresses of the different aspects of the vring. 1030 1031``VHOST_USER_SET_VRING_BASE`` 1032 :id: 10 1033 :equivalent ioctl: ``VHOST_SET_VRING_BASE`` 1034 :request payload: vring state description 1035 :reply payload: N/A 1036 1037 Sets the base offset in the available vring. 1038 1039``VHOST_USER_GET_VRING_BASE`` 1040 :id: 11 1041 :equivalent ioctl: ``VHOST_USER_GET_VRING_BASE`` 1042 :request payload: vring state description 1043 :reply payload: vring state description 1044 1045 Get the available vring base offset. 1046 1047``VHOST_USER_SET_VRING_KICK`` 1048 :id: 12 1049 :equivalent ioctl: ``VHOST_SET_VRING_KICK`` 1050 :request payload: ``u64`` 1051 :reply payload: N/A 1052 1053 Set the event file descriptor for adding buffers to the vring. It is 1054 passed in the ancillary data. 1055 1056 Bits (0-7) of the payload contain the vring index. Bit 8 is the 1057 invalid FD flag. This flag is set when there is no file descriptor 1058 in the ancillary data. This signals that polling should be used 1059 instead of waiting for the kick. Note that if the protocol feature 1060 ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` has been negotiated 1061 this message isn't necessary as the ring is also started on the 1062 ``VHOST_USER_VRING_KICK`` message, it may however still be used to 1063 set an event file descriptor (which will be preferred over the 1064 message) or to enable polling. 1065 1066``VHOST_USER_SET_VRING_CALL`` 1067 :id: 13 1068 :equivalent ioctl: ``VHOST_SET_VRING_CALL`` 1069 :request payload: ``u64`` 1070 :reply payload: N/A 1071 1072 Set the event file descriptor to signal when buffers are used. It is 1073 passed in the ancillary data. 1074 1075 Bits (0-7) of the payload contain the vring index. Bit 8 is the 1076 invalid FD flag. This flag is set when there is no file descriptor 1077 in the ancillary data. This signals that polling will be used 1078 instead of waiting for the call. Note that if the protocol features 1079 ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` and 1080 ``VHOST_USER_PROTOCOL_F_BACKEND_REQ`` have been negotiated this message 1081 isn't necessary as the ``VHOST_USER_BACKEND_VRING_CALL`` message can be 1082 used, it may however still be used to set an event file descriptor 1083 or to enable polling. 1084 1085``VHOST_USER_SET_VRING_ERR`` 1086 :id: 14 1087 :equivalent ioctl: ``VHOST_SET_VRING_ERR`` 1088 :request payload: ``u64`` 1089 :reply payload: N/A 1090 1091 Set the event file descriptor to signal when error occurs. It is 1092 passed in the ancillary data. 1093 1094 Bits (0-7) of the payload contain the vring index. Bit 8 is the 1095 invalid FD flag. This flag is set when there is no file descriptor 1096 in the ancillary data. Note that if the protocol features 1097 ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` and 1098 ``VHOST_USER_PROTOCOL_F_BACKEND_REQ`` have been negotiated this message 1099 isn't necessary as the ``VHOST_USER_BACKEND_VRING_ERR`` message can be 1100 used, it may however still be used to set an event file descriptor 1101 (which will be preferred over the message). 1102 1103``VHOST_USER_GET_QUEUE_NUM`` 1104 :id: 17 1105 :equivalent ioctl: N/A 1106 :request payload: N/A 1107 :reply payload: u64 1108 1109 Query how many queues the back-end supports. 1110 1111 This request should be sent only when ``VHOST_USER_PROTOCOL_F_MQ`` 1112 is set in queried protocol features by 1113 ``VHOST_USER_GET_PROTOCOL_FEATURES``. 1114 1115``VHOST_USER_SET_VRING_ENABLE`` 1116 :id: 18 1117 :equivalent ioctl: N/A 1118 :request payload: vring state description 1119 :reply payload: N/A 1120 1121 Signal the back-end to enable or disable corresponding vring. 1122 1123 This request should be sent only when 1124 ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated. 1125 1126``VHOST_USER_SEND_RARP`` 1127 :id: 19 1128 :equivalent ioctl: N/A 1129 :request payload: ``u64`` 1130 :reply payload: N/A 1131 1132 Ask vhost user back-end to broadcast a fake RARP to notify the migration 1133 is terminated for guest that does not support GUEST_ANNOUNCE. 1134 1135 Only legal if feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` is 1136 present in ``VHOST_USER_GET_FEATURES`` and protocol feature bit 1137 ``VHOST_USER_PROTOCOL_F_RARP`` is present in 1138 ``VHOST_USER_GET_PROTOCOL_FEATURES``. The first 6 bytes of the 1139 payload contain the mac address of the guest to allow the vhost user 1140 back-end to construct and broadcast the fake RARP. 1141 1142``VHOST_USER_NET_SET_MTU`` 1143 :id: 20 1144 :equivalent ioctl: N/A 1145 :request payload: ``u64`` 1146 :reply payload: N/A 1147 1148 Set host MTU value exposed to the guest. 1149 1150 This request should be sent only when ``VIRTIO_NET_F_MTU`` feature 1151 has been successfully negotiated, ``VHOST_USER_F_PROTOCOL_FEATURES`` 1152 is present in ``VHOST_USER_GET_FEATURES`` and protocol feature bit 1153 ``VHOST_USER_PROTOCOL_F_NET_MTU`` is present in 1154 ``VHOST_USER_GET_PROTOCOL_FEATURES``. 1155 1156 If ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is negotiated, the back-end must 1157 respond with zero in case the specified MTU is valid, or non-zero 1158 otherwise. 1159 1160``VHOST_USER_SET_BACKEND_REQ_FD`` (previous name ``VHOST_USER_SET_SLAVE_REQ_FD``) 1161 :id: 21 1162 :equivalent ioctl: N/A 1163 :request payload: N/A 1164 :reply payload: N/A 1165 1166 Set the socket file descriptor for back-end initiated requests. It is passed 1167 in the ancillary data. 1168 1169 This request should be sent only when 1170 ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated, and protocol 1171 feature bit ``VHOST_USER_PROTOCOL_F_BACKEND_REQ`` bit is present in 1172 ``VHOST_USER_GET_PROTOCOL_FEATURES``. If 1173 ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is negotiated, the back-end must 1174 respond with zero for success, non-zero otherwise. 1175 1176``VHOST_USER_IOTLB_MSG`` 1177 :id: 22 1178 :equivalent ioctl: N/A (equivalent to ``VHOST_IOTLB_MSG`` message type) 1179 :request payload: ``struct vhost_iotlb_msg`` 1180 :reply payload: ``u64`` 1181 1182 Send IOTLB messages with ``struct vhost_iotlb_msg`` as payload. 1183 1184 The front-end sends such requests to update and invalidate entries in the 1185 device IOTLB. The back-end has to acknowledge the request with sending 1186 zero as ``u64`` payload for success, non-zero otherwise. 1187 1188 This request should be send only when ``VIRTIO_F_IOMMU_PLATFORM`` 1189 feature has been successfully negotiated. 1190 1191``VHOST_USER_SET_VRING_ENDIAN`` 1192 :id: 23 1193 :equivalent ioctl: ``VHOST_SET_VRING_ENDIAN`` 1194 :request payload: vring state description 1195 :reply payload: N/A 1196 1197 Set the endianness of a VQ for legacy devices. Little-endian is 1198 indicated with state.num set to 0 and big-endian is indicated with 1199 state.num set to 1. Other values are invalid. 1200 1201 This request should be sent only when 1202 ``VHOST_USER_PROTOCOL_F_CROSS_ENDIAN`` has been negotiated. 1203 Backends that negotiated this feature should handle both 1204 endiannesses and expect this message once (per VQ) during device 1205 configuration (ie. before the front-end starts the VQ). 1206 1207``VHOST_USER_GET_CONFIG`` 1208 :id: 24 1209 :equivalent ioctl: N/A 1210 :request payload: virtio device config space 1211 :reply payload: virtio device config space 1212 1213 When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, this message is 1214 submitted by the vhost-user front-end to fetch the contents of the 1215 virtio device configuration space, vhost-user back-end's payload size 1216 MUST match the front-end's request, vhost-user back-end uses zero length of 1217 payload to indicate an error to the vhost-user front-end. The vhost-user 1218 front-end may cache the contents to avoid repeated 1219 ``VHOST_USER_GET_CONFIG`` calls. 1220 1221``VHOST_USER_SET_CONFIG`` 1222 :id: 25 1223 :equivalent ioctl: N/A 1224 :request payload: virtio device config space 1225 :reply payload: N/A 1226 1227 When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, this message is 1228 submitted by the vhost-user front-end when the Guest changes the virtio 1229 device configuration space and also can be used for live migration 1230 on the destination host. The vhost-user back-end must check the flags 1231 field, and back-ends MUST NOT accept SET_CONFIG for read-only 1232 configuration space fields unless the live migration bit is set. 1233 1234``VHOST_USER_CREATE_CRYPTO_SESSION`` 1235 :id: 26 1236 :equivalent ioctl: N/A 1237 :request payload: crypto session description 1238 :reply payload: crypto session description 1239 1240 Create a session for crypto operation. The back-end must return 1241 the session id, 0 or positive for success, negative for failure. 1242 This request should be sent only when 1243 ``VHOST_USER_PROTOCOL_F_CRYPTO_SESSION`` feature has been 1244 successfully negotiated. It's a required feature for crypto 1245 devices. 1246 1247``VHOST_USER_CLOSE_CRYPTO_SESSION`` 1248 :id: 27 1249 :equivalent ioctl: N/A 1250 :request payload: ``u64`` 1251 :reply payload: N/A 1252 1253 Close a session for crypto operation which was previously 1254 created by ``VHOST_USER_CREATE_CRYPTO_SESSION``. 1255 1256 This request should be sent only when 1257 ``VHOST_USER_PROTOCOL_F_CRYPTO_SESSION`` feature has been 1258 successfully negotiated. It's a required feature for crypto 1259 devices. 1260 1261``VHOST_USER_POSTCOPY_ADVISE`` 1262 :id: 28 1263 :request payload: N/A 1264 :reply payload: userfault fd 1265 1266 When ``VHOST_USER_PROTOCOL_F_PAGEFAULT`` is supported, the front-end 1267 advises back-end that a migration with postcopy enabled is underway, 1268 the back-end must open a userfaultfd for later use. Note that at this 1269 stage the migration is still in precopy mode. 1270 1271``VHOST_USER_POSTCOPY_LISTEN`` 1272 :id: 29 1273 :request payload: N/A 1274 :reply payload: N/A 1275 1276 The front-end advises back-end that a transition to postcopy mode has 1277 happened. The back-end must ensure that shared memory is registered 1278 with userfaultfd to cause faulting of non-present pages. 1279 1280 This is always sent sometime after a ``VHOST_USER_POSTCOPY_ADVISE``, 1281 and thus only when ``VHOST_USER_PROTOCOL_F_PAGEFAULT`` is supported. 1282 1283``VHOST_USER_POSTCOPY_END`` 1284 :id: 30 1285 :request payload: N/A 1286 :reply payload: ``u64`` 1287 1288 The front-end advises that postcopy migration has now completed. The back-end 1289 must disable the userfaultfd. The reply is an acknowledgement 1290 only. 1291 1292 When ``VHOST_USER_PROTOCOL_F_PAGEFAULT`` is supported, this message 1293 is sent at the end of the migration, after 1294 ``VHOST_USER_POSTCOPY_LISTEN`` was previously sent. 1295 1296 The value returned is an error indication; 0 is success. 1297 1298``VHOST_USER_GET_INFLIGHT_FD`` 1299 :id: 31 1300 :equivalent ioctl: N/A 1301 :request payload: inflight description 1302 :reply payload: N/A 1303 1304 When ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD`` protocol feature has 1305 been successfully negotiated, this message is submitted by the front-end to 1306 get a shared buffer from back-end. The shared buffer will be used to 1307 track inflight I/O by back-end. QEMU should retrieve a new one when vm 1308 reset. 1309 1310``VHOST_USER_SET_INFLIGHT_FD`` 1311 :id: 32 1312 :equivalent ioctl: N/A 1313 :request payload: inflight description 1314 :reply payload: N/A 1315 1316 When ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD`` protocol feature has 1317 been successfully negotiated, this message is submitted by the front-end to 1318 send the shared inflight buffer back to the back-end so that the back-end 1319 could get inflight I/O after a crash or restart. 1320 1321``VHOST_USER_GPU_SET_SOCKET`` 1322 :id: 33 1323 :equivalent ioctl: N/A 1324 :request payload: N/A 1325 :reply payload: N/A 1326 1327 Sets the GPU protocol socket file descriptor, which is passed as 1328 ancillary data. The GPU protocol is used to inform the front-end of 1329 rendering state and updates. See vhost-user-gpu.rst for details. 1330 1331``VHOST_USER_RESET_DEVICE`` 1332 :id: 34 1333 :equivalent ioctl: N/A 1334 :request payload: N/A 1335 :reply payload: N/A 1336 1337 Ask the vhost user back-end to disable all rings and reset all 1338 internal device state to the initial state, ready to be 1339 reinitialized. The back-end retains ownership of the device 1340 throughout the reset operation. 1341 1342 Only valid if the ``VHOST_USER_PROTOCOL_F_RESET_DEVICE`` protocol 1343 feature is set by the back-end. 1344 1345``VHOST_USER_VRING_KICK`` 1346 :id: 35 1347 :equivalent ioctl: N/A 1348 :request payload: vring state description 1349 :reply payload: N/A 1350 1351 When the ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` protocol 1352 feature has been successfully negotiated, this message may be 1353 submitted by the front-end to indicate that a buffer was added to 1354 the vring instead of signalling it using the vring's kick file 1355 descriptor or having the back-end rely on polling. 1356 1357 The state.num field is currently reserved and must be set to 0. 1358 1359``VHOST_USER_GET_MAX_MEM_SLOTS`` 1360 :id: 36 1361 :equivalent ioctl: N/A 1362 :request payload: N/A 1363 :reply payload: u64 1364 1365 When the ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol 1366 feature has been successfully negotiated, this message is submitted 1367 by the front-end to the back-end. The back-end should return the message with a 1368 u64 payload containing the maximum number of memory slots for 1369 QEMU to expose to the guest. The value returned by the back-end 1370 will be capped at the maximum number of ram slots which can be 1371 supported by the target platform. 1372 1373``VHOST_USER_ADD_MEM_REG`` 1374 :id: 37 1375 :equivalent ioctl: N/A 1376 :request payload: N/A 1377 :reply payload: single memory region description 1378 1379 When the ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol 1380 feature has been successfully negotiated, this message is submitted 1381 by the front-end to the back-end. The message payload contains a memory 1382 region descriptor struct, describing a region of guest memory which 1383 the back-end device must map in. When the 1384 ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol feature has 1385 been successfully negotiated, along with the 1386 ``VHOST_USER_REM_MEM_REG`` message, this message is used to set and 1387 update the memory tables of the back-end device. 1388 1389 Exactly one file descriptor from which the memory is mapped is 1390 passed in the ancillary data. 1391 1392 In postcopy mode (see ``VHOST_USER_POSTCOPY_LISTEN``), the back-end 1393 replies with the bases of the memory mapped region to the front-end. 1394 For further details on postcopy, see ``VHOST_USER_SET_MEM_TABLE``. 1395 They apply to ``VHOST_USER_ADD_MEM_REG`` accordingly. 1396 1397``VHOST_USER_REM_MEM_REG`` 1398 :id: 38 1399 :equivalent ioctl: N/A 1400 :request payload: N/A 1401 :reply payload: single memory region description 1402 1403 When the ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol 1404 feature has been successfully negotiated, this message is submitted 1405 by the front-end to the back-end. The message payload contains a memory 1406 region descriptor struct, describing a region of guest memory which 1407 the back-end device must unmap. When the 1408 ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol feature has 1409 been successfully negotiated, along with the 1410 ``VHOST_USER_ADD_MEM_REG`` message, this message is used to set and 1411 update the memory tables of the back-end device. 1412 1413 The memory region to be removed is identified by its guest address, 1414 user address and size. The mmap offset is ignored. 1415 1416 No file descriptors SHOULD be passed in the ancillary data. For 1417 compatibility with existing incorrect implementations, the back-end MAY 1418 accept messages with one file descriptor. If a file descriptor is 1419 passed, the back-end MUST close it without using it otherwise. 1420 1421``VHOST_USER_SET_STATUS`` 1422 :id: 39 1423 :equivalent ioctl: VHOST_VDPA_SET_STATUS 1424 :request payload: ``u64`` 1425 :reply payload: N/A 1426 1427 When the ``VHOST_USER_PROTOCOL_F_STATUS`` protocol feature has been 1428 successfully negotiated, this message is submitted by the front-end to 1429 notify the back-end with updated device status as defined in the Virtio 1430 specification. 1431 1432``VHOST_USER_GET_STATUS`` 1433 :id: 40 1434 :equivalent ioctl: VHOST_VDPA_GET_STATUS 1435 :request payload: N/A 1436 :reply payload: ``u64`` 1437 1438 When the ``VHOST_USER_PROTOCOL_F_STATUS`` protocol feature has been 1439 successfully negotiated, this message is submitted by the front-end to 1440 query the back-end for its device status as defined in the Virtio 1441 specification. 1442 1443 1444Back-end message types 1445---------------------- 1446 1447For this type of message, the request is sent by the back-end and the reply 1448is sent by the front-end. 1449 1450``VHOST_USER_BACKEND_IOTLB_MSG`` (previous name ``VHOST_USER_SLAVE_IOTLB_MSG``) 1451 :id: 1 1452 :equivalent ioctl: N/A (equivalent to ``VHOST_IOTLB_MSG`` message type) 1453 :request payload: ``struct vhost_iotlb_msg`` 1454 :reply payload: N/A 1455 1456 Send IOTLB messages with ``struct vhost_iotlb_msg`` as payload. 1457 The back-end sends such requests to notify of an IOTLB miss, or an IOTLB 1458 access failure. If ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is 1459 negotiated, and back-end set the ``VHOST_USER_NEED_REPLY`` flag, the front-end 1460 must respond with zero when operation is successfully completed, or 1461 non-zero otherwise. This request should be send only when 1462 ``VIRTIO_F_IOMMU_PLATFORM`` feature has been successfully 1463 negotiated. 1464 1465``VHOST_USER_BACKEND_CONFIG_CHANGE_MSG`` (previous name ``VHOST_USER_SLAVE_CONFIG_CHANGE_MSG``) 1466 :id: 2 1467 :equivalent ioctl: N/A 1468 :request payload: N/A 1469 :reply payload: N/A 1470 1471 When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, vhost-user 1472 back-end sends such messages to notify that the virtio device's 1473 configuration space has changed, for those host devices which can 1474 support such feature, host driver can send ``VHOST_USER_GET_CONFIG`` 1475 message to the back-end to get the latest content. If 1476 ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is negotiated, and the back-end sets the 1477 ``VHOST_USER_NEED_REPLY`` flag, the front-end must respond with zero when 1478 operation is successfully completed, or non-zero otherwise. 1479 1480``VHOST_USER_BACKEND_VRING_HOST_NOTIFIER_MSG`` (previous name ``VHOST_USER_SLAVE_VRING_HOST_NOTIFIER_MSG``) 1481 :id: 3 1482 :equivalent ioctl: N/A 1483 :request payload: vring area description 1484 :reply payload: N/A 1485 1486 Sets host notifier for a specified queue. The queue index is 1487 contained in the ``u64`` field of the vring area description. The 1488 host notifier is described by the file descriptor (typically it's a 1489 VFIO device fd) which is passed as ancillary data and the size 1490 (which is mmap size and should be the same as host page size) and 1491 offset (which is mmap offset) carried in the vring area 1492 description. QEMU can mmap the file descriptor based on the size and 1493 offset to get a memory range. Registering a host notifier means 1494 mapping this memory range to the VM as the specified queue's notify 1495 MMIO region. The back-end sends this request to tell QEMU to de-register 1496 the existing notifier if any and register the new notifier if the 1497 request is sent with a file descriptor. 1498 1499 This request should be sent only when 1500 ``VHOST_USER_PROTOCOL_F_HOST_NOTIFIER`` protocol feature has been 1501 successfully negotiated. 1502 1503``VHOST_USER_BACKEND_VRING_CALL`` (previous name ``VHOST_USER_SLAVE_VRING_CALL``) 1504 :id: 4 1505 :equivalent ioctl: N/A 1506 :request payload: vring state description 1507 :reply payload: N/A 1508 1509 When the ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` protocol 1510 feature has been successfully negotiated, this message may be 1511 submitted by the back-end to indicate that a buffer was used from 1512 the vring instead of signalling this using the vring's call file 1513 descriptor or having the front-end relying on polling. 1514 1515 The state.num field is currently reserved and must be set to 0. 1516 1517``VHOST_USER_BACKEND_VRING_ERR`` (previous name ``VHOST_USER_SLAVE_VRING_ERR``) 1518 :id: 5 1519 :equivalent ioctl: N/A 1520 :request payload: vring state description 1521 :reply payload: N/A 1522 1523 When the ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` protocol 1524 feature has been successfully negotiated, this message may be 1525 submitted by the back-end to indicate that an error occurred on the 1526 specific vring, instead of signalling the error file descriptor 1527 set by the front-end via ``VHOST_USER_SET_VRING_ERR``. 1528 1529 The state.num field is currently reserved and must be set to 0. 1530 1531.. _reply_ack: 1532 1533VHOST_USER_PROTOCOL_F_REPLY_ACK 1534------------------------------- 1535 1536The original vhost-user specification only demands replies for certain 1537commands. This differs from the vhost protocol implementation where 1538commands are sent over an ``ioctl()`` call and block until the back-end 1539has completed. 1540 1541With this protocol extension negotiated, the sender (QEMU) can set the 1542``need_reply`` [Bit 3] flag to any command. This indicates that the 1543back-end MUST respond with a Payload ``VhostUserMsg`` indicating success 1544or failure. The payload should be set to zero on success or non-zero 1545on failure, unless the message already has an explicit reply body. 1546 1547The reply payload gives QEMU a deterministic indication of the result 1548of the command. Today, QEMU is expected to terminate the main vhost-user 1549loop upon receiving such errors. In future, qemu could be taught to be more 1550resilient for selective requests. 1551 1552For the message types that already solicit a reply from the back-end, 1553the presence of ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` or need_reply bit 1554being set brings no behavioural change. (See the Communication_ 1555section for details.) 1556 1557.. _backend_conventions: 1558 1559Backend program conventions 1560=========================== 1561 1562vhost-user back-ends can provide various devices & services and may 1563need to be configured manually depending on the use case. However, it 1564is a good idea to follow the conventions listed here when 1565possible. Users, QEMU or libvirt, can then rely on some common 1566behaviour to avoid heterogeneous configuration and management of the 1567back-end programs and facilitate interoperability. 1568 1569Each back-end installed on a host system should come with at least one 1570JSON file that conforms to the vhost-user.json schema. Each file 1571informs the management applications about the back-end type, and binary 1572location. In addition, it defines rules for management apps for 1573picking the highest priority back-end when multiple match the search 1574criteria (see ``@VhostUserBackend`` documentation in the schema file). 1575 1576If the back-end is not capable of enabling a requested feature on the 1577host (such as 3D acceleration with virgl), or the initialization 1578failed, the back-end should fail to start early and exit with a status 1579!= 0. It may also print a message to stderr for further details. 1580 1581The back-end program must not daemonize itself, but it may be 1582daemonized by the management layer. It may also have a restricted 1583access to the system. 1584 1585File descriptors 0, 1 and 2 will exist, and have regular 1586stdin/stdout/stderr usage (they may have been redirected to /dev/null 1587by the management layer, or to a log handler). 1588 1589The back-end program must end (as quickly and cleanly as possible) when 1590the SIGTERM signal is received. Eventually, it may receive SIGKILL by 1591the management layer after a few seconds. 1592 1593The following command line options have an expected behaviour. They 1594are mandatory, unless explicitly said differently: 1595 1596--socket-path=PATH 1597 1598 This option specify the location of the vhost-user Unix domain socket. 1599 It is incompatible with --fd. 1600 1601--fd=FDNUM 1602 1603 When this argument is given, the back-end program is started with the 1604 vhost-user socket as file descriptor FDNUM. It is incompatible with 1605 --socket-path. 1606 1607--print-capabilities 1608 1609 Output to stdout the back-end capabilities in JSON format, and then 1610 exit successfully. Other options and arguments should be ignored, and 1611 the back-end program should not perform its normal function. The 1612 capabilities can be reported dynamically depending on the host 1613 capabilities. 1614 1615The JSON output is described in the ``vhost-user.json`` schema, by 1616```@VHostUserBackendCapabilities``. Example: 1617 1618.. code:: json 1619 1620 { 1621 "type": "foo", 1622 "features": [ 1623 "feature-a", 1624 "feature-b" 1625 ] 1626 } 1627 1628vhost-user-input 1629---------------- 1630 1631Command line options: 1632 1633--evdev-path=PATH 1634 1635 Specify the linux input device. 1636 1637 (optional) 1638 1639--no-grab 1640 1641 Do no request exclusive access to the input device. 1642 1643 (optional) 1644 1645vhost-user-gpu 1646-------------- 1647 1648Command line options: 1649 1650--render-node=PATH 1651 1652 Specify the GPU DRM render node. 1653 1654 (optional) 1655 1656--virgl 1657 1658 Enable virgl rendering support. 1659 1660 (optional) 1661 1662vhost-user-blk 1663-------------- 1664 1665Command line options: 1666 1667--blk-file=PATH 1668 1669 Specify block device or file path. 1670 1671 (optional) 1672 1673--read-only 1674 1675 Enable read-only. 1676 1677 (optional) 1678