xref: /openbmc/qemu/docs/interop/vhost-user.rst (revision f6b0de53)
1.. _vhost_user_proto:
2
3===================
4Vhost-user Protocol
5===================
6
7..
8  Copyright 2014 Virtual Open Systems Sarl.
9  Copyright 2019 Intel Corporation
10  Licence: This work is licensed under the terms of the GNU GPL,
11           version 2 or later. See the COPYING file in the top-level
12           directory.
13
14.. contents:: Table of Contents
15
16Introduction
17============
18
19This protocol is aiming to complement the ``ioctl`` interface used to
20control the vhost implementation in the Linux kernel. It implements
21the control plane needed to establish virtqueue sharing with a user
22space process on the same host. It uses communication over a Unix
23domain socket to share file descriptors in the ancillary data of the
24message.
25
26The protocol defines 2 sides of the communication, *front-end* and
27*back-end*. The *front-end* is the application that shares its virtqueues, in
28our case QEMU. The *back-end* is the consumer of the virtqueues.
29
30In the current implementation QEMU is the *front-end*, and the *back-end*
31is the external process consuming the virtio queues, for example a
32software Ethernet switch running in user space, such as Snabbswitch,
33or a block device back-end processing read & write to a virtual
34disk. In order to facilitate interoperability between various back-end
35implementations, it is recommended to follow the :ref:`Backend program
36conventions <backend_conventions>`.
37
38The *front-end* and *back-end* can be either a client (i.e. connecting) or
39server (listening) in the socket communication.
40
41Support for platforms other than Linux
42--------------------------------------
43
44While vhost-user was initially developed targeting Linux, nowadays it
45is supported on any platform that provides the following features:
46
47- A way for requesting shared memory represented by a file descriptor
48  so it can be passed over a UNIX domain socket and then mapped by the
49  other process.
50
51- AF_UNIX sockets with SCM_RIGHTS, so QEMU and the other process can
52  exchange messages through it, including ancillary data when needed.
53
54- Either eventfd or pipe/pipe2. On platforms where eventfd is not
55  available, QEMU will automatically fall back to pipe2 or, as a last
56  resort, pipe. Each file descriptor will be used for receiving or
57  sending events by reading or writing (respectively) an 8-byte value
58  to the corresponding it. The 8-value itself has no meaning and
59  should not be interpreted.
60
61Message Specification
62=====================
63
64.. Note:: All numbers are in the machine native byte order.
65
66A vhost-user message consists of 3 header fields and a payload.
67
68+---------+-------+------+---------+
69| request | flags | size | payload |
70+---------+-------+------+---------+
71
72Header
73------
74
75:request: 32-bit type of the request
76
77:flags: 32-bit bit field
78
79- Lower 2 bits are the version (currently 0x01)
80- Bit 2 is the reply flag - needs to be sent on each reply from the back-end
81- Bit 3 is the need_reply flag - see :ref:`REPLY_ACK <reply_ack>` for
82  details.
83
84:size: 32-bit size of the payload
85
86Payload
87-------
88
89Depending on the request type, **payload** can be:
90
91A single 64-bit integer
92^^^^^^^^^^^^^^^^^^^^^^^
93
94+-----+
95| u64 |
96+-----+
97
98:u64: a 64-bit unsigned integer
99
100A vring state description
101^^^^^^^^^^^^^^^^^^^^^^^^^
102
103+-------+-----+
104| index | num |
105+-------+-----+
106
107:index: a 32-bit index
108
109:num: a 32-bit number
110
111A vring address description
112^^^^^^^^^^^^^^^^^^^^^^^^^^^
113
114+-------+-------+------+------------+------+-----------+-----+
115| index | flags | size | descriptor | used | available | log |
116+-------+-------+------+------------+------+-----------+-----+
117
118:index: a 32-bit vring index
119
120:flags: a 32-bit vring flags
121
122:descriptor: a 64-bit ring address of the vring descriptor table
123
124:used: a 64-bit ring address of the vring used ring
125
126:available: a 64-bit ring address of the vring available ring
127
128:log: a 64-bit guest address for logging
129
130Note that a ring address is an IOVA if ``VIRTIO_F_IOMMU_PLATFORM`` has
131been negotiated. Otherwise it is a user address.
132
133Memory region description
134^^^^^^^^^^^^^^^^^^^^^^^^^
135
136+---------------+------+--------------+-------------+
137| guest address | size | user address | mmap offset |
138+---------------+------+--------------+-------------+
139
140:guest address: a 64-bit guest address of the region
141
142:size: a 64-bit size
143
144:user address: a 64-bit user address
145
146:mmap offset: 64-bit offset where region starts in the mapped memory
147
148When the ``VHOST_USER_PROTOCOL_F_XEN_MMAP`` protocol feature has been
149successfully negotiated, the memory region description contains two extra
150fields at the end.
151
152+---------------+------+--------------+-------------+----------------+-------+
153| guest address | size | user address | mmap offset | xen mmap flags | domid |
154+---------------+------+--------------+-------------+----------------+-------+
155
156:xen mmap flags: 32-bit bit field
157
158- Bit 0 is set for Xen foreign memory mapping.
159- Bit 1 is set for Xen grant memory mapping.
160- Bit 8 is set if the memory region can not be mapped in advance, and memory
161  areas within this region must be mapped / unmapped only when required by the
162  back-end. The back-end shouldn't try to map the entire region at once, as the
163  front-end may not allow it. The back-end should rather map only the required
164  amount of memory at once and unmap it after it is used.
165
166:domid: a 32-bit Xen hypervisor specific domain id.
167
168Single memory region description
169^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
170
171+---------+--------+
172| padding | region |
173+---------+--------+
174
175:padding: 64-bit
176
177A region is represented by Memory region description.
178
179Multiple Memory regions description
180^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
181
182+-------------+---------+---------+-----+---------+
183| num regions | padding | region0 | ... | region7 |
184+-------------+---------+---------+-----+---------+
185
186:num regions: a 32-bit number of regions
187
188:padding: 32-bit
189
190A region is represented by Memory region description.
191
192Log description
193^^^^^^^^^^^^^^^
194
195+----------+------------+
196| log size | log offset |
197+----------+------------+
198
199:log size: size of area used for logging
200
201:log offset: offset from start of supplied file descriptor where
202             logging starts (i.e. where guest address 0 would be
203             logged)
204
205An IOTLB message
206^^^^^^^^^^^^^^^^
207
208+------+------+--------------+-------------------+------+
209| iova | size | user address | permissions flags | type |
210+------+------+--------------+-------------------+------+
211
212:iova: a 64-bit I/O virtual address programmed by the guest
213
214:size: a 64-bit size
215
216:user address: a 64-bit user address
217
218:permissions flags: an 8-bit value:
219  - 0: No access
220  - 1: Read access
221  - 2: Write access
222  - 3: Read/Write access
223
224:type: an 8-bit IOTLB message type:
225  - 1: IOTLB miss
226  - 2: IOTLB update
227  - 3: IOTLB invalidate
228  - 4: IOTLB access fail
229
230Virtio device config space
231^^^^^^^^^^^^^^^^^^^^^^^^^^
232
233+--------+------+-------+---------+
234| offset | size | flags | payload |
235+--------+------+-------+---------+
236
237:offset: a 32-bit offset of virtio device's configuration space
238
239:size: a 32-bit configuration space access size in bytes
240
241:flags: a 32-bit value:
242  - 0: Vhost front-end messages used for writable fields
243  - 1: Vhost front-end messages used for live migration
244
245:payload: Size bytes array holding the contents of the virtio
246          device's configuration space
247
248Vring area description
249^^^^^^^^^^^^^^^^^^^^^^
250
251+-----+------+--------+
252| u64 | size | offset |
253+-----+------+--------+
254
255:u64: a 64-bit integer contains vring index and flags
256
257:size: a 64-bit size of this area
258
259:offset: a 64-bit offset of this area from the start of the
260         supplied file descriptor
261
262Inflight description
263^^^^^^^^^^^^^^^^^^^^
264
265+-----------+-------------+------------+------------+
266| mmap size | mmap offset | num queues | queue size |
267+-----------+-------------+------------+------------+
268
269:mmap size: a 64-bit size of area to track inflight I/O
270
271:mmap offset: a 64-bit offset of this area from the start
272              of the supplied file descriptor
273
274:num queues: a 16-bit number of virtqueues
275
276:queue size: a 16-bit size of virtqueues
277
278C structure
279-----------
280
281In QEMU the vhost-user message is implemented with the following struct:
282
283.. code:: c
284
285  typedef struct VhostUserMsg {
286      VhostUserRequest request;
287      uint32_t flags;
288      uint32_t size;
289      union {
290          uint64_t u64;
291          struct vhost_vring_state state;
292          struct vhost_vring_addr addr;
293          VhostUserMemory memory;
294          VhostUserLog log;
295          struct vhost_iotlb_msg iotlb;
296          VhostUserConfig config;
297          VhostUserVringArea area;
298          VhostUserInflight inflight;
299      };
300  } QEMU_PACKED VhostUserMsg;
301
302Communication
303=============
304
305The protocol for vhost-user is based on the existing implementation of
306vhost for the Linux Kernel. Most messages that can be sent via the
307Unix domain socket implementing vhost-user have an equivalent ioctl to
308the kernel implementation.
309
310The communication consists of the *front-end* sending message requests and
311the *back-end* sending message replies. Most of the requests don't require
312replies. Here is a list of the ones that do:
313
314* ``VHOST_USER_GET_FEATURES``
315* ``VHOST_USER_GET_PROTOCOL_FEATURES``
316* ``VHOST_USER_GET_VRING_BASE``
317* ``VHOST_USER_SET_LOG_BASE`` (if ``VHOST_USER_PROTOCOL_F_LOG_SHMFD``)
318* ``VHOST_USER_GET_INFLIGHT_FD`` (if ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD``)
319
320.. seealso::
321
322   :ref:`REPLY_ACK <reply_ack>`
323       The section on ``REPLY_ACK`` protocol extension.
324
325There are several messages that the front-end sends with file descriptors passed
326in the ancillary data:
327
328* ``VHOST_USER_ADD_MEM_REG``
329* ``VHOST_USER_SET_MEM_TABLE``
330* ``VHOST_USER_SET_LOG_BASE`` (if ``VHOST_USER_PROTOCOL_F_LOG_SHMFD``)
331* ``VHOST_USER_SET_LOG_FD``
332* ``VHOST_USER_SET_VRING_KICK``
333* ``VHOST_USER_SET_VRING_CALL``
334* ``VHOST_USER_SET_VRING_ERR``
335* ``VHOST_USER_SET_BACKEND_REQ_FD`` (previous name ``VHOST_USER_SET_SLAVE_REQ_FD``)
336* ``VHOST_USER_SET_INFLIGHT_FD`` (if ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD``)
337
338If *front-end* is unable to send the full message or receives a wrong
339reply it will close the connection. An optional reconnection mechanism
340can be implemented.
341
342If *back-end* detects some error such as incompatible features, it may also
343close the connection. This should only happen in exceptional circumstances.
344
345Any protocol extensions are gated by protocol feature bits, which
346allows full backwards compatibility on both front-end and back-end.  As
347older back-ends don't support negotiating protocol features, a feature
348bit was dedicated for this purpose::
349
350  #define VHOST_USER_F_PROTOCOL_FEATURES 30
351
352Note that VHOST_USER_F_PROTOCOL_FEATURES is the UNUSED (30) feature
353bit defined in `VIRTIO 1.1 6.3 Legacy Interface: Reserved Feature Bits
354<https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-4130003>`_.
355VIRTIO devices do not advertise this feature bit and therefore VIRTIO
356drivers cannot negotiate it.
357
358This reserved feature bit was reused by the vhost-user protocol to add
359vhost-user protocol feature negotiation in a backwards compatible
360fashion. Old vhost-user front-end and back-end implementations continue to
361work even though they are not aware of vhost-user protocol feature
362negotiation.
363
364Ring states
365-----------
366
367Rings can be in one of three states:
368
369* stopped: the back-end must not process the ring at all.
370
371* started but disabled: the back-end must process the ring without
372  causing any side effects.  For example, for a networking device,
373  in the disabled state the back-end must not supply any new RX packets,
374  but must process and discard any TX packets.
375
376* started and enabled.
377
378Each ring is initialized in a stopped state.  The back-end must start
379ring upon receiving a kick (that is, detecting that file descriptor is
380readable) on the descriptor specified by ``VHOST_USER_SET_VRING_KICK``
381or receiving the in-band message ``VHOST_USER_VRING_KICK`` if negotiated,
382and stop ring upon receiving ``VHOST_USER_GET_VRING_BASE``.
383
384Rings can be enabled or disabled by ``VHOST_USER_SET_VRING_ENABLE``.
385
386If ``VHOST_USER_F_PROTOCOL_FEATURES`` has not been negotiated, the
387ring starts directly in the enabled state.
388
389If ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated, the ring is
390initialized in a disabled state and is enabled by
391``VHOST_USER_SET_VRING_ENABLE`` with parameter 1.
392
393While processing the rings (whether they are enabled or not), the back-end
394must support changing some configuration aspects on the fly.
395
396Multiple queue support
397----------------------
398
399Many devices have a fixed number of virtqueues.  In this case the front-end
400already knows the number of available virtqueues without communicating with the
401back-end.
402
403Some devices do not have a fixed number of virtqueues.  Instead the maximum
404number of virtqueues is chosen by the back-end.  The number can depend on host
405resource availability or back-end implementation details.  Such devices are called
406multiple queue devices.
407
408Multiple queue support allows the back-end to advertise the maximum number of
409queues.  This is treated as a protocol extension, hence the back-end has to
410implement protocol features first. The multiple queues feature is supported
411only when the protocol feature ``VHOST_USER_PROTOCOL_F_MQ`` (bit 0) is set.
412
413The max number of queues the back-end supports can be queried with message
414``VHOST_USER_GET_QUEUE_NUM``. Front-end should stop when the number of requested
415queues is bigger than that.
416
417As all queues share one connection, the front-end uses a unique index for each
418queue in the sent message to identify a specified queue.
419
420The front-end enables queues by sending message ``VHOST_USER_SET_VRING_ENABLE``.
421vhost-user-net has historically automatically enabled the first queue pair.
422
423Back-ends should always implement the ``VHOST_USER_PROTOCOL_F_MQ`` protocol
424feature, even for devices with a fixed number of virtqueues, since it is simple
425to implement and offers a degree of introspection.
426
427Front-ends must not rely on the ``VHOST_USER_PROTOCOL_F_MQ`` protocol feature for
428devices with a fixed number of virtqueues.  Only true multiqueue devices
429require this protocol feature.
430
431Migration
432---------
433
434During live migration, the front-end may need to track the modifications
435the back-end makes to the memory mapped regions. The front-end should mark
436the dirty pages in a log. Once it complies to this logging, it may
437declare the ``VHOST_F_LOG_ALL`` vhost feature.
438
439To start/stop logging of data/used ring writes, the front-end may send
440messages ``VHOST_USER_SET_FEATURES`` with ``VHOST_F_LOG_ALL`` and
441``VHOST_USER_SET_VRING_ADDR`` with ``VHOST_VRING_F_LOG`` in ring's
442flags set to 1/0, respectively.
443
444All the modifications to memory pointed by vring "descriptor" should
445be marked. Modifications to "used" vring should be marked if
446``VHOST_VRING_F_LOG`` is part of ring's flags.
447
448Dirty pages are of size::
449
450  #define VHOST_LOG_PAGE 0x1000
451
452The log memory fd is provided in the ancillary data of
453``VHOST_USER_SET_LOG_BASE`` message when the back-end has
454``VHOST_USER_PROTOCOL_F_LOG_SHMFD`` protocol feature.
455
456The size of the log is supplied as part of ``VhostUserMsg`` which
457should be large enough to cover all known guest addresses. Log starts
458at the supplied offset in the supplied file descriptor.  The log
459covers from address 0 to the maximum of guest regions. In pseudo-code,
460to mark page at ``addr`` as dirty::
461
462  page = addr / VHOST_LOG_PAGE
463  log[page / 8] |= 1 << page % 8
464
465Where ``addr`` is the guest physical address.
466
467Use atomic operations, as the log may be concurrently manipulated.
468
469Note that when logging modifications to the used ring (when
470``VHOST_VRING_F_LOG`` is set for this ring), ``log_guest_addr`` should
471be used to calculate the log offset: the write to first byte of the
472used ring is logged at this offset from log start. Also note that this
473value might be outside the legal guest physical address range
474(i.e. does not have to be covered by the ``VhostUserMemory`` table), but
475the bit offset of the last byte of the ring must fall within the size
476supplied by ``VhostUserLog``.
477
478``VHOST_USER_SET_LOG_FD`` is an optional message with an eventfd in
479ancillary data, it may be used to inform the front-end that the log has
480been modified.
481
482Once the source has finished migration, rings will be stopped by the
483source. No further update must be done before rings are restarted.
484
485In postcopy migration the back-end is started before all the memory has
486been received from the source host, and care must be taken to avoid
487accessing pages that have yet to be received.  The back-end opens a
488'userfault'-fd and registers the memory with it; this fd is then
489passed back over to the front-end.  The front-end services requests on the
490userfaultfd for pages that are accessed and when the page is available
491it performs WAKE ioctl's on the userfaultfd to wake the stalled
492back-end.  The front-end indicates support for this via the
493``VHOST_USER_PROTOCOL_F_PAGEFAULT`` feature.
494
495Memory access
496-------------
497
498The front-end sends a list of vhost memory regions to the back-end using the
499``VHOST_USER_SET_MEM_TABLE`` message.  Each region has two base
500addresses: a guest address and a user address.
501
502Messages contain guest addresses and/or user addresses to reference locations
503within the shared memory.  The mapping of these addresses works as follows.
504
505User addresses map to the vhost memory region containing that user address.
506
507When the ``VIRTIO_F_IOMMU_PLATFORM`` feature has not been negotiated:
508
509* Guest addresses map to the vhost memory region containing that guest
510  address.
511
512When the ``VIRTIO_F_IOMMU_PLATFORM`` feature has been negotiated:
513
514* Guest addresses are also called I/O virtual addresses (IOVAs).  They are
515  translated to user addresses via the IOTLB.
516
517* The vhost memory region guest address is not used.
518
519IOMMU support
520-------------
521
522When the ``VIRTIO_F_IOMMU_PLATFORM`` feature has been negotiated, the
523front-end sends IOTLB entries update & invalidation by sending
524``VHOST_USER_IOTLB_MSG`` requests to the back-end with a ``struct
525vhost_iotlb_msg`` as payload. For update events, the ``iotlb`` payload
526has to be filled with the update message type (2), the I/O virtual
527address, the size, the user virtual address, and the permissions
528flags. Addresses and size must be within vhost memory regions set via
529the ``VHOST_USER_SET_MEM_TABLE`` request. For invalidation events, the
530``iotlb`` payload has to be filled with the invalidation message type
531(3), the I/O virtual address and the size. On success, the back-end is
532expected to reply with a zero payload, non-zero otherwise.
533
534The back-end relies on the back-end communication channel (see :ref:`Back-end
535communication <backend_communication>` section below) to send IOTLB miss
536and access failure events, by sending ``VHOST_USER_BACKEND_IOTLB_MSG``
537requests to the front-end with a ``struct vhost_iotlb_msg`` as
538payload. For miss events, the iotlb payload has to be filled with the
539miss message type (1), the I/O virtual address and the permissions
540flags. For access failure event, the iotlb payload has to be filled
541with the access failure message type (4), the I/O virtual address and
542the permissions flags.  For synchronization purpose, the back-end may
543rely on the reply-ack feature, so the front-end may send a reply when
544operation is completed if the reply-ack feature is negotiated and
545back-ends requests a reply. For miss events, completed operation means
546either front-end sent an update message containing the IOTLB entry
547containing requested address and permission, or front-end sent nothing if
548the IOTLB miss message is invalid (invalid IOVA or permission).
549
550The front-end isn't expected to take the initiative to send IOTLB update
551messages, as the back-end sends IOTLB miss messages for the guest virtual
552memory areas it needs to access.
553
554.. _backend_communication:
555
556Back-end communication
557----------------------
558
559An optional communication channel is provided if the back-end declares
560``VHOST_USER_PROTOCOL_F_BACKEND_REQ`` protocol feature, to allow the
561back-end to make requests to the front-end.
562
563The fd is provided via ``VHOST_USER_SET_BACKEND_REQ_FD`` ancillary data.
564
565A back-end may then send ``VHOST_USER_BACKEND_*`` messages to the front-end
566using this fd communication channel.
567
568If ``VHOST_USER_PROTOCOL_F_BACKEND_SEND_FD`` protocol feature is
569negotiated, back-end can send file descriptors (at most 8 descriptors in
570each message) to front-end via ancillary data using this fd communication
571channel.
572
573Inflight I/O tracking
574---------------------
575
576To support reconnecting after restart or crash, back-end may need to
577resubmit inflight I/Os. If virtqueue is processed in order, we can
578easily achieve that by getting the inflight descriptors from
579descriptor table (split virtqueue) or descriptor ring (packed
580virtqueue). However, it can't work when we process descriptors
581out-of-order because some entries which store the information of
582inflight descriptors in available ring (split virtqueue) or descriptor
583ring (packed virtqueue) might be overridden by new entries. To solve
584this problem, the back-end need to allocate an extra buffer to store this
585information of inflight descriptors and share it with front-end for
586persistent. ``VHOST_USER_GET_INFLIGHT_FD`` and
587``VHOST_USER_SET_INFLIGHT_FD`` are used to transfer this buffer
588between front-end and back-end. And the format of this buffer is described
589below:
590
591+---------------+---------------+-----+---------------+
592| queue0 region | queue1 region | ... | queueN region |
593+---------------+---------------+-----+---------------+
594
595N is the number of available virtqueues. The back-end could get it from num
596queues field of ``VhostUserInflight``.
597
598For split virtqueue, queue region can be implemented as:
599
600.. code:: c
601
602  typedef struct DescStateSplit {
603      /* Indicate whether this descriptor is inflight or not.
604       * Only available for head-descriptor. */
605      uint8_t inflight;
606
607      /* Padding */
608      uint8_t padding[5];
609
610      /* Maintain a list for the last batch of used descriptors.
611       * Only available when batching is used for submitting */
612      uint16_t next;
613
614      /* Used to preserve the order of fetching available descriptors.
615       * Only available for head-descriptor. */
616      uint64_t counter;
617  } DescStateSplit;
618
619  typedef struct QueueRegionSplit {
620      /* The feature flags of this region. Now it's initialized to 0. */
621      uint64_t features;
622
623      /* The version of this region. It's 1 currently.
624       * Zero value indicates an uninitialized buffer */
625      uint16_t version;
626
627      /* The size of DescStateSplit array. It's equal to the virtqueue size.
628       * The back-end could get it from queue size field of VhostUserInflight. */
629      uint16_t desc_num;
630
631      /* The head of list that track the last batch of used descriptors. */
632      uint16_t last_batch_head;
633
634      /* Store the idx value of used ring */
635      uint16_t used_idx;
636
637      /* Used to track the state of each descriptor in descriptor table */
638      DescStateSplit desc[];
639  } QueueRegionSplit;
640
641To track inflight I/O, the queue region should be processed as follows:
642
643When receiving available buffers from the driver:
644
645#. Get the next available head-descriptor index from available ring, ``i``
646
647#. Set ``desc[i].counter`` to the value of global counter
648
649#. Increase global counter by 1
650
651#. Set ``desc[i].inflight`` to 1
652
653When supplying used buffers to the driver:
654
6551. Get corresponding used head-descriptor index, i
656
6572. Set ``desc[i].next`` to ``last_batch_head``
658
6593. Set ``last_batch_head`` to ``i``
660
661#. Steps 1,2,3 may be performed repeatedly if batching is possible
662
663#. Increase the ``idx`` value of used ring by the size of the batch
664
665#. Set the ``inflight`` field of each ``DescStateSplit`` entry in the batch to 0
666
667#. Set ``used_idx`` to the ``idx`` value of used ring
668
669When reconnecting:
670
671#. If the value of ``used_idx`` does not match the ``idx`` value of
672   used ring (means the inflight field of ``DescStateSplit`` entries in
673   last batch may be incorrect),
674
675   a. Subtract the value of ``used_idx`` from the ``idx`` value of
676      used ring to get last batch size of ``DescStateSplit`` entries
677
678   #. Set the ``inflight`` field of each ``DescStateSplit`` entry to 0 in last batch
679      list which starts from ``last_batch_head``
680
681   #. Set ``used_idx`` to the ``idx`` value of used ring
682
683#. Resubmit inflight ``DescStateSplit`` entries in order of their
684   counter value
685
686For packed virtqueue, queue region can be implemented as:
687
688.. code:: c
689
690  typedef struct DescStatePacked {
691      /* Indicate whether this descriptor is inflight or not.
692       * Only available for head-descriptor. */
693      uint8_t inflight;
694
695      /* Padding */
696      uint8_t padding;
697
698      /* Link to the next free entry */
699      uint16_t next;
700
701      /* Link to the last entry of descriptor list.
702       * Only available for head-descriptor. */
703      uint16_t last;
704
705      /* The length of descriptor list.
706       * Only available for head-descriptor. */
707      uint16_t num;
708
709      /* Used to preserve the order of fetching available descriptors.
710       * Only available for head-descriptor. */
711      uint64_t counter;
712
713      /* The buffer id */
714      uint16_t id;
715
716      /* The descriptor flags */
717      uint16_t flags;
718
719      /* The buffer length */
720      uint32_t len;
721
722      /* The buffer address */
723      uint64_t addr;
724  } DescStatePacked;
725
726  typedef struct QueueRegionPacked {
727      /* The feature flags of this region. Now it's initialized to 0. */
728      uint64_t features;
729
730      /* The version of this region. It's 1 currently.
731       * Zero value indicates an uninitialized buffer */
732      uint16_t version;
733
734      /* The size of DescStatePacked array. It's equal to the virtqueue size.
735       * The back-end could get it from queue size field of VhostUserInflight. */
736      uint16_t desc_num;
737
738      /* The head of free DescStatePacked entry list */
739      uint16_t free_head;
740
741      /* The old head of free DescStatePacked entry list */
742      uint16_t old_free_head;
743
744      /* The used index of descriptor ring */
745      uint16_t used_idx;
746
747      /* The old used index of descriptor ring */
748      uint16_t old_used_idx;
749
750      /* Device ring wrap counter */
751      uint8_t used_wrap_counter;
752
753      /* The old device ring wrap counter */
754      uint8_t old_used_wrap_counter;
755
756      /* Padding */
757      uint8_t padding[7];
758
759      /* Used to track the state of each descriptor fetched from descriptor ring */
760      DescStatePacked desc[];
761  } QueueRegionPacked;
762
763To track inflight I/O, the queue region should be processed as follows:
764
765When receiving available buffers from the driver:
766
767#. Get the next available descriptor entry from descriptor ring, ``d``
768
769#. If ``d`` is head descriptor,
770
771   a. Set ``desc[old_free_head].num`` to 0
772
773   #. Set ``desc[old_free_head].counter`` to the value of global counter
774
775   #. Increase global counter by 1
776
777   #. Set ``desc[old_free_head].inflight`` to 1
778
779#. If ``d`` is last descriptor, set ``desc[old_free_head].last`` to
780   ``free_head``
781
782#. Increase ``desc[old_free_head].num`` by 1
783
784#. Set ``desc[free_head].addr``, ``desc[free_head].len``,
785   ``desc[free_head].flags``, ``desc[free_head].id`` to ``d.addr``,
786   ``d.len``, ``d.flags``, ``d.id``
787
788#. Set ``free_head`` to ``desc[free_head].next``
789
790#. If ``d`` is last descriptor, set ``old_free_head`` to ``free_head``
791
792When supplying used buffers to the driver:
793
7941. Get corresponding used head-descriptor entry from descriptor ring,
795   ``d``
796
7972. Get corresponding ``DescStatePacked`` entry, ``e``
798
7993. Set ``desc[e.last].next`` to ``free_head``
800
8014. Set ``free_head`` to the index of ``e``
802
803#. Steps 1,2,3,4 may be performed repeatedly if batching is possible
804
805#. Increase ``used_idx`` by the size of the batch and update
806   ``used_wrap_counter`` if needed
807
808#. Update ``d.flags``
809
810#. Set the ``inflight`` field of each head ``DescStatePacked`` entry
811   in the batch to 0
812
813#. Set ``old_free_head``,  ``old_used_idx``, ``old_used_wrap_counter``
814   to ``free_head``, ``used_idx``, ``used_wrap_counter``
815
816When reconnecting:
817
818#. If ``used_idx`` does not match ``old_used_idx`` (means the
819   ``inflight`` field of ``DescStatePacked`` entries in last batch may
820   be incorrect),
821
822   a. Get the next descriptor ring entry through ``old_used_idx``, ``d``
823
824   #. Use ``old_used_wrap_counter`` to calculate the available flags
825
826   #. If ``d.flags`` is not equal to the calculated flags value (means
827      back-end has submitted the buffer to guest driver before crash, so
828      it has to commit the in-progres update), set ``old_free_head``,
829      ``old_used_idx``, ``old_used_wrap_counter`` to ``free_head``,
830      ``used_idx``, ``used_wrap_counter``
831
832#. Set ``free_head``, ``used_idx``, ``used_wrap_counter`` to
833   ``old_free_head``, ``old_used_idx``, ``old_used_wrap_counter``
834   (roll back any in-progress update)
835
836#. Set the ``inflight`` field of each ``DescStatePacked`` entry in
837   free list to 0
838
839#. Resubmit inflight ``DescStatePacked`` entries in order of their
840   counter value
841
842In-band notifications
843---------------------
844
845In some limited situations (e.g. for simulation) it is desirable to
846have the kick, call and error (if used) signals done via in-band
847messages instead of asynchronous eventfd notifications. This can be
848done by negotiating the ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS``
849protocol feature.
850
851Note that due to the fact that too many messages on the sockets can
852cause the sending application(s) to block, it is not advised to use
853this feature unless absolutely necessary. It is also considered an
854error to negotiate this feature without also negotiating
855``VHOST_USER_PROTOCOL_F_BACKEND_REQ`` and ``VHOST_USER_PROTOCOL_F_REPLY_ACK``,
856the former is necessary for getting a message channel from the back-end
857to the front-end, while the latter needs to be used with the in-band
858notification messages to block until they are processed, both to avoid
859blocking later and for proper processing (at least in the simulation
860use case.) As it has no other way of signalling this error, the back-end
861should close the connection as a response to a
862``VHOST_USER_SET_PROTOCOL_FEATURES`` message that sets the in-band
863notifications feature flag without the other two.
864
865Protocol features
866-----------------
867
868.. code:: c
869
870  #define VHOST_USER_PROTOCOL_F_MQ                    0
871  #define VHOST_USER_PROTOCOL_F_LOG_SHMFD             1
872  #define VHOST_USER_PROTOCOL_F_RARP                  2
873  #define VHOST_USER_PROTOCOL_F_REPLY_ACK             3
874  #define VHOST_USER_PROTOCOL_F_MTU                   4
875  #define VHOST_USER_PROTOCOL_F_BACKEND_REQ           5
876  #define VHOST_USER_PROTOCOL_F_CROSS_ENDIAN          6
877  #define VHOST_USER_PROTOCOL_F_CRYPTO_SESSION        7
878  #define VHOST_USER_PROTOCOL_F_PAGEFAULT             8
879  #define VHOST_USER_PROTOCOL_F_CONFIG                9
880  #define VHOST_USER_PROTOCOL_F_BACKEND_SEND_FD      10
881  #define VHOST_USER_PROTOCOL_F_HOST_NOTIFIER        11
882  #define VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD       12
883  #define VHOST_USER_PROTOCOL_F_RESET_DEVICE         13
884  #define VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS 14
885  #define VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS  15
886  #define VHOST_USER_PROTOCOL_F_STATUS               16
887  #define VHOST_USER_PROTOCOL_F_XEN_MMAP             17
888
889Front-end message types
890-----------------------
891
892``VHOST_USER_GET_FEATURES``
893  :id: 1
894  :equivalent ioctl: ``VHOST_GET_FEATURES``
895  :request payload: N/A
896  :reply payload: ``u64``
897
898  Get from the underlying vhost implementation the features bitmask.
899  Feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` signals back-end support
900  for ``VHOST_USER_GET_PROTOCOL_FEATURES`` and
901  ``VHOST_USER_SET_PROTOCOL_FEATURES``.
902
903``VHOST_USER_SET_FEATURES``
904  :id: 2
905  :equivalent ioctl: ``VHOST_SET_FEATURES``
906  :request payload: ``u64``
907  :reply payload: N/A
908
909  Enable features in the underlying vhost implementation using a
910  bitmask.  Feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` signals
911  back-end support for ``VHOST_USER_GET_PROTOCOL_FEATURES`` and
912  ``VHOST_USER_SET_PROTOCOL_FEATURES``.
913
914``VHOST_USER_GET_PROTOCOL_FEATURES``
915  :id: 15
916  :equivalent ioctl: ``VHOST_GET_FEATURES``
917  :request payload: N/A
918  :reply payload: ``u64``
919
920  Get the protocol feature bitmask from the underlying vhost
921  implementation.  Only legal if feature bit
922  ``VHOST_USER_F_PROTOCOL_FEATURES`` is present in
923  ``VHOST_USER_GET_FEATURES``.  It does not need to be acknowledged by
924  ``VHOST_USER_SET_FEATURES``.
925
926.. Note::
927   Back-ends that report ``VHOST_USER_F_PROTOCOL_FEATURES`` must
928   support this message even before ``VHOST_USER_SET_FEATURES`` was
929   called.
930
931``VHOST_USER_SET_PROTOCOL_FEATURES``
932  :id: 16
933  :equivalent ioctl: ``VHOST_SET_FEATURES``
934  :request payload: ``u64``
935  :reply payload: N/A
936
937  Enable protocol features in the underlying vhost implementation.
938
939  Only legal if feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` is present in
940  ``VHOST_USER_GET_FEATURES``.  It does not need to be acknowledged by
941  ``VHOST_USER_SET_FEATURES``.
942
943.. Note::
944   Back-ends that report ``VHOST_USER_F_PROTOCOL_FEATURES`` must support
945   this message even before ``VHOST_USER_SET_FEATURES`` was called.
946
947``VHOST_USER_SET_OWNER``
948  :id: 3
949  :equivalent ioctl: ``VHOST_SET_OWNER``
950  :request payload: N/A
951  :reply payload: N/A
952
953  Issued when a new connection is established. It marks the sender
954  as the front-end that owns of the session. This can be used on the *back-end*
955  as a "session start" flag.
956
957``VHOST_USER_RESET_OWNER``
958  :id: 4
959  :request payload: N/A
960  :reply payload: N/A
961
962.. admonition:: Deprecated
963
964   This is no longer used. Used to be sent to request disabling all
965   rings, but some back-ends interpreted it to also discard connection
966   state (this interpretation would lead to bugs).  It is recommended
967   that back-ends either ignore this message, or use it to disable all
968   rings.
969
970``VHOST_USER_SET_MEM_TABLE``
971  :id: 5
972  :equivalent ioctl: ``VHOST_SET_MEM_TABLE``
973  :request payload: multiple memory regions description
974  :reply payload: (postcopy only) multiple memory regions description
975
976  Sets the memory map regions on the back-end so it can translate the
977  vring addresses. In the ancillary data there is an array of file
978  descriptors for each memory mapped region. The size and ordering of
979  the fds matches the number and ordering of memory regions.
980
981  When ``VHOST_USER_POSTCOPY_LISTEN`` has been received,
982  ``SET_MEM_TABLE`` replies with the bases of the memory mapped
983  regions to the front-end.  The back-end must have mmap'd the regions but
984  not yet accessed them and should not yet generate a userfault
985  event.
986
987.. Note::
988   ``NEED_REPLY_MASK`` is not set in this case.  QEMU will then
989   reply back to the list of mappings with an empty
990   ``VHOST_USER_SET_MEM_TABLE`` as an acknowledgement; only upon
991   reception of this message may the guest start accessing the memory
992   and generating faults.
993
994``VHOST_USER_SET_LOG_BASE``
995  :id: 6
996  :equivalent ioctl: ``VHOST_SET_LOG_BASE``
997  :request payload: u64
998  :reply payload: N/A
999
1000  Sets logging shared memory space.
1001
1002  When the back-end has ``VHOST_USER_PROTOCOL_F_LOG_SHMFD`` protocol feature,
1003  the log memory fd is provided in the ancillary data of
1004  ``VHOST_USER_SET_LOG_BASE`` message, the size and offset of shared
1005  memory area provided in the message.
1006
1007``VHOST_USER_SET_LOG_FD``
1008  :id: 7
1009  :equivalent ioctl: ``VHOST_SET_LOG_FD``
1010  :request payload: N/A
1011  :reply payload: N/A
1012
1013  Sets the logging file descriptor, which is passed as ancillary data.
1014
1015``VHOST_USER_SET_VRING_NUM``
1016  :id: 8
1017  :equivalent ioctl: ``VHOST_SET_VRING_NUM``
1018  :request payload: vring state description
1019  :reply payload: N/A
1020
1021  Set the size of the queue.
1022
1023``VHOST_USER_SET_VRING_ADDR``
1024  :id: 9
1025  :equivalent ioctl: ``VHOST_SET_VRING_ADDR``
1026  :request payload: vring address description
1027  :reply payload: N/A
1028
1029  Sets the addresses of the different aspects of the vring.
1030
1031``VHOST_USER_SET_VRING_BASE``
1032  :id: 10
1033  :equivalent ioctl: ``VHOST_SET_VRING_BASE``
1034  :request payload: vring state description
1035  :reply payload: N/A
1036
1037  Sets the base offset in the available vring.
1038
1039``VHOST_USER_GET_VRING_BASE``
1040  :id: 11
1041  :equivalent ioctl: ``VHOST_USER_GET_VRING_BASE``
1042  :request payload: vring state description
1043  :reply payload: vring state description
1044
1045  Get the available vring base offset.
1046
1047``VHOST_USER_SET_VRING_KICK``
1048  :id: 12
1049  :equivalent ioctl: ``VHOST_SET_VRING_KICK``
1050  :request payload: ``u64``
1051  :reply payload: N/A
1052
1053  Set the event file descriptor for adding buffers to the vring. It is
1054  passed in the ancillary data.
1055
1056  Bits (0-7) of the payload contain the vring index. Bit 8 is the
1057  invalid FD flag. This flag is set when there is no file descriptor
1058  in the ancillary data. This signals that polling should be used
1059  instead of waiting for the kick. Note that if the protocol feature
1060  ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` has been negotiated
1061  this message isn't necessary as the ring is also started on the
1062  ``VHOST_USER_VRING_KICK`` message, it may however still be used to
1063  set an event file descriptor (which will be preferred over the
1064  message) or to enable polling.
1065
1066``VHOST_USER_SET_VRING_CALL``
1067  :id: 13
1068  :equivalent ioctl: ``VHOST_SET_VRING_CALL``
1069  :request payload: ``u64``
1070  :reply payload: N/A
1071
1072  Set the event file descriptor to signal when buffers are used. It is
1073  passed in the ancillary data.
1074
1075  Bits (0-7) of the payload contain the vring index. Bit 8 is the
1076  invalid FD flag. This flag is set when there is no file descriptor
1077  in the ancillary data. This signals that polling will be used
1078  instead of waiting for the call. Note that if the protocol features
1079  ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` and
1080  ``VHOST_USER_PROTOCOL_F_BACKEND_REQ`` have been negotiated this message
1081  isn't necessary as the ``VHOST_USER_BACKEND_VRING_CALL`` message can be
1082  used, it may however still be used to set an event file descriptor
1083  or to enable polling.
1084
1085``VHOST_USER_SET_VRING_ERR``
1086  :id: 14
1087  :equivalent ioctl: ``VHOST_SET_VRING_ERR``
1088  :request payload: ``u64``
1089  :reply payload: N/A
1090
1091  Set the event file descriptor to signal when error occurs. It is
1092  passed in the ancillary data.
1093
1094  Bits (0-7) of the payload contain the vring index. Bit 8 is the
1095  invalid FD flag. This flag is set when there is no file descriptor
1096  in the ancillary data. Note that if the protocol features
1097  ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` and
1098  ``VHOST_USER_PROTOCOL_F_BACKEND_REQ`` have been negotiated this message
1099  isn't necessary as the ``VHOST_USER_BACKEND_VRING_ERR`` message can be
1100  used, it may however still be used to set an event file descriptor
1101  (which will be preferred over the message).
1102
1103``VHOST_USER_GET_QUEUE_NUM``
1104  :id: 17
1105  :equivalent ioctl: N/A
1106  :request payload: N/A
1107  :reply payload: u64
1108
1109  Query how many queues the back-end supports.
1110
1111  This request should be sent only when ``VHOST_USER_PROTOCOL_F_MQ``
1112  is set in queried protocol features by
1113  ``VHOST_USER_GET_PROTOCOL_FEATURES``.
1114
1115``VHOST_USER_SET_VRING_ENABLE``
1116  :id: 18
1117  :equivalent ioctl: N/A
1118  :request payload: vring state description
1119  :reply payload: N/A
1120
1121  Signal the back-end to enable or disable corresponding vring.
1122
1123  This request should be sent only when
1124  ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated.
1125
1126``VHOST_USER_SEND_RARP``
1127  :id: 19
1128  :equivalent ioctl: N/A
1129  :request payload: ``u64``
1130  :reply payload: N/A
1131
1132  Ask vhost user back-end to broadcast a fake RARP to notify the migration
1133  is terminated for guest that does not support GUEST_ANNOUNCE.
1134
1135  Only legal if feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` is
1136  present in ``VHOST_USER_GET_FEATURES`` and protocol feature bit
1137  ``VHOST_USER_PROTOCOL_F_RARP`` is present in
1138  ``VHOST_USER_GET_PROTOCOL_FEATURES``.  The first 6 bytes of the
1139  payload contain the mac address of the guest to allow the vhost user
1140  back-end to construct and broadcast the fake RARP.
1141
1142``VHOST_USER_NET_SET_MTU``
1143  :id: 20
1144  :equivalent ioctl: N/A
1145  :request payload: ``u64``
1146  :reply payload: N/A
1147
1148  Set host MTU value exposed to the guest.
1149
1150  This request should be sent only when ``VIRTIO_NET_F_MTU`` feature
1151  has been successfully negotiated, ``VHOST_USER_F_PROTOCOL_FEATURES``
1152  is present in ``VHOST_USER_GET_FEATURES`` and protocol feature bit
1153  ``VHOST_USER_PROTOCOL_F_NET_MTU`` is present in
1154  ``VHOST_USER_GET_PROTOCOL_FEATURES``.
1155
1156  If ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is negotiated, the back-end must
1157  respond with zero in case the specified MTU is valid, or non-zero
1158  otherwise.
1159
1160``VHOST_USER_SET_BACKEND_REQ_FD`` (previous name ``VHOST_USER_SET_SLAVE_REQ_FD``)
1161  :id: 21
1162  :equivalent ioctl: N/A
1163  :request payload: N/A
1164  :reply payload: N/A
1165
1166  Set the socket file descriptor for back-end initiated requests. It is passed
1167  in the ancillary data.
1168
1169  This request should be sent only when
1170  ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated, and protocol
1171  feature bit ``VHOST_USER_PROTOCOL_F_BACKEND_REQ`` bit is present in
1172  ``VHOST_USER_GET_PROTOCOL_FEATURES``.  If
1173  ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is negotiated, the back-end must
1174  respond with zero for success, non-zero otherwise.
1175
1176``VHOST_USER_IOTLB_MSG``
1177  :id: 22
1178  :equivalent ioctl: N/A (equivalent to ``VHOST_IOTLB_MSG`` message type)
1179  :request payload: ``struct vhost_iotlb_msg``
1180  :reply payload: ``u64``
1181
1182  Send IOTLB messages with ``struct vhost_iotlb_msg`` as payload.
1183
1184  The front-end sends such requests to update and invalidate entries in the
1185  device IOTLB. The back-end has to acknowledge the request with sending
1186  zero as ``u64`` payload for success, non-zero otherwise.
1187
1188  This request should be send only when ``VIRTIO_F_IOMMU_PLATFORM``
1189  feature has been successfully negotiated.
1190
1191``VHOST_USER_SET_VRING_ENDIAN``
1192  :id: 23
1193  :equivalent ioctl: ``VHOST_SET_VRING_ENDIAN``
1194  :request payload: vring state description
1195  :reply payload: N/A
1196
1197  Set the endianness of a VQ for legacy devices. Little-endian is
1198  indicated with state.num set to 0 and big-endian is indicated with
1199  state.num set to 1. Other values are invalid.
1200
1201  This request should be sent only when
1202  ``VHOST_USER_PROTOCOL_F_CROSS_ENDIAN`` has been negotiated.
1203  Backends that negotiated this feature should handle both
1204  endiannesses and expect this message once (per VQ) during device
1205  configuration (ie. before the front-end starts the VQ).
1206
1207``VHOST_USER_GET_CONFIG``
1208  :id: 24
1209  :equivalent ioctl: N/A
1210  :request payload: virtio device config space
1211  :reply payload: virtio device config space
1212
1213  When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, this message is
1214  submitted by the vhost-user front-end to fetch the contents of the
1215  virtio device configuration space, vhost-user back-end's payload size
1216  MUST match the front-end's request, vhost-user back-end uses zero length of
1217  payload to indicate an error to the vhost-user front-end. The vhost-user
1218  front-end may cache the contents to avoid repeated
1219  ``VHOST_USER_GET_CONFIG`` calls.
1220
1221``VHOST_USER_SET_CONFIG``
1222  :id: 25
1223  :equivalent ioctl: N/A
1224  :request payload: virtio device config space
1225  :reply payload: N/A
1226
1227  When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, this message is
1228  submitted by the vhost-user front-end when the Guest changes the virtio
1229  device configuration space and also can be used for live migration
1230  on the destination host. The vhost-user back-end must check the flags
1231  field, and back-ends MUST NOT accept SET_CONFIG for read-only
1232  configuration space fields unless the live migration bit is set.
1233
1234``VHOST_USER_CREATE_CRYPTO_SESSION``
1235  :id: 26
1236  :equivalent ioctl: N/A
1237  :request payload: crypto session description
1238  :reply payload: crypto session description
1239
1240  Create a session for crypto operation. The back-end must return
1241  the session id, 0 or positive for success, negative for failure.
1242  This request should be sent only when
1243  ``VHOST_USER_PROTOCOL_F_CRYPTO_SESSION`` feature has been
1244  successfully negotiated.  It's a required feature for crypto
1245  devices.
1246
1247``VHOST_USER_CLOSE_CRYPTO_SESSION``
1248  :id: 27
1249  :equivalent ioctl: N/A
1250  :request payload: ``u64``
1251  :reply payload: N/A
1252
1253  Close a session for crypto operation which was previously
1254  created by ``VHOST_USER_CREATE_CRYPTO_SESSION``.
1255
1256  This request should be sent only when
1257  ``VHOST_USER_PROTOCOL_F_CRYPTO_SESSION`` feature has been
1258  successfully negotiated.  It's a required feature for crypto
1259  devices.
1260
1261``VHOST_USER_POSTCOPY_ADVISE``
1262  :id: 28
1263  :request payload: N/A
1264  :reply payload: userfault fd
1265
1266  When ``VHOST_USER_PROTOCOL_F_PAGEFAULT`` is supported, the front-end
1267  advises back-end that a migration with postcopy enabled is underway,
1268  the back-end must open a userfaultfd for later use.  Note that at this
1269  stage the migration is still in precopy mode.
1270
1271``VHOST_USER_POSTCOPY_LISTEN``
1272  :id: 29
1273  :request payload: N/A
1274  :reply payload: N/A
1275
1276  The front-end advises back-end that a transition to postcopy mode has
1277  happened.  The back-end must ensure that shared memory is registered
1278  with userfaultfd to cause faulting of non-present pages.
1279
1280  This is always sent sometime after a ``VHOST_USER_POSTCOPY_ADVISE``,
1281  and thus only when ``VHOST_USER_PROTOCOL_F_PAGEFAULT`` is supported.
1282
1283``VHOST_USER_POSTCOPY_END``
1284  :id: 30
1285  :request payload: N/A
1286  :reply payload: ``u64``
1287
1288  The front-end advises that postcopy migration has now completed.  The back-end
1289  must disable the userfaultfd. The reply is an acknowledgement
1290  only.
1291
1292  When ``VHOST_USER_PROTOCOL_F_PAGEFAULT`` is supported, this message
1293  is sent at the end of the migration, after
1294  ``VHOST_USER_POSTCOPY_LISTEN`` was previously sent.
1295
1296  The value returned is an error indication; 0 is success.
1297
1298``VHOST_USER_GET_INFLIGHT_FD``
1299  :id: 31
1300  :equivalent ioctl: N/A
1301  :request payload: inflight description
1302  :reply payload: N/A
1303
1304  When ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD`` protocol feature has
1305  been successfully negotiated, this message is submitted by the front-end to
1306  get a shared buffer from back-end. The shared buffer will be used to
1307  track inflight I/O by back-end. QEMU should retrieve a new one when vm
1308  reset.
1309
1310``VHOST_USER_SET_INFLIGHT_FD``
1311  :id: 32
1312  :equivalent ioctl: N/A
1313  :request payload: inflight description
1314  :reply payload: N/A
1315
1316  When ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD`` protocol feature has
1317  been successfully negotiated, this message is submitted by the front-end to
1318  send the shared inflight buffer back to the back-end so that the back-end
1319  could get inflight I/O after a crash or restart.
1320
1321``VHOST_USER_GPU_SET_SOCKET``
1322  :id: 33
1323  :equivalent ioctl: N/A
1324  :request payload: N/A
1325  :reply payload: N/A
1326
1327  Sets the GPU protocol socket file descriptor, which is passed as
1328  ancillary data. The GPU protocol is used to inform the front-end of
1329  rendering state and updates. See vhost-user-gpu.rst for details.
1330
1331``VHOST_USER_RESET_DEVICE``
1332  :id: 34
1333  :equivalent ioctl: N/A
1334  :request payload: N/A
1335  :reply payload: N/A
1336
1337  Ask the vhost user back-end to disable all rings and reset all
1338  internal device state to the initial state, ready to be
1339  reinitialized. The back-end retains ownership of the device
1340  throughout the reset operation.
1341
1342  Only valid if the ``VHOST_USER_PROTOCOL_F_RESET_DEVICE`` protocol
1343  feature is set by the back-end.
1344
1345``VHOST_USER_VRING_KICK``
1346  :id: 35
1347  :equivalent ioctl: N/A
1348  :request payload: vring state description
1349  :reply payload: N/A
1350
1351  When the ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` protocol
1352  feature has been successfully negotiated, this message may be
1353  submitted by the front-end to indicate that a buffer was added to
1354  the vring instead of signalling it using the vring's kick file
1355  descriptor or having the back-end rely on polling.
1356
1357  The state.num field is currently reserved and must be set to 0.
1358
1359``VHOST_USER_GET_MAX_MEM_SLOTS``
1360  :id: 36
1361  :equivalent ioctl: N/A
1362  :request payload: N/A
1363  :reply payload: u64
1364
1365  When the ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol
1366  feature has been successfully negotiated, this message is submitted
1367  by the front-end to the back-end. The back-end should return the message with a
1368  u64 payload containing the maximum number of memory slots for
1369  QEMU to expose to the guest. The value returned by the back-end
1370  will be capped at the maximum number of ram slots which can be
1371  supported by the target platform.
1372
1373``VHOST_USER_ADD_MEM_REG``
1374  :id: 37
1375  :equivalent ioctl: N/A
1376  :request payload: N/A
1377  :reply payload: single memory region description
1378
1379  When the ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol
1380  feature has been successfully negotiated, this message is submitted
1381  by the front-end to the back-end. The message payload contains a memory
1382  region descriptor struct, describing a region of guest memory which
1383  the back-end device must map in. When the
1384  ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol feature has
1385  been successfully negotiated, along with the
1386  ``VHOST_USER_REM_MEM_REG`` message, this message is used to set and
1387  update the memory tables of the back-end device.
1388
1389  Exactly one file descriptor from which the memory is mapped is
1390  passed in the ancillary data.
1391
1392  In postcopy mode (see ``VHOST_USER_POSTCOPY_LISTEN``), the back-end
1393  replies with the bases of the memory mapped region to the front-end.
1394  For further details on postcopy, see ``VHOST_USER_SET_MEM_TABLE``.
1395  They apply to ``VHOST_USER_ADD_MEM_REG`` accordingly.
1396
1397``VHOST_USER_REM_MEM_REG``
1398  :id: 38
1399  :equivalent ioctl: N/A
1400  :request payload: N/A
1401  :reply payload: single memory region description
1402
1403  When the ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol
1404  feature has been successfully negotiated, this message is submitted
1405  by the front-end to the back-end. The message payload contains a memory
1406  region descriptor struct, describing a region of guest memory which
1407  the back-end device must unmap. When the
1408  ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol feature has
1409  been successfully negotiated, along with the
1410  ``VHOST_USER_ADD_MEM_REG`` message, this message is used to set and
1411  update the memory tables of the back-end device.
1412
1413  The memory region to be removed is identified by its guest address,
1414  user address and size. The mmap offset is ignored.
1415
1416  No file descriptors SHOULD be passed in the ancillary data. For
1417  compatibility with existing incorrect implementations, the back-end MAY
1418  accept messages with one file descriptor. If a file descriptor is
1419  passed, the back-end MUST close it without using it otherwise.
1420
1421``VHOST_USER_SET_STATUS``
1422  :id: 39
1423  :equivalent ioctl: VHOST_VDPA_SET_STATUS
1424  :request payload: ``u64``
1425  :reply payload: N/A
1426
1427  When the ``VHOST_USER_PROTOCOL_F_STATUS`` protocol feature has been
1428  successfully negotiated, this message is submitted by the front-end to
1429  notify the back-end with updated device status as defined in the Virtio
1430  specification.
1431
1432``VHOST_USER_GET_STATUS``
1433  :id: 40
1434  :equivalent ioctl: VHOST_VDPA_GET_STATUS
1435  :request payload: N/A
1436  :reply payload: ``u64``
1437
1438  When the ``VHOST_USER_PROTOCOL_F_STATUS`` protocol feature has been
1439  successfully negotiated, this message is submitted by the front-end to
1440  query the back-end for its device status as defined in the Virtio
1441  specification.
1442
1443
1444Back-end message types
1445----------------------
1446
1447For this type of message, the request is sent by the back-end and the reply
1448is sent by the front-end.
1449
1450``VHOST_USER_BACKEND_IOTLB_MSG`` (previous name ``VHOST_USER_SLAVE_IOTLB_MSG``)
1451  :id: 1
1452  :equivalent ioctl: N/A (equivalent to ``VHOST_IOTLB_MSG`` message type)
1453  :request payload: ``struct vhost_iotlb_msg``
1454  :reply payload: N/A
1455
1456  Send IOTLB messages with ``struct vhost_iotlb_msg`` as payload.
1457  The back-end sends such requests to notify of an IOTLB miss, or an IOTLB
1458  access failure. If ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is
1459  negotiated, and back-end set the ``VHOST_USER_NEED_REPLY`` flag, the front-end
1460  must respond with zero when operation is successfully completed, or
1461  non-zero otherwise.  This request should be send only when
1462  ``VIRTIO_F_IOMMU_PLATFORM`` feature has been successfully
1463  negotiated.
1464
1465``VHOST_USER_BACKEND_CONFIG_CHANGE_MSG`` (previous name ``VHOST_USER_SLAVE_CONFIG_CHANGE_MSG``)
1466  :id: 2
1467  :equivalent ioctl: N/A
1468  :request payload: N/A
1469  :reply payload: N/A
1470
1471  When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, vhost-user
1472  back-end sends such messages to notify that the virtio device's
1473  configuration space has changed, for those host devices which can
1474  support such feature, host driver can send ``VHOST_USER_GET_CONFIG``
1475  message to the back-end to get the latest content. If
1476  ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is negotiated, and the back-end sets the
1477  ``VHOST_USER_NEED_REPLY`` flag, the front-end must respond with zero when
1478  operation is successfully completed, or non-zero otherwise.
1479
1480``VHOST_USER_BACKEND_VRING_HOST_NOTIFIER_MSG`` (previous name ``VHOST_USER_SLAVE_VRING_HOST_NOTIFIER_MSG``)
1481  :id: 3
1482  :equivalent ioctl: N/A
1483  :request payload: vring area description
1484  :reply payload: N/A
1485
1486  Sets host notifier for a specified queue. The queue index is
1487  contained in the ``u64`` field of the vring area description. The
1488  host notifier is described by the file descriptor (typically it's a
1489  VFIO device fd) which is passed as ancillary data and the size
1490  (which is mmap size and should be the same as host page size) and
1491  offset (which is mmap offset) carried in the vring area
1492  description. QEMU can mmap the file descriptor based on the size and
1493  offset to get a memory range. Registering a host notifier means
1494  mapping this memory range to the VM as the specified queue's notify
1495  MMIO region. The back-end sends this request to tell QEMU to de-register
1496  the existing notifier if any and register the new notifier if the
1497  request is sent with a file descriptor.
1498
1499  This request should be sent only when
1500  ``VHOST_USER_PROTOCOL_F_HOST_NOTIFIER`` protocol feature has been
1501  successfully negotiated.
1502
1503``VHOST_USER_BACKEND_VRING_CALL`` (previous name ``VHOST_USER_SLAVE_VRING_CALL``)
1504  :id: 4
1505  :equivalent ioctl: N/A
1506  :request payload: vring state description
1507  :reply payload: N/A
1508
1509  When the ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` protocol
1510  feature has been successfully negotiated, this message may be
1511  submitted by the back-end to indicate that a buffer was used from
1512  the vring instead of signalling this using the vring's call file
1513  descriptor or having the front-end relying on polling.
1514
1515  The state.num field is currently reserved and must be set to 0.
1516
1517``VHOST_USER_BACKEND_VRING_ERR`` (previous name ``VHOST_USER_SLAVE_VRING_ERR``)
1518  :id: 5
1519  :equivalent ioctl: N/A
1520  :request payload: vring state description
1521  :reply payload: N/A
1522
1523  When the ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` protocol
1524  feature has been successfully negotiated, this message may be
1525  submitted by the back-end to indicate that an error occurred on the
1526  specific vring, instead of signalling the error file descriptor
1527  set by the front-end via ``VHOST_USER_SET_VRING_ERR``.
1528
1529  The state.num field is currently reserved and must be set to 0.
1530
1531.. _reply_ack:
1532
1533VHOST_USER_PROTOCOL_F_REPLY_ACK
1534-------------------------------
1535
1536The original vhost-user specification only demands replies for certain
1537commands. This differs from the vhost protocol implementation where
1538commands are sent over an ``ioctl()`` call and block until the back-end
1539has completed.
1540
1541With this protocol extension negotiated, the sender (QEMU) can set the
1542``need_reply`` [Bit 3] flag to any command. This indicates that the
1543back-end MUST respond with a Payload ``VhostUserMsg`` indicating success
1544or failure. The payload should be set to zero on success or non-zero
1545on failure, unless the message already has an explicit reply body.
1546
1547The reply payload gives QEMU a deterministic indication of the result
1548of the command. Today, QEMU is expected to terminate the main vhost-user
1549loop upon receiving such errors. In future, qemu could be taught to be more
1550resilient for selective requests.
1551
1552For the message types that already solicit a reply from the back-end,
1553the presence of ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` or need_reply bit
1554being set brings no behavioural change. (See the Communication_
1555section for details.)
1556
1557.. _backend_conventions:
1558
1559Backend program conventions
1560===========================
1561
1562vhost-user back-ends can provide various devices & services and may
1563need to be configured manually depending on the use case. However, it
1564is a good idea to follow the conventions listed here when
1565possible. Users, QEMU or libvirt, can then rely on some common
1566behaviour to avoid heterogeneous configuration and management of the
1567back-end programs and facilitate interoperability.
1568
1569Each back-end installed on a host system should come with at least one
1570JSON file that conforms to the vhost-user.json schema. Each file
1571informs the management applications about the back-end type, and binary
1572location. In addition, it defines rules for management apps for
1573picking the highest priority back-end when multiple match the search
1574criteria (see ``@VhostUserBackend`` documentation in the schema file).
1575
1576If the back-end is not capable of enabling a requested feature on the
1577host (such as 3D acceleration with virgl), or the initialization
1578failed, the back-end should fail to start early and exit with a status
1579!= 0. It may also print a message to stderr for further details.
1580
1581The back-end program must not daemonize itself, but it may be
1582daemonized by the management layer. It may also have a restricted
1583access to the system.
1584
1585File descriptors 0, 1 and 2 will exist, and have regular
1586stdin/stdout/stderr usage (they may have been redirected to /dev/null
1587by the management layer, or to a log handler).
1588
1589The back-end program must end (as quickly and cleanly as possible) when
1590the SIGTERM signal is received. Eventually, it may receive SIGKILL by
1591the management layer after a few seconds.
1592
1593The following command line options have an expected behaviour. They
1594are mandatory, unless explicitly said differently:
1595
1596--socket-path=PATH
1597
1598  This option specify the location of the vhost-user Unix domain socket.
1599  It is incompatible with --fd.
1600
1601--fd=FDNUM
1602
1603  When this argument is given, the back-end program is started with the
1604  vhost-user socket as file descriptor FDNUM. It is incompatible with
1605  --socket-path.
1606
1607--print-capabilities
1608
1609  Output to stdout the back-end capabilities in JSON format, and then
1610  exit successfully. Other options and arguments should be ignored, and
1611  the back-end program should not perform its normal function.  The
1612  capabilities can be reported dynamically depending on the host
1613  capabilities.
1614
1615The JSON output is described in the ``vhost-user.json`` schema, by
1616```@VHostUserBackendCapabilities``.  Example:
1617
1618.. code:: json
1619
1620  {
1621    "type": "foo",
1622    "features": [
1623      "feature-a",
1624      "feature-b"
1625    ]
1626  }
1627
1628vhost-user-input
1629----------------
1630
1631Command line options:
1632
1633--evdev-path=PATH
1634
1635  Specify the linux input device.
1636
1637  (optional)
1638
1639--no-grab
1640
1641  Do no request exclusive access to the input device.
1642
1643  (optional)
1644
1645vhost-user-gpu
1646--------------
1647
1648Command line options:
1649
1650--render-node=PATH
1651
1652  Specify the GPU DRM render node.
1653
1654  (optional)
1655
1656--virgl
1657
1658  Enable virgl rendering support.
1659
1660  (optional)
1661
1662vhost-user-blk
1663--------------
1664
1665Command line options:
1666
1667--blk-file=PATH
1668
1669  Specify block device or file path.
1670
1671  (optional)
1672
1673--read-only
1674
1675  Enable read-only.
1676
1677  (optional)
1678