xref: /openbmc/qemu/docs/interop/vhost-user.rst (revision 5242ef88)
1.. _vhost_user_proto:
2
3===================
4Vhost-user Protocol
5===================
6
7..
8  Copyright 2014 Virtual Open Systems Sarl.
9  Copyright 2019 Intel Corporation
10  Licence: This work is licensed under the terms of the GNU GPL,
11           version 2 or later. See the COPYING file in the top-level
12           directory.
13
14.. contents:: Table of Contents
15
16Introduction
17============
18
19This protocol is aiming to complement the ``ioctl`` interface used to
20control the vhost implementation in the Linux kernel. It implements
21the control plane needed to establish virtqueue sharing with a user
22space process on the same host. It uses communication over a Unix
23domain socket to share file descriptors in the ancillary data of the
24message.
25
26The protocol defines 2 sides of the communication, *master* and
27*slave*. *Master* is the application that shares its virtqueues, in
28our case QEMU. *Slave* is the consumer of the virtqueues.
29
30In the current implementation QEMU is the *master*, and the *slave* is
31the external process consuming the virtio queues, for example a
32software Ethernet switch running in user space, such as Snabbswitch,
33or a block device backend processing read & write to a virtual
34disk. In order to facilitate interoperability between various backend
35implementations, it is recommended to follow the :ref:`Backend program
36conventions <backend_conventions>`.
37
38*Master* and *slave* can be either a client (i.e. connecting) or
39server (listening) in the socket communication.
40
41Support for platforms other than Linux
42--------------------------------------
43
44While vhost-user was initially developed targeting Linux, nowadays it
45is supported on any platform that provides the following features:
46
47- A way for requesting shared memory represented by a file descriptor
48  so it can be passed over a UNIX domain socket and then mapped by the
49  other process.
50
51- AF_UNIX sockets with SCM_RIGHTS, so QEMU and the other process can
52  exchange messages through it, including ancillary data when needed.
53
54- Either eventfd or pipe/pipe2. On platforms where eventfd is not
55  available, QEMU will automatically fall back to pipe2 or, as a last
56  resort, pipe. Each file descriptor will be used for receiving or
57  sending events by reading or writing (respectively) an 8-byte value
58  to the corresponding it. The 8-value itself has no meaning and
59  should not be interpreted.
60
61Message Specification
62=====================
63
64.. Note:: All numbers are in the machine native byte order.
65
66A vhost-user message consists of 3 header fields and a payload.
67
68+---------+-------+------+---------+
69| request | flags | size | payload |
70+---------+-------+------+---------+
71
72Header
73------
74
75:request: 32-bit type of the request
76
77:flags: 32-bit bit field
78
79- Lower 2 bits are the version (currently 0x01)
80- Bit 2 is the reply flag - needs to be sent on each reply from the slave
81- Bit 3 is the need_reply flag - see :ref:`REPLY_ACK <reply_ack>` for
82  details.
83
84:size: 32-bit size of the payload
85
86Payload
87-------
88
89Depending on the request type, **payload** can be:
90
91A single 64-bit integer
92^^^^^^^^^^^^^^^^^^^^^^^
93
94+-----+
95| u64 |
96+-----+
97
98:u64: a 64-bit unsigned integer
99
100A vring state description
101^^^^^^^^^^^^^^^^^^^^^^^^^
102
103+-------+-----+
104| index | num |
105+-------+-----+
106
107:index: a 32-bit index
108
109:num: a 32-bit number
110
111A vring address description
112^^^^^^^^^^^^^^^^^^^^^^^^^^^
113
114+-------+-------+------+------------+------+-----------+-----+
115| index | flags | size | descriptor | used | available | log |
116+-------+-------+------+------------+------+-----------+-----+
117
118:index: a 32-bit vring index
119
120:flags: a 32-bit vring flags
121
122:descriptor: a 64-bit ring address of the vring descriptor table
123
124:used: a 64-bit ring address of the vring used ring
125
126:available: a 64-bit ring address of the vring available ring
127
128:log: a 64-bit guest address for logging
129
130Note that a ring address is an IOVA if ``VIRTIO_F_IOMMU_PLATFORM`` has
131been negotiated. Otherwise it is a user address.
132
133Memory regions description
134^^^^^^^^^^^^^^^^^^^^^^^^^^
135
136+-------------+---------+---------+-----+---------+
137| num regions | padding | region0 | ... | region7 |
138+-------------+---------+---------+-----+---------+
139
140:num regions: a 32-bit number of regions
141
142:padding: 32-bit
143
144A region is:
145
146+---------------+------+--------------+-------------+
147| guest address | size | user address | mmap offset |
148+---------------+------+--------------+-------------+
149
150:guest address: a 64-bit guest address of the region
151
152:size: a 64-bit size
153
154:user address: a 64-bit user address
155
156:mmap offset: 64-bit offset where region starts in the mapped memory
157
158Single memory region description
159^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
160
161+---------+---------------+------+--------------+-------------+
162| padding | guest address | size | user address | mmap offset |
163+---------+---------------+------+--------------+-------------+
164
165:padding: 64-bit
166
167:guest address: a 64-bit guest address of the region
168
169:size: a 64-bit size
170
171:user address: a 64-bit user address
172
173:mmap offset: 64-bit offset where region starts in the mapped memory
174
175Log description
176^^^^^^^^^^^^^^^
177
178+----------+------------+
179| log size | log offset |
180+----------+------------+
181
182:log size: size of area used for logging
183
184:log offset: offset from start of supplied file descriptor where
185             logging starts (i.e. where guest address 0 would be
186             logged)
187
188An IOTLB message
189^^^^^^^^^^^^^^^^
190
191+------+------+--------------+-------------------+------+
192| iova | size | user address | permissions flags | type |
193+------+------+--------------+-------------------+------+
194
195:iova: a 64-bit I/O virtual address programmed by the guest
196
197:size: a 64-bit size
198
199:user address: a 64-bit user address
200
201:permissions flags: an 8-bit value:
202  - 0: No access
203  - 1: Read access
204  - 2: Write access
205  - 3: Read/Write access
206
207:type: an 8-bit IOTLB message type:
208  - 1: IOTLB miss
209  - 2: IOTLB update
210  - 3: IOTLB invalidate
211  - 4: IOTLB access fail
212
213Virtio device config space
214^^^^^^^^^^^^^^^^^^^^^^^^^^
215
216+--------+------+-------+---------+
217| offset | size | flags | payload |
218+--------+------+-------+---------+
219
220:offset: a 32-bit offset of virtio device's configuration space
221
222:size: a 32-bit configuration space access size in bytes
223
224:flags: a 32-bit value:
225  - 0: Vhost master messages used for writeable fields
226  - 1: Vhost master messages used for live migration
227
228:payload: Size bytes array holding the contents of the virtio
229          device's configuration space
230
231Vring area description
232^^^^^^^^^^^^^^^^^^^^^^
233
234+-----+------+--------+
235| u64 | size | offset |
236+-----+------+--------+
237
238:u64: a 64-bit integer contains vring index and flags
239
240:size: a 64-bit size of this area
241
242:offset: a 64-bit offset of this area from the start of the
243         supplied file descriptor
244
245Inflight description
246^^^^^^^^^^^^^^^^^^^^
247
248+-----------+-------------+------------+------------+
249| mmap size | mmap offset | num queues | queue size |
250+-----------+-------------+------------+------------+
251
252:mmap size: a 64-bit size of area to track inflight I/O
253
254:mmap offset: a 64-bit offset of this area from the start
255              of the supplied file descriptor
256
257:num queues: a 16-bit number of virtqueues
258
259:queue size: a 16-bit size of virtqueues
260
261C structure
262-----------
263
264In QEMU the vhost-user message is implemented with the following struct:
265
266.. code:: c
267
268  typedef struct VhostUserMsg {
269      VhostUserRequest request;
270      uint32_t flags;
271      uint32_t size;
272      union {
273          uint64_t u64;
274          struct vhost_vring_state state;
275          struct vhost_vring_addr addr;
276          VhostUserMemory memory;
277          VhostUserLog log;
278          struct vhost_iotlb_msg iotlb;
279          VhostUserConfig config;
280          VhostUserVringArea area;
281          VhostUserInflight inflight;
282      };
283  } QEMU_PACKED VhostUserMsg;
284
285Communication
286=============
287
288The protocol for vhost-user is based on the existing implementation of
289vhost for the Linux Kernel. Most messages that can be sent via the
290Unix domain socket implementing vhost-user have an equivalent ioctl to
291the kernel implementation.
292
293The communication consists of *master* sending message requests and
294*slave* sending message replies. Most of the requests don't require
295replies. Here is a list of the ones that do:
296
297* ``VHOST_USER_GET_FEATURES``
298* ``VHOST_USER_GET_PROTOCOL_FEATURES``
299* ``VHOST_USER_GET_VRING_BASE``
300* ``VHOST_USER_SET_LOG_BASE`` (if ``VHOST_USER_PROTOCOL_F_LOG_SHMFD``)
301* ``VHOST_USER_GET_INFLIGHT_FD`` (if ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD``)
302
303.. seealso::
304
305   :ref:`REPLY_ACK <reply_ack>`
306       The section on ``REPLY_ACK`` protocol extension.
307
308There are several messages that the master sends with file descriptors passed
309in the ancillary data:
310
311* ``VHOST_USER_SET_MEM_TABLE``
312* ``VHOST_USER_SET_LOG_BASE`` (if ``VHOST_USER_PROTOCOL_F_LOG_SHMFD``)
313* ``VHOST_USER_SET_LOG_FD``
314* ``VHOST_USER_SET_VRING_KICK``
315* ``VHOST_USER_SET_VRING_CALL``
316* ``VHOST_USER_SET_VRING_ERR``
317* ``VHOST_USER_SET_SLAVE_REQ_FD``
318* ``VHOST_USER_SET_INFLIGHT_FD`` (if ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD``)
319
320If *master* is unable to send the full message or receives a wrong
321reply it will close the connection. An optional reconnection mechanism
322can be implemented.
323
324If *slave* detects some error such as incompatible features, it may also
325close the connection. This should only happen in exceptional circumstances.
326
327Any protocol extensions are gated by protocol feature bits, which
328allows full backwards compatibility on both master and slave.  As
329older slaves don't support negotiating protocol features, a feature
330bit was dedicated for this purpose::
331
332  #define VHOST_USER_F_PROTOCOL_FEATURES 30
333
334Starting and stopping rings
335---------------------------
336
337Client must only process each ring when it is started.
338
339Client must only pass data between the ring and the backend, when the
340ring is enabled.
341
342If ring is started but disabled, client must process the ring without
343talking to the backend.
344
345For example, for a networking device, in the disabled state client
346must not supply any new RX packets, but must process and discard any
347TX packets.
348
349If ``VHOST_USER_F_PROTOCOL_FEATURES`` has not been negotiated, the
350ring is initialized in an enabled state.
351
352If ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated, the ring is
353initialized in a disabled state. Client must not pass data to/from the
354backend until ring is enabled by ``VHOST_USER_SET_VRING_ENABLE`` with
355parameter 1, or after it has been disabled by
356``VHOST_USER_SET_VRING_ENABLE`` with parameter 0.
357
358Each ring is initialized in a stopped state, client must not process
359it until ring is started, or after it has been stopped.
360
361Client must start ring upon receiving a kick (that is, detecting that
362file descriptor is readable) on the descriptor specified by
363``VHOST_USER_SET_VRING_KICK`` or receiving the in-band message
364``VHOST_USER_VRING_KICK`` if negotiated, and stop ring upon receiving
365``VHOST_USER_GET_VRING_BASE``.
366
367While processing the rings (whether they are enabled or not), client
368must support changing some configuration aspects on the fly.
369
370Multiple queue support
371----------------------
372
373Many devices have a fixed number of virtqueues.  In this case the master
374already knows the number of available virtqueues without communicating with the
375slave.
376
377Some devices do not have a fixed number of virtqueues.  Instead the maximum
378number of virtqueues is chosen by the slave.  The number can depend on host
379resource availability or slave implementation details.  Such devices are called
380multiple queue devices.
381
382Multiple queue support allows the slave to advertise the maximum number of
383queues.  This is treated as a protocol extension, hence the slave has to
384implement protocol features first. The multiple queues feature is supported
385only when the protocol feature ``VHOST_USER_PROTOCOL_F_MQ`` (bit 0) is set.
386
387The max number of queues the slave supports can be queried with message
388``VHOST_USER_GET_QUEUE_NUM``. Master should stop when the number of requested
389queues is bigger than that.
390
391As all queues share one connection, the master uses a unique index for each
392queue in the sent message to identify a specified queue.
393
394The master enables queues by sending message ``VHOST_USER_SET_VRING_ENABLE``.
395vhost-user-net has historically automatically enabled the first queue pair.
396
397Slaves should always implement the ``VHOST_USER_PROTOCOL_F_MQ`` protocol
398feature, even for devices with a fixed number of virtqueues, since it is simple
399to implement and offers a degree of introspection.
400
401Masters must not rely on the ``VHOST_USER_PROTOCOL_F_MQ`` protocol feature for
402devices with a fixed number of virtqueues.  Only true multiqueue devices
403require this protocol feature.
404
405Migration
406---------
407
408During live migration, the master may need to track the modifications
409the slave makes to the memory mapped regions. The client should mark
410the dirty pages in a log. Once it complies to this logging, it may
411declare the ``VHOST_F_LOG_ALL`` vhost feature.
412
413To start/stop logging of data/used ring writes, server may send
414messages ``VHOST_USER_SET_FEATURES`` with ``VHOST_F_LOG_ALL`` and
415``VHOST_USER_SET_VRING_ADDR`` with ``VHOST_VRING_F_LOG`` in ring's
416flags set to 1/0, respectively.
417
418All the modifications to memory pointed by vring "descriptor" should
419be marked. Modifications to "used" vring should be marked if
420``VHOST_VRING_F_LOG`` is part of ring's flags.
421
422Dirty pages are of size::
423
424  #define VHOST_LOG_PAGE 0x1000
425
426The log memory fd is provided in the ancillary data of
427``VHOST_USER_SET_LOG_BASE`` message when the slave has
428``VHOST_USER_PROTOCOL_F_LOG_SHMFD`` protocol feature.
429
430The size of the log is supplied as part of ``VhostUserMsg`` which
431should be large enough to cover all known guest addresses. Log starts
432at the supplied offset in the supplied file descriptor.  The log
433covers from address 0 to the maximum of guest regions. In pseudo-code,
434to mark page at ``addr`` as dirty::
435
436  page = addr / VHOST_LOG_PAGE
437  log[page / 8] |= 1 << page % 8
438
439Where ``addr`` is the guest physical address.
440
441Use atomic operations, as the log may be concurrently manipulated.
442
443Note that when logging modifications to the used ring (when
444``VHOST_VRING_F_LOG`` is set for this ring), ``log_guest_addr`` should
445be used to calculate the log offset: the write to first byte of the
446used ring is logged at this offset from log start. Also note that this
447value might be outside the legal guest physical address range
448(i.e. does not have to be covered by the ``VhostUserMemory`` table), but
449the bit offset of the last byte of the ring must fall within the size
450supplied by ``VhostUserLog``.
451
452``VHOST_USER_SET_LOG_FD`` is an optional message with an eventfd in
453ancillary data, it may be used to inform the master that the log has
454been modified.
455
456Once the source has finished migration, rings will be stopped by the
457source. No further update must be done before rings are restarted.
458
459In postcopy migration the slave is started before all the memory has
460been received from the source host, and care must be taken to avoid
461accessing pages that have yet to be received.  The slave opens a
462'userfault'-fd and registers the memory with it; this fd is then
463passed back over to the master.  The master services requests on the
464userfaultfd for pages that are accessed and when the page is available
465it performs WAKE ioctl's on the userfaultfd to wake the stalled
466slave.  The client indicates support for this via the
467``VHOST_USER_PROTOCOL_F_PAGEFAULT`` feature.
468
469Memory access
470-------------
471
472The master sends a list of vhost memory regions to the slave using the
473``VHOST_USER_SET_MEM_TABLE`` message.  Each region has two base
474addresses: a guest address and a user address.
475
476Messages contain guest addresses and/or user addresses to reference locations
477within the shared memory.  The mapping of these addresses works as follows.
478
479User addresses map to the vhost memory region containing that user address.
480
481When the ``VIRTIO_F_IOMMU_PLATFORM`` feature has not been negotiated:
482
483* Guest addresses map to the vhost memory region containing that guest
484  address.
485
486When the ``VIRTIO_F_IOMMU_PLATFORM`` feature has been negotiated:
487
488* Guest addresses are also called I/O virtual addresses (IOVAs).  They are
489  translated to user addresses via the IOTLB.
490
491* The vhost memory region guest address is not used.
492
493IOMMU support
494-------------
495
496When the ``VIRTIO_F_IOMMU_PLATFORM`` feature has been negotiated, the
497master sends IOTLB entries update & invalidation by sending
498``VHOST_USER_IOTLB_MSG`` requests to the slave with a ``struct
499vhost_iotlb_msg`` as payload. For update events, the ``iotlb`` payload
500has to be filled with the update message type (2), the I/O virtual
501address, the size, the user virtual address, and the permissions
502flags. Addresses and size must be within vhost memory regions set via
503the ``VHOST_USER_SET_MEM_TABLE`` request. For invalidation events, the
504``iotlb`` payload has to be filled with the invalidation message type
505(3), the I/O virtual address and the size. On success, the slave is
506expected to reply with a zero payload, non-zero otherwise.
507
508The slave relies on the slave communication channel (see :ref:`Slave
509communication <slave_communication>` section below) to send IOTLB miss
510and access failure events, by sending ``VHOST_USER_SLAVE_IOTLB_MSG``
511requests to the master with a ``struct vhost_iotlb_msg`` as
512payload. For miss events, the iotlb payload has to be filled with the
513miss message type (1), the I/O virtual address and the permissions
514flags. For access failure event, the iotlb payload has to be filled
515with the access failure message type (4), the I/O virtual address and
516the permissions flags.  For synchronization purpose, the slave may
517rely on the reply-ack feature, so the master may send a reply when
518operation is completed if the reply-ack feature is negotiated and
519slaves requests a reply. For miss events, completed operation means
520either master sent an update message containing the IOTLB entry
521containing requested address and permission, or master sent nothing if
522the IOTLB miss message is invalid (invalid IOVA or permission).
523
524The master isn't expected to take the initiative to send IOTLB update
525messages, as the slave sends IOTLB miss messages for the guest virtual
526memory areas it needs to access.
527
528.. _slave_communication:
529
530Slave communication
531-------------------
532
533An optional communication channel is provided if the slave declares
534``VHOST_USER_PROTOCOL_F_SLAVE_REQ`` protocol feature, to allow the
535slave to make requests to the master.
536
537The fd is provided via ``VHOST_USER_SET_SLAVE_REQ_FD`` ancillary data.
538
539A slave may then send ``VHOST_USER_SLAVE_*`` messages to the master
540using this fd communication channel.
541
542If ``VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD`` protocol feature is
543negotiated, slave can send file descriptors (at most 8 descriptors in
544each message) to master via ancillary data using this fd communication
545channel.
546
547Inflight I/O tracking
548---------------------
549
550To support reconnecting after restart or crash, slave may need to
551resubmit inflight I/Os. If virtqueue is processed in order, we can
552easily achieve that by getting the inflight descriptors from
553descriptor table (split virtqueue) or descriptor ring (packed
554virtqueue). However, it can't work when we process descriptors
555out-of-order because some entries which store the information of
556inflight descriptors in available ring (split virtqueue) or descriptor
557ring (packed virtqueue) might be overridden by new entries. To solve
558this problem, slave need to allocate an extra buffer to store this
559information of inflight descriptors and share it with master for
560persistent. ``VHOST_USER_GET_INFLIGHT_FD`` and
561``VHOST_USER_SET_INFLIGHT_FD`` are used to transfer this buffer
562between master and slave. And the format of this buffer is described
563below:
564
565+---------------+---------------+-----+---------------+
566| queue0 region | queue1 region | ... | queueN region |
567+---------------+---------------+-----+---------------+
568
569N is the number of available virtqueues. Slave could get it from num
570queues field of ``VhostUserInflight``.
571
572For split virtqueue, queue region can be implemented as:
573
574.. code:: c
575
576  typedef struct DescStateSplit {
577      /* Indicate whether this descriptor is inflight or not.
578       * Only available for head-descriptor. */
579      uint8_t inflight;
580
581      /* Padding */
582      uint8_t padding[5];
583
584      /* Maintain a list for the last batch of used descriptors.
585       * Only available when batching is used for submitting */
586      uint16_t next;
587
588      /* Used to preserve the order of fetching available descriptors.
589       * Only available for head-descriptor. */
590      uint64_t counter;
591  } DescStateSplit;
592
593  typedef struct QueueRegionSplit {
594      /* The feature flags of this region. Now it's initialized to 0. */
595      uint64_t features;
596
597      /* The version of this region. It's 1 currently.
598       * Zero value indicates an uninitialized buffer */
599      uint16_t version;
600
601      /* The size of DescStateSplit array. It's equal to the virtqueue
602       * size. Slave could get it from queue size field of VhostUserInflight. */
603      uint16_t desc_num;
604
605      /* The head of list that track the last batch of used descriptors. */
606      uint16_t last_batch_head;
607
608      /* Store the idx value of used ring */
609      uint16_t used_idx;
610
611      /* Used to track the state of each descriptor in descriptor table */
612      DescStateSplit desc[];
613  } QueueRegionSplit;
614
615To track inflight I/O, the queue region should be processed as follows:
616
617When receiving available buffers from the driver:
618
619#. Get the next available head-descriptor index from available ring, ``i``
620
621#. Set ``desc[i].counter`` to the value of global counter
622
623#. Increase global counter by 1
624
625#. Set ``desc[i].inflight`` to 1
626
627When supplying used buffers to the driver:
628
6291. Get corresponding used head-descriptor index, i
630
6312. Set ``desc[i].next`` to ``last_batch_head``
632
6333. Set ``last_batch_head`` to ``i``
634
635#. Steps 1,2,3 may be performed repeatedly if batching is possible
636
637#. Increase the ``idx`` value of used ring by the size of the batch
638
639#. Set the ``inflight`` field of each ``DescStateSplit`` entry in the batch to 0
640
641#. Set ``used_idx`` to the ``idx`` value of used ring
642
643When reconnecting:
644
645#. If the value of ``used_idx`` does not match the ``idx`` value of
646   used ring (means the inflight field of ``DescStateSplit`` entries in
647   last batch may be incorrect),
648
649   a. Subtract the value of ``used_idx`` from the ``idx`` value of
650      used ring to get last batch size of ``DescStateSplit`` entries
651
652   #. Set the ``inflight`` field of each ``DescStateSplit`` entry to 0 in last batch
653      list which starts from ``last_batch_head``
654
655   #. Set ``used_idx`` to the ``idx`` value of used ring
656
657#. Resubmit inflight ``DescStateSplit`` entries in order of their
658   counter value
659
660For packed virtqueue, queue region can be implemented as:
661
662.. code:: c
663
664  typedef struct DescStatePacked {
665      /* Indicate whether this descriptor is inflight or not.
666       * Only available for head-descriptor. */
667      uint8_t inflight;
668
669      /* Padding */
670      uint8_t padding;
671
672      /* Link to the next free entry */
673      uint16_t next;
674
675      /* Link to the last entry of descriptor list.
676       * Only available for head-descriptor. */
677      uint16_t last;
678
679      /* The length of descriptor list.
680       * Only available for head-descriptor. */
681      uint16_t num;
682
683      /* Used to preserve the order of fetching available descriptors.
684       * Only available for head-descriptor. */
685      uint64_t counter;
686
687      /* The buffer id */
688      uint16_t id;
689
690      /* The descriptor flags */
691      uint16_t flags;
692
693      /* The buffer length */
694      uint32_t len;
695
696      /* The buffer address */
697      uint64_t addr;
698  } DescStatePacked;
699
700  typedef struct QueueRegionPacked {
701      /* The feature flags of this region. Now it's initialized to 0. */
702      uint64_t features;
703
704      /* The version of this region. It's 1 currently.
705       * Zero value indicates an uninitialized buffer */
706      uint16_t version;
707
708      /* The size of DescStatePacked array. It's equal to the virtqueue
709       * size. Slave could get it from queue size field of VhostUserInflight. */
710      uint16_t desc_num;
711
712      /* The head of free DescStatePacked entry list */
713      uint16_t free_head;
714
715      /* The old head of free DescStatePacked entry list */
716      uint16_t old_free_head;
717
718      /* The used index of descriptor ring */
719      uint16_t used_idx;
720
721      /* The old used index of descriptor ring */
722      uint16_t old_used_idx;
723
724      /* Device ring wrap counter */
725      uint8_t used_wrap_counter;
726
727      /* The old device ring wrap counter */
728      uint8_t old_used_wrap_counter;
729
730      /* Padding */
731      uint8_t padding[7];
732
733      /* Used to track the state of each descriptor fetched from descriptor ring */
734      DescStatePacked desc[];
735  } QueueRegionPacked;
736
737To track inflight I/O, the queue region should be processed as follows:
738
739When receiving available buffers from the driver:
740
741#. Get the next available descriptor entry from descriptor ring, ``d``
742
743#. If ``d`` is head descriptor,
744
745   a. Set ``desc[old_free_head].num`` to 0
746
747   #. Set ``desc[old_free_head].counter`` to the value of global counter
748
749   #. Increase global counter by 1
750
751   #. Set ``desc[old_free_head].inflight`` to 1
752
753#. If ``d`` is last descriptor, set ``desc[old_free_head].last`` to
754   ``free_head``
755
756#. Increase ``desc[old_free_head].num`` by 1
757
758#. Set ``desc[free_head].addr``, ``desc[free_head].len``,
759   ``desc[free_head].flags``, ``desc[free_head].id`` to ``d.addr``,
760   ``d.len``, ``d.flags``, ``d.id``
761
762#. Set ``free_head`` to ``desc[free_head].next``
763
764#. If ``d`` is last descriptor, set ``old_free_head`` to ``free_head``
765
766When supplying used buffers to the driver:
767
7681. Get corresponding used head-descriptor entry from descriptor ring,
769   ``d``
770
7712. Get corresponding ``DescStatePacked`` entry, ``e``
772
7733. Set ``desc[e.last].next`` to ``free_head``
774
7754. Set ``free_head`` to the index of ``e``
776
777#. Steps 1,2,3,4 may be performed repeatedly if batching is possible
778
779#. Increase ``used_idx`` by the size of the batch and update
780   ``used_wrap_counter`` if needed
781
782#. Update ``d.flags``
783
784#. Set the ``inflight`` field of each head ``DescStatePacked`` entry
785   in the batch to 0
786
787#. Set ``old_free_head``,  ``old_used_idx``, ``old_used_wrap_counter``
788   to ``free_head``, ``used_idx``, ``used_wrap_counter``
789
790When reconnecting:
791
792#. If ``used_idx`` does not match ``old_used_idx`` (means the
793   ``inflight`` field of ``DescStatePacked`` entries in last batch may
794   be incorrect),
795
796   a. Get the next descriptor ring entry through ``old_used_idx``, ``d``
797
798   #. Use ``old_used_wrap_counter`` to calculate the available flags
799
800   #. If ``d.flags`` is not equal to the calculated flags value (means
801      slave has submitted the buffer to guest driver before crash, so
802      it has to commit the in-progres update), set ``old_free_head``,
803      ``old_used_idx``, ``old_used_wrap_counter`` to ``free_head``,
804      ``used_idx``, ``used_wrap_counter``
805
806#. Set ``free_head``, ``used_idx``, ``used_wrap_counter`` to
807   ``old_free_head``, ``old_used_idx``, ``old_used_wrap_counter``
808   (roll back any in-progress update)
809
810#. Set the ``inflight`` field of each ``DescStatePacked`` entry in
811   free list to 0
812
813#. Resubmit inflight ``DescStatePacked`` entries in order of their
814   counter value
815
816In-band notifications
817---------------------
818
819In some limited situations (e.g. for simulation) it is desirable to
820have the kick, call and error (if used) signals done via in-band
821messages instead of asynchronous eventfd notifications. This can be
822done by negotiating the ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS``
823protocol feature.
824
825Note that due to the fact that too many messages on the sockets can
826cause the sending application(s) to block, it is not advised to use
827this feature unless absolutely necessary. It is also considered an
828error to negotiate this feature without also negotiating
829``VHOST_USER_PROTOCOL_F_SLAVE_REQ`` and ``VHOST_USER_PROTOCOL_F_REPLY_ACK``,
830the former is necessary for getting a message channel from the slave
831to the master, while the latter needs to be used with the in-band
832notification messages to block until they are processed, both to avoid
833blocking later and for proper processing (at least in the simulation
834use case.) As it has no other way of signalling this error, the slave
835should close the connection as a response to a
836``VHOST_USER_SET_PROTOCOL_FEATURES`` message that sets the in-band
837notifications feature flag without the other two.
838
839Protocol features
840-----------------
841
842.. code:: c
843
844  #define VHOST_USER_PROTOCOL_F_MQ                    0
845  #define VHOST_USER_PROTOCOL_F_LOG_SHMFD             1
846  #define VHOST_USER_PROTOCOL_F_RARP                  2
847  #define VHOST_USER_PROTOCOL_F_REPLY_ACK             3
848  #define VHOST_USER_PROTOCOL_F_MTU                   4
849  #define VHOST_USER_PROTOCOL_F_SLAVE_REQ             5
850  #define VHOST_USER_PROTOCOL_F_CROSS_ENDIAN          6
851  #define VHOST_USER_PROTOCOL_F_CRYPTO_SESSION        7
852  #define VHOST_USER_PROTOCOL_F_PAGEFAULT             8
853  #define VHOST_USER_PROTOCOL_F_CONFIG                9
854  #define VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD        10
855  #define VHOST_USER_PROTOCOL_F_HOST_NOTIFIER        11
856  #define VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD       12
857  #define VHOST_USER_PROTOCOL_F_RESET_DEVICE         13
858  #define VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS 14
859  #define VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS  15
860  #define VHOST_USER_PROTOCOL_F_STATUS               16
861
862Master message types
863--------------------
864
865``VHOST_USER_GET_FEATURES``
866  :id: 1
867  :equivalent ioctl: ``VHOST_GET_FEATURES``
868  :master payload: N/A
869  :slave payload: ``u64``
870
871  Get from the underlying vhost implementation the features bitmask.
872  Feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` signals slave support
873  for ``VHOST_USER_GET_PROTOCOL_FEATURES`` and
874  ``VHOST_USER_SET_PROTOCOL_FEATURES``.
875
876``VHOST_USER_SET_FEATURES``
877  :id: 2
878  :equivalent ioctl: ``VHOST_SET_FEATURES``
879  :master payload: ``u64``
880
881  Enable features in the underlying vhost implementation using a
882  bitmask.  Feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` signals
883  slave support for ``VHOST_USER_GET_PROTOCOL_FEATURES`` and
884  ``VHOST_USER_SET_PROTOCOL_FEATURES``.
885
886``VHOST_USER_GET_PROTOCOL_FEATURES``
887  :id: 15
888  :equivalent ioctl: ``VHOST_GET_FEATURES``
889  :master payload: N/A
890  :slave payload: ``u64``
891
892  Get the protocol feature bitmask from the underlying vhost
893  implementation.  Only legal if feature bit
894  ``VHOST_USER_F_PROTOCOL_FEATURES`` is present in
895  ``VHOST_USER_GET_FEATURES``.
896
897.. Note::
898   Slave that reported ``VHOST_USER_F_PROTOCOL_FEATURES`` must
899   support this message even before ``VHOST_USER_SET_FEATURES`` was
900   called.
901
902``VHOST_USER_SET_PROTOCOL_FEATURES``
903  :id: 16
904  :equivalent ioctl: ``VHOST_SET_FEATURES``
905  :master payload: ``u64``
906
907  Enable protocol features in the underlying vhost implementation.
908
909  Only legal if feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` is present in
910  ``VHOST_USER_GET_FEATURES``.
911
912.. Note::
913   Slave that reported ``VHOST_USER_F_PROTOCOL_FEATURES`` must support
914   this message even before ``VHOST_USER_SET_FEATURES`` was called.
915
916``VHOST_USER_SET_OWNER``
917  :id: 3
918  :equivalent ioctl: ``VHOST_SET_OWNER``
919  :master payload: N/A
920
921  Issued when a new connection is established. It sets the current
922  *master* as an owner of the session. This can be used on the *slave*
923  as a "session start" flag.
924
925``VHOST_USER_RESET_OWNER``
926  :id: 4
927  :master payload: N/A
928
929.. admonition:: Deprecated
930
931   This is no longer used. Used to be sent to request disabling all
932   rings, but some clients interpreted it to also discard connection
933   state (this interpretation would lead to bugs).  It is recommended
934   that clients either ignore this message, or use it to disable all
935   rings.
936
937``VHOST_USER_SET_MEM_TABLE``
938  :id: 5
939  :equivalent ioctl: ``VHOST_SET_MEM_TABLE``
940  :master payload: memory regions description
941  :slave payload: (postcopy only) memory regions description
942
943  Sets the memory map regions on the slave so it can translate the
944  vring addresses. In the ancillary data there is an array of file
945  descriptors for each memory mapped region. The size and ordering of
946  the fds matches the number and ordering of memory regions.
947
948  When ``VHOST_USER_POSTCOPY_LISTEN`` has been received,
949  ``SET_MEM_TABLE`` replies with the bases of the memory mapped
950  regions to the master.  The slave must have mmap'd the regions but
951  not yet accessed them and should not yet generate a userfault
952  event.
953
954.. Note::
955   ``NEED_REPLY_MASK`` is not set in this case.  QEMU will then
956   reply back to the list of mappings with an empty
957   ``VHOST_USER_SET_MEM_TABLE`` as an acknowledgement; only upon
958   reception of this message may the guest start accessing the memory
959   and generating faults.
960
961``VHOST_USER_SET_LOG_BASE``
962  :id: 6
963  :equivalent ioctl: ``VHOST_SET_LOG_BASE``
964  :master payload: u64
965  :slave payload: N/A
966
967  Sets logging shared memory space.
968
969  When slave has ``VHOST_USER_PROTOCOL_F_LOG_SHMFD`` protocol feature,
970  the log memory fd is provided in the ancillary data of
971  ``VHOST_USER_SET_LOG_BASE`` message, the size and offset of shared
972  memory area provided in the message.
973
974``VHOST_USER_SET_LOG_FD``
975  :id: 7
976  :equivalent ioctl: ``VHOST_SET_LOG_FD``
977  :master payload: N/A
978
979  Sets the logging file descriptor, which is passed as ancillary data.
980
981``VHOST_USER_SET_VRING_NUM``
982  :id: 8
983  :equivalent ioctl: ``VHOST_SET_VRING_NUM``
984  :master payload: vring state description
985
986  Set the size of the queue.
987
988``VHOST_USER_SET_VRING_ADDR``
989  :id: 9
990  :equivalent ioctl: ``VHOST_SET_VRING_ADDR``
991  :master payload: vring address description
992  :slave payload: N/A
993
994  Sets the addresses of the different aspects of the vring.
995
996``VHOST_USER_SET_VRING_BASE``
997  :id: 10
998  :equivalent ioctl: ``VHOST_SET_VRING_BASE``
999  :master payload: vring state description
1000
1001  Sets the base offset in the available vring.
1002
1003``VHOST_USER_GET_VRING_BASE``
1004  :id: 11
1005  :equivalent ioctl: ``VHOST_USER_GET_VRING_BASE``
1006  :master payload: vring state description
1007  :slave payload: vring state description
1008
1009  Get the available vring base offset.
1010
1011``VHOST_USER_SET_VRING_KICK``
1012  :id: 12
1013  :equivalent ioctl: ``VHOST_SET_VRING_KICK``
1014  :master payload: ``u64``
1015
1016  Set the event file descriptor for adding buffers to the vring. It is
1017  passed in the ancillary data.
1018
1019  Bits (0-7) of the payload contain the vring index. Bit 8 is the
1020  invalid FD flag. This flag is set when there is no file descriptor
1021  in the ancillary data. This signals that polling should be used
1022  instead of waiting for the kick. Note that if the protocol feature
1023  ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` has been negotiated
1024  this message isn't necessary as the ring is also started on the
1025  ``VHOST_USER_VRING_KICK`` message, it may however still be used to
1026  set an event file descriptor (which will be preferred over the
1027  message) or to enable polling.
1028
1029``VHOST_USER_SET_VRING_CALL``
1030  :id: 13
1031  :equivalent ioctl: ``VHOST_SET_VRING_CALL``
1032  :master payload: ``u64``
1033
1034  Set the event file descriptor to signal when buffers are used. It is
1035  passed in the ancillary data.
1036
1037  Bits (0-7) of the payload contain the vring index. Bit 8 is the
1038  invalid FD flag. This flag is set when there is no file descriptor
1039  in the ancillary data. This signals that polling will be used
1040  instead of waiting for the call. Note that if the protocol features
1041  ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` and
1042  ``VHOST_USER_PROTOCOL_F_SLAVE_REQ`` have been negotiated this message
1043  isn't necessary as the ``VHOST_USER_SLAVE_VRING_CALL`` message can be
1044  used, it may however still be used to set an event file descriptor
1045  or to enable polling.
1046
1047``VHOST_USER_SET_VRING_ERR``
1048  :id: 14
1049  :equivalent ioctl: ``VHOST_SET_VRING_ERR``
1050  :master payload: ``u64``
1051
1052  Set the event file descriptor to signal when error occurs. It is
1053  passed in the ancillary data.
1054
1055  Bits (0-7) of the payload contain the vring index. Bit 8 is the
1056  invalid FD flag. This flag is set when there is no file descriptor
1057  in the ancillary data. Note that if the protocol features
1058  ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` and
1059  ``VHOST_USER_PROTOCOL_F_SLAVE_REQ`` have been negotiated this message
1060  isn't necessary as the ``VHOST_USER_SLAVE_VRING_ERR`` message can be
1061  used, it may however still be used to set an event file descriptor
1062  (which will be preferred over the message).
1063
1064``VHOST_USER_GET_QUEUE_NUM``
1065  :id: 17
1066  :equivalent ioctl: N/A
1067  :master payload: N/A
1068  :slave payload: u64
1069
1070  Query how many queues the backend supports.
1071
1072  This request should be sent only when ``VHOST_USER_PROTOCOL_F_MQ``
1073  is set in queried protocol features by
1074  ``VHOST_USER_GET_PROTOCOL_FEATURES``.
1075
1076``VHOST_USER_SET_VRING_ENABLE``
1077  :id: 18
1078  :equivalent ioctl: N/A
1079  :master payload: vring state description
1080
1081  Signal slave to enable or disable corresponding vring.
1082
1083  This request should be sent only when
1084  ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated.
1085
1086``VHOST_USER_SEND_RARP``
1087  :id: 19
1088  :equivalent ioctl: N/A
1089  :master payload: ``u64``
1090
1091  Ask vhost user backend to broadcast a fake RARP to notify the migration
1092  is terminated for guest that does not support GUEST_ANNOUNCE.
1093
1094  Only legal if feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` is
1095  present in ``VHOST_USER_GET_FEATURES`` and protocol feature bit
1096  ``VHOST_USER_PROTOCOL_F_RARP`` is present in
1097  ``VHOST_USER_GET_PROTOCOL_FEATURES``.  The first 6 bytes of the
1098  payload contain the mac address of the guest to allow the vhost user
1099  backend to construct and broadcast the fake RARP.
1100
1101``VHOST_USER_NET_SET_MTU``
1102  :id: 20
1103  :equivalent ioctl: N/A
1104  :master payload: ``u64``
1105
1106  Set host MTU value exposed to the guest.
1107
1108  This request should be sent only when ``VIRTIO_NET_F_MTU`` feature
1109  has been successfully negotiated, ``VHOST_USER_F_PROTOCOL_FEATURES``
1110  is present in ``VHOST_USER_GET_FEATURES`` and protocol feature bit
1111  ``VHOST_USER_PROTOCOL_F_NET_MTU`` is present in
1112  ``VHOST_USER_GET_PROTOCOL_FEATURES``.
1113
1114  If ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is negotiated, slave must
1115  respond with zero in case the specified MTU is valid, or non-zero
1116  otherwise.
1117
1118``VHOST_USER_SET_SLAVE_REQ_FD``
1119  :id: 21
1120  :equivalent ioctl: N/A
1121  :master payload: N/A
1122
1123  Set the socket file descriptor for slave initiated requests. It is passed
1124  in the ancillary data.
1125
1126  This request should be sent only when
1127  ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated, and protocol
1128  feature bit ``VHOST_USER_PROTOCOL_F_SLAVE_REQ`` bit is present in
1129  ``VHOST_USER_GET_PROTOCOL_FEATURES``.  If
1130  ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is negotiated, slave must
1131  respond with zero for success, non-zero otherwise.
1132
1133``VHOST_USER_IOTLB_MSG``
1134  :id: 22
1135  :equivalent ioctl: N/A (equivalent to ``VHOST_IOTLB_MSG`` message type)
1136  :master payload: ``struct vhost_iotlb_msg``
1137  :slave payload: ``u64``
1138
1139  Send IOTLB messages with ``struct vhost_iotlb_msg`` as payload.
1140
1141  Master sends such requests to update and invalidate entries in the
1142  device IOTLB. The slave has to acknowledge the request with sending
1143  zero as ``u64`` payload for success, non-zero otherwise.
1144
1145  This request should be send only when ``VIRTIO_F_IOMMU_PLATFORM``
1146  feature has been successfully negotiated.
1147
1148``VHOST_USER_SET_VRING_ENDIAN``
1149  :id: 23
1150  :equivalent ioctl: ``VHOST_SET_VRING_ENDIAN``
1151  :master payload: vring state description
1152
1153  Set the endianness of a VQ for legacy devices. Little-endian is
1154  indicated with state.num set to 0 and big-endian is indicated with
1155  state.num set to 1. Other values are invalid.
1156
1157  This request should be sent only when
1158  ``VHOST_USER_PROTOCOL_F_CROSS_ENDIAN`` has been negotiated.
1159  Backends that negotiated this feature should handle both
1160  endiannesses and expect this message once (per VQ) during device
1161  configuration (ie. before the master starts the VQ).
1162
1163``VHOST_USER_GET_CONFIG``
1164  :id: 24
1165  :equivalent ioctl: N/A
1166  :master payload: virtio device config space
1167  :slave payload: virtio device config space
1168
1169  When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, this message is
1170  submitted by the vhost-user master to fetch the contents of the
1171  virtio device configuration space, vhost-user slave's payload size
1172  MUST match master's request, vhost-user slave uses zero length of
1173  payload to indicate an error to vhost-user master. The vhost-user
1174  master may cache the contents to avoid repeated
1175  ``VHOST_USER_GET_CONFIG`` calls.
1176
1177``VHOST_USER_SET_CONFIG``
1178  :id: 25
1179  :equivalent ioctl: N/A
1180  :master payload: virtio device config space
1181  :slave payload: N/A
1182
1183  When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, this message is
1184  submitted by the vhost-user master when the Guest changes the virtio
1185  device configuration space and also can be used for live migration
1186  on the destination host. The vhost-user slave must check the flags
1187  field, and slaves MUST NOT accept SET_CONFIG for read-only
1188  configuration space fields unless the live migration bit is set.
1189
1190``VHOST_USER_CREATE_CRYPTO_SESSION``
1191  :id: 26
1192  :equivalent ioctl: N/A
1193  :master payload: crypto session description
1194  :slave payload: crypto session description
1195
1196  Create a session for crypto operation. The server side must return
1197  the session id, 0 or positive for success, negative for failure.
1198  This request should be sent only when
1199  ``VHOST_USER_PROTOCOL_F_CRYPTO_SESSION`` feature has been
1200  successfully negotiated.  It's a required feature for crypto
1201  devices.
1202
1203``VHOST_USER_CLOSE_CRYPTO_SESSION``
1204  :id: 27
1205  :equivalent ioctl: N/A
1206  :master payload: ``u64``
1207
1208  Close a session for crypto operation which was previously
1209  created by ``VHOST_USER_CREATE_CRYPTO_SESSION``.
1210
1211  This request should be sent only when
1212  ``VHOST_USER_PROTOCOL_F_CRYPTO_SESSION`` feature has been
1213  successfully negotiated.  It's a required feature for crypto
1214  devices.
1215
1216``VHOST_USER_POSTCOPY_ADVISE``
1217  :id: 28
1218  :master payload: N/A
1219  :slave payload: userfault fd
1220
1221  When ``VHOST_USER_PROTOCOL_F_PAGEFAULT`` is supported, the master
1222  advises slave that a migration with postcopy enabled is underway,
1223  the slave must open a userfaultfd for later use.  Note that at this
1224  stage the migration is still in precopy mode.
1225
1226``VHOST_USER_POSTCOPY_LISTEN``
1227  :id: 29
1228  :master payload: N/A
1229
1230  Master advises slave that a transition to postcopy mode has
1231  happened.  The slave must ensure that shared memory is registered
1232  with userfaultfd to cause faulting of non-present pages.
1233
1234  This is always sent sometime after a ``VHOST_USER_POSTCOPY_ADVISE``,
1235  and thus only when ``VHOST_USER_PROTOCOL_F_PAGEFAULT`` is supported.
1236
1237``VHOST_USER_POSTCOPY_END``
1238  :id: 30
1239  :slave payload: ``u64``
1240
1241  Master advises that postcopy migration has now completed.  The slave
1242  must disable the userfaultfd. The response is an acknowledgement
1243  only.
1244
1245  When ``VHOST_USER_PROTOCOL_F_PAGEFAULT`` is supported, this message
1246  is sent at the end of the migration, after
1247  ``VHOST_USER_POSTCOPY_LISTEN`` was previously sent.
1248
1249  The value returned is an error indication; 0 is success.
1250
1251``VHOST_USER_GET_INFLIGHT_FD``
1252  :id: 31
1253  :equivalent ioctl: N/A
1254  :master payload: inflight description
1255
1256  When ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD`` protocol feature has
1257  been successfully negotiated, this message is submitted by master to
1258  get a shared buffer from slave. The shared buffer will be used to
1259  track inflight I/O by slave. QEMU should retrieve a new one when vm
1260  reset.
1261
1262``VHOST_USER_SET_INFLIGHT_FD``
1263  :id: 32
1264  :equivalent ioctl: N/A
1265  :master payload: inflight description
1266
1267  When ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD`` protocol feature has
1268  been successfully negotiated, this message is submitted by master to
1269  send the shared inflight buffer back to slave so that slave could
1270  get inflight I/O after a crash or restart.
1271
1272``VHOST_USER_GPU_SET_SOCKET``
1273  :id: 33
1274  :equivalent ioctl: N/A
1275  :master payload: N/A
1276
1277  Sets the GPU protocol socket file descriptor, which is passed as
1278  ancillary data. The GPU protocol is used to inform the master of
1279  rendering state and updates. See vhost-user-gpu.rst for details.
1280
1281``VHOST_USER_RESET_DEVICE``
1282  :id: 34
1283  :equivalent ioctl: N/A
1284  :master payload: N/A
1285  :slave payload: N/A
1286
1287  Ask the vhost user backend to disable all rings and reset all
1288  internal device state to the initial state, ready to be
1289  reinitialized. The backend retains ownership of the device
1290  throughout the reset operation.
1291
1292  Only valid if the ``VHOST_USER_PROTOCOL_F_RESET_DEVICE`` protocol
1293  feature is set by the backend.
1294
1295``VHOST_USER_VRING_KICK``
1296  :id: 35
1297  :equivalent ioctl: N/A
1298  :slave payload: vring state description
1299  :master payload: N/A
1300
1301  When the ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` protocol
1302  feature has been successfully negotiated, this message may be
1303  submitted by the master to indicate that a buffer was added to
1304  the vring instead of signalling it using the vring's kick file
1305  descriptor or having the slave rely on polling.
1306
1307  The state.num field is currently reserved and must be set to 0.
1308
1309``VHOST_USER_GET_MAX_MEM_SLOTS``
1310  :id: 36
1311  :equivalent ioctl: N/A
1312  :slave payload: u64
1313
1314  When the ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol
1315  feature has been successfully negotiated, this message is submitted
1316  by master to the slave. The slave should return the message with a
1317  u64 payload containing the maximum number of memory slots for
1318  QEMU to expose to the guest. The value returned by the backend
1319  will be capped at the maximum number of ram slots which can be
1320  supported by the target platform.
1321
1322``VHOST_USER_ADD_MEM_REG``
1323  :id: 37
1324  :equivalent ioctl: N/A
1325  :slave payload: single memory region description
1326
1327  When the ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol
1328  feature has been successfully negotiated, this message is submitted
1329  by the master to the slave. The message payload contains a memory
1330  region descriptor struct, describing a region of guest memory which
1331  the slave device must map in. When the
1332  ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol feature has
1333  been successfully negotiated, along with the
1334  ``VHOST_USER_REM_MEM_REG`` message, this message is used to set and
1335  update the memory tables of the slave device.
1336
1337``VHOST_USER_REM_MEM_REG``
1338  :id: 38
1339  :equivalent ioctl: N/A
1340  :slave payload: single memory region description
1341
1342  When the ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol
1343  feature has been successfully negotiated, this message is submitted
1344  by the master to the slave. The message payload contains a memory
1345  region descriptor struct, describing a region of guest memory which
1346  the slave device must unmap. When the
1347  ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol feature has
1348  been successfully negotiated, along with the
1349  ``VHOST_USER_ADD_MEM_REG`` message, this message is used to set and
1350  update the memory tables of the slave device.
1351
1352``VHOST_USER_SET_STATUS``
1353  :id: 39
1354  :equivalent ioctl: VHOST_VDPA_SET_STATUS
1355  :slave payload: N/A
1356  :master payload: ``u64``
1357
1358  When the ``VHOST_USER_PROTOCOL_F_STATUS`` protocol feature has been
1359  successfully negotiated, this message is submitted by the master to
1360  notify the backend with updated device status as defined in the Virtio
1361  specification.
1362
1363``VHOST_USER_GET_STATUS``
1364  :id: 40
1365  :equivalent ioctl: VHOST_VDPA_GET_STATUS
1366  :slave payload: ``u64``
1367  :master payload: N/A
1368
1369  When the ``VHOST_USER_PROTOCOL_F_STATUS`` protocol feature has been
1370  successfully negotiated, this message is submitted by the master to
1371  query the backend for its device status as defined in the Virtio
1372  specification.
1373
1374
1375Slave message types
1376-------------------
1377
1378``VHOST_USER_SLAVE_IOTLB_MSG``
1379  :id: 1
1380  :equivalent ioctl: N/A (equivalent to ``VHOST_IOTLB_MSG`` message type)
1381  :slave payload: ``struct vhost_iotlb_msg``
1382  :master payload: N/A
1383
1384  Send IOTLB messages with ``struct vhost_iotlb_msg`` as payload.
1385  Slave sends such requests to notify of an IOTLB miss, or an IOTLB
1386  access failure. If ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is
1387  negotiated, and slave set the ``VHOST_USER_NEED_REPLY`` flag, master
1388  must respond with zero when operation is successfully completed, or
1389  non-zero otherwise.  This request should be send only when
1390  ``VIRTIO_F_IOMMU_PLATFORM`` feature has been successfully
1391  negotiated.
1392
1393``VHOST_USER_SLAVE_CONFIG_CHANGE_MSG``
1394  :id: 2
1395  :equivalent ioctl: N/A
1396  :slave payload: N/A
1397  :master payload: N/A
1398
1399  When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, vhost-user
1400  slave sends such messages to notify that the virtio device's
1401  configuration space has changed, for those host devices which can
1402  support such feature, host driver can send ``VHOST_USER_GET_CONFIG``
1403  message to slave to get the latest content. If
1404  ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is negotiated, and slave set the
1405  ``VHOST_USER_NEED_REPLY`` flag, master must respond with zero when
1406  operation is successfully completed, or non-zero otherwise.
1407
1408``VHOST_USER_SLAVE_VRING_HOST_NOTIFIER_MSG``
1409  :id: 3
1410  :equivalent ioctl: N/A
1411  :slave payload: vring area description
1412  :master payload: N/A
1413
1414  Sets host notifier for a specified queue. The queue index is
1415  contained in the ``u64`` field of the vring area description. The
1416  host notifier is described by the file descriptor (typically it's a
1417  VFIO device fd) which is passed as ancillary data and the size
1418  (which is mmap size and should be the same as host page size) and
1419  offset (which is mmap offset) carried in the vring area
1420  description. QEMU can mmap the file descriptor based on the size and
1421  offset to get a memory range. Registering a host notifier means
1422  mapping this memory range to the VM as the specified queue's notify
1423  MMIO region. Slave sends this request to tell QEMU to de-register
1424  the existing notifier if any and register the new notifier if the
1425  request is sent with a file descriptor.
1426
1427  This request should be sent only when
1428  ``VHOST_USER_PROTOCOL_F_HOST_NOTIFIER`` protocol feature has been
1429  successfully negotiated.
1430
1431``VHOST_USER_SLAVE_VRING_CALL``
1432  :id: 4
1433  :equivalent ioctl: N/A
1434  :slave payload: vring state description
1435  :master payload: N/A
1436
1437  When the ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` protocol
1438  feature has been successfully negotiated, this message may be
1439  submitted by the slave to indicate that a buffer was used from
1440  the vring instead of signalling this using the vring's call file
1441  descriptor or having the master relying on polling.
1442
1443  The state.num field is currently reserved and must be set to 0.
1444
1445``VHOST_USER_SLAVE_VRING_ERR``
1446  :id: 5
1447  :equivalent ioctl: N/A
1448  :slave payload: vring state description
1449  :master payload: N/A
1450
1451  When the ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` protocol
1452  feature has been successfully negotiated, this message may be
1453  submitted by the slave to indicate that an error occurred on the
1454  specific vring, instead of signalling the error file descriptor
1455  set by the master via ``VHOST_USER_SET_VRING_ERR``.
1456
1457  The state.num field is currently reserved and must be set to 0.
1458
1459.. _reply_ack:
1460
1461VHOST_USER_PROTOCOL_F_REPLY_ACK
1462-------------------------------
1463
1464The original vhost-user specification only demands replies for certain
1465commands. This differs from the vhost protocol implementation where
1466commands are sent over an ``ioctl()`` call and block until the client
1467has completed.
1468
1469With this protocol extension negotiated, the sender (QEMU) can set the
1470``need_reply`` [Bit 3] flag to any command. This indicates that the
1471client MUST respond with a Payload ``VhostUserMsg`` indicating success
1472or failure. The payload should be set to zero on success or non-zero
1473on failure, unless the message already has an explicit reply body.
1474
1475The response payload gives QEMU a deterministic indication of the result
1476of the command. Today, QEMU is expected to terminate the main vhost-user
1477loop upon receiving such errors. In future, qemu could be taught to be more
1478resilient for selective requests.
1479
1480For the message types that already solicit a reply from the client,
1481the presence of ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` or need_reply bit
1482being set brings no behavioural change. (See the Communication_
1483section for details.)
1484
1485.. _backend_conventions:
1486
1487Backend program conventions
1488===========================
1489
1490vhost-user backends can provide various devices & services and may
1491need to be configured manually depending on the use case. However, it
1492is a good idea to follow the conventions listed here when
1493possible. Users, QEMU or libvirt, can then rely on some common
1494behaviour to avoid heterogeneous configuration and management of the
1495backend programs and facilitate interoperability.
1496
1497Each backend installed on a host system should come with at least one
1498JSON file that conforms to the vhost-user.json schema. Each file
1499informs the management applications about the backend type, and binary
1500location. In addition, it defines rules for management apps for
1501picking the highest priority backend when multiple match the search
1502criteria (see ``@VhostUserBackend`` documentation in the schema file).
1503
1504If the backend is not capable of enabling a requested feature on the
1505host (such as 3D acceleration with virgl), or the initialization
1506failed, the backend should fail to start early and exit with a status
1507!= 0. It may also print a message to stderr for further details.
1508
1509The backend program must not daemonize itself, but it may be
1510daemonized by the management layer. It may also have a restricted
1511access to the system.
1512
1513File descriptors 0, 1 and 2 will exist, and have regular
1514stdin/stdout/stderr usage (they may have been redirected to /dev/null
1515by the management layer, or to a log handler).
1516
1517The backend program must end (as quickly and cleanly as possible) when
1518the SIGTERM signal is received. Eventually, it may receive SIGKILL by
1519the management layer after a few seconds.
1520
1521The following command line options have an expected behaviour. They
1522are mandatory, unless explicitly said differently:
1523
1524--socket-path=PATH
1525
1526  This option specify the location of the vhost-user Unix domain socket.
1527  It is incompatible with --fd.
1528
1529--fd=FDNUM
1530
1531  When this argument is given, the backend program is started with the
1532  vhost-user socket as file descriptor FDNUM. It is incompatible with
1533  --socket-path.
1534
1535--print-capabilities
1536
1537  Output to stdout the backend capabilities in JSON format, and then
1538  exit successfully. Other options and arguments should be ignored, and
1539  the backend program should not perform its normal function.  The
1540  capabilities can be reported dynamically depending on the host
1541  capabilities.
1542
1543The JSON output is described in the ``vhost-user.json`` schema, by
1544```@VHostUserBackendCapabilities``.  Example:
1545
1546.. code:: json
1547
1548  {
1549    "type": "foo",
1550    "features": [
1551      "feature-a",
1552      "feature-b"
1553    ]
1554  }
1555
1556vhost-user-input
1557----------------
1558
1559Command line options:
1560
1561--evdev-path=PATH
1562
1563  Specify the linux input device.
1564
1565  (optional)
1566
1567--no-grab
1568
1569  Do no request exclusive access to the input device.
1570
1571  (optional)
1572
1573vhost-user-gpu
1574--------------
1575
1576Command line options:
1577
1578--render-node=PATH
1579
1580  Specify the GPU DRM render node.
1581
1582  (optional)
1583
1584--virgl
1585
1586  Enable virgl rendering support.
1587
1588  (optional)
1589
1590vhost-user-blk
1591--------------
1592
1593Command line options:
1594
1595--blk-file=PATH
1596
1597  Specify block device or file path.
1598
1599  (optional)
1600
1601--read-only
1602
1603  Enable read-only.
1604
1605  (optional)
1606