xref: /openbmc/qemu/docs/interop/vhost-user.rst (revision abb6295b)
1.. _vhost_user_proto:
2
3===================
4Vhost-user Protocol
5===================
6
7..
8  Copyright 2014 Virtual Open Systems Sarl.
9  Copyright 2019 Intel Corporation
10  Licence: This work is licensed under the terms of the GNU GPL,
11           version 2 or later. See the COPYING file in the top-level
12           directory.
13
14.. contents:: Table of Contents
15
16Introduction
17============
18
19This protocol is aiming to complement the ``ioctl`` interface used to
20control the vhost implementation in the Linux kernel. It implements
21the control plane needed to establish virtqueue sharing with a user
22space process on the same host. It uses communication over a Unix
23domain socket to share file descriptors in the ancillary data of the
24message.
25
26The protocol defines 2 sides of the communication, *master* and
27*slave*. *Master* is the application that shares its virtqueues, in
28our case QEMU. *Slave* is the consumer of the virtqueues.
29
30In the current implementation QEMU is the *master*, and the *slave* is
31the external process consuming the virtio queues, for example a
32software Ethernet switch running in user space, such as Snabbswitch,
33or a block device backend processing read & write to a virtual
34disk. In order to facilitate interoperability between various backend
35implementations, it is recommended to follow the :ref:`Backend program
36conventions <backend_conventions>`.
37
38*Master* and *slave* can be either a client (i.e. connecting) or
39server (listening) in the socket communication.
40
41Support for platforms other than Linux
42--------------------------------------
43
44While vhost-user was initially developed targeting Linux, nowadays it
45is supported on any platform that provides the following features:
46
47- A way for requesting shared memory represented by a file descriptor
48  so it can be passed over a UNIX domain socket and then mapped by the
49  other process.
50
51- AF_UNIX sockets with SCM_RIGHTS, so QEMU and the other process can
52  exchange messages through it, including ancillary data when needed.
53
54- Either eventfd or pipe/pipe2. On platforms where eventfd is not
55  available, QEMU will automatically fall back to pipe2 or, as a last
56  resort, pipe. Each file descriptor will be used for receiving or
57  sending events by reading or writing (respectively) an 8-byte value
58  to the corresponding it. The 8-value itself has no meaning and
59  should not be interpreted.
60
61Message Specification
62=====================
63
64.. Note:: All numbers are in the machine native byte order.
65
66A vhost-user message consists of 3 header fields and a payload.
67
68+---------+-------+------+---------+
69| request | flags | size | payload |
70+---------+-------+------+---------+
71
72Header
73------
74
75:request: 32-bit type of the request
76
77:flags: 32-bit bit field
78
79- Lower 2 bits are the version (currently 0x01)
80- Bit 2 is the reply flag - needs to be sent on each reply from the slave
81- Bit 3 is the need_reply flag - see :ref:`REPLY_ACK <reply_ack>` for
82  details.
83
84:size: 32-bit size of the payload
85
86Payload
87-------
88
89Depending on the request type, **payload** can be:
90
91A single 64-bit integer
92^^^^^^^^^^^^^^^^^^^^^^^
93
94+-----+
95| u64 |
96+-----+
97
98:u64: a 64-bit unsigned integer
99
100A vring state description
101^^^^^^^^^^^^^^^^^^^^^^^^^
102
103+-------+-----+
104| index | num |
105+-------+-----+
106
107:index: a 32-bit index
108
109:num: a 32-bit number
110
111A vring address description
112^^^^^^^^^^^^^^^^^^^^^^^^^^^
113
114+-------+-------+------+------------+------+-----------+-----+
115| index | flags | size | descriptor | used | available | log |
116+-------+-------+------+------------+------+-----------+-----+
117
118:index: a 32-bit vring index
119
120:flags: a 32-bit vring flags
121
122:descriptor: a 64-bit ring address of the vring descriptor table
123
124:used: a 64-bit ring address of the vring used ring
125
126:available: a 64-bit ring address of the vring available ring
127
128:log: a 64-bit guest address for logging
129
130Note that a ring address is an IOVA if ``VIRTIO_F_IOMMU_PLATFORM`` has
131been negotiated. Otherwise it is a user address.
132
133Memory regions description
134^^^^^^^^^^^^^^^^^^^^^^^^^^
135
136+-------------+---------+---------+-----+---------+
137| num regions | padding | region0 | ... | region7 |
138+-------------+---------+---------+-----+---------+
139
140:num regions: a 32-bit number of regions
141
142:padding: 32-bit
143
144A region is:
145
146+---------------+------+--------------+-------------+
147| guest address | size | user address | mmap offset |
148+---------------+------+--------------+-------------+
149
150:guest address: a 64-bit guest address of the region
151
152:size: a 64-bit size
153
154:user address: a 64-bit user address
155
156:mmap offset: 64-bit offset where region starts in the mapped memory
157
158Single memory region description
159^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
160
161+---------+---------------+------+--------------+-------------+
162| padding | guest address | size | user address | mmap offset |
163+---------+---------------+------+--------------+-------------+
164
165:padding: 64-bit
166
167:guest address: a 64-bit guest address of the region
168
169:size: a 64-bit size
170
171:user address: a 64-bit user address
172
173:mmap offset: 64-bit offset where region starts in the mapped memory
174
175Log description
176^^^^^^^^^^^^^^^
177
178+----------+------------+
179| log size | log offset |
180+----------+------------+
181
182:log size: size of area used for logging
183
184:log offset: offset from start of supplied file descriptor where
185             logging starts (i.e. where guest address 0 would be
186             logged)
187
188An IOTLB message
189^^^^^^^^^^^^^^^^
190
191+------+------+--------------+-------------------+------+
192| iova | size | user address | permissions flags | type |
193+------+------+--------------+-------------------+------+
194
195:iova: a 64-bit I/O virtual address programmed by the guest
196
197:size: a 64-bit size
198
199:user address: a 64-bit user address
200
201:permissions flags: an 8-bit value:
202  - 0: No access
203  - 1: Read access
204  - 2: Write access
205  - 3: Read/Write access
206
207:type: an 8-bit IOTLB message type:
208  - 1: IOTLB miss
209  - 2: IOTLB update
210  - 3: IOTLB invalidate
211  - 4: IOTLB access fail
212
213Virtio device config space
214^^^^^^^^^^^^^^^^^^^^^^^^^^
215
216+--------+------+-------+---------+
217| offset | size | flags | payload |
218+--------+------+-------+---------+
219
220:offset: a 32-bit offset of virtio device's configuration space
221
222:size: a 32-bit configuration space access size in bytes
223
224:flags: a 32-bit value:
225  - 0: Vhost master messages used for writeable fields
226  - 1: Vhost master messages used for live migration
227
228:payload: Size bytes array holding the contents of the virtio
229          device's configuration space
230
231Vring area description
232^^^^^^^^^^^^^^^^^^^^^^
233
234+-----+------+--------+
235| u64 | size | offset |
236+-----+------+--------+
237
238:u64: a 64-bit integer contains vring index and flags
239
240:size: a 64-bit size of this area
241
242:offset: a 64-bit offset of this area from the start of the
243         supplied file descriptor
244
245Inflight description
246^^^^^^^^^^^^^^^^^^^^
247
248+-----------+-------------+------------+------------+
249| mmap size | mmap offset | num queues | queue size |
250+-----------+-------------+------------+------------+
251
252:mmap size: a 64-bit size of area to track inflight I/O
253
254:mmap offset: a 64-bit offset of this area from the start
255              of the supplied file descriptor
256
257:num queues: a 16-bit number of virtqueues
258
259:queue size: a 16-bit size of virtqueues
260
261C structure
262-----------
263
264In QEMU the vhost-user message is implemented with the following struct:
265
266.. code:: c
267
268  typedef struct VhostUserMsg {
269      VhostUserRequest request;
270      uint32_t flags;
271      uint32_t size;
272      union {
273          uint64_t u64;
274          struct vhost_vring_state state;
275          struct vhost_vring_addr addr;
276          VhostUserMemory memory;
277          VhostUserLog log;
278          struct vhost_iotlb_msg iotlb;
279          VhostUserConfig config;
280          VhostUserVringArea area;
281          VhostUserInflight inflight;
282      };
283  } QEMU_PACKED VhostUserMsg;
284
285Communication
286=============
287
288The protocol for vhost-user is based on the existing implementation of
289vhost for the Linux Kernel. Most messages that can be sent via the
290Unix domain socket implementing vhost-user have an equivalent ioctl to
291the kernel implementation.
292
293The communication consists of *master* sending message requests and
294*slave* sending message replies. Most of the requests don't require
295replies. Here is a list of the ones that do:
296
297* ``VHOST_USER_GET_FEATURES``
298* ``VHOST_USER_GET_PROTOCOL_FEATURES``
299* ``VHOST_USER_GET_VRING_BASE``
300* ``VHOST_USER_SET_LOG_BASE`` (if ``VHOST_USER_PROTOCOL_F_LOG_SHMFD``)
301* ``VHOST_USER_GET_INFLIGHT_FD`` (if ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD``)
302
303.. seealso::
304
305   :ref:`REPLY_ACK <reply_ack>`
306       The section on ``REPLY_ACK`` protocol extension.
307
308There are several messages that the master sends with file descriptors passed
309in the ancillary data:
310
311* ``VHOST_USER_ADD_MEM_REG``
312* ``VHOST_USER_SET_MEM_TABLE``
313* ``VHOST_USER_SET_LOG_BASE`` (if ``VHOST_USER_PROTOCOL_F_LOG_SHMFD``)
314* ``VHOST_USER_SET_LOG_FD``
315* ``VHOST_USER_SET_VRING_KICK``
316* ``VHOST_USER_SET_VRING_CALL``
317* ``VHOST_USER_SET_VRING_ERR``
318* ``VHOST_USER_SET_SLAVE_REQ_FD``
319* ``VHOST_USER_SET_INFLIGHT_FD`` (if ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD``)
320
321If *master* is unable to send the full message or receives a wrong
322reply it will close the connection. An optional reconnection mechanism
323can be implemented.
324
325If *slave* detects some error such as incompatible features, it may also
326close the connection. This should only happen in exceptional circumstances.
327
328Any protocol extensions are gated by protocol feature bits, which
329allows full backwards compatibility on both master and slave.  As
330older slaves don't support negotiating protocol features, a feature
331bit was dedicated for this purpose::
332
333  #define VHOST_USER_F_PROTOCOL_FEATURES 30
334
335Starting and stopping rings
336---------------------------
337
338Client must only process each ring when it is started.
339
340Client must only pass data between the ring and the backend, when the
341ring is enabled.
342
343If ring is started but disabled, client must process the ring without
344talking to the backend.
345
346For example, for a networking device, in the disabled state client
347must not supply any new RX packets, but must process and discard any
348TX packets.
349
350If ``VHOST_USER_F_PROTOCOL_FEATURES`` has not been negotiated, the
351ring is initialized in an enabled state.
352
353If ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated, the ring is
354initialized in a disabled state. Client must not pass data to/from the
355backend until ring is enabled by ``VHOST_USER_SET_VRING_ENABLE`` with
356parameter 1, or after it has been disabled by
357``VHOST_USER_SET_VRING_ENABLE`` with parameter 0.
358
359Each ring is initialized in a stopped state, client must not process
360it until ring is started, or after it has been stopped.
361
362Client must start ring upon receiving a kick (that is, detecting that
363file descriptor is readable) on the descriptor specified by
364``VHOST_USER_SET_VRING_KICK`` or receiving the in-band message
365``VHOST_USER_VRING_KICK`` if negotiated, and stop ring upon receiving
366``VHOST_USER_GET_VRING_BASE``.
367
368While processing the rings (whether they are enabled or not), client
369must support changing some configuration aspects on the fly.
370
371Multiple queue support
372----------------------
373
374Many devices have a fixed number of virtqueues.  In this case the master
375already knows the number of available virtqueues without communicating with the
376slave.
377
378Some devices do not have a fixed number of virtqueues.  Instead the maximum
379number of virtqueues is chosen by the slave.  The number can depend on host
380resource availability or slave implementation details.  Such devices are called
381multiple queue devices.
382
383Multiple queue support allows the slave to advertise the maximum number of
384queues.  This is treated as a protocol extension, hence the slave has to
385implement protocol features first. The multiple queues feature is supported
386only when the protocol feature ``VHOST_USER_PROTOCOL_F_MQ`` (bit 0) is set.
387
388The max number of queues the slave supports can be queried with message
389``VHOST_USER_GET_QUEUE_NUM``. Master should stop when the number of requested
390queues is bigger than that.
391
392As all queues share one connection, the master uses a unique index for each
393queue in the sent message to identify a specified queue.
394
395The master enables queues by sending message ``VHOST_USER_SET_VRING_ENABLE``.
396vhost-user-net has historically automatically enabled the first queue pair.
397
398Slaves should always implement the ``VHOST_USER_PROTOCOL_F_MQ`` protocol
399feature, even for devices with a fixed number of virtqueues, since it is simple
400to implement and offers a degree of introspection.
401
402Masters must not rely on the ``VHOST_USER_PROTOCOL_F_MQ`` protocol feature for
403devices with a fixed number of virtqueues.  Only true multiqueue devices
404require this protocol feature.
405
406Migration
407---------
408
409During live migration, the master may need to track the modifications
410the slave makes to the memory mapped regions. The client should mark
411the dirty pages in a log. Once it complies to this logging, it may
412declare the ``VHOST_F_LOG_ALL`` vhost feature.
413
414To start/stop logging of data/used ring writes, server may send
415messages ``VHOST_USER_SET_FEATURES`` with ``VHOST_F_LOG_ALL`` and
416``VHOST_USER_SET_VRING_ADDR`` with ``VHOST_VRING_F_LOG`` in ring's
417flags set to 1/0, respectively.
418
419All the modifications to memory pointed by vring "descriptor" should
420be marked. Modifications to "used" vring should be marked if
421``VHOST_VRING_F_LOG`` is part of ring's flags.
422
423Dirty pages are of size::
424
425  #define VHOST_LOG_PAGE 0x1000
426
427The log memory fd is provided in the ancillary data of
428``VHOST_USER_SET_LOG_BASE`` message when the slave has
429``VHOST_USER_PROTOCOL_F_LOG_SHMFD`` protocol feature.
430
431The size of the log is supplied as part of ``VhostUserMsg`` which
432should be large enough to cover all known guest addresses. Log starts
433at the supplied offset in the supplied file descriptor.  The log
434covers from address 0 to the maximum of guest regions. In pseudo-code,
435to mark page at ``addr`` as dirty::
436
437  page = addr / VHOST_LOG_PAGE
438  log[page / 8] |= 1 << page % 8
439
440Where ``addr`` is the guest physical address.
441
442Use atomic operations, as the log may be concurrently manipulated.
443
444Note that when logging modifications to the used ring (when
445``VHOST_VRING_F_LOG`` is set for this ring), ``log_guest_addr`` should
446be used to calculate the log offset: the write to first byte of the
447used ring is logged at this offset from log start. Also note that this
448value might be outside the legal guest physical address range
449(i.e. does not have to be covered by the ``VhostUserMemory`` table), but
450the bit offset of the last byte of the ring must fall within the size
451supplied by ``VhostUserLog``.
452
453``VHOST_USER_SET_LOG_FD`` is an optional message with an eventfd in
454ancillary data, it may be used to inform the master that the log has
455been modified.
456
457Once the source has finished migration, rings will be stopped by the
458source. No further update must be done before rings are restarted.
459
460In postcopy migration the slave is started before all the memory has
461been received from the source host, and care must be taken to avoid
462accessing pages that have yet to be received.  The slave opens a
463'userfault'-fd and registers the memory with it; this fd is then
464passed back over to the master.  The master services requests on the
465userfaultfd for pages that are accessed and when the page is available
466it performs WAKE ioctl's on the userfaultfd to wake the stalled
467slave.  The client indicates support for this via the
468``VHOST_USER_PROTOCOL_F_PAGEFAULT`` feature.
469
470Memory access
471-------------
472
473The master sends a list of vhost memory regions to the slave using the
474``VHOST_USER_SET_MEM_TABLE`` message.  Each region has two base
475addresses: a guest address and a user address.
476
477Messages contain guest addresses and/or user addresses to reference locations
478within the shared memory.  The mapping of these addresses works as follows.
479
480User addresses map to the vhost memory region containing that user address.
481
482When the ``VIRTIO_F_IOMMU_PLATFORM`` feature has not been negotiated:
483
484* Guest addresses map to the vhost memory region containing that guest
485  address.
486
487When the ``VIRTIO_F_IOMMU_PLATFORM`` feature has been negotiated:
488
489* Guest addresses are also called I/O virtual addresses (IOVAs).  They are
490  translated to user addresses via the IOTLB.
491
492* The vhost memory region guest address is not used.
493
494IOMMU support
495-------------
496
497When the ``VIRTIO_F_IOMMU_PLATFORM`` feature has been negotiated, the
498master sends IOTLB entries update & invalidation by sending
499``VHOST_USER_IOTLB_MSG`` requests to the slave with a ``struct
500vhost_iotlb_msg`` as payload. For update events, the ``iotlb`` payload
501has to be filled with the update message type (2), the I/O virtual
502address, the size, the user virtual address, and the permissions
503flags. Addresses and size must be within vhost memory regions set via
504the ``VHOST_USER_SET_MEM_TABLE`` request. For invalidation events, the
505``iotlb`` payload has to be filled with the invalidation message type
506(3), the I/O virtual address and the size. On success, the slave is
507expected to reply with a zero payload, non-zero otherwise.
508
509The slave relies on the slave communication channel (see :ref:`Slave
510communication <slave_communication>` section below) to send IOTLB miss
511and access failure events, by sending ``VHOST_USER_SLAVE_IOTLB_MSG``
512requests to the master with a ``struct vhost_iotlb_msg`` as
513payload. For miss events, the iotlb payload has to be filled with the
514miss message type (1), the I/O virtual address and the permissions
515flags. For access failure event, the iotlb payload has to be filled
516with the access failure message type (4), the I/O virtual address and
517the permissions flags.  For synchronization purpose, the slave may
518rely on the reply-ack feature, so the master may send a reply when
519operation is completed if the reply-ack feature is negotiated and
520slaves requests a reply. For miss events, completed operation means
521either master sent an update message containing the IOTLB entry
522containing requested address and permission, or master sent nothing if
523the IOTLB miss message is invalid (invalid IOVA or permission).
524
525The master isn't expected to take the initiative to send IOTLB update
526messages, as the slave sends IOTLB miss messages for the guest virtual
527memory areas it needs to access.
528
529.. _slave_communication:
530
531Slave communication
532-------------------
533
534An optional communication channel is provided if the slave declares
535``VHOST_USER_PROTOCOL_F_SLAVE_REQ`` protocol feature, to allow the
536slave to make requests to the master.
537
538The fd is provided via ``VHOST_USER_SET_SLAVE_REQ_FD`` ancillary data.
539
540A slave may then send ``VHOST_USER_SLAVE_*`` messages to the master
541using this fd communication channel.
542
543If ``VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD`` protocol feature is
544negotiated, slave can send file descriptors (at most 8 descriptors in
545each message) to master via ancillary data using this fd communication
546channel.
547
548Inflight I/O tracking
549---------------------
550
551To support reconnecting after restart or crash, slave may need to
552resubmit inflight I/Os. If virtqueue is processed in order, we can
553easily achieve that by getting the inflight descriptors from
554descriptor table (split virtqueue) or descriptor ring (packed
555virtqueue). However, it can't work when we process descriptors
556out-of-order because some entries which store the information of
557inflight descriptors in available ring (split virtqueue) or descriptor
558ring (packed virtqueue) might be overridden by new entries. To solve
559this problem, slave need to allocate an extra buffer to store this
560information of inflight descriptors and share it with master for
561persistent. ``VHOST_USER_GET_INFLIGHT_FD`` and
562``VHOST_USER_SET_INFLIGHT_FD`` are used to transfer this buffer
563between master and slave. And the format of this buffer is described
564below:
565
566+---------------+---------------+-----+---------------+
567| queue0 region | queue1 region | ... | queueN region |
568+---------------+---------------+-----+---------------+
569
570N is the number of available virtqueues. Slave could get it from num
571queues field of ``VhostUserInflight``.
572
573For split virtqueue, queue region can be implemented as:
574
575.. code:: c
576
577  typedef struct DescStateSplit {
578      /* Indicate whether this descriptor is inflight or not.
579       * Only available for head-descriptor. */
580      uint8_t inflight;
581
582      /* Padding */
583      uint8_t padding[5];
584
585      /* Maintain a list for the last batch of used descriptors.
586       * Only available when batching is used for submitting */
587      uint16_t next;
588
589      /* Used to preserve the order of fetching available descriptors.
590       * Only available for head-descriptor. */
591      uint64_t counter;
592  } DescStateSplit;
593
594  typedef struct QueueRegionSplit {
595      /* The feature flags of this region. Now it's initialized to 0. */
596      uint64_t features;
597
598      /* The version of this region. It's 1 currently.
599       * Zero value indicates an uninitialized buffer */
600      uint16_t version;
601
602      /* The size of DescStateSplit array. It's equal to the virtqueue
603       * size. Slave could get it from queue size field of VhostUserInflight. */
604      uint16_t desc_num;
605
606      /* The head of list that track the last batch of used descriptors. */
607      uint16_t last_batch_head;
608
609      /* Store the idx value of used ring */
610      uint16_t used_idx;
611
612      /* Used to track the state of each descriptor in descriptor table */
613      DescStateSplit desc[];
614  } QueueRegionSplit;
615
616To track inflight I/O, the queue region should be processed as follows:
617
618When receiving available buffers from the driver:
619
620#. Get the next available head-descriptor index from available ring, ``i``
621
622#. Set ``desc[i].counter`` to the value of global counter
623
624#. Increase global counter by 1
625
626#. Set ``desc[i].inflight`` to 1
627
628When supplying used buffers to the driver:
629
6301. Get corresponding used head-descriptor index, i
631
6322. Set ``desc[i].next`` to ``last_batch_head``
633
6343. Set ``last_batch_head`` to ``i``
635
636#. Steps 1,2,3 may be performed repeatedly if batching is possible
637
638#. Increase the ``idx`` value of used ring by the size of the batch
639
640#. Set the ``inflight`` field of each ``DescStateSplit`` entry in the batch to 0
641
642#. Set ``used_idx`` to the ``idx`` value of used ring
643
644When reconnecting:
645
646#. If the value of ``used_idx`` does not match the ``idx`` value of
647   used ring (means the inflight field of ``DescStateSplit`` entries in
648   last batch may be incorrect),
649
650   a. Subtract the value of ``used_idx`` from the ``idx`` value of
651      used ring to get last batch size of ``DescStateSplit`` entries
652
653   #. Set the ``inflight`` field of each ``DescStateSplit`` entry to 0 in last batch
654      list which starts from ``last_batch_head``
655
656   #. Set ``used_idx`` to the ``idx`` value of used ring
657
658#. Resubmit inflight ``DescStateSplit`` entries in order of their
659   counter value
660
661For packed virtqueue, queue region can be implemented as:
662
663.. code:: c
664
665  typedef struct DescStatePacked {
666      /* Indicate whether this descriptor is inflight or not.
667       * Only available for head-descriptor. */
668      uint8_t inflight;
669
670      /* Padding */
671      uint8_t padding;
672
673      /* Link to the next free entry */
674      uint16_t next;
675
676      /* Link to the last entry of descriptor list.
677       * Only available for head-descriptor. */
678      uint16_t last;
679
680      /* The length of descriptor list.
681       * Only available for head-descriptor. */
682      uint16_t num;
683
684      /* Used to preserve the order of fetching available descriptors.
685       * Only available for head-descriptor. */
686      uint64_t counter;
687
688      /* The buffer id */
689      uint16_t id;
690
691      /* The descriptor flags */
692      uint16_t flags;
693
694      /* The buffer length */
695      uint32_t len;
696
697      /* The buffer address */
698      uint64_t addr;
699  } DescStatePacked;
700
701  typedef struct QueueRegionPacked {
702      /* The feature flags of this region. Now it's initialized to 0. */
703      uint64_t features;
704
705      /* The version of this region. It's 1 currently.
706       * Zero value indicates an uninitialized buffer */
707      uint16_t version;
708
709      /* The size of DescStatePacked array. It's equal to the virtqueue
710       * size. Slave could get it from queue size field of VhostUserInflight. */
711      uint16_t desc_num;
712
713      /* The head of free DescStatePacked entry list */
714      uint16_t free_head;
715
716      /* The old head of free DescStatePacked entry list */
717      uint16_t old_free_head;
718
719      /* The used index of descriptor ring */
720      uint16_t used_idx;
721
722      /* The old used index of descriptor ring */
723      uint16_t old_used_idx;
724
725      /* Device ring wrap counter */
726      uint8_t used_wrap_counter;
727
728      /* The old device ring wrap counter */
729      uint8_t old_used_wrap_counter;
730
731      /* Padding */
732      uint8_t padding[7];
733
734      /* Used to track the state of each descriptor fetched from descriptor ring */
735      DescStatePacked desc[];
736  } QueueRegionPacked;
737
738To track inflight I/O, the queue region should be processed as follows:
739
740When receiving available buffers from the driver:
741
742#. Get the next available descriptor entry from descriptor ring, ``d``
743
744#. If ``d`` is head descriptor,
745
746   a. Set ``desc[old_free_head].num`` to 0
747
748   #. Set ``desc[old_free_head].counter`` to the value of global counter
749
750   #. Increase global counter by 1
751
752   #. Set ``desc[old_free_head].inflight`` to 1
753
754#. If ``d`` is last descriptor, set ``desc[old_free_head].last`` to
755   ``free_head``
756
757#. Increase ``desc[old_free_head].num`` by 1
758
759#. Set ``desc[free_head].addr``, ``desc[free_head].len``,
760   ``desc[free_head].flags``, ``desc[free_head].id`` to ``d.addr``,
761   ``d.len``, ``d.flags``, ``d.id``
762
763#. Set ``free_head`` to ``desc[free_head].next``
764
765#. If ``d`` is last descriptor, set ``old_free_head`` to ``free_head``
766
767When supplying used buffers to the driver:
768
7691. Get corresponding used head-descriptor entry from descriptor ring,
770   ``d``
771
7722. Get corresponding ``DescStatePacked`` entry, ``e``
773
7743. Set ``desc[e.last].next`` to ``free_head``
775
7764. Set ``free_head`` to the index of ``e``
777
778#. Steps 1,2,3,4 may be performed repeatedly if batching is possible
779
780#. Increase ``used_idx`` by the size of the batch and update
781   ``used_wrap_counter`` if needed
782
783#. Update ``d.flags``
784
785#. Set the ``inflight`` field of each head ``DescStatePacked`` entry
786   in the batch to 0
787
788#. Set ``old_free_head``,  ``old_used_idx``, ``old_used_wrap_counter``
789   to ``free_head``, ``used_idx``, ``used_wrap_counter``
790
791When reconnecting:
792
793#. If ``used_idx`` does not match ``old_used_idx`` (means the
794   ``inflight`` field of ``DescStatePacked`` entries in last batch may
795   be incorrect),
796
797   a. Get the next descriptor ring entry through ``old_used_idx``, ``d``
798
799   #. Use ``old_used_wrap_counter`` to calculate the available flags
800
801   #. If ``d.flags`` is not equal to the calculated flags value (means
802      slave has submitted the buffer to guest driver before crash, so
803      it has to commit the in-progres update), set ``old_free_head``,
804      ``old_used_idx``, ``old_used_wrap_counter`` to ``free_head``,
805      ``used_idx``, ``used_wrap_counter``
806
807#. Set ``free_head``, ``used_idx``, ``used_wrap_counter`` to
808   ``old_free_head``, ``old_used_idx``, ``old_used_wrap_counter``
809   (roll back any in-progress update)
810
811#. Set the ``inflight`` field of each ``DescStatePacked`` entry in
812   free list to 0
813
814#. Resubmit inflight ``DescStatePacked`` entries in order of their
815   counter value
816
817In-band notifications
818---------------------
819
820In some limited situations (e.g. for simulation) it is desirable to
821have the kick, call and error (if used) signals done via in-band
822messages instead of asynchronous eventfd notifications. This can be
823done by negotiating the ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS``
824protocol feature.
825
826Note that due to the fact that too many messages on the sockets can
827cause the sending application(s) to block, it is not advised to use
828this feature unless absolutely necessary. It is also considered an
829error to negotiate this feature without also negotiating
830``VHOST_USER_PROTOCOL_F_SLAVE_REQ`` and ``VHOST_USER_PROTOCOL_F_REPLY_ACK``,
831the former is necessary for getting a message channel from the slave
832to the master, while the latter needs to be used with the in-band
833notification messages to block until they are processed, both to avoid
834blocking later and for proper processing (at least in the simulation
835use case.) As it has no other way of signalling this error, the slave
836should close the connection as a response to a
837``VHOST_USER_SET_PROTOCOL_FEATURES`` message that sets the in-band
838notifications feature flag without the other two.
839
840Protocol features
841-----------------
842
843.. code:: c
844
845  #define VHOST_USER_PROTOCOL_F_MQ                    0
846  #define VHOST_USER_PROTOCOL_F_LOG_SHMFD             1
847  #define VHOST_USER_PROTOCOL_F_RARP                  2
848  #define VHOST_USER_PROTOCOL_F_REPLY_ACK             3
849  #define VHOST_USER_PROTOCOL_F_MTU                   4
850  #define VHOST_USER_PROTOCOL_F_SLAVE_REQ             5
851  #define VHOST_USER_PROTOCOL_F_CROSS_ENDIAN          6
852  #define VHOST_USER_PROTOCOL_F_CRYPTO_SESSION        7
853  #define VHOST_USER_PROTOCOL_F_PAGEFAULT             8
854  #define VHOST_USER_PROTOCOL_F_CONFIG                9
855  #define VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD        10
856  #define VHOST_USER_PROTOCOL_F_HOST_NOTIFIER        11
857  #define VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD       12
858  #define VHOST_USER_PROTOCOL_F_RESET_DEVICE         13
859  #define VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS 14
860  #define VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS  15
861  #define VHOST_USER_PROTOCOL_F_STATUS               16
862
863Master message types
864--------------------
865
866``VHOST_USER_GET_FEATURES``
867  :id: 1
868  :equivalent ioctl: ``VHOST_GET_FEATURES``
869  :master payload: N/A
870  :slave payload: ``u64``
871
872  Get from the underlying vhost implementation the features bitmask.
873  Feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` signals slave support
874  for ``VHOST_USER_GET_PROTOCOL_FEATURES`` and
875  ``VHOST_USER_SET_PROTOCOL_FEATURES``.
876
877``VHOST_USER_SET_FEATURES``
878  :id: 2
879  :equivalent ioctl: ``VHOST_SET_FEATURES``
880  :master payload: ``u64``
881
882  Enable features in the underlying vhost implementation using a
883  bitmask.  Feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` signals
884  slave support for ``VHOST_USER_GET_PROTOCOL_FEATURES`` and
885  ``VHOST_USER_SET_PROTOCOL_FEATURES``.
886
887``VHOST_USER_GET_PROTOCOL_FEATURES``
888  :id: 15
889  :equivalent ioctl: ``VHOST_GET_FEATURES``
890  :master payload: N/A
891  :slave payload: ``u64``
892
893  Get the protocol feature bitmask from the underlying vhost
894  implementation.  Only legal if feature bit
895  ``VHOST_USER_F_PROTOCOL_FEATURES`` is present in
896  ``VHOST_USER_GET_FEATURES``.
897
898.. Note::
899   Slave that reported ``VHOST_USER_F_PROTOCOL_FEATURES`` must
900   support this message even before ``VHOST_USER_SET_FEATURES`` was
901   called.
902
903``VHOST_USER_SET_PROTOCOL_FEATURES``
904  :id: 16
905  :equivalent ioctl: ``VHOST_SET_FEATURES``
906  :master payload: ``u64``
907
908  Enable protocol features in the underlying vhost implementation.
909
910  Only legal if feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` is present in
911  ``VHOST_USER_GET_FEATURES``.
912
913.. Note::
914   Slave that reported ``VHOST_USER_F_PROTOCOL_FEATURES`` must support
915   this message even before ``VHOST_USER_SET_FEATURES`` was called.
916
917``VHOST_USER_SET_OWNER``
918  :id: 3
919  :equivalent ioctl: ``VHOST_SET_OWNER``
920  :master payload: N/A
921
922  Issued when a new connection is established. It sets the current
923  *master* as an owner of the session. This can be used on the *slave*
924  as a "session start" flag.
925
926``VHOST_USER_RESET_OWNER``
927  :id: 4
928  :master payload: N/A
929
930.. admonition:: Deprecated
931
932   This is no longer used. Used to be sent to request disabling all
933   rings, but some clients interpreted it to also discard connection
934   state (this interpretation would lead to bugs).  It is recommended
935   that clients either ignore this message, or use it to disable all
936   rings.
937
938``VHOST_USER_SET_MEM_TABLE``
939  :id: 5
940  :equivalent ioctl: ``VHOST_SET_MEM_TABLE``
941  :master payload: memory regions description
942  :slave payload: (postcopy only) memory regions description
943
944  Sets the memory map regions on the slave so it can translate the
945  vring addresses. In the ancillary data there is an array of file
946  descriptors for each memory mapped region. The size and ordering of
947  the fds matches the number and ordering of memory regions.
948
949  When ``VHOST_USER_POSTCOPY_LISTEN`` has been received,
950  ``SET_MEM_TABLE`` replies with the bases of the memory mapped
951  regions to the master.  The slave must have mmap'd the regions but
952  not yet accessed them and should not yet generate a userfault
953  event.
954
955.. Note::
956   ``NEED_REPLY_MASK`` is not set in this case.  QEMU will then
957   reply back to the list of mappings with an empty
958   ``VHOST_USER_SET_MEM_TABLE`` as an acknowledgement; only upon
959   reception of this message may the guest start accessing the memory
960   and generating faults.
961
962``VHOST_USER_SET_LOG_BASE``
963  :id: 6
964  :equivalent ioctl: ``VHOST_SET_LOG_BASE``
965  :master payload: u64
966  :slave payload: N/A
967
968  Sets logging shared memory space.
969
970  When slave has ``VHOST_USER_PROTOCOL_F_LOG_SHMFD`` protocol feature,
971  the log memory fd is provided in the ancillary data of
972  ``VHOST_USER_SET_LOG_BASE`` message, the size and offset of shared
973  memory area provided in the message.
974
975``VHOST_USER_SET_LOG_FD``
976  :id: 7
977  :equivalent ioctl: ``VHOST_SET_LOG_FD``
978  :master payload: N/A
979
980  Sets the logging file descriptor, which is passed as ancillary data.
981
982``VHOST_USER_SET_VRING_NUM``
983  :id: 8
984  :equivalent ioctl: ``VHOST_SET_VRING_NUM``
985  :master payload: vring state description
986
987  Set the size of the queue.
988
989``VHOST_USER_SET_VRING_ADDR``
990  :id: 9
991  :equivalent ioctl: ``VHOST_SET_VRING_ADDR``
992  :master payload: vring address description
993  :slave payload: N/A
994
995  Sets the addresses of the different aspects of the vring.
996
997``VHOST_USER_SET_VRING_BASE``
998  :id: 10
999  :equivalent ioctl: ``VHOST_SET_VRING_BASE``
1000  :master payload: vring state description
1001
1002  Sets the base offset in the available vring.
1003
1004``VHOST_USER_GET_VRING_BASE``
1005  :id: 11
1006  :equivalent ioctl: ``VHOST_USER_GET_VRING_BASE``
1007  :master payload: vring state description
1008  :slave payload: vring state description
1009
1010  Get the available vring base offset.
1011
1012``VHOST_USER_SET_VRING_KICK``
1013  :id: 12
1014  :equivalent ioctl: ``VHOST_SET_VRING_KICK``
1015  :master payload: ``u64``
1016
1017  Set the event file descriptor for adding buffers to the vring. It is
1018  passed in the ancillary data.
1019
1020  Bits (0-7) of the payload contain the vring index. Bit 8 is the
1021  invalid FD flag. This flag is set when there is no file descriptor
1022  in the ancillary data. This signals that polling should be used
1023  instead of waiting for the kick. Note that if the protocol feature
1024  ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` has been negotiated
1025  this message isn't necessary as the ring is also started on the
1026  ``VHOST_USER_VRING_KICK`` message, it may however still be used to
1027  set an event file descriptor (which will be preferred over the
1028  message) or to enable polling.
1029
1030``VHOST_USER_SET_VRING_CALL``
1031  :id: 13
1032  :equivalent ioctl: ``VHOST_SET_VRING_CALL``
1033  :master payload: ``u64``
1034
1035  Set the event file descriptor to signal when buffers are used. It is
1036  passed in the ancillary data.
1037
1038  Bits (0-7) of the payload contain the vring index. Bit 8 is the
1039  invalid FD flag. This flag is set when there is no file descriptor
1040  in the ancillary data. This signals that polling will be used
1041  instead of waiting for the call. Note that if the protocol features
1042  ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` and
1043  ``VHOST_USER_PROTOCOL_F_SLAVE_REQ`` have been negotiated this message
1044  isn't necessary as the ``VHOST_USER_SLAVE_VRING_CALL`` message can be
1045  used, it may however still be used to set an event file descriptor
1046  or to enable polling.
1047
1048``VHOST_USER_SET_VRING_ERR``
1049  :id: 14
1050  :equivalent ioctl: ``VHOST_SET_VRING_ERR``
1051  :master payload: ``u64``
1052
1053  Set the event file descriptor to signal when error occurs. It is
1054  passed in the ancillary data.
1055
1056  Bits (0-7) of the payload contain the vring index. Bit 8 is the
1057  invalid FD flag. This flag is set when there is no file descriptor
1058  in the ancillary data. Note that if the protocol features
1059  ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` and
1060  ``VHOST_USER_PROTOCOL_F_SLAVE_REQ`` have been negotiated this message
1061  isn't necessary as the ``VHOST_USER_SLAVE_VRING_ERR`` message can be
1062  used, it may however still be used to set an event file descriptor
1063  (which will be preferred over the message).
1064
1065``VHOST_USER_GET_QUEUE_NUM``
1066  :id: 17
1067  :equivalent ioctl: N/A
1068  :master payload: N/A
1069  :slave payload: u64
1070
1071  Query how many queues the backend supports.
1072
1073  This request should be sent only when ``VHOST_USER_PROTOCOL_F_MQ``
1074  is set in queried protocol features by
1075  ``VHOST_USER_GET_PROTOCOL_FEATURES``.
1076
1077``VHOST_USER_SET_VRING_ENABLE``
1078  :id: 18
1079  :equivalent ioctl: N/A
1080  :master payload: vring state description
1081
1082  Signal slave to enable or disable corresponding vring.
1083
1084  This request should be sent only when
1085  ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated.
1086
1087``VHOST_USER_SEND_RARP``
1088  :id: 19
1089  :equivalent ioctl: N/A
1090  :master payload: ``u64``
1091
1092  Ask vhost user backend to broadcast a fake RARP to notify the migration
1093  is terminated for guest that does not support GUEST_ANNOUNCE.
1094
1095  Only legal if feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` is
1096  present in ``VHOST_USER_GET_FEATURES`` and protocol feature bit
1097  ``VHOST_USER_PROTOCOL_F_RARP`` is present in
1098  ``VHOST_USER_GET_PROTOCOL_FEATURES``.  The first 6 bytes of the
1099  payload contain the mac address of the guest to allow the vhost user
1100  backend to construct and broadcast the fake RARP.
1101
1102``VHOST_USER_NET_SET_MTU``
1103  :id: 20
1104  :equivalent ioctl: N/A
1105  :master payload: ``u64``
1106
1107  Set host MTU value exposed to the guest.
1108
1109  This request should be sent only when ``VIRTIO_NET_F_MTU`` feature
1110  has been successfully negotiated, ``VHOST_USER_F_PROTOCOL_FEATURES``
1111  is present in ``VHOST_USER_GET_FEATURES`` and protocol feature bit
1112  ``VHOST_USER_PROTOCOL_F_NET_MTU`` is present in
1113  ``VHOST_USER_GET_PROTOCOL_FEATURES``.
1114
1115  If ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is negotiated, slave must
1116  respond with zero in case the specified MTU is valid, or non-zero
1117  otherwise.
1118
1119``VHOST_USER_SET_SLAVE_REQ_FD``
1120  :id: 21
1121  :equivalent ioctl: N/A
1122  :master payload: N/A
1123
1124  Set the socket file descriptor for slave initiated requests. It is passed
1125  in the ancillary data.
1126
1127  This request should be sent only when
1128  ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated, and protocol
1129  feature bit ``VHOST_USER_PROTOCOL_F_SLAVE_REQ`` bit is present in
1130  ``VHOST_USER_GET_PROTOCOL_FEATURES``.  If
1131  ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is negotiated, slave must
1132  respond with zero for success, non-zero otherwise.
1133
1134``VHOST_USER_IOTLB_MSG``
1135  :id: 22
1136  :equivalent ioctl: N/A (equivalent to ``VHOST_IOTLB_MSG`` message type)
1137  :master payload: ``struct vhost_iotlb_msg``
1138  :slave payload: ``u64``
1139
1140  Send IOTLB messages with ``struct vhost_iotlb_msg`` as payload.
1141
1142  Master sends such requests to update and invalidate entries in the
1143  device IOTLB. The slave has to acknowledge the request with sending
1144  zero as ``u64`` payload for success, non-zero otherwise.
1145
1146  This request should be send only when ``VIRTIO_F_IOMMU_PLATFORM``
1147  feature has been successfully negotiated.
1148
1149``VHOST_USER_SET_VRING_ENDIAN``
1150  :id: 23
1151  :equivalent ioctl: ``VHOST_SET_VRING_ENDIAN``
1152  :master payload: vring state description
1153
1154  Set the endianness of a VQ for legacy devices. Little-endian is
1155  indicated with state.num set to 0 and big-endian is indicated with
1156  state.num set to 1. Other values are invalid.
1157
1158  This request should be sent only when
1159  ``VHOST_USER_PROTOCOL_F_CROSS_ENDIAN`` has been negotiated.
1160  Backends that negotiated this feature should handle both
1161  endiannesses and expect this message once (per VQ) during device
1162  configuration (ie. before the master starts the VQ).
1163
1164``VHOST_USER_GET_CONFIG``
1165  :id: 24
1166  :equivalent ioctl: N/A
1167  :master payload: virtio device config space
1168  :slave payload: virtio device config space
1169
1170  When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, this message is
1171  submitted by the vhost-user master to fetch the contents of the
1172  virtio device configuration space, vhost-user slave's payload size
1173  MUST match master's request, vhost-user slave uses zero length of
1174  payload to indicate an error to vhost-user master. The vhost-user
1175  master may cache the contents to avoid repeated
1176  ``VHOST_USER_GET_CONFIG`` calls.
1177
1178``VHOST_USER_SET_CONFIG``
1179  :id: 25
1180  :equivalent ioctl: N/A
1181  :master payload: virtio device config space
1182  :slave payload: N/A
1183
1184  When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, this message is
1185  submitted by the vhost-user master when the Guest changes the virtio
1186  device configuration space and also can be used for live migration
1187  on the destination host. The vhost-user slave must check the flags
1188  field, and slaves MUST NOT accept SET_CONFIG for read-only
1189  configuration space fields unless the live migration bit is set.
1190
1191``VHOST_USER_CREATE_CRYPTO_SESSION``
1192  :id: 26
1193  :equivalent ioctl: N/A
1194  :master payload: crypto session description
1195  :slave payload: crypto session description
1196
1197  Create a session for crypto operation. The server side must return
1198  the session id, 0 or positive for success, negative for failure.
1199  This request should be sent only when
1200  ``VHOST_USER_PROTOCOL_F_CRYPTO_SESSION`` feature has been
1201  successfully negotiated.  It's a required feature for crypto
1202  devices.
1203
1204``VHOST_USER_CLOSE_CRYPTO_SESSION``
1205  :id: 27
1206  :equivalent ioctl: N/A
1207  :master payload: ``u64``
1208
1209  Close a session for crypto operation which was previously
1210  created by ``VHOST_USER_CREATE_CRYPTO_SESSION``.
1211
1212  This request should be sent only when
1213  ``VHOST_USER_PROTOCOL_F_CRYPTO_SESSION`` feature has been
1214  successfully negotiated.  It's a required feature for crypto
1215  devices.
1216
1217``VHOST_USER_POSTCOPY_ADVISE``
1218  :id: 28
1219  :master payload: N/A
1220  :slave payload: userfault fd
1221
1222  When ``VHOST_USER_PROTOCOL_F_PAGEFAULT`` is supported, the master
1223  advises slave that a migration with postcopy enabled is underway,
1224  the slave must open a userfaultfd for later use.  Note that at this
1225  stage the migration is still in precopy mode.
1226
1227``VHOST_USER_POSTCOPY_LISTEN``
1228  :id: 29
1229  :master payload: N/A
1230
1231  Master advises slave that a transition to postcopy mode has
1232  happened.  The slave must ensure that shared memory is registered
1233  with userfaultfd to cause faulting of non-present pages.
1234
1235  This is always sent sometime after a ``VHOST_USER_POSTCOPY_ADVISE``,
1236  and thus only when ``VHOST_USER_PROTOCOL_F_PAGEFAULT`` is supported.
1237
1238``VHOST_USER_POSTCOPY_END``
1239  :id: 30
1240  :slave payload: ``u64``
1241
1242  Master advises that postcopy migration has now completed.  The slave
1243  must disable the userfaultfd. The response is an acknowledgement
1244  only.
1245
1246  When ``VHOST_USER_PROTOCOL_F_PAGEFAULT`` is supported, this message
1247  is sent at the end of the migration, after
1248  ``VHOST_USER_POSTCOPY_LISTEN`` was previously sent.
1249
1250  The value returned is an error indication; 0 is success.
1251
1252``VHOST_USER_GET_INFLIGHT_FD``
1253  :id: 31
1254  :equivalent ioctl: N/A
1255  :master payload: inflight description
1256
1257  When ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD`` protocol feature has
1258  been successfully negotiated, this message is submitted by master to
1259  get a shared buffer from slave. The shared buffer will be used to
1260  track inflight I/O by slave. QEMU should retrieve a new one when vm
1261  reset.
1262
1263``VHOST_USER_SET_INFLIGHT_FD``
1264  :id: 32
1265  :equivalent ioctl: N/A
1266  :master payload: inflight description
1267
1268  When ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD`` protocol feature has
1269  been successfully negotiated, this message is submitted by master to
1270  send the shared inflight buffer back to slave so that slave could
1271  get inflight I/O after a crash or restart.
1272
1273``VHOST_USER_GPU_SET_SOCKET``
1274  :id: 33
1275  :equivalent ioctl: N/A
1276  :master payload: N/A
1277
1278  Sets the GPU protocol socket file descriptor, which is passed as
1279  ancillary data. The GPU protocol is used to inform the master of
1280  rendering state and updates. See vhost-user-gpu.rst for details.
1281
1282``VHOST_USER_RESET_DEVICE``
1283  :id: 34
1284  :equivalent ioctl: N/A
1285  :master payload: N/A
1286  :slave payload: N/A
1287
1288  Ask the vhost user backend to disable all rings and reset all
1289  internal device state to the initial state, ready to be
1290  reinitialized. The backend retains ownership of the device
1291  throughout the reset operation.
1292
1293  Only valid if the ``VHOST_USER_PROTOCOL_F_RESET_DEVICE`` protocol
1294  feature is set by the backend.
1295
1296``VHOST_USER_VRING_KICK``
1297  :id: 35
1298  :equivalent ioctl: N/A
1299  :slave payload: vring state description
1300  :master payload: N/A
1301
1302  When the ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` protocol
1303  feature has been successfully negotiated, this message may be
1304  submitted by the master to indicate that a buffer was added to
1305  the vring instead of signalling it using the vring's kick file
1306  descriptor or having the slave rely on polling.
1307
1308  The state.num field is currently reserved and must be set to 0.
1309
1310``VHOST_USER_GET_MAX_MEM_SLOTS``
1311  :id: 36
1312  :equivalent ioctl: N/A
1313  :slave payload: u64
1314
1315  When the ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol
1316  feature has been successfully negotiated, this message is submitted
1317  by master to the slave. The slave should return the message with a
1318  u64 payload containing the maximum number of memory slots for
1319  QEMU to expose to the guest. The value returned by the backend
1320  will be capped at the maximum number of ram slots which can be
1321  supported by the target platform.
1322
1323``VHOST_USER_ADD_MEM_REG``
1324  :id: 37
1325  :equivalent ioctl: N/A
1326  :slave payload: single memory region description
1327
1328  When the ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol
1329  feature has been successfully negotiated, this message is submitted
1330  by the master to the slave. The message payload contains a memory
1331  region descriptor struct, describing a region of guest memory which
1332  the slave device must map in. When the
1333  ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol feature has
1334  been successfully negotiated, along with the
1335  ``VHOST_USER_REM_MEM_REG`` message, this message is used to set and
1336  update the memory tables of the slave device.
1337
1338  Exactly one file descriptor from which the memory is mapped is
1339  passed in the ancillary data.
1340
1341  In postcopy mode (see ``VHOST_USER_POSTCOPY_LISTEN``), the slave
1342  replies with the bases of the memory mapped region to the master.
1343  For further details on postcopy, see ``VHOST_USER_SET_MEM_TABLE``.
1344  They apply to ``VHOST_USER_ADD_MEM_REG`` accordingly.
1345
1346``VHOST_USER_REM_MEM_REG``
1347  :id: 38
1348  :equivalent ioctl: N/A
1349  :slave payload: single memory region description
1350
1351  When the ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol
1352  feature has been successfully negotiated, this message is submitted
1353  by the master to the slave. The message payload contains a memory
1354  region descriptor struct, describing a region of guest memory which
1355  the slave device must unmap. When the
1356  ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol feature has
1357  been successfully negotiated, along with the
1358  ``VHOST_USER_ADD_MEM_REG`` message, this message is used to set and
1359  update the memory tables of the slave device.
1360
1361  The memory region to be removed is identified by its guest address,
1362  user address and size. The mmap offset is ignored.
1363
1364  No file descriptors SHOULD be passed in the ancillary data. For
1365  compatibility with existing incorrect implementations, the slave MAY
1366  accept messages with one file descriptor. If a file descriptor is
1367  passed, the slave MUST close it without using it otherwise.
1368
1369``VHOST_USER_SET_STATUS``
1370  :id: 39
1371  :equivalent ioctl: VHOST_VDPA_SET_STATUS
1372  :slave payload: N/A
1373  :master payload: ``u64``
1374
1375  When the ``VHOST_USER_PROTOCOL_F_STATUS`` protocol feature has been
1376  successfully negotiated, this message is submitted by the master to
1377  notify the backend with updated device status as defined in the Virtio
1378  specification.
1379
1380``VHOST_USER_GET_STATUS``
1381  :id: 40
1382  :equivalent ioctl: VHOST_VDPA_GET_STATUS
1383  :slave payload: ``u64``
1384  :master payload: N/A
1385
1386  When the ``VHOST_USER_PROTOCOL_F_STATUS`` protocol feature has been
1387  successfully negotiated, this message is submitted by the master to
1388  query the backend for its device status as defined in the Virtio
1389  specification.
1390
1391
1392Slave message types
1393-------------------
1394
1395``VHOST_USER_SLAVE_IOTLB_MSG``
1396  :id: 1
1397  :equivalent ioctl: N/A (equivalent to ``VHOST_IOTLB_MSG`` message type)
1398  :slave payload: ``struct vhost_iotlb_msg``
1399  :master payload: N/A
1400
1401  Send IOTLB messages with ``struct vhost_iotlb_msg`` as payload.
1402  Slave sends such requests to notify of an IOTLB miss, or an IOTLB
1403  access failure. If ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is
1404  negotiated, and slave set the ``VHOST_USER_NEED_REPLY`` flag, master
1405  must respond with zero when operation is successfully completed, or
1406  non-zero otherwise.  This request should be send only when
1407  ``VIRTIO_F_IOMMU_PLATFORM`` feature has been successfully
1408  negotiated.
1409
1410``VHOST_USER_SLAVE_CONFIG_CHANGE_MSG``
1411  :id: 2
1412  :equivalent ioctl: N/A
1413  :slave payload: N/A
1414  :master payload: N/A
1415
1416  When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, vhost-user
1417  slave sends such messages to notify that the virtio device's
1418  configuration space has changed, for those host devices which can
1419  support such feature, host driver can send ``VHOST_USER_GET_CONFIG``
1420  message to slave to get the latest content. If
1421  ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is negotiated, and slave set the
1422  ``VHOST_USER_NEED_REPLY`` flag, master must respond with zero when
1423  operation is successfully completed, or non-zero otherwise.
1424
1425``VHOST_USER_SLAVE_VRING_HOST_NOTIFIER_MSG``
1426  :id: 3
1427  :equivalent ioctl: N/A
1428  :slave payload: vring area description
1429  :master payload: N/A
1430
1431  Sets host notifier for a specified queue. The queue index is
1432  contained in the ``u64`` field of the vring area description. The
1433  host notifier is described by the file descriptor (typically it's a
1434  VFIO device fd) which is passed as ancillary data and the size
1435  (which is mmap size and should be the same as host page size) and
1436  offset (which is mmap offset) carried in the vring area
1437  description. QEMU can mmap the file descriptor based on the size and
1438  offset to get a memory range. Registering a host notifier means
1439  mapping this memory range to the VM as the specified queue's notify
1440  MMIO region. Slave sends this request to tell QEMU to de-register
1441  the existing notifier if any and register the new notifier if the
1442  request is sent with a file descriptor.
1443
1444  This request should be sent only when
1445  ``VHOST_USER_PROTOCOL_F_HOST_NOTIFIER`` protocol feature has been
1446  successfully negotiated.
1447
1448``VHOST_USER_SLAVE_VRING_CALL``
1449  :id: 4
1450  :equivalent ioctl: N/A
1451  :slave payload: vring state description
1452  :master payload: N/A
1453
1454  When the ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` protocol
1455  feature has been successfully negotiated, this message may be
1456  submitted by the slave to indicate that a buffer was used from
1457  the vring instead of signalling this using the vring's call file
1458  descriptor or having the master relying on polling.
1459
1460  The state.num field is currently reserved and must be set to 0.
1461
1462``VHOST_USER_SLAVE_VRING_ERR``
1463  :id: 5
1464  :equivalent ioctl: N/A
1465  :slave payload: vring state description
1466  :master payload: N/A
1467
1468  When the ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` protocol
1469  feature has been successfully negotiated, this message may be
1470  submitted by the slave to indicate that an error occurred on the
1471  specific vring, instead of signalling the error file descriptor
1472  set by the master via ``VHOST_USER_SET_VRING_ERR``.
1473
1474  The state.num field is currently reserved and must be set to 0.
1475
1476.. _reply_ack:
1477
1478VHOST_USER_PROTOCOL_F_REPLY_ACK
1479-------------------------------
1480
1481The original vhost-user specification only demands replies for certain
1482commands. This differs from the vhost protocol implementation where
1483commands are sent over an ``ioctl()`` call and block until the client
1484has completed.
1485
1486With this protocol extension negotiated, the sender (QEMU) can set the
1487``need_reply`` [Bit 3] flag to any command. This indicates that the
1488client MUST respond with a Payload ``VhostUserMsg`` indicating success
1489or failure. The payload should be set to zero on success or non-zero
1490on failure, unless the message already has an explicit reply body.
1491
1492The response payload gives QEMU a deterministic indication of the result
1493of the command. Today, QEMU is expected to terminate the main vhost-user
1494loop upon receiving such errors. In future, qemu could be taught to be more
1495resilient for selective requests.
1496
1497For the message types that already solicit a reply from the client,
1498the presence of ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` or need_reply bit
1499being set brings no behavioural change. (See the Communication_
1500section for details.)
1501
1502.. _backend_conventions:
1503
1504Backend program conventions
1505===========================
1506
1507vhost-user backends can provide various devices & services and may
1508need to be configured manually depending on the use case. However, it
1509is a good idea to follow the conventions listed here when
1510possible. Users, QEMU or libvirt, can then rely on some common
1511behaviour to avoid heterogeneous configuration and management of the
1512backend programs and facilitate interoperability.
1513
1514Each backend installed on a host system should come with at least one
1515JSON file that conforms to the vhost-user.json schema. Each file
1516informs the management applications about the backend type, and binary
1517location. In addition, it defines rules for management apps for
1518picking the highest priority backend when multiple match the search
1519criteria (see ``@VhostUserBackend`` documentation in the schema file).
1520
1521If the backend is not capable of enabling a requested feature on the
1522host (such as 3D acceleration with virgl), or the initialization
1523failed, the backend should fail to start early and exit with a status
1524!= 0. It may also print a message to stderr for further details.
1525
1526The backend program must not daemonize itself, but it may be
1527daemonized by the management layer. It may also have a restricted
1528access to the system.
1529
1530File descriptors 0, 1 and 2 will exist, and have regular
1531stdin/stdout/stderr usage (they may have been redirected to /dev/null
1532by the management layer, or to a log handler).
1533
1534The backend program must end (as quickly and cleanly as possible) when
1535the SIGTERM signal is received. Eventually, it may receive SIGKILL by
1536the management layer after a few seconds.
1537
1538The following command line options have an expected behaviour. They
1539are mandatory, unless explicitly said differently:
1540
1541--socket-path=PATH
1542
1543  This option specify the location of the vhost-user Unix domain socket.
1544  It is incompatible with --fd.
1545
1546--fd=FDNUM
1547
1548  When this argument is given, the backend program is started with the
1549  vhost-user socket as file descriptor FDNUM. It is incompatible with
1550  --socket-path.
1551
1552--print-capabilities
1553
1554  Output to stdout the backend capabilities in JSON format, and then
1555  exit successfully. Other options and arguments should be ignored, and
1556  the backend program should not perform its normal function.  The
1557  capabilities can be reported dynamically depending on the host
1558  capabilities.
1559
1560The JSON output is described in the ``vhost-user.json`` schema, by
1561```@VHostUserBackendCapabilities``.  Example:
1562
1563.. code:: json
1564
1565  {
1566    "type": "foo",
1567    "features": [
1568      "feature-a",
1569      "feature-b"
1570    ]
1571  }
1572
1573vhost-user-input
1574----------------
1575
1576Command line options:
1577
1578--evdev-path=PATH
1579
1580  Specify the linux input device.
1581
1582  (optional)
1583
1584--no-grab
1585
1586  Do no request exclusive access to the input device.
1587
1588  (optional)
1589
1590vhost-user-gpu
1591--------------
1592
1593Command line options:
1594
1595--render-node=PATH
1596
1597  Specify the GPU DRM render node.
1598
1599  (optional)
1600
1601--virgl
1602
1603  Enable virgl rendering support.
1604
1605  (optional)
1606
1607vhost-user-blk
1608--------------
1609
1610Command line options:
1611
1612--blk-file=PATH
1613
1614  Specify block device or file path.
1615
1616  (optional)
1617
1618--read-only
1619
1620  Enable read-only.
1621
1622  (optional)
1623