xref: /openbmc/qemu/docs/interop/vhost-user.rst (revision 9c372ecfec5bd00f7ef5b6b2e9db9c2c859b408b)
1.. _vhost_user_proto:
2
3===================
4Vhost-user Protocol
5===================
6
7..
8  Copyright 2014 Virtual Open Systems Sarl.
9  Copyright 2019 Intel Corporation
10  Licence: This work is licensed under the terms of the GNU GPL,
11           version 2 or later. See the COPYING file in the top-level
12           directory.
13
14.. contents:: Table of Contents
15
16Introduction
17============
18
19This protocol is aiming to complement the ``ioctl`` interface used to
20control the vhost implementation in the Linux kernel. It implements
21the control plane needed to establish virtqueue sharing with a user
22space process on the same host. It uses communication over a Unix
23domain socket to share file descriptors in the ancillary data of the
24message.
25
26The protocol defines 2 sides of the communication, *master* and
27*slave*. *Master* is the application that shares its virtqueues, in
28our case QEMU. *Slave* is the consumer of the virtqueues.
29
30In the current implementation QEMU is the *master*, and the *slave* is
31the external process consuming the virtio queues, for example a
32software Ethernet switch running in user space, such as Snabbswitch,
33or a block device backend processing read & write to a virtual
34disk. In order to facilitate interoperability between various backend
35implementations, it is recommended to follow the :ref:`Backend program
36conventions <backend_conventions>`.
37
38*Master* and *slave* can be either a client (i.e. connecting) or
39server (listening) in the socket communication.
40
41Message Specification
42=====================
43
44.. Note:: All numbers are in the machine native byte order.
45
46A vhost-user message consists of 3 header fields and a payload.
47
48+---------+-------+------+---------+
49| request | flags | size | payload |
50+---------+-------+------+---------+
51
52Header
53------
54
55:request: 32-bit type of the request
56
57:flags: 32-bit bit field
58
59- Lower 2 bits are the version (currently 0x01)
60- Bit 2 is the reply flag - needs to be sent on each reply from the slave
61- Bit 3 is the need_reply flag - see :ref:`REPLY_ACK <reply_ack>` for
62  details.
63
64:size: 32-bit size of the payload
65
66Payload
67-------
68
69Depending on the request type, **payload** can be:
70
71A single 64-bit integer
72^^^^^^^^^^^^^^^^^^^^^^^
73
74+-----+
75| u64 |
76+-----+
77
78:u64: a 64-bit unsigned integer
79
80A vring state description
81^^^^^^^^^^^^^^^^^^^^^^^^^
82
83+-------+-----+
84| index | num |
85+-------+-----+
86
87:index: a 32-bit index
88
89:num: a 32-bit number
90
91A vring address description
92^^^^^^^^^^^^^^^^^^^^^^^^^^^
93
94+-------+-------+------+------------+------+-----------+-----+
95| index | flags | size | descriptor | used | available | log |
96+-------+-------+------+------------+------+-----------+-----+
97
98:index: a 32-bit vring index
99
100:flags: a 32-bit vring flags
101
102:descriptor: a 64-bit ring address of the vring descriptor table
103
104:used: a 64-bit ring address of the vring used ring
105
106:available: a 64-bit ring address of the vring available ring
107
108:log: a 64-bit guest address for logging
109
110Note that a ring address is an IOVA if ``VIRTIO_F_IOMMU_PLATFORM`` has
111been negotiated. Otherwise it is a user address.
112
113Memory regions description
114^^^^^^^^^^^^^^^^^^^^^^^^^^
115
116+-------------+---------+---------+-----+---------+
117| num regions | padding | region0 | ... | region7 |
118+-------------+---------+---------+-----+---------+
119
120:num regions: a 32-bit number of regions
121
122:padding: 32-bit
123
124A region is:
125
126+---------------+------+--------------+-------------+
127| guest address | size | user address | mmap offset |
128+---------------+------+--------------+-------------+
129
130:guest address: a 64-bit guest address of the region
131
132:size: a 64-bit size
133
134:user address: a 64-bit user address
135
136:mmap offset: 64-bit offset where region starts in the mapped memory
137
138Single memory region description
139^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
140
141+---------+---------------+------+--------------+-------------+
142| padding | guest address | size | user address | mmap offset |
143+---------+---------------+------+--------------+-------------+
144
145:padding: 64-bit
146
147:guest address: a 64-bit guest address of the region
148
149:size: a 64-bit size
150
151:user address: a 64-bit user address
152
153:mmap offset: 64-bit offset where region starts in the mapped memory
154
155Log description
156^^^^^^^^^^^^^^^
157
158+----------+------------+
159| log size | log offset |
160+----------+------------+
161
162:log size: size of area used for logging
163
164:log offset: offset from start of supplied file descriptor where
165             logging starts (i.e. where guest address 0 would be
166             logged)
167
168An IOTLB message
169^^^^^^^^^^^^^^^^
170
171+------+------+--------------+-------------------+------+
172| iova | size | user address | permissions flags | type |
173+------+------+--------------+-------------------+------+
174
175:iova: a 64-bit I/O virtual address programmed by the guest
176
177:size: a 64-bit size
178
179:user address: a 64-bit user address
180
181:permissions flags: an 8-bit value:
182  - 0: No access
183  - 1: Read access
184  - 2: Write access
185  - 3: Read/Write access
186
187:type: an 8-bit IOTLB message type:
188  - 1: IOTLB miss
189  - 2: IOTLB update
190  - 3: IOTLB invalidate
191  - 4: IOTLB access fail
192
193Virtio device config space
194^^^^^^^^^^^^^^^^^^^^^^^^^^
195
196+--------+------+-------+---------+
197| offset | size | flags | payload |
198+--------+------+-------+---------+
199
200:offset: a 32-bit offset of virtio device's configuration space
201
202:size: a 32-bit configuration space access size in bytes
203
204:flags: a 32-bit value:
205  - 0: Vhost master messages used for writeable fields
206  - 1: Vhost master messages used for live migration
207
208:payload: Size bytes array holding the contents of the virtio
209          device's configuration space
210
211Vring area description
212^^^^^^^^^^^^^^^^^^^^^^
213
214+-----+------+--------+
215| u64 | size | offset |
216+-----+------+--------+
217
218:u64: a 64-bit integer contains vring index and flags
219
220:size: a 64-bit size of this area
221
222:offset: a 64-bit offset of this area from the start of the
223         supplied file descriptor
224
225Inflight description
226^^^^^^^^^^^^^^^^^^^^
227
228+-----------+-------------+------------+------------+
229| mmap size | mmap offset | num queues | queue size |
230+-----------+-------------+------------+------------+
231
232:mmap size: a 64-bit size of area to track inflight I/O
233
234:mmap offset: a 64-bit offset of this area from the start
235              of the supplied file descriptor
236
237:num queues: a 16-bit number of virtqueues
238
239:queue size: a 16-bit size of virtqueues
240
241C structure
242-----------
243
244In QEMU the vhost-user message is implemented with the following struct:
245
246.. code:: c
247
248  typedef struct VhostUserMsg {
249      VhostUserRequest request;
250      uint32_t flags;
251      uint32_t size;
252      union {
253          uint64_t u64;
254          struct vhost_vring_state state;
255          struct vhost_vring_addr addr;
256          VhostUserMemory memory;
257          VhostUserLog log;
258          struct vhost_iotlb_msg iotlb;
259          VhostUserConfig config;
260          VhostUserVringArea area;
261          VhostUserInflight inflight;
262      };
263  } QEMU_PACKED VhostUserMsg;
264
265Communication
266=============
267
268The protocol for vhost-user is based on the existing implementation of
269vhost for the Linux Kernel. Most messages that can be sent via the
270Unix domain socket implementing vhost-user have an equivalent ioctl to
271the kernel implementation.
272
273The communication consists of *master* sending message requests and
274*slave* sending message replies. Most of the requests don't require
275replies. Here is a list of the ones that do:
276
277* ``VHOST_USER_GET_FEATURES``
278* ``VHOST_USER_GET_PROTOCOL_FEATURES``
279* ``VHOST_USER_GET_VRING_BASE``
280* ``VHOST_USER_SET_LOG_BASE`` (if ``VHOST_USER_PROTOCOL_F_LOG_SHMFD``)
281* ``VHOST_USER_GET_INFLIGHT_FD`` (if ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD``)
282
283.. seealso::
284
285   :ref:`REPLY_ACK <reply_ack>`
286       The section on ``REPLY_ACK`` protocol extension.
287
288There are several messages that the master sends with file descriptors passed
289in the ancillary data:
290
291* ``VHOST_USER_SET_MEM_TABLE``
292* ``VHOST_USER_SET_LOG_BASE`` (if ``VHOST_USER_PROTOCOL_F_LOG_SHMFD``)
293* ``VHOST_USER_SET_LOG_FD``
294* ``VHOST_USER_SET_VRING_KICK``
295* ``VHOST_USER_SET_VRING_CALL``
296* ``VHOST_USER_SET_VRING_ERR``
297* ``VHOST_USER_SET_SLAVE_REQ_FD``
298* ``VHOST_USER_SET_INFLIGHT_FD`` (if ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD``)
299
300If *master* is unable to send the full message or receives a wrong
301reply it will close the connection. An optional reconnection mechanism
302can be implemented.
303
304If *slave* detects some error such as incompatible features, it may also
305close the connection. This should only happen in exceptional circumstances.
306
307Any protocol extensions are gated by protocol feature bits, which
308allows full backwards compatibility on both master and slave.  As
309older slaves don't support negotiating protocol features, a feature
310bit was dedicated for this purpose::
311
312  #define VHOST_USER_F_PROTOCOL_FEATURES 30
313
314Starting and stopping rings
315---------------------------
316
317Client must only process each ring when it is started.
318
319Client must only pass data between the ring and the backend, when the
320ring is enabled.
321
322If ring is started but disabled, client must process the ring without
323talking to the backend.
324
325For example, for a networking device, in the disabled state client
326must not supply any new RX packets, but must process and discard any
327TX packets.
328
329If ``VHOST_USER_F_PROTOCOL_FEATURES`` has not been negotiated, the
330ring is initialized in an enabled state.
331
332If ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated, the ring is
333initialized in a disabled state. Client must not pass data to/from the
334backend until ring is enabled by ``VHOST_USER_SET_VRING_ENABLE`` with
335parameter 1, or after it has been disabled by
336``VHOST_USER_SET_VRING_ENABLE`` with parameter 0.
337
338Each ring is initialized in a stopped state, client must not process
339it until ring is started, or after it has been stopped.
340
341Client must start ring upon receiving a kick (that is, detecting that
342file descriptor is readable) on the descriptor specified by
343``VHOST_USER_SET_VRING_KICK`` or receiving the in-band message
344``VHOST_USER_VRING_KICK`` if negotiated, and stop ring upon receiving
345``VHOST_USER_GET_VRING_BASE``.
346
347While processing the rings (whether they are enabled or not), client
348must support changing some configuration aspects on the fly.
349
350Multiple queue support
351----------------------
352
353Many devices have a fixed number of virtqueues.  In this case the master
354already knows the number of available virtqueues without communicating with the
355slave.
356
357Some devices do not have a fixed number of virtqueues.  Instead the maximum
358number of virtqueues is chosen by the slave.  The number can depend on host
359resource availability or slave implementation details.  Such devices are called
360multiple queue devices.
361
362Multiple queue support allows the slave to advertise the maximum number of
363queues.  This is treated as a protocol extension, hence the slave has to
364implement protocol features first. The multiple queues feature is supported
365only when the protocol feature ``VHOST_USER_PROTOCOL_F_MQ`` (bit 0) is set.
366
367The max number of queues the slave supports can be queried with message
368``VHOST_USER_GET_QUEUE_NUM``. Master should stop when the number of requested
369queues is bigger than that.
370
371As all queues share one connection, the master uses a unique index for each
372queue in the sent message to identify a specified queue.
373
374The master enables queues by sending message ``VHOST_USER_SET_VRING_ENABLE``.
375vhost-user-net has historically automatically enabled the first queue pair.
376
377Slaves should always implement the ``VHOST_USER_PROTOCOL_F_MQ`` protocol
378feature, even for devices with a fixed number of virtqueues, since it is simple
379to implement and offers a degree of introspection.
380
381Masters must not rely on the ``VHOST_USER_PROTOCOL_F_MQ`` protocol feature for
382devices with a fixed number of virtqueues.  Only true multiqueue devices
383require this protocol feature.
384
385Migration
386---------
387
388During live migration, the master may need to track the modifications
389the slave makes to the memory mapped regions. The client should mark
390the dirty pages in a log. Once it complies to this logging, it may
391declare the ``VHOST_F_LOG_ALL`` vhost feature.
392
393To start/stop logging of data/used ring writes, server may send
394messages ``VHOST_USER_SET_FEATURES`` with ``VHOST_F_LOG_ALL`` and
395``VHOST_USER_SET_VRING_ADDR`` with ``VHOST_VRING_F_LOG`` in ring's
396flags set to 1/0, respectively.
397
398All the modifications to memory pointed by vring "descriptor" should
399be marked. Modifications to "used" vring should be marked if
400``VHOST_VRING_F_LOG`` is part of ring's flags.
401
402Dirty pages are of size::
403
404  #define VHOST_LOG_PAGE 0x1000
405
406The log memory fd is provided in the ancillary data of
407``VHOST_USER_SET_LOG_BASE`` message when the slave has
408``VHOST_USER_PROTOCOL_F_LOG_SHMFD`` protocol feature.
409
410The size of the log is supplied as part of ``VhostUserMsg`` which
411should be large enough to cover all known guest addresses. Log starts
412at the supplied offset in the supplied file descriptor.  The log
413covers from address 0 to the maximum of guest regions. In pseudo-code,
414to mark page at ``addr`` as dirty::
415
416  page = addr / VHOST_LOG_PAGE
417  log[page / 8] |= 1 << page % 8
418
419Where ``addr`` is the guest physical address.
420
421Use atomic operations, as the log may be concurrently manipulated.
422
423Note that when logging modifications to the used ring (when
424``VHOST_VRING_F_LOG`` is set for this ring), ``log_guest_addr`` should
425be used to calculate the log offset: the write to first byte of the
426used ring is logged at this offset from log start. Also note that this
427value might be outside the legal guest physical address range
428(i.e. does not have to be covered by the ``VhostUserMemory`` table), but
429the bit offset of the last byte of the ring must fall within the size
430supplied by ``VhostUserLog``.
431
432``VHOST_USER_SET_LOG_FD`` is an optional message with an eventfd in
433ancillary data, it may be used to inform the master that the log has
434been modified.
435
436Once the source has finished migration, rings will be stopped by the
437source. No further update must be done before rings are restarted.
438
439In postcopy migration the slave is started before all the memory has
440been received from the source host, and care must be taken to avoid
441accessing pages that have yet to be received.  The slave opens a
442'userfault'-fd and registers the memory with it; this fd is then
443passed back over to the master.  The master services requests on the
444userfaultfd for pages that are accessed and when the page is available
445it performs WAKE ioctl's on the userfaultfd to wake the stalled
446slave.  The client indicates support for this via the
447``VHOST_USER_PROTOCOL_F_PAGEFAULT`` feature.
448
449Memory access
450-------------
451
452The master sends a list of vhost memory regions to the slave using the
453``VHOST_USER_SET_MEM_TABLE`` message.  Each region has two base
454addresses: a guest address and a user address.
455
456Messages contain guest addresses and/or user addresses to reference locations
457within the shared memory.  The mapping of these addresses works as follows.
458
459User addresses map to the vhost memory region containing that user address.
460
461When the ``VIRTIO_F_IOMMU_PLATFORM`` feature has not been negotiated:
462
463* Guest addresses map to the vhost memory region containing that guest
464  address.
465
466When the ``VIRTIO_F_IOMMU_PLATFORM`` feature has been negotiated:
467
468* Guest addresses are also called I/O virtual addresses (IOVAs).  They are
469  translated to user addresses via the IOTLB.
470
471* The vhost memory region guest address is not used.
472
473IOMMU support
474-------------
475
476When the ``VIRTIO_F_IOMMU_PLATFORM`` feature has been negotiated, the
477master sends IOTLB entries update & invalidation by sending
478``VHOST_USER_IOTLB_MSG`` requests to the slave with a ``struct
479vhost_iotlb_msg`` as payload. For update events, the ``iotlb`` payload
480has to be filled with the update message type (2), the I/O virtual
481address, the size, the user virtual address, and the permissions
482flags. Addresses and size must be within vhost memory regions set via
483the ``VHOST_USER_SET_MEM_TABLE`` request. For invalidation events, the
484``iotlb`` payload has to be filled with the invalidation message type
485(3), the I/O virtual address and the size. On success, the slave is
486expected to reply with a zero payload, non-zero otherwise.
487
488The slave relies on the slave communication channel (see :ref:`Slave
489communication <slave_communication>` section below) to send IOTLB miss
490and access failure events, by sending ``VHOST_USER_SLAVE_IOTLB_MSG``
491requests to the master with a ``struct vhost_iotlb_msg`` as
492payload. For miss events, the iotlb payload has to be filled with the
493miss message type (1), the I/O virtual address and the permissions
494flags. For access failure event, the iotlb payload has to be filled
495with the access failure message type (4), the I/O virtual address and
496the permissions flags.  For synchronization purpose, the slave may
497rely on the reply-ack feature, so the master may send a reply when
498operation is completed if the reply-ack feature is negotiated and
499slaves requests a reply. For miss events, completed operation means
500either master sent an update message containing the IOTLB entry
501containing requested address and permission, or master sent nothing if
502the IOTLB miss message is invalid (invalid IOVA or permission).
503
504The master isn't expected to take the initiative to send IOTLB update
505messages, as the slave sends IOTLB miss messages for the guest virtual
506memory areas it needs to access.
507
508.. _slave_communication:
509
510Slave communication
511-------------------
512
513An optional communication channel is provided if the slave declares
514``VHOST_USER_PROTOCOL_F_SLAVE_REQ`` protocol feature, to allow the
515slave to make requests to the master.
516
517The fd is provided via ``VHOST_USER_SET_SLAVE_REQ_FD`` ancillary data.
518
519A slave may then send ``VHOST_USER_SLAVE_*`` messages to the master
520using this fd communication channel.
521
522If ``VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD`` protocol feature is
523negotiated, slave can send file descriptors (at most 8 descriptors in
524each message) to master via ancillary data using this fd communication
525channel.
526
527Inflight I/O tracking
528---------------------
529
530To support reconnecting after restart or crash, slave may need to
531resubmit inflight I/Os. If virtqueue is processed in order, we can
532easily achieve that by getting the inflight descriptors from
533descriptor table (split virtqueue) or descriptor ring (packed
534virtqueue). However, it can't work when we process descriptors
535out-of-order because some entries which store the information of
536inflight descriptors in available ring (split virtqueue) or descriptor
537ring (packed virtqueue) might be overridden by new entries. To solve
538this problem, slave need to allocate an extra buffer to store this
539information of inflight descriptors and share it with master for
540persistent. ``VHOST_USER_GET_INFLIGHT_FD`` and
541``VHOST_USER_SET_INFLIGHT_FD`` are used to transfer this buffer
542between master and slave. And the format of this buffer is described
543below:
544
545+---------------+---------------+-----+---------------+
546| queue0 region | queue1 region | ... | queueN region |
547+---------------+---------------+-----+---------------+
548
549N is the number of available virtqueues. Slave could get it from num
550queues field of ``VhostUserInflight``.
551
552For split virtqueue, queue region can be implemented as:
553
554.. code:: c
555
556  typedef struct DescStateSplit {
557      /* Indicate whether this descriptor is inflight or not.
558       * Only available for head-descriptor. */
559      uint8_t inflight;
560
561      /* Padding */
562      uint8_t padding[5];
563
564      /* Maintain a list for the last batch of used descriptors.
565       * Only available when batching is used for submitting */
566      uint16_t next;
567
568      /* Used to preserve the order of fetching available descriptors.
569       * Only available for head-descriptor. */
570      uint64_t counter;
571  } DescStateSplit;
572
573  typedef struct QueueRegionSplit {
574      /* The feature flags of this region. Now it's initialized to 0. */
575      uint64_t features;
576
577      /* The version of this region. It's 1 currently.
578       * Zero value indicates an uninitialized buffer */
579      uint16_t version;
580
581      /* The size of DescStateSplit array. It's equal to the virtqueue
582       * size. Slave could get it from queue size field of VhostUserInflight. */
583      uint16_t desc_num;
584
585      /* The head of list that track the last batch of used descriptors. */
586      uint16_t last_batch_head;
587
588      /* Store the idx value of used ring */
589      uint16_t used_idx;
590
591      /* Used to track the state of each descriptor in descriptor table */
592      DescStateSplit desc[];
593  } QueueRegionSplit;
594
595To track inflight I/O, the queue region should be processed as follows:
596
597When receiving available buffers from the driver:
598
599#. Get the next available head-descriptor index from available ring, ``i``
600
601#. Set ``desc[i].counter`` to the value of global counter
602
603#. Increase global counter by 1
604
605#. Set ``desc[i].inflight`` to 1
606
607When supplying used buffers to the driver:
608
6091. Get corresponding used head-descriptor index, i
610
6112. Set ``desc[i].next`` to ``last_batch_head``
612
6133. Set ``last_batch_head`` to ``i``
614
615#. Steps 1,2,3 may be performed repeatedly if batching is possible
616
617#. Increase the ``idx`` value of used ring by the size of the batch
618
619#. Set the ``inflight`` field of each ``DescStateSplit`` entry in the batch to 0
620
621#. Set ``used_idx`` to the ``idx`` value of used ring
622
623When reconnecting:
624
625#. If the value of ``used_idx`` does not match the ``idx`` value of
626   used ring (means the inflight field of ``DescStateSplit`` entries in
627   last batch may be incorrect),
628
629   a. Subtract the value of ``used_idx`` from the ``idx`` value of
630      used ring to get last batch size of ``DescStateSplit`` entries
631
632   #. Set the ``inflight`` field of each ``DescStateSplit`` entry to 0 in last batch
633      list which starts from ``last_batch_head``
634
635   #. Set ``used_idx`` to the ``idx`` value of used ring
636
637#. Resubmit inflight ``DescStateSplit`` entries in order of their
638   counter value
639
640For packed virtqueue, queue region can be implemented as:
641
642.. code:: c
643
644  typedef struct DescStatePacked {
645      /* Indicate whether this descriptor is inflight or not.
646       * Only available for head-descriptor. */
647      uint8_t inflight;
648
649      /* Padding */
650      uint8_t padding;
651
652      /* Link to the next free entry */
653      uint16_t next;
654
655      /* Link to the last entry of descriptor list.
656       * Only available for head-descriptor. */
657      uint16_t last;
658
659      /* The length of descriptor list.
660       * Only available for head-descriptor. */
661      uint16_t num;
662
663      /* Used to preserve the order of fetching available descriptors.
664       * Only available for head-descriptor. */
665      uint64_t counter;
666
667      /* The buffer id */
668      uint16_t id;
669
670      /* The descriptor flags */
671      uint16_t flags;
672
673      /* The buffer length */
674      uint32_t len;
675
676      /* The buffer address */
677      uint64_t addr;
678  } DescStatePacked;
679
680  typedef struct QueueRegionPacked {
681      /* The feature flags of this region. Now it's initialized to 0. */
682      uint64_t features;
683
684      /* The version of this region. It's 1 currently.
685       * Zero value indicates an uninitialized buffer */
686      uint16_t version;
687
688      /* The size of DescStatePacked array. It's equal to the virtqueue
689       * size. Slave could get it from queue size field of VhostUserInflight. */
690      uint16_t desc_num;
691
692      /* The head of free DescStatePacked entry list */
693      uint16_t free_head;
694
695      /* The old head of free DescStatePacked entry list */
696      uint16_t old_free_head;
697
698      /* The used index of descriptor ring */
699      uint16_t used_idx;
700
701      /* The old used index of descriptor ring */
702      uint16_t old_used_idx;
703
704      /* Device ring wrap counter */
705      uint8_t used_wrap_counter;
706
707      /* The old device ring wrap counter */
708      uint8_t old_used_wrap_counter;
709
710      /* Padding */
711      uint8_t padding[7];
712
713      /* Used to track the state of each descriptor fetched from descriptor ring */
714      DescStatePacked desc[];
715  } QueueRegionPacked;
716
717To track inflight I/O, the queue region should be processed as follows:
718
719When receiving available buffers from the driver:
720
721#. Get the next available descriptor entry from descriptor ring, ``d``
722
723#. If ``d`` is head descriptor,
724
725   a. Set ``desc[old_free_head].num`` to 0
726
727   #. Set ``desc[old_free_head].counter`` to the value of global counter
728
729   #. Increase global counter by 1
730
731   #. Set ``desc[old_free_head].inflight`` to 1
732
733#. If ``d`` is last descriptor, set ``desc[old_free_head].last`` to
734   ``free_head``
735
736#. Increase ``desc[old_free_head].num`` by 1
737
738#. Set ``desc[free_head].addr``, ``desc[free_head].len``,
739   ``desc[free_head].flags``, ``desc[free_head].id`` to ``d.addr``,
740   ``d.len``, ``d.flags``, ``d.id``
741
742#. Set ``free_head`` to ``desc[free_head].next``
743
744#. If ``d`` is last descriptor, set ``old_free_head`` to ``free_head``
745
746When supplying used buffers to the driver:
747
7481. Get corresponding used head-descriptor entry from descriptor ring,
749   ``d``
750
7512. Get corresponding ``DescStatePacked`` entry, ``e``
752
7533. Set ``desc[e.last].next`` to ``free_head``
754
7554. Set ``free_head`` to the index of ``e``
756
757#. Steps 1,2,3,4 may be performed repeatedly if batching is possible
758
759#. Increase ``used_idx`` by the size of the batch and update
760   ``used_wrap_counter`` if needed
761
762#. Update ``d.flags``
763
764#. Set the ``inflight`` field of each head ``DescStatePacked`` entry
765   in the batch to 0
766
767#. Set ``old_free_head``,  ``old_used_idx``, ``old_used_wrap_counter``
768   to ``free_head``, ``used_idx``, ``used_wrap_counter``
769
770When reconnecting:
771
772#. If ``used_idx`` does not match ``old_used_idx`` (means the
773   ``inflight`` field of ``DescStatePacked`` entries in last batch may
774   be incorrect),
775
776   a. Get the next descriptor ring entry through ``old_used_idx``, ``d``
777
778   #. Use ``old_used_wrap_counter`` to calculate the available flags
779
780   #. If ``d.flags`` is not equal to the calculated flags value (means
781      slave has submitted the buffer to guest driver before crash, so
782      it has to commit the in-progres update), set ``old_free_head``,
783      ``old_used_idx``, ``old_used_wrap_counter`` to ``free_head``,
784      ``used_idx``, ``used_wrap_counter``
785
786#. Set ``free_head``, ``used_idx``, ``used_wrap_counter`` to
787   ``old_free_head``, ``old_used_idx``, ``old_used_wrap_counter``
788   (roll back any in-progress update)
789
790#. Set the ``inflight`` field of each ``DescStatePacked`` entry in
791   free list to 0
792
793#. Resubmit inflight ``DescStatePacked`` entries in order of their
794   counter value
795
796In-band notifications
797---------------------
798
799In some limited situations (e.g. for simulation) it is desirable to
800have the kick, call and error (if used) signals done via in-band
801messages instead of asynchronous eventfd notifications. This can be
802done by negotiating the ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS``
803protocol feature.
804
805Note that due to the fact that too many messages on the sockets can
806cause the sending application(s) to block, it is not advised to use
807this feature unless absolutely necessary. It is also considered an
808error to negotiate this feature without also negotiating
809``VHOST_USER_PROTOCOL_F_SLAVE_REQ`` and ``VHOST_USER_PROTOCOL_F_REPLY_ACK``,
810the former is necessary for getting a message channel from the slave
811to the master, while the latter needs to be used with the in-band
812notification messages to block until they are processed, both to avoid
813blocking later and for proper processing (at least in the simulation
814use case.) As it has no other way of signalling this error, the slave
815should close the connection as a response to a
816``VHOST_USER_SET_PROTOCOL_FEATURES`` message that sets the in-band
817notifications feature flag without the other two.
818
819Protocol features
820-----------------
821
822.. code:: c
823
824  #define VHOST_USER_PROTOCOL_F_MQ                    0
825  #define VHOST_USER_PROTOCOL_F_LOG_SHMFD             1
826  #define VHOST_USER_PROTOCOL_F_RARP                  2
827  #define VHOST_USER_PROTOCOL_F_REPLY_ACK             3
828  #define VHOST_USER_PROTOCOL_F_MTU                   4
829  #define VHOST_USER_PROTOCOL_F_SLAVE_REQ             5
830  #define VHOST_USER_PROTOCOL_F_CROSS_ENDIAN          6
831  #define VHOST_USER_PROTOCOL_F_CRYPTO_SESSION        7
832  #define VHOST_USER_PROTOCOL_F_PAGEFAULT             8
833  #define VHOST_USER_PROTOCOL_F_CONFIG                9
834  #define VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD        10
835  #define VHOST_USER_PROTOCOL_F_HOST_NOTIFIER        11
836  #define VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD       12
837  #define VHOST_USER_PROTOCOL_F_RESET_DEVICE         13
838  #define VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS 14
839  #define VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS  15
840  #define VHOST_USER_PROTOCOL_F_STATUS               16
841
842Master message types
843--------------------
844
845``VHOST_USER_GET_FEATURES``
846  :id: 1
847  :equivalent ioctl: ``VHOST_GET_FEATURES``
848  :master payload: N/A
849  :slave payload: ``u64``
850
851  Get from the underlying vhost implementation the features bitmask.
852  Feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` signals slave support
853  for ``VHOST_USER_GET_PROTOCOL_FEATURES`` and
854  ``VHOST_USER_SET_PROTOCOL_FEATURES``.
855
856``VHOST_USER_SET_FEATURES``
857  :id: 2
858  :equivalent ioctl: ``VHOST_SET_FEATURES``
859  :master payload: ``u64``
860
861  Enable features in the underlying vhost implementation using a
862  bitmask.  Feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` signals
863  slave support for ``VHOST_USER_GET_PROTOCOL_FEATURES`` and
864  ``VHOST_USER_SET_PROTOCOL_FEATURES``.
865
866``VHOST_USER_GET_PROTOCOL_FEATURES``
867  :id: 15
868  :equivalent ioctl: ``VHOST_GET_FEATURES``
869  :master payload: N/A
870  :slave payload: ``u64``
871
872  Get the protocol feature bitmask from the underlying vhost
873  implementation.  Only legal if feature bit
874  ``VHOST_USER_F_PROTOCOL_FEATURES`` is present in
875  ``VHOST_USER_GET_FEATURES``.
876
877.. Note::
878   Slave that reported ``VHOST_USER_F_PROTOCOL_FEATURES`` must
879   support this message even before ``VHOST_USER_SET_FEATURES`` was
880   called.
881
882``VHOST_USER_SET_PROTOCOL_FEATURES``
883  :id: 16
884  :equivalent ioctl: ``VHOST_SET_FEATURES``
885  :master payload: ``u64``
886
887  Enable protocol features in the underlying vhost implementation.
888
889  Only legal if feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` is present in
890  ``VHOST_USER_GET_FEATURES``.
891
892.. Note::
893   Slave that reported ``VHOST_USER_F_PROTOCOL_FEATURES`` must support
894   this message even before ``VHOST_USER_SET_FEATURES`` was called.
895
896``VHOST_USER_SET_OWNER``
897  :id: 3
898  :equivalent ioctl: ``VHOST_SET_OWNER``
899  :master payload: N/A
900
901  Issued when a new connection is established. It sets the current
902  *master* as an owner of the session. This can be used on the *slave*
903  as a "session start" flag.
904
905``VHOST_USER_RESET_OWNER``
906  :id: 4
907  :master payload: N/A
908
909.. admonition:: Deprecated
910
911   This is no longer used. Used to be sent to request disabling all
912   rings, but some clients interpreted it to also discard connection
913   state (this interpretation would lead to bugs).  It is recommended
914   that clients either ignore this message, or use it to disable all
915   rings.
916
917``VHOST_USER_SET_MEM_TABLE``
918  :id: 5
919  :equivalent ioctl: ``VHOST_SET_MEM_TABLE``
920  :master payload: memory regions description
921  :slave payload: (postcopy only) memory regions description
922
923  Sets the memory map regions on the slave so it can translate the
924  vring addresses. In the ancillary data there is an array of file
925  descriptors for each memory mapped region. The size and ordering of
926  the fds matches the number and ordering of memory regions.
927
928  When ``VHOST_USER_POSTCOPY_LISTEN`` has been received,
929  ``SET_MEM_TABLE`` replies with the bases of the memory mapped
930  regions to the master.  The slave must have mmap'd the regions but
931  not yet accessed them and should not yet generate a userfault
932  event.
933
934.. Note::
935   ``NEED_REPLY_MASK`` is not set in this case.  QEMU will then
936   reply back to the list of mappings with an empty
937   ``VHOST_USER_SET_MEM_TABLE`` as an acknowledgement; only upon
938   reception of this message may the guest start accessing the memory
939   and generating faults.
940
941``VHOST_USER_SET_LOG_BASE``
942  :id: 6
943  :equivalent ioctl: ``VHOST_SET_LOG_BASE``
944  :master payload: u64
945  :slave payload: N/A
946
947  Sets logging shared memory space.
948
949  When slave has ``VHOST_USER_PROTOCOL_F_LOG_SHMFD`` protocol feature,
950  the log memory fd is provided in the ancillary data of
951  ``VHOST_USER_SET_LOG_BASE`` message, the size and offset of shared
952  memory area provided in the message.
953
954``VHOST_USER_SET_LOG_FD``
955  :id: 7
956  :equivalent ioctl: ``VHOST_SET_LOG_FD``
957  :master payload: N/A
958
959  Sets the logging file descriptor, which is passed as ancillary data.
960
961``VHOST_USER_SET_VRING_NUM``
962  :id: 8
963  :equivalent ioctl: ``VHOST_SET_VRING_NUM``
964  :master payload: vring state description
965
966  Set the size of the queue.
967
968``VHOST_USER_SET_VRING_ADDR``
969  :id: 9
970  :equivalent ioctl: ``VHOST_SET_VRING_ADDR``
971  :master payload: vring address description
972  :slave payload: N/A
973
974  Sets the addresses of the different aspects of the vring.
975
976``VHOST_USER_SET_VRING_BASE``
977  :id: 10
978  :equivalent ioctl: ``VHOST_SET_VRING_BASE``
979  :master payload: vring state description
980
981  Sets the base offset in the available vring.
982
983``VHOST_USER_GET_VRING_BASE``
984  :id: 11
985  :equivalent ioctl: ``VHOST_USER_GET_VRING_BASE``
986  :master payload: vring state description
987  :slave payload: vring state description
988
989  Get the available vring base offset.
990
991``VHOST_USER_SET_VRING_KICK``
992  :id: 12
993  :equivalent ioctl: ``VHOST_SET_VRING_KICK``
994  :master payload: ``u64``
995
996  Set the event file descriptor for adding buffers to the vring. It is
997  passed in the ancillary data.
998
999  Bits (0-7) of the payload contain the vring index. Bit 8 is the
1000  invalid FD flag. This flag is set when there is no file descriptor
1001  in the ancillary data. This signals that polling should be used
1002  instead of waiting for the kick. Note that if the protocol feature
1003  ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` has been negotiated
1004  this message isn't necessary as the ring is also started on the
1005  ``VHOST_USER_VRING_KICK`` message, it may however still be used to
1006  set an event file descriptor (which will be preferred over the
1007  message) or to enable polling.
1008
1009``VHOST_USER_SET_VRING_CALL``
1010  :id: 13
1011  :equivalent ioctl: ``VHOST_SET_VRING_CALL``
1012  :master payload: ``u64``
1013
1014  Set the event file descriptor to signal when buffers are used. It is
1015  passed in the ancillary data.
1016
1017  Bits (0-7) of the payload contain the vring index. Bit 8 is the
1018  invalid FD flag. This flag is set when there is no file descriptor
1019  in the ancillary data. This signals that polling will be used
1020  instead of waiting for the call. Note that if the protocol features
1021  ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` and
1022  ``VHOST_USER_PROTOCOL_F_SLAVE_REQ`` have been negotiated this message
1023  isn't necessary as the ``VHOST_USER_SLAVE_VRING_CALL`` message can be
1024  used, it may however still be used to set an event file descriptor
1025  or to enable polling.
1026
1027``VHOST_USER_SET_VRING_ERR``
1028  :id: 14
1029  :equivalent ioctl: ``VHOST_SET_VRING_ERR``
1030  :master payload: ``u64``
1031
1032  Set the event file descriptor to signal when error occurs. It is
1033  passed in the ancillary data.
1034
1035  Bits (0-7) of the payload contain the vring index. Bit 8 is the
1036  invalid FD flag. This flag is set when there is no file descriptor
1037  in the ancillary data. Note that if the protocol features
1038  ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` and
1039  ``VHOST_USER_PROTOCOL_F_SLAVE_REQ`` have been negotiated this message
1040  isn't necessary as the ``VHOST_USER_SLAVE_VRING_ERR`` message can be
1041  used, it may however still be used to set an event file descriptor
1042  (which will be preferred over the message).
1043
1044``VHOST_USER_GET_QUEUE_NUM``
1045  :id: 17
1046  :equivalent ioctl: N/A
1047  :master payload: N/A
1048  :slave payload: u64
1049
1050  Query how many queues the backend supports.
1051
1052  This request should be sent only when ``VHOST_USER_PROTOCOL_F_MQ``
1053  is set in queried protocol features by
1054  ``VHOST_USER_GET_PROTOCOL_FEATURES``.
1055
1056``VHOST_USER_SET_VRING_ENABLE``
1057  :id: 18
1058  :equivalent ioctl: N/A
1059  :master payload: vring state description
1060
1061  Signal slave to enable or disable corresponding vring.
1062
1063  This request should be sent only when
1064  ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated.
1065
1066``VHOST_USER_SEND_RARP``
1067  :id: 19
1068  :equivalent ioctl: N/A
1069  :master payload: ``u64``
1070
1071  Ask vhost user backend to broadcast a fake RARP to notify the migration
1072  is terminated for guest that does not support GUEST_ANNOUNCE.
1073
1074  Only legal if feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` is
1075  present in ``VHOST_USER_GET_FEATURES`` and protocol feature bit
1076  ``VHOST_USER_PROTOCOL_F_RARP`` is present in
1077  ``VHOST_USER_GET_PROTOCOL_FEATURES``.  The first 6 bytes of the
1078  payload contain the mac address of the guest to allow the vhost user
1079  backend to construct and broadcast the fake RARP.
1080
1081``VHOST_USER_NET_SET_MTU``
1082  :id: 20
1083  :equivalent ioctl: N/A
1084  :master payload: ``u64``
1085
1086  Set host MTU value exposed to the guest.
1087
1088  This request should be sent only when ``VIRTIO_NET_F_MTU`` feature
1089  has been successfully negotiated, ``VHOST_USER_F_PROTOCOL_FEATURES``
1090  is present in ``VHOST_USER_GET_FEATURES`` and protocol feature bit
1091  ``VHOST_USER_PROTOCOL_F_NET_MTU`` is present in
1092  ``VHOST_USER_GET_PROTOCOL_FEATURES``.
1093
1094  If ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is negotiated, slave must
1095  respond with zero in case the specified MTU is valid, or non-zero
1096  otherwise.
1097
1098``VHOST_USER_SET_SLAVE_REQ_FD``
1099  :id: 21
1100  :equivalent ioctl: N/A
1101  :master payload: N/A
1102
1103  Set the socket file descriptor for slave initiated requests. It is passed
1104  in the ancillary data.
1105
1106  This request should be sent only when
1107  ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated, and protocol
1108  feature bit ``VHOST_USER_PROTOCOL_F_SLAVE_REQ`` bit is present in
1109  ``VHOST_USER_GET_PROTOCOL_FEATURES``.  If
1110  ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is negotiated, slave must
1111  respond with zero for success, non-zero otherwise.
1112
1113``VHOST_USER_IOTLB_MSG``
1114  :id: 22
1115  :equivalent ioctl: N/A (equivalent to ``VHOST_IOTLB_MSG`` message type)
1116  :master payload: ``struct vhost_iotlb_msg``
1117  :slave payload: ``u64``
1118
1119  Send IOTLB messages with ``struct vhost_iotlb_msg`` as payload.
1120
1121  Master sends such requests to update and invalidate entries in the
1122  device IOTLB. The slave has to acknowledge the request with sending
1123  zero as ``u64`` payload for success, non-zero otherwise.
1124
1125  This request should be send only when ``VIRTIO_F_IOMMU_PLATFORM``
1126  feature has been successfully negotiated.
1127
1128``VHOST_USER_SET_VRING_ENDIAN``
1129  :id: 23
1130  :equivalent ioctl: ``VHOST_SET_VRING_ENDIAN``
1131  :master payload: vring state description
1132
1133  Set the endianness of a VQ for legacy devices. Little-endian is
1134  indicated with state.num set to 0 and big-endian is indicated with
1135  state.num set to 1. Other values are invalid.
1136
1137  This request should be sent only when
1138  ``VHOST_USER_PROTOCOL_F_CROSS_ENDIAN`` has been negotiated.
1139  Backends that negotiated this feature should handle both
1140  endiannesses and expect this message once (per VQ) during device
1141  configuration (ie. before the master starts the VQ).
1142
1143``VHOST_USER_GET_CONFIG``
1144  :id: 24
1145  :equivalent ioctl: N/A
1146  :master payload: virtio device config space
1147  :slave payload: virtio device config space
1148
1149  When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, this message is
1150  submitted by the vhost-user master to fetch the contents of the
1151  virtio device configuration space, vhost-user slave's payload size
1152  MUST match master's request, vhost-user slave uses zero length of
1153  payload to indicate an error to vhost-user master. The vhost-user
1154  master may cache the contents to avoid repeated
1155  ``VHOST_USER_GET_CONFIG`` calls.
1156
1157``VHOST_USER_SET_CONFIG``
1158  :id: 25
1159  :equivalent ioctl: N/A
1160  :master payload: virtio device config space
1161  :slave payload: N/A
1162
1163  When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, this message is
1164  submitted by the vhost-user master when the Guest changes the virtio
1165  device configuration space and also can be used for live migration
1166  on the destination host. The vhost-user slave must check the flags
1167  field, and slaves MUST NOT accept SET_CONFIG for read-only
1168  configuration space fields unless the live migration bit is set.
1169
1170``VHOST_USER_CREATE_CRYPTO_SESSION``
1171  :id: 26
1172  :equivalent ioctl: N/A
1173  :master payload: crypto session description
1174  :slave payload: crypto session description
1175
1176  Create a session for crypto operation. The server side must return
1177  the session id, 0 or positive for success, negative for failure.
1178  This request should be sent only when
1179  ``VHOST_USER_PROTOCOL_F_CRYPTO_SESSION`` feature has been
1180  successfully negotiated.  It's a required feature for crypto
1181  devices.
1182
1183``VHOST_USER_CLOSE_CRYPTO_SESSION``
1184  :id: 27
1185  :equivalent ioctl: N/A
1186  :master payload: ``u64``
1187
1188  Close a session for crypto operation which was previously
1189  created by ``VHOST_USER_CREATE_CRYPTO_SESSION``.
1190
1191  This request should be sent only when
1192  ``VHOST_USER_PROTOCOL_F_CRYPTO_SESSION`` feature has been
1193  successfully negotiated.  It's a required feature for crypto
1194  devices.
1195
1196``VHOST_USER_POSTCOPY_ADVISE``
1197  :id: 28
1198  :master payload: N/A
1199  :slave payload: userfault fd
1200
1201  When ``VHOST_USER_PROTOCOL_F_PAGEFAULT`` is supported, the master
1202  advises slave that a migration with postcopy enabled is underway,
1203  the slave must open a userfaultfd for later use.  Note that at this
1204  stage the migration is still in precopy mode.
1205
1206``VHOST_USER_POSTCOPY_LISTEN``
1207  :id: 29
1208  :master payload: N/A
1209
1210  Master advises slave that a transition to postcopy mode has
1211  happened.  The slave must ensure that shared memory is registered
1212  with userfaultfd to cause faulting of non-present pages.
1213
1214  This is always sent sometime after a ``VHOST_USER_POSTCOPY_ADVISE``,
1215  and thus only when ``VHOST_USER_PROTOCOL_F_PAGEFAULT`` is supported.
1216
1217``VHOST_USER_POSTCOPY_END``
1218  :id: 30
1219  :slave payload: ``u64``
1220
1221  Master advises that postcopy migration has now completed.  The slave
1222  must disable the userfaultfd. The response is an acknowledgement
1223  only.
1224
1225  When ``VHOST_USER_PROTOCOL_F_PAGEFAULT`` is supported, this message
1226  is sent at the end of the migration, after
1227  ``VHOST_USER_POSTCOPY_LISTEN`` was previously sent.
1228
1229  The value returned is an error indication; 0 is success.
1230
1231``VHOST_USER_GET_INFLIGHT_FD``
1232  :id: 31
1233  :equivalent ioctl: N/A
1234  :master payload: inflight description
1235
1236  When ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD`` protocol feature has
1237  been successfully negotiated, this message is submitted by master to
1238  get a shared buffer from slave. The shared buffer will be used to
1239  track inflight I/O by slave. QEMU should retrieve a new one when vm
1240  reset.
1241
1242``VHOST_USER_SET_INFLIGHT_FD``
1243  :id: 32
1244  :equivalent ioctl: N/A
1245  :master payload: inflight description
1246
1247  When ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD`` protocol feature has
1248  been successfully negotiated, this message is submitted by master to
1249  send the shared inflight buffer back to slave so that slave could
1250  get inflight I/O after a crash or restart.
1251
1252``VHOST_USER_GPU_SET_SOCKET``
1253  :id: 33
1254  :equivalent ioctl: N/A
1255  :master payload: N/A
1256
1257  Sets the GPU protocol socket file descriptor, which is passed as
1258  ancillary data. The GPU protocol is used to inform the master of
1259  rendering state and updates. See vhost-user-gpu.rst for details.
1260
1261``VHOST_USER_RESET_DEVICE``
1262  :id: 34
1263  :equivalent ioctl: N/A
1264  :master payload: N/A
1265  :slave payload: N/A
1266
1267  Ask the vhost user backend to disable all rings and reset all
1268  internal device state to the initial state, ready to be
1269  reinitialized. The backend retains ownership of the device
1270  throughout the reset operation.
1271
1272  Only valid if the ``VHOST_USER_PROTOCOL_F_RESET_DEVICE`` protocol
1273  feature is set by the backend.
1274
1275``VHOST_USER_VRING_KICK``
1276  :id: 35
1277  :equivalent ioctl: N/A
1278  :slave payload: vring state description
1279  :master payload: N/A
1280
1281  When the ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` protocol
1282  feature has been successfully negotiated, this message may be
1283  submitted by the master to indicate that a buffer was added to
1284  the vring instead of signalling it using the vring's kick file
1285  descriptor or having the slave rely on polling.
1286
1287  The state.num field is currently reserved and must be set to 0.
1288
1289``VHOST_USER_GET_MAX_MEM_SLOTS``
1290  :id: 36
1291  :equivalent ioctl: N/A
1292  :slave payload: u64
1293
1294  When the ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol
1295  feature has been successfully negotiated, this message is submitted
1296  by master to the slave. The slave should return the message with a
1297  u64 payload containing the maximum number of memory slots for
1298  QEMU to expose to the guest. The value returned by the backend
1299  will be capped at the maximum number of ram slots which can be
1300  supported by the target platform.
1301
1302``VHOST_USER_ADD_MEM_REG``
1303  :id: 37
1304  :equivalent ioctl: N/A
1305  :slave payload: single memory region description
1306
1307  When the ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol
1308  feature has been successfully negotiated, this message is submitted
1309  by the master to the slave. The message payload contains a memory
1310  region descriptor struct, describing a region of guest memory which
1311  the slave device must map in. When the
1312  ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol feature has
1313  been successfully negotiated, along with the
1314  ``VHOST_USER_REM_MEM_REG`` message, this message is used to set and
1315  update the memory tables of the slave device.
1316
1317``VHOST_USER_REM_MEM_REG``
1318  :id: 38
1319  :equivalent ioctl: N/A
1320  :slave payload: single memory region description
1321
1322  When the ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol
1323  feature has been successfully negotiated, this message is submitted
1324  by the master to the slave. The message payload contains a memory
1325  region descriptor struct, describing a region of guest memory which
1326  the slave device must unmap. When the
1327  ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol feature has
1328  been successfully negotiated, along with the
1329  ``VHOST_USER_ADD_MEM_REG`` message, this message is used to set and
1330  update the memory tables of the slave device.
1331
1332``VHOST_USER_SET_STATUS``
1333  :id: 39
1334  :equivalent ioctl: VHOST_VDPA_SET_STATUS
1335  :slave payload: N/A
1336  :master payload: ``u64``
1337
1338  When the ``VHOST_USER_PROTOCOL_F_STATUS`` protocol feature has been
1339  successfully negotiated, this message is submitted by the master to
1340  notify the backend with updated device status as defined in the Virtio
1341  specification.
1342
1343``VHOST_USER_GET_STATUS``
1344  :id: 40
1345  :equivalent ioctl: VHOST_VDPA_GET_STATUS
1346  :slave payload: ``u64``
1347  :master payload: N/A
1348
1349  When the ``VHOST_USER_PROTOCOL_F_STATUS`` protocol feature has been
1350  successfully negotiated, this message is submitted by the master to
1351  query the backend for its device status as defined in the Virtio
1352  specification.
1353
1354
1355Slave message types
1356-------------------
1357
1358``VHOST_USER_SLAVE_IOTLB_MSG``
1359  :id: 1
1360  :equivalent ioctl: N/A (equivalent to ``VHOST_IOTLB_MSG`` message type)
1361  :slave payload: ``struct vhost_iotlb_msg``
1362  :master payload: N/A
1363
1364  Send IOTLB messages with ``struct vhost_iotlb_msg`` as payload.
1365  Slave sends such requests to notify of an IOTLB miss, or an IOTLB
1366  access failure. If ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is
1367  negotiated, and slave set the ``VHOST_USER_NEED_REPLY`` flag, master
1368  must respond with zero when operation is successfully completed, or
1369  non-zero otherwise.  This request should be send only when
1370  ``VIRTIO_F_IOMMU_PLATFORM`` feature has been successfully
1371  negotiated.
1372
1373``VHOST_USER_SLAVE_CONFIG_CHANGE_MSG``
1374  :id: 2
1375  :equivalent ioctl: N/A
1376  :slave payload: N/A
1377  :master payload: N/A
1378
1379  When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, vhost-user
1380  slave sends such messages to notify that the virtio device's
1381  configuration space has changed, for those host devices which can
1382  support such feature, host driver can send ``VHOST_USER_GET_CONFIG``
1383  message to slave to get the latest content. If
1384  ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is negotiated, and slave set the
1385  ``VHOST_USER_NEED_REPLY`` flag, master must respond with zero when
1386  operation is successfully completed, or non-zero otherwise.
1387
1388``VHOST_USER_SLAVE_VRING_HOST_NOTIFIER_MSG``
1389  :id: 3
1390  :equivalent ioctl: N/A
1391  :slave payload: vring area description
1392  :master payload: N/A
1393
1394  Sets host notifier for a specified queue. The queue index is
1395  contained in the ``u64`` field of the vring area description. The
1396  host notifier is described by the file descriptor (typically it's a
1397  VFIO device fd) which is passed as ancillary data and the size
1398  (which is mmap size and should be the same as host page size) and
1399  offset (which is mmap offset) carried in the vring area
1400  description. QEMU can mmap the file descriptor based on the size and
1401  offset to get a memory range. Registering a host notifier means
1402  mapping this memory range to the VM as the specified queue's notify
1403  MMIO region. Slave sends this request to tell QEMU to de-register
1404  the existing notifier if any and register the new notifier if the
1405  request is sent with a file descriptor.
1406
1407  This request should be sent only when
1408  ``VHOST_USER_PROTOCOL_F_HOST_NOTIFIER`` protocol feature has been
1409  successfully negotiated.
1410
1411``VHOST_USER_SLAVE_VRING_CALL``
1412  :id: 4
1413  :equivalent ioctl: N/A
1414  :slave payload: vring state description
1415  :master payload: N/A
1416
1417  When the ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` protocol
1418  feature has been successfully negotiated, this message may be
1419  submitted by the slave to indicate that a buffer was used from
1420  the vring instead of signalling this using the vring's call file
1421  descriptor or having the master relying on polling.
1422
1423  The state.num field is currently reserved and must be set to 0.
1424
1425``VHOST_USER_SLAVE_VRING_ERR``
1426  :id: 5
1427  :equivalent ioctl: N/A
1428  :slave payload: vring state description
1429  :master payload: N/A
1430
1431  When the ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` protocol
1432  feature has been successfully negotiated, this message may be
1433  submitted by the slave to indicate that an error occurred on the
1434  specific vring, instead of signalling the error file descriptor
1435  set by the master via ``VHOST_USER_SET_VRING_ERR``.
1436
1437  The state.num field is currently reserved and must be set to 0.
1438
1439.. _reply_ack:
1440
1441VHOST_USER_PROTOCOL_F_REPLY_ACK
1442-------------------------------
1443
1444The original vhost-user specification only demands replies for certain
1445commands. This differs from the vhost protocol implementation where
1446commands are sent over an ``ioctl()`` call and block until the client
1447has completed.
1448
1449With this protocol extension negotiated, the sender (QEMU) can set the
1450``need_reply`` [Bit 3] flag to any command. This indicates that the
1451client MUST respond with a Payload ``VhostUserMsg`` indicating success
1452or failure. The payload should be set to zero on success or non-zero
1453on failure, unless the message already has an explicit reply body.
1454
1455The response payload gives QEMU a deterministic indication of the result
1456of the command. Today, QEMU is expected to terminate the main vhost-user
1457loop upon receiving such errors. In future, qemu could be taught to be more
1458resilient for selective requests.
1459
1460For the message types that already solicit a reply from the client,
1461the presence of ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` or need_reply bit
1462being set brings no behavioural change. (See the Communication_
1463section for details.)
1464
1465.. _backend_conventions:
1466
1467Backend program conventions
1468===========================
1469
1470vhost-user backends can provide various devices & services and may
1471need to be configured manually depending on the use case. However, it
1472is a good idea to follow the conventions listed here when
1473possible. Users, QEMU or libvirt, can then rely on some common
1474behaviour to avoid heterogeneous configuration and management of the
1475backend programs and facilitate interoperability.
1476
1477Each backend installed on a host system should come with at least one
1478JSON file that conforms to the vhost-user.json schema. Each file
1479informs the management applications about the backend type, and binary
1480location. In addition, it defines rules for management apps for
1481picking the highest priority backend when multiple match the search
1482criteria (see ``@VhostUserBackend`` documentation in the schema file).
1483
1484If the backend is not capable of enabling a requested feature on the
1485host (such as 3D acceleration with virgl), or the initialization
1486failed, the backend should fail to start early and exit with a status
1487!= 0. It may also print a message to stderr for further details.
1488
1489The backend program must not daemonize itself, but it may be
1490daemonized by the management layer. It may also have a restricted
1491access to the system.
1492
1493File descriptors 0, 1 and 2 will exist, and have regular
1494stdin/stdout/stderr usage (they may have been redirected to /dev/null
1495by the management layer, or to a log handler).
1496
1497The backend program must end (as quickly and cleanly as possible) when
1498the SIGTERM signal is received. Eventually, it may receive SIGKILL by
1499the management layer after a few seconds.
1500
1501The following command line options have an expected behaviour. They
1502are mandatory, unless explicitly said differently:
1503
1504--socket-path=PATH
1505
1506  This option specify the location of the vhost-user Unix domain socket.
1507  It is incompatible with --fd.
1508
1509--fd=FDNUM
1510
1511  When this argument is given, the backend program is started with the
1512  vhost-user socket as file descriptor FDNUM. It is incompatible with
1513  --socket-path.
1514
1515--print-capabilities
1516
1517  Output to stdout the backend capabilities in JSON format, and then
1518  exit successfully. Other options and arguments should be ignored, and
1519  the backend program should not perform its normal function.  The
1520  capabilities can be reported dynamically depending on the host
1521  capabilities.
1522
1523The JSON output is described in the ``vhost-user.json`` schema, by
1524```@VHostUserBackendCapabilities``.  Example:
1525
1526.. code:: json
1527
1528  {
1529    "type": "foo",
1530    "features": [
1531      "feature-a",
1532      "feature-b"
1533    ]
1534  }
1535
1536vhost-user-input
1537----------------
1538
1539Command line options:
1540
1541--evdev-path=PATH
1542
1543  Specify the linux input device.
1544
1545  (optional)
1546
1547--no-grab
1548
1549  Do no request exclusive access to the input device.
1550
1551  (optional)
1552
1553vhost-user-gpu
1554--------------
1555
1556Command line options:
1557
1558--render-node=PATH
1559
1560  Specify the GPU DRM render node.
1561
1562  (optional)
1563
1564--virgl
1565
1566  Enable virgl rendering support.
1567
1568  (optional)
1569
1570vhost-user-blk
1571--------------
1572
1573Command line options:
1574
1575--blk-file=PATH
1576
1577  Specify block device or file path.
1578
1579  (optional)
1580
1581--read-only
1582
1583  Enable read-only.
1584
1585  (optional)
1586