xref: /openbmc/qemu/docs/interop/vhost-user.rst (revision 793abe24aa9c6ad1a06dee091fa4dd4479fef482)
1.. _vhost_user_proto:
2
3===================
4Vhost-user Protocol
5===================
6:Copyright: 2014 Virtual Open Systems Sarl.
7:Copyright: 2019 Intel Corporation
8:Licence: This work is licensed under the terms of the GNU GPL,
9          version 2 or later. See the COPYING file in the top-level
10          directory.
11
12.. contents:: Table of Contents
13
14Introduction
15============
16
17This protocol is aiming to complement the ``ioctl`` interface used to
18control the vhost implementation in the Linux kernel. It implements
19the control plane needed to establish virtqueue sharing with a user
20space process on the same host. It uses communication over a Unix
21domain socket to share file descriptors in the ancillary data of the
22message.
23
24The protocol defines 2 sides of the communication, *master* and
25*slave*. *Master* is the application that shares its virtqueues, in
26our case QEMU. *Slave* is the consumer of the virtqueues.
27
28In the current implementation QEMU is the *master*, and the *slave* is
29the external process consuming the virtio queues, for example a
30software Ethernet switch running in user space, such as Snabbswitch,
31or a block device backend processing read & write to a virtual
32disk. In order to facilitate interoperability between various backend
33implementations, it is recommended to follow the :ref:`Backend program
34conventions <backend_conventions>`.
35
36*Master* and *slave* can be either a client (i.e. connecting) or
37server (listening) in the socket communication.
38
39Message Specification
40=====================
41
42.. Note:: All numbers are in the machine native byte order.
43
44A vhost-user message consists of 3 header fields and a payload.
45
46+---------+-------+------+---------+
47| request | flags | size | payload |
48+---------+-------+------+---------+
49
50Header
51------
52
53:request: 32-bit type of the request
54
55:flags: 32-bit bit field
56
57- Lower 2 bits are the version (currently 0x01)
58- Bit 2 is the reply flag - needs to be sent on each reply from the slave
59- Bit 3 is the need_reply flag - see :ref:`REPLY_ACK <reply_ack>` for
60  details.
61
62:size: 32-bit size of the payload
63
64Payload
65-------
66
67Depending on the request type, **payload** can be:
68
69A single 64-bit integer
70^^^^^^^^^^^^^^^^^^^^^^^
71
72+-----+
73| u64 |
74+-----+
75
76:u64: a 64-bit unsigned integer
77
78A vring state description
79^^^^^^^^^^^^^^^^^^^^^^^^^
80
81+-------+-----+
82| index | num |
83+-------+-----+
84
85:index: a 32-bit index
86
87:num: a 32-bit number
88
89A vring address description
90^^^^^^^^^^^^^^^^^^^^^^^^^^^
91
92+-------+-------+------+------------+------+-----------+-----+
93| index | flags | size | descriptor | used | available | log |
94+-------+-------+------+------------+------+-----------+-----+
95
96:index: a 32-bit vring index
97
98:flags: a 32-bit vring flags
99
100:descriptor: a 64-bit ring address of the vring descriptor table
101
102:used: a 64-bit ring address of the vring used ring
103
104:available: a 64-bit ring address of the vring available ring
105
106:log: a 64-bit guest address for logging
107
108Note that a ring address is an IOVA if ``VIRTIO_F_IOMMU_PLATFORM`` has
109been negotiated. Otherwise it is a user address.
110
111Memory regions description
112^^^^^^^^^^^^^^^^^^^^^^^^^^
113
114+-------------+---------+---------+-----+---------+
115| num regions | padding | region0 | ... | region7 |
116+-------------+---------+---------+-----+---------+
117
118:num regions: a 32-bit number of regions
119
120:padding: 32-bit
121
122A region is:
123
124+---------------+------+--------------+-------------+
125| guest address | size | user address | mmap offset |
126+---------------+------+--------------+-------------+
127
128:guest address: a 64-bit guest address of the region
129
130:size: a 64-bit size
131
132:user address: a 64-bit user address
133
134:mmap offset: 64-bit offset where region starts in the mapped memory
135
136Single memory region description
137^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
138
139+---------+---------------+------+--------------+-------------+
140| padding | guest address | size | user address | mmap offset |
141+---------+---------------+------+--------------+-------------+
142
143:padding: 64-bit
144
145:guest address: a 64-bit guest address of the region
146
147:size: a 64-bit size
148
149:user address: a 64-bit user address
150
151:mmap offset: 64-bit offset where region starts in the mapped memory
152
153Log description
154^^^^^^^^^^^^^^^
155
156+----------+------------+
157| log size | log offset |
158+----------+------------+
159
160:log size: size of area used for logging
161
162:log offset: offset from start of supplied file descriptor where
163             logging starts (i.e. where guest address 0 would be
164             logged)
165
166An IOTLB message
167^^^^^^^^^^^^^^^^
168
169+------+------+--------------+-------------------+------+
170| iova | size | user address | permissions flags | type |
171+------+------+--------------+-------------------+------+
172
173:iova: a 64-bit I/O virtual address programmed by the guest
174
175:size: a 64-bit size
176
177:user address: a 64-bit user address
178
179:permissions flags: an 8-bit value:
180  - 0: No access
181  - 1: Read access
182  - 2: Write access
183  - 3: Read/Write access
184
185:type: an 8-bit IOTLB message type:
186  - 1: IOTLB miss
187  - 2: IOTLB update
188  - 3: IOTLB invalidate
189  - 4: IOTLB access fail
190
191Virtio device config space
192^^^^^^^^^^^^^^^^^^^^^^^^^^
193
194+--------+------+-------+---------+
195| offset | size | flags | payload |
196+--------+------+-------+---------+
197
198:offset: a 32-bit offset of virtio device's configuration space
199
200:size: a 32-bit configuration space access size in bytes
201
202:flags: a 32-bit value:
203  - 0: Vhost master messages used for writeable fields
204  - 1: Vhost master messages used for live migration
205
206:payload: Size bytes array holding the contents of the virtio
207          device's configuration space
208
209Vring area description
210^^^^^^^^^^^^^^^^^^^^^^
211
212+-----+------+--------+
213| u64 | size | offset |
214+-----+------+--------+
215
216:u64: a 64-bit integer contains vring index and flags
217
218:size: a 64-bit size of this area
219
220:offset: a 64-bit offset of this area from the start of the
221         supplied file descriptor
222
223Inflight description
224^^^^^^^^^^^^^^^^^^^^
225
226+-----------+-------------+------------+------------+
227| mmap size | mmap offset | num queues | queue size |
228+-----------+-------------+------------+------------+
229
230:mmap size: a 64-bit size of area to track inflight I/O
231
232:mmap offset: a 64-bit offset of this area from the start
233              of the supplied file descriptor
234
235:num queues: a 16-bit number of virtqueues
236
237:queue size: a 16-bit size of virtqueues
238
239C structure
240-----------
241
242In QEMU the vhost-user message is implemented with the following struct:
243
244.. code:: c
245
246  typedef struct VhostUserMsg {
247      VhostUserRequest request;
248      uint32_t flags;
249      uint32_t size;
250      union {
251          uint64_t u64;
252          struct vhost_vring_state state;
253          struct vhost_vring_addr addr;
254          VhostUserMemory memory;
255          VhostUserLog log;
256          struct vhost_iotlb_msg iotlb;
257          VhostUserConfig config;
258          VhostUserVringArea area;
259          VhostUserInflight inflight;
260      };
261  } QEMU_PACKED VhostUserMsg;
262
263Communication
264=============
265
266The protocol for vhost-user is based on the existing implementation of
267vhost for the Linux Kernel. Most messages that can be sent via the
268Unix domain socket implementing vhost-user have an equivalent ioctl to
269the kernel implementation.
270
271The communication consists of *master* sending message requests and
272*slave* sending message replies. Most of the requests don't require
273replies. Here is a list of the ones that do:
274
275* ``VHOST_USER_GET_FEATURES``
276* ``VHOST_USER_GET_PROTOCOL_FEATURES``
277* ``VHOST_USER_GET_VRING_BASE``
278* ``VHOST_USER_SET_LOG_BASE`` (if ``VHOST_USER_PROTOCOL_F_LOG_SHMFD``)
279* ``VHOST_USER_GET_INFLIGHT_FD`` (if ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD``)
280
281.. seealso::
282
283   :ref:`REPLY_ACK <reply_ack>`
284       The section on ``REPLY_ACK`` protocol extension.
285
286There are several messages that the master sends with file descriptors passed
287in the ancillary data:
288
289* ``VHOST_USER_SET_MEM_TABLE``
290* ``VHOST_USER_SET_LOG_BASE`` (if ``VHOST_USER_PROTOCOL_F_LOG_SHMFD``)
291* ``VHOST_USER_SET_LOG_FD``
292* ``VHOST_USER_SET_VRING_KICK``
293* ``VHOST_USER_SET_VRING_CALL``
294* ``VHOST_USER_SET_VRING_ERR``
295* ``VHOST_USER_SET_SLAVE_REQ_FD``
296* ``VHOST_USER_SET_INFLIGHT_FD`` (if ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD``)
297
298If *master* is unable to send the full message or receives a wrong
299reply it will close the connection. An optional reconnection mechanism
300can be implemented.
301
302If *slave* detects some error such as incompatible features, it may also
303close the connection. This should only happen in exceptional circumstances.
304
305Any protocol extensions are gated by protocol feature bits, which
306allows full backwards compatibility on both master and slave.  As
307older slaves don't support negotiating protocol features, a feature
308bit was dedicated for this purpose::
309
310  #define VHOST_USER_F_PROTOCOL_FEATURES 30
311
312Starting and stopping rings
313---------------------------
314
315Client must only process each ring when it is started.
316
317Client must only pass data between the ring and the backend, when the
318ring is enabled.
319
320If ring is started but disabled, client must process the ring without
321talking to the backend.
322
323For example, for a networking device, in the disabled state client
324must not supply any new RX packets, but must process and discard any
325TX packets.
326
327If ``VHOST_USER_F_PROTOCOL_FEATURES`` has not been negotiated, the
328ring is initialized in an enabled state.
329
330If ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated, the ring is
331initialized in a disabled state. Client must not pass data to/from the
332backend until ring is enabled by ``VHOST_USER_SET_VRING_ENABLE`` with
333parameter 1, or after it has been disabled by
334``VHOST_USER_SET_VRING_ENABLE`` with parameter 0.
335
336Each ring is initialized in a stopped state, client must not process
337it until ring is started, or after it has been stopped.
338
339Client must start ring upon receiving a kick (that is, detecting that
340file descriptor is readable) on the descriptor specified by
341``VHOST_USER_SET_VRING_KICK`` or receiving the in-band message
342``VHOST_USER_VRING_KICK`` if negotiated, and stop ring upon receiving
343``VHOST_USER_GET_VRING_BASE``.
344
345While processing the rings (whether they are enabled or not), client
346must support changing some configuration aspects on the fly.
347
348Multiple queue support
349----------------------
350
351Many devices have a fixed number of virtqueues.  In this case the master
352already knows the number of available virtqueues without communicating with the
353slave.
354
355Some devices do not have a fixed number of virtqueues.  Instead the maximum
356number of virtqueues is chosen by the slave.  The number can depend on host
357resource availability or slave implementation details.  Such devices are called
358multiple queue devices.
359
360Multiple queue support allows the slave to advertise the maximum number of
361queues.  This is treated as a protocol extension, hence the slave has to
362implement protocol features first. The multiple queues feature is supported
363only when the protocol feature ``VHOST_USER_PROTOCOL_F_MQ`` (bit 0) is set.
364
365The max number of queues the slave supports can be queried with message
366``VHOST_USER_GET_QUEUE_NUM``. Master should stop when the number of requested
367queues is bigger than that.
368
369As all queues share one connection, the master uses a unique index for each
370queue in the sent message to identify a specified queue.
371
372The master enables queues by sending message ``VHOST_USER_SET_VRING_ENABLE``.
373vhost-user-net has historically automatically enabled the first queue pair.
374
375Slaves should always implement the ``VHOST_USER_PROTOCOL_F_MQ`` protocol
376feature, even for devices with a fixed number of virtqueues, since it is simple
377to implement and offers a degree of introspection.
378
379Masters must not rely on the ``VHOST_USER_PROTOCOL_F_MQ`` protocol feature for
380devices with a fixed number of virtqueues.  Only true multiqueue devices
381require this protocol feature.
382
383Migration
384---------
385
386During live migration, the master may need to track the modifications
387the slave makes to the memory mapped regions. The client should mark
388the dirty pages in a log. Once it complies to this logging, it may
389declare the ``VHOST_F_LOG_ALL`` vhost feature.
390
391To start/stop logging of data/used ring writes, server may send
392messages ``VHOST_USER_SET_FEATURES`` with ``VHOST_F_LOG_ALL`` and
393``VHOST_USER_SET_VRING_ADDR`` with ``VHOST_VRING_F_LOG`` in ring's
394flags set to 1/0, respectively.
395
396All the modifications to memory pointed by vring "descriptor" should
397be marked. Modifications to "used" vring should be marked if
398``VHOST_VRING_F_LOG`` is part of ring's flags.
399
400Dirty pages are of size::
401
402  #define VHOST_LOG_PAGE 0x1000
403
404The log memory fd is provided in the ancillary data of
405``VHOST_USER_SET_LOG_BASE`` message when the slave has
406``VHOST_USER_PROTOCOL_F_LOG_SHMFD`` protocol feature.
407
408The size of the log is supplied as part of ``VhostUserMsg`` which
409should be large enough to cover all known guest addresses. Log starts
410at the supplied offset in the supplied file descriptor.  The log
411covers from address 0 to the maximum of guest regions. In pseudo-code,
412to mark page at ``addr`` as dirty::
413
414  page = addr / VHOST_LOG_PAGE
415  log[page / 8] |= 1 << page % 8
416
417Where ``addr`` is the guest physical address.
418
419Use atomic operations, as the log may be concurrently manipulated.
420
421Note that when logging modifications to the used ring (when
422``VHOST_VRING_F_LOG`` is set for this ring), ``log_guest_addr`` should
423be used to calculate the log offset: the write to first byte of the
424used ring is logged at this offset from log start. Also note that this
425value might be outside the legal guest physical address range
426(i.e. does not have to be covered by the ``VhostUserMemory`` table), but
427the bit offset of the last byte of the ring must fall within the size
428supplied by ``VhostUserLog``.
429
430``VHOST_USER_SET_LOG_FD`` is an optional message with an eventfd in
431ancillary data, it may be used to inform the master that the log has
432been modified.
433
434Once the source has finished migration, rings will be stopped by the
435source. No further update must be done before rings are restarted.
436
437In postcopy migration the slave is started before all the memory has
438been received from the source host, and care must be taken to avoid
439accessing pages that have yet to be received.  The slave opens a
440'userfault'-fd and registers the memory with it; this fd is then
441passed back over to the master.  The master services requests on the
442userfaultfd for pages that are accessed and when the page is available
443it performs WAKE ioctl's on the userfaultfd to wake the stalled
444slave.  The client indicates support for this via the
445``VHOST_USER_PROTOCOL_F_PAGEFAULT`` feature.
446
447Memory access
448-------------
449
450The master sends a list of vhost memory regions to the slave using the
451``VHOST_USER_SET_MEM_TABLE`` message.  Each region has two base
452addresses: a guest address and a user address.
453
454Messages contain guest addresses and/or user addresses to reference locations
455within the shared memory.  The mapping of these addresses works as follows.
456
457User addresses map to the vhost memory region containing that user address.
458
459When the ``VIRTIO_F_IOMMU_PLATFORM`` feature has not been negotiated:
460
461* Guest addresses map to the vhost memory region containing that guest
462  address.
463
464When the ``VIRTIO_F_IOMMU_PLATFORM`` feature has been negotiated:
465
466* Guest addresses are also called I/O virtual addresses (IOVAs).  They are
467  translated to user addresses via the IOTLB.
468
469* The vhost memory region guest address is not used.
470
471IOMMU support
472-------------
473
474When the ``VIRTIO_F_IOMMU_PLATFORM`` feature has been negotiated, the
475master sends IOTLB entries update & invalidation by sending
476``VHOST_USER_IOTLB_MSG`` requests to the slave with a ``struct
477vhost_iotlb_msg`` as payload. For update events, the ``iotlb`` payload
478has to be filled with the update message type (2), the I/O virtual
479address, the size, the user virtual address, and the permissions
480flags. Addresses and size must be within vhost memory regions set via
481the ``VHOST_USER_SET_MEM_TABLE`` request. For invalidation events, the
482``iotlb`` payload has to be filled with the invalidation message type
483(3), the I/O virtual address and the size. On success, the slave is
484expected to reply with a zero payload, non-zero otherwise.
485
486The slave relies on the slave communication channel (see :ref:`Slave
487communication <slave_communication>` section below) to send IOTLB miss
488and access failure events, by sending ``VHOST_USER_SLAVE_IOTLB_MSG``
489requests to the master with a ``struct vhost_iotlb_msg`` as
490payload. For miss events, the iotlb payload has to be filled with the
491miss message type (1), the I/O virtual address and the permissions
492flags. For access failure event, the iotlb payload has to be filled
493with the access failure message type (4), the I/O virtual address and
494the permissions flags.  For synchronization purpose, the slave may
495rely on the reply-ack feature, so the master may send a reply when
496operation is completed if the reply-ack feature is negotiated and
497slaves requests a reply. For miss events, completed operation means
498either master sent an update message containing the IOTLB entry
499containing requested address and permission, or master sent nothing if
500the IOTLB miss message is invalid (invalid IOVA or permission).
501
502The master isn't expected to take the initiative to send IOTLB update
503messages, as the slave sends IOTLB miss messages for the guest virtual
504memory areas it needs to access.
505
506.. _slave_communication:
507
508Slave communication
509-------------------
510
511An optional communication channel is provided if the slave declares
512``VHOST_USER_PROTOCOL_F_SLAVE_REQ`` protocol feature, to allow the
513slave to make requests to the master.
514
515The fd is provided via ``VHOST_USER_SET_SLAVE_REQ_FD`` ancillary data.
516
517A slave may then send ``VHOST_USER_SLAVE_*`` messages to the master
518using this fd communication channel.
519
520If ``VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD`` protocol feature is
521negotiated, slave can send file descriptors (at most 8 descriptors in
522each message) to master via ancillary data using this fd communication
523channel.
524
525Inflight I/O tracking
526---------------------
527
528To support reconnecting after restart or crash, slave may need to
529resubmit inflight I/Os. If virtqueue is processed in order, we can
530easily achieve that by getting the inflight descriptors from
531descriptor table (split virtqueue) or descriptor ring (packed
532virtqueue). However, it can't work when we process descriptors
533out-of-order because some entries which store the information of
534inflight descriptors in available ring (split virtqueue) or descriptor
535ring (packed virtqueue) might be overridden by new entries. To solve
536this problem, slave need to allocate an extra buffer to store this
537information of inflight descriptors and share it with master for
538persistent. ``VHOST_USER_GET_INFLIGHT_FD`` and
539``VHOST_USER_SET_INFLIGHT_FD`` are used to transfer this buffer
540between master and slave. And the format of this buffer is described
541below:
542
543+---------------+---------------+-----+---------------+
544| queue0 region | queue1 region | ... | queueN region |
545+---------------+---------------+-----+---------------+
546
547N is the number of available virtqueues. Slave could get it from num
548queues field of ``VhostUserInflight``.
549
550For split virtqueue, queue region can be implemented as:
551
552.. code:: c
553
554  typedef struct DescStateSplit {
555      /* Indicate whether this descriptor is inflight or not.
556       * Only available for head-descriptor. */
557      uint8_t inflight;
558
559      /* Padding */
560      uint8_t padding[5];
561
562      /* Maintain a list for the last batch of used descriptors.
563       * Only available when batching is used for submitting */
564      uint16_t next;
565
566      /* Used to preserve the order of fetching available descriptors.
567       * Only available for head-descriptor. */
568      uint64_t counter;
569  } DescStateSplit;
570
571  typedef struct QueueRegionSplit {
572      /* The feature flags of this region. Now it's initialized to 0. */
573      uint64_t features;
574
575      /* The version of this region. It's 1 currently.
576       * Zero value indicates an uninitialized buffer */
577      uint16_t version;
578
579      /* The size of DescStateSplit array. It's equal to the virtqueue
580       * size. Slave could get it from queue size field of VhostUserInflight. */
581      uint16_t desc_num;
582
583      /* The head of list that track the last batch of used descriptors. */
584      uint16_t last_batch_head;
585
586      /* Store the idx value of used ring */
587      uint16_t used_idx;
588
589      /* Used to track the state of each descriptor in descriptor table */
590      DescStateSplit desc[];
591  } QueueRegionSplit;
592
593To track inflight I/O, the queue region should be processed as follows:
594
595When receiving available buffers from the driver:
596
597#. Get the next available head-descriptor index from available ring, ``i``
598
599#. Set ``desc[i].counter`` to the value of global counter
600
601#. Increase global counter by 1
602
603#. Set ``desc[i].inflight`` to 1
604
605When supplying used buffers to the driver:
606
6071. Get corresponding used head-descriptor index, i
608
6092. Set ``desc[i].next`` to ``last_batch_head``
610
6113. Set ``last_batch_head`` to ``i``
612
613#. Steps 1,2,3 may be performed repeatedly if batching is possible
614
615#. Increase the ``idx`` value of used ring by the size of the batch
616
617#. Set the ``inflight`` field of each ``DescStateSplit`` entry in the batch to 0
618
619#. Set ``used_idx`` to the ``idx`` value of used ring
620
621When reconnecting:
622
623#. If the value of ``used_idx`` does not match the ``idx`` value of
624   used ring (means the inflight field of ``DescStateSplit`` entries in
625   last batch may be incorrect),
626
627   a. Subtract the value of ``used_idx`` from the ``idx`` value of
628      used ring to get last batch size of ``DescStateSplit`` entries
629
630   #. Set the ``inflight`` field of each ``DescStateSplit`` entry to 0 in last batch
631      list which starts from ``last_batch_head``
632
633   #. Set ``used_idx`` to the ``idx`` value of used ring
634
635#. Resubmit inflight ``DescStateSplit`` entries in order of their
636   counter value
637
638For packed virtqueue, queue region can be implemented as:
639
640.. code:: c
641
642  typedef struct DescStatePacked {
643      /* Indicate whether this descriptor is inflight or not.
644       * Only available for head-descriptor. */
645      uint8_t inflight;
646
647      /* Padding */
648      uint8_t padding;
649
650      /* Link to the next free entry */
651      uint16_t next;
652
653      /* Link to the last entry of descriptor list.
654       * Only available for head-descriptor. */
655      uint16_t last;
656
657      /* The length of descriptor list.
658       * Only available for head-descriptor. */
659      uint16_t num;
660
661      /* Used to preserve the order of fetching available descriptors.
662       * Only available for head-descriptor. */
663      uint64_t counter;
664
665      /* The buffer id */
666      uint16_t id;
667
668      /* The descriptor flags */
669      uint16_t flags;
670
671      /* The buffer length */
672      uint32_t len;
673
674      /* The buffer address */
675      uint64_t addr;
676  } DescStatePacked;
677
678  typedef struct QueueRegionPacked {
679      /* The feature flags of this region. Now it's initialized to 0. */
680      uint64_t features;
681
682      /* The version of this region. It's 1 currently.
683       * Zero value indicates an uninitialized buffer */
684      uint16_t version;
685
686      /* The size of DescStatePacked array. It's equal to the virtqueue
687       * size. Slave could get it from queue size field of VhostUserInflight. */
688      uint16_t desc_num;
689
690      /* The head of free DescStatePacked entry list */
691      uint16_t free_head;
692
693      /* The old head of free DescStatePacked entry list */
694      uint16_t old_free_head;
695
696      /* The used index of descriptor ring */
697      uint16_t used_idx;
698
699      /* The old used index of descriptor ring */
700      uint16_t old_used_idx;
701
702      /* Device ring wrap counter */
703      uint8_t used_wrap_counter;
704
705      /* The old device ring wrap counter */
706      uint8_t old_used_wrap_counter;
707
708      /* Padding */
709      uint8_t padding[7];
710
711      /* Used to track the state of each descriptor fetched from descriptor ring */
712      DescStatePacked desc[];
713  } QueueRegionPacked;
714
715To track inflight I/O, the queue region should be processed as follows:
716
717When receiving available buffers from the driver:
718
719#. Get the next available descriptor entry from descriptor ring, ``d``
720
721#. If ``d`` is head descriptor,
722
723   a. Set ``desc[old_free_head].num`` to 0
724
725   #. Set ``desc[old_free_head].counter`` to the value of global counter
726
727   #. Increase global counter by 1
728
729   #. Set ``desc[old_free_head].inflight`` to 1
730
731#. If ``d`` is last descriptor, set ``desc[old_free_head].last`` to
732   ``free_head``
733
734#. Increase ``desc[old_free_head].num`` by 1
735
736#. Set ``desc[free_head].addr``, ``desc[free_head].len``,
737   ``desc[free_head].flags``, ``desc[free_head].id`` to ``d.addr``,
738   ``d.len``, ``d.flags``, ``d.id``
739
740#. Set ``free_head`` to ``desc[free_head].next``
741
742#. If ``d`` is last descriptor, set ``old_free_head`` to ``free_head``
743
744When supplying used buffers to the driver:
745
7461. Get corresponding used head-descriptor entry from descriptor ring,
747   ``d``
748
7492. Get corresponding ``DescStatePacked`` entry, ``e``
750
7513. Set ``desc[e.last].next`` to ``free_head``
752
7534. Set ``free_head`` to the index of ``e``
754
755#. Steps 1,2,3,4 may be performed repeatedly if batching is possible
756
757#. Increase ``used_idx`` by the size of the batch and update
758   ``used_wrap_counter`` if needed
759
760#. Update ``d.flags``
761
762#. Set the ``inflight`` field of each head ``DescStatePacked`` entry
763   in the batch to 0
764
765#. Set ``old_free_head``,  ``old_used_idx``, ``old_used_wrap_counter``
766   to ``free_head``, ``used_idx``, ``used_wrap_counter``
767
768When reconnecting:
769
770#. If ``used_idx`` does not match ``old_used_idx`` (means the
771   ``inflight`` field of ``DescStatePacked`` entries in last batch may
772   be incorrect),
773
774   a. Get the next descriptor ring entry through ``old_used_idx``, ``d``
775
776   #. Use ``old_used_wrap_counter`` to calculate the available flags
777
778   #. If ``d.flags`` is not equal to the calculated flags value (means
779      slave has submitted the buffer to guest driver before crash, so
780      it has to commit the in-progres update), set ``old_free_head``,
781      ``old_used_idx``, ``old_used_wrap_counter`` to ``free_head``,
782      ``used_idx``, ``used_wrap_counter``
783
784#. Set ``free_head``, ``used_idx``, ``used_wrap_counter`` to
785   ``old_free_head``, ``old_used_idx``, ``old_used_wrap_counter``
786   (roll back any in-progress update)
787
788#. Set the ``inflight`` field of each ``DescStatePacked`` entry in
789   free list to 0
790
791#. Resubmit inflight ``DescStatePacked`` entries in order of their
792   counter value
793
794In-band notifications
795---------------------
796
797In some limited situations (e.g. for simulation) it is desirable to
798have the kick, call and error (if used) signals done via in-band
799messages instead of asynchronous eventfd notifications. This can be
800done by negotiating the ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS``
801protocol feature.
802
803Note that due to the fact that too many messages on the sockets can
804cause the sending application(s) to block, it is not advised to use
805this feature unless absolutely necessary. It is also considered an
806error to negotiate this feature without also negotiating
807``VHOST_USER_PROTOCOL_F_SLAVE_REQ`` and ``VHOST_USER_PROTOCOL_F_REPLY_ACK``,
808the former is necessary for getting a message channel from the slave
809to the master, while the latter needs to be used with the in-band
810notification messages to block until they are processed, both to avoid
811blocking later and for proper processing (at least in the simulation
812use case.) As it has no other way of signalling this error, the slave
813should close the connection as a response to a
814``VHOST_USER_SET_PROTOCOL_FEATURES`` message that sets the in-band
815notifications feature flag without the other two.
816
817Protocol features
818-----------------
819
820.. code:: c
821
822  #define VHOST_USER_PROTOCOL_F_MQ                    0
823  #define VHOST_USER_PROTOCOL_F_LOG_SHMFD             1
824  #define VHOST_USER_PROTOCOL_F_RARP                  2
825  #define VHOST_USER_PROTOCOL_F_REPLY_ACK             3
826  #define VHOST_USER_PROTOCOL_F_MTU                   4
827  #define VHOST_USER_PROTOCOL_F_SLAVE_REQ             5
828  #define VHOST_USER_PROTOCOL_F_CROSS_ENDIAN          6
829  #define VHOST_USER_PROTOCOL_F_CRYPTO_SESSION        7
830  #define VHOST_USER_PROTOCOL_F_PAGEFAULT             8
831  #define VHOST_USER_PROTOCOL_F_CONFIG                9
832  #define VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD        10
833  #define VHOST_USER_PROTOCOL_F_HOST_NOTIFIER        11
834  #define VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD       12
835  #define VHOST_USER_PROTOCOL_F_RESET_DEVICE         13
836  #define VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS 14
837  #define VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS  15
838  #define VHOST_USER_PROTOCOL_F_STATUS               16
839
840Master message types
841--------------------
842
843``VHOST_USER_GET_FEATURES``
844  :id: 1
845  :equivalent ioctl: ``VHOST_GET_FEATURES``
846  :master payload: N/A
847  :slave payload: ``u64``
848
849  Get from the underlying vhost implementation the features bitmask.
850  Feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` signals slave support
851  for ``VHOST_USER_GET_PROTOCOL_FEATURES`` and
852  ``VHOST_USER_SET_PROTOCOL_FEATURES``.
853
854``VHOST_USER_SET_FEATURES``
855  :id: 2
856  :equivalent ioctl: ``VHOST_SET_FEATURES``
857  :master payload: ``u64``
858
859  Enable features in the underlying vhost implementation using a
860  bitmask.  Feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` signals
861  slave support for ``VHOST_USER_GET_PROTOCOL_FEATURES`` and
862  ``VHOST_USER_SET_PROTOCOL_FEATURES``.
863
864``VHOST_USER_GET_PROTOCOL_FEATURES``
865  :id: 15
866  :equivalent ioctl: ``VHOST_GET_FEATURES``
867  :master payload: N/A
868  :slave payload: ``u64``
869
870  Get the protocol feature bitmask from the underlying vhost
871  implementation.  Only legal if feature bit
872  ``VHOST_USER_F_PROTOCOL_FEATURES`` is present in
873  ``VHOST_USER_GET_FEATURES``.
874
875.. Note::
876   Slave that reported ``VHOST_USER_F_PROTOCOL_FEATURES`` must
877   support this message even before ``VHOST_USER_SET_FEATURES`` was
878   called.
879
880``VHOST_USER_SET_PROTOCOL_FEATURES``
881  :id: 16
882  :equivalent ioctl: ``VHOST_SET_FEATURES``
883  :master payload: ``u64``
884
885  Enable protocol features in the underlying vhost implementation.
886
887  Only legal if feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` is present in
888  ``VHOST_USER_GET_FEATURES``.
889
890.. Note::
891   Slave that reported ``VHOST_USER_F_PROTOCOL_FEATURES`` must support
892   this message even before ``VHOST_USER_SET_FEATURES`` was called.
893
894``VHOST_USER_SET_OWNER``
895  :id: 3
896  :equivalent ioctl: ``VHOST_SET_OWNER``
897  :master payload: N/A
898
899  Issued when a new connection is established. It sets the current
900  *master* as an owner of the session. This can be used on the *slave*
901  as a "session start" flag.
902
903``VHOST_USER_RESET_OWNER``
904  :id: 4
905  :master payload: N/A
906
907.. admonition:: Deprecated
908
909   This is no longer used. Used to be sent to request disabling all
910   rings, but some clients interpreted it to also discard connection
911   state (this interpretation would lead to bugs).  It is recommended
912   that clients either ignore this message, or use it to disable all
913   rings.
914
915``VHOST_USER_SET_MEM_TABLE``
916  :id: 5
917  :equivalent ioctl: ``VHOST_SET_MEM_TABLE``
918  :master payload: memory regions description
919  :slave payload: (postcopy only) memory regions description
920
921  Sets the memory map regions on the slave so it can translate the
922  vring addresses. In the ancillary data there is an array of file
923  descriptors for each memory mapped region. The size and ordering of
924  the fds matches the number and ordering of memory regions.
925
926  When ``VHOST_USER_POSTCOPY_LISTEN`` has been received,
927  ``SET_MEM_TABLE`` replies with the bases of the memory mapped
928  regions to the master.  The slave must have mmap'd the regions but
929  not yet accessed them and should not yet generate a userfault
930  event.
931
932.. Note::
933   ``NEED_REPLY_MASK`` is not set in this case.  QEMU will then
934   reply back to the list of mappings with an empty
935   ``VHOST_USER_SET_MEM_TABLE`` as an acknowledgement; only upon
936   reception of this message may the guest start accessing the memory
937   and generating faults.
938
939``VHOST_USER_SET_LOG_BASE``
940  :id: 6
941  :equivalent ioctl: ``VHOST_SET_LOG_BASE``
942  :master payload: u64
943  :slave payload: N/A
944
945  Sets logging shared memory space.
946
947  When slave has ``VHOST_USER_PROTOCOL_F_LOG_SHMFD`` protocol feature,
948  the log memory fd is provided in the ancillary data of
949  ``VHOST_USER_SET_LOG_BASE`` message, the size and offset of shared
950  memory area provided in the message.
951
952``VHOST_USER_SET_LOG_FD``
953  :id: 7
954  :equivalent ioctl: ``VHOST_SET_LOG_FD``
955  :master payload: N/A
956
957  Sets the logging file descriptor, which is passed as ancillary data.
958
959``VHOST_USER_SET_VRING_NUM``
960  :id: 8
961  :equivalent ioctl: ``VHOST_SET_VRING_NUM``
962  :master payload: vring state description
963
964  Set the size of the queue.
965
966``VHOST_USER_SET_VRING_ADDR``
967  :id: 9
968  :equivalent ioctl: ``VHOST_SET_VRING_ADDR``
969  :master payload: vring address description
970  :slave payload: N/A
971
972  Sets the addresses of the different aspects of the vring.
973
974``VHOST_USER_SET_VRING_BASE``
975  :id: 10
976  :equivalent ioctl: ``VHOST_SET_VRING_BASE``
977  :master payload: vring state description
978
979  Sets the base offset in the available vring.
980
981``VHOST_USER_GET_VRING_BASE``
982  :id: 11
983  :equivalent ioctl: ``VHOST_USER_GET_VRING_BASE``
984  :master payload: vring state description
985  :slave payload: vring state description
986
987  Get the available vring base offset.
988
989``VHOST_USER_SET_VRING_KICK``
990  :id: 12
991  :equivalent ioctl: ``VHOST_SET_VRING_KICK``
992  :master payload: ``u64``
993
994  Set the event file descriptor for adding buffers to the vring. It is
995  passed in the ancillary data.
996
997  Bits (0-7) of the payload contain the vring index. Bit 8 is the
998  invalid FD flag. This flag is set when there is no file descriptor
999  in the ancillary data. This signals that polling should be used
1000  instead of waiting for the kick. Note that if the protocol feature
1001  ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` has been negotiated
1002  this message isn't necessary as the ring is also started on the
1003  ``VHOST_USER_VRING_KICK`` message, it may however still be used to
1004  set an event file descriptor (which will be preferred over the
1005  message) or to enable polling.
1006
1007``VHOST_USER_SET_VRING_CALL``
1008  :id: 13
1009  :equivalent ioctl: ``VHOST_SET_VRING_CALL``
1010  :master payload: ``u64``
1011
1012  Set the event file descriptor to signal when buffers are used. It is
1013  passed in the ancillary data.
1014
1015  Bits (0-7) of the payload contain the vring index. Bit 8 is the
1016  invalid FD flag. This flag is set when there is no file descriptor
1017  in the ancillary data. This signals that polling will be used
1018  instead of waiting for the call. Note that if the protocol features
1019  ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` and
1020  ``VHOST_USER_PROTOCOL_F_SLAVE_REQ`` have been negotiated this message
1021  isn't necessary as the ``VHOST_USER_SLAVE_VRING_CALL`` message can be
1022  used, it may however still be used to set an event file descriptor
1023  or to enable polling.
1024
1025``VHOST_USER_SET_VRING_ERR``
1026  :id: 14
1027  :equivalent ioctl: ``VHOST_SET_VRING_ERR``
1028  :master payload: ``u64``
1029
1030  Set the event file descriptor to signal when error occurs. It is
1031  passed in the ancillary data.
1032
1033  Bits (0-7) of the payload contain the vring index. Bit 8 is the
1034  invalid FD flag. This flag is set when there is no file descriptor
1035  in the ancillary data. Note that if the protocol features
1036  ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` and
1037  ``VHOST_USER_PROTOCOL_F_SLAVE_REQ`` have been negotiated this message
1038  isn't necessary as the ``VHOST_USER_SLAVE_VRING_ERR`` message can be
1039  used, it may however still be used to set an event file descriptor
1040  (which will be preferred over the message).
1041
1042``VHOST_USER_GET_QUEUE_NUM``
1043  :id: 17
1044  :equivalent ioctl: N/A
1045  :master payload: N/A
1046  :slave payload: u64
1047
1048  Query how many queues the backend supports.
1049
1050  This request should be sent only when ``VHOST_USER_PROTOCOL_F_MQ``
1051  is set in queried protocol features by
1052  ``VHOST_USER_GET_PROTOCOL_FEATURES``.
1053
1054``VHOST_USER_SET_VRING_ENABLE``
1055  :id: 18
1056  :equivalent ioctl: N/A
1057  :master payload: vring state description
1058
1059  Signal slave to enable or disable corresponding vring.
1060
1061  This request should be sent only when
1062  ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated.
1063
1064``VHOST_USER_SEND_RARP``
1065  :id: 19
1066  :equivalent ioctl: N/A
1067  :master payload: ``u64``
1068
1069  Ask vhost user backend to broadcast a fake RARP to notify the migration
1070  is terminated for guest that does not support GUEST_ANNOUNCE.
1071
1072  Only legal if feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` is
1073  present in ``VHOST_USER_GET_FEATURES`` and protocol feature bit
1074  ``VHOST_USER_PROTOCOL_F_RARP`` is present in
1075  ``VHOST_USER_GET_PROTOCOL_FEATURES``.  The first 6 bytes of the
1076  payload contain the mac address of the guest to allow the vhost user
1077  backend to construct and broadcast the fake RARP.
1078
1079``VHOST_USER_NET_SET_MTU``
1080  :id: 20
1081  :equivalent ioctl: N/A
1082  :master payload: ``u64``
1083
1084  Set host MTU value exposed to the guest.
1085
1086  This request should be sent only when ``VIRTIO_NET_F_MTU`` feature
1087  has been successfully negotiated, ``VHOST_USER_F_PROTOCOL_FEATURES``
1088  is present in ``VHOST_USER_GET_FEATURES`` and protocol feature bit
1089  ``VHOST_USER_PROTOCOL_F_NET_MTU`` is present in
1090  ``VHOST_USER_GET_PROTOCOL_FEATURES``.
1091
1092  If ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is negotiated, slave must
1093  respond with zero in case the specified MTU is valid, or non-zero
1094  otherwise.
1095
1096``VHOST_USER_SET_SLAVE_REQ_FD``
1097  :id: 21
1098  :equivalent ioctl: N/A
1099  :master payload: N/A
1100
1101  Set the socket file descriptor for slave initiated requests. It is passed
1102  in the ancillary data.
1103
1104  This request should be sent only when
1105  ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated, and protocol
1106  feature bit ``VHOST_USER_PROTOCOL_F_SLAVE_REQ`` bit is present in
1107  ``VHOST_USER_GET_PROTOCOL_FEATURES``.  If
1108  ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is negotiated, slave must
1109  respond with zero for success, non-zero otherwise.
1110
1111``VHOST_USER_IOTLB_MSG``
1112  :id: 22
1113  :equivalent ioctl: N/A (equivalent to ``VHOST_IOTLB_MSG`` message type)
1114  :master payload: ``struct vhost_iotlb_msg``
1115  :slave payload: ``u64``
1116
1117  Send IOTLB messages with ``struct vhost_iotlb_msg`` as payload.
1118
1119  Master sends such requests to update and invalidate entries in the
1120  device IOTLB. The slave has to acknowledge the request with sending
1121  zero as ``u64`` payload for success, non-zero otherwise.
1122
1123  This request should be send only when ``VIRTIO_F_IOMMU_PLATFORM``
1124  feature has been successfully negotiated.
1125
1126``VHOST_USER_SET_VRING_ENDIAN``
1127  :id: 23
1128  :equivalent ioctl: ``VHOST_SET_VRING_ENDIAN``
1129  :master payload: vring state description
1130
1131  Set the endianness of a VQ for legacy devices. Little-endian is
1132  indicated with state.num set to 0 and big-endian is indicated with
1133  state.num set to 1. Other values are invalid.
1134
1135  This request should be sent only when
1136  ``VHOST_USER_PROTOCOL_F_CROSS_ENDIAN`` has been negotiated.
1137  Backends that negotiated this feature should handle both
1138  endiannesses and expect this message once (per VQ) during device
1139  configuration (ie. before the master starts the VQ).
1140
1141``VHOST_USER_GET_CONFIG``
1142  :id: 24
1143  :equivalent ioctl: N/A
1144  :master payload: virtio device config space
1145  :slave payload: virtio device config space
1146
1147  When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, this message is
1148  submitted by the vhost-user master to fetch the contents of the
1149  virtio device configuration space, vhost-user slave's payload size
1150  MUST match master's request, vhost-user slave uses zero length of
1151  payload to indicate an error to vhost-user master. The vhost-user
1152  master may cache the contents to avoid repeated
1153  ``VHOST_USER_GET_CONFIG`` calls.
1154
1155``VHOST_USER_SET_CONFIG``
1156  :id: 25
1157  :equivalent ioctl: N/A
1158  :master payload: virtio device config space
1159  :slave payload: N/A
1160
1161  When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, this message is
1162  submitted by the vhost-user master when the Guest changes the virtio
1163  device configuration space and also can be used for live migration
1164  on the destination host. The vhost-user slave must check the flags
1165  field, and slaves MUST NOT accept SET_CONFIG for read-only
1166  configuration space fields unless the live migration bit is set.
1167
1168``VHOST_USER_CREATE_CRYPTO_SESSION``
1169  :id: 26
1170  :equivalent ioctl: N/A
1171  :master payload: crypto session description
1172  :slave payload: crypto session description
1173
1174  Create a session for crypto operation. The server side must return
1175  the session id, 0 or positive for success, negative for failure.
1176  This request should be sent only when
1177  ``VHOST_USER_PROTOCOL_F_CRYPTO_SESSION`` feature has been
1178  successfully negotiated.  It's a required feature for crypto
1179  devices.
1180
1181``VHOST_USER_CLOSE_CRYPTO_SESSION``
1182  :id: 27
1183  :equivalent ioctl: N/A
1184  :master payload: ``u64``
1185
1186  Close a session for crypto operation which was previously
1187  created by ``VHOST_USER_CREATE_CRYPTO_SESSION``.
1188
1189  This request should be sent only when
1190  ``VHOST_USER_PROTOCOL_F_CRYPTO_SESSION`` feature has been
1191  successfully negotiated.  It's a required feature for crypto
1192  devices.
1193
1194``VHOST_USER_POSTCOPY_ADVISE``
1195  :id: 28
1196  :master payload: N/A
1197  :slave payload: userfault fd
1198
1199  When ``VHOST_USER_PROTOCOL_F_PAGEFAULT`` is supported, the master
1200  advises slave that a migration with postcopy enabled is underway,
1201  the slave must open a userfaultfd for later use.  Note that at this
1202  stage the migration is still in precopy mode.
1203
1204``VHOST_USER_POSTCOPY_LISTEN``
1205  :id: 29
1206  :master payload: N/A
1207
1208  Master advises slave that a transition to postcopy mode has
1209  happened.  The slave must ensure that shared memory is registered
1210  with userfaultfd to cause faulting of non-present pages.
1211
1212  This is always sent sometime after a ``VHOST_USER_POSTCOPY_ADVISE``,
1213  and thus only when ``VHOST_USER_PROTOCOL_F_PAGEFAULT`` is supported.
1214
1215``VHOST_USER_POSTCOPY_END``
1216  :id: 30
1217  :slave payload: ``u64``
1218
1219  Master advises that postcopy migration has now completed.  The slave
1220  must disable the userfaultfd. The response is an acknowledgement
1221  only.
1222
1223  When ``VHOST_USER_PROTOCOL_F_PAGEFAULT`` is supported, this message
1224  is sent at the end of the migration, after
1225  ``VHOST_USER_POSTCOPY_LISTEN`` was previously sent.
1226
1227  The value returned is an error indication; 0 is success.
1228
1229``VHOST_USER_GET_INFLIGHT_FD``
1230  :id: 31
1231  :equivalent ioctl: N/A
1232  :master payload: inflight description
1233
1234  When ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD`` protocol feature has
1235  been successfully negotiated, this message is submitted by master to
1236  get a shared buffer from slave. The shared buffer will be used to
1237  track inflight I/O by slave. QEMU should retrieve a new one when vm
1238  reset.
1239
1240``VHOST_USER_SET_INFLIGHT_FD``
1241  :id: 32
1242  :equivalent ioctl: N/A
1243  :master payload: inflight description
1244
1245  When ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD`` protocol feature has
1246  been successfully negotiated, this message is submitted by master to
1247  send the shared inflight buffer back to slave so that slave could
1248  get inflight I/O after a crash or restart.
1249
1250``VHOST_USER_GPU_SET_SOCKET``
1251  :id: 33
1252  :equivalent ioctl: N/A
1253  :master payload: N/A
1254
1255  Sets the GPU protocol socket file descriptor, which is passed as
1256  ancillary data. The GPU protocol is used to inform the master of
1257  rendering state and updates. See vhost-user-gpu.rst for details.
1258
1259``VHOST_USER_RESET_DEVICE``
1260  :id: 34
1261  :equivalent ioctl: N/A
1262  :master payload: N/A
1263  :slave payload: N/A
1264
1265  Ask the vhost user backend to disable all rings and reset all
1266  internal device state to the initial state, ready to be
1267  reinitialized. The backend retains ownership of the device
1268  throughout the reset operation.
1269
1270  Only valid if the ``VHOST_USER_PROTOCOL_F_RESET_DEVICE`` protocol
1271  feature is set by the backend.
1272
1273``VHOST_USER_VRING_KICK``
1274  :id: 35
1275  :equivalent ioctl: N/A
1276  :slave payload: vring state description
1277  :master payload: N/A
1278
1279  When the ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` protocol
1280  feature has been successfully negotiated, this message may be
1281  submitted by the master to indicate that a buffer was added to
1282  the vring instead of signalling it using the vring's kick file
1283  descriptor or having the slave rely on polling.
1284
1285  The state.num field is currently reserved and must be set to 0.
1286
1287``VHOST_USER_GET_MAX_MEM_SLOTS``
1288  :id: 36
1289  :equivalent ioctl: N/A
1290  :slave payload: u64
1291
1292  When the ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol
1293  feature has been successfully negotiated, this message is submitted
1294  by master to the slave. The slave should return the message with a
1295  u64 payload containing the maximum number of memory slots for
1296  QEMU to expose to the guest. The value returned by the backend
1297  will be capped at the maximum number of ram slots which can be
1298  supported by the target platform.
1299
1300``VHOST_USER_ADD_MEM_REG``
1301  :id: 37
1302  :equivalent ioctl: N/A
1303  :slave payload: single memory region description
1304
1305  When the ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol
1306  feature has been successfully negotiated, this message is submitted
1307  by the master to the slave. The message payload contains a memory
1308  region descriptor struct, describing a region of guest memory which
1309  the slave device must map in. When the
1310  ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol feature has
1311  been successfully negotiated, along with the
1312  ``VHOST_USER_REM_MEM_REG`` message, this message is used to set and
1313  update the memory tables of the slave device.
1314
1315``VHOST_USER_REM_MEM_REG``
1316  :id: 38
1317  :equivalent ioctl: N/A
1318  :slave payload: single memory region description
1319
1320  When the ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol
1321  feature has been successfully negotiated, this message is submitted
1322  by the master to the slave. The message payload contains a memory
1323  region descriptor struct, describing a region of guest memory which
1324  the slave device must unmap. When the
1325  ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol feature has
1326  been successfully negotiated, along with the
1327  ``VHOST_USER_ADD_MEM_REG`` message, this message is used to set and
1328  update the memory tables of the slave device.
1329
1330``VHOST_USER_SET_STATUS``
1331  :id: 39
1332  :equivalent ioctl: VHOST_VDPA_SET_STATUS
1333  :slave payload: N/A
1334  :master payload: ``u64``
1335
1336  When the ``VHOST_USER_PROTOCOL_F_STATUS`` protocol feature has been
1337  successfully negotiated, this message is submitted by the master to
1338  notify the backend with updated device status as defined in the Virtio
1339  specification.
1340
1341``VHOST_USER_GET_STATUS``
1342  :id: 40
1343  :equivalent ioctl: VHOST_VDPA_GET_STATUS
1344  :slave payload: ``u64``
1345  :master payload: N/A
1346
1347  When the ``VHOST_USER_PROTOCOL_F_STATUS`` protocol feature has been
1348  successfully negotiated, this message is submitted by the master to
1349  query the backend for its device status as defined in the Virtio
1350  specification.
1351
1352
1353Slave message types
1354-------------------
1355
1356``VHOST_USER_SLAVE_IOTLB_MSG``
1357  :id: 1
1358  :equivalent ioctl: N/A (equivalent to ``VHOST_IOTLB_MSG`` message type)
1359  :slave payload: ``struct vhost_iotlb_msg``
1360  :master payload: N/A
1361
1362  Send IOTLB messages with ``struct vhost_iotlb_msg`` as payload.
1363  Slave sends such requests to notify of an IOTLB miss, or an IOTLB
1364  access failure. If ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is
1365  negotiated, and slave set the ``VHOST_USER_NEED_REPLY`` flag, master
1366  must respond with zero when operation is successfully completed, or
1367  non-zero otherwise.  This request should be send only when
1368  ``VIRTIO_F_IOMMU_PLATFORM`` feature has been successfully
1369  negotiated.
1370
1371``VHOST_USER_SLAVE_CONFIG_CHANGE_MSG``
1372  :id: 2
1373  :equivalent ioctl: N/A
1374  :slave payload: N/A
1375  :master payload: N/A
1376
1377  When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, vhost-user
1378  slave sends such messages to notify that the virtio device's
1379  configuration space has changed, for those host devices which can
1380  support such feature, host driver can send ``VHOST_USER_GET_CONFIG``
1381  message to slave to get the latest content. If
1382  ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is negotiated, and slave set the
1383  ``VHOST_USER_NEED_REPLY`` flag, master must respond with zero when
1384  operation is successfully completed, or non-zero otherwise.
1385
1386``VHOST_USER_SLAVE_VRING_HOST_NOTIFIER_MSG``
1387  :id: 3
1388  :equivalent ioctl: N/A
1389  :slave payload: vring area description
1390  :master payload: N/A
1391
1392  Sets host notifier for a specified queue. The queue index is
1393  contained in the ``u64`` field of the vring area description. The
1394  host notifier is described by the file descriptor (typically it's a
1395  VFIO device fd) which is passed as ancillary data and the size
1396  (which is mmap size and should be the same as host page size) and
1397  offset (which is mmap offset) carried in the vring area
1398  description. QEMU can mmap the file descriptor based on the size and
1399  offset to get a memory range. Registering a host notifier means
1400  mapping this memory range to the VM as the specified queue's notify
1401  MMIO region. Slave sends this request to tell QEMU to de-register
1402  the existing notifier if any and register the new notifier if the
1403  request is sent with a file descriptor.
1404
1405  This request should be sent only when
1406  ``VHOST_USER_PROTOCOL_F_HOST_NOTIFIER`` protocol feature has been
1407  successfully negotiated.
1408
1409``VHOST_USER_SLAVE_VRING_CALL``
1410  :id: 4
1411  :equivalent ioctl: N/A
1412  :slave payload: vring state description
1413  :master payload: N/A
1414
1415  When the ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` protocol
1416  feature has been successfully negotiated, this message may be
1417  submitted by the slave to indicate that a buffer was used from
1418  the vring instead of signalling this using the vring's call file
1419  descriptor or having the master relying on polling.
1420
1421  The state.num field is currently reserved and must be set to 0.
1422
1423``VHOST_USER_SLAVE_VRING_ERR``
1424  :id: 5
1425  :equivalent ioctl: N/A
1426  :slave payload: vring state description
1427  :master payload: N/A
1428
1429  When the ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` protocol
1430  feature has been successfully negotiated, this message may be
1431  submitted by the slave to indicate that an error occurred on the
1432  specific vring, instead of signalling the error file descriptor
1433  set by the master via ``VHOST_USER_SET_VRING_ERR``.
1434
1435  The state.num field is currently reserved and must be set to 0.
1436
1437.. _reply_ack:
1438
1439VHOST_USER_PROTOCOL_F_REPLY_ACK
1440-------------------------------
1441
1442The original vhost-user specification only demands replies for certain
1443commands. This differs from the vhost protocol implementation where
1444commands are sent over an ``ioctl()`` call and block until the client
1445has completed.
1446
1447With this protocol extension negotiated, the sender (QEMU) can set the
1448``need_reply`` [Bit 3] flag to any command. This indicates that the
1449client MUST respond with a Payload ``VhostUserMsg`` indicating success
1450or failure. The payload should be set to zero on success or non-zero
1451on failure, unless the message already has an explicit reply body.
1452
1453The response payload gives QEMU a deterministic indication of the result
1454of the command. Today, QEMU is expected to terminate the main vhost-user
1455loop upon receiving such errors. In future, qemu could be taught to be more
1456resilient for selective requests.
1457
1458For the message types that already solicit a reply from the client,
1459the presence of ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` or need_reply bit
1460being set brings no behavioural change. (See the Communication_
1461section for details.)
1462
1463.. _backend_conventions:
1464
1465Backend program conventions
1466===========================
1467
1468vhost-user backends can provide various devices & services and may
1469need to be configured manually depending on the use case. However, it
1470is a good idea to follow the conventions listed here when
1471possible. Users, QEMU or libvirt, can then rely on some common
1472behaviour to avoid heterogeneous configuration and management of the
1473backend programs and facilitate interoperability.
1474
1475Each backend installed on a host system should come with at least one
1476JSON file that conforms to the vhost-user.json schema. Each file
1477informs the management applications about the backend type, and binary
1478location. In addition, it defines rules for management apps for
1479picking the highest priority backend when multiple match the search
1480criteria (see ``@VhostUserBackend`` documentation in the schema file).
1481
1482If the backend is not capable of enabling a requested feature on the
1483host (such as 3D acceleration with virgl), or the initialization
1484failed, the backend should fail to start early and exit with a status
1485!= 0. It may also print a message to stderr for further details.
1486
1487The backend program must not daemonize itself, but it may be
1488daemonized by the management layer. It may also have a restricted
1489access to the system.
1490
1491File descriptors 0, 1 and 2 will exist, and have regular
1492stdin/stdout/stderr usage (they may have been redirected to /dev/null
1493by the management layer, or to a log handler).
1494
1495The backend program must end (as quickly and cleanly as possible) when
1496the SIGTERM signal is received. Eventually, it may receive SIGKILL by
1497the management layer after a few seconds.
1498
1499The following command line options have an expected behaviour. They
1500are mandatory, unless explicitly said differently:
1501
1502--socket-path=PATH
1503
1504  This option specify the location of the vhost-user Unix domain socket.
1505  It is incompatible with --fd.
1506
1507--fd=FDNUM
1508
1509  When this argument is given, the backend program is started with the
1510  vhost-user socket as file descriptor FDNUM. It is incompatible with
1511  --socket-path.
1512
1513--print-capabilities
1514
1515  Output to stdout the backend capabilities in JSON format, and then
1516  exit successfully. Other options and arguments should be ignored, and
1517  the backend program should not perform its normal function.  The
1518  capabilities can be reported dynamically depending on the host
1519  capabilities.
1520
1521The JSON output is described in the ``vhost-user.json`` schema, by
1522```@VHostUserBackendCapabilities``.  Example:
1523
1524.. code:: json
1525
1526  {
1527    "type": "foo",
1528    "features": [
1529      "feature-a",
1530      "feature-b"
1531    ]
1532  }
1533
1534vhost-user-input
1535----------------
1536
1537Command line options:
1538
1539--evdev-path=PATH
1540
1541  Specify the linux input device.
1542
1543  (optional)
1544
1545--no-grab
1546
1547  Do no request exclusive access to the input device.
1548
1549  (optional)
1550
1551vhost-user-gpu
1552--------------
1553
1554Command line options:
1555
1556--render-node=PATH
1557
1558  Specify the GPU DRM render node.
1559
1560  (optional)
1561
1562--virgl
1563
1564  Enable virgl rendering support.
1565
1566  (optional)
1567
1568vhost-user-blk
1569--------------
1570
1571Command line options:
1572
1573--blk-file=PATH
1574
1575  Specify block device or file path.
1576
1577  (optional)
1578
1579--read-only
1580
1581  Enable read-only.
1582
1583  (optional)
1584