xref: /openbmc/qemu/docs/interop/vfio-user.rst (revision 20ced60dd2a577d5e9bf0a16ff3ef0f8a953f495)
1.. include:: <isonum.txt>
2.. SPDX-License-Identifier: GPL-2.0-or-later
3
4================================
5vfio-user Protocol Specification
6================================
7
8.. contents:: Table of Contents
9
10Introduction
11============
12vfio-user is a protocol that allows a device to be emulated in a separate
13process outside of a Virtual Machine Monitor (VMM). vfio-user devices consist
14of a generic VFIO device type, living inside the VMM, which we call the client,
15and the core device implementation, living outside the VMM, which we call the
16server.
17
18The vfio-user specification is partly based on the
19`Linux VFIO ioctl interface <https://www.kernel.org/doc/html/latest/driver-api/vfio.html>`_.
20
21VFIO is a mature and stable API, backed by an extensively used framework. The
22existing VFIO client implementation in QEMU (``qemu/hw/vfio/``) can be largely
23re-used, though there is nothing in this specification that requires that
24particular implementation. None of the VFIO kernel modules are required for
25supporting the protocol, on either the client or server side. Some source
26definitions in VFIO are re-used for vfio-user.
27
28The main idea is to allow a virtual device to function in a separate process in
29the same host over a UNIX domain socket. A UNIX domain socket (``AF_UNIX``) is
30chosen because file descriptors can be trivially sent over it, which in turn
31allows:
32
33* Sharing of client memory for DMA with the server.
34* Sharing of server memory with the client for fast MMIO.
35* Efficient sharing of eventfd's for triggering interrupts.
36
37Other socket types could be used which allow the server to run in a separate
38guest in the same host (``AF_VSOCK``) or remotely (``AF_INET``). Theoretically
39the underlying transport does not necessarily have to be a socket, however we do
40not examine such alternatives. In this protocol version we focus on using a UNIX
41domain socket and introduce basic support for the other two types of sockets
42without considering performance implications.
43
44While passing of file descriptors is desirable for performance reasons, support
45is not necessary for either the client or the server in order to implement the
46protocol. There is always an in-band, message-passing fall back mechanism.
47
48Overview
49========
50
51VFIO is a framework that allows a physical device to be securely passed through
52to a user space process; the device-specific kernel driver does not drive the
53device at all.  Typically, the user space process is a VMM and the device is
54passed through to it in order to achieve high performance. VFIO provides an API
55and the required functionality in the kernel. QEMU has adopted VFIO to allow a
56guest to directly access physical devices, instead of emulating them in
57software.
58
59vfio-user reuses the core VFIO concepts defined in its API, but implements them
60as messages to be sent over a socket. It does not change the kernel-based VFIO
61in any way, in fact none of the VFIO kernel modules need to be loaded to use
62vfio-user. It is also possible for the client to concurrently use the current
63kernel-based VFIO for one device, and vfio-user for another device.
64
65VFIO Device Model
66-----------------
67
68A device under VFIO presents a standard interface to the user process. Many of
69the VFIO operations in the existing interface use the ``ioctl()`` system call, and
70references to the existing interface are called the ``ioctl()`` implementation in
71this document.
72
73The following sections describe the set of messages that implement the vfio-user
74interface over a socket. In many cases, the messages are analogous to data
75structures used in the ``ioctl()`` implementation. Messages derived from the
76``ioctl()`` will have a name derived from the ``ioctl()`` command name.  E.g., the
77``VFIO_DEVICE_GET_INFO`` ``ioctl()`` command becomes a
78``VFIO_USER_DEVICE_GET_INFO`` message.  The purpose of this reuse is to share as
79much code as feasible with the ``ioctl()`` implementation``.
80
81Connection Initiation
82^^^^^^^^^^^^^^^^^^^^^
83
84After the client connects to the server, the initial client message is
85``VFIO_USER_VERSION`` to propose a protocol version and set of capabilities to
86apply to the session. The server replies with a compatible version and set of
87capabilities it supports, or closes the connection if it cannot support the
88advertised version.
89
90Device Information
91^^^^^^^^^^^^^^^^^^
92
93The client uses a ``VFIO_USER_DEVICE_GET_INFO`` message to query the server for
94information about the device. This information includes:
95
96* The device type and whether it supports reset (``VFIO_DEVICE_FLAGS_``),
97* the number of device regions, and
98* the device presents to the client the number of interrupt types the device
99  supports.
100
101Region Information
102^^^^^^^^^^^^^^^^^^
103
104The client uses ``VFIO_USER_DEVICE_GET_REGION_INFO`` messages to query the
105server for information about the device's regions. This information describes:
106
107* Read and write permissions, whether it can be memory mapped, and whether it
108  supports additional capabilities (``VFIO_REGION_INFO_CAP_``).
109* Region index, size, and offset.
110
111When a device region can be mapped by the client, the server provides a file
112descriptor which the client can ``mmap()``. The server is responsible for
113polling for client updates to memory mapped regions.
114
115Region Capabilities
116"""""""""""""""""""
117
118Some regions have additional capabilities that cannot be described adequately
119by the region info data structure. These capabilities are returned in the
120region info reply in a list similar to PCI capabilities in a PCI device's
121configuration space.
122
123Sparse Regions
124""""""""""""""
125A region can be memory-mappable in whole or in part. When only a subset of a
126region can be mapped by the client, a ``VFIO_REGION_INFO_CAP_SPARSE_MMAP``
127capability is included in the region info reply. This capability describes
128which portions can be mapped by the client.
129
130.. Note::
131   For example, in a virtual NVMe controller, sparse regions can be used so
132   that accesses to the NVMe registers (found in the beginning of BAR0) are
133   trapped (an infrequent event), while allowing direct access to the doorbells
134   (an extremely frequent event as every I/O submission requires a write to
135   BAR0), found in the next page after the NVMe registers in BAR0.
136
137Device-Specific Regions
138"""""""""""""""""""""""
139
140A device can define regions additional to the standard ones (e.g. PCI indexes
1410-8). This is achieved by including a ``VFIO_REGION_INFO_CAP_TYPE`` capability
142in the region info reply of a device-specific region. Such regions are reflected
143in ``struct vfio_user_device_info.num_regions``. Thus, for PCI devices this
144value can be equal to, or higher than, ``VFIO_PCI_NUM_REGIONS``.
145
146Region I/O via file descriptors
147-------------------------------
148
149For unmapped regions, region I/O from the client is done via
150``VFIO_USER_REGION_READ/WRITE``.  As an optimization, ioeventfds or ioregionfds
151may be configured for sub-regions of some regions. A client may request
152information on these sub-regions via ``VFIO_USER_DEVICE_GET_REGION_IO_FDS``; by
153configuring the returned file descriptors as ioeventfds or ioregionfds, the
154server can be directly notified of I/O (for example, by KVM) without taking a
155trip through the client.
156
157Interrupts
158^^^^^^^^^^
159
160The client uses ``VFIO_USER_DEVICE_GET_IRQ_INFO`` messages to query the server
161for the device's interrupt types. The interrupt types are specific to the bus
162the device is attached to, and the client is expected to know the capabilities
163of each interrupt type. The server can signal an interrupt by directly injecting
164interrupts into the guest via an event file descriptor. The client configures
165how the server signals an interrupt with ``VFIO_USER_SET_IRQS`` messages.
166
167Device Read and Write
168^^^^^^^^^^^^^^^^^^^^^
169
170When the guest executes load or store operations to an unmapped device region,
171the client forwards these operations to the server with
172``VFIO_USER_REGION_READ`` or ``VFIO_USER_REGION_WRITE`` messages. The server
173will reply with data from the device on read operations or an acknowledgement on
174write operations. See `Read and Write Operations`_.
175
176Client memory access
177--------------------
178
179The client uses ``VFIO_USER_DMA_MAP`` and ``VFIO_USER_DMA_UNMAP`` messages to
180inform the server of the valid DMA ranges that the server can access on behalf
181of a device (typically, VM guest memory). DMA memory may be accessed by the
182server via ``VFIO_USER_DMA_READ`` and ``VFIO_USER_DMA_WRITE`` messages over the
183socket. In this case, the "DMA" part of the naming is a misnomer.
184
185Actual direct memory access of client memory from the server is possible if the
186client provides file descriptors the server can ``mmap()``. Note that ``mmap()``
187privileges cannot be revoked by the client, therefore file descriptors should
188only be exported in environments where the client trusts the server not to
189corrupt guest memory.
190
191See `Read and Write Operations`_.
192
193Client/server interactions
194==========================
195
196Socket
197------
198
199A server can serve:
200
2011) one or more clients, and/or
2022) one or more virtual devices, belonging to one or more clients.
203
204The current protocol specification requires a dedicated socket per
205client/server connection. It is a server-side implementation detail whether a
206single server handles multiple virtual devices from the same or multiple
207clients. The location of the socket is implementation-specific. Multiplexing
208clients, devices, and servers over the same socket is not supported in this
209version of the protocol.
210
211Authentication
212--------------
213
214For ``AF_UNIX``, we rely on OS mandatory access controls on the socket files,
215therefore it is up to the management layer to set up the socket as required.
216Socket types that span guests or hosts will require a proper authentication
217mechanism. Defining that mechanism is deferred to a future version of the
218protocol.
219
220Command Concurrency
221-------------------
222
223A client may pipeline multiple commands without waiting for previous command
224replies.  The server will process commands in the order they are received.  A
225consequence of this is if a client issues a command with the *No_reply* bit,
226then subsequently issues a command without *No_reply*, the older command will
227have been processed before the reply to the younger command is sent by the
228server.  The client must be aware of the device's capability to process
229concurrent commands if pipelining is used.  For example, pipelining allows
230multiple client threads to concurrently access device regions; the client must
231ensure these accesses obey device semantics.
232
233An example is a frame buffer device, where the device may allow concurrent
234access to different areas of video memory, but may have indeterminate behavior
235if concurrent accesses are performed to command or status registers.
236
237Note that unrelated messages sent from the server to the client can appear in
238between a client to server request/reply and vice versa.
239
240Implementers should be prepared for certain commands to exhibit potentially
241unbounded latencies.  For example, ``VFIO_USER_DEVICE_RESET`` may take an
242arbitrarily long time to complete; clients should take care not to block
243unnecessarily.
244
245Socket Disconnection Behavior
246-----------------------------
247The server and the client can disconnect from each other, either intentionally
248or unexpectedly. Both the client and the server need to know how to handle such
249events.
250
251Server Disconnection
252^^^^^^^^^^^^^^^^^^^^
253A server disconnecting from the client may indicate that:
254
2551) A virtual device has been restarted, either intentionally (e.g. because of a
256   device update) or unintentionally (e.g. because of a crash).
2572) A virtual device has been shut down with no intention to be restarted.
258
259It is impossible for the client to know whether or not a failure is
260intermittent or innocuous and should be retried, therefore the client should
261reset the VFIO device when it detects the socket has been disconnected.
262Error recovery will be driven by the guest's device error handling
263behavior.
264
265Client Disconnection
266^^^^^^^^^^^^^^^^^^^^
267The client disconnecting from the server primarily means that the client
268has exited. Currently, this means that the guest is shut down so the device is
269no longer needed therefore the server can automatically exit. However, there
270can be cases where a client disconnection should not result in a server exit:
271
2721) A single server serving multiple clients.
2732) A multi-process QEMU upgrading itself step by step, which is not yet
274   implemented.
275
276Therefore in order for the protocol to be forward compatible, the server should
277respond to a client disconnection as follows:
278
279 - all client memory regions are unmapped and cleaned up (including closing any
280   passed file descriptors)
281 - all IRQ file descriptors passed from the old client are closed
282 - the device state should otherwise be retained
283
284The expectation is that when a client reconnects, it will re-establish IRQ and
285client memory mappings.
286
287If anything happens to the client (such as qemu really did exit), the control
288stack will know about it and can clean up resources accordingly.
289
290Security Considerations
291-----------------------
292
293Speaking generally, vfio-user clients should not trust servers, and vice versa.
294Standard tools and mechanisms should be used on both sides to validate input and
295prevent against denial of service scenarios, buffer overflow, etc.
296
297Request Retry and Response Timeout
298----------------------------------
299A failed command is a command that has been successfully sent and has been
300responded to with an error code. Failure to send the command in the first place
301(e.g. because the socket is disconnected) is a different type of error examined
302earlier in the disconnect section.
303
304.. Note::
305   QEMU's VFIO retries certain operations if they fail. While this makes sense
306   for real HW, we don't know for sure whether it makes sense for virtual
307   devices.
308
309Defining a retry and timeout scheme is deferred to a future version of the
310protocol.
311
312Message sizes
313-------------
314
315Some requests have an ``argsz`` field. In a request, it defines the maximum
316expected reply payload size, which should be at least the size of the fixed
317reply payload headers defined here. The *request* payload size is defined by the
318usual ``msg_size`` field in the header, not the ``argsz`` field.
319
320In a reply, the server sets ``argsz`` field to the size needed for a full
321payload size. This may be less than the requested maximum size. This may be
322larger than the requested maximum size: in that case, the full payload is not
323included in the reply, but the ``argsz`` field in the reply indicates the needed
324size, allowing a client to allocate a larger buffer for holding the reply before
325trying again.
326
327In addition, during negotiation (see  `Version`_), the client and server may
328each specify a ``max_data_xfer_size`` value; this defines the maximum data that
329may be read or written via one of the ``VFIO_USER_DMA/REGION_READ/WRITE``
330messages; see `Read and Write Operations`_.
331
332Protocol Specification
333======================
334
335To distinguish from the base VFIO symbols, all vfio-user symbols are prefixed
336with ``vfio_user`` or ``VFIO_USER``. In this revision, all data is in the
337endianness of the host system, although this may be relaxed in future
338revisions in cases where the client and server run on different hosts
339with different endianness.
340
341Unless otherwise specified, all sizes should be presumed to be in bytes.
342
343.. _Commands:
344
345Commands
346--------
347The following table lists the VFIO message command IDs, and whether the
348message command is sent from the client or the server.
349
350======================================  =========  =================
351Name                                    Command    Request Direction
352======================================  =========  =================
353``VFIO_USER_VERSION``                   1          client -> server
354``VFIO_USER_DMA_MAP``                   2          client -> server
355``VFIO_USER_DMA_UNMAP``                 3          client -> server
356``VFIO_USER_DEVICE_GET_INFO``           4          client -> server
357``VFIO_USER_DEVICE_GET_REGION_INFO``    5          client -> server
358``VFIO_USER_DEVICE_GET_REGION_IO_FDS``  6          client -> server
359``VFIO_USER_DEVICE_GET_IRQ_INFO``       7          client -> server
360``VFIO_USER_DEVICE_SET_IRQS``           8          client -> server
361``VFIO_USER_REGION_READ``               9          client -> server
362``VFIO_USER_REGION_WRITE``              10         client -> server
363``VFIO_USER_DMA_READ``                  11         server -> client
364``VFIO_USER_DMA_WRITE``                 12         server -> client
365``VFIO_USER_DEVICE_RESET``              13         client -> server
366``VFIO_USER_REGION_WRITE_MULTI``        15         client -> server
367======================================  =========  =================
368
369Header
370------
371
372All messages, both command messages and reply messages, are preceded by a
37316-byte header that contains basic information about the message. The header is
374followed by message-specific data described in the sections below.
375
376+----------------+--------+-------------+
377| Name           | Offset | Size        |
378+================+========+=============+
379| Message ID     | 0      | 2           |
380+----------------+--------+-------------+
381| Command        | 2      | 2           |
382+----------------+--------+-------------+
383| Message size   | 4      | 4           |
384+----------------+--------+-------------+
385| Flags          | 8      | 4           |
386+----------------+--------+-------------+
387|                | +-----+------------+ |
388|                | | Bit | Definition | |
389|                | +=====+============+ |
390|                | | 0-3 | Type       | |
391|                | +-----+------------+ |
392|                | | 4   | No_reply   | |
393|                | +-----+------------+ |
394|                | | 5   | Error      | |
395|                | +-----+------------+ |
396+----------------+--------+-------------+
397| Error          | 12     | 4           |
398+----------------+--------+-------------+
399| <message data> | 16     | variable    |
400+----------------+--------+-------------+
401
402* *Message ID* identifies the message, and is echoed in the command's reply
403  message. Message IDs belong entirely to the sender, can be re-used (even
404  concurrently) and the receiver must not make any assumptions about their
405  uniqueness.
406* *Command* specifies the command to be executed, listed in Commands_. It is
407  also set in the reply header.
408* *Message size* contains the size of the entire message, including the header.
409* *Flags* contains attributes of the message:
410
411  * The *Type* bits indicate the message type.
412
413    *  *Command* (value 0x0) indicates a command message.
414    *  *Reply* (value 0x1) indicates a reply message acknowledging a previous
415       command with the same message ID.
416  * *No_reply* in a command message indicates that no reply is needed for this
417    command.  This is commonly used when multiple commands are sent, and only
418    the last needs acknowledgement.
419  * *Error* in a reply message indicates the command being acknowledged had
420    an error. In this case, the *Error* field will be valid.
421
422* *Error* in a reply message is an optional UNIX errno value. It may be zero
423  even if the Error bit is set in Flags. It is reserved in a command message.
424
425Each command message in Commands_ must be replied to with a reply message,
426unless the message sets the *No_Reply* bit.  The reply consists of the header
427with the *Reply* bit set, plus any additional data.
428
429If an error occurs, the reply message must only include the reply header.
430
431As the header is standard in both requests and replies, it is not included in
432the command-specific specifications below; each message definition should be
433appended to the standard header, and the offsets are given from the end of the
434standard header.
435
436``VFIO_USER_VERSION``
437---------------------
438
439.. _Version:
440
441This is the initial message sent by the client after the socket connection is
442established; the same format is used for the server's reply.
443
444Upon establishing a connection, the client must send a ``VFIO_USER_VERSION``
445message proposing a protocol version and a set of capabilities. The server
446compares these with the versions and capabilities it supports and sends a
447``VFIO_USER_VERSION`` reply according to the following rules.
448
449* The major version in the reply must be the same as proposed. If the client
450  does not support the proposed major, it closes the connection.
451* The minor version in the reply must be equal to or less than the minor
452  version proposed.
453* The capability list must be a subset of those proposed. If the server
454  requires a capability the client did not include, it closes the connection.
455
456The protocol major version will only change when incompatible protocol changes
457are made, such as changing the message format. The minor version may change
458when compatible changes are made, such as adding new messages or capabilities,
459Both the client and server must support all minor versions less than the
460maximum minor version it supports. E.g., an implementation that supports
461version 1.3 must also support 1.0 through 1.2.
462
463When making a change to this specification, the protocol version number must
464be included in the form "added in version X.Y"
465
466Request
467^^^^^^^
468
469==============  ======  ====
470Name            Offset  Size
471==============  ======  ====
472version major   0       2
473version minor   2       2
474version data    4       variable (including terminating NUL). Optional.
475==============  ======  ====
476
477The version data is an optional UTF-8 encoded JSON byte array with the following
478format:
479
480+--------------+--------+-----------------------------------+
481| Name         | Type   | Description                       |
482+==============+========+===================================+
483| capabilities | object | Contains common capabilities that |
484|              |        | the sender supports. Optional.    |
485+--------------+--------+-----------------------------------+
486
487Capabilities:
488
489+--------------------+---------+------------------------------------------------+
490| Name               | Type    | Description                                    |
491+====================+=========+================================================+
492| max_msg_fds        | number  | Maximum number of file descriptors that can be |
493|                    |         | received by the sender in one message.         |
494|                    |         | Optional. If not specified then the receiver   |
495|                    |         | must assume a value of ``1``.                  |
496+--------------------+---------+------------------------------------------------+
497| max_data_xfer_size | number  | Maximum ``count`` for data transfer messages;  |
498|                    |         | see `Read and Write Operations`_. Optional,    |
499|                    |         | with a default value of 1048576 bytes.         |
500+--------------------+---------+------------------------------------------------+
501| pgsizes            | number  | Page sizes supported in DMA map operations     |
502|                    |         | or'ed together. Optional, with a default value |
503|                    |         | of supporting only 4k pages.                   |
504+--------------------+---------+------------------------------------------------+
505| max_dma_maps       | number  | Maximum number DMA map windows that can be     |
506|                    |         | valid simultaneously.  Optional, with a        |
507|                    |         | value of 65535 (64k-1).                        |
508+--------------------+---------+------------------------------------------------+
509| migration          | object  | Migration capability parameters. If missing    |
510|                    |         | then migration is not supported by the sender. |
511+--------------------+---------+------------------------------------------------+
512| write_multiple     | boolean | ``VFIO_USER_REGION_WRITE_MULTI`` messages      |
513|                    |         | are supported if the value is ``true``.        |
514+--------------------+---------+------------------------------------------------+
515
516The migration capability contains the following name/value pairs:
517
518+-----------------+--------+--------------------------------------------------+
519| Name            | Type   | Description                                      |
520+=================+========+==================================================+
521| pgsize          | number | Page size of dirty pages bitmap. The smallest    |
522|                 |        | between the client and the server is used.       |
523+-----------------+--------+--------------------------------------------------+
524| max_bitmap_size | number | Maximum bitmap size in ``VFIO_USER_DIRTY_PAGES`` |
525|                 |        | and ``VFIO_DMA_UNMAP`` messages.  Optional,      |
526|                 |        | with a default value of 256MB.                   |
527+-----------------+--------+--------------------------------------------------+
528
529Reply
530^^^^^
531
532The same message format is used in the server's reply with the semantics
533described above.
534
535``VFIO_USER_DMA_MAP``
536---------------------
537
538This command message is sent by the client to the server to inform it of the
539memory regions the server can access. It must be sent before the server can
540perform any DMA to the client. It is normally sent directly after the version
541handshake is completed, but may also occur when memory is added to the client,
542or if the client uses a vIOMMU.
543
544Request
545^^^^^^^
546
547The request payload for this message is a structure of the following format:
548
549+-------------+--------+-------------+
550| Name        | Offset | Size        |
551+=============+========+=============+
552| argsz       | 0      | 4           |
553+-------------+--------+-------------+
554| flags       | 4      | 4           |
555+-------------+--------+-------------+
556|             | +-----+------------+ |
557|             | | Bit | Definition | |
558|             | +=====+============+ |
559|             | | 0   | readable   | |
560|             | +-----+------------+ |
561|             | | 1   | writeable  | |
562|             | +-----+------------+ |
563+-------------+--------+-------------+
564| offset      | 8      | 8           |
565+-------------+--------+-------------+
566| address     | 16     | 8           |
567+-------------+--------+-------------+
568| size        | 24     | 8           |
569+-------------+--------+-------------+
570
571* *argsz* is the size of the above structure. Note there is no reply payload,
572  so this field differs from other message types.
573* *flags* contains the following region attributes:
574
575  * *readable* indicates that the region can be read from.
576
577  * *writeable* indicates that the region can be written to.
578
579* *offset* is the file offset of the region with respect to the associated file
580  descriptor, or zero if the region is not mappable
581* *address* is the base DMA address of the region.
582* *size* is the size of the region.
583
584This structure is 32 bytes in size, so the message size is 16 + 32 bytes.
585
586If the DMA region being added can be directly mapped by the server, a file
587descriptor must be sent as part of the message meta-data. The region can be
588mapped via the mmap() system call. On ``AF_UNIX`` sockets, the file descriptor
589must be passed as ``SCM_RIGHTS`` type ancillary data.  Otherwise, if the DMA
590region cannot be directly mapped by the server, no file descriptor must be sent
591as part of the message meta-data and the DMA region can be accessed by the
592server using ``VFIO_USER_DMA_READ`` and ``VFIO_USER_DMA_WRITE`` messages,
593explained in `Read and Write Operations`_. A command to map over an existing
594region must be failed by the server with ``EEXIST`` set in error field in the
595reply.
596
597Reply
598^^^^^
599
600There is no payload in the reply message.
601
602``VFIO_USER_DMA_UNMAP``
603-----------------------
604
605This command message is sent by the client to the server to inform it that a
606DMA region, previously made available via a ``VFIO_USER_DMA_MAP`` command
607message, is no longer available for DMA. It typically occurs when memory is
608subtracted from the client or if the client uses a vIOMMU. The DMA region is
609described by the following structure:
610
611Request
612^^^^^^^
613
614The request payload for this message is a structure of the following format:
615
616+--------------+--------+------------------------+
617| Name         | Offset | Size                   |
618+==============+========+========================+
619| argsz        | 0      | 4                      |
620+--------------+--------+------------------------+
621| flags        | 4      | 4                      |
622+--------------+--------+------------------------+
623| address      | 8      | 8                      |
624+--------------+--------+------------------------+
625| size         | 16     | 8                      |
626+--------------+--------+------------------------+
627
628* *argsz* is the maximum size of the reply payload.
629* *flags* is unused in this version.
630* *address* is the base DMA address of the DMA region.
631* *size* is the size of the DMA region.
632
633The address and size of the DMA region being unmapped must match exactly a
634previous mapping.
635
636Reply
637^^^^^
638
639Upon receiving a ``VFIO_USER_DMA_UNMAP`` command, if the file descriptor is
640mapped then the server must release all references to that DMA region before
641replying, which potentially includes in-flight DMA transactions.
642
643The server responds with the original DMA entry in the request.
644
645
646``VFIO_USER_DEVICE_GET_INFO``
647-----------------------------
648
649This command message is sent by the client to the server to query for basic
650information about the device.
651
652Request
653^^^^^^^
654
655+-------------+--------+--------------------------+
656| Name        | Offset | Size                     |
657+=============+========+==========================+
658| argsz       | 0      | 4                        |
659+-------------+--------+--------------------------+
660| flags       | 4      | 4                        |
661+-------------+--------+--------------------------+
662|             | +-----+-------------------------+ |
663|             | | Bit | Definition              | |
664|             | +=====+=========================+ |
665|             | | 0   | VFIO_DEVICE_FLAGS_RESET | |
666|             | +-----+-------------------------+ |
667|             | | 1   | VFIO_DEVICE_FLAGS_PCI   | |
668|             | +-----+-------------------------+ |
669+-------------+--------+--------------------------+
670| num_regions | 8      | 4                        |
671+-------------+--------+--------------------------+
672| num_irqs    | 12     | 4                        |
673+-------------+--------+--------------------------+
674
675* *argsz* is the maximum size of the reply payload
676* all other fields must be zero.
677
678Reply
679^^^^^
680
681+-------------+--------+--------------------------+
682| Name        | Offset | Size                     |
683+=============+========+==========================+
684| argsz       | 0      | 4                        |
685+-------------+--------+--------------------------+
686| flags       | 4      | 4                        |
687+-------------+--------+--------------------------+
688|             | +-----+-------------------------+ |
689|             | | Bit | Definition              | |
690|             | +=====+=========================+ |
691|             | | 0   | VFIO_DEVICE_FLAGS_RESET | |
692|             | +-----+-------------------------+ |
693|             | | 1   | VFIO_DEVICE_FLAGS_PCI   | |
694|             | +-----+-------------------------+ |
695+-------------+--------+--------------------------+
696| num_regions | 8      | 4                        |
697+-------------+--------+--------------------------+
698| num_irqs    | 12     | 4                        |
699+-------------+--------+--------------------------+
700
701* *argsz* is the size required for the full reply payload (16 bytes today)
702* *flags* contains the following device attributes.
703
704  * ``VFIO_DEVICE_FLAGS_RESET`` indicates that the device supports the
705    ``VFIO_USER_DEVICE_RESET`` message.
706  * ``VFIO_DEVICE_FLAGS_PCI`` indicates that the device is a PCI device.
707
708* *num_regions* is the number of memory regions that the device exposes.
709* *num_irqs* is the number of distinct interrupt types that the device supports.
710
711This version of the protocol only supports PCI devices. Additional devices may
712be supported in future versions.
713
714``VFIO_USER_DEVICE_GET_REGION_INFO``
715------------------------------------
716
717This command message is sent by the client to the server to query for
718information about device regions. The VFIO region info structure is defined in
719``<linux/vfio.h>`` (``struct vfio_region_info``).
720
721Request
722^^^^^^^
723
724+------------+--------+------------------------------+
725| Name       | Offset | Size                         |
726+============+========+==============================+
727| argsz      | 0      | 4                            |
728+------------+--------+------------------------------+
729| flags      | 4      | 4                            |
730+------------+--------+------------------------------+
731| index      | 8      | 4                            |
732+------------+--------+------------------------------+
733| cap_offset | 12     | 4                            |
734+------------+--------+------------------------------+
735| size       | 16     | 8                            |
736+------------+--------+------------------------------+
737| offset     | 24     | 8                            |
738+------------+--------+------------------------------+
739
740* *argsz* the maximum size of the reply payload
741* *index* is the index of memory region being queried, it is the only field
742  that is required to be set in the command message.
743* all other fields must be zero.
744
745Reply
746^^^^^
747
748+------------+--------+------------------------------+
749| Name       | Offset | Size                         |
750+============+========+==============================+
751| argsz      | 0      | 4                            |
752+------------+--------+------------------------------+
753| flags      | 4      | 4                            |
754+------------+--------+------------------------------+
755|            | +-----+-----------------------------+ |
756|            | | Bit | Definition                  | |
757|            | +=====+=============================+ |
758|            | | 0   | VFIO_REGION_INFO_FLAG_READ  | |
759|            | +-----+-----------------------------+ |
760|            | | 1   | VFIO_REGION_INFO_FLAG_WRITE | |
761|            | +-----+-----------------------------+ |
762|            | | 2   | VFIO_REGION_INFO_FLAG_MMAP  | |
763|            | +-----+-----------------------------+ |
764|            | | 3   | VFIO_REGION_INFO_FLAG_CAPS  | |
765|            | +-----+-----------------------------+ |
766+------------+--------+------------------------------+
767+------------+--------+------------------------------+
768| index      | 8      | 4                            |
769+------------+--------+------------------------------+
770| cap_offset | 12     | 4                            |
771+------------+--------+------------------------------+
772| size       | 16     | 8                            |
773+------------+--------+------------------------------+
774| offset     | 24     | 8                            |
775+------------+--------+------------------------------+
776
777* *argsz* is the size required for the full reply payload (region info structure
778  plus the size of any region capabilities)
779* *flags* are attributes of the region:
780
781  * ``VFIO_REGION_INFO_FLAG_READ`` allows client read access to the region.
782  * ``VFIO_REGION_INFO_FLAG_WRITE`` allows client write access to the region.
783  * ``VFIO_REGION_INFO_FLAG_MMAP`` specifies the client can mmap() the region.
784    When this flag is set, the reply will include a file descriptor in its
785    meta-data. On ``AF_UNIX`` sockets, the file descriptors will be passed as
786    ``SCM_RIGHTS`` type ancillary data.
787  * ``VFIO_REGION_INFO_FLAG_CAPS`` indicates additional capabilities found in the
788    reply.
789
790* *index* is the index of memory region being queried, it is the only field
791  that is required to be set in the command message.
792* *cap_offset* describes where additional region capabilities can be found.
793  cap_offset is relative to the beginning of the VFIO region info structure.
794  The data structure it points is a VFIO cap header defined in
795  ``<linux/vfio.h>``.
796* *size* is the size of the region.
797* *offset* is the offset that should be given to the mmap() system call for
798  regions with the MMAP attribute. It is also used as the base offset when
799  mapping a VFIO sparse mmap area, described below.
800
801VFIO region capabilities
802""""""""""""""""""""""""
803
804The VFIO region information can also include a capabilities list. This list is
805similar to a PCI capability list - each entry has a common header that
806identifies a capability and where the next capability in the list can be found.
807The VFIO capability header format is defined in ``<linux/vfio.h>`` (``struct
808vfio_info_cap_header``).
809
810VFIO cap header format
811""""""""""""""""""""""
812
813+---------+--------+------+
814| Name    | Offset | Size |
815+=========+========+======+
816| id      | 0      | 2    |
817+---------+--------+------+
818| version | 2      | 2    |
819+---------+--------+------+
820| next    | 4      | 4    |
821+---------+--------+------+
822
823* *id* is the capability identity.
824* *version* is a capability-specific version number.
825* *next* specifies the offset of the next capability in the capability list. It
826  is relative to the beginning of the VFIO region info structure.
827
828VFIO sparse mmap cap header
829"""""""""""""""""""""""""""
830
831+------------------+----------------------------------+
832| Name             | Value                            |
833+==================+==================================+
834| id               | VFIO_REGION_INFO_CAP_SPARSE_MMAP |
835+------------------+----------------------------------+
836| version          | 0x1                              |
837+------------------+----------------------------------+
838| next             | <next>                           |
839+------------------+----------------------------------+
840| sparse mmap info | VFIO region info sparse mmap     |
841+------------------+----------------------------------+
842
843This capability is defined when only a subrange of the region supports
844direct access by the client via mmap(). The VFIO sparse mmap area is defined in
845``<linux/vfio.h>`` (``struct vfio_region_sparse_mmap_area`` and ``struct
846vfio_region_info_cap_sparse_mmap``).
847
848VFIO region info cap sparse mmap
849""""""""""""""""""""""""""""""""
850
851+----------+--------+------+
852| Name     | Offset | Size |
853+==========+========+======+
854| nr_areas | 0      | 4    |
855+----------+--------+------+
856| reserved | 4      | 4    |
857+----------+--------+------+
858| offset   | 8      | 8    |
859+----------+--------+------+
860| size     | 16     | 8    |
861+----------+--------+------+
862| ...      |        |      |
863+----------+--------+------+
864
865* *nr_areas* is the number of sparse mmap areas in the region.
866* *offset* and size describe a single area that can be mapped by the client.
867  There will be *nr_areas* pairs of offset and size. The offset will be added to
868  the base offset given in the ``VFIO_USER_DEVICE_GET_REGION_INFO`` to form the
869  offset argument of the subsequent mmap() call.
870
871The VFIO sparse mmap area is defined in ``<linux/vfio.h>`` (``struct
872vfio_region_info_cap_sparse_mmap``).
873
874
875``VFIO_USER_DEVICE_GET_REGION_IO_FDS``
876--------------------------------------
877
878Clients can access regions via ``VFIO_USER_REGION_READ/WRITE`` or, if provided, by
879``mmap()`` of a file descriptor provided by the server.
880
881``VFIO_USER_DEVICE_GET_REGION_IO_FDS`` provides an alternative access mechanism via
882file descriptors. This is an optional feature intended for performance
883improvements where an underlying sub-system (such as KVM) supports communication
884across such file descriptors to the vfio-user server, without needing to
885round-trip through the client.
886
887The server returns an array of sub-regions for the requested region. Each
888sub-region describes a span (offset and size) of a region, along with the
889requested file descriptor notification mechanism to use.  Each sub-region in the
890response message may choose to use a different method, as defined below.  The
891two mechanisms supported in this specification are ioeventfds and ioregionfds.
892
893The server in addition returns a file descriptor in the ancillary data; clients
894are expected to configure each sub-region's file descriptor with the requested
895notification method. For example, a client could configure KVM with the
896requested ioeventfd via a ``KVM_IOEVENTFD`` ``ioctl()``.
897
898Request
899^^^^^^^
900
901+-------------+--------+------+
902| Name        | Offset | Size |
903+=============+========+======+
904| argsz       | 0      | 4    |
905+-------------+--------+------+
906| flags       | 4      | 4    |
907+-------------+--------+------+
908| index       | 8      | 4    |
909+-------------+--------+------+
910| count       | 12     | 4    |
911+-------------+--------+------+
912
913* *argsz* the maximum size of the reply payload
914* *index* is the index of memory region being queried
915* all other fields must be zero
916
917The client must set ``flags`` to zero and specify the region being queried in
918the ``index``.
919
920Reply
921^^^^^
922
923+-------------+--------+------+
924| Name        | Offset | Size |
925+=============+========+======+
926| argsz       | 0      | 4    |
927+-------------+--------+------+
928| flags       | 4      | 4    |
929+-------------+--------+------+
930| index       | 8      | 4    |
931+-------------+--------+------+
932| count       | 12     | 4    |
933+-------------+--------+------+
934| sub-regions | 16     | ...  |
935+-------------+--------+------+
936
937* *argsz* is the size of the region IO FD info structure plus the
938  total size of the sub-region array. Thus, each array entry "i" is at offset
939  i * ((argsz - 32) / count). Note that currently this is 40 bytes for both IO
940  FD types, but this is not to be relied on. As elsewhere, this indicates the
941  full reply payload size needed.
942* *flags* must be zero
943* *index* is the index of memory region being queried
944* *count* is the number of sub-regions in the array
945* *sub-regions* is the array of Sub-Region IO FD info structures
946
947The reply message will additionally include at least one file descriptor in the
948ancillary data. Note that more than one sub-region may share the same file
949descriptor.
950
951Note that it is the client's responsibility to verify the requested values (for
952example, that the requested offset does not exceed the region's bounds).
953
954Each sub-region given in the response has one of two possible structures,
955depending whether *type* is ``VFIO_USER_IO_FD_TYPE_IOEVENTFD`` or
956``VFIO_USER_IO_FD_TYPE_IOREGIONFD``:
957
958Sub-Region IO FD info format (ioeventfd)
959""""""""""""""""""""""""""""""""""""""""
960
961+-----------+--------+------+
962| Name      | Offset | Size |
963+===========+========+======+
964| offset    | 0      | 8    |
965+-----------+--------+------+
966| size      | 8      | 8    |
967+-----------+--------+------+
968| fd_index  | 16     | 4    |
969+-----------+--------+------+
970| type      | 20     | 4    |
971+-----------+--------+------+
972| flags     | 24     | 4    |
973+-----------+--------+------+
974| padding   | 28     | 4    |
975+-----------+--------+------+
976| datamatch | 32     | 8    |
977+-----------+--------+------+
978
979* *offset* is the offset of the start of the sub-region within the region
980  requested ("physical address offset" for the region)
981* *size* is the length of the sub-region. This may be zero if the access size is
982  not relevant, which may allow for optimizations
983* *fd_index* is the index in the ancillary data of the FD to use for ioeventfd
984  notification; it may be shared.
985* *type* is ``VFIO_USER_IO_FD_TYPE_IOEVENTFD``
986* *flags* is any of:
987
988  * ``KVM_IOEVENTFD_FLAG_DATAMATCH``
989  * ``KVM_IOEVENTFD_FLAG_PIO``
990  * ``KVM_IOEVENTFD_FLAG_VIRTIO_CCW_NOTIFY`` (FIXME: makes sense?)
991
992* *datamatch* is the datamatch value if needed
993
994See https://www.kernel.org/doc/Documentation/virtual/kvm/api.txt, *4.59
995KVM_IOEVENTFD* for further context on the ioeventfd-specific fields.
996
997Sub-Region IO FD info format (ioregionfd)
998"""""""""""""""""""""""""""""""""""""""""
999
1000+-----------+--------+------+
1001| Name      | Offset | Size |
1002+===========+========+======+
1003| offset    | 0      | 8    |
1004+-----------+--------+------+
1005| size      | 8      | 8    |
1006+-----------+--------+------+
1007| fd_index  | 16     | 4    |
1008+-----------+--------+------+
1009| type      | 20     | 4    |
1010+-----------+--------+------+
1011| flags     | 24     | 4    |
1012+-----------+--------+------+
1013| padding   | 28     | 4    |
1014+-----------+--------+------+
1015| user_data | 32     | 8    |
1016+-----------+--------+------+
1017
1018* *offset* is the offset of the start of the sub-region within the region
1019  requested ("physical address offset" for the region)
1020* *size* is the length of the sub-region. This may be zero if the access size is
1021  not relevant, which may allow for optimizations; ``KVM_IOREGION_POSTED_WRITES``
1022  must be set in *flags* in this case
1023* *fd_index* is the index in the ancillary data of the FD to use for ioregionfd
1024  messages; it may be shared
1025* *type* is ``VFIO_USER_IO_FD_TYPE_IOREGIONFD``
1026* *flags* is any of:
1027
1028  * ``KVM_IOREGION_PIO``
1029  * ``KVM_IOREGION_POSTED_WRITES``
1030
1031* *user_data* is an opaque value passed back to the server via a message on the
1032  file descriptor
1033
1034For further information on the ioregionfd-specific fields, see:
1035https://lore.kernel.org/kvm/cover.1613828726.git.eafanasova@gmail.com/
1036
1037(FIXME: update with final API docs.)
1038
1039``VFIO_USER_DEVICE_GET_IRQ_INFO``
1040---------------------------------
1041
1042This command message is sent by the client to the server to query for
1043information about device interrupt types. The VFIO IRQ info structure is
1044defined in ``<linux/vfio.h>`` (``struct vfio_irq_info``).
1045
1046Request
1047^^^^^^^
1048
1049+-------+--------+---------------------------+
1050| Name  | Offset | Size                      |
1051+=======+========+===========================+
1052| argsz | 0      | 4                         |
1053+-------+--------+---------------------------+
1054| flags | 4      | 4                         |
1055+-------+--------+---------------------------+
1056|       | +-----+--------------------------+ |
1057|       | | Bit | Definition               | |
1058|       | +=====+==========================+ |
1059|       | | 0   | VFIO_IRQ_INFO_EVENTFD    | |
1060|       | +-----+--------------------------+ |
1061|       | | 1   | VFIO_IRQ_INFO_MASKABLE   | |
1062|       | +-----+--------------------------+ |
1063|       | | 2   | VFIO_IRQ_INFO_AUTOMASKED | |
1064|       | +-----+--------------------------+ |
1065|       | | 3   | VFIO_IRQ_INFO_NORESIZE   | |
1066|       | +-----+--------------------------+ |
1067+-------+--------+---------------------------+
1068| index | 8      | 4                         |
1069+-------+--------+---------------------------+
1070| count | 12     | 4                         |
1071+-------+--------+---------------------------+
1072
1073* *argsz* is the maximum size of the reply payload (16 bytes today)
1074* index is the index of IRQ type being queried (e.g. ``VFIO_PCI_MSIX_IRQ_INDEX``)
1075* all other fields must be zero
1076
1077Reply
1078^^^^^
1079
1080+-------+--------+---------------------------+
1081| Name  | Offset | Size                      |
1082+=======+========+===========================+
1083| argsz | 0      | 4                         |
1084+-------+--------+---------------------------+
1085| flags | 4      | 4                         |
1086+-------+--------+---------------------------+
1087|       | +-----+--------------------------+ |
1088|       | | Bit | Definition               | |
1089|       | +=====+==========================+ |
1090|       | | 0   | VFIO_IRQ_INFO_EVENTFD    | |
1091|       | +-----+--------------------------+ |
1092|       | | 1   | VFIO_IRQ_INFO_MASKABLE   | |
1093|       | +-----+--------------------------+ |
1094|       | | 2   | VFIO_IRQ_INFO_AUTOMASKED | |
1095|       | +-----+--------------------------+ |
1096|       | | 3   | VFIO_IRQ_INFO_NORESIZE   | |
1097|       | +-----+--------------------------+ |
1098+-------+--------+---------------------------+
1099| index | 8      | 4                         |
1100+-------+--------+---------------------------+
1101| count | 12     | 4                         |
1102+-------+--------+---------------------------+
1103
1104* *argsz* is the size required for the full reply payload (16 bytes today)
1105* *flags* defines IRQ attributes:
1106
1107  * ``VFIO_IRQ_INFO_EVENTFD`` indicates the IRQ type can support server eventfd
1108    signalling.
1109  * ``VFIO_IRQ_INFO_MASKABLE`` indicates that the IRQ type supports the ``MASK``
1110    and ``UNMASK`` actions in a ``VFIO_USER_DEVICE_SET_IRQS`` message.
1111  * ``VFIO_IRQ_INFO_AUTOMASKED`` indicates the IRQ type masks itself after being
1112    triggered, and the client must send an ``UNMASK`` action to receive new
1113    interrupts.
1114  * ``VFIO_IRQ_INFO_NORESIZE`` indicates ``VFIO_USER_SET_IRQS`` operations setup
1115    interrupts as a set, and new sub-indexes cannot be enabled without disabling
1116    the entire type.
1117* index is the index of IRQ type being queried
1118* count describes the number of interrupts of the queried type.
1119
1120``VFIO_USER_DEVICE_SET_IRQS``
1121-----------------------------
1122
1123This command message is sent by the client to the server to set actions for
1124device interrupt types. The VFIO IRQ set structure is defined in
1125``<linux/vfio.h>`` (``struct vfio_irq_set``).
1126
1127Request
1128^^^^^^^
1129
1130+-------+--------+------------------------------+
1131| Name  | Offset | Size                         |
1132+=======+========+==============================+
1133| argsz | 0      | 4                            |
1134+-------+--------+------------------------------+
1135| flags | 4      | 4                            |
1136+-------+--------+------------------------------+
1137|       | +-----+-----------------------------+ |
1138|       | | Bit | Definition                  | |
1139|       | +=====+=============================+ |
1140|       | | 0   | VFIO_IRQ_SET_DATA_NONE      | |
1141|       | +-----+-----------------------------+ |
1142|       | | 1   | VFIO_IRQ_SET_DATA_BOOL      | |
1143|       | +-----+-----------------------------+ |
1144|       | | 2   | VFIO_IRQ_SET_DATA_EVENTFD   | |
1145|       | +-----+-----------------------------+ |
1146|       | | 3   | VFIO_IRQ_SET_ACTION_MASK    | |
1147|       | +-----+-----------------------------+ |
1148|       | | 4   | VFIO_IRQ_SET_ACTION_UNMASK  | |
1149|       | +-----+-----------------------------+ |
1150|       | | 5   | VFIO_IRQ_SET_ACTION_TRIGGER | |
1151|       | +-----+-----------------------------+ |
1152+-------+--------+------------------------------+
1153| index | 8      | 4                            |
1154+-------+--------+------------------------------+
1155| start | 12     | 4                            |
1156+-------+--------+------------------------------+
1157| count | 16     | 4                            |
1158+-------+--------+------------------------------+
1159| data  | 20     | variable                     |
1160+-------+--------+------------------------------+
1161
1162* *argsz* is the size of the VFIO IRQ set request payload, including any *data*
1163  field. Note there is no reply payload, so this field differs from other
1164  message types.
1165* *flags* defines the action performed on the interrupt range. The ``DATA``
1166  flags describe the data field sent in the message; the ``ACTION`` flags
1167  describe the action to be performed. The flags are mutually exclusive for
1168  both sets.
1169
1170  * ``VFIO_IRQ_SET_DATA_NONE`` indicates there is no data field in the command.
1171    The action is performed unconditionally.
1172  * ``VFIO_IRQ_SET_DATA_BOOL`` indicates the data field is an array of boolean
1173    bytes. The action is performed if the corresponding boolean is true.
1174  * ``VFIO_IRQ_SET_DATA_EVENTFD`` indicates an array of event file descriptors
1175    was sent in the message meta-data. These descriptors will be signalled when
1176    the action defined by the action flags occurs. In ``AF_UNIX`` sockets, the
1177    descriptors are sent as ``SCM_RIGHTS`` type ancillary data.
1178    If no file descriptors are provided, this de-assigns the specified
1179    previously configured interrupts.
1180  * ``VFIO_IRQ_SET_ACTION_MASK`` indicates a masking event. It can be used with
1181    ``VFIO_IRQ_SET_DATA_BOOL`` or ``VFIO_IRQ_SET_DATA_NONE`` to mask an interrupt,
1182    or with ``VFIO_IRQ_SET_DATA_EVENTFD`` to generate an event when the guest masks
1183    the interrupt.
1184  * ``VFIO_IRQ_SET_ACTION_UNMASK`` indicates an unmasking event. It can be used
1185    with ``VFIO_IRQ_SET_DATA_BOOL`` or ``VFIO_IRQ_SET_DATA_NONE`` to unmask an
1186    interrupt, or with ``VFIO_IRQ_SET_DATA_EVENTFD`` to generate an event when the
1187    guest unmasks the interrupt.
1188  * ``VFIO_IRQ_SET_ACTION_TRIGGER`` indicates a triggering event. It can be used
1189    with ``VFIO_IRQ_SET_DATA_BOOL`` or ``VFIO_IRQ_SET_DATA_NONE`` to trigger an
1190    interrupt, or with ``VFIO_IRQ_SET_DATA_EVENTFD`` to generate an event when the
1191    server triggers the interrupt.
1192
1193* *index* is the index of IRQ type being setup.
1194* *start* is the start of the sub-index being set.
1195* *count* describes the number of sub-indexes being set. As a special case, a
1196  count (and start) of 0, with data flags of ``VFIO_IRQ_SET_DATA_NONE`` disables
1197  all interrupts of the index.
1198* *data* is an optional field included when the
1199  ``VFIO_IRQ_SET_DATA_BOOL`` flag is present. It contains an array of booleans
1200  that specify whether the action is to be performed on the corresponding
1201  index. It's used when the action is only performed on a subset of the range
1202  specified.
1203
1204Not all interrupt types support every combination of data and action flags.
1205The client must know the capabilities of the device and IRQ index before it
1206sends a ``VFIO_USER_DEVICE_SET_IRQ`` message.
1207
1208In typical operation, a specific IRQ may operate as follows:
1209
12101. The client sends a ``VFIO_USER_DEVICE_SET_IRQ`` message with
1211   ``flags=(VFIO_IRQ_SET_DATA_EVENTFD|VFIO_IRQ_SET_ACTION_TRIGGER)`` along
1212   with an eventfd. This associates the IRQ with a particular eventfd on the
1213   server side.
1214
1215#. The client may send a ``VFIO_USER_DEVICE_SET_IRQ`` message with
1216   ``flags=(VFIO_IRQ_SET_DATA_EVENTFD|VFIO_IRQ_SET_ACTION_MASK/UNMASK)`` along
1217   with another eventfd. This associates the given eventfd with the
1218   mask/unmask state on the server side.
1219
1220#. The server may trigger the IRQ by writing 1 to the eventfd.
1221
1222#. The server may mask/unmask an IRQ which will write 1 to the corresponding
1223   mask/unmask eventfd, if there is one.
1224
12255. A client may trigger a device IRQ itself, by sending a
1226   ``VFIO_USER_DEVICE_SET_IRQ`` message with
1227   ``flags=(VFIO_IRQ_SET_DATA_NONE/BOOL|VFIO_IRQ_SET_ACTION_TRIGGER)``.
1228
12296. A client may mask or unmask the IRQ, by sending a
1230   ``VFIO_USER_DEVICE_SET_IRQ`` message with
1231   ``flags=(VFIO_IRQ_SET_DATA_NONE/BOOL|VFIO_IRQ_SET_ACTION_MASK/UNMASK)``.
1232
1233Reply
1234^^^^^
1235
1236There is no payload in the reply.
1237
1238.. _Read and Write Operations:
1239
1240Note that all of these operations must be supported by the client and/or server,
1241even if the corresponding memory or device region has been shared as mappable.
1242
1243The ``count`` field must not exceed the value of ``max_data_xfer_size`` of the
1244peer, for both reads and writes.
1245
1246``VFIO_USER_REGION_READ``
1247-------------------------
1248
1249If a device region is not mappable, it's not directly accessible by the client
1250via ``mmap()`` of the underlying file descriptor. In this case, a client can
1251read from a device region with this message.
1252
1253Request
1254^^^^^^^
1255
1256+--------+--------+----------+
1257| Name   | Offset | Size     |
1258+========+========+==========+
1259| offset | 0      | 8        |
1260+--------+--------+----------+
1261| region | 8      | 4        |
1262+--------+--------+----------+
1263| count  | 12     | 4        |
1264+--------+--------+----------+
1265
1266* *offset* into the region being accessed.
1267* *region* is the index of the region being accessed.
1268* *count* is the size of the data to be transferred.
1269
1270Reply
1271^^^^^
1272
1273+--------+--------+----------+
1274| Name   | Offset | Size     |
1275+========+========+==========+
1276| offset | 0      | 8        |
1277+--------+--------+----------+
1278| region | 8      | 4        |
1279+--------+--------+----------+
1280| count  | 12     | 4        |
1281+--------+--------+----------+
1282| data   | 16     | variable |
1283+--------+--------+----------+
1284
1285* *offset* into the region accessed.
1286* *region* is the index of the region accessed.
1287* *count* is the size of the data transferred.
1288* *data* is the data that was read from the device region.
1289
1290``VFIO_USER_REGION_WRITE``
1291--------------------------
1292
1293If a device region is not mappable, it's not directly accessible by the client
1294via mmap() of the underlying fd. In this case, a client can write to a device
1295region with this message.
1296
1297Request
1298^^^^^^^
1299
1300+--------+--------+----------+
1301| Name   | Offset | Size     |
1302+========+========+==========+
1303| offset | 0      | 8        |
1304+--------+--------+----------+
1305| region | 8      | 4        |
1306+--------+--------+----------+
1307| count  | 12     | 4        |
1308+--------+--------+----------+
1309| data   | 16     | variable |
1310+--------+--------+----------+
1311
1312* *offset* into the region being accessed.
1313* *region* is the index of the region being accessed.
1314* *count* is the size of the data to be transferred.
1315* *data* is the data to write
1316
1317Reply
1318^^^^^
1319
1320+--------+--------+----------+
1321| Name   | Offset | Size     |
1322+========+========+==========+
1323| offset | 0      | 8        |
1324+--------+--------+----------+
1325| region | 8      | 4        |
1326+--------+--------+----------+
1327| count  | 12     | 4        |
1328+--------+--------+----------+
1329
1330* *offset* into the region accessed.
1331* *region* is the index of the region accessed.
1332* *count* is the size of the data transferred.
1333
1334``VFIO_USER_DMA_READ``
1335-----------------------
1336
1337If the client has not shared mappable memory, the server can use this message to
1338read from guest memory.
1339
1340Request
1341^^^^^^^
1342
1343+---------+--------+----------+
1344| Name    | Offset | Size     |
1345+=========+========+==========+
1346| address | 0      | 8        |
1347+---------+--------+----------+
1348| count   | 8      | 8        |
1349+---------+--------+----------+
1350
1351* *address* is the client DMA memory address being accessed. This address must have
1352  been previously exported to the server with a ``VFIO_USER_DMA_MAP`` message.
1353* *count* is the size of the data to be transferred.
1354
1355Reply
1356^^^^^
1357
1358+---------+--------+----------+
1359| Name    | Offset | Size     |
1360+=========+========+==========+
1361| address | 0      | 8        |
1362+---------+--------+----------+
1363| count   | 8      | 8        |
1364+---------+--------+----------+
1365| data    | 16     | variable |
1366+---------+--------+----------+
1367
1368* *address* is the client DMA memory address being accessed.
1369* *count* is the size of the data transferred.
1370* *data* is the data read.
1371
1372``VFIO_USER_DMA_WRITE``
1373-----------------------
1374
1375If the client has not shared mappable memory, the server can use this message to
1376write to guest memory.
1377
1378Request
1379^^^^^^^
1380
1381+---------+--------+----------+
1382| Name    | Offset | Size     |
1383+=========+========+==========+
1384| address | 0      | 8        |
1385+---------+--------+----------+
1386| count   | 8      | 8        |
1387+---------+--------+----------+
1388| data    | 16     | variable |
1389+---------+--------+----------+
1390
1391* *address* is the client DMA memory address being accessed. This address must have
1392  been previously exported to the server with a ``VFIO_USER_DMA_MAP`` message.
1393* *count* is the size of the data to be transferred.
1394* *data* is the data to write
1395
1396Reply
1397^^^^^
1398
1399+---------+--------+----------+
1400| Name    | Offset | Size     |
1401+=========+========+==========+
1402| address | 0      | 8        |
1403+---------+--------+----------+
1404| count   | 8      | 4        |
1405+---------+--------+----------+
1406
1407* *address* is the client DMA memory address being accessed.
1408* *count* is the size of the data transferred.
1409
1410``VFIO_USER_DEVICE_RESET``
1411--------------------------
1412
1413This command message is sent from the client to the server to reset the device.
1414Neither the request or reply have a payload.
1415
1416``VFIO_USER_REGION_WRITE_MULTI``
1417--------------------------------
1418
1419This message can be used to coalesce multiple device write operations
1420into a single messgage.  It is only used as an optimization when the
1421outgoing message queue is relatively full.
1422
1423Request
1424^^^^^^^
1425
1426+---------+--------+----------+
1427| Name    | Offset | Size     |
1428+=========+========+==========+
1429| wr_cnt  | 0      | 8        |
1430+---------+--------+----------+
1431| wrs     | 8      | variable |
1432+---------+--------+----------+
1433
1434* *wr_cnt* is the number of device writes coalesced in the message
1435* *wrs* is an array of device writes defined below
1436
1437Single Device Write Format
1438""""""""""""""""""""""""""
1439
1440+--------+--------+----------+
1441| Name   | Offset | Size     |
1442+========+========+==========+
1443| offset | 0      | 8        |
1444+--------+--------+----------+
1445| region | 8      | 4        |
1446+--------+--------+----------+
1447| count  | 12     | 4        |
1448+--------+--------+----------+
1449| data   | 16     | 8        |
1450+--------+--------+----------+
1451
1452* *offset* into the region being accessed.
1453* *region* is the index of the region being accessed.
1454* *count* is the size of the data to be transferred.  This format can
1455  only describe writes of 8 bytes or less.
1456* *data* is the data to write.
1457
1458Reply
1459^^^^^
1460
1461+---------+--------+----------+
1462| Name    | Offset | Size     |
1463+=========+========+==========+
1464| wr_cnt  | 0      | 8        |
1465+---------+--------+----------+
1466
1467* *wr_cnt* is the number of device writes completed.
1468
1469
1470Appendices
1471==========
1472
1473Unused VFIO ``ioctl()`` commands
1474--------------------------------
1475
1476The following VFIO commands do not have an equivalent vfio-user command:
1477
1478* ``VFIO_GET_API_VERSION``
1479* ``VFIO_CHECK_EXTENSION``
1480* ``VFIO_SET_IOMMU``
1481* ``VFIO_GROUP_GET_STATUS``
1482* ``VFIO_GROUP_SET_CONTAINER``
1483* ``VFIO_GROUP_UNSET_CONTAINER``
1484* ``VFIO_GROUP_GET_DEVICE_FD``
1485* ``VFIO_IOMMU_GET_INFO``
1486
1487However, once support for live migration for VFIO devices is finalized some
1488of the above commands may have to be handled by the client in their
1489corresponding vfio-user form. This will be addressed in a future protocol
1490version.
1491
1492VFIO groups and containers
1493^^^^^^^^^^^^^^^^^^^^^^^^^^
1494
1495The current VFIO implementation includes group and container idioms that
1496describe how a device relates to the host IOMMU. In the vfio-user
1497implementation, the IOMMU is implemented in SW by the client, and is not
1498visible to the server. The simplest idea would be that the client put each
1499device into its own group and container.
1500
1501Backend Program Conventions
1502---------------------------
1503
1504vfio-user backend program conventions are based on the vhost-user ones.
1505
1506* The backend program must not daemonize itself.
1507* No assumptions must be made as to what access the backend program has on the
1508  system.
1509* File descriptors 0, 1 and 2 must exist, must have regular
1510  stdin/stdout/stderr semantics, and can be redirected.
1511* The backend program must honor the SIGTERM signal.
1512* The backend program must accept the following commands line options:
1513
1514  * ``--socket-path=PATH``: path to UNIX domain socket,
1515  * ``--fd=FDNUM``: file descriptor for UNIX domain socket, incompatible with
1516    ``--socket-path``
1517* The backend program must be accompanied with a JSON file stored under
1518  ``/usr/share/vfio-user``.
1519
1520TODO add schema similar to docs/interop/vhost-user.json.
1521