xref: /openbmc/qemu/docs/specs/ivshmem-spec.rst (revision d762bf97931b58839316b68a570eecc6143c9e3e)
1*bb1cff6eSPeter Maydell======================================================
2*bb1cff6eSPeter MaydellDevice Specification for Inter-VM shared memory device
3*bb1cff6eSPeter Maydell======================================================
4*bb1cff6eSPeter Maydell
5*bb1cff6eSPeter MaydellThe Inter-VM shared memory device (ivshmem) is designed to share a
6*bb1cff6eSPeter Maydellmemory region between multiple QEMU processes running different guests
7*bb1cff6eSPeter Maydelland the host.  In order for all guests to be able to pick up the
8*bb1cff6eSPeter Maydellshared memory area, it is modeled by QEMU as a PCI device exposing
9*bb1cff6eSPeter Maydellsaid memory to the guest as a PCI BAR.
10*bb1cff6eSPeter Maydell
11*bb1cff6eSPeter MaydellThe device can use a shared memory object on the host directly, or it
12*bb1cff6eSPeter Maydellcan obtain one from an ivshmem server.
13*bb1cff6eSPeter Maydell
14*bb1cff6eSPeter MaydellIn the latter case, the device can additionally interrupt its peers, and
15*bb1cff6eSPeter Maydellget interrupted by its peers.
16*bb1cff6eSPeter Maydell
17*bb1cff6eSPeter MaydellFor information on configuring the ivshmem device on the QEMU
18*bb1cff6eSPeter Maydellcommand line, see :doc:`../system/devices/ivshmem`.
19*bb1cff6eSPeter Maydell
20*bb1cff6eSPeter MaydellThe ivshmem PCI device's guest interface
21*bb1cff6eSPeter Maydell========================================
22*bb1cff6eSPeter Maydell
23*bb1cff6eSPeter MaydellThe device has vendor ID 1af4, device ID 1110, revision 1.  Before
24*bb1cff6eSPeter MaydellQEMU 2.6.0, it had revision 0.
25*bb1cff6eSPeter Maydell
26*bb1cff6eSPeter MaydellPCI BARs
27*bb1cff6eSPeter Maydell--------
28*bb1cff6eSPeter Maydell
29*bb1cff6eSPeter MaydellThe ivshmem PCI device has two or three BARs:
30*bb1cff6eSPeter Maydell
31*bb1cff6eSPeter Maydell- BAR0 holds device registers (256 Byte MMIO)
32*bb1cff6eSPeter Maydell- BAR1 holds MSI-X table and PBA (only ivshmem-doorbell)
33*bb1cff6eSPeter Maydell- BAR2 maps the shared memory object
34*bb1cff6eSPeter Maydell
35*bb1cff6eSPeter MaydellThere are two ways to use this device:
36*bb1cff6eSPeter Maydell
37*bb1cff6eSPeter Maydell- If you only need the shared memory part, BAR2 suffices.  This way,
38*bb1cff6eSPeter Maydell  you have access to the shared memory in the guest and can use it as
39*bb1cff6eSPeter Maydell  you see fit.
40*bb1cff6eSPeter Maydell
41*bb1cff6eSPeter Maydell- If you additionally need the capability for peers to interrupt each
42*bb1cff6eSPeter Maydell  other, you need BAR0 and BAR1.  You will most likely want to write a
43*bb1cff6eSPeter Maydell  kernel driver to handle interrupts.  Requires the device to be
44*bb1cff6eSPeter Maydell  configured for interrupts, obviously.
45*bb1cff6eSPeter Maydell
46*bb1cff6eSPeter MaydellBefore QEMU 2.6.0, BAR2 can initially be invalid if the device is
47*bb1cff6eSPeter Maydellconfigured for interrupts.  It becomes safely accessible only after
48*bb1cff6eSPeter Maydellthe ivshmem server provided the shared memory.  These devices have PCI
49*bb1cff6eSPeter Maydellrevision 0 rather than 1.  Guest software should wait for the
50*bb1cff6eSPeter MaydellIVPosition register (described below) to become non-negative before
51*bb1cff6eSPeter Maydellaccessing BAR2.
52*bb1cff6eSPeter Maydell
53*bb1cff6eSPeter MaydellRevision 0 of the device is not capable to tell guest software whether
54*bb1cff6eSPeter Maydellit is configured for interrupts.
55*bb1cff6eSPeter Maydell
56*bb1cff6eSPeter MaydellPCI device registers
57*bb1cff6eSPeter Maydell--------------------
58*bb1cff6eSPeter Maydell
59*bb1cff6eSPeter MaydellBAR 0 contains the following registers:
60*bb1cff6eSPeter Maydell
61*bb1cff6eSPeter Maydell::
62*bb1cff6eSPeter Maydell
63*bb1cff6eSPeter Maydell    Offset  Size  Access      On reset  Function
64*bb1cff6eSPeter Maydell        0     4   read/write        0   Interrupt Mask
65*bb1cff6eSPeter Maydell                                        bit 0: peer interrupt (rev 0)
66*bb1cff6eSPeter Maydell                                               reserved       (rev 1)
67*bb1cff6eSPeter Maydell                                        bit 1..31: reserved
68*bb1cff6eSPeter Maydell        4     4   read/write        0   Interrupt Status
69*bb1cff6eSPeter Maydell                                        bit 0: peer interrupt (rev 0)
70*bb1cff6eSPeter Maydell                                               reserved       (rev 1)
71*bb1cff6eSPeter Maydell                                        bit 1..31: reserved
72*bb1cff6eSPeter Maydell        8     4   read-only   0 or ID   IVPosition
73*bb1cff6eSPeter Maydell       12     4   write-only      N/A   Doorbell
74*bb1cff6eSPeter Maydell                                        bit 0..15: vector
75*bb1cff6eSPeter Maydell                                        bit 16..31: peer ID
76*bb1cff6eSPeter Maydell       16   240   none            N/A   reserved
77*bb1cff6eSPeter Maydell
78*bb1cff6eSPeter MaydellSoftware should only access the registers as specified in column
79*bb1cff6eSPeter Maydell"Access".  Reserved bits should be ignored on read, and preserved on
80*bb1cff6eSPeter Maydellwrite.
81*bb1cff6eSPeter Maydell
82*bb1cff6eSPeter MaydellIn revision 0 of the device, Interrupt Status and Mask Register
83*bb1cff6eSPeter Maydelltogether control the legacy INTx interrupt when the device has no
84*bb1cff6eSPeter MaydellMSI-X capability: INTx is asserted when the bit-wise AND of Status and
85*bb1cff6eSPeter MaydellMask is non-zero and the device has no MSI-X capability.  Interrupt
86*bb1cff6eSPeter MaydellStatus Register bit 0 becomes 1 when an interrupt request from a peer
87*bb1cff6eSPeter Maydellis received.  Reading the register clears it.
88*bb1cff6eSPeter Maydell
89*bb1cff6eSPeter MaydellIVPosition Register: if the device is not configured for interrupts,
90*bb1cff6eSPeter Maydellthis is zero.  Else, it is the device's ID (between 0 and 65535).
91*bb1cff6eSPeter Maydell
92*bb1cff6eSPeter MaydellBefore QEMU 2.6.0, the register may read -1 for a short while after
93*bb1cff6eSPeter Maydellreset.  These devices have PCI revision 0 rather than 1.
94*bb1cff6eSPeter Maydell
95*bb1cff6eSPeter MaydellThere is no good way for software to find out whether the device is
96*bb1cff6eSPeter Maydellconfigured for interrupts.  A positive IVPosition means interrupts,
97*bb1cff6eSPeter Maydellbut zero could be either.
98*bb1cff6eSPeter Maydell
99*bb1cff6eSPeter MaydellDoorbell Register: writing this register requests to interrupt a peer.
100*bb1cff6eSPeter MaydellThe written value's high 16 bits are the ID of the peer to interrupt,
101*bb1cff6eSPeter Maydelland its low 16 bits select an interrupt vector.
102*bb1cff6eSPeter Maydell
103*bb1cff6eSPeter MaydellIf the device is not configured for interrupts, the write is ignored.
104*bb1cff6eSPeter Maydell
105*bb1cff6eSPeter MaydellIf the interrupt hasn't completed setup, the write is ignored.  The
106*bb1cff6eSPeter Maydelldevice is not capable to tell guest software whether setup is
107*bb1cff6eSPeter Maydellcomplete.  Interrupts can regress to this state on migration.
108*bb1cff6eSPeter Maydell
109*bb1cff6eSPeter MaydellIf the peer with the requested ID isn't connected, or it has fewer
110*bb1cff6eSPeter Maydellinterrupt vectors connected, the write is ignored.  The device is not
111*bb1cff6eSPeter Maydellcapable to tell guest software what peers are connected, or how many
112*bb1cff6eSPeter Maydellinterrupt vectors are connected.
113*bb1cff6eSPeter Maydell
114*bb1cff6eSPeter MaydellThe peer's interrupt for this vector then becomes pending.  There is
115*bb1cff6eSPeter Maydellno way for software to clear the pending bit, and a polling mode of
116*bb1cff6eSPeter Maydelloperation is therefore impossible.
117*bb1cff6eSPeter Maydell
118*bb1cff6eSPeter MaydellIf the peer is a revision 0 device without MSI-X capability, its
119*bb1cff6eSPeter MaydellInterrupt Status register is set to 1.  This asserts INTx unless
120*bb1cff6eSPeter Maydellmasked by the Interrupt Mask register.  The device is not capable to
121*bb1cff6eSPeter Maydellcommunicate the interrupt vector to guest software then.
122*bb1cff6eSPeter Maydell
123*bb1cff6eSPeter MaydellWith multiple MSI-X vectors, different vectors can be used to indicate
124*bb1cff6eSPeter Maydelldifferent events have occurred.  The semantics of interrupt vectors
125*bb1cff6eSPeter Maydellare left to the application.
126*bb1cff6eSPeter Maydell
127*bb1cff6eSPeter MaydellInterrupt infrastructure
128*bb1cff6eSPeter Maydell========================
129*bb1cff6eSPeter Maydell
130*bb1cff6eSPeter MaydellWhen configured for interrupts, the peers share eventfd objects in
131*bb1cff6eSPeter Maydelladdition to shared memory.  The shared resources are managed by an
132*bb1cff6eSPeter Maydellivshmem server.
133*bb1cff6eSPeter Maydell
134*bb1cff6eSPeter MaydellThe ivshmem server
135*bb1cff6eSPeter Maydell------------------
136*bb1cff6eSPeter Maydell
137*bb1cff6eSPeter MaydellThe server listens on a UNIX domain socket.
138*bb1cff6eSPeter Maydell
139*bb1cff6eSPeter MaydellFor each new client that connects to the server, the server
140*bb1cff6eSPeter Maydell
141*bb1cff6eSPeter Maydell- picks an ID,
142*bb1cff6eSPeter Maydell- creates eventfd file descriptors for the interrupt vectors,
143*bb1cff6eSPeter Maydell- sends the ID and the file descriptor for the shared memory to the
144*bb1cff6eSPeter Maydell  new client,
145*bb1cff6eSPeter Maydell- sends connect notifications for the new client to the other clients
146*bb1cff6eSPeter Maydell  (these contain file descriptors for sending interrupts),
147*bb1cff6eSPeter Maydell- sends connect notifications for the other clients to the new client,
148*bb1cff6eSPeter Maydell  and
149*bb1cff6eSPeter Maydell- sends interrupt setup messages to the new client (these contain file
150*bb1cff6eSPeter Maydell  descriptors for receiving interrupts).
151*bb1cff6eSPeter Maydell
152*bb1cff6eSPeter MaydellThe first client to connect to the server receives ID zero.
153*bb1cff6eSPeter Maydell
154*bb1cff6eSPeter MaydellWhen a client disconnects from the server, the server sends disconnect
155*bb1cff6eSPeter Maydellnotifications to the other clients.
156*bb1cff6eSPeter Maydell
157*bb1cff6eSPeter MaydellThe next section describes the protocol in detail.
158*bb1cff6eSPeter Maydell
159*bb1cff6eSPeter MaydellIf the server terminates without sending disconnect notifications for
160*bb1cff6eSPeter Maydellits connected clients, the clients can elect to continue.  They can
161*bb1cff6eSPeter Maydellcommunicate with each other normally, but won't receive disconnect
162*bb1cff6eSPeter Maydellnotification on disconnect, and no new clients can connect.  There is
163*bb1cff6eSPeter Maydellno way for the clients to connect to a restarted server.  The device
164*bb1cff6eSPeter Maydellis not capable to tell guest software whether the server is still up.
165*bb1cff6eSPeter Maydell
166*bb1cff6eSPeter MaydellExample server code is in contrib/ivshmem-server/.  Not to be used in
167*bb1cff6eSPeter Maydellproduction.  It assumes all clients use the same number of interrupt
168*bb1cff6eSPeter Maydellvectors.
169*bb1cff6eSPeter Maydell
170*bb1cff6eSPeter MaydellA standalone client is in contrib/ivshmem-client/.  It can be useful
171*bb1cff6eSPeter Maydellfor debugging.
172*bb1cff6eSPeter Maydell
173*bb1cff6eSPeter MaydellThe ivshmem Client-Server Protocol
174*bb1cff6eSPeter Maydell----------------------------------
175*bb1cff6eSPeter Maydell
176*bb1cff6eSPeter MaydellAn ivshmem device configured for interrupts connects to an ivshmem
177*bb1cff6eSPeter Maydellserver.  This section details the protocol between the two.
178*bb1cff6eSPeter Maydell
179*bb1cff6eSPeter MaydellThe connection is one-way: the server sends messages to the client.
180*bb1cff6eSPeter MaydellEach message consists of a single 8 byte little-endian signed number,
181*bb1cff6eSPeter Maydelland may be accompanied by a file descriptor via SCM_RIGHTS.  Both
182*bb1cff6eSPeter Maydellclient and server close the connection on error.
183*bb1cff6eSPeter Maydell
184*bb1cff6eSPeter MaydellNote: QEMU currently doesn't close the connection right on error, but
185*bb1cff6eSPeter Maydellonly when the character device is destroyed.
186*bb1cff6eSPeter Maydell
187*bb1cff6eSPeter MaydellOn connect, the server sends the following messages in order:
188*bb1cff6eSPeter Maydell
189*bb1cff6eSPeter Maydell1. The protocol version number, currently zero.  The client should
190*bb1cff6eSPeter Maydell   close the connection on receipt of versions it can't handle.
191*bb1cff6eSPeter Maydell
192*bb1cff6eSPeter Maydell2. The client's ID.  This is unique among all clients of this server.
193*bb1cff6eSPeter Maydell   IDs must be between 0 and 65535, because the Doorbell register
194*bb1cff6eSPeter Maydell   provides only 16 bits for them.
195*bb1cff6eSPeter Maydell
196*bb1cff6eSPeter Maydell3. The number -1, accompanied by the file descriptor for the shared
197*bb1cff6eSPeter Maydell   memory.
198*bb1cff6eSPeter Maydell
199*bb1cff6eSPeter Maydell4. Connect notifications for existing other clients, if any.  This is
200*bb1cff6eSPeter Maydell   a peer ID (number between 0 and 65535 other than the client's ID),
201*bb1cff6eSPeter Maydell   repeated N times.  Each repetition is accompanied by one file
202*bb1cff6eSPeter Maydell   descriptor.  These are for interrupting the peer with that ID using
203*bb1cff6eSPeter Maydell   vector 0,..,N-1, in order.  If the client is configured for fewer
204*bb1cff6eSPeter Maydell   vectors, it closes the extra file descriptors.  If it is configured
205*bb1cff6eSPeter Maydell   for more, the extra vectors remain unconnected.
206*bb1cff6eSPeter Maydell
207*bb1cff6eSPeter Maydell5. Interrupt setup.  This is the client's own ID, repeated N times.
208*bb1cff6eSPeter Maydell   Each repetition is accompanied by one file descriptor.  These are
209*bb1cff6eSPeter Maydell   for receiving interrupts from peers using vector 0,..,N-1, in
210*bb1cff6eSPeter Maydell   order.  If the client is configured for fewer vectors, it closes
211*bb1cff6eSPeter Maydell   the extra file descriptors.  If it is configured for more, the
212*bb1cff6eSPeter Maydell   extra vectors remain unconnected.
213*bb1cff6eSPeter Maydell
214*bb1cff6eSPeter MaydellFrom then on, the server sends these kinds of messages:
215*bb1cff6eSPeter Maydell
216*bb1cff6eSPeter Maydell6. Connection / disconnection notification.  This is a peer ID.
217*bb1cff6eSPeter Maydell
218*bb1cff6eSPeter Maydell  - If the number comes with a file descriptor, it's a connection
219*bb1cff6eSPeter Maydell    notification, exactly like in step 4.
220*bb1cff6eSPeter Maydell
221*bb1cff6eSPeter Maydell  - Else, it's a disconnection notification for the peer with that ID.
222*bb1cff6eSPeter Maydell
223*bb1cff6eSPeter MaydellKnown bugs:
224*bb1cff6eSPeter Maydell
225*bb1cff6eSPeter Maydell* The protocol changed incompatibly in QEMU 2.5.  Before, messages
226*bb1cff6eSPeter Maydell  were native endian long, and there was no version number.
227*bb1cff6eSPeter Maydell
228*bb1cff6eSPeter Maydell* The protocol is poorly designed.
229*bb1cff6eSPeter Maydell
230*bb1cff6eSPeter MaydellThe ivshmem Client-Client Protocol
231*bb1cff6eSPeter Maydell----------------------------------
232*bb1cff6eSPeter Maydell
233*bb1cff6eSPeter MaydellAn ivshmem device configured for interrupts receives eventfd file
234*bb1cff6eSPeter Maydelldescriptors for interrupting peers and getting interrupted by peers
235*bb1cff6eSPeter Maydellfrom the server, as explained in the previous section.
236*bb1cff6eSPeter Maydell
237*bb1cff6eSPeter MaydellTo interrupt a peer, the device writes the 8-byte integer 1 in native
238*bb1cff6eSPeter Maydellbyte order to the respective file descriptor.
239*bb1cff6eSPeter Maydell
240*bb1cff6eSPeter MaydellTo receive an interrupt, the device reads and discards as many 8-byte
241*bb1cff6eSPeter Maydellintegers as it can.
242