xref: /openbmc/qemu/docs/system/devices/nvme.rst (revision f227c07bbb9569ed12e1559083fe27a797e40c66)
1==============
2NVMe Emulation
3==============
4
5QEMU provides NVMe emulation through the ``nvme``, ``nvme-ns`` and
6``nvme-subsys`` devices.
7
8See the following sections for specific information on
9
10  * `Adding NVMe Devices`_, `additional namespaces`_ and `NVM subsystems`_.
11  * Configuration of `Optional Features`_ such as `Controller Memory Buffer`_,
12    `Simple Copy`_, `Zoned Namespaces`_, `metadata`_ and `End-to-End Data
13    Protection`_,
14
15Adding NVMe Devices
16===================
17
18Controller Emulation
19--------------------
20
21The QEMU emulated NVMe controller implements version 1.4 of the NVM Express
22specification. All mandatory features are implement with a couple of exceptions
23and limitations:
24
25  * Accounting numbers in the SMART/Health log page are reset when the device
26    is power cycled.
27  * Interrupt Coalescing is not supported and is disabled by default.
28
29The simplest way to attach an NVMe controller on the QEMU PCI bus is to add the
30following parameters:
31
32.. code-block:: console
33
34    -drive file=nvm.img,if=none,id=nvm
35    -device nvme,serial=deadbeef,drive=nvm
36
37There are a number of optional general parameters for the ``nvme`` device. Some
38are mentioned here, but see ``-device nvme,help`` to list all possible
39parameters.
40
41``max_ioqpairs=UINT32`` (default: ``64``)
42  Set the maximum number of allowed I/O queue pairs. This replaces the
43  deprecated ``num_queues`` parameter.
44
45``msix_qsize=UINT16`` (default: ``65``)
46  The number of MSI-X vectors that the device should support.
47
48``mdts=UINT8`` (default: ``7``)
49  Set the Maximum Data Transfer Size of the device.
50
51``use-intel-id`` (default: ``off``)
52  Since QEMU 5.2, the device uses a QEMU allocated "Red Hat" PCI Device and
53  Vendor ID. Set this to ``on`` to revert to the unallocated Intel ID
54  previously used.
55
56Additional Namespaces
57---------------------
58
59In the simplest possible invocation sketched above, the device only support a
60single namespace with the namespace identifier ``1``. To support multiple
61namespaces and additional features, the ``nvme-ns`` device must be used.
62
63.. code-block:: console
64
65   -device nvme,id=nvme-ctrl-0,serial=deadbeef
66   -drive file=nvm-1.img,if=none,id=nvm-1
67   -device nvme-ns,drive=nvm-1
68   -drive file=nvm-2.img,if=none,id=nvm-2
69   -device nvme-ns,drive=nvm-2
70
71The namespaces defined by the ``nvme-ns`` device will attach to the most
72recently defined ``nvme-bus`` that is created by the ``nvme`` device. Namespace
73identifiers are allocated automatically, starting from ``1``.
74
75There are a number of parameters available:
76
77``nsid`` (default: ``0``)
78  Explicitly set the namespace identifier.
79
80``uuid`` (default: *autogenerated*)
81  Set the UUID of the namespace. This will be reported as a "Namespace UUID"
82  descriptor in the Namespace Identification Descriptor List.
83
84``eui64``
85  Set the EUI-64 of the namespace. This will be reported as a "IEEE Extended
86  Unique Identifier" descriptor in the Namespace Identification Descriptor List.
87  Since machine type 6.1 a non-zero default value is used if the parameter
88  is not provided. For earlier machine types the field defaults to 0.
89
90``bus``
91  If there are more ``nvme`` devices defined, this parameter may be used to
92  attach the namespace to a specific ``nvme`` device (identified by an ``id``
93  parameter on the controller device).
94
95NVM Subsystems
96--------------
97
98Additional features becomes available if the controller device (``nvme``) is
99linked to an NVM Subsystem device (``nvme-subsys``).
100
101The NVM Subsystem emulation allows features such as shared namespaces and
102multipath I/O.
103
104.. code-block:: console
105
106   -device nvme-subsys,id=nvme-subsys-0,nqn=subsys0
107   -device nvme,serial=deadbeef,subsys=nvme-subsys-0
108   -device nvme,serial=deadbeef,subsys=nvme-subsys-0
109
110This will create an NVM subsystem with two controllers. Having controllers
111linked to an ``nvme-subsys`` device allows additional ``nvme-ns`` parameters:
112
113``shared`` (default: ``on`` since 6.2)
114  Specifies that the namespace will be attached to all controllers in the
115  subsystem. If set to ``off``, the namespace will remain a private namespace
116  and may only be attached to a single controller at a time. Shared namespaces
117  are always automatically attached to all controllers (also when controllers
118  are hotplugged).
119
120``detached`` (default: ``off``)
121  If set to ``on``, the namespace will be be available in the subsystem, but
122  not attached to any controllers initially. A shared namespace with this set
123  to ``on`` will never be automatically attached to controllers.
124
125Thus, adding
126
127.. code-block:: console
128
129   -drive file=nvm-1.img,if=none,id=nvm-1
130   -device nvme-ns,drive=nvm-1,nsid=1
131   -drive file=nvm-2.img,if=none,id=nvm-2
132   -device nvme-ns,drive=nvm-2,nsid=3,shared=off,detached=on
133
134will cause NSID 1 will be a shared namespace that is initially attached to both
135controllers. NSID 3 will be a private namespace due to ``shared=off`` and only
136attachable to a single controller at a time. Additionally it will not be
137attached to any controller initially (due to ``detached=on``) or to hotplugged
138controllers.
139
140Optional Features
141=================
142
143Controller Memory Buffer
144------------------------
145
146``nvme`` device parameters related to the Controller Memory Buffer support:
147
148``cmb_size_mb=UINT32`` (default: ``0``)
149  This adds a Controller Memory Buffer of the given size at offset zero in BAR
150  2.
151
152``legacy-cmb`` (default: ``off``)
153  By default, the device uses the "v1.4 scheme" for the Controller Memory
154  Buffer support (i.e, the CMB is initially disabled and must be explicitly
155  enabled by the host). Set this to ``on`` to behave as a v1.3 device wrt. the
156  CMB.
157
158Simple Copy
159-----------
160
161The device includes support for TP 4065 ("Simple Copy Command"). A number of
162additional ``nvme-ns`` device parameters may be used to control the Copy
163command limits:
164
165``mssrl=UINT16`` (default: ``128``)
166  Set the Maximum Single Source Range Length (``MSSRL``). This is the maximum
167  number of logical blocks that may be specified in each source range.
168
169``mcl=UINT32`` (default: ``128``)
170  Set the Maximum Copy Length (``MCL``). This is the maximum number of logical
171  blocks that may be specified in a Copy command (the total for all source
172  ranges).
173
174``msrc=UINT8`` (default: ``127``)
175  Set the Maximum Source Range Count (``MSRC``). This is the maximum number of
176  source ranges that may be used in a Copy command. This is a 0's based value.
177
178Zoned Namespaces
179----------------
180
181A namespaces may be "Zoned" as defined by TP 4053 ("Zoned Namespaces"). Set
182``zoned=on`` on an ``nvme-ns`` device to configure it as a zoned namespace.
183
184The namespace may be configured with additional parameters
185
186``zoned.zone_size=SIZE`` (default: ``128MiB``)
187  Define the zone size (``ZSZE``).
188
189``zoned.zone_capacity=SIZE`` (default: ``0``)
190  Define the zone capacity (``ZCAP``). If left at the default (``0``), the zone
191  capacity will equal the zone size.
192
193``zoned.descr_ext_size=UINT32`` (default: ``0``)
194  Set the Zone Descriptor Extension Size (``ZDES``). Must be a multiple of 64
195  bytes.
196
197``zoned.cross_read=BOOL`` (default: ``off``)
198  Set to ``on`` to allow reads to cross zone boundaries.
199
200``zoned.max_active=UINT32`` (default: ``0``)
201  Set the maximum number of active resources (``MAR``). The default (``0``)
202  allows all zones to be active.
203
204``zoned.max_open=UINT32`` (default: ``0``)
205  Set the maximum number of open resources (``MOR``). The default (``0``)
206  allows all zones to be open. If ``zoned.max_active`` is specified, this value
207  must be less than or equal to that.
208
209``zoned.zasl=UINT8`` (default: ``0``)
210  Set the maximum data transfer size for the Zone Append command. Like
211  ``mdts``, the value is specified as a power of two (2^n) and is in units of
212  the minimum memory page size (CAP.MPSMIN). The default value (``0``)
213  has this property inherit the ``mdts`` value.
214
215Metadata
216--------
217
218The virtual namespace device supports LBA metadata in the form separate
219metadata (``MPTR``-based) and extended LBAs.
220
221``ms=UINT16`` (default: ``0``)
222  Defines the number of metadata bytes per LBA.
223
224``mset=UINT8`` (default: ``0``)
225  Set to ``1`` to enable extended LBAs.
226
227End-to-End Data Protection
228--------------------------
229
230The virtual namespace device supports DIF- and DIX-based protection information
231(depending on ``mset``).
232
233``pi=UINT8`` (default: ``0``)
234  Enable protection information of the specified type (type ``1``, ``2`` or
235  ``3``).
236
237``pil=UINT8`` (default: ``0``)
238  Controls the location of the protection information within the metadata. Set
239  to ``1`` to transfer protection information as the first eight bytes of
240  metadata. Otherwise, the protection information is transferred as the last
241  eight bytes.
242
243Virtualization Enhancements and SR-IOV (Experimental Support)
244-------------------------------------------------------------
245
246The ``nvme`` device supports Single Root I/O Virtualization and Sharing
247along with Virtualization Enhancements. The controller has to be linked to
248an NVM Subsystem device (``nvme-subsys``) for use with SR-IOV.
249
250A number of parameters are present (**please note, that they may be
251subject to change**):
252
253``sriov_max_vfs`` (default: ``0``)
254  Indicates the maximum number of PCIe virtual functions supported
255  by the controller. Specifying a non-zero value enables reporting of both
256  SR-IOV and ARI (Alternative Routing-ID Interpretation) capabilities
257  by the NVMe device. Virtual function controllers will not report SR-IOV.
258
259``sriov_vq_flexible``
260  Indicates the total number of flexible queue resources assignable to all
261  the secondary controllers. Implicitly sets the number of primary
262  controller's private resources to ``(max_ioqpairs - sriov_vq_flexible)``.
263
264``sriov_vi_flexible``
265  Indicates the total number of flexible interrupt resources assignable to
266  all the secondary controllers. Implicitly sets the number of primary
267  controller's private resources to ``(msix_qsize - sriov_vi_flexible)``.
268
269``sriov_max_vi_per_vf`` (default: ``0``)
270  Indicates the maximum number of virtual interrupt resources assignable
271  to a secondary controller. The default ``0`` resolves to
272  ``(sriov_vi_flexible / sriov_max_vfs)``
273
274``sriov_max_vq_per_vf`` (default: ``0``)
275  Indicates the maximum number of virtual queue resources assignable to
276  a secondary controller. The default ``0`` resolves to
277  ``(sriov_vq_flexible / sriov_max_vfs)``
278
279The simplest possible invocation enables the capability to set up one VF
280controller and assign an admin queue, an IO queue, and a MSI-X interrupt.
281
282.. code-block:: console
283
284   -device nvme-subsys,id=subsys0
285   -device nvme,serial=deadbeef,subsys=subsys0,sriov_max_vfs=1,
286    sriov_vq_flexible=2,sriov_vi_flexible=1
287
288The minimum steps required to configure a functional NVMe secondary
289controller are:
290
291  * unbind flexible resources from the primary controller
292
293.. code-block:: console
294
295   nvme virt-mgmt /dev/nvme0 -c 0 -r 1 -a 1 -n 0
296   nvme virt-mgmt /dev/nvme0 -c 0 -r 0 -a 1 -n 0
297
298  * perform a Function Level Reset on the primary controller to actually
299    release the resources
300
301.. code-block:: console
302
303   echo 1 > /sys/bus/pci/devices/0000:01:00.0/reset
304
305  * enable VF
306
307.. code-block:: console
308
309   echo 1 > /sys/bus/pci/devices/0000:01:00.0/sriov_numvfs
310
311  * assign the flexible resources to the VF and set it ONLINE
312
313.. code-block:: console
314
315   nvme virt-mgmt /dev/nvme0 -c 1 -r 1 -a 8 -n 1
316   nvme virt-mgmt /dev/nvme0 -c 1 -r 0 -a 8 -n 2
317   nvme virt-mgmt /dev/nvme0 -c 1 -r 0 -a 9 -n 0
318
319  * bind the NVMe driver to the VF
320
321.. code-block:: console
322
323   echo 0000:01:00.1 > /sys/bus/pci/drivers/nvme/bind