xref: /openbmc/qemu/docs/system/devices/nvme.rst (revision 0681ec253141d838210b3c5e6bc0d2d71f2e111e)
1==============
2NVMe Emulation
3==============
4
5QEMU provides NVMe emulation through the ``nvme``, ``nvme-ns`` and
6``nvme-subsys`` devices.
7
8See the following sections for specific information on
9
10  * `Adding NVMe Devices`_, `additional namespaces`_ and `NVM subsystems`_.
11  * Configuration of `Optional Features`_ such as `Controller Memory Buffer`_,
12    `Simple Copy`_, `Zoned Namespaces`_, `metadata`_ and `End-to-End Data
13    Protection`_,
14
15Adding NVMe Devices
16===================
17
18Controller Emulation
19--------------------
20
21The QEMU emulated NVMe controller implements version 1.4 of the NVM Express
22specification. All mandatory features are implement with a couple of exceptions
23and limitations:
24
25  * Accounting numbers in the SMART/Health log page are reset when the device
26    is power cycled.
27  * Interrupt Coalescing is not supported and is disabled by default.
28
29The simplest way to attach an NVMe controller on the QEMU PCI bus is to add the
30following parameters:
31
32.. code-block:: console
33
34    -drive file=nvm.img,if=none,id=nvm
35    -device nvme,serial=deadbeef,drive=nvm
36
37There are a number of optional general parameters for the ``nvme`` device. Some
38are mentioned here, but see ``-device nvme,help`` to list all possible
39parameters.
40
41``max_ioqpairs=UINT32`` (default: ``64``)
42  Set the maximum number of allowed I/O queue pairs. This replaces the
43  deprecated ``num_queues`` parameter.
44
45``msix_qsize=UINT16`` (default: ``65``)
46  The number of MSI-X vectors that the device should support.
47
48``mdts=UINT8`` (default: ``7``)
49  Set the Maximum Data Transfer Size of the device.
50
51``use-intel-id`` (default: ``off``)
52  Since QEMU 5.2, the device uses a QEMU allocated "Red Hat" PCI Device and
53  Vendor ID. Set this to ``on`` to revert to the unallocated Intel ID
54  previously used.
55
56``ocp`` (default: ``off``)
57  The Open Compute Project defines the Datacenter NVMe SSD Specification that
58  sits on top of NVMe. It describes additional commands and NVMe behaviors
59  specific for the Datacenter. When this option is ``on`` OCP features such as
60  the SMART / Health information extended log become available in the
61  controller. We emulate version 5 of this log page.
62
63Additional Namespaces
64---------------------
65
66In the simplest possible invocation sketched above, the device only support a
67single namespace with the namespace identifier ``1``. To support multiple
68namespaces and additional features, the ``nvme-ns`` device must be used.
69
70.. code-block:: console
71
72   -device nvme,id=nvme-ctrl-0,serial=deadbeef
73   -drive file=nvm-1.img,if=none,id=nvm-1
74   -device nvme-ns,drive=nvm-1
75   -drive file=nvm-2.img,if=none,id=nvm-2
76   -device nvme-ns,drive=nvm-2
77
78The namespaces defined by the ``nvme-ns`` device will attach to the most
79recently defined ``nvme-bus`` that is created by the ``nvme`` device. Namespace
80identifiers are allocated automatically, starting from ``1``.
81
82There are a number of parameters available:
83
84``nsid`` (default: ``0``)
85  Explicitly set the namespace identifier.
86
87``uuid`` (default: *autogenerated*)
88  Set the UUID of the namespace. This will be reported as a "Namespace UUID"
89  descriptor in the Namespace Identification Descriptor List.
90
91``nguid``
92  Set the NGUID of the namespace. This will be reported as a "Namespace Globally
93  Unique Identifier" descriptor in the Namespace Identification Descriptor List.
94  It is specified as a string of hexadecimal digits containing exactly 16 bytes
95  or "auto" for a random value. An optional '-' separator could be used to group
96  bytes. If not specified the NGUID will remain all zeros.
97
98``eui64``
99  Set the EUI-64 of the namespace. This will be reported as a "IEEE Extended
100  Unique Identifier" descriptor in the Namespace Identification Descriptor List.
101  Since machine type 6.1 a non-zero default value is used if the parameter
102  is not provided. For earlier machine types the field defaults to 0.
103
104``bus``
105  If there are more ``nvme`` devices defined, this parameter may be used to
106  attach the namespace to a specific ``nvme`` device (identified by an ``id``
107  parameter on the controller device).
108
109NVM Subsystems
110--------------
111
112Additional features becomes available if the controller device (``nvme``) is
113linked to an NVM Subsystem device (``nvme-subsys``).
114
115The NVM Subsystem emulation allows features such as shared namespaces and
116multipath I/O.
117
118.. code-block:: console
119
120   -device nvme-subsys,id=nvme-subsys-0,nqn=subsys0
121   -device nvme,serial=deadbeef,subsys=nvme-subsys-0
122   -device nvme,serial=deadbeef,subsys=nvme-subsys-0
123
124This will create an NVM subsystem with two controllers. Having controllers
125linked to an ``nvme-subsys`` device allows additional ``nvme-ns`` parameters:
126
127``shared`` (default: ``on`` since 6.2)
128  Specifies that the namespace will be attached to all controllers in the
129  subsystem. If set to ``off``, the namespace will remain a private namespace
130  and may only be attached to a single controller at a time. Shared namespaces
131  are always automatically attached to all controllers (also when controllers
132  are hotplugged).
133
134``detached`` (default: ``off``)
135  If set to ``on``, the namespace will be be available in the subsystem, but
136  not attached to any controllers initially. A shared namespace with this set
137  to ``on`` will never be automatically attached to controllers.
138
139Thus, adding
140
141.. code-block:: console
142
143   -drive file=nvm-1.img,if=none,id=nvm-1
144   -device nvme-ns,drive=nvm-1,nsid=1
145   -drive file=nvm-2.img,if=none,id=nvm-2
146   -device nvme-ns,drive=nvm-2,nsid=3,shared=off,detached=on
147
148will cause NSID 1 will be a shared namespace that is initially attached to both
149controllers. NSID 3 will be a private namespace due to ``shared=off`` and only
150attachable to a single controller at a time. Additionally it will not be
151attached to any controller initially (due to ``detached=on``) or to hotplugged
152controllers.
153
154Optional Features
155=================
156
157Controller Memory Buffer
158------------------------
159
160``nvme`` device parameters related to the Controller Memory Buffer support:
161
162``cmb_size_mb=UINT32`` (default: ``0``)
163  This adds a Controller Memory Buffer of the given size at offset zero in BAR
164  2.
165
166``legacy-cmb`` (default: ``off``)
167  By default, the device uses the "v1.4 scheme" for the Controller Memory
168  Buffer support (i.e, the CMB is initially disabled and must be explicitly
169  enabled by the host). Set this to ``on`` to behave as a v1.3 device wrt. the
170  CMB.
171
172Simple Copy
173-----------
174
175The device includes support for TP 4065 ("Simple Copy Command"). A number of
176additional ``nvme-ns`` device parameters may be used to control the Copy
177command limits:
178
179``mssrl=UINT16`` (default: ``128``)
180  Set the Maximum Single Source Range Length (``MSSRL``). This is the maximum
181  number of logical blocks that may be specified in each source range.
182
183``mcl=UINT32`` (default: ``128``)
184  Set the Maximum Copy Length (``MCL``). This is the maximum number of logical
185  blocks that may be specified in a Copy command (the total for all source
186  ranges).
187
188``msrc=UINT8`` (default: ``127``)
189  Set the Maximum Source Range Count (``MSRC``). This is the maximum number of
190  source ranges that may be used in a Copy command. This is a 0's based value.
191
192Zoned Namespaces
193----------------
194
195A namespaces may be "Zoned" as defined by TP 4053 ("Zoned Namespaces"). Set
196``zoned=on`` on an ``nvme-ns`` device to configure it as a zoned namespace.
197
198The namespace may be configured with additional parameters
199
200``zoned.zone_size=SIZE`` (default: ``128MiB``)
201  Define the zone size (``ZSZE``).
202
203``zoned.zone_capacity=SIZE`` (default: ``0``)
204  Define the zone capacity (``ZCAP``). If left at the default (``0``), the zone
205  capacity will equal the zone size.
206
207``zoned.descr_ext_size=UINT32`` (default: ``0``)
208  Set the Zone Descriptor Extension Size (``ZDES``). Must be a multiple of 64
209  bytes.
210
211``zoned.cross_read=BOOL`` (default: ``off``)
212  Set to ``on`` to allow reads to cross zone boundaries.
213
214``zoned.max_active=UINT32`` (default: ``0``)
215  Set the maximum number of active resources (``MAR``). The default (``0``)
216  allows all zones to be active.
217
218``zoned.max_open=UINT32`` (default: ``0``)
219  Set the maximum number of open resources (``MOR``). The default (``0``)
220  allows all zones to be open. If ``zoned.max_active`` is specified, this value
221  must be less than or equal to that.
222
223``zoned.zasl=UINT8`` (default: ``0``)
224  Set the maximum data transfer size for the Zone Append command. Like
225  ``mdts``, the value is specified as a power of two (2^n) and is in units of
226  the minimum memory page size (CAP.MPSMIN). The default value (``0``)
227  has this property inherit the ``mdts`` value.
228
229Flexible Data Placement
230-----------------------
231
232The device may be configured to support TP4146 ("Flexible Data Placement") by
233configuring it (``fdp=on``) on the subsystem::
234
235    -device nvme-subsys,id=nvme-subsys-0,nqn=subsys0,fdp=on,fdp.nruh=16
236
237The subsystem emulates a single Endurance Group, on which Flexible Data
238Placement will be supported. Also note that the device emulation deviates
239slightly from the specification, by always enabling the "FDP Mode" feature on
240the controller if the subsystems is configured for Flexible Data Placement.
241
242Enabling Flexible Data Placement on the subsyste enables the following
243parameters:
244
245``fdp.nrg`` (default: ``1``)
246  Set the number of Reclaim Groups.
247
248``fdp.nruh`` (default: ``0``)
249  Set the number of Reclaim Unit Handles. This is a mandatory parameter and
250  must be non-zero.
251
252``fdp.runs`` (default: ``96M``)
253  Set the Reclaim Unit Nominal Size. Defaults to 96 MiB.
254
255Namespaces within this subsystem may requests Reclaim Unit Handles::
256
257    -device nvme-ns,drive=nvm-1,fdp.ruhs=RUHLIST
258
259The ``RUHLIST`` is a semicolon separated list (i.e. ``0;1;2;3``) and may
260include ranges (i.e. ``0;8-15``). If no reclaim unit handle list is specified,
261the controller will assign the controller-specified reclaim unit handle to
262placement handle identifier 0.
263
264Metadata
265--------
266
267The virtual namespace device supports LBA metadata in the form separate
268metadata (``MPTR``-based) and extended LBAs.
269
270``ms=UINT16`` (default: ``0``)
271  Defines the number of metadata bytes per LBA.
272
273``mset=UINT8`` (default: ``0``)
274  Set to ``1`` to enable extended LBAs.
275
276End-to-End Data Protection
277--------------------------
278
279The virtual namespace device supports DIF- and DIX-based protection information
280(depending on ``mset``).
281
282``pi=UINT8`` (default: ``0``)
283  Enable protection information of the specified type (type ``1``, ``2`` or
284  ``3``).
285
286``pil=UINT8`` (default: ``0``)
287  Controls the location of the protection information within the metadata. Set
288  to ``1`` to transfer protection information as the first bytes of metadata.
289  Otherwise, the protection information is transferred as the last bytes of
290  metadata.
291
292``pif=UINT8`` (default: ``0``)
293  By default, the namespace device uses 16 bit guard protection information
294  format (``pif=0``). Set to ``2`` to enable 64 bit guard protection
295  information format. This requires at least 16 bytes of metadata. Note that
296  ``pif=1`` (32 bit guards) is currently not supported.
297
298Virtualization Enhancements and SR-IOV (Experimental Support)
299-------------------------------------------------------------
300
301The ``nvme`` device supports Single Root I/O Virtualization and Sharing
302along with Virtualization Enhancements. The controller has to be linked to
303an NVM Subsystem device (``nvme-subsys``) for use with SR-IOV.
304
305A number of parameters are present (**please note, that they may be
306subject to change**):
307
308``sriov_max_vfs`` (default: ``0``)
309  Indicates the maximum number of PCIe virtual functions supported
310  by the controller. Specifying a non-zero value enables reporting of both
311  SR-IOV and ARI (Alternative Routing-ID Interpretation) capabilities
312  by the NVMe device. Virtual function controllers will not report SR-IOV.
313
314``sriov_vq_flexible``
315  Indicates the total number of flexible queue resources assignable to all
316  the secondary controllers. Implicitly sets the number of primary
317  controller's private resources to ``(max_ioqpairs - sriov_vq_flexible)``.
318
319``sriov_vi_flexible``
320  Indicates the total number of flexible interrupt resources assignable to
321  all the secondary controllers. Implicitly sets the number of primary
322  controller's private resources to ``(msix_qsize - sriov_vi_flexible)``.
323
324``sriov_max_vi_per_vf`` (default: ``0``)
325  Indicates the maximum number of virtual interrupt resources assignable
326  to a secondary controller. The default ``0`` resolves to
327  ``(sriov_vi_flexible / sriov_max_vfs)``
328
329``sriov_max_vq_per_vf`` (default: ``0``)
330  Indicates the maximum number of virtual queue resources assignable to
331  a secondary controller. The default ``0`` resolves to
332  ``(sriov_vq_flexible / sriov_max_vfs)``
333
334The simplest possible invocation enables the capability to set up one VF
335controller and assign an admin queue, an IO queue, and a MSI-X interrupt.
336
337.. code-block:: console
338
339   -device nvme-subsys,id=subsys0
340   -device nvme,serial=deadbeef,subsys=subsys0,sriov_max_vfs=1,
341    sriov_vq_flexible=2,sriov_vi_flexible=1
342
343The minimum steps required to configure a functional NVMe secondary
344controller are:
345
346  * unbind flexible resources from the primary controller
347
348.. code-block:: console
349
350   nvme virt-mgmt /dev/nvme0 -c 0 -r 1 -a 1 -n 0
351   nvme virt-mgmt /dev/nvme0 -c 0 -r 0 -a 1 -n 0
352
353  * perform a Function Level Reset on the primary controller to actually
354    release the resources
355
356.. code-block:: console
357
358   echo 1 > /sys/bus/pci/devices/0000:01:00.0/reset
359
360  * enable VF
361
362.. code-block:: console
363
364   echo 1 > /sys/bus/pci/devices/0000:01:00.0/sriov_numvfs
365
366  * assign the flexible resources to the VF and set it ONLINE
367
368.. code-block:: console
369
370   nvme virt-mgmt /dev/nvme0 -c 1 -r 1 -a 8 -n 1
371   nvme virt-mgmt /dev/nvme0 -c 1 -r 0 -a 8 -n 2
372   nvme virt-mgmt /dev/nvme0 -c 1 -r 0 -a 9 -n 0
373
374  * bind the NVMe driver to the VF
375
376.. code-block:: console
377
378   echo 0000:01:00.1 > /sys/bus/pci/drivers/nvme/bind
379