xref: /openbmc/qemu/docs/system/devices/nvme.rst (revision bc3e41a0)
1==============
2NVMe Emulation
3==============
4
5QEMU provides NVMe emulation through the ``nvme``, ``nvme-ns`` and
6``nvme-subsys`` devices.
7
8See the following sections for specific information on
9
10  * `Adding NVMe Devices`_, `additional namespaces`_ and `NVM subsystems`_.
11  * Configuration of `Optional Features`_ such as `Controller Memory Buffer`_,
12    `Simple Copy`_, `Zoned Namespaces`_, `metadata`_ and `End-to-End Data
13    Protection`_,
14
15Adding NVMe Devices
16===================
17
18Controller Emulation
19--------------------
20
21The QEMU emulated NVMe controller implements version 1.4 of the NVM Express
22specification. All mandatory features are implement with a couple of exceptions
23and limitations:
24
25  * Accounting numbers in the SMART/Health log page are reset when the device
26    is power cycled.
27  * Interrupt Coalescing is not supported and is disabled by default.
28
29The simplest way to attach an NVMe controller on the QEMU PCI bus is to add the
30following parameters:
31
32.. code-block:: console
33
34    -drive file=nvm.img,if=none,id=nvm
35    -device nvme,serial=deadbeef,drive=nvm
36
37There are a number of optional general parameters for the ``nvme`` device. Some
38are mentioned here, but see ``-device nvme,help`` to list all possible
39parameters.
40
41``max_ioqpairs=UINT32`` (default: ``64``)
42  Set the maximum number of allowed I/O queue pairs. This replaces the
43  deprecated ``num_queues`` parameter.
44
45``msix_qsize=UINT16`` (default: ``65``)
46  The number of MSI-X vectors that the device should support.
47
48``mdts=UINT8`` (default: ``7``)
49  Set the Maximum Data Transfer Size of the device.
50
51``use-intel-id`` (default: ``off``)
52  Since QEMU 5.2, the device uses a QEMU allocated "Red Hat" PCI Device and
53  Vendor ID. Set this to ``on`` to revert to the unallocated Intel ID
54  previously used.
55
56Additional Namespaces
57---------------------
58
59In the simplest possible invocation sketched above, the device only support a
60single namespace with the namespace identifier ``1``. To support multiple
61namespaces and additional features, the ``nvme-ns`` device must be used.
62
63.. code-block:: console
64
65   -device nvme,id=nvme-ctrl-0,serial=deadbeef
66   -drive file=nvm-1.img,if=none,id=nvm-1
67   -device nvme-ns,drive=nvm-1
68   -drive file=nvm-2.img,if=none,id=nvm-2
69   -device nvme-ns,drive=nvm-2
70
71The namespaces defined by the ``nvme-ns`` device will attach to the most
72recently defined ``nvme-bus`` that is created by the ``nvme`` device. Namespace
73identifiers are allocated automatically, starting from ``1``.
74
75There are a number of parameters available:
76
77``nsid`` (default: ``0``)
78  Explicitly set the namespace identifier.
79
80``uuid`` (default: *autogenerated*)
81  Set the UUID of the namespace. This will be reported as a "Namespace UUID"
82  descriptor in the Namespace Identification Descriptor List.
83
84``eui64``
85  Set the EUI-64 of the namespace. This will be reported as a "IEEE Extended
86  Unique Identifier" descriptor in the Namespace Identification Descriptor List.
87  Since machine type 6.1 a non-zero default value is used if the parameter
88  is not provided. For earlier machine types the field defaults to 0.
89
90``bus``
91  If there are more ``nvme`` devices defined, this parameter may be used to
92  attach the namespace to a specific ``nvme`` device (identified by an ``id``
93  parameter on the controller device).
94
95NVM Subsystems
96--------------
97
98Additional features becomes available if the controller device (``nvme``) is
99linked to an NVM Subsystem device (``nvme-subsys``).
100
101The NVM Subsystem emulation allows features such as shared namespaces and
102multipath I/O.
103
104.. code-block:: console
105
106   -device nvme-subsys,id=nvme-subsys-0,nqn=subsys0
107   -device nvme,serial=deadbeef,subsys=nvme-subsys-0
108   -device nvme,serial=deadbeef,subsys=nvme-subsys-0
109
110This will create an NVM subsystem with two controllers. Having controllers
111linked to an ``nvme-subsys`` device allows additional ``nvme-ns`` parameters:
112
113``shared`` (default: ``on`` since 6.2)
114  Specifies that the namespace will be attached to all controllers in the
115  subsystem. If set to ``off``, the namespace will remain a private namespace
116  and may only be attached to a single controller at a time. Shared namespaces
117  are always automatically attached to all controllers (also when controllers
118  are hotplugged).
119
120``detached`` (default: ``off``)
121  If set to ``on``, the namespace will be be available in the subsystem, but
122  not attached to any controllers initially. A shared namespace with this set
123  to ``on`` will never be automatically attached to controllers.
124
125Thus, adding
126
127.. code-block:: console
128
129   -drive file=nvm-1.img,if=none,id=nvm-1
130   -device nvme-ns,drive=nvm-1,nsid=1
131   -drive file=nvm-2.img,if=none,id=nvm-2
132   -device nvme-ns,drive=nvm-2,nsid=3,shared=off,detached=on
133
134will cause NSID 1 will be a shared namespace that is initially attached to both
135controllers. NSID 3 will be a private namespace due to ``shared=off`` and only
136attachable to a single controller at a time. Additionally it will not be
137attached to any controller initially (due to ``detached=on``) or to hotplugged
138controllers.
139
140Optional Features
141=================
142
143Controller Memory Buffer
144------------------------
145
146``nvme`` device parameters related to the Controller Memory Buffer support:
147
148``cmb_size_mb=UINT32`` (default: ``0``)
149  This adds a Controller Memory Buffer of the given size at offset zero in BAR
150  2.
151
152``legacy-cmb`` (default: ``off``)
153  By default, the device uses the "v1.4 scheme" for the Controller Memory
154  Buffer support (i.e, the CMB is initially disabled and must be explicitly
155  enabled by the host). Set this to ``on`` to behave as a v1.3 device wrt. the
156  CMB.
157
158Simple Copy
159-----------
160
161The device includes support for TP 4065 ("Simple Copy Command"). A number of
162additional ``nvme-ns`` device parameters may be used to control the Copy
163command limits:
164
165``mssrl=UINT16`` (default: ``128``)
166  Set the Maximum Single Source Range Length (``MSSRL``). This is the maximum
167  number of logical blocks that may be specified in each source range.
168
169``mcl=UINT32`` (default: ``128``)
170  Set the Maximum Copy Length (``MCL``). This is the maximum number of logical
171  blocks that may be specified in a Copy command (the total for all source
172  ranges).
173
174``msrc=UINT8`` (default: ``127``)
175  Set the Maximum Source Range Count (``MSRC``). This is the maximum number of
176  source ranges that may be used in a Copy command. This is a 0's based value.
177
178Zoned Namespaces
179----------------
180
181A namespaces may be "Zoned" as defined by TP 4053 ("Zoned Namespaces"). Set
182``zoned=on`` on an ``nvme-ns`` device to configure it as a zoned namespace.
183
184The namespace may be configured with additional parameters
185
186``zoned.zone_size=SIZE`` (default: ``128MiB``)
187  Define the zone size (``ZSZE``).
188
189``zoned.zone_capacity=SIZE`` (default: ``0``)
190  Define the zone capacity (``ZCAP``). If left at the default (``0``), the zone
191  capacity will equal the zone size.
192
193``zoned.descr_ext_size=UINT32`` (default: ``0``)
194  Set the Zone Descriptor Extension Size (``ZDES``). Must be a multiple of 64
195  bytes.
196
197``zoned.cross_read=BOOL`` (default: ``off``)
198  Set to ``on`` to allow reads to cross zone boundaries.
199
200``zoned.max_active=UINT32`` (default: ``0``)
201  Set the maximum number of active resources (``MAR``). The default (``0``)
202  allows all zones to be active.
203
204``zoned.max_open=UINT32`` (default: ``0``)
205  Set the maximum number of open resources (``MOR``). The default (``0``)
206  allows all zones to be open. If ``zoned.max_active`` is specified, this value
207  must be less than or equal to that.
208
209``zoned.zasl=UINT8`` (default: ``0``)
210  Set the maximum data transfer size for the Zone Append command. Like
211  ``mdts``, the value is specified as a power of two (2^n) and is in units of
212  the minimum memory page size (CAP.MPSMIN). The default value (``0``)
213  has this property inherit the ``mdts`` value.
214
215Flexible Data Placement
216-----------------------
217
218The device may be configured to support TP4146 ("Flexible Data Placement") by
219configuring it (``fdp=on``) on the subsystem::
220
221    -device nvme-subsys,id=nvme-subsys-0,nqn=subsys0,fdp=on,fdp.nruh=16
222
223The subsystem emulates a single Endurance Group, on which Flexible Data
224Placement will be supported. Also note that the device emulation deviates
225slightly from the specification, by always enabling the "FDP Mode" feature on
226the controller if the subsystems is configured for Flexible Data Placement.
227
228Enabling Flexible Data Placement on the subsyste enables the following
229parameters:
230
231``fdp.nrg`` (default: ``1``)
232  Set the number of Reclaim Groups.
233
234``fdp.nruh`` (default: ``0``)
235  Set the number of Reclaim Unit Handles. This is a mandatory parameter and
236  must be non-zero.
237
238``fdp.runs`` (default: ``96M``)
239  Set the Reclaim Unit Nominal Size. Defaults to 96 MiB.
240
241Namespaces within this subsystem may requests Reclaim Unit Handles::
242
243    -device nvme-ns,drive=nvm-1,fdp.ruhs=RUHLIST
244
245The ``RUHLIST`` is a semicolon separated list (i.e. ``0;1;2;3``) and may
246include ranges (i.e. ``0;8-15``). If no reclaim unit handle list is specified,
247the controller will assign the controller-specified reclaim unit handle to
248placement handle identifier 0.
249
250Metadata
251--------
252
253The virtual namespace device supports LBA metadata in the form separate
254metadata (``MPTR``-based) and extended LBAs.
255
256``ms=UINT16`` (default: ``0``)
257  Defines the number of metadata bytes per LBA.
258
259``mset=UINT8`` (default: ``0``)
260  Set to ``1`` to enable extended LBAs.
261
262End-to-End Data Protection
263--------------------------
264
265The virtual namespace device supports DIF- and DIX-based protection information
266(depending on ``mset``).
267
268``pi=UINT8`` (default: ``0``)
269  Enable protection information of the specified type (type ``1``, ``2`` or
270  ``3``).
271
272``pil=UINT8`` (default: ``0``)
273  Controls the location of the protection information within the metadata. Set
274  to ``1`` to transfer protection information as the first bytes of metadata.
275  Otherwise, the protection information is transferred as the last bytes of
276  metadata.
277
278``pif=UINT8`` (default: ``0``)
279  By default, the namespace device uses 16 bit guard protection information
280  format (``pif=0``). Set to ``2`` to enable 64 bit guard protection
281  information format. This requires at least 16 bytes of metadata. Note that
282  ``pif=1`` (32 bit guards) is currently not supported.
283
284Virtualization Enhancements and SR-IOV (Experimental Support)
285-------------------------------------------------------------
286
287The ``nvme`` device supports Single Root I/O Virtualization and Sharing
288along with Virtualization Enhancements. The controller has to be linked to
289an NVM Subsystem device (``nvme-subsys``) for use with SR-IOV.
290
291A number of parameters are present (**please note, that they may be
292subject to change**):
293
294``sriov_max_vfs`` (default: ``0``)
295  Indicates the maximum number of PCIe virtual functions supported
296  by the controller. Specifying a non-zero value enables reporting of both
297  SR-IOV and ARI (Alternative Routing-ID Interpretation) capabilities
298  by the NVMe device. Virtual function controllers will not report SR-IOV.
299
300``sriov_vq_flexible``
301  Indicates the total number of flexible queue resources assignable to all
302  the secondary controllers. Implicitly sets the number of primary
303  controller's private resources to ``(max_ioqpairs - sriov_vq_flexible)``.
304
305``sriov_vi_flexible``
306  Indicates the total number of flexible interrupt resources assignable to
307  all the secondary controllers. Implicitly sets the number of primary
308  controller's private resources to ``(msix_qsize - sriov_vi_flexible)``.
309
310``sriov_max_vi_per_vf`` (default: ``0``)
311  Indicates the maximum number of virtual interrupt resources assignable
312  to a secondary controller. The default ``0`` resolves to
313  ``(sriov_vi_flexible / sriov_max_vfs)``
314
315``sriov_max_vq_per_vf`` (default: ``0``)
316  Indicates the maximum number of virtual queue resources assignable to
317  a secondary controller. The default ``0`` resolves to
318  ``(sriov_vq_flexible / sriov_max_vfs)``
319
320The simplest possible invocation enables the capability to set up one VF
321controller and assign an admin queue, an IO queue, and a MSI-X interrupt.
322
323.. code-block:: console
324
325   -device nvme-subsys,id=subsys0
326   -device nvme,serial=deadbeef,subsys=subsys0,sriov_max_vfs=1,
327    sriov_vq_flexible=2,sriov_vi_flexible=1
328
329The minimum steps required to configure a functional NVMe secondary
330controller are:
331
332  * unbind flexible resources from the primary controller
333
334.. code-block:: console
335
336   nvme virt-mgmt /dev/nvme0 -c 0 -r 1 -a 1 -n 0
337   nvme virt-mgmt /dev/nvme0 -c 0 -r 0 -a 1 -n 0
338
339  * perform a Function Level Reset on the primary controller to actually
340    release the resources
341
342.. code-block:: console
343
344   echo 1 > /sys/bus/pci/devices/0000:01:00.0/reset
345
346  * enable VF
347
348.. code-block:: console
349
350   echo 1 > /sys/bus/pci/devices/0000:01:00.0/sriov_numvfs
351
352  * assign the flexible resources to the VF and set it ONLINE
353
354.. code-block:: console
355
356   nvme virt-mgmt /dev/nvme0 -c 1 -r 1 -a 8 -n 1
357   nvme virt-mgmt /dev/nvme0 -c 1 -r 0 -a 8 -n 2
358   nvme virt-mgmt /dev/nvme0 -c 1 -r 0 -a 9 -n 0
359
360  * bind the NVMe driver to the VF
361
362.. code-block:: console
363
364   echo 0000:01:00.1 > /sys/bus/pci/drivers/nvme/bind
365