xref: /openbmc/qemu/docs/system/devices/nvme.rst (revision 4921d0a7)
1==============
2NVMe Emulation
3==============
4
5QEMU provides NVMe emulation through the ``nvme``, ``nvme-ns`` and
6``nvme-subsys`` devices.
7
8See the following sections for specific information on
9
10  * `Adding NVMe Devices`_, `additional namespaces`_ and `NVM subsystems`_.
11  * Configuration of `Optional Features`_ such as `Controller Memory Buffer`_,
12    `Simple Copy`_, `Zoned Namespaces`_, `metadata`_ and `End-to-End Data
13    Protection`_,
14
15Adding NVMe Devices
16===================
17
18Controller Emulation
19--------------------
20
21The QEMU emulated NVMe controller implements version 1.4 of the NVM Express
22specification. All mandatory features are implement with a couple of exceptions
23and limitations:
24
25  * Accounting numbers in the SMART/Health log page are reset when the device
26    is power cycled.
27  * Interrupt Coalescing is not supported and is disabled by default.
28
29The simplest way to attach an NVMe controller on the QEMU PCI bus is to add the
30following parameters:
31
32.. code-block:: console
33
34    -drive file=nvm.img,if=none,id=nvm
35    -device nvme,serial=deadbeef,drive=nvm
36
37There are a number of optional general parameters for the ``nvme`` device. Some
38are mentioned here, but see ``-device nvme,help`` to list all possible
39parameters.
40
41``max_ioqpairs=UINT32`` (default: ``64``)
42  Set the maximum number of allowed I/O queue pairs. This replaces the
43  deprecated ``num_queues`` parameter.
44
45``msix_qsize=UINT16`` (default: ``65``)
46  The number of MSI-X vectors that the device should support.
47
48``mdts=UINT8`` (default: ``7``)
49  Set the Maximum Data Transfer Size of the device.
50
51``use-intel-id`` (default: ``off``)
52  Since QEMU 5.2, the device uses a QEMU allocated "Red Hat" PCI Device and
53  Vendor ID. Set this to ``on`` to revert to the unallocated Intel ID
54  previously used.
55
56Additional Namespaces
57---------------------
58
59In the simplest possible invocation sketched above, the device only support a
60single namespace with the namespace identifier ``1``. To support multiple
61namespaces and additional features, the ``nvme-ns`` device must be used.
62
63.. code-block:: console
64
65   -device nvme,id=nvme-ctrl-0,serial=deadbeef
66   -drive file=nvm-1.img,if=none,id=nvm-1
67   -device nvme-ns,drive=nvm-1
68   -drive file=nvm-2.img,if=none,id=nvm-2
69   -device nvme-ns,drive=nvm-2
70
71The namespaces defined by the ``nvme-ns`` device will attach to the most
72recently defined ``nvme-bus`` that is created by the ``nvme`` device. Namespace
73identifiers are allocated automatically, starting from ``1``.
74
75There are a number of parameters available:
76
77``nsid`` (default: ``0``)
78  Explicitly set the namespace identifier.
79
80``uuid`` (default: *autogenerated*)
81  Set the UUID of the namespace. This will be reported as a "Namespace UUID"
82  descriptor in the Namespace Identification Descriptor List.
83
84``nguid``
85  Set the NGUID of the namespace. This will be reported as a "Namespace Globally
86  Unique Identifier" descriptor in the Namespace Identification Descriptor List.
87  It is specified as a string of hexadecimal digits containing exactly 16 bytes
88  or "auto" for a random value. An optional '-' separator could be used to group
89  bytes. If not specified the NGUID will remain all zeros.
90
91``eui64``
92  Set the EUI-64 of the namespace. This will be reported as a "IEEE Extended
93  Unique Identifier" descriptor in the Namespace Identification Descriptor List.
94  Since machine type 6.1 a non-zero default value is used if the parameter
95  is not provided. For earlier machine types the field defaults to 0.
96
97``bus``
98  If there are more ``nvme`` devices defined, this parameter may be used to
99  attach the namespace to a specific ``nvme`` device (identified by an ``id``
100  parameter on the controller device).
101
102NVM Subsystems
103--------------
104
105Additional features becomes available if the controller device (``nvme``) is
106linked to an NVM Subsystem device (``nvme-subsys``).
107
108The NVM Subsystem emulation allows features such as shared namespaces and
109multipath I/O.
110
111.. code-block:: console
112
113   -device nvme-subsys,id=nvme-subsys-0,nqn=subsys0
114   -device nvme,serial=deadbeef,subsys=nvme-subsys-0
115   -device nvme,serial=deadbeef,subsys=nvme-subsys-0
116
117This will create an NVM subsystem with two controllers. Having controllers
118linked to an ``nvme-subsys`` device allows additional ``nvme-ns`` parameters:
119
120``shared`` (default: ``on`` since 6.2)
121  Specifies that the namespace will be attached to all controllers in the
122  subsystem. If set to ``off``, the namespace will remain a private namespace
123  and may only be attached to a single controller at a time. Shared namespaces
124  are always automatically attached to all controllers (also when controllers
125  are hotplugged).
126
127``detached`` (default: ``off``)
128  If set to ``on``, the namespace will be be available in the subsystem, but
129  not attached to any controllers initially. A shared namespace with this set
130  to ``on`` will never be automatically attached to controllers.
131
132Thus, adding
133
134.. code-block:: console
135
136   -drive file=nvm-1.img,if=none,id=nvm-1
137   -device nvme-ns,drive=nvm-1,nsid=1
138   -drive file=nvm-2.img,if=none,id=nvm-2
139   -device nvme-ns,drive=nvm-2,nsid=3,shared=off,detached=on
140
141will cause NSID 1 will be a shared namespace that is initially attached to both
142controllers. NSID 3 will be a private namespace due to ``shared=off`` and only
143attachable to a single controller at a time. Additionally it will not be
144attached to any controller initially (due to ``detached=on``) or to hotplugged
145controllers.
146
147Optional Features
148=================
149
150Controller Memory Buffer
151------------------------
152
153``nvme`` device parameters related to the Controller Memory Buffer support:
154
155``cmb_size_mb=UINT32`` (default: ``0``)
156  This adds a Controller Memory Buffer of the given size at offset zero in BAR
157  2.
158
159``legacy-cmb`` (default: ``off``)
160  By default, the device uses the "v1.4 scheme" for the Controller Memory
161  Buffer support (i.e, the CMB is initially disabled and must be explicitly
162  enabled by the host). Set this to ``on`` to behave as a v1.3 device wrt. the
163  CMB.
164
165Simple Copy
166-----------
167
168The device includes support for TP 4065 ("Simple Copy Command"). A number of
169additional ``nvme-ns`` device parameters may be used to control the Copy
170command limits:
171
172``mssrl=UINT16`` (default: ``128``)
173  Set the Maximum Single Source Range Length (``MSSRL``). This is the maximum
174  number of logical blocks that may be specified in each source range.
175
176``mcl=UINT32`` (default: ``128``)
177  Set the Maximum Copy Length (``MCL``). This is the maximum number of logical
178  blocks that may be specified in a Copy command (the total for all source
179  ranges).
180
181``msrc=UINT8`` (default: ``127``)
182  Set the Maximum Source Range Count (``MSRC``). This is the maximum number of
183  source ranges that may be used in a Copy command. This is a 0's based value.
184
185Zoned Namespaces
186----------------
187
188A namespaces may be "Zoned" as defined by TP 4053 ("Zoned Namespaces"). Set
189``zoned=on`` on an ``nvme-ns`` device to configure it as a zoned namespace.
190
191The namespace may be configured with additional parameters
192
193``zoned.zone_size=SIZE`` (default: ``128MiB``)
194  Define the zone size (``ZSZE``).
195
196``zoned.zone_capacity=SIZE`` (default: ``0``)
197  Define the zone capacity (``ZCAP``). If left at the default (``0``), the zone
198  capacity will equal the zone size.
199
200``zoned.descr_ext_size=UINT32`` (default: ``0``)
201  Set the Zone Descriptor Extension Size (``ZDES``). Must be a multiple of 64
202  bytes.
203
204``zoned.cross_read=BOOL`` (default: ``off``)
205  Set to ``on`` to allow reads to cross zone boundaries.
206
207``zoned.max_active=UINT32`` (default: ``0``)
208  Set the maximum number of active resources (``MAR``). The default (``0``)
209  allows all zones to be active.
210
211``zoned.max_open=UINT32`` (default: ``0``)
212  Set the maximum number of open resources (``MOR``). The default (``0``)
213  allows all zones to be open. If ``zoned.max_active`` is specified, this value
214  must be less than or equal to that.
215
216``zoned.zasl=UINT8`` (default: ``0``)
217  Set the maximum data transfer size for the Zone Append command. Like
218  ``mdts``, the value is specified as a power of two (2^n) and is in units of
219  the minimum memory page size (CAP.MPSMIN). The default value (``0``)
220  has this property inherit the ``mdts`` value.
221
222Flexible Data Placement
223-----------------------
224
225The device may be configured to support TP4146 ("Flexible Data Placement") by
226configuring it (``fdp=on``) on the subsystem::
227
228    -device nvme-subsys,id=nvme-subsys-0,nqn=subsys0,fdp=on,fdp.nruh=16
229
230The subsystem emulates a single Endurance Group, on which Flexible Data
231Placement will be supported. Also note that the device emulation deviates
232slightly from the specification, by always enabling the "FDP Mode" feature on
233the controller if the subsystems is configured for Flexible Data Placement.
234
235Enabling Flexible Data Placement on the subsyste enables the following
236parameters:
237
238``fdp.nrg`` (default: ``1``)
239  Set the number of Reclaim Groups.
240
241``fdp.nruh`` (default: ``0``)
242  Set the number of Reclaim Unit Handles. This is a mandatory parameter and
243  must be non-zero.
244
245``fdp.runs`` (default: ``96M``)
246  Set the Reclaim Unit Nominal Size. Defaults to 96 MiB.
247
248Namespaces within this subsystem may requests Reclaim Unit Handles::
249
250    -device nvme-ns,drive=nvm-1,fdp.ruhs=RUHLIST
251
252The ``RUHLIST`` is a semicolon separated list (i.e. ``0;1;2;3``) and may
253include ranges (i.e. ``0;8-15``). If no reclaim unit handle list is specified,
254the controller will assign the controller-specified reclaim unit handle to
255placement handle identifier 0.
256
257Metadata
258--------
259
260The virtual namespace device supports LBA metadata in the form separate
261metadata (``MPTR``-based) and extended LBAs.
262
263``ms=UINT16`` (default: ``0``)
264  Defines the number of metadata bytes per LBA.
265
266``mset=UINT8`` (default: ``0``)
267  Set to ``1`` to enable extended LBAs.
268
269End-to-End Data Protection
270--------------------------
271
272The virtual namespace device supports DIF- and DIX-based protection information
273(depending on ``mset``).
274
275``pi=UINT8`` (default: ``0``)
276  Enable protection information of the specified type (type ``1``, ``2`` or
277  ``3``).
278
279``pil=UINT8`` (default: ``0``)
280  Controls the location of the protection information within the metadata. Set
281  to ``1`` to transfer protection information as the first bytes of metadata.
282  Otherwise, the protection information is transferred as the last bytes of
283  metadata.
284
285``pif=UINT8`` (default: ``0``)
286  By default, the namespace device uses 16 bit guard protection information
287  format (``pif=0``). Set to ``2`` to enable 64 bit guard protection
288  information format. This requires at least 16 bytes of metadata. Note that
289  ``pif=1`` (32 bit guards) is currently not supported.
290
291Virtualization Enhancements and SR-IOV (Experimental Support)
292-------------------------------------------------------------
293
294The ``nvme`` device supports Single Root I/O Virtualization and Sharing
295along with Virtualization Enhancements. The controller has to be linked to
296an NVM Subsystem device (``nvme-subsys``) for use with SR-IOV.
297
298A number of parameters are present (**please note, that they may be
299subject to change**):
300
301``sriov_max_vfs`` (default: ``0``)
302  Indicates the maximum number of PCIe virtual functions supported
303  by the controller. Specifying a non-zero value enables reporting of both
304  SR-IOV and ARI (Alternative Routing-ID Interpretation) capabilities
305  by the NVMe device. Virtual function controllers will not report SR-IOV.
306
307``sriov_vq_flexible``
308  Indicates the total number of flexible queue resources assignable to all
309  the secondary controllers. Implicitly sets the number of primary
310  controller's private resources to ``(max_ioqpairs - sriov_vq_flexible)``.
311
312``sriov_vi_flexible``
313  Indicates the total number of flexible interrupt resources assignable to
314  all the secondary controllers. Implicitly sets the number of primary
315  controller's private resources to ``(msix_qsize - sriov_vi_flexible)``.
316
317``sriov_max_vi_per_vf`` (default: ``0``)
318  Indicates the maximum number of virtual interrupt resources assignable
319  to a secondary controller. The default ``0`` resolves to
320  ``(sriov_vi_flexible / sriov_max_vfs)``
321
322``sriov_max_vq_per_vf`` (default: ``0``)
323  Indicates the maximum number of virtual queue resources assignable to
324  a secondary controller. The default ``0`` resolves to
325  ``(sriov_vq_flexible / sriov_max_vfs)``
326
327The simplest possible invocation enables the capability to set up one VF
328controller and assign an admin queue, an IO queue, and a MSI-X interrupt.
329
330.. code-block:: console
331
332   -device nvme-subsys,id=subsys0
333   -device nvme,serial=deadbeef,subsys=subsys0,sriov_max_vfs=1,
334    sriov_vq_flexible=2,sriov_vi_flexible=1
335
336The minimum steps required to configure a functional NVMe secondary
337controller are:
338
339  * unbind flexible resources from the primary controller
340
341.. code-block:: console
342
343   nvme virt-mgmt /dev/nvme0 -c 0 -r 1 -a 1 -n 0
344   nvme virt-mgmt /dev/nvme0 -c 0 -r 0 -a 1 -n 0
345
346  * perform a Function Level Reset on the primary controller to actually
347    release the resources
348
349.. code-block:: console
350
351   echo 1 > /sys/bus/pci/devices/0000:01:00.0/reset
352
353  * enable VF
354
355.. code-block:: console
356
357   echo 1 > /sys/bus/pci/devices/0000:01:00.0/sriov_numvfs
358
359  * assign the flexible resources to the VF and set it ONLINE
360
361.. code-block:: console
362
363   nvme virt-mgmt /dev/nvme0 -c 1 -r 1 -a 8 -n 1
364   nvme virt-mgmt /dev/nvme0 -c 1 -r 0 -a 8 -n 2
365   nvme virt-mgmt /dev/nvme0 -c 1 -r 0 -a 9 -n 0
366
367  * bind the NVMe driver to the VF
368
369.. code-block:: console
370
371   echo 0000:01:00.1 > /sys/bus/pci/drivers/nvme/bind
372