1============== 2NVMe Emulation 3============== 4 5QEMU provides NVMe emulation through the ``nvme``, ``nvme-ns`` and 6``nvme-subsys`` devices. 7 8See the following sections for specific information on 9 10 * `Adding NVMe Devices`_, `additional namespaces`_ and `NVM subsystems`_. 11 * Configuration of `Optional Features`_ such as `Controller Memory Buffer`_, 12 `Simple Copy`_, `Zoned Namespaces`_, `metadata`_ and `End-to-End Data 13 Protection`_, 14 15Adding NVMe Devices 16=================== 17 18Controller Emulation 19-------------------- 20 21The QEMU emulated NVMe controller implements version 1.4 of the NVM Express 22specification. All mandatory features are implement with a couple of exceptions 23and limitations: 24 25 * Accounting numbers in the SMART/Health log page are reset when the device 26 is power cycled. 27 * Interrupt Coalescing is not supported and is disabled by default. 28 29The simplest way to attach an NVMe controller on the QEMU PCI bus is to add the 30following parameters: 31 32.. code-block:: console 33 34 -drive file=nvm.img,if=none,id=nvm 35 -device nvme,serial=deadbeef,drive=nvm 36 37There are a number of optional general parameters for the ``nvme`` device. Some 38are mentioned here, but see ``-device nvme,help`` to list all possible 39parameters. 40 41``max_ioqpairs=UINT32`` (default: ``64``) 42 Set the maximum number of allowed I/O queue pairs. This replaces the 43 deprecated ``num_queues`` parameter. 44 45``msix_qsize=UINT16`` (default: ``65``) 46 The number of MSI-X vectors that the device should support. 47 48``mdts=UINT8`` (default: ``7``) 49 Set the Maximum Data Transfer Size of the device. 50 51``use-intel-id`` (default: ``off``) 52 Since QEMU 5.2, the device uses a QEMU allocated "Red Hat" PCI Device and 53 Vendor ID. Set this to ``on`` to revert to the unallocated Intel ID 54 previously used. 55 56Additional Namespaces 57--------------------- 58 59In the simplest possible invocation sketched above, the device only support a 60single namespace with the namespace identifier ``1``. To support multiple 61namespaces and additional features, the ``nvme-ns`` device must be used. 62 63.. code-block:: console 64 65 -device nvme,id=nvme-ctrl-0,serial=deadbeef 66 -drive file=nvm-1.img,if=none,id=nvm-1 67 -device nvme-ns,drive=nvm-1 68 -drive file=nvm-2.img,if=none,id=nvm-2 69 -device nvme-ns,drive=nvm-2 70 71The namespaces defined by the ``nvme-ns`` device will attach to the most 72recently defined ``nvme-bus`` that is created by the ``nvme`` device. Namespace 73identifiers are allocated automatically, starting from ``1``. 74 75There are a number of parameters available: 76 77``nsid`` (default: ``0``) 78 Explicitly set the namespace identifier. 79 80``uuid`` (default: *autogenerated*) 81 Set the UUID of the namespace. This will be reported as a "Namespace UUID" 82 descriptor in the Namespace Identification Descriptor List. 83 84``nguid`` 85 Set the NGUID of the namespace. This will be reported as a "Namespace Globally 86 Unique Identifier" descriptor in the Namespace Identification Descriptor List. 87 It is specified as a string of hexadecimal digits containing exactly 16 bytes 88 or "auto" for a random value. An optional '-' separator could be used to group 89 bytes. If not specified the NGUID will remain all zeros. 90 91``eui64`` 92 Set the EUI-64 of the namespace. This will be reported as a "IEEE Extended 93 Unique Identifier" descriptor in the Namespace Identification Descriptor List. 94 Since machine type 6.1 a non-zero default value is used if the parameter 95 is not provided. For earlier machine types the field defaults to 0. 96 97``bus`` 98 If there are more ``nvme`` devices defined, this parameter may be used to 99 attach the namespace to a specific ``nvme`` device (identified by an ``id`` 100 parameter on the controller device). 101 102NVM Subsystems 103-------------- 104 105Additional features becomes available if the controller device (``nvme``) is 106linked to an NVM Subsystem device (``nvme-subsys``). 107 108The NVM Subsystem emulation allows features such as shared namespaces and 109multipath I/O. 110 111.. code-block:: console 112 113 -device nvme-subsys,id=nvme-subsys-0,nqn=subsys0 114 -device nvme,serial=deadbeef,subsys=nvme-subsys-0 115 -device nvme,serial=deadbeef,subsys=nvme-subsys-0 116 117This will create an NVM subsystem with two controllers. Having controllers 118linked to an ``nvme-subsys`` device allows additional ``nvme-ns`` parameters: 119 120``shared`` (default: ``on`` since 6.2) 121 Specifies that the namespace will be attached to all controllers in the 122 subsystem. If set to ``off``, the namespace will remain a private namespace 123 and may only be attached to a single controller at a time. Shared namespaces 124 are always automatically attached to all controllers (also when controllers 125 are hotplugged). 126 127``detached`` (default: ``off``) 128 If set to ``on``, the namespace will be be available in the subsystem, but 129 not attached to any controllers initially. A shared namespace with this set 130 to ``on`` will never be automatically attached to controllers. 131 132Thus, adding 133 134.. code-block:: console 135 136 -drive file=nvm-1.img,if=none,id=nvm-1 137 -device nvme-ns,drive=nvm-1,nsid=1 138 -drive file=nvm-2.img,if=none,id=nvm-2 139 -device nvme-ns,drive=nvm-2,nsid=3,shared=off,detached=on 140 141will cause NSID 1 will be a shared namespace that is initially attached to both 142controllers. NSID 3 will be a private namespace due to ``shared=off`` and only 143attachable to a single controller at a time. Additionally it will not be 144attached to any controller initially (due to ``detached=on``) or to hotplugged 145controllers. 146 147Optional Features 148================= 149 150Controller Memory Buffer 151------------------------ 152 153``nvme`` device parameters related to the Controller Memory Buffer support: 154 155``cmb_size_mb=UINT32`` (default: ``0``) 156 This adds a Controller Memory Buffer of the given size at offset zero in BAR 157 2. 158 159``legacy-cmb`` (default: ``off``) 160 By default, the device uses the "v1.4 scheme" for the Controller Memory 161 Buffer support (i.e, the CMB is initially disabled and must be explicitly 162 enabled by the host). Set this to ``on`` to behave as a v1.3 device wrt. the 163 CMB. 164 165Simple Copy 166----------- 167 168The device includes support for TP 4065 ("Simple Copy Command"). A number of 169additional ``nvme-ns`` device parameters may be used to control the Copy 170command limits: 171 172``mssrl=UINT16`` (default: ``128``) 173 Set the Maximum Single Source Range Length (``MSSRL``). This is the maximum 174 number of logical blocks that may be specified in each source range. 175 176``mcl=UINT32`` (default: ``128``) 177 Set the Maximum Copy Length (``MCL``). This is the maximum number of logical 178 blocks that may be specified in a Copy command (the total for all source 179 ranges). 180 181``msrc=UINT8`` (default: ``127``) 182 Set the Maximum Source Range Count (``MSRC``). This is the maximum number of 183 source ranges that may be used in a Copy command. This is a 0's based value. 184 185Zoned Namespaces 186---------------- 187 188A namespaces may be "Zoned" as defined by TP 4053 ("Zoned Namespaces"). Set 189``zoned=on`` on an ``nvme-ns`` device to configure it as a zoned namespace. 190 191The namespace may be configured with additional parameters 192 193``zoned.zone_size=SIZE`` (default: ``128MiB``) 194 Define the zone size (``ZSZE``). 195 196``zoned.zone_capacity=SIZE`` (default: ``0``) 197 Define the zone capacity (``ZCAP``). If left at the default (``0``), the zone 198 capacity will equal the zone size. 199 200``zoned.descr_ext_size=UINT32`` (default: ``0``) 201 Set the Zone Descriptor Extension Size (``ZDES``). Must be a multiple of 64 202 bytes. 203 204``zoned.cross_read=BOOL`` (default: ``off``) 205 Set to ``on`` to allow reads to cross zone boundaries. 206 207``zoned.max_active=UINT32`` (default: ``0``) 208 Set the maximum number of active resources (``MAR``). The default (``0``) 209 allows all zones to be active. 210 211``zoned.max_open=UINT32`` (default: ``0``) 212 Set the maximum number of open resources (``MOR``). The default (``0``) 213 allows all zones to be open. If ``zoned.max_active`` is specified, this value 214 must be less than or equal to that. 215 216``zoned.zasl=UINT8`` (default: ``0``) 217 Set the maximum data transfer size for the Zone Append command. Like 218 ``mdts``, the value is specified as a power of two (2^n) and is in units of 219 the minimum memory page size (CAP.MPSMIN). The default value (``0``) 220 has this property inherit the ``mdts`` value. 221 222Flexible Data Placement 223----------------------- 224 225The device may be configured to support TP4146 ("Flexible Data Placement") by 226configuring it (``fdp=on``) on the subsystem:: 227 228 -device nvme-subsys,id=nvme-subsys-0,nqn=subsys0,fdp=on,fdp.nruh=16 229 230The subsystem emulates a single Endurance Group, on which Flexible Data 231Placement will be supported. Also note that the device emulation deviates 232slightly from the specification, by always enabling the "FDP Mode" feature on 233the controller if the subsystems is configured for Flexible Data Placement. 234 235Enabling Flexible Data Placement on the subsyste enables the following 236parameters: 237 238``fdp.nrg`` (default: ``1``) 239 Set the number of Reclaim Groups. 240 241``fdp.nruh`` (default: ``0``) 242 Set the number of Reclaim Unit Handles. This is a mandatory parameter and 243 must be non-zero. 244 245``fdp.runs`` (default: ``96M``) 246 Set the Reclaim Unit Nominal Size. Defaults to 96 MiB. 247 248Namespaces within this subsystem may requests Reclaim Unit Handles:: 249 250 -device nvme-ns,drive=nvm-1,fdp.ruhs=RUHLIST 251 252The ``RUHLIST`` is a semicolon separated list (i.e. ``0;1;2;3``) and may 253include ranges (i.e. ``0;8-15``). If no reclaim unit handle list is specified, 254the controller will assign the controller-specified reclaim unit handle to 255placement handle identifier 0. 256 257Metadata 258-------- 259 260The virtual namespace device supports LBA metadata in the form separate 261metadata (``MPTR``-based) and extended LBAs. 262 263``ms=UINT16`` (default: ``0``) 264 Defines the number of metadata bytes per LBA. 265 266``mset=UINT8`` (default: ``0``) 267 Set to ``1`` to enable extended LBAs. 268 269End-to-End Data Protection 270-------------------------- 271 272The virtual namespace device supports DIF- and DIX-based protection information 273(depending on ``mset``). 274 275``pi=UINT8`` (default: ``0``) 276 Enable protection information of the specified type (type ``1``, ``2`` or 277 ``3``). 278 279``pil=UINT8`` (default: ``0``) 280 Controls the location of the protection information within the metadata. Set 281 to ``1`` to transfer protection information as the first bytes of metadata. 282 Otherwise, the protection information is transferred as the last bytes of 283 metadata. 284 285``pif=UINT8`` (default: ``0``) 286 By default, the namespace device uses 16 bit guard protection information 287 format (``pif=0``). Set to ``2`` to enable 64 bit guard protection 288 information format. This requires at least 16 bytes of metadata. Note that 289 ``pif=1`` (32 bit guards) is currently not supported. 290 291Virtualization Enhancements and SR-IOV (Experimental Support) 292------------------------------------------------------------- 293 294The ``nvme`` device supports Single Root I/O Virtualization and Sharing 295along with Virtualization Enhancements. The controller has to be linked to 296an NVM Subsystem device (``nvme-subsys``) for use with SR-IOV. 297 298A number of parameters are present (**please note, that they may be 299subject to change**): 300 301``sriov_max_vfs`` (default: ``0``) 302 Indicates the maximum number of PCIe virtual functions supported 303 by the controller. Specifying a non-zero value enables reporting of both 304 SR-IOV and ARI (Alternative Routing-ID Interpretation) capabilities 305 by the NVMe device. Virtual function controllers will not report SR-IOV. 306 307``sriov_vq_flexible`` 308 Indicates the total number of flexible queue resources assignable to all 309 the secondary controllers. Implicitly sets the number of primary 310 controller's private resources to ``(max_ioqpairs - sriov_vq_flexible)``. 311 312``sriov_vi_flexible`` 313 Indicates the total number of flexible interrupt resources assignable to 314 all the secondary controllers. Implicitly sets the number of primary 315 controller's private resources to ``(msix_qsize - sriov_vi_flexible)``. 316 317``sriov_max_vi_per_vf`` (default: ``0``) 318 Indicates the maximum number of virtual interrupt resources assignable 319 to a secondary controller. The default ``0`` resolves to 320 ``(sriov_vi_flexible / sriov_max_vfs)`` 321 322``sriov_max_vq_per_vf`` (default: ``0``) 323 Indicates the maximum number of virtual queue resources assignable to 324 a secondary controller. The default ``0`` resolves to 325 ``(sriov_vq_flexible / sriov_max_vfs)`` 326 327The simplest possible invocation enables the capability to set up one VF 328controller and assign an admin queue, an IO queue, and a MSI-X interrupt. 329 330.. code-block:: console 331 332 -device nvme-subsys,id=subsys0 333 -device nvme,serial=deadbeef,subsys=subsys0,sriov_max_vfs=1, 334 sriov_vq_flexible=2,sriov_vi_flexible=1 335 336The minimum steps required to configure a functional NVMe secondary 337controller are: 338 339 * unbind flexible resources from the primary controller 340 341.. code-block:: console 342 343 nvme virt-mgmt /dev/nvme0 -c 0 -r 1 -a 1 -n 0 344 nvme virt-mgmt /dev/nvme0 -c 0 -r 0 -a 1 -n 0 345 346 * perform a Function Level Reset on the primary controller to actually 347 release the resources 348 349.. code-block:: console 350 351 echo 1 > /sys/bus/pci/devices/0000:01:00.0/reset 352 353 * enable VF 354 355.. code-block:: console 356 357 echo 1 > /sys/bus/pci/devices/0000:01:00.0/sriov_numvfs 358 359 * assign the flexible resources to the VF and set it ONLINE 360 361.. code-block:: console 362 363 nvme virt-mgmt /dev/nvme0 -c 1 -r 1 -a 8 -n 1 364 nvme virt-mgmt /dev/nvme0 -c 1 -r 0 -a 8 -n 2 365 nvme virt-mgmt /dev/nvme0 -c 1 -r 0 -a 9 -n 0 366 367 * bind the NVMe driver to the VF 368 369.. code-block:: console 370 371 echo 0000:01:00.1 > /sys/bus/pci/drivers/nvme/bind 372