1============== 2NVMe Emulation 3============== 4 5QEMU provides NVMe emulation through the ``nvme``, ``nvme-ns`` and 6``nvme-subsys`` devices. 7 8See the following sections for specific information on 9 10 * `Adding NVMe Devices`_, `additional namespaces`_ and `NVM subsystems`_. 11 * Configuration of `Optional Features`_ such as `Controller Memory Buffer`_, 12 `Simple Copy`_, `Zoned Namespaces`_, `metadata`_ and `End-to-End Data 13 Protection`_, 14 15Adding NVMe Devices 16=================== 17 18Controller Emulation 19-------------------- 20 21The QEMU emulated NVMe controller implements version 1.4 of the NVM Express 22specification. All mandatory features are implement with a couple of exceptions 23and limitations: 24 25 * Accounting numbers in the SMART/Health log page are reset when the device 26 is power cycled. 27 * Interrupt Coalescing is not supported and is disabled by default. 28 29The simplest way to attach an NVMe controller on the QEMU PCI bus is to add the 30following parameters: 31 32.. code-block:: console 33 34 -drive file=nvm.img,if=none,id=nvm 35 -device nvme,serial=deadbeef,drive=nvm 36 37There are a number of optional general parameters for the ``nvme`` device. Some 38are mentioned here, but see ``-device nvme,help`` to list all possible 39parameters. 40 41``max_ioqpairs=UINT32`` (default: ``64``) 42 Set the maximum number of allowed I/O queue pairs. This replaces the 43 deprecated ``num_queues`` parameter. 44 45``msix_qsize=UINT16`` (default: ``65``) 46 The number of MSI-X vectors that the device should support. 47 48``mdts=UINT8`` (default: ``7``) 49 Set the Maximum Data Transfer Size of the device. 50 51``use-intel-id`` (default: ``off``) 52 Since QEMU 5.2, the device uses a QEMU allocated "Red Hat" PCI Device and 53 Vendor ID. Set this to ``on`` to revert to the unallocated Intel ID 54 previously used. 55 56Additional Namespaces 57--------------------- 58 59In the simplest possible invocation sketched above, the device only support a 60single namespace with the namespace identifier ``1``. To support multiple 61namespaces and additional features, the ``nvme-ns`` device must be used. 62 63.. code-block:: console 64 65 -device nvme,id=nvme-ctrl-0,serial=deadbeef 66 -drive file=nvm-1.img,if=none,id=nvm-1 67 -device nvme-ns,drive=nvm-1 68 -drive file=nvm-2.img,if=none,id=nvm-2 69 -device nvme-ns,drive=nvm-2 70 71The namespaces defined by the ``nvme-ns`` device will attach to the most 72recently defined ``nvme-bus`` that is created by the ``nvme`` device. Namespace 73identifiers are allocated automatically, starting from ``1``. 74 75There are a number of parameters available: 76 77``nsid`` (default: ``0``) 78 Explicitly set the namespace identifier. 79 80``uuid`` (default: *autogenerated*) 81 Set the UUID of the namespace. This will be reported as a "Namespace UUID" 82 descriptor in the Namespace Identification Descriptor List. 83 84``eui64`` 85 Set the EUI-64 of the namespace. This will be reported as a "IEEE Extended 86 Unique Identifier" descriptor in the Namespace Identification Descriptor List. 87 Since machine type 6.1 a non-zero default value is used if the parameter 88 is not provided. For earlier machine types the field defaults to 0. 89 90``bus`` 91 If there are more ``nvme`` devices defined, this parameter may be used to 92 attach the namespace to a specific ``nvme`` device (identified by an ``id`` 93 parameter on the controller device). 94 95NVM Subsystems 96-------------- 97 98Additional features becomes available if the controller device (``nvme``) is 99linked to an NVM Subsystem device (``nvme-subsys``). 100 101The NVM Subsystem emulation allows features such as shared namespaces and 102multipath I/O. 103 104.. code-block:: console 105 106 -device nvme-subsys,id=nvme-subsys-0,nqn=subsys0 107 -device nvme,serial=deadbeef,subsys=nvme-subsys-0 108 -device nvme,serial=deadbeef,subsys=nvme-subsys-0 109 110This will create an NVM subsystem with two controllers. Having controllers 111linked to an ``nvme-subsys`` device allows additional ``nvme-ns`` parameters: 112 113``shared`` (default: ``on`` since 6.2) 114 Specifies that the namespace will be attached to all controllers in the 115 subsystem. If set to ``off``, the namespace will remain a private namespace 116 and may only be attached to a single controller at a time. Shared namespaces 117 are always automatically attached to all controllers (also when controllers 118 are hotplugged). 119 120``detached`` (default: ``off``) 121 If set to ``on``, the namespace will be be available in the subsystem, but 122 not attached to any controllers initially. A shared namespace with this set 123 to ``on`` will never be automatically attached to controllers. 124 125Thus, adding 126 127.. code-block:: console 128 129 -drive file=nvm-1.img,if=none,id=nvm-1 130 -device nvme-ns,drive=nvm-1,nsid=1 131 -drive file=nvm-2.img,if=none,id=nvm-2 132 -device nvme-ns,drive=nvm-2,nsid=3,shared=off,detached=on 133 134will cause NSID 1 will be a shared namespace that is initially attached to both 135controllers. NSID 3 will be a private namespace due to ``shared=off`` and only 136attachable to a single controller at a time. Additionally it will not be 137attached to any controller initially (due to ``detached=on``) or to hotplugged 138controllers. 139 140Optional Features 141================= 142 143Controller Memory Buffer 144------------------------ 145 146``nvme`` device parameters related to the Controller Memory Buffer support: 147 148``cmb_size_mb=UINT32`` (default: ``0``) 149 This adds a Controller Memory Buffer of the given size at offset zero in BAR 150 2. 151 152``legacy-cmb`` (default: ``off``) 153 By default, the device uses the "v1.4 scheme" for the Controller Memory 154 Buffer support (i.e, the CMB is initially disabled and must be explicitly 155 enabled by the host). Set this to ``on`` to behave as a v1.3 device wrt. the 156 CMB. 157 158Simple Copy 159----------- 160 161The device includes support for TP 4065 ("Simple Copy Command"). A number of 162additional ``nvme-ns`` device parameters may be used to control the Copy 163command limits: 164 165``mssrl=UINT16`` (default: ``128``) 166 Set the Maximum Single Source Range Length (``MSSRL``). This is the maximum 167 number of logical blocks that may be specified in each source range. 168 169``mcl=UINT32`` (default: ``128``) 170 Set the Maximum Copy Length (``MCL``). This is the maximum number of logical 171 blocks that may be specified in a Copy command (the total for all source 172 ranges). 173 174``msrc=UINT8`` (default: ``127``) 175 Set the Maximum Source Range Count (``MSRC``). This is the maximum number of 176 source ranges that may be used in a Copy command. This is a 0's based value. 177 178Zoned Namespaces 179---------------- 180 181A namespaces may be "Zoned" as defined by TP 4053 ("Zoned Namespaces"). Set 182``zoned=on`` on an ``nvme-ns`` device to configure it as a zoned namespace. 183 184The namespace may be configured with additional parameters 185 186``zoned.zone_size=SIZE`` (default: ``128MiB``) 187 Define the zone size (``ZSZE``). 188 189``zoned.zone_capacity=SIZE`` (default: ``0``) 190 Define the zone capacity (``ZCAP``). If left at the default (``0``), the zone 191 capacity will equal the zone size. 192 193``zoned.descr_ext_size=UINT32`` (default: ``0``) 194 Set the Zone Descriptor Extension Size (``ZDES``). Must be a multiple of 64 195 bytes. 196 197``zoned.cross_read=BOOL`` (default: ``off``) 198 Set to ``on`` to allow reads to cross zone boundaries. 199 200``zoned.max_active=UINT32`` (default: ``0``) 201 Set the maximum number of active resources (``MAR``). The default (``0``) 202 allows all zones to be active. 203 204``zoned.max_open=UINT32`` (default: ``0``) 205 Set the maximum number of open resources (``MOR``). The default (``0``) 206 allows all zones to be open. If ``zoned.max_active`` is specified, this value 207 must be less than or equal to that. 208 209``zoned.zasl=UINT8`` (default: ``0``) 210 Set the maximum data transfer size for the Zone Append command. Like 211 ``mdts``, the value is specified as a power of two (2^n) and is in units of 212 the minimum memory page size (CAP.MPSMIN). The default value (``0``) 213 has this property inherit the ``mdts`` value. 214 215Flexible Data Placement 216----------------------- 217 218The device may be configured to support TP4146 ("Flexible Data Placement") by 219configuring it (``fdp=on``) on the subsystem:: 220 221 -device nvme-subsys,id=nvme-subsys-0,nqn=subsys0,fdp=on,fdp.nruh=16 222 223The subsystem emulates a single Endurance Group, on which Flexible Data 224Placement will be supported. Also note that the device emulation deviates 225slightly from the specification, by always enabling the "FDP Mode" feature on 226the controller if the subsystems is configured for Flexible Data Placement. 227 228Enabling Flexible Data Placement on the subsyste enables the following 229parameters: 230 231``fdp.nrg`` (default: ``1``) 232 Set the number of Reclaim Groups. 233 234``fdp.nruh`` (default: ``0``) 235 Set the number of Reclaim Unit Handles. This is a mandatory parameter and 236 must be non-zero. 237 238``fdp.runs`` (default: ``96M``) 239 Set the Reclaim Unit Nominal Size. Defaults to 96 MiB. 240 241Namespaces within this subsystem may requests Reclaim Unit Handles:: 242 243 -device nvme-ns,drive=nvm-1,fdp.ruhs=RUHLIST 244 245The ``RUHLIST`` is a semicolon separated list (i.e. ``0;1;2;3``) and may 246include ranges (i.e. ``0;8-15``). If no reclaim unit handle list is specified, 247the controller will assign the controller-specified reclaim unit handle to 248placement handle identifier 0. 249 250Metadata 251-------- 252 253The virtual namespace device supports LBA metadata in the form separate 254metadata (``MPTR``-based) and extended LBAs. 255 256``ms=UINT16`` (default: ``0``) 257 Defines the number of metadata bytes per LBA. 258 259``mset=UINT8`` (default: ``0``) 260 Set to ``1`` to enable extended LBAs. 261 262End-to-End Data Protection 263-------------------------- 264 265The virtual namespace device supports DIF- and DIX-based protection information 266(depending on ``mset``). 267 268``pi=UINT8`` (default: ``0``) 269 Enable protection information of the specified type (type ``1``, ``2`` or 270 ``3``). 271 272``pil=UINT8`` (default: ``0``) 273 Controls the location of the protection information within the metadata. Set 274 to ``1`` to transfer protection information as the first bytes of metadata. 275 Otherwise, the protection information is transferred as the last bytes of 276 metadata. 277 278``pif=UINT8`` (default: ``0``) 279 By default, the namespace device uses 16 bit guard protection information 280 format (``pif=0``). Set to ``2`` to enable 64 bit guard protection 281 information format. This requires at least 16 bytes of metadata. Note that 282 ``pif=1`` (32 bit guards) is currently not supported. 283 284Virtualization Enhancements and SR-IOV (Experimental Support) 285------------------------------------------------------------- 286 287The ``nvme`` device supports Single Root I/O Virtualization and Sharing 288along with Virtualization Enhancements. The controller has to be linked to 289an NVM Subsystem device (``nvme-subsys``) for use with SR-IOV. 290 291A number of parameters are present (**please note, that they may be 292subject to change**): 293 294``sriov_max_vfs`` (default: ``0``) 295 Indicates the maximum number of PCIe virtual functions supported 296 by the controller. Specifying a non-zero value enables reporting of both 297 SR-IOV and ARI (Alternative Routing-ID Interpretation) capabilities 298 by the NVMe device. Virtual function controllers will not report SR-IOV. 299 300``sriov_vq_flexible`` 301 Indicates the total number of flexible queue resources assignable to all 302 the secondary controllers. Implicitly sets the number of primary 303 controller's private resources to ``(max_ioqpairs - sriov_vq_flexible)``. 304 305``sriov_vi_flexible`` 306 Indicates the total number of flexible interrupt resources assignable to 307 all the secondary controllers. Implicitly sets the number of primary 308 controller's private resources to ``(msix_qsize - sriov_vi_flexible)``. 309 310``sriov_max_vi_per_vf`` (default: ``0``) 311 Indicates the maximum number of virtual interrupt resources assignable 312 to a secondary controller. The default ``0`` resolves to 313 ``(sriov_vi_flexible / sriov_max_vfs)`` 314 315``sriov_max_vq_per_vf`` (default: ``0``) 316 Indicates the maximum number of virtual queue resources assignable to 317 a secondary controller. The default ``0`` resolves to 318 ``(sriov_vq_flexible / sriov_max_vfs)`` 319 320The simplest possible invocation enables the capability to set up one VF 321controller and assign an admin queue, an IO queue, and a MSI-X interrupt. 322 323.. code-block:: console 324 325 -device nvme-subsys,id=subsys0 326 -device nvme,serial=deadbeef,subsys=subsys0,sriov_max_vfs=1, 327 sriov_vq_flexible=2,sriov_vi_flexible=1 328 329The minimum steps required to configure a functional NVMe secondary 330controller are: 331 332 * unbind flexible resources from the primary controller 333 334.. code-block:: console 335 336 nvme virt-mgmt /dev/nvme0 -c 0 -r 1 -a 1 -n 0 337 nvme virt-mgmt /dev/nvme0 -c 0 -r 0 -a 1 -n 0 338 339 * perform a Function Level Reset on the primary controller to actually 340 release the resources 341 342.. code-block:: console 343 344 echo 1 > /sys/bus/pci/devices/0000:01:00.0/reset 345 346 * enable VF 347 348.. code-block:: console 349 350 echo 1 > /sys/bus/pci/devices/0000:01:00.0/sriov_numvfs 351 352 * assign the flexible resources to the VF and set it ONLINE 353 354.. code-block:: console 355 356 nvme virt-mgmt /dev/nvme0 -c 1 -r 1 -a 8 -n 1 357 nvme virt-mgmt /dev/nvme0 -c 1 -r 0 -a 8 -n 2 358 nvme virt-mgmt /dev/nvme0 -c 1 -r 0 -a 9 -n 0 359 360 * bind the NVMe driver to the VF 361 362.. code-block:: console 363 364 echo 0000:01:00.1 > /sys/bus/pci/drivers/nvme/bind 365