============== NVMe Emulation ============== QEMU provides NVMe emulation through the ``nvme``, ``nvme-ns`` and ``nvme-subsys`` devices. See the following sections for specific information on * `Adding NVMe Devices`_, `additional namespaces`_ and `NVM subsystems`_. * Configuration of `Optional Features`_ such as `Controller Memory Buffer`_, `Simple Copy`_, `Zoned Namespaces`_, `metadata`_ and `End-to-End Data Protection`_, Adding NVMe Devices =================== Controller Emulation -------------------- The QEMU emulated NVMe controller implements version 1.4 of the NVM Express specification. All mandatory features are implement with a couple of exceptions and limitations: * Accounting numbers in the SMART/Health log page are reset when the device is power cycled. * Interrupt Coalescing is not supported and is disabled by default. The simplest way to attach an NVMe controller on the QEMU PCI bus is to add the following parameters: .. code-block:: console -drive file=nvm.img,if=none,id=nvm -device nvme,serial=deadbeef,drive=nvm There are a number of optional general parameters for the ``nvme`` device. Some are mentioned here, but see ``-device nvme,help`` to list all possible parameters. ``max_ioqpairs=UINT32`` (default: ``64``) Set the maximum number of allowed I/O queue pairs. This replaces the deprecated ``num_queues`` parameter. ``msix_qsize=UINT16`` (default: ``65``) The number of MSI-X vectors that the device should support. ``mdts=UINT8`` (default: ``7``) Set the Maximum Data Transfer Size of the device. ``use-intel-id`` (default: ``off``) Since QEMU 5.2, the device uses a QEMU allocated "Red Hat" PCI Device and Vendor ID. Set this to ``on`` to revert to the unallocated Intel ID previously used. Additional Namespaces --------------------- In the simplest possible invocation sketched above, the device only support a single namespace with the namespace identifier ``1``. To support multiple namespaces and additional features, the ``nvme-ns`` device must be used. .. code-block:: console -device nvme,id=nvme-ctrl-0,serial=deadbeef -drive file=nvm-1.img,if=none,id=nvm-1 -device nvme-ns,drive=nvm-1 -drive file=nvm-2.img,if=none,id=nvm-2 -device nvme-ns,drive=nvm-2 The namespaces defined by the ``nvme-ns`` device will attach to the most recently defined ``nvme-bus`` that is created by the ``nvme`` device. Namespace identifiers are allocated automatically, starting from ``1``. There are a number of parameters available: ``nsid`` (default: ``0``) Explicitly set the namespace identifier. ``uuid`` (default: *autogenerated*) Set the UUID of the namespace. This will be reported as a "Namespace UUID" descriptor in the Namespace Identification Descriptor List. ``eui64`` Set the EUI-64 of the namespace. This will be reported as a "IEEE Extended Unique Identifier" descriptor in the Namespace Identification Descriptor List. Since machine type 6.1 a non-zero default value is used if the parameter is not provided. For earlier machine types the field defaults to 0. ``bus`` If there are more ``nvme`` devices defined, this parameter may be used to attach the namespace to a specific ``nvme`` device (identified by an ``id`` parameter on the controller device). NVM Subsystems -------------- Additional features becomes available if the controller device (``nvme``) is linked to an NVM Subsystem device (``nvme-subsys``). The NVM Subsystem emulation allows features such as shared namespaces and multipath I/O. .. code-block:: console -device nvme-subsys,id=nvme-subsys-0,nqn=subsys0 -device nvme,serial=a,subsys=nvme-subsys-0 -device nvme,serial=b,subsys=nvme-subsys-0 This will create an NVM subsystem with two controllers. Having controllers linked to an ``nvme-subsys`` device allows additional ``nvme-ns`` parameters: ``shared`` (default: ``on`` since 6.2) Specifies that the namespace will be attached to all controllers in the subsystem. If set to ``off``, the namespace will remain a private namespace and may only be attached to a single controller at a time. Shared namespaces are always automatically attached to all controllers (also when controllers are hotplugged). ``detached`` (default: ``off``) If set to ``on``, the namespace will be be available in the subsystem, but not attached to any controllers initially. A shared namespace with this set to ``on`` will never be automatically attached to controllers. Thus, adding .. code-block:: console -drive file=nvm-1.img,if=none,id=nvm-1 -device nvme-ns,drive=nvm-1,nsid=1 -drive file=nvm-2.img,if=none,id=nvm-2 -device nvme-ns,drive=nvm-2,nsid=3,shared=off,detached=on will cause NSID 1 will be a shared namespace that is initially attached to both controllers. NSID 3 will be a private namespace due to ``shared=off`` and only attachable to a single controller at a time. Additionally it will not be attached to any controller initially (due to ``detached=on``) or to hotplugged controllers. Optional Features ================= Controller Memory Buffer ------------------------ ``nvme`` device parameters related to the Controller Memory Buffer support: ``cmb_size_mb=UINT32`` (default: ``0``) This adds a Controller Memory Buffer of the given size at offset zero in BAR 2. ``legacy-cmb`` (default: ``off``) By default, the device uses the "v1.4 scheme" for the Controller Memory Buffer support (i.e, the CMB is initially disabled and must be explicitly enabled by the host). Set this to ``on`` to behave as a v1.3 device wrt. the CMB. Simple Copy ----------- The device includes support for TP 4065 ("Simple Copy Command"). A number of additional ``nvme-ns`` device parameters may be used to control the Copy command limits: ``mssrl=UINT16`` (default: ``128``) Set the Maximum Single Source Range Length (``MSSRL``). This is the maximum number of logical blocks that may be specified in each source range. ``mcl=UINT32`` (default: ``128``) Set the Maximum Copy Length (``MCL``). This is the maximum number of logical blocks that may be specified in a Copy command (the total for all source ranges). ``msrc=UINT8`` (default: ``127``) Set the Maximum Source Range Count (``MSRC``). This is the maximum number of source ranges that may be used in a Copy command. This is a 0's based value. Zoned Namespaces ---------------- A namespaces may be "Zoned" as defined by TP 4053 ("Zoned Namespaces"). Set ``zoned=on`` on an ``nvme-ns`` device to configure it as a zoned namespace. The namespace may be configured with additional parameters ``zoned.zone_size=SIZE`` (default: ``128MiB``) Define the zone size (``ZSZE``). ``zoned.zone_capacity=SIZE`` (default: ``0``) Define the zone capacity (``ZCAP``). If left at the default (``0``), the zone capacity will equal the zone size. ``zoned.descr_ext_size=UINT32`` (default: ``0``) Set the Zone Descriptor Extension Size (``ZDES``). Must be a multiple of 64 bytes. ``zoned.cross_read=BOOL`` (default: ``off``) Set to ``on`` to allow reads to cross zone boundaries. ``zoned.max_active=UINT32`` (default: ``0``) Set the maximum number of active resources (``MAR``). The default (``0``) allows all zones to be active. ``zoned.max_open=UINT32`` (default: ``0``) Set the maximum number of open resources (``MOR``). The default (``0``) allows all zones to be open. If ``zoned.max_active`` is specified, this value must be less than or equal to that. ``zoned.zasl=UINT8`` (default: ``0``) Set the maximum data transfer size for the Zone Append command. Like ``mdts``, the value is specified as a power of two (2^n) and is in units of the minimum memory page size (CAP.MPSMIN). The default value (``0``) has this property inherit the ``mdts`` value. Metadata -------- The virtual namespace device supports LBA metadata in the form separate metadata (``MPTR``-based) and extended LBAs. ``ms=UINT16`` (default: ``0``) Defines the number of metadata bytes per LBA. ``mset=UINT8`` (default: ``0``) Set to ``1`` to enable extended LBAs. End-to-End Data Protection -------------------------- The virtual namespace device supports DIF- and DIX-based protection information (depending on ``mset``). ``pi=UINT8`` (default: ``0``) Enable protection information of the specified type (type ``1``, ``2`` or ``3``). ``pil=UINT8`` (default: ``0``) Controls the location of the protection information within the metadata. Set to ``1`` to transfer protection information as the first eight bytes of metadata. Otherwise, the protection information is transferred as the last eight bytes. Virtualization Enhancements and SR-IOV (Experimental Support) ------------------------------------------------------------- The ``nvme`` device supports Single Root I/O Virtualization and Sharing along with Virtualization Enhancements. The controller has to be linked to an NVM Subsystem device (``nvme-subsys``) for use with SR-IOV. A number of parameters are present (**please note, that they may be subject to change**): ``sriov_max_vfs`` (default: ``0``) Indicates the maximum number of PCIe virtual functions supported by the controller. Specifying a non-zero value enables reporting of both SR-IOV and ARI (Alternative Routing-ID Interpretation) capabilities by the NVMe device. Virtual function controllers will not report SR-IOV. ``sriov_vq_flexible`` Indicates the total number of flexible queue resources assignable to all the secondary controllers. Implicitly sets the number of primary controller's private resources to ``(max_ioqpairs - sriov_vq_flexible)``. ``sriov_vi_flexible`` Indicates the total number of flexible interrupt resources assignable to all the secondary controllers. Implicitly sets the number of primary controller's private resources to ``(msix_qsize - sriov_vi_flexible)``. ``sriov_max_vi_per_vf`` (default: ``0``) Indicates the maximum number of virtual interrupt resources assignable to a secondary controller. The default ``0`` resolves to ``(sriov_vi_flexible / sriov_max_vfs)`` ``sriov_max_vq_per_vf`` (default: ``0``) Indicates the maximum number of virtual queue resources assignable to a secondary controller. The default ``0`` resolves to ``(sriov_vq_flexible / sriov_max_vfs)`` The simplest possible invocation enables the capability to set up one VF controller and assign an admin queue, an IO queue, and a MSI-X interrupt. .. code-block:: console -device nvme-subsys,id=subsys0 -device nvme,serial=deadbeef,subsys=subsys0,sriov_max_vfs=1, sriov_vq_flexible=2,sriov_vi_flexible=1 The minimum steps required to configure a functional NVMe secondary controller are: * unbind flexible resources from the primary controller .. code-block:: console nvme virt-mgmt /dev/nvme0 -c 0 -r 1 -a 1 -n 0 nvme virt-mgmt /dev/nvme0 -c 0 -r 0 -a 1 -n 0 * perform a Function Level Reset on the primary controller to actually release the resources .. code-block:: console echo 1 > /sys/bus/pci/devices/0000:01:00.0/reset * enable VF .. code-block:: console echo 1 > /sys/bus/pci/devices/0000:01:00.0/sriov_numvfs * assign the flexible resources to the VF and set it ONLINE .. code-block:: console nvme virt-mgmt /dev/nvme0 -c 1 -r 1 -a 8 -n 1 nvme virt-mgmt /dev/nvme0 -c 1 -r 0 -a 8 -n 2 nvme virt-mgmt /dev/nvme0 -c 1 -r 0 -a 9 -n 0 * bind the NVMe driver to the VF .. code-block:: console echo 0000:01:00.1 > /sys/bus/pci/drivers/nvme/bind