1QEMU Virtual NVDIMM 2=================== 3 4This document explains the usage of virtual NVDIMM (vNVDIMM) feature 5which is available since QEMU v2.6.0. 6 7The current QEMU only implements the persistent memory mode of vNVDIMM 8device and not the block window mode. 9 10Basic Usage 11----------- 12 13The storage of a vNVDIMM device in QEMU is provided by the memory 14backend (i.e. memory-backend-file and memory-backend-ram). A simple 15way to create a vNVDIMM device at startup time is done via the 16following command line options: 17 18 -machine pc,nvdimm 19 -m $RAM_SIZE,slots=$N,maxmem=$MAX_SIZE 20 -object memory-backend-file,id=mem1,share=on,mem-path=$PATH,size=$NVDIMM_SIZE 21 -device nvdimm,id=nvdimm1,memdev=mem1 22 23Where, 24 25 - the "nvdimm" machine option enables vNVDIMM feature. 26 27 - "slots=$N" should be equal to or larger than the total amount of 28 normal RAM devices and vNVDIMM devices, e.g. $N should be >= 2 here. 29 30 - "maxmem=$MAX_SIZE" should be equal to or larger than the total size 31 of normal RAM devices and vNVDIMM devices, e.g. $MAX_SIZE should be 32 >= $RAM_SIZE + $NVDIMM_SIZE here. 33 34 - "object memory-backend-file,id=mem1,share=on,mem-path=$PATH,size=$NVDIMM_SIZE" 35 creates a backend storage of size $NVDIMM_SIZE on a file $PATH. All 36 accesses to the virtual NVDIMM device go to the file $PATH. 37 38 "share=on/off" controls the visibility of guest writes. If 39 "share=on", then guest writes will be applied to the backend 40 file. If another guest uses the same backend file with option 41 "share=on", then above writes will be visible to it as well. If 42 "share=off", then guest writes won't be applied to the backend 43 file and thus will be invisible to other guests. 44 45 - "device nvdimm,id=nvdimm1,memdev=mem1" creates a virtual NVDIMM 46 device whose storage is provided by above memory backend device. 47 48Multiple vNVDIMM devices can be created if multiple pairs of "-object" 49and "-device" are provided. 50 51For above command line options, if the guest OS has the proper NVDIMM 52driver (e.g. "CONFIG_ACPI_NFIT=y" under Linux), it should be able to 53detect a NVDIMM device which is in the persistent memory mode and whose 54size is $NVDIMM_SIZE. 55 56Note: 57 581. Prior to QEMU v2.8.0, if memory-backend-file is used and the actual 59 backend file size is not equal to the size given by "size" option, 60 QEMU will truncate the backend file by ftruncate(2), which will 61 corrupt the existing data in the backend file, especially for the 62 shrink case. 63 64 QEMU v2.8.0 and later check the backend file size and the "size" 65 option. If they do not match, QEMU will report errors and abort in 66 order to avoid the data corruption. 67 682. QEMU v2.6.0 only puts a basic alignment requirement on the "size" 69 option of memory-backend-file, e.g. 4KB alignment on x86. However, 70 QEMU v.2.7.0 puts an additional alignment requirement, which may 71 require a larger value than the basic one, e.g. 2MB on x86. This 72 change breaks the usage of memory-backend-file that only satisfies 73 the basic alignment. 74 75 QEMU v2.8.0 and later remove the additional alignment on non-s390x 76 architectures, so the broken memory-backend-file can work again. 77 78Label 79----- 80 81QEMU v2.7.0 and later implement the label support for vNVDIMM devices. 82To enable label on vNVDIMM devices, users can simply add 83"label-size=$SZ" option to "-device nvdimm", e.g. 84 85 -device nvdimm,id=nvdimm1,memdev=mem1,label-size=128K 86 87Note: 88 891. The minimal label size is 128KB. 90 912. QEMU v2.7.0 and later store labels at the end of backend storage. 92 If a memory backend file, which was previously used as the backend 93 of a vNVDIMM device without labels, is now used for a vNVDIMM 94 device with label, the data in the label area at the end of file 95 will be inaccessible to the guest. If any useful data (e.g. the 96 meta-data of the file system) was stored there, the latter usage 97 may result guest data corruption (e.g. breakage of guest file 98 system). 99 100Hotplug 101------- 102 103QEMU v2.8.0 and later implement the hotplug support for vNVDIMM 104devices. Similarly to the RAM hotplug, the vNVDIMM hotplug is 105accomplished by two monitor commands "object_add" and "device_add". 106 107For example, the following commands add another 4GB vNVDIMM device to 108the guest: 109 110 (qemu) object_add memory-backend-file,id=mem2,share=on,mem-path=new_nvdimm.img,size=4G 111 (qemu) device_add nvdimm,id=nvdimm2,memdev=mem2 112 113Note: 114 1151. Each hotplugged vNVDIMM device consumes one memory slot. Users 116 should always ensure the memory option "-m ...,slots=N" specifies 117 enough number of slots, i.e. 118 N >= number of RAM devices + 119 number of statically plugged vNVDIMM devices + 120 number of hotplugged vNVDIMM devices 121 1222. The similar is required for the memory option "-m ...,maxmem=M", i.e. 123 M >= size of RAM devices + 124 size of statically plugged vNVDIMM devices + 125 size of hotplugged vNVDIMM devices 126 127Alignment 128--------- 129 130QEMU uses mmap(2) to maps vNVDIMM backends and aligns the mapping 131address to the page size (getpagesize(2)) by default. However, some 132types of backends may require an alignment different than the page 133size. In that case, QEMU v2.12.0 and later provide 'align' option to 134memory-backend-file to allow users to specify the proper alignment. 135 136For example, device dax require the 2 MB alignment, so we can use 137following QEMU command line options to use it (/dev/dax0.0) as the 138backend of vNVDIMM: 139 140 -object memory-backend-file,id=mem1,share=on,mem-path=/dev/dax0.0,size=4G,align=2M 141 -device nvdimm,id=nvdimm1,memdev=mem1 142 143Guest Data Persistence 144---------------------- 145 146Though QEMU supports multiple types of vNVDIMM backends on Linux, 147the only backend that can guarantee the guest write persistence is: 148 149A. DAX device (e.g., /dev/dax0.0, ) or 150B. DAX file(mounted with dax option) 151 152When using B (A file supporting direct mapping of persistent memory) 153as a backend, write persistence is guaranteed if the host kernel has 154support for the MAP_SYNC flag in the mmap system call (available 155since Linux 4.15 and on certain distro kernels) and additionally 156both 'pmem' and 'share' flags are set to 'on' on the backend. 157 158If these conditions are not satisfied i.e. if either 'pmem' or 'share' 159are not set, if the backend file does not support DAX or if MAP_SYNC 160is not supported by the host kernel, write persistence is not 161guaranteed after a system crash. For compatibility reasons, these 162conditions are ignored if not satisfied. Currently, no way is 163provided to test for them. 164For more details, please reference mmap(2) man page: 165http://man7.org/linux/man-pages/man2/mmap.2.html. 166 167When using other types of backends, it's suggested to set 'unarmed' 168option of '-device nvdimm' to 'on', which sets the unarmed flag of the 169guest NVDIMM region mapping structure. This unarmed flag indicates 170guest software that this vNVDIMM device contains a region that cannot 171accept persistent writes. In result, for example, the guest Linux 172NVDIMM driver, marks such vNVDIMM device as read-only. 173 174Backend File Setup Example 175-------------------------- 176 177Here are two examples showing how to setup these persistent backends on 178linux using the tool ndctl [3]. 179 180A. DAX device 181 182Use the following command to set up /dev/dax0.0 so that the entirety of 183namespace0.0 can be exposed as an emulated NVDIMM to the guest: 184 185 ndctl create-namespace -f -e namespace0.0 -m devdax 186 187The /dev/dax0.0 could be used directly in "mem-path" option. 188 189B. DAX file 190 191Individual files on a DAX host file system can be exposed as emulated 192NVDIMMS. First an fsdax block device is created, partitioned, and then 193mounted with the "dax" mount option: 194 195 ndctl create-namespace -f -e namespace0.0 -m fsdax 196 (partition /dev/pmem0 with name pmem0p1) 197 mount -o dax /dev/pmem0p1 /mnt 198 (create or copy a disk image file with qemu-img(1), cp(1), or dd(1) 199 in /mnt) 200 201Then the new file in /mnt could be used in "mem-path" option. 202 203NVDIMM Persistence 204------------------ 205 206ACPI 6.2 Errata A added support for a new Platform Capabilities Structure 207which allows the platform to communicate what features it supports related to 208NVDIMM data persistence. Users can provide a persistence value to a guest via 209the optional "nvdimm-persistence" machine command line option: 210 211 -machine pc,accel=kvm,nvdimm,nvdimm-persistence=cpu 212 213There are currently two valid values for this option: 214 215"mem-ctrl" - The platform supports flushing dirty data from the memory 216 controller to the NVDIMMs in the event of power loss. 217 218"cpu" - The platform supports flushing dirty data from the CPU cache to 219 the NVDIMMs in the event of power loss. This implies that the 220 platform also supports flushing dirty data through the memory 221 controller on power loss. 222 223If the vNVDIMM backend is in host persistent memory that can be accessed in 224SNIA NVM Programming Model [1] (e.g., Intel NVDIMM), it's suggested to set 225the 'pmem' option of memory-backend-file to 'on'. When 'pmem' is 'on' and QEMU 226is built with libpmem [2] support (configured with --enable-libpmem), QEMU 227will take necessary operations to guarantee the persistence of its own writes 228to the vNVDIMM backend(e.g., in vNVDIMM label emulation and live migration). 229If 'pmem' is 'on' while there is no libpmem support, qemu will exit and report 230a "lack of libpmem support" message to ensure the persistence is available. 231For example, if we want to ensure the persistence for some backend file, 232use the QEMU command line: 233 234 -object memory-backend-file,id=nv_mem,mem-path=/XXX/yyy,size=4G,pmem=on 235 236References 237---------- 238 239[1] NVM Programming Model (NPM) 240 Version 1.2 241 https://www.snia.org/sites/default/files/technical_work/final/NVMProgrammingModel_v1.2.pdf 242[2] Persistent Memory Development Kit (PMDK), formerly known as NVML project, home page: 243 http://pmem.io/pmdk/ 244[3] ndctl-create-namespace - provision or reconfigure a namespace 245 http://pmem.io/ndctl/ndctl-create-namespace.html 246