nvdimm.txt (10789B)
1 QEMU Virtual NVDIMM 2 =================== 3 4 This document explains the usage of virtual NVDIMM (vNVDIMM) feature 5 which is available since QEMU v2.6.0. 6 7 The current QEMU only implements the persistent memory mode of vNVDIMM 8 device and not the block window mode. 9 10 Basic Usage 11 ----------- 12 13 The storage of a vNVDIMM device in QEMU is provided by the memory 14 backend (i.e. memory-backend-file and memory-backend-ram). A simple 15 way to create a vNVDIMM device at startup time is done via the 16 following command line options: 17 18 -machine pc,nvdimm=on 19 -m $RAM_SIZE,slots=$N,maxmem=$MAX_SIZE 20 -object memory-backend-file,id=mem1,share=on,mem-path=$PATH,size=$NVDIMM_SIZE,readonly=off 21 -device nvdimm,id=nvdimm1,memdev=mem1,unarmed=off 22 23 Where, 24 25 - the "nvdimm" machine option enables vNVDIMM feature. 26 27 - "slots=$N" should be equal to or larger than the total amount of 28 normal RAM devices and vNVDIMM devices, e.g. $N should be >= 2 here. 29 30 - "maxmem=$MAX_SIZE" should be equal to or larger than the total size 31 of normal RAM devices and vNVDIMM devices, e.g. $MAX_SIZE should be 32 >= $RAM_SIZE + $NVDIMM_SIZE here. 33 34 - "object memory-backend-file,id=mem1,share=on,mem-path=$PATH, 35 size=$NVDIMM_SIZE,readonly=off" creates a backend storage of size 36 $NVDIMM_SIZE on a file $PATH. All accesses to the virtual NVDIMM device go 37 to the file $PATH. 38 39 "share=on/off" controls the visibility of guest writes. If 40 "share=on", then guest writes will be applied to the backend 41 file. If another guest uses the same backend file with option 42 "share=on", then above writes will be visible to it as well. If 43 "share=off", then guest writes won't be applied to the backend 44 file and thus will be invisible to other guests. 45 46 "readonly=on/off" controls whether the file $PATH is opened read-only or 47 read/write (default). 48 49 - "device nvdimm,id=nvdimm1,memdev=mem1,unarmed=off" creates a read/write 50 virtual NVDIMM device whose storage is provided by above memory backend 51 device. 52 53 "unarmed" controls the ACPI NFIT NVDIMM Region Mapping Structure "NVDIMM 54 State Flags" Bit 3 indicating that the device is "unarmed" and cannot accept 55 persistent writes. Linux guest drivers set the device to read-only when this 56 bit is present. Set unarmed to on when the memdev has readonly=on. 57 58 Multiple vNVDIMM devices can be created if multiple pairs of "-object" 59 and "-device" are provided. 60 61 For above command line options, if the guest OS has the proper NVDIMM 62 driver (e.g. "CONFIG_ACPI_NFIT=y" under Linux), it should be able to 63 detect a NVDIMM device which is in the persistent memory mode and whose 64 size is $NVDIMM_SIZE. 65 66 Note: 67 68 1. Prior to QEMU v2.8.0, if memory-backend-file is used and the actual 69 backend file size is not equal to the size given by "size" option, 70 QEMU will truncate the backend file by ftruncate(2), which will 71 corrupt the existing data in the backend file, especially for the 72 shrink case. 73 74 QEMU v2.8.0 and later check the backend file size and the "size" 75 option. If they do not match, QEMU will report errors and abort in 76 order to avoid the data corruption. 77 78 2. QEMU v2.6.0 only puts a basic alignment requirement on the "size" 79 option of memory-backend-file, e.g. 4KB alignment on x86. However, 80 QEMU v.2.7.0 puts an additional alignment requirement, which may 81 require a larger value than the basic one, e.g. 2MB on x86. This 82 change breaks the usage of memory-backend-file that only satisfies 83 the basic alignment. 84 85 QEMU v2.8.0 and later remove the additional alignment on non-s390x 86 architectures, so the broken memory-backend-file can work again. 87 88 Label 89 ----- 90 91 QEMU v2.7.0 and later implement the label support for vNVDIMM devices. 92 To enable label on vNVDIMM devices, users can simply add 93 "label-size=$SZ" option to "-device nvdimm", e.g. 94 95 -device nvdimm,id=nvdimm1,memdev=mem1,label-size=128K 96 97 Note: 98 99 1. The minimal label size is 128KB. 100 101 2. QEMU v2.7.0 and later store labels at the end of backend storage. 102 If a memory backend file, which was previously used as the backend 103 of a vNVDIMM device without labels, is now used for a vNVDIMM 104 device with label, the data in the label area at the end of file 105 will be inaccessible to the guest. If any useful data (e.g. the 106 meta-data of the file system) was stored there, the latter usage 107 may result guest data corruption (e.g. breakage of guest file 108 system). 109 110 Hotplug 111 ------- 112 113 QEMU v2.8.0 and later implement the hotplug support for vNVDIMM 114 devices. Similarly to the RAM hotplug, the vNVDIMM hotplug is 115 accomplished by two monitor commands "object_add" and "device_add". 116 117 For example, the following commands add another 4GB vNVDIMM device to 118 the guest: 119 120 (qemu) object_add memory-backend-file,id=mem2,share=on,mem-path=new_nvdimm.img,size=4G 121 (qemu) device_add nvdimm,id=nvdimm2,memdev=mem2 122 123 Note: 124 125 1. Each hotplugged vNVDIMM device consumes one memory slot. Users 126 should always ensure the memory option "-m ...,slots=N" specifies 127 enough number of slots, i.e. 128 N >= number of RAM devices + 129 number of statically plugged vNVDIMM devices + 130 number of hotplugged vNVDIMM devices 131 132 2. The similar is required for the memory option "-m ...,maxmem=M", i.e. 133 M >= size of RAM devices + 134 size of statically plugged vNVDIMM devices + 135 size of hotplugged vNVDIMM devices 136 137 Alignment 138 --------- 139 140 QEMU uses mmap(2) to maps vNVDIMM backends and aligns the mapping 141 address to the page size (getpagesize(2)) by default. However, some 142 types of backends may require an alignment different than the page 143 size. In that case, QEMU v2.12.0 and later provide 'align' option to 144 memory-backend-file to allow users to specify the proper alignment. 145 For device dax (e.g., /dev/dax0.0), this alignment needs to match the 146 alignment requirement of the device dax. The NUM of 'align=NUM' option 147 must be larger than or equal to the 'align' of device dax. 148 We can use one of the following commands to show the 'align' of device dax. 149 150 ndctl list -X 151 daxctl list -R 152 153 In order to get the proper 'align' of device dax, you need to install 154 the library 'libdaxctl'. 155 156 For example, device dax require the 2 MB alignment, so we can use 157 following QEMU command line options to use it (/dev/dax0.0) as the 158 backend of vNVDIMM: 159 160 -object memory-backend-file,id=mem1,share=on,mem-path=/dev/dax0.0,size=4G,align=2M 161 -device nvdimm,id=nvdimm1,memdev=mem1 162 163 Guest Data Persistence 164 ---------------------- 165 166 Though QEMU supports multiple types of vNVDIMM backends on Linux, 167 the only backend that can guarantee the guest write persistence is: 168 169 A. DAX device (e.g., /dev/dax0.0, ) or 170 B. DAX file(mounted with dax option) 171 172 When using B (A file supporting direct mapping of persistent memory) 173 as a backend, write persistence is guaranteed if the host kernel has 174 support for the MAP_SYNC flag in the mmap system call (available 175 since Linux 4.15 and on certain distro kernels) and additionally 176 both 'pmem' and 'share' flags are set to 'on' on the backend. 177 178 If these conditions are not satisfied i.e. if either 'pmem' or 'share' 179 are not set, if the backend file does not support DAX or if MAP_SYNC 180 is not supported by the host kernel, write persistence is not 181 guaranteed after a system crash. For compatibility reasons, these 182 conditions are ignored if not satisfied. Currently, no way is 183 provided to test for them. 184 For more details, please reference mmap(2) man page: 185 http://man7.org/linux/man-pages/man2/mmap.2.html. 186 187 When using other types of backends, it's suggested to set 'unarmed' 188 option of '-device nvdimm' to 'on', which sets the unarmed flag of the 189 guest NVDIMM region mapping structure. This unarmed flag indicates 190 guest software that this vNVDIMM device contains a region that cannot 191 accept persistent writes. In result, for example, the guest Linux 192 NVDIMM driver, marks such vNVDIMM device as read-only. 193 194 Backend File Setup Example 195 -------------------------- 196 197 Here are two examples showing how to setup these persistent backends on 198 linux using the tool ndctl [3]. 199 200 A. DAX device 201 202 Use the following command to set up /dev/dax0.0 so that the entirety of 203 namespace0.0 can be exposed as an emulated NVDIMM to the guest: 204 205 ndctl create-namespace -f -e namespace0.0 -m devdax 206 207 The /dev/dax0.0 could be used directly in "mem-path" option. 208 209 B. DAX file 210 211 Individual files on a DAX host file system can be exposed as emulated 212 NVDIMMS. First an fsdax block device is created, partitioned, and then 213 mounted with the "dax" mount option: 214 215 ndctl create-namespace -f -e namespace0.0 -m fsdax 216 (partition /dev/pmem0 with name pmem0p1) 217 mount -o dax /dev/pmem0p1 /mnt 218 (create or copy a disk image file with qemu-img(1), cp(1), or dd(1) 219 in /mnt) 220 221 Then the new file in /mnt could be used in "mem-path" option. 222 223 NVDIMM Persistence 224 ------------------ 225 226 ACPI 6.2 Errata A added support for a new Platform Capabilities Structure 227 which allows the platform to communicate what features it supports related to 228 NVDIMM data persistence. Users can provide a persistence value to a guest via 229 the optional "nvdimm-persistence" machine command line option: 230 231 -machine pc,accel=kvm,nvdimm,nvdimm-persistence=cpu 232 233 There are currently two valid values for this option: 234 235 "mem-ctrl" - The platform supports flushing dirty data from the memory 236 controller to the NVDIMMs in the event of power loss. 237 238 "cpu" - The platform supports flushing dirty data from the CPU cache to 239 the NVDIMMs in the event of power loss. This implies that the 240 platform also supports flushing dirty data through the memory 241 controller on power loss. 242 243 If the vNVDIMM backend is in host persistent memory that can be accessed in 244 SNIA NVM Programming Model [1] (e.g., Intel NVDIMM), it's suggested to set 245 the 'pmem' option of memory-backend-file to 'on'. When 'pmem' is 'on' and QEMU 246 is built with libpmem [2] support (configured with --enable-libpmem), QEMU 247 will take necessary operations to guarantee the persistence of its own writes 248 to the vNVDIMM backend(e.g., in vNVDIMM label emulation and live migration). 249 If 'pmem' is 'on' while there is no libpmem support, qemu will exit and report 250 a "lack of libpmem support" message to ensure the persistence is available. 251 For example, if we want to ensure the persistence for some backend file, 252 use the QEMU command line: 253 254 -object memory-backend-file,id=nv_mem,mem-path=/XXX/yyy,size=4G,pmem=on 255 256 References 257 ---------- 258 259 [1] NVM Programming Model (NPM) 260 Version 1.2 261 https://www.snia.org/sites/default/files/technical_work/final/NVMProgrammingModel_v1.2.pdf 262 [2] Persistent Memory Development Kit (PMDK), formerly known as NVML project, home page: 263 http://pmem.io/pmdk/ 264 [3] ndctl-create-namespace - provision or reconfigure a namespace 265 http://pmem.io/ndctl/ndctl-create-namespace.html