qemu

FORK: QEMU emulator
git clone https://git.neptards.moe/neptards/qemu.git
Log | Files | Refs | Submodules | LICENSE

nvdimm.txt (10789B)


      1 QEMU Virtual NVDIMM
      2 ===================
      3 
      4 This document explains the usage of virtual NVDIMM (vNVDIMM) feature
      5 which is available since QEMU v2.6.0.
      6 
      7 The current QEMU only implements the persistent memory mode of vNVDIMM
      8 device and not the block window mode.
      9 
     10 Basic Usage
     11 -----------
     12 
     13 The storage of a vNVDIMM device in QEMU is provided by the memory
     14 backend (i.e. memory-backend-file and memory-backend-ram). A simple
     15 way to create a vNVDIMM device at startup time is done via the
     16 following command line options:
     17 
     18  -machine pc,nvdimm=on
     19  -m $RAM_SIZE,slots=$N,maxmem=$MAX_SIZE
     20  -object memory-backend-file,id=mem1,share=on,mem-path=$PATH,size=$NVDIMM_SIZE,readonly=off
     21  -device nvdimm,id=nvdimm1,memdev=mem1,unarmed=off
     22 
     23 Where,
     24 
     25  - the "nvdimm" machine option enables vNVDIMM feature.
     26 
     27  - "slots=$N" should be equal to or larger than the total amount of
     28    normal RAM devices and vNVDIMM devices, e.g. $N should be >= 2 here.
     29 
     30  - "maxmem=$MAX_SIZE" should be equal to or larger than the total size
     31    of normal RAM devices and vNVDIMM devices, e.g. $MAX_SIZE should be
     32    >= $RAM_SIZE + $NVDIMM_SIZE here.
     33 
     34  - "object memory-backend-file,id=mem1,share=on,mem-path=$PATH,
     35    size=$NVDIMM_SIZE,readonly=off" creates a backend storage of size
     36    $NVDIMM_SIZE on a file $PATH. All accesses to the virtual NVDIMM device go
     37    to the file $PATH.
     38 
     39    "share=on/off" controls the visibility of guest writes. If
     40    "share=on", then guest writes will be applied to the backend
     41    file. If another guest uses the same backend file with option
     42    "share=on", then above writes will be visible to it as well. If
     43    "share=off", then guest writes won't be applied to the backend
     44    file and thus will be invisible to other guests.
     45 
     46    "readonly=on/off" controls whether the file $PATH is opened read-only or
     47    read/write (default).
     48 
     49  - "device nvdimm,id=nvdimm1,memdev=mem1,unarmed=off" creates a read/write
     50    virtual NVDIMM device whose storage is provided by above memory backend
     51    device.
     52 
     53    "unarmed" controls the ACPI NFIT NVDIMM Region Mapping Structure "NVDIMM
     54    State Flags" Bit 3 indicating that the device is "unarmed" and cannot accept
     55    persistent writes. Linux guest drivers set the device to read-only when this
     56    bit is present. Set unarmed to on when the memdev has readonly=on.
     57 
     58 Multiple vNVDIMM devices can be created if multiple pairs of "-object"
     59 and "-device" are provided.
     60 
     61 For above command line options, if the guest OS has the proper NVDIMM
     62 driver (e.g. "CONFIG_ACPI_NFIT=y" under Linux), it should be able to
     63 detect a NVDIMM device which is in the persistent memory mode and whose
     64 size is $NVDIMM_SIZE.
     65 
     66 Note:
     67 
     68 1. Prior to QEMU v2.8.0, if memory-backend-file is used and the actual
     69    backend file size is not equal to the size given by "size" option,
     70    QEMU will truncate the backend file by ftruncate(2), which will
     71    corrupt the existing data in the backend file, especially for the
     72    shrink case.
     73 
     74    QEMU v2.8.0 and later check the backend file size and the "size"
     75    option. If they do not match, QEMU will report errors and abort in
     76    order to avoid the data corruption.
     77 
     78 2. QEMU v2.6.0 only puts a basic alignment requirement on the "size"
     79    option of memory-backend-file, e.g. 4KB alignment on x86.  However,
     80    QEMU v.2.7.0 puts an additional alignment requirement, which may
     81    require a larger value than the basic one, e.g. 2MB on x86. This
     82    change breaks the usage of memory-backend-file that only satisfies
     83    the basic alignment.
     84 
     85    QEMU v2.8.0 and later remove the additional alignment on non-s390x
     86    architectures, so the broken memory-backend-file can work again.
     87 
     88 Label
     89 -----
     90 
     91 QEMU v2.7.0 and later implement the label support for vNVDIMM devices.
     92 To enable label on vNVDIMM devices, users can simply add
     93 "label-size=$SZ" option to "-device nvdimm", e.g.
     94 
     95  -device nvdimm,id=nvdimm1,memdev=mem1,label-size=128K
     96 
     97 Note:
     98 
     99 1. The minimal label size is 128KB.
    100 
    101 2. QEMU v2.7.0 and later store labels at the end of backend storage.
    102    If a memory backend file, which was previously used as the backend
    103    of a vNVDIMM device without labels, is now used for a vNVDIMM
    104    device with label, the data in the label area at the end of file
    105    will be inaccessible to the guest. If any useful data (e.g. the
    106    meta-data of the file system) was stored there, the latter usage
    107    may result guest data corruption (e.g. breakage of guest file
    108    system).
    109 
    110 Hotplug
    111 -------
    112 
    113 QEMU v2.8.0 and later implement the hotplug support for vNVDIMM
    114 devices. Similarly to the RAM hotplug, the vNVDIMM hotplug is
    115 accomplished by two monitor commands "object_add" and "device_add".
    116 
    117 For example, the following commands add another 4GB vNVDIMM device to
    118 the guest:
    119 
    120  (qemu) object_add memory-backend-file,id=mem2,share=on,mem-path=new_nvdimm.img,size=4G
    121  (qemu) device_add nvdimm,id=nvdimm2,memdev=mem2
    122 
    123 Note:
    124 
    125 1. Each hotplugged vNVDIMM device consumes one memory slot. Users
    126    should always ensure the memory option "-m ...,slots=N" specifies
    127    enough number of slots, i.e.
    128      N >= number of RAM devices +
    129           number of statically plugged vNVDIMM devices +
    130           number of hotplugged vNVDIMM devices
    131 
    132 2. The similar is required for the memory option "-m ...,maxmem=M", i.e.
    133      M >= size of RAM devices +
    134           size of statically plugged vNVDIMM devices +
    135           size of hotplugged vNVDIMM devices
    136 
    137 Alignment
    138 ---------
    139 
    140 QEMU uses mmap(2) to maps vNVDIMM backends and aligns the mapping
    141 address to the page size (getpagesize(2)) by default. However, some
    142 types of backends may require an alignment different than the page
    143 size. In that case, QEMU v2.12.0 and later provide 'align' option to
    144 memory-backend-file to allow users to specify the proper alignment.
    145 For device dax (e.g., /dev/dax0.0), this alignment needs to match the
    146 alignment requirement of the device dax. The NUM of 'align=NUM' option
    147 must be larger than or equal to the 'align' of device dax.
    148 We can use one of the following commands to show the 'align' of device dax.
    149 
    150     ndctl list -X
    151     daxctl list -R
    152 
    153 In order to get the proper 'align' of device dax, you need to install
    154 the library 'libdaxctl'.
    155 
    156 For example, device dax require the 2 MB alignment, so we can use
    157 following QEMU command line options to use it (/dev/dax0.0) as the
    158 backend of vNVDIMM:
    159 
    160  -object memory-backend-file,id=mem1,share=on,mem-path=/dev/dax0.0,size=4G,align=2M
    161  -device nvdimm,id=nvdimm1,memdev=mem1
    162 
    163 Guest Data Persistence
    164 ----------------------
    165 
    166 Though QEMU supports multiple types of vNVDIMM backends on Linux,
    167 the only backend that can guarantee the guest write persistence is:
    168 
    169 A. DAX device (e.g., /dev/dax0.0, ) or
    170 B. DAX file(mounted with dax option)
    171 
    172 When using B (A file supporting direct mapping of persistent memory)
    173 as a backend, write persistence is guaranteed if the host kernel has
    174 support for the MAP_SYNC flag in the mmap system call (available
    175 since Linux 4.15 and on certain distro kernels) and additionally
    176 both 'pmem' and 'share' flags are set to 'on' on the backend.
    177 
    178 If these conditions are not satisfied i.e. if either 'pmem' or 'share'
    179 are not set, if the backend file does not support DAX or if MAP_SYNC
    180 is not supported by the host kernel, write persistence is not
    181 guaranteed after a system crash. For compatibility reasons, these
    182 conditions are ignored if not satisfied. Currently, no way is
    183 provided to test for them.
    184 For more details, please reference mmap(2) man page:
    185 http://man7.org/linux/man-pages/man2/mmap.2.html.
    186 
    187 When using other types of backends, it's suggested to set 'unarmed'
    188 option of '-device nvdimm' to 'on', which sets the unarmed flag of the
    189 guest NVDIMM region mapping structure.  This unarmed flag indicates
    190 guest software that this vNVDIMM device contains a region that cannot
    191 accept persistent writes. In result, for example, the guest Linux
    192 NVDIMM driver, marks such vNVDIMM device as read-only.
    193 
    194 Backend File Setup Example
    195 --------------------------
    196 
    197 Here are two examples showing how to setup these persistent backends on
    198 linux using the tool ndctl [3].
    199 
    200 A. DAX device
    201 
    202 Use the following command to set up /dev/dax0.0 so that the entirety of
    203 namespace0.0 can be exposed as an emulated NVDIMM to the guest:
    204 
    205     ndctl create-namespace -f -e namespace0.0 -m devdax
    206 
    207 The /dev/dax0.0 could be used directly in "mem-path" option.
    208 
    209 B. DAX file
    210 
    211 Individual files on a DAX host file system can be exposed as emulated
    212 NVDIMMS.  First an fsdax block device is created, partitioned, and then
    213 mounted with the "dax" mount option:
    214 
    215     ndctl create-namespace -f -e namespace0.0 -m fsdax
    216     (partition /dev/pmem0 with name pmem0p1)
    217     mount -o dax /dev/pmem0p1 /mnt
    218     (create or copy a disk image file with qemu-img(1), cp(1), or dd(1)
    219      in /mnt)
    220 
    221 Then the new file in /mnt could be used in "mem-path" option.
    222 
    223 NVDIMM Persistence
    224 ------------------
    225 
    226 ACPI 6.2 Errata A added support for a new Platform Capabilities Structure
    227 which allows the platform to communicate what features it supports related to
    228 NVDIMM data persistence.  Users can provide a persistence value to a guest via
    229 the optional "nvdimm-persistence" machine command line option:
    230 
    231     -machine pc,accel=kvm,nvdimm,nvdimm-persistence=cpu
    232 
    233 There are currently two valid values for this option:
    234 
    235 "mem-ctrl" - The platform supports flushing dirty data from the memory
    236              controller to the NVDIMMs in the event of power loss.
    237 
    238 "cpu"      - The platform supports flushing dirty data from the CPU cache to
    239              the NVDIMMs in the event of power loss.  This implies that the
    240              platform also supports flushing dirty data through the memory
    241              controller on power loss.
    242 
    243 If the vNVDIMM backend is in host persistent memory that can be accessed in
    244 SNIA NVM Programming Model [1] (e.g., Intel NVDIMM), it's suggested to set
    245 the 'pmem' option of memory-backend-file to 'on'. When 'pmem' is 'on' and QEMU
    246 is built with libpmem [2] support (configured with --enable-libpmem), QEMU
    247 will take necessary operations to guarantee the persistence of its own writes
    248 to the vNVDIMM backend(e.g., in vNVDIMM label emulation and live migration).
    249 If 'pmem' is 'on' while there is no libpmem support, qemu will exit and report
    250 a "lack of libpmem support" message to ensure the persistence is available.
    251 For example, if we want to ensure the persistence for some backend file,
    252 use the QEMU command line:
    253 
    254     -object memory-backend-file,id=nv_mem,mem-path=/XXX/yyy,size=4G,pmem=on
    255 
    256 References
    257 ----------
    258 
    259 [1] NVM Programming Model (NPM)
    260 	Version 1.2
    261     https://www.snia.org/sites/default/files/technical_work/final/NVMProgrammingModel_v1.2.pdf
    262 [2] Persistent Memory Development Kit (PMDK), formerly known as NVML project, home page:
    263     http://pmem.io/pmdk/
    264 [3] ndctl-create-namespace - provision or reconfigure a namespace
    265     http://pmem.io/ndctl/ndctl-create-namespace.html