nvdimm.txt 7.9 KB
Newer Older
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
QEMU Virtual NVDIMM
===================

This document explains the usage of virtual NVDIMM (vNVDIMM) feature
which is available since QEMU v2.6.0.

The current QEMU only implements the persistent memory mode of vNVDIMM
device and not the block window mode.

Basic Usage
-----------

The storage of a vNVDIMM device in QEMU is provided by the memory
backend (i.e. memory-backend-file and memory-backend-ram). A simple
way to create a vNVDIMM device at startup time is done via the
following command line options:

 -machine pc,nvdimm
 -m $RAM_SIZE,slots=$N,maxmem=$MAX_SIZE
 -object memory-backend-file,id=mem1,share=on,mem-path=$PATH,size=$NVDIMM_SIZE
 -device nvdimm,id=nvdimm1,memdev=mem1

Where,

 - the "nvdimm" machine option enables vNVDIMM feature.

 - "slots=$N" should be equal to or larger than the total amount of
   normal RAM devices and vNVDIMM devices, e.g. $N should be >= 2 here.

 - "maxmem=$MAX_SIZE" should be equal to or larger than the total size
   of normal RAM devices and vNVDIMM devices, e.g. $MAX_SIZE should be
   >= $RAM_SIZE + $NVDIMM_SIZE here.

 - "object memory-backend-file,id=mem1,share=on,mem-path=$PATH,size=$NVDIMM_SIZE"
   creates a backend storage of size $NVDIMM_SIZE on a file $PATH. All
   accesses to the virtual NVDIMM device go to the file $PATH.

   "share=on/off" controls the visibility of guest writes. If
   "share=on", then guest writes will be applied to the backend
   file. If another guest uses the same backend file with option
   "share=on", then above writes will be visible to it as well. If
   "share=off", then guest writes won't be applied to the backend
   file and thus will be invisible to other guests.

 - "device nvdimm,id=nvdimm1,memdev=mem1" creates a virtual NVDIMM
   device whose storage is provided by above memory backend device.

Multiple vNVDIMM devices can be created if multiple pairs of "-object"
and "-device" are provided.

For above command line options, if the guest OS has the proper NVDIMM
52 53 54
driver (e.g. "CONFIG_ACPI_NFIT=y" under Linux), it should be able to
detect a NVDIMM device which is in the persistent memory mode and whose
size is $NVDIMM_SIZE.
55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125

Note:

1. Prior to QEMU v2.8.0, if memory-backend-file is used and the actual
   backend file size is not equal to the size given by "size" option,
   QEMU will truncate the backend file by ftruncate(2), which will
   corrupt the existing data in the backend file, especially for the
   shrink case.

   QEMU v2.8.0 and later check the backend file size and the "size"
   option. If they do not match, QEMU will report errors and abort in
   order to avoid the data corruption.

2. QEMU v2.6.0 only puts a basic alignment requirement on the "size"
   option of memory-backend-file, e.g. 4KB alignment on x86.  However,
   QEMU v.2.7.0 puts an additional alignment requirement, which may
   require a larger value than the basic one, e.g. 2MB on x86. This
   change breaks the usage of memory-backend-file that only satisfies
   the basic alignment.

   QEMU v2.8.0 and later remove the additional alignment on non-s390x
   architectures, so the broken memory-backend-file can work again.

Label
-----

QEMU v2.7.0 and later implement the label support for vNVDIMM devices.
To enable label on vNVDIMM devices, users can simply add
"label-size=$SZ" option to "-device nvdimm", e.g.

 -device nvdimm,id=nvdimm1,memdev=mem1,label-size=128K

Note:

1. The minimal label size is 128KB.

2. QEMU v2.7.0 and later store labels at the end of backend storage.
   If a memory backend file, which was previously used as the backend
   of a vNVDIMM device without labels, is now used for a vNVDIMM
   device with label, the data in the label area at the end of file
   will be inaccessible to the guest. If any useful data (e.g. the
   meta-data of the file system) was stored there, the latter usage
   may result guest data corruption (e.g. breakage of guest file
   system).

Hotplug
-------

QEMU v2.8.0 and later implement the hotplug support for vNVDIMM
devices. Similarly to the RAM hotplug, the vNVDIMM hotplug is
accomplished by two monitor commands "object_add" and "device_add".

For example, the following commands add another 4GB vNVDIMM device to
the guest:

 (qemu) object_add memory-backend-file,id=mem2,share=on,mem-path=new_nvdimm.img,size=4G
 (qemu) device_add nvdimm,id=nvdimm2,memdev=mem2

Note:

1. Each hotplugged vNVDIMM device consumes one memory slot. Users
   should always ensure the memory option "-m ...,slots=N" specifies
   enough number of slots, i.e.
     N >= number of RAM devices +
          number of statically plugged vNVDIMM devices +
          number of hotplugged vNVDIMM devices

2. The similar is required for the memory option "-m ...,maxmem=M", i.e.
     M >= size of RAM devices +
          size of statically plugged vNVDIMM devices +
          size of hotplugged vNVDIMM devices
126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141

Alignment
---------

QEMU uses mmap(2) to maps vNVDIMM backends and aligns the mapping
address to the page size (getpagesize(2)) by default. However, some
types of backends may require an alignment different than the page
size. In that case, QEMU v2.12.0 and later provide 'align' option to
memory-backend-file to allow users to specify the proper alignment.

For example, device dax require the 2 MB alignment, so we can use
following QEMU command line options to use it (/dev/dax0.0) as the
backend of vNVDIMM:

 -object memory-backend-file,id=mem1,share=on,mem-path=/dev/dax0.0,size=4G,align=2M
 -device nvdimm,id=nvdimm1,memdev=mem1
H
Haozhong Zhang 已提交
142 143 144 145 146 147 148 149 150 151 152 153 154 155 156

Guest Data Persistence
----------------------

Though QEMU supports multiple types of vNVDIMM backends on Linux,
currently the only one that can guarantee the guest write persistence
is the device DAX on the real NVDIMM device (e.g., /dev/dax0.0), to
which all guest access do not involve any host-side kernel cache.

When using other types of backends, it's suggested to set 'unarmed'
option of '-device nvdimm' to 'on', which sets the unarmed flag of the
guest NVDIMM region mapping structure.  This unarmed flag indicates
guest software that this vNVDIMM device contains a region that cannot
accept persistent writes. In result, for example, the guest Linux
NVDIMM driver, marks such vNVDIMM device as read-only.
157

158 159
NVDIMM Persistence
------------------
160 161 162

ACPI 6.2 Errata A added support for a new Platform Capabilities Structure
which allows the platform to communicate what features it supports related to
163 164
NVDIMM data persistence.  Users can provide a persistence value to a guest via
the optional "nvdimm-persistence" machine command line option:
165

166
    -machine pc,accel=kvm,nvdimm,nvdimm-persistence=cpu
167

168
There are currently two valid values for this option:
169

170 171
"mem-ctrl" - The platform supports flushing dirty data from the memory
             controller to the NVDIMMs in the event of power loss.
172

173 174 175 176
"cpu"      - The platform supports flushing dirty data from the CPU cache to
             the NVDIMMs in the event of power loss.  This implies that the
             platform also supports flushing dirty data through the memory
             controller on power loss.
J
Junyan He 已提交
177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198

If the vNVDIMM backend is in host persistent memory that can be accessed in
SNIA NVM Programming Model [1] (e.g., Intel NVDIMM), it's suggested to set
the 'pmem' option of memory-backend-file to 'on'. When 'pmem' is 'on' and QEMU
is built with libpmem [2] support (configured with --enable-libpmem), QEMU
will take necessary operations to guarantee the persistence of its own writes
to the vNVDIMM backend(e.g., in vNVDIMM label emulation and live migration).
If 'pmem' is 'on' while there is no libpmem support, qemu will exit and report
a "lack of libpmem support" message to ensure the persistence is available.
For example, if we want to ensure the persistence for some backend file,
use the QEMU command line:

    -object memory-backend-file,id=nv_mem,mem-path=/XXX/yyy,size=4G,pmem=on

References
----------

[1] NVM Programming Model (NPM)
	Version 1.2
    https://www.snia.org/sites/default/files/technical_work/final/NVMProgrammingModel_v1.2.pdf
[2] Persistent Memory Development Kit (PMDK), formerly known as NVML project, home page:
    http://pmem.io/pmdk/