diff --git a/Documentation/nvdimm/nvdimm.txt b/Documentation/nvdimm/nvdimm.txt index 197a0b6b058215c60f3d79f3f40cc2aca4504013..e894de69915a38d3abd81b40e4b0e21a261fadcd 100644 --- a/Documentation/nvdimm/nvdimm.txt +++ b/Documentation/nvdimm/nvdimm.txt @@ -62,6 +62,12 @@ DAX: File system extensions to bypass the page cache and block layer to mmap persistent memory, from a PMEM block device, directly into a process address space. +DSM: Device Specific Method: ACPI method to to control specific +device - in this case the firmware. + +DCR: NVDIMM Control Region Structure defined in ACPI 6 Section 5.2.25.5. +It defines a vendor-id, device-id, and interface format for a given DIMM. + BTT: Block Translation Table: Persistent memory is byte addressable. Existing software may have an expectation that the power-fail-atomicity of writes is at least one sector, 512 bytes. The BTT is an indirection @@ -133,16 +139,16 @@ device driver: registered, can be immediately attached to nd_pmem. 2. BLK (nd_blk.ko): This driver performs I/O using a set of platform - defined apertures. A set of apertures will all access just one DIMM. - Multiple windows allow multiple concurrent accesses, much like + defined apertures. A set of apertures will access just one DIMM. + Multiple windows (apertures) allow multiple concurrent accesses, much like tagged-command-queuing, and would likely be used by different threads or different CPUs. The NFIT specification defines a standard format for a BLK-aperture, but the spec also allows for vendor specific layouts, and non-NFIT BLK - implementations may other designs for BLK I/O. For this reason "nd_blk" - calls back into platform-specific code to perform the I/O. One such - implementation is defined in the "Driver Writer's Guide" and "DSM + implementations may have other designs for BLK I/O. For this reason + "nd_blk" calls back into platform-specific code to perform the I/O. + One such implementation is defined in the "Driver Writer's Guide" and "DSM Interface Example". @@ -152,7 +158,7 @@ Why BLK? While PMEM provides direct byte-addressable CPU-load/store access to NVDIMM storage, it does not provide the best system RAS (recovery, availability, and serviceability) model. An access to a corrupted -system-physical-address address causes a cpu exception while an access +system-physical-address address causes a CPU exception while an access to a corrupted address through an BLK-aperture causes that block window to raise an error status in a register. The latter is more aligned with the standard error model that host-bus-adapter attached disks present. @@ -162,7 +168,7 @@ data could be interleaved in an opaque hardware specific manner across several DIMMs. PMEM vs BLK -BLK-apertures solve this RAS problem, but their presence is also the +BLK-apertures solve these RAS problems, but their presence is also the major contributing factor to the complexity of the ND subsystem. They complicate the implementation because PMEM and BLK alias in DPA space. Any given DIMM's DPA-range may contribute to one or more @@ -220,8 +226,8 @@ socket. Each unique interface (BLK or PMEM) to DPA space is identified by a region device with a dynamically assigned id (REGION0 - REGION5). 1. The first portion of DIMM0 and DIMM1 are interleaved as REGION0. A - single PMEM namespace is created in the REGION0-SPA-range that spans - DIMM0 and DIMM1 with a user-specified name of "pm0.0". Some of that + single PMEM namespace is created in the REGION0-SPA-range that spans most + of DIMM0 and DIMM1 with a user-specified name of "pm0.0". Some of that interleaved system-physical-address range is reclaimed as BLK-aperture accessed space starting at DPA-offset (a) into each DIMM. In that reclaimed space we create two BLK-aperture "namespaces" from REGION2 and @@ -230,13 +236,13 @@ by a region device with a dynamically assigned id (REGION0 - REGION5). 2. In the last portion of DIMM0 and DIMM1 we have an interleaved system-physical-address range, REGION1, that spans those two DIMMs as - well as DIMM2 and DIMM3. Some of REGION1 allocated to a PMEM namespace - named "pm1.0" the rest is reclaimed in 4 BLK-aperture namespaces (for + well as DIMM2 and DIMM3. Some of REGION1 is allocated to a PMEM namespace + named "pm1.0", the rest is reclaimed in 4 BLK-aperture namespaces (for each DIMM in the interleave set), "blk2.1", "blk3.1", "blk4.0", and "blk5.0". 3. The portion of DIMM2 and DIMM3 that do not participate in the REGION1 - interleaved system-physical-address range (i.e. the DPA address below + interleaved system-physical-address range (i.e. the DPA address past offset (b) are also included in the "blk4.0" and "blk5.0" namespaces. Note, that this example shows that BLK-aperture namespaces don't need to be contiguous in DPA-space. @@ -252,15 +258,15 @@ LIBNVDIMM Kernel Device Model and LIBNDCTL Userspace API What follows is a description of the LIBNVDIMM sysfs layout and a corresponding object hierarchy diagram as viewed through the LIBNDCTL -api. The example sysfs paths and diagrams are relative to the Example +API. The example sysfs paths and diagrams are relative to the Example NVDIMM Platform which is also the LIBNVDIMM bus used in the LIBNDCTL unit test. LIBNDCTL: Context -Every api call in the LIBNDCTL library requires a context that holds the +Every API call in the LIBNDCTL library requires a context that holds the logging parameters and other library instance state. The library is based on the libabc template: -https://git.kernel.org/cgit/linux/kernel/git/kay/libabc.git/ +https://git.kernel.org/cgit/linux/kernel/git/kay/libabc.git LIBNDCTL: instantiate a new library context example @@ -409,7 +415,7 @@ Bit 31:28 Reserved LIBNVDIMM/LIBNDCTL: Region ---------------------- -A generic REGION device is registered for each PMEM range orBLK-aperture +A generic REGION device is registered for each PMEM range or BLK-aperture set. Per the example there are 6 regions: 2 PMEM and 4 BLK-aperture sets on the "nfit_test.0" bus. The primary role of regions are to be a container of "mappings". A mapping is a tuple of REGION0 +---> NAMESPACE0.0 +--> PMEM8 "pm0.0" | diff --git a/drivers/nvdimm/e820.c b/drivers/nvdimm/e820.c index 8282db2ef99ea02077ef5609b85acfe0a8aae823..b0045a505dc88c9ef56dcc1094a198bd011d24a7 100644 --- a/drivers/nvdimm/e820.c +++ b/drivers/nvdimm/e820.c @@ -3,6 +3,7 @@ * Copyright (c) 2015, Intel Corporation. */ #include +#include #include #include @@ -25,6 +26,18 @@ static int e820_pmem_remove(struct platform_device *pdev) return 0; } +#ifdef CONFIG_MEMORY_HOTPLUG +static int e820_range_to_nid(resource_size_t addr) +{ + return memory_add_physaddr_to_nid(addr); +} +#else +static int e820_range_to_nid(resource_size_t addr) +{ + return NUMA_NO_NODE; +} +#endif + static int e820_pmem_probe(struct platform_device *pdev) { static struct nvdimm_bus_descriptor nd_desc; @@ -48,7 +61,7 @@ static int e820_pmem_probe(struct platform_device *pdev) memset(&ndr_desc, 0, sizeof(ndr_desc)); ndr_desc.res = p; ndr_desc.attr_groups = e820_pmem_region_attribute_groups; - ndr_desc.numa_node = NUMA_NO_NODE; + ndr_desc.numa_node = e820_range_to_nid(p->start); set_bit(ND_REGION_PAGEMAP, &ndr_desc.flags); if (!nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc)) goto err; diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index 012e0649f1ac48374de47a690a2d0ac1d1865cac..8ee79893d2f5f6f5165558baa22f236738b99c45 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -105,22 +105,11 @@ static long pmem_direct_access(struct block_device *bdev, sector_t sector, { struct pmem_device *pmem = bdev->bd_disk->private_data; resource_size_t offset = sector * 512 + pmem->data_offset; - resource_size_t size; - - if (pmem->data_offset) { - /* - * Limit the direct_access() size to what is covered by - * the memmap - */ - size = (pmem->size - offset) & ~ND_PFN_MASK; - } else - size = pmem->size - offset; - - /* FIXME convert DAX to comprehend that this mapping has a lifetime */ + *kaddr = pmem->virt_addr + offset; *pfn = (pmem->phys_addr + offset) >> PAGE_SHIFT; - return size; + return pmem->size - offset; } static const struct block_device_operations pmem_fops = { diff --git a/fs/dax.c b/fs/dax.c index 8e17b371aeb894620ac2cc5d7a7e2aaee7095310..d1e5cb7311a1de295ecd4348360a9a9947e71142 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -629,6 +629,13 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address, if ((length < PMD_SIZE) || (pfn & PG_PMD_COLOUR)) goto fallback; + /* + * TODO: teach vmf_insert_pfn_pmd() to support + * 'pte_special' for pmds + */ + if (pfn_valid(pfn)) + goto fallback; + if (buffer_unwritten(&bh) || buffer_new(&bh)) { int i; for (i = 0; i < PTRS_PER_PMD; i++) diff --git a/tools/testing/nvdimm/test/nfit.c b/tools/testing/nvdimm/test/nfit.c index dce346aa94eabbd3544189af7e3de528da960cb6..40ab4476c80a2039ab543cc94b09725789b2fe57 100644 --- a/tools/testing/nvdimm/test/nfit.c +++ b/tools/testing/nvdimm/test/nfit.c @@ -1135,7 +1135,7 @@ static void nfit_test1_setup(struct nfit_test *t) memdev->interleave_ways = 1; memdev->flags = ACPI_NFIT_MEM_SAVE_FAILED | ACPI_NFIT_MEM_RESTORE_FAILED | ACPI_NFIT_MEM_FLUSH_FAILED | ACPI_NFIT_MEM_HEALTH_OBSERVED - | ACPI_NFIT_MEM_ARMED; + | ACPI_NFIT_MEM_NOT_ARMED; offset += sizeof(*memdev); /* dcr-descriptor0 */