提交 88793e5c 编写于 作者: L Linus Torvalds

Merge tag 'libnvdimm-for-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm

Pull libnvdimm subsystem from Dan Williams:
 "The libnvdimm sub-system introduces, in addition to the
  libnvdimm-core, 4 drivers / enabling modules:

  NFIT:
    Instantiates an "nvdimm bus" with the core and registers memory
    devices (NVDIMMs) enumerated by the ACPI 6.0 NFIT (NVDIMM Firmware
    Interface table).

    After registering NVDIMMs the NFIT driver then registers "region"
    devices.  A libnvdimm-region defines an access mode and the
    boundaries of persistent memory media.  A region may span multiple
    NVDIMMs that are interleaved by the hardware memory controller.  In
    turn, a libnvdimm-region can be carved into a "namespace" device and
    bound to the PMEM or BLK driver which will attach a Linux block
    device (disk) interface to the memory.

  PMEM:
    Initially merged in v4.1 this driver for contiguous spans of
    persistent memory address ranges is re-worked to drive
    PMEM-namespaces emitted by the libnvdimm-core.

    In this update the PMEM driver, on x86, gains the ability to assert
    that writes to persistent memory have been flushed all the way
    through the caches and buffers in the platform to persistent media.
    See memcpy_to_pmem() and wmb_pmem().

  BLK:
    This new driver enables access to persistent memory media through
    "Block Data Windows" as defined by the NFIT.  The primary difference
    of this driver to PMEM is that only a small window of persistent
    memory is mapped into system address space at any given point in
    time.

    Per-NVDIMM windows are reprogrammed at run time, per-I/O, to access
    different portions of the media.  BLK-mode, by definition, does not
    support DAX.

  BTT:
    This is a library, optionally consumed by either PMEM or BLK, that
    converts a byte-accessible namespace into a disk with atomic sector
    update semantics (prevents sector tearing on crash or power loss).

    The sinister aspect of sector tearing is that most applications do
    not know they have a atomic sector dependency.  At least today's
    disk's rarely ever tear sectors and if they do one almost certainly
    gets a CRC error on access.  NVDIMMs will always tear and always
    silently.  Until an application is audited to be robust in the
    presence of sector-tearing the usage of BTT is recommended.

  Thanks to: Ross Zwisler, Jeff Moyer, Vishal Verma, Christoph Hellwig,
  Ingo Molnar, Neil Brown, Boaz Harrosh, Robert Elliott, Matthew Wilcox,
  Andy Rudoff, Linda Knippers, Toshi Kani, Nicholas Moulin, Rafael
  Wysocki, and Bob Moore"

* tag 'libnvdimm-for-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm: (33 commits)
  arch, x86: pmem api for ensuring durability of persistent memory updates
  libnvdimm: Add sysfs numa_node to NVDIMM devices
  libnvdimm: Set numa_node to NVDIMM devices
  acpi: Add acpi_map_pxm_to_online_node()
  libnvdimm, nfit: handle unarmed dimms, mark namespaces read-only
  pmem: flag pmem block devices as non-rotational
  libnvdimm: enable iostat
  pmem: make_request cleanups
  libnvdimm, pmem: fix up max_hw_sectors
  libnvdimm, blk: add support for blk integrity
  libnvdimm, btt: add support for blk integrity
  fs/block_dev.c: skip rw_page if bdev has integrity
  libnvdimm: Non-Volatile Devices
  tools/testing/nvdimm: libnvdimm unit test infrastructure
  libnvdimm, nfit, nd_blk: driver for BLK-mode access persistent memory
  nd_btt: atomic sector updates
  libnvdimm: infrastructure for btt devices
  libnvdimm: write blk label set
  libnvdimm: write pmem label set
  libnvdimm: blk labels and namespace instantiation
  ...
BTT - Block Translation Table
=============================
1. Introduction
---------------
Persistent memory based storage is able to perform IO at byte (or more
accurately, cache line) granularity. However, we often want to expose such
storage as traditional block devices. The block drivers for persistent memory
will do exactly this. However, they do not provide any atomicity guarantees.
Traditional SSDs typically provide protection against torn sectors in hardware,
using stored energy in capacitors to complete in-flight block writes, or perhaps
in firmware. We don't have this luxury with persistent memory - if a write is in
progress, and we experience a power failure, the block will contain a mix of old
and new data. Applications may not be prepared to handle such a scenario.
The Block Translation Table (BTT) provides atomic sector update semantics for
persistent memory devices, so that applications that rely on sector writes not
being torn can continue to do so. The BTT manifests itself as a stacked block
device, and reserves a portion of the underlying storage for its metadata. At
the heart of it, is an indirection table that re-maps all the blocks on the
volume. It can be thought of as an extremely simple file system that only
provides atomic sector updates.
2. Static Layout
----------------
The underlying storage on which a BTT can be laid out is not limited in any way.
The BTT, however, splits the available space into chunks of up to 512 GiB,
called "Arenas".
Each arena follows the same layout for its metadata, and all references in an
arena are internal to it (with the exception of one field that points to the
next arena). The following depicts the "On-disk" metadata layout:
Backing Store +-------> Arena
+---------------+ | +------------------+
| | | | Arena info block |
| Arena 0 +---+ | 4K |
| 512G | +------------------+
| | | |
+---------------+ | |
| | | |
| Arena 1 | | Data Blocks |
| 512G | | |
| | | |
+---------------+ | |
| . | | |
| . | | |
| . | | |
| | | |
| | | |
+---------------+ +------------------+
| |
| BTT Map |
| |
| |
+------------------+
| |
| BTT Flog |
| |
+------------------+
| Info block copy |
| 4K |
+------------------+
3. Theory of Operation
----------------------
a. The BTT Map
--------------
The map is a simple lookup/indirection table that maps an LBA to an internal
block. Each map entry is 32 bits. The two most significant bits are special
flags, and the remaining form the internal block number.
Bit Description
31 - 30 : Error and Zero flags - Used in the following way:
Bit Description
31 30
-----------------------------------------------------------------------
00 Initial state. Reads return zeroes; Premap = Postmap
01 Zero state: Reads return zeroes
10 Error state: Reads fail; Writes clear 'E' bit
11 Normal Block – has valid postmap
29 - 0 : Mappings to internal 'postmap' blocks
Some of the terminology that will be subsequently used:
External LBA : LBA as made visible to upper layers.
ABA : Arena Block Address - Block offset/number within an arena
Premap ABA : The block offset into an arena, which was decided upon by range
checking the External LBA
Postmap ABA : The block number in the "Data Blocks" area obtained after
indirection from the map
nfree : The number of free blocks that are maintained at any given time.
This is the number of concurrent writes that can happen to the
arena.
For example, after adding a BTT, we surface a disk of 1024G. We get a read for
the external LBA at 768G. This falls into the second arena, and of the 512G
worth of blocks that this arena contributes, this block is at 256G. Thus, the
premap ABA is 256G. We now refer to the map, and find out the mapping for block
'X' (256G) points to block 'Y', say '64'. Thus the postmap ABA is 64.
b. The BTT Flog
---------------
The BTT provides sector atomicity by making every write an "allocating write",
i.e. Every write goes to a "free" block. A running list of free blocks is
maintained in the form of the BTT flog. 'Flog' is a combination of the words
"free list" and "log". The flog contains 'nfree' entries, and an entry contains:
lba : The premap ABA that is being written to
old_map : The old postmap ABA - after 'this' write completes, this will be a
free block.
new_map : The new postmap ABA. The map will up updated to reflect this
lba->postmap_aba mapping, but we log it here in case we have to
recover.
seq : Sequence number to mark which of the 2 sections of this flog entry is
valid/newest. It cycles between 01->10->11->01 (binary) under normal
operation, with 00 indicating an uninitialized state.
lba' : alternate lba entry
old_map': alternate old postmap entry
new_map': alternate new postmap entry
seq' : alternate sequence number.
Each of the above fields is 32-bit, making one entry 32 bytes. Entries are also
padded to 64 bytes to avoid cache line sharing or aliasing. Flog updates are
done such that for any entry being written, it:
a. overwrites the 'old' section in the entry based on sequence numbers
b. writes the 'new' section such that the sequence number is written last.
c. The concept of lanes
-----------------------
While 'nfree' describes the number of concurrent IOs an arena can process
concurrently, 'nlanes' is the number of IOs the BTT device as a whole can
process.
nlanes = min(nfree, num_cpus)
A lane number is obtained at the start of any IO, and is used for indexing into
all the on-disk and in-memory data structures for the duration of the IO. If
there are more CPUs than the max number of available lanes, than lanes are
protected by spinlocks.
d. In-memory data structure: Read Tracking Table (RTT)
------------------------------------------------------
Consider a case where we have two threads, one doing reads and the other,
writes. We can hit a condition where the writer thread grabs a free block to do
a new IO, but the (slow) reader thread is still reading from it. In other words,
the reader consulted a map entry, and started reading the corresponding block. A
writer started writing to the same external LBA, and finished the write updating
the map for that external LBA to point to its new postmap ABA. At this point the
internal, postmap block that the reader is (still) reading has been inserted
into the list of free blocks. If another write comes in for the same LBA, it can
grab this free block, and start writing to it, causing the reader to read
incorrect data. To prevent this, we introduce the RTT.
The RTT is a simple, per arena table with 'nfree' entries. Every reader inserts
into rtt[lane_number], the postmap ABA it is reading, and clears it after the
read is complete. Every writer thread, after grabbing a free block, checks the
RTT for its presence. If the postmap free block is in the RTT, it waits till the
reader clears the RTT entry, and only then starts writing to it.
e. In-memory data structure: map locks
--------------------------------------
Consider a case where two writer threads are writing to the same LBA. There can
be a race in the following sequence of steps:
free[lane] = map[premap_aba]
map[premap_aba] = postmap_aba
Both threads can update their respective free[lane] with the same old, freed
postmap_aba. This has made the layout inconsistent by losing a free entry, and
at the same time, duplicating another free entry for two lanes.
To solve this, we could have a single map lock (per arena) that has to be taken
before performing the above sequence, but we feel that could be too contentious.
Instead we use an array of (nfree) map_locks that is indexed by
(premap_aba modulo nfree).
f. Reconstruction from the Flog
-------------------------------
On startup, we analyze the BTT flog to create our list of free blocks. We walk
through all the entries, and for each lane, of the set of two possible
'sections', we always look at the most recent one only (based on the sequence
number). The reconstruction rules/steps are simple:
- Read map[log_entry.lba].
- If log_entry.new matches the map entry, then log_entry.old is free.
- If log_entry.new does not match the map entry, then log_entry.new is free.
(This case can only be caused by power-fails/unsafe shutdowns)
g. Summarizing - Read and Write flows
-------------------------------------
Read:
1. Convert external LBA to arena number + pre-map ABA
2. Get a lane (and take lane_lock)
3. Read map to get the entry for this pre-map ABA
4. Enter post-map ABA into RTT[lane]
5. If TRIM flag set in map, return zeroes, and end IO (go to step 8)
6. If ERROR flag set in map, end IO with EIO (go to step 8)
7. Read data from this block
8. Remove post-map ABA entry from RTT[lane]
9. Release lane (and lane_lock)
Write:
1. Convert external LBA to Arena number + pre-map ABA
2. Get a lane (and take lane_lock)
3. Use lane to index into in-memory free list and obtain a new block, next flog
index, next sequence number
4. Scan the RTT to check if free block is present, and spin/wait if it is.
5. Write data to this free block
6. Read map to get the existing post-map ABA entry for this pre-map ABA
7. Write flog entry: [premap_aba / old postmap_aba / new postmap_aba / seq_num]
8. Write new post-map ABA into map.
9. Write old post-map entry into the free list
10. Calculate next sequence number and write into the free list entry
11. Release lane (and lane_lock)
4. Error Handling
=================
An arena would be in an error state if any of the metadata is corrupted
irrecoverably, either due to a bug or a media error. The following conditions
indicate an error:
- Info block checksum does not match (and recovering from the copy also fails)
- All internal available blocks are not uniquely and entirely addressed by the
sum of mapped blocks and free blocks (from the BTT flog).
- Rebuilding free list from the flog reveals missing/duplicate/impossible
entries
- A map entry is out of bounds
If any of these error conditions are encountered, the arena is put into a read
only state using a flag in the info block.
5. In-kernel usage
==================
Any block driver that supports byte granularity IO to the storage may register
with the BTT. It will have to provide the rw_bytes interface in its
block_device_operations struct:
int (*rw_bytes)(struct gendisk *, void *, size_t, off_t, int rw);
It may register with the BTT after it adds its own gendisk, using btt_init:
struct btt *btt_init(struct gendisk *disk, unsigned long long rawsize,
u32 lbasize, u8 uuid[], int maxlane);
note that maxlane is the maximum amount of concurrency the driver wishes to
allow the BTT to use.
The BTT 'disk' appears as a stacked block device that grabs the underlying block
device in the O_EXCL mode.
When the driver wishes to remove the backing disk, it should similarly call
btt_fini using the same struct btt* handle that was provided to it by btt_init.
void btt_fini(struct btt *btt);
此差异已折叠。
......@@ -6102,6 +6102,39 @@ M: Sasha Levin <sasha.levin@oracle.com>
S: Maintained
F: tools/lib/lockdep/
LIBNVDIMM: NON-VOLATILE MEMORY DEVICE SUBSYSTEM
M: Dan Williams <dan.j.williams@intel.com>
L: linux-nvdimm@lists.01.org
Q: https://patchwork.kernel.org/project/linux-nvdimm/list/
S: Supported
F: drivers/nvdimm/*
F: include/linux/nd.h
F: include/linux/libnvdimm.h
F: include/uapi/linux/ndctl.h
LIBNVDIMM BLK: MMIO-APERTURE DRIVER
M: Ross Zwisler <ross.zwisler@linux.intel.com>
L: linux-nvdimm@lists.01.org
Q: https://patchwork.kernel.org/project/linux-nvdimm/list/
S: Supported
F: drivers/nvdimm/blk.c
F: drivers/nvdimm/region_devs.c
F: drivers/acpi/nfit*
LIBNVDIMM BTT: BLOCK TRANSLATION TABLE
M: Vishal Verma <vishal.l.verma@intel.com>
L: linux-nvdimm@lists.01.org
Q: https://patchwork.kernel.org/project/linux-nvdimm/list/
S: Supported
F: drivers/nvdimm/btt*
LIBNVDIMM PMEM: PERSISTENT MEMORY DRIVER
M: Ross Zwisler <ross.zwisler@linux.intel.com>
L: linux-nvdimm@lists.01.org
Q: https://patchwork.kernel.org/project/linux-nvdimm/list/
S: Supported
F: drivers/nvdimm/pmem.c
LINUX FOR IBM pSERIES (RS/6000)
M: Paul Mackerras <paulus@au.ibm.com>
W: http://www.ibm.com/linux/ltc/projects/ppc
......@@ -8363,12 +8396,6 @@ S: Maintained
F: Documentation/blockdev/ramdisk.txt
F: drivers/block/brd.c
PERSISTENT MEMORY DRIVER
M: Ross Zwisler <ross.zwisler@linux.intel.com>
L: linux-nvdimm@lists.01.org
S: Supported
F: drivers/block/pmem.c
RANDOM NUMBER DRIVER
M: "Theodore Ts'o" <tytso@mit.edu>
S: Maintained
......
......@@ -158,6 +158,7 @@ static __init int is_reserve_region(efi_memory_desc_t *md)
case EFI_BOOT_SERVICES_CODE:
case EFI_BOOT_SERVICES_DATA:
case EFI_CONVENTIONAL_MEMORY:
case EFI_PERSISTENT_MEMORY:
return 0;
default:
break;
......
......@@ -1222,6 +1222,10 @@ efi_initialize_iomem_resources(struct resource *code_resource,
flags |= IORESOURCE_DISABLED;
break;
case EFI_PERSISTENT_MEMORY:
name = "Persistent Memory";
break;
case EFI_RESERVED_TYPE:
case EFI_RUNTIME_SERVICES_CODE:
case EFI_RUNTIME_SERVICES_DATA:
......
......@@ -27,6 +27,7 @@ config X86
select ARCH_HAS_ELF_RANDOMIZE
select ARCH_HAS_FAST_MULTIPLIER
select ARCH_HAS_GCOV_PROFILE_ALL
select ARCH_HAS_PMEM_API
select ARCH_HAS_SG_CHAIN
select ARCH_HAVE_NMI_SAFE_CMPXCHG
select ARCH_MIGHT_HAVE_ACPI_PDC if ACPI
......@@ -1419,6 +1420,9 @@ source "mm/Kconfig"
config X86_PMEM_LEGACY
bool "Support non-standard NVDIMMs and ADR protected memory"
depends on PHYS_ADDR_T_64BIT
depends on BLK_DEV
select LIBNVDIMM
help
Treat memory marked using the non-standard e820 type of 12 as used
by the Intel Sandy Bridge-EP reference BIOS as protected memory.
......
......@@ -1224,6 +1224,10 @@ static efi_status_t setup_e820(struct boot_params *params,
e820_type = E820_NVS;
break;
case EFI_PERSISTENT_MEMORY:
e820_type = E820_PMEM;
break;
default:
continue;
}
......
......@@ -4,6 +4,7 @@
/* Caches aren't brain-dead on the intel. */
#include <asm-generic/cacheflush.h>
#include <asm/special_insns.h>
#include <asm/uaccess.h>
/*
* The set_memory_* API can be used to change various attributes of a virtual
......@@ -108,4 +109,75 @@ static inline int rodata_test(void)
}
#endif
#ifdef ARCH_HAS_NOCACHE_UACCESS
/**
* arch_memcpy_to_pmem - copy data to persistent memory
* @dst: destination buffer for the copy
* @src: source buffer for the copy
* @n: length of the copy in bytes
*
* Copy data to persistent memory media via non-temporal stores so that
* a subsequent arch_wmb_pmem() can flush cpu and memory controller
* write buffers to guarantee durability.
*/
static inline void arch_memcpy_to_pmem(void __pmem *dst, const void *src,
size_t n)
{
int unwritten;
/*
* We are copying between two kernel buffers, if
* __copy_from_user_inatomic_nocache() returns an error (page
* fault) we would have already reported a general protection fault
* before the WARN+BUG.
*/
unwritten = __copy_from_user_inatomic_nocache((void __force *) dst,
(void __user *) src, n);
if (WARN(unwritten, "%s: fault copying %p <- %p unwritten: %d\n",
__func__, dst, src, unwritten))
BUG();
}
/**
* arch_wmb_pmem - synchronize writes to persistent memory
*
* After a series of arch_memcpy_to_pmem() operations this drains data
* from cpu write buffers and any platform (memory controller) buffers
* to ensure that written data is durable on persistent memory media.
*/
static inline void arch_wmb_pmem(void)
{
/*
* wmb() to 'sfence' all previous writes such that they are
* architecturally visible to 'pcommit'. Note, that we've
* already arranged for pmem writes to avoid the cache via
* arch_memcpy_to_pmem().
*/
wmb();
pcommit_sfence();
}
static inline bool __arch_has_wmb_pmem(void)
{
#ifdef CONFIG_X86_64
/*
* We require that wmb() be an 'sfence', that is only guaranteed on
* 64-bit builds
*/
return static_cpu_has(X86_FEATURE_PCOMMIT);
#else
return false;
#endif
}
#else /* ARCH_HAS_NOCACHE_UACCESS i.e. ARCH=um */
extern void arch_memcpy_to_pmem(void __pmem *dst, const void *src, size_t n);
extern void arch_wmb_pmem(void);
static inline bool __arch_has_wmb_pmem(void)
{
return false;
}
#endif
#endif /* _ASM_X86_CACHEFLUSH_H */
......@@ -248,6 +248,12 @@ static inline void flush_write_buffers(void)
#endif
}
static inline void __pmem *arch_memremap_pmem(resource_size_t offset,
unsigned long size)
{
return (void __force __pmem *) ioremap_cache(offset, size);
}
#endif /* __KERNEL__ */
extern void native_io_delay(void);
......
......@@ -32,6 +32,7 @@
#define E820_ACPI 3
#define E820_NVS 4
#define E820_UNUSABLE 5
#define E820_PMEM 7
/*
* This is a non-standardized way to represent ADR or NVDIMM regions that
......
......@@ -149,6 +149,7 @@ static void __init e820_print_type(u32 type)
case E820_UNUSABLE:
printk(KERN_CONT "unusable");
break;
case E820_PMEM:
case E820_PRAM:
printk(KERN_CONT "persistent (type %u)", type);
break;
......@@ -918,11 +919,32 @@ static inline const char *e820_type_to_string(int e820_type)
case E820_ACPI: return "ACPI Tables";
case E820_NVS: return "ACPI Non-volatile Storage";
case E820_UNUSABLE: return "Unusable memory";
case E820_PRAM: return "Persistent RAM";
case E820_PRAM: return "Persistent Memory (legacy)";
case E820_PMEM: return "Persistent Memory";
default: return "reserved";
}
}
static bool do_mark_busy(u32 type, struct resource *res)
{
/* this is the legacy bios/dos rom-shadow + mmio region */
if (res->start < (1ULL<<20))
return true;
/*
* Treat persistent memory like device memory, i.e. reserve it
* for exclusive use of a driver
*/
switch (type) {
case E820_RESERVED:
case E820_PRAM:
case E820_PMEM:
return false;
default:
return true;
}
}
/*
* Mark e820 reserved areas as busy for the resource manager.
*/
......@@ -952,9 +974,7 @@ void __init e820_reserve_resources(void)
* pci device BAR resource and insert them later in
* pcibios_resource_survey()
*/
if (((e820.map[i].type != E820_RESERVED) &&
(e820.map[i].type != E820_PRAM)) ||
res->start < (1ULL<<20)) {
if (do_mark_busy(e820.map[i].type, res)) {
res->flags |= IORESOURCE_BUSY;
insert_resource(&iomem_resource, res);
}
......
/*
* Copyright (c) 2015, Christoph Hellwig.
* Copyright (c) 2015, Intel Corporation.
*/
#include <linux/memblock.h>
#include <linux/platform_device.h>
#include <linux/slab.h>
#include <linux/libnvdimm.h>
#include <linux/module.h>
#include <asm/e820.h>
#include <asm/page_types.h>
#include <asm/setup.h>
static __init void register_pmem_device(struct resource *res)
static void e820_pmem_release(struct device *dev)
{
struct platform_device *pdev;
int error;
struct nvdimm_bus *nvdimm_bus = dev->platform_data;
pdev = platform_device_alloc("pmem", PLATFORM_DEVID_AUTO);
if (!pdev)
return;
if (nvdimm_bus)
nvdimm_bus_unregister(nvdimm_bus);
}
error = platform_device_add_resources(pdev, res, 1);
if (error)
goto out_put_pdev;
static struct platform_device e820_pmem = {
.name = "e820_pmem",
.id = -1,
.dev = {
.release = e820_pmem_release,
},
};
error = platform_device_add(pdev);
if (error)
goto out_put_pdev;
return;
static const struct attribute_group *e820_pmem_attribute_groups[] = {
&nvdimm_bus_attribute_group,
NULL,
};
out_put_pdev:
dev_warn(&pdev->dev, "failed to add 'pmem' (persistent memory) device!\n");
platform_device_put(pdev);
}
static const struct attribute_group *e820_pmem_region_attribute_groups[] = {
&nd_region_attribute_group,
&nd_device_attribute_group,
NULL,
};
static __init int register_pmem_devices(void)
static __init int register_e820_pmem(void)
{
int i;
static struct nvdimm_bus_descriptor nd_desc;
struct device *dev = &e820_pmem.dev;
struct nvdimm_bus *nvdimm_bus;
int rc, i;
rc = platform_device_register(&e820_pmem);
if (rc)
return rc;
nd_desc.attr_groups = e820_pmem_attribute_groups;
nd_desc.provider_name = "e820";
nvdimm_bus = nvdimm_bus_register(dev, &nd_desc);
if (!nvdimm_bus)
goto err;
dev->platform_data = nvdimm_bus;
for (i = 0; i < e820.nr_map; i++) {
struct e820entry *ei = &e820.map[i];
struct resource res = {
.flags = IORESOURCE_MEM,
.start = ei->addr,
.end = ei->addr + ei->size - 1,
};
struct nd_region_desc ndr_desc;
if (ei->type != E820_PRAM)
continue;
if (ei->type == E820_PRAM) {
struct resource res = {
.flags = IORESOURCE_MEM,
.start = ei->addr,
.end = ei->addr + ei->size - 1,
};
register_pmem_device(&res);
}
memset(&ndr_desc, 0, sizeof(ndr_desc));
ndr_desc.res = &res;
ndr_desc.attr_groups = e820_pmem_region_attribute_groups;
ndr_desc.numa_node = NUMA_NO_NODE;
if (!nvdimm_pmem_region_create(nvdimm_bus, &ndr_desc))
goto err;
}
return 0;
err:
dev_err(dev, "failed to register legacy persistent memory ranges\n");
platform_device_unregister(&e820_pmem);
return -ENXIO;
}
device_initcall(register_pmem_devices);
device_initcall(register_e820_pmem);
......@@ -174,6 +174,9 @@ static void __init do_add_efi_memmap(void)
case EFI_UNUSABLE_MEMORY:
e820_type = E820_UNUSABLE;
break;
case EFI_PERSISTENT_MEMORY:
e820_type = E820_PMEM;
break;
default:
/*
* EFI_RESERVED_TYPE EFI_RUNTIME_SERVICES_CODE
......
......@@ -182,4 +182,6 @@ source "drivers/thunderbolt/Kconfig"
source "drivers/android/Kconfig"
source "drivers/nvdimm/Kconfig"
endmenu
......@@ -64,6 +64,7 @@ obj-$(CONFIG_FB_INTEL) += video/fbdev/intelfb/
obj-$(CONFIG_PARPORT) += parport/
obj-y += base/ block/ misc/ mfd/ nfc/
obj-$(CONFIG_LIBNVDIMM) += nvdimm/
obj-$(CONFIG_DMA_SHARED_BUFFER) += dma-buf/
obj-$(CONFIG_NUBUS) += nubus/
obj-y += macintosh/
......
......@@ -386,6 +386,32 @@ config ACPI_REDUCED_HARDWARE_ONLY
If you are unsure what to do, do not enable this option.
config ACPI_NFIT
tristate "ACPI NVDIMM Firmware Interface Table (NFIT)"
depends on PHYS_ADDR_T_64BIT
depends on BLK_DEV
select LIBNVDIMM
help
Infrastructure to probe ACPI 6 compliant platforms for
NVDIMMs (NFIT) and register a libnvdimm device tree. In
addition to storage devices this also enables libnvdimm to pass
ACPI._DSM messages for platform/dimm configuration.
To compile this driver as a module, choose M here:
the module will be called nfit.
config ACPI_NFIT_DEBUG
bool "NFIT DSM debug"
depends on ACPI_NFIT
depends on DYNAMIC_DEBUG
default n
help
Enabling this option causes the nfit driver to dump the
input and output buffers of _DSM operations on the ACPI0012
device and its children. This can be very verbose, so leave
it disabled unless you are debugging a hardware / firmware
issue.
source "drivers/acpi/apei/Kconfig"
config ACPI_EXTLOG
......
......@@ -68,6 +68,7 @@ obj-$(CONFIG_ACPI_PCI_SLOT) += pci_slot.o
obj-$(CONFIG_ACPI_PROCESSOR) += processor.o
obj-y += container.o
obj-$(CONFIG_ACPI_THERMAL) += thermal.o
obj-$(CONFIG_ACPI_NFIT) += nfit.o
obj-y += acpi_memhotplug.o
obj-$(CONFIG_ACPI_HOTPLUG_IOAPIC) += ioapic.o
obj-$(CONFIG_ACPI_BATTERY) += battery.o
......
此差异已折叠。
/*
* NVDIMM Firmware Interface Table - NFIT
*
* Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of version 2 of the GNU General Public License as
* published by the Free Software Foundation.
*
* This program is distributed in the hope that it will be useful, but
* WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
* General Public License for more details.
*/
#ifndef __NFIT_H__
#define __NFIT_H__
#include <linux/libnvdimm.h>
#include <linux/types.h>
#include <linux/uuid.h>
#include <linux/acpi.h>
#include <acpi/acuuid.h>
#define UUID_NFIT_BUS "2f10e7a4-9e91-11e4-89d3-123b93f75cba"
#define UUID_NFIT_DIMM "4309ac30-0d11-11e4-9191-0800200c9a66"
#define ACPI_NFIT_MEM_FAILED_MASK (ACPI_NFIT_MEM_SAVE_FAILED \
| ACPI_NFIT_MEM_RESTORE_FAILED | ACPI_NFIT_MEM_FLUSH_FAILED \
| ACPI_NFIT_MEM_ARMED)
enum nfit_uuids {
NFIT_SPA_VOLATILE,
NFIT_SPA_PM,
NFIT_SPA_DCR,
NFIT_SPA_BDW,
NFIT_SPA_VDISK,
NFIT_SPA_VCD,
NFIT_SPA_PDISK,
NFIT_SPA_PCD,
NFIT_DEV_BUS,
NFIT_DEV_DIMM,
NFIT_UUID_MAX,
};
struct nfit_spa {
struct acpi_nfit_system_address *spa;
struct list_head list;
};
struct nfit_dcr {
struct acpi_nfit_control_region *dcr;
struct list_head list;
};
struct nfit_bdw {
struct acpi_nfit_data_region *bdw;
struct list_head list;
};
struct nfit_idt {
struct acpi_nfit_interleave *idt;
struct list_head list;
};
struct nfit_memdev {
struct acpi_nfit_memory_map *memdev;
struct list_head list;
};
/* assembled tables for a given dimm/memory-device */
struct nfit_mem {
struct nvdimm *nvdimm;
struct acpi_nfit_memory_map *memdev_dcr;
struct acpi_nfit_memory_map *memdev_pmem;
struct acpi_nfit_memory_map *memdev_bdw;
struct acpi_nfit_control_region *dcr;
struct acpi_nfit_data_region *bdw;
struct acpi_nfit_system_address *spa_dcr;
struct acpi_nfit_system_address *spa_bdw;
struct acpi_nfit_interleave *idt_dcr;
struct acpi_nfit_interleave *idt_bdw;
struct list_head list;
struct acpi_device *adev;
unsigned long dsm_mask;
};
struct acpi_nfit_desc {
struct nvdimm_bus_descriptor nd_desc;
struct acpi_table_nfit *nfit;
struct mutex spa_map_mutex;
struct list_head spa_maps;
struct list_head memdevs;
struct list_head dimms;
struct list_head spas;
struct list_head dcrs;
struct list_head bdws;
struct list_head idts;
struct nvdimm_bus *nvdimm_bus;
struct device *dev;
unsigned long dimm_dsm_force_en;
int (*blk_do_io)(struct nd_blk_region *ndbr, resource_size_t dpa,
void *iobuf, u64 len, int rw);
};
enum nd_blk_mmio_selector {
BDW,
DCR,
};
struct nfit_blk {
struct nfit_blk_mmio {
union {
void __iomem *base;
void *aperture;
};
u64 size;
u64 base_offset;
u32 line_size;
u32 num_lines;
u32 table_size;
struct acpi_nfit_interleave *idt;
struct acpi_nfit_system_address *spa;
} mmio[2];
struct nd_region *nd_region;
u64 bdw_offset; /* post interleave offset */
u64 stat_offset;
u64 cmd_offset;
};
struct nfit_spa_mapping {
struct acpi_nfit_desc *acpi_desc;
struct acpi_nfit_system_address *spa;
struct list_head list;
struct kref kref;
void __iomem *iomem;
};
static inline struct nfit_spa_mapping *to_spa_map(struct kref *kref)
{
return container_of(kref, struct nfit_spa_mapping, kref);
}
static inline struct acpi_nfit_memory_map *__to_nfit_memdev(
struct nfit_mem *nfit_mem)
{
if (nfit_mem->memdev_dcr)
return nfit_mem->memdev_dcr;
return nfit_mem->memdev_pmem;
}
static inline struct acpi_nfit_desc *to_acpi_desc(
struct nvdimm_bus_descriptor *nd_desc)
{
return container_of(nd_desc, struct acpi_nfit_desc, nd_desc);
}
const u8 *to_nfit_uuid(enum nfit_uuids id);
int acpi_nfit_init(struct acpi_nfit_desc *nfit, acpi_size sz);
extern const struct attribute_group *acpi_nfit_attribute_groups[];
#endif /* __NFIT_H__ */
......@@ -29,6 +29,8 @@
#include <linux/errno.h>
#include <linux/acpi.h>
#include <linux/numa.h>
#include <linux/nodemask.h>
#include <linux/topology.h>
#define PREFIX "ACPI: "
......@@ -70,7 +72,12 @@ static void __acpi_map_pxm_to_node(int pxm, int node)
int acpi_map_pxm_to_node(int pxm)
{
int node = pxm_to_node_map[pxm];
int node;
if (pxm < 0 || pxm >= MAX_PXM_DOMAINS)
return NUMA_NO_NODE;
node = pxm_to_node_map[pxm];
if (node == NUMA_NO_NODE) {
if (nodes_weight(nodes_found_map) >= MAX_NUMNODES)
......@@ -83,6 +90,45 @@ int acpi_map_pxm_to_node(int pxm)
return node;
}
/**
* acpi_map_pxm_to_online_node - Map proximity ID to online node
* @pxm: ACPI proximity ID
*
* This is similar to acpi_map_pxm_to_node(), but always returns an online
* node. When the mapped node from a given proximity ID is offline, it
* looks up the node distance table and returns the nearest online node.
*
* ACPI device drivers, which are called after the NUMA initialization has
* completed in the kernel, can call this interface to obtain their device
* NUMA topology from ACPI tables. Such drivers do not have to deal with
* offline nodes. A node may be offline when a device proximity ID is
* unique, SRAT memory entry does not exist, or NUMA is disabled, ex.
* "numa=off" on x86.
*/
int acpi_map_pxm_to_online_node(int pxm)
{
int node, n, dist, min_dist;
node = acpi_map_pxm_to_node(pxm);
if (node == NUMA_NO_NODE)
node = 0;
if (!node_online(node)) {
min_dist = INT_MAX;
for_each_online_node(n) {
dist = node_distance(node, n);
if (dist < min_dist) {
min_dist = dist;
node = n;
}
}
}
return node;
}
EXPORT_SYMBOL(acpi_map_pxm_to_online_node);
static void __init
acpi_table_print_srat_entry(struct acpi_subtable_header *header)
{
......@@ -328,8 +374,6 @@ int acpi_get_node(acpi_handle handle)
int pxm;
pxm = acpi_get_pxm(handle);
if (pxm < 0 || pxm >= MAX_PXM_DOMAINS)
return NUMA_NO_NODE;
return acpi_map_pxm_to_node(pxm);
}
......
......@@ -404,18 +404,6 @@ config BLK_DEV_RAM_DAX
and will prevent RAM block device backing store memory from being
allocated from highmem (only a problem for highmem systems).
config BLK_DEV_PMEM
tristate "Persistent memory block device support"
depends on HAS_IOMEM
help
Saying Y here will allow you to use a contiguous range of reserved
memory as one or more persistent block devices.
To compile this driver as a module, choose M here: the module will be
called 'pmem'.
If unsure, say N.
config CDROM_PKTCDVD
tristate "Packet writing on CD/DVD media"
depends on !UML
......
......@@ -14,7 +14,6 @@ obj-$(CONFIG_PS3_VRAM) += ps3vram.o
obj-$(CONFIG_ATARI_FLOPPY) += ataflop.o
obj-$(CONFIG_AMIGA_Z2RAM) += z2ram.o
obj-$(CONFIG_BLK_DEV_RAM) += brd.o
obj-$(CONFIG_BLK_DEV_PMEM) += pmem.o
obj-$(CONFIG_BLK_DEV_LOOP) += loop.o
obj-$(CONFIG_BLK_CPQ_DA) += cpqarray.o
obj-$(CONFIG_BLK_CPQ_CISS_DA) += cciss.o
......
menuconfig LIBNVDIMM
tristate "NVDIMM (Non-Volatile Memory Device) Support"
depends on PHYS_ADDR_T_64BIT
depends on BLK_DEV
help
Generic support for non-volatile memory devices including
ACPI-6-NFIT defined resources. On platforms that define an
NFIT, or otherwise can discover NVDIMM resources, a libnvdimm
bus is registered to advertise PMEM (persistent memory)
namespaces (/dev/pmemX) and BLK (sliding mmio window(s))
namespaces (/dev/ndblkX.Y). A PMEM namespace refers to a
memory resource that may span multiple DIMMs and support DAX
(see CONFIG_DAX). A BLK namespace refers to an NVDIMM control
region which exposes an mmio register set for windowed access
mode to non-volatile memory.
if LIBNVDIMM
config BLK_DEV_PMEM
tristate "PMEM: Persistent memory block device support"
default LIBNVDIMM
depends on HAS_IOMEM
select ND_BTT if BTT
help
Memory ranges for PMEM are described by either an NFIT
(NVDIMM Firmware Interface Table, see CONFIG_NFIT_ACPI), a
non-standard OEM-specific E820 memory type (type-12, see
CONFIG_X86_PMEM_LEGACY), or it is manually specified by the
'memmap=nn[KMG]!ss[KMG]' kernel command line (see
Documentation/kernel-parameters.txt). This driver converts
these persistent memory ranges into block devices that are
capable of DAX (direct-access) file system mappings. See
Documentation/nvdimm/nvdimm.txt for more details.
Say Y if you want to use an NVDIMM
config ND_BLK
tristate "BLK: Block data window (aperture) device support"
default LIBNVDIMM
select ND_BTT if BTT
help
Support NVDIMMs, or other devices, that implement a BLK-mode
access capability. BLK-mode access uses memory-mapped-i/o
apertures to access persistent media.
Say Y if your platform firmware emits an ACPI.NFIT table
(CONFIG_ACPI_NFIT), or otherwise exposes BLK-mode
capabilities.
config ND_BTT
tristate
config BTT
bool "BTT: Block Translation Table (atomic sector updates)"
default y if LIBNVDIMM
help
The Block Translation Table (BTT) provides atomic sector
update semantics for persistent memory devices, so that
applications that rely on sector writes not being torn (a
guarantee that typical disks provide) can continue to do so.
The BTT manifests itself as an alternate personality for an
NVDIMM namespace, i.e. a namespace can be in raw mode (pmemX,
ndblkX.Y, etc...), or 'sectored' mode, (pmemXs, ndblkX.Ys,
etc...).
Select Y if unsure
endif
obj-$(CONFIG_LIBNVDIMM) += libnvdimm.o
obj-$(CONFIG_BLK_DEV_PMEM) += nd_pmem.o
obj-$(CONFIG_ND_BTT) += nd_btt.o
obj-$(CONFIG_ND_BLK) += nd_blk.o
nd_pmem-y := pmem.o
nd_btt-y := btt.o
nd_blk-y := blk.o
libnvdimm-y := core.o
libnvdimm-y += bus.o
libnvdimm-y += dimm_devs.o
libnvdimm-y += dimm.o
libnvdimm-y += region_devs.o
libnvdimm-y += region.o
libnvdimm-y += namespace_devs.o
libnvdimm-y += label.o
libnvdimm-$(CONFIG_BTT) += btt_devs.o
/*
* NVDIMM Block Window Driver
* Copyright (c) 2014, Intel Corporation.
*
* This program is free software; you can redistribute it and/or modify it
* under the terms and conditions of the GNU General Public License,
* version 2, as published by the Free Software Foundation.
*
* This program is distributed in the hope it will be useful, but WITHOUT
* ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
* FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
* more details.
*/
#include <linux/blkdev.h>
#include <linux/fs.h>
#include <linux/genhd.h>
#include <linux/module.h>
#include <linux/moduleparam.h>
#include <linux/nd.h>
#include <linux/sizes.h>
#include "nd.h"
struct nd_blk_device {
struct request_queue *queue;
struct gendisk *disk;
struct nd_namespace_blk *nsblk;
struct nd_blk_region *ndbr;
size_t disk_size;
u32 sector_size;
u32 internal_lbasize;
};
static int nd_blk_major;
static u32 nd_blk_meta_size(struct nd_blk_device *blk_dev)
{
return blk_dev->nsblk->lbasize - blk_dev->sector_size;
}
static resource_size_t to_dev_offset(struct nd_namespace_blk *nsblk,
resource_size_t ns_offset, unsigned int len)
{
int i;
for (i = 0; i < nsblk->num_resources; i++) {
if (ns_offset < resource_size(nsblk->res[i])) {
if (ns_offset + len > resource_size(nsblk->res[i])) {
dev_WARN_ONCE(&nsblk->common.dev, 1,
"illegal request\n");
return SIZE_MAX;
}
return nsblk->res[i]->start + ns_offset;
}
ns_offset -= resource_size(nsblk->res[i]);
}
dev_WARN_ONCE(&nsblk->common.dev, 1, "request out of range\n");
return SIZE_MAX;
}
#ifdef CONFIG_BLK_DEV_INTEGRITY
static int nd_blk_rw_integrity(struct nd_blk_device *blk_dev,
struct bio_integrity_payload *bip, u64 lba,
int rw)
{
unsigned int len = nd_blk_meta_size(blk_dev);
resource_size_t dev_offset, ns_offset;
struct nd_namespace_blk *nsblk;
struct nd_blk_region *ndbr;
int err = 0;
nsblk = blk_dev->nsblk;
ndbr = blk_dev->ndbr;
ns_offset = lba * blk_dev->internal_lbasize + blk_dev->sector_size;
dev_offset = to_dev_offset(nsblk, ns_offset, len);
if (dev_offset == SIZE_MAX)
return -EIO;
while (len) {
unsigned int cur_len;
struct bio_vec bv;
void *iobuf;
bv = bvec_iter_bvec(bip->bip_vec, bip->bip_iter);
/*
* The 'bv' obtained from bvec_iter_bvec has its .bv_len and
* .bv_offset already adjusted for iter->bi_bvec_done, and we
* can use those directly
*/
cur_len = min(len, bv.bv_len);
iobuf = kmap_atomic(bv.bv_page);
err = ndbr->do_io(ndbr, dev_offset, iobuf + bv.bv_offset,
cur_len, rw);
kunmap_atomic(iobuf);
if (err)
return err;
len -= cur_len;
dev_offset += cur_len;
bvec_iter_advance(bip->bip_vec, &bip->bip_iter, cur_len);
}
return err;
}
#else /* CONFIG_BLK_DEV_INTEGRITY */
static int nd_blk_rw_integrity(struct nd_blk_device *blk_dev,
struct bio_integrity_payload *bip, u64 lba,
int rw)
{
return 0;
}
#endif
static int nd_blk_do_bvec(struct nd_blk_device *blk_dev,
struct bio_integrity_payload *bip, struct page *page,
unsigned int len, unsigned int off, int rw,
sector_t sector)
{
struct nd_blk_region *ndbr = blk_dev->ndbr;
resource_size_t dev_offset, ns_offset;
int err = 0;
void *iobuf;
u64 lba;
while (len) {
unsigned int cur_len;
/*
* If we don't have an integrity payload, we don't have to
* split the bvec into sectors, as this would cause unnecessary
* Block Window setup/move steps. the do_io routine is capable
* of handling len <= PAGE_SIZE.
*/
cur_len = bip ? min(len, blk_dev->sector_size) : len;
lba = div_u64(sector << SECTOR_SHIFT, blk_dev->sector_size);
ns_offset = lba * blk_dev->internal_lbasize;
dev_offset = to_dev_offset(blk_dev->nsblk, ns_offset, cur_len);
if (dev_offset == SIZE_MAX)
return -EIO;
iobuf = kmap_atomic(page);
err = ndbr->do_io(ndbr, dev_offset, iobuf + off, cur_len, rw);
kunmap_atomic(iobuf);
if (err)
return err;
if (bip) {
err = nd_blk_rw_integrity(blk_dev, bip, lba, rw);
if (err)
return err;
}
len -= cur_len;
off += cur_len;
sector += blk_dev->sector_size >> SECTOR_SHIFT;
}
return err;
}
static void nd_blk_make_request(struct request_queue *q, struct bio *bio)
{
struct block_device *bdev = bio->bi_bdev;
struct gendisk *disk = bdev->bd_disk;
struct bio_integrity_payload *bip;
struct nd_blk_device *blk_dev;
struct bvec_iter iter;
unsigned long start;
struct bio_vec bvec;
int err = 0, rw;
bool do_acct;
/*
* bio_integrity_enabled also checks if the bio already has an
* integrity payload attached. If it does, we *don't* do a
* bio_integrity_prep here - the payload has been generated by
* another kernel subsystem, and we just pass it through.
*/
if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
err = -EIO;
goto out;
}
bip = bio_integrity(bio);
blk_dev = disk->private_data;
rw = bio_data_dir(bio);
do_acct = nd_iostat_start(bio, &start);
bio_for_each_segment(bvec, bio, iter) {
unsigned int len = bvec.bv_len;
BUG_ON(len > PAGE_SIZE);
err = nd_blk_do_bvec(blk_dev, bip, bvec.bv_page, len,
bvec.bv_offset, rw, iter.bi_sector);
if (err) {
dev_info(&blk_dev->nsblk->common.dev,
"io error in %s sector %lld, len %d,\n",
(rw == READ) ? "READ" : "WRITE",
(unsigned long long) iter.bi_sector, len);
break;
}
}
if (do_acct)
nd_iostat_end(bio, start);
out:
bio_endio(bio, err);
}
static int nd_blk_rw_bytes(struct nd_namespace_common *ndns,
resource_size_t offset, void *iobuf, size_t n, int rw)
{
struct nd_blk_device *blk_dev = dev_get_drvdata(ndns->claim);
struct nd_namespace_blk *nsblk = blk_dev->nsblk;
struct nd_blk_region *ndbr = blk_dev->ndbr;
resource_size_t dev_offset;
dev_offset = to_dev_offset(nsblk, offset, n);
if (unlikely(offset + n > blk_dev->disk_size)) {
dev_WARN_ONCE(&ndns->dev, 1, "request out of range\n");
return -EFAULT;
}
if (dev_offset == SIZE_MAX)
return -EIO;
return ndbr->do_io(ndbr, dev_offset, iobuf, n, rw);
}
static const struct block_device_operations nd_blk_fops = {
.owner = THIS_MODULE,
.revalidate_disk = nvdimm_revalidate_disk,
};
static int nd_blk_attach_disk(struct nd_namespace_common *ndns,
struct nd_blk_device *blk_dev)
{
resource_size_t available_disk_size;
struct gendisk *disk;
u64 internal_nlba;
internal_nlba = div_u64(blk_dev->disk_size, blk_dev->internal_lbasize);
available_disk_size = internal_nlba * blk_dev->sector_size;
blk_dev->queue = blk_alloc_queue(GFP_KERNEL);
if (!blk_dev->queue)
return -ENOMEM;
blk_queue_make_request(blk_dev->queue, nd_blk_make_request);
blk_queue_max_hw_sectors(blk_dev->queue, UINT_MAX);
blk_queue_bounce_limit(blk_dev->queue, BLK_BOUNCE_ANY);
blk_queue_logical_block_size(blk_dev->queue, blk_dev->sector_size);
queue_flag_set_unlocked(QUEUE_FLAG_NONROT, blk_dev->queue);
disk = blk_dev->disk = alloc_disk(0);
if (!disk) {
blk_cleanup_queue(blk_dev->queue);
return -ENOMEM;
}
disk->driverfs_dev = &ndns->dev;
disk->major = nd_blk_major;
disk->first_minor = 0;
disk->fops = &nd_blk_fops;
disk->private_data = blk_dev;
disk->queue = blk_dev->queue;
disk->flags = GENHD_FL_EXT_DEVT;
nvdimm_namespace_disk_name(ndns, disk->disk_name);
set_capacity(disk, 0);
add_disk(disk);
if (nd_blk_meta_size(blk_dev)) {
int rc = nd_integrity_init(disk, nd_blk_meta_size(blk_dev));
if (rc) {
del_gendisk(disk);
put_disk(disk);
blk_cleanup_queue(blk_dev->queue);
return rc;
}
}
set_capacity(disk, available_disk_size >> SECTOR_SHIFT);
revalidate_disk(disk);
return 0;
}
static int nd_blk_probe(struct device *dev)
{
struct nd_namespace_common *ndns;
struct nd_namespace_blk *nsblk;
struct nd_blk_device *blk_dev;
int rc;
ndns = nvdimm_namespace_common_probe(dev);
if (IS_ERR(ndns))
return PTR_ERR(ndns);
blk_dev = kzalloc(sizeof(*blk_dev), GFP_KERNEL);
if (!blk_dev)
return -ENOMEM;
nsblk = to_nd_namespace_blk(&ndns->dev);
blk_dev->disk_size = nvdimm_namespace_capacity(ndns);
blk_dev->ndbr = to_nd_blk_region(dev->parent);
blk_dev->nsblk = to_nd_namespace_blk(&ndns->dev);
blk_dev->internal_lbasize = roundup(nsblk->lbasize,
INT_LBASIZE_ALIGNMENT);
blk_dev->sector_size = ((nsblk->lbasize >= 4096) ? 4096 : 512);
dev_set_drvdata(dev, blk_dev);
ndns->rw_bytes = nd_blk_rw_bytes;
if (is_nd_btt(dev))
rc = nvdimm_namespace_attach_btt(ndns);
else if (nd_btt_probe(ndns, blk_dev) == 0) {
/* we'll come back as btt-blk */
rc = -ENXIO;
} else
rc = nd_blk_attach_disk(ndns, blk_dev);
if (rc)
kfree(blk_dev);
return rc;
}
static void nd_blk_detach_disk(struct nd_blk_device *blk_dev)
{
del_gendisk(blk_dev->disk);
put_disk(blk_dev->disk);
blk_cleanup_queue(blk_dev->queue);
}
static int nd_blk_remove(struct device *dev)
{
struct nd_blk_device *blk_dev = dev_get_drvdata(dev);
if (is_nd_btt(dev))
nvdimm_namespace_detach_btt(to_nd_btt(dev)->ndns);
else
nd_blk_detach_disk(blk_dev);
kfree(blk_dev);
return 0;
}
static struct nd_device_driver nd_blk_driver = {
.probe = nd_blk_probe,
.remove = nd_blk_remove,
.drv = {
.name = "nd_blk",
},
.type = ND_DRIVER_NAMESPACE_BLK,
};
static int __init nd_blk_init(void)
{
int rc;
rc = register_blkdev(0, "nd_blk");
if (rc < 0)
return rc;
nd_blk_major = rc;
rc = nd_driver_register(&nd_blk_driver);
if (rc < 0)
unregister_blkdev(nd_blk_major, "nd_blk");
return rc;
}
static void __exit nd_blk_exit(void)
{
driver_unregister(&nd_blk_driver.drv);
unregister_blkdev(nd_blk_major, "nd_blk");
}
MODULE_AUTHOR("Ross Zwisler <ross.zwisler@linux.intel.com>");
MODULE_LICENSE("GPL v2");
MODULE_ALIAS_ND_DEVICE(ND_DEVICE_NAMESPACE_BLK);
module_init(nd_blk_init);
module_exit(nd_blk_exit);
此差异已折叠。
/*
* Block Translation Table library
* Copyright (c) 2014-2015, Intel Corporation.
*
* This program is free software; you can redistribute it and/or modify it
* under the terms and conditions of the GNU General Public License,
* version 2, as published by the Free Software Foundation.
*
* This program is distributed in the hope it will be useful, but WITHOUT
* ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
* FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
* more details.
*/
#ifndef _LINUX_BTT_H
#define _LINUX_BTT_H
#include <linux/types.h>
#define BTT_SIG_LEN 16
#define BTT_SIG "BTT_ARENA_INFO\0"
#define MAP_ENT_SIZE 4
#define MAP_TRIM_SHIFT 31
#define MAP_TRIM_MASK (1 << MAP_TRIM_SHIFT)
#define MAP_ERR_SHIFT 30
#define MAP_ERR_MASK (1 << MAP_ERR_SHIFT)
#define MAP_LBA_MASK (~((1 << MAP_TRIM_SHIFT) | (1 << MAP_ERR_SHIFT)))
#define MAP_ENT_NORMAL 0xC0000000
#define LOG_ENT_SIZE sizeof(struct log_entry)
#define ARENA_MIN_SIZE (1UL << 24) /* 16 MB */
#define ARENA_MAX_SIZE (1ULL << 39) /* 512 GB */
#define RTT_VALID (1UL << 31)
#define RTT_INVALID 0
#define BTT_PG_SIZE 4096
#define BTT_DEFAULT_NFREE ND_MAX_LANES
#define LOG_SEQ_INIT 1
#define IB_FLAG_ERROR 0x00000001
#define IB_FLAG_ERROR_MASK 0x00000001
enum btt_init_state {
INIT_UNCHECKED = 0,
INIT_NOTFOUND,
INIT_READY
};
struct log_entry {
__le32 lba;
__le32 old_map;
__le32 new_map;
__le32 seq;
__le64 padding[2];
};
struct btt_sb {
u8 signature[BTT_SIG_LEN];
u8 uuid[16];
u8 parent_uuid[16];
__le32 flags;
__le16 version_major;
__le16 version_minor;
__le32 external_lbasize;
__le32 external_nlba;
__le32 internal_lbasize;
__le32 internal_nlba;
__le32 nfree;
__le32 infosize;
__le64 nextoff;
__le64 dataoff;
__le64 mapoff;
__le64 logoff;
__le64 info2off;
u8 padding[3968];
__le64 checksum;
};
struct free_entry {
u32 block;
u8 sub;
u8 seq;
};
struct aligned_lock {
union {
spinlock_t lock;
u8 cacheline_padding[L1_CACHE_BYTES];
};
};
/**
* struct arena_info - handle for an arena
* @size: Size in bytes this arena occupies on the raw device.
* This includes arena metadata.
* @external_lba_start: The first external LBA in this arena.
* @internal_nlba: Number of internal blocks available in the arena
* including nfree reserved blocks
* @internal_lbasize: Internal and external lba sizes may be different as
* we can round up 'odd' external lbasizes such as 520B
* to be aligned.
* @external_nlba: Number of blocks contributed by the arena to the number
* reported to upper layers. (internal_nlba - nfree)
* @external_lbasize: LBA size as exposed to upper layers.
* @nfree: A reserve number of 'free' blocks that is used to
* handle incoming writes.
* @version_major: Metadata layout version major.
* @version_minor: Metadata layout version minor.
* @nextoff: Offset in bytes to the start of the next arena.
* @infooff: Offset in bytes to the info block of this arena.
* @dataoff: Offset in bytes to the data area of this arena.
* @mapoff: Offset in bytes to the map area of this arena.
* @logoff: Offset in bytes to the log area of this arena.
* @info2off: Offset in bytes to the backup info block of this arena.
* @freelist: Pointer to in-memory list of free blocks
* @rtt: Pointer to in-memory "Read Tracking Table"
* @map_locks: Spinlocks protecting concurrent map writes
* @nd_btt: Pointer to parent nd_btt structure.
* @list: List head for list of arenas
* @debugfs_dir: Debugfs dentry
* @flags: Arena flags - may signify error states.
*
* arena_info is a per-arena handle. Once an arena is narrowed down for an
* IO, this struct is passed around for the duration of the IO.
*/
struct arena_info {
u64 size; /* Total bytes for this arena */
u64 external_lba_start;
u32 internal_nlba;
u32 internal_lbasize;
u32 external_nlba;
u32 external_lbasize;
u32 nfree;
u16 version_major;
u16 version_minor;
/* Byte offsets to the different on-media structures */
u64 nextoff;
u64 infooff;
u64 dataoff;
u64 mapoff;
u64 logoff;
u64 info2off;
/* Pointers to other in-memory structures for this arena */
struct free_entry *freelist;
u32 *rtt;
struct aligned_lock *map_locks;
struct nd_btt *nd_btt;
struct list_head list;
struct dentry *debugfs_dir;
/* Arena flags */
u32 flags;
};
/**
* struct btt - handle for a BTT instance
* @btt_disk: Pointer to the gendisk for BTT device
* @btt_queue: Pointer to the request queue for the BTT device
* @arena_list: Head of the list of arenas
* @debugfs_dir: Debugfs dentry
* @nd_btt: Parent nd_btt struct
* @nlba: Number of logical blocks exposed to the upper layers
* after removing the amount of space needed by metadata
* @rawsize: Total size in bytes of the available backing device
* @lbasize: LBA size as requested and presented to upper layers.
* This is sector_size + size of any metadata.
* @sector_size: The Linux sector size - 512 or 4096
* @lanes: Per-lane spinlocks
* @init_lock: Mutex used for the BTT initialization
* @init_state: Flag describing the initialization state for the BTT
* @num_arenas: Number of arenas in the BTT instance
*/
struct btt {
struct gendisk *btt_disk;
struct request_queue *btt_queue;
struct list_head arena_list;
struct dentry *debugfs_dir;
struct nd_btt *nd_btt;
u64 nlba;
unsigned long long rawsize;
u32 lbasize;
u32 sector_size;
struct nd_region *nd_region;
struct mutex init_lock;
int init_state;
int num_arenas;
};
#endif
/*
* Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of version 2 of the GNU General Public License as
* published by the Free Software Foundation.
*
* This program is distributed in the hope that it will be useful, but
* WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
* General Public License for more details.
*/
#include <linux/blkdev.h>
#include <linux/device.h>
#include <linux/genhd.h>
#include <linux/sizes.h>
#include <linux/slab.h>
#include <linux/fs.h>
#include <linux/mm.h>
#include "nd-core.h"
#include "btt.h"
#include "nd.h"
static void __nd_btt_detach_ndns(struct nd_btt *nd_btt)
{
struct nd_namespace_common *ndns = nd_btt->ndns;
dev_WARN_ONCE(&nd_btt->dev, !mutex_is_locked(&ndns->dev.mutex)
|| ndns->claim != &nd_btt->dev,
"%s: invalid claim\n", __func__);
ndns->claim = NULL;
nd_btt->ndns = NULL;
put_device(&ndns->dev);
}
static void nd_btt_detach_ndns(struct nd_btt *nd_btt)
{
struct nd_namespace_common *ndns = nd_btt->ndns;
if (!ndns)
return;
get_device(&ndns->dev);
device_lock(&ndns->dev);
__nd_btt_detach_ndns(nd_btt);
device_unlock(&ndns->dev);
put_device(&ndns->dev);
}
static bool __nd_btt_attach_ndns(struct nd_btt *nd_btt,
struct nd_namespace_common *ndns)
{
if (ndns->claim)
return false;
dev_WARN_ONCE(&nd_btt->dev, !mutex_is_locked(&ndns->dev.mutex)
|| nd_btt->ndns,
"%s: invalid claim\n", __func__);
ndns->claim = &nd_btt->dev;
nd_btt->ndns = ndns;
get_device(&ndns->dev);
return true;
}
static bool nd_btt_attach_ndns(struct nd_btt *nd_btt,
struct nd_namespace_common *ndns)
{
bool claimed;
device_lock(&ndns->dev);
claimed = __nd_btt_attach_ndns(nd_btt, ndns);
device_unlock(&ndns->dev);
return claimed;
}
static void nd_btt_release(struct device *dev)
{
struct nd_region *nd_region = to_nd_region(dev->parent);
struct nd_btt *nd_btt = to_nd_btt(dev);
dev_dbg(dev, "%s\n", __func__);
nd_btt_detach_ndns(nd_btt);
ida_simple_remove(&nd_region->btt_ida, nd_btt->id);
kfree(nd_btt->uuid);
kfree(nd_btt);
}
static struct device_type nd_btt_device_type = {
.name = "nd_btt",
.release = nd_btt_release,
};
bool is_nd_btt(struct device *dev)
{
return dev->type == &nd_btt_device_type;
}
EXPORT_SYMBOL(is_nd_btt);
struct nd_btt *to_nd_btt(struct device *dev)
{
struct nd_btt *nd_btt = container_of(dev, struct nd_btt, dev);
WARN_ON(!is_nd_btt(dev));
return nd_btt;
}
EXPORT_SYMBOL(to_nd_btt);
static const unsigned long btt_lbasize_supported[] = { 512, 520, 528,
4096, 4104, 4160, 4224, 0 };
static ssize_t sector_size_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
struct nd_btt *nd_btt = to_nd_btt(dev);
return nd_sector_size_show(nd_btt->lbasize, btt_lbasize_supported, buf);
}
static ssize_t sector_size_store(struct device *dev,
struct device_attribute *attr, const char *buf, size_t len)
{
struct nd_btt *nd_btt = to_nd_btt(dev);
ssize_t rc;
device_lock(dev);
nvdimm_bus_lock(dev);
rc = nd_sector_size_store(dev, buf, &nd_btt->lbasize,
btt_lbasize_supported);
dev_dbg(dev, "%s: result: %zd wrote: %s%s", __func__,
rc, buf, buf[len - 1] == '\n' ? "" : "\n");
nvdimm_bus_unlock(dev);
device_unlock(dev);
return rc ? rc : len;
}
static DEVICE_ATTR_RW(sector_size);
static ssize_t uuid_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
struct nd_btt *nd_btt = to_nd_btt(dev);
if (nd_btt->uuid)
return sprintf(buf, "%pUb\n", nd_btt->uuid);
return sprintf(buf, "\n");
}
static ssize_t uuid_store(struct device *dev,
struct device_attribute *attr, const char *buf, size_t len)
{
struct nd_btt *nd_btt = to_nd_btt(dev);
ssize_t rc;
device_lock(dev);
rc = nd_uuid_store(dev, &nd_btt->uuid, buf, len);
dev_dbg(dev, "%s: result: %zd wrote: %s%s", __func__,
rc, buf, buf[len - 1] == '\n' ? "" : "\n");
device_unlock(dev);
return rc ? rc : len;
}
static DEVICE_ATTR_RW(uuid);
static ssize_t namespace_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
struct nd_btt *nd_btt = to_nd_btt(dev);
ssize_t rc;
nvdimm_bus_lock(dev);
rc = sprintf(buf, "%s\n", nd_btt->ndns
? dev_name(&nd_btt->ndns->dev) : "");
nvdimm_bus_unlock(dev);
return rc;
}
static int namespace_match(struct device *dev, void *data)
{
char *name = data;
return strcmp(name, dev_name(dev)) == 0;
}
static bool is_nd_btt_idle(struct device *dev)
{
struct nd_region *nd_region = to_nd_region(dev->parent);
struct nd_btt *nd_btt = to_nd_btt(dev);
if (nd_region->btt_seed == dev || nd_btt->ndns || dev->driver)
return false;
return true;
}
static ssize_t __namespace_store(struct device *dev,
struct device_attribute *attr, const char *buf, size_t len)
{
struct nd_btt *nd_btt = to_nd_btt(dev);
struct nd_namespace_common *ndns;
struct device *found;
char *name;
if (dev->driver) {
dev_dbg(dev, "%s: -EBUSY\n", __func__);
return -EBUSY;
}
name = kstrndup(buf, len, GFP_KERNEL);
if (!name)
return -ENOMEM;
strim(name);
if (strncmp(name, "namespace", 9) == 0 || strcmp(name, "") == 0)
/* pass */;
else {
len = -EINVAL;
goto out;
}
ndns = nd_btt->ndns;
if (strcmp(name, "") == 0) {
/* detach the namespace and destroy / reset the btt device */
nd_btt_detach_ndns(nd_btt);
if (is_nd_btt_idle(dev))
nd_device_unregister(dev, ND_ASYNC);
else {
nd_btt->lbasize = 0;
kfree(nd_btt->uuid);
nd_btt->uuid = NULL;
}
goto out;
} else if (ndns) {
dev_dbg(dev, "namespace already set to: %s\n",
dev_name(&ndns->dev));
len = -EBUSY;
goto out;
}
found = device_find_child(dev->parent, name, namespace_match);
if (!found) {
dev_dbg(dev, "'%s' not found under %s\n", name,
dev_name(dev->parent));
len = -ENODEV;
goto out;
}
ndns = to_ndns(found);
if (__nvdimm_namespace_capacity(ndns) < SZ_16M) {
dev_dbg(dev, "%s too small to host btt\n", name);
len = -ENXIO;
goto out_attach;
}
WARN_ON_ONCE(!is_nvdimm_bus_locked(&nd_btt->dev));
if (!nd_btt_attach_ndns(nd_btt, ndns)) {
dev_dbg(dev, "%s already claimed\n",
dev_name(&ndns->dev));
len = -EBUSY;
}
out_attach:
put_device(&ndns->dev); /* from device_find_child */
out:
kfree(name);
return len;
}
static ssize_t namespace_store(struct device *dev,
struct device_attribute *attr, const char *buf, size_t len)
{
ssize_t rc;
nvdimm_bus_lock(dev);
device_lock(dev);
rc = __namespace_store(dev, attr, buf, len);
dev_dbg(dev, "%s: result: %zd wrote: %s%s", __func__,
rc, buf, buf[len - 1] == '\n' ? "" : "\n");
device_unlock(dev);
nvdimm_bus_unlock(dev);
return rc;
}
static DEVICE_ATTR_RW(namespace);
static struct attribute *nd_btt_attributes[] = {
&dev_attr_sector_size.attr,
&dev_attr_namespace.attr,
&dev_attr_uuid.attr,
NULL,
};
static struct attribute_group nd_btt_attribute_group = {
.attrs = nd_btt_attributes,
};
static const struct attribute_group *nd_btt_attribute_groups[] = {
&nd_btt_attribute_group,
&nd_device_attribute_group,
&nd_numa_attribute_group,
NULL,
};
static struct device *__nd_btt_create(struct nd_region *nd_region,
unsigned long lbasize, u8 *uuid,
struct nd_namespace_common *ndns)
{
struct nd_btt *nd_btt;
struct device *dev;
nd_btt = kzalloc(sizeof(*nd_btt), GFP_KERNEL);
if (!nd_btt)
return NULL;
nd_btt->id = ida_simple_get(&nd_region->btt_ida, 0, 0, GFP_KERNEL);
if (nd_btt->id < 0) {
kfree(nd_btt);
return NULL;
}
nd_btt->lbasize = lbasize;
if (uuid)
uuid = kmemdup(uuid, 16, GFP_KERNEL);
nd_btt->uuid = uuid;
dev = &nd_btt->dev;
dev_set_name(dev, "btt%d.%d", nd_region->id, nd_btt->id);
dev->parent = &nd_region->dev;
dev->type = &nd_btt_device_type;
dev->groups = nd_btt_attribute_groups;
device_initialize(&nd_btt->dev);
if (ndns && !__nd_btt_attach_ndns(nd_btt, ndns)) {
dev_dbg(&ndns->dev, "%s failed, already claimed by %s\n",
__func__, dev_name(ndns->claim));
put_device(dev);
return NULL;
}
return dev;
}
struct device *nd_btt_create(struct nd_region *nd_region)
{
struct device *dev = __nd_btt_create(nd_region, 0, NULL, NULL);
if (dev)
__nd_device_register(dev);
return dev;
}
/*
* nd_btt_sb_checksum: compute checksum for btt info block
*
* Returns a fletcher64 checksum of everything in the given info block
* except the last field (since that's where the checksum lives).
*/
u64 nd_btt_sb_checksum(struct btt_sb *btt_sb)
{
u64 sum;
__le64 sum_save;
sum_save = btt_sb->checksum;
btt_sb->checksum = 0;
sum = nd_fletcher64(btt_sb, sizeof(*btt_sb), 1);
btt_sb->checksum = sum_save;
return sum;
}
EXPORT_SYMBOL(nd_btt_sb_checksum);
static int __nd_btt_probe(struct nd_btt *nd_btt,
struct nd_namespace_common *ndns, struct btt_sb *btt_sb)
{
u64 checksum;
if (!btt_sb || !ndns || !nd_btt)
return -ENODEV;
if (nvdimm_read_bytes(ndns, SZ_4K, btt_sb, sizeof(*btt_sb)))
return -ENXIO;
if (nvdimm_namespace_capacity(ndns) < SZ_16M)
return -ENXIO;
if (memcmp(btt_sb->signature, BTT_SIG, BTT_SIG_LEN) != 0)
return -ENODEV;
checksum = le64_to_cpu(btt_sb->checksum);
btt_sb->checksum = 0;
if (checksum != nd_btt_sb_checksum(btt_sb))
return -ENODEV;
btt_sb->checksum = cpu_to_le64(checksum);
nd_btt->lbasize = le32_to_cpu(btt_sb->external_lbasize);
nd_btt->uuid = kmemdup(btt_sb->uuid, 16, GFP_KERNEL);
if (!nd_btt->uuid)
return -ENOMEM;
__nd_device_register(&nd_btt->dev);
return 0;
}
int nd_btt_probe(struct nd_namespace_common *ndns, void *drvdata)
{
int rc;
struct device *dev;
struct btt_sb *btt_sb;
struct nd_region *nd_region = to_nd_region(ndns->dev.parent);
if (ndns->force_raw)
return -ENODEV;
nvdimm_bus_lock(&ndns->dev);
dev = __nd_btt_create(nd_region, 0, NULL, ndns);
nvdimm_bus_unlock(&ndns->dev);
if (!dev)
return -ENOMEM;
dev_set_drvdata(dev, drvdata);
btt_sb = kzalloc(sizeof(*btt_sb), GFP_KERNEL);
rc = __nd_btt_probe(to_nd_btt(dev), ndns, btt_sb);
kfree(btt_sb);
dev_dbg(&ndns->dev, "%s: btt: %s\n", __func__,
rc == 0 ? dev_name(dev) : "<none>");
if (rc < 0) {
__nd_btt_detach_ndns(to_nd_btt(dev));
put_device(dev);
}
return rc;
}
EXPORT_SYMBOL(nd_btt_probe);
此差异已折叠。
/*
* Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of version 2 of the GNU General Public License as
* published by the Free Software Foundation.
*
* This program is distributed in the hope that it will be useful, but
* WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
* General Public License for more details.
*/
#include <linux/libnvdimm.h>
#include <linux/export.h>
#include <linux/module.h>
#include <linux/blkdev.h>
#include <linux/device.h>
#include <linux/ctype.h>
#include <linux/ndctl.h>
#include <linux/mutex.h>
#include <linux/slab.h>
#include "nd-core.h"
#include "nd.h"
LIST_HEAD(nvdimm_bus_list);
DEFINE_MUTEX(nvdimm_bus_list_mutex);
static DEFINE_IDA(nd_ida);
void nvdimm_bus_lock(struct device *dev)
{
struct nvdimm_bus *nvdimm_bus = walk_to_nvdimm_bus(dev);
if (!nvdimm_bus)
return;
mutex_lock(&nvdimm_bus->reconfig_mutex);
}
EXPORT_SYMBOL(nvdimm_bus_lock);
void nvdimm_bus_unlock(struct device *dev)
{
struct nvdimm_bus *nvdimm_bus = walk_to_nvdimm_bus(dev);
if (!nvdimm_bus)
return;
mutex_unlock(&nvdimm_bus->reconfig_mutex);
}
EXPORT_SYMBOL(nvdimm_bus_unlock);
bool is_nvdimm_bus_locked(struct device *dev)
{
struct nvdimm_bus *nvdimm_bus = walk_to_nvdimm_bus(dev);
if (!nvdimm_bus)
return false;
return mutex_is_locked(&nvdimm_bus->reconfig_mutex);
}
EXPORT_SYMBOL(is_nvdimm_bus_locked);
u64 nd_fletcher64(void *addr, size_t len, bool le)
{
u32 *buf = addr;
u32 lo32 = 0;
u64 hi32 = 0;
int i;
for (i = 0; i < len / sizeof(u32); i++) {
lo32 += le ? le32_to_cpu((__le32) buf[i]) : buf[i];
hi32 += lo32;
}
return hi32 << 32 | lo32;
}
EXPORT_SYMBOL_GPL(nd_fletcher64);
static void nvdimm_bus_release(struct device *dev)
{
struct nvdimm_bus *nvdimm_bus;
nvdimm_bus = container_of(dev, struct nvdimm_bus, dev);
ida_simple_remove(&nd_ida, nvdimm_bus->id);
kfree(nvdimm_bus);
}
struct nvdimm_bus *to_nvdimm_bus(struct device *dev)
{
struct nvdimm_bus *nvdimm_bus;
nvdimm_bus = container_of(dev, struct nvdimm_bus, dev);
WARN_ON(nvdimm_bus->dev.release != nvdimm_bus_release);
return nvdimm_bus;
}
EXPORT_SYMBOL_GPL(to_nvdimm_bus);
struct nvdimm_bus_descriptor *to_nd_desc(struct nvdimm_bus *nvdimm_bus)
{
/* struct nvdimm_bus definition is private to libnvdimm */
return nvdimm_bus->nd_desc;
}
EXPORT_SYMBOL_GPL(to_nd_desc);
struct nvdimm_bus *walk_to_nvdimm_bus(struct device *nd_dev)
{
struct device *dev;
for (dev = nd_dev; dev; dev = dev->parent)
if (dev->release == nvdimm_bus_release)
break;
dev_WARN_ONCE(nd_dev, !dev, "invalid dev, not on nd bus\n");
if (dev)
return to_nvdimm_bus(dev);
return NULL;
}
static bool is_uuid_sep(char sep)
{
if (sep == '\n' || sep == '-' || sep == ':' || sep == '\0')
return true;
return false;
}
static int nd_uuid_parse(struct device *dev, u8 *uuid_out, const char *buf,
size_t len)
{
const char *str = buf;
u8 uuid[16];
int i;
for (i = 0; i < 16; i++) {
if (!isxdigit(str[0]) || !isxdigit(str[1])) {
dev_dbg(dev, "%s: pos: %d buf[%zd]: %c buf[%zd]: %c\n",
__func__, i, str - buf, str[0],
str + 1 - buf, str[1]);
return -EINVAL;
}
uuid[i] = (hex_to_bin(str[0]) << 4) | hex_to_bin(str[1]);
str += 2;
if (is_uuid_sep(*str))
str++;
}
memcpy(uuid_out, uuid, sizeof(uuid));
return 0;
}
/**
* nd_uuid_store: common implementation for writing 'uuid' sysfs attributes
* @dev: container device for the uuid property
* @uuid_out: uuid buffer to replace
* @buf: raw sysfs buffer to parse
*
* Enforce that uuids can only be changed while the device is disabled
* (driver detached)
* LOCKING: expects device_lock() is held on entry
*/
int nd_uuid_store(struct device *dev, u8 **uuid_out, const char *buf,
size_t len)
{
u8 uuid[16];
int rc;
if (dev->driver)
return -EBUSY;
rc = nd_uuid_parse(dev, uuid, buf, len);
if (rc)
return rc;
kfree(*uuid_out);
*uuid_out = kmemdup(uuid, sizeof(uuid), GFP_KERNEL);
if (!(*uuid_out))
return -ENOMEM;
return 0;
}
ssize_t nd_sector_size_show(unsigned long current_lbasize,
const unsigned long *supported, char *buf)
{
ssize_t len = 0;
int i;
for (i = 0; supported[i]; i++)
if (current_lbasize == supported[i])
len += sprintf(buf + len, "[%ld] ", supported[i]);
else
len += sprintf(buf + len, "%ld ", supported[i]);
len += sprintf(buf + len, "\n");
return len;
}
ssize_t nd_sector_size_store(struct device *dev, const char *buf,
unsigned long *current_lbasize, const unsigned long *supported)
{
unsigned long lbasize;
int rc, i;
if (dev->driver)
return -EBUSY;
rc = kstrtoul(buf, 0, &lbasize);
if (rc)
return rc;
for (i = 0; supported[i]; i++)
if (lbasize == supported[i])
break;
if (supported[i]) {
*current_lbasize = lbasize;
return 0;
} else {
return -EINVAL;
}
}
void __nd_iostat_start(struct bio *bio, unsigned long *start)
{
struct gendisk *disk = bio->bi_bdev->bd_disk;
const int rw = bio_data_dir(bio);
int cpu = part_stat_lock();
*start = jiffies;
part_round_stats(cpu, &disk->part0);
part_stat_inc(cpu, &disk->part0, ios[rw]);
part_stat_add(cpu, &disk->part0, sectors[rw], bio_sectors(bio));
part_inc_in_flight(&disk->part0, rw);
part_stat_unlock();
}
EXPORT_SYMBOL(__nd_iostat_start);
void nd_iostat_end(struct bio *bio, unsigned long start)
{
struct gendisk *disk = bio->bi_bdev->bd_disk;
unsigned long duration = jiffies - start;
const int rw = bio_data_dir(bio);
int cpu = part_stat_lock();
part_stat_add(cpu, &disk->part0, ticks[rw], duration);
part_round_stats(cpu, &disk->part0);
part_dec_in_flight(&disk->part0, rw);
part_stat_unlock();
}
EXPORT_SYMBOL(nd_iostat_end);
static ssize_t commands_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
int cmd, len = 0;
struct nvdimm_bus *nvdimm_bus = to_nvdimm_bus(dev);
struct nvdimm_bus_descriptor *nd_desc = nvdimm_bus->nd_desc;
for_each_set_bit(cmd, &nd_desc->dsm_mask, BITS_PER_LONG)
len += sprintf(buf + len, "%s ", nvdimm_bus_cmd_name(cmd));
len += sprintf(buf + len, "\n");
return len;
}
static DEVICE_ATTR_RO(commands);
static const char *nvdimm_bus_provider(struct nvdimm_bus *nvdimm_bus)
{
struct nvdimm_bus_descriptor *nd_desc = nvdimm_bus->nd_desc;
struct device *parent = nvdimm_bus->dev.parent;
if (nd_desc->provider_name)
return nd_desc->provider_name;
else if (parent)
return dev_name(parent);
else
return "unknown";
}
static ssize_t provider_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
struct nvdimm_bus *nvdimm_bus = to_nvdimm_bus(dev);
return sprintf(buf, "%s\n", nvdimm_bus_provider(nvdimm_bus));
}
static DEVICE_ATTR_RO(provider);
static int flush_namespaces(struct device *dev, void *data)
{
device_lock(dev);
device_unlock(dev);
return 0;
}
static int flush_regions_dimms(struct device *dev, void *data)
{
device_lock(dev);
device_unlock(dev);
device_for_each_child(dev, NULL, flush_namespaces);
return 0;
}
static ssize_t wait_probe_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
nd_synchronize();
device_for_each_child(dev, NULL, flush_regions_dimms);
return sprintf(buf, "1\n");
}
static DEVICE_ATTR_RO(wait_probe);
static struct attribute *nvdimm_bus_attributes[] = {
&dev_attr_commands.attr,
&dev_attr_wait_probe.attr,
&dev_attr_provider.attr,
NULL,
};
struct attribute_group nvdimm_bus_attribute_group = {
.attrs = nvdimm_bus_attributes,
};
EXPORT_SYMBOL_GPL(nvdimm_bus_attribute_group);
struct nvdimm_bus *__nvdimm_bus_register(struct device *parent,
struct nvdimm_bus_descriptor *nd_desc, struct module *module)
{
struct nvdimm_bus *nvdimm_bus;
int rc;
nvdimm_bus = kzalloc(sizeof(*nvdimm_bus), GFP_KERNEL);
if (!nvdimm_bus)
return NULL;
INIT_LIST_HEAD(&nvdimm_bus->list);
init_waitqueue_head(&nvdimm_bus->probe_wait);
nvdimm_bus->id = ida_simple_get(&nd_ida, 0, 0, GFP_KERNEL);
mutex_init(&nvdimm_bus->reconfig_mutex);
if (nvdimm_bus->id < 0) {
kfree(nvdimm_bus);
return NULL;
}
nvdimm_bus->nd_desc = nd_desc;
nvdimm_bus->module = module;
nvdimm_bus->dev.parent = parent;
nvdimm_bus->dev.release = nvdimm_bus_release;
nvdimm_bus->dev.groups = nd_desc->attr_groups;
dev_set_name(&nvdimm_bus->dev, "ndbus%d", nvdimm_bus->id);
rc = device_register(&nvdimm_bus->dev);
if (rc) {
dev_dbg(&nvdimm_bus->dev, "registration failed: %d\n", rc);
goto err;
}
rc = nvdimm_bus_create_ndctl(nvdimm_bus);
if (rc)
goto err;
mutex_lock(&nvdimm_bus_list_mutex);
list_add_tail(&nvdimm_bus->list, &nvdimm_bus_list);
mutex_unlock(&nvdimm_bus_list_mutex);
return nvdimm_bus;
err:
put_device(&nvdimm_bus->dev);
return NULL;
}
EXPORT_SYMBOL_GPL(__nvdimm_bus_register);
static int child_unregister(struct device *dev, void *data)
{
/*
* the singular ndctl class device per bus needs to be
* "device_destroy"ed, so skip it here
*
* i.e. remove classless children
*/
if (dev->class)
/* pass */;
else
nd_device_unregister(dev, ND_SYNC);
return 0;
}
void nvdimm_bus_unregister(struct nvdimm_bus *nvdimm_bus)
{
if (!nvdimm_bus)
return;
mutex_lock(&nvdimm_bus_list_mutex);
list_del_init(&nvdimm_bus->list);
mutex_unlock(&nvdimm_bus_list_mutex);
nd_synchronize();
device_for_each_child(&nvdimm_bus->dev, NULL, child_unregister);
nvdimm_bus_destroy_ndctl(nvdimm_bus);
device_unregister(&nvdimm_bus->dev);
}
EXPORT_SYMBOL_GPL(nvdimm_bus_unregister);
#ifdef CONFIG_BLK_DEV_INTEGRITY
static int nd_pi_nop_generate_verify(struct blk_integrity_iter *iter)
{
return 0;
}
int nd_integrity_init(struct gendisk *disk, unsigned long meta_size)
{
struct blk_integrity integrity = {
.name = "ND-PI-NOP",
.generate_fn = nd_pi_nop_generate_verify,
.verify_fn = nd_pi_nop_generate_verify,
.tuple_size = meta_size,
.tag_size = meta_size,
};
int ret;
if (meta_size == 0)
return 0;
ret = blk_integrity_register(disk, &integrity);
if (ret)
return ret;
blk_queue_max_integrity_segments(disk->queue, 1);
return 0;
}
EXPORT_SYMBOL(nd_integrity_init);
#else /* CONFIG_BLK_DEV_INTEGRITY */
int nd_integrity_init(struct gendisk *disk, unsigned long meta_size)
{
return 0;
}
EXPORT_SYMBOL(nd_integrity_init);
#endif
static __init int libnvdimm_init(void)
{
int rc;
rc = nvdimm_bus_init();
if (rc)
return rc;
rc = nvdimm_init();
if (rc)
goto err_dimm;
rc = nd_region_init();
if (rc)
goto err_region;
return 0;
err_region:
nvdimm_exit();
err_dimm:
nvdimm_bus_exit();
return rc;
}
static __exit void libnvdimm_exit(void)
{
WARN_ON(!list_empty(&nvdimm_bus_list));
nd_region_exit();
nvdimm_exit();
nvdimm_bus_exit();
}
MODULE_LICENSE("GPL v2");
MODULE_AUTHOR("Intel Corporation");
subsys_initcall(libnvdimm_init);
module_exit(libnvdimm_exit);
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
/*
* Copyright(c) 2013-2015 Intel Corporation. All rights reserved.
*
* This program is free software; you can redistribute it and/or modify
* it under the terms of version 2 of the GNU General Public License as
* published by the Free Software Foundation.
*
* This program is distributed in the hope that it will be useful, but
* WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
* General Public License for more details.
*/
#ifndef __ND_CORE_H__
#define __ND_CORE_H__
#include <linux/libnvdimm.h>
#include <linux/device.h>
#include <linux/libnvdimm.h>
#include <linux/sizes.h>
#include <linux/mutex.h>
#include <linux/nd.h>
extern struct list_head nvdimm_bus_list;
extern struct mutex nvdimm_bus_list_mutex;
extern int nvdimm_major;
struct nvdimm_bus {
struct nvdimm_bus_descriptor *nd_desc;
wait_queue_head_t probe_wait;
struct module *module;
struct list_head list;
struct device dev;
int id, probe_active;
struct mutex reconfig_mutex;
};
struct nvdimm {
unsigned long flags;
void *provider_data;
unsigned long *dsm_mask;
struct device dev;
atomic_t busy;
int id;
};
bool is_nvdimm(struct device *dev);
bool is_nd_pmem(struct device *dev);
bool is_nd_blk(struct device *dev);
struct nvdimm_bus *walk_to_nvdimm_bus(struct device *nd_dev);
int __init nvdimm_bus_init(void);
void nvdimm_bus_exit(void);
void nd_region_probe_success(struct nvdimm_bus *nvdimm_bus, struct device *dev);
struct nd_region;
void nd_region_create_blk_seed(struct nd_region *nd_region);
void nd_region_create_btt_seed(struct nd_region *nd_region);
void nd_region_disable(struct nvdimm_bus *nvdimm_bus, struct device *dev);
int nvdimm_bus_create_ndctl(struct nvdimm_bus *nvdimm_bus);
void nvdimm_bus_destroy_ndctl(struct nvdimm_bus *nvdimm_bus);
void nd_synchronize(void);
int nvdimm_bus_register_dimms(struct nvdimm_bus *nvdimm_bus);
int nvdimm_bus_register_regions(struct nvdimm_bus *nvdimm_bus);
int nvdimm_bus_init_interleave_sets(struct nvdimm_bus *nvdimm_bus);
void __nd_device_register(struct device *dev);
int nd_match_dimm(struct device *dev, void *data);
struct nd_label_id;
char *nd_label_gen_id(struct nd_label_id *label_id, u8 *uuid, u32 flags);
bool nd_is_uuid_unique(struct device *dev, u8 *uuid);
struct nd_region;
struct nvdimm_drvdata;
struct nd_mapping;
resource_size_t nd_pmem_available_dpa(struct nd_region *nd_region,
struct nd_mapping *nd_mapping, resource_size_t *overlap);
resource_size_t nd_blk_available_dpa(struct nd_mapping *nd_mapping);
resource_size_t nd_region_available_dpa(struct nd_region *nd_region);
resource_size_t nvdimm_allocated_dpa(struct nvdimm_drvdata *ndd,
struct nd_label_id *label_id);
struct nd_mapping;
struct resource *nsblk_add_resource(struct nd_region *nd_region,
struct nvdimm_drvdata *ndd, struct nd_namespace_blk *nsblk,
resource_size_t start);
int nvdimm_num_label_slots(struct nvdimm_drvdata *ndd);
void get_ndd(struct nvdimm_drvdata *ndd);
resource_size_t __nvdimm_namespace_capacity(struct nd_namespace_common *ndns);
#endif /* __ND_CORE_H__ */
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
......@@ -261,8 +261,13 @@ extern void acpi_osi_setup(char *str);
extern bool acpi_osi_is_win8(void);
#ifdef CONFIG_ACPI_NUMA
int acpi_map_pxm_to_online_node(int pxm);
int acpi_get_node(acpi_handle handle);
#else
static inline int acpi_map_pxm_to_online_node(int pxm)
{
return 0;
}
static inline int acpi_get_node(acpi_handle handle)
{
return 0;
......
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册