提交 · 0f8087ecdeac921fc4920f1328f55c15080bc6aa · openanolis / cloud-kernel

22 10月, 2015 1 次提交

block: Consolidate static integrity profile properties · 0f8087ec

由 Martin K. Petersen 提交于 10月 21, 2015

We previously made a complete copy of a device's data integrity profile
even though several of the fields inside the blk_integrity struct are
pointers to fixed template entries in t10-pi.c.

Split the static and per-device portions so that we can reference the
template directly.
Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
Reported-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagig@mellanox.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

0f8087ec

17 9月, 2015 3 次提交

pmem: add proper fencing to pmem_rw_page() · ba8fe0f8

由 Ross Zwisler 提交于 9月 16, 2015

pmem_rw_page() needs to call wmb_pmem() on writes to make sure that the
newly written data is durable.  This flow was added to pmem_rw_bytes()
and pmem_make_request() with this commit:

commit 61031952 ("arch, x86: pmem api for ensuring durability of
	persistent memory updates")

...the pmem_rw_page() path was missed.

Cc: <stable@vger.kernel.org>
Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

ba8fe0f8

libnvdimm: pfn_devs: Fix locking in namespace_store · 4ca8b57a

由 Axel Lin 提交于 9月 16, 2015

Always take device_lock() before nvdimm_bus_lock() to prevent deadlock.
Signed-off-by: NAxel Lin <axel.lin@ingics.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

4ca8b57a

libnvdimm: btt_devs: Fix locking in namespace_store · 4be9c1fc

由 Axel Lin 提交于 9月 16, 2015

Always take device_lock() before nvdimm_bus_lock() to prevent deadlock.

Cc: <stable@vger.kernel.org>
Signed-off-by: NAxel Lin <axel.lin@ingics.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

4be9c1fc

29 8月, 2015 3 次提交

libnvdimm, pmem: direct map legacy pmem by default · 004f1afb

由 Dan Williams 提交于 8月 24, 2015

The expectation is that the legacy / non-standard pmem discovery method
(e820 type-12) will only ever be used to describe small quantities of
persistent memory.  Larger capacities will be described via the ACPI
NFIT.  When "allocate struct page from pmem" support is added this default
policy can be overridden by assigning a legacy pmem namespace to a pfn
device, however this would be only be necessary if a platform used the
legacy mechanism to define a very large range.

Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

004f1afb

libnvdimm, pmem: 'struct page' for pmem · 32ab0a3f

由 Dan Williams 提交于 8月 01, 2015

Enable the pmem driver to handle PFN device instances.  Attaching a pmem
namespace to a pfn device triggers the driver to allocate and initialize
struct page entries for pmem.  Memory capacity for this allocation comes
exclusively from RAM for now which is suitable for low PMEM to RAM
ratios.  This mechanism will be expanded later for setting an "allocate
from PMEM" policy.

Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

32ab0a3f

libnvdimm, pfn: 'struct page' provider infrastructure · e1455744

由 Dan Williams 提交于 7月 30, 2015

Implement the base infrastructure for libnvdimm PFN devices. Similar to
BTT devices they take a namespace as a backing device and layer
functionality on top. In this case the functionality is reserving space
for an array of 'struct page' entries to be handed out through
pfn_to_page(). For now this is just the basic libnvdimm-device-model for
configuring the base PFN device.

As the namespace claiming mechanism for PFN devices is mostly identical
to BTT devices drivers/nvdimm/claim.c is created to house the common
bits.

Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

e1455744

28 8月, 2015 3 次提交

x86, pmem: clarify that ARCH_HAS_PMEM_API implies PMEM mapped WB · 96601adb

由 Dan Williams 提交于 8月 24, 2015

Given that a write-back (WB) mapping plus non-temporal stores is
expected to be the most efficient way to access PMEM, update the
definition of ARCH_HAS_PMEM_API to imply arch support for
WB-mapped-PMEM.  This is needed as a pre-requisite for adding PMEM to
the direct map and mapping it with struct page.

The above clarification for X86_64 means that memcpy_to_pmem() is
permitted to use the non-temporal arch_memcpy_to_pmem() rather than
needlessly fall back to default_memcpy_to_pmem() when the pcommit
instruction is not available.  When arch_memcpy_to_pmem() is not
guaranteed to flush writes out of cache, i.e. on older X86_32
implementations where non-temporal stores may just dirty cache,
ARCH_HAS_PMEM_API is simply disabled.

The default fall back for persistent memory handling remains.  Namely,
map it with the WT (write-through) cache-type and hope for the best.

arch_has_pmem_api() is updated to only indicate whether the arch
provides the proper helpers to meet the minimum "writes are visible
outside the cache hierarchy after memcpy_to_pmem() + wmb_pmem()".  Code
that cares whether wmb_pmem() actually flushes writes to pmem must now
call arch_has_wmb_pmem() directly.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
[hch: set ARCH_HAS_PMEM_API=n on x86_32]
Reviewed-by: NChristoph Hellwig <hch@lst.de>
[toshi: x86_32 compile fixes]
Signed-off-by: NToshi Kani <toshi.kani@hp.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

96601adb

dax: drop size parameter to ->direct_access() · cb389b9c

由 Dan Williams 提交于 8月 07, 2015

None of the implementations currently use it.  The common
bdev_direct_access() entry point handles all the size checks before
calling ->direct_access().
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

cb389b9c

nvdimm: change to use generic kvfree() · a06a7576

由 yalin wang 提交于 8月 27, 2015

Signed-off-by: Nyalin wang <yalin.wang2010@gmail.com>
Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

a06a7576

21 8月, 2015 1 次提交

pmem, dax: have direct_access use __pmem annotation · e2e05394

由 Ross Zwisler 提交于 8月 18, 2015

Update the annotation for the kaddr pointer returned by direct_access()
so that it is a __pmem pointer. This is consistent with the PMEM driver
and with how this direct_access() pointer is used in the DAX code.
Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

e2e05394

19 8月, 2015 1 次提交

libnvdimm, e820: make CONFIG_X86_PMEM_LEGACY a tristate option · 7a67832c

由 Dan Williams 提交于 8月 19, 2015

We currently register a platform device for e820 type-12 memory and
register a nvdimm bus beneath it.  Registering the platform device
triggers the device-core machinery to probe for a driver, but that
search currently comes up empty.  Building the nvdimm-bus registration
into the e820_pmem platform device registration in this way forces
libnvdimm to be built-in.  Instead, convert the built-in portion of
CONFIG_X86_PMEM_LEGACY to simply register a platform device and move the
rest of the logic to the driver for e820_pmem, for the following
reasons:

1/ Letting e820_pmem support be a module allows building and testing
   libnvdimm.ko changes without rebooting

2/ All the normal policy around modules can be applied to e820_pmem
   (unbind to disable and/or blacklisting the module from loading by
   default)

3/ Moving the driver to a generic location and converting it to scan
   "iomem_resource" rather than "e820.map" means any other architecture can
   take advantage of this simple nvdimm resource discovery mechanism by
   registering a resource named "Persistent Memory (legacy)"

Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

7a67832c

15 8月, 2015 4 次提交

pmem: switch to devm_ allocations · 708ab62b

由 Christoph Hellwig 提交于 8月 10, 2015

Signed-off-by: NChristoph Hellwig <hch@lst.de>
[djbw: tools/testing/nvdimm/ and memunmap_pmem support]
Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

708ab62b

libnvdimm, btt: write and validate parent_uuid · 6ec68954

由 Vishal Verma 提交于 7月 29, 2015

When a BTT is instantiated on a namespace it must validate the namespace
uuid matches the 'parent_uuid' stored in the btt superblock. This
property enforces that changing the namespace UUID invalidates all
former BTT instances on that storage. For "IO namespaces" that don't
have a label or UUID, the parent_uuid is set to zero, and this
validation is skipped. For such cases, old BTTs have to be invalidated
by forcing the namespace to raw mode, and overwriting the BTT info
blocks.

Based on a patch by Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

6ec68954

libnvdimm, btt: consolidate arena validation · ab45e763

由 Vishal Verma 提交于 7月 29, 2015

Use arena_is_valid as a common routine for checking the validity of an
info block from both discover_arenas, and nd_btt_probe.

As a result, don't check for validity of the BTT's UUID, and lbasize.
The checksum in the BTT info block guarantees self-consistency, and when
we're called from nd_btt_probe, we don't have a valid uuid or lbasize
available to check against.

Also cleanup to return a bool instead of an int.
Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

ab45e763

libnvdimm, btt: clean up internal interfaces · fbde1414

由 Vishal Verma 提交于 7月 29, 2015

Consolidate the parameters passed to arena_is_valid into just nd_btt,
and an info block to increase re-usability.

Similarly, btt_arena_write_layout doesn't need to be passed a uuid, as
it can be obtained from arena->nd_btt.
Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

fbde1414

01 8月, 2015 1 次提交

nvdimm: fix inline function return type warning · f6ef5a2a

由 Randy Dunlap 提交于 7月 28, 2015

Fix multiple build warnings when CONFIG_BTT is not enabled:

In file included from ../drivers/nvdimm/bus.c:29:0:
../drivers/nvdimm/nd.h:169:15: warning: return type defaults to 'int' [-Wreturn-type]
 static inline nd_btt_probe(struct nd_namespace_common *ndns, void *drvdata)
               ^
Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: linux-nvdimm@lists.01.org
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

f6ef5a2a

29 7月, 2015 1 次提交

block: add a bi_error field to struct bio · 4246a0b6

由 Christoph Hellwig 提交于 7月 20, 2015

Currently we have two different ways to signal an I/O error on a BIO:

 (1) by clearing the BIO_UPTODATE flag
 (2) by returning a Linux errno value to the bi_end_io callback

The first one has the drawback of only communicating a single possible
error (-EIO), and the second one has the drawback of not beeing persistent
when bios are queued up, and are not passed along from child to parent
bio in the ever more popular chaining scenario.  Having both mechanisms
available has the additional drawback of utterly confusing driver authors
and introducing bugs where various I/O submitters only deal with one of
them, and the others have to add boilerplate code to deal with both kinds
of error returns.

So add a new bi_error field to store an errno value directly in struct
bio and remove the existing mechanisms to clean all this up.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Reviewed-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

4246a0b6

28 7月, 2015 2 次提交

libnvdimm, pmem: Change pmem physical sector size to PAGE_SIZE · 6b47496a

由 Vishal Verma 提交于 7月 23, 2015

Based on a patch: c8fa3173 brd: Request from fdisk 4k alignment by Boaz
Harrosh, allow fdisk to create properly aligned partitions for DAX. This
will also cause mkfs.ext4 to emit a warning if using a file system block
size of less than PAGE_SIZE.

Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Elliott, Robert <Elliott@hp.com>
Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>
Acked-by: NBoaz Harrosh <boaz@plexistor.com>
Acked-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

6b47496a

libnvdimm, btt: sparse fix · 5e329406

由 Dan Williams 提交于 7月 11, 2015

Fix:
drivers/nvdimm/btt.c:635:29: warning: restricted __le64 degrades to integer
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

5e329406

26 7月, 2015 1 次提交

libnvdimm: fix namespace seed creation · 8ca24353

由 Dan Williams 提交于 7月 24, 2015

A new BLK namespace "seed" device is created whenever the current seed
is successfully probed. However, if that namespace is assigned to a BTT
it may never directly experience a successful probe as it is a
subordinate device to a BTT configuration.

The effect of the current code is that no new namespaces can be
instantiated, after the seed namespace, to consume available BLK DPA
capacity. Fix this by treating a successful BTT probe event as a
successful probe event for the backing namespace.
Reported-by: NNicholas Moulin <nicholas.w.moulin@linux.intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

8ca24353

01 7月, 2015 2 次提交

nvdimm: Fix return value of nvdimm_bus_init() if class_create() fails · daa1dee4

由 Axel Lin 提交于 6月 28, 2015

Return proper error if class_create() fails.
Signed-off-by: NAxel Lin <axel.lin@ingics.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

daa1dee4

libnvdimm: smatch cleanups in __nd_ioctl · af834d45

由 Dan Williams 提交于 6月 30, 2015

Drop use of access_ok() since we are already using copy_{to|from}_user()
which do their own access_ok().
Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

af834d45

26 6月, 2015 12 次提交

arch, x86: pmem api for ensuring durability of persistent memory updates · 61031952

由 Ross Zwisler 提交于 6月 25, 2015

Based on an original patch by Ross Zwisler [1].

Writes to persistent memory have the potential to be posted to cpu
cache, cpu write buffers, and platform write buffers (memory controller)
before being committed to persistent media.  Provide apis,
memcpy_to_pmem(), wmb_pmem(), and memremap_pmem(), to write data to
pmem and assert that it is durable in PMEM (a persistent linear address
range).  A '__pmem' attribute is added so sparse can track proper usage
of pointers to pmem.

This continues the status quo of pmem being x86 only for 4.2, but
reworks to ioremap, and wider implementation of memremap() will enable
other archs in 4.3.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-May/000932.html

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
[djbw: various reworks]
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

61031952

libnvdimm: Add sysfs numa_node to NVDIMM devices · 74ae66c3

由 Toshi Kani 提交于 6月 19, 2015

Add support of sysfs 'numa_node' to I/O-related NVDIMM devices
under /sys/bus/nd/devices, regionN, namespaceN.0, and bttN.x.

An example of numa_node values on a 2-socket system with a single
NVDIMM range on each socket is shown below.
  /sys/bus/nd/devices
  |-- btt0.0/numa_node:0
  |-- btt1.0/numa_node:1
  |-- btt1.1/numa_node:1
  |-- namespace0.0/numa_node:0
  |-- namespace1.0/numa_node:1
  |-- region0/numa_node:0
  |-- region1/numa_node:1

These numa_node files are then linked under the block class of
their device names.
  /sys/class/block/pmem0/device/numa_node:0
  /sys/class/block/pmem1s/device/numa_node:1

This enables numactl(8) to accept 'block:' and 'file:' paths of
pmem and btt devices as shown in the examples below.
  numactl --preferred block:pmem0 --show
  numactl --preferred file:/dev/pmem1s --show
Signed-off-by: NToshi Kani <toshi.kani@hp.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

74ae66c3

libnvdimm: Set numa_node to NVDIMM devices · 41d7a6d6

由 Toshi Kani 提交于 6月 19, 2015

ACPI NFIT table has System Physical Address Range Structure entries that
describe a proximity ID of each range when ACPI_NFIT_PROXIMITY_VALID is
set in the flags.

Change acpi_nfit_register_region() to map a proximity ID to its node ID,
and set it to a new numa_node field of nd_region_desc, which is then
conveyed to the nd_region device.

The device core arranges for btt and namespace devices to inherit their
node from their parent region.
Signed-off-by: NToshi Kani <toshi.kani@hp.com>
[djbw: move set_dev_node() from region.c to bus.c]
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

41d7a6d6

libnvdimm, nfit: handle unarmed dimms, mark namespaces read-only · 58138820

由 Dan Williams 提交于 6月 23, 2015

Upon detection of an unarmed dimm in a region, arrange for descendant
BTT, PMEM, or BLK instances to be read-only. A dimm is primarily marked
"unarmed" via flags passed by platform firmware (NFIT).

The flags in the NFIT memory device sub-structure indicate the state of
the data on the nvdimm relative to its energy source or last "flush to
persistence". For the most part there is nothing the driver can do but
advertise the state of these flags in sysfs and emit a message if
firmware indicates that the contents of the device may be corrupted.
However, for the case of ACPI_NFIT_MEM_ARMED, the driver can arrange for
the block devices incorporating that nvdimm to be marked read-only.
This is a safe default as the data is still available and new writes are
held off until the administrator either forces read-write mode, or the
energy source becomes armed.

A 'read_only' attribute is added to REGION devices to allow for
overriding the default read-only policy of all descendant block devices.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

58138820

pmem: flag pmem block devices as non-rotational · 0f51c4fa

由 Dan Williams 提交于 5月 16, 2015

...since they are effectively SSDs as far as userspace is concerned.
Reviewed-by: NVishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

0f51c4fa

libnvdimm: enable iostat · f0dc089c

由 Dan Williams 提交于 5月 16, 2015

This is disabled by default as the overhead is prohibitive, but if the
user takes the action to turn it on we'll oblige.
Reviewed-by: NVishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

f0dc089c

pmem: make_request cleanups · edc870e5

由 Dan Williams 提交于 5月 16, 2015

Various cleanups:

1/ Kill the BUG_ON since we've already told the block layer we don't
   support DISCARD on all these drivers.

2/ Kill the 'rw' variable, no need to cache it.

3/ Kill the local 'sector' variable.  bio_for_each_segment() is already
   advancing the iterator's sector number by the bio_vec length.

4/ Kill the check for accessing past the end of device
   generic_make_request_checks() already does that.
Suggested-by: NChristoph Hellwig <hch@lst.de>
[hch: kill access past end of the device check]
Reviewed-by: NVishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

edc870e5

libnvdimm, pmem: fix up max_hw_sectors · 43d3fa3a

由 Dan Williams 提交于 5月 16, 2015

There is no hardware limit to enforce on the size of the i/o that can be passed
to an nvdimm block device, so set it to UINT_MAX.
Reviewed-by: NVishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

43d3fa3a

libnvdimm, blk: add support for blk integrity · fcae6957

由 Vishal Verma 提交于 6月 25, 2015

Support multiple block sizes (sector + metadata) for nd_blk in the
same way as done for the BTT. Add the idea of an 'internal' lbasize,
which is properly aligned and padded, and store metadata in this space.
Signed-off-by: NVishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

fcae6957

libnvdimm, btt: add support for blk integrity · 41cd8b70

由 Vishal Verma 提交于 6月 25, 2015

Support multiple block sizes (sector + metadata) using the blk integrity
framework. This registers a new integrity template that defines the
protection information tuple size based on the configured metadata size,
and simply acts as a passthrough for protection information generated by
another layer. The metadata is written to the storage as-is, and read back
with each sector.
Signed-off-by: NVishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

41cd8b70

libnvdimm, nfit, nd_blk: driver for BLK-mode access persistent memory · 047fc8a1

由 Ross Zwisler 提交于 6月 25, 2015

The libnvdimm implementation handles allocating dimm address space (DPA)
between PMEM and BLK mode interfaces.  After DPA has been allocated from
a BLK-region to a BLK-namespace the nd_blk driver attaches to handle I/O
as a struct bio based block device. Unlike PMEM, BLK is required to
handle platform specific details like mmio register formats and memory
controller interleave.  For this reason the libnvdimm generic nd_blk
driver calls back into the bus provider to carry out the I/O.

This initial implementation handles the BLK interface defined by the
ACPI 6 NFIT [1] and the NVDIMM DSM Interface Example [2] composed from
DCR (dimm control region), BDW (block data window), IDT (interleave
descriptor) NFIT structures and the hardware register format.
[1]: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
[2]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf

Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

047fc8a1

nd_btt: atomic sector updates · 5212e11f

由 Vishal Verma 提交于 6月 25, 2015

BTT stands for Block Translation Table, and is a way to provide power
fail sector atomicity semantics for block devices that have the ability
to perform byte granularity IO. It relies on the capability of libnvdimm
namespace devices to do byte aligned IO.

The BTT works as a stacked blocked device, and reserves a chunk of space
from the backing device for its accounting metadata. It is a bio-based
driver because all IO is done synchronously, and there is no queuing or
asynchronous completions at either the device or the driver level.

The BTT uses 'lanes' to index into various 'on-disk' data structures,
and lanes also act as a synchronization mechanism in case there are more
CPUs than available lanes. We did a comparison between two lane lock
strategies - first where we kept an atomic counter around that tracked
which was the last lane that was used, and 'our' lane was determined by
atomically incrementing that. That way, for the nr_cpus > nr_lanes case,
theoretically, no CPU would be blocked waiting for a lane. The other
strategy was to use the cpu number we're scheduled on to and hash it to
a lane number. Theoretically, this could block an IO that could've
otherwise run using a different, free lane. But some fio workloads
showed that the direct cpu -> lane hash performed faster than tracking
'last lane' - my reasoning is the cache thrash caused by moving the
atomic variable made that approach slower than simply waiting out the
in-progress IO. This supports the conclusion that the driver can be a
very simple bio-based one that does synchronous IOs instead of queuing.

Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Neil Brown <neilb@suse.de>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
[jmoyer: fix nmi watchdog timeout in btt_map_init]
[jmoyer: move btt initialization to module load path]
[jmoyer: fix memory leak in the btt initialization path]
[jmoyer: Don't overwrite corrupted arenas]
Signed-off-by: NVishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

5212e11f

25 6月, 2015 5 次提交

libnvdimm: infrastructure for btt devices · 8c2f7e86

由 Dan Williams 提交于 6月 25, 2015

NVDIMM namespaces, in addition to accepting "struct bio" based requests,
also have the capability to perform byte-aligned accesses.  By default
only the bio/block interface is used.  However, if another driver can
make effective use of the byte-aligned capability it can claim namespace
interface and use the byte-aligned ->rw_bytes() interface.

The BTT driver is the initial first consumer of this mechanism to allow
adding atomic sector update semantics to a pmem or blk namespace.  This
patch is the sysfs infrastructure to allow configuring a BTT instance
for a namespace.  Enabling that BTT and performing i/o is in a
subsequent patch.

Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

8c2f7e86

libnvdimm: write blk label set · 0ba1c634

由 Dan Williams 提交于 5月 30, 2015

After 'uuid', 'size', 'sector_size', and optionally 'alt_name' have been
set to valid values the labels on the dimm can be updated.  The
difference with the pmem case is that blk namespaces are limited to one
dimm and can cover discontiguous ranges in dpa space.

Also, after allocating label slots, it is useful for userspace to know
how many slots are left.  Export this information in sysfs.

Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Neil Brown <neilb@suse.de>
Acked-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

0ba1c634

libnvdimm: write pmem label set · f524bf27

由 Dan Williams 提交于 5月 30, 2015

After 'uuid', 'size', and optionally 'alt_name' have been set to valid
values the labels on the dimms can be updated.

Write procedure is:
1/ Allocate and write new labels in the "next" index
2/ Free the old labels in the working copy
3/ Write the bitmap and the label space on the dimm
4/ Write the index to make the update valid

Label ranges directly mirror the dpa resource values for the given
label_id of the namespace.

Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Neil Brown <neilb@suse.de>
Acked-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

f524bf27

libnvdimm: blk labels and namespace instantiation · 1b40e09a

由 Dan Williams 提交于 5月 01, 2015

A blk label set describes a namespace comprised of one or more
discontiguous dpa ranges on a single dimm.  They may alias with one or
more pmem interleave sets that include the given dimm.

This is the runtime/volatile configuration infrastructure for sysfs
manipulation of 'alt_name', 'uuid', 'size', and 'sector_size'.  A later
patch will make these settings persistent by writing back the label(s).

Unlike pmem namespaces, multiple blk namespaces can be created per
region.  Once a blk namespace has been created a new seed device
(unconfigured child of a parent blk region) is instantiated.  As long as
a region has 'available_size' != 0 new child namespaces may be created.

Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Neil Brown <neilb@suse.de>
Acked-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

1b40e09a

libnvdimm: pmem label sets and namespace instantiation. · bf9bccc1

由 Dan Williams 提交于 6月 17, 2015

A complete label set is a PMEM-label per-dimm per-interleave-set where
all the UUIDs match and the interleave set cookie matches the hosting
interleave set.

Present sysfs attributes for manipulation of a PMEM-namespace's
'alt_name', 'uuid', and 'size' attributes.  A later patch will make
these settings persistent by writing back the label.

Note that PMEM allocations grow forwards from the start of an interleave
set (lowest dimm-physical-address (DPA)).  BLK-namespaces that alias
with a PMEM interleave set will grow allocations backward from the
highest DPA.

Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Neil Brown <neilb@suse.de>
Acked-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

bf9bccc1

openanolis / cloud-kernel 大约 1 年 前同步成功

openanolis / cloud-kernel
大约 1 年前同步成功