提交 · 565851c972b50612f3a4542e26879ffb3e906fc2 · openanolis / cloud-kernel

13 4月, 2017 1 次提交

libnvdimm: add mechanism to publish badblocks at the region level · 6a6bef90

由 Dave Jiang 提交于 4月 07, 2017

badblocks sysfs file will be export at region level. When nvdimm event
notifier happens for NVDIMM_REVALIATE_POISON, the badblocks in the
region will be updated.
Signed-off-by: NDave Jiang <dave.jiang@intel.com>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

6a6bef90

01 3月, 2017 1 次提交

nfit, libnvdimm: fix interleave set cookie calculation · 86ef58a4

由 Dan Williams 提交于 2月 28, 2017

The interleave-set cookie is a sum that sanity checks the composition of
an interleave set has not changed from when the namespace was initially
created.  The checksum is calculated by sorting the DIMMs by their
location in the interleave-set. The comparison for the sort must be
64-bit wide, not byte-by-byte as performed by memcmp() in the broken
case.

Fix the implementation to accept correct cookie values in addition to
the Linux "memcmp" order cookies, but only allow correct cookies to be
generated going forward. It does mean that namespaces created by
third-party-tooling, or created by newer kernels with this fix, will not
validate on older kernels. However, there are a couple mitigating
conditions:

    1/ platforms with namespace-label capable NVDIMMs are not widely
       available.

    2/ interleave-sets with a single-dimm are by definition not affected
       (nothing to sort). This covers the QEMU-KVM NVDIMM emulation case.

The cookie stored in the namespace label will be fixed by any write the
namespace label, the most straightforward way to achieve this is to
write to the "alt_name" attribute of a namespace in sysfs.

Cc: <stable@vger.kernel.org>
Fixes: eaf96153 ("libnvdimm, nfit: add interleave-set state-tracking infrastructure")
Reported-by: NNicholas Moulin <nicholas.w.moulin@linux.intel.com>
Tested-by: NNicholas Moulin <nicholas.w.moulin@linux.intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

86ef58a4

19 10月, 2016 2 次提交

libnvdimm: allow a platform to force enable label support · 42237e39

由 Dan Williams 提交于 10月 15, 2016

Platforms like QEMU-KVM implement an NFIT table and label DSMs.
However, since that environment does not define an aliased
configuration, the labels are currently ignored and the kernel registers
a single full-sized pmem-namespace per region. Now that the kernel
supports sub-divisions of pmem regions the labels have a purpose.
Arrange for the labels to be honored when we find an existing / valid
namespace index block.

Cc: <qemu-devel@nongnu.org>
Cc: Haozhong Zhang <haozhong.zhang@intel.com>
Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

42237e39

libnvdimm: use generic iostat interfaces · 8d7c22ac

由 Toshi Kani 提交于 10月 19, 2016

nd_iostat_start() and nd_iostat_end() implement the same functionality
that generic_start_io_acct() and generic_end_io_acct() already provide.

Change nd_iostat_start() and nd_iostat_end() to call the generic iostat
interfaces.  There is no change in the nd interfaces.
Signed-off-by: NToshi Kani <toshi.kani@hpe.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

8d7c22ac

01 10月, 2016 2 次提交

libnvdimm, label: convert label tracking to a linked list · ae8219f1

由 Dan Williams 提交于 9月 19, 2016

In preparation for enabling multiple namespaces per pmem region, convert
the label tracking to use a linked list. In particular this will allow
select_pmem_id() to move labels from the unvalidated state to the
validated state. Currently we only track one validated set per-region.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

ae8219f1

libnvdimm, region: move region-mapping input-paramters to nd_mapping_desc · 44c462eb

由 Dan Williams 提交于 9月 19, 2016

Before we add more libnvdimm-private fields to nd_mapping make it clear
which parameters are input vs libnvdimm internals. Use struct
nd_mapping_desc instead of struct nd_mapping in nd_region_desc and make
struct nd_mapping private to libnvdimm.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

44c462eb

25 9月, 2016 1 次提交

libnvdimm, region: fix flush hint table thinko · 595c7307

由 Dan Williams 提交于 9月 23, 2016

The definition of the flush hint table as:

	void __iomem *flush_wpq[0][0];

...passed the unit test, but is broken as flush_wpq[0][1] and
flush_wpq[1][0] refer to the same entry.  Fix this to use a helper that
calculates a slot in the table based on the geometry of flush hints in
the region.  This is important to get right since virtualization
solutions use this mechanism to trigger hypervisor flushes to platform
persistence.
Reported-by: NDave Jiang <dave.jiang@intel.com>
Tested-by: NDave Jiang <dave.jiang@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

595c7307

02 9月, 2016 1 次提交

libnvdimm: Fix nvdimm_probe error on NVDIMM-N · aee65987

由 Toshi Kani 提交于 8月 16, 2016

'ndctl list --buses --dimms' does not list any NVDIMM-Ns since
they are considered as idle.  ndctl checks if any driver is
attached to nmem device.  nvdimm_probe() always fails in
nvdimm_init_nsarea() since NVDIMM-Ns do not implement optinal
ND_CMD_GET_CONFIG_DATA command.

Change nvdimm_probe() to accept the case that the CONFIG_DATA
command is not implemented for NVDIMM-Ns.  The driver attaches
without ndd, which keeps it no-op to the device.
Reported-by: NBrian Boylston <brian.boylston@hpe.com>
Signed-off-by: NToshi Kani <toshi.kani@hpe.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Tested-by: NJohannes Thumshirn <jthumshirn@suse.de>
Acked-by: NJohannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

aee65987

09 8月, 2016 1 次提交

nvdimm, btt: add a size attribute for BTTs · abe8b4e3

由 Vishal Verma 提交于 7月 27, 2016

To be consistent with other namespaces, expose a 'size' attribute for
BTT devices also.

Cc: Dan Williams <dan.j.williams@intel.com>
Reported-by: NLinda Knippers <linda.knippers@hpe.com>
Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

abe8b4e3

12 7月, 2016 3 次提交

libnvdimm: cycle flush hints · 0c27af60

由 Dan Williams 提交于 5月 27, 2016

When the NFIT provides multiple flush hint addresses per-dimm it is
expressing that the platform is capable of processing multiple flush
requests in parallel. There is some fixed cost per flush request, let
the cost be shared in parallel on multiple cpus.

Since there may not be enough flush hint addresses for each cpu to have
one, keep a per-cpu index of the last used hint, hash it with current
pid, and assume that access pattern and scheduler randomness will keep
the flush-hint usage somewhat staggered across cpus.

Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

0c27af60

libnvdimm, nfit: move flush hint mapping to region-device driver-data · e5ae3b25

由 Dan Williams 提交于 6月 07, 2016

In preparation for triggering flushes of a DIMM's writes-posted-queue
(WPQ) via the pmem driver move mapping of flush hint addresses to the
region driver.  Since this uses devm_nvdimm_memremap() the flush
addresses will remain mapped while any region to which the dimm belongs
is active.

We need to communicate more information to the nvdimm core to facilitate
this mapping, namely each dimm object now carries an array of flush hint
address resources.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

e5ae3b25

libnvdimm, nfit: remove nfit_spa_map() infrastructure · a8a6d2e0

由 Dan Williams 提交于 6月 07, 2016

Now that all shared mappings are handled by devm_nvdimm_memremap() we no
longer need nfit_spa_map() nor do we need to trigger a callback to the
bus provider at region disable time.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

a8a6d2e0

21 5月, 2016 1 次提交

libnvdimm, dax: autodetect support · c5ed9268

由 Dan Williams 提交于 5月 18, 2016

For autodetecting a previously established dax configuration we need the
info block to indicate block-device vs device-dax mode, and we need to
have the default namespace probe hand-off the configuration to the
dax_pmem driver.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

c5ed9268

10 5月, 2016 1 次提交

libnvdimm, dax: introduce device-dax infrastructure · cd03412a

由 Dan Williams 提交于 3月 11, 2016

Device DAX is the device-centric analogue of Filesystem DAX
(CONFIG_FS_DAX).  It allows persistent memory ranges to be allocated and
mapped without need of an intervening file system.  This initial
infrastructure arranges for a libnvdimm pfn-device to be represented as
a different device-type so that it can be attached to a driver other
than the pmem driver.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

cd03412a

23 4月, 2016 5 次提交

libnvdimm, pmem, pfn: move pfn setup to the core · ac515c08

由 Dan Williams 提交于 3月 22, 2016

Now that pmem internals have been disentangled from pfn setup, that code
can move to the core.  This is in preparation for adding another user of
the pfn-device capabilities.
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

ac515c08

libnvdimm, pmem, pfn: make pmem_rw_bytes generic and refactor pfn setup · 200c79da

由 Dan Williams 提交于 3月 22, 2016

In preparation for providing an alternative (to block device) access
mechanism to persistent memory, convert pmem_rw_bytes() to
nsio_rw_bytes().  This allows ->rw_bytes() functionality without
requiring a 'struct pmem_device' to be instantiated.

In other words, when ->rw_bytes() is in use i/o is driven through
'struct nd_namespace_io', otherwise it is driven through 'struct
pmem_device' and the block layer.  This consolidates the disjoint calls
to devm_exit_badblocks() and devm_memunmap() into a common
devm_nsio_disable() and cleans up the init path to use a unified
pmem_attach_disk() implementation.
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

200c79da

libnvdimm, btt, convert nd_btt_probe() to devm · e32bc729

由 Dan Williams 提交于 3月 17, 2016

Pass the device performing the probe so we can use a devm allocation for
the btt superblock.

Cc: Vishal Verma <vishal.l.verma@intel.com>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

e32bc729

libnvdimm, pfn, convert nd_pfn_probe() to devm · bd032943

由 Dan Williams 提交于 3月 17, 2016

Pass the device performing the probe so we can use a devm allocation for
the pfn superblock.

Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

bd032943

libnvdimm, pmem: kill pmem->ndns · 298f2bc5

由 Dan Williams 提交于 3月 15, 2016

We can derive the common namespace from other information.  We also do
not need to cache it because all the usages are in slow paths.
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

298f2bc5

08 4月, 2016 1 次提交

libnvdimm, pfn: fix nvdimm_namespace_add_poison() vs section alignment · a3901802

由 Dan Williams 提交于 4月 07, 2016

When section alignment padding is in effect we need to shift / truncate
the range that is queried for poison by the 'start_pad' or 'end_trunc'
reservations.

It's easiest if we just pass in an adjusted resource range rather than
deriving it from the passed in namespace.  With the resource range
resolution pushed out to the caller we can also push the
namespace-to-region lookup to the caller and drop the implicit pmem-type
assumption about the passed in namespace object.

Cc: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

a3901802

10 3月, 2016 1 次提交

libnvdimm, pmem: clear poison on write · 59e64739

由 Dan Williams 提交于 3月 08, 2016

If a write is directed at a known bad block perform the following:

1/ write the data

2/ send a clear poison command

3/ invalidate the poison out of the cache hierarchy

Cc: <x86@kernel.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: NVishal Verma <vishal.l.verma@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

59e64739

06 3月, 2016 1 次提交

libnvdimm: async notification support · 71999466

由 Dan Williams 提交于 2月 18, 2016

In preparation for asynchronous address range scrub support add an
ability for the pmem driver to dynamically consume address range scrub
results.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

71999466

10 1月, 2016 3 次提交

libnvdimm: convert to statically allocated badblocks · b95f5f43

由 Dan Williams 提交于 1月 04, 2016

If a device will ever have badblocks it should always have a badblocks
instance available.  So, similar to md, embed a badblocks instance in
pmem_device.  This reduces pointer chasing in the i/o fast path, and
simplifies the init path.
Reported-by: NVishal Verma <vishal.l.verma@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

b95f5f43

libnvdimm, pmem: move definition of nvdimm_namespace_add_poison to nd.h · ad9a8bde

由 Dan Williams 提交于 1月 06, 2016

nd-core.h is private to the libnvdimm core internals and should not be
used by drivers.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

ad9a8bde

libnvdimm: Add a poison list and export badblocks · 0caeef63

由 Vishal Verma 提交于 12月 24, 2015

During region creation, perform Address Range Scrubs (ARS) for the SPA
(System Physical Address) ranges to retrieve known poison locations from
firmware. Add a new data structure 'nd_poison' which is used as a list
in nvdimm_bus to store these poison locations.

When creating a pmem namespace, if there is any known poison associated
with its physical address space, convert the poison ranges to bad sectors
that are exposed using the badblocks interface.
Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

0caeef63

13 12月, 2015 1 次提交

libnvdimm, pfn: add 'align' attribute, default to HPAGE_SIZE · 315c5625

由 Dan Williams 提交于 12月 10, 2015

When setting aside capacity for struct page it must be aligned to the
largest mapping size that is to be made available via DAX.  Make the
alignment configurable to enable support for 1GiB page-size mappings.

The offset for PFN_MODE_RAM may now be larger than SZ_8K, so fixup the
offset check in nvdimm_namespace_attach_pfn().
Reported-by: NToshi Kani <toshi.kani@hpe.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

315c5625

11 12月, 2015 1 次提交

libnvdimm, pfn: kill ND_PFN_ALIGN · 9f1e8cee

由 Dan Williams 提交于 12月 10, 2015

The alignment constraint isn't necessary now that devm_memremap_pages()
allows for unaligned mappings.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

9f1e8cee

29 8月, 2015 3 次提交

libnvdimm, pmem: direct map legacy pmem by default · 004f1afb

由 Dan Williams 提交于 8月 24, 2015

The expectation is that the legacy / non-standard pmem discovery method
(e820 type-12) will only ever be used to describe small quantities of
persistent memory.  Larger capacities will be described via the ACPI
NFIT.  When "allocate struct page from pmem" support is added this default
policy can be overridden by assigning a legacy pmem namespace to a pfn
device, however this would be only be necessary if a platform used the
legacy mechanism to define a very large range.

Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

004f1afb

libnvdimm, pmem: 'struct page' for pmem · 32ab0a3f

由 Dan Williams 提交于 8月 01, 2015

Enable the pmem driver to handle PFN device instances.  Attaching a pmem
namespace to a pfn device triggers the driver to allocate and initialize
struct page entries for pmem.  Memory capacity for this allocation comes
exclusively from RAM for now which is suitable for low PMEM to RAM
ratios.  This mechanism will be expanded later for setting an "allocate
from PMEM" policy.

Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

32ab0a3f

libnvdimm, pfn: 'struct page' provider infrastructure · e1455744

由 Dan Williams 提交于 7月 30, 2015

Implement the base infrastructure for libnvdimm PFN devices. Similar to
BTT devices they take a namespace as a backing device and layer
functionality on top. In this case the functionality is reserving space
for an array of 'struct page' entries to be handed out through
pfn_to_page(). For now this is just the basic libnvdimm-device-model for
configuring the base PFN device.

As the namespace claiming mechanism for PFN devices is mostly identical
to BTT devices drivers/nvdimm/claim.c is created to house the common
bits.

Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

e1455744

15 8月, 2015 1 次提交

libnvdimm, btt: write and validate parent_uuid · 6ec68954

由 Vishal Verma 提交于 7月 29, 2015

When a BTT is instantiated on a namespace it must validate the namespace
uuid matches the 'parent_uuid' stored in the btt superblock. This
property enforces that changing the namespace UUID invalidates all
former BTT instances on that storage. For "IO namespaces" that don't
have a label or UUID, the parent_uuid is set to zero, and this
validation is skipped. For such cases, old BTTs have to be invalidated
by forcing the namespace to raw mode, and overwriting the BTT info
blocks.

Based on a patch by Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

6ec68954

01 8月, 2015 1 次提交

nvdimm: fix inline function return type warning · f6ef5a2a

由 Randy Dunlap 提交于 7月 28, 2015

Fix multiple build warnings when CONFIG_BTT is not enabled:

In file included from ../drivers/nvdimm/bus.c:29:0:
../drivers/nvdimm/nd.h:169:15: warning: return type defaults to 'int' [-Wreturn-type]
 static inline nd_btt_probe(struct nd_namespace_common *ndns, void *drvdata)
               ^
Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: linux-nvdimm@lists.01.org
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

f6ef5a2a

26 6月, 2015 7 次提交

libnvdimm: Set numa_node to NVDIMM devices · 41d7a6d6

由 Toshi Kani 提交于 6月 19, 2015

ACPI NFIT table has System Physical Address Range Structure entries that
describe a proximity ID of each range when ACPI_NFIT_PROXIMITY_VALID is
set in the flags.

Change acpi_nfit_register_region() to map a proximity ID to its node ID,
and set it to a new numa_node field of nd_region_desc, which is then
conveyed to the nd_region device.

The device core arranges for btt and namespace devices to inherit their
node from their parent region.
Signed-off-by: NToshi Kani <toshi.kani@hp.com>
[djbw: move set_dev_node() from region.c to bus.c]
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

41d7a6d6

libnvdimm, nfit: handle unarmed dimms, mark namespaces read-only · 58138820

由 Dan Williams 提交于 6月 23, 2015

Upon detection of an unarmed dimm in a region, arrange for descendant
BTT, PMEM, or BLK instances to be read-only. A dimm is primarily marked
"unarmed" via flags passed by platform firmware (NFIT).

The flags in the NFIT memory device sub-structure indicate the state of
the data on the nvdimm relative to its energy source or last "flush to
persistence". For the most part there is nothing the driver can do but
advertise the state of these flags in sysfs and emit a message if
firmware indicates that the contents of the device may be corrupted.
However, for the case of ACPI_NFIT_MEM_ARMED, the driver can arrange for
the block devices incorporating that nvdimm to be marked read-only.
This is a safe default as the data is still available and new writes are
held off until the administrator either forces read-write mode, or the
energy source becomes armed.

A 'read_only' attribute is added to REGION devices to allow for
overriding the default read-only policy of all descendant block devices.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

58138820

libnvdimm: enable iostat · f0dc089c

由 Dan Williams 提交于 5月 16, 2015

This is disabled by default as the overhead is prohibitive, but if the
user takes the action to turn it on we'll oblige.
Reviewed-by: NVishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

f0dc089c

libnvdimm, blk: add support for blk integrity · fcae6957

由 Vishal Verma 提交于 6月 25, 2015

Support multiple block sizes (sector + metadata) for nd_blk in the
same way as done for the BTT. Add the idea of an 'internal' lbasize,
which is properly aligned and padded, and store metadata in this space.
Signed-off-by: NVishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

fcae6957

libnvdimm, btt: add support for blk integrity · 41cd8b70

由 Vishal Verma 提交于 6月 25, 2015

Support multiple block sizes (sector + metadata) using the blk integrity
framework. This registers a new integrity template that defines the
protection information tuple size based on the configured metadata size,
and simply acts as a passthrough for protection information generated by
another layer. The metadata is written to the storage as-is, and read back
with each sector.
Signed-off-by: NVishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

41cd8b70

libnvdimm, nfit, nd_blk: driver for BLK-mode access persistent memory · 047fc8a1

由 Ross Zwisler 提交于 6月 25, 2015

The libnvdimm implementation handles allocating dimm address space (DPA)
between PMEM and BLK mode interfaces.  After DPA has been allocated from
a BLK-region to a BLK-namespace the nd_blk driver attaches to handle I/O
as a struct bio based block device. Unlike PMEM, BLK is required to
handle platform specific details like mmio register formats and memory
controller interleave.  For this reason the libnvdimm generic nd_blk
driver calls back into the bus provider to carry out the I/O.

This initial implementation handles the BLK interface defined by the
ACPI 6 NFIT [1] and the NVDIMM DSM Interface Example [2] composed from
DCR (dimm control region), BDW (block data window), IDT (interleave
descriptor) NFIT structures and the hardware register format.
[1]: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
[2]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf

Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

047fc8a1

nd_btt: atomic sector updates · 5212e11f

由 Vishal Verma 提交于 6月 25, 2015

BTT stands for Block Translation Table, and is a way to provide power
fail sector atomicity semantics for block devices that have the ability
to perform byte granularity IO. It relies on the capability of libnvdimm
namespace devices to do byte aligned IO.

The BTT works as a stacked blocked device, and reserves a chunk of space
from the backing device for its accounting metadata. It is a bio-based
driver because all IO is done synchronously, and there is no queuing or
asynchronous completions at either the device or the driver level.

The BTT uses 'lanes' to index into various 'on-disk' data structures,
and lanes also act as a synchronization mechanism in case there are more
CPUs than available lanes. We did a comparison between two lane lock
strategies - first where we kept an atomic counter around that tracked
which was the last lane that was used, and 'our' lane was determined by
atomically incrementing that. That way, for the nr_cpus > nr_lanes case,
theoretically, no CPU would be blocked waiting for a lane. The other
strategy was to use the cpu number we're scheduled on to and hash it to
a lane number. Theoretically, this could block an IO that could've
otherwise run using a different, free lane. But some fio workloads
showed that the direct cpu -> lane hash performed faster than tracking
'last lane' - my reasoning is the cache thrash caused by moving the
atomic variable made that approach slower than simply waiting out the
in-progress IO. This supports the conclusion that the driver can be a
very simple bio-based one that does synchronous IOs instead of queuing.

Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Neil Brown <neilb@suse.de>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
[jmoyer: fix nmi watchdog timeout in btt_map_init]
[jmoyer: move btt initialization to module load path]
[jmoyer: fix memory leak in the btt initialization path]
[jmoyer: Don't overwrite corrupted arenas]
Signed-off-by: NVishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

5212e11f

25 6月, 2015 1 次提交

libnvdimm: infrastructure for btt devices · 8c2f7e86

由 Dan Williams 提交于 6月 25, 2015

NVDIMM namespaces, in addition to accepting "struct bio" based requests,
also have the capability to perform byte-aligned accesses.  By default
only the bio/block interface is used.  However, if another driver can
make effective use of the byte-aligned capability it can claim namespace
interface and use the byte-aligned ->rw_bytes() interface.

The BTT driver is the initial first consumer of this mechanism to allow
adding atomic sector update semantics to a pmem or blk namespace.  This
patch is the sysfs infrastructure to allow configuring a BTT instance
for a namespace.  Enabling that BTT and performing i/o is in a
subsequent patch.

Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

8c2f7e86

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功