提交 · 87bf572e19a092cc9cc77d5a00d543a2b628c142 · openanolis / cloud-kernel

06 3月, 2016 5 次提交

nfit: disable userspace initiated ars during scrub · 87bf572e

由 Dan Williams 提交于 2月 22, 2016

While the nfit driver is issuing address range scrub commands and
reaping the results do not permit an ars_start command issued from
userspace. The scrub thread assumes that all ars completions are for
scrubs initiated by platform firmware at boot, or by the nfit driver.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

87bf572e

nfit: scrub and register regions in a workqueue · 1cf03c00

由 Dan Williams 提交于 2月 17, 2016

Address range scrub is a potentially long running process that we want
to complete before any pmem regions are registered. Perform this
operation asynchronously to allow other drivers to load in the meantime.

Platform firmware may have initiated a partial scrub prior to the driver
loading, so we must be careful to consume those results before kicking
off kernel initiated scrubs on other regions.

This rework also makes the registration path more tolerant of scrub
errors in that it splits scrubbing into 2 phases. The first phase
synchronously waits for a platform-firmware initiated scrub to complete.
The second phase scans the remaining address ranges asynchronously and
notifies the related driver(s) when the scrub completes.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

1cf03c00

nfit, libnvdimm: async region scrub workqueue · 7ae0fa43

由 Dan Williams 提交于 2月 19, 2016

Introduce a workqueue that will be used to run address range scrub
asynchronously with the rest of nvdimm device probing.

Userspace still wants notification when probing operations complete, so
introduce a new callback to flush this workqueue when userspace is
awaiting probe completion.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

7ae0fa43

nfit, tools/testing/nvdimm: unify common init for acpi_nfit_desc · a61fe6f7

由 Dan Williams 提交于 2月 19, 2016

The nvdimm unit test infrastructure performs its own initialization of
an acpi_nfit_desc to specify test overrides over the native
implementation. Make it clear which attributes and operations it is
overriding by re-using acpi_nfit_init_desc() as a common starting point.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

a61fe6f7

libnvdimm, nfit: centralize command status translation · aef25338

由 Dan Williams 提交于 2月 12, 2016

The return value from an 'ndctl_fn' reports the command execution
status, i.e. was the command properly formatted and was it successfully
submitted to the bus provider.  The new 'cmd_rc' parameter allows the bus
provider to communicate command specific results, translated into
common error codes.

Convert the ARS commands to this scheme to:

1/ Consolidate status reporting

2/ Prepare for for expanding ars unit test cases

3/ Make the implementation more generic

Cc: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

aef25338

05 3月, 2016 1 次提交

nfit: Continue init even if ARS commands are unimplemented · 6e2452df

由 Vishal Verma 提交于 3月 03, 2016

If firmware doesn't implement any of the ARS commands, take that to
mean that ARS is unsupported, and continue to initialize regions without
bad block lists. We cannot make the assumption that ARS commands will be
unconditionally supported on all NVDIMMs.
Reported-by: NHaozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>
Acked-by: NXiao Guangrong <guangrong.xiao@linux.intel.com>
Tested-by: NHaozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

6e2452df

24 2月, 2016 1 次提交

nfit: update address range scrub commands to the acpi 6.1 format · 4577b066

由 Dan Williams 提交于 2月 17, 2016

The original format of these commands from the "NVDIMM DSM Interface
Example" [1] are superseded by the ACPI 6.1 definition of the "NVDIMM Root
Device _DSMs" [2].

[1]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
[2]: http://www.uefi.org/sites/default/files/resources/ACPI_6_1.pdf
     "9.20.7 NVDIMM Root Device _DSMs"

Changes include:
1/ New 'restart' fields in ars_status, unfortunately these are
   implemented in the middle of the existing definition so this change
   is not backwards compatible.  The expectation is that shipping
   platforms will only ever support the ACPI 6.1 definition.

2/ New status values for ars_start ('busy') and ars_status ('overflow').

Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Linda Knippers <linda.knippers@hpe.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

4577b066

20 2月, 2016 2 次提交

libnvdimm, tools/testing/nvdimm: fix 'ars_status' output buffer sizing · 747ffe11

由 Dan Williams 提交于 2月 19, 2016

Use the output length specified in the command to size the receive
buffer rather than the arbitrary 4K limit.

This bug was hiding the fact that the ndctl implementation of
ndctl_bus_cmd_new_ars_status() was not specifying an output buffer size.

Cc: <stable@vger.kernel.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

747ffe11

nfit: fix multi-interface dimm handling, acpi6.1 compatibility · 6697b2cf

由 Dan Williams 提交于 2月 04, 2016

ACPI 6.1 clarified that multi-interface dimms require multiple control
region entries (DCRs) per dimm.  Previously we were assuming that a
control region is only present when block-data-windows are present.
This implementation was done with an eye to be compatibility with the
looser ACPI 6.0 interpretation of this table.

1/ When coalescing the memory device (MEMDEV) tables for a single dimm,
coalesce on device_handle rather than control region index.

2/ Whenever we disocver a control region with non-zero block windows
re-scan for block-data-window (BDW) entries.

We may need to revisit this if a DIMM ever implements a format interface
outside of blk or pmem, but that is not on the foreseeable horizon.

Cc: <stable@vger.kernel.org>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

6697b2cf

10 1月, 2016 1 次提交

libnvdimm: Add a poison list and export badblocks · 0caeef63

由 Vishal Verma 提交于 12月 24, 2015

During region creation, perform Address Range Scrubs (ARS) for the SPA
(System Physical Address) ranges to retrieve known poison locations from
firmware. Add a new data structure 'nd_poison' which is used as a list
in nvdimm_bus to store these poison locations.

When creating a pmem namespace, if there is any known poison associated
with its physical address space, convert the poison ranges to bad sectors
that are exposed using the badblocks interface.
Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

0caeef63

12 12月, 2015 1 次提交

nfit: acpi_nfit_notify(): Do not leave device locked · d91e8928

由 Alexey Khoroshilov 提交于 12月 11, 2015

Even if dev->driver is null because we are being removed,
it is safer to not leave device locked.

Found by Linux Driver Verification project (linuxtesting.org).
Signed-off-by: NAlexey Khoroshilov <khoroshilov@ispras.ru>
Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: NVishal Verma <vishal.l.verma@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

d91e8928

01 12月, 2015 3 次提交

nfit: Adjust for different _FIT and NFIT headers · 6b577c9d

由 Linda Knippers 提交于 11月 20, 2015

When support for _FIT was added, the code presumed that the data
returned by the _FIT method is identical to the NFIT table, which
starts with an acpi_table_header.  However, the _FIT is defined
to return a data in the format of a series of NFIT type structure
entries and as a method, has an acpi_object header rather tahn
an acpi_table_header.

To address the differences, explicitly save the acpi_table_header
from the NFIT, since it is accessible through /sys, and change
the nfit pointer in the acpi_desc structure to point to the
table entries rather than the headers.

Reported-by: Jeff Moyer (jmoyer@redhat.com>
Signed-off-by: NLinda Knippers <linda.knippers@hpe.com>
Acked-by: NVishal Verma <vishal.l.verma@intel.com>
[vishal: fix up unit test for new header assumptions]
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

6b577c9d

nfit: Fix the check for a successful NFIT merge · ff5a55f8

由 Linda Knippers 提交于 11月 20, 2015

Missed previously due to a lack of test coverage on a platform that
provided an valid response to _FIT.
Signed-off-by: NLinda Knippers <linda.knippers@hpe.com>
Acked-by: NVishal Verma <vishal.l.verma@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

ff5a55f8

nfit: Account for table size length variation · 826c416f

由 Linda Knippers 提交于 11月 20, 2015

The size of NFIT tables don't necessarily match the size of the
data structures that we use for them.  For example, the NVDIMM
Control Region Structure table is shorter for a device with
no block control windows than for a device with block control windows.
Other tables, such as Flush Hint Address Structure and the Interleave
Structure are variable length by definition.

Account for the size difference when comparing table entries by
using the actual table size from the table header if it's less
than the structure size.
Signed-off-by: NLinda Knippers <linda.knippers@hpe.com>
Acked-by: NVishal Verma <vishal.l.verma@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

826c416f

03 11月, 2015 2 次提交

acpi: nfit: Add support for hot-add · 20985164

由 Vishal Verma 提交于 10月 27, 2015

Add a .notify callback to the acpi_nfit_driver that gets called on a
hotplug event. From this, evaluate the _FIT ACPI method which returns
the updated NFIT with handles for the hot-plugged NVDIMM.

Iterate over the new NFIT, and add any new tables found, and
register/enable the corresponding regions.

In the nfit test framework, after normal initialization, update the NFIT
with a new hot-plugged NVDIMM, and directly call into the driver to
update its view of the available regions.

Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Toshi Kani <toshi.kani@hpe.com>
Cc: Elliott, Robert <elliott@hpe.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: <linux-acpi@vger.kernel.org>
Cc: <linux-nvdimm@lists.01.org>
Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

20985164

nfit: in acpi_nfit_init, break on a 0-length table · 564d5011

由 Vishal Verma 提交于 10月 27, 2015

If acpi_nfit_init is called (such as from nfit_test), with an nfit table
that has more memory allocated than it needs (and a similarly large
'size' field, add_tables would happily keep adding null SPA Range tables
filling up all available memory.

Make it friendlier by breaking out if a 0-length header is found in any
of the tables.

Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: <linux-acpi@vger.kernel.org>
Cc: <linux-nvdimm@lists.01.org>
Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

564d5011

22 10月, 2015 1 次提交

ACPICA: Update NFIT table to rename a flags field · ca321d1c

由 Bob Moore 提交于 10月 19, 2015

ACPICA commit 534deab97fb416a13bfede15c538e2c9eac9384a

Updated one of the memory subtable flags to clarify.

Link: https://github.com/acpica/acpica/commit/534deab9Signed-off-by: NBob Moore <robert.moore@intel.com>
Signed-off-by: NLv Zheng <lv.zheng@intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

ca321d1c

15 10月, 2015 1 次提交

move io-64-nonatomic*.h out of asm-generic · 2f8e2c87

由 Christoph Hellwig 提交于 8月 28, 2015

These are not implementations of default architecture code but helpers
for drivers. Move them to the place they belong to.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NDarren Hart <dvhart@linux.intel.com>
Acked-by: NHitoshi Mitake <mitake.hitoshi@lab.ntt.co.jp>
Signed-off-by: NArnd Bergmann <arnd@arndb.de>

2f8e2c87

28 8月, 2015 3 次提交

x86, pmem: clarify that ARCH_HAS_PMEM_API implies PMEM mapped WB · 96601adb

由 Dan Williams 提交于 8月 24, 2015

Given that a write-back (WB) mapping plus non-temporal stores is
expected to be the most efficient way to access PMEM, update the
definition of ARCH_HAS_PMEM_API to imply arch support for
WB-mapped-PMEM.  This is needed as a pre-requisite for adding PMEM to
the direct map and mapping it with struct page.

The above clarification for X86_64 means that memcpy_to_pmem() is
permitted to use the non-temporal arch_memcpy_to_pmem() rather than
needlessly fall back to default_memcpy_to_pmem() when the pcommit
instruction is not available.  When arch_memcpy_to_pmem() is not
guaranteed to flush writes out of cache, i.e. on older X86_32
implementations where non-temporal stores may just dirty cache,
ARCH_HAS_PMEM_API is simply disabled.

The default fall back for persistent memory handling remains.  Namely,
map it with the WT (write-through) cache-type and hope for the best.

arch_has_pmem_api() is updated to only indicate whether the arch
provides the proper helpers to meet the minimum "writes are visible
outside the cache hierarchy after memcpy_to_pmem() + wmb_pmem()".  Code
that cares whether wmb_pmem() actually flushes writes to pmem must now
call arch_has_wmb_pmem() directly.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
[hch: set ARCH_HAS_PMEM_API=n on x86_32]
Reviewed-by: NChristoph Hellwig <hch@lst.de>
[toshi: x86_32 compile fixes]
Signed-off-by: NToshi Kani <toshi.kani@hp.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

96601adb

nd_blk: change aperture mapping from WC to WB · 67a3e8fe

由 Ross Zwisler 提交于 8月 27, 2015

This should result in a pretty sizeable performance gain for reads.  For
rough comparison I did some simple read testing using PMEM to compare
reads of write combining (WC) mappings vs write-back (WB).  This was
done on a random lab machine.

PMEM reads from a write combining mapping:
	# dd of=/dev/null if=/dev/pmem0 bs=4096 count=100000
	100000+0 records in
	100000+0 records out
	409600000 bytes (410 MB) copied, 9.2855 s, 44.1 MB/s

PMEM reads from a write-back mapping:
	# dd of=/dev/null if=/dev/pmem0 bs=4096 count=1000000
	1000000+0 records in
	1000000+0 records out
	4096000000 bytes (4.1 GB) copied, 3.44034 s, 1.2 GB/s

To be able to safely support a write-back aperture I needed to add
support for the "read flush" _DSM flag, as outlined in the DSM spec:

http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf

This flag tells the ND BLK driver that it needs to flush the cache lines
associated with the aperture after the aperture is moved but before any
new data is read.  This ensures that any stale cache lines from the
previous contents of the aperture will be discarded from the processor
cache, and the new data will be read properly from the DIMM.  We know
that the cache lines are clean and will be discarded without any
writeback because either a) the previous aperture operation was a read,
and we never modified the contents of the aperture, or b) the previous
aperture operation was a write and we must have written back the dirtied
contents of the aperture to the DIMM before the I/O was completed.

In order to add support for the "read flush" flag I needed to add a
generic routine to invalidate cache lines, mmio_flush_range().  This is
protected by the ARCH_HAS_MMIO_FLUSH Kconfig variable, and is currently
only supported on x86.
Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

67a3e8fe

nfit: Clarify memory device state flags strings · 402bae59

由 Toshi Kani 提交于 8月 26, 2015

ACPI 6.0 NFIT Memory Device State Flags in Table 5-129 defines
NVDIMM status as follows.  These bits indicate multiple info,
such as failures, pending event, and capability.

  Bit [0] set to 1 to indicate that the previous SAVE to the
  Memory Device failed.
  Bit [1] set to 1 to indicate that the last RESTORE from the
  Memory Device failed.
  Bit [2] set to 1 to indicate that platform flush of data to
  Memory Device failed. As a result, the restored data content
  may be inconsistent even if SAVE and RESTORE do not indicate
  failure.
  Bit [3] set to 1 to indicate that the Memory Device is observed
  to be not armed prior to OSPM hand off. A Memory Device is
  considered armed if it is able to accept persistent writes.
  Bit [4] set to 1 to indicate that the Memory Device observed
  SMART and health events prior to OSPM handoff.

/sys/bus/nd/devices/nmemX/nfit/flags shows this flags info.
The output strings associated with the bits are "save", "restore",
"smart", etc., which can be confusing as they may be interpreted
as positive status, i.e. save succeeded.

Change also the dev_info() message in acpi_nfit_register_dimms()
to be consistent with the sysfs flags strings.
Reported-by: NRobert Elliott <elliott@hp.com>
Signed-off-by: NToshi Kani <toshi.kani@hp.com>
[ross: rename 'not_arm' to 'not_armed']
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
[djbw: defer adding bit5, HEALTH_ENABLED, for now]
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

402bae59

26 8月, 2015 1 次提交

nfit, nd_blk: BLK status register is only 32 bits · de4a196c

由 Ross Zwisler 提交于 8月 20, 2015

Only read 32 bits for the BLK status register in read_blk_stat().

The format and size of this register is defined in the
"NVDIMM Driver Writer's guide":

http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdfSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Reported-by: NNicholas Moulin <nicholas.w.moulin@linux.intel.com>
Tested-by: NNicholas Moulin <nicholas.w.moulin@linux.intel.com>
Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

de4a196c

28 7月, 2015 2 次提交

nfit: Don't check _STA on NVDIMM devices · 60e95f43

由 Linda Knippers 提交于 7月 22, 2015

The _STA only applies to the root device, not the individual NVDIMMS,
so don't check here. NVDIMM device state flags are checked elsewhere.
Signed-off-by: NLinda Knippers <linda.knippers@hp.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

60e95f43

libnvdimm: Add DSM support for Address Range Scrub commands · 39c686b8

由 Vishal Verma 提交于 7月 09, 2015

Add support for the three ARS DSM commands:
- Query ARS Capabilities - Queries the firmware to check if a given
  range supports scrub, and if so, which type (persistent vs. volatile)
- Start ARS - Starts a scrub for a given range/type
- Query ARS Status - Checks status of a previously started scrub, and
  provides the error logs if any.

  The commands are described by the example DSM spec at:
  http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf

Also add these commands to the nfit_test test framework, and return
canned data.
Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

39c686b8

11 7月, 2015 2 次提交

nfit: add support for NVDIMM "latch" flag · f0f2c072

由 Ross Zwisler 提交于 7月 10, 2015

Add support in the NFIT BLK I/O path for the "latch" flag
defined in the "Get Block NVDIMM Flags" _DSM function:

http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf

This flag requires the driver to read back the command register after it
is written in the block I/O path. This ensures that the hardware has
fully processed the new command and moved the aperture appropriately.
Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

f0f2c072

nfit: update block I/O path to use PMEM API · c2ad2954

由 Ross Zwisler 提交于 7月 10, 2015

Update the nfit block I/O path to use the new PMEM API and to adhere to
the read/write flows outlined in the "NVDIMM Block Window Driver
Writer's Guide":

http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf

This includes adding support for targeted NVDIMM flushes called "flush
hints" in the ACPI 6.0 specification:

http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf

For performance and media durability the mapping for a BLK aperture is
moved to a write-combining mapping which is consistent with
memcpy_to_pmem() and wmb_blk().
Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

c2ad2954

01 7月, 2015 1 次提交

nfit: fix smatch "use after null check" report · 193ccca4

由 Dan Williams 提交于 6月 30, 2015

drivers/acpi/nfit.c:1224 acpi_nfit_blk_region_enable()
         error: we previously assumed 'nfit_mem' could be null (see line 1223)

drivers/acpi/nfit.c
  1222          nfit_mem = nvdimm_provider_data(nvdimm);
  1223          if (!nfit_mem || !nfit_mem->dcr || !nfit_mem->bdw) {
                     ^^^^^^^^
Check.

  1224                  dev_dbg(dev, "%s: missing%s%s%s\n", __func__,
  1225                                  nfit_mem ? "" : " nfit_mem",
  1226                                  nfit_mem->dcr ? "" : " dcr",
                                        ^^^^^^^^^^^^^
Unchecked dereference.
Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
Acked-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

193ccca4

26 6月, 2015 6 次提交

libnvdimm: Add sysfs numa_node to NVDIMM devices · 74ae66c3

由 Toshi Kani 提交于 6月 19, 2015

Add support of sysfs 'numa_node' to I/O-related NVDIMM devices
under /sys/bus/nd/devices, regionN, namespaceN.0, and bttN.x.

An example of numa_node values on a 2-socket system with a single
NVDIMM range on each socket is shown below.
  /sys/bus/nd/devices
  |-- btt0.0/numa_node:0
  |-- btt1.0/numa_node:1
  |-- btt1.1/numa_node:1
  |-- namespace0.0/numa_node:0
  |-- namespace1.0/numa_node:1
  |-- region0/numa_node:0
  |-- region1/numa_node:1

These numa_node files are then linked under the block class of
their device names.
  /sys/class/block/pmem0/device/numa_node:0
  /sys/class/block/pmem1s/device/numa_node:1

This enables numactl(8) to accept 'block:' and 'file:' paths of
pmem and btt devices as shown in the examples below.
  numactl --preferred block:pmem0 --show
  numactl --preferred file:/dev/pmem1s --show
Signed-off-by: NToshi Kani <toshi.kani@hp.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

74ae66c3

libnvdimm: Set numa_node to NVDIMM devices · 41d7a6d6

由 Toshi Kani 提交于 6月 19, 2015

ACPI NFIT table has System Physical Address Range Structure entries that
describe a proximity ID of each range when ACPI_NFIT_PROXIMITY_VALID is
set in the flags.

Change acpi_nfit_register_region() to map a proximity ID to its node ID,
and set it to a new numa_node field of nd_region_desc, which is then
conveyed to the nd_region device.

The device core arranges for btt and namespace devices to inherit their
node from their parent region.
Signed-off-by: NToshi Kani <toshi.kani@hp.com>
[djbw: move set_dev_node() from region.c to bus.c]
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

41d7a6d6

libnvdimm, nfit: handle unarmed dimms, mark namespaces read-only · 58138820

由 Dan Williams 提交于 6月 23, 2015

Upon detection of an unarmed dimm in a region, arrange for descendant
BTT, PMEM, or BLK instances to be read-only. A dimm is primarily marked
"unarmed" via flags passed by platform firmware (NFIT).

The flags in the NFIT memory device sub-structure indicate the state of
the data on the nvdimm relative to its energy source or last "flush to
persistence". For the most part there is nothing the driver can do but
advertise the state of these flags in sysfs and emit a message if
firmware indicates that the contents of the device may be corrupted.
However, for the case of ACPI_NFIT_MEM_ARMED, the driver can arrange for
the block devices incorporating that nvdimm to be marked read-only.
This is a safe default as the data is still available and new writes are
held off until the administrator either forces read-write mode, or the
energy source becomes armed.

A 'read_only' attribute is added to REGION devices to allow for
overriding the default read-only policy of all descendant block devices.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

58138820

tools/testing/nvdimm: libnvdimm unit test infrastructure · 6bc75619

由 Dan Williams 提交于 6月 17, 2015

'libnvdimm' is the first driver sub-system in the kernel to implement
mocking for unit test coverage.  The nfit_test module gets built as an
external module and arranges for external module replacements of nfit,
libnvdimm, nd_pmem, and nd_blk.  These replacements use the linker
--wrap option to redirect calls to ioremap() + request_mem_region() to
custom defined unit test resources.  The end result is a fully
functional nvdimm_bus, as far as userspace is concerned, but with the
capability to perform otherwise destructive tests on emulated resources.

Q: Why not use QEMU for this emulation?
QEMU is not suitable for unit testing.  QEMU's role is to faithfully
emulate the platform.  A unit test's role is to unfaithfully implement
the platform with the goal of triggering bugs in the corners of the
sub-system implementation.  As bugs are discovered in platforms, or the
sub-system itself, the unit tests are extended to backstop a fix with a
reproducer unit test.

Another problem with QEMU is that it would require coordination of 3
software projects instead of 2 (kernel + libndctl [1]) to maintain and
execute the tests.  The chances for bit rot and the difficulty of
getting the tests running goes up non-linearly the more components
involved.


Q: Why submit this to the kernel tree instead of external modules in
   libndctl?
Simple, to alleviate the same risk that out-of-tree external modules
face.  Updates to drivers/nvdimm/ can be immediately evaluated to see if
they have any impact on tools/testing/nvdimm/.


Q: What are the negative implications of merging this?
It is a unique maintenance burden because the purpose of mocking an
interface to enable a unit test is to purposefully short circuit the
semantics of a routine to enable testing.  For example
__wrap_ioremap_cache() fakes the pmem driver into "ioremap()'ing" a test
resource buffer allocated by dma_alloc_coherent().  The future
maintenance burden hits when someone changes the semantics of
ioremap_cache() and wonders what the implications are for the unit test.

[1]: https://github.com/pmem/ndctl

Cc: <linux-acpi@vger.kernel.org>
Cc: Lv Zheng <lv.zheng@intel.com>
Cc: Robert Moore <robert.moore@intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

6bc75619

libnvdimm, nfit, nd_blk: driver for BLK-mode access persistent memory · 047fc8a1

由 Ross Zwisler 提交于 6月 25, 2015

The libnvdimm implementation handles allocating dimm address space (DPA)
between PMEM and BLK mode interfaces.  After DPA has been allocated from
a BLK-region to a BLK-namespace the nd_blk driver attaches to handle I/O
as a struct bio based block device. Unlike PMEM, BLK is required to
handle platform specific details like mmio register formats and memory
controller interleave.  For this reason the libnvdimm generic nd_blk
driver calls back into the bus provider to carry out the I/O.

This initial implementation handles the BLK interface defined by the
ACPI 6 NFIT [1] and the NVDIMM DSM Interface Example [2] composed from
DCR (dimm control region), BDW (block data window), IDT (interleave
descriptor) NFIT structures and the hardware register format.
[1]: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
[2]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf

Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

047fc8a1

nd_btt: atomic sector updates · 5212e11f

由 Vishal Verma 提交于 6月 25, 2015

BTT stands for Block Translation Table, and is a way to provide power
fail sector atomicity semantics for block devices that have the ability
to perform byte granularity IO. It relies on the capability of libnvdimm
namespace devices to do byte aligned IO.

The BTT works as a stacked blocked device, and reserves a chunk of space
from the backing device for its accounting metadata. It is a bio-based
driver because all IO is done synchronously, and there is no queuing or
asynchronous completions at either the device or the driver level.

The BTT uses 'lanes' to index into various 'on-disk' data structures,
and lanes also act as a synchronization mechanism in case there are more
CPUs than available lanes. We did a comparison between two lane lock
strategies - first where we kept an atomic counter around that tracked
which was the last lane that was used, and 'our' lane was determined by
atomically incrementing that. That way, for the nr_cpus > nr_lanes case,
theoretically, no CPU would be blocked waiting for a lane. The other
strategy was to use the cpu number we're scheduled on to and hash it to
a lane number. Theoretically, this could block an IO that could've
otherwise run using a different, free lane. But some fio workloads
showed that the direct cpu -> lane hash performed faster than tracking
'last lane' - my reasoning is the cache thrash caused by moving the
atomic variable made that approach slower than simply waiting out the
in-progress IO. This supports the conclusion that the driver can be a
very simple bio-based one that does synchronous IOs instead of queuing.

Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Neil Brown <neilb@suse.de>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
[jmoyer: fix nmi watchdog timeout in btt_map_init]
[jmoyer: move btt initialization to module load path]
[jmoyer: fix memory leak in the btt initialization path]
[jmoyer: Don't overwrite corrupted arenas]
Signed-off-by: NVishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

5212e11f

25 6月, 2015 7 次提交

libnvdimm, nfit: add interleave-set state-tracking infrastructure · eaf96153

由 Dan Williams 提交于 5月 01, 2015

On platforms that have firmware support for reading/writing per-dimm
label space, a portion of the dimm may be accessible via an interleave
set PMEM mapping in addition to the dimm's BLK (block-data-window
aperture(s)) interface.  A label, stored in a "configuration data
region" on the dimm, disambiguates which dimm addresses are accessed
through which exclusive interface.

Add infrastructure that allows the kernel to block modifications to a
label in the set while any member dimm is active.  Note that this is
meant only for enforcing "no modifications of active labels" via the
coarse ioctl command.  Adding/deleting namespaces from an active
interleave set is always possible via sysfs.

Another aspect of tracking interleave sets is tracking their integrity
when DIMMs in a set are physically re-ordered.  For this purpose we
generate an "interleave-set cookie" that can be recorded in a label and
validated against the current configuration.  It is the bus provider
implementation's responsibility to calculate the interleave set cookie
and attach it to a given region.

Cc: Neil Brown <neilb@suse.de>
Cc: <linux-acpi@vger.kernel.org>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Robert Moore <robert.moore@intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

eaf96153

libnvdimm: support for legacy (non-aliasing) nvdimms · 3d88002e

由 Dan Williams 提交于 5月 31, 2015

The libnvdimm region driver is an intermediary driver that translates
non-volatile "region"s into "namespace" sub-devices that are surfaced by
persistent memory block-device drivers (PMEM and BLK).

ACPI 6 introduces the concept that a given nvdimm may simultaneously
offer multiple access modes to its media through direct PMEM load/store
access, or windowed BLK mode. Existing nvdimms mostly implement a PMEM
interface, some offer a BLK-like mode, but never both as ACPI 6 defines.
If an nvdimm is single interfaced, then there is no need for dimm
metadata labels. For these devices we can take the region boundaries
directly to create a child namespace device (nd_namespace_io).
Acked-by: NChristoph Hellwig <hch@lst.de>
Tested-by: NToshi Kani <toshi.kani@hp.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

3d88002e

libnvdimm, nfit: regions (block-data-window, persistent memory, volatile memory) · 1f7df6f8

由 Dan Williams 提交于 6月 09, 2015

A "region" device represents the maximum capacity of a BLK range (mmio
block-data-window(s)), or a PMEM range (DAX-capable persistent memory or
volatile memory), without regard for aliasing.  Aliasing, in the
dimm-local address space (DPA), is resolved by metadata on a dimm to
designate which exclusive interface will access the aliased DPA ranges.
Support for the per-dimm metadata/label arrvies is in a subsequent
patch.

The name format of "region" devices is "regionN" where, like dimms, N is
a global ida index assigned at discovery time.  This id is not reliable
across reboots nor in the presence of hotplug.  Look to attributes of
the region or static id-data of the sub-namespace to generate a
persistent name.  However, if the platform configuration does not change
it is reasonable to expect the same region id to be assigned at the next
boot.

"region"s have 2 generic attributes "size", and "mapping"s where:
- size: the BLK accessible capacity or the span of the
  system physical address range in the case of PMEM.

- mappingN: a tuple describing a dimm's contribution to the region's
  capacity in the format (<nmemX>,<dpa>,<size>).  For a PMEM-region
  there will be at least one mapping per dimm in the interleave set.  For
  a BLK-region there is only "mapping0" listing the starting DPA of the
  BLK-region and the available DPA capacity of that space (matches "size"
  above).

The max number of mappings per "region" is hard coded per the
constraints of sysfs attribute groups.  That said the number of mappings
per region should never exceed the maximum number of possible dimms in
the system.  If the current number turns out to not be enough then the
"mappings" attribute clarifies how many there are supposed to be. "32
should be enough for anybody...".

Cc: Neil Brown <neilb@suse.de>
Cc: <linux-acpi@vger.kernel.org>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Robert Moore <robert.moore@intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: NToshi Kani <toshi.kani@hp.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

1f7df6f8

libnvdimm, nvdimm: dimm driver and base libnvdimm device-driver infrastructure · 4d88a97a

由 Dan Williams 提交于 5月 31, 2015

* Implement the device-model infrastructure for loading modules and
  attaching drivers to nvdimm devices.  This is a simple association of a
  nd-device-type number with a driver that has a bitmask of supported
  device types.  To facilitate userspace bind/unbind operations 'modalias'
  and 'devtype', that also appear in the uevent, are added as generic
  sysfs attributes for all nvdimm devices.  The reason for the device-type
  number is to support sub-types within a given parent devtype, be it a
  vendor-specific sub-type or otherwise.

* The first consumer of this infrastructure is the driver
  for dimm devices.  It simply uses control messages to retrieve and
  store the configuration-data image (label set) from each dimm.

Note: nd_device_register() arranges for asynchronous registration of
      nvdimm bus devices by default.

Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Neil Brown <neilb@suse.de>
Acked-by: NChristoph Hellwig <hch@lst.de>
Tested-by: NToshi Kani <toshi.kani@hp.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

4d88a97a

libnvdimm: control (ioctl) messages for nvdimm_bus and nvdimm devices · 62232e45

由 Dan Williams 提交于 6月 08, 2015

Most discovery/configuration of the nvdimm-subsystem is done via sysfs
attributes.  However, some nvdimm_bus instances, particularly the
ACPI.NFIT bus, define a small set of messages that can be passed to the
platform.  For convenience we derive the initial libnvdimm-ioctl command
formats directly from the NFIT DSM Interface Example formats.

    ND_CMD_SMART: media health and diagnostics
    ND_CMD_GET_CONFIG_SIZE: size of the label space
    ND_CMD_GET_CONFIG_DATA: read label space
    ND_CMD_SET_CONFIG_DATA: write label space
    ND_CMD_VENDOR: vendor-specific command passthrough
    ND_CMD_ARS_CAP: report address-range-scrubbing capabilities
    ND_CMD_ARS_START: initiate scrubbing
    ND_CMD_ARS_STATUS: report on scrubbing state
    ND_CMD_SMART_THRESHOLD: configure alarm thresholds for smart events

If a platform later defines different commands than this set it is
straightforward to extend support to those formats.

Most of the commands target a specific dimm.  However, the
address-range-scrubbing commands target the bus.  The 'commands'
attribute in sysfs of an nvdimm_bus, or nvdimm, enumerate the supported
commands for that object.

Cc: <linux-acpi@vger.kernel.org>
Cc: Robert Moore <robert.moore@intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reported-by: NNicholas Moulin <nicholas.w.moulin@linux.intel.com>
Acked-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

62232e45

libnvdimm, nfit: dimm/memory-devices · e6dfb2de

由 Dan Williams 提交于 4月 25, 2015

Enable nvdimm devices to be registered on a nvdimm_bus.  The kernel
assigned device id for nvdimm devicesis dynamic.  If userspace needs a
more static identifier it should consult a provider-specific attribute.
In the case where NFIT is the provider, the 'nmemX/nfit/handle' or
'nmemX/nfit/serial' attributes may be used for this purpose.

Cc: Neil Brown <neilb@suse.de>
Cc: <linux-acpi@vger.kernel.org>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Robert Moore <robert.moore@intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: NToshi Kani <toshi.kani@hp.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

e6dfb2de

libnvdimm: control character device and nvdimm_bus sysfs attributes · 45def22c

由 Dan Williams 提交于 4月 26, 2015

The control device for a nvdimm_bus is registered as an "nd" class
device.  The expectation is that there will usually only be one "nd" bus
registered under /sys/class/nd.  However, we allow for the possibility
of multiple buses and they will listed in discovery order as
ndctl0...ndctlN.  This character device hosts the ioctl for passing
control messages.  The initial command set has a 1:1 correlation with
the commands listed in the by the "NFIT DSM Example" document [1], but
this scheme is extensible to future command sets.

Note, nd_ioctl() and the backing ->ndctl() implementation are defined in
a subsequent patch.  This is simply the initial registrations and sysfs
attributes.

[1]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf

Cc: Neil Brown <neilb@suse.de>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: <linux-acpi@vger.kernel.org>
Cc: Robert Moore <robert.moore@intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: NToshi Kani <toshi.kani@hp.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

45def22c

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功