提交 · b5ebc8ec693281c3c1efff7459a069cbd8b9a149 · openeuler / Kernel

10 3月, 2016 1 次提交

libnvdimm, pmem: fix kmap_atomic() leak in error path · b5ebc8ec

由 Dan Williams 提交于 3月 06, 2016

When we enounter a bad block we need to kunmap_atomic() before
returning.

Cc: <stable@vger.kernel.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: NVishal Verma <vishal.l.verma@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

b5ebc8ec

06 3月, 2016 1 次提交

libnvdimm: async notification support · 71999466

由 Dan Williams 提交于 2月 18, 2016

In preparation for asynchronous address range scrub support add an
ability for the pmem driver to dynamically consume address range scrub
results.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

71999466

24 2月, 2016 1 次提交

nvdimm: use 'u64' for pfn flags · c4544205

由 Arnd Bergmann 提交于 2月 22, 2016

A recent bugfix changed pfn_t to always be 64-bit wide, but did not
change the code in pmem.c, which is now broken on 32-bit architectures
as reported by gcc:

In file included from ../drivers/nvdimm/pmem.c:28:0:
drivers/nvdimm/pmem.c: In function 'pmem_alloc':
include/linux/pfn_t.h:15:17: error: large integer implicitly truncated to unsigned type [-Werror=overflow]
 #define PFN_DEV (1ULL << (BITS_PER_LONG_LONG - 3))

This changes the intermediate pfn_flags in struct pmem_device to
be 64 bit wide as well, so they can store the flags correctly.
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Fixes: db78c222 ("mm: fix pfn_t vs highmem")
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

c4544205

16 1月, 2016 6 次提交

mm, dax, pmem: introduce {get|put}_dev_pagemap() for dax-gup · 5c2c2587

由 Dan Williams 提交于 1月 15, 2016

get_dev_page() enables paths like get_user_pages() to pin a dynamically
mapped pfn-range (devm_memremap_pages()) while the resulting struct page
objects are in use.  Unlike get_page() it may fail if the device is, or
is in the process of being, disabled.  While the initial lookup of the
range may be an expensive list walk, the result is cached to speed up
subsequent lookups which are likely to be in the same mapped range.

devm_memremap_pages() now requires a reference counter to be specified
at init time.  For pmem this means moving request_queue allocation into
pmem_alloc() so the existing queue usage counter can track "device
pages".

ZONE_DEVICE pages always have an elevated count and will never be on an
lru reclaim list.  That space in 'struct page' can be redirected for
other uses, but for safety introduce a poison value that will always
trip __list_add() to assert.  This allows half of the struct list_head
storage to be reclaimed with some assurance to back up the assumption
that the page count never goes to zero and a list_add() is never
attempted.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Tested-by: NLogan Gunthorpe <logang@deltatee.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5c2c2587

libnvdimm, pmem: move request_queue allocation earlier in probe · 468ded03

由 Dan Williams 提交于 1月 15, 2016

Before the dynamically allocated struct pages from devm_memremap_pages()
can be put to use outside the driver, we need a mechanism to track
whether they are still in use at teardown.  Towards that goal reorder
the initialization sequence to allow the 'q_usage_counter' from the
request_queue to be used by the devm_memremap_pages() implementation (in
subsequent patches).
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

468ded03

libnvdimm, pfn, pmem: allocate memmap array in persistent memory · d2c0f041

由 Dan Williams 提交于 1月 15, 2016

Use the new vmem_altmap capability to enable the pmem driver to arrange
for a struct page memmap to be established in persistent memory.

[linux@roeck-us.net: mn10300: declare __pfn_to_phys() to fix build error]
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: NGuenter Roeck <linux@roeck-us.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d2c0f041

x86, mm: introduce vmem_altmap to augment vmemmap_populate() · 4b94ffdc

由 Dan Williams 提交于 1月 15, 2016

In support of providing struct page for large persistent memory
capacities, use struct vmem_altmap to change the default policy for
allocating memory for the memmap array.  The default vmemmap_populate()
allocates page table storage area from the page allocator.  Given
persistent memory capacities relative to DRAM it may not be feasible to
store the memmap in 'System Memory'.  Instead vmem_altmap represents
pre-allocated "device pages" to satisfy vmemmap_alloc_block_buf()
requests.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Reported-by: Nkbuild test robot <lkp@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

4b94ffdc

mm: introduce find_dev_pagemap() · 9476df7d

由 Dan Williams 提交于 1月 15, 2016

There are several scenarios where we need to retrieve and update
metadata associated with a given devm_memremap_pages() mapping, and the
only lookup key available is a pfn in the range:

1/ We want to augment vmemmap_populate() (called via arch_add_memory())
   to allocate memmap storage from pre-allocated pages reserved by the
   device driver.  At vmemmap_alloc_block_buf() time it grabs device pages
   rather than page allocator pages.  This is in support of
   devm_memremap_pages() mappings where the memmap is too large to fit in
   main memory (i.e. large persistent memory devices).

2/ Taking a reference against the mapping when inserting device pages
   into the address_space radix of a given inode.  This facilitates
   unmap_mapping_range() and truncate_inode_pages() operations when the
   driver is tearing down the mapping.

3/ get_user_pages() operations on ZONE_DEVICE memory require taking a
   reference against the mapping so that the driver teardown path can
   revoke and drain usage of device pages.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Tested-by: NLogan Gunthorpe <logang@deltatee.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9476df7d

mm, dax, pmem: introduce pfn_t · 34c0fd54

由 Dan Williams 提交于 1月 15, 2016

For the purpose of communicating the optional presence of a 'struct
page' for the pfn returned from ->direct_access(), introduce a type that
encapsulates a page-frame-number plus flags.  These flags contain the
historical "page_link" encoding for a scatterlist entry, but can also
denote "device memory".  Where "device memory" is a set of pfns that are
not part of the kernel's linear mapping by default, but are accessed via
the same memory controller as ram.

The motivation for this new type is large capacity persistent memory
that needs struct page entries in the 'memmap' to support 3rd party DMA
(i.e.  O_DIRECT I/O with a persistent memory source/target).  However,
we also need it in support of maintaining a list of mapped inodes which
need to be unmapped at driver teardown or freeze_bdev() time.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Hansen <dave@sr71.net>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

34c0fd54

10 1月, 2016 6 次提交

libnvdimm, pmem: nvdimm_read_bytes() badblocks support · 710d69cc

由 Dan Williams 提交于 1月 04, 2016

Support badblock checking in all the pmem read paths that do not go
through the block layer. This protects info block reads (btt or pfn) as
well as data reads to a pmem namespace via a btt instance.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

710d69cc

pmem, dax: disable dax in the presence of bad blocks · 57f7f317

由 Dan Williams 提交于 1月 06, 2016

Longer term teach dax to punch "error" holes in mapping requests and
deliver SIGBUS to applications that consume a bad pmem page.  For now,
simply disable the dax performance optimization in the presence of known
errors.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

57f7f317

pmem: fail io-requests to known bad blocks · e10624f8

由 Dan Williams 提交于 1月 06, 2016

Check the sectors specified in a read bio to see if they hit a known bad
block, and return an error code pmem_do_bvec().

Note that the ->rw_page() is not in a position to return errors.  For
now, copy the same layering violation present in zram_rw_page() to avoid
crashes of the form:

 kernel BUG at mm/filemap.c:822!
 [..]
 Call Trace:
  [<ffffffff811c540e>] page_endio+0x1e/0x60
  [<ffffffff81290d29>] mpage_end_io+0x39/0x60
  [<ffffffff8141c4ef>] bio_endio+0x3f/0x60
  [<ffffffffa005c491>] pmem_make_request+0x111/0x230 [nd_pmem]

...i.e. unlock a page that was already unlocked via pmem_rw_page() =>
page_endio().
Reported-by: NVishal Verma <vishal.l.verma@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

e10624f8

libnvdimm: convert to statically allocated badblocks · b95f5f43

由 Dan Williams 提交于 1月 04, 2016

If a device will ever have badblocks it should always have a badblocks
instance available.  So, similar to md, embed a badblocks instance in
pmem_device.  This reduces pointer chasing in the i/o fast path, and
simplifies the init path.
Reported-by: NVishal Verma <vishal.l.verma@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

b95f5f43

libnvdimm, pmem: move definition of nvdimm_namespace_add_poison to nd.h · ad9a8bde

由 Dan Williams 提交于 1月 06, 2016

nd-core.h is private to the libnvdimm core internals and should not be
used by drivers.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

ad9a8bde

libnvdimm: Add a poison list and export badblocks · 0caeef63

由 Vishal Verma 提交于 12月 24, 2015

During region creation, perform Address Range Scrubs (ARS) for the SPA
(System Physical Address) ranges to retrieve known poison locations from
firmware. Add a new data structure 'nd_poison' which is used as a list
in nvdimm_bus to store these poison locations.

When creating a pmem namespace, if there is any known poison associated
with its physical address space, convert the poison ranges to bad sectors
that are exposed using the badblocks interface.
Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

0caeef63

25 12月, 2015 1 次提交

libnvdimm, pfn: fix nd_pfn_validate() return value handling · 3fa96268

由 Dan Williams 提交于 12月 13, 2015

The -ENODEV case indicates that the info-block needs to established.
All other return codes cause nd_pfn_init() to abort.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

3fa96268

14 12月, 2015 1 次提交

libnvdimm, pfn: add parent uuid validation · a34d5e8a

由 Dan Williams 提交于 12月 12, 2015

Track and check the uuid of the namespace hosting a pfn instance.  This
forces the pfn info block to be invalidated if the namespace is
re-configured with a different uuid.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

a34d5e8a

13 12月, 2015 1 次提交

libnvdimm, pfn: add 'align' attribute, default to HPAGE_SIZE · 315c5625

由 Dan Williams 提交于 12月 10, 2015

When setting aside capacity for struct page it must be aligned to the
largest mapping size that is to be made available via DAX.  Make the
alignment configurable to enable support for 1GiB page-size mappings.

The offset for PFN_MODE_RAM may now be larger than SZ_8K, so fixup the
offset check in nvdimm_namespace_attach_pfn().
Reported-by: NToshi Kani <toshi.kani@hpe.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

315c5625

11 12月, 2015 1 次提交

libnvdimm, pfn: kill ND_PFN_ALIGN · 9f1e8cee

由 Dan Williams 提交于 12月 10, 2015

The alignment constraint isn't necessary now that devm_memremap_pages()
allows for unaligned mappings.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

9f1e8cee

13 11月, 2015 1 次提交

libnvdimm, pmem: fix size trim in pmem_direct_access() · 589e75d1

由 Dan Williams 提交于 10月 24, 2015

This masking prevents access to the end of the device via dax_do_io(),
and is unnecessary as arch_add_memory() would have rejected an unaligned
allocation.

Cc: <stable@vger.kernel.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

589e75d1

08 11月, 2015 1 次提交

block: change ->make_request_fn() and users to return a queue cookie · dece1635

由 Jens Axboe 提交于 11月 05, 2015

No functional changes in this patch, but it prepares us for returning
a more useful cookie related to the IO that was queued up.
Signed-off-by: NJens Axboe <axboe@fb.com>
Acked-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NKeith Busch <keith.busch@intel.com>

dece1635

10 10月, 2015 3 次提交

pmem, memremap: convert to numa aware allocations · 538ea4aa

由 Dan Williams 提交于 10月 05, 2015

Given that pmem ranges come with numa-locality hints, arrange for the
resulting driver objects to be obtained from node-local memory.
Reviewed-by: NTejun Heo <tj@kernel.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

538ea4aa

devm_memremap: convert to return ERR_PTR · b36f4761

由 Dan Williams 提交于 9月 15, 2015

Make devm_memremap consistent with the error return scheme of
devm_memremap_pages to remove special casing in the pmem driver.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

b36f4761

pmem: kill memremap_pmem() · a639315d

由 Dan Williams 提交于 9月 15, 2015

Now that the pmem-api is defined as "a set of apis that enables access
to WB mapped pmem",  the mapping type is implied.  Remove the wrapper
and push the functionality down into the pmem driver in preparation for
adding support for direct-mapped pmem.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

a639315d

17 9月, 2015 1 次提交

pmem: add proper fencing to pmem_rw_page() · ba8fe0f8

由 Ross Zwisler 提交于 9月 16, 2015

pmem_rw_page() needs to call wmb_pmem() on writes to make sure that the
newly written data is durable.  This flow was added to pmem_rw_bytes()
and pmem_make_request() with this commit:

commit 61031952 ("arch, x86: pmem api for ensuring durability of
	persistent memory updates")

...the pmem_rw_page() path was missed.

Cc: <stable@vger.kernel.org>
Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

ba8fe0f8

29 8月, 2015 2 次提交

libnvdimm, pmem: direct map legacy pmem by default · 004f1afb

由 Dan Williams 提交于 8月 24, 2015

The expectation is that the legacy / non-standard pmem discovery method
(e820 type-12) will only ever be used to describe small quantities of
persistent memory.  Larger capacities will be described via the ACPI
NFIT.  When "allocate struct page from pmem" support is added this default
policy can be overridden by assigning a legacy pmem namespace to a pfn
device, however this would be only be necessary if a platform used the
legacy mechanism to define a very large range.

Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

004f1afb

libnvdimm, pmem: 'struct page' for pmem · 32ab0a3f

由 Dan Williams 提交于 8月 01, 2015

Enable the pmem driver to handle PFN device instances.  Attaching a pmem
namespace to a pfn device triggers the driver to allocate and initialize
struct page entries for pmem.  Memory capacity for this allocation comes
exclusively from RAM for now which is suitable for low PMEM to RAM
ratios.  This mechanism will be expanded later for setting an "allocate
from PMEM" policy.

Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

32ab0a3f

28 8月, 2015 2 次提交

x86, pmem: clarify that ARCH_HAS_PMEM_API implies PMEM mapped WB · 96601adb

由 Dan Williams 提交于 8月 24, 2015

Given that a write-back (WB) mapping plus non-temporal stores is
expected to be the most efficient way to access PMEM, update the
definition of ARCH_HAS_PMEM_API to imply arch support for
WB-mapped-PMEM.  This is needed as a pre-requisite for adding PMEM to
the direct map and mapping it with struct page.

The above clarification for X86_64 means that memcpy_to_pmem() is
permitted to use the non-temporal arch_memcpy_to_pmem() rather than
needlessly fall back to default_memcpy_to_pmem() when the pcommit
instruction is not available.  When arch_memcpy_to_pmem() is not
guaranteed to flush writes out of cache, i.e. on older X86_32
implementations where non-temporal stores may just dirty cache,
ARCH_HAS_PMEM_API is simply disabled.

The default fall back for persistent memory handling remains.  Namely,
map it with the WT (write-through) cache-type and hope for the best.

arch_has_pmem_api() is updated to only indicate whether the arch
provides the proper helpers to meet the minimum "writes are visible
outside the cache hierarchy after memcpy_to_pmem() + wmb_pmem()".  Code
that cares whether wmb_pmem() actually flushes writes to pmem must now
call arch_has_wmb_pmem() directly.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
[hch: set ARCH_HAS_PMEM_API=n on x86_32]
Reviewed-by: NChristoph Hellwig <hch@lst.de>
[toshi: x86_32 compile fixes]
Signed-off-by: NToshi Kani <toshi.kani@hp.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

96601adb

dax: drop size parameter to ->direct_access() · cb389b9c

由 Dan Williams 提交于 8月 07, 2015

None of the implementations currently use it.  The common
bdev_direct_access() entry point handles all the size checks before
calling ->direct_access().
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

cb389b9c

21 8月, 2015 1 次提交

pmem, dax: have direct_access use __pmem annotation · e2e05394

由 Ross Zwisler 提交于 8月 18, 2015

Update the annotation for the kaddr pointer returned by direct_access()
so that it is a __pmem pointer. This is consistent with the PMEM driver
and with how this direct_access() pointer is used in the DAX code.
Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

e2e05394

15 8月, 2015 1 次提交

pmem: switch to devm_ allocations · 708ab62b

由 Christoph Hellwig 提交于 8月 10, 2015

Signed-off-by: NChristoph Hellwig <hch@lst.de>
[djbw: tools/testing/nvdimm/ and memunmap_pmem support]
Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

708ab62b

29 7月, 2015 1 次提交

block: add a bi_error field to struct bio · 4246a0b6

由 Christoph Hellwig 提交于 7月 20, 2015

Currently we have two different ways to signal an I/O error on a BIO:

 (1) by clearing the BIO_UPTODATE flag
 (2) by returning a Linux errno value to the bi_end_io callback

The first one has the drawback of only communicating a single possible
error (-EIO), and the second one has the drawback of not beeing persistent
when bios are queued up, and are not passed along from child to parent
bio in the ever more popular chaining scenario.  Having both mechanisms
available has the additional drawback of utterly confusing driver authors
and introducing bugs where various I/O submitters only deal with one of
them, and the others have to add boilerplate code to deal with both kinds
of error returns.

So add a new bi_error field to store an errno value directly in struct
bio and remove the existing mechanisms to clean all this up.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Reviewed-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

4246a0b6

28 7月, 2015 1 次提交

libnvdimm, pmem: Change pmem physical sector size to PAGE_SIZE · 6b47496a

由 Vishal Verma 提交于 7月 23, 2015

Based on a patch: c8fa3173 brd: Request from fdisk 4k alignment by Boaz
Harrosh, allow fdisk to create properly aligned partitions for DAX. This
will also cause mkfs.ext4 to emit a warning if using a file system block
size of less than PAGE_SIZE.

Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Elliott, Robert <Elliott@hp.com>
Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>
Acked-by: NBoaz Harrosh <boaz@plexistor.com>
Acked-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

6b47496a

26 6月, 2015 7 次提交

arch, x86: pmem api for ensuring durability of persistent memory updates · 61031952

由 Ross Zwisler 提交于 6月 25, 2015

Based on an original patch by Ross Zwisler [1].

Writes to persistent memory have the potential to be posted to cpu
cache, cpu write buffers, and platform write buffers (memory controller)
before being committed to persistent media.  Provide apis,
memcpy_to_pmem(), wmb_pmem(), and memremap_pmem(), to write data to
pmem and assert that it is durable in PMEM (a persistent linear address
range).  A '__pmem' attribute is added so sparse can track proper usage
of pointers to pmem.

This continues the status quo of pmem being x86 only for 4.2, but
reworks to ioremap, and wider implementation of memremap() will enable
other archs in 4.3.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-May/000932.html

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
[djbw: various reworks]
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

61031952

libnvdimm, nfit: handle unarmed dimms, mark namespaces read-only · 58138820

由 Dan Williams 提交于 6月 23, 2015

Upon detection of an unarmed dimm in a region, arrange for descendant
BTT, PMEM, or BLK instances to be read-only. A dimm is primarily marked
"unarmed" via flags passed by platform firmware (NFIT).

The flags in the NFIT memory device sub-structure indicate the state of
the data on the nvdimm relative to its energy source or last "flush to
persistence". For the most part there is nothing the driver can do but
advertise the state of these flags in sysfs and emit a message if
firmware indicates that the contents of the device may be corrupted.
However, for the case of ACPI_NFIT_MEM_ARMED, the driver can arrange for
the block devices incorporating that nvdimm to be marked read-only.
This is a safe default as the data is still available and new writes are
held off until the administrator either forces read-write mode, or the
energy source becomes armed.

A 'read_only' attribute is added to REGION devices to allow for
overriding the default read-only policy of all descendant block devices.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

58138820

pmem: flag pmem block devices as non-rotational · 0f51c4fa

由 Dan Williams 提交于 5月 16, 2015

...since they are effectively SSDs as far as userspace is concerned.
Reviewed-by: NVishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

0f51c4fa

libnvdimm: enable iostat · f0dc089c

由 Dan Williams 提交于 5月 16, 2015

This is disabled by default as the overhead is prohibitive, but if the
user takes the action to turn it on we'll oblige.
Reviewed-by: NVishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

f0dc089c

pmem: make_request cleanups · edc870e5

由 Dan Williams 提交于 5月 16, 2015

Various cleanups:

1/ Kill the BUG_ON since we've already told the block layer we don't
   support DISCARD on all these drivers.

2/ Kill the 'rw' variable, no need to cache it.

3/ Kill the local 'sector' variable.  bio_for_each_segment() is already
   advancing the iterator's sector number by the bio_vec length.

4/ Kill the check for accessing past the end of device
   generic_make_request_checks() already does that.
Suggested-by: NChristoph Hellwig <hch@lst.de>
[hch: kill access past end of the device check]
Reviewed-by: NVishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

edc870e5

libnvdimm, pmem: fix up max_hw_sectors · 43d3fa3a

由 Dan Williams 提交于 5月 16, 2015

There is no hardware limit to enforce on the size of the i/o that can be passed
to an nvdimm block device, so set it to UINT_MAX.
Reviewed-by: NVishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

43d3fa3a

nd_btt: atomic sector updates · 5212e11f

由 Vishal Verma 提交于 6月 25, 2015

BTT stands for Block Translation Table, and is a way to provide power
fail sector atomicity semantics for block devices that have the ability
to perform byte granularity IO. It relies on the capability of libnvdimm
namespace devices to do byte aligned IO.

The BTT works as a stacked blocked device, and reserves a chunk of space
from the backing device for its accounting metadata. It is a bio-based
driver because all IO is done synchronously, and there is no queuing or
asynchronous completions at either the device or the driver level.

The BTT uses 'lanes' to index into various 'on-disk' data structures,
and lanes also act as a synchronization mechanism in case there are more
CPUs than available lanes. We did a comparison between two lane lock
strategies - first where we kept an atomic counter around that tracked
which was the last lane that was used, and 'our' lane was determined by
atomically incrementing that. That way, for the nr_cpus > nr_lanes case,
theoretically, no CPU would be blocked waiting for a lane. The other
strategy was to use the cpu number we're scheduled on to and hash it to
a lane number. Theoretically, this could block an IO that could've
otherwise run using a different, free lane. But some fio workloads
showed that the direct cpu -> lane hash performed faster than tracking
'last lane' - my reasoning is the cache thrash caused by moving the
atomic variable made that approach slower than simply waiting out the
in-progress IO. This supports the conclusion that the driver can be a
very simple bio-based one that does synchronous IOs instead of queuing.

Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Neil Brown <neilb@suse.de>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
[jmoyer: fix nmi watchdog timeout in btt_map_init]
[jmoyer: move btt initialization to module load path]
[jmoyer: fix memory leak in the btt initialization path]
[jmoyer: Don't overwrite corrupted arenas]
Signed-off-by: NVishal Verma <vishal.l.verma@linux.intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

5212e11f

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功