1. 02 2月, 2018 1 次提交
  2. 29 9月, 2017 1 次提交
  3. 05 8月, 2017 1 次提交
  4. 30 6月, 2017 2 次提交
  5. 28 6月, 2017 3 次提交
    • D
      libnvdimm, nfit: enable support for volatile ranges · c9e582aa
      Dan Williams 提交于
      Allow volatile nfit ranges to participate in all the same infrastructure
      provided for persistent memory regions. A resulting resulting namespace
      device will still be called "pmem", but the parent region type will be
      "nd_volatile". This is in preparation for disabling the dax ->flush()
      operation in the pmem driver when it is hosted on a volatile range.
      
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      c9e582aa
    • D
      libnvdimm, pmem: fix persistence warning · c00b396e
      Dan Williams 提交于
      The pmem driver assumes if platform firmware describes the memory
      devices associated with a persistent memory range and
      CONFIG_ARCH_HAS_PMEM_API=y that it has all the mechanism necessary to
      flush data to a power-fail safe zone. We warn if the firmware does not
      describe memory devices, but we also need to warn if the architecture
      does not claim pmem support.
      
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      c00b396e
    • D
      x86, libnvdimm, pmem: remove global pmem api · ca6a4657
      Dan Williams 提交于
      Now that all callers of the pmem api have been converted to dax helpers that
      call back to the pmem driver, we can remove include/linux/pmem.h and
      asm/pmem.h.
      
      Cc: <x86@kernel.org>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Oliver O'Halloran <oohall@gmail.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      ca6a4657
  6. 16 6月, 2017 1 次提交
  7. 10 6月, 2017 1 次提交
    • D
      x86, uaccess: introduce copy_from_iter_flushcache for pmem / cache-bypass operations · 0aed55af
      Dan Williams 提交于
      The pmem driver has a need to transfer data with a persistent memory
      destination and be able to rely on the fact that the destination writes are not
      cached. It is sufficient for the writes to be flushed to a cpu-store-buffer
      (non-temporal / "movnt" in x86 terms), as we expect userspace to call fsync()
      to ensure data-writes have reached a power-fail-safe zone in the platform. The
      fsync() triggers a REQ_FUA or REQ_FLUSH to the pmem driver which will turn
      around and fence previous writes with an "sfence".
      
      Implement a __copy_from_user_inatomic_flushcache, memcpy_page_flushcache, and
      memcpy_flushcache, that guarantee that the destination buffer is not dirty in
      the cpu cache on completion. The new copy_from_iter_flushcache and sub-routines
      will be used to replace the "pmem api" (include/linux/pmem.h +
      arch/x86/include/asm/pmem.h). The availability of copy_from_iter_flushcache()
      and memcpy_flushcache() are gated by the CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE
      config symbol, and fallback to copy_from_iter_nocache() and plain memcpy()
      otherwise.
      
      This is meant to satisfy the concern from Linus that if a driver wants to do
      something beyond the normal nocache semantics it should be something private to
      that driver [1], and Al's concern that anything uaccess related belongs with
      the rest of the uaccess code [2].
      
      The first consumer of this interface is a new 'copy_from_iter' dax operation so
      that pmem can inject cache maintenance operations without imposing this
      overhead on other dax-capable drivers.
      
      [1]: https://lists.01.org/pipermail/linux-nvdimm/2017-January/008364.html
      [2]: https://lists.01.org/pipermail/linux-nvdimm/2017-April/009942.html
      
      Cc: <x86@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      0aed55af
  8. 05 5月, 2017 1 次提交
  9. 30 4月, 2017 1 次提交
    • D
      libnvdimm: rework region badblocks clearing · 23f49844
      Dan Williams 提交于
      Toshi noticed that the new support for a region-level badblocks missed
      the case where errors are cleared due to BTT I/O.
      
      An initial attempt to fix this ran into a "sleeping while atomic"
      warning due to taking the nvdimm_bus_lock() in the BTT I/O path to
      satisfy the locking requirements of __nvdimm_bus_badblocks_clear().
      However, that lock is not needed since we are not acting on any data that
      is subject to change under that lock. The badblocks instance has its own
      internal lock to handle mutations of the error list.
      
      So, in order to make it clear that we are just acting on region devices,
      rename __nvdimm_bus_badblocks_clear() to nvdimm_clear_badblocks_regions().
      Eliminate the lock and consolidate all support routines for the new
      nvdimm_account_cleared_poison() in drivers/nvdimm/bus.c. Finally, to the
      opportunity to cleanup to some unnecessary casts, make the calling
      convention of nvdimm_clear_badblocks_regions() clearer by replacing struct
      resource with the minimal struct clear_badblocks_context, and use the
      DEVICE_ATTR macro.
      
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Reported-by: NToshi Kani <toshi.kani@hpe.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      23f49844
  10. 29 4月, 2017 1 次提交
    • D
      libnvdimm, region: sysfs trigger for nvdimm_flush() · ab630891
      Dan Williams 提交于
      The nvdimm_flush() mechanism helps to reduce the impact of an ADR
      (asynchronous-dimm-refresh) failure. The ADR mechanism handles flushing
      platform WPQ (write-pending-queue) buffers when power is removed. The
      nvdimm_flush() mechanism performs that same function on-demand.
      
      When a pmem namespace is associated with a block device, an
      nvdimm_flush() is triggered with every block-layer REQ_FUA, or REQ_FLUSH
      request. These requests are typically associated with filesystem
      metadata updates. However, when a namespace is in device-dax mode,
      userspace (think database metadata) needs another path to perform the
      same flushing. In other words this is not required to make data
      persistent, but in the case of metadata it allows for a smaller failure
      domain in the unlikely event of an ADR failure.
      
      The new 'deep_flush' attribute is visible when the individual DIMMs
      backing a given interleave-set are described by platform firmware. In
      ACPI terms this is "NVDIMM Region Mapping Structures" and associated
      "Flush Hint Address Structures". Reads return "1" if the region supports
      triggering WPQ flushes on all DIMMs. Reads return "0" the flush
      operation is a platform nop, and in that case the attribute is
      read-only.
      
      Why sysfs and not an ioctl? An ioctl requires establishing a new
      ioctl function number space for device-dax. Given that this would be
      called on a device-dax fd an application could be forgiven for
      accidentally calling this on a filesystem-dax fd. Placing this interface
      in libnvdimm sysfs removes that potential for collision with a
      filesystem ioctl, and it keeps ioctls out of the generic device-dax
      implementation.
      
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      ab630891
  11. 25 4月, 2017 1 次提交
    • D
      libnvdimm, region: fix flush hint detection crash · bc042fdf
      Dan Williams 提交于
      In the case where a dimm does not have any associated flush hints the
      ndrd->flush_wpq array may be uninitialized leading to crashes with the
      following signature:
      
       BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
       IP: region_visible+0x10f/0x160 [libnvdimm]
      
       Call Trace:
        internal_create_group+0xbe/0x2f0
        sysfs_create_groups+0x40/0x80
        device_add+0x2d8/0x650
        nd_async_device_register+0x12/0x40 [libnvdimm]
        async_run_entry_fn+0x39/0x170
        process_one_work+0x212/0x6c0
        ? process_one_work+0x197/0x6c0
        worker_thread+0x4e/0x4a0
        kthread+0x10c/0x140
        ? process_one_work+0x6c0/0x6c0
        ? kthread_create_on_node+0x60/0x60
        ret_from_fork+0x31/0x40
      
      Cc: <stable@vger.kernel.org>
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Fixes: f284a4f2 ("libnvdimm: introduce nvdimm_flush() and nvdimm_has_flush()")
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      bc042fdf
  12. 13 4月, 2017 2 次提交
  13. 01 3月, 2017 1 次提交
    • D
      nfit, libnvdimm: fix interleave set cookie calculation · 86ef58a4
      Dan Williams 提交于
      The interleave-set cookie is a sum that sanity checks the composition of
      an interleave set has not changed from when the namespace was initially
      created.  The checksum is calculated by sorting the DIMMs by their
      location in the interleave-set. The comparison for the sort must be
      64-bit wide, not byte-by-byte as performed by memcmp() in the broken
      case.
      
      Fix the implementation to accept correct cookie values in addition to
      the Linux "memcmp" order cookies, but only allow correct cookies to be
      generated going forward. It does mean that namespaces created by
      third-party-tooling, or created by newer kernels with this fix, will not
      validate on older kernels. However, there are a couple mitigating
      conditions:
      
          1/ platforms with namespace-label capable NVDIMMs are not widely
             available.
      
          2/ interleave-sets with a single-dimm are by definition not affected
             (nothing to sort). This covers the QEMU-KVM NVDIMM emulation case.
      
      The cookie stored in the namespace label will be fixed by any write the
      namespace label, the most straightforward way to achieve this is to
      write to the "alt_name" attribute of a namespace in sysfs.
      
      Cc: <stable@vger.kernel.org>
      Fixes: eaf96153 ("libnvdimm, nfit: add interleave-set state-tracking infrastructure")
      Reported-by: NNicholas Moulin <nicholas.w.moulin@linux.intel.com>
      Tested-by: NNicholas Moulin <nicholas.w.moulin@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      86ef58a4
  14. 16 12月, 2016 1 次提交
  15. 08 10月, 2016 2 次提交
    • D
      libnvdimm, namespace: allow creation of multiple pmem-namespaces per region · 98a29c39
      Dan Williams 提交于
      Similar to BLK regions, publish new seed namespace devices to allow
      unused PMEM region capacity to be consumed by additional namespaces.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      98a29c39
    • D
      libnvdimm, region: update nd_region_available_dpa() for multi-pmem support · a1f3e4d6
      Dan Williams 提交于
      The free dpa (dimm-physical-address) space calculation reports how much
      free space is available with consideration for aliased BLK + PMEM
      regions.  Recall that BLK capacity is allocated from high addresses and
      PMEM is allocated from low addresses in their respective regions.
      
      nd_region_available_dpa() accounts for the fact that the largest
      encroachment (lowest starting address) into PMEM capacity by a BLK
      allocation limits the available capacity to that point, regardless if
      there is BLK allocation hole at a higher address.  Similarly, for the
      multi-pmem case we need to track the largest encroachment (highest
       ending address) of a PMEM allocation in BLK capacity regardless of
      whether there is an allocation hole that a BLK allocation could fill at
      a lower address.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      a1f3e4d6
  16. 01 10月, 2016 3 次提交
  17. 25 9月, 2016 1 次提交
  18. 19 9月, 2016 1 次提交
  19. 12 7月, 2016 4 次提交
    • D
      libnvdimm: cycle flush hints · 0c27af60
      Dan Williams 提交于
      When the NFIT provides multiple flush hint addresses per-dimm it is
      expressing that the platform is capable of processing multiple flush
      requests in parallel.  There is some fixed cost per flush request, let
      the cost be shared in parallel on multiple cpus.
      
      Since there may not be enough flush hint addresses for each cpu to have
      one, keep a per-cpu index of the last used hint, hash it with current
      pid, and assume that access pattern and scheduler randomness will keep
      the flush-hint usage somewhat staggered across cpus.
      
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      0c27af60
    • D
      libnvdimm: introduce nvdimm_flush() and nvdimm_has_flush() · f284a4f2
      Dan Williams 提交于
      nvdimm_flush() is a replacement for the x86 'pcommit' instruction.  It is
      an optional write flushing mechanism that an nvdimm bus can provide for
      the pmem driver to consume.  In the case of the NFIT nvdimm-bus-provider
      nvdimm_flush() is implemented as a series of flush-hint-address [1]
      writes to each dimm in the interleave set (region) that backs the
      namespace.
      
      The nvdimm_has_flush() routine relies on platform firmware to describe
      the flushing capabilities of a platform.  It uses the heuristic of
      whether an nvdimm bus provider provides flush address data to return a
      ternary result:
      
            1: flush addresses defined
            0: dimm topology described without flush addresses (assume ADR)
       -errno: no topology information, unable to determine flush mechanism
      
      The pmem driver is expected to take the following actions on this ternary
      result:
      
            1: nvdimm_flush() in response to REQ_FUA / REQ_FLUSH and shutdown
            0: do not set, WC or FUA on the queue, take no further action
       -errno: warn and then operate as if nvdimm_has_flush() returned '0'
      
      The caveat of this heuristic is that it can not distinguish the "dimm
      does not have flush address" case from the "platform firmware is broken
      and failed to describe a flush address".  Given we are already
      explicitly trusting the NFIT there's not much more we can do beyond
      blacklisting broken firmwares if they are ever encountered.
      
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      f284a4f2
    • D
      libnvdimm, nfit: move flush hint mapping to region-device driver-data · e5ae3b25
      Dan Williams 提交于
      In preparation for triggering flushes of a DIMM's writes-posted-queue
      (WPQ) via the pmem driver move mapping of flush hint addresses to the
      region driver.  Since this uses devm_nvdimm_memremap() the flush
      addresses will remain mapped while any region to which the dimm belongs
      is active.
      
      We need to communicate more information to the nvdimm core to facilitate
      this mapping, namely each dimm object now carries an array of flush hint
      address resources.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      e5ae3b25
    • D
      libnvdimm, nfit: remove nfit_spa_map() infrastructure · a8a6d2e0
      Dan Williams 提交于
      Now that all shared mappings are handled by devm_nvdimm_memremap() we no
      longer need nfit_spa_map() nor do we need to trigger a callback to the
      bus provider at region disable time.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      a8a6d2e0
  20. 21 5月, 2016 1 次提交
  21. 10 5月, 2016 1 次提交
    • D
      libnvdimm, dax: introduce device-dax infrastructure · cd03412a
      Dan Williams 提交于
      Device DAX is the device-centric analogue of Filesystem DAX
      (CONFIG_FS_DAX).  It allows persistent memory ranges to be allocated and
      mapped without need of an intervening file system.  This initial
      infrastructure arranges for a libnvdimm pfn-device to be represented as
      a different device-type so that it can be attached to a driver other
      than the pmem driver.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      cd03412a
  22. 06 1月, 2016 1 次提交
    • D
      libnvdimm: fix namespace object confusion in is_uuid_busy() · e07ecd76
      Dan Williams 提交于
      When btt devices were re-worked to be child devices of regions this
      routine was overlooked.  It mistakenly attempts to_nd_namespace_pmem()
      or to_nd_namespace_blk() conversions on btt and pfn devices.  By luck to
      date we have happened to be hitting valid memory leading to a uuid
      miscompare, but a recent change to struct nd_namespace_common causes:
      
       BUG: unable to handle kernel NULL pointer dereference at 0000000000000001
       IP: [<ffffffff814610dc>] memcmp+0xc/0x40
       [..]
       Call Trace:
        [<ffffffffa0028631>] is_uuid_busy+0xc1/0x2a0 [libnvdimm]
        [<ffffffffa0028570>] ? to_nd_blk_region+0x50/0x50 [libnvdimm]
        [<ffffffff8158c9c0>] device_for_each_child+0x50/0x90
      
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      e07ecd76
  23. 14 12月, 2015 1 次提交
  24. 09 12月, 2015 1 次提交
  25. 29 8月, 2015 2 次提交
    • D
      libnvdimm, pmem: direct map legacy pmem by default · 004f1afb
      Dan Williams 提交于
      The expectation is that the legacy / non-standard pmem discovery method
      (e820 type-12) will only ever be used to describe small quantities of
      persistent memory.  Larger capacities will be described via the ACPI
      NFIT.  When "allocate struct page from pmem" support is added this default
      policy can be overridden by assigning a legacy pmem namespace to a pfn
      device, however this would be only be necessary if a platform used the
      legacy mechanism to define a very large range.
      
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      004f1afb
    • D
      libnvdimm, pfn: 'struct page' provider infrastructure · e1455744
      Dan Williams 提交于
      Implement the base infrastructure for libnvdimm PFN devices. Similar to
      BTT devices they take a namespace as a backing device and layer
      functionality on top. In this case the functionality is reserving space
      for an array of 'struct page' entries to be handed out through
      pfn_to_page(). For now this is just the basic libnvdimm-device-model for
      configuring the base PFN device.
      
      As the namespace claiming mechanism for PFN devices is mostly identical
      to BTT devices drivers/nvdimm/claim.c is created to house the common
      bits.
      
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      e1455744
  26. 26 7月, 2015 1 次提交
    • D
      libnvdimm: fix namespace seed creation · 8ca24353
      Dan Williams 提交于
      A new BLK namespace "seed" device is created whenever the current seed
      is successfully probed.  However, if that namespace is assigned to a BTT
      it may never directly experience a successful probe as it is a
      subordinate device to a BTT configuration.
      
      The effect of the current code is that no new namespaces can be
      instantiated, after the seed namespace, to consume available BLK DPA
      capacity.  Fix this by treating a successful BTT probe event as a
      successful probe event for the backing namespace.
      Reported-by: NNicholas Moulin <nicholas.w.moulin@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      8ca24353
  27. 26 6月, 2015 3 次提交
    • T
      libnvdimm: Set numa_node to NVDIMM devices · 41d7a6d6
      Toshi Kani 提交于
      ACPI NFIT table has System Physical Address Range Structure entries that
      describe a proximity ID of each range when ACPI_NFIT_PROXIMITY_VALID is
      set in the flags.
      
      Change acpi_nfit_register_region() to map a proximity ID to its node ID,
      and set it to a new numa_node field of nd_region_desc, which is then
      conveyed to the nd_region device.
      
      The device core arranges for btt and namespace devices to inherit their
      node from their parent region.
      Signed-off-by: NToshi Kani <toshi.kani@hp.com>
      [djbw: move set_dev_node() from region.c to bus.c]
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      41d7a6d6
    • D
      libnvdimm, nfit: handle unarmed dimms, mark namespaces read-only · 58138820
      Dan Williams 提交于
      Upon detection of an unarmed dimm in a region, arrange for descendant
      BTT, PMEM, or BLK instances to be read-only.  A dimm is primarily marked
      "unarmed" via flags passed by platform firmware (NFIT).
      
      The flags in the NFIT memory device sub-structure indicate the state of
      the data on the nvdimm relative to its energy source or last "flush to
      persistence".  For the most part there is nothing the driver can do but
      advertise the state of these flags in sysfs and emit a message if
      firmware indicates that the contents of the device may be corrupted.
      However, for the case of ACPI_NFIT_MEM_ARMED, the driver can arrange for
      the block devices incorporating that nvdimm to be marked read-only.
      This is a safe default as the data is still available and new writes are
      held off until the administrator either forces read-write mode, or the
      energy source becomes armed.
      
      A 'read_only' attribute is added to REGION devices to allow for
      overriding the default read-only policy of all descendant block devices.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      58138820
    • R
      libnvdimm, nfit, nd_blk: driver for BLK-mode access persistent memory · 047fc8a1
      Ross Zwisler 提交于
      The libnvdimm implementation handles allocating dimm address space (DPA)
      between PMEM and BLK mode interfaces.  After DPA has been allocated from
      a BLK-region to a BLK-namespace the nd_blk driver attaches to handle I/O
      as a struct bio based block device. Unlike PMEM, BLK is required to
      handle platform specific details like mmio register formats and memory
      controller interleave.  For this reason the libnvdimm generic nd_blk
      driver calls back into the bus provider to carry out the I/O.
      
      This initial implementation handles the BLK interface defined by the
      ACPI 6 NFIT [1] and the NVDIMM DSM Interface Example [2] composed from
      DCR (dimm control region), BDW (block data window), IDT (interleave
      descriptor) NFIT structures and the hardware register format.
      [1]: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
      [2]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
      
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      047fc8a1