1. 13 4月, 2017 1 次提交
  2. 01 3月, 2017 1 次提交
    • D
      nfit, libnvdimm: fix interleave set cookie calculation · 86ef58a4
      Dan Williams 提交于
      The interleave-set cookie is a sum that sanity checks the composition of
      an interleave set has not changed from when the namespace was initially
      created.  The checksum is calculated by sorting the DIMMs by their
      location in the interleave-set. The comparison for the sort must be
      64-bit wide, not byte-by-byte as performed by memcmp() in the broken
      case.
      
      Fix the implementation to accept correct cookie values in addition to
      the Linux "memcmp" order cookies, but only allow correct cookies to be
      generated going forward. It does mean that namespaces created by
      third-party-tooling, or created by newer kernels with this fix, will not
      validate on older kernels. However, there are a couple mitigating
      conditions:
      
          1/ platforms with namespace-label capable NVDIMMs are not widely
             available.
      
          2/ interleave-sets with a single-dimm are by definition not affected
             (nothing to sort). This covers the QEMU-KVM NVDIMM emulation case.
      
      The cookie stored in the namespace label will be fixed by any write the
      namespace label, the most straightforward way to achieve this is to
      write to the "alt_name" attribute of a namespace in sysfs.
      
      Cc: <stable@vger.kernel.org>
      Fixes: eaf96153 ("libnvdimm, nfit: add interleave-set state-tracking infrastructure")
      Reported-by: NNicholas Moulin <nicholas.w.moulin@linux.intel.com>
      Tested-by: NNicholas Moulin <nicholas.w.moulin@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      86ef58a4
  3. 19 10月, 2016 2 次提交
    • D
      libnvdimm: allow a platform to force enable label support · 42237e39
      Dan Williams 提交于
      Platforms like QEMU-KVM implement an NFIT table and label DSMs.
      However, since that environment does not define an aliased
      configuration, the labels are currently ignored and the kernel registers
      a single full-sized pmem-namespace per region. Now that the kernel
      supports sub-divisions of pmem regions the labels have a purpose.
      Arrange for the labels to be honored when we find an existing / valid
      namespace index block.
      
      Cc: <qemu-devel@nongnu.org>
      Cc: Haozhong Zhang <haozhong.zhang@intel.com>
      Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      42237e39
    • T
      libnvdimm: use generic iostat interfaces · 8d7c22ac
      Toshi Kani 提交于
      nd_iostat_start() and nd_iostat_end() implement the same functionality
      that generic_start_io_acct() and generic_end_io_acct() already provide.
      
      Change nd_iostat_start() and nd_iostat_end() to call the generic iostat
      interfaces.  There is no change in the nd interfaces.
      Signed-off-by: NToshi Kani <toshi.kani@hpe.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      8d7c22ac
  4. 01 10月, 2016 2 次提交
  5. 25 9月, 2016 1 次提交
  6. 02 9月, 2016 1 次提交
  7. 09 8月, 2016 1 次提交
  8. 12 7月, 2016 3 次提交
    • D
      libnvdimm: cycle flush hints · 0c27af60
      Dan Williams 提交于
      When the NFIT provides multiple flush hint addresses per-dimm it is
      expressing that the platform is capable of processing multiple flush
      requests in parallel.  There is some fixed cost per flush request, let
      the cost be shared in parallel on multiple cpus.
      
      Since there may not be enough flush hint addresses for each cpu to have
      one, keep a per-cpu index of the last used hint, hash it with current
      pid, and assume that access pattern and scheduler randomness will keep
      the flush-hint usage somewhat staggered across cpus.
      
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      0c27af60
    • D
      libnvdimm, nfit: move flush hint mapping to region-device driver-data · e5ae3b25
      Dan Williams 提交于
      In preparation for triggering flushes of a DIMM's writes-posted-queue
      (WPQ) via the pmem driver move mapping of flush hint addresses to the
      region driver.  Since this uses devm_nvdimm_memremap() the flush
      addresses will remain mapped while any region to which the dimm belongs
      is active.
      
      We need to communicate more information to the nvdimm core to facilitate
      this mapping, namely each dimm object now carries an array of flush hint
      address resources.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      e5ae3b25
    • D
      libnvdimm, nfit: remove nfit_spa_map() infrastructure · a8a6d2e0
      Dan Williams 提交于
      Now that all shared mappings are handled by devm_nvdimm_memremap() we no
      longer need nfit_spa_map() nor do we need to trigger a callback to the
      bus provider at region disable time.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      a8a6d2e0
  9. 21 5月, 2016 1 次提交
  10. 10 5月, 2016 1 次提交
    • D
      libnvdimm, dax: introduce device-dax infrastructure · cd03412a
      Dan Williams 提交于
      Device DAX is the device-centric analogue of Filesystem DAX
      (CONFIG_FS_DAX).  It allows persistent memory ranges to be allocated and
      mapped without need of an intervening file system.  This initial
      infrastructure arranges for a libnvdimm pfn-device to be represented as
      a different device-type so that it can be attached to a driver other
      than the pmem driver.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      cd03412a
  11. 23 4月, 2016 5 次提交
  12. 08 4月, 2016 1 次提交
    • D
      libnvdimm, pfn: fix nvdimm_namespace_add_poison() vs section alignment · a3901802
      Dan Williams 提交于
      When section alignment padding is in effect we need to shift / truncate
      the range that is queried for poison by the 'start_pad' or 'end_trunc'
      reservations.
      
      It's easiest if we just pass in an adjusted resource range rather than
      deriving it from the passed in namespace.  With the resource range
      resolution pushed out to the caller we can also push the
      namespace-to-region lookup to the caller and drop the implicit pmem-type
      assumption about the passed in namespace object.
      
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      a3901802
  13. 10 3月, 2016 1 次提交
  14. 06 3月, 2016 1 次提交
  15. 10 1月, 2016 3 次提交
  16. 13 12月, 2015 1 次提交
  17. 11 12月, 2015 1 次提交
  18. 29 8月, 2015 3 次提交
    • D
      libnvdimm, pmem: direct map legacy pmem by default · 004f1afb
      Dan Williams 提交于
      The expectation is that the legacy / non-standard pmem discovery method
      (e820 type-12) will only ever be used to describe small quantities of
      persistent memory.  Larger capacities will be described via the ACPI
      NFIT.  When "allocate struct page from pmem" support is added this default
      policy can be overridden by assigning a legacy pmem namespace to a pfn
      device, however this would be only be necessary if a platform used the
      legacy mechanism to define a very large range.
      
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      004f1afb
    • D
      libnvdimm, pmem: 'struct page' for pmem · 32ab0a3f
      Dan Williams 提交于
      Enable the pmem driver to handle PFN device instances.  Attaching a pmem
      namespace to a pfn device triggers the driver to allocate and initialize
      struct page entries for pmem.  Memory capacity for this allocation comes
      exclusively from RAM for now which is suitable for low PMEM to RAM
      ratios.  This mechanism will be expanded later for setting an "allocate
      from PMEM" policy.
      
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      32ab0a3f
    • D
      libnvdimm, pfn: 'struct page' provider infrastructure · e1455744
      Dan Williams 提交于
      Implement the base infrastructure for libnvdimm PFN devices. Similar to
      BTT devices they take a namespace as a backing device and layer
      functionality on top. In this case the functionality is reserving space
      for an array of 'struct page' entries to be handed out through
      pfn_to_page(). For now this is just the basic libnvdimm-device-model for
      configuring the base PFN device.
      
      As the namespace claiming mechanism for PFN devices is mostly identical
      to BTT devices drivers/nvdimm/claim.c is created to house the common
      bits.
      
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      e1455744
  19. 15 8月, 2015 1 次提交
    • V
      libnvdimm, btt: write and validate parent_uuid · 6ec68954
      Vishal Verma 提交于
      When a BTT is instantiated on a namespace it must validate the namespace
      uuid matches the 'parent_uuid' stored in the btt superblock. This
      property enforces that changing the namespace UUID invalidates all
      former BTT instances on that storage. For "IO namespaces" that don't
      have a label or UUID, the parent_uuid is set to zero, and this
      validation is skipped. For such cases, old BTTs have to be invalidated
      by forcing the namespace to raw mode, and overwriting the BTT info
      blocks.
      
      Based on a patch by Dan Williams <dan.j.williams@intel.com>
      Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      6ec68954
  20. 01 8月, 2015 1 次提交
  21. 26 6月, 2015 7 次提交
    • T
      libnvdimm: Set numa_node to NVDIMM devices · 41d7a6d6
      Toshi Kani 提交于
      ACPI NFIT table has System Physical Address Range Structure entries that
      describe a proximity ID of each range when ACPI_NFIT_PROXIMITY_VALID is
      set in the flags.
      
      Change acpi_nfit_register_region() to map a proximity ID to its node ID,
      and set it to a new numa_node field of nd_region_desc, which is then
      conveyed to the nd_region device.
      
      The device core arranges for btt and namespace devices to inherit their
      node from their parent region.
      Signed-off-by: NToshi Kani <toshi.kani@hp.com>
      [djbw: move set_dev_node() from region.c to bus.c]
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      41d7a6d6
    • D
      libnvdimm, nfit: handle unarmed dimms, mark namespaces read-only · 58138820
      Dan Williams 提交于
      Upon detection of an unarmed dimm in a region, arrange for descendant
      BTT, PMEM, or BLK instances to be read-only.  A dimm is primarily marked
      "unarmed" via flags passed by platform firmware (NFIT).
      
      The flags in the NFIT memory device sub-structure indicate the state of
      the data on the nvdimm relative to its energy source or last "flush to
      persistence".  For the most part there is nothing the driver can do but
      advertise the state of these flags in sysfs and emit a message if
      firmware indicates that the contents of the device may be corrupted.
      However, for the case of ACPI_NFIT_MEM_ARMED, the driver can arrange for
      the block devices incorporating that nvdimm to be marked read-only.
      This is a safe default as the data is still available and new writes are
      held off until the administrator either forces read-write mode, or the
      energy source becomes armed.
      
      A 'read_only' attribute is added to REGION devices to allow for
      overriding the default read-only policy of all descendant block devices.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      58138820
    • D
      libnvdimm: enable iostat · f0dc089c
      Dan Williams 提交于
      This is disabled by default as the overhead is prohibitive, but if the
      user takes the action to turn it on we'll oblige.
      Reviewed-by: NVishal Verma <vishal.l.verma@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      f0dc089c
    • V
      libnvdimm, blk: add support for blk integrity · fcae6957
      Vishal Verma 提交于
      Support multiple block sizes (sector + metadata) for nd_blk in the
      same way as done for the BTT. Add the idea of an 'internal' lbasize,
      which is properly aligned and padded, and store metadata in this space.
      Signed-off-by: NVishal Verma <vishal.l.verma@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      fcae6957
    • V
      libnvdimm, btt: add support for blk integrity · 41cd8b70
      Vishal Verma 提交于
      Support multiple block sizes (sector + metadata) using the blk integrity
      framework. This registers a new integrity template that defines the
      protection information tuple size based on the configured metadata size,
      and simply acts as a passthrough for protection information generated by
      another layer. The metadata is written to the storage as-is, and read back
      with each sector.
      Signed-off-by: NVishal Verma <vishal.l.verma@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      41cd8b70
    • R
      libnvdimm, nfit, nd_blk: driver for BLK-mode access persistent memory · 047fc8a1
      Ross Zwisler 提交于
      The libnvdimm implementation handles allocating dimm address space (DPA)
      between PMEM and BLK mode interfaces.  After DPA has been allocated from
      a BLK-region to a BLK-namespace the nd_blk driver attaches to handle I/O
      as a struct bio based block device. Unlike PMEM, BLK is required to
      handle platform specific details like mmio register formats and memory
      controller interleave.  For this reason the libnvdimm generic nd_blk
      driver calls back into the bus provider to carry out the I/O.
      
      This initial implementation handles the BLK interface defined by the
      ACPI 6 NFIT [1] and the NVDIMM DSM Interface Example [2] composed from
      DCR (dimm control region), BDW (block data window), IDT (interleave
      descriptor) NFIT structures and the hardware register format.
      [1]: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
      [2]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
      
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      047fc8a1
    • V
      nd_btt: atomic sector updates · 5212e11f
      Vishal Verma 提交于
      BTT stands for Block Translation Table, and is a way to provide power
      fail sector atomicity semantics for block devices that have the ability
      to perform byte granularity IO. It relies on the capability of libnvdimm
      namespace devices to do byte aligned IO.
      
      The BTT works as a stacked blocked device, and reserves a chunk of space
      from the backing device for its accounting metadata. It is a bio-based
      driver because all IO is done synchronously, and there is no queuing or
      asynchronous completions at either the device or the driver level.
      
      The BTT uses 'lanes' to index into various 'on-disk' data structures,
      and lanes also act as a synchronization mechanism in case there are more
      CPUs than available lanes. We did a comparison between two lane lock
      strategies - first where we kept an atomic counter around that tracked
      which was the last lane that was used, and 'our' lane was determined by
      atomically incrementing that. That way, for the nr_cpus > nr_lanes case,
      theoretically, no CPU would be blocked waiting for a lane. The other
      strategy was to use the cpu number we're scheduled on to and hash it to
      a lane number. Theoretically, this could block an IO that could've
      otherwise run using a different, free lane. But some fio workloads
      showed that the direct cpu -> lane hash performed faster than tracking
      'last lane' - my reasoning is the cache thrash caused by moving the
      atomic variable made that approach slower than simply waiting out the
      in-progress IO. This supports the conclusion that the driver can be a
      very simple bio-based one that does synchronous IOs instead of queuing.
      
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      [jmoyer: fix nmi watchdog timeout in btt_map_init]
      [jmoyer: move btt initialization to module load path]
      [jmoyer: fix memory leak in the btt initialization path]
      [jmoyer: Don't overwrite corrupted arenas]
      Signed-off-by: NVishal Verma <vishal.l.verma@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      5212e11f
  22. 25 6月, 2015 1 次提交
    • D
      libnvdimm: infrastructure for btt devices · 8c2f7e86
      Dan Williams 提交于
      NVDIMM namespaces, in addition to accepting "struct bio" based requests,
      also have the capability to perform byte-aligned accesses.  By default
      only the bio/block interface is used.  However, if another driver can
      make effective use of the byte-aligned capability it can claim namespace
      interface and use the byte-aligned ->rw_bytes() interface.
      
      The BTT driver is the initial first consumer of this mechanism to allow
      adding atomic sector update semantics to a pmem or blk namespace.  This
      patch is the sysfs infrastructure to allow configuring a BTT instance
      for a namespace.  Enabling that BTT and performing i/o is in a
      subsequent patch.
      
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Neil Brown <neilb@suse.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      8c2f7e86