1. 22 10月, 2015 1 次提交
  2. 17 9月, 2015 3 次提交
  3. 29 8月, 2015 3 次提交
    • D
      libnvdimm, pmem: direct map legacy pmem by default · 004f1afb
      Dan Williams 提交于
      The expectation is that the legacy / non-standard pmem discovery method
      (e820 type-12) will only ever be used to describe small quantities of
      persistent memory.  Larger capacities will be described via the ACPI
      NFIT.  When "allocate struct page from pmem" support is added this default
      policy can be overridden by assigning a legacy pmem namespace to a pfn
      device, however this would be only be necessary if a platform used the
      legacy mechanism to define a very large range.
      
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      004f1afb
    • D
      libnvdimm, pmem: 'struct page' for pmem · 32ab0a3f
      Dan Williams 提交于
      Enable the pmem driver to handle PFN device instances.  Attaching a pmem
      namespace to a pfn device triggers the driver to allocate and initialize
      struct page entries for pmem.  Memory capacity for this allocation comes
      exclusively from RAM for now which is suitable for low PMEM to RAM
      ratios.  This mechanism will be expanded later for setting an "allocate
      from PMEM" policy.
      
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      32ab0a3f
    • D
      libnvdimm, pfn: 'struct page' provider infrastructure · e1455744
      Dan Williams 提交于
      Implement the base infrastructure for libnvdimm PFN devices. Similar to
      BTT devices they take a namespace as a backing device and layer
      functionality on top. In this case the functionality is reserving space
      for an array of 'struct page' entries to be handed out through
      pfn_to_page(). For now this is just the basic libnvdimm-device-model for
      configuring the base PFN device.
      
      As the namespace claiming mechanism for PFN devices is mostly identical
      to BTT devices drivers/nvdimm/claim.c is created to house the common
      bits.
      
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      e1455744
  4. 28 8月, 2015 3 次提交
  5. 21 8月, 2015 1 次提交
  6. 19 8月, 2015 1 次提交
    • D
      libnvdimm, e820: make CONFIG_X86_PMEM_LEGACY a tristate option · 7a67832c
      Dan Williams 提交于
      We currently register a platform device for e820 type-12 memory and
      register a nvdimm bus beneath it.  Registering the platform device
      triggers the device-core machinery to probe for a driver, but that
      search currently comes up empty.  Building the nvdimm-bus registration
      into the e820_pmem platform device registration in this way forces
      libnvdimm to be built-in.  Instead, convert the built-in portion of
      CONFIG_X86_PMEM_LEGACY to simply register a platform device and move the
      rest of the logic to the driver for e820_pmem, for the following
      reasons:
      
      1/ Letting e820_pmem support be a module allows building and testing
         libnvdimm.ko changes without rebooting
      
      2/ All the normal policy around modules can be applied to e820_pmem
         (unbind to disable and/or blacklisting the module from loading by
         default)
      
      3/ Moving the driver to a generic location and converting it to scan
         "iomem_resource" rather than "e820.map" means any other architecture can
         take advantage of this simple nvdimm resource discovery mechanism by
         registering a resource named "Persistent Memory (legacy)"
      
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      7a67832c
  7. 15 8月, 2015 4 次提交
  8. 01 8月, 2015 1 次提交
  9. 29 7月, 2015 1 次提交
    • C
      block: add a bi_error field to struct bio · 4246a0b6
      Christoph Hellwig 提交于
      Currently we have two different ways to signal an I/O error on a BIO:
      
       (1) by clearing the BIO_UPTODATE flag
       (2) by returning a Linux errno value to the bi_end_io callback
      
      The first one has the drawback of only communicating a single possible
      error (-EIO), and the second one has the drawback of not beeing persistent
      when bios are queued up, and are not passed along from child to parent
      bio in the ever more popular chaining scenario.  Having both mechanisms
      available has the additional drawback of utterly confusing driver authors
      and introducing bugs where various I/O submitters only deal with one of
      them, and the others have to add boilerplate code to deal with both kinds
      of error returns.
      
      So add a new bi_error field to store an errno value directly in struct
      bio and remove the existing mechanisms to clean all this up.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Reviewed-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      4246a0b6
  10. 28 7月, 2015 2 次提交
  11. 26 7月, 2015 1 次提交
    • D
      libnvdimm: fix namespace seed creation · 8ca24353
      Dan Williams 提交于
      A new BLK namespace "seed" device is created whenever the current seed
      is successfully probed.  However, if that namespace is assigned to a BTT
      it may never directly experience a successful probe as it is a
      subordinate device to a BTT configuration.
      
      The effect of the current code is that no new namespaces can be
      instantiated, after the seed namespace, to consume available BLK DPA
      capacity.  Fix this by treating a successful BTT probe event as a
      successful probe event for the backing namespace.
      Reported-by: NNicholas Moulin <nicholas.w.moulin@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      8ca24353
  12. 01 7月, 2015 2 次提交
  13. 26 6月, 2015 12 次提交
    • R
      arch, x86: pmem api for ensuring durability of persistent memory updates · 61031952
      Ross Zwisler 提交于
      Based on an original patch by Ross Zwisler [1].
      
      Writes to persistent memory have the potential to be posted to cpu
      cache, cpu write buffers, and platform write buffers (memory controller)
      before being committed to persistent media.  Provide apis,
      memcpy_to_pmem(), wmb_pmem(), and memremap_pmem(), to write data to
      pmem and assert that it is durable in PMEM (a persistent linear address
      range).  A '__pmem' attribute is added so sparse can track proper usage
      of pointers to pmem.
      
      This continues the status quo of pmem being x86 only for 4.2, but
      reworks to ioremap, and wider implementation of memremap() will enable
      other archs in 4.3.
      
      [1]: https://lists.01.org/pipermail/linux-nvdimm/2015-May/000932.html
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      [djbw: various reworks]
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      61031952
    • T
      libnvdimm: Add sysfs numa_node to NVDIMM devices · 74ae66c3
      Toshi Kani 提交于
      Add support of sysfs 'numa_node' to I/O-related NVDIMM devices
      under /sys/bus/nd/devices, regionN, namespaceN.0, and bttN.x.
      
      An example of numa_node values on a 2-socket system with a single
      NVDIMM range on each socket is shown below.
        /sys/bus/nd/devices
        |-- btt0.0/numa_node:0
        |-- btt1.0/numa_node:1
        |-- btt1.1/numa_node:1
        |-- namespace0.0/numa_node:0
        |-- namespace1.0/numa_node:1
        |-- region0/numa_node:0
        |-- region1/numa_node:1
      
      These numa_node files are then linked under the block class of
      their device names.
        /sys/class/block/pmem0/device/numa_node:0
        /sys/class/block/pmem1s/device/numa_node:1
      
      This enables numactl(8) to accept 'block:' and 'file:' paths of
      pmem and btt devices as shown in the examples below.
        numactl --preferred block:pmem0 --show
        numactl --preferred file:/dev/pmem1s --show
      Signed-off-by: NToshi Kani <toshi.kani@hp.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      74ae66c3
    • T
      libnvdimm: Set numa_node to NVDIMM devices · 41d7a6d6
      Toshi Kani 提交于
      ACPI NFIT table has System Physical Address Range Structure entries that
      describe a proximity ID of each range when ACPI_NFIT_PROXIMITY_VALID is
      set in the flags.
      
      Change acpi_nfit_register_region() to map a proximity ID to its node ID,
      and set it to a new numa_node field of nd_region_desc, which is then
      conveyed to the nd_region device.
      
      The device core arranges for btt and namespace devices to inherit their
      node from their parent region.
      Signed-off-by: NToshi Kani <toshi.kani@hp.com>
      [djbw: move set_dev_node() from region.c to bus.c]
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      41d7a6d6
    • D
      libnvdimm, nfit: handle unarmed dimms, mark namespaces read-only · 58138820
      Dan Williams 提交于
      Upon detection of an unarmed dimm in a region, arrange for descendant
      BTT, PMEM, or BLK instances to be read-only.  A dimm is primarily marked
      "unarmed" via flags passed by platform firmware (NFIT).
      
      The flags in the NFIT memory device sub-structure indicate the state of
      the data on the nvdimm relative to its energy source or last "flush to
      persistence".  For the most part there is nothing the driver can do but
      advertise the state of these flags in sysfs and emit a message if
      firmware indicates that the contents of the device may be corrupted.
      However, for the case of ACPI_NFIT_MEM_ARMED, the driver can arrange for
      the block devices incorporating that nvdimm to be marked read-only.
      This is a safe default as the data is still available and new writes are
      held off until the administrator either forces read-write mode, or the
      energy source becomes armed.
      
      A 'read_only' attribute is added to REGION devices to allow for
      overriding the default read-only policy of all descendant block devices.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      58138820
    • D
      pmem: flag pmem block devices as non-rotational · 0f51c4fa
      Dan Williams 提交于
      ...since they are effectively SSDs as far as userspace is concerned.
      Reviewed-by: NVishal Verma <vishal.l.verma@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      0f51c4fa
    • D
      libnvdimm: enable iostat · f0dc089c
      Dan Williams 提交于
      This is disabled by default as the overhead is prohibitive, but if the
      user takes the action to turn it on we'll oblige.
      Reviewed-by: NVishal Verma <vishal.l.verma@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      f0dc089c
    • D
      pmem: make_request cleanups · edc870e5
      Dan Williams 提交于
      Various cleanups:
      
      1/ Kill the BUG_ON since we've already told the block layer we don't
         support DISCARD on all these drivers.
      
      2/ Kill the 'rw' variable, no need to cache it.
      
      3/ Kill the local 'sector' variable.  bio_for_each_segment() is already
         advancing the iterator's sector number by the bio_vec length.
      
      4/ Kill the check for accessing past the end of device
         generic_make_request_checks() already does that.
      Suggested-by: NChristoph Hellwig <hch@lst.de>
      [hch: kill access past end of the device check]
      Reviewed-by: NVishal Verma <vishal.l.verma@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      edc870e5
    • D
      libnvdimm, pmem: fix up max_hw_sectors · 43d3fa3a
      Dan Williams 提交于
      There is no hardware limit to enforce on the size of the i/o that can be passed
      to an nvdimm block device, so set it to UINT_MAX.
      Reviewed-by: NVishal Verma <vishal.l.verma@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      43d3fa3a
    • V
      libnvdimm, blk: add support for blk integrity · fcae6957
      Vishal Verma 提交于
      Support multiple block sizes (sector + metadata) for nd_blk in the
      same way as done for the BTT. Add the idea of an 'internal' lbasize,
      which is properly aligned and padded, and store metadata in this space.
      Signed-off-by: NVishal Verma <vishal.l.verma@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      fcae6957
    • V
      libnvdimm, btt: add support for blk integrity · 41cd8b70
      Vishal Verma 提交于
      Support multiple block sizes (sector + metadata) using the blk integrity
      framework. This registers a new integrity template that defines the
      protection information tuple size based on the configured metadata size,
      and simply acts as a passthrough for protection information generated by
      another layer. The metadata is written to the storage as-is, and read back
      with each sector.
      Signed-off-by: NVishal Verma <vishal.l.verma@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      41cd8b70
    • R
      libnvdimm, nfit, nd_blk: driver for BLK-mode access persistent memory · 047fc8a1
      Ross Zwisler 提交于
      The libnvdimm implementation handles allocating dimm address space (DPA)
      between PMEM and BLK mode interfaces.  After DPA has been allocated from
      a BLK-region to a BLK-namespace the nd_blk driver attaches to handle I/O
      as a struct bio based block device. Unlike PMEM, BLK is required to
      handle platform specific details like mmio register formats and memory
      controller interleave.  For this reason the libnvdimm generic nd_blk
      driver calls back into the bus provider to carry out the I/O.
      
      This initial implementation handles the BLK interface defined by the
      ACPI 6 NFIT [1] and the NVDIMM DSM Interface Example [2] composed from
      DCR (dimm control region), BDW (block data window), IDT (interleave
      descriptor) NFIT structures and the hardware register format.
      [1]: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
      [2]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
      
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      047fc8a1
    • V
      nd_btt: atomic sector updates · 5212e11f
      Vishal Verma 提交于
      BTT stands for Block Translation Table, and is a way to provide power
      fail sector atomicity semantics for block devices that have the ability
      to perform byte granularity IO. It relies on the capability of libnvdimm
      namespace devices to do byte aligned IO.
      
      The BTT works as a stacked blocked device, and reserves a chunk of space
      from the backing device for its accounting metadata. It is a bio-based
      driver because all IO is done synchronously, and there is no queuing or
      asynchronous completions at either the device or the driver level.
      
      The BTT uses 'lanes' to index into various 'on-disk' data structures,
      and lanes also act as a synchronization mechanism in case there are more
      CPUs than available lanes. We did a comparison between two lane lock
      strategies - first where we kept an atomic counter around that tracked
      which was the last lane that was used, and 'our' lane was determined by
      atomically incrementing that. That way, for the nr_cpus > nr_lanes case,
      theoretically, no CPU would be blocked waiting for a lane. The other
      strategy was to use the cpu number we're scheduled on to and hash it to
      a lane number. Theoretically, this could block an IO that could've
      otherwise run using a different, free lane. But some fio workloads
      showed that the direct cpu -> lane hash performed faster than tracking
      'last lane' - my reasoning is the cache thrash caused by moving the
      atomic variable made that approach slower than simply waiting out the
      in-progress IO. This supports the conclusion that the driver can be a
      very simple bio-based one that does synchronous IOs instead of queuing.
      
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Boaz Harrosh <boaz@plexistor.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      [jmoyer: fix nmi watchdog timeout in btt_map_init]
      [jmoyer: move btt initialization to module load path]
      [jmoyer: fix memory leak in the btt initialization path]
      [jmoyer: Don't overwrite corrupted arenas]
      Signed-off-by: NVishal Verma <vishal.l.verma@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      5212e11f
  14. 25 6月, 2015 5 次提交
    • D
      libnvdimm: infrastructure for btt devices · 8c2f7e86
      Dan Williams 提交于
      NVDIMM namespaces, in addition to accepting "struct bio" based requests,
      also have the capability to perform byte-aligned accesses.  By default
      only the bio/block interface is used.  However, if another driver can
      make effective use of the byte-aligned capability it can claim namespace
      interface and use the byte-aligned ->rw_bytes() interface.
      
      The BTT driver is the initial first consumer of this mechanism to allow
      adding atomic sector update semantics to a pmem or blk namespace.  This
      patch is the sysfs infrastructure to allow configuring a BTT instance
      for a namespace.  Enabling that BTT and performing i/o is in a
      subsequent patch.
      
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Neil Brown <neilb@suse.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      8c2f7e86
    • D
      libnvdimm: write blk label set · 0ba1c634
      Dan Williams 提交于
      After 'uuid', 'size', 'sector_size', and optionally 'alt_name' have been
      set to valid values the labels on the dimm can be updated.  The
      difference with the pmem case is that blk namespaces are limited to one
      dimm and can cover discontiguous ranges in dpa space.
      
      Also, after allocating label slots, it is useful for userspace to know
      how many slots are left.  Export this information in sysfs.
      
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Neil Brown <neilb@suse.de>
      Acked-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      0ba1c634
    • D
      libnvdimm: write pmem label set · f524bf27
      Dan Williams 提交于
      After 'uuid', 'size', and optionally 'alt_name' have been set to valid
      values the labels on the dimms can be updated.
      
      Write procedure is:
      1/ Allocate and write new labels in the "next" index
      2/ Free the old labels in the working copy
      3/ Write the bitmap and the label space on the dimm
      4/ Write the index to make the update valid
      
      Label ranges directly mirror the dpa resource values for the given
      label_id of the namespace.
      
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Neil Brown <neilb@suse.de>
      Acked-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      f524bf27
    • D
      libnvdimm: blk labels and namespace instantiation · 1b40e09a
      Dan Williams 提交于
      A blk label set describes a namespace comprised of one or more
      discontiguous dpa ranges on a single dimm.  They may alias with one or
      more pmem interleave sets that include the given dimm.
      
      This is the runtime/volatile configuration infrastructure for sysfs
      manipulation of 'alt_name', 'uuid', 'size', and 'sector_size'.  A later
      patch will make these settings persistent by writing back the label(s).
      
      Unlike pmem namespaces, multiple blk namespaces can be created per
      region.  Once a blk namespace has been created a new seed device
      (unconfigured child of a parent blk region) is instantiated.  As long as
      a region has 'available_size' != 0 new child namespaces may be created.
      
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Neil Brown <neilb@suse.de>
      Acked-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      1b40e09a
    • D
      libnvdimm: pmem label sets and namespace instantiation. · bf9bccc1
      Dan Williams 提交于
      A complete label set is a PMEM-label per-dimm per-interleave-set where
      all the UUIDs match and the interleave set cookie matches the hosting
      interleave set.
      
      Present sysfs attributes for manipulation of a PMEM-namespace's
      'alt_name', 'uuid', and 'size' attributes.  A later patch will make
      these settings persistent by writing back the label.
      
      Note that PMEM allocations grow forwards from the start of an interleave
      set (lowest dimm-physical-address (DPA)).  BLK-namespaces that alias
      with a PMEM interleave set will grow allocations backward from the
      highest DPA.
      
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Neil Brown <neilb@suse.de>
      Acked-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      bf9bccc1