1. 24 3月, 2019 2 次提交
    • W
      libnvdimm, pfn: Fix over-trim in trim_pfn_device() · 6a89ed7a
      Wei Yang 提交于
      commit f101ada7da6551127d192c2f1742c1e9e0f62799 upstream.
      
      When trying to see whether current nd_region intersects with others,
      trim_pfn_device() has already calculated the *size* to be expanded to
      SECTION size.
      
      Do not double append 'adjust' to 'size' when calculating whether the end
      of a region collides with the next pmem region.
      
      Fixes: ae86cbfef381 "libnvdimm, pfn: Pad pfn namespaces relative to other regions"
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NWei Yang <richardw.yang@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6a89ed7a
    • D
      libnvdimm/label: Clear 'updating' flag after label-set update · 2b88d92e
      Dan Williams 提交于
      commit 966d23a006ca7b44ac8cf4d0c96b19785e0c3da0 upstream.
      
      The UEFI 2.7 specification sets expectations that the 'updating' flag is
      eventually cleared. To date, the libnvdimm core has never adhered to
      that protocol. The policy of the core matches the policy of other
      multi-device info-block formats like MD-Software-RAID that expect
      administrator intervention on inconsistent info-blocks, not automatic
      invalidation.
      
      However, some pre-boot environments may unfortunately attempt to "clean
      up" the labels and invalidate a set when it fails to find at least one
      "non-updating" label in the set. Clear the updating flag after set
      updates to minimize the window of vulnerability to aggressive pre-boot
      environments.
      
      Ideally implementations would not write to the label area outside of
      creating namespaces.
      
      Note that this only minimizes the window, it does not close it as the
      system can still crash while clearing the flag and the set can be
      subsequently deleted / invalidated by the pre-boot environment.
      
      Fixes: f524bf27 ("libnvdimm: write pmem label set")
      Cc: <stable@vger.kernel.org>
      Cc: Kelly Couch <kelly.j.couch@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2b88d92e
  2. 13 1月, 2019 1 次提交
  3. 13 12月, 2018 1 次提交
  4. 14 11月, 2018 3 次提交
  5. 21 8月, 2018 1 次提交
  6. 20 8月, 2018 1 次提交
    • V
      libnvdimm: fix ars_status output length calculation · 286e8771
      Vishal Verma 提交于
      Commit efda1b5d ("acpi, nfit, libnvdimm: fix / harden ars_status output length handling")
      Introduced additional hardening for ambiguity in the ACPI spec for
      ars_status output sizing. However, it had a couple of cases mixed up.
      Where it should have been checking for (and returning) "out_field[1] -
      4" it was using "out_field[1] - 8" and vice versa.
      
      This caused a four byte discrepancy in the buffer size passed on to
      the command handler, and in some cases, this caused memory corruption
      like:
      
        ./daxdev-errors.sh: line 76: 24104 Aborted   (core dumped) ./daxdev-errors $busdev $region
        malloc(): memory corruption
        Program received signal SIGABRT, Aborted.
        [...]
        #5  0x00007ffff7865a2e in calloc () from /lib64/libc.so.6
        #6  0x00007ffff7bc2970 in ndctl_bus_cmd_new_ars_status (ars_cap=ars_cap@entry=0x6153b0) at ars.c:136
        #7  0x0000000000401644 in check_ars_status (check=0x7fffffffdeb0, bus=0x604c20) at daxdev-errors.c:144
        #8  test_daxdev_clear_error (region_name=<optimized out>, bus_name=<optimized out>)
            at daxdev-errors.c:332
      
      Cc: <stable@vger.kernel.org>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Lukasz Dorau <lukasz.dorau@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Fixes: efda1b5d ("acpi, nfit, libnvdimm: fix / harden ars_status output length handling")
      Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>
      Reviewed-by: NKeith Busch <keith.busch@intel.com>
      Signed-of-by: NDave Jiang <dave.jiang@intel.com>
      286e8771
  7. 31 7月, 2018 1 次提交
  8. 26 7月, 2018 2 次提交
  9. 18 7月, 2018 2 次提交
    • M
      block: Add and use op_stat_group() for indexing disk_stat fields. · ddcf35d3
      Michael Callahan 提交于
      Add and use a new op_stat_group() function for indexing partition stat
      fields rather than indexing them by rq_data_dir() or bio_data_dir().
      This function works similarly to op_is_sync() in that it takes the
      request::cmd_flags or bio::bi_opf flags and determines which stats
      should et updated.
      
      In addition, the second parameter to generic_start_io_acct() and
      generic_end_io_acct() is now a REQ_OP rather than simply a read or
      write bit and it uses op_stat_group() on the parameter to determine
      the stat group.
      
      Note that the partition in_flight counts are not part of the per-cpu
      statistics and as such are not indexed via this function.  It's now
      indexed by op_is_write().
      
      tj: Refreshed on top of v4.17.  Updated to pass around REQ_OP.
      Signed-off-by: NMichael Callahan <michaelcallahan@fb.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Joshua Morris <josh.h.morris@us.ibm.com>
      Cc: Philipp Reisner <philipp.reisner@linbit.com>
      Cc: Matias Bjorling <mb@lightnvm.io>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Alasdair Kergon <agk@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ddcf35d3
    • T
      block: make bdev_ops->rw_page() take a REQ_OP instead of bool · 3f289dcb
      Tejun Heo 提交于
      c11f0c0b ("block/mm: make bdev_ops->rw_page() take a bool for
      read/write") replaced @OP with boolean @is_write, which limited the
      amount of information going into ->rw_page() and more importantly
      page_endio(), which removed the need to expose block internals to mm.
      
      Unfortunately, we want to track discards separately and @is_write
      isn't enough information.  This patch updates bdev_ops->rw_page() to
      take REQ_OP instead but leaves page_endio() to take bool @is_write.
      This allows the block part of operations to have enough information
      while not leaking it to mm.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Mike Christie <mchristi@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3f289dcb
  10. 15 7月, 2018 1 次提交
  11. 29 6月, 2018 2 次提交
  12. 07 6月, 2018 3 次提交
    • R
      libnvdimm, pmem: Do not flush power-fail protected CPU caches · 546eb031
      Ross Zwisler 提交于
      This commit:
      
      5fdf8e5b ("libnvdimm: re-enable deep flush for pmem devices via fsync()")
      
      intended to make sure that deep flush was always available even on
      platforms which support a power-fail protected CPU cache.  An unintended
      side effect of this change was that we also lost the ability to skip
      flushing CPU caches on those power-fail protected CPU cache.
      
      Fix this by skipping the low level cache flushing in dax_flush() if we have
      CPU caches which are power-fail protected.  The user can still override this
      behavior by manually setting the write_cache state of a namespace.  See
      libndctl's ndctl_namespace_write_cache_is_enabled(),
      ndctl_namespace_enable_write_cache() and
      ndctl_namespace_disable_write_cache() functions.
      
      Cc: <stable@vger.kernel.org>
      Fixes: 5fdf8e5b ("libnvdimm: re-enable deep flush for pmem devices via fsync()")
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      546eb031
    • R
      libnvdimm, pmem: Unconditionally deep flush on *sync · ce7f11a2
      Ross Zwisler 提交于
      Prior to this commit we would only do a "deep flush" (have nvdimm_flush()
      write to each of the flush hints for a region) in response to an
      msync/fsync/sync call if the nvdimm_has_cache() returned true at the time
      we were setting up the request queue.  This happens due to the write cache
      value passed in to blk_queue_write_cache(), which then causes the block
      layer to send down BIOs with REQ_FUA and REQ_PREFLUSH set.  We do have a
      "write_cache" sysfs entry for namespaces, i.e.:
      
        /sys/bus/nd/devices/pfn0.1/block/pmem0/dax/write_cache
      
      which can be used to control whether or not the kernel thinks a given
      namespace has a write cache, but this didn't modify the deep flush behavior
      that we set up when the driver was initialized.  Instead, it only modified
      whether or not DAX would flush CPU caches via dax_flush() in response to
      *sync calls.
      
      Simplify this by making the *sync deep flush always happen, regardless of
      the write cache setting of a namespace.  The DAX CPU cache flushing will
      still be controlled the write_cache setting of the namespace.
      
      Cc: <stable@vger.kernel.org>
      Fixes: 5fdf8e5b ("libnvdimm: re-enable deep flush for pmem devices via fsync()")
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      ce7f11a2
    • R
      libnvdimm, pmem: Complete REQ_FLUSH => REQ_PREFLUSH · d2d6364d
      Ross Zwisler 提交于
      Complete the move from REQ_FLUSH to REQ_PREFLUSH that apparently started
      way back in v4.8.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      d2d6364d
  13. 03 6月, 2018 2 次提交
    • D
      libnvdimm, e820: Register all pmem resources · d76401ad
      Dan Williams 提交于
      There is currently a mismatch between the resources that will trigger
      the e820_pmem driver to register/load and the resources that will
      actually be surfaced as pmem ranges. register_e820_pmem() uses
      walk_iomem_res_desc() which includes children and siblings. In contrast,
      e820_pmem_probe() only considers top level resources. For example the
      following resource tree results in the driver being loaded, but no
      resources being registered:
      
          398000000000-39bfffffffff : PCI Bus 0000:ae
            39be00000000-39bf07ffffff : PCI Bus 0000:af
              39be00000000-39beffffffff : 0000:af:00.0
                39be10000000-39beffffffff : Persistent Memory (legacy)
      
      Fix this up to allow definitions of "legacy" pmem ranges anywhere in
      system-physical address space. Not that it is a recommended or safe to
      define a pmem range in PCI space, but it is useful for debug /
      experimentation, and the restriction on being a top-level resource was
      arbitrary.
      
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      d76401ad
    • D
      libnvdimm: Debug probe times · 3f46833d
      Dan Williams 提交于
      Instrument nvdimm_bus_probe() to emit timestamps for the start and end
      of libnvdimm device probing. This is useful for identifying sources of
      libnvdimm sub-system initialization latency.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      3f46833d
  14. 01 6月, 2018 1 次提交
    • R
      linvdimm, pmem: Preserve read-only setting for pmem devices · 254a4cd5
      Robert Elliott 提交于
      The pmem driver does not honor a forced read-only setting for very long:
      	$ blockdev --setro /dev/pmem0
      	$ blockdev --getro /dev/pmem0
      	1
      
      followed by various commands like these:
      	$ blockdev --rereadpt /dev/pmem0
      	or
      	$ mkfs.ext4 /dev/pmem0
      
      results in this in the kernel serial log:
      	 nd_pmem namespace0.0: region0 read-write, marking pmem0 read-write
      
      with the read-only setting lost:
      	$ blockdev --getro /dev/pmem0
      	0
      
      That's from bus.c nvdimm_revalidate_disk(), which always applies the
      setting from nd_region (which is initially based on the ACPI NFIT
      NVDIMM state flags not_armed bit).
      
      In contrast, commit 20bd1d02 ("scsi: sd: Keep disk read-only when
      re-reading partition") fixed this issue for SCSI devices to preserve
      the previous setting if it was set to read-only.
      
      This patch modifies bus.c to preserve any previous read-only setting.
      It also eliminates the kernel serial log print except for cases where
      read-write is changed to read-only, so it doesn't print read-only to
      read-only non-changes.
      
      Cc: <stable@vger.kernel.org>
      Fixes: 58138820 ("libnvdimm, nfit: handle unarmed dimms, mark namespaces read-only")
      Signed-off-by: NRobert Elliott <elliott@hpe.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      254a4cd5
  15. 23 5月, 2018 2 次提交
  16. 22 5月, 2018 1 次提交
    • D
      mm: introduce MEMORY_DEVICE_FS_DAX and CONFIG_DEV_PAGEMAP_OPS · e7638488
      Dan Williams 提交于
      In preparation for fixing dax-dma-vs-unmap issues, filesystems need to
      be able to rely on the fact that they will get wakeups on dev_pagemap
      page-idle events. Introduce MEMORY_DEVICE_FS_DAX and
      generic_dax_page_free() as common indicator / infrastructure for dax
      filesytems to require. With this change there are no users of the
      MEMORY_DEVICE_HOST designation, so remove it.
      
      The HMM sub-system extended dev_pagemap to arrange a callback when a
      dev_pagemap managed page is freed. Since a dev_pagemap page is free /
      idle when its reference count is 1 it requires an additional branch to
      check the page-type at put_page() time. Given put_page() is a hot-path
      we do not want to incur that check if HMM is not in use, so a static
      branch is used to avoid that overhead when not necessary.
      
      Now, the FS_DAX implementation wants to reuse this mechanism for
      receiving dev_pagemap ->page_free() callbacks. Rework the HMM-specific
      static-key into a generic mechanism that either HMM or FS_DAX code paths
      can enable.
      
      For ARCH=um builds, and any other arch that lacks ZONE_DEVICE support,
      care must be taken to compile out the DEV_PAGEMAP_OPS infrastructure.
      However, we still need to support FS_DAX in the FS_DAX_LIMITED case
      implemented by the s390/dcssblk driver.
      
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Reported-by: NThomas Meyer <thomas@m3y3r.de>
      Reported-by: NDave Jiang <dave.jiang@intel.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      e7638488
  17. 15 5月, 2018 1 次提交
    • D
      x86/asm/memcpy_mcsafe: Return bytes remaining · 60622d68
      Dan Williams 提交于
      Machine check safe memory copies are currently deployed in the pmem
      driver whenever reading from persistent memory media, so that -EIO is
      returned rather than triggering a kernel panic. While this protects most
      pmem accesses, it is not complete in the filesystem-dax case. When
      filesystem-dax is enabled reads may bypass the block layer and the
      driver via dax_iomap_actor() and its usage of copy_to_iter().
      
      In preparation for creating a copy_to_iter() variant that can handle
      machine checks, teach memcpy_mcsafe() to return the number of bytes
      remaining rather than -EFAULT when an exception occurs.
      Co-developed-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: hch@lst.de
      Cc: linux-fsdevel@vger.kernel.org
      Cc: linux-nvdimm@lists.01.org
      Link: http://lkml.kernel.org/r/152539238119.31796.14318473522414462886.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      60622d68
  18. 20 4月, 2018 2 次提交
  19. 16 4月, 2018 1 次提交
    • D
      libnvdimm, dimm: handle EACCES failures from label reads · e7c5a571
      Dan Williams 提交于
      The new support for the standard _LSR and _LSW methods neglected to also
      update the nvdimm_init_config_data() and nvdimm_set_config_data() to
      return the translated error code from failed commands. This precision is
      necessary because the locked status that was previously returned on
      ND_CMD_GET_CONFIG_SIZE commands is now returned on
      ND_CMD_{GET,SET}_CONFIG_DATA commands.
      
      If the kernel misses this indication it can inadvertently fall back to
      label-less mode when it should otherwise avoid all access to locked
      regions.
      
      Cc: <stable@vger.kernel.org>
      Fixes: 4b27db7e ("acpi, nfit: add support for the _LSI, _LSR, and...")
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      e7c5a571
  20. 10 4月, 2018 1 次提交
    • D
      libnvdimm, of_pmem: workaround OF_NUMA=n build error · 291717b6
      Dan Williams 提交于
      Stephen reports that an x86 allmodconfig build fails to build the
      of_pmem driver due to a missing definition of of_node_to_nid(). That
      helper is currently only exported in the OF_NUMA=y case. In other cases,
      ppc and sparc, it is a weak symbol, and outside of those platforms it is
      a static inline.
      
      Until an OF_NUMA=n configuration can reliably support usage of
      of_node_to_nid() in modules across architectures, mark this driver as
      'bool' instead of 'tristate'.
      
      Cc: Rob Herring <robh@kernel.org>
      Cc: Oliver O'Halloran <oohall@gmail.com>
      Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      291717b6
  21. 07 4月, 2018 5 次提交
  22. 04 4月, 2018 1 次提交
  23. 03 4月, 2018 1 次提交
  24. 22 3月, 2018 2 次提交
    • D
      libnvdimm, nfit: fix persistence domain reporting · fe9a552e
      Dan Williams 提交于
      The persistence domain is a point in the platform where once writes
      reach that destination the platform claims it will make them persistent
      relative to power loss. In the ACPI NFIT this is currently communicated
      as 2 bits in the "NFIT - Platform Capabilities Structure". The bits
      comprise a hierarchy, i.e. bit0 "CPU Cache Flush to NVDIMM Durability on
      Power Loss Capable" implies bit1 "Memory Controller Flush to NVDIMM
      Durability on Power Loss Capable".
      
      Commit 96c3a239 "libnvdimm: expose platform persistence attr..."
      shows the persistence domain as flags, but it's really an enumerated
      hierarchy.
      
      Fix this newly introduced user ABI to show the closest available
      persistence domain before userspace develops dependencies on seeing, or
      needing to develop code to tolerate, the raw NFIT flags communicated
      through the libnvdimm-generic region attribute.
      
      Fixes: 96c3a239 ("libnvdimm: expose platform persistence attr...")
      Reviewed-by: NDave Jiang <dave.jiang@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      fe9a552e
    • D
      libnvdimm, region: hide persistence_domain when unknown · 896196dc
      Dan Williams 提交于
      Similar to other region attributes, do not emit the persistence_domain
      attribute if its contents are empty.
      
      Fixes: 96c3a239 ("libnvdimm: expose platform persistence attr...")
      Cc: Dave Jiang <dave.jiang@intel.com>
      Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      896196dc