1. 08 10月, 2017 2 次提交
    • D
      libnvdimm: introduce 'flags' attribute for DIMM 'lock' and 'alias' status · efbf6f50
      Dan Williams 提交于
      Given that we now how have two mechanisms for a DIMM to indicate that it
      is locked:
      
          * NVDIMM_FAMILY_INTEL 'get_config_size' _DSM command
      
          * ACPI 6.2 Label Storage Read / Write commands
      
      ...export the generic libnvdimm DIMM status in a new 'flags' attribute.
      
      This attribute can also reflect the 'alias' state which indicates
      whether the nvdimm core is enforcing labels for aliased-region-capacity
      that the given dimm is an interleave-set member.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      efbf6f50
    • D
      acpi, nfit: add support for the _LSI, _LSR, and _LSW label methods · 4b27db7e
      Dan Williams 提交于
      ACPI 6.2 adds support for named methods to access the label storage area
      of an NVDIMM. We prefer these new methods if available and otherwise
      fallback to the NVDIMM_FAMILY_INTEL _DSMs. The kernel ioctls,
      ND_IOCTL_{GET,SET}_CONFIG_{SIZE,DATA}, remain generic and the driver
      translates the 'package' payloads into the NVDIMM_FAMILY_INTEL 'buffer'
      format to maintain compatibility with existing userspace and keep the
      output buffer parsing code in the driver common.
      
      The output payloads are mostly compatible save for the 'label area
      locked' status that moves from the 'config_size' (_LSI) command to the
      'config_read' (_LSR) command status.
      
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      4b27db7e
  2. 29 9月, 2017 5 次提交
  3. 19 9月, 2017 1 次提交
    • D
      libnvdimm, namespace: fix btt claim class crash · 33a56086
      Dan Williams 提交于
      Maurice reports:
      
          BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
          IP: holder_class_store+0x253/0x2b0 [libnvdimm]
      
      ...while trying to reconfigure an NVDIMM-N namespace into 'sector' /
      'btt' mode. The crash points to this line:
      
          (gdb) li *(holder_class_store+0x253)
          0x7773 is in holder_class_store (drivers/nvdimm/namespace_devs.c:1420).
          1415            for (i = 0; i < nd_region->ndr_mappings; i++) {
          1416                    struct nd_mapping *nd_mapping = &nd_region->mapping[i];
          1417                    struct nvdimm_drvdata *ndd = to_ndd(nd_mapping);
          1418                    struct nd_namespace_index *nsindex;
          1419
          1420                    nsindex = to_namespace_index(ndd, ndd->ns_current);
      
      ...where we are failing because ndd is NULL due to NVDIMM-N dimms not
      supporting labels.
      
      Long story short, default to the BTTv1 format in the label-less /
      NVDIMM-N case.
      
      Fixes: 14e49454 ("libnvdimm, btt: BTT updates for UEFI 2.7 format")
      Cc: <stable@vger.kernel.org>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Reported-by: NMaurice A. Saldivar <maurice.a.saldivar@hpe.com>
      Tested-by: NMaurice A. Saldivar <maurice.a.saldivar@hpe.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      33a56086
  4. 11 9月, 2017 1 次提交
    • M
      dax: remove the pmem_dax_ops->flush abstraction · c3ca015f
      Mikulas Patocka 提交于
      Commit abebfbe2 ("dm: add ->flush() dax operation support") is
      buggy. A DM device may be composed of multiple underlying devices and
      all of them need to be flushed. That commit just routes the flush
      request to the first device and ignores the other devices.
      
      It could be fixed by adding more complex logic to the device mapper. But
      there is only one implementation of the method pmem_dax_ops->flush - that
      is pmem_dax_flush() - and it calls arch_wb_cache_pmem(). Consequently, we
      don't need the pmem_dax_ops->flush abstraction at all, we can call
      arch_wb_cache_pmem() directly from dax_flush() because dax_dev->ops->flush
      can't ever reach anything different from arch_wb_cache_pmem().
      
      It should be also pointed out that for some uses of persistent memory it
      is needed to flush only a very small amount of data (such as 1 cacheline),
      and it would be overkill if we go through that device mapper machinery for
      a single flushed cache line.
      
      Fix this by removing the pmem_dax_ops->flush abstraction and call
      arch_wb_cache_pmem() directly from dax_flush(). Also, remove the device
      mapper code that forwards the flushes.
      
      Fixes: abebfbe2 ("dm: add ->flush() dax operation support")
      Cc: stable@vger.kernel.org
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      c3ca015f
  5. 10 9月, 2017 1 次提交
  6. 08 9月, 2017 1 次提交
  7. 07 9月, 2017 1 次提交
  8. 05 9月, 2017 1 次提交
    • M
      libnvdimm, nfit: move the check on nd_reserved2 to the endpoint · 9edcad53
      Meng Xu 提交于
      Delay the check of nd_reserved2 to the actual endpoint (acpi_nfit_ctl)
      that uses it, as a prevention of a potential double-fetch bug.
      
      While examining the kernel source code, I found a dangerous operation that
      could turn into a double-fetch situation (a race condition bug) where
      the same userspace memory region are fetched twice into kernel with sanity
      checks after the first fetch while missing checks after the second fetch.
      
      In the case of _IOC_NR(ioctl_cmd) == ND_CMD_CALL:
      
      1. The first fetch happens in line 935 copy_from_user(&pkg, p, sizeof(pkg)
      
      2. subsequently `pkg.nd_reserved2` is asserted to be all zeroes
      (line 984 to 986).
      
      3. The second fetch happens in line 1022 copy_from_user(buf, p, buf_len)
      
      4. Given that `p` can be fully controlled in userspace, an attacker can
      race condition to override the header part of `p`, say,
      `((struct nd_cmd_pkg *)p)->nd_reserved2` to arbitrary value
      (say nine 0xFFFFFFFF for `nd_reserved2`) after the first fetch but before the
      second fetch. The changed value will be copied to `buf`.
      
      5. There is no checks on the second fetches until the use of it in
      line 1034: nd_cmd_clear_to_send(nvdimm_bus, nvdimm, cmd, buf) and
      line 1038: nd_desc->ndctl(nd_desc, nvdimm, cmd, buf, buf_len, &cmd_rc)
      which means that the assumed relation, `p->nd_reserved2` are all zeroes might
      not hold after the second fetch. And once the control goes to these functions
      we lose the context to assert the assumed relation.
      
      6. Based on my manual analysis, `p->nd_reserved2` is not used in function
      `nd_cmd_clear_to_send` and potential implementations of `nd_desc->ndctl`
      so there is no working exploit against it right now. However, this could
      easily turns to an exploitable one if careless developers start to use
      `p->nd_reserved2` later and assume that they are all zeroes.
      
      Move the validation of the nd_reserved2 field to the ->ndctl()
      implementation where it has a stable buffer to evaluate.
      Signed-off-by: NMeng Xu <mengxu.gatech@gmail.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      9edcad53
  9. 01 9月, 2017 8 次提交
    • D
      libnvdimm: fix integer overflow static analysis warning · 58738c49
      Dan Williams 提交于
      Dan reports:
          The patch 62232e45: "libnvdimm: control (ioctl) messages for
          nvdimm_bus and nvdimm devices" from Jun 8, 2015, leads to the
          following static checker warning:
      
                  drivers/nvdimm/bus.c:1018 __nd_ioctl()
                  warn: integer overflows 'buf_len'
      
          From a casual review, this seems like it might be a real bug.  On
          the first iteration we load some data into in_env[].  On the second
          iteration we read a use controlled "in_size" from nd_cmd_in_size().
          It can go up to UINT_MAX - 1.  A high number means we will fill the
          whole in_env[] buffer.  But we potentially keep looping and adding
          more to in_len so now it can be any value.
      
          It simple enough to change, but it feels weird that we keep looping
          even though in_env is totally full.  Shouldn't we just return an
          error if we don't have space for desc->in_num.
      
      We keep looping because the size of the total input is allowed to be
      bigger than the 'envelope' which is a subset of the payload that tells
      us how much data to expect. For safety explicitly check that buf_len
      does not overflow which is what the checker flagged.
      
      Cc: <stable@vger.kernel.org>
      Fixes: 62232e45: "libnvdimm: control (ioctl) messages for nvdimm_bus..."
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      58738c49
    • R
      libnvdimm, nd_blk: remove mmio_flush_range() · 5deb67f7
      Robin Murphy 提交于
      mmio_flush_range() suffers from a lack of clearly-defined semantics,
      and is somewhat ambiguous to port to other architectures where the
      scope of the writeback implied by "flush" and ordering might matter,
      but MMIO would tend to imply non-cacheable anyway. Per the rationale
      in 67a3e8fe ("nd_blk: change aperture mapping from WC to WB"), the
      only existing use is actually to invalidate clean cache lines for
      ARCH_MEMREMAP_PMEM type mappings *without* writeback. Since the recent
      cleanup of the pmem API, that also now happens to be the exact purpose
      of arch_invalidate_pmem(), which would be a far more well-defined tool
      for the job.
      
      Rather than risk potentially inconsistent implementations of
      mmio_flush_range() for the sake of one callsite, streamline things by
      removing it entirely and instead move the ARCH_MEMREMAP_PMEM related
      definitions up to the libnvdimm level, so they can be shared by NFIT
      as well. This allows NFIT to be enabled for arm64.
      Signed-off-by: NRobin Murphy <robin.murphy@arm.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      5deb67f7
    • V
      libnvdimm, btt: rework error clearing · d9b83c75
      Vishal Verma 提交于
      Clearing errors or badblocks during a BTT write requires sending an ACPI
      DSM, which means potentially sleeping. Since a BTT IO happens in atomic
      context (preemption disabled, spinlocks may be held), we cannot perform
      error clearing in the course of an IO. Due to this error clearing for
      BTT IOs has hitherto been disabled.
      
      In this patch we move error clearing out of the atomic section, and thus
      re-enable error clearing with BTTs. When we are about to add a block to
      the free list, we check if it was previously marked as an error, and if
      it was, we add it to the freelist, but also set a flag that says error
      clearing will be required. We then drop the lane (ending the atomic
      context), and send a zero buffer so that the error can be cleared. The
      error flag in the free list is protected by the nd 'lane', and is set
      only be a thread while it holds that lane. When the error is cleared,
      the flag is cleared, but while holding a mutex for that freelist index.
      
      When writing, we check for two things -
      1/ If the freelist mutex is held or if the error flag is set. If so,
      this is an error block that is being (or about to be) cleared.
      2/ If the block is a known badblock based on nsio->bb
      
      The second check is required because the BTT map error flag for a map
      entry only gets set when an error LBA is read. If we write to a new
      location that may not have the map error flag set, but still might be in
      the region's badblock list, we can trigger an EIO on the write, which is
      undesirable and completely avoidable.
      
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      d9b83c75
    • V
      libnvdimm: fix potential deadlock while clearing errors · 0930a750
      Vishal Verma 提交于
      With the ACPI NFIT 'DSM' methods, acpi can be called from IO paths.
      Specifically, the DSM to clear media errors is called during writes, so
      that we can provide a writes-fix-errors model.
      
      However it is easy to imagine a scenario like:
       -> write through the nvdimm driver
         -> acpi allocation
           -> writeback, causes more IO through the nvdimm driver
             -> deadlock
      
      Fix this by using memalloc_noio_{save,restore}, which sets the GFP_NOIO
      flag for the current scope when issuing commands/IOs that are expected
      to clear errors.
      
      Cc: <linux-acpi@vger.kernel.org>
      Cc: <linux-nvdimm@lists.01.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Robert Moore <robert.moore@intel.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      0930a750
    • V
      libnvdimm, btt: cache sector_size in arena_info · 75892004
      Vishal Verma 提交于
      In preparation for the error clearing rework, add sector_size in the
      arena_info struct.
      Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      75892004
    • V
      libnvdimm, btt: ensure that flags were also unchanged during a map_read · 1398199d
      Vishal Verma 提交于
      In btt_map_read, we read the map twice to make sure that the map entry
      didn't change after we added it to the read tracking table. In
      anticipation of expanding the use of the error bit, also make sure that
      the error and zero flags are constant across the two map reads.
      Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      1398199d
    • V
      libnvdimm, btt: refactor map entry operations with macros · 0595d539
      Vishal Verma 提交于
      Add helpers for converting a raw map entry to just the block number, or
      either of the 'e' or 'z' flags in preparation for actually using the
      error flag to mark blocks with media errors.
      Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      0595d539
    • V
      libnvdimm, btt: fix a missed NVDIMM_IO_ATOMIC case in the write path · 1db1f3ce
      Vishal Verma 提交于
      The IO context conversion for rw_bytes missed a case in the BTT write
      path (btt_map_write) which should've been marked as atomic.
      
      In reality this should not cause a problem, because map writes are to
      small for nsio_rw_bytes to attempt error clearing, but it should be
      fixed for posterity.
      
      Add a might_sleep() in the non-atomic section of nsio_rw_bytes so that
      things like the nfit unit tests, which don't actually sleep, can catch
      bugs like this.
      
      Cc: Dan Williams <dan.j.williams@intel.com>
      Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      1db1f3ce
  10. 30 8月, 2017 2 次提交
  11. 24 8月, 2017 1 次提交
    • C
      block: replace bi_bdev with a gendisk pointer and partitions index · 74d46992
      Christoph Hellwig 提交于
      This way we don't need a block_device structure to submit I/O.  The
      block_device has different life time rules from the gendisk and
      request_queue and is usually only available when the block device node
      is open.  Other callers need to explicitly create one (e.g. the lightnvm
      passthrough code, or the new nvme multipathing code).
      
      For the actual I/O path all that we need is the gendisk, which exists
      once per block device.  But given that the block layer also does
      partition remapping we additionally need a partition index, which is
      used for said remapping in generic_make_request.
      
      Note that all the block drivers generally want request_queue or
      sometimes the gendisk, so this removes a layer of indirection all
      over the stack.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      74d46992
  12. 16 8月, 2017 1 次提交
  13. 12 8月, 2017 2 次提交
  14. 10 8月, 2017 1 次提交
  15. 05 8月, 2017 1 次提交
  16. 26 7月, 2017 1 次提交
    • O
      libnvdimm: Stop using HPAGE_SIZE · 0dd69643
      Oliver O'Halloran 提交于
      Currently libnvdimm uses HPAGE_SIZE as the default alignment for DAX and
      PFN devices. HPAGE_SIZE is the default hugetlbfs page size and when
      hugetlbfs is disabled it defaults to PAGE_SIZE. Given DAX has more
      in common with THP than hugetlbfs we should proably be using
      HPAGE_PMD_SIZE, but this is undefined when THP is disabled so lets just
      give it a new name.
      
      The other usage of HPAGE_SIZE in libnvdimm is when determining how large
      the altmap should be. For the reasons mentioned above it doesn't really
      make sense to use HPAGE_SIZE here either. PMD_SIZE seems to be safe to
      use in generic code and it happens to match the vmemmap allocation block
      on x86 and Power. It's still a hack, but it's a slightly nicer hack.
      Signed-off-by: NOliver O'Halloran <oohall@gmail.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      0dd69643
  17. 18 7月, 2017 1 次提交
  18. 04 7月, 2017 3 次提交
  19. 01 7月, 2017 4 次提交
  20. 30 6月, 2017 2 次提交