1. 01 5月, 2017 1 次提交
    • D
      mm, zone_device: Replace {get, put}_zone_device_page() with a single reference to fix pmem crash · 71389703
      Dan Williams 提交于
      The x86 conversion to the generic GUP code included a small change which causes
      crashes and data corruption in the pmem code - not good.
      
      The root cause is that the /dev/pmem driver code implicitly relies on the x86
      get_user_pages() implementation doing a get_page() on the page refcount, because
      get_page() does a get_zone_device_page() which properly refcounts pmem's separate
      page struct arrays that are not present in the regular page struct structures.
      (The pmem driver does this because it can cover huge memory areas.)
      
      But the x86 conversion to the generic GUP code changed the get_page() to
      page_cache_get_speculative() which is faster but doesn't do the
      get_zone_device_page() call the pmem code relies on.
      
      One way to solve the regression would be to change the generic GUP code to use
      get_page(), but that would slow things down a bit and punish other generic-GUP
      using architectures for an x86-ism they did not care about. (Arguably the pmem
      driver was probably not working reliably for them: but nvdimm is an Intel
      feature, so non-x86 exposure is probably still limited.)
      
      So restructure the pmem code's interface with the MM instead: get rid of the
      get/put_zone_device_page() distinction, integrate put_zone_device_page() into
      __put_page() and and restructure the pmem completion-wait and teardown machinery:
      
      Kirill points out that the calls to {get,put}_dev_pagemap() can be
      removed from the mm fast path if we take a single get_dev_pagemap()
      reference to signify that the page is alive and use the final put of the
      page to drop that reference.
      
      This does require some care to make sure that any waits for the
      percpu_ref to drop to zero occur *after* devm_memremap_page_release(),
      since it now maintains its own elevated reference.
      
      This speeds up things while also making the pmem refcounting more robust going
      forward.
      Suggested-by: NKirill Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NKirill Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NLogan Gunthorpe <logang@deltatee.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/149339998297.24933.1129582806028305912.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      71389703
  2. 29 4月, 2017 1 次提交
    • T
      libnvdimm, pmem: fix a NULL pointer BUG in nd_pmem_notify · b2518c78
      Toshi Kani 提交于
      The following BUG was observed when nd_pmem_notify() was called
      for a BTT device.  The use of a pmem_device pointer is not valid
      with BTT.
      
       BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
       IP: nd_pmem_notify+0x30/0xf0 [nd_pmem]
       Call Trace:
        nd_device_notify+0x40/0x50
        child_notify+0x10/0x20
        device_for_each_child+0x50/0x90
        nd_region_notify+0x20/0x30
        nd_device_notify+0x40/0x50
        nvdimm_region_notify+0x27/0x30
        acpi_nfit_scrub+0x341/0x590 [nfit]
        process_one_work+0x197/0x450
        worker_thread+0x4e/0x4a0
        kthread+0x109/0x140
      
      Fix nd_pmem_notify() by setting nd_region and badblocks pointers
      properly for BTT.
      
      Cc: <stable@vger.kernel.org>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Fixes: 71999466 ("libnvdimm: async notification support")
      Signed-off-by: NToshi Kani <toshi.kani@hpe.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      b2518c78
  3. 26 4月, 2017 2 次提交
    • D
      x86, dax, pmem: remove indirection around memcpy_from_pmem() · 6abccd1b
      Dan Williams 提交于
      memcpy_from_pmem() maps directly to memcpy_mcsafe(). The wrapper
      serves no real benefit aside from affording a more generic function name
      than the x86-specific 'mcsafe'. However this would not be the first time
      that x86 terminology leaked into the global namespace. For lack of
      better name, just use memcpy_mcsafe() directly.
      
      This conversion also catches a place where we should have been using
      plain memcpy, acpi_nfit_blk_single_io().
      
      Cc: <x86@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Acked-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      6abccd1b
    • D
      block: remove block_device_operations ->direct_access() · d4b29fd7
      Dan Williams 提交于
      Now that all the producers and consumers of dax interfaces have been
      converted to using dax_operations on a dax_device, remove the block
      device direct_access enabling.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      d4b29fd7
  4. 20 4月, 2017 1 次提交
  5. 13 1月, 2017 1 次提交
  6. 17 12月, 2016 1 次提交
    • D
      libnvdimm: fix mishandled nvdimm_clear_poison() return value · 868f036f
      Dan Williams 提交于
      Colin, via static analysis, reports that the length could be negative
      from nvdimm_clear_poison() in the error case. There was a similar
      problem with commit 0a3f27b9 "libnvdimm, namespace: avoid multiple
      sector calculations" that I noticed when merging the for-4.10/libnvdimm
      topic branch into libnvdimm-for-next, but I missed this one. Fix both of
      them to the following procedure:
      
      * if we clear a block's worth of media, clear that many blocks in
        badblocks
      
      * if we clear less than the requested size of the transfer return an
        error
      
      * always invalidate cache after any non-error / non-zero
        nvdimm_clear_poison result
      
      Fixes: 82bf1037 ("libnvdimm: check and clear poison before writing to pmem")
      Fixes: 0a3f27b9 ("libnvdimm, namespace: avoid multiple sector calculations")
      Cc: Fabian Frederick <fabf@skynet.be>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Reported-by: NColin Ian King <colin.king@canonical.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      868f036f
  7. 05 12月, 2016 1 次提交
  8. 29 11月, 2016 1 次提交
    • D
      libnvdimm: use consistent naming for request_mem_region() · 450c6633
      Dan Williams 提交于
      Here is an example /proc/iomem listing for a system with 2 namespaces,
      one in "sector" mode and one in "memory" mode:
      
        1fc000000-2fbffffff : Persistent Memory (legacy)
          1fc000000-2fbffffff : namespace1.0
        340000000-34fffffff : Persistent Memory
          340000000-34fffffff : btt0.1
      
      Here is the corresponding ndctl listing:
      
        # ndctl list
        [
          {
            "dev":"namespace1.0",
            "mode":"memory",
            "size":4294967296,
            "blockdev":"pmem1"
          },
          {
            "dev":"namespace0.0",
            "mode":"sector",
            "size":267091968,
            "uuid":"f7594f86-badb-4592-875f-ded577da2eaf",
            "sector_size":4096,
            "blockdev":"pmem0s"
          }
        ]
      
      Notice that the ndctl listing is purely in terms of namespace devices,
      while the iomem listing leaks the internal "btt0.1" implementation
      detail. Given that ndctl requires the namespace device name to change
      the mode, for example:
      
        # ndctl create-namespace --reconfig=namespace0.0 --mode=raw --force
      
      ...use the namespace name in the iomem listing to keep the claiming
      device name consistent across different mode settings.
      
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      450c6633
  9. 20 10月, 2016 1 次提交
    • T
      pmem: report error on clear poison failure · 3115bb02
      Toshi Kani 提交于
      ACPI Clear Uncorrectable Error DSM function may fail or may be
      unsupported on a platform.  pmem_clear_poison() returns without clearing
      badblocks in such cases.  This failure is detected at the next read
      (-EIO).
      
      This behavior can lead to an issue when user keeps writing but does not
      read immediately.  For instance, flight recorder file may be only read
      when it is necessary for troubleshooting.
      
      Change pmem_do_bvec() and pmem_clear_poison() to return -EIO so that
      filesystem can log an error message on a write error.
      
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Signed-off-by: NToshi Kani <toshi.kani@hpe.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      3115bb02
  10. 01 10月, 2016 1 次提交
  11. 08 8月, 2016 2 次提交
    • J
      block: rename bio bi_rw to bi_opf · 1eff9d32
      Jens Axboe 提交于
      Since commit 63a4cc24, bio->bi_rw contains flags in the lower
      portion and the op code in the higher portions. This means that
      old code that relies on manually setting bi_rw is most likely
      going to be broken. Instead of letting that brokeness linger,
      rename the member, to force old and out-of-tree code to break
      at compile time instead of at runtime.
      
      No intended functional changes in this commit.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      1eff9d32
    • J
      block/mm: make bdev_ops->rw_page() take a bool for read/write · c11f0c0b
      Jens Axboe 提交于
      Commit abf54548 changed it from an 'rw' flags type to the
      newer ops based interface, but now we're effectively leaking
      some bdev internals to the rest of the kernel. Since we only
      care about whether it's a read or a write at that level, just
      pass in a bool 'is_write' parameter instead.
      
      Then we can also move op_is_write() and friends back under
      CONFIG_BLOCK protection.
      Reviewed-by: NMike Christie <mchristi@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c11f0c0b
  12. 05 8月, 2016 1 次提交
  13. 24 7月, 2016 1 次提交
  14. 21 7月, 2016 1 次提交
  15. 13 7月, 2016 2 次提交
  16. 12 7月, 2016 2 次提交
    • D
      libnvdimm, pmem: use REQ_FUA, REQ_FLUSH for nvdimm_flush() · 7e267a8c
      Dan Williams 提交于
      Given that nvdimm_flush() has higher overhead than wmb_pmem() (pointer
      chasing through nd_region), and that we otherwise assume a platform has
      ADR capability when flush hints are not present, move nvdimm_flush() to
      REQ_FLUSH context.
      
      Note that we still arrange for nvdimm_flush() to be called even in the
      ADR case. We need at least once wmb() fence to push buffered writes in
      the cpu out to the ADR protected domain.
      
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      7e267a8c
    • D
      libnvdimm: introduce nvdimm_flush() and nvdimm_has_flush() · f284a4f2
      Dan Williams 提交于
      nvdimm_flush() is a replacement for the x86 'pcommit' instruction.  It is
      an optional write flushing mechanism that an nvdimm bus can provide for
      the pmem driver to consume.  In the case of the NFIT nvdimm-bus-provider
      nvdimm_flush() is implemented as a series of flush-hint-address [1]
      writes to each dimm in the interleave set (region) that backs the
      namespace.
      
      The nvdimm_has_flush() routine relies on platform firmware to describe
      the flushing capabilities of a platform.  It uses the heuristic of
      whether an nvdimm bus provider provides flush address data to return a
      ternary result:
      
            1: flush addresses defined
            0: dimm topology described without flush addresses (assume ADR)
       -errno: no topology information, unable to determine flush mechanism
      
      The pmem driver is expected to take the following actions on this ternary
      result:
      
            1: nvdimm_flush() in response to REQ_FUA / REQ_FLUSH and shutdown
            0: do not set, WC or FUA on the queue, take no further action
       -errno: warn and then operate as if nvdimm_has_flush() returned '0'
      
      The caveat of this heuristic is that it can not distinguish the "dimm
      does not have flush address" case from the "platform firmware is broken
      and failed to describe a flush address".  Given we are already
      explicitly trusting the NFIT there's not much more we can do beyond
      blacklisting broken firmwares if they are ever encountered.
      
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      f284a4f2
  17. 28 6月, 2016 1 次提交
    • D
      block: convert to device_add_disk() · 0d52c756
      Dan Williams 提交于
      For block drivers that specify a parent device, convert them to use
      device_add_disk().
      
      This conversion was done with the following semantic patch:
      
          @@
          struct gendisk *disk;
          expression E;
          @@
      
          - disk->driverfs_dev = E;
          ...
          - add_disk(disk);
          + device_add_disk(E, disk);
      
          @@
          struct gendisk *disk;
          expression E1, E2;
          @@
      
          - disk->driverfs_dev = E1;
          ...
          E2 = disk;
          ...
          - add_disk(E2);
          + device_add_disk(E1, E2);
      
      ...plus some manual fixups for a few missed conversions.
      
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Martin K. Petersen <martin.petersen@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      0d52c756
  18. 25 6月, 2016 1 次提交
    • D
      libnvdimm, pmem: allow nfit_test to override pmem_direct_access() · f295e53b
      Dan Williams 提交于
      Currently phys_to_pfn_t() is an exported symbol to allow nfit_test to
      override it and indicate that nfit_test-pmem is not device-mapped.  Now,
      we want to enable nfit_test to operate without DMA_CMA and the pmem it
      provides will no longer be physically contiguous, i.e. won't be capable
      of supporting direct_access requests larger than a page.  Make
      pmem_direct_access() a weak symbol so that it can be replaced by the
      tools/testing/nvdimm/ version, and move phys_to_pfn_t() to a static
      inline now that it no longer needs to be overridden.
      Acked-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      f295e53b
  19. 16 6月, 2016 1 次提交
  20. 21 5月, 2016 1 次提交
  21. 19 5月, 2016 1 次提交
  22. 07 5月, 2016 1 次提交
  23. 01 5月, 2016 1 次提交
    • D
      libnvdimm, pfn: fix memmap reservation sizing · 658922e5
      Dan Williams 提交于
      When configuring a pfn-device instance to allocate the memmap array it
      needs to account for the fact that vmemmap_populate_hugepages()
      allocates struct page blocks in HPAGE_SIZE chunks.  We need to align the
      reserved area size to 2MB otherwise arch_add_memory() runs out of memory
      while establishing the memmap:
      
       WARNING: CPU: 0 PID: 496 at arch/x86/mm/init_64.c:704 arch_add_memory+0xe7/0xf0
       [..]
       Call Trace:
        [<ffffffff8148bdb3>] dump_stack+0x85/0xc2
        [<ffffffff810a749b>] __warn+0xcb/0xf0
        [<ffffffff810a75cd>] warn_slowpath_null+0x1d/0x20
        [<ffffffff8106a497>] arch_add_memory+0xe7/0xf0
        [<ffffffff811d2097>] devm_memremap_pages+0x287/0x450
        [<ffffffff811d1ffa>] ? devm_memremap_pages+0x1ea/0x450
        [<ffffffffa0000298>] __wrap_devm_memremap_pages+0x58/0x70 [nfit_test_iomap]
        [<ffffffffa0047a58>] pmem_attach_disk+0x318/0x420 [nd_pmem]
        [<ffffffffa0047bcf>] nd_pmem_probe+0x6f/0x90 [nd_pmem]
        [<ffffffffa0009469>] nvdimm_bus_probe+0x69/0x110 [libnvdimm]
       [..]
        ndbus0: nd_pmem.probe(pfn3.0) = -12
       nd_pmem: probe of pfn3.0 failed with error -12
      libndctl: ndctl_pfn_enable: pfn3.0: failed to enable
      Reported-by: NNamratha Kothapalli <namratha.n.kothapalli@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      658922e5
  24. 23 4月, 2016 9 次提交
  25. 16 4月, 2016 1 次提交
  26. 08 4月, 2016 1 次提交
    • D
      libnvdimm, pfn: fix nvdimm_namespace_add_poison() vs section alignment · a3901802
      Dan Williams 提交于
      When section alignment padding is in effect we need to shift / truncate
      the range that is queried for poison by the 'start_pad' or 'end_trunc'
      reservations.
      
      It's easiest if we just pass in an adjusted resource range rather than
      deriving it from the passed in namespace.  With the resource range
      resolution pushed out to the caller we can also push the
      namespace-to-region lookup to the caller and drop the implicit pmem-type
      assumption about the passed in namespace object.
      
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      a3901802
  27. 05 4月, 2016 1 次提交
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  28. 29 3月, 2016 1 次提交