1. 05 6月, 2017 1 次提交
  2. 11 5月, 2017 2 次提交
    • V
      libnvdimm, btt: ensure that initializing metadata clears poison · b177fe85
      Vishal Verma 提交于
      If we had badblocks/poison in the metadata area of a BTT, recreating the
      BTT would not clear the poison in all cases, notably the flog area. This
      is because rw_bytes will only clear errors if the request being sent
      down is 512B aligned and sized.
      
      Make sure that when writing the map and info blocks, the rw_bytes being
      sent are of the correct size/alignment. For the flog, instead of doing
      the smaller log_entry writes only, first do a 'wipe' of the entire area
      by writing zeroes in large enough chunks so that errors get cleared.
      
      Cc: Andy Rudoff <andy.rudoff@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      b177fe85
    • V
      libnvdimm: add an atomic vs process context flag to rw_bytes · 3ae3d67b
      Vishal Verma 提交于
      nsio_rw_bytes can clear media errors, but this cannot be done while we
      are in an atomic context due to locking within ACPI. From the BTT,
      ->rw_bytes may be called either from atomic or process context depending
      on whether the calls happen during initialization or during IO.
      
      During init, we want to ensure error clearing happens, and the flag
      marking process context allows nsio_rw_bytes to do that. When called
      during IO, we're in atomic context, and error clearing can be skipped.
      
      Cc: Dan Williams <dan.j.williams@intel.com>
      Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      3ae3d67b
  3. 09 5月, 2017 1 次提交
    • M
      treewide: use kv[mz]alloc* rather than opencoded variants · 752ade68
      Michal Hocko 提交于
      There are many code paths opencoding kvmalloc.  Let's use the helper
      instead.  The main difference to kvmalloc is that those users are
      usually not considering all the aspects of the memory allocator.  E.g.
      allocation requests <= 32kB (with 4kB pages) are basically never failing
      and invoke OOM killer to satisfy the allocation.  This sounds too
      disruptive for something that has a reasonable fallback - the vmalloc.
      On the other hand those requests might fallback to vmalloc even when the
      memory allocator would succeed after several more reclaim/compaction
      attempts previously.  There is no guarantee something like that happens
      though.
      
      This patch converts many of those places to kv[mz]alloc* helpers because
      they are more conservative.
      
      Link: http://lkml.kernel.org/r/20170306103327.2766-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> # Xen bits
      Acked-by: NKees Cook <keescook@chromium.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: Andreas Dilger <andreas.dilger@intel.com> # Lustre
      Acked-by: Christian Borntraeger <borntraeger@de.ibm.com> # KVM/s390
      Acked-by: Dan Williams <dan.j.williams@intel.com> # nvdim
      Acked-by: David Sterba <dsterba@suse.com> # btrfs
      Acked-by: Ilya Dryomov <idryomov@gmail.com> # Ceph
      Acked-by: Tariq Toukan <tariqt@mellanox.com> # mlx4
      Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx5
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Anton Vorontsov <anton@enomsg.org>
      Cc: Colin Cross <ccross@android.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Santosh Raspatur <santosh@chelsio.com>
      Cc: Hariprasad S <hariprasad@chelsio.com>
      Cc: Yishai Hadas <yishaih@mellanox.com>
      Cc: Oleg Drokin <oleg.drokin@intel.com>
      Cc: "Yan, Zheng" <zyan@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      752ade68
  4. 05 5月, 2017 3 次提交
    • D
      libnvdimm, pfn: fix 'npfns' vs section alignment · d5483fed
      Dan Williams 提交于
      Fix failures to create namespaces due to the vmem_altmap not advertising
      enough free space to store the memmap.
      
       WARNING: CPU: 15 PID: 8022 at arch/x86/mm/init_64.c:656 arch_add_memory+0xde/0xf0
       [..]
       Call Trace:
        dump_stack+0x63/0x83
        __warn+0xcb/0xf0
        warn_slowpath_null+0x1d/0x20
        arch_add_memory+0xde/0xf0
        devm_memremap_pages+0x244/0x440
        pmem_attach_disk+0x37e/0x490 [nd_pmem]
        nd_pmem_probe+0x7e/0xa0 [nd_pmem]
        nvdimm_bus_probe+0x71/0x120 [libnvdimm]
        driver_probe_device+0x2bb/0x460
        bind_store+0x114/0x160
        drv_attr_store+0x25/0x30
      
      In commit 658922e5 "libnvdimm, pfn: fix memmap reservation sizing"
      we arranged for the capacity to be allocated, but failed to also update
      the 'npfns' parameter. This leads to cases where there is enough
      capacity reserved to hold all the allocated sections, but
      vmemmap_populate_hugepages() still encounters -ENOMEM from
      altmap_alloc_block_buf().
      
      This fix is a stop-gap until we can teach the core memory hotplug
      implementation to permit sub-section hotplug.
      
      Cc: <stable@vger.kernel.org>
      Fixes: 658922e5 ("libnvdimm, pfn: fix memmap reservation sizing")
      Reported-by: NAnisha Allada <anisha.allada@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      d5483fed
    • D
      libnvdimm: handle locked label storage areas · 9d62ed96
      Dan Williams 提交于
      Per the latest version of the "NVDIMM DSM Interface Example" [1], the
      label data retrieval routine can report a "locked" status. In this case
      all regions associated with that DIMM are disabled until the label area
      is unlocked. Provide generic libnvdimm enabling for NVDIMMs with label
      data area locking capabilities.
      
      [1]: http://pmem.io/documents/Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      9d62ed96
    • D
      libnvdimm: convert NDD_ flags to use bitops, introduce NDD_LOCKED · 8f078b38
      Dan Williams 提交于
      This is a preparation patch for handling locked nvdimm label regions, a
      new concept as introduced by the latest DSM document on pmem.io [1]. A
      future patch will leverage nvdimm_set_locked() at DIMM probe time to
      flag regions that can not be enabled. There should be no functional
      difference resulting from this change.
      
      [1]: http://pmem.io/documents/NVDIMM_DSM_Interface_Example-V1.3.pdfSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      8f078b38
  5. 02 5月, 2017 1 次提交
    • D
      libnvdimm: restore "libnvdimm: band aid btt vs clear poison locking" · a3e9af95
      Dan Williams 提交于
      This continues the 4.11 status quo of disabling of error clearing from
      the BTT I/O path. Toshi found that even though we have eliminated all
      the libnvdimm sources of sleeping-while-atomic triggers, we still have
      sleeping operations that will occur in the path to send the ACPI DSM to
      the DIMM to clear the error:
      
       BUG: sleeping function called from invalid context at mm/slab.h:432
       in_atomic(): 1, irqs_disabled(): 0, pid: 13353, name: dd
       Call Trace:
        dump_stack+0x86/0xc3
        ___might_sleep+0x17d/0x250
        __might_sleep+0x4a/0x80
        __kmalloc+0x1c0/0x2e0
        acpi_os_allocate_zeroed+0x2d/0x2f
        acpi_evaluate_object+0x59/0x3b1
        acpi_evaluate_dsm+0xbd/0x10c
        acpi_nfit_ctl+0x1ef/0x7c0 [nfit]
        ? nsio_rw_bytes+0x152/0x280
        nvdimm_clear_poison+0x77/0x140
        nsio_rw_bytes+0x18f/0x280
        btt_write_pg+0x1d4/0x3d0 [nd_btt]
        btt_make_request+0x119/0x2d0 [nd_btt]
      
      A solution for tracking and handling media errors natively in the BTT is
      needed.
      
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Reported-by: NToshi Kani <toshi.kani@hpe.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      a3e9af95
  6. 01 5月, 2017 2 次提交
    • D
      libnvdimm: fix nvdimm_bus_lock() vs device_lock() ordering · 452bae0a
      Dan Williams 提交于
      A debug patch to turn the standard device_lock() into something that
      lockdep can analyze yielded the following:
      
       ======================================================
       [ INFO: possible circular locking dependency detected ]
       4.11.0-rc4+ #106 Tainted: G           O
       -------------------------------------------------------
       lt-libndctl/1898 is trying to acquire lock:
        (&dev->nvdimm_mutex/3){+.+.+.}, at: [<ffffffffc023c948>] nd_attach_ndns+0x178/0x1b0 [libnvdimm]
      
       but task is already holding lock:
        (&nvdimm_bus->reconfig_mutex){+.+.+.}, at: [<ffffffffc022e0b1>] nvdimm_bus_lock+0x21/0x30 [libnvdimm]
      
       which lock already depends on the new lock.
      
       the existing dependency chain (in reverse order) is:
      
       -> #1 (&nvdimm_bus->reconfig_mutex){+.+.+.}:
              lock_acquire+0xf6/0x1f0
              __mutex_lock+0x88/0x980
              mutex_lock_nested+0x1b/0x20
              nvdimm_bus_lock+0x21/0x30 [libnvdimm]
              nvdimm_namespace_capacity+0x1b/0x40 [libnvdimm]
              nvdimm_namespace_common_probe+0x230/0x510 [libnvdimm]
              nd_pmem_probe+0x14/0x180 [nd_pmem]
              nvdimm_bus_probe+0xa9/0x260 [libnvdimm]
      
       -> #0 (&dev->nvdimm_mutex/3){+.+.+.}:
              __lock_acquire+0x1107/0x1280
              lock_acquire+0xf6/0x1f0
              __mutex_lock+0x88/0x980
              mutex_lock_nested+0x1b/0x20
              nd_attach_ndns+0x178/0x1b0 [libnvdimm]
              nd_namespace_store+0x308/0x3c0 [libnvdimm]
              namespace_store+0x87/0x220 [libnvdimm]
      
      In this case '&dev->nvdimm_mutex/3' mirrors '&dev->mutex'.
      
      Fix this by replacing the use of device_lock() with nvdimm_bus_lock() to protect
      nd_{attach,detach}_ndns() operations.
      
      Cc: <stable@vger.kernel.org>
      Fixes: 8c2f7e86 ("libnvdimm: infrastructure for btt devices")
      Reported-by: NYi Zhang <yizhan@redhat.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      452bae0a
    • D
      mm, zone_device: Replace {get, put}_zone_device_page() with a single reference to fix pmem crash · 71389703
      Dan Williams 提交于
      The x86 conversion to the generic GUP code included a small change which causes
      crashes and data corruption in the pmem code - not good.
      
      The root cause is that the /dev/pmem driver code implicitly relies on the x86
      get_user_pages() implementation doing a get_page() on the page refcount, because
      get_page() does a get_zone_device_page() which properly refcounts pmem's separate
      page struct arrays that are not present in the regular page struct structures.
      (The pmem driver does this because it can cover huge memory areas.)
      
      But the x86 conversion to the generic GUP code changed the get_page() to
      page_cache_get_speculative() which is faster but doesn't do the
      get_zone_device_page() call the pmem code relies on.
      
      One way to solve the regression would be to change the generic GUP code to use
      get_page(), but that would slow things down a bit and punish other generic-GUP
      using architectures for an x86-ism they did not care about. (Arguably the pmem
      driver was probably not working reliably for them: but nvdimm is an Intel
      feature, so non-x86 exposure is probably still limited.)
      
      So restructure the pmem code's interface with the MM instead: get rid of the
      get/put_zone_device_page() distinction, integrate put_zone_device_page() into
      __put_page() and and restructure the pmem completion-wait and teardown machinery:
      
      Kirill points out that the calls to {get,put}_dev_pagemap() can be
      removed from the mm fast path if we take a single get_dev_pagemap()
      reference to signify that the page is alive and use the final put of the
      page to drop that reference.
      
      This does require some care to make sure that any waits for the
      percpu_ref to drop to zero occur *after* devm_memremap_page_release(),
      since it now maintains its own elevated reference.
      
      This speeds up things while also making the pmem refcounting more robust going
      forward.
      Suggested-by: NKirill Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NKirill Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NLogan Gunthorpe <logang@deltatee.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/149339998297.24933.1129582806028305912.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      71389703
  7. 30 4月, 2017 1 次提交
    • D
      libnvdimm: rework region badblocks clearing · 23f49844
      Dan Williams 提交于
      Toshi noticed that the new support for a region-level badblocks missed
      the case where errors are cleared due to BTT I/O.
      
      An initial attempt to fix this ran into a "sleeping while atomic"
      warning due to taking the nvdimm_bus_lock() in the BTT I/O path to
      satisfy the locking requirements of __nvdimm_bus_badblocks_clear().
      However, that lock is not needed since we are not acting on any data that
      is subject to change under that lock. The badblocks instance has its own
      internal lock to handle mutations of the error list.
      
      So, in order to make it clear that we are just acting on region devices,
      rename __nvdimm_bus_badblocks_clear() to nvdimm_clear_badblocks_regions().
      Eliminate the lock and consolidate all support routines for the new
      nvdimm_account_cleared_poison() in drivers/nvdimm/bus.c. Finally, to the
      opportunity to cleanup to some unnecessary casts, make the calling
      convention of nvdimm_clear_badblocks_regions() clearer by replacing struct
      resource with the minimal struct clear_badblocks_context, and use the
      DEVICE_ATTR macro.
      
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Reported-by: NToshi Kani <toshi.kani@hpe.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      23f49844
  8. 29 4月, 2017 3 次提交
    • T
      libnvdimm: fix clear length of nvdimm_forget_poison() · 8d13c029
      Toshi Kani 提交于
      ND_CMD_CLEAR_ERROR command returns 'clear_err.cleared', the length
      of error actually cleared, which may be smaller than its requested
      'len'.
      
      Change nvdimm_clear_poison() to call nvdimm_forget_poison() with
      'clear_err.cleared' when this value is valid.
      
      Cc: <stable@vger.kernel.org>
      Fixes: e046114a ("libnvdimm: clear the internal poison_list when clearing badblocks")
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Signed-off-by: NToshi Kani <toshi.kani@hpe.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      8d13c029
    • T
      libnvdimm, pmem: fix a NULL pointer BUG in nd_pmem_notify · b2518c78
      Toshi Kani 提交于
      The following BUG was observed when nd_pmem_notify() was called
      for a BTT device.  The use of a pmem_device pointer is not valid
      with BTT.
      
       BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
       IP: nd_pmem_notify+0x30/0xf0 [nd_pmem]
       Call Trace:
        nd_device_notify+0x40/0x50
        child_notify+0x10/0x20
        device_for_each_child+0x50/0x90
        nd_region_notify+0x20/0x30
        nd_device_notify+0x40/0x50
        nvdimm_region_notify+0x27/0x30
        acpi_nfit_scrub+0x341/0x590 [nfit]
        process_one_work+0x197/0x450
        worker_thread+0x4e/0x4a0
        kthread+0x109/0x140
      
      Fix nd_pmem_notify() by setting nd_region and badblocks pointers
      properly for BTT.
      
      Cc: <stable@vger.kernel.org>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Fixes: 71999466 ("libnvdimm: async notification support")
      Signed-off-by: NToshi Kani <toshi.kani@hpe.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      b2518c78
    • D
      libnvdimm, region: sysfs trigger for nvdimm_flush() · ab630891
      Dan Williams 提交于
      The nvdimm_flush() mechanism helps to reduce the impact of an ADR
      (asynchronous-dimm-refresh) failure. The ADR mechanism handles flushing
      platform WPQ (write-pending-queue) buffers when power is removed. The
      nvdimm_flush() mechanism performs that same function on-demand.
      
      When a pmem namespace is associated with a block device, an
      nvdimm_flush() is triggered with every block-layer REQ_FUA, or REQ_FLUSH
      request. These requests are typically associated with filesystem
      metadata updates. However, when a namespace is in device-dax mode,
      userspace (think database metadata) needs another path to perform the
      same flushing. In other words this is not required to make data
      persistent, but in the case of metadata it allows for a smaller failure
      domain in the unlikely event of an ADR failure.
      
      The new 'deep_flush' attribute is visible when the individual DIMMs
      backing a given interleave-set are described by platform firmware. In
      ACPI terms this is "NVDIMM Region Mapping Structures" and associated
      "Flush Hint Address Structures". Reads return "1" if the region supports
      triggering WPQ flushes on all DIMMs. Reads return "0" the flush
      operation is a platform nop, and in that case the attribute is
      read-only.
      
      Why sysfs and not an ioctl? An ioctl requires establishing a new
      ioctl function number space for device-dax. Given that this would be
      called on a device-dax fd an application could be forgiven for
      accidentally calling this on a filesystem-dax fd. Placing this interface
      in libnvdimm sysfs removes that potential for collision with a
      filesystem ioctl, and it keeps ioctls out of the generic device-dax
      implementation.
      
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      ab630891
  9. 28 4月, 2017 1 次提交
  10. 26 4月, 2017 2 次提交
    • D
      x86, dax, pmem: remove indirection around memcpy_from_pmem() · 6abccd1b
      Dan Williams 提交于
      memcpy_from_pmem() maps directly to memcpy_mcsafe(). The wrapper
      serves no real benefit aside from affording a more generic function name
      than the x86-specific 'mcsafe'. However this would not be the first time
      that x86 terminology leaked into the global namespace. For lack of
      better name, just use memcpy_mcsafe() directly.
      
      This conversion also catches a place where we should have been using
      plain memcpy, acpi_nfit_blk_single_io().
      
      Cc: <x86@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Acked-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      6abccd1b
    • D
      block: remove block_device_operations ->direct_access() · d4b29fd7
      Dan Williams 提交于
      Now that all the producers and consumers of dax interfaces have been
      converted to using dax_operations on a dax_device, remove the block
      device direct_access enabling.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      d4b29fd7
  11. 25 4月, 2017 1 次提交
    • D
      libnvdimm, region: fix flush hint detection crash · bc042fdf
      Dan Williams 提交于
      In the case where a dimm does not have any associated flush hints the
      ndrd->flush_wpq array may be uninitialized leading to crashes with the
      following signature:
      
       BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
       IP: region_visible+0x10f/0x160 [libnvdimm]
      
       Call Trace:
        internal_create_group+0xbe/0x2f0
        sysfs_create_groups+0x40/0x80
        device_add+0x2d8/0x650
        nd_async_device_register+0x12/0x40 [libnvdimm]
        async_run_entry_fn+0x39/0x170
        process_one_work+0x212/0x6c0
        ? process_one_work+0x197/0x6c0
        worker_thread+0x4e/0x4a0
        kthread+0x10c/0x140
        ? process_one_work+0x6c0/0x6c0
        ? kthread_create_on_node+0x60/0x60
        ret_from_fork+0x31/0x40
      
      Cc: <stable@vger.kernel.org>
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Fixes: f284a4f2 ("libnvdimm: introduce nvdimm_flush() and nvdimm_has_flush()")
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      bc042fdf
  12. 20 4月, 2017 1 次提交
  13. 15 4月, 2017 1 次提交
  14. 14 4月, 2017 1 次提交
    • D
      libnvdimm: fix clear poison locking with spinlock and GFP_NOWAIT allocation · b3b454f6
      Dave Jiang 提交于
      The following warning results from holding a lane spinlock,
      preempt_disable(), or the btt map spinlock and then trying to take the
      reconfig_mutex to walk the poison list and potentially add new entries.
      
      BUG: sleeping function called from invalid context at kernel/locking/mutex.
      c:747
      in_atomic(): 1, irqs_disabled(): 0, pid: 17159, name: dd
      [..]
      Call Trace:
      dump_stack+0x85/0xc8
      ___might_sleep+0x184/0x250
      __might_sleep+0x4a/0x90
      __mutex_lock+0x58/0x9b0
      ? nvdimm_bus_lock+0x21/0x30 [libnvdimm]
      ? __nvdimm_bus_badblocks_clear+0x2f/0x60 [libnvdimm]
      ? acpi_nfit_forget_poison+0x79/0x80 [nfit]
      ? _raw_spin_unlock+0x27/0x40
      mutex_lock_nested+0x1b/0x20
      nvdimm_bus_lock+0x21/0x30 [libnvdimm]
      nvdimm_forget_poison+0x25/0x50 [libnvdimm]
      nvdimm_clear_poison+0x106/0x140 [libnvdimm]
      nsio_rw_bytes+0x164/0x270 [libnvdimm]
      btt_write_pg+0x1de/0x3e0 [nd_btt]
      ? blk_queue_enter+0x30/0x290
      btt_make_request+0x11a/0x310 [nd_btt]
      ? blk_queue_enter+0xb7/0x290
      ? blk_queue_enter+0x30/0x290
      generic_make_request+0x118/0x3b0
      
      A spinlock is introduced to protect the poison list. This allows us to not
      having to acquire the reconfig_mutex for touching the poison list. The
      add_poison() function has been broken out into two helper functions. One to
      allocate the poison entry and the other to apppend the entry. This allows us
      to unlock the poison_lock in non-I/O path and continue to be able to allocate
      the poison entry with GFP_KERNEL. We will use GFP_NOWAIT in the I/O path in
      order to satisfy being in atomic context.
      Reviewed-by: NVishal Verma <vishal.l.verma@intel.com>
      Signed-off-by: NDave Jiang <dave.jiang@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      b3b454f6
  15. 13 4月, 2017 3 次提交
  16. 11 4月, 2017 2 次提交
    • D
      libnvdimm: band aid btt vs clear poison locking · 4aa5615e
      Dan Williams 提交于
      The following warning results from holding a lane spinlock,
      preempt_disable(), or the btt map spinlock and then trying to take the
      reconfig_mutex to walk the poison list and potentially add new entries.
      
       BUG: sleeping function called from invalid context at kernel/locking/mutex.c:747
       in_atomic(): 1, irqs_disabled(): 0, pid: 17159, name: dd
       [..]
       Call Trace:
        dump_stack+0x85/0xc8
        ___might_sleep+0x184/0x250
        __might_sleep+0x4a/0x90
        __mutex_lock+0x58/0x9b0
        ? nvdimm_bus_lock+0x21/0x30 [libnvdimm]
        ? __nvdimm_bus_badblocks_clear+0x2f/0x60 [libnvdimm]
        ? acpi_nfit_forget_poison+0x79/0x80 [nfit]
        ? _raw_spin_unlock+0x27/0x40
        mutex_lock_nested+0x1b/0x20
        nvdimm_bus_lock+0x21/0x30 [libnvdimm]
        nvdimm_forget_poison+0x25/0x50 [libnvdimm]
        nvdimm_clear_poison+0x106/0x140 [libnvdimm]
        nsio_rw_bytes+0x164/0x270 [libnvdimm]
        btt_write_pg+0x1de/0x3e0 [nd_btt]
        ? blk_queue_enter+0x30/0x290
        btt_make_request+0x11a/0x310 [nd_btt]
        ? blk_queue_enter+0xb7/0x290
        ? blk_queue_enter+0x30/0x290
        generic_make_request+0x118/0x3b0
      
      As a minimal fix, disable error clearing when the BTT is enabled for the
      namespace. For the final fix a larger rework of the poison list locking
      is needed.
      
      Note that this is not a problem in the blk case since that path never
      calls nvdimm_clear_poison().
      
      Cc: <stable@vger.kernel.org>
      Fixes: 82bf1037 ("libnvdimm: check and clear poison before writing to pmem")
      Cc: Dave Jiang <dave.jiang@intel.com>
      [jeff: dynamically disable error clearing in the btt case]
      Suggested-by: NJeff Moyer <jmoyer@redhat.com>
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Reported-by: NVishal Verma <vishal.l.verma@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      4aa5615e
    • D
      libnvdimm: fix reconfig_mutex, mmap_sem, and jbd2_handle lockdep splat · 0beb2012
      Dan Williams 提交于
      Holding the reconfig_mutex over a potential userspace fault sets up a
      lockdep dependency chain between filesystem-DAX and the libnvdimm ioctl
      path. Move the user access outside of the lock.
      
           [ INFO: possible circular locking dependency detected ]
           4.11.0-rc3+ #13 Tainted: G        W  O
           -------------------------------------------------------
           fallocate/16656 is trying to acquire lock:
            (&nvdimm_bus->reconfig_mutex){+.+.+.}, at: [<ffffffffa00080b1>] nvdimm_bus_lock+0x21/0x30 [libnvdimm]
           but task is already holding lock:
            (jbd2_handle){++++..}, at: [<ffffffff813b4944>] start_this_handle+0x104/0x460
      
          which lock already depends on the new lock.
      
          the existing dependency chain (in reverse order) is:
      
          -> #2 (jbd2_handle){++++..}:
                  lock_acquire+0xbd/0x200
                  start_this_handle+0x16a/0x460
                  jbd2__journal_start+0xe9/0x2d0
                  __ext4_journal_start_sb+0x89/0x1c0
                  ext4_dirty_inode+0x32/0x70
                  __mark_inode_dirty+0x235/0x670
                  generic_update_time+0x87/0xd0
                  touch_atime+0xa9/0xd0
                  ext4_file_mmap+0x90/0xb0
                  mmap_region+0x370/0x5b0
                  do_mmap+0x415/0x4f0
                  vm_mmap_pgoff+0xd7/0x120
                  SyS_mmap_pgoff+0x1c5/0x290
                  SyS_mmap+0x22/0x30
                  entry_SYSCALL_64_fastpath+0x1f/0xc2
      
          -> #1 (&mm->mmap_sem){++++++}:
                  lock_acquire+0xbd/0x200
                  __might_fault+0x70/0xa0
                  __nd_ioctl+0x683/0x720 [libnvdimm]
                  nvdimm_ioctl+0x8b/0xe0 [libnvdimm]
                  do_vfs_ioctl+0xa8/0x740
                  SyS_ioctl+0x79/0x90
                  do_syscall_64+0x6c/0x200
                  return_from_SYSCALL_64+0x0/0x7a
      
          -> #0 (&nvdimm_bus->reconfig_mutex){+.+.+.}:
                  __lock_acquire+0x16b6/0x1730
                  lock_acquire+0xbd/0x200
                  __mutex_lock+0x88/0x9b0
                  mutex_lock_nested+0x1b/0x20
                  nvdimm_bus_lock+0x21/0x30 [libnvdimm]
                  nvdimm_forget_poison+0x25/0x50 [libnvdimm]
                  nvdimm_clear_poison+0x106/0x140 [libnvdimm]
                  pmem_do_bvec+0x1c2/0x2b0 [nd_pmem]
                  pmem_make_request+0xf9/0x270 [nd_pmem]
                  generic_make_request+0x118/0x3b0
                  submit_bio+0x75/0x150
      
      Cc: <stable@vger.kernel.org>
      Fixes: 62232e45 ("libnvdimm: control (ioctl) messages for nvdimm_bus and nvdimm devices")
      Cc: Dave Jiang <dave.jiang@intel.com>
      Reported-by: NVishal Verma <vishal.l.verma@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      0beb2012
  17. 05 4月, 2017 1 次提交
    • D
      libnvdimm: fix blk free space accounting · fe514739
      Dan Williams 提交于
      Commit a1f3e4d6 "libnvdimm, region: update nd_region_available_dpa()
      for multi-pmem support" reworked blk dpa (DIMM Physical Address)
      accounting to comprehend multiple pmem namespace allocations aliasing
      with a given blk-dpa range.
      
      The following call trace is a result of failing to account for allocated
      blk capacity.
      
       WARNING: CPU: 1 PID: 2433 at tools/testing/nvdimm/../../../drivers/nvdimm/names
      4 size_store+0x6f3/0x930 [libnvdimm]
       nd_region region5: allocation underrun: 0x0 of 0x1000000 bytes
       [..]
       Call Trace:
        dump_stack+0x86/0xc3
        __warn+0xcb/0xf0
        warn_slowpath_fmt+0x5f/0x80
        size_store+0x6f3/0x930 [libnvdimm]
        dev_attr_store+0x18/0x30
      
      If a given blk-dpa allocation does not alias with any pmem ranges then
      the full allocation should be accounted as busy space, not the size of
      the current pmem contribution to the region.
      
      The thinkos that led to this confusion was not realizing that the struct
      resource management is already guaranteeing no collisions between pmem
      allocations and blk allocations on the same dimm. Also, we do not try to
      support blk allocations in aliased pmem holes.
      
      This patch also fixes a case where the available blk goes negative.
      
      Cc: <stable@vger.kernel.org>
      Fixes: a1f3e4d6 ("libnvdimm, region: update nd_region_available_dpa() for multi-pmem support").
      Reported-by: NDariusz Dokupil <dariusz.dokupil@intel.com>
      Reported-by: NDave Jiang <dave.jiang@intel.com>
      Reported-by: NVishal Verma <vishal.l.verma@intel.com>
      Tested-by: NDave Jiang <dave.jiang@intel.com>
      Tested-by: NVishal Verma <vishal.l.verma@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      fe514739
  18. 01 3月, 2017 1 次提交
    • D
      nfit, libnvdimm: fix interleave set cookie calculation · 86ef58a4
      Dan Williams 提交于
      The interleave-set cookie is a sum that sanity checks the composition of
      an interleave set has not changed from when the namespace was initially
      created.  The checksum is calculated by sorting the DIMMs by their
      location in the interleave-set. The comparison for the sort must be
      64-bit wide, not byte-by-byte as performed by memcmp() in the broken
      case.
      
      Fix the implementation to accept correct cookie values in addition to
      the Linux "memcmp" order cookies, but only allow correct cookies to be
      generated going forward. It does mean that namespaces created by
      third-party-tooling, or created by newer kernels with this fix, will not
      validate on older kernels. However, there are a couple mitigating
      conditions:
      
          1/ platforms with namespace-label capable NVDIMMs are not widely
             available.
      
          2/ interleave-sets with a single-dimm are by definition not affected
             (nothing to sort). This covers the QEMU-KVM NVDIMM emulation case.
      
      The cookie stored in the namespace label will be fixed by any write the
      namespace label, the most straightforward way to achieve this is to
      write to the "alt_name" attribute of a namespace in sysfs.
      
      Cc: <stable@vger.kernel.org>
      Fixes: eaf96153 ("libnvdimm, nfit: add interleave-set state-tracking infrastructure")
      Reported-by: NNicholas Moulin <nicholas.w.moulin@linux.intel.com>
      Tested-by: NNicholas Moulin <nicholas.w.moulin@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      86ef58a4
  19. 05 2月, 2017 1 次提交
    • D
      libnvdimm, pfn: fix memmap reservation size versus 4K alignment · bfb34527
      Dan Williams 提交于
      When vmemmap_populate() allocates space for the memmap it does so in 2MB
      sized chunks. The libnvdimm-pfn driver incorrectly accounts for this
      when the alignment of the device is set to 4K. When this happens we
      trigger memory allocation failures in altmap_alloc_block_buf() and
      trigger warnings of the form:
      
       WARNING: CPU: 0 PID: 3376 at arch/x86/mm/init_64.c:656 arch_add_memory+0xe4/0xf0
       [..]
       Call Trace:
        dump_stack+0x86/0xc3
        __warn+0xcb/0xf0
        warn_slowpath_null+0x1d/0x20
        arch_add_memory+0xe4/0xf0
        devm_memremap_pages+0x29b/0x4e0
      
      Fixes: 315c5625 ("libnvdimm, pfn: add 'align' attribute, default to HPAGE_SIZE")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      bfb34527
  20. 01 2月, 2017 2 次提交
    • D
      libnvdimm, namespace: do not delete namespace-id 0 · 9d032f42
      Dan Williams 提交于
      Given that the naming of pmem devices changes from the pmemX form to the
      pmemX.Y form when namespace id is greater than 0, arrange for namespaces
      with id-0 to be exempt from deletion. Otherwise a simple reconfiguration
      of an existing namespace to a new mode results in a name change of the
      resulting block device:
      
          # ndctl list --namespace=namespace1.0
          {
            "dev":"namespace1.0",
            "mode":"raw",
            "size":2147483648,
            "uuid":"3dadf3dc-89b9-4b24-b20e-abc8a4707ce3",
            "blockdev":"pmem1"
          }
      
          # ndctl create-namespace --reconfig=namespace1.0 --mode=memory --force
          {
            "dev":"namespace1.1",
            "mode":"memory",
            "size":2111832064,
            "uuid":"7b4a6341-7318-4219-a02c-fb57c0bbf613",
            "blockdev":"pmem1.1"
          }
      
      This change does require tooling changes to explicitly look for
      namespaceX.0 if the seed has already advanced to another namespace.
      
      Cc: <stable@vger.kernel.org>
      Fixes: 98a29c39 ("libnvdimm, namespace: allow creation of multiple pmem-namespaces per region")
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      9d032f42
    • B
      nvdimm: constify device_type structures · 970d14e3
      Bhumika Goyal 提交于
      Declare device_type structure as const as it is only stored in the
      type field of a device structure. This field is of type const, so add
      const to declaration of device_type structure.
      
      File size before:
        text	   data	    bss	    dec	    hex	filename
        19278	   3199	     16	  22493	   57dd	nvdimm/namespace_devs.o
      
      File size after:
        text	   data	    bss	    dec	    hex	filename
        19929	   3160	     16	  23105	   5a41	nvdimm/namespace_devs.o
      Signed-off-by: NBhumika Goyal <bhumirks@gmail.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      970d14e3
  21. 14 1月, 2017 1 次提交
  22. 13 1月, 2017 1 次提交
  23. 17 12月, 2016 1 次提交
    • D
      libnvdimm: fix mishandled nvdimm_clear_poison() return value · 868f036f
      Dan Williams 提交于
      Colin, via static analysis, reports that the length could be negative
      from nvdimm_clear_poison() in the error case. There was a similar
      problem with commit 0a3f27b9 "libnvdimm, namespace: avoid multiple
      sector calculations" that I noticed when merging the for-4.10/libnvdimm
      topic branch into libnvdimm-for-next, but I missed this one. Fix both of
      them to the following procedure:
      
      * if we clear a block's worth of media, clear that many blocks in
        badblocks
      
      * if we clear less than the requested size of the transfer return an
        error
      
      * always invalidate cache after any non-error / non-zero
        nvdimm_clear_poison result
      
      Fixes: 82bf1037 ("libnvdimm: check and clear poison before writing to pmem")
      Fixes: 0a3f27b9 ("libnvdimm, namespace: avoid multiple sector calculations")
      Cc: Fabian Frederick <fabf@skynet.be>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Reported-by: NColin Ian King <colin.king@canonical.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      868f036f
  24. 16 12月, 2016 1 次提交
  25. 11 12月, 2016 1 次提交
  26. 07 12月, 2016 1 次提交
    • D
      acpi, nfit, libnvdimm: fix / harden ars_status output length handling · efda1b5d
      Dan Williams 提交于
      Given ambiguities in the ACPI 6.1 definition of the "Output (Size)"
      field of the ARS (Address Range Scrub) Status command, a firmware
      implementation may in practice return 0, 4, or 8 to indicate that there
      is no output payload to process.
      
      The specification states "Size of Output Buffer in bytes, including this
      field.". However, 'Output Buffer' is also the name of the entire
      payload, and earlier in the specification it states "Max Query ARS
      Status Output Buffer Size: Maximum size of buffer (including the Status
      and Extended Status fields)".
      
      Without this fix if the BIOS happens to return 0 it causes memory
      corruption as evidenced by this result from the acpi_nfit_ctl() unit
      test.
      
       ars_status00000000: 00020000 00000000                    ........
       BUG: stack guard page was hit at ffffc90001750000 (stack is ffffc9000174c000..ffffc9000174ffff)
       kernel stack overflow (page fault): 0000 [#1] SMP DEBUG_PAGEALLOC
       task: ffff8803332d2ec0 task.stack: ffffc9000174c000
       RIP: 0010:[<ffffffff814cfe72>]  [<ffffffff814cfe72>] __memcpy+0x12/0x20
       RSP: 0018:ffffc9000174f9a8  EFLAGS: 00010246
       RAX: ffffc9000174fab8 RBX: 0000000000000000 RCX: 000000001fffff56
       RDX: 0000000000000000 RSI: ffff8803231f5a08 RDI: ffffc90001750000
       RBP: ffffc9000174fa88 R08: ffffc9000174fab0 R09: ffff8803231f54b8
       R10: 0000000000000008 R11: 0000000000000001 R12: 0000000000000000
       R13: 0000000000000000 R14: 0000000000000003 R15: ffff8803231f54a0
       FS:  00007f3a611af640(0000) GS:ffff88033ed00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: ffffc90001750000 CR3: 0000000325b20000 CR4: 00000000000406e0
       Stack:
        ffffffffa00bc60d 0000000000000008 ffffc90000000001 ffffc9000174faac
        0000000000000292 ffffffffa00c24e4 ffffffffa00c2914 0000000000000000
        0000000000000000 ffffffff00000003 ffff880331ae8ad0 0000000800000246
       Call Trace:
        [<ffffffffa00bc60d>] ? acpi_nfit_ctl+0x49d/0x750 [nfit]
        [<ffffffffa01f4fe0>] nfit_test_probe+0x670/0xb1b [nfit_test]
      
      Cc: <stable@vger.kernel.org>
      Fixes: 747ffe11 ("libnvdimm, tools/testing/nvdimm: fix 'ars_status' output buffer sizing")
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      efda1b5d
  27. 06 12月, 2016 1 次提交
  28. 05 12月, 2016 2 次提交