1. 30 4月, 2020 5 次提交
  2. 15 1月, 2020 1 次提交
    • D
      acpi/nfit, device-dax: Identify differentiated memory with a unique numa-node · a0a4e71f
      Dan Williams 提交于
      commit 8fc5c73554db0ac18c0c6ac5b2099ab917f83bdf upstream
      
      Persistent memory, as described by the ACPI NFIT (NVDIMM Firmware
      Interface Table), is the first known instance of a memory range
      described by a unique "target" proximity domain. Where "initiator" and
      "target" proximity domains is an approach that the ACPI HMAT
      (Heterogeneous Memory Attributes Table) uses to described the unique
      performance properties of a memory range relative to a given initiator
      (e.g. CPU or DMA device).
      
      Currently the numa-node for a /dev/pmemX block-device or /dev/daxX.Y
      char-device follows the traditional notion of 'numa-node' where the
      attribute conveys the closest online numa-node. That numa-node attribute
      is useful for cpu-binding and memory-binding processes *near* the
      device. However, when the memory range backing a 'pmem', or 'dax' device
      is onlined (memory hot-add) the memory-only-numa-node representing that
      address needs to be differentiated from the set of online nodes. In
      other words, the numa-node association of the device depends on whether
      you can bind processes *near* the cpu-numa-node in the offline
      device-case, or bind process *on* the memory-range directly after the
      backing address range is onlined.
      
      Allow for the case that platform firmware describes persistent memory
      with a unique proximity domain, i.e. when it is distinct from the
      proximity of DRAM and CPUs that are on the same socket. Plumb the Linux
      numa-node translation of that proximity through the libnvdimm region
      device to namespaces that are in device-dax mode. With this in place the
      proposed kmem driver [1] can optionally discover a unique numa-node
      number for the address range as it transitions the memory from an
      offline state managed by a device-driver to an online memory range
      managed by the core-mm.
      
      [1]: https://lore.kernel.org/lkml/20181022201317.8558C1D8@viggo.jf.intel.comReported-by: NFan Du <fan.du@intel.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Oliver O'Halloran" <oohall@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      [yshi: Removed PowerPC stuff which is not applicable 4.19]
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      a0a4e71f
  3. 12 10月, 2019 1 次提交
    • A
      libnvdimm/region: Initialize bad block for volatile namespaces · 2e93d24a
      Aneesh Kumar K.V 提交于
      [ Upstream commit c42adf87e4e7ed77f6ffe288dc90f980d07d68df ]
      
      We do check for a bad block during namespace init and that use
      region bad block list. We need to initialize the bad block
      for volatile regions for this to work. We also observe a lockdep
      warning as below because the lock is not initialized correctly
      since we skip bad block init for volatile regions.
      
       INFO: trying to register non-static key.
       the code is fine but needs lockdep annotation.
       turning off the locking correctness validator.
       CPU: 2 PID: 1 Comm: swapper/0 Not tainted 5.3.0-rc1-15699-g3dee241c937e #149
       Call Trace:
       [c0000000f95cb250] [c00000000147dd84] dump_stack+0xe8/0x164 (unreliable)
       [c0000000f95cb2a0] [c00000000022ccd8] register_lock_class+0x308/0xa60
       [c0000000f95cb3a0] [c000000000229cc0] __lock_acquire+0x170/0x1ff0
       [c0000000f95cb4c0] [c00000000022c740] lock_acquire+0x220/0x270
       [c0000000f95cb580] [c000000000a93230] badblocks_check+0xc0/0x290
       [c0000000f95cb5f0] [c000000000d97540] nd_pfn_validate+0x5c0/0x7f0
       [c0000000f95cb6d0] [c000000000d98300] nd_dax_probe+0xd0/0x1f0
       [c0000000f95cb760] [c000000000d9b66c] nd_pmem_probe+0x10c/0x160
       [c0000000f95cb790] [c000000000d7f5ec] nvdimm_bus_probe+0x10c/0x240
       [c0000000f95cb820] [c000000000d0f844] really_probe+0x254/0x4e0
       [c0000000f95cb8b0] [c000000000d0fdfc] driver_probe_device+0x16c/0x1e0
       [c0000000f95cb930] [c000000000d10238] device_driver_attach+0x68/0xa0
       [c0000000f95cb970] [c000000000d1040c] __driver_attach+0x19c/0x1c0
       [c0000000f95cb9f0] [c000000000d0c4c4] bus_for_each_dev+0x94/0x130
       [c0000000f95cba50] [c000000000d0f014] driver_attach+0x34/0x50
       [c0000000f95cba70] [c000000000d0e208] bus_add_driver+0x178/0x2f0
       [c0000000f95cbb00] [c000000000d117c8] driver_register+0x108/0x170
       [c0000000f95cbb70] [c000000000d7edb0] __nd_driver_register+0xe0/0x100
       [c0000000f95cbbd0] [c000000001a6baa4] nd_pmem_driver_init+0x34/0x48
       [c0000000f95cbbf0] [c0000000000106f4] do_one_initcall+0x1d4/0x4b0
       [c0000000f95cbcd0] [c0000000019f499c] kernel_init_freeable+0x544/0x65c
       [c0000000f95cbdb0] [c000000000010d6c] kernel_init+0x2c/0x180
       [c0000000f95cbe20] [c00000000000b954] ret_from_kernel_thread+0x5c/0x68
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Link: https://lore.kernel.org/r/20190919083355.26340-1-aneesh.kumar@linux.ibm.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      2e93d24a
  4. 09 8月, 2019 4 次提交
    • D
      libnvdimm/bus: Fix wait_nvdimm_bus_probe_idle() ABBA deadlock · 2364ed0d
      Dan Williams 提交于
      commit ca6bf264f6d856f959c4239cda1047b587745c67 upstream.
      
      A multithreaded namespace creation/destruction stress test currently
      deadlocks with the following lockup signature:
      
          INFO: task ndctl:2924 blocked for more than 122 seconds.
                Tainted: G           OE     5.2.0-rc4+ #3382
          "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
          ndctl           D    0  2924   1176 0x00000000
          Call Trace:
           ? __schedule+0x27e/0x780
           schedule+0x30/0xb0
           wait_nvdimm_bus_probe_idle+0x8a/0xd0 [libnvdimm]
           ? finish_wait+0x80/0x80
           uuid_store+0xe6/0x2e0 [libnvdimm]
           kernfs_fop_write+0xf0/0x1a0
           vfs_write+0xb7/0x1b0
           ksys_write+0x5c/0xd0
           do_syscall_64+0x60/0x240
      
           INFO: task ndctl:2923 blocked for more than 122 seconds.
                 Tainted: G           OE     5.2.0-rc4+ #3382
           "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
           ndctl           D    0  2923   1175 0x00000000
           Call Trace:
            ? __schedule+0x27e/0x780
            ? __mutex_lock+0x489/0x910
            schedule+0x30/0xb0
            schedule_preempt_disabled+0x11/0x20
            __mutex_lock+0x48e/0x910
            ? nvdimm_namespace_common_probe+0x95/0x4d0 [libnvdimm]
            ? __lock_acquire+0x23f/0x1710
            ? nvdimm_namespace_common_probe+0x95/0x4d0 [libnvdimm]
            nvdimm_namespace_common_probe+0x95/0x4d0 [libnvdimm]
            __dax_pmem_probe+0x5e/0x210 [dax_pmem_core]
            ? nvdimm_bus_probe+0x1d0/0x2c0 [libnvdimm]
            dax_pmem_probe+0xc/0x20 [dax_pmem]
            nvdimm_bus_probe+0x90/0x2c0 [libnvdimm]
            really_probe+0xef/0x390
            driver_probe_device+0xb4/0x100
      
      In this sequence an 'nd_dax' device is being probed and trying to take
      the lock on its backing namespace to validate that the 'nd_dax' device
      indeed has exclusive access to the backing namespace. Meanwhile, another
      thread is trying to update the uuid property of that same backing
      namespace. So one thread is in the probe path trying to acquire the
      lock, and the other thread has acquired the lock and tries to flush the
      probe path.
      
      Fix this deadlock by not holding the namespace device_lock over the
      wait_nvdimm_bus_probe_idle() synchronization step. In turn this requires
      the device_lock to be held on entry to wait_nvdimm_bus_probe_idle() and
      subsequently dropped internally to wait_nvdimm_bus_probe_idle().
      
      Cc: <stable@vger.kernel.org>
      Fixes: bf9bccc1 ("libnvdimm: pmem label sets and namespace instantiation")
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Tested-by: NJane Chu <jane.chu@oracle.com>
      Link: https://lore.kernel.org/r/156341210094.292348.2384694131126767789.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      2364ed0d
    • D
      libnvdimm/bus: Prepare the nd_ioctl() path to be re-entrant · 7f000e7b
      Dan Williams 提交于
      commit 6de5d06e657acdbcf9637dac37916a4a5309e0f4 upstream.
      
      In preparation for not holding a lock over the execution of nd_ioctl(),
      update the implementation to allow multiple threads to be attempting
      ioctls at the same time. The bus lock still prevents multiple in-flight
      ->ndctl() invocations from corrupting each other's state, but static
      global staging buffers are moved to the heap.
      Reported-by: NVishal Verma <vishal.l.verma@intel.com>
      Reviewed-by: NVishal Verma <vishal.l.verma@intel.com>
      Tested-by: NVishal Verma <vishal.l.verma@intel.com>
      Link: https://lore.kernel.org/r/156341208947.292348.10560140326807607481.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      7f000e7b
    • D
      libnvdimm/region: Register badblocks before namespaces · 32485369
      Dan Williams 提交于
      commit 700cd033a82d466ad8f9615f9985525e45f8960a upstream.
      
      Namespace activation expects to be able to reference region badblocks.
      The following warning sometimes triggers when asynchronous namespace
      activation races in front of the completion of namespace probing. Move
      all possible namespace probing after region badblocks initialization.
      
      Otherwise, lockdep sometimes catches the uninitialized state of the
      badblocks seqlock with stack trace signatures like:
      
          INFO: trying to register non-static key.
          pmem2: detected capacity change from 0 to 136365211648
          the code is fine but needs lockdep annotation.
          turning off the locking correctness validator.
          CPU: 9 PID: 358 Comm: kworker/u80:5 Tainted: G           OE     5.2.0-rc4+ #3382
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
          Workqueue: events_unbound async_run_entry_fn
          Call Trace:
           dump_stack+0x85/0xc0
          pmem1.12: detected capacity change from 0 to 8589934592
           register_lock_class+0x56a/0x570
           ? check_object+0x140/0x270
           __lock_acquire+0x80/0x1710
           ? __mutex_lock+0x39d/0x910
           lock_acquire+0x9e/0x180
           ? nd_pfn_validate+0x28f/0x440 [libnvdimm]
           badblocks_check+0x93/0x1f0
           ? nd_pfn_validate+0x28f/0x440 [libnvdimm]
           nd_pfn_validate+0x28f/0x440 [libnvdimm]
           ? lockdep_hardirqs_on+0xf0/0x180
           nd_dax_probe+0x9a/0x120 [libnvdimm]
           nd_pmem_probe+0x6d/0x180 [nd_pmem]
           nvdimm_bus_probe+0x90/0x2c0 [libnvdimm]
      
      Fixes: 48af2f7e52f4 ("libnvdimm, pfn: during init, clear errors...")
      Cc: <stable@vger.kernel.org>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Reviewed-by: NVishal Verma <vishal.l.verma@intel.com>
      Link: https://lore.kernel.org/r/156341208365.292348.1547528796026249120.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      32485369
    • D
      libnvdimm/bus: Prevent duplicate device_unregister() calls · d16bbdbb
      Dan Williams 提交于
      commit 8aac0e2338916e273ccbd438a2b7a1e8c61749f5 upstream.
      
      A multithreaded namespace creation/destruction stress test currently
      fails with signatures like the following:
      
          sysfs group 'power' not found for kobject 'dax1.1'
          RIP: 0010:sysfs_remove_group+0x76/0x80
          Call Trace:
           device_del+0x73/0x370
           device_unregister+0x16/0x50
           nd_async_device_unregister+0x1e/0x30 [libnvdimm]
           async_run_entry_fn+0x39/0x160
           process_one_work+0x23c/0x5e0
           worker_thread+0x3c/0x390
      
          BUG: kernel NULL pointer dereference, address: 0000000000000020
          RIP: 0010:klist_put+0x1b/0x6c
          Call Trace:
           klist_del+0xe/0x10
           device_del+0x8a/0x2c9
           ? __switch_to_asm+0x34/0x70
           ? __switch_to_asm+0x40/0x70
           device_unregister+0x44/0x4f
           nd_async_device_unregister+0x22/0x2d [libnvdimm]
           async_run_entry_fn+0x47/0x15a
           process_one_work+0x1a2/0x2eb
           worker_thread+0x1b8/0x26e
      
      Use the kill_device() helper to atomically resolve the race of multiple
      threads issuing kill, device_unregister(), requests.
      Reported-by: NJane Chu <jane.chu@oracle.com>
      Reported-by: NErwin Tsaur <erwin.tsaur@oracle.com>
      Fixes: 4d88a97a ("libnvdimm, nvdimm: dimm driver and base libnvdimm device-driver...")
      Cc: <stable@vger.kernel.org>
      Link: https://github.com/pmem/ndctl/issues/96Tested-by: NTested-by: Jane Chu <jane.chu@oracle.com>
      Link: https://lore.kernel.org/r/156341207846.292348.10435719262819764054.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      d16bbdbb
  5. 31 7月, 2019 1 次提交
  6. 26 7月, 2019 1 次提交
    • D
      libnvdimm/pfn: fix fsdax-mode namespace info-block zero-fields · 2c0222b4
      Dan Williams 提交于
      commit 7e3e888dfc138089f4c15a81b418e88f0978f744 upstream.
      
      At namespace creation time there is the potential for the "expected to
      be zero" fields of a 'pfn' info-block to be filled with indeterminate
      data.  While the kernel buffer is zeroed on allocation it is immediately
      overwritten by nd_pfn_validate() filling it with the current contents of
      the on-media info-block location.  For fields like, 'flags' and the
      'padding' it potentially means that future implementations can not rely on
      those fields being zero.
      
      In preparation to stop using the 'start_pad' and 'end_trunc' fields for
      section alignment, arrange for fields that are not explicitly
      initialized to be guaranteed zero.  Bump the minor version to indicate
      it is safe to assume the 'padding' and 'flags' are zero.  Otherwise,
      this corruption is expected to benign since all other critical fields
      are explicitly initialized.
      
      Note The cc: stable is about spreading this new policy to as many
      kernels as possible not fixing an issue in those kernels.  It is not
      until the change titled "libnvdimm/pfn: Stop padding pmem namespaces to
      section alignment" where this improper initialization becomes a problem.
      So if someone decides to backport "libnvdimm/pfn: Stop padding pmem
      namespaces to section alignment" (which is not tagged for stable), make
      sure this pre-requisite is flagged.
      
      Link: http://lkml.kernel.org/r/156092356065.979959.6681003754765958296.stgit@dwillia2-desk3.amr.corp.intel.com
      Fixes: 32ab0a3f ("libnvdimm, pmem: 'struct page' for pmem")
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>	[ppc64]
      Cc: <stable@vger.kernel.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Jane Chu <jane.chu@oracle.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richardw.yang@linux.intel.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2c0222b4
  7. 19 6月, 2019 1 次提交
    • Q
      libnvdimm: Fix compilation warnings with W=1 · 90a56454
      Qian Cai 提交于
      [ Upstream commit c01dafad77fea8d64c4fdca0a6031c980842ad65 ]
      
      Several places (dimm_devs.c, core.c etc) include label.h but only
      label.c uses NSINDEX_SIGNATURE, so move its definition to label.c
      instead.
      
      In file included from drivers/nvdimm/dimm_devs.c:23:
      drivers/nvdimm/label.h:41:19: warning: 'NSINDEX_SIGNATURE' defined but
      not used [-Wunused-const-variable=]
      
      Also, some places abuse "/**" which is only reserved for the kernel-doc.
      
      drivers/nvdimm/bus.c:648: warning: cannot understand function prototype:
      'struct attribute_group nd_device_attribute_group = '
      drivers/nvdimm/bus.c:677: warning: cannot understand function prototype:
      'struct attribute_group nd_numa_attribute_group = '
      
      Those are just some member assignments for the "struct attribute_group"
      instances and it can't be expressed in the kernel-doc.
      Reviewed-by: NVishal Verma <vishal.l.verma@intel.com>
      Signed-off-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      90a56454
  8. 31 5月, 2019 1 次提交
    • D
      libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead · ee6d3eb3
      Dan Williams 提交于
      commit 52f476a323f9efc959be1c890d0cdcf12e1582e0 upstream.
      
      Jeff discovered that performance improves from ~375K iops to ~519K iops
      on a simple psync-write fio workload when moving the location of 'struct
      page' from the default PMEM location to DRAM. This result is surprising
      because the expectation is that 'struct page' for dax is only needed for
      third party references to dax mappings. For example, a dax-mapped buffer
      passed to another system call for direct-I/O requires 'struct page' for
      sending the request down the driver stack and pinning the page. There is
      no usage of 'struct page' for first party access to a file via
      read(2)/write(2) and friends.
      
      However, this "no page needed" expectation is violated by
      CONFIG_HARDENED_USERCOPY and the check_copy_size() performed in
      copy_from_iter_full_nocache() and copy_to_iter_mcsafe(). The
      check_heap_object() helper routine assumes the buffer is backed by a
      slab allocator (DRAM) page and applies some checks.  Those checks are
      invalid, dax pages do not originate from the slab, and redundant,
      dax_iomap_actor() has already validated that the I/O is within bounds.
      Specifically that routine validates that the logical file offset is
      within bounds of the file, then it does a sector-to-pfn translation
      which validates that the physical mapping is within bounds of the block
      device.
      
      Bypass additional hardened usercopy overhead and call the 'no check'
      versions of the copy_{to,from}_iter operations directly.
      
      Fixes: 0aed55af ("x86, uaccess: introduce copy_from_iter_flushcache...")
      Cc: <stable@vger.kernel.org>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Reported-and-tested-by: NJeff Smits <jeff.smits@intel.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Acked-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ee6d3eb3
  9. 22 5月, 2019 1 次提交
    • D
      libnvdimm/namespace: Fix label tracking error · 866f0111
      Dan Williams 提交于
      commit c4703ce11c23423d4b46e3d59aef7979814fd608 upstream.
      
      Users have reported intermittent occurrences of DIMM initialization
      failures due to duplicate allocations of address capacity detected in
      the labels, or errors of the form below, both have the same root cause.
      
          nd namespace1.4: failed to track label: 0
          WARNING: CPU: 17 PID: 1381 at drivers/nvdimm/label.c:863
      
          RIP: 0010:__pmem_label_update+0x56c/0x590 [libnvdimm]
          Call Trace:
           ? nd_pmem_namespace_label_update+0xd6/0x160 [libnvdimm]
           nd_pmem_namespace_label_update+0xd6/0x160 [libnvdimm]
           uuid_store+0x17e/0x190 [libnvdimm]
           kernfs_fop_write+0xf0/0x1a0
           vfs_write+0xb7/0x1b0
           ksys_write+0x57/0xd0
           do_syscall_64+0x60/0x210
      
      Unfortunately those reports were typically with a busy parallel
      namespace creation / destruction loop making it difficult to see the
      components of the bug. However, Jane provided a simple reproducer using
      the work-in-progress sub-section implementation.
      
      When ndctl is reconfiguring a namespace it may take an existing defunct
      / disabled namespace and reconfigure it with a new uuid and other
      parameters. Critically namespace_update_uuid() takes existing address
      resources and renames them for the new namespace to use / reconfigure as
      it sees fit. The bug is that this rename only happens in the resource
      tracking tree. Existing labels with the old uuid are not reaped leading
      to a scenario where multiple active labels reference the same span of
      address range.
      
      Teach namespace_update_uuid() to flag any references to the old uuid for
      reaping at the next label update attempt.
      
      Cc: <stable@vger.kernel.org>
      Fixes: bf9bccc1 ("libnvdimm: pmem label sets and namespace instantiation")
      Link: https://github.com/pmem/ndctl/issues/91Reported-by: NJane Chu <jane.chu@oracle.com>
      Reported-by: NJeff Moyer <jmoyer@redhat.com>
      Reported-by: NErwin Tsaur <erwin.tsaur@oracle.com>
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      866f0111
  10. 17 5月, 2019 3 次提交
  11. 24 3月, 2019 4 次提交
    • O
      libnvdimm: Fix altmap reservation size calculation · 3b8da135
      Oliver O'Halloran 提交于
      commit 07464e88365e9236febaca9ed1a2e2006d8bc952 upstream.
      
      Libnvdimm reserves the first 8K of pfn and devicedax namespaces to
      store a superblock describing the namespace. This 8K reservation
      is contained within the altmap area which the kernel uses for the
      vmemmap backing for the pages within the namespace. The altmap
      allows for some pages at the start of the altmap area to be reserved
      and that mechanism is used to protect the superblock from being
      re-used as vmemmap backing.
      
      The number of PFNs to reserve is calculated using:
      
      	PHYS_PFN(SZ_8K)
      
      Which is implemented as:
      
       #define PHYS_PFN(x) ((unsigned long)((x) >> PAGE_SHIFT))
      
      So on systems where PAGE_SIZE is greater than 8K the reservation
      size is truncated to zero and the superblock area is re-used as
      vmemmap backing. As a result all the namespace information stored
      in the superblock (i.e. if it's a PFN or DAX namespace) is lost
      and the namespace needs to be re-created to get access to the
      contents.
      
      This patch fixes this by using PFN_UP() rather than PHYS_PFN() to ensure
      that at least one page is reserved. On systems with a 4K pages size this
      patch should have no effect.
      
      Cc: stable@vger.kernel.org
      Cc: Dan Williams <dan.j.williams@intel.com>
      Fixes: ac515c08 ("libnvdimm, pmem, pfn: move pfn setup to the core")
      Signed-off-by: NOliver O'Halloran <oohall@gmail.com>
      Reviewed-by: NVishal Verma <vishal.l.verma@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3b8da135
    • D
      libnvdimm/pmem: Honor force_raw for legacy pmem regions · 696c3752
      Dan Williams 提交于
      commit fa7d2e639cd90442d868dfc6ca1d4cc9d8bf206e upstream.
      
      For recovery, where non-dax access is needed to a given physical address
      range, and testing, allow the 'force_raw' attribute to override the
      default establishment of a dev_pagemap.
      
      Otherwise without this capability it is possible to end up with a
      namespace that can not be activated due to corrupted info-block, and one
      that can not be repaired due to a section collision.
      
      Cc: <stable@vger.kernel.org>
      Fixes: 004f1afb ("libnvdimm, pmem: direct map legacy pmem by default")
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      696c3752
    • W
      libnvdimm, pfn: Fix over-trim in trim_pfn_device() · 6a89ed7a
      Wei Yang 提交于
      commit f101ada7da6551127d192c2f1742c1e9e0f62799 upstream.
      
      When trying to see whether current nd_region intersects with others,
      trim_pfn_device() has already calculated the *size* to be expanded to
      SECTION size.
      
      Do not double append 'adjust' to 'size' when calculating whether the end
      of a region collides with the next pmem region.
      
      Fixes: ae86cbfef381 "libnvdimm, pfn: Pad pfn namespaces relative to other regions"
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NWei Yang <richardw.yang@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6a89ed7a
    • D
      libnvdimm/label: Clear 'updating' flag after label-set update · 2b88d92e
      Dan Williams 提交于
      commit 966d23a006ca7b44ac8cf4d0c96b19785e0c3da0 upstream.
      
      The UEFI 2.7 specification sets expectations that the 'updating' flag is
      eventually cleared. To date, the libnvdimm core has never adhered to
      that protocol. The policy of the core matches the policy of other
      multi-device info-block formats like MD-Software-RAID that expect
      administrator intervention on inconsistent info-blocks, not automatic
      invalidation.
      
      However, some pre-boot environments may unfortunately attempt to "clean
      up" the labels and invalidate a set when it fails to find at least one
      "non-updating" label in the set. Clear the updating flag after set
      updates to minimize the window of vulnerability to aggressive pre-boot
      environments.
      
      Ideally implementations would not write to the label area outside of
      creating namespaces.
      
      Note that this only minimizes the window, it does not close it as the
      system can still crash while clearing the flag and the set can be
      subsequently deleted / invalidated by the pre-boot environment.
      
      Fixes: f524bf27 ("libnvdimm: write pmem label set")
      Cc: <stable@vger.kernel.org>
      Cc: Kelly Couch <kelly.j.couch@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2b88d92e
  12. 13 1月, 2019 1 次提交
  13. 13 12月, 2018 1 次提交
  14. 14 11月, 2018 3 次提交
  15. 21 8月, 2018 1 次提交
  16. 20 8月, 2018 1 次提交
    • V
      libnvdimm: fix ars_status output length calculation · 286e8771
      Vishal Verma 提交于
      Commit efda1b5d ("acpi, nfit, libnvdimm: fix / harden ars_status output length handling")
      Introduced additional hardening for ambiguity in the ACPI spec for
      ars_status output sizing. However, it had a couple of cases mixed up.
      Where it should have been checking for (and returning) "out_field[1] -
      4" it was using "out_field[1] - 8" and vice versa.
      
      This caused a four byte discrepancy in the buffer size passed on to
      the command handler, and in some cases, this caused memory corruption
      like:
      
        ./daxdev-errors.sh: line 76: 24104 Aborted   (core dumped) ./daxdev-errors $busdev $region
        malloc(): memory corruption
        Program received signal SIGABRT, Aborted.
        [...]
        #5  0x00007ffff7865a2e in calloc () from /lib64/libc.so.6
        #6  0x00007ffff7bc2970 in ndctl_bus_cmd_new_ars_status (ars_cap=ars_cap@entry=0x6153b0) at ars.c:136
        #7  0x0000000000401644 in check_ars_status (check=0x7fffffffdeb0, bus=0x604c20) at daxdev-errors.c:144
        #8  test_daxdev_clear_error (region_name=<optimized out>, bus_name=<optimized out>)
            at daxdev-errors.c:332
      
      Cc: <stable@vger.kernel.org>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Lukasz Dorau <lukasz.dorau@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Fixes: efda1b5d ("acpi, nfit, libnvdimm: fix / harden ars_status output length handling")
      Signed-off-by: NVishal Verma <vishal.l.verma@intel.com>
      Reviewed-by: NKeith Busch <keith.busch@intel.com>
      Signed-of-by: NDave Jiang <dave.jiang@intel.com>
      286e8771
  17. 31 7月, 2018 1 次提交
  18. 26 7月, 2018 2 次提交
  19. 18 7月, 2018 2 次提交
    • M
      block: Add and use op_stat_group() for indexing disk_stat fields. · ddcf35d3
      Michael Callahan 提交于
      Add and use a new op_stat_group() function for indexing partition stat
      fields rather than indexing them by rq_data_dir() or bio_data_dir().
      This function works similarly to op_is_sync() in that it takes the
      request::cmd_flags or bio::bi_opf flags and determines which stats
      should et updated.
      
      In addition, the second parameter to generic_start_io_acct() and
      generic_end_io_acct() is now a REQ_OP rather than simply a read or
      write bit and it uses op_stat_group() on the parameter to determine
      the stat group.
      
      Note that the partition in_flight counts are not part of the per-cpu
      statistics and as such are not indexed via this function.  It's now
      indexed by op_is_write().
      
      tj: Refreshed on top of v4.17.  Updated to pass around REQ_OP.
      Signed-off-by: NMichael Callahan <michaelcallahan@fb.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Joshua Morris <josh.h.morris@us.ibm.com>
      Cc: Philipp Reisner <philipp.reisner@linbit.com>
      Cc: Matias Bjorling <mb@lightnvm.io>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Cc: Alasdair Kergon <agk@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ddcf35d3
    • T
      block: make bdev_ops->rw_page() take a REQ_OP instead of bool · 3f289dcb
      Tejun Heo 提交于
      c11f0c0b ("block/mm: make bdev_ops->rw_page() take a bool for
      read/write") replaced @OP with boolean @is_write, which limited the
      amount of information going into ->rw_page() and more importantly
      page_endio(), which removed the need to expose block internals to mm.
      
      Unfortunately, we want to track discards separately and @is_write
      isn't enough information.  This patch updates bdev_ops->rw_page() to
      take REQ_OP instead but leaves page_endio() to take bool @is_write.
      This allows the block part of operations to have enough information
      while not leaking it to mm.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Mike Christie <mchristi@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3f289dcb
  20. 15 7月, 2018 1 次提交
  21. 29 6月, 2018 2 次提交
  22. 07 6月, 2018 2 次提交
    • R
      libnvdimm, pmem: Do not flush power-fail protected CPU caches · 546eb031
      Ross Zwisler 提交于
      This commit:
      
      5fdf8e5b ("libnvdimm: re-enable deep flush for pmem devices via fsync()")
      
      intended to make sure that deep flush was always available even on
      platforms which support a power-fail protected CPU cache.  An unintended
      side effect of this change was that we also lost the ability to skip
      flushing CPU caches on those power-fail protected CPU cache.
      
      Fix this by skipping the low level cache flushing in dax_flush() if we have
      CPU caches which are power-fail protected.  The user can still override this
      behavior by manually setting the write_cache state of a namespace.  See
      libndctl's ndctl_namespace_write_cache_is_enabled(),
      ndctl_namespace_enable_write_cache() and
      ndctl_namespace_disable_write_cache() functions.
      
      Cc: <stable@vger.kernel.org>
      Fixes: 5fdf8e5b ("libnvdimm: re-enable deep flush for pmem devices via fsync()")
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      546eb031
    • R
      libnvdimm, pmem: Unconditionally deep flush on *sync · ce7f11a2
      Ross Zwisler 提交于
      Prior to this commit we would only do a "deep flush" (have nvdimm_flush()
      write to each of the flush hints for a region) in response to an
      msync/fsync/sync call if the nvdimm_has_cache() returned true at the time
      we were setting up the request queue.  This happens due to the write cache
      value passed in to blk_queue_write_cache(), which then causes the block
      layer to send down BIOs with REQ_FUA and REQ_PREFLUSH set.  We do have a
      "write_cache" sysfs entry for namespaces, i.e.:
      
        /sys/bus/nd/devices/pfn0.1/block/pmem0/dax/write_cache
      
      which can be used to control whether or not the kernel thinks a given
      namespace has a write cache, but this didn't modify the deep flush behavior
      that we set up when the driver was initialized.  Instead, it only modified
      whether or not DAX would flush CPU caches via dax_flush() in response to
      *sync calls.
      
      Simplify this by making the *sync deep flush always happen, regardless of
      the write cache setting of a namespace.  The DAX CPU cache flushing will
      still be controlled the write_cache setting of the namespace.
      
      Cc: <stable@vger.kernel.org>
      Fixes: 5fdf8e5b ("libnvdimm: re-enable deep flush for pmem devices via fsync()")
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      ce7f11a2