1. 16 11月, 2017 3 次提交
    • M
      mm, pagevec: remove cold parameter for pagevecs · 86679820
      Mel Gorman 提交于
      Every pagevec_init user claims the pages being released are hot even in
      cases where it is unlikely the pages are hot.  As no one cares about the
      hotness of pages being released to the allocator, just ditch the
      parameter.
      
      No performance impact is expected as the overhead is marginal.  The
      parameter is removed simply because it is a bit stupid to have a useless
      parameter copied everywhere.
      
      Link: http://lkml.kernel.org/r/20171018075952.10627-6-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      86679820
    • M
      mm, truncate: do not check mapping for every page being truncated · c7df8ad2
      Mel Gorman 提交于
      During truncation, the mapping has already been checked for shmem and
      dax so it's known that workingset_update_node is required.
      
      This patch avoids the checks on mapping for each page being truncated.
      In all other cases, a lookup helper is used to determine if
      workingset_update_node() needs to be called.  The one danger is that the
      API is slightly harder to use as calling workingset_update_node directly
      without checking for dax or shmem mappings could lead to surprises.
      However, the API rarely needs to be used and hopefully the comment is
      enough to give people the hint.
      
      sparsetruncate (tiny)
                                    4.14.0-rc4             4.14.0-rc4
                                   oneirq-v1r1        pickhelper-v1r1
      Min          Time      141.00 (   0.00%)      140.00 (   0.71%)
      1st-qrtle    Time      142.00 (   0.00%)      141.00 (   0.70%)
      2nd-qrtle    Time      142.00 (   0.00%)      142.00 (   0.00%)
      3rd-qrtle    Time      143.00 (   0.00%)      143.00 (   0.00%)
      Max-90%      Time      144.00 (   0.00%)      144.00 (   0.00%)
      Max-95%      Time      147.00 (   0.00%)      145.00 (   1.36%)
      Max-99%      Time      195.00 (   0.00%)      191.00 (   2.05%)
      Max          Time      230.00 (   0.00%)      205.00 (  10.87%)
      Amean        Time      144.37 (   0.00%)      143.82 (   0.38%)
      Stddev       Time       10.44 (   0.00%)        9.00 (  13.74%)
      Coeff        Time        7.23 (   0.00%)        6.26 (  13.41%)
      Best99%Amean Time      143.72 (   0.00%)      143.34 (   0.26%)
      Best95%Amean Time      142.37 (   0.00%)      142.00 (   0.26%)
      Best90%Amean Time      142.19 (   0.00%)      141.85 (   0.24%)
      Best75%Amean Time      141.92 (   0.00%)      141.58 (   0.24%)
      Best50%Amean Time      141.69 (   0.00%)      141.31 (   0.27%)
      Best25%Amean Time      141.38 (   0.00%)      140.97 (   0.29%)
      
      As you'd expect, the gain is marginal but it can be detected.  The
      differences in bonnie are all within the noise which is not surprising
      given the impact on the microbenchmark.
      
      radix_tree_update_node_t is a callback for some radix operations that
      optionally passes in a private field.  The only user of the callback is
      workingset_update_node and as it no longer requires a mapping, the
      private field is removed.
      
      Link: http://lkml.kernel.org/r/20171018075952.10627-3-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c7df8ad2
    • J
      mm/mmu_notifier: avoid double notification when it is useless · 0f10851e
      Jérôme Glisse 提交于
      This patch only affects users of mmu_notifier->invalidate_range callback
      which are device drivers related to ATS/PASID, CAPI, IOMMUv2, SVM ...
      and it is an optimization for those users.  Everyone else is unaffected
      by it.
      
      When clearing a pte/pmd we are given a choice to notify the event under
      the page table lock (notify version of *_clear_flush helpers do call the
      mmu_notifier_invalidate_range).  But that notification is not necessary
      in all cases.
      
      This patch removes almost all cases where it is useless to have a call
      to mmu_notifier_invalidate_range before
      mmu_notifier_invalidate_range_end.  It also adds documentation in all
      those cases explaining why.
      
      Below is a more in depth analysis of why this is fine to do this:
      
      For secondary TLB (non CPU TLB) like IOMMU TLB or device TLB (when
      device use thing like ATS/PASID to get the IOMMU to walk the CPU page
      table to access a process virtual address space).  There is only 2 cases
      when you need to notify those secondary TLB while holding page table
      lock when clearing a pte/pmd:
      
        A) page backing address is free before mmu_notifier_invalidate_range_end
        B) a page table entry is updated to point to a new page (COW, write fault
           on zero page, __replace_page(), ...)
      
      Case A is obvious you do not want to take the risk for the device to write
      to a page that might now be used by something completely different.
      
      Case B is more subtle. For correctness it requires the following sequence
      to happen:
        - take page table lock
        - clear page table entry and notify (pmd/pte_huge_clear_flush_notify())
        - set page table entry to point to new page
      
      If clearing the page table entry is not followed by a notify before setting
      the new pte/pmd value then you can break memory model like C11 or C++11 for
      the device.
      
      Consider the following scenario (device use a feature similar to ATS/
      PASID):
      
      Two address addrA and addrB such that |addrA - addrB| >= PAGE_SIZE we
      assume they are write protected for COW (other case of B apply too).
      
      [Time N] -----------------------------------------------------------------
      CPU-thread-0  {try to write to addrA}
      CPU-thread-1  {try to write to addrB}
      CPU-thread-2  {}
      CPU-thread-3  {}
      DEV-thread-0  {read addrA and populate device TLB}
      DEV-thread-2  {read addrB and populate device TLB}
      [Time N+1] ---------------------------------------------------------------
      CPU-thread-0  {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}}
      CPU-thread-1  {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}}
      CPU-thread-2  {}
      CPU-thread-3  {}
      DEV-thread-0  {}
      DEV-thread-2  {}
      [Time N+2] ---------------------------------------------------------------
      CPU-thread-0  {COW_step1: {update page table point to new page for addrA}}
      CPU-thread-1  {COW_step1: {update page table point to new page for addrB}}
      CPU-thread-2  {}
      CPU-thread-3  {}
      DEV-thread-0  {}
      DEV-thread-2  {}
      [Time N+3] ---------------------------------------------------------------
      CPU-thread-0  {preempted}
      CPU-thread-1  {preempted}
      CPU-thread-2  {write to addrA which is a write to new page}
      CPU-thread-3  {}
      DEV-thread-0  {}
      DEV-thread-2  {}
      [Time N+3] ---------------------------------------------------------------
      CPU-thread-0  {preempted}
      CPU-thread-1  {preempted}
      CPU-thread-2  {}
      CPU-thread-3  {write to addrB which is a write to new page}
      DEV-thread-0  {}
      DEV-thread-2  {}
      [Time N+4] ---------------------------------------------------------------
      CPU-thread-0  {preempted}
      CPU-thread-1  {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}}
      CPU-thread-2  {}
      CPU-thread-3  {}
      DEV-thread-0  {}
      DEV-thread-2  {}
      [Time N+5] ---------------------------------------------------------------
      CPU-thread-0  {preempted}
      CPU-thread-1  {}
      CPU-thread-2  {}
      CPU-thread-3  {}
      DEV-thread-0  {read addrA from old page}
      DEV-thread-2  {read addrB from new page}
      
      So here because at time N+2 the clear page table entry was not pair with a
      notification to invalidate the secondary TLB, the device see the new value
      for addrB before seing the new value for addrA.  This break total memory
      ordering for the device.
      
      When changing a pte to write protect or to point to a new write protected
      page with same content (KSM) it is ok to delay invalidate_range callback
      to mmu_notifier_invalidate_range_end() outside the page table lock.  This
      is true even if the thread doing page table update is preempted right
      after releasing page table lock before calling
      mmu_notifier_invalidate_range_end
      
      Thanks to Andrea for thinking of a problematic scenario for COW.
      
      [jglisse@redhat.com: v2]
        Link: http://lkml.kernel.org/r/20171017031003.7481-2-jglisse@redhat.com
      Link: http://lkml.kernel.org/r/20170901173011.10745-1-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Alistair Popple <alistair@popple.id.au>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0f10851e
  2. 02 10月, 2017 1 次提交
  3. 11 9月, 2017 1 次提交
    • M
      dax: remove the pmem_dax_ops->flush abstraction · c3ca015f
      Mikulas Patocka 提交于
      Commit abebfbe2 ("dm: add ->flush() dax operation support") is
      buggy. A DM device may be composed of multiple underlying devices and
      all of them need to be flushed. That commit just routes the flush
      request to the first device and ignores the other devices.
      
      It could be fixed by adding more complex logic to the device mapper. But
      there is only one implementation of the method pmem_dax_ops->flush - that
      is pmem_dax_flush() - and it calls arch_wb_cache_pmem(). Consequently, we
      don't need the pmem_dax_ops->flush abstraction at all, we can call
      arch_wb_cache_pmem() directly from dax_flush() because dax_dev->ops->flush
      can't ever reach anything different from arch_wb_cache_pmem().
      
      It should be also pointed out that for some uses of persistent memory it
      is needed to flush only a very small amount of data (such as 1 cacheline),
      and it would be overkill if we go through that device mapper machinery for
      a single flushed cache line.
      
      Fix this by removing the pmem_dax_ops->flush abstraction and call
      arch_wb_cache_pmem() directly from dax_flush(). Also, remove the device
      mapper code that forwards the flushes.
      
      Fixes: abebfbe2 ("dm: add ->flush() dax operation support")
      Cc: stable@vger.kernel.org
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      c3ca015f
  4. 07 9月, 2017 7 次提交
    • N
      dax: initialize variable pfn before using it · 2f52074d
      Nicolas Iooss 提交于
      dax_pmd_insert_mapping() contains the following code:
      
              pfn_t pfn;
              if (bdev_dax_pgoff(bdev, sector, size, &pgoff) != 0)
                  goto fallback;
              /* ... */
          fallback:
            trace_dax_pmd_insert_mapping_fallback(inode, vmf, length, pfn, ret);
      
      When the condition in the if statement fails, the function calls
      trace_dax_pmd_insert_mapping_fallback() with an uninitialized pfn value.
      
      This issue has been found while building the kernel with clang.  The
      compiler reported:
      
          fs/dax.c:1280:6: error: variable 'pfn' is used uninitialized
          whenever 'if' condition is true [-Werror,-Wsometimes-uninitialized]
              if (bdev_dax_pgoff(bdev, sector, size, &pgoff) != 0)
                  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          fs/dax.c:1310:60: note: uninitialized use occurs here
            trace_dax_pmd_insert_mapping_fallback(inode, vmf, length, pfn, ret);
                                                                           ^~~
      
      Link: http://lkml.kernel.org/r/20170903083000.587-1-nicolas.iooss_linux@m4x.orgSigned-off-by: NNicolas Iooss <nicolas.iooss_linux@m4x.org>
      Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2f52074d
    • R
      dax: use PG_PMD_COLOUR instead of open coding · 917f3452
      Ross Zwisler 提交于
      Use ~PG_PMD_COLOUR in dax_entry_waitqueue() instead of open coding an
      equivalent page offset mask.
      
      Link: http://lkml.kernel.org/r/20170822222436.18926-2-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: "Slusarz, Marcin" <marcin.slusarz@intel.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      917f3452
    • R
      dax: explain how read(2)/write(2) addresses are validated · a2e050f5
      Ross Zwisler 提交于
      Add a comment explaining how the user addresses provided to read(2) and
      write(2) are validated in the DAX I/O path.
      
      We call dax_copy_from_iter() or copy_to_iter() on these without calling
      access_ok() first in the DAX code, and there was a concern that the user
      might be able to read/write to arbitrary kernel addresses with this
      path.
      
      Link: http://lkml.kernel.org/r/20170816173615.10098-1-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a2e050f5
    • R
      dax: move all DAX radix tree defs to fs/dax.c · 527b19d0
      Ross Zwisler 提交于
      Now that we no longer insert struct page pointers in DAX radix trees the
      page cache code no longer needs to know anything about DAX exceptional
      entries.  Move all the DAX exceptional entry definitions from dax.h to
      fs/dax.c.
      
      Link: http://lkml.kernel.org/r/20170724170616.25810-6-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Suggested-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      527b19d0
    • R
      dax: remove DAX code from page_cache_tree_insert() · d01ad197
      Ross Zwisler 提交于
      Now that we no longer insert struct page pointers in DAX radix trees we
      can remove the special casing for DAX in page_cache_tree_insert().
      
      This also allows us to make dax_wake_mapping_entry_waiter() local to
      fs/dax.c, removing it from dax.h.
      
      Link: http://lkml.kernel.org/r/20170724170616.25810-5-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Suggested-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d01ad197
    • R
      dax: use common 4k zero page for dax mmap reads · 91d25ba8
      Ross Zwisler 提交于
      When servicing mmap() reads from file holes the current DAX code
      allocates a page cache page of all zeroes and places the struct page
      pointer in the mapping->page_tree radix tree.
      
      This has three major drawbacks:
      
      1) It consumes memory unnecessarily. For every 4k page that is read via
         a DAX mmap() over a hole, we allocate a new page cache page. This
         means that if you read 1GiB worth of pages, you end up using 1GiB of
         zeroed memory. This is easily visible by looking at the overall
         memory consumption of the system or by looking at /proc/[pid]/smaps:
      
      	7f62e72b3000-7f63272b3000 rw-s 00000000 103:00 12   /root/dax/data
      	Size:            1048576 kB
      	Rss:             1048576 kB
      	Pss:             1048576 kB
      	Shared_Clean:          0 kB
      	Shared_Dirty:          0 kB
      	Private_Clean:   1048576 kB
      	Private_Dirty:         0 kB
      	Referenced:      1048576 kB
      	Anonymous:             0 kB
      	LazyFree:              0 kB
      	AnonHugePages:         0 kB
      	ShmemPmdMapped:        0 kB
      	Shared_Hugetlb:        0 kB
      	Private_Hugetlb:       0 kB
      	Swap:                  0 kB
      	SwapPss:               0 kB
      	KernelPageSize:        4 kB
      	MMUPageSize:           4 kB
      	Locked:                0 kB
      
      2) It is slower than using a common zero page because each page fault
         has more work to do. Instead of just inserting a common zero page we
         have to allocate a page cache page, zero it, and then insert it. Here
         are the average latencies of dax_load_hole() as measured by ftrace on
         a random test box:
      
          Old method, using zeroed page cache pages:	3.4 us
          New method, using the common 4k zero page:	0.8 us
      
         This was the average latency over 1 GiB of sequential reads done by
         this simple fio script:
      
           [global]
           size=1G
           filename=/root/dax/data
           fallocate=none
           [io]
           rw=read
           ioengine=mmap
      
      3) The fact that we had to check for both DAX exceptional entries and
         for page cache pages in the radix tree made the DAX code more
         complex.
      
      Solve these issues by following the lead of the DAX PMD code and using a
      common 4k zero page instead.  As with the PMD code we will now insert a
      DAX exceptional entry into the radix tree instead of a struct page
      pointer which allows us to remove all the special casing in the DAX
      code.
      
      Note that we do still pretty aggressively check for regular pages in the
      DAX radix tree, especially where we take action based on the bits set in
      the page.  If we ever find a regular page in our radix tree now that
      most likely means that someone besides DAX is inserting pages (which has
      happened lots of times in the past), and we want to find that out early
      and fail loudly.
      
      This solution also removes the extra memory consumption.  Here is that
      same /proc/[pid]/smaps after 1GiB of reading from a hole with the new
      code:
      
      	7f2054a74000-7f2094a74000 rw-s 00000000 103:00 12   /root/dax/data
      	Size:            1048576 kB
      	Rss:                   0 kB
      	Pss:                   0 kB
      	Shared_Clean:          0 kB
      	Shared_Dirty:          0 kB
      	Private_Clean:         0 kB
      	Private_Dirty:         0 kB
      	Referenced:            0 kB
      	Anonymous:             0 kB
      	LazyFree:              0 kB
      	AnonHugePages:         0 kB
      	ShmemPmdMapped:        0 kB
      	Shared_Hugetlb:        0 kB
      	Private_Hugetlb:       0 kB
      	Swap:                  0 kB
      	SwapPss:               0 kB
      	KernelPageSize:        4 kB
      	MMUPageSize:           4 kB
      	Locked:                0 kB
      
      Overall system memory consumption is similarly improved.
      
      Another major change is that we remove dax_pfn_mkwrite() from our fault
      flow, and instead rely on the page fault itself to make the PTE dirty
      and writeable.  The following description from the patch adding the
      vm_insert_mixed_mkwrite() call explains this a little more:
      
         "To be able to use the common 4k zero page in DAX we need to have our
          PTE fault path look more like our PMD fault path where a PTE entry
          can be marked as dirty and writeable as it is first inserted rather
          than waiting for a follow-up dax_pfn_mkwrite() =>
          finish_mkwrite_fault() call.
      
          Right now we can rely on having a dax_pfn_mkwrite() call because we
          can distinguish between these two cases in do_wp_page():
      
                  case 1: 4k zero page => writable DAX storage
                  case 2: read-only DAX storage => writeable DAX storage
      
          This distinction is made by via vm_normal_page(). vm_normal_page()
          returns false for the common 4k zero page, though, just as it does
          for DAX ptes. Instead of special casing the DAX + 4k zero page case
          we will simplify our DAX PTE page fault sequence so that it matches
          our DAX PMD sequence, and get rid of the dax_pfn_mkwrite() helper.
          We will instead use dax_iomap_fault() to handle write-protection
          faults.
      
          This means that insert_pfn() needs to follow the lead of
          insert_pfn_pmd() and allow us to pass in a 'mkwrite' flag. If
          'mkwrite' is set insert_pfn() will do the work that was previously
          done by wp_page_reuse() as part of the dax_pfn_mkwrite() call path"
      
      Link: http://lkml.kernel.org/r/20170724170616.25810-4-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      91d25ba8
    • R
      dax: relocate some dax functions · e30331ff
      Ross Zwisler 提交于
      dax_load_hole() will soon need to call dax_insert_mapping_entry(), so it
      needs to be moved lower in dax.c so the definition exists.
      
      dax_wake_mapping_entry_waiter() will soon be removed from dax.h and be
      made static to dax.c, so we need to move its definition above all its
      callers.
      
      Link: http://lkml.kernel.org/r/20170724170616.25810-3-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e30331ff
  5. 01 9月, 2017 1 次提交
    • J
      dax: update to new mmu_notifier semantic · a4d1a885
      Jérôme Glisse 提交于
      Replace all mmu_notifier_invalidate_page() calls by *_invalidate_range()
      and make sure it is bracketed by calls to *_invalidate_range_start()/end().
      
      Note that because we can not presume the pmd value or pte value we have
      to assume the worst and unconditionaly report an invalidation as
      happening.
      Signed-off-by: NJérôme Glisse <jglisse@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Bernhard Held <berny156@gmx.de>
      Cc: Adam Borowski <kilobyte@angband.pl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Wanpeng Li <kernellwp@gmail.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: axie <axie@amd.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a4d1a885
  6. 26 8月, 2017 1 次提交
    • R
      dax: fix deadlock due to misaligned PMD faults · fffa281b
      Ross Zwisler 提交于
      In DAX there are two separate places where the 2MiB range of a PMD is
      defined.
      
      The first is in the page tables, where a PMD mapping inserted for a
      given address spans from (vmf->address & PMD_MASK) to ((vmf->address &
      PMD_MASK) + PMD_SIZE - 1).  That is, from the 2MiB boundary below the
      address to the 2MiB boundary above the address.
      
      So, for example, a fault at address 3MiB (0x30 0000) falls within the
      PMD that ranges from 2MiB (0x20 0000) to 4MiB (0x40 0000).
      
      The second PMD range is in the mapping->page_tree, where a given file
      offset is covered by a radix tree entry that spans from one 2MiB aligned
      file offset to another 2MiB aligned file offset.
      
      So, for example, the file offset for 3MiB (pgoff 768) falls within the
      PMD range for the order 9 radix tree entry that ranges from 2MiB (pgoff
      512) to 4MiB (pgoff 1024).
      
      This system works so long as the addresses and file offsets for a given
      mapping both have the same offsets relative to the start of each PMD.
      
      Consider the case where the starting address for a given file isn't 2MiB
      aligned - say our faulting address is 3 MiB (0x30 0000), but that
      corresponds to the beginning of our file (pgoff 0).  Now all the PMDs in
      the mapping are misaligned so that the 2MiB range defined in the page
      tables never matches up with the 2MiB range defined in the radix tree.
      
      The current code notices this case for DAX faults to storage with the
      following test in dax_pmd_insert_mapping():
      
      	if (pfn_t_to_pfn(pfn) & PG_PMD_COLOUR)
      		goto unlock_fallback;
      
      This test makes sure that the pfn we get from the driver is 2MiB
      aligned, and relies on the assumption that the 2MiB alignment of the pfn
      we get back from the driver matches the 2MiB alignment of the faulting
      address.
      
      However, faults to holes were not checked and we could hit the problem
      described above.
      
      This was reported in response to the NVML nvml/src/test/pmempool_sync
      TEST5:
      
      	$ cd nvml/src/test/pmempool_sync
      	$ make TEST5
      
      You can grab NVML here:
      
      	https://github.com/pmem/nvml/
      
      The dmesg warning you see when you hit this error is:
      
        WARNING: CPU: 13 PID: 2900 at fs/dax.c:641 dax_insert_mapping_entry+0x2df/0x310
      
      Where we notice in dax_insert_mapping_entry() that the radix tree entry
      we are about to replace doesn't match the locked entry that we had
      previously inserted into the tree.  This happens because the initial
      insertion was done in grab_mapping_entry() using a pgoff calculated from
      the faulting address (vmf->address), and the replacement in
      dax_pmd_load_hole() => dax_insert_mapping_entry() is done using
      vmf->pgoff.
      
      In our failure case those two page offsets (one calculated from
      vmf->address, one using vmf->pgoff) point to different order 9 radix
      tree entries.
      
      This failure case can result in a deadlock because the radix tree unlock
      also happens on the pgoff calculated from vmf->address.  This means that
      the locked radix tree entry that we swapped in to the tree in
      dax_insert_mapping_entry() using vmf->pgoff is never unlocked, so all
      future faults to that 2MiB range will block forever.
      
      Fix this by validating that the faulting address's PMD offset matches
      the PMD offset from the start of the file.  This check is done at the
      very beginning of the fault and covers faults that would have mapped to
      storage as well as faults to holes.  I left the COLOUR check in
      dax_pmd_insert_mapping() in place in case we ever hit the insanity
      condition where the alignment of the pfn we get from the driver doesn't
      match the alignment of the userspace address.
      
      Link: http://lkml.kernel.org/r/20170822222436.18926-1-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reported-by: N"Slusarz, Marcin" <marcin.slusarz@intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fffa281b
  7. 07 7月, 2017 1 次提交
  8. 06 7月, 2017 1 次提交
    • J
      dax: set errors in mapping when writeback fails · 819ec6b9
      Jeff Layton 提交于
      Jan Kara's description for this patch is much better than mine, so I'm
      quoting it verbatim here:
      
      DAX currently doesn't set errors in the mapping when cache flushing
      fails in dax_writeback_mapping_range(). Since this function can get
      called only from fsync(2) or sync(2), this is actually as good as it can
      currently get since we correctly propagate the error up from
      dax_writeback_mapping_range() to filemap_fdatawrite()
      
      However, in the future better writeback error handling will enable us to
      properly report these errors on fsync(2) even if there are multiple file
      descriptors open against the file or if sync(2) gets called before
      fsync(2). So convert DAX to using standard error reporting through the
      mapping.
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-and-tested-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      819ec6b9
  9. 28 6月, 2017 1 次提交
    • D
      x86, libnvdimm, pmem: remove global pmem api · ca6a4657
      Dan Williams 提交于
      Now that all callers of the pmem api have been converted to dax helpers that
      call back to the pmem driver, we can remove include/linux/pmem.h and
      asm/pmem.h.
      
      Cc: <x86@kernel.org>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Oliver O'Halloran <oohall@gmail.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      ca6a4657
  10. 24 6月, 2017 1 次提交
  11. 20 6月, 2017 1 次提交
    • I
      sched/wait: Rename wait_queue_t => wait_queue_entry_t · ac6424b9
      Ingo Molnar 提交于
      Rename:
      
      	wait_queue_t		=>	wait_queue_entry_t
      
      'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
      but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
      which had to carry the name.
      
      Start sorting this out by renaming it to 'wait_queue_entry_t'.
      
      This also allows the real structure name 'struct __wait_queue' to
      lose its double underscore and become 'struct wait_queue_entry',
      which is the more canonical nomenclature for such data types.
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ac6424b9
  12. 16 6月, 2017 3 次提交
    • D
      x86, dax: replace clear_pmem() with open coded memset + dax_ops->flush · 81f55870
      Dan Williams 提交于
      The clear_pmem() helper simply combines a memset() plus a cache flush.
      Now that the flush routine is optionally provided by the dax device
      driver we can avoid unnecessary cache management on dax devices fronting
      volatile memory.
      
      With clear_pmem() gone we can follow on with a patch to make pmem cache
      management completely defined within the pmem driver.
      
      Cc: <x86@kernel.org>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      81f55870
    • D
      filesystem-dax: convert to dax_flush() · 6318770a
      Dan Williams 提交于
      Filesystem-DAX flushes caches whenever it writes to the address returned
      through dax_direct_access() and when writing back dirty radix entries.
      That flushing is only required in the pmem case, so the dax_flush()
      helper skips cache management work when the underlying driver does not
      specify a flush method.
      
      We still do all the dirty tracking since the radix entry will already be
      there for locking purposes. However, the work to clean the entry will be
      a nop for some dax drivers.
      
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      6318770a
    • D
      filesystem-dax: convert to dax_copy_from_iter() · fec53774
      Dan Williams 提交于
      Now that all possible providers of the dax_operations copy_from_iter
      method are implemented, switch filesytem-dax to call the driver rather
      than copy_to_iter_pmem.
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      fec53774
  13. 03 6月, 2017 1 次提交
    • R
      dax: fix race between colliding PMD & PTE entries · e2093926
      Ross Zwisler 提交于
      We currently have two related PMD vs PTE races in the DAX code.  These
      can both be easily triggered by having two threads reading and writing
      simultaneously to the same private mapping, with the key being that
      private mapping reads can be handled with PMDs but private mapping
      writes are always handled with PTEs so that we can COW.
      
      Here is the first race:
      
        CPU 0					CPU 1
      
        (private mapping write)
        __handle_mm_fault()
          create_huge_pmd() - FALLBACK
          handle_pte_fault()
            passes check for pmd_devmap()
      
      					(private mapping read)
      					__handle_mm_fault()
      					  create_huge_pmd()
      					    dax_iomap_pmd_fault() inserts PMD
      
            dax_iomap_pte_fault() does a PTE fault, but we already have a DAX PMD
            			  installed in our page tables at this spot.
      
      Here's the second race:
      
        CPU 0					CPU 1
      
        (private mapping read)
        __handle_mm_fault()
          passes check for pmd_none()
          create_huge_pmd()
            dax_iomap_pmd_fault() inserts PMD
      
        (private mapping write)
        __handle_mm_fault()
          create_huge_pmd() - FALLBACK
      					(private mapping read)
      					__handle_mm_fault()
      					  passes check for pmd_none()
      					  create_huge_pmd()
      
          handle_pte_fault()
            dax_iomap_pte_fault() inserts PTE
      					    dax_iomap_pmd_fault() inserts PMD,
      					       but we already have a PTE at
      					       this spot.
      
      The core of the issue is that while there is isolation between faults to
      the same range in the DAX fault handlers via our DAX entry locking,
      there is no isolation between faults in the code in mm/memory.c.  This
      means for instance that this code in __handle_mm_fault() can run:
      
      	if (pmd_none(*vmf.pmd) && transparent_hugepage_enabled(vma)) {
      		ret = create_huge_pmd(&vmf);
      
      But by the time we actually get to run the fault handler called by
      create_huge_pmd(), the PMD is no longer pmd_none() because a racing PTE
      fault has installed a normal PMD here as a parent.  This is the cause of
      the 2nd race.  The first race is similar - there is the following check
      in handle_pte_fault():
      
      	} else {
      		/* See comment in pte_alloc_one_map() */
      		if (pmd_devmap(*vmf->pmd) || pmd_trans_unstable(vmf->pmd))
      			return 0;
      
      So if a pmd_devmap() PMD (a DAX PMD) has been installed at vmf->pmd, we
      will bail and retry the fault.  This is correct, but there is nothing
      preventing the PMD from being installed after this check but before we
      actually get to the DAX PTE fault handlers.
      
      In my testing these races result in the following types of errors:
      
        BUG: Bad rss-counter state mm:ffff8800a817d280 idx:1 val:1
        BUG: non-zero nr_ptes on freeing mm: 15
      
      Fix this issue by having the DAX fault handlers verify that it is safe
      to continue their fault after they have taken an entry lock to block
      other racing faults.
      
      [ross.zwisler@linux.intel.com: improve fix for colliding PMD & PTE entries]
        Link: http://lkml.kernel.org/r/20170526195932.32178-1-ross.zwisler@linux.intel.com
      Link: http://lkml.kernel.org/r/20170522215749.23516-2-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reported-by: NPawel Lebioda <pawel.lebioda@intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Pawel Lebioda <pawel.lebioda@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Xiong Zhou <xzhou@redhat.com>
      Cc: Eryu Guan <eguan@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e2093926
  14. 13 5月, 2017 4 次提交
    • R
      dax: fix PMD data corruption when fault races with write · 876f2946
      Ross Zwisler 提交于
      This is based on a patch from Jan Kara that fixed the equivalent race in
      the DAX PTE fault path.
      
      Currently DAX PMD read fault can race with write(2) in the following
      way:
      
      CPU1 - write(2)                 CPU2 - read fault
                                      dax_iomap_pmd_fault()
                                        ->iomap_begin() - sees hole
      
      dax_iomap_rw()
        iomap_apply()
          ->iomap_begin - allocates blocks
          dax_iomap_actor()
            invalidate_inode_pages2_range()
              - there's nothing to invalidate
      
                                        grab_mapping_entry()
      				  - we add huge zero page to the radix tree
      				    and map it to page tables
      
      The result is that hole page is mapped into page tables (and thus zeros
      are seen in mmap) while file has data written in that place.
      
      Fix the problem by locking exception entry before mapping blocks for the
      fault.  That way we are sure invalidate_inode_pages2_range() call for
      racing write will either block on entry lock waiting for the fault to
      finish (and unmap stale page tables after that) or read fault will see
      already allocated blocks by write(2).
      
      Fixes: 9f141d6e ("dax: Call ->iomap_begin without entry lock during dax fault")
      Link: http://lkml.kernel.org/r/20170510172700.18991-1-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      876f2946
    • J
      dax: fix data corruption when fault races with write · 13e451fd
      Jan Kara 提交于
      Currently DAX read fault can race with write(2) in the following way:
      
      CPU1 - write(2)			CPU2 - read fault
      				dax_iomap_pte_fault()
      				  ->iomap_begin() - sees hole
      dax_iomap_rw()
        iomap_apply()
          ->iomap_begin - allocates blocks
          dax_iomap_actor()
            invalidate_inode_pages2_range()
              - there's nothing to invalidate
      				  grab_mapping_entry()
      				  - we add zero page in the radix tree
      				    and map it to page tables
      
      The result is that hole page is mapped into page tables (and thus zeros
      are seen in mmap) while file has data written in that place.
      
      Fix the problem by locking exception entry before mapping blocks for the
      fault.  That way we are sure invalidate_inode_pages2_range() call for
      racing write will either block on entry lock waiting for the fault to
      finish (and unmap stale page tables after that) or read fault will see
      already allocated blocks by write(2).
      
      Fixes: 9f141d6e
      Link: http://lkml.kernel.org/r/20170510085419.27601-5-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      13e451fd
    • J
      mm: fix data corruption due to stale mmap reads · cd656375
      Jan Kara 提交于
      Currently, we didn't invalidate page tables during invalidate_inode_pages2()
      for DAX.  That could result in e.g. 2MiB zero page being mapped into
      page tables while there were already underlying blocks allocated and
      thus data seen through mmap were different from data seen by read(2).
      The following sequence reproduces the problem:
      
       - open an mmap over a 2MiB hole
      
       - read from a 2MiB hole, faulting in a 2MiB zero page
      
       - write to the hole with write(3p). The write succeeds but we
         incorrectly leave the 2MiB zero page mapping intact.
      
       - via the mmap, read the data that was just written. Since the zero
         page mapping is still intact we read back zeroes instead of the new
         data.
      
      Fix the problem by unconditionally calling invalidate_inode_pages2_range()
      in dax_iomap_actor() for new block allocations and by properly
      invalidating page tables in invalidate_inode_pages2_range() for DAX
      mappings.
      
      Fixes: c6dcf52c
      Link: http://lkml.kernel.org/r/20170510085419.27601-3-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cd656375
    • R
      dax: prevent invalidation of mapped DAX entries · 4636e70b
      Ross Zwisler 提交于
      Patch series "mm,dax: Fix data corruption due to mmap inconsistency",
      v4.
      
      This series fixes data corruption that can happen for DAX mounts when
      page faults race with write(2) and as a result page tables get out of
      sync with block mappings in the filesystem and thus data seen through
      mmap is different from data seen through read(2).
      
      The series passes testing with t_mmap_stale test program from Ross and
      also other mmap related tests on DAX filesystem.
      
      This patch (of 4):
      
      dax_invalidate_mapping_entry() currently removes DAX exceptional entries
      only if they are clean and unlocked.  This is done via:
      
        invalidate_mapping_pages()
          invalidate_exceptional_entry()
            dax_invalidate_mapping_entry()
      
      However, for page cache pages removed in invalidate_mapping_pages()
      there is an additional criteria which is that the page must not be
      mapped.  This is noted in the comments above invalidate_mapping_pages()
      and is checked in invalidate_inode_page().
      
      For DAX entries this means that we can can end up in a situation where a
      DAX exceptional entry, either a huge zero page or a regular DAX entry,
      could end up mapped but without an associated radix tree entry.  This is
      inconsistent with the rest of the DAX code and with what happens in the
      page cache case.
      
      We aren't able to unmap the DAX exceptional entry because according to
      its comments invalidate_mapping_pages() isn't allowed to block, and
      unmap_mapping_range() takes a write lock on the mapping->i_mmap_rwsem.
      
      Since we essentially never have unmapped DAX entries to evict from the
      radix tree, just remove dax_invalidate_mapping_entry().
      
      Fixes: c6dcf52c ("mm: Invalidate DAX radix tree entries only if appropriate")
      Link: http://lkml.kernel.org/r/20170510085419.27601-2-jack@suse.czSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reported-by: NJan Kara <jack@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: <stable@vger.kernel.org>    [4.10+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4636e70b
  15. 11 5月, 2017 1 次提交
  16. 09 5月, 2017 6 次提交
    • R
      dax: add tracepoint to dax_insert_mapping() · b4440734
      Ross Zwisler 提交于
      Add a tracepoint to dax_insert_mapping(), following the same logging
      conventions as the rest of DAX.  This tracepoint, along with the one in
      dax_load_hole(), lets us know how a DAX PTE fault was serviced.
      
      Here is an example DAX fault that inserts a PTE mapping:
      
        small-1126  [007] ....
         145.451604: dax_pte_fault: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10420000 pgoff 0x220
      
        small-1126  [007] ....
         145.452317: dax_insert_mapping: dev 259:0 ino 0x1003 shared write address 0x10420000 radix_entry 0x100006
      
        small-1126  [007] ....
         145.452399: dax_pte_fault_done: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10420000 pgoff 0x220 MAJOR|NOPAGE
      
      Link: http://lkml.kernel.org/r/20170221195116.13278-7-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4440734
    • R
      dax: add tracepoint to dax_writeback_one() · f9bc3a07
      Ross Zwisler 提交于
      Add a tracepoint to dax_writeback_one(), following the same logging
      conventions as the rest of DAX.
      
      Here is an example range writeback which ends up flushing one PMD and
      one PTE:
      
        test-1265  [003] ....
         496.615250: dax_writeback_range: dev 259:0 ino 0x1003 pgoff 0x0-0x7ffffffffffff
      
        test-1265  [003] ....
         496.616263: dax_writeback_one: dev 259:0 ino 0x1003 pgoff 0x0 pglen 0x200
      
        test-1265  [003] ....
         496.616270: dax_writeback_one: dev 259:0 ino 0x1003 pgoff 0x305 pglen 0x1
      
        test-1265  [003] ....
         496.616272: dax_writeback_range_done: dev 259:0 ino 0x1003 pgoff 0x0-0x7ffffffffffff
      
      [akpm@linux-foundation.org: struct blk_dax_ctl has disappeared]
      Link: http://lkml.kernel.org/r/20170221195116.13278-6-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f9bc3a07
    • R
      dax: add tracepoints to dax_writeback_mapping_range() · d14a3f48
      Ross Zwisler 提交于
      Add tracepoints to dax_writeback_mapping_range(), following the same
      logging conventions as the rest of DAX.
      
      Here is an example writeback call:
      
        msync-1085  [006] ....
         200.902565: dax_writeback_range: dev 259:0 ino 0x1003 pgoff 0x200-0x2ff
      
        msync-1085  [006] ....
         200.902579: dax_writeback_range_done: dev 259:0 ino 0x1003 pgoff 0x200-0x2ff
      
      [ross.zwisler@linux.intel.com: fix regression in dax_writeback_mapping_range()]
        Link: http://lkml.kernel.org/r/20170314215358.31451-1-ross.zwisler@linux.intel.com
      Link: http://lkml.kernel.org/r/20170221195116.13278-5-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d14a3f48
    • R
      dax: add tracepoints to dax_load_hole() · 678c9fd0
      Ross Zwisler 提交于
      Add tracepoints to dax_load_hole(), following the same logging conventions
      as the rest of DAX.
      
      Here is the logging generated by a PTE read from a hole:
      
        read-1075  [002] ....
          62.362108: dax_pte_fault: dev 259:0 ino 0x1003 shared ALLOW_RETRY|KILLABLE|USER address 0x10480000 pgoff 0x280
      
        read-1075  [002] ....
          62.362140: dax_load_hole: dev 259:0 ino 0x1003 shared ALLOW_RETRY|KILLABLE|USER address 0x10480000 pgoff 0x280 NOPAGE
      
        read-1075  [002] ....
          62.362141: dax_pte_fault_done: dev 259:0 ino 0x1003 shared ALLOW_RETRY|KILLABLE|USER address 0x10480000 pgoff 0x280 NOPAGE
      
      Link: http://lkml.kernel.org/r/20170221195116.13278-4-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      678c9fd0
    • R
      dax: add tracepoints to dax_pfn_mkwrite() · c3ff68d7
      Ross Zwisler 提交于
      Add tracepoints to dax_pfn_mkwrite(), following the same logging
      conventions as the rest of DAX.
      
      Here is an example PTE fault followed by a pfn_mkwrite:
      
        small_aligned-1094  [002] ....
         374.084998: dax_pte_fault: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10400000 pgoff 0x200
      
        small_aligned-1094  [002] ....
         374.085145: dax_pte_fault_done: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10400000 pgoff 0x200 MAJOR|NOPAGE
      
        small_aligned-1094  [002] ....
         374.085165: dax_pfn_mkwrite: dev 259:0 ino 0x1003 shared WRITE|MKWRITE|ALLOW_RETRY|KILLABLE|USER address 0x10400000 pgoff 0x200 NOPAGE
      
      Link: http://lkml.kernel.org/r/20170221195116.13278-3-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c3ff68d7
    • R
      dax: add tracepoints to dax_iomap_pte_fault() · a9c42b33
      Ross Zwisler 提交于
      Patch series "second round of tracepoints for DAX".
      
      This second round of DAX tracepoint patches adds tracing to the PTE
      fault path (dax_iomap_pte_fault(), dax_pfn_mkwrite(), dax_load_hole(),
      dax_insert_mapping()) and to the writeback path
      (dax_writeback_mapping_range(), dax_writeback_one()).
      
      The purpose of this tracing is to give us a high level view of what DAX
      is doing, whether faults are being serviced by PMDs or PTEs, and by real
      storage or by zero pages covering holes.
      
      I do have some patches nearly ready which also add tracing to
      grab_mapping_entry() and dax_insert_mapping_entry().  These are more
      targeted at logging how we are interacting with the radix tree, how we
      use empty entries for locking, whether we "downgrade" huge zero pages to
      4k PTE sized allocations, etc.  In the end it seemed to me that this
      might be too detailed to have as constantly present tracepoints, but if
      anyone sees value in having tracepoints like this in the DAX code
      permanently (Jan?), please let me know and I'll add those last two
      patches.
      
      All these tracepoints were done to be consistent with the style of the
      XFS tracepoints and with the existing DAX PMD tracepoints.
      
      This patch (of 6):
      
      Add tracepoints to dax_iomap_pte_fault(), following the same logging
      conventions as the rest of DAX.
      
      Here is an example fault that initially tries to be serviced by the PMD
      fault handler but which falls back to PTEs because the VMA isn't large
      enough to hold a PMD:
      
        small-1086  [005] ....
         71.140014: xfs_filemap_huge_fault: dev 259:0 ino 0x1003
      
        small-1086  [005] ....
          71.140027: dax_pmd_fault: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10420000 vm_start 0x10200000 vm_end 0x10500000 pgoff 0x220 max_pgoff 0x1400
      
        small-1086  [005] ....
          71.140028: dax_pmd_fault_done: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10420000 vm_start 0x10200000 vm_end 0x10500000 pgoff 0x220 max_pgoff 0x1400 FALLBACK
      
        small-1086  [005] ....
          71.140035: dax_pte_fault: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10420000 pgoff 0x220
      
        small-1086  [005] ....
          71.140396: dax_pte_fault_done: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10420000 pgoff 0x220 MAJOR|NOPAGE
      
      Link: http://lkml.kernel.org/r/20170221195116.13278-2-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9c42b33
  17. 26 4月, 2017 2 次提交
    • D
      filesystem-dax: convert to dax_direct_access() · cccbce67
      Dan Williams 提交于
      Now that a dax_device is plumbed through all dax-capable drivers we can
      switch from block_device_operations to dax_operations for invoking
      ->direct_access.
      
      This also lets us kill off some usages of struct blk_dax_ctl on the way
      to its eventual removal.
      Suggested-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      cccbce67
    • D
      Revert "block: use DAX for partition table reads" · a41fe02b
      Dan Williams 提交于
      commit d1a5f2b4 ("block: use DAX for partition table reads") was
      part of a stalled effort to allow dax mappings of block devices. Since
      then the device-dax mechanism has filled the role of dax-mapping static
      device ranges.
      
      Now that we are moving ->direct_access() from a block_device operation
      to a dax_inode operation we would need block devices to map and carry
      their own dax_inode reference.
      
      Unless / until we decide to revive dax mapping of raw block devices
      through the dax_inode scheme, there is no need to carry
      read_dax_sector(). Its removal in turn allows for the removal of
      bdev_direct_access() and should have been included in commit
      22375701 ("block_dev: remove DAX leftovers").
      
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      a41fe02b
  18. 09 4月, 2017 1 次提交
  19. 08 4月, 2017 1 次提交
    • R
      dax: fix radix tree insertion race · e11f8b7b
      Ross Zwisler 提交于
      While running generic/340 in my test setup I hit the following race.  It
      can happen with kernels that support FS DAX PMDs, so v4.10 thru
      v4.11-rc5.
      
      Thread 1				Thread 2
      --------				--------
      dax_iomap_pmd_fault()
        grab_mapping_entry()
          spin_lock_irq()
          get_unlocked_mapping_entry()
          'entry' is NULL, can't call lock_slot()
          spin_unlock_irq()
          radix_tree_preload()
      					dax_iomap_pmd_fault()
      					  grab_mapping_entry()
      					    spin_lock_irq()
      					    get_unlocked_mapping_entry()
      					    ...
      					    lock_slot()
      					    spin_unlock_irq()
      					  dax_pmd_insert_mapping()
      					    <inserts a PMD mapping>
          spin_lock_irq()
          __radix_tree_insert() fails with -EEXIST
          <fall back to 4k fault, and die horribly
           when inserting a 4k entry where a PMD exists>
      
      The issue is that we have to drop mapping->tree_lock while calling
      radix_tree_preload(), but since we didn't have a radix tree entry to
      lock (unlike in the pmd_downgrade case) we have no protection against
      Thread 2 coming along and inserting a PMD at the same index.  For 4k
      entries we handled this with a special-case response to -EEXIST coming
      from the __radix_tree_insert(), but this doesn't save us for PMDs
      because the -EEXIST case can also mean that we collided with a 4k entry
      in the radix tree at a different index, but one that is covered by our
      PMD range.
      
      So, correctly handle both the 4k and 2M collision cases by explicitly
      re-checking the radix tree for an entry at our index once we reacquire
      mapping->tree_lock.
      
      This patch has made it through a clean xfstests run with the current
      v4.11-rc5 based linux/master, and it also ran generic/340 500 times in a
      loop.  It used to fail within the first 10 iterations.
      
      Link: http://lkml.kernel.org/r/20170406212944.2866-1-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: <stable@vger.kernel.org>    [4.10+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e11f8b7b
  20. 02 3月, 2017 1 次提交
  21. 28 2月, 2017 1 次提交