1. 03 11月, 2017 1 次提交
  2. 11 9月, 2017 1 次提交
    • M
      dax: remove the pmem_dax_ops->flush abstraction · c3ca015f
      Mikulas Patocka 提交于
      Commit abebfbe2 ("dm: add ->flush() dax operation support") is
      buggy. A DM device may be composed of multiple underlying devices and
      all of them need to be flushed. That commit just routes the flush
      request to the first device and ignores the other devices.
      
      It could be fixed by adding more complex logic to the device mapper. But
      there is only one implementation of the method pmem_dax_ops->flush - that
      is pmem_dax_flush() - and it calls arch_wb_cache_pmem(). Consequently, we
      don't need the pmem_dax_ops->flush abstraction at all, we can call
      arch_wb_cache_pmem() directly from dax_flush() because dax_dev->ops->flush
      can't ever reach anything different from arch_wb_cache_pmem().
      
      It should be also pointed out that for some uses of persistent memory it
      is needed to flush only a very small amount of data (such as 1 cacheline),
      and it would be overkill if we go through that device mapper machinery for
      a single flushed cache line.
      
      Fix this by removing the pmem_dax_ops->flush abstraction and call
      arch_wb_cache_pmem() directly from dax_flush(). Also, remove the device
      mapper code that forwards the flushes.
      
      Fixes: abebfbe2 ("dm: add ->flush() dax operation support")
      Cc: stable@vger.kernel.org
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      c3ca015f
  3. 07 9月, 2017 7 次提交
    • N
      dax: initialize variable pfn before using it · 2f52074d
      Nicolas Iooss 提交于
      dax_pmd_insert_mapping() contains the following code:
      
              pfn_t pfn;
              if (bdev_dax_pgoff(bdev, sector, size, &pgoff) != 0)
                  goto fallback;
              /* ... */
          fallback:
            trace_dax_pmd_insert_mapping_fallback(inode, vmf, length, pfn, ret);
      
      When the condition in the if statement fails, the function calls
      trace_dax_pmd_insert_mapping_fallback() with an uninitialized pfn value.
      
      This issue has been found while building the kernel with clang.  The
      compiler reported:
      
          fs/dax.c:1280:6: error: variable 'pfn' is used uninitialized
          whenever 'if' condition is true [-Werror,-Wsometimes-uninitialized]
              if (bdev_dax_pgoff(bdev, sector, size, &pgoff) != 0)
                  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          fs/dax.c:1310:60: note: uninitialized use occurs here
            trace_dax_pmd_insert_mapping_fallback(inode, vmf, length, pfn, ret);
                                                                           ^~~
      
      Link: http://lkml.kernel.org/r/20170903083000.587-1-nicolas.iooss_linux@m4x.orgSigned-off-by: NNicolas Iooss <nicolas.iooss_linux@m4x.org>
      Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2f52074d
    • R
      dax: use PG_PMD_COLOUR instead of open coding · 917f3452
      Ross Zwisler 提交于
      Use ~PG_PMD_COLOUR in dax_entry_waitqueue() instead of open coding an
      equivalent page offset mask.
      
      Link: http://lkml.kernel.org/r/20170822222436.18926-2-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: "Slusarz, Marcin" <marcin.slusarz@intel.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      917f3452
    • R
      dax: explain how read(2)/write(2) addresses are validated · a2e050f5
      Ross Zwisler 提交于
      Add a comment explaining how the user addresses provided to read(2) and
      write(2) are validated in the DAX I/O path.
      
      We call dax_copy_from_iter() or copy_to_iter() on these without calling
      access_ok() first in the DAX code, and there was a concern that the user
      might be able to read/write to arbitrary kernel addresses with this
      path.
      
      Link: http://lkml.kernel.org/r/20170816173615.10098-1-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a2e050f5
    • R
      dax: move all DAX radix tree defs to fs/dax.c · 527b19d0
      Ross Zwisler 提交于
      Now that we no longer insert struct page pointers in DAX radix trees the
      page cache code no longer needs to know anything about DAX exceptional
      entries.  Move all the DAX exceptional entry definitions from dax.h to
      fs/dax.c.
      
      Link: http://lkml.kernel.org/r/20170724170616.25810-6-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Suggested-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      527b19d0
    • R
      dax: remove DAX code from page_cache_tree_insert() · d01ad197
      Ross Zwisler 提交于
      Now that we no longer insert struct page pointers in DAX radix trees we
      can remove the special casing for DAX in page_cache_tree_insert().
      
      This also allows us to make dax_wake_mapping_entry_waiter() local to
      fs/dax.c, removing it from dax.h.
      
      Link: http://lkml.kernel.org/r/20170724170616.25810-5-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Suggested-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d01ad197
    • R
      dax: use common 4k zero page for dax mmap reads · 91d25ba8
      Ross Zwisler 提交于
      When servicing mmap() reads from file holes the current DAX code
      allocates a page cache page of all zeroes and places the struct page
      pointer in the mapping->page_tree radix tree.
      
      This has three major drawbacks:
      
      1) It consumes memory unnecessarily. For every 4k page that is read via
         a DAX mmap() over a hole, we allocate a new page cache page. This
         means that if you read 1GiB worth of pages, you end up using 1GiB of
         zeroed memory. This is easily visible by looking at the overall
         memory consumption of the system or by looking at /proc/[pid]/smaps:
      
      	7f62e72b3000-7f63272b3000 rw-s 00000000 103:00 12   /root/dax/data
      	Size:            1048576 kB
      	Rss:             1048576 kB
      	Pss:             1048576 kB
      	Shared_Clean:          0 kB
      	Shared_Dirty:          0 kB
      	Private_Clean:   1048576 kB
      	Private_Dirty:         0 kB
      	Referenced:      1048576 kB
      	Anonymous:             0 kB
      	LazyFree:              0 kB
      	AnonHugePages:         0 kB
      	ShmemPmdMapped:        0 kB
      	Shared_Hugetlb:        0 kB
      	Private_Hugetlb:       0 kB
      	Swap:                  0 kB
      	SwapPss:               0 kB
      	KernelPageSize:        4 kB
      	MMUPageSize:           4 kB
      	Locked:                0 kB
      
      2) It is slower than using a common zero page because each page fault
         has more work to do. Instead of just inserting a common zero page we
         have to allocate a page cache page, zero it, and then insert it. Here
         are the average latencies of dax_load_hole() as measured by ftrace on
         a random test box:
      
          Old method, using zeroed page cache pages:	3.4 us
          New method, using the common 4k zero page:	0.8 us
      
         This was the average latency over 1 GiB of sequential reads done by
         this simple fio script:
      
           [global]
           size=1G
           filename=/root/dax/data
           fallocate=none
           [io]
           rw=read
           ioengine=mmap
      
      3) The fact that we had to check for both DAX exceptional entries and
         for page cache pages in the radix tree made the DAX code more
         complex.
      
      Solve these issues by following the lead of the DAX PMD code and using a
      common 4k zero page instead.  As with the PMD code we will now insert a
      DAX exceptional entry into the radix tree instead of a struct page
      pointer which allows us to remove all the special casing in the DAX
      code.
      
      Note that we do still pretty aggressively check for regular pages in the
      DAX radix tree, especially where we take action based on the bits set in
      the page.  If we ever find a regular page in our radix tree now that
      most likely means that someone besides DAX is inserting pages (which has
      happened lots of times in the past), and we want to find that out early
      and fail loudly.
      
      This solution also removes the extra memory consumption.  Here is that
      same /proc/[pid]/smaps after 1GiB of reading from a hole with the new
      code:
      
      	7f2054a74000-7f2094a74000 rw-s 00000000 103:00 12   /root/dax/data
      	Size:            1048576 kB
      	Rss:                   0 kB
      	Pss:                   0 kB
      	Shared_Clean:          0 kB
      	Shared_Dirty:          0 kB
      	Private_Clean:         0 kB
      	Private_Dirty:         0 kB
      	Referenced:            0 kB
      	Anonymous:             0 kB
      	LazyFree:              0 kB
      	AnonHugePages:         0 kB
      	ShmemPmdMapped:        0 kB
      	Shared_Hugetlb:        0 kB
      	Private_Hugetlb:       0 kB
      	Swap:                  0 kB
      	SwapPss:               0 kB
      	KernelPageSize:        4 kB
      	MMUPageSize:           4 kB
      	Locked:                0 kB
      
      Overall system memory consumption is similarly improved.
      
      Another major change is that we remove dax_pfn_mkwrite() from our fault
      flow, and instead rely on the page fault itself to make the PTE dirty
      and writeable.  The following description from the patch adding the
      vm_insert_mixed_mkwrite() call explains this a little more:
      
         "To be able to use the common 4k zero page in DAX we need to have our
          PTE fault path look more like our PMD fault path where a PTE entry
          can be marked as dirty and writeable as it is first inserted rather
          than waiting for a follow-up dax_pfn_mkwrite() =>
          finish_mkwrite_fault() call.
      
          Right now we can rely on having a dax_pfn_mkwrite() call because we
          can distinguish between these two cases in do_wp_page():
      
                  case 1: 4k zero page => writable DAX storage
                  case 2: read-only DAX storage => writeable DAX storage
      
          This distinction is made by via vm_normal_page(). vm_normal_page()
          returns false for the common 4k zero page, though, just as it does
          for DAX ptes. Instead of special casing the DAX + 4k zero page case
          we will simplify our DAX PTE page fault sequence so that it matches
          our DAX PMD sequence, and get rid of the dax_pfn_mkwrite() helper.
          We will instead use dax_iomap_fault() to handle write-protection
          faults.
      
          This means that insert_pfn() needs to follow the lead of
          insert_pfn_pmd() and allow us to pass in a 'mkwrite' flag. If
          'mkwrite' is set insert_pfn() will do the work that was previously
          done by wp_page_reuse() as part of the dax_pfn_mkwrite() call path"
      
      Link: http://lkml.kernel.org/r/20170724170616.25810-4-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      91d25ba8
    • R
      dax: relocate some dax functions · e30331ff
      Ross Zwisler 提交于
      dax_load_hole() will soon need to call dax_insert_mapping_entry(), so it
      needs to be moved lower in dax.c so the definition exists.
      
      dax_wake_mapping_entry_waiter() will soon be removed from dax.h and be
      made static to dax.c, so we need to move its definition above all its
      callers.
      
      Link: http://lkml.kernel.org/r/20170724170616.25810-3-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e30331ff
  4. 01 9月, 2017 1 次提交
    • J
      dax: update to new mmu_notifier semantic · a4d1a885
      Jérôme Glisse 提交于
      Replace all mmu_notifier_invalidate_page() calls by *_invalidate_range()
      and make sure it is bracketed by calls to *_invalidate_range_start()/end().
      
      Note that because we can not presume the pmd value or pte value we have
      to assume the worst and unconditionaly report an invalidation as
      happening.
      Signed-off-by: NJérôme Glisse <jglisse@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Bernhard Held <berny156@gmx.de>
      Cc: Adam Borowski <kilobyte@angband.pl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Wanpeng Li <kernellwp@gmail.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: axie <axie@amd.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a4d1a885
  5. 26 8月, 2017 1 次提交
    • R
      dax: fix deadlock due to misaligned PMD faults · fffa281b
      Ross Zwisler 提交于
      In DAX there are two separate places where the 2MiB range of a PMD is
      defined.
      
      The first is in the page tables, where a PMD mapping inserted for a
      given address spans from (vmf->address & PMD_MASK) to ((vmf->address &
      PMD_MASK) + PMD_SIZE - 1).  That is, from the 2MiB boundary below the
      address to the 2MiB boundary above the address.
      
      So, for example, a fault at address 3MiB (0x30 0000) falls within the
      PMD that ranges from 2MiB (0x20 0000) to 4MiB (0x40 0000).
      
      The second PMD range is in the mapping->page_tree, where a given file
      offset is covered by a radix tree entry that spans from one 2MiB aligned
      file offset to another 2MiB aligned file offset.
      
      So, for example, the file offset for 3MiB (pgoff 768) falls within the
      PMD range for the order 9 radix tree entry that ranges from 2MiB (pgoff
      512) to 4MiB (pgoff 1024).
      
      This system works so long as the addresses and file offsets for a given
      mapping both have the same offsets relative to the start of each PMD.
      
      Consider the case where the starting address for a given file isn't 2MiB
      aligned - say our faulting address is 3 MiB (0x30 0000), but that
      corresponds to the beginning of our file (pgoff 0).  Now all the PMDs in
      the mapping are misaligned so that the 2MiB range defined in the page
      tables never matches up with the 2MiB range defined in the radix tree.
      
      The current code notices this case for DAX faults to storage with the
      following test in dax_pmd_insert_mapping():
      
      	if (pfn_t_to_pfn(pfn) & PG_PMD_COLOUR)
      		goto unlock_fallback;
      
      This test makes sure that the pfn we get from the driver is 2MiB
      aligned, and relies on the assumption that the 2MiB alignment of the pfn
      we get back from the driver matches the 2MiB alignment of the faulting
      address.
      
      However, faults to holes were not checked and we could hit the problem
      described above.
      
      This was reported in response to the NVML nvml/src/test/pmempool_sync
      TEST5:
      
      	$ cd nvml/src/test/pmempool_sync
      	$ make TEST5
      
      You can grab NVML here:
      
      	https://github.com/pmem/nvml/
      
      The dmesg warning you see when you hit this error is:
      
        WARNING: CPU: 13 PID: 2900 at fs/dax.c:641 dax_insert_mapping_entry+0x2df/0x310
      
      Where we notice in dax_insert_mapping_entry() that the radix tree entry
      we are about to replace doesn't match the locked entry that we had
      previously inserted into the tree.  This happens because the initial
      insertion was done in grab_mapping_entry() using a pgoff calculated from
      the faulting address (vmf->address), and the replacement in
      dax_pmd_load_hole() => dax_insert_mapping_entry() is done using
      vmf->pgoff.
      
      In our failure case those two page offsets (one calculated from
      vmf->address, one using vmf->pgoff) point to different order 9 radix
      tree entries.
      
      This failure case can result in a deadlock because the radix tree unlock
      also happens on the pgoff calculated from vmf->address.  This means that
      the locked radix tree entry that we swapped in to the tree in
      dax_insert_mapping_entry() using vmf->pgoff is never unlocked, so all
      future faults to that 2MiB range will block forever.
      
      Fix this by validating that the faulting address's PMD offset matches
      the PMD offset from the start of the file.  This check is done at the
      very beginning of the fault and covers faults that would have mapped to
      storage as well as faults to holes.  I left the COLOUR check in
      dax_pmd_insert_mapping() in place in case we ever hit the insanity
      condition where the alignment of the pfn we get from the driver doesn't
      match the alignment of the userspace address.
      
      Link: http://lkml.kernel.org/r/20170822222436.18926-1-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reported-by: N"Slusarz, Marcin" <marcin.slusarz@intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fffa281b
  6. 07 7月, 2017 1 次提交
  7. 06 7月, 2017 1 次提交
    • J
      dax: set errors in mapping when writeback fails · 819ec6b9
      Jeff Layton 提交于
      Jan Kara's description for this patch is much better than mine, so I'm
      quoting it verbatim here:
      
      DAX currently doesn't set errors in the mapping when cache flushing
      fails in dax_writeback_mapping_range(). Since this function can get
      called only from fsync(2) or sync(2), this is actually as good as it can
      currently get since we correctly propagate the error up from
      dax_writeback_mapping_range() to filemap_fdatawrite()
      
      However, in the future better writeback error handling will enable us to
      properly report these errors on fsync(2) even if there are multiple file
      descriptors open against the file or if sync(2) gets called before
      fsync(2). So convert DAX to using standard error reporting through the
      mapping.
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-and-tested-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      819ec6b9
  8. 28 6月, 2017 1 次提交
    • D
      x86, libnvdimm, pmem: remove global pmem api · ca6a4657
      Dan Williams 提交于
      Now that all callers of the pmem api have been converted to dax helpers that
      call back to the pmem driver, we can remove include/linux/pmem.h and
      asm/pmem.h.
      
      Cc: <x86@kernel.org>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Oliver O'Halloran <oohall@gmail.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      ca6a4657
  9. 24 6月, 2017 1 次提交
  10. 20 6月, 2017 1 次提交
    • I
      sched/wait: Rename wait_queue_t => wait_queue_entry_t · ac6424b9
      Ingo Molnar 提交于
      Rename:
      
      	wait_queue_t		=>	wait_queue_entry_t
      
      'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
      but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
      which had to carry the name.
      
      Start sorting this out by renaming it to 'wait_queue_entry_t'.
      
      This also allows the real structure name 'struct __wait_queue' to
      lose its double underscore and become 'struct wait_queue_entry',
      which is the more canonical nomenclature for such data types.
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ac6424b9
  11. 16 6月, 2017 3 次提交
    • D
      x86, dax: replace clear_pmem() with open coded memset + dax_ops->flush · 81f55870
      Dan Williams 提交于
      The clear_pmem() helper simply combines a memset() plus a cache flush.
      Now that the flush routine is optionally provided by the dax device
      driver we can avoid unnecessary cache management on dax devices fronting
      volatile memory.
      
      With clear_pmem() gone we can follow on with a patch to make pmem cache
      management completely defined within the pmem driver.
      
      Cc: <x86@kernel.org>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      81f55870
    • D
      filesystem-dax: convert to dax_flush() · 6318770a
      Dan Williams 提交于
      Filesystem-DAX flushes caches whenever it writes to the address returned
      through dax_direct_access() and when writing back dirty radix entries.
      That flushing is only required in the pmem case, so the dax_flush()
      helper skips cache management work when the underlying driver does not
      specify a flush method.
      
      We still do all the dirty tracking since the radix entry will already be
      there for locking purposes. However, the work to clean the entry will be
      a nop for some dax drivers.
      
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      6318770a
    • D
      filesystem-dax: convert to dax_copy_from_iter() · fec53774
      Dan Williams 提交于
      Now that all possible providers of the dax_operations copy_from_iter
      method are implemented, switch filesytem-dax to call the driver rather
      than copy_to_iter_pmem.
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      fec53774
  12. 03 6月, 2017 1 次提交
    • R
      dax: fix race between colliding PMD & PTE entries · e2093926
      Ross Zwisler 提交于
      We currently have two related PMD vs PTE races in the DAX code.  These
      can both be easily triggered by having two threads reading and writing
      simultaneously to the same private mapping, with the key being that
      private mapping reads can be handled with PMDs but private mapping
      writes are always handled with PTEs so that we can COW.
      
      Here is the first race:
      
        CPU 0					CPU 1
      
        (private mapping write)
        __handle_mm_fault()
          create_huge_pmd() - FALLBACK
          handle_pte_fault()
            passes check for pmd_devmap()
      
      					(private mapping read)
      					__handle_mm_fault()
      					  create_huge_pmd()
      					    dax_iomap_pmd_fault() inserts PMD
      
            dax_iomap_pte_fault() does a PTE fault, but we already have a DAX PMD
            			  installed in our page tables at this spot.
      
      Here's the second race:
      
        CPU 0					CPU 1
      
        (private mapping read)
        __handle_mm_fault()
          passes check for pmd_none()
          create_huge_pmd()
            dax_iomap_pmd_fault() inserts PMD
      
        (private mapping write)
        __handle_mm_fault()
          create_huge_pmd() - FALLBACK
      					(private mapping read)
      					__handle_mm_fault()
      					  passes check for pmd_none()
      					  create_huge_pmd()
      
          handle_pte_fault()
            dax_iomap_pte_fault() inserts PTE
      					    dax_iomap_pmd_fault() inserts PMD,
      					       but we already have a PTE at
      					       this spot.
      
      The core of the issue is that while there is isolation between faults to
      the same range in the DAX fault handlers via our DAX entry locking,
      there is no isolation between faults in the code in mm/memory.c.  This
      means for instance that this code in __handle_mm_fault() can run:
      
      	if (pmd_none(*vmf.pmd) && transparent_hugepage_enabled(vma)) {
      		ret = create_huge_pmd(&vmf);
      
      But by the time we actually get to run the fault handler called by
      create_huge_pmd(), the PMD is no longer pmd_none() because a racing PTE
      fault has installed a normal PMD here as a parent.  This is the cause of
      the 2nd race.  The first race is similar - there is the following check
      in handle_pte_fault():
      
      	} else {
      		/* See comment in pte_alloc_one_map() */
      		if (pmd_devmap(*vmf->pmd) || pmd_trans_unstable(vmf->pmd))
      			return 0;
      
      So if a pmd_devmap() PMD (a DAX PMD) has been installed at vmf->pmd, we
      will bail and retry the fault.  This is correct, but there is nothing
      preventing the PMD from being installed after this check but before we
      actually get to the DAX PTE fault handlers.
      
      In my testing these races result in the following types of errors:
      
        BUG: Bad rss-counter state mm:ffff8800a817d280 idx:1 val:1
        BUG: non-zero nr_ptes on freeing mm: 15
      
      Fix this issue by having the DAX fault handlers verify that it is safe
      to continue their fault after they have taken an entry lock to block
      other racing faults.
      
      [ross.zwisler@linux.intel.com: improve fix for colliding PMD & PTE entries]
        Link: http://lkml.kernel.org/r/20170526195932.32178-1-ross.zwisler@linux.intel.com
      Link: http://lkml.kernel.org/r/20170522215749.23516-2-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reported-by: NPawel Lebioda <pawel.lebioda@intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Pawel Lebioda <pawel.lebioda@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Xiong Zhou <xzhou@redhat.com>
      Cc: Eryu Guan <eguan@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e2093926
  13. 13 5月, 2017 4 次提交
    • R
      dax: fix PMD data corruption when fault races with write · 876f2946
      Ross Zwisler 提交于
      This is based on a patch from Jan Kara that fixed the equivalent race in
      the DAX PTE fault path.
      
      Currently DAX PMD read fault can race with write(2) in the following
      way:
      
      CPU1 - write(2)                 CPU2 - read fault
                                      dax_iomap_pmd_fault()
                                        ->iomap_begin() - sees hole
      
      dax_iomap_rw()
        iomap_apply()
          ->iomap_begin - allocates blocks
          dax_iomap_actor()
            invalidate_inode_pages2_range()
              - there's nothing to invalidate
      
                                        grab_mapping_entry()
      				  - we add huge zero page to the radix tree
      				    and map it to page tables
      
      The result is that hole page is mapped into page tables (and thus zeros
      are seen in mmap) while file has data written in that place.
      
      Fix the problem by locking exception entry before mapping blocks for the
      fault.  That way we are sure invalidate_inode_pages2_range() call for
      racing write will either block on entry lock waiting for the fault to
      finish (and unmap stale page tables after that) or read fault will see
      already allocated blocks by write(2).
      
      Fixes: 9f141d6e ("dax: Call ->iomap_begin without entry lock during dax fault")
      Link: http://lkml.kernel.org/r/20170510172700.18991-1-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      876f2946
    • J
      dax: fix data corruption when fault races with write · 13e451fd
      Jan Kara 提交于
      Currently DAX read fault can race with write(2) in the following way:
      
      CPU1 - write(2)			CPU2 - read fault
      				dax_iomap_pte_fault()
      				  ->iomap_begin() - sees hole
      dax_iomap_rw()
        iomap_apply()
          ->iomap_begin - allocates blocks
          dax_iomap_actor()
            invalidate_inode_pages2_range()
              - there's nothing to invalidate
      				  grab_mapping_entry()
      				  - we add zero page in the radix tree
      				    and map it to page tables
      
      The result is that hole page is mapped into page tables (and thus zeros
      are seen in mmap) while file has data written in that place.
      
      Fix the problem by locking exception entry before mapping blocks for the
      fault.  That way we are sure invalidate_inode_pages2_range() call for
      racing write will either block on entry lock waiting for the fault to
      finish (and unmap stale page tables after that) or read fault will see
      already allocated blocks by write(2).
      
      Fixes: 9f141d6e
      Link: http://lkml.kernel.org/r/20170510085419.27601-5-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      13e451fd
    • J
      mm: fix data corruption due to stale mmap reads · cd656375
      Jan Kara 提交于
      Currently, we didn't invalidate page tables during invalidate_inode_pages2()
      for DAX.  That could result in e.g. 2MiB zero page being mapped into
      page tables while there were already underlying blocks allocated and
      thus data seen through mmap were different from data seen by read(2).
      The following sequence reproduces the problem:
      
       - open an mmap over a 2MiB hole
      
       - read from a 2MiB hole, faulting in a 2MiB zero page
      
       - write to the hole with write(3p). The write succeeds but we
         incorrectly leave the 2MiB zero page mapping intact.
      
       - via the mmap, read the data that was just written. Since the zero
         page mapping is still intact we read back zeroes instead of the new
         data.
      
      Fix the problem by unconditionally calling invalidate_inode_pages2_range()
      in dax_iomap_actor() for new block allocations and by properly
      invalidating page tables in invalidate_inode_pages2_range() for DAX
      mappings.
      
      Fixes: c6dcf52c
      Link: http://lkml.kernel.org/r/20170510085419.27601-3-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cd656375
    • R
      dax: prevent invalidation of mapped DAX entries · 4636e70b
      Ross Zwisler 提交于
      Patch series "mm,dax: Fix data corruption due to mmap inconsistency",
      v4.
      
      This series fixes data corruption that can happen for DAX mounts when
      page faults race with write(2) and as a result page tables get out of
      sync with block mappings in the filesystem and thus data seen through
      mmap is different from data seen through read(2).
      
      The series passes testing with t_mmap_stale test program from Ross and
      also other mmap related tests on DAX filesystem.
      
      This patch (of 4):
      
      dax_invalidate_mapping_entry() currently removes DAX exceptional entries
      only if they are clean and unlocked.  This is done via:
      
        invalidate_mapping_pages()
          invalidate_exceptional_entry()
            dax_invalidate_mapping_entry()
      
      However, for page cache pages removed in invalidate_mapping_pages()
      there is an additional criteria which is that the page must not be
      mapped.  This is noted in the comments above invalidate_mapping_pages()
      and is checked in invalidate_inode_page().
      
      For DAX entries this means that we can can end up in a situation where a
      DAX exceptional entry, either a huge zero page or a regular DAX entry,
      could end up mapped but without an associated radix tree entry.  This is
      inconsistent with the rest of the DAX code and with what happens in the
      page cache case.
      
      We aren't able to unmap the DAX exceptional entry because according to
      its comments invalidate_mapping_pages() isn't allowed to block, and
      unmap_mapping_range() takes a write lock on the mapping->i_mmap_rwsem.
      
      Since we essentially never have unmapped DAX entries to evict from the
      radix tree, just remove dax_invalidate_mapping_entry().
      
      Fixes: c6dcf52c ("mm: Invalidate DAX radix tree entries only if appropriate")
      Link: http://lkml.kernel.org/r/20170510085419.27601-2-jack@suse.czSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reported-by: NJan Kara <jack@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: <stable@vger.kernel.org>    [4.10+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4636e70b
  14. 11 5月, 2017 1 次提交
  15. 09 5月, 2017 6 次提交
    • R
      dax: add tracepoint to dax_insert_mapping() · b4440734
      Ross Zwisler 提交于
      Add a tracepoint to dax_insert_mapping(), following the same logging
      conventions as the rest of DAX.  This tracepoint, along with the one in
      dax_load_hole(), lets us know how a DAX PTE fault was serviced.
      
      Here is an example DAX fault that inserts a PTE mapping:
      
        small-1126  [007] ....
         145.451604: dax_pte_fault: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10420000 pgoff 0x220
      
        small-1126  [007] ....
         145.452317: dax_insert_mapping: dev 259:0 ino 0x1003 shared write address 0x10420000 radix_entry 0x100006
      
        small-1126  [007] ....
         145.452399: dax_pte_fault_done: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10420000 pgoff 0x220 MAJOR|NOPAGE
      
      Link: http://lkml.kernel.org/r/20170221195116.13278-7-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4440734
    • R
      dax: add tracepoint to dax_writeback_one() · f9bc3a07
      Ross Zwisler 提交于
      Add a tracepoint to dax_writeback_one(), following the same logging
      conventions as the rest of DAX.
      
      Here is an example range writeback which ends up flushing one PMD and
      one PTE:
      
        test-1265  [003] ....
         496.615250: dax_writeback_range: dev 259:0 ino 0x1003 pgoff 0x0-0x7ffffffffffff
      
        test-1265  [003] ....
         496.616263: dax_writeback_one: dev 259:0 ino 0x1003 pgoff 0x0 pglen 0x200
      
        test-1265  [003] ....
         496.616270: dax_writeback_one: dev 259:0 ino 0x1003 pgoff 0x305 pglen 0x1
      
        test-1265  [003] ....
         496.616272: dax_writeback_range_done: dev 259:0 ino 0x1003 pgoff 0x0-0x7ffffffffffff
      
      [akpm@linux-foundation.org: struct blk_dax_ctl has disappeared]
      Link: http://lkml.kernel.org/r/20170221195116.13278-6-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f9bc3a07
    • R
      dax: add tracepoints to dax_writeback_mapping_range() · d14a3f48
      Ross Zwisler 提交于
      Add tracepoints to dax_writeback_mapping_range(), following the same
      logging conventions as the rest of DAX.
      
      Here is an example writeback call:
      
        msync-1085  [006] ....
         200.902565: dax_writeback_range: dev 259:0 ino 0x1003 pgoff 0x200-0x2ff
      
        msync-1085  [006] ....
         200.902579: dax_writeback_range_done: dev 259:0 ino 0x1003 pgoff 0x200-0x2ff
      
      [ross.zwisler@linux.intel.com: fix regression in dax_writeback_mapping_range()]
        Link: http://lkml.kernel.org/r/20170314215358.31451-1-ross.zwisler@linux.intel.com
      Link: http://lkml.kernel.org/r/20170221195116.13278-5-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d14a3f48
    • R
      dax: add tracepoints to dax_load_hole() · 678c9fd0
      Ross Zwisler 提交于
      Add tracepoints to dax_load_hole(), following the same logging conventions
      as the rest of DAX.
      
      Here is the logging generated by a PTE read from a hole:
      
        read-1075  [002] ....
          62.362108: dax_pte_fault: dev 259:0 ino 0x1003 shared ALLOW_RETRY|KILLABLE|USER address 0x10480000 pgoff 0x280
      
        read-1075  [002] ....
          62.362140: dax_load_hole: dev 259:0 ino 0x1003 shared ALLOW_RETRY|KILLABLE|USER address 0x10480000 pgoff 0x280 NOPAGE
      
        read-1075  [002] ....
          62.362141: dax_pte_fault_done: dev 259:0 ino 0x1003 shared ALLOW_RETRY|KILLABLE|USER address 0x10480000 pgoff 0x280 NOPAGE
      
      Link: http://lkml.kernel.org/r/20170221195116.13278-4-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      678c9fd0
    • R
      dax: add tracepoints to dax_pfn_mkwrite() · c3ff68d7
      Ross Zwisler 提交于
      Add tracepoints to dax_pfn_mkwrite(), following the same logging
      conventions as the rest of DAX.
      
      Here is an example PTE fault followed by a pfn_mkwrite:
      
        small_aligned-1094  [002] ....
         374.084998: dax_pte_fault: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10400000 pgoff 0x200
      
        small_aligned-1094  [002] ....
         374.085145: dax_pte_fault_done: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10400000 pgoff 0x200 MAJOR|NOPAGE
      
        small_aligned-1094  [002] ....
         374.085165: dax_pfn_mkwrite: dev 259:0 ino 0x1003 shared WRITE|MKWRITE|ALLOW_RETRY|KILLABLE|USER address 0x10400000 pgoff 0x200 NOPAGE
      
      Link: http://lkml.kernel.org/r/20170221195116.13278-3-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c3ff68d7
    • R
      dax: add tracepoints to dax_iomap_pte_fault() · a9c42b33
      Ross Zwisler 提交于
      Patch series "second round of tracepoints for DAX".
      
      This second round of DAX tracepoint patches adds tracing to the PTE
      fault path (dax_iomap_pte_fault(), dax_pfn_mkwrite(), dax_load_hole(),
      dax_insert_mapping()) and to the writeback path
      (dax_writeback_mapping_range(), dax_writeback_one()).
      
      The purpose of this tracing is to give us a high level view of what DAX
      is doing, whether faults are being serviced by PMDs or PTEs, and by real
      storage or by zero pages covering holes.
      
      I do have some patches nearly ready which also add tracing to
      grab_mapping_entry() and dax_insert_mapping_entry().  These are more
      targeted at logging how we are interacting with the radix tree, how we
      use empty entries for locking, whether we "downgrade" huge zero pages to
      4k PTE sized allocations, etc.  In the end it seemed to me that this
      might be too detailed to have as constantly present tracepoints, but if
      anyone sees value in having tracepoints like this in the DAX code
      permanently (Jan?), please let me know and I'll add those last two
      patches.
      
      All these tracepoints were done to be consistent with the style of the
      XFS tracepoints and with the existing DAX PMD tracepoints.
      
      This patch (of 6):
      
      Add tracepoints to dax_iomap_pte_fault(), following the same logging
      conventions as the rest of DAX.
      
      Here is an example fault that initially tries to be serviced by the PMD
      fault handler but which falls back to PTEs because the VMA isn't large
      enough to hold a PMD:
      
        small-1086  [005] ....
         71.140014: xfs_filemap_huge_fault: dev 259:0 ino 0x1003
      
        small-1086  [005] ....
          71.140027: dax_pmd_fault: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10420000 vm_start 0x10200000 vm_end 0x10500000 pgoff 0x220 max_pgoff 0x1400
      
        small-1086  [005] ....
          71.140028: dax_pmd_fault_done: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10420000 vm_start 0x10200000 vm_end 0x10500000 pgoff 0x220 max_pgoff 0x1400 FALLBACK
      
        small-1086  [005] ....
          71.140035: dax_pte_fault: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10420000 pgoff 0x220
      
        small-1086  [005] ....
          71.140396: dax_pte_fault_done: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10420000 pgoff 0x220 MAJOR|NOPAGE
      
      Link: http://lkml.kernel.org/r/20170221195116.13278-2-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9c42b33
  16. 26 4月, 2017 2 次提交
    • D
      filesystem-dax: convert to dax_direct_access() · cccbce67
      Dan Williams 提交于
      Now that a dax_device is plumbed through all dax-capable drivers we can
      switch from block_device_operations to dax_operations for invoking
      ->direct_access.
      
      This also lets us kill off some usages of struct blk_dax_ctl on the way
      to its eventual removal.
      Suggested-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      cccbce67
    • D
      Revert "block: use DAX for partition table reads" · a41fe02b
      Dan Williams 提交于
      commit d1a5f2b4 ("block: use DAX for partition table reads") was
      part of a stalled effort to allow dax mappings of block devices. Since
      then the device-dax mechanism has filled the role of dax-mapping static
      device ranges.
      
      Now that we are moving ->direct_access() from a block_device operation
      to a dax_inode operation we would need block devices to map and carry
      their own dax_inode reference.
      
      Unless / until we decide to revive dax mapping of raw block devices
      through the dax_inode scheme, there is no need to carry
      read_dax_sector(). Its removal in turn allows for the removal of
      bdev_direct_access() and should have been included in commit
      22375701 ("block_dev: remove DAX leftovers").
      
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      a41fe02b
  17. 09 4月, 2017 1 次提交
  18. 08 4月, 2017 1 次提交
    • R
      dax: fix radix tree insertion race · e11f8b7b
      Ross Zwisler 提交于
      While running generic/340 in my test setup I hit the following race.  It
      can happen with kernels that support FS DAX PMDs, so v4.10 thru
      v4.11-rc5.
      
      Thread 1				Thread 2
      --------				--------
      dax_iomap_pmd_fault()
        grab_mapping_entry()
          spin_lock_irq()
          get_unlocked_mapping_entry()
          'entry' is NULL, can't call lock_slot()
          spin_unlock_irq()
          radix_tree_preload()
      					dax_iomap_pmd_fault()
      					  grab_mapping_entry()
      					    spin_lock_irq()
      					    get_unlocked_mapping_entry()
      					    ...
      					    lock_slot()
      					    spin_unlock_irq()
      					  dax_pmd_insert_mapping()
      					    <inserts a PMD mapping>
          spin_lock_irq()
          __radix_tree_insert() fails with -EEXIST
          <fall back to 4k fault, and die horribly
           when inserting a 4k entry where a PMD exists>
      
      The issue is that we have to drop mapping->tree_lock while calling
      radix_tree_preload(), but since we didn't have a radix tree entry to
      lock (unlike in the pmd_downgrade case) we have no protection against
      Thread 2 coming along and inserting a PMD at the same index.  For 4k
      entries we handled this with a special-case response to -EEXIST coming
      from the __radix_tree_insert(), but this doesn't save us for PMDs
      because the -EEXIST case can also mean that we collided with a 4k entry
      in the radix tree at a different index, but one that is covered by our
      PMD range.
      
      So, correctly handle both the 4k and 2M collision cases by explicitly
      re-checking the radix tree for an entry at our index once we reacquire
      mapping->tree_lock.
      
      This patch has made it through a clean xfstests run with the current
      v4.11-rc5 based linux/master, and it also ran generic/340 500 times in a
      loop.  It used to fail within the first 10 iterations.
      
      Link: http://lkml.kernel.org/r/20170406212944.2866-1-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: <stable@vger.kernel.org>    [4.10+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e11f8b7b
  19. 02 3月, 2017 1 次提交
  20. 28 2月, 2017 1 次提交
  21. 25 2月, 2017 3 次提交