1. 09 5月, 2017 4 次提交
    • R
      dax: add tracepoints to dax_writeback_mapping_range() · d14a3f48
      Ross Zwisler 提交于
      Add tracepoints to dax_writeback_mapping_range(), following the same
      logging conventions as the rest of DAX.
      
      Here is an example writeback call:
      
        msync-1085  [006] ....
         200.902565: dax_writeback_range: dev 259:0 ino 0x1003 pgoff 0x200-0x2ff
      
        msync-1085  [006] ....
         200.902579: dax_writeback_range_done: dev 259:0 ino 0x1003 pgoff 0x200-0x2ff
      
      [ross.zwisler@linux.intel.com: fix regression in dax_writeback_mapping_range()]
        Link: http://lkml.kernel.org/r/20170314215358.31451-1-ross.zwisler@linux.intel.com
      Link: http://lkml.kernel.org/r/20170221195116.13278-5-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d14a3f48
    • R
      dax: add tracepoints to dax_load_hole() · 678c9fd0
      Ross Zwisler 提交于
      Add tracepoints to dax_load_hole(), following the same logging conventions
      as the rest of DAX.
      
      Here is the logging generated by a PTE read from a hole:
      
        read-1075  [002] ....
          62.362108: dax_pte_fault: dev 259:0 ino 0x1003 shared ALLOW_RETRY|KILLABLE|USER address 0x10480000 pgoff 0x280
      
        read-1075  [002] ....
          62.362140: dax_load_hole: dev 259:0 ino 0x1003 shared ALLOW_RETRY|KILLABLE|USER address 0x10480000 pgoff 0x280 NOPAGE
      
        read-1075  [002] ....
          62.362141: dax_pte_fault_done: dev 259:0 ino 0x1003 shared ALLOW_RETRY|KILLABLE|USER address 0x10480000 pgoff 0x280 NOPAGE
      
      Link: http://lkml.kernel.org/r/20170221195116.13278-4-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      678c9fd0
    • R
      dax: add tracepoints to dax_pfn_mkwrite() · c3ff68d7
      Ross Zwisler 提交于
      Add tracepoints to dax_pfn_mkwrite(), following the same logging
      conventions as the rest of DAX.
      
      Here is an example PTE fault followed by a pfn_mkwrite:
      
        small_aligned-1094  [002] ....
         374.084998: dax_pte_fault: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10400000 pgoff 0x200
      
        small_aligned-1094  [002] ....
         374.085145: dax_pte_fault_done: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10400000 pgoff 0x200 MAJOR|NOPAGE
      
        small_aligned-1094  [002] ....
         374.085165: dax_pfn_mkwrite: dev 259:0 ino 0x1003 shared WRITE|MKWRITE|ALLOW_RETRY|KILLABLE|USER address 0x10400000 pgoff 0x200 NOPAGE
      
      Link: http://lkml.kernel.org/r/20170221195116.13278-3-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c3ff68d7
    • R
      dax: add tracepoints to dax_iomap_pte_fault() · a9c42b33
      Ross Zwisler 提交于
      Patch series "second round of tracepoints for DAX".
      
      This second round of DAX tracepoint patches adds tracing to the PTE
      fault path (dax_iomap_pte_fault(), dax_pfn_mkwrite(), dax_load_hole(),
      dax_insert_mapping()) and to the writeback path
      (dax_writeback_mapping_range(), dax_writeback_one()).
      
      The purpose of this tracing is to give us a high level view of what DAX
      is doing, whether faults are being serviced by PMDs or PTEs, and by real
      storage or by zero pages covering holes.
      
      I do have some patches nearly ready which also add tracing to
      grab_mapping_entry() and dax_insert_mapping_entry().  These are more
      targeted at logging how we are interacting with the radix tree, how we
      use empty entries for locking, whether we "downgrade" huge zero pages to
      4k PTE sized allocations, etc.  In the end it seemed to me that this
      might be too detailed to have as constantly present tracepoints, but if
      anyone sees value in having tracepoints like this in the DAX code
      permanently (Jan?), please let me know and I'll add those last two
      patches.
      
      All these tracepoints were done to be consistent with the style of the
      XFS tracepoints and with the existing DAX PMD tracepoints.
      
      This patch (of 6):
      
      Add tracepoints to dax_iomap_pte_fault(), following the same logging
      conventions as the rest of DAX.
      
      Here is an example fault that initially tries to be serviced by the PMD
      fault handler but which falls back to PTEs because the VMA isn't large
      enough to hold a PMD:
      
        small-1086  [005] ....
         71.140014: xfs_filemap_huge_fault: dev 259:0 ino 0x1003
      
        small-1086  [005] ....
          71.140027: dax_pmd_fault: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10420000 vm_start 0x10200000 vm_end 0x10500000 pgoff 0x220 max_pgoff 0x1400
      
        small-1086  [005] ....
          71.140028: dax_pmd_fault_done: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10420000 vm_start 0x10200000 vm_end 0x10500000 pgoff 0x220 max_pgoff 0x1400 FALLBACK
      
        small-1086  [005] ....
          71.140035: dax_pte_fault: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10420000 pgoff 0x220
      
        small-1086  [005] ....
          71.140396: dax_pte_fault_done: dev 259:0 ino 0x1003 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x10420000 pgoff 0x220 MAJOR|NOPAGE
      
      Link: http://lkml.kernel.org/r/20170221195116.13278-2-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9c42b33
  2. 26 4月, 2017 2 次提交
    • D
      filesystem-dax: convert to dax_direct_access() · cccbce67
      Dan Williams 提交于
      Now that a dax_device is plumbed through all dax-capable drivers we can
      switch from block_device_operations to dax_operations for invoking
      ->direct_access.
      
      This also lets us kill off some usages of struct blk_dax_ctl on the way
      to its eventual removal.
      Suggested-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      cccbce67
    • D
      Revert "block: use DAX for partition table reads" · a41fe02b
      Dan Williams 提交于
      commit d1a5f2b4 ("block: use DAX for partition table reads") was
      part of a stalled effort to allow dax mappings of block devices. Since
      then the device-dax mechanism has filled the role of dax-mapping static
      device ranges.
      
      Now that we are moving ->direct_access() from a block_device operation
      to a dax_inode operation we would need block devices to map and carry
      their own dax_inode reference.
      
      Unless / until we decide to revive dax mapping of raw block devices
      through the dax_inode scheme, there is no need to carry
      read_dax_sector(). Its removal in turn allows for the removal of
      bdev_direct_access() and should have been included in commit
      22375701 ("block_dev: remove DAX leftovers").
      
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      a41fe02b
  3. 09 4月, 2017 1 次提交
  4. 08 4月, 2017 1 次提交
    • R
      dax: fix radix tree insertion race · e11f8b7b
      Ross Zwisler 提交于
      While running generic/340 in my test setup I hit the following race.  It
      can happen with kernels that support FS DAX PMDs, so v4.10 thru
      v4.11-rc5.
      
      Thread 1				Thread 2
      --------				--------
      dax_iomap_pmd_fault()
        grab_mapping_entry()
          spin_lock_irq()
          get_unlocked_mapping_entry()
          'entry' is NULL, can't call lock_slot()
          spin_unlock_irq()
          radix_tree_preload()
      					dax_iomap_pmd_fault()
      					  grab_mapping_entry()
      					    spin_lock_irq()
      					    get_unlocked_mapping_entry()
      					    ...
      					    lock_slot()
      					    spin_unlock_irq()
      					  dax_pmd_insert_mapping()
      					    <inserts a PMD mapping>
          spin_lock_irq()
          __radix_tree_insert() fails with -EEXIST
          <fall back to 4k fault, and die horribly
           when inserting a 4k entry where a PMD exists>
      
      The issue is that we have to drop mapping->tree_lock while calling
      radix_tree_preload(), but since we didn't have a radix tree entry to
      lock (unlike in the pmd_downgrade case) we have no protection against
      Thread 2 coming along and inserting a PMD at the same index.  For 4k
      entries we handled this with a special-case response to -EEXIST coming
      from the __radix_tree_insert(), but this doesn't save us for PMDs
      because the -EEXIST case can also mean that we collided with a 4k entry
      in the radix tree at a different index, but one that is covered by our
      PMD range.
      
      So, correctly handle both the 4k and 2M collision cases by explicitly
      re-checking the radix tree for an entry at our index once we reacquire
      mapping->tree_lock.
      
      This patch has made it through a clean xfstests run with the current
      v4.11-rc5 based linux/master, and it also ran generic/340 500 times in a
      loop.  It used to fail within the first 10 iterations.
      
      Link: http://lkml.kernel.org/r/20170406212944.2866-1-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: <stable@vger.kernel.org>    [4.10+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e11f8b7b
  5. 02 3月, 2017 1 次提交
  6. 28 2月, 2017 1 次提交
  7. 25 2月, 2017 3 次提交
  8. 23 2月, 2017 5 次提交
  9. 09 2月, 2017 1 次提交
  10. 04 2月, 2017 1 次提交
    • M
      fs: break out of iomap_file_buffered_write on fatal signals · d1908f52
      Michal Hocko 提交于
      Tetsuo has noticed that an OOM stress test which performs large write
      requests can cause the full memory reserves depletion.  He has tracked
      this down to the following path
      
      	__alloc_pages_nodemask+0x436/0x4d0
      	alloc_pages_current+0x97/0x1b0
      	__page_cache_alloc+0x15d/0x1a0          mm/filemap.c:728
      	pagecache_get_page+0x5a/0x2b0           mm/filemap.c:1331
      	grab_cache_page_write_begin+0x23/0x40   mm/filemap.c:2773
      	iomap_write_begin+0x50/0xd0             fs/iomap.c:118
      	iomap_write_actor+0xb5/0x1a0            fs/iomap.c:190
      	? iomap_write_end+0x80/0x80             fs/iomap.c:150
      	iomap_apply+0xb3/0x130                  fs/iomap.c:79
      	iomap_file_buffered_write+0x68/0xa0     fs/iomap.c:243
      	? iomap_write_end+0x80/0x80
      	xfs_file_buffered_aio_write+0x132/0x390 [xfs]
      	? remove_wait_queue+0x59/0x60
      	xfs_file_write_iter+0x90/0x130 [xfs]
      	__vfs_write+0xe5/0x140
      	vfs_write+0xc7/0x1f0
      	? syscall_trace_enter+0x1d0/0x380
      	SyS_write+0x58/0xc0
      	do_syscall_64+0x6c/0x200
      	entry_SYSCALL64_slow_path+0x25/0x25
      
      the oom victim has access to all memory reserves to make a forward
      progress to exit easier.  But iomap_file_buffered_write and other
      callers of iomap_apply loop to complete the full request.  We need to
      check for fatal signals and back off with a short write instead.
      
      As the iomap_apply delegates all the work down to the actor we have to
      hook into those.  All callers that work with the page cache are calling
      iomap_write_begin so we will check for signals there.  dax_iomap_actor
      has to handle the situation explicitly because it copies data to the
      userspace directly.  Other callers like iomap_page_mkwrite work on a
      single page or iomap_fiemap_actor do not allocate memory based on the
      given len.
      
      Fixes: 68a9f5e7 ("xfs: implement iomap based buffered write path")
      Link: http://lkml.kernel.org/r/20170201092706.9966-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: <stable@vger.kernel.org>	[4.8+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d1908f52
  11. 31 1月, 2017 1 次提交
  12. 25 1月, 2017 1 次提交
  13. 11 1月, 2017 1 次提交
    • R
      dax: wrprotect pmd_t in dax_mapping_entry_mkclean · f729c8c9
      Ross Zwisler 提交于
      Currently dax_mapping_entry_mkclean() fails to clean and write protect
      the pmd_t of a DAX PMD entry during an *sync operation.  This can result
      in data loss in the following sequence:
      
      1) mmap write to DAX PMD, dirtying PMD radix tree entry and making the
         pmd_t dirty and writeable
      2) fsync, flushing out PMD data and cleaning the radix tree entry. We
         currently fail to mark the pmd_t as clean and write protected.
      3) more mmap writes to the PMD.  These don't cause any page faults since
         the pmd_t is dirty and writeable.  The radix tree entry remains clean.
      4) fsync, which fails to flush the dirty PMD data because the radix tree
         entry was clean.
      5) crash - dirty data that should have been fsync'd as part of 4) could
         still have been in the processor cache, and is lost.
      
      Fix this by marking the pmd_t clean and write protected in
      dax_mapping_entry_mkclean(), which is called as part of the fsync
      operation 2).  This will cause the writes in step 3) above to generate
      page faults where we'll re-dirty the PMD radix tree entry, resulting in
      flushes in the fsync that happens in step 4).
      
      Fixes: 4b4bb46d ("dax: clear dirty entry tags on cache flush")
      Link: http://lkml.kernel.org/r/1482272586-21177-3-git-send-email-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f729c8c9
  14. 27 12月, 2016 4 次提交
  15. 15 12月, 2016 5 次提交
  16. 13 12月, 2016 4 次提交
  17. 21 11月, 2016 1 次提交
  18. 10 11月, 2016 1 次提交
  19. 08 11月, 2016 2 次提交
    • R
      dax: add struct iomap based DAX PMD support · 642261ac
      Ross Zwisler 提交于
      DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
      locking.  This patch allows DAX PMDs to participate in the DAX radix tree
      based locking scheme so that they can be re-enabled using the new struct
      iomap based fault handlers.
      
      There are currently three types of DAX 4k entries: 4k zero pages, 4k DAX
      mappings that have an associated block allocation, and 4k DAX empty
      entries.  The empty entries exist to provide locking for the duration of a
      given page fault.
      
      This patch adds three equivalent 2MiB DAX entries: Huge Zero Page (HZP)
      entries, PMD DAX entries that have associated block allocations, and 2 MiB
      DAX empty entries.
      
      Unlike the 4k case where we insert a struct page* into the radix tree for
      4k zero pages, for HZP we insert a DAX exceptional entry with the new
      RADIX_DAX_HZP flag set.  This is because we use a single 2 MiB zero page in
      every 2MiB hole mapping, and it doesn't make sense to have that same struct
      page* with multiple entries in multiple trees.  This would cause contention
      on the single page lock for the one Huge Zero Page, and it would break the
      page->index and page->mapping associations that are assumed to be valid in
      many other places in the kernel.
      
      One difficult use case is when one thread is trying to use 4k entries in
      radix tree for a given offset, and another thread is using 2 MiB entries
      for that same offset.  The current code handles this by making the 2 MiB
      user fall back to 4k entries for most cases.  This was done because it is
      the simplest solution, and because the use of 2MiB pages is already
      opportunistic.
      
      If we were to try to upgrade from 4k pages to 2MiB pages for a given range,
      we run into the problem of how we lock out 4k page faults for the entire
      2MiB range while we clean out the radix tree so we can insert the 2MiB
      entry.  We can solve this problem if we need to, but I think that the cases
      where both 2MiB entries and 4K entries are being used for the same range
      will be rare enough and the gain small enough that it probably won't be
      worth the complexity.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      642261ac
    • R
      dax: move put_(un)locked_mapping_entry() in dax.c · 422476c4
      Ross Zwisler 提交于
      No functional change.
      
      The static functions put_locked_mapping_entry() and
      put_unlocked_mapping_entry() will soon be used in error cases in
      grab_mapping_entry(), so move their definitions above this function.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      422476c4