1. 22 5月, 2018 3 次提交
    • D
      xfs, dax: introduce xfs_break_dax_layouts() · d6dc57e2
      Dan Williams 提交于
      xfs_break_dax_layouts(), similar to xfs_break_leased_layouts(), scans
      for busy / pinned dax pages and waits for those pages to go idle before
      any potential extent unmap operation.
      
      dax_layout_busy_page() handles synchronizing against new page-busy
      events (get_user_pages). It invalidates all mappings to trigger the
      get_user_pages slow path which will eventually block on the xfs inode
      lock held in XFS_MMAPLOCK_EXCL mode. If dax_layout_busy_page() finds a
      busy page it returns it for xfs to wait for the page-idle event that
      will fire when the page reference count reaches 1 (recall ZONE_DEVICE
      pages are idle at count 1, see generic_dax_pagefree()).
      
      While waiting, the XFS_MMAPLOCK_EXCL lock is dropped in order to not
      deadlock the process that might be trying to elevate the page count of
      more pages before arranging for any of them to go idle. I.e. the typical
      case of submitting I/O is that iov_iter_get_pages() elevates the
      reference count of all pages in the I/O before starting I/O on the first
      page. The process of elevating the reference count of all pages involved
      in an I/O may cause faults that need to take XFS_MMAPLOCK_EXCL.
      
      Although XFS_MMAPLOCK_EXCL is dropped while waiting, XFS_IOLOCK_EXCL is
      held while sleeping. We need this to prevent starvation of the truncate
      path as continuous submission of direct-I/O could starve the truncate
      path indefinitely if the lock is dropped.
      
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Reported-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Acked-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      d6dc57e2
    • D
      xfs: prepare xfs_break_layouts() for another layout type · 69eb5fa1
      Dan Williams 提交于
      When xfs is operating as the back-end of a pNFS block server, it
      prevents collisions between local and remote operations by requiring a
      lease to be held for remotely accessed blocks. Local filesystem
      operations break those leases before writing or mutating the extent map
      of the file.
      
      A similar mechanism is needed to prevent operations on pinned dax
      mappings, like device-DMA, from colliding with extent unmap operations.
      
      BREAK_WRITE and BREAK_UNMAP are introduced as two distinct levels of
      layout breaking.
      
      Layouts are broken in the BREAK_WRITE case to ensure that layout-holders
      do not collide with local writes. Additionally, layouts are broken in
      the BREAK_UNMAP case to make sure the layout-holder has a consistent
      view of the file's extent map. While BREAK_WRITE breaks can be satisfied
      be recalling FL_LAYOUT leases, BREAK_UNMAP breaks additionally require
      waiting for busy dax-pages to go idle while holding XFS_MMAPLOCK_EXCL.
      
      After this refactoring xfs_break_layouts() becomes the entry point for
      coordinating both types of breaks. Finally, xfs_break_leased_layouts()
      becomes just the BREAK_WRITE handler.
      
      Note that the unlock tracking is needed in a follow on change. That will
      coordinate retrying either break handler until both successfully test
      for a lease break while maintaining the lock state.
      
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Reported-by: NDave Chinner <david@fromorbit.com>
      Reported-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      69eb5fa1
    • D
      xfs: prepare xfs_break_layouts() to be called with XFS_MMAPLOCK_EXCL · c63a8eae
      Dan Williams 提交于
      In preparation for adding coordination between extent unmap operations
      and busy dax-pages, update xfs_break_layouts() to permit it to be called
      with the mmap lock held. This lock scheme will be required for
      coordinating the break of 'dax layouts' (non-idle dax (ZONE_DEVICE)
      pages mapped into the file's address space). Breaking dax layouts will
      be added to xfs_break_layouts() in a future patch, for now this preps
      the unmap call sites to take and hold XFS_MMAPLOCK_EXCL over the call to
      xfs_break_layouts().
      
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Suggested-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: N"Darrick J. Wong" <darrick.wong@oracle.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      c63a8eae
  2. 16 3月, 2018 1 次提交
  3. 15 3月, 2018 1 次提交
  4. 08 1月, 2018 1 次提交
  5. 03 11月, 2017 3 次提交
  6. 27 10月, 2017 1 次提交
  7. 24 10月, 2017 1 次提交
  8. 12 10月, 2017 1 次提交
  9. 27 9月, 2017 1 次提交
  10. 26 9月, 2017 2 次提交
  11. 07 9月, 2017 1 次提交
    • R
      dax: use common 4k zero page for dax mmap reads · 91d25ba8
      Ross Zwisler 提交于
      When servicing mmap() reads from file holes the current DAX code
      allocates a page cache page of all zeroes and places the struct page
      pointer in the mapping->page_tree radix tree.
      
      This has three major drawbacks:
      
      1) It consumes memory unnecessarily. For every 4k page that is read via
         a DAX mmap() over a hole, we allocate a new page cache page. This
         means that if you read 1GiB worth of pages, you end up using 1GiB of
         zeroed memory. This is easily visible by looking at the overall
         memory consumption of the system or by looking at /proc/[pid]/smaps:
      
      	7f62e72b3000-7f63272b3000 rw-s 00000000 103:00 12   /root/dax/data
      	Size:            1048576 kB
      	Rss:             1048576 kB
      	Pss:             1048576 kB
      	Shared_Clean:          0 kB
      	Shared_Dirty:          0 kB
      	Private_Clean:   1048576 kB
      	Private_Dirty:         0 kB
      	Referenced:      1048576 kB
      	Anonymous:             0 kB
      	LazyFree:              0 kB
      	AnonHugePages:         0 kB
      	ShmemPmdMapped:        0 kB
      	Shared_Hugetlb:        0 kB
      	Private_Hugetlb:       0 kB
      	Swap:                  0 kB
      	SwapPss:               0 kB
      	KernelPageSize:        4 kB
      	MMUPageSize:           4 kB
      	Locked:                0 kB
      
      2) It is slower than using a common zero page because each page fault
         has more work to do. Instead of just inserting a common zero page we
         have to allocate a page cache page, zero it, and then insert it. Here
         are the average latencies of dax_load_hole() as measured by ftrace on
         a random test box:
      
          Old method, using zeroed page cache pages:	3.4 us
          New method, using the common 4k zero page:	0.8 us
      
         This was the average latency over 1 GiB of sequential reads done by
         this simple fio script:
      
           [global]
           size=1G
           filename=/root/dax/data
           fallocate=none
           [io]
           rw=read
           ioengine=mmap
      
      3) The fact that we had to check for both DAX exceptional entries and
         for page cache pages in the radix tree made the DAX code more
         complex.
      
      Solve these issues by following the lead of the DAX PMD code and using a
      common 4k zero page instead.  As with the PMD code we will now insert a
      DAX exceptional entry into the radix tree instead of a struct page
      pointer which allows us to remove all the special casing in the DAX
      code.
      
      Note that we do still pretty aggressively check for regular pages in the
      DAX radix tree, especially where we take action based on the bits set in
      the page.  If we ever find a regular page in our radix tree now that
      most likely means that someone besides DAX is inserting pages (which has
      happened lots of times in the past), and we want to find that out early
      and fail loudly.
      
      This solution also removes the extra memory consumption.  Here is that
      same /proc/[pid]/smaps after 1GiB of reading from a hole with the new
      code:
      
      	7f2054a74000-7f2094a74000 rw-s 00000000 103:00 12   /root/dax/data
      	Size:            1048576 kB
      	Rss:                   0 kB
      	Pss:                   0 kB
      	Shared_Clean:          0 kB
      	Shared_Dirty:          0 kB
      	Private_Clean:         0 kB
      	Private_Dirty:         0 kB
      	Referenced:            0 kB
      	Anonymous:             0 kB
      	LazyFree:              0 kB
      	AnonHugePages:         0 kB
      	ShmemPmdMapped:        0 kB
      	Shared_Hugetlb:        0 kB
      	Private_Hugetlb:       0 kB
      	Swap:                  0 kB
      	SwapPss:               0 kB
      	KernelPageSize:        4 kB
      	MMUPageSize:           4 kB
      	Locked:                0 kB
      
      Overall system memory consumption is similarly improved.
      
      Another major change is that we remove dax_pfn_mkwrite() from our fault
      flow, and instead rely on the page fault itself to make the PTE dirty
      and writeable.  The following description from the patch adding the
      vm_insert_mixed_mkwrite() call explains this a little more:
      
         "To be able to use the common 4k zero page in DAX we need to have our
          PTE fault path look more like our PMD fault path where a PTE entry
          can be marked as dirty and writeable as it is first inserted rather
          than waiting for a follow-up dax_pfn_mkwrite() =>
          finish_mkwrite_fault() call.
      
          Right now we can rely on having a dax_pfn_mkwrite() call because we
          can distinguish between these two cases in do_wp_page():
      
                  case 1: 4k zero page => writable DAX storage
                  case 2: read-only DAX storage => writeable DAX storage
      
          This distinction is made by via vm_normal_page(). vm_normal_page()
          returns false for the common 4k zero page, though, just as it does
          for DAX ptes. Instead of special casing the DAX + 4k zero page case
          we will simplify our DAX PTE page fault sequence so that it matches
          our DAX PMD sequence, and get rid of the dax_pfn_mkwrite() helper.
          We will instead use dax_iomap_fault() to handle write-protection
          faults.
      
          This means that insert_pfn() needs to follow the lead of
          insert_pfn_pmd() and allow us to pass in a 'mkwrite' flag. If
          'mkwrite' is set insert_pfn() will do the work that was previously
          done by wp_page_reuse() as part of the dax_pfn_mkwrite() call path"
      
      Link: http://lkml.kernel.org/r/20170724170616.25810-4-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      91d25ba8
  12. 05 9月, 2017 1 次提交
  13. 02 9月, 2017 2 次提交
  14. 06 7月, 2017 1 次提交
  15. 03 7月, 2017 1 次提交
  16. 28 6月, 2017 1 次提交
  17. 21 6月, 2017 1 次提交
  18. 20 6月, 2017 1 次提交
  19. 26 5月, 2017 4 次提交
  20. 28 2月, 2017 1 次提交
  21. 25 2月, 2017 3 次提交
  22. 23 2月, 2017 2 次提交
  23. 07 2月, 2017 2 次提交
  24. 03 2月, 2017 1 次提交
  25. 31 1月, 2017 1 次提交
    • B
      xfs: sync eofblocks scans under iolock are livelock prone · c3155097
      Brian Foster 提交于
      The xfs_eofblocks.eof_scan_owner field is an internal field to
      facilitate invoking eofb scans from the kernel while under the iolock.
      This is necessary because the eofb scan acquires the iolock of each
      inode. Synchronous scans are invoked on certain buffered write failures
      while under iolock. In such cases, the scan owner indicates that the
      context for the scan already owns the particular iolock and prevents a
      double lock deadlock.
      
      eofblocks scans while under iolock are still livelock prone in the event
      of multiple parallel scans, however. If multiple buffered writes to
      different inodes fail and invoke eofblocks scans at the same time, each
      scan avoids a deadlock with its own inode by virtue of the
      eof_scan_owner field, but will never be able to acquire the iolock of
      the inode from the parallel scan. Because the low free space scans are
      invoked with SYNC_WAIT, the scan will not return until it has processed
      every tagged inode and thus both scans will spin indefinitely on the
      iolock being held across the opposite scan. This problem can be
      reproduced reliably by generic/224 on systems with higher cpu counts
      (x16).
      
      To avoid this problem, simplify the semantics of eofblocks scans to
      never invoke a scan while under iolock. This means that the buffered
      write context must drop the iolock before the scan. It must reacquire
      the lock before the write retry and also repeat the initial write
      checks, as the original state might no longer be valid once the iolock
      was dropped.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      c3155097
  26. 10 12月, 2016 1 次提交
    • C
      fs: try to clone files first in vfs_copy_file_range · a76b5b04
      Christoph Hellwig 提交于
      A clone is a perfectly fine implementation of a file copy, so most
      file systems just implement the copy that way.  Instead of duplicating
      this logic move it to the VFS.  Currently btrfs and XFS implement copies
      the same way as clones and there is no behavior change for them, cifs
      only implements clones and grow support for copy_file_range with this
      patch.  NFS implements both, so this will allow copy_file_range to work
      on servers that only implement CLONE and be lot more efficient on servers
      that implements CLONE and COPY.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      a76b5b04
  27. 09 12月, 2016 1 次提交