1. 10 5月, 2018 1 次提交
  2. 03 5月, 2018 1 次提交
  3. 18 4月, 2018 1 次提交
    • D
      xfs: prevent creating negative-sized file via INSERT_RANGE · 7d83fb14
      Darrick J. Wong 提交于
      
      During the "insert range" fallocate operation, i_size grows by the
      specified 'len' bytes.  XFS verifies that i_size + len < s_maxbytes, as
      it should.  But this comparison is done using the signed 'loff_t', and
      'i_size + len' can wrap around to a negative value, causing the check to
      incorrectly pass, resulting in an inode with "negative" i_size.  This is
      possible on 64-bit platforms, where XFS sets s_maxbytes = LLONG_MAX.
      ext4 and f2fs don't run into this because they set a smaller s_maxbytes.
      
      Fix it by using subtraction instead.
      
      Reproducer:
          xfs_io -f file -c "truncate $(((1<<63)-1))" -c "finsert 0 4096"
      
      Fixes: a904b1ca ("xfs: Add support FALLOC_FL_INSERT_RANGE for fallocate")
      Cc: <stable@vger.kernel.org> # v4.1+
      Originally-From: Eric Biggers <ebiggers@google.com>
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      [darrick: fix signed integer addition overflow too]
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      7d83fb14
  4. 16 3月, 2018 1 次提交
  5. 15 3月, 2018 1 次提交
  6. 08 1月, 2018 1 次提交
  7. 03 11月, 2017 3 次提交
  8. 27 10月, 2017 1 次提交
  9. 24 10月, 2017 1 次提交
  10. 12 10月, 2017 1 次提交
  11. 27 9月, 2017 1 次提交
  12. 26 9月, 2017 2 次提交
  13. 07 9月, 2017 1 次提交
    • R
      dax: use common 4k zero page for dax mmap reads · 91d25ba8
      Ross Zwisler 提交于
      When servicing mmap() reads from file holes the current DAX code
      allocates a page cache page of all zeroes and places the struct page
      pointer in the mapping->page_tree radix tree.
      
      This has three major drawbacks:
      
      1) It consumes memory unnecessarily. For every 4k page that is read via
         a DAX mmap() over a hole, we allocate a new page cache page. This
         means that if you read 1GiB worth of pages, you end up using 1GiB of
         zeroed memory. This is easily visible by looking at the overall
         memory consumption of the system or by looking at /proc/[pid]/smaps:
      
      	7f62e72b3000-7f63272b3000 rw-s 00000000 103:00 12   /root/dax/data
      	Size:            1048576 kB
      	Rss:             1048576 kB
      	Pss:             1048576 kB
      	Shared_Clean:          0 kB
      	Shared_Dirty:          0 kB
      	Private_Clean:   1048576 kB
      	Private_Dirty:         0 kB
      	Referenced:      1048576 kB
      	Anonymous:             0 kB
      	LazyFree:              0 kB
      	AnonHugePages:         0 kB
      	ShmemPmdMapped:        0 kB
      	Shared_Hugetlb:        0 kB
      	Private_Hugetlb:       0 kB
      	Swap:                  0 kB
      	SwapPss:               0 kB
      	KernelPageSize:        4 kB
      	MMUPageSize:           4 kB
      	Locked:                0 kB
      
      2) It is slower than using a common zero page because each page fault
         has more work to do. Instead of just inserting a common zero page we
         have to allocate a page cache page, zero it, and then insert it. Here
         are the average latencies of dax_load_hole() as measured by ftrace on
         a random test box:
      
          Old method, using zeroed page cache pages:	3.4 us
          New method, using the common 4k zero page:	0.8 us
      
         This was the average latency over 1 GiB of sequential reads done by
         this simple fio script:
      
           [global]
           size=1G
           filename=/root/dax/data
           fallocate=none
           [io]
           rw=read
           ioengine=mmap
      
      3) The fact that we had to check for both DAX exceptional entries and
         for page cache pages in the radix tree made the DAX code more
         complex.
      
      Solve these issues by following the lead of the DAX PMD code and using a
      common 4k zero page instead.  As with the PMD code we will now insert a
      DAX exceptional entry into the radix tree instead of a struct page
      pointer which allows us to remove all the special casing in the DAX
      code.
      
      Note that we do still pretty aggressively check for regular pages in the
      DAX radix tree, especially where we take action based on the bits set in
      the page.  If we ever find a regular page in our radix tree now that
      most likely means that someone besides DAX is inserting pages (which has
      happened lots of times in the past), and we want to find that out early
      and fail loudly.
      
      This solution also removes the extra memory consumption.  Here is that
      same /proc/[pid]/smaps after 1GiB of reading from a hole with the new
      code:
      
      	7f2054a74000-7f2094a74000 rw-s 00000000 103:00 12   /root/dax/data
      	Size:            1048576 kB
      	Rss:                   0 kB
      	Pss:                   0 kB
      	Shared_Clean:          0 kB
      	Shared_Dirty:          0 kB
      	Private_Clean:         0 kB
      	Private_Dirty:         0 kB
      	Referenced:            0 kB
      	Anonymous:             0 kB
      	LazyFree:              0 kB
      	AnonHugePages:         0 kB
      	ShmemPmdMapped:        0 kB
      	Shared_Hugetlb:        0 kB
      	Private_Hugetlb:       0 kB
      	Swap:                  0 kB
      	SwapPss:               0 kB
      	KernelPageSize:        4 kB
      	MMUPageSize:           4 kB
      	Locked:                0 kB
      
      Overall system memory consumption is similarly improved.
      
      Another major change is that we remove dax_pfn_mkwrite() from our fault
      flow, and instead rely on the page fault itself to make the PTE dirty
      and writeable.  The following description from the patch adding the
      vm_insert_mixed_mkwrite() call explains this a little more:
      
         "To be able to use the common 4k zero page in DAX we need to have our
          PTE fault path look more like our PMD fault path where a PTE entry
          can be marked as dirty and writeable as it is first inserted rather
          than waiting for a follow-up dax_pfn_mkwrite() =>
          finish_mkwrite_fault() call.
      
          Right now we can rely on having a dax_pfn_mkwrite() call because we
          can distinguish between these two cases in do_wp_page():
      
                  case 1: 4k zero page => writable DAX storage
                  case 2: read-only DAX storage => writeable DAX storage
      
          This distinction is made by via vm_normal_page(). vm_normal_page()
          returns false for the common 4k zero page, though, just as it does
          for DAX ptes. Instead of special casing the DAX + 4k zero page case
          we will simplify our DAX PTE page fault sequence so that it matches
          our DAX PMD sequence, and get rid of the dax_pfn_mkwrite() helper.
          We will instead use dax_iomap_fault() to handle write-protection
          faults.
      
          This means that insert_pfn() needs to follow the lead of
          insert_pfn_pmd() and allow us to pass in a 'mkwrite' flag. If
          'mkwrite' is set insert_pfn() will do the work that was previously
          done by wp_page_reuse() as part of the dax_pfn_mkwrite() call path"
      
      Link: http://lkml.kernel.org/r/20170724170616.25810-4-ross.zwisler@linux.intel.comSigned-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      91d25ba8
  14. 05 9月, 2017 1 次提交
  15. 02 9月, 2017 2 次提交
  16. 06 7月, 2017 1 次提交
  17. 03 7月, 2017 1 次提交
  18. 28 6月, 2017 1 次提交
  19. 21 6月, 2017 1 次提交
  20. 20 6月, 2017 1 次提交
  21. 26 5月, 2017 4 次提交
  22. 28 2月, 2017 1 次提交
  23. 25 2月, 2017 3 次提交
  24. 23 2月, 2017 2 次提交
  25. 07 2月, 2017 2 次提交
  26. 03 2月, 2017 1 次提交
  27. 31 1月, 2017 1 次提交
    • B
      xfs: sync eofblocks scans under iolock are livelock prone · c3155097
      Brian Foster 提交于
      The xfs_eofblocks.eof_scan_owner field is an internal field to
      facilitate invoking eofb scans from the kernel while under the iolock.
      This is necessary because the eofb scan acquires the iolock of each
      inode. Synchronous scans are invoked on certain buffered write failures
      while under iolock. In such cases, the scan owner indicates that the
      context for the scan already owns the particular iolock and prevents a
      double lock deadlock.
      
      eofblocks scans while under iolock are still livelock prone in the event
      of multiple parallel scans, however. If multiple buffered writes to
      different inodes fail and invoke eofblocks scans at the same time, each
      scan avoids a deadlock with its own inode by virtue of the
      eof_scan_owner field, but will never be able to acquire the iolock of
      the inode from the parallel scan. Because the low free space scans are
      invoked with SYNC_WAIT, the scan will not return until it has processed
      every tagged inode and thus both scans will spin indefinitely on the
      iolock being held across the opposite scan. This problem can be
      reproduced reliably by generic/224 on systems with higher cpu counts
      (x16).
      
      To avoid this problem, simplify the semantics of eofblocks scans to
      never invoke a scan while under iolock. This means that the buffered
      write context must drop the iolock before the scan. It must reacquire
      the lock before the write retry and also repeat the initial write
      checks, as the original state might no longer be valid once the iolock
      was dropped.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      c3155097
  28. 10 12月, 2016 1 次提交
    • C
      fs: try to clone files first in vfs_copy_file_range · a76b5b04
      Christoph Hellwig 提交于
      A clone is a perfectly fine implementation of a file copy, so most
      file systems just implement the copy that way.  Instead of duplicating
      this logic move it to the VFS.  Currently btrfs and XFS implement copies
      the same way as clones and there is no behavior change for them, cifs
      only implements clones and grow support for copy_file_range with this
      patch.  NFS implements both, so this will allow copy_file_range to work
      on servers that only implement CLONE and be lot more efficient on servers
      that implements CLONE and COPY.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      a76b5b04
  29. 09 12月, 2016 1 次提交