1. 19 8月, 2015 1 次提交
    • B
      xfs: flush entire file on dio read/write to cached file · 3d751af2
      Brian Foster 提交于
      Filesystems are responsible to manage file coherency between the page
      cache and direct I/O. The generic dio code flushes dirty pages over the
      range of a dio to ensure that the dio read or a future buffered read
      returns the correct data. XFS has generally followed this pattern,
      though traditionally has flushed and invalidated the range from the
      start of the I/O all the way to the end of the file. This changed after
      the following commit:
      
      	7d4ea3ce xfs: use ranged writeback and invalidation for direct IO
      
      ... as the full file flush was no longer necessary to deal with the
      strange post-eof delalloc issues that were since fixed. Unfortunately,
      we have since received complaints about performance degradation due to
      the increased exclusive iolock cycles (which locks out parallel dio
      submission) that occur when a file has cached pages. This does not occur
      on filesystems that use the generic code as it also does not incorporate
      locking.
      
      The exclusive iolock is acquired any time the inode mapping has cached
      pages, regardless of whether they reside in the range of the I/O or not.
      If not, the flush/inval calls do no work and the lock was cycled for no
      reason.
      
      Under consideration of the cost of the exclusive iolock, update the dio
      read and write handlers to flush and invalidate the entire mapping when
      cached pages exist. In most cases, this increases the cost of the
      initial flush sequence but eliminates the need for further lock cycles
      and flushes so long as the workload does not actively mix direct and
      buffered I/O. This also more closely matches historical behavior and
      performance characteristics that users have come to expect.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      3d751af2
  2. 24 6月, 2015 2 次提交
  3. 04 6月, 2015 5 次提交
  4. 02 6月, 2015 1 次提交
    • T
      writeback: separate out include/linux/backing-dev-defs.h · 66114cad
      Tejun Heo 提交于
      With the planned cgroup writeback support, backing-dev related
      declarations will be more widely used across block and cgroup;
      unfortunately, including backing-dev.h from include/linux/blkdev.h
      makes cyclic include dependency quite likely.
      
      This patch separates out backing-dev-defs.h which only has the
      essential definitions and updates blkdev.h to include it.  c files
      which need access to more backing-dev details now include
      backing-dev.h directly.  This takes backing-dev.h off the common
      include dependency chain making it a lot easier to use it across block
      and cgroup.
      
      v2: fs/fat build failure fixed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      66114cad
  5. 29 5月, 2015 1 次提交
  6. 16 4月, 2015 3 次提交
    • D
      xfs: using generic_file_direct_write() is unnecessary · 0cefb29e
      Dave Chinner 提交于
      generic_file_direct_write() does all sorts of things to make DIO
      work "sorta ok" with mixed buffered IO workloads. We already do
      most of this work in xfs_file_aio_dio_write() because of the locking
      requirements, so there's only a couple of things it does for us.
      
      The first thing is that it does a page cache invalidation after the
      ->direct_IO callout. This can easily be added to the XFS code.
      
      The second thing it does is that if data was written, it updates the
      iov_iter structure to reflect the data written, and then does EOF
      size updates if necessary. For XFS, these EOF size updates are now
      not necessary, as we do them safely and race-free in IO completion
      context. That leaves just the iov_iter update, and that's also moved
      to the XFS code.
      
      Therefore we don't need to call generic_file_direct_write() and in
      doing so remove redundant buffered writeback and page cache
      invalidation calls from the DIO submission path. We also remove a
      racy EOF size update, and make the DIO submission code in XFS much
      easier to follow. Wins all round, really.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      0cefb29e
    • D
      xfs: direct IO EOF zeroing needs to drain AIO · 40c63fbc
      Dave Chinner 提交于
      When we are doing AIO DIO writes, the IOLOCK only provides an IO
      submission barrier. When we need to do EOF zeroing, we need to ensure
      that no other IO is in progress and all pending in-core EOF updates
      have been completed. This requires us to wait for all outstanding
      AIO DIO writes to the inode to complete and, if necessary, run their
      EOF updates.
      
      Once all the EOF updates are complete, we can then restart
      xfs_file_aio_write_checks() while holding the IOLOCK_EXCL, knowing
      that EOF is up to date and we have exclusive IO access to the file
      so we can run EOF block zeroing if we need to without interference.
      This gives EOF zeroing the same exclusivity against other IO as we
      provide truncate operations.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      40c63fbc
    • D
      xfs: DIO write completion size updates race · b9d59846
      Dave Chinner 提交于
      xfs_end_io_direct_write() can race with other IO completions when
      updating the in-core inode size. The IO completion processing is not
      serialised for direct IO - they are done either under the
      IOLOCK_SHARED for non-AIO DIO, and without any IOLOCK held at all
      during AIO DIO completion. Hence the non-atomic test-and-set update
      of the in-core inode size is racy and can result in the in-core
      inode size going backwards if the race if hit just right.
      
      If the inode size goes backwards, this can trigger the EOF zeroing
      code to run incorrectly on the next IO, which then will zero data
      that has successfully been written to disk by a previous DIO.
      
      To fix this bug, we need to serialise the test/set updates of the
      in-core inode size. This first patch introduces locking around the
      relevant updates and checks in the DIO path. Because we now have an
      ioend in xfs_end_io_direct_write(), we know exactly then we are
      doing an IO that requires an in-core EOF update, and we know that
      they are not running in interrupt context. As such, we do not need to
      use irqsave() spinlock variants to protect against interrupts while
      the lock is held.
      
      Hence we can use an existing spinlock in the inode to do this
      serialisation and so not need to grow the struct xfs_inode just to
      work around this problem.
      
      This patch does not address the test/set EOF update in
      generic_file_write_direct() for various reasons - that will be done
      as a followup with separate explanation.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      b9d59846
  7. 13 4月, 2015 1 次提交
  8. 12 4月, 2015 5 次提交
  9. 26 3月, 2015 1 次提交
  10. 25 3月, 2015 1 次提交
  11. 23 2月, 2015 4 次提交
    • D
      xfs: ensure truncate forces zeroed blocks to disk · 5885ebda
      Dave Chinner 提交于
      A new fsync vs power fail test in xfstests indicated that XFS can
      have unreliable data consistency when doing extending truncates that
      require block zeroing. The blocks beyond EOF get zeroed in memory,
      but we never force those changes to disk before we run the
      transaction that extends the file size and exposes those blocks to
      userspace. This can result in the blocks not being correctly zeroed
      after a crash.
      
      Because in-memory behaviour is correct, tools like fsx don't pick up
      any coherency problems - it's not until the filesystem is shutdown
      or the system crashes after writing the truncate transaction to the
      journal but before the zeroed data in the page cache is flushed that
      the issue is exposed.
      
      Fix this by also flushing the dirty data in memory region between
      the old size and new size when we've found blocks that need zeroing
      in the truncate process.
      Reported-by: NLiu Bo <bo.li.liu@oracle.com>
      cc: <stable@vger.kernel.org>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      5885ebda
    • D
      xfs: take i_mmap_lock on extent manipulation operations · e8e9ad42
      Dave Chinner 提交于
      Now we have the i_mmap_lock being held across the page fault IO
      path, we now add extent manipulation operation exclusion by adding
      the lock to the paths that directly modify extent maps. This
      includes truncate, hole punching and other fallocate based
      operations. The operations will now take both the i_iolock and the
      i_mmaplock in exclusive mode, thereby ensuring that all IO and page
      faults block without holding any page locks while the extent
      manipulation is in progress.
      
      This gives us the lock order during truncate of i_iolock ->
      i_mmaplock -> page_lock -> i_lock, hence providing the same
      lock order as the iolock provides the normal IO path without
      involving the mmap_sem.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      e8e9ad42
    • D
      xfs: use i_mmaplock on write faults · 075a924d
      Dave Chinner 提交于
      Take the i_mmaplock over write page faults. These come through the
      ->page_mkwrite callout, so we need to wrap that calls with the
      i_mmaplock.
      
      This gives us a lock order of mmap_sem -> i_mmaplock -> page_lock
      -> i_lock.
      
      Also, move the page_mkwrite wrapper to the same region of xfs_file.c
      as the read fault wrappers and add a tracepoint.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      075a924d
    • D
      xfs: use i_mmaplock on read faults · de0e8c20
      Dave Chinner 提交于
      Take the i_mmaplock over read page faults. These come through the
      ->fault callout, so we need to wrap the generic implementation
      with the i_mmaplock. While there, add tracepoints for the read
      fault as it passes through XFS.
      
      This gives us a lock order of mmap_sem -> i_mmaplock -> page_lock
      -> i_lock.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      de0e8c20
  12. 16 2月, 2015 1 次提交
  13. 11 2月, 2015 1 次提交
  14. 02 2月, 2015 1 次提交
  15. 21 1月, 2015 1 次提交
  16. 01 12月, 2014 1 次提交
  17. 28 11月, 2014 3 次提交
  18. 09 9月, 2014 2 次提交
  19. 02 9月, 2014 3 次提交
  20. 04 8月, 2014 1 次提交
    • D
      xfs: kill xfs_vnode.h · b92cc59f
      Dave Chinner 提交于
      Move the IO flag definitions to xfs_inode.h and kill the header file
      as it is now empty.
      
      Removing the xfs_vnode.h file showed up an implicit header include
      path:
      	xfs_linux.h -> xfs_vnode.h -> xfs_fs.h
      
      And so every xfs header file has been inplicitly been including
      xfs_fs.h where it is needed or not. Hence the removal of xfs_vnode.h
      causes all sorts of build issues because BBTOB() and friends are no
      longer automatically included in the build. This also gets fixed.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      b92cc59f
  21. 24 7月, 2014 1 次提交
    • B
      xfs: run an eofblocks scan on ENOSPC/EDQUOT · dc06f398
      Brian Foster 提交于
      From: Brian Foster <bfoster@redhat.com>
      
      Speculative preallocation and and the associated throttling metrics
      assume we're working with large files on large filesystems. Users have
      reported inefficiencies in these mechanisms when we happen to be dealing
      with large files on smaller filesystems. This can occur because while
      prealloc throttling is aggressive under low free space conditions, it is
      not active until we reach 5% free space or less.
      
      For example, a 40GB filesystem has enough space for several files large
      enough to have multi-GB preallocations at any given time. If those files
      are slow growing, they might reserve preallocation for long periods of
      time as well as avoid the background scanner due to frequent
      modification. If a new file is written under these conditions, said file
      has no access to this already reserved space and premature ENOSPC is
      imminent.
      
      To handle this scenario, modify the buffered write ENOSPC handling and
      retry sequence to invoke an eofblocks scan. In the smaller filesystem
      scenario, the eofblocks scan resets the usage of preallocation such that
      when the 5% free space threshold is met, throttling effectively takes
      over to provide fair and efficient preallocation until legitimate
      ENOSPC.
      
      The eofblocks scan is selective based on the nature of the failure. For
      example, an EDQUOT failure in a particular quota will use a filtered
      scan for that quota. Because we don't know which quota might have caused
      an allocation failure at any given time, we include each applicable
      quota determined to be under low free space conditions in the scan.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      dc06f398