1. 03 11月, 2015 3 次提交
    • D
      xfs: add ->pfn_mkwrite support for DAX · 3af49285
      Dave Chinner 提交于
      ->pfn_mkwrite support is needed so that when a page with allocated
      backing store takes a write fault we can check that the fault has
      not raced with a truncate and is pointing to a region beyond the
      current end of file.
      
      This also allows us to update the timestamp on the inode, too, which
      fixes a generic/080 failure.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      3af49285
    • D
      xfs: DAX does not use IO completion callbacks · 01a155e6
      Dave Chinner 提交于
      For DAX, we are now doing block zeroing during allocation. This
      means we no longer need a special DAX fault IO completion callback
      to do unwritten extent conversion. Because mmap never extends the
      file size (it SEGVs the process) we don't need a callback to update
      the file size, either. Hence we can remove the completion callbacks
      from the __dax_fault and __dax_mkwrite calls.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      01a155e6
    • D
      xfs: fix inode size update overflow in xfs_map_direct() · 3e12dbbd
      Dave Chinner 提交于
      Both direct IO and DAX pass an offset and count into get_blocks that
      will overflow a s64 variable when an IO goes into the last supported
      block in a file (i.e. at offset 2^63 - 1FSB bytes). This can be seen
      from the tracing:
      
      xfs_get_blocks_alloc: [...] offset 0x7ffffffffffff000 count 4096
      xfs_gbmap_direct:     [...] offset 0x7ffffffffffff000 count 4096
      xfs_gbmap_direct_none:[...] offset 0x7ffffffffffff000 count 4096
      
      0x7ffffffffffff000 + 4096 = 0x8000000000000000, and hence that
      overflows the s64 offset and we fail to detect the need for a
      filesize update and an ioend is not allocated.
      
      This is *mostly* avoided for direct IO because such extending IOs
      occur with full block allocation, and so the "IS_UNWRITTEN()" check
      still evaluates as true and we get an ioend that way. However, doing
      single sector extending IOs to this last block will expose the fact
      that file size updates will not occur after the first allocating
      direct IO as the overflow will then be exposed.
      
      There is one further complexity: the DAX page fault path also
      exposes the same issue in block allocation. However, page faults
      cannot extend the file size, so in this case we want to allocate the
      block but do not want to allocate an ioend to enable file size
      update at IO completion. Hence we now need to distinguish between
      the direct IO patch allocation and dax fault path allocation to
      avoid leaking ioend structures.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      3e12dbbd
  2. 09 9月, 2015 1 次提交
  3. 19 8月, 2015 1 次提交
    • B
      xfs: flush entire file on dio read/write to cached file · 3d751af2
      Brian Foster 提交于
      Filesystems are responsible to manage file coherency between the page
      cache and direct I/O. The generic dio code flushes dirty pages over the
      range of a dio to ensure that the dio read or a future buffered read
      returns the correct data. XFS has generally followed this pattern,
      though traditionally has flushed and invalidated the range from the
      start of the I/O all the way to the end of the file. This changed after
      the following commit:
      
      	7d4ea3ce xfs: use ranged writeback and invalidation for direct IO
      
      ... as the full file flush was no longer necessary to deal with the
      strange post-eof delalloc issues that were since fixed. Unfortunately,
      we have since received complaints about performance degradation due to
      the increased exclusive iolock cycles (which locks out parallel dio
      submission) that occur when a file has cached pages. This does not occur
      on filesystems that use the generic code as it also does not incorporate
      locking.
      
      The exclusive iolock is acquired any time the inode mapping has cached
      pages, regardless of whether they reside in the range of the I/O or not.
      If not, the flush/inval calls do no work and the lock was cycled for no
      reason.
      
      Under consideration of the cost of the exclusive iolock, update the dio
      read and write handlers to flush and invalidate the entire mapping when
      cached pages exist. In most cases, this increases the cost of the
      initial flush sequence but eliminates the need for further lock cycles
      and flushes so long as the workload does not actively mix direct and
      buffered I/O. This also more closely matches historical behavior and
      performance characteristics that users have come to expect.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      3d751af2
  4. 29 7月, 2015 1 次提交
    • D
      xfs: call dax_fault on read page faults for DAX · b2442c5a
      Dave Chinner 提交于
      When modifying the patch series to handle the XFS MMAP_LOCK nesting
      of page faults, I botched the conversion of the read page fault
      path, and so it is only every calling through the page cache. Re-add
      the necessary __dax_fault() call for such files.
      
      Because the get_blocks callback on read faults may not set up the
      mapping buffer correctly to allow unwritten extent completion to be
      run, we need to allow callers of __dax_fault() to pass a null
      complete_unwritten() callback. The DAX code always zeros the
      unwritten page when it is read faulted so there are no stale data
      exposure issues with not doing the conversion. The only downside
      will be the potential for increased CPU overhead on repeated read
      faults of the same page. If this proves to be a problem, then the
      filesystem needs to fix it's get_block callback and provide a
      convert_unwritten() callback to the read fault path.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMatthew Wilcox <willy@linux.intel.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      b2442c5a
  5. 24 6月, 2015 2 次提交
  6. 04 6月, 2015 5 次提交
  7. 02 6月, 2015 1 次提交
    • T
      writeback: separate out include/linux/backing-dev-defs.h · 66114cad
      Tejun Heo 提交于
      With the planned cgroup writeback support, backing-dev related
      declarations will be more widely used across block and cgroup;
      unfortunately, including backing-dev.h from include/linux/blkdev.h
      makes cyclic include dependency quite likely.
      
      This patch separates out backing-dev-defs.h which only has the
      essential definitions and updates blkdev.h to include it.  c files
      which need access to more backing-dev details now include
      backing-dev.h directly.  This takes backing-dev.h off the common
      include dependency chain making it a lot easier to use it across block
      and cgroup.
      
      v2: fs/fat build failure fixed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      66114cad
  8. 29 5月, 2015 1 次提交
  9. 16 4月, 2015 3 次提交
    • D
      xfs: using generic_file_direct_write() is unnecessary · 0cefb29e
      Dave Chinner 提交于
      generic_file_direct_write() does all sorts of things to make DIO
      work "sorta ok" with mixed buffered IO workloads. We already do
      most of this work in xfs_file_aio_dio_write() because of the locking
      requirements, so there's only a couple of things it does for us.
      
      The first thing is that it does a page cache invalidation after the
      ->direct_IO callout. This can easily be added to the XFS code.
      
      The second thing it does is that if data was written, it updates the
      iov_iter structure to reflect the data written, and then does EOF
      size updates if necessary. For XFS, these EOF size updates are now
      not necessary, as we do them safely and race-free in IO completion
      context. That leaves just the iov_iter update, and that's also moved
      to the XFS code.
      
      Therefore we don't need to call generic_file_direct_write() and in
      doing so remove redundant buffered writeback and page cache
      invalidation calls from the DIO submission path. We also remove a
      racy EOF size update, and make the DIO submission code in XFS much
      easier to follow. Wins all round, really.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      0cefb29e
    • D
      xfs: direct IO EOF zeroing needs to drain AIO · 40c63fbc
      Dave Chinner 提交于
      When we are doing AIO DIO writes, the IOLOCK only provides an IO
      submission barrier. When we need to do EOF zeroing, we need to ensure
      that no other IO is in progress and all pending in-core EOF updates
      have been completed. This requires us to wait for all outstanding
      AIO DIO writes to the inode to complete and, if necessary, run their
      EOF updates.
      
      Once all the EOF updates are complete, we can then restart
      xfs_file_aio_write_checks() while holding the IOLOCK_EXCL, knowing
      that EOF is up to date and we have exclusive IO access to the file
      so we can run EOF block zeroing if we need to without interference.
      This gives EOF zeroing the same exclusivity against other IO as we
      provide truncate operations.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      40c63fbc
    • D
      xfs: DIO write completion size updates race · b9d59846
      Dave Chinner 提交于
      xfs_end_io_direct_write() can race with other IO completions when
      updating the in-core inode size. The IO completion processing is not
      serialised for direct IO - they are done either under the
      IOLOCK_SHARED for non-AIO DIO, and without any IOLOCK held at all
      during AIO DIO completion. Hence the non-atomic test-and-set update
      of the in-core inode size is racy and can result in the in-core
      inode size going backwards if the race if hit just right.
      
      If the inode size goes backwards, this can trigger the EOF zeroing
      code to run incorrectly on the next IO, which then will zero data
      that has successfully been written to disk by a previous DIO.
      
      To fix this bug, we need to serialise the test/set updates of the
      in-core inode size. This first patch introduces locking around the
      relevant updates and checks in the DIO path. Because we now have an
      ioend in xfs_end_io_direct_write(), we know exactly then we are
      doing an IO that requires an in-core EOF update, and we know that
      they are not running in interrupt context. As such, we do not need to
      use irqsave() spinlock variants to protect against interrupts while
      the lock is held.
      
      Hence we can use an existing spinlock in the inode to do this
      serialisation and so not need to grow the struct xfs_inode just to
      work around this problem.
      
      This patch does not address the test/set EOF update in
      generic_file_write_direct() for various reasons - that will be done
      as a followup with separate explanation.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      b9d59846
  10. 13 4月, 2015 1 次提交
  11. 12 4月, 2015 5 次提交
  12. 26 3月, 2015 1 次提交
  13. 25 3月, 2015 1 次提交
  14. 23 2月, 2015 4 次提交
    • D
      xfs: ensure truncate forces zeroed blocks to disk · 5885ebda
      Dave Chinner 提交于
      A new fsync vs power fail test in xfstests indicated that XFS can
      have unreliable data consistency when doing extending truncates that
      require block zeroing. The blocks beyond EOF get zeroed in memory,
      but we never force those changes to disk before we run the
      transaction that extends the file size and exposes those blocks to
      userspace. This can result in the blocks not being correctly zeroed
      after a crash.
      
      Because in-memory behaviour is correct, tools like fsx don't pick up
      any coherency problems - it's not until the filesystem is shutdown
      or the system crashes after writing the truncate transaction to the
      journal but before the zeroed data in the page cache is flushed that
      the issue is exposed.
      
      Fix this by also flushing the dirty data in memory region between
      the old size and new size when we've found blocks that need zeroing
      in the truncate process.
      Reported-by: NLiu Bo <bo.li.liu@oracle.com>
      cc: <stable@vger.kernel.org>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      5885ebda
    • D
      xfs: take i_mmap_lock on extent manipulation operations · e8e9ad42
      Dave Chinner 提交于
      Now we have the i_mmap_lock being held across the page fault IO
      path, we now add extent manipulation operation exclusion by adding
      the lock to the paths that directly modify extent maps. This
      includes truncate, hole punching and other fallocate based
      operations. The operations will now take both the i_iolock and the
      i_mmaplock in exclusive mode, thereby ensuring that all IO and page
      faults block without holding any page locks while the extent
      manipulation is in progress.
      
      This gives us the lock order during truncate of i_iolock ->
      i_mmaplock -> page_lock -> i_lock, hence providing the same
      lock order as the iolock provides the normal IO path without
      involving the mmap_sem.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      e8e9ad42
    • D
      xfs: use i_mmaplock on write faults · 075a924d
      Dave Chinner 提交于
      Take the i_mmaplock over write page faults. These come through the
      ->page_mkwrite callout, so we need to wrap that calls with the
      i_mmaplock.
      
      This gives us a lock order of mmap_sem -> i_mmaplock -> page_lock
      -> i_lock.
      
      Also, move the page_mkwrite wrapper to the same region of xfs_file.c
      as the read fault wrappers and add a tracepoint.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      075a924d
    • D
      xfs: use i_mmaplock on read faults · de0e8c20
      Dave Chinner 提交于
      Take the i_mmaplock over read page faults. These come through the
      ->fault callout, so we need to wrap the generic implementation
      with the i_mmaplock. While there, add tracepoints for the read
      fault as it passes through XFS.
      
      This gives us a lock order of mmap_sem -> i_mmaplock -> page_lock
      -> i_lock.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      de0e8c20
  15. 16 2月, 2015 1 次提交
  16. 11 2月, 2015 1 次提交
  17. 02 2月, 2015 1 次提交
  18. 21 1月, 2015 1 次提交
  19. 01 12月, 2014 1 次提交
  20. 28 11月, 2014 3 次提交
  21. 09 9月, 2014 2 次提交