1. 24 10月, 2013 1 次提交
    • D
      xfs: create a shared header file for format-related information · 70a9883c
      Dave Chinner 提交于
      All of the buffer operations structures are needed to be exported
      for xfs_db, so move them all to a common location rather than
      spreading them all over the place. They are verifying the on-disk
      format, so while xfs_format.h might be a good place, it is not part
      of the on disk format.
      
      Hence we need to create a new header file that we centralise these
      related definitions. Start by moving the bffer operations
      structures, and then also move all the other definitions that have
      crept into xfs_log_format.h and xfs_format.h as there was no other
      shared header file to put them in.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      70a9883c
  2. 18 10月, 2013 1 次提交
  3. 02 10月, 2013 1 次提交
  4. 13 8月, 2013 4 次提交
  5. 28 6月, 2013 1 次提交
    • D
      xfs: don't use speculative prealloc for small files · 133eeb17
      Dave Chinner 提交于
      Dedicated small file workloads have been seeing significant free
      space fragmentation causing premature inode allocation failure
      when large inode sizes are in use. A particular test case showed
      that a workload that runs to a real ENOSPC on 256 byte inodes would
      fail inode allocation with ENOSPC about about 80% full with 512 byte
      inodes, and at about 50% full with 1024 byte inodes.
      
      The same workload, when run with -o allocsize=4096 on 1024 byte
      inodes would run to being 100% full before giving ENOSPC. That is,
      no freespace fragmentation at all.
      
      The issue was caused by the specific IO pattern the application had
      - the framework it was using did not support direct IO, and so it
      was emulating it by using fadvise(DONT_NEED). The result was that
      the data was getting written back before the speculative prealloc
      had been trimmed from memory by the close(), and so small single
      block files were being allocated with 2 blocks, and then having one
      truncated away. The result was lots of small 4k free space extents,
      and hence each new 8k allocation would take another 8k from
      contiguous free space and turn it into 4k of allocated space and 4k
      of free space.
      
      Hence inode allocation, which requires contiguous, aligned
      allocation of 16k (256 byte inodes), 32k (512 byte inodes) or 64k
      (1024 byte inodes) can fail to find sufficiently large freespace and
      hence fail while there is still lots of free space available.
      
      There's a simple fix for this, and one that has precendence in the
      allocator code already - don't do speculative allocation unless the
      size of the file is larger than a certain size. In this case, that
      size is the minimum default preallocation size:
      mp->m_writeio_blocks. And to keep with the concept of being nice to
      people when the files are still relatively small, cap the prealloc
      to mp->m_writeio_blocks until the file goes over a stripe unit is
      size, at which point we'll fall back to the current behaviour based
      on the last extent size.
      
      This will effectively turn off speculative prealloc for very small
      files, keep preallocation low for small files, and behave as it
      currently does for any file larger than a stripe unit. This
      completely avoids the freespace fragmentation problem this
      particular IO pattern was causing.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      133eeb17
  6. 23 3月, 2013 4 次提交
  7. 19 3月, 2013 2 次提交
  8. 08 3月, 2013 3 次提交
  9. 15 2月, 2013 1 次提交
    • D
      xfs: limit speculative prealloc size on sparse files · a1e16c26
      Dave Chinner 提交于
      Speculative preallocation based on the current file size works well
      for contiguous files, but is sub-optimal for sparse files where the
      EOF preallocation can fill holes and result in large amounts of
      zeros being written when it is not necessary.
      
      The algorithm is modified to prevent EOF speculative preallocation
      from triggering larger allocations on IO patterns of
      truncate--to-zero-seek-write-seek-write-....  which results in
      non-sparse files for large files. This, unfortunately, is the way cp
      now behaves when copying sparse files and so needs to be fixed.
      
      What this code does is that it looks at the existing extent adjacent
      to the current EOF and if it determines that it is a hole we disable
      speculative preallocation altogether. To avoid the next write from
      doing a large prealloc, it takes the size of subsequent
      preallocations from the current size of the existing EOF extent.
      IOWs, if you leave a hole in the file, it resets preallocation
      behaviour to the same as if it was a zero size file.
      
      Example new behaviour:
      
      $ xfs_io -f -c "pwrite 0 31m" \
                  -c "pwrite 33m 1m" \
                  -c "pwrite 128m 1m" \
                  -c "fiemap -v" /mnt/scratch/blah
      wrote 32505856/32505856 bytes at offset 0
      31 MiB, 7936 ops; 0.0000 sec (1.608 GiB/sec and 421432.7439 ops/sec)
      wrote 1048576/1048576 bytes at offset 34603008
      1 MiB, 256 ops; 0.0000 sec (1.462 GiB/sec and 383233.5329 ops/sec)
      wrote 1048576/1048576 bytes at offset 134217728
      1 MiB, 256 ops; 0.0000 sec (1.719 GiB/sec and 450704.2254 ops/sec)
      /mnt/scratch/blah:
       EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
         0: [0..65535]:      96..65631        65536   0x0
         1: [65536..67583]:  hole              2048
         2: [67584..69631]:  67680..69727      2048   0x0
         3: [69632..262143]: hole             192512
         4: [262144..264191]: 262240..264287    2048   0x1
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      a1e16c26
  10. 29 1月, 2013 1 次提交
    • D
      xfs: limit speculative prealloc near ENOSPC thresholds · f2a45956
      Dave Chinner 提交于
      There is a window on small filesytsems where specualtive
      preallocation can be larger than that ENOSPC throttling thresholds,
      resulting in specualtive preallocation trying to reserve more space
      than there is space available. This causes immediate ENOSPC to be
      triggered, prealloc to be turned off and flushing to occur. One the
      next write (i.e. next 4k page), we do exactly the same thing, and so
      effective drive into synchronous 4k writes by triggering ENOSPC
      flushing on every page while in the window between the prealloc size
      and the ENOSPC prealloc throttle threshold.
      
      Fix this by checking to see if the prealloc size would consume all
      free space, and throttle it appropriately to avoid premature
      ENOSPC...
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      f2a45956
  11. 25 1月, 2013 1 次提交
    • D
      xfs: limit speculative prealloc near ENOSPC thresholds · 4d559a3b
      Dave Chinner 提交于
      There is a window on small filesytsems where specualtive
      preallocation can be larger than that ENOSPC throttling thresholds,
      resulting in specualtive preallocation trying to reserve more space
      than there is space available. This causes immediate ENOSPC to be
      triggered, prealloc to be turned off and flushing to occur. One the
      next write (i.e. next 4k page), we do exactly the same thing, and so
      effective drive into synchronous 4k writes by triggering ENOSPC
      flushing on every page while in the window between the prealloc size
      and the ENOSPC prealloc throttle threshold.
      
      Fix this by checking to see if the prealloc size would consume all
      free space, and throttle it appropriately to avoid premature
      ENOSPC...
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      4d559a3b
  12. 09 11月, 2012 2 次提交
    • B
      xfs: add EOFBLOCKS inode tagging/untagging · 27b52867
      Brian Foster 提交于
      Add the XFS_ICI_EOFBLOCKS_TAG inode tag to identify inodes with
      speculatively preallocated blocks beyond EOF. An inode is tagged
      when speculative preallocation occurs and untagged either via
      truncate down or when post-EOF blocks are freed via release or
      reclaim.
      
      The tag management is intentionally not aggressive to prefer
      simplicity over the complexity of handling all the corner cases
      under which post-EOF blocks could be freed (i.e., forward
      truncation, fallocate, write error conditions, etc.). This means
      that a tagged inode may or may not have post-EOF blocks after a
      period of time. The tag is eventually cleared when the inode is
      released or reclaimed.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      27b52867
    • D
      xfs: introduce XFS_BMAPI_STACK_SWITCH · 326c0355
      Dave Chinner 提交于
      Certain allocation paths through xfs_bmapi_write() are in situations
      where we have limited stack available. These are almost always in
      the buffered IO writeback path when convertion delayed allocation
      extents to real extents.
      
      The current stack switch occurs for userdata allocations, which
      means we also do stack switches for preallocation, direct IO and
      unwritten extent conversion, even those these call chains have never
      been implicated in a stack overrun.
      
      Hence, let's target just the single stack overun offended for stack
      switches. To do that, introduce a XFS_BMAPI_STACK_SWITCH flag that
      the caller can pass xfs_bmapi_write() to indicate it should switch
      stacks if it needs to do allocation.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      326c0355
  13. 19 10月, 2012 1 次提交
    • D
      xfs: introduce XFS_BMAPI_STACK_SWITCH · 2455881c
      Dave Chinner 提交于
      Certain allocation paths through xfs_bmapi_write() are in situations
      where we have limited stack available. These are almost always in
      the buffered IO writeback path when convertion delayed allocation
      extents to real extents.
      
      The current stack switch occurs for userdata allocations, which
      means we also do stack switches for preallocation, direct IO and
      unwritten extent conversion, even those these call chains have never
      been implicated in a stack overrun.
      
      Hence, let's target just the single stack overun offended for stack
      switches. To do that, introduce a XFS_BMAPI_STACK_SWITCH flag that
      the caller can pass xfs_bmapi_write() to indicate it should switch
      stacks if it needs to do allocation.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      2455881c
  14. 18 10月, 2012 1 次提交
    • D
      xfs: xfs_sync_data is redundant. · 9aa05000
      Dave Chinner 提交于
      We don't do any data writeback from XFS any more - the VFS is
      completely responsible for that, including for freeze. We can
      replace the remaining caller with a VFS level function that
      achieves the same thing, but without conflicting with current
      writeback work.
      
      This means we can remove the flush_work and xfs_flush_inodes() - the
      VFS functionality completely replaces the internal flush queue for
      doing this writeback work in a separate context to avoid stack
      overruns.
      
      This does have one complication - it cannot be called with page
      locks held.  Hence move the flushing of delalloc space when ENOSPC
      occurs back up into xfs_file_aio_buffered_write when we don't hold
      any locks that will stall writeback.
      
      Unfortunately, writeback_inodes_sb_if_idle() is not sufficient to
      trigger delalloc conversion fast enough to prevent spurious ENOSPC
      whent here are hundreds of writers, thousands of small files and GBs
      of free RAM.  Hence we need to use sync_sb_inodes() to block callers
      while we wait for writeback like the previous xfs_flush_inodes
      implementation did.
      
      That means we have to hold the s_umount lock here, but because this
      call can nest inside i_mutex (the parent directory in the create
      case, held by the VFS), we have to use down_read_trylock() to avoid
      potential deadlocks. In practice, this trylock will succeed on
      almost every attempt as unmount/remount type operations are
      exceedingly rare.
      
      Note: we always need to pass a count of zero to
      generic_file_buffered_write() as the previously written byte count.
      We only do this by accident before this patch by the virtue of ret
      always being zero when there are no errors. Make this explicit
      rather than needing to specifically zero ret in the ENOSPC retry
      case.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Tested-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      9aa05000
  15. 31 7月, 2012 1 次提交
    • J
      xfs: Convert to new freezing code · d9457dc0
      Jan Kara 提交于
      Generic code now blocks all writers from standard write paths. So we add
      blocking of all writers coming from ioctl (we get a protection of ioctl against
      racing remount read-only as a bonus) and convert xfs_file_aio_write() to a
      non-racy freeze protection. We also keep freeze protection on transaction
      start to block internal filesystem writes such as removal of preallocated
      blocks.
      
      CC: Ben Myers <bpm@sgi.com>
      CC: Alex Elder <elder@kernel.org>
      CC: xfs@oss.sgi.com
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      d9457dc0
  16. 15 6月, 2012 2 次提交
  17. 21 5月, 2012 1 次提交
    • D
      xfs: fix delalloc quota accounting on failure · ea562ed6
      Dave Chinner 提交于
      xfstest 270 was causing quota reservations way beyond what was sane
      (ten to hundreds of TB) for a 4GB filesystem. There's a sign problem
      in the error handling path of xfs_bmapi_reserve_delalloc() because
      xfs_trans_unreserve_quota_nblks() simple negates the value passed -
      which doesn't work for an unsigned variable. This causes
      reservations of close to 2^32 block instead of removing a
      reservation of a handful of blocks.
      
      Fix the same problem in the other xfs_trans_unreserve_quota_nblks()
      callers where unsigned integer variables are used, too.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      ea562ed6
  18. 15 5月, 2012 5 次提交
  19. 06 3月, 2012 1 次提交
  20. 18 1月, 2012 1 次提交
    • C
      xfs: remove the i_size field in struct xfs_inode · ce7ae151
      Christoph Hellwig 提交于
      There is no fundamental need to keep an in-memory inode size copy in the XFS
      inode.  We already have the on-disk value in the dinode, and the separate
      in-memory copy that we need for regular files only in the XFS inode.
      
      Remove the xfs_inode i_size field and change the XFS_ISIZE macro to use the
      VFS inode i_size field for regular files.  Switch code that was directly
      accessing the i_size field in the xfs_inode to XFS_ISIZE, or in cases where
      we are limited to regular files direct access of the VFS inode i_size field.
      
      This also allows dropping some fairly complicated code in the write path
      which dealt with keeping the xfs_inode i_size uptodate with the VFS i_size
      that is getting updated inside ->write_end.
      
      Note that we do not bother resetting the VFS i_size when truncating a file
      that gets freed to zero as there is no point in doing so because the VFS inode
      is no longer in use at this point.  Just relax the assert in xfs_ifree to
      only check the on-disk size instead.
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      ce7ae151
  21. 14 1月, 2012 1 次提交
  22. 12 10月, 2011 4 次提交
    • C
      xfs: simplify xfs_trans_ijoin* again · ddc3415a
      Christoph Hellwig 提交于
      There is no reason to keep a reference to the inode even if we unlock
      it during transaction commit because we never drop a reference between
      the ijoin and commit.  Also use this fact to merge xfs_trans_ijoin_ref
      back into xfs_trans_ijoin - the third argument decides if an unlock
      is needed now.
      
      I'm actually starting to wonder if allowing inodes to be unlocked
      at transaction commit really is worth the effort.  The only real
      benefit is that they can be unlocked earlier when commiting a
      synchronous transactions, but that could be solved by doing the
      log force manually after the unlock, too.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      
      ddc3415a
    • D
      xfs: rename xfs_bmapi to xfs_bmapi_write · c0dc7828
      Dave Chinner 提交于
      Now that all the read-only users of xfs_bmapi have been converted to
      use xfs_bmapi_read(), we can remove all the read-only handling cases
      from xfs_bmapi().
      
      Once this is done, rename xfs_bmapi to xfs_bmapi_write to reflect
      the fact it is for allocation only. This enables us to kill the
      XFS_BMAPI_WRITE flag as well.
      
      Also clean up xfs_bmapi_write to the style used in the newly added
      xfs_bmapi_read/delay functions.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      c0dc7828
    • C
      xfs: introduce xfs_bmapi_delay() · 4403280a
      Christoph Hellwig 提交于
      Delalloc reservations are much simpler than allocations, so give
      them a separate bmapi-level interface.  Using the previously added
      xfs_bmapi_reserve_delalloc we get a function that is only minimally
      more complicated than xfs_bmapi_read, which is far from the complexity
      in xfs_bmapi.  Also remove the XFS_BMAPI_DELAY code after switching
      over the only user to xfs_bmapi_delay.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      
      4403280a
    • D
      xfs: introduce xfs_bmapi_read() · 5c8ed202
      Dave Chinner 提交于
      xfs_bmapi() currently handles both extent map reading and
      allocation. As a result, the code is littered with "if (wr)"
      branches to conditionally do allocation operations if required.
      This makes the code much harder to follow and causes significant
      indent issues with the code.
      
      Given that read mapping is much simpler than allocation, we can
      split out read mapping from xfs_bmapi() and reuse the logic that
      we have already factored out do do all the hard work of handling the
      extent map manipulations. The results in a much simpler function for
      the common extent read operations, and will allow the allocation
      code to be simplified in another commit.
      
      Once xfs_bmapi_read() is implemented, convert all the callers of
      xfs_bmapi() that are only reading extents to use the new function.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      
      5c8ed202