1. 25 3月, 2015 1 次提交
  2. 28 11月, 2014 3 次提交
  3. 30 10月, 2014 1 次提交
    • B
      xfs: rework zero range to prevent invalid i_size updates · 5d11fb4b
      Brian Foster 提交于
      The zero range operation is analogous to fallocate with the exception of
      converting the range to zeroes. E.g., it attempts to allocate zeroed
      blocks over the range specified by the caller. The XFS implementation
      kills all delalloc blocks currently over the aligned range, converts the
      range to allocated zero blocks (unwritten extents) and handles the
      partial pages at the ends of the range by sending writes through the
      pagecache.
      
      The current implementation suffers from several problems associated with
      inode size. If the aligned range covers an extending I/O, said I/O is
      discarded and an inode size update from a previous write never makes it
      to disk. Further, if an unaligned zero range extends beyond eof, the
      page write induced for the partial end page can itself increase the
      inode size, even if the zero range request is not supposed to update
      i_size (via KEEP_SIZE, similar to an fallocate beyond EOF).
      
      The latter behavior not only incorrectly increases the inode size, but
      can lead to stray delalloc blocks on the inode. Typically, post-eof
      preallocation blocks are either truncated on release or inode eviction
      or explicitly written to by xfs_zero_eof() on natural file size
      extension. If the inode size increases due to zero range, however,
      associated blocks leak into the address space having never been
      converted or mapped to pagecache pages. A direct I/O to such an
      uncovered range cannot convert the extent via writeback and will BUG().
      For example:
      
      $ xfs_io -fc "pwrite 0 128k" -c "fzero -k 1m 54321" <file>
      ...
      $ xfs_io -d -c "pread 128k 128k" <file>
      <BUG>
      
      If the entire delalloc extent happens to not have page coverage
      whatsoever (e.g., delalloc conversion couldn't find a large enough free
      space extent), even a full file writeback won't convert what's left of
      the extent and we'll assert on inode eviction.
      
      Rework xfs_zero_file_space() to avoid buffered I/O for partial pages.
      Use the existing hole punch and prealloc mechanisms as primitives for
      zero range. This implementation is not efficient nor ideal as we
      writeback dirty data over the range and remove existing extents rather
      than convert to unwrittern. The former writeback, however, is currently
      the only mechanism available to ensure consistency between pagecache and
      extent state. Even a pagecache truncate/delalloc punch prior to hole
      punch has lead to inconsistencies due to racing with writeback.
      
      This provides a consistent, correct implementation of zero range that
      survives fsstress/fsx testing without assert failures. The
      implementation can be optimized from this point forward once the
      fundamental issue of pagecache and delalloc extent state consistency is
      addressed.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      5d11fb4b
  4. 02 10月, 2014 3 次提交
    • B
      xfs: flush the range before zero range conversion · da5f1096
      Brian Foster 提交于
      XFS currently discards delalloc blocks within the target range of a
      zero range request. Unaligned start and end offsets are zeroed
      through the page cache and the internal, aligned blocks are
      converted to unwritten extents.
      
      If EOF is page aligned and covered by a delayed allocation extent.
      The inode size is not updated until I/O completion. If a zero range
      request discards a delalloc range that covers page aligned EOF as
      such, the inode size update never occurs. For example:
      
      $ rm -f /mnt/file
      $ xfs_io -fc "pwrite 0 64k" -c "zero 60k 4k" /mnt/file
      $ stat -c "%s" /mnt/file
      65536
      $ umount /mnt
      $ mount <dev> /mnt
      $ stat -c "%s" /mnt/file
      61440
      
      Update xfs_zero_file_space() to flush the range rather than discard
      delalloc blocks to ensure that inode size updates occur
      appropriately.
      
      [dchinner: Note that this is really a workaround to avoid the
      underlying problems. More work is needed (and ongoing) to fix those
      issues so this fix is being added as a temporary stop-gap measure. ]
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      da5f1096
    • C
      xfs: simplify xfs_zero_remaining_bytes · 8c156125
      Christoph Hellwig 提交于
      xfs_zero_remaining_bytes() open codes a log of buffer manupulations
      to do a read forllowed by a write. It can simply be replaced by an
      uncached read followed by a xfs_bwrite() call.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      8c156125
    • D
      xfs: introduce xfs_buf_submit[_wait] · 595bff75
      Dave Chinner 提交于
      There is a lot of cookie-cutter code that looks like:
      
      	if (shutdown)
      		handle buffer error
      	xfs_buf_iorequest(bp)
      	error = xfs_buf_iowait(bp)
      	if (error)
      		handle buffer error
      
      spread through XFS. There's significant complexity now in
      xfs_buf_iorequest() to specifically handle this sort of synchronous
      IO pattern, but there's all sorts of nasty surprises in different
      error handling code dependent on who owns the buffer references and
      the locks.
      
      Pull this pattern into a single helper, where we can hide all the
      synchronous IO warts and hence make the error handling for all the
      callers much saner. This removes the need for a special extra
      reference to protect IO completion processing, as we can now hold a
      single reference across dispatch and waiting, simplifying the sync
      IO smeantics and error handling.
      
      In doing this, also rename xfs_buf_iorequest to xfs_buf_submit and
      make it explicitly handle on asynchronous IO. This forces all users
      to be switched specifically to one interface or the other and
      removes any ambiguity between how the interfaces are to be used. It
      also means that xfs_buf_iowait() goes away.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      595bff75
  5. 23 9月, 2014 4 次提交
    • D
      xfs: xfs_swap_extent_flush can be static · 7abbb8f9
      Dave Chinner 提交于
      Fix sparse warning introduced by commit 4ef897a2 ("xfs: flush both
      inodes in xfs_swap_extents").
      Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      7abbb8f9
    • B
      xfs: only writeback and truncate pages for the freed range · 8b5279e3
      Brian Foster 提交于
      xfs_free_file_space() only affects the range of the file for which space
      is being freed. It currently writes and truncates the page cache from
      the start offset of the free to EOF.
      
      Modify xfs_free_file_space() to write back and truncate page cache of
      just the range being freed.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      8b5279e3
    • B
      xfs: writeback and inval. file range to be shifted by collapse · f71721d0
      Brian Foster 提交于
      The collapse range operation currently writes the entire file before
      starting the collapse to avoid changes in the in-core extent list due to
      writeback causing the extent count to change. Now that collapse range is
      fsb based rather than extent index based it can sustain changes in the
      extent list during the shift sequence without disruption.
      
      Modify xfs_collapse_file_space() to writeback and invalidate pages
      associated with the range of the file to be shifted.
      xfs_free_file_space() currently has similar behavior, but the space free
      need only affect the region of the file that is freed and this could
      change in the future.
      
      Also update the comments to reflect the current implementation. We
      retain the eofblocks trim permanently as a best option for dealing with
      delalloc extents. We don't shift delalloc extents because this scenario
      only occurs with post-eof preallocation (since data must be flushed such
      that the cache can be invalidated and data can be shifted). That means
      said space must also be initialized before being shifted into the
      accessible region of the file only to be immediately truncated off as
      the last part of the collapse. In other words, the eofblocks trim will
      happen anyways, we just run it first to ensure the file remains in a
      consistent state throughout the collapse.
      
      Finally, detect and fail explicitly in the event of a delalloc extent
      during the extent shift. The implementation does not support delalloc
      extents and the caller is expected to prevent this scenario in advance
      as is done by collapse.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      f71721d0
    • B
      xfs: track collapse via file offset rather than extent index · 2c845f5a
      Brian Foster 提交于
      The collapse range implementation uses a transaction per extent shift.
      The progress of the overall operation is tracked via the current extent
      index of the in-core extent list. This is racy because the ilock must be
      dropped and reacquired for each transaction according to locking and log
      reservation rules. Therefore, writeback to prior regions of the file is
      possible and can change the extent count. This changes the extent to
      which the current index refers and causes the collapse to fail mid
      operation. To avoid this problem, the entire file is currently written
      back before the collapse operation starts.
      
      To eliminate the need to flush the entire file, use the file offset
      (fsb) to track the progress of the overall extent shift operation rather
      than the extent index. Modify xfs_bmap_shift_extents() to
      unconditionally convert the start_fsb parameter to an extent index and
      return the file offset of the extent where the shift left off, if
      further extents exist. The bulk of ths function can remain based on
      extent index as ilock is held by the caller. xfs_collapse_file_space()
      now uses the fsb output as the starting point for the subsequent shift.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      2c845f5a
  6. 02 9月, 2014 2 次提交
    • B
      xfs: trim eofblocks before collapse range · 41b9d726
      Brian Foster 提交于
      xfs_collapse_file_space() currently writes back the entire file
      undergoing collapse range to settle things down for the extent shift
      algorithm. While this prevents changes to the extent list during the
      collapse operation, the writeback itself is not enough to prevent
      unnecessary collapse failures.
      
      The current shift algorithm uses the extent index to iterate the in-core
      extent list. If a post-eof delalloc extent persists after the writeback
      (e.g., a prior zero range op where the end of the range aligns with eof
      can separate the post-eof blocks such that they are not written back and
      converted), xfs_bmap_shift_extents() becomes confused over the encoded
      br_startblock value and fails the collapse.
      
      As with the full writeback, this is a temporary fix until the algorithm
      is improved to cope with a volatile extent list and avoid attempts to
      shift post-eof extents.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      41b9d726
    • D
      xfs: xfs_file_collapse_range is delalloc challenged · 1669a8ca
      Dave Chinner 提交于
      If we have delalloc extents on a file before we run a collapse range
      opertaion, we sync the range that we are going to collapse to
      convert delalloc extents in that region to real extents to simplify
      the shift operation.
      
      However, the shift operation then assumes that the extent list is
      not going to change as it iterates over the extent list moving
      things about. Unfortunately, this isn't true because we can't hold
      the ILOCK over all the operations. We can prevent new IO from
      modifying the extent list by holding the IOLOCK, but that doesn't
      prevent writeback from running....
      
      And when writeback runs, it can convert delalloc extents is the
      range of the file prior to the region being collapsed, and this
      changes the indexes of all the extents in the file. That causes the
      collapse range operation to Go Bad.
      
      The right fix is to rewrite the extent shift operation not to be
      dependent on the extent list not changing across the entire
      operation, but this is a fairly significant piece of work to do.
      Hence, as a short-term workaround for the problem, sync the entire
      file before starting a collapse operation to remove all delalloc
      ranges from the file and so avoid the problem of concurrent
      writeback changing the extent list.
      Diagnosed-and-Reported-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      
      1669a8ca
  7. 04 8月, 2014 4 次提交
  8. 30 7月, 2014 1 次提交
  9. 15 7月, 2014 2 次提交
    • D
      xfs: refine the allocation stack switch · cf11da9c
      Dave Chinner 提交于
      The allocation stack switch at xfs_bmapi_allocate() has served it's
      purpose, but is no longer a sufficient solution to the stack usage
      problem we have in the XFS allocation path.
      
      Whilst the kernel stack size is now 16k, that is not a valid reason
      for undoing all our "keep stack usage down" modifications. What it
      does allow us to do is have the freedom to refine and perfect the
      modifications knowing that if we get it wrong it won't blow up in
      our faces - we have a safety net now.
      
      This is important because we still have the issue of older kernels
      having smaller stacks and that they are still supported and are
      demonstrating a wide range of different stack overflows.  Red Hat
      has several open bugs for allocation based stack overflows from
      directory modifications and direct IO block allocation and these
      problems still need to be solved. If we can solve them upstream,
      then distro's won't need to bake their own unique solutions.
      
      To that end, I've observed that every allocation based stack
      overflow report has had a specific characteristic - it has happened
      during or directly after a bmap btree block split. That event
      requires a new block to be allocated to the tree, and so we
      effectively stack one allocation stack on top of another, and that's
      when we get into trouble.
      
      A further observation is that bmap btree block splits are much rarer
      than writeback allocation - over a range of different workloads I've
      observed the ratio of bmap btree inserts to splits ranges from 100:1
      (xfstests run) to 10000:1 (local VM image server with sparse files
      that range in the hundreds of thousands to millions of extents).
      Either way, bmap btree split events are much, much rarer than
      allocation events.
      
      Finally, we have to move the kswapd state to the allocation workqueue
      work when allocation is done on behalf of kswapd. This is proving to
      cause significant perturbation in performance under memory pressure
      and appears to be generating allocation deadlock warnings under some
      workloads, so avoiding the use of a workqueue for the majority of
      kswapd writeback allocation will minimise the impact of such
      behaviour.
      
      Hence it makes sense to move the stack switch to xfs_btree_split()
      and only do it for bmap btree splits. Stack switches during
      allocation will be much rarer, so there won't be significant
      performacne overhead caused by switching stacks. The worse case
      stack from all allocation paths will be split, not just writeback.
      And the majority of memory allocations will be done in the correct
      context (e.g. kswapd) without causing additional latency, and so we
      simplify the memory reclaim interactions between processes,
      workqueues and kswapd.
      
      The worst stack I've been able to generate with this patch in place
      is 5600 bytes deep. It's very revealing because we exit XFS at:
      
      37)     1768      64   kmem_cache_alloc+0x13b/0x170
      
      about 1800 bytes of stack consumed, and the remaining 3800 bytes
      (and 36 functions) is memory reclaim, swap and the IO stack. And
      this occurs in the inode allocation from an open(O_CREAT) syscall,
      not writeback.
      
      The amount of stack being used is much less than I've previously be
      able to generate - fs_mark testing has been able to generate stack
      usage of around 7k without too much trouble; with this patch it's
      only just getting to 5.5k. This is primarily because the metadata
      allocation paths (e.g. directory blocks) are no longer causing
      double splits on the same stack, and hence now stack tracing is
      showing swapping being the worst stack consumer rather than XFS.
      
      Performance of fs_mark inode create workloads is unchanged.
      Performance of fs_mark async fsync workloads is consistently good
      with context switches reduced by around 150,000/s (30%).
      Performance of dbench, streaming IO and postmark is unchanged.
      Allocation deadlock warnings have not been seen on the workloads
      that generated them since adding this patch.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      cf11da9c
    • D
      Revert "xfs: block allocation work needs to be kswapd aware" · aa182e64
      Dave Chinner 提交于
      This reverts commit 1f6d6482.
      
      This commit resulted in regressions in performance in low
      memory situations where kswapd was doing writeback of delayed
      allocation blocks. It resulted in significant parallelism of the
      kswapd work and with the special kswapd flags meant that hundreds of
      active allocation could dip into kswapd specific memory reserves and
      avoid being throttled. This cause a large amount of performance
      variation, as well as random OOM-killer invocations that didn't
      previously exist.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      aa182e64
  10. 25 6月, 2014 1 次提交
    • D
      xfs: global error sign conversion · 2451337d
      Dave Chinner 提交于
      Convert all the errors the core XFs code to negative error signs
      like the rest of the kernel and remove all the sign conversion we
      do in the interface layers.
      
      Errors for conversion (and comparison) found via searches like:
      
      $ git grep " E" fs/xfs
      $ git grep "return E" fs/xfs
      $ git grep " E[A-Z].*;$" fs/xfs
      
      Negation points found via searches like:
      
      $ git grep "= -[a-z,A-Z]" fs/xfs
      $ git grep "return -[a-z,A-D,F-Z]" fs/xfs
      $ git grep " -[a-z].*;" fs/xfs
      
      [ with some bits I missed from Brian Foster ]
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      2451337d
  11. 22 6月, 2014 1 次提交
  12. 06 6月, 2014 1 次提交
    • D
      xfs: block allocation work needs to be kswapd aware · 1f6d6482
      Dave Chinner 提交于
      Upon memory pressure, kswapd calls xfs_vm_writepage() from
      shrink_page_list(). This can result in delayed allocation occurring
      and that gets deferred to the the allocation workqueue.
      
      The allocation then runs outside kswapd context, which means if it
      needs memory (and it does to demand page metadata from disk) it can
      block in shrink_inactive_list() waiting for IO congestion. These
      blocking waits are normally avoiding in kswapd context, so under
      memory pressure writeback from kswapd can be arbitrarily delayed by
      memory reclaim.
      
      To avoid this, pass the kswapd context to the allocation being done
      by the workqueue, so that memory reclaim understands correctly that
      the work is being done for kswapd and therefore it is not blocked
      and does not delay memory reclaim.
      
      To avoid issues with int->char conversion of flag fields (as noticed
      in v1 of this patch) convert the flag fields in the struct
      xfs_bmalloca to bool types. pahole indicates these variables are
      still single byte variables, so no extra space is consumed by this
      change.
      
      cc: <stable@vger.kernel.org>
      Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      1f6d6482
  13. 20 5月, 2014 1 次提交
  14. 14 4月, 2014 1 次提交
  15. 24 2月, 2014 1 次提交
  16. 11 1月, 2014 1 次提交
  17. 10 1月, 2014 1 次提交
  18. 19 12月, 2013 4 次提交
  19. 17 12月, 2013 1 次提交
    • C
      xfs: remove xfsbdstrat error · 83a0adc3
      Christoph Hellwig 提交于
      The xfsbdstrat helper is a small but useless wrapper for xfs_buf_iorequest that
      handles the case of a shut down filesystem.  Most of the users have private,
      uncached buffers that can just be freed in this case, but the complex error
      handling in xfs_bioerror_relse messes up the case when it's called without
      a locked buffer.
      
      Remove xfsbdstrat and opencode the error handling in the callers.  All but
      one can simply return an error and don't need to deal with buffer state,
      and the one caller that cares about the buffer state could do with a major
      cleanup as well, but we'll defer that to later.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      83a0adc3
  20. 24 10月, 2013 4 次提交
    • D
      xfs: decouple inode and bmap btree header files · a4fbe6ab
      Dave Chinner 提交于
      Currently the xfs_inode.h header has a dependency on the definition
      of the BMAP btree records as the inode fork includes an array of
      xfs_bmbt_rec_host_t objects in it's definition.
      
      Move all the btree format definitions from xfs_btree.h,
      xfs_bmap_btree.h, xfs_alloc_btree.h and xfs_ialloc_btree.h to
      xfs_format.h to continue the process of centralising the on-disk
      format definitions. With this done, the xfs inode definitions are no
      longer dependent on btree header files.
      
      The enables a massive culling of unnecessary includes, with close to
      200 #include directives removed from the XFS kernel code base.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      a4fbe6ab
    • D
      xfs: decouple log and transaction headers · 239880ef
      Dave Chinner 提交于
      xfs_trans.h has a dependency on xfs_log.h for a couple of
      structures. Most code that does transactions doesn't need to know
      anything about the log, but this dependency means that they have to
      include xfs_log.h. Decouple the xfs_trans.h and xfs_log.h header
      files and clean up the includes to be in dependency order.
      
      In doing this, remove the direct include of xfs_trans_reserve.h from
      xfs_trans.h so that we remove the dependency between xfs_trans.h and
      xfs_mount.h. Hence the xfs_trans.h include can be moved to the
      indicate the actual dependencies other header files have on it.
      
      Note that these are kernel only header files, so this does not
      translate to any userspace changes at all.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      239880ef
    • D
      xfs: unify directory/attribute format definitions · 57062787
      Dave Chinner 提交于
      The on-disk format definitions for the directory and attribute
      structures are spread across 3 header files right now, only one of
      which is dedicated to defining on-disk structures and their
      manipulation (xfs_dir2_format.h). Pull all the format definitions
      into a single header file - xfs_da_format.h - and switch all the
      code over to point at that.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      57062787
    • D
      xfs: create a shared header file for format-related information · 70a9883c
      Dave Chinner 提交于
      All of the buffer operations structures are needed to be exported
      for xfs_db, so move them all to a common location rather than
      spreading them all over the place. They are verifying the on-disk
      format, so while xfs_format.h might be a good place, it is not part
      of the on disk format.
      
      Hence we need to create a new header file that we centralise these
      related definitions. Start by moving the bffer operations
      structures, and then also move all the other definitions that have
      crept into xfs_log_format.h and xfs_format.h as there was no other
      shared header file to put them in.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      70a9883c
  21. 22 10月, 2013 2 次提交