1. 15 7月, 2014 2 次提交
    • D
      xfs: refine the allocation stack switch · cf11da9c
      Dave Chinner 提交于
      The allocation stack switch at xfs_bmapi_allocate() has served it's
      purpose, but is no longer a sufficient solution to the stack usage
      problem we have in the XFS allocation path.
      
      Whilst the kernel stack size is now 16k, that is not a valid reason
      for undoing all our "keep stack usage down" modifications. What it
      does allow us to do is have the freedom to refine and perfect the
      modifications knowing that if we get it wrong it won't blow up in
      our faces - we have a safety net now.
      
      This is important because we still have the issue of older kernels
      having smaller stacks and that they are still supported and are
      demonstrating a wide range of different stack overflows.  Red Hat
      has several open bugs for allocation based stack overflows from
      directory modifications and direct IO block allocation and these
      problems still need to be solved. If we can solve them upstream,
      then distro's won't need to bake their own unique solutions.
      
      To that end, I've observed that every allocation based stack
      overflow report has had a specific characteristic - it has happened
      during or directly after a bmap btree block split. That event
      requires a new block to be allocated to the tree, and so we
      effectively stack one allocation stack on top of another, and that's
      when we get into trouble.
      
      A further observation is that bmap btree block splits are much rarer
      than writeback allocation - over a range of different workloads I've
      observed the ratio of bmap btree inserts to splits ranges from 100:1
      (xfstests run) to 10000:1 (local VM image server with sparse files
      that range in the hundreds of thousands to millions of extents).
      Either way, bmap btree split events are much, much rarer than
      allocation events.
      
      Finally, we have to move the kswapd state to the allocation workqueue
      work when allocation is done on behalf of kswapd. This is proving to
      cause significant perturbation in performance under memory pressure
      and appears to be generating allocation deadlock warnings under some
      workloads, so avoiding the use of a workqueue for the majority of
      kswapd writeback allocation will minimise the impact of such
      behaviour.
      
      Hence it makes sense to move the stack switch to xfs_btree_split()
      and only do it for bmap btree splits. Stack switches during
      allocation will be much rarer, so there won't be significant
      performacne overhead caused by switching stacks. The worse case
      stack from all allocation paths will be split, not just writeback.
      And the majority of memory allocations will be done in the correct
      context (e.g. kswapd) without causing additional latency, and so we
      simplify the memory reclaim interactions between processes,
      workqueues and kswapd.
      
      The worst stack I've been able to generate with this patch in place
      is 5600 bytes deep. It's very revealing because we exit XFS at:
      
      37)     1768      64   kmem_cache_alloc+0x13b/0x170
      
      about 1800 bytes of stack consumed, and the remaining 3800 bytes
      (and 36 functions) is memory reclaim, swap and the IO stack. And
      this occurs in the inode allocation from an open(O_CREAT) syscall,
      not writeback.
      
      The amount of stack being used is much less than I've previously be
      able to generate - fs_mark testing has been able to generate stack
      usage of around 7k without too much trouble; with this patch it's
      only just getting to 5.5k. This is primarily because the metadata
      allocation paths (e.g. directory blocks) are no longer causing
      double splits on the same stack, and hence now stack tracing is
      showing swapping being the worst stack consumer rather than XFS.
      
      Performance of fs_mark inode create workloads is unchanged.
      Performance of fs_mark async fsync workloads is consistently good
      with context switches reduced by around 150,000/s (30%).
      Performance of dbench, streaming IO and postmark is unchanged.
      Allocation deadlock warnings have not been seen on the workloads
      that generated them since adding this patch.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      cf11da9c
    • D
      Revert "xfs: block allocation work needs to be kswapd aware" · aa182e64
      Dave Chinner 提交于
      This reverts commit 1f6d6482.
      
      This commit resulted in regressions in performance in low
      memory situations where kswapd was doing writeback of delayed
      allocation blocks. It resulted in significant parallelism of the
      kswapd work and with the special kswapd flags meant that hundreds of
      active allocation could dip into kswapd specific memory reserves and
      avoid being throttled. This cause a large amount of performance
      variation, as well as random OOM-killer invocations that didn't
      previously exist.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      aa182e64
  2. 06 6月, 2014 1 次提交
    • D
      xfs: block allocation work needs to be kswapd aware · 1f6d6482
      Dave Chinner 提交于
      Upon memory pressure, kswapd calls xfs_vm_writepage() from
      shrink_page_list(). This can result in delayed allocation occurring
      and that gets deferred to the the allocation workqueue.
      
      The allocation then runs outside kswapd context, which means if it
      needs memory (and it does to demand page metadata from disk) it can
      block in shrink_inactive_list() waiting for IO congestion. These
      blocking waits are normally avoiding in kswapd context, so under
      memory pressure writeback from kswapd can be arbitrarily delayed by
      memory reclaim.
      
      To avoid this, pass the kswapd context to the allocation being done
      by the workqueue, so that memory reclaim understands correctly that
      the work is being done for kswapd and therefore it is not blocked
      and does not delay memory reclaim.
      
      To avoid issues with int->char conversion of flag fields (as noticed
      in v1 of this patch) convert the flag fields in the struct
      xfs_bmalloca to bool types. pahole indicates these variables are
      still single byte variables, so no extra space is consumed by this
      change.
      
      cc: <stable@vger.kernel.org>
      Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      1f6d6482
  3. 24 2月, 2014 1 次提交
  4. 22 10月, 2013 2 次提交
  5. 13 8月, 2013 3 次提交
    • D
      xfs: consolidate extent swap code · a133d952
      Dave Chinner 提交于
      So we don't need xfs_dfrag.h in userspace anymore, move the extent
      swap ioctl structure definition to xfs_fs.h where most of the other
      ioctl structure definitions are.
      
      Now that we don't need separate files for extent swapping, separate
      the basic file descriptor checking code to xfs_ioctl.c, and the code
      that does the extent swap operation to xfs_bmap_util.c.  This
      cleanly separates the user interface code from the physical
      mechanism used to do the extent swap.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      a133d952
    • D
      xfs: kill xfs_vnodeops.[ch] · c24b5dfa
      Dave Chinner 提交于
      Now we have xfs_inode.c for holding kernel-only XFS inode
      operations, move all the inode operations from xfs_vnodeops.c to
      this new file as it holds another set of kernel-only inode
      operations. The name of this file traces back to the days of Irix
      and it's vnodes which we don't have anymore.
      
      Essentially this move consolidates the inode locking functions
      and a bunch of XFS inode operations into the one file. Eventually
      the high level functions will be merged into the VFS interface
      functions in xfs_iops.c.
      
      This leaves only internal preallocation, EOF block manipulation and
      hole punching functions in vnodeops.c. Move these to xfs_bmap_util.c
      where we are already consolidating various in-kernel physical extent
      manipulation and querying functions.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      c24b5dfa
    • D
      xfs: create xfs_bmap_util.[ch] · 68988114
      Dave Chinner 提交于
      There is a bunch of code in xfs_bmap.c that is kernel specific and
      not shared with userspace. To minimise the difference between the
      kernel and userspace code, shift this unshared code to
      xfs_bmap_util.c, and the declarations to xfs_bmap_util.h.
      
      The biggest issue here is xfs_bmap_finish() - userspace has it's own
      definition of this function, and so we need to move it out of
      xfs_bmap.[ch]. This means several other files need to include
      xfs_bmap_util.h as well.
      
      It also introduces and interesting dance for the stack switching
      code in xfs_bmapi_allocate(). The stack switching/workqueue code is
      actually moved to xfs_bmap_util.c, so that userspace can simply use
      a #define in a header file to connect the dots without needing to
      know about the stack switch code at all.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      68988114