1. 02 7月, 2012 1 次提交
    • D
      xfs: struct xfs_buf_log_format isn't variable sized. · 77c1a08f
      Dave Chinner 提交于
      The struct xfs_buf_log_format wants to think the dirty bitmap is
      variable sized.  In fact, it is variable size on disk simply due to
      the way we map it from the in-memory structure, but we still just
      use a fixed size memory allocation for the in-memory structure.
      
      Hence it makes no sense to set the function up as a variable sized
      structure when we already know it's maximum size, and we always
      allocate it as such. Simplify the structure by making the dirty
      bitmap a fixed sized array and just using the size of the structure
      for the allocation size.
      
      This will make it much simpler to allocate and manipulate an array
      of format structures for discontiguous buffer support.
      
      The previous struct xfs_buf_log_item size according to
      /proc/slabinfo was 224 bytes. pahole doesn't give the same size
      because of the variable size definition. With this modification,
      pahole reports the same as /proc/slabinfo:
      
      	/* size: 224, cachelines: 4, members: 6 */
      
      Because the xfs_buf_log_item size is now determined by the maximum
      supported block size we introduce a dependency on xfs_alloc_btree.h.
      Avoid this dependency by moving the idefines for the maximum block
      sizes supported to xfs_types.h with all the other max/min type
      defines to avoid any new dependencies.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      77c1a08f
  2. 15 5月, 2012 4 次提交
    • D
      xfs: clean up xfs_bit.h includes · ad1e95c5
      Dave Chinner 提交于
      With the removal of xfs_rw.h and other changes over time, xfs_bit.h
      is being included in many files that don't actually need it. Clean
      up the includes as necessary.
      
      Also move the only-used-once xfs_ialloc_find_free() static inline
      function out of a header file that is widely included to reduce
      the number of needless dependencies on xfs_bit.h.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      ad1e95c5
    • D
      xfs: Do background CIL flushes via a workqueue · 4c2d542f
      Dave Chinner 提交于
      Doing background CIL flushes adds significant latency to whatever
      async transaction that triggers it. To avoid blocking async
      transactions on things like waiting for log buffer IO to complete,
      move the CIL push off into a workqueue.  By moving the push work
      into a workqueue, we remove all the latency that the commit adds
      from the foreground transaction commit path. This also means that
      single threaded workloads won't do the CIL push procssing, leaving
      them more CPU to do more async transactions.
      
      To do this, we need to keep track of the sequence number we have
      pushed work for. This avoids having many transaction commits
      attempting to schedule work for the same sequence, and ensures that
      we only ever have one push (background or forced) in progress at a
      time. It also means that we don't need to take the CIL lock in write
      mode to check for potential background push races, which reduces
      lock contention.
      
      To avoid potential issues with "smart" IO schedulers, don't use the
      workqueue for log force triggered flushes. Instead, do them directly
      so that the log IO is done directly by the process issuing the log
      force and so doesn't get stuck on IO elevator queue idling
      incorrectly delaying the log IO from the workqueue.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      4c2d542f
    • C
      xfs: on-stack delayed write buffer lists · 43ff2122
      Christoph Hellwig 提交于
      Queue delwri buffers on a local on-stack list instead of a per-buftarg one,
      and write back the buffers per-process instead of by waking up xfsbufd.
      
      This is now easily doable given that we have very few places left that write
      delwri buffers:
      
       - log recovery:
      	Only done at mount time, and already forcing out the buffers
      	synchronously using xfs_flush_buftarg
      
       - quotacheck:
      	Same story.
      
       - dquot reclaim:
      	Writes out dirty dquots on the LRU under memory pressure.  We might
      	want to look into doing more of this via xfsaild, but it's already
      	more optimal than the synchronous inode reclaim that writes each
      	buffer synchronously.
      
       - xfsaild:
      	This is the main beneficiary of the change.  By keeping a local list
      	of buffers to write we reduce latency of writing out buffers, and
      	more importably we can remove all the delwri list promotions which
      	were hitting the buffer cache hard under sustained metadata loads.
      
      The implementation is very straight forward - xfs_buf_delwri_queue now gets
      a new list_head pointer that it adds the delwri buffers to, and all callers
      need to eventually submit the list using xfs_buf_delwi_submit or
      xfs_buf_delwi_submit_nowait.  Buffers that already are on a delwri list are
      skipped in xfs_buf_delwri_queue, assuming they already are on another delwri
      list.  The biggest change to pass down the buffer list was done to the AIL
      pushing. Now that we operate on buffers the trylock, push and pushbuf log
      item methods are merged into a single push routine, which tries to lock the
      item, and if possible add the buffer that needs writeback to the buffer list.
      This leads to much simpler code than the previous split but requires the
      individual IOP_PUSH instances to unlock and reacquire the AIL around calls
      to blocking routines.
      
      Given that xfsailds now also handle writing out buffers, the conditions for
      log forcing and the sleep times needed some small changes.  The most
      important one is that we consider an AIL busy as long we still have buffers
      to push, and the other one is that we do increment the pushed LSN for
      buffers that are under flushing at this moment, but still count them towards
      the stuck items for restart purposes.  Without this we could hammer on stuck
      items without ever forcing the log and not make progress under heavy random
      delete workloads on fast flash storage devices.
      
      [ Dave Chinner:
      	- rebase on previous patches.
      	- improved comments for XBF_DELWRI_Q handling
      	- fix XBF_ASYNC handling in queue submission (test 106 failure)
      	- rename delwri submit function buffer list parameters for clarity
      	- xfs_efd_item_push() should return XFS_ITEM_PINNED ]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      43ff2122
    • S
      xfs: using GFP_NOFS for blkdev_issue_flush · 7582df51
      Shaohua Li 提交于
      Issuing a block device flush request in transaction context using GFP_KERNEL
      directly can cause deadlocks due to memory reclaim recursion. Use GFP_NOFS to
      avoid recursion from reclaim context.
      Signed-off-by: NShaohua Li <shli@fusionio.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      7582df51
  3. 06 5月, 2012 1 次提交
  4. 18 4月, 2012 1 次提交
    • D
      xfs: Ensure inode reclaim can run during quotacheck · 8a00ebe4
      Dave Chinner 提交于
      Because the mount process can run a quotacheck and consume lots of
      inodes, we need to be able to run periodic inode reclaim during the
      mount process. This will prevent running the system out of memory
      during quota checks.
      
      This essentially reverts 2bcf6e97, but that is safe to do now that
      the quota sync code that was causing problems during long quotacheck
      executions is now gone.
      
      The reclaim work is currently protected from running during the
      unmount process by a check against MS_ACTIVE. Unfortunately, this
      also means that the reclaim work cannot run during mount.  The
      unmount process should stop the reclaim cleanly before freeing
      anything that the reclaim work depends on, so there is no need to
      have this guard in place.
      
      Also, the inode reclaim work is demand driven, so there is no need
      to start it immediately during mount. It will be started the moment
      an inode is queued for reclaim, so qutoacheck will trigger it just
      fine.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      8a00ebe4
  5. 17 4月, 2012 1 次提交
  6. 27 3月, 2012 1 次提交
    • D
      xfs: don't cache inodes read through bulkstat · 5132ba8f
      Dave Chinner 提交于
      When we read inodes via bulkstat, we generally only read them once
      and then throw them away - they never get used again. If we retain
      them in cache, then it simply causes the working set of inodes and
      other cached items to be reclaimed just so the inode cache can grow.
      
      Avoid this problem by marking inodes read by bulkstat not to be
      cached and check this flag in .drop_inode to determine whether the
      inode should be added to the VFS LRU or not. If the inode lookup
      hits an already cached inode, then don't set the flag. If the inode
      lookup hits an inode marked with no cache flag, remove the flag and
      allow it to be cached once the current reference goes away.
      
      Inodes marked as not cached will get cleaned up by the background
      inode reclaim or via memory pressure, so they will still generate
      some short term cache pressure. They will, however, be reclaimed
      much sooner and in preference to cache hot inodes.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      5132ba8f
  7. 23 3月, 2012 1 次提交
    • D
      xfs: introduce an allocation workqueue · c999a223
      Dave Chinner 提交于
      We currently have significant issues with the amount of stack that
      allocation in XFS uses, especially in the writeback path. We can
      easily consume 4k of stack between mapping the page, manipulating
      the bmap btree and allocating blocks from the free list. Not to
      mention btree block readahead and other functionality that issues IO
      in the allocation path.
      
      As a result, we can no longer fit allocation in the writeback path
      in the stack space provided on x86_64. To alleviate this problem,
      introduce an allocation workqueue and move all allocations to a
      seperate context. This can be easily added as an interposing layer
      into xfs_alloc_vextent(), which takes a single argument structure
      and does not return until the allocation is complete or has failed.
      
      To do this, add a work structure and a completion to the allocation
      args structure. This allows xfs_alloc_vextent to queue the args onto
      the workqueue and wait for it to be completed by the worker. This
      can be done completely transparently to the caller.
      
      The worker function needs to ensure that it sets and clears the
      PF_TRANS flag appropriately as it is being run in an active
      transaction context. Work can also be queued in a memory reclaim
      context, so a rescuer is needed for the workqueue.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      c999a223
  8. 21 3月, 2012 2 次提交
  9. 15 3月, 2012 1 次提交
  10. 14 3月, 2012 2 次提交
  11. 06 3月, 2012 1 次提交
  12. 04 2月, 2012 1 次提交
  13. 18 1月, 2012 2 次提交
  14. 07 1月, 2012 1 次提交
  15. 24 12月, 2011 2 次提交
    • C
      xfs: log all dirty inodes in xfs_fs_sync_fs · be4f1ac8
      Christoph Hellwig 提交于
      Since Linux 2.6.36 the writeback code has introduces various measures for
      live lock prevention during sync().  Unfortunately some of these are
      actively harmful for the XFS model, where the inode gets marked dirty for
      metadata from the data I/O handler.
      
      The older_than_this checks that are now more strictly enforced since
      
          writeback: avoid livelocking WB_SYNC_ALL writeback
      
      by only calling into __writeback_inodes_sb and thus only sampling the
      current cut off time once.  But on a slow enough devices the previous
      asynchronous sync pass might not have fully completed yet, and thus XFS
      might mark metadata dirty only after that sampling of the cut off time for
      the blocking pass already happened.  I have not myself reproduced this
      myself on a real system, but by introducing artificial delay into the
      XFS I/O completion workqueues it can be reproduced easily.
      
      Fix this by iterating over all XFS inodes in ->sync_fs and log all that
      are dirty.  This might log inode that only got redirtied after the
      previous pass, but given how cheap delayed logging of inodes is it
      isn't a major concern for performance.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Tested-by: NMark Tinguely <tinguely@sgi.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      be4f1ac8
    • C
      xfs: log the inode in ->write_inode calls for kupdate · 0b8fd303
      Christoph Hellwig 提交于
      If the writeback code writes back an inode because it has expired we currently
      use the non-blockin ->write_inode path.  This means any inode that is pinned
      is skipped.  With delayed logging and a workload that has very little log
      traffic otherwise it is very likely that an inode that gets constantly
      written to is always pinned, and thus we keep refusing to write it.  The VM
      writeback code at that point redirties it and doesn't try to write it again
      for another 30 seconds.  This means under certain scenarious time based
      metadata writeback never happens.
      
      Fix this by calling into xfs_log_inode for kupdate in addition to data
      integrity syncs, and thus transfer the inode to the log ASAP.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Tested-by: NMark Tinguely <tinguely@sgi.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      0b8fd303
  16. 20 12月, 2011 1 次提交
    • C
      xfs: mark the xfssyncd workqueue as non-reentrant · 40d344ec
      Christoph Hellwig 提交于
      On a system with lots of memory pressure that is stuck on synchronous inode
      reclaim the workqueue code will run one instance of the inode reclaim work
      item on every CPU. which is not what we want.  Make sure to mark the
      xfssyncd workqueue as non-reentrant to make sure there only is one instace
      of each running globally.  Also stop using special paramater for the
      workqueue; now that we guarantee each fs has only running one of each works
      at a time there is no need to artificially lower max_active and compensate
      for that by setting the WQ_CPU_INTENSIVE flag.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      40d344ec
  17. 13 12月, 2011 1 次提交
  18. 09 12月, 2011 1 次提交
  19. 12 10月, 2011 4 次提交
  20. 01 9月, 2011 1 次提交
    • C
      xfs: fix ->write_inode return values · 58d84c4e
      Christoph Hellwig 提交于
      Currently we always redirty an inode that was attempted to be written out
      synchronously but has been cleaned by an AIL pushed internall, which is
      rather bogus.  Fix that by doing the i_update_core check early on and
      return 0 for it.  Also include async calls for it, as doing any work for
      those is just as pointless.  While we're at it also fix the sign for the
      EIO return in case of a filesystem shutdown, and fix the completely
      non-sensical locking around xfs_log_inode.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      (cherry picked from commit 297db93bb74cf687510313eb235a7aec14d67e97)
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      58d84c4e
  21. 25 8月, 2011 1 次提交
  22. 13 8月, 2011 1 次提交
    • C
      xfs: remove subdirectories · c59d87c4
      Christoph Hellwig 提交于
      Use the move from Linux 2.6 to Linux 3.x as an excuse to kill the
      annoying subdirectories in the XFS source code.  Besides the large
      amount of file rename the only changes are to the Makefile, a few
      files including headers with the subdirectory prefix, and the binary
      sysctl compat code that includes a header under fs/xfs/ from
      kernel/.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      c59d87c4
  23. 21 7月, 2011 1 次提交
  24. 13 7月, 2011 2 次提交
  25. 16 6月, 2011 1 次提交
    • C
      xfs: make log devices with write back caches work · a27a263b
      Christoph Hellwig 提交于
      There's no reason not to support cache flushing on external log devices.
      The only thing this really requires is flushing the data device first
      both in fsync and log commits.  A side effect is that we also have to
      remove the barrier write test during mount, which has been superflous
      since the new FLUSH+FUA code anyway.  Also use the chance to flush the
      RT subvolume write cache before the fsync commit, which is required
      for correct semantics.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      a27a263b
  26. 27 5月, 2011 1 次提交
    • C
      fs: pass exact type of data dirties to ->dirty_inode · aa385729
      Christoph Hellwig 提交于
      Tell the filesystem if we just updated timestamp (I_DIRTY_SYNC) or
      anything else, so that the filesystem can track internally if it
      needs to push out a transaction for fdatasync or not.
      
      This is just the prototype change with no user for it yet.  I plan
      to push large XFS changes for the next merge window, and getting
      this trivial infrastructure in this window would help a lot to avoid
      tree interdependencies.
      
      Also remove incorrect comments that ->dirty_inode can't block.  That
      has been changed a long time ago, and many implementations rely on it.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      aa385729
  27. 25 5月, 2011 1 次提交
    • C
      xfs: add online discard support · e84661aa
      Christoph Hellwig 提交于
      Now that we have reliably tracking of deleted extents in a
      transaction we can easily implement "online" discard support
      which calls blkdev_issue_discard once a transaction commits.
      
      The actual discard is a two stage operation as we first have
      to mark the busy extent as not available for reuse before we
      can start the actual discard.  Note that we don't bother
      supporting discard for the non-delaylog mode.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      e84661aa
  28. 20 5月, 2011 1 次提交
  29. 12 4月, 2011 1 次提交