1. 22 6月, 2012 1 次提交
  2. 16 5月, 2012 1 次提交
    • B
      xfs: protect xfs_sync_worker with s_umount semaphore · 1307bbd2
      Ben Myers 提交于
      xfs_sync_worker checks the MS_ACTIVE flag in s_flags to avoid doing
      work during mount and unmount.  This flag can be cleared by unmount
      after the xfs_sync_worker checks it but before the work is completed.
      The has caused crashes in the completion handler for the dummy
      transaction commited by xfs_sync_worker:
      
      PID: 27544  TASK: ffff88013544e040  CPU: 3   COMMAND: "kworker/3:0"
       #0 [ffff88016fdff930] machine_kexec at ffffffff810244e9
       #1 [ffff88016fdff9a0] crash_kexec at ffffffff8108d053
       #2 [ffff88016fdffa70] oops_end at ffffffff813ad1b8
       #3 [ffff88016fdffaa0] no_context at ffffffff8102bd48
       #4 [ffff88016fdffaf0] __bad_area_nosemaphore at ffffffff8102c04d
       #5 [ffff88016fdffb40] bad_area_nosemaphore at ffffffff8102c12e
       #6 [ffff88016fdffb50] do_page_fault at ffffffff813afaee
       #7 [ffff88016fdffc60] page_fault at ffffffff813ac635
          [exception RIP: xlog_get_lowest_lsn+0x30]
          RIP: ffffffffa04a9910  RSP: ffff88016fdffd10  RFLAGS: 00010246
          RAX: ffffc90014e48000  RBX: ffff88014d879980  RCX: ffff88014d879980
          RDX: ffff8802214ee4c0  RSI: 0000000000000000  RDI: 0000000000000000
          RBP: ffff88016fdffd10   R8: ffff88014d879a80   R9: 0000000000000000
          R10: 0000000000000001  R11: 0000000000000000  R12: ffff8802214ee400
          R13: ffff88014d879980  R14: 0000000000000000  R15: ffff88022fd96605
          ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
       #8 [ffff88016fdffd18] xlog_state_do_callback at ffffffffa04aa186 [xfs]
       #9 [ffff88016fdffd98] xlog_state_done_syncing at ffffffffa04aa568 [xfs]
      
      Protect xfs_sync_worker by using the s_umount semaphore at the read
      level to provide exclusion with unmount while work is progressing.
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      1307bbd2
  3. 15 5月, 2012 7 次提交
    • D
      xfs: clean up xfs_bit.h includes · ad1e95c5
      Dave Chinner 提交于
      With the removal of xfs_rw.h and other changes over time, xfs_bit.h
      is being included in many files that don't actually need it. Clean
      up the includes as necessary.
      
      Also move the only-used-once xfs_ialloc_find_free() static inline
      function out of a header file that is widely included to reduce
      the number of needless dependencies on xfs_bit.h.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      ad1e95c5
    • D
      xfs: pass shutdown method into xfs_trans_ail_delete_bulk · 04913fdd
      Dave Chinner 提交于
      xfs_trans_ail_delete_bulk() can be called from different contexts so
      if the item is not in the AIL we need different shutdown for each
      context.  Pass in the shutdown method needed so the correct action
      can be taken.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      04913fdd
    • C
      xfs: on-stack delayed write buffer lists · 43ff2122
      Christoph Hellwig 提交于
      Queue delwri buffers on a local on-stack list instead of a per-buftarg one,
      and write back the buffers per-process instead of by waking up xfsbufd.
      
      This is now easily doable given that we have very few places left that write
      delwri buffers:
      
       - log recovery:
      	Only done at mount time, and already forcing out the buffers
      	synchronously using xfs_flush_buftarg
      
       - quotacheck:
      	Same story.
      
       - dquot reclaim:
      	Writes out dirty dquots on the LRU under memory pressure.  We might
      	want to look into doing more of this via xfsaild, but it's already
      	more optimal than the synchronous inode reclaim that writes each
      	buffer synchronously.
      
       - xfsaild:
      	This is the main beneficiary of the change.  By keeping a local list
      	of buffers to write we reduce latency of writing out buffers, and
      	more importably we can remove all the delwri list promotions which
      	were hitting the buffer cache hard under sustained metadata loads.
      
      The implementation is very straight forward - xfs_buf_delwri_queue now gets
      a new list_head pointer that it adds the delwri buffers to, and all callers
      need to eventually submit the list using xfs_buf_delwi_submit or
      xfs_buf_delwi_submit_nowait.  Buffers that already are on a delwri list are
      skipped in xfs_buf_delwri_queue, assuming they already are on another delwri
      list.  The biggest change to pass down the buffer list was done to the AIL
      pushing. Now that we operate on buffers the trylock, push and pushbuf log
      item methods are merged into a single push routine, which tries to lock the
      item, and if possible add the buffer that needs writeback to the buffer list.
      This leads to much simpler code than the previous split but requires the
      individual IOP_PUSH instances to unlock and reacquire the AIL around calls
      to blocking routines.
      
      Given that xfsailds now also handle writing out buffers, the conditions for
      log forcing and the sleep times needed some small changes.  The most
      important one is that we consider an AIL busy as long we still have buffers
      to push, and the other one is that we do increment the pushed LSN for
      buffers that are under flushing at this moment, but still count them towards
      the stuck items for restart purposes.  Without this we could hammer on stuck
      items without ever forcing the log and not make progress under heavy random
      delete workloads on fast flash storage devices.
      
      [ Dave Chinner:
      	- rebase on previous patches.
      	- improved comments for XBF_DELWRI_Q handling
      	- fix XBF_ASYNC handling in queue submission (test 106 failure)
      	- rename delwri submit function buffer list parameters for clarity
      	- xfs_efd_item_push() should return XFS_ITEM_PINNED ]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      43ff2122
    • C
      xfs: do not write the buffer from xfs_iflush · 4c46819a
      Christoph Hellwig 提交于
      Instead of writing the buffer directly from inside xfs_iflush return it to
      the caller and let the caller decide what to do with the buffer.  Also
      remove the pincount check in xfs_iflush that all non-blocking callers already
      implement and the now unused flags parameter.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      4c46819a
    • C
      xfs: don't flush inodes from background inode reclaim · 8a48088f
      Christoph Hellwig 提交于
      We already flush dirty inodes throug the AIL regularly, there is no reason
      to have second thread compete with it and disturb the I/O pattern.  We still
      do write inodes when doing a synchronous reclaim from the shrinker or during
      unmount for now.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      8a48088f
    • C
      xfs: implement freezing by emptying the AIL · 211e4d43
      Christoph Hellwig 提交于
      Now that we write back all metadata either synchronously or through
      the AIL we can simply implement metadata freezing in terms of
      emptying the AIL.
      
      The implementation for this is fairly simply and straight-forward:
      A new routine is added that asks the xfsaild to push the AIL to the
      end and waits for it to complete and send a wakeup. The routine will
      then loop if the AIL is not actually empty, and continue to do so
      until the AIL is compeltely empty.
      
      We keep an inode reclaim pass in the freeze process to avoid having
      memory pressure have to reclaim inodes that require dirtying the
      filesystem to be reclaimed after the freeze has completed. This
      means we can also treat unmount in the exact same way as freeze.
      
      As an upside we can now remove the radix tree based inode writeback
      and xfs_unmountfs_writesb.
      
      [ Dave Chinner:
      	- Cleaned up commit message.
      	- Added inode reclaim passes back into freeze.
      	- Cleaned up wakeup mechanism to avoid the use of a new
      	  sleep counter variable. ]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      211e4d43
    • C
      xfs: remove log item from AIL in xfs_iflush after a shutdown · 32ce90a4
      Christoph Hellwig 提交于
      If a filesystem has been forced shutdown we are never going to write inodes
      to disk, which means the inode items will stay in the AIL until we free
      the inode. Currently that is not a problem, but a pending change requires us
      to empty the AIL before shutting down the filesystem. In that case leaving
      the inode in the AIL is lethal. Make sure to remove the log item from the AIL
      to allow emptying the AIL on shutdown filesystems.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      32ce90a4
  4. 18 4月, 2012 1 次提交
    • D
      xfs: Ensure inode reclaim can run during quotacheck · 8a00ebe4
      Dave Chinner 提交于
      Because the mount process can run a quotacheck and consume lots of
      inodes, we need to be able to run periodic inode reclaim during the
      mount process. This will prevent running the system out of memory
      during quota checks.
      
      This essentially reverts 2bcf6e97, but that is safe to do now that
      the quota sync code that was causing problems during long quotacheck
      executions is now gone.
      
      The reclaim work is currently protected from running during the
      unmount process by a check against MS_ACTIVE. Unfortunately, this
      also means that the reclaim work cannot run during mount.  The
      unmount process should stop the reclaim cleanly before freeing
      anything that the reclaim work depends on, so there is no need to
      have this guard in place.
      
      Also, the inode reclaim work is demand driven, so there is no need
      to start it immediately during mount. It will be started the moment
      an inode is queued for reclaim, so qutoacheck will trigger it just
      fine.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      8a00ebe4
  5. 14 3月, 2012 1 次提交
  6. 26 2月, 2012 1 次提交
    • A
      xfs: only take the ILOCK in xfs_reclaim_inode() · ad637a10
      Alex Elder 提交于
      At the end of xfs_reclaim_inode(), the inode is locked in order to
      we wait for a possible concurrent lookup to complete before the
      inode is freed.  This synchronization step was taking both the ILOCK
      and the IOLOCK, but the latter was causing lockdep to produce
      reports of the possibility of deadlock.
      
      It turns out that there's no need to acquire the IOLOCK at this
      point anyway.  It may have been required in some earlier version of
      the code, but there should be no need to take the IOLOCK in
      xfs_iget(), so there's no (longer) any need to get it here for
      synchronization.  Add an assertion in xfs_iget() as a reminder
      of this assumption.
      
      Dave Chinner diagnosed this on IRC, and Christoph Hellwig suggested
      no longer including the IOLOCK.  I just put together the patch.
      Signed-off-by: NAlex Elder <elder@dreamhost.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      ad637a10
  7. 18 1月, 2012 1 次提交
    • C
      xfs: replace i_flock with a sleeping bitlock · 474fce06
      Christoph Hellwig 提交于
      We almost never block on i_flock, the exception is synchronous inode
      flushing.  Instead of bloating the inode with a 16/24-byte completion
      that we abuse as a semaphore just implement it as a bitlock that uses
      a bit waitqueue for the rare sleeping path.  This primarily is a
      tradeoff between a much smaller inode and a faster non-blocking
      path vs faster wakeups, and we are much better off with the former.
      
      A small downside is that we will lose lockdep checking for i_flock, but
      given that it's always taken inside the ilock that should be acceptable.
      
      Note that for example the inode writeback locking is implemented in a
      very similar way.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      474fce06
  8. 24 12月, 2011 1 次提交
    • C
      xfs: log all dirty inodes in xfs_fs_sync_fs · be4f1ac8
      Christoph Hellwig 提交于
      Since Linux 2.6.36 the writeback code has introduces various measures for
      live lock prevention during sync().  Unfortunately some of these are
      actively harmful for the XFS model, where the inode gets marked dirty for
      metadata from the data I/O handler.
      
      The older_than_this checks that are now more strictly enforced since
      
          writeback: avoid livelocking WB_SYNC_ALL writeback
      
      by only calling into __writeback_inodes_sb and thus only sampling the
      current cut off time once.  But on a slow enough devices the previous
      asynchronous sync pass might not have fully completed yet, and thus XFS
      might mark metadata dirty only after that sampling of the cut off time for
      the blocking pass already happened.  I have not myself reproduced this
      myself on a real system, but by introducing artificial delay into the
      XFS I/O completion workqueues it can be reproduced easily.
      
      Fix this by iterating over all XFS inodes in ->sync_fs and log all that
      are dirty.  This might log inode that only got redirtied after the
      previous pass, but given how cheap delayed logging of inodes is it
      isn't a major concern for performance.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Tested-by: NMark Tinguely <tinguely@sgi.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      be4f1ac8
  9. 13 12月, 2011 1 次提交
  10. 30 11月, 2011 1 次提交
  11. 12 10月, 2011 3 次提交
  12. 13 8月, 2011 1 次提交
    • C
      xfs: remove subdirectories · c59d87c4
      Christoph Hellwig 提交于
      Use the move from Linux 2.6 to Linux 3.x as an excuse to kill the
      annoying subdirectories in the XFS source code.  Besides the large
      amount of file rename the only changes are to the Makefile, a few
      files including headers with the subdirectory prefix, and the binary
      sysctl compat code that includes a header under fs/xfs/ from
      kernel/.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      c59d87c4
  13. 26 7月, 2011 1 次提交
  14. 21 7月, 2011 2 次提交
  15. 08 7月, 2011 1 次提交
    • C
      xfs: improve sync behaviour in the face of aggressive dirtying · 33b8f7c2
      Christoph Hellwig 提交于
      The following script from Wu Fengguang shows very bad behaviour in XFS
      when aggressively dirtying data during a sync on XFS, with sync times
      up to almost 10 times as long as ext4.
      
      A large part of the issue is that XFS writes data out itself two times
      in the ->sync_fs method, overriding the livelock protection in the core
      writeback code, and another issue is the lock-less xfs_ioend_wait call,
      which doesn't prevent new ioend from being queue up while waiting for
      the count to reach zero.
      
      This patch removes the XFS-internal sync calls and relies on the VFS
      to do it's work just like all other filesystems do.  Note that the
      i_iocount wait which is rather suboptimal is simply removed here.
      We already do it in ->write_inode, which keeps the current supoptimal
      behaviour.  We'll eventually need to remove that as well, but that's
      material for a separate commit.
      
      ------------------------------ snip ------------------------------
      #!/bin/sh
      
      umount /dev/sda7
      mkfs.xfs -f /dev/sda7
      # mkfs.ext4 /dev/sda7
      # mkfs.btrfs /dev/sda7
      mount /dev/sda7 /fs
      
      echo $((50<<20)) > /proc/sys/vm/dirty_bytes
      
      pid=
      for i in `seq 10`
      do
      	dd if=/dev/zero of=/fs/zero-$i bs=1M count=1000 &
      	pid="$pid $!"
      done
      
      sleep 1
      
      tic=$(date +'%s')
      sync
      tac=$(date +'%s')
      
      echo
      echo sync time: $((tac-tic))
      egrep '(Dirty|Writeback|NFS_Unstable)' /proc/meminfo
      
      pidof dd > /dev/null && { kill -9 $pid; echo sync NOT livelocked; }
      ------------------------------ snip ------------------------------
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reported-by: NWu Fengguang <fengguang.wu@intel.com>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      33b8f7c2
  16. 25 5月, 2011 1 次提交
  17. 20 5月, 2011 1 次提交
    • D
      xfs: avoid getting stuck during async inode flushes · ee58abdf
      Dave Chinner 提交于
      When the underlying inode buffer is locked and xfs_sync_inode_attr()
      is doing a non-blocking flush, xfs_iflush() can return EAGAIN.  When
      this happens, clear the error rather than returning it to
      xfs_inode_ag_walk(), as returning EAGAIN will result in the AG walk
      delaying for a short while and trying again. This can result in
      background walks getting stuck on the one AG until inode buffer is
      unlocked by some other means.
      
      This behaviour was noticed when analysing event traces followed by
      code inspection and verification of the fix via further traces.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      ee58abdf
  18. 10 5月, 2011 2 次提交
    • D
      xfs: ensure reclaim cursor is reset correctly at end of AG · 228d62dd
      Dave Chinner 提交于
      On a 32 bit highmem PowerPC machine, the XFS inode cache was growing
      without bound and exhausting low memory causing the OOM killer to be
      triggered. After some effort, the problem was reproduced on a 32 bit
      x86 highmem machine.
      
      The problem is that the per-ag inode reclaim index cursor was not
      getting reset to the start of the AG if the radix tree tag lookup
      found no more reclaimable inodes. Hence every further reclaim
      attempt started at the same index beyond where any reclaimable
      inodes lay, and no further background reclaim ever occurred from the
      AG.
      
      Without background inode reclaim the VM driven cache shrinker
      simply cannot keep up with cache growth, and OOM is the result.
      
      While the change that exposed the problem was the conversion of the
      inode reclaim to use work queues for background reclaim, it was not
      the cause of the bug. The bug was introduced when the cursor code
      was added, just waiting for some weird configuration to strike....
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Tested-By: NChristian Kujau <lists@nerdbynature.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      
      (cherry picked from commit b2232219)
      228d62dd
    • D
      xfs: ensure reclaim cursor is reset correctly at end of AG · b2232219
      Dave Chinner 提交于
      On a 32 bit highmem PowerPC machine, the XFS inode cache was growing
      without bound and exhausting low memory causing the OOM killer to be
      triggered. After some effort, the problem was reproduced on a 32 bit
      x86 highmem machine.
      
      The problem is that the per-ag inode reclaim index cursor was not
      getting reset to the start of the AG if the radix tree tag lookup
      found no more reclaimable inodes. Hence every further reclaim
      attempt started at the same index beyond where any reclaimable
      inodes lay, and no further background reclaim ever occurred from the
      AG.
      
      Without background inode reclaim the VM driven cache shrinker
      simply cannot keep up with cache growth, and OOM is the result.
      
      While the change that exposed the problem was the conversion of the
      inode reclaim to use work queues for background reclaim, it was not
      the cause of the bug. The bug was introduced when the cursor code
      was added, just waiting for some weird configuration to strike....
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Tested-By: NChristian Kujau <lists@nerdbynature.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      b2232219
  19. 08 4月, 2011 4 次提交
    • D
      xfs: push the AIL from memory reclaim and periodic sync · fd074841
      Dave Chinner 提交于
      When we are short on memory, we want to expedite the cleaning of
      dirty objects.  Hence when we run short on memory, we need to kick
      the AIL flushing into action to clean as many dirty objects as
      quickly as possible.  To implement this, sample the lsn of the log
      item at the head of the AIL and use that as the push target for the
      AIL flush.
      
      Further, we keep items in the AIL that are dirty that are not
      tracked any other way, so we can get objects sitting in the AIL that
      don't get written back until the AIL is pushed. Hence to get the
      filesystem to the idle state, we might need to push the AIL to flush
      out any remaining dirty objects sitting in the AIL. This requires
      the same push mechanism as the reclaim push.
      
      This patch also renames xfs_trans_ail_tail() to xfs_ail_min_lsn() to
      match the new xfs_ail_max_lsn() function introduced in this patch.
      Similarly for xfs_trans_ail_push -> xfs_ail_push.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      fd074841
    • D
      xfs: introduce background inode reclaim work · a7b339f1
      Dave Chinner 提交于
      Background inode reclaim needs to run more frequently that the XFS
      syncd work is run as 30s is too long between optimal reclaim runs.
      Add a new periodic work item to the xfs syncd workqueue to run a
      fast, non-blocking inode reclaim scan.
      
      Background inode reclaim is kicked by the act of marking inodes for
      reclaim.  When an AG is first marked as having reclaimable inodes,
      the background reclaim work is kicked. It will continue to run
      periodically untill it detects that there are no more reclaimable
      inodes. It will be kicked again when the first inode is queued for
      reclaim.
      
      To ensure shrinker based inode reclaim throttles to the inode
      cleaning and reclaim rate but still reclaim inodes efficiently, make it kick the
      background inode reclaim so that when we are low on memory we are
      trying to reclaim inodes as efficiently as possible. This kick shoul
      d not be necessary, but it will protect against failures to kick the
      background reclaim when inodes are first dirtied.
      
      To provide the rate throttling, make the shrinker pass do
      synchronous inode reclaim so that it blocks on inodes under IO. This
      means that the shrinker will reclaim inodes rather than just
      skipping over them, but it does not adversely affect the rate of
      reclaim because most dirty inodes are already under IO due to the
      background reclaim work the shrinker kicked.
      
      These two modifications solve one of the two OOM killer invocations
      Chris Mason reported recently when running a stress testing script.
      The particular workload trigger for the OOM killer invocation is
      where there are more threads than CPUs all unlinking files in an
      extremely memory constrained environment. Unlike other solutions,
      this one does not have a performance impact on performance when
      memory is not constrained or the number of concurrent threads
      operating is <= to the number of CPUs.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      a7b339f1
    • D
      xfs: convert ENOSPC inode flushing to use new syncd workqueue · 89e4cb55
      Dave Chinner 提交于
      On of the problems with the current inode flush at ENOSPC is that we
      queue a flush per ENOSPC event, regardless of how many are already
      queued. Thi can result in    hundreds of queued flushes, most of
      which simply burn CPU scanned and do no real work. This simply slows
      down allocation at ENOSPC.
      
      We really only need one active flush at a time, and we can easily
      implement that via the new xfs_syncd_wq. All we need to do is queue
      a flush if one is not already active, then block waiting for the
      currently active flush to complete. The result is that we only ever
      have a single ENOSPC inode flush active at a time and this greatly
      reduces the overhead of ENOSPC processing.
      
      On my 2p test machine, this results in tests exercising ENOSPC
      conditions running significantly faster - 042 halves execution time,
      083 drops from 60s to 5s, etc - while not introducing test
      regressions.
      
      This allows us to remove the old xfssyncd threads and infrastructure
      as they are no longer used.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      89e4cb55
    • D
      xfs: introduce a xfssyncd workqueue · c6d09b66
      Dave Chinner 提交于
      All of the work xfssyncd does is background functionality. There is
      no need for a thread per filesystem to do this work - it can al be
      managed by a global workqueue now they manage concurrency
      effectively.
      
      Introduce a new gglobal xfssyncd workqueue, and convert the periodic
      work to use this new functionality. To do this, use a delayed work
      construct to schedule the next running of the periodic sync work
      for the filesystem. When the sync work is complete, queue a new
      delayed work for the next running of the sync work.
      
      For laptop mode, we wait on completion for the sync works, so ensure
      that the sync work queuing interface can flush and wait for work to
      complete to enable the work queue infrastructure to replace the
      current sequence number and wakeup that is used.
      
      Because the sync work does non-trivial amounts of work, mark the
      new work queue as CPU intensive.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      c6d09b66
  20. 31 3月, 2011 1 次提交
  21. 26 3月, 2011 1 次提交
    • D
      xfs: introduce inode cluster buffer trylocks for xfs_iflush · 1bfd8d04
      Dave Chinner 提交于
      There is an ABBA deadlock between synchronous inode flushing in
      xfs_reclaim_inode and xfs_icluster_free. xfs_icluster_free locks the
      buffer, then takes inode ilocks, whilst synchronous reclaim takes
      the ilock followed by the buffer lock in xfs_iflush().
      
      To avoid this deadlock, separate the inode cluster buffer locking
      semantics from the synchronous inode flush semantics, allowing
      callers to attempt to lock the buffer but still issue synchronous IO
      if it can get the buffer. This requires xfs_iflush() calls that
      currently use non-blocking semantics to pass SYNC_TRYLOCK rather
      than 0 as the flags parameter.
      
      This allows xfs_reclaim_inode to avoid the deadlock on the buffer
      lock and detect the failure so that it can drop the inode ilock and
      restart the reclaim attempt on the inode. This allows
      xfs_ifree_cluster to obtain the inode lock, mark the inode stale and
      release it and hence defuse the deadlock situation. It also has the
      pleasant side effect of avoiding IO in xfs_reclaim_inode when it
      tries to next reclaim the inode as it is now marked stale.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      1bfd8d04
  22. 07 3月, 2011 1 次提交
  23. 12 1月, 2011 1 次提交
    • D
      xfs: ensure log covering transactions are synchronous · c58efdb4
      Dave Chinner 提交于
      To ensure the log is covered and the filesystem idles correctly, we
      need to ensure that dummy transactions hit the disk and do not stay
      pinned in memory.  If the superblock is pinned in memory, it can't
      be flushed so the log covering cannot make progress. The result is
      dependent on timing - more oftent han not we continue to issues a
      log covering transaction every 36s rather than idling after ~90s.
      
      Fix this by making the log covering transaction synchronous. To
      avoid additional log force from xfssyncd, make the log covering
      transaction take the place of the existing log force in the xfssyncd
      background sync process.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      c58efdb4
  24. 16 12月, 2010 1 次提交
  25. 17 12月, 2010 1 次提交
    • D
      xfs: convert inode cache lookups to use RCU locking · 1a3e8f3d
      Dave Chinner 提交于
      With delayed logging greatly increasing the sustained parallelism of inode
      operations, the inode cache locking is showing significant read vs write
      contention when inode reclaim runs at the same time as lookups. There is
      also a lot more write lock acquistions than there are read locks (4:1 ratio)
      so the read locking is not really buying us much in the way of parallelism.
      
      To avoid the read vs write contention, change the cache to use RCU locking on
      the read side. To avoid needing to RCU free every single inode, use the built
      in slab RCU freeing mechanism. This requires us to be able to detect lookups of
      freed inodes, so enѕure that ever freed inode has an inode number of zero and
      the XFS_IRECLAIM flag set. We already check the XFS_IRECLAIM flag in cache hit
      lookup path, but also add a check for a zero inode number as well.
      
      We canthen convert all the read locking lockups to use RCU read side locking
      and hence remove all read side locking.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      1a3e8f3d
  26. 11 11月, 2010 1 次提交
  27. 19 10月, 2010 1 次提交
    • D
      xfs: serialise inode reclaim within an AG · 69b491c2
      Dave Chinner 提交于
      Memory reclaim via shrinkers has a terrible habit of having N+M
      concurrent shrinker executions (N = num CPUs, M = num kswapds) all
      trying to shrink the same cache. When the cache they are all working
      on is protected by a single spinlock, massive contention an
      slowdowns occur.
      
      Wrap the per-ag inode caches with a reclaim mutex to serialise
      reclaim access to the AG. This will block concurrent reclaim in each
      AG but still allow reclaim to scan multiple AGs concurrently. Allow
      shrinkers to move on to the next AG if it can't get the lock, and if
      we can't get any AG, then start blocking on locks.
      
      To prevent reclaimers from continually scanning the same inodes in
      each AG, add a cursor that tracks where the last reclaim got up to
      and start from that point on the next reclaim. This should avoid
      only ever scanning a small number of inodes at the satart of each AG
      and not making progress. If we have a non-shrinker based reclaim
      pass, ignore the cursor and reset it to zero once we are done.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      69b491c2