1. 20 5月, 2014 1 次提交
    • D
      xfs: log vector rounding leaks log space · 110dc24a
      Dave Chinner 提交于
      The addition of direct formatting of log items into the CIL
      linear buffer added alignment restrictions that the start of each
      vector needed to be 64 bit aligned. Hence padding was added in
      xlog_finish_iovec() to round up the vector length to ensure the next
      vector started with the correct alignment.
      
      This adds a small number of bytes to the size of
      the linear buffer that is otherwise unused. The issue is that we
      then use the linear buffer size to determine the log space used by
      the log item, and this includes the unused space. Hence when we
      account for space used by the log item, it's more than is actually
      written into the iclogs, and hence we slowly leak this space.
      
      This results on log hangs when reserving space, with threads getting
      stuck with these stack traces:
      
      Call Trace:
      [<ffffffff81d15989>] schedule+0x29/0x70
      [<ffffffff8150d3a2>] xlog_grant_head_wait+0xa2/0x1a0
      [<ffffffff8150d55d>] xlog_grant_head_check+0xbd/0x140
      [<ffffffff8150ee33>] xfs_log_reserve+0x103/0x220
      [<ffffffff814b7f05>] xfs_trans_reserve+0x2f5/0x310
      .....
      
      The 4 bytes is significant. Brain Foster did all the hard work in
      tracking down a reproducable leak to inode chunk allocation (it went
      away with the ikeep mount option). His rough numbers were that
      creating 50,000 inodes leaked 11 log blocks. This turns out to be
      roughly 800 inode chunks or 1600 inode cluster buffers. That
      works out at roughly 4 bytes per cluster buffer logged, and at that
      I started looking for a 4 byte leak in the buffer logging code.
      
      What I found was that a struct xfs_buf_log_format structure for an
      inode cluster buffer is 28 bytes in length. This gets rounded up to
      32 bytes, but the vector length remains 28 bytes. Hence the CIL
      ticket reservation is decremented by 32 bytes (via lv->lv_buf_len)
      for that vector rather than 28 bytes which are written into the log.
      
      The fix for this problem is to separately track the bytes used by
      the log vectors in the item and use that instead of the buffer
      length when accounting for the log space that will be used by the
      formatted log item.
      
      Again, thanks to Brian Foster for doing all the hard work and long
      hours to isolate this leak and make finding the bug relatively
      simple.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      
      110dc24a
  2. 27 2月, 2014 1 次提交
    • D
      xfs: always do log forces via the workqueue · f876e446
      Dave Chinner 提交于
      Log forces can occur deep in the call chain when we have relatively
      little stack free. Log forces can also happen at close to the call
      chain leaves (e.g. xfs_buf_lock()) and hence we can trigger IO from
      places where we really don't want to add more stack overhead.
      
      This stack overhead occurs because log forces do foreground CIL
      pushes (xlog_cil_push_foreground()) rather than waking the
      background push wq and waiting for the for the push to complete.
      This foreground push was done to avoid confusing the CFQ Io
      scheduler when fsync()s were issued, as it has trouble dealing with
      dependent IOs being issued from different process contexts.
      
      Avoiding blowing the stack is much more critical than performance
      optimisations for CFQ, especially as we've been recommending against
      the use of CFQ for XFS since 3.2 kernels were release because of
      it's problems with multi-threaded IO workloads.
      
      Hence convert xlog_cil_push_foreground() to move the push work
      to the CIL workqueue. We already do the waiting for the push to
      complete in xlog_cil_force_lsn(), so there's nothing else we need to
      modify to make this work.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      f876e446
  3. 10 2月, 2014 1 次提交
  4. 07 2月, 2014 1 次提交
  5. 13 12月, 2013 2 次提交
  6. 24 10月, 2013 2 次提交
    • D
      xfs: decouple log and transaction headers · 239880ef
      Dave Chinner 提交于
      xfs_trans.h has a dependency on xfs_log.h for a couple of
      structures. Most code that does transactions doesn't need to know
      anything about the log, but this dependency means that they have to
      include xfs_log.h. Decouple the xfs_trans.h and xfs_log.h header
      files and clean up the includes to be in dependency order.
      
      In doing this, remove the direct include of xfs_trans_reserve.h from
      xfs_trans.h so that we remove the dependency between xfs_trans.h and
      xfs_mount.h. Hence the xfs_trans.h include can be moved to the
      indicate the actual dependencies other header files have on it.
      
      Note that these are kernel only header files, so this does not
      translate to any userspace changes at all.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      239880ef
    • D
      xfs: create a shared header file for format-related information · 70a9883c
      Dave Chinner 提交于
      All of the buffer operations structures are needed to be exported
      for xfs_db, so move them all to a common location rather than
      spreading them all over the place. They are verifying the on-disk
      format, so while xfs_format.h might be a good place, it is not part
      of the on disk format.
      
      Hence we need to create a new header file that we centralise these
      related definitions. Start by moving the bffer operations
      structures, and then also move all the other definitions that have
      crept into xfs_log_format.h and xfs_format.h as there was no other
      shared header file to put them in.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      70a9883c
  7. 17 10月, 2013 1 次提交
    • D
      xfs: prevent deadlock trying to cover an active log · 2c6e24ce
      Dave Chinner 提交于
      Recent analysis of a deadlocked XFS filesystem from a kernel
      crash dump indicated that the filesystem was stuck waiting for log
      space. The short story of the hang on the RHEL6 kernel is this:
      
      	- the tail of the log is pinned by an inode
      	- the inode has been pushed by the xfsaild
      	- the inode has been flushed to it's backing buffer and is
      	  currently flush locked and hence waiting for backing
      	  buffer IO to complete and remove it from the AIL
      	- the backing buffer is marked for write - it is on the
      	  delayed write queue
      	- the inode buffer has been modified directly and logged
      	  recently due to unlinked inode list modification
      	- the backing buffer is pinned in memory as it is in the
      	  active CIL context.
      	- the xfsbufd won't start buffer writeback because it is
      	  pinned
      	- xfssyncd won't force the log because it sees the log as
      	  needing to be covered and hence wants to issue a dummy
      	  transaction to move the log covering state machine along.
      
      Hence there is no trigger to force the CIL to the log and hence
      unpin the inode buffer and therefore complete the inode IO, remove
      it from the AIL and hence move the tail of the log along, allowing
      transactions to start again.
      
      Mainline kernels also have the same deadlock, though the signature
      is slightly different - the inode buffer never reaches the delayed
      write lists because xfs_buf_item_push() sees that it is pinned and
      hence never adds it to the delayed write list that the xfsaild
      flushes.
      
      There are two possible solutions here. The first is to simply force
      the log before trying to cover the log and so ensure that the CIL is
      emptied before we try to reserve space for the dummy transaction in
      the xfs_log_worker(). While this might work most of the time, it is
      still racy and is no guarantee that we don't get stuck in
      xfs_trans_reserve waiting for log space to come free. Hence it's not
      the best way to solve the problem.
      
      The second solution is to modify xfs_log_need_covered() to be aware
      of the CIL. We only should be attempting to cover the log if there
      is no current activity in the log - covering the log is the process
      of ensuring that the head and tail in the log on disk are identical
      (i.e. the log is clean and at idle). Hence, by definition, if there
      are items in the CIL then the log is not at idle and so we don't
      need to attempt to cover it.
      
      When we don't need to cover the log because it is active or idle, we
      issue a log force from xfs_log_worker() - if the log is idle, then
      this does nothing.  However, if the log is active due to there being
      items in the CIL, it will force the items in the CIL to the log and
      unpin them.
      
      In the case of the above deadlock scenario, instead of
      xfs_log_worker() getting stuck in xfs_trans_reserve() attempting to
      cover the log, it will instead force the log, thereby unpinning the
      inode buffer, allowing IO to be issued and complete and hence
      removing the inode that was pinning the tail of the log from the
      AIL. At that point, everything will start moving along again. i.e.
      the xfs_log_worker turns back into a watchdog that can alleviate
      deadlocks based around pinned items that prevent the tail of the log
      from being moved...
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      2c6e24ce
  8. 14 8月, 2013 5 次提交
    • D
      xfs: split the CIL lock · 4bb928cd
      Dave Chinner 提交于
      The xc_cil_lock is used for two purposes - to protect the CIL
      itself, and to protect the push/commit state and lists. These are
      two logically separate structures and operations, so can have their
      own locks. This means that pushing on the CIL and the commit wait
      ordering won't contend for a lock with other transactions that are
      completing concurrently. As the CIL insertion is the hottest path
      throught eh CIL, this is a big win.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      4bb928cd
    • D
      xfs: Combine CIL insert and prepare passes · 991aaf65
      Dave Chinner 提交于
      Now that all the log item preparation and formatting is done under
      the CIL lock, we can get rid of the intermediate log vector chain
      used to track items to be inserted into the CIL.
      
      We can already find all the items to be committed from the
      transaction handle, so as long as we attach the log vectors to the
      item before we insert the items into the CIL, we don't need to
      create a log vector chain to pass around.
      
      This means we can move all the item insertion code into and optimise
      it into a pair of simple passes across all the items in the
      transaction. The first pass does the formatting and accounting, the
      second inserts them all into the CIL.
      
      We keep this two pass split so that we can separate the CIL
      insertion - which must be done under the CIL spinlock - from the
      formatting. We could insert each item into the CIL with a single
      pass, but that massively increases the number of times we have to
      grab the CIL spinlock. It is much more efficient (and hence
      scalable) to do a batch operation and insert all objects in a single
      lock grab.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      991aaf65
    • D
      xfs: avoid CIL allocation during insert · f5baac35
      Dave Chinner 提交于
      Now that we have the size of the log vector that has been allocated,
      we can determine if we need to allocate a new log vector for
      formatting and insertion. We only need to allocate a new vector if
      it won't fit into the existing buffer.
      
      However, we need to hold the CIL context lock while we do this so
      that we can't race with a push draining the currently queued log
      vectors. It is safe to do this as long as we do GFP_NOFS allocation
      to avoid avoid memory allocation recursing into the filesystem.
      Hence we can safely overwrite the existing log vector on the CIL if
      it is large enough to hold all the dirty regions of the current
      item.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      f5baac35
    • D
      xfs: Reduce allocations during CIL insertion · 7492c5b4
      Dave Chinner 提交于
      Now that we have the size of the object before the formatting pass
      is called, we can allocation the log vector and it's buffer in a
      single allocation rather than two separate allocations.
      
      Store the size of the allocated buffer in the log vector so that
      we potentially avoid allocation for future modifications of the
      object.
      
      While touching this code, remove the IOP_FORMAT definition.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      7492c5b4
    • D
      xfs: return log item size in IOP_SIZE · 166d1368
      Dave Chinner 提交于
      To begin optimising the CIL commit process, we need to have IOP_SIZE
      return both the number of vectors and the size of the data pointed
      to by the vectors. This enables us to calculate the size ofthe
      memory allocation needed before the formatting step and reduces the
      number of memory allocations per item by one.
      
      While there, kill the IOP_SIZE macro.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      166d1368
  9. 28 6月, 2013 1 次提交
    • D
      xfs: Introduce ordered log vector support · fd63875c
      Dave Chinner 提交于
      And "ordered log vector" is a log vector that is used for
      tracking a log item through the CIL and into the AIL as part of the
      log checkpointing. These ordered log vectors are special in that
      they are not written to to journal in any way, and are not accounted
      to the checkpoint being written.
      
      The reason for this behaviour is to allow operations to attach items
      to transactions and have them follow the normal transactional
      lifecycle without actually having to write them to the journal. This
      allows logging of items that track high level logical changes and
      writing them to the log, while the physical items being modified
      pass through into the AIL and pin the tail of the log (and therefore
      the logical item in the log) until all the modified items are
      physically written to disk.
      
      IOWs, it allows us to write metadata without physically logging
      every individual change but still maintain the full transactional
      integrity guarantees we currently have w.r.t. crash recovery.
      
      This change modifies some of the CIL item insertion loops, as
      ordered log vectors introduce some new constraints as they don't
      track any data. One advantage of this change is that it combines
      two log vector chain walks into a single pass, so there is less
      overhead in the transaction commit pass as well. It also kills some
      unused code in the log vector walk loop when committing the CIL.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      fd63875c
  10. 25 5月, 2013 1 次提交
  11. 21 5月, 2013 1 次提交
  12. 17 4月, 2013 1 次提交
  13. 22 6月, 2012 2 次提交
  14. 15 5月, 2012 5 次提交
    • D
      xfs: clean up xfs_bit.h includes · ad1e95c5
      Dave Chinner 提交于
      With the removal of xfs_rw.h and other changes over time, xfs_bit.h
      is being included in many files that don't actually need it. Clean
      up the includes as necessary.
      
      Also move the only-used-once xfs_ialloc_find_free() static inline
      function out of a header file that is widely included to reduce
      the number of needless dependencies on xfs_bit.h.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      ad1e95c5
    • D
      xfs: clean up busy extent naming · 4ecbfe63
      Dave Chinner 提交于
      Now that the busy extent tracking has been moved out of the
      allocation files, clean up the namespace it uses to
      "xfs_extent_busy" rather than a mix of "xfs_busy" and
      "xfs_alloc_busy".
      
      Signed-off-by: Dave Chinner<dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      4ecbfe63
    • D
      xfs: move busy extent handling to it's own file · efc27b52
      Dave Chinner 提交于
      To make it easier to handle userspace code merges, move all the busy
      extent handling out of the allocation code and into it's own file.
      The userspace code does not need the busy extent code, so this
      simplifies the merging of the kernel code into the userspace
      xfsprogs library.
      
      Because the busy extent code has been almost completely rewritten
      over the past couple of years, also update the copyright on this new
      file to include the authors that made all those changes.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      efc27b52
    • D
      xfs: move xfsagino_t to xfs_types.h · 60a34607
      Dave Chinner 提交于
      Untangle the header file includes a bit by moving the definition of
      xfs_agino_t to xfs_types.h. This removes the dependency that xfs_ag.h has on
      xfs_inum.h, meaning we don't need to include xfs_inum.h everywhere we include
      xfs_ag.h.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      60a34607
    • D
      xfs: Do background CIL flushes via a workqueue · 4c2d542f
      Dave Chinner 提交于
      Doing background CIL flushes adds significant latency to whatever
      async transaction that triggers it. To avoid blocking async
      transactions on things like waiting for log buffer IO to complete,
      move the CIL push off into a workqueue.  By moving the push work
      into a workqueue, we remove all the latency that the commit adds
      from the foreground transaction commit path. This also means that
      single threaded workloads won't do the CIL push procssing, leaving
      them more CPU to do more async transactions.
      
      To do this, we need to keep track of the sequence number we have
      pushed work for. This avoids having many transaction commits
      attempting to schedule work for the same sequence, and ensures that
      we only ever have one push (background or forced) in progress at a
      time. It also means that we don't need to take the CIL lock in write
      mode to check for potential background push races, which reduces
      lock contention.
      
      To avoid potential issues with "smart" IO schedulers, don't use the
      workqueue for log force triggered flushes. Instead, do them directly
      so that the log IO is done directly by the process issuing the log
      force and so doesn't get stuck on IO elevator queue idling
      incorrectly delaying the log IO from the workqueue.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      4c2d542f
  15. 09 12月, 2011 3 次提交
  16. 02 12月, 2011 1 次提交
  17. 25 5月, 2011 1 次提交
    • C
      xfs: add online discard support · e84661aa
      Christoph Hellwig 提交于
      Now that we have reliably tracking of deleted extents in a
      transaction we can easily implement "online" discard support
      which calls blkdev_issue_discard once a transaction commits.
      
      The actual discard is a two stage operation as we first have
      to mark the busy extent as not available for reuse before we
      can start the actual discard.  Note that we don't bother
      supporting discard for the non-delaylog mode.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      e84661aa
  18. 29 4月, 2011 1 次提交
  19. 28 1月, 2011 1 次提交
  20. 27 1月, 2011 1 次提交
    • D
      xfs: fix log ticket leak on forced shutdown. · 7db37c5e
      Dave Chinner 提交于
      The kmemleak detector shows this after test 139:
      
      unreferenced object 0xffff880079b88bb0 (size 264):
        comm "xfs_io", pid 4904, jiffies 4294909382 (age 276.824s)
        hex dump (first 32 bytes):
          00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00  .....N..........
          ff ff ff ff ff ff ff ff 48 7b c9 82 ff ff ff ff  ........H{......
        backtrace:
          [<ffffffff81afb04d>] kmemleak_alloc+0x2d/0x60
          [<ffffffff8115c6cf>] kmem_cache_alloc+0x13f/0x2b0
          [<ffffffff814aaa97>] kmem_zone_alloc+0x77/0xf0
          [<ffffffff814aab2e>] kmem_zone_zalloc+0x1e/0x50
          [<ffffffff8148f394>] xlog_ticket_alloc+0x34/0x170
          [<ffffffff81494444>] xlog_cil_push+0xa4/0x3f0
          [<ffffffff81494eca>] xlog_cil_force_lsn+0x15a/0x160
          [<ffffffff814933a5>] _xfs_log_force_lsn+0x75/0x2d0
          [<ffffffff814a264d>] _xfs_trans_commit+0x2bd/0x2f0
          [<ffffffff8148bfdd>] xfs_iomap_write_allocate+0x1ad/0x350
          [<ffffffff814ac17f>] xfs_map_blocks+0x21f/0x370
          [<ffffffff814ad1b7>] xfs_vm_writepage+0x1c7/0x550
          [<ffffffff8112200a>] __writepage+0x1a/0x50
          [<ffffffff81122df2>] write_cache_pages+0x1c2/0x4c0
          [<ffffffff81123117>] generic_writepages+0x27/0x30
          [<ffffffff814aba5d>] xfs_vm_writepages+0x5d/0x80
      
      By inspection, the leak occurs when xlog_write() returns and error
      and we jump to the abort path without dropping the reference on the
      active ticket.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      7db37c5e
  21. 21 12月, 2010 1 次提交
  22. 20 12月, 2010 1 次提交
    • D
      xfs: bulk AIL insertion during transaction commit · 0e57f6a3
      Dave Chinner 提交于
      When inserting items into the AIL from the transaction committed
      callbacks, we take the AIL lock for every single item that is to be
      inserted. For a CIL checkpoint commit, this can be tens of thousands
      of individual inserts, yet almost all of the items will be inserted
      at the same point in the AIL because they have the same index.
      
      To reduce the overhead and contention on the AIL lock for such
      operations, introduce a "bulk insert" operation which allows a list
      of log items with the same LSN to be inserted in a single operation
      via a list splice. To do this, we need to pre-sort the log items
      being committed into a temporary list for insertion.
      
      The complexity is that not every log item will end up with the same
      LSN, and not every item is actually inserted into the AIL. Items
      that don't match the commit LSN will be inserted and unpinned as per
      the current one-at-a-time method (relatively rare), while items that
      are not to be inserted will be unpinned and freed immediately. Items
      that are to be inserted at the given commit lsn are placed in a
      temporary array and inserted into the AIL in bulk each time the
      array fills up.
      
      As a result of this, we trade off AIL hold time for a significant
      reduction in traffic. lock_stat output shows that the worst case
      hold time is unchanged, but contention from AIL inserts drops by an
      order of magnitude and the number of lock traversal decreases
      significantly.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      0e57f6a3
  23. 19 10月, 2010 1 次提交
    • D
      xfs: reduce the number of CIL lock round trips during commit · d1583a38
      Dave Chinner 提交于
      When commiting a transaction, we do a lock CIL state lock round trip
      on every single log vector we insert into the CIL. This is resulting
      in the lock being as hot as the inode and dcache locks on 8-way
      create workloads. Rework the insertion loops to bring the number
      of lock round trips to one per transaction for log vectors, and one
      more do the busy extents.
      
      Also change the allocation of the log vector buffer not to zero it
      as we copy over the entire allocated buffer anyway.
      
      This patch also includes a structural cleanup to the CIL item
      insertion provided by Christoph Hellwig.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      d1583a38
  24. 29 9月, 2010 1 次提交
    • D
      xfs: force background CIL push under sustained load · 80168676
      Dave Chinner 提交于
      I have been seeing occasional pauses in transaction throughput up to
      30s long under heavy parallel workloads. The only notable thing was
      that the xfsaild was trying to be active during the pauses, but
      making no progress. It was running exactly 20 times a second (on the
      50ms no-progress backoff), and the number of pushbuf events was
      constant across this time as well.  IOWs, the xfsaild appeared to be
      stuck on buffers that it could not push out.
      
      Further investigation indicated that it was trying to push out inode
      buffers that were pinned and/or locked. The xfsbufd was also getting
      woken at the same frequency (by the xfsaild, no doubt) to push out
      delayed write buffers. The xfsbufd was not making any progress
      because all the buffers in the delwri queue were pinned. This scan-
      and-make-no-progress dance went one in the trace for some seconds,
      before the xfssyncd came along an issued a log force, and then
      things started going again.
      
      However, I noticed something strange about the log force - there
      were way too many IO's issued. 516 log buffers were written, to be
      exact. That added up to 129MB of log IO, which got me very
      interested because it's almost exactly 25% of the size of the log.
      He delayed logging code is suppose to aggregate the minimum of 25%
      of the log or 8MB worth of changes before flushing. That's what
      really puzzled me - why did a log force write 129MB instead of only
      8MB?
      
      Essentially what has happened is that no CIL pushes had occurred
      since the previous tail push which cleared out 25% of the log space.
      That caused all the new transactions to block because there wasn't
      log space for them, but they kick the xfsaild to push the tail.
      However, the xfsaild was not making progress because there were
      buffers it could not lock and flush, and the xfsbufd could not flush
      them because they were pinned. As a result, both the xfsaild and the
      xfsbufd could not move the tail of the log forward without the CIL
      first committing.
      
      The cause of the problem was that the background CIL push, which
      should happen when 8MB of aggregated changes have been committed, is
      being held off by the concurrent transaction commit load. The
      background push does a down_write_trylock() which will fail if there
      is a concurrent transaction commit holding the push lock in read
      mode. With 8 CPUs all doing transactions as fast as they can, there
      was enough concurrent transaction commits to hold off the background
      push until tail-pushing could no longer free log space, and the halt
      would occur.
      
      It should be noted that there is no reason why it would halt at 25%
      of log space used by a single CIL checkpoint. This bug could
      definitely violate the "no transaction should be larger than half
      the log" requirement and hence result in corruption if the system
      crashed under heavy load. This sort of bug is exactly the reason why
      delayed logging was tagged as experimental....
      
      The fix is to start blocking background pushes once the threshold
      has been exceeded. Rework the threshold calculations to keep the
      amount of log space a CIL checkpoint can use to below that of the
      AIL push threshold to avoid the problem completely.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      80168676
  25. 24 8月, 2010 3 次提交
    • D
      xfs: don't do memory allocation under the CIL context lock · 3b93c7aa
      Dave Chinner 提交于
      Formatting items requires memory allocation when using delayed
      logging. Currently that memory allocation is done while holding the
      CIL context lock in read mode. This means that if memory allocation
      takes some time (e.g. enters reclaim), we cannot push on the CIL
      until the allocation(s) required by formatting complete. This can
      stall CIL pushes for some time, and once a push is stalled so are
      all new transaction commits.
      
      Fix this splitting the item formatting into two steps. The first
      step which does the allocation and memcpy() into the allocated
      buffer is now done outside the CIL context lock, and only the CIL
      insert is done inside the CIL context lock. This avoids the stall
      issue.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      3b93c7aa
    • D
      xfs: Reduce log force overhead for delayed logging · a44f13ed
      Dave Chinner 提交于
      Delayed logging adds some serialisation to the log force process to
      ensure that it does not deference a bad commit context structure
      when determining if a CIL push is necessary or not. It does this by
      grabing the CIL context lock exclusively, then dropping it before
      pushing the CIL if necessary. This causes serialisation of all log
      forces and pushes regardless of whether a force is necessary or not.
      As a result fsync heavy workloads (like dbench) can be significantly
      slower with delayed logging than without.
      
      To avoid this penalty, copy the current sequence from the context to
      the CIL structure when they are swapped. This allows us to do
      unlocked checks on the current sequence without having to worry
      about dereferencing context structures that may have already been
      freed. Hence we can remove the CIL context locking in the forcing
      code and only call into the push code if the current context matches
      the sequence we need to force.
      
      By passing the sequence into the push code, we can check the
      sequence again once we have the CIL lock held exclusive and abort if
      the sequence has already been pushed. This avoids a lock round-trip
      and unnecessary CIL pushes when we have racing push calls.
      
      The result is that the regression in dbench performance goes away -
      this change improves dbench performance on a ramdisk from ~2100MB/s
      to ~2500MB/s. This compares favourably to not using delayed logging
      which retuns ~2500MB/s for the same workload.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      a44f13ed
    • D
      xfs: unlock items before allowing the CIL to commit · d17c701c
      Dave Chinner 提交于
      When we commit a transaction using delayed logging, we need to
      unlock the items in the transaciton before we unlock the CIL context
      and allow it to be checkpointed. If we unlock them after we release
      the CIl context lock, the CIL can checkpoint and complete before
      we free the log items. This breaks stale buffer item unlock and
      unpin processing as there is an implicit assumption that the unlock
      will occur before the unpin.
      
      Also, some log items need to store the LSN of the transaction commit
      in the item (inodes and EFIs) and so can race with other transaction
      completions if we don't prevent the CIL from checkpointing before
      the unlock occurs.
      
      Cc: <stable@kernel.org>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      d17c701c