1. 02 2月, 2013 2 次提交
  2. 31 7月, 2012 1 次提交
    • J
      xfs: Convert to new freezing code · d9457dc0
      Jan Kara 提交于
      Generic code now blocks all writers from standard write paths. So we add
      blocking of all writers coming from ioctl (we get a protection of ioctl against
      racing remount read-only as a bonus) and convert xfs_file_aio_write() to a
      non-racy freeze protection. We also keep freeze protection on transaction
      start to block internal filesystem writes such as removal of preallocated
      blocks.
      
      CC: Ben Myers <bpm@sgi.com>
      CC: Alex Elder <elder@kernel.org>
      CC: xfs@oss.sgi.com
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      d9457dc0
  3. 30 5月, 2012 1 次提交
  4. 15 5月, 2012 4 次提交
  5. 23 2月, 2012 1 次提交
  6. 14 2月, 2012 2 次提交
  7. 09 12月, 2011 3 次提交
  8. 12 10月, 2011 2 次提交
    • C
      xfs: simplify xfs_trans_ijoin* again · ddc3415a
      Christoph Hellwig 提交于
      There is no reason to keep a reference to the inode even if we unlock
      it during transaction commit because we never drop a reference between
      the ijoin and commit.  Also use this fact to merge xfs_trans_ijoin_ref
      back into xfs_trans_ijoin - the third argument decides if an unlock
      is needed now.
      
      I'm actually starting to wonder if allowing inodes to be unlocked
      at transaction commit really is worth the effort.  The only real
      benefit is that they can be unlocked earlier when commiting a
      synchronous transactions, but that could be solved by doing the
      log force manually after the unlock, too.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      
      ddc3415a
    • C
      xfs: unlock the inode before log force in xfs_fsync · b1037058
      Christoph Hellwig 提交于
      Only read the LSN we need to push to with the ilock held, and then release
      it before we do the log force to improve concurrency.
      
      This also removes the only direct caller of _xfs_trans_commit, thus
      allowing it to be merged into the plain xfs_trans_commit again.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      
      b1037058
  9. 21 7月, 2011 1 次提交
    • D
      xfs: use a cursor for bulk AIL insertion · 1d8c95a3
      Dave Chinner 提交于
      Delayed logging can insert tens of thousands of log items into the
      AIL at the same LSN. When the committing of log commit records
      occur, we can get insertions occurring at an LSN that is not at the
      end of the AIL. If there are thousands of items in the AIL on the
      tail LSN, each insertion has to walk the AIL to find the correct
      place to insert the new item into the AIL. This can consume large
      amounts of CPU time and block other operations from occurring while
      the traversals are in progress.
      
      To avoid this repeated walk, use a AIL cursor to record
      where we should be inserting the new items into the AIL without
      having to repeat the walk. The cursor infrastructure already
      provides this functionality for push walks, so is a simple extension
      of existing code. While this will not avoid the initial walk, it
      will avoid repeating it tens of thousands of times during a single
      checkpoint commit.
      
      This version includes logic improvements from Christoph Hellwig.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      1d8c95a3
  10. 11 7月, 2011 1 次提交
  11. 08 7月, 2011 1 次提交
  12. 07 7月, 2011 1 次提交
    • D
      xfs: unpin stale inodes directly in IOP_COMMITTED · 1316d4da
      Dave Chinner 提交于
      When inodes are marked stale in a transaction, they are treated
      specially when the inode log item is being inserted into the AIL.
      It tries to avoid moving the log item forward in the AIL due to a
      race condition with the writing the underlying buffer back to disk.
      The was "fixed" in commit de25c181 ("xfs: avoid moving stale inodes
      in the AIL").
      
      To avoid moving the item forward, we return a LSN smaller than the
      commit_lsn of the completing transaction, thereby trying to trick
      the commit code into not moving the inode forward at all. I'm not
      sure this ever worked as intended - it assumes the inode is already
      in the AIL, but I don't think the returned LSN would have been small
      enough to prevent moving the inode. It appears that the reason it
      worked is that the lower LSN of the inodes meant they were inserted
      into the AIL and flushed before the inode buffer (which was moved to
      the commit_lsn of the transaction).
      
      The big problem is that with delayed logging, the returning of the
      different LSN means insertion takes the slow, non-bulk path.  Worse
      yet is that insertion is to a position -before- the commit_lsn so it
      is doing a AIL traversal on every insertion, and has to walk over
      all the items that have already been inserted into the AIL. It's
      expensive.
      
      To compound the matter further, with delayed logging inodes are
      likely to go from clean to stale in a single checkpoint, which means
      they aren't even in the AIL at all when we come across them at AIL
      insertion time. Hence these were all getting inserted into the AIL
      when they simply do not need to be as inodes marked XFS_ISTALE are
      never written back.
      
      Transactional/recovery integrity is maintained in this case by the
      other items in the unlink transaction that were modified (e.g. the
      AGI btree blocks) and committed in the same checkpoint.
      
      So to fix this, simply unpin the stale inodes directly in
      xfs_inode_item_committed() and return -1 to indicate that the AIL
      insertion code does not need to do any further processing of these
      inodes.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      1316d4da
  13. 25 5月, 2011 1 次提交
    • C
      xfs: add online discard support · e84661aa
      Christoph Hellwig 提交于
      Now that we have reliably tracking of deleted extents in a
      transaction we can easily implement "online" discard support
      which calls blkdev_issue_discard once a transaction commits.
      
      The actual discard is a two stage operation as we first have
      to mark the busy extent as not available for reuse before we
      can start the actual discard.  Note that we don't bother
      supporting discard for the non-delaylog mode.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      e84661aa
  14. 29 4月, 2011 1 次提交
  15. 28 1月, 2011 2 次提交
    • D
      xfs: handle CIl transaction commit failures correctly · c6f990d1
      Dave Chinner 提交于
      Failure to commit a transaction into the CIL is not handled
      correctly. This currently can only happen when racing with a
      shutdown and requires an explicit shutdown check, so it rare and can
      be avoided. Remove the shutdown check and make the CIL commit a void
      function to indicate it will always succeed, thereby removing the
      incorrectly handled failure case.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      c6f990d1
    • D
      xfs: fix efi item leak on forced shutdown · e34a314c
      Dave Chinner 提交于
      After test 139, kmemleak shows:
      
      unreferenced object 0xffff880078b405d8 (size 400):
        comm "xfs_io", pid 4904, jiffies 4294909383 (age 1186.728s)
        hex dump (first 32 bytes):
          60 c1 17 79 00 88 ff ff 60 c1 17 79 00 88 ff ff  `..y....`..y....
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<ffffffff81afb04d>] kmemleak_alloc+0x2d/0x60
          [<ffffffff8115c6cf>] kmem_cache_alloc+0x13f/0x2b0
          [<ffffffff814aaa97>] kmem_zone_alloc+0x77/0xf0
          [<ffffffff814aab2e>] kmem_zone_zalloc+0x1e/0x50
          [<ffffffff8147cd6b>] xfs_efi_init+0x4b/0xb0
          [<ffffffff814a4ee8>] xfs_trans_get_efi+0x58/0x90
          [<ffffffff81455fab>] xfs_bmap_finish+0x8b/0x1d0
          [<ffffffff814851b4>] xfs_itruncate_finish+0x2c4/0x5d0
          [<ffffffff814a970f>] xfs_setattr+0x8df/0xa70
          [<ffffffff814b5c7b>] xfs_vn_setattr+0x1b/0x20
          [<ffffffff8117dc00>] notify_change+0x170/0x2e0
          [<ffffffff81163bf6>] do_truncate+0x66/0xa0
          [<ffffffff81163d0b>] sys_ftruncate+0xdb/0xe0
          [<ffffffff8103a002>] system_call_fastpath+0x16/0x1b
          [<ffffffffffffffff>] 0xffffffffffffffff
      
      The cause of the leak is that the "remove" parameter of IOP_UNPIN()
      is never set when a CIL push is aborted. This means that the EFI
      item is never freed if it was in the push being cancelled. The
      problem is specific to delayed logging, but has uncovered a couple
      of problems with the handling of IOP_UNPIN(remove).
      
      Firstly, we cannot safely call xfs_trans_del_item() from IOP_UNPIN()
      in the CIL commit failure path or the iclog write failure path
      because for delayed loging we have no transaction context. Hence we
      must only call xfs_trans_del_item() if the log item being unpinned
      has an active log item descriptor.
      
      Secondly, xfs_trans_uncommit() does not handle log item descriptor
      freeing during the traversal of log items on a transaction. It can
      reference a freed log item descriptor when unpinning an EFI item.
      Hence it needs to use a safe list traversal method to allow items to
      be removed from the transaction during IOP_UNPIN().
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      e34a314c
  16. 12 1月, 2011 1 次提交
  17. 20 12月, 2010 1 次提交
    • D
      xfs: bulk AIL insertion during transaction commit · 0e57f6a3
      Dave Chinner 提交于
      When inserting items into the AIL from the transaction committed
      callbacks, we take the AIL lock for every single item that is to be
      inserted. For a CIL checkpoint commit, this can be tens of thousands
      of individual inserts, yet almost all of the items will be inserted
      at the same point in the AIL because they have the same index.
      
      To reduce the overhead and contention on the AIL lock for such
      operations, introduce a "bulk insert" operation which allows a list
      of log items with the same LSN to be inserted in a single operation
      via a list splice. To do this, we need to pre-sort the log items
      being committed into a temporary list for insertion.
      
      The complexity is that not every log item will end up with the same
      LSN, and not every item is actually inserted into the AIL. Items
      that don't match the commit LSN will be inserted and unpinned as per
      the current one-at-a-time method (relatively rare), while items that
      are not to be inserted will be unpinned and freed immediately. Items
      that are to be inserted at the given commit lsn are placed in a
      temporary array and inserted into the AIL in bulk each time the
      array fills up.
      
      As a result of this, we trade off AIL hold time for a significant
      reduction in traffic. lock_stat output shows that the worst case
      hold time is unchanged, but contention from AIL inserts drops by an
      order of magnitude and the number of lock traversal decreases
      significantly.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      0e57f6a3
  18. 19 10月, 2010 4 次提交
  19. 24 8月, 2010 1 次提交
    • D
      xfs: unlock items before allowing the CIL to commit · d17c701c
      Dave Chinner 提交于
      When we commit a transaction using delayed logging, we need to
      unlock the items in the transaciton before we unlock the CIL context
      and allow it to be checkpointed. If we unlock them after we release
      the CIl context lock, the CIL can checkpoint and complete before
      we free the log items. This breaks stale buffer item unlock and
      unpin processing as there is an implicit assumption that the unlock
      will occur before the unpin.
      
      Also, some log items need to store the LSN of the transaction commit
      in the item (inodes and EFIs) and so can race with other transaction
      completions if we don't prevent the CIL from checkpointing before
      the unlock occurs.
      
      Cc: <stable@kernel.org>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      d17c701c
  20. 27 7月, 2010 6 次提交
    • D
      xfs: fix xfs_trans_add_item() lockdep warnings · 43869706
      Dave Chinner 提交于
      xfs_trans_add_item() is called with ip->i_ilock held, which means it
      is unsafe for memory reclaim to recurse back into the filesystem
      (ilock is required in writeback). Hence the allocation needs to be
      KM_NOFS to avoid recursion.
      
      Lockdep report indicating memory allocation being called with the
      ip->i_ilock held is as follows:
      
      [ 1749.866796] =================================
      [ 1749.867788] [ INFO: inconsistent lock state ]
      [ 1749.868327] 2.6.35-rc3-dgc+ #25
      [ 1749.868741] ---------------------------------
      [ 1749.868741] inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage.
      [ 1749.868741] dd/2835 [HC0[0]:SC0[0]:HE1:SE1] takes:
      [ 1749.868741]  (&(&ip->i_lock)->mr_lock){++++?.}, at: [<ffffffff813170fb>] xfs_ilock+0x10b/0x190
      [ 1749.868741] {IN-RECLAIM_FS-W} state was registered at:
      [ 1749.868741]   [<ffffffff810b3a97>] __lock_acquire+0x437/0x1450
      [ 1749.868741]   [<ffffffff810b4b56>] lock_acquire+0xa6/0x160
      [ 1749.868741]   [<ffffffff810a20b5>] down_write_nested+0x65/0xb0
      [ 1749.868741]   [<ffffffff813170fb>] xfs_ilock+0x10b/0x190
      [ 1749.868741]   [<ffffffff8134e819>] xfs_reclaim_inode+0x99/0x310
      [ 1749.868741]   [<ffffffff8134f56b>] xfs_inode_ag_walk+0x8b/0x150
      [ 1749.868741]   [<ffffffff8134f6bb>] xfs_inode_ag_iterator+0x8b/0xf0
      [ 1749.868741]   [<ffffffff8134f7a8>] xfs_reclaim_inode_shrink+0x88/0x90
      [ 1749.868741]   [<ffffffff81119d07>] shrink_slab+0x137/0x1a0
      [ 1749.868741]   [<ffffffff8111bbe1>] balance_pgdat+0x421/0x6a0
      [ 1749.868741]   [<ffffffff8111bf7d>] kswapd+0x11d/0x320
      [ 1749.868741]   [<ffffffff8109ce56>] kthread+0x96/0xa0
      [ 1749.868741]   [<ffffffff81035de4>] kernel_thread_helper+0x4/0x10
      [ 1749.868741] irq event stamp: 4234335
      [ 1749.868741] hardirqs last  enabled at (4234335): [<ffffffff81147d25>] kmem_cache_free+0x115/0x220
      [ 1749.868741] hardirqs last disabled at (4234334): [<ffffffff81147c4d>] kmem_cache_free+0x3d/0x220
      [ 1749.868741] softirqs last  enabled at (4233112): [<ffffffff81084dd2>] __do_softirq+0x142/0x260
      [ 1749.868741] softirqs last disabled at (4233095): [<ffffffff81035edc>] call_softirq+0x1c/0x50
      [ 1749.868741] 
      [ 1749.868741] other info that might help us debug this:
      [ 1749.868741] 2 locks held by dd/2835:
      [ 1749.868741]  #0:  (&(&ip->i_iolock)->mr_lock#2){+.+.+.}, at: [<ffffffff81316edd>] xfs_ilock_nowait+0xed/0x200
      [ 1749.868741]  #1:  (&(&ip->i_lock)->mr_lock){++++?.}, at: [<ffffffff813170fb>] xfs_ilock+0x10b/0x190
      [ 1749.868741] 
      [ 1749.868741] stack backtrace:
      [ 1749.868741] Pid: 2835, comm: dd Not tainted 2.6.35-rc3-dgc+ #25
      [ 1749.868741] Call Trace:
      [ 1749.868741]  [<ffffffff810b1faa>] print_usage_bug+0x18a/0x190
      [ 1749.868741]  [<ffffffff8104264f>] ? save_stack_trace+0x2f/0x50
      [ 1749.868741]  [<ffffffff810b2400>] ? check_usage_backwards+0x0/0xf0
      [ 1749.868741]  [<ffffffff810b2f11>] mark_lock+0x331/0x400
      [ 1749.868741]  [<ffffffff810b3047>] mark_held_locks+0x67/0x90
      [ 1749.868741]  [<ffffffff810b3111>] lockdep_trace_alloc+0xa1/0xe0
      [ 1749.868741]  [<ffffffff81147419>] kmem_cache_alloc+0x39/0x1e0
      [ 1749.868741]  [<ffffffff8133f954>] kmem_zone_alloc+0x94/0xe0
      [ 1749.868741]  [<ffffffff8133f9be>] kmem_zone_zalloc+0x1e/0x50
      [ 1749.868741]  [<ffffffff81335f02>] xfs_trans_add_item+0x72/0xb0
      [ 1749.868741]  [<ffffffff81339e41>] xfs_trans_ijoin+0xa1/0xd0
      [ 1749.868741]  [<ffffffff81319f82>] xfs_itruncate_finish+0x312/0x5d0
      [ 1749.868741]  [<ffffffff8133cb87>] xfs_free_eofblocks+0x227/0x280
      [ 1749.868741]  [<ffffffff8133cd18>] xfs_release+0x138/0x190
      [ 1749.868741]  [<ffffffff813464c5>] xfs_file_release+0x15/0x20
      [ 1749.868741]  [<ffffffff81150ebf>] fput+0x13f/0x260
      [ 1749.868741]  [<ffffffff8114d8c2>] filp_close+0x52/0x80
      [ 1749.868741]  [<ffffffff8114d9a9>] sys_close+0xb9/0x120
      [ 1749.868741]  [<ffffffff81034ff2>] system_call_fastpath+0x16/0x1b
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      43869706
    • C
      xfs: simplify inode to transaction joining · 898621d5
      Christoph Hellwig 提交于
      Currently we need to either call IHOLD or xfs_trans_ihold on an inode when
      joining it to a transaction via xfs_trans_ijoin.
      
      This patches instead makes xfs_trans_ijoin usable on it's own by doing
      an implicity xfs_trans_ihold, which also allows us to drop the third
      argument.  For the case where we want to hold a reference on the inode
      a xfs_trans_ijoin_ref wrapper is added which does the IHOLD and marks
      the inode for needing an xfs_iput.  In addition to the cleaner interface
      to the caller this also simplifies the implementation.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      898621d5
    • C
      xfs: merge iop_unpin_remove into iop_unpin · 9412e318
      Christoph Hellwig 提交于
      The unpin_remove item operation instances always share most of the
      implementation with the respective unpin implementation.  So instead
      of keeping two different entry points add a remove flag to the unpin
      operation and share the code more easily.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      9412e318
    • C
      xfs: simplify log item descriptor tracking · e98c414f
      Christoph Hellwig 提交于
      Currently we track log item descriptor belonging to a transaction using a
      complex opencoded chunk allocator.  This code has been there since day one
      and seems to work around the lack of an efficient slab allocator.
      
      This patch replaces it with dynamically allocated log item descriptors
      from a dedicated slab pool, linked to the transaction by a linked list.
      
      This allows to greatly simplify the log item descriptor tracking to the
      point where it's just a couple hundred lines in xfs_trans.c instead of
      a separate file.  The external API has also been simplified while we're
      at it - the xfs_trans_add_item and xfs_trans_del_item functions to add/
      delete items from a transaction have been simplified to the bare minium,
      and the xfs_trans_find_item function is replaced with a direct dereference
      of the li_desc field.  All debug code walking the list of log items in
      a transaction is down to a simple list_for_each_entry.
      
      Note that we could easily use a singly linked list here instead of the
      double linked list from list.h as the fastpath only does deletion from
      sequential traversal.  But given that we don't have one available as
      a library function yet I use the list.h functions for simplicity.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      e98c414f
    • C
      xfs: remove unneeded #include statements · 3400777f
      Christoph Hellwig 提交于
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <david@fromorbit.com>
      3400777f
    • C
      xfs: drop dmapi hooks · 288699fe
      Christoph Hellwig 提交于
      Dmapi support was never merged upstream, but we still have a lot of hooks
      bloating XFS for it, all over the fast pathes of the filesystem.
      
      This patch drops over 700 lines of dmapi overhead.  If we'll ever get HSM
      support in mainline at least the namespace events can be done much saner
      in the VFS instead of the individual filesystem, so it's not like this
      is much help for future work.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      288699fe
  21. 29 5月, 2010 1 次提交
  22. 24 5月, 2010 2 次提交
    • D
      xfs: Introduce delayed logging core code · 71e330b5
      Dave Chinner 提交于
      The delayed logging code only changes in-memory structures and as
      such can be enabled and disabled with a mount option. Add the mount
      option and emit a warning that this is an experimental feature that
      should not be used in production yet.
      
      We also need infrastructure to track committed items that have not
      yet been written to the log. This is what the Committed Item List
      (CIL) is for.
      
      The log item also needs to be extended to track the current log
      vector, the associated memory buffer and it's location in the Commit
      Item List. Extend the log item and log vector structures to enable
      this tracking.
      
      To maintain the current log format for transactions with delayed
      logging, we need to introduce a checkpoint transaction and a context
      for tracking each checkpoint from initiation to transaction
      completion.  This includes adding a log ticket for tracking space
      log required/used by the context checkpoint.
      
      To track all the changes we need an io vector array per log item,
      rather than a single array for the entire transaction. Using the new
      log vector structure for this requires two passes - the first to
      allocate the log vector structures and chain them together, and the
      second to fill them out.  This log vector chain can then be passed
      to the CIL for formatting, pinning and insertion into the CIL.
      
      Formatting of the log vector chain is relatively simple - it's just
      a loop over the iovecs on each log vector, but it is made slightly
      more complex because we re-write the iovec after the copy to point
      back at the memory buffer we just copied into.
      
      This code also needs to pin log items. If the log item is not
      already tracked in this checkpoint context, then it needs to be
      pinned. Otherwise it is already pinned and we don't need to pin it
      again.
      
      The only other complexity is calculating the amount of new log space
      the formatting has consumed. This needs to be accounted to the
      transaction in progress, and the accounting is made more complex
      becase we need also to steal space from it for log metadata in the
      checkpoint transaction. Calculate all this at insert time and update
      all the tickets, counters, etc correctly.
      
      Once we've formatted all the log items in the transaction, attach
      the busy extents to the checkpoint context so the busy extents live
      until checkpoint completion and can be processed at that point in
      time. Transactions can then be freed at this point in time.
      
      Now we need to issue checkpoints - we are tracking the amount of log space
      used by the items in the CIL, so we can trigger background checkpoints when the
      space usage gets to a certain threshold. Otherwise, checkpoints need ot be
      triggered when a log synchronisation point is reached - a log force event.
      
      Because the log write code already handles chained log vectors, writing the
      transaction is trivial, too. Construct a transaction header, add it
      to the head of the chain and write it into the log, then issue a
      commit record write. Then we can release the checkpoint log ticket
      and attach the context to the log buffer so it can be called during
      Io completion to complete the checkpoint.
      
      We also need to allow for synchronising multiple in-flight
      checkpoints. This is needed for two things - the first is to ensure
      that checkpoint commit records appear in the log in the correct
      sequence order (so they are replayed in the correct order). The
      second is so that xfs_log_force_lsn() operates correctly and only
      flushes and/or waits for the specific sequence it was provided with.
      
      To do this we need a wait variable and a list tracking the
      checkpoint commits in progress. We can walk this list and wait for
      the checkpoints to change state or complete easily, an this provides
      the necessary synchronisation for correct operation in both cases.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      71e330b5
    • D
      xfs: Improve scalability of busy extent tracking · ed3b4d6c
      Dave Chinner 提交于
      When we free a metadata extent, we record it in the per-AG busy
      extent array so that it is not re-used before the freeing
      transaction hits the disk. This array is fixed size, so when it
      overflows we make further allocation transactions synchronous
      because we cannot track more freed extents until those transactions
      hit the disk and are completed. Under heavy mixed allocation and
      freeing workloads with large log buffers, we can overflow this array
      quite easily.
      
      Further, the array is sparsely populated, which means that inserts
      need to search for a free slot, and array searches often have to
      search many more slots that are actually used to check all the
      busy extents. Quite inefficient, really.
      
      To enable this aspect of extent freeing to scale better, we need
      a structure that can grow dynamically. While in other areas of
      XFS we have used radix trees, the extents being freed are at random
      locations on disk so are better suited to being indexed by an rbtree.
      
      So, use a per-AG rbtree indexed by block number to track busy
      extents.  This incures a memory allocation when marking an extent
      busy, but should not occur too often in low memory situations. This
      should scale to an arbitrary number of extents so should not be a
      limitation for features such as in-memory aggregation of
      transactions.
      
      However, there are still situations where we can't avoid allocating
      busy extents (such as allocation from the AGFL). To minimise the
      overhead of such occurences, we need to avoid doing a synchronous
      log force while holding the AGF locked to ensure that the previous
      transactions are safely on disk before we use the extent. We can do
      this by marking the transaction doing the allocation as synchronous
      rather issuing a log force.
      
      Because of the locking involved and the ordering of transactions,
      the synchronous transaction provides the same guarantees as a
      synchronous log force because it ensures that all the prior
      transactions are already on disk when the synchronous transaction
      hits the disk. i.e. it preserves the free->allocate order of the
      extent correctly in recovery.
      
      By doing this, we avoid holding the AGF locked while log writes are
      in progress, hence reducing the length of time the lock is held and
      therefore we increase the rate at which we can allocate and free
      from the allocation group, thereby increasing overall throughput.
      
      The only problem with this approach is that when a metadata buffer is
      marked stale (e.g. a directory block is removed), then buffer remains
      pinned and locked until the log goes to disk. The issue here is that
      if that stale buffer is reallocated in a subsequent transaction, the
      attempt to lock that buffer in the transaction will hang waiting
      the log to go to disk to unlock and unpin the buffer. Hence if
      someone tries to lock a pinned, stale, locked buffer we need to
      push on the log to get it unlocked ASAP. Effectively we are trading
      off a guaranteed log force for a much less common trigger for log
      force to occur.
      
      Ideally we should not reallocate busy extents. That is a much more
      complex fix to the problem as it involves direct intervention in the
      allocation btree searches in many places. This is left to a future
      set of modifications.
      
      Finally, now that we track busy extents in allocated memory, we
      don't need the descriptors in the transaction structure to point to
      them. We can replace the complex busy chunk infrastructure with a
      simple linked list of busy extents. This allows us to remove a large
      chunk of code, making the overall change a net reduction in code
      size.
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      ed3b4d6c