1. 16 11月, 2012 1 次提交
    • D
      xfs: verify superblocks as they are read from disk · 98021821
      Dave Chinner 提交于
      Add a superblock verify callback function and pass it into the
      buffer read functions. Remove the now redundant verification code
      that is currently in use.
      
      Adding verification shows that secondary superblocks never have
      their "sb_inprogress" flag cleared by mkfs.xfs, so when validating
      the secondary superblocks during a grow operation we have to avoid
      checking this field. Even if we fix mkfs, we will still have to
      ignore this field for verification purposes unless a version of mkfs
      that does not have this bug was used.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NPhil White <pwhite@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      98021821
  2. 09 11月, 2012 1 次提交
  3. 18 10月, 2012 4 次提交
    • D
      xfs: rename xfs_sync.[ch] to xfs_icache.[ch] · 6d8b79cf
      Dave Chinner 提交于
      xfs_sync.c now only contains inode reclaim functions and inode cache
      iteration functions. It is not related to sync operations anymore.
      Rename to xfs_icache.c to reflect it's contents and prepare for
      consolidation with the other inode cache file that exists
      (xfs_iget.c).
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      6d8b79cf
    • D
      xfs: syncd workqueue is no more · 5889608d
      Dave Chinner 提交于
      With the syncd functions moved to the log and/or removed, the syncd
      workqueue is the only remaining bit left. It is used by the log
      covering/ail pushing work, as well as by the inode reclaim work.
      
      Given how cheap workqueues are these days, give the log and inode
      reclaim work their own work queues and kill the syncd work queue.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      5889608d
    • D
      xfs: xfs_sync_data is redundant. · 9aa05000
      Dave Chinner 提交于
      We don't do any data writeback from XFS any more - the VFS is
      completely responsible for that, including for freeze. We can
      replace the remaining caller with a VFS level function that
      achieves the same thing, but without conflicting with current
      writeback work.
      
      This means we can remove the flush_work and xfs_flush_inodes() - the
      VFS functionality completely replaces the internal flush queue for
      doing this writeback work in a separate context to avoid stack
      overruns.
      
      This does have one complication - it cannot be called with page
      locks held.  Hence move the flushing of delalloc space when ENOSPC
      occurs back up into xfs_file_aio_buffered_write when we don't hold
      any locks that will stall writeback.
      
      Unfortunately, writeback_inodes_sb_if_idle() is not sufficient to
      trigger delalloc conversion fast enough to prevent spurious ENOSPC
      whent here are hundreds of writers, thousands of small files and GBs
      of free RAM.  Hence we need to use sync_sb_inodes() to block callers
      while we wait for writeback like the previous xfs_flush_inodes
      implementation did.
      
      That means we have to hold the s_umount lock here, but because this
      call can nest inside i_mutex (the parent directory in the create
      case, held by the VFS), we have to use down_read_trylock() to avoid
      potential deadlocks. In practice, this trylock will succeed on
      almost every attempt as unmount/remount type operations are
      exceedingly rare.
      
      Note: we always need to pass a count of zero to
      generic_file_buffered_write() as the previously written byte count.
      We only do this by accident before this patch by the virtue of ret
      always being zero when there are no errors. Make this explicit
      rather than needing to specifically zero ret in the ENOSPC retry
      case.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Tested-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      9aa05000
    • D
      xfs: sync work is now only periodic log work · f661f1e0
      Dave Chinner 提交于
      The only thing the periodic sync work does now is flush the AIL and
      idle the log. These are really functions of the log code, so move
      the work to xfs_log.c and rename it appropriately.
      
      The only wart that this leaves behind is the xfssyncd_centisecs
      sysctl, otherwise the xfssyncd is dead. Clean up any comments that
      related to xfssyncd to reflect it's passing.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      f661f1e0
  4. 17 8月, 2012 1 次提交
    • A
      xfs: kill struct declarations in xfs_mount.h · 1ed845df
      Alex Elder 提交于
      I noticed that "struct xfs_mount_args" was still declared in
      "fs/xfs/xfs_mount.h".  That struct doesn't even exist any more (and
      is obviously not referenced elsewhere in that header file).  While
      in there, delete four other unneeded struct declarations in that
      file.
      
      Doing so highlights that "fs/xfs/xfs_trace.h" was relying indirectly
      on "xfs_mount.h" to be #included in order to declare "struct
      xfs_bmbt_irec", so add that declaration to resolve that issue.
      Signed-off-by: NAlex Elder <elder@inktank.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      1ed845df
  5. 31 7月, 2012 1 次提交
    • J
      xfs: Convert to new freezing code · d9457dc0
      Jan Kara 提交于
      Generic code now blocks all writers from standard write paths. So we add
      blocking of all writers coming from ioctl (we get a protection of ioctl against
      racing remount read-only as a bonus) and convert xfs_file_aio_write() to a
      non-racy freeze protection. We also keep freeze protection on transaction
      start to block internal filesystem writes such as removal of preallocated
      blocks.
      
      CC: Ben Myers <bpm@sgi.com>
      CC: Alex Elder <elder@kernel.org>
      CC: xfs@oss.sgi.com
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      d9457dc0
  6. 22 6月, 2012 2 次提交
  7. 15 6月, 2012 2 次提交
  8. 15 5月, 2012 2 次提交
    • D
      xfs: Do background CIL flushes via a workqueue · 4c2d542f
      Dave Chinner 提交于
      Doing background CIL flushes adds significant latency to whatever
      async transaction that triggers it. To avoid blocking async
      transactions on things like waiting for log buffer IO to complete,
      move the CIL push off into a workqueue.  By moving the push work
      into a workqueue, we remove all the latency that the commit adds
      from the foreground transaction commit path. This also means that
      single threaded workloads won't do the CIL push procssing, leaving
      them more CPU to do more async transactions.
      
      To do this, we need to keep track of the sequence number we have
      pushed work for. This avoids having many transaction commits
      attempting to schedule work for the same sequence, and ensures that
      we only ever have one push (background or forced) in progress at a
      time. It also means that we don't need to take the CIL lock in write
      mode to check for potential background push races, which reduces
      lock contention.
      
      To avoid potential issues with "smart" IO schedulers, don't use the
      workqueue for log force triggered flushes. Instead, do them directly
      so that the log IO is done directly by the process issuing the log
      force and so doesn't get stuck on IO elevator queue idling
      incorrectly delaying the log IO from the workqueue.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      4c2d542f
    • C
      xfs: implement freezing by emptying the AIL · 211e4d43
      Christoph Hellwig 提交于
      Now that we write back all metadata either synchronously or through
      the AIL we can simply implement metadata freezing in terms of
      emptying the AIL.
      
      The implementation for this is fairly simply and straight-forward:
      A new routine is added that asks the xfsaild to push the AIL to the
      end and waits for it to complete and send a wakeup. The routine will
      then loop if the AIL is not actually empty, and continue to do so
      until the AIL is compeltely empty.
      
      We keep an inode reclaim pass in the freeze process to avoid having
      memory pressure have to reclaim inodes that require dirtying the
      filesystem to be reclaimed after the freeze has completed. This
      means we can also treat unmount in the exact same way as freeze.
      
      As an upside we can now remove the radix tree based inode writeback
      and xfs_unmountfs_writesb.
      
      [ Dave Chinner:
      	- Cleaned up commit message.
      	- Added inode reclaim passes back into freeze.
      	- Cleaned up wakeup mechanism to avoid the use of a new
      	  sleep counter variable. ]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      211e4d43
  9. 06 3月, 2012 1 次提交
  10. 04 2月, 2012 1 次提交
  11. 09 12月, 2011 1 次提交
  12. 21 7月, 2011 1 次提交
  13. 25 5月, 2011 1 次提交
    • C
      xfs: add online discard support · e84661aa
      Christoph Hellwig 提交于
      Now that we have reliably tracking of deleted extents in a
      transaction we can easily implement "online" discard support
      which calls blkdev_issue_discard once a transaction commits.
      
      The actual discard is a two stage operation as we first have
      to mark the busy extent as not available for reuse before we
      can start the actual discard.  Note that we don't bother
      supporting discard for the non-delaylog mode.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      e84661aa
  14. 08 4月, 2011 3 次提交
    • D
      xfs: introduce background inode reclaim work · a7b339f1
      Dave Chinner 提交于
      Background inode reclaim needs to run more frequently that the XFS
      syncd work is run as 30s is too long between optimal reclaim runs.
      Add a new periodic work item to the xfs syncd workqueue to run a
      fast, non-blocking inode reclaim scan.
      
      Background inode reclaim is kicked by the act of marking inodes for
      reclaim.  When an AG is first marked as having reclaimable inodes,
      the background reclaim work is kicked. It will continue to run
      periodically untill it detects that there are no more reclaimable
      inodes. It will be kicked again when the first inode is queued for
      reclaim.
      
      To ensure shrinker based inode reclaim throttles to the inode
      cleaning and reclaim rate but still reclaim inodes efficiently, make it kick the
      background inode reclaim so that when we are low on memory we are
      trying to reclaim inodes as efficiently as possible. This kick shoul
      d not be necessary, but it will protect against failures to kick the
      background reclaim when inodes are first dirtied.
      
      To provide the rate throttling, make the shrinker pass do
      synchronous inode reclaim so that it blocks on inodes under IO. This
      means that the shrinker will reclaim inodes rather than just
      skipping over them, but it does not adversely affect the rate of
      reclaim because most dirty inodes are already under IO due to the
      background reclaim work the shrinker kicked.
      
      These two modifications solve one of the two OOM killer invocations
      Chris Mason reported recently when running a stress testing script.
      The particular workload trigger for the OOM killer invocation is
      where there are more threads than CPUs all unlinking files in an
      extremely memory constrained environment. Unlike other solutions,
      this one does not have a performance impact on performance when
      memory is not constrained or the number of concurrent threads
      operating is <= to the number of CPUs.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      a7b339f1
    • D
      xfs: convert ENOSPC inode flushing to use new syncd workqueue · 89e4cb55
      Dave Chinner 提交于
      On of the problems with the current inode flush at ENOSPC is that we
      queue a flush per ENOSPC event, regardless of how many are already
      queued. Thi can result in    hundreds of queued flushes, most of
      which simply burn CPU scanned and do no real work. This simply slows
      down allocation at ENOSPC.
      
      We really only need one active flush at a time, and we can easily
      implement that via the new xfs_syncd_wq. All we need to do is queue
      a flush if one is not already active, then block waiting for the
      currently active flush to complete. The result is that we only ever
      have a single ENOSPC inode flush active at a time and this greatly
      reduces the overhead of ENOSPC processing.
      
      On my 2p test machine, this results in tests exercising ENOSPC
      conditions running significantly faster - 042 halves execution time,
      083 drops from 60s to 5s, etc - while not introducing test
      regressions.
      
      This allows us to remove the old xfssyncd threads and infrastructure
      as they are no longer used.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      89e4cb55
    • D
      xfs: introduce a xfssyncd workqueue · c6d09b66
      Dave Chinner 提交于
      All of the work xfssyncd does is background functionality. There is
      no need for a thread per filesystem to do this work - it can al be
      managed by a global workqueue now they manage concurrency
      effectively.
      
      Introduce a new gglobal xfssyncd workqueue, and convert the periodic
      work to use this new functionality. To do this, use a delayed work
      construct to schedule the next running of the periodic sync work
      for the filesystem. When the sync work is complete, queue a new
      delayed work for the next running of the sync work.
      
      For laptop mode, we wait on completion for the sync works, so ensure
      that the sync work queuing interface can flush and wait for work to
      complete to enable the work queue infrastructure to replace the
      current sequence number and wakeup that is used.
      
      Because the sync work does non-trivial amounts of work, mark the
      new work queue as CPU intensive.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      c6d09b66
  15. 04 1月, 2011 1 次提交
    • D
      xfs: dynamic speculative EOF preallocation · 055388a3
      Dave Chinner 提交于
      Currently the size of the speculative preallocation during delayed
      allocation is fixed by either the allocsize mount option of a
      default size. We are seeing a lot of cases where we need to
      recommend using the allocsize mount option to prevent fragmentation
      when buffered writes land in the same AG.
      
      Rather than using a fixed preallocation size by default (up to 64k),
      make it dynamic by basing it on the current inode size. That way the
      EOF preallocation will increase as the file size increases.  Hence
      for streaming writes we are much more likely to get large
      preallocations exactly when we need it to reduce fragementation.
      
      For default settings, the size of the initial extents is determined
      by the number of parallel writers and the amount of memory in the
      machine. For 4GB RAM and 4 concurrent 32GB file writes:
      
      EXT: FILE-OFFSET           BLOCK-RANGE          AG AG-OFFSET                 TOTAL
         0: [0..1048575]:         1048672..2097247      0 (1048672..2097247)      1048576
         1: [1048576..2097151]:   5242976..6291551      0 (5242976..6291551)      1048576
         2: [2097152..4194303]:   12583008..14680159    0 (12583008..14680159)    2097152
         3: [4194304..8388607]:   25165920..29360223    0 (25165920..29360223)    4194304
         4: [8388608..16777215]:  58720352..67108959    0 (58720352..67108959)    8388608
         5: [16777216..33554423]: 117440584..134217791  0 (117440584..134217791) 16777208
         6: [33554424..50331511]: 184549056..201326143  0 (184549056..201326143) 16777088
         7: [50331512..67108599]: 251657408..268434495  0 (251657408..268434495) 16777088
      
      and for 16 concurrent 16GB file writes:
      
       EXT: FILE-OFFSET           BLOCK-RANGE          AG AG-OFFSET                 TOTAL
         0: [0..262143]:          2490472..2752615      0 (2490472..2752615)       262144
         1: [262144..524287]:     6291560..6553703      0 (6291560..6553703)       262144
         2: [524288..1048575]:    13631592..14155879    0 (13631592..14155879)     524288
         3: [1048576..2097151]:   30408808..31457383    0 (30408808..31457383)    1048576
         4: [2097152..4194303]:   52428904..54526055    0 (52428904..54526055)    2097152
         5: [4194304..8388607]:   104857704..109052007  0 (104857704..109052007)  4194304
         6: [8388608..16777215]:  209715304..218103911  0 (209715304..218103911)  8388608
         7: [16777216..33554423]: 452984848..469762055  0 (452984848..469762055) 16777208
      
      Because it is hard to take back specualtive preallocation, cases
      where there are large slow growing log files on a nearly full
      filesystem may cause premature ENOSPC. Hence as the filesystem nears
      full, the maximum dynamic prealloc size іs reduced according to this
      table (based on 4k block size):
      
      freespace       max prealloc size
        >5%             full extent (8GB)
        4-5%             2GB (8GB >> 2)
        3-4%             1GB (8GB >> 3)
        2-3%           512MB (8GB >> 4)
        1-2%           256MB (8GB >> 5)
        <1%            128MB (8GB >> 6)
      
      This should reduce the amount of space held in speculative
      preallocation for such cases.
      
      The allocsize mount option turns off the dynamic behaviour and fixes
      the prealloc size to whatever the mount option specifies. i.e. the
      behaviour is unchanged.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      055388a3
  16. 19 10月, 2010 4 次提交
  17. 27 7月, 2010 2 次提交
  18. 20 7月, 2010 1 次提交
  19. 24 5月, 2010 1 次提交
    • D
      xfs: Introduce delayed logging core code · 71e330b5
      Dave Chinner 提交于
      The delayed logging code only changes in-memory structures and as
      such can be enabled and disabled with a mount option. Add the mount
      option and emit a warning that this is an experimental feature that
      should not be used in production yet.
      
      We also need infrastructure to track committed items that have not
      yet been written to the log. This is what the Committed Item List
      (CIL) is for.
      
      The log item also needs to be extended to track the current log
      vector, the associated memory buffer and it's location in the Commit
      Item List. Extend the log item and log vector structures to enable
      this tracking.
      
      To maintain the current log format for transactions with delayed
      logging, we need to introduce a checkpoint transaction and a context
      for tracking each checkpoint from initiation to transaction
      completion.  This includes adding a log ticket for tracking space
      log required/used by the context checkpoint.
      
      To track all the changes we need an io vector array per log item,
      rather than a single array for the entire transaction. Using the new
      log vector structure for this requires two passes - the first to
      allocate the log vector structures and chain them together, and the
      second to fill them out.  This log vector chain can then be passed
      to the CIL for formatting, pinning and insertion into the CIL.
      
      Formatting of the log vector chain is relatively simple - it's just
      a loop over the iovecs on each log vector, but it is made slightly
      more complex because we re-write the iovec after the copy to point
      back at the memory buffer we just copied into.
      
      This code also needs to pin log items. If the log item is not
      already tracked in this checkpoint context, then it needs to be
      pinned. Otherwise it is already pinned and we don't need to pin it
      again.
      
      The only other complexity is calculating the amount of new log space
      the formatting has consumed. This needs to be accounted to the
      transaction in progress, and the accounting is made more complex
      becase we need also to steal space from it for log metadata in the
      checkpoint transaction. Calculate all this at insert time and update
      all the tickets, counters, etc correctly.
      
      Once we've formatted all the log items in the transaction, attach
      the busy extents to the checkpoint context so the busy extents live
      until checkpoint completion and can be processed at that point in
      time. Transactions can then be freed at this point in time.
      
      Now we need to issue checkpoints - we are tracking the amount of log space
      used by the items in the CIL, so we can trigger background checkpoints when the
      space usage gets to a certain threshold. Otherwise, checkpoints need ot be
      triggered when a log synchronisation point is reached - a log force event.
      
      Because the log write code already handles chained log vectors, writing the
      transaction is trivial, too. Construct a transaction header, add it
      to the head of the chain and write it into the log, then issue a
      commit record write. Then we can release the checkpoint log ticket
      and attach the context to the log buffer so it can be called during
      Io completion to complete the checkpoint.
      
      We also need to allow for synchronising multiple in-flight
      checkpoints. This is needed for two things - the first is to ensure
      that checkpoint commit records appear in the log in the correct
      sequence order (so they are replayed in the correct order). The
      second is so that xfs_log_force_lsn() operates correctly and only
      flushes and/or waits for the specific sequence it was provided with.
      
      To do this we need a wait variable and a list tracking the
      checkpoint commits in progress. We can walk this list and wait for
      the checkpoints to change state or complete easily, an this provides
      the necessary synchronisation for correct operation in both cases.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      71e330b5
  20. 30 4月, 2010 1 次提交
    • D
      xfs: add a shrinker to background inode reclaim · 9bf729c0
      Dave Chinner 提交于
      On low memory boxes or those with highmem, kernel can OOM before the
      background reclaims inodes via xfssyncd. Add a shrinker to run inode
      reclaim so that it inode reclaim is expedited when memory is low.
      
      This is more complex than it needs to be because the VM folk don't
      want a context added to the shrinker infrastructure. Hence we need
      to add a global list of XFS mount structures so the shrinker can
      traverse them.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      9bf729c0
  21. 02 3月, 2010 1 次提交
  22. 17 2月, 2010 1 次提交
    • T
      percpu: add __percpu sparse annotations to fs · 003cb608
      Tejun Heo 提交于
      Add __percpu sparse annotations to fs.
      
      These annotations are to make sparse consider percpu variables to be
      in a different address space and warn if accessed without going
      through percpu accessors.  This patch doesn't affect normal builds.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Alex Elder <aelder@sgi.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      003cb608
  23. 09 2月, 2010 1 次提交
  24. 26 1月, 2010 1 次提交
    • D
      xfs: don't hold onto reserved blocks on remount,ro · cbe132a8
      Dave Chinner 提交于
      If we hold onto reserved blocks when doing a remount,ro we end
      up writing the blocks used count to disk that includes the reserved
      blocks. Reserved blocks are not actually used, so this results in
      the values in the superblock being incorrect.
      
      Hence if we run xfs_check or xfs_repair -n while the filesystem is
      mounted remount,ro we end up with an inconsistent filesystem being
      reported. Also, running xfs_copy on the remount,ro filesystem will
      result in an inconsistent image being generated.
      
      To fix this, unreserve the blocks when doing the remount,ro, and
      reserved them again on remount,rw. This way a remount,ro filesystem
      will appear consistent on disk to all utilities.
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      cbe132a8
  25. 20 1月, 2010 1 次提交
  26. 16 1月, 2010 3 次提交
    • D
      xfs: Add trace points for per-ag refcount debugging. · 0fa800fb
      Dave Chinner 提交于
      Uninline xfs_perag_{get,put} so that tracepoints can be inserted
      into them to speed debugging of reference count problems.
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      0fa800fb
    • D
      xfs: Reference count per-ag structures · aed3bb90
      Dave Chinner 提交于
      Reference count the per-ag structures to ensure that we keep get/put
      pairs balanced. Assert that the reference counts are zero at unmount
      time to catch leaks. In future, reference counts will enable us to
      safely remove perag structures by allowing us to detect when they
      are no longer in use.
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      aed3bb90
    • D
      xfs: Replace per-ag array with a radix tree · 1c1c6ebc
      Dave Chinner 提交于
      The use of an array for the per-ag structures requires reallocation
      of the array when growing the filesystem. This requires locking
      access to the array to avoid use after free situations, and the
      locking is difficult to get right. To avoid needing to reallocate an
      array, change the per-ag structures to an allocated object per ag
      and index them using a tree structure.
      
      The AGs are always densely indexed (hence the use of an array), but
      the number supported is 2^32 and lookups tend to be random and hence
      indexing needs to scale. A simple choice is a radix tree - it works
      well with this sort of index.  This change also removes another
      large contiguous allocation from the mount/growfs path in XFS.
      
      The growing process now needs to change to only initialise the new
      AGs required for the extra space, and as such only needs to
      exclusively lock the tree for inserts. The rest of the code only
      needs to lock the tree while doing lookups, and hence this will
      remove all the deadlocks that currently occur on the m_perag_lock as
      it is now an innermost lock. The lock is also changed to a spinlock
      from a read/write lock as the hold time is now extremely short.
      
      To complete the picture, the per-ag structures will need to be
      reference counted to ensure that we don't free/modify them while
      they are still in use.  This will be done in subsequent patch.
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      1c1c6ebc