1. 22 4月, 2013 1 次提交
    • D
      xfs: add CRC checks to the AGF · 4e0e6040
      Dave Chinner 提交于
      The AGF already has some self identifying fields (e.g. the sequence
      number) so we only need to add the uuid to it to identify the
      filesystem it belongs to. The location is fixed based on the
      sequence number, so there's no need to add a block number, either.
      
      Hence the only additional fields are the CRC and LSN fields. These
      are unlogged, so place some space between the end of the logged
      fields and them so that future expansion of the AGF for logged
      fields can be placed adjacent to the existing logged fields and
      hence not complicate the field-derived range based logging we
      currently have.
      
      Based originally on a patch from myself, modified further by
      Christoph Hellwig and then modified again to fit into the
      verifier structure with additional fields by myself. The multiple
      signed-off-by tags indicate the age and history of this patch.
      Signed-off-by: NDave Chinner <dgc@sgi.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NBen Myers <bpm@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      4e0e6040
  2. 16 11月, 2012 1 次提交
    • D
      xfs: convert buffer verifiers to an ops structure. · 1813dd64
      Dave Chinner 提交于
      To separate the verifiers from iodone functions and associate read
      and write verifiers at the same time, introduce a buffer verifier
      operations structure to the xfs_buf.
      
      This avoids the need for assigning the write verifier, clearing the
      iodone function and re-running ioend processing in the read
      verifier, and gets rid of the nasty "b_pre_io" name for the write
      verifier function pointer. If we ever need to, it will also be
      easier to add further content specific callbacks to a buffer with an
      ops structure in place.
      
      We also avoid needing to export verifier functions, instead we
      can simply export the ops structures for those that are needed
      outside the function they are defined in.
      
      This patch also fixes a directory block readahead verifier issue
      it exposed.
      
      This patch also adds ops callbacks to the inode/alloc btree blocks
      initialised by growfs. These will need more work before they will
      work with CRCs.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NPhil White <pwhite@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      1813dd64
  3. 09 11月, 2012 1 次提交
    • B
      xfs: add EOFBLOCKS inode tagging/untagging · 27b52867
      Brian Foster 提交于
      Add the XFS_ICI_EOFBLOCKS_TAG inode tag to identify inodes with
      speculatively preallocated blocks beyond EOF. An inode is tagged
      when speculative preallocation occurs and untagged either via
      truncate down or when post-EOF blocks are freed via release or
      reclaim.
      
      The tag management is intentionally not aggressive to prefer
      simplicity over the complexity of handling all the corner cases
      under which post-EOF blocks could be freed (i.e., forward
      truncation, fallocate, write error conditions, etc.). This means
      that a tagged inode may or may not have post-EOF blocks after a
      period of time. The tag is eventually cleared when the inode is
      released or reclaimed.
      Signed-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      27b52867
  4. 15 5月, 2012 1 次提交
  5. 26 7月, 2011 1 次提交
  6. 25 5月, 2011 2 次提交
    • C
      xfs: do not discard alloc btree blocks · 55a7bc5a
      Christoph Hellwig 提交于
      Blocks for the allocation btree are allocated from and released to
      the AGFL, and thus frequently reused.  Even worse we do not have an
      easy way to avoid using an AGFL block when it is discarded due to
      the simple FILO list of free blocks, and thus can frequently stall
      on blocks that are currently undergoing a discard.
      
      Add a flag to the busy extent tracking structure to skip the discard
      for allocation btree blocks.  In normal operation these blocks are
      reused frequently enough that there is no need to discard them
      anyway, but if they spill over to the allocation btree as part of a
      balance we "leak" blocks that we would otherwise discard.  We could
      fix this by adding another flag and keeping these block in the
      rbtree even after they aren't busy any more so that we could discard
      them when they migrate out of the AGFL.  Given that this would cause
      significant overhead I don't think it's worthwile for now.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      55a7bc5a
    • C
      xfs: add online discard support · e84661aa
      Christoph Hellwig 提交于
      Now that we have reliably tracking of deleted extents in a
      transaction we can easily implement "online" discard support
      which calls blkdev_issue_discard once a transaction commits.
      
      The actual discard is a two stage operation as we first have
      to mark the busy extent as not available for reuse before we
      can start the actual discard.  Note that we don't bother
      supporting discard for the non-delaylog mode.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      e84661aa
  7. 29 4月, 2011 1 次提交
    • C
      xfs: exact busy extent tracking · 97d3ac75
      Christoph Hellwig 提交于
      Update the extent tree in case we have to reuse a busy extent, so that it
      always is kept uptodate.  This is done by replacing the busy list searches
      with a new xfs_alloc_busy_reuse helper, which updates the busy extent tree
      in case of a reuse.  This allows us to allow reusing metadata extents
      unconditionally, and thus avoid log forces especially for allocation btree
      blocks.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      97d3ac75
  8. 16 12月, 2010 1 次提交
  9. 19 10月, 2010 3 次提交
    • D
      xfs: convert buffer cache hash to rbtree · 74f75a0c
      Dave Chinner 提交于
      The buffer cache hash is showing typical hash scalability problems.
      In large scale testing the number of cached items growing far larger
      than the hash can efficiently handle. Hence we need to move to a
      self-scaling cache indexing mechanism.
      
      I have selected rbtrees for indexing becuse they can have O(log n)
      search scalability, and insert and remove cost is not excessive,
      even on large trees. Hence we should be able to cache large numbers
      of buffers without incurring the excessive cache miss search
      penalties that the hash is imposing on us.
      
      To ensure we still have parallel access to the cache, we need
      multiple trees. Rather than hashing the buffers by disk address to
      select a tree, it seems more sensible to separate trees by typical
      access patterns. Most operations use buffers from within a single AG
      at a time, so rather than searching lots of different lists,
      separate the buffer indexes out into per-AG rbtrees. This means that
      searches during metadata operation have a much higher chance of
      hitting cache resident nodes, and that updates of the tree are less
      likely to disturb trees being accessed on other CPUs doing
      independent operations.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      74f75a0c
    • D
      xfs: serialise inode reclaim within an AG · 69b491c2
      Dave Chinner 提交于
      Memory reclaim via shrinkers has a terrible habit of having N+M
      concurrent shrinker executions (N = num CPUs, M = num kswapds) all
      trying to shrink the same cache. When the cache they are all working
      on is protected by a single spinlock, massive contention an
      slowdowns occur.
      
      Wrap the per-ag inode caches with a reclaim mutex to serialise
      reclaim access to the AG. This will block concurrent reclaim in each
      AG but still allow reclaim to scan multiple AGs concurrently. Allow
      shrinkers to move on to the next AG if it can't get the lock, and if
      we can't get any AG, then start blocking on locks.
      
      To prevent reclaimers from continually scanning the same inodes in
      each AG, add a cursor that tracks where the last reclaim got up to
      and start from that point on the next reclaim. This should avoid
      only ever scanning a small number of inodes at the satart of each AG
      and not making progress. If we have a non-shrinker based reclaim
      pass, ignore the cursor and reset it to zero once we are done.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      69b491c2
    • D
      xfs: lockless per-ag lookups · e176579e
      Dave Chinner 提交于
      When we start taking a reference to the per-ag for every cached
      buffer in the system, kernel lockstat profiling on an 8-way create
      workload shows the mp->m_perag_lock has higher acquisition rates
      than the inode lock and has significantly more contention. That is,
      it becomes the highest contended lock in the system.
      
      The perag lookup is trivial to convert to lock-less RCU lookups
      because perag structures never go away. Hence the only thing we need
      to protect against is tree structure changes during a grow. This can
      be done simply by replacing the locking in xfs_perag_get() with RCU
      read locking. This removes the mp->m_perag_lock completely from this
      path.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      e176579e
  10. 29 5月, 2010 1 次提交
    • C
      xfs: fix access to upper inodes without inode64 · fb3b504a
      Christoph Hellwig 提交于
      If a filesystem is mounted without the inode64 mount option we
      should still be able to access inodes not fitting into 32 bits, just
      not created new ones.  For this to work we need to make sure the
      inode cache radix tree is initialized for all allocation groups, not
      just those we plan to allocate inodes from.  This patch makes sure
      we initialize the inode cache radix tree for all allocation groups,
      and also cleans xfs_initialize_perag up a bit to separate the
      inode32 logical from the general perag structure setup.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      fb3b504a
  11. 24 5月, 2010 1 次提交
    • D
      xfs: Improve scalability of busy extent tracking · ed3b4d6c
      Dave Chinner 提交于
      When we free a metadata extent, we record it in the per-AG busy
      extent array so that it is not re-used before the freeing
      transaction hits the disk. This array is fixed size, so when it
      overflows we make further allocation transactions synchronous
      because we cannot track more freed extents until those transactions
      hit the disk and are completed. Under heavy mixed allocation and
      freeing workloads with large log buffers, we can overflow this array
      quite easily.
      
      Further, the array is sparsely populated, which means that inserts
      need to search for a free slot, and array searches often have to
      search many more slots that are actually used to check all the
      busy extents. Quite inefficient, really.
      
      To enable this aspect of extent freeing to scale better, we need
      a structure that can grow dynamically. While in other areas of
      XFS we have used radix trees, the extents being freed are at random
      locations on disk so are better suited to being indexed by an rbtree.
      
      So, use a per-AG rbtree indexed by block number to track busy
      extents.  This incures a memory allocation when marking an extent
      busy, but should not occur too often in low memory situations. This
      should scale to an arbitrary number of extents so should not be a
      limitation for features such as in-memory aggregation of
      transactions.
      
      However, there are still situations where we can't avoid allocating
      busy extents (such as allocation from the AGFL). To minimise the
      overhead of such occurences, we need to avoid doing a synchronous
      log force while holding the AGF locked to ensure that the previous
      transactions are safely on disk before we use the extent. We can do
      this by marking the transaction doing the allocation as synchronous
      rather issuing a log force.
      
      Because of the locking involved and the ordering of transactions,
      the synchronous transaction provides the same guarantees as a
      synchronous log force because it ensures that all the prior
      transactions are already on disk when the synchronous transaction
      hits the disk. i.e. it preserves the free->allocate order of the
      extent correctly in recovery.
      
      By doing this, we avoid holding the AGF locked while log writes are
      in progress, hence reducing the length of time the lock is held and
      therefore we increase the rate at which we can allocate and free
      from the allocation group, thereby increasing overall throughput.
      
      The only problem with this approach is that when a metadata buffer is
      marked stale (e.g. a directory block is removed), then buffer remains
      pinned and locked until the log goes to disk. The issue here is that
      if that stale buffer is reallocated in a subsequent transaction, the
      attempt to lock that buffer in the transaction will hang waiting
      the log to go to disk to unlock and unpin the buffer. Hence if
      someone tries to lock a pinned, stale, locked buffer we need to
      push on the log to get it unlocked ASAP. Effectively we are trading
      off a guaranteed log force for a much less common trigger for log
      force to occur.
      
      Ideally we should not reallocate busy extents. That is a much more
      complex fix to the problem as it involves direct intervention in the
      allocation btree searches in many places. This is left to a future
      set of modifications.
      
      Finally, now that we track busy extents in allocated memory, we
      don't need the descriptors in the transaction structure to point to
      them. We can replace the complex busy chunk infrastructure with a
      simple linked list of busy extents. This allows us to remove a large
      chunk of code, making the overall change a net reduction in code
      size.
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      ed3b4d6c
  12. 30 4月, 2010 1 次提交
    • D
      xfs: add a shrinker to background inode reclaim · 9bf729c0
      Dave Chinner 提交于
      On low memory boxes or those with highmem, kernel can OOM before the
      background reclaims inodes via xfssyncd. Add a shrinker to run inode
      reclaim so that it inode reclaim is expedited when memory is low.
      
      This is more complex than it needs to be because the VM folk don't
      want a context added to the shrinker infrastructure. Hence we need
      to add a global list of XFS mount structures so the shrinker can
      traverse them.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      9bf729c0
  13. 16 1月, 2010 3 次提交
  14. 15 12月, 2009 1 次提交
    • C
      xfs: event tracing support · 0b1b213f
      Christoph Hellwig 提交于
      Convert the old xfs tracing support that could only be used with the
      out of tree kdb and xfsidbg patches to use the generic event tracer.
      
      To use it make sure CONFIG_EVENT_TRACING is enabled and then enable
      all xfs trace channels by:
      
         echo 1 > /sys/kernel/debug/tracing/events/xfs/enable
      
      or alternatively enable single events by just doing the same in one
      event subdirectory, e.g.
      
         echo 1 > /sys/kernel/debug/tracing/events/xfs/xfs_ihold/enable
      
      or set more complex filters, etc. In Documentation/trace/events.txt
      all this is desctribed in more detail.  To reads the events do a
      
         cat /sys/kernel/debug/tracing/trace
      
      Compared to the last posting this patch converts the tracing mostly to
      the one tracepoint per callsite model that other users of the new
      tracing facility also employ.  This allows a very fine-grained control
      of the tracing, a cleaner output of the traces and also enables the
      perf tool to use each tracepoint as a virtual performance counter,
           allowing us to e.g. count how often certain workloads git various
           spots in XFS.  Take a look at
      
          http://lwn.net/Articles/346470/
      
      for some examples.
      
      Also the btree tracing isn't included at all yet, as it will require
      additional core tracing features not in mainline yet, I plan to
      deliver it later.
      
      And the really nice thing about this patch is that it actually removes
      many lines of code while adding this nice functionality:
      
       fs/xfs/Makefile                |    8
       fs/xfs/linux-2.6/xfs_acl.c     |    1
       fs/xfs/linux-2.6/xfs_aops.c    |   52 -
       fs/xfs/linux-2.6/xfs_aops.h    |    2
       fs/xfs/linux-2.6/xfs_buf.c     |  117 +--
       fs/xfs/linux-2.6/xfs_buf.h     |   33
       fs/xfs/linux-2.6/xfs_fs_subr.c |    3
       fs/xfs/linux-2.6/xfs_ioctl.c   |    1
       fs/xfs/linux-2.6/xfs_ioctl32.c |    1
       fs/xfs/linux-2.6/xfs_iops.c    |    1
       fs/xfs/linux-2.6/xfs_linux.h   |    1
       fs/xfs/linux-2.6/xfs_lrw.c     |   87 --
       fs/xfs/linux-2.6/xfs_lrw.h     |   45 -
       fs/xfs/linux-2.6/xfs_super.c   |  104 ---
       fs/xfs/linux-2.6/xfs_super.h   |    7
       fs/xfs/linux-2.6/xfs_sync.c    |    1
       fs/xfs/linux-2.6/xfs_trace.c   |   75 ++
       fs/xfs/linux-2.6/xfs_trace.h   | 1369 +++++++++++++++++++++++++++++++++++++++++
       fs/xfs/linux-2.6/xfs_vnode.h   |    4
       fs/xfs/quota/xfs_dquot.c       |  110 ---
       fs/xfs/quota/xfs_dquot.h       |   21
       fs/xfs/quota/xfs_qm.c          |   40 -
       fs/xfs/quota/xfs_qm_syscalls.c |    4
       fs/xfs/support/ktrace.c        |  323 ---------
       fs/xfs/support/ktrace.h        |   85 --
       fs/xfs/xfs.h                   |   16
       fs/xfs/xfs_ag.h                |   14
       fs/xfs/xfs_alloc.c             |  230 +-----
       fs/xfs/xfs_alloc.h             |   27
       fs/xfs/xfs_alloc_btree.c       |    1
       fs/xfs/xfs_attr.c              |  107 ---
       fs/xfs/xfs_attr.h              |   10
       fs/xfs/xfs_attr_leaf.c         |   14
       fs/xfs/xfs_attr_sf.h           |   40 -
       fs/xfs/xfs_bmap.c              |  507 +++------------
       fs/xfs/xfs_bmap.h              |   49 -
       fs/xfs/xfs_bmap_btree.c        |    6
       fs/xfs/xfs_btree.c             |    5
       fs/xfs/xfs_btree_trace.h       |   17
       fs/xfs/xfs_buf_item.c          |   87 --
       fs/xfs/xfs_buf_item.h          |   20
       fs/xfs/xfs_da_btree.c          |    3
       fs/xfs/xfs_da_btree.h          |    7
       fs/xfs/xfs_dfrag.c             |    2
       fs/xfs/xfs_dir2.c              |    8
       fs/xfs/xfs_dir2_block.c        |   20
       fs/xfs/xfs_dir2_leaf.c         |   21
       fs/xfs/xfs_dir2_node.c         |   27
       fs/xfs/xfs_dir2_sf.c           |   26
       fs/xfs/xfs_dir2_trace.c        |  216 ------
       fs/xfs/xfs_dir2_trace.h        |   72 --
       fs/xfs/xfs_filestream.c        |    8
       fs/xfs/xfs_fsops.c             |    2
       fs/xfs/xfs_iget.c              |  111 ---
       fs/xfs/xfs_inode.c             |   67 --
       fs/xfs/xfs_inode.h             |   76 --
       fs/xfs/xfs_inode_item.c        |    5
       fs/xfs/xfs_iomap.c             |   85 --
       fs/xfs/xfs_iomap.h             |    8
       fs/xfs/xfs_log.c               |  181 +----
       fs/xfs/xfs_log_priv.h          |   20
       fs/xfs/xfs_log_recover.c       |    1
       fs/xfs/xfs_mount.c             |    2
       fs/xfs/xfs_quota.h             |    8
       fs/xfs/xfs_rename.c            |    1
       fs/xfs/xfs_rtalloc.c           |    1
       fs/xfs/xfs_rw.c                |    3
       fs/xfs/xfs_trans.h             |   47 +
       fs/xfs/xfs_trans_buf.c         |   62 -
       fs/xfs/xfs_vnodeops.c          |    8
       70 files changed, 2151 insertions(+), 2592 deletions(-)
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      0b1b213f
  15. 02 9月, 2009 1 次提交
  16. 01 9月, 2009 2 次提交
  17. 03 7月, 2009 1 次提交
  18. 02 7月, 2009 1 次提交
  19. 08 6月, 2009 1 次提交
    • D
      xfs: introduce a per-ag inode iterator · 75f3cb13
      Dave Chinner 提交于
      Given that we walk across the per-ag inode lists so often, it makes sense to
      introduce an iterator for this.
      
      Convert the sync and reclaim code to use this new iterator, quota code will
      follow in the next patch.
      
      Also change xfs_reclaim_inode to return -EGAIN instead of 1 for an inode
      already under reclaim.  This simplifies the AG iterator and doesn't
      matter for the only other caller.
      
      [hch: merged the lookup and execute callbacks back into one to get the
       pag_ici_lock locking correct and simplify the code flow]
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NEric Sandeen <sandeen@sandeen.net>
      75f3cb13
  20. 09 2月, 2009 1 次提交
  21. 19 1月, 2009 1 次提交
  22. 16 1月, 2009 1 次提交
  23. 09 1月, 2009 1 次提交
  24. 01 12月, 2008 2 次提交
  25. 30 10月, 2008 2 次提交
  26. 07 2月, 2008 1 次提交
  27. 15 10月, 2007 1 次提交
    • D
      [XFS] Radix tree based inode caching · da353b0d
      David Chinner 提交于
      One of the perpetual scaling problems XFS has is indexing it's incore
      inodes. We currently uses hashes and the default hash sizes chosen can
      only ever be a tradeoff between memory consumption and the maximum
      realistic size of the cache.
      
      As a result, anyone who has millions of inodes cached on a filesystem
      needs to tunes the size of the cache via the ihashsize mount option to
      allow decent scalability with inode cache operations.
      
      A further problem is the separate inode cluster hash, whose size is based
      on the ihashsize but is smaller, and so under certain conditions (sparse
      cluster cache population) this can become a limitation long before the
      inode hash is causing issues.
      
      The following patchset removes the inode hash and cluster hash and
      replaces them with radix trees to avoid the scalability limitations of the
      hashes. It also reduces the size of the inodes by 3 pointers....
      
      SGI-PV: 969561
      SGI-Modid: xfs-linux-melb:xfs-kern:29481a
      Signed-off-by: NDavid Chinner <dgc@sgi.com>
      Signed-off-by: NChristoph Hellwig <hch@infradead.org>
      Signed-off-by: NTim Shimmin <tes@sgi.com>
      da353b0d
  28. 14 7月, 2007 2 次提交
    • D
      [XFS] Concurrent Multi-File Data Streams · 2a82b8be
      David Chinner 提交于
      In media spaces, video is often stored in a frame-per-file format. When
      dealing with uncompressed realtime HD video streams in this format, it is
      crucial that files do not get fragmented and that multiple files a placed
      contiguously on disk.
      
      When multiple streams are being ingested and played out at the same time,
      it is critical that the filesystem does not cross the streams and
      interleave them together as this creates seek and readahead cache miss
      latency and prevents both ingest and playout from meeting frame rate
      targets.
      
      This patch set creates a "stream of files" concept into the allocator to
      place all the data from a single stream contiguously on disk so that RAID
      array readahead can be used effectively. Each additional stream gets
      placed in different allocation groups within the filesystem, thereby
      ensuring that we don't cross any streams. When an AG fills up, we select a
      new AG for the stream that is not in use.
      
      The core of the functionality is the stream tracking - each inode that we
      create in a directory needs to be associated with the directories' stream.
      Hence every time we create a file, we look up the directories' stream
      object and associate the new file with that object.
      
      Once we have a stream object for a file, we use the AG that the stream
      object point to for allocations. If we can't allocate in that AG (e.g. it
      is full) we move the entire stream to another AG. Other inodes in the same
      stream are moved to the new AG on their next allocation (i.e. lazy
      update).
      
      Stream objects are kept in a cache and hold a reference on the inode.
      Hence the inode cannot be reclaimed while there is an outstanding stream
      reference. This means that on unlink we need to remove the stream
      association and we also need to flush all the associations on certain
      events that want to reclaim all unreferenced inodes (e.g. filesystem
      freeze).
      
      SGI-PV: 964469
      SGI-Modid: xfs-linux-melb:xfs-kern:29096a
      Signed-off-by: NDavid Chinner <dgc@sgi.com>
      Signed-off-by: NBarry Naujok <bnaujok@sgi.com>
      Signed-off-by: NDonald Douwsma <donaldd@sgi.com>
      Signed-off-by: NChristoph Hellwig <hch@infradead.org>
      Signed-off-by: NTim Shimmin <tes@sgi.com>
      Signed-off-by: NVlad Apostolov <vapo@sgi.com>
      2a82b8be
    • D
      [XFS] Lazy Superblock Counters · 92821e2b
      David Chinner 提交于
      When we have a couple of hundred transactions on the fly at once, they all
      typically modify the on disk superblock in some way.
      create/unclink/mkdir/rmdir modify inode counts, allocation/freeing modify
      free block counts.
      
      When these counts are modified in a transaction, they must eventually lock
      the superblock buffer and apply the mods. The buffer then remains locked
      until the transaction is committed into the incore log buffer. The result
      of this is that with enough transactions on the fly the incore superblock
      buffer becomes a bottleneck.
      
      The result of contention on the incore superblock buffer is that
      transaction rates fall - the more pressure that is put on the superblock
      buffer, the slower things go.
      
      The key to removing the contention is to not require the superblock fields
      in question to be locked. We do that by not marking the superblock dirty
      in the transaction. IOWs, we modify the incore superblock but do not
      modify the cached superblock buffer. In short, we do not log superblock
      modifications to critical fields in the superblock on every transaction.
      In fact we only do it just before we write the superblock to disk every
      sync period or just before unmount.
      
      This creates an interesting problem - if we don't log or write out the
      fields in every transaction, then how do the values get recovered after a
      crash? the answer is simple - we keep enough duplicate, logged information
      in other structures that we can reconstruct the correct count after log
      recovery has been performed.
      
      It is the AGF and AGI structures that contain the duplicate information;
      after recovery, we walk every AGI and AGF and sum their individual
      counters to get the correct value, and we do a transaction into the log to
      correct them. An optimisation of this is that if we have a clean unmount
      record, we know the value in the superblock is correct, so we can avoid
      the summation walk under normal conditions and so mount/recovery times do
      not change under normal operation.
      
      One wrinkle that was discovered during development was that the blocks
      used in the freespace btrees are never accounted for in the AGF counters.
      This was once a valid optimisation to make; when the filesystem is full,
      the free space btrees are empty and consume no space. Hence when it
      matters, the "accounting" is correct. But that means the when we do the
      AGF summations, we would not have a correct count and xfs_check would
      complain. Hence a new counter was added to track the number of blocks used
      by the free space btrees. This is an *on-disk format change*.
      
      As a result of this, lazy superblock counters are a mkfs option and at the
      moment on linux there is no way to convert an old filesystem. This is
      possible - xfs_db can be used to twiddle the right bits and then
      xfs_repair will do the format conversion for you. Similarly, you can
      convert backwards as well. At some point we'll add functionality to
      xfs_admin to do the bit twiddling easily....
      
      SGI-PV: 964999
      SGI-Modid: xfs-linux-melb:xfs-kern:28652a
      Signed-off-by: NDavid Chinner <dgc@sgi.com>
      Signed-off-by: NChristoph Hellwig <hch@infradead.org>
      Signed-off-by: NTim Shimmin <tes@sgi.com>
      92821e2b
  29. 28 9月, 2006 1 次提交
  30. 29 3月, 2006 1 次提交
  31. 02 11月, 2005 1 次提交