1. 16 11月, 2012 1 次提交
  2. 19 10月, 2012 2 次提交
    • D
      xfs: move allocation stack switch up to xfs_bmapi_allocate · e04426b9
      Dave Chinner 提交于
      Switching stacks are xfs_alloc_vextent can cause deadlocks when we
      run out of worker threads on the allocation workqueue. This can
      occur because xfs_bmap_btalloc can make multiple calls to
      xfs_alloc_vextent() and even if xfs_alloc_vextent() fails it can
      return with the AGF locked in the current allocation transaction.
      
      If we then need to make another allocation, and all the allocation
      worker contexts are exhausted because the are blocked waiting for
      the AGF lock, holder of the AGF cannot get it's xfs-alloc_vextent
      work completed to release the AGF.  Hence allocation effectively
      deadlocks.
      
      To avoid this, move the stack switch one layer up to
      xfs_bmapi_allocate() so that all of the allocation attempts in a
      single switched stack transaction occur in a single worker context.
      This avoids the problem of an allocation being blocked waiting for
      a worker thread whilst holding the AGF.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      e04426b9
    • D
      xfs: introduce XFS_BMAPI_STACK_SWITCH · 2455881c
      Dave Chinner 提交于
      Certain allocation paths through xfs_bmapi_write() are in situations
      where we have limited stack available. These are almost always in
      the buffered IO writeback path when convertion delayed allocation
      extents to real extents.
      
      The current stack switch occurs for userdata allocations, which
      means we also do stack switches for preallocation, direct IO and
      unwritten extent conversion, even those these call chains have never
      been implicated in a stack overrun.
      
      Hence, let's target just the single stack overun offended for stack
      switches. To do that, introduce a XFS_BMAPI_STACK_SWITCH flag that
      the caller can pass xfs_bmapi_write() to indicate it should switch
      stacks if it needs to do allocation.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NMark Tinguely <tinguely@sgi.com>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      2455881c
  3. 15 5月, 2012 1 次提交
  4. 28 3月, 2012 1 次提交
    • D
      xfs: fix fstrim offset calculations · a66d6363
      Dave Chinner 提交于
      xfs_ioc_fstrim() doesn't treat the incoming offset and length
      correctly. It treats them as a filesystem block address, rather than
      a disk address. This is wrong because the range passed in is a
      linear representation, while the filesystem block address notation
      is a sparse representation. Hence we cannot convert the range direct
      to filesystem block units and then use that for calculating the
      range to trim.
      
      While this sounds dangerous, the problem is limited to calculating
      what AGs need to be trimmed. The code that calcuates the actual
      ranges to trim gets the right result (i.e. only ever discards free
      space), even though it uses the wrong ranges to limit what is
      trimmed. Hence this is not a bug that endangers user data.
      
      Fix this by treating the range as a disk address range and use the
      appropriate functions to convert the range into the desired formats
      for calculations.
      
      Further, fix the first free extent lookup (the longest) to actually
      find the largest free extent. Currently this lookup uses a <=
      lookup, which results in finding the extent to the left of the
      largest because we can never get an exact match on the largest
      extent. This is due to the fact that while we know it's size, we
      don't know it's location and so the exact match fails and we move
      one record to the left to get the next largest extent. Instead, use
      a >= search so that the lookup returns the largest extent regardless
      of the fact we don't get an exact match on it.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      a66d6363
  5. 23 3月, 2012 1 次提交
    • D
      xfs: introduce an allocation workqueue · c999a223
      Dave Chinner 提交于
      We currently have significant issues with the amount of stack that
      allocation in XFS uses, especially in the writeback path. We can
      easily consume 4k of stack between mapping the page, manipulating
      the bmap btree and allocating blocks from the free list. Not to
      mention btree block readahead and other functionality that issues IO
      in the allocation path.
      
      As a result, we can no longer fit allocation in the writeback path
      in the stack space provided on x86_64. To alleviate this problem,
      introduce an allocation workqueue and move all allocations to a
      seperate context. This can be easily added as an interposing layer
      into xfs_alloc_vextent(), which takes a single argument structure
      and does not return until the allocation is complete or has failed.
      
      To do this, add a work structure and a completion to the allocation
      args structure. This allows xfs_alloc_vextent to queue the args onto
      the workqueue and wait for it to be completed by the worker. This
      can be done completely transparently to the caller.
      
      The worker function needs to ensure that it sets and clears the
      PF_TRANS flag appropriately as it is being run in an active
      transaction context. Work can also be queued in a memory reclaim
      context, so a rescuer is needed for the workqueue.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NBen Myers <bpm@sgi.com>
      c999a223
  6. 25 5月, 2011 2 次提交
    • C
      xfs: do not discard alloc btree blocks · 55a7bc5a
      Christoph Hellwig 提交于
      Blocks for the allocation btree are allocated from and released to
      the AGFL, and thus frequently reused.  Even worse we do not have an
      easy way to avoid using an AGFL block when it is discarded due to
      the simple FILO list of free blocks, and thus can frequently stall
      on blocks that are currently undergoing a discard.
      
      Add a flag to the busy extent tracking structure to skip the discard
      for allocation btree blocks.  In normal operation these blocks are
      reused frequently enough that there is no need to discard them
      anyway, but if they spill over to the allocation btree as part of a
      balance we "leak" blocks that we would otherwise discard.  We could
      fix this by adding another flag and keeping these block in the
      rbtree even after they aren't busy any more so that we could discard
      them when they migrate out of the AGFL.  Given that this would cause
      significant overhead I don't think it's worthwile for now.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      55a7bc5a
    • C
      xfs: add online discard support · e84661aa
      Christoph Hellwig 提交于
      Now that we have reliably tracking of deleted extents in a
      transaction we can easily implement "online" discard support
      which calls blkdev_issue_discard once a transaction commits.
      
      The actual discard is a two stage operation as we first have
      to mark the busy extent as not available for reuse before we
      can start the actual discard.  Note that we don't bother
      supporting discard for the non-delaylog mode.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      e84661aa
  7. 29 4月, 2011 2 次提交
  8. 28 1月, 2011 1 次提交
    • D
      xfs: limit extent length for allocation to AG size · 14b064ce
      Dave Chinner 提交于
      Delayed allocation extents can be larger than AGs, so when trying to
      convert a large range we may scan every AG inside
      xfs_bmap_alloc_nullfb() trying to find an AG with a size larger than
      an AG. We should stop when we find the first AG with a maximum
      possible allocation size. This causes excessive CPU usage when there
      are lots of AGs.
      
      The same problem occurs when doing preallocation of a range larger
      than an AG.
      
      Fix the problem by limiting real allocation lengths to the maximum
      that an AG can support. This means if we have empty AGs, we'll stop
      the search at the first of them. If there are no empty AGs, we'll
      still scan them all, but that is a different problem....
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      14b064ce
  9. 12 1月, 2011 1 次提交
  10. 27 7月, 2010 1 次提交
  11. 24 5月, 2010 1 次提交
    • D
      xfs: Improve scalability of busy extent tracking · ed3b4d6c
      Dave Chinner 提交于
      When we free a metadata extent, we record it in the per-AG busy
      extent array so that it is not re-used before the freeing
      transaction hits the disk. This array is fixed size, so when it
      overflows we make further allocation transactions synchronous
      because we cannot track more freed extents until those transactions
      hit the disk and are completed. Under heavy mixed allocation and
      freeing workloads with large log buffers, we can overflow this array
      quite easily.
      
      Further, the array is sparsely populated, which means that inserts
      need to search for a free slot, and array searches often have to
      search many more slots that are actually used to check all the
      busy extents. Quite inefficient, really.
      
      To enable this aspect of extent freeing to scale better, we need
      a structure that can grow dynamically. While in other areas of
      XFS we have used radix trees, the extents being freed are at random
      locations on disk so are better suited to being indexed by an rbtree.
      
      So, use a per-AG rbtree indexed by block number to track busy
      extents.  This incures a memory allocation when marking an extent
      busy, but should not occur too often in low memory situations. This
      should scale to an arbitrary number of extents so should not be a
      limitation for features such as in-memory aggregation of
      transactions.
      
      However, there are still situations where we can't avoid allocating
      busy extents (such as allocation from the AGFL). To minimise the
      overhead of such occurences, we need to avoid doing a synchronous
      log force while holding the AGF locked to ensure that the previous
      transactions are safely on disk before we use the extent. We can do
      this by marking the transaction doing the allocation as synchronous
      rather issuing a log force.
      
      Because of the locking involved and the ordering of transactions,
      the synchronous transaction provides the same guarantees as a
      synchronous log force because it ensures that all the prior
      transactions are already on disk when the synchronous transaction
      hits the disk. i.e. it preserves the free->allocate order of the
      extent correctly in recovery.
      
      By doing this, we avoid holding the AGF locked while log writes are
      in progress, hence reducing the length of time the lock is held and
      therefore we increase the rate at which we can allocate and free
      from the allocation group, thereby increasing overall throughput.
      
      The only problem with this approach is that when a metadata buffer is
      marked stale (e.g. a directory block is removed), then buffer remains
      pinned and locked until the log goes to disk. The issue here is that
      if that stale buffer is reallocated in a subsequent transaction, the
      attempt to lock that buffer in the transaction will hang waiting
      the log to go to disk to unlock and unpin the buffer. Hence if
      someone tries to lock a pinned, stale, locked buffer we need to
      push on the log to get it unlocked ASAP. Effectively we are trading
      off a guaranteed log force for a much less common trigger for log
      force to occur.
      
      Ideally we should not reallocate busy extents. That is a much more
      complex fix to the problem as it involves direct intervention in the
      allocation btree searches in many places. This is left to a future
      set of modifications.
      
      Finally, now that we track busy extents in allocated memory, we
      don't need the descriptors in the transaction structure to point to
      them. We can replace the complex busy chunk infrastructure with a
      simple linked list of busy extents. This allows us to remove a large
      chunk of code, making the overall change a net reduction in code
      size.
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      ed3b4d6c
  12. 15 12月, 2009 1 次提交
    • C
      xfs: event tracing support · 0b1b213f
      Christoph Hellwig 提交于
      Convert the old xfs tracing support that could only be used with the
      out of tree kdb and xfsidbg patches to use the generic event tracer.
      
      To use it make sure CONFIG_EVENT_TRACING is enabled and then enable
      all xfs trace channels by:
      
         echo 1 > /sys/kernel/debug/tracing/events/xfs/enable
      
      or alternatively enable single events by just doing the same in one
      event subdirectory, e.g.
      
         echo 1 > /sys/kernel/debug/tracing/events/xfs/xfs_ihold/enable
      
      or set more complex filters, etc. In Documentation/trace/events.txt
      all this is desctribed in more detail.  To reads the events do a
      
         cat /sys/kernel/debug/tracing/trace
      
      Compared to the last posting this patch converts the tracing mostly to
      the one tracepoint per callsite model that other users of the new
      tracing facility also employ.  This allows a very fine-grained control
      of the tracing, a cleaner output of the traces and also enables the
      perf tool to use each tracepoint as a virtual performance counter,
           allowing us to e.g. count how often certain workloads git various
           spots in XFS.  Take a look at
      
          http://lwn.net/Articles/346470/
      
      for some examples.
      
      Also the btree tracing isn't included at all yet, as it will require
      additional core tracing features not in mainline yet, I plan to
      deliver it later.
      
      And the really nice thing about this patch is that it actually removes
      many lines of code while adding this nice functionality:
      
       fs/xfs/Makefile                |    8
       fs/xfs/linux-2.6/xfs_acl.c     |    1
       fs/xfs/linux-2.6/xfs_aops.c    |   52 -
       fs/xfs/linux-2.6/xfs_aops.h    |    2
       fs/xfs/linux-2.6/xfs_buf.c     |  117 +--
       fs/xfs/linux-2.6/xfs_buf.h     |   33
       fs/xfs/linux-2.6/xfs_fs_subr.c |    3
       fs/xfs/linux-2.6/xfs_ioctl.c   |    1
       fs/xfs/linux-2.6/xfs_ioctl32.c |    1
       fs/xfs/linux-2.6/xfs_iops.c    |    1
       fs/xfs/linux-2.6/xfs_linux.h   |    1
       fs/xfs/linux-2.6/xfs_lrw.c     |   87 --
       fs/xfs/linux-2.6/xfs_lrw.h     |   45 -
       fs/xfs/linux-2.6/xfs_super.c   |  104 ---
       fs/xfs/linux-2.6/xfs_super.h   |    7
       fs/xfs/linux-2.6/xfs_sync.c    |    1
       fs/xfs/linux-2.6/xfs_trace.c   |   75 ++
       fs/xfs/linux-2.6/xfs_trace.h   | 1369 +++++++++++++++++++++++++++++++++++++++++
       fs/xfs/linux-2.6/xfs_vnode.h   |    4
       fs/xfs/quota/xfs_dquot.c       |  110 ---
       fs/xfs/quota/xfs_dquot.h       |   21
       fs/xfs/quota/xfs_qm.c          |   40 -
       fs/xfs/quota/xfs_qm_syscalls.c |    4
       fs/xfs/support/ktrace.c        |  323 ---------
       fs/xfs/support/ktrace.h        |   85 --
       fs/xfs/xfs.h                   |   16
       fs/xfs/xfs_ag.h                |   14
       fs/xfs/xfs_alloc.c             |  230 +-----
       fs/xfs/xfs_alloc.h             |   27
       fs/xfs/xfs_alloc_btree.c       |    1
       fs/xfs/xfs_attr.c              |  107 ---
       fs/xfs/xfs_attr.h              |   10
       fs/xfs/xfs_attr_leaf.c         |   14
       fs/xfs/xfs_attr_sf.h           |   40 -
       fs/xfs/xfs_bmap.c              |  507 +++------------
       fs/xfs/xfs_bmap.h              |   49 -
       fs/xfs/xfs_bmap_btree.c        |    6
       fs/xfs/xfs_btree.c             |    5
       fs/xfs/xfs_btree_trace.h       |   17
       fs/xfs/xfs_buf_item.c          |   87 --
       fs/xfs/xfs_buf_item.h          |   20
       fs/xfs/xfs_da_btree.c          |    3
       fs/xfs/xfs_da_btree.h          |    7
       fs/xfs/xfs_dfrag.c             |    2
       fs/xfs/xfs_dir2.c              |    8
       fs/xfs/xfs_dir2_block.c        |   20
       fs/xfs/xfs_dir2_leaf.c         |   21
       fs/xfs/xfs_dir2_node.c         |   27
       fs/xfs/xfs_dir2_sf.c           |   26
       fs/xfs/xfs_dir2_trace.c        |  216 ------
       fs/xfs/xfs_dir2_trace.h        |   72 --
       fs/xfs/xfs_filestream.c        |    8
       fs/xfs/xfs_fsops.c             |    2
       fs/xfs/xfs_iget.c              |  111 ---
       fs/xfs/xfs_inode.c             |   67 --
       fs/xfs/xfs_inode.h             |   76 --
       fs/xfs/xfs_inode_item.c        |    5
       fs/xfs/xfs_iomap.c             |   85 --
       fs/xfs/xfs_iomap.h             |    8
       fs/xfs/xfs_log.c               |  181 +----
       fs/xfs/xfs_log_priv.h          |   20
       fs/xfs/xfs_log_recover.c       |    1
       fs/xfs/xfs_mount.c             |    2
       fs/xfs/xfs_quota.h             |    8
       fs/xfs/xfs_rename.c            |    1
       fs/xfs/xfs_rtalloc.c           |    1
       fs/xfs/xfs_rw.c                |    3
       fs/xfs/xfs_trans.h             |   47 +
       fs/xfs/xfs_trans_buf.c         |   62 -
       fs/xfs/xfs_vnodeops.c          |    8
       70 files changed, 2151 insertions(+), 2592 deletions(-)
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      0b1b213f
  13. 16 3月, 2009 1 次提交
  14. 30 10月, 2008 1 次提交
  15. 14 7月, 2007 1 次提交
    • D
      [XFS] Lazy Superblock Counters · 92821e2b
      David Chinner 提交于
      When we have a couple of hundred transactions on the fly at once, they all
      typically modify the on disk superblock in some way.
      create/unclink/mkdir/rmdir modify inode counts, allocation/freeing modify
      free block counts.
      
      When these counts are modified in a transaction, they must eventually lock
      the superblock buffer and apply the mods. The buffer then remains locked
      until the transaction is committed into the incore log buffer. The result
      of this is that with enough transactions on the fly the incore superblock
      buffer becomes a bottleneck.
      
      The result of contention on the incore superblock buffer is that
      transaction rates fall - the more pressure that is put on the superblock
      buffer, the slower things go.
      
      The key to removing the contention is to not require the superblock fields
      in question to be locked. We do that by not marking the superblock dirty
      in the transaction. IOWs, we modify the incore superblock but do not
      modify the cached superblock buffer. In short, we do not log superblock
      modifications to critical fields in the superblock on every transaction.
      In fact we only do it just before we write the superblock to disk every
      sync period or just before unmount.
      
      This creates an interesting problem - if we don't log or write out the
      fields in every transaction, then how do the values get recovered after a
      crash? the answer is simple - we keep enough duplicate, logged information
      in other structures that we can reconstruct the correct count after log
      recovery has been performed.
      
      It is the AGF and AGI structures that contain the duplicate information;
      after recovery, we walk every AGI and AGF and sum their individual
      counters to get the correct value, and we do a transaction into the log to
      correct them. An optimisation of this is that if we have a clean unmount
      record, we know the value in the superblock is correct, so we can avoid
      the summation walk under normal conditions and so mount/recovery times do
      not change under normal operation.
      
      One wrinkle that was discovered during development was that the blocks
      used in the freespace btrees are never accounted for in the AGF counters.
      This was once a valid optimisation to make; when the filesystem is full,
      the free space btrees are empty and consume no space. Hence when it
      matters, the "accounting" is correct. But that means the when we do the
      AGF summations, we would not have a correct count and xfs_check would
      complain. Hence a new counter was added to track the number of blocks used
      by the free space btrees. This is an *on-disk format change*.
      
      As a result of this, lazy superblock counters are a mkfs option and at the
      moment on linux there is no way to convert an old filesystem. This is
      possible - xfs_db can be used to twiddle the right bits and then
      xfs_repair will do the format conversion for you. Similarly, you can
      convert backwards as well. At some point we'll add functionality to
      xfs_admin to do the bit twiddling easily....
      
      SGI-PV: 964999
      SGI-Modid: xfs-linux-melb:xfs-kern:28652a
      Signed-off-by: NDavid Chinner <dgc@sgi.com>
      Signed-off-by: NChristoph Hellwig <hch@infradead.org>
      Signed-off-by: NTim Shimmin <tes@sgi.com>
      92821e2b
  16. 07 9月, 2006 1 次提交
    • D
      [XFS] Prevent free space oversubscription and xfssyncd looping. · 4be536de
      David Chinner 提交于
      The fix for recent ENOSPC deadlocks introduced certain limitations on
      allocations. The fix could cause xfssyncd to loop endlessly if we did not
      leave some space free for the allocator to work correctly. Basically, we
      needed to ensure that we had at least 4 blocks free for an AG free list
      and a block for the inode bmap btree at all times.
      
      However, this did not take into account the fact that each AG has a free
      list that needs 4 blocks. Hence any filesystem with more than one AG could
      cause oversubscription of free space and make xfssyncd spin forever trying
      to allocate space needed for AG freelists that was not available in the
      AG.
      
      The following patch reserves space for the free lists in all AGs plus the
      inode bmap btree which prevents oversubscription. It also prevents those
      blocks from being reported as free space (as they can never be used) and
      makes the SMP in-core superblock accounting code and the reserved block
      ioctl respect this requirement.
      
      SGI-PV: 955674
      SGI-Modid: xfs-linux-melb:xfs-kern:26894a
      Signed-off-by: NDavid Chinner <dgc@sgi.com>
      Signed-off-by: NDavid Chatterton <chatz@sgi.com>
      4be536de
  17. 09 6月, 2006 1 次提交
    • Y
      [XFS] In actual allocation of file system blocks and freeing extents, the · d210a28c
      Yingping Lu 提交于
      transaction within each such operation may involve multiple locking of AGF
      buffer. While the freeing extent function has sorted the extents based on
      AGF number before entering into transaction, however, when the file system
      space is very limited, the allocation of space would try every AGF to get
      space allocated, this could potentially cause out-of-order locking, thus
      deadlock could happen. This fix mitigates the scarce space for allocation
      by setting aside a few blocks without reservation, and avoid deadlock by
      maintaining ascending order of AGF locking.
      
      SGI-PV: 947395
      SGI-Modid: xfs-linux-melb:xfs-kern:210801a
      Signed-off-by: NYingping Lu <yingping@sgi.com>
      Signed-off-by: NNathan Scott <nathans@sgi.com>
      d210a28c
  18. 29 3月, 2006 1 次提交
  19. 02 11月, 2005 1 次提交
  20. 17 4月, 2005 1 次提交
    • L
      Linux-2.6.12-rc2 · 1da177e4
      Linus Torvalds 提交于
      Initial git repository build. I'm not bothering with the full history,
      even though we have it. We can create a separate "historical" git
      archive of that later if we want to, and in the meantime it's about
      3.2GB when imported into git - space that would just make the early
      git days unnecessarily complicated, when we don't have a lot of good
      infrastructure for it.
      
      Let it rip!
      1da177e4