1. 29 4月, 2011 1 次提交
    • C
      xfs: do not immediately reuse busy extent ranges · e26f0501
      Christoph Hellwig 提交于
      Every time we reallocate a busy extent, we cause a synchronous log force
      to occur to ensure the freeing transaction is on disk before we continue
      and use the newly allocated extent.  This is extremely sub-optimal as we
      have to mark every transaction with blocks that get reused as synchronous.
      
      Instead of searching the busy extent list after deciding on the extent to
      allocate, check each candidate extent during the allocation decisions as
      to whether they are in the busy list.  If they are in the busy list, we
      trim the busy range out of the extent we have found and determine if that
      trimmed range is still OK for allocation. In many cases, this check can
      be incorporated into the allocation extent alignment code which already
      does trimming of the found extent before determining if it is a valid
      candidate for allocation.
      
      Based on earlier patches from Dave Chinner.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      e26f0501
  2. 12 1月, 2011 1 次提交
  3. 21 12月, 2010 4 次提交
    • D
      xfs: introduce new locks for the log grant ticket wait queues · 3f16b985
      Dave Chinner 提交于
      The log grant ticket wait queues are currently protected by the log
      grant lock.  However, the queues are functionally independent from
      each other, and operations on them only require serialisation
      against other queue operations now that all of the other log
      variables they use are atomic values.
      
      Hence, we can make them independent of the grant lock by introducing
      new locks just to protect the lists operations. because the lists
      are independent, we can use a lock per list and ensure that reserve
      and write head queuing do not contend.
      
      To ensure forced shutdowns work correctly in conjunction with the
      new fast paths, ensure that we check whether the log has been shut
      down in the grant functions once we hold the relevant spin locks but
      before we go to sleep. This is needed to co-ordinate correctly with
      the wakeups that are issued on the ticket queues so we don't leave
      any processes sleeping on the queues during a shutdown.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      3f16b985
    • D
      xfs: convert l_tail_lsn to an atomic variable. · 1c3cb9ec
      Dave Chinner 提交于
      log->l_tail_lsn is currently protected by the log grant lock. The
      lock is only needed for serialising readers against writers, so we
      don't really need the lock if we make the l_tail_lsn variable an
      atomic. Converting the l_tail_lsn variable to an atomic64_t means we
      can start to peel back the grant lock from various operations.
      
      Also, provide functions to safely crack an atomic LSN variable into
      it's component pieces and to recombined the components into an
      atomic variable. Use them where appropriate.
      
      This also removes the need for explicitly holding a spinlock to read
      the l_tail_lsn on 32 bit platforms.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      
      1c3cb9ec
    • D
      xfs: combine grant heads into a single 64 bit integer · a69ed03c
      Dave Chinner 提交于
      Prepare for switching the grant heads to atomic variables by
      combining the two 32 bit values that make up the grant head into a
      single 64 bit variable.  Provide wrapper functions to combine and
      split the grant heads appropriately for calculations and use them as
      necessary.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      a69ed03c
    • D
      xfs: convert log grant ticket queues to list heads · 10547941
      Dave Chinner 提交于
      The grant write and reserve queues use a roll-your-own double linked
      list, so convert it to a standard list_head structure and convert
      all the list traversals to use list_for_each_entry(). We can also
      get rid of the XLOG_TIC_IN_Q flag as we can use the list_empty()
      check to tell if the ticket is in a list or not.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      10547941
  4. 17 12月, 2010 2 次提交
  5. 19 10月, 2010 2 次提交
  6. 10 9月, 2010 1 次提交
  7. 10 8月, 2010 1 次提交
  8. 27 7月, 2010 5 次提交
    • T
      xfs: Fix build when CONFIG_XFS_POSIX_ACL=n · 0f1a932f
      Tony Luck 提交于
      When CONFIG_XFS_POSIX_ACL is not set "xfs_check_acl" is #defined
      to NULL - which breaks the code attempting to add a tracepoint
      on this function.
      
      Only define the tracepoint when the function exists.
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      0f1a932f
    • C
      xfs: split xfs_itrace_entry · cca28fb8
      Christoph Hellwig 提交于
      Replace the xfs_itrace_entry catchall with specific trace points.  For
      most simple callers we now use the simple inode class, which used to
      be the iget class, but add more details tracing for namespace events,
      which now includes the name of the directory entries manipulated.
      
      Remove the xfs_inactive trace point, which is a duplicate of the clear_inode
      one, and the xfs_change_file_space trace point, which is immediately
      followed by the more specific alloc/free space trace points.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      cca28fb8
    • C
      xfs: some iget tracing cleanups / fixes · d2e078c3
      Christoph Hellwig 提交于
      The xfs_iget_alloc/found tracepoints are a bit misnamed and misplaced.
      Rename them to xfs_iget_hit/xfs_iget_miss and move them to the beggining
      of the xfs_iget_cache_hit/miss functions.  Add a new xfs_iget_reclaim_fail
      tracepoint for the case where we fail to re-initialize a VFS inode,
      and add a second instance of the xfs_iget_skip tracepoint for the case
      of a failed igrab() call.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      d2e078c3
    • C
      xfs: simplify xfs_vm_writepage · 20cb52eb
      Christoph Hellwig 提交于
      The writepage implementation in XFS still tries to deal with dirty but
      unmapped buffers which used to caused by writes through shared mmaps.  Since
      the introduction of ->page_mkwrite these can't happen anymore, so remove the
      code dealing with them.
      
      Note that the all_bh variable which causes us to start I/O on all buffers on
      the pages was controlled by the count of unmapped buffers, which also
      included those not actually dirty.  It's now unconditionally initialized to
      0 but set to 1 for the case of small file size extensions.  It probably can
      be removed entirely, but that's left for another patch.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      20cb52eb
    • C
      xfs: simplify buffer pinning · 4d16e924
      Christoph Hellwig 提交于
      Get rid of the xfs_buf_pin/xfs_buf_unpin/xfs_buf_ispin helpers and opencode
      them in their only callers, just like we did for the inode pinning a while
      ago.  Also remove duplicate trace points - the bufitem tracepoints cover
      all the information that is present in a buffer tracepoint.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      4d16e924
  9. 20 7月, 2010 1 次提交
    • D
      xfs: track AGs with reclaimable inodes in per-ag radix tree · 16fd5367
      Dave Chinner 提交于
      https://bugzilla.kernel.org/show_bug.cgi?id=16348
      
      When the filesystem grows to a large number of allocation groups,
      the summing of recalimable inodes gets expensive. In many cases,
      most AGs won't have any reclaimable inodes and so we are wasting CPU
      time aggregating over these AGs. This is particularly important for
      the inode shrinker that gets called frequently under memory
      pressure.
      
      To avoid the overhead, track AGs with reclaimable inodes in the
      per-ag radix tree so that we can find all the AGs with reclaimable
      inodes via a simple gang tag lookup. This involves setting the tag
      when the first reclaimable inode is tracked in the AG, and removing
      the tag when the last reclaimable inode is removed from the tree.
      Then the summation process becomes a loop walking the radix tree
      summing AGs with the reclaim tag set.
      
      This significantly reduces the overhead of scanning - a 6400 AG
      filesystea now only uses about 25% of a cpu in kswapd while slab
      reclaim progresses instead of being permanently stuck at 100% CPU
      and making little progress. Clean filesystems filesystems will see
      no overhead and the overhead only increases linearly with the number
      of dirty AGs.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      16fd5367
  10. 29 5月, 2010 1 次提交
  11. 24 5月, 2010 1 次提交
    • D
      xfs: Improve scalability of busy extent tracking · ed3b4d6c
      Dave Chinner 提交于
      When we free a metadata extent, we record it in the per-AG busy
      extent array so that it is not re-used before the freeing
      transaction hits the disk. This array is fixed size, so when it
      overflows we make further allocation transactions synchronous
      because we cannot track more freed extents until those transactions
      hit the disk and are completed. Under heavy mixed allocation and
      freeing workloads with large log buffers, we can overflow this array
      quite easily.
      
      Further, the array is sparsely populated, which means that inserts
      need to search for a free slot, and array searches often have to
      search many more slots that are actually used to check all the
      busy extents. Quite inefficient, really.
      
      To enable this aspect of extent freeing to scale better, we need
      a structure that can grow dynamically. While in other areas of
      XFS we have used radix trees, the extents being freed are at random
      locations on disk so are better suited to being indexed by an rbtree.
      
      So, use a per-AG rbtree indexed by block number to track busy
      extents.  This incures a memory allocation when marking an extent
      busy, but should not occur too often in low memory situations. This
      should scale to an arbitrary number of extents so should not be a
      limitation for features such as in-memory aggregation of
      transactions.
      
      However, there are still situations where we can't avoid allocating
      busy extents (such as allocation from the AGFL). To minimise the
      overhead of such occurences, we need to avoid doing a synchronous
      log force while holding the AGF locked to ensure that the previous
      transactions are safely on disk before we use the extent. We can do
      this by marking the transaction doing the allocation as synchronous
      rather issuing a log force.
      
      Because of the locking involved and the ordering of transactions,
      the synchronous transaction provides the same guarantees as a
      synchronous log force because it ensures that all the prior
      transactions are already on disk when the synchronous transaction
      hits the disk. i.e. it preserves the free->allocate order of the
      extent correctly in recovery.
      
      By doing this, we avoid holding the AGF locked while log writes are
      in progress, hence reducing the length of time the lock is held and
      therefore we increase the rate at which we can allocate and free
      from the allocation group, thereby increasing overall throughput.
      
      The only problem with this approach is that when a metadata buffer is
      marked stale (e.g. a directory block is removed), then buffer remains
      pinned and locked until the log goes to disk. The issue here is that
      if that stale buffer is reallocated in a subsequent transaction, the
      attempt to lock that buffer in the transaction will hang waiting
      the log to go to disk to unlock and unpin the buffer. Hence if
      someone tries to lock a pinned, stale, locked buffer we need to
      push on the log to get it unlocked ASAP. Effectively we are trading
      off a guaranteed log force for a much less common trigger for log
      force to occur.
      
      Ideally we should not reallocate busy extents. That is a much more
      complex fix to the problem as it involves direct intervention in the
      allocation btree searches in many places. This is left to a future
      set of modifications.
      
      Finally, now that we track busy extents in allocated memory, we
      don't need the descriptors in the transaction structure to point to
      them. We can replace the complex busy chunk infrastructure with a
      simple linked list of busy extents. This allows us to remove a large
      chunk of code, making the overall change a net reduction in code
      size.
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      ed3b4d6c
  12. 19 5月, 2010 4 次提交
  13. 02 3月, 2010 2 次提交
  14. 02 2月, 2010 1 次提交
    • D
      xfs: Don't issue buffer IO direct from AIL push V2 · d808f617
      Dave Chinner 提交于
      All buffers logged into the AIL are marked as delayed write.
      When the AIL needs to push the buffer out, it issues an async write of the
      buffer. This means that IO patterns are dependent on the order of
      buffers in the AIL.
      
      Instead of flushing the buffer, promote the buffer in the delayed
      write list so that the next time the xfsbufd is run the buffer will
      be flushed by the xfsbufd. Return the state to the xfsaild that the
      buffer was promoted so that the xfsaild knows that it needs to cause
      the xfsbufd to run to flush the buffers that were promoted.
      
      Using the xfsbufd for issuing the IO allows us to dispatch all
      buffer IO from the one queue. This means that we can make much more
      enlightened decisions on what order to flush buffers to disk as
      we don't have multiple places issuing IO. Optimisations to xfsbufd
      will be in a future patch.
      
      Version 2
      - kill XFS_ITEM_FLUSHING as it is now unused.
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      d808f617
  15. 16 1月, 2010 2 次提交
  16. 11 1月, 2010 2 次提交
    • D
      xfs: Ensure we force all busy extents in range to disk · fd45e478
      Dave Chinner 提交于
      When we search for and find a busy extent during allocation we
      force the log out to ensure the extent free transaction is on
      disk before the allocation transaction. The current implementation
      has a subtle bug in it--it does not handle multiple overlapping
      ranges.
      
      That is, if we free lots of little extents into a single
      contiguous extent, then allocate the contiguous extent, the busy
      search code stops searching at the first extent it finds that
      overlaps the allocated range. It then uses the commit LSN of the
      transaction to force the log out to.
      
      Unfortunately, the other busy ranges might have more recent
      commit LSNs than the first busy extent that is found, and this
      results in xfs_alloc_search_busy() returning before all the
      extent free transactions are on disk for the range being
      allocated. This can lead to potential metadata corruption or
      stale data exposure after a crash because log replay won't replay
      all the extent free transactions that cover the allocation range.
      Modified-by: NAlex Elder <aelder@sgi.com>
      
      (Dropped the "found" argument from the xfs_alloc_busysearch trace
      event.)
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      fd45e478
    • C
      xfs: use DECLARE_EVENT_CLASS · ea9a4888
      Christoph Hellwig 提交于
      Using DECLARE_EVENT_CLASS allows us to to use trace event code
      instead of duplicating it in the binary.  This was not available
      before 2.6.33 so it had to be done as a separate step once the
      prerequisite was merged.
      
      This only requires changes to xfs_trace.h and the results are
      rather impressive:
      
      hch@brick:~/work/linux-2.6/obj-kvm$ size fs/xfs/xfs.o*
      text	   data	    bss	    dec	    hex	filename
       607732	  41884	   3616	 653232	  9f7b0	fs/xfs/xfs.o
      1026732	  41884	   3808	1072424	 105d28	fs/xfs/xfs.o.old
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      ea9a4888
  17. 09 1月, 2010 1 次提交
  18. 15 12月, 2009 1 次提交
    • C
      xfs: event tracing support · 0b1b213f
      Christoph Hellwig 提交于
      Convert the old xfs tracing support that could only be used with the
      out of tree kdb and xfsidbg patches to use the generic event tracer.
      
      To use it make sure CONFIG_EVENT_TRACING is enabled and then enable
      all xfs trace channels by:
      
         echo 1 > /sys/kernel/debug/tracing/events/xfs/enable
      
      or alternatively enable single events by just doing the same in one
      event subdirectory, e.g.
      
         echo 1 > /sys/kernel/debug/tracing/events/xfs/xfs_ihold/enable
      
      or set more complex filters, etc. In Documentation/trace/events.txt
      all this is desctribed in more detail.  To reads the events do a
      
         cat /sys/kernel/debug/tracing/trace
      
      Compared to the last posting this patch converts the tracing mostly to
      the one tracepoint per callsite model that other users of the new
      tracing facility also employ.  This allows a very fine-grained control
      of the tracing, a cleaner output of the traces and also enables the
      perf tool to use each tracepoint as a virtual performance counter,
           allowing us to e.g. count how often certain workloads git various
           spots in XFS.  Take a look at
      
          http://lwn.net/Articles/346470/
      
      for some examples.
      
      Also the btree tracing isn't included at all yet, as it will require
      additional core tracing features not in mainline yet, I plan to
      deliver it later.
      
      And the really nice thing about this patch is that it actually removes
      many lines of code while adding this nice functionality:
      
       fs/xfs/Makefile                |    8
       fs/xfs/linux-2.6/xfs_acl.c     |    1
       fs/xfs/linux-2.6/xfs_aops.c    |   52 -
       fs/xfs/linux-2.6/xfs_aops.h    |    2
       fs/xfs/linux-2.6/xfs_buf.c     |  117 +--
       fs/xfs/linux-2.6/xfs_buf.h     |   33
       fs/xfs/linux-2.6/xfs_fs_subr.c |    3
       fs/xfs/linux-2.6/xfs_ioctl.c   |    1
       fs/xfs/linux-2.6/xfs_ioctl32.c |    1
       fs/xfs/linux-2.6/xfs_iops.c    |    1
       fs/xfs/linux-2.6/xfs_linux.h   |    1
       fs/xfs/linux-2.6/xfs_lrw.c     |   87 --
       fs/xfs/linux-2.6/xfs_lrw.h     |   45 -
       fs/xfs/linux-2.6/xfs_super.c   |  104 ---
       fs/xfs/linux-2.6/xfs_super.h   |    7
       fs/xfs/linux-2.6/xfs_sync.c    |    1
       fs/xfs/linux-2.6/xfs_trace.c   |   75 ++
       fs/xfs/linux-2.6/xfs_trace.h   | 1369 +++++++++++++++++++++++++++++++++++++++++
       fs/xfs/linux-2.6/xfs_vnode.h   |    4
       fs/xfs/quota/xfs_dquot.c       |  110 ---
       fs/xfs/quota/xfs_dquot.h       |   21
       fs/xfs/quota/xfs_qm.c          |   40 -
       fs/xfs/quota/xfs_qm_syscalls.c |    4
       fs/xfs/support/ktrace.c        |  323 ---------
       fs/xfs/support/ktrace.h        |   85 --
       fs/xfs/xfs.h                   |   16
       fs/xfs/xfs_ag.h                |   14
       fs/xfs/xfs_alloc.c             |  230 +-----
       fs/xfs/xfs_alloc.h             |   27
       fs/xfs/xfs_alloc_btree.c       |    1
       fs/xfs/xfs_attr.c              |  107 ---
       fs/xfs/xfs_attr.h              |   10
       fs/xfs/xfs_attr_leaf.c         |   14
       fs/xfs/xfs_attr_sf.h           |   40 -
       fs/xfs/xfs_bmap.c              |  507 +++------------
       fs/xfs/xfs_bmap.h              |   49 -
       fs/xfs/xfs_bmap_btree.c        |    6
       fs/xfs/xfs_btree.c             |    5
       fs/xfs/xfs_btree_trace.h       |   17
       fs/xfs/xfs_buf_item.c          |   87 --
       fs/xfs/xfs_buf_item.h          |   20
       fs/xfs/xfs_da_btree.c          |    3
       fs/xfs/xfs_da_btree.h          |    7
       fs/xfs/xfs_dfrag.c             |    2
       fs/xfs/xfs_dir2.c              |    8
       fs/xfs/xfs_dir2_block.c        |   20
       fs/xfs/xfs_dir2_leaf.c         |   21
       fs/xfs/xfs_dir2_node.c         |   27
       fs/xfs/xfs_dir2_sf.c           |   26
       fs/xfs/xfs_dir2_trace.c        |  216 ------
       fs/xfs/xfs_dir2_trace.h        |   72 --
       fs/xfs/xfs_filestream.c        |    8
       fs/xfs/xfs_fsops.c             |    2
       fs/xfs/xfs_iget.c              |  111 ---
       fs/xfs/xfs_inode.c             |   67 --
       fs/xfs/xfs_inode.h             |   76 --
       fs/xfs/xfs_inode_item.c        |    5
       fs/xfs/xfs_iomap.c             |   85 --
       fs/xfs/xfs_iomap.h             |    8
       fs/xfs/xfs_log.c               |  181 +----
       fs/xfs/xfs_log_priv.h          |   20
       fs/xfs/xfs_log_recover.c       |    1
       fs/xfs/xfs_mount.c             |    2
       fs/xfs/xfs_quota.h             |    8
       fs/xfs/xfs_rename.c            |    1
       fs/xfs/xfs_rtalloc.c           |    1
       fs/xfs/xfs_rw.c                |    3
       fs/xfs/xfs_trans.h             |   47 +
       fs/xfs/xfs_trans_buf.c         |   62 -
       fs/xfs/xfs_vnodeops.c          |    8
       70 files changed, 2151 insertions(+), 2592 deletions(-)
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      0b1b213f