1. 21 7月, 2011 1 次提交
  2. 13 7月, 2011 1 次提交
  3. 11 7月, 2011 1 次提交
  4. 09 7月, 2011 1 次提交
  5. 08 7月, 2011 2 次提交
  6. 29 4月, 2011 1 次提交
  7. 07 3月, 2011 3 次提交
  8. 16 12月, 2010 1 次提交
  9. 04 1月, 2011 1 次提交
    • D
      xfs: dynamic speculative EOF preallocation · 055388a3
      Dave Chinner 提交于
      Currently the size of the speculative preallocation during delayed
      allocation is fixed by either the allocsize mount option of a
      default size. We are seeing a lot of cases where we need to
      recommend using the allocsize mount option to prevent fragmentation
      when buffered writes land in the same AG.
      
      Rather than using a fixed preallocation size by default (up to 64k),
      make it dynamic by basing it on the current inode size. That way the
      EOF preallocation will increase as the file size increases.  Hence
      for streaming writes we are much more likely to get large
      preallocations exactly when we need it to reduce fragementation.
      
      For default settings, the size of the initial extents is determined
      by the number of parallel writers and the amount of memory in the
      machine. For 4GB RAM and 4 concurrent 32GB file writes:
      
      EXT: FILE-OFFSET           BLOCK-RANGE          AG AG-OFFSET                 TOTAL
         0: [0..1048575]:         1048672..2097247      0 (1048672..2097247)      1048576
         1: [1048576..2097151]:   5242976..6291551      0 (5242976..6291551)      1048576
         2: [2097152..4194303]:   12583008..14680159    0 (12583008..14680159)    2097152
         3: [4194304..8388607]:   25165920..29360223    0 (25165920..29360223)    4194304
         4: [8388608..16777215]:  58720352..67108959    0 (58720352..67108959)    8388608
         5: [16777216..33554423]: 117440584..134217791  0 (117440584..134217791) 16777208
         6: [33554424..50331511]: 184549056..201326143  0 (184549056..201326143) 16777088
         7: [50331512..67108599]: 251657408..268434495  0 (251657408..268434495) 16777088
      
      and for 16 concurrent 16GB file writes:
      
       EXT: FILE-OFFSET           BLOCK-RANGE          AG AG-OFFSET                 TOTAL
         0: [0..262143]:          2490472..2752615      0 (2490472..2752615)       262144
         1: [262144..524287]:     6291560..6553703      0 (6291560..6553703)       262144
         2: [524288..1048575]:    13631592..14155879    0 (13631592..14155879)     524288
         3: [1048576..2097151]:   30408808..31457383    0 (30408808..31457383)    1048576
         4: [2097152..4194303]:   52428904..54526055    0 (52428904..54526055)    2097152
         5: [4194304..8388607]:   104857704..109052007  0 (104857704..109052007)  4194304
         6: [8388608..16777215]:  209715304..218103911  0 (209715304..218103911)  8388608
         7: [16777216..33554423]: 452984848..469762055  0 (452984848..469762055) 16777208
      
      Because it is hard to take back specualtive preallocation, cases
      where there are large slow growing log files on a nearly full
      filesystem may cause premature ENOSPC. Hence as the filesystem nears
      full, the maximum dynamic prealloc size іs reduced according to this
      table (based on 4k block size):
      
      freespace       max prealloc size
        >5%             full extent (8GB)
        4-5%             2GB (8GB >> 2)
        3-4%             1GB (8GB >> 3)
        2-3%           512MB (8GB >> 4)
        1-2%           256MB (8GB >> 5)
        <1%            128MB (8GB >> 6)
      
      This should reduce the amount of space held in speculative
      preallocation for such cases.
      
      The allocsize mount option turns off the dynamic behaviour and fixes
      the prealloc size to whatever the mount option specifies. i.e. the
      behaviour is unchanged.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      055388a3
  10. 11 11月, 2010 1 次提交
  11. 19 10月, 2010 11 次提交
    • C
      xfs: remove xfs_buf wrappers · 1a1a3e97
      Christoph Hellwig 提交于
      Stop having two different names for many buffer functions and use
      the more descriptive xfs_buf_* names directly.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      1a1a3e97
    • C
      xfs: do not use xfs_mod_incore_sb_batch for per-cpu counters · 1b040712
      Christoph Hellwig 提交于
      Update the per-cpu counters manually in xfs_trans_unreserve_and_mod_sb
      and remove support for per-cpu counters from xfs_mod_incore_sb_batch
      to simplify it.  And added benefit is that we don't have to take
      m_sb_lock for transactions that only modify per-cpu counters.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      1b040712
    • C
      xfs: do not use xfs_mod_incore_sb for per-cpu counters · 96540c78
      Christoph Hellwig 提交于
      Export xfs_icsb_modify_counters and always use it for modifying
      the per-cpu counters.  Remove support for per-cpu counters from
      xfs_mod_incore_sb to simplify it.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      96540c78
    • C
      xfs: remove XFS_MOUNT_NO_PERCPU_SB · 61ba35de
      Christoph Hellwig 提交于
      Fail the mount if we can't allocate memory for the per-CPU counters.
      This is consistent with how we handle everything else in the mount
      path and makes the superblock counter modification a lot simpler.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      61ba35de
    • D
      xfs: convert buffer cache hash to rbtree · 74f75a0c
      Dave Chinner 提交于
      The buffer cache hash is showing typical hash scalability problems.
      In large scale testing the number of cached items growing far larger
      than the hash can efficiently handle. Hence we need to move to a
      self-scaling cache indexing mechanism.
      
      I have selected rbtrees for indexing becuse they can have O(log n)
      search scalability, and insert and remove cost is not excessive,
      even on large trees. Hence we should be able to cache large numbers
      of buffers without incurring the excessive cache miss search
      penalties that the hash is imposing on us.
      
      To ensure we still have parallel access to the cache, we need
      multiple trees. Rather than hashing the buffers by disk address to
      select a tree, it seems more sensible to separate trees by typical
      access patterns. Most operations use buffers from within a single AG
      at a time, so rather than searching lots of different lists,
      separate the buffer indexes out into per-AG rbtrees. This means that
      searches during metadata operation have a much higher chance of
      hitting cache resident nodes, and that updates of the tree are less
      likely to disturb trees being accessed on other CPUs doing
      independent operations.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      74f75a0c
    • D
      xfs: serialise inode reclaim within an AG · 69b491c2
      Dave Chinner 提交于
      Memory reclaim via shrinkers has a terrible habit of having N+M
      concurrent shrinker executions (N = num CPUs, M = num kswapds) all
      trying to shrink the same cache. When the cache they are all working
      on is protected by a single spinlock, massive contention an
      slowdowns occur.
      
      Wrap the per-ag inode caches with a reclaim mutex to serialise
      reclaim access to the AG. This will block concurrent reclaim in each
      AG but still allow reclaim to scan multiple AGs concurrently. Allow
      shrinkers to move on to the next AG if it can't get the lock, and if
      we can't get any AG, then start blocking on locks.
      
      To prevent reclaimers from continually scanning the same inodes in
      each AG, add a cursor that tracks where the last reclaim got up to
      and start from that point on the next reclaim. This should avoid
      only ever scanning a small number of inodes at the satart of each AG
      and not making progress. If we have a non-shrinker based reclaim
      pass, ignore the cursor and reset it to zero once we are done.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      69b491c2
    • D
      xfs: split inode AG walking into separate code for reclaim · 65d0f205
      Dave Chinner 提交于
      The reclaim walk requires different locking and has a slightly
      different walk algorithm, so separate it out so that it can be
      optimised separately.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      65d0f205
    • D
      xfs: use unhashed buffers for size checks · 1922c949
      Dave Chinner 提交于
      When we are checking we can access the last block of each device, we
      do not need to use cached buffers as they will be tossed away
      immediately. Use uncached buffers for size checks so that all IO
      prior to full in-memory structure initialisation does not use the
      buffer cache.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      1922c949
    • D
      xfs: kill XBF_FS_MANAGED buffers · 26af6552
      Dave Chinner 提交于
      Filesystem level managed buffers are buffers that have their
      lifecycle controlled by the filesystem layer, not the buffer cache.
      We currently cache these buffers, which makes cleanup and cache
      walking somewhat troublesome. Convert the fs managed buffers to
      uncached buffers obtained by via xfs_buf_get_uncached(), and remove
      the XBF_FS_MANAGED special cases from the buffer cache.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      26af6552
    • D
      xfs: lockless per-ag lookups · e176579e
      Dave Chinner 提交于
      When we start taking a reference to the per-ag for every cached
      buffer in the system, kernel lockstat profiling on an 8-way create
      workload shows the mp->m_perag_lock has higher acquisition rates
      than the inode lock and has significantly more contention. That is,
      it becomes the highest contended lock in the system.
      
      The perag lookup is trivial to convert to lock-less RCU lookups
      because perag structures never go away. Hence the only thing we need
      to protect against is tree structure changes during a grow. This can
      be done simply by replacing the locking in xfs_perag_get() with RCU
      read locking. This removes the mp->m_perag_lock completely from this
      path.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      e176579e
    • D
      xfs: remove debug assert for per-ag reference counting · bd32d25a
      Dave Chinner 提交于
      When we start taking references per cached buffer to the the perag
      it is cached on, it will blow the current debug maximum reference
      count assert out of the water. The assert has never caught a bug,
      and we have tracing to track changes if there ever is a problem,
      so just remove it.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      bd32d25a
  12. 27 7月, 2010 2 次提交
  13. 24 6月, 2010 1 次提交
  14. 29 5月, 2010 3 次提交
  15. 19 5月, 2010 1 次提交
  16. 06 3月, 2010 1 次提交
    • D
      xfs: Increase the default size of the reserved blocks pool · 8babd8a2
      Dave Chinner 提交于
      The current default size of the reserved blocks pool is easy to deplete
      with certain workloads, in particular workloads that do lots of concurrent
      delayed allocation extent conversions.  If enough transactions are running
      in parallel and the entire pool is consumed then subsequent calls to
      xfs_trans_reserve() will fail with ENOSPC.  Also add a rate limited
      warning so we know if this starts happening again.
      
      This is an updated version of an old patch from Lachlan McIlroy.
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      8babd8a2
  17. 02 3月, 2010 1 次提交
  18. 06 2月, 2010 1 次提交
    • D
      xfs: Use delayed write for inodes rather than async V2 · c854363e
      Dave Chinner 提交于
      We currently do background inode flush asynchronously, resulting in
      inodes being written in whatever order the background writeback
      issues them. Not only that, there are also blocking and non-blocking
      asynchronous inode flushes, depending on where the flush comes from.
      
      This patch completely removes asynchronous inode writeback. It
      removes all the strange writeback modes and replaces them with
      either a synchronous flush or a non-blocking delayed write flush.
      That is, inode flushes will only issue IO directly if they are
      synchronous, and background flushing may do nothing if the operation
      would block (e.g. on a pinned inode or buffer lock).
      
      Delayed write flushes will now result in the inode buffer sitting in
      the delwri queue of the buffer cache to be flushed by either an AIL
      push or by the xfsbufd timing out the buffer. This will allow
      accumulation of dirty inode buffers in memory and allow optimisation
      of inode cluster writeback at the xfsbufd level where we have much
      greater queue depths than the block layer elevators. We will also
      get adjacent inode cluster buffer IO merging for free when a later
      patch in the series allows sorting of the delayed write buffers
      before dispatch.
      
      This effectively means that any inode that is written back by
      background writeback will be seen as flush locked during AIL
      pushing, and will result in the buffers being pushed from there.
      This writeback path is currently non-optimal, but the next patch
      in the series will fix that problem.
      
      A side effect of this delayed write mechanism is that background
      inode reclaim will no longer directly flush inodes, nor can it wait
      on the flush lock. The result is that inode reclaim must leave the
      inode in the reclaimable state until it is clean. Hence attempts to
      reclaim a dirty inode in the background will simply skip the inode
      until it is clean and this allows other mechanisms (i.e. xfsbufd) to
      do more optimal writeback of the dirty buffers. As a result, the
      inode reclaim code has been rewritten so that it no longer relies on
      the ambiguous return values of xfs_iflush() to determine whether it
      is safe to reclaim an inode.
      
      Portions of this patch are derived from patches by Christoph
      Hellwig.
      
      Version 2:
      - cleanup reclaim code as suggested by Christoph
      - log background reclaim inode flush errors
      - just pass sync flags to xfs_iflush
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      c854363e
  19. 09 2月, 2010 1 次提交
  20. 22 1月, 2010 2 次提交
  21. 20 1月, 2010 1 次提交
  22. 16 1月, 2010 2 次提交