1. 19 10月, 2010 25 次提交
    • C
      xfs: do not use xfs_mod_incore_sb for per-cpu counters · 96540c78
      Christoph Hellwig 提交于
      Export xfs_icsb_modify_counters and always use it for modifying
      the per-cpu counters.  Remove support for per-cpu counters from
      xfs_mod_incore_sb to simplify it.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      96540c78
    • C
      xfs: remove XFS_MOUNT_NO_PERCPU_SB · 61ba35de
      Christoph Hellwig 提交于
      Fail the mount if we can't allocate memory for the per-CPU counters.
      This is consistent with how we handle everything else in the mount
      path and makes the superblock counter modification a lot simpler.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      61ba35de
    • D
      xfs: pack xfs_buf structure more tightly · 50f59e8e
      Dave Chinner 提交于
      pahole reports the struct xfs_buf has quite a few holes in it, so
      packing the structure better will reduce the size of it by 16 bytes.
      Also, move all the fields used in cache lookups into the first
      cacheline.
      
      Before on x86_64:
      
              /* size: 320, cachelines: 5 */
      	/* sum members: 298, holes: 6, sum holes: 22 */
      
      After on x86_64:
      
              /* size: 304, cachelines: 5 */
      	/* padding: 6 */
      	/* last cacheline: 48 bytes */
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      50f59e8e
    • D
      xfs: convert buffer cache hash to rbtree · 74f75a0c
      Dave Chinner 提交于
      The buffer cache hash is showing typical hash scalability problems.
      In large scale testing the number of cached items growing far larger
      than the hash can efficiently handle. Hence we need to move to a
      self-scaling cache indexing mechanism.
      
      I have selected rbtrees for indexing becuse they can have O(log n)
      search scalability, and insert and remove cost is not excessive,
      even on large trees. Hence we should be able to cache large numbers
      of buffers without incurring the excessive cache miss search
      penalties that the hash is imposing on us.
      
      To ensure we still have parallel access to the cache, we need
      multiple trees. Rather than hashing the buffers by disk address to
      select a tree, it seems more sensible to separate trees by typical
      access patterns. Most operations use buffers from within a single AG
      at a time, so rather than searching lots of different lists,
      separate the buffer indexes out into per-AG rbtrees. This means that
      searches during metadata operation have a much higher chance of
      hitting cache resident nodes, and that updates of the tree are less
      likely to disturb trees being accessed on other CPUs doing
      independent operations.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      74f75a0c
    • D
      xfs: serialise inode reclaim within an AG · 69b491c2
      Dave Chinner 提交于
      Memory reclaim via shrinkers has a terrible habit of having N+M
      concurrent shrinker executions (N = num CPUs, M = num kswapds) all
      trying to shrink the same cache. When the cache they are all working
      on is protected by a single spinlock, massive contention an
      slowdowns occur.
      
      Wrap the per-ag inode caches with a reclaim mutex to serialise
      reclaim access to the AG. This will block concurrent reclaim in each
      AG but still allow reclaim to scan multiple AGs concurrently. Allow
      shrinkers to move on to the next AG if it can't get the lock, and if
      we can't get any AG, then start blocking on locks.
      
      To prevent reclaimers from continually scanning the same inodes in
      each AG, add a cursor that tracks where the last reclaim got up to
      and start from that point on the next reclaim. This should avoid
      only ever scanning a small number of inodes at the satart of each AG
      and not making progress. If we have a non-shrinker based reclaim
      pass, ignore the cursor and reset it to zero once we are done.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      69b491c2
    • D
      xfs: batch inode reclaim lookup · e3a20c0b
      Dave Chinner 提交于
      Batch and optimise the per-ag inode lookup for reclaim to minimise
      scanning overhead. This involves gang lookups on the radix trees to
      get multiple inodes during each tree walk, and tighter validation of
      what inodes can be reclaimed without blocking befor we take any
      locks.
      
      This is based on ideas suggested in a proof-of-concept patch
      posted by Nick Piggin.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      e3a20c0b
    • D
      xfs: implement batched inode lookups for AG walking · 78ae5256
      Dave Chinner 提交于
      With the reclaim code separated from the generic walking code, it is
      simple to implement batched lookups for the generic walk code.
      Separate out the inode validation from the execute operations and
      modify the tree lookups to get a batch of inodes at a time.
      
      Reclaim operations will be optimised separately.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      78ae5256
    • D
      xfs: split out inode walk inode grabbing · e13de955
      Dave Chinner 提交于
      When doing read side inode cache walks, the code to validate and
      grab an inode is common to all callers. Split it out of the execute
      callbacks in preparation for batching lookups. Similarly, split out
      the inode reference dropping from the execute callbacks into the
      main lookup look to be symmetric with the grab.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      e13de955
    • D
      xfs: split inode AG walking into separate code for reclaim · 65d0f205
      Dave Chinner 提交于
      The reclaim walk requires different locking and has a slightly
      different walk algorithm, so separate it out so that it can be
      optimised separately.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      65d0f205
    • D
      xfs: remove buftarg hash for external devices · 69d6cc76
      Dave Chinner 提交于
      For RT and external log devices, we never use hashed buffers on them
      now.  Remove the buftarg hash tables that are set up for them.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      69d6cc76
    • D
      xfs: use unhashed buffers for size checks · 1922c949
      Dave Chinner 提交于
      When we are checking we can access the last block of each device, we
      do not need to use cached buffers as they will be tossed away
      immediately. Use uncached buffers for size checks so that all IO
      prior to full in-memory structure initialisation does not use the
      buffer cache.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      1922c949
    • D
      xfs: kill XBF_FS_MANAGED buffers · 26af6552
      Dave Chinner 提交于
      Filesystem level managed buffers are buffers that have their
      lifecycle controlled by the filesystem layer, not the buffer cache.
      We currently cache these buffers, which makes cleanup and cache
      walking somewhat troublesome. Convert the fs managed buffers to
      uncached buffers obtained by via xfs_buf_get_uncached(), and remove
      the XBF_FS_MANAGED special cases from the buffer cache.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      26af6552
    • D
      xfs: store xfs_mount in the buftarg instead of in the xfs_buf · ebad861b
      Dave Chinner 提交于
      Each buffer contains both a buftarg pointer and a mount pointer. If
      we add a mount pointer into the buftarg, we can avoid needing the
      b_mount field in every buffer and grab it from the buftarg when
      needed instead. This shrinks the xfs_buf by 8 bytes.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      ebad861b
    • D
      xfs: introduced uncached buffer read primitve · 5adc94c2
      Dave Chinner 提交于
      To avoid the need to use cached buffers for single-shot or buffers
      cached at the filesystem level, introduce a new buffer read
      primitive that bypasses the cache an reads directly from disk.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      5adc94c2
    • D
      xfs: rename xfs_buf_get_nodaddr to be more appropriate · 686865f7
      Dave Chinner 提交于
      xfs_buf_get_nodaddr() is really used to allocate a buffer that is
      uncached. While it is not directly assigned a disk address, the fact
      that they are not cached is a more important distinction. With the
      upcoming uncached buffer read primitive, we should be consistent
      with this disctinction.
      
      While there, make page allocation in xfs_buf_get_nodaddr() safe
      against memory reclaim re-entrancy into the filesystem by allowing
      a flags parameter to be passed.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      686865f7
    • D
      xfs: don't use vfs writeback for pure metadata modifications · dcd79a14
      Dave Chinner 提交于
      Under heavy multi-way parallel create workloads, the VFS struggles
      to write back all the inodes that have been changed in age order.
      The bdi flusher thread becomes CPU bound, spending 85% of it's time
      in the VFS code, mostly traversing the superblock dirty inode list
      to separate dirty inodes old enough to flush.
      
      We already keep an index of all metadata changes in age order - in
      the AIL - and continued log pressure will do age ordered writeback
      without any extra overhead at all. If there is no pressure on the
      log, the xfssyncd will periodically write back metadata in ascending
      disk address offset order so will be very efficient.
      
      Hence we can stop marking VFS inodes dirty during transaction commit
      or when changing timestamps during transactions. This will keep the
      inodes in the superblock dirty list to those containing data or
      unlogged metadata changes.
      
      However, the timstamp changes are slightly more complex than this -
      there are a couple of places that do unlogged updates of the
      timestamps, and the VFS need to be informed of these. Hence add a
      new function xfs_trans_ichgtime() for transactional changes,
      and leave xfs_ichgtime() for the non-transactional changes.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      dcd79a14
    • D
      xfs: lockless per-ag lookups · e176579e
      Dave Chinner 提交于
      When we start taking a reference to the per-ag for every cached
      buffer in the system, kernel lockstat profiling on an 8-way create
      workload shows the mp->m_perag_lock has higher acquisition rates
      than the inode lock and has significantly more contention. That is,
      it becomes the highest contended lock in the system.
      
      The perag lookup is trivial to convert to lock-less RCU lookups
      because perag structures never go away. Hence the only thing we need
      to protect against is tree structure changes during a grow. This can
      be done simply by replacing the locking in xfs_perag_get() with RCU
      read locking. This removes the mp->m_perag_lock completely from this
      path.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      e176579e
    • D
      xfs: remove debug assert for per-ag reference counting · bd32d25a
      Dave Chinner 提交于
      When we start taking references per cached buffer to the the perag
      it is cached on, it will blow the current debug maximum reference
      count assert out of the water. The assert has never caught a bug,
      and we have tracing to track changes if there ever is a problem,
      so just remove it.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      bd32d25a
    • D
      xfs: reduce the number of CIL lock round trips during commit · d1583a38
      Dave Chinner 提交于
      When commiting a transaction, we do a lock CIL state lock round trip
      on every single log vector we insert into the CIL. This is resulting
      in the lock being as hot as the inode and dcache locks on 8-way
      create workloads. Rework the insertion loops to bring the number
      of lock round trips to one per transaction for log vectors, and one
      more do the busy extents.
      
      Also change the allocation of the log vector buffer not to zero it
      as we copy over the entire allocated buffer anyway.
      
      This patch also includes a structural cleanup to the CIL item
      insertion provided by Christoph Hellwig.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NAlex Elder <aelder@sgi.com>
      d1583a38
    • P
      xfs: eliminate some newly-reported gcc warnings · 9c169915
      Poyo VL 提交于
      Ionut Gabriel Popescu <poyo_vl@yahoo.com> submitted a simple change
      to eliminate some "may be used uninitialized" warnings when building
      XFS.  The reported condition seems to be something that GCC did not
      used to recognize or report.  The warnings were produced by:
      
          gcc version 4.5.0 20100604
          [gcc-4_5-branch revision 160292] (SUSE Linux)
      Signed-off-by: NIonut Gabriel Popescu <poyo_vl@yahoo.com>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      9c169915
    • C
      xfs: remove the ->kill_root btree operation · c0e59e1a
      Christoph Hellwig 提交于
      The implementation os ->kill_root only differ by either simply
      zeroing out the now unused buffer in the btree cursor in the inode
      allocation btree or using xfs_btree_setbuf in the allocation btree.
      
      Initially both of them used xfs_btree_setbuf, but the use in the
      ialloc btree was removed early on because it interacted badly with
      xfs_trans_binval.
      
      In addition to zeroing out the buffer in the cursor xfs_btree_setbuf
      updates the bc_ra array in the btree cursor, and calls
      xfs_trans_brelse on the buffer previous occupying the slot.
      
      The bc_ra update should be done for the alloc btree updated too,
      although the lack of it does not cause serious problems.  The
      xfs_trans_brelse call on the other hand is effectively a no-op in
      the end - it keeps decrementing the bli_recur refcount until it hits
      zero, and then just skips out because the buffer will always be
      dirty at this point.  So removing it for the allocation btree is
      just fine.
      
      So unify the code and move it to xfs_btree.c.  While we're at it
      also replace the call to xfs_btree_setbuf with a NULL bp argument in
      xfs_btree_del_cursor with a direct call to xfs_trans_brelse given
      that the cursor is beeing freed just after this and the state
      updates are superflous.  After this xfs_btree_setbuf is only used
      with a non-NULL bp argument and can thus be simplified.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      c0e59e1a
    • C
      xfs: stop using xfs_qm_dqtobp in xfs_qm_dqflush · acecf1b5
      Christoph Hellwig 提交于
      In xfs_qm_dqflush we know that q_blkno must be initialized already from a
      previous xfs_qm_dqread.  So instead of calling xfs_qm_dqtobp we can
      simply read the quota buffer directly.  This also saves us from a duplicate
      xfs_qm_dqcheck call check and allows xfs_qm_dqtobp to be simplified now
      that it is always called for a newly initialized inode.  In addition to
      that properly unwind all locks in xfs_qm_dqflush when xfs_qm_dqcheck
      fails.
      
      This mirrors a similar cleanup in the inode lookup done earlier.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      acecf1b5
    • C
      xfs: simplify xfs_qm_dqusage_adjust · 52fda114
      Christoph Hellwig 提交于
      There is no need to have the users and group/project quota locked at the
      same time.  Get rid of xfs_qm_dqget_noattach and just do a xfs_qm_dqget
      inside xfs_qm_quotacheck_dqadjust for the quota we are operating on
      right now.  The new version of xfs_qm_quotacheck_dqadjust holds the
      inode lock over it's operations, which is not a problem as it simply
      increments counters and there is no concern about log contention
      during mount time.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAlex Elder <aelder@sgi.com>
      52fda114
    • D
      xfs: Introduce XFS_IOC_ZERO_RANGE · 44722352
      Dave Chinner 提交于
      XFS_IOC_ZERO_RANGE is the equivalent of an atomic XFS_IOC_UNRESVSP/
      XFS_IOC_RESVSP call pair. It enabled ranges of written data to be
      turned into zeroes without requiring IO or having to free and
      reallocate the extents in the range given as would occur if we had
      to punch and then preallocate them separately.  This enables
      applications to zero parts of files very quickly without changing
      the layout of the files in any way.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      44722352
    • D
      xfs: use range primitives for xfs page cache operations · 3ae4c9de
      Dave Chinner 提交于
      While XFS passes ranges to operate on from the core code, the
      functions being called ignore the either the entire range or the end
      of the range. This is historical because when the function were
      written linux didn't have the necessary range operations. Update the
      functions to use the correct operations.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      3ae4c9de
  2. 15 10月, 2010 4 次提交
    • L
      Linux 2.6.36-rc8 · cd07202c
      Linus Torvalds 提交于
      cd07202c
    • L
      Un-inline the core-dump helper functions · 3aa0ce82
      Linus Torvalds 提交于
      Tony Luck reports that the addition of the access_ok() check in commit
      0eead9ab ("Don't dump task struct in a.out core-dumps") broke the
      ia64 compile due to missing the necessary header file includes.
      
      Rather than add yet another include (<asm/unistd.h>) to make everything
      happy, just uninline the silly core dump helper functions and move the
      bodies to fs/exec.c where they make a lot more sense.
      
      dump_seek() in particular was too big to be an inline function anyway,
      and none of them are in any way performance-critical.  And we really
      don't need to mess up our include file headers more than they already
      are.
      Reported-and-tested-by: NTony Luck <tony.luck@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3aa0ce82
    • L
      Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6 · ae42d8d4
      Linus Torvalds 提交于
      * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6:
        ehea: Fix a checksum issue on the receive path
        net: allow FEC driver to use fixed PHY support
        tg3: restore rx_dropped accounting
        b44: fix carrier detection on bind
        net: clear heap allocations for privileged ethtool actions
        NET: wimax, fix use after free
        ATM: iphase, remove sleep-inside-atomic
        ATM: mpc, fix use after free
        ATM: solos-pci, remove use after free
        net/fec: carrier off initially to avoid root mount failure
        r8169: use device model DMA API
        r8169: allocate with GFP_KERNEL flag when able to sleep
      ae42d8d4
    • L
      Don't dump task struct in a.out core-dumps · 0eead9ab
      Linus Torvalds 提交于
      akiphie points out that a.out core-dumps have that odd task struct
      dumping that was never used and was never really a good idea (it goes
      back into the mists of history, probably the original core-dumping
      code).  Just remove it.
      
      Also do the access_ok() check on dump_write().  It probably doesn't
      matter (since normal filesystems all seem to do it anyway), but he
      points out that it's normally done by the VFS layer, so ...
      
      [ I suspect that we should possibly do "vfs_write()" instead of
        calling ->write directly.  That also does the whole fsnotify and write
        statistics thing, which may or may not be a good idea. ]
      
      And just to be anal, do this all for the x86-64 32-bit a.out emulation
      code too, even though it's not enabled (and won't currently even
      compile)
      Reported-by: Nakiphie <akiphie@lavabit.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0eead9ab
  3. 14 10月, 2010 11 次提交