1. 15 11月, 2012 1 次提交
  2. 14 11月, 2012 1 次提交
    • D
      GFS2: skip dlm_unlock calls in unmount · fb6791d1
      David Teigland 提交于
      When unmounting, gfs2 does a full dlm_unlock operation on every
      cached lock.  This can create a very large amount of work and can
      take a long time to complete.  However, the vast majority of these
      dlm unlock operations are unnecessary because after all the unlocks
      are done, gfs2 leaves the dlm lockspace, which automatically clears
      the locks of the leaving node, without unlocking each one individually.
      So, gfs2 can skip explicit dlm unlocks, and use dlm_release_lockspace to
      remove the locks implicitly.  The one exception is when the lock's lvb is
      being used.  In this case, dlm_unlock is called because it may update the
      lvb of the resource.
      Signed-off-by: NDavid Teigland <teigland@redhat.com>
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      fb6791d1
  3. 07 11月, 2012 2 次提交
  4. 24 9月, 2012 4 次提交
  5. 11 6月, 2012 2 次提交
  6. 08 6月, 2012 2 次提交
    • B
      GFS2: Use lvbs for storing rgrp information with mount option · 90306c41
      Benjamin Marzinski 提交于
      Instead of reading in the resource groups when gfs2 is checking
      for free space to allocate from, gfs2 can store the necessary infromation
      in the resource group's lvb.  Also, instead of searching for unlinked
      inodes in every resource group that's checked for free space, gfs2 can
      store the number of unlinked but inodes in the lvb, and only check for
      unlinked inodes if it will find some.
      
      The first time a resource group is locked, the lvb must initialized.
      Since this involves counting the unlinked inodes in the resource group,
      this takes a little extra time.  But after that, if the resource group
      is locked with GL_SKIP, the buffer head won't be read in unless it's
      actually needed.
      
      Enabling the resource groups lvbs is done via the rgrplvb mount option.  If
      this option isn't set, the lvbs will still be set and updated, but they won't
      be verfied or used by the filesystem.  To safely turn on this option, all of
      the nodes mounting the filesystem must be running code with this patch, and
      the filesystem must have been completely unmounted since they were updated.
      Signed-off-by: NBenjamin Marzinski <bmarzins@redhat.com>
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      90306c41
    • S
      GFS2: Cache last hash bucket for glock seq_files · ba1ddcb6
      Steven Whitehouse 提交于
      For the glocks and glstats seq_files, which are exposed via debugfs
      we should cache the most recent hash bucket, along with the offset
      into that bucket. This allows us to restart from that point, rather
      than having to begin at the beginning each time.
      
      This is an idea from Eric Dumazet, however I've slightly extended it
      so that if the position from which we are due to start is at any
      point beyond the last cached point, we start from the last cached
      point, plus whatever is the appropriate offset. I don't really expect
      people to be lseeking around these files, but if they did so with only
      positive offsets, then we'd still get some of the benefit of using a
      cached offset.
      
      With my simple test of around 200k entries in the file, I'm seeing
      an approx 10x speed up.
      
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      ba1ddcb6
  7. 07 6月, 2012 1 次提交
  8. 29 2月, 2012 1 次提交
    • S
      GFS2: glock statistics gathering · a245769f
      Steven Whitehouse 提交于
      The stats are divided into two sets: those relating to the
      super block and those relating to an individual glock. The
      super block stats are done on a per cpu basis in order to
      try and reduce the overhead of gathering them. They are also
      further divided by glock type.
      
      In the case of both the super block and glock statistics,
      the same information is gathered in each case. The super
      block statistics are used to provide default values for
      most of the glock statistics, so that newly created glocks
      should have, as far as possible, a sensible starting point.
      
      The statistics are divided into three pairs of mean and
      variance, plus two counters. The mean/variance pairs are
      smoothed exponential estimates and the algorithm used is
      one which will be very familiar to those used to calculation
      of round trip times in network code.
      
      The three pairs of mean/variance measure the following
      things:
      
       1. DLM lock time (non-blocking requests)
       2. DLM lock time (blocking requests)
       3. Inter-request time (again to the DLM)
      
      A non-blocking request is one which will complete right
      away, whatever the state of the DLM lock in question. That
      currently means any requests when (a) the current state of
      the lock is exclusive (b) the requested state is either null
      or unlocked or (c) the "try lock" flag is set. A blocking
      request covers all the other lock requests.
      
      There are two counters. The first is there primarily to show
      how many lock requests have been made, and thus how much data
      has gone into the mean/variance calculations. The other counter
      is counting queueing of holders at the top layer of the glock
      code. Hopefully that number will be a lot larger than the number
      of dlm lock requests issued.
      
      So why gather these statistics? There are several reasons
      we'd like to get a better idea of these timings:
      
      1. To be able to better set the glock "min hold time"
      2. To spot performance issues more easily
      3. To improve the algorithm for selecting resource groups for
      allocation (to base it on lock wait time, rather than blindly
      using a "try lock")
      Due to the smoothing action of the updates, a step change in
      some input quantity being sampled will only fully be taken
      into account after 8 samples (or 4 for the variance) and this
      needs to be carefully considered when interpreting the
      results.
      
      Knowing both the time it takes a lock request to complete and
      the average time between lock requests for a glock means we
      can compute the total percentage of the time for which the
      node is able to use a glock vs. time that the rest of the
      cluster has its share. That will be very useful when setting
      the lock min hold time.
      
      The other point to remember is that all times are in
      nanoseconds. Great care has been taken to ensure that we
      measure exactly the quantities that we want, as accurately
      as possible. There are always inaccuracies in any
      measuring system, but I hope this is as accurate as we
      can reasonably make it.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      a245769f
  9. 28 2月, 2012 1 次提交
  10. 11 1月, 2012 1 次提交
  11. 15 7月, 2011 1 次提交
  12. 25 5月, 2011 2 次提交
    • Y
      vmscan: change shrinker API by passing shrink_control struct · 1495f230
      Ying Han 提交于
      Change each shrinker's API by consolidating the existing parameters into
      shrink_control struct.  This will simplify any further features added w/o
      touching each file of shrinker.
      
      [akpm@linux-foundation.org: fix build]
      [akpm@linux-foundation.org: fix warning]
      [kosaki.motohiro@jp.fujitsu.com: fix up new shrinker API]
      [akpm@linux-foundation.org: fix xfs warning]
      [akpm@linux-foundation.org: update gfs2]
      Signed-off-by: NYing Han <yinghan@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Acked-by: NPavel Emelyanov <xemul@openvz.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1495f230
    • B
      GFS2: Processes waiting on inode glock that no processes are holding · f90e5b5b
      Bob Peterson 提交于
      This patch fixes a race in the GFS2 glock state machine that may
      result in lockups.  The symptom is that all nodes but one will
      hang, waiting for a particular glock.  All the holder records
      will have the "W" (Waiting) bit set.  The other node will
      typically have the glock stuck in Exclusive mode (EX) with no
      holder records, but the dinode will be cached.  In other words,
      an entry with "I:" will appear in the glock dump for that glock,
      but nothing else.
      
      The race has to do with the glock "Pending Demote" bit, which
      can be set, then immediately reset, thus losing the fact that
      another node needs the glock.  The sequence of events is:
      
      1. Something schedules the glock workqueue (e.g. glock request from fs)
      2. The glock workqueue gets to the point between the test of the reply pending
      bit and the spin lock:
      
              if (test_and_clear_bit(GLF_REPLY_PENDING, &gl->gl_flags)) {
                      finish_xmote(gl, gl->gl_reply);
                      drop_ref = 1;
              }
              down_read(&gfs2_umount_flush_sem);         <---- i.e. here
              spin_lock(&gl->gl_spin);
      
      3. In comes (a) the reply to our EX lock request setting GLF_REPLY_PENDING and
                  (b) the demote request which sets GLF_PENDING_DEMOTE
      
      4. The following test is executed:
      
              if (test_and_clear_bit(GLF_PENDING_DEMOTE, &gl->gl_flags) &&
                  gl->gl_state != LM_ST_UNLOCKED &&
                  gl->gl_demote_state != LM_ST_EXCLUSIVE) {
      
      This resets the pending demote flag, and gl->gl_demote_state is not equal to
      exclusive, however because the reply from the dlm arrived after we checked for
      the GLF_REPLY_PENDING flag, gl->gl_state is still equal to unlocked, so
      although we reset the GLF_PENDING_DEMOTE flag, we didn't then set the
      GLF_DEMOTE flag or reinstate the GLF_PENDING_DEMOTE_FLAG.
      
      The patch closes the timing window by only transitioning the
      "Pending demote" bit to the "demote" flag once we know the
      other conditions (not unlocked and not exclusive) are met.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      f90e5b5b
  13. 05 5月, 2011 1 次提交
  14. 26 4月, 2011 1 次提交
  15. 20 4月, 2011 4 次提交
    • S
      GFS2: Make writeback more responsive to system conditions · 4667a0ec
      Steven Whitehouse 提交于
      This patch adds writeback_control to writing back the AIL
      list. This means that we can then take advantage of the
      information we get in ->write_inode() in order to set off
      some pre-emptive writeback.
      
      In addition, the AIL code is cleaned up a bit to make it
      a bit simpler to understand.
      
      There is still more which can usefully be done in this area,
      but this is a good start at least.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      4667a0ec
    • S
      GFS2: Optimise glock lru and end of life inodes · f42ab085
      Steven Whitehouse 提交于
      The GLF_LRU flag introduced in the previous patch can be
      used to check if a glock is on the lru list when a new
      holder is queued and if so remove it, without having first
      to get the lru_lock.
      
      The main purpose of this patch however is to optimise the
      glocks left over when an inode at end of life is being
      evicted. Previously such glocks were left with the GLF_LFLUSH
      flag set, so that when reclaimed, each one required a log flush.
      This patch resets the GLF_LFLUSH flag when there is nothing
      left to flush thus preventing later log flushes as glocks are
      reused or demoted.
      
      In order to do this, we need to keep track of the number of
      revokes which are outstanding, and also to clear the GLF_LFLUSH
      bit after a log commit when only revokes have been processed.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      f42ab085
    • S
      GFS2: Improve tracing support (adds two flags) · 627c10b7
      Steven Whitehouse 提交于
      This adds support for two new flags. One keeps track of whether
      the glock is on the LRU list or not. The other isn't really a
      flag as such, but an indication of whether the glock has an
      attached object or not. This indication is reported without
      any locking, which is ok since we do not dereference the object
      pointer but merely report whether it is NULL or not.
      
      Also, this fixes one place where a tracepoint was missing, which
      was at the point we remove deallocated blocks from the journal.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      627c10b7
    • S
      GFS2: Alter point of entry to glock lru list for glocks with an address_space · 29687a2a
      Steven Whitehouse 提交于
      Rather than allowing the glocks to be scheduled for possible
      reclaim as soon as they have exited the journal, this patch
      delays their entry to the list until the glocks in question
      are no longer in use.
      
      This means that we will rely on the vm for writeback of all
      dirty data and metadata from now on. When glocks are added
      to the lru list they should be freeable much faster since all
      the I/O required to free them should have already been completed.
      
      This should lead to much better I/O patterns under low memory
      conditions.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      29687a2a
  16. 31 3月, 2011 1 次提交
  17. 15 3月, 2011 1 次提交
  18. 11 3月, 2011 1 次提交
  19. 09 3月, 2011 1 次提交
    • S
      GFS2: Fix glock deallocation race · fc0e38da
      Steven Whitehouse 提交于
      This patch fixes a race in deallocating glocks which was introduced
      in the RCU glock patch. We need to ensure that the glock count is
      kept correct even in the case that there is a race to add a new
      glock into the hash table. Also, to avoid having to wait for an
      RCU grace period, the glock counter can be decremented before
      call_rcu() is called.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      fc0e38da
  20. 17 2月, 2011 1 次提交
  21. 31 1月, 2011 1 次提交
  22. 21 1月, 2011 1 次提交
    • S
      GFS2: Use RCU for glock hash table · bc015cb8
      Steven Whitehouse 提交于
      This has a number of advantages:
      
       - Reduces contention on the hash table lock
       - Makes the code smaller and simpler
       - Should speed up glock dumps when under load
       - Removes ref count changing in examine_bucket
       - No longer need hash chain lock in glock_put() in common case
      
      There are some further changes which this enables and which
      we may do in the future. One is to look at using SLAB_RCU,
      and another is to look at using a per-cpu counter for the
      per-sb glock counter, since that is touched twice in the
      lifetime of each glock (but only used at umount time).
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      bc015cb8
  23. 30 11月, 2010 5 次提交
  24. 15 11月, 2010 1 次提交
    • S
      GFS2: Fix inode deallocation race · 044b9414
      Steven Whitehouse 提交于
      This area of the code has always been a bit delicate due to the
      subtleties of lock ordering. The problem is that for "normal"
      alloc/dealloc, we always grab the inode locks first and the rgrp lock
      later.
      
      In order to ensure no races in looking up the unlinked, but still
      allocated inodes, we need to hold the rgrp lock when we do the lookup,
      which means that we can't take the inode glock.
      
      The solution is to borrow the technique already used by NFS to solve
      what is essentially the same problem (given an inode number, look up
      the inode carefully, checking that it really is in the expected
      state).
      
      We cannot do that directly from the allocation code (lock ordering
      again) so we give the job to the pre-existing delete workqueue and
      carry on with the allocation as normal.
      
      If we find there is no space, we do a journal flush (required anyway
      if space from a deallocation is to be released) which should block
      against the pending deallocations, so we should always get the space
      back.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      044b9414
  25. 29 9月, 2010 1 次提交
  26. 20 9月, 2010 1 次提交
    • S
      GFS2: Use new workqueue scheme · 9fa0ea9f
      Steven Whitehouse 提交于
      The recovery workqueue can be freezable since
      we want it to finish what it is doing if the system is to
      be frozen (although why you'd want to freeze a cluster node
      is beyond me since it will result in it being ejected from
      the cluster). It does still make sense for single node
      GFS2 filesystems though.
      
      The glock workqueue will benefit from being able to run more
      work items concurrently. A test running postmark shows
      improved performance and multi-threaded workloads are likely
      to benefit even more. It needs to be high priority because
      the latency directly affects the latency of filesystem glock
      operations.
      
      The delete workqueue is similar to the recovery workqueue in
      that it must not get blocked by memory allocations, and may
      run for a long time.
      
      Potentially other GFS2 threads might also be converted to
      workqueues, but I'll leave that for a later patch.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      9fa0ea9f