1. 18 7月, 2014 4 次提交
    • B
      GFS2: Allow flocks to use normal glock dq rather than dq_wait · 5bef3e7c
      Bob Peterson 提交于
      This patch allows flock glocks to use a non-blocking dequeue rather
      than dq_wait. It also reverts the previous patch I had posted regarding
      dq_wait. The reverted patch isn't necessarily a bad idea, but I decided
      this might avoid unforeseen side effects, and was therefore safer.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      5bef3e7c
    • S
      GFS2: Use GFP_NOFS when allocating glocks · fe0bbd29
      Steven Whitehouse 提交于
      Normally GFP_KERNEL is ok here, but there is now a rarely used code path
      relating to deallocation of unlinked inodes (in certain corner cases)
      which if hit at times of memory shortage can cause recursion while
      trying to free memory.
      
      One solution would be to try and move the gfs2_glock_get() call so
      that it is no longer called while another glock is held, but that
      doesn't look at all easy, so GFP_NOFS is the best solution for the
      time being.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      fe0bbd29
    • S
      GFS2: Fix race in glock lru glock disposal · 94a09a39
      Steven Whitehouse 提交于
      We must not leave items on the LRU list with GLF_LOCK set, since
      they can be removed if the glock is brought back into use, which
      may then potentially result in a hang, waiting for GLF_LOCK to
      clear.
      
      It doesn't happen very often, since it requires a glock that has
      not been used for a long time to be brought back into use at the
      same moment that the shrinker is part way through disposing of
      glocks.
      
      The fix is to set GLF_LOCK at a later time, when we already know
      that the other locks can be obtained. Also, we now only release
      the lru_lock in case a resched is needed, rather than on every
      iteration.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      94a09a39
    • B
      GFS2: Only wait for demote when last holder is dequeued · 79272b35
      Bob Peterson 提交于
      Function gfs2_glock_dq_wait is supposed to dequeue a glock and then
      wait for the lock to be demoted. The problem is, if this is a shared
      lock, its demote will depend on the other holders, which means you
      might end up waiting forever because the other process is blocked.
      This problem is especially apparent when dealing with nested flocks.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      79272b35
  2. 18 4月, 2014 1 次提交
  3. 12 3月, 2014 1 次提交
  4. 07 3月, 2014 2 次提交
  5. 16 1月, 2014 1 次提交
    • S
      GFS2: Don't use ENOBUFS when ENOMEM is the correct error code · ac3beb6a
      Steven Whitehouse 提交于
      Al Viro has tactfully pointed out that we are using the incorrect
      error code in some cases. This patch fixes that, and also removes
      the (unused) return value for glock dumping.
      
      >        * gfs2_iget() - ENOBUFS instead of ENOMEM.  ENOBUFS is
      > "No buffer space available (POSIX.1 (XSI STREAMS option))" and since
      > we don't support STREAMS it's probably fair game, but... what the hell?
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      ac3beb6a
  6. 02 1月, 2014 1 次提交
  7. 21 11月, 2013 1 次提交
  8. 15 10月, 2013 1 次提交
    • S
      GFS2: Use lockref for glocks · e66cf161
      Steven Whitehouse 提交于
      Currently glocks have an atomic reference count and also a spinlock
      which covers various internal fields, such as the state. This intent of
      this patch is to replace the spinlock and the atomic reference count
      with a lockref structure. This contains a spinlock which we can continue
      to use as before, and a reference counter which is used in conjuction
      with the spinlock to replace the previous atomic counter.
      
      As a result of this there are some new rules for reference counting on
      glocks. We need to distinguish between reference count changes under
      gl_spin (which are now just increment or decrement of the new counter,
      provided the count cannot hit zero) and those which are outside of
      gl_spin, but which now take gl_spin internally.
      
      The conversion is relatively straight forward. There is probably some
      further clean up which can be done, but the priority at this stage is to
      make the change in as simple a manner as possible.
      
      A consequence of this change is that the reference count is being
      decoupled from the lru list processing. This should allow future
      adoption of the lru_list code with glocks in due course.
      
      The reason for using the "dead" state and not just relying on 0 being
      the "invalid state" is so that in due course 0 ref counts can be
      allowable. The intent is to eventually be able to remove the ref count
      changes which are currently hidden away in state_change().
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      e66cf161
  9. 11 9月, 2013 2 次提交
    • D
      fs: convert fs shrinkers to new scan/count API · 1ab6c499
      Dave Chinner 提交于
      Convert the filesystem shrinkers to use the new API, and standardise some
      of the behaviours of the shrinkers at the same time.  For example,
      nr_to_scan means the number of objects to scan, not the number of objects
      to free.
      
      I refactored the CIFS idmap shrinker a little - it really needs to be
      broken up into a shrinker per tree and keep an item count with the tree
      root so that we don't need to walk the tree every time the shrinker needs
      to count the number of objects in the tree (i.e.  all the time under
      memory pressure).
      
      [glommer@openvz.org: fixes for ext4, ubifs, nfs, cifs and glock. Fixes are needed mainly due to new code merged in the tree]
      [assorted fixes folded in]
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NGlauber Costa <glommer@openvz.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NArtem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Acked-by: NJan Kara <jack@suse.cz>
      Acked-by: NSteven Whitehouse <swhiteho@redhat.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1ab6c499
    • G
      super: fix calculation of shrinkable objects for small numbers · 55f841ce
      Glauber Costa 提交于
      The sysctl knob sysctl_vfs_cache_pressure is used to determine which
      percentage of the shrinkable objects in our cache we should actively try
      to shrink.
      
      It works great in situations in which we have many objects (at least more
      than 100), because the aproximation errors will be negligible.  But if
      this is not the case, specially when total_objects < 100, we may end up
      concluding that we have no objects at all (total / 100 = 0, if total <
      100).
      
      This is certainly not the biggest killer in the world, but may matter in
      very low kernel memory situations.
      Signed-off-by: NGlauber Costa <glommer@openvz.org>
      Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      55f841ce
  10. 04 9月, 2013 1 次提交
  11. 20 8月, 2013 1 次提交
  12. 19 8月, 2013 1 次提交
  13. 29 4月, 2013 1 次提交
  14. 26 4月, 2013 1 次提交
    • B
      GFS2: Flush work queue before clearing glock hash tables · 222cb538
      Bob Peterson 提交于
      There was a timing window when a GFS2 file system was unmounted
      that caused GFS2 to call BUG() and panic the kernel. The call
      to BUG() is meant to ensure that the glock reference count,
      gl_ref, never gets down to zero and bounce back up again. What was
      happening during umount is that function gfs2_put_super was dequeing
      its glocks for well-known files. In particular, we saw it on the
      journal glock, sd_jinode_gh. The dequeue caused delayed work to be
      queued for the glock state machine, to transition the lock to an
      "unlocked" state. While the work was still queued, gfs2_put_super
      called gfs2_gl_hash_clear to clear out the glock hash tables.
      If the timing was just so, the glock work function would drop the
      reference count at the time when it was being checked for zero,
      and that caused BUG() to be called. This patch calls
      flush_workqueue before clearing the glock hash tables, thereby
      ensuring that the delayed work is executed before the hash tables
      are cleared, and therefore the reference count never goes to zero
      until the glock is cleared.
      Signed-off-by: NBob Peterson <rpeterso@redhat.com>
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      222cb538
  15. 10 4月, 2013 2 次提交
    • S
      GFS2: Add origin indicator to glock demote tracing · 7bd8b2eb
      Steven Whitehouse 提交于
      This adds the origin indicator to the trace point for glock
      demotion, so that it is possible to see where demote requests
      have come from.
      
      Note that requests generated from the demote_rq sysfs interface
      will show as remote, since they are intended to replicate
      exactly the effect of a demote reuqest from a remote node. It
      is still possible to tell these apart by looking at the process
      which initiated the demote request.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      7bd8b2eb
    • S
      GFS2: Add origin indicator to glock callbacks · 81ffbf65
      Steven Whitehouse 提交于
      This patch adds a bool indicating whether the demote
      request was originated locally or remotely. This is then
      used by the iopen ->go_callback() to make 100% sure that
      it will only respond to remote callbacks.
      
      Since ->evict_inode() uses GL_NOCACHE when it attempts to
      get an exclusive lock on the iopen lock, this may result
      in extra scheduling of the workqueue in case that the
      exclusive promotion request failed. This patch prevents
      that from happening.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      81ffbf65
  16. 08 4月, 2013 1 次提交
    • S
      GFS2: Remove gfs2_refresh_inode from inode creation path · 28fb3027
      Steven Whitehouse 提交于
      The original method for creating inodes used in GFS2 was to fill
      out a buffer, with all the information, and then to read that
      buffer into the in-core inode, using gfs2_refresh_inode()
      
      The problem with this approach is that all the inode's fields
      need to be calculated ahead of time, and were stored in various
      variables making the code rather complicated.
      
      The new approach is simply to allocate the in-core inode earlier
      and fill in as many fields as possible ahead of time. These can
      then be used to initilise the on disk representation. The
      code has been working towards the point where it is possible
      to remove gfs2_refresh_inode() because all the fields are
      correctly initialised ahead of time. We've now reached that
      milestone, and have reversed the order of setting up the in
      core and on disk inodes.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      28fb3027
  17. 02 2月, 2013 1 次提交
    • S
      GFS2: Split glock lru processing into two parts · 4506a519
      Steven Whitehouse 提交于
      The intent here is to split the processing of the glock lru
      list into two parts, so that the selection of glocks and the
      disposal are separate functions. The plan is then, that further
      updates can then be made to these functions in the future
      to improve the selection of glocks and also the efficiency of
      glock disposal.
      
      The new feature which this patch brings is sorting the
      glocks to be disposed of into glock number (and thus also
      disk block number) order. Not all glocks will need i/o in
      order to dispose of them, but some will, and at least we'll
      generate mostly disk block order i/o now.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      4506a519
  18. 29 1月, 2013 1 次提交
  19. 12 12月, 2012 1 次提交
    • R
      mm: redefine address_space.assoc_mapping · 252aa6f5
      Rafael Aquini 提交于
      Overhaul struct address_space.assoc_mapping renaming it to
      address_space.private_data and its type is redefined to void*.  By this
      approach we consistently name the .private_* elements from struct
      address_space as well as allow extended usage for address_space
      association with other data structures through ->private_data.
      
      Also, all users of old ->assoc_mapping element are converted to reflect
      its new name and type change (->private_data).
      Signed-off-by: NRafael Aquini <aquini@redhat.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      252aa6f5
  20. 15 11月, 2012 2 次提交
  21. 14 11月, 2012 1 次提交
    • D
      GFS2: skip dlm_unlock calls in unmount · fb6791d1
      David Teigland 提交于
      When unmounting, gfs2 does a full dlm_unlock operation on every
      cached lock.  This can create a very large amount of work and can
      take a long time to complete.  However, the vast majority of these
      dlm unlock operations are unnecessary because after all the unlocks
      are done, gfs2 leaves the dlm lockspace, which automatically clears
      the locks of the leaving node, without unlocking each one individually.
      So, gfs2 can skip explicit dlm unlocks, and use dlm_release_lockspace to
      remove the locks implicitly.  The one exception is when the lock's lvb is
      being used.  In this case, dlm_unlock is called because it may update the
      lvb of the resource.
      Signed-off-by: NDavid Teigland <teigland@redhat.com>
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      fb6791d1
  22. 07 11月, 2012 2 次提交
  23. 24 9月, 2012 4 次提交
  24. 11 6月, 2012 2 次提交
  25. 08 6月, 2012 2 次提交
    • B
      GFS2: Use lvbs for storing rgrp information with mount option · 90306c41
      Benjamin Marzinski 提交于
      Instead of reading in the resource groups when gfs2 is checking
      for free space to allocate from, gfs2 can store the necessary infromation
      in the resource group's lvb.  Also, instead of searching for unlinked
      inodes in every resource group that's checked for free space, gfs2 can
      store the number of unlinked but inodes in the lvb, and only check for
      unlinked inodes if it will find some.
      
      The first time a resource group is locked, the lvb must initialized.
      Since this involves counting the unlinked inodes in the resource group,
      this takes a little extra time.  But after that, if the resource group
      is locked with GL_SKIP, the buffer head won't be read in unless it's
      actually needed.
      
      Enabling the resource groups lvbs is done via the rgrplvb mount option.  If
      this option isn't set, the lvbs will still be set and updated, but they won't
      be verfied or used by the filesystem.  To safely turn on this option, all of
      the nodes mounting the filesystem must be running code with this patch, and
      the filesystem must have been completely unmounted since they were updated.
      Signed-off-by: NBenjamin Marzinski <bmarzins@redhat.com>
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      90306c41
    • S
      GFS2: Cache last hash bucket for glock seq_files · ba1ddcb6
      Steven Whitehouse 提交于
      For the glocks and glstats seq_files, which are exposed via debugfs
      we should cache the most recent hash bucket, along with the offset
      into that bucket. This allows us to restart from that point, rather
      than having to begin at the beginning each time.
      
      This is an idea from Eric Dumazet, however I've slightly extended it
      so that if the position from which we are due to start is at any
      point beyond the last cached point, we start from the last cached
      point, plus whatever is the appropriate offset. I don't really expect
      people to be lseeking around these files, but if they did so with only
      positive offsets, then we'd still get some of the benefit of using a
      cached offset.
      
      With my simple test of around 200k entries in the file, I'm seeing
      an approx 10x speed up.
      
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      ba1ddcb6
  26. 07 6月, 2012 1 次提交
  27. 29 2月, 2012 1 次提交
    • S
      GFS2: glock statistics gathering · a245769f
      Steven Whitehouse 提交于
      The stats are divided into two sets: those relating to the
      super block and those relating to an individual glock. The
      super block stats are done on a per cpu basis in order to
      try and reduce the overhead of gathering them. They are also
      further divided by glock type.
      
      In the case of both the super block and glock statistics,
      the same information is gathered in each case. The super
      block statistics are used to provide default values for
      most of the glock statistics, so that newly created glocks
      should have, as far as possible, a sensible starting point.
      
      The statistics are divided into three pairs of mean and
      variance, plus two counters. The mean/variance pairs are
      smoothed exponential estimates and the algorithm used is
      one which will be very familiar to those used to calculation
      of round trip times in network code.
      
      The three pairs of mean/variance measure the following
      things:
      
       1. DLM lock time (non-blocking requests)
       2. DLM lock time (blocking requests)
       3. Inter-request time (again to the DLM)
      
      A non-blocking request is one which will complete right
      away, whatever the state of the DLM lock in question. That
      currently means any requests when (a) the current state of
      the lock is exclusive (b) the requested state is either null
      or unlocked or (c) the "try lock" flag is set. A blocking
      request covers all the other lock requests.
      
      There are two counters. The first is there primarily to show
      how many lock requests have been made, and thus how much data
      has gone into the mean/variance calculations. The other counter
      is counting queueing of holders at the top layer of the glock
      code. Hopefully that number will be a lot larger than the number
      of dlm lock requests issued.
      
      So why gather these statistics? There are several reasons
      we'd like to get a better idea of these timings:
      
      1. To be able to better set the glock "min hold time"
      2. To spot performance issues more easily
      3. To improve the algorithm for selecting resource groups for
      allocation (to base it on lock wait time, rather than blindly
      using a "try lock")
      Due to the smoothing action of the updates, a step change in
      some input quantity being sampled will only fully be taken
      into account after 8 samples (or 4 for the variance) and this
      needs to be carefully considered when interpreting the
      results.
      
      Knowing both the time it takes a lock request to complete and
      the average time between lock requests for a glock means we
      can compute the total percentage of the time for which the
      node is able to use a glock vs. time that the rest of the
      cluster has its share. That will be very useful when setting
      the lock min hold time.
      
      The other point to remember is that all times are in
      nanoseconds. Great care has been taken to ensure that we
      measure exactly the quantities that we want, as accurately
      as possible. There are always inaccuracies in any
      measuring system, but I hope this is as accurate as we
      can reasonably make it.
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      a245769f