1. 11 9月, 2013 4 次提交
    • D
      dcache: remove dentries from LRU before putting on dispose list · dd1f6b2e
      Dave Chinner 提交于
      One of the big problems with modifying the way the dcache shrinker and LRU
      implementation works is that the LRU is abused in several ways.  One of
      these is shrink_dentry_list().
      
      Basically, we can move a dentry off the LRU onto a different list without
      doing any accounting changes, and then use dentry_lru_prune() to remove it
      from what-ever list it is now on to do the LRU accounting at that point.
      
      This makes it -really hard- to change the LRU implementation.  The use of
      the per-sb LRU lock serialises movement of the dentries between the
      different lists and the removal of them, and this is the only reason that
      it works.  If we want to break up the dentry LRU lock and lists into, say,
      per-node lists, we remove the only serialisation that allows this lru
      list/dispose list abuse to work.
      
      To make this work effectively, the dispose list has to be isolated from
      the LRU list - dentries have to be removed from the LRU *before* being
      placed on the dispose list.  This means that the LRU accounting and
      isolation is completed before disposal is started, and that means we can
      change the LRU implementation freely in future.
      
      This means that dentries *must* be marked with DCACHE_SHRINK_LIST when
      they are placed on the dispose list so that we don't think that parent
      dentries found in try_prune_one_dentry() are on the LRU when the are
      actually on the dispose list.  This would result in accounting the dentry
      to the LRU a second time.  Hence dentry_lru_del() has to handle the
      DCACHE_SHRINK_LIST case
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NGlauber Costa <glommer@openvz.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      dd1f6b2e
    • D
      dentry: move to per-sb LRU locks · 19156840
      Dave Chinner 提交于
      With the dentry LRUs being per-sb structures, there is no real need for
      a global dentry_lru_lock. The locking can be made more fine-grained by
      moving to a per-sb LRU lock, isolating the LRU operations of different
      filesytsems completely from each other. The need for this is independent
      of any performance consideration that may arise: in the interest of
      abstracting the lru operations away, it is mandatory that each lru works
      around its own lock instead of a global lock for all of them.
      
      [glommer@openvz.org: updated changelog ]
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NGlauber Costa <glommer@openvz.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      19156840
    • D
      dcache: convert dentry_stat.nr_unused to per-cpu counters · 62d36c77
      Dave Chinner 提交于
      Before we split up the dcache_lru_lock, the unused dentry counter needs to
      be made independent of the global dcache_lru_lock.  Convert it to per-cpu
      counters to do this.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NGlauber Costa <glommer@openvz.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      62d36c77
    • G
      fs: bump inode and dentry counters to long · 3942c07c
      Glauber Costa 提交于
      This series reworks our current object cache shrinking infrastructure in
      two main ways:
      
       * Noticing that a lot of users copy and paste their own version of LRU
         lists for objects, we put some effort in providing a generic version.
         It is modeled after the filesystem users: dentries, inodes, and xfs
         (for various tasks), but we expect that other users could benefit in
         the near future with little or no modification.  Let us know if you
         have any issues.
      
       * The underlying list_lru being proposed automatically and
         transparently keeps the elements in per-node lists, and is able to
         manipulate the node lists individually.  Given this infrastructure, we
         are able to modify the up-to-now hammer called shrink_slab to proceed
         with node-reclaim instead of always searching memory from all over like
         it has been doing.
      
      Per-node lru lists are also expected to lead to less contention in the lru
      locks on multi-node scans, since we are now no longer fighting for a
      global lock.  The locks usually disappear from the profilers with this
      change.
      
      Although we have no official benchmarks for this version - be our guest to
      independently evaluate this - earlier versions of this series were
      performance tested (details at
      http://permalink.gmane.org/gmane.linux.kernel.mm/100537) yielding no
      visible performance regressions while yielding a better qualitative
      behavior in NUMA machines.
      
      With this infrastructure in place, we can use the list_lru entry point to
      provide memcg isolation and per-memcg targeted reclaim.  Historically,
      those two pieces of work have been posted together.  This version presents
      only the infrastructure work, deferring the memcg work for a later time,
      so we can focus on getting this part tested.  You can see more about the
      history of such work at http://lwn.net/Articles/552769/
      
      Dave Chinner (18):
        dcache: convert dentry_stat.nr_unused to per-cpu counters
        dentry: move to per-sb LRU locks
        dcache: remove dentries from LRU before putting on dispose list
        mm: new shrinker API
        shrinker: convert superblock shrinkers to new API
        list: add a new LRU list type
        inode: convert inode lru list to generic lru list code.
        dcache: convert to use new lru list infrastructure
        list_lru: per-node list infrastructure
        shrinker: add node awareness
        fs: convert inode and dentry shrinking to be node aware
        xfs: convert buftarg LRU to generic code
        xfs: rework buffer dispose list tracking
        xfs: convert dquot cache lru to list_lru
        fs: convert fs shrinkers to new scan/count API
        drivers: convert shrinkers to new count/scan API
        shrinker: convert remaining shrinkers to count/scan API
        shrinker: Kill old ->shrink API.
      
      Glauber Costa (7):
        fs: bump inode and dentry counters to long
        super: fix calculation of shrinkable objects for small numbers
        list_lru: per-node API
        vmscan: per-node deferred work
        i915: bail out earlier when shrinker cannot acquire mutex
        hugepage: convert huge zero page shrinker to new shrinker API
        list_lru: dynamically adjust node arrays
      
      This patch:
      
      There are situations in very large machines in which we can have a large
      quantity of dirty inodes, unused dentries, etc.  This is particularly true
      when umounting a filesystem, where eventually since every live object will
      eventually be discarded.
      
      Dave Chinner reported a problem with this while experimenting with the
      shrinker revamp patchset.  So we believe it is time for a change.  This
      patch just moves int to longs.  Machines where it matters should have a
      big long anyway.
      Signed-off-by: NGlauber Costa <glommer@openvz.org>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      3942c07c
  2. 10 9月, 2013 2 次提交
    • A
      split read_seqretry_or_unlock(), convert d_walk() to resulting primitives · 48f5ec21
      Al Viro 提交于
      Separate "check if we need to retry" from "unlock if we are done and
      had seq_writelock"; that allows to use these guys in d_walk(), where
      we need to recheck every time we ascend back to parent, but do *not*
      want to unlock until the very end.  Lift rcu_read_lock/rcu_read_unlock
      out into callers.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      48f5ec21
    • W
      dcache: Translating dentry into pathname without taking rename_lock · 232d2d60
      Waiman Long 提交于
      When running the AIM7's short workload, Linus' lockref patch eliminated
      most of the spinlock contention. However, there were still some left:
      
           8.46%     reaim  [kernel.kallsyms]     [k] _raw_spin_lock
                       |--42.21%-- d_path
                       |          proc_pid_readlink
                       |          SyS_readlinkat
                       |          SyS_readlink
                       |          system_call
                       |          __GI___readlink
                       |
                       |--40.97%-- sys_getcwd
                       |          system_call
                       |          __getcwd
      
      The big one here is the rename_lock (seqlock) contention in d_path()
      and the getcwd system call. This patch will eliminate the need to take
      the rename_lock while translating dentries into the full pathnames.
      
      The need to take the rename_lock is to make sure that no rename
      operation can be ongoing while the translation is in progress. However,
      only one thread can take the rename_lock thus blocking all the other
      threads that need it even though the translation process won't make
      any change to the dentries.
      
      This patch will replace the writer's write_seqlock/write_sequnlock
      sequence of the rename_lock of the callers of the prepend_path() and
      __dentry_path() functions with the reader's read_seqbegin/read_seqretry
      sequence within these 2 functions. As a result, the code will have to
      retry if one or more rename operations had been performed. In addition,
      RCU read lock will be taken during the translation process to make sure
      that no dentries will go away. To prevent live-lock from happening,
      the code will switch back to take the rename_lock if read_seqretry()
      fails for three times.
      
      To further reduce spinlock contention, this patch does not take the
      dentry's d_lock when copying the filename from the dentries. Instead,
      it treats the name pointer and length as unreliable and just copy
      the string byte-by-byte over until it hits a null byte or the end of
      string as specified by the length. This should avoid stepping into
      invalid memory address. The error cases are left to be handled by
      the sequence number check.
      
      The following code re-factoring are also made:
      1. Move prepend('/') into prepend_name() to remove one conditional
         check.
      2. Move the global root check in prepend_path() back to the top of
         the while loop.
      
      With this patch, the _raw_spin_lock will now account for only 1.2%
      of the total CPU cycles for the short workload. This patch also has
      the effect of reducing the effect of running perf on its profile
      since the perf command itself can be a heavy user of the d_path()
      function depending on the complexity of the workload.
      
      When taking the perf profile of the high-systime workload, the amount
      of spinlock contention contributed by running perf without this patch
      was about 16%. With this patch, the spinlock contention caused by
      the running of perf will go away and we will have a more accurate
      perf profile.
      Signed-off-by: NWaiman Long <Waiman.Long@hp.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      232d2d60
  3. 09 9月, 2013 2 次提交
    • L
      vfs: use lockred "dead" flag to mark unrecoverably dead dentries · 0d98439e
      Linus Torvalds 提交于
      This simplifies the RCU to refcounting code in particular.
      
      I was originally intending to leave this for later, but walking through
      all the dput() logic (see previous commit), I realized that the dput()
      "might_sleep()" check was misleadingly weak.  And I removed it as
      misleading, both for performance profiling and for debugging.
      
      However, the might_sleep() debugging case is actually true: the final
      dput() can indeed sleep, if the inode of the dentry that you are
      releasing ends up sleeping at iput time (see dentry_iput()).  So the
      problem with the might_sleep() in dput() wasn't that it wasn't true, it
      was that it wasn't actually testing and triggering on the interesting
      case.
      
      In particular, just about *any* dput() can indeed sleep, if you happen
      to race with another thread deleting the file in question, and you then
      lose the race to the be the last dput() for that file.  But because it's
      a very rare race, the debugging code would never trigger it in practice.
      
      Why is this problematic? The new d_rcu_to_refcount() (see commit
      15570086: "vfs: reimplement d_rcu_to_refcount() using
      lockref_get_or_lock()") does a dput() for the failure case, and it does
      it under the RCU lock.  So potentially sleeping really is a bug.
      
      But there's no way I'm going to fix this with the previous complicated
      "lockref_get_or_lock()" interface.  And rather than revert to the old
      and crufty nested dentry locking code (which did get this right by
      delaying the reference count updates until they were verified to be
      safe), let's make forward progress.
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0d98439e
    • L
      vfs: reorganize dput() memory accesses · 8aab6a27
      Linus Torvalds 提交于
      This is me being a bit OCD after all the dentry optimization work this
      merge window: profiles end up showing 'dput()' as a rather expensive
      operation, and there were two unrelated bad reasons for that.
      
      The first reason was reading d_lockref.count for debugging purposes,
      which touches the lockref cacheline (for reads) before really need to.
      More importantly, the debugging test in question is _wrong_, and has
      hidden bugs.  It's true that we can only sleep when the count goes down
      to zero, but the test as-is hides the much more subtle bug that happens
      if we race with somebody else deleting the file.
      
      Anyway we _will_ touch that cacheline, but let's do it for a write and
      in the right routine (ie in "lockref_put_or_lock()") which annotates the
      costs better.  So remove the misleading debug code.
      
      The other was an unnecessary access to the cacheline that contains the
      d_lru list, just to check whether we already were on the LRU list or
      not.  This is exactly what we have d_flags for, so that we can avoid
      touching extra cache lines for the common case.  So just add another bit
      for "is this dentry on the LRU".
      
      Finally, mark the tests properly likely/unlikely, so that the common
      fast-paths are dense in the instruction stream.
      
      This makes the profiles look much saner.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8aab6a27
  4. 06 9月, 2013 4 次提交
    • M
      vfs: check unlinked ancestors before mount · eed81007
      Miklos Szeredi 提交于
      We check submounts before doing d_drop() on a non-empty directory dentry in
      NFS (have_submounts()), but we do not exclude a racing mount.  Nor do we
      prevent mounts to be added to the disconnected subtree using relative paths
      after the d_drop().
      
      This patch fixes these issues by checking for unlinked (unhashed, non-root)
      ancestors before proceeding with the mount.  This is done with rename
      seqlock taken for write and with ->d_lock grabbed on each ancestor in turn,
      including our dentry itself.  This ensures that the only one of
      check_submounts_and_drop() or has_unlinked_ancestor() can succeed.
      Signed-off-by: NMiklos Szeredi <miklos@szeredi.hu>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      eed81007
    • M
      vfs: check submounts and drop atomically · 848ac114
      Miklos Szeredi 提交于
      We check submounts before doing d_drop() on a non-empty directory dentry in
      NFS (have_submounts()), but we do not exclude a racing mount.
      
       Process A: have_submounts() -> returns false
       Process B: mount() -> success
       Process A: d_drop()
      
      This patch prepares the ground for the fix by doing the following
      operations all under the same rename lock:
      
        have_submounts()
        shrink_dcache_parent()
        d_drop()
      
      This is actually an optimization since have_submounts() and
      shrink_dcache_parent() both traverse the same dentry tree separately.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      CC: David Howells <dhowells@redhat.com>
      CC: Steven Whitehouse <swhiteho@redhat.com>
      CC: Trond Myklebust <Trond.Myklebust@netapp.com>
      CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      848ac114
    • M
      vfs: add d_walk() · db14fc3a
      Miklos Szeredi 提交于
      This one replaces three instances open coded tree walking (have_submounts,
      select_parent, d_genocide) with a common helper.
      
      In addition to slightly reducing the kernel size, this simplifies the
      callers and makes them less bug prone.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      db14fc3a
    • M
      vfs: restructure d_genocide() · 01ddc4ed
      Miklos Szeredi 提交于
      It shouldn't matter when we decrement the refcount during the walk as long
      as we do it exactly once.
      
      Restructure d_genocide() to do the killing on entering the dentry instead
      of when leaving it.  This helps creating a common helper for tree walking.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      01ddc4ed
  5. 04 9月, 2013 1 次提交
    • Y
      vfs: call d_op->d_prune() before unhashing dentry · 590fb51f
      Yan, Zheng 提交于
      The d_prune dentry operation is used to notify filesystem when VFS
      about to prune a hashed dentry from the dcache. There are three
      code paths that prune dentries: shrink_dcache_for_umount_subtree(),
      prune_dcache_sb() and d_prune_aliases(). For the d_prune_aliases()
      case, VFS unhashes the dentry first, then call the d_prune dentry
      operation. This confuses ceph_d_prune() (ceph uses the d_prune
      dentry operation to maintain a flag indicating whether the complete
      contents of a directory are in the dcache, pruning unhashed dentry
      does not affect dir's completeness)
      
      This patch fixes the issue by calling the d_prune dentry operation
      in d_prune_aliases(), before unhashing the dentry. Also make VFS
      only call the d_prune dentry operation for hashed dentry, to avoid
      calling the d_prune dentry operation twice when dentry is pruned
      by d_prune_aliases().
      Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      590fb51f
  6. 03 9月, 2013 2 次提交
    • L
      vfs: reimplement d_rcu_to_refcount() using lockref_get_or_lock() · 15570086
      Linus Torvalds 提交于
      This moves __d_rcu_to_refcount() from <linux/dcache.h> into fs/namei.c
      and re-implements it using the lockref infrastructure instead.  It also
      adds a lot of comments about what is actually going on, because turning
      a dentry that was looked up using RCU into a long-lived reference
      counted entry is one of the more subtle parts of the rcu walk.
      
      We also used to be _particularly_ subtle in unlazy_walk() where we
      re-validate both the dentry and its parent using the same sequence
      count.  We used to do it by nesting the locks and then verifying the
      sequence count just once.
      
      That was silly, because nested locking is expensive, but the sequence
      count check is not.  So this just re-validates the dentry and the parent
      separately, avoiding the nested locking, and making the lockref lookup
      possible.
      Acked-by: NWaiman Long <waiman.long@hp.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      15570086
    • W
      vfs: use lockref_get_not_zero() for optimistic lockless dget_parent() · df3d0bbc
      Waiman Long 提交于
      A valid parent pointer is always going to have a non-zero reference
      count, but if we look up the parent optimistically without locking, we
      have to protect against the (very unlikely) race against renaming
      changing the parent from under us.
      
      We do that by using lockref_get_not_zero(), and then re-checking the
      parent pointer after getting a valid reference.
      
      [ This is a re-implementation of a chunk from the original patch by
        Waiman Long: "dcache: Enable lockless update of dentry's refcount".
        I've completely rewritten the patch-series and split it up, but I'm
        attributing this part to Waiman as it's close enough to his earlier
        patch  - Linus ]
      Signed-off-by: NWaiman Long <Waiman.Long@hp.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      df3d0bbc
  7. 29 8月, 2013 1 次提交
    • W
      vfs: make the dentry cache use the lockref infrastructure · 98474236
      Waiman Long 提交于
      This just replaces the dentry count/lock combination with the lockref
      structure that contains both a count and a spinlock, and does the
      mechanical conversion to use the lockref infrastructure.
      
      There are no semantic changes here, it's purely syntactic.  The
      reference lockref implementation uses the spinlock exactly the same way
      that the old dcache code did, and the bulk of this patch is just
      expanding the internal "d_count" use in the dcache code to use
      "d_lockref.count" instead.
      
      This is purely preparation for the real change to make the reference
      count updates be lockless during the 3.12 merge window.
      
      [ As with the previous commit, this is a rewritten version of a concept
        originally from Waiman, so credit goes to him, blame for any errors
        goes to me.
      
        Waiman's patch had some semantic differences for taking advantage of
        the lockless update in dget_parent(), while this patch is
        intentionally a pure search-and-replace change with no semantic
        changes.     - Linus ]
      Signed-off-by: NWaiman Long <Waiman.Long@hp.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      98474236
  8. 25 8月, 2013 1 次提交
  9. 29 6月, 2013 3 次提交
  10. 14 6月, 2013 1 次提交
    • S
      GFS2: Add atomic_open support · 6d4ade98
      Steven Whitehouse 提交于
      I've restricted atomic_open to only operate on regular files, although
      I still don't understand why atomic_open should not be possible also for
      directories on GFS2. That can always be added in later though, if it
      makes sense.
      
      The ->atomic_open function can be passed negative dentries, which
      in most cases means either ENOENT (->lookup) or a call to d_instantiate
      (->create). In the GFS2 case though, we need to actually perform the
      look up, since we do not know whether there has been a new inode created
      on another node. The look up calls d_splice_alias which then tries to
      rehash the dentry - so the solution here is to simply check for that
      in d_splice_alias. The same issue is likely to affect any other cluster
      filesystem implementing ->atomic_open
      Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "J. Bruce Fields" <bfields fieldses org>
      Cc: Jeff Layton <jlayton@redhat.com>
      6d4ade98
  11. 05 5月, 2013 2 次提交
  12. 01 5月, 2013 1 次提交
    • G
      fs/dcache.c: add cond_resched() to shrink_dcache_parent() · 421348f1
      Greg Thelen 提交于
      Call cond_resched() in shrink_dcache_parent() to maintain interactivity.
      
      Before this patch:
      
      	void shrink_dcache_parent(struct dentry * parent)
      	{
      		while ((found = select_parent(parent, &dispose)) != 0)
      			shrink_dentry_list(&dispose);
      	}
      
      select_parent() populates the dispose list with dentries which
      shrink_dentry_list() then deletes.  select_parent() carefully uses
      need_resched() to avoid doing too much work at once.  But neither
      shrink_dcache_parent() nor its called functions call cond_resched().  So
      once need_resched() is set select_parent() will return single dentry
      dispose list which is then deleted by shrink_dentry_list().  This is
      inefficient when there are a lot of dentry to process.  This can cause
      softlockup and hurts interactivity on non preemptable kernels.
      
      This change adds cond_resched() in shrink_dcache_parent().  The benefit
      of this is that need_resched() is quickly cleared so that future calls
      to select_parent() are able to efficiently return a big batch of dentry.
      
      These additional cond_resched() do not seem to impact performance, at
      least for the workload below.
      
      Here is a program which can cause soft lockup if other system activity
      sets need_resched().
      
      	int main()
      	{
      	        struct rlimit rlim;
      	        int i;
      	        int f[100000];
      	        char buf[20];
      	        struct timeval t1, t2;
      	        double diff;
      
      	        /* cleanup past run */
      	        system("rm -rf x");
      
      	        /* boost nfile rlimit */
      	        rlim.rlim_cur = 200000;
      	        rlim.rlim_max = 200000;
      	        if (setrlimit(RLIMIT_NOFILE, &rlim))
      	                err(1, "setrlimit");
      
      	        /* make directory for files */
      	        if (mkdir("x", 0700))
      	                err(1, "mkdir");
      
      	        if (gettimeofday(&t1, NULL))
      	                err(1, "gettimeofday");
      
      	        /* populate directory with open files */
      	        for (i = 0; i < 100000; i++) {
      	                snprintf(buf, sizeof(buf), "x/%d", i);
      	                f[i] = open(buf, O_CREAT);
      	                if (f[i] == -1)
      	                        err(1, "open");
      	        }
      
      	        /* close some of the files */
      	        for (i = 0; i < 85000; i++)
      	                close(f[i]);
      
      	        /* unlink all files, even open ones */
      	        system("rm -rf x");
      
      	        if (gettimeofday(&t2, NULL))
      	                err(1, "gettimeofday");
      
      	        diff = (((double)t2.tv_sec * 1000000 + t2.tv_usec) -
      	                ((double)t1.tv_sec * 1000000 + t1.tv_usec));
      
      	        printf("done: %g elapsed\n", diff/1e6);
      	        return 0;
      	}
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Signed-off-by: NDave Chinner <david@fromorbit.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      421348f1
  13. 27 3月, 2013 1 次提交
    • A
      Nest rename_lock inside vfsmount_lock · 7ea600b5
      Al Viro 提交于
      ... lest we get livelocks between path_is_under() and d_path() and friends.
      
      The thing is, wrt fairness lglocks are more similar to rwsems than to rwlocks;
      it is possible to have thread B spin on attempt to take lock shared while thread
      A is already holding it shared, if B is on lower-numbered CPU than A and there's
      a thread C spinning on attempt to take the same lock exclusive.
      
      As the result, we need consistent ordering between vfsmount_lock (lglock) and
      rename_lock (seq_lock), even though everything that takes both is going to take
      vfsmount_lock only shared.
      Spotted-by: NBrad Spengler <spender@grsecurity.net>
      Cc: stable@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      7ea600b5
  14. 28 2月, 2013 1 次提交
    • S
      hlist: drop the node parameter from iterators · b67bfe0d
      Sasha Levin 提交于
      I'm not sure why, but the hlist for each entry iterators were conceived
      
              list_for_each_entry(pos, head, member)
      
      The hlist ones were greedy and wanted an extra parameter:
      
              hlist_for_each_entry(tpos, pos, head, member)
      
      Why did they need an extra pos parameter? I'm not quite sure. Not only
      they don't really need it, it also prevents the iterator from looking
      exactly like the list iterator, which is unfortunate.
      
      Besides the semantic patch, there was some manual work required:
      
       - Fix up the actual hlist iterators in linux/list.h
       - Fix up the declaration of other iterators based on the hlist ones.
       - A very small amount of places were using the 'node' parameter, this
       was modified to use 'obj->member' instead.
       - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
       properly, so those had to be fixed up manually.
      
      The semantic patch which is mostly the work of Peter Senna Tschudin is here:
      
      @@
      iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
      
      type T;
      expression a,c,d,e;
      identifier b;
      statement S;
      @@
      
      -T b;
          <+... when != b
      (
      hlist_for_each_entry(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue(a,
      - b,
      c) S
      |
      hlist_for_each_entry_from(a,
      - b,
      c) S
      |
      hlist_for_each_entry_rcu(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_rcu_bh(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue_rcu_bh(a,
      - b,
      c) S
      |
      for_each_busy_worker(a, c,
      - b,
      d) S
      |
      ax25_uid_for_each(a,
      - b,
      c) S
      |
      ax25_for_each(a,
      - b,
      c) S
      |
      inet_bind_bucket_for_each(a,
      - b,
      c) S
      |
      sctp_for_each_hentry(a,
      - b,
      c) S
      |
      sk_for_each(a,
      - b,
      c) S
      |
      sk_for_each_rcu(a,
      - b,
      c) S
      |
      sk_for_each_from
      -(a, b)
      +(a)
      S
      + sk_for_each_from(a) S
      |
      sk_for_each_safe(a,
      - b,
      c, d) S
      |
      sk_for_each_bound(a,
      - b,
      c) S
      |
      hlist_for_each_entry_safe(a,
      - b,
      c, d, e) S
      |
      hlist_for_each_entry_continue_rcu(a,
      - b,
      c) S
      |
      nr_neigh_for_each(a,
      - b,
      c) S
      |
      nr_neigh_for_each_safe(a,
      - b,
      c, d) S
      |
      nr_node_for_each(a,
      - b,
      c) S
      |
      nr_node_for_each_safe(a,
      - b,
      c, d) S
      |
      - for_each_gfn_sp(a, c, d, b) S
      + for_each_gfn_sp(a, c, d) S
      |
      - for_each_gfn_indirect_valid_sp(a, c, d, b) S
      + for_each_gfn_indirect_valid_sp(a, c, d) S
      |
      for_each_host(a,
      - b,
      c) S
      |
      for_each_host_safe(a,
      - b,
      c, d) S
      |
      for_each_mesh_entry(a,
      - b,
      c, d) S
      )
          ...+>
      
      [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
      [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
      [akpm@linux-foundation.org: checkpatch fixes]
      [akpm@linux-foundation.org: fix warnings]
      [akpm@linux-foudnation.org: redo intrusive kvm changes]
      Tested-by: NPeter Senna Tschudin <peter.senna@gmail.com>
      Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b67bfe0d
  15. 26 2月, 2013 2 次提交
    • J
      vfs: kill FS_REVAL_DOT by adding a d_weak_revalidate dentry op · ecf3d1f1
      Jeff Layton 提交于
      The following set of operations on a NFS client and server will cause
      
          server# mkdir a
          client# cd a
          server# mv a a.bak
          client# sleep 30  # (or whatever the dir attrcache timeout is)
          client# stat .
          stat: cannot stat `.': Stale NFS file handle
      
      Obviously, we should not be getting an ESTALE error back there since the
      inode still exists on the server. The problem is that the lookup code
      will call d_revalidate on the dentry that "." refers to, because NFS has
      FS_REVAL_DOT set.
      
      nfs_lookup_revalidate will see that the parent directory has changed and
      will try to reverify the dentry by redoing a LOOKUP. That of course
      fails, so the lookup code returns ESTALE.
      
      The problem here is that d_revalidate is really a bad fit for this case.
      What we really want to know at this point is whether the inode is still
      good or not, but we don't really care what name it goes by or whether
      the dcache is still valid.
      
      Add a new d_op->d_weak_revalidate operation and have complete_walk call
      that instead of d_revalidate. The intent there is to allow for a
      "weaker" d_revalidate that just checks to see whether the inode is still
      good. This is also gives us an opportunity to kill off the FS_REVAL_DOT
      special casing.
      
      [AV: changed method name, added note in porting, fixed confusion re
      having it possibly called from RCU mode (it won't be)]
      
      Cc: NeilBrown <neilb@suse.de>
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      ecf3d1f1
    • A
      d_hash_and_lookup(): export, switch open-coded instances · 4f522a24
      Al Viro 提交于
      * calling conventions change - ERR_PTR() is returned on ->d_hash() errors;
      NULL is just for dcache miss now.
      * exported, open-coded instances in ncpfs and cifs converted.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      4f522a24
  16. 23 2月, 2013 4 次提交
  17. 21 12月, 2012 2 次提交
    • N
      vfs: d_obtain_alias() needs to use "/" as default name. · b911a6bd
      NeilBrown 提交于
      NFS appears to use d_obtain_alias() to create the root dentry rather than
      d_make_root.  This can cause 'prepend_path()' to complain that the root
      has a weird name if an NFS filesystem is lazily unmounted.  e.g.  if
      "/mnt" is an NFS mount then
      
       { cd /mnt; umount -l /mnt ; ls -l /proc/self/cwd; }
      
      will cause a WARN message like
         WARNING: at /home/git/linux/fs/dcache.c:2624 prepend_path+0x1d7/0x1e0()
         ...
         Root dentry has weird name <>
      
      to appear in kernel logs.
      
      So change d_obtain_alias() to use "/" rather than "" as the anonymous
      name.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b911a6bd
    • J
      vfs: remove DCACHE_NEED_LOOKUP · 39e3c955
      Jeff Layton 提交于
      The code that relied on that flag was ripped out of btrfs quite some
      time ago, and never added back. Josef indicated that he was going to
      take a different approach to the problem in btrfs, and that we
      could just eliminate this flag.
      
      Cc: Josef Bacik <jbacik@fusionio.com>
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      39e3c955
  18. 30 9月, 2012 1 次提交
    • M
      vfs: dcache: fix deadlock in tree traversal · 8110e16d
      Miklos Szeredi 提交于
      IBM reported a deadlock in select_parent().  This was found to be caused
      by taking rename_lock when already locked when restarting the tree
      traversal.
      
      There are two cases when the traversal needs to be restarted:
      
       1) concurrent d_move(); this can only happen when not already locked,
          since taking rename_lock protects against concurrent d_move().
      
       2) racing with final d_put() on child just at the moment of ascending
          to parent; rename_lock doesn't protect against this rare race, so it
          can happen when already locked.
      
      Because of case 2, we need to be able to handle restarting the traversal
      when rename_lock is already held.  This patch fixes all three callers of
      try_to_ascend().
      
      IBM reported that the deadlock is gone with this patch.
      
      [ I rewrote the patch to be smaller and just do the "goto again" if the
        lock was already held, but credit goes to Miklos for the real work.
         - Linus ]
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8110e16d
  19. 28 9月, 2012 1 次提交
  20. 27 9月, 2012 1 次提交
  21. 19 9月, 2012 1 次提交
    • M
      vfs: dcache: use DCACHE_DENTRY_KILLED instead of DCACHE_DISCONNECTED in d_kill() · b161dfa6
      Miklos Szeredi 提交于
      IBM reported a soft lockup after applying the fix for the rename_lock
      deadlock.  Commit c83ce989 ("VFS: Fix the nfs sillyrename regression
      in kernel 2.6.38") was found to be the culprit.
      
      The nfs sillyrename fix used DCACHE_DISCONNECTED to indicate that the
      dentry was killed.  This flag can be set on non-killed dentries too,
      which results in infinite retries when trying to traverse the dentry
      tree.
      
      This patch introduces a separate flag: DCACHE_DENTRY_KILLED, which is
      only set in d_kill() and makes try_to_ascend() test only this flag.
      
      IBM reported successful test results with this patch.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b161dfa6
  22. 14 7月, 2012 2 次提交