1. 11 9月, 2013 8 次提交
    • D
      dentry: move to per-sb LRU locks · 19156840
      Dave Chinner 提交于
      With the dentry LRUs being per-sb structures, there is no real need for
      a global dentry_lru_lock. The locking can be made more fine-grained by
      moving to a per-sb LRU lock, isolating the LRU operations of different
      filesytsems completely from each other. The need for this is independent
      of any performance consideration that may arise: in the interest of
      abstracting the lru operations away, it is mandatory that each lru works
      around its own lock instead of a global lock for all of them.
      
      [glommer@openvz.org: updated changelog ]
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NGlauber Costa <glommer@openvz.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      19156840
    • D
      dcache: convert dentry_stat.nr_unused to per-cpu counters · 62d36c77
      Dave Chinner 提交于
      Before we split up the dcache_lru_lock, the unused dentry counter needs to
      be made independent of the global dcache_lru_lock.  Convert it to per-cpu
      counters to do this.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NGlauber Costa <glommer@openvz.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      62d36c77
    • G
      super: fix calculation of shrinkable objects for small numbers · 55f841ce
      Glauber Costa 提交于
      The sysctl knob sysctl_vfs_cache_pressure is used to determine which
      percentage of the shrinkable objects in our cache we should actively try
      to shrink.
      
      It works great in situations in which we have many objects (at least more
      than 100), because the aproximation errors will be negligible.  But if
      this is not the case, specially when total_objects < 100, we may end up
      concluding that we have no objects at all (total / 100 = 0, if total <
      100).
      
      This is certainly not the biggest killer in the world, but may matter in
      very low kernel memory situations.
      Signed-off-by: NGlauber Costa <glommer@openvz.org>
      Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      55f841ce
    • G
      fs: bump inode and dentry counters to long · 3942c07c
      Glauber Costa 提交于
      This series reworks our current object cache shrinking infrastructure in
      two main ways:
      
       * Noticing that a lot of users copy and paste their own version of LRU
         lists for objects, we put some effort in providing a generic version.
         It is modeled after the filesystem users: dentries, inodes, and xfs
         (for various tasks), but we expect that other users could benefit in
         the near future with little or no modification.  Let us know if you
         have any issues.
      
       * The underlying list_lru being proposed automatically and
         transparently keeps the elements in per-node lists, and is able to
         manipulate the node lists individually.  Given this infrastructure, we
         are able to modify the up-to-now hammer called shrink_slab to proceed
         with node-reclaim instead of always searching memory from all over like
         it has been doing.
      
      Per-node lru lists are also expected to lead to less contention in the lru
      locks on multi-node scans, since we are now no longer fighting for a
      global lock.  The locks usually disappear from the profilers with this
      change.
      
      Although we have no official benchmarks for this version - be our guest to
      independently evaluate this - earlier versions of this series were
      performance tested (details at
      http://permalink.gmane.org/gmane.linux.kernel.mm/100537) yielding no
      visible performance regressions while yielding a better qualitative
      behavior in NUMA machines.
      
      With this infrastructure in place, we can use the list_lru entry point to
      provide memcg isolation and per-memcg targeted reclaim.  Historically,
      those two pieces of work have been posted together.  This version presents
      only the infrastructure work, deferring the memcg work for a later time,
      so we can focus on getting this part tested.  You can see more about the
      history of such work at http://lwn.net/Articles/552769/
      
      Dave Chinner (18):
        dcache: convert dentry_stat.nr_unused to per-cpu counters
        dentry: move to per-sb LRU locks
        dcache: remove dentries from LRU before putting on dispose list
        mm: new shrinker API
        shrinker: convert superblock shrinkers to new API
        list: add a new LRU list type
        inode: convert inode lru list to generic lru list code.
        dcache: convert to use new lru list infrastructure
        list_lru: per-node list infrastructure
        shrinker: add node awareness
        fs: convert inode and dentry shrinking to be node aware
        xfs: convert buftarg LRU to generic code
        xfs: rework buffer dispose list tracking
        xfs: convert dquot cache lru to list_lru
        fs: convert fs shrinkers to new scan/count API
        drivers: convert shrinkers to new count/scan API
        shrinker: convert remaining shrinkers to count/scan API
        shrinker: Kill old ->shrink API.
      
      Glauber Costa (7):
        fs: bump inode and dentry counters to long
        super: fix calculation of shrinkable objects for small numbers
        list_lru: per-node API
        vmscan: per-node deferred work
        i915: bail out earlier when shrinker cannot acquire mutex
        hugepage: convert huge zero page shrinker to new shrinker API
        list_lru: dynamically adjust node arrays
      
      This patch:
      
      There are situations in very large machines in which we can have a large
      quantity of dirty inodes, unused dentries, etc.  This is particularly true
      when umounting a filesystem, where eventually since every live object will
      eventually be discarded.
      
      Dave Chinner reported a problem with this while experimenting with the
      shrinker revamp patchset.  So we believe it is time for a change.  This
      patch just moves int to longs.  Machines where it matters should have a
      big long anyway.
      Signed-off-by: NGlauber Costa <glommer@openvz.org>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      3942c07c
    • D
      da5338c7
    • A
      bcce56d5
    • C
      fs: remove vfs_follow_link · aac34df1
      Christoph Hellwig 提交于
      For a long time no filesystem has been using vfs_follow_link, and as seen
      by recent filesystem submissions any new use is accidental as well.
      
      Remove vfs_follow_link, document the replacement in
      Documentation/filesystems/porting and also rename __vfs_follow_link
      to match its only caller better.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      aac34df1
    • L
      vfs: make sure we don't have a stale root path if unlazy_walk() fails · d0d27277
      Linus Torvalds 提交于
      When I moved the RCU walk termination into unlazy_walk(), I didn't copy
      quite all of it: for the successful RCU termination we properly add the
      necessary reference counts to our temporary copy of the root path, but
      for the failure case we need to make sure that any temporary root path
      information is cleared out (since it does _not_ have the proper
      reference counts from the RCU lookup).
      
      We could clean up this mess by just always dropping the temporary root
      information, but Al points out that that would mean that a single lookup
      through symlinks could see multiple different root entries if it races
      with another thread doing chroot.  Not that I think we should really
      care (we had that before too, back before we had a copy of the root path
      in the nameidata).
      
      Al says he has a cunning plan.  In the meantime, this is the minimal fix
      for the problem, even if it's not all that pretty.
      Reported-by: NMace Moneta <moneta.mace@gmail.com>
      Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0d27277
  2. 10 9月, 2013 3 次提交
    • A
      split read_seqretry_or_unlock(), convert d_walk() to resulting primitives · 48f5ec21
      Al Viro 提交于
      Separate "check if we need to retry" from "unlock if we are done and
      had seq_writelock"; that allows to use these guys in d_walk(), where
      we need to recheck every time we ascend back to parent, but do *not*
      want to unlock until the very end.  Lift rcu_read_lock/rcu_read_unlock
      out into callers.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      48f5ec21
    • O
      direct-io: Use return from cmpxchg to decide of assignment happened · 45150c43
      Olof Johansson 提交于
      Not using the return value can in the generic case be racy, so it's
      in general good practice to check the return value instead.
      
      This also resolved the warning caused on ARM and other architectures:
      
        fs/direct-io.c: In function 'sb_init_dio_done_wq':
        fs/direct-io.c:557:2: warning: value computed is not used [-Wunused-value]
      Signed-off-by: NOlof Johansson <olof@lixom.net>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: H Peter Anvin <hpa@zytor.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      45150c43
    • W
      dcache: Translating dentry into pathname without taking rename_lock · 232d2d60
      Waiman Long 提交于
      When running the AIM7's short workload, Linus' lockref patch eliminated
      most of the spinlock contention. However, there were still some left:
      
           8.46%     reaim  [kernel.kallsyms]     [k] _raw_spin_lock
                       |--42.21%-- d_path
                       |          proc_pid_readlink
                       |          SyS_readlinkat
                       |          SyS_readlink
                       |          system_call
                       |          __GI___readlink
                       |
                       |--40.97%-- sys_getcwd
                       |          system_call
                       |          __getcwd
      
      The big one here is the rename_lock (seqlock) contention in d_path()
      and the getcwd system call. This patch will eliminate the need to take
      the rename_lock while translating dentries into the full pathnames.
      
      The need to take the rename_lock is to make sure that no rename
      operation can be ongoing while the translation is in progress. However,
      only one thread can take the rename_lock thus blocking all the other
      threads that need it even though the translation process won't make
      any change to the dentries.
      
      This patch will replace the writer's write_seqlock/write_sequnlock
      sequence of the rename_lock of the callers of the prepend_path() and
      __dentry_path() functions with the reader's read_seqbegin/read_seqretry
      sequence within these 2 functions. As a result, the code will have to
      retry if one or more rename operations had been performed. In addition,
      RCU read lock will be taken during the translation process to make sure
      that no dentries will go away. To prevent live-lock from happening,
      the code will switch back to take the rename_lock if read_seqretry()
      fails for three times.
      
      To further reduce spinlock contention, this patch does not take the
      dentry's d_lock when copying the filename from the dentries. Instead,
      it treats the name pointer and length as unreliable and just copy
      the string byte-by-byte over until it hits a null byte or the end of
      string as specified by the length. This should avoid stepping into
      invalid memory address. The error cases are left to be handled by
      the sequence number check.
      
      The following code re-factoring are also made:
      1. Move prepend('/') into prepend_name() to remove one conditional
         check.
      2. Move the global root check in prepend_path() back to the top of
         the while loop.
      
      With this patch, the _raw_spin_lock will now account for only 1.2%
      of the total CPU cycles for the short workload. This patch also has
      the effect of reducing the effect of running perf on its profile
      since the perf command itself can be a heavy user of the d_path()
      function depending on the complexity of the workload.
      
      When taking the perf profile of the high-systime workload, the amount
      of spinlock contention contributed by running perf without this patch
      was about 16%. With this patch, the spinlock contention caused by
      the running of perf will go away and we will have a more accurate
      perf profile.
      Signed-off-by: NWaiman Long <Waiman.Long@hp.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      232d2d60
  3. 09 9月, 2013 7 次提交
    • I
      autofs4 - fix device ioctl mount lookup · ac838719
      Ian Kent 提交于
      When reconnecting to automounts at startup an autofs ioctl is used
      to find the device and inode of existing mounts so they can be used
      to open a file descriptor of possibly covered mounts.
      
      At this time the the caller might not yet "own" the mount so it can
      trigger calling ->d_automount(). This causes automount to hang when
      trying to reconnect to direct or offset mount types.
      
      Consequently kern_path() can't be used but kern_path_mountpoint() can be.
      Signed-off-by: NIan Kent <raven@themaw.net>
      Cc: Jeff Layton <jlayton@redhat.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      ac838719
    • L
      vfs: fix dentry RCU to refcounting possibly sleeping dput() · e5c832d5
      Linus Torvalds 提交于
      This is the fix that the last two commits indirectly led up to - making
      sure that we don't call dput() in a bad context on the dentries we've
      looked up in RCU mode after the sequence count validation fails.
      
      This basically expands d_rcu_to_refcount() into the callers, and then
      fixes the callers to delay the dput() in the failure case until _after_
      we've dropped all locks and are no longer in an RCU-locked region.
      
      The case of 'complete_walk()' was trivial, since its failure case did
      the unlock_rcu_walk() directly after the call to d_rcu_to_refcount(),
      and as such that is just a pure expansion of the function with a trivial
      movement of the resulting dput() to after 'unlock_rcu_walk()'.
      
      In contrast, the unlazy_walk() case was much more complicated, because
      not only does convert two different dentries from RCU to be reference
      counted, but it used to not call unlock_rcu_walk() at all, and instead
      just returned an error and let the caller clean everything up in
      "terminate_walk()".
      
      Happily, one of the dentries in question (called "parent" inside
      unlazy_walk()) is the dentry of "nd->path", which terminate_walk() wants
      a refcount to anyway for the non-RCU case.
      
      So what the new and improved unlazy_walk() does is to first turn that
      dentry into a refcounted one, and once that is set up, the error cases
      can continue to use the terminate_walk() helper for cleanup, but for the
      non-RCU case.  Which makes it possible to drop out of RCU mode if we
      actually hit the sequence number failure case.
      Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e5c832d5
    • A
      introduce kern_path_mountpoint() · 2d864651
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2d864651
    • A
      rename user_path_umountat() to user_path_mountpoint_at() · 197df04c
      Al Viro 提交于
      ... and move the extern from linux/namei.h to fs/internal.h,
      along with that of vfs_path_lookup().
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      197df04c
    • A
      take unlazy_walk() into umount_lookup_last() · 35759521
      Al Viro 提交于
      ... and massage it a bit to reduce nesting
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      35759521
    • L
      vfs: use lockred "dead" flag to mark unrecoverably dead dentries · 0d98439e
      Linus Torvalds 提交于
      This simplifies the RCU to refcounting code in particular.
      
      I was originally intending to leave this for later, but walking through
      all the dput() logic (see previous commit), I realized that the dput()
      "might_sleep()" check was misleadingly weak.  And I removed it as
      misleading, both for performance profiling and for debugging.
      
      However, the might_sleep() debugging case is actually true: the final
      dput() can indeed sleep, if the inode of the dentry that you are
      releasing ends up sleeping at iput time (see dentry_iput()).  So the
      problem with the might_sleep() in dput() wasn't that it wasn't true, it
      was that it wasn't actually testing and triggering on the interesting
      case.
      
      In particular, just about *any* dput() can indeed sleep, if you happen
      to race with another thread deleting the file in question, and you then
      lose the race to the be the last dput() for that file.  But because it's
      a very rare race, the debugging code would never trigger it in practice.
      
      Why is this problematic? The new d_rcu_to_refcount() (see commit
      15570086: "vfs: reimplement d_rcu_to_refcount() using
      lockref_get_or_lock()") does a dput() for the failure case, and it does
      it under the RCU lock.  So potentially sleeping really is a bug.
      
      But there's no way I'm going to fix this with the previous complicated
      "lockref_get_or_lock()" interface.  And rather than revert to the old
      and crufty nested dentry locking code (which did get this right by
      delaying the reference count updates until they were verified to be
      safe), let's make forward progress.
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0d98439e
    • L
      vfs: reorganize dput() memory accesses · 8aab6a27
      Linus Torvalds 提交于
      This is me being a bit OCD after all the dentry optimization work this
      merge window: profiles end up showing 'dput()' as a rather expensive
      operation, and there were two unrelated bad reasons for that.
      
      The first reason was reading d_lockref.count for debugging purposes,
      which touches the lockref cacheline (for reads) before really need to.
      More importantly, the debugging test in question is _wrong_, and has
      hidden bugs.  It's true that we can only sleep when the count goes down
      to zero, but the test as-is hides the much more subtle bug that happens
      if we race with somebody else deleting the file.
      
      Anyway we _will_ touch that cacheline, but let's do it for a write and
      in the right routine (ie in "lockref_put_or_lock()") which annotates the
      costs better.  So remove the misleading debug code.
      
      The other was an unnecessary access to the cacheline that contains the
      d_lru list, just to check whether we already were on the LRU list or
      not.  This is exactly what we have d_flags for, so that we can avoid
      touching extra cache lines for the common case.  So just add another bit
      for "is this dentry on the LRU".
      
      Finally, mark the tests properly likely/unlikely, so that the common
      fast-paths are dense in the instruction stream.
      
      This makes the profiles look much saner.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8aab6a27
  4. 08 9月, 2013 10 次提交
  5. 07 9月, 2013 10 次提交
    • R
      um: hostfs: Fix writeback · 65984ff9
      Richard Weinberger 提交于
      We have to implement ->release() and trigger writeback from it.
      Otherwise we might lose dirty pages at munmap().
      Signed-off-by: NRichard Weinberger <richard@nod.at>
      65984ff9
    • Y
      ceph: use d_invalidate() to invalidate aliases · a8d436f0
      Yan, Zheng 提交于
      d_invalidate() is the standard VFS method to invalidate dentry.
      compare to d_delete(), it also try shrinking children dentries.
      Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
      Reviewed-by: NSage Weil <sage@inktank.com>
      a8d436f0
    • Y
      ceph: remove ceph_lookup_inode() · ed284c49
      Yan, Zheng 提交于
      commit 6f60f889 (ceph: fix freeing inode vs removing session caps race)
      introduced ceph_lookup_inode(). But there is already a ceph_find_inode()
      which provides similar function. So remove ceph_lookup_inode(), use
      ceph_find_inode() instead.
      Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com>
      Reviewed-by: NAlex Elder <alex.elder@linary.org>
      Reviewed-by: NSage Weil <sage@inktank.com>
      ed284c49
    • A
      NFSv4.1 Use MDS auth flavor for data server connection · 0e20162e
      Andy Adamson 提交于
      Commit 4edaa308 "NFS: Use "krb5i" to establish NFSv4 state whenever possible"
      uses the nfs_client cl_rpcclient for all state management operations, and
      will use krb5i or auth_sys with no regard to the mount command authflavor
      choice.
      
      The MDS, as any NFSv4.1 mount point, uses the nfs_server rpc client for all
      non-state management operations with a different nfs_server for each fsid
      encountered traversing the mount point, each with a potentially different
      auth flavor.
      
      pNFS data servers are not mounted in the normal sense as there is no associated
      nfs_server structure. Data servers can also export multiple fsids, each with
      a potentially different auth flavor.
      
      Data servers need to use the same authflavor as the MDS server rpc client for
      non-state management operations. Populate a list of rpc clients with the MDS
      server rpc client auth flavor for the DS to use.
      Signed-off-by: NAndy Adamson <andros@netapp.com>
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      0e20162e
    • M
      ceph: trivial buildbot warnings fix · 971f0bde
      Milosz Tanski 提交于
      The linux-next build bot found a three of warnings, this addresses all of them.
      
       * non-ANSI function declaration of function 'ceph_fscache_register' and
         'ceph_fscache_unregister'
       * symbol 'ceph_cache_netfs' was not declared, now it's extern in the header.
       * warning: "pr_fmt" redefined
      Signed-off-by: NMilosz Tanski <milosz@adfin.com>
      971f0bde
    • M
      ceph: Do not do invalidate if the filesystem is mounted nofsc · e81568eb
      Milosz Tanski 提交于
      Previously we would always try to enqueue work even if the filesystem is not
      mounted with fscache enabled (or the file has no cookie). In the case of the
      filesystem mouned nofsc (but with fscache compiled in) this would lead to a
      crash.
      Signed-off-by: NMilosz Tanski <milosz@adfin.com>
      e81568eb
    • M
      ceph: page still marked private_2 · d4d3aa38
      Milosz Tanski 提交于
      Previous patch that allowed us to cleanup most of the issues with pages marked
      as private_2 when calling ceph_readpages. However, there seams to be a case in
      the error case clean up in start read that still trigers this from time to
      time. I've only seen this one a couple times.
      
      BUG: Bad page state in process petabucket  pfn:335b82
      page:ffffea000cd6e080 count:0 mapcount:0 mapping:          (null) index:0x0
      page flags: 0x200000000001000(private_2)
      Call Trace:
       [<ffffffff81563442>] dump_stack+0x46/0x58
       [<ffffffff8112c7f7>] bad_page+0xc7/0x120
       [<ffffffff8112cd9e>] free_pages_prepare+0x10e/0x120
       [<ffffffff8112e580>] free_hot_cold_page+0x40/0x160
       [<ffffffff81132427>] __put_single_page+0x27/0x30
       [<ffffffff81132d95>] put_page+0x25/0x40
       [<ffffffffa02cb409>] ceph_readpages+0x2e9/0x6f0 [ceph]
       [<ffffffff811313cf>] __do_page_cache_readahead+0x1af/0x260
      Signed-off-by: NMilosz Tanski <milosz@adfin.com>
      Signed-off-by: NSage Weil <sage@inktank.com>
      d4d3aa38
    • M
      ceph: ceph_readpage_to_fscache didn't check if marked · 9b8dd1e8
      Milosz Tanski 提交于
      Previously ceph_readpage_to_fscache did not call if page was marked as cached
      before calling fscache_write_page resulting in a BUG inside of fscache.
      
      FS-Cache: Assertion failed
      ------------[ cut here ]------------
      kernel BUG at fs/fscache/page.c:874!
      invalid opcode: 0000 [#1] SMP
      Call Trace:
       [<ffffffffa02e6566>] __ceph_readpage_to_fscache+0x66/0x80 [ceph]
       [<ffffffffa02caf84>] readpage_nounlock+0x124/0x210 [ceph]
       [<ffffffffa02cb08d>] ceph_readpage+0x1d/0x40 [ceph]
       [<ffffffff81126db6>] generic_file_aio_read+0x1f6/0x700
       [<ffffffffa02c6fcc>] ceph_aio_read+0x5fc/0xab0 [ceph]
      Signed-off-by: NMilosz Tanski <milosz@adfin.com>
      Signed-off-by: NSage Weil <sage@inktank.com>
      9b8dd1e8
    • M
      ceph: clean PgPrivate2 on returning from readpages · 76be778b
      Milosz Tanski 提交于
      In some cases the ceph readapages code code bails without filling all the pages
      already marked by fscache. When we return back to readahead code this causes
      a BUG.
      Signed-off-by: NMilosz Tanski <milosz@adfin.com>
      76be778b
    • M
      ceph: use fscache as a local presisent cache · 99ccbd22
      Milosz Tanski 提交于
      Adding support for fscache to the Ceph filesystem. This would bring it to on
      par with some of the other network filesystems in Linux (like NFS, AFS, etc...)
      
      In order to mount the filesystem with fscache the 'fsc' mount option must be
      passed.
      Signed-off-by: NMilosz Tanski <milosz@adfin.com>
      Signed-off-by: NSage Weil <sage@inktank.com>
      99ccbd22
  6. 06 9月, 2013 2 次提交
    • T
      NFS: Don't check lock owner compatability unless file is locked (part 2) · 4109bb74
      Trond Myklebust 提交于
      When coalescing requests into a single READ or WRITE RPC call, and there
      is no file locking involved, we don't have to refuse coalescing for
      requests where the lock owner information doesn't match.
      Reported-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      4109bb74
    • M
      fscache: Netfs function for cleanup post readpages · 5a6f282a
      Milosz Tanski 提交于
      Currently the fscache code expect the netfs to call fscache_readpages_or_alloc
      inside the aops readpages callback.  It marks all the pages in the list
      provided by readahead with PG_private_2.  In the cases that the netfs fails to
      read all the pages (which is legal) it ends up returning to the readahead and
      triggering a BUG.  This happens because the page list still contains marked
      pages.
      
      This patch implements a simple fscache_readpages_cancel function that the netfs
      should call before returning from readpages.  It will revoke the pages from the
      underlying cache backend and unmark them.
      
      The problem was originally worked out in the Ceph devel tree, but it also
      occurs in CIFS.  It appears that NFS, AFS and 9P are okay as read_cache_pages()
      will clean up the unprocessed pages in the case of an error.
      
      This can be used to address the following oops:
      
      [12410647.597278] BUG: Bad page state in process petabucket  pfn:3d504e
      [12410647.597292] page:ffffea000f541380 count:0 mapcount:0 mapping:
      	(null) index:0x0
      [12410647.597298] page flags: 0x200000000001000(private_2)
      
      ...
      
      [12410647.597334] Call Trace:
      [12410647.597345]  [<ffffffff815523f2>] dump_stack+0x19/0x1b
      [12410647.597356]  [<ffffffff8111def7>] bad_page+0xc7/0x120
      [12410647.597359]  [<ffffffff8111e49e>] free_pages_prepare+0x10e/0x120
      [12410647.597361]  [<ffffffff8111fc80>] free_hot_cold_page+0x40/0x170
      [12410647.597363]  [<ffffffff81123507>] __put_single_page+0x27/0x30
      [12410647.597365]  [<ffffffff81123df5>] put_page+0x25/0x40
      [12410647.597376]  [<ffffffffa02bdcf9>] ceph_readpages+0x2e9/0x6e0 [ceph]
      [12410647.597379]  [<ffffffff81122a8f>] __do_page_cache_readahead+0x1af/0x260
      [12410647.597382]  [<ffffffff81122ea1>] ra_submit+0x21/0x30
      [12410647.597384]  [<ffffffff81118f64>] filemap_fault+0x254/0x490
      [12410647.597387]  [<ffffffff8113a74f>] __do_fault+0x6f/0x4e0
      [12410647.597391]  [<ffffffff810125bd>] ? __switch_to+0x16d/0x4a0
      [12410647.597395]  [<ffffffff810865ba>] ? finish_task_switch+0x5a/0xc0
      [12410647.597398]  [<ffffffff8113d856>] handle_pte_fault+0xf6/0x930
      [12410647.597401]  [<ffffffff81008c33>] ? pte_mfn_to_pfn+0x93/0x110
      [12410647.597403]  [<ffffffff81008cce>] ? xen_pmd_val+0xe/0x10
      [12410647.597405]  [<ffffffff81005469>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
      [12410647.597407]  [<ffffffff8113f361>] handle_mm_fault+0x251/0x370
      [12410647.597411]  [<ffffffff812b0ac4>] ? call_rwsem_down_read_failed+0x14/0x30
      [12410647.597414]  [<ffffffff8155bffa>] __do_page_fault+0x1aa/0x550
      [12410647.597418]  [<ffffffff8108011d>] ? up_write+0x1d/0x20
      [12410647.597422]  [<ffffffff8113141c>] ? vm_mmap_pgoff+0xbc/0xe0
      [12410647.597425]  [<ffffffff81143bb8>] ? SyS_mmap_pgoff+0xd8/0x240
      [12410647.597427]  [<ffffffff8155c3ae>] do_page_fault+0xe/0x10
      [12410647.597431]  [<ffffffff81558818>] page_fault+0x28/0x30
      Signed-off-by: NMilosz Tanski <milosz@adfin.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      5a6f282a