1. 11 9月, 2013 10 次提交
    • D
      list_lru: per-node list infrastructure · 3b1d58a4
      Dave Chinner 提交于
      Now that we have an LRU list API, we can start to enhance the
      implementation.  This splits the single LRU list into per-node lists and
      locks to enhance scalability.  Items are placed on lists according to the
      node the memory belongs to.  To make scanning the lists efficient, also
      track whether the per-node lists have entries in them in a active
      nodemask.
      
      Note: We use a fixed-size array for the node LRU, this struct can be very
      big if MAX_NUMNODES is big.  If this becomes a problem this is fixable by
      turning this into a pointer and dynamically allocating this to
      nr_node_ids.  This quantity is firwmare-provided, and still would provide
      room for all nodes at the cost of a pointer lookup and an extra
      allocation.  Because that allocation will most likely come from a may very
      well fail.
      
      [glommer@openvz.org: fix warnings, added note about node lru]
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NGlauber Costa <glommer@openvz.org>
      Reviewed-by: NGreg Thelen <gthelen@google.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      3b1d58a4
    • D
      dcache: convert to use new lru list infrastructure · f6041567
      Dave Chinner 提交于
      [glommer@openvz.org: don't reintroduce double decrement of nr_unused_dentries, adapted for new LRU return codes]
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NGlauber Costa <glommer@openvz.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      f6041567
    • D
      inode: convert inode lru list to generic lru list code. · bc3b14cb
      Dave Chinner 提交于
      [glommer@openvz.org: adapted for new LRU return codes]
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NGlauber Costa <glommer@openvz.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      bc3b14cb
    • D
      list: add a new LRU list type · a38e4082
      Dave Chinner 提交于
      Several subsystems use the same construct for LRU lists - a list head, a
      spin lock and and item count.  They also use exactly the same code for
      adding and removing items from the LRU.  Create a generic type for these
      LRU lists.
      
      This is the beginning of generic, node aware LRUs for shrinkers to work
      with.
      
      [glommer@openvz.org: enum defined constants for lru. Suggested by gthelen, don't relock over retry]
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NGlauber Costa <glommer@openvz.org>
      Reviewed-by: NGreg Thelen <gthelen@google.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a38e4082
    • D
      shrinker: convert superblock shrinkers to new API · 0a234c6d
      Dave Chinner 提交于
      Convert superblock shrinker to use the new count/scan API, and propagate
      the API changes through to the filesystem callouts.  The filesystem
      callouts already use a count/scan API, so it's just changing counters to
      longs to match the VM API.
      
      This requires the dentry and inode shrinker callouts to be converted to
      the count/scan API.  This is mainly a mechanical change.
      
      [glommer@openvz.org: use mult_frac for fractional proportions, build fixes]
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NGlauber Costa <glommer@openvz.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      0a234c6d
    • D
      mm: new shrinker API · 24f7c6b9
      Dave Chinner 提交于
      The current shrinker callout API uses an a single shrinker call for
      multiple functions.  To determine the function, a special magical value is
      passed in a parameter to change the behaviour.  This complicates the
      implementation and return value specification for the different
      behaviours.
      
      Separate the two different behaviours into separate operations, one to
      return a count of freeable objects in the cache, and another to scan a
      certain number of objects in the cache for freeing.  In defining these new
      operations, ensure the return values and resultant behaviours are clearly
      defined and documented.
      
      Modify shrink_slab() to use the new API and implement the callouts for all
      the existing shrinkers.
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NGlauber Costa <glommer@parallels.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      24f7c6b9
    • D
      dentry: move to per-sb LRU locks · 19156840
      Dave Chinner 提交于
      With the dentry LRUs being per-sb structures, there is no real need for
      a global dentry_lru_lock. The locking can be made more fine-grained by
      moving to a per-sb LRU lock, isolating the LRU operations of different
      filesytsems completely from each other. The need for this is independent
      of any performance consideration that may arise: in the interest of
      abstracting the lru operations away, it is mandatory that each lru works
      around its own lock instead of a global lock for all of them.
      
      [glommer@openvz.org: updated changelog ]
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NGlauber Costa <glommer@openvz.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      19156840
    • G
      super: fix calculation of shrinkable objects for small numbers · 55f841ce
      Glauber Costa 提交于
      The sysctl knob sysctl_vfs_cache_pressure is used to determine which
      percentage of the shrinkable objects in our cache we should actively try
      to shrink.
      
      It works great in situations in which we have many objects (at least more
      than 100), because the aproximation errors will be negligible.  But if
      this is not the case, specially when total_objects < 100, we may end up
      concluding that we have no objects at all (total / 100 = 0, if total <
      100).
      
      This is certainly not the biggest killer in the world, but may matter in
      very low kernel memory situations.
      Signed-off-by: NGlauber Costa <glommer@openvz.org>
      Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      55f841ce
    • G
      fs: bump inode and dentry counters to long · 3942c07c
      Glauber Costa 提交于
      This series reworks our current object cache shrinking infrastructure in
      two main ways:
      
       * Noticing that a lot of users copy and paste their own version of LRU
         lists for objects, we put some effort in providing a generic version.
         It is modeled after the filesystem users: dentries, inodes, and xfs
         (for various tasks), but we expect that other users could benefit in
         the near future with little or no modification.  Let us know if you
         have any issues.
      
       * The underlying list_lru being proposed automatically and
         transparently keeps the elements in per-node lists, and is able to
         manipulate the node lists individually.  Given this infrastructure, we
         are able to modify the up-to-now hammer called shrink_slab to proceed
         with node-reclaim instead of always searching memory from all over like
         it has been doing.
      
      Per-node lru lists are also expected to lead to less contention in the lru
      locks on multi-node scans, since we are now no longer fighting for a
      global lock.  The locks usually disappear from the profilers with this
      change.
      
      Although we have no official benchmarks for this version - be our guest to
      independently evaluate this - earlier versions of this series were
      performance tested (details at
      http://permalink.gmane.org/gmane.linux.kernel.mm/100537) yielding no
      visible performance regressions while yielding a better qualitative
      behavior in NUMA machines.
      
      With this infrastructure in place, we can use the list_lru entry point to
      provide memcg isolation and per-memcg targeted reclaim.  Historically,
      those two pieces of work have been posted together.  This version presents
      only the infrastructure work, deferring the memcg work for a later time,
      so we can focus on getting this part tested.  You can see more about the
      history of such work at http://lwn.net/Articles/552769/
      
      Dave Chinner (18):
        dcache: convert dentry_stat.nr_unused to per-cpu counters
        dentry: move to per-sb LRU locks
        dcache: remove dentries from LRU before putting on dispose list
        mm: new shrinker API
        shrinker: convert superblock shrinkers to new API
        list: add a new LRU list type
        inode: convert inode lru list to generic lru list code.
        dcache: convert to use new lru list infrastructure
        list_lru: per-node list infrastructure
        shrinker: add node awareness
        fs: convert inode and dentry shrinking to be node aware
        xfs: convert buftarg LRU to generic code
        xfs: rework buffer dispose list tracking
        xfs: convert dquot cache lru to list_lru
        fs: convert fs shrinkers to new scan/count API
        drivers: convert shrinkers to new count/scan API
        shrinker: convert remaining shrinkers to count/scan API
        shrinker: Kill old ->shrink API.
      
      Glauber Costa (7):
        fs: bump inode and dentry counters to long
        super: fix calculation of shrinkable objects for small numbers
        list_lru: per-node API
        vmscan: per-node deferred work
        i915: bail out earlier when shrinker cannot acquire mutex
        hugepage: convert huge zero page shrinker to new shrinker API
        list_lru: dynamically adjust node arrays
      
      This patch:
      
      There are situations in very large machines in which we can have a large
      quantity of dirty inodes, unused dentries, etc.  This is particularly true
      when umounting a filesystem, where eventually since every live object will
      eventually be discarded.
      
      Dave Chinner reported a problem with this while experimenting with the
      shrinker revamp patchset.  So we believe it is time for a change.  This
      patch just moves int to longs.  Machines where it matters should have a
      big long anyway.
      Signed-off-by: NGlauber Costa <glommer@openvz.org>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      3942c07c
    • C
      fs: remove vfs_follow_link · aac34df1
      Christoph Hellwig 提交于
      For a long time no filesystem has been using vfs_follow_link, and as seen
      by recent filesystem submissions any new use is accidental as well.
      
      Remove vfs_follow_link, document the replacement in
      Documentation/filesystems/porting and also rename __vfs_follow_link
      to match its only caller better.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      aac34df1
  2. 10 9月, 2013 1 次提交
  3. 09 9月, 2013 3 次提交
    • A
      introduce kern_path_mountpoint() · 2d864651
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2d864651
    • A
      rename user_path_umountat() to user_path_mountpoint_at() · 197df04c
      Al Viro 提交于
      ... and move the extern from linux/namei.h to fs/internal.h,
      along with that of vfs_path_lookup().
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      197df04c
    • L
      vfs: reorganize dput() memory accesses · 8aab6a27
      Linus Torvalds 提交于
      This is me being a bit OCD after all the dentry optimization work this
      merge window: profiles end up showing 'dput()' as a rather expensive
      operation, and there were two unrelated bad reasons for that.
      
      The first reason was reading d_lockref.count for debugging purposes,
      which touches the lockref cacheline (for reads) before really need to.
      More importantly, the debugging test in question is _wrong_, and has
      hidden bugs.  It's true that we can only sleep when the count goes down
      to zero, but the test as-is hides the much more subtle bug that happens
      if we race with somebody else deleting the file.
      
      Anyway we _will_ touch that cacheline, but let's do it for a write and
      in the right routine (ie in "lockref_put_or_lock()") which annotates the
      costs better.  So remove the misleading debug code.
      
      The other was an unnecessary access to the cacheline that contains the
      d_lru list, just to check whether we already were on the LRU list or
      not.  This is exactly what we have d_flags for, so that we can avoid
      touching extra cache lines for the common case.  So just add another bit
      for "is this dentry on the LRU".
      
      Finally, mark the tests properly likely/unlikely, so that the common
      fast-paths are dense in the instruction stream.
      
      This makes the profiles look much saner.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8aab6a27
  4. 08 9月, 2013 3 次提交
    • A
      Kill indirect include of file.h from eventfd.h, use fdget() in cgroup.c · 4e10f3c9
      Al Viro 提交于
      kernel/cgroup.c is the only place in the tree that relies on eventfd.h
      pulling file.h; move that include there.  Switch from eventfd_fget()/fput()
      to fdget()/fdput(), while we are at it - eventfd_ctx_fileget() will fail
      on non-eventfd descriptors just fine, no need to do that check twice...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      4e10f3c9
    • L
      lockref: add ability to mark lockrefs "dead" · e7d33bb5
      Linus Torvalds 提交于
      The only actual current lockref user (dcache) uses zero reference counts
      even for perfectly live dentries, because it's a cache: there may not be
      any users, but that doesn't mean that we want to throw away the dentry.
      
      At the same time, the dentry cache does have a notion of a truly "dead"
      dentry that we must not even increment the reference count of, because
      we have pruned it and it is not valid.
      
      Currently that distinction is not visible in the lockref itself, and the
      dentry cache validation uses "lockref_get_or_lock()" to either get a new
      reference to a dentry that already had existing references (and thus
      cannot be dead), or get the dentry lock so that we can then verify the
      dentry and increment the reference count under the lock if that
      verification was successful.
      
      That's all somewhat complicated.
      
      This adds the concept of being "dead" to the lockref itself, by simply
      using a count that is negative.  This allows a usage scenario where we
      can increment the refcount of a dentry without having to validate it,
      and pushing the special "we killed it" case into the lockref code.
      
      The dentry code itself doesn't actually use this yet, and it's probably
      too late in the merge window to do that code (the dentry_kill() code
      with its "should I decrement the count" logic really is pretty complex
      code), but let's introduce the concept at the lockref level now.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e7d33bb5
    • L
      Revert "Input: introduce BTN/ABS bits for drums and guitars" · b04c99e3
      Linus Torvalds 提交于
      This reverts commits 61e00655, 73f8645d and 8e22ecb6:
        "Input: introduce BTN/ABS bits for drums and guitars"
        "HID: wiimote: add support for Guitar-Hero drums"
        "HID: wiimote: add support for Guitar-Hero guitars"
      
      The extra new ABS_xx values resulted in ABS_MAX no longer being a
      power-of-two, which broke the comparison logic.  It also caused the
      ioctl numbers to overflow into the next byte, causing problems for that.
      
      We'll try again for 3.13.
      Reported-by: NMarkus Trippelsdorf <markus@trippelsdorf.de>
      Reported-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: NDavid Herrmann <dh.herrmann@gmail.com>
      Acked-by: NDmitry Torokhov <dmitry.torokhov@gmail.com>
      Cc: Benjamin Tissoires <benjamin.tissoires@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b04c99e3
  5. 07 9月, 2013 3 次提交
  6. 06 9月, 2013 5 次提交
    • M
      fscache: Netfs function for cleanup post readpages · 5a6f282a
      Milosz Tanski 提交于
      Currently the fscache code expect the netfs to call fscache_readpages_or_alloc
      inside the aops readpages callback.  It marks all the pages in the list
      provided by readahead with PG_private_2.  In the cases that the netfs fails to
      read all the pages (which is legal) it ends up returning to the readahead and
      triggering a BUG.  This happens because the page list still contains marked
      pages.
      
      This patch implements a simple fscache_readpages_cancel function that the netfs
      should call before returning from readpages.  It will revoke the pages from the
      underlying cache backend and unmark them.
      
      The problem was originally worked out in the Ceph devel tree, but it also
      occurs in CIFS.  It appears that NFS, AFS and 9P are okay as read_cache_pages()
      will clean up the unprocessed pages in the case of an error.
      
      This can be used to address the following oops:
      
      [12410647.597278] BUG: Bad page state in process petabucket  pfn:3d504e
      [12410647.597292] page:ffffea000f541380 count:0 mapcount:0 mapping:
      	(null) index:0x0
      [12410647.597298] page flags: 0x200000000001000(private_2)
      
      ...
      
      [12410647.597334] Call Trace:
      [12410647.597345]  [<ffffffff815523f2>] dump_stack+0x19/0x1b
      [12410647.597356]  [<ffffffff8111def7>] bad_page+0xc7/0x120
      [12410647.597359]  [<ffffffff8111e49e>] free_pages_prepare+0x10e/0x120
      [12410647.597361]  [<ffffffff8111fc80>] free_hot_cold_page+0x40/0x170
      [12410647.597363]  [<ffffffff81123507>] __put_single_page+0x27/0x30
      [12410647.597365]  [<ffffffff81123df5>] put_page+0x25/0x40
      [12410647.597376]  [<ffffffffa02bdcf9>] ceph_readpages+0x2e9/0x6e0 [ceph]
      [12410647.597379]  [<ffffffff81122a8f>] __do_page_cache_readahead+0x1af/0x260
      [12410647.597382]  [<ffffffff81122ea1>] ra_submit+0x21/0x30
      [12410647.597384]  [<ffffffff81118f64>] filemap_fault+0x254/0x490
      [12410647.597387]  [<ffffffff8113a74f>] __do_fault+0x6f/0x4e0
      [12410647.597391]  [<ffffffff810125bd>] ? __switch_to+0x16d/0x4a0
      [12410647.597395]  [<ffffffff810865ba>] ? finish_task_switch+0x5a/0xc0
      [12410647.597398]  [<ffffffff8113d856>] handle_pte_fault+0xf6/0x930
      [12410647.597401]  [<ffffffff81008c33>] ? pte_mfn_to_pfn+0x93/0x110
      [12410647.597403]  [<ffffffff81008cce>] ? xen_pmd_val+0xe/0x10
      [12410647.597405]  [<ffffffff81005469>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
      [12410647.597407]  [<ffffffff8113f361>] handle_mm_fault+0x251/0x370
      [12410647.597411]  [<ffffffff812b0ac4>] ? call_rwsem_down_read_failed+0x14/0x30
      [12410647.597414]  [<ffffffff8155bffa>] __do_page_fault+0x1aa/0x550
      [12410647.597418]  [<ffffffff8108011d>] ? up_write+0x1d/0x20
      [12410647.597422]  [<ffffffff8113141c>] ? vm_mmap_pgoff+0xbc/0xe0
      [12410647.597425]  [<ffffffff81143bb8>] ? SyS_mmap_pgoff+0xd8/0x240
      [12410647.597427]  [<ffffffff8155c3ae>] do_page_fault+0xe/0x10
      [12410647.597431]  [<ffffffff81558818>] page_fault+0x28/0x30
      Signed-off-by: NMilosz Tanski <milosz@adfin.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      5a6f282a
    • D
      FS-Cache: Add interface to check consistency of a cached object · da9803bc
      David Howells 提交于
      Extend the fscache netfs API so that the netfs can ask as to whether a cache
      object is up to date with respect to its corresponding netfs object:
      
      	int fscache_check_consistency(struct fscache_cookie *cookie)
      
      This will call back to the netfs to check whether the auxiliary data associated
      with a cookie is correct.  It returns 0 if it is and -ESTALE if it isn't; it
      may also return -ENOMEM and -ERESTARTSYS.
      
      The backends now have to implement a mandatory operation pointer:
      
      	int (*check_consistency)(struct fscache_object *object)
      
      that corresponds to the above API call.  FS-Cache takes care of pinning the
      object and the cookie in memory and managing this call with respect to the
      object state.
      
      Original-author: Hongyi Jia <jiayisuse@gmail.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: Hongyi Jia <jiayisuse@gmail.com>
      cc: Milosz Tanski <milosz@adfin.com>
      da9803bc
    • A
      constify dcache.c inlined helpers where possible · f0d3b3de
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      f0d3b3de
    • M
      vfs: check submounts and drop atomically · 848ac114
      Miklos Szeredi 提交于
      We check submounts before doing d_drop() on a non-empty directory dentry in
      NFS (have_submounts()), but we do not exclude a racing mount.
      
       Process A: have_submounts() -> returns false
       Process B: mount() -> success
       Process A: d_drop()
      
      This patch prepares the ground for the fix by doing the following
      operations all under the same rename lock:
      
        have_submounts()
        shrink_dcache_parent()
        d_drop()
      
      This is actually an optimization since have_submounts() and
      shrink_dcache_parent() both traverse the same dentry tree separately.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      CC: David Howells <dhowells@redhat.com>
      CC: Steven Whitehouse <swhiteho@redhat.com>
      CC: Trond Myklebust <Trond.Myklebust@netapp.com>
      CC: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      848ac114
    • J
      vxlan: Notify drivers for listening UDP port changes · 53cf5275
      Joseph Gasparakis 提交于
      This patch adds two more ndo ops: ndo_add_rx_vxlan_port() and
      ndo_del_rx_vxlan_port().
      
      Drivers can get notifications through the above functions about changes
      of the UDP listening port of VXLAN. Also, when physical ports come up,
      now they can call vxlan_get_rx_port() in order to obtain the port number(s)
      of the existing VXLAN interface in case they already up before them.
      
      This information about the listening UDP port would be used for VXLAN
      related offloads.
      
      A big thank you to John Fastabend (john.r.fastabend@intel.com) for his
      input and his suggestions on this patch set.
      
      CC: John Fastabend <john.r.fastabend@intel.com>
      CC: Stephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NJoseph Gasparakis <joseph.gasparakis@intel.com>
      Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      53cf5275
  7. 05 9月, 2013 15 次提交
    • W
      nfs4.1: Add SP4_MACH_CRED write and commit support · 8c21c62c
      Weston Andros Adamson 提交于
      WRITE and COMMIT can use the machine credential.
      
      If WRITE is supported and COMMIT is not, make all (mach cred) writes FILE_SYNC4.
      Signed-off-by: NWeston Andros Adamson <dros@netapp.com>
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      8c21c62c
    • W
      nfs4.1: Add SP4_MACH_CRED stateid support · 3787d506
      Weston Andros Adamson 提交于
      TEST_STATEID and FREE_STATEID can use the machine credential.
      Signed-off-by: NWeston Andros Adamson <dros@netapp.com>
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      3787d506
    • W
      nfs4.1: Add SP4_MACH_CRED secinfo support · 8b5bee2e
      Weston Andros Adamson 提交于
      SECINFO and SECINFO_NONAME can use the machine credential.
      Signed-off-by: NWeston Andros Adamson <dros@netapp.com>
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      8b5bee2e
    • W
      nfs4.1: Add SP4_MACH_CRED cleanup support · fa940720
      Weston Andros Adamson 提交于
      CLOSE and LOCKU can use the machine credential.
      Signed-off-by: NWeston Andros Adamson <dros@netapp.com>
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      fa940720
    • W
      nfs4.1: Minimal SP4_MACH_CRED implementation · 2031cd1a
      Weston Andros Adamson 提交于
      This is a minimal client side implementation of SP4_MACH_CRED.  It will
      attempt to negotiate SP4_MACH_CRED iff the EXCHANGE_ID is using
      krb5i or krb5p auth.  SP4_MACH_CRED will be used if the server supports the
      minimal operations:
      
       BIND_CONN_TO_SESSION
       EXCHANGE_ID
       CREATE_SESSION
       DESTROY_SESSION
       DESTROY_CLIENTID
      
      This patch only includes the EXCHANGE_ID negotiation code because
      the client will already use the machine cred for these operations.
      
      If the server doesn't support SP4_MACH_CRED or doesn't support the minimal
      operations, the exchange id will be resent with SP4_NONE.
      Signed-off-by: NWeston Andros Adamson <dros@netapp.com>
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      2031cd1a
    • T
      SUNRPC: Replace pointer values with task->tk_pid and rpc_clnt->cl_clid · 92cb6c5b
      Trond Myklebust 提交于
      Instead of the pointer values, use the task and client identifier values
      for tracing purposes.
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      92cb6c5b
    • T
      SUNRPC: Add an identifier for struct rpc_clnt · 2f048db4
      Trond Myklebust 提交于
      Add an identifier in order to aid debugging.
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      2f048db4
    • A
      drm/exynos: fimd: replace struct fb_videomode with videomode · 111e6055
      Andrzej Hajda 提交于
      The patch replaces all occurrences of struct fb_videomode by
      more accurate struct videomode. The change allows to remove
      mode conversion function and simplifies clock divider calculation.
      Clock configuration is moved to separate function.
      Signed-off-by: NAndrzej Hajda <a.hajda@samsung.com>
      Signed-off-by: NKyungmin Park <kyungmin.park@samsung.com>
      Signed-off-by: NInki Dae <inki.dae@samsung.com>
      111e6055
    • D
      net: ipv6: mld: get rid of MLDV2_MRC and simplify calculation · e3f5b170
      Daniel Borkmann 提交于
      Get rid of MLDV2_MRC and use our new macros for mantisse and
      exponent to calculate Maximum Response Delay out of the Maximum
      Response Code.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e3f5b170
    • D
      net: ipv6: mld: fix v1/v2 switchback timeout to rfc3810, 9.12. · 89225d1c
      Daniel Borkmann 提交于
      i) RFC3810, 9.2. Query Interval [QI] says:
      
         The Query Interval variable denotes the interval between General
         Queries sent by the Querier. Default value: 125 seconds. [...]
      
      ii) RFC3810, 9.3. Query Response Interval [QRI] says:
      
        The Maximum Response Delay used to calculate the Maximum Response
        Code inserted into the periodic General Queries. Default value:
        10000 (10 seconds) [...] The number of seconds represented by the
        [Query Response Interval] must be less than the [Query Interval].
      
      iii) RFC3810, 9.12. Older Version Querier Present Timeout [OVQPT] says:
      
        The Older Version Querier Present Timeout is the time-out for
        transitioning a host back to MLDv2 Host Compatibility Mode. When an
        MLDv1 query is received, MLDv2 hosts set their Older Version Querier
        Present Timer to [Older Version Querier Present Timeout].
      
        This value MUST be ([Robustness Variable] times (the [Query Interval]
        in the last Query received)) plus ([Query Response Interval]).
      
      Hence, on *default* the timeout results in:
      
        [RV] = 2, [QI] = 125sec, [QRI] = 10sec
        [OVQPT] = [RV] * [QI] + [QRI] = 260sec
      
      Having that said, we currently calculate [OVQPT] (here given as 'switchback'
      variable) as ...
      
        switchback = (idev->mc_qrv + 1) * max_delay
      
      RFC3810, 9.12. says "the [Query Interval] in the last Query received". In
      section "9.14. Configuring timers", it is said:
      
        This section is meant to provide advice to network administrators on
        how to tune these settings to their network. Ambitious router
        implementations might tune these settings dynamically based upon
        changing characteristics of the network. [...]
      
      iv) RFC38010, 9.14.2. Query Interval:
      
        The overall level of periodic MLD traffic is inversely proportional
        to the Query Interval. A longer Query Interval results in a lower
        overall level of MLD traffic. The value of the Query Interval MUST
        be equal to or greater than the Maximum Response Delay used to
        calculate the Maximum Response Code inserted in General Query
        messages.
      
      I assume that was why switchback is calculated as is (3 * max_delay), although
      this setting seems to be meant for routers only to configure their [QI]
      interval for non-default intervals. So usage here like this is clearly wrong.
      
      Concluding, the current behaviour in IPv6's multicast code is not conform
      to the RFC as switch back is calculated wrongly. That is, it has a too small
      value, so MLDv2 hosts switch back again to MLDv2 way too early, i.e. ~30secs
      instead of ~260secs on default.
      
      Hence, introduce necessary helper functions and fix this up properly as it
      should be.
      
      Introduced in 06da92283 ("[IPV6]: Add MLDv2 support."). Credits to Hannes
      Frederic Sowa who also had a hand in this as well. Also thanks to Hangbin Liu
      who did initial testing.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Cc: David Stevens <dlstevens@us.ibm.com>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      89225d1c
    • T
    • A
      vfio-pci: PCI hot reset interface · 8b27ee60
      Alex Williamson 提交于
      The current VFIO_DEVICE_RESET interface only maps to PCI use cases
      where we can isolate the reset to the individual PCI function.  This
      means the device must support FLR (PCIe or AF), PM reset on D3hot->D0
      transition, device specific reset, or be a singleton device on a bus
      for a secondary bus reset.  FLR does not have widespread support,
      PM reset is not very reliable, and bus topology is dictated by the
      system and device design.  We need to provide a means for a user to
      induce a bus reset in cases where the existing mechanisms are not
      available or not reliable.
      
      This device specific extension to VFIO provides the user with this
      ability.  Two new ioctls are introduced:
       - VFIO_DEVICE_PCI_GET_HOT_RESET_INFO
       - VFIO_DEVICE_PCI_HOT_RESET
      
      The first provides the user with information about the extent of
      devices affected by a hot reset.  This is essentially a list of
      devices and the IOMMU groups they belong to.  The user may then
      initiate a hot reset by calling the second ioctl.  We must be
      careful that the user has ownership of all the affected devices
      found via the first ioctl, so the second ioctl takes a list of file
      descriptors for the VFIO groups affected by the reset.  Each group
      must have IOMMU protection established for the ioctl to succeed.
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      8b27ee60
    • C
      net: sync some IP headers with glibc · cfd280c9
      Carlos O'Donell 提交于
      Solution:
      =========
      
      - Synchronize linux's `include/uapi/linux/in6.h'
        with glibc's `inet/netinet/in.h'.
      - Synchronize glibc's `inet/netinet/in.h with linux's
        `include/uapi/linux/in6.h'.
      - Allow including the headers in either other.
      - First header included defines the structures and macros.
      
      Details:
      ========
      
      The kernel promises not to break the UAPI ABI so I don't
      see why we can't just have the two userspace headers
      coordinate?
      
      If you include the kernel headers first you get those,
      and if you include the glibc headers first you get those,
      and the following patch arranges a coordination and
      synchronization between the two.
      
      Let's handle `include/uapi/linux/in6.h' from linux,
      and `inet/netinet/in.h' from glibc and ensure they compile
      in any order and preserve the required ABI.
      
      These two patches pass the following compile tests:
      
      cat >> test1.c <<EOF
      int main (void) {
        return 0;
      }
      EOF
      gcc -c test1.c
      
      cat >> test2.c <<EOF
      int main (void) {
        return 0;
      }
      EOF
      gcc -c test2.c
      
      One wrinkle is that the kernel has a different name for one of
      the members in ipv6_mreq. In the kernel patch we create a macro
      to cover the uses of the old name, and while that's not entirely
      clean it's one of the best solutions (aside from an anonymous
      union which has other issues).
      
      I've reviewed the code and it looks to me like the ABI is
      assured and everything matches on both sides.
      
      Notes:
      - You want netinet/in.h to include bits/in.h as early as possible,
        but it needs in_addr so define in_addr early.
      - You want bits/in.h included as early as possible so you can use
        the linux specific code to define __USE_KERNEL_DEFS based on
        the _UAPI_* macro definition and use those to cull in.h.
      - glibc was missing IPPROTO_MH, added here.
      
      Compile tested and inspected.
      Reported-by: NThomas Backlund <tmb@mageia.org>
      Cc: Thomas Backlund <tmb@mageia.org>
      Cc: libc-alpha@sourceware.org
      Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
      Cc: David S. Miller <davem@davemloft.net>
      Tested-by: NCong Wang <amwang@redhat.com>
      Signed-off-by: NCarlos O'Donell <carlos@redhat.com>
      Signed-off-by: NCong Wang <amwang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cfd280c9
    • N
      NFSv4: Don't try to recover NFSv4 locks when they are lost. · ef1820f9
      NeilBrown 提交于
      When an NFSv4 client loses contact with the server it can lose any
      locks that it holds.
      
      Currently when it reconnects to the server it simply tries to reclaim
      those locks.  This might succeed even though some other client has
      held and released a lock in the mean time.  So the first client might
      think the file is unchanged, but it isn't.  This isn't good.
      
      If, when recovery happens, the locks cannot be claimed because some
      other client still holds the lock, then we get a message in the kernel
      logs, but the client can still write.  So two clients can both think
      they have a lock and can both write at the same time.  This is equally
      not good.
      
      There was a patch a while ago
        http://comments.gmane.org/gmane.linux.nfs/41917
      
      which tried to address some of this, but it didn't seem to go
      anywhere.  That patch would also send a signal to the process.  That
      might be useful but for now this patch just causes writes to fail.
      
      For NFSv4 (unlike v2/v3) there is a strong link between the lock and
      the write request so we can fairly easily fail any IO of the lock is
      gone.  While some applications might not expect this, it is still
      safer than allowing the write to succeed.
      
      Because this is a fairly big change in behaviour a module parameter,
      "recover_locks", is introduced which defaults to true (the current
      behaviour) but can be set to "false" to tell the client not to try to
      recover things that were lost.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      ef1820f9
    • T
      SUNRPC: Add tracepoints to help debug socket connection issues · 40b5ea0c
      Trond Myklebust 提交于
      Add client side debugging to help trace socket connection/disconnection
      and unexpected state change issues.
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      40b5ea0c