1. 20 11月, 2014 2 次提交
  2. 05 11月, 2014 1 次提交
  3. 09 10月, 2014 2 次提交
  4. 04 8月, 2014 8 次提交
    • N
      NFS: fix two problems in lookup_revalidate in RCU-walk · 50d77739
      NeilBrown 提交于
      1/ rcu_dereference isn't correct: that field isn't
         RCU protected.   It could potentially change at any time
         so ACCESS_ONCE might be justified.
      
         changes to ->d_parent are protected by ->d_seq.  However
         that isn't always checked after ->d_revalidate is called,
         so it is safest to keep the double-check that ->d_parent
         hasn't changed at the end of these functions.
      
      2/ in nfs4_lookup_revalidate, "->d_parent" was forgotten.
         So 'parent' was not the parent of 'dentry'.
         This fails safe is the context is that dentry->d_inode is
         NULL, and the result of parent->d_inode being NULL is
         that ECHILD is returned, which is always safe.
      Reported-by: Nkbuild test robot <fengguang.wu@intel.com>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      50d77739
    • N
      NFS: allow lockless access to access_cache · f682a398
      NeilBrown 提交于
      The access cache is used during RCU-walk path lookups, so it is best
      to avoid locking if possible as taking a lock kills concurrency.
      
      The rbtree is not rcu-safe and cannot easily be made so.
      Instead we simply check the last (i.e. most recent) entry on the LRU
      list.  If this doesn't match, then we return -ECHILD and retry in
      lock/refcount mode.
      
      This requires freeing the nfs_access_entry struct with rcu, and
      requires using rcu access primatives when adding entries to the lru, and
      when examining the last entry.
      
      Calling put_rpccred before kfree_rcu looks a bit odd, but as
      put_rpccred already provides rcu protection, we know that the cred will
      not actually be freed until the next grace period, so any concurrent
      access will be safe.
      
      This patch provides about 5% performance improvement on a stat-heavy
      synthetic work load with 4 threads on a 2-core CPU.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      f682a398
    • N
      NFS: teach nfs_lookup_verify_inode to handle LOOKUP_RCU · 1fa1e384
      NeilBrown 提交于
      It fails with -ECHILD rather than make an RPC call.
      
      This allows nfs_lookup_revalidate to call it in RCU-walk mode.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      1fa1e384
    • N
      NFS: teach nfs_neg_need_reval to understand LOOKUP_RCU · 912a108d
      NeilBrown 提交于
      This requires nfs_check_verifier to take an rcu_walk flag, and requires
      an rcu version of nfs_revalidate_inode which returns -ECHILD rather
      than making an RPC call.
      
      With this, nfs_lookup_revalidate can call nfs_neg_need_reval in
      RCU-walk mode.
      
      We can also move the LOOKUP_RCU check past the nfs_check_verifier()
      call in nfs_lookup_revalidate.
      
      If RCU_WALK prevents nfs_check_verifier or nfs_neg_need_reval from
      doing a full check, they return a status indicating that a revalidation
      is required.  As this revalidation will not be possible in RCU_WALK
      mode, -ECHILD will ultimately be returned, which is the desired result.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      912a108d
    • N
      NFS: support RCU_WALK in nfs_permission() · f3324a2a
      NeilBrown 提交于
      nfs_permission makes two calls which are not always safe in RCU_WALK,
      rpc_lookup_cred and nfs_do_access.
      
      The second can easily be made rcu-safe by aborting with -ECHILD before
      making the RPC call.
      
      The former can be made rcu-safe by calling rpc_lookup_cred_nonblock()
      instead.
      As this will almost always succeed, we use it even when RCU_WALK
      isn't being used as it still saves some spinlocks in a common case.
      We only fall back to rpc_lookup_cred() if rpc_lookup_cred_nonblock()
      fails and MAY_NOT_BLOCK isn't set.
      
      This optimisation (always trying rpc_lookup_cred_nonblock()) is
      particularly important when a security module is active.
      In that case inode_permission() may return -ECHILD from
      security_inode_permission() even though ->permission() succeeded in
      RCU_WALK mode.
      This leads to may_lookup() retrying inode_permission after performing
      unlazy_walk().  The spinlock that rpc_lookup_cred() takes is often
      more expensive than anything security_inode_permission() does, so that
      spinlock becomes the main bottleneck.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      f3324a2a
    • N
      NFS: prepare for RCU-walk support but pushing tests later in code. · d51ac1a8
      NeilBrown 提交于
      nfs_lookup_revalidate, nfs4_lookup_revalidate, and nfs_permission
      all need to understand and handle RCU-walk for NFS to gain the
      benefits of RCU-walk for cached information.
      
      Currently these functions all immediately return -ECHILD
      if the relevant flag (LOOKUP_RCU or MAY_NOT_BLOCK) is set.
      
      This patch pushes those tests later in the code so that we only abort
      immediately before we enter rcu-unsafe code.  As subsequent patches
      make that rcu-unsafe code rcu-safe, several of these new tests will
      disappear.
      
      With this patch there are several paths through the code which will no
      longer return -ECHILD during an RCU-walk.  However these are mostly
      error paths or other uninteresting cases.
      
      A noteworthy change in nfs_lookup_revalidate is that we don't take
      (or put) the reference to ->d_parent when LOOKUP_RCU is set.
      Rather we rcu_dereference ->d_parent, and check that ->d_inode
      is not NULL.  We also check that ->d_parent hasn't changed after
      all the tests.
      
      In nfs4_lookup_revalidate we simply avoid testing LOOKUP_RCU on the
      path that only calls nfs_lookup_revalidate() as that function
      already performs the required test.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      d51ac1a8
    • N
      NFS: nfs4_lookup_revalidate: only evaluate parent if it will be used. · 49317a7f
      NeilBrown 提交于
      nfs4_lookup_revalidate only uses 'parent' to get 'dir', and only
      uses 'dir' if 'inode == NULL'.
      
      So we don't need to find out what 'parent' or 'dir' is until we
      know that 'inode' is NULL.
      
      By moving 'dget_parent' inside the 'if', we can reduce the number of
      call sites for 'dput(parent)'.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      49317a7f
    • T
      NFS: Enforce an upper limit on the number of cached access call · 3a505845
      Trond Myklebust 提交于
      This may be used to limit the number of cached credentials building up
      inside the access cache.
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      3a505845
  5. 18 4月, 2014 1 次提交
  6. 05 4月, 2014 1 次提交
  7. 18 3月, 2014 1 次提交
  8. 12 2月, 2014 1 次提交
  9. 11 2月, 2014 1 次提交
  10. 29 1月, 2014 1 次提交
  11. 28 1月, 2014 1 次提交
    • J
      NFS: fix the handling of NFS_INO_INVALID_DATA flag in nfs_revalidate_mapping · d529ef83
      Jeff Layton 提交于
      There is a possible race in how the nfs_invalidate_mapping function is
      handled.  Currently, we go and invalidate the pages in the file and then
      clear NFS_INO_INVALID_DATA.
      
      The problem is that it's possible for a stale page to creep into the
      mapping after the page was invalidated (i.e., via readahead). If another
      writer comes along and sets the flag after that happens but before
      invalidate_inode_pages2 returns then we could clear the flag
      without the cache having been properly invalidated.
      
      So, we must clear the flag first and then invalidate the pages. Doing
      this however, opens another race:
      
      It's possible to have two concurrent read() calls that end up in
      nfs_revalidate_mapping at the same time. The first one clears the
      NFS_INO_INVALID_DATA flag and then goes to call nfs_invalidate_mapping.
      
      Just before calling that though, the other task races in, checks the
      flag and finds it cleared. At that point, it trusts that the mapping is
      good and gets the lock on the page, allowing the read() to be satisfied
      from the cache even though the data is no longer valid.
      
      These effects are easily manifested by running diotest3 from the LTP
      test suite on NFS. That program does a series of DIO writes and buffered
      reads. The operations are serialized and page-aligned but the existing
      code fails the test since it occasionally allows a read to come out of
      the cache incorrectly. While mixing direct and buffered I/O isn't
      recommended, I believe it's possible to hit this in other ways that just
      use buffered I/O, though that situation is much harder to reproduce.
      
      The problem is that the checking/clearing of that flag and the
      invalidation of the mapping really need to be atomic. Fix this by
      serializing concurrent invalidations with a bitlock.
      
      At the same time, we also need to allow other places that check
      NFS_INO_INVALID_DATA to check whether we might be in the middle of
      invalidating the file, so fix up a couple of places that do that
      to look for the new NFS_INO_INVALIDATING flag.
      
      Doing this requires us to be careful not to set the bitlock
      unnecessarily, so this code only does that if it believes it will
      be doing an invalidation.
      Signed-off-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      d529ef83
  12. 06 1月, 2014 1 次提交
  13. 29 10月, 2013 1 次提交
  14. 25 10月, 2013 1 次提交
  15. 28 9月, 2013 1 次提交
    • D
      NFS: Use i_writecount to control whether to get an fscache cookie in nfs_open() · f1fe29b4
      David Howells 提交于
      Use i_writecount to control whether to get an fscache cookie in nfs_open() as
      NFS does not do write caching yet.  I *think* this is the cause of a problem
      encountered by Mark Moseley whereby __fscache_uncache_page() gets a NULL
      pointer dereference because cookie->def is NULL:
      
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
      IP: [<ffffffff812a1903>] __fscache_uncache_page+0x23/0x160
      PGD 0
      Thread overran stack, or stack corrupted
      Oops: 0000 [#1] SMP
      Modules linked in: ...
      CPU: 7 PID: 18993 Comm: php Not tainted 3.11.1 #1
      Hardware name: Dell Inc. PowerEdge R420/072XWF, BIOS 1.3.5 08/21/2012
      task: ffff8804203460c0 ti: ffff880420346640
      RIP: 0010:[<ffffffff812a1903>] __fscache_uncache_page+0x23/0x160
      RSP: 0018:ffff8801053af878 EFLAGS: 00210286
      RAX: 0000000000000000 RBX: ffff8800be2f8780 RCX: ffff88022ffae5e8
      RDX: 0000000000004c66 RSI: ffffea00055ff440 RDI: ffff8800be2f8780
      RBP: ffff8801053af898 R08: 0000000000000001 R09: 0000000000000003
      R10: 0000000000000000 R11: 0000000000000000 R12: ffffea00055ff440
      R13: 0000000000001000 R14: ffff8800c50be538 R15: 0000000000000000
      FS: 0000000000000000(0000) GS:ffff88042fc60000(0063) knlGS:00000000e439c700
      CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
      CR2: 0000000000000010 CR3: 0000000001d8f000 CR4: 00000000000607f0
      Stack:
      ...
      Call Trace:
      [<ffffffff81365a72>] __nfs_fscache_invalidate_page+0x42/0x70
      [<ffffffff813553d5>] nfs_invalidate_page+0x75/0x90
      [<ffffffff811b8f5e>] truncate_inode_page+0x8e/0x90
      [<ffffffff811b90ad>] truncate_inode_pages_range.part.12+0x14d/0x620
      [<ffffffff81d6387d>] ? __mutex_lock_slowpath+0x1fd/0x2e0
      [<ffffffff811b95d3>] truncate_inode_pages_range+0x53/0x70
      [<ffffffff811b969d>] truncate_inode_pages+0x2d/0x40
      [<ffffffff811b96ff>] truncate_pagecache+0x4f/0x70
      [<ffffffff81356840>] nfs_setattr_update_inode+0xa0/0x120
      [<ffffffff81368de4>] nfs3_proc_setattr+0xc4/0xe0
      [<ffffffff81357f78>] nfs_setattr+0xc8/0x150
      [<ffffffff8122d95b>] notify_change+0x1cb/0x390
      [<ffffffff8120a55b>] do_truncate+0x7b/0xc0
      [<ffffffff8121f96c>] do_last+0xa4c/0xfd0
      [<ffffffff8121ffbc>] path_openat+0xcc/0x670
      [<ffffffff81220a0e>] do_filp_open+0x4e/0xb0
      [<ffffffff8120ba1f>] do_sys_open+0x13f/0x2b0
      [<ffffffff8126aaf6>] compat_SyS_open+0x36/0x50
      [<ffffffff81d7204c>] sysenter_dispatch+0x7/0x24
      
      The code at the instruction pointer was disassembled:
      
      > (gdb) disas __fscache_uncache_page
      > Dump of assembler code for function __fscache_uncache_page:
      > ...
      > 0xffffffff812a18ff <+31>: mov 0x48(%rbx),%rax
      > 0xffffffff812a1903 <+35>: cmpb $0x0,0x10(%rax)
      > 0xffffffff812a1907 <+39>: je 0xffffffff812a19cd <__fscache_uncache_page+237>
      
      These instructions make up:
      
      	ASSERTCMP(cookie->def->type, !=, FSCACHE_COOKIE_TYPE_INDEX);
      
      That cmpb is the faulting instruction (%rax is 0).  So cookie->def is NULL -
      which presumably means that the cookie has already been at least partway
      through __fscache_relinquish_cookie().
      
      What I think may be happening is something like a three-way race on the same
      file:
      
      	PROCESS 1	PROCESS 2	PROCESS 3
      	===============	===============	===============
      	open(O_TRUNC|O_WRONLY)
      			open(O_RDONLY)
      					open(O_WRONLY)
      	-->nfs_open()
      	-->nfs_fscache_set_inode_cookie()
      	nfs_fscache_inode_lock()
      	nfs_fscache_disable_inode_cookie()
      	__fscache_relinquish_cookie()
      	nfs_inode->fscache = NULL
      	<--nfs_fscache_set_inode_cookie()
      
      			-->nfs_open()
      			-->nfs_fscache_set_inode_cookie()
      			nfs_fscache_inode_lock()
      			nfs_fscache_enable_inode_cookie()
      			__fscache_acquire_cookie()
      			nfs_inode->fscache = cookie
      			<--nfs_fscache_set_inode_cookie()
      	<--nfs_open()
      	-->nfs_setattr()
      	...
      	...
      	-->nfs_invalidate_page()
      	-->__nfs_fscache_invalidate_page()
      	cookie = nfsi->fscache
      					-->nfs_open()
      					-->nfs_fscache_set_inode_cookie()
      					nfs_fscache_inode_lock()
      					nfs_fscache_disable_inode_cookie()
      					-->__fscache_relinquish_cookie()
      	-->__fscache_uncache_page(cookie)
      	<crash>
      					<--__fscache_relinquish_cookie()
      					nfs_inode->fscache = NULL
      					<--nfs_fscache_set_inode_cookie()
      
      What is needed is something to prevent process #2 from reacquiring the cookie
      - and I think checking i_writecount should do the trick.
      
      It's also possible to have a two-way race on this if the file is opened
      O_TRUNC|O_RDONLY instead.
      Reported-by: NMark Moseley <moseleymark@gmail.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      f1fe29b4
  16. 26 9月, 2013 1 次提交
  17. 17 9月, 2013 1 次提交
    • M
      nfs: set FILE_CREATED · 01c919ab
      Miklos Szeredi 提交于
      Set FILE_CREATED on O_CREAT|O_EXCL.  If the NFS server honored our request
      for exclusivity then this must be correct.
      
      Currently this is a no-op, since the VFS sets FILE_CREATED anyway.  The
      next patch will, however, require this flag to be always set by
      filesystems.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      01c919ab
  18. 11 9月, 2013 2 次提交
    • D
      fs: convert fs shrinkers to new scan/count API · 1ab6c499
      Dave Chinner 提交于
      Convert the filesystem shrinkers to use the new API, and standardise some
      of the behaviours of the shrinkers at the same time.  For example,
      nr_to_scan means the number of objects to scan, not the number of objects
      to free.
      
      I refactored the CIFS idmap shrinker a little - it really needs to be
      broken up into a shrinker per tree and keep an item count with the tree
      root so that we don't need to walk the tree every time the shrinker needs
      to count the number of objects in the tree (i.e.  all the time under
      memory pressure).
      
      [glommer@openvz.org: fixes for ext4, ubifs, nfs, cifs and glock. Fixes are needed mainly due to new code merged in the tree]
      [assorted fixes folded in]
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Signed-off-by: NGlauber Costa <glommer@openvz.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NArtem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Acked-by: NJan Kara <jack@suse.cz>
      Acked-by: NSteven Whitehouse <swhiteho@redhat.com>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1ab6c499
    • G
      super: fix calculation of shrinkable objects for small numbers · 55f841ce
      Glauber Costa 提交于
      The sysctl knob sysctl_vfs_cache_pressure is used to determine which
      percentage of the shrinkable objects in our cache we should actively try
      to shrink.
      
      It works great in situations in which we have many objects (at least more
      than 100), because the aproximation errors will be negligible.  But if
      this is not the case, specially when total_objects < 100, we may end up
      concluding that we have no objects at all (total / 100 = 0, if total <
      100).
      
      This is certainly not the biggest killer in the world, but may matter in
      very low kernel memory situations.
      Signed-off-by: NGlauber Costa <glommer@openvz.org>
      Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Hunter <adrian.hunter@intel.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Carlos Maiolino <cmaiolino@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: J. Bruce Fields <bfields@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Kent Overstreet <koverstreet@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      55f841ce
  19. 06 9月, 2013 1 次提交
  20. 04 9月, 2013 1 次提交
  21. 30 8月, 2013 1 次提交
  22. 22 8月, 2013 7 次提交
  23. 21 8月, 2013 1 次提交
  24. 08 8月, 2013 1 次提交