1. 14 7月, 2017 4 次提交
    • N
      NFS: guard against confused server in nfs_atomic_open() · eaa2b82c
      NeilBrown 提交于
      A confused server could return a filehandle for an
      NFSv4 OPEN request, which it previously returned for a directory.
      So the inode returned by  ->open_context() in nfs_atomic_open()
      could conceivably be a directory inode.
      
      This has particular implications for the call to
      nfs_file_set_open_context() in nfs_finish_open().
      If that is called on a directory inode, then the nfs_open_context
      that gets stored in the filp->private_data will be linked to
      nfs_inode->open_files.
      
      When the directory is closed, nfs_closedir() will (ultimately)
      free the ->private_data, but not unlink it from nfs_inode->open_files
      (because it doesn't expect an nfs_open_context there).
      
      Subsequently the memory could get used for something else and eventually
      if the ->open_files list is walked, the walker will fall off the end and
      crash.
      
      So: change nfs_finish_open() to only call nfs_file_set_open_context()
      for regular-file inodes.
      
      This failure mode has been seen in a production setting (unknown NFS
      server implementation).  The kernel was v3.0 and the specific sequence
      seen would not affect more recent kernels, but I think a risk is still
      present, and caution is wise.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      eaa2b82c
    • N
      NFS: only invalidate dentrys that are clearly invalid. · cc89684c
      NeilBrown 提交于
      Since commit bafc9b75 ("vfs: More precise tests in d_invalidate")
      in v3.18, a return of '0' from ->d_revalidate() will cause the dentry
      to be invalidated even if it has filesystems mounted on or it or on a
      descendant.  The mounted filesystem is unmounted.
      
      This means we need to be careful not to return 0 unless the directory
      referred to truly is invalid.  So -ESTALE or -ENOENT should invalidate
      the directory.  Other errors such a -EPERM or -ERESTARTSYS should be
      returned from ->d_revalidate() so they are propagated to the caller.
      
      A particular problem can be demonstrated by:
      
      1/ mount an NFS filesystem using NFSv3 on /mnt
      2/ mount any other filesystem on /mnt/foo
      3/ ls /mnt/foo
      4/ turn off network, or otherwise make the server unable to respond
      5/ ls /mnt/foo &
      6/ cat /proc/$!/stack # note that nfs_lookup_revalidate is in the call stack
      7/ kill -9 $! # this results in -ERESTARTSYS being returned
      8/ observe that /mnt/foo has been unmounted.
      
      This patch changes nfs_lookup_revalidate() to only treat
        -ESTALE from nfs_lookup_verify_inode() and
        -ESTALE or -ENOENT from ->lookup()
      as indicating an invalid inode.  Other errors are returned.
      
      Also nfs_check_inode_attributes() is changed to return -ESTALE rather
      than -EIO.  This is consistent with the error returned in similar
      circumstances from nfs_update_inode().
      
      As this bug allows any user to unmount a filesystem mounted on an NFS
      filesystem, this fix is suitable for stable kernels.
      
      Fixes: bafc9b75 ("vfs: More precise tests in d_invalidate")
      Cc: stable@vger.kernel.org (v3.18+)
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      cc89684c
    • B
      NFS: nfs_rename() - revalidate directories on -ERESTARTSYS · 818a8dbe
      Benjamin Coddington 提交于
      An interrupted rename will leave the old dentry behind if the rename
      succeeds.  Fix this by forcing a lookup the next time through
      ->d_revalidate.
      
      A previous attempt at solving this problem took the approach to complete
      the work of the rename asynchronously, however that approach was wrong
      since it would allow the d_move() to occur after the directory's i_mutex
      had been dropped by the original process.
      Signed-off-by: NBenjamin Coddington <bcodding@redhat.com>
      Reviewed-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      818a8dbe
    • B
      NFS: convert flags to bool · a7a3b1e9
      Benjamin Coddington 提交于
      NFS uses some int, and unsigned int :1, and bool as flags in structs and
      args.  Assert the preference for uniformly replacing these with the bool
      type.
      Signed-off-by: NBenjamin Coddington <bcodding@redhat.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      a7a3b1e9
  2. 06 5月, 2017 1 次提交
  3. 21 4月, 2017 1 次提交
    • B
      NFS: switch back to to ->iterate() · b044f645
      Benjamin Coddington 提交于
      NFS has some optimizations for readdir to choose between using READDIR or
      READDIRPLUS based on workload, and which NFS operation to use is determined
      by subsequent interactions with lookup, d_revalidate, and getattr.
      
      Concurrent use of nfs_readdir() via ->iterate_shared() can cause those
      optimizations to repeatedly invalidate the pagecache used to store
      directory entries during readdir(), which causes some very bad performance
      for directories with many entries (more than about 10000).
      
      There's a couple ways to fix this in NFS, but no fix would be as simple as
      going back to ->iterate() to serialize nfs_readdir(), and neither fix I
      tested performed as well as going back to ->iterate().
      
      The first required taking the directory's i_lock for each entry, with the
      result of terrible contention.
      
      The second way adds another flag to the nfs_inode, and so keeps the
      optimizations working for large directories.  The difference from using
      ->iterate() here is that much more memory is consumed for a given workload
      without any performance gain.
      
      The workings of nfs_readdir() are such that concurrent users are serialized
      within read_cache_page() waiting to retrieve pages of entries from the
      server.  By serializing this work in iterate_dir() instead, contention for
      cache pages is reduced.  Waiting processes can have an uncontended pass at
      the entirety of the directory's pagecache once previous processes have
      completed filling it.
      
      v2 - Keep the bits needed for parallel lookup
      Signed-off-by: NBenjamin Coddington <bcodding@redhat.com>
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      b044f645
  4. 28 3月, 2017 1 次提交
  5. 09 2月, 2017 1 次提交
  6. 20 12月, 2016 2 次提交
  7. 10 12月, 2016 1 次提交
  8. 05 12月, 2016 1 次提交
  9. 03 12月, 2016 3 次提交
  10. 02 12月, 2016 1 次提交
    • N
      NFSv4: add flock_owner to open context · 532d4def
      NeilBrown 提交于
      An open file description (struct file) in a given process can be
      associated with two different lock owners.
      
      It can have a Posix lock owner which will be different in each process
      that has a fd on the file.
      It can have a Flock owner which will be the same in all processes.
      
      When searching for a lock stateid to use, we need to consider both of these
      owners
      
      So add a new "flock_owner" to the "nfs_open_context" (of which there
      is one for each open file description).
      
      This flock_owner does not need to be reference-counted as there is a
      1-1 relation between 'struct file' and nfs open contexts,
      and it will never be part of a list of contexts.  So there is no need
      for a 'flock_context' - just the owner is enough.
      
      The io_count included in the (Posix) lock_context provides no
      guarantee that all read-aheads that could use the state have
      completed, so not supporting it for flock locks in not a serious
      problem.  Synchronization between flock and read-ahead can be added
      later if needed.
      
      When creating an open_context for a non-openning create call, we don't have
      a 'struct file' to pass in, so the lock context gets initialized with
      a NULL owner, but this will never be used.
      
      The flock_owner is not used at all in this patch, that will come later.
      Acked-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Reviewed-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      532d4def
  11. 28 9月, 2016 1 次提交
  12. 27 9月, 2016 1 次提交
    • M
      fs: make remaining filesystems use .rename2 · 1cd66c93
      Miklos Szeredi 提交于
      This is trivial to do:
      
       - add flags argument to foo_rename()
       - check if flags is zero
       - assign foo_rename() to .rename2 instead of .rename
      
      This doesn't mean it's impossible to support RENAME_NOREPLACE for these
      filesystems, but it is not trivial, like for local filesystems.
      RENAME_NOREPLACE must guarantee atomicity (i.e. it shouldn't be possible
      for a file to be created on one host while it is overwritten by rename on
      another host).
      
      Filesystems converted:
      
      9p, afs, ceph, coda, ecryptfs, kernfs, lustre, ncpfs, nfs, ocfs2, orangefs.
      
      After this, we can get rid of the duplicate interfaces for rename.
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Acked-by: David Howells <dhowells@redhat.com> [AFS]
      Acked-by: NMike Marshall <hubcap@omnibond.com>
      Cc: Eric Van Hensbergen <ericvh@gmail.com>
      Cc: Ilya Dryomov <idryomov@gmail.com>
      Cc: Jan Harkes <jaharkes@cs.cmu.edu>
      Cc: Tyler Hicks <tyhicks@canonical.com>
      Cc: Oleg Drokin <oleg.drokin@intel.com>
      Cc: Trond Myklebust <trond.myklebust@primarydata.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      1cd66c93
  13. 23 9月, 2016 1 次提交
  14. 06 7月, 2016 2 次提交
  15. 27 6月, 2016 1 次提交
    • A
      make nfs_atomic_open() call d_drop() on all ->open_context() errors. · d20cb71d
      Al Viro 提交于
      In "NFSv4: Move dentry instantiation into the NFSv4-specific atomic open code"
      unconditional d_drop() after the ->open_context() had been removed.  It had
      been correct for success cases (there ->open_context() itself had been doing
      dcache manipulations), but not for error ones.  Only one of those (ENOENT)
      got a compensatory d_drop() added in that commit, but in fact it should've
      been done for all errors.  As it is, the case of O_CREAT non-exclusive open
      on a hashed negative dentry racing with e.g. symlink creation from another
      client ended up with ->open_context() getting an error and proceeding to
      call nfs_lookup().  On a hashed dentry, which would've instantly triggered
      BUG_ON() in d_materialise_unique() (or, these days, its equivalent in
      d_splice_alias()).
      
      Cc: stable@vger.kernel.org # v3.10+
      Tested-by: NOleg Drokin <green@linuxhacker.ru>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      d20cb71d
  16. 25 6月, 2016 2 次提交
  17. 16 6月, 2016 1 次提交
  18. 11 6月, 2016 1 次提交
    • L
      vfs: make the string hashes salt the hash · 8387ff25
      Linus Torvalds 提交于
      We always mixed in the parent pointer into the dentry name hash, but we
      did it late at lookup time.  It turns out that we can simplify that
      lookup-time action by salting the hash with the parent pointer early
      instead of late.
      
      A few other users of our string hashes also wanted to mix in their own
      pointers into the hash, and those are updated to use the same mechanism.
      
      Hash users that don't have any particular initial salt can just use the
      NULL pointer as a no-salt.
      
      Cc: Vegard Nossum <vegard.nossum@oracle.com>
      Cc: George Spelvin <linux@sciencehorizons.net>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8387ff25
  19. 30 5月, 2016 2 次提交
  20. 09 5月, 2016 1 次提交
    • A
      nfs: per-name sillyunlink exclusion · 884be175
      Al Viro 提交于
      use d_alloc_parallel() for sillyunlink/lookup exclusion and
      explicit rwsem (nfs_rmdir() being a writer and nfs_call_unlink() -
      a reader) for rmdir/sillyunlink one.
      
      That ought to make lookup/readdir/!O_CREAT atomic_open really
      parallel on NFS.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      884be175
  21. 03 5月, 2016 1 次提交
  22. 05 4月, 2016 1 次提交
    • K
      mm, fs: get rid of PAGE_CACHE_* and page_cache_{get,release} macros · 09cbfeaf
      Kirill A. Shutemov 提交于
      PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} macros were introduced *long* time
      ago with promise that one day it will be possible to implement page
      cache with bigger chunks than PAGE_SIZE.
      
      This promise never materialized.  And unlikely will.
      
      We have many places where PAGE_CACHE_SIZE assumed to be equal to
      PAGE_SIZE.  And it's constant source of confusion on whether
      PAGE_CACHE_* or PAGE_* constant should be used in a particular case,
      especially on the border between fs and mm.
      
      Global switching to PAGE_CACHE_SIZE != PAGE_SIZE would cause to much
      breakage to be doable.
      
      Let's stop pretending that pages in page cache are special.  They are
      not.
      
      The changes are pretty straight-forward:
      
       - <foo> << (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - <foo> >> (PAGE_CACHE_SHIFT - PAGE_SHIFT) -> <foo>;
      
       - PAGE_CACHE_{SIZE,SHIFT,MASK,ALIGN} -> PAGE_{SIZE,SHIFT,MASK,ALIGN};
      
       - page_cache_get() -> get_page();
      
       - page_cache_release() -> put_page();
      
      This patch contains automated changes generated with coccinelle using
      script below.  For some reason, coccinelle doesn't patch header files.
      I've called spatch for them manually.
      
      The only adjustment after coccinelle is revert of changes to
      PAGE_CAHCE_ALIGN definition: we are going to drop it later.
      
      There are few places in the code where coccinelle didn't reach.  I'll
      fix them manually in a separate patch.  Comments and documentation also
      will be addressed with the separate patch.
      
      virtual patch
      
      @@
      expression E;
      @@
      - E << (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      expression E;
      @@
      - E >> (PAGE_CACHE_SHIFT - PAGE_SHIFT)
      + E
      
      @@
      @@
      - PAGE_CACHE_SHIFT
      + PAGE_SHIFT
      
      @@
      @@
      - PAGE_CACHE_SIZE
      + PAGE_SIZE
      
      @@
      @@
      - PAGE_CACHE_MASK
      + PAGE_MASK
      
      @@
      expression E;
      @@
      - PAGE_CACHE_ALIGN(E)
      + PAGE_ALIGN(E)
      
      @@
      expression E;
      @@
      - page_cache_get(E)
      + get_page(E)
      
      @@
      expression E;
      @@
      - page_cache_release(E)
      + put_page(E)
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09cbfeaf
  23. 27 3月, 2016 1 次提交
  24. 14 3月, 2016 1 次提交
  25. 23 1月, 2016 1 次提交
    • A
      wrappers for ->i_mutex access · 5955102c
      Al Viro 提交于
      parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
      inode_foo(inode) being mutex_foo(&inode->i_mutex).
      
      Please, use those for access to ->i_mutex; over the coming cycle
      ->i_mutex will become rwsem, with ->lookup() done with it held
      only shared.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      5955102c
  26. 15 1月, 2016 1 次提交
    • A
      Make sure that highmem pages are not added to symlink page cache · e8ecde25
      Al Viro 提交于
      inode_nohighmem() is sufficient to make sure that page_get_link()
      won't try to allocate a highmem page.  Moreover, it is sufficient
      to make sure that page_symlink/__page_symlink won't do the same
      thing.  However, any filesystem that manually preseeds the symlink's
      page cache upon symlink(2) needs to make sure that the page it
      inserts there won't be a highmem one.
      
      Fortunately, only nfs and shmem have run afoul of that...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      e8ecde25
  27. 29 12月, 2015 2 次提交
  28. 04 11月, 2015 1 次提交
  29. 18 8月, 2015 2 次提交