1. 14 1月, 2011 4 次提交
    • N
      fs: namei fix ->put_link on wrong inode in do_filp_open · 7b9337aa
      Nick Piggin 提交于
      J. R. Okajima noticed that ->put_link is being attempted on the
      wrong inode, and suggested the way to fix it. I changed it a bit
      according to Al's suggestion to keep an explicit link path around.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      7b9337aa
    • J
      fs: fix do_last error case when need_reval_dot · f20877d9
      J. R. Okajima 提交于
      When open(2) without O_DIRECTORY opens an existing dir, it should return
      EISDIR. In do_last(), the variable 'error' is initialized EISDIR, but it
      is changed by d_revalidate() which returns any positive to represent
      'the target dir is valid.'
      
      Should we keep and return the initialized 'error' in this case.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      f20877d9
    • N
      fs: fix dropping of rcu-walk from force_reval_path · 90dbb77b
      Nick Piggin 提交于
      As J. R. Okajima noted, force_reval_path passes in the same dentry to
      d_revalidate as the one in the nameidata structure (other callers pass in a
      child), so the locking breaks. This can oops with a chrooted nfs mount, for
      example. Similarly there can be other problems with revalidating a dentry
      which is already in nameidata of the path walk.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      90dbb77b
    • N
      fs: force_reval_path drop rcu-walk before d_invalidate · bb20c18d
      Nick Piggin 提交于
      d_revalidate can return in rcu-walk mode even when it returns 0.  We can't just
      call any old dcache function on rcu-walk dentry (the dentry is unstable, so
      even through d_lock can safely be taken, the result may no longer be what we
      expect -- careful re-checks would be required). So just drop rcu in this case.
      
      (I missed this conversion when switching to the rcu-walk convention that Linus
      suggested)
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      bb20c18d
  2. 07 1月, 2011 9 次提交
    • N
      fs: scale mntget/mntput · b3e19d92
      Nick Piggin 提交于
      The problem that this patch aims to fix is vfsmount refcounting scalability.
      We need to take a reference on the vfsmount for every successful path lookup,
      which often go to the same mount point.
      
      The fundamental difficulty is that a "simple" reference count can never be made
      scalable, because any time a reference is dropped, we must check whether that
      was the last reference. To do that requires communication with all other CPUs
      that may have taken a reference count.
      
      We can make refcounts more scalable in a couple of ways, involving keeping
      distributed counters, and checking for the global-zero condition less
      frequently.
      
      - check the global sum once every interval (this will delay zero detection
        for some interval, so it's probably a showstopper for vfsmounts).
      
      - keep a local count and only taking the global sum when local reaches 0 (this
        is difficult for vfsmounts, because we can't hold preempt off for the life of
        a reference, so a counter would need to be per-thread or tied strongly to a
        particular CPU which requires more locking).
      
      - keep a local difference of increments and decrements, which allows us to sum
        the total difference and hence find the refcount when summing all CPUs. Then,
        keep a single integer "long" refcount for slow and long lasting references,
        and only take the global sum of local counters when the long refcount is 0.
      
      This last scheme is what I implemented here. Attached mounts and process root
      and working directory references are "long" references, and everything else is
      a short reference.
      
      This allows scalable vfsmount references during path walking over mounted
      subtrees and unattached (lazy umounted) mounts with processes still running
      in them.
      
      This results in one fewer atomic op in the fastpath: mntget is now just a
      per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
      and non-atomic decrement in the common case. However code is otherwise bigger
      and heavier, so single threaded performance is basically a wash.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      b3e19d92
    • N
      fs: provide rcu-walk aware permission i_ops · b74c79e9
      Nick Piggin 提交于
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      b74c79e9
    • N
      fs: rcu-walk aware d_revalidate method · 34286d66
      Nick Piggin 提交于
      Require filesystems be aware of .d_revalidate being called in rcu-walk
      mode (nd->flags & LOOKUP_RCU). For now do a simple push down, returning
      -ECHILD from all implementations.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      34286d66
    • N
      fs: dcache reduce branches in lookup path · fb045adb
      Nick Piggin 提交于
      Reduce some branches and memory accesses in dcache lookup by adding dentry
      flags to indicate common d_ops are set, rather than having to check them.
      This saves a pointer memory access (dentry->d_op) in common path lookup
      situations, and saves another pointer load and branch in cases where we
      have d_op but not the particular operation.
      
      Patched with:
      
      git grep -E '[.>]([[:space:]])*d_op([[:space:]])*=' | xargs sed -e 's/\([^\t ]*\)->d_op = \(.*\);/d_set_d_op(\1, \2);/' -e 's/\([^\t ]*\)\.d_op = \(.*\);/d_set_d_op(\&\1, \2);/' -i
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      fb045adb
    • N
      fs: fs_struct use seqlock · c28cc364
      Nick Piggin 提交于
      Use a seqlock in the fs_struct to enable us to take an atomic copy of the
      complete cwd and root paths. Use this in the RCU lookup path to avoid a
      thread-shared spinlock in RCU lookup operations.
      
      Multi-threaded apps may now perform path lookups with scalability matching
      multi-process apps. Operations such as stat(2) become very scalable for
      multi-threaded workload.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      c28cc364
    • N
      fs: rcu-walk for path lookup · 31e6b01f
      Nick Piggin 提交于
      Perform common cases of path lookups without any stores or locking in the
      ancestor dentry elements. This is called rcu-walk, as opposed to the current
      algorithm which is a refcount based walk, or ref-walk.
      
      This results in far fewer atomic operations on every path element,
      significantly improving path lookup performance. It also avoids cacheline
      bouncing on common dentries, significantly improving scalability.
      
      The overall design is like this:
      * LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk.
      * Take the RCU lock for the entire path walk, starting with the acquiring
        of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
        not required for dentry persistence.
      * synchronize_rcu is called when unregistering a filesystem, so we can
        access d_ops and i_ops during rcu-walk.
      * Similarly take the vfsmount lock for the entire path walk. So now mnt
        refcounts are not required for persistence. Also we are free to perform mount
        lookups, and to assume dentry mount points and mount roots are stable up and
        down the path.
      * Have a per-dentry seqlock to protect the dentry name, parent, and inode,
        so we can load this tuple atomically, and also check whether any of its
        members have changed.
      * Dentry lookups (based on parent, candidate string tuple) recheck the parent
        sequence after the child is found in case anything changed in the parent
        during the path walk.
      * inode is also RCU protected so we can load d_inode and use the inode for
        limited things.
      * i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
      * i_op can be loaded.
      
      When we reach the destination dentry, we lock it, recheck lookup sequence,
      and increment its refcount and mountpoint refcount. RCU and vfsmount locks
      are dropped. This is termed "dropping rcu-walk". If the dentry refcount does
      not match, we can not drop rcu-walk gracefully at the current point in the
      lokup, so instead return -ECHILD (for want of a better errno). This signals the
      path walking code to re-do the entire lookup with a ref-walk.
      
      Aside from the final dentry, there are other situations that may be encounted
      where we cannot continue rcu-walk. In that case, we drop rcu-walk (ie. take
      a reference on the last good dentry) and continue with a ref-walk. Again, if
      we can drop rcu-walk gracefully, we return -ECHILD and do the whole lookup
      using ref-walk. But it is very important that we can continue with ref-walk
      for most cases, particularly to avoid the overhead of double lookups, and to
      gain the scalability advantages on common path elements (like cwd and root).
      
      The cases where rcu-walk cannot continue are:
      * NULL dentry (ie. any uncached path element)
      * parent with d_inode->i_op->permission or ACLs
      * dentries with d_revalidate
      * Following links
      
      In future patches, permission checks and d_revalidate become rcu-walk aware. It
      may be possible eventually to make following links rcu-walk aware.
      
      Uncached path elements will always require dropping to ref-walk mode, at the
      very least because i_mutex needs to be grabbed, and objects allocated.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      31e6b01f
    • N
      fs: dcache remove dcache_lock · b5c84bf6
      Nick Piggin 提交于
      dcache_lock no longer protects anything. remove it.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      b5c84bf6
    • N
      fs: dcache scale dentry refcount · b7ab39f6
      Nick Piggin 提交于
      Make d_count non-atomic and protect it with d_lock. This allows us to ensure a
      0 refcount dentry remains 0 without dcache_lock. It is also fairly natural when
      we start protecting many other dentry members with d_lock.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      b7ab39f6
    • N
      fs: change d_hash for rcu-walk · b1e6a015
      Nick Piggin 提交于
      Change d_hash so it may be called from lock-free RCU lookups. See similar
      patch for d_compare for details.
      
      For in-tree filesystems, this is just a mechanical change.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      b1e6a015
  3. 08 12月, 2010 1 次提交
  4. 29 10月, 2010 1 次提交
    • A
      fix open/umount race · d893f1bc
      Al Viro 提交于
      nameidata_to_filp() drops nd->path or transfers it to opened
      file.  In the former case it's a Bad Idea(tm) to do mnt_drop_write()
      on nd->path.mnt, since we might race with umount and vfsmount in
      question might be gone already.
      
      Fix: don't drop it, then...  IOW, have nameidata_to_filp() grab nd->path
      in case it transfers it to file and do path_drop() in callers.  After
      they are through with accessing nd->path...
      Reported-by: NMiklos Szeredi <miklos@szeredi.hu>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      d893f1bc
  5. 26 10月, 2010 2 次提交
  6. 18 8月, 2010 4 次提交
    • N
      fs: brlock vfsmount_lock · 99b7db7b
      Nick Piggin 提交于
      fs: brlock vfsmount_lock
      
      Use a brlock for the vfsmount lock. It must be taken for write whenever
      modifying the mount hash or associated fields, and may be taken for read when
      performing mount hash lookups.
      
      A new lock is added for the mnt-id allocator, so it doesn't need to take
      the heavy vfsmount write-lock.
      
      The number of atomics should remain the same for fastpath rlock cases, though
      code would be slightly slower due to per-cpu access. Scalability is not not be
      much improved in common cases yet, due to other locks (ie. dcache_lock) getting
      in the way. However path lookups crossing mountpoints should be one case where
      scalability is improved (currently requiring the global lock).
      
      The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
      Altix system (high latency to remote nodes), a simple umount microbenchmark
      (mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
      took 6.8s, afterwards took 7.1s, about 5% slower.
      
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      99b7db7b
    • N
      fs: remove extra lookup in __lookup_hash · b04f784e
      Nick Piggin 提交于
      fs: remove extra lookup in __lookup_hash
      
      Optimize lookup for create operations, where no dentry should often be
      common-case. In cases where it is not, such as unlink, the added overhead
      is much smaller than the removed.
      
      Also, move comments about __d_lookup racyness to the __d_lookup call site.
      d_lookup is intuitive; __d_lookup is what needs commenting. So in that same
      vein, add kerneldoc comments to __d_lookup and clean up some of the comments:
      
      - We are interested in how the RCU lookup works here, particularly with
        renames. Make that explicit, and point to the document where it is explained
        in more detail.
      - RCU is pretty standard now, and macros make implementations pretty mindless.
        If we want to know about RCU barrier details, we look in RCU code.
      - Delete some boring legacy comments because we don't care much about how the
        code used to work, more about the interesting parts of how it works now. So
        comments about lazy LRU may be interesting, but would better be done in the
        LRU or refcount management code.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b04f784e
    • N
      fs: dentry allocation consolidation · baa03890
      Nick Piggin 提交于
      fs: dentry allocation consolidation
      
      There are 2 duplicate copies of code in dentry allocation in path lookup.
      Consolidate them into a single function.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      baa03890
    • N
      fs: fix do_lookup false negative · 2e2e88ea
      Nick Piggin 提交于
      fs: fix do_lookup false negative
      
      In do_lookup, if we initially find no dentry, we take the directory i_mutex and
      re-check the lookup. If we find a dentry there, then we revalidate it if
      needed. However if that revalidate asks for the dentry to be invalidated, we
      return -ENOENT from do_lookup. What should happen instead is an attempt to
      allocate and lookup a new dentry.
      
      This is probably not noticed because it is rare. It is only reached if a
      concurrent create races in first (in which case, the dentry probably won't be
      invalidated anyway), or if the racy __d_lookup has failed due to a
      false-negative (which is very rare).
      
      Fix this by removing code and have it use the normal reval path.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2e2e88ea
  7. 11 8月, 2010 1 次提交
  8. 02 8月, 2010 2 次提交
  9. 28 7月, 2010 1 次提交
    • E
      fsnotify: use unsigned char * for dentry->d_name.name · 59b0df21
      Eric Paris 提交于
      fsnotify was using char * when it passed around the d_name.name string
      internally but it is actually an unsigned char *.  This patch switches
      fsnotify to use unsigned and should silence some pointer signess warnings
      which have popped out of xfs.  I do not add -Wpointer-sign to the fsnotify
      code as there are still issues with kstrdup and strlen which would pop
      out needless warnings.
      Signed-off-by: NEric Paris <eparis@redhat.com>
      59b0df21
  10. 28 5月, 2010 1 次提交
    • N
      VFS: fix recent breakage of FS_REVAL_DOT · 176306f5
      Neil Brown 提交于
      Commit 1f36f774 broke FS_REVAL_DOT semantics.
      
      In particular, before this patch, the command
         ls -l
      in an NFS mounted directory would always check if the directory on the server
      had changed and if so would flush and refill the pagecache for the dir.
      After this patch, the same "ls -l" will repeatedly return stale date until
      the cached attributes for the directory time out.
      
      The following patch fixes this by ensuring the d_revalidate is called by
      do_last when "." is being looked-up.
      link_path_walk has already called d_revalidate, but in that case LOOKUP_OPEN
      is not set so nfs_lookup_verify_inode chooses not to do any validation.
      
      The following patch restores the original behaviour.
      
      Cc: stable@kernel.org
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      176306f5
  11. 22 5月, 2010 1 次提交
  12. 15 5月, 2010 1 次提交
    • A
      Fix the regression created by "set S_DEAD on unlink()..." commit · d83c49f3
      Al Viro 提交于
      1) i_flags simply doesn't work for mount/unlink race prevention;
      we may have many links to file and rm on one of those obviously
      shouldn't prevent bind on top of another later on.  To fix it
      right way we need to mark _dentry_ as unsuitable for mounting
      upon; new flag (DCACHE_CANT_MOUNT) is protected by d_flags and
      i_mutex on the inode in question.  Set it (with dont_mount(dentry))
      in unlink/rmdir/etc., check (with cant_mount(dentry)) in places
      in namespace.c that used to check for S_DEAD.  Setting S_DEAD
      is still needed in places where we used to set it (for directories
      getting killed), since we rely on it for readdir/rmdir race
      prevention.
      
      2) rename()/mount() protection has another bogosity - we unhash
      the target before we'd checked that it's not a mountpoint.  Fixed.
      
      3) ancient bogosity in pivot_root() - we locked i_mutex on the
      right directory, but checked S_DEAD on the different (and wrong)
      one.  Noticed and fixed.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      d83c49f3
  13. 13 5月, 2010 1 次提交
  14. 27 3月, 2010 1 次提交
  15. 07 3月, 2010 1 次提交
  16. 05 3月, 2010 9 次提交