1. 09 10月, 2014 4 次提交
    • E
      vfs: Lazily remove mounts on unlinked files and directories. · 8ed936b5
      Eric W. Biederman 提交于
      With the introduction of mount namespaces and bind mounts it became
      possible to access files and directories that on some paths are mount
      points but are not mount points on other paths.  It is very confusing
      when rm -rf somedir returns -EBUSY simply because somedir is mounted
      somewhere else.  With the addition of user namespaces allowing
      unprivileged mounts this condition has gone from annoying to allowing
      a DOS attack on other users in the system.
      
      The possibility for mischief is removed by updating the vfs to support
      rename, unlink and rmdir on a dentry that is a mountpoint and by
      lazily unmounting mountpoints on deleted dentries.
      
      In particular this change allows rename, unlink and rmdir system calls
      on a dentry without a mountpoint in the current mount namespace to
      succeed, and it allows rename, unlink, and rmdir performed on a
      distributed filesystem to update the vfs cache even if when there is a
      mount in some namespace on the original dentry.
      
      There are two common patterns of maintaining mounts: Mounts on trusted
      paths with the parent directory of the mount point and all ancestory
      directories up to / owned by root and modifiable only by root
      (i.e. /media/xxx, /dev, /dev/pts, /proc, /sys, /sys/fs/cgroup/{cpu,
      cpuacct, ...}, /usr, /usr/local).  Mounts on unprivileged directories
      maintained by fusermount.
      
      In the case of mounts in trusted directories owned by root and
      modifiable only by root the current parent directory permissions are
      sufficient to ensure a mount point on a trusted path is not removed
      or renamed by anyone other than root, even if there is a context
      where the there are no mount points to prevent this.
      
      In the case of mounts in directories owned by less privileged users
      races with users modifying the path of a mount point are already a
      danger.  fusermount already uses a combination of chdir,
      /proc/<pid>/fd/NNN, and UMOUNT_NOFOLLOW to prevent these races.  The
      removable of global rename, unlink, and rmdir protection really adds
      nothing new to consider only a widening of the attack window, and
      fusermount is already safe against unprivileged users modifying the
      directory simultaneously.
      
      In principle for perfect userspace programs returning -EBUSY for
      unlink, rmdir, and rename of dentires that have mounts in the local
      namespace is actually unnecessary.  Unfortunately not all userspace
      programs are perfect so retaining -EBUSY for unlink, rmdir and rename
      of dentries that have mounts in the current mount namespace plays an
      important role of maintaining consistency with historical behavior and
      making imperfect userspace applications hard to exploit.
      
      v2: Remove spurious old_dentry.
      v3: Optimized shrink_submounts_and_drop
          Removed unsued afs label
      v4: Simplified the changes to check_submounts_and_drop
          Do not rename check_submounts_and_drop shrink_submounts_and_drop
          Document what why we need atomicity in check_submounts_and_drop
          Rely on the parent inode mutex to make d_revalidate and d_invalidate
          an atomic unit.
      v5: Refcount the mountpoint to detach in case of simultaneous
          renames.
      Reviewed-by: NMiklos Szeredi <miklos@szeredi.hu>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      8ed936b5
    • E
      vfs: More precise tests in d_invalidate · bafc9b75
      Eric W. Biederman 提交于
      The current comments in d_invalidate about what and why it is doing
      what it is doing are wildly off-base.  Which is not surprising as
      the comments date back to last minute bug fix of the 2.2 kernel.
      
      The big fat lie of a comment said: If it's a directory, we can't drop
      it for fear of somebody re-populating it with children (even though
      dropping it would make it unreachable from that root, we still might
      repopulate it if it was a working directory or similar).
      
      [AV] What we really need to avoid is multiple dentry aliases of the
      same directory inode; on all filesystems that have ->d_revalidate()
      we either declare all positive dentries always valid (and thus never
      fed to d_invalidate()) or use d_materialise_unique() and/or d_splice_alias(),
      which take care of alias prevention.
      
      The current rules are:
      - To prevent mount point leaks dentries that are mount points or that
        have childrent that are mount points may not be be unhashed.
      - All dentries may be unhashed.
      - Directories may be rehashed with d_materialise_unique
      
      check_submounts_and_drop implements this already for well maintained
      remote filesystems so implement the current rules in d_invalidate
      by just calling check_submounts_and_drop.
      
      The one difference between d_invalidate and check_submounts_and_drop
      is that d_invalidate must respect it when a d_revalidate method has
      earlier called d_drop so preserve the d_unhashed check in
      d_invalidate.
      Reviewed-by: NMiklos Szeredi <miklos@szeredi.hu>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      bafc9b75
    • E
      vfs: Document the effect of d_revalidate on d_find_alias · 3ccb354d
      Eric W. Biederman 提交于
      d_drop or check_submounts_and_drop called from d_revalidate can result
      in renamed directories with child dentries being unhashed.  These
      renamed and drop directory dentries can be rehashed after
      d_materialise_unique uses d_find_alias to find them.
      Reviewed-by: NMiklos Szeredi <miklos@szeredi.hu>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      3ccb354d
    • A
      Allow sharing external names after __d_move() · 8d85b484
      Al Viro 提交于
      * external dentry names get a small structure prepended to them
      (struct external_name).
      * it contains an atomic refcount, matching the number of struct dentry
      instances that have ->d_name.name pointing to that external name.  The
      first thing free_dentry() does is decrementing refcount of external name,
      so the instances that are between the call of free_dentry() and
      RCU-delayed actual freeing do not contribute.
      * __d_move(x, y, false) makes the name of x equal to the name of y,
      external or not.  If y has an external name, extra reference is grabbed
      and put into x->d_name.name.  If x used to have an external name, the
      reference to the old name is dropped and, should it reach zero, freeing
      is scheduled via kfree_rcu().
      * free_dentry() in dentry with external name decrements the refcount of
      that name and, should it reach zero, does RCU-delayed call that will
      free both the dentry and external name.  Otherwise it does what it
      used to do, except that __d_free() doesn't even look at ->d_name.name;
      it simply frees the dentry.
      
      All non-RCU accesses to dentry external name are safe wrt freeing since they
      all should happen before free_dentry() is called.  RCU accesses might run
      into a dentry seen by free_dentry() or into an old name that got already
      dropped by __d_move(); however, in both cases dentry must have been
      alive and refer to that name at some point after we'd done rcu_read_lock(),
      which means that any freeing must be still pending.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      8d85b484
  2. 30 9月, 2014 1 次提交
    • A
      missing data dependency barrier in prepend_name() · 6d13f694
      Al Viro 提交于
      AFAICS, prepend_name() is broken on SMP alpha.  Disclaimer: I don't have
      SMP alpha boxen to reproduce it on.  However, it really looks like the race
      is real.
      
      CPU1: d_path() on /mnt/ramfs/<255-character>/foo
      CPU2: mv /mnt/ramfs/<255-character> /mnt/ramfs/<63-character>
      
      CPU2 does d_alloc(), which allocates an external name, stores the name there
      including terminating NUL, does smp_wmb() and stores its address in
      dentry->d_name.name.  It proceeds to d_add(dentry, NULL) and d_move()
      old dentry over to that.  ->d_name.name value ends up in that dentry.
      
      In the meanwhile, CPU1 gets to prepend_name() for that dentry.  It fetches
      ->d_name.name and ->d_name.len; the former ends up pointing to new name
      (64-byte kmalloc'ed array), the latter - 255 (length of the old name).
      Nothing to force the ordering there, and normally that would be OK, since we'd
      run into the terminating NUL and stop.  Except that it's alpha, and we'd need
      a data dependency barrier to guarantee that we see that store of NUL
      __d_alloc() has done.  In a similar situation dentry_cmp() would survive; it
      does explicit smp_read_barrier_depends() after fetching ->d_name.name.
      prepend_name() doesn't and it risks walking past the end of kmalloc'ed object
      and possibly oops due to taking a page fault in kernel mode.
      
      Cc: stable@vger.kernel.org # 3.12+
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      6d13f694
  3. 28 9月, 2014 2 次提交
    • M
      vfs: Don't exchange "short" filenames unconditionally. · d2fa4a84
      Mikhail Efremov 提交于
      Only exchange source and destination filenames
      if flags contain RENAME_EXCHANGE.
      In case if executable file was running and replaced by
      other file /proc/PID/exe should still show correct file name,
      not the old name of the file by which it was replaced.
      
      The scenario when this bug manifests itself was like this:
      * ALT Linux uses rpm and start-stop-daemon;
      * during a package upgrade rpm creates a temporary file
        for an executable to rename it upon successful unpacking;
      * start-stop-daemon is run subsequently and it obtains
        the (nonexistant) temporary filename via /proc/PID/exe
        thus failing to identify the running process.
      
      Note that "long" filenames (> DNAiME_INLINE_LEN) are still
      exchanged without RENAME_EXCHANGE and this behaviour exists
      long enough (should be fixed too apparently).
      So this patch is just an interim workaround that restores
      behavior for "short" names as it was before changes
      introduced by commit da1ce067 ("vfs: add cross-rename").
      
      See https://lkml.org/lkml/2014/9/7/6 for details.
      
      AV: the comments about being more careful with ->d_name.hash
      than with ->d_name.name are from back in 2.3.40s; they
      became obsolete by 2.3.60s, when we started to unhash the
      target instead of swapping hash chain positions followed
      by d_delete() as we used to do when dcache was first
      introduced.
      Acked-by: NMiklos Szeredi <mszeredi@suse.cz>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Cc: stable@vger.kernel.org
      Fixes: da1ce067 "vfs: add cross-rename"
      Signed-off-by: NMikhail Efremov <sem@altlinux.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      d2fa4a84
    • L
      fold swapping ->d_name.hash into switch_names() · a28ddb87
      Linus Torvalds 提交于
      and do it along with ->d_name.len there
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a28ddb87
  4. 27 9月, 2014 6 次提交
  5. 14 9月, 2014 2 次提交
    • A
      move the call of __d_drop(anon) into __d_materialise_unique(dentry, anon) · 6f18493e
      Al Viro 提交于
      and lock the right list there
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      6f18493e
    • L
      vfs: fix bad hashing of dentries · 99d263d4
      Linus Torvalds 提交于
      Josef Bacik found a performance regression between 3.2 and 3.10 and
      narrowed it down to commit bfcfaa77 ("vfs: use 'unsigned long'
      accesses for dcache name comparison and hashing"). He reports:
      
       "The test case is essentially
      
            for (i = 0; i < 1000000; i++)
                    mkdir("a$i");
      
        On xfs on a fio card this goes at about 20k dir/sec with 3.2, and 12k
        dir/sec with 3.10.  This is because we spend waaaaay more time in
        __d_lookup on 3.10 than in 3.2.
      
        The new hashing function for strings is suboptimal for <
        sizeof(unsigned long) string names (and hell even > sizeof(unsigned
        long) string names that I've tested).  I broke out the old hashing
        function and the new one into a userspace helper to get real numbers
        and this is what I'm getting:
      
            Old hash table had 1000000 entries, 0 dupes, 0 max dupes
            New hash table had 12628 entries, 987372 dupes, 900 max dupes
            We had 11400 buckets with a p50 of 30 dupes, p90 of 240 dupes, p99 of 567 dupes for the new hash
      
        My test does the hash, and then does the d_hash into a integer pointer
        array the same size as the dentry hash table on my system, and then
        just increments the value at the address we got to see how many
        entries we overlap with.
      
        As you can see the old hash function ended up with all 1 million
        entries in their own bucket, whereas the new one they are only
        distributed among ~12.5k buckets, which is why we're using so much
        more CPU in __d_lookup".
      
      The reason for this hash regression is two-fold:
      
       - On 64-bit architectures the down-mixing of the original 64-bit
         word-at-a-time hash into the final 32-bit hash value is very
         simplistic and suboptimal, and just adds the two 32-bit parts
         together.
      
         In particular, because there is no bit shuffling and the mixing
         boundary is also a byte boundary, similar character patterns in the
         low and high word easily end up just canceling each other out.
      
       - the old byte-at-a-time hash mixed each byte into the final hash as it
         hashed the path component name, resulting in the low bits of the hash
         generally being a good source of hash data.  That is not true for the
         word-at-a-time case, and the hash data is distributed among all the
         bits.
      
      The fix is the same in both cases: do a better job of mixing the bits up
      and using as much of the hash data as possible.  We already have the
      "hash_32|64()" functions to do that.
      Reported-by: NJosef Bacik <jbacik@fb.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: linux-fsdevel@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      99d263d4
  6. 08 8月, 2014 9 次提交
  7. 12 6月, 2014 1 次提交
    • A
      lock_parent: don't step on stale ->d_parent of all-but-freed one · c2338f2d
      Al Viro 提交于
      Dentry that had been through (or into) __dentry_kill() might be seen
      by shrink_dentry_list(); that's normal, it'll be taken off the shrink
      list and freed if __dentry_kill() has already finished.  The problem
      is, its ->d_parent might be pointing to already freed dentry, so
      lock_parent() needs to be careful.
      
      We need to check that dentry hasn't already gone into __dentry_kill()
      *and* grab rcu_read_lock() before dropping ->d_lock - the latter makes
      sure that whatever we see in ->d_parent after dropping ->d_lock it
      won't be freed until we drop rcu_read_lock().
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      c2338f2d
  8. 07 6月, 2014 1 次提交
  9. 01 6月, 2014 1 次提交
    • L
      dcache: add missing lockdep annotation · 9f12600f
      Linus Torvalds 提交于
      lock_parent() very much on purpose does nested locking of dentries, and
      is careful to maintain the right order (lock parent first).  But because
      it didn't annotate the nested locking order, lockdep thought it might be
      a deadlock on d_lock, and complained.
      
      Add the proper annotation for the inner locking of the child dentry to
      make lockdep happy.
      
      Introduced by commit 046b961b ("shrink_dentry_list(): take parent's
      ->d_lock earlier").
      Reported-and-tested-by: NJosh Boyer <jwboyer@fedoraproject.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9f12600f
  10. 30 5月, 2014 3 次提交
    • A
      dentry_kill() doesn't need the second argument now · 8cbf74da
      Al Viro 提交于
      it's 1 in the only remaining caller.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      8cbf74da
    • A
      dealing with the rest of shrink_dentry_list() livelock · b2b80195
      Al Viro 提交于
      We have the same problem with ->d_lock order in the inner loop, where
      we are dropping references to ancestors.  Same solution, basically -
      instead of using dentry_kill() we use lock_parent() (introduced in the
      previous commit) to get that lock in a safe way, recheck ->d_count
      (in case if lock_parent() has ended up dropping and retaking ->d_lock
      and somebody managed to grab a reference during that window), trylock
      the inode->i_lock and use __dentry_kill() to do the rest.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b2b80195
    • A
      shrink_dentry_list(): take parent's ->d_lock earlier · 046b961b
      Al Viro 提交于
      The cause of livelocks there is that we are taking ->d_lock on
      dentry and its parent in the wrong order, forcing us to use
      trylock on the parent's one.  d_walk() takes them in the right
      order, and unfortunately it's not hard to create a situation
      when shrink_dentry_list() can't make progress since trylock
      keeps failing, and shrink_dcache_parent() or check_submounts_and_drop()
      keeps calling d_walk() disrupting the very shrink_dentry_list() it's
      waiting for.
      
      Solution is straightforward - if that trylock fails, let's unlock
      the dentry itself and take locks in the right order.  We need to
      stabilize ->d_parent without holding ->d_lock, but that's doable
      using RCU.  And we'd better do that in the very beginning of the
      loop in shrink_dentry_list(), since the checks on refcount, etc.
      would need to be redone anyway.
      
      That deals with a half of the problem - killing dentries on the
      shrink list itself.  Another one (dropping their parents) is
      in the next commit.
      
      locking parent is interesting - it would be easy to do rcu_read_lock(),
      lock whatever we think is a parent, lock dentry itself and check
      if the parent is still the right one.  Except that we need to check
      that *before* locking the dentry, or we are risking taking ->d_lock
      out of order.  Fortunately, once the D1 is locked, we can check if
      D2->d_parent is equal to D1 without the need to lock D2; D2->d_parent
      can start or stop pointing to D1 only under D1->d_lock, so taking
      D1->d_lock is enough.  In other words, the right solution is
      rcu_read_lock/lock what looks like parent right now/check if it's
      still our parent/rcu_read_unlock/lock the child.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      046b961b
  11. 29 5月, 2014 2 次提交
  12. 28 5月, 2014 1 次提交
  13. 04 5月, 2014 3 次提交
    • M
      dcache: don't need rcu in shrink_dentry_list() · 60942f2f
      Miklos Szeredi 提交于
      Since now the shrink list is private and nobody can free the dentry while
      it is on the shrink list, we can remove RCU protection from this.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      60942f2f
    • A
      more graceful recovery in umount_collect() · 9c8c10e2
      Al Viro 提交于
      Start with shrink_dcache_parent(), then scan what remains.
      
      First of all, BUG() is very much an overkill here; we are holding
      ->s_umount, and hitting BUG() means that a lot of interesting stuff
      will be hanging after that point (sync(2), for example).  Moreover,
      in cases when there had been more than one leak, we'll be better
      off reporting all of them.  And more than just the last component
      of pathname - %pd is there for just such uses...
      
      That was the last user of dentry_lru_del(), so kill it off...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      9c8c10e2
    • A
      don't remove from shrink list in select_collect() · fe91522a
      Al Viro 提交于
      	If we find something already on a shrink list, just increment
      data->found and do nothing else.  Loops in shrink_dcache_parent() and
      check_submounts_and_drop() will do the right thing - everything we
      did put into our list will be evicted and if there had been nothing,
      but data->found got non-zero, well, we have somebody else shrinking
      those guys; just try again.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      fe91522a
  14. 01 5月, 2014 4 次提交