提交 · 5542aa2fa7f6cddb03c4ac3135e390adffda98ca · xiphi1978 / linux

09 10月, 2014 6 次提交

vfs: Make d_invalidate return void · 5542aa2f

由 Eric W. Biederman 提交于 2月 13, 2014

Now that d_invalidate can no longer fail, stop returning a useless
return code.  For the few callers that checked the return code update
remove the handling of d_invalidate failure.
Reviewed-by: NMiklos Szeredi <miklos@szeredi.hu>
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

5542aa2f

vfs: Merge check_submounts_and_drop and d_invalidate · 1ffe46d1

由 Eric W. Biederman 提交于 2月 13, 2014

Now that d_invalidate is the only caller of check_submounts_and_drop,
expand check_submounts_and_drop inline in d_invalidate.
Reviewed-by: NMiklos Szeredi <miklos@szeredi.hu>
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

1ffe46d1

vfs: Lazily remove mounts on unlinked files and directories. · 8ed936b5

由 Eric W. Biederman 提交于 10月 01, 2013

With the introduction of mount namespaces and bind mounts it became
possible to access files and directories that on some paths are mount
points but are not mount points on other paths.  It is very confusing
when rm -rf somedir returns -EBUSY simply because somedir is mounted
somewhere else.  With the addition of user namespaces allowing
unprivileged mounts this condition has gone from annoying to allowing
a DOS attack on other users in the system.

The possibility for mischief is removed by updating the vfs to support
rename, unlink and rmdir on a dentry that is a mountpoint and by
lazily unmounting mountpoints on deleted dentries.

In particular this change allows rename, unlink and rmdir system calls
on a dentry without a mountpoint in the current mount namespace to
succeed, and it allows rename, unlink, and rmdir performed on a
distributed filesystem to update the vfs cache even if when there is a
mount in some namespace on the original dentry.

There are two common patterns of maintaining mounts: Mounts on trusted
paths with the parent directory of the mount point and all ancestory
directories up to / owned by root and modifiable only by root
(i.e. /media/xxx, /dev, /dev/pts, /proc, /sys, /sys/fs/cgroup/{cpu,
cpuacct, ...}, /usr, /usr/local).  Mounts on unprivileged directories
maintained by fusermount.

In the case of mounts in trusted directories owned by root and
modifiable only by root the current parent directory permissions are
sufficient to ensure a mount point on a trusted path is not removed
or renamed by anyone other than root, even if there is a context
where the there are no mount points to prevent this.

In the case of mounts in directories owned by less privileged users
races with users modifying the path of a mount point are already a
danger.  fusermount already uses a combination of chdir,
/proc/<pid>/fd/NNN, and UMOUNT_NOFOLLOW to prevent these races.  The
removable of global rename, unlink, and rmdir protection really adds
nothing new to consider only a widening of the attack window, and
fusermount is already safe against unprivileged users modifying the
directory simultaneously.

In principle for perfect userspace programs returning -EBUSY for
unlink, rmdir, and rename of dentires that have mounts in the local
namespace is actually unnecessary.  Unfortunately not all userspace
programs are perfect so retaining -EBUSY for unlink, rmdir and rename
of dentries that have mounts in the current mount namespace plays an
important role of maintaining consistency with historical behavior and
making imperfect userspace applications hard to exploit.

v2: Remove spurious old_dentry.
v3: Optimized shrink_submounts_and_drop
    Removed unsued afs label
v4: Simplified the changes to check_submounts_and_drop
    Do not rename check_submounts_and_drop shrink_submounts_and_drop
    Document what why we need atomicity in check_submounts_and_drop
    Rely on the parent inode mutex to make d_revalidate and d_invalidate
    an atomic unit.
v5: Refcount the mountpoint to detach in case of simultaneous
    renames.
Reviewed-by: NMiklos Szeredi <miklos@szeredi.hu>
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

8ed936b5

vfs: More precise tests in d_invalidate · bafc9b75

由 Eric W. Biederman 提交于 2月 13, 2014

The current comments in d_invalidate about what and why it is doing
what it is doing are wildly off-base.  Which is not surprising as
the comments date back to last minute bug fix of the 2.2 kernel.

The big fat lie of a comment said: If it's a directory, we can't drop
it for fear of somebody re-populating it with children (even though
dropping it would make it unreachable from that root, we still might
repopulate it if it was a working directory or similar).

[AV] What we really need to avoid is multiple dentry aliases of the
same directory inode; on all filesystems that have ->d_revalidate()
we either declare all positive dentries always valid (and thus never
fed to d_invalidate()) or use d_materialise_unique() and/or d_splice_alias(),
which take care of alias prevention.

The current rules are:
- To prevent mount point leaks dentries that are mount points or that
  have childrent that are mount points may not be be unhashed.
- All dentries may be unhashed.
- Directories may be rehashed with d_materialise_unique

check_submounts_and_drop implements this already for well maintained
remote filesystems so implement the current rules in d_invalidate
by just calling check_submounts_and_drop.

The one difference between d_invalidate and check_submounts_and_drop
is that d_invalidate must respect it when a d_revalidate method has
earlier called d_drop so preserve the d_unhashed check in
d_invalidate.
Reviewed-by: NMiklos Szeredi <miklos@szeredi.hu>
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

bafc9b75

vfs: Document the effect of d_revalidate on d_find_alias · 3ccb354d

由 Eric W. Biederman 提交于 2月 12, 2014

d_drop or check_submounts_and_drop called from d_revalidate can result
in renamed directories with child dentries being unhashed.  These
renamed and drop directory dentries can be rehashed after
d_materialise_unique uses d_find_alias to find them.
Reviewed-by: NMiklos Szeredi <miklos@szeredi.hu>
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

3ccb354d

Allow sharing external names after __d_move() · 8d85b484

由 Al Viro 提交于 9月 29, 2014

* external dentry names get a small structure prepended to them
(struct external_name).
* it contains an atomic refcount, matching the number of struct dentry
instances that have ->d_name.name pointing to that external name.  The
first thing free_dentry() does is decrementing refcount of external name,
so the instances that are between the call of free_dentry() and
RCU-delayed actual freeing do not contribute.
* __d_move(x, y, false) makes the name of x equal to the name of y,
external or not.  If y has an external name, extra reference is grabbed
and put into x->d_name.name.  If x used to have an external name, the
reference to the old name is dropped and, should it reach zero, freeing
is scheduled via kfree_rcu().
* free_dentry() in dentry with external name decrements the refcount of
that name and, should it reach zero, does RCU-delayed call that will
free both the dentry and external name.  Otherwise it does what it
used to do, except that __d_free() doesn't even look at ->d_name.name;
it simply frees the dentry.

All non-RCU accesses to dentry external name are safe wrt freeing since they
all should happen before free_dentry() is called.  RCU accesses might run
into a dentry seen by free_dentry() or into an old name that got already
dropped by __d_move(); however, in both cases dentry must have been
alive and refer to that name at some point after we'd done rcu_read_lock(),
which means that any freeing must be still pending.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

8d85b484

30 9月, 2014 1 次提交

missing data dependency barrier in prepend_name() · 6d13f694

由 Al Viro 提交于 9月 29, 2014

AFAICS, prepend_name() is broken on SMP alpha.  Disclaimer: I don't have
SMP alpha boxen to reproduce it on.  However, it really looks like the race
is real.

CPU1: d_path() on /mnt/ramfs/<255-character>/foo
CPU2: mv /mnt/ramfs/<255-character> /mnt/ramfs/<63-character>

CPU2 does d_alloc(), which allocates an external name, stores the name there
including terminating NUL, does smp_wmb() and stores its address in
dentry->d_name.name.  It proceeds to d_add(dentry, NULL) and d_move()
old dentry over to that.  ->d_name.name value ends up in that dentry.

In the meanwhile, CPU1 gets to prepend_name() for that dentry.  It fetches
->d_name.name and ->d_name.len; the former ends up pointing to new name
(64-byte kmalloc'ed array), the latter - 255 (length of the old name).
Nothing to force the ordering there, and normally that would be OK, since we'd
run into the terminating NUL and stop.  Except that it's alpha, and we'd need
a data dependency barrier to guarantee that we see that store of NUL
__d_alloc() has done.  In a similar situation dentry_cmp() would survive; it
does explicit smp_read_barrier_depends() after fetching ->d_name.name.
prepend_name() doesn't and it risks walking past the end of kmalloc'ed object
and possibly oops due to taking a page fault in kernel mode.

Cc: stable@vger.kernel.org # 3.12+
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

6d13f694

28 9月, 2014 2 次提交

vfs: Don't exchange "short" filenames unconditionally. · d2fa4a84

由 Mikhail Efremov 提交于 9月 24, 2014

Only exchange source and destination filenames
if flags contain RENAME_EXCHANGE.
In case if executable file was running and replaced by
other file /proc/PID/exe should still show correct file name,
not the old name of the file by which it was replaced.

The scenario when this bug manifests itself was like this:
* ALT Linux uses rpm and start-stop-daemon;
* during a package upgrade rpm creates a temporary file
  for an executable to rename it upon successful unpacking;
* start-stop-daemon is run subsequently and it obtains
  the (nonexistant) temporary filename via /proc/PID/exe
  thus failing to identify the running process.

Note that "long" filenames (> DNAiME_INLINE_LEN) are still
exchanged without RENAME_EXCHANGE and this behaviour exists
long enough (should be fixed too apparently).
So this patch is just an interim workaround that restores
behavior for "short" names as it was before changes
introduced by commit da1ce067 ("vfs: add cross-rename").

See https://lkml.org/lkml/2014/9/7/6 for details.

AV: the comments about being more careful with ->d_name.hash
than with ->d_name.name are from back in 2.3.40s; they
became obsolete by 2.3.60s, when we started to unhash the
target instead of swapping hash chain positions followed
by d_delete() as we used to do when dcache was first
introduced.
Acked-by: NMiklos Szeredi <mszeredi@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: stable@vger.kernel.org
Fixes: da1ce067 "vfs: add cross-rename"
Signed-off-by: NMikhail Efremov <sem@altlinux.org>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

d2fa4a84

fold swapping ->d_name.hash into switch_names() · a28ddb87

由 Linus Torvalds 提交于 9月 24, 2014

and do it along with ->d_name.len there
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

a28ddb87

27 9月, 2014 6 次提交

fold unlocking the children into dentry_unlock_parents_for_move() · 986c0194

由 Al Viro 提交于 9月 26, 2014

... renaming it into dentry_unlock_for_move() and making it more
symmetric with dentry_lock_for_move().
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

986c0194

A
kill __d_materialise_dentry() · 63cf427a
由 Al Viro 提交于 9月 26, 2014
```
it folds into __d_move() now
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
63cf427a

__d_materialise_dentry(): flip the order of arguments · 4453641f

由 Al Viro 提交于 9月 26, 2014

... thus making it much closer to (now unreachable, BTW) IS_ROOT(dentry)
case in __d_move().  A bit more and it'll fold in.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

4453641f

__d_move(): fold manipulations with ->d_child/->d_subdirs · 9d8cd306

由 Al Viro 提交于 9月 26, 2014

list_del() + list_add() is a slightly pessimised list_move()
list_del() + INIT_LIST_HEAD() is a slightly pessimised list_del_init()

Interleaving those makes the resulting code even worse. And harder to follow...
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

9d8cd306

A
don't open-code d_rehash() in d_materialise_unique() · 8527dd71
由 Al Viro 提交于 9月 26, 2014
```
... and get rid of duplicate BUG_ON() there
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
8527dd71
A
pull rehashing and unlocking the target dentry into __d_materialise_dentry() · 5cc3821b
由 Al Viro 提交于 9月 26, 2014
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
5cc3821b

14 9月, 2014 2 次提交

A
move the call of __d_drop(anon) into __d_materialise_unique(dentry, anon) · 6f18493e
由 Al Viro 提交于 9月 11, 2014
```
and lock the right list there

Cc: stable@vger.kernel.org
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
6f18493e

vfs: fix bad hashing of dentries · 99d263d4

由 Linus Torvalds 提交于 9月 13, 2014

Josef Bacik found a performance regression between 3.2 and 3.10 and
narrowed it down to commit bfcfaa77 ("vfs: use 'unsigned long'
accesses for dcache name comparison and hashing"). He reports:

 "The test case is essentially

      for (i = 0; i < 1000000; i++)
              mkdir("a$i");

  On xfs on a fio card this goes at about 20k dir/sec with 3.2, and 12k
  dir/sec with 3.10.  This is because we spend waaaaay more time in
  __d_lookup on 3.10 than in 3.2.

  The new hashing function for strings is suboptimal for <
  sizeof(unsigned long) string names (and hell even > sizeof(unsigned
  long) string names that I've tested).  I broke out the old hashing
  function and the new one into a userspace helper to get real numbers
  and this is what I'm getting:

      Old hash table had 1000000 entries, 0 dupes, 0 max dupes
      New hash table had 12628 entries, 987372 dupes, 900 max dupes
      We had 11400 buckets with a p50 of 30 dupes, p90 of 240 dupes, p99 of 567 dupes for the new hash

  My test does the hash, and then does the d_hash into a integer pointer
  array the same size as the dentry hash table on my system, and then
  just increments the value at the address we got to see how many
  entries we overlap with.

  As you can see the old hash function ended up with all 1 million
  entries in their own bucket, whereas the new one they are only
  distributed among ~12.5k buckets, which is why we're using so much
  more CPU in __d_lookup".

The reason for this hash regression is two-fold:

 - On 64-bit architectures the down-mixing of the original 64-bit
   word-at-a-time hash into the final 32-bit hash value is very
   simplistic and suboptimal, and just adds the two 32-bit parts
   together.

   In particular, because there is no bit shuffling and the mixing
   boundary is also a byte boundary, similar character patterns in the
   low and high word easily end up just canceling each other out.

 - the old byte-at-a-time hash mixed each byte into the final hash as it
   hashed the path component name, resulting in the low bits of the hash
   generally being a good source of hash data.  That is not true for the
   word-at-a-time case, and the hash data is distributed among all the
   bits.

The fix is the same in both cases: do a better job of mixing the bits up
and using as much of the hash data as possible.  We already have the
"hash_32|64()" functions to do that.
Reported-by: NJosef Bacik <jbacik@fb.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

99d263d4

08 8月, 2014 9 次提交

fs: mark __d_obtain_alias static · 49c7dd28

由 Fengguang Wu 提交于 7月 31, 2014

Signed-off-by: NFengguang Wu <fengguang.wu@intel.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

49c7dd28

dcache: d_splice_alias should detect loops · 95ad5c29

由 J. Bruce Fields 提交于 3月 12, 2014

I believe this can only happen in the case of a corrupted filesystem.
So -EIO looks like the appropriate error.
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

95ad5c29

dcache: d_find_alias needn't recheck IS_ROOT && DCACHE_DISCONNECTED · 8d80d7da

由 J. Bruce Fields 提交于 1月 16, 2014

If we get to this point and discover the dentry is not a root dentry, or
not DCACHE_DISCONNECTED--great, we always prefer that anyway.
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

8d80d7da

dcache: remove unused d_find_alias parameter · 52ed46f0

由 J. Bruce Fields 提交于 1月 16, 2014

Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

52ed46f0

dcache: d_obtain_alias callers don't all want DISCONNECTED · 1a0a397e

由 J. Bruce Fields 提交于 2月 14, 2014

There are a few d_obtain_alias callers that are using it to get the
root of a filesystem which may already have an alias somewhere else.

This is not the same as the filehandle-lookup case, and none of them
actually need DCACHE_DISCONNECTED set.

It isn't really a serious problem, but it would really be clearer if we
reserved DCACHE_DISCONNECTED for those cases where it's actually needed.

In the btrfs case this was causing a spurious printk from
nfsd/nfsfh.c:fh_verify when it found an unexpected DCACHE_DISCONNECTED
dentry.  Josef worked around this by unsetting DCACHE_DISCONNECTED
manually in 3a0dfa6a "Btrfs: unset DCACHE_DISCONNECTED when mounting
default subvol", and this replaces that workaround.

Cc: Josef Bacik <jbacik@fb.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

1a0a397e

dcache: d_splice_alias should ignore DCACHE_DISCONNECTED · da093a9b

由 J. Bruce Fields 提交于 2月 17, 2014

Any IS_ROOT() alias should be safe to use; there's nothing special about
DCACHE_DISCONNECTED dentries.

Note that this is in fact useful for filesystems such as btrfs which can
legimately encounter a directory with a preexisting IS_ROOT alias on a
lookup that crosses into a subvolume.  (Those aliases are currently
marked DCACHE_DISCONNECTED--but not really for any good reason, and
we'll change that soon.)
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

da093a9b

dcache: d_splice_alias mustn't create directory aliases · 908790fa

由 J. Bruce Fields 提交于 2月 17, 2014

Currently if d_splice_alias finds a directory with an alias that is not
IS_ROOT or not DCACHE_DISCONNECTED, it creates a duplicate directory.

Duplicate directory dentries are unacceptable; it is better just to
error out.

(In the case of a local filesystem the most likely case is filesystem
corruption: for example, perhaps two directories point to the same child
directory, and the other parent has already been found and cached.)

Note that distributed filesystems may encounter this case in normal
operation if a remote host moves a directory to a location different
from the one we last cached in the dcache.  For that reason, such
filesystems should instead use d_materialise_unique, which tries to move
the old directory alias to the right place instead of erroring out.
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

908790fa

dcache: close d_move race in d_splice_alias · 75a2352d

由 J. Bruce Fields 提交于 2月 17, 2014

d_splice_alias will d_move an IS_ROOT() directory dentry into place if
one exists.  This should be safe as long as the dentry remains IS_ROOT,
but I can't see what guarantees that: once we drop the i_lock all we
hold here is the i_mutex on an unrelated parent directory.

Instead copy the logic of d_materialise_unique.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

75a2352d

dcache: move d_splice_alias · 3f70bd51

由 J. Bruce Fields 提交于 2月 18, 2014

Just a trivial move to locate it near (similar) d_materialise_unique
code and save some forward references in a following patch.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

3f70bd51

12 6月, 2014 1 次提交

lock_parent: don't step on stale ->d_parent of all-but-freed one · c2338f2d

由 Al Viro 提交于 6月 12, 2014

Dentry that had been through (or into) __dentry_kill() might be seen
by shrink_dentry_list(); that's normal, it'll be taken off the shrink
list and freed if __dentry_kill() has already finished.  The problem
is, its ->d_parent might be pointing to already freed dentry, so
lock_parent() needs to be careful.

We need to check that dentry hasn't already gone into __dentry_kill()
*and* grab rcu_read_lock() before dropping ->d_lock - the latter makes
sure that whatever we see in ->d_parent after dropping ->d_lock it
won't be freed until we drop rcu_read_lock().
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

c2338f2d

07 6月, 2014 1 次提交

fs: convert use of typedef ctl_table to struct ctl_table · 1f7e0616

由 Joe Perches 提交于 6月 06, 2014

This typedef is unnecessary and should just be removed.
Signed-off-by: NJoe Perches <joe@perches.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

1f7e0616

01 6月, 2014 1 次提交

dcache: add missing lockdep annotation · 9f12600f

由 Linus Torvalds 提交于 5月 31, 2014

lock_parent() very much on purpose does nested locking of dentries, and
is careful to maintain the right order (lock parent first).  But because
it didn't annotate the nested locking order, lockdep thought it might be
a deadlock on d_lock, and complained.

Add the proper annotation for the inner locking of the child dentry to
make lockdep happy.

Introduced by commit 046b961b ("shrink_dentry_list(): take parent's
->d_lock earlier").
Reported-and-tested-by: NJosh Boyer <jwboyer@fedoraproject.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9f12600f

30 5月, 2014 3 次提交

A
dentry_kill() doesn't need the second argument now · 8cbf74da
由 Al Viro 提交于 5月 29, 2014
```
it's 1 in the only remaining caller.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
8cbf74da

dealing with the rest of shrink_dentry_list() livelock · b2b80195

由 Al Viro 提交于 5月 29, 2014

We have the same problem with ->d_lock order in the inner loop, where
we are dropping references to ancestors.  Same solution, basically -
instead of using dentry_kill() we use lock_parent() (introduced in the
previous commit) to get that lock in a safe way, recheck ->d_count
(in case if lock_parent() has ended up dropping and retaking ->d_lock
and somebody managed to grab a reference during that window), trylock
the inode->i_lock and use __dentry_kill() to do the rest.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

b2b80195

shrink_dentry_list(): take parent's ->d_lock earlier · 046b961b

由 Al Viro 提交于 5月 29, 2014

The cause of livelocks there is that we are taking ->d_lock on
dentry and its parent in the wrong order, forcing us to use
trylock on the parent's one.  d_walk() takes them in the right
order, and unfortunately it's not hard to create a situation
when shrink_dentry_list() can't make progress since trylock
keeps failing, and shrink_dcache_parent() or check_submounts_and_drop()
keeps calling d_walk() disrupting the very shrink_dentry_list() it's
waiting for.

Solution is straightforward - if that trylock fails, let's unlock
the dentry itself and take locks in the right order.  We need to
stabilize ->d_parent without holding ->d_lock, but that's doable
using RCU.  And we'd better do that in the very beginning of the
loop in shrink_dentry_list(), since the checks on refcount, etc.
would need to be redone anyway.

That deals with a half of the problem - killing dentries on the
shrink list itself.  Another one (dropping their parents) is
in the next commit.

locking parent is interesting - it would be easy to do rcu_read_lock(),
lock whatever we think is a parent, lock dentry itself and check
if the parent is still the right one.  Except that we need to check
that *before* locking the dentry, or we are risking taking ->d_lock
out of order.  Fortunately, once the D1 is locked, we can check if
D2->d_parent is equal to D1 without the need to lock D2; D2->d_parent
can start or stop pointing to D1 only under D1->d_lock, so taking
D1->d_lock is enough.  In other words, the right solution is
rcu_read_lock/lock what looks like parent right now/check if it's
still our parent/rcu_read_unlock/lock the child.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

046b961b

29 5月, 2014 2 次提交

expand dentry_kill(dentry, 0) in shrink_dentry_list() · ff2fde99

由 Al Viro 提交于 5月 28, 2014

Result will be massaged to saner shape in the next commits.  It is
ugly, no questions - the point of that one is to be a provably
equivalent transformation (and it might be worth splitting a bit
more).
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

ff2fde99

split dentry_kill() · e55fd011

由 Al Viro 提交于 5月 28, 2014

... into trylocks and everything else.  The latter (actual killing)
is __dentry_kill().
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

e55fd011

28 5月, 2014 1 次提交

lift the "already marked killed" case into shrink_dentry_list() · 64fd72e0

由 Al Viro 提交于 5月 28, 2014

It can happen only when dentry_kill() is called with unlock_on_failure
equal to 0 - other callers had dentry pinned until the moment they've
got ->d_lock and DCACHE_DENTRY_KILLED is set only after lockref_mark_dead().

IOW, only one of three call sites of dentry_kill() might end up reaching
that code. Just move it there.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

64fd72e0

04 5月, 2014 3 次提交

dcache: don't need rcu in shrink_dentry_list() · 60942f2f

由 Miklos Szeredi 提交于 5月 02, 2014

Since now the shrink list is private and nobody can free the dentry while
it is on the shrink list, we can remove RCU protection from this.
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

60942f2f

more graceful recovery in umount_collect() · 9c8c10e2

由 Al Viro 提交于 5月 02, 2014

Start with shrink_dcache_parent(), then scan what remains.

First of all, BUG() is very much an overkill here; we are holding
->s_umount, and hitting BUG() means that a lot of interesting stuff
will be hanging after that point (sync(2), for example).  Moreover,
in cases when there had been more than one leak, we'll be better
off reporting all of them.  And more than just the last component
of pathname - %pd is there for just such uses...

That was the last user of dentry_lru_del(), so kill it off...
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

9c8c10e2

don't remove from shrink list in select_collect() · fe91522a

由 Al Viro 提交于 5月 03, 2014

	If we find something already on a shrink list, just increment
data->found and do nothing else.  Loops in shrink_dcache_parent() and
check_submounts_and_drop() will do the right thing - everything we
did put into our list will be evicted and if there had been nothing,
but data->found got non-zero, well, we have somebody else shrinking
those guys; just try again.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

fe91522a

01 5月, 2014 2 次提交

dentry_kill(): don't try to remove from shrink list · 41edf278

由 Al Viro 提交于 5月 01, 2014

If the victim in on the shrink list, don't remove it from there.
If shrink_dentry_list() manages to remove it from the list before
we are done - fine, we'll just free it as usual.  If not - mark
it with new flag (DCACHE_MAY_FREE) and leave it there.

Eventually, shrink_dentry_list() will get to it, remove the sucker
from shrink list and call dentry_kill(dentry, 0).  Which is where
we'll deal with freeing.

Since now dentry_kill(dentry, 0) may happen after or during
dentry_kill(dentry, 1), we need to recognize that (by seeing
DCACHE_DENTRY_KILLED already set), unlock everything
and either free the sucker (in case DCACHE_MAY_FREE has been
set) or leave it for ongoing dentry_kill(dentry, 1) to deal with.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

41edf278

A
expand the call of dentry_lru_del() in dentry_kill() · 01b60351
由 Al Viro 提交于 4月 29, 2014
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
01b60351