- 09 10月, 2014 4 次提交
-
-
由 Eric W. Biederman 提交于
With the introduction of mount namespaces and bind mounts it became possible to access files and directories that on some paths are mount points but are not mount points on other paths. It is very confusing when rm -rf somedir returns -EBUSY simply because somedir is mounted somewhere else. With the addition of user namespaces allowing unprivileged mounts this condition has gone from annoying to allowing a DOS attack on other users in the system. The possibility for mischief is removed by updating the vfs to support rename, unlink and rmdir on a dentry that is a mountpoint and by lazily unmounting mountpoints on deleted dentries. In particular this change allows rename, unlink and rmdir system calls on a dentry without a mountpoint in the current mount namespace to succeed, and it allows rename, unlink, and rmdir performed on a distributed filesystem to update the vfs cache even if when there is a mount in some namespace on the original dentry. There are two common patterns of maintaining mounts: Mounts on trusted paths with the parent directory of the mount point and all ancestory directories up to / owned by root and modifiable only by root (i.e. /media/xxx, /dev, /dev/pts, /proc, /sys, /sys/fs/cgroup/{cpu, cpuacct, ...}, /usr, /usr/local). Mounts on unprivileged directories maintained by fusermount. In the case of mounts in trusted directories owned by root and modifiable only by root the current parent directory permissions are sufficient to ensure a mount point on a trusted path is not removed or renamed by anyone other than root, even if there is a context where the there are no mount points to prevent this. In the case of mounts in directories owned by less privileged users races with users modifying the path of a mount point are already a danger. fusermount already uses a combination of chdir, /proc/<pid>/fd/NNN, and UMOUNT_NOFOLLOW to prevent these races. The removable of global rename, unlink, and rmdir protection really adds nothing new to consider only a widening of the attack window, and fusermount is already safe against unprivileged users modifying the directory simultaneously. In principle for perfect userspace programs returning -EBUSY for unlink, rmdir, and rename of dentires that have mounts in the local namespace is actually unnecessary. Unfortunately not all userspace programs are perfect so retaining -EBUSY for unlink, rmdir and rename of dentries that have mounts in the current mount namespace plays an important role of maintaining consistency with historical behavior and making imperfect userspace applications hard to exploit. v2: Remove spurious old_dentry. v3: Optimized shrink_submounts_and_drop Removed unsued afs label v4: Simplified the changes to check_submounts_and_drop Do not rename check_submounts_and_drop shrink_submounts_and_drop Document what why we need atomicity in check_submounts_and_drop Rely on the parent inode mutex to make d_revalidate and d_invalidate an atomic unit. v5: Refcount the mountpoint to detach in case of simultaneous renames. Reviewed-by: NMiklos Szeredi <miklos@szeredi.hu> Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Eric W. Biederman 提交于
The current comments in d_invalidate about what and why it is doing what it is doing are wildly off-base. Which is not surprising as the comments date back to last minute bug fix of the 2.2 kernel. The big fat lie of a comment said: If it's a directory, we can't drop it for fear of somebody re-populating it with children (even though dropping it would make it unreachable from that root, we still might repopulate it if it was a working directory or similar). [AV] What we really need to avoid is multiple dentry aliases of the same directory inode; on all filesystems that have ->d_revalidate() we either declare all positive dentries always valid (and thus never fed to d_invalidate()) or use d_materialise_unique() and/or d_splice_alias(), which take care of alias prevention. The current rules are: - To prevent mount point leaks dentries that are mount points or that have childrent that are mount points may not be be unhashed. - All dentries may be unhashed. - Directories may be rehashed with d_materialise_unique check_submounts_and_drop implements this already for well maintained remote filesystems so implement the current rules in d_invalidate by just calling check_submounts_and_drop. The one difference between d_invalidate and check_submounts_and_drop is that d_invalidate must respect it when a d_revalidate method has earlier called d_drop so preserve the d_unhashed check in d_invalidate. Reviewed-by: NMiklos Szeredi <miklos@szeredi.hu> Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Eric W. Biederman 提交于
d_drop or check_submounts_and_drop called from d_revalidate can result in renamed directories with child dentries being unhashed. These renamed and drop directory dentries can be rehashed after d_materialise_unique uses d_find_alias to find them. Reviewed-by: NMiklos Szeredi <miklos@szeredi.hu> Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
* external dentry names get a small structure prepended to them (struct external_name). * it contains an atomic refcount, matching the number of struct dentry instances that have ->d_name.name pointing to that external name. The first thing free_dentry() does is decrementing refcount of external name, so the instances that are between the call of free_dentry() and RCU-delayed actual freeing do not contribute. * __d_move(x, y, false) makes the name of x equal to the name of y, external or not. If y has an external name, extra reference is grabbed and put into x->d_name.name. If x used to have an external name, the reference to the old name is dropped and, should it reach zero, freeing is scheduled via kfree_rcu(). * free_dentry() in dentry with external name decrements the refcount of that name and, should it reach zero, does RCU-delayed call that will free both the dentry and external name. Otherwise it does what it used to do, except that __d_free() doesn't even look at ->d_name.name; it simply frees the dentry. All non-RCU accesses to dentry external name are safe wrt freeing since they all should happen before free_dentry() is called. RCU accesses might run into a dentry seen by free_dentry() or into an old name that got already dropped by __d_move(); however, in both cases dentry must have been alive and refer to that name at some point after we'd done rcu_read_lock(), which means that any freeing must be still pending. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 30 9月, 2014 1 次提交
-
-
由 Al Viro 提交于
AFAICS, prepend_name() is broken on SMP alpha. Disclaimer: I don't have SMP alpha boxen to reproduce it on. However, it really looks like the race is real. CPU1: d_path() on /mnt/ramfs/<255-character>/foo CPU2: mv /mnt/ramfs/<255-character> /mnt/ramfs/<63-character> CPU2 does d_alloc(), which allocates an external name, stores the name there including terminating NUL, does smp_wmb() and stores its address in dentry->d_name.name. It proceeds to d_add(dentry, NULL) and d_move() old dentry over to that. ->d_name.name value ends up in that dentry. In the meanwhile, CPU1 gets to prepend_name() for that dentry. It fetches ->d_name.name and ->d_name.len; the former ends up pointing to new name (64-byte kmalloc'ed array), the latter - 255 (length of the old name). Nothing to force the ordering there, and normally that would be OK, since we'd run into the terminating NUL and stop. Except that it's alpha, and we'd need a data dependency barrier to guarantee that we see that store of NUL __d_alloc() has done. In a similar situation dentry_cmp() would survive; it does explicit smp_read_barrier_depends() after fetching ->d_name.name. prepend_name() doesn't and it risks walking past the end of kmalloc'ed object and possibly oops due to taking a page fault in kernel mode. Cc: stable@vger.kernel.org # 3.12+ Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 28 9月, 2014 2 次提交
-
-
由 Mikhail Efremov 提交于
Only exchange source and destination filenames if flags contain RENAME_EXCHANGE. In case if executable file was running and replaced by other file /proc/PID/exe should still show correct file name, not the old name of the file by which it was replaced. The scenario when this bug manifests itself was like this: * ALT Linux uses rpm and start-stop-daemon; * during a package upgrade rpm creates a temporary file for an executable to rename it upon successful unpacking; * start-stop-daemon is run subsequently and it obtains the (nonexistant) temporary filename via /proc/PID/exe thus failing to identify the running process. Note that "long" filenames (> DNAiME_INLINE_LEN) are still exchanged without RENAME_EXCHANGE and this behaviour exists long enough (should be fixed too apparently). So this patch is just an interim workaround that restores behavior for "short" names as it was before changes introduced by commit da1ce067 ("vfs: add cross-rename"). See https://lkml.org/lkml/2014/9/7/6 for details. AV: the comments about being more careful with ->d_name.hash than with ->d_name.name are from back in 2.3.40s; they became obsolete by 2.3.60s, when we started to unhash the target instead of swapping hash chain positions followed by d_delete() as we used to do when dcache was first introduced. Acked-by: NMiklos Szeredi <mszeredi@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Cc: stable@vger.kernel.org Fixes: da1ce067 "vfs: add cross-rename" Signed-off-by: NMikhail Efremov <sem@altlinux.org> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Linus Torvalds 提交于
and do it along with ->d_name.len there Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 27 9月, 2014 6 次提交
-
-
由 Al Viro 提交于
... renaming it into dentry_unlock_for_move() and making it more symmetric with dentry_lock_for_move(). Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
it folds into __d_move() now Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
... thus making it much closer to (now unreachable, BTW) IS_ROOT(dentry) case in __d_move(). A bit more and it'll fold in. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
list_del() + list_add() is a slightly pessimised list_move() list_del() + INIT_LIST_HEAD() is a slightly pessimised list_del_init() Interleaving those makes the resulting code even worse. And harder to follow... Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
... and get rid of duplicate BUG_ON() there Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 14 9月, 2014 2 次提交
-
-
由 Al Viro 提交于
and lock the right list there Cc: stable@vger.kernel.org Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Linus Torvalds 提交于
Josef Bacik found a performance regression between 3.2 and 3.10 and narrowed it down to commit bfcfaa77 ("vfs: use 'unsigned long' accesses for dcache name comparison and hashing"). He reports: "The test case is essentially for (i = 0; i < 1000000; i++) mkdir("a$i"); On xfs on a fio card this goes at about 20k dir/sec with 3.2, and 12k dir/sec with 3.10. This is because we spend waaaaay more time in __d_lookup on 3.10 than in 3.2. The new hashing function for strings is suboptimal for < sizeof(unsigned long) string names (and hell even > sizeof(unsigned long) string names that I've tested). I broke out the old hashing function and the new one into a userspace helper to get real numbers and this is what I'm getting: Old hash table had 1000000 entries, 0 dupes, 0 max dupes New hash table had 12628 entries, 987372 dupes, 900 max dupes We had 11400 buckets with a p50 of 30 dupes, p90 of 240 dupes, p99 of 567 dupes for the new hash My test does the hash, and then does the d_hash into a integer pointer array the same size as the dentry hash table on my system, and then just increments the value at the address we got to see how many entries we overlap with. As you can see the old hash function ended up with all 1 million entries in their own bucket, whereas the new one they are only distributed among ~12.5k buckets, which is why we're using so much more CPU in __d_lookup". The reason for this hash regression is two-fold: - On 64-bit architectures the down-mixing of the original 64-bit word-at-a-time hash into the final 32-bit hash value is very simplistic and suboptimal, and just adds the two 32-bit parts together. In particular, because there is no bit shuffling and the mixing boundary is also a byte boundary, similar character patterns in the low and high word easily end up just canceling each other out. - the old byte-at-a-time hash mixed each byte into the final hash as it hashed the path component name, resulting in the low bits of the hash generally being a good source of hash data. That is not true for the word-at-a-time case, and the hash data is distributed among all the bits. The fix is the same in both cases: do a better job of mixing the bits up and using as much of the hash data as possible. We already have the "hash_32|64()" functions to do that. Reported-by: NJosef Bacik <jbacik@fb.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christoph Hellwig <hch@infradead.org> Cc: Chris Mason <clm@fb.com> Cc: linux-fsdevel@vger.kernel.org Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 08 8月, 2014 9 次提交
-
-
由 Fengguang Wu 提交于
Signed-off-by: NFengguang Wu <fengguang.wu@intel.com> Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 J. Bruce Fields 提交于
I believe this can only happen in the case of a corrupted filesystem. So -EIO looks like the appropriate error. Signed-off-by: NJ. Bruce Fields <bfields@redhat.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 J. Bruce Fields 提交于
If we get to this point and discover the dentry is not a root dentry, or not DCACHE_DISCONNECTED--great, we always prefer that anyway. Signed-off-by: NJ. Bruce Fields <bfields@redhat.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 J. Bruce Fields 提交于
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 J. Bruce Fields 提交于
There are a few d_obtain_alias callers that are using it to get the root of a filesystem which may already have an alias somewhere else. This is not the same as the filehandle-lookup case, and none of them actually need DCACHE_DISCONNECTED set. It isn't really a serious problem, but it would really be clearer if we reserved DCACHE_DISCONNECTED for those cases where it's actually needed. In the btrfs case this was causing a spurious printk from nfsd/nfsfh.c:fh_verify when it found an unexpected DCACHE_DISCONNECTED dentry. Josef worked around this by unsetting DCACHE_DISCONNECTED manually in 3a0dfa6a "Btrfs: unset DCACHE_DISCONNECTED when mounting default subvol", and this replaces that workaround. Cc: Josef Bacik <jbacik@fb.com> Signed-off-by: NJ. Bruce Fields <bfields@redhat.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 J. Bruce Fields 提交于
Any IS_ROOT() alias should be safe to use; there's nothing special about DCACHE_DISCONNECTED dentries. Note that this is in fact useful for filesystems such as btrfs which can legimately encounter a directory with a preexisting IS_ROOT alias on a lookup that crosses into a subvolume. (Those aliases are currently marked DCACHE_DISCONNECTED--but not really for any good reason, and we'll change that soon.) Signed-off-by: NJ. Bruce Fields <bfields@redhat.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 J. Bruce Fields 提交于
Currently if d_splice_alias finds a directory with an alias that is not IS_ROOT or not DCACHE_DISCONNECTED, it creates a duplicate directory. Duplicate directory dentries are unacceptable; it is better just to error out. (In the case of a local filesystem the most likely case is filesystem corruption: for example, perhaps two directories point to the same child directory, and the other parent has already been found and cached.) Note that distributed filesystems may encounter this case in normal operation if a remote host moves a directory to a location different from the one we last cached in the dcache. For that reason, such filesystems should instead use d_materialise_unique, which tries to move the old directory alias to the right place instead of erroring out. Signed-off-by: NJ. Bruce Fields <bfields@redhat.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 J. Bruce Fields 提交于
d_splice_alias will d_move an IS_ROOT() directory dentry into place if one exists. This should be safe as long as the dentry remains IS_ROOT, but I can't see what guarantees that: once we drop the i_lock all we hold here is the i_mutex on an unrelated parent directory. Instead copy the logic of d_materialise_unique. Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJ. Bruce Fields <bfields@redhat.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 J. Bruce Fields 提交于
Just a trivial move to locate it near (similar) d_materialise_unique code and save some forward references in a following patch. Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJ. Bruce Fields <bfields@redhat.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 12 6月, 2014 1 次提交
-
-
由 Al Viro 提交于
Dentry that had been through (or into) __dentry_kill() might be seen by shrink_dentry_list(); that's normal, it'll be taken off the shrink list and freed if __dentry_kill() has already finished. The problem is, its ->d_parent might be pointing to already freed dentry, so lock_parent() needs to be careful. We need to check that dentry hasn't already gone into __dentry_kill() *and* grab rcu_read_lock() before dropping ->d_lock - the latter makes sure that whatever we see in ->d_parent after dropping ->d_lock it won't be freed until we drop rcu_read_lock(). Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 07 6月, 2014 1 次提交
-
-
由 Joe Perches 提交于
This typedef is unnecessary and should just be removed. Signed-off-by: NJoe Perches <joe@perches.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 01 6月, 2014 1 次提交
-
-
由 Linus Torvalds 提交于
lock_parent() very much on purpose does nested locking of dentries, and is careful to maintain the right order (lock parent first). But because it didn't annotate the nested locking order, lockdep thought it might be a deadlock on d_lock, and complained. Add the proper annotation for the inner locking of the child dentry to make lockdep happy. Introduced by commit 046b961b ("shrink_dentry_list(): take parent's ->d_lock earlier"). Reported-and-tested-by: NJosh Boyer <jwboyer@fedoraproject.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 30 5月, 2014 3 次提交
-
-
由 Al Viro 提交于
it's 1 in the only remaining caller. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
We have the same problem with ->d_lock order in the inner loop, where we are dropping references to ancestors. Same solution, basically - instead of using dentry_kill() we use lock_parent() (introduced in the previous commit) to get that lock in a safe way, recheck ->d_count (in case if lock_parent() has ended up dropping and retaking ->d_lock and somebody managed to grab a reference during that window), trylock the inode->i_lock and use __dentry_kill() to do the rest. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
The cause of livelocks there is that we are taking ->d_lock on dentry and its parent in the wrong order, forcing us to use trylock on the parent's one. d_walk() takes them in the right order, and unfortunately it's not hard to create a situation when shrink_dentry_list() can't make progress since trylock keeps failing, and shrink_dcache_parent() or check_submounts_and_drop() keeps calling d_walk() disrupting the very shrink_dentry_list() it's waiting for. Solution is straightforward - if that trylock fails, let's unlock the dentry itself and take locks in the right order. We need to stabilize ->d_parent without holding ->d_lock, but that's doable using RCU. And we'd better do that in the very beginning of the loop in shrink_dentry_list(), since the checks on refcount, etc. would need to be redone anyway. That deals with a half of the problem - killing dentries on the shrink list itself. Another one (dropping their parents) is in the next commit. locking parent is interesting - it would be easy to do rcu_read_lock(), lock whatever we think is a parent, lock dentry itself and check if the parent is still the right one. Except that we need to check that *before* locking the dentry, or we are risking taking ->d_lock out of order. Fortunately, once the D1 is locked, we can check if D2->d_parent is equal to D1 without the need to lock D2; D2->d_parent can start or stop pointing to D1 only under D1->d_lock, so taking D1->d_lock is enough. In other words, the right solution is rcu_read_lock/lock what looks like parent right now/check if it's still our parent/rcu_read_unlock/lock the child. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 29 5月, 2014 2 次提交
-
-
由 Al Viro 提交于
Result will be massaged to saner shape in the next commits. It is ugly, no questions - the point of that one is to be a provably equivalent transformation (and it might be worth splitting a bit more). Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
... into trylocks and everything else. The latter (actual killing) is __dentry_kill(). Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 28 5月, 2014 1 次提交
-
-
由 Al Viro 提交于
It can happen only when dentry_kill() is called with unlock_on_failure equal to 0 - other callers had dentry pinned until the moment they've got ->d_lock and DCACHE_DENTRY_KILLED is set only after lockref_mark_dead(). IOW, only one of three call sites of dentry_kill() might end up reaching that code. Just move it there. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 04 5月, 2014 3 次提交
-
-
由 Miklos Szeredi 提交于
Since now the shrink list is private and nobody can free the dentry while it is on the shrink list, we can remove RCU protection from this. Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
Start with shrink_dcache_parent(), then scan what remains. First of all, BUG() is very much an overkill here; we are holding ->s_umount, and hitting BUG() means that a lot of interesting stuff will be hanging after that point (sync(2), for example). Moreover, in cases when there had been more than one leak, we'll be better off reporting all of them. And more than just the last component of pathname - %pd is there for just such uses... That was the last user of dentry_lru_del(), so kill it off... Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
If we find something already on a shrink list, just increment data->found and do nothing else. Loops in shrink_dcache_parent() and check_submounts_and_drop() will do the right thing - everything we did put into our list will be evicted and if there had been nothing, but data->found got non-zero, well, we have somebody else shrinking those guys; just try again. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 01 5月, 2014 4 次提交
-
-
由 Al Viro 提交于
If the victim in on the shrink list, don't remove it from there. If shrink_dentry_list() manages to remove it from the list before we are done - fine, we'll just free it as usual. If not - mark it with new flag (DCACHE_MAY_FREE) and leave it there. Eventually, shrink_dentry_list() will get to it, remove the sucker from shrink list and call dentry_kill(dentry, 0). Which is where we'll deal with freeing. Since now dentry_kill(dentry, 0) may happen after or during dentry_kill(dentry, 1), we need to recognize that (by seeing DCACHE_DENTRY_KILLED already set), unlock everything and either free the sucker (in case DCACHE_MAY_FREE has been set) or leave it for ongoing dentry_kill(dentry, 1) to deal with. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
The part of old d_free() that dealt with actual freeing of dentry. Taken out of dentry_kill() into a separate function. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-