提交 · f70cac8d9c7125f83048f8b3d1c60f5a041a165c · openanolis / cloud-kernel

10 9月, 2011 1 次提交

vfs: automount should ignore LOOKUP_FOLLOW · 0ec26fd0

由 Miklos Szeredi 提交于 9月 05, 2011

Prior to 2.6.38 automount would not trigger on either stat(2) or
lstat(2) on the automount point.

After 2.6.38, with the introduction of the ->d_automount()
infrastructure, stat(2) and others would start triggering automount
while lstat(2), etc. still would not.  This is a regression and a
userspace ABI change.

Problem originally reported here:

  http://thread.gmane.org/gmane.linux.kernel.autofs/6098

It appears that there was an attempt at fixing various userspace tools
to not trigger the automount.  But since the stat system call is
rather common it is impossible to "fix" all userspace.

This patch reverts the original behavior, which is to not trigger on
stat(2) and other symlink following syscalls.

[ It's not really clear what the right behavior is.  Apparently Solaris
  does the "automount on stat, leave alone on lstat".  And some programs
  can get unhappy when "stat+open+fstat" ends up giving a different
  result from the fstat than from the initial stat.

  But the change in 2.6.38 resulted in problems for some people, so
  we're going back to old behavior.  Maybe we can re-visit this
  discussion at some future date  - Linus ]
Reported-by: NLeonardo Chiquitto <leonardo.lists@gmail.com>
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
Acked-by: NIan Kent <raven@themaw.net>
Cc: David Howells <dhowells@redhat.com>
Cc: stable@kernel.org
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

0ec26fd0

08 8月, 2011 1 次提交

vfs: rename 'do_follow_link' to 'should_follow_link' · 7813b94a

由 Linus Torvalds 提交于 8月 07, 2011

Al points out that the do_follow_link() helper function really is
misnamed - it's about whether we should try to follow a symlink or not,
not about actually doing the following.
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7813b94a

07 8月, 2011 2 次提交

Fix POSIX ACL permission check · 206b1d09

由 Ari Savolainen 提交于 8月 06, 2011

After commit 3567866b: "RCUify freeing acls, let check_acl() go ahead in
RCU mode if acl is cached" posix_acl_permission is being called with an
unsupported flag and the permission check fails. This patch fixes the issue.
Signed-off-by: NAri Savolainen <ari.m.savolainen@gmail.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

206b1d09

vfs: optimize inode cache access patterns · 3ddcd056

由 Linus Torvalds 提交于 8月 06, 2011

The inode structure layout is largely random, and some of the vfs paths
really do care.  The path lookup in particular is already quite D$
intensive, and profiles show that accessing the 'inode->i_op->xyz'
fields is quite costly.

We already optimized the dcache to not unnecessarily load the d_op
structure for members that are often NULL using the DCACHE_OP_xyz bits
in dentry->d_flags, and this does something very similar for the inode
ops that are used during pathname lookup.

It also re-orders the fields so that the fields accessed by 'stat' are
together at the beginning of the inode structure, and roughly in the
order accessed.

The effect of this seems to be in the 1-2% range for an empty kernel
"make -j" run (which is fairly kernel-intensive, mostly in filename
lookup), so it's visible.  The numbers are fairly noisy, though, and
likely depend a lot on exact microarchitecture.  So there's more tuning
to be done.
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3ddcd056

03 8月, 2011 1 次提交
- A
  RCUify freeing acls, let check_acl() go ahead in RCU mode if acl is cached · 3567866b
  由 Al Viro 提交于 8月 02, 2011
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  3567866b
01 8月, 2011 1 次提交

VFS: Fix automount for negative autofs dentries · 5a30d8a2

由 David Howells 提交于 7月 11, 2011

Autofs may set the DCACHE_NEED_AUTOMOUNT flag on negative dentries.  These
need attention from the automounter daemon regardless of the LOOKUP_FOLLOW flag.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Acked-by: NIan Kent <raven@themaw.net>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

5a30d8a2

26 7月, 2011 4 次提交

vfs: fix check_acl compile error when CONFIG_FS_POSIX_ACL is not set · 84635d68

由 Linus Torvalds 提交于 7月 25, 2011

Commit e77819e5 ("vfs: move ACL cache lookup into generic code")
didn't take the FS_POSIX_ACL config variable into account - when that is
not set, ACL's go away, and the cache helper functions do not exist,
causing compile errors like

fs/namei.c: In function 'check_acl':
fs/namei.c:191:10: error: implicit declaration of function 'negative_cached_acl'
fs/namei.c:196:2: error: implicit declaration of function 'get_cached_acl'
fs/namei.c:196:6: warning: assignment makes pointer from integer without a cast
fs/namei.c:212:11: error: implicit declaration of function 'set_cached_acl'
Reported-by: NMarkus Trippelsdorf <markus@trippelsdorf.de>
Acked-by: NStephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

84635d68

vfs: make gcc generate more obvious code for acl permission checking · 14067ff5

由 Linus Torvalds 提交于 7月 25, 2011

The "fsuid is the inode owner" case is not necessarily always the likely
case, but it's the case that doesn't do anything odd and that we want in
straight-line code.  Make gcc not generate random "jump around for the
fun of it" code.

This just helps me read profiles.  That thing is one of the hottest
parts of the whole pathname lookup.
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

14067ff5

fs: take the ACL checks to common code · 4e34e719

由 Christoph Hellwig 提交于 7月 23, 2011

Replace the ->check_acl method with a ->get_acl method that simply reads an
ACL from disk after having a cache miss. This means we can replace the ACL
checking boilerplate code with a single implementation in namei.c.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

4e34e719

vfs: move ACL cache lookup into generic code · e77819e5

由 Linus Torvalds 提交于 7月 22, 2011

This moves logic for checking the cached ACL values from low-level
filesystems into generic code.  The end result is a streamlined ACL
check that doesn't need to load the inode->i_op->check_acl pointer at
all for the common cached case.

The filesystems also don't need to check for a non-blocking RCU walk
case in their acl_check() functions, because that is all handled at a
VFS layer.
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

e77819e5

21 7月, 2011 1 次提交

VFS: Fixup kerneldoc for generic_permission() · 8c5dc70a

由 Tobias Klauser 提交于 7月 01, 2011

The flags parameter went away in
d749519b444db985e40b897f73ce1898b11f997e
Signed-off-by: NTobias Klauser <tklauser@distanz.ch>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

8c5dc70a

20 7月, 2011 24 次提交

A
unexport kern_path_parent() · e3c3d9c8
由 Al Viro 提交于 6月 27, 2011
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
e3c3d9c8
A
switch vfs_path_lookup() to struct path · e0a01249
由 Al Viro 提交于 6月 27, 2011
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
e0a01249

kill lookup_create() · ed75e95d

由 Al Viro 提交于 6月 27, 2011

folded into the only caller (kern_path_create())
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

ed75e95d

new helpers: kern_path_create/user_path_create · dae6ad8f

由 Al Viro 提交于 6月 26, 2011

combination of kern_path_parent() and lookup_create().  Does *not*
expose struct nameidata to caller.  Syscalls converted to that...
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

dae6ad8f

kill LOOKUP_CONTINUE · 49084c3b

由 Al Viro 提交于 6月 25, 2011

LOOKUP_PARENT is equivalent to it now
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

49084c3b

A
don't transliterate lower bits of ->intent.open.flags to FMODE_... · 8a5e929d
由 Al Viro 提交于 6月 25, 2011
```
->create() instances are much happier that way...
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
8a5e929d

Don't pass nameidata when calling vfs_create() from mknod() · 554a8b9f

由 Al Viro 提交于 6月 23, 2011

All instances can cope with that now (and ceph one actually
starts working properly).
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

554a8b9f

A
merge do_revalidate() into its only caller · d2d9e9fb
由 Al Viro 提交于 6月 20, 2011
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
d2d9e9fb
A
no reason to keep exec_permission() separate now · 4ad5abb3
由 Al Viro 提交于 6月 20, 2011
```
cache footprint alone makes it a bad idea...
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
4ad5abb3
A
massage generic_permission() to treat directories on a separate path · d594e7ec
由 Al Viro 提交于 6月 20, 2011
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
d594e7ec

->permission() sanitizing: don't pass flags to exec_permission() · eecdd358

由 Al Viro 提交于 6月 20, 2011

pass mask instead; kill security_inode_exec_permission() since we can use
security_inode_permission() instead.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

eecdd358

A
->permission() sanitizing: don't pass flags to ->permission() · 10556cb2
由 Al Viro 提交于 6月 20, 2011
```
not used by the instances anymore.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
10556cb2

->permission() sanitizing: don't pass flags to generic_permission() · 2830ba7f

由 Al Viro 提交于 6月 20, 2011

redundant; all callers get it duplicated in mask & MAY_NOT_BLOCK and none of
them removes that bit.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

2830ba7f

A
->permission() sanitizing: don't pass flags to ->check_acl() · 7e40145e
由 Al Viro 提交于 6月 20, 2011
```
not used in the instances anymore.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
7e40145e
A
->permission() sanitizing: pass MAY_NOT_BLOCK to ->check_acl() · 9c2c7039
由 Al Viro 提交于 6月 20, 2011
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
9c2c7039

->permission() sanitizing: MAY_NOT_BLOCK · 1fc0f78c

由 Al Viro 提交于 6月 20, 2011

Duplicate the flags argument into mask bitmap.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

1fc0f78c

kill check_acl callback of generic_permission() · 178ea735

由 Al Viro 提交于 6月 20, 2011

its value depends only on inode and does not change; we might as
well store it in ->i_op->check_acl and be done with that.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

178ea735

lockless get_write_access/deny_write_access · 07b8ce1e

由 Al Viro 提交于 6月 20, 2011

new helpers: atomic_inc_unless_negative()/atomic_dec_unless_positive()
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

07b8ce1e

A
move exec_permission() up to the rest of permission-related functions · f4d6ff89
由 Al Viro 提交于 6月 19, 2011
```
... and convert the comment before it into linuxdoc form.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
f4d6ff89

kill file_permission() completely · 3bfa784a

由 Al Viro 提交于 6月 19, 2011

convert the last remaining caller to inode_permission()
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

3bfa784a

A
switch path_init() to exec_permission() · 78f32a9b
由 Al Viro 提交于 6月 19, 2011
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
78f32a9b

make exec_permission(dir) really equivalent to inode_permission(dir, MAY_EXEC) · 4cf27141

由 Al Viro 提交于 6月 19, 2011

capability overrides apply only to the default case; if fs has ->permission()
that does _not_ call generic_permission(), we have no business doing them.
Moreover, if it has ->permission() that does call generic_permission(), we
have no need to recheck capabilities.

Besides, the capability overrides should apply only if we got EACCES from
acl_permission_check(); any other value (-EIO, etc.) should be returned
to caller, capabilities or not capabilities.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

4cf27141

fs: add a DCACHE_NEED_LOOKUP flag for d_flags · 44396f4b

由 Josef Bacik 提交于 5月 31, 2011

Btrfs (and I'd venture most other fs's) stores its indexes in nice disk order
for readdir, but unfortunately in the case of anything that stats the files in
order that readdir spits back (like oh say ls) that means we still have to do
the normal lookup of the file, which means looking up our other index and then
looking up the inode. What I want is a way to create dummy dentries when we
find them in readdir so that when ls or anything else subsequently does a
stat(), we already have the location information in the dentry and can go
straight to the inode itself. The lookup stuff just assumes that if it finds a
dentry it is done, it doesn't perform a lookup. So add a DCACHE_NEED_LOOKUP
flag so that the lookup code knows it still needs to run i_op->lookup() on the
parent to get the inode for the dentry. I have tested this with btrfs and I
went from something that looks like this

http://people.redhat.com/jwhiter/ls-noreada.png

To this

http://people.redhat.com/jwhiter/ls-good.png

Thats a savings of 1300 seconds, or 22 minutes. That is a significant savings.
Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

44396f4b

vfs: fix race in rcu lookup of pruned dentry · 59430262

由 Linus Torvalds 提交于 7月 18, 2011

Don't update *inode in __follow_mount_rcu() until we'd verified that
there is mountpoint there.  Kudos to Hugh Dickins for catching that
one in the first place and eventually figuring out the solution (and
catching a braino in the earlier version of patch).
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

59430262

13 7月, 2011 1 次提交

Fix ->d_lock locking order in unlazy_walk() · 94c0d4ec

由 Al Viro 提交于 7月 12, 2011

Make sure that child is still a child of parent before nested locking
of child->d_lock in unlazy_walk(); otherwise we are risking a violation
of locking order and deadlocks.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

94c0d4ec

20 6月, 2011 2 次提交

fix comment in generic_permission() · 8e833fd2

由 Al Viro 提交于 6月 19, 2011

CAP_DAC_OVERRIDE is enough for MAY_EXEC on directory, even if
no exec bits are set.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

8e833fd2

A
kill obsolete comment for follow_down() · 6291176b
由 Al Viro 提交于 6月 17, 2011
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
6291176b

16 6月, 2011 2 次提交

VFS: Fix vfsmount overput on simultaneous automount · 8aef1884

由 Al Viro 提交于 6月 16, 2011

[Kudos to dhowells for tracking that crap down]

If two processes attempt to cause automounting on the same mountpoint at the
same time, the vfsmount holding the mountpoint will be left with one too few
references on it, causing a BUG when the kernel tries to clean up.

The problem is that lock_mount() drops the caller's reference to the
mountpoint's vfsmount in the case where it finds something already mounted on
the mountpoint as it transits to the mounted filesystem and replaces path->mnt
with the new mountpoint vfsmount.

During a pathwalk, however, we don't take a reference on the vfsmount if it is
the same as the one in the nameidata struct, but do_add_mount() doesn't know
this.

The fix is to make sure we have a ref on the vfsmount of the mountpoint before
calling do_add_mount().  However, if lock_mount() doesn't transit, we're then
left with an extra ref on the mountpoint vfsmount which needs releasing.
We can handle that in follow_managed() by not making assumptions about what
we can and what we cannot get from lookup_mnt() as the current code does.

The callers of follow_managed() expect that reference to path->mnt will be
grabbed iff path->mnt has been changed.  follow_managed() and follow_automount()
keep track of whether such reference has been grabbed and assume that it'll
happen in those and only those cases that'll have us return with changed
path->mnt.  That assumption is almost correct - it breaks in case of
racing automounts and in even harder to hit race between following a mountpoint
and a couple of mount --move.  The thing is, we don't need to make that
assumption at all - after the end of loop in follow_manage() we can check
if path->mnt has ended up unchanged and do mntput() if needed.

The BUG can be reproduced with the following test program:

	#include <stdio.h>
	#include <sys/types.h>
	#include <sys/stat.h>
	#include <unistd.h>
	#include <sys/wait.h>
	int main(int argc, char **argv)
	{
		int pid, ws;
		struct stat buf;
		pid = fork();
		stat(argv[1], &buf);
		if (pid > 0) wait(&ws);
		return 0;
	}

and the following procedure:

 (1) Mount an NFS volume that on the server has something else mounted on a
     subdirectory.  For instance, I can mount / from my server:

	mount warthog:/ /mnt -t nfs4 -r

     On the server /data has another filesystem mounted on it, so NFS will see
     a change in FSID as it walks down the path, and will mark /mnt/data as
     being a mountpoint.  This will cause the automount code to be triggered.

     !!! Do not look inside the mounted fs at this point !!!

 (2) Run the above program on a file within the submount to generate two
     simultaneous automount requests:

	/tmp/forkstat /mnt/data/testfile

 (3) Unmount the automounted submount:

	umount /mnt/data

 (4) Unmount the original mount:

	umount /mnt

     At this point the kernel should throw a BUG with something like the
     following:

	BUG: Dentry ffff880032e3c5c0{i=2,n=} still in use (1) [unmount of nfs4 0:12]

Note that the bug appears on the root dentry of the original mount, not the
mountpoint and not the submount because sys_umount() hasn't got to its final
mntput_no_expire() yet, but this isn't so obvious from the call trace:

 [<ffffffff8117cd82>] shrink_dcache_for_umount+0x69/0x82
 [<ffffffff8116160e>] generic_shutdown_super+0x37/0x15b
 [<ffffffffa00fae56>] ? nfs_super_return_all_delegations+0x2e/0x1b1 [nfs]
 [<ffffffff811617f3>] kill_anon_super+0x1d/0x7e
 [<ffffffffa00d0be1>] nfs4_kill_super+0x60/0xb6 [nfs]
 [<ffffffff81161c17>] deactivate_locked_super+0x34/0x83
 [<ffffffff811629ff>] deactivate_super+0x6f/0x7b
 [<ffffffff81186261>] mntput_no_expire+0x18d/0x199
 [<ffffffff811862a8>] mntput+0x3b/0x44
 [<ffffffff81186d87>] release_mounts+0xa2/0xbf
 [<ffffffff811876af>] sys_umount+0x47a/0x4ba
 [<ffffffff8109e1ca>] ? trace_hardirqs_on_caller+0x1fd/0x22f
 [<ffffffff816ea86b>] system_call_fastpath+0x16/0x1b

as do_umount() is inlined.  However, you can see release_mounts() in there.

Note also that it may be necessary to have multiple CPU cores to be able to
trigger this bug.
Tested-by: NJeff Layton <jlayton@redhat.com>
Tested-by: NIan Kent <raven@themaw.net>
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

8aef1884

fix wrong iput on d_inode introduced by · 50338b88

由 Török Edwin 提交于 6月 16, 2011

Git bisection shows that commit e6bc45d6 causes
BUG_ONs under high I/O load:

kernel BUG at fs/inode.c:1368!
[ 2862.501007] Call Trace:
[ 2862.501007]  [<ffffffff811691d8>] d_kill+0xf8/0x140
[ 2862.501007]  [<ffffffff81169c19>] dput+0xc9/0x190
[ 2862.501007]  [<ffffffff8115577f>] fput+0x15f/0x210
[ 2862.501007]  [<ffffffff81152171>] filp_close+0x61/0x90
[ 2862.501007]  [<ffffffff81152251>] sys_close+0xb1/0x110
[ 2862.501007]  [<ffffffff814c14fb>] system_call_fastpath+0x16/0x1b

A reliable way to reproduce this bug is:
Login to KDE, run 'rsnapshot sync', and apt-get install openjdk-6-jdk,
and apt-get remove openjdk-6-jdk.

The buggy part of the patch is this:
	struct inode *inode = NULL;
.....
-               if (nd.last.name[nd.last.len])
-                       goto slashes;
                inode = dentry->d_inode;
-               if (inode)
-                       ihold(inode);
+               if (nd.last.name[nd.last.len] || !inode)
+                       goto slashes;
+               ihold(inode)
...
	if (inode)
		iput(inode);	/* truncate the inode here */

If nd.last.name[nd.last.len] is nonzero (and thus goto slashes branch is taken),
and dentry->d_inode is non-NULL, then this code now does an additional iput on
the inode, which is wrong.

Fix this by only setting the inode variable if nd.last.name[nd.last.len] is 0.

Reference: https://lkml.org/lkml/2011/6/15/50Reported-by: NNorbert Preining <preining@logic.at>
Reported-by: NTörök Edwin <edwintorok@gmail.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NTörök Edwin <edwintorok@gmail.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

50338b88

openanolis / cloud-kernel 接近 2 年 前同步成功

openanolis / cloud-kernel
接近 2 年前同步成功