提交 · e36cb0b89ce20b4f8786a57e8a6bc8476f577650 · openeuler / Kernel

23 2月, 2015 1 次提交

VFS: (Scripted) Convert S_ISLNK/DIR/REG(dentry->d_inode) to d_is_*(dentry) · e36cb0b8

由 David Howells 提交于 1月 29, 2015

Convert the following where appropriate:

 (1) S_ISLNK(dentry->d_inode) to d_is_symlink(dentry).

 (2) S_ISREG(dentry->d_inode) to d_is_reg(dentry).

 (3) S_ISDIR(dentry->d_inode) to d_is_dir(dentry).  This is actually more
     complicated than it appears as some calls should be converted to
     d_can_lookup() instead.  The difference is whether the directory in
     question is a real dir with a ->lookup op or whether it's a fake dir with
     a ->d_automount op.

In some circumstances, we can subsume checks for dentry->d_inode not being
NULL into this, provided we the code isn't in a filesystem that expects
d_inode to be NULL if the dirent really *is* negative (ie. if we're going to
use d_inode() rather than d_backing_inode() to get the inode pointer).

Note that the dentry type field may be set to something other than
DCACHE_MISS_TYPE when d_inode is NULL in the case of unionmount, where the VFS
manages the fall-through from a negative dentry to a lower layer.  In such a
case, the dentry type of the negative union dentry is set to the same as the
type of the lower dentry.

However, if you know d_inode is not NULL at the call site, then you can use
the d_is_xxx() functions even in a filesystem.

There is one further complication: a 0,0 chardev dentry may be labelled
DCACHE_WHITEOUT_TYPE rather than DCACHE_SPECIAL_TYPE.  Strictly, this was
intended for special directory entry types that don't have attached inodes.

The following perl+coccinelle script was used:

use strict;

my @callers;
open($fd, 'git grep -l \'S_IS[A-Z].*->d_inode\' |') ||
    die "Can't grep for S_ISDIR and co. callers";
@callers = <$fd>;
close($fd);
unless (@callers) {
    print "No matches\n";
    exit(0);
}

my @cocci = (
    '@@',
    'expression E;',
    '@@',
    '',
    '- S_ISLNK(E->d_inode->i_mode)',
    '+ d_is_symlink(E)',
    '',
    '@@',
    'expression E;',
    '@@',
    '',
    '- S_ISDIR(E->d_inode->i_mode)',
    '+ d_is_dir(E)',
    '',
    '@@',
    'expression E;',
    '@@',
    '',
    '- S_ISREG(E->d_inode->i_mode)',
    '+ d_is_reg(E)' );

my $coccifile = "tmp.sp.cocci";
open($fd, ">$coccifile") || die $coccifile;
print($fd "$_\n") || die $coccifile foreach (@cocci);
close($fd);

foreach my $file (@callers) {
    chomp $file;
    print "Processing ", $file, "\n";
    system("spatch", "--sp-file", $coccifile, $file, "--in-place", "--no-show-diff") == 0 ||
	die "spatch failed";
}

[AV: overlayfs parts skipped]
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

e36cb0b8

23 1月, 2015 6 次提交

audit: replace getname()/putname() hacks with reference counters · 55422d0b

由 Paul Moore 提交于 1月 22, 2015

In order to ensure that filenames are not released before the audit
subsystem is done with the strings there are a number of hacks built
into the fs and audit subsystems around getname() and putname().  To
say these hacks are "ugly" would be kind.

This patch removes the filename hackery in favor of a more
conventional reference count based approach.  The diffstat below tells
most of the story; lots of audit/fs specific code is replaced with a
traditional reference count based approach that is easily understood,
even by those not familiar with the audit and/or fs subsystems.

CC: viro@zeniv.linux.org.uk
CC: linux-fsdevel@vger.kernel.org
Signed-off-by: NPaul Moore <pmoore@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

55422d0b

audit: enable filename recording via getname_kernel() · fd3522fd

由 Paul Moore 提交于 1月 22, 2015

Enable recording of filenames in getname_kernel() and remove the
kludgy workaround in __audit_inode() now that we have proper filename
logging for kernel users.

CC: viro@zeniv.linux.org.uk
CC: linux-fsdevel@vger.kernel.org
Signed-off-by: NPaul Moore <pmoore@redhat.com>
Reviewed-by: NRichard Guy Briggs <rgb@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

fd3522fd

simpler calling conventions for filename_mountpoint() · cbaab2db

由 Al Viro 提交于 1月 22, 2015

a) make it accept ERR_PTR() as filename (and return its PTR_ERR() in that case)
b) make it putname() the sucker in the end otherwise

simplifies life for callers...
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

cbaab2db

fs: create proper filename objects using getname_kernel() · 51689104

由 Paul Moore 提交于 1月 22, 2015

There are several areas in the kernel that create temporary filename
objects using the following pattern:

	int func(const char *name)
	{
		struct filename *file = { .name = name };
		...
		return 0;
	}

... which for the most part works okay, but it causes havoc within the
audit subsystem as the filename object does not persist beyond the
lifetime of the function.  This patch converts all of these temporary
filename objects into proper filename objects using getname_kernel()
and putname() which ensure that the filename object persists until the
audit subsystem is finished with it.

Also, a special thanks to Al Viro, Guenter Roeck, and Sabrina Dubroca
for helping resolve a difficult kernel panic on boot related to a
use-after-free problem in kern_path_create(); the thread can be seen
at the link below:

 * https://lkml.org/lkml/2015/1/20/710

This patch includes code that was either based on, or directly written
by Al in the above thread.

CC: viro@zeniv.linux.org.uk
CC: linux@roeck-us.net
CC: sd@queasysnail.net
CC: linux-fsdevel@vger.kernel.org
Signed-off-by: NPaul Moore <pmoore@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

51689104

fs: rework getname_kernel to handle up to PATH_MAX sized filenames · 08518549

由 Paul Moore 提交于 1月 21, 2015

In preparation for expanded use in the kernel, make getname_kernel()
more useful by allowing it to handle any legal filename length.

Thanks to Guenter Roeck for his suggestion to substitute memcpy() for
strlcpy().

CC: linux@roeck-us.net
CC: viro@zeniv.linux.org.uk
CC: linux-fsdevel@vger.kernel.org
Signed-off-by: NPaul Moore <pmoore@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

08518549

cut down the number of do_path_lookup() callers · fa14a0b8

由 Al Viro 提交于 1月 22, 2015

... and don't bother with new struct filename when we already have one
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

fa14a0b8

14 12月, 2014 1 次提交

syscalls: implement execveat() system call · 51f39a1f

由 David Drysdale 提交于 12月 12, 2014

This patchset adds execveat(2) for x86, and is derived from Meredydd
Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).

The primary aim of adding an execveat syscall is to allow an
implementation of fexecve(3) that does not rely on the /proc filesystem,
at least for executables (rather than scripts).  The current glibc version
of fexecve(3) is implemented via /proc, which causes problems in sandboxed
or otherwise restricted environments.

Given the desire for a /proc-free fexecve() implementation, HPA suggested
(https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be
an appropriate generalization.

Also, having a new syscall means that it can take a flags argument without
back-compatibility concerns.  The current implementation just defines the
AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be
added in future -- for example, flags for new namespaces (as suggested at
https://lkml.org/lkml/2006/7/11/474).

Related history:
 - https://lkml.org/lkml/2006/12/27/123 is an example of someone
   realizing that fexecve() is likely to fail in a chroot environment.
 - http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
   documenting the /proc requirement of fexecve(3) in its manpage, to
   "prevent other people from wasting their time".
 - https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
   problem where a process that did setuid() could not fexecve()
   because it no longer had access to /proc/self/fd; this has since
   been fixed.

This patch (of 4):

Add a new execveat(2) system call.  execveat() is to execve() as openat()
is to open(): it takes a file descriptor that refers to a directory, and
resolves the filename relative to that.

In addition, if the filename is empty and AT_EMPTY_PATH is specified,
execveat() executes the file to which the file descriptor refers.  This
replicates the functionality of fexecve(), which is a system call in other
UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/<fd>" (and
so relies on /proc being mounted).

The filename fed to the executed program as argv[0] (or the name of the
script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
(for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
reflecting how the executable was found.  This does however mean that
execution of a script in a /proc-less environment won't work; also, script
execution via an O_CLOEXEC file descriptor fails (as the file will not be
accessible after exec).

Based on patches by Meredydd Luff.
Signed-off-by: NDavid Drysdale <drysdale@google.com>
Cc: Meredydd Luff <meredydd@senatehouse.org>
Cc: Shuah Khan <shuah.kh@samsung.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Rich Felker <dalias@aerifal.cx>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

51f39a1f

12 12月, 2014 4 次提交
- A
  fs/namei.c: fold link_path_walk() call into path_init() · d465887f
  由 Al Viro 提交于 11月 20, 2014
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  d465887f
- A
  path_init(): don't bother with LOOKUP_PARENT in argument · 980f3ea2
  由 Al Viro 提交于 11月 20, 2014
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  980f3ea2
- A
  fs/namei.c: new helper (path_cleanup()) · 893b7775
  由 Al Viro 提交于 11月 20, 2014
```
All callers of path_init() proceed to do the identical cleanup when
they are done with nameidata.  Don't open-code it...
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  893b7775
- A
  path_init(): store the "base" pointer to file in nameidata itself · 5e53084d
  由 Al Viro 提交于 11月 20, 2014
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  5e53084d
11 12月, 2014 1 次提交
- A
  make nameidata completely opaque outside of fs/namei.c · 1f55a6ec
  由 Al Viro 提交于 11月 01, 2014
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  1f55a6ec
31 10月, 2014 1 次提交

fs: allow open(dir, O_TMPFILE|..., 0) with mode 0 · 69a91c23

由 Eric Rannaud 提交于 10月 30, 2014

The man page for open(2) indicates that when O_CREAT is specified, the
'mode' argument applies only to future accesses to the file:

	Note that this mode applies only to future accesses of the newly
	created file; the open() call that creates a read-only file
	may well return a read/write file descriptor.

The man page for open(2) implies that 'mode' is treated identically by
O_CREAT and O_TMPFILE.

O_TMPFILE, however, behaves differently:

	int fd = open("/tmp", O_TMPFILE | O_RDWR, 0);
	assert(fd == -1);
	assert(errno == EACCES);

	int fd = open("/tmp", O_TMPFILE | O_RDWR, 0600);
	assert(fd > 0);

For O_CREAT, do_last() sets acc_mode to MAY_OPEN only:

	if (*opened & FILE_CREATED) {
		/* Don't check for write permission, don't truncate */
		open_flag &= ~O_TRUNC;
		will_truncate = false;
		acc_mode = MAY_OPEN;
		path_to_nameidata(path, nd);
		goto finish_open_created;
	}

But for O_TMPFILE, do_tmpfile() passes the full op->acc_mode to
may_open().

This patch lines up the behavior of O_TMPFILE with O_CREAT. After the
inode is created, may_open() is called with acc_mode = MAY_OPEN, in
do_tmpfile().

A different, but related glibc bug revealed the discrepancy:
https://sourceware.org/bugzilla/show_bug.cgi?id=17523

The glibc lazily loads the 'mode' argument of open() and openat() using
va_arg() only if O_CREAT is present in 'flags' (to support both the 2
argument and the 3 argument forms of open; same idea for openat()).
However, the glibc ignores the 'mode' argument if O_TMPFILE is in
'flags'.

On x86_64, for open(), it magically works anyway, as 'mode' is in
RDX when entering open(), and is still in RDX on SYSCALL, which is where
the kernel looks for the 3rd argument of a syscall.

But openat() is not quite so lucky: 'mode' is in RCX when entering the
glibc wrapper for openat(), while the kernel looks for the 4th argument
of a syscall in R10. Indeed, the syscall calling convention differs from
the regular calling convention in this respect on x86_64. So the kernel
sees mode = 0 when trying to use glibc openat() with O_TMPFILE, and
fails with EACCES.
Signed-off-by: NEric Rannaud <e@nanocritical.com>
Acked-by: NAndy Lutomirski <luto@amacapital.net>
Cc: stable@vger.kernel.org
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

69a91c23

29 10月, 2014 1 次提交

overlayfs: fix lockdep misannotation · d1b72cc6

由 Miklos Szeredi 提交于 10月 27, 2014

In an overlay directory that shadows an empty lower directory, say
/mnt/a/empty102, do:

 	touch /mnt/a/empty102/x
 	unlink /mnt/a/empty102/x
 	rmdir /mnt/a/empty102

It's actually harmless, but needs another level of nesting between
I_MUTEX_CHILD and I_MUTEX_NORMAL.
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
Tested-by: NDavid Howells <dhowells@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

d1b72cc6

24 10月, 2014 5 次提交

vfs: add RENAME_WHITEOUT · 0d7a8555

由 Miklos Szeredi 提交于 10月 24, 2014

This adds a new RENAME_WHITEOUT flag.  This flag makes rename() create a
whiteout of source.  The whiteout creation is atomic relative to the
rename.
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>

0d7a8555

vfs: add whiteout support · 787fb6bc

由 Miklos Szeredi 提交于 10月 24, 2014

Whiteout isn't actually a new file type, but is represented as a char
device (Linus's idea) with 0/0 device number.

This has several advantages compared to introducing a new whiteout file
type:

 - no userspace API changes (e.g. trivial to make backups of upper layer
   filesystem, without losing whiteouts)

 - no fs image format changes (you can boot an old kernel/fsck without
   whiteout support and things won't break)

 - implementation is trivial
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>

787fb6bc

vfs: export check_sticky() · cbdf35bc

由 Miklos Szeredi 提交于 10月 24, 2014

It's already duplicated in btrfs and about to be used in overlayfs too.

Move the sticky bit check to an inline helper and call the out-of-line
helper only in the unlikly case of the sticky bit being set.
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>

cbdf35bc

vfs: export __inode_permission() to modules · bd5d0856

由 Miklos Szeredi 提交于 10月 24, 2014

We need to be able to check inode permissions (but not filesystem implied
permissions) for stackable filesystems. Expose this interface for overlayfs.
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>

bd5d0856

vfs: add i_op->dentry_open() · 4aa7c634

由 Miklos Szeredi 提交于 10月 24, 2014

Add a new inode operation i_op->dentry_open(). This is for stacked filesystems
that want to return a struct file from a different filesystem.
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>

4aa7c634

13 10月, 2014 1 次提交

let path_init() failures treated the same way as subsequent link_path_walk() · 115cbfdc

由 Al Viro 提交于 10月 11, 2014

As it is, path_lookupat() and path_mounpoint() might end up leaking struct file
reference in some cases.
Spotted-by: NEric Biggers <ebiggers3@gmail.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

115cbfdc

09 10月, 2014 3 次提交

vfs: Make d_invalidate return void · 5542aa2f

由 Eric W. Biederman 提交于 2月 13, 2014

Now that d_invalidate can no longer fail, stop returning a useless
return code.  For the few callers that checked the return code update
remove the handling of d_invalidate failure.
Reviewed-by: NMiklos Szeredi <miklos@szeredi.hu>
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

5542aa2f

vfs: Lazily remove mounts on unlinked files and directories. · 8ed936b5

由 Eric W. Biederman 提交于 10月 01, 2013

With the introduction of mount namespaces and bind mounts it became
possible to access files and directories that on some paths are mount
points but are not mount points on other paths.  It is very confusing
when rm -rf somedir returns -EBUSY simply because somedir is mounted
somewhere else.  With the addition of user namespaces allowing
unprivileged mounts this condition has gone from annoying to allowing
a DOS attack on other users in the system.

The possibility for mischief is removed by updating the vfs to support
rename, unlink and rmdir on a dentry that is a mountpoint and by
lazily unmounting mountpoints on deleted dentries.

In particular this change allows rename, unlink and rmdir system calls
on a dentry without a mountpoint in the current mount namespace to
succeed, and it allows rename, unlink, and rmdir performed on a
distributed filesystem to update the vfs cache even if when there is a
mount in some namespace on the original dentry.

There are two common patterns of maintaining mounts: Mounts on trusted
paths with the parent directory of the mount point and all ancestory
directories up to / owned by root and modifiable only by root
(i.e. /media/xxx, /dev, /dev/pts, /proc, /sys, /sys/fs/cgroup/{cpu,
cpuacct, ...}, /usr, /usr/local).  Mounts on unprivileged directories
maintained by fusermount.

In the case of mounts in trusted directories owned by root and
modifiable only by root the current parent directory permissions are
sufficient to ensure a mount point on a trusted path is not removed
or renamed by anyone other than root, even if there is a context
where the there are no mount points to prevent this.

In the case of mounts in directories owned by less privileged users
races with users modifying the path of a mount point are already a
danger.  fusermount already uses a combination of chdir,
/proc/<pid>/fd/NNN, and UMOUNT_NOFOLLOW to prevent these races.  The
removable of global rename, unlink, and rmdir protection really adds
nothing new to consider only a widening of the attack window, and
fusermount is already safe against unprivileged users modifying the
directory simultaneously.

In principle for perfect userspace programs returning -EBUSY for
unlink, rmdir, and rename of dentires that have mounts in the local
namespace is actually unnecessary.  Unfortunately not all userspace
programs are perfect so retaining -EBUSY for unlink, rmdir and rename
of dentries that have mounts in the current mount namespace plays an
important role of maintaining consistency with historical behavior and
making imperfect userspace applications hard to exploit.

v2: Remove spurious old_dentry.
v3: Optimized shrink_submounts_and_drop
    Removed unsued afs label
v4: Simplified the changes to check_submounts_and_drop
    Do not rename check_submounts_and_drop shrink_submounts_and_drop
    Document what why we need atomicity in check_submounts_and_drop
    Rely on the parent inode mutex to make d_revalidate and d_invalidate
    an atomic unit.
v5: Refcount the mountpoint to detach in case of simultaneous
    renames.
Reviewed-by: NMiklos Szeredi <miklos@szeredi.hu>
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

8ed936b5

vfs: Don't allow overwriting mounts in the current mount namespace · 7af1364f

由 Eric W. Biederman 提交于 10月 04, 2013

In preparation for allowing mountpoints to be renamed and unlinked
in remote filesystems and in other mount namespaces test if on a dentry
there is a mount in the local mount namespace before allowing it to
be renamed or unlinked.

The primary motivation here are old versions of fusermount unmount
which is not safe if the a path can be renamed or unlinked while it is
verifying the mount is safe to unmount.  More recent versions are simpler
and safer by simply using UMOUNT_NOFOLLOW when unmounting a mount
in a directory owned by an arbitrary user.

Miklos Szeredi <miklos@szeredi.hu> reports this is approach is good
enough to remove concerns about new kernels mixed with old versions
of fusermount.

A secondary motivation for restrictions here is that it removing empty
directories that have non-empty mount points on them appears to
violate the rule that rmdir can not remove empty directories.  As
Linus Torvalds pointed out this is useful for programs (like git) that
test if a directory is empty with rmdir.

Therefore this patch arranges to enforce the existing mount point
semantics for local mount namespace.

v2: Rewrote the test to be a drop in replacement for d_mountpoint
v3: Use bool instead of int as the return type of is_local_mountpoint
Reviewed-by: NMiklos Szeredi <miklos@szeredi.hu>
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

7af1364f

16 9月, 2014 2 次提交

vfs: workaround gcc <4.6 build error in link_path_walk() · a060dc50

由 James Hogan 提交于 9月 16, 2014

Commit d6bb3e90 ("vfs: simplify and shrink stack frame of
link_path_walk()") introduced build problems with GCC versions older
than 4.6 due to the initialisation of a member of an anonymous union in
struct qstr without enclosing braces.

This hits GCC bug 10676 [1] (which was fixed in GCC 4.6 by [2]), and
causes the following build error:

  fs/namei.c: In function 'link_path_walk':
  fs/namei.c:1778: error: unknown field 'hash_len' specified in initializer

This is worked around by adding explicit braces.

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=10676
[2] https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=159206

Fixes: d6bb3e90 (vfs: simplify and shrink stack frame of link_path_walk())
Signed-off-by: NJames Hogan <james.hogan@imgtec.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-metag@vger.kernel.org
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a060dc50

vfs: simplify and shrink stack frame of link_path_walk() · d6bb3e90

由 Linus Torvalds 提交于 9月 15, 2014

Commit 9226b5b4 ("vfs: avoid non-forwarding large load after small
store in path lookup") made link_path_walk() always access the
"hash_len" field as a single 64-bit entity, in order to avoid mixed size
accesses to the members.

However, what I didn't notice was that that effectively means that the
whole "struct qstr this" is now basically redundant.  We already
explicitly track the "const char *name", and if we just use "u64
hash_len" instead of "long len", there is nothing else left of the
"struct qstr".

We do end up wanting the "struct qstr" if we have a filesystem with a
"d_hash()" function, but that's a rare case, and we might as well then
just squirrell away the name and hash_len at that point.

End result: fewer live variables in the loop, a smaller stack frame, and
better code generation.  And we don't need to pass in pointers variables
to helper functions any more, because the return value contains all the
relevant information.  So this removes more lines than it adds, and the
source code is clearer too.
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d6bb3e90

15 9月, 2014 3 次提交

vfs: avoid non-forwarding large load after small store in path lookup · 9226b5b4

由 Linus Torvalds 提交于 9月 14, 2014

The performance regression that Josef Bacik reported in the pathname
lookup (see commit 99d263d4 "vfs: fix bad hashing of dentries") made
me look at performance stability of the dcache code, just to verify that
the problem was actually fixed.  That turned up a few other problems in
this area.

There are a few cases where we exit RCU lookup mode and go to the slow
serializing case when we shouldn't, Al has fixed those and they'll come
in with the next VFS pull.

But my performance verification also shows that link_path_walk() turns
out to have a very unfortunate 32-bit store of the length and hash of
the name we look up, followed by a 64-bit read of the combined hash_len
field.  That screws up the processor store to load forwarding, causing
an unnecessary hickup in this critical routine.

It's caused by the ugly calling convention for the "hash_name()"
function, and easily fixed by just making hash_name() fill in the whole
'struct qstr' rather than passing it a pointer to just the hash value.

With that, the profile for this function looks much smoother.
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9226b5b4

be careful with nd->inode in path_init() and follow_dotdot_rcu() · 4023bfc9

由 Al Viro 提交于 9月 13, 2014

in the former we simply check if dentry is still valid after picking
its ->d_inode; in the latter we fetch ->d_inode in the same places
where we fetch dentry and its ->d_seq, under the same checks.

Cc: stable@vger.kernel.org # 2.6.38+
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

4023bfc9

don't bugger nd->seq on set_root_rcu() from follow_dotdot_rcu() · 7bd88377

由 Al Viro 提交于 9月 13, 2014

return the value instead, and have path_init() do the assignment.  Broken by
"vfs: Fix absolute RCU path walk failures due to uninitialized seq number",
which was Cc-stable with 2.6.38+ as destination.  This one should go where
it went.

To avoid dummy value returned in case when root is already set (it would do
no harm, actually, since the only caller that doesn't ignore the return value
is guaranteed to have nd->root *not* set, but it's more obvious that way),
lift the check into callers.  And do the same to set_root(), to keep them
in sync.

Cc: stable@vger.kernel.org # 2.6.38+
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

7bd88377

14 9月, 2014 2 次提交

fix bogus read_seqretry() checks introduced in · f5be3e29

由 Al Viro 提交于 9月 13, 2014

read_seqretry() returns true on mismatch, not on match...

Cc: stable@vger.kernel.org # 3.15+
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

f5be3e29

vfs: fix bad hashing of dentries · 99d263d4

由 Linus Torvalds 提交于 9月 13, 2014

Josef Bacik found a performance regression between 3.2 and 3.10 and
narrowed it down to commit bfcfaa77 ("vfs: use 'unsigned long'
accesses for dcache name comparison and hashing"). He reports:

 "The test case is essentially

      for (i = 0; i < 1000000; i++)
              mkdir("a$i");

  On xfs on a fio card this goes at about 20k dir/sec with 3.2, and 12k
  dir/sec with 3.10.  This is because we spend waaaaay more time in
  __d_lookup on 3.10 than in 3.2.

  The new hashing function for strings is suboptimal for <
  sizeof(unsigned long) string names (and hell even > sizeof(unsigned
  long) string names that I've tested).  I broke out the old hashing
  function and the new one into a userspace helper to get real numbers
  and this is what I'm getting:

      Old hash table had 1000000 entries, 0 dupes, 0 max dupes
      New hash table had 12628 entries, 987372 dupes, 900 max dupes
      We had 11400 buckets with a p50 of 30 dupes, p90 of 240 dupes, p99 of 567 dupes for the new hash

  My test does the hash, and then does the d_hash into a integer pointer
  array the same size as the dentry hash table on my system, and then
  just increments the value at the address we got to see how many
  entries we overlap with.

  As you can see the old hash function ended up with all 1 million
  entries in their own bucket, whereas the new one they are only
  distributed among ~12.5k buckets, which is why we're using so much
  more CPU in __d_lookup".

The reason for this hash regression is two-fold:

 - On 64-bit architectures the down-mixing of the original 64-bit
   word-at-a-time hash into the final 32-bit hash value is very
   simplistic and suboptimal, and just adds the two 32-bit parts
   together.

   In particular, because there is no bit shuffling and the mixing
   boundary is also a byte boundary, similar character patterns in the
   low and high word easily end up just canceling each other out.

 - the old byte-at-a-time hash mixed each byte into the final hash as it
   hashed the path component name, resulting in the low bits of the hash
   generally being a good source of hash data.  That is not true for the
   word-at-a-time case, and the hash data is distributed among all the
   bits.

The fix is the same in both cases: do a better job of mixing the bits up
and using as much of the hash data as possible.  We already have the
"hash_32|64()" functions to do that.
Reported-by: NJosef Bacik <jbacik@fb.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Chris Mason <clm@fb.com>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

99d263d4

09 9月, 2014 1 次提交

ima: pass 'opened' flag to identify newly created files · 3034a146

由 Dmitry Kasatkin 提交于 6月 27, 2014

Empty files and missing xattrs do not guarantee that a file was
just created.  This patch passes FILE_CREATED flag to IMA to
reliably identify new files.
Signed-off-by: NDmitry Kasatkin <d.kasatkin@samsung.com>
Signed-off-by: NMimi Zohar <zohar@linux.vnet.ibm.com>
Cc: <stable@vger.kernel.org>  3.14+

3034a146

08 8月, 2014 3 次提交

namei: trivial fix to vfs_rename_dir comment · d03b29a2

由 J. Bruce Fields 提交于 2月 17, 2014

Looks like the directory loop check is actually done in renameat?
Whatever, leave this out rather than trying to keep it up to date with
the code.
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

d03b29a2

VFS: allow ->d_manage() to declare -EISDIR in rcu_walk mode. · b8faf035

由 NeilBrown 提交于 8月 04, 2014

In REF-walk mode, ->d_manage can return -EISDIR to indicate
that the dentry is not really a mount trap (or even a mount point)
and that any mounts or any DCACHE_NEED_AUTOMOUNT flag should be
ignored.

RCU-walk mode doesn't currently support this, so if there is a dentry
with DCACHE_NEED_AUTOMOUNT set but which shouldn't be a mount-trap,
lookup_fast() will always drop in REF-walk mode.

With this patch, an -EISDIR from ->d_manage will always cause mounts
and automounts to be ignored, both in REF-walk and RCU-walk.
Bug-fixed-by: NDan Carpenter <dan.carpenter@oracle.com>
Cc: Ian Kent <raven@themaw.net>
Signed-off-by: NNeilBrown <neilb@suse.de>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

b8faf035

fs: call rename2 if exists · 7177a9c4

由 Miklos Szeredi 提交于 7月 23, 2014

Christoph Hellwig suggests:

1) make vfs_rename call ->rename2 if it exists instead of ->rename
2) switch all filesystems that you're adding NOREPLACE support for to
   use ->rename2
3) see how many ->rename instances we'll have left after a few
   iterations of 2.
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

7177a9c4

24 7月, 2014 1 次提交

fs: umount on symlink leaks mnt count · 295dc39d

由 Vasily Averin 提交于 7月 21, 2014

Currently umount on symlink blocks following umount:

/vz is separate mount

# ls /vz/ -al | grep test
drwxr-xr-x.  2 root root       4096 Jul 19 01:14 testdir
lrwxrwxrwx.  1 root root         11 Jul 19 01:16 testlink -> /vz/testdir
# umount -l /vz/testlink
umount: /vz/testlink: not mounted (expected)

# lsof /vz
# umount /vz
umount: /vz: device is busy. (unexpected)

In this case mountpoint_last() gets an extra refcount on path->mnt
Signed-off-by: NVasily Averin <vvs@openvz.org>
Acked-by: NIan Kent <raven@themaw.net>
Acked-by: NJeff Layton <jlayton@primarydata.com>
Cc: stable@vger.kernel.org
Signed-off-by: NChristoph Hellwig <hch@lst.de>

295dc39d

11 6月, 2014 1 次提交

fs,userns: Change inode_capable to capable_wrt_inode_uidgid · 23adbe12

由 Andy Lutomirski 提交于 6月 10, 2014

The kernel has no concept of capabilities with respect to inodes; inodes
exist independently of namespaces.  For example, inode_capable(inode,
CAP_LINUX_IMMUTABLE) would be nonsense.

This patch changes inode_capable to check for uid and gid mappings and
renames it to capable_wrt_inode_uidgid, which should make it more
obvious what it does.

Fixes CVE-2014-4014.

Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: stable@vger.kernel.org
Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

23adbe12

20 4月, 2014 1 次提交

fix races between __d_instantiate() and checks of dentry flags · 22213318

由 Al Viro 提交于 4月 19, 2014

in non-lazy walk we need to be careful about dentry switching from
negative to positive - both ->d_flags and ->d_inode are updated,
and in some places we might see only one store.  The cases where
dentry has been obtained by dcache lookup with ->i_mutex held on
parent are safe - ->d_lock and ->i_mutex provide all the barriers
we need.  However, there are several places where we run into
trouble:
	* do_last() fetches ->d_inode, then checks ->d_flags and
assumes that inode won't be NULL unless d_is_negative() is true.
Race with e.g. creat() - we might have fetched the old value of
->d_inode (still NULL) and new value of ->d_flags (already not
DCACHE_MISS_TYPE).  Lin Ming has observed and reported the resulting
oops.
	* a bunch of places checks ->d_inode for being non-NULL,
then checks ->d_flags for "is it a symlink".  Race with symlink(2)
in case if our CPU sees ->d_inode update first - we see non-NULL
there, but ->d_flags still contains DCACHE_MISS_TYPE instead of
DCACHE_SYMLINK_TYPE.  Result: false negative on "should we follow
link here?", with subsequent unpleasantness.

Cc: stable@vger.kernel.org # 3.13 and 3.14 need that one
Reported-and-tested-by: NLin Ming <minggr@gmail.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

22213318

02 4月, 2014 2 次提交
- A
  new helper: readlink_copy() · 5d826c84
  由 Al Viro 提交于 3月 14, 2014
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  5d826c84
- A
  namei.c: move EXPORT_SYMBOL to corresponding definitions · 4d359507
  由 Al Viro 提交于 3月 14, 2014
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  4d359507

openeuler / Kernel 11 个月 前同步成功

openeuler / Kernel
11 个月前同步成功