提交 · 7c6893e3c9abf6a9676e060a1e35e5caca673d57 · openeuler / raspberrypi-kernel

05 9月, 2017 2 次提交

ovl: don't allow writing ioctl on lower layer · 7c6893e3

由 Miklos Szeredi 提交于 9月 05, 2017

Problem with ioctl() is that it's a file operation, yet often used as an
inode operation (i.e. modify the inode despite the file being opened for
read-only).

mnt_want_write_file() is used by filesystems in such cases to get write
access on an arbitrary open file.

Since overlayfs lets filesystems do all file operations, including ioctl,
this can lead to mnt_want_write_file() returning OK for a lower file and
modification of that lower file.

This patch prevents modification by checking if the file is from an
overlayfs lower layer and returning EPERM in that case.

Need to introduce a mnt_want_write_file_path() variant that still does the
old thing for inode operations that can do the copy up + modification
correctly in such cases (fchown, fsetxattr, fremovexattr).

This does not address the correctness of such ioctls on overlayfs (the
correct way would be to copy up and attempt to perform ioctl on upper
file).

In theory this could be a regression.  We very much hope that nobody is
relying on such a hack in any sane setup.

While this patch meddles in VFS code, it has no effect on non-overlayfs
filesystems.
Reported-by: N"zhangyi (F)" <yi.zhang@huawei.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

7c6893e3

vfs: add flags to d_real() · 495e6429

由 Miklos Szeredi 提交于 9月 04, 2017

Add a separate flags argument (in addition to the open flags) to control
the behavior of d_real().
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

495e6429

06 7月, 2017 1 次提交

fs: new infrastructure for writeback error handling and reporting · 5660e13d

由 Jeff Layton 提交于 7月 06, 2017

Most filesystems currently use mapping_set_error and
filemap_check_errors for setting and reporting/clearing writeback errors
at the mapping level. filemap_check_errors is indirectly called from
most of the filemap_fdatawait_* functions and from
filemap_write_and_wait*. These functions are called from all sorts of
contexts to wait on writeback to finish -- e.g. mostly in fsync, but
also in truncate calls, getattr, etc.

The non-fsync callers are problematic. We should be reporting writeback
errors during fsync, but many places spread over the tree clear out
errors before they can be properly reported, or report errors at
nonsensical times.

If I get -EIO on a stat() call, there is no reason for me to assume that
it is because some previous writeback failed. The fact that it also
clears out the error such that a subsequent fsync returns 0 is a bug,
and a nasty one since that's potentially silent data corruption.

This patch adds a small bit of new infrastructure for setting and
reporting errors during address_space writeback. While the above was my
original impetus for adding this, I think it's also the case that
current fsync semantics are just problematic for userland. Most
applications that call fsync do so to ensure that the data they wrote
has hit the backing store.

In the case where there are multiple writers to the file at the same
time, this is really hard to determine. The first one to call fsync will
see any stored error, and the rest get back 0. The processes with open
fds may not be associated with one another in any way. They could even
be in different containers, so ensuring coordination between all fsync
callers is not really an option.

One way to remedy this would be to track what file descriptor was used
to dirty the file, but that's rather cumbersome and would likely be
slow. However, there is a simpler way to improve the semantics here
without incurring too much overhead.

This set adds an errseq_t to struct address_space, and a corresponding
one is added to struct file. Writeback errors are recorded in the
mapping's errseq_t, and the one in struct file is used as the "since"
value.

This changes the semantics of the Linux fsync implementation such that
applications can now use it to determine whether there were any
writeback errors since fsync(fd) was last called (or since the file was
opened in the case of fsync having never been called).

Note that those writeback errors may have occurred when writing data
that was dirtied via an entirely different fd, but that's the case now
with the current mapping_set_error/filemap_check_error infrastructure.
This will at least prevent you from getting a false report of success.

The new behavior is still consistent with the POSIX spec, and is more
reliable for application developers. This patch just adds some basic
infrastructure for doing this, and ensures that the f_wb_err "cursor"
is properly set when a file is opened. Later patches will change the
existing code to use this new infrastructure for reporting errors at
fsync time.
Signed-off-by: NJeff Layton <jlayton@redhat.com>
Reviewed-by: NJan Kara <jack@suse.cz>

5660e13d

28 6月, 2017 1 次提交

fs: add fcntl() interface for setting/getting write life time hints · c75b1d94

由 Jens Axboe 提交于 6月 27, 2017

Define a set of write life time hints:

RWH_WRITE_LIFE_NOT_SET	No hint information set
RWH_WRITE_LIFE_NONE	No hints about write life time
RWH_WRITE_LIFE_SHORT	Data written has a short life time
RWH_WRITE_LIFE_MEDIUM	Data written has a medium life time
RWH_WRITE_LIFE_LONG	Data written has a long life time
RWH_WRITE_LIFE_EXTREME	Data written has an extremely long life time

The intent is for these values to be relative to each other, no
absolute meaning should be attached to these flag names.

Add an fcntl interface for querying these flags, and also for
setting them as well:

F_GET_RW_HINT		Returns the read/write hint set on the
			underlying inode.

F_SET_RW_HINT		Set one of the above write hints on the
			underlying inode.

F_GET_FILE_RW_HINT	Returns the read/write hint set on the
			file descriptor.

F_SET_FILE_RW_HINT	Set one of the above write hints on the
			file descriptor.

The user passes in a 64-bit pointer to get/set these values, and
the interface returns 0/-1 on success/error.

Sample program testing/implementing basic setting/getting of write
hints is below.

Add support for storing the write life time hint in the inode flags
and in struct file as well, and pass them to the kiocb flags. If
both a file and its corresponding inode has a write hint, then we
use the one in the file, if available. The file hint can be used
for sync/direct IO, for buffered writeback only the inode hint
is available.

This is in preparation for utilizing these hints in the block layer,
to guide on-media data placement.

/*
 * writehint.c: get or set an inode write hint
 */
 #include <stdio.h>
 #include <fcntl.h>
 #include <stdlib.h>
 #include <unistd.h>
 #include <stdbool.h>
 #include <inttypes.h>

 #ifndef F_GET_RW_HINT
 #define F_LINUX_SPECIFIC_BASE	1024
 #define F_GET_RW_HINT		(F_LINUX_SPECIFIC_BASE + 11)
 #define F_SET_RW_HINT		(F_LINUX_SPECIFIC_BASE + 12)
 #endif

static char *str[] = { "RWF_WRITE_LIFE_NOT_SET", "RWH_WRITE_LIFE_NONE",
			"RWH_WRITE_LIFE_SHORT", "RWH_WRITE_LIFE_MEDIUM",
			"RWH_WRITE_LIFE_LONG", "RWH_WRITE_LIFE_EXTREME" };

int main(int argc, char *argv[])
{
	uint64_t hint;
	int fd, ret;

	if (argc < 2) {
		fprintf(stderr, "%s: file <hint>\n", argv[0]);
		return 1;
	}

	fd = open(argv[1], O_RDONLY);
	if (fd < 0) {
		perror("open");
		return 2;
	}

	if (argc > 2) {
		hint = atoi(argv[2]);
		ret = fcntl(fd, F_SET_RW_HINT, &hint);
		if (ret < 0) {
			perror("fcntl: F_SET_RW_HINT");
			return 4;
		}
	}

	ret = fcntl(fd, F_GET_RW_HINT, &hint);
	if (ret < 0) {
		perror("fcntl: F_GET_RW_HINT");
		return 3;
	}

	printf("%s: hint %s\n", argv[1], str[hint]);
	close(fd);
	return 0;
}
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c75b1d94

27 4月, 2017 1 次提交

fs: completely ignore unknown open flags · 629e014b

由 Christoph Hellwig 提交于 4月 27, 2017

Currently we just stash anything we got into file->f_flags, and the
report it in fcntl(F_GETFD).  This patch just clears out all unknown
flags so that we don't pass them to the fs or report them.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

629e014b

22 4月, 2017 1 次提交
- A
  make sure that fchdir() won't accept referral points, etc. · 159b0956
  由 Al Viro 提交于 4月 15, 2017
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  159b0956
20 4月, 2017 1 次提交

vfs: ftruncate check IS_APPEND() on real upper inode · 78757af6

由 Amir Goldstein 提交于 4月 08, 2017

ftruncate an overlayfs inode was checking IS_APPEND() on
overlay inode, but overlay inode does not have the S_APPEND flag.

Check IS_APPEND() on real upper inode instead.
Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

78757af6

18 4月, 2017 1 次提交
- A
  open: move compat syscalls from compat.c · e35d49f6
  由 Al Viro 提交于 4月 08, 2017
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  e35d49f6
07 2月, 2017 2 次提交

vfs: wrap write f_ops with file_{start,end}_write() · bfe219d3

由 Amir Goldstein 提交于 1月 31, 2017

Before calling write f_ops, call file_start_write() instead
of sb_start_write().

Replace {sb,file}_start_write() for {copy,clone}_file_range() and
for fallocate().

Beyond correct semantics, this avoids freeze protection to sb when
operating on special inodes, such as fallocate() on a blockdev.
Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

bfe219d3

vfs: deny fallocate() on directory · 9e79b132

由 Amir Goldstein 提交于 1月 31, 2017

There was an obscure use case of fallocate of directory inode
in the vfs helper with the comment:
"Let individual file system decide if it supports preallocation
 for directories or not."

But there is no in-tree file system that implements fallocate
for directory operations.

Deny an attempt to fallocate a directory with EISDIR error.

This change is needed prior to converting sb_start_write()
to  file_start_write(), so freeze protection is correctly
handled for cases of fallocate file and blockdev.

Cc: linux-api@vger.kernel.org
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

9e79b132

25 12月, 2016 1 次提交

Replace <asm/uaccess.h> with <linux/uaccess.h> globally · 7c0f6ba6

由 Linus Torvalds 提交于 12月 24, 2016

This was entirely automated, using the script by Al:

  PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
  sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
        $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)

to do the replacement at the end of the merge window.
Requested-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7c0f6ba6

12 10月, 2016 1 次提交

block: implement (some of) fallocate for block devices · 25f4c414

由 Darrick J. Wong 提交于 10月 11, 2016

After much discussion, it seems that the fallocate feature flag
FALLOC_FL_ZERO_RANGE maps nicely to SCSI WRITE SAME; and the feature
FALLOC_FL_PUNCH_HOLE maps nicely to the devices that have been whitelisted
for zeroing SCSI UNMAP.  Punch still requires that FALLOC_FL_KEEP_SIZE is
set.  A length that goes past the end of the device will be clamped to the
device size if KEEP_SIZE is set; or will return -EINVAL if not.  Both
start and length must be aligned to the device's logical block size.

Since the semantics of fallocate are fairly well established already, wire
up the two pieces.  The other fallocate variants (collapse range, insert
range, and allocate blocks) are not supported.

Link: http://lkml.kernel.org/r/147518379992.22791.8849838163218235007.stgit@birch.djwong.orgSigned-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NBart Van Assche <bart.vanassche@sandisk.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Mike Snitzer <snitzer@redhat.com> # tweaked header
Cc: Brian Foster <bfoster@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

25f4c414

04 10月, 2016 1 次提交

vfs: add a FALLOC_FL_UNSHARE mode to fallocate to unshare a range of blocks · 71be6b49

由 Darrick J. Wong 提交于 10月 03, 2016

Add a new fallocate mode flag that explicitly unshares blocks on
filesystems that support such features.  The new flag can only
be used with an allocate-mode fallocate call.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>

71be6b49

16 9月, 2016 2 次提交

vfs: do get_write_access() on upper layer of overlayfs · 4d0c5ba2

由 Miklos Szeredi 提交于 9月 16, 2016

The problem with writecount is: we want consistent handling of it for
underlying filesystems as well as overlayfs. Making sure i_writecount is
correct on all layers is difficult. Instead this patch makes sure that
when write access is acquired, it's always done on the underlying writable
layer (called the upper layer). We must also make sure to look at the
writecount on this layer when checking for conflicting leases.

Open for write already updates the upper layer's writecount. Leaving only
truncate.

For truncate copy up must happen before get_write_access() so that the
writecount is updated on the upper layer. Problem with this is if
something fails after that, then copy-up was done needlessly. E.g. if
break_lease() was interrupted. Probably not a big deal in practice.

Another interesting case is if there's a denywrite on a lower file that is
then opened for write or truncated. With this patch these will succeed,
which is somewhat counterintuitive. But I think it's still acceptable,
considering that the copy-up does actually create a different file, so the
old, denywrite mapping won't be touched.

On non-overlayfs d_real() is an identity function and d_real_inode() is
equivalent to d_inode() so this patch doesn't change behavior in that case.
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
Acked-by: NJeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>

4d0c5ba2

locks: fix file locking on overlayfs · c568d683

由 Miklos Szeredi 提交于 9月 16, 2016

This patch allows flock, posix locks, ofd locks and leases to work
correctly on overlayfs.

Instead of using the underlying inode for storing lock context use the
overlay inode.  This allows locks to be persistent across copy-up.

This is done by introducing locks_inode() helper and using it instead of
file_inode() to get the inode in locking code.  For non-overlayfs the two
are equivalent, except for an extra pointer dereference in locks_inode().

Since lock operations are in "struct file_operations" we must also make
sure not to call underlying filesystem's lock operations.  Introcude a
super block flag MS_NOREMOTELOCK to this effect.
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
Acked-by: NJeff Layton <jlayton@poochiereds.net>
Cc: "J. Bruce Fields" <bfields@fieldses.org>

c568d683

30 6月, 2016 1 次提交

vfs: merge .d_select_inode() into .d_real() · 2d902671

由 Miklos Szeredi 提交于 6月 30, 2016

The two methods essentially do the same: find the real dentry/inode
belonging to an overlay dentry.  The difference is in the usage:

vfs_open() uses ->d_select_inode() and expects the function to perform
copy-up if necessary based on the open flags argument.

file_dentry() uses ->d_real() passing in the overlay dentry as well as the
underlying inode.

vfs_rename() uses ->d_select_inode() but passes zero flags.  ->d_real()
with a zero inode would have worked just as well here.

This patch merges the functionality of ->d_select_inode() into ->d_real()
by adding an 'open_flags' argument to the latter.

[Al Viro] Make the signature of d_real() match that of ->d_real() again.
And constify the inode argument, while we are at it.
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

2d902671

11 5月, 2016 1 次提交
- M
  vfs: add vfs_select_inode() helper · 54d5ca87
  由 Miklos Szeredi 提交于 5月 11, 2016
```
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
Cc: <stable@vger.kernel.org> # v4.2+
```
  54d5ca87
03 5月, 2016 1 次提交
- A
  give readdir(2)/getdents(2)/etc. uniform exclusion with lseek() · 63b6df14
  由 Al Viro 提交于 4月 20, 2016
```
same as read() on regular files has, and for the same reason.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  63b6df14
31 3月, 2016 1 次提交

fs: add filp_clone_open API · 9a08c352

由 James Bottomley 提交于 2月 17, 2016

I need an API that allows me to obtain a clone of the current file
pointer to pass in to an exec handler.  I've labelled this as an
internal API because I can't see how it would be useful outside of the
fs subsystem.  The use case will be a persistent binfmt_misc handler.
Signed-off-by: NJames Bottomley <James.Bottomley@HansenPartnership.com>
Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
Acked-by: NJan Kara <jack@suse.cz>

9a08c352

28 3月, 2016 3 次提交
- A
  constify chmod_common/security_path_chmod · be01f9f2
  由 Al Viro 提交于 3月 25, 2016
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  be01f9f2
- A
  constify chown_common/security_path_chown · 7fd25dac
  由 Al Viro 提交于 3月 25, 2016
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  7fd25dac
- A
  constify vfs_truncate() · 7df818b2
  由 Al Viro 提交于 3月 25, 2016
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  7df818b2
23 3月, 2016 1 次提交

fs/coredump: prevent fsuid=0 dumps into user-controlled directories · 378c6520

由 Jann Horn 提交于 3月 22, 2016

This commit fixes the following security hole affecting systems where
all of the following conditions are fulfilled:

 - The fs.suid_dumpable sysctl is set to 2.
 - The kernel.core_pattern sysctl's value starts with "/". (Systems
   where kernel.core_pattern starts with "|/" are not affected.)
 - Unprivileged user namespace creation is permitted. (This is
   true on Linux >=3.8, but some distributions disallow it by
   default using a distro patch.)

Under these conditions, if a program executes under secure exec rules,
causing it to run with the SUID_DUMP_ROOT flag, then unshares its user
namespace, changes its root directory and crashes, the coredump will be
written using fsuid=0 and a path derived from kernel.core_pattern - but
this path is interpreted relative to the root directory of the process,
allowing the attacker to control where a coredump will be written with
root privileges.

To fix the security issue, always interpret core_pattern for dumps that
are written under SUID_DUMP_ROOT relative to the root directory of init.
Signed-off-by: NJann Horn <jann@thejh.net>
Acked-by: NKees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

378c6520

23 1月, 2016 1 次提交

wrappers for ->i_mutex access · 5955102c

由 Al Viro 提交于 1月 22, 2016

parallel to mutex_{lock,unlock,trylock,is_locked,lock_nested},
inode_foo(inode) being mutex_foo(&inode->i_mutex).

Please, use those for access to ->i_mutex; over the coming cycle
->i_mutex will become rwsem, with ->lookup() done with it held
only shared.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

5955102c

04 1月, 2016 1 次提交
- A
  don't carry MAY_OPEN in op->acc_mode · 62fb4a15
  由 Al Viro 提交于 12月 26, 2015
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  62fb4a15
10 7月, 2015 1 次提交

vfs: Commit to never having exectuables on proc and sysfs. · 90f8572b

由 Eric W. Biederman 提交于 6月 29, 2015

Today proc and sysfs do not contain any executable files.  Several
applications today mount proc or sysfs without noexec and nosuid and
then depend on there being no exectuables files on proc or sysfs.
Having any executable files show on proc or sysfs would cause
a user space visible regression, and most likely security problems.

Therefore commit to never allowing executables on proc and sysfs by
adding a new flag to mark them as filesystems without executables and
enforce that flag.

Test the flag where MNT_NOEXEC is tested today, so that the only user
visible effect will be that exectuables will be treated as if the
execute bit is cleared.

The filesystems proc and sysfs do not currently incoporate any
executable files so this does not result in any user visible effects.

This makes it unnecessary to vet changes to proc and sysfs tightly for
adding exectuable files or changes to chattr that would modify
existing files, as no matter what the individual file say they will
not be treated as exectuable files by the vfs.

Not having to vet changes to closely is important as without this we
are only one proc_create call (or another goof up in the
implementation of notify_change) from having problematic executables
on proc.  Those mistakes are all too easy to make and would create
a situation where there are security issues or the assumptions of
some program having to be broken (and cause userspace regressions).
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>

90f8572b

24 6月, 2015 2 次提交

fs: Call security_ops->inode_killpriv on truncate · 45f147a1

由 Jan Kara 提交于 5月 21, 2015

Comment in include/linux/security.h says that ->inode_killpriv() should
be called when setuid bit is being removed and that similar security
labels (in fact this applies only to file capabilities) should be
removed at this time as well. However we don't call ->inode_killpriv()
when we remove suid bit on truncate.

We fix the problem by calling ->inode_need_killpriv() and subsequently
->inode_killpriv() on truncate the same way as we do it on file write.

After this patch there's only one user of should_remove_suid() - ocfs2 -
and indeed it's buggy because it doesn't call ->inode_killpriv() on
write. However fixing it is difficult because of special locking
constraints.
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

45f147a1

vfs: add file_path() helper · 9bf39ab2

由 Miklos Szeredi 提交于 6月 19, 2015

Turn
	d_path(&file->f_path, ...);
into
	file_path(file, ...);
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

9bf39ab2

19 6月, 2015 1 次提交

overlayfs: Make f_path always point to the overlay and f_inode to the underlay · 4bacc9c9

由 David Howells 提交于 6月 18, 2015

Make file->f_path always point to the overlay dentry so that the path in
/proc/pid/fd is correct and to ensure that label-based LSMs have access to the
overlay as well as the underlay (path-based LSMs probably don't need it).

Using my union testsuite to set things up, before the patch I see:

	[root@andromeda union-testsuite]# bash 5</mnt/a/foo107
	[root@andromeda union-testsuite]# ls -l /proc/$$/fd/
	...
	lr-x------. 1 root root 64 Jun  5 14:38 5 -> /a/foo107
	[root@andromeda union-testsuite]# stat /mnt/a/foo107
	...
	Device: 23h/35d Inode: 13381       Links: 1
	...
	[root@andromeda union-testsuite]# stat -L /proc/$$/fd/5
	...
	Device: 23h/35d Inode: 13381       Links: 1
	...

After the patch:

	[root@andromeda union-testsuite]# bash 5</mnt/a/foo107
	[root@andromeda union-testsuite]# ls -l /proc/$$/fd/
	...
	lr-x------. 1 root root 64 Jun  5 14:22 5 -> /mnt/a/foo107
	[root@andromeda union-testsuite]# stat /mnt/a/foo107
	...
	Device: 23h/35d Inode: 40346       Links: 1
	...
	[root@andromeda union-testsuite]# stat -L /proc/$$/fd/5
	...
	Device: 23h/35d Inode: 40346       Links: 1
	...

Note the change in where /proc/$$/fd/5 points to in the ls command.  It was
pointing to /a/foo107 (which doesn't exist) and now points to /mnt/a/foo107
(which is correct).

The inode accessed, however, is the lower layer.  The union layer is on device
25h/37d and the upper layer on 24h/36d.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

4bacc9c9

11 5月, 2015 1 次提交

VFS: Handle lower layer dentry/inode in pathwalk · 63afdfc7

由 David Howells 提交于 5月 06, 2015

Make use of d_backing_inode() in pathwalk to gain access to an
inode or dentry that's on a lower layer.
Signed-off-by: NDavid Howells <dhowells@redhat.com>

63afdfc7

12 4月, 2015 3 次提交

A
->aio_read and ->aio_write removed · 84363182
由 Al Viro 提交于 4月 04, 2015
```
no remaining users
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
84363182

NFS: fix BUG() crash in notify_change() with patch to chown_common() · c1b8940b

由 Andrew Elble 提交于 2月 23, 2015

We have observed a BUG() crash in fs/attr.c:notify_change(). The crash
occurs during an rsync into a filesystem that is exported via NFS.

1.) fs/attr.c:notify_change() modifies the caller's version of attr.
2.) 6de0ec00 ("VFS: make notify_change pass ATTR_KILL_S*ID to
    setattr operations") introduced a BUG() restriction such that "no
    function will ever call notify_change() with both ATTR_MODE and
    ATTR_KILL_S*ID set". Under some circumstances though, it will have
    assisted in setting the caller's version of attr to this very
    combination.
3.) 27ac0ffe ("locks: break delegations on any attribute
    modification") introduced code to handle breaking
    delegations. This can result in notify_change() being re-called. attr
    _must_ be explicitly reset to avoid triggering the BUG() established
    in #2.
4.) The path that that triggers this is via fs/open.c:chmod_common().
    The combination of attr flags set here and in the first call to
    notify_change() along with a later failed break_deleg_wait()
    results in notify_change() being called again via retry_deleg
    without resetting attr.

Solution is to move retry_deleg in chmod_common() a bit further up to
ensure attr is completely reset.

There are other places where this seemingly could occur, such as
fs/utimes.c:utimes_common(), but the attr flags are not initially
set in such a way to trigger this.

Fixes: 27ac0ffe ("locks: break delegations on any attribute modification")
Reported-by: NEric Meddaugh <etmsys@rit.edu>
Tested-by: NEric Meddaugh <etmsys@rit.edu>
Signed-off-by: NAndrew Elble <aweits@rit.edu>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

c1b8940b

drop bogus check in file_open_root() · e5b811e3

由 Al Viro 提交于 3月 08, 2015

For one thing, LOOKUP_DIRECTORY will be dealt with in do_last().
For another, name can be an empty string, but not NULL - no callers
pass that and it would oops immediately if they would.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

e5b811e3

25 3月, 2015 1 次提交

fs: Add support FALLOC_FL_INSERT_RANGE for fallocate · dd46c787

由 Namjae Jeon 提交于 3月 25, 2015

FALLOC_FL_INSERT_RANGE command is the opposite command of
FALLOC_FL_COLLAPSE_RANGE that is needed for someone who wants to add
some data in the middle of file.

FALLOC_FL_INSERT_RANGE will create space for writing new data within
a file after shifting extents to right as given length. This command
also has same limitations as FALLOC_FL_COLLAPSE_RANGE in that
operations need to be filesystem block boundary aligned and cannot
cross the current EOF.
Signed-off-by: NNamjae Jeon <namjae.jeon@samsung.com>
Signed-off-by: NAshish Sangwan <a.sangwan@samsung.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

dd46c787

17 2月, 2015 1 次提交

vfs: remove get_xip_mem · e748dcd0

由 Matthew Wilcox 提交于 2月 16, 2015

All callers of get_xip_mem() are now gone.  Remove checks for it,
initialisers of it, documentation of it and the only implementation of it.
 Also remove mm/filemap_xip.c as it is now empty.  Also remove
documentation of the long-gone get_xip_page().
Signed-off-by: NMatthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Andreas Dilger <andreas.dilger@intel.com>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e748dcd0

23 1月, 2015 1 次提交

fs: create proper filename objects using getname_kernel() · 51689104

由 Paul Moore 提交于 1月 22, 2015

There are several areas in the kernel that create temporary filename
objects using the following pattern:

	int func(const char *name)
	{
		struct filename *file = { .name = name };
		...
		return 0;
	}

... which for the most part works okay, but it causes havoc within the
audit subsystem as the filename object does not persist beyond the
lifetime of the function.  This patch converts all of these temporary
filename objects into proper filename objects using getname_kernel()
and putname() which ensure that the filename object persists until the
audit subsystem is finished with it.

Also, a special thanks to Al Viro, Guenter Roeck, and Sabrina Dubroca
for helping resolve a difficult kernel panic on boot related to a
use-after-free problem in kern_path_create(); the thread can be seen
at the link below:

 * https://lkml.org/lkml/2015/1/20/710

This patch includes code that was either based on, or directly written
by Al in the above thread.

CC: viro@zeniv.linux.org.uk
CC: linux@roeck-us.net
CC: sd@queasysnail.net
CC: linux-fsdevel@vger.kernel.org
Signed-off-by: NPaul Moore <pmoore@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

51689104

14 12月, 2014 1 次提交

fallocate: create FAN_MODIFY and IN_MODIFY events · 820c12d5

由 Heinrich Schuchardt 提交于 12月 12, 2014

The fanotify and the inotify API can be used to monitor changes of the
file system.  System call fallocate() modifies files.  Hence it should
trigger the corresponding fanotify (FAN_MODIFY) and inotify (IN_MODIFY)
events.  The most interesting case is FALLOC_FL_COLLAPSE_RANGE because
this value allows to create arbitrary file content from random data.

This patch adds the missing call to fsnotify_modify().

The FAN_MODIFY and IN_MODIFY event will be created when fallocate()
succeeds.  It will even be created if the file length remains unchanged,
e.g.  when calling fanotify with flag FALLOC_FL_KEEP_SIZE.

This logic was primarily chosen to keep the coding simple.

It resembles the logic of the write() system call.

When we call write() we always create a FAN_MODIFY event, even in the case
of overwriting with identical data.

Events FAN_MODIFY and IN_MODIFY do not provide any guarantee that data was
actually changed.

Furthermore even if if the filesize remains unchanged, fallocate() may
influence whether a subsequent write() will succeed and hence the
fallocate() call may be considered a modification.

The fallocate(2) man page teaches: After a successful call, subsequent
writes into the range specified by offset and len are guaranteed not to
fail because of lack of disk space.

So calling fallocate(fd, FALLOC_FL_KEEP_SIZE, offset, len) may result in
different outcomes of a subsequent write depending on the values of offset
and len.
Signed-off-by: NHeinrich Schuchardt <xypron.glpk@gmx.de>
Reviewed-by: NJan Kara <jack@suse.cz>
Cc: Jan Kara <jack@suse.cz>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@parisplace.org>
Cc: John McCutchan <john@johnmccutchan.com>
Cc: Robert Love <rlove@rlove.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

820c12d5

20 11月, 2014 1 次提交

new helper: audit_file() · 9f45f5bf

由 Al Viro 提交于 10月 31, 2014

... for situations when we don't have any candidate in pathnames - basically,
in descriptor-based syscalls.

[Folded the build fix for !CONFIG_AUDITSYSCALL configs from Chen Gang]
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

9f45f5bf

08 11月, 2014 1 次提交

VFS: Rename do_fallocate() to vfs_fallocate() · 72c72bdf

由 Anna Schumaker 提交于 11月 07, 2014

This function needs to be exported so it can be used by the NFSD module
when responding to the new ALLOCATE and DEALLOCATE operations in NFS
v4.2.  Christoph Hellwig suggested renaming the function to stay
consistent with how other vfs functions are named.
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

72c72bdf

24 10月, 2014 1 次提交

vfs: add i_op->dentry_open() · 4aa7c634

由 Miklos Szeredi 提交于 10月 24, 2014

Add a new inode operation i_op->dentry_open(). This is for stacked filesystems
that want to return a struct file from a different filesystem.
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>

4aa7c634