提交 · a6795a585929d94ca3e931bc8518f8deb8bbe627 · openeuler / Kernel

18 7月, 2018 3 次提交

vfs: fix freeze protection in mnt_want_write_file() for overlayfs · a6795a58

由 Miklos Szeredi 提交于 7月 18, 2018

The underlying real file used by overlayfs still contains the overlay path.
This results in mnt_want_write_file() calls by the filesystem getting
freeze protection on the wrong inode (the overlayfs one instead of the real
one).

Fix by using file_inode(file)->i_sb instead of file->f_path.mnt->mnt_sb.

Reported-by: Amir Goldstein <amir73il@gmail.com> 
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

a6795a58

Revert "ovl: don't allow writing ioctl on lower layer" · 6742cee0

由 Miklos Szeredi 提交于 7月 18, 2018

This reverts commit 7c6893e3.

Overlayfs no longer relies on the vfs for checking writability of files.
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

6742cee0

Revert "ovl: fix may_write_real() for overlayfs directories" · d561f218

由 Miklos Szeredi 提交于 7月 18, 2018

This reverts commit 954c736f.

Overlayfs no longer relies on the vfs for checking writability of files.
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

d561f218

25 5月, 2018 1 次提交

fs: Allow superblock owner to access do_remount_sb() · bc6155d1

由 Eric W. Biederman 提交于 9月 18, 2017

Superblock level remounts are currently restricted to global
CAP_SYS_ADMIN, as is the path for changing the root mount to
read only on umount. Loosen both of these permission checks to
also allow CAP_SYS_ADMIN in any namespace which is privileged
towards the userns which originally mounted the filesystem.
Signed-off-by: NSeth Forshee <seth.forshee@canonical.com>
Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: NSerge Hallyn <serge@hallyn.com>
Acked-by: NChristian Brauner <christian@brauner.io>
Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>

bc6155d1

21 4月, 2018 1 次提交

vfs: Undo an overly zealous MS_RDONLY -> SB_RDONLY conversion · a9e5b732

由 David Howells 提交于 4月 20, 2018

In do_mount() when the MS_* flags are being converted to MNT_* flags,
MS_RDONLY got accidentally convered to SB_RDONLY.

Undo this change.

Fixes: e462ec50 ("VFS: Differentiate mount flags (MS_*) from internal superblock flags")
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a9e5b732

20 4月, 2018 1 次提交

Don't leak MNT_INTERNAL away from internal mounts · 16a34adb

由 Al Viro 提交于 4月 19, 2018

We want it only for the stuff created by SB_KERNMOUNT mounts, *not* for
their copies.  As it is, creating a deep stack of bindings of /proc/*/ns/*
somewhere in a new namespace and exiting yields a stack overflow.

Cc: stable@kernel.org
Reported-by: NAlexander Aring <aring@mojatatu.com>
Bisected-by: NKirill Tkhai <ktkhai@virtuozzo.com>
Tested-by: NKirill Tkhai <ktkhai@virtuozzo.com>
Tested-by: NAlexander Aring <aring@mojatatu.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

16a34adb

03 4月, 2018 2 次提交

fs: add ksys_umount() helper; remove in-kernel call to sys_umount() · 3a18ef5c

由 Dominik Brodowski 提交于 3月 11, 2018

Using this helper allows us to avoid the in-kernel call to the sys_umount()
syscall. The ksys_ prefix denotes that this function is meant as a drop-in
replacement for the syscall. In particular, it uses the same calling
convention as ksys_umount().

In the near future, the only fs-external caller of ksys_umount() should be
converted to call do_umount() directly. Then, ksys_umount() can be moved
within sys_umount() again.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NDominik Brodowski <linux@dominikbrodowski.net>

3a18ef5c

fs: add ksys_mount() helper; remove in-kernel calls to sys_mount() · 312db1aa

由 Dominik Brodowski 提交于 3月 11, 2018

Using this helper allows us to avoid the in-kernel calls to the sys_mount()
syscall. The ksys_ prefix denotes that this function is meant as a drop-in
replacement for the syscall. In particular, it uses the same calling
convention as sys_mount().

In the near future, all callers of ksys_mount() should be converted to call
do_mount() directly.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NDominik Brodowski <linux@dominikbrodowski.net>

312db1aa

10 12月, 2017 1 次提交

VFS: Handle lazytime in do_mount() · d7ee9469

由 Markus Trippelsdorf 提交于 10月 11, 2017

Since commit e462ec50 ("VFS: Differentiate mount flags (MS_*) from
internal superblock flags") the lazytime mount option doesn't get passed
on anymore.

Fix the issue by handling the option in do_mount().
Reviewed-by: NLukas Czerner <lczerner@redhat.com>
Signed-off-by: NMarkus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

d7ee9469

09 11月, 2017 1 次提交

vfs: fix mounting a filesystem with i_version · 46cdc6d5

由 Mimi Zohar 提交于 10月 07, 2017

The mount i_version flag is not enabled in the new sb_flags.  This patch
adds the missing SB_I_VERSION flag.

Fixes: e462ec50 "VFS: Differentiate mount flags (MS_*) from internal
       superblock flags"
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: NMimi Zohar <zohar@linux.vnet.ibm.com>

46cdc6d5

25 10月, 2017 1 次提交

locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns... · 6aa7de05

由 Mark Rutland 提交于 10月 23, 2017

locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE()

Please do not apply this to mainline directly, instead please re-run the
coccinelle script shown below and apply its output.

For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't harmful, and changing them results in
churn.

However, for some features, the read/write distinction is critical to
correct operation. To distinguish these cases, separate read/write
accessors must be used. This patch migrates (most) remaining
ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
coccinelle script:

----
// Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
// WRITE_ONCE()

// $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch

virtual patch

@ depends on patch @
expression E1, E2;
@@

- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)

@ depends on patch @
expression E;
@@

- ACCESS_ONCE(E)
+ READ_ONCE(E)
----
Signed-off-by: NMark Rutland <mark.rutland@arm.com>
Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: davem@davemloft.net
Cc: linux-arch@vger.kernel.org
Cc: mpe@ellerman.id.au
Cc: shuah@kernel.org
Cc: snitzer@redhat.com
Cc: thor.thayer@linux.intel.com
Cc: tj@kernel.org
Cc: viro@zeniv.linux.org.uk
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

6aa7de05

17 10月, 2017 1 次提交

vfs: fix mounting a filesystem with i_version · 917086ff

由 Mimi Zohar 提交于 10月 08, 2017

The mount i_version flag is not enabled in the new sb_flags.  This patch
adds the missing SB_I_VERSION flag.

Fixes: e462ec50 "VFS: Differentiate mount flags (MS_*) from internal
       superblock flags"
Signed-off-by: NMimi Zohar <zohar@linux.vnet.ibm.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

917086ff

05 10月, 2017 1 次提交

ovl: fix may_write_real() for overlayfs directories · 954c736f

由 Amir Goldstein 提交于 9月 18, 2017

Overlayfs directory file_inode() is the overlay inode whether the real
inode is upper or lower.

This fixes a regression in xfstest generic/158.

Fixes: 7c6893e3 ("ovl: don't allow writing ioctl on lower layer")
Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

954c736f

05 9月, 2017 1 次提交

ovl: don't allow writing ioctl on lower layer · 7c6893e3

由 Miklos Szeredi 提交于 9月 05, 2017

Problem with ioctl() is that it's a file operation, yet often used as an
inode operation (i.e. modify the inode despite the file being opened for
read-only).

mnt_want_write_file() is used by filesystems in such cases to get write
access on an arbitrary open file.

Since overlayfs lets filesystems do all file operations, including ioctl,
this can lead to mnt_want_write_file() returning OK for a lower file and
modification of that lower file.

This patch prevents modification by checking if the file is from an
overlayfs lower layer and returning EPERM in that case.

Need to introduce a mnt_want_write_file_path() variant that still does the
old thing for inode operations that can do the copy up + modification
correctly in such cases (fchown, fsetxattr, fremovexattr).

This does not address the correctness of such ioctls on overlayfs (the
correct way would be to copy up and attempt to perform ioctl on upper
file).

In theory this could be a regression.  We very much hope that nobody is
relying on such a hack in any sane setup.

While this patch meddles in VFS code, it has no effect on non-overlayfs
filesystems.
Reported-by: N"zhangyi (F)" <yi.zhang@huawei.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

7c6893e3

28 8月, 2017 1 次提交

namespace.c: Don't reinvent the wheel but use existing llist API · 29785735

由 Byungchul Park 提交于 8月 07, 2017

Although llist provides proper APIs, they are not used. Make them used.
Signed-off-by: NByungchul Park <byungchul.park@lge.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

29785735

17 7月, 2017 2 次提交

VFS: Differentiate mount flags (MS_*) from internal superblock flags · e462ec50

由 David Howells 提交于 7月 17, 2017

Differentiate the MS_* flags passed to mount(2) from the internal flags set
in the super_block's s_flags.  s_flags are now called SB_*, with the names
and the values for the moment mirroring the MS_* flags that they're
equivalent to.

In this patch, just the headers are altered and some kernel code where
blind automated conversion isn't necessarily correct.

Note that this shows up some interesting issues:

 (1) Some MS_* flags get translated to MNT_* flags (such as MS_NODEV ->
     MNT_NODEV) without passing this on to the filesystem, but some
     filesystems set such flags anyway.

 (2) The ->remount_fs() methods of some filesystems adjust the *flags
     argument by setting MS_* flags in it, such as MS_NOATIME - but these
     flags are then scrubbed by do_remount_sb() (only the occupants of
     MS_RMT_MASK are permitted: MS_RDONLY, MS_SYNCHRONOUS, MS_MANDLOCK,
     MS_I_VERSION and MS_LAZYTIME)

I'm not sure what's the best way to solve all these cases.
Suggested-by: NAl Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: NDavid Howells <dhowells@redhat.com>

e462ec50

VFS: Convert sb->s_flags & MS_RDONLY to sb_rdonly(sb) · bc98a42c

由 David Howells 提交于 7月 17, 2017

Firstly by applying the following with coccinelle's spatch:

	@@ expression SB; @@
	-SB->s_flags & MS_RDONLY
	+sb_rdonly(SB)

to effect the conversion to sb_rdonly(sb), then by applying:

	@@ expression A, SB; @@
	(
	-(!sb_rdonly(SB)) && A
	+!sb_rdonly(SB) && A
	|
	-A != (sb_rdonly(SB))
	+A != sb_rdonly(SB)
	|
	-A == (sb_rdonly(SB))
	+A == sb_rdonly(SB)
	|
	-!(sb_rdonly(SB))
	+!sb_rdonly(SB)
	|
	-A && (sb_rdonly(SB))
	+A && sb_rdonly(SB)
	|
	-A || (sb_rdonly(SB))
	+A || sb_rdonly(SB)
	|
	-(sb_rdonly(SB)) != A
	+sb_rdonly(SB) != A
	|
	-(sb_rdonly(SB)) == A
	+sb_rdonly(SB) == A
	|
	-(sb_rdonly(SB)) && A
	+sb_rdonly(SB) && A
	|
	-(sb_rdonly(SB)) || A
	+sb_rdonly(SB) || A
	)

	@@ expression A, B, SB; @@
	(
	-(sb_rdonly(SB)) ? 1 : 0
	+sb_rdonly(SB)
	|
	-(sb_rdonly(SB)) ? A : B
	+sb_rdonly(SB) ? A : B
	)

to remove left over excess bracketage and finally by applying:

	@@ expression A, SB; @@
	(
	-(A & MS_RDONLY) != sb_rdonly(SB)
	+(bool)(A & MS_RDONLY) != sb_rdonly(SB)
	|
	-(A & MS_RDONLY) == sb_rdonly(SB)
	+(bool)(A & MS_RDONLY) == sb_rdonly(SB)
	)

to make comparisons against the result of sb_rdonly() (which is a bool)
work correctly.
Signed-off-by: NDavid Howells <dhowells@redhat.com>

bc98a42c

11 7月, 2017 1 次提交

VFS: Kill off s_options and helpers · 1d278a87

由 David Howells 提交于 7月 05, 2017

Kill off s_options, save/replace_mount_options() and generic_show_options()
as all filesystems now implement ->show_options() for themselves. This
should make it easier to implement a context-based mount where the mount
options can be passed individually over a file descriptor.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

1d278a87

07 7月, 2017 1 次提交

mm: update callers to use HASH_ZERO flag · 3d375d78

由 Pavel Tatashin 提交于 7月 06, 2017

Update dcache, inode, pid, mountpoint, and mount hash tables to use
HASH_ZERO, and remove initialization after allocations.  In case of
places where HASH_EARLY was used such as in __pv_init_lock_hash the
zeroed hash table was already assumed, because memblock zeroes the
memory.

CPU: SPARC M6, Memory: 7T
Before fix:
  Dentry cache hash table entries: 1073741824
  Inode-cache hash table entries: 536870912
  Mount-cache hash table entries: 16777216
  Mountpoint-cache hash table entries: 16777216
  ftrace: allocating 20414 entries in 40 pages
  Total time: 11.798s

After fix:
  Dentry cache hash table entries: 1073741824
  Inode-cache hash table entries: 536870912
  Mount-cache hash table entries: 16777216
  Mountpoint-cache hash table entries: 16777216
  ftrace: allocating 20414 entries in 40 pages
  Total time: 3.198s

CPU: Intel Xeon E5-2630, Memory: 2.2T:
Before fix:
  Dentry cache hash table entries: 536870912
  Inode-cache hash table entries: 268435456
  Mount-cache hash table entries: 8388608
  Mountpoint-cache hash table entries: 8388608
  CPU: Physical Processor ID: 0
  Total time: 3.245s

After fix:
  Dentry cache hash table entries: 536870912
  Inode-cache hash table entries: 268435456
  Mount-cache hash table entries: 8388608
  Mountpoint-cache hash table entries: 8388608
  CPU: Physical Processor ID: 0
  Total time: 3.244s

Link: http://lkml.kernel.org/r/1488432825-92126-4-git-send-email-pasha.tatashin@oracle.comSigned-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
Reviewed-by: NBabu Moger <babu.moger@oracle.com>
Cc: David Miller <davem@davemloft.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3d375d78

06 7月, 2017 1 次提交

VFS: Clean up whitespace in fs/namespace.c and fs/super.c · dd111b31

由 David Howells 提交于 7月 04, 2017

Clean up line terminal whitespace in fs/namespace.c and fs/super.c.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

dd111b31

15 6月, 2017 1 次提交

fs: don't forget to put old mntns in mntns_install · 4068367c

由 Andrei Vagin 提交于 6月 08, 2017

Fixes: 4f757f3c ("make sure that mntns_install() doesn't end up with referral for root")
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NAndrei Vagin <avagin@openvz.org>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

4068367c

23 5月, 2017 2 次提交

mnt: In propgate_umount handle visiting mounts in any order · 99b19d16

由 Eric W. Biederman 提交于 10月 24, 2016

While investigating some poor umount performance I realized that in
the case of overlapping mount trees where some of the mounts are locked
the code has been failing to unmount all of the mounts it should
have been unmounting.

This failure to unmount all of the necessary
mounts can be reproduced with:

$ cat locked_mounts_test.sh

mount -t tmpfs test-base /mnt
mount --make-shared /mnt
mkdir -p /mnt/b

mount -t tmpfs test1 /mnt/b
mount --make-shared /mnt/b
mkdir -p /mnt/b/10

mount -t tmpfs test2 /mnt/b/10
mount --make-shared /mnt/b/10
mkdir -p /mnt/b/10/20

mount --rbind /mnt/b /mnt/b/10/20

unshare -Urm --propagation unchaged /bin/sh -c 'sleep 5; if [ $(grep test /proc/self/mountinfo | wc -l) -eq 1 ] ; then echo SUCCESS ; else echo FAILURE ; fi'
sleep 1
umount -l /mnt/b
wait %%

$ unshare -Urm ./locked_mounts_test.sh

This failure is corrected by removing the prepass that marks mounts
that may be umounted.

A first pass is added that umounts mounts if possible and if not sets
mount mark if they could be unmounted if they weren't locked and adds
them to a list to umount possibilities.  This first pass reconsiders
the mounts parent if it is on the list of umount possibilities, ensuring
that information of umoutability will pass from child to mount parent.

A second pass then walks through all mounts that are umounted and processes
their children unmounting them or marking them for reparenting.

A last pass cleans up the state on the mounts that could not be umounted
and if applicable reparents them to their first parent that remained
mounted.

While a bit longer than the old code this code is much more robust
as it allows information to flow up from the leaves and down
from the trunk making the order in which mounts are encountered
in the umount propgation tree irrelevant.

Cc: stable@vger.kernel.org
Fixes: 0c56fe31 ("mnt: Don't propagate unmounts to locked mounts")
Reviewed-by: NAndrei Vagin <avagin@virtuozzo.com>
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>

99b19d16

mnt: In umount propagation reparent in a separate pass · 570487d3

由 Eric W. Biederman 提交于 5月 15, 2017

It was observed that in some pathlogical cases that the current code
does not unmount everything it should.  After investigation it
was determined that the issue is that mnt_change_mntpoint can
can change which mounts are available to be unmounted during mount
propagation which is wrong.

The trivial reproducer is:
$ cat ./pathological.sh

mount -t tmpfs test-base /mnt
cd /mnt
mkdir 1 2 1/1
mount --bind 1 1
mount --make-shared 1
mount --bind 1 2
mount --bind 1/1 1/1
mount --bind 1/1 1/1
echo
grep test-base /proc/self/mountinfo
umount 1/1
echo
grep test-base /proc/self/mountinfo

$ unshare -Urm ./pathological.sh

The expected output looks like:
46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
49 54 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
50 53 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
51 49 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
54 47 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
53 48 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
52 50 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000

46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000

The output without the fix looks like:
46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
49 54 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
50 53 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
51 49 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
54 47 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
53 48 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
52 50 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000

46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
52 48 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000

That last mount in the output was in the propgation tree to be unmounted but
was missed because the mnt_change_mountpoint changed it's parent before the walk
through the mount propagation tree observed it.

Cc: stable@vger.kernel.org
Fixes: 1064f874 ("mnt: Tuck mounts under others instead of creating shadow/side mounts.")
Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
Reviewed-by: NRam Pai <linuxram@us.ibm.com>
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>

570487d3

22 4月, 2017 1 次提交

make sure that mntns_install() doesn't end up with referral for root · 4f757f3c

由 Al Viro 提交于 4月 15, 2017

new flag: LOOKUP_DOWN.  If the starting point is overmounted, cross
into whatever's mounted on top, triggering referrals et.al.

Use that instead of follow_down_one() loop in mntns_install(), handle
errors properly.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

4f757f3c

10 4月, 2017 2 次提交

fsnotify: Free fsnotify_mark_connector when there is no mark attached · 08991e83

由 Jan Kara 提交于 2月 01, 2017

Currently we free fsnotify_mark_connector structure only when inode /
vfsmount is getting freed. This can however impose noticeable memory
overhead when marks get attached to inodes only temporarily. So free the
connector structure once the last mark is detached from the object.
Since notification infrastructure can be working with the connector
under the protection of fsnotify_mark_srcu, we have to be careful and
free the fsnotify_mark_connector only after SRCU period passes.
Reviewed-by: NMiklos Szeredi <mszeredi@redhat.com>
Reviewed-by: NAmir Goldstein <amir73il@gmail.com>
Signed-off-by: NJan Kara <jack@suse.cz>

08991e83

fsnotify: Move mark list head from object into dedicated structure · 9dd813c1

由 Jan Kara 提交于 3月 14, 2017

Currently notification marks are attached to object (inode or vfsmnt) by
a hlist_head in the object. The list is also protected by a spinlock in
the object. So while there is any mark attached to the list of marks,
the object must be pinned in memory (and thus e.g. last iput() deleting
inode cannot happen). Also for list iteration in fsnotify() to work, we
must hold fsnotify_mark_srcu lock so that mark itself and
mark->obj_list.next cannot get freed. Thus we are required to wait for
response to fanotify events from userspace process with
fsnotify_mark_srcu lock held. That causes issues when userspace process
is buggy and does not reply to some event - basically the whole
notification subsystem gets eventually stuck.

So to be able to drop fsnotify_mark_srcu lock while waiting for
response, we have to pin the mark in memory and make sure it stays in
the object list (as removing the mark waiting for response could lead to
lost notification events for groups later in the list). However we don't
want inode reclaim to block on such mark as that would lead to system
just locking up elsewhere.

This commit is the first in the series that paves way towards solving
these conflicting lifetime needs. Instead of anchoring the list of marks
directly in the object, we anchor it in a dedicated structure
(fsnotify_mark_connector) and just point to that structure from the
object. The following commits will also add spinlock protecting the list
and object pointer to the structure.
Reviewed-by: NMiklos Szeredi <mszeredi@redhat.com>
Reviewed-by: NAmir Goldstein <amir73il@gmail.com>
Signed-off-by: NJan Kara <jack@suse.cz>

9dd813c1

02 3月, 2017 2 次提交

sched/headers: Prepare to move 'init_task' and 'init_thread_union' from... · 9164bb4a

由 Ingo Molnar 提交于 2月 04, 2017

sched/headers: Prepare to move 'init_task' and 'init_thread_union' from <linux/sched.h> to <linux/sched/task.h>

Update all usage sites first.
Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: NIngo Molnar <mingo@kernel.org>

9164bb4a

sched/headers: Prepare to remove <linux/cred.h> inclusion from <linux/sched.h> · 5b825c3a

由 Ingo Molnar 提交于 2月 02, 2017

Add #include <linux/cred.h> dependencies to all .c files rely on sched.h
doing that for them.

Note that even if the count where we need to add extra headers seems high,
it's still a net win, because <linux/sched.h> is included in over
2,200 files ...
Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: NIngo Molnar <mingo@kernel.org>

5b825c3a

03 2月, 2017 1 次提交

mnt: Tuck mounts under others instead of creating shadow/side mounts. · 1064f874

由 Eric W. Biederman 提交于 1月 20, 2017

Ever since mount propagation was introduced in cases where a mount in
propagated to parent mount mountpoint pair that is already in use the
code has placed the new mount behind the old mount in the mount hash
table.

This implementation detail is problematic as it allows creating
arbitrary length mount hash chains.

Furthermore it invalidates the constraint maintained elsewhere in the
mount code that a parent mount and a mountpoint pair will have exactly
one mount upon them.  Making it hard to deal with and to talk about
this special case in the mount code.

Modify mount propagation to notice when there is already a mount at
the parent mount and mountpoint where a new mount is propagating to
and place that preexisting mount on top of the new mount.

Modify unmount propagation to notice when a mount that is being
unmounted has another mount on top of it (and no other children), and
to replace the unmounted mount with the mount on top of it.

Move the MNT_UMUONT test from __lookup_mnt_last into
__propagate_umount as that is the only call of __lookup_mnt_last where
MNT_UMOUNT may be set on any mount visible in the mount hash table.

These modifications allow:
 - __lookup_mnt_last to be removed.
 - attach_shadows to be renamed __attach_mnt and its shadow
   handling to be removed.
 - commit_tree to be simplified
 - copy_tree to be simplified

The result is an easier to understand tree of mounts that does not
allow creation of arbitrary length hash chains in the mount hash table.

The result is also a very slight userspace visible difference in semantics.
The following two cases now behave identically, where before order
mattered:

case 1: (explicit user action)
	B is a slave of A
	mount something on A/a , it will propagate to B/a
	and than mount something on B/a

case 2: (tucked mount)
	B is a slave of A
	mount something on B/a
	and than mount something on A/a

Histroically umount A/a would fail in case 1 and succeed in case 2.
Now umount A/a succeeds in both configurations.

This very small change in semantics appears if anything to be a bug
fix to me and my survey of userspace leads me to believe that no programs
will notice or care of this subtle semantic change.

v2: Updated to mnt_change_mountpoint to not call dput or mntput
and instead to decrement the counts directly.  It is guaranteed
that there will be other references when mnt_change_mountpoint is
called so this is safe.

v3: Moved put_mountpoint under mount_lock in attach_recursive_mnt
    As the locking in fs/namespace.c changed between v2 and v3.

v4: Reworked the logic in propagate_mount_busy and __propagate_umount
    that detects when a mount completely covers another mount.

v5: Removed unnecessary tests whose result is alwasy true in
    find_topper and attach_recursive_mnt.

v6: Document the user space visible semantic difference.

Cc: stable@vger.kernel.org
Fixes: b90fa9ae ("[PATCH] shared mount handling: bind and rbind")
Tested-by: NAndrei Vagin <avagin@virtuozzo.com>
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>

1064f874

01 2月, 2017 1 次提交

fs: Better permission checking for submounts · 93faccbb

由 Eric W. Biederman 提交于 2月 01, 2017

To support unprivileged users mounting filesystems two permission
checks have to be performed: a test to see if the user allowed to
create a mount in the mount namespace, and a test to see if
the user is allowed to access the specified filesystem.

The automount case is special in that mounting the original filesystem
grants permission to mount the sub-filesystems, to any user who
happens to stumble across the their mountpoint and satisfies the
ordinary filesystem permission checks.

Attempting to handle the automount case by using override_creds
almost works.  It preserves the idea that permission to mount
the original filesystem is permission to mount the sub-filesystem.
Unfortunately using override_creds messes up the filesystems
ordinary permission checks.

Solve this by being explicit that a mount is a submount by introducing
vfs_submount, and using it where appropriate.

vfs_submount uses a new mount internal mount flags MS_SUBMOUNT, to let
sget and friends know that a mount is a submount so they can take appropriate
action.

sget and sget_userns are modified to not perform any permission checks
on submounts.

follow_automount is modified to stop using override_creds as that
has proven problemantic.

do_mount is modified to always remove the new MS_SUBMOUNT flag so
that we know userspace will never by able to specify it.

autofs4 is modified to stop using current_real_cred that was put in
there to handle the previous version of submount permission checking.

cifs is modified to pass the mountpoint all of the way down to vfs_submount.

debugfs is modified to pass the mountpoint all of the way down to
trace_automount by adding a new parameter.  To make this change easier
a new typedef debugfs_automount_t is introduced to capture the type of
the debugfs automount function.

Cc: stable@vger.kernel.org
Fixes: 069d5ac9 ("autofs:  Fix automounts by using current_real_cred()->uid")
Fixes: aeaa4a79 ("fs: Call d_automount with the filesystems creds")
Reviewed-by: NTrond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: NSeth Forshee <seth.forshee@canonical.com>
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>

93faccbb

10 1月, 2017 1 次提交

mnt: Protect the mountpoint hashtable with mount_lock · 3895dbf8

由 Eric W. Biederman 提交于 1月 03, 2017

Protecting the mountpoint hashtable with namespace_sem was sufficient
until a call to umount_mnt was added to mntput_no_expire.  At which
point it became possible for multiple calls of put_mountpoint on
the same hash chain to happen on the same time.

Kristen Johansen <kjlx@templeofstupid.com> reported:
> This can cause a panic when simultaneous callers of put_mountpoint
> attempt to free the same mountpoint.  This occurs because some callers
> hold the mount_hash_lock, while others hold the namespace lock.  Some
> even hold both.
>
> In this submitter's case, the panic manifested itself as a GP fault in
> put_mountpoint() when it called hlist_del() and attempted to dereference
> a m_hash.pprev that had been poisioned by another thread.

Al Viro observed that the simple fix is to switch from using the namespace_sem
to the mount_lock to protect the mountpoint hash table.

I have taken Al's suggested patch moved put_mountpoint in pivot_root
(instead of taking mount_lock an additional time), and have replaced
new_mountpoint with get_mountpoint a function that does the hash table
lookup and addition under the mount_lock.   The introduction of get_mounptoint
ensures that only the mount_lock is needed to manipulate the mountpoint
hashtable.

d_set_mounted is modified to only set DCACHE_MOUNTED if it is not
already set.  This allows get_mountpoint to use the setting of
DCACHE_MOUNTED to ensure adding a struct mountpoint for a dentry
happens exactly once.

Cc: stable@vger.kernel.org
Fixes: ce07d891 ("mnt: Honor MNT_LOCKED when detaching mounts")
Reported-by: NKrister Johansen <kjlx@templeofstupid.com>
Suggested-by: NAl Viro <viro@ZenIV.linux.org.uk>
Acked-by: NAl Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>

3895dbf8

17 12月, 2016 3 次提交

reorganize do_make_slave() · 5235d448

由 Al Viro 提交于 11月 20, 2016

Make sure that clone_mnt() never returns a mount with MNT_SHARED in
flags, but without a valid ->mnt_group_id.  That allows to demystify
do_make_slave() quite a bit, among other things.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

5235d448

A
clone_private_mount() doesn't need to touch namespace_sem · 066715d3
由 Al Viro 提交于 11月 19, 2016
```
not for CL_PRIVATE clone_mnt()
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
066715d3
A
remove a bogus claim about namespace_sem being held by callers of mnt_alloc_id() · f4cc1c38
由 Al Viro 提交于 11月 19, 2016
```
Hadn't been true for quite a while
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
f4cc1c38

06 12月, 2016 2 次提交

A
namespace.c: constify struct path passed to a bunch of primitives · ca71cf71
由 Al Viro 提交于 11月 20, 2016
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
ca71cf71

fs: Constify path_is_under()'s arguments · 640eb7e7

由 Mickaël Salaün 提交于 11月 14, 2016

The function path_is_under() doesn't modify the paths pointed by its
arguments but only browse them. Constifying this pointers make a cleaner
interface to be used by (future) code which may only have access to
const struct path pointers (e.g. LSM hooks).
Signed-off-by: NMickaël Salaün <mic@digikod.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

640eb7e7

04 12月, 2016 1 次提交

vfs: add path_is_mountpoint() helper · c6609c0a

由 Ian Kent 提交于 11月 24, 2016

d_mountpoint() can only be used reliably to establish if a dentry is
not mounted in any namespace. It isn't aware of the possibility there
may be multiple mounts using a given dentry that may be in a different
namespace.

Add helper functions, path_is_mountpoint(), that checks if a struct path
is a mountpoint for this case.

Link: http://lkml.kernel.org/r/20161011053358.27645.9729.stgit@pluto.themaw.netSigned-off-by: NIan Kent <raven@themaw.net>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Omar Sandoval <osandov@osandov.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

c6609c0a

11 10月, 2016 1 次提交

latent_entropy: Mark functions with __latent_entropy · 0766f788

由 Emese Revfy 提交于 6月 20, 2016

The __latent_entropy gcc attribute can be used only on functions and
variables.  If it is on a function then the plugin will instrument it for
gathering control-flow entropy. If the attribute is on a variable then
the plugin will initialize it with random contents.  The variable must
be an integer, an integer array type or a structure with integer fields.

These specific functions have been selected because they are init
functions (to help gather boot-time entropy), are called at unpredictable
times, or they have variable loops, each of which provide some level of
latent entropy.
Signed-off-by: NEmese Revfy <re.emese@gmail.com>
[kees: expanded commit message]
Signed-off-by: NKees Cook <keescook@chromium.org>

0766f788

01 10月, 2016 1 次提交

mnt: Add a per mount namespace limit on the number of mounts · d2921684

由 Eric W. Biederman 提交于 9月 28, 2016

CAI Qian <caiqian@redhat.com> pointed out that the semantics
of shared subtrees make it possible to create an exponentially
increasing number of mounts in a mount namespace.

    mkdir /tmp/1 /tmp/2
    mount --make-rshared /
    for i in $(seq 1 20) ; do mount --bind /tmp/1 /tmp/2 ; done

Will create create 2^20 or 1048576 mounts, which is a practical problem
as some people have managed to hit this by accident.

As such CVE-2016-6213 was assigned.

Ian Kent <raven@themaw.net> described the situation for autofs users
as follows:

> The number of mounts for direct mount maps is usually not very large because of
> the way they are implemented, large direct mount maps can have performance
> problems. There can be anywhere from a few (likely case a few hundred) to less
> than 10000, plus mounts that have been triggered and not yet expired.
>
> Indirect mounts have one autofs mount at the root plus the number of mounts that
> have been triggered and not yet expired.
>
> The number of autofs indirect map entries can range from a few to the common
> case of several thousand and in rare cases up to between 30000 and 50000. I've
> not heard of people with maps larger than 50000 entries.
>
> The larger the number of map entries the greater the possibility for a large
> number of active mounts so it's not hard to expect cases of a 1000 or somewhat
> more active mounts.

So I am setting the default number of mounts allowed per mount
namespace at 100,000.  This is more than enough for any use case I
know of, but small enough to quickly stop an exponential increase
in mounts.  Which should be perfect to catch misconfigurations and
malfunctioning programs.

For anyone who needs a higher limit this can be changed by writing
to the new /proc/sys/fs/mount-max sysctl.
Tested-by: NCAI Qian <caiqian@redhat.com>
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>

d2921684

23 9月, 2016 1 次提交

kernel: add a helper to get an owning user namespace for a namespace · bcac25a5

由 Andrey Vagin 提交于 9月 06, 2016

Return -EPERM if an owning user namespace is outside of a process
current user namespace.

v2: In a first version ns_get_owner returned ENOENT for init_user_ns.
    This special cases was removed from this version. There is nothing
    outside of init_user_ns, so we can return EPERM.
v3: rename ns->get_owner() to ns->owner(). get_* usually means that it
grabs a reference.
Acked-by: NSerge Hallyn <serge@hallyn.com>
Signed-off-by: NAndrei Vagin <avagin@openvz.org>
Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>

bcac25a5

openeuler / Kernel 大约 1 年 前同步成功

openeuler / Kernel
大约 1 年前同步成功