提交 · 83f936c75e3689a63253d89c47a4d239c56d7410 · Linux-御风守护者 / linux

02 4月, 2014 3 次提交

mark struct file that had write access grabbed by open() · 83f936c7

由 Al Viro 提交于 3月 14, 2014

new flag in ->f_mode - FMODE_WRITER. Set by do_dentry_open() in case
when it has grabbed write access, checked by __fput() to decide whether
it wants to drop the sucker. Allows to stop bothering with mnt_clone_write()
in alloc_file(), along with fewer special_file() checks.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

83f936c7

A
reduce m_start() cost... · c7999c36
由 Al Viro 提交于 2月 27, 2014
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
c7999c36

smarter propagate_mnt() · f2ebb3a9

由 Al Viro 提交于 2月 27, 2014

The current mainline has copies propagated to *all* nodes, then
tears down the copies we made for nodes that do not contain
counterparts of the desired mountpoint.  That sets the right
propagation graph for the copies (at teardown time we move
the slaves of removed node to a surviving peer or directly
to master), but we end up paying a fairly steep price in
useless allocations.  It's fairly easy to create a situation
where N calls of mount(2) create exactly N bindings, with
O(N^2) vfsmounts allocated and freed in process.

Fortunately, it is possible to avoid those allocations/freeings.
The trick is to create copies in the right order and find which
one would've eventually become a master with the current algorithm.
It turns out to be possible in O(nodes getting propagation) time
and with no extra allocations at all.

One part is that we need to make sure that eventual master will be
created before its slaves, so we need to walk the propagation
tree in a different order - by peer groups.  And iterate through
the peers before dealing with the next group.

Another thing is finding the (earlier) copy that will be a master
of one we are about to create; to do that we are (temporary) marking
the masters of mountpoints we are attaching the copies to.

Either we are in a peer of the last mountpoint we'd dealt with,
or we have the following situation: we are attaching to mountpoint M,
the last copy S_0 had been attached to M_0 and there are sequences
S_0...S_n, M_0...M_n such that S_{i+1} is a master of S_{i},
S_{i} mounted on M{i} and we need to create a slave of the first S_{k}
such that M is getting propagation from M_{k}.  It means that the master
of M_{k} will be among the sequence of masters of M.  On the
other hand, the nearest marked node in that sequence will either
be the master of M_{k} or the master of M_{k-1} (the latter -
in the case if M_{k-1} is a slave of something M gets propagation
from, but in a wrong peer group).

So we go through the sequence of masters of M until we find
a marked one (P).  Let N be the one before it.  Then we go through
the sequence of masters of S_0 until we find one (say, S) mounted
on a node D that has P as master and check if D is a peer of N.
If it is, S will be the master of new copy, if not - the master of S
will be.

That's it for the hard part; the rest is fairly simple.  Iterator
is in next_group(), handling of one prospective mountpoint is
propagate_one().

It seems to survive all tests and gives a noticably better performance
than the current mainline for setups that are seriously using shared
subtrees.

Cc: stable@vger.kernel.org
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

f2ebb3a9

31 3月, 2014 4 次提交

switch mnt_hash to hlist · 38129a13

由 Al Viro 提交于 3月 20, 2014

fixes RCU bug - walking through hlist is safe in face of element moves,
since it's self-terminating.  Cyclic lists are not - if we end up jumping
to another hash chain, we'll loop infinitely without ever hitting the
original list head.

[fix for dumb braino folded]

Spotted by: Max Kellermann <mk@cm4all.com>
Cc: stable@vger.kernel.org
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

38129a13

don't bother with propagate_mnt() unless the target is shared · 0b1b901b

由 Al Viro 提交于 3月 21, 2014

If the dest_mnt is not shared, propagate_mnt() does nothing -
there's no mounts to propagate to and thus no copies to create.
Might as well don't bother calling it in that case.

Cc: stable@vger.kernel.org
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

0b1b901b

keep shadowed vfsmounts together · 1d6a32ac

由 Al Viro 提交于 3月 20, 2014

preparation to switching mnt_hash to hlist

Cc: stable@vger.kernel.org
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

1d6a32ac

resizable namespace.c hashes · 0818bf27

由 Al Viro 提交于 2月 28, 2014

* switch allocation to alloc_large_system_hash()
* make sizes overridable by boot parameters (mhash_entries=, mphash_entries=)
* switch mountpoint_hashtable from list_head to hlist_head

Cc: stable@vger.kernel.org
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

0818bf27

30 11月, 2013 1 次提交

sysfs, kernfs: prepare mount path for kernfs · 4b93dc9b

由 Tejun Heo 提交于 11月 28, 2013

We're in the process of separating out core sysfs functionality into
kernfs which will deal with sysfs_dirents directly.  This patch
rearranges mount path so that the kernfs and sysfs parts are separate.

* As sysfs_super_info won't be visible outside kernfs proper,
  kernfs_super_ns() is added to allow kernfs users to access a
  super_block's namespace tag.

* Generic mount operation is separated out into kernfs_mount_ns().
  sysfs_mount() now just performs sysfs-specific permission check,
  acquires namespace tag, and invokes kernfs_mount_ns().

* Generic superblock release is separated out into kernfs_kill_sb()
  which can be used directly as file_system_type->kill_sb().  As sysfs
  needs to put the namespace tag, sysfs_kill_sb() wraps
  kernfs_kill_sb() with ns tag put.

* sysfs_dir_cachep init and sysfs_inode_init() are separated out into
  kernfs_init().  kernfs_init() uses only small amount of memory and
  trying to handle and propagate kernfs_init() failure doesn't make
  much sense.  Use SLAB_PANIC for sysfs_dir_cachep and make
  sysfs_inode_init() panic on failure.

  After this change, kernfs_init() should be called before
  sysfs_init(), fs/namespace.c::mnt_init() modified accordingly.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

4b93dc9b

27 11月, 2013 1 次提交

vfs: Fix a regression in mounting proc · 41301ae7

由 Eric W. Biederman 提交于 11月 14, 2013

Gao feng <gaofeng@cn.fujitsu.com> reported that commit
e51db735
userns: Better restrictions on when proc and sysfs can be mounted
caused a regression on mounting a new instance of proc in a mount
namespace created with user namespace privileges, when binfmt_misc
is mounted on /proc/sys/fs/binfmt_misc.

This is an unintended regression caused by the absolutely bogus empty
directory check in fs_fully_visible.  The check fs_fully_visible replaced
didn't even bother to attempt to verify proc was fully visible and
hiding proc files with any kind of mount is rare.  So for now fix
the userspace regression by allowing directory with nlink == 1
as /proc/sys/fs/binfmt_misc has.

I will have a better patch but it is not stable material, or
last minute kernel material.  So it will have to wait.

Cc: stable@vger.kernel.org
Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
Acked-by: NGao feng <gaofeng@cn.fujitsu.com>
Tested-by: NGao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>

41301ae7

09 11月, 2013 1 次提交

RCU'd vfsmounts · 48a066e7

由 Al Viro 提交于 9月 29, 2013

* RCU-delayed freeing of vfsmounts
* vfsmount_lock replaced with a seqlock (mount_lock)
* sequence number from mount_lock is stored in nameidata->m_seq and
used when we exit RCU mode
* new vfsmount flag - MNT_SYNC_UMOUNT.  Set by umount_tree() when its
caller knows that vfsmount will have no surviving references.
* synchronize_rcu() done between unlocking namespace_sem in namespace_unlock()
and doing pending mntput().
* new helper: legitimize_mnt(mnt, seq).  Checks the mount_lock sequence
number against seq, then grabs reference to mnt.  Then it rechecks mount_lock
again to close the race and either returns success or drops the reference it
has acquired.  The subtle point is that in case of MNT_SYNC_UMOUNT we can
simply decrement the refcount and sod off - aforementioned synchronize_rcu()
makes sure that final mntput() won't come until we leave RCU mode.  We need
that, since we don't want to end up with some lazy pathwalk racing with
umount() and stealing the final mntput() from it - caller of umount() may
expect it to return only once the fs is shut down and we don't want to break
that.  In other cases (i.e. with MNT_SYNC_UMOUNT absent) we have to do
full-blown mntput() in case of mount_lock sequence number mismatch happening
just as we'd grabbed the reference, but in those cases we won't be stealing
the final mntput() from anything that would care.
* mntput_no_expire() doesn't lock anything on the fast path now.  Incidentally,
SMP and UP cases are handled the same way - no ifdefs there.
* normal pathname resolution does *not* do any writes to mount_lock.  It does,
of course, bump the refcounts of vfsmount and dentry in the very end, but that's
it.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

48a066e7

25 10月, 2013 13 次提交

split __lookup_mnt() in two functions · 474279dc

由 Al Viro 提交于 10月 01, 2013

Instead of passing the direction as argument (and checking it on every
step through the hash chain), just have separate __lookup_mnt() and
__lookup_mnt_last().  And use the standard iterators...
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

474279dc

new helpers: lock_mount_hash/unlock_mount_hash · 719ea2fb

由 Al Viro 提交于 9月 29, 2013

aka br_write_{lock,unlock} of vfsmount_lock.  Inlines in fs/mount.h,
vfsmount_lock extern moved over there as well.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

719ea2fb

A
namespace.c: get rid of mnt_ghosts · aba809cf
由 Al Viro 提交于 9月 28, 2013
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
aba809cf
A
fold dup_mnt_ns() into its only surviving caller · 9559f689
由 Al Viro 提交于 9月 28, 2013
```
should've been done 6 years ago...
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
9559f689
A
mnt_set_expiry() doesn't need vfsmount_lock · f6b742d8
由 Al Viro 提交于 9月 28, 2013
```
->mnt_expire is protected by namespace_sem
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
f6b742d8
A
finish_automount() doesn't need vfsmount_lock for removal from expiry list · 22a79192
由 Al Viro 提交于 9月 28, 2013
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
22a79192

fs/namespace.c: bury long-dead define · 085e83ff

由 Al Viro 提交于 9月 28, 2013

MNT_WRITER_UNDERFLOW_LIMIT has been missed 4 years ago when it became unused.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

085e83ff

A
fold mntfree() into mntput_no_expire() · 649a795a
由 Al Viro 提交于 9月 28, 2013
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
649a795a

do_remount(): pull touch_mnt_namespace() up · 6339dab8

由 Al Viro 提交于 9月 16, 2013

... and don't bother with dropping and regaining vfsmount_lock
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

6339dab8

A
dup_mnt_ns(): get rid of pointless grabbing of vfsmount_lock · aa7a574d
由 Al Viro 提交于 9月 16, 2013
```
mnt_list is protected by namespace_sem, not vfsmount_lock
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
aa7a574d
A
fs_is_visible only needs namespace_sem held shared · 44bb4385
由 Al Viro 提交于 9月 16, 2013
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
44bb4385
A
initialize namespace_sem statically · 59aa0da8
由 Al Viro 提交于 9月 16, 2013
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
59aa0da8
A
put_mnt_ns(): use drop_collected_mounts() · 7b00ed6f
由 Al Viro 提交于 9月 16, 2013
```
... rather than open-coding it
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
7b00ed6f

12 9月, 2013 1 次提交

initmpfs: move rootfs code from fs/ramfs/ to init/ · 57f150a5

由 Rob Landley 提交于 9月 11, 2013

When the rootfs code was a wrapper around ramfs, having them in the same
file made sense.  Now that it can wrap another filesystem type, move it in
with the init code instead.

This also allows a subsequent patch to access rootfstype= command line
arg.
Signed-off-by: NRob Landley <rob@landley.net>
Cc: Jeff Layton <jlayton@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Stephen Warren <swarren@nvidia.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Jim Cromie <jim.cromie@gmail.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

57f150a5

09 9月, 2013 1 次提交

rename user_path_umountat() to user_path_mountpoint_at() · 197df04c

由 Al Viro 提交于 9月 08, 2013

... and move the extern from linux/namei.h to fs/internal.h,
along with that of vfs_path_lookup().
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

197df04c

06 9月, 2013 1 次提交

vfs: check unlinked ancestors before mount · eed81007

由 Miklos Szeredi 提交于 9月 05, 2013

We check submounts before doing d_drop() on a non-empty directory dentry in
NFS (have_submounts()), but we do not exclude a racing mount. Nor do we
prevent mounts to be added to the disconnected subtree using relative paths
after the d_drop().

This patch fixes these issues by checking for unlinked (unhashed, non-root)
ancestors before proceeding with the mount. This is done with rename
seqlock taken for write and with ->d_lock grabbed on each ancestor in turn,
including our dentry itself. This ensures that the only one of
check_submounts_and_drop() or has_unlinked_ancestor() can succeed.
Signed-off-by: NMiklos Szeredi <miklos@szeredi.hu>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

eed81007

04 9月, 2013 1 次提交

vfs: allow umount to handle mountpoints without revalidating them · 8033426e

由 Jeff Layton 提交于 7月 26, 2013

Christopher reported a regression where he was unable to unmount a NFS
filesystem where the root had gone stale. The problem is that
d_revalidate handles the root of the filesystem differently from other
dentries, but d_weak_revalidate does not. We could simply fix this by
making d_weak_revalidate return success on IS_ROOT dentries, but there
are cases where we do want to revalidate the root of the fs.

A umount is really a special case. We generally aren't interested in
anything but the dentry and vfsmount that's attached at that point. If
the inode turns out to be stale we just don't care since the intent is
to stop using it anyway.

Try to handle this situation better by treating umount as a special
case in the lookup code. Have it resolve the parent using normal
means, and then do a lookup of the final dentry without revalidating
it. In most cases, the final lookup will come out of the dcache, but
the case where there's a trailing symlink or !LAST_NORM entry on the
end complicates things a bit.

Cc: Neil Brown <neilb@suse.de>
Reported-by: NChristopher T Vogan <cvogan@us.ibm.com>
Signed-off-by: NJeff Layton <jlayton@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

8033426e

31 8月, 2013 1 次提交

userns: Kill nsown_capable it makes the wrong thing easy · c7b96acf

由 Eric W. Biederman 提交于 3月 20, 2013

nsown_capable is a special case of ns_capable essentially for just CAP_SETUID and
CAP_SETGID. For the existing users it doesn't noticably simplify things and
from the suggested patches I have seen it encourages people to do the wrong
thing. So remove nsown_capable.
Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>

c7b96acf

27 8月, 2013 2 次提交

userns: Better restrictions on when proc and sysfs can be mounted · e51db735

由 Eric W. Biederman 提交于 3月 30, 2013

Rely on the fact that another flavor of the filesystem is already
mounted and do not rely on state in the user namespace.

Verify that the mounted filesystem is not covered in any significant
way.  I would love to verify that the previously mounted filesystem
has no mounts on top but there are at least the directories
/proc/sys/fs/binfmt_misc and /sys/fs/cgroup/ that exist explicitly
for other filesystems to mount on top of.

Refactor the test into a function named fs_fully_visible and call that
function from the mount routines of proc and sysfs.  This makes this
test local to the filesystems involved and the results current of when
the mounts take place, removing a weird threading of the user
namespace, the mount namespace and the filesystems themselves.
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>

e51db735

vfs: Don't copy mount bind mounts of /proc/<pid>/ns/mnt between namespaces · 4ce5d2b1

由 Eric W. Biederman 提交于 3月 30, 2013

Don't copy bind mounts of /proc/<pid>/ns/mnt between namespaces.
These files hold references to a mount namespace and copying them
between namespaces could result in a reference counting loop.

The current mnt_ns_loop test prevents loops on the assumption that
mounts don't cross between namespaces. Unfortunately unsharing a
mount namespace and shared substrees can both cause mounts to
propogate between mount namespaces.

Add two flags CL_COPY_UNBINDABLE and CL_COPY_MNT_NS_FILE are added to
control this behavior, and CL_COPY_ALL is redefined as both of them.
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>

4ce5d2b1

25 8月, 2013 1 次提交

VFS: collect_mounts() should return an ERR_PTR · 52e220d3

由 Dan Carpenter 提交于 8月 14, 2013

This should actually be returning an ERR_PTR on error instead of NULL.
That was how it was designed and all the callers expect it.

[AV: actually, that's what "VFS: Make clone_mnt()/copy_tree()/collect_mounts()
return errors" missed - originally collect_mounts() was expected to return
NULL on failure]

Cc: <stable@vger.kernel.org> # 3.10+
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

52e220d3

25 7月, 2013 1 次提交

vfs: Lock in place mounts from more privileged users · 5ff9d8a6

由 Eric W. Biederman 提交于 3月 29, 2013

When creating a less privileged mount namespace or propogating mounts
from a more privileged to a less privileged mount namespace lock the
submounts so they may not be unmounted individually in the child mount
namespace revealing what is under them.

This enforces the reasonable expectation that it is not possible to
see under a mount point.  Most of the time mounts are on empty
directories and revealing that does not matter, however I have seen an
occassionaly sloppy configuration where there were interesting things
concealed under a mount point that probably should not be revealed.

Expirable submounts are not locked because they will eventually
unmount automatically so whatever is under them already needs
to be safe for unprivileged users to access.

From a practical standpoint these restrictions do not appear to be
significant for unprivileged users of the mount namespace.  Recursive
bind mounts and pivot_root continues to work, and mounts that are
created in a mount namespace may be unmounted there.  All of which
means that the common idiom of keeping a directory of interesting
files and using pivot_root to throw everything else away continues to
work just fine.
Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
Acked-by: NAndy Lutomirski <luto@amacapital.net>
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>

5ff9d8a6

05 5月, 2013 2 次提交

create_mnt_ns: unidiomatic use of list_add() · b1983cd8

由 Al Viro 提交于 5月 04, 2013

while list_add(A, B) and list_add(B, A) are equivalent when both A and B
are guaranteed to be empty, the usual idiom is list_add(what, where),
not the other way round...  Not a bug per se, but only by accident and
it makes RTFS harder for no good reason.
Spotted-by: NRajat Sharma <fs.rajat@gmail.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

b1983cd8

do_mount(): fix a leak introduced in 3.9 ("mount: consolidate permission checks") · 0d5cadb8

由 Al Viro 提交于 5月 04, 2013

Cc: stable@vger.kernel.org
Bisected-by: NMichael Leun <lkml20130126@newton.leun.net>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

0d5cadb8

02 5月, 2013 1 次提交

proc: Split the namespace stuff out into linux/proc_ns.h · 0bb80f24

由 David Howells 提交于 4月 12, 2013

Split the proc namespace stuff out into linux/proc_ns.h.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
cc: netdev@vger.kernel.org
cc: Serge E. Hallyn <serge.hallyn@ubuntu.com>
cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

0bb80f24

10 4月, 2013 5 次提交

fold release_mounts() into namespace_unlock() · 97216be0

由 Al Viro 提交于 3月 16, 2013

... and provide namespace_lock() as a trivial wrapper;
switch to those two consistently.

Result is patterned after rtnl_lock/rtnl_unlock pair.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

97216be0

switch unlock_mount() to namespace_unlock(), convert all umount_tree() callers · 328e6d90

由 Al Viro 提交于 3月 16, 2013

which allows to kill the last argument of umount_tree() and make release_mounts()
static.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

328e6d90

A
more conversions to namespace_unlock() · 3ab6abee
由 Al Viro 提交于 3月 16, 2013
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
3ab6abee
A
get rid of the second argument of shrink_submounts() · b54b9be7
由 Al Viro 提交于 3月 16, 2013
```
... it's always &unmounted.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
b54b9be7

saner umount_tree()/release_mounts(), part 1 · e3197d83

由 Al Viro 提交于 3月 16, 2013

global list of release_mounts() fodder, protected by namespace_sem;
eventually, all umount_tree() callers will use it as kill list.
Helper picking the contents of that list, releasing namespace_sem
and doing release_mounts() on what it got.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

e3197d83

Linux-御风守护者 / linux 与 Fork 源项目一致

Linux-御风守护者 / linux
与 Fork 源项目一致