提交 · 18815a18085364d8514c0d0c4c986776cb74272c · openanolis / cloud-kernel

03 5月, 2012 3 次提交

userns: Convert capabilities related permsion checks · 18815a18

由 Eric W. Biederman 提交于 2月 07, 2012

- Use uid_eq when comparing kuids
  Use gid_eq when comparing kgids
- Use make_kuid(user_ns, 0) to talk about the user_namespace root uid
Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>

18815a18

userns: Store uid and gid values in struct cred with kuid_t and kgid_t types · 078de5f7

由 Eric W. Biederman 提交于 2月 08, 2012

cred.h and a few trivial users of struct cred are changed. The rest of the users
of struct cred are left for other patches as there are too many changes to make
in one go and leave the change reviewable. If the user namespace is disabled and
CONFIG_UIDGID_STRICT_TYPE_CHECKS are disabled the code will contiue to compile
and behave correctly.
Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>

078de5f7

userns: Convert group_info values from gid_t to kgid_t. · ae2975bc

由 Eric W. Biederman 提交于 11月 14, 2011

As a first step to converting struct cred to be all kuid_t and kgid_t
values convert the group values stored in group_info to always be
kgid_t values.   Unless user namespaces are used this change should
have no effect.
Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>

ae2975bc

26 4月, 2012 1 次提交

userns: Rework the user_namespace adding uid/gid mapping support · 22d917d8

由 Eric W. Biederman 提交于 11月 17, 2011

- Convert the old uid mapping functions into compatibility wrappers
- Add a uid/gid mapping layer from user space uid and gids to kernel
  internal uids and gids that is extent based for simplicty and speed.
  * Working with number space after mapping uids/gids into their kernel
    internal version adds only mapping complexity over what we have today,
    leaving the kernel code easy to understand and test.
- Add proc files /proc/self/uid_map /proc/self/gid_map
  These files display the mapping and allow a mapping to be added
  if a mapping does not exist.
- Allow entering the user namespace without a uid or gid mapping.
  Since we are starting with an existing user our uids and gids
  still have global mappings so are still valid and useful they just don't
  have local mappings.  The requirement for things to work are global uid
  and gid so it is odd but perfectly fine not to have a local uid
  and gid mapping.
  Not requiring global uid and gid mappings greatly simplifies
  the logic of setting up the uid and gid mappings by allowing
  the mappings to be set after the namespace is created which makes the
  slight weirdness worth it.
- Make the mappings in the initial user namespace to the global
  uid/gid space explicit.  Today it is an identity mapping
  but in the future we may want to twist this for debugging, similar
  to what we do with jiffies.
- Document the memory ordering requirements of setting the uid and
  gid mappings.  We only allow the mappings to be set once
  and there are no pointers involved so the requirments are
  trivial but a little atypical.

Performance:

In this scheme for the permission checks the performance is expected to
stay the same as the actuall machine instructions should remain the same.

The worst case I could think of is ls -l on a large directory where
all of the stat results need to be translated with from kuids and
kgids to uids and gids.  So I benchmarked that case on my laptop
with a dual core hyperthread Intel i5-2520M cpu with 3M of cpu cache.

My benchmark consisted of going to single user mode where nothing else
was running. On an ext4 filesystem opening 1,000,000 files and looping
through all of the files 1000 times and calling fstat on the
individuals files.  This was to ensure I was benchmarking stat times
where the inodes were in the kernels cache, but the inode values were
not in the processors cache.  My results:

v3.4-rc1:         ~= 156ns (unmodified v3.4-rc1 with user namespace support disabled)
v3.4-rc1-userns-: ~= 155ns (v3.4-rc1 with my user namespace patches and user namespace support disabled)
v3.4-rc1-userns+: ~= 164ns (v3.4-rc1 with my user namespace patches and user namespace support enabled)

All of the configurations ran in roughly 120ns when I performed tests
that ran in the cpu cache.

So in summary the performance impact is:
1ns improvement in the worst case with user namespace support compiled out.
8ns aka 5% slowdown in the worst case with user namespace support compiled in.
Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>

22d917d8

08 4月, 2012 3 次提交

userns: Disassociate user_struct from the user_namespace. · 7b44ab97

由 Eric W. Biederman 提交于 11月 16, 2011

Modify alloc_uid to take a kuid and make the user hash table global.
Stop holding a reference to the user namespace in struct user_struct.

This simplifies the code and makes the per user accounting not
care about which user namespace a uid happens to appear in.
Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>

7b44ab97

userns: Replace the hard to write inode_userns with inode_capable. · 1a48e2ac

由 Eric W. Biederman 提交于 11月 14, 2011

This represents a change in strategy of how to handle user namespaces.
Instead of tagging everything explicitly with a user namespace and bulking
up all of the comparisons of uids and gids in the kernel,  all uids and gids
in use will have a mapping to a flat kuid and kgid spaces respectively.  This
allows much more of the existing logic to be preserved and in general
allows for faster code.

In this new and improved world we allow someone to utiliize capabilities
over an inode if the inodes owner mapps into the capabilities holders user
namespace and the user has capabilities in their user namespace.  Which
is simple and efficient.

Moving the fs uid comparisons to be comparisons in a flat kuid space
follows in later patches, something that is only significant if you
are using user namespaces.
Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>

1a48e2ac

userns: Use cred->user_ns instead of cred->user->user_ns · c4a4d603

由 Eric W. Biederman 提交于 11月 16, 2011

Optimize performance and prepare for the removal of the user_ns reference
from user_struct.  Remove the slow long walk through cred->user->user_ns and
instead go straight to cred->user_ns.
Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>

c4a4d603

03 4月, 2012 1 次提交

vfs: Don't allow a user namespace root to make device nodes · 975d6b39

由 Eric W. Biederman 提交于 11月 13, 2011

Safely making device nodes in a container is solvable but simply
having the capability in a user namespace is not sufficient to make
this work.
Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>

975d6b39

01 4月, 2012 20 次提交

vfs: fix out-of-date dentry_unhash() comment · c0d02594

由 J. Bruce Fields 提交于 2月 15, 2012

64252c75 "vfs: remove dget() from
dentry_unhash()" changed the implementation but not the comment.

Cc: Sage Weil <sage@newdream.net>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

c0d02594

vfs: split __lookup_hash · bad61189

由 Miklos Szeredi 提交于 3月 26, 2012

Split __lookup_hash into two component functions:

 lookup_dcache - tries cached lookup, returns whether real lookup is needed
 lookup_real - calls i_op->lookup

This eliminates code duplication between d_alloc_and_lookup() and
d_inode_lookup().
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

bad61189

A
untangling do_lookup() - take __lookup_hash()-calling case out of line. · 81e6f520
由 Al Viro 提交于 3月 30, 2012
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
81e6f520

untangling do_lookup() - switch to calling __lookup_hash() · a3255546

由 Al Viro 提交于 3月 30, 2012

now we have __lookup_hash() open-coded if !dentry case;
just call the damn thing instead...
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

a3255546

A
untangling do_lookup() - merge d_alloc_and_lookup() callers · a6ecdfcf
由 Al Viro 提交于 3月 30, 2012
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
a6ecdfcf
A
untangling do_lookup() - merge failure exits in !dentry case · ec335e91
由 Al Viro 提交于 3月 30, 2012
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
ec335e91
A
untangling do_lookup() - massage !dentry case towards __lookup_hash() · d774a058
由 Al Viro 提交于 3月 30, 2012
```
Reorder if-else cases for starters...
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
d774a058

untangling do_lookup() - get rid of need_reval in !dentry case · 08b0ab7c

由 Al Viro 提交于 3月 30, 2012

Everything arriving into if (!dentry) will have need_reval = 1.
Indeed, the only way to get there with need_reval reset to 0 would
be via
	if (unlikely(d_need_lookup(dentry)))
		goto unlazy;
	if (unlikely(dentry->d_flags & DCACHE_OP_REVALIDATE)) {
		status = d_revalidate(dentry, nd);
	if (unlikely(status <= 0)) {
		if (status != -ECHILD)
			need_reval = 0;
		goto unlazy;
...
unlazy:
	/* no assignments to dentry */
	if (dentry && unlikely(d_need_lookup(dentry))) {
		dput(dentry);
		dentry = NULL;
	}
and if d_need_lookup() had already been false the first time around, it
will remain false on the second call as well.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

08b0ab7c

untangling do_lookup() - eliminate a loop. · acc9cb3c

由 Al Viro 提交于 3月 30, 2012

d_lookup() *will* fail after successful d_invalidate(), if we are
holding i_mutex all along.  IOW, we don't need to jump back to
l: - we know what path will be taken there and can do that (i.e.
d_alloc_and_lookup()) directly.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

acc9cb3c

A
untangling do_lookup() - expand the area under ->i_mutex · 37c17e1f
由 Al Viro 提交于 3月 30, 2012
```
keep holding ->i_mutex over revalidation parts
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
37c17e1f

untangling do_lookup() - isolate !dentry stuff from the rest of it. · 3f6c7c71

由 Al Viro 提交于 3月 30, 2012

Duplicate the revalidation-related parts into if (!dentry) branch.
Next step will be to pull them under i_mutex.

This and the next 8 commits are more or less a splitup of patch
by Miklos; folks, when you are working with something that convoluted,
carve your patches up into easily reviewed steps, especially when
a lot of codepaths involved are rarely hit...
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

3f6c7c71

vfs: move MAY_EXEC check from __lookup_hash() · cda309de

由 Miklos Szeredi 提交于 3月 26, 2012

The only caller of __lookup_hash() that needs the exec permission check on
parent is lookup_one_len().

All lookup_hash() callers already checked permission in LOOKUP_PARENT walk.
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

cda309de

vfs: don't revalidate just looked up dentry · 3637c05d

由 Miklos Szeredi 提交于 3月 26, 2012

__lookup_hash() calls ->lookup() if the dentry needs lookup and on success
revalidates the dentry (all under dir->i_mutex).

While this is harmless it doesn't make a lot of sense.
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

3637c05d

vfs: fix d_need_lookup/d_revalidate order in do_lookup · fa4ee159

由 Miklos Szeredi 提交于 3月 26, 2012

Doing revalidate on a dentry which has not yet been looked up makes no sense.

Move the d_need_lookup() check before d_revalidate().
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

fa4ee159

A
ext3: move headers to fs/ext3/ · 4613ad18
由 Al Viro 提交于 3月 29, 2012
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
4613ad18
A
migrate ext2_fs.h guts to fs/ext2/ext2.h · f7699f2b
由 Al Viro 提交于 3月 23, 2012
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
f7699f2b
A
get rid of pointless includes of ext2_fs.h · 2f99c369
由 Al Viro 提交于 3月 23, 2012
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
2f99c369

pstore: trim pstore_get_inode() · 22a71c30

由 Al Viro 提交于 3月 22, 2012

move mode-dependent parts to callers, kill unused arguments
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

22a71c30

A
aio: take final put_ioctx() into callers of io_destroy() · a2e1859a
由 Al Viro 提交于 3月 20, 2012
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
a2e1859a
A
aio: merge aio_cancel_all() with wait_for_all_aios() · 06af121e
由 Al Viro 提交于 3月 20, 2012
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
06af121e

30 3月, 2012 3 次提交

Revert "ext4: don't release page refs in ext4_end_bio()" · 6268b325

由 Linus Torvalds 提交于 3月 29, 2012

This reverts commit b43d17f3.

Dave Jones reports that it causes lockups on his laptop, and his debug
output showed a lot of processes hung waiting for page_writeback (or
more commonly - processes hung waiting for a lock that was held during
that writeback wait).

The page_writeback hint made Ted suggest that Dave look at this commit,
and Dave verified that reverting it makes his problems go away.

Ted says:
 "That commit fixes a race which is seen when you write into fallocated
  (and hence uninitialized) disk blocks under *very* heavy memory
  pressure.  Furthermore, although theoretically it could trigger under
  normal direct I/O writes, it only seems to trigger if you are issuing
  a huge number of AIO writes, such that a just-written page can get
  evicted from memory, and then read back into memory, before the
  workqueue has a chance to update the extent tree.

  This race has been around for a little over a year, and no one noticed
  until two months ago; it only happens under fairly exotic conditions,
  and in fact even after trying very hard to create a simple repro under
  lab conditions, we could only reproduce the problem and confirm the
  fix on production servers running MySQL on very fast PCIe-attached
  flash devices.

  Given that Dave was able to hit this problem pretty quickly, if we
  confirm that this commit is at fault, the only reasonable thing to do
  is to revert it IMO."
Reported-and-tested-by: NDave Jones <davej@redhat.com>
Acked-by: NTheodore Ts'o <tytso@mit.edu>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

6268b325

pagemap: remove remaining unneeded spin_lock() · 10bdfb5e

由 Naoya Horiguchi 提交于 3月 29, 2012

Commit 025c5b24 ("thp: optimize away unnecessary page table
locking") moves spin_lock() into pmd_trans_huge_lock() in order to avoid
locking unless pmd is for thp.  So this spin_lock() is a bug.
Reported-by: NSasha Levin <levinsasha928@gmail.com>
Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

10bdfb5e

Btrfs: update the checks for mixed block groups with big metadata blocks · bc3f116f

由 Chris Mason 提交于 3月 29, 2012

Dave Sterba had put in patches to look for mixed data/metadata groups
with metadata bigger than 4KB.  But these ended up in the wrong place
and it wasn't testing the feature flag correctly.

This updates the tests to make sure our sizes are matching
Signed-off-by: NChris Mason <chris.mason@oracle.com>

bc3f116f

29 3月, 2012 9 次提交

Btrfs: update to the right index of defragment · e1f041e1

由 Liu Bo 提交于 3月 29, 2012

When we use autodefrag, we forget to update the index which indicates
the last page we've dirty.  And we'll set dirty flags on a same set of
pages again and again.
Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

e1f041e1

Btrfs: do not bother to defrag an extent if it is a big real extent · 66c26892

由 Liu Bo 提交于 3月 29, 2012

$ mkfs.btrfs /dev/sdb7
$ mount /dev/sdb7 /mnt/btrfs/ -oautodefrag
$ dd if=/dev/zero of=/mnt/btrfs/foobar bs=4k count=10 oflag=direct 2>/dev/null
$ filefrag -v /mnt/btrfs/foobar
Filesystem type is: 9123683e
File size of /mnt/btrfs/foobar is 40960 (10 blocks, blocksize 4096)
 ext logical physical expected length flags
   0       0     3072              10 eof
/mnt/btrfs/foobar: 1 extent found

Now we have a big real extent [0, 40960), but autodefrag will still defrag it.

$ sync
$ filefrag -v /mnt/btrfs/foobar
Filesystem type is: 9123683e
File size of /mnt/btrfs/foobar is 40960 (10 blocks, blocksize 4096)
 ext logical physical expected length flags
   0       0     3082              10 eof
/mnt/btrfs/foobar: 1 extent found

So if we already find a big real extent, we're ok about that, just skip it.
Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

66c26892

Btrfs: add a check to decide if we should defrag the range · 17ce6ef8

由 Liu Bo 提交于 3月 29, 2012

If our file's layout is as follows:
| hole | data1 | hole | data2 |

we do not need to defrag this file, because this file has holes and
cannot be merged into one extent.
Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

17ce6ef8

Btrfs: fix recursive defragment with autodefrag option · 4cb13e5d

由 Liu Bo 提交于 3月 29, 2012

$ mkfs.btrfs disk
$ mount disk /mnt -o autodefrag
$ dd if=/dev/zero of=/mnt/foobar bs=4k count=10 2>/dev/null && sync
$ for i in `seq 9 -2 0`; do dd if=/dev/zero of=/mnt/foobar bs=4k count=1 \
  seek=$i conv=notrunc 2> /dev/null; done && sync

then we'll get to defrag "foobar" again and again.
So does option "-o autodefrag,compress".

Reasons:
When the cleaner kthread gets to fetch inodes from the defrag tree and defrag
them, it will dirty pages and submit them, this will comes to another DATA COW
where the processing inode will be inserted to the defrag tree again.

This patch sets a rule for COW code, i.e. insert an inode when we're really
going to make some defragments.
Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

4cb13e5d

Btrfs: fix the mismatch of page->mapping · 1f12bd06

由 Liu Bo 提交于 3月 29, 2012

commit 600a45e1
(Btrfs: fix deadlock on page lock when doing auto-defragment)
fixes the deadlock on page, but it also introduces another bug.

A page may have been truncated after unlock & lock.
So we need to find it again to get the right one.

And since we've held i_mutex lock, inode size remains unchanged and
we can drop isize overflow checks.
Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

1f12bd06

Btrfs: fix race between direct io and autodefrag · ecb8bea8

由 Liu Bo 提交于 3月 29, 2012

The bug is from running xfstests 209 with autodefrag.

The race is as follows:
       t1                       t2(autodefrag)
   direct IO
     invalidate pagecache
     dio(old data)             add_inode_defrag
     invalidate pagecache
   endio

   direct IO
     invalidate pagecache
                                run_defrag
                                  readpage(old data)
                                  set page dirty (old data)
     dio(new data, rewrite)
     invalidate pagecache (*)
     endio

t2(autodefrag) will get old data into pagecache via readpage and set
pagecache dirty.  Meanwhile, invalidate pagecache(*) will fail due to
dirty flags in pages.  So the old data may be flushed into disk by
flush thread, which will lead to data loss.

And so does the case of user defragment progs.

The patch fixes this race by holding i_mutex when we readpage and set page dirty.
Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

ecb8bea8

Btrfs: fix deadlock during allocating chunks · 15d1ff81

由 Liu Bo 提交于 3月 29, 2012

This deadlock comes from xfstests 251.

We'll hold the chunk_mutex throughout the whole of a chunk allocation.
But if we find that we've used up system chunk space, we need to allocate a
new system chunk, but this will lead to a recursion of chunk allocation and end
up with a deadlock on chunk_mutex.
So instead we need to allocate the system chunk first if we find we're in ENOSPC.
Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

15d1ff81

Btrfs: show useful info in space reservation tracepoint · 2bcc0328

由 Liu Bo 提交于 3月 29, 2012

o For space info, the type of space info is useful for debug.
o For transaction handle, its transid is useful.
Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

2bcc0328

nfsd: only register cld pipe notifier when CONFIG_NFSD_V4 is enabled · 797a9d79

由 Jeff Layton 提交于 3月 29, 2012

Otherwise, we get a warning or error similar to this when building with
CONFIG_NFSD_V4 disabled:

    ERROR: "nfsd4_cld_block" [fs/nfsd/nfsd.ko] undefined!

Fix this by wrapping the calls to rpc_pipefs_notifier_register and
..._unregister in another function and providing no-op replacements
when CONFIG_NFSD_V4 is disabled.
Reported-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: NJeff Layton <jlayton@redhat.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

797a9d79

openanolis / cloud-kernel 接近 2 年 前同步成功

openanolis / cloud-kernel
接近 2 年前同步成功