提交 · 07d5f69b457019eda4ca568923b1d62b7ada89e1 · openanolis / cloud-kernel

21 3月, 2011 1 次提交

fuse: reduce size of struct fuse_request · 07d5f69b

由 Miklos Szeredi 提交于 3月 21, 2011

Reduce the size of struct fuse_request by removing cuse_init_out from
the request structure and allocating it dinamically instead.

CC: Tejun Heo <tj@kernel.org>
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>

07d5f69b

15 3月, 2011 1 次提交

Fix corrupted OSF partition table parsing · 1eafbfeb

由 Timo Warns 提交于 3月 14, 2011

The kernel automatically evaluates partition tables of storage devices.
The code for evaluating OSF partitions contains a bug that leaks data
from kernel heap memory to userspace for certain corrupted OSF
partitions.

In more detail:

  for (i = 0 ; i < le16_to_cpu(label->d_npartitions); i++, partition++) {

iterates from 0 to d_npartitions - 1, where d_npartitions is read from
the partition table without validation and partition is a pointer to an
array of at most 8 d_partitions.

Add the proper and obvious validation.
Signed-off-by: NTimo Warns <warns@pre-sense.de>
Cc: stable@kernel.org
[ Changed the patch trivially to not repeat the whole le16_to_cpu()
  thing, and to use an explicit constant for the magic value '8' ]
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

1eafbfeb

14 3月, 2011 1 次提交

compat breakage in preadv() and pwritev() · c44ed965

由 Al Viro 提交于 3月 13, 2011

Fix for a dumb preadv()/pwritev() compat bug - unlike the native
variants, the compat_...  ones forget to check FMODE_P{READ,WRITE}, so
e.g.  on pipe the native preadv() will fail with -ESPIPE and compat one
will act as readv() and succeed.

Not critical, but it's a clear bug with trivial fix, so IMO it's OK for
-final.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c44ed965

12 3月, 2011 7 次提交

Btrfs: break out of shrink_delalloc earlier · 36e39c40

由 Chris Mason 提交于 3月 12, 2011

Josef had changed shrink_delalloc to exit after three shrink
attempts, which wasn't quite enough because new writers could
race in and steal free space.

But it also fixed deadlocks and stalls as we tried to recover
delalloc reservations.  The code was tweaked to loop 1024
times, and would reset the counter any time a small amount
of progress was made.  This was too drastic, and with a
lot of writers we can end up stuck in shrink_delalloc forever.

The shrink_delalloc loop is fairly complex because the caller is looping
too, and the caller will go ahead and force a transaction commit to make
sure we reclaim space.

This reworks things to exit shrink_delalloc when we've forced some
writeback and the delalloc reservations have gone down.  This means
the writeback has not just started but has also finished at
least some of the metadata changes required to reclaim delalloc
space.

If we've got this wrong, we're returning ENOSPC too early, which
is a big improvement over the current behavior of hanging the machine.

Test 224 in xfstests hammers on this nicely, and with 1000 writers
trying to fill a 1GB drive we get our first ENOSPC at 93% full.  The
other writers are able to continue until we get 100%.

This is a worst case test for btrfs because the 1000 writers are doing
small IO, and the small FS size means we don't have a lot of room
for metadata chunks.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

36e39c40

NFS: NFSROOT should default to "proto=udp" · 53d47375

由 Chuck Lever 提交于 3月 11, 2011

There have been a number of recent reports that NFSROOT is no longer
working with default mount options, but fails only with certain NICs.

Brian Downing <bdowning@lavos.net> bisected to commit 56463e50 "NFS:
Use super.c for NFSROOT mount option parsing".  Among other things,
this commit changes the default mount options for NFSROOT to use TCP
instead of UDP as the underlying transport.

TCP seems less able to deal with NICs that are slow to initialize.
The system logs that have accompanied reports of problems all show
that NFSROOT attempts to establish a TCP connection before the NIC is
fully initialized, and thus the TCP connection attempt fails.

When a TCP connection attempt fails during a mount operation, the
NFS stack needs to fail the operation.  Usually user space knows how
and when to retry it.  The network layer does not report a distinct
error code for this particular failure mode.  Thus, there isn't a
clean way for the RPC client to see that it needs to retry in this
case, but not in others.

Because NFSROOT is used in some environments where it is not possible
to update the kernel command line to specify "udp", the proper thing
to do is change NFSROOT to use UDP by default, as it did before commit
56463e50.

To make it easier to see how to change default mount options for
NFSROOT and to distinguish default settings from mandatory settings,
I've adjusted a couple of areas to document the specifics.

root_nfs_cat() is also modified to deal with commas properly when
concatenating strings containing mount option lists.  This keeps
root_nfs_cat() call sites simpler, now that we may be concatenating
multiple mount option strings.
Tested-by: NBrian Downing <bdowning@lavos.net>
Tested-by: NMark Brown <broonie@opensource.wolfsonmicro.com>
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Cc: <stable@kernel.org> # 2.6.37
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

53d47375

nfs4: remove duplicated #include · 57df216b

由 Huang Weiyi 提交于 3月 08, 2011

Remove duplicated #include('s) in
  fs/nfs/nfs4proc.c
Signed-off-by: NHuang Weiyi <weiyi.huang@gmail.com>
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

57df216b

NFSv4: nfs4_state_mark_reclaim_nograce() should be static · f9feab1e

由 Trond Myklebust 提交于 3月 09, 2011

There are no more external users of nfs4_state_mark_reclaim_nograce() or
nfs4_state_mark_reclaim_reboot(), so mark them as static.
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

f9feab1e

T
NFSv4: Fix the setlk error handler · ecac799a
由 Trond Myklebust 提交于 3月 09, 2011
```
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
```
ecac799a

NFSv4.1: Fix the handling of the SEQUENCE status bits · b4410c2f

由 Trond Myklebust 提交于 3月 09, 2011

We want SEQUENCE status bits to be handled by the state manager in order
to avoid threading issues.
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

b4410c2f

NFSv4/4.1: Fix nfs4_schedule_state_recovery abuses · 0400a6b0

由 Trond Myklebust 提交于 3月 09, 2011

nfs4_schedule_state_recovery() should only be used when we need to force
the state manager to check the lease. If we just want to start the
state manager in order to handle a state recovery situation, we should be
using nfs4_schedule_state_manager().

This patch fixes the abuses of nfs4_schedule_state_recovery() by replacing
its use with a set of helper functions that do the right thing.
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

0400a6b0

11 3月, 2011 10 次提交

NFSv4.1 reclaim complete must wait for completion · c34c32ea

由 Andy Adamson 提交于 3月 09, 2011

Signed-off-by: NAndy Adamson <andros@netapp.com>
[Trond: fix whitespace errors]
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

c34c32ea

NFSv4: remove duplicate clientid in struct nfs_client · 114f64b5

由 Andy Adamson 提交于 3月 09, 2011

Signed-off-by: NAndy Adamson <andros@netapp.com>
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

114f64b5

NFSv4.1: Retry CREATE_SESSION on NFS4ERR_DELAY · 7d6d63d6

由 Ricardo Labiaga 提交于 3月 09, 2011

Fix bug where we currently retry the EXCHANGEID call again, eventhough
we already have a valid clientid.  Instead, delay and retry the CREATE_SESSION
call.
Signed-off-by: NRicardo Labiaga <Ricardo.Labiaga@netapp.com>
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

7d6d63d6

(try3-resend) Fix nfs_compat_user_ino64 so it doesn't cause problems if bit 31... · 3fa0b4e2

由 Frank Filz 提交于 12月 02, 2010

(try3-resend) Fix nfs_compat_user_ino64 so it doesn't cause problems if bit 31 or 63 are set in fileid

The problem was use of an int32, which when converted to a uint64
is sign extended resulting in a fileid that doesn't fit in 32 bits
even though the intent of the function is to fit the fileid into
32 bits.
Signed-off-by: NFrank Filz <ffilzlnx@us.ibm.com>
Reviewed-by: NJeff Layton <jlayton@redhat.com>
[Trond: Added an include for compat.h]
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

3fa0b4e2

nfs: fix compilation warning · 43b7c3f0

由 Jovi Zhang 提交于 3月 02, 2011

this commit fix compilation warning as following:
linux-2.6/fs/nfs/nfs4proc.c:3265: warning: comparison of distinct pointer types lacks a cast
Signed-off-by: NJovi Zhang <bookjovi@gmail.com>
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

43b7c3f0

nfs: add kmalloc return value check in decode_and_add_ds · b9f81057

由 Stanislav Fomichev 提交于 2月 05, 2011

add kmalloc return value check in decode_and_add_ds
Signed-off-by: NStanislav Fomichev <kernel@fomichev.me>
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

b9f81057

nfs: close NFSv4 COMMIT vs. CLOSE race · d2224e7a

由 Jeff Layton 提交于 3月 06, 2011

I've been adding in more artificial delays in the NFSv4 commit and close
codepaths to uncover races. The kernel I'm testing has the patch to
close the race in __rpc_wait_for_completion_task that's in Trond's
cthon2011 branch. The reproducer I've been using does this in a loop:

	mkdir("DIR");
	fd = open("DIR/FILE", O_WRONLY|O_CREAT|O_EXCL, 0644);
	write(fd, "abcdefg", 7);
	close(fd);
	unlink("DIR/FILE");
	rmdir("DIR");

The above reproducer shouldn't result in any silly-renaming. However,
when I add a "msleep(100)" just after the nfs_commit_clear_lock call in
nfs_commit_release, I can almost always force one to occur. If I can
force it to occur with that, then it can happen without that delay
given the right timing.

nfs_commit_inode waits for the NFS_INO_COMMIT bit to clear when called
with FLUSH_SYNC set. nfs_commit_rpcsetup on the other hand does not wait
for the task to complete before putting its reference to it, so the last
reference get put in rpc_release task and gets queued to a workqueue.

In this situation, the last open context reference may be put by the
COMMIT release instead of the close() syscall. The close() syscall
returns too quickly and the unlink runs while the d_count is still
high since the COMMIT release hasn't put its dentry reference yet.

Fix this by having rpc_commit_rpcsetup wait for the RPC call to complete
before putting the task reference when FLUSH_SYNC is set. With this, the
last reference is put by the process that's initiating the FLUSH_SYNC
commit and the race is closed.
Signed-off-by: NJeff Layton <jlayton@redhat.com>
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

d2224e7a

SUNRPC: Close a race in __rpc_wait_for_completion_task() · bf294b41

由 Trond Myklebust 提交于 2月 21, 2011

Although they run as rpciod background tasks, under normal operation
(i.e. no SIGKILL), functions like nfs_sillyrename(), nfs4_proc_unlck()
and nfs4_do_close() want to be fully synchronous. This means that when we
exit, we want all references to the rpc_task to be gone, and we want
any dentry references etc. held by that task to be released.

For this reason these functions call __rpc_wait_for_completion_task(),
followed by rpc_put_task() in the expectation that the latter will be
releasing the last reference to the rpc_task, and thus ensuring that the
callback_ops->rpc_release() has been called synchronously.

This patch fixes a race which exists due to the fact that
rpciod calls rpc_complete_task() (in order to wake up the callers of
__rpc_wait_for_completion_task()) and then subsequently calls
rpc_put_task() without ensuring that these two steps are done atomically.

In order to avoid adding new spin locks, the patch uses the existing
waitqueue spin lock to order the rpc_task reference count releases between
the waiting process and rpciod.
The common case where nobody is waiting for completion is optimised for by
checking if the RPC_TASK_ASYNC flag is cleared and/or if the rpc_task
reference count is 1: in those cases we drop trying to grab the spin lock,
and immediately free up the rpc_task.

Those few processes that need to put the rpc_task from inside an
asynchronous context and that do not care about ordering are given a new
helper: rpc_put_task_async().
Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>

bf294b41

btrfs: fix not enough reserved space · 7e6b6465

由 Miao Xie 提交于 2月 18, 2011

btrfs_link() will insert 3 items(inode ref, dir name item and dir index item)
into the b+ tree and update 2 items(its inode, and parent's inode) in the b+
tree. So we should reserve space for these 5 items, not 3 items.
Reported-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

7e6b6465

btrfs: fix dip leak · b4966b77

由 Daniel J Blueman 提交于 3月 09, 2011

The btrfs DIO code leaks dip structs when dip->csums allocation
fails; bio->bi_end_io isn't set at the point where the free_ordered
branch is consequently taken, thus bio_endio doesn't call the function
which would free it in the normal case. Fix.
Signed-off-by: NDaniel J Blueman <daniel.blueman@gmail.com>
Acked-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

b4966b77

10 3月, 2011 10 次提交

fs/dcache: allow d_obtain_alias() to return unhashed dentries · d891eedb

由 J. Bruce Fields 提交于 1月 18, 2011

Without this patch, inodes are not promptly freed on last close of an
unlinked file by an nfs client:

	client$ mount -tnfs4 server:/export/ /mnt/
	client$ tail -f /mnt/FOO
	...
	server$ df -i /export
	server$ rm /export/FOO
	(^C the tail -f)
	server$ df -i /export
	server$ echo 2 >/proc/sys/vm/drop_caches
	server$ df -i /export

the df's will show that the inode is not freed on the filesystem until
the last step, when it could have been freed after killing the client's
tail -f. On-disk data won't be deallocated either, leading to possible
spurious ENOSPC.

This occurs because when the client does the close, it arrives in a
compound with a putfh and a close, processed like:

	- putfh: look up the filehandle.  The only alias found for the
	  inode will be DCACHE_UNHASHED alias referenced by the filp
	  this, so it creates a new DCACHE_DISCONECTED dentry and
	  returns that instead.
	- close: closes the existing filp, which is destroyed
	  immediately by dput() since it's DCACHE_UNHASHED.
	- end of the compound: release the reference
	  to the current filehandle, and dput() the new
	  DCACHE_DISCONECTED dentry, which gets put on the
	  unused list instead of being destroyed immediately.

Nick Piggin suggested fixing this by allowing d_obtain_alias to return
the unhashed dentry that is referenced by the filp, instead of making it
create a new dentry.

Leave __d_find_alias() alone to avoid changing behavior of other
callers.

Also nfsd doesn't need all the checks of __d_find_alias(); any dentry,
hashed or unhashed, disconnected or not, should work.
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

d891eedb

Check for immutable/append flag in fallocate path · 1ca551c6

由 Marco Stornelli 提交于 3月 05, 2011

In the fallocate path the kernel doesn't check for the immutable/append
flag. It's possible to have a race condition in this scenario: an
application open a file in read/write and it does something, meanwhile
root set the immutable flag on the file, the application at that point
can call fallocate with success. In addition, we don't allow to do any
unreserve operation on an append only file but only the reserve one.
Signed-off-by: NMarco Stornelli <marco.stornelli@gmail.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

1ca551c6

fat: fix d_revalidate oopsen on NFS exports · 9177ada9

由 Al Viro 提交于 3月 10, 2011

can't blindly check nd->flags in ->d_revalidate()
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

9177ada9

jfs: fix d_revalidate oopsen on NFS exports · 8ce84eeb

由 Al Viro 提交于 3月 10, 2011

can't blindly check nd->flags in ->d_revalidate()
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

8ce84eeb

ocfs2: fix d_revalidate oopsen on NFS exports · 4714e637

由 Al Viro 提交于 3月 10, 2011

can't blindly check nd->flags in ->d_revalidate()
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

4714e637

gfs2: fix d_revalidate oopsen on NFS exports · 53fe9241

由 Al Viro 提交于 3月 10, 2011

can't blindly check nd->flags in ->d_revalidate()
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

53fe9241

fuse: fix d_revalidate oopsen on NFS exports · 529c5f95

由 Al Viro 提交于 3月 10, 2011

can't blindly check nd->flags in ->d_revalidate()
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

529c5f95

ceph: fix d_revalidate oopsen on NFS exports · 0eb980e3

由 Al Viro 提交于 3月 10, 2011

can't blindly check nd->flags in ->d_revalidate()
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

0eb980e3

A
reiserfs xattr ->d_revalidate() shouldn't care about RCU · c78f4cc5
由 Al Viro 提交于 2月 16, 2011
```
... it returns an error unconditionally
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
c78f4cc5
A
/proc/self is never going to be invalidated... · ae50adcb
由 Al Viro 提交于 2月 16, 2011
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
ae50adcb

09 3月, 2011 3 次提交

nd->inode is not set on the second attempt in path_walk() · b306419a

由 Al Viro 提交于 3月 08, 2011

We leave it at whatever it had been pointing to after the
first link_path_walk() had failed with -ESTALE.  Things
do not work well after that...
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

b306419a

nfsd: wrong index used in inner loop · 3ec07aa9

由 roel 提交于 3月 08, 2011

Index i was already used in the outer loop

Cc: stable@kernel.org
Signed-off-by: NRoel Kluin <roel.kluin@gmail.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

3ec07aa9

Btrfs: make sure not to return overlapping extents to fiemap · ea8efc74

由 Chris Mason 提交于 3月 08, 2011

The btrfs fiemap code was incorrectly returning duplicate or overlapping
extents in some cases.  cp was blindly trusting this result and we would
end up with a destination file that was bigger than the original because
some bytes were copied twice.

The fix here adjusts our offsets to make sure we're always moving
forward in the fiemap results.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

ea8efc74

08 3月, 2011 3 次提交

unfuck proc_sysctl ->d_compare() · dfef6dcd

由 Al Viro 提交于 3月 08, 2011

a) struct inode is not going to be freed under ->d_compare();
however, the thing PROC_I(inode)->sysctl points to just might.
Fortunately, it's enough to make freeing that sucker delayed,
provided that we don't step on its ->unregistering, clear
the pointer to it in PROC_I(inode) before dropping the reference
and check if it's NULL in ->d_compare().

b) I'm not sure that we *can* walk into NULL inode here (we recheck
dentry->seq between verifying that it's still hashed / fetching
dentry->d_inode and passing it to ->d_compare() and there's no
negative hashed dentries in /proc/sys/*), but if we can walk into
that, we really should not have ->d_compare() return 0 on it!
Said that, I really suspect that this check can be simply killed.
Nick?
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

dfef6dcd

nfsd4: fix bad pointer on failure to find delegation · 32b007b4

由 J. Bruce Fields 提交于 3月 06, 2011

In case of a nonempty list, the return on error here is obviously bogus;
it ends up being a pointer to the list head instead of to any valid
delegation on the list.

In particular, if nfsd4_delegreturn() hits this case, and you're quite unlucky,
then renew_client may oops, and it may take an embarassingly long time to
figure out why.  Facepalm.

BUG: unable to handle kernel NULL pointer dereference at 0000000000000090
IP: [<ffffffff81292965>] nfsd4_delegreturn+0x125/0x200
...

Cc: stable@kernel.org
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

32b007b4

Btrfs: deal with short returns from copy_from_user · 31339acd

由 Chris Mason 提交于 3月 07, 2011

When copy_from_user is only able to copy some of the bytes we requested,
we may end up creating a partially up to date page.  To avoid garbage in
the page, we need to treat a partial copy as a zero length copy.

This makes the rest of the file_write code drop the page and
retry the whole copy instead of marking the partially up to
date page as dirty.
Signed-off-by: NChris Mason <chris.mason@oracle.com>
cc: stable@kernel.org

31339acd

07 3月, 2011 1 次提交

Btrfs: fix regressions in copy_from_user handling · b1bf862e

由 Chris Mason 提交于 2月 28, 2011

Commit 914ee295 fixed deadlocks in
btrfs_file_write where we would catch page faults on pages we had
locked.

But, there were a few problems:

1) The x86-32 iov_iter_copy_from_user_atomic code always fails to copy
data when the amount to copy is more than 4K and the offset to start
copying from is not page aligned.  The result was btrfs_file_write
looping forever retrying the iov_iter_copy_from_user_atomic

We deal with this by changing btrfs_file_write to drop down to single
page copies when iov_iter_copy_from_user_atomic starts returning failure.

2) The btrfs_file_write code was leaking delalloc reservations when
iov_iter_copy_from_user_atomic returned zero.  The looping above would
result in the entire filesystem running out of delalloc reservations and
constantly trying to flush things to disk.

3) btrfs_file_write will lock down page cache pages, make sure
any writeback is finished, do the copy_from_user and then release them.
Before the loop runs we check the first and last pages in the write to
see if they are only being partially modified.  If the start or end of
the write isn't aligned, we make sure the corresponding pages are
up to date so that we don't introduce garbage into the file.

With the copy_from_user changes, we're allowing the VM to reclaim the
pages after a partial update from copy_from_user, but we're not
making sure the page cache page is up to date when we loop around to
resume the write.

We deal with this by pushing the up to date checks down into the page
prep code.  This fits better with how the rest of file_write works.
Signed-off-by: NChris Mason <chris.mason@oracle.com>
Reported-by: NMitch Harder <mitch.harder@sabayonlinux.org>
cc: stable@kernel.org

b1bf862e

05 3月, 2011 3 次提交

nfs4: Ensure that ACL pages sent over NFS were not allocated from the slab (v3) · e9e3d724

由 Neil Horman 提交于 3月 04, 2011

The "bad_page()" page allocator sanity check was reported recently (call
chain as follows):

  bad_page+0x69/0x91
  free_hot_cold_page+0x81/0x144
  skb_release_data+0x5f/0x98
  __kfree_skb+0x11/0x1a
  tcp_ack+0x6a3/0x1868
  tcp_rcv_established+0x7a6/0x8b9
  tcp_v4_do_rcv+0x2a/0x2fa
  tcp_v4_rcv+0x9a2/0x9f6
  do_timer+0x2df/0x52c
  ip_local_deliver+0x19d/0x263
  ip_rcv+0x539/0x57c
  netif_receive_skb+0x470/0x49f
  :virtio_net:virtnet_poll+0x46b/0x5c5
  net_rx_action+0xac/0x1b3
  __do_softirq+0x89/0x133
  call_softirq+0x1c/0x28
  do_softirq+0x2c/0x7d
  do_IRQ+0xec/0xf5
  default_idle+0x0/0x50
  ret_from_intr+0x0/0xa
  default_idle+0x29/0x50
  cpu_idle+0x95/0xb8
  start_kernel+0x220/0x225
  _sinittext+0x22f/0x236

It occurs because an skb with a fraglist was freed from the tcp
retransmit queue when it was acked, but a page on that fraglist had
PG_Slab set (indicating it was allocated from the Slab allocator (which
means the free path above can't safely free it via put_page.

We tracked this back to an nfsv4 setacl operation, in which the nfs code
attempted to fill convert the passed in buffer to an array of pages in
__nfs4_proc_set_acl, which gets used by the skb->frags list in
xs_sendpages.  __nfs4_proc_set_acl just converts each page in the buffer
to a page struct via virt_to_page, but the vfs allocates the buffer via
kmalloc, meaning the PG_slab bit is set.  We can't create a buffer with
kmalloc and free it later in the tcp ack path with put_page, so we need
to either:

1) ensure that when we create the list of pages, no page struct has
   PG_Slab set

 or

2) not use a page list to send this data

Given that these buffers can be multiple pages and arbitrarily sized, I
think (1) is the right way to go.  I've written the below patch to
allocate a page from the buddy allocator directly and copy the data over
to it.  This ensures that we have a put_page free-able page for every
entry that winds up on an skb frag list, so it can be safely freed when
the frame is acked.  We do a put page on each entry after the
rpc_call_sync call so as to drop our own reference count to the page,
leaving only the ref count taken by tcp_sendpages.  This way the data
will be properly freed when the ack comes in

Successfully tested by myself to solve the above oops.

Note, as this is the result of a setacl operation that exceeded a page
of data, I think this amounts to a local DOS triggerable by an
uprivlidged user, so I'm CCing security on this as well.
Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
CC: Trond Myklebust <Trond.Myklebust@netapp.com>
CC: security@kernel.org
CC: Jeff Layton <jlayton@redhat.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e9e3d724

ceph: no .snap inside of snapped namespace · 455cec0a

由 Sage Weil 提交于 3月 03, 2011

Otherwise you can do things like

# mkdir .snap/foo
# cd .snap/foo/.snap
# ls
<badness>
Signed-off-by: NSage Weil <sage@newdream.net>

455cec0a

minimal fix for do_filp_open() race · 1858efd4

由 Al Viro 提交于 3月 04, 2011

failure exits on the no-O_CREAT side of do_filp_open() merge with
those of O_CREAT one; unfortunately, if do_path_lookup() returns
-ESTALE, we'll get out_filp:, notice that we are about to return
-ESTALE without having trying to create the sucker with LOOKUP_REVAL
and jump right into the O_CREAT side of code.  And proceed to try
and create a file.  Usually that'll fail with -ESTALE again, but
we can race and get that attempt of pathname resolution to succeed.

open() without O_CREAT really shouldn't end up creating files, races
or not.  The real fix is to rearchitect the whole do_filp_open(),
but for now splitting the failure exits will do.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

1858efd4

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功