提交 · 1bccf513ac49d44604ba1cddcc29f5886e70f1b6 · openeuler / raspberrypi-kernel

20 11月, 2009 9 次提交

FS-Cache: Fix lock misorder in fscache_write_op() · 1bccf513

由 David Howells 提交于 11月 19, 2009

FS-Cache has two structs internally for keeping track of the internal state of
a cached file: the fscache_cookie struct, which represents the netfs's state,
and fscache_object struct, which represents the cache's state.  Each has a
pointer that points to the other (when both are in existence), and each has a
spinlock for pointer maintenance.

Since netfs operations approach these structures from the cookie side, they get
the cookie lock first, then the object lock.  Cache operations, on the other
hand, approach from the object side, and get the object lock first.  It is not
then permitted for a cache operation to get the cookie lock whilst it is
holding the object lock lest deadlock occur; instead, it must do one of two
things:

 (1) increment the cookie usage counter, drop the object lock and then get both
     locks in order, or

 (2) simply hold the object lock as certain parts of the cookie may not be
     altered whilst the object lock is held.

It is also not permitted to follow either pointer without holding the lock at
the end you start with.  To break the pointers between the cookie and the
object, both locks must be held.

fscache_write_op(), however, violates the locking rules: It attempts to get the
cookie lock without (a) checking that the cookie pointer is a valid pointer,
and (b) holding the object lock to protect the cookie pointer whilst it follows
it.  This is so that it can access the pending page store tree without
interference from __fscache_write_page().

This is fixed by splitting the cookie lock, such that the page store tracking
tree is protected by its own lock, and checking that the cookie pointer is
non-NULL before we attempt to follow it whilst holding the object lock.

The new lock is subordinate to both the cookie lock and the object lock, and so
should be taken after those.
Signed-off-by: NDavid Howells <dhowells@redhat.com>

1bccf513

FS-Cache: The object-available state can't rely on the cookie to be available · 6897e3df

由 David Howells 提交于 11月 19, 2009

The object-available state in the object processing state machine (as
processed by fscache_object_available()) can't rely on the cookie to be
available because the FSCACHE_COOKIE_CREATING bit may have been cleared by
fscache_obtained_object() prior to the object being put into the
FSCACHE_OBJECT_AVAILABLE state.

Clearing the FSCACHE_COOKIE_CREATING bit on a cookie permits
__fscache_relinquish_cookie() to proceed and detach the cookie from the
object.

To deal with this, we don't dereference object->cookie in
fscache_object_available() if the object has already been detached.

In addition, a couple of assertions are added into fscache_drop_object() to
make sure the object is unbound from the cookie before it gets there.
Signed-off-by: NDavid Howells <dhowells@redhat.com>

6897e3df

FS-Cache: Permit cache retrieval ops to be interrupted in the initial wait phase · 5753c441

由 David Howells 提交于 11月 19, 2009

Permit the operations to retrieve data from the cache or to allocate space in
the cache for future writes to be interrupted whilst they're waiting for
permission for the operation to proceed. Typically this wait occurs whilst the
cache object is being looked up on disk in the background.

If an interruption occurs, and the operation has not yet been given the
go-ahead to run, the operation is dequeued and cancelled, and control returns
to the read operation of the netfs routine with none of the requested pages
having been read or in any way marked as known by the cache.

This means that the initial wait is done interruptibly rather than
uninterruptibly.

In addition, extra stats values are made available to show the number of ops
cancelled and the number of cache space allocations interrupted.
Signed-off-by: NDavid Howells <dhowells@redhat.com>

5753c441

FS-Cache: Use radix tree preload correctly in tracking of pages to be stored · b34df792

由 David Howells 提交于 11月 19, 2009

__fscache_write_page() attempts to load the radix tree preallocation pool for
the CPU it is on before calling radix_tree_insert(), as the insertion must be
done inside a pair of spinlocks.

Use of the preallocation pool, however, is contingent on the radix tree being
initialised without __GFP_WAIT specified.  __fscache_acquire_cookie() was
passing GFP_NOFS to INIT_RADIX_TREE() - but that includes __GFP_WAIT.

The solution is to AND out __GFP_WAIT.

Additionally, the banner comment to radix_tree_preload() is altered to make
note of this prerequisite.  Possibly there should be a WARN_ON() too.

Without this fix, I have seen the following recursive deadlock caused by
radix_tree_insert() attempting to allocate memory inside the spinlocked
region, which resulted in FS-Cache being called back into to release memory -
which required the spinlock already held.

=============================================
[ INFO: possible recursive locking detected ]
2.6.32-rc6-cachefs #24
---------------------------------------------
nfsiod/7916 is trying to acquire lock:
 (&cookie->lock){+.+.-.}, at: [<ffffffffa0076872>] __fscache_uncache_page+0xdb/0x160 [fscache]

but task is already holding lock:
 (&cookie->lock){+.+.-.}, at: [<ffffffffa0076acc>] __fscache_write_page+0x15c/0x3f3 [fscache]

other info that might help us debug this:
5 locks held by nfsiod/7916:
 #0:  (nfsiod){+.+.+.}, at: [<ffffffff81048290>] worker_thread+0x19a/0x2e2
 #1:  (&task->u.tk_work#2){+.+.+.}, at: [<ffffffff81048290>] worker_thread+0x19a/0x2e2
 #2:  (&cookie->lock){+.+.-.}, at: [<ffffffffa0076acc>] __fscache_write_page+0x15c/0x3f3 [fscache]
 #3:  (&object->lock#2){+.+.-.}, at: [<ffffffffa0076b07>] __fscache_write_page+0x197/0x3f3 [fscache]
 #4:  (&cookie->stores_lock){+.+...}, at: [<ffffffffa0076b0f>] __fscache_write_page+0x19f/0x3f3 [fscache]

stack backtrace:
Pid: 7916, comm: nfsiod Not tainted 2.6.32-rc6-cachefs #24
Call Trace:
 [<ffffffff8105ac7f>] __lock_acquire+0x1649/0x16e3
 [<ffffffff81059ded>] ? __lock_acquire+0x7b7/0x16e3
 [<ffffffff8100e27d>] ? dump_trace+0x248/0x257
 [<ffffffff8105ad70>] lock_acquire+0x57/0x6d
 [<ffffffffa0076872>] ? __fscache_uncache_page+0xdb/0x160 [fscache]
 [<ffffffff8135467c>] _spin_lock+0x2c/0x3b
 [<ffffffffa0076872>] ? __fscache_uncache_page+0xdb/0x160 [fscache]
 [<ffffffffa0076872>] __fscache_uncache_page+0xdb/0x160 [fscache]
 [<ffffffffa0077eb7>] ? __fscache_check_page_write+0x0/0x71 [fscache]
 [<ffffffffa00b4755>] nfs_fscache_release_page+0x86/0xc4 [nfs]
 [<ffffffffa00907f0>] nfs_release_page+0x3c/0x41 [nfs]
 [<ffffffff81087ffb>] try_to_release_page+0x32/0x3b
 [<ffffffff81092c2b>] shrink_page_list+0x316/0x4ac
 [<ffffffff81058a9b>] ? mark_held_locks+0x52/0x70
 [<ffffffff8135451b>] ? _spin_unlock_irq+0x2b/0x31
 [<ffffffff81093153>] shrink_inactive_list+0x392/0x67c
 [<ffffffff81058a9b>] ? mark_held_locks+0x52/0x70
 [<ffffffff810934ca>] shrink_list+0x8d/0x8f
 [<ffffffff81093744>] shrink_zone+0x278/0x33c
 [<ffffffff81052c70>] ? ktime_get_ts+0xad/0xba
 [<ffffffff8109453b>] try_to_free_pages+0x22e/0x392
 [<ffffffff8109184c>] ? isolate_pages_global+0x0/0x212
 [<ffffffff8108e16b>] __alloc_pages_nodemask+0x3dc/0x5cf
 [<ffffffff810ae24a>] cache_alloc_refill+0x34d/0x6c1
 [<ffffffff811bcf74>] ? radix_tree_node_alloc+0x52/0x5c
 [<ffffffff810ae929>] kmem_cache_alloc+0xb2/0x118
 [<ffffffff811bcf74>] radix_tree_node_alloc+0x52/0x5c
 [<ffffffff811bcfd5>] radix_tree_insert+0x57/0x19c
 [<ffffffffa0076b53>] __fscache_write_page+0x1e3/0x3f3 [fscache]
 [<ffffffffa00b4248>] __nfs_readpage_to_fscache+0x58/0x11e [nfs]
 [<ffffffffa009bb77>] nfs_readpage_release+0x34/0x9b [nfs]
 [<ffffffffa009c0d9>] nfs_readpage_release_full+0x32/0x4b [nfs]
 [<ffffffffa0006cff>] rpc_release_calldata+0x12/0x14 [sunrpc]
 [<ffffffffa0006e2d>] rpc_free_task+0x59/0x61 [sunrpc]
 [<ffffffffa0006f03>] rpc_async_release+0x10/0x12 [sunrpc]
 [<ffffffff810482e5>] worker_thread+0x1ef/0x2e2
 [<ffffffff81048290>] ? worker_thread+0x19a/0x2e2
 [<ffffffff81352433>] ? thread_return+0x3e/0x101
 [<ffffffffa0006ef3>] ? rpc_async_release+0x0/0x12 [sunrpc]
 [<ffffffff8104bff5>] ? autoremove_wake_function+0x0/0x34
 [<ffffffff81058d25>] ? trace_hardirqs_on+0xd/0xf
 [<ffffffff810480f6>] ? worker_thread+0x0/0x2e2
 [<ffffffff8104bd21>] kthread+0x7a/0x82
 [<ffffffff8100beda>] child_rip+0xa/0x20
 [<ffffffff8100b87c>] ? restore_args+0x0/0x30
 [<ffffffff8104c2b9>] ? add_wait_queue+0x15/0x44
 [<ffffffff8104bca7>] ? kthread+0x0/0x82
 [<ffffffff8100bed0>] ? child_rip+0x0/0x20
Signed-off-by: NDavid Howells <dhowells@redhat.com>

b34df792

FS-Cache: Clear netfs pointers in cookie after detaching object, not before · 7e311a20

由 David Howells 提交于 11月 19, 2009

Clear the pointers from the fscache_cookie struct to netfs private data after
clearing the pointer to the cookie from the fscache_object struct and
releasing the object lock, rather than before.

This allows the netfs private data pointers to be relied on simply by holding
the object lock, rather than having to hold the cookie lock. This is makes
things simpler as the cookie lock has to be taken before the object lock, but
sometimes the object pointer is all that the code has.
Signed-off-by: NDavid Howells <dhowells@redhat.com>

7e311a20

FS-Cache: Add counters for entry/exit to/from cache operation functions · 52bd75fd

由 David Howells 提交于 11月 19, 2009

Count entries to and exits from cache operation table functions. Maintain
these as a single counter that's added to or removed from as appropriate.
Signed-off-by: NDavid Howells <dhowells@redhat.com>

52bd75fd

FS-Cache: Allow the current state of all objects to be dumped · 4fbf4291

由 David Howells 提交于 11月 19, 2009

Allow the current state of all fscache objects to be dumped by doing:

	cat /proc/fs/fscache/objects

By default, all objects and all fields will be shown.  This can be restricted
by adding a suitable key to one of the caller's keyrings (such as the session
keyring):

	keyctl add user fscache:objlist "<restrictions>" @s

The <restrictions> are:

	K	Show hexdump of object key (don't show if not given)
	A	Show hexdump of object aux data (don't show if not given)

And paired restrictions:

	C	Show objects that have a cookie
	c	Show objects that don't have a cookie
	B	Show objects that are busy
	b	Show objects that aren't busy
	W	Show objects that have pending writes
	w	Show objects that don't have pending writes
	R	Show objects that have outstanding reads
	r	Show objects that don't have outstanding reads
	S	Show objects that have slow work queued
	s	Show objects that don't have slow work queued

If neither side of a restriction pair is given, then both are implied.  For
example:

	keyctl add user fscache:objlist KB @s

shows objects that are busy, and lists their object keys, but does not dump
their auxiliary data.  It also implies "CcWwRrSs", but as 'B' is given, 'b' is
not implied.
Signed-off-by: NDavid Howells <dhowells@redhat.com>

4fbf4291

FS-Cache: Annotate slow-work runqueue proc lines for FS-Cache work items · 440f0aff

由 David Howells 提交于 11月 19, 2009

Annotate slow-work runqueue proc lines for FS-Cache work items. Objects
include the object ID and the state. Operations include the object ID, the
operation ID and the operation type and state.
Signed-off-by: NDavid Howells <dhowells@redhat.com>

440f0aff

SLOW_WORK: Wait for outstanding work items belonging to a module to clear · 3d7a641e

由 David Howells 提交于 11月 19, 2009

Wait for outstanding slow work items belonging to a module to clear when
unregistering that module as a user of the facility. This prevents the put_ref
code of a work item from being taken away before it returns.
Signed-off-by: NDavid Howells <dhowells@redhat.com>

3d7a641e

18 11月, 2009 4 次提交

fcntl: rename F_OWNER_GID to F_OWNER_PGRP · 978b4053

由 Peter Zijlstra 提交于 11月 17, 2009

This is for consistency with various ioctl() operations that include the
suffix "PGRP" in their names, and also for consistency with PRIO_PGRP,
used with setpriority() and getpriority().  Also, using PGRP instead of
GID avoids confusion with the common abbreviation of "group ID".

I'm fine with anything that makes it more consistent, and if PGRP is what
is the predominant abbreviation then I see no need to further confuse
matters by adding a third one.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: NMichael Kerrisk <mtk.manpages@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

978b4053

procfs: fix /proc/<pid>/stat stack pointer for kernel threads · 9ebd4eba

由 Stefani Seibold 提交于 11月 17, 2009

Fix a small issue for the stack pointer in /proc/<pid>/stat.  In case of a
kernel thread the value of the printed stack pointer should be 0.
Signed-off-by: NStefani Seibold <stefani@seibold.net>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9ebd4eba

xfs: copy li_lsn before dropping AIL lock · 6c06f072

由 Nathaniel W. Turner 提交于 11月 16, 2009

Access to log items on the AIL is generally protected by m_ail_lock;
this is particularly needed when we're getting or setting the 64-bit
li_lsn on a 32-bit platform.  This patch fixes a couple places where we
were accessing the log item after dropping the AIL lock on 32-bit
machines.

This can result in a partially-zeroed log->l_tail_lsn if
xfs_trans_ail_delete is racing with xfs_trans_ail_update, and in at
least some cases, this can leave the l_tail_lsn with a zero cycle
number, which means xlog_space_left will think the log is full (unless
CONFIG_XFS_DEBUG is set, in which case we'll trip an ASSERT), leading to
processes stuck forever in xlog_grant_log_space.

Thanks to Adrian VanderSpek for first spotting the race potential and to
Dave Chinner for debug assistance.
Signed-off-by: NNathaniel W. Turner <nate@houseofnate.net>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAlex Elder <aelder@sgi.com>

6c06f072

XFS bug in log recover with quota (bugzilla id 855) · 8ec6dba2

由 Jan Rekorajski 提交于 11月 16, 2009

Hi,
I was hit by a bug in linux 2.6.31 when XFS is not able to recover the
log after a crash if fs was mounted with quotas. Gory details in XFS
bugzilla: http://oss.sgi.com/bugzilla/show_bug.cgi?id=855.

It looks like wrong struct is used in buffer length check, and the following
patch should fix the problem.

xfs_dqblk_t has a size of 104+32 bytes, while xfs_disk_dquot_t is 104 bytes
long, and this is exactly what I see in system logs - "XFS: dquot too small
(104) in xlog_recover_do_dquot_trans."
Signed-off-by: NJan Rekorajski <baggins@sith.mimuw.edu.pl>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAlex Elder <aelder@sgi.com>

8ec6dba2

16 11月, 2009 1 次提交

cifs: clear server inode number flag while autodisabling · f534dc99

由 Suresh Jayaraman 提交于 11月 16, 2009

Fix the commit ec06aedd that intended to turn off querying for server inode
numbers when server doesn't consistently support inode numbers. Presumably
the commit didn't actually clear the CIFS_MOUNT_SERVER_INUM flag, perhaps a
typo.
Signed-off-by: NSuresh Jayaraman <sjayaraman@suse.de>
Acked-by: NJeff Layton <jlayton@redhat.com>
Cc: Stable <stable@kernel.org>
Signed-off-by: NSteve French <sfrench@us.ibm.com>

f534dc99

15 11月, 2009 2 次提交

nilfs2: deleted inconsistent comment in nilfs_load_inode_block() · 18dafac1

由 Jiro SEKIBA 提交于 11月 15, 2009

The comment says, "Caller of this function MUST lock s_inode_lock",
however just above the comment, it locks s_inode_lock in the function.
Signed-off-by: NJiro SEKIBA <jir@unicus.jp>
Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>

18dafac1

Fix memory corruption caused by nfsd readdir+ · 479c2553

由 Petr Vandrovec 提交于 11月 14, 2009

Commit 8177e6d6 ("nfsd: clean up
readdirplus encoding") introduced single character typo in nfs3 readdir+
implementation.  Unfortunately that typo has quite bad side effects:
random memory corruption, followed (on my box) with immediate
spontaneous box reboot.

Using 'p1' instead of 'p' fixes my Linux box rebooting whenever VMware
ESXi box tries to list contents of my home directory.
Signed-off-by: NPetr Vandrovec <petr@vandrovec.name>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Neil Brown <neilb@suse.de>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

479c2553

13 11月, 2009 1 次提交

nilfs2: fix lock order reversal in chcp operation · c1ea985c

由 Ryusuke Konishi 提交于 11月 12, 2009

Will fix the following lock order reversal lockdep detected:

=======================================================
[ INFO: possible circular locking dependency detected ]
2.6.32-rc6 #7
-------------------------------------------------------
chcp/30157 is trying to acquire lock:
 (&nilfs->ns_mount_mutex){+.+.+.}, at: [<fed7cfcc>] nilfs_cpfile_change_cpmode+0x46/0x752 [nilfs2]

but task is already holding lock:
 (&nilfs->ns_segctor_sem){++++.+}, at: [<fed7ca32>] nilfs_transaction_begin+0xba/0x110 [nilfs2]

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (&nilfs->ns_segctor_sem){++++.+}:
       [<c105799c>] __lock_acquire+0x109c/0x139d
       [<c1057d26>] lock_acquire+0x89/0xa0
       [<c14151e2>] down_read+0x31/0x45
       [<fed6d77b>] nilfs_attach_checkpoint+0x8f/0x16b [nilfs2]
       [<fed6e393>] nilfs_get_sb+0x3e7/0x653 [nilfs2]
       [<c10c0ccb>] vfs_kern_mount+0x8b/0x124
       [<c10c0db2>] do_kern_mount+0x37/0xc3
       [<c10d7517>] do_mount+0x64d/0x69d
       [<c10d75cd>] sys_mount+0x66/0x95
       [<c1002a14>] sysenter_do_call+0x12/0x32

-> #1 (&type->s_umount_key#31/1){+.+.+.}:
       [<c105799c>] __lock_acquire+0x109c/0x139d
       [<c1057d26>] lock_acquire+0x89/0xa0
       [<c104c0f3>] down_write_nested+0x34/0x52
       [<c10c08fe>] sget+0x22e/0x389
       [<fed6e133>] nilfs_get_sb+0x187/0x653 [nilfs2]
       [<c10c0ccb>] vfs_kern_mount+0x8b/0x124
       [<c10c0db2>] do_kern_mount+0x37/0xc3
       [<c10d7517>] do_mount+0x64d/0x69d
       [<c10d75cd>] sys_mount+0x66/0x95
       [<c1002a14>] sysenter_do_call+0x12/0x32

-> #0 (&nilfs->ns_mount_mutex){+.+.+.}:
       [<c1057727>] __lock_acquire+0xe27/0x139d
       [<c1057d26>] lock_acquire+0x89/0xa0
       [<c1414d63>] mutex_lock_nested+0x41/0x23e
       [<fed7cfcc>] nilfs_cpfile_change_cpmode+0x46/0x752 [nilfs2]
       [<fed801b2>] nilfs_ioctl+0x11a/0x7da [nilfs2]
       [<c10cca12>] vfs_ioctl+0x27/0x6e
       [<c10ccf93>] do_vfs_ioctl+0x491/0x4db
       [<c10cd022>] sys_ioctl+0x45/0x5f
       [<c1002a14>] sysenter_do_call+0x12/0x32
Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>

c1ea985c

12 11月, 2009 15 次提交

__generic_block_fiemap(): fix for files bigger than 4GB · e04b5ef8

由 Mike Hommey 提交于 11月 11, 2009

Because of an integer overflow on start_blk, various kind of wrong results
would be returned by the generic_block_fiemap() handler, such as no
extents when there is a 4GB+ hole at the beginning of the file, or wrong
fe_logical when an extent starts after the first 4GB.
Signed-off-by: NMike Hommey <mh@glandium.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Steven Whitehouse <swhiteho@redhat.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Eric Sandeen <sandeen@sgi.com>
Cc: Josef Bacik <jbacik@redhat.com>
Cc: Mark Fasheh <mfasheh@suse.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e04b5ef8

exec: setup_arg_pages() fails to return errors · fc63cf23

由 Anton Blanchard 提交于 11月 11, 2009

In setup_arg_pages we work hard to assign a value to ret, but on exit we
always return 0.

Also remove a now duplicated exit path and branch to out_unlock instead.
Signed-off-by: NAnton Blanchard <anton@samba.org>
Acked-by: NSerge Hallyn <serue@us.ibm.com>
Reviewed-by: NWANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

fc63cf23

fs: add missing compat_ptr handling for FS_IOC_RESVSP ioctl · 7779d7be

由 Heiko Carstens 提交于 11月 11, 2009

For FS_IOC_RESVSP and FS_IOC_RESVSP64 compat_sys_ioctl() uses its
arg argument as a pointer to userspace. However it is missing a
a call to compat_ptr() which will do a proper pointer conversion.

This was introduced with 3e63cbb1 "fs: Add new pre-allocation ioctls
to vfs for compatibility with legacy xfs ioctls".
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
Cc: Ankit Jain <me@ankitjain.org>
Acked-by: NChristoph Hellwig <hch@lst.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: NArnd Bergmann <arndbergmann@googlemail.com>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Cc: <stable@kernel.org>		[2.6.31.x]
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7779d7be

pidns: fix a leak in /proc dentries and inodes with pid namespaces. · 29f12ca3

由 Sukadev Bhattiprolu 提交于 11月 11, 2009

Daniel Lezcano reported a leak in 'struct pid' and 'struct pid_namespace'
that is discussed in:

	http://lkml.org/lkml/2009/10/2/159.

To summarize the thread, when container-init is terminated, it sets the
PF_EXITING flag, zaps other processes in the container and waits to reap
them.  As a part of reaping, the container-init should flush any /proc
dentries associated with the processes.  But because the container-init is
itself exiting and the following PF_EXITING check, the dentries are not
flushed, resulting in leak in /proc inodes and dentries.

This fix reverts the commit 7766755a ("Fix /proc dcache deadlock
in do_exit") which introduced the check for PF_EXITING.  At the time of
the commit, shrink_dcache_parent() flushed dentries from other filesystems
also and could have caused a deadlock which the commit fixed.  But as
pointed out by Eric Biederman, after commit 0feae5c4,
shrink_dcache_parent() no longer affects other filesystems.  So reverting
the commit is now safe.

As pointed out by Jan Kara, the leak is not as critical since the
unclaimed space will be reclaimed under memory pressure or by:

	echo 3 > /proc/sys/vm/drop_caches

But since this check is no longer required, its best to remove it.
Signed-off-by: NSukadev Bhattiprolu <sukadev@us.ibm.com>
Reported-by: NDaniel Lezcano <dlezcano@fr.ibm.com>
Acked-by: NEric W. Biederman <ebiederm@xmission.com>
Acked-by: NJan Kara <jack@ucw.cz>
Cc: Andrea Arcangeli <andrea@cpushare.com>
Cc: Serge Hallyn <serue@us.ibm.com>
Cc: <stable@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

29f12ca3

fs/jbd: Export log_start_commit to fix ext3 build. · ff5e4b51

由 Stefan Schmidt 提交于 11月 12, 2009

This fixes:
ERROR: "log_start_commit" [fs/ext3/ext3.ko] undefined!
Signed-off-by: NStefan Schmidt <stefan@datenfreihafen.org>

ff5e4b51

Btrfs: fix panic when trying to destroy a newly allocated · a6dbd429

由 Josef Bacik 提交于 11月 11, 2009

There is a problem where iget5_locked will look for an inode, not find it, and
then subsequently try to allocate it. Another CPU will have raced in and
allocated the inode instead, so when iget5_locked gets the inode spin lock again
and does a search, it finds the new inode. So it goes ahead and calls
destroy_inode on the inode it just allocated. The problem is we don't set
BTRFS_I(inode)->root until the new inode is completely initialized. This patch
makes us set root to NULL when alloc'ing a new inode, so when we get to
btrfs_destroy_inode and we see that root is NULL we can just free up the memory
and continue on. This fixes the panic

http://www.kerneloops.org/submitresult.php?number=812690

Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

a6dbd429

Btrfs: allow more metadata chunk preallocation · 33b25808

由 Chris Mason 提交于 11月 11, 2009

On an FS where all of the space has not been allocated into chunks yet,
the enospc can return enospc just because the existing metadata chunks
are full.

We get around this by allowing more metadata chunks to be allocated up
to a certain limit, and finding the right limit is a little fuzzy.  The
problem is the reservations for delalloc would preallocate way too much
of the FS as metadata.  We need to start saying no and just force some
IO to happen.

But we also need to let a reasonable amount of the FS become metadata.
This bumps the hard limit up, later releases will have a better system.
Signed-off-by: NChris Mason <chris.mason@oracle.com>

33b25808

Btrfs: fallback on uncompressed io if compressed io fails · f5a84ee3

由 Josef Bacik 提交于 11月 10, 2009

Currently compressed IO does not deal with not having its entire extent able to
be allocated. So if we have enough free space to allocate for the extent, but
its not contiguous, it will fail spectacularly. This patch fixes this by
falling back on uncompressed IO which lets us spread the delalloc extent across
multiple extents. I tested this by making us randomly think the reservation had
failed to make it fallback on the uncompressed io way and it seemed to work
fine. Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

f5a84ee3

Btrfs: find ideal block group for caching · ccf0e725

由 Josef Bacik 提交于 11月 10, 2009

This patch changes a few things. Hopefully the comments are helpfull, but
I'll try and be as verbose here.

Problem:

My fedora box was taking 1 minute and 21 seconds to boot with btrfs as root.
Part of this problem was we pick the first block group we can find and start
caching it, even if it may not have enough free space. The other problem is
we only search for cached block groups the first time around, which we won't
find any cached block groups because this is a newly mounted fs, so we end up
caching several block groups during bootup, which with alot of fragmentation
takes around 30-45 seconds to complete, which bogs down the system. So

Solution:

1) Don't cache block groups willy-nilly at first. Instead try and figure out
which block group has the most free, and therefore will take the least amount
of time to cache.

2) Don't be so picky about cached block groups. The other problem is once
we've filled up a cluster, if the block group isn't finished caching the next
time we try and do the allocation we'll completely ignore the cluster and
start searching from the beginning of the space, which makes us cache more
block groups, which slows us down even more. So instead of skipping block
groups that are not finished caching when we have a hint, only skip the block
group if it hasn't started caching yet.

There is one other tweak in here. Before if we allocated a chunk and still
couldn't find new space, we'd end up switching the space info to force another
chunk allocation. This could make us end up with way too many chunks, so keep
track of this particular case.

With this patch and my previous cluster fixes my fedora box now boots in 43
seconds, and according to the bootchart is not held up by our block group
caching at all.
Signed-off-by: NJosef Bacik <josef@redhat.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

ccf0e725

Btrfs: avoid null deref in unpin_extent_cache() · 4eb3991c

由 Dan Carpenter 提交于 11月 10, 2009

I re-orderred the checks to avoid dereferencing "em" if it was null.

Found by smatch static checker.
Signed-off-by: NDan Carpenter <error27@gmail.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

4eb3991c

Btrfs: skip btrfs_release_path in btrfs_update_root and btrfs_del_root · df66916e

由 Li Dongyang 提交于 11月 06, 2009

We don't need to call btrfs_release_path because btrfs_free_path will do
that for us.
Signed-off-by: NLi Dongyang <Jerry87905@gmail.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

df66916e

Btrfs: fix some metadata enospc issues · 5df6a9f6

由 Josef Bacik 提交于 11月 10, 2009

We weren't reserving metadata space for rename, rmdir and unlink, which could
cause problems.
Signed-off-by: NJosef Bacik <josef@redhat.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

5df6a9f6

Btrfs: fix how we set max_size for free space clusters · 01dea1ef

由 Josef Bacik 提交于 11月 10, 2009

This patch fixes a problem where max_size can be set to 0 even though we
filled the cluster properly. We set max_size to 0 if we restart the cluster
window, but if the new start entry is big enough to be our new cluster then we
could return with a max_size set to 0, which will mean the next time we try to
allocate from this cluster it will fail. So set max_extent to the entry's
size. Tested this on my box and now we actually allocate from the cluster
after we fill it. Thanks,
Signed-off-by: NJosef Bacik <josef@redhat.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

01dea1ef

Btrfs: cleanup transaction starting and fix journal_info usage · 249ac1e5

由 Josef Bacik 提交于 11月 10, 2009

We use journal_info to tell if we're in a nested transaction to make sure we
don't commit the transaction within a nested transaction. We use another
method to see if there are any outstanding ioctl trans handles, so if we're
starting one do not set current->journal_info, since it will screw with other
filesystems. This patch also cleans up the starting stuff so there aren't any
magic numbers.
Signed-off-by: NJosef Bacik <josef@redhat.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

249ac1e5

Btrfs: fix data allocation hint start · 6346c939

由 Josef Bacik 提交于 11月 10, 2009

Sometimes our start allocation hint when we cow a file can be either
EXTENT_HOLE or some other such place holder, which is not optimal. So if we
find that our em->block_start is one of these special values, check to see
where the first block of the inode is stored, and use that as a hint. If that
block is also a special value, just fallback on a hint of 0 and let the
allocator figure out a good place to put the data.
Signed-off-by: NJosef Bacik <josef@redhat.com>
Signed-off-by: NChris Mason <chris.mason@oracle.com>

6346c939

11 11月, 2009 3 次提交

JBD/JBD2: free j_wbuf if journal init fails. · 7b02bec0

由 Tao Ma 提交于 11月 10, 2009

If journal init fails, we need to free j_wbuf.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: NTao Ma <tao.ma@oracle.com>
Signed-off-by: NJan Kara <jack@suse.cz>

7b02bec0

ext3: Wait for proper transaction commit on fsync · fe8bc91c

由 Jan Kara 提交于 10月 16, 2009

We cannot rely on buffer dirty bits during fsync because pdflush can come
before fsync is called and clear dirty bits without forcing a transaction
commit. What we do is that we track which transaction has last changed
the inode and which transaction last changed allocation and force it to
disk on fsync.
Signed-off-by: NJan Kara <jack@suse.cz>
Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>

fe8bc91c

ext3: retry failed direct IO allocations · ea0174a7

由 Eric Sandeen 提交于 10月 12, 2009

On a 256M 4k block filesystem, doing this in a loop:

    dd if=/dev/zero of=test oflag=direct bs=1M count=64
    rm -f test

eventually leads to spurious ENOSPC:

    dd: writing `test': No space left on device

As with other block allocation callers, it looks like we need to
potentially retry the allocations on the initial ENOSPC.

A similar patch went into ext4 (commit
fbbf6945)
Signed-off-by: NEric Sandeen <sandeen@redhat.com>
Signed-off-by: NJan Kara <jack@suse.cz>

ea0174a7

09 11月, 2009 1 次提交

ext4: partial revert to fix double brelse WARNING() · 1e424a34

由 Theodore Ts'o 提交于 11月 08, 2009

This is a partial revert of commit 6487a9d3 (only the changes made to
fs/ext4/namei.c), since it is causing the following brelse()
double-free warning when running fsstress on a file system with 1k
blocksize and we run into a block allocation failure while converting
a single-block directory to a multi-block hash-tree indexed directory.

WARNING: at fs/buffer.c:1197 __brelse+0x2e/0x33()
Hardware name: 
VFS: brelse: Trying to free free buffer
Modules linked in:
Pid: 2226, comm: jbd2/sdd-8 Not tainted 2.6.32-rc6-00577-g0003f55 #101
Call Trace:
 [<c01587fb>] warn_slowpath_common+0x65/0x95
 [<c0158869>] warn_slowpath_fmt+0x29/0x2c
 [<c021168e>] __brelse+0x2e/0x33
 [<c0288a9f>] jbd2_journal_refile_buffer+0x67/0x6c
 [<c028a9ed>] jbd2_journal_commit_transaction+0x319/0x14d8
 [<c0164d73>] ? try_to_del_timer_sync+0x58/0x60
 [<c0175bcc>] ? sched_clock_cpu+0x12a/0x13e
 [<c017f6b4>] ? trace_hardirqs_off+0xb/0xd
 [<c0175c1f>] ? cpu_clock+0x3f/0x5b
 [<c017f6ec>] ? lock_release_holdtime+0x36/0x137
 [<c0664ad0>] ? _spin_unlock_irqrestore+0x44/0x51
 [<c0180af3>] ? trace_hardirqs_on_caller+0x103/0x124
 [<c0180b1f>] ? trace_hardirqs_on+0xb/0xd
 [<c0164d73>] ? try_to_del_timer_sync+0x58/0x60
 [<c0290d1c>] kjournald2+0x11a/0x310
 [<c017118e>] ? autoremove_wake_function+0x0/0x38
 [<c0290c02>] ? kjournald2+0x0/0x310
 [<c0170ee6>] kthread+0x66/0x6b
 [<c0170e80>] ? kthread+0x0/0x6b
 [<c01251b3>] kernel_thread_helper+0x7/0x10
---[ end trace 5579351b86af61e3 ]---

Commit 6487a9d3 was an attempt some buffer head leaks in an ENOSPC
error path, but in some cases it actually results in an excess ENOSPC,
as shown above.  Fixing this means cleaning up who is responsible for
releasing the buffer heads from the callee to the caller of
add_dirent_to_buf().

Since that's a relatively complex change, and we're late in the rcX
development cycle, I'm reverting this now, and holding back a more
complete fix until after 2.6.32 ships.  We've lived with this
buffer_head leak on ENOSPC in ext3 and ext4 for a very long time; a
few more months won't kill us.
Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
Cc: Curt Wohlgemuth <curtw@google.com>

1e424a34

08 11月, 2009 2 次提交

nilfs2: fix missing cleanup of gc cache on error cases · c083234f

由 Ryusuke Konishi 提交于 11月 08, 2009

This fixes an -rc1 regression brought by the commit:
1cf58fa8 ("nilfs2: shorten freeze
period due to GC in write operation v3").

Although the patch moved out a function call of
nilfs_ioctl_move_blocks() to nilfs_ioctl_clean_segments() from
nilfs_ioctl_prepare_clean_segments(), it didn't move corresponding
cleanup job needed for the error case.

This will move the missing cleanup job to the destination function.
Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Acked-by: NJiro SEKIBA <jir@unicus.jp>

c083234f

nilfs2: fix kernel oops in error case of nilfs_ioctl_move_blocks · 5399dd1f

由 Ryusuke Konishi 提交于 11月 07, 2009

This fixes a kernel oops reported by Markus Trippelsdorf in the email
titled "[NILFS users] kernel Oops while running nilfs_cleanerd".

The oops was caused by a bug of error path in
nilfs_ioctl_move_blocks() function, which was inlined in
nilfs_ioctl_clean_segments().

nilfs_ioctl_move_blocks checks duplication of blocks which will be
moved in garbage collection.  But, the check should have be done
within nilfs_ioctl_move_inode_block() to prevent list corruption among
buffers storing the target blocks.

To fix the kernel oops, this moves forward the duplication check
before the list insertion.

I also tested this for stable trees [2.6.30, 2.6.31].
Reported-by: NMarkus Trippelsdorf <markus@trippelsdorf.de>
Signed-off-by: NRyusuke Konishi <konishi.ryusuke@lab.ntt.co.jp>
Cc: stable <stable@kernel.org>

5399dd1f

07 11月, 2009 2 次提交

cifs: don't use CIFSGetSrvInodeNumber in is_path_accessible · f475f677

由 Jeff Layton 提交于 11月 06, 2009

Because it's lighter weight, CIFS tries to use CIFSGetSrvInodeNumber to
verify the accessibility of the root inode and then falls back to doing a
full QPathInfo if that fails with -EOPNOTSUPP. I have at least a report
of a server that returns NT_STATUS_INTERNAL_ERROR rather than something
that translates to EOPNOTSUPP.

Rather than trying to be clever with that call, just have
is_path_accessible do a normal QPathInfo. That call is widely
supported and it shouldn't increase the overhead significantly.

Cc: Stable <stable@kernel.org>
Signed-off-by: NJeff Layton <jlayton@redhat.com>
Signed-off-by: NSteve French <sfrench@us.ibm.com>

f475f677

cifs: clean up handling when server doesn't consistently support inode numbers · ec06aedd

由 Jeff Layton 提交于 11月 06, 2009

It's possible that a server will return a valid FileID when we query the
FILE_INTERNAL_INFO for the root inode, but then zeroed out inode numbers
when we do a FindFile with an infolevel of
SMB_FIND_FILE_ID_FULL_DIR_INFO.

In this situation turn off querying for server inode numbers, generate a
warning for the user and just generate an inode number using iunique.
Once we generate any inode number with iunique we can no longer use any
server inode numbers or we risk collisions, so ensure that we don't do
that in cifs_get_inode_info either.

Cc: Stable <stable@kernel.org>
Reported-by: NTimothy Normand Miller <theosib@gmail.com>
Signed-off-by: NJeff Layton <jlayton@redhat.com>
Signed-off-by: NSteve French <sfrench@us.ibm.com>

ec06aedd