提交 · 49317a7fdaa462b09b9bb4942b64c3a3316bd564 · openeuler / Kernel

04 8月, 2014 10 次提交

NFS: nfs4_lookup_revalidate: only evaluate parent if it will be used. · 49317a7f

由 NeilBrown 提交于 7月 14, 2014

nfs4_lookup_revalidate only uses 'parent' to get 'dir', and only
uses 'dir' if 'inode == NULL'.

So we don't need to find out what 'parent' or 'dir' is until we
know that 'inode' is NULL.

By moving 'dget_parent' inside the 'if', we can reduce the number of
call sites for 'dput(parent)'.
Signed-off-by: NNeilBrown <neilb@suse.de>
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

49317a7f

NFS: add checks for returned value of try_module_get() · 1f70ef96

由 Alexey Khoroshilov 提交于 7月 18, 2014

There is a couple of places in client code where returned value
of try_module_get() is ignored. As a result there is a small chance
to premature unload module because of unbalanced refcounting.

The patch adds error handling in that places.

Found by Linux Driver Verification project (linuxtesting.org).
Signed-off-by: NAlexey Khoroshilov <khoroshilov@ispras.ru>
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

1f70ef96

nfs: clear_request_commit while holding i_lock · 411a99ad

由 Weston Andros Adamson 提交于 7月 17, 2014

Signed-off-by: NWeston Andros Adamson <dros@primarydata.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

411a99ad

pnfs: add pnfs_put_lseg_async · e6cf82d1

由 Weston Andros Adamson 提交于 7月 17, 2014

This is useful when lsegs need to be released while holding locks.
Signed-off-by: NWeston Andros Adamson <dros@primarydata.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

e6cf82d1

pnfs: find swapped pages on pnfs commit lists too · 02d1426c

由 Weston Andros Adamson 提交于 7月 17, 2014

nfs_page_find_head_request_locked looks through the regular nfs commit lists
when the page is swapped out, but doesn't look through the pnfs commit lists.

I'm not sure if anyone has hit any issues caused by this.
Suggested-by: NPeng Tao <tao.peng@primarydata.com>
Signed-off-by: NWeston Andros Adamson <dros@primarydata.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

02d1426c

nfs: fix comment and add warn_on for PG_INODE_REF · b412ddf0

由 Weston Andros Adamson 提交于 7月 17, 2014

Fix the comment in nfs_page.h for PG_INODE_REF to reflect that it's no longer
set only on head requests. Also add a WARN_ON_ONCE in nfs_inode_remove_request
as PG_INODE_REF should always be set.
Suggested-by: NPeng Tao <tao.peng@primarydata.com>
Signed-off-by: NWeston Andros Adamson <dros@primarydata.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

b412ddf0

nfs: check wait_on_bit_lock err in page_group_lock · e7029206

由 Weston Andros Adamson 提交于 7月 17, 2014

Return errors from wait_on_bit_lock from nfs_page_group_lock.

Add a bool argument @wait to nfs_page_group_lock. If true, loop over
wait_on_bit_lock until it returns cleanly. If false, return the error
from wait_on_bit_lock.
Signed-off-by: NWeston Andros Adamson <dros@primarydata.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

e7029206

NFS: nfs4_do_open should add negative results to the dcache. · 4fa2c54b

由 NeilBrown 提交于 7月 21, 2014

If you have an NFSv4 mounted directory which does not container 'foo'
and:

  ls -l foo
  ssh $server touch foo
  cat foo

then the 'cat' will fail (usually, depending a bit on the various
cache ages).  This is correct as negative looks are cached by default.
However with the same initial conditions:

  cat foo
  ssh $server touch foo
  cat foo

will usually succeed.  This is because an "open" does not add a
negative dentry to the dcache, while a "lookup" does.

This can have negative performance effects.  When "gcc" searches for
an include file, it will try to "open" the file in every director in
the search path.  Without caching of negative "open" results, this
generates much more traffic to the server than it should (or than
NFSv3 does).

The root of the problem is that _nfs4_open_and_get_state() will call
d_add_unique() on a positive result, but not on a negative result.
Compare with nfs_lookup() which calls d_materialise_unique on both
a positive result and on ENOENT.

This patch adds a call d_add() in the ENOENT case for
_nfs4_open_and_get_state() and also calls nfs_set_verifier().

With it, many fewer "open" requests for known-non-existent files are
sent to the server.
Signed-off-by: NNeilBrown <neilb@suse.de>
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

4fa2c54b

nfs3_list_one_acl(): check get_acl() result with IS_ERR_OR_NULL · 7a9e75a1

由 Andrey Utkin 提交于 7月 26, 2014

There was a check for result being not NULL. But get_acl() may return
NULL, or ERR_PTR, or actual pointer.
The purpose of the function where current change is done is to "list
ACLs only when they are available", so any error condition of get_acl()
mustn't be elevated, and returning 0 there is still valid.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=81111Signed-off-by: NAndrey Utkin <andrey.krieger.utkin@gmail.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Fixes: 74adf83f (nfs: only show Posix ACLs in listxattr if actually...)
Cc: stable@vger.kernel.org # 3.14+
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

7a9e75a1

NFS: Enforce an upper limit on the number of cached access call · 3a505845

由 Trond Myklebust 提交于 7月 21, 2014

This may be used to limit the number of cached credentials building up
inside the access cache.
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

3a505845

20 7月, 2014 2 次提交

btrfs: test for valid bdev before kobj removal in btrfs_rm_device · 0bfaa9c5

由 Eric Sandeen 提交于 7月 07, 2014

commit 99994cde btrfs: dev delete should remove sysfs entry
added a btrfs_kobj_rm_device, which dereferences device->bdev...
right after we check whether device->bdev might be NULL.

I don't honestly know if it's possible to have a NULL device->bdev
here, but assuming that it is (given the test), we need to move
the kobject removal to be under that test.

(Coverity spotted this)
Signed-off-by: NEric Sandeen <sandeen@redhat.com>
Signed-off-by: NChris Mason <clm@fb.com>

0bfaa9c5

Btrfs: fix abnormal long waiting in fsync · 98ce2ded

由 Liu Bo 提交于 7月 17, 2014

xfstests generic/127 detected this problem.

With commit 7fc34a62, now fsync will only flush
data within the passed range.  This is the cause of the above problem,
-- btrfs's fsync has a stage called 'sync log' which will wait for all the
ordered extents it've recorded to finish.

In xfstests/generic/127, with mixed operations such as truncate, fallocate,
punch hole, and mapwrite, we get some pre-allocated extents, and mapwrite will
mmap, and then msync.  And I find that msync will wait for quite a long time
(about 20s in my case), thanks to ftrace, it turns out that the previous
fallocate calls 'btrfs_wait_ordered_range()' to flush dirty pages, but as the
range of dirty pages may be larger than 'btrfs_wait_ordered_range()' wants,
there can be some ordered extents created but not getting corresponding pages
flushed, then they're left in memory until we fsync which runs into the
stage 'sync log', and fsync will just wait for the system writeback thread
to flush those pages and get ordered extents finished, so the latency is
inevitable.

This adds a flush similar to btrfs_start_ordered_extent() in
btrfs_wait_logged_extents() to fix that.
Reviewed-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
Signed-off-by: NChris Mason <clm@fb.com>

98ce2ded

18 7月, 2014 8 次提交

GFS2: fs/gfs2/rgrp.c: kernel-doc warning fixes · 27ff6a0f

由 Fabian Frederick 提交于 7月 02, 2014

Cc: cluster-devel@redhat.com
Signed-off-by: NFabian Frederick <fabf@skynet.be>
Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>

27ff6a0f

GFS2: memcontrol: Spelling s/invlidate/invalidate/ · 6b49d1d9

由 Geert Uytterhoeven 提交于 6月 29, 2014

Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
Cc: cluster-devel@redhat.com
Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>

6b49d1d9

GFS2: Allow caching of glocks for flock · 97a4f1d7

由 Bob Peterson 提交于 6月 26, 2014

This patch removes the GLF_NOCACHE flag from the glocks associated with
flocks. There should be no good reason not to cache glocks for flocks:
they only force the glock to be demoted before they can be reacquired,
which can slow down performance and even cause glock hangs, especially
in cases where the flocks are held in Shared (SH) mode.
Signed-off-by: NBob Peterson <rpeterso@redhat.com>
Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>

97a4f1d7

GFS2: Allow flocks to use normal glock dq rather than dq_wait · 5bef3e7c

由 Bob Peterson 提交于 6月 26, 2014

This patch allows flock glocks to use a non-blocking dequeue rather
than dq_wait. It also reverts the previous patch I had posted regarding
dq_wait. The reverted patch isn't necessarily a bad idea, but I decided
this might avoid unforeseen side effects, and was therefore safer.
Signed-off-by: NBob Peterson <rpeterso@redhat.com>
Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>

5bef3e7c

GFS2: replace count*size kzalloc by kcalloc · 6ec43b18

由 Fabian Frederick 提交于 6月 25, 2014

kcalloc manages count*sizeof overflow.

Cc: cluster-devel@redhat.com
Signed-off-by: NFabian Frederick <fabf@skynet.be>
Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>

6ec43b18

GFS2: Use GFP_NOFS when allocating glocks · fe0bbd29

由 Steven Whitehouse 提交于 6月 23, 2014

Normally GFP_KERNEL is ok here, but there is now a rarely used code path
relating to deallocation of unlinked inodes (in certain corner cases)
which if hit at times of memory shortage can cause recursion while
trying to free memory.

One solution would be to try and move the gfs2_glock_get() call so
that it is no longer called while another glock is held, but that
doesn't look at all easy, so GFP_NOFS is the best solution for the
time being.
Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>

fe0bbd29

GFS2: Fix race in glock lru glock disposal · 94a09a39

由 Steven Whitehouse 提交于 6月 23, 2014

We must not leave items on the LRU list with GLF_LOCK set, since
they can be removed if the glock is brought back into use, which
may then potentially result in a hang, waiting for GLF_LOCK to
clear.

It doesn't happen very often, since it requires a glock that has
not been used for a long time to be brought back into use at the
same moment that the shrinker is part way through disposing of
glocks.

The fix is to set GLF_LOCK at a later time, when we already know
that the other locks can be obtained. Also, we now only release
the lru_lock in case a resched is needed, rather than on every
iteration.
Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>

94a09a39

GFS2: Only wait for demote when last holder is dequeued · 79272b35

由 Bob Peterson 提交于 6月 20, 2014

Function gfs2_glock_dq_wait is supposed to dequeue a glock and then
wait for the lock to be demoted. The problem is, if this is a shared
lock, its demote will depend on the other holders, which means you
might end up waiting forever because the other process is blocked.
This problem is especially apparent when dealing with nested flocks.
Signed-off-by: NBob Peterson <rpeterso@redhat.com>
Signed-off-by: NSteven Whitehouse <swhiteho@redhat.com>

79272b35

16 7月, 2014 1 次提交

quota: missing lock in dqcache_shrink_scan() · d68aab6b

由 Niu Yawei 提交于 6月 04, 2014

Commit 1ab6c499 (fs: convert fs shrinkers to new scan/count API)
accidentally removed locking from quota shrinker. Fix it -
dqcache_shrink_scan() should use dq_list_lock to protect the
scan on free_dquots list.

CC: stable@vger.kernel.org
Fixes: 1ab6c499Signed-off-by: NNiu Yawei <yawei.niu@intel.com>
Signed-off-by: NJan Kara <jack@suse.cz>

d68aab6b

15 7月, 2014 4 次提交

xfs: null unused quota inodes when quota is on · 03e01349

由 Dave Chinner 提交于 7月 15, 2014

When quota is on, it is expected that unused quota inodes have a
value of NULLFSINO. The changes to support a separate project quota
in 3.12 broken this rule for non-project quota inode enabled
filesystem, as the code now refuses to write the group quota inode
if neither group or project quotas are enabled. This regression was
introduced by commit d892d586 ("xfs: Start using pquotaino from the
superblock").

In this case, we should be writing NULLFSINO rather than nothing to
ensure that we leave the group quota inode in a valid state while
quotas are enabled.

Failure to do so doesn't cause a current kernel to break - the
separate project quota inodes introduced translation code to always
treat a zero inode as NULLFSINO. This was introduced by commit
01026297 ("xfs: Initialize all quota inodes to be NULLFSINO") with is
also in 3.12 but older kernels do not do this and hence taking a
filesystem back to an older kernel can result in quotas failing
initialisation at mount time. When that happens, we see this in
dmesg:

[ 1649.215390] XFS (sdb): Mounting Filesystem
[ 1649.316894] XFS (sdb): Failed to initialize disk quotas.
[ 1649.316902] XFS (sdb): Ending clean mount

By ensuring that we write NULLFSINO to quota inodes that aren't
active, we avoid this problem. We have to be really careful when
determining if the quota inodes are active or not, because we don't
want to write a NULLFSINO if the quota inodes are active and we
simply aren't updating them.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

03e01349

xfs: refine the allocation stack switch · cf11da9c

由 Dave Chinner 提交于 7月 15, 2014

The allocation stack switch at xfs_bmapi_allocate() has served it's
purpose, but is no longer a sufficient solution to the stack usage
problem we have in the XFS allocation path.

Whilst the kernel stack size is now 16k, that is not a valid reason
for undoing all our "keep stack usage down" modifications. What it
does allow us to do is have the freedom to refine and perfect the
modifications knowing that if we get it wrong it won't blow up in
our faces - we have a safety net now.

This is important because we still have the issue of older kernels
having smaller stacks and that they are still supported and are
demonstrating a wide range of different stack overflows.  Red Hat
has several open bugs for allocation based stack overflows from
directory modifications and direct IO block allocation and these
problems still need to be solved. If we can solve them upstream,
then distro's won't need to bake their own unique solutions.

To that end, I've observed that every allocation based stack
overflow report has had a specific characteristic - it has happened
during or directly after a bmap btree block split. That event
requires a new block to be allocated to the tree, and so we
effectively stack one allocation stack on top of another, and that's
when we get into trouble.

A further observation is that bmap btree block splits are much rarer
than writeback allocation - over a range of different workloads I've
observed the ratio of bmap btree inserts to splits ranges from 100:1
(xfstests run) to 10000:1 (local VM image server with sparse files
that range in the hundreds of thousands to millions of extents).
Either way, bmap btree split events are much, much rarer than
allocation events.

Finally, we have to move the kswapd state to the allocation workqueue
work when allocation is done on behalf of kswapd. This is proving to
cause significant perturbation in performance under memory pressure
and appears to be generating allocation deadlock warnings under some
workloads, so avoiding the use of a workqueue for the majority of
kswapd writeback allocation will minimise the impact of such
behaviour.

Hence it makes sense to move the stack switch to xfs_btree_split()
and only do it for bmap btree splits. Stack switches during
allocation will be much rarer, so there won't be significant
performacne overhead caused by switching stacks. The worse case
stack from all allocation paths will be split, not just writeback.
And the majority of memory allocations will be done in the correct
context (e.g. kswapd) without causing additional latency, and so we
simplify the memory reclaim interactions between processes,
workqueues and kswapd.

The worst stack I've been able to generate with this patch in place
is 5600 bytes deep. It's very revealing because we exit XFS at:

37)     1768      64   kmem_cache_alloc+0x13b/0x170

about 1800 bytes of stack consumed, and the remaining 3800 bytes
(and 36 functions) is memory reclaim, swap and the IO stack. And
this occurs in the inode allocation from an open(O_CREAT) syscall,
not writeback.

The amount of stack being used is much less than I've previously be
able to generate - fs_mark testing has been able to generate stack
usage of around 7k without too much trouble; with this patch it's
only just getting to 5.5k. This is primarily because the metadata
allocation paths (e.g. directory blocks) are no longer causing
double splits on the same stack, and hence now stack tracing is
showing swapping being the worst stack consumer rather than XFS.

Performance of fs_mark inode create workloads is unchanged.
Performance of fs_mark async fsync workloads is consistently good
with context switches reduced by around 150,000/s (30%).
Performance of dbench, streaming IO and postmark is unchanged.
Allocation deadlock warnings have not been seen on the workloads
that generated them since adding this patch.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

cf11da9c

Revert "xfs: block allocation work needs to be kswapd aware" · aa182e64

由 Dave Chinner 提交于 7月 15, 2014

This reverts commit 1f6d6482.

This commit resulted in regressions in performance in low
memory situations where kswapd was doing writeback of delayed
allocation blocks. It resulted in significant parallelism of the
kswapd work and with the special kswapd flags meant that hundreds of
active allocation could dip into kswapd specific memory reserves and
avoid being throttled. This cause a large amount of performance
variation, as well as random OOM-killer invocations that didn't
previously exist.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Signed-off-by: NDave Chinner <david@fromorbit.com>

aa182e64

aio: protect reqs_available updates from changes in interrupt handlers · 263782c1

由 Benjamin LaHaise 提交于 7月 14, 2014

As of commit f8567a38 it is now possible to
have put_reqs_available() called from irq context.  While put_reqs_available()
is per cpu, it did not protect itself from interrupts on the same CPU.  This
lead to aio_complete() corrupting the available io requests count when run
under a heavy O_DIRECT workloads as reported by Robert Elliott.  Fix this by
disabling irq updates around the per cpu batch updates of reqs_available.

Many thanks to Robert and folks for testing and tracking this down.
Reported-by: NRobert Elliot <Elliott@hp.com>
Tested-by: NRobert Elliot <Elliott@hp.com>
Signed-off-by: NBenjamin LaHaise <bcrl@kvack.org>
Cc: Jens Axboe <axboe@kernel.dk>, Christoph Hellwig <hch@infradead.org>
Cc: stable@vger.kenel.org

263782c1

14 7月, 2014 3 次提交

fuse: replace count*size kzalloc by kcalloc · f2b3455e

由 Fabian Frederick 提交于 6月 23, 2014

kcalloc manages count*sizeof overflow.
Signed-off-by: NFabian Frederick <fabf@skynet.be>
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>

f2b3455e

fuse: release temporary page if fuse_writepage_locked() failed · 27f1b363

由 Maxim Patlasov 提交于 7月 10, 2014

tmp_page to be freed if fuse_write_file_get() returns NULL.
Signed-off-by: NMaxim Patlasov <mpatlasov@parallels.com>
Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>

27f1b363

NFS: Don't reset pg_moreio in __nfs_pageio_add_request · f563b89b

由 Trond Myklebust 提交于 7月 13, 2014

Once we've started sending unstable NFS writes, we do not want to
clear pg_moreio, or we may end up sending the very last request as
a stable write if the commit lists are still empty.

Do, however, reset pg_moreio in the case where we end up having to
recoalesce the write if an attempt to use pNFS failed.
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

f563b89b

13 7月, 2014 12 次提交

NFS: use ARRAY_SIZE instead of sizeof/sizeof[0] · 00216026

由 Fabian Frederick 提交于 6月 30, 2014

Use macro definition

Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: linux-nfs@vger.kernel.org
Signed-off-by: NFabian Frederick <fabf@skynet.be>
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

00216026

NFSv4: Drop cast · 8ee2b78a

由 Himangi Saraogi 提交于 6月 27, 2014

This patch does away with the cast on void * as it is unnecessary.

The following Coccinelle semantic patch was used for making the change:

@r@
expression x;
void* e;
type T;
identifier f;
@@

(
  *((T *)e)
|
  ((T *)x)[...]
|
  ((T *)x)->f
|
- (T *)
  e
)
Signed-off-by: NHimangi Saraogi <himangi774@gmail.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

8ee2b78a

fs/nfs_common/nfsacl.c: move EXPORT symbol after functions · 57b696fb

由 Fabian Frederick 提交于 5月 28, 2014

Fix checkpatch warnings:

"WARNING: EXPORT_SYMBOL(foo); should immediately follow its function/variable"

Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: NFabian Frederick <fabf@skynet.be>
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

57b696fb

nfs4: copy acceptor name from context to nfs_client · f11b2a1c

由 Jeff Layton 提交于 6月 21, 2014

The current CB_COMPOUND handling code tries to compare the principal
name of the request with the cl_hostname in the client. This is not
guaranteed to ever work, particularly if the client happened to mount
a CNAME of the server or a non-fqdn.

Fix this by instead comparing the cr_principal string with the acceptor
name that we get from gssd. In the event that gssd didn't send one
down (i.e. it was too old), then we fall back to trying to use the
cl_hostname as we do today.
Signed-off-by: NJeff Layton <jlayton@poochiereds.net>
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

f11b2a1c

nfs4: turn free_lock_state into a void return operation · f1cdae87

由 Jeff Layton 提交于 5月 01, 2014

Nothing checks its return value.
Signed-off-by: NJeff Layton <jlayton@poochiereds.net>
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

f1cdae87

nfs4: queue free_lock_state job submission to nfsiod · 49a4bda2

由 Jeff Layton 提交于 5月 01, 2014

We got a report of the following warning in Fedora:

BUG: sleeping function called from invalid context at mm/slub.c:969
in_atomic(): 1, irqs_disabled(): 0, pid: 533, name: bash
3 locks held by bash/533:
 #0:  (&sp->so_delegreturn_mutex){+.+...}, at: [<ffffffffa033da62>] nfs4_proc_lock+0x262/0x910 [nfsv4]
 #1:  (&nfsi->rwsem){.+.+.+}, at: [<ffffffffa033da6a>] nfs4_proc_lock+0x26a/0x910 [nfsv4]
 #2:  (&sb->s_type->i_lock_key#23){+.+...}, at: [<ffffffff812998dc>] flock_lock_file_wait+0x8c/0x3a0
CPU: 0 PID: 533 Comm: bash Not tainted 3.15.0-0.rc1.git1.1.fc21.x86_64 #1
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
 0000000000000000 00000000d664ff3c ffff880078b69a70 ffffffff817e82e0
 0000000000000000 ffff880078b69a98 ffffffff810cf1a4 0000000000000050
 0000000000000050 ffff88007cc01a00 ffff880078b69ad8 ffffffff8121449e
Call Trace:
 [<ffffffff817e82e0>] dump_stack+0x4d/0x66
 [<ffffffff810cf1a4>] __might_sleep+0x184/0x240
 [<ffffffff8121449e>] kmem_cache_alloc_trace+0x4e/0x330
 [<ffffffffa0331124>] ? nfs4_release_lockowner+0x74/0x110 [nfsv4]
 [<ffffffffa0331124>] nfs4_release_lockowner+0x74/0x110 [nfsv4]
 [<ffffffffa0352340>] nfs4_put_lock_state+0x90/0xb0 [nfsv4]
 [<ffffffffa0352375>] nfs4_fl_release_lock+0x15/0x20 [nfsv4]
 [<ffffffff81297515>] locks_free_lock+0x45/0x90
 [<ffffffff8129996c>] flock_lock_file_wait+0x11c/0x3a0
 [<ffffffffa033da6a>] ? nfs4_proc_lock+0x26a/0x910 [nfsv4]
 [<ffffffffa033301e>] do_vfs_lock+0x1e/0x30 [nfsv4]
 [<ffffffffa033da79>] nfs4_proc_lock+0x279/0x910 [nfsv4]
 [<ffffffff810dbb26>] ? local_clock+0x16/0x30
 [<ffffffff810f5a3f>] ? lock_release_holdtime.part.28+0xf/0x200
 [<ffffffffa02f820c>] do_unlk+0x8c/0xc0 [nfs]
 [<ffffffffa02f85c5>] nfs_flock+0xa5/0xf0 [nfs]
 [<ffffffff8129a6f6>] locks_remove_file+0xb6/0x1e0
 [<ffffffff812159d8>] ? kfree+0xd8/0x2d0
 [<ffffffff8123bc63>] __fput+0xd3/0x210
 [<ffffffff8123bdee>] ____fput+0xe/0x10
 [<ffffffff810bfb6d>] task_work_run+0xcd/0xf0
 [<ffffffff81019cd1>] do_notify_resume+0x61/0x90
 [<ffffffff817fbea2>] int_signal+0x12/0x17

The problem is that NFSv4 is trying to do an allocation from
fl_release_private (in order to send a RELEASE_LOCKOWNER call). That
function can be called while holding the inode->i_lock, and it's
currently set up to do __GFP_WAIT allocations. v4.1 code has a
similar problem.

This patch adds a work_struct to the nfs4_lock_state and has the code
queue the free_lock_state operation to nfsiod.
Reported-by: NJosh Stone <jistone@redhat.com>
Signed-off-by: NJeff Layton <jlayton@poochiereds.net>
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

49a4bda2

nfs4: treat lock owners as opaque values · 8003d3c4

由 Jeff Layton 提交于 5月 01, 2014

Do the following set of ops with a file on a NFSv4 mount:

    exec 3>>/file/on/nfsv4
    flock -x 3
    exec 3>&-

You'll see the LOCK request go across the wire, but no LOCKU when the
file is closed.

What happens is that the fd is passed across a fork, and the final close
is done in a different process than the opener. That makes
__nfs4_find_lock_state miss finding the correct lock state because it
uses the fl_pid as a search key. A new one is created, and the locking
code treats it as a delegation stateid (because NFS_LOCK_INITIALIZED
isn't set).

The root cause of this breakage seems to be commit 77041ed9
(NFSv4: Ensure the lockowners are labelled using the fl_owner and/or
fl_pid).

That changed it so that flock lockowners are allocated based on the
fl_pid. I think this is incorrect. flock locks should be "owned" by the
struct file, and that is already accounted for in the fl_owner field of
the lock request when it comes through nfs_flock.

This patch basically reverts the above commit and with it, a LOCKU is
sent in the above reproducer.
Signed-off-by: NJeff Layton <jlayton@poochiereds.net>
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

8003d3c4

nfs41: layout return on close in delegation return · 039b756a

由 Peng Tao 提交于 7月 03, 2014

If file is not opened by anyone, we do layout return on close
in delegation return.
Signed-off-by: NPeng Tao <tao.peng@primarydata.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

039b756a

nfs41: return layout on last close · fe08c546

由 Peng Tao 提交于 7月 03, 2014

If client has valid delegation, do not return layout on close at all.
Signed-off-by: NPeng Tao <tao.peng@primarydata.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

fe08c546

nfs4: add nfs4_check_delegation · 15bb3afe

由 Peng Tao 提交于 7月 03, 2014

Signed-off-by: NPeng Tao <tao.peng@primarydata.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

15bb3afe

pnfs/filelayout: retry ds commit if nfs_commitdata_alloc fails · 0b0bc6ea

由 Peng Tao 提交于 7月 03, 2014

Signed-off-by: NPeng Tao <tao.peng@primarydata.com>
Signed-off-by: NTom Haynes <Thomas.Haynes@primarydata.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

0b0bc6ea

pnfs/filelayout: fix race between mark_request_commit and scan_commit_lists · c8a3292d

由 Peng Tao 提交于 7月 03, 2014

We need to hold cinfo lock while setting bucket->wlseg and adding req to nwritten
list at the same time. Otherwise there might be a window where nwritten list
is empty yet we set bucket->wlseg, in which case ff_layout_scan_ds_commit_list()
may end up clearing bucket->wlseg incorrectly, casuing client to oops later on.

This was found when testing flexfile layout but filelayout has the same problem.
Signed-off-by: NPeng Tao <tao.peng@primarydata.com>
Signed-off-by: NTom Haynes <Thomas.Haynes@primarydata.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

c8a3292d

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功