提交 · 3c87b9b7c05d7775a3d942de588296025023c6d2 · openeuler / raspberrypi-kernel

09 7月, 2014 19 次提交

nfsd: lock owners are not per open stateid · 3c87b9b7

由 Trond Myklebust 提交于 6月 30, 2014

In the NFSv4 spec, lock stateids are per-file objects. Lockowners are not.
This patch replaces the current list of lock owners in the open stateids
with a list of lock stateids.
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

3c87b9b7

nfsd: clean up nfsd4_close_open_stateid · acf9295b

由 Trond Myklebust 提交于 6月 30, 2014

Minor cleanup that should introduce no behavioral changes.

Currently this function just unhashes the stateid and leaves the caller
to do the work of the CLOSE processing.

Change nfsd4_close_open_stateid so that it handles doing all of the work
of closing a stateid. Move the handling of the unhashed stateid into it
instead of doing that work in nfsd4_close. This will help isolate some
coming changes to stateid handling from nfsd4_close.
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

acf9295b

nfsd: declare v4.1+ openowners confirmed on creation · db24b3b4

由 Jeff Layton 提交于 6月 30, 2014

There's no need to confirm an openowner in v4.1 and above, so we can
go ahead and set NFS4_OO_CONFIRMED when we create openowners in
those versions. This will also be necessary when we remove the
client_mutex, as it'll be possible for two concurrent opens to race
in versions >4.0.
Signed-off-by: NJeff Layton <jlayton@primarydata.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

db24b3b4

nfsd: Cleanup nfs4svc_encode_compoundres · b607664e

由 Trond Myklebust 提交于 6月 30, 2014

Move the slot return, put session etc into a helper in fs/nfsd/nfs4state.c
instead of open coding in nfs4svc_encode_compoundres.
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

b607664e

nfsd: nfs4_preprocess_seqid_op should only set *stpp on success · e17f99b7

由 Trond Myklebust 提交于 6月 30, 2014

Not technically a bugfix, since nothing tries to use the return pointer
if this function doesn't return success, but it could be a problem
with some coming changes.
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: NJeff Layton <jlayton@primarydata.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

e17f99b7

nfsd: add a new /proc/fs/nfsd/max_connections file · 5b8db00b

由 Jeff Layton 提交于 7月 02, 2014

Currently, the maximum number of connections that nfsd will allow
is based on the number of threads spawned. While this is fine for a
default, there really isn't a clear relationship between the two.

The number of threads corresponds to the number of concurrent requests
that we want to allow the server to process at any given time. The
connection limit corresponds to the maximum number of clients that we
want to allow the server to handle. These are two entirely different
quantities.

Break the dependency on increasing threads in order to allow for more
connections, by adding a new per-net parameter that can be set to a
non-zero value. The default is still to base it on the number of threads,
so there should be no behavior change for anyone who doesn't use it.

Cc: Trond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: NJeff Layton <jlayton@primarydata.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

5b8db00b

nfsd: Ensure that nfsd_create_setattr commits files to stable storage · 0f3a24b4

由 Trond Myklebust 提交于 7月 01, 2014

Since nfsd_create_setattr strips the mode from the struct iattr, it
is quite possible that it will optimise away the call to nfsd_setattr
altogether.
If this is the case, then we never call commit_metadata() on the
newly created file.

Also ensure that both nfsd_setattr() and nfsd_create_setattr() fail
when the call to commit_metadata fails.
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

0f3a24b4

NFSD: Remove iattr parameter from nfsd_symlink() · 1e444f5b

由 Kinglong Mee 提交于 7月 01, 2014

Commit db2e747b (vfs: remove mode parameter from vfs_symlink())
have remove mode parameter from vfs_symlink.
So that, iattr isn't needed by nfsd_symlink now, just remove it.
Signed-off-by: NKinglong Mee <kinglongmee@gmail.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

1e444f5b

nfsd: Protect addition to the file_hashtbl · 950e0118

由 Trond Myklebust 提交于 6月 30, 2014

Current code depends on the client_mutex to guarantee a single struct
nfs4_file per inode in the file_hashtbl and make addition atomic with
respect to lookup.  Rely instead on the state_Lock, to make it easier to
stop taking the client_mutex here later.

To prevent an i_lock/state_lock inversion, change nfsd4_init_file to
use ihold instead if igrab. That's also more efficient anyway as we
definitely hold a reference to the inode at that point.
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
Signed-off-by: NJeff Layton <jlayton@primarydata.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

950e0118

nfsd: fix file access refcount leak when nfsd4_truncate fails · 7e6a72e5

由 Christoph Hellwig 提交于 6月 30, 2014

nfsd4_process_open2 will currently will get access to the file, and then
call nfsd4_truncate to (possibly) truncate it. If that operation fails
though, then the access references will never be released as the
nfs4_ol_stateid is never initialized.

Fix by moving the nfsd4_truncate call into nfs4_get_vfs_file, ensuring
that the refcounts are properly put if the truncate fails.
Signed-off-by: NJeff Layton <jlayton@primarydata.com>
Signed-off-by: NChristoph Hellwig <hch@infradead.org>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

7e6a72e5

NFSD: Avoid warning message when compile at i686 arch · 1055414f

由 Kinglong Mee 提交于 6月 29, 2014

fs/nfsd/nfs4xdr.c: In function 'nfsd4_encode_readv':
>> fs/nfsd/nfs4xdr.c:3137:148: warning: comparison of distinct pointer types lacks a cast [enabled by default]
thislen = min(len, ((void *)xdr->end - (void *)xdr->p));
Reported-by: NFengguang Wu <fengguang.wu@intel.com>
Signed-off-by: NKinglong Mee <kinglongmee@gmail.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

1055414f

nfsd4: replace defer_free by svcxdr_tmpalloc · d5e23383

由 J. Bruce Fields 提交于 6月 24, 2014

Avoid an extra allocation for the tmpbuf struct itself, and stop
ignoring some allocation failures.
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

d5e23383

nfsd4: remove nfs4_acl_new · bcaab953

由 J. Bruce Fields 提交于 6月 24, 2014

This is a not-that-useful kmalloc wrapper.  And I'd like one of the
callers to actually use something other than kmalloc.
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

bcaab953

J
nfsd4: define svcxdr_dupstr to share some common code · 29c353b3
由 J. Bruce Fields 提交于 6月 24, 2014
```
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
```
29c353b3

nfsd4: remove unused defer_free argument · ce043ac8

由 J. Bruce Fields 提交于 6月 24, 2014

28e05dd8 "knfsd: nfsd4: represent nfsv4 acl with array instead of
linked list" removed the last user that wanted a custom free function.
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

ce043ac8

nfsd4: rename cr_linkname->cr_data · 7fb84306

由 J. Bruce Fields 提交于 6月 24, 2014

The name of a link is currently stored in cr_name and cr_namelen, and
the content in cr_linkname and cr_linklen.  That's confusing.
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

7fb84306

nfsd: let nfsd_symlink assume null-terminated data · 52ee0433

由 J. Bruce Fields 提交于 6月 20, 2014

Currently nfsd_symlink has a weird hack to serve callers who don't
null-terminate symlink data: it looks ahead at the next byte to see if
it's zero, and copies it to a new buffer to null-terminate if not.

That means callers don't have to null-terminate, but they *do* have to
ensure that the byte following the end of the data is theirs to read.

That's a bit subtle, and the NFSv4 code actually got this wrong.

So let's just throw out that code and let callers pass null-terminated
strings; we've already fixed them to do that.
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

52ee0433

nfsd: make NFSv2 null terminate symlink data · 0aeae33f

由 J. Bruce Fields 提交于 6月 20, 2014

It's simple enough for NFSv2 to null-terminate the symlink data.

A bit weird (it depends on knowing that we've already read the following
byte, which is either padding or part of the mode), but no worse than
the conditional kstrdup it otherwise relies on in nfsd_symlink().
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

0aeae33f

nfsd: fix rare symlink decoding bug · b829e919

由 J. Bruce Fields 提交于 6月 19, 2014

An NFS operation that creates a new symlink includes the symlink data,
which is xdr-encoded as a length followed by the data plus 0 to 3 bytes
of zero-padding as required to reach a 4-byte boundary.

The vfs, on the other hand, wants null-terminated data.

The simple way to handle this would be by copying the data into a newly
allocated buffer with space for the final null.

The current nfsd_symlink code tries to be more clever by skipping that
step in the (likely) case where the byte following the string is already
0.

But that assumes that the byte following the string is ours to look at.
In fact, it might be the first byte of a page that we can't read, or of
some object that another task might modify.

Worse, the NFSv4 code tries to fix the problem by actually writing to
that byte.

In the NFSv2/v3 cases this actually appears to be safe:

	- nfs3svc_decode_symlinkargs explicitly null-terminates the data
	  (after first checking its length and copying it to a new
	  page).
	- NFSv2 limits symlinks to 1k.  The buffer holding the rpc
	  request is always at least a page, and the link data (and
	  previous fields) have maximum lengths that prevent the request
	  from reaching the end of a page.

In the NFSv4 case the CREATE op is potentially just one part of a long
compound so can end up on the end of a page if you're unlucky.

The minimal fix here is to copy and null-terminate in the NFSv4 case.
The nfsd_symlink() interface here seems too fragile, though.  It should
really either do the copy itself every time or just require a
null-terminated string.
Reported-by: NJeff Layton <jlayton@primarydata.com>
Cc: stable@vger.kernel.org
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

b829e919

23 6月, 2014 10 次提交

nfsd: properly handle embedded newlines in fault_injection input · d4c8e34f

由 Jeff Layton 提交于 6月 18, 2014

Currently rpc_pton() fails to handle the case where you echo an address
into the file, as it barfs on the newline. Ensure that we NULL out the
first occurrence of any newline.
Signed-off-by: NJeff Layton <jlayton@primarydata.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

d4c8e34f

nfsd: fix return of nfs4_acl_write_who · f7ce5d28

由 Jeff Layton 提交于 6月 17, 2014

AFAICT, the only way to hit this error is to pass this function a bogus
"who" value. In that case, we probably don't want to return -1 as that
could get sent back to the client. Turn this into nfserr_serverfault,
which is a more appropriate error for a server bug like this.
Signed-off-by: NJeff Layton <jlayton@primarydata.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

f7ce5d28

nfsd: add appropriate __force directives to filehandle generation code · 94ec938b

由 Jeff Layton 提交于 6月 17, 2014

The filehandle structs all use host-endian values, but will sometimes
stuff big-endian values into those fields. This is OK since these
values are opaque to the client, but it confuses sparse. Add __force to
make it clear that we are doing this intentionally.
Signed-off-by: NJeff Layton <jlayton@primarydata.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

94ec938b

nfsd: nfsd_splice_read and nfsd_readv should return __be32 · e2afc819

由 Jeff Layton 提交于 6月 17, 2014

The callers expect a __be32 return and the functions they call return
__be32, so having these return int is just wrong. Also, nfsd_finish_read
can be made static.
Signed-off-by: NJeff Layton <jlayton@primarydata.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

e2afc819

nfsd: clean up sparse endianness warnings in nfscache.c · b3d8d128

由 Jeff Layton 提交于 6月 17, 2014

We currently hash the XID to determine a hash bucket to use for the
reply cache entry, which is fed into hash_32 without byte-swapping it.
Add __force to make sparse happy, and add some comments to explain
why.
Signed-off-by: NJeff Layton <jlayton@primarydata.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

b3d8d128

nfsd: add __force to opaque verifier field casts · f419992c

由 Jeff Layton 提交于 6月 17, 2014

sparse complains that we're stuffing non-byte-swapped values into
__be32's here. Since they're supposed to be opaque, it doesn't matter
much. Just add __force to make sparse happy.
Signed-off-by: NJeff Layton <jlayton@primarydata.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

f419992c

NFSD: Using exp_get for export getting · bf18f163

由 Kinglong Mee 提交于 6月 10, 2014

Don't using cache_get besides export.h, using exp_get for export.
Signed-off-by: NKinglong Mee <kinglongmee@gmail.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

bf18f163

NFSD: Using path_get when assigning path for export · 0da22a91

由 Kinglong Mee 提交于 6月 10, 2014

Signed-off-by: NKinglong Mee <kinglongmee@gmail.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

0da22a91

SUNRPC/NFSD: Change to type of bool for rq_usedeferral and rq_splice_ok · f15a5cf9

由 Kinglong Mee 提交于 6月 10, 2014

rq_usedeferral and rq_splice_ok are used as 0 and 1, just defined to bool.
Signed-off-by: NKinglong Mee <kinglongmee@gmail.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

f15a5cf9

NFSD: Using min/max/min_t/max_t for calculate · 3c7aa15d

由 Kinglong Mee 提交于 6月 10, 2014

Signed-off-by: NKinglong Mee <kinglongmee@gmail.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

3c7aa15d

20 6月, 2014 9 次提交

Btrfs: fix wrong error handle when the device is missing or is not writeable · 8408c716

由 Miao Xie 提交于 6月 19, 2014

The original bio might be submitted, so we shoud increase bi_remaining to
account for it when we deal with the error that the device is missing or
is not writeable, or we would skip the endio handle.
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NChris Mason <clm@fb.com>

8408c716

Btrfs: fix deadlock when mounting a degraded fs · c55f1396

由 Miao Xie 提交于 6月 19, 2014

The deadlock happened when we mount degraded filesystem, the reproduced
steps are following:
 # mkfs.btrfs -f -m raid1 -d raid1 <dev0> <dev1>
 # echo 1 > /sys/block/`basename <dev0>`/device/delete
 # mount -o degraded <dev1> <mnt>

The reason was that the counter -- bi_remaining was wrong. If the missing
or unwriteable device was the last device in the mapping array, we would
not submit the original bio, so we shouldn't increase bi_remaining of it
in btrfs_end_bio(), or we would skip the final endio handle.

Fix this problem by adding a flag into btrfs bio structure. If we submit
the original bio, we will set the flag, and we increase bi_remaining counter,
or we don't.

Though there is another way to fix it -- decrease bi_remaining counter of the
original bio when we make sure the original bio is not submitted, this method
need add more check and is easy to make mistake.
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
Signed-off-by: NChris Mason <clm@fb.com>

c55f1396

M
Btrfs: use bio_endio_nodec instead of open code · e990f167
由 Miao Xie 提交于 6月 19, 2014
```
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NChris Mason <clm@fb.com>
```
e990f167

Btrfs: fix NULL pointer crash when running balance and scrub concurrently · 298a8f9c

由 Wang Shilong 提交于 6月 19, 2014

While running balance, scrub, fsstress concurrently we hit the
following kernel crash:

[56561.448845] BTRFS info (device sde): relocating block group 11005853696 flags 132
[56561.524077] BUG: unable to handle kernel NULL pointer dereference at 0000000000000078
[56561.524237] IP: [<ffffffffa038956d>] scrub_chunk.isra.12+0xdd/0x130 [btrfs]
[56561.524297] PGD 9be28067 PUD 7f3dd067 PMD 0
[56561.524325] Oops: 0000 [#1] SMP
[....]
[56561.527237] Call Trace:
[56561.527309]  [<ffffffffa038980e>] scrub_enumerate_chunks+0x24e/0x490 [btrfs]
[56561.527392]  [<ffffffff810abe00>] ? abort_exclusive_wait+0x50/0xb0
[56561.527476]  [<ffffffffa038add4>] btrfs_scrub_dev+0x1a4/0x530 [btrfs]
[56561.527561]  [<ffffffffa0368107>] btrfs_ioctl+0x13f7/0x2a90 [btrfs]
[56561.527639]  [<ffffffff811c82f0>] do_vfs_ioctl+0x2e0/0x4c0
[56561.527712]  [<ffffffff8109c384>] ? vtime_account_user+0x54/0x60
[56561.527788]  [<ffffffff810f768c>] ? __audit_syscall_entry+0x9c/0xf0
[56561.527870]  [<ffffffff811c8551>] SyS_ioctl+0x81/0xa0
[56561.527941]  [<ffffffff815707f7>] tracesys+0xdd/0xe2
[...]
[56561.528304] RIP  [<ffffffffa038956d>] scrub_chunk.isra.12+0xdd/0x130 [btrfs]
[56561.528395]  RSP <ffff88004c0f5be8>
[56561.528454] CR2: 0000000000000078

This is because in btrfs_relocate_chunk(), we will free @bdev directly while
scrub may still hold extent mapping, and may access freed memory.

Fix this problem by wrapping freeing @bdev work into free_extent_map() which
is based on reference count.
Reported-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NChris Mason <clm@fb.com>

298a8f9c

btrfs: Skip scrubbing removed chunks to avoid -ENOENT. · ced96edc

由 Qu Wenruo 提交于 6月 19, 2014

When run scrub with balance, sometimes -ENOENT will be returned, since
in scrub_enumerate_chunks() will search dev_extent in *COMMIT_ROOT*, but
btrfs_lookup_block_group() will search block group in *MEMORY*, so if a
chunk is removed but not committed, -ENOENT will be returned.

However, there is no need to stop scrubbing since other chunks may be
scrubbed without problem.

So this patch changes the behavior to skip removed chunks and continue
to scrub the rest.
Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NChris Mason <clm@fb.com>

ced96edc

Btrfs: fix broken free space cache after the system crashed · e570fd27

由 Miao Xie 提交于 6月 19, 2014

When we mounted the filesystem after the crash, we got the following
message:
  BTRFS error (device xxx): block group xxxx has wrong amount of free space
  BTRFS error (device xxx): failed to load free space cache for block group xxx

It is because we didn't update the metadata of the allocated space (in extent
tree) until the file data was written into the disk. During this time, there was
no information about the allocated spaces in either the extent tree nor the
free space cache. when we wrote out the free space cache at this time (commit
transaction), those spaces were lost. In fact, only the free space that is
used to store the file data had this problem, the others didn't because
the metadata of them is updated in the same transaction context.

There are many methods which can fix the above problem
- track the allocated space, and write it out when we write out the free
  space cache
- account the size of the allocated space that is used to store the file
  data, if the size is not zero, don't write out the free space cache.

The first one is complex and may make the performance drop down.
This patch chose the second method, we use a per-block-group variant to
account the size of that allocated space. Besides that, we also introduce
a per-block-group read-write semaphore to avoid the race between
the allocation and the free space cache write out.
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NChris Mason <clm@fb.com>

e570fd27

Btrfs: make free space cache write out functions more readable · 5349d6c3

由 Miao Xie 提交于 6月 19, 2014

This patch makes the free space cache write out functions more readable,
and beisdes that, it also reduces the stack space that the function --
__btrfs_write_out_cache uses from 194bytes to 144bytes.
Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
Signed-off-by: NChris Mason <clm@fb.com>

5349d6c3

Btrfs: remove unused wait queue in struct extent_buffer · 46fefe41

由 Filipe Manana 提交于 6月 16, 2014

The lock_wq wait queue is not used anywhere, therefore just remove it.
On a x86_64 system, this reduced sizeof(struct extent_buffer) from 320
bytes down to 296 bytes, which means a 4Kb page can now be used for
13 extent buffers instead of 12.
Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
Signed-off-by: NChris Mason <clm@fb.com>

46fefe41

Btrfs: fix deadlocks with trylock on tree nodes · ea4ebde0

由 Chris Mason 提交于 6月 19, 2014

The Btrfs tree trylock function is poorly named.  It always takes
the spinlock and backs off if the blocking lock is held.  This
can lead to surprising lockups because people expect it to really be a
trylock.

This commit makes it a pure trylock, both for the spinlock and the
blocking lock.  It also reworks the nested lock handling slightly to
avoid taking the read lock while a spinning write lock might be held.
Signed-off-by: NChris Mason <clm@fb.com>

ea4ebde0

18 6月, 2014 2 次提交

NFSD: fix bug for readdir of pseudofs · f41c5ad2

由 Kinglong Mee 提交于 6月 13, 2014

Commit 561f0ed4 (nfsd4: allow large readdirs) introduces a bug
about readdir the root of pseudofs.

Call xdr_truncate_encode() revert encoded name when skipping.
Signed-off-by: NKinglong Mee <kinglongmee@gmail.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

f41c5ad2

NFSD: Don't hand out delegations for 30 seconds after recalling them. · 6282cd56

由 NeilBrown 提交于 6月 04, 2014

If nfsd needs to recall a delegation for some reason it implies that there is
contention on the file, so further delegations should not be handed out.

The current code fails to do so, and the result is effectively a
live-lock under some workloads: a client attempting a conflicting
operation on a read-delegated file receives NFS4ERR_DELAY and retries
the operation, but by the time it retries the server may already have
given out another delegation.

We could simply avoid delegations for (say) 30 seconds after any recall, but
this is probably too heavy handed.

We could keep a list of inodes (or inode numbers or filehandles) for recalled
delegations, but that requires memory allocation and searching.

The approach taken here is to use a bloom filter to record the filehandles
which are currently blocked from delegation, and to accept the cost of a few
false positives.

We have 2 bloom filters, each of which is valid for 30 seconds.   When a
delegation is recalled the filehandle is added to one filter and will remain
disabled for between 30 and 60 seconds.

We keep a count of the number of filehandles that have been added, so when
that count is zero we can bypass all other tests.

The bloom filters have 256 bits and 3 hash functions.  This should allow a
couple of dozen blocked  filehandles with minimal false positives.  If many
more filehandles are all blocked at once, behaviour will degrade towards
rejecting all delegations for between 30 and 60 seconds, then resetting and
allowing new delegations.
Signed-off-by: NNeilBrown <neilb@suse.de>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

6282cd56