提交 · 9a301cafc8619c7f30032d314da6e65d9d913d57 · openeuler / Kernel

26 4月, 2021 19 次提交

xprtrdma: Move fr_linv_done field to struct rpcrdma_mr · 9a301caf

由 Chuck Lever 提交于 4月 19, 2021

Clean up: Move more of struct rpcrdma_frwr into its parent.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

9a301caf

xprtrdma: Move cqe to struct rpcrdma_mr · e10fa96d

由 Chuck Lever 提交于 4月 19, 2021

Clean up.

- Simplify variable initialization in the completion handlers.

- Move another field out of struct rpcrdma_frwr.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

e10fa96d

xprtrdma: Move fr_cid to struct rpcrdma_mr · 0a26d10e

由 Chuck Lever 提交于 4月 19, 2021

Clean up (for several purposes):

- The MR's cid is initialized sooner so that tracepoints can show
  something reasonable even if the MR is never posted.
- The MR's res.id doesn't change so the cid won't change either.
  Initializing the cid once is sufficient.
- struct rpcrdma_frwr is going away soon.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

0a26d10e

xprtrdma: Remove the RPC/RDMA QP event handler · e1648eb2

由 Chuck Lever 提交于 4月 19, 2021

Clean up: The handler only recorded a trace event. If indeed no
action is needed by the RPC/RDMA consumer, then the event can be
ignored.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

e1648eb2

xprtrdma: Add tracepoints showing FastReg WRs and remote invalidation · 4ddd0fc3

由 Chuck Lever 提交于 4月 19, 2021

The Send signaling logic is a little subtle, so add some
observability around it. For every xprtrdma_mr_fastreg event, there
should be an xprtrdma_mr_localinv or xprtrdma_mr_reminv event.

When these tracepoints are enabled, we can see exactly when an MR is
DMA-mapped, registered, invalidated (either locally or remotely) and
then DMA-unmapped.

kworker/u25:2-190 [000] 787.979512: xprtrdma_mr_map: task:351@5 mr.id=4 nents=2 5608@0x8679e0c8f6f56000:0x00000503 (TO_DEVICE)
kworker/u25:2-190 [000] 787.979515: xprtrdma_chunk_read: task:351@5 pos=148 5608@0x8679e0c8f6f56000:0x00000503 (last)
kworker/u25:2-190 [000] 787.979519: xprtrdma_marshal: task:351@5 xid=0x8679e0c8: hdr=52 xdr=148/5608/0 read list/inline
kworker/u25:2-190 [000] 787.979525: xprtrdma_mr_fastreg: task:351@5 mr.id=4 nents=2 5608@0x8679e0c8f6f56000:0x00000503 (TO_DEVICE)
kworker/u25:2-190 [000] 787.979526: xprtrdma_post_send: task:351@5 cq.id=0 cid=73 (2 SGEs)

...

kworker/5:1H-219 [005] 787.980567: xprtrdma_wc_receive: cq.id=1 cid=161 status=SUCCESS (0/0x0) received=164
kworker/5:1H-219 [005] 787.980571: xprtrdma_post_recvs: peer=[192.168.100.55]:20049 r_xprt=0xffff8884974d4000: 0 new recvs, 70 active (rc 0)
kworker/5:1H-219 [005] 787.980573: xprtrdma_reply: task:351@5 xid=0x8679e0c8 credits=64
kworker/5:1H-219 [005] 787.980576: xprtrdma_mr_reminv: task:351@5 mr.id=4 nents=2 5608@0x8679e0c8f6f56000:0x00000503 (TO_DEVICE)
kworker/5:1H-219 [005] 787.980577: xprtrdma_mr_unmap: mr.id=4 nents=2 5608@0x8679e0c8f6f56000:0x00000503 (TO_DEVICE)

Note that I've moved the xprtrdma_post_send tracepoint so that event
always appears after the xprtrdma_mr_fastreg tracepoint. Otherwise
the event log looks counterintuitive (FastReg is always supposed to
happen before Send).
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

4ddd0fc3

xprtrdma: Avoid Send Queue wrapping · b3ce7a25

由 Chuck Lever 提交于 4月 19, 2021

Send WRs can be signalled or unsignalled. A signalled Send WR
always has a matching Send completion, while a unsignalled Send
has a completion only if the Send WR fails.

xprtrdma has a Send account mechanism that is designed to reduce
the number of signalled Send WRs. This in turn mitigates the
interrupt rate of the underlying device.

RDMA consumers can't leave all Sends unsignaled, however, because
providers rely on Send completions to maintain their Send Queue head
and tail pointers. xprtrdma counts the number of unsignaled Send WRs
that have been posted to ensure that Sends are signalled often
enough to prevent the Send Queue from wrapping.

This mechanism neglected to account for FastReg WRs, which are
posted on the Send Queue but never signalled. As a result, the
Send Queue wrapped on occasion, resulting in duplication completions
of FastReg and LocalInv WRs.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

b3ce7a25

xprtrdma: Do not wake RPC consumer on a failed LocalInv · 8a053433

由 Chuck Lever 提交于 4月 19, 2021

Throw away any reply where the LocalInv flushes or could not be
posted. The registered memory region is in an unknown state until
the disconnect completes.

rpcrdma_xprt_disconnect() will find and release the MR. No need to
put it back on the MR free list in this case.

The client retransmits pending RPC requests once it reestablishes a
fresh connection, so a replacement reply should be forthcoming on
the next connection instance.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

8a053433

xprtrdma: Do not recycle MR after FastReg/LocalInv flushes · e4b52ca0

由 Chuck Lever 提交于 4月 19, 2021

Better not to touch MRs involved in a flush or post error until the
Send and Receive Queues are drained and the transport is fully
quiescent. Simply don't insert such MRs back onto the free list.
They remain on mr_all and will be released when the connection is
torn down.

I had thought that recycling would prevent hardware resources from
being tied up for a long time. However, since v5.7, a transport
disconnect destroys the QP and other hardware-owned resources. The
MRs get cleaned up nicely at that point.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

e4b52ca0

xprtrdma: Clarify use of barrier in frwr_wc_localinv_done() · 44438ad9

由 Chuck Lever 提交于 4月 19, 2021

Clean up: The comment and the placement of the memory barrier is
confusing. Humans want to read the function statements from head
to tail.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

44438ad9

xprtrdma: Rename frwr_release_mr() · f912af77

由 Chuck Lever 提交于 4月 19, 2021

Clean up: To be consistent with other functions in this source file,
follow the naming convention of putting the object being acted upon
before the action itself.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

f912af77

xprtrdma: rpcrdma_mr_pop() already does list_del_init() · 1363e638

由 Chuck Lever 提交于 4月 19, 2021

The rpcrdma_mr_pop() earlier in the function has already cleared
out mr_list, so it must not be done again in the error path.

Fixes: 84756894 ("xprtrdma: Remove fr_state")
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

1363e638

xprtrdma: Delete rpcrdma_recv_buffer_put() · c35ca60d

由 Chuck Lever 提交于 4月 19, 2021

Clean up: The name recv_buffer_put() is a vestige of older code,
and the function is just a wrapper for the newer rpcrdma_rep_put().
In most of the existing call sites, a pointer to the owning
rpcrdma_buffer is already available.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

c35ca60d

xprtrdma: Fix cwnd update ordering · 35d8b10a

由 Chuck Lever 提交于 4月 19, 2021

After a reconnect, the reply handler is opening the cwnd (and thus
enabling more RPC Calls to be sent) /before/ rpcrdma_post_recvs()
can post enough Receive WRs to receive their replies. This causes an
RNR and the new connection is lost immediately.

The race is most clearly exposed when KASAN and disconnect injection
are enabled. This slows down rpcrdma_rep_create() enough to allow
the send side to post a bunch of RPC Calls before the Receive
completion handler can invoke ib_post_recv().

Fixes: 2ae50ad6 ("xprtrdma: Close window between waking RPC senders and posting Receives")
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

35d8b10a

xprtrdma: Improve locking around rpcrdma_rep creation · 9e3ca33b

由 Chuck Lever 提交于 4月 19, 2021

Defensive clean up: Protect the rb_all_reps list during rep
creation.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

9e3ca33b

xprtrdma: Improve commentary around rpcrdma_reps_unmap() · 8b5292be

由 Chuck Lever 提交于 4月 19, 2021

Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

8b5292be

xprtrdma: Improve locking around rpcrdma_rep destruction · eaf86e8c

由 Chuck Lever 提交于 4月 24, 2021

Currently rpcrdma_reps_destroy() assumes that, at transport
tear-down, the content of the rb_free_reps list is the same as the
content of the rb_all_reps list. Although that is usually true,
using the rb_all_reps list should be more reliable because of
the way it's managed. And, rpcrdma_reps_unmap() uses rb_all_reps;
these two functions should both traverse the "all" list.

Ensure that all rpcrdma_reps are always destroyed whether they are
on the rep free list or not.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

eaf86e8c

xprtrdma: Put flushed Receives on free list instead of destroying them · 5030c9a9

由 Chuck Lever 提交于 4月 19, 2021

Defer destruction of an rpcrdma_rep until transport tear-down to
preserve the rb_all_reps list while Receives flush.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NTom Talpey <tom@talpey.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

5030c9a9

xprtrdma: Do not refresh Receive Queue while it is draining · 15788d1d

由 Chuck Lever 提交于 4月 19, 2021

Currently the Receive completion handler refreshes the Receive Queue
whenever a successful Receive completion occurs.

On disconnect, xprtrdma drains the Receive Queue. The first few
Receive completions after a disconnect are typically successful,
until the first flushed Receive.

This means the Receive completion handler continues to post more
Receive WRs after the drain sentinel has been posted. The late-
posted Receives flush after the drain sentinel has completed,
leading to a crash later in rpcrdma_xprt_disconnect().

To prevent this crash, xprtrdma has to ensure that the Receive
handler stops posting Receives before ib_drain_rq() posts its
drain sentinel.
Suggested-by: NTom Talpey <tom@talpey.com>
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

15788d1d

xprtrdma: Avoid Receive Queue wrapping · 32e6b681

由 Chuck Lever 提交于 4月 19, 2021

Commit e340c2d6 ("xprtrdma: Reduce the doorbell rate (Receive)")
increased the number of Receive WRs that are posted by the client,
but did not increase the size of the Receive Queue allocated during
transport set-up.

This is usually not an issue because RPCRDMA_BACKWARD_WRS is defined
as (32) when SUNRPC_BACKCHANNEL is defined. In cases where it isn't,
there is a real risk of Receive Queue wrapping.

Fixes: e340c2d6 ("xprtrdma: Reduce the doorbell rate (Receive)")
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NTom Talpey <tom@talpey.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

32e6b681

14 4月, 2021 4 次提交

SUNRPC: Handle major timeout in xprt_adjust_timeout() · 09252177

由 Chris Dion 提交于 4月 04, 2021

Currently if a major timeout value is reached, but the minor value has
not been reached, an ETIMEOUT will not be sent back to the caller.
This can occur if the v4 server is not responding to requests and
retrans is configured larger than the default of two.

For example, A TCP mount with a configured timeout value of 50 and a
retransmission count of 3 to a v4 server which is not responding:

1. Initial value and increment set to 5s, maxval set to 20s, retries at 3
2. Major timeout is set to 20s, minor timeout set to 5s initially
3. xport_adjust_timeout() is called after 5s, retry with 10s timeout,
   minor timeout is bumped to 10s
4. And again after another 10s, 15s total time with minor timeout set
   to 15s
5. After 20s total time xport_adjust_timeout is called as major timeout is
   reached, but skipped because the minor timeout is not reached
       - After this time the cpu spins continually calling
       	 xport_adjust_timeout() and returning 0 for 10 seconds.
	 As seen on perf sched:
   	 39243.913182 [0005]  mount.nfs[3794] 4607.938      0.017   9746.863
6. This continues until the 15s minor timeout condition is reached (in
   this case for 10 seconds). After which the ETIMEOUT is processed
   back to the caller, the cpu spinning stops, and normal operations
   continue

Fixes: 7de62bc0 ("SUNRPC dont update timeout value on connection reset")
Signed-off-by: NChris Dion <Christopher.Dion@dell.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

09252177

SUNRPC: Remove trace_xprt_transmit_queued · 6cf23783

由 Chuck Lever 提交于 3月 31, 2021

This tracepoint can crash when dereferencing snd_task because
when some transports connect, they put a cookie in that field
instead of a pointer to an rpc_task.

BUG: KASAN: use-after-free in trace_event_raw_event_xprt_writelock_event+0x141/0x18e [sunrpc]
Read of size 2 at addr ffff8881a83bd3a0 by task git/331872

CPU: 11 PID: 331872 Comm: git Tainted: G S                5.12.0-rc2-00007-g3ab6e585a7f9 #1453
Hardware name: Supermicro SYS-6028R-T/X10DRi, BIOS 1.1a 10/16/2015
Call Trace:
 dump_stack+0x9c/0xcf
 print_address_description.constprop.0+0x18/0x239
 kasan_report+0x174/0x1b0
 trace_event_raw_event_xprt_writelock_event+0x141/0x18e [sunrpc]
 xprt_prepare_transmit+0x8e/0xc1 [sunrpc]
 call_transmit+0x4d/0xc6 [sunrpc]

Fixes: 9ce07ae5 ("SUNRPC: Replace dprintk() call site in xprt_prepare_transmit")
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

6cf23783

SUNRPC: Add tracepoint that fires when an RPC is retransmitted · e936a597

由 Chuck Lever 提交于 3月 31, 2021

A separate tracepoint can be left enabled all the time to capture
rare but important retransmission events. So for example:

kworker/u26:3-568 [009] 156.967933: xprt_retransmit: task:44093@5 xid=0xa25dbc79 nfsv3 WRITE ntrans=2

Or, for example, enable all nfs and nfs4 tracepoints, and set up a
trigger to disable tracing when xprt_retransmit fires to capture
everything that leads up to it.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

e936a597

SUNRPC: Move fault injection call sites · 7638e0bf

由 Chuck Lever 提交于 3月 31, 2021

I've hit some crashes that occur in the xprt_rdma_inject_disconnect
path. It appears that, for some provides, rdma_disconnect() can
take so long that the transport can disconnect and release its
hardware resources while rdma_disconnect() is still running,
resulting in a UAF in the provider.

The transport's fault injection method may depend on the stability
of transport data structures. That means it needs to be invoked
only from contexts that hold the transport write lock.

Fixes: 4a068258 ("SUNRPC: Transport fault injection")
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

7638e0bf

05 4月, 2021 3 次提交

SUNRPC: Ensure the transport backchannel association · 98b5cee3

由 Benjamin Coddington 提交于 3月 22, 2021

If the server sends CB_ calls on a connection that is not associated
with the backchannel, refuse to process the call and shut down the
connection.  This avoids a NULL dereference crash in
xprt_complete_bc_request().  There's not much more we can do in this
situation unless we want to look into allowing all connections to be
associated with the fore and back channel.
Signed-off-by: NBenjamin Coddington <bcodding@redhat.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

98b5cee3

sunrpc: honor rpc_task's timeout value in rpcb_create() · 6b996476

由 Eryu Guan 提交于 3月 22, 2021

Currently rpcbind client is created without setting rpc timeout (thus
using the default value). But if the rpc_task already has a customized
timeout in its tk_client field, it's also ignored.

Let's use the same timeout setting in rpc_task->tk_client->cl_timeout
for rpcbind connection.
Signed-off-by: NEryu Guan <eguan@linux.alibaba.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

6b996476

SUNRPC: Set TCP_CORK until the transmit queue is empty · d737e5d4

由 Trond Myklebust 提交于 2月 09, 2021

When we have multiple RPC requests queued up, it makes sense to set the
TCP_CORK option while the transmit queue is non-empty.
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

d737e5d4

12 3月, 2021 1 次提交

svcrdma: Revert "svcrdma: Reduce Receive doorbell rate" · bade4be6

由 Chuck Lever 提交于 3月 11, 2021

I tested commit 43042b90 ("svcrdma: Reduce Receive doorbell
rate") with mlx4 (IB) and software iWARP and didn't find any
issues. However, I recently got my hardware iWARP setup back on
line (FastLinQ) and it's crashing hard on this commit (confirmed
via bisect).

The failure mode is complex.
 - After a connection is established, the first Receive completes
   normally.
 - But the second and third Receives have garbage in their Receive
   buffers. The server responds with ERR_VERS as a result.
 - When the client tears down the connection to retry, a couple
   of posted Receives flush twice, and that corrupts the recv_ctxt
   free list.
 - __svc_rdma_free then faults or loops infinitely while destroying
   the xprt's recv_ctxts.

Since 43042b90 ("svcrdma: Reduce Receive doorbell rate") does
not fix a bug but is a scalability enhancement, it's safe and
appropriate to revert it while working on a replacement.

Fixes: 43042b90 ("svcrdma: Reduce Receive doorbell rate")
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>

bade4be6

09 3月, 2021 1 次提交

SUNRPC: Set memalloc_nofs_save() for sync tasks · f0940f4b

由 Benjamin Coddington 提交于 3月 03, 2021

We could recurse into NFS doing memory reclaim while sending a sync task,
which might result in a deadlock.  Set memalloc_nofs_save for sync task
execution.

Fixes: a1231fda ("SUNRPC: Set memalloc_nofs_save() on all rpciod/xprtiod jobs")
Signed-off-by: NBenjamin Coddington <bcodding@redhat.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

f0940f4b

07 3月, 2021 4 次提交

rpc: fix NULL dereference on kmalloc failure · 0ddc9423

由 J. Bruce Fields 提交于 3月 02, 2021

I think this is unlikely but possible:

svc_authenticate sets rq_authop and calls svcauth_gss_accept. The
kmalloc(sizeof(*svcdata), GFP_KERNEL) fails, leaving rq_auth_data NULL,
and returning SVC_DENIED.

This causes svc_process_common to go to err_bad_auth, and eventually
call svc_authorise. That calls ->release == svcauth_gss_release, which
tries to dereference rq_auth_data.
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
Link: https://lore.kernel.org/linux-nfs/3F1B347F-B809-478F-A1E9-0BE98E22B0F0@oracle.com/T/#tSigned-off-by: NChuck Lever <chuck.lever@oracle.com>

0ddc9423

sunrpc: fix refcount leak for rpc auth modules · f1442d63

由 Daniel Kobras 提交于 2月 27, 2021

If an auth module's accept op returns SVC_CLOSE, svc_process_common()
enters a call path that does not call svc_authorise() before leaving the
function, and thus leaks a reference on the auth module's refcount. Hence,
make sure calls to svc_authenticate() and svc_authorise() are paired for
all call paths, to make sure rpc auth modules can be unloaded.
Signed-off-by: NDaniel Kobras <kobras@puzzle-itc.de>
Fixes: 4d712ef1 ("svcauth_gss: Close connection when dropping an incoming message")
Link: https://lore.kernel.org/linux-nfs/3F1B347F-B809-478F-A1E9-0BE98E22B0F0@oracle.com/T/#tSigned-off-by: NChuck Lever <chuck.lever@oracle.com>

f1442d63

NFSD: Repair misuse of sv_lock in 5.10.16-rt30. · c7de87ff

由 Joe Korty 提交于 2月 26, 2021

[ This problem is in mainline, but only rt has the chops to be
able to detect it. ]

Lockdep reports a circular lock dependency between serv->sv_lock and
softirq_ctl.lock on system shutdown, when using a kernel built with
CONFIG_PREEMPT_RT=y, and a nfs mount exists.

This is due to the definition of spin_lock_bh on rt:

	local_bh_disable();
	rt_spin_lock(lock);

which forces a softirq_ctl.lock -> serv->sv_lock dependency.  This is
not a problem as long as _every_ lock of serv->sv_lock is a:

	spin_lock_bh(&serv->sv_lock);

but there is one of the form:

	spin_lock(&serv->sv_lock);

This is what is causing the circular dependency splat.  The spin_lock()
grabs the lock without first grabbing softirq_ctl.lock via local_bh_disable.
If later on in the critical region,  someone does a local_bh_disable, we
get a serv->sv_lock -> softirq_ctrl.lock dependency established.  Deadlock.

Fix is to make serv->sv_lock be locked with spin_lock_bh everywhere, no
exceptions.

[  OK  ] Stopped target NFS client services.
         Stopping Logout off all iSCSI sessions on shutdown...
         Stopping NFS server and services...
[  109.442380]
[  109.442385] ======================================================
[  109.442386] WARNING: possible circular locking dependency detected
[  109.442387] 5.10.16-rt30 #1 Not tainted
[  109.442389] ------------------------------------------------------
[  109.442390] nfsd/1032 is trying to acquire lock:
[  109.442392] ffff994237617f60 ((softirq_ctrl.lock).lock){+.+.}-{2:2}, at: __local_bh_disable_ip+0xd9/0x270
[  109.442405]
[  109.442405] but task is already holding lock:
[  109.442406] ffff994245cb00b0 (&serv->sv_lock){+.+.}-{0:0}, at: svc_close_list+0x1f/0x90
[  109.442415]
[  109.442415] which lock already depends on the new lock.
[  109.442415]
[  109.442416]
[  109.442416] the existing dependency chain (in reverse order) is:
[  109.442417]
[  109.442417] -> #1 (&serv->sv_lock){+.+.}-{0:0}:
[  109.442421]        rt_spin_lock+0x2b/0xc0
[  109.442428]        svc_add_new_perm_xprt+0x42/0xa0
[  109.442430]        svc_addsock+0x135/0x220
[  109.442434]        write_ports+0x4b3/0x620
[  109.442438]        nfsctl_transaction_write+0x45/0x80
[  109.442440]        vfs_write+0xff/0x420
[  109.442444]        ksys_write+0x4f/0xc0
[  109.442446]        do_syscall_64+0x33/0x40
[  109.442450]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  109.442454]
[  109.442454] -> #0 ((softirq_ctrl.lock).lock){+.+.}-{2:2}:
[  109.442457]        __lock_acquire+0x1264/0x20b0
[  109.442463]        lock_acquire+0xc2/0x400
[  109.442466]        rt_spin_lock+0x2b/0xc0
[  109.442469]        __local_bh_disable_ip+0xd9/0x270
[  109.442471]        svc_xprt_do_enqueue+0xc0/0x4d0
[  109.442474]        svc_close_list+0x60/0x90
[  109.442476]        svc_close_net+0x49/0x1a0
[  109.442478]        svc_shutdown_net+0x12/0x40
[  109.442480]        nfsd_destroy+0xc5/0x180
[  109.442482]        nfsd+0x1bc/0x270
[  109.442483]        kthread+0x194/0x1b0
[  109.442487]        ret_from_fork+0x22/0x30
[  109.442492]
[  109.442492] other info that might help us debug this:
[  109.442492]
[  109.442493]  Possible unsafe locking scenario:
[  109.442493]
[  109.442493]        CPU0                    CPU1
[  109.442494]        ----                    ----
[  109.442495]   lock(&serv->sv_lock);
[  109.442496]                                lock((softirq_ctrl.lock).lock);
[  109.442498]                                lock(&serv->sv_lock);
[  109.442499]   lock((softirq_ctrl.lock).lock);
[  109.442501]
[  109.442501]  *** DEADLOCK ***
[  109.442501]
[  109.442501] 3 locks held by nfsd/1032:
[  109.442503]  #0: ffffffff93b49258 (nfsd_mutex){+.+.}-{3:3}, at: nfsd+0x19a/0x270
[  109.442508]  #1: ffff994245cb00b0 (&serv->sv_lock){+.+.}-{0:0}, at: svc_close_list+0x1f/0x90
[  109.442512]  #2: ffffffff93a81b20 (rcu_read_lock){....}-{1:2}, at: rt_spin_lock+0x5/0xc0
[  109.442518]
[  109.442518] stack backtrace:
[  109.442519] CPU: 0 PID: 1032 Comm: nfsd Not tainted 5.10.16-rt30 #1
[  109.442522] Hardware name: Supermicro X9DRL-3F/iF/X9DRL-3F/iF, BIOS 3.2 09/22/2015
[  109.442524] Call Trace:
[  109.442527]  dump_stack+0x77/0x97
[  109.442533]  check_noncircular+0xdc/0xf0
[  109.442546]  __lock_acquire+0x1264/0x20b0
[  109.442553]  lock_acquire+0xc2/0x400
[  109.442564]  rt_spin_lock+0x2b/0xc0
[  109.442570]  __local_bh_disable_ip+0xd9/0x270
[  109.442573]  svc_xprt_do_enqueue+0xc0/0x4d0
[  109.442577]  svc_close_list+0x60/0x90
[  109.442581]  svc_close_net+0x49/0x1a0
[  109.442585]  svc_shutdown_net+0x12/0x40
[  109.442588]  nfsd_destroy+0xc5/0x180
[  109.442590]  nfsd+0x1bc/0x270
[  109.442595]  kthread+0x194/0x1b0
[  109.442600]  ret_from_fork+0x22/0x30
[  109.518225] nfsd: last server has exited, flushing export cache
[  OK  ] Stopped NFSv4 ID-name mapping service.
[  OK  ] Stopped GSSAPI Proxy Daemon.
[  OK  ] Stopped NFS Mount Daemon.
[  OK  ] Stopped NFS status monitor for NFSv2/3 locking..

Fixes: 719f8bcc ("svcrpc: fix xpt_list traversal locking on shutdown")
Signed-off-by: NJoe Korty <joe.korty@concurrent-rt.com>
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>

c7de87ff

svcrdma: disable timeouts on rdma backchannel · 6820bf77

由 Timo Rothenpieler 提交于 2月 23, 2021

This brings it in line with the regular tcp backchannel, which also has
all those timeouts disabled.

Prevents the backchannel from timing out, getting some async operations
like server side copying getting stuck indefinitely on the client side.
Signed-off-by: NTimo Rothenpieler <timo@rothenpieler.org>
Fixes: 5d252f90 ("svcrdma: Add class for RDMA backwards direction transport")
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>

6820bf77

17 2月, 2021 3 次提交

SUNRPC: Further clean up svc_tcp_sendmsg() · 4d12b727

由 Chuck Lever 提交于 2月 16, 2021

Clean up: The msghdr is no longer needed in the caller.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>

4d12b727

SUNRPC: Remove redundant socket flags from svc_tcp_sendmsg() · 987c7b1d

由 Trond Myklebust 提交于 2月 16, 2021

Now that the caller controls the TCP_CORK socket option, it is redundant
to set MSG_MORE and MSG_SENDPAGE_NOTLAST in the calls to
kernel_sendpage().
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>

987c7b1d

SUNRPC: Use TCP_CORK to optimise send performance on the server · e0a912e8

由 Trond Myklebust 提交于 2月 16, 2021

Use a counter to keep track of how many requests are queued behind the
xprt->xpt_mutex, and keep TCP_CORK set until the queue is empty.
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
Link: https://lore.kernel.org/linux-nfs/20210213202532.23146-1-trondmy@kernel.org/T/#uSigned-off-by: NChuck Lever <chuck.lever@oracle.com>

e0a912e8

15 2月, 2021 1 次提交

svcrdma: Hold private mutex while invoking rdma_accept() · 0ac24c32

由 Chuck Lever 提交于 2月 09, 2021

RDMA core mutex locking was restructured by commit d114c6fe
("RDMA/cma: Add missing locking to rdma_accept()") [Aug 2020]. When
lock debugging is enabled, the RPC/RDMA server trips over the new
lockdep assertion in rdma_accept() because it doesn't call
rdma_accept() from its CM event handler.

As a temporary fix, have svc_rdma_accept() take the handler_mutex
explicitly. In the meantime, let's consider how to restructure the
RPC/RDMA transport to invoke rdma_accept() from the proper context.

Calls to svc_rdma_accept() are serialized with calls to
svc_rdma_free() by the generic RPC server layer.
Suggested-by: NJason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/linux-rdma/20210209154014.GO4247@nvidia.com/
Fixes: d114c6fe ("RDMA/cma: Add missing locking to rdma_accept()")
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>

0ac24c32

06 2月, 2021 4 次提交

xprtrdma: Clean up rpcrdma_prepare_readch() · 586a0787

由 Chuck Lever 提交于 2月 05, 2021

Since commit 9ed5af26 ("SUNRPC: Clean up the handling of page
padding in rpc_prepare_reply_pages()") [Dec 2020] the NFS client
passes payload data to the transport with the padding in xdr->pages
instead of in the send buffer's tail kvec. There's no need for the
extra logic to advance the base of the tail kvec because the upper
layer no longer places XDR padding there.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

586a0787

xprtrdma: Pad optimization, revisited · 2324fbed

由 Chuck Lever 提交于 2月 04, 2021

The NetApp Linux team discovered that with NFS/RDMA servers that do
not support RFC 8797, the Linux client is forming NFSv4.x WRITE
requests incorrectly.

In this case, the Linux NFS client disables implicit chunk round-up
for odd-length Read and Write chunks. The goal was to support old
servers that needed that padding to be sent explicitly by clients.

In that case the Linux NFS included the tail kvec in the Read chunk,
since the tail contains any needed padding. That meant a separate
memory registration is needed for the tail kvec, adding to the cost
of forming such requests. To avoid that cost for a mere 3 bytes of
zeroes that are always ignored by receivers, we try to use implicit
roundup when possible.

For NFSv4.x, the tail kvec also sometimes contains a trailing
GETATTR operation. The Linux NFS client unintentionally includes
that GETATTR operation in the Read chunk as well as inline.

The fix is simply to /never/ include the tail kvec when forming a
data payload Read chunk. The padding is thus now always present.

Note that since commit 9ed5af26 ("SUNRPC: Clean up the handling
of page padding in rpc_prepare_reply_pages()") [Dec 2020] the NFS
client passes payload data to the transport with the padding in
xdr->pages instead of in the send buffer's tail kvec. So now the
Linux NFS client appends XDR padding to all odd-sized Read chunks.
This shouldn't be a problem because:

 - RFC 8166-compliant servers are supposed to work with or without
   that XDR padding in Read chunks.

 - Since the padding is now in the same memory region as the data
   payload, a separate memory registration is not needed. In
   addition, the link layer extends data in RDMA Read responses to
   4-byte boundaries anyway. Thus there is now no savings when the
   padding is not included.

Because older kernels include the payload's XDR padding in the
tail kvec, a fix there will be more complicated. Thus backporting
this patch is not recommended.

Reported by: Olga Kornievskaia <Olga.Kornievskaia@netapp.com>
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NTom Talpey <tom@talpey.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

2324fbed

rpcrdma: Fix comments about reverse-direction operation · 84dff5eb

由 Chuck Lever 提交于 2月 04, 2021

During the final stages of publication of RFC 8167, reviewers
requested that we use the term "reverse direction" rather than
"backwards direction". Update comments to reflect this preference.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NTom Talpey <tom@talpey.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

84dff5eb

xprtrdma: Refactor invocations of offset_in_page() · 67b16625

由 Chuck Lever 提交于 2月 04, 2021

Clean up so that offset_in_page() is invoked less often in the
most common case, which is mapping xdr->pages.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NTom Talpey <tom@talpey.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

67b16625

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功