提交 · 48be539dd44a3a010a6a330d09610d60ad42758a · openeuler / Kernel

07 5月, 2018 1 次提交

xprtrdma: Add proper SPDX tags for NetApp-contributed source · a2268cfb

由 Chuck Lever 提交于 5月 04, 2018

Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

a2268cfb

11 4月, 2018 3 次提交

xprtrdma: Chain Send to FastReg WRs · f2877623

由 Chuck Lever 提交于 2月 28, 2018

With FRWR, the client transport can perform memory registration and
post a Send with just a single ib_post_send.

This reduces contention between the send_request path and the Send
Completion handlers, and reduces the overhead of registering a chunk
that has multiple segments.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

f2877623

xprtrdma: Remove xprt-specific connect cookie · 8a14793e

由 Chuck Lever 提交于 2月 28, 2018

Clean up: The generic rq_connect_cookie is sufficient to detect RPC
Call retransmission.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

8a14793e

xprtrdma: Fix latency regression on NUMA NFS/RDMA clients · 6720a899

由 Chuck Lever 提交于 2月 28, 2018

With v4.15, on one of my NFS/RDMA clients I measured a nearly
doubling in the latency of small read and write system calls. There
was no change in server round trip time. The extra latency appears
in the whole RPC execution path.

"git bisect" settled on commit ccede759 ("xprtrdma: Spread reply
processing over more CPUs") .

After some experimentation, I found that leaving the WQ bound and
allowing the scheduler to pick the dispatch CPU seems to eliminate
the long latencies, and it does not introduce any new regressions.

The fix is implemented by reverting only the part of
commit ccede759 ("xprtrdma: Spread reply processing over more
CPUs") that dispatches RPC replies specifically on the CPU where the
matching RPC call was made.

Interestingly, saving the CPU number and later queuing reply
processing there was effective _only_ for a NFS READ and WRITE
request. On my NUMA client, in-kernel RPC reply processing for
asynchronous RPCs was dispatched on the same CPU where the RPC call
was made, as expected. However synchronous RPCs seem to get their
reply dispatched on some other CPU than where the call was placed,
every time.

Fixes: ccede759 ("xprtrdma: Spread reply processing over ... ")
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Cc: stable@vger.kernel.org # v4.15+
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

6720a899

23 1月, 2018 2 次提交
- C
  xprtrdma: Add trace points in the client-side backchannel code paths · fc1eb807
  由 Chuck Lever 提交于 12月 20, 2017
```
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
```
  fc1eb807
- C
  rpcrdma: infrastructure for static trace points in rpcrdma.ko · e48f083e
  由 Chuck Lever 提交于 1月 20, 2018
```
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
```
  e48f083e
17 1月, 2018 11 次提交

xprtrdma: Introduce rpcrdma_mw_unmap_and_put · ec12e479

由 Chuck Lever 提交于 12月 14, 2017

Clean up: Code review suggested that a common bit of code can be
placed into a helper function, and this gives us fewer places to
stick an "I DMA unmapped something" trace point.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

ec12e479

xprtrdma: Remove usage of "mw" · 96ceddea

由 Chuck Lever 提交于 12月 14, 2017

Clean up: struct rpcrdma_mw was named after Memory Windows, but
xprtrdma no longer supports a Memory Window registration mode.
Rename rpcrdma_mw and its fields to reduce confusion and make
the code more sensible to read.

Renaming "mw" was suggested by Tom Talpey, the author of the
original xprtrdma implementation. It's a good idea, but I haven't
done this until now because it's a huge diffstat for no benefit
other than code readability.

However, I'm about to introduce static trace points that expose
a few of xprtrdma's internal data structures. They should make sense
in the trace report, and it's reasonable to treat trace points as a
kernel API contract which might be difficult to change later.

While I'm churning things up, two additional changes:
- rename variables unhelpfully called "r" to "mr", to improve code
  clarity, and
- rename the MR-related helper functions using the form
  "rpcrdma_mr_<verb>", to be consistent with other areas of the
  code.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

96ceddea

xprtrdma: Replace all usage of "frmr" with "frwr" · ce5b3717

由 Chuck Lever 提交于 12月 14, 2017

Clean up: Over time, the industry has adopted the term "frwr"
instead of "frmr". The term "frwr" is now more widely recognized.

For the past couple of years I've attempted to add new code using
"frwr" , but there still remains plenty of older code that still
uses "frmr". Replace all usage of "frmr" to avoid confusion.

While we're churning code, rename variables unhelpfully called "f"
to "frwr", to improve code clarity.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

ce5b3717

xprtrdma: Split xprt_rdma_send_request · cf73daf5

由 Chuck Lever 提交于 12月 14, 2017

Clean up. @rqst is set up differently for backchannel Replies. For
example, rqst->rq_task and task->tk_client are both NULL. So it is
easier to understand and maintain this code path if it is separated.

Also, we can get rid of the confusing rl_connect_cookie hack in
rpcrdma_bc_receive_call.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

cf73daf5

xprtrdma: buf_free not called for CB replies · 6c537f2c

由 Chuck Lever 提交于 12月 14, 2017

Since commit 5a6d1db4 ("SUNRPC: Add a transport-specific private
field in rpc_rqst"), the rpc_rqst's for RPC-over-RDMA backchannel
operations leave rq_buffer set to NULL.

xprt_release does not invoke ->op->buf_free when rq_buffer is NULL.
The RPCRDMA_REQ_F_BACKCHANNEL check in xprt_rdma_free is therefore
redundant because xprt_rdma_free is not invoked for backchannel
requests.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

6c537f2c

xprtrdma: Remove another sockaddr_storage field (cdata::addr) · dd229cee

由 Chuck Lever 提交于 12月 14, 2017

Save more space in struct rpcrdma_xprt by removing the redundant
"addr" field from struct rpcrdma_create_data_internal. Wherever
we have rpcrdma_xprt, we also have the rpc_xprt, which has a
sockaddr_storage field with the same content.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

dd229cee

xprtrdma: Initialize the xprt address string array earlier · d461f1f2

由 Chuck Lever 提交于 12月 14, 2017

This makes the address strings available for debugging messages in
earlier stages of transport set up.

The first benefit is to get rid of the single-use rep_remote_addr
field, saving 128+ bytes in struct rpcrdma_ep.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

d461f1f2

xprtrdma: Remove unused padding variables · 10492704

由 Chuck Lever 提交于 12月 14, 2017

Clean up. Remove fields that should have been removed by
commit b3221d6a ("xprtrdma: Remove logic that constructs
RDMA_MSGP type calls").
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

10492704

xprtrdma: Remove ri_reminv_expected · 3f0e3edd

由 Chuck Lever 提交于 12月 14, 2017

Clean up.

Commit b5f0afbe ("xprtrdma: Per-connection pad optimization")
should have removed this.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

3f0e3edd

xprtrdma: Per-mode handling for Remote Invalidation · c3441618

由 Chuck Lever 提交于 12月 14, 2017

Refactoring change: Remote Invalidation is particular to the memory
registration mode that is use. Use a callout instead of a generic
function to handle Remote Invalidation.

This gets rid of the 8-byte flags field in struct rpcrdma_mw, of
which only a single bit flag has been allocated.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

c3441618

xprtrdma: Fix backchannel allocation of extra rpcrdma_reps · d698c4a0

由 Chuck Lever 提交于 12月 14, 2017

The backchannel code uses rpcrdma_recv_buffer_put to add new reps
to the free rep list. This also decrements rb_recv_count, which
spoofs the receive overrun logic in rpcrdma_buffer_get_rep.

Commit 9b06688b ("xprtrdma: Fix additional uses of
spin_lock_irqsave(rb_lock)") replaced the original open-coded
list_add with a call to rpcrdma_recv_buffer_put(), but then a year
later, commit 05c97466 ("xprtrdma: Fix receive buffer
accounting") added rep accounting to rpcrdma_recv_buffer_put.
It was an oversight to let the backchannel continue to use this
function.

The fix this, let's combine the "add to free list" logic with
rpcrdma_create_rep.

Also, do not allocate RPCRDMA_MAX_BC_REQUESTS rpcrdma_reps in
rpcrdma_buffer_create and then allocate additional rpcrdma_reps in
rpcrdma_bc_setup_reps. Allocating the extra reps during backchannel
set-up is sufficient.

Fixes: 05c97466 ("xprtrdma: Fix receive buffer accounting")
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

d698c4a0

16 12月, 2017 1 次提交

xprtrdma: Spread reply processing over more CPUs · ccede759

由 Chuck Lever 提交于 12月 04, 2017

Commit d8f532d2 ("xprtrdma: Invoke rpcrdma_reply_handler
directly from RECV completion") introduced a performance regression
for NFS I/O small enough to not need memory registration. In multi-
threaded benchmarks that generate primarily small I/O requests,
IOPS throughput is reduced by nearly a third. This patch restores
the previous level of throughput.

Because workqueues are typically BOUND (in particular ib_comp_wq,
nfsiod_workqueue, and rpciod_workqueue), NFS/RDMA workloads tend
to aggregate on the CPU that is handling Receive completions.

The usual approach to addressing this problem is to create a QP
and CQ for each CPU, and then schedule transactions on the QP
for the CPU where you want the transaction to complete. The
transaction then does not require an extra context switch during
completion to end up on the same CPU where the transaction was
started.

This approach doesn't work for the Linux NFS/RDMA client because
currently the Linux NFS client does not support multiple connections
per client-server pair, and the RDMA core API does not make it
straightforward for ULPs to determine which CPU is responsible for
handling Receive completions for a CQ.

So for the moment, record the CPU number in the rpcrdma_req before
the transport sends each RPC Call. Then during Receive completion,
queue the RPC completion on that same CPU.

Additionally, move all RPC completion processing to the deferred
handler so that even RPCs with simple small replies complete on
the CPU that sent the corresponding RPC Call.

Fixes: d8f532d2 ("xprtrdma: Invoke rpcrdma_reply_handler ...")
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

ccede759

18 11月, 2017 12 次提交

xprtrdma: Update copyright notices · 62b56a67

由 Chuck Lever 提交于 10月 30, 2017

Credit work contributed by Oracle engineers since 2014.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

62b56a67

rpcrdma: Remove C structure definitions of XDR data items · 2232df5e

由 Chuck Lever 提交于 10月 30, 2017

Clean up: C-structure style XDR encoding and decoding logic has
been replaced over the past several merge windows on both the
client and server. These data structures are no longer used.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NDevesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

2232df5e

xprtrdma: Remove atomic send completion counting · 6f0afc28

由 Chuck Lever 提交于 10月 20, 2017

The sendctx circular queue now guarantees that xprtrdma cannot
overflow the Send Queue, so remove the remaining bits of the
original Send WQE counting mechanism.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

6f0afc28

xprtrdma: RPC completion should wait for Send completion · 01bb35c8

由 Chuck Lever 提交于 10月 20, 2017

When an RPC Call includes a file data payload, that payload can come
from pages in the page cache, or a user buffer (for direct I/O).

If the payload can fit inline, xprtrdma includes it in the Send
using a scatter-gather technique. xprtrdma mustn't allow the RPC
consumer to re-use the memory where that payload resides before the
Send completes. Otherwise, the new contents of that memory would be
exposed by an HCA retransmit of the Send operation.

So, block RPC completion on Send completion, but only in the case
where a separate file data payload is part of the Send. This
prevents the reuse of that memory while it is still part of a Send
operation without an undue cost to other cases.

Waiting is avoided in the common case because typically the Send
will have completed long before the RPC Reply arrives.

These days, an RPC timeout will trigger a disconnect, which tears
down the QP. The disconnect flushes all waiting Sends. This bounds
the amount of time the reply handler has to wait for a Send
completion.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

01bb35c8

xprtrdma: Refactor rpcrdma_deferred_completion · 0ba6f370

由 Chuck Lever 提交于 10月 20, 2017

Invoke a common routine for releasing hardware resources (for
example, invalidating MRs). This needs to be done whether an
RPC Reply has arrived or the RPC was terminated early.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

0ba6f370

xprtrdma: Add a field of bit flags to struct rpcrdma_req · 531cca0c

由 Chuck Lever 提交于 10月 20, 2017

We have one boolean flag in rpcrdma_req today. I'd like to add more
flags, so convert that boolean to a bit flag.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

531cca0c

xprtrdma: Add data structure to manage RDMA Send arguments · ae72950a

由 Chuck Lever 提交于 10月 20, 2017

Problem statement:

Recently Sagi Grimberg <sagi@grimberg.me> observed that kernel RDMA-
enabled storage initiators don't handle delayed Send completion
correctly. If Send completion is delayed beyond the end of a ULP
transaction, the ULP may release resources that are still being used
by the HCA to complete a long-running Send operation.

This is a common design trait amongst our initiators. Most Send
operations are faster than the ULP transaction they are part of.
Waiting for a completion for these is typically unnecessary.

Infrequently, a network partition or some other problem crops up
where an ordering problem can occur. In NFS parlance, the RPC Reply
arrives and completes the RPC, but the HCA is still retrying the
Send WR that conveyed the RPC Call. In this case, the HCA can try
to use memory that has been invalidated or DMA unmapped, and the
connection is lost. If that memory has been re-used for something
else (possibly not related to NFS), and the Send retransmission
exposes that data on the wire.

Thus we cannot assume that it is safe to release Send-related
resources just because a ULP reply has arrived.

After some analysis, we have determined that the completion
housekeeping will not be difficult for xprtrdma:

 - Inline Send buffers are registered via the local DMA key, and
   are already left DMA mapped for the lifetime of a transport
   connection, thus no additional handling is necessary for those
 - Gathered Sends involving page cache pages _will_ need to
   DMA unmap those pages after the Send completes. But like
   inline send buffers, they are registered via the local DMA key,
   and thus will not need to be invalidated

In addition, RPC completion will need to wait for Send completion
in the latter case. However, nearly always, the Send that conveys
the RPC Call will have completed long before the RPC Reply
arrives, and thus no additional latency will be accrued.

Design notes:

In this patch, the rpcrdma_sendctx object is introduced, and a
lock-free circular queue is added to manage a set of them per
transport.

The RPC client's send path already prevents sending more than one
RPC Call at the same time. This allows us to treat the consumer
side of the queue (rpcrdma_sendctx_get_locked) as if there is a
single consumer thread.

The producer side of the queue (rpcrdma_sendctx_put_locked) is
invoked only from the Send completion handler, which is a single
thread of execution (soft IRQ).

The only care that needs to be taken is with the tail index, which
is shared between the producer and consumer. Only the producer
updates the tail index. The consumer compares the head with the
tail to ensure that the a sendctx that is in use is never handed
out again (or, expressed more conventionally, the queue is empty).

When the sendctx queue empties completely, there are enough Sends
outstanding that posting more Send operations can result in a Send
Queue overflow. In this case, the ULP is told to wait and try again.
This introduces strong Send Queue accounting to xprtrdma.

As a final touch, Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
suggested a mechanism that does not require signaling every Send.
We signal once every N Sends, and perform SGE unmapping of N Send
operations during that one completion.
Reported-by: NSagi Grimberg <sagi@grimberg.me>
Suggested-by: NJason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

ae72950a

xprtrdma: Change return value of rpcrdma_prepare_send_sges() · 857f9aca

由 Chuck Lever 提交于 10月 20, 2017

Clean up: Make rpcrdma_prepare_send_sges() return a negative errno
instead of a bool. Soon callers will want distinct treatments of
different types of failures.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

857f9aca

xprtrdma: Decode credits field in rpcrdma_reply_handler · be798f90

由 Chuck Lever 提交于 10月 16, 2017

We need to decode and save the incoming rdma_credits field _after_
we know that the direction of the message is "forward direction
Reply". Otherwise, the credits value in reverse direction Calls is
also used to update the forward direction credits.

It is safe to decode the rdma_credits field in rpcrdma_reply_handler
now that rpcrdma_reply_handler is single-threaded. Receives complete
in the same order as they were sent on the NFS server.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

be798f90

xprtrdma: Invoke rpcrdma_reply_handler directly from RECV completion · d8f532d2

由 Chuck Lever 提交于 10月 16, 2017

I noticed that the soft IRQ thread looked pretty busy under heavy
I/O workloads. perf suggested one area that was expensive was the
queue_work() call in rpcrdma_wc_receive. That gave me some ideas.

Instead of scheduling a separate worker to process RPC Replies,
promote the Receive completion handler to IB_POLL_WORKQUEUE, and
invoke rpcrdma_reply_handler directly.

Note that the poll workqueue is single-threaded. In order to keep
memory invalidation from serializing all RPC Replies, handle any
necessary invalidation tasks in a separate multi-threaded workqueue.

This provides a two-tier scheme, similar to OS I/O interrupt
handlers: A fast interrupt handler that schedules the slow handler
and re-enables the interrupt, and a slower handler that is invoked
for any needed heavy lifting.

Benefits include:
- One less context switch for RPCs that don't register memory
- Receive completion handling is moved out of soft IRQ context to
  make room for other users of soft IRQ
- The same CPU core now DMA syncs and XDR decodes the Receive buffer
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

d8f532d2

xprtrdma: Refactor rpcrdma_reply_handler some more · e1352c96

由 Chuck Lever 提交于 10月 16, 2017

Clean up: I'd like to be able to invoke the tail of
rpcrdma_reply_handler in two different places. Split the tail out
into its own helper function.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

e1352c96

xprtrdma: Move decoded header fields into rpcrdma_rep · 5381e0ec

由 Chuck Lever 提交于 10月 16, 2017

Clean up: Make it easier to pass the decoded XID, vers, credits, and
proc fields around by moving these variables into struct rpcrdma_rep.

Note: the credits field will be handled in a subsequent patch.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

5381e0ec

17 10月, 2017 1 次提交

xprtrdma: Remove ro_unmap_safe · 2b4f8923

由 Chuck Lever 提交于 10月 09, 2017

Clean up: There are no remaining callers of this method.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

2b4f8923

06 9月, 2017 1 次提交

xprtrdma: Use xprt_pin_rqst in rpcrdma_reply_handler · 9590d083

由 Chuck Lever 提交于 8月 23, 2017

Adopt the use of xprt_pin_rqst to eliminate contention between
Call-side users of rb_lock and the use of rb_lock in
rpcrdma_reply_handler.

This replaces the mechanism introduced in 431af645 ("xprtrdma:
Fix client lock-up after application signal fires").

Use recv_lock to quickly find the completing rqst, pin it, then
drop the lock. At that point invalidation and pull-up of the Reply
XDR can be done. Both are often expensive operations.

Finally, take recv_lock again to signal completion to the RPC
layer. It also protects adjustment of "cwnd".

This greatly reduces the amount of time a lock is held by the
reply handler. Comparing lock_stat results shows a marked decrease
in contention on rb_lock and recv_lock.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
[trond.myklebust@primarydata.com: Remove call to rpcrdma_buffer_put() from
   the "out_norqst:" path in rpcrdma_reply_handler.]
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

9590d083

23 8月, 2017 1 次提交

xprtrdma: Re-arrange struct rx_stats · 67af6f65

由 Chuck Lever 提交于 8月 22, 2017

To reduce false cacheline sharing, separate counters that are likely
to be accessed in the Call path from those accessed in the Reply
path.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

67af6f65

16 8月, 2017 1 次提交

xprtrdma: Remove imul instructions from chunk list encoders · 6748b0ca

由 Chuck Lever 提交于 8月 14, 2017

Re-arrange the pointer arithmetic in the chunk list encoders to
eliminate several more integer multiplication instructions during
Transport Header encoding.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

6748b0ca

12 8月, 2017 2 次提交

xprtrdma: Set up an xdr_stream in rpcrdma_marshal_req() · 7a80f3f0

由 Chuck Lever 提交于 8月 10, 2017

Initialize an xdr_stream at the top of rpcrdma_marshal_req(), and
use it to encode the fixed transport header fields. This xdr_stream
will be used to encode the chunk lists in a subsequent patch.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

7a80f3f0

xprtrdma: Clean up rpcrdma_marshal_req() synopsis · 09e60641

由 Chuck Lever 提交于 8月 10, 2017

Clean up: The caller already has rpcrdma_xprt, so pass that directly
instead. And provide a documenting comment for this critical
function.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

09e60641

08 8月, 2017 3 次提交

xprtrdma: Clean up XDR decoding in rpcrdma_update_granted_credits() · c1bcb68e

由 Chuck Lever 提交于 8月 03, 2017

Clean up: Replace C-structure based XDR decoding for consistency
with other areas.

struct rpcrdma_rep is rearranged slightly so that the relevant fields
are in cache when the Receive completion handler is invoked.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

c1bcb68e

xprtrdma: Remove rpcrdma_rep::rr_len · e2a67190

由 Chuck Lever 提交于 8月 03, 2017

This field is no longer used outside the Receive completion handler.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

e2a67190

xprtrdma: Add xdr_init_decode to rpcrdma_reply_handler() · 96f8778f

由 Chuck Lever 提交于 8月 03, 2017

Transport header decoding deals with untrusted input data, therefore
decoding this header needs to be hardened.

Adopt the same infrastructure that is used when XDR decoding NFS
replies. This is slightly more CPU-intensive than the replaced code,
but we're not adding new atomics, locking, or context switches. The
cost is manageable.

Start by initializing an xdr_stream in rpcrdma_reply_handler().
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

96f8778f

14 7月, 2017 1 次提交

xprtrdma: Fix client lock-up after application signal fires · 431af645

由 Chuck Lever 提交于 6月 08, 2017

After a signal, the RPC client aborts synchronous RPCs running on
behalf of the signaled application.

The server is still executing those RPCs, and will write the results
back into the client's memory when it's done. By the time the server
writes the results, that memory is likely being used for other
purposes. Therefore xprtrdma has to immediately invalidate all
memory regions used by those aborted RPCs to prevent the server's
writes from clobbering that re-used memory.

With FMR memory registration, invalidation takes a relatively long
time. In fact, the invalidation is often still running when the
server tries to write the results into the memory regions that are
being invalidated.

This sets up a race between two processes:

1.  After the signal, xprt_rdma_free calls ro_unmap_safe.
2.  While ro_unmap_safe is still running, the server replies and
    rpcrdma_reply_handler runs, calling ro_unmap_sync.

Both processes invoke ib_unmap_fmr on the same FMR.

The mlx4 driver allows two ib_unmap_fmr calls on the same FMR at
the same time, but HCAs generally don't tolerate this. Sometimes
this can result in a system crash.

If the HCA happens to survive, rpcrdma_reply_handler continues. It
removes the rpc_rqst from rq_list and releases the transport_lock.
This enables xprt_rdma_free to run in another process, and the
rpc_rqst is released while rpcrdma_reply_handler is still waiting
for the ib_unmap_fmr call to finish.

But further down in rpcrdma_reply_handler, the transport_lock is
taken again, and "rqst" is dereferenced. If "rqst" has already been
released, this triggers a general protection fault. Since bottom-
halves are disabled, the system locks up.

Address both issues by reversing the order of the xprt_lookup_rqst
call and the ro_unmap_sync call. Introduce a separate lookup
mechanism for rpcrdma_req's to enable calling ro_unmap_sync before
xprt_lookup_rqst. Now the handler takes the transport_lock once
and holds it for the XID lookup and RPC completion.

BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305
Fixes: 68791649 ('xprtrdma: Invalidate in the RPC reply ... ')
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

431af645

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功