提交 · b6e717cbf28c8348d34be472f119b0ea82e5e8e7 · openanolis / cloud-kernel

12 5月, 2018 1 次提交

xprtrdma: Prepare RPC/RDMA includes for server-side trace points · b6e717cb

由 Chuck Lever 提交于 5月 07, 2018

Clean up: Move #include <trace/events/rpcrdma.h> into source files,
similar to how it is done with trace/events/sunrpc.h.

Server-side trace points will be part of the rpcrdma subsystem,
just like the client-side trace points.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

b6e717cb

11 4月, 2018 2 次提交

xprtrdma: ->send_request returns -EAGAIN when there are no free MRs · 9e679d5e

由 Chuck Lever 提交于 2月 28, 2018

Currently, when the MR free list is exhausted during marshaling, the
RPC/RDMA transport places the RPC task on the delayq, which forces a
wait for HZ >> 2 before the marshal and send is retried.

With this change, the transport now places such an RPC task on the
pending queue, and wakes it just as soon as more MRs have been
created. Creating more MRs typically takes less than a millisecond,
and this waking mechanism is less deadlock-prone.

Moreover, the waiting RPC task is holding the transport's write
lock, which blocks the transport from sending RPCs. Therefore faster
recovery from MR exhaustion is desirable.

This is the same mechanism that the TCP transport utilizes when
handling write buffer space exhaustion.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

9e679d5e

xprtrdma: Fix latency regression on NUMA NFS/RDMA clients · 6720a899

由 Chuck Lever 提交于 2月 28, 2018

With v4.15, on one of my NFS/RDMA clients I measured a nearly
doubling in the latency of small read and write system calls. There
was no change in server round trip time. The extra latency appears
in the whole RPC execution path.

"git bisect" settled on commit ccede759 ("xprtrdma: Spread reply
processing over more CPUs") .

After some experimentation, I found that leaving the WQ bound and
allowing the scheduler to pick the dispatch CPU seems to eliminate
the long latencies, and it does not introduce any new regressions.

The fix is implemented by reverting only the part of
commit ccede759 ("xprtrdma: Spread reply processing over more
CPUs") that dispatches RPC replies specifically on the CPU where the
matching RPC call was made.

Interestingly, saving the CPU number and later queuing reply
processing there was effective _only_ for a NFS READ and WRITE
request. On my NUMA client, in-kernel RPC reply processing for
asynchronous RPCs was dispatched on the same CPU where the RPC call
was made, as expected. However synchronous RPCs seem to get their
reply dispatched on some other CPU than where the call was placed,
every time.

Fixes: ccede759 ("xprtrdma: Spread reply processing over ... ")
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Cc: stable@vger.kernel.org # v4.15+
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

6720a899

03 2月, 2018 1 次提交

xprtrdma: Fix calculation of ri_max_send_sges · 1179e2c2

由 Chuck Lever 提交于 1月 31, 2018

Commit 16f906d6 ("xprtrdma: Reduce required number of send
SGEs") introduced the rpcrdma_ia::ri_max_send_sges field. This fixes
a problem where xprtrdma would not work if the device's max_sge
capability was small (low single digits).

At least RPCRDMA_MIN_SEND_SGES are needed for the inline parts of
each RPC. ri_max_send_sges is set to this value:

  ia->ri_max_send_sges = max_sge - RPCRDMA_MIN_SEND_SGES;

Then when marshaling each RPC, rpcrdma_args_inline uses that value
to determine whether the device has enough Send SGEs to convey an
NFS WRITE payload inline, or whether instead a Read chunk is
required.

More recently, commit ae72950a ("xprtrdma: Add data structure to
manage RDMA Send arguments") used the ri_max_send_sges value to
calculate the size of an array, but that commit erroneously assumed
ri_max_send_sges contains a value similar to the device's max_sge,
and not one that was reduced by the minimum SGE count.

This assumption results in the calculated size of the sendctx's
Send SGE array to be too small. When the array is used to marshal
an RPC, the code can write Send SGEs into the following sendctx
element in that array, corrupting it. When the device's max_sge is
large, this issue is entirely harmless; but it results in an oops
in the provider's post_send method, if dev.attrs.max_sge is small.

So let's straighten this out: ri_max_send_sges will now contain a
value with the same meaning as dev.attrs.max_sge, which makes
the code easier to understand, and enables rpcrdma_sendctx_create
to calculate the size of the SGE array correctly.
Reported-by: NMichal Kalderon <Michal.Kalderon@cavium.com>
Fixes: 16f906d6 ("xprtrdma: Reduce required number of send SGEs")
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Tested-by: NMichal Kalderon <Michal.Kalderon@cavium.com>
Cc: stable@vger.kernel.org # v4.10+
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

1179e2c2

23 1月, 2018 5 次提交

xprtrdma: Fix "bytes registered" accounting · aae2349c

由 Chuck Lever 提交于 1月 03, 2018

The contents of seg->mr_len changed when ->ro_map stopped returning
the full chunk length in the first segment. Count the full length of
each Write chunk, not the length of the first segment (which now can
only be as large as a page).

Fixes: 9d6b0409 ("xprtrdma: Place registered MWs on a ... ")
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

aae2349c

xprtrdma: Add trace points in reply decoder path · e11b7c96

由 Chuck Lever 提交于 12月 20, 2017

This includes decoding Write and Reply chunks, and fixing up inline
payloads.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

e11b7c96

C
xprtrdma: Add trace points to instrument memory registration · 58f10ad4
由 Chuck Lever 提交于 12月 20, 2017
```
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
```
58f10ad4

xprtrdma: Add trace points in the RPC Reply handler paths · b4a7f91c

由 Chuck Lever 提交于 12月 20, 2017

Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

b4a7f91c

xprtrdma: Add trace points in RPC Call transmit paths · ab03eff5

由 Chuck Lever 提交于 12月 20, 2017

Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

ab03eff5

17 1月, 2018 4 次提交

xprtrdma: Remove usage of "mw" · 96ceddea

由 Chuck Lever 提交于 12月 14, 2017

Clean up: struct rpcrdma_mw was named after Memory Windows, but
xprtrdma no longer supports a Memory Window registration mode.
Rename rpcrdma_mw and its fields to reduce confusion and make
the code more sensible to read.

Renaming "mw" was suggested by Tom Talpey, the author of the
original xprtrdma implementation. It's a good idea, but I haven't
done this until now because it's a huge diffstat for no benefit
other than code readability.

However, I'm about to introduce static trace points that expose
a few of xprtrdma's internal data structures. They should make sense
in the trace report, and it's reasonable to treat trace points as a
kernel API contract which might be difficult to change later.

While I'm churning things up, two additional changes:
- rename variables unhelpfully called "r" to "mr", to improve code
  clarity, and
- rename the MR-related helper functions using the form
  "rpcrdma_mr_<verb>", to be consistent with other areas of the
  code.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

96ceddea

xprtrdma: Split xprt_rdma_send_request · cf73daf5

由 Chuck Lever 提交于 12月 14, 2017

Clean up. @rqst is set up differently for backchannel Replies. For
example, rqst->rq_task and task->tk_client are both NULL. So it is
easier to understand and maintain this code path if it is separated.

Also, we can get rid of the confusing rl_connect_cookie hack in
rpcrdma_bc_receive_call.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

cf73daf5

xprtrdma: Move unmap-safe logic to rpcrdma_marshal_req · a2b6470b

由 Chuck Lever 提交于 12月 14, 2017

Clean up. This logic is related to marshaling the request, and I'd
like to keep everything that touches req->rl_registered close
together, for CPU cache efficiency.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

a2b6470b

xprtrdma: Per-mode handling for Remote Invalidation · c3441618

由 Chuck Lever 提交于 12月 14, 2017

Refactoring change: Remote Invalidation is particular to the memory
registration mode that is use. Use a callout instead of a generic
function to handle Remote Invalidation.

This gets rid of the 8-byte flags field in struct rpcrdma_mw, of
which only a single bit flag has been allocated.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

c3441618

16 12月, 2017 1 次提交

xprtrdma: Spread reply processing over more CPUs · ccede759

由 Chuck Lever 提交于 12月 04, 2017

Commit d8f532d2 ("xprtrdma: Invoke rpcrdma_reply_handler
directly from RECV completion") introduced a performance regression
for NFS I/O small enough to not need memory registration. In multi-
threaded benchmarks that generate primarily small I/O requests,
IOPS throughput is reduced by nearly a third. This patch restores
the previous level of throughput.

Because workqueues are typically BOUND (in particular ib_comp_wq,
nfsiod_workqueue, and rpciod_workqueue), NFS/RDMA workloads tend
to aggregate on the CPU that is handling Receive completions.

The usual approach to addressing this problem is to create a QP
and CQ for each CPU, and then schedule transactions on the QP
for the CPU where you want the transaction to complete. The
transaction then does not require an extra context switch during
completion to end up on the same CPU where the transaction was
started.

This approach doesn't work for the Linux NFS/RDMA client because
currently the Linux NFS client does not support multiple connections
per client-server pair, and the RDMA core API does not make it
straightforward for ULPs to determine which CPU is responsible for
handling Receive completions for a CQ.

So for the moment, record the CPU number in the rpcrdma_req before
the transport sends each RPC Call. Then during Receive completion,
queue the RPC completion on that same CPU.

Additionally, move all RPC completion processing to the deferred
handler so that even RPCs with simple small replies complete on
the CPU that sent the corresponding RPC Call.

Fixes: d8f532d2 ("xprtrdma: Invoke rpcrdma_reply_handler ...")
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

ccede759

18 11月, 2017 14 次提交

xprtrdma: Update copyright notices · 62b56a67

由 Chuck Lever 提交于 10月 30, 2017

Credit work contributed by Oracle engineers since 2014.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

62b56a67

rpcrdma: Remove C structure definitions of XDR data items · 2232df5e

由 Chuck Lever 提交于 10月 30, 2017

Clean up: C-structure style XDR encoding and decoding logic has
been replaced over the past several merge windows on both the
client and server. These data structures are no longer used.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NDevesh Sharma <devesh.sharma@broadcom.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

2232df5e

xprtrdma: RPC completion should wait for Send completion · 01bb35c8