提交 · 80414abc2848f43690c8402a77d37710ad0020c2 · openeuler / Kernel

12 7月, 2016 5 次提交

xprtrdma: rpcrdma_inline_fixup() overruns the receive page list · 80414abc

由 Chuck Lever 提交于 6月 29, 2016

When the remaining length of an incoming reply is longer than the
XDR buf's page_len, switch over to the tail iovec instead of
copying more than page_len bytes into the page list.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Tested-by: NSteve Wise <swise@opengridcomputing.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

80414abc

xprtrdma: Chunk list encoders no longer share one rl_segments array · 5ab81428

由 Chuck Lever 提交于 6月 29, 2016

Currently, all three chunk list encoders each use a portion of the
one rl_segments array in rpcrdma_req. This is because the MWs for
each chunk list were preserved in rl_segments so that ro_unmap could
find and invalidate them after the RPC was complete.

However, now that MWs are placed on a per-req linked list as they
are registered, there is no longer any information in rpcrdma_mr_seg
that is shared between ro_map and ro_unmap_{sync,safe}, and thus
nothing in rl_segments needs to be preserved after
rpcrdma_marshal_req is complete.

Thus the rl_segments array can be used now just for the needs of
each rpcrdma_convert_iovs call. Once each chunk list is encoded, the
next chunk list encoder is free to re-use all of rl_segments.

This means all three chunk lists in one RPC request can now each
encode a full size data payload with no increase in the size of
rl_segments.

This is a key requirement for Kerberos support, since both the Call
and Reply for a single RPC transaction are conveyed via Long
messages (RDMA Read/Write). Both can be large.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Tested-by: NSteve Wise <swise@opengridcomputing.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

5ab81428

xprtrdma: Place registered MWs on a per-req list · 9d6b0409

由 Chuck Lever 提交于 6月 29, 2016

Instead of placing registered MWs sparsely into the rl_segments
array, place these MWs on a per-req list.

ro_unmap_{sync,safe} can then simply pull those MWs off the list
instead of walking through the array.

This change significantly reduces the size of struct rpcrdma_req
by removing nsegs and rl_mw from every array element.

As an additional clean-up, chunk co-ordinates are returned in the
"*mw" output argument so they are no longer needed in every
array element.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Tested-by: NSteve Wise <swise@opengridcomputing.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

9d6b0409

xprtrdma: Chunk list encoders must not return zero · a54d4059

由 Chuck Lever 提交于 6月 29, 2016

Clean up, based on code audit: Remove the possibility that the
chunk list XDR encoders can return zero, which would be interpreted
as a NULL.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Tested-by: NSteve Wise <swise@opengridcomputing.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

a54d4059

xprtrdma: Honor ->send_request API contract · 7a89f9c6

由 Chuck Lever 提交于 6月 29, 2016

Commit c93c6223 ("xprtrdma: Disconnect on registration failure")
added a disconnect for some RPC marshaling failures. This is needed
only in a handful of cases, but it was triggering for simple stuff
like temporary resource shortages. Try to straighten this out.

Fix up the lower layers so they don't return -ENOMEM or other error
codes that the RPC client's FSM doesn't explicitly recognize.

Also fix up the places in the send_request path that do want a
disconnect. For example, when ib_post_send or ib_post_recv fail,
this is a sign that there is a send or receive queue resource
miscalculation. That should be rare, and is a sign of a software
bug. But xprtrdma can recover: disconnect to reset the transport and
start over.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Tested-by: NSteve Wise <swise@opengridcomputing.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

7a89f9c6

18 5月, 2016 6 次提交

xprtrdma: Add ro_unmap_safe memreg method · ead3f26e

由 Chuck Lever 提交于 5月 02, 2016

There needs to be a safe method of releasing registered memory
resources when an RPC terminates. Safe can mean a number of things:

+ Doesn't have to sleep

+ Doesn't rely on having a QP in RTS

ro_unmap_safe will be that safe method. It can be used in cases
where synchronous memory invalidation can deadlock, or needs to have
an active QP.

The important case is fencing an RPC's memory regions after it is
signaled (^C) and before it exits. If this is not done, there is a
window where the server can write an RPC reply into memory that the
client has released and re-used for some other purpose.

Note that this is a full solution for FRWR, but FMR and physical
still have some gaps where a particularly bad server can wreak
some havoc on the client. These gaps are not made worse by this
patch and are expected to be exceptionally rare and timing-based.
They are noted in documenting comments.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Tested-by: NSteve Wise <swise@opengridcomputing.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

ead3f26e

xprtrdma: Remove rpcrdma_create_chunks() · 3c19409b

由 Chuck Lever 提交于 5月 02, 2016

rpcrdma_create_chunks() has been replaced, and can be removed.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Tested-by: NSteve Wise <swise@opengridcomputing.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

3c19409b

xprtrdma: Allow Read list and Reply chunk simultaneously · 94f58c58

由 Chuck Lever 提交于 5月 02, 2016

rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.

In fact, this assumption is asserted in the code:

  if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
  	dprintk("RPC:       %s: cannot marshal multiple chunk lists\n",
		__func__);
	return -EIO;
  }

But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.

Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.

The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.

Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Tested-by: NSteve Wise <swise@opengridcomputing.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

94f58c58

xprtrdma: Update comments in rpcrdma_marshal_req() · 88b18a12

由 Chuck Lever 提交于 5月 02, 2016

Update documenting comments to reflect code changes over the past
year.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Tested-by: NSteve Wise <swise@opengridcomputing.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

88b18a12

xprtrdma: Avoid using Write list for small NFS READ requests · cce6deeb

由 Chuck Lever 提交于 5月 02, 2016

Avoid the latency and interrupt overhead of registering a Write
chunk when handling NFS READ requests of a few hundred bytes or
less.

This change does not interoperate with Linux NFS/RDMA servers
that do not have commit 9d11b51c ('svcrdma: Fix send_reply()
scatter/gather set-up'). Commit 9d11b51c was introduced in v4.3,
and is included in 4.2.y, 4.1.y, and 3.18.y.

Oracle bug 22925946 has been filed to request that the above fix
be included in the Oracle Linux UEK4 NFS/RDMA server.

Red Hat bugzillas 1327280 and 1327554 have been filed to request
that RHEL NFS/RDMA server backports include the above fix.

Workaround: Replace the "proto=rdma,port=20049" mount options
with "proto=tcp" until commit 9d11b51c is applied to your
NFS server.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Tested-by: NSteve Wise <swise@opengridcomputing.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

cce6deeb

xprtrdma: Prevent inline overflow · 302d3deb

由 Chuck Lever 提交于 5月 02, 2016

When deciding whether to send a Call inline, rpcrdma_marshal_req
doesn't take into account header bytes consumed by chunk lists.
This results in Call messages on the wire that are sometimes larger
than the inline threshold.

Likewise, when a Write list or Reply chunk is in play, the server's
reply has to emit an RDMA Send that includes a larger-than-minimal
RPC-over-RDMA header.

The actual size of a Call message cannot be estimated until after
the chunk lists have been registered. Thus the size of each
RPC-over-RDMA header can be estimated only after chunks are
registered; but the decision to register chunks is based on the size
of that header. Chicken, meet egg.

The best a client can do is estimate header size based on the
largest header that might occur, and then ensure that inline content
is always smaller than that.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Tested-by: NSteve Wise <swise@opengridcomputing.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

302d3deb

15 3月, 2016 4 次提交

xprtrdma: Serialize credit accounting again · 23826c7a

由 Chuck Lever 提交于 3月 04, 2016

Commit fe97b47c ("xprtrdma: Use workqueue to process RPC/RDMA
replies") replaced the reply tasklet with a workqueue that allows
RPC replies to be processed in parallel. Thus the credit values in
RPC-over-RDMA replies can be applied in a different order than in
which the server sent them.

To fix this, revert commit eba8ff66 ("xprtrdma: Move credit
update to RPC reply handler"). Reverting is done by hand to
accommodate code changes that have occurred since then.

Fixes: fe97b47c ("xprtrdma: Use workqueue to process . . .")
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NSagi Grimberg <sagig@mellanox.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

23826c7a

xprtrdma: Properly handle RDMA_ERROR replies · 59aa1f9a

由 Chuck Lever 提交于 3月 04, 2016

These are shorter than RPCRDMA_HDRLEN_MIN, and they need to
complete the waiting RPC.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NSagi Grimberg <sagig@mellanox.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

59aa1f9a

xprtrdma: Segment head and tail XDR buffers on page boundaries · 821c791a

由 Chuck Lever 提交于 3月 04, 2016

A single memory allocation is used for the pair of buffers wherein
the RPC client builds an RPC call message and decodes its matching
reply. These buffers are sized based on the maximum possible size
of the RPC call and reply messages for the operation in progress.

This means that as the call buffer increases in size, the start of
the reply buffer is pushed farther into the memory allocation.

RPC requests are growing in size. It used to be that both the call
and reply buffers fit inside a single page.

But these days, thanks to NFSv4 (and especially security labels in
NFSv4.2) the maximum call and reply sizes are large. NFSv4.0 OPEN,
for example, now requires a 6KB allocation for a pair of call and
reply buffers, and NFSv4 LOOKUP is not far behind.

As the maximum size of a call increases, the reply buffer is pushed
far enough into the buffer's memory allocation that a page boundary
can appear in the middle of it.

When the maximum possible reply size is larger than the client's
RDMA receive buffers (currently 1KB), the client has to register a
Reply chunk for the server to RDMA Write the reply into.

The logic in rpcrdma_convert_iovs() assumes that xdr_buf head and
tail buffers would always be contained on a single page. It supplies
just one segment for the head and one for the tail.

FMR, for example, registers up to a page boundary (only a portion of
the reply buffer in the OPEN case above). But without additional
segments, it doesn't register the rest of the buffer.

When the server tries to write the OPEN reply, the RDMA Write fails
with a remote access error since the client registered only part of
the Reply chunk.

rpcrdma_convert_iovs() must split the XDR buffer into multiple
segments, each of which are guaranteed not to contain a page
boundary. That way fmr_op_map is given the proper number of segments
to register the whole reply buffer.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NDevesh Sharma <devesh.sharma@broadcom.com>
Reviewed-by: NSagi Grimberg <sagig@mellanox.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

821c791a

xprtrdma: Clean up dprintk format string containing a newline · af0f16e8

由 Chuck Lever 提交于 3月 04, 2016

Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NSagi Grimberg <sagig@mellanox.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

af0f16e8

19 12月, 2015 1 次提交

xprtrdma: Invalidate in the RPC reply handler · 68791649

由 Chuck Lever 提交于 12月 16, 2015

There is a window between the time the RPC reply handler wakes the
waiting RPC task and when xprt_release() invokes ops->buf_free.
During this time, memory regions containing the data payload may
still be accessed by a broken or malicious server, but the RPC
application has already been allowed access to the memory containing
the RPC request's data payloads.

The server should be fenced from client memory containing RPC data
payloads _before_ the RPC application is allowed to continue.

This change also more strongly enforces send queue accounting. There
is a maximum number of RPC calls allowed to be outstanding. When an
RPC/RDMA transport is set up, just enough send queue resources are
allocated to handle registration, Send, and invalidation WRs for
each those RPCs at the same time.

Before, additional RPC calls could be dispatched while invalidation
WRs were still consuming send WQEs. When invalidation WRs backed
up, dispatching additional RPCs resulted in a send queue overrun.

Now, the reply handler prevents RPC dispatch until invalidation is
complete. This prevents RPC call dispatch until there are enough
send queue resources to proceed.

Still to do: If an RPC exits early (say, ^C), the reply handler has
no opportunity to perform invalidation. Currently, xprt_rdma_free()
still frees remaining RDMA resources, which could deadlock.
Additional changes are needed to handle invalidation properly in this
case.
Reported-by: NJason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

68791649

03 11月, 2015 4 次提交

xprtrdma: Handle incoming backward direction RPC calls · 63cae470

由 Chuck Lever 提交于 10月 24, 2015

Introduce a code path in the rpcrdma_reply_handler() to catch
incoming backward direction RPC calls and route them to the ULP's
backchannel server.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NSagi Grimberg <sagig@mellanox.com>
Tested-By: NDevesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

63cae470

xprtrdma: Add support for sending backward direction RPC replies · 83128a60

由 Chuck Lever 提交于 10月 24, 2015

Backward direction RPC replies are sent via the client transport's
send_request method, the same way forward direction RPC calls are
sent.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NSagi Grimberg <sagig@mellanox.com>
Tested-By: NDevesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

83128a60

xprtrdma: Use workqueue to process RPC/RDMA replies · fe97b47c

由 Chuck Lever 提交于 10月 24, 2015

The reply tasklet is fast, but it's single threaded. After reply
traffic saturates a single CPU, there's no more reply processing
capacity.

Replace the tasklet with a workqueue to spread reply handling across
all CPUs.  This also moves RPC/RDMA reply handling out of the soft
IRQ context and into a context that allows sleeps.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NSagi Grimberg <sagig@mellanox.com>
Tested-By: NDevesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

fe97b47c

xprtrdma: Refactor reply handler error handling · b0e178a2

由 Chuck Lever 提交于 10月 24, 2015

Clean up: The error cases in rpcrdma_reply_handler() almost never
execute. Ensure the compiler places them out of the hot path.

No behavior change expected.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NSagi Grimberg <sagig@mellanox.com>
Reviewed-by: NDevesh Sharma <devesh.sharma@avagotech.com>
Tested-By: NDevesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

b0e178a2

06 8月, 2015 7 次提交

xprtrdma: Count RDMA_NOMSG type calls · 860477d1

由 Chuck Lever 提交于 8月 03, 2015

RDMA_NOMSG type calls are less efficient than RDMA_MSG. Count NOMSG
calls so administrators can tell if they happen to be used more than
expected.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

860477d1

xprtrdma: Fix large NFS SYMLINK calls · 2fcc213a

由 Chuck Lever 提交于 8月 03, 2015

Repair how rpcrdma_marshal_req() chooses which RDMA message type
to use for large non-WRITE operations so that it picks RDMA_NOMSG
in the correct situations, and sets up the marshaling logic to
SEND only the RPC/RDMA header.

Large NFSv2 SYMLINK requests now use RDMA_NOMSG calls. The Linux NFS
server XDR decoder for NFSv2 SYMLINK does not handle having the
pathname argument arrive in a separate buffer. The decoder could be
fixed, but this is simpler and RDMA_NOMSG can be used in a variety
of other situations.

Ensure that the Linux client continues to use "RDMA_MSG + read
list" when sending large NFSv3 SYMLINK requests, which is more
efficient than using RDMA_NOMSG.

Large NFSv4 CREATE(NF4LNK) requests are changed to use "RDMA_MSG +
read list" just like NFSv3 (see Section 5 of RFC 5667). Before,
these did not work at all.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

2fcc213a

xprtrdma: Fix XDR tail buffer marshalling · 677eb17e

由 Chuck Lever 提交于 8月 03, 2015

Currently xprtrdma appends an extra chunk element to the RPC/RDMA
read chunk list of each NFSv4 WRITE compound. The extra element
contains the final GETATTR operation in the compound.

The result is an extra RDMA READ operation to transfer a very short
piece of each NFS WRITE compound (typically 16 bytes). This is
inefficient.

It is also incorrect.

The client is sending the trailing GETATTR at the same Position as
the preceding WRITE data payload. Whether or not RFC 5667 allows
the GETATTR to appear in a read chunk, RFC 5666 requires that these
two separate RPC arguments appear at two distinct Positions.

It can also be argued that the GETATTR operation is not bulk data,
and therefore RFC 5667 forbids its appearance in a read chunk at
all.

Although RFC 5667 is not precise about when using a read list with
NFSv4 COMPOUND is allowed, the intent is that only data arguments
not touched by NFS (ie, read and write payloads) are to be sent
using RDMA READ or WRITE.

The NFS client constructs GETATTR arguments itself, and therefore is
required to send the trailing GETATTR operation as additional inline
content, not as a data payload.

NB: This change is not backwards compatible. Some older servers do
not accept inline content following the read list. The Linux NFS
server should handle this content correctly as of commit
a97c331f ("svcrdma: Handle additional inline content").
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

677eb17e

xprtrdma: Don't provide a reply chunk when expecting a short reply · 33943b29

由 Chuck Lever 提交于 8月 03, 2015

Currently Linux always offers a reply chunk, even when the reply
can be sent inline (ie. is smaller than 1KB).

On the client, registering a memory region can be expensive. A
server may choose not to use the reply chunk, wasting the cost of
the registration.

This is a change only for RPC replies smaller than 1KB which the
server constructs in the RPC reply send buffer. Because the elements
of the reply must be XDR encoded, a copy-free data transfer has no
benefit in this case.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NSagi Grimberg <sagig@mellanox.com>
Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

33943b29

xprtrdma: Always provide a write list when sending NFS READ · 02eb57d8

由 Chuck Lever 提交于 8月 03, 2015

The client has been setting up a reply chunk for NFS READs that are
smaller than the inline threshold. This is not efficient: both the
server and client CPUs have to copy the reply's data payload into
and out of the memory region that is then transferred via RDMA.

Using the write list, the data payload is moved by the device and no
extra data copying is necessary.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NDevesh Sharma <devesh.sharma@avagotech.com>
Reviewed-By: NSagi Grimberg <sagig@mellanox.com>
Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

02eb57d8

xprtrdma: Account for RPC/RDMA header size when deciding to inline · 5457ced0

由 Chuck Lever 提交于 8月 03, 2015

When the size of the RPC message is near the inline threshold (1KB),
the client would allow messages to be sent that were a few bytes too
large.

When marshaling RPC/RDMA requests, ensure the combined size of
RPC/RDMA header and RPC header do not exceed the inline threshold.
Endpoints typically reject RPC/RDMA messages that exceed the size
of their receive buffers.

The two server implementations I test with (Linux and Solaris) use
receive buffers that are larger than the client’s inline threshold.
Thus so far this has been benign, observed only by code inspection.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NDevesh Sharma <devesh.sharma@avagotech.com>
Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

5457ced0

xprtrdma: Remove logic that constructs RDMA_MSGP type calls · b3221d6a

由 Chuck Lever 提交于 8月 03, 2015

RDMA_MSGP type calls insert a zero pad in the middle of the RPC
message to align the RPC request's data payload to the server's
alignment preferences. A server can then "page flip" the payload
into place to avoid a data copy in certain circumstances. However:

1. The client has to have a priori knowledge of the server's
   preferred alignment

2. Requests eligible for RDMA_MSGP are requests that are small
   enough to have been sent inline, and convey a data payload
   at the _end_ of the RPC message

Today 1. is done with a sysctl, and is a global setting that is
copied during mount. Linux does not support CCP to query the
server's preferences (RFC 5666, Section 6).

A small-ish NFSv3 WRITE might use RDMA_MSGP, but no NFSv4
compound fits bullet 2.

Thus the Linux client currently leaves RDMA_MSGP disabled. The
Linux server handles RDMA_MSGP, but does not use any special
page flipping, so it confers no benefit.

Clean up the marshaling code by removing the logic that constructs
RDMA_MSGP type calls. This also reduces the maximum send iovec size
from four to just two elements.

/proc/sys/sunrpc/rdma_inline_write_padding is a kernel API, and
thus is left in place.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

b3221d6a

13 6月, 2015 3 次提交

xprtrdma: Acquire MRs in rpcrdma_register_external() · c14d86e5

由 Chuck Lever 提交于 5月 26, 2015

Acquiring 64 MRs in rpcrdma_buffer_get() while holding the buffer
pool lock is expensive, and unnecessary because most modern adapters
can transfer 100s of KBs of payload using just a single MR.

Instead, acquire MRs one-at-a-time as chunks are registered, and
return them to rb_mws immediately during deregistration.

Note: commit 539431a4 ("xprtrdma: Don't invalidate FRMRs if
registration fails") is reverted: There is now a valid case where
registration can fail (with -ENOMEM) but the QP is still in RTS.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NSteve Wise <swise@opengridcomputing.com>
Tested-By: NDevesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: NDoug Ledford <dledford@redhat.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

c14d86e5

xprtrdma: Remove rr_func · 494ae30d

由 Chuck Lever 提交于 5月 26, 2015

A posted rpcrdma_rep never has rr_func set to anything but
rpcrdma_reply_handler.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Tested-By: NDevesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: NDoug Ledford <dledford@redhat.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

494ae30d

xprtrdma: Replace rpcrdma_rep::rr_buffer with rr_rxprt · fed171b3

由 Chuck Lever 提交于 5月 26, 2015

Clean up: Instead of carrying a pointer to the buffer pool and
the rpc_xprt, carry a pointer to the controlling rpcrdma_xprt.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NSteve Wise <swise@opengridcomputing.com>
Reviewed-by: NSagi Grimberg <sagig@mellanox.com>
Tested-By: NDevesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: NDoug Ledford <dledford@redhat.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

fed171b3

31 3月, 2015 3 次提交

xprtrdma: Add a "deregister_external" op for each memreg mode · 6814baea

由 Chuck Lever 提交于 3月 30, 2015

There is very little common processing among the different external
memory deregistration functions.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Tested-by: NDevesh Sharma <Devesh.Sharma@Emulex.Com>
Tested-by: NMeghana Cheripady <Meghana.Cheripady@Emulex.Com>
Tested-by: NVeeresh U. Kokatnur <veereshuk@chelsio.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

6814baea

xprtrdma: Add a "register_external" op for each memreg mode · 9c1b4d77

由 Chuck Lever 提交于 3月 30, 2015

There is very little common processing among the different external
memory registration functions. Have rpcrdma_create_chunks() call
the registration method directly. This removes a stack frame and a
switch statement from the external registration path.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Tested-by: NDevesh Sharma <Devesh.Sharma@Emulex.Com>
Tested-by: NMeghana Cheripady <Meghana.Cheripady@Emulex.Com>
Tested-by: NVeeresh U. Kokatnur <veereshuk@chelsio.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

9c1b4d77

xprtrdma: Perform a full marshal on retransmit · e2377945

由 Chuck Lever 提交于 3月 30, 2015

Commit 6ab59945 ("xprtrdma: Update rkeys after transport
reconnect" added logic in the ->send_request path to update the
chunk list when an RPC/RDMA request is retransmitted.

Note that rpc_xdr_encode() resets and re-encodes the entire RPC
send buffer for each retransmit of an RPC. The RPC send buffer
is not preserved from the previous transmission of an RPC.

Revert 6ab59945, and instead, just force each request to be
fully marshaled every time through ->send_request. This should
preserve the fix from 6ab59945, while also performing pullup
during retransmits.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Acked-by: NSagi Grimberg <sagig@mellanox.com>
Tested-by: NDevesh Sharma <Devesh.Sharma@Emulex.Com>
Tested-by: NMeghana Cheripady <Meghana.Cheripady@Emulex.Com>
Tested-by: NVeeresh U. Kokatnur <veereshuk@chelsio.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

e2377945

24 2月, 2015 1 次提交

xprtrdma: Store RDMA credits in unsigned variables · 9b1dcbc8

由 Chuck Lever 提交于 2月 12, 2015

Dan Carpenter's static checker pointed out:

   net/sunrpc/xprtrdma/rpc_rdma.c:879 rpcrdma_reply_handler()
   warn: can 'credits' be negative?

"credits" is defined as an int. The credits value comes from the
server as a 32-bit unsigned integer.

A malicious or broken server can plant a large unsigned integer in
that field which would result in an underflow in the following
logic, potentially triggering a deadlock of the mount point by
blocking the client from issuing more RPC requests.

net/sunrpc/xprtrdma/rpc_rdma.c:

  876          credits = be32_to_cpu(headerp->rm_credit);
  877          if (credits == 0)
  878                  credits = 1;    /* don't deadlock */
  879          else if (credits > r_xprt->rx_buf.rb_max_requests)
  880                  credits = r_xprt->rx_buf.rb_max_requests;
  881
  882          cwnd = xprt->cwnd;
  883          xprt->cwnd = credits << RPC_CWNDSHIFT;
  884          if (xprt->cwnd > cwnd)
  885                  xprt_release_rqst_cong(rqst->rq_task);
Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
Fixes: eba8ff66 ("xprtrdma: Move credit update to RPC . . .")
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

9b1dcbc8

30 1月, 2015 6 次提交

xprtrdma: Allocate zero pad separately from rpcrdma_buffer · c05fbb5a

由 Chuck Lever 提交于 1月 21, 2015

Use the new rpcrdma_alloc_regbuf() API to shrink the amount of
contiguous memory needed for a buffer pool by moving the zero
pad buffer into a regbuf.

This is for consistency with the other uses of internally
registered memory.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NSteve Wise <swise@opengridcomputing.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

c05fbb5a

xprtrdma: Allocate RPC/RDMA receive buffer separately from struct rpcrdma_rep · 6b1184cd

由 Chuck Lever 提交于 1月 21, 2015

The rr_base field is currently the buffer where RPC replies land.

An RPC/RDMA reply header lands in this buffer. In some cases an RPC
reply header also lands in this buffer, just after the RPC/RDMA
header.

The inline threshold is an agreed-on size limit for RDMA SEND
operations that pass from server and client. The sum of the
RPC/RDMA reply header size and the RPC reply header size must be
less than this threshold.

The largest RDMA RECV that the client should have to handle is the
size of the inline threshold. The receive buffer should thus be the
size of the inline threshold, and not related to RPCRDMA_MAX_SEGS.

RPC replies received via RDMA WRITE (long replies) are caught in
rq_rcv_buf, which is the second half of the RPC send buffer. Ie,
such replies are not involved in any way with rr_base.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

6b1184cd

xprtrdma: Allocate RPC/RDMA send buffer separately from struct rpcrdma_req · 85275c87

由 Chuck Lever 提交于 1月 21, 2015

The rl_base field is currently the buffer where each RPC/RDMA call
header is built.

The inline threshold is an agreed-on size limit to for RDMA SEND
operations that pass between client and server. The sum of the
RPC/RDMA header size and the RPC header size must be less than or
equal to this threshold.

Increasing the r/wsize maximum will require MAX_SEGS to grow
significantly, but the inline threshold size won't change (both
sides agree on it). The server's inline threshold doesn't change.

Since an RPC/RDMA header can never be larger than the inline
threshold, make all RPC/RDMA header buffers the size of the
inline threshold.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NSteve Wise <swise@opengridcomputing.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

85275c87

xprtrdma: Allocate RPC send buffer separately from struct rpcrdma_req · 0ca77dc3

由 Chuck Lever 提交于 1月 21, 2015

Because internal memory registration is an expensive and synchronous
operation, xprtrdma pre-registers send and receive buffers at mount
time, and then re-uses them for each RPC.

A "hardway" allocation is a memory allocation and registration that
replaces a send buffer during the processing of an RPC. Hardway must
be done if the RPC send buffer is too small to accommodate an RPC's
call and reply headers.

For xprtrdma, each RPC send buffer is currently part of struct
rpcrdma_req so that xprt_rdma_free(), which is passed nothing but
the address of an RPC send buffer, can find its matching struct
rpcrdma_req and rpcrdma_rep quickly via container_of / offsetof.

That means that hardway currently has to replace a whole rpcrmda_req
when it replaces an RPC send buffer. This is often a fairly hefty
chunk of contiguous memory due to the size of the rl_segments array
and the fact that both the send and receive buffers are part of
struct rpcrdma_req.

Some obscure re-use of fields in rpcrdma_req is done so that
xprt_rdma_free() can detect replaced rpcrdma_req structs, and
restore the original.

This commit breaks apart the RPC send buffer and struct rpcrdma_req
so that increasing the size of the rl_segments array does not change
the alignment of each RPC send buffer. (Increasing rl_segments is
needed to bump up the maximum r/wsize for NFS/RDMA).

This change opens up some interesting possibilities for improving
the design of xprt_rdma_allocate().

xprt_rdma_allocate() is now the one place where RPC send buffers
are allocated or re-allocated, and they are now always left in place
by xprt_rdma_free().

A large re-allocation that includes both the rl_segments array and
the RPC send buffer is no longer needed. Send buffer re-allocation
becomes quite rare. Good send buffer alignment is guaranteed no
matter what the size of the rl_segments array is.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NSteve Wise <swise@opengridcomputing.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

0ca77dc3

xprtrdma: Remove rpcrdma_ep::rep_func and ::rep_xprt · afadc468

由 Chuck Lever 提交于 1月 21, 2015

Clean up: The rep_func field always refers to rpcrdma_conn_func().
rep_func should have been removed by commit b45ccfd2 ("xprtrdma:
Remove MEMWINDOWS registration modes").
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NSteve Wise <swise@opengridcomputing.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

afadc468

xprtrdma: Move credit update to RPC reply handler · eba8ff66

由 Chuck Lever 提交于 1月 21, 2015

Reduce work in the receive CQ handler, which can be run at hardware
interrupt level, by moving the RPC/RDMA credit update logic to the
RPC reply handler.

This has some additional benefits: More header sanity checking is
done before trusting the incoming credit value, and the receive CQ
handler no longer touches the RPC/RDMA header (the CPU stalls while
waiting for the header contents to be brought into the cache).

This further extends work begun by commit e7ce710a ("xprtrdma:
Avoid deadlock when credit window is reset").
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NSteve Wise <swise@opengridcomputing.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

eba8ff66

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功