提交 · 6ea8e71150ecdc235fab31f76ed9953d82313923 · openeuler / Kernel

20 9月, 2016 11 次提交

xprtrdma: Move recv_wr to struct rpcrdma_rep · 6ea8e711

由 Chuck Lever 提交于 9月 15, 2016

Clean up: The fields in the recv_wr do not vary. There is no need to
initialize them before each ib_post_recv(). This removes a large-ish
data structure from the stack.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

6ea8e711

xprtrdma: Move send_wr to struct rpcrdma_req · 90aab602

由 Chuck Lever 提交于 9月 15, 2016

Clean up: Most of the fields in each send_wr do not vary. There is
no need to initialize them before each ib_post_send(). This removes
a large-ish data structure from the stack.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

90aab602

xprtrdma: Simplify rpcrdma_ep_post_recv() · b157380a

由 Chuck Lever 提交于 9月 15, 2016

Clean up.

Since commit fc664485 ("xprtrdma: Split the completion queue"),
rpcrdma_ep_post_recv() no longer uses the "ep" argument.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

b157380a

xprtrdma: Eliminate "ia" argument in rpcrdma_{alloc, free}_regbuf · 13650c23

由 Chuck Lever 提交于 9月 15, 2016

Clean up. The "ia" argument is no longer used.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

13650c23

xprtrdma: Delay DMA mapping Send and Receive buffers · 54cbd6b0

由 Chuck Lever 提交于 9月 15, 2016

Currently, each regbuf is allocated and DMA mapped at the same time.
This is done during transport creation.

When a device driver is unloaded, every DMA-mapped buffer in use by
a transport has to be unmapped, and then remapped to the new
device if the driver is loaded again. Remapping will have to be done
_after_ the connect worker has set up the new device.

But there's an ordering problem:

call_allocate, which invokes xprt_rdma_allocate which calls
rpcrdma_alloc_regbuf to allocate Send buffers, happens _before_
the connect worker can run to set up the new device.

Instead, at transport creation, allocate each buffer, but leave it
unmapped. Once the RPC carries these buffers into ->send_request, by
which time a transport connection should have been established,
check to see that the RPC's buffers have been DMA mapped. If not,
map them there.

When device driver unplug support is added, it will simply unmap all
the transport's regbufs, but it doesn't have to deallocate the
underlying memory.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

54cbd6b0

xprtrdma: Replace DMA_BIDIRECTIONAL · 99ef4db3

由 Chuck Lever 提交于 9月 15, 2016

The use of DMA_BIDIRECTIONAL is discouraged by DMA-API.txt.
Fortunately, xprtrdma now knows which direction I/O is going as
soon as it allocates each regbuf.

The RPC Call and Reply buffers are no longer the same regbuf. They
can each be labeled correctly now. The RPC Reply buffer is never
part of either a Send or Receive WR, but it can be part of Reply
chunk, which is mapped and registered via ->ro_map . So it is not
DMA mapped when it is allocated (DMA_NONE), to avoid a double-
mapping.

Since Receive buffers are no longer DMA_BIDIRECTIONAL and their
contents are never modified by the host CPU, DMA-API-HOWTO.txt
suggests that a DMA sync before posting each buffer should be
unnecessary. (See my_card_interrupt_handler).
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

99ef4db3

xprtrdma: Use smaller buffers for RPC-over-RDMA headers · 08cf2efd

由 Chuck Lever 提交于 9月 15, 2016

Commit 94931746 ("xprtrdma: Limit number of RDMA segments in
RPC-over-RDMA headers") capped the number of chunks that may appear
in RPC-over-RDMA headers. The maximum header size can be estimated
and fixed to avoid allocating buffer space that is never used.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

08cf2efd

xprtrdma: Initialize separate RPC call and reply buffers · 9c40c49f

由 Chuck Lever 提交于 9月 15, 2016

RPC-over-RDMA needs to separate its RPC call and reply buffers.

 o When an RPC Call is sent, rq_snd_buf is DMA mapped for an RDMA
   Send operation using DMA_TO_DEVICE

 o If the client expects a large RPC reply, it DMA maps rq_rcv_buf
   as part of a Reply chunk using DMA_FROM_DEVICE

The two mappings are for data movement in opposite directions.

DMA-API.txt suggests that if these mappings share a DMA cacheline,
bad things can happen. This could occur in the final bytes of
rq_snd_buf and the first bytes of rq_rcv_buf if the two buffers
happen to share a DMA cacheline.

On x86_64 the cacheline size is typically 8 bytes, and RPC call
messages are usually much smaller than the send buffer, so this
hasn't been a noticeable problem. But the DMA cacheline size can be
larger on other platforms.

Also, often rq_rcv_buf starts most of the way into a page, thus
an additional RDMA segment is needed to map and register the end of
that buffer. Try to avoid that scenario to reduce the cost of
registering and invalidating Reply chunks.

Instead of carrying a single regbuf that covers both rq_snd_buf and
rq_rcv_buf, each struct rpcrdma_req now carries one regbuf for
rq_snd_buf and one regbuf for rq_rcv_buf.

Some incidental changes worth noting:

- To clear out some spaghetti, refactor xprt_rdma_allocate.
- The value stored in rg_size is the same as the value stored in
  the iov.length field, so eliminate rg_size
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

9c40c49f

SUNRPC: Add a transport-specific private field in rpc_rqst · 5a6d1db4

由 Chuck Lever 提交于 9月 15, 2016

Currently there's a hidden and indirect mechanism for finding the
rpcrdma_req that goes with an rpc_rqst. It depends on getting from
the rq_buffer pointer in struct rpc_rqst to the struct
rpcrdma_regbuf that controls that buffer, and then to the struct
rpcrdma_req it goes with.

This was done back in the day to avoid the need to add a per-rqst
pointer or to alter the buf_free API when support for RPC-over-RDMA
was introduced.

I'm about to change the way regbuf's work to support larger inline
thresholds. Now is a good time to replace this indirect mechanism
with something that is more straightforward. I guess this should be
considered a clean up.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

5a6d1db4

SUNRPC: Generalize the RPC buffer release API · 3435c74a

由 Chuck Lever 提交于 9月 15, 2016

xprtrdma needs to allocate the Call and Reply buffers separately.
TBH, the reliance on using a single buffer for the pair of XDR
buffers is transport implementation-specific.

Instead of passing just the rq_buffer into the buf_free method, pass
the task structure and let buf_free take care of freeing both
XDR buffers at once.

There's a micro-optimization here. In the common case, both
xprt_release and the transport's buf_free method were checking if
rq_buffer was NULL. Now the check is done only once per RPC.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

3435c74a

xprtrdma: Eliminate INLINE_THRESHOLD macros · eb342e9a

由 Chuck Lever 提交于 9月 15, 2016

Clean up: r_xprt is already available everywhere these macros are
invoked, so just dereference that directly.

RPCRDMA_INLINE_PAD_VALUE is no longer used, so it can simply be
removed.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

eb342e9a

07 9月, 2016 1 次提交

xprtrdma: Fix receive buffer accounting · 05c97466

由 Chuck Lever 提交于 9月 06, 2016

An RPC can terminate before its reply arrives, if a credential
problem or a soft timeout occurs. After this happens, xprtrdma
reports it is out of Receive buffers.

A Receive buffer is posted before each RPC is sent, and returned to
the buffer pool when a reply is received. If no reply is received
for an RPC, that Receive buffer remains posted. But xprtrdma tries
to post another when the next RPC is sent.

If this happens a few dozen times, there are no receive buffers left
to be posted at send time. I don't see a way for a transport
connection to recover at that point, and it will spit warnings and
unnecessarily delay RPCs on occasion for its remaining lifetime.

Commit 1e465fd4 ("xprtrdma: Replace send and receive arrays")
removed a little bit of logic to detect this case and not provide
a Receive buffer so no more buffers are posted, and then transport
operation continues correctly. We didn't understand what that logic
did, and it wasn't commented, so it was removed as part of the
overhaul to support backchannel requests.

Restore it, but be wary of the need to keep extra Receives posted
to deal with backchannel requests.

Fixes: 1e465fd4 ("xprtrdma: Replace send and receive arrays")
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

05c97466

12 7月, 2016 9 次提交

xprtrdma: Chunk list encoders no longer share one rl_segments array · 5ab81428

由 Chuck Lever 提交于 6月 29, 2016

Currently, all three chunk list encoders each use a portion of the
one rl_segments array in rpcrdma_req. This is because the MWs for
each chunk list were preserved in rl_segments so that ro_unmap could
find and invalidate them after the RPC was complete.

However, now that MWs are placed on a per-req linked list as they
are registered, there is no longer any information in rpcrdma_mr_seg
that is shared between ro_map and ro_unmap_{sync,safe}, and thus
nothing in rl_segments needs to be preserved after
rpcrdma_marshal_req is complete.

Thus the rl_segments array can be used now just for the needs of
each rpcrdma_convert_iovs call. Once each chunk list is encoded, the
next chunk list encoder is free to re-use all of rl_segments.

This means all three chunk lists in one RPC request can now each
encode a full size data payload with no increase in the size of
rl_segments.

This is a key requirement for Kerberos support, since both the Call
and Reply for a single RPC transaction are conveyed via Long
messages (RDMA Read/Write). Both can be large.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Tested-by: NSteve Wise <swise@opengridcomputing.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

5ab81428

xprtrdma: Place registered MWs on a per-req list · 9d6b0409

由 Chuck Lever 提交于 6月 29, 2016

Instead of placing registered MWs sparsely into the rl_segments
array, place these MWs on a per-req list.

ro_unmap_{sync,safe} can then simply pull those MWs off the list
instead of walking through the array.

This change significantly reduces the size of struct rpcrdma_req
by removing nsegs and rl_mw from every array element.

As an additional clean-up, chunk co-ordinates are returned in the
"*mw" output argument so they are no longer needed in every
array element.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Tested-by: NSteve Wise <swise@opengridcomputing.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

9d6b0409

xprtrdma: Allocate MRs on demand · e2ac236c

由 Chuck Lever 提交于 6月 29, 2016

Frequent MR list exhaustion can impact I/O throughput, so enough MRs
are always created during transport set-up to prevent running out.
This means more MRs are created than most workloads need.

Commit 94f58c58 ("xprtrdma: Allow Read list and Reply chunk
simultaneously") introduced support for sending two chunk lists per
RPC, which consumes more MRs per RPC.

Instead of trying to provision more MRs, introduce a mechanism for
allocating MRs on demand. A few MRs are allocated during transport
set-up to kick things off.

This significantly reduces the average number of MRs per transport
while allowing the MR count to grow for workloads or devices that
need more MRs.

FRWR with mlx4 allocated almost 400 MRs per transport before this
patch. Now it starts with 32.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Tested-by: NSteve Wise <swise@opengridcomputing.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

e2ac236c

xprtrdma: Clean up device capability detection · b54054ca

由 Chuck Lever 提交于 6月 29, 2016

Clean up: Move device capability detection into memreg-specific
source files.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Tested-by: NSteve Wise <swise@opengridcomputing.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

b54054ca

xprtrdma: Remove rpcrdma_map_one() and friends · a473018c

由 Chuck Lever 提交于 6月 29, 2016

Clean up: ALLPHYSICAL is gone and FMR has been converted to use
scatterlists. There are no more users of these functions.

This patch shrinks the size of struct rpcrdma_req by about 3500
bytes on x86_64. There is one of these structs for each RPC credit
(128 credits per transport connection).
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Tested-by: NSteve Wise <swise@opengridcomputing.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

a473018c

xprtrdma: Remove ALLPHYSICAL memory registration mode · 2dc3a69d

由 Chuck Lever 提交于 6月 29, 2016

No HCA or RNIC in the kernel tree requires the use of ALLPHYSICAL.

ALLPHYSICAL advertises in the clear on the network fabric an R_key
that is good for all of the client's memory. No known exploit
exists, but theoretically any user on the server can use that R_key
on the client's QP to read or update any part of the client's memory.

ALLPHYSICAL exposes the client to server bugs, including:
 o base/bounds errors causing data outside the i/o buffer to be
   accessed
 o RDMA access after reply causing data corruption and/or integrity
   fail

ALLPHYSICAL can't protect application memory regions from server
update after a local signal or soft timeout has terminated an RPC.

ALLPHYSICAL chunks are no larger than a page. Special cases to
handle small chunks and long chunk lists have been a source of
implementation complexity and bugs.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Tested-by: NSteve Wise <swise@opengridcomputing.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

2dc3a69d

xprtrdma: Refactor MR recovery work queues · 505bbe64

由 Chuck Lever 提交于 6月 29, 2016

I found that commit ead3f26e ("xprtrdma: Add ro_unmap_safe
memreg method"), which introduces ro_unmap_safe, never wired up the
FMR recovery worker.

The FMR and FRWR recovery work queues both do the same thing.
Instead of setting up separate individual work queues for this,
schedule a delayed worker to deal with them, since recovering MRs is
not performance-critical.

Fixes: ead3f26e ("xprtrdma: Add ro_unmap_safe memreg method")
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Tested-by: NSteve Wise <swise@opengridcomputing.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

505bbe64

xprtrdma: Rename fields in rpcrdma_fmr · 88975ebe

由 Chuck Lever 提交于 6月 29, 2016

Clean up: Use the same naming convention used in other
RPC/RDMA-related data structures.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Tested-by: NSteve Wise <swise@opengridcomputing.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

88975ebe

xprtrdma: Create common scatterlist fields in rpcrdma_mw · 564471d2

由 Chuck Lever 提交于 6月 29, 2016

Clean up: FMR is about to replace the rpcrdma_map_one code with
scatterlists. Move the scatterlist fields out of the FRWR-specific
union and into the generic part of rpcrdma_mw.

One minor change: -EIO is now returned if FRWR registration fails.
The RPC is terminated immediately, since the problem is likely due
to a software bug, thus retrying likely won't help.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Tested-by: NSteve Wise <swise@opengridcomputing.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

564471d2

18 5月, 2016 10 次提交

xprtrdma: Remove qplock · 6e14a92c

由 Chuck Lever 提交于 5月 04, 2016

Clean up.

After "xprtrdma: Remove ro_unmap() from all registration modes",
there are no longer any sites that take rpcrdma_ia::qplock for read.
The one site that takes it for write is always single-threaded. It
is safe to remove it.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

6e14a92c

xprtrdma: Remove ro_unmap() from all registration modes · 0b043b9f

由 Chuck Lever 提交于 5月 02, 2016

Clean up: The ro_unmap method is no longer used.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Tested-by: NSteve Wise <swise@opengridcomputing.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

0b043b9f

xprtrdma: Add ro_unmap_safe memreg method · ead3f26e

由 Chuck Lever 提交于 5月 02, 2016

There needs to be a safe method of releasing registered memory
resources when an RPC terminates. Safe can mean a number of things:

+ Doesn't have to sleep

+ Doesn't rely on having a QP in RTS

ro_unmap_safe will be that safe method. It can be used in cases
where synchronous memory invalidation can deadlock, or needs to have
an active QP.

The important case is fencing an RPC's memory regions after it is
signaled (^C) and before it exits. If this is not done, there is a
window where the server can write an RPC reply into memory that the
client has released and re-used for some other purpose.

Note that this is a full solution for FRWR, but FMR and physical
still have some gaps where a particularly bad server can wreak
some havoc on the client. These gaps are not made worse by this
patch and are expected to be exceptionally rare and timing-based.
They are noted in documenting comments.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Tested-by: NSteve Wise <swise@opengridcomputing.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

ead3f26e

xprtrdma: Move fr_xprt and fr_worker to struct rpcrdma_mw · 766656b0

由 Chuck Lever 提交于 5月 02, 2016

In a subsequent patch, the fr_xprt and fr_worker fields will be
needed by another memory registration mode. Move them into the
generic rpcrdma_mw structure that wraps struct rpcrdma_frmr.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Tested-by: NSteve Wise <swise@opengridcomputing.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

766656b0

xprtrdma: Save I/O direction in struct rpcrdma_frwr · a3aa8b2b

由 Chuck Lever 提交于 5月 02, 2016

Move the the I/O direction field from rpcrdma_mr_seg into the
rpcrdma_frmr.

This makes it possible to DMA-unmap the frwr long after an RPC has
exited and its rpcrdma_mr_seg array has been released and re-used.
This might occur if an RPC times out while waiting for a new
connection to be established.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Tested-by: NSteve Wise <swise@opengridcomputing.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

a3aa8b2b

xprtrdma: Rename rpcrdma_frwr::sg and sg_nents · 55fdfce1

由 Chuck Lever 提交于 5月 02, 2016

Clean up: Follow same naming convention as other fields in struct
rpcrdma_frwr.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Tested-by: NSteve Wise <swise@opengridcomputing.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

55fdfce1

xprtrdma: Allow Read list and Reply chunk simultaneously · 94f58c58

由 Chuck Lever 提交于 5月 02, 2016

rpcrdma_marshal_req() makes a simplifying assumption: that NFS
operations with large Call messages have small Reply messages, and
vice versa. Therefore with RPC-over-RDMA, only one chunk type is
ever needed for each Call/Reply pair, because one direction needs
chunks, the other direction will always fit inline.

In fact, this assumption is asserted in the code:

  if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
  	dprintk("RPC:       %s: cannot marshal multiple chunk lists\n",
		__func__);
	return -EIO;
  }

But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
perform data transformation on RPC messages before they are
transmitted, direct data placement techniques cannot be used, thus
RPC messages must be sent via a Long call in both directions.
All such calls are sent with a Position Zero Read chunk, and all
such replies are handled with a Reply chunk. Thus the client must
provide every Call/Reply pair with both a Read list and a Reply
chunk.

Without any special security in effect, NFSv4 WRITEs may now also
use the Read list and provide a Reply chunk. The marshal_req
logic was preventing that, meaning an NFSv4 WRITE with a large
payload that included a GETATTR result larger than the inline
threshold would fail.

The code that encodes each chunk list is now completely contained in
its own function. There is some code duplication, but the trade-off
is that the overall logic should be more clear.

Note that all three chunk lists now share the rl_segments array.
Some additional per-req accounting is necessary to track this
usage. For the same reasons that the above simplifying assumption
has held true for so long, I don't expect more array elements are
needed at this time.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Tested-by: NSteve Wise <swise@opengridcomputing.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

94f58c58

xprtrdma: Prevent inline overflow · 302d3deb

由 Chuck Lever 提交于 5月 02, 2016

When deciding whether to send a Call inline, rpcrdma_marshal_req
doesn't take into account header bytes consumed by chunk lists.
This results in Call messages on the wire that are sometimes larger
than the inline threshold.

Likewise, when a Write list or Reply chunk is in play, the server's
reply has to emit an RDMA Send that includes a larger-than-minimal
RPC-over-RDMA header.

The actual size of a Call message cannot be estimated until after
the chunk lists have been registered. Thus the size of each
RPC-over-RDMA header can be estimated only after chunks are
registered; but the decision to register chunks is based on the size
of that header. Chicken, meet egg.

The best a client can do is estimate header size based on the
largest header that might occur, and then ensure that inline content
is always smaller than that.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Tested-by: NSteve Wise <swise@opengridcomputing.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

302d3deb

xprtrdma: Limit number of RDMA segments in RPC-over-RDMA headers · 94931746

由 Chuck Lever 提交于 5月 02, 2016

Send buffer space is shared between the RPC-over-RDMA header and
an RPC message. A large RPC-over-RDMA header means less space is
available for the associated RPC message, which then has to be
moved via an RDMA Read or Write.

As more segments are added to the chunk lists, the header increases
in size.  Typical modern hardware needs only a few segments to
convey the maximum payload size, but some devices and registration
modes may need a lot of segments to convey data payload. Sometimes
so many are needed that the remaining space in the Send buffer is
not enough for the RPC message. Sending such a message usually
fails.

To ensure a transport can always make forward progress, cap the
number of RDMA segments that are allowed in chunk lists. This
prevents less-capable devices and memory registrations from
consuming a large portion of the Send buffer by reducing the
maximum data payload that can be conveyed with such devices.

For now I choose an arbitrary maximum of 8 RDMA segments. This
allows a maximum size RPC-over-RDMA header to fit nicely in the
current 1024 byte inline threshold with over 700 bytes remaining
for an inline RPC message.

The current maximum data payload of NFS READ or WRITE requests is
one megabyte. To convey that payload on a client with 4KB pages,
each chunk segment would need to handle 32 or more data pages. This
is well within the capabilities of FMR. For physical registration,
the maximum payload size on platforms with 4KB pages is reduced to
32KB.

For FRWR, a device's maximum page list depth would need to be at
least 34 to support the maximum 1MB payload. A device with a smaller
maximum page list depth means the maximum data payload is reduced
when using that device.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Tested-by: NSteve Wise <swise@opengridcomputing.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

94931746

sunrpc: Advertise maximum backchannel payload size · 6b26cc8c

由 Chuck Lever 提交于 5月 02, 2016

RPC-over-RDMA transports have a limit on how large a backward
direction (backchannel) RPC message can be. Ensure that the NFSv4.x
CREATE_SESSION operation advertises this limit to servers.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Tested-by: NSteve Wise <swise@opengridcomputing.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

6b26cc8c

15 3月, 2016 4 次提交

xprtrdma: Use new CQ API for RPC-over-RDMA client send CQs · 2fa8f88d

由 Chuck Lever 提交于 3月 04, 2016

Calling ib_poll_cq() to sort through WCs during a completion is a
common pattern amongst RDMA consumers. Since commit 14d3a3b2
("IB: add a proper completion queue abstraction"), WC sorting can
be handled by the IB core.

By converting to this new API, xprtrdma is made a better neighbor to
other RDMA consumers, as it allows the core to schedule the delivery
of completions more fairly amongst all active consumers.

Because each ib_cqe carries a pointer to a completion method, the
core can now post its own operations on a consumer's QP, and handle
the completions itself, without changes to the consumer.

Send completions were previously handled entirely in the completion
upcall handler (ie, deferring to a process context is unneeded).
Thus IB_POLL_SOFTIRQ is a direct replacement for the current
xprtrdma send code path.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NDevesh Sharma <devesh.sharma@broadcom.com>
Reviewed-by: NSagi Grimberg <sagig@mellanox.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

2fa8f88d

xprtrdma: Use an anonymous union in struct rpcrdma_mw · c882a655

由 Chuck Lever 提交于 3月 04, 2016

Clean up: Make code more readable.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NDevesh Sharma <devesh.sharma@broadcom.com>
Reviewed-by: NSagi Grimberg <sagig@mellanox.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

c882a655

xprtrdma: Use new CQ API for RPC-over-RDMA client receive CQs · 552bf225

由 Chuck Lever 提交于 3月 04, 2016

Calling ib_poll_cq() to sort through WCs during a completion is a
common pattern amongst RDMA consumers. Since commit 14d3a3b2
("IB: add a proper completion queue abstraction"), WC sorting can
be handled by the IB core.

By converting to this new API, xprtrdma is made a better neighbor to
other RDMA consumers, as it allows the core to schedule the delivery
of completions more fairly amongst all active consumers.

Because each ib_cqe carries a pointer to a completion method, the
core can now post its own operations on a consumer's QP, and handle
the completions itself, without changes to the consumer.

xprtrdma's reply processing is already handled in a work queue, but
there is some initial order-dependent processing that is done in the
soft IRQ context before a work item is scheduled.

IB_POLL_SOFTIRQ is a direct replacement for the current xprtrdma
receive code path.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NDevesh Sharma <devesh.sharma@broadcom.com>
Reviewed-by: NSagi Grimberg <sagig@mellanox.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

552bf225

xprtrdma: Serialize credit accounting again · 23826c7a

由 Chuck Lever 提交于 3月 04, 2016

Commit fe97b47c ("xprtrdma: Use workqueue to process RPC/RDMA
replies") replaced the reply tasklet with a workqueue that allows
RPC replies to be processed in parallel. Thus the credit values in
RPC-over-RDMA replies can be applied in a different order than in
which the server sent them.

To fix this, revert commit eba8ff66 ("xprtrdma: Move credit
update to RPC reply handler"). Reverting is done by hand to
accommodate code changes that have occurred since then.

Fixes: fe97b47c ("xprtrdma: Use workqueue to process . . .")
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NSagi Grimberg <sagig@mellanox.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

23826c7a

20 1月, 2016 2 次提交

svcrdma: Add class for RDMA backwards direction transport · 5d252f90

由 Chuck Lever 提交于 1月 07, 2016

To support the server-side of an NFSv4.1 backchannel on RDMA
connections, add a transport class that enables backward
direction messages on an existing forward channel connection.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Acked-by: NBruce Fields <bfields@fieldses.org>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

5d252f90

svcrdma: Remove unused req_map and ctxt kmem_caches · 71810ef3

由 Chuck Lever 提交于 1月 07, 2016

Clean up.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Acked-by: NBruce Fields <bfields@fieldses.org>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

71810ef3

23 12月, 2015 1 次提交

xprtrdma: Avoid calling ib_query_device · e3e45b1b

由 Or Gerlitz 提交于 12月 18, 2015

Instead, use the cached copy of the attributes present on the device.
Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

e3e45b1b

19 12月, 2015 2 次提交

xprtrdma: Revert commit ('xprtrdma: Cap req_cqinit'). · 26ae9d1c

由 Chuck Lever 提交于 12月 16, 2015

The root of the problem was that sends (especially unsignalled
FASTREG and LOCAL_INV Work Requests) were not properly flow-
controlled, which allowed a send queue overrun.

Now that the RPC/RDMA reply handler waits for invalidation to
complete, the send queue is properly flow-controlled. Thus this
limit is no longer necessary.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

26ae9d1c

xprtrdma: Add ro_unmap_sync method for FRWR · c9918ff5

由 Chuck Lever 提交于 12月 16, 2015

FRWR's ro_unmap is asynchronous. The new ro_unmap_sync posts
LOCAL_INV Work Requests and waits for them to complete before
returning.

Note also, DMA unmapping is now done _after_ invalidation.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

c9918ff5

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功