提交 · c209e49ceac0ff479f79ac5cd2fbf8be80621203 · openeuler / Kernel

26 4月, 2019 14 次提交

xprtrdma: More Send completion batching · c209e49c

由 Chuck Lever 提交于 4月 24, 2019

Instead of using a fixed number, allow the amount of Send completion
batching to vary based on the client's maximum credit limit.

- A larger default gives a small boost to IOPS throughput

- Reducing it based on max_requests gives a safe result when the
  max credit limit is cranked down (eg. when the device has a small
  max_qp_wr).
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

c209e49c

xprtrdma: Clean up sendctx functions · dbcc53a5

由 Chuck Lever 提交于 4月 24, 2019

Minor clean-ups I've stumbled on since sendctx was merged last year.
In particular, making Send completion processing more efficient
appears to have a measurable impact on IOPS throughput.

Note: test_and_clear_bit() returns a value, thus an explicit memory
barrier is not necessary.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

dbcc53a5

xprtrdma: Trace marshaling failures · 17e4c443

由 Chuck Lever 提交于 4月 24, 2019

Record an event when rpcrdma_marshal_req returns a non-zero return
value to help track down why an xprt close might have occurred.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

17e4c443

xprtrdma: Increase maximum number of backchannel requests · 4ba02e8d

由 Chuck Lever 提交于 4月 24, 2019

Reflects the change introduced in commit 067c4696 ("NFSv4.1:
Bump the default callback session slot count to 16").
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

4ba02e8d

xprtrdma: Backchannel can use GFP_KERNEL allocations · 3f9c7e76

由 Chuck Lever 提交于 4月 24, 2019

The Receive handler runs in process context, thus can use on-demand
GFP_KERNEL allocations instead of pre-allocation.

This makes the xprtrdma backchannel independent of the number of
backchannel session slots provisioned by the Upper Layer protocol.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

3f9c7e76

xprtrdma: Clean up regbuf helpers · d2832af3

由 Chuck Lever 提交于 4月 24, 2019

For code legibility, clean up the function names to be consistent
with the pattern: "rpcrdma" _ object-type _ action

Also rpcrdma_regbuf_alloc and rpcrdma_regbuf_free no longer have any
callers outside of verbs.c, and can thus be made static.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

d2832af3

xprtrdma: De-duplicate "allocate new, free old regbuf" · 0f665ceb

由 Chuck Lever 提交于 4月 24, 2019

Clean up by providing an API to do this common task.

At this point, the difference between rpcrdma_get_sendbuf and
rpcrdma_get_recvbuf has become tiny. These can be collapsed into a
single helper.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

0f665ceb

xprtrdma: Allocate req's regbufs at xprt create time · bb93a1ae

由 Chuck Lever 提交于 4月 24, 2019

Allocating an rpcrdma_req's regbufs at xprt create time enables
a pair of micro-optimizations:

First, if these regbufs are always there, we can eliminate two
conditional branches from the hot xprt_rdma_allocate path.

Second, by allocating a 1KB buffer, it places a lower bound on the
size of these buffers, without adding yet another conditional
branch. The lower bound reduces the number of hardway re-
allocations. In fact, for some workloads it completely eliminates
hardway allocations.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

bb93a1ae

xprtrdma: rpcrdma_regbuf alignment · 8cec3dba

由 Chuck Lever 提交于 4月 24, 2019

Allocate the struct rpcrdma_regbuf separately from the I/O buffer
to better guarantee the alignment of the I/O buffer and eliminate
the wasted space between the rpcrdma_regbuf metadata and the buffer
itself.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

8cec3dba

xprtrdma: Clean up rpcrdma_create_rep() and rpcrdma_destroy_rep() · 23146500

由 Chuck Lever 提交于 4月 24, 2019

For code legibility, clean up the function names to be consistent
with the pattern: "rpcrdma" _ object-type _ action
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

23146500

xprtrdma: Clean up rpcrdma_create_req() · 1769e6a8

由 Chuck Lever 提交于 4月 24, 2019

Eventually, I'd like to invoke rpcrdma_create_req() during the
call_reserve step. Memory allocation there probably needs to use
GFP_NOIO. Therefore a set of GFP flags needs to be passed in.

As an additional clean up, just return a pointer or NULL, because
the only error return code here is -ENOMEM.

Lastly, clean up the function names to be consistent with the
pattern: "rpcrdma" _ object-type _ action
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

1769e6a8

xprtrdma: Fix an frwr_map recovery nit · b2ca473b

由 Chuck Lever 提交于 4月 24, 2019

After a DMA map failure in frwr_map, mark the MR so that recycling
won't attempt to DMA unmap it.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Fixes: e2f34e26 ("xprtrdma: Yet another double DMA-unmap")
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

b2ca473b

SUNRPC: Avoid digging into the ATOMIC pool · 52db6f9a

由 Chuck Lever 提交于 4月 24, 2019

Page allocation requests made when the SPARSE_PAGES flag is set are
allowed to fail, and are not critical. No need to spend a rare
resource.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

52db6f9a

SUNRPC: Refactor xprt_request_wait_receive() · 8ba6a92d

由 Trond Myklebust 提交于 4月 07, 2019

Convert the transport callback to actually put the request to sleep
instead of just setting a timeout. This is in preparation for
rpc_sleep_on_timeout().
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

8ba6a92d

12 4月, 2019 1 次提交

xprtrdma: Fix helper that drains the transport · e1ede312

由 Chuck Lever 提交于 4月 09, 2019

We want to drain only the RQ first. Otherwise the transport can
deadlock on ->close if there are outstanding Send completions.

Fixes: 6d2d0ee2 ("xprtrdma: Replace rpcrdma_receive_wq ... ")
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Cc: stable@vger.kernel.org # v5.0+
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

e1ede312

21 2月, 2019 1 次提交

SUNRPC: Ensure rq_bytes_sent is reset before request transmission · b9779a54

由 Trond Myklebust 提交于 1月 02, 2019

When we resend a request, ensure that the 'rq_bytes_sent' is reset
to zero.
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

b9779a54

14 2月, 2019 2 次提交

SUNRPC: Remove rpc_xprt::tsh_size · 067fb11b

由 Chuck Lever 提交于 2月 11, 2019

tsh_size was added to accommodate transports that send a pre-amble
before each RPC message. However, this assumes the pre-amble is
fixed in size, which isn't true for some transports. That makes
tsh_size not very generic.

Also I'd like to make the estimation of RPC send and receive
buffer sizes more precise. tsh_size doesn't currently appear to be
accounted for at all by call_allocate.

Therefore let's just remove the tsh_size concept, and make the only
transports that have a non-zero tsh_size employ a direct approach.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

067fb11b

SUNRPC: Add xdr_stream::rqst field · 0ccc61b1

由 Chuck Lever 提交于 2月 11, 2019

Having access to the controlling rpc_rqst means a trace point in the
XDR code can report:

 - the XID
 - the task ID and client ID
 - the p_name of RPC being processed

Subsequent patches will introduce such trace points.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

0ccc61b1

13 2月, 2019 4 次提交

xprtrdma: Reduce the doorbell rate (Receive) · e340c2d6

由 Chuck Lever 提交于 2月 11, 2019

Post RECV WRs in batches to reduce the hardware doorbell rate per
transport. This helps the RPC-over-RDMA client scale better in
number of transports.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

e340c2d6

xprtrdma: Check inline size before providing a Write chunk · d4550bbe

由 Chuck Lever 提交于 2月 11, 2019

In very rare cases, an NFS READ operation might predict that the
non-payload part of the RPC Call is large. For instance, an
NFSv4 COMPOUND with a large GETATTR result, in combination with a
large Kerberos credential, could push the non-payload part to be
several kilobytes.

If the non-payload part is larger than the connection's inline
threshold, the client is required to provision a Reply chunk. The
current Linux client does not check for this case. There are two
obvious ways to handle it:

a. Provision a Write chunk for the payload and a Reply chunk for
   the non-payload part

b. Provision a Reply chunk for the whole RPC Reply

Some testing at a recent NFS bake-a-thon showed that servers can
mostly handle a. but there are some corner cases that do not work
yet. b. already works (it has to, to handle krb5i/p), but could be
somewhat less efficient. However, I expect this scenario to be very
rare -- no-one has reported a problem yet.

So I'm going to implement b. Sometime later I will provide some
patches to help make b. a little more efficient by more carefully
choosing the Reply chunk's segment sizes to ensure the payload is
optimally aligned.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

d4550bbe

xprtrdma: Fix sparse warnings · ec482cc1

由 Chuck Lever 提交于 2月 11, 2019

linux/net/sunrpc/xprtrdma/rpc_rdma.c:375:63: warning: incorrect type in argument 5 (different base types)
linux/net/sunrpc/xprtrdma/rpc_rdma.c:375:63: expected unsigned int [usertype] xid
linux/net/sunrpc/xprtrdma/rpc_rdma.c:375:63: got restricted __be32 [usertype] rq_xid
linux/net/sunrpc/xprtrdma/rpc_rdma.c:432:62: warning: incorrect type in argument 5 (different base types)
linux/net/sunrpc/xprtrdma/rpc_rdma.c:432:62: expected unsigned int [usertype] xid
linux/net/sunrpc/xprtrdma/rpc_rdma.c:432:62: got restricted __be32 [usertype] rq_xid
linux/net/sunrpc/xprtrdma/rpc_rdma.c:489:62: warning: incorrect type in argument 5 (different base types)
linux/net/sunrpc/xprtrdma/rpc_rdma.c:489:62: expected unsigned int [usertype] xid
linux/net/sunrpc/xprtrdma/rpc_rdma.c:489:62: got restricted __be32 [usertype] rq_xid

Fixes: 0a93fbcb ("xprtrdma: Plant XID in on-the-wire RDMA ... ")
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

ec482cc1

xprtrdma: Make sure Send CQ is allocated on an existing compvec · a4cb5bdb

由 Nicolas Morey-Chaisemartin 提交于 2月 05, 2019

Make sure the device has at least 2 completion vectors
before allocating to compvec#1

Fixes: a4699f56 (xprtrdma: Put Send CQ in IB_POLL_WORKQUEUE mode)
Signed-off-by: NNicolas Morey-Chaisemartin <nmoreychaisemartin@suse.com>
Reviewed-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

a4cb5bdb

07 2月, 2019 5 次提交

svcrdma: Remove syslog warnings in work completion handlers · 8820bcaa

由 Chuck Lever 提交于 2月 06, 2019

These can result in a lot of log noise, and are able to be triggered
by client misbehavior. Since there are trace points in these
handlers now, there's no need to spam the log.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

8820bcaa

svcrdma: Squelch compiler warning when SUNRPC_DEBUG is disabled · c7920f06

由 Chuck Lever 提交于 2月 06, 2019

  CC [M]  net/sunrpc/xprtrdma/svc_rdma_transport.o
linux/net/sunrpc/xprtrdma/svc_rdma_transport.c: In function ‘svc_rdma_accept’:
linux/net/sunrpc/xprtrdma/svc_rdma_transport.c:452:19: warning: variable ‘sap’ set but not used [-Wunused-but-set-variable]
  struct sockaddr *sap;
                   ^
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

c7920f06

svcrdma: Use struct_size() in kmalloc() · 14cfbd94

由 Gustavo A. R. Silva 提交于 1月 15, 2019

One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:

struct foo {
    int stuff;
    struct boo entry[];
};

instance = kmalloc(sizeof(struct foo) + count * sizeof(struct boo), GFP_KERNEL);

Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:

instance = kmalloc(struct_size(instance, entry, count), GFP_KERNEL);

This code was detected with the help of Coccinelle.
Signed-off-by: NGustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

14cfbd94

svcrpc: fix unlikely races preventing queueing of sockets · 95503d29

由 J. Bruce Fields 提交于 1月 11, 2019

In the rpc server, When something happens that might be reason to wake
up a thread to do something, what we do is

	- modify xpt_flags, sk_sock->flags, xpt_reserved, or
	  xpt_nr_rqsts to indicate the new situation
	- call svc_xprt_enqueue() to decide whether to wake up a thread.

svc_xprt_enqueue may require multiple conditions to be true before
queueing up a thread to handle the xprt.  In the SMP case, one of the
other CPU's may have set another required condition, and in that case,
although both CPUs run svc_xprt_enqueue(), it's possible that neither
call sees the writes done by the other CPU in time, and neither one
recognizes that all the required conditions have been set.  A socket
could therefore be ignored indefinitely.

Add memory barries to ensure that any svc_xprt_enqueue() call will
always see the conditions changed by other CPUs before deciding to
ignore a socket.

I've never seen this race reported.  In the unlikely event it happens,
another event will usually come along and the problem will fix itself.
So I don't think this is worth backporting to stable.

Chuck tried this patch and said "I don't see any performance
regressions, but my server has only a single last-level CPU cache."
Tested-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

95503d29

svcrdma: Remove max_sge check at connect time · e248aa7b

由 Chuck Lever 提交于 1月 25, 2019

Two and a half years ago, the client was changed to use gathered
Send for larger inline messages, in commit 655fec69 ("xprtrdma:
Use gathered Send for large inline messages"). Several fixes were
required because there are a few in-kernel device drivers whose
max_sge is 3, and these were broken by the change.

Apparently my memory is going, because some time later, I submitted
commit 25fd86ec ("svcrdma: Don't overrun the SGE array in
svc_rdma_send_ctxt"), and after that, commit f3c1fd0e ("svcrdma:
Reduce max_send_sges"). These too incorrectly assumed in-kernel
device drivers would have more than a few Send SGEs available.

The fix for the server side is not the same. This is because the
fundamental problem on the server is that, whether or not the client
has provisioned a chunk for the RPC reply, the server must squeeze
even the most complex RPC replies into a single RDMA Send. Failing
in the send path because of Send SGE exhaustion should never be an
option.

Therefore, instead of failing when the send path runs out of SGEs,
switch to using a bounce buffer mechanism to handle RPC replies that
are too complex for the device to send directly. That allows us to
remove the max_sge check to enable drivers with small max_sge to
work again.
Reported-by: NDon Dutile <ddutile@redhat.com>
Fixes: 25fd86ec ("svcrdma: Don't overrun the SGE array in ...")
Cc: stable@vger.kernel.org
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

e248aa7b

09 1月, 2019 2 次提交

xprtrdma: Double free in rpcrdma_sendctxs_create() · 6e17f58c

由 Dan Carpenter 提交于 1月 05, 2019

The clean up is handled by the caller, rpcrdma_buffer_create(), so this
call to rpcrdma_sendctxs_destroy() leads to a double free.

Fixes: ae72950a ("xprtrdma: Add data structure to manage RDMA Send arguments")
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

6e17f58c

xprtrdma: Fix error code in rpcrdma_buffer_create() · 4429b668

由 Dan Carpenter 提交于 1月 05, 2019

This should return -ENOMEM if __alloc_workqueue_key() fails, but it
returns success.

Fixes: 6d2d0ee2 ("xprtrdma: Replace rpcrdma_receive_wq with a per-xprt workqueue")
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

4429b668

03 1月, 2019 11 次提交

xprtrdma: Prevent leak of rpcrdma_rep objects · 07e10308

由 Chuck Lever 提交于 12月 07, 2018

If a reply has been processed but the RPC is later retransmitted
anyway, the req->rl_reply field still contains the only pointer to
the old rpcrdma rep. When the next reply comes in, the reply handler
will stomp on the rl_reply field, leaking the old rep.

A trace event is added to capture such leaks.

This problem seems to be worsened by the restructuring of the RPC
Call path in v4.20. Fully addressing this issue will require at
least a re-architecture of the disconnect logic, which is not
appropriate during -rc.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

07e10308

xprtrdma: Don't leak freed MRs · f85adb1b