提交 · 0c78789e3a030615c6650fde89546cadf40ec2cc · openeuler / raspberrypi-kernel

30 8月, 2015 1 次提交

SUNRPC: xs_reset_transport must mark the connection as disconnected · 0c78789e

由 Trond Myklebust 提交于 8月 29, 2015

In case the reconnection attempt fails.

Cc: stable@vger.kernel.org
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

0c78789e

20 8月, 2015 1 次提交

SUNRPC: Allow sockets to do GFP_NOIO allocations · c2126157

由 Trond Myklebust 提交于 8月 19, 2015

Follow up to commit c4a7ca77 ("SUNRPC: Allow waiting on memory
allocation"). Allows the RPC socket code to do non-IO blocking.
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

c2126157

18 8月, 2015 1 次提交

SUNRPC: Fix a thinko in xs_connect() · 99b1a4c3

由 Trond Myklebust 提交于 8月 13, 2015

It is rather pointless to test the value of transport->inet after
calling xs_reset_transport(), since it will always be zero, and
so we will never see any exponential back off behaviour.
Also don't force early connections for SOFTCONN tasks. If the server
disconnects us, we should respect the exponential backoff.

Cc: stable@vger.kernel.org # 4.0+
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

99b1a4c3

13 8月, 2015 1 次提交

sunrpc: increase UNX_MAXNODENAME from 32 to __NEW_UTS_LEN bytes · 24a9a961

由 Jeff Layton 提交于 8月 03, 2015

The current limit of 32 bytes artificially limits the name string that
we end up stuffing into NFSv4.x client ID blobs. If you have multiple
hosts with long hostnames that only differ near the end, then this can
cause NFSv4 client ID collisions.

Linux nodenames are actually limited to __NEW_UTS_LEN bytes (64), so use
that as the limit instead. Also, use XDR_QUADLEN to specify the slack
length, just for clarity and in case someone in the future changes this
to something not evenly divisible by 4.
Reported-by: NMichael Skralivetsky <michael.skralivetsky@primarydata.com>
Signed-off-by: NJeff Layton <jeff.layton@primarydata.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

24a9a961

06 8月, 2015 14 次提交

xprtrdma: take HCA driver refcount at client · d0f36c46

由 Devesh Sharma 提交于 8月 03, 2015

This is a rework of the following patch sent almost a year back:
http://www.mail-archive.com/linux-rdma%40vger.kernel.org/msg20730.html

In presence of active mount if someone tries to rmmod vendor-driver, the
command remains stuck forever waiting for destruction of all rdma-cm-id.
in worst case client can crash during shutdown with active mounts.

The existing code assumes that ia->ri_id->device cannot change during
the lifetime of a transport. xprtrdma do not have support for
DEVICE_REMOVAL event either. Lifting that assumption and adding support
for DEVICE_REMOVAL event is a long chain of work, and is in plan.

The community decided that preventing the hang right now is more
important than waiting for architectural changes.

Thus, this patch introduces a temporary workaround to acquire HCA driver
module reference count during the mount of a nfs-rdma mount point.
Signed-off-by: NDevesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NSagi Grimberg <sagig@dev.mellanox.co.il>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

d0f36c46

xprtrdma: Count RDMA_NOMSG type calls · 860477d1

由 Chuck Lever 提交于 8月 03, 2015

RDMA_NOMSG type calls are less efficient than RDMA_MSG. Count NOMSG
calls so administrators can tell if they happen to be used more than
expected.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

860477d1

xprtrdma: Clean up xprt_rdma_print_stats() · 763f7e4e

由 Chuck Lever 提交于 8月 03, 2015

checkpatch.pl complained about the seq_printf() format string split
across lines and the use of %Lu.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

763f7e4e

xprtrdma: Fix large NFS SYMLINK calls · 2fcc213a

由 Chuck Lever 提交于 8月 03, 2015

Repair how rpcrdma_marshal_req() chooses which RDMA message type
to use for large non-WRITE operations so that it picks RDMA_NOMSG
in the correct situations, and sets up the marshaling logic to
SEND only the RPC/RDMA header.

Large NFSv2 SYMLINK requests now use RDMA_NOMSG calls. The Linux NFS
server XDR decoder for NFSv2 SYMLINK does not handle having the
pathname argument arrive in a separate buffer. The decoder could be
fixed, but this is simpler and RDMA_NOMSG can be used in a variety
of other situations.

Ensure that the Linux client continues to use "RDMA_MSG + read
list" when sending large NFSv3 SYMLINK requests, which is more
efficient than using RDMA_NOMSG.

Large NFSv4 CREATE(NF4LNK) requests are changed to use "RDMA_MSG +
read list" just like NFSv3 (see Section 5 of RFC 5667). Before,
these did not work at all.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

2fcc213a

xprtrdma: Fix XDR tail buffer marshalling · 677eb17e

由 Chuck Lever 提交于 8月 03, 2015

Currently xprtrdma appends an extra chunk element to the RPC/RDMA
read chunk list of each NFSv4 WRITE compound. The extra element
contains the final GETATTR operation in the compound.

The result is an extra RDMA READ operation to transfer a very short
piece of each NFS WRITE compound (typically 16 bytes). This is
inefficient.

It is also incorrect.

The client is sending the trailing GETATTR at the same Position as
the preceding WRITE data payload. Whether or not RFC 5667 allows
the GETATTR to appear in a read chunk, RFC 5666 requires that these
two separate RPC arguments appear at two distinct Positions.

It can also be argued that the GETATTR operation is not bulk data,
and therefore RFC 5667 forbids its appearance in a read chunk at
all.

Although RFC 5667 is not precise about when using a read list with
NFSv4 COMPOUND is allowed, the intent is that only data arguments
not touched by NFS (ie, read and write payloads) are to be sent
using RDMA READ or WRITE.

The NFS client constructs GETATTR arguments itself, and therefore is
required to send the trailing GETATTR operation as additional inline
content, not as a data payload.

NB: This change is not backwards compatible. Some older servers do
not accept inline content following the read list. The Linux NFS
server should handle this content correctly as of commit
a97c331f ("svcrdma: Handle additional inline content").
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

677eb17e

xprtrdma: Don't provide a reply chunk when expecting a short reply · 33943b29

由 Chuck Lever 提交于 8月 03, 2015

Currently Linux always offers a reply chunk, even when the reply
can be sent inline (ie. is smaller than 1KB).

On the client, registering a memory region can be expensive. A
server may choose not to use the reply chunk, wasting the cost of
the registration.

This is a change only for RPC replies smaller than 1KB which the
server constructs in the RPC reply send buffer. Because the elements
of the reply must be XDR encoded, a copy-free data transfer has no
benefit in this case.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NSagi Grimberg <sagig@mellanox.com>
Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

33943b29

xprtrdma: Always provide a write list when sending NFS READ · 02eb57d8

由 Chuck Lever 提交于 8月 03, 2015

The client has been setting up a reply chunk for NFS READs that are
smaller than the inline threshold. This is not efficient: both the
server and client CPUs have to copy the reply's data payload into
and out of the memory region that is then transferred via RDMA.

Using the write list, the data payload is moved by the device and no
extra data copying is necessary.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NDevesh Sharma <devesh.sharma@avagotech.com>
Reviewed-By: NSagi Grimberg <sagig@mellanox.com>
Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

02eb57d8

xprtrdma: Account for RPC/RDMA header size when deciding to inline · 5457ced0

由 Chuck Lever 提交于 8月 03, 2015

When the size of the RPC message is near the inline threshold (1KB),
the client would allow messages to be sent that were a few bytes too
large.

When marshaling RPC/RDMA requests, ensure the combined size of
RPC/RDMA header and RPC header do not exceed the inline threshold.
Endpoints typically reject RPC/RDMA messages that exceed the size
of their receive buffers.

The two server implementations I test with (Linux and Solaris) use
receive buffers that are larger than the client’s inline threshold.
Thus so far this has been benign, observed only by code inspection.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NDevesh Sharma <devesh.sharma@avagotech.com>
Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

5457ced0

xprtrdma: Remove logic that constructs RDMA_MSGP type calls · b3221d6a

由 Chuck Lever 提交于 8月 03, 2015

RDMA_MSGP type calls insert a zero pad in the middle of the RPC
message to align the RPC request's data payload to the server's
alignment preferences. A server can then "page flip" the payload
into place to avoid a data copy in certain circumstances. However:

1. The client has to have a priori knowledge of the server's
   preferred alignment

2. Requests eligible for RDMA_MSGP are requests that are small
   enough to have been sent inline, and convey a data payload
   at the _end_ of the RPC message

Today 1. is done with a sysctl, and is a global setting that is
copied during mount. Linux does not support CCP to query the
server's preferences (RFC 5666, Section 6).

A small-ish NFSv3 WRITE might use RDMA_MSGP, but no NFSv4
compound fits bullet 2.

Thus the Linux client currently leaves RDMA_MSGP disabled. The
Linux server handles RDMA_MSGP, but does not use any special
page flipping, so it confers no benefit.

Clean up the marshaling code by removing the logic that constructs
RDMA_MSGP type calls. This also reduces the maximum send iovec size
from four to just two elements.

/proc/sys/sunrpc/rdma_inline_write_padding is a kernel API, and
thus is left in place.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

b3221d6a

xprtrdma: Clean up rpcrdma_ia_open() · d1ed857e

由 Chuck Lever 提交于 8月 03, 2015

Untangle the end of rpcrdma_ia_open() by moving DMA MR set-up, which
is different for each registration method, to the .ro_open functions.

This is refactoring only. No behavior change is expected.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

d1ed857e

xprtrdma: Remove last ib_reg_phys_mr() call site · e531dcab

由 Chuck Lever 提交于 8月 03, 2015

All HCA providers have an ib_get_dma_mr() verb. Thus
rpcrdma_ia_open() will either grab the device's local_dma_key if one
is available, or it will call ib_get_dma_mr(). If ib_get_dma_mr()
fails, rpcrdma_ia_open() fails and no transport is created.

Therefore execution never reaches the ib_reg_phys_mr() call site in
rpcrdma_register_internal(), so it can be removed.

The remaining logic in rpcrdma_{de}register_internal() is folded
into rpcrdma_{alloc,free}_regbuf().

This is clean up only. No behavior change is expected.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NDevesh Sharma <devesh.sharma@avagotech.com>
Reviewed-By: NSagi Grimberg <sagig@mellanox.com>
Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

e531dcab

xprtrdma: Don't fall back to PHYSICAL memory registration · d2310930

由 Chuck Lever 提交于 8月 03, 2015

PHYSICAL memory registration uses a single rkey for all of the
client's memory, thus is insecure. It is still useful in some cases
for testing.

Retain the ability to select PHYSICAL memory registration capability
via /proc/sys/sunrpc/rdma_memreg_strategy, but don't fall back to it
if the HCA does not support FRWR or FMR.

This means amso1100 no longer works out of the box with NFS/RDMA.
When using amso1100 HCAs, set the memreg_strategy sysctl to 6 before
performing NFS/RDMA mounts.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

d2310930

xprtrdma: Raise maximum payload size to one megabyte · 864be126

由 Chuck Lever 提交于 8月 03, 2015

The point of larger rsize and wsize is to reduce the per-byte cost
of memory registration and deregistration. Modern HCAs can typically
handle a megabyte or more with a single registration operation.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NDevesh Sharma <devesh.sharma@avagotech.com>
Reviewed-By: NSagi Grimberg <sagig@mellanox.com>
Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

864be126

xprtrdma: Make xprt_setup_rdma() agnostic to family of server address · 5231eb97

由 Chuck Lever 提交于 8月 03, 2015

In particular, recognize when an IPv6 connection is bound.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

5231eb97

28 7月, 2015 1 次提交
- T
  SUNRPC: Report TCP errors to the caller · f580dd04
  由 Trond Myklebust 提交于 7月 11, 2015
```
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
```
  f580dd04
27 7月, 2015 1 次提交

sunrpc: translate -EAGAIN to -ENOBUFS when socket is writable. · 743c69e7

由 NeilBrown 提交于 7月 27, 2015

The networking layer does not reliably report the distinction between
a non-block write failing because:
 1/ the queue is too full already and
 2/ a memory allocation attempt failed.

The distinction is important because in the first case it is
appropriate to retry as soon as the socket reports that it is
writable, and in the second case a small delay is required as the
socket will most likely report as writable but kmalloc could still
fail.

sk_stream_wait_memory() exhibits this distinction nicely, setting
'vm_wait' if a small wait is needed.  However in the non-blocking case
it always returns -EAGAIN no matter the cause of the failure.  This
-EAGAIN call get all the way to sunrpc.

The sunrpc layer expects EAGAIN to indicate the first cause, and
ENOBUFS to indicate the second.  Various documentation suggests that
this is not unreasonable, but does not guarantee the desired error
codes.

The result of getting -EAGAIN when -ENOBUFS is expected is that the
send is tried again in a tight loop and soft lockups are reported.

so: add tests after calls to xs_sendpages() to translate -EAGAIN into
-ENOBUFS if the socket is writable.  This cannot happen inside
xs_sendpages() as the test for "is socket writable" is different
between TCP and UDP.

With this change, the tight loop retrying xs_sendpages() becomes a
loop which only retries every 250ms, and so will not trigger a
soft-lockup warning.

It is possible that the write did fail because the queue was too full
and by the time xs_sendpages() completed, the queue was writable
again.  In this case an extra 250ms delay is inserted that isn't
really needed.  This circumstance suggests a degree of congestion so a
delay is not necessarily a bad thing, and it can only cause a single
250ms delay, not a series of them.
Signed-off-by: NNeilBrown <neilb@suse.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

743c69e7

23 7月, 2015 2 次提交

SUNRPC: xprt_complete_bc_request must also decrement the free slot count · 1980bd4d

由 Trond Myklebust 提交于 7月 22, 2015

Calling xprt_complete_bc_request() effectively causes the slot to be allocated,
so it needs to decrement the backchannel free slot count as well.

Fixes: 0d2a970d ("SUNRPC: Fix a backchannel race")
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

1980bd4d

SUNRPC: Fix a backchannel deadlock · 68514471

由 Trond Myklebust 提交于 7月 22, 2015

xprt_alloc_bc_request() cannot call xprt_free_bc_request() without
deadlocking, since it already holds the xprt->bc_pa_lock.
Reported-by: NChuck Lever <chuck.lever@oracle.com>
Fixes: 0d2a970d ("SUNRPC: Fix a backchannel race")
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

68514471

03 7月, 2015 2 次提交

SUNRPC: Don't confuse ENOBUFS with a write_space issue · b5872f0c

由 Trond Myklebust 提交于 7月 03, 2015

ENOBUFS means that memory allocations are failing due to an actual
low memory situation. It should not be confused with being out of
socket buffer space.

Handle the problem by just punting to the delay in call_status.
Reported-by: NNeil Brown <neilb@suse.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

b5872f0c

SUNRPC: Don't reencode message if transmission failed with ENOBUFS · 93aa6c7b

由 Trond Myklebust 提交于 7月 03, 2015

If we're running out of buffer memory when transmitting data, then
we want to just delay for a moment, and then continue transmitting
the remainder of the message.
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

93aa6c7b

23 6月, 2015 1 次提交

sunrpc: use sg_init_one() in krb5_rc4_setup_enc/seq_key() · 901f1379

由 Fabian Frederick 提交于 6月 16, 2015

Don't opencode sg_init_one()
Signed-off-by: NFabian Frederick <fabf@skynet.be>
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>

901f1379

21 6月, 2015 1 次提交

SUNRPC: Set the TCP user timeout option on client sockets · 775f06ab

由 Trond Myklebust 提交于 6月 20, 2015

Use the TCP_USER_TIMEOUT socket option to advertise to the server
how long we will keep the connection open if there is unacknowledged
data. See RFC5482.
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

775f06ab

20 6月, 2015 2 次提交

SUNRPC: Ensure we release the TCP socket once it has been closed · 4876cc77

由 Trond Myklebust 提交于 6月 19, 2015

This fixes a regression introduced by commit caf4ccd4 ("SUNRPC:
Make xs_tcp_close() do a socket shutdown rather than a sock_release").
Prior to that commit, the autoclose feature would ensure that an
idle connection would result in the socket being both disconnected and
released, whereas now only gets disconnected.

While the current behaviour is harmless, it does leave the port bound
until either RPC traffic resumes or the RPC client is shut down.
Reported-by: NSteven Rostedt <rostedt@goodmis.org>
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

4876cc77

SUNRPC: Handle connection issues correctly on the back channel · 3832591e

由 Trond Myklebust 提交于 6月 19, 2015

If the back channel is disconnected, we can and should just fail the
transmission. The expectation is that the NFSv4.1 server will always
retransmit any outstanding callbacks once the connection is
re-established.
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

3832591e

18 6月, 2015 1 次提交

sunrpc: use sg_init_one() in krb5_rc4_setup_enc/seq_key() · d1381929

由 Fabian Frederick 提交于 6月 16, 2015

Don't opencode sg_init_one()
Signed-off-by: NFabian Frederick <fabf@skynet.be>
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

d1381929

16 6月, 2015 1 次提交

SUNRPC: never enqueue a ->rq_cong request on ->sending · 29807318

由 Neil Brown 提交于 6月 15, 2015

If the sending queue has a task without ->rq_cong set at the front,
and then a number of tasks with ->rq_cong set such that they use
the entire congestion window, then the queue deadlocks.  The first
entry cannot be processed until later entries complete.

This scenario has been seen with a client using UDP to access a server,
and the network connection breaking for a period of time - it doesn't
recover.

It never really makes sense for an ->rq_cong request to be on the ->sending
queue, but it can happen when a request is being retried, and finds
the transport if locked (XPRT_LOCKED).  In this case we simple call
__xprt_put_cong() and the deadlock goes away.
Signed-off-by: NNeilBrown <neilb@suse.de>
Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>

29807318

13 6月, 2015 10 次提交

IB/core: Change ib_create_cq to use struct ib_cq_init_attr · 8e37210b

由 Matan Barak 提交于 6月 11, 2015

Currently, ib_create_cq uses cqe and comp_vecotr instead
of the extendible ib_cq_init_attr struct.

Earlier patches already changed the vendors to work with
ib_cq_init_attr. This patch changes the consumers too.
Signed-off-by: NMatan Barak <matanb@mellanox.com>
Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: NDoug Ledford <dledford@redhat.com>

8e37210b

xprtrdma: Reduce per-transport MR allocation · 40c6ed0c

由 Chuck Lever 提交于 5月 26, 2015

Reduce resource consumption per-transport to make way for increasing
the credit limit and maximum r/wsize. Pre-allocate fewer MRs.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NSteve Wise <swise@opengridcomputing.com>
Reviewed-by: NSagi Grimberg <sagig@mellanox.com>
Reviewed-by: NDevesh Sharma <devesh.sharma@avagotech.com>
Tested-By: NDevesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: NDoug Ledford <dledford@redhat.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

40c6ed0c

xprtrdma: Stack relief in fmr_op_map() · acb9da7a

由 Chuck Lever 提交于 5月 26, 2015

fmr_op_map() declares a 64 element array of u64 in automatic
storage. This is 512 bytes (8 * 64) on the stack.

Instead, when FMR memory registration is in use, pre-allocate a
physaddr array for each rpcrdma_mw.

This is a pre-requisite for increasing the r/wsize maximum for
FMR on platforms with 4KB pages.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NSteve Wise <swise@opengridcomputing.com>
Reviewed-by: NSagi Grimberg <sagig@mellanox.com>
Reviewed-by: NDevesh Sharma <devesh.sharma@avagotech.com>
Tested-By: NDevesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: NDoug Ledford <dledford@redhat.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

acb9da7a

xprtrdma: Split rb_lock · 58d1dcf5

由 Chuck Lever 提交于 5月 26, 2015

/proc/lock_stat showed contention between rpcrdma_buffer_get/put
and the MR allocation functions during I/O intensive workloads.

Now that MRs are no longer allocated in rpcrdma_buffer_get(),
there's no reason the rb_mws list has to be managed using the
same lock as the send/receive buffers. Split that lock. The
new lock does not need to disable interrupts because buffer
get/put is never called in an interrupt context.

struct rpcrdma_buffer is re-arranged to ensure rb_mwlock and rb_mws
are always in a different cacheline than rb_lock and the buffer
pointers.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NSteve Wise <swise@opengridcomputing.com>
Reviewed-by: NSagi Grimberg <sagig@mellanox.com>
Tested-By: NDevesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: NDoug Ledford <dledford@redhat.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

58d1dcf5

xprtrdma: Remove rpcrdma_ia::ri_memreg_strategy · 7e53df11

由 Chuck Lever 提交于 5月 26, 2015

Clean up: This field is no longer used.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NSteve Wise <swise@opengridcomputing.com>
Reviewed-by: NSagi Grimberg <sagig@mellanox.com>
Reviewed-by: NDevesh Sharma <devesh.sharma@avagotech.com>
Tested-By: NDevesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: NDoug Ledford <dledford@redhat.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

7e53df11

xprtrdma: Remove ->ro_reset · 3269a94b

由 Chuck Lever 提交于 5月 26, 2015

An RPC can exit at any time. When it does so, xprt_rdma_free() is
called, and it calls ->op_unmap().

If ->ro_reset() is running due to a transport disconnect, the two
methods can race while processing the same rpcrdma_mw. The results
are unpredictable.

Because of this, in previous patches I've altered ->ro_map() to
handle MR reset. ->ro_reset() is no longer needed and can be
removed.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NSteve Wise <swise@opengridcomputing.com>
Reviewed-by: NSagi Grimberg <sagig@mellanox.com>
Reviewed-by: NDevesh Sharma <devesh.sharma@avagotech.com>
Tested-By: NDevesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: NDoug Ledford <dledford@redhat.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

3269a94b

xprtrdma: Remove unused LOCAL_INV recovery logic · 06b00880

由 Chuck Lever 提交于 5月 26, 2015

Clean up: Remove functions no longer used to recover broken FRMRs.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NSteve Wise <swise@opengridcomputing.com>
Reviewed-by: NSagi Grimberg <sagig@mellanox.com>
Reviewed-by: NDevesh Sharma <devesh.sharma@avagotech.com>
Tested-By: NDevesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: NDoug Ledford <dledford@redhat.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

06b00880

xprtrdma: Acquire MRs in rpcrdma_register_external() · c14d86e5

由 Chuck Lever 提交于 5月 26, 2015

Acquiring 64 MRs in rpcrdma_buffer_get() while holding the buffer
pool lock is expensive, and unnecessary because most modern adapters
can transfer 100s of KBs of payload using just a single MR.

Instead, acquire MRs one-at-a-time as chunks are registered, and
return them to rb_mws immediately during deregistration.

Note: commit 539431a4 ("xprtrdma: Don't invalidate FRMRs if
registration fails") is reverted: There is now a valid case where
registration can fail (with -ENOMEM) but the QP is still in RTS.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NSteve Wise <swise@opengridcomputing.com>
Tested-By: NDevesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: NDoug Ledford <dledford@redhat.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

c14d86e5

xprtrdma: Introduce an FRMR recovery workqueue · 951e721c

由 Chuck Lever 提交于 5月 26, 2015

After a transport disconnect, FRMRs can be left in an undetermined
state. In particular, the MR's rkey is no good.

Currently, FRMRs are fixed up by the transport connect worker, but
that can race with ->ro_unmap if an RPC happens to exit while the
transport connect worker is running.

A better way of dealing with broken FRMRs is to detect them before
they are re-used by ->ro_map. Such FRMRs are either already invalid
or are owned by the sending RPC, and thus no race with ->ro_unmap
is possible.

Introduce a mechanism for handing broken FRMRs to a workqueue to be
reset in a context that is appropriate for allocating resources
(ie. an ib_alloc_fast_reg_mr() API call).

This mechanism is not yet used, but will be in subsequent patches.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NSteve Wise <swise@opengridcomputing.com>
Reviewed-By: NDevesh Sharma <devesh.sharma@avagotech.com>
Tested-By: NDevesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: NDoug Ledford <dledford@redhat.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

951e721c

xprtrdma: Acquire FMRs in rpcrdma_fmr_register_external() · fc7fbb59

由 Chuck Lever 提交于 5月 26, 2015

Acquiring 64 FMRs in rpcrdma_buffer_get() while holding the buffer
pool lock is expensive, and unnecessary because FMR mode can
transfer up to a 1MB payload using just a single ib_fmr.

Instead, acquire ib_fmrs one-at-a-time as chunks are registered, and
return them to rb_mws immediately during deregistration.
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Reviewed-by: NSteve Wise <swise@opengridcomputing.com>
Tested-By: NDevesh Sharma <devesh.sharma@avagotech.com>
Reviewed-by: NDoug Ledford <dledford@redhat.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>

fc7fbb59