- 03 10月, 2018 7 次提交
-
-
由 Chuck Lever 提交于
Clean up: Eliminate the FALLTHROUGH into the default arm to make the switch easier to understand. Also, as long as I'm here, do not display the memory address of the target rpcrdma_ep. A hashed memory address is of marginal use here. Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
Clean up. Since commit 173b8f49 ("xprtrdma: Demote "connect" log messages") there has been no need to initialize connstat to zero. In fact, in this code path there's now no reason not to set rep_connected directly. Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
Clean up: The convention throughout other parts of xprtrdma is to name variables of type struct rpcrdma_xprt "r_xprt", not "xprt". This convention enables the use of the name "xprt" for a "struct rpc_xprt" type variable, as in other parts of the RPC client. Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
Clean up: Use a function name that is consistent with the RDMA core API and with other consumers. Because this is a function that is invoked from outside the rpcrdma.ko module, add an appropriate documenting comment. Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
Clean up the names of trace events related to MRs so that it's easy to enable these with a glob. Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
When a memory operation fails, the MR's driver state might not match its hardware state. The only reliable recourse is to dereg the MR. This is done in ->ro_recover_mr, which then attempts to allocate a fresh MR to replace the released MR. Since commit e2ac236c ("xprtrdma: Allocate MRs on demand"), xprtrdma dynamically allocates MRs. It can add more MRs whenever they are needed. That makes it possible to simply release an MR when a memory operation fails, instead of "recovering" it. It will automatically be replaced by the on-demand MR allocator. This commit is a little larger than I wanted, but it replaces ->ro_recover_mr, rb_recovery_lock, rb_recovery_worker, and the rb_stale_mrs list with a generic work queue. Since MRs are no longer orphaned, the mrs_orphaned metric is no longer used. Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
Some devices require more than 3 MRs to build a single 1MB I/O. Ensure that rpcrdma_mrs_create() will add enough MRs to build that I/O. In a subsequent patch I'm changing the MR recovery logic to just toss out the MRs. In that case it's possible for ->send_request to loop acquiring some MRs, not getting enough, getting called again, recycling the previous MRs, then not getting enough, lather rinse repeat. Thus first we need to ensure enough MRs are created to prevent that loop. I'm "reusing" ia->ri_max_segs. All of its accessors seem to want the maximum number of data segments plus two, so I'm going to bake that into the initial calculation. Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
- 09 8月, 2018 1 次提交
-
-
由 Chuck Lever 提交于
I found that injecting disconnects with v4.18-rc resulted in random failures of the multi-threaded git regression test. The root cause appears to be that, after a reconnect, the RPC/RDMA transport is waking pending RPCs before the transport has posted enough Receive buffers to receive the Replies. If a Reply arrives before enough Receive buffers are posted, the connection is dropped. A few connection drops happen in quick succession as the client and server struggle to regain credit synchronization. This regression was introduced with commit 7c8d9e7c ("xprtrdma: Move Receive posting to Receive handler"). The client is supposed to post a single Receive when a connection is established because it's not supposed to send more than one RPC Call before it gets a fresh credit grant in the first RPC Reply [RFC 8166, Section 3.3.3]. Unfortunately there appears to be a longstanding bug in the Linux client's credit accounting mechanism. On connect, it simply dumps all pending RPC Calls onto the new connection. It's possible it has done this ever since the RPC/RDMA transport was added to the kernel ten years ago. Servers have so far been tolerant of this bad behavior. Currently no server implementation ever changes its credit grant over reconnects, and servers always repost enough Receives before connections are fully established. The Linux client implementation used to post a Receive before each of these Calls. This has covered up the flooding send behavior. I could try to correct this old bug so that the client sends exactly one RPC Call and waits for a Reply. Since we are so close to the next merge window, I'm going to instead provide a simple patch to post enough Receives before a reconnect completes (based on the number of credits granted to the previous connection). The spurious disconnects will be gone, but the client will still send multiple RPC Calls immediately after a reconnect. Addressing the latter problem will wait for a merge window because a) I expect it to be a large change requiring lots of testing, and b) obviously the Linux client has interoperated successfully since day zero while still being broken. Fixes: 7c8d9e7c ("xprtrdma: Move Receive posting to ... ") Cc: stable@vger.kernel.org # v4.18+ Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
- 31 7月, 2018 1 次提交
-
-
由 Bart Van Assche 提交于
Since neither ib_post_send() nor ib_post_recv() modify the data structure their second argument points at, declare that argument const. This change makes it necessary to declare the 'bad_wr' argument const too and also to modify all ULPs that call ib_post_send(), ib_post_recv() or ib_post_srq_recv(). This patch does not change any functionality but makes it possible for the compiler to verify whether the ib_post_(send|recv|srq_recv) really do not modify the posted work request. To make this possible, only one cast had to be introduce that casts away constness, namely in rpcrdma_post_recvs(). The only way I can think of to avoid that cast is to introduce an additional loop in that function or to change the data type of bad_wr from struct ib_recv_wr ** into int (an index that refers to an element in the work request list). However, both approaches would require even more extensive changes than this patch. Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com> Reviewed-by: NChuck Lever <chuck.lever@oracle.com> Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
-
- 19 6月, 2018 1 次提交
-
-
由 Steve Wise 提交于
This patch replaces the ib_device_attr.max_sge with max_send_sge and max_recv_sge. It allows ulps to take advantage of devices that have very different send and recv sge depths. For example cxgb4 has a max_recv_sge of 4, yet a max_send_sge of 16. Splitting out these attributes allows much more efficient use of the SQ for cxgb4 with ulps that use the RDMA_RW API. Consider a large RDMA WRITE that has 16 scattergather entries. With max_sge of 4, the ulp would send 4 WRITE WRs, but with max_sge of 16, it can be done with 1 WRITE WR. Acked-by: NSagi Grimberg <sagi@grimberg.me> Acked-by: NChristoph Hellwig <hch@lst.de> Acked-by: NSelvin Xavier <selvin.xavier@broadcom.com> Acked-by: NShiraz Saleem <shiraz.saleem@intel.com> Acked-by: NDennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: NSteve Wise <swise@opengridcomputing.com> Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
-
- 02 6月, 2018 1 次提交
-
-
由 Chuck Lever 提交于
Currently, when the sendctx queue is exhausted during marshaling, the RPC/RDMA transport places the RPC task on the delayq, which forces a wait for HZ >> 2 before the marshal and send is retried. With this change, the transport now places such an RPC task on the pending queue, and wakes it just as soon as more sendctxs become available. This typically takes less than a millisecond, and the write_space waking mechanism is less deadlock-prone. Moreover, the waiting RPC task is holding the transport's write lock, which blocks the transport from sending RPCs. Therefore faster recovery from sendctx queue exhaustion is desirable. Cf. commit 5804891455d5 ("xprtrdma: ->send_request returns -EAGAIN when there are no free MRs"). Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
- 12 5月, 2018 1 次提交
-
-
由 Chuck Lever 提交于
Clean up: Move #include <trace/events/rpcrdma.h> into source files, similar to how it is done with trace/events/sunrpc.h. Server-side trace points will be part of the rpcrdma subsystem, just like the client-side trace points. Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
-
- 07 5月, 2018 10 次提交
-
-
由 Chuck Lever 提交于
Clean up: The only call site is in the same file as the function's definition. Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
Clean up: There is only one remaining call site for this helper. Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
Clean up. There is only one call-site for this helper, and it can be simplified by using list_first_entry_or_null(). Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
Clean up: These functions are no longer used. Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
Receive completion and Reply handling are done by a BOUND workqueue, meaning they run on only one CPU. Posting receives is currently done in the send_request path, which on large systems is typically done on a different CPU than the one handling Receive completions. This results in movement of Receive-related cachelines between the sending and receiving CPUs. More importantly, it means that currently Receives are posted while the transport's write lock is held, which is unnecessary and costly. Finally, allocation of Receive buffers is performed on-demand in the Receive completion handler. This helps guarantee that they are allocated on the same NUMA node as the CPU that handles Receive completions. Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
For clarity, report the posting and completion of Receive CQEs. Also, the wc->byte_len field contains garbage if wc->status is non-zero, and the vendor error field contains garbage if wc->status is zero. For readability, don't save those fields in those cases. Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
For FRWR, the computation of max_send_wr is split between frwr_op_open and rpcrdma_ep_create, which makes it difficult to tell that the max_send_wr result is currently incorrect if frwr_op_open has to reduce the credit limit to accommodate a small max_qp_wr. This is a problem now that extra WRs are needed for backchannel operations and a drain CQE. So, refactor the computation so that it is all done in ->ro_open, and fix the FRWR version of this computation so that it accommodates HCAs with small max_qp_wr correctly. Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
Set up RPC/RDMA transport in mount.nfs's network namespace. This passes the correct namespace information to the RDMA core, similar to how RPC sockets are created (see xs_create_sock). Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
rdma_resolve_addr(3) says: > This call is used to map a given destination IP address to a > usable RDMA address. The IP to RDMA address mapping is done > using the local routing tables, or via ARP. If this can't be done, there's no local device that can be used to establish an RDMA-capable network path to the remote. In this case, the RDMA CM very quickly posts an RDMA_CM_EVENT_ADDR_ERROR upcall. Currently rpcrdma_conn_upcall() converts RDMA_CM_EVENT_ADDR_ERROR to EHOSTUNREACH. mount.nfs seems to want to retry EHOSTUNREACH forever, thinking that this is a temporary situation. This makes mount.nfs appear to hang if I try to mount with proto=rdma through, say, a conventional Ethernet device. If the admin has specified proto=rdma along with a server IP address that requires a network path that does not support RDMA, instead let's fail with a permanent error. -EPROTONOSUPPORT is returned when NFSv4 or one of its minor versions is not supported. -EPROTO is not (currently) retried by mount.nfs. There are potentially other similar cases where -EPROTO is an appropriate return code. Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Tested-by: NOlga Kornievskaia <kolga@netapp.com> Tested-by: NAnna Schumaker <Anna.Schumaker@netapp.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
- 02 5月, 2018 1 次提交
-
-
由 Chuck Lever 提交于
The ro_release_mr methods check whether mr->mr_list is empty. Therefore, be sure to always use list_del_init when removing an MR linked into a list using that field. Otherwise, when recovering from transport failures or device removal, list corruption can result, or MRs can get mapped or unmapped an odd number of times, resulting in IOMMU-related failures. In general this fix is appropriate back to v4.8. However, code changes since then make it impossible to apply this patch directly to stable kernels. The fix would have to be applied by hand or reworked for kernels earlier than v4.16. Backport guidance -- there are several cases: - When creating an MR, initialize mr_list so that using list_empty on an as-yet-unused MR is safe. - When an MR is being handled by the remote invalidation path, ensure that mr_list is reinitialized when it is removed from rl_registered. - When an MR is being handled by rpcrdma_destroy_mrs, it is removed from mr_all, but it may still be on an rl_registered list. In that case, the MR needs to be removed from that list before being released. - Other cases are covered by using list_del_init in rpcrdma_mr_pop. Fixes: 9d6b0409 ('xprtrdma: Place registered MWs on a ... ') Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
- 11 4月, 2018 7 次提交
-
-
由 Chuck Lever 提交于
Michal Kalderon has found some corner cases around device unload with active NFS mounts that I didn't have the imagination to test when xprtrdma device removal was added last year. - The ULP device removal handler is responsible for deallocating the PD. That wasn't clear to me initially, and my own testing suggested it was not necessary, but that is incorrect. - The transport destruction path can no longer assume that there is a valid ID. - When destroying a transport, ensure that ib_free_cq() is not invoked on a CQ that was already released. Reported-by: NMichal Kalderon <Michal.Kalderon@cavium.com> Fixes: bebd0318 ("xprtrdma: Support unplugging an HCA from ...") Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Cc: stable@vger.kernel.org # v4.12+ Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
Refactor: Both rpcrdma_create_req call sites have to allocate the buffer where the transport header is built, so just move that allocation into rpcrdma_create_req. This buffer is a fixed size. There's no needed information available in call_allocate that is not also available when the transport is created. The original purpose for allocating these buffers on demand was to reduce the possibility that an allocation failure during transport creation will hork the mount operation during low memory scenarios. Some relief for this rare possibility is coming up in the next few patches. Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
With FRWR, the client transport can perform memory registration and post a Send with just a single ib_post_send. This reduces contention between the send_request path and the Send Completion handlers, and reduces the overhead of registering a chunk that has multiple segments. Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
Create fewer MRs on average. Many workloads don't need as many as 32 MRs, and the transport can now quickly restock the MR free list. Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
Currently, when the MR free list is exhausted during marshaling, the RPC/RDMA transport places the RPC task on the delayq, which forces a wait for HZ >> 2 before the marshal and send is retried. With this change, the transport now places such an RPC task on the pending queue, and wakes it just as soon as more MRs have been created. Creating more MRs typically takes less than a millisecond, and this waking mechanism is less deadlock-prone. Moreover, the waiting RPC task is holding the transport's write lock, which blocks the transport from sending RPCs. Therefore faster recovery from MR exhaustion is desirable. This is the same mechanism that the TCP transport utilizes when handling write buffer space exhaustion. Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
Clean up: The generic rq_connect_cookie is sufficient to detect RPC Call retransmission. Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
Clean up: We need to check only that the value does not exceed the range of the u8 field it's going into. Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
- 03 2月, 2018 2 次提交
-
-
由 Chuck Lever 提交于
Michal Kalderon reports a BUG that occurs just after device removal: [ 169.112490] rpcrdma: removing device qedr0 for 192.168.110.146:20049 [ 169.143909] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010 [ 169.181837] IP: rpcrdma_dma_unmap_regbuf+0xa/0x60 [rpcrdma] The RPC/RDMA client transport attempts to allocate some resources on demand. Registered buffers are one such resource. These are allocated (or re-allocated) by xprt_rdma_allocate to hold RPC Call and Reply messages. A hardware resource is associated with each of these buffers, as they can be used for a Send or Receive Work Request. If a device is removed from under an NFS/RDMA mount, the transport layer is responsible for releasing all hardware resources before the device can be finally unplugged. A BUG results when the NFS mount hasn't yet seen much activity: the transport tries to release resources that haven't yet been allocated. rpcrdma_free_regbuf() already checks for this case, so just move that check to cover the DEVICE_REMOVAL case as well. Reported-by: NMichal Kalderon <Michal.Kalderon@cavium.com> Fixes: bebd0318 ("xprtrdma: Support unplugging an HCA ...") Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Tested-by: NMichal Kalderon <Michal.Kalderon@cavium.com> Cc: stable@vger.kernel.org # v4.12+ Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
Commit 16f906d6 ("xprtrdma: Reduce required number of send SGEs") introduced the rpcrdma_ia::ri_max_send_sges field. This fixes a problem where xprtrdma would not work if the device's max_sge capability was small (low single digits). At least RPCRDMA_MIN_SEND_SGES are needed for the inline parts of each RPC. ri_max_send_sges is set to this value: ia->ri_max_send_sges = max_sge - RPCRDMA_MIN_SEND_SGES; Then when marshaling each RPC, rpcrdma_args_inline uses that value to determine whether the device has enough Send SGEs to convey an NFS WRITE payload inline, or whether instead a Read chunk is required. More recently, commit ae72950a ("xprtrdma: Add data structure to manage RDMA Send arguments") used the ri_max_send_sges value to calculate the size of an array, but that commit erroneously assumed ri_max_send_sges contains a value similar to the device's max_sge, and not one that was reduced by the minimum SGE count. This assumption results in the calculated size of the sendctx's Send SGE array to be too small. When the array is used to marshal an RPC, the code can write Send SGEs into the following sendctx element in that array, corrupting it. When the device's max_sge is large, this issue is entirely harmless; but it results in an oops in the provider's post_send method, if dev.attrs.max_sge is small. So let's straighten this out: ri_max_send_sges will now contain a value with the same meaning as dev.attrs.max_sge, which makes the code easier to understand, and enables rpcrdma_sendctx_create to calculate the size of the SGE array correctly. Reported-by: NMichal Kalderon <Michal.Kalderon@cavium.com> Fixes: 16f906d6 ("xprtrdma: Reduce required number of send SGEs") Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Tested-by: NMichal Kalderon <Michal.Kalderon@cavium.com> Cc: stable@vger.kernel.org # v4.10+ Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
- 23 1月, 2018 8 次提交
-
-
由 Chuck Lever 提交于
Fix kernel-doc warnings in net/sunrpc/xprtrdma/ . net/sunrpc/xprtrdma/verbs.c:1575: warning: No description found for parameter 'count' net/sunrpc/xprtrdma/verbs.c:1575: warning: Excess function parameter 'min_reqs' description in 'rpcrdma_ep_post_extra_recv' net/sunrpc/xprtrdma/backchannel.c:288: warning: No description found for parameter 'r_xprt' net/sunrpc/xprtrdma/backchannel.c:288: warning: Excess function parameter 'xprt' description in 'rpcrdma_bc_receive_call' Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-
由 Chuck Lever 提交于
Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
-