1. 03 1月, 2019 6 次提交
    • C
      xprtrdma: Remove support for FMR memory registration · ba69cd12
      Chuck Lever 提交于
      FMR is not supported on most recent RDMA devices. It is also less
      secure than FRWR because an FMR memory registration can expose
      adjacent bytes to remote reading or writing. As discussed during the
      RDMA BoF at LPC 2018, it is time to remove support for FMR in the
      NFS/RDMA client stack.
      
      Note that NFS/RDMA server-side uses either local memory registration
      or FRWR. FMR is not used.
      
      There are a few Infiniband/RoCE devices in the kernel tree that do
      not appear to support MEM_MGT_EXTENSIONS (FRWR), and therefore will
      not support client-side NFS/RDMA after this patch. These are:
      
       - mthca
       - qib
       - hns (RoCE)
      
      Users of these devices can use NFS/TCP on IPoIB instead.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      ba69cd12
    • C
      xprtrdma: Don't wake pending tasks until disconnect is done · 0c0829bc
      Chuck Lever 提交于
      Transport disconnect processing does a "wake pending tasks" at
      various points.
      
      Suppose an RPC Reply is being processed. The RPC task that Reply
      goes with is waiting on the pending queue. If a disconnect wake-up
      happens before reply processing is done, that reply, even if it is
      good, is thrown away, and the RPC has to be sent again.
      
      This window apparently does not exist for socket transports because
      there is a lock held while a reply is being received which prevents
      the wake-up call until after reply processing is done.
      
      To resolve this, all RPC replies being processed on an RPC-over-RDMA
      transport have to complete before pending tasks are awoken due to a
      transport disconnect.
      
      Callers that already hold the transport write lock may invoke
      ->ops->close directly. Others use a generic helper that schedules
      a close when the write lock can be taken safely.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      0c0829bc
    • C
      xprtrdma: No qp_event disconnect · 3d433ad8
      Chuck Lever 提交于
      After thinking about this more, and auditing other kernel ULP imple-
      mentations, I believe that a DISCONNECT cm_event will occur after a
      fatal QP event. If that's the case, there's no need for an explicit
      disconnect in the QP event handler.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      3d433ad8
    • C
      xprtrdma: Replace rpcrdma_receive_wq with a per-xprt workqueue · 6d2d0ee2
      Chuck Lever 提交于
      To address a connection-close ordering problem, we need the ability
      to drain the RPC completions running on rpcrdma_receive_wq for just
      one transport. Give each transport its own RPC completion workqueue,
      and drain that workqueue when disconnecting the transport.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      6d2d0ee2
    • C
      xprtrdma: Refactor Receive accounting · 6ceea368
      Chuck Lever 提交于
      Clean up: Divide the work cleanly:
      
      - rpcrdma_wc_receive is responsible only for RDMA Receives
      - rpcrdma_reply_handler is responsible only for RPC Replies
      - the posted send and receive counts both belong in rpcrdma_ep
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      6ceea368
    • C
      xprtrdma: Yet another double DMA-unmap · e2f34e26
      Chuck Lever 提交于
      While chasing yet another set of DMAR fault reports, I noticed that
      the frwr recycler conflates whether or not an MR has been DMA
      unmapped with frwr->fr_state. Actually the two have only an indirect
      relationship. It's in fact impossible to guess reliably whether the
      MR has been DMA unmapped based on its fr_state field, especially as
      the surrounding code and its assumptions have changed over time.
      
      A better approach is to track the DMA mapping status explicitly so
      that the recycler is less brittle to unexpected situations, and
      attempts to DMA-unmap a second time are prevented.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Cc: stable@vger.kernel.org # v4.20
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      e2f34e26
  2. 04 10月, 2018 2 次提交
  3. 03 10月, 2018 10 次提交
  4. 09 8月, 2018 1 次提交
    • C
      xprtrdma: Fix disconnect regression · 8d4fb8ff
      Chuck Lever 提交于
      I found that injecting disconnects with v4.18-rc resulted in
      random failures of the multi-threaded git regression test.
      
      The root cause appears to be that, after a reconnect, the
      RPC/RDMA transport is waking pending RPCs before the transport has
      posted enough Receive buffers to receive the Replies. If a Reply
      arrives before enough Receive buffers are posted, the connection
      is dropped. A few connection drops happen in quick succession as
      the client and server struggle to regain credit synchronization.
      
      This regression was introduced with commit 7c8d9e7c ("xprtrdma:
      Move Receive posting to Receive handler"). The client is supposed to
      post a single Receive when a connection is established because
      it's not supposed to send more than one RPC Call before it gets
      a fresh credit grant in the first RPC Reply [RFC 8166, Section
      3.3.3].
      
      Unfortunately there appears to be a longstanding bug in the Linux
      client's credit accounting mechanism. On connect, it simply dumps
      all pending RPC Calls onto the new connection. It's possible it has
      done this ever since the RPC/RDMA transport was added to the kernel
      ten years ago.
      
      Servers have so far been tolerant of this bad behavior. Currently no
      server implementation ever changes its credit grant over reconnects,
      and servers always repost enough Receives before connections are
      fully established.
      
      The Linux client implementation used to post a Receive before each
      of these Calls. This has covered up the flooding send behavior.
      
      I could try to correct this old bug so that the client sends exactly
      one RPC Call and waits for a Reply. Since we are so close to the
      next merge window, I'm going to instead provide a simple patch to
      post enough Receives before a reconnect completes (based on the
      number of credits granted to the previous connection).
      
      The spurious disconnects will be gone, but the client will still
      send multiple RPC Calls immediately after a reconnect.
      
      Addressing the latter problem will wait for a merge window because
      a) I expect it to be a large change requiring lots of testing, and
      b) obviously the Linux client has interoperated successfully since
      day zero while still being broken.
      
      Fixes: 7c8d9e7c ("xprtrdma: Move Receive posting to ... ")
      Cc: stable@vger.kernel.org # v4.18+
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      8d4fb8ff
  5. 31 7月, 2018 1 次提交
    • B
      RDMA, core and ULPs: Declare ib_post_send() and ib_post_recv() arguments const · d34ac5cd
      Bart Van Assche 提交于
      Since neither ib_post_send() nor ib_post_recv() modify the data structure
      their second argument points at, declare that argument const. This change
      makes it necessary to declare the 'bad_wr' argument const too and also to
      modify all ULPs that call ib_post_send(), ib_post_recv() or
      ib_post_srq_recv(). This patch does not change any functionality but makes
      it possible for the compiler to verify whether the
      ib_post_(send|recv|srq_recv) really do not modify the posted work request.
      
      To make this possible, only one cast had to be introduce that casts away
      constness, namely in rpcrdma_post_recvs(). The only way I can think of to
      avoid that cast is to introduce an additional loop in that function or to
      change the data type of bad_wr from struct ib_recv_wr ** into int
      (an index that refers to an element in the work request list). However,
      both approaches would require even more extensive changes than this
      patch.
      Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
      Reviewed-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      d34ac5cd
  6. 19 6月, 2018 1 次提交
  7. 02 6月, 2018 1 次提交
    • C
      xprtrdma: Wait on empty sendctx queue · 2fad6592
      Chuck Lever 提交于
      Currently, when the sendctx queue is exhausted during marshaling, the
      RPC/RDMA transport places the RPC task on the delayq, which forces a
      wait for HZ >> 2 before the marshal and send is retried.
      
      With this change, the transport now places such an RPC task on the
      pending queue, and wakes it just as soon as more sendctxs become
      available. This typically takes less than a millisecond, and the
      write_space waking mechanism is less deadlock-prone.
      
      Moreover, the waiting RPC task is holding the transport's write
      lock, which blocks the transport from sending RPCs. Therefore faster
      recovery from sendctx queue exhaustion is desirable.
      
      Cf. commit 5804891455d5 ("xprtrdma: ->send_request returns -EAGAIN
      when there are no free MRs").
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      2fad6592
  8. 12 5月, 2018 1 次提交
  9. 07 5月, 2018 10 次提交
  10. 02 5月, 2018 1 次提交
    • C
      xprtrdma: Fix list corruption / DMAR errors during MR recovery · 054f1557
      Chuck Lever 提交于
      The ro_release_mr methods check whether mr->mr_list is empty.
      Therefore, be sure to always use list_del_init when removing an MR
      linked into a list using that field. Otherwise, when recovering from
      transport failures or device removal, list corruption can result, or
      MRs can get mapped or unmapped an odd number of times, resulting in
      IOMMU-related failures.
      
      In general this fix is appropriate back to v4.8. However, code
      changes since then make it impossible to apply this patch directly
      to stable kernels. The fix would have to be applied by hand or
      reworked for kernels earlier than v4.16.
      
      Backport guidance -- there are several cases:
      - When creating an MR, initialize mr_list so that using list_empty
        on an as-yet-unused MR is safe.
      - When an MR is being handled by the remote invalidation path,
        ensure that mr_list is reinitialized when it is removed from
        rl_registered.
      - When an MR is being handled by rpcrdma_destroy_mrs, it is removed
        from mr_all, but it may still be on an rl_registered list. In
        that case, the MR needs to be removed from that list before being
        released.
      - Other cases are covered by using list_del_init in rpcrdma_mr_pop.
      
      Fixes: 9d6b0409 ('xprtrdma: Place registered MWs on a ... ')
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      054f1557
  11. 11 4月, 2018 6 次提交