1. 03 1月, 2019 8 次提交
  2. 03 10月, 2018 2 次提交
    • C
      xprtrdma: Simplify RPC wake-ups on connect · 31e62d25
      Chuck Lever 提交于
      Currently, when a connection is established, rpcrdma_conn_upcall
      invokes rpcrdma_conn_func and then
      wake_up_all(&ep->rep_connect_wait). The former wakes waiting RPCs,
      but the connect worker is not done yet, and that leads to races,
      double wakes, and difficulty understanding how this logic is
      supposed to work.
      
      Instead, collect all the "connection established" logic in the
      connect worker (xprt_rdma_connect_worker). A disconnect worker is
      retained to handle provider upcalls safely.
      
      Fixes: 254f91e2 ("xprtrdma: RPC/RDMA must invoke ... ")
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      31e62d25
    • C
      xprtrdma: Explicitly resetting MRs is no longer necessary · 61da886b
      Chuck Lever 提交于
      When a memory operation fails, the MR's driver state might not match
      its hardware state. The only reliable recourse is to dereg the MR.
      This is done in ->ro_recover_mr, which then attempts to allocate a
      fresh MR to replace the released MR.
      
      Since commit e2ac236c ("xprtrdma: Allocate MRs on demand"),
      xprtrdma dynamically allocates MRs. It can add more MRs whenever
      they are needed.
      
      That makes it possible to simply release an MR when a memory
      operation fails, instead of "recovering" it. It will automatically
      be replaced by the on-demand MR allocator.
      
      This commit is a little larger than I wanted, but it replaces
      ->ro_recover_mr, rb_recovery_lock, rb_recovery_worker, and the
      rb_stale_mrs list with a generic work queue.
      
      Since MRs are no longer orphaned, the mrs_orphaned metric is no
      longer used.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      61da886b
  3. 02 6月, 2018 1 次提交
    • C
      xprtrdma: Wait on empty sendctx queue · 2fad6592
      Chuck Lever 提交于
      Currently, when the sendctx queue is exhausted during marshaling, the
      RPC/RDMA transport places the RPC task on the delayq, which forces a
      wait for HZ >> 2 before the marshal and send is retried.
      
      With this change, the transport now places such an RPC task on the
      pending queue, and wakes it just as soon as more sendctxs become
      available. This typically takes less than a millisecond, and the
      write_space waking mechanism is less deadlock-prone.
      
      Moreover, the waiting RPC task is holding the transport's write
      lock, which blocks the transport from sending RPCs. Therefore faster
      recovery from sendctx queue exhaustion is desirable.
      
      Cf. commit 5804891455d5 ("xprtrdma: ->send_request returns -EAGAIN
      when there are no free MRs").
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      2fad6592
  4. 12 5月, 2018 1 次提交
  5. 07 5月, 2018 5 次提交
  6. 02 5月, 2018 1 次提交
    • C
      xprtrdma: Fix list corruption / DMAR errors during MR recovery · 054f1557
      Chuck Lever 提交于
      The ro_release_mr methods check whether mr->mr_list is empty.
      Therefore, be sure to always use list_del_init when removing an MR
      linked into a list using that field. Otherwise, when recovering from
      transport failures or device removal, list corruption can result, or
      MRs can get mapped or unmapped an odd number of times, resulting in
      IOMMU-related failures.
      
      In general this fix is appropriate back to v4.8. However, code
      changes since then make it impossible to apply this patch directly
      to stable kernels. The fix would have to be applied by hand or
      reworked for kernels earlier than v4.16.
      
      Backport guidance -- there are several cases:
      - When creating an MR, initialize mr_list so that using list_empty
        on an as-yet-unused MR is safe.
      - When an MR is being handled by the remote invalidation path,
        ensure that mr_list is reinitialized when it is removed from
        rl_registered.
      - When an MR is being handled by rpcrdma_destroy_mrs, it is removed
        from mr_all, but it may still be on an rl_registered list. In
        that case, the MR needs to be removed from that list before being
        released.
      - Other cases are covered by using list_del_init in rpcrdma_mr_pop.
      
      Fixes: 9d6b0409 ('xprtrdma: Place registered MWs on a ... ')
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      054f1557
  7. 11 4月, 2018 3 次提交
    • C
      xprtrdma: Chain Send to FastReg WRs · f2877623
      Chuck Lever 提交于
      With FRWR, the client transport can perform memory registration and
      post a Send with just a single ib_post_send.
      
      This reduces contention between the send_request path and the Send
      Completion handlers, and reduces the overhead of registering a chunk
      that has multiple segments.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      f2877623
    • C
      xprtrdma: Remove xprt-specific connect cookie · 8a14793e
      Chuck Lever 提交于
      Clean up: The generic rq_connect_cookie is sufficient to detect RPC
      Call retransmission.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      8a14793e
    • C
      xprtrdma: Fix latency regression on NUMA NFS/RDMA clients · 6720a899
      Chuck Lever 提交于
      With v4.15, on one of my NFS/RDMA clients I measured a nearly
      doubling in the latency of small read and write system calls. There
      was no change in server round trip time. The extra latency appears
      in the whole RPC execution path.
      
      "git bisect" settled on commit ccede759 ("xprtrdma: Spread reply
      processing over more CPUs") .
      
      After some experimentation, I found that leaving the WQ bound and
      allowing the scheduler to pick the dispatch CPU seems to eliminate
      the long latencies, and it does not introduce any new regressions.
      
      The fix is implemented by reverting only the part of
      commit ccede759 ("xprtrdma: Spread reply processing over more
      CPUs") that dispatches RPC replies specifically on the CPU where the
      matching RPC call was made.
      
      Interestingly, saving the CPU number and later queuing reply
      processing there was effective _only_ for a NFS READ and WRITE
      request. On my NUMA client, in-kernel RPC reply processing for
      asynchronous RPCs was dispatched on the same CPU where the RPC call
      was made, as expected. However synchronous RPCs seem to get their
      reply dispatched on some other CPU than where the call was placed,
      every time.
      
      Fixes: ccede759 ("xprtrdma: Spread reply processing over ... ")
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Cc: stable@vger.kernel.org # v4.15+
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      6720a899
  8. 23 1月, 2018 2 次提交
  9. 17 1月, 2018 11 次提交
  10. 16 12月, 2017 1 次提交
    • C
      xprtrdma: Spread reply processing over more CPUs · ccede759
      Chuck Lever 提交于
      Commit d8f532d2 ("xprtrdma: Invoke rpcrdma_reply_handler
      directly from RECV completion") introduced a performance regression
      for NFS I/O small enough to not need memory registration. In multi-
      threaded benchmarks that generate primarily small I/O requests,
      IOPS throughput is reduced by nearly a third. This patch restores
      the previous level of throughput.
      
      Because workqueues are typically BOUND (in particular ib_comp_wq,
      nfsiod_workqueue, and rpciod_workqueue), NFS/RDMA workloads tend
      to aggregate on the CPU that is handling Receive completions.
      
      The usual approach to addressing this problem is to create a QP
      and CQ for each CPU, and then schedule transactions on the QP
      for the CPU where you want the transaction to complete. The
      transaction then does not require an extra context switch during
      completion to end up on the same CPU where the transaction was
      started.
      
      This approach doesn't work for the Linux NFS/RDMA client because
      currently the Linux NFS client does not support multiple connections
      per client-server pair, and the RDMA core API does not make it
      straightforward for ULPs to determine which CPU is responsible for
      handling Receive completions for a CQ.
      
      So for the moment, record the CPU number in the rpcrdma_req before
      the transport sends each RPC Call. Then during Receive completion,
      queue the RPC completion on that same CPU.
      
      Additionally, move all RPC completion processing to the deferred
      handler so that even RPCs with simple small replies complete on
      the CPU that sent the corresponding RPC Call.
      
      Fixes: d8f532d2 ("xprtrdma: Invoke rpcrdma_reply_handler ...")
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      ccede759
  11. 18 11月, 2017 5 次提交