1. 03 10月, 2018 2 次提交
    • C
      xprtrdma: Simplify RPC wake-ups on connect · 31e62d25
      Chuck Lever 提交于
      Currently, when a connection is established, rpcrdma_conn_upcall
      invokes rpcrdma_conn_func and then
      wake_up_all(&ep->rep_connect_wait). The former wakes waiting RPCs,
      but the connect worker is not done yet, and that leads to races,
      double wakes, and difficulty understanding how this logic is
      supposed to work.
      
      Instead, collect all the "connection established" logic in the
      connect worker (xprt_rdma_connect_worker). A disconnect worker is
      retained to handle provider upcalls safely.
      
      Fixes: 254f91e2 ("xprtrdma: RPC/RDMA must invoke ... ")
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      31e62d25
    • C
      xprtrdma: Explicitly resetting MRs is no longer necessary · 61da886b
      Chuck Lever 提交于
      When a memory operation fails, the MR's driver state might not match
      its hardware state. The only reliable recourse is to dereg the MR.
      This is done in ->ro_recover_mr, which then attempts to allocate a
      fresh MR to replace the released MR.
      
      Since commit e2ac236c ("xprtrdma: Allocate MRs on demand"),
      xprtrdma dynamically allocates MRs. It can add more MRs whenever
      they are needed.
      
      That makes it possible to simply release an MR when a memory
      operation fails, instead of "recovering" it. It will automatically
      be replaced by the on-demand MR allocator.
      
      This commit is a little larger than I wanted, but it replaces
      ->ro_recover_mr, rb_recovery_lock, rb_recovery_worker, and the
      rb_stale_mrs list with a generic work queue.
      
      Since MRs are no longer orphaned, the mrs_orphaned metric is no
      longer used.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      61da886b
  2. 02 6月, 2018 1 次提交
    • C
      xprtrdma: Wait on empty sendctx queue · 2fad6592
      Chuck Lever 提交于
      Currently, when the sendctx queue is exhausted during marshaling, the
      RPC/RDMA transport places the RPC task on the delayq, which forces a
      wait for HZ >> 2 before the marshal and send is retried.
      
      With this change, the transport now places such an RPC task on the
      pending queue, and wakes it just as soon as more sendctxs become
      available. This typically takes less than a millisecond, and the
      write_space waking mechanism is less deadlock-prone.
      
      Moreover, the waiting RPC task is holding the transport's write
      lock, which blocks the transport from sending RPCs. Therefore faster
      recovery from sendctx queue exhaustion is desirable.
      
      Cf. commit 5804891455d5 ("xprtrdma: ->send_request returns -EAGAIN
      when there are no free MRs").
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      2fad6592
  3. 12 5月, 2018 1 次提交
  4. 07 5月, 2018 5 次提交
  5. 02 5月, 2018 1 次提交
    • C
      xprtrdma: Fix list corruption / DMAR errors during MR recovery · 054f1557
      Chuck Lever 提交于
      The ro_release_mr methods check whether mr->mr_list is empty.
      Therefore, be sure to always use list_del_init when removing an MR
      linked into a list using that field. Otherwise, when recovering from
      transport failures or device removal, list corruption can result, or
      MRs can get mapped or unmapped an odd number of times, resulting in
      IOMMU-related failures.
      
      In general this fix is appropriate back to v4.8. However, code
      changes since then make it impossible to apply this patch directly
      to stable kernels. The fix would have to be applied by hand or
      reworked for kernels earlier than v4.16.
      
      Backport guidance -- there are several cases:
      - When creating an MR, initialize mr_list so that using list_empty
        on an as-yet-unused MR is safe.
      - When an MR is being handled by the remote invalidation path,
        ensure that mr_list is reinitialized when it is removed from
        rl_registered.
      - When an MR is being handled by rpcrdma_destroy_mrs, it is removed
        from mr_all, but it may still be on an rl_registered list. In
        that case, the MR needs to be removed from that list before being
        released.
      - Other cases are covered by using list_del_init in rpcrdma_mr_pop.
      
      Fixes: 9d6b0409 ('xprtrdma: Place registered MWs on a ... ')
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      054f1557
  6. 11 4月, 2018 3 次提交
    • C
      xprtrdma: Chain Send to FastReg WRs · f2877623
      Chuck Lever 提交于
      With FRWR, the client transport can perform memory registration and
      post a Send with just a single ib_post_send.
      
      This reduces contention between the send_request path and the Send
      Completion handlers, and reduces the overhead of registering a chunk
      that has multiple segments.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      f2877623
    • C
      xprtrdma: Remove xprt-specific connect cookie · 8a14793e
      Chuck Lever 提交于
      Clean up: The generic rq_connect_cookie is sufficient to detect RPC
      Call retransmission.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      8a14793e
    • C
      xprtrdma: Fix latency regression on NUMA NFS/RDMA clients · 6720a899
      Chuck Lever 提交于
      With v4.15, on one of my NFS/RDMA clients I measured a nearly
      doubling in the latency of small read and write system calls. There
      was no change in server round trip time. The extra latency appears
      in the whole RPC execution path.
      
      "git bisect" settled on commit ccede759 ("xprtrdma: Spread reply
      processing over more CPUs") .
      
      After some experimentation, I found that leaving the WQ bound and
      allowing the scheduler to pick the dispatch CPU seems to eliminate
      the long latencies, and it does not introduce any new regressions.
      
      The fix is implemented by reverting only the part of
      commit ccede759 ("xprtrdma: Spread reply processing over more
      CPUs") that dispatches RPC replies specifically on the CPU where the
      matching RPC call was made.
      
      Interestingly, saving the CPU number and later queuing reply
      processing there was effective _only_ for a NFS READ and WRITE
      request. On my NUMA client, in-kernel RPC reply processing for
      asynchronous RPCs was dispatched on the same CPU where the RPC call
      was made, as expected. However synchronous RPCs seem to get their
      reply dispatched on some other CPU than where the call was placed,
      every time.
      
      Fixes: ccede759 ("xprtrdma: Spread reply processing over ... ")
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Cc: stable@vger.kernel.org # v4.15+
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      6720a899
  7. 23 1月, 2018 2 次提交
  8. 17 1月, 2018 11 次提交
  9. 16 12月, 2017 1 次提交
    • C
      xprtrdma: Spread reply processing over more CPUs · ccede759
      Chuck Lever 提交于
      Commit d8f532d2 ("xprtrdma: Invoke rpcrdma_reply_handler
      directly from RECV completion") introduced a performance regression
      for NFS I/O small enough to not need memory registration. In multi-
      threaded benchmarks that generate primarily small I/O requests,
      IOPS throughput is reduced by nearly a third. This patch restores
      the previous level of throughput.
      
      Because workqueues are typically BOUND (in particular ib_comp_wq,
      nfsiod_workqueue, and rpciod_workqueue), NFS/RDMA workloads tend
      to aggregate on the CPU that is handling Receive completions.
      
      The usual approach to addressing this problem is to create a QP
      and CQ for each CPU, and then schedule transactions on the QP
      for the CPU where you want the transaction to complete. The
      transaction then does not require an extra context switch during
      completion to end up on the same CPU where the transaction was
      started.
      
      This approach doesn't work for the Linux NFS/RDMA client because
      currently the Linux NFS client does not support multiple connections
      per client-server pair, and the RDMA core API does not make it
      straightforward for ULPs to determine which CPU is responsible for
      handling Receive completions for a CQ.
      
      So for the moment, record the CPU number in the rpcrdma_req before
      the transport sends each RPC Call. Then during Receive completion,
      queue the RPC completion on that same CPU.
      
      Additionally, move all RPC completion processing to the deferred
      handler so that even RPCs with simple small replies complete on
      the CPU that sent the corresponding RPC Call.
      
      Fixes: d8f532d2 ("xprtrdma: Invoke rpcrdma_reply_handler ...")
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      ccede759
  10. 18 11月, 2017 12 次提交
    • C
      xprtrdma: Update copyright notices · 62b56a67
      Chuck Lever 提交于
      Credit work contributed by Oracle engineers since 2014.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      62b56a67
    • C
      rpcrdma: Remove C structure definitions of XDR data items · 2232df5e
      Chuck Lever 提交于
      Clean up: C-structure style XDR encoding and decoding logic has
      been replaced over the past several merge windows on both the
      client and server. These data structures are no longer used.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: NDevesh Sharma <devesh.sharma@broadcom.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      2232df5e
    • C
      xprtrdma: Remove atomic send completion counting · 6f0afc28
      Chuck Lever 提交于
      The sendctx circular queue now guarantees that xprtrdma cannot
      overflow the Send Queue, so remove the remaining bits of the
      original Send WQE counting mechanism.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      6f0afc28
    • C
      xprtrdma: RPC completion should wait for Send completion · 01bb35c8
      Chuck Lever 提交于
      When an RPC Call includes a file data payload, that payload can come
      from pages in the page cache, or a user buffer (for direct I/O).
      
      If the payload can fit inline, xprtrdma includes it in the Send
      using a scatter-gather technique. xprtrdma mustn't allow the RPC
      consumer to re-use the memory where that payload resides before the
      Send completes. Otherwise, the new contents of that memory would be
      exposed by an HCA retransmit of the Send operation.
      
      So, block RPC completion on Send completion, but only in the case
      where a separate file data payload is part of the Send. This
      prevents the reuse of that memory while it is still part of a Send
      operation without an undue cost to other cases.
      
      Waiting is avoided in the common case because typically the Send
      will have completed long before the RPC Reply arrives.
      
      These days, an RPC timeout will trigger a disconnect, which tears
      down the QP. The disconnect flushes all waiting Sends. This bounds
      the amount of time the reply handler has to wait for a Send
      completion.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      01bb35c8
    • C
      xprtrdma: Refactor rpcrdma_deferred_completion · 0ba6f370
      Chuck Lever 提交于
      Invoke a common routine for releasing hardware resources (for
      example, invalidating MRs). This needs to be done whether an
      RPC Reply has arrived or the RPC was terminated early.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      0ba6f370
    • C
      xprtrdma: Add a field of bit flags to struct rpcrdma_req · 531cca0c
      Chuck Lever 提交于
      We have one boolean flag in rpcrdma_req today. I'd like to add more
      flags, so convert that boolean to a bit flag.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      531cca0c
    • C
      xprtrdma: Add data structure to manage RDMA Send arguments · ae72950a
      Chuck Lever 提交于
      Problem statement:
      
      Recently Sagi Grimberg <sagi@grimberg.me> observed that kernel RDMA-
      enabled storage initiators don't handle delayed Send completion
      correctly. If Send completion is delayed beyond the end of a ULP
      transaction, the ULP may release resources that are still being used
      by the HCA to complete a long-running Send operation.
      
      This is a common design trait amongst our initiators. Most Send
      operations are faster than the ULP transaction they are part of.
      Waiting for a completion for these is typically unnecessary.
      
      Infrequently, a network partition or some other problem crops up
      where an ordering problem can occur. In NFS parlance, the RPC Reply
      arrives and completes the RPC, but the HCA is still retrying the
      Send WR that conveyed the RPC Call. In this case, the HCA can try
      to use memory that has been invalidated or DMA unmapped, and the
      connection is lost. If that memory has been re-used for something
      else (possibly not related to NFS), and the Send retransmission
      exposes that data on the wire.
      
      Thus we cannot assume that it is safe to release Send-related
      resources just because a ULP reply has arrived.
      
      After some analysis, we have determined that the completion
      housekeeping will not be difficult for xprtrdma:
      
       - Inline Send buffers are registered via the local DMA key, and
         are already left DMA mapped for the lifetime of a transport
         connection, thus no additional handling is necessary for those
       - Gathered Sends involving page cache pages _will_ need to
         DMA unmap those pages after the Send completes. But like
         inline send buffers, they are registered via the local DMA key,
         and thus will not need to be invalidated
      
      In addition, RPC completion will need to wait for Send completion
      in the latter case. However, nearly always, the Send that conveys
      the RPC Call will have completed long before the RPC Reply
      arrives, and thus no additional latency will be accrued.
      
      Design notes:
      
      In this patch, the rpcrdma_sendctx object is introduced, and a
      lock-free circular queue is added to manage a set of them per
      transport.
      
      The RPC client's send path already prevents sending more than one
      RPC Call at the same time. This allows us to treat the consumer
      side of the queue (rpcrdma_sendctx_get_locked) as if there is a
      single consumer thread.
      
      The producer side of the queue (rpcrdma_sendctx_put_locked) is
      invoked only from the Send completion handler, which is a single
      thread of execution (soft IRQ).
      
      The only care that needs to be taken is with the tail index, which
      is shared between the producer and consumer. Only the producer
      updates the tail index. The consumer compares the head with the
      tail to ensure that the a sendctx that is in use is never handed
      out again (or, expressed more conventionally, the queue is empty).
      
      When the sendctx queue empties completely, there are enough Sends
      outstanding that posting more Send operations can result in a Send
      Queue overflow. In this case, the ULP is told to wait and try again.
      This introduces strong Send Queue accounting to xprtrdma.
      
      As a final touch, Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
      suggested a mechanism that does not require signaling every Send.
      We signal once every N Sends, and perform SGE unmapping of N Send
      operations during that one completion.
      Reported-by: NSagi Grimberg <sagi@grimberg.me>
      Suggested-by: NJason Gunthorpe <jgunthorpe@obsidianresearch.com>
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      ae72950a
    • C
      xprtrdma: Change return value of rpcrdma_prepare_send_sges() · 857f9aca
      Chuck Lever 提交于
      Clean up: Make rpcrdma_prepare_send_sges() return a negative errno
      instead of a bool. Soon callers will want distinct treatments of
      different types of failures.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      857f9aca
    • C
      xprtrdma: Decode credits field in rpcrdma_reply_handler · be798f90
      Chuck Lever 提交于
      We need to decode and save the incoming rdma_credits field _after_
      we know that the direction of the message is "forward direction
      Reply". Otherwise, the credits value in reverse direction Calls is
      also used to update the forward direction credits.
      
      It is safe to decode the rdma_credits field in rpcrdma_reply_handler
      now that rpcrdma_reply_handler is single-threaded. Receives complete
      in the same order as they were sent on the NFS server.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      be798f90
    • C
      xprtrdma: Invoke rpcrdma_reply_handler directly from RECV completion · d8f532d2
      Chuck Lever 提交于
      I noticed that the soft IRQ thread looked pretty busy under heavy
      I/O workloads. perf suggested one area that was expensive was the
      queue_work() call in rpcrdma_wc_receive. That gave me some ideas.
      
      Instead of scheduling a separate worker to process RPC Replies,
      promote the Receive completion handler to IB_POLL_WORKQUEUE, and
      invoke rpcrdma_reply_handler directly.
      
      Note that the poll workqueue is single-threaded. In order to keep
      memory invalidation from serializing all RPC Replies, handle any
      necessary invalidation tasks in a separate multi-threaded workqueue.
      
      This provides a two-tier scheme, similar to OS I/O interrupt
      handlers: A fast interrupt handler that schedules the slow handler
      and re-enables the interrupt, and a slower handler that is invoked
      for any needed heavy lifting.
      
      Benefits include:
      - One less context switch for RPCs that don't register memory
      - Receive completion handling is moved out of soft IRQ context to
        make room for other users of soft IRQ
      - The same CPU core now DMA syncs and XDR decodes the Receive buffer
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      d8f532d2
    • C
      xprtrdma: Refactor rpcrdma_reply_handler some more · e1352c96
      Chuck Lever 提交于
      Clean up: I'd like to be able to invoke the tail of
      rpcrdma_reply_handler in two different places. Split the tail out
      into its own helper function.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      e1352c96
    • C
      xprtrdma: Move decoded header fields into rpcrdma_rep · 5381e0ec
      Chuck Lever 提交于
      Clean up: Make it easier to pass the decoded XID, vers, credits, and
      proc fields around by moving these variables into struct rpcrdma_rep.
      
      Note: the credits field will be handled in a subsequent patch.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      5381e0ec
  11. 17 10月, 2017 1 次提交