1. 24 10月, 2019 2 次提交
    • C
      xprtrdma: Close window between waking RPC senders and posting Receives · 2ae50ad6
      Chuck Lever 提交于
      A recent clean up attempted to separate Receive handling and RPC
      Reply processing, in the name of clean layering.
      
      Unfortunately, we can't do this because the Receive Queue has to be
      refilled _after_ the most recent credit update from the responder
      is parsed from the transport header, but _before_ we wake up the
      next RPC sender. That is right in the middle of
      rpcrdma_reply_handler().
      
      Usually this isn't a problem because current responder
      implementations don't vary their credit grant. The one exception is
      when a connection is established: the grant goes from one to a much
      larger number on the first Receive. The requester MUST post enough
      Receives right then so that any outstanding requests can be sent
      without risking RNR and connection loss.
      
      Fixes: 6ceea368 ("xprtrdma: Refactor Receive accounting")
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      2ae50ad6
    • C
      xprtrdma: Initialize rb_credits in one place · eea63ca7
      Chuck Lever 提交于
      Clean up/code de-duplication.
      
      Nit: RPC_CWNDSHIFT is incorrect as the initial value for xprt->cwnd.
      This mistake does not appear to have operational consequences, since
      the cwnd value is replaced with a valid value upon the first Receive
      completion.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      eea63ca7
  2. 27 8月, 2019 2 次提交
  3. 22 8月, 2019 1 次提交
  4. 21 8月, 2019 3 次提交
  5. 09 7月, 2019 8 次提交
    • C
      xprtrdma: Refactor chunk encoding · 6a6c6def
      Chuck Lever 提交于
      Clean up.
      
      Move the "not present" case into the individual chunk encoders. This
      improves code organization and readability.
      
      The reason for the original organization was to optimize for the
      case where there there are no chunks. The optimization turned out to
      be inconsequential, so let's err on the side of code readability.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      6a6c6def
    • C
      xprtrdma: Wake RPCs directly in rpcrdma_wc_send path · 0ab11523
      Chuck Lever 提交于
      Eliminate a context switch in the path that handles RPC wake-ups
      when a Receive completion has to wait for a Send completion.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      0ab11523
    • C
      xprtrdma: Reduce context switching due to Local Invalidation · d8099fed
      Chuck Lever 提交于
      Since commit ba69cd12 ("xprtrdma: Remove support for FMR memory
      registration"), FRWR is the only supported memory registration mode.
      
      We can take advantage of the asynchronous nature of FRWR's LOCAL_INV
      Work Requests to get rid of the completion wait by having the
      LOCAL_INV completion handler take care of DMA unmapping MRs and
      waking the upper layer RPC waiter.
      
      This eliminates two context switches when local invalidation is
      necessary. As a side benefit, we will no longer need the per-xprt
      deferred completion work queue.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      d8099fed
    • C
      xprtrdma: Add mechanism to place MRs back on the free list · 40088f0e
      Chuck Lever 提交于
      When a marshal operation fails, any MRs that were already set up for
      that request are recycled. Recycling releases MRs and creates new
      ones, which is expensive.
      
      Since commit f2877623 ("xprtrdma: Chain Send to FastReg WRs")
      was merged, recycling FRWRs is unnecessary. This is because before
      that commit, frwr_map had already posted FAST_REG Work Requests,
      so ownership of the MRs had already been passed to the NIC and thus
      dealing with them had to be delayed until they completed.
      
      Since that commit, however, FAST_REG WRs are posted at the same time
      as the Send WR. This means that if marshaling fails, we are certain
      the MRs are safe to simply unmap and place back on the free list
      because neither the Send nor the FAST_REG WRs have been posted yet.
      The kernel still has ownership of the MRs at this point.
      
      This reduces the total number of MRs that the xprt has to create
      under heavy workloads and makes the marshaling logic less brittle.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      40088f0e
    • C
      xprtrdma: Remove fr_state · 84756894
      Chuck Lever 提交于
      Now that both the Send and Receive completions are handled in
      process context, it is safe to DMA unmap and return MRs to the
      free or recycle lists directly in the completion handlers.
      
      Doing this means rpcrdma_frwr no longer needs to track the state of
      each MR, meaning that a VALID or FLUSHED MR can no longer appear on
      an xprt's MR free list. Thus there is no longer a need to track the
      MR's registration state in rpcrdma_frwr.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      84756894
    • C
      xprtrdma: Remove the RPCRDMA_REQ_F_PENDING flag · 5809ea4f
      Chuck Lever 提交于
      Commit 9590d083 ("xprtrdma: Use xprt_pin_rqst in
      rpcrdma_reply_handler") pins incoming RPC/RDMA replies so they
      can be left in the pending requests queue while they are being
      processed without introducing a race between ->buf_free and the
      transport's reply handler. Therefore RPCRDMA_REQ_F_PENDING is no
      longer necessary.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      5809ea4f
    • C
      xprtrdma: Fix occasional transport deadlock · 05eb06d8
      Chuck Lever 提交于
      Under high I/O workloads, I've noticed that an RPC/RDMA transport
      occasionally deadlocks (IOPS goes to zero, and doesn't recover).
      Diagnosis shows that the sendctx queue is empty, but when sendctxs
      are returned to the queue, the xprt_write_space wake-up never
      occurs. The wake-up logic in rpcrdma_sendctx_put_locked is racy.
      
      I noticed that both EMPTY_SCQ and XPRT_WRITE_SPACE are implemented
      via an atomic bit. Just one of those is sufficient. Removing
      EMPTY_SCQ in favor of the generic bit mechanism makes the deadlock
      un-reproducible.
      
      Without EMPTY_SCQ, rpcrdma_buffer::rb_flags is no longer used and
      is therefore removed.
      
      Unfortunately this patch does not apply cleanly to stable. If
      needed, someone will have to port it and test it.
      
      Fixes: 2fad6592 ("xprtrdma: Wait on empty sendctx queue")
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      05eb06d8
    • C
      xprtrdma: Replace use of xdr_stream_pos in rpcrdma_marshal_req · 1310051c
      Chuck Lever 提交于
      This is a latent bug. xdr_stream_pos works by subtracting
      xdr_stream::nwords from xdr_buf::len. But xdr_stream::nwords is not
      initialized by xdr_init_encode().
      
      It works today only because all fields in rpcrdma_req::rl_stream
      are initialized to zero by rpcrdma_req_create, making the
      subtraction in xdr_stream_pos always a no-op.
      
      I found this issue via code inspection. It was introduced by commit
      39f4cd9e ("xprtrdma: Harden chunk list encoding against send
      buffer overflow"), but the code has changed enough since then that
      this fix can't be automatically applied to stable.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      1310051c
  6. 07 7月, 2019 1 次提交
  7. 26 4月, 2019 6 次提交
  8. 14 2月, 2019 1 次提交
  9. 13 2月, 2019 1 次提交
    • C
      xprtrdma: Check inline size before providing a Write chunk · d4550bbe
      Chuck Lever 提交于
      In very rare cases, an NFS READ operation might predict that the
      non-payload part of the RPC Call is large. For instance, an
      NFSv4 COMPOUND with a large GETATTR result, in combination with a
      large Kerberos credential, could push the non-payload part to be
      several kilobytes.
      
      If the non-payload part is larger than the connection's inline
      threshold, the client is required to provision a Reply chunk. The
      current Linux client does not check for this case. There are two
      obvious ways to handle it:
      
      a. Provision a Write chunk for the payload and a Reply chunk for
         the non-payload part
      
      b. Provision a Reply chunk for the whole RPC Reply
      
      Some testing at a recent NFS bake-a-thon showed that servers can
      mostly handle a. but there are some corner cases that do not work
      yet. b. already works (it has to, to handle krb5i/p), but could be
      somewhat less efficient. However, I expect this scenario to be very
      rare -- no-one has reported a problem yet.
      
      So I'm going to implement b. Sometime later I will provide some
      patches to help make b. a little more efficient by more carefully
      choosing the Reply chunk's segment sizes to ensure the payload is
      optimally aligned.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      d4550bbe
  10. 03 1月, 2019 10 次提交
  11. 03 10月, 2018 3 次提交
    • C
      xprtrdma: Explicitly resetting MRs is no longer necessary · 61da886b
      Chuck Lever 提交于
      When a memory operation fails, the MR's driver state might not match
      its hardware state. The only reliable recourse is to dereg the MR.
      This is done in ->ro_recover_mr, which then attempts to allocate a
      fresh MR to replace the released MR.
      
      Since commit e2ac236c ("xprtrdma: Allocate MRs on demand"),
      xprtrdma dynamically allocates MRs. It can add more MRs whenever
      they are needed.
      
      That makes it possible to simply release an MR when a memory
      operation fails, instead of "recovering" it. It will automatically
      be replaced by the on-demand MR allocator.
      
      This commit is a little larger than I wanted, but it replaces
      ->ro_recover_mr, rb_recovery_lock, rb_recovery_worker, and the
      rb_stale_mrs list with a generic work queue.
      
      Since MRs are no longer orphaned, the mrs_orphaned metric is no
      longer used.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      61da886b
    • C
      xprtrdma: Create more MRs at a time · c421ece6
      Chuck Lever 提交于
      Some devices require more than 3 MRs to build a single 1MB I/O.
      Ensure that rpcrdma_mrs_create() will add enough MRs to build that
      I/O.
      
      In a subsequent patch I'm changing the MR recovery logic to just
      toss out the MRs. In that case it's possible for ->send_request to
      loop acquiring some MRs, not getting enough, getting called again,
      recycling the previous MRs, then not getting enough, lather rinse
      repeat. Thus first we need to ensure enough MRs are created to
      prevent that loop.
      
      I'm "reusing" ia->ri_max_segs. All of its accessors seem to want the
      maximum number of data segments plus two, so I'm going to bake that
      into the initial calculation.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      c421ece6
    • C
      xprtrdma: xprt_release_rqst_cong is called outside of transport_lock · 91ca1866
      Chuck Lever 提交于
      Since commit ce7c252a ("SUNRPC: Add a separate spinlock to
      protect the RPC request receive list") the RPC/RDMA reply handler
      has been calling xprt_release_rqst_cong without holding
      xprt->transport_lock.
      
      I think the only way this call is ever made is if the credit grant
      increases and there are RPCs pending. Current server implementations
      do not change their credit grant during operation (except at
      connect time).
      
      Commit e7ce710a ("xprtrdma: Avoid deadlock when credit window is
      reset") added the ->release_rqst call because UDP invokes
      xprt_adjust_cwnd(), which calls __xprt_put_cong() after adjusting
      xprt->cwnd. Both xprt_release() and ->xprt_release_xprt already wake
      another task in this case, so it is safe to remove this call from
      the reply handler.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      91ca1866
  12. 01 10月, 2018 2 次提交