1. 02 6月, 2018 1 次提交
    • C
      xprtrdma: Wait on empty sendctx queue · 2fad6592
      Chuck Lever 提交于
      Currently, when the sendctx queue is exhausted during marshaling, the
      RPC/RDMA transport places the RPC task on the delayq, which forces a
      wait for HZ >> 2 before the marshal and send is retried.
      
      With this change, the transport now places such an RPC task on the
      pending queue, and wakes it just as soon as more sendctxs become
      available. This typically takes less than a millisecond, and the
      write_space waking mechanism is less deadlock-prone.
      
      Moreover, the waiting RPC task is holding the transport's write
      lock, which blocks the transport from sending RPCs. Therefore faster
      recovery from sendctx queue exhaustion is desirable.
      
      Cf. commit 5804891455d5 ("xprtrdma: ->send_request returns -EAGAIN
      when there are no free MRs").
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      2fad6592
  2. 07 5月, 2018 10 次提交
  3. 11 4月, 2018 7 次提交
  4. 03 2月, 2018 2 次提交
    • C
      xprtrdma: Fix BUG after a device removal · e89e8d8f
      Chuck Lever 提交于
      Michal Kalderon reports a BUG that occurs just after device removal:
      
      [  169.112490] rpcrdma: removing device qedr0 for 192.168.110.146:20049
      [  169.143909] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
      [  169.181837] IP: rpcrdma_dma_unmap_regbuf+0xa/0x60 [rpcrdma]
      
      The RPC/RDMA client transport attempts to allocate some resources
      on demand. Registered buffers are one such resource. These are
      allocated (or re-allocated) by xprt_rdma_allocate to hold RPC Call
      and Reply messages. A hardware resource is associated with each of
      these buffers, as they can be used for a Send or Receive Work
      Request.
      
      If a device is removed from under an NFS/RDMA mount, the transport
      layer is responsible for releasing all hardware resources before
      the device can be finally unplugged. A BUG results when the NFS
      mount hasn't yet seen much activity: the transport tries to release
      resources that haven't yet been allocated.
      
      rpcrdma_free_regbuf() already checks for this case, so just move
      that check to cover the DEVICE_REMOVAL case as well.
      Reported-by: NMichal Kalderon <Michal.Kalderon@cavium.com>
      Fixes: bebd0318 ("xprtrdma: Support unplugging an HCA ...")
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Tested-by: NMichal Kalderon <Michal.Kalderon@cavium.com>
      Cc: stable@vger.kernel.org # v4.12+
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      e89e8d8f
    • C
      xprtrdma: Fix calculation of ri_max_send_sges · 1179e2c2
      Chuck Lever 提交于
      Commit 16f906d6 ("xprtrdma: Reduce required number of send
      SGEs") introduced the rpcrdma_ia::ri_max_send_sges field. This fixes
      a problem where xprtrdma would not work if the device's max_sge
      capability was small (low single digits).
      
      At least RPCRDMA_MIN_SEND_SGES are needed for the inline parts of
      each RPC. ri_max_send_sges is set to this value:
      
        ia->ri_max_send_sges = max_sge - RPCRDMA_MIN_SEND_SGES;
      
      Then when marshaling each RPC, rpcrdma_args_inline uses that value
      to determine whether the device has enough Send SGEs to convey an
      NFS WRITE payload inline, or whether instead a Read chunk is
      required.
      
      More recently, commit ae72950a ("xprtrdma: Add data structure to
      manage RDMA Send arguments") used the ri_max_send_sges value to
      calculate the size of an array, but that commit erroneously assumed
      ri_max_send_sges contains a value similar to the device's max_sge,
      and not one that was reduced by the minimum SGE count.
      
      This assumption results in the calculated size of the sendctx's
      Send SGE array to be too small. When the array is used to marshal
      an RPC, the code can write Send SGEs into the following sendctx
      element in that array, corrupting it. When the device's max_sge is
      large, this issue is entirely harmless; but it results in an oops
      in the provider's post_send method, if dev.attrs.max_sge is small.
      
      So let's straighten this out: ri_max_send_sges will now contain a
      value with the same meaning as dev.attrs.max_sge, which makes
      the code easier to understand, and enables rpcrdma_sendctx_create
      to calculate the size of the SGE array correctly.
      Reported-by: NMichal Kalderon <Michal.Kalderon@cavium.com>
      Fixes: 16f906d6 ("xprtrdma: Reduce required number of send SGEs")
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Tested-by: NMichal Kalderon <Michal.Kalderon@cavium.com>
      Cc: stable@vger.kernel.org # v4.10+
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      1179e2c2
  5. 23 1月, 2018 8 次提交
  6. 17 1月, 2018 8 次提交
  7. 16 12月, 2017 1 次提交
    • C
      xprtrdma: Spread reply processing over more CPUs · ccede759
      Chuck Lever 提交于
      Commit d8f532d2 ("xprtrdma: Invoke rpcrdma_reply_handler
      directly from RECV completion") introduced a performance regression
      for NFS I/O small enough to not need memory registration. In multi-
      threaded benchmarks that generate primarily small I/O requests,
      IOPS throughput is reduced by nearly a third. This patch restores
      the previous level of throughput.
      
      Because workqueues are typically BOUND (in particular ib_comp_wq,
      nfsiod_workqueue, and rpciod_workqueue), NFS/RDMA workloads tend
      to aggregate on the CPU that is handling Receive completions.
      
      The usual approach to addressing this problem is to create a QP
      and CQ for each CPU, and then schedule transactions on the QP
      for the CPU where you want the transaction to complete. The
      transaction then does not require an extra context switch during
      completion to end up on the same CPU where the transaction was
      started.
      
      This approach doesn't work for the Linux NFS/RDMA client because
      currently the Linux NFS client does not support multiple connections
      per client-server pair, and the RDMA core API does not make it
      straightforward for ULPs to determine which CPU is responsible for
      handling Receive completions for a CQ.
      
      So for the moment, record the CPU number in the rpcrdma_req before
      the transport sends each RPC Call. Then during Receive completion,
      queue the RPC completion on that same CPU.
      
      Additionally, move all RPC completion processing to the deferred
      handler so that even RPCs with simple small replies complete on
      the CPU that sent the corresponding RPC Call.
      
      Fixes: d8f532d2 ("xprtrdma: Invoke rpcrdma_reply_handler ...")
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      ccede759
  8. 18 11月, 2017 3 次提交