1. 03 10月, 2018 7 次提交
  2. 09 8月, 2018 1 次提交
    • C
      xprtrdma: Fix disconnect regression · 8d4fb8ff
      Chuck Lever 提交于
      I found that injecting disconnects with v4.18-rc resulted in
      random failures of the multi-threaded git regression test.
      
      The root cause appears to be that, after a reconnect, the
      RPC/RDMA transport is waking pending RPCs before the transport has
      posted enough Receive buffers to receive the Replies. If a Reply
      arrives before enough Receive buffers are posted, the connection
      is dropped. A few connection drops happen in quick succession as
      the client and server struggle to regain credit synchronization.
      
      This regression was introduced with commit 7c8d9e7c ("xprtrdma:
      Move Receive posting to Receive handler"). The client is supposed to
      post a single Receive when a connection is established because
      it's not supposed to send more than one RPC Call before it gets
      a fresh credit grant in the first RPC Reply [RFC 8166, Section
      3.3.3].
      
      Unfortunately there appears to be a longstanding bug in the Linux
      client's credit accounting mechanism. On connect, it simply dumps
      all pending RPC Calls onto the new connection. It's possible it has
      done this ever since the RPC/RDMA transport was added to the kernel
      ten years ago.
      
      Servers have so far been tolerant of this bad behavior. Currently no
      server implementation ever changes its credit grant over reconnects,
      and servers always repost enough Receives before connections are
      fully established.
      
      The Linux client implementation used to post a Receive before each
      of these Calls. This has covered up the flooding send behavior.
      
      I could try to correct this old bug so that the client sends exactly
      one RPC Call and waits for a Reply. Since we are so close to the
      next merge window, I'm going to instead provide a simple patch to
      post enough Receives before a reconnect completes (based on the
      number of credits granted to the previous connection).
      
      The spurious disconnects will be gone, but the client will still
      send multiple RPC Calls immediately after a reconnect.
      
      Addressing the latter problem will wait for a merge window because
      a) I expect it to be a large change requiring lots of testing, and
      b) obviously the Linux client has interoperated successfully since
      day zero while still being broken.
      
      Fixes: 7c8d9e7c ("xprtrdma: Move Receive posting to ... ")
      Cc: stable@vger.kernel.org # v4.18+
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      8d4fb8ff
  3. 31 7月, 2018 1 次提交
    • B
      RDMA, core and ULPs: Declare ib_post_send() and ib_post_recv() arguments const · d34ac5cd
      Bart Van Assche 提交于
      Since neither ib_post_send() nor ib_post_recv() modify the data structure
      their second argument points at, declare that argument const. This change
      makes it necessary to declare the 'bad_wr' argument const too and also to
      modify all ULPs that call ib_post_send(), ib_post_recv() or
      ib_post_srq_recv(). This patch does not change any functionality but makes
      it possible for the compiler to verify whether the
      ib_post_(send|recv|srq_recv) really do not modify the posted work request.
      
      To make this possible, only one cast had to be introduce that casts away
      constness, namely in rpcrdma_post_recvs(). The only way I can think of to
      avoid that cast is to introduce an additional loop in that function or to
      change the data type of bad_wr from struct ib_recv_wr ** into int
      (an index that refers to an element in the work request list). However,
      both approaches would require even more extensive changes than this
      patch.
      Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
      Reviewed-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      d34ac5cd
  4. 19 6月, 2018 1 次提交
  5. 02 6月, 2018 1 次提交
    • C
      xprtrdma: Wait on empty sendctx queue · 2fad6592
      Chuck Lever 提交于
      Currently, when the sendctx queue is exhausted during marshaling, the
      RPC/RDMA transport places the RPC task on the delayq, which forces a
      wait for HZ >> 2 before the marshal and send is retried.
      
      With this change, the transport now places such an RPC task on the
      pending queue, and wakes it just as soon as more sendctxs become
      available. This typically takes less than a millisecond, and the
      write_space waking mechanism is less deadlock-prone.
      
      Moreover, the waiting RPC task is holding the transport's write
      lock, which blocks the transport from sending RPCs. Therefore faster
      recovery from sendctx queue exhaustion is desirable.
      
      Cf. commit 5804891455d5 ("xprtrdma: ->send_request returns -EAGAIN
      when there are no free MRs").
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      2fad6592
  6. 12 5月, 2018 1 次提交
  7. 07 5月, 2018 10 次提交
  8. 02 5月, 2018 1 次提交
    • C
      xprtrdma: Fix list corruption / DMAR errors during MR recovery · 054f1557
      Chuck Lever 提交于
      The ro_release_mr methods check whether mr->mr_list is empty.
      Therefore, be sure to always use list_del_init when removing an MR
      linked into a list using that field. Otherwise, when recovering from
      transport failures or device removal, list corruption can result, or
      MRs can get mapped or unmapped an odd number of times, resulting in
      IOMMU-related failures.
      
      In general this fix is appropriate back to v4.8. However, code
      changes since then make it impossible to apply this patch directly
      to stable kernels. The fix would have to be applied by hand or
      reworked for kernels earlier than v4.16.
      
      Backport guidance -- there are several cases:
      - When creating an MR, initialize mr_list so that using list_empty
        on an as-yet-unused MR is safe.
      - When an MR is being handled by the remote invalidation path,
        ensure that mr_list is reinitialized when it is removed from
        rl_registered.
      - When an MR is being handled by rpcrdma_destroy_mrs, it is removed
        from mr_all, but it may still be on an rl_registered list. In
        that case, the MR needs to be removed from that list before being
        released.
      - Other cases are covered by using list_del_init in rpcrdma_mr_pop.
      
      Fixes: 9d6b0409 ('xprtrdma: Place registered MWs on a ... ')
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      054f1557
  9. 11 4月, 2018 7 次提交
  10. 03 2月, 2018 2 次提交
    • C
      xprtrdma: Fix BUG after a device removal · e89e8d8f
      Chuck Lever 提交于
      Michal Kalderon reports a BUG that occurs just after device removal:
      
      [  169.112490] rpcrdma: removing device qedr0 for 192.168.110.146:20049
      [  169.143909] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
      [  169.181837] IP: rpcrdma_dma_unmap_regbuf+0xa/0x60 [rpcrdma]
      
      The RPC/RDMA client transport attempts to allocate some resources
      on demand. Registered buffers are one such resource. These are
      allocated (or re-allocated) by xprt_rdma_allocate to hold RPC Call
      and Reply messages. A hardware resource is associated with each of
      these buffers, as they can be used for a Send or Receive Work
      Request.
      
      If a device is removed from under an NFS/RDMA mount, the transport
      layer is responsible for releasing all hardware resources before
      the device can be finally unplugged. A BUG results when the NFS
      mount hasn't yet seen much activity: the transport tries to release
      resources that haven't yet been allocated.
      
      rpcrdma_free_regbuf() already checks for this case, so just move
      that check to cover the DEVICE_REMOVAL case as well.
      Reported-by: NMichal Kalderon <Michal.Kalderon@cavium.com>
      Fixes: bebd0318 ("xprtrdma: Support unplugging an HCA ...")
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Tested-by: NMichal Kalderon <Michal.Kalderon@cavium.com>
      Cc: stable@vger.kernel.org # v4.12+
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      e89e8d8f
    • C
      xprtrdma: Fix calculation of ri_max_send_sges · 1179e2c2
      Chuck Lever 提交于
      Commit 16f906d6 ("xprtrdma: Reduce required number of send
      SGEs") introduced the rpcrdma_ia::ri_max_send_sges field. This fixes
      a problem where xprtrdma would not work if the device's max_sge
      capability was small (low single digits).
      
      At least RPCRDMA_MIN_SEND_SGES are needed for the inline parts of
      each RPC. ri_max_send_sges is set to this value:
      
        ia->ri_max_send_sges = max_sge - RPCRDMA_MIN_SEND_SGES;
      
      Then when marshaling each RPC, rpcrdma_args_inline uses that value
      to determine whether the device has enough Send SGEs to convey an
      NFS WRITE payload inline, or whether instead a Read chunk is
      required.
      
      More recently, commit ae72950a ("xprtrdma: Add data structure to
      manage RDMA Send arguments") used the ri_max_send_sges value to
      calculate the size of an array, but that commit erroneously assumed
      ri_max_send_sges contains a value similar to the device's max_sge,
      and not one that was reduced by the minimum SGE count.
      
      This assumption results in the calculated size of the sendctx's
      Send SGE array to be too small. When the array is used to marshal
      an RPC, the code can write Send SGEs into the following sendctx
      element in that array, corrupting it. When the device's max_sge is
      large, this issue is entirely harmless; but it results in an oops
      in the provider's post_send method, if dev.attrs.max_sge is small.
      
      So let's straighten this out: ri_max_send_sges will now contain a
      value with the same meaning as dev.attrs.max_sge, which makes
      the code easier to understand, and enables rpcrdma_sendctx_create
      to calculate the size of the SGE array correctly.
      Reported-by: NMichal Kalderon <Michal.Kalderon@cavium.com>
      Fixes: 16f906d6 ("xprtrdma: Reduce required number of send SGEs")
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Tested-by: NMichal Kalderon <Michal.Kalderon@cavium.com>
      Cc: stable@vger.kernel.org # v4.10+
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      1179e2c2
  11. 23 1月, 2018 8 次提交