1. 03 10月, 2018 3 次提交
    • C
      xprtrdma: Name MR trace events consistently · d379eaa8
      Chuck Lever 提交于
      Clean up the names of trace events related to MRs so that it's
      easy to enable these with a glob.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      d379eaa8
    • C
      xprtrdma: Explicitly resetting MRs is no longer necessary · 61da886b
      Chuck Lever 提交于
      When a memory operation fails, the MR's driver state might not match
      its hardware state. The only reliable recourse is to dereg the MR.
      This is done in ->ro_recover_mr, which then attempts to allocate a
      fresh MR to replace the released MR.
      
      Since commit e2ac236c ("xprtrdma: Allocate MRs on demand"),
      xprtrdma dynamically allocates MRs. It can add more MRs whenever
      they are needed.
      
      That makes it possible to simply release an MR when a memory
      operation fails, instead of "recovering" it. It will automatically
      be replaced by the on-demand MR allocator.
      
      This commit is a little larger than I wanted, but it replaces
      ->ro_recover_mr, rb_recovery_lock, rb_recovery_worker, and the
      rb_stale_mrs list with a generic work queue.
      
      Since MRs are no longer orphaned, the mrs_orphaned metric is no
      longer used.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      61da886b
    • C
      xprtrdma: Create more MRs at a time · c421ece6
      Chuck Lever 提交于
      Some devices require more than 3 MRs to build a single 1MB I/O.
      Ensure that rpcrdma_mrs_create() will add enough MRs to build that
      I/O.
      
      In a subsequent patch I'm changing the MR recovery logic to just
      toss out the MRs. In that case it's possible for ->send_request to
      loop acquiring some MRs, not getting enough, getting called again,
      recycling the previous MRs, then not getting enough, lather rinse
      repeat. Thus first we need to ensure enough MRs are created to
      prevent that loop.
      
      I'm "reusing" ia->ri_max_segs. All of its accessors seem to want the
      maximum number of data segments plus two, so I'm going to bake that
      into the initial calculation.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      c421ece6
  2. 09 8月, 2018 1 次提交
    • C
      xprtrdma: Fix disconnect regression · 8d4fb8ff
      Chuck Lever 提交于
      I found that injecting disconnects with v4.18-rc resulted in
      random failures of the multi-threaded git regression test.
      
      The root cause appears to be that, after a reconnect, the
      RPC/RDMA transport is waking pending RPCs before the transport has
      posted enough Receive buffers to receive the Replies. If a Reply
      arrives before enough Receive buffers are posted, the connection
      is dropped. A few connection drops happen in quick succession as
      the client and server struggle to regain credit synchronization.
      
      This regression was introduced with commit 7c8d9e7c ("xprtrdma:
      Move Receive posting to Receive handler"). The client is supposed to
      post a single Receive when a connection is established because
      it's not supposed to send more than one RPC Call before it gets
      a fresh credit grant in the first RPC Reply [RFC 8166, Section
      3.3.3].
      
      Unfortunately there appears to be a longstanding bug in the Linux
      client's credit accounting mechanism. On connect, it simply dumps
      all pending RPC Calls onto the new connection. It's possible it has
      done this ever since the RPC/RDMA transport was added to the kernel
      ten years ago.
      
      Servers have so far been tolerant of this bad behavior. Currently no
      server implementation ever changes its credit grant over reconnects,
      and servers always repost enough Receives before connections are
      fully established.
      
      The Linux client implementation used to post a Receive before each
      of these Calls. This has covered up the flooding send behavior.
      
      I could try to correct this old bug so that the client sends exactly
      one RPC Call and waits for a Reply. Since we are so close to the
      next merge window, I'm going to instead provide a simple patch to
      post enough Receives before a reconnect completes (based on the
      number of credits granted to the previous connection).
      
      The spurious disconnects will be gone, but the client will still
      send multiple RPC Calls immediately after a reconnect.
      
      Addressing the latter problem will wait for a merge window because
      a) I expect it to be a large change requiring lots of testing, and
      b) obviously the Linux client has interoperated successfully since
      day zero while still being broken.
      
      Fixes: 7c8d9e7c ("xprtrdma: Move Receive posting to ... ")
      Cc: stable@vger.kernel.org # v4.18+
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      8d4fb8ff
  3. 31 7月, 2018 1 次提交
    • B
      RDMA, core and ULPs: Declare ib_post_send() and ib_post_recv() arguments const · d34ac5cd
      Bart Van Assche 提交于
      Since neither ib_post_send() nor ib_post_recv() modify the data structure
      their second argument points at, declare that argument const. This change
      makes it necessary to declare the 'bad_wr' argument const too and also to
      modify all ULPs that call ib_post_send(), ib_post_recv() or
      ib_post_srq_recv(). This patch does not change any functionality but makes
      it possible for the compiler to verify whether the
      ib_post_(send|recv|srq_recv) really do not modify the posted work request.
      
      To make this possible, only one cast had to be introduce that casts away
      constness, namely in rpcrdma_post_recvs(). The only way I can think of to
      avoid that cast is to introduce an additional loop in that function or to
      change the data type of bad_wr from struct ib_recv_wr ** into int
      (an index that refers to an element in the work request list). However,
      both approaches would require even more extensive changes than this
      patch.
      Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
      Reviewed-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      d34ac5cd
  4. 19 6月, 2018 1 次提交
  5. 02 6月, 2018 1 次提交
    • C
      xprtrdma: Wait on empty sendctx queue · 2fad6592
      Chuck Lever 提交于
      Currently, when the sendctx queue is exhausted during marshaling, the
      RPC/RDMA transport places the RPC task on the delayq, which forces a
      wait for HZ >> 2 before the marshal and send is retried.
      
      With this change, the transport now places such an RPC task on the
      pending queue, and wakes it just as soon as more sendctxs become
      available. This typically takes less than a millisecond, and the
      write_space waking mechanism is less deadlock-prone.
      
      Moreover, the waiting RPC task is holding the transport's write
      lock, which blocks the transport from sending RPCs. Therefore faster
      recovery from sendctx queue exhaustion is desirable.
      
      Cf. commit 5804891455d5 ("xprtrdma: ->send_request returns -EAGAIN
      when there are no free MRs").
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      2fad6592
  6. 12 5月, 2018 1 次提交
  7. 07 5月, 2018 10 次提交
  8. 02 5月, 2018 1 次提交
    • C
      xprtrdma: Fix list corruption / DMAR errors during MR recovery · 054f1557
      Chuck Lever 提交于
      The ro_release_mr methods check whether mr->mr_list is empty.
      Therefore, be sure to always use list_del_init when removing an MR
      linked into a list using that field. Otherwise, when recovering from
      transport failures or device removal, list corruption can result, or
      MRs can get mapped or unmapped an odd number of times, resulting in
      IOMMU-related failures.
      
      In general this fix is appropriate back to v4.8. However, code
      changes since then make it impossible to apply this patch directly
      to stable kernels. The fix would have to be applied by hand or
      reworked for kernels earlier than v4.16.
      
      Backport guidance -- there are several cases:
      - When creating an MR, initialize mr_list so that using list_empty
        on an as-yet-unused MR is safe.
      - When an MR is being handled by the remote invalidation path,
        ensure that mr_list is reinitialized when it is removed from
        rl_registered.
      - When an MR is being handled by rpcrdma_destroy_mrs, it is removed
        from mr_all, but it may still be on an rl_registered list. In
        that case, the MR needs to be removed from that list before being
        released.
      - Other cases are covered by using list_del_init in rpcrdma_mr_pop.
      
      Fixes: 9d6b0409 ('xprtrdma: Place registered MWs on a ... ')
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      054f1557
  9. 11 4月, 2018 7 次提交
  10. 03 2月, 2018 2 次提交
    • C
      xprtrdma: Fix BUG after a device removal · e89e8d8f
      Chuck Lever 提交于
      Michal Kalderon reports a BUG that occurs just after device removal:
      
      [  169.112490] rpcrdma: removing device qedr0 for 192.168.110.146:20049
      [  169.143909] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
      [  169.181837] IP: rpcrdma_dma_unmap_regbuf+0xa/0x60 [rpcrdma]
      
      The RPC/RDMA client transport attempts to allocate some resources
      on demand. Registered buffers are one such resource. These are
      allocated (or re-allocated) by xprt_rdma_allocate to hold RPC Call
      and Reply messages. A hardware resource is associated with each of
      these buffers, as they can be used for a Send or Receive Work
      Request.
      
      If a device is removed from under an NFS/RDMA mount, the transport
      layer is responsible for releasing all hardware resources before
      the device can be finally unplugged. A BUG results when the NFS
      mount hasn't yet seen much activity: the transport tries to release
      resources that haven't yet been allocated.
      
      rpcrdma_free_regbuf() already checks for this case, so just move
      that check to cover the DEVICE_REMOVAL case as well.
      Reported-by: NMichal Kalderon <Michal.Kalderon@cavium.com>
      Fixes: bebd0318 ("xprtrdma: Support unplugging an HCA ...")
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Tested-by: NMichal Kalderon <Michal.Kalderon@cavium.com>
      Cc: stable@vger.kernel.org # v4.12+
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      e89e8d8f
    • C
      xprtrdma: Fix calculation of ri_max_send_sges · 1179e2c2
      Chuck Lever 提交于
      Commit 16f906d6 ("xprtrdma: Reduce required number of send
      SGEs") introduced the rpcrdma_ia::ri_max_send_sges field. This fixes
      a problem where xprtrdma would not work if the device's max_sge
      capability was small (low single digits).
      
      At least RPCRDMA_MIN_SEND_SGES are needed for the inline parts of
      each RPC. ri_max_send_sges is set to this value:
      
        ia->ri_max_send_sges = max_sge - RPCRDMA_MIN_SEND_SGES;
      
      Then when marshaling each RPC, rpcrdma_args_inline uses that value
      to determine whether the device has enough Send SGEs to convey an
      NFS WRITE payload inline, or whether instead a Read chunk is
      required.
      
      More recently, commit ae72950a ("xprtrdma: Add data structure to
      manage RDMA Send arguments") used the ri_max_send_sges value to
      calculate the size of an array, but that commit erroneously assumed
      ri_max_send_sges contains a value similar to the device's max_sge,
      and not one that was reduced by the minimum SGE count.
      
      This assumption results in the calculated size of the sendctx's
      Send SGE array to be too small. When the array is used to marshal
      an RPC, the code can write Send SGEs into the following sendctx
      element in that array, corrupting it. When the device's max_sge is
      large, this issue is entirely harmless; but it results in an oops
      in the provider's post_send method, if dev.attrs.max_sge is small.
      
      So let's straighten this out: ri_max_send_sges will now contain a
      value with the same meaning as dev.attrs.max_sge, which makes
      the code easier to understand, and enables rpcrdma_sendctx_create
      to calculate the size of the SGE array correctly.
      Reported-by: NMichal Kalderon <Michal.Kalderon@cavium.com>
      Fixes: 16f906d6 ("xprtrdma: Reduce required number of send SGEs")
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Tested-by: NMichal Kalderon <Michal.Kalderon@cavium.com>
      Cc: stable@vger.kernel.org # v4.10+
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      1179e2c2
  11. 23 1月, 2018 8 次提交
  12. 17 1月, 2018 4 次提交
    • C
      xprtrdma: Introduce rpcrdma_mw_unmap_and_put · ec12e479
      Chuck Lever 提交于
      Clean up: Code review suggested that a common bit of code can be
      placed into a helper function, and this gives us fewer places to
      stick an "I DMA unmapped something" trace point.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      ec12e479
    • C
      xprtrdma: Remove usage of "mw" · 96ceddea
      Chuck Lever 提交于
      Clean up: struct rpcrdma_mw was named after Memory Windows, but
      xprtrdma no longer supports a Memory Window registration mode.
      Rename rpcrdma_mw and its fields to reduce confusion and make
      the code more sensible to read.
      
      Renaming "mw" was suggested by Tom Talpey, the author of the
      original xprtrdma implementation. It's a good idea, but I haven't
      done this until now because it's a huge diffstat for no benefit
      other than code readability.
      
      However, I'm about to introduce static trace points that expose
      a few of xprtrdma's internal data structures. They should make sense
      in the trace report, and it's reasonable to treat trace points as a
      kernel API contract which might be difficult to change later.
      
      While I'm churning things up, two additional changes:
      - rename variables unhelpfully called "r" to "mr", to improve code
        clarity, and
      - rename the MR-related helper functions using the form
        "rpcrdma_mr_<verb>", to be consistent with other areas of the
        code.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      96ceddea
    • C
      xprtrdma: Replace all usage of "frmr" with "frwr" · ce5b3717
      Chuck Lever 提交于
      Clean up: Over time, the industry has adopted the term "frwr"
      instead of "frmr". The term "frwr" is now more widely recognized.
      
      For the past couple of years I've attempted to add new code using
      "frwr" , but there still remains plenty of older code that still
      uses "frmr". Replace all usage of "frmr" to avoid confusion.
      
      While we're churning code, rename variables unhelpfully called "f"
      to "frwr", to improve code clarity.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      ce5b3717
    • C
      xprtrdma: Remove another sockaddr_storage field (cdata::addr) · dd229cee
      Chuck Lever 提交于
      Save more space in struct rpcrdma_xprt by removing the redundant
      "addr" field from struct rpcrdma_create_data_internal. Wherever
      we have rpcrdma_xprt, we also have the rpc_xprt, which has a
      sockaddr_storage field with the same content.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      dd229cee