1. 18 11月, 2017 5 次提交
  2. 17 10月, 2017 1 次提交
  3. 06 9月, 2017 1 次提交
    • C
      xprtrdma: Use xprt_pin_rqst in rpcrdma_reply_handler · 9590d083
      Chuck Lever 提交于
      Adopt the use of xprt_pin_rqst to eliminate contention between
      Call-side users of rb_lock and the use of rb_lock in
      rpcrdma_reply_handler.
      
      This replaces the mechanism introduced in 431af645 ("xprtrdma:
      Fix client lock-up after application signal fires").
      
      Use recv_lock to quickly find the completing rqst, pin it, then
      drop the lock. At that point invalidation and pull-up of the Reply
      XDR can be done. Both are often expensive operations.
      
      Finally, take recv_lock again to signal completion to the RPC
      layer. It also protects adjustment of "cwnd".
      
      This greatly reduces the amount of time a lock is held by the
      reply handler. Comparing lock_stat results shows a marked decrease
      in contention on rb_lock and recv_lock.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      [trond.myklebust@primarydata.com: Remove call to rpcrdma_buffer_put() from
         the "out_norqst:" path in rpcrdma_reply_handler.]
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      9590d083
  4. 23 8月, 2017 1 次提交
  5. 16 8月, 2017 1 次提交
  6. 12 8月, 2017 2 次提交
  7. 08 8月, 2017 3 次提交
  8. 14 7月, 2017 4 次提交
    • C
      xprtrdma: Fix client lock-up after application signal fires · 431af645
      Chuck Lever 提交于
      After a signal, the RPC client aborts synchronous RPCs running on
      behalf of the signaled application.
      
      The server is still executing those RPCs, and will write the results
      back into the client's memory when it's done. By the time the server
      writes the results, that memory is likely being used for other
      purposes. Therefore xprtrdma has to immediately invalidate all
      memory regions used by those aborted RPCs to prevent the server's
      writes from clobbering that re-used memory.
      
      With FMR memory registration, invalidation takes a relatively long
      time. In fact, the invalidation is often still running when the
      server tries to write the results into the memory regions that are
      being invalidated.
      
      This sets up a race between two processes:
      
      1.  After the signal, xprt_rdma_free calls ro_unmap_safe.
      2.  While ro_unmap_safe is still running, the server replies and
          rpcrdma_reply_handler runs, calling ro_unmap_sync.
      
      Both processes invoke ib_unmap_fmr on the same FMR.
      
      The mlx4 driver allows two ib_unmap_fmr calls on the same FMR at
      the same time, but HCAs generally don't tolerate this. Sometimes
      this can result in a system crash.
      
      If the HCA happens to survive, rpcrdma_reply_handler continues. It
      removes the rpc_rqst from rq_list and releases the transport_lock.
      This enables xprt_rdma_free to run in another process, and the
      rpc_rqst is released while rpcrdma_reply_handler is still waiting
      for the ib_unmap_fmr call to finish.
      
      But further down in rpcrdma_reply_handler, the transport_lock is
      taken again, and "rqst" is dereferenced. If "rqst" has already been
      released, this triggers a general protection fault. Since bottom-
      halves are disabled, the system locks up.
      
      Address both issues by reversing the order of the xprt_lookup_rqst
      call and the ro_unmap_sync call. Introduce a separate lookup
      mechanism for rpcrdma_req's to enable calling ro_unmap_sync before
      xprt_lookup_rqst. Now the handler takes the transport_lock once
      and holds it for the XID lookup and RPC completion.
      
      BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305
      Fixes: 68791649 ('xprtrdma: Invalidate in the RPC reply ... ')
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      431af645
    • C
      xprtrdma: Rename rpcrdma_req::rl_free · a80d66c9
      Chuck Lever 提交于
      Clean up: I'm about to use the rl_free field for purposes other than
      a free list. So use a more generic name.
      
      This is a refactoring change only.
      
      BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305
      Fixes: 68791649 ('xprtrdma: Invalidate in the RPC reply ... ')
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      a80d66c9
    • C
      xprtrdma: Pass only the list of registered MRs to ro_unmap_sync · 451d26e1
      Chuck Lever 提交于
      There are rare cases where an rpcrdma_req can be re-used (via
      rpcrdma_buffer_put) while the RPC reply handler is still running.
      This is due to a signal firing at just the wrong instant.
      
      Since commit 9d6b0409 ("xprtrdma: Place registered MWs on a
      per-req list"), rpcrdma_mws are self-contained; ie., they fully
      describe an MR and scatterlist, and no part of that information is
      stored in struct rpcrdma_req.
      
      As part of closing the above race window, pass only the req's list
      of registered MRs to ro_unmap_sync, rather than the rpcrdma_req
      itself.
      
      Some extra transport header sanity checking is removed. Since the
      client depends on its own recollection of what memory had been
      registered, there doesn't seem to be a way to abuse this change.
      
      And, the check was not terribly effective. If the client had sent
      Read chunks, the "list_empty" test is negative in both of the
      removed cases, which are actually looking for Write or Reply
      chunks.
      
      BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305
      Fixes: 68791649 ('xprtrdma: Invalidate in the RPC reply ... ')
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      451d26e1
    • C
      xprtrdma: Pre-mark remotely invalidated MRs · 4b196dc6
      Chuck Lever 提交于
      There are rare cases where an rpcrdma_req and its matched
      rpcrdma_rep can be re-used, via rpcrdma_buffer_put, while the RPC
      reply handler is still using that req. This is typically due to a
      signal firing at just the wrong instant.
      
      As part of closing this race window, avoid using the wrong
      rpcrdma_rep to detect remotely invalidated MRs. Mark MRs as
      invalidated while we are sure the rep is still OK to use.
      
      BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305
      Fixes: 68791649 ('xprtrdma: Invalidate in the RPC reply ... ')
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      4b196dc6
  9. 26 4月, 2017 4 次提交
    • C
      xprtrdma: Remove rpcrdma_buffer::rb_pool · 2be1fce9
      Chuck Lever 提交于
      Since commit 1e465fd4 ("xprtrdma: Replace send and receive
      arrays"), this field is no longer used.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      2be1fce9
    • C
      xprtrdma: Support unplugging an HCA from under an NFS mount · bebd0318
      Chuck Lever 提交于
      The device driver for the underlying physical device associated
      with an RPC-over-RDMA transport can be removed while RPC-over-RDMA
      transports are still in use (ie, while NFS filesystems are still
      mounted and active). The IB core performs a connection event upcall
      to request that consumers free all RDMA resources associated with
      a transport.
      
      There may be pending RPCs when this occurs. Care must be taken to
      release associated resources without leaving references that can
      trigger a subsequent crash if a signal or soft timeout occurs. We
      rely on the caller of the transport's ->close method to ensure that
      the previous RPC task has invoked xprt_release but the transport
      remains write-locked.
      
      A DEVICE_REMOVE upcall forces a disconnect then sleeps. When ->close
      is invoked, it destroys the transport's H/W resources, then wakes
      the upcall, which completes and allows the core driver unload to
      continue.
      
      BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=266Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      bebd0318
    • C
      xprtrdma: Use same device when mapping or syncing DMA buffers · 91a10c52
      Chuck Lever 提交于
      When the underlying device driver is reloaded, ia->ri_device will be
      replaced. All cached copies of that device pointer have to be
      updated as well.
      
      Commit 54cbd6b0 ("xprtrdma: Delay DMA mapping Send and Receive
      buffers") added the rg_device field to each regbuf. As part of
      handling a device removal, rpcrdma_dma_unmap_regbuf is invoked on
      all regbufs for a transport.
      
      Simply calling rpcrdma_dma_map_regbuf for each Receive buffer after
      the driver has been reloaded should reinitialize rg_device correctly
      for every case except rpcrdma_wc_receive, which still uses
      rpcrdma_rep::rr_device.
      
      Ensure the same device that was used to map a Receive buffer is also
      used to sync it in rpcrdma_wc_receive by using rg_device there
      instead of rr_device.
      
      This is the only use of rr_device, so it can be removed.
      
      The use of regbufs in the send path is also updated, for
      completeness.
      
      Fixes: 54cbd6b0 ("xprtrdma: Delay DMA mapping Send and ... ")
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      91a10c52
    • C
      xprtrdma: Refactor rpcrdma_ia_open() · fff09594
      Chuck Lever 提交于
      In order to unload a device driver and reload it, xprtrdma will need
      to close a transport's interface adapter, and then call
      rpcrdma_ia_open again, possibly finding a different interface
      adapter.
      
      Make rpcrdma_ia_open safe to call on the same transport multiple
      times.
      
      This is a refactoring change only.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      fff09594
  10. 11 2月, 2017 4 次提交
  11. 30 11月, 2016 3 次提交
    • C
      xprtrdma: Relocate connection helper functions · 3a72dc77
      Chuck Lever 提交于
      Clean up: Disentangle connection helpers from RPC-over-RDMA reply
      decoding functions.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      3a72dc77
    • C
      xprtrdma: Support for SG_GAP devices · 5e9fc6a0
      Chuck Lever 提交于
      Some devices (such as the Mellanox CX-4) can register, under a
      single R_key, a set of memory regions that are not contiguous. When
      this is done, all the segments in a Reply list, say, can then be
      invalidated in a single LocalInv Work Request (or via Remote
      Invalidation, which can invalidate exactly one R_key when completing
      a Receive).
      
      This means a single FastReg WR is used to register, and one or zero
      LocalInv WRs can invalidate, the memory involved with RDMA transfers
      on behalf of an RPC.
      
      In addition, xprtrdma constructs some Reply chunks from three or
      more segments. By registering them with SG_GAP, only one segment
      is needed for the Reply chunk, allowing the whole chunk to be
      invalidated remotely.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      5e9fc6a0
    • C
      xprtrdma: Make FRWR send queue entry accounting more accurate · 8d38de65
      Chuck Lever 提交于
      Verbs providers may perform house-keeping on the Send Queue during
      each signaled send completion. It is necessary therefore for a verbs
      consumer (like xprtrdma) to occasionally force a signaled send
      completion if it runs unsignaled most of the time.
      
      xprtrdma does not require signaled completions for Send or FastReg
      Work Requests, but does signal some LocalInv Work Requests. To
      ensure that Send Queue house-keeping can run before the Send Queue
      is more than half-consumed, xprtrdma forces a signaled completion
      on occasion by counting the number of Send Queue Entries it
      consumes. It currently does this by counting each ib_post_send as
      one Entry.
      
      Commit c9918ff5 ("xprtrdma: Add ro_unmap_sync method for FRWR")
      introduced the ability for frwr_op_unmap_sync to post more than one
      Work Request with a single post_send. Thus the underlying assumption
      of one Send Queue Entry per ib_post_send is no longer true.
      
      Also, FastReg Work Requests are currently never signaled. They
      should be signaled once in a while, just as Send is, to keep the
      accounting of consumed SQEs accurate.
      
      While we're here, convert the CQCOUNT macros to the currently
      preferred kernel coding style, which is inline functions.
      
      Fixes: c9918ff5 ("xprtrdma: Add ro_unmap_sync method for FRWR")
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      8d38de65
  12. 11 11月, 2016 1 次提交
  13. 20 9月, 2016 10 次提交
    • C
      xprtrdma: Eliminate rpcrdma_receive_worker() · 496b77a5
      Chuck Lever 提交于
      Clean up: the extra layer of indirection doesn't add value.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      496b77a5
    • C
      xprtrdma: Use gathered Send for large inline messages · 655fec69
      Chuck Lever 提交于
      An RPC Call message that is sent inline but that has a data payload
      (ie, one or more items in rq_snd_buf's page list) must be "pulled
      up:"
      
      - call_allocate has to reserve enough RPC Call buffer space to
      accommodate the data payload
      
      - call_transmit has to memcopy the rq_snd_buf's page list and tail
      into its head iovec before it is sent
      
      As the inline threshold is increased beyond its current 1KB default,
      however, this means data payloads of more than a few KB are copied
      by the host CPU. For example, if the inline threshold is increased
      just to 4KB, then NFS WRITE requests up to 4KB would involve a
      memcpy of the NFS WRITE's payload data into the RPC Call buffer.
      This is an undesirable amount of participation by the host CPU.
      
      The inline threshold may be much larger than 4KB in the future,
      after negotiation with a peer server.
      
      Instead of copying the components of rq_snd_buf into its head iovec,
      construct a gather list of these components, and send them all in
      place. The same approach is already used in the Linux server's
      RPC-over-RDMA reply path.
      
      This mechanism also eliminates the need for rpcrdma_tail_pullup,
      which is used to manage the XDR pad and trailing inline content when
      a Read list is present.
      
      This requires that the pages in rq_snd_buf's page list be DMA-mapped
      during marshaling, and unmapped when a data-bearing RPC is
      completed. This is slightly less efficient for very small I/O
      payloads, but significantly more efficient as data payload size and
      inline threshold increase past a kilobyte.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      655fec69
    • C
      xprtrdma: Basic support for Remote Invalidation · c8b920bb
      Chuck Lever 提交于
      Have frwr's ro_unmap_sync recognize an invalidated rkey that appears
      as part of a Receive completion. Local invalidation can be skipped
      for that rkey.
      
      Use an out-of-band signaling mechanism to indicate to the server
      that the client is prepared to receive RDMA Send With Invalidate.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      c8b920bb
    • C
      xprtrdma: Client-side support for rpcrdma_connect_private · 87cfb9a0
      Chuck Lever 提交于
      Send an RDMA-CM private message on connect, and look for one during
      a connection-established event.
      
      Both sides can communicate their various implementation limits.
      Implementations that don't support this sideband protocol ignore it.
      
      Once the client knows the server's inline threshold maxima, it can
      adjust the use of Reply chunks, and eliminate most use of Position
      Zero Read chunks. Moderately-sized I/O can be done using a pure
      inline RDMA Send instead of RDMA operations that require memory
      registration.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      87cfb9a0
    • C
      xprtrdma: Move recv_wr to struct rpcrdma_rep · 6ea8e711
      Chuck Lever 提交于
      Clean up: The fields in the recv_wr do not vary. There is no need to
      initialize them before each ib_post_recv(). This removes a large-ish
      data structure from the stack.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      6ea8e711
    • C
      xprtrdma: Move send_wr to struct rpcrdma_req · 90aab602
      Chuck Lever 提交于
      Clean up: Most of the fields in each send_wr do not vary. There is
      no need to initialize them before each ib_post_send(). This removes
      a large-ish data structure from the stack.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      90aab602
    • C
      xprtrdma: Simplify rpcrdma_ep_post_recv() · b157380a
      Chuck Lever 提交于
      Clean up.
      
      Since commit fc664485 ("xprtrdma: Split the completion queue"),
      rpcrdma_ep_post_recv() no longer uses the "ep" argument.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      b157380a
    • C
      xprtrdma: Eliminate "ia" argument in rpcrdma_{alloc, free}_regbuf · 13650c23
      Chuck Lever 提交于
      Clean up. The "ia" argument is no longer used.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      13650c23
    • C
      xprtrdma: Delay DMA mapping Send and Receive buffers · 54cbd6b0
      Chuck Lever 提交于
      Currently, each regbuf is allocated and DMA mapped at the same time.
      This is done during transport creation.
      
      When a device driver is unloaded, every DMA-mapped buffer in use by
      a transport has to be unmapped, and then remapped to the new
      device if the driver is loaded again. Remapping will have to be done
      _after_ the connect worker has set up the new device.
      
      But there's an ordering problem:
      
      call_allocate, which invokes xprt_rdma_allocate which calls
      rpcrdma_alloc_regbuf to allocate Send buffers, happens _before_
      the connect worker can run to set up the new device.
      
      Instead, at transport creation, allocate each buffer, but leave it
      unmapped. Once the RPC carries these buffers into ->send_request, by
      which time a transport connection should have been established,
      check to see that the RPC's buffers have been DMA mapped. If not,
      map them there.
      
      When device driver unplug support is added, it will simply unmap all
      the transport's regbufs, but it doesn't have to deallocate the
      underlying memory.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      54cbd6b0
    • C
      xprtrdma: Replace DMA_BIDIRECTIONAL · 99ef4db3
      Chuck Lever 提交于
      The use of DMA_BIDIRECTIONAL is discouraged by DMA-API.txt.
      Fortunately, xprtrdma now knows which direction I/O is going as
      soon as it allocates each regbuf.
      
      The RPC Call and Reply buffers are no longer the same regbuf. They
      can each be labeled correctly now. The RPC Reply buffer is never
      part of either a Send or Receive WR, but it can be part of Reply
      chunk, which is mapped and registered via ->ro_map . So it is not
      DMA mapped when it is allocated (DMA_NONE), to avoid a double-
      mapping.
      
      Since Receive buffers are no longer DMA_BIDIRECTIONAL and their
      contents are never modified by the host CPU, DMA-API-HOWTO.txt
      suggests that a DMA sync before posting each buffer should be
      unnecessary. (See my_card_interrupt_handler).
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      99ef4db3