1. 18 11月, 2017 7 次提交
    • C
      xprtrdma: Put Send CQ in IB_POLL_WORKQUEUE mode · a4699f56
      Chuck Lever 提交于
      Lift the Send and LocalInv completion handlers out of soft IRQ mode
      to make room for other work. Also, move the Send CQ to a different
      CPU than the CPU where the Receive CQ is running, for improved
      scalability.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: NDevesh Sharma <devesh.sharma@broadcom.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      a4699f56
    • C
      xprtrdma: Remove atomic send completion counting · 6f0afc28
      Chuck Lever 提交于
      The sendctx circular queue now guarantees that xprtrdma cannot
      overflow the Send Queue, so remove the remaining bits of the
      original Send WQE counting mechanism.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      6f0afc28
    • C
      xprtrdma: RPC completion should wait for Send completion · 01bb35c8
      Chuck Lever 提交于
      When an RPC Call includes a file data payload, that payload can come
      from pages in the page cache, or a user buffer (for direct I/O).
      
      If the payload can fit inline, xprtrdma includes it in the Send
      using a scatter-gather technique. xprtrdma mustn't allow the RPC
      consumer to re-use the memory where that payload resides before the
      Send completes. Otherwise, the new contents of that memory would be
      exposed by an HCA retransmit of the Send operation.
      
      So, block RPC completion on Send completion, but only in the case
      where a separate file data payload is part of the Send. This
      prevents the reuse of that memory while it is still part of a Send
      operation without an undue cost to other cases.
      
      Waiting is avoided in the common case because typically the Send
      will have completed long before the RPC Reply arrives.
      
      These days, an RPC timeout will trigger a disconnect, which tears
      down the QP. The disconnect flushes all waiting Sends. This bounds
      the amount of time the reply handler has to wait for a Send
      completion.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      01bb35c8
    • C
      xprtrdma: Add a field of bit flags to struct rpcrdma_req · 531cca0c
      Chuck Lever 提交于
      We have one boolean flag in rpcrdma_req today. I'd like to add more
      flags, so convert that boolean to a bit flag.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      531cca0c
    • C
      xprtrdma: Add data structure to manage RDMA Send arguments · ae72950a
      Chuck Lever 提交于
      Problem statement:
      
      Recently Sagi Grimberg <sagi@grimberg.me> observed that kernel RDMA-
      enabled storage initiators don't handle delayed Send completion
      correctly. If Send completion is delayed beyond the end of a ULP
      transaction, the ULP may release resources that are still being used
      by the HCA to complete a long-running Send operation.
      
      This is a common design trait amongst our initiators. Most Send
      operations are faster than the ULP transaction they are part of.
      Waiting for a completion for these is typically unnecessary.
      
      Infrequently, a network partition or some other problem crops up
      where an ordering problem can occur. In NFS parlance, the RPC Reply
      arrives and completes the RPC, but the HCA is still retrying the
      Send WR that conveyed the RPC Call. In this case, the HCA can try
      to use memory that has been invalidated or DMA unmapped, and the
      connection is lost. If that memory has been re-used for something
      else (possibly not related to NFS), and the Send retransmission
      exposes that data on the wire.
      
      Thus we cannot assume that it is safe to release Send-related
      resources just because a ULP reply has arrived.
      
      After some analysis, we have determined that the completion
      housekeeping will not be difficult for xprtrdma:
      
       - Inline Send buffers are registered via the local DMA key, and
         are already left DMA mapped for the lifetime of a transport
         connection, thus no additional handling is necessary for those
       - Gathered Sends involving page cache pages _will_ need to
         DMA unmap those pages after the Send completes. But like
         inline send buffers, they are registered via the local DMA key,
         and thus will not need to be invalidated
      
      In addition, RPC completion will need to wait for Send completion
      in the latter case. However, nearly always, the Send that conveys
      the RPC Call will have completed long before the RPC Reply
      arrives, and thus no additional latency will be accrued.
      
      Design notes:
      
      In this patch, the rpcrdma_sendctx object is introduced, and a
      lock-free circular queue is added to manage a set of them per
      transport.
      
      The RPC client's send path already prevents sending more than one
      RPC Call at the same time. This allows us to treat the consumer
      side of the queue (rpcrdma_sendctx_get_locked) as if there is a
      single consumer thread.
      
      The producer side of the queue (rpcrdma_sendctx_put_locked) is
      invoked only from the Send completion handler, which is a single
      thread of execution (soft IRQ).
      
      The only care that needs to be taken is with the tail index, which
      is shared between the producer and consumer. Only the producer
      updates the tail index. The consumer compares the head with the
      tail to ensure that the a sendctx that is in use is never handed
      out again (or, expressed more conventionally, the queue is empty).
      
      When the sendctx queue empties completely, there are enough Sends
      outstanding that posting more Send operations can result in a Send
      Queue overflow. In this case, the ULP is told to wait and try again.
      This introduces strong Send Queue accounting to xprtrdma.
      
      As a final touch, Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
      suggested a mechanism that does not require signaling every Send.
      We signal once every N Sends, and perform SGE unmapping of N Send
      operations during that one completion.
      Reported-by: NSagi Grimberg <sagi@grimberg.me>
      Suggested-by: NJason Gunthorpe <jgunthorpe@obsidianresearch.com>
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      ae72950a
    • C
      xprtrdma: Decode credits field in rpcrdma_reply_handler · be798f90
      Chuck Lever 提交于
      We need to decode and save the incoming rdma_credits field _after_
      we know that the direction of the message is "forward direction
      Reply". Otherwise, the credits value in reverse direction Calls is
      also used to update the forward direction credits.
      
      It is safe to decode the rdma_credits field in rpcrdma_reply_handler
      now that rpcrdma_reply_handler is single-threaded. Receives complete
      in the same order as they were sent on the NFS server.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      be798f90
    • C
      xprtrdma: Invoke rpcrdma_reply_handler directly from RECV completion · d8f532d2
      Chuck Lever 提交于
      I noticed that the soft IRQ thread looked pretty busy under heavy
      I/O workloads. perf suggested one area that was expensive was the
      queue_work() call in rpcrdma_wc_receive. That gave me some ideas.
      
      Instead of scheduling a separate worker to process RPC Replies,
      promote the Receive completion handler to IB_POLL_WORKQUEUE, and
      invoke rpcrdma_reply_handler directly.
      
      Note that the poll workqueue is single-threaded. In order to keep
      memory invalidation from serializing all RPC Replies, handle any
      necessary invalidation tasks in a separate multi-threaded workqueue.
      
      This provides a two-tier scheme, similar to OS I/O interrupt
      handlers: A fast interrupt handler that schedules the slow handler
      and re-enables the interrupt, and a slower handler that is invoked
      for any needed heavy lifting.
      
      Benefits include:
      - One less context switch for RPCs that don't register memory
      - Receive completion handling is moved out of soft IRQ context to
        make room for other users of soft IRQ
      - The same CPU core now DMA syncs and XDR decodes the Receive buffer
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      d8f532d2
  2. 06 9月, 2017 1 次提交
    • C
      xprtrdma: Use xprt_pin_rqst in rpcrdma_reply_handler · 9590d083
      Chuck Lever 提交于
      Adopt the use of xprt_pin_rqst to eliminate contention between
      Call-side users of rb_lock and the use of rb_lock in
      rpcrdma_reply_handler.
      
      This replaces the mechanism introduced in 431af645 ("xprtrdma:
      Fix client lock-up after application signal fires").
      
      Use recv_lock to quickly find the completing rqst, pin it, then
      drop the lock. At that point invalidation and pull-up of the Reply
      XDR can be done. Both are often expensive operations.
      
      Finally, take recv_lock again to signal completion to the RPC
      layer. It also protects adjustment of "cwnd".
      
      This greatly reduces the amount of time a lock is held by the
      reply handler. Comparing lock_stat results shows a marked decrease
      in contention on rb_lock and recv_lock.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      [trond.myklebust@primarydata.com: Remove call to rpcrdma_buffer_put() from
         the "out_norqst:" path in rpcrdma_reply_handler.]
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      9590d083
  3. 08 8月, 2017 4 次提交
  4. 14 7月, 2017 4 次提交
    • C
      xprtrdma: Demote "connect" log messages · 173b8f49
      Chuck Lever 提交于
      Some have complained about the log messages generated when xprtrdma
      opens or closes a connection to a server. When an NFS mount is
      mostly idle these can appear every few minutes as the client idles
      out the connection and reconnects.
      
      Connection and disconnection is a normal part of operation, and not
      exceptional, so change these to dprintk's for now. At some point
      all of these will be converted to tracepoints, but that's for
      another day.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      173b8f49
    • C
      xprtrdma: Fix client lock-up after application signal fires · 431af645
      Chuck Lever 提交于
      After a signal, the RPC client aborts synchronous RPCs running on
      behalf of the signaled application.
      
      The server is still executing those RPCs, and will write the results
      back into the client's memory when it's done. By the time the server
      writes the results, that memory is likely being used for other
      purposes. Therefore xprtrdma has to immediately invalidate all
      memory regions used by those aborted RPCs to prevent the server's
      writes from clobbering that re-used memory.
      
      With FMR memory registration, invalidation takes a relatively long
      time. In fact, the invalidation is often still running when the
      server tries to write the results into the memory regions that are
      being invalidated.
      
      This sets up a race between two processes:
      
      1.  After the signal, xprt_rdma_free calls ro_unmap_safe.
      2.  While ro_unmap_safe is still running, the server replies and
          rpcrdma_reply_handler runs, calling ro_unmap_sync.
      
      Both processes invoke ib_unmap_fmr on the same FMR.
      
      The mlx4 driver allows two ib_unmap_fmr calls on the same FMR at
      the same time, but HCAs generally don't tolerate this. Sometimes
      this can result in a system crash.
      
      If the HCA happens to survive, rpcrdma_reply_handler continues. It
      removes the rpc_rqst from rq_list and releases the transport_lock.
      This enables xprt_rdma_free to run in another process, and the
      rpc_rqst is released while rpcrdma_reply_handler is still waiting
      for the ib_unmap_fmr call to finish.
      
      But further down in rpcrdma_reply_handler, the transport_lock is
      taken again, and "rqst" is dereferenced. If "rqst" has already been
      released, this triggers a general protection fault. Since bottom-
      halves are disabled, the system locks up.
      
      Address both issues by reversing the order of the xprt_lookup_rqst
      call and the ro_unmap_sync call. Introduce a separate lookup
      mechanism for rpcrdma_req's to enable calling ro_unmap_sync before
      xprt_lookup_rqst. Now the handler takes the transport_lock once
      and holds it for the XID lookup and RPC completion.
      
      BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305
      Fixes: 68791649 ('xprtrdma: Invalidate in the RPC reply ... ')
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      431af645
    • C
      xprtrdma: Rename rpcrdma_req::rl_free · a80d66c9
      Chuck Lever 提交于
      Clean up: I'm about to use the rl_free field for purposes other than
      a free list. So use a more generic name.
      
      This is a refactoring change only.
      
      BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305
      Fixes: 68791649 ('xprtrdma: Invalidate in the RPC reply ... ')
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      a80d66c9
    • C
      xprtrdma: Pre-mark remotely invalidated MRs · 4b196dc6
      Chuck Lever 提交于
      There are rare cases where an rpcrdma_req and its matched
      rpcrdma_rep can be re-used, via rpcrdma_buffer_put, while the RPC
      reply handler is still using that req. This is typically due to a
      signal firing at just the wrong instant.
      
      As part of closing this race window, avoid using the wrong
      rpcrdma_rep to detect remotely invalidated MRs. Mark MRs as
      invalidated while we are sure the rep is still OK to use.
      
      BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=305
      Fixes: 68791649 ('xprtrdma: Invalidate in the RPC reply ... ')
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      4b196dc6
  5. 26 4月, 2017 8 次提交
  6. 18 3月, 2017 1 次提交
    • C
      xprtrdma: Squelch kbuild sparse complaint · eed50879
      Chuck Lever 提交于
      New complaint from kbuild for 4.9.y:
      
      net/sunrpc/xprtrdma/verbs.c:489:19: sparse: incompatible types in
          comparison expression (different type sizes)
      
      verbs.c:
      489	max_sge = min(ia->ri_device->attrs.max_sge, RPCRDMA_MAX_SEND_SGES);
      
      I can't reproduce this running sparse here. Likewise, "make W=1
      net/sunrpc/xprtrdma/verbs.o" never indicated any issue.
      
      A little poking suggests that because the range of its values is
      small, gcc can make the actual width of RPCRDMA_MAX_SEND_SGES
      smaller than the width of an unsigned integer.
      
      Fixes: 16f906d6 ("xprtrdma: Reduce required number of send SGEs")
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Cc: stable@kernel.org
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      eed50879
  7. 11 2月, 2017 5 次提交
    • C
      xprtrdma: Refactor management of mw_list field · 9a5c63e9
      Chuck Lever 提交于
      Clean up some duplicate code.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      9a5c63e9
    • C
      xprtrdma: Handle stale connection rejection · 0a90487b
      Chuck Lever 提交于
      A server rejects a connection attempt with STALE_CONNECTION when a
      client attempts to connect to a working remote service, but uses a
      QPN and GUID that corresponds to an old connection that was
      abandoned. This might occur after a client crashes and restarts.
      
      Fix rpcrdma_conn_upcall() to distinguish between a normal rejection
      and rejection of stale connection parameters.
      
      As an additional clean-up, remove the code that retries the
      connection attempt with different ORD/IRD values. Code audit of
      other ULP initiators shows no similar special case handling of
      initiator_depth or responder_resources.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      0a90487b
    • C
      xprtrdma: Reduce required number of send SGEs · 16f906d6
      Chuck Lever 提交于
      The MAX_SEND_SGES check introduced in commit 655fec69
      ("xprtrdma: Use gathered Send for large inline messages") fails
      for devices that have a small max_sge.
      
      Instead of checking for a large fixed maximum number of SGEs,
      check for a minimum small number. RPC-over-RDMA will switch to
      using a Read chunk if an xdr_buf has more pages than can fit in
      the device's max_sge limit. This is considerably better than
      failing all together to mount the server.
      
      This fix supports devices that have as few as three send SGEs
      available.
      Reported-by: NSelvin Xavier <selvin.xavier@broadcom.com>
      Reported-by: NDevesh Sharma <devesh.sharma@broadcom.com>
      Reported-by: NHonggang Li <honli@redhat.com>
      Reported-by: NRam Amrani <Ram.Amrani@cavium.com>
      Fixes: 655fec69 ("xprtrdma: Use gathered Send for large ...")
      Cc: stable@vger.kernel.org # v4.9+
      Tested-by: NHonggang Li <honli@redhat.com>
      Tested-by: NRam Amrani <Ram.Amrani@cavium.com>
      Tested-by: NSteve Wise <swise@opengridcomputing.com>
      Reviewed-by: NParav Pandit <parav@mellanox.com>
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      16f906d6
    • C
      xprtrdma: Disable pad optimization by default · c95a3c6b
      Chuck Lever 提交于
      Commit d5440e27 ("xprtrdma: Enable pad optimization") made the
      Linux client omit XDR round-up padding in normal Read and Write
      chunks so that the client doesn't have to register and invalidate
      3-byte memory regions that contain no real data.
      
      Unfortunately, my cheery 2014 assessment that this optimization "is
      supported now by both Linux and Solaris servers" was premature.
      We've found bugs in Solaris in this area since commit d5440e27
      ("xprtrdma: Enable pad optimization") was merged (SYMLINK is the
      main offender).
      
      So for maximum interoperability, I'm disabling this optimization
      again. If a CM private message is exchanged when connecting, the
      client recognizes that the server is Linux, and enables the
      optimization for that connection.
      
      Until now the Solaris server bugs did not impact common operations,
      and were thus largely benign. Soon, less capable devices on Linux
      NFS/RDMA clients will make use of Read chunks more often, and these
      Solaris bugs will prevent interoperation in more cases.
      
      Fixes: 677eb17e ("xprtrdma: Fix XDR tail buffer marshalling")
      Cc: stable@vger.kernel.org # v4.9+
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      c95a3c6b
    • C
      xprtrdma: Per-connection pad optimization · b5f0afbe
      Chuck Lever 提交于
      Pad optimization is changed by echoing into
      /proc/sys/sunrpc/rdma_pad_optimize. This is a global setting,
      affecting all RPC-over-RDMA connections to all servers.
      
      The marshaling code picks up that value and uses it for decisions
      about how to construct each RPC-over-RDMA frame. Having it change
      suddenly in mid-operation can result in unexpected failures. And
      some servers a client mounts might need chunk round-up, while
      others don't.
      
      So instead, copy the pad_optimize setting into each connection's
      rpcrdma_ia when the transport is created, and use the copy, which
      can't change during the life of the connection, instead.
      
      This also removes a hack: rpcrdma_convert_iovs was using
      the remote-invalidation-expected flag to predict when it could leave
      out Write chunk padding. This is because the Linux server handles
      implicit XDR padding on Write chunks correctly, and only Linux
      servers can set the connection's remote-invalidation-expected flag.
      
      It's more sensible to use the pad optimization setting instead.
      
      Fixes: 677eb17e ("xprtrdma: Fix XDR tail buffer marshalling")
      Cc: stable@vger.kernel.org # v4.9+
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      b5f0afbe
  8. 30 11月, 2016 4 次提交
    • C
      xprtrdma: Shorten QP access error message · 2f6922ca
      Chuck Lever 提交于
      Clean up: The convention for this type of warning message is not to
      show the function name or "RPC:       ".
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      2f6922ca
    • C
      xprtrdma: Squelch "max send, max recv" messages at connect time · 6d6bf72d
      Chuck Lever 提交于
      Clean up: This message was intended to be a dprintk, as it is on the
      server-side.
      
      Fixes: 87cfb9a0 ('xprtrdma: Client-side support for ...')
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      6d6bf72d
    • C
      xprtrdma: Address coverity complaint about wait_for_completion() · 109b88ab
      Chuck Lever 提交于
      > ** CID 114101:  Error handling issues  (CHECKED_RETURN)
      > /net/sunrpc/xprtrdma/verbs.c: 355 in rpcrdma_create_id()
      
      Commit 5675add3 ("RPC/RDMA: harden connection logic against
      missing/late rdma_cm upcalls.") replaced wait_for_completion() calls
      with these two call sites.
      
      The original wait_for_completion() calls were added in the initial
      commit of verbs.c, which was commit c56c65fb ("RPCRDMA: rpc rdma
      verbs interface implementation"), but these returned void.
      
      rpcrdma_create_id() is called by the RDMA connect worker, which
      probably won't ever be interrupted. It is also called by
      rpcrdma_ia_open which is in the synchronous mount path, and ^C is
      possible there.
      
      Add a bit of logic at those two call sites to return if the waits
      return ERESTARTSYS.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      109b88ab
    • C
      xprtrdma: Make FRWR send queue entry accounting more accurate · 8d38de65
      Chuck Lever 提交于
      Verbs providers may perform house-keeping on the Send Queue during
      each signaled send completion. It is necessary therefore for a verbs
      consumer (like xprtrdma) to occasionally force a signaled send
      completion if it runs unsignaled most of the time.
      
      xprtrdma does not require signaled completions for Send or FastReg
      Work Requests, but does signal some LocalInv Work Requests. To
      ensure that Send Queue house-keeping can run before the Send Queue
      is more than half-consumed, xprtrdma forces a signaled completion
      on occasion by counting the number of Send Queue Entries it
      consumes. It currently does this by counting each ib_post_send as
      one Entry.
      
      Commit c9918ff5 ("xprtrdma: Add ro_unmap_sync method for FRWR")
      introduced the ability for frwr_op_unmap_sync to post more than one
      Work Request with a single post_send. Thus the underlying assumption
      of one Send Queue Entry per ib_post_send is no longer true.
      
      Also, FastReg Work Requests are currently never signaled. They
      should be signaled once in a while, just as Send is, to keep the
      accounting of consumed SQEs accurate.
      
      While we're here, convert the CQCOUNT macros to the currently
      preferred kernel coding style, which is inline functions.
      
      Fixes: c9918ff5 ("xprtrdma: Add ro_unmap_sync method for FRWR")
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      8d38de65
  9. 24 9月, 2016 1 次提交
  10. 20 9月, 2016 5 次提交
    • C
      xprtrdma: Eliminate rpcrdma_receive_worker() · 496b77a5
      Chuck Lever 提交于
      Clean up: the extra layer of indirection doesn't add value.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      496b77a5
    • C
      xprtrdma: Rename rpcrdma_receive_wc() · 1519e969
      Chuck Lever 提交于
      Clean up: When converting xprtrdma to use the new CQ API, I missed a
      spot. The naming convention elsewhere is:
      
        {svc_rdma,rpcrdma}_wc_{operation}
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      1519e969
    • C
      xprtrdma: Use gathered Send for large inline messages · 655fec69
      Chuck Lever 提交于
      An RPC Call message that is sent inline but that has a data payload
      (ie, one or more items in rq_snd_buf's page list) must be "pulled
      up:"
      
      - call_allocate has to reserve enough RPC Call buffer space to
      accommodate the data payload
      
      - call_transmit has to memcopy the rq_snd_buf's page list and tail
      into its head iovec before it is sent
      
      As the inline threshold is increased beyond its current 1KB default,
      however, this means data payloads of more than a few KB are copied
      by the host CPU. For example, if the inline threshold is increased
      just to 4KB, then NFS WRITE requests up to 4KB would involve a
      memcpy of the NFS WRITE's payload data into the RPC Call buffer.
      This is an undesirable amount of participation by the host CPU.
      
      The inline threshold may be much larger than 4KB in the future,
      after negotiation with a peer server.
      
      Instead of copying the components of rq_snd_buf into its head iovec,
      construct a gather list of these components, and send them all in
      place. The same approach is already used in the Linux server's
      RPC-over-RDMA reply path.
      
      This mechanism also eliminates the need for rpcrdma_tail_pullup,
      which is used to manage the XDR pad and trailing inline content when
      a Read list is present.
      
      This requires that the pages in rq_snd_buf's page list be DMA-mapped
      during marshaling, and unmapped when a data-bearing RPC is
      completed. This is slightly less efficient for very small I/O
      payloads, but significantly more efficient as data payload size and
      inline threshold increase past a kilobyte.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      655fec69
    • C
      xprtrdma: Basic support for Remote Invalidation · c8b920bb
      Chuck Lever 提交于
      Have frwr's ro_unmap_sync recognize an invalidated rkey that appears
      as part of a Receive completion. Local invalidation can be skipped
      for that rkey.
      
      Use an out-of-band signaling mechanism to indicate to the server
      that the client is prepared to receive RDMA Send With Invalidate.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      c8b920bb
    • C
      xprtrdma: Client-side support for rpcrdma_connect_private · 87cfb9a0
      Chuck Lever 提交于
      Send an RDMA-CM private message on connect, and look for one during
      a connection-established event.
      
      Both sides can communicate their various implementation limits.
      Implementations that don't support this sideband protocol ignore it.
      
      Once the client knows the server's inline threshold maxima, it can
      adjust the use of Reply chunks, and eliminate most use of Position
      Zero Read chunks. Moderately-sized I/O can be done using a pure
      inline RDMA Send instead of RDMA operations that require memory
      registration.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      87cfb9a0