1. 30 11月, 2016 4 次提交
    • C
      xprtrdma: Shorten QP access error message · 2f6922ca
      Chuck Lever 提交于
      Clean up: The convention for this type of warning message is not to
      show the function name or "RPC:       ".
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      2f6922ca
    • C
      xprtrdma: Squelch "max send, max recv" messages at connect time · 6d6bf72d
      Chuck Lever 提交于
      Clean up: This message was intended to be a dprintk, as it is on the
      server-side.
      
      Fixes: 87cfb9a0 ('xprtrdma: Client-side support for ...')
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      6d6bf72d
    • C
      xprtrdma: Address coverity complaint about wait_for_completion() · 109b88ab
      Chuck Lever 提交于
      > ** CID 114101:  Error handling issues  (CHECKED_RETURN)
      > /net/sunrpc/xprtrdma/verbs.c: 355 in rpcrdma_create_id()
      
      Commit 5675add3 ("RPC/RDMA: harden connection logic against
      missing/late rdma_cm upcalls.") replaced wait_for_completion() calls
      with these two call sites.
      
      The original wait_for_completion() calls were added in the initial
      commit of verbs.c, which was commit c56c65fb ("RPCRDMA: rpc rdma
      verbs interface implementation"), but these returned void.
      
      rpcrdma_create_id() is called by the RDMA connect worker, which
      probably won't ever be interrupted. It is also called by
      rpcrdma_ia_open which is in the synchronous mount path, and ^C is
      possible there.
      
      Add a bit of logic at those two call sites to return if the waits
      return ERESTARTSYS.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      109b88ab
    • C
      xprtrdma: Make FRWR send queue entry accounting more accurate · 8d38de65
      Chuck Lever 提交于
      Verbs providers may perform house-keeping on the Send Queue during
      each signaled send completion. It is necessary therefore for a verbs
      consumer (like xprtrdma) to occasionally force a signaled send
      completion if it runs unsignaled most of the time.
      
      xprtrdma does not require signaled completions for Send or FastReg
      Work Requests, but does signal some LocalInv Work Requests. To
      ensure that Send Queue house-keeping can run before the Send Queue
      is more than half-consumed, xprtrdma forces a signaled completion
      on occasion by counting the number of Send Queue Entries it
      consumes. It currently does this by counting each ib_post_send as
      one Entry.
      
      Commit c9918ff5 ("xprtrdma: Add ro_unmap_sync method for FRWR")
      introduced the ability for frwr_op_unmap_sync to post more than one
      Work Request with a single post_send. Thus the underlying assumption
      of one Send Queue Entry per ib_post_send is no longer true.
      
      Also, FastReg Work Requests are currently never signaled. They
      should be signaled once in a while, just as Send is, to keep the
      accounting of consumed SQEs accurate.
      
      While we're here, convert the CQCOUNT macros to the currently
      preferred kernel coding style, which is inline functions.
      
      Fixes: c9918ff5 ("xprtrdma: Add ro_unmap_sync method for FRWR")
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      8d38de65
  2. 24 9月, 2016 1 次提交
  3. 20 9月, 2016 13 次提交
    • C
      xprtrdma: Eliminate rpcrdma_receive_worker() · 496b77a5
      Chuck Lever 提交于
      Clean up: the extra layer of indirection doesn't add value.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      496b77a5
    • C
      xprtrdma: Rename rpcrdma_receive_wc() · 1519e969
      Chuck Lever 提交于
      Clean up: When converting xprtrdma to use the new CQ API, I missed a
      spot. The naming convention elsewhere is:
      
        {svc_rdma,rpcrdma}_wc_{operation}
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      1519e969
    • C
      xprtrdma: Use gathered Send for large inline messages · 655fec69
      Chuck Lever 提交于
      An RPC Call message that is sent inline but that has a data payload
      (ie, one or more items in rq_snd_buf's page list) must be "pulled
      up:"
      
      - call_allocate has to reserve enough RPC Call buffer space to
      accommodate the data payload
      
      - call_transmit has to memcopy the rq_snd_buf's page list and tail
      into its head iovec before it is sent
      
      As the inline threshold is increased beyond its current 1KB default,
      however, this means data payloads of more than a few KB are copied
      by the host CPU. For example, if the inline threshold is increased
      just to 4KB, then NFS WRITE requests up to 4KB would involve a
      memcpy of the NFS WRITE's payload data into the RPC Call buffer.
      This is an undesirable amount of participation by the host CPU.
      
      The inline threshold may be much larger than 4KB in the future,
      after negotiation with a peer server.
      
      Instead of copying the components of rq_snd_buf into its head iovec,
      construct a gather list of these components, and send them all in
      place. The same approach is already used in the Linux server's
      RPC-over-RDMA reply path.
      
      This mechanism also eliminates the need for rpcrdma_tail_pullup,
      which is used to manage the XDR pad and trailing inline content when
      a Read list is present.
      
      This requires that the pages in rq_snd_buf's page list be DMA-mapped
      during marshaling, and unmapped when a data-bearing RPC is
      completed. This is slightly less efficient for very small I/O
      payloads, but significantly more efficient as data payload size and
      inline threshold increase past a kilobyte.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      655fec69
    • C
      xprtrdma: Basic support for Remote Invalidation · c8b920bb
      Chuck Lever 提交于
      Have frwr's ro_unmap_sync recognize an invalidated rkey that appears
      as part of a Receive completion. Local invalidation can be skipped
      for that rkey.
      
      Use an out-of-band signaling mechanism to indicate to the server
      that the client is prepared to receive RDMA Send With Invalidate.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      c8b920bb
    • C
      xprtrdma: Client-side support for rpcrdma_connect_private · 87cfb9a0
      Chuck Lever 提交于
      Send an RDMA-CM private message on connect, and look for one during
      a connection-established event.
      
      Both sides can communicate their various implementation limits.
      Implementations that don't support this sideband protocol ignore it.
      
      Once the client knows the server's inline threshold maxima, it can
      adjust the use of Reply chunks, and eliminate most use of Position
      Zero Read chunks. Moderately-sized I/O can be done using a pure
      inline RDMA Send instead of RDMA operations that require memory
      registration.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      87cfb9a0
    • C
      xprtrdma: Move recv_wr to struct rpcrdma_rep · 6ea8e711
      Chuck Lever 提交于
      Clean up: The fields in the recv_wr do not vary. There is no need to
      initialize them before each ib_post_recv(). This removes a large-ish
      data structure from the stack.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      6ea8e711
    • C
      xprtrdma: Move send_wr to struct rpcrdma_req · 90aab602
      Chuck Lever 提交于
      Clean up: Most of the fields in each send_wr do not vary. There is
      no need to initialize them before each ib_post_send(). This removes
      a large-ish data structure from the stack.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      90aab602
    • C
      xprtrdma: Simplify rpcrdma_ep_post_recv() · b157380a
      Chuck Lever 提交于
      Clean up.
      
      Since commit fc664485 ("xprtrdma: Split the completion queue"),
      rpcrdma_ep_post_recv() no longer uses the "ep" argument.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      b157380a
    • C
      xprtrdma: Eliminate "ia" argument in rpcrdma_{alloc, free}_regbuf · 13650c23
      Chuck Lever 提交于
      Clean up. The "ia" argument is no longer used.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      13650c23
    • C
      xprtrdma: Delay DMA mapping Send and Receive buffers · 54cbd6b0
      Chuck Lever 提交于
      Currently, each regbuf is allocated and DMA mapped at the same time.
      This is done during transport creation.
      
      When a device driver is unloaded, every DMA-mapped buffer in use by
      a transport has to be unmapped, and then remapped to the new
      device if the driver is loaded again. Remapping will have to be done
      _after_ the connect worker has set up the new device.
      
      But there's an ordering problem:
      
      call_allocate, which invokes xprt_rdma_allocate which calls
      rpcrdma_alloc_regbuf to allocate Send buffers, happens _before_
      the connect worker can run to set up the new device.
      
      Instead, at transport creation, allocate each buffer, but leave it
      unmapped. Once the RPC carries these buffers into ->send_request, by
      which time a transport connection should have been established,
      check to see that the RPC's buffers have been DMA mapped. If not,
      map them there.
      
      When device driver unplug support is added, it will simply unmap all
      the transport's regbufs, but it doesn't have to deallocate the
      underlying memory.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      54cbd6b0
    • C
      xprtrdma: Replace DMA_BIDIRECTIONAL · 99ef4db3
      Chuck Lever 提交于
      The use of DMA_BIDIRECTIONAL is discouraged by DMA-API.txt.
      Fortunately, xprtrdma now knows which direction I/O is going as
      soon as it allocates each regbuf.
      
      The RPC Call and Reply buffers are no longer the same regbuf. They
      can each be labeled correctly now. The RPC Reply buffer is never
      part of either a Send or Receive WR, but it can be part of Reply
      chunk, which is mapped and registered via ->ro_map . So it is not
      DMA mapped when it is allocated (DMA_NONE), to avoid a double-
      mapping.
      
      Since Receive buffers are no longer DMA_BIDIRECTIONAL and their
      contents are never modified by the host CPU, DMA-API-HOWTO.txt
      suggests that a DMA sync before posting each buffer should be
      unnecessary. (See my_card_interrupt_handler).
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      99ef4db3
    • C
      xprtrdma: Initialize separate RPC call and reply buffers · 9c40c49f
      Chuck Lever 提交于
      RPC-over-RDMA needs to separate its RPC call and reply buffers.
      
       o When an RPC Call is sent, rq_snd_buf is DMA mapped for an RDMA
         Send operation using DMA_TO_DEVICE
      
       o If the client expects a large RPC reply, it DMA maps rq_rcv_buf
         as part of a Reply chunk using DMA_FROM_DEVICE
      
      The two mappings are for data movement in opposite directions.
      
      DMA-API.txt suggests that if these mappings share a DMA cacheline,
      bad things can happen. This could occur in the final bytes of
      rq_snd_buf and the first bytes of rq_rcv_buf if the two buffers
      happen to share a DMA cacheline.
      
      On x86_64 the cacheline size is typically 8 bytes, and RPC call
      messages are usually much smaller than the send buffer, so this
      hasn't been a noticeable problem. But the DMA cacheline size can be
      larger on other platforms.
      
      Also, often rq_rcv_buf starts most of the way into a page, thus
      an additional RDMA segment is needed to map and register the end of
      that buffer. Try to avoid that scenario to reduce the cost of
      registering and invalidating Reply chunks.
      
      Instead of carrying a single regbuf that covers both rq_snd_buf and
      rq_rcv_buf, each struct rpcrdma_req now carries one regbuf for
      rq_snd_buf and one regbuf for rq_rcv_buf.
      
      Some incidental changes worth noting:
      
      - To clear out some spaghetti, refactor xprt_rdma_allocate.
      - The value stored in rg_size is the same as the value stored in
        the iov.length field, so eliminate rg_size
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      9c40c49f
    • C
      SUNRPC: Add a transport-specific private field in rpc_rqst · 5a6d1db4
      Chuck Lever 提交于
      Currently there's a hidden and indirect mechanism for finding the
      rpcrdma_req that goes with an rpc_rqst. It depends on getting from
      the rq_buffer pointer in struct rpc_rqst to the struct
      rpcrdma_regbuf that controls that buffer, and then to the struct
      rpcrdma_req it goes with.
      
      This was done back in the day to avoid the need to add a per-rqst
      pointer or to alter the buf_free API when support for RPC-over-RDMA
      was introduced.
      
      I'm about to change the way regbuf's work to support larger inline
      thresholds. Now is a good time to replace this indirect mechanism
      with something that is more straightforward. I guess this should be
      considered a clean up.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      5a6d1db4
  4. 07 9月, 2016 2 次提交
    • C
      xprtrdma: Fix receive buffer accounting · 05c97466
      Chuck Lever 提交于
      An RPC can terminate before its reply arrives, if a credential
      problem or a soft timeout occurs. After this happens, xprtrdma
      reports it is out of Receive buffers.
      
      A Receive buffer is posted before each RPC is sent, and returned to
      the buffer pool when a reply is received. If no reply is received
      for an RPC, that Receive buffer remains posted. But xprtrdma tries
      to post another when the next RPC is sent.
      
      If this happens a few dozen times, there are no receive buffers left
      to be posted at send time. I don't see a way for a transport
      connection to recover at that point, and it will spit warnings and
      unnecessarily delay RPCs on occasion for its remaining lifetime.
      
      Commit 1e465fd4 ("xprtrdma: Replace send and receive arrays")
      removed a little bit of logic to detect this case and not provide
      a Receive buffer so no more buffers are posted, and then transport
      operation continues correctly. We didn't understand what that logic
      did, and it wasn't commented, so it was removed as part of the
      overhaul to support backchannel requests.
      
      Restore it, but be wary of the need to keep extra Receives posted
      to deal with backchannel requests.
      
      Fixes: 1e465fd4 ("xprtrdma: Replace send and receive arrays")
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      05c97466
    • C
      xprtrdma: Revert 3d4cf35b ("xprtrdma: Reply buffer exhaustion...") · 78d506e1
      Chuck Lever 提交于
      Receive buffer exhaustion, if it were to actually occur, would be
      catastrophic. However, when there are no reply buffers to post, that
      means all of them have already been posted and are waiting for
      incoming replies. By design, there can never be more RPCs in flight
      than there are available receive buffers.
      
      A receive buffer can be left posted after an RPC exits without a
      received reply; say, due to a credential problem or a soft timeout.
      This does not result in fewer posted receive buffers than there are
      pending RPCs, and there is already logic in xprtrdma to deal
      appropriately with this case.
      
      It also looks like the "+ 2" that was removed was accidentally
      accommodating the number of extra receive buffers needed for
      receiving backchannel requests. That will need to be addressed by
      another patch.
      
      Fixes: 3d4cf35b ("xprtrdma: Reply buffer exhaustion can be...")
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      78d506e1
  5. 20 7月, 2016 1 次提交
  6. 12 7月, 2016 8 次提交
  7. 18 5月, 2016 4 次提交
    • C
      xprtrdma: Remove qplock · 6e14a92c
      Chuck Lever 提交于
      Clean up.
      
      After "xprtrdma: Remove ro_unmap() from all registration modes",
      there are no longer any sites that take rpcrdma_ia::qplock for read.
      The one site that takes it for write is always single-threaded. It
      is safe to remove it.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      6e14a92c
    • C
      xprtrdma: Faster server reboot recovery · b2dde94b
      Chuck Lever 提交于
      In a cluster failover scenario, it is desirable for the client to
      attempt to reconnect quickly, as an alternate NFS server is already
      waiting to take over for the down server. The client can't see that
      a server IP address has moved to a new server until the existing
      connection is gone.
      
      For fabrics and devices where it is meaningful, set a definite upper
      bound on the amount of time before it is determined that a
      connection is no longer valid. This allows the RPC client to detect
      connection loss in a timely matter, then perform a fresh resolution
      of the server GUID in case it has changed (cluster failover).
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Tested-by: NSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      b2dde94b
    • C
      xprtrdma: Use core ib_drain_qp() API · 550d7502
      Chuck Lever 提交于
      Clean up: Replace rpcrdma_flush_cqs() and rpcrdma_clean_cqs() with
      the new ib_drain_qp() API.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Reviewed-By: NLeon Romanovsky <leonro@mellanox.com>
      Tested-by: NSteve Wise <swise@opengridcomputing.com>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      550d7502
    • C
      xprtrdma: Limit number of RDMA segments in RPC-over-RDMA headers · 94931746
      Chuck Lever 提交于
      Send buffer space is shared between the RPC-over-RDMA header and
      an RPC message. A large RPC-over-RDMA header means less space is
      available for the associated RPC message, which then has to be
      moved via an RDMA Read or Write.
      
      As more segments are added to the chunk lists, the header increases
      in size.  Typical modern hardware needs only a few segments to
      convey the maximum payload size, but some devices and registration
      modes may need a lot of segments to convey data payload. Sometimes
      so many are needed that the remaining space in the Send buffer is
      not enough for the RPC message. Sending such a message usually
      fails.
      
      To ensure a transport can always make forward progress, cap the
      number of RDMA segments that are allowed in chunk lists. This
      prevents less-capable devices and memory registrations from
      consuming a large portion of the Send buffer by reducing the
      maximum data payload that can be conveyed with such devices.
      
      For now I choose an arbitrary maximum of 8 RDMA segments. This
      allows a maximum size RPC-over-RDMA header to fit nicely in the
      current 1024 byte inline threshold with over 700 bytes remaining
      for an inline RPC message.
      
      The current maximum data payload of NFS READ or WRITE requests is
      one megabyte. To convey that payload on a client with 4KB pages,
      each chunk segment would need to handle 32 or more data pages. This
      is well within the capabilities of FMR. For physical registration,
      the maximum payload size on platforms with 4KB pages is reduced to
      32KB.
      
      For FRWR, a device's maximum page list depth would need to be at
      least 34 to support the maximum 1MB payload. A device with a smaller
      maximum page list depth means the maximum data payload is reduced
      when using that device.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Tested-by: NSteve Wise <swise@opengridcomputing.com>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      94931746
  8. 15 3月, 2016 3 次提交
    • C
      xprtrdma: Use new CQ API for RPC-over-RDMA client send CQs · 2fa8f88d
      Chuck Lever 提交于
      Calling ib_poll_cq() to sort through WCs during a completion is a
      common pattern amongst RDMA consumers. Since commit 14d3a3b2
      ("IB: add a proper completion queue abstraction"), WC sorting can
      be handled by the IB core.
      
      By converting to this new API, xprtrdma is made a better neighbor to
      other RDMA consumers, as it allows the core to schedule the delivery
      of completions more fairly amongst all active consumers.
      
      Because each ib_cqe carries a pointer to a completion method, the
      core can now post its own operations on a consumer's QP, and handle
      the completions itself, without changes to the consumer.
      
      Send completions were previously handled entirely in the completion
      upcall handler (ie, deferring to a process context is unneeded).
      Thus IB_POLL_SOFTIRQ is a direct replacement for the current
      xprtrdma send code path.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: NDevesh Sharma <devesh.sharma@broadcom.com>
      Reviewed-by: NSagi Grimberg <sagig@mellanox.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      2fa8f88d
    • C
      xprtrdma: Use new CQ API for RPC-over-RDMA client receive CQs · 552bf225
      Chuck Lever 提交于
      Calling ib_poll_cq() to sort through WCs during a completion is a
      common pattern amongst RDMA consumers. Since commit 14d3a3b2
      ("IB: add a proper completion queue abstraction"), WC sorting can
      be handled by the IB core.
      
      By converting to this new API, xprtrdma is made a better neighbor to
      other RDMA consumers, as it allows the core to schedule the delivery
      of completions more fairly amongst all active consumers.
      
      Because each ib_cqe carries a pointer to a completion method, the
      core can now post its own operations on a consumer's QP, and handle
      the completions itself, without changes to the consumer.
      
      xprtrdma's reply processing is already handled in a work queue, but
      there is some initial order-dependent processing that is done in the
      soft IRQ context before a work item is scheduled.
      
      IB_POLL_SOFTIRQ is a direct replacement for the current xprtrdma
      receive code path.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: NDevesh Sharma <devesh.sharma@broadcom.com>
      Reviewed-by: NSagi Grimberg <sagig@mellanox.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      552bf225
    • C
      xprtrdma: Serialize credit accounting again · 23826c7a
      Chuck Lever 提交于
      Commit fe97b47c ("xprtrdma: Use workqueue to process RPC/RDMA
      replies") replaced the reply tasklet with a workqueue that allows
      RPC replies to be processed in parallel. Thus the credit values in
      RPC-over-RDMA replies can be applied in a different order than in
      which the server sent them.
      
      To fix this, revert commit eba8ff66 ("xprtrdma: Move credit
      update to RPC reply handler"). Reverting is done by hand to
      accommodate code changes that have occurred since then.
      
      Fixes: fe97b47c ("xprtrdma: Use workqueue to process . . .")
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: NSagi Grimberg <sagig@mellanox.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      23826c7a
  9. 23 12月, 2015 1 次提交
  10. 19 12月, 2015 3 次提交