1. 30 11月, 2016 2 次提交
  2. 20 9月, 2016 8 次提交
    • C
      xprtrdma: Eliminate rpcrdma_receive_worker() · 496b77a5
      Chuck Lever 提交于
      Clean up: the extra layer of indirection doesn't add value.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      496b77a5
    • C
      xprtrdma: Use gathered Send for large inline messages · 655fec69
      Chuck Lever 提交于
      An RPC Call message that is sent inline but that has a data payload
      (ie, one or more items in rq_snd_buf's page list) must be "pulled
      up:"
      
      - call_allocate has to reserve enough RPC Call buffer space to
      accommodate the data payload
      
      - call_transmit has to memcopy the rq_snd_buf's page list and tail
      into its head iovec before it is sent
      
      As the inline threshold is increased beyond its current 1KB default,
      however, this means data payloads of more than a few KB are copied
      by the host CPU. For example, if the inline threshold is increased
      just to 4KB, then NFS WRITE requests up to 4KB would involve a
      memcpy of the NFS WRITE's payload data into the RPC Call buffer.
      This is an undesirable amount of participation by the host CPU.
      
      The inline threshold may be much larger than 4KB in the future,
      after negotiation with a peer server.
      
      Instead of copying the components of rq_snd_buf into its head iovec,
      construct a gather list of these components, and send them all in
      place. The same approach is already used in the Linux server's
      RPC-over-RDMA reply path.
      
      This mechanism also eliminates the need for rpcrdma_tail_pullup,
      which is used to manage the XDR pad and trailing inline content when
      a Read list is present.
      
      This requires that the pages in rq_snd_buf's page list be DMA-mapped
      during marshaling, and unmapped when a data-bearing RPC is
      completed. This is slightly less efficient for very small I/O
      payloads, but significantly more efficient as data payload size and
      inline threshold increase past a kilobyte.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      655fec69
    • C
      xprtrdma: Basic support for Remote Invalidation · c8b920bb
      Chuck Lever 提交于
      Have frwr's ro_unmap_sync recognize an invalidated rkey that appears
      as part of a Receive completion. Local invalidation can be skipped
      for that rkey.
      
      Use an out-of-band signaling mechanism to indicate to the server
      that the client is prepared to receive RDMA Send With Invalidate.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      c8b920bb
    • C
      xprtrdma: Client-side support for rpcrdma_connect_private · 87cfb9a0
      Chuck Lever 提交于
      Send an RDMA-CM private message on connect, and look for one during
      a connection-established event.
      
      Both sides can communicate their various implementation limits.
      Implementations that don't support this sideband protocol ignore it.
      
      Once the client knows the server's inline threshold maxima, it can
      adjust the use of Reply chunks, and eliminate most use of Position
      Zero Read chunks. Moderately-sized I/O can be done using a pure
      inline RDMA Send instead of RDMA operations that require memory
      registration.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      87cfb9a0
    • C
      xprtrdma: Move send_wr to struct rpcrdma_req · 90aab602
      Chuck Lever 提交于
      Clean up: Most of the fields in each send_wr do not vary. There is
      no need to initialize them before each ib_post_send(). This removes
      a large-ish data structure from the stack.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      90aab602
    • C
      xprtrdma: Simplify rpcrdma_ep_post_recv() · b157380a
      Chuck Lever 提交于
      Clean up.
      
      Since commit fc664485 ("xprtrdma: Split the completion queue"),
      rpcrdma_ep_post_recv() no longer uses the "ep" argument.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      b157380a
    • C
      xprtrdma: Delay DMA mapping Send and Receive buffers · 54cbd6b0
      Chuck Lever 提交于
      Currently, each regbuf is allocated and DMA mapped at the same time.
      This is done during transport creation.
      
      When a device driver is unloaded, every DMA-mapped buffer in use by
      a transport has to be unmapped, and then remapped to the new
      device if the driver is loaded again. Remapping will have to be done
      _after_ the connect worker has set up the new device.
      
      But there's an ordering problem:
      
      call_allocate, which invokes xprt_rdma_allocate which calls
      rpcrdma_alloc_regbuf to allocate Send buffers, happens _before_
      the connect worker can run to set up the new device.
      
      Instead, at transport creation, allocate each buffer, but leave it
      unmapped. Once the RPC carries these buffers into ->send_request, by
      which time a transport connection should have been established,
      check to see that the RPC's buffers have been DMA mapped. If not,
      map them there.
      
      When device driver unplug support is added, it will simply unmap all
      the transport's regbufs, but it doesn't have to deallocate the
      underlying memory.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      54cbd6b0
    • C
      xprtrdma: Eliminate INLINE_THRESHOLD macros · eb342e9a
      Chuck Lever 提交于
      Clean up: r_xprt is already available everywhere these macros are
      invoked, so just dereference that directly.
      
      RPCRDMA_INLINE_PAD_VALUE is no longer used, so it can simply be
      removed.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      eb342e9a
  3. 12 7月, 2016 9 次提交
  4. 18 5月, 2016 6 次提交
    • C
      xprtrdma: Add ro_unmap_safe memreg method · ead3f26e
      Chuck Lever 提交于
      There needs to be a safe method of releasing registered memory
      resources when an RPC terminates. Safe can mean a number of things:
      
      + Doesn't have to sleep
      
      + Doesn't rely on having a QP in RTS
      
      ro_unmap_safe will be that safe method. It can be used in cases
      where synchronous memory invalidation can deadlock, or needs to have
      an active QP.
      
      The important case is fencing an RPC's memory regions after it is
      signaled (^C) and before it exits. If this is not done, there is a
      window where the server can write an RPC reply into memory that the
      client has released and re-used for some other purpose.
      
      Note that this is a full solution for FRWR, but FMR and physical
      still have some gaps where a particularly bad server can wreak
      some havoc on the client. These gaps are not made worse by this
      patch and are expected to be exceptionally rare and timing-based.
      They are noted in documenting comments.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Tested-by: NSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      ead3f26e
    • C
      xprtrdma: Remove rpcrdma_create_chunks() · 3c19409b
      Chuck Lever 提交于
      rpcrdma_create_chunks() has been replaced, and can be removed.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Tested-by: NSteve Wise <swise@opengridcomputing.com>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      3c19409b
    • C
      xprtrdma: Allow Read list and Reply chunk simultaneously · 94f58c58
      Chuck Lever 提交于
      rpcrdma_marshal_req() makes a simplifying assumption: that NFS
      operations with large Call messages have small Reply messages, and
      vice versa. Therefore with RPC-over-RDMA, only one chunk type is
      ever needed for each Call/Reply pair, because one direction needs
      chunks, the other direction will always fit inline.
      
      In fact, this assumption is asserted in the code:
      
        if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
        	dprintk("RPC:       %s: cannot marshal multiple chunk lists\n",
      		__func__);
      	return -EIO;
        }
      
      But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
      perform data transformation on RPC messages before they are
      transmitted, direct data placement techniques cannot be used, thus
      RPC messages must be sent via a Long call in both directions.
      All such calls are sent with a Position Zero Read chunk, and all
      such replies are handled with a Reply chunk. Thus the client must
      provide every Call/Reply pair with both a Read list and a Reply
      chunk.
      
      Without any special security in effect, NFSv4 WRITEs may now also
      use the Read list and provide a Reply chunk. The marshal_req
      logic was preventing that, meaning an NFSv4 WRITE with a large
      payload that included a GETATTR result larger than the inline
      threshold would fail.
      
      The code that encodes each chunk list is now completely contained in
      its own function. There is some code duplication, but the trade-off
      is that the overall logic should be more clear.
      
      Note that all three chunk lists now share the rl_segments array.
      Some additional per-req accounting is necessary to track this
      usage. For the same reasons that the above simplifying assumption
      has held true for so long, I don't expect more array elements are
      needed at this time.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Tested-by: NSteve Wise <swise@opengridcomputing.com>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      94f58c58
    • C
      xprtrdma: Update comments in rpcrdma_marshal_req() · 88b18a12
      Chuck Lever 提交于
      Update documenting comments to reflect code changes over the past
      year.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Tested-by: NSteve Wise <swise@opengridcomputing.com>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      88b18a12
    • C
      xprtrdma: Avoid using Write list for small NFS READ requests · cce6deeb
      Chuck Lever 提交于
      Avoid the latency and interrupt overhead of registering a Write
      chunk when handling NFS READ requests of a few hundred bytes or
      less.
      
      This change does not interoperate with Linux NFS/RDMA servers
      that do not have commit 9d11b51c ('svcrdma: Fix send_reply()
      scatter/gather set-up'). Commit 9d11b51c was introduced in v4.3,
      and is included in 4.2.y, 4.1.y, and 3.18.y.
      
      Oracle bug 22925946 has been filed to request that the above fix
      be included in the Oracle Linux UEK4 NFS/RDMA server.
      
      Red Hat bugzillas 1327280 and 1327554 have been filed to request
      that RHEL NFS/RDMA server backports include the above fix.
      
      Workaround: Replace the "proto=rdma,port=20049" mount options
      with "proto=tcp" until commit 9d11b51c is applied to your
      NFS server.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Tested-by: NSteve Wise <swise@opengridcomputing.com>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      cce6deeb
    • C
      xprtrdma: Prevent inline overflow · 302d3deb
      Chuck Lever 提交于
      When deciding whether to send a Call inline, rpcrdma_marshal_req
      doesn't take into account header bytes consumed by chunk lists.
      This results in Call messages on the wire that are sometimes larger
      than the inline threshold.
      
      Likewise, when a Write list or Reply chunk is in play, the server's
      reply has to emit an RDMA Send that includes a larger-than-minimal
      RPC-over-RDMA header.
      
      The actual size of a Call message cannot be estimated until after
      the chunk lists have been registered. Thus the size of each
      RPC-over-RDMA header can be estimated only after chunks are
      registered; but the decision to register chunks is based on the size
      of that header. Chicken, meet egg.
      
      The best a client can do is estimate header size based on the
      largest header that might occur, and then ensure that inline content
      is always smaller than that.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Tested-by: NSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      302d3deb
  5. 15 3月, 2016 4 次提交
  6. 19 12月, 2015 1 次提交
    • C
      xprtrdma: Invalidate in the RPC reply handler · 68791649
      Chuck Lever 提交于
      There is a window between the time the RPC reply handler wakes the
      waiting RPC task and when xprt_release() invokes ops->buf_free.
      During this time, memory regions containing the data payload may
      still be accessed by a broken or malicious server, but the RPC
      application has already been allowed access to the memory containing
      the RPC request's data payloads.
      
      The server should be fenced from client memory containing RPC data
      payloads _before_ the RPC application is allowed to continue.
      
      This change also more strongly enforces send queue accounting. There
      is a maximum number of RPC calls allowed to be outstanding. When an
      RPC/RDMA transport is set up, just enough send queue resources are
      allocated to handle registration, Send, and invalidation WRs for
      each those RPCs at the same time.
      
      Before, additional RPC calls could be dispatched while invalidation
      WRs were still consuming send WQEs. When invalidation WRs backed
      up, dispatching additional RPCs resulted in a send queue overrun.
      
      Now, the reply handler prevents RPC dispatch until invalidation is
      complete. This prevents RPC call dispatch until there are enough
      send queue resources to proceed.
      
      Still to do: If an RPC exits early (say, ^C), the reply handler has
      no opportunity to perform invalidation. Currently, xprt_rdma_free()
      still frees remaining RDMA resources, which could deadlock.
      Additional changes are needed to handle invalidation properly in this
      case.
      Reported-by: NJason Gunthorpe <jgunthorpe@obsidianresearch.com>
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      68791649
  7. 03 11月, 2015 4 次提交
  8. 06 8月, 2015 6 次提交