1. 12 7月, 2016 5 次提交
  2. 18 5月, 2016 6 次提交
    • C
      xprtrdma: Add ro_unmap_safe memreg method · ead3f26e
      Chuck Lever 提交于
      There needs to be a safe method of releasing registered memory
      resources when an RPC terminates. Safe can mean a number of things:
      
      + Doesn't have to sleep
      
      + Doesn't rely on having a QP in RTS
      
      ro_unmap_safe will be that safe method. It can be used in cases
      where synchronous memory invalidation can deadlock, or needs to have
      an active QP.
      
      The important case is fencing an RPC's memory regions after it is
      signaled (^C) and before it exits. If this is not done, there is a
      window where the server can write an RPC reply into memory that the
      client has released and re-used for some other purpose.
      
      Note that this is a full solution for FRWR, but FMR and physical
      still have some gaps where a particularly bad server can wreak
      some havoc on the client. These gaps are not made worse by this
      patch and are expected to be exceptionally rare and timing-based.
      They are noted in documenting comments.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Tested-by: NSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      ead3f26e
    • C
      xprtrdma: Remove rpcrdma_create_chunks() · 3c19409b
      Chuck Lever 提交于
      rpcrdma_create_chunks() has been replaced, and can be removed.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Tested-by: NSteve Wise <swise@opengridcomputing.com>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      3c19409b
    • C
      xprtrdma: Allow Read list and Reply chunk simultaneously · 94f58c58
      Chuck Lever 提交于
      rpcrdma_marshal_req() makes a simplifying assumption: that NFS
      operations with large Call messages have small Reply messages, and
      vice versa. Therefore with RPC-over-RDMA, only one chunk type is
      ever needed for each Call/Reply pair, because one direction needs
      chunks, the other direction will always fit inline.
      
      In fact, this assumption is asserted in the code:
      
        if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
        	dprintk("RPC:       %s: cannot marshal multiple chunk lists\n",
      		__func__);
      	return -EIO;
        }
      
      But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
      perform data transformation on RPC messages before they are
      transmitted, direct data placement techniques cannot be used, thus
      RPC messages must be sent via a Long call in both directions.
      All such calls are sent with a Position Zero Read chunk, and all
      such replies are handled with a Reply chunk. Thus the client must
      provide every Call/Reply pair with both a Read list and a Reply
      chunk.
      
      Without any special security in effect, NFSv4 WRITEs may now also
      use the Read list and provide a Reply chunk. The marshal_req
      logic was preventing that, meaning an NFSv4 WRITE with a large
      payload that included a GETATTR result larger than the inline
      threshold would fail.
      
      The code that encodes each chunk list is now completely contained in
      its own function. There is some code duplication, but the trade-off
      is that the overall logic should be more clear.
      
      Note that all three chunk lists now share the rl_segments array.
      Some additional per-req accounting is necessary to track this
      usage. For the same reasons that the above simplifying assumption
      has held true for so long, I don't expect more array elements are
      needed at this time.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Tested-by: NSteve Wise <swise@opengridcomputing.com>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      94f58c58
    • C
      xprtrdma: Update comments in rpcrdma_marshal_req() · 88b18a12
      Chuck Lever 提交于
      Update documenting comments to reflect code changes over the past
      year.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Tested-by: NSteve Wise <swise@opengridcomputing.com>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      88b18a12
    • C
      xprtrdma: Avoid using Write list for small NFS READ requests · cce6deeb
      Chuck Lever 提交于
      Avoid the latency and interrupt overhead of registering a Write
      chunk when handling NFS READ requests of a few hundred bytes or
      less.
      
      This change does not interoperate with Linux NFS/RDMA servers
      that do not have commit 9d11b51c ('svcrdma: Fix send_reply()
      scatter/gather set-up'). Commit 9d11b51c was introduced in v4.3,
      and is included in 4.2.y, 4.1.y, and 3.18.y.
      
      Oracle bug 22925946 has been filed to request that the above fix
      be included in the Oracle Linux UEK4 NFS/RDMA server.
      
      Red Hat bugzillas 1327280 and 1327554 have been filed to request
      that RHEL NFS/RDMA server backports include the above fix.
      
      Workaround: Replace the "proto=rdma,port=20049" mount options
      with "proto=tcp" until commit 9d11b51c is applied to your
      NFS server.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Tested-by: NSteve Wise <swise@opengridcomputing.com>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      cce6deeb
    • C
      xprtrdma: Prevent inline overflow · 302d3deb
      Chuck Lever 提交于
      When deciding whether to send a Call inline, rpcrdma_marshal_req
      doesn't take into account header bytes consumed by chunk lists.
      This results in Call messages on the wire that are sometimes larger
      than the inline threshold.
      
      Likewise, when a Write list or Reply chunk is in play, the server's
      reply has to emit an RDMA Send that includes a larger-than-minimal
      RPC-over-RDMA header.
      
      The actual size of a Call message cannot be estimated until after
      the chunk lists have been registered. Thus the size of each
      RPC-over-RDMA header can be estimated only after chunks are
      registered; but the decision to register chunks is based on the size
      of that header. Chicken, meet egg.
      
      The best a client can do is estimate header size based on the
      largest header that might occur, and then ensure that inline content
      is always smaller than that.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Tested-by: NSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      302d3deb
  3. 15 3月, 2016 4 次提交
  4. 19 12月, 2015 1 次提交
    • C
      xprtrdma: Invalidate in the RPC reply handler · 68791649
      Chuck Lever 提交于
      There is a window between the time the RPC reply handler wakes the
      waiting RPC task and when xprt_release() invokes ops->buf_free.
      During this time, memory regions containing the data payload may
      still be accessed by a broken or malicious server, but the RPC
      application has already been allowed access to the memory containing
      the RPC request's data payloads.
      
      The server should be fenced from client memory containing RPC data
      payloads _before_ the RPC application is allowed to continue.
      
      This change also more strongly enforces send queue accounting. There
      is a maximum number of RPC calls allowed to be outstanding. When an
      RPC/RDMA transport is set up, just enough send queue resources are
      allocated to handle registration, Send, and invalidation WRs for
      each those RPCs at the same time.
      
      Before, additional RPC calls could be dispatched while invalidation
      WRs were still consuming send WQEs. When invalidation WRs backed
      up, dispatching additional RPCs resulted in a send queue overrun.
      
      Now, the reply handler prevents RPC dispatch until invalidation is
      complete. This prevents RPC call dispatch until there are enough
      send queue resources to proceed.
      
      Still to do: If an RPC exits early (say, ^C), the reply handler has
      no opportunity to perform invalidation. Currently, xprt_rdma_free()
      still frees remaining RDMA resources, which could deadlock.
      Additional changes are needed to handle invalidation properly in this
      case.
      Reported-by: NJason Gunthorpe <jgunthorpe@obsidianresearch.com>
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      68791649
  5. 03 11月, 2015 4 次提交
  6. 06 8月, 2015 7 次提交
    • C
      xprtrdma: Count RDMA_NOMSG type calls · 860477d1
      Chuck Lever 提交于
      RDMA_NOMSG type calls are less efficient than RDMA_MSG. Count NOMSG
      calls so administrators can tell if they happen to be used more than
      expected.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      860477d1
    • C
      xprtrdma: Fix large NFS SYMLINK calls · 2fcc213a
      Chuck Lever 提交于
      Repair how rpcrdma_marshal_req() chooses which RDMA message type
      to use for large non-WRITE operations so that it picks RDMA_NOMSG
      in the correct situations, and sets up the marshaling logic to
      SEND only the RPC/RDMA header.
      
      Large NFSv2 SYMLINK requests now use RDMA_NOMSG calls. The Linux NFS
      server XDR decoder for NFSv2 SYMLINK does not handle having the
      pathname argument arrive in a separate buffer. The decoder could be
      fixed, but this is simpler and RDMA_NOMSG can be used in a variety
      of other situations.
      
      Ensure that the Linux client continues to use "RDMA_MSG + read
      list" when sending large NFSv3 SYMLINK requests, which is more
      efficient than using RDMA_NOMSG.
      
      Large NFSv4 CREATE(NF4LNK) requests are changed to use "RDMA_MSG +
      read list" just like NFSv3 (see Section 5 of RFC 5667). Before,
      these did not work at all.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      2fcc213a
    • C
      xprtrdma: Fix XDR tail buffer marshalling · 677eb17e
      Chuck Lever 提交于
      Currently xprtrdma appends an extra chunk element to the RPC/RDMA
      read chunk list of each NFSv4 WRITE compound. The extra element
      contains the final GETATTR operation in the compound.
      
      The result is an extra RDMA READ operation to transfer a very short
      piece of each NFS WRITE compound (typically 16 bytes). This is
      inefficient.
      
      It is also incorrect.
      
      The client is sending the trailing GETATTR at the same Position as
      the preceding WRITE data payload. Whether or not RFC 5667 allows
      the GETATTR to appear in a read chunk, RFC 5666 requires that these
      two separate RPC arguments appear at two distinct Positions.
      
      It can also be argued that the GETATTR operation is not bulk data,
      and therefore RFC 5667 forbids its appearance in a read chunk at
      all.
      
      Although RFC 5667 is not precise about when using a read list with
      NFSv4 COMPOUND is allowed, the intent is that only data arguments
      not touched by NFS (ie, read and write payloads) are to be sent
      using RDMA READ or WRITE.
      
      The NFS client constructs GETATTR arguments itself, and therefore is
      required to send the trailing GETATTR operation as additional inline
      content, not as a data payload.
      
      NB: This change is not backwards compatible. Some older servers do
      not accept inline content following the read list. The Linux NFS
      server should handle this content correctly as of commit
      a97c331f ("svcrdma: Handle additional inline content").
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      677eb17e
    • C
      xprtrdma: Don't provide a reply chunk when expecting a short reply · 33943b29
      Chuck Lever 提交于
      Currently Linux always offers a reply chunk, even when the reply
      can be sent inline (ie. is smaller than 1KB).
      
      On the client, registering a memory region can be expensive. A
      server may choose not to use the reply chunk, wasting the cost of
      the registration.
      
      This is a change only for RPC replies smaller than 1KB which the
      server constructs in the RPC reply send buffer. Because the elements
      of the reply must be XDR encoded, a copy-free data transfer has no
      benefit in this case.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: NSagi Grimberg <sagig@mellanox.com>
      Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      33943b29
    • C
      xprtrdma: Always provide a write list when sending NFS READ · 02eb57d8
      Chuck Lever 提交于
      The client has been setting up a reply chunk for NFS READs that are
      smaller than the inline threshold. This is not efficient: both the
      server and client CPUs have to copy the reply's data payload into
      and out of the memory region that is then transferred via RDMA.
      
      Using the write list, the data payload is moved by the device and no
      extra data copying is necessary.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: NDevesh Sharma <devesh.sharma@avagotech.com>
      Reviewed-By: NSagi Grimberg <sagig@mellanox.com>
      Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      02eb57d8
    • C
      xprtrdma: Account for RPC/RDMA header size when deciding to inline · 5457ced0
      Chuck Lever 提交于
      When the size of the RPC message is near the inline threshold (1KB),
      the client would allow messages to be sent that were a few bytes too
      large.
      
      When marshaling RPC/RDMA requests, ensure the combined size of
      RPC/RDMA header and RPC header do not exceed the inline threshold.
      Endpoints typically reject RPC/RDMA messages that exceed the size
      of their receive buffers.
      
      The two server implementations I test with (Linux and Solaris) use
      receive buffers that are larger than the client’s inline threshold.
      Thus so far this has been benign, observed only by code inspection.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: NDevesh Sharma <devesh.sharma@avagotech.com>
      Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      5457ced0
    • C
      xprtrdma: Remove logic that constructs RDMA_MSGP type calls · b3221d6a
      Chuck Lever 提交于
      RDMA_MSGP type calls insert a zero pad in the middle of the RPC
      message to align the RPC request's data payload to the server's
      alignment preferences. A server can then "page flip" the payload
      into place to avoid a data copy in certain circumstances. However:
      
      1. The client has to have a priori knowledge of the server's
         preferred alignment
      
      2. Requests eligible for RDMA_MSGP are requests that are small
         enough to have been sent inline, and convey a data payload
         at the _end_ of the RPC message
      
      Today 1. is done with a sysctl, and is a global setting that is
      copied during mount. Linux does not support CCP to query the
      server's preferences (RFC 5666, Section 6).
      
      A small-ish NFSv3 WRITE might use RDMA_MSGP, but no NFSv4
      compound fits bullet 2.
      
      Thus the Linux client currently leaves RDMA_MSGP disabled. The
      Linux server handles RDMA_MSGP, but does not use any special
      page flipping, so it confers no benefit.
      
      Clean up the marshaling code by removing the logic that constructs
      RDMA_MSGP type calls. This also reduces the maximum send iovec size
      from four to just two elements.
      
      /proc/sys/sunrpc/rdma_inline_write_padding is a kernel API, and
      thus is left in place.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      b3221d6a
  7. 13 6月, 2015 3 次提交
  8. 31 3月, 2015 3 次提交
  9. 24 2月, 2015 1 次提交
    • C
      xprtrdma: Store RDMA credits in unsigned variables · 9b1dcbc8
      Chuck Lever 提交于
      Dan Carpenter's static checker pointed out:
      
         net/sunrpc/xprtrdma/rpc_rdma.c:879 rpcrdma_reply_handler()
         warn: can 'credits' be negative?
      
      "credits" is defined as an int. The credits value comes from the
      server as a 32-bit unsigned integer.
      
      A malicious or broken server can plant a large unsigned integer in
      that field which would result in an underflow in the following
      logic, potentially triggering a deadlock of the mount point by
      blocking the client from issuing more RPC requests.
      
      net/sunrpc/xprtrdma/rpc_rdma.c:
      
        876          credits = be32_to_cpu(headerp->rm_credit);
        877          if (credits == 0)
        878                  credits = 1;    /* don't deadlock */
        879          else if (credits > r_xprt->rx_buf.rb_max_requests)
        880                  credits = r_xprt->rx_buf.rb_max_requests;
        881
        882          cwnd = xprt->cwnd;
        883          xprt->cwnd = credits << RPC_CWNDSHIFT;
        884          if (xprt->cwnd > cwnd)
        885                  xprt_release_rqst_cong(rqst->rq_task);
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Fixes: eba8ff66 ("xprtrdma: Move credit update to RPC . . .")
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      9b1dcbc8
  10. 30 1月, 2015 6 次提交
    • C
      xprtrdma: Allocate zero pad separately from rpcrdma_buffer · c05fbb5a
      Chuck Lever 提交于
      Use the new rpcrdma_alloc_regbuf() API to shrink the amount of
      contiguous memory needed for a buffer pool by moving the zero
      pad buffer into a regbuf.
      
      This is for consistency with the other uses of internally
      registered memory.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: NSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      c05fbb5a
    • C
      xprtrdma: Allocate RPC/RDMA receive buffer separately from struct rpcrdma_rep · 6b1184cd
      Chuck Lever 提交于
      The rr_base field is currently the buffer where RPC replies land.
      
      An RPC/RDMA reply header lands in this buffer. In some cases an RPC
      reply header also lands in this buffer, just after the RPC/RDMA
      header.
      
      The inline threshold is an agreed-on size limit for RDMA SEND
      operations that pass from server and client. The sum of the
      RPC/RDMA reply header size and the RPC reply header size must be
      less than this threshold.
      
      The largest RDMA RECV that the client should have to handle is the
      size of the inline threshold. The receive buffer should thus be the
      size of the inline threshold, and not related to RPCRDMA_MAX_SEGS.
      
      RPC replies received via RDMA WRITE (long replies) are caught in
      rq_rcv_buf, which is the second half of the RPC send buffer. Ie,
      such replies are not involved in any way with rr_base.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      6b1184cd
    • C
      xprtrdma: Allocate RPC/RDMA send buffer separately from struct rpcrdma_req · 85275c87
      Chuck Lever 提交于
      The rl_base field is currently the buffer where each RPC/RDMA call
      header is built.
      
      The inline threshold is an agreed-on size limit to for RDMA SEND
      operations that pass between client and server. The sum of the
      RPC/RDMA header size and the RPC header size must be less than or
      equal to this threshold.
      
      Increasing the r/wsize maximum will require MAX_SEGS to grow
      significantly, but the inline threshold size won't change (both
      sides agree on it). The server's inline threshold doesn't change.
      
      Since an RPC/RDMA header can never be larger than the inline
      threshold, make all RPC/RDMA header buffers the size of the
      inline threshold.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: NSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      85275c87
    • C
      xprtrdma: Allocate RPC send buffer separately from struct rpcrdma_req · 0ca77dc3
      Chuck Lever 提交于
      Because internal memory registration is an expensive and synchronous
      operation, xprtrdma pre-registers send and receive buffers at mount
      time, and then re-uses them for each RPC.
      
      A "hardway" allocation is a memory allocation and registration that
      replaces a send buffer during the processing of an RPC. Hardway must
      be done if the RPC send buffer is too small to accommodate an RPC's
      call and reply headers.
      
      For xprtrdma, each RPC send buffer is currently part of struct
      rpcrdma_req so that xprt_rdma_free(), which is passed nothing but
      the address of an RPC send buffer, can find its matching struct
      rpcrdma_req and rpcrdma_rep quickly via container_of / offsetof.
      
      That means that hardway currently has to replace a whole rpcrmda_req
      when it replaces an RPC send buffer. This is often a fairly hefty
      chunk of contiguous memory due to the size of the rl_segments array
      and the fact that both the send and receive buffers are part of
      struct rpcrdma_req.
      
      Some obscure re-use of fields in rpcrdma_req is done so that
      xprt_rdma_free() can detect replaced rpcrdma_req structs, and
      restore the original.
      
      This commit breaks apart the RPC send buffer and struct rpcrdma_req
      so that increasing the size of the rl_segments array does not change
      the alignment of each RPC send buffer. (Increasing rl_segments is
      needed to bump up the maximum r/wsize for NFS/RDMA).
      
      This change opens up some interesting possibilities for improving
      the design of xprt_rdma_allocate().
      
      xprt_rdma_allocate() is now the one place where RPC send buffers
      are allocated or re-allocated, and they are now always left in place
      by xprt_rdma_free().
      
      A large re-allocation that includes both the rl_segments array and
      the RPC send buffer is no longer needed. Send buffer re-allocation
      becomes quite rare. Good send buffer alignment is guaranteed no
      matter what the size of the rl_segments array is.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: NSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      0ca77dc3
    • C
      xprtrdma: Remove rpcrdma_ep::rep_func and ::rep_xprt · afadc468
      Chuck Lever 提交于
      Clean up: The rep_func field always refers to rpcrdma_conn_func().
      rep_func should have been removed by commit b45ccfd2 ("xprtrdma:
      Remove MEMWINDOWS registration modes").
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: NSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      afadc468
    • C
      xprtrdma: Move credit update to RPC reply handler · eba8ff66
      Chuck Lever 提交于
      Reduce work in the receive CQ handler, which can be run at hardware
      interrupt level, by moving the RPC/RDMA credit update logic to the
      RPC reply handler.
      
      This has some additional benefits: More header sanity checking is
      done before trusting the incoming credit value, and the receive CQ
      handler no longer touches the RPC/RDMA header (the CPU stalls while
      waiting for the header contents to be brought into the cache).
      
      This further extends work begun by commit e7ce710a ("xprtrdma:
      Avoid deadlock when credit window is reset").
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: NSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      eba8ff66