1. 18 5月, 2016 4 次提交
    • C
      xprtrdma: Allow Read list and Reply chunk simultaneously · 94f58c58
      Chuck Lever 提交于
      rpcrdma_marshal_req() makes a simplifying assumption: that NFS
      operations with large Call messages have small Reply messages, and
      vice versa. Therefore with RPC-over-RDMA, only one chunk type is
      ever needed for each Call/Reply pair, because one direction needs
      chunks, the other direction will always fit inline.
      
      In fact, this assumption is asserted in the code:
      
        if (rtype != rpcrdma_noch && wtype != rpcrdma_noch) {
        	dprintk("RPC:       %s: cannot marshal multiple chunk lists\n",
      		__func__);
      	return -EIO;
        }
      
      But RPCGSS_SEC breaks this assumption. Because krb5i and krb5p
      perform data transformation on RPC messages before they are
      transmitted, direct data placement techniques cannot be used, thus
      RPC messages must be sent via a Long call in both directions.
      All such calls are sent with a Position Zero Read chunk, and all
      such replies are handled with a Reply chunk. Thus the client must
      provide every Call/Reply pair with both a Read list and a Reply
      chunk.
      
      Without any special security in effect, NFSv4 WRITEs may now also
      use the Read list and provide a Reply chunk. The marshal_req
      logic was preventing that, meaning an NFSv4 WRITE with a large
      payload that included a GETATTR result larger than the inline
      threshold would fail.
      
      The code that encodes each chunk list is now completely contained in
      its own function. There is some code duplication, but the trade-off
      is that the overall logic should be more clear.
      
      Note that all three chunk lists now share the rl_segments array.
      Some additional per-req accounting is necessary to track this
      usage. For the same reasons that the above simplifying assumption
      has held true for so long, I don't expect more array elements are
      needed at this time.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Tested-by: NSteve Wise <swise@opengridcomputing.com>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      94f58c58
    • C
      xprtrdma: Update comments in rpcrdma_marshal_req() · 88b18a12
      Chuck Lever 提交于
      Update documenting comments to reflect code changes over the past
      year.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Tested-by: NSteve Wise <swise@opengridcomputing.com>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      88b18a12
    • C
      xprtrdma: Avoid using Write list for small NFS READ requests · cce6deeb
      Chuck Lever 提交于
      Avoid the latency and interrupt overhead of registering a Write
      chunk when handling NFS READ requests of a few hundred bytes or
      less.
      
      This change does not interoperate with Linux NFS/RDMA servers
      that do not have commit 9d11b51c ('svcrdma: Fix send_reply()
      scatter/gather set-up'). Commit 9d11b51c was introduced in v4.3,
      and is included in 4.2.y, 4.1.y, and 3.18.y.
      
      Oracle bug 22925946 has been filed to request that the above fix
      be included in the Oracle Linux UEK4 NFS/RDMA server.
      
      Red Hat bugzillas 1327280 and 1327554 have been filed to request
      that RHEL NFS/RDMA server backports include the above fix.
      
      Workaround: Replace the "proto=rdma,port=20049" mount options
      with "proto=tcp" until commit 9d11b51c is applied to your
      NFS server.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Tested-by: NSteve Wise <swise@opengridcomputing.com>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      cce6deeb
    • C
      xprtrdma: Prevent inline overflow · 302d3deb
      Chuck Lever 提交于
      When deciding whether to send a Call inline, rpcrdma_marshal_req
      doesn't take into account header bytes consumed by chunk lists.
      This results in Call messages on the wire that are sometimes larger
      than the inline threshold.
      
      Likewise, when a Write list or Reply chunk is in play, the server's
      reply has to emit an RDMA Send that includes a larger-than-minimal
      RPC-over-RDMA header.
      
      The actual size of a Call message cannot be estimated until after
      the chunk lists have been registered. Thus the size of each
      RPC-over-RDMA header can be estimated only after chunks are
      registered; but the decision to register chunks is based on the size
      of that header. Chicken, meet egg.
      
      The best a client can do is estimate header size based on the
      largest header that might occur, and then ensure that inline content
      is always smaller than that.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Tested-by: NSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      302d3deb
  2. 15 3月, 2016 4 次提交
  3. 19 12月, 2015 1 次提交
    • C
      xprtrdma: Invalidate in the RPC reply handler · 68791649
      Chuck Lever 提交于
      There is a window between the time the RPC reply handler wakes the
      waiting RPC task and when xprt_release() invokes ops->buf_free.
      During this time, memory regions containing the data payload may
      still be accessed by a broken or malicious server, but the RPC
      application has already been allowed access to the memory containing
      the RPC request's data payloads.
      
      The server should be fenced from client memory containing RPC data
      payloads _before_ the RPC application is allowed to continue.
      
      This change also more strongly enforces send queue accounting. There
      is a maximum number of RPC calls allowed to be outstanding. When an
      RPC/RDMA transport is set up, just enough send queue resources are
      allocated to handle registration, Send, and invalidation WRs for
      each those RPCs at the same time.
      
      Before, additional RPC calls could be dispatched while invalidation
      WRs were still consuming send WQEs. When invalidation WRs backed
      up, dispatching additional RPCs resulted in a send queue overrun.
      
      Now, the reply handler prevents RPC dispatch until invalidation is
      complete. This prevents RPC call dispatch until there are enough
      send queue resources to proceed.
      
      Still to do: If an RPC exits early (say, ^C), the reply handler has
      no opportunity to perform invalidation. Currently, xprt_rdma_free()
      still frees remaining RDMA resources, which could deadlock.
      Additional changes are needed to handle invalidation properly in this
      case.
      Reported-by: NJason Gunthorpe <jgunthorpe@obsidianresearch.com>
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      68791649
  4. 03 11月, 2015 4 次提交
  5. 06 8月, 2015 7 次提交
    • C
      xprtrdma: Count RDMA_NOMSG type calls · 860477d1
      Chuck Lever 提交于
      RDMA_NOMSG type calls are less efficient than RDMA_MSG. Count NOMSG
      calls so administrators can tell if they happen to be used more than
      expected.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      860477d1
    • C
      xprtrdma: Fix large NFS SYMLINK calls · 2fcc213a
      Chuck Lever 提交于
      Repair how rpcrdma_marshal_req() chooses which RDMA message type
      to use for large non-WRITE operations so that it picks RDMA_NOMSG
      in the correct situations, and sets up the marshaling logic to
      SEND only the RPC/RDMA header.
      
      Large NFSv2 SYMLINK requests now use RDMA_NOMSG calls. The Linux NFS
      server XDR decoder for NFSv2 SYMLINK does not handle having the
      pathname argument arrive in a separate buffer. The decoder could be
      fixed, but this is simpler and RDMA_NOMSG can be used in a variety
      of other situations.
      
      Ensure that the Linux client continues to use "RDMA_MSG + read
      list" when sending large NFSv3 SYMLINK requests, which is more
      efficient than using RDMA_NOMSG.
      
      Large NFSv4 CREATE(NF4LNK) requests are changed to use "RDMA_MSG +
      read list" just like NFSv3 (see Section 5 of RFC 5667). Before,
      these did not work at all.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      2fcc213a
    • C
      xprtrdma: Fix XDR tail buffer marshalling · 677eb17e
      Chuck Lever 提交于
      Currently xprtrdma appends an extra chunk element to the RPC/RDMA
      read chunk list of each NFSv4 WRITE compound. The extra element
      contains the final GETATTR operation in the compound.
      
      The result is an extra RDMA READ operation to transfer a very short
      piece of each NFS WRITE compound (typically 16 bytes). This is
      inefficient.
      
      It is also incorrect.
      
      The client is sending the trailing GETATTR at the same Position as
      the preceding WRITE data payload. Whether or not RFC 5667 allows
      the GETATTR to appear in a read chunk, RFC 5666 requires that these
      two separate RPC arguments appear at two distinct Positions.
      
      It can also be argued that the GETATTR operation is not bulk data,
      and therefore RFC 5667 forbids its appearance in a read chunk at
      all.
      
      Although RFC 5667 is not precise about when using a read list with
      NFSv4 COMPOUND is allowed, the intent is that only data arguments
      not touched by NFS (ie, read and write payloads) are to be sent
      using RDMA READ or WRITE.
      
      The NFS client constructs GETATTR arguments itself, and therefore is
      required to send the trailing GETATTR operation as additional inline
      content, not as a data payload.
      
      NB: This change is not backwards compatible. Some older servers do
      not accept inline content following the read list. The Linux NFS
      server should handle this content correctly as of commit
      a97c331f ("svcrdma: Handle additional inline content").
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      677eb17e
    • C
      xprtrdma: Don't provide a reply chunk when expecting a short reply · 33943b29
      Chuck Lever 提交于
      Currently Linux always offers a reply chunk, even when the reply
      can be sent inline (ie. is smaller than 1KB).
      
      On the client, registering a memory region can be expensive. A
      server may choose not to use the reply chunk, wasting the cost of
      the registration.
      
      This is a change only for RPC replies smaller than 1KB which the
      server constructs in the RPC reply send buffer. Because the elements
      of the reply must be XDR encoded, a copy-free data transfer has no
      benefit in this case.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: NSagi Grimberg <sagig@mellanox.com>
      Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      33943b29
    • C
      xprtrdma: Always provide a write list when sending NFS READ · 02eb57d8
      Chuck Lever 提交于
      The client has been setting up a reply chunk for NFS READs that are
      smaller than the inline threshold. This is not efficient: both the
      server and client CPUs have to copy the reply's data payload into
      and out of the memory region that is then transferred via RDMA.
      
      Using the write list, the data payload is moved by the device and no
      extra data copying is necessary.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: NDevesh Sharma <devesh.sharma@avagotech.com>
      Reviewed-By: NSagi Grimberg <sagig@mellanox.com>
      Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      02eb57d8
    • C
      xprtrdma: Account for RPC/RDMA header size when deciding to inline · 5457ced0
      Chuck Lever 提交于
      When the size of the RPC message is near the inline threshold (1KB),
      the client would allow messages to be sent that were a few bytes too
      large.
      
      When marshaling RPC/RDMA requests, ensure the combined size of
      RPC/RDMA header and RPC header do not exceed the inline threshold.
      Endpoints typically reject RPC/RDMA messages that exceed the size
      of their receive buffers.
      
      The two server implementations I test with (Linux and Solaris) use
      receive buffers that are larger than the client’s inline threshold.
      Thus so far this has been benign, observed only by code inspection.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: NDevesh Sharma <devesh.sharma@avagotech.com>
      Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      5457ced0
    • C
      xprtrdma: Remove logic that constructs RDMA_MSGP type calls · b3221d6a
      Chuck Lever 提交于
      RDMA_MSGP type calls insert a zero pad in the middle of the RPC
      message to align the RPC request's data payload to the server's
      alignment preferences. A server can then "page flip" the payload
      into place to avoid a data copy in certain circumstances. However:
      
      1. The client has to have a priori knowledge of the server's
         preferred alignment
      
      2. Requests eligible for RDMA_MSGP are requests that are small
         enough to have been sent inline, and convey a data payload
         at the _end_ of the RPC message
      
      Today 1. is done with a sysctl, and is a global setting that is
      copied during mount. Linux does not support CCP to query the
      server's preferences (RFC 5666, Section 6).
      
      A small-ish NFSv3 WRITE might use RDMA_MSGP, but no NFSv4
      compound fits bullet 2.
      
      Thus the Linux client currently leaves RDMA_MSGP disabled. The
      Linux server handles RDMA_MSGP, but does not use any special
      page flipping, so it confers no benefit.
      
      Clean up the marshaling code by removing the logic that constructs
      RDMA_MSGP type calls. This also reduces the maximum send iovec size
      from four to just two elements.
      
      /proc/sys/sunrpc/rdma_inline_write_padding is a kernel API, and
      thus is left in place.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      b3221d6a
  6. 13 6月, 2015 3 次提交
  7. 31 3月, 2015 3 次提交
  8. 24 2月, 2015 1 次提交
    • C
      xprtrdma: Store RDMA credits in unsigned variables · 9b1dcbc8
      Chuck Lever 提交于
      Dan Carpenter's static checker pointed out:
      
         net/sunrpc/xprtrdma/rpc_rdma.c:879 rpcrdma_reply_handler()
         warn: can 'credits' be negative?
      
      "credits" is defined as an int. The credits value comes from the
      server as a 32-bit unsigned integer.
      
      A malicious or broken server can plant a large unsigned integer in
      that field which would result in an underflow in the following
      logic, potentially triggering a deadlock of the mount point by
      blocking the client from issuing more RPC requests.
      
      net/sunrpc/xprtrdma/rpc_rdma.c:
      
        876          credits = be32_to_cpu(headerp->rm_credit);
        877          if (credits == 0)
        878                  credits = 1;    /* don't deadlock */
        879          else if (credits > r_xprt->rx_buf.rb_max_requests)
        880                  credits = r_xprt->rx_buf.rb_max_requests;
        881
        882          cwnd = xprt->cwnd;
        883          xprt->cwnd = credits << RPC_CWNDSHIFT;
        884          if (xprt->cwnd > cwnd)
        885                  xprt_release_rqst_cong(rqst->rq_task);
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Fixes: eba8ff66 ("xprtrdma: Move credit update to RPC . . .")
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      9b1dcbc8
  9. 30 1月, 2015 9 次提交
  10. 25 11月, 2014 1 次提交
  11. 01 8月, 2014 2 次提交
  12. 04 6月, 2014 1 次提交
    • C
      xprtrdma: Disconnect on registration failure · c93c6223
      Chuck Lever 提交于
      If rpcrdma_register_external() fails during request marshaling, the
      current RPC request is killed. Instead, this RPC should be retried
      after reconnecting the transport instance.
      
      The most likely reason for registration failure with FRMR is a
      failed post_send, which would be due to a remote transport
      disconnect or memory exhaustion. These issues can be recovered
      by a retry.
      
      Problems encountered in the marshaling logic itself will not be
      corrected by trying again, so these should still kill a request.
      
      Now that we've added a clean exit for marshaling errors, take the
      opportunity to defang some BUG_ON's.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      c93c6223