1. 03 11月, 2015 1 次提交
  2. 06 8月, 2015 7 次提交
    • C
      xprtrdma: Count RDMA_NOMSG type calls · 860477d1
      Chuck Lever 提交于
      RDMA_NOMSG type calls are less efficient than RDMA_MSG. Count NOMSG
      calls so administrators can tell if they happen to be used more than
      expected.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      860477d1
    • C
      xprtrdma: Fix large NFS SYMLINK calls · 2fcc213a
      Chuck Lever 提交于
      Repair how rpcrdma_marshal_req() chooses which RDMA message type
      to use for large non-WRITE operations so that it picks RDMA_NOMSG
      in the correct situations, and sets up the marshaling logic to
      SEND only the RPC/RDMA header.
      
      Large NFSv2 SYMLINK requests now use RDMA_NOMSG calls. The Linux NFS
      server XDR decoder for NFSv2 SYMLINK does not handle having the
      pathname argument arrive in a separate buffer. The decoder could be
      fixed, but this is simpler and RDMA_NOMSG can be used in a variety
      of other situations.
      
      Ensure that the Linux client continues to use "RDMA_MSG + read
      list" when sending large NFSv3 SYMLINK requests, which is more
      efficient than using RDMA_NOMSG.
      
      Large NFSv4 CREATE(NF4LNK) requests are changed to use "RDMA_MSG +
      read list" just like NFSv3 (see Section 5 of RFC 5667). Before,
      these did not work at all.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      2fcc213a
    • C
      xprtrdma: Fix XDR tail buffer marshalling · 677eb17e
      Chuck Lever 提交于
      Currently xprtrdma appends an extra chunk element to the RPC/RDMA
      read chunk list of each NFSv4 WRITE compound. The extra element
      contains the final GETATTR operation in the compound.
      
      The result is an extra RDMA READ operation to transfer a very short
      piece of each NFS WRITE compound (typically 16 bytes). This is
      inefficient.
      
      It is also incorrect.
      
      The client is sending the trailing GETATTR at the same Position as
      the preceding WRITE data payload. Whether or not RFC 5667 allows
      the GETATTR to appear in a read chunk, RFC 5666 requires that these
      two separate RPC arguments appear at two distinct Positions.
      
      It can also be argued that the GETATTR operation is not bulk data,
      and therefore RFC 5667 forbids its appearance in a read chunk at
      all.
      
      Although RFC 5667 is not precise about when using a read list with
      NFSv4 COMPOUND is allowed, the intent is that only data arguments
      not touched by NFS (ie, read and write payloads) are to be sent
      using RDMA READ or WRITE.
      
      The NFS client constructs GETATTR arguments itself, and therefore is
      required to send the trailing GETATTR operation as additional inline
      content, not as a data payload.
      
      NB: This change is not backwards compatible. Some older servers do
      not accept inline content following the read list. The Linux NFS
      server should handle this content correctly as of commit
      a97c331f ("svcrdma: Handle additional inline content").
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      677eb17e
    • C
      xprtrdma: Don't provide a reply chunk when expecting a short reply · 33943b29
      Chuck Lever 提交于
      Currently Linux always offers a reply chunk, even when the reply
      can be sent inline (ie. is smaller than 1KB).
      
      On the client, registering a memory region can be expensive. A
      server may choose not to use the reply chunk, wasting the cost of
      the registration.
      
      This is a change only for RPC replies smaller than 1KB which the
      server constructs in the RPC reply send buffer. Because the elements
      of the reply must be XDR encoded, a copy-free data transfer has no
      benefit in this case.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: NSagi Grimberg <sagig@mellanox.com>
      Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      33943b29
    • C
      xprtrdma: Always provide a write list when sending NFS READ · 02eb57d8
      Chuck Lever 提交于
      The client has been setting up a reply chunk for NFS READs that are
      smaller than the inline threshold. This is not efficient: both the
      server and client CPUs have to copy the reply's data payload into
      and out of the memory region that is then transferred via RDMA.
      
      Using the write list, the data payload is moved by the device and no
      extra data copying is necessary.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: NDevesh Sharma <devesh.sharma@avagotech.com>
      Reviewed-By: NSagi Grimberg <sagig@mellanox.com>
      Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      02eb57d8
    • C
      xprtrdma: Account for RPC/RDMA header size when deciding to inline · 5457ced0
      Chuck Lever 提交于
      When the size of the RPC message is near the inline threshold (1KB),
      the client would allow messages to be sent that were a few bytes too
      large.
      
      When marshaling RPC/RDMA requests, ensure the combined size of
      RPC/RDMA header and RPC header do not exceed the inline threshold.
      Endpoints typically reject RPC/RDMA messages that exceed the size
      of their receive buffers.
      
      The two server implementations I test with (Linux and Solaris) use
      receive buffers that are larger than the client’s inline threshold.
      Thus so far this has been benign, observed only by code inspection.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: NDevesh Sharma <devesh.sharma@avagotech.com>
      Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      5457ced0
    • C
      xprtrdma: Remove logic that constructs RDMA_MSGP type calls · b3221d6a
      Chuck Lever 提交于
      RDMA_MSGP type calls insert a zero pad in the middle of the RPC
      message to align the RPC request's data payload to the server's
      alignment preferences. A server can then "page flip" the payload
      into place to avoid a data copy in certain circumstances. However:
      
      1. The client has to have a priori knowledge of the server's
         preferred alignment
      
      2. Requests eligible for RDMA_MSGP are requests that are small
         enough to have been sent inline, and convey a data payload
         at the _end_ of the RPC message
      
      Today 1. is done with a sysctl, and is a global setting that is
      copied during mount. Linux does not support CCP to query the
      server's preferences (RFC 5666, Section 6).
      
      A small-ish NFSv3 WRITE might use RDMA_MSGP, but no NFSv4
      compound fits bullet 2.
      
      Thus the Linux client currently leaves RDMA_MSGP disabled. The
      Linux server handles RDMA_MSGP, but does not use any special
      page flipping, so it confers no benefit.
      
      Clean up the marshaling code by removing the logic that constructs
      RDMA_MSGP type calls. This also reduces the maximum send iovec size
      from four to just two elements.
      
      /proc/sys/sunrpc/rdma_inline_write_padding is a kernel API, and
      thus is left in place.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Tested-by: NDevesh Sharma <devesh.sharma@avagotech.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      b3221d6a
  3. 13 6月, 2015 3 次提交
  4. 31 3月, 2015 3 次提交
  5. 24 2月, 2015 1 次提交
    • C
      xprtrdma: Store RDMA credits in unsigned variables · 9b1dcbc8
      Chuck Lever 提交于
      Dan Carpenter's static checker pointed out:
      
         net/sunrpc/xprtrdma/rpc_rdma.c:879 rpcrdma_reply_handler()
         warn: can 'credits' be negative?
      
      "credits" is defined as an int. The credits value comes from the
      server as a 32-bit unsigned integer.
      
      A malicious or broken server can plant a large unsigned integer in
      that field which would result in an underflow in the following
      logic, potentially triggering a deadlock of the mount point by
      blocking the client from issuing more RPC requests.
      
      net/sunrpc/xprtrdma/rpc_rdma.c:
      
        876          credits = be32_to_cpu(headerp->rm_credit);
        877          if (credits == 0)
        878                  credits = 1;    /* don't deadlock */
        879          else if (credits > r_xprt->rx_buf.rb_max_requests)
        880                  credits = r_xprt->rx_buf.rb_max_requests;
        881
        882          cwnd = xprt->cwnd;
        883          xprt->cwnd = credits << RPC_CWNDSHIFT;
        884          if (xprt->cwnd > cwnd)
        885                  xprt_release_rqst_cong(rqst->rq_task);
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Fixes: eba8ff66 ("xprtrdma: Move credit update to RPC . . .")
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      9b1dcbc8
  6. 30 1月, 2015 9 次提交
  7. 25 11月, 2014 1 次提交
  8. 01 8月, 2014 2 次提交
  9. 04 6月, 2014 10 次提交
  10. 18 3月, 2014 1 次提交
    • C
      SUNRPC: Fix large reads on NFS/RDMA · 2b7bbc96
      Chuck Lever 提交于
      After commit a11a2bf4, "SUNRPC: Optimise away unnecessary data moves
      in xdr_align_pages", Thu Aug 2 13:21:43 2012, READs larger than a
      few hundred bytes via NFS/RDMA no longer work.  This commit exposed
      a long-standing bug in rpcrdma_inline_fixup().
      
      I reproduce this with an rsize=4096 mount using the cthon04 basic
      tests.  Test 5 fails with an EIO error.
      
      For my reproducer, kernel log shows:
      
        NFS: server cheating in read reply: count 4096 > recvd 0
      
      rpcrdma_inline_fixup() is zeroing the xdr_stream::page_len field,
      and xdr_align_pages() is now returning that value to the READ XDR
      decoder function.
      
      That field is set up by xdr_inline_pages() by the READ XDR encoder
      function.  As far as I can tell, it is supposed to be left alone
      after that, as it describes the dimensions of the reply xdr_stream,
      not the contents of that stream.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=68391Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NTrond Myklebust <trond.myklebust@primarydata.com>
      2b7bbc96
  11. 01 2月, 2013 1 次提交
  12. 21 3月, 2012 1 次提交