1. 18 5月, 2020 1 次提交
    • C
      svcrdma: Fix backchannel return code · ea740bd5
      Chuck Lever 提交于
      Way back when I was writing the RPC/RDMA server-side backchannel
      code, I misread the TCP backchannel reply handler logic. When
      svc_tcp_recvfrom() successfully receives a backchannel reply, it
      does not return -EAGAIN. It sets XPT_DATA and returns zero.
      
      Update svc_rdma_recvfrom() to return zero. Here, XPT_DATA doesn't
      need to be set again: it is set whenever a new message is received,
      behind a spin lock in a single threaded context.
      
      Also, if handling the cb reply is not successful, the message is
      simply dropped. There's no special message framing to deal with as
      there is in the TCP case.
      
      Now that the handle_bc_reply() return value is ignored, I've removed
      the dprintk call sites in the error exit of handle_bc_reply() in
      favor of trace points in other areas that already report the error
      cases.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      ea740bd5
  2. 18 4月, 2020 1 次提交
    • C
      svcrdma: Fix leak of svc_rdma_recv_ctxt objects · 23cf1ee1
      Chuck Lever 提交于
      Utilize the xpo_release_rqst transport method to ensure that each
      rqstp's svc_rdma_recv_ctxt object is released even when the server
      cannot return a Reply for that rqstp.
      
      Without this fix, each RPC whose Reply cannot be sent leaks one
      svc_rdma_recv_ctxt. This is a 2.5KB structure, a 4KB DMA-mapped
      Receive buffer, and any pages that might be part of the Reply
      message.
      
      The leak is infrequent unless the network fabric is unreliable or
      Kerberos is in use, as GSS sequence window overruns, which result
      in connection loss, are more common on fast transports.
      
      Fixes: 3a88092e ("svcrdma: Preserve Receive buffer until svc_rdma_sendto")
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      23cf1ee1
  3. 17 3月, 2020 5 次提交
    • C
      svcrdma: Fix double sync of transport header buffer · aee4b74a
      Chuck Lever 提交于
      Performance optimization: Avoid syncing the transport buffer twice
      when Reply buffer pull-up is necessary.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      aee4b74a
    • C
      svcrdma: Refactor chunk list encoders · 6fd5034d
      Chuck Lever 提交于
      Same idea as the receive-side changes I did a while back: use
      xdr_stream helpers rather than open-coding the XDR chunk list
      encoders. This builds the Reply transport header from beginning to
      end without backtracking.
      
      As additional clean-ups, fill in documenting comments for the XDR
      encoders and sprinkle some trace points in the new encoding
      functions.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      6fd5034d
    • C
      svcrdma: De-duplicate code that locates Write and Reply chunks · 2fe8c446
      Chuck Lever 提交于
      Cache the locations of the Requester-provided Write list and Reply
      chunk so that the Send path doesn't need to parse the Call header
      again.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      2fe8c446
    • C
      svcrdma: Use struct xdr_stream to decode ingress transport headers · e604aad2
      Chuck Lever 提交于
      The logic that checks incoming network headers has to be scrupulous.
      
      De-duplicate: replace open-coded buffer overflow checks with the use
      of xdr_stream helpers that are used most everywhere else XDR
      decoding is done.
      
      One minor change to the sanity checks: instead of checking the
      length of individual segments, cap the length of the whole chunk
      to be sure it can fit in the set of pages available in rq_pages.
      This should be a better test of whether the server can handle the
      chunks in each request.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      e604aad2
    • C
      nfsd: Fix NFSv4 READ on RDMA when using readv · 41205539
      Chuck Lever 提交于
      svcrdma expects that the payload falls precisely into the xdr_buf
      page vector. This does not seem to be the case for
      nfsd4_encode_readv().
      
      This code is called only when fops->splice_read is missing or when
      RQ_SPLICE_OK is clear, so it's not a noticeable problem in many
      common cases.
      
      Add new transport method: ->xpo_read_payload so that when a READ
      payload does not fit exactly in rq_res's page vector, the XDR
      encoder can inform the RPC transport exactly where that payload is,
      without the payload's XDR pad.
      
      That way, when a Write chunk is present, the transport knows what
      byte range in the Reply message is supposed to be matched with the
      chunk.
      
      Note that the Linux NFS server implementation of NFS/RDMA can
      currently handle only one Write chunk per RPC-over-RDMA message.
      This simplifies the implementation of this fix.
      
      Fixes: b0420980 ("nfsd4: allow exotic read compounds")
      Buglink: https://bugzilla.kernel.org/show_bug.cgi?id=198053Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      41205539
  4. 19 8月, 2019 1 次提交
  5. 07 2月, 2019 2 次提交
    • C
      svcrdma: Remove syslog warnings in work completion handlers · 8820bcaa
      Chuck Lever 提交于
      These can result in a lot of log noise, and are able to be triggered
      by client misbehavior. Since there are trace points in these
      handlers now, there's no need to spam the log.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      8820bcaa
    • J
      svcrpc: fix unlikely races preventing queueing of sockets · 95503d29
      J. Bruce Fields 提交于
      In the rpc server, When something happens that might be reason to wake
      up a thread to do something, what we do is
      
      	- modify xpt_flags, sk_sock->flags, xpt_reserved, or
      	  xpt_nr_rqsts to indicate the new situation
      	- call svc_xprt_enqueue() to decide whether to wake up a thread.
      
      svc_xprt_enqueue may require multiple conditions to be true before
      queueing up a thread to handle the xprt.  In the SMP case, one of the
      other CPU's may have set another required condition, and in that case,
      although both CPUs run svc_xprt_enqueue(), it's possible that neither
      call sees the writes done by the other CPU in time, and neither one
      recognizes that all the required conditions have been set.  A socket
      could therefore be ignored indefinitely.
      
      Add memory barries to ensure that any svc_xprt_enqueue() call will
      always see the conditions changed by other CPUs before deciding to
      ignore a socket.
      
      I've never seen this race reported.  In the unlikely event it happens,
      another event will usually come along and the problem will fix itself.
      So I don't think this is worth backporting to stable.
      
      Chuck tried this patch and said "I don't see any performance
      regressions, but my server has only a single last-level CPU cache."
      Tested-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      95503d29
  6. 29 11月, 2018 1 次提交
  7. 10 8月, 2018 1 次提交
    • C
      svcrdma: Avoid releasing a page in svc_xprt_release() · a53d5cb0
      Chuck Lever 提交于
      svc_xprt_release() invokes svc_free_res_pages(), which releases
      pages between rq_respages and rq_next_page.
      
      Historically, the RPC/RDMA transport has set these two pointers to
      be different by one, which means:
      
      - one page gets released when svc_recv returns 0. This normally
      happens whenever one or more RDMA Reads need to be dispatched to
      complete construction of an RPC Call.
      
      - one page gets released after every call to svc_send.
      
      In both cases, this released page is immediately refilled by
      svc_alloc_arg. There does not seem to be a reason for releasing this
      page.
      
      To avoid this unnecessary memory allocator traffic, set rq_next_page
      more carefully.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      a53d5cb0
  8. 25 7月, 2018 1 次提交
  9. 09 6月, 2018 1 次提交
  10. 12 5月, 2018 12 次提交
    • C
      svcrdma: Persistently allocate and DMA-map Send buffers · 99722fe4
      Chuck Lever 提交于
      While sending each RPC Reply, svc_rdma_sendto allocates and DMA-
      maps a separate buffer where the RPC/RDMA transport header is
      constructed. The buffer is unmapped and released in the Send
      completion handler. This is significant per-RPC overhead,
      especially for small RPCs.
      
      Instead, allocate and DMA-map a buffer, and cache it in each
      svc_rdma_send_ctxt. This buffer and its mapping can be re-used
      for each RPC, saving the cost of memory allocation and DMA
      mapping.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      99722fe4
    • C
      svcrdma: Remove post_send_wr · 986b7889
      Chuck Lever 提交于
      Clean up: Now that the send_wr is part of the svc_rdma_send_ctxt,
      svc_rdma_post_send_wr is nearly empty.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      986b7889
    • C
      svcrdma: Introduce svc_rdma_send_ctxt · 4201c746
      Chuck Lever 提交于
      svc_rdma_op_ctxt's are pre-allocated and maintained on a per-xprt
      free list. This eliminates the overhead of calling kmalloc / kfree,
      both of which grab a globally shared lock that disables interrupts.
      Introduce a replacement to svc_rdma_op_ctxt's that is built
      especially for the svcrdma Send path.
      
      Subsequent patches will take advantage of this new structure by
      allocating real resources which are then cached in these objects.
      The allocations are freed when the transport is torn down.
      
      I've renamed the structure so that static type checking can be used
      to ensure that uses of op_ctxt and send_ctxt are not confused. As an
      additional clean up, structure fields are renamed to conform with
      kernel coding conventions.
      
      Additional clean ups:
      - Handle svc_rdma_send_ctxt_get allocation failure at each call
        site, rather than pre-allocating and hoping we guessed correctly
      - All send_ctxt_put call-sites request page freeing, so remove
        the @free_pages argument
      - All send_ctxt_put call-sites unmap SGEs, so fold that into
        svc_rdma_send_ctxt_put
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      4201c746
    • C
      svcrdma: Clean up Send SGE accounting · 23262790
      Chuck Lever 提交于
      Clean up: Since there's already a svc_rdma_op_ctxt being passed
      around with the running count of mapped SGEs, drop unneeded
      parameters to svc_rdma_post_send_wr().
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      23262790
    • C
      svcrdma: Allocate recv_ctxt's on CPU handling Receives · eb5d7a62
      Chuck Lever 提交于
      There is a significant latency penalty when processing an ingress
      Receive if the Receive buffer resides in memory that is not on the
      same NUMA node as the the CPU handling completions for a CQ.
      
      The system administrator and the device driver determine which CPU
      handles completions. This CPU does not change during life of the CQ.
      Further the Upper Layer does not have any visibility of which CPU it
      is.
      
      Allocating Receive buffers in the Receive completion handler
      guarantees that Receive buffers are allocated on the preferred NUMA
      node for that CQ.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      eb5d7a62
    • C
      svcrdma: Persistently allocate and DMA-map Receive buffers · 3316f063
      Chuck Lever 提交于
      The current Receive path uses an array of pages which are allocated
      and DMA mapped when each Receive WR is posted, and then handed off
      to the upper layer in rqstp::rq_arg. The page flip releases unused
      pages in the rq_pages pagelist. This mechanism introduces a
      significant amount of overhead.
      
      So instead, kmalloc the Receive buffer, and leave it DMA-mapped
      while the transport remains connected. This confers a number of
      benefits:
      
      * Each Receive WR requires only one receive SGE, no matter how large
        the inline threshold is. This helps the server-side NFS/RDMA
        transport operate on less capable RDMA devices.
      
      * The Receive buffer is left allocated and mapped all the time. This
        relieves svc_rdma_post_recv from the overhead of allocating and
        DMA-mapping a fresh buffer.
      
      * svc_rdma_wc_receive no longer has to DMA unmap the Receive buffer.
        It has to DMA sync only the number of bytes that were received.
      
      * svc_rdma_build_arg_xdr no longer has to free a page in rq_pages
        for each page in the Receive buffer, making it a constant-time
        function.
      
      * The Receive buffer is now plugged directly into the rq_arg's
        head[0].iov_vec, and can be larger than a page without spilling
        over into rq_arg's page list. This enables simplification of
        the RDMA Read path in subsequent patches.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      3316f063
    • C
      svcrdma: Preserve Receive buffer until svc_rdma_sendto · 3a88092e
      Chuck Lever 提交于
      Rather than releasing the incoming svc_rdma_recv_ctxt at the end of
      svc_rdma_recvfrom, hold onto it until svc_rdma_sendto.
      
      This permits the contents of the Receive buffer to be preserved
      through svc_process and then referenced directly in sendto as it
      constructs Write and Reply chunks to return to the client.
      
      The real changes will come in subsequent patches.
      
      Note: I cannot use ->xpo_release_rqst for this purpose because that
      is called _before_ ->xpo_sendto. svc_rdma_sendto uses information in
      the received Call transport header to construct the Reply transport
      header, which is preserved in the RPC's Receive buffer.
      
      The historical comment in svc_send() isn't helpful: it is already
      obvious that ->xpo_release_rqst is being called before ->xpo_sendto,
      but there is no explanation for this ordering going back to the
      beginning of the git era.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      3a88092e
    • C
      svcrdma: Simplify svc_rdma_recv_ctxt_put · 1e5f4160
      Chuck Lever 提交于
      Currently svc_rdma_recv_ctxt_put's callers have to know whether they
      want to free the ctxt's pages or not. This means the human
      developers have to know when and why to set that free_pages
      argument.
      
      Instead, the ctxt should carry that information with it so that
      svc_rdma_recv_ctxt_put does the right thing no matter who is
      calling.
      
      We want to keep track of the number of pages in the Receive buffer
      separately from the number of pages pulled over by RDMA Read. This
      is so that the correct number of pages can be freed properly and
      that number is well-documented.
      
      So now, rc_hdr_count is the number of pages consumed by head[0]
      (ie., the page index where the Read chunk should start); and
      rc_page_count is always the number of pages that need to be released
      when the ctxt is put.
      
      The @free_pages argument is no longer needed.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      1e5f4160
    • C
      svcrdma: Introduce svc_rdma_recv_ctxt · ecf85b23
      Chuck Lever 提交于
      svc_rdma_op_ctxt's are pre-allocated and maintained on a per-xprt
      free list. This eliminates the overhead of calling kmalloc / kfree,
      both of which grab a globally shared lock that disables interrupts.
      To reduce contention further, separate the use of these objects in
      the Receive and Send paths in svcrdma.
      
      Subsequent patches will take advantage of this separation by
      allocating real resources which are then cached in these objects.
      The allocations are freed when the transport is torn down.
      
      I've renamed the structure so that static type checking can be used
      to ensure that uses of op_ctxt and recv_ctxt are not confused. As an
      additional clean up, structure fields are renamed to conform with
      kernel coding conventions.
      
      As a final clean up, helpers related to recv_ctxt are moved closer
      to the functions that use them.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      ecf85b23
    • C
      svcrdma: Trace key RDMA API events · bd2abef3
      Chuck Lever 提交于
      This includes:
        * Posting on the Send and Receive queues
        * Send, Receive, Read, and Write completion
        * Connect upcalls
        * QP errors
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      bd2abef3
    • C
      svcrdma: Trace key RPC/RDMA protocol events · 98895edb
      Chuck Lever 提交于
      This includes:
        * Transport accept and tear-down
        * Decisions about using Write and Reply chunks
        * Each RDMA segment that is handled
        * Whenever an RDMA_ERR is sent
      
      As a clean-up, I've standardized the order of the includes, and
      removed some now redundant dprintk call sites.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      98895edb
    • C
  11. 21 3月, 2018 1 次提交
  12. 19 1月, 2018 1 次提交
  13. 13 7月, 2017 2 次提交
    • C
      svcrdma: Properly compute .len and .buflen for received RPC Calls · 71641d99
      Chuck Lever 提交于
      When an RPC-over-RDMA request is received, the Receive buffer
      contains a Transport Header possibly followed by an RPC message.
      
      Even though rq_arg.head[0] (as passed to NFSD) does not contain the
      Transport Header header, currently rq_arg.len includes the size of
      the Transport Header.
      
      That violates the intent of the xdr_buf API contract. .buflen should
      include everything, but .len should be exactly the length of the RPC
      message in the buffer.
      
      The rq_arg fields are summed together at the end of
      svc_rdma_recvfrom to obtain the correct return value. rq_arg.len
      really ought to contain the correct number of bytes already, but it
      currently doesn't due to the above misbehavior.
      
      Let's instead ensure that .buflen includes the length of the
      transport header, and that .len is always equal to head.iov_len +
      .page_len + tail.iov_len .
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      71641d99
    • C
      svcrdma: Use generic RDMA R/W API in RPC Call path · cafc7398
      Chuck Lever 提交于
      The current svcrdma recvfrom code path has a lot of detail about
      registration mode and the type of port (iWARP, IB, etc).
      
      Instead, use the RDMA core's generic R/W API. This shares code with
      other RDMA-enabled ULPs that manages the gory details of buffer
      registration and the posting of RDMA Read Work Requests.
      
      Since the Read list marshaling code is being replaced, I took the
      opportunity to replace C structure-based XDR encoding code with more
      portable code that uses pointer arithmetic.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      cafc7398
  14. 29 6月, 2017 5 次提交
    • C
      svcrdma: Don't account for Receive queue "starvation" · 2d6491a5
      Chuck Lever 提交于
      >From what I can tell, calling ->recvfrom when there is no work to do
      is a normal part of operation. This is the only way svc_recv can
      tell when there is no more data ready to receive on the transport.
      
      Neither the TCP nor the UDP transport implementations have a
      "starve" metric.
      
      The cost of receive starvation accounting is bumping an atomic, which
      results in extra (IMO unnecessary) bus traffic between CPU sockets,
      while holding a spin lock.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      2d6491a5
    • C
      svcrdma: Improve Reply chunk sanity checking · ca5c76ab
      Chuck Lever 提交于
      Identify malformed transport headers and unsupported chunk
      combinations as early as possible.
      
      - Ensure that segment lengths are not crazy.
      
      - Ensure that the Reply chunk's segment count is not crazy.
      
      With a 1KB inline threshold, the largest number of Write segments
      that can be conveyed is about 60 (for a RDMA_NOMSG Reply message).
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      ca5c76ab
    • C
      svcrdma: Improve Write chunk sanity checking · 3c22f326
      Chuck Lever 提交于
      Identify malformed transport headers and unsupported chunk
      combinations as early as possible.
      
      - Reject RPC-over-RDMA messages that contain more than one Write
      chunk, since this implementation does not support more than one per
      message.
      
      - Ensure that segment lengths are not crazy.
      
      - Ensure that the chunk's segment count is not crazy.
      
      With a 1KB inline threshold, the largest number of Write segments
      that can be conveyed is about 60 (for a RDMA_NOMSG Reply message).
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      3c22f326
    • C
      svcrdma: Improve Read chunk sanity checking · e77340e0
      Chuck Lever 提交于
      Identify malformed transport headers and unsupported chunk
      combinations as early as possible.
      
      - Reject RPC-over-RDMA messages that contain more than one Read chunk,
        since this implementation currently does not support more than one
        per RPC transaction.
      
      - Ensure that segment lengths are not crazy.
      
      - Remove the segment count check. With a 1KB inline threshold, the
        largest number of Read segments that can be conveyed is about 40
        (for a RDMA_NOMSG Call message). This is nowhere near
        RPCSVC_MAXPAGES. As far as I can tell, that was just a sanity
        check and does not enforce an implementation limit.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      e77340e0
    • C
      svcrdma: Remove svc_rdma_marshal.c · a80a3234
      Chuck Lever 提交于
      svc_rdma_marshal.c has one remaining exported function --
      svc_rdma_xdr_decode_req -- and it has a single call site. Take
      the same approach as the sendto path, and move this function
      into the source file where it is called.
      
      This is a refactoring change only.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      a80a3234
  15. 26 4月, 2017 2 次提交
  16. 09 2月, 2017 2 次提交
  17. 13 1月, 2017 1 次提交