1. 18 5月, 2020 1 次提交
    • C
      svcrdma: Fix backchannel return code · ea740bd5
      Chuck Lever 提交于
      Way back when I was writing the RPC/RDMA server-side backchannel
      code, I misread the TCP backchannel reply handler logic. When
      svc_tcp_recvfrom() successfully receives a backchannel reply, it
      does not return -EAGAIN. It sets XPT_DATA and returns zero.
      
      Update svc_rdma_recvfrom() to return zero. Here, XPT_DATA doesn't
      need to be set again: it is set whenever a new message is received,
      behind a spin lock in a single threaded context.
      
      Also, if handling the cb reply is not successful, the message is
      simply dropped. There's no special message framing to deal with as
      there is in the TCP case.
      
      Now that the handle_bc_reply() return value is ignored, I've removed
      the dprintk call sites in the error exit of handle_bc_reply() in
      favor of trace points in other areas that already report the error
      cases.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      ea740bd5
  2. 18 4月, 2020 1 次提交
    • C
      svcrdma: Fix leak of svc_rdma_recv_ctxt objects · 23cf1ee1
      Chuck Lever 提交于
      Utilize the xpo_release_rqst transport method to ensure that each
      rqstp's svc_rdma_recv_ctxt object is released even when the server
      cannot return a Reply for that rqstp.
      
      Without this fix, each RPC whose Reply cannot be sent leaks one
      svc_rdma_recv_ctxt. This is a 2.5KB structure, a 4KB DMA-mapped
      Receive buffer, and any pages that might be part of the Reply
      message.
      
      The leak is infrequent unless the network fabric is unreliable or
      Kerberos is in use, as GSS sequence window overruns, which result
      in connection loss, are more common on fast transports.
      
      Fixes: 3a88092e ("svcrdma: Preserve Receive buffer until svc_rdma_sendto")
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      23cf1ee1
  3. 17 3月, 2020 8 次提交
    • C
      svcrdma: Avoid DMA mapping small RPC Replies · 0dabe948
      Chuck Lever 提交于
      On some platforms, DMA mapping part of a page is more costly than
      copying bytes. Indeed, not involving the I/O MMU can help the
      RPC/RDMA transport scale better for tiny I/Os across more RDMA
      devices. This is because interaction with the I/O MMU is eliminated
      for each of these small I/Os. Without the explicit unmapping, the
      NIC no longer needs to do a costly internal TLB shoot down for
      buffers that are just a handful of bytes.
      
      Since pull-up is now a more a frequent operation, I've introduced a
      trace point in the pull-up path. It can be used for debugging or
      user-space tools that count pull-up frequency.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      0dabe948
    • C
      svcrdma: Fix double sync of transport header buffer · aee4b74a
      Chuck Lever 提交于
      Performance optimization: Avoid syncing the transport buffer twice
      when Reply buffer pull-up is necessary.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      aee4b74a
    • C
      svcrdma: Refactor chunk list encoders · 6fd5034d
      Chuck Lever 提交于
      Same idea as the receive-side changes I did a while back: use
      xdr_stream helpers rather than open-coding the XDR chunk list
      encoders. This builds the Reply transport header from beginning to
      end without backtracking.
      
      As additional clean-ups, fill in documenting comments for the XDR
      encoders and sprinkle some trace points in the new encoding
      functions.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      6fd5034d
    • C
      svcrdma: Update synopsis of svc_rdma_map_reply_msg() · 4554755e
      Chuck Lever 提交于
      Preparing for subsequent patches, no behavior change expected.
      
      Pass the RPC Call's svc_rdma_recv_ctxt deeper into the sendto()
      path. This enables passing more information about Requester-
      provided Write and Reply chunks into those lower-level functions.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      4554755e
    • C
      svcrdma: Update synopsis of svc_rdma_send_reply_chunk() · 6fa5785e
      Chuck Lever 提交于
      Preparing for subsequent patches, no behavior change expected.
      
      Pass the RPC Call's svc_rdma_recv_ctxt deeper into the sendto()
      path. This enables passing more information about Requester-
      provided Write and Reply chunks into the lower-level send
      functions.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      6fa5785e
    • C
      svcrdma: De-duplicate code that locates Write and Reply chunks · 2fe8c446
      Chuck Lever 提交于
      Cache the locations of the Requester-provided Write list and Reply
      chunk so that the Send path doesn't need to parse the Call header
      again.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      2fe8c446
    • C
      svcrdma: Use struct xdr_stream to decode ingress transport headers · e604aad2
      Chuck Lever 提交于
      The logic that checks incoming network headers has to be scrupulous.
      
      De-duplicate: replace open-coded buffer overflow checks with the use
      of xdr_stream helpers that are used most everywhere else XDR
      decoding is done.
      
      One minor change to the sanity checks: instead of checking the
      length of individual segments, cap the length of the whole chunk
      to be sure it can fit in the set of pages available in rq_pages.
      This should be a better test of whether the server can handle the
      chunks in each request.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      e604aad2
    • C
      nfsd: Fix NFSv4 READ on RDMA when using readv · 41205539
      Chuck Lever 提交于
      svcrdma expects that the payload falls precisely into the xdr_buf
      page vector. This does not seem to be the case for
      nfsd4_encode_readv().
      
      This code is called only when fops->splice_read is missing or when
      RQ_SPLICE_OK is clear, so it's not a noticeable problem in many
      common cases.
      
      Add new transport method: ->xpo_read_payload so that when a READ
      payload does not fit exactly in rq_res's page vector, the XDR
      encoder can inform the RPC transport exactly where that payload is,
      without the payload's XDR pad.
      
      That way, when a Write chunk is present, the transport knows what
      byte range in the Reply message is supposed to be matched with the
      chunk.
      
      Note that the Linux NFS server implementation of NFS/RDMA can
      currently handle only one Write chunk per RPC-over-RDMA message.
      This simplifies the implementation of this fix.
      
      Fixes: b0420980 ("nfsd4: allow exotic read compounds")
      Buglink: https://bugzilla.kernel.org/show_bug.cgi?id=198053Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      41205539
  4. 19 8月, 2019 2 次提交
  5. 28 12月, 2018 1 次提交
  6. 29 11月, 2018 1 次提交
  7. 30 10月, 2018 1 次提交
  8. 12 5月, 2018 13 次提交
    • C
      svcrdma: Remove unused svc_rdma_op_ctxt · 51cc257a
      Chuck Lever 提交于
      Clean up: Eliminate a structure that is no longer used.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      51cc257a
    • C
      svcrdma: Persistently allocate and DMA-map Send buffers · 99722fe4
      Chuck Lever 提交于
      While sending each RPC Reply, svc_rdma_sendto allocates and DMA-
      maps a separate buffer where the RPC/RDMA transport header is
      constructed. The buffer is unmapped and released in the Send
      completion handler. This is significant per-RPC overhead,
      especially for small RPCs.
      
      Instead, allocate and DMA-map a buffer, and cache it in each
      svc_rdma_send_ctxt. This buffer and its mapping can be re-used
      for each RPC, saving the cost of memory allocation and DMA
      mapping.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      99722fe4
    • C
      svcrdma: Remove post_send_wr · 986b7889
      Chuck Lever 提交于
      Clean up: Now that the send_wr is part of the svc_rdma_send_ctxt,
      svc_rdma_post_send_wr is nearly empty.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      986b7889
    • C
      svcrdma: Don't overrun the SGE array in svc_rdma_send_ctxt · 25fd86ec
      Chuck Lever 提交于
      Receive buffers are always the same size, but each Send WR has a
      variable number of SGEs, based on the contents of the xdr_buf being
      sent.
      
      While assembling a Send WR, keep track of the number of SGEs so that
      we don't exceed the device's maximum, or walk off the end of the
      Send SGE array.
      
      For now the Send path just fails if it exceeds the maximum.
      
      The current logic in svc_rdma_accept bases the maximum number of
      Send SGEs on the largest NFS request that can be sent or received.
      In the transport layer, the limit is actually based on the
      capabilities of the underlying device, not on properties of the
      Upper Layer Protocol.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      25fd86ec
    • C
      svcrdma: Introduce svc_rdma_send_ctxt · 4201c746
      Chuck Lever 提交于
      svc_rdma_op_ctxt's are pre-allocated and maintained on a per-xprt
      free list. This eliminates the overhead of calling kmalloc / kfree,
      both of which grab a globally shared lock that disables interrupts.
      Introduce a replacement to svc_rdma_op_ctxt's that is built
      especially for the svcrdma Send path.
      
      Subsequent patches will take advantage of this new structure by
      allocating real resources which are then cached in these objects.
      The allocations are freed when the transport is torn down.
      
      I've renamed the structure so that static type checking can be used
      to ensure that uses of op_ctxt and send_ctxt are not confused. As an
      additional clean up, structure fields are renamed to conform with
      kernel coding conventions.
      
      Additional clean ups:
      - Handle svc_rdma_send_ctxt_get allocation failure at each call
        site, rather than pre-allocating and hoping we guessed correctly
      - All send_ctxt_put call-sites request page freeing, so remove
        the @free_pages argument
      - All send_ctxt_put call-sites unmap SGEs, so fold that into
        svc_rdma_send_ctxt_put
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      4201c746
    • C
      svcrdma: Clean up Send SGE accounting · 23262790
      Chuck Lever 提交于
      Clean up: Since there's already a svc_rdma_op_ctxt being passed
      around with the running count of mapped SGEs, drop unneeded
      parameters to svc_rdma_post_send_wr().
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      23262790
    • C
      svcrdma: Refactor svc_rdma_dma_map_buf · f016f305
      Chuck Lever 提交于
      Clean up: svc_rdma_dma_map_buf does mostly the same thing as
      svc_rdma_dma_map_page, so let's fold these together.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      f016f305
    • C
      svcrdma: Allocate recv_ctxt's on CPU handling Receives · eb5d7a62
      Chuck Lever 提交于
      There is a significant latency penalty when processing an ingress
      Receive if the Receive buffer resides in memory that is not on the
      same NUMA node as the the CPU handling completions for a CQ.
      
      The system administrator and the device driver determine which CPU
      handles completions. This CPU does not change during life of the CQ.
      Further the Upper Layer does not have any visibility of which CPU it
      is.
      
      Allocating Receive buffers in the Receive completion handler
      guarantees that Receive buffers are allocated on the preferred NUMA
      node for that CQ.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      eb5d7a62
    • C
      svcrdma: Persistently allocate and DMA-map Receive buffers · 3316f063
      Chuck Lever 提交于
      The current Receive path uses an array of pages which are allocated
      and DMA mapped when each Receive WR is posted, and then handed off
      to the upper layer in rqstp::rq_arg. The page flip releases unused
      pages in the rq_pages pagelist. This mechanism introduces a
      significant amount of overhead.
      
      So instead, kmalloc the Receive buffer, and leave it DMA-mapped
      while the transport remains connected. This confers a number of
      benefits:
      
      * Each Receive WR requires only one receive SGE, no matter how large
        the inline threshold is. This helps the server-side NFS/RDMA
        transport operate on less capable RDMA devices.
      
      * The Receive buffer is left allocated and mapped all the time. This
        relieves svc_rdma_post_recv from the overhead of allocating and
        DMA-mapping a fresh buffer.
      
      * svc_rdma_wc_receive no longer has to DMA unmap the Receive buffer.
        It has to DMA sync only the number of bytes that were received.
      
      * svc_rdma_build_arg_xdr no longer has to free a page in rq_pages
        for each page in the Receive buffer, making it a constant-time
        function.
      
      * The Receive buffer is now plugged directly into the rq_arg's
        head[0].iov_vec, and can be larger than a page without spilling
        over into rq_arg's page list. This enables simplification of
        the RDMA Read path in subsequent patches.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      3316f063
    • C
      svcrdma: Simplify svc_rdma_recv_ctxt_put · 1e5f4160
      Chuck Lever 提交于
      Currently svc_rdma_recv_ctxt_put's callers have to know whether they
      want to free the ctxt's pages or not. This means the human
      developers have to know when and why to set that free_pages
      argument.
      
      Instead, the ctxt should carry that information with it so that
      svc_rdma_recv_ctxt_put does the right thing no matter who is
      calling.
      
      We want to keep track of the number of pages in the Receive buffer
      separately from the number of pages pulled over by RDMA Read. This
      is so that the correct number of pages can be freed properly and
      that number is well-documented.
      
      So now, rc_hdr_count is the number of pages consumed by head[0]
      (ie., the page index where the Read chunk should start); and
      rc_page_count is always the number of pages that need to be released
      when the ctxt is put.
      
      The @free_pages argument is no longer needed.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      1e5f4160
    • C
      svcrdma: Remove sc_rq_depth · 2c577bfe
      Chuck Lever 提交于
      Clean up: No need to retain rq_depth in struct svcrdma_xprt, it is
      used only in svc_rdma_accept().
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      2c577bfe
    • C
      svcrdma: Introduce svc_rdma_recv_ctxt · ecf85b23
      Chuck Lever 提交于
      svc_rdma_op_ctxt's are pre-allocated and maintained on a per-xprt
      free list. This eliminates the overhead of calling kmalloc / kfree,
      both of which grab a globally shared lock that disables interrupts.
      To reduce contention further, separate the use of these objects in
      the Receive and Send paths in svcrdma.
      
      Subsequent patches will take advantage of this separation by
      allocating real resources which are then cached in these objects.
      The allocations are freed when the transport is torn down.
      
      I've renamed the structure so that static type checking can be used
      to ensure that uses of op_ctxt and recv_ctxt are not confused. As an
      additional clean up, structure fields are renamed to conform with
      kernel coding conventions.
      
      As a final clean up, helpers related to recv_ctxt are moved closer
      to the functions that use them.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      ecf85b23
    • C
  9. 21 3月, 2018 1 次提交
    • C
      svcrdma: Consult max_qp_init_rd_atom when accepting connections · 97cc3264
      Chuck Lever 提交于
      The target needs to return the lesser of the client's Inbound RDMA
      Read Queue Depth (IRD), provided in the connection parameters, and
      the local device's Outbound RDMA Read Queue Depth (ORD). The latter
      limit is max_qp_init_rd_atom, not max_qp_rd_atom.
      
      The svcrdma_ord value caps the ORD value for iWARP transports, which
      do not exchange ORD/IRD values at connection time. Since no other
      Linux kernel RDMA-enabled storage target sees fit to provide this
      cap, I'm removing it here too.
      
      initiator_depth is a u8, so ensure the computed ORD value does not
      overflow that field.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      97cc3264
  10. 19 1月, 2018 1 次提交
  11. 13 7月, 2017 5 次提交
  12. 29 6月, 2017 1 次提交
  13. 26 4月, 2017 4 次提交