1. 10 8月, 2018 1 次提交
  2. 12 5月, 2018 9 次提交
    • C
      svcrdma: Don't overrun the SGE array in svc_rdma_send_ctxt · 25fd86ec
      Chuck Lever 提交于
      Receive buffers are always the same size, but each Send WR has a
      variable number of SGEs, based on the contents of the xdr_buf being
      sent.
      
      While assembling a Send WR, keep track of the number of SGEs so that
      we don't exceed the device's maximum, or walk off the end of the
      Send SGE array.
      
      For now the Send path just fails if it exceeds the maximum.
      
      The current logic in svc_rdma_accept bases the maximum number of
      Send SGEs on the largest NFS request that can be sent or received.
      In the transport layer, the limit is actually based on the
      capabilities of the underlying device, not on properties of the
      Upper Layer Protocol.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      25fd86ec
    • C
      svcrdma: Introduce svc_rdma_send_ctxt · 4201c746
      Chuck Lever 提交于
      svc_rdma_op_ctxt's are pre-allocated and maintained on a per-xprt
      free list. This eliminates the overhead of calling kmalloc / kfree,
      both of which grab a globally shared lock that disables interrupts.
      Introduce a replacement to svc_rdma_op_ctxt's that is built
      especially for the svcrdma Send path.
      
      Subsequent patches will take advantage of this new structure by
      allocating real resources which are then cached in these objects.
      The allocations are freed when the transport is torn down.
      
      I've renamed the structure so that static type checking can be used
      to ensure that uses of op_ctxt and send_ctxt are not confused. As an
      additional clean up, structure fields are renamed to conform with
      kernel coding conventions.
      
      Additional clean ups:
      - Handle svc_rdma_send_ctxt_get allocation failure at each call
        site, rather than pre-allocating and hoping we guessed correctly
      - All send_ctxt_put call-sites request page freeing, so remove
        the @free_pages argument
      - All send_ctxt_put call-sites unmap SGEs, so fold that into
        svc_rdma_send_ctxt_put
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      4201c746
    • C
      svcrdma: Persistently allocate and DMA-map Receive buffers · 3316f063
      Chuck Lever 提交于
      The current Receive path uses an array of pages which are allocated
      and DMA mapped when each Receive WR is posted, and then handed off
      to the upper layer in rqstp::rq_arg. The page flip releases unused
      pages in the rq_pages pagelist. This mechanism introduces a
      significant amount of overhead.
      
      So instead, kmalloc the Receive buffer, and leave it DMA-mapped
      while the transport remains connected. This confers a number of
      benefits:
      
      * Each Receive WR requires only one receive SGE, no matter how large
        the inline threshold is. This helps the server-side NFS/RDMA
        transport operate on less capable RDMA devices.
      
      * The Receive buffer is left allocated and mapped all the time. This
        relieves svc_rdma_post_recv from the overhead of allocating and
        DMA-mapping a fresh buffer.
      
      * svc_rdma_wc_receive no longer has to DMA unmap the Receive buffer.
        It has to DMA sync only the number of bytes that were received.
      
      * svc_rdma_build_arg_xdr no longer has to free a page in rq_pages
        for each page in the Receive buffer, making it a constant-time
        function.
      
      * The Receive buffer is now plugged directly into the rq_arg's
        head[0].iov_vec, and can be larger than a page without spilling
        over into rq_arg's page list. This enables simplification of
        the RDMA Read path in subsequent patches.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      3316f063
    • C
      svcrdma: Remove sc_rq_depth · 2c577bfe
      Chuck Lever 提交于
      Clean up: No need to retain rq_depth in struct svcrdma_xprt, it is
      used only in svc_rdma_accept().
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      2c577bfe
    • C
      svcrdma: Introduce svc_rdma_recv_ctxt · ecf85b23
      Chuck Lever 提交于
      svc_rdma_op_ctxt's are pre-allocated and maintained on a per-xprt
      free list. This eliminates the overhead of calling kmalloc / kfree,
      both of which grab a globally shared lock that disables interrupts.
      To reduce contention further, separate the use of these objects in
      the Receive and Send paths in svcrdma.
      
      Subsequent patches will take advantage of this separation by
      allocating real resources which are then cached in these objects.
      The allocations are freed when the transport is torn down.
      
      I've renamed the structure so that static type checking can be used
      to ensure that uses of op_ctxt and recv_ctxt are not confused. As an
      additional clean up, structure fields are renamed to conform with
      kernel coding conventions.
      
      As a final clean up, helpers related to recv_ctxt are moved closer
      to the functions that use them.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      ecf85b23
    • C
      svcrdma: Trace key RDMA API events · bd2abef3
      Chuck Lever 提交于
      This includes:
        * Posting on the Send and Receive queues
        * Send, Receive, Read, and Write completion
        * Connect upcalls
        * QP errors
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      bd2abef3
    • C
      svcrdma: Trace key RPC/RDMA protocol events · 98895edb
      Chuck Lever 提交于
      This includes:
        * Transport accept and tear-down
        * Decisions about using Write and Reply chunks
        * Each RDMA segment that is handled
        * Whenever an RDMA_ERR is sent
      
      As a clean-up, I've standardized the order of the includes, and
      removed some now redundant dprintk call sites.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      98895edb
    • C
      svcrdma: Use passed-in net namespace when creating RDMA listener · 8dafcbee
      Chuck Lever 提交于
      Ensure each RDMA listener and its children transports are created in
      the same net namespace as the user that started the NFS service.
      This is similar to how listener sockets are created in
      svc_create_socket, required for enabling support for containers.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      8dafcbee
    • C
  3. 04 4月, 2018 2 次提交
  4. 21 3月, 2018 2 次提交
  5. 19 1月, 2018 1 次提交
  6. 08 11月, 2017 1 次提交
  7. 06 9月, 2017 2 次提交
    • C
      svcrdma: Estimate Send Queue depth properly · 26fb2254
      Chuck Lever 提交于
      The rdma_rw API adjusts max_send_wr upwards during the
      rdma_create_qp() call. If the ULP actually wants to take advantage
      of these extra resources, it must increase the size of its send
      completion queue (created before rdma_create_qp is called) and
      increase its send queue accounting limit.
      
      Use the new rdma_rw_mr_factor API to figure out the correct value
      to use for the Send Queue and Send Completion Queue depths.
      
      And, ensure that the chosen Send Queue depth for a newly created
      transport does not overrun the QP WR limit of the underlying device.
      
      Lastly, there's no longer a need to carry the Send Queue depth in
      struct svcxprt_rdma, since the value is used only in the
      svc_rdma_accept() path.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      26fb2254
    • C
      svcrdma: Limit RQ depth · 5a25bfd2
      Chuck Lever 提交于
      Ensure that the chosen Receive Queue depth for a newly created
      transport does not overrun the QP WR limit of the underlying device.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      5a25bfd2
  8. 25 8月, 2017 1 次提交
  9. 13 7月, 2017 5 次提交
  10. 26 4月, 2017 5 次提交
    • C
      svcrdma: Remove the req_map cache · 2cf32924
      Chuck Lever 提交于
      req_maps are no longer used by the send path and can thus be removed.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      2cf32924
    • C
      svcrdma: Remove unused RDMA Write completion handler · 68cc4636
      Chuck Lever 提交于
      Clean up. All RDMA Write completions are now handled by
      svc_rdma_wc_write_ctx.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      68cc4636
    • C
      svcrdma: Use rdma_rw API in RPC reply path · 9a6a180b
      Chuck Lever 提交于
      The current svcrdma sendto code path posts one RDMA Write WR at a
      time. Each of these Writes typically carries a small number of pages
      (for instance, up to 30 pages for mlx4 devices). That means a 1MB
      NFS READ reply requires 9 ib_post_send() calls for the Write WRs,
      and one for the Send WR carrying the actual RPC Reply message.
      
      Instead, use the new rdma_rw API. The details of Write WR chain
      construction and memory registration are taken care of in the RDMA
      core. svcrdma can focus on the details of the RPC-over-RDMA
      protocol. This gives three main benefits:
      
      1. All Write WRs for one RDMA segment are posted in a single chain.
      As few as one ib_post_send() for each Write chunk.
      
      2. The Write path can now use FRWR to register the Write buffers.
      If the device's maximum page list depth is large, this means a
      single Write WR is needed for each RPC's Write chunk data.
      
      3. The new code introduces support for RPCs that carry both a Write
      list and a Reply chunk. This combination can be used for an NFSv4
      READ where the data payload is large, and thus is removed from the
      Payload Stream, but the Payload Stream is still larger than the
      inline threshold.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      9a6a180b
    • C
      svcrdma: Introduce local rdma_rw API helpers · f13193f5
      Chuck Lever 提交于
      The plan is to replace the local bespoke code that constructs and
      posts RDMA Read and Write Work Requests with calls to the rdma_rw
      API. This shares code with other RDMA-enabled ULPs that manages the
      gory details of buffer registration and posting Work Requests.
      
      Some design notes:
      
       o The structure of RPC-over-RDMA transport headers is flexible,
         allowing multiple segments per Reply with arbitrary alignment,
         each with a unique R_key. Write and Send WRs continue to be
         built and posted in separate code paths. However, one whole
         chunk (with one or more RDMA segments apiece) gets exactly
         one ib_post_send and one work completion.
      
       o svc_xprt reference counting is modified, since a chain of
         rdma_rw_ctx structs generates one completion, no matter how
         many Write WRs are posted.
      
       o The current code builds the transport header as it is construct-
         ing Write WRs. I've replaced that with marshaling of transport
         header data items in a separate step. This is because the exact
         structure of client-provided segments may not align with the
         components of the server's reply xdr_buf, or the pages in the
         page list. Thus parts of each client-provided segment may be
         written at different points in the send path.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      f13193f5
    • C
      svcrdma: Eliminate RPCRDMA_SQ_DEPTH_MULT · b623589d
      Chuck Lever 提交于
      The Send Queue depth is temporarily reduced to 1 SQE per credit. The
      new rdma_rw API does an internal computation, during QP creation, to
      increase the depth of the Send Queue to handle RDMA Read and Write
      operations.
      
      This change has to come before the NFSD code paths are updated to
      use the rdma_rw API. Without this patch, rdma_rw_init_qp() increases
      the size of the SQ too much, resulting in memory allocation failures
      during QP creation.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      b623589d
  11. 29 3月, 2017 1 次提交
  12. 25 2月, 2017 1 次提交
  13. 09 2月, 2017 4 次提交
  14. 14 1月, 2017 1 次提交
    • P
      locking/atomic, kref: Add kref_read() · 2c935bc5
      Peter Zijlstra 提交于
      Since we need to change the implementation, stop exposing internals.
      
      Provide kref_read() to read the current reference count; typically
      used for debug messages.
      
      Kills two anti-patterns:
      
      	atomic_read(&kref->refcount)
      	kref->refcount.counter
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      2c935bc5
  15. 01 12月, 2016 4 次提交
    • C
      svcrdma: Break up dprintk format in svc_rdma_accept() · 07257450
      Chuck Lever 提交于
      The current code results in:
      
      Nov  7 14:50:19 klimt kernel: svcrdma: newxprt->sc_cm_id=ffff88085590c800,
       newxprt->sc_pd=ffff880852a7ce00#012    cm_id->device=ffff88084dd20000,
       sc_pd->device=ffff88084dd20000#012    cap.max_send_wr = 272#012
       cap.max_recv_wr = 34#012    cap.max_send_sge = 32#012
       cap.max_recv_sge = 32
      Nov  7 14:50:19 klimt kernel: svcrdma: new connection ffff880855908000
       accepted with the following attributes:#012    local_ip        :
       10.0.0.5#012    local_port#011     : 20049#012    remote_ip       :
       10.0.0.2#012    remote_port     : 59909#012    max_sge         : 32#012
       max_sge_rd      : 30#012    sq_depth        : 272#012    max_requests    :
       32#012    ord             : 16
      
      Split up the output over multiple dprintks and take the opportunity
      to fix the display of IPv6 addresses.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      07257450
    • C
      svcrdma: Remove svc_rdma_op_ctxt::wc_status · 96a58f9c
      Chuck Lever 提交于
      Clean up: Completion status is already reported in the individual
      completion handlers. Save a few bytes in struct svc_rdma_op_ctxt.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      96a58f9c
    • C
      svcrdma: Remove DMA map accounting · dd6fd213
      Chuck Lever 提交于
      Clean up: sc_dma_used is not required for correct operation. It is
      simply a debugging tool to report when svcrdma has leaked DMA maps.
      
      However, manipulating an atomic has a measurable CPU cost, and DMA
      map accounting specific to svcrdma will be meaningless once svcrdma
      is converted to use the new generic r/w API.
      
      A similar kind of debug accounting can be done simply by enabling
      the IOMMU or by using CONFIG_DMA_API_DEBUG, CONFIG_IOMMU_DEBUG, and
      CONFIG_IOMMU_LEAK.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      dd6fd213
    • C
      svcrdma: Remove BH-disabled spin locking in svc_rdma_send() · e4eb42ce
      Chuck Lever 提交于
      svcrdma's current SQ accounting algorithm takes sc_lock and disables
      bottom-halves while posting all RDMA Read, Write, and Send WRs.
      
      This is relatively heavyweight serialization. And note that Write and
      Send are already fully serialized by the xpt_mutex.
      
      Using a single atomic_t should be all that is necessary to guarantee
      that ib_post_send() is called only when there is enough space on the
      send queue. This is what the other RDMA-enabled storage targets do.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      e4eb42ce