1. 19 2月, 2015 1 次提交
  2. 22 7月, 2014 1 次提交
  3. 16 7月, 2014 3 次提交
    • H
      cxgb4/iw_cxgb4: work request logging feature · 7730b4c7
      Hariprasad Shenai 提交于
      This commit enhances the iwarp driver to optionally keep a log of rdma
      work request timining data for kernel mode QPs.  If iw_cxgb4 module option
      c4iw_wr_log is set to non-zero, each work request is tracked and timing
      data maintained in a rolling log that is 4096 entries deep by default.
      Module option c4iw_wr_log_size_order allows specifing a log2 size to use
      instead of the default order of 12 (4096 entries). Both module options
      are read-only and must be passed in at module load time to set them. IE:
      
      modprobe iw_cxgb4 c4iw_wr_log=1 c4iw_wr_log_size_order=10
      
      The timing data is viewable via the iw_cxgb4 debugfs file "wr_log".
      Writing anything to this file will clear all the timing data.
      Data tracked includes:
      
      - The host time when the work request was posted, just before ringing
      the doorbell.  The host time when the completion was polled by the
      application.  This is also the time the log entry is created.  The delta
      of these two times is the amount of time took processing the work request.
      
      - The qid of the EQ used to post the work request.
      
      - The work request opcode.
      
      - The cqe wr_id field.  For sq completions requests this is the swsqe
      index.  For recv completions this is the MSN of the ingress SEND.
      This value can be used to match log entries from this log with firmware
      flowc event entries.
      
      - The sge timestamp value just before ringing the doorbell when
      posting,  the sge timestamp value just after polling the completion,
      and CQE.timestamp field from the completion itself.  With these three
      timestamps we can track the latency from post to poll, and the amount
      of time the completion resided in the CQ before being reaped by the
      application.  With debug firmware, the sge timestamp is also logged by
      firmware in its flowc history so that we can compute the latency from
      posting the work request until the firmware sees it.
      Signed-off-by: NSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: NHariprasad Shenai <hariprasad@chelsio.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7730b4c7
    • H
      cxgb4/iw_cxgb4: use firmware ord/ird resource limits · 4c2c5763
      Hariprasad Shenai 提交于
      Advertise a larger max read queue depth for qps, and gather the resource limits
      from fw and use them to avoid exhaustinq all the resources.
      
      Design:
      
      cxgb4:
      
      Obtain the max_ordird_qp and max_ird_adapter device params from FW
      at init time and pass them up to the ULDs when they attach.  If these
      parameters are not available, due to older firmware, then hard-code
      the values based on the known values for older firmware.
      iw_cxgb4:
      
      Fix the c4iw_query_device() to report these correct values based on
      adapter parameters.  ibv_query_device() will always return:
      
      max_qp_rd_atom = max_qp_init_rd_atom = min(module_max, max_ordird_qp)
      max_res_rd_atom = max_ird_adapter
      
      Bump up the per qp max module option to 32, allowing it to be increased
      by the user up to the device max of max_ordird_qp.  32 seems to be
      sufficient to maximize throughput for streaming read benchmarks.
      
      Fail connection setup if the negotiated IRD exhausts the available
      adapter ird resources.  So the driver will track the amount of ird
      resource in use and not send an RI_WR/INIT to FW that would reduce the
      available ird resources below zero.
      Signed-off-by: NSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: NHariprasad Shenai <hariprasad@chelsio.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4c2c5763
    • H
      iw_cxgb4: Detect Ing. Padding Boundary at run-time · 04e10e21
      Hariprasad Shenai 提交于
      Updates iw_cxgb4 to determine the Ingress Padding Boundary from
      cxgb4_lld_info, and take subsequent actions.
      Signed-off-by: NSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: NHariprasad Shenai <hariprasad@chelsio.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      04e10e21
  4. 14 7月, 2014 1 次提交
  5. 11 6月, 2014 2 次提交
  6. 29 4月, 2014 1 次提交
    • S
      RDMA/cxgb4: Fix endpoint mutex deadlocks · cc18b939
      Steve Wise 提交于
      In cases where the cm calls c4iw_modify_rc_qp() with the endpoint
      mutex held, they must be called with internal == 1.  rx_data() and
      process_mpa_reply() are not doing this.  This causes a deadlock
      because c4iw_modify_rc_qp() might call c4iw_ep_disconnect() in some
      !internal cases, and c4iw_ep_disconnect() acquires the endpoint mutex.
      The design was intended to only do the disconnect for !internal calls.
      
      Change rx_data(), FPDU_MODE case, to call c4iw_modify_rc_qp() with
      internal == 1, and then disconnect only after releasing the mutex.
      
      Change process_mpa_reply() to call c4iw_modify_rc_qp(TERMINATE) with
      internal == 1 and set a new attr flag telling it to send a TERMINATE
      message.  Previously this was implied by !internal.
      
      Change process_mpa_reply() to return whether the caller should
      disconnect after releasing the endpoint mutex.  Now rx_data() will do
      the disconnect in the cases where process_mpa_reply() wants to
      disconnect after the TERMINATE is sent.
      
      Change c4iw_modify_rc_qp() RTS->TERM to only disconnect if !internal,
      and to send a TERMINATE message if attrs->send_term is 1.
      
      Change abort_connection() to not aquire the ep mutex for setting the
      state, and make all calls to abort_connection() do so with the mutex
      held.
      Signed-off-by: NSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      cc18b939
  7. 12 4月, 2014 1 次提交
  8. 21 3月, 2014 2 次提交
  9. 15 3月, 2014 1 次提交
    • S
      cxgb4/iw_cxgb4: Doorbell Drop Avoidance Bug Fixes · 05eb2389
      Steve Wise 提交于
      The current logic suffers from a slow response time to disable user DB
      usage, and also fails to avoid DB FIFO drops under heavy load. This commit
      fixes these deficiencies and makes the avoidance logic more optimal.
      This is done by more efficiently notifying the ULDs of potential DB
      problems, and implements a smoother flow control algorithm in iw_cxgb4,
      which is the ULD that puts the most load on the DB fifo.
      
      Design:
      
      cxgb4:
      
      Direct ULD callback from the DB FULL/DROP interrupt handler.  This allows
      the ULD to stop doing user DB writes as quickly as possible.
      
      While user DB usage is disabled, the LLD will accumulate DB write events
      for its queues.  Then once DB usage is reenabled, a single DB write is
      done for each queue with its accumulated write count.  This reduces the
      load put on the DB fifo when reenabling.
      
      iw_cxgb4:
      
      Instead of marking each qp to indicate DB writes are disabled, we create
      a device-global status page that each user process maps.  This allows
      iw_cxgb4 to only set this single bit to disable all DB writes for all
      user QPs vs traversing the idr of all the active QPs.  If the libcxgb4
      doesn't support this, then we fall back to the old approach of marking
      each QP.  Thus we allow the new driver to work with an older libcxgb4.
      
      When the LLD upcalls iw_cxgb4 indicating DB FULL, we disable all DB writes
      via the status page and transition the DB state to STOPPED.  As user
      processes see that DB writes are disabled, they call into iw_cxgb4
      to submit their DB write events.  Since the DB state is in STOPPED,
      the QP trying to write gets enqueued on a new DB "flow control" list.
      As subsequent DB writes are submitted for this flow controlled QP, the
      amount of writes are accumulated for each QP on the flow control list.
      So all the user QPs that are actively ringing the DB get put on this
      list and the number of writes they request are accumulated.
      
      When the LLD upcalls iw_cxgb4 indicating DB EMPTY, which is in a workq
      context, we change the DB state to FLOW_CONTROL, and begin resuming all
      the QPs that are on the flow control list.  This logic runs on until
      the flow control list is empty or we exit FLOW_CONTROL mode (due to
      a DB DROP upcall, for example).  QPs are removed from this list, and
      their accumulated DB write counts written to the DB FIFO.  Sets of QPs,
      called chunks in the code, are removed at one time. The chunk size is 64.
      So 64 QPs are resumed at a time, and before the next chunk is resumed, the
      logic waits (blocks) for the DB FIFO to drain.  This prevents resuming to
      quickly and overflowing the FIFO.  Once the flow control list is empty,
      the db state transitions back to NORMAL and user QPs are again allowed
      to write directly to the user DB register.
      
      The algorithm is designed such that if the DB write load is high enough,
      then all the DB writes get submitted by the kernel using this flow
      controlled approach to avoid DB drops.  As the load lightens though, we
      resume to normal DB writes directly by user applications.
      Signed-off-by: NSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      05eb2389
  10. 14 8月, 2013 2 次提交
    • S
      RDMA/cxgb4: Fix QP flush logic · 1cf24dce
      Steve Wise 提交于
      This patch makes following fixes in QP flush logic:
      
      - correctly flushes unsignaled WRs followed by a signaled WR
      - supports for flushing a CQ bound to multiple QPs
      - resets cidx_flush if a active queue starts getting HW CQEs again
      - marks WQ in error when we leave RTS. This was only being done for
        user queues, but we need it for kernel queues too so that
        post_send/post_recv will start returning the appropriate error
        synchronously
      - eats unsignaled read resp CQEs. HW always inserts CQEs so we must
        silently discard them if the read work request was unsignaled.
      - handles QP flushes with pending SW CQEs. The flush and out of order
        completion logic has a bug where if out of order completions are
        flushed but not yet polled by the consumer and the qp is then
        flushed then we end up inserting duplicate completions.
      - c4iw_flush_sq() should only flush wrs that have not already been
        flushed.  Since we already track where in the SQ we've flushed via
        sq.cidx_flush, just start at that point and flush any remaining.
        This bug only caused a problem in the presence of unsignaled work
        requests.
      Signed-off-by: NSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: NVipul Pandya <vipul@chelsio.com>
      
      [ Fixed sparse warning due to htonl/ntohl confusion.  - Roland ]
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      1cf24dce
    • V
      RDMA/cxgb4: Add support for active and passive open connection with IPv6 address · 830662f6
      Vipul Pandya 提交于
      Add new cpl messages, cpl_act_open_req6 and cpl_t5_act_open_req6, for
      initiating active open connections.
      
      Use LLD api cxgb4_create_server and cxgb4_create_server6 for
      initiating passive open connections. Similarly use cxgb4_remove_server
      to remove the passive open connections in place of listen_stop.
      
      Add support for iWARP over VLAN device and enable IPv6 support on VLAN device.
      
      Make use of import_ep in c4iw_reconnect.
      Signed-off-by: NVipul Pandya <vipul@chelsio.com>
      
      [ Fix build when IPv6 is disabled and make sure iw_cxgb4 is not built-in
        when ipv6 is a module.  - Roland ]
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      830662f6
  11. 14 3月, 2013 4 次提交
  12. 28 2月, 2013 1 次提交
  13. 22 2月, 2013 1 次提交
    • S
      IB/core: Add "type 2" memory windows support · 7083e42e
      Shani Michaeli 提交于
      This patch enhances the IB core support for Memory Windows (MWs).
      
      MWs allow an application to have better/flexible control over remote
      access to memory.
      
      Two types of MWs are supported, with the second type having two flavors:
      
          Type 1  - associated with PD only
          Type 2A - associated with QPN only
          Type 2B - associated with PD and QPN
      
      Applications can allocate a MW once, and then repeatedly bind the MW
      to different ranges in MRs that are associated to the same PD. Type 1
      windows are bound through a verb, while type 2 windows are bound by
      posting a work request.
      
      The 32-bit memory key is composed of a 24-bit index and an 8-bit
      key. The key is changed with each bind, thus allowing more control
      over the peer's use of the memory key.
      
      The changes introduced are the following:
      
      * add memory window type enum and a corresponding parameter to ib_alloc_mw.
      * type 2 memory window bind work request support.
      * create a struct that contains the common part of the bind verb struct
        ibv_mw_bind and the bind work request into a single struct.
      * add the ib_inc_rkey helper function to advance the tag part of an rkey.
      
      Consumer interface details:
      
      * new device capability flags IB_DEVICE_MEM_WINDOW_TYPE_2A and
        IB_DEVICE_MEM_WINDOW_TYPE_2B are added to indicate device support
        for these features.
      
        Devices can set either IB_DEVICE_MEM_WINDOW_TYPE_2A or
        IB_DEVICE_MEM_WINDOW_TYPE_2B if it supports type 2A or type 2B
        memory windows. It can set neither to indicate it doesn't support
        type 2 windows at all.
      
      * modify existing provides and consumers code to the new param of
        ib_alloc_mw and the ib_mw_bind_info structure
      Signed-off-by: NHaggai Eran <haggaie@mellanox.com>
      Signed-off-by: NShani Michaeli <shanim@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      7083e42e
  14. 15 2月, 2013 2 次提交
  15. 20 12月, 2012 2 次提交
  16. 19 5月, 2012 6 次提交
  17. 01 11月, 2011 1 次提交
  18. 07 10月, 2011 1 次提交
  19. 25 5月, 2011 1 次提交
    • S
      RDMA/cxgb4: Use completion objects for event blocking · c337374b
      Steve Wise 提交于
      There exists a race condition when using wait_queue_head_t objects
      that are declared on the stack.  This was being done in a few places
      where we are sending work requests to the FW and awaiting replies, but
      we don't have an endpoint structure with an embedded c4iw_wr_wait
      struct.  So the code was allocating it locally on the stack.  Bad
      design.  The race is:
      
        1) thread on cpuX declares the wait_queue_head_t on the stack, then
           posts a firmware WR with that wait object ptr as the cookie to be
           returned in the WR reply.  This thread will proceed to block in
           wait_event_timeout() but before it does:
      
        2) An interrupt runs on cpuY with the WR reply.  fw6_msg() handles
           this and calls c4iw_wake_up().  c4iw_wake_up() sets the condition
           variable in the c4iw_wr_wait object to TRUE and will call
           wake_up(), but before it calls wake_up():
      
        3) The thread on cpuX calls c4iw_wait_for_reply(), which calls
           wait_event_timeout().  The wait_event_timeout() macro checks the
           condition variable and returns immediately since it is TRUE.  So
           this thread never blocks/sleeps. The function then returns
           effectively deallocating the c4iw_wr_wait object that was on the
           stack.
      
        4) So at this point cpuY has a pointer to the c4iw_wr_wait object
           that is no longer valid.  Further its pointing to a stack frame
           that might now be in use by some other context/thread.  So cpuY
           continues execution and calls wake_up() on a ptr to a wait object
           that as been effectively deallocated.
      
      This race, when it hits, can cause a crash in wake_up(), which I've
      seen under heavy stress. It can also corrupt the referenced stack
      which can cause any number of failures.
      
      The fix:
      
      Use struct completion, which supports on-stack declarations.
      Completions use a spinlock around setting the condition to true and
      the wake up so that steps 2 and 4 above are atomic and step 3 can
      never happen in-between.
      Signed-off-by: NSteve Wise <swise@opengridcomputing.com>
      c337374b
  20. 10 5月, 2011 3 次提交
    • S
      RDMA/cxgb4: EEH errors can hang the driver · 2f25e9a5
      Steve Wise 提交于
      A few more EEH fixes:
      
      c4iw_wait_for_reply(): detect fatal EEH condition on timeout and
      return an error.
      
      The iw_cxgb4 driver was only calling ib_deregister_device() on an EEH
      event followed by a ib_register_device() when the device was
      reinitialized.  However, the RDMA core doesn't allow multiple
      iterations of register/deregister by the provider. See
      drivers/infiniband/core/sysfs.c: ib_device_unregister_sysfs() where
      the kobject ref is held until the device is deallocated in
      ib_deallocate_device().  Calling deregister adds this kobj reference,
      and then a subsequent register call will generate a WARN_ON() from the
      kobject subsystem because the kobject is being initialized but is
      already initialized with the ref held.
      
      So the provider must deregister and dealloc when resetting for an EEH
      event, then alloc/register to re-initialize.  To do this, we cannot
      use the device ptr as our ULD handle since it will change with each
      reallocation.  This commit adds a ULD context struct which is used as
      the ULD handle, and then contains the device pointer and other state
      needed.
      Signed-off-by: NSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      2f25e9a5
    • S
      RDMA/cxgb4: Reset wait condition atomically · d9594d99
      Steve Wise 提交于
      The driver was never really waiting for RDMA_WR/FINI completions
      because the condition variable used to determine if the completion
      happened was never reset, and this condition variable is reused for
      both connection setup and teardown.  This causes various driver
      crashes under heavy loads due to releasing resources too early.
      
      The fix is to use atomic bits to correctly reset the condition
      immediately after the completion is detected.
      Signed-off-by: NSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      d9594d99
    • S
      RDMA/cxgb4: Don't change QP state outside EP lock · 30c95c2d
      Steve Wise 提交于
      Concurrent ingress CLOSE and ULP ABORT operations causes a crash due
      to a race condition where the close path releases the EP lock and then
      tries to move the QP state to CLOSED.  This must be done inside the EP
      lock to avoid the race.
      Signed-off-by: NSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      30c95c2d
  21. 15 3月, 2011 1 次提交
  22. 11 1月, 2011 1 次提交
  23. 15 11月, 2010 1 次提交