1. 19 5月, 2012 5 次提交
  2. 01 11月, 2011 1 次提交
  3. 07 10月, 2011 1 次提交
  4. 25 5月, 2011 1 次提交
    • S
      RDMA/cxgb4: Use completion objects for event blocking · c337374b
      Steve Wise 提交于
      There exists a race condition when using wait_queue_head_t objects
      that are declared on the stack.  This was being done in a few places
      where we are sending work requests to the FW and awaiting replies, but
      we don't have an endpoint structure with an embedded c4iw_wr_wait
      struct.  So the code was allocating it locally on the stack.  Bad
      design.  The race is:
      
        1) thread on cpuX declares the wait_queue_head_t on the stack, then
           posts a firmware WR with that wait object ptr as the cookie to be
           returned in the WR reply.  This thread will proceed to block in
           wait_event_timeout() but before it does:
      
        2) An interrupt runs on cpuY with the WR reply.  fw6_msg() handles
           this and calls c4iw_wake_up().  c4iw_wake_up() sets the condition
           variable in the c4iw_wr_wait object to TRUE and will call
           wake_up(), but before it calls wake_up():
      
        3) The thread on cpuX calls c4iw_wait_for_reply(), which calls
           wait_event_timeout().  The wait_event_timeout() macro checks the
           condition variable and returns immediately since it is TRUE.  So
           this thread never blocks/sleeps. The function then returns
           effectively deallocating the c4iw_wr_wait object that was on the
           stack.
      
        4) So at this point cpuY has a pointer to the c4iw_wr_wait object
           that is no longer valid.  Further its pointing to a stack frame
           that might now be in use by some other context/thread.  So cpuY
           continues execution and calls wake_up() on a ptr to a wait object
           that as been effectively deallocated.
      
      This race, when it hits, can cause a crash in wake_up(), which I've
      seen under heavy stress. It can also corrupt the referenced stack
      which can cause any number of failures.
      
      The fix:
      
      Use struct completion, which supports on-stack declarations.
      Completions use a spinlock around setting the condition to true and
      the wake up so that steps 2 and 4 above are atomic and step 3 can
      never happen in-between.
      Signed-off-by: NSteve Wise <swise@opengridcomputing.com>
      c337374b
  5. 10 5月, 2011 3 次提交
    • S
      RDMA/cxgb4: EEH errors can hang the driver · 2f25e9a5
      Steve Wise 提交于
      A few more EEH fixes:
      
      c4iw_wait_for_reply(): detect fatal EEH condition on timeout and
      return an error.
      
      The iw_cxgb4 driver was only calling ib_deregister_device() on an EEH
      event followed by a ib_register_device() when the device was
      reinitialized.  However, the RDMA core doesn't allow multiple
      iterations of register/deregister by the provider. See
      drivers/infiniband/core/sysfs.c: ib_device_unregister_sysfs() where
      the kobject ref is held until the device is deallocated in
      ib_deallocate_device().  Calling deregister adds this kobj reference,
      and then a subsequent register call will generate a WARN_ON() from the
      kobject subsystem because the kobject is being initialized but is
      already initialized with the ref held.
      
      So the provider must deregister and dealloc when resetting for an EEH
      event, then alloc/register to re-initialize.  To do this, we cannot
      use the device ptr as our ULD handle since it will change with each
      reallocation.  This commit adds a ULD context struct which is used as
      the ULD handle, and then contains the device pointer and other state
      needed.
      Signed-off-by: NSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      2f25e9a5
    • S
      RDMA/cxgb4: Reset wait condition atomically · d9594d99
      Steve Wise 提交于
      The driver was never really waiting for RDMA_WR/FINI completions
      because the condition variable used to determine if the completion
      happened was never reset, and this condition variable is reused for
      both connection setup and teardown.  This causes various driver
      crashes under heavy loads due to releasing resources too early.
      
      The fix is to use atomic bits to correctly reset the condition
      immediately after the completion is detected.
      Signed-off-by: NSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      d9594d99
    • S
      RDMA/cxgb4: Don't change QP state outside EP lock · 30c95c2d
      Steve Wise 提交于
      Concurrent ingress CLOSE and ULP ABORT operations causes a crash due
      to a race condition where the close path releases the EP lock and then
      tries to move the QP state to CLOSED.  This must be done inside the EP
      lock to avoid the race.
      Signed-off-by: NSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      30c95c2d
  6. 15 3月, 2011 1 次提交
  7. 11 1月, 2011 1 次提交
  8. 15 11月, 2010 1 次提交
  9. 29 9月, 2010 3 次提交
  10. 03 8月, 2010 1 次提交
  11. 07 7月, 2010 1 次提交
  12. 25 5月, 2010 1 次提交
  13. 06 5月, 2010 1 次提交
  14. 22 4月, 2010 1 次提交