1. 07 7月, 2010 1 次提交
  2. 10 12月, 2009 1 次提交
    • D
      IPoIB: Clear ipoib_neigh.dgid in ipoib_neigh_alloc() · 0cd4d0fd
      David J. Wilder 提交于
      IPoIB can miss a change in destination GID under some conditions.  The
      problem is caused when ipoib_neigh->dgid contains a stale address.
      The fix is to set ipoib_neigh->dgid to zero in ipoib_neigh_alloc().
      
      This can happen when a system using bonding on its IPoIB interfaces
      has switched its active interface from interface A to B and back to A.
      The system that fails over will not correctly processes the 2nd
      address change, as described below.
      
      When an address has changed neighbor->ha is updated with the new
      address.  Each neighbor has an associated ipoib_neigh.
      ipoib_neigh->dgid also holds a copy of the remote node's hardware
      address.  When an address changes neighbor->ha is updated by the
      network layer (arp code) with the new address.  IPoIB detects this
      change in ipoib_start_xmit() by comparing neighbor->ha with
      ipoib_neigh->dgid.  The bug is that ipoib_neigh->dgid may already
      contain the new address (A) thus the change from B to A is missed by
      ipoib.  Here is the sequence of events:
      
          ipoib_neigh->dgid = A  and  neighbor->ha = A
      
      The address is switched to B (the first switch)
      
          neighbor->ha = B
      
      The change is seen in ipoib_start_xmit() -- neighbor->ha !=
      ipoib_neigh->dgid so ipoib_neigh is released, and a new one is
      allocated.
      
      The allocator may return the same chunk of memory that was just
      released, therefore ipoib_neigh->dgid still contains A at this point.
      
      ipoib_neigh->dgid should be updated in neigh_add_path(), but if the
      following conditions are true dgid is not updated:
      
              1) __path_find() returns a path
              2) path->ah is NULL
      
      The remote system now switches from address B to A, neighbor->ha is
      updated to A.
      
      Now we have again : ipoib_neigh->dgid = A  and  neighbor->ha = A
      
      Since the addresses are the same ipoib won't process the change in
      address.  Fix this by zeroing out the dgid field when allocating a new
      struct ipoib_neigh.
      Signed-off-by: NDavid Wilder <dwilder@us.ibm.com>
      Signed-off-by: NRoland Dreier <rolandd@cisco.com>
      0cd4d0fd
  3. 06 9月, 2009 1 次提交
  4. 03 6月, 2009 1 次提交
  5. 31 5月, 2009 1 次提交
  6. 21 4月, 2009 1 次提交
  7. 22 3月, 2009 1 次提交
  8. 18 2月, 2009 1 次提交
  9. 15 1月, 2009 1 次提交
  10. 10 1月, 2009 1 次提交
    • Y
      IPoIB: Fix loss of connectivity after bonding failover on both sides · a50df398
      Yossi Etigin 提交于
      Fix bonding failover in the case both peers failover and the
      gratuitous ARP is lost.  In that case, the sender side will create an
      ipoib_neigh and issue a path request with the old GID first.  When
      skb->dst->neighbour->ha changes due to ARP refresh, this ipoib_neigh
      will not be added to the path->list of the path of the new GID,
      because the ipoib_neigh already exists.  It will not have an AH
      either, because of sender-side failover.  Therefore, it will not get
      an AH when the path is resolved.
      
      The solution here is to compare GIDs in ipoib_start_xmit() even if
      neigh->ah is invalid.  Comparing with an uninitialized value of
      neigh->dgid should be fine, since a spurious match is harmless (and
      astronomically unlikely too).
      Signed-off-by: NMoni Shoua <monis@voltaire.com>
      Signed-off-by: NYossi Etigin <yosefe@voltaire.com>
      Signed-off-by: NRoland Dreier <rolandd@cisco.com>
      a50df398
  11. 13 11月, 2008 3 次提交
  12. 30 10月, 2008 1 次提交
  13. 29 10月, 2008 1 次提交
  14. 23 10月, 2008 1 次提交
  15. 01 10月, 2008 1 次提交
    • R
      IPoIB: Use netif_tx_lock() and get rid of private tx_lock, LLTX · 943c246e
      Roland Dreier 提交于
      Currently, IPoIB is an LLTX driver that uses its own IRQ-disabling
      tx_lock.  Not only do we want to get rid of LLTX, this actually causes
      problems because of the skb_orphan() done with this tx_lock held: some
      skb destructors expect to be run with interrupts enabled.
      
      The simplest fix for this is to get rid of the driver-private tx_lock
      and stop using LLTX.  We kill off priv->tx_lock and use
      netif_tx_lock[_bh]() instead; the patch to do this is a tiny bit
      tricky because we need to update places that take priv->lock inside
      the tx_lock to disable IRQs, rather than relying on tx_lock having
      already disabled IRQs.
      
      Also, there are a couple of places where we need to disable BHs to
      make sure we have a consistent context to call netif_tx_lock() (since
      we no longer can use _irqsave() variants), and we also have to change
      ipoib_send_comp_handler() to call drain_tx_cq() through a timer rather
      than directly, because ipoib_send_comp_handler() runs in interrupt
      context and drain_tx_cq() must run in BH context so it can call
      netif_tx_lock().
      Signed-off-by: NRoland Dreier <rolandd@cisco.com>
      943c246e
  16. 26 9月, 2008 2 次提交
  17. 17 9月, 2008 1 次提交
    • Y
      IPoIB: Fix deadlock on RTNL between bcast join comp and ipoib_stop() · e8224e4b
      Yossi Etigin 提交于
      Taking rtnl_lock in ipoib_mcast_join_complete() causes a deadlock with
      ipoib_stop().  We avoid it by scheduling the piece of code that takes
      the lock on ipoib_workqueue instead of executing it directly.  This
      works because we only flush the ipoib_workqueue with the RTNL not held.
      
      The deadlock happens because ipoib_stop() calls ipoib_ib_dev_down()
      which calls ipoib_mcast_dev_flush(), which calls ipoib_mcast_free(),
      which calls ipoib_mcast_leave(). The latter calls
      ib_sa_free_multicast(), and this waits until the multicast completion
      handler finishes.  This handler is ipoib_mcast_join_complete(), which
      waits for the rtnl_lock(), which was already taken by ipoib_stop().
      
      This bug was introduced in commit a77a57a1 ("IPoIB: Fix deadlock on
      RTNL in ipoib_stop()").
      Signed-off-by: NYossi Etigin <yosefe@voltaire.com>
      Signed-off-by: NRoland Dreier <rolandd@cisco.com>
      e8224e4b
  18. 20 8月, 2008 1 次提交
    • R
      IPoIB: Fix deadlock on RTNL in ipoib_stop() · a77a57a1
      Roland Dreier 提交于
      Commit c8c2afe3 ("IPoIB: Use rtnl lock/unlock when changing device
      flags") added a call to rtnl_lock() in ipoib_mcast_join_task(), which
      is run from the ipoib_workqueue.  However, ipoib_stop() (which is run
      inside rtnl_lock()) flushes this workqueue, which leads to a deadlock
      if the join task is pending.
      
      Fix this by simply not flushing the workqueue from ipoib_stop().  It
      turns out that we really don't care about workqueue tasks running
      during or after ipoib_stop(), as long as we make sure to flush the
      workqueue before unregistering a netdev.
      
      This fixes <https://bugs.openfabrics.org/show_bug.cgi?id=1114>.
      Signed-off-by: NRoland Dreier <rolandd@cisco.com>
      a77a57a1
  19. 23 7月, 2008 1 次提交
  20. 15 7月, 2008 5 次提交
    • E
      IPoIB: Remove priv->mcast_mutex · 5892eff9
      Eli Cohen 提交于
      No need for a mutex around calls to ib_attach_mcast/ib_detach_mcast
      since these operations are synchronized at the HW driver layer.
      Signed-off-by: NEli Cohen <eli@mellanox.co.il>
      Signed-off-by: NRoland Dreier <rolandd@cisco.com>
      5892eff9
    • M
      IPoIB: Refresh paths instead of flushing them on SM change events · ee1e2c82
      Moni Shoua 提交于
      The patch tries to solve the problem of device going down and paths being
      flushed on an SM change event. The method is to mark the paths as candidates for
      refresh (by setting the new valid flag to 0), and wait for an ARP
      probe a new path record query.
      
      The solution requires a different and less intrusive handling of SM
      change event. For that, the second argument of the flush function
      changes its meaning from a boolean flag to a level.  In most cases, SM
      failover doesn't cause LID change so traffic won't stop.  In the rare
      cases of LID change, the remote host (the one that hadn't changed its
      LID) will lose connectivity until paths are refreshed. This is no
      worse than the current state.  In fact, preventing the device from
      going down saves packets that otherwise would be lost.
      Signed-off-by: NMoni Levy <monil@voltaire.com>
      Signed-off-by: NMoni Shoua <monis@voltaire.com>
      Signed-off-by: NRoland Dreier <rolandd@cisco.com>
      ee1e2c82
    • V
      IPoIB: add LRO support · af40da89
      Vladimir Sokolovsky 提交于
      Add "ipoib_use_lro" module parameter to enable LRO and an
      "ipoib_lro_max_aggr" module parameter to set the max number of packets
      to be aggregated.  Make LRO controllable and LRO statistics accessible
      through ethtool.
      Signed-off-by: NVladimir Sokolovsky <vlad@mellanox.co.il>
      Signed-off-by: NEli Cohen <eli@mellanox.co.il>
      Signed-off-by: NRoland Dreier <rolandd@cisco.com>
      af40da89
    • E
      IPoIB: Copy small received SKBs in connected mode · f89271da
      Eli Cohen 提交于
      The connected mode implementation in the IPoIB driver has a large
      overhead in the way SKBs are handled in the receive flow.  It usually
      allocates an SKB with as big as was used in the currently received SKB
      and moves unused fragments from the old SKB to the new one. This
      involves a loop on all the remaining fragments and incurs overhead on
      the CPU.  This patch, for small SKBs, allocates an SKB just large
      enough to contain the received data and copies to it the data from the
      received SKB.  The newly allocated SKB is passed to the stack and the
      old SKB is reposted.
      
      When running netperf, UDP small messages, without this pach I get:
      
          UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
          14.4.3.178 (14.4.3.178) port 0 AF_INET
          Socket  Message  Elapsed      Messages
          Size    Size     Time         Okay Errors   Throughput
          bytes   bytes    secs            #      #   10^6bits/sec
      
          114688     128   10.00     5142034      0     526.31
          114688           10.00     1130489            115.71
      
      With this patch I get both send and receive at ~315 mbps.
      
      The reason that send performance actually slows down is as follows:
      When using this patch, the overhead of the CPU for handling RX packets
      is dramatically reduced.  As a result, we do not experience RNR NAK
      messages from the receiver which cause the connection to be closed and
      reopened again; when the patch is not used, the receiver cannot handle
      the packets fast enough so there is less time to post new buffers and
      hence the mentioned RNR NACKs.  So what happens is that the
      application *thinks* it posted a certain number of packets for
      transmission but these packets are flushed and do not really get
      transmitted.  Since the connection gets opened and closed many times,
      each time netperf gets the CPU time that otherwise would have been
      given to IPoIB to actually transmit the packets.  This can be verified
      when looking at the port counters -- the output of ifconfig and the
      oputput of netperf (this is for the case without the patch):
      
          tx packets
          ==========
          port counter:   1,543,996
          ifconfig:       1,581,426
          netperf:        5,142,034
      
          rx packets
          ==========
          netperf         1,1304,089
      Signed-off-by: NEli Cohen <eli@mellanox.co.il>
      f89271da
    • R
      RDMA: Remove subversion $Id tags · f3781d2e
      Roland Dreier 提交于
      They don't get updated by git and so they're worse than useless.
      Signed-off-by: NRoland Dreier <rolandd@cisco.com>
      f3781d2e
  21. 30 4月, 2008 1 次提交
  22. 24 4月, 2008 1 次提交
  23. 17 4月, 2008 4 次提交
  24. 12 3月, 2008 1 次提交
    • R
      IPoIB: Allocate priv->tx_ring with vmalloc() · 10313cbb
      Roland Dreier 提交于
      Commit 7143740d ("IPoIB: Add send gather support") made struct
      ipoib_tx_buf significantly larger, since the mapping member changed
      from a single u64 to an array with MAX_SKB_FRAGS + 1 entries.  This
      means that allocating tx_rings with kzalloc() may fail because there
      is not enough contiguous memory for the new, much bigger size.  Fix
      this regression by allocating the rings with vmalloc() instead.
      Signed-off-by: NRoland Dreier <rolandd@cisco.com>
      10313cbb
  25. 09 2月, 2008 1 次提交
    • E
      IPoIB: Add high DMA feature flag · eb14032f
      Eli Cohen 提交于
      All current InfiniBand devices can handle all DMA addresses, and it's
      hard to imagine anyone would be silly enough to build a new device
      that couldn't.  Therefore, enable the NETIF_F_HIGHDMA feature for IPoIB.
      
      This has no effect for no, but is needed when we enable gather/scatter
      support and checksum stateless offloads.
      Signed-off-by: NEli Cohen <eli@mellnaox.co.il>
      Signed-off-by: NRoland Dreier <rolandd@cisco.com>
      eb14032f
  26. 05 2月, 2008 2 次提交
  27. 26 1月, 2008 3 次提交
    • K
      IPoIB: Remove redundant check of netif_queue_stopped() in xmit handler · 48fe5e59
      Krishna Kumar 提交于
      qdisc_run() now tests for queue_stopped() before calling
      __qdisc_run(), and the same check is done in every iteration of
      __qdisc_run(), so another check is not required in the driver xmit.
      This means that ipoib_start_xmit() no longer needs to test
      netif_queue_stopped(); the test was added to fix earlier kernels,
      where the networking stack did not guarantee that the xmit method of
      an LLTX driver would not be called after the queue was stopped, but
      current kernels do provide this guarantee.
      
      To validate, I put a debug in the TX_BUSY path which never hit with 64
      threads running overnight exercising this code a few 100 million
      times.
      Signed-off-by: NKrishna Kumar <krkumar2@in.ibm.com>
      Signed-off-by: NRoland Dreier <rolandd@cisco.com>
      48fe5e59
    • P
      IPoIB/CM: Enable SRQ support on HCAs that support fewer than 16 SG entries · 586a6934
      Pradeep Satyanarayana 提交于
      Some HCAs (such as ehca2) support SRQ, but only support fewer than 16 SG
      entries for SRQs.  Currently IPoIB/CM implicitly assumes all HCAs will
      support 16 SG entries for SRQs (to handle a 64K MTU with 4K pages). This
      patch removes that restriction by limiting the maximum MTU in connected
      mode to what the maximum number of SRQ SG entries allows.
      
      This patch addresses <https://bugs.openfabrics.org/show_bug.cgi?id=728>
      Signed-off-by: NPradeep Satyanarayana <pradeeps@linux.vnet.ibm.com>
      Signed-off-by: NRoland Dreier <rolandd@cisco.com>
      586a6934
    • P
      IPoIB/cm: Add connected mode support for devices without SRQs · 68e995a2
      Pradeep Satyanarayana 提交于
      Some IB adapters (notably IBM's eHCA) do not implement SRQs (shared
      receive queues).  The current IPoIB connected mode support only works
      on devices that support SRQs.
      
      Fix this by adding support for using the receive queue of each
      connected mode receive QP.  The disadvantage of this compared to using
      an SRQ is that it means a full queue of receives must be posted for
      each remote connected mode peer, which means that total memory usage
      is potentially much higher than when using SRQs.  To manage this, add
      a new module parameter "max_nonsrq_conn_qp" that limits the number of
      connections allowed per interface.
      
      The rest of the changes are fairly straightforward: we use a table of
      struct ipoib_cm_rx to hold all the active connections, and put the
      table index of the connection in the high bits of receive WR IDs.
      This is needed because we cannot rely on the struct ib_wc.qp field for
      non-SRQ receive completions.  Most of the rest of the changes just
      test whether or not an SRQ is available, and post receives or find
      received packets in the right place depending on the answer.
      
      Cleaning up dead connections actually becomes simpler, because we do
      not have to do the "last WQE reached" dance that is required to
      destroy QPs attached to an SRQ.  We just move the QP to the error
      state and wait for all pending receives to be flushed.
      Signed-off-by: NPradeep Satyanarayana <pradeeps@linux.vnet.ibm.com>
      
      [ Completely rewritten and split up, based on Pradeep's work.  Several
        bugs fixed and no doubt several bugs introduced.  - Roland ]
      Signed-off-by: NRoland Dreier <rolandd@cisco.com>
      68e995a2