1. 03 10月, 2012 1 次提交
  2. 02 10月, 2012 1 次提交
    • O
      IB/ipoib: Add more rtnl_link_ops callbacks · 862096a8
      Or Gerlitz 提交于
      Add the rtnl_link_ops changelink and fill_info callbacks, through
      which the admin can now set/get the driver mode, etc policies.
      Maintain the proprietary sysfs entries only for legacy childs.
      
      For child devices, set dev->iflink to point to the parent
      device ifindex, such that user space tools can now correctly
      show the uplink relation as done for vlan, macvlan, etc
      devices. Pointed out by Patrick McHardy <kaber@trash.net>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      862096a8
  3. 21 9月, 2012 1 次提交
    • O
      IB/ipoib: Add rtnl_link_ops support · 9baa0b03
      Or Gerlitz 提交于
      Add rtnl_link_ops to IPoIB, with the first usage being child device
      create/delete through them. Childs devices are now either legacy ones,
      created/deleted through the ipoib sysfs entries, or RTNL ones.
      
      Adding support for RTNL childs involved refactoring of ipoib_vlan_add
      which is now used by both the sysfs and the link_ops code.
      
      Also, added ndo_uninit entry to support calling unregister_netdevice_queue
      from the rtnl dellink entry. This required removal of calls to
      ipoib_dev_cleanup from the driver in flows which use unregister_netdevice,
      since the networking core will invoke ipoib_uninit which does exactly that.
      Signed-off-by: NErez Shitrit <erezsh@mellanox.co.il>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9baa0b03
  4. 13 9月, 2012 2 次提交
  5. 30 7月, 2012 1 次提交
    • S
      IPoIB: Use a private hash table for path lookup in xmit path · b63b70d8
      Shlomo Pongratz 提交于
      Dave Miller <davem@davemloft.net> provided a detailed description of
      why the way IPoIB is using neighbours for its own ipoib_neigh struct
      is buggy:
      
          Any time an ipoib_neigh is changed, a sequence like the following is made:
      
          			spin_lock_irqsave(&priv->lock, flags);
          			/*
          			 * It's safe to call ipoib_put_ah() inside
          			 * priv->lock here, because we know that
          			 * path->ah will always hold one more reference,
          			 * so ipoib_put_ah() will never do more than
          			 * decrement the ref count.
          			 */
          			if (neigh->ah)
          				ipoib_put_ah(neigh->ah);
          			list_del(&neigh->list);
          			ipoib_neigh_free(dev, neigh);
          			spin_unlock_irqrestore(&priv->lock, flags);
          			ipoib_path_lookup(skb, n, dev);
      
          This doesn't work, because you're leaving a stale pointer to the freed up
          ipoib_neigh in the special neigh->ha pointer cookie.  Yes, it even fails
          with all the locking done to protect _changes_ to *ipoib_neigh(n), and
          with the code in ipoib_neigh_free() that NULLs out the pointer.
      
          The core issue is that read side calls to *to_ipoib_neigh(n) are not
          being synchronized at all, they are performed without any locking.  So
          whether we hold the lock or not when making changes to *ipoib_neigh(n)
          you still can have threads see references to freed up ipoib_neigh
          objects.
      
          	cpu 1			cpu 2
          	n = *ipoib_neigh()
          				*ipoib_neigh() = NULL
          				kfree(n)
          	n->foo == OOPS
      
          [..]
      
          Perhaps the ipoib code can have a private path database it manages
          entirely itself, which holds all the necessary information and is
          looked up by some generic key which is available easily at transmit
          time and does not involve generic neighbour entries.
      
      See <http://marc.info/?l=linux-rdma&m=132812793105624&w=2> and
      <http://marc.info/?l=linux-rdma&w=2&r=1&s=allows+references+to+freed+memory&q=b>
      for the full discussion.
      
      This patch aims to solve the race conditions found in the IPoIB driver.
      
      The patch removes the connection between the core networking neighbour
      structure and the ipoib_neigh structure.  In addition to avoiding the
      race described above, it allows us to handle SKBs carrying IP packets
      that don't have any associated neighbour.
      
      We add an ipoib_neigh hash table with N buckets where the key is the
      destination hardware address.  The ipoib_neigh is fetched from the
      hash table and instead of the stashed location in the neighbour
      structure. The hash table uses both RCU and reference counting to
      guarantee that no ipoib_neigh instance is ever deleted while in use.
      
      Fetching the ipoib_neigh structure instance from the hash also makes
      the special code in ipoib_start_xmit that handles remote and local
      bonding failover redundant.
      
      Aged ipoib_neigh instances are deleted by a garbage collection task
      that runs every M seconds and deletes every ipoib_neigh instance that
      was idle for at least 2*M seconds. The deletion is safe since the
      ipoib_neigh instances are protected using RCU and reference count
      mechanisms.
      
      The number of buckets (N) and frequency of running the GC thread (M),
      are taken from the exported arb_tbl.
      Signed-off-by: NShlomo Pongratz <shlomop@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      b63b70d8
  6. 09 2月, 2012 2 次提交
  7. 27 7月, 2011 1 次提交
  8. 20 4月, 2011 1 次提交
  9. 11 1月, 2011 1 次提交
    • O
      IPoIB: Remove LRO support · 19e364f6
      Or Gerlitz 提交于
      As a first step in moving from LRO to GRO, revert commit af40da89
      ("IPoIB: add LRO support").  Also eliminate the ethtool set_flags
      callback which isn't needed anymore.  Finally, we need to include
      <linux/sched.h> directly to get the declaration of restart_syscall()
      (which used to be included implicitly through <linux/inet_lro.h>).
      
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Vladimir Sokolovsky <vlad@mellanox.co.il>
      Signed-off-by: NOr Gerlitz <ogerlitz@voltaire.com>
      Signed-off-by: NRoland Dreier <rolandd@cisco.com>
      19e364f6
  10. 29 10月, 2008 1 次提交
  11. 23 10月, 2008 1 次提交
  12. 01 10月, 2008 1 次提交
    • R
      IPoIB: Use netif_tx_lock() and get rid of private tx_lock, LLTX · 943c246e
      Roland Dreier 提交于
      Currently, IPoIB is an LLTX driver that uses its own IRQ-disabling
      tx_lock.  Not only do we want to get rid of LLTX, this actually causes
      problems because of the skb_orphan() done with this tx_lock held: some
      skb destructors expect to be run with interrupts enabled.
      
      The simplest fix for this is to get rid of the driver-private tx_lock
      and stop using LLTX.  We kill off priv->tx_lock and use
      netif_tx_lock[_bh]() instead; the patch to do this is a tiny bit
      tricky because we need to update places that take priv->lock inside
      the tx_lock to disable IRQs, rather than relying on tx_lock having
      already disabled IRQs.
      
      Also, there are a couple of places where we need to disable BHs to
      make sure we have a consistent context to call netif_tx_lock() (since
      we no longer can use _irqsave() variants), and we also have to change
      ipoib_send_comp_handler() to call drain_tx_cq() through a timer rather
      than directly, because ipoib_send_comp_handler() runs in interrupt
      context and drain_tx_cq() must run in BH context so it can call
      netif_tx_lock().
      Signed-off-by: NRoland Dreier <rolandd@cisco.com>
      943c246e
  13. 17 9月, 2008 1 次提交
    • Y
      IPoIB: Fix deadlock on RTNL between bcast join comp and ipoib_stop() · e8224e4b
      Yossi Etigin 提交于
      Taking rtnl_lock in ipoib_mcast_join_complete() causes a deadlock with
      ipoib_stop().  We avoid it by scheduling the piece of code that takes
      the lock on ipoib_workqueue instead of executing it directly.  This
      works because we only flush the ipoib_workqueue with the RTNL not held.
      
      The deadlock happens because ipoib_stop() calls ipoib_ib_dev_down()
      which calls ipoib_mcast_dev_flush(), which calls ipoib_mcast_free(),
      which calls ipoib_mcast_leave(). The latter calls
      ib_sa_free_multicast(), and this waits until the multicast completion
      handler finishes.  This handler is ipoib_mcast_join_complete(), which
      waits for the rtnl_lock(), which was already taken by ipoib_stop().
      
      This bug was introduced in commit a77a57a1 ("IPoIB: Fix deadlock on
      RTNL in ipoib_stop()").
      Signed-off-by: NYossi Etigin <yosefe@voltaire.com>
      Signed-off-by: NRoland Dreier <rolandd@cisco.com>
      e8224e4b
  14. 15 7月, 2008 10 次提交
    • E
      IPoIB: Double default RX/TX ring sizes · bc3a290b
      Eli Cohen 提交于
      Increase IPoIB ring sizes to twice their original sizes (RX: 128->256,
      TX: 64->128) to act as a shock absorber for high traffic peaks.  With
      the current settings, we have seen cases that there are many calls to
      netif_stop_queue(), which causes degradation in throughput.  Also,
      larger receive buffer sizes help IPoIB in CM mode to avoid experiencing
      RNR NAK conditions due to insufficient receive buffers at the SRQ.
      Signed-off-by: NEli Cohen <eli@mellanox.co.il>
      Signed-off-by: NRoland Dreier <rolandd@cisco.com>
      bc3a290b
    • E
      IPoIB/cm: Reduce connected mode TX object size · e112373f
      Eli Cohen 提交于
      Since IPoIB connected mode does not NETIF_F_SG, we only have one DMA
      mapping per send, so we don't need a mapping[] array.  Define a new
      struct with a single u64 mapping member and use it for the CM tx_ring.
      Signed-off-by: NEli Cohen <eli@mellanox.co.il>
      Signed-off-by: NRoland Dreier <rolandd@cisco.com>
      e112373f
    • R
      IPoIB: Get rid of ipoib_mcast_detach() wrapper · 9eae554c
      Roland Dreier 提交于
      ipoib_mcast_detach() does nothing except call ib_detach_mcast(), so just
      use the core API in the one place that does a multicast group detach.
      
      add/remove: 0/1 grow/shrink: 0/1 up/down: 0/-105 (-105)
      function                                     old     new   delta
      ipoib_mcast_leave                            357     319     -38
      ipoib_mcast_detach                            67       -     -67
      Signed-off-by: NRoland Dreier <rolandd@cisco.com>
      9eae554c
    • E
      IPoIB: Only set Q_Key once: after joining broadcast group · d0de1362
      Eli Cohen 提交于
      The current code will set the Q_Key for any join of a non-sendonly
      multicast group.  The operation involves a modify QP operation, which
      is fairly heavyweight, and is only really required after the join of
      the broadcast group.  Fix this by adding a parameter to ipoib_mcast_attach()
      to control when the Q_Key is set.
      Signed-off-by: NEli Cohen <eli@mellanox.co.il>
      Signed-off-by: NRoland Dreier <rolandd@cisco.com>
      d0de1362
    • E
      IPoIB: Remove priv->mcast_mutex · 5892eff9
      Eli Cohen 提交于
      No need for a mutex around calls to ib_attach_mcast/ib_detach_mcast
      since these operations are synchronized at the HW driver layer.
      Signed-off-by: NEli Cohen <eli@mellanox.co.il>
      Signed-off-by: NRoland Dreier <rolandd@cisco.com>
      5892eff9
    • E
      IPoIB: Remove unused IPOIB_MCAST_STARTED code · c03d4731
      Eli Cohen 提交于
      The IPOIB_MCAST_STARTED flag is not used at all since commit b3e2749b
      ("IPoIB: Don't drop multicast sends when they can be queued"), so
      remove it.
      Signed-off-by: NEli Cohen <eli@mellanox.co.il>
      Signed-off-by: NRoland Dreier <rolandd@cisco.com>
      c03d4731
    • M
      IPoIB: Refresh paths instead of flushing them on SM change events · ee1e2c82
      Moni Shoua 提交于
      The patch tries to solve the problem of device going down and paths being
      flushed on an SM change event. The method is to mark the paths as candidates for
      refresh (by setting the new valid flag to 0), and wait for an ARP
      probe a new path record query.
      
      The solution requires a different and less intrusive handling of SM
      change event. For that, the second argument of the flush function
      changes its meaning from a boolean flag to a level.  In most cases, SM
      failover doesn't cause LID change so traffic won't stop.  In the rare
      cases of LID change, the remote host (the one that hadn't changed its
      LID) will lose connectivity until paths are refreshed. This is no
      worse than the current state.  In fact, preventing the device from
      going down saves packets that otherwise would be lost.
      Signed-off-by: NMoni Levy <monil@voltaire.com>
      Signed-off-by: NMoni Shoua <monis@voltaire.com>
      Signed-off-by: NRoland Dreier <rolandd@cisco.com>
      ee1e2c82
    • V
      IPoIB: add LRO support · af40da89
      Vladimir Sokolovsky 提交于
      Add "ipoib_use_lro" module parameter to enable LRO and an
      "ipoib_lro_max_aggr" module parameter to set the max number of packets
      to be aggregated.  Make LRO controllable and LRO statistics accessible
      through ethtool.
      Signed-off-by: NVladimir Sokolovsky <vlad@mellanox.co.il>
      Signed-off-by: NEli Cohen <eli@mellanox.co.il>
      Signed-off-by: NRoland Dreier <rolandd@cisco.com>
      af40da89
    • E
      IPoIB: Copy small received SKBs in connected mode · f89271da
      Eli Cohen 提交于
      The connected mode implementation in the IPoIB driver has a large
      overhead in the way SKBs are handled in the receive flow.  It usually
      allocates an SKB with as big as was used in the currently received SKB
      and moves unused fragments from the old SKB to the new one. This
      involves a loop on all the remaining fragments and incurs overhead on
      the CPU.  This patch, for small SKBs, allocates an SKB just large
      enough to contain the received data and copies to it the data from the
      received SKB.  The newly allocated SKB is passed to the stack and the
      old SKB is reposted.
      
      When running netperf, UDP small messages, without this pach I get:
      
          UDP UNIDIRECTIONAL SEND TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
          14.4.3.178 (14.4.3.178) port 0 AF_INET
          Socket  Message  Elapsed      Messages
          Size    Size     Time         Okay Errors   Throughput
          bytes   bytes    secs            #      #   10^6bits/sec
      
          114688     128   10.00     5142034      0     526.31
          114688           10.00     1130489            115.71
      
      With this patch I get both send and receive at ~315 mbps.
      
      The reason that send performance actually slows down is as follows:
      When using this patch, the overhead of the CPU for handling RX packets
      is dramatically reduced.  As a result, we do not experience RNR NAK
      messages from the receiver which cause the connection to be closed and
      reopened again; when the patch is not used, the receiver cannot handle
      the packets fast enough so there is less time to post new buffers and
      hence the mentioned RNR NACKs.  So what happens is that the
      application *thinks* it posted a certain number of packets for
      transmission but these packets are flushed and do not really get
      transmitted.  Since the connection gets opened and closed many times,
      each time netperf gets the CPU time that otherwise would have been
      given to IPoIB to actually transmit the packets.  This can be verified
      when looking at the port counters -- the output of ifconfig and the
      oputput of netperf (this is for the case without the patch):
      
          tx packets
          ==========
          port counter:   1,543,996
          ifconfig:       1,581,426
          netperf:        5,142,034
      
          rx packets
          ==========
          netperf         1,1304,089
      Signed-off-by: NEli Cohen <eli@mellanox.co.il>
      f89271da
    • R
      RDMA: Remove subversion $Id tags · f3781d2e
      Roland Dreier 提交于
      They don't get updated by git and so they're worse than useless.
      Signed-off-by: NRoland Dreier <rolandd@cisco.com>
      f3781d2e
  15. 01 5月, 2008 1 次提交
    • E
      IB/ipoib: Fix transmit queue stalling forever · 57ce41d1
      Eli Cohen 提交于
      Commit f56bcd80 ("IPoIB: Use separate CQ for UD send completions")
      introduced a bug where the transmit queue could get stopped and never
      woken up.  The problem is that send completions are only polled at the
      end of the xmit function, so if the send queue fills up and the xmit
      path stops the queue, then there is no way for send completions to
      ever get polled, and so the transmit queue stays stopped forever.
      
      Fix this by arming the send CQ just before posting the last send
      request that fills the send queue.  Then, when the completion event
      handler is called, drain the send CQ.  Since it is possible that not
      enough send completions are in the CQ, verify that the the net queue
      has been woken up after draining the send CQ, and if not arm a timer
      and drain again at the timer function.
      Signed-off-by: NRoland Dreier <rolandd@cisco.com>
      57ce41d1
  16. 30 4月, 2008 1 次提交
  17. 24 4月, 2008 1 次提交
  18. 17 4月, 2008 4 次提交
  19. 15 2月, 2008 1 次提交
  20. 09 2月, 2008 1 次提交
  21. 26 1月, 2008 3 次提交
    • P
      IPoIB/CM: Enable SRQ support on HCAs that support fewer than 16 SG entries · 586a6934
      Pradeep Satyanarayana 提交于
      Some HCAs (such as ehca2) support SRQ, but only support fewer than 16 SG
      entries for SRQs.  Currently IPoIB/CM implicitly assumes all HCAs will
      support 16 SG entries for SRQs (to handle a 64K MTU with 4K pages). This
      patch removes that restriction by limiting the maximum MTU in connected
      mode to what the maximum number of SRQ SG entries allows.
      
      This patch addresses <https://bugs.openfabrics.org/show_bug.cgi?id=728>
      Signed-off-by: NPradeep Satyanarayana <pradeeps@linux.vnet.ibm.com>
      Signed-off-by: NRoland Dreier <rolandd@cisco.com>
      586a6934
    • P
      IPoIB/cm: Add connected mode support for devices without SRQs · 68e995a2
      Pradeep Satyanarayana 提交于
      Some IB adapters (notably IBM's eHCA) do not implement SRQs (shared
      receive queues).  The current IPoIB connected mode support only works
      on devices that support SRQs.
      
      Fix this by adding support for using the receive queue of each
      connected mode receive QP.  The disadvantage of this compared to using
      an SRQ is that it means a full queue of receives must be posted for
      each remote connected mode peer, which means that total memory usage
      is potentially much higher than when using SRQs.  To manage this, add
      a new module parameter "max_nonsrq_conn_qp" that limits the number of
      connections allowed per interface.
      
      The rest of the changes are fairly straightforward: we use a table of
      struct ipoib_cm_rx to hold all the active connections, and put the
      table index of the connection in the high bits of receive WR IDs.
      This is needed because we cannot rely on the struct ib_wc.qp field for
      non-SRQ receive completions.  Most of the rest of the changes just
      test whether or not an SRQ is available, and post receives or find
      received packets in the right place depending on the answer.
      
      Cleaning up dead connections actually becomes simpler, because we do
      not have to do the "last WQE reached" dance that is required to
      destroy QPs attached to an SRQ.  We just move the QP to the error
      state and wait for all pending receives to be flushed.
      Signed-off-by: NPradeep Satyanarayana <pradeeps@linux.vnet.ibm.com>
      
      [ Completely rewritten and split up, based on Pradeep's work.  Several
        bugs fixed and no doubt several bugs introduced.  - Roland ]
      Signed-off-by: NRoland Dreier <rolandd@cisco.com>
      68e995a2
    • R
      IPoIB: Trivial formatting cleanups · 2337f809
      Roland Dreier 提交于
      Fix whitespace blunders, convert "foo* bar" to "foo *bar", etc.
      Signed-off-by: NRoland Dreier <rolandd@cisco.com>
      2337f809
  22. 20 10月, 2007 1 次提交
  23. 16 10月, 2007 1 次提交
    • M
      IB/ipoib: Bound the net device to the ipoib_neigh structue · 732a2170
      Moni Shoua 提交于
      IPoIB uses a two layer neighboring scheme, such that for each struct neighbour
      whose device is an ipoib one, there is a struct ipoib_neigh buddy which is
      created on demand at the tx flow by an ipoib_neigh_alloc(skb->dst->neighbour)
      call.
      
      When using the bonding driver, neighbours are created by the net stack on behalf
      of the bonding (master) device. On the tx flow the bonding code gets an skb such
      that skb->dev points to the master device, it changes this skb to point on the
      slave device and calls the slave hard_start_xmit function.
      
      Under this scheme, ipoib_neigh_destructor assumption that for each struct
      neighbour it gets, n->dev is an ipoib device and hence netdev_priv(n->dev)
      can be casted to struct ipoib_dev_priv is buggy.
      
      To fix it, this patch adds a dev field to struct ipoib_neigh which is used
      instead of the struct neighbour dev one, when n->dev->flags has the
      IFF_MASTER bit set.
      
      Signed-off-by: Moni Shoua <monis at voltaire.com>
      Signed-off-by: Or Gerlitz <ogerlitz at voltaire.com>
      Acked-by: NRoland Dreier <rdreier@cisco.com>
      Signed-off-by: NJeff Garzik <jeff@garzik.org>
      732a2170
  24. 11 10月, 2007 1 次提交