1. 31 8月, 2015 1 次提交
    • H
      IB/core: lock client data with lists_rwsem · 7c1eb45a
      Haggai Eran 提交于
      An ib_client callback that is called with the lists_rwsem locked only for
      read is protected from changes to the IB client lists, but not from
      ib_unregister_device() freeing its client data. This is because
      ib_unregister_device() will remove the device from the device list with
      lists_rwsem locked for write, but perform the rest of the cleanup,
      including the call to remove() without that lock.
      
      Mark client data that is undergoing de-registration with a new going_down
      flag in the client data context. Lock the client data list with lists_rwsem
      for write in addition to using the spinlock, so that functions calling the
      callback would be able to lock only lists_rwsem for read and let callbacks
      sleep.
      
      Since ib_unregister_client() now marks the client data context, no need for
      remove() to search the context again, so pass the client data directly to
      remove() callbacks.
      Reviewed-by: NJason Gunthorpe <jgunthorpe@obsidianresearch.com>
      Signed-off-by: NHaggai Eran <haggaie@mellanox.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      7c1eb45a
  2. 24 7月, 2015 1 次提交
  3. 15 7月, 2015 5 次提交
    • E
      IB/ipoib: Set MTU to max allowed by mode when mode changes · edcd2a74
      Erez Shitrit 提交于
      When switching between modes (datagram / connected) change the MTU
      accordingly.
      datagram mode up to 4K, connected mode up to (64K - 0x10).
      Signed-off-by: NELi Cohen <eli@mellanox.com>
      Signed-off-by: NErez Shitrit <erezsh@mellanox.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      edcd2a74
    • Y
      IB/ipoib: Scatter-Gather support in connected mode · c4268778
      Yuval Shaia 提交于
      By default, IPoIB-CM driver uses 64k MTU. Larger MTU gives better
      performance.
      This MTU plus overhead puts the memory allocation for IP based packets at
      32 4k pages (order 5), which have to be contiguous.
      When the system memory under pressure, it was observed that allocating 128k
      contiguous physical memory is difficult and causes serious errors (such as
      system becomes unusable).
      
      This enhancement resolve the issue by removing the physically contiguous
      memory requirement using Scatter/Gather feature that exists in Linux stack.
      
      With this fix Scatter-Gather will be supported also in connected mode.
      
      This change reverts some of the change made in commit e112373f
      ("IPoIB/cm: Reduce connected mode TX object size").
      
      The ability to use SG in IPoIB CM is possible because the coupling
      between NETIF_F_SG and NETIF_F_CSUM was removed in commit
      ec5f0615 ("net: Kill link between CSUM and SG features.")
      Signed-off-by: NYuval Shaia <yuval.shaia@oracle.com>
      Acked-by: NChristian Marie <christian@ponies.io>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      c4268778
    • H
      IB/ipoib: Prevent lockdep warning in __ipoib_ib_dev_flush · 8b7cce0d
      Haggai Eran 提交于
      __ipoib_ib_dev_flush calls itself recursively on child devices, and lockdep
      complains about locking vlan_rwsem twice (see below). Use down_read_nested
      instead of down_read to prevent the warning.
      
       =============================================
       [ INFO: possible recursive locking detected ]
       4.1.0-rc4+ #36 Tainted: G           O
       ---------------------------------------------
       kworker/u20:2/261 is trying to acquire lock:
        (&priv->vlan_rwsem){.+.+..}, at: [<ffffffffa0791e2a>] __ipoib_ib_dev_flush+0x3a/0x2b0 [ib_ipoib]
      
       but task is already holding lock:
        (&priv->vlan_rwsem){.+.+..}, at: [<ffffffffa0791e2a>] __ipoib_ib_dev_flush+0x3a/0x2b0 [ib_ipoib]
      
       other info that might help us debug this:
        Possible unsafe locking scenario:
      
              CPU0
              ----
         lock(&priv->vlan_rwsem);
         lock(&priv->vlan_rwsem);
      
        *** DEADLOCK ***
      
        May be due to missing lock nesting notation
      
       3 locks held by kworker/u20:2/261:
        #0:  ("%s""ipoib_flush"){.+.+..}, at: [<ffffffff810827cc>] process_one_work+0x15c/0x760
        #1:  ((&priv->flush_heavy)){+.+...}, at: [<ffffffff810827cc>] process_one_work+0x15c/0x760
        #2:  (&priv->vlan_rwsem){.+.+..}, at: [<ffffffffa0791e2a>] __ipoib_ib_dev_flush+0x3a/0x2b0 [ib_ipoib]
      
       stack backtrace:
       CPU: 3 PID: 261 Comm: kworker/u20:2 Tainted: G           O    4.1.0-rc4+ #36
       Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
       Workqueue: ipoib_flush ipoib_ib_dev_flush_heavy [ib_ipoib]
        ffff8801c6c54790 ffff8801c9927af8 ffffffff81665238 0000000000000001
        ffffffff825b5b30 ffff8801c9927bd8 ffffffff810bba51 ffff880100000000
        ffffffff00000001 ffff880100000001 ffff8801c6c55428 ffff8801c6c54790
       Call Trace:
        [<ffffffff81665238>] dump_stack+0x4f/0x6f
        [<ffffffff810bba51>] __lock_acquire+0x741/0x1820
        [<ffffffff810bcbf8>] lock_acquire+0xc8/0x240
        [<ffffffffa0791e2a>] ? __ipoib_ib_dev_flush+0x3a/0x2b0 [ib_ipoib]
        [<ffffffff81669d2c>] down_read+0x4c/0x70
        [<ffffffffa0791e2a>] ? __ipoib_ib_dev_flush+0x3a/0x2b0 [ib_ipoib]
        [<ffffffffa0791e2a>] __ipoib_ib_dev_flush+0x3a/0x2b0 [ib_ipoib]
        [<ffffffffa0791e4a>] __ipoib_ib_dev_flush+0x5a/0x2b0 [ib_ipoib]
        [<ffffffffa07920ba>] ipoib_ib_dev_flush_heavy+0x1a/0x20 [ib_ipoib]
        [<ffffffff81082871>] process_one_work+0x201/0x760
        [<ffffffff810827cc>] ? process_one_work+0x15c/0x760
        [<ffffffff81082ef0>] worker_thread+0x120/0x4d0
        [<ffffffff81082dd0>] ? process_one_work+0x760/0x760
        [<ffffffff81082dd0>] ? process_one_work+0x760/0x760
        [<ffffffff81088b7e>] kthread+0xfe/0x120
        [<ffffffff81088a80>] ? __init_kthread_worker+0x70/0x70
        [<ffffffff8166c6e2>] ret_from_fork+0x42/0x70
        [<ffffffff81088a80>] ? __init_kthread_worker+0x70/0x70
      Signed-off-by: NHaggai Eran <haggaie@mellanox.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      8b7cce0d
    • A
      IB/IPoIB: Fix bad error flow in ipoib_add_port() · 58e9cc90
      Amir Vadai 提交于
      Error values of ib_query_port() and ib_query_device() weren't propagated
      correctly. Because of that, ipoib_add_port() could return NULL value,
      which escaped the IS_ERR() check in ipoib_add_one() and we crashed.
      Signed-off-by: NAmir Vadai <amirv@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Reviewed-by: NIra Weiny <ira.weiny@intel.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      58e9cc90
    • H
      IB: Add rdma_cap_ib_switch helper and use where appropriate · 4139032b
      Hal Rosenstock 提交于
      Persuant to Liran's comments on node_type on linux-rdma
      mailing list:
      
      In an effort to reform the RDMA core and ULPs to minimize use of
      node_type in struct ib_device, an additional bit is added to
      struct ib_device for is_switch (IB switch). This is needed
      to be initialized by any IB switch device driver. This is a
      NEW requirement on such device drivers which are all
      "out of tree".
      
      In addition, an ib_switch helper was added to ib_verbs.h
      based on the is_switch device bit rather than node_type
      (although those should be consistent).
      
      The RDMA core (MAD, SMI, agent, sa_query, multicast, sysfs)
      as well as (IPoIB and SRP) ULPs are updated where
      appropriate to use this new helper. In some cases,
      the helper is now used under the covers of using
      rdma_[start end]_port rather than the open coding
      previously used.
      Reviewed-by: NSean Hefty <sean.hefty@intel.com>
      Reviewed-By: NJason Gunthorpe <jgunthorpe@obsidianresearch.com>
      Reviewed-by: NIra Weiny <ira.weiny@intel.com>
      Tested-by: NIra Weiny <ira.weiny@intel.com>
      Signed-off-by: NHal Rosenstock <hal@mellanox.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      4139032b
  4. 13 6月, 2015 1 次提交
  5. 02 6月, 2015 1 次提交
  6. 19 5月, 2015 1 次提交
  7. 06 5月, 2015 1 次提交
  8. 18 4月, 2015 1 次提交
  9. 16 4月, 2015 14 次提交
    • E
      IB/ipoib: Remove IPOIB_MCAST_RUN bit · 0e5544d9
      Erez Shitrit 提交于
      After Doug Ledford's changes there is no need in that bit, it's
      semantic becomes subset of the IPOIB_FLAG_OPER_UP bit.
      Signed-off-by: NErez Shitrit <erezsh@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      0e5544d9
    • E
      IB/ipoib: Save only IPOIB_MAX_PATH_REC_QUEUE skb's · 1e85b806
      Erez Shitrit 提交于
      Whenever there is no path->ah to the destination, keep only defined
      number of skb's. Otherwise there are cases that the driver can keep
      infinite list of skb's.
      
      For example, when one device want to send unicast arp to the destination,
      and from some reason the SM doesn't respond, the driver currently keeps
      all the skb's. If that unicast arp traffic stopped, all  these skb's
      are kept by the path object till the interface is down.
      Signed-off-by: NErez Shitrit <erezsh@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      1e85b806
    • E
      IB/ipoib: Handle QP in SQE state · 2c010730
      Erez Shitrit 提交于
      As the result of a completion error the QP can moved to SQE state by
      the hardware. Since it's not the Error state, there are no flushes
      and hence the driver doesn't know about that.
      
      The fix creates a task that after completion with error which is not a
      flush tracks the QP state and if it is in SQE state moves it back to RTS.
      Signed-off-by: NErez Shitrit <erezsh@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      2c010730
    • E
      IB/ipoib: Update broadcast record values after each successful join request · 3fd0605c
      Erez Shitrit 提交于
      Update the cached broadcast record in the priv object after every new
      join of this broadcast domain group.
      
      These values are needed for the port configuration (MTU size) and to
      all the new multicast (non-broadcast) join requests initial parameters.
      
      For example, SM starts with 2K MTU for all the fabric, and after that it
      restarts (or handover to new SM) with new port configuration of 4K MTU.
      Without using the new values, the driver will keep its old configuration
      of 2K and will not apply the new configuration of 4K.
      Signed-off-by: NErez Shitrit <erezsh@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      3fd0605c
    • E
      IB/ipoib: Use one linear skb in RX flow · a44878d1
      Erez Shitrit 提交于
      The current code in the RX flow uses two sg entries for each incoming
      packet, the first one was for the IB headers and the second for the rest
      of the data, that causes two  dma map/unmap and two allocations, and few
      more actions that were done at the data path.
      
      Use only one linear skb on each incoming packet, for the data (IB
      headers and payload), that reduces the packet processing in the
      data-path (only one skb, no frags, the first frag was not used anyway,
      less memory allocations) and the dma handling (only one dma map/unmap
      over each incoming packet instead of two map/unmap per each incoming packet).
      
      After commit 73d3fe6d ("gro: fix aggregation for skb using frag_list") from
      Eric Dumazet, we will get full aggregation for large packets.
      
      When running bandwidth tests before and after the (over the card's numa node),
      using "netperf -H 1.1.1.3 -T -t TCP_STREAM", the results before are ~12Gbs before
      and after ~16Gbs on my setup (Mellanox's ConnectX3).
      Signed-off-by: NErez Shitrit <erezsh@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      a44878d1
    • D
      IB/ipoib: drop mcast_mutex usage · 1c0453d6
      Doug Ledford 提交于
      We needed the mcast_mutex when we had to prevent the join completion
      callback from having the value it stored in mcast->mc overwritten
      by a delayed return from ib_sa_join_multicast.  By storing the return
      of ib_sa_join_multicast in an intermediate variable, we prevent a
      delayed return from ib_sa_join_multicast overwriting the valid
      contents of mcast->mc, and we no longer need a mutex to force the
      join callback to run after the return of ib_sa_join_multicast.  This
      allows us to do away with the mutex entirely and protect our critical
      sections with a just a spinlock instead.  This is highly desirable
      as there were some places where we couldn't use a mutex because the
      code was not allowed to sleep, and so we were currently using a mix
      of mutex and spinlock to protect what we needed to protect.  Now we
      only have a spin lock and the locking complexity is greatly reduced.
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      1c0453d6
    • D
      IB/ipoib: deserialize multicast joins · d2fe937c
      Doug Ledford 提交于
      Allow the ipoib layer to attempt to join all outstanding multicast
      groups at once.  The ib_sa layer will serialize multiple attempts to
      join the same group, but will process attempts to join different groups
      in parallel.  Take advantage of that.
      
      In order to make this happen, change the mcast_join_thread to loop
      through all needed joins, sending a join request for each one that we
      still need to join.  There are a few special cases we handle though:
      
      1) Don't attempt to join anything but the broadcast group until the join
      of the broadcast group has succeeded.
      2) No longer restart the join task at the end of completion handling.
      If we completed successfully, we are done.  The join task now needs kicked
      either by mcast_send or mcast_restart_task or mcast_start_thread, but
      should not need started anytime else except when scheduling a backoff
      attempt to rejoin.
      3) No longer use separate join/completion routines for regular and
      sendonly joins, pass them all through the same routine and just do the
      right thing based on the SENDONLY join flag.
      4) Only try to join a SENDONLY join twice, then drop the packets and
      quit trying.  We leave the mcast group in the list so that if we get a
      new packet, all that we have to do is queue up the packet and restart
      the join task and it will automatically try to join twice and then
      either send or flush the queue again.
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      d2fe937c
    • D
      IB/ipoib: fix MCAST_FLAG_BUSY usage · 69911416
      Doug Ledford 提交于
      Commit a9c8ba58 ("IPoIB: Fix usage of uninitialized multicast
      objects") added a new flag MCAST_JOIN_STARTED, but was not very strict
      in how it was used.  We didn't always initialize the completion struct
      before we set the flag, and we didn't always call complete on the
      completion struct from all paths that complete it.  And when we did
      complete it, sometimes we continued to touch the mcast entry after
      the completion, opening us up to possible use after free issues.
      
      This made it less than totally effective, and certainly made its use
      confusing.  And in the flush function we would use the presence of this
      flag to signal that we should wait on the completion struct, but we never
      cleared this flag, ever.
      
      In order to make things clearer and aid in resolving the rtnl deadlock
      bug I've been chasing, I cleaned this up a bit.
      
       1) Remove the MCAST_JOIN_STARTED flag entirely
       2) Change MCAST_FLAG_BUSY so it now only means a join is in-flight
       3) Test mcast->mc directly to see if we have completed
          ib_sa_join_multicast (using IS_ERR_OR_NULL)
       4) Make sure that before setting MCAST_FLAG_BUSY we always initialize
          the mcast->done completion struct
       5) Make sure that before calling complete(&mcast->done), we always clear
          the MCAST_FLAG_BUSY bit
       6) Take the mcast_mutex before we call ib_sa_multicast_join and also
          take the mutex in our join callback.  This forces
          ib_sa_multicast_join to return and set mcast->mc before we process
          the callback.  This way, our callback can safely clear mcast->mc
          if there is an error on the join and we will do the right thing as
          a result in mcast_dev_flush.
       7) Because we need the mutex to synchronize mcast->mc, we can no
          longer call mcast_sendonly_join directly from mcast_send and
          instead must add sendonly join processing to the mcast_join_task
       8) Make MCAST_RUN mean that we have a working mcast subsystem, not that
          we have a running task.  We know when we need to reschedule our
          join task thread and don't need a flag to tell us.
       9) Add a helper for rescheduling the join task thread
      
      A number of different races are resolved with these changes.  These
      races existed with the old MCAST_FLAG_BUSY usage, the
      MCAST_JOIN_STARTED flag was an attempt to address them, and while it
      helped, a determined effort could still trip things up.
      
      One race looks something like this:
      
      Thread 1                             Thread 2
      ib_sa_join_multicast (as part of running restart mcast task)
        alloc member
        call callback
                                           ifconfig ib0 down
      				     wait_for_completion
          callback call completes
                                           wait_for_completion in
      				     mcast_dev_flush completes
      				       mcast->mc is PTR_ERR_OR_NULL
      				       so we skip ib_sa_leave_multicast
          return from callback
        return from ib_sa_join_multicast
      set mcast->mc = return from ib_sa_multicast
      
      We now have a permanently unbalanced join/leave issue that trips up the
      refcounting in core/multicast.c
      
      Another like this:
      
      Thread 1                   Thread 2         Thread 3
      ib_sa_multicast_join
                                                  ifconfig ib0 down
      					    priv->broadcast = NULL
                                 join_complete
      			                    wait_for_completion
      			   mcast->mc is not yet set, so don't clear
      return from ib_sa_join_multicast and set mcast->mc
      			   complete
      			   return -EAGAIN (making mcast->mc invalid)
      			   		    call ib_sa_multicast_leave
      					    on invalid mcast->mc, hang
      					    forever
      
      By holding the mutex around ib_sa_multicast_join and taking the mutex
      early in the callback, we force mcast->mc to be valid at the time we
      run the callback.  This allows us to clear mcast->mc if there is an
      error and the join is going to fail.  We do this before we complete
      the mcast.  In this way, mcast_dev_flush always sees consistent state
      in regards to mcast->mc membership at the time that the
      wait_for_completion() returns.
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      69911416
    • D
      IB/ipoib: No longer use flush as a parameter · efc82eee
      Doug Ledford 提交于
      Various places in the IPoIB code had a deadlock related to flushing
      the ipoib workqueue.  Now that we have per device workqueues and a
      specific flush workqueue, there is no longer a deadlock issue with
      flushing the device specific workqueues and we can do so unilaterally.
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      efc82eee
    • D
      IB/ipoib: Use dedicated workqueues per interface · 0b39578b
      Doug Ledford 提交于
      During my recent work on the rtnl lock deadlock in the IPoIB driver, I
      saw that even once I fixed the apparent races for a single device, as
      soon as that device had any children, new races popped up.  It turns
      out that this is because no matter how well we protect against races
      on a single device, the fact that all devices use the same workqueue,
      and flush_workqueue() flushes *everything* from that workqueue means
      that we would also have to prevent all races between different devices
      (for instance, ipoib_mcast_restart_task on interface ib0 can race with
      ipoib_mcast_flush_dev on interface ib0.8002, resulting in a deadlock on
      the rtnl_lock).
      
      There are several possible solutions to this problem:
      
      Make carrier_on_task and mcast_restart_task try to take the rtnl for
      some set period of time and if they fail, then bail.  This runs the
      real risk of dropping work on the floor, which can end up being its
      own separate kind of deadlock.
      
      Set some global flag in the driver that says some device is in the
      middle of going down, letting all tasks know to bail.  Again, this can
      drop work on the floor.
      
      Or the method this patch attempts to use, which is when we bring an
      interface up, create a workqueue specifically for that interface, so
      that when we take it back down, we are flushing only those tasks
      associated with our interface.  In addition, keep the global
      workqueue, but now limit it to only flush tasks.  In this way, the
      flush tasks can always flush the device specific work queues without
      having deadlock issues.
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      0b39578b
    • D
      IB/ipoib: Make the carrier_on_task race aware · 894021a7
      Doug Ledford 提交于
      We blindly assume that we can just take the rtnl lock and that will
      prevent races with downing this interface.  Unfortunately, that's not
      the case.  In ipoib_mcast_stop_thread() we will call flush_workqueue()
      in an attempt to clear out all remaining instances of ipoib_join_task.
      But, since this task is put on the same workqueue as the join task,
      the flush_workqueue waits on this thread too.  But this thread is
      deadlocked on the rtnl lock.  The better thing here is to use trylock
      and loop on that until we either get the lock or we see that
      FLAG_OPER_UP has been cleared, in which case we don't need to do
      anything anyway and we just return.
      
      While investigating which flag should be used, FLAG_ADMIN_UP or
      FLAG_OPER_UP, it was determined that FLAG_OPER_UP was the more
      appropriate flag to use.  However, there was a mix of these two flags in
      use in the existing code.  So while we check for that flag here as part
      of this race fix, also cleanup the two places that had used the less
      appropriate flag for their tests.
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      894021a7
    • D
      IB/ipoib: Consolidate rtnl_lock tasks in workqueue · c84ca6d2
      Doug Ledford 提交于
      The ipoib_mcast_flush_dev routine is called with the rtnl_lock held and
      needs to keep it held.  It also needs to call flush_workqueue() to flush
      out any outstanding work.  In the past, we've had to try and make sure
      that we didn't flush out any outstanding join completions because they
      also wanted to grab rtnl_lock() and that would deadlock.  It turns out
      that the only thing in the join completion handler that needs this lock
      can be safely moved to our carrier_on_task, thereby reducing the
      potential for the join completion code and the flush code to deadlock
      against each other.
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      c84ca6d2
    • D
      IB/ipoib: change init sequence ordering · be7aa663
      Doug Ledford 提交于
      In preparation for using per device work queues, we need to move the
      start of the neighbor thread task to after ipoib_ib_dev_init and move
      the destruction of the neighbor task to before ipoib_ib_dev_cleanup.
      Otherwise we will end up freeing our workqueue with work possibly
      still on it.
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      be7aa663
    • D
      IB/ipoib: factor out ah flushing · e135106f
      Doug Ledford 提交于
      Create a an ipoib_flush_ah and ipoib_stop_ah routines to use at
      appropriate times to flush out all remaining ah entries before we shut
      the device down.
      
      Because neighbors and mcast entries can each have a reference on any
      given ah, we must make sure to free all of those first before our ah
      will actually have a 0 refcount and be able to be reaped.
      
      This factoring is needed in preparation for having per-device work
      queues.  The original per-device workqueue code resulted in the following
      error message:
      
      <ibdev>: ib_dealloc_pd failed
      
      That error was tracked down to this issue.  With the changes to which
      workqueues were flushed when, there were no flushes of the per device
      workqueue after the last ah's were freed, resulting in an attempt to
      dealloc the pd with outstanding resources still allocated.  This code
      puts the explicit flushes in the needed places to avoid that problem.
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      e135106f
  10. 03 4月, 2015 1 次提交
  11. 31 1月, 2015 8 次提交
  12. 16 12月, 2014 5 次提交
    • D
      IPoIB: No longer use flush as a parameter · ce347ab9
      Doug Ledford 提交于
      Various places in the IPoIB code had a deadlock related to flushing
      the ipoib workqueue.  Now that we have per device workqueues and a
      specific flush workqueue, there is no longer a deadlock issue with
      flushing the device specific workqueues and we can do so unilaterally.
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      ce347ab9
    • D
      IPoIB: Make ipoib_mcast_stop_thread flush the workqueue · bb42a6dd
      Doug Ledford 提交于
      We used to pass a flush variable to mcast_stop_thread to indicate if
      we should flush the workqueue or not.  This was due to some code
      trying to flush a workqueue that it was currently running on which is
      a no-no.  Now that we have per-device work queues, and now that
      ipoib_mcast_restart_task has taken the fact that it is queued on a
      single thread workqueue with all of the ipoib_mcast_join_task's and
      therefore has no need to stop the join task while it runs, we can do
      away with the flush parameter and unilaterally flush always.
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      bb42a6dd
    • D
      IPoIB: Use dedicated workqueues per interface · 5141861c
      Doug Ledford 提交于
      During my recent work on the rtnl lock deadlock in the IPoIB driver, I
      saw that even once I fixed the apparent races for a single device, as
      soon as that device had any children, new races popped up.  It turns
      out that this is because no matter how well we protect against races
      on a single device, the fact that all devices use the same workqueue,
      and flush_workqueue() flushes *everything* from that workqueue, we can
      have one device in the middle of a down and holding the rtnl lock and
      another totally unrelated device needing to run mcast_restart_task,
      which wants the rtnl lock and will loop trying to take it unless is
      sees its own FLAG_ADMIN_UP flag go away.  Because the unrelated
      interface will never see its own ADMIN_UP flag drop, the interface
      going down will deadlock trying to flush the queue.  There are several
      possible solutions to this problem:
      
      Make carrier_on_task and mcast_restart_task try to take the rtnl for
      some set period of time and if they fail, then bail.  This runs the
      real risk of dropping work on the floor, which can end up being its
      own separate kind of deadlock.
      
      Set some global flag in the driver that says some device is in the
      middle of going down, letting all tasks know to bail.  Again, this can
      drop work on the floor.  I suppose if our own ADMIN_UP flag doesn't go
      away, then maybe after a few tries on the rtnl lock we can queue our
      own task back up as a delayed work and return and avoid dropping work
      on the floor that way.  But I'm not 100% convinced that we won't cause
      other problems.
      
      Or the method this patch attempts to use, which is when we bring an
      interface up, create a workqueue specifically for that interface, so
      that when we take it back down, we are flushing only those tasks
      associated with our interface.  In addition, keep the global
      workqueue, but now limit it to only flush tasks.  In this way, the
      flush tasks can always flush the device specific work queues without
      having deadlock issues.
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      5141861c
    • D
      IPoIB: change init sequence ordering · 3bcce487
      Doug Ledford 提交于
      In preparation for using per device work queues, we need to move the
      start of the neighbor thread task to after ipoib_ib_dev_init and move
      the destruction of the neighbor task to before ipoib_ib_dev_cleanup.
      Otherwise we will end up freeing our workqueue with work possibly
      still on it.
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      3bcce487
    • D
      IPoIB: fix mcast_dev_flush/mcast_restart_task race · e5d1dcf1
      Doug Ledford 提交于
      Our mcast_dev_flush routine and our mcast_restart_task can race
      against each other.  In particular, they both hold the priv->lock
      while manipulating the rbtree and while removing mcast entries from
      the multicast_list and while adding entries to the remove_list, but
      they also both drop their locks prior to doing the actual removes.
      The mcast_dev_flush routine is run entirely under the rtnl lock and so
      has at least some locking.  The actual race condition is like this:
      
      Thread 1                                Thread 2
      ifconfig ib0 up
        start multicast join for broadcast
        multicast join completes for broadcast
        start to add more multicast joins
          call mcast_restart_task to add new entries
                                              ifconfig ib0 down
      					  mcast_dev_flush
      					    mcast_leave(mcast A)
          mcast_leave(mcast A)
      
      As mcast_leave calls ib_sa_multicast_leave, and as member in
      core/multicast.c is ref counted, we run into an unbalanced refcount
      issue.  To avoid stomping on each others removes, take the rtnl lock
      specifically when we are deleting the entries from the remove list.
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      e5d1dcf1