1. 23 7月, 2017 3 次提交
    • E
      IB/ipoib: Make sure no in-flight joins while leaving that mcast · a08e1120
      Erez Shitrit 提交于
      While cleaning neighs and there is a send-only mcast neigh, the driver
      should wait to finish its join process before trying to remove it.
      
      Without this patch, we will see messages like: "ipoib_mcast_leave on an
      in-flight join" and unexpected results in the join_complete.
      Signed-off-by: NErez Shitrit <erezsh@mellanox.com>
      Signed-off-by: NLeon Romanovsky <leon@kernel.org>
      a08e1120
    • E
      IB/ipoib: Use cancel_delayed_work_sync when needed · 6bdc8de2
      Erez Shitrit 提交于
      The work mcast_task can re-queue itself, so instead of doing
      cancel && flush_workqueue, that still can leave a queued task
      on the air, use cancel_delayed_work_sync.
      
      Also, no need to use lock over the cancel, the original lock was
      due to bit assignment setting (IPOIB_MCAST_RUN) that is not in use
      anymore.
      Signed-off-by: NErez Shitrit <erezsh@mellanox.com>
      Signed-off-by: NLeon Romanovsky <leon@kernel.org>
      6bdc8de2
    • F
      IB/ipoib: Fix race between light events and interface restart · edf3f301
      Feras Daoud 提交于
      A potential race between light_event and interface restart
      may attach multicast group to an already attached QP.
      
      Scenario:
      light_event flow goes through ipoib_mcast_dev_flush function,
      if a context switch occurs before calling ipoib_mcast_remove_list,
      then we may face a situation where the broadcast of the priv is null
      and the corresponding QP is not detached yet.
      If an "interface restart" runs during the previous context switch,
      the following scenario occurs:
      When the device goes up, ipoib_ib_dev_up function will be called,
      it will send a new registration request to the broadcast group and then
      attach the group to the QP that was not detached before.
      
           IPOIB_FLUSH_LIGHT                                          INTERFACE RESTART
      
          __ipoib_ib_dev_flush                                                |
              |                                                               |
              |                                                               |
              |                                                               |
          ipoib_mcast_dev_flush                                               |
          Move mcast list and broadcast to remove_list                        |
              |                                                               |
              |                                                               |
          Context Switch-->                                                   |
              |                                                       ipoib_ib_dev_down
              |                                                               |
              |                                                               |
              |                                                       ipoib_ib_dev_up
              |                                                               |
              |                                                               |
              |                                                       ipoib_mcast_join_task
              |                                                       allocate new broadcast
              |                                                               |
              |                                                               |
              |                                                       Attach QP to multicast group
              |                                                               |
              |                                                               |
              |                                                       <--Context Switch
          ipoib_mcast_leave
          Detach QP from multicast group
      Signed-off-by: NFeras Daoud <ferasda@mellanox.com>
      Signed-off-by: NLeon Romanovsky <leon@kernel.org>
      edf3f301
  2. 02 5月, 2017 4 次提交
  3. 29 4月, 2017 1 次提交
  4. 21 4月, 2017 3 次提交
    • F
      IB/ipoib: Fix deadlock between ipoib_stop and mcast join flow · 3e31a490
      Feras Daoud 提交于
      Before calling ipoib_stop, rtnl_lock should be taken, then
      the flow clears the IPOIB_FLAG_ADMIN_UP and IPOIB_FLAG_OPER_UP
      flags, and waits for mcast completion if IPOIB_MCAST_FLAG_BUSY
      is set.
      
      On the other hand, the flow of multicast join task initializes
      a mcast completion, sets the IPOIB_MCAST_FLAG_BUSY and calls
      ipoib_mcast_join. If IPOIB_FLAG_OPER_UP flag is not set, this
      call returns EINVAL without setting the mcast completion and
      leads to a deadlock.
      
          ipoib_stop                          |
              |                               |
          clear_bit(IPOIB_FLAG_ADMIN_UP)      |
              |                               |
          Context Switch                      |
              |                       ipoib_mcast_join_task
              |                               |
              |                       spin_lock_irq(lock)
              |                               |
              |                       init_completion(mcast)
              |                               |
              |                       set_bit(IPOIB_MCAST_FLAG_BUSY)
              |                               |
              |                       Context Switch
              |                               |
          clear_bit(IPOIB_FLAG_OPER_UP)       |
              |                               |
          spin_lock_irqsave(lock)             |
              |                               |
          Context Switch                      |
              |                       ipoib_mcast_join
              |                       return (-EINVAL)
              |                               |
              |                       spin_unlock_irq(lock)
              |                               |
              |                       Context Switch
              |                               |
          ipoib_mcast_dev_flush               |
          wait_for_completion(mcast)          |
      
      ipoib_stop will wait for mcast completion for ever, and will
      not release the rtnl_lock. As a result panic occurs with the
      following trace:
      
          [13441.639268] Call Trace:
          [13441.640150]  [<ffffffff8168b579>] schedule+0x29/0x70
          [13441.641038]  [<ffffffff81688fc9>] schedule_timeout+0x239/0x2d0
          [13441.641914]  [<ffffffff810bc017>] ? complete+0x47/0x50
          [13441.642765]  [<ffffffff810a690d>] ? flush_workqueue_prep_pwqs+0x16d/0x200
          [13441.643580]  [<ffffffff8168b956>] wait_for_completion+0x116/0x170
          [13441.644434]  [<ffffffff810c4ec0>] ? wake_up_state+0x20/0x20
          [13441.645293]  [<ffffffffa05af170>] ipoib_mcast_dev_flush+0x150/0x190 [ib_ipoib]
          [13441.646159]  [<ffffffffa05ac967>] ipoib_ib_dev_down+0x37/0x60 [ib_ipoib]
          [13441.647013]  [<ffffffffa05a4805>] ipoib_stop+0x75/0x150 [ib_ipoib]
      
      Fixes: 08bc3276 ("IB/ipoib: fix for rare multicast join race condition")
      Signed-off-by: NFeras Daoud <ferasda@mellanox.com>
      Signed-off-by: NLeon Romanovsky <leon@kernel.org>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      3e31a490
    • E
      IB/IPoIB: Support acceleration options callbacks · cd565b4b
      Erez Shitrit 提交于
      IPoIB driver now uses the new set of callback functions.
      
      If the hardware provider supports the new ipoib_options implementation,
      the driver uses the callbacks in its data path flows, otherwise it uses the
      driver default implementation for all data flows in its code.
      
      The default implementation wasn't change and it is exactly as it was before
      introduction of acceleration support.
      Signed-off-by: NErez Shitrit <erezsh@mellanox.com>
      Reviewed-by: NAlex Vesker <valex@mellanox.com>
      Signed-off-by: NLeon Romanovsky <leon@kernel.org>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      cd565b4b
    • E
      IB/IPoIB: Use defined function for netdev_priv function · c1048aff
      Erez Shitrit 提交于
      Make ipoib_priv point to netdev_priv where the code calls netdev_priv.
      Signed-off-by: NErez Shitrit <erezsh@mellanox.com>
      Reviewed-by: NAlex Vesker <valex@mellanox.com>
      Signed-off-by: NLeon Romanovsky <leon@kernel.org>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      c1048aff
  5. 25 1月, 2017 1 次提交
  6. 13 1月, 2017 1 次提交
  7. 15 12月, 2016 1 次提交
  8. 14 10月, 2016 1 次提交
    • P
      IB/ipoib: move back IB LL address into the hard header · fc791b63
      Paolo Abeni 提交于
      After the commit 9207f9d4 ("net: preserve IP control block
      during GSO segmentation"), the GSO CB and the IPoIB CB conflict.
      That destroy the IPoIB address information cached there,
      causing a severe performance regression, as better described here:
      
      http://marc.info/?l=linux-kernel&m=146787279825501&w=2
      
      This change moves the data cached by the IPoIB driver from the
      skb control lock into the IPoIB hard header, as done before
      the commit 936d7de3 ("IPoIB: Stop lying about hard_header_len
      and use skb->cb to stash LL addresses").
      In order to avoid GRO issue, on packet reception, the IPoIB driver
      stash into the skb a dummy pseudo header, so that the received
      packets have actually a hard header matching the declared length.
      To avoid changing the connected mode maximum mtu, the allocated
      head buffer size is increased by the pseudo header length.
      
      After this commit, IPoIB performances are back to pre-regression
      value.
      
      v2 -> v3: rebased
      v1 -> v2: avoid changing the max mtu, increasing the head buf size
      
      Fixes: 9207f9d4 ("net: preserve IP control block during GSO segmentation")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fc791b63
  9. 07 6月, 2016 1 次提交
  10. 26 5月, 2016 2 次提交
    • M
      IB/IPoIB: Allow setting the device address · 492a7e67
      Mark Bloch 提交于
      In IB networks, and specifically in IPoIB/rdmacm traffic, the device
      address of an IPoIB interface is used as a means to exchange information
      between nodes needed for communication.
      
      Currently an IPoIB interface will always be created with a device
      address based on its node GUID without a way to change that.
      
      This change adds the ability to set the device address of an IPoIB
      interface by value. We use the set mac address ndo to do that.
      
      The flow should be broken down to two:
      1) The GID value is already in the GID table,
         in this case the interface will be able to set carrier up.
      
      2) The GID value is not yet in the GID table,
         in this case the interface won't try to join the multicast group
         and will wait (listen on GID_CHANGE event) until the GID is inserted.
      
      In order to track those changes, we add a new flag:
      * IPOIB_FLAG_DEV_ADDR_SET.
      
      When set, it means the dev_addr is a based on a value in the gid
      table. this bit will be cleared upon a dev_addr change triggered
      by the user and set after validation.
      
      Per IB spec the port GUID can't change if the module is loaded.
      port GUID is the basis for GID at index 0 which is the basis for
      the default device address of a ipoib interface.
      
      The issue is that there are devices that don't follow the spec,
      they change the port GUID while HCA is powered on, so in order
      not to break userspace applications. We need to check if the
      user wanted to control the device address and we assume that
      if he sets the device address back to be based on GID index 0,
      he no longer wishs to control it.
      
      In order to track this, we add an additional flag:
      * IPOIB_FLAG_DEV_ADDR_CTRL
      
      When setting the device address, there is no validation of the upper
      twelve bytes of the device address (flags, qpn, subnet prefix) as those
      bytes are not under the control of the user.
      Signed-off-by: NMark Bloch <markb@mellanox.com>
      Reviewed-by: NLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: NLeon Romanovsky <leon@kernel.org>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      492a7e67
    • E
      IB/ipoib: Support SendOnlyFullMember MCG for SendOnly join · 3b561130
      Erez Shitrit 提交于
      Check (via an SA query) if the SM supports the new option for SendOnly
      multicast joins.
      If the SM supports that option it will use the new join state to create
      such multicast group.
      If SendOnlyFullMember is supported, we wouldn't use faked FullMember state
      join for SendOnly MCG, use the correct state if supported.
      
      This check is performed at every invocation of mcast_restart task, to be
      sure that the driver stays in sync with the current state of the SM.
      Signed-off-by: NErez Shitrit <erezsh@mellanox.com>
      Reviewed-by: NLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      3b561130
  11. 13 2月, 2016 1 次提交
    • A
      IB/ipoib: fix for rare multicast join race condition · 08bc3276
      Alex Estrin 提交于
      A narrow window for race condition still exist between
      multicast join thread and *dev_flush workers.
      A kernel crash caused by prolong erratic link state changes
      was observed (most likely a faulty cabling):
      
      [167275.656270] BUG: unable to handle kernel NULL pointer dereference at
      0000000000000020
      [167275.665973] IP: [<ffffffffa05f8f2e>] ipoib_mcast_join+0xae/0x1d0 [ib_ipoib]
      [167275.674443] PGD 0
      [167275.677373] Oops: 0000 [#1] SMP
      ...
      [167275.977530] Call Trace:
      [167275.982225]  [<ffffffffa05f92f0>] ? ipoib_mcast_free+0x200/0x200 [ib_ipoib]
      [167275.992024]  [<ffffffffa05fa1b7>] ipoib_mcast_join_task+0x2a7/0x490
      [ib_ipoib]
      [167276.002149]  [<ffffffff8109d5fb>] process_one_work+0x17b/0x470
      [167276.010754]  [<ffffffff8109e3cb>] worker_thread+0x11b/0x400
      [167276.019088]  [<ffffffff8109e2b0>] ? rescuer_thread+0x400/0x400
      [167276.027737]  [<ffffffff810a5aef>] kthread+0xcf/0xe0
      Here was a hit spot:
      ipoib_mcast_join() {
      ..............
            rec.qkey      = priv->broadcast->mcmember.qkey;
                                             ^^^^^^^
      .....
       }
      Proposed patch should prevent multicast join task to continue
      if link state change is detected.
      Signed-off-by: NAlex Estrin <alex.estrin@intel.com>
      
      Changes from v4:
      - as suggested by Doug Ledford, optimized spinlock usage,
      i.e. ipoib_mcast_join() is called with lock held.
      Changes from v3:
      - sync with priv->lock before flag check.
      Chages from v2:
      - Move check for OPER_UP flag state to mcast_join() to
      ensure no event worker is in progress.
      - minor style fixes.
      Changes from v1:
      - No need to lock again if error detected.
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      08bc3276
  12. 20 1月, 2016 1 次提交
    • E
      IB/IPoIB: Fix kernel panic on multicast flow · 50be28de
      Erez Shitrit 提交于
      ipoib_mcast_restart_task calls ipoib_mcast_remove_list with the
      parameter mcast->dev. That mcast is a temporary (used as an iterator)
      variable that may be uninitialized.
      There is no need to send the variable dev to the function, as each mcast
      has its dev as a member in the mcast struct.
      
      This causes the next panic:
      RIP: 0010: ipoib_mcast_leave+0x6d/0xf0 [ib_ipoib]
      RSP: 0018: EFLAGS: 00010246
      RAX: f0201 RBX: 24e00 RCX: 00000
      ....
      ....
      Stack:
      Call Trace:
      	ipoib_mcast_remove_list+0x3a/0x70 [ib_ipoib]
      	ipoib_mcast_restart_task+0x3bb/0x520 [ib_ipoib]
      	process_one_work+0x164/0x470
      	worker_thread+0x11d/0x420
      	...
      
      Fixes: 5a0e81f6 ('IB/IPoIB: factor out common multicast list removal code')
      Signed-off-by: NErez Shitrit <erezsh@mellanox.com>
      Reported-by: NDoron Tsur <doront@mellanox.com>
      Reviewed-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      50be28de
  13. 24 12月, 2015 2 次提交
  14. 22 10月, 2015 1 次提交
  15. 14 10月, 2015 1 次提交
  16. 08 10月, 2015 1 次提交
    • C
      IB: split struct ib_send_wr · e622f2f4
      Christoph Hellwig 提交于
      This patch split up struct ib_send_wr so that all non-trivial verbs
      use their own structure which embedds struct ib_send_wr.  This dramaticly
      shrinks the size of a WR for most common operations:
      
      sizeof(struct ib_send_wr) (old):	96
      
      sizeof(struct ib_send_wr):		48
      sizeof(struct ib_rdma_wr):		64
      sizeof(struct ib_atomic_wr):		96
      sizeof(struct ib_ud_wr):		88
      sizeof(struct ib_fast_reg_wr):		88
      sizeof(struct ib_bind_mw_wr):		96
      sizeof(struct ib_sig_handover_wr):	80
      
      And with Sagi's pending MR rework the fast registration WR will also be
      down to a reasonable size:
      
      sizeof(struct ib_fastreg_wr):		64
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> [srp, srpt]
      Reviewed-by: Chuck Lever <chuck.lever@oracle.com> [sunrpc]
      Tested-by: NHaggai Eran <haggaie@mellanox.com>
      Tested-by: NSagi Grimberg <sagig@mellanox.com>
      Tested-by: NSteve Wise <swise@opengridcomputing.com>
      e622f2f4
  17. 26 9月, 2015 2 次提交
    • D
      IB/ipoib: Make sendonly multicast joins create the mcast group · c3852ab0
      Doug Ledford 提交于
      Since IPoIB should, as much as possible, emulate how multicast
      sends work on Ethernet for regular TCP/IP apps, there should be
      no requirement to subscribe to a multicast group before your
      sends are properly sent.  However, due to the difference in how
      multicast is handled on InfiniBand, we must join the appropriate
      multicast group before we can send to it.  Previously we tried
      not to trigger the auto-create feature of the subnet manager when
      doing this because we didn't have tracking of these sendonly
      groups and the auto-creation might never get undone.  The previous
      patch added timing to these sendonly joins and allows us to
      leave them after a reasonable idle expiration time.  So supply
      all of the information needed to auto-create group.
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      c3852ab0
    • C
      IB/ipoib: Expire sendonly multicast joins · bd99b2e0
      Christoph Lameter 提交于
      On neighbor expiration, check to see if the neighbor was actually a
      sendonly multicast join, and if so, leave the multicast group as we
      expire the neighbor.
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      bd99b2e0
  18. 04 9月, 2015 2 次提交
  19. 16 4月, 2015 9 次提交
    • E
      IB/ipoib: Remove IPOIB_MCAST_RUN bit · 0e5544d9
      Erez Shitrit 提交于
      After Doug Ledford's changes there is no need in that bit, it's
      semantic becomes subset of the IPOIB_FLAG_OPER_UP bit.
      Signed-off-by: NErez Shitrit <erezsh@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      0e5544d9
    • E
      IB/ipoib: Update broadcast record values after each successful join request · 3fd0605c
      Erez Shitrit 提交于
      Update the cached broadcast record in the priv object after every new
      join of this broadcast domain group.
      
      These values are needed for the port configuration (MTU size) and to
      all the new multicast (non-broadcast) join requests initial parameters.
      
      For example, SM starts with 2K MTU for all the fabric, and after that it
      restarts (or handover to new SM) with new port configuration of 4K MTU.
      Without using the new values, the driver will keep its old configuration
      of 2K and will not apply the new configuration of 4K.
      Signed-off-by: NErez Shitrit <erezsh@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      3fd0605c
    • D
      IB/ipoib: drop mcast_mutex usage · 1c0453d6
      Doug Ledford 提交于
      We needed the mcast_mutex when we had to prevent the join completion
      callback from having the value it stored in mcast->mc overwritten
      by a delayed return from ib_sa_join_multicast.  By storing the return
      of ib_sa_join_multicast in an intermediate variable, we prevent a
      delayed return from ib_sa_join_multicast overwriting the valid
      contents of mcast->mc, and we no longer need a mutex to force the
      join callback to run after the return of ib_sa_join_multicast.  This
      allows us to do away with the mutex entirely and protect our critical
      sections with a just a spinlock instead.  This is highly desirable
      as there were some places where we couldn't use a mutex because the
      code was not allowed to sleep, and so we were currently using a mix
      of mutex and spinlock to protect what we needed to protect.  Now we
      only have a spin lock and the locking complexity is greatly reduced.
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      1c0453d6
    • D
      IB/ipoib: deserialize multicast joins · d2fe937c
      Doug Ledford 提交于
      Allow the ipoib layer to attempt to join all outstanding multicast
      groups at once.  The ib_sa layer will serialize multiple attempts to
      join the same group, but will process attempts to join different groups
      in parallel.  Take advantage of that.
      
      In order to make this happen, change the mcast_join_thread to loop
      through all needed joins, sending a join request for each one that we
      still need to join.  There are a few special cases we handle though:
      
      1) Don't attempt to join anything but the broadcast group until the join
      of the broadcast group has succeeded.
      2) No longer restart the join task at the end of completion handling.
      If we completed successfully, we are done.  The join task now needs kicked
      either by mcast_send or mcast_restart_task or mcast_start_thread, but
      should not need started anytime else except when scheduling a backoff
      attempt to rejoin.
      3) No longer use separate join/completion routines for regular and
      sendonly joins, pass them all through the same routine and just do the
      right thing based on the SENDONLY join flag.
      4) Only try to join a SENDONLY join twice, then drop the packets and
      quit trying.  We leave the mcast group in the list so that if we get a
      new packet, all that we have to do is queue up the packet and restart
      the join task and it will automatically try to join twice and then
      either send or flush the queue again.
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      d2fe937c
    • D
      IB/ipoib: fix MCAST_FLAG_BUSY usage · 69911416
      Doug Ledford 提交于
      Commit a9c8ba58 ("IPoIB: Fix usage of uninitialized multicast
      objects") added a new flag MCAST_JOIN_STARTED, but was not very strict
      in how it was used.  We didn't always initialize the completion struct
      before we set the flag, and we didn't always call complete on the
      completion struct from all paths that complete it.  And when we did
      complete it, sometimes we continued to touch the mcast entry after
      the completion, opening us up to possible use after free issues.
      
      This made it less than totally effective, and certainly made its use
      confusing.  And in the flush function we would use the presence of this
      flag to signal that we should wait on the completion struct, but we never
      cleared this flag, ever.
      
      In order to make things clearer and aid in resolving the rtnl deadlock
      bug I've been chasing, I cleaned this up a bit.
      
       1) Remove the MCAST_JOIN_STARTED flag entirely
       2) Change MCAST_FLAG_BUSY so it now only means a join is in-flight
       3) Test mcast->mc directly to see if we have completed
          ib_sa_join_multicast (using IS_ERR_OR_NULL)
       4) Make sure that before setting MCAST_FLAG_BUSY we always initialize
          the mcast->done completion struct
       5) Make sure that before calling complete(&mcast->done), we always clear
          the MCAST_FLAG_BUSY bit
       6) Take the mcast_mutex before we call ib_sa_multicast_join and also
          take the mutex in our join callback.  This forces
          ib_sa_multicast_join to return and set mcast->mc before we process
          the callback.  This way, our callback can safely clear mcast->mc
          if there is an error on the join and we will do the right thing as
          a result in mcast_dev_flush.
       7) Because we need the mutex to synchronize mcast->mc, we can no
          longer call mcast_sendonly_join directly from mcast_send and
          instead must add sendonly join processing to the mcast_join_task
       8) Make MCAST_RUN mean that we have a working mcast subsystem, not that
          we have a running task.  We know when we need to reschedule our
          join task thread and don't need a flag to tell us.
       9) Add a helper for rescheduling the join task thread
      
      A number of different races are resolved with these changes.  These
      races existed with the old MCAST_FLAG_BUSY usage, the
      MCAST_JOIN_STARTED flag was an attempt to address them, and while it
      helped, a determined effort could still trip things up.
      
      One race looks something like this:
      
      Thread 1                             Thread 2
      ib_sa_join_multicast (as part of running restart mcast task)
        alloc member
        call callback
                                           ifconfig ib0 down
      				     wait_for_completion
          callback call completes
                                           wait_for_completion in
      				     mcast_dev_flush completes
      				       mcast->mc is PTR_ERR_OR_NULL
      				       so we skip ib_sa_leave_multicast
          return from callback
        return from ib_sa_join_multicast
      set mcast->mc = return from ib_sa_multicast
      
      We now have a permanently unbalanced join/leave issue that trips up the
      refcounting in core/multicast.c
      
      Another like this:
      
      Thread 1                   Thread 2         Thread 3
      ib_sa_multicast_join
                                                  ifconfig ib0 down
      					    priv->broadcast = NULL
                                 join_complete
      			                    wait_for_completion
      			   mcast->mc is not yet set, so don't clear
      return from ib_sa_join_multicast and set mcast->mc
      			   complete
      			   return -EAGAIN (making mcast->mc invalid)
      			   		    call ib_sa_multicast_leave
      					    on invalid mcast->mc, hang
      					    forever
      
      By holding the mutex around ib_sa_multicast_join and taking the mutex
      early in the callback, we force mcast->mc to be valid at the time we
      run the callback.  This allows us to clear mcast->mc if there is an
      error and the join is going to fail.  We do this before we complete
      the mcast.  In this way, mcast_dev_flush always sees consistent state
      in regards to mcast->mc membership at the time that the
      wait_for_completion() returns.
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      69911416
    • D
      IB/ipoib: No longer use flush as a parameter · efc82eee
      Doug Ledford 提交于
      Various places in the IPoIB code had a deadlock related to flushing
      the ipoib workqueue.  Now that we have per device workqueues and a
      specific flush workqueue, there is no longer a deadlock issue with
      flushing the device specific workqueues and we can do so unilaterally.
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      efc82eee
    • D
      IB/ipoib: Use dedicated workqueues per interface · 0b39578b
      Doug Ledford 提交于
      During my recent work on the rtnl lock deadlock in the IPoIB driver, I
      saw that even once I fixed the apparent races for a single device, as
      soon as that device had any children, new races popped up.  It turns
      out that this is because no matter how well we protect against races
      on a single device, the fact that all devices use the same workqueue,
      and flush_workqueue() flushes *everything* from that workqueue means
      that we would also have to prevent all races between different devices
      (for instance, ipoib_mcast_restart_task on interface ib0 can race with
      ipoib_mcast_flush_dev on interface ib0.8002, resulting in a deadlock on
      the rtnl_lock).
      
      There are several possible solutions to this problem:
      
      Make carrier_on_task and mcast_restart_task try to take the rtnl for
      some set period of time and if they fail, then bail.  This runs the
      real risk of dropping work on the floor, which can end up being its
      own separate kind of deadlock.
      
      Set some global flag in the driver that says some device is in the
      middle of going down, letting all tasks know to bail.  Again, this can
      drop work on the floor.
      
      Or the method this patch attempts to use, which is when we bring an
      interface up, create a workqueue specifically for that interface, so
      that when we take it back down, we are flushing only those tasks
      associated with our interface.  In addition, keep the global
      workqueue, but now limit it to only flush tasks.  In this way, the
      flush tasks can always flush the device specific work queues without
      having deadlock issues.
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      0b39578b
    • D
      IB/ipoib: Make the carrier_on_task race aware · 894021a7
      Doug Ledford 提交于
      We blindly assume that we can just take the rtnl lock and that will
      prevent races with downing this interface.  Unfortunately, that's not
      the case.  In ipoib_mcast_stop_thread() we will call flush_workqueue()
      in an attempt to clear out all remaining instances of ipoib_join_task.
      But, since this task is put on the same workqueue as the join task,
      the flush_workqueue waits on this thread too.  But this thread is
      deadlocked on the rtnl lock.  The better thing here is to use trylock
      and loop on that until we either get the lock or we see that
      FLAG_OPER_UP has been cleared, in which case we don't need to do
      anything anyway and we just return.
      
      While investigating which flag should be used, FLAG_ADMIN_UP or
      FLAG_OPER_UP, it was determined that FLAG_OPER_UP was the more
      appropriate flag to use.  However, there was a mix of these two flags in
      use in the existing code.  So while we check for that flag here as part
      of this race fix, also cleanup the two places that had used the less
      appropriate flag for their tests.
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      894021a7
    • D
      IB/ipoib: Consolidate rtnl_lock tasks in workqueue · c84ca6d2
      Doug Ledford 提交于
      The ipoib_mcast_flush_dev routine is called with the rtnl_lock held and
      needs to keep it held.  It also needs to call flush_workqueue() to flush
      out any outstanding work.  In the past, we've had to try and make sure
      that we didn't flush out any outstanding join completions because they
      also wanted to grab rtnl_lock() and that would deadlock.  It turns out
      that the only thing in the join completion handler that needs this lock
      can be safely moved to our carrier_on_task, thereby reducing the
      potential for the join completion code and the flush code to deadlock
      against each other.
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      c84ca6d2
  20. 31 1月, 2015 2 次提交