1. 21 4月, 2017 5 次提交
  2. 25 1月, 2017 3 次提交
  3. 14 10月, 2016 1 次提交
    • P
      IB/ipoib: move back IB LL address into the hard header · fc791b63
      Paolo Abeni 提交于
      After the commit 9207f9d4 ("net: preserve IP control block
      during GSO segmentation"), the GSO CB and the IPoIB CB conflict.
      That destroy the IPoIB address information cached there,
      causing a severe performance regression, as better described here:
      
      http://marc.info/?l=linux-kernel&m=146787279825501&w=2
      
      This change moves the data cached by the IPoIB driver from the
      skb control lock into the IPoIB hard header, as done before
      the commit 936d7de3 ("IPoIB: Stop lying about hard_header_len
      and use skb->cb to stash LL addresses").
      In order to avoid GRO issue, on packet reception, the IPoIB driver
      stash into the skb a dummy pseudo header, so that the received
      packets have actually a hard header matching the declared length.
      To avoid changing the connected mode maximum mtu, the allocated
      head buffer size is increased by the pseudo header length.
      
      After this commit, IPoIB performances are back to pre-regression
      value.
      
      v2 -> v3: rebased
      v1 -> v2: avoid changing the max mtu, increasing the head buf size
      
      Fixes: 9207f9d4 ("net: preserve IP control block during GSO segmentation")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fc791b63
  4. 08 10月, 2016 1 次提交
  5. 03 9月, 2016 1 次提交
    • E
      IB/ipoib: Fix memory corruption in ipoib cm mode connect flow · 546481c2
      Erez Shitrit 提交于
      When a new CM connection is being requested, ipoib driver copies data
      from the path pointer in the CM/tx object, the path object might be
      invalid at the point and memory corruption will happened later when now
      the CM driver will try using that data.
      
      The next scenario demonstrates it:
      	neigh_add_path --> ipoib_cm_create_tx -->
      	queue_work (pointer to path is in the cm/tx struct)
      	#while the work is still in the queue,
      	#the port goes down and causes the ipoib_flush_paths:
      	ipoib_flush_paths --> path_free --> kfree(path)
      	#at this point the work scheduled starts.
      	ipoib_cm_tx_start --> copy from the (invalid)path pointer:
      	(memcpy(&pathrec, &p->path->pathrec, sizeof pathrec);)
      	 -> memory corruption.
      
      To fix that the driver now starts the CM/tx connection only if that
      specific path exists in the general paths database.
      This check is protected with the relevant locks, and uses the gid from
      the neigh member in the CM/tx object which is valid according to the ref
      count that was taken by the CM/tx.
      
      Fixes: 839fcaba ('IPoIB: Connected mode experimental support')
      Signed-off-by: NErez Shitrit <erezsh@mellanox.com>
      Signed-off-by: NLeon Romanovsky <leon@kernel.org>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      546481c2
  6. 07 6月, 2016 1 次提交
    • E
      IB/IPoIB: Fix race between ipoib_remove_one to sysfs functions · 198b12f7
      Erez Shitrit 提交于
      In ipoib_remove_one the driver holds the rtnl_lock and tries to do some
      operation like dev_change_flags or unregister_netdev, while sysfs
      callback like ipoib_vlan_delete holds sysfs mutex and tries to hold the
      rtnl_lock via rtnl_trylock() and restart_syscall() if the lock is not
      free, meanwhile ipoib_remove_one tries to get the sysfs lock in order to
      free its sysfs directory, and we will get  a->b, b->a deadlock.
      
          Trace like the following:
      
              schedule+0x37/0x80
              schedule_preempt_disabled+0xe/0x10
              __mutex_lock_slowpath+0xb5/0x120
              mutex_lock+0x23/0x40
              rtnl_lock+0x15/0x20
              netdev_run_todo+0x17c/0x320
              rtnl_unlock+0xe/0x10
              ipoib_vlan_delete+0x11b/0x1b0 [ib_ipoib]
              delete_child+0x54/0x80 [ib_ipoib]
              dev_attr_store+0x18/0x30
              sysfs_kf_write+0x37/0x40
              mutex_lock+0x16/0x40
              SyS_write+0x55/0xc0
              entry_SYSCALL_64_fastpath+0x16/0x75
          And
              schedule+0x37/0x80
              __kernfs_remove+0x1a8/0x260
              ? wake_atomic_t_function+0x60/0x60
              kernfs_remove+0x25/0x40
              sysfs_remove_dir+0x50/0x80
              kobject_del+0x18/0x50
              device_del+0x19f/0x260
              netdev_unregister_kobject+0x6a/0x80
              rollback_registered_many+0x1fd/0x340
              rollback_registered+0x3c/0x70
              unregister_netdevice_queue+0x55/0xc0
              unregister_netdev+0x20/0x30
              ipoib_remove_one+0x114/0x1b0 [ib_ipoib]
              ib_unregister_client+0x4a/0x170 [ib_core]
              ? find_module_all+0x71/0xa0
              ipoib_cleanup_module+0x10/0x94 [ib_ipoib]
              SyS_delete_module+0x1b5/0x210
              entry_SYSCALL_64_fastpath+0x16/0x75
      
      The fix is by checking the flag IPOIB_FLAG_INTF_ON_DESTROY in order to
      get out from the sysfs function.
      
      Fixes: 862096a8 ("IB/ipoib: Add more rtnl_link_ops callbacks")
      Fixes: 9baa0b03 ("IB/ipoib: Add rtnl_link_ops support")
      Signed-off-by: NErez Shitrit <erezsh@mellanox.com>
      Signed-off-by: NLeon Romanovsky <leon@kernel.org>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      198b12f7
  7. 26 5月, 2016 2 次提交
    • M
      IB/IPoIB: Allow setting the device address · 492a7e67
      Mark Bloch 提交于
      In IB networks, and specifically in IPoIB/rdmacm traffic, the device
      address of an IPoIB interface is used as a means to exchange information
      between nodes needed for communication.
      
      Currently an IPoIB interface will always be created with a device
      address based on its node GUID without a way to change that.
      
      This change adds the ability to set the device address of an IPoIB
      interface by value. We use the set mac address ndo to do that.
      
      The flow should be broken down to two:
      1) The GID value is already in the GID table,
         in this case the interface will be able to set carrier up.
      
      2) The GID value is not yet in the GID table,
         in this case the interface won't try to join the multicast group
         and will wait (listen on GID_CHANGE event) until the GID is inserted.
      
      In order to track those changes, we add a new flag:
      * IPOIB_FLAG_DEV_ADDR_SET.
      
      When set, it means the dev_addr is a based on a value in the gid
      table. this bit will be cleared upon a dev_addr change triggered
      by the user and set after validation.
      
      Per IB spec the port GUID can't change if the module is loaded.
      port GUID is the basis for GID at index 0 which is the basis for
      the default device address of a ipoib interface.
      
      The issue is that there are devices that don't follow the spec,
      they change the port GUID while HCA is powered on, so in order
      not to break userspace applications. We need to check if the
      user wanted to control the device address and we assume that
      if he sets the device address back to be based on GID index 0,
      he no longer wishs to control it.
      
      In order to track this, we add an additional flag:
      * IPOIB_FLAG_DEV_ADDR_CTRL
      
      When setting the device address, there is no validation of the upper
      twelve bytes of the device address (flags, qpn, subnet prefix) as those
      bytes are not under the control of the user.
      Signed-off-by: NMark Bloch <markb@mellanox.com>
      Reviewed-by: NLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: NLeon Romanovsky <leon@kernel.org>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      492a7e67
    • E
      IB/ipoib: Support SendOnlyFullMember MCG for SendOnly join · 3b561130
      Erez Shitrit 提交于
      Check (via an SA query) if the SM supports the new option for SendOnly
      multicast joins.
      If the SM supports that option it will use the new join state to create
      such multicast group.
      If SendOnlyFullMember is supported, we wouldn't use faked FullMember state
      join for SendOnly MCG, use the correct state if supported.
      
      This check is performed at every invocation of mcast_restart task, to be
      sure that the driver stays in sync with the current state of the SM.
      Signed-off-by: NErez Shitrit <erezsh@mellanox.com>
      Reviewed-by: NLeon Romanovsky <leonro@mellanox.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      3b561130
  8. 22 3月, 2016 1 次提交
  9. 03 3月, 2016 1 次提交
  10. 20 1月, 2016 1 次提交
    • E
      IB/IPoIB: Fix kernel panic on multicast flow · 50be28de
      Erez Shitrit 提交于
      ipoib_mcast_restart_task calls ipoib_mcast_remove_list with the
      parameter mcast->dev. That mcast is a temporary (used as an iterator)
      variable that may be uninitialized.
      There is no need to send the variable dev to the function, as each mcast
      has its dev as a member in the mcast struct.
      
      This causes the next panic:
      RIP: 0010: ipoib_mcast_leave+0x6d/0xf0 [ib_ipoib]
      RSP: 0018: EFLAGS: 00010246
      RAX: f0201 RBX: 24e00 RCX: 00000
      ....
      ....
      Stack:
      Call Trace:
      	ipoib_mcast_remove_list+0x3a/0x70 [ib_ipoib]
      	ipoib_mcast_restart_task+0x3bb/0x520 [ib_ipoib]
      	process_one_work+0x164/0x470
      	worker_thread+0x11d/0x420
      	...
      
      Fixes: 5a0e81f6 ('IB/IPoIB: factor out common multicast list removal code')
      Signed-off-by: NErez Shitrit <erezsh@mellanox.com>
      Reported-by: NDoron Tsur <doront@mellanox.com>
      Reviewed-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      50be28de
  11. 24 12月, 2015 2 次提交
  12. 14 10月, 2015 1 次提交
  13. 08 10月, 2015 1 次提交
    • C
      IB: split struct ib_send_wr · e622f2f4
      Christoph Hellwig 提交于
      This patch split up struct ib_send_wr so that all non-trivial verbs
      use their own structure which embedds struct ib_send_wr.  This dramaticly
      shrinks the size of a WR for most common operations:
      
      sizeof(struct ib_send_wr) (old):	96
      
      sizeof(struct ib_send_wr):		48
      sizeof(struct ib_rdma_wr):		64
      sizeof(struct ib_atomic_wr):		96
      sizeof(struct ib_ud_wr):		88
      sizeof(struct ib_fast_reg_wr):		88
      sizeof(struct ib_bind_mw_wr):		96
      sizeof(struct ib_sig_handover_wr):	80
      
      And with Sagi's pending MR rework the fast registration WR will also be
      down to a reasonable size:
      
      sizeof(struct ib_fastreg_wr):		64
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> [srp, srpt]
      Reviewed-by: Chuck Lever <chuck.lever@oracle.com> [sunrpc]
      Tested-by: NHaggai Eran <haggaie@mellanox.com>
      Tested-by: NSagi Grimberg <sagig@mellanox.com>
      Tested-by: NSteve Wise <swise@opengridcomputing.com>
      e622f2f4
  14. 26 9月, 2015 2 次提交
  15. 31 8月, 2015 1 次提交
  16. 15 7月, 2015 1 次提交
    • Y
      IB/ipoib: Scatter-Gather support in connected mode · c4268778
      Yuval Shaia 提交于
      By default, IPoIB-CM driver uses 64k MTU. Larger MTU gives better
      performance.
      This MTU plus overhead puts the memory allocation for IP based packets at
      32 4k pages (order 5), which have to be contiguous.
      When the system memory under pressure, it was observed that allocating 128k
      contiguous physical memory is difficult and causes serious errors (such as
      system becomes unusable).
      
      This enhancement resolve the issue by removing the physically contiguous
      memory requirement using Scatter/Gather feature that exists in Linux stack.
      
      With this fix Scatter-Gather will be supported also in connected mode.
      
      This change reverts some of the change made in commit e112373f
      ("IPoIB/cm: Reduce connected mode TX object size").
      
      The ability to use SG in IPoIB CM is possible because the coupling
      between NETIF_F_SG and NETIF_F_CSUM was removed in commit
      ec5f0615 ("net: Kill link between CSUM and SG features.")
      Signed-off-by: NYuval Shaia <yuval.shaia@oracle.com>
      Acked-by: NChristian Marie <christian@ponies.io>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      c4268778
  17. 16 4月, 2015 6 次提交
    • E
      IB/ipoib: Remove IPOIB_MCAST_RUN bit · 0e5544d9
      Erez Shitrit 提交于
      After Doug Ledford's changes there is no need in that bit, it's
      semantic becomes subset of the IPOIB_FLAG_OPER_UP bit.
      Signed-off-by: NErez Shitrit <erezsh@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      0e5544d9
    • E
      IB/ipoib: Handle QP in SQE state · 2c010730
      Erez Shitrit 提交于
      As the result of a completion error the QP can moved to SQE state by
      the hardware. Since it's not the Error state, there are no flushes
      and hence the driver doesn't know about that.
      
      The fix creates a task that after completion with error which is not a
      flush tracks the QP state and if it is in SQE state moves it back to RTS.
      Signed-off-by: NErez Shitrit <erezsh@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      2c010730
    • E
      IB/ipoib: Use one linear skb in RX flow · a44878d1
      Erez Shitrit 提交于
      The current code in the RX flow uses two sg entries for each incoming
      packet, the first one was for the IB headers and the second for the rest
      of the data, that causes two  dma map/unmap and two allocations, and few
      more actions that were done at the data path.
      
      Use only one linear skb on each incoming packet, for the data (IB
      headers and payload), that reduces the packet processing in the
      data-path (only one skb, no frags, the first frag was not used anyway,
      less memory allocations) and the dma handling (only one dma map/unmap
      over each incoming packet instead of two map/unmap per each incoming packet).
      
      After commit 73d3fe6d ("gro: fix aggregation for skb using frag_list") from
      Eric Dumazet, we will get full aggregation for large packets.
      
      When running bandwidth tests before and after the (over the card's numa node),
      using "netperf -H 1.1.1.3 -T -t TCP_STREAM", the results before are ~12Gbs before
      and after ~16Gbs on my setup (Mellanox's ConnectX3).
      Signed-off-by: NErez Shitrit <erezsh@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      a44878d1
    • D
      IB/ipoib: fix MCAST_FLAG_BUSY usage · 69911416
      Doug Ledford 提交于
      Commit a9c8ba58 ("IPoIB: Fix usage of uninitialized multicast
      objects") added a new flag MCAST_JOIN_STARTED, but was not very strict
      in how it was used.  We didn't always initialize the completion struct
      before we set the flag, and we didn't always call complete on the
      completion struct from all paths that complete it.  And when we did
      complete it, sometimes we continued to touch the mcast entry after
      the completion, opening us up to possible use after free issues.
      
      This made it less than totally effective, and certainly made its use
      confusing.  And in the flush function we would use the presence of this
      flag to signal that we should wait on the completion struct, but we never
      cleared this flag, ever.
      
      In order to make things clearer and aid in resolving the rtnl deadlock
      bug I've been chasing, I cleaned this up a bit.
      
       1) Remove the MCAST_JOIN_STARTED flag entirely
       2) Change MCAST_FLAG_BUSY so it now only means a join is in-flight
       3) Test mcast->mc directly to see if we have completed
          ib_sa_join_multicast (using IS_ERR_OR_NULL)
       4) Make sure that before setting MCAST_FLAG_BUSY we always initialize
          the mcast->done completion struct
       5) Make sure that before calling complete(&mcast->done), we always clear
          the MCAST_FLAG_BUSY bit
       6) Take the mcast_mutex before we call ib_sa_multicast_join and also
          take the mutex in our join callback.  This forces
          ib_sa_multicast_join to return and set mcast->mc before we process
          the callback.  This way, our callback can safely clear mcast->mc
          if there is an error on the join and we will do the right thing as
          a result in mcast_dev_flush.
       7) Because we need the mutex to synchronize mcast->mc, we can no
          longer call mcast_sendonly_join directly from mcast_send and
          instead must add sendonly join processing to the mcast_join_task
       8) Make MCAST_RUN mean that we have a working mcast subsystem, not that
          we have a running task.  We know when we need to reschedule our
          join task thread and don't need a flag to tell us.
       9) Add a helper for rescheduling the join task thread
      
      A number of different races are resolved with these changes.  These
      races existed with the old MCAST_FLAG_BUSY usage, the
      MCAST_JOIN_STARTED flag was an attempt to address them, and while it
      helped, a determined effort could still trip things up.
      
      One race looks something like this:
      
      Thread 1                             Thread 2
      ib_sa_join_multicast (as part of running restart mcast task)
        alloc member
        call callback
                                           ifconfig ib0 down
      				     wait_for_completion
          callback call completes
                                           wait_for_completion in
      				     mcast_dev_flush completes
      				       mcast->mc is PTR_ERR_OR_NULL
      				       so we skip ib_sa_leave_multicast
          return from callback
        return from ib_sa_join_multicast
      set mcast->mc = return from ib_sa_multicast
      
      We now have a permanently unbalanced join/leave issue that trips up the
      refcounting in core/multicast.c
      
      Another like this:
      
      Thread 1                   Thread 2         Thread 3
      ib_sa_multicast_join
                                                  ifconfig ib0 down
      					    priv->broadcast = NULL
                                 join_complete
      			                    wait_for_completion
      			   mcast->mc is not yet set, so don't clear
      return from ib_sa_join_multicast and set mcast->mc
      			   complete
      			   return -EAGAIN (making mcast->mc invalid)
      			   		    call ib_sa_multicast_leave
      					    on invalid mcast->mc, hang
      					    forever
      
      By holding the mutex around ib_sa_multicast_join and taking the mutex
      early in the callback, we force mcast->mc to be valid at the time we
      run the callback.  This allows us to clear mcast->mc if there is an
      error and the join is going to fail.  We do this before we complete
      the mcast.  In this way, mcast_dev_flush always sees consistent state
      in regards to mcast->mc membership at the time that the
      wait_for_completion() returns.
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      69911416
    • D
      IB/ipoib: No longer use flush as a parameter · efc82eee
      Doug Ledford 提交于
      Various places in the IPoIB code had a deadlock related to flushing
      the ipoib workqueue.  Now that we have per device workqueues and a
      specific flush workqueue, there is no longer a deadlock issue with
      flushing the device specific workqueues and we can do so unilaterally.
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      efc82eee
    • D
      IB/ipoib: Use dedicated workqueues per interface · 0b39578b
      Doug Ledford 提交于
      During my recent work on the rtnl lock deadlock in the IPoIB driver, I
      saw that even once I fixed the apparent races for a single device, as
      soon as that device had any children, new races popped up.  It turns
      out that this is because no matter how well we protect against races
      on a single device, the fact that all devices use the same workqueue,
      and flush_workqueue() flushes *everything* from that workqueue means
      that we would also have to prevent all races between different devices
      (for instance, ipoib_mcast_restart_task on interface ib0 can race with
      ipoib_mcast_flush_dev on interface ib0.8002, resulting in a deadlock on
      the rtnl_lock).
      
      There are several possible solutions to this problem:
      
      Make carrier_on_task and mcast_restart_task try to take the rtnl for
      some set period of time and if they fail, then bail.  This runs the
      real risk of dropping work on the floor, which can end up being its
      own separate kind of deadlock.
      
      Set some global flag in the driver that says some device is in the
      middle of going down, letting all tasks know to bail.  Again, this can
      drop work on the floor.
      
      Or the method this patch attempts to use, which is when we bring an
      interface up, create a workqueue specifically for that interface, so
      that when we take it back down, we are flushing only those tasks
      associated with our interface.  In addition, keep the global
      workqueue, but now limit it to only flush tasks.  In this way, the
      flush tasks can always flush the device specific work queues without
      having deadlock issues.
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      0b39578b
  18. 31 1月, 2015 4 次提交
  19. 16 12月, 2014 4 次提交
    • D
      IPoIB: No longer use flush as a parameter · ce347ab9
      Doug Ledford 提交于
      Various places in the IPoIB code had a deadlock related to flushing
      the ipoib workqueue.  Now that we have per device workqueues and a
      specific flush workqueue, there is no longer a deadlock issue with
      flushing the device specific workqueues and we can do so unilaterally.
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      ce347ab9
    • D
      IPoIB: Make ipoib_mcast_stop_thread flush the workqueue · bb42a6dd
      Doug Ledford 提交于
      We used to pass a flush variable to mcast_stop_thread to indicate if
      we should flush the workqueue or not.  This was due to some code
      trying to flush a workqueue that it was currently running on which is
      a no-no.  Now that we have per-device work queues, and now that
      ipoib_mcast_restart_task has taken the fact that it is queued on a
      single thread workqueue with all of the ipoib_mcast_join_task's and
      therefore has no need to stop the join task while it runs, we can do
      away with the flush parameter and unilaterally flush always.
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      bb42a6dd
    • D
      IPoIB: Use dedicated workqueues per interface · 5141861c
      Doug Ledford 提交于
      During my recent work on the rtnl lock deadlock in the IPoIB driver, I
      saw that even once I fixed the apparent races for a single device, as
      soon as that device had any children, new races popped up.  It turns
      out that this is because no matter how well we protect against races
      on a single device, the fact that all devices use the same workqueue,
      and flush_workqueue() flushes *everything* from that workqueue, we can
      have one device in the middle of a down and holding the rtnl lock and
      another totally unrelated device needing to run mcast_restart_task,
      which wants the rtnl lock and will loop trying to take it unless is
      sees its own FLAG_ADMIN_UP flag go away.  Because the unrelated
      interface will never see its own ADMIN_UP flag drop, the interface
      going down will deadlock trying to flush the queue.  There are several
      possible solutions to this problem:
      
      Make carrier_on_task and mcast_restart_task try to take the rtnl for
      some set period of time and if they fail, then bail.  This runs the
      real risk of dropping work on the floor, which can end up being its
      own separate kind of deadlock.
      
      Set some global flag in the driver that says some device is in the
      middle of going down, letting all tasks know to bail.  Again, this can
      drop work on the floor.  I suppose if our own ADMIN_UP flag doesn't go
      away, then maybe after a few tries on the rtnl lock we can queue our
      own task back up as a delayed work and return and avoid dropping work
      on the floor that way.  But I'm not 100% convinced that we won't cause
      other problems.
      
      Or the method this patch attempts to use, which is when we bring an
      interface up, create a workqueue specifically for that interface, so
      that when we take it back down, we are flushing only those tasks
      associated with our interface.  In addition, keep the global
      workqueue, but now limit it to only flush tasks.  In this way, the
      flush tasks can always flush the device specific work queues without
      having deadlock issues.
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      5141861c
    • D
      IPoIB: fix MCAST_FLAG_BUSY usage · 016d9fb2
      Doug Ledford 提交于
      Commit a9c8ba58 ("IPoIB: Fix usage of uninitialized multicast
      objects") added a new flag MCAST_JOIN_STARTED, but was not very strict
      in how it was used.  We didn't always initialize the completion struct
      before we set the flag, and we didn't always call complete on the
      completion struct from all paths that complete it.  This made it less
      than totally effective, and certainly made its use confusing.  And in
      the flush function we would use the presence of this flag to signal
      that we should wait on the completion struct, but we never cleared
      this flag, ever.  This is further muddied by the fact that we overload
      the MCAST_FLAG_BUSY flag to mean two different things: we have a join
      in flight, and we have succeeded in getting an ib_sa_join_multicast.
      
      In order to make things clearer and aid in resolving the rtnl deadlock
      bug I've been chasing, I cleaned this up a bit.
      
       1) Remove the MCAST_JOIN_STARTED flag entirely
       2) Un-overload MCAST_FLAG_BUSY so it now only means a join is in-flight
       3) Test on mcast->mc directly to see if we have completed
          ib_sa_join_multicast (using IS_ERR_OR_NULL)
       4) Make sure that before setting MCAST_FLAG_BUSY we always initialize
          the mcast->done completion struct
       5) Make sure that before calling complete(&mcast->done), we always clear
          the MCAST_FLAG_BUSY bit
       6) Take the mcast_mutex before we call ib_sa_multicast_join and also
          take the mutex in our join callback.  This forces
          ib_sa_multicast_join to return and set mcast->mc before we process
          the callback.  This way, our callback can safely clear mcast->mc
          if there is an error on the join and we will do the right thing as
          a result in mcast_dev_flush.
       7) Because we need the mutex to synchronize mcast->mc, we can no
          longer call mcast_sendonly_join directly from mcast_send and
          instead must add sendonly join processing to the mcast_join_task
      
      A number of different races are resolved with these changes.  These
      races existed with the old MCAST_FLAG_BUSY usage, the
      MCAST_JOIN_STARTED flag was an attempt to address them, and while it
      helped, a determined effort could still trip things up.
      
      One race looks something like this:
      
      Thread 1                             Thread 2
      ib_sa_join_multicast (as part of running restart mcast task)
        alloc member
        call callback
                                           ifconfig ib0 down
      				     wait_for_completion
          callback call completes
                                           wait_for_completion in
      				     mcast_dev_flush completes
      				       mcast->mc is PTR_ERR_OR_NULL
      				       so we skip ib_sa_leave_multicast
          return from callback
        return from ib_sa_join_multicast
      set mcast->mc = return from ib_sa_multicast
      
      We now have a permanently unbalanced join/leave issue that trips up the
      refcounting in core/multicast.c
      
      Another like this:
      
      Thread 1                   Thread 2         Thread 3
      ib_sa_multicast_join
                                                  ifconfig ib0 down
      					    priv->broadcast = NULL
                                 join_complete
      			                    wait_for_completion
      			   mcast->mc is not yet set, so don't clear
      return from ib_sa_join_multicast and set mcast->mc
      			   complete
      			   return -EAGAIN (making mcast->mc invalid)
      			   		    call ib_sa_multicast_leave
      					    on invalid mcast->mc, hang
      					    forever
      
      By holding the mutex around ib_sa_multicast_join and taking the mutex
      early in the callback, we force mcast->mc to be valid at the time we
      run the callback.  This allows us to clear mcast->mc if there is an
      error and the join is going to fail.  We do this before we complete
      the mcast.  In this way, mcast_dev_flush always sees consistent state
      in regards to mcast->mc membership at the time that the
      wait_for_completion() returns.
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      016d9fb2
  20. 23 9月, 2014 1 次提交