1. 26 10月, 2017 2 次提交
  2. 29 9月, 2017 1 次提交
  3. 23 9月, 2017 1 次提交
  4. 25 8月, 2017 1 次提交
    • E
      IB/ipoib: Sync between remove_one to sysfs calls that use rtnl_lock · 69956d83
      Erez Shitrit 提交于
      In order to avoid deadlock between sysfs functions (like create/delete
      child) and remove_one (both of them are using the sysfs lock and
      rtnl_lock) the driver will use a state mutex for sync.
      
      That will fix traces as the following:
      schedule+0x3e/0x90
      kernfs_drain+0x75/0xf0
      ? wait_woken+0x90/0x90
      __kernfs_remove+0x12e/0x1c0
      kernfs_remove+0x25/0x40
      sysfs_remove_dir+0x57/0x90
      kobject_del+0x22/0x60
      device_del+0x195/0x230
       pm_runtime_set_memalloc_noio+0xac/0xf0
      netdev_unregister_kobject+0x71/0x80
      rollback_registered_many+0x205/0x2f0
      rollback_registered+0x31/0x40
      unregister_netdevice_queue+0x58/0xb0
      unregister_netdev+0x20/0x30
      ipoib_remove_one+0xb7/0x240 [ib_ipoib]
      ib_unregister_device+0xbc/0x1b0 [ib_core]
      ib_unregister_mad_agent+0x29/0x30 [ib_core]
      mlx4_ib_remove+0x67/0x280 [mlx4_ib]
      INFO: task echo:24082 blocked for more than 120 seconds.
      Tainted: G           OE   4.1.12-37.5.1.el6uek.x86_64 #2
      "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
      message.
      Call Trace:
      schedule+0x3e/0x90
      schedule_preempt_disabled+0xe/0x10
      __mutex_lock_slowpath+0x95/0x110
      ? _rcu_barrier+0x177/0x220
      mutex_lock+0x23/0x40
      rtnl_lock+0x15/0x20
      netdev_run_todo+0x81/0x1f0
      rtnl_unlock+0xe/0x10
      ipoib_vlan_delete+0x12f/0x1c0 [ib_ipoib]
      delete_child+0x69/0x80 [ib_ipoib]
      dev_attr_store+0x20/0x30
      sysfs_kf_write+0x41/0x50
      Signed-off-by: NErez Shitrit <erezsh@mellanox.com>
      Reviewed-by: NAlex Vesker <valex@mellanox.com>
      Signed-off-by: NLeon Romanovsky <leon@kernel.org>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      69956d83
  5. 23 7月, 2017 1 次提交
  6. 18 7月, 2017 2 次提交
  7. 02 5月, 2017 1 次提交
  8. 21 4月, 2017 1 次提交
  9. 02 3月, 2017 1 次提交
  10. 19 2月, 2017 2 次提交
  11. 13 1月, 2017 4 次提交
  12. 04 12月, 2016 1 次提交
  13. 17 11月, 2016 1 次提交
  14. 14 10月, 2016 1 次提交
    • P
      IB/ipoib: move back IB LL address into the hard header · fc791b63
      Paolo Abeni 提交于
      After the commit 9207f9d4 ("net: preserve IP control block
      during GSO segmentation"), the GSO CB and the IPoIB CB conflict.
      That destroy the IPoIB address information cached there,
      causing a severe performance regression, as better described here:
      
      http://marc.info/?l=linux-kernel&m=146787279825501&w=2
      
      This change moves the data cached by the IPoIB driver from the
      skb control lock into the IPoIB hard header, as done before
      the commit 936d7de3 ("IPoIB: Stop lying about hard_header_len
      and use skb->cb to stash LL addresses").
      In order to avoid GRO issue, on packet reception, the IPoIB driver
      stash into the skb a dummy pseudo header, so that the received
      packets have actually a hard header matching the declared length.
      To avoid changing the connected mode maximum mtu, the allocated
      head buffer size is increased by the pseudo header length.
      
      After this commit, IPoIB performances are back to pre-regression
      value.
      
      v2 -> v3: rebased
      v1 -> v2: avoid changing the max mtu, increasing the head buf size
      
      Fixes: 9207f9d4 ("net: preserve IP control block during GSO segmentation")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fc791b63
  15. 03 9月, 2016 1 次提交
    • E
      IB/ipoib: Fix memory corruption in ipoib cm mode connect flow · 546481c2
      Erez Shitrit 提交于
      When a new CM connection is being requested, ipoib driver copies data
      from the path pointer in the CM/tx object, the path object might be
      invalid at the point and memory corruption will happened later when now
      the CM driver will try using that data.
      
      The next scenario demonstrates it:
      	neigh_add_path --> ipoib_cm_create_tx -->
      	queue_work (pointer to path is in the cm/tx struct)
      	#while the work is still in the queue,
      	#the port goes down and causes the ipoib_flush_paths:
      	ipoib_flush_paths --> path_free --> kfree(path)
      	#at this point the work scheduled starts.
      	ipoib_cm_tx_start --> copy from the (invalid)path pointer:
      	(memcpy(&pathrec, &p->path->pathrec, sizeof pathrec);)
      	 -> memory corruption.
      
      To fix that the driver now starts the CM/tx connection only if that
      specific path exists in the general paths database.
      This check is protected with the relevant locks, and uses the gid from
      the neigh member in the CM/tx object which is valid according to the ref
      count that was taken by the CM/tx.
      
      Fixes: 839fcaba ('IPoIB: Connected mode experimental support')
      Signed-off-by: NErez Shitrit <erezsh@mellanox.com>
      Signed-off-by: NLeon Romanovsky <leon@kernel.org>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      546481c2
  16. 07 6月, 2016 1 次提交
    • E
      IB/IPoIB: Fix race between ipoib_remove_one to sysfs functions · 198b12f7
      Erez Shitrit 提交于
      In ipoib_remove_one the driver holds the rtnl_lock and tries to do some
      operation like dev_change_flags or unregister_netdev, while sysfs
      callback like ipoib_vlan_delete holds sysfs mutex and tries to hold the
      rtnl_lock via rtnl_trylock() and restart_syscall() if the lock is not
      free, meanwhile ipoib_remove_one tries to get the sysfs lock in order to
      free its sysfs directory, and we will get  a->b, b->a deadlock.
      
          Trace like the following:
      
              schedule+0x37/0x80
              schedule_preempt_disabled+0xe/0x10
              __mutex_lock_slowpath+0xb5/0x120
              mutex_lock+0x23/0x40
              rtnl_lock+0x15/0x20
              netdev_run_todo+0x17c/0x320
              rtnl_unlock+0xe/0x10
              ipoib_vlan_delete+0x11b/0x1b0 [ib_ipoib]
              delete_child+0x54/0x80 [ib_ipoib]
              dev_attr_store+0x18/0x30
              sysfs_kf_write+0x37/0x40
              mutex_lock+0x16/0x40
              SyS_write+0x55/0xc0
              entry_SYSCALL_64_fastpath+0x16/0x75
          And
              schedule+0x37/0x80
              __kernfs_remove+0x1a8/0x260
              ? wake_atomic_t_function+0x60/0x60
              kernfs_remove+0x25/0x40
              sysfs_remove_dir+0x50/0x80
              kobject_del+0x18/0x50
              device_del+0x19f/0x260
              netdev_unregister_kobject+0x6a/0x80
              rollback_registered_many+0x1fd/0x340
              rollback_registered+0x3c/0x70
              unregister_netdevice_queue+0x55/0xc0
              unregister_netdev+0x20/0x30
              ipoib_remove_one+0x114/0x1b0 [ib_ipoib]
              ib_unregister_client+0x4a/0x170 [ib_core]
              ? find_module_all+0x71/0xa0
              ipoib_cleanup_module+0x10/0x94 [ib_ipoib]
              SyS_delete_module+0x1b5/0x210
              entry_SYSCALL_64_fastpath+0x16/0x75
      
      The fix is by checking the flag IPOIB_FLAG_INTF_ON_DESTROY in order to
      get out from the sysfs function.
      
      Fixes: 862096a8 ("IB/ipoib: Add more rtnl_link_ops callbacks")
      Fixes: 9baa0b03 ("IB/ipoib: Add rtnl_link_ops support")
      Signed-off-by: NErez Shitrit <erezsh@mellanox.com>
      Signed-off-by: NLeon Romanovsky <leon@kernel.org>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      198b12f7
  17. 05 5月, 2016 1 次提交
  18. 03 3月, 2016 1 次提交
  19. 23 12月, 2015 1 次提交
  20. 12 12月, 2015 1 次提交
    • C
      IB: add a proper completion queue abstraction · 14d3a3b2
      Christoph Hellwig 提交于
      This adds an abstraction that allows ULPs to simply pass a completion
      object and completion callback with each submitted WR and let the RDMA
      core handle the nitty gritty details of how to handle completion
      interrupts and poll the CQ.
      
      In detail there is a new ib_cqe structure which just contains the
      completion callback, and which can be used to get at the containing
      object using container_of.  It is pointed to by the WR and WC as an
      alternative to the wr_id field, similar to how many ULPs already use
      the field to store a pointer using casts.
      
      A driver using the new completion callbacks allocates it's CQs using
      the new ib_create_cq API, which in addition to the number of CQEs and
      the completion vectors also takes a mode on how we poll for CQEs.
      Three modes are available: direct for drivers that never take CQ
      interrupts and just poll for them, softirq to poll from softirq context
      using the to be renamed blk-iopoll infrastructure which takes care of
      rearming and budgeting, or a workqueue for consumer who want to be
      called from user context.
      
      Thanks a lot to Sagi Grimberg who helped reviewing the API, wrote
      the current version of the workqueue code because my two previous
      attempts sucked too much and converted the iSER initiator to the new
      API.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      14d3a3b2
  21. 08 10月, 2015 1 次提交
    • C
      IB: split struct ib_send_wr · e622f2f4
      Christoph Hellwig 提交于
      This patch split up struct ib_send_wr so that all non-trivial verbs
      use their own structure which embedds struct ib_send_wr.  This dramaticly
      shrinks the size of a WR for most common operations:
      
      sizeof(struct ib_send_wr) (old):	96
      
      sizeof(struct ib_send_wr):		48
      sizeof(struct ib_rdma_wr):		64
      sizeof(struct ib_atomic_wr):		96
      sizeof(struct ib_ud_wr):		88
      sizeof(struct ib_fast_reg_wr):		88
      sizeof(struct ib_bind_mw_wr):		96
      sizeof(struct ib_sig_handover_wr):	80
      
      And with Sagi's pending MR rework the fast registration WR will also be
      down to a reasonable size:
      
      sizeof(struct ib_fastreg_wr):		64
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: Bart Van Assche <bart.vanassche@sandisk.com> [srp, srpt]
      Reviewed-by: Chuck Lever <chuck.lever@oracle.com> [sunrpc]
      Tested-by: NHaggai Eran <haggaie@mellanox.com>
      Tested-by: NSagi Grimberg <sagig@mellanox.com>
      Tested-by: NSteve Wise <swise@opengridcomputing.com>
      e622f2f4
  22. 31 8月, 2015 2 次提交
  23. 15 7月, 2015 1 次提交
    • Y
      IB/ipoib: Scatter-Gather support in connected mode · c4268778
      Yuval Shaia 提交于
      By default, IPoIB-CM driver uses 64k MTU. Larger MTU gives better
      performance.
      This MTU plus overhead puts the memory allocation for IP based packets at
      32 4k pages (order 5), which have to be contiguous.
      When the system memory under pressure, it was observed that allocating 128k
      contiguous physical memory is difficult and causes serious errors (such as
      system becomes unusable).
      
      This enhancement resolve the issue by removing the physically contiguous
      memory requirement using Scatter/Gather feature that exists in Linux stack.
      
      With this fix Scatter-Gather will be supported also in connected mode.
      
      This change reverts some of the change made in commit e112373f
      ("IPoIB/cm: Reduce connected mode TX object size").
      
      The ability to use SG in IPoIB CM is possible because the coupling
      between NETIF_F_SG and NETIF_F_CSUM was removed in commit
      ec5f0615 ("net: Kill link between CSUM and SG features.")
      Signed-off-by: NYuval Shaia <yuval.shaia@oracle.com>
      Acked-by: NChristian Marie <christian@ponies.io>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      c4268778
  24. 06 5月, 2015 1 次提交
  25. 16 4月, 2015 1 次提交
    • D
      IB/ipoib: Use dedicated workqueues per interface · 0b39578b
      Doug Ledford 提交于
      During my recent work on the rtnl lock deadlock in the IPoIB driver, I
      saw that even once I fixed the apparent races for a single device, as
      soon as that device had any children, new races popped up.  It turns
      out that this is because no matter how well we protect against races
      on a single device, the fact that all devices use the same workqueue,
      and flush_workqueue() flushes *everything* from that workqueue means
      that we would also have to prevent all races between different devices
      (for instance, ipoib_mcast_restart_task on interface ib0 can race with
      ipoib_mcast_flush_dev on interface ib0.8002, resulting in a deadlock on
      the rtnl_lock).
      
      There are several possible solutions to this problem:
      
      Make carrier_on_task and mcast_restart_task try to take the rtnl for
      some set period of time and if they fail, then bail.  This runs the
      real risk of dropping work on the floor, which can end up being its
      own separate kind of deadlock.
      
      Set some global flag in the driver that says some device is in the
      middle of going down, letting all tasks know to bail.  Again, this can
      drop work on the floor.
      
      Or the method this patch attempts to use, which is when we bring an
      interface up, create a workqueue specifically for that interface, so
      that when we take it back down, we are flushing only those tasks
      associated with our interface.  In addition, keep the global
      workqueue, but now limit it to only flush tasks.  In this way, the
      flush tasks can always flush the device specific work queues without
      having deadlock issues.
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      0b39578b
  26. 31 1月, 2015 1 次提交
  27. 16 12月, 2014 1 次提交
    • D
      IPoIB: Use dedicated workqueues per interface · 5141861c
      Doug Ledford 提交于
      During my recent work on the rtnl lock deadlock in the IPoIB driver, I
      saw that even once I fixed the apparent races for a single device, as
      soon as that device had any children, new races popped up.  It turns
      out that this is because no matter how well we protect against races
      on a single device, the fact that all devices use the same workqueue,
      and flush_workqueue() flushes *everything* from that workqueue, we can
      have one device in the middle of a down and holding the rtnl lock and
      another totally unrelated device needing to run mcast_restart_task,
      which wants the rtnl lock and will loop trying to take it unless is
      sees its own FLAG_ADMIN_UP flag go away.  Because the unrelated
      interface will never see its own ADMIN_UP flag drop, the interface
      going down will deadlock trying to flush the queue.  There are several
      possible solutions to this problem:
      
      Make carrier_on_task and mcast_restart_task try to take the rtnl for
      some set period of time and if they fail, then bail.  This runs the
      real risk of dropping work on the floor, which can end up being its
      own separate kind of deadlock.
      
      Set some global flag in the driver that says some device is in the
      middle of going down, letting all tasks know to bail.  Again, this can
      drop work on the floor.  I suppose if our own ADMIN_UP flag doesn't go
      away, then maybe after a few tries on the rtnl lock we can queue our
      own task back up as a delayed work and return and avoid dropping work
      on the floor that way.  But I'm not 100% convinced that we won't cause
      other problems.
      
      Or the method this patch attempts to use, which is when we bring an
      interface up, create a workqueue specifically for that interface, so
      that when we take it back down, we are flushing only those tasks
      associated with our interface.  In addition, keep the global
      workqueue, but now limit it to only flush tasks.  In this way, the
      flush tasks can always flush the device specific work queues without
      having deadlock issues.
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      5141861c
  28. 03 6月, 2014 1 次提交
    • O
      IB: Add a QP creation flag to use GFP_NOIO allocations · 09b93088
      Or Gerlitz 提交于
      This addresses a problem where NFS client writes over IPoIB connected
      mode may deadlock on memory allocation/writeback.
      
      The problem is not directly memory reclamation.  There is an indirect
      dependency between network filesystems writing back pages and
      ipoib_cm_tx_init() due to how a kworker is used.  Page reclaim cannot
      make forward progress until ipoib_cm_tx_init() succeeds and it is
      stuck in page reclaim itself waiting for network transmission.
      Ordinarily this situation may be avoided by having the caller use
      GFP_NOFS but ipoib_cm_tx_init() does not have that information.
      
      To address this, take a general approach and add a new QP creation
      flag that tells the low-level hardware driver to use GFP_NOIO for the
      memory allocations related to the new QP.
      
      Use the new flag in the ipoib connected mode path, and if the driver
      doesn't support it, re-issue the QP creation without the flag.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      09b93088
  29. 09 11月, 2013 1 次提交
  30. 14 8月, 2013 1 次提交
    • J
      IPoIB: Fix race in deleting ipoib_neigh entries · 49b8e744
      Jim Foraker 提交于
      In several places, this snippet is used when removing neigh entries:
      
      	list_del(&neigh->list);
      	ipoib_neigh_free(neigh);
      
      The list_del() removes neigh from the associated struct ipoib_path, while
      ipoib_neigh_free() removes neigh from the device's neigh entry lookup
      table.  Both of these operations are protected by the priv->lock
      spinlock.  The table however is also protected via RCU, and so naturally
      the lock is not held when doing reads.
      
      This leads to a race condition, in which a thread may successfully look
      up a neigh entry that has already been deleted from neigh->list.  Since
      the previous deletion will have marked the entry with poison, a second
      list_del() on the object will cause a panic:
      
        #5 [ffff8802338c3c70] general_protection at ffffffff815108c5
           [exception RIP: list_del+16]
           RIP: ffffffff81289020  RSP: ffff8802338c3d20  RFLAGS: 00010082
           RAX: dead000000200200  RBX: ffff880433e60c88  RCX: 0000000000009e6c
           RDX: 0000000000000246  RSI: ffff8806012ca298  RDI: ffff880433e60c88
           RBP: ffff8802338c3d30   R8: ffff8806012ca2e8   R9: 00000000ffffffff
           R10: 0000000000000001  R11: 0000000000000000  R12: ffff8804346b2020
           R13: ffff88032a3e7540  R14: ffff8804346b26e0  R15: 0000000000000246
           ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
        #6 [ffff8802338c3d38] ipoib_cm_tx_handler at ffffffffa066fe0a [ib_ipoib]
        #7 [ffff8802338c3d98] cm_process_work at ffffffffa05149a7 [ib_cm]
        #8 [ffff8802338c3de8] cm_work_handler at ffffffffa05161aa [ib_cm]
        #9 [ffff8802338c3e38] worker_thread at ffffffff81090e10
       #10 [ffff8802338c3ee8] kthread at ffffffff81096c66
       #11 [ffff8802338c3f48] kernel_thread at ffffffff8100c0ca
      
      We move the list_del() into ipoib_neigh_free(), so that deletion happens
      only once, after the entry has been successfully removed from the lookup
      table.  This same behavior is already used in ipoib_del_neighs_by_gid()
      and __ipoib_reap_neigh().
      Signed-off-by: NJim Foraker <foraker1@llnl.gov>
      Reviewed-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Reviewed-by: NJack Wang <jinpu.wang@profitbricks.com>
      Reviewed-by: NShlomo Pongratz <shlomop@mellanox.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      49b8e744
  31. 08 5月, 2013 1 次提交
  32. 17 4月, 2013 1 次提交
  33. 23 3月, 2013 1 次提交
    • M
      IPoIB: Fix send lockup due to missed TX completion · 1ee9e2aa
      Mike Marciniszyn 提交于
      Commit f0dc117a ("IPoIB: Fix TX queue lockup with mixed UD/CM
      traffic") attempts to solve an issue where unprocessed UD send
      completions can deadlock the netdev.
      
      The patch doesn't fully resolve the issue because if more than half
      the tx_outstanding's were UD and all of the destinations are RC
      reachable, arming the CQ doesn't solve the issue.
      
      This patch uses the IB_CQ_REPORT_MISSED_EVENTS on the
      ib_req_notify_cq().  If the rc is above 0, the UD send cq completion
      callback is called directly to re-arm the send completion timer.
      
      This issue is seen in very large parallel filesystem deployments
      and the patch has been shown to correct the issue.
      
      Cc: <stable@vger.kernel.org>
      Reviewed-by: NDean Luick <dean.luick@intel.com>
      Signed-off-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      1ee9e2aa