1. 15 7月, 2015 2 次提交
    • Y
      IB/ipoib: Scatter-Gather support in connected mode · c4268778
      Yuval Shaia 提交于
      By default, IPoIB-CM driver uses 64k MTU. Larger MTU gives better
      performance.
      This MTU plus overhead puts the memory allocation for IP based packets at
      32 4k pages (order 5), which have to be contiguous.
      When the system memory under pressure, it was observed that allocating 128k
      contiguous physical memory is difficult and causes serious errors (such as
      system becomes unusable).
      
      This enhancement resolve the issue by removing the physically contiguous
      memory requirement using Scatter/Gather feature that exists in Linux stack.
      
      With this fix Scatter-Gather will be supported also in connected mode.
      
      This change reverts some of the change made in commit e112373f
      ("IPoIB/cm: Reduce connected mode TX object size").
      
      The ability to use SG in IPoIB CM is possible because the coupling
      between NETIF_F_SG and NETIF_F_CSUM was removed in commit
      ec5f0615 ("net: Kill link between CSUM and SG features.")
      Signed-off-by: NYuval Shaia <yuval.shaia@oracle.com>
      Acked-by: NChristian Marie <christian@ponies.io>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      c4268778
    • H
      IB/ipoib: Prevent lockdep warning in __ipoib_ib_dev_flush · 8b7cce0d
      Haggai Eran 提交于
      __ipoib_ib_dev_flush calls itself recursively on child devices, and lockdep
      complains about locking vlan_rwsem twice (see below). Use down_read_nested
      instead of down_read to prevent the warning.
      
       =============================================
       [ INFO: possible recursive locking detected ]
       4.1.0-rc4+ #36 Tainted: G           O
       ---------------------------------------------
       kworker/u20:2/261 is trying to acquire lock:
        (&priv->vlan_rwsem){.+.+..}, at: [<ffffffffa0791e2a>] __ipoib_ib_dev_flush+0x3a/0x2b0 [ib_ipoib]
      
       but task is already holding lock:
        (&priv->vlan_rwsem){.+.+..}, at: [<ffffffffa0791e2a>] __ipoib_ib_dev_flush+0x3a/0x2b0 [ib_ipoib]
      
       other info that might help us debug this:
        Possible unsafe locking scenario:
      
              CPU0
              ----
         lock(&priv->vlan_rwsem);
         lock(&priv->vlan_rwsem);
      
        *** DEADLOCK ***
      
        May be due to missing lock nesting notation
      
       3 locks held by kworker/u20:2/261:
        #0:  ("%s""ipoib_flush"){.+.+..}, at: [<ffffffff810827cc>] process_one_work+0x15c/0x760
        #1:  ((&priv->flush_heavy)){+.+...}, at: [<ffffffff810827cc>] process_one_work+0x15c/0x760
        #2:  (&priv->vlan_rwsem){.+.+..}, at: [<ffffffffa0791e2a>] __ipoib_ib_dev_flush+0x3a/0x2b0 [ib_ipoib]
      
       stack backtrace:
       CPU: 3 PID: 261 Comm: kworker/u20:2 Tainted: G           O    4.1.0-rc4+ #36
       Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2007
       Workqueue: ipoib_flush ipoib_ib_dev_flush_heavy [ib_ipoib]
        ffff8801c6c54790 ffff8801c9927af8 ffffffff81665238 0000000000000001
        ffffffff825b5b30 ffff8801c9927bd8 ffffffff810bba51 ffff880100000000
        ffffffff00000001 ffff880100000001 ffff8801c6c55428 ffff8801c6c54790
       Call Trace:
        [<ffffffff81665238>] dump_stack+0x4f/0x6f
        [<ffffffff810bba51>] __lock_acquire+0x741/0x1820
        [<ffffffff810bcbf8>] lock_acquire+0xc8/0x240
        [<ffffffffa0791e2a>] ? __ipoib_ib_dev_flush+0x3a/0x2b0 [ib_ipoib]
        [<ffffffff81669d2c>] down_read+0x4c/0x70
        [<ffffffffa0791e2a>] ? __ipoib_ib_dev_flush+0x3a/0x2b0 [ib_ipoib]
        [<ffffffffa0791e2a>] __ipoib_ib_dev_flush+0x3a/0x2b0 [ib_ipoib]
        [<ffffffffa0791e4a>] __ipoib_ib_dev_flush+0x5a/0x2b0 [ib_ipoib]
        [<ffffffffa07920ba>] ipoib_ib_dev_flush_heavy+0x1a/0x20 [ib_ipoib]
        [<ffffffff81082871>] process_one_work+0x201/0x760
        [<ffffffff810827cc>] ? process_one_work+0x15c/0x760
        [<ffffffff81082ef0>] worker_thread+0x120/0x4d0
        [<ffffffff81082dd0>] ? process_one_work+0x760/0x760
        [<ffffffff81082dd0>] ? process_one_work+0x760/0x760
        [<ffffffff81088b7e>] kthread+0xfe/0x120
        [<ffffffff81088a80>] ? __init_kthread_worker+0x70/0x70
        [<ffffffff8166c6e2>] ret_from_fork+0x42/0x70
        [<ffffffff81088a80>] ? __init_kthread_worker+0x70/0x70
      Signed-off-by: NHaggai Eran <haggaie@mellanox.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      8b7cce0d
  2. 16 4月, 2015 5 次提交
    • E
      IB/ipoib: Handle QP in SQE state · 2c010730
      Erez Shitrit 提交于
      As the result of a completion error the QP can moved to SQE state by
      the hardware. Since it's not the Error state, there are no flushes
      and hence the driver doesn't know about that.
      
      The fix creates a task that after completion with error which is not a
      flush tracks the QP state and if it is in SQE state moves it back to RTS.
      Signed-off-by: NErez Shitrit <erezsh@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      2c010730
    • E
      IB/ipoib: Use one linear skb in RX flow · a44878d1
      Erez Shitrit 提交于
      The current code in the RX flow uses two sg entries for each incoming
      packet, the first one was for the IB headers and the second for the rest
      of the data, that causes two  dma map/unmap and two allocations, and few
      more actions that were done at the data path.
      
      Use only one linear skb on each incoming packet, for the data (IB
      headers and payload), that reduces the packet processing in the
      data-path (only one skb, no frags, the first frag was not used anyway,
      less memory allocations) and the dma handling (only one dma map/unmap
      over each incoming packet instead of two map/unmap per each incoming packet).
      
      After commit 73d3fe6d ("gro: fix aggregation for skb using frag_list") from
      Eric Dumazet, we will get full aggregation for large packets.
      
      When running bandwidth tests before and after the (over the card's numa node),
      using "netperf -H 1.1.1.3 -T -t TCP_STREAM", the results before are ~12Gbs before
      and after ~16Gbs on my setup (Mellanox's ConnectX3).
      Signed-off-by: NErez Shitrit <erezsh@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      a44878d1
    • D
      IB/ipoib: No longer use flush as a parameter · efc82eee
      Doug Ledford 提交于
      Various places in the IPoIB code had a deadlock related to flushing
      the ipoib workqueue.  Now that we have per device workqueues and a
      specific flush workqueue, there is no longer a deadlock issue with
      flushing the device specific workqueues and we can do so unilaterally.
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      efc82eee
    • D
      IB/ipoib: Use dedicated workqueues per interface · 0b39578b
      Doug Ledford 提交于
      During my recent work on the rtnl lock deadlock in the IPoIB driver, I
      saw that even once I fixed the apparent races for a single device, as
      soon as that device had any children, new races popped up.  It turns
      out that this is because no matter how well we protect against races
      on a single device, the fact that all devices use the same workqueue,
      and flush_workqueue() flushes *everything* from that workqueue means
      that we would also have to prevent all races between different devices
      (for instance, ipoib_mcast_restart_task on interface ib0 can race with
      ipoib_mcast_flush_dev on interface ib0.8002, resulting in a deadlock on
      the rtnl_lock).
      
      There are several possible solutions to this problem:
      
      Make carrier_on_task and mcast_restart_task try to take the rtnl for
      some set period of time and if they fail, then bail.  This runs the
      real risk of dropping work on the floor, which can end up being its
      own separate kind of deadlock.
      
      Set some global flag in the driver that says some device is in the
      middle of going down, letting all tasks know to bail.  Again, this can
      drop work on the floor.
      
      Or the method this patch attempts to use, which is when we bring an
      interface up, create a workqueue specifically for that interface, so
      that when we take it back down, we are flushing only those tasks
      associated with our interface.  In addition, keep the global
      workqueue, but now limit it to only flush tasks.  In this way, the
      flush tasks can always flush the device specific work queues without
      having deadlock issues.
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      0b39578b
    • D
      IB/ipoib: factor out ah flushing · e135106f
      Doug Ledford 提交于
      Create a an ipoib_flush_ah and ipoib_stop_ah routines to use at
      appropriate times to flush out all remaining ah entries before we shut
      the device down.
      
      Because neighbors and mcast entries can each have a reference on any
      given ah, we must make sure to free all of those first before our ah
      will actually have a 0 refcount and be able to be reaped.
      
      This factoring is needed in preparation for having per-device work
      queues.  The original per-device workqueue code resulted in the following
      error message:
      
      <ibdev>: ib_dealloc_pd failed
      
      That error was tracked down to this issue.  With the changes to which
      workqueues were flushed when, there were no flushes of the per device
      workqueue after the last ah's were freed, resulting in an attempt to
      dealloc the pd with outstanding resources still allocated.  This code
      puts the explicit flushes in the needed places to avoid that problem.
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      e135106f
  3. 31 1月, 2015 3 次提交
  4. 16 12月, 2014 3 次提交
    • D
      IPoIB: No longer use flush as a parameter · ce347ab9
      Doug Ledford 提交于
      Various places in the IPoIB code had a deadlock related to flushing
      the ipoib workqueue.  Now that we have per device workqueues and a
      specific flush workqueue, there is no longer a deadlock issue with
      flushing the device specific workqueues and we can do so unilaterally.
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      ce347ab9
    • D
      IPoIB: Make ipoib_mcast_stop_thread flush the workqueue · bb42a6dd
      Doug Ledford 提交于
      We used to pass a flush variable to mcast_stop_thread to indicate if
      we should flush the workqueue or not.  This was due to some code
      trying to flush a workqueue that it was currently running on which is
      a no-no.  Now that we have per-device work queues, and now that
      ipoib_mcast_restart_task has taken the fact that it is queued on a
      single thread workqueue with all of the ipoib_mcast_join_task's and
      therefore has no need to stop the join task while it runs, we can do
      away with the flush parameter and unilaterally flush always.
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      bb42a6dd
    • D
      IPoIB: Use dedicated workqueues per interface · 5141861c
      Doug Ledford 提交于
      During my recent work on the rtnl lock deadlock in the IPoIB driver, I
      saw that even once I fixed the apparent races for a single device, as
      soon as that device had any children, new races popped up.  It turns
      out that this is because no matter how well we protect against races
      on a single device, the fact that all devices use the same workqueue,
      and flush_workqueue() flushes *everything* from that workqueue, we can
      have one device in the middle of a down and holding the rtnl lock and
      another totally unrelated device needing to run mcast_restart_task,
      which wants the rtnl lock and will loop trying to take it unless is
      sees its own FLAG_ADMIN_UP flag go away.  Because the unrelated
      interface will never see its own ADMIN_UP flag drop, the interface
      going down will deadlock trying to flush the queue.  There are several
      possible solutions to this problem:
      
      Make carrier_on_task and mcast_restart_task try to take the rtnl for
      some set period of time and if they fail, then bail.  This runs the
      real risk of dropping work on the floor, which can end up being its
      own separate kind of deadlock.
      
      Set some global flag in the driver that says some device is in the
      middle of going down, letting all tasks know to bail.  Again, this can
      drop work on the floor.  I suppose if our own ADMIN_UP flag doesn't go
      away, then maybe after a few tries on the rtnl lock we can queue our
      own task back up as a delayed work and return and avoid dropping work
      on the floor that way.  But I'm not 100% convinced that we won't cause
      other problems.
      
      Or the method this patch attempts to use, which is when we bring an
      interface up, create a workqueue specifically for that interface, so
      that when we take it back down, we are flushing only those tasks
      associated with our interface.  In addition, keep the global
      workqueue, but now limit it to only flush tasks.  In this way, the
      flush tasks can always flush the device specific work queues without
      having deadlock issues.
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      5141861c
  5. 11 8月, 2014 1 次提交
    • A
      IB/ipoib: Avoid multicast join attempts with invalid P_key · dd57c930
      Alex Estrin 提交于
      Currently, the parent interface keeps sending broadcast group join
      requests even if p_key index 0 is invalid, which is possible/common in
      virtualized environments where a VF has been probed to VM but the
      actual P_key configuration has not yet been assigned by the management
      software. This creates unnecessary noise on the fabric and in the
      kernel logs:
      
          ib0: multicast join failed for ff12:401b:8000:0000:0000:0000:ffff:ffff, status -22
      
      The original code run the multicast task regardless of the actual
      P_key value, which can be avoided. The fix is to re-init resources and
      bring interface up only if P_key index 0 is valid either when starting
      up or on PKEY_CHANGE event.
      
      Fixes: c2904141 ("IPoIB: Fix pkey change flow for virtualization environments")
      Reviewed-by: NIra Weiny <ira.weiny@intel.com>
      Signed-off-by: NAlex Estrin <alex.estrin@intel.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      dd57c930
  6. 05 8月, 2014 2 次提交
  7. 09 11月, 2013 4 次提交
    • E
      IPoIB: Add path query flushing in ipoib_ib_dev_cleanup · a39c52ab
      Erez Shitrit 提交于
      The path_rec_completion() callback may be invoked asynchronously even
      at the middle of "driver uninit" process.  This can lead to scheduling
      a task that tries to touch members of the priv object that are no
      longer valid.  For example the function cm_create_tx_qp can attempt to
      create qp with no valid priv->pd object.
      
      The following crash is one of the results:
      RIP: 0010:[<ffffffffa021bb47>]  [<ffffffffa021bb47>] ipoib_cm_create_tx_qp+0x57/0x90 [ib_ipoib]
      Process ipoib (pid: 5916, threadinfo ffff8803786e4000, task ffff8804150e1500)
      Stack:
      Call Trace:
      [<ffffffff81309ef0>] ? get_random_bytes+0x20/0x30
      [<ffffffffa021be2a>] ipoib_cm_tx_init+0xca/0x340 [ib_ipoib]
      [<ffffffffa021f765>] ipoib_cm_tx_start+0x215/0x3f0 [ib_ipoib]
      [<ffffffffa021f550>] ? ipoib_cm_tx_start+0x0/0x3f0 [ib_ipoib]
      [<ffffffff8108b2b0>] worker_thread+0x170/0x2a0
      [<ffffffff81090bf0>] ? autoremove_wake_function+0x0/0x40
      [<ffffffff8108b140>] ? worker_thread+0x0/0x2a0
      [<ffffffff81090886>] kthread+0x96/0xa0
      [<ffffffff8100c14a>] child_rip+0xa/0x20
      [<ffffffff810907f0>] ? kthread+0x0/0xa0
      [<ffffffff8100c140>] ? child_rip+0x0/0x20
      
      Fix that by flushing all pending path queries at this point.
      Signed-off-by: NAlex Markuze <markuze@mellanox.com>
      Signed-off-by: NErez Shitrit <erezsh@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      a39c52ab
    • E
      IPoIB: Avoid flushing the driver workqueue on dev_down · aede2501
      Erez Shitrit 提交于
      The driver should not flush the whole workqueue when only one work (the
      pkey poll one) needs to be cancelled.  Use cancel_delayed_work_sync()
      instead.
      Signed-off-by: NErez Shitrit <erezsh@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      aede2501
    • E
      IPoIB: Fix deadlock between dev_change_flags() and __ipoib_dev_flush() · f47944cc
      Erez Shitrit 提交于
      When ipoib interface is going down it takes all of its children with
      it, under mutex.
      
      For each child, dev_change_flags() is called.  That function calls
      ipoib_stop() via the ndo, and causes flush of the workqueue.
      Sometimes in the workqueue an __ipoib_dev_flush work() is waiting and
      when invoked tries to get the same mutex, which leads to a deadlock,
      as seen below.
      
      The solution is to switch to rw-sem instead of mutex.
      
      The deadlock:
      [11028.165303]  [<ffffffff812b0977>] ? vgacon_scroll+0x107/0x2e0
      [11028.171844]  [<ffffffff814eaac5>] schedule_timeout+0x215/0x2e0
      [11028.178465]  [<ffffffff8105a5c3>] ? perf_event_task_sched_out+0x33/0x80
      [11028.185962]  [<ffffffff814ea743>] wait_for_common+0x123/0x180
      [11028.192491]  [<ffffffff8105fa40>] ? default_wake_function+0x0/0x20
      [11028.199504]  [<ffffffff814ea85d>] wait_for_completion+0x1d/0x20
      [11028.206224]  [<ffffffff8108b4f1>] flush_cpu_workqueue+0x61/0x90
      [11028.212948]  [<ffffffff8108b5a0>] ? wq_barrier_func+0x0/0x20
      [11028.219375]  [<ffffffff8108bfc4>] flush_workqueue+0x54/0x80
      [11028.225712]  [<ffffffffa05a0576>] ipoib_mcast_stop_thread+0x66/0x90 [ib_ipoib]
      [11028.233988]  [<ffffffffa059ccea>] ipoib_ib_dev_down+0x6a/0x100 [ib_ipoib]
      [11028.241678]  [<ffffffffa059849a>] ipoib_stop+0x8a/0x140 [ib_ipoib]
      [11028.248692]  [<ffffffff8142adf1>] dev_close+0x71/0xc0
      [11028.254447]  [<ffffffff8142a631>] dev_change_flags+0xa1/0x1d0
      [11028.261062]  [<ffffffffa059851b>] ipoib_stop+0x10b/0x140 [ib_ipoib]
      [11028.268172]  [<ffffffff8142adf1>] dev_close+0x71/0xc0
      [11028.273922]  [<ffffffff8142a631>] dev_change_flags+0xa1/0x1d0
      [11028.280452]  [<ffffffff8148f20b>] devinet_ioctl+0x5eb/0x6a0
      [11028.286786]  [<ffffffff814903b8>] inet_ioctl+0x88/0xa0
      [11028.292633]  [<ffffffff8141591a>] sock_ioctl+0x7a/0x280
      [11028.298576]  [<ffffffff81189012>] vfs_ioctl+0x22/0xa0
      [11028.304326]  [<ffffffff81140540>] ? unmap_region+0x110/0x130
      [11028.310756]  [<ffffffff811891b4>] do_vfs_ioctl+0x84/0x580
      [11028.316897]  [<ffffffff81189731>] sys_ioctl+0x81/0xa0
      
      and
      
      11028.017533]  [<ffffffff8105a5c3>] ? perf_event_task_sched_out+0x33/0x80
      [11028.025030]  [<ffffffff8100bb8e>] ? apic_timer_interrupt+0xe/0x20
      [11028.031945]  [<ffffffff814eb2ae>] __mutex_lock_slowpath+0x13e/0x180
      [11028.039053]  [<ffffffff814eb14b>] mutex_lock+0x2b/0x50
      [11028.044910]  [<ffffffffa059f7e7>] __ipoib_ib_dev_flush+0x37/0x210 [ib_ipoib]
      [11028.052894]  [<ffffffffa059fa00>] ? ipoib_ib_dev_flush_light+0x0/0x20 [ib_ipoib]
      [11028.061363]  [<ffffffffa059fa17>] ipoib_ib_dev_flush_light+0x17/0x20 [ib_ipoib]
      [11028.069738]  [<ffffffff8108b120>] worker_thread+0x170/0x2a0
      [11028.076068]  [<ffffffff81090990>] ? autoremove_wake_function+0x0/0x40
      [11028.083374]  [<ffffffff8108afb0>] ? worker_thread+0x0/0x2a0
      [11028.089709]  [<ffffffff81090626>] kthread+0x96/0xa0
      [11028.095266]  [<ffffffff8100c0ca>] child_rip+0xa/0x20
      [11028.100921]  [<ffffffff81090590>] ? kthread+0x0/0xa0
      [11028.106573]  [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
      [11028.112423] INFO: task ifconfig:23640 blocked for more than 120 seconds.
      Signed-off-by: NErez Shitrit <erezsh@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      f47944cc
    • E
      IPoIB: Fix crash in dev_open error flow · c2bb5628
      Erez Shitrit 提交于
      If napi has never been enabled when calling ipoib_ib_dev_stop, a
      kernel crash occurs, because the verbs layer completion handler
      (ipoib_ib_completion) calls napi_schedule unconditionally.
      
      If the napi structure passed in the napi_schedule call has not
      been initialized, napi will crash.
      
      The cleanest solution is to simply enable napi before calling
      ipoib_ib_dev_stop in the dev_open error flow. (dev_stop then
      immediately disables napi).
      Signed-off-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: NErez Shitrit <erezsh@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      c2bb5628
  8. 01 8月, 2013 1 次提交
    • E
      IPoIB: Fix pkey change flow for virtualization environments · c2904141
      Erez Shitrit 提交于
      IPoIB's required behaviour w.r.t to the pkey used by the device is the following:
      
      - For "parent" interfaces (e.g ib0, ib1, etc) who are created
        automatically as a result of hot-plug events from the IB core, the
        driver needs to take whatever pkey vlaue it finds in index 0, and
        stick to that index.
      
      - For child interfaces (e.g ib0.8001, etc) created by admin directive,
        the driver needs to use and stick to the value provided during its
        creation.
      
      In SR-IOV environment its possible for the VF probe to take place
      before the cloud management software provisions the suitable pkey for
      the VF in the paravirtualed PKEY table index 0. When this is the case,
      the VF IB stack will find in index 0 an invalide pkey, which is all
      zeros.
      
      Moreover, the cloud managment can assign the pkey value at index 0 at
      any time of the guest life cycle.
      
      The correct behavior for IPoIB to address these requirements for
      parent interfaces is to use PKEY_CHANGE event as trigger to optionally
      re-init the device pkey value and re-create all the relevant resources
      accordingly, if the value of the pkey in index 0 has changed (from
      invalid to valid or from valid value X to invalid value Y).
      
      This patch enhances the heavy flushing code which is triggered by pkey
      change event, to behave correctly for parent devices. For child
      devices, the code remains the same, namely chases pkey value and not
      index.
      Signed-off-by: NErez Shitrit <erezsh@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      c2904141
  9. 06 2月, 2013 1 次提交
    • S
      IPoIB: Fix crash due to skb double destruct · 7e5a90c2
      Shlomo Pongratz 提交于
      After commit b13912bb ("IPoIB: Call skb_dst_drop() once skb is
      enqueued for sending"), using connected mode and running multithreaded
      iperf for long time, ie
      
          iperf -c <IP> -P 16 -t 3600
      
      results in a crash.
      
      After the above-mentioned patch, the driver is calling skb_orphan() and
      skb_dst_drop() after calling post_send() in ipoib_cm.c::ipoib_cm_send()
      (also in ipoib_ib.c::ipoib_send())
      
      The problem with this is, as is written in a comment in both routines,
      "it's entirely possible that the completion handler will run before we
      execute anything after the post_send()."  This leads to running the
      skb cleanup routines simultaneously in two different contexts.
      
      The solution is to always perform the skb_orphan() and skb_dst_drop()
      before queueing the send work request.  If an error occurs, then it
      will be no different than the regular case where dev_free_skb_any() in
      the completion path, which is assumed to be after these two routines.
      Signed-off-by: NShlomo Pongratz <shlomop@mellanox.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      7e5a90c2
  10. 20 12月, 2012 1 次提交
    • R
      IPoIB: Call skb_dst_drop() once skb is enqueued for sending · b13912bb
      Roland Dreier 提交于
      Currently, IPoIB delays collecting send completions for TX packets in
      order to batch work more efficiently.  It does skb_orphan() right after
      queuing the packets so that destructors run early, to avoid problems
      like holding socket send buffers for too long (since we might not
      collect a send completion until a long time after the packet is
      actually sent).
      
      However, IPoIB clears IFF_XMIT_DST_RELEASE because it actually looks
      at skb_dst() to update the PMTU when it gets a too-long packet.  This
      means that the packets sitting in the TX ring with uncollected send
      completions are holding a reference on the dst.  We've seen this lead
      to pathological behavior with respect to route and neighbour GC.  The
      easy fix for this is to call skb_dst_drop() when we call skb_orphan().
      
      Also, give packets sent via connected mode (CM) the same skb_orphan()
      / skb_dst_drop() treatment that packets sent via datagram mode get.
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      b13912bb
  11. 11 7月, 2012 1 次提交
    • E
      IPoIB: fix skb truesize underestimatiom · b28ba726
      Eric Dumazet 提交于
      Or Gerlitz reported triggering of WARN_ON_ONCE(delta < len); in
      skb_try_coalesce()
      This warning tracks drivers that incorrectly set skb->truesize
      
      IPoIB indeed allocates a full page to store a fragment, but only
      accounts in skb->truesize the used part of the page (frame length)
      
      This patch fixes skb truesize underestimation, and
      also fixes a performance issue, because RX skbs have not enough tailroom
      to allow IP and TCP stacks to pull their header in skb linear part
      without an expensive call to pskb_expand_head()
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Cc: Erez Shitrit <erezsh@mellanox.com>
      Cc: Shlomo Pongartz <shlomop@mellanox.com>
      Cc: Roland Dreier <roland@purestorage.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b28ba726
  12. 09 3月, 2012 1 次提交
    • O
      IB: Change CQE "csum_ok" field to a bit flag · d927d505
      Or Gerlitz 提交于
      Use a bit in wc_flags rather then a whole integer to hold the
      "checksum OK" flag.  By itself, this change doesn't reduce the size of
      struct ib_wc on 64bit machines -- it stays on 56 bytes because of
      padding.  However, it will allow to add more fields in the future
      without enlarging the struct.  Also, it will let us have a unified
      approach with future libibverbs checksum offload reporting, because a
      bit flag doesn't break the library ABI.
      
      This patch was suggested during conversation with Liran Liss
      <liranl@mellanox.com>.
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Reviewed-by: NSean Hefty <sean.hefty@intel.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      d927d505
  13. 30 11月, 2011 1 次提交
    • M
      IB/ipoib: Prevent hung task or softlockup processing multicast response · 3874397c
      Mike Marciniszyn 提交于
      This following can occur with ipoib when processing a multicast reponse:
      
          BUG: soft lockup - CPU#0 stuck for 67s! [ib_mad1:982]
          Modules linked in: ...
          CPU 0:
          Modules linked in: ...
          Pid: 982, comm: ib_mad1 Not tainted 2.6.32-131.0.15.el6.x86_64 #1 ProLiant DL160 G5
          RIP: 0010:[<ffffffff814ddb27>]  [<ffffffff814ddb27>] _spin_unlock_irqrestore+0x17/0x20
          RSP: 0018:ffff8802119ed860  EFLAGS: 00000246
          0000000000000004 RBX: ffff8802119ed860 RCX: 000000000000a299
          RDX: ffff88021086c700 RSI: 0000000000000246 RDI: 0000000000000246
          RBP: ffffffff8100bc8e R08: ffff880210ac229c R09: 0000000000000000
          R10: ffff88021278aab8 R11: 0000000000000000 R12: ffff8802119ed860
          R13: ffffffff8100be6e R14: 0000000000000001 R15: 0000000000000003
          FS:  0000000000000000(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
          CR2: 00000000006d4840 CR3: 0000000209aa5000 CR4: 00000000000406f0
          DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
          DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
          Call Trace:
          [<ffffffffa032c247>] ? ipoib_mcast_send+0x157/0x480 [ib_ipoib]
          [<ffffffff8100bc8e>] ? apic_timer_interrupt+0xe/0x20
          [<ffffffff8100bc8e>] ? apic_timer_interrupt+0xe/0x20
          [<ffffffffa03283d4>] ? ipoib_path_lookup+0x124/0x2d0 [ib_ipoib]
          [<ffffffffa03286fc>] ? ipoib_start_xmit+0x17c/0x430 [ib_ipoib]
          [<ffffffff8141e758>] ? dev_hard_start_xmit+0x2c8/0x3f0
          [<ffffffff81439d0a>] ? sch_direct_xmit+0x15a/0x1c0
          [<ffffffff81423098>] ? dev_queue_xmit+0x388/0x4d0
          [<ffffffffa032d6b7>] ? ipoib_mcast_join_finish+0x2c7/0x510 [ib_ipoib]
          [<ffffffffa032dab8>] ? ipoib_mcast_sendonly_join_complete+0x1b8/0x1f0 [ib_ipoib]
          [<ffffffffa02a0946>] ? mcast_work_handler+0x1a6/0x710 [ib_sa]
          [<ffffffffa015f01e>] ? ib_send_mad+0xfe/0x3c0 [ib_mad]
          [<ffffffffa00f6c93>] ? ib_get_cached_lmc+0xa3/0xb0 [ib_core]
          [<ffffffffa02a0f9b>] ? join_handler+0xeb/0x200 [ib_sa]
          [<ffffffffa029e4fc>] ? ib_sa_mcmember_rec_callback+0x5c/0xa0 [ib_sa]
          [<ffffffffa029e79c>] ? recv_handler+0x3c/0x70 [ib_sa]
          [<ffffffffa01603a4>] ? ib_mad_completion_handler+0x844/0x9d0 [ib_mad]
          [<ffffffffa015fb60>] ? ib_mad_completion_handler+0x0/0x9d0 [ib_mad]
          [<ffffffff81088830>] ? worker_thread+0x170/0x2a0
          [<ffffffff8108e160>] ? autoremove_wake_function+0x0/0x40
          [<ffffffff810886c0>] ? worker_thread+0x0/0x2a0
          [<ffffffff8108ddf6>] ? kthread+0x96/0xa0
          [<ffffffff8100c1ca>] ? child_rip+0xa/0x20
      
      Coinciding with stack trace is the following message:
      
          ib0: ib_address_create failed
      
      The code below in ipoib_mcast_join_finish() will note the above
      failure in the address handle but otherwise continue:
      
                      ah = ipoib_create_ah(dev, priv->pd, &av);
                      if (!ah) {
                              ipoib_warn(priv, "ib_address_create failed\n");
                      } else {
      
      The while loop at the bottom of ipoib_mcast_join_finish() will attempt
      to send queued multicast packets in mcast->pkt_queue and eventually
      end up in ipoib_mcast_send():
      
              if (!mcast->ah) {
                      if (skb_queue_len(&mcast->pkt_queue) < IPOIB_MAX_MCAST_QUEUE)
                              skb_queue_tail(&mcast->pkt_queue, skb);
                      else {
                              ++dev->stats.tx_dropped;
                              dev_kfree_skb_any(skb);
                      }
      
      My read is that the code will requeue the packet and return to the
      ipoib_mcast_join_finish() while loop and the stage is set for the
      "hung" task diagnostic as the while loop never sees a non-NULL ah, and
      will do nothing to resolve.
      
      There are GFP_ATOMIC allocates in the provider routines, so this is
      possible and should be dealt with.
      
      The test that induced the failure is associated with a host SM on the
      same server during a shutdown.
      
      This patch causes ipoib_mcast_join_finish() to exit with an error
      which will flush the queued mcast packets.  Nothing is done to unwind
      the QP attached state so that subsequent sends from above will retry
      the join.
      Reviewed-by: NRam Vepa <ram.vepa@qlogic.com>
      Reviewed-by: NGary Leshner <gary.leshner@qlogic.com>
      Signed-off-by: NMike Marciniszyn <mike.marciniszyn@qlogic.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      3874397c
  14. 01 11月, 2011 1 次提交
  15. 19 10月, 2011 1 次提交
  16. 27 8月, 2011 1 次提交
  17. 20 4月, 2011 1 次提交
  18. 11 1月, 2011 2 次提交
  19. 29 9月, 2010 1 次提交
  20. 30 3月, 2010 1 次提交
    • T
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking... · 5a0e3ad6
      Tejun Heo 提交于
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
      
      percpu.h is included by sched.h and module.h and thus ends up being
      included when building most .c files.  percpu.h includes slab.h which
      in turn includes gfp.h making everything defined by the two files
      universally available and complicating inclusion dependencies.
      
      percpu.h -> slab.h dependency is about to be removed.  Prepare for
      this change by updating users of gfp and slab facilities include those
      headers directly instead of assuming availability.  As this conversion
      needs to touch large number of source files, the following script is
      used as the basis of conversion.
      
        http://userweb.kernel.org/~tj/misc/slabh-sweep.py
      
      The script does the followings.
      
      * Scan files for gfp and slab usages and update includes such that
        only the necessary includes are there.  ie. if only gfp is used,
        gfp.h, if slab is used, slab.h.
      
      * When the script inserts a new include, it looks at the include
        blocks and try to put the new include such that its order conforms
        to its surrounding.  It's put in the include block which contains
        core kernel includes, in the same order that the rest are ordered -
        alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
        doesn't seem to be any matching order.
      
      * If the script can't find a place to put a new include (mostly
        because the file doesn't have fitting include block), it prints out
        an error message indicating which .h file needs to be added to the
        file.
      
      The conversion was done in the following steps.
      
      1. The initial automatic conversion of all .c files updated slightly
         over 4000 files, deleting around 700 includes and adding ~480 gfp.h
         and ~3000 slab.h inclusions.  The script emitted errors for ~400
         files.
      
      2. Each error was manually checked.  Some didn't need the inclusion,
         some needed manual addition while adding it to implementation .h or
         embedding .c file was more appropriate for others.  This step added
         inclusions to around 150 files.
      
      3. The script was run again and the output was compared to the edits
         from #2 to make sure no file was left behind.
      
      4. Several build tests were done and a couple of problems were fixed.
         e.g. lib/decompress_*.c used malloc/free() wrappers around slab
         APIs requiring slab.h to be added manually.
      
      5. The script was run on all .h files but without automatically
         editing them as sprinkling gfp.h and slab.h inclusions around .h
         files could easily lead to inclusion dependency hell.  Most gfp.h
         inclusion directives were ignored as stuff from gfp.h was usually
         wildly available and often used in preprocessor macros.  Each
         slab.h inclusion directive was examined and added manually as
         necessary.
      
      6. percpu.h was updated not to include slab.h.
      
      7. Build test were done on the following configurations and failures
         were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
         distributed build env didn't work with gcov compiles) and a few
         more options had to be turned off depending on archs to make things
         build (like ipr on powerpc/64 which failed due to missing writeq).
      
         * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
         * powerpc and powerpc64 SMP allmodconfig
         * sparc and sparc64 SMP allmodconfig
         * ia64 SMP allmodconfig
         * s390 SMP allmodconfig
         * alpha SMP allmodconfig
         * um on x86_64 SMP allmodconfig
      
      8. percpu.h modifications were reverted so that it could be applied as
         a separate patch and serve as bisection point.
      
      Given the fact that I had only a couple of failures from tests on step
      6, I'm fairly confident about the coverage of this conversion patch.
      If there is a breakage, it's likely to be something in one of the arch
      headers which should be easily discoverable easily on most builds of
      the specific arch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Guess-its-ok-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      5a0e3ad6
  21. 12 3月, 2010 1 次提交
  22. 06 9月, 2009 1 次提交
  23. 03 9月, 2009 1 次提交
  24. 21 4月, 2009 1 次提交
  25. 22 1月, 2009 1 次提交
  26. 23 12月, 2008 1 次提交