1. 16 12月, 2014 5 次提交
    • S
      RDMA/cxgb4: Wake up waiters after flushing the qp · 5b341808
      Steve Wise 提交于
      When transitioning into ERROR state, the QP was getting flushed after
      waking up any waiters.  This can cause applications to miss flushed work
      requests which can stall an NFS mount.
      Signed-off-by: NSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      5b341808
    • H
      RDMA/cxgb4: Limit MRs to < 8GB for T4/T5 devices · 2550a88d
      Hariprasad Shenai 提交于
      T4/T5 hardware can't handle MRs >= 8GB due to a hardware bug.  So limit
      registrations to < 8GB for thse devices.
      
      Based on original work by Steve Wise <swise@opengridcomputing.com>.
      Signed-off-by: NSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: NHariprasad Shenai <hariprasad@chelsio.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      2550a88d
    • H
      RDMA/cxgb4: Fix locking issue in process_mpa_request · 10be6b48
      Hariprasad Shenai 提交于
      Fix the following lockdep report:
      
          =============================================
          [ INFO: possible recursive locking detected ]
          3.17.0+ #3 Tainted: G            E
          ---------------------------------------------
          kworker/u64:3/299 is trying to acquire lock:
           (&epc->mutex){+.+.+.}, at: [<ffffffffa074e07a>]
          process_mpa_request+0x1aa/0x3e0 [iw_cxgb4]
      
          but task is already holding lock:
           (&epc->mutex){+.+.+.}, at: [<ffffffffa074e34e>] rx_data+0x9e/0x1f0 [iw_cxgb4]
      
          other info that might help us debug this:
           Possible unsafe locking scenario:
      
                 CPU0
                 ----
            lock(&epc->mutex);
            lock(&epc->mutex);
      
           *** DEADLOCK ***
      
           May be due to missing lock nesting notation
      
          3 locks held by kworker/u64:3/299:
           #0:  ("%s""iw_cxgb4"){.+.+.+}, at: [<ffffffff8106f14d>]
          process_one_work+0x13d/0x4d0
           #1:  (skb_work){+.+.+.}, at: [<ffffffff8106f14d>] process_one_work+0x13d/0x4d0
           #2:  (&epc->mutex){+.+.+.}, at: [<ffffffffa074e34e>] rx_data+0x9e/0x1f0
          [iw_cxgb4]
      
          stack backtrace:
          CPU: 2 PID: 299 Comm: kworker/u64:3 Tainted: G            E  3.17.0+ #3
          Hardware name: Dell Inc. PowerEdge T110/0X744K, BIOS 1.2.1 01/28/2010
          Workqueue: iw_cxgb4 process_work [iw_cxgb4]
           ffff8800b91593d0 ffff8800b8a2f9f8 ffffffff815df107 0000000000000001
           ffff8800b9158750 ffff8800b8a2fa28 ffffffff8109f0e2 ffff8800bb768a00
           ffff8800b91593d0 ffff8800b9158750 0000000000000000 ffff8800b8a2fa88
          Call Trace:
           [<ffffffff815df107>] dump_stack+0x49/0x62
           [<ffffffff8109f0e2>] print_deadlock_bug+0xf2/0x100
           [<ffffffff810a0f04>] validate_chain+0x454/0x700
           [<ffffffff810a1574>] __lock_acquire+0x3c4/0x580
           [<ffffffffa074e07a>] ? process_mpa_request+0x1aa/0x3e0 [iw_cxgb4]
           [<ffffffff810a17cc>] lock_acquire+0x9c/0x110
           [<ffffffffa074e07a>] ? process_mpa_request+0x1aa/0x3e0 [iw_cxgb4]
           [<ffffffff815e111b>] mutex_lock_nested+0x4b/0x360
           [<ffffffffa074e07a>] ? process_mpa_request+0x1aa/0x3e0 [iw_cxgb4]
           [<ffffffff810c181a>] ? del_timer_sync+0xaa/0xd0
           [<ffffffff810c1770>] ? try_to_del_timer_sync+0x70/0x70
           [<ffffffffa074e07a>] process_mpa_request+0x1aa/0x3e0 [iw_cxgb4]
           [<ffffffffa074a3ec>] ? update_rx_credits+0xec/0x140 [iw_cxgb4]
           [<ffffffffa074e381>] rx_data+0xd1/0x1f0 [iw_cxgb4]
           [<ffffffff8109ff23>] ? mark_held_locks+0x73/0xa0
           [<ffffffff815e4b90>] ? _raw_spin_unlock_irqrestore+0x40/0x70
           [<ffffffff810a020d>] ? trace_hardirqs_on_caller+0xfd/0x1c0
           [<ffffffff810a02dd>] ? trace_hardirqs_on+0xd/0x10
           [<ffffffffa074c931>] process_work+0x51/0x80 [iw_cxgb4]
           [<ffffffff8106f1c8>] process_one_work+0x1b8/0x4d0
           [<ffffffff8106f14d>] ? process_one_work+0x13d/0x4d0
           [<ffffffff8106f600>] worker_thread+0x120/0x3c0
           [<ffffffff8106f4e0>] ? process_one_work+0x4d0/0x4d0
           [<ffffffff81074a0e>] kthread+0xde/0x100
           [<ffffffff815e4b40>] ? _raw_spin_unlock_irq+0x30/0x40
           [<ffffffff81074930>] ? __init_kthread_worker+0x70/0x70
           [<ffffffff815e512c>] ret_from_fork+0x7c/0xb0
           [<ffffffff81074930>] ? __init_kthread_worker+0x70/0x70
      
      Based on original work by Steve Wise <swise@opengridcomputing.com>.
      Signed-off-by: NSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: NHariprasad Shenai <hariprasad@chelsio.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      10be6b48
    • P
      RDMA/cxgb4: Configure 0B MRs to match HW implementation · 123bc2a2
      Pramod Kumar 提交于
      0B MRs need some tweaks to work correctly with HW. When writing the
      TPTE, if the MR length is zero we now:
      
      1) turn off all permissions
      2) set the length to -1
      
      While functionality/capabilities of the MR are the same with these
      changes, it resolves a dapltest 0B RDMA Read test failure.  Based on
      original work by Steve Wise <swise@opengridcomputing.com>.
      Signed-off-by: NPramod Kumar <pramod@chelsio.com>
      Signed-off-by: NSteve Wise <swise@opengridcomputing.com>
      Signed-off-by: NHariprasad Shenai <hariprasad@chelsio.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      123bc2a2
    • P
      RDMA/cxgb4: Increase epd buff size for debug interface · 63a71ba6
      Pramod Kumar 提交于
      IPv6 address string lengths require increasing the buffer size for
      debugfs handlers.
      Signed-off-by: NPramod Kumar <pramod@chelsio.com>
      Signed-off-by: NRoland Dreier <roland@purestorage.com>
      63a71ba6
  2. 12 12月, 2014 15 次提交
    • M
      Fix race condition between vxlan_sock_add and vxlan_sock_release · 00c83b01
      Marcelo Leitner 提交于
      Currently, when trying to reuse a socket, vxlan_sock_add will grab
      vn->sock_lock, locate a reusable socket, inc refcount and release
      vn->sock_lock.
      
      But vxlan_sock_release() will first decrement refcount, and then grab
      that lock. refcnt operations are atomic but as currently we have
      deferred works which hold vs->refcnt each, this might happen, leading to
      a use after free (specially after vxlan_igmp_leave):
      
        CPU 1                            CPU 2
      
      deferred work                    vxlan_sock_add
        ...                              ...
                                         spin_lock(&vn->sock_lock)
                                         vs = vxlan_find_sock();
        vxlan_sock_release
          dec vs->refcnt, reaches 0
          spin_lock(&vn->sock_lock)
                                         vxlan_sock_hold(vs), refcnt=1
                                         spin_unlock(&vn->sock_lock)
          hlist_del_rcu(&vs->hlist);
          vxlan_notify_del_rx_port(vs)
          spin_unlock(&vn->sock_lock)
      
      So when we look for a reusable socket, we check if it wasn't freed
      already before reusing it.
      Signed-off-by: NMarcelo Ricardo Leitner <mleitner@redhat.com>
      Fixes: 7c47cedf ("vxlan: move IGMP join/leave to work queue")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      00c83b01
    • C
    • M
      net/mlx4: Add support for A0 steering · 7d077cd3
      Matan Barak 提交于
      Add the required firmware commands for A0 steering and a way to enable
      that. The firmware support focuses on INIT_HCA, QUERY_HCA, QUERY_PORT,
      QUERY_DEV_CAP and QUERY_FUNC_CAP commands. Those commands are used
      to configure and query the device.
      
      The different A0 DMFS (steering) modes are:
      
      Static - optimized performance, but flow steering rules are
      limited. This mode should be choosed explicitly by the user
      in order to be used.
      
      Dynamic - this mode should be explicitly choosed by the user.
      In this mode, the FW works in optimized steering mode as long as
      it can and afterwards automatically drops to classic (full) DMFS.
      
      Disable - this mode should be explicitly choosed by the user.
      The user instructs the system not to use optimized steering, even if
      the FW supports Dynamic A0 DMFS (and thus will be able to use optimized
      steering in Default A0 DMFS mode).
      
      Default - this mode is implicitly choosed. In this mode, if the FW
      supports Dynamic A0 DMFS, it'll work in this mode. Otherwise, it'll
      work at Disable A0 DMFS mode.
      
      Under SRIOV configuration, when the A0 steering mode is enabled,
      older guest VF drivers who aren't using the RX QP allocation flag
      (MLX4_RESERVE_A0_QP) will get a QP from the general range and
      fail when attempting to register a steering rule. To avoid that,
      the PF context behaviour is changed once on A0 static mode, to
      require support for the allocation flag in VF drivers too.
      
      In order to enable A0 steering, we use log_num_mgm_entry_size param.
      If the value of the parameter is not positive, we treat the absolute
      value of log_num_mgm_entry_size as a bit field. Setting bit 2 of this
      bit field enables static A0 steering.
      Signed-off-by: NMatan Barak <matanb@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7d077cd3
    • M
      net/mlx4: Refactor QUERY_PORT · 431df8c7
      Matan Barak 提交于
      Currently QUERY_PORT is done as a part of QUERY_DEV_CAP firmware command.
      
      Since we would like to use it without querying all device capabilities,
      extract this part to be a function of its own.
      Signed-off-by: NMatan Barak <matanb@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      431df8c7
    • M
      net/mlx4_core: Add explicit error message when rule doesn't meet configuration · 579d059b
      Matan Barak 提交于
      When a given flow steering rule is invalid in respect to the current
      steering configuration, print the correct error message to the system log.
      Signed-off-by: NMatan Barak <matanb@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      579d059b
    • M
      net/mlx4: Add A0 hybrid steering · d57febe1
      Matan Barak 提交于
      A0 hybrid steering is a form of high performance flow steering.
      By using this mode, mlx4 cards use a fast limited table based steering,
      in order to enable fast steering of unicast packets to a QP.
      
      In order to implement A0 hybrid steering we allocate resources
      from different zones:
      (1) General range
      (2) Special MAC-assigned QPs [RSS, Raw-Ethernet] each has its own region.
      
      When we create a rss QP or a raw ethernet (A0 steerable and BF ready) QP,
      we try hard to allocate the QP from range (2). Otherwise, we try hard not
      to allocate from this  range. However, when the system is pushed to its
      limits and one needs every resource, the allocator uses every region it can.
      
      Meaning, when we run out of raw-eth qps, the allocator allocates from the
      general range (and the special-A0 area is no longer active). If we run out
      of RSS qps, the mechanism tries to allocate from the raw-eth QP zone. If that
      is also exhausted, the allocator will allocate from the general range
      (and the A0 region is no longer active).
      
      Note that if a raw-eth qp is allocated from the general range, it attempts
      to allocate the range such that bits 6 and 7 (blueflame bits) in the
      QP number are not set.
      
      When the feature is used in SRIOV, the VF has to notify the PF what
      kind of QP attributes it needs. In order to do that, along with the
      "Eth QP blueflame" bit, we reserve a new "A0 steerable QP". According
      to the combination of these bits, the PF tries to allocate a suitable QP.
      
      In order to maintain backward compatibility (with older PFs), the PF
      notifies which QP attributes it supports via QUERY_FUNC_CAP command.
      Signed-off-by: NMatan Barak <matanb@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d57febe1
    • M
      net/mlx4: Add mlx4_bitmap zone allocator · 7a89399f
      Matan Barak 提交于
      The zone allocator is a mechanism which manages a few mlx4_bitmaps.
      
      When allocating a resource, the user indicates the desired zone of
      which this resource will be allocated from. If possible, the resource
      will be allocated from this zone. Otherwise, the resource will be
      allocated from a less-than, equal-to, higher-than priority zone,
      according to the desired zone's properties with that respective
      allocation order.
      Signed-off-by: NMatan Barak <matanb@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7a89399f
    • D
      net/mlx4: Add a check if there are too many reserved QPs · ab256e5a
      Dotan Barak 提交于
      The number of reserved QPs is affected both from the firmware and
      from the driver's requirements. This patch adds a check that
      validates that this number is indeed feasable.
      Signed-off-by: NDotan Barak <dotanb@dev.mellanox.co.il>
      Signed-off-by: NMatan Barak <matanb@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ab256e5a
    • E
      net/mlx4: Change QP allocation scheme · ddae0349
      Eugenia Emantayev 提交于
      When using BF (Blue-Flame), the QPN overrides the VLAN, CV, and SV fields
      in the WQE. Thus, BF may only be used for QPNs with bits 6,7 unset.
      
      The current Ethernet driver code reserves a Tx QP range with 256b alignment.
      
      This is wrong because if there are more than 64 Tx QPs in use,
      QPNs >= base + 65 will have bits 6/7 set.
      
      This problem is not specific for the Ethernet driver, any entity that
      tries to reserve more than 64 BF-enabled QPs should fail. Also, using
      ranges is not necessary here and is wasteful.
      
      The new mechanism introduced here will support reservation for
      "Eth QPs eligible for BF" for all drivers: bare-metal, multi-PF, and VFs
      (when hypervisors support WC in VMs). The flow we use is:
      
      1. In mlx4_en, allocate Tx QPs one by one instead of a range allocation,
         and request "BF enabled QPs" if BF is supported for the function
      
      2. In the ALLOC_RES FW command, change param1 to:
      a. param1[23:0]  - number of QPs
      b. param1[31-24] - flags controlling QPs reservation
      
      Bit 31 refers to Eth blueflame supported QPs. Those QPs must have
      bits 6 and 7 unset in order to be used in Ethernet.
      
      Bits 24-30 of the flags are currently reserved.
      
      When a function tries to allocate a QP, it states the required attributes
      for this QP. Those attributes are considered "best-effort". If an attribute,
      such as Ethernet BF enabled QP, is a must-have attribute, the function has
      to check that attribute is supported before trying to do the allocation.
      
      In a lower layer of the code, mlx4_qp_reserve_range masks out the bits
      which are unsupported. If SRIOV is used, the PF validates those attributes
      and masks out unsupported attributes as well. In order to notify VFs which
      attributes are supported, the VF uses QUERY_FUNC_CAP command. This command's
      mailbox is filled by the PF, which notifies which QP allocation attributes
      it supports.
      Signed-off-by: NEugenia Emantayev <eugenia@mellanox.co.il>
      Signed-off-by: NMatan Barak <matanb@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ddae0349
    • M
      net/mlx4_core: Use tasklet for user-space CQ completion events · 3dca0f42
      Matan Barak 提交于
      Previously, we've fired all our completion callbacks straight from our ISR.
      
      Some of those callbacks were lightweight (for example, mlx4_en's and
      IPoIB napi callbacks), but some of them did more work (for example,
      the user-space RDMA stack uverbs' completion handler). Besides that,
      doing more than the minimal work in ISR is generally considered wrong,
      it could even lead to a hard lockup of the system. Since when a lot
      of completion events are generated by the hardware, the loop over those
      events could be so long, that we'll get into a hard lockup by the system
      watchdog.
      
      In order to avoid that, add a new way of invoking completion events
      callbacks. In the interrupt itself, we add the CQs which receive completion
      event to a per-EQ list and schedule a tasklet. In the tasklet context
      we loop over all the CQs in the list and invoke the user callback.
      Signed-off-by: NMatan Barak <matanb@mellanox.com>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3dca0f42
    • O
      net/mlx4_core: Mask out host side virtualization features for guests · 383677da
      Or Gerlitz 提交于
      When VFs (guests in this context) issue the QUERY_DEV_CAP command, they
      need not be told that host side virtualization features such as VST, FSM
      (MAC anti-spoofing) and running > 80 VFs are supported by the device.
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      383677da
    • O
      net/mlx4_en: Set csum level for encapsulated packets · c58942f2
      Or Gerlitz 提交于
      This was dropped by mistake for the napi_gro_frags flow, fix that.
      
      Fixes: dd65beac ('net/mlx4_en: Extend usage of napi_gro_frags')
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c58942f2
    • S
      be2net: Export tunnel offloads only when a VxLAN tunnel is created · 630f4b70
      Sriharsha Basavapatna 提交于
      The encapsulated offload flags shouldn't be unconditionally exported
      to the stack. The stack expects offloading to work across all tunnel
      types when those flags are set. This would break other tunnels (like
      GRE) since be2net currently supports tunnel offload for VxLAN only.
      
      Also, with VxLANs Skyhawk-R can offload only 1 UDP dport. If more
      than 1 UDP port is added, we should disable offloads in that case too.
      Signed-off-by: NSriharsha Basavapatna <sriharsha.basavapatna@emulex.com>
      Signed-off-by: NSathya Perla <sathya.perla@emulex.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      630f4b70
    • K
      gianfar: Fix dma check map error when DMA_API_DEBUG is enabled · 0a4b5a24
      Kevin Hao 提交于
      We need to use dma_mapping_error() to check the dma address returned
      by dma_map_single/page(). Otherwise we would get warning like this:
        WARNING: at lib/dma-debug.c:1140
        Modules linked in:
        CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.18.0-rc2-next-20141029 #196
        task: c0834300 ti: effe6000 task.ti: c0874000
        NIP: c02b2c98 LR: c02b2c98 CTR: c030abc4
        REGS: effe7d70 TRAP: 0700   Not tainted  (3.18.0-rc2-next-20141029)
        MSR: 00021000 <CE,ME>  CR: 22044022  XER: 20000000
      
        GPR00: c02b2c98 effe7e20 c0834300 00000098 00021000 00000000 c030b898 00000003
        GPR08: 00000001 00000000 00000001 749eec9d 22044022 1001abe0 00000020 ef278678
        GPR16: ef278670 ef278668 ef278660 070a8040 c087f99c c08cdc60 00029000 c0840d44
        GPR24: c08be6e8 c0840000 effe7e78 ef041340 00000600 ef114e10 00000000 c08be6e0
        NIP [c02b2c98] check_unmap+0x51c/0x9e4
        LR [c02b2c98] check_unmap+0x51c/0x9e4
        Call Trace:
        [effe7e20] [c02b2c98] check_unmap+0x51c/0x9e4 (unreliable)
        [effe7e70] [c02b31d8] debug_dma_unmap_page+0x78/0x8c
        [effe7ed0] [c03d1640] gfar_clean_rx_ring+0x208/0x488
        [effe7f40] [c03d1a9c] gfar_poll_rx_sq+0x3c/0xa8
        [effe7f60] [c04f8714] net_rx_action+0xc0/0x178
        [effe7f90] [c00435a0] __do_softirq+0x100/0x1fc
        [effe7fe0] [c0043958] irq_exit+0xa4/0xc8
        [effe7ff0] [c000d14c] call_do_irq+0x24/0x3c
        [c0875e90] [c00048a0] do_IRQ+0x8c/0xf8
        [c0875eb0] [c000ed10] ret_from_except+0x0/0x18
      
      For TX, we need to unmap the pages which has already been mapped and
      free the skb before return.
      
      For RX, move the dma mapping and error check to gfar_new_skb(). We
      would reuse the original skb in the rx ring when either allocating
      skb failure or dma mapping error.
      Signed-off-by: NKevin Hao <haokexin@gmail.com>
      Signed-off-by: NClaudiu Manoil <claudiu.manoil@freescale.com>
      Reviewed-by: NClaudiu Manoil <claudiu.manoil@freescale.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a4b5a24
    • H
      cxgb4/csiostor: Don't use MASTER_MUST for fw_hello call · 666224d4
      Hariprasad Shenai 提交于
      Remove use of calls into t4_fw_hello() with MASTER_MUST, which results in
      FW_HELLO_CMD_MASTERFORCE being set. The firmware doesn't support this and of
      course any existing PF Drivers will totally go for a toss.
      Signed-off-by: NHariprasad Shenai <hariprasad@chelsio.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      666224d4
  3. 11 12月, 2014 20 次提交