1. 30 10月, 2016 2 次提交
  2. 24 9月, 2016 3 次提交
  3. 12 9月, 2016 1 次提交
  4. 07 9月, 2016 1 次提交
    • B
      net/mlx4_en: protect ring->xdp_prog with rcu_read_lock · 326fe02d
      Brenden Blanco 提交于
      Depending on the preempt mode, the bpf_prog stored in xdp_prog may be
      freed despite the use of call_rcu inside bpf_prog_put. The situation is
      possible when running in PREEMPT_RCU=y mode, for instance, since the rcu
      callback for destroying the bpf prog can run even during the bh handling
      in the mlx4 rx path.
      
      Several options were considered before this patch was settled on:
      
      Add a napi_synchronize loop in mlx4_xdp_set, which would occur after all
      of the rings are updated with the new program.
      This approach has the disadvantage that as the number of rings
      increases, the speed of update will slow down significantly due to
      napi_synchronize's msleep(1).
      
      Add a new rcu_head in bpf_prog_aux, to be used by a new bpf_prog_put_bh.
      The action of the bpf_prog_put_bh would be to then call bpf_prog_put
      later. Those drivers that consume a bpf prog in a bh context (like mlx4)
      would then use the bpf_prog_put_bh instead when the ring is up. This has
      the problem of complexity, in maintaining proper refcnts and rcu lists,
      and would likely be harder to review. In addition, this approach to
      freeing must be exclusive with other frees of the bpf prog, for instance
      a _bh prog must not be referenced from a prog array that is consumed by
      a non-_bh prog.
      
      The placement of rcu_read_lock in this patch is functionally the same as
      putting an rcu_read_lock in napi_poll. Actually doing so could be a
      potentially controversial change, but would bring the implementation in
      line with sk_busy_loop (though of course the nature of those two paths
      is substantially different), and would also avoid future copy/paste
      problems with future supporters of XDP. Still, this patch does not take
      that opinionated option.
      
      Testing was done with kernels in either PREEMPT_RCU=y or
      CONFIG_PREEMPT_VOLUNTARY=y+PREEMPT_RCU=n modes, with neither exhibiting
      any drawback. With PREEMPT_RCU=n, the extra call to rcu_read_lock did
      not show up in the perf report whatsoever, and with PREEMPT_RCU=y the
      overhead of rcu_read_lock (according to perf) was the same before/after.
      In the rx path, rcu_read_lock is eventually called for every packet
      from netif_receive_skb_internal, so the napi poll call's rcu_read_lock
      is easily amortized.
      
      v2:
      Remove extra rcu_read_lock in mlx4_en_process_rx_cq body
      Annotate xdp_prog with __rcu, and convert all usages to rcu_assign or
      rcu_dereference[_protected] as appropriate.
      Add explicit mutex lock around rcu_assign instead of xchg loop.
      
      Fixes: d576acf0 ("net/mlx4_en: add page recycle to prepare rx ring for tx support")
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <alexei.starovoitov@gmail.com>
      Signed-off-by: NBrenden Blanco <bblanco@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      326fe02d
  5. 20 7月, 2016 5 次提交
    • B
      net/mlx4_en: add xdp forwarding and data write support · 9ecc2d86
      Brenden Blanco 提交于
      A user will now be able to loop packets back out of the same port using
      a bpf program attached to xdp hook. Updates to the packet contents from
      the bpf program is also supported.
      
      For the packet write feature to work, the rx buffers are now mapped as
      bidirectional when the page is allocated. This occurs only when the xdp
      hook is active.
      
      When the program returns a TX action, enqueue the packet directly to a
      dedicated tx ring, so as to avoid completely any locking. This requires
      the tx ring to be allocated 1:1 for each rx ring, as well as the tx
      completion running in the same softirq.
      
      Upon tx completion, this dedicated tx ring recycles pages without
      unmapping directly back to the original rx ring. In steady state tx/drop
      workload, effectively 0 page allocs/frees will occur.
      
      In order to separate out the paths between free and recycle, a
      free_tx_desc func pointer is introduced that is optionally updated
      whenever recycle_ring is activated. By default the original free
      function is always initialized.
      Signed-off-by: NBrenden Blanco <bblanco@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9ecc2d86
    • B
      net/mlx4_en: add page recycle to prepare rx ring for tx support · d576acf0
      Brenden Blanco 提交于
      The mlx4 driver by default allocates order-3 pages for the ring to
      consume in multiple fragments. When the device has an xdp program, this
      behavior will prevent tx actions since the page must be re-mapped in
      TODEVICE mode, which cannot be done if the page is still shared.
      
      Start by making the allocator configurable based on whether xdp is
      running, such that order-0 pages are always used and never shared.
      
      Since this will stress the page allocator, add a simple page cache to
      each rx ring. Pages in the cache are left dma-mapped, and in drop-only
      stress tests the page allocator is eliminated from the perf report.
      
      Note that setting an xdp program will now require the rings to be
      reconfigured.
      
      Before:
       26.91%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_process_rx_cq
       17.88%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_alloc_frags
        6.00%  ksoftirqd/0  [mlx4_en]         [k] mlx4_en_free_frag
        4.49%  ksoftirqd/0  [kernel.vmlinux]  [k] get_page_from_freelist
        3.21%  swapper      [kernel.vmlinux]  [k] intel_idle
        2.73%  ksoftirqd/0  [kernel.vmlinux]  [k] bpf_map_lookup_elem
        2.57%  swapper      [mlx4_en]         [k] mlx4_en_process_rx_cq
      
      After:
       31.72%  swapper      [kernel.vmlinux]       [k] intel_idle
        8.79%  swapper      [mlx4_en]              [k] mlx4_en_process_rx_cq
        7.54%  swapper      [kernel.vmlinux]       [k] poll_idle
        6.36%  swapper      [mlx4_core]            [k] mlx4_eq_int
        4.21%  swapper      [kernel.vmlinux]       [k] tasklet_action
        4.03%  swapper      [kernel.vmlinux]       [k] cpuidle_enter_state
        3.43%  swapper      [mlx4_en]              [k] mlx4_en_prepare_rx_desc
        2.18%  swapper      [kernel.vmlinux]       [k] native_irq_return_iret
        1.37%  swapper      [kernel.vmlinux]       [k] menu_select
        1.09%  swapper      [kernel.vmlinux]       [k] bpf_map_lookup_elem
      Signed-off-by: NBrenden Blanco <bblanco@plumgrid.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d576acf0
    • B
      net/mlx4_en: add support for fast rx drop bpf program · 47a38e15
      Brenden Blanco 提交于
      Add support for the BPF_PROG_TYPE_XDP hook in mlx4 driver.
      
      In tc/socket bpf programs, helpers linearize skb fragments as needed
      when the program touches the packet data. However, in the pursuit of
      speed, XDP programs will not be allowed to use these slower functions,
      especially if it involves allocating an skb.
      
      Therefore, disallow MTU settings that would produce a multi-fragment
      packet that XDP programs would fail to access. Future enhancements could
      be done to increase the allowable MTU.
      
      The xdp program is present as a per-ring data structure, but as of yet
      it is not possible to set at that granularity through any ndo.
      Signed-off-by: NBrenden Blanco <bblanco@plumgrid.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      47a38e15
    • E
      net/mlx4_en: Add resilience in low memory systems · ec25bc04
      Eugenia Emantayev 提交于
      This patch fixes the lost of Ethernet port on low memory system,
      when driver frees its resources and fails to allocate new resources.
      Issue could happen while changing number of channels, rings size or
      changing the timestamp configuration.
      This fix is necessary because of removing vmap use in the code.
      When vmap was in use driver could allocate non-contiguous memory
      and make it contiguous with vmap. Now it could fail to allocate
      a large chunk of contiguous memory and lose the port.
      Current code tries to allocate new resources and then upon success
      frees the old resources.
      
      Fixes: 73898db0 ('net/mlx4: Avoid wrong virtual mappings')
      Signed-off-by: NEugenia Emantayev <eugenia@mellanox.com>
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ec25bc04
    • E
      net/mlx4_en: Move filters cleanup to a proper location · 30f56e3c
      Eugenia Emantayev 提交于
      Filters cleanup should be done once before destroying net device,
      since filters list is contained in the private data.
      
      Fixes: 1eb8c695 ('net/mlx4_en: Add accelerated RFS support')
      Signed-off-by: NEugenia Emantayev <eugenia@mellanox.com>
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      30f56e3c
  6. 24 6月, 2016 1 次提交
  7. 23 6月, 2016 2 次提交
  8. 18 6月, 2016 1 次提交
  9. 17 6月, 2016 1 次提交
  10. 10 6月, 2016 1 次提交
  11. 26 5月, 2016 4 次提交
  12. 06 5月, 2016 1 次提交
    • H
      net/mlx4: Avoid wrong virtual mappings · 73898db0
      Haggai Abramovsky 提交于
      The dma_alloc_coherent() function returns a virtual address which can
      be used for coherent access to the underlying memory.  On some
      architectures, like arm64, undefined behavior results if this memory is
      also accessed via virtual mappings that are not coherent.  Because of
      their undefined nature, operations like virt_to_page() return garbage
      when passed virtual addresses obtained from dma_alloc_coherent().  Any
      subsequent mappings via vmap() of the garbage page values are unusable
      and result in bad things like bus errors (synchronous aborts in ARM64
      speak).
      
      The mlx4 driver contains code that does the equivalent of:
      vmap(virt_to_page(dma_alloc_coherent)), this results in an OOPs when the
      device is opened.
      
      Prevent Ethernet driver to run this problematic code by forcing it to
      allocate contiguous memory. As for the Infiniband driver, at first we
      are trying to allocate contiguous memory, but in case of failure roll
      back to work with fragmented memory.
      Signed-off-by: NHaggai Abramovsky <hagaya@mellanox.com>
      Signed-off-by: NYishai Hadas <yishaih@mellanox.com>
      Reported-by: NDavid Daney <david.daney@cavium.com>
      Tested-by: NSinan Kaya <okaya@codeaurora.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      73898db0
  13. 05 5月, 2016 2 次提交
  14. 22 4月, 2016 1 次提交
  15. 04 3月, 2016 1 次提交
    • J
      net: relax setup_tc ndo op handle restriction · 5eb4dce3
      John Fastabend 提交于
      I added this check in setup_tc to multiple drivers,
      
       if (handle != TC_H_ROOT || tc->type != TC_SETUP_MQPRIO)
      
      Unfortunately restricting to TC_H_ROOT like this breaks the old
      instantiation of mqprio to setup a hardware qdisc. This patch
      relaxes the test to only check the type to make it equivalent
      to the check before I broke it. With this the old instantiation
      continues to work.
      
      A good smoke test is to setup mqprio with,
      
      # tc qdisc add dev eth4 root mqprio num_tc 8 \
        map 0 1 2 3 4 5 6 7 \
        queues 0@0 1@1 2@2 3@3 4@4 5@5 6@6 7@7
      
      Fixes: e4c6734e ("net: rework ndo tc op to consume additional qdisc handle paramete")
      Reported-by: NSingh Krishneil <krishneil.k.singh@intel.com>
      Reported-by: NJake Keller <jacob.e.keller@intel.com>
      CC: Murali Karicheri <m-karicheri2@ti.com>
      CC: Shradha Shah <sshah@solarflare.com>
      CC: Or Gerlitz <ogerlitz@mellanox.com>
      CC: Ariel Elior <ariel.elior@qlogic.com>
      CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      CC: Bruce Allan <bruce.w.allan@intel.com>
      CC: Jesse Brandeburg <jesse.brandeburg@intel.com>
      CC: Don Skidmore <donald.c.skidmore@intel.com>
      Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5eb4dce3
  16. 03 3月, 2016 1 次提交
    • J
      net/mlx4_core: Allow resetting VF admin mac to zero · 6e522422
      Jack Morgenstein 提交于
      The VF administrative mac addresses (stored in the PF driver) are
      initialized to zero when the PF driver starts up.
      
      These addresses may be modified in the PF driver through ndo calls
      initiated by iproute2 or libvirt.
      
      While we allow the PF/host to change the VF admin mac address from zero
      to a valid unicast mac, we do not allow restoring the VF admin mac to
      zero. We currently only allow changing this mac to a different unicast mac.
      
      This leads to problems when libvirt scripts are used to deal with
      VF mac addresses, and libvirt attempts to revoke the mac so this
      host will not use it anymore.
      
      Fix this by allowing resetting a VF administrative MAC back to zero.
      
      Fixes: 8f7ba3ca ('net/mlx4: Add set VF mac address support')
      Signed-off-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Reported-by: NMoshe Levi <moshele@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6e522422
  17. 02 3月, 2016 1 次提交
  18. 17 2月, 2016 3 次提交
  19. 19 12月, 2015 2 次提交
  20. 19 11月, 2015 1 次提交
    • E
      mlx4: remove mlx4_en_low_latency_recv() · 868fdb06
      Eric Dumazet 提交于
      Busy polling can now be handled in generic NAPI poll infrastructure.
      This removes complexity and fast path overhead :
      
      mlx4 used two spin_lock()/spin_unlock() pair per napi->poll() call
      in mlx4_en_cq_lock_napi()/mlx4_en_cq_unlock_napi()
      
      Tested:
      
      Without busy polling :
      
      lpaa23:~# echo 0 >/proc/sys/net/core/busy_read
      lpaa24:~# echo 0 >/proc/sys/net/core/busy_read
      lpaa23:~# ./netperf -H lpaa24 -t TCP_RR
      MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpaa24.prod.google.com () port 0 AF_INET : first burst 0
      Local /Remote
      Socket Size   Request  Resp.   Elapsed  Trans.
      Send   Recv   Size     Size    Time     Rate
      bytes  Bytes  bytes    bytes   secs.    per sec
      
      16384  87380  1        1       10.00    47330.78
      
      With busy polling :
      
      lpaa23:~# echo 70 >/proc/sys/net/core/busy_read
      lpaa24:~# echo 70 >/proc/sys/net/core/busy_read
      lpaa23:~# ./netperf -H lpaa24 -t TCP_RR
      MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpaa24.prod.google.com () port 0 AF_INET : first burst 0
      Local /Remote
      Socket Size   Request  Resp.   Elapsed  Trans.
      Send   Recv   Size     Size    Time     Rate
      bytes  Bytes  bytes    bytes   secs.    per sec
      
      16384  87380  1        1       10.00    97643.55
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      868fdb06
  21. 15 10月, 2015 1 次提交
    • J
      net/mlx4_core: Replace VF zero mac with random mac in mlx4_core · 2b3ddf27
      Jack Morgenstein 提交于
      By design, when no default MAC addresses are set in the Hypervisor for VFs,
      the VFs are passed zero-macs. When such a MAC is received by the VF, it
      generates a random MAC address and registers that MAC address
      with the Hypervisor.
      
      This random mac generation is currently done in the mlx4_en module.
      There is a problem, though, if the mlx4_ib module is loaded by a VF before
      the mlx4_en module. In this case, for RoCE, mlx4_ib will see the un-replaced
      zero-mac and register that zero-mac as part of QP1 initialization.
      
      Having a zero-mac in the port's MAC table creates problems for a
      Baseboard Management Console. The BMC occasionally sends packets with a
      zero-mac destination MAC. If there is a zero-mac present in the port's
      MAC table, the FW will send such BMC packets to the host driver rather than
      to the wire, and BMC will stop working.
      
      To address this problem, we move the replacement of zero-mac addresses
      with random-mac addresses to procedure mlx4_slave_cap(), which is part of the
      driver startup for VFs, and is before activation of mlx4_ib and mlx4_en.
      As a result, zero-mac addresses will never be registered in the port MAC table
      by the driver.
      
      In addition, when mlx4_en does initialize the net device, it needs to set
      the NET_ADDR_RANDOM flag in the netdev structure if the address was
      randomly generated. This is done so that udev on the VM does not create
      a new device name after each VF probe (VM boot and such). To accomplish this,
      we add a per-port flag in mlx4_dev which gets set whenever mlx4_core replaces
      a zero-mac with a randomly-generated mac. This flag is examined when mlx4_en
      initializes the net-device.
      
      Fix was suggested by Matan Barak <matanb@mellanox.com>
      Signed-off-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Signed-off-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2b3ddf27
  22. 09 10月, 2015 1 次提交
  23. 28 7月, 2015 1 次提交
    • H
      net/mlx4_en: Add support for hardware accelerated 802.1ad vlan · e38af4fa
      Hadar Hen Zion 提交于
      To enable device support in accelerated 802.1ad vlan, the port
      capability "packet has vlan enable" (phv_en) should be set.
      Firmware won't work properly, in case phv_en is not set.
      
      The user can enable "phv_en" port capability with the new ethtool
      private flag phv-bit. The phv-bit private flag default value is OFF,
      users who are interested in 802.1ad hardware acceleration should turn ON
      the phv-bit private flag:
      $ ethtool --set-priv-flags eth1 phv-bit on
      
      Once the private flag is set, the device is ready for 802.1ad vlan
      acceleration.
      
      The user should also change the interface device features and turn on
      "tx-vlan-stag-hw-insert" which is off by default:
      $ ethtool -K eth1  tx-vlan-stag-hw-insert on
      
      "phv-bit" private flag setting is available only for Physical
      Functions(PF), the Virtual Function (VF) will be able to use the feature
      by setting "tx-vlan-stag-hw-insert" ethtool device feature only if the
      feature was enabled by the Hypervisor.
      Signed-off-by: NHadar Hen Zion <hadarh@mellanox.com>
      Signed-off-by: NAmir Vadai <amirv@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e38af4fa
  24. 25 6月, 2015 1 次提交
  25. 16 6月, 2015 1 次提交