1. 12 10月, 2017 1 次提交
  2. 11 10月, 2017 1 次提交
  3. 27 9月, 2017 1 次提交
    • D
      bpf: add meta pointer for direct access · de8f3a83
      Daniel Borkmann 提交于
      This work enables generic transfer of metadata from XDP into skb. The
      basic idea is that we can make use of the fact that the resulting skb
      must be linear and already comes with a larger headroom for supporting
      bpf_xdp_adjust_head(), which mangles xdp->data. Here, we base our work
      on a similar principle and introduce a small helper bpf_xdp_adjust_meta()
      for adjusting a new pointer called xdp->data_meta. Thus, the packet has
      a flexible and programmable room for meta data, followed by the actual
      packet data. struct xdp_buff is therefore laid out that we first point
      to data_hard_start, then data_meta directly prepended to data followed
      by data_end marking the end of packet. bpf_xdp_adjust_head() takes into
      account whether we have meta data already prepended and if so, memmove()s
      this along with the given offset provided there's enough room.
      
      xdp->data_meta is optional and programs are not required to use it. The
      rationale is that when we process the packet in XDP (e.g. as DoS filter),
      we can push further meta data along with it for the XDP_PASS case, and
      give the guarantee that a clsact ingress BPF program on the same device
      can pick this up for further post-processing. Since we work with skb
      there, we can also set skb->mark, skb->priority or other skb meta data
      out of BPF, thus having this scratch space generic and programmable
      allows for more flexibility than defining a direct 1:1 transfer of
      potentially new XDP members into skb (it's also more efficient as we
      don't need to initialize/handle each of such new members). The facility
      also works together with GRO aggregation. The scratch space at the head
      of the packet can be multiple of 4 byte up to 32 byte large. Drivers not
      yet supporting xdp->data_meta can simply be set up with xdp->data_meta
      as xdp->data + 1 as bpf_xdp_adjust_meta() will detect this and bail out,
      such that the subsequent match against xdp->data for later access is
      guaranteed to fail.
      
      The verifier treats xdp->data_meta/xdp->data the same way as we treat
      xdp->data/xdp->data_end pointer comparisons. The requirement for doing
      the compare against xdp->data is that it hasn't been modified from it's
      original address we got from ctx access. It may have a range marking
      already from prior successful xdp->data/xdp->data_end pointer comparisons
      though.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      de8f3a83
  4. 17 8月, 2017 1 次提交
  5. 09 8月, 2017 1 次提交
  6. 24 7月, 2017 1 次提交
  7. 18 7月, 2017 1 次提交
  8. 16 6月, 2017 4 次提交
    • T
      net/mlx4_en: Poll XDP TX completion queue in RX NAPI · 6c78511b
      Tariq Toukan 提交于
      Instead of having their own NAPIs, XDP TX completion queues get
      polled within the corresponding RX NAPI.
      This prevents any possible race on TX ring prod/cons indices,
      between the context that issues the transmits (RX NAPI) and the
      context that handles the completions (was previously done in
      a separate NAPI).
      
      This also improves performance, as it decreases the number
      of NAPIs running on a CPU, saving the overhead of syncing
      and switching between the contexts.
      
      Performance tests:
      Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
      Single queue no-RSS optimization ON.
      
      XDP_TX packet rate:
      -------------------------------------
           | Before    | After     | Gain |
      IPv4 | 12.0 Mpps | 13.8 Mpps |  15% |
      IPv6 | 12.0 Mpps | 13.8 Mpps |  15% |
      -------------------------------------
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Reviewed-by: NSaeed Mahameed <saeedm@mellanox.com>
      Cc: kernel-team@fb.com
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6c78511b
    • T
      net/mlx4_en: Improve XDP xmit function · 36ea7964
      Tariq Toukan 提交于
      Several performance improvements in XDP TX datapath,
      including:
      - Ring a single doorbell for XDP TX ring per NAPI budget,
        instead of doing it per a lower threshold (was 8).
        This includes removing the flow of immediate doorbell ringing
        in case of a full TX ring.
      - Compiler branch predictor hints.
      - Calculate values in compile time rather than in runtime.
      
      Performance tests:
      Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
      Single queue no-RSS optimization ON.
      
      XDP_TX packet rate:
      -------------------------------------
           | Before    | After     | Gain |
      IPv4 | 10.3 Mpps | 12.0 Mpps |  17% |
      IPv6 | 10.3 Mpps | 12.0 Mpps |  17% |
      -------------------------------------
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Reviewed-by: NSaeed Mahameed <saeedm@mellanox.com>
      Cc: kernel-team@fb.com
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      36ea7964
    • T
      net/mlx4_en: Improve receive data-path · 9bcee89a
      Tariq Toukan 提交于
      Several small performance improvements in RX datapath,
      including:
      - Compiler branch predictor hints.
      - Replace a multiplication with a shift operation.
      - Minimize variables scope.
      - Write-prefetch for packet header.
      - Avoid trinary-operator ("?") when value can be preset in a matching
        branch.
      - Save a branch by updating RX ring doorbell within
        mlx4_en_refill_rx_buffers(), which now returns void.
      
      Performance tests:
      Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
      Single queue no-RSS optimization ON
      (enable by ethtool -L <interface> rx 1).
      
      XDP_DROP packet rate:
      Same (28.1 Mpps), lower CPU utilization (from ~100% to ~92%).
      
      Drop packets in TC:
      -------------------------------------
           | Before    | After     | Gain |
      IPv4 | 4.14 Mpps | 4.18 Mpps |   1% |
      -------------------------------------
      
      XDP_TX packet rate:
      -------------------------------------
           | Before    | After     | Gain |
      IPv4 | 10.1 Mpps | 10.3 Mpps |   2% |
      IPv6 | 10.1 Mpps | 10.3 Mpps |   2% |
      -------------------------------------
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Reviewed-by: NSaeed Mahameed <saeedm@mellanox.com>
      Cc: kernel-team@fb.com
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9bcee89a
    • S
      net/mlx4_en: Optimized single ring steering · 4931c6ef
      Saeed Mahameed 提交于
      Avoid touching RX QP RSS context when loading with only
      one RX ring, to allow optimized A0 RX steering.
      
      Enable by:
      - loading mlx4_core with module param: log_num_mgm_entry_size = -6.
      - then: ethtool -L <interface> rx 1
      
      Performance tests:
      Tested on ConnectX3Pro, Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
      
      XDP_DROP packet rate:
      -------------------------------------
           | Before    | After     | Gain |
      IPv4 | 20.5 Mpps | 28.1 Mpps |  37% |
      IPv6 | 18.4 Mpps | 28.1 Mpps |  53% |
      -------------------------------------
      Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Cc: kernel-team@fb.com
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4931c6ef
  9. 09 5月, 2017 1 次提交
  10. 10 3月, 2017 13 次提交
  11. 23 2月, 2017 1 次提交
  12. 08 2月, 2017 1 次提交
  13. 26 1月, 2017 1 次提交
    • D
      bpf: add initial bpf tracepoints · a67edbf4
      Daniel Borkmann 提交于
      This work adds a number of tracepoints to paths that are either
      considered slow-path or exception-like states, where monitoring or
      inspecting them would be desirable.
      
      For bpf(2) syscall, tracepoints have been placed for main commands
      when they succeed. In XDP case, tracepoint is for exceptions, that
      is, f.e. on abnormal BPF program exit such as unknown or XDP_ABORTED
      return code, or when error occurs during XDP_TX action and the packet
      could not be forwarded.
      
      Both have been split into separate event headers, and can be further
      extended. Worst case, if they unexpectedly should get into our way in
      future, they can also removed [1]. Of course, these tracepoints (like
      any other) can be analyzed by eBPF itself, etc. Example output:
      
        # ./perf record -a -e bpf:* sleep 10
        # ./perf script
        sock_example  6197 [005]   283.980322:      bpf:bpf_map_create: map type=ARRAY ufd=4 key=4 val=8 max=256 flags=0
        sock_example  6197 [005]   283.980721:       bpf:bpf_prog_load: prog=a5ea8fa30ea6849c type=SOCKET_FILTER ufd=5
        sock_example  6197 [005]   283.988423:   bpf:bpf_prog_get_type: prog=a5ea8fa30ea6849c type=SOCKET_FILTER
        sock_example  6197 [005]   283.988443: bpf:bpf_map_lookup_elem: map type=ARRAY ufd=4 key=[06 00 00 00] val=[00 00 00 00 00 00 00 00]
        [...]
        sock_example  6197 [005]   288.990868: bpf:bpf_map_lookup_elem: map type=ARRAY ufd=4 key=[01 00 00 00] val=[14 00 00 00 00 00 00 00]
             swapper     0 [005]   289.338243:    bpf:bpf_prog_put_rcu: prog=a5ea8fa30ea6849c type=SOCKET_FILTER
      
        [1] https://lwn.net/Articles/705270/Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a67edbf4
  14. 20 1月, 2017 1 次提交
    • E
      mlx4: support __GFP_MEMALLOC for rx · dceeab0e
      Eric Dumazet 提交于
      Commit 04aeb56a ("net/mlx4_en: allocate non 0-order pages for RX
      ring with __GFP_NOMEMALLOC") added code that appears to be not needed at
      that time, since mlx4 never used __GFP_MEMALLOC allocations anyway.
      
      As using memory reserves is a must in some situations (swap over NFS or
      iSCSI), this patch adds this flag.
      
      Note that this driver does not reuse pages (yet) so we do not have to
      add anything else.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Tariq Toukan <tariqt@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dceeab0e
  15. 30 12月, 2016 1 次提交
  16. 09 12月, 2016 2 次提交
  17. 22 11月, 2016 1 次提交
    • E
      mlx4: avoid unnecessary dirtying of critical fields · dad42c30
      Eric Dumazet 提交于
      While stressing a 40Gbit mlx4 NIC with busy polling, I found false
      sharing in mlx4 driver that can be easily avoided.
      
      This patch brings an additional 7 % performance improvement in UDP_RR
      workload.
      
      1) If we received no frame during one mlx4_en_process_rx_cq()
         invocation, no need to call mlx4_cq_set_ci() and/or dirty ring->cons
      
      2) Do not refill rx buffers if we have plenty of them.
         This avoids false sharing and allows some bulk/batch optimizations.
         Page allocator and its locks will thank us.
      
      Finally, mlx4_en_poll_rx_cq() should not return 0 if it determined
      cpu handling NIC IRQ should be changed. We should return budget-1
      instead, to not fool net_rx_action() and its netdev_budget.
      
      v2: keep AVG_PERF_COUNTER(... polled) even if polled is 0
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Tariq Toukan <tariqt@mellanox.com>
      Reviewed-by: NTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dad42c30
  18. 17 11月, 2016 1 次提交
  19. 03 11月, 2016 2 次提交
    • T
      net/mlx4_en: Add ethtool statistics for XDP cases · 15fca2c8
      Tariq Toukan 提交于
      XDP statistics are reported in ethtool, in total and per ring,
      as follows:
      - xdp_drop: the number of packets dropped by xdp.
      - xdp_tx: the number of packets forwarded by xdp.
      - xdp_tx_full: the number of times an xdp forward failed
      	due to a full tx xdp ring.
      
      In addition, all packets that are dropped/forwarded by XDP
      are no longer accounted in rx_packets/rx_bytes of the ring,
      so that they count traffic that is passed to the stack.
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      15fca2c8
    • T
      net/mlx4_en: Refactor the XDP forwarding rings scheme · 67f8b1dc
      Tariq Toukan 提交于
      Separately manage the two types of TX rings: regular ones, and XDP.
      Upon an XDP set, do not borrow regular TX rings and convert them
      into XDP ones, but allocate new ones, unless we hit the max number
      of rings.
      Which means that in systems with smaller #cores we will not consume
      the current TX rings for XDP, while we are still in the num TX limit.
      
      XDP TX rings counters are not shown in ethtool statistics.
      Instead, XDP counters will be added to the respective RX rings
      in a downstream patch.
      
      This has no performance implications.
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      67f8b1dc
  20. 22 9月, 2016 2 次提交
  21. 20 9月, 2016 1 次提交
  22. 19 9月, 2016 1 次提交