1. 20 7月, 2021 4 次提交
  2. 17 4月, 2021 1 次提交
  3. 12 4月, 2021 3 次提交
    • P
      veth: refine napi usage · 47e550e0
      Paolo Abeni 提交于
      After the previous patch, when enabling GRO, locally generated
      TCP traffic experiences some measurable overhead, as it traverses
      the GRO engine without any chance of aggregation.
      
      This change refine the NAPI receive path admission test, to avoid
      unnecessary GRO overhead in most scenarios, when GRO is enabled
      on a veth peer.
      
      Only skbs that are eligible for aggregation enter the GRO layer,
      the others will go through the traditional receive path.
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      47e550e0
    • P
      veth: allow enabling NAPI even without XDP · d3256efd
      Paolo Abeni 提交于
      Currently the veth device has the GRO feature bit set, even if
      no GRO aggregation is possible with the default configuration,
      as the veth device does not hook into the GRO engine.
      
      Flipping the GRO feature bit from user-space is a no-op, unless
      XDP is enabled. In such scenario GRO could actually take place, but
      TSO is forced to off on the peer device.
      
      This change allow user-space to really control the GRO feature, with
      no need for an XDP program.
      
      The GRO feature bit is now cleared by default - so that there are no
      user-visible behavior changes with the default configuration.
      
      When the GRO bit is set, the per-queue NAPI instances are initialized
      and registered. On xmit, when napi instances are available, we try
      to use them.
      
      Some additional checks are in place to ensure we initialize/delete NAPIs
      only when needed in case of overlapping XDP and GRO configuration
      changes.
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d3256efd
    • P
      veth: use skb_orphan_partial instead of skb_orphan · c75fb320
      Paolo Abeni 提交于
      As described by commit 9c4c3252 ("skbuff: preserve sock
      reference when scrubbing the skb."), orphaning a skb
      in the TX path will cause OoO.
      
      Let's use skb_orphan_partial() instead of skb_orphan(), so
      that we keep the sk around for queue's selection sake and we
      still avoid the problem fixed with commit 4bf9ffa0 ("veth:
      Orphan skb before GRO")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c75fb320
  4. 31 3月, 2021 1 次提交
  5. 18 3月, 2021 1 次提交
  6. 06 3月, 2021 1 次提交
  7. 04 2月, 2021 1 次提交
  8. 21 1月, 2021 1 次提交
  9. 09 1月, 2021 2 次提交
  10. 01 12月, 2020 1 次提交
  11. 17 11月, 2020 1 次提交
  12. 12 10月, 2020 1 次提交
    • D
      bpf: Add redirect_peer helper · 9aa1206e
      Daniel Borkmann 提交于
      Add an efficient ingress to ingress netns switch that can be used out of tc BPF
      programs in order to redirect traffic from host ns ingress into a container
      veth device ingress without having to go via CPU backlog queue [0]. For local
      containers this can also be utilized and path via CPU backlog queue only needs
      to be taken once, not twice. On a high level this borrows from ipvlan which does
      similar switch in __netif_receive_skb_core() and then iterates via another_round.
      This helps to reduce latency for mentioned use cases.
      
      Pod to remote pod with redirect(), TCP_RR [1]:
      
        # percpu_netperf 10.217.1.33
                RT_LATENCY:         122.450         (per CPU:         122.666         122.401         122.333         122.401 )
              MEAN_LATENCY:         121.210         (per CPU:         121.100         121.260         121.320         121.160 )
            STDDEV_LATENCY:         120.040         (per CPU:         119.420         119.910         125.460         115.370 )
               MIN_LATENCY:          46.500         (per CPU:          47.000          47.000          47.000          45.000 )
               P50_LATENCY:         118.500         (per CPU:         118.000         119.000         118.000         119.000 )
               P90_LATENCY:         127.500         (per CPU:         127.000         128.000         127.000         128.000 )
               P99_LATENCY:         130.750         (per CPU:         131.000         131.000         129.000         132.000 )
      
          TRANSACTION_RATE:       32666.400         (per CPU:        8152.200        8169.842        8174.439        8169.897 )
      
      Pod to remote pod with redirect_peer(), TCP_RR:
      
        # percpu_netperf 10.217.1.33
                RT_LATENCY:          44.449         (per CPU:          43.767          43.127          45.279          45.622 )
              MEAN_LATENCY:          45.065         (per CPU:          44.030          45.530          45.190          45.510 )
            STDDEV_LATENCY:          84.823         (per CPU:          66.770          97.290          84.380          90.850 )
               MIN_LATENCY:          33.500         (per CPU:          33.000          33.000          34.000          34.000 )
               P50_LATENCY:          43.250         (per CPU:          43.000          43.000          43.000          44.000 )
               P90_LATENCY:          46.750         (per CPU:          46.000          47.000          47.000          47.000 )
               P99_LATENCY:          52.750         (per CPU:          51.000          54.000          53.000          53.000 )
      
          TRANSACTION_RATE:       90039.500         (per CPU:       22848.186       23187.089       22085.077       21919.130 )
      
        [0] https://linuxplumbersconf.org/event/7/contributions/674/attachments/568/1002/plumbers_2020_cilium_load_balancer.pdf
        [1] https://github.com/borkmann/netperf_scripts/blob/master/percpu_netperfSigned-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20201010234006.7075-3-daniel@iogearbox.net
      9aa1206e
  13. 11 9月, 2020 1 次提交
    • J
      net: remove napi_hash_del() from driver-facing API · 5198d545
      Jakub Kicinski 提交于
      We allow drivers to call napi_hash_del() before calling
      netif_napi_del() to batch RCU grace periods. This makes
      the API asymmetric and leaks internal implementation details.
      Soon we will want the grace period to protect more than just
      the NAPI hash table.
      
      Restructure the API and have drivers call a new function -
      __netif_napi_del() if they want to take care of RCU waits.
      
      Note that only core was checking the return status from
      napi_hash_del() so the new helper does not report if the
      NAPI was actually deleted.
      
      Some notes on driver oddness:
       - veth observed the grace period before calling netif_napi_del()
         but that should not matter
       - myri10ge observed normal RCU flavor
       - bnx2x and enic did not actually observe the grace period
         (unless they did so implicitly)
       - virtio_net and enic only unhashed Rx NAPIs
      
      The last two points seem to indicate that the calls to
      napi_hash_del() were a left over rather than an optimization.
      Regardless, it's easy enough to correct them.
      
      This patch may introduce extra synchronize_net() calls for
      interfaces which set NAPI_STATE_NO_BUSY_POLL and depend on
      free_netdev() to call netif_napi_del(). This seems inevitable
      since we want to use RCU for netpoll dev->napi_list traversal,
      and almost no drivers set IFF_DISABLE_NETPOLL.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5198d545
  14. 24 8月, 2020 1 次提交
  15. 20 8月, 2020 1 次提交
  16. 26 7月, 2020 1 次提交
  17. 02 6月, 2020 2 次提交
  18. 15 5月, 2020 2 次提交
  19. 27 3月, 2020 2 次提交
  20. 20 3月, 2020 5 次提交
  21. 06 3月, 2020 1 次提交
    • J
      veth: ignore peer tx_dropped when counting local rx_dropped · e25d5dbc
      Jiang Lidong 提交于
      When local NET_RX backlog is full due to traffic overrun,
      peer veth tx_dropped counter increases. At that time, list
      local veth stats, rx_dropped has double value of peer
      tx_dropped, even bigger than transmit packets by peer.
      
      In NET_RX softirq process, if any packet drop case happens,
      it increases dev's rx_dropped counter and returns NET_RX_DROP.
      
      At veth tx side, it records any error returned from peer netif_rx
      into local dev tx_dropped counter.
      
      In veth get stats process, it puts local dev rx_dropped and
      peer dev tx_dropped into together as local rx_drpped value.
      So that it shows double value of real dropped packets number in
      this case.
      
      This patch ignores peer tx_dropped when counting local rx_dropped,
      since peer tx_dropped is duplicated to local rx_dropped at most cases.
      Signed-off-by: NJiang Lidong <jianglidong3@jd.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e25d5dbc
  22. 27 1月, 2020 1 次提交
    • J
      bpf, xdp: Remove no longer required rcu_read_{un}lock() · b23bfa56
      John Fastabend 提交于
      Now that we depend on rcu_call() and synchronize_rcu() to also wait
      for preempt_disabled region to complete the rcu read critical section
      in __dev_map_flush() is no longer required. Except in a few special
      cases in drivers that need it for other reasons.
      
      These originally ensured the map reference was safe while a map was
      also being free'd. And additionally that bpf program updates via
      ndo_bpf did not happen while flush updates were in flight. But flush
      by new rules can only be called from preempt-disabled NAPI context.
      The synchronize_rcu from the map free path and the rcu_call from the
      delete path will ensure the reference there is safe. So lets remove
      the rcu_read_lock and rcu_read_unlock pair to avoid any confusion
      around how this is being protected.
      
      If the rcu_read_lock was required it would mean errors in the above
      logic and the original patch would also be wrong.
      
      Now that we have done above we put the rcu_read_lock in the driver
      code where it is needed in a driver dependent way. I think this
      helps readability of the code so we know where and why we are
      taking read locks. Most drivers will not need rcu_read_locks here
      and further XDP drivers already have rcu_read_locks in their code
      paths for reading xdp programs on RX side so this makes it symmetric
      where we don't have half of rcu critical sections define in driver
      and the other half in devmap.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Link: https://lore.kernel.org/bpf/1580084042-11598-4-git-send-email-john.fastabend@gmail.com
      b23bfa56
  23. 17 1月, 2020 1 次提交
    • T
      xdp: Use bulking for non-map XDP_REDIRECT and consolidate code paths · 1d233886
      Toke Høiland-Jørgensen 提交于
      Since the bulk queue used by XDP_REDIRECT now lives in struct net_device,
      we can re-use the bulking for the non-map version of the bpf_redirect()
      helper. This is a simple matter of having xdp_do_redirect_slow() queue the
      frame on the bulk queue instead of sending it out with __bpf_tx_xdp().
      
      Unfortunately we can't make the bpf_redirect() helper return an error if
      the ifindex doesn't exit (as bpf_redirect_map() does), because we don't
      have a reference to the network namespace of the ingress device at the time
      the helper is called. So we have to leave it as-is and keep the device
      lookup in xdp_do_redirect_slow().
      
      Since this leaves less reason to have the non-map redirect code in a
      separate function, so we get rid of the xdp_do_redirect_slow() function
      entirely. This does lose us the tracepoint disambiguation, but fortunately
      the xdp_redirect and xdp_redirect_map tracepoints use the same tracepoint
      entry structures. This means both can contain a map index, so we can just
      amend the tracepoint definitions so we always emit the xdp_redirect(_err)
      tracepoints, but with the map ID only populated if a map is present. This
      means we retire the xdp_redirect_map(_err) tracepoints entirely, but keep
      the definitions around in case someone is still listening for them.
      
      With this change, the performance of the xdp_redirect sample program goes
      from 5Mpps to 8.4Mpps (a 68% increase).
      
      Since the flush functions are no longer map-specific, rename the flush()
      functions to drop _map from their names. One of the renamed functions is
      the xdp_do_flush_map() callback used in all the xdp-enabled drivers. To
      keep from having to update all drivers, use a #define to keep the old name
      working, and only update the virtual drivers in this patch.
      Signed-off-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/157918768505.1458396.17518057312953572912.stgit@toke.dk
      1d233886
  24. 08 11月, 2019 1 次提交
  25. 25 6月, 2019 1 次提交
    • T
      veth: Support bulk XDP_TX · 9cda7807
      Toshiaki Makita 提交于
      XDP_TX is similar to XDP_REDIRECT as it essentially redirects packets to
      the device itself. XDP_REDIRECT has bulk transmit mechanism to avoid the
      heavy cost of indirect call but it also reduces lock acquisition on the
      destination device that needs locks like veth and tun.
      
      XDP_TX does not use indirect calls but drivers which require locks can
      benefit from the bulk transmit for XDP_TX as well.
      
      This patch introduces bulk transmit mechanism in veth using bulk queue
      on stack, and improves XDP_TX performance by about 9%.
      
      Here are single-core/single-flow XDP_TX test results. CPU consumptions
      are taken from "perf report --no-child".
      
      - Before:
      
        7.26 Mpps
      
        _raw_spin_lock  7.83%
        veth_xdp_xmit  12.23%
      
      - After:
      
        7.94 Mpps
      
        _raw_spin_lock  1.08%
        veth_xdp_xmit   6.10%
      
      v2:
      - Use stack for bulk queue instead of a global variable.
      Signed-off-by: NToshiaki Makita <toshiaki.makita1@gmail.com>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      9cda7807
  26. 19 6月, 2019 1 次提交
  27. 21 5月, 2019 1 次提交