1. 06 5月, 2022 1 次提交
  2. 01 5月, 2022 1 次提交
  3. 08 4月, 2022 1 次提交
    • G
      veth: Ensure eth header is in skb's linear part · 726e2c59
      Guillaume Nault 提交于
      After feeding a decapsulated packet to a veth device with act_mirred,
      skb_headlen() may be 0. But veth_xmit() calls __dev_forward_skb(),
      which expects at least ETH_HLEN byte of linear data (as
      __dev_forward_skb2() calls eth_type_trans(), which pulls ETH_HLEN bytes
      unconditionally).
      
      Use pskb_may_pull() to ensure veth_xmit() respects this constraint.
      
      kernel BUG at include/linux/skbuff.h:2328!
      RIP: 0010:eth_type_trans+0xcf/0x140
      Call Trace:
       <IRQ>
       __dev_forward_skb2+0xe3/0x160
       veth_xmit+0x6e/0x250 [veth]
       dev_hard_start_xmit+0xc7/0x200
       __dev_queue_xmit+0x47f/0x520
       ? skb_ensure_writable+0x85/0xa0
       ? skb_mpls_pop+0x98/0x1c0
       tcf_mirred_act+0x442/0x47e [act_mirred]
       tcf_action_exec+0x86/0x140
       fl_classify+0x1d8/0x1e0 [cls_flower]
       ? dma_pte_clear_level+0x129/0x1a0
       ? dma_pte_clear_level+0x129/0x1a0
       ? prb_fill_curr_block+0x2f/0xc0
       ? skb_copy_bits+0x11a/0x220
       __tcf_classify+0x58/0x110
       tcf_classify_ingress+0x6b/0x140
       __netif_receive_skb_core.constprop.0+0x47d/0xfd0
       ? __iommu_dma_unmap_swiotlb+0x44/0x90
       __netif_receive_skb_one_core+0x3d/0xa0
       netif_receive_skb+0x116/0x170
       be_process_rx+0x22f/0x330 [be2net]
       be_poll+0x13c/0x370 [be2net]
       __napi_poll+0x2a/0x170
       net_rx_action+0x22f/0x2f0
       __do_softirq+0xca/0x2a8
       __irq_exit_rcu+0xc1/0xe0
       common_interrupt+0x83/0xa0
      
      Fixes: e314dbdc ("[NET]: Virtual ethernet device driver.")
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      726e2c59
  4. 18 3月, 2022 3 次提交
  5. 14 2月, 2022 1 次提交
    • S
      net: dev: Makes sure netif_rx() can be invoked in any context. · baebdf48
      Sebastian Andrzej Siewior 提交于
      Dave suggested a while ago (eleven years by now) "Let's make netif_rx()
      work in all contexts and get rid of netif_rx_ni()". Eric agreed and
      pointed out that modern devices should use netif_receive_skb() to avoid
      the overhead.
      In the meantime someone added another variant, netif_rx_any_context(),
      which behaves as suggested.
      
      netif_rx() must be invoked with disabled bottom halves to ensure that
      pending softirqs, which were raised within the function, are handled.
      netif_rx_ni() can be invoked only from process context (bottom halves
      must be enabled) because the function handles pending softirqs without
      checking if bottom halves were disabled or not.
      netif_rx_any_context() invokes on the former functions by checking
      in_interrupts().
      
      netif_rx() could be taught to handle both cases (disabled and enabled
      bottom halves) by simply disabling bottom halves while invoking
      netif_rx_internal(). The local_bh_enable() invocation will then invoke
      pending softirqs only if the BH-disable counter drops to zero.
      
      Eric is concerned about the overhead of BH-disable+enable especially in
      regard to the loopback driver. As critical as this driver is, it will
      receive a shortcut to avoid the additional overhead which is not needed.
      
      Add a local_bh_disable() section in netif_rx() to ensure softirqs are
      handled if needed.
      Provide __netif_rx() which does not disable BH and has a lockdep assert
      to ensure that interrupts are disabled. Use this shortcut in the
      loopback driver and in drivers/net/*.c.
      Make netif_rx_ni() and netif_rx_any_context() invoke netif_rx() so they
      can be removed once they are no more users left.
      
      Link: https://lkml.kernel.org/r/20100415.020246.218622820.davem@davemloft.netSigned-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      baebdf48
  6. 09 2月, 2022 1 次提交
    • E
      veth: fix races around rq->rx_notify_masked · 68468d8c
      Eric Dumazet 提交于
      veth being NETIF_F_LLTX enabled, we need to be more careful
      whenever we read/write rq->rx_notify_masked.
      
      BUG: KCSAN: data-race in veth_xmit / veth_xmit
      
      write to 0xffff888133d9a9f8 of 1 bytes by task 23552 on cpu 0:
       __veth_xdp_flush drivers/net/veth.c:269 [inline]
       veth_xmit+0x307/0x470 drivers/net/veth.c:350
       __netdev_start_xmit include/linux/netdevice.h:4683 [inline]
       netdev_start_xmit include/linux/netdevice.h:4697 [inline]
       xmit_one+0x105/0x2f0 net/core/dev.c:3473
       dev_hard_start_xmit net/core/dev.c:3489 [inline]
       __dev_queue_xmit+0x86d/0xf90 net/core/dev.c:4116
       dev_queue_xmit+0x13/0x20 net/core/dev.c:4149
       br_dev_queue_push_xmit+0x3ce/0x430 net/bridge/br_forward.c:53
       NF_HOOK include/linux/netfilter.h:307 [inline]
       br_forward_finish net/bridge/br_forward.c:66 [inline]
       NF_HOOK include/linux/netfilter.h:307 [inline]
       __br_forward+0x2e4/0x400 net/bridge/br_forward.c:115
       br_flood+0x521/0x5c0 net/bridge/br_forward.c:242
       br_dev_xmit+0x8b6/0x960
       __netdev_start_xmit include/linux/netdevice.h:4683 [inline]
       netdev_start_xmit include/linux/netdevice.h:4697 [inline]
       xmit_one+0x105/0x2f0 net/core/dev.c:3473
       dev_hard_start_xmit net/core/dev.c:3489 [inline]
       __dev_queue_xmit+0x86d/0xf90 net/core/dev.c:4116
       dev_queue_xmit+0x13/0x20 net/core/dev.c:4149
       neigh_hh_output include/net/neighbour.h:525 [inline]
       neigh_output include/net/neighbour.h:539 [inline]
       ip_finish_output2+0x6f8/0xb70 net/ipv4/ip_output.c:228
       ip_finish_output+0xfb/0x240 net/ipv4/ip_output.c:316
       NF_HOOK_COND include/linux/netfilter.h:296 [inline]
       ip_output+0xf3/0x1a0 net/ipv4/ip_output.c:430
       dst_output include/net/dst.h:451 [inline]
       ip_local_out net/ipv4/ip_output.c:126 [inline]
       ip_send_skb+0x6e/0xe0 net/ipv4/ip_output.c:1570
       udp_send_skb+0x641/0x880 net/ipv4/udp.c:967
       udp_sendmsg+0x12ea/0x14c0 net/ipv4/udp.c:1254
       inet_sendmsg+0x5f/0x80 net/ipv4/af_inet.c:819
       sock_sendmsg_nosec net/socket.c:705 [inline]
       sock_sendmsg net/socket.c:725 [inline]
       ____sys_sendmsg+0x39a/0x510 net/socket.c:2413
       ___sys_sendmsg net/socket.c:2467 [inline]
       __sys_sendmmsg+0x267/0x4c0 net/socket.c:2553
       __do_sys_sendmmsg net/socket.c:2582 [inline]
       __se_sys_sendmmsg net/socket.c:2579 [inline]
       __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2579
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      read to 0xffff888133d9a9f8 of 1 bytes by task 23563 on cpu 1:
       __veth_xdp_flush drivers/net/veth.c:268 [inline]
       veth_xmit+0x2d6/0x470 drivers/net/veth.c:350
       __netdev_start_xmit include/linux/netdevice.h:4683 [inline]
       netdev_start_xmit include/linux/netdevice.h:4697 [inline]
       xmit_one+0x105/0x2f0 net/core/dev.c:3473
       dev_hard_start_xmit net/core/dev.c:3489 [inline]
       __dev_queue_xmit+0x86d/0xf90 net/core/dev.c:4116
       dev_queue_xmit+0x13/0x20 net/core/dev.c:4149
       br_dev_queue_push_xmit+0x3ce/0x430 net/bridge/br_forward.c:53
       NF_HOOK include/linux/netfilter.h:307 [inline]
       br_forward_finish net/bridge/br_forward.c:66 [inline]
       NF_HOOK include/linux/netfilter.h:307 [inline]
       __br_forward+0x2e4/0x400 net/bridge/br_forward.c:115
       br_flood+0x521/0x5c0 net/bridge/br_forward.c:242
       br_dev_xmit+0x8b6/0x960
       __netdev_start_xmit include/linux/netdevice.h:4683 [inline]
       netdev_start_xmit include/linux/netdevice.h:4697 [inline]
       xmit_one+0x105/0x2f0 net/core/dev.c:3473
       dev_hard_start_xmit net/core/dev.c:3489 [inline]
       __dev_queue_xmit+0x86d/0xf90 net/core/dev.c:4116
       dev_queue_xmit+0x13/0x20 net/core/dev.c:4149
       neigh_hh_output include/net/neighbour.h:525 [inline]
       neigh_output include/net/neighbour.h:539 [inline]
       ip_finish_output2+0x6f8/0xb70 net/ipv4/ip_output.c:228
       ip_finish_output+0xfb/0x240 net/ipv4/ip_output.c:316
       NF_HOOK_COND include/linux/netfilter.h:296 [inline]
       ip_output+0xf3/0x1a0 net/ipv4/ip_output.c:430
       dst_output include/net/dst.h:451 [inline]
       ip_local_out net/ipv4/ip_output.c:126 [inline]
       ip_send_skb+0x6e/0xe0 net/ipv4/ip_output.c:1570
       udp_send_skb+0x641/0x880 net/ipv4/udp.c:967
       udp_sendmsg+0x12ea/0x14c0 net/ipv4/udp.c:1254
       inet_sendmsg+0x5f/0x80 net/ipv4/af_inet.c:819
       sock_sendmsg_nosec net/socket.c:705 [inline]
       sock_sendmsg net/socket.c:725 [inline]
       ____sys_sendmsg+0x39a/0x510 net/socket.c:2413
       ___sys_sendmsg net/socket.c:2467 [inline]
       __sys_sendmmsg+0x267/0x4c0 net/socket.c:2553
       __do_sys_sendmmsg net/socket.c:2582 [inline]
       __se_sys_sendmmsg net/socket.c:2579 [inline]
       __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2579
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x44/0xd0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      value changed: 0x00 -> 0x01
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 23563 Comm: syz-executor.5 Not tainted 5.17.0-rc2-syzkaller-00064-gc36c04c2 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      
      Fixes: 948d4f21 ("veth: Add driver XDP")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      68468d8c
  7. 06 1月, 2022 1 次提交
    • D
      veth: Do not record rx queue hint in veth_xmit · 710ad98c
      Daniel Borkmann 提交于
      Laurent reported that they have seen a significant amount of TCP retransmissions
      at high throughput from applications residing in network namespaces talking to
      the outside world via veths. The drops were seen on the qdisc layer (fq_codel,
      as per systemd default) of the phys device such as ena or virtio_net due to all
      traffic hitting a _single_ TX queue _despite_ multi-queue device. (Note that the
      setup was _not_ using XDP on veths as the issue is generic.)
      
      More specifically, after edbea922 ("veth: Store queue_mapping independently
      of XDP prog presence") which made it all the way back to v4.19.184+,
      skb_record_rx_queue() would set skb->queue_mapping to 1 (given 1 RX and 1 TX
      queue by default for veths) instead of leaving at 0.
      
      This is eventually retained and callbacks like ena_select_queue() will also pick
      single queue via netdev_core_pick_tx()'s ndo_select_queue() once all the traffic
      is forwarded to that device via upper stack or other means. Similarly, for others
      not implementing ndo_select_queue() if XPS is disabled, netdev_pick_tx() might
      call into the skb_tx_hash() and check for prior skb_rx_queue_recorded() as well.
      
      In general, it is a _bad_ idea for virtual devices like veth to mess around with
      queue selection [by default]. Given dev->real_num_tx_queues is by default 1,
      the skb->queue_mapping was left untouched, and so prior to edbea922 the
      netdev_core_pick_tx() could do its job upon __dev_queue_xmit() on the phys device.
      
      Unbreak this and restore prior behavior by removing the skb_record_rx_queue()
      from veth_xmit() altogether.
      
      If the veth peer has an XDP program attached, then it would return the first RX
      queue index in xdp_md->rx_queue_index (unless configured in non-default manner).
      However, this is still better than breaking the generic case.
      
      Fixes: edbea922 ("veth: Store queue_mapping independently of XDP prog presence")
      Fixes: 638264dc ("veth: Support per queue XDP ring")
      Reported-by: NLaurent Bernaille <laurent.bernaille@datadoghq.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
      Cc: Toshiaki Makita <toshiaki.makita1@gmail.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NToshiaki Makita <toshiaki.makita1@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      710ad98c
  8. 24 12月, 2021 1 次提交
    • P
      veth: ensure skb entering GRO are not cloned. · 9695b7de
      Paolo Abeni 提交于
      After commit d3256efd ("veth: allow enabling NAPI even without XDP"),
      if GRO is enabled on a veth device and TSO is disabled on the peer
      device, TCP skbs will go through the NAPI callback. If there is no XDP
      program attached, the veth code does not perform any share check, and
      shared/cloned skbs could enter the GRO engine.
      
      Ignat reported a BUG triggered later-on due to the above condition:
      
      [   53.970529][    C1] kernel BUG at net/core/skbuff.c:3574!
      [   53.981755][    C1] invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI
      [   53.982634][    C1] CPU: 1 PID: 19 Comm: ksoftirqd/1 Not tainted 5.16.0-rc5+ #25
      [   53.982634][    C1] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
      [   53.982634][    C1] RIP: 0010:skb_shift+0x13ef/0x23b0
      [   53.982634][    C1] Code: ea 03 0f b6 04 02 48 89 fa 83 e2 07 38 d0
      7f 08 84 c0 0f 85 41 0c 00 00 41 80 7f 02 00 4d 8d b5 d0 00 00 00 0f
      85 74 f5 ff ff <0f> 0b 4d 8d 77 20 be 04 00 00 00 4c 89 44 24 78 4c 89
      f7 4c 89 8c
      [   53.982634][    C1] RSP: 0018:ffff8881008f7008 EFLAGS: 00010246
      [   53.982634][    C1] RAX: 0000000000000000 RBX: ffff8881180b4c80 RCX: 0000000000000000
      [   53.982634][    C1] RDX: 0000000000000002 RSI: ffff8881180b4d3c RDI: ffff88810bc9cac2
      [   53.982634][    C1] RBP: ffff8881008f70b8 R08: ffff8881180b4cf4 R09: ffff8881180b4cf0
      [   53.982634][    C1] R10: ffffed1022999e5c R11: 0000000000000002 R12: 0000000000000590
      [   53.982634][    C1] R13: ffff88810f940c80 R14: ffff88810f940d50 R15: ffff88810bc9cac0
      [   53.982634][    C1] FS:  0000000000000000(0000) GS:ffff888235880000(0000) knlGS:0000000000000000
      [   53.982634][    C1] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   53.982634][    C1] CR2: 00007ff5f9b86680 CR3: 0000000108ce8004 CR4: 0000000000170ee0
      [   53.982634][    C1] Call Trace:
      [   53.982634][    C1]  <TASK>
      [   53.982634][    C1]  tcp_sacktag_walk+0xaba/0x18e0
      [   53.982634][    C1]  tcp_sacktag_write_queue+0xe7b/0x3460
      [   53.982634][    C1]  tcp_ack+0x2666/0x54b0
      [   53.982634][    C1]  tcp_rcv_established+0x4d9/0x20f0
      [   53.982634][    C1]  tcp_v4_do_rcv+0x551/0x810
      [   53.982634][    C1]  tcp_v4_rcv+0x22ed/0x2ed0
      [   53.982634][    C1]  ip_protocol_deliver_rcu+0x96/0xaf0
      [   53.982634][    C1]  ip_local_deliver_finish+0x1e0/0x2f0
      [   53.982634][    C1]  ip_sublist_rcv_finish+0x211/0x440
      [   53.982634][    C1]  ip_list_rcv_finish.constprop.0+0x424/0x660
      [   53.982634][    C1]  ip_list_rcv+0x2c8/0x410
      [   53.982634][    C1]  __netif_receive_skb_list_core+0x65c/0x910
      [   53.982634][    C1]  netif_receive_skb_list_internal+0x5f9/0xcb0
      [   53.982634][    C1]  napi_complete_done+0x188/0x6e0
      [   53.982634][    C1]  gro_cell_poll+0x10c/0x1d0
      [   53.982634][    C1]  __napi_poll+0xa1/0x530
      [   53.982634][    C1]  net_rx_action+0x567/0x1270
      [   53.982634][    C1]  __do_softirq+0x28a/0x9ba
      [   53.982634][    C1]  run_ksoftirqd+0x32/0x60
      [   53.982634][    C1]  smpboot_thread_fn+0x559/0x8c0
      [   53.982634][    C1]  kthread+0x3b9/0x490
      [   53.982634][    C1]  ret_from_fork+0x22/0x30
      [   53.982634][    C1]  </TASK>
      
      Address the issue by skipping the GRO stage for shared or cloned skbs.
      To reduce the chance of OoO, try to unclone the skbs before giving up.
      
      v1 -> v2:
       - use avoid skb_copy and fallback to netif_receive_skb  - Eric
      Reported-by: NIgnat Korchagin <ignat@cloudflare.com>
      Fixes: d3256efd ("veth: allow enabling NAPI even without XDP")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Tested-by: NIgnat Korchagin <ignat@cloudflare.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/b5f61c5602aab01bac8d711d8d1bfab0a4817db7.1640197544.git.pabeni@redhat.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      9695b7de
  9. 14 12月, 2021 1 次提交
  10. 26 11月, 2021 1 次提交
  11. 22 11月, 2021 2 次提交
  12. 29 7月, 2021 1 次提交
  13. 20 7月, 2021 4 次提交
  14. 17 4月, 2021 1 次提交
  15. 12 4月, 2021 3 次提交
    • P
      veth: refine napi usage · 47e550e0
      Paolo Abeni 提交于
      After the previous patch, when enabling GRO, locally generated
      TCP traffic experiences some measurable overhead, as it traverses
      the GRO engine without any chance of aggregation.
      
      This change refine the NAPI receive path admission test, to avoid
      unnecessary GRO overhead in most scenarios, when GRO is enabled
      on a veth peer.
      
      Only skbs that are eligible for aggregation enter the GRO layer,
      the others will go through the traditional receive path.
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      47e550e0
    • P
      veth: allow enabling NAPI even without XDP · d3256efd
      Paolo Abeni 提交于
      Currently the veth device has the GRO feature bit set, even if
      no GRO aggregation is possible with the default configuration,
      as the veth device does not hook into the GRO engine.
      
      Flipping the GRO feature bit from user-space is a no-op, unless
      XDP is enabled. In such scenario GRO could actually take place, but
      TSO is forced to off on the peer device.
      
      This change allow user-space to really control the GRO feature, with
      no need for an XDP program.
      
      The GRO feature bit is now cleared by default - so that there are no
      user-visible behavior changes with the default configuration.
      
      When the GRO bit is set, the per-queue NAPI instances are initialized
      and registered. On xmit, when napi instances are available, we try
      to use them.
      
      Some additional checks are in place to ensure we initialize/delete NAPIs
      only when needed in case of overlapping XDP and GRO configuration
      changes.
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d3256efd
    • P
      veth: use skb_orphan_partial instead of skb_orphan · c75fb320
      Paolo Abeni 提交于
      As described by commit 9c4c3252 ("skbuff: preserve sock
      reference when scrubbing the skb."), orphaning a skb
      in the TX path will cause OoO.
      
      Let's use skb_orphan_partial() instead of skb_orphan(), so
      that we keep the sk around for queue's selection sake and we
      still avoid the problem fixed with commit 4bf9ffa0 ("veth:
      Orphan skb before GRO")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c75fb320
  16. 31 3月, 2021 1 次提交
  17. 18 3月, 2021 1 次提交
  18. 06 3月, 2021 1 次提交
  19. 04 2月, 2021 1 次提交
  20. 21 1月, 2021 1 次提交
  21. 09 1月, 2021 2 次提交
  22. 01 12月, 2020 1 次提交
  23. 17 11月, 2020 1 次提交
  24. 12 10月, 2020 1 次提交
    • D
      bpf: Add redirect_peer helper · 9aa1206e
      Daniel Borkmann 提交于
      Add an efficient ingress to ingress netns switch that can be used out of tc BPF
      programs in order to redirect traffic from host ns ingress into a container
      veth device ingress without having to go via CPU backlog queue [0]. For local
      containers this can also be utilized and path via CPU backlog queue only needs
      to be taken once, not twice. On a high level this borrows from ipvlan which does
      similar switch in __netif_receive_skb_core() and then iterates via another_round.
      This helps to reduce latency for mentioned use cases.
      
      Pod to remote pod with redirect(), TCP_RR [1]:
      
        # percpu_netperf 10.217.1.33
                RT_LATENCY:         122.450         (per CPU:         122.666         122.401         122.333         122.401 )
              MEAN_LATENCY:         121.210         (per CPU:         121.100         121.260         121.320         121.160 )
            STDDEV_LATENCY:         120.040         (per CPU:         119.420         119.910         125.460         115.370 )
               MIN_LATENCY:          46.500         (per CPU:          47.000          47.000          47.000          45.000 )
               P50_LATENCY:         118.500         (per CPU:         118.000         119.000         118.000         119.000 )
               P90_LATENCY:         127.500         (per CPU:         127.000         128.000         127.000         128.000 )
               P99_LATENCY:         130.750         (per CPU:         131.000         131.000         129.000         132.000 )
      
          TRANSACTION_RATE:       32666.400         (per CPU:        8152.200        8169.842        8174.439        8169.897 )
      
      Pod to remote pod with redirect_peer(), TCP_RR:
      
        # percpu_netperf 10.217.1.33
                RT_LATENCY:          44.449         (per CPU:          43.767          43.127          45.279          45.622 )
              MEAN_LATENCY:          45.065         (per CPU:          44.030          45.530          45.190          45.510 )
            STDDEV_LATENCY:          84.823         (per CPU:          66.770          97.290          84.380          90.850 )
               MIN_LATENCY:          33.500         (per CPU:          33.000          33.000          34.000          34.000 )
               P50_LATENCY:          43.250         (per CPU:          43.000          43.000          43.000          44.000 )
               P90_LATENCY:          46.750         (per CPU:          46.000          47.000          47.000          47.000 )
               P99_LATENCY:          52.750         (per CPU:          51.000          54.000          53.000          53.000 )
      
          TRANSACTION_RATE:       90039.500         (per CPU:       22848.186       23187.089       22085.077       21919.130 )
      
        [0] https://linuxplumbersconf.org/event/7/contributions/674/attachments/568/1002/plumbers_2020_cilium_load_balancer.pdf
        [1] https://github.com/borkmann/netperf_scripts/blob/master/percpu_netperfSigned-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20201010234006.7075-3-daniel@iogearbox.net
      9aa1206e
  25. 11 9月, 2020 1 次提交
    • J
      net: remove napi_hash_del() from driver-facing API · 5198d545
      Jakub Kicinski 提交于
      We allow drivers to call napi_hash_del() before calling
      netif_napi_del() to batch RCU grace periods. This makes
      the API asymmetric and leaks internal implementation details.
      Soon we will want the grace period to protect more than just
      the NAPI hash table.
      
      Restructure the API and have drivers call a new function -
      __netif_napi_del() if they want to take care of RCU waits.
      
      Note that only core was checking the return status from
      napi_hash_del() so the new helper does not report if the
      NAPI was actually deleted.
      
      Some notes on driver oddness:
       - veth observed the grace period before calling netif_napi_del()
         but that should not matter
       - myri10ge observed normal RCU flavor
       - bnx2x and enic did not actually observe the grace period
         (unless they did so implicitly)
       - virtio_net and enic only unhashed Rx NAPIs
      
      The last two points seem to indicate that the calls to
      napi_hash_del() were a left over rather than an optimization.
      Regardless, it's easy enough to correct them.
      
      This patch may introduce extra synchronize_net() calls for
      interfaces which set NAPI_STATE_NO_BUSY_POLL and depend on
      free_netdev() to call netif_napi_del(). This seems inevitable
      since we want to use RCU for netpoll dev->napi_list traversal,
      and almost no drivers set IFF_DISABLE_NETPOLL.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5198d545
  26. 24 8月, 2020 1 次提交
  27. 20 8月, 2020 1 次提交
  28. 26 7月, 2020 1 次提交
  29. 02 6月, 2020 2 次提交
  30. 15 5月, 2020 1 次提交