1. 09 2月, 2022 2 次提交
  2. 06 2月, 2022 1 次提交
    • E
      net: initialize init_net earlier · 9c1be193
      Eric Dumazet 提交于
      While testing a patch that will follow later
      ("net: add netns refcount tracker to struct nsproxy")
      I found that devtmpfs_init() was called before init_net
      was initialized.
      
      This is a bug, because devtmpfs_setup() calls
      ksys_unshare(CLONE_NEWNS);
      
      This has the effect of increasing init_net refcount,
      which will be later overwritten to 1, as part of setup_net(&init_net)
      
      We had too many prior patches [1] trying to work around the root cause.
      
      Really, make sure init_net is in BSS section, and that net_ns_init()
      is called earlier at boot time.
      
      Note that another patch ("vfs: add netns refcount tracker
      to struct fs_context") also will need net_ns_init() being called
      before vfs_caches_init()
      
      As a bonus, this patch saves around 4KB in .data section.
      
      [1]
      
      f8c46cb3 ("netns: do not call pernet ops for not yet set up init_net namespace")
      b5082df8 ("net: Initialise init_net.count to 1")
      734b6541 ("net: Statically initialize init_net.dev_base_head")
      
      v2: fixed a build error reported by kernel build bots (CONFIG_NET=n)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9c1be193
  3. 05 2月, 2022 1 次提交
    • E
      net: refine dev_put()/dev_hold() debugging · 4c6c11ea
      Eric Dumazet 提交于
      We are still chasing some syzbot reports where we think a rogue dev_put()
      is called with no corresponding prior dev_hold().
      Unfortunately it eats a reference on dev->dev_refcnt taken by innocent
      dev_hold_track(), meaning that the refcount saturation splat comes
      too late to be useful.
      
      Make sure that 'not tracked' dev_put() and dev_hold() better use
      CONFIG_NET_DEV_REFCNT_TRACKER=y debug infrastructure:
      
      Prior patch in the series allowed ref_tracker_alloc() and ref_tracker_free()
      to be called with a NULL @trackerp parameter, and to use a separate refcount
      only to detect too many put() even in the following case:
      
      dev_hold_track(dev, tracker_1, GFP_ATOMIC);
       dev_hold(dev);
       dev_put(dev);
       dev_put(dev); // Should complain loudly here.
      dev_put_track(dev, tracker_1); // instead of here
      
      Add clarification about netdev_tracker_alloc() role.
      
      v2: I replaced the dev_put() in linkwatch_do_dev()
          with __dev_put() because callers called netdev_tracker_free().
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4c6c11ea
  4. 04 2月, 2022 1 次提交
  5. 12 1月, 2022 1 次提交
  6. 10 1月, 2022 1 次提交
    • M
      net: skb: introduce kfree_skb_reason() · c504e5c2
      Menglong Dong 提交于
      Introduce the interface kfree_skb_reason(), which is able to pass
      the reason why the skb is dropped to 'kfree_skb' tracepoint.
      
      Add the 'reason' field to 'trace_kfree_skb', therefor user can get
      more detail information about abnormal skb with 'drop_monitor' or
      eBPF.
      
      All drop reasons are defined in the enum 'skb_drop_reason', and
      they will be print as string in 'kfree_skb' tracepoint in format
      of 'reason: XXX'.
      
      ( Maybe the reasons should be defined in a uapi header file, so that
      user space can use them? )
      Signed-off-by: NMenglong Dong <imagedong@tencent.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      c504e5c2
  7. 06 1月, 2022 1 次提交
  8. 18 12月, 2021 1 次提交
  9. 14 12月, 2021 2 次提交
  10. 13 12月, 2021 1 次提交
  11. 07 12月, 2021 2 次提交
    • E
      f77159a3
    • E
      net: add net device refcount tracker infrastructure · 4d92b95f
      Eric Dumazet 提交于
      net device are refcounted. Over the years we had numerous bugs
      caused by imbalanced dev_hold() and dev_put() calls.
      
      The general idea is to be able to precisely pair each decrement with
      a corresponding prior increment. Both share a cookie, basically
      a pointer to private data storing stack traces.
      
      This patch adds dev_hold_track() and dev_put_track().
      
      To use these helpers, each data structure owning a refcount
      should also use a "netdevice_tracker" to pair the hold and put.
      
      netdevice_tracker dev_tracker;
      ...
      dev_hold_track(dev, &dev_tracker, GFP_ATOMIC);
      ...
      dev_put_track(dev, &dev_tracker);
      
      Whenever a leak happens, we will get precise stack traces
      of the point dev_hold_track() happened, at device dismantle phase.
      
      We will also get a stack trace if too many dev_put_track() for the same
      netdevice_tracker are attempted.
      
      This is guarded by CONFIG_NET_DEV_REFCNT_TRACKER option.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      4d92b95f
  12. 02 12月, 2021 1 次提交
    • E
      net: annotate data-races on txq->xmit_lock_owner · 7a10d8c8
      Eric Dumazet 提交于
      syzbot found that __dev_queue_xmit() is reading txq->xmit_lock_owner
      without annotations.
      
      No serious issue there, let's document what is happening there.
      
      BUG: KCSAN: data-race in __dev_queue_xmit / __dev_queue_xmit
      
      write to 0xffff888139d09484 of 4 bytes by interrupt on cpu 0:
       __netif_tx_unlock include/linux/netdevice.h:4437 [inline]
       __dev_queue_xmit+0x948/0xf70 net/core/dev.c:4229
       dev_queue_xmit_accel+0x19/0x20 net/core/dev.c:4265
       macvlan_queue_xmit drivers/net/macvlan.c:543 [inline]
       macvlan_start_xmit+0x2b3/0x3d0 drivers/net/macvlan.c:567
       __netdev_start_xmit include/linux/netdevice.h:4987 [inline]
       netdev_start_xmit include/linux/netdevice.h:5001 [inline]
       xmit_one+0x105/0x2f0 net/core/dev.c:3590
       dev_hard_start_xmit+0x72/0x120 net/core/dev.c:3606
       sch_direct_xmit+0x1b2/0x7c0 net/sched/sch_generic.c:342
       __dev_xmit_skb+0x83d/0x1370 net/core/dev.c:3817
       __dev_queue_xmit+0x590/0xf70 net/core/dev.c:4194
       dev_queue_xmit+0x13/0x20 net/core/dev.c:4259
       neigh_hh_output include/net/neighbour.h:511 [inline]
       neigh_output include/net/neighbour.h:525 [inline]
       ip6_finish_output2+0x995/0xbb0 net/ipv6/ip6_output.c:126
       __ip6_finish_output net/ipv6/ip6_output.c:191 [inline]
       ip6_finish_output+0x444/0x4c0 net/ipv6/ip6_output.c:201
       NF_HOOK_COND include/linux/netfilter.h:296 [inline]
       ip6_output+0x10e/0x210 net/ipv6/ip6_output.c:224
       dst_output include/net/dst.h:450 [inline]
       NF_HOOK include/linux/netfilter.h:307 [inline]
       ndisc_send_skb+0x486/0x610 net/ipv6/ndisc.c:508
       ndisc_send_rs+0x3b0/0x3e0 net/ipv6/ndisc.c:702
       addrconf_rs_timer+0x370/0x540 net/ipv6/addrconf.c:3898
       call_timer_fn+0x2e/0x240 kernel/time/timer.c:1421
       expire_timers+0x116/0x240 kernel/time/timer.c:1466
       __run_timers+0x368/0x410 kernel/time/timer.c:1734
       run_timer_softirq+0x2e/0x60 kernel/time/timer.c:1747
       __do_softirq+0x158/0x2de kernel/softirq.c:558
       __irq_exit_rcu kernel/softirq.c:636 [inline]
       irq_exit_rcu+0x37/0x70 kernel/softirq.c:648
       sysvec_apic_timer_interrupt+0x3e/0xb0 arch/x86/kernel/apic/apic.c:1097
       asm_sysvec_apic_timer_interrupt+0x12/0x20
      
      read to 0xffff888139d09484 of 4 bytes by interrupt on cpu 1:
       __dev_queue_xmit+0x5e3/0xf70 net/core/dev.c:4213
       dev_queue_xmit_accel+0x19/0x20 net/core/dev.c:4265
       macvlan_queue_xmit drivers/net/macvlan.c:543 [inline]
       macvlan_start_xmit+0x2b3/0x3d0 drivers/net/macvlan.c:567
       __netdev_start_xmit include/linux/netdevice.h:4987 [inline]
       netdev_start_xmit include/linux/netdevice.h:5001 [inline]
       xmit_one+0x105/0x2f0 net/core/dev.c:3590
       dev_hard_start_xmit+0x72/0x120 net/core/dev.c:3606
       sch_direct_xmit+0x1b2/0x7c0 net/sched/sch_generic.c:342
       __dev_xmit_skb+0x83d/0x1370 net/core/dev.c:3817
       __dev_queue_xmit+0x590/0xf70 net/core/dev.c:4194
       dev_queue_xmit+0x13/0x20 net/core/dev.c:4259
       neigh_resolve_output+0x3db/0x410 net/core/neighbour.c:1523
       neigh_output include/net/neighbour.h:527 [inline]
       ip6_finish_output2+0x9be/0xbb0 net/ipv6/ip6_output.c:126
       __ip6_finish_output net/ipv6/ip6_output.c:191 [inline]
       ip6_finish_output+0x444/0x4c0 net/ipv6/ip6_output.c:201
       NF_HOOK_COND include/linux/netfilter.h:296 [inline]
       ip6_output+0x10e/0x210 net/ipv6/ip6_output.c:224
       dst_output include/net/dst.h:450 [inline]
       NF_HOOK include/linux/netfilter.h:307 [inline]
       ndisc_send_skb+0x486/0x610 net/ipv6/ndisc.c:508
       ndisc_send_rs+0x3b0/0x3e0 net/ipv6/ndisc.c:702
       addrconf_rs_timer+0x370/0x540 net/ipv6/addrconf.c:3898
       call_timer_fn+0x2e/0x240 kernel/time/timer.c:1421
       expire_timers+0x116/0x240 kernel/time/timer.c:1466
       __run_timers+0x368/0x410 kernel/time/timer.c:1734
       run_timer_softirq+0x2e/0x60 kernel/time/timer.c:1747
       __do_softirq+0x158/0x2de kernel/softirq.c:558
       __irq_exit_rcu kernel/softirq.c:636 [inline]
       irq_exit_rcu+0x37/0x70 kernel/softirq.c:648
       sysvec_apic_timer_interrupt+0x8d/0xb0 arch/x86/kernel/apic/apic.c:1097
       asm_sysvec_apic_timer_interrupt+0x12/0x20
       kcsan_setup_watchpoint+0x94/0x420 kernel/kcsan/core.c:443
       folio_test_anon include/linux/page-flags.h:581 [inline]
       PageAnon include/linux/page-flags.h:586 [inline]
       zap_pte_range+0x5ac/0x10e0 mm/memory.c:1347
       zap_pmd_range mm/memory.c:1467 [inline]
       zap_pud_range mm/memory.c:1496 [inline]
       zap_p4d_range mm/memory.c:1517 [inline]
       unmap_page_range+0x2dc/0x3d0 mm/memory.c:1538
       unmap_single_vma+0x157/0x210 mm/memory.c:1583
       unmap_vmas+0xd0/0x180 mm/memory.c:1615
       exit_mmap+0x23d/0x470 mm/mmap.c:3170
       __mmput+0x27/0x1b0 kernel/fork.c:1113
       mmput+0x3d/0x50 kernel/fork.c:1134
       exit_mm+0xdb/0x170 kernel/exit.c:507
       do_exit+0x608/0x17a0 kernel/exit.c:819
       do_group_exit+0xce/0x180 kernel/exit.c:929
       get_signal+0xfc3/0x1550 kernel/signal.c:2852
       arch_do_signal_or_restart+0x8c/0x2e0 arch/x86/kernel/signal.c:868
       handle_signal_work kernel/entry/common.c:148 [inline]
       exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
       exit_to_user_mode_prepare+0x113/0x190 kernel/entry/common.c:207
       __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
       syscall_exit_to_user_mode+0x20/0x40 kernel/entry/common.c:300
       do_syscall_64+0x50/0xd0 arch/x86/entry/common.c:86
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      value changed: 0x00000000 -> 0xffffffff
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 28712 Comm: syz-executor.0 Tainted: G        W         5.16.0-rc1-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Link: https://lore.kernel.org/r/20211130170155.2331929-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      7a10d8c8
  13. 29 11月, 2021 1 次提交
    • S
      net: Write lock dev_base_lock without disabling bottom halves. · fd888e85
      Sebastian Andrzej Siewior 提交于
      The writer acquires dev_base_lock with disabled bottom halves.
      The reader can acquire dev_base_lock without disabling bottom halves
      because there is no writer in softirq context.
      
      On PREEMPT_RT the softirqs are preemptible and local_bh_disable() acts
      as a lock to ensure that resources, that are protected by disabling
      bottom halves, remain protected.
      This leads to a circular locking dependency if the lock acquired with
      disabled bottom halves (as in write_lock_bh()) and somewhere else with
      enabled bottom halves (as by read_lock() in netstat_show()) followed by
      disabling bottom halves (cxgb_get_stats() -> t4_wr_mbox_meat_timeout()
      -> spin_lock_bh()). This is the reverse locking order.
      
      All read_lock() invocation are from sysfs callback which are not invoked
      from softirq context. Therefore there is no need to disable bottom
      halves while acquiring a write lock.
      
      Acquire the write lock of dev_base_lock without disabling bottom halves.
      Reported-by: NPei Zhang <pezhang@redhat.com>
      Reported-by: NLuis Claudio R. Goncalves <lgoncalv@redhat.com>
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fd888e85
  14. 23 11月, 2021 1 次提交
    • J
      net: remove .ndo_change_proto_down · 2106efda
      Jakub Kicinski 提交于
      .ndo_change_proto_down was added seemingly to enable out-of-tree
      implementations. Over 2.5yrs later we still have no real users
      upstream. Hardwire the generic implementation for now, we can
      revert once real users materialize. (rocker is a test vehicle,
      not a user.)
      
      We need to drop the optimization on the sysfs side, because
      unlike ndos priv_flags will be changed at runtime, so we'd
      need READ_ONCE/WRITE_ONCE everywhere..
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2106efda
  15. 22 11月, 2021 1 次提交
  16. 20 11月, 2021 1 次提交
  17. 16 11月, 2021 1 次提交
  18. 11 11月, 2021 1 次提交
    • A
      net: fix premature exit from NAPI state polling in napi_disable() · 0315a075
      Alexander Lobakin 提交于
      Commit 719c5719 ("net: make napi_disable() symmetric with
      enable") accidentally introduced a bug sometimes leading to a kernel
      BUG when bringing an iface up/down under heavy traffic load.
      
      Prior to this commit, napi_disable() was polling n->state until
      none of (NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC) is set and then
      always flip them. Now there's a possibility to get away with the
      NAPIF_STATE_SCHE unset as 'continue' drops us to the cmpxchg()
      call with an uninitialized variable, rather than straight to
      another round of the state check.
      
      Error path looks like:
      
      napi_disable():
      unsigned long val, new; /* new is uninitialized */
      
      do {
      	val = READ_ONCE(n->state); /* NAPIF_STATE_NPSVC and/or
      				      NAPIF_STATE_SCHED is set */
      	if (val & (NAPIF_STATE_SCHED | NAPIF_STATE_NPSVC)) { /* true */
      		usleep_range(20, 200);
      		continue; /* go straight to the condition check */
      	}
      	new = val | <...>
      } while (cmpxchg(&n->state, val, new) != val); /* state == val, cmpxchg()
      						  writes garbage */
      
      napi_enable():
      do {
      	val = READ_ONCE(n->state);
      	BUG_ON(!test_bit(NAPI_STATE_SCHED, &val)); /* 50/50 boom */
      <...>
      
      while the typical BUG splat is like:
      
      [  172.652461] ------------[ cut here ]------------
      [  172.652462] kernel BUG at net/core/dev.c:6937!
      [  172.656914] invalid opcode: 0000 [#1] PREEMPT SMP PTI
      [  172.661966] CPU: 36 PID: 2829 Comm: xdp_redirect_cp Tainted: G          I       5.15.0 #42
      [  172.670222] Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.02.01.0014.082620210524 08/26/2021
      [  172.680646] RIP: 0010:napi_enable+0x5a/0xd0
      [  172.684832] Code: 07 49 81 cc 00 01 00 00 4c 89 e2 48 89 d8 80 e6 fb f0 48 0f b1 55 10 48 39 c3 74 10 48 8b 5d 10 f6 c7 04 75 3d f6 c3 01 75 b4 <0f> 0b 5b 5d 41 5c c3 65 ff 05 b8 e5 61 53 48 c7 c6 c0 f3 34 ad 48
      [  172.703578] RSP: 0018:ffffa3c9497477a8 EFLAGS: 00010246
      [  172.708803] RAX: ffffa3c96615a014 RBX: 0000000000000000 RCX: ffff8a4b575301a0
      < snip >
      [  172.782403] Call Trace:
      [  172.784857]  <TASK>
      [  172.786963]  ice_up_complete+0x6f/0x210 [ice]
      [  172.791349]  ice_xdp+0x136/0x320 [ice]
      [  172.795108]  ? ice_change_mtu+0x180/0x180 [ice]
      [  172.799648]  dev_xdp_install+0x61/0xe0
      [  172.803401]  dev_xdp_attach+0x1e0/0x550
      [  172.807240]  dev_change_xdp_fd+0x1e6/0x220
      [  172.811338]  do_setlink+0xee8/0x1010
      [  172.814917]  rtnl_setlink+0xe5/0x170
      [  172.818499]  ? bpf_lsm_binder_set_context_mgr+0x10/0x10
      [  172.823732]  ? security_capable+0x36/0x50
      < snip >
      
      Fix this by replacing 'do { } while (cmpxchg())' with an "infinite"
      for-loop with an explicit break.
      
      From v1 [0]:
       - just use a for-loop to simplify both the fix and the existing
         code (Eric).
      
      [0] https://lore.kernel.org/netdev/20211110191126.1214-1-alexandr.lobakin@intel.com
      
      Fixes: 719c5719 ("net: make napi_disable() symmetric with enable")
      Suggested-by: Eric Dumazet <edumazet@google.com> # for-loop
      Signed-off-by: NAlexander Lobakin <alexandr.lobakin@intel.com>
      Reviewed-by: NJesse Brandeburg <jesse.brandeburg@intel.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20211110195605.1304-1-alexandr.lobakin@intel.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      0315a075
  19. 27 10月, 2021 1 次提交
  20. 26 10月, 2021 1 次提交
    • C
      net: multicast: calculate csum of looped-back and forwarded packets · 9122a70a
      Cyril Strejc 提交于
      During a testing of an user-space application which transmits UDP
      multicast datagrams and utilizes multicast routing to send the UDP
      datagrams out of defined network interfaces, I've found a multicast
      router does not fill-in UDP checksum into locally produced, looped-back
      and forwarded UDP datagrams, if an original output NIC the datagrams
      are sent to has UDP TX checksum offload enabled.
      
      The datagrams are sent malformed out of the NIC the datagrams have been
      forwarded to.
      
      It is because:
      
      1. If TX checksum offload is enabled on the output NIC, UDP checksum
         is not calculated by kernel and is not filled into skb data.
      
      2. dev_loopback_xmit(), which is called solely by
         ip_mc_finish_output(), sets skb->ip_summed = CHECKSUM_UNNECESSARY
         unconditionally.
      
      3. Since 35fc92a9 ("[NET]: Allow forwarding of ip_summed except
         CHECKSUM_COMPLETE"), the ip_summed value is preserved during
         forwarding.
      
      4. If ip_summed != CHECKSUM_PARTIAL, checksum is not calculated during
         a packet egress.
      
      The minimum fix in dev_loopback_xmit():
      
      1. Preserves skb->ip_summed CHECKSUM_PARTIAL. This is the
         case when the original output NIC has TX checksum offload enabled.
         The effects are:
      
           a) If the forwarding destination interface supports TX checksum
              offloading, the NIC driver is responsible to fill-in the
              checksum.
      
           b) If the forwarding destination interface does NOT support TX
              checksum offloading, checksums are filled-in by kernel before
              skb is submitted to the NIC driver.
      
           c) For local delivery, checksum validation is skipped as in the
              case of CHECKSUM_UNNECESSARY, thanks to skb_csum_unnecessary().
      
      2. Translates ip_summed CHECKSUM_NONE to CHECKSUM_UNNECESSARY. It
         means, for CHECKSUM_NONE, the behavior is unmodified and is there
         to skip a looped-back packet local delivery checksum validation.
      Signed-off-by: NCyril Strejc <cyril.strejc@skoda.cz>
      Reviewed-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9122a70a
  21. 25 10月, 2021 1 次提交
    • M
      net: Prevent infinite while loop in skb_tx_hash() · 0c57eeec
      Michael Chan 提交于
      Drivers call netdev_set_num_tc() and then netdev_set_tc_queue()
      to set the queue count and offset for each TC.  So the queue count
      and offset for the TCs may be zero for a short period after dev->num_tc
      has been set.  If a TX packet is being transmitted at this time in the
      code path netdev_pick_tx() -> skb_tx_hash(), skb_tx_hash() may see
      nonzero dev->num_tc but zero qcount for the TC.  The while loop that
      keeps looping while hash >= qcount will not end.
      
      Fix it by checking the TC's qcount to be nonzero before using it.
      
      Fixes: eadec877 ("net: Add support for subordinate traffic classes to netdev_pick_tx")
      Reviewed-by: NAndy Gospodarek <gospo@broadcom.com>
      Signed-off-by: NMichael Chan <michael.chan@broadcom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0c57eeec
  22. 20 10月, 2021 1 次提交
    • J
      net-core: use netdev_* calls for kernel messages · 5b92be64
      Jesse Brandeburg 提交于
      While loading a driver and changing the number of queues, I noticed this
      message in the kernel log:
      
      "[253489.070080] Number of in use tx queues changed invalidating tc
      mappings. Priority traffic classification disabled!"
      
      But I had no idea what interface was being talked about because this
      message used pr_warn().
      
      After investigating, it appears we can use the netdev_* helpers already
      defined to create predictably formatted messages, and that already handle
      <unknown netdev> cases, in more of the messages in dev.c.
      
      After this change, this message (and others) will look like this:
      "[  170.181093] ice 0000:3b:00.0 ens785f0: Number of in use tx queues
      changed invalidating tc mappings. Priority traffic classification
      disabled!"
      
      One goal here was not to change the message significantly from the
      original format so as to not break user's expectations, so I just
      changed messages that used pr_* and generally started with %s ==
      dev->name.
      Signed-off-by: NJesse Brandeburg <jesse.brandeburg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5b92be64
  23. 15 10月, 2021 3 次提交
    • L
      netfilter: Introduce egress hook · 42df6e1d
      Lukas Wunner 提交于
      Support classifying packets with netfilter on egress to satisfy user
      requirements such as:
      * outbound security policies for containers (Laura)
      * filtering and mangling intra-node Direct Server Return (DSR) traffic
        on a load balancer (Laura)
      * filtering locally generated traffic coming in through AF_PACKET,
        such as local ARP traffic generated for clustering purposes or DHCP
        (Laura; the AF_PACKET plumbing is contained in a follow-up commit)
      * L2 filtering from ingress and egress for AVB (Audio Video Bridging)
        and gPTP with nftables (Pablo)
      * in the future: in-kernel NAT64/NAT46 (Pablo)
      
      The egress hook introduced herein complements the ingress hook added by
      commit e687ad60 ("netfilter: add netfilter ingress hook after
      handle_ing() under unique static key").  A patch for nftables to hook up
      egress rules from user space has been submitted separately, so users may
      immediately take advantage of the feature.
      
      Alternatively or in addition to netfilter, packets can be classified
      with traffic control (tc).  On ingress, packets are classified first by
      tc, then by netfilter.  On egress, the order is reversed for symmetry.
      Conceptually, tc and netfilter can be thought of as layers, with
      netfilter layered above tc.
      
      Traffic control is capable of redirecting packets to another interface
      (man 8 tc-mirred).  E.g., an ingress packet may be redirected from the
      host namespace to a container via a veth connection:
      tc ingress (host) -> tc egress (veth host) -> tc ingress (veth container)
      
      In this case, netfilter egress classifying is not performed when leaving
      the host namespace!  That's because the packet is still on the tc layer.
      If tc redirects the packet to a physical interface in the host namespace
      such that it leaves the system, the packet is never subjected to
      netfilter egress classifying.  That is only logical since it hasn't
      passed through netfilter ingress classifying either.
      
      Packets can alternatively be redirected at the netfilter layer using
      nft fwd.  Such a packet *is* subjected to netfilter egress classifying
      since it has reached the netfilter layer.
      
      Internally, the skb->nf_skip_egress flag controls whether netfilter is
      invoked on egress by __dev_queue_xmit().  Because __dev_queue_xmit() may
      be called recursively by tunnel drivers such as vxlan, the flag is
      reverted to false after sch_handle_egress().  This ensures that
      netfilter is applied both on the overlay and underlying network.
      
      Interaction between tc and netfilter is possible by setting and querying
      skb->mark.
      
      If netfilter egress classifying is not enabled on any interface, it is
      patched out of the data path by way of a static_key and doesn't make a
      performance difference that is discernible from noise:
      
      Before:             1537 1538 1538 1537 1538 1537 Mb/sec
      After:              1536 1534 1539 1539 1539 1540 Mb/sec
      Before + tc accept: 1418 1418 1418 1419 1419 1418 Mb/sec
      After  + tc accept: 1419 1424 1418 1419 1422 1420 Mb/sec
      Before + tc drop:   1620 1619 1619 1619 1620 1620 Mb/sec
      After  + tc drop:   1616 1624 1625 1624 1622 1619 Mb/sec
      
      When netfilter egress classifying is enabled on at least one interface,
      a minimal performance penalty is incurred for every egress packet, even
      if the interface it's transmitted over doesn't have any netfilter egress
      rules configured.  That is caused by checking dev->nf_hooks_egress
      against NULL.
      
      Measurements were performed on a Core i7-3615QM.  Commands to reproduce:
      ip link add dev foo type dummy
      ip link set dev foo up
      modprobe pktgen
      echo "add_device foo" > /proc/net/pktgen/kpktgend_3
      samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh -i foo -n 400000000 -m "11:11:11:11:11:11" -d 1.1.1.1
      
      Accept all traffic with tc:
      tc qdisc add dev foo clsact
      tc filter add dev foo egress bpf da bytecode '1,6 0 0 0,'
      
      Drop all traffic with tc:
      tc qdisc add dev foo clsact
      tc filter add dev foo egress bpf da bytecode '1,6 0 0 2,'
      
      Apply this patch when measuring packet drops to avoid errors in dmesg:
      https://lore.kernel.org/netdev/a73dda33-57f4-95d8-ea51-ed483abd6a7a@iogearbox.net/Signed-off-by: NLukas Wunner <lukas@wunner.de>
      Cc: Laura García Liébana <nevola@gmail.com>
      Cc: John Fastabend <john.fastabend@gmail.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Thomas Graf <tgraf@suug.ch>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      42df6e1d
    • L
      netfilter: Generalize ingress hook include file · 17d20784
      Lukas Wunner 提交于
      Prepare for addition of a netfilter egress hook by generalizing the
      ingress hook include file.
      
      No functional change intended.
      Signed-off-by: NLukas Wunner <lukas@wunner.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      17d20784
    • L
      netfilter: Rename ingress hook include file · 7463acfb
      Lukas Wunner 提交于
      Prepare for addition of a netfilter egress hook by renaming
      <linux/netfilter_ingress.h> to <linux/netfilter_netdev.h>.
      
      The egress hook also necessitates a refactoring of the include file,
      but that is done in a separate commit to ease reviewing.
      
      No functional change intended.
      Signed-off-by: NLukas Wunner <lukas@wunner.de>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      7463acfb
  24. 10 10月, 2021 1 次提交
  25. 09 10月, 2021 1 次提交
    • A
      net: introduce a function to check if a netdev name is in use · 75ea27d0
      Antoine Tenart 提交于
      __dev_get_by_name is currently used to either retrieve a net device
      reference using its name or to check if a name is already used by a
      registered net device (per ns). In the later case there is no need to
      return a reference to a net device.
      
      Introduce a new helper, netdev_name_in_use, to check if a name is
      currently used by a registered net device without leaking a reference
      the corresponding net device. This helper uses netdev_name_node_lookup
      instead of __dev_get_by_name as we don't need the extra logic retrieving
      a reference to the corresponding net device.
      Signed-off-by: NAntoine Tenart <atenart@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      75ea27d0
  26. 02 10月, 2021 1 次提交
  27. 27 9月, 2021 1 次提交
  28. 20 9月, 2021 1 次提交
    • X
      napi: fix race inside napi_enable · 3765996e
      Xuan Zhuo 提交于
      The process will cause napi.state to contain NAPI_STATE_SCHED and
      not in the poll_list, which will cause napi_disable() to get stuck.
      
      The prefix "NAPI_STATE_" is removed in the figure below, and
      NAPI_STATE_HASHED is ignored in napi.state.
      
                            CPU0       |                   CPU1       | napi.state
      ===============================================================================
      napi_disable()                   |                              | SCHED | NPSVC
      napi_enable()                    |                              |
      {                                |                              |
          smp_mb__before_atomic();     |                              |
          clear_bit(SCHED, &n->state); |                              | NPSVC
                                       | napi_schedule_prep()         | SCHED | NPSVC
                                       | napi_poll()                  |
                                       |   napi_complete_done()       |
                                       |   {                          |
                                       |      if (n->state & (NPSVC | | (1)
                                       |               _BUSY_POLL)))  |
                                       |           return false;      |
                                       |     ................         |
                                       |   }                          | SCHED | NPSVC
                                       |                              |
          clear_bit(NPSVC, &n->state); |                              | SCHED
      }                                |                              |
                                       |                              |
      napi_schedule_prep()             |                              | SCHED | MISSED (2)
      
      (1) Here return direct. Because of NAPI_STATE_NPSVC exists.
      (2) NAPI_STATE_SCHED exists. So not add napi.poll_list to sd->poll_list
      
      Since NAPI_STATE_SCHED already exists and napi is not in the
      sd->poll_list queue, NAPI_STATE_SCHED cannot be cleared and will always
      exist.
      
      1. This will cause this queue to no longer receive packets.
      2. If you encounter napi_disable under the protection of rtnl_lock, it
         will cause the entire rtnl_lock to be locked, affecting the overall
         system.
      
      This patch uses cmpxchg to implement napi_enable(), which ensures that
      there will be no race due to the separation of clear two bits.
      
      Fixes: 2d8bff12 ("netpoll: Close race condition between poll_one_napi and napi_disable")
      Signed-off-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com>
      Reviewed-by: NDust Li <dust.li@linux.alibaba.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3765996e
  29. 15 9月, 2021 1 次提交
    • J
      net: sched: update default qdisc visibility after Tx queue cnt changes · 1e080f17
      Jakub Kicinski 提交于
      mq / mqprio make the default child qdiscs visible. They only do
      so for the qdiscs which are within real_num_tx_queues when the
      device is registered. Depending on order of calls in the driver,
      or if user space changes config via ethtool -L the number of
      qdiscs visible under tc qdisc show will differ from the number
      of queues. This is confusing to users and potentially to system
      configuration scripts which try to make sure qdiscs have the
      right parameters.
      
      Add a new Qdisc_ops callback and make relevant qdiscs TTRT.
      
      Note that this uncovers the "shortcut" created by
      commit 1f27cde3 ("net: sched: use pfifo_fast for non real queues")
      The default child qdiscs beyond initial real_num_tx are always
      pfifo_fast, no matter what the sysfs setting is. Fixing this
      gets a little tricky because we'd need to keep a reference
      on whatever the default qdisc was at the time of creation.
      In practice this is likely an non-issue the qdiscs likely have
      to be configured to non-default settings, so whatever user space
      is doing such configuration can replace the pfifos... now that
      it will see them.
      Reported-by: NMatthew Massey <matthewmassey@fb.com>
      Reviewed-by: NDave Taht <dave.taht@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1e080f17
  30. 14 8月, 2021 1 次提交
  31. 10 8月, 2021 2 次提交
  32. 05 8月, 2021 2 次提交
  33. 04 8月, 2021 1 次提交
    • J
      net: add netif_set_real_num_queues() for device reconfig · 271e5b7d
      Jakub Kicinski 提交于
      netif_set_real_num_rx_queues() and netif_set_real_num_tx_queues()
      can fail which breaks drivers trying to implement reconfiguration
      in a way that can't leave the device half-broken. In other words
      those functions are incompatible with prepare/commit approach.
      
      Luckily setting real number of queues can fail only if the number
      is increased, meaning that if we order operations correctly we
      can guarantee ending up with either new config (success), or
      the old one (on error).
      
      Provide a helper implementing such logic so that drivers don't
      have to duplicate it.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      271e5b7d