1. 21 7月, 2020 1 次提交
  2. 17 7月, 2020 1 次提交
  3. 14 7月, 2020 2 次提交
  4. 30 6月, 2020 1 次提交
    • P
      net: sched: Pass root lock to Qdisc_ops.enqueue · aebe4426
      Petr Machata 提交于
      A following patch introduces qevents, points in qdisc algorithm where
      packet can be processed by user-defined filters. Should this processing
      lead to a situation where a new packet is to be enqueued on the same port,
      holding the root lock would lead to deadlocks. To solve the issue, qevent
      handler needs to unlock and relock the root lock when necessary.
      
      To that end, add the root lock argument to the qdisc op enqueue, and
      propagate throughout.
      Signed-off-by: NPetr Machata <petrm@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aebe4426
  5. 23 6月, 2020 1 次提交
  6. 19 6月, 2020 3 次提交
    • E
      net: increment xmit_recursion level in dev_direct_xmit() · 0ad6f6e7
      Eric Dumazet 提交于
      Back in commit f60e5990 ("ipv6: protect skb->sk accesses
      from recursive dereference inside the stack") Hannes added code
      so that IPv6 stack would not trust skb->sk for typical cases
      where packet goes through 'standard' xmit path (__dev_queue_xmit())
      
      Alas af_packet had a dev_direct_xmit() path that was not
      dealing yet with xmit_recursion level.
      
      Also change sk_mc_loop() to dump a stack once only.
      
      Without this patch, syzbot was able to trigger :
      
      [1]
      [  153.567378] WARNING: CPU: 7 PID: 11273 at net/core/sock.c:721 sk_mc_loop+0x51/0x70
      [  153.567378] Modules linked in: nfnetlink ip6table_raw ip6table_filter iptable_raw iptable_nat nf_nat nf_conntrack nf_defrag_ipv4 nf_defrag_ipv6 iptable_filter macsec macvtap tap macvlan 8021q hsr wireguard libblake2s blake2s_x86_64 libblake2s_generic udp_tunnel ip6_udp_tunnel libchacha20poly1305 poly1305_x86_64 chacha_x86_64 libchacha curve25519_x86_64 libcurve25519_generic netdevsim batman_adv dummy team bridge stp llc w1_therm wire i2c_mux_pca954x i2c_mux cdc_acm ehci_pci ehci_hcd mlx4_en mlx4_ib ib_uverbs ib_core mlx4_core
      [  153.567386] CPU: 7 PID: 11273 Comm: b159172088 Not tainted 5.8.0-smp-DEV #273
      [  153.567387] RIP: 0010:sk_mc_loop+0x51/0x70
      [  153.567388] Code: 66 83 f8 0a 75 24 0f b6 4f 12 b8 01 00 00 00 31 d2 d3 e0 a9 bf ef ff ff 74 07 48 8b 97 f0 02 00 00 0f b6 42 3a 83 e0 01 5d c3 <0f> 0b b8 01 00 00 00 5d c3 0f b6 87 18 03 00 00 5d c0 e8 04 83 e0
      [  153.567388] RSP: 0018:ffff95c69bb93990 EFLAGS: 00010212
      [  153.567388] RAX: 0000000000000011 RBX: ffff95c6e0ee3e00 RCX: 0000000000000007
      [  153.567389] RDX: ffff95c69ae50000 RSI: ffff95c6c30c3000 RDI: ffff95c6c30c3000
      [  153.567389] RBP: ffff95c69bb93990 R08: ffff95c69a77f000 R09: 0000000000000008
      [  153.567389] R10: 0000000000000040 R11: 00003e0e00026128 R12: ffff95c6c30c3000
      [  153.567390] R13: ffff95c6cc4fd500 R14: ffff95c6f84500c0 R15: ffff95c69aa13c00
      [  153.567390] FS:  00007fdc3a283700(0000) GS:ffff95c6ff9c0000(0000) knlGS:0000000000000000
      [  153.567390] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  153.567391] CR2: 00007ffee758e890 CR3: 0000001f9ba20003 CR4: 00000000001606e0
      [  153.567391] Call Trace:
      [  153.567391]  ip6_finish_output2+0x34e/0x550
      [  153.567391]  __ip6_finish_output+0xe7/0x110
      [  153.567391]  ip6_finish_output+0x2d/0xb0
      [  153.567392]  ip6_output+0x77/0x120
      [  153.567392]  ? __ip6_finish_output+0x110/0x110
      [  153.567392]  ip6_local_out+0x3d/0x50
      [  153.567392]  ipvlan_queue_xmit+0x56c/0x5e0
      [  153.567393]  ? ksize+0x19/0x30
      [  153.567393]  ipvlan_start_xmit+0x18/0x50
      [  153.567393]  dev_direct_xmit+0xf3/0x1c0
      [  153.567393]  packet_direct_xmit+0x69/0xa0
      [  153.567394]  packet_sendmsg+0xbf0/0x19b0
      [  153.567394]  ? plist_del+0x62/0xb0
      [  153.567394]  sock_sendmsg+0x65/0x70
      [  153.567394]  sock_write_iter+0x93/0xf0
      [  153.567394]  new_sync_write+0x18e/0x1a0
      [  153.567395]  __vfs_write+0x29/0x40
      [  153.567395]  vfs_write+0xb9/0x1b0
      [  153.567395]  ksys_write+0xb1/0xe0
      [  153.567395]  __x64_sys_write+0x1a/0x20
      [  153.567395]  do_syscall_64+0x43/0x70
      [  153.567396]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [  153.567396] RIP: 0033:0x453549
      [  153.567396] Code: Bad RIP value.
      [  153.567396] RSP: 002b:00007fdc3a282cc8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      [  153.567397] RAX: ffffffffffffffda RBX: 00000000004d32d0 RCX: 0000000000453549
      [  153.567397] RDX: 0000000000000020 RSI: 0000000020000300 RDI: 0000000000000003
      [  153.567398] RBP: 00000000004d32d8 R08: 0000000000000000 R09: 0000000000000000
      [  153.567398] R10: 0000000000000000 R11: 0000000000000246 R12: 00000000004d32dc
      [  153.567398] R13: 00007ffee742260f R14: 00007fdc3a282dc0 R15: 00007fdc3a283700
      [  153.567399] ---[ end trace c1d5ae2b1059ec62 ]---
      
      f60e5990 ("ipv6: protect skb->sk accesses from recursive dereference inside the stack")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0ad6f6e7
    • E
      net: napi: remove useless stack trace · 427d5838
      Eric Dumazet 提交于
      Whenever a buggy NAPI driver returns more than its budget,
      we emit a stack trace that is of no use, since it does not
      tell which driver is buggy.
      
      Instead, emit a message giving the function name, and a
      descriptive message.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      427d5838
    • Y
      net: fix memleak in register_netdevice() · 814152a8
      Yang Yingliang 提交于
      I got a memleak report when doing some fuzz test:
      
      unreferenced object 0xffff888112584000 (size 13599):
        comm "ip", pid 3048, jiffies 4294911734 (age 343.491s)
        hex dump (first 32 bytes):
          74 61 70 30 00 00 00 00 00 00 00 00 00 00 00 00  tap0............
          00 ee d9 19 81 88 ff ff 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<000000002f60ba65>] __kmalloc_node+0x309/0x3a0
          [<0000000075b211ec>] kvmalloc_node+0x7f/0xc0
          [<00000000d3a97396>] alloc_netdev_mqs+0x76/0xfc0
          [<00000000609c3655>] __tun_chr_ioctl+0x1456/0x3d70
          [<000000001127ca24>] ksys_ioctl+0xe5/0x130
          [<00000000b7d5e66a>] __x64_sys_ioctl+0x6f/0xb0
          [<00000000e1023498>] do_syscall_64+0x56/0xa0
          [<000000009ec0eb12>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      unreferenced object 0xffff888111845cc0 (size 8):
        comm "ip", pid 3048, jiffies 4294911734 (age 343.491s)
        hex dump (first 8 bytes):
          74 61 70 30 00 88 ff ff                          tap0....
        backtrace:
          [<000000004c159777>] kstrdup+0x35/0x70
          [<00000000d8b496ad>] kstrdup_const+0x3d/0x50
          [<00000000494e884a>] kvasprintf_const+0xf1/0x180
          [<0000000097880a2b>] kobject_set_name_vargs+0x56/0x140
          [<000000008fbdfc7b>] dev_set_name+0xab/0xe0
          [<000000005b99e3b4>] netdev_register_kobject+0xc0/0x390
          [<00000000602704fe>] register_netdevice+0xb61/0x1250
          [<000000002b7ca244>] __tun_chr_ioctl+0x1cd1/0x3d70
          [<000000001127ca24>] ksys_ioctl+0xe5/0x130
          [<00000000b7d5e66a>] __x64_sys_ioctl+0x6f/0xb0
          [<00000000e1023498>] do_syscall_64+0x56/0xa0
          [<000000009ec0eb12>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      unreferenced object 0xffff88811886d800 (size 512):
        comm "ip", pid 3048, jiffies 4294911734 (age 343.491s)
        hex dump (first 32 bytes):
          00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00  .....N..........
          ff ff ff ff ff ff ff ff c0 66 3d a3 ff ff ff ff  .........f=.....
        backtrace:
          [<0000000050315800>] device_add+0x61e/0x1950
          [<0000000021008dfb>] netdev_register_kobject+0x17e/0x390
          [<00000000602704fe>] register_netdevice+0xb61/0x1250
          [<000000002b7ca244>] __tun_chr_ioctl+0x1cd1/0x3d70
          [<000000001127ca24>] ksys_ioctl+0xe5/0x130
          [<00000000b7d5e66a>] __x64_sys_ioctl+0x6f/0xb0
          [<00000000e1023498>] do_syscall_64+0x56/0xa0
          [<000000009ec0eb12>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      If call_netdevice_notifiers() failed, then rollback_registered()
      calls netdev_unregister_kobject() which holds the kobject. The
      reference cannot be put because the netdev won't be add to todo
      list, so it will leads a memleak, we need put the reference to
      avoid memleak.
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      814152a8
  7. 10 6月, 2020 1 次提交
    • C
      net: change addr_list_lock back to static key · 845e0ebb
      Cong Wang 提交于
      The dynamic key update for addr_list_lock still causes troubles,
      for example the following race condition still exists:
      
      CPU 0:				CPU 1:
      (RCU read lock)			(RTNL lock)
      dev_mc_seq_show()		netdev_update_lockdep_key()
      				  -> lockdep_unregister_key()
       -> netif_addr_lock_bh()
      
      because lockdep doesn't provide an API to update it atomically.
      Therefore, we have to move it back to static keys and use subclass
      for nest locking like before.
      
      In commit 1a33e10e ("net: partially revert dynamic lockdep key
      changes"), I already reverted most parts of commit ab92d68f
      ("net: core: add generic lockdep keys").
      
      This patch reverts the rest and also part of commit f3b0a18b
      ("net: remove unnecessary variables and callback"). After this
      patch, addr_list_lock changes back to using static keys and
      subclasses to satisfy lockdep. Thanks to dev->lower_level, we do
      not have to change back to ->ndo_get_lock_subclass().
      
      And hopefully this reduces some syzbot lockdep noises too.
      
      Reported-by: syzbot+f3a0e80c34b3fc28ac5e@syzkaller.appspotmail.com
      Cc: Taehee Yoo <ap420073@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      845e0ebb
  8. 05 6月, 2020 1 次提交
    • A
      net: core: device_rename: Use rwsem instead of a seqcount · 11d6011c
      Ahmed S. Darwish 提交于
      Sequence counters write paths are critical sections that must never be
      preempted, and blocking, even for CONFIG_PREEMPTION=n, is not allowed.
      
      Commit 5dbe7c17 ("net: fix kernel deadlock with interface rename and
      netdev name retrieval.") handled a deadlock, observed with
      CONFIG_PREEMPTION=n, where the devnet_rename seqcount read side was
      infinitely spinning: it got scheduled after the seqcount write side
      blocked inside its own critical section.
      
      To fix that deadlock, among other issues, the commit added a
      cond_resched() inside the read side section. While this will get the
      non-preemptible kernel eventually unstuck, the seqcount reader is fully
      exhausting its slice just spinning -- until TIF_NEED_RESCHED is set.
      
      The fix is also still broken: if the seqcount reader belongs to a
      real-time scheduling policy, it can spin forever and the kernel will
      livelock.
      
      Disabling preemption over the seqcount write side critical section will
      not work: inside it are a number of GFP_KERNEL allocations and mutex
      locking through the drivers/base/ :: device_rename() call chain.
      
      >From all the above, replace the seqcount with a rwsem.
      
      Fixes: 5dbe7c17 (net: fix kernel deadlock with interface rename and netdev name retrieval.)
      Fixes: 30e6c9fa (net: devnet_rename_seq should be a seqcount)
      Fixes: c91f6df2 (sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name)
      Cc: <stable@vger.kernel.org>
      Reported-by: kbuild test robot <lkp@intel.com> [ v1 missing up_read() on error exit ]
      Reported-by: Dan Carpenter <dan.carpenter@oracle.com> [ v1 missing up_read() on error exit ]
      Signed-off-by: NAhmed S. Darwish <a.darwish@linutronix.de>
      Reviewed-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      11d6011c
  9. 02 6月, 2020 1 次提交
    • D
      bpf: Add support to attach bpf program to a devmap entry · fbee97fe
      David Ahern 提交于
      Add BPF_XDP_DEVMAP attach type for use with programs associated with a
      DEVMAP entry.
      
      Allow DEVMAPs to associate a program with a device entry by adding
      a bpf_prog.fd to 'struct bpf_devmap_val'. Values read show the program
      id, so the fd and id are a union. bpf programs can get access to the
      struct via vmlinux.h.
      
      The program associated with the fd must have type XDP with expected
      attach type BPF_XDP_DEVMAP. When a program is associated with a device
      index, the program is run on an XDP_REDIRECT and before the buffer is
      added to the per-cpu queue. At this point rxq data is still valid; the
      next patch adds tx device information allowing the prorgam to see both
      ingress and egress device indices.
      
      XDP generic is skb based and XDP programs do not work with skb's. Block
      the use case by walking maps used by a program that is to be attached
      via xdpgeneric and fail if any of them are DEVMAP / DEVMAP_HASH with
      
      Block attach of BPF_XDP_DEVMAP programs to devices.
      Signed-off-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20200529220716.75383-3-dsahern@kernel.orgSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
      fbee97fe
  10. 20 5月, 2020 1 次提交
  11. 15 5月, 2020 1 次提交
  12. 08 5月, 2020 1 次提交
    • C
      net: fix a potential recursive NETDEV_FEAT_CHANGE · dd912306
      Cong Wang 提交于
      syzbot managed to trigger a recursive NETDEV_FEAT_CHANGE event
      between bonding master and slave. I managed to find a reproducer
      for this:
      
        ip li set bond0 up
        ifenslave bond0 eth0
        brctl addbr br0
        ethtool -K eth0 lro off
        brctl addif br0 bond0
        ip li set br0 up
      
      When a NETDEV_FEAT_CHANGE event is triggered on a bonding slave,
      it captures this and calls bond_compute_features() to fixup its
      master's and other slaves' features. However, when syncing with
      its lower devices by netdev_sync_lower_features() this event is
      triggered again on slaves when the LRO feature fails to change,
      so it goes back and forth recursively until the kernel stack is
      exhausted.
      
      Commit 17b85d29 intentionally lets __netdev_update_features()
      return -1 for such a failure case, so we have to just rely on
      the existing check inside netdev_sync_lower_features() and skip
      NETDEV_FEAT_CHANGE event only for this specific failure case.
      
      Fixes: fd867d51 ("net/core: generic support for disabling netdev features down stack")
      Reported-by: syzbot+e73ceacfd8560cc8a3ca@syzkaller.appspotmail.com
      Reported-by: syzbot+c2fb6f9ddcea95ba49b5@syzkaller.appspotmail.com
      Cc: Jarod Wilson <jarod@redhat.com>
      Cc: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Reviewed-by: NJay Vosburgh <jay.vosburgh@canonical.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dd912306
  13. 05 5月, 2020 1 次提交
  14. 02 5月, 2020 1 次提交
  15. 24 4月, 2020 2 次提交
    • E
      net: napi: use READ_ONCE()/WRITE_ONCE() · 7e417a66
      Eric Dumazet 提交于
      gro_flush_timeout and napi_defer_hard_irqs can be read
      from napi_complete_done() while other cpus write the value,
      whithout explicit synchronization.
      
      Use READ_ONCE()/WRITE_ONCE() to annotate the races.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7e417a66
    • E
      net: napi: add hard irqs deferral feature · 6f8b12d6
      Eric Dumazet 提交于
      Back in commit 3b47d303 ("net: gro: add a per device gro flush timer")
      we added the ability to arm one high resolution timer, that we used
      to keep not-complete packets in GRO engine a bit longer, hoping that further
      frames might be added to them.
      
      Since then, we added the napi_complete_done() interface, and commit
      364b6055 ("net: busy-poll: return busypolling status to drivers")
      allowed drivers to avoid re-arming NIC interrupts if we made a promise
      that their NAPI poll() handler would be called in the near future.
      
      This infrastructure can be leveraged, thanks to a new device parameter,
      which allows to arm the napi hrtimer, instead of re-arming the device
      hard IRQ.
      
      We have noticed that on some servers with 32 RX queues or more, the chit-chat
      between the NIC and the host caused by IRQ delivery and re-arming could hurt
      throughput by ~20% on 100Gbit NIC.
      
      In contrast, hrtimers are using local (percpu) resources and might have lower
      cost.
      
      The new tunable, named napi_defer_hard_irqs, is placed in the same hierarchy
      than gro_flush_timeout (/sys/class/net/ethX/)
      
      By default, both gro_flush_timeout and napi_defer_hard_irqs are zero.
      
      This patch does not change the prior behavior of gro_flush_timeout
      if used alone : NIC hard irqs should be rearmed as before.
      
      One concrete usage can be :
      
      echo 20000 >/sys/class/net/eth1/gro_flush_timeout
      echo 10 >/sys/class/net/eth1/napi_defer_hard_irqs
      
      If at least one packet is retired, then we will reset napi counter
      to 10 (napi_defer_hard_irqs), ensuring at least 10 periodic scans
      of the queue.
      
      On busy queues, this should avoid NIC hard IRQ, while before this patch IRQ
      avoidance was only possible if napi->poll() was exhausting its budget
      and not call napi_complete_done().
      
      This feature also can be used to work around some non-optimal NIC irq
      coalescing strategies.
      
      Having the ability to insert XX usec delays between each napi->poll()
      can increase cache efficiency, since we increase batch sizes.
      
      It also keeps serving cpus not idle too long, reducing tail latencies.
      Co-developed-by: NLuigi Rizzo <lrizzo@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6f8b12d6
  16. 21 4月, 2020 1 次提交
  17. 15 4月, 2020 1 次提交
  18. 08 4月, 2020 1 次提交
  19. 30 3月, 2020 1 次提交
  20. 29 3月, 2020 1 次提交
  21. 26 3月, 2020 1 次提交
    • P
      net: Fix CONFIG_NET_CLS_ACT=n and CONFIG_NFT_FWD_NETDEV={y, m} build · 2c64605b
      Pablo Neira Ayuso 提交于
      net/netfilter/nft_fwd_netdev.c: In function ‘nft_fwd_netdev_eval’:
          net/netfilter/nft_fwd_netdev.c:32:10: error: ‘struct sk_buff’ has no member named ‘tc_redirected’
            pkt->skb->tc_redirected = 1;
                    ^~
          net/netfilter/nft_fwd_netdev.c:33:10: error: ‘struct sk_buff’ has no member named ‘tc_from_ingress’
            pkt->skb->tc_from_ingress = 1;
                    ^~
      
      To avoid a direct dependency with tc actions from netfilter, wrap the
      redirect bits around CONFIG_NET_REDIRECT and move helpers to
      include/linux/skbuff.h. Turn on this toggle from the ifb driver, the
      only existing client of these bits in the tree.
      
      This patch adds skb_set_redirected() that sets on the redirected bit
      on the skbuff, it specifies if the packet was redirect from ingress
      and resets the timestamp (timestamp reset was originally missing in the
      netfilter bugfix).
      
      Fixes: bcfabee1 ("netfilter: nft_fwd_netdev: allow to redirect to ifb via ingress")
      Reported-by: noreply@ellerman.id.au
      Reported-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2c64605b
  22. 19 3月, 2020 1 次提交
  23. 18 3月, 2020 5 次提交
    • M
      net: core: dev.c: fix a documentation warning · 2de9780f
      Mauro Carvalho Chehab 提交于
      There's a markup for link with is "foo_". On this kernel-doc
      comment, we don't want this, but instead, place a literal
      reference. So, escape the literal with ``foo``, in order to
      avoid this warning:
      
      	./net/core/dev.c:5195: WARNING: Unknown target name: "page_is".
      Signed-off-by: NMauro Carvalho Chehab <mchehab+huawei@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2de9780f
    • J
      net: ethtool: require drivers to set supported_coalesce_params · 9000edb7
      Jakub Kicinski 提交于
      Now that all in-tree drivers have been updated we can
      make the supported_coalesce_params mandatory.
      
      To save debugging time in case some driver was missed
      (or is out of tree) add a warning when netdev is registered
      with set_coalesce but without supported_coalesce_params.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Reviewed-by: NMichal Kubecek <mkubecek@suse.cz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9000edb7
    • L
      netfilter: Introduce egress hook · 8537f786
      Lukas Wunner 提交于
      Commit e687ad60 ("netfilter: add netfilter ingress hook after
      handle_ing() under unique static key") introduced the ability to
      classify packets on ingress.
      
      Allow the same on egress.  Position the hook immediately before a packet
      is handed to tc and then sent out on an interface, thereby mirroring the
      ingress order.  This order allows marking packets in the netfilter
      egress hook and subsequently using the mark in tc.  Another benefit of
      this order is consistency with a lot of existing documentation which
      says that egress tc is performed after netfilter hooks.
      
      Egress hooks already exist for the most common protocols, such as
      NF_INET_LOCAL_OUT or NF_ARP_OUT, and those are to be preferred because
      they are executed earlier during packet processing.  However for more
      exotic protocols, there is currently no provision to apply netfilter on
      egress.  A common workaround is to enslave the interface to a bridge and
      use ebtables, or to resort to tc.  But when the ingress hook was
      introduced, consensus was that users should be given the choice to use
      netfilter or tc, whichever tool suits their needs best:
      https://lore.kernel.org/netdev/20150430153317.GA3230@salvia/
      This hook is also useful for NAT46/NAT64, tunneling and filtering of
      locally generated af_packet traffic such as dhclient.
      
      There have also been occasional user requests for a netfilter egress
      hook in the past, e.g.:
      https://www.spinics.net/lists/netfilter/msg50038.html
      
      Performance measurements with pktgen surprisingly show a speedup rather
      than a slowdown with this commit:
      
      * Without this commit:
        Result: OK: 34240933(c34238375+d2558) usec, 100000000 (60byte,0frags)
        2920481pps 1401Mb/sec (1401830880bps) errors: 0
      
      * With this commit:
        Result: OK: 33997299(c33994193+d3106) usec, 100000000 (60byte,0frags)
        2941410pps 1411Mb/sec (1411876800bps) errors: 0
      
      * Without this commit + tc egress:
        Result: OK: 39022386(c39019547+d2839) usec, 100000000 (60byte,0frags)
        2562631pps 1230Mb/sec (1230062880bps) errors: 0
      
      * With this commit + tc egress:
        Result: OK: 37604447(c37601877+d2570) usec, 100000000 (60byte,0frags)
        2659259pps 1276Mb/sec (1276444320bps) errors: 0
      
      * With this commit + nft egress:
        Result: OK: 41436689(c41434088+d2600) usec, 100000000 (60byte,0frags)
        2413320pps 1158Mb/sec (1158393600bps) errors: 0
      
      Tested on a bare-metal Core i7-3615QM, each measurement was performed
      three times to verify that the numbers are stable.
      
      Commands to perform a measurement:
      modprobe pktgen
      echo "add_device lo@3" > /proc/net/pktgen/kpktgend_3
      samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh -i 'lo@3' -n 100000000
      
      Commands for testing tc egress:
      tc qdisc add dev lo clsact
      tc filter add dev lo egress protocol ip prio 1 u32 match ip dst 4.3.2.1/32
      
      Commands for testing nft egress:
      nft add table netdev t
      nft add chain netdev t co \{ type filter hook egress device lo priority 0 \; \}
      nft add rule netdev t co ip daddr 4.3.2.1/32 drop
      
      All testing was performed on the loopback interface to avoid distorting
      measurements by the packet handling in the low-level Ethernet driver.
      Signed-off-by: NLukas Wunner <lukas@wunner.de>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      8537f786
    • L
      netfilter: Generalize ingress hook · 5418d388
      Lukas Wunner 提交于
      Prepare for addition of a netfilter egress hook by generalizing the
      ingress hook introduced by commit e687ad60 ("netfilter: add
      netfilter ingress hook after handle_ing() under unique static key").
      
      In particular, rename and refactor the ingress hook's static inlines
      such that they can be reused for an egress hook.
      
      No functional change intended.
      Signed-off-by: NLukas Wunner <lukas@wunner.de>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      5418d388
    • L
      netfilter: Rename ingress hook include file · b030f194
      Lukas Wunner 提交于
      Prepare for addition of a netfilter egress hook by renaming
      <linux/netfilter_ingress.h> to <linux/netfilter_netdev.h>.
      
      The egress hook also necessitates a refactoring of the include file,
      but that is done in a separate commit to ease reviewing.
      
      No functional change intended.
      Signed-off-by: NLukas Wunner <lukas@wunner.de>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      b030f194
  24. 13 3月, 2020 1 次提交
    • J
      Revert "net: sched: make newly activated qdiscs visible" · 7c4046b1
      Julian Wiedmann 提交于
      This reverts commit 4cda7527
      from net-next.
      
      Brown bag time.
      
      Michal noticed that this change doesn't work at all when
      netif_set_real_num_tx_queues() gets called prior to an initial
      dev_activate(), as for instance igb does.
      
      Doing so dies with:
      
      [   40.579142] BUG: kernel NULL pointer dereference, address: 0000000000000400
      [   40.586922] #PF: supervisor read access in kernel mode
      [   40.592668] #PF: error_code(0x0000) - not-present page
      [   40.598405] PGD 0 P4D 0
      [   40.601234] Oops: 0000 [#1] PREEMPT SMP PTI
      [   40.605909] CPU: 18 PID: 1681 Comm: wickedd Tainted: G            E     5.6.0-rc3-ethnl.50-default #1
      [   40.616205] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.R3.27.D685.1305151734 05/15/2013
      [   40.627377] RIP: 0010:qdisc_hash_add.part.22+0x2e/0x90
      [   40.633115] Code: 00 55 53 89 f5 48 89 fb e8 2f 9b fb ff 85 c0 74 44 48 8b 43 40 48 8b 08 69 43 38 47 86 c8 61 c1 e8 1c 48 83 e8 80 48 8d 14 c1 <48> 8b 04 c1 48 8d 4b 28 48 89 53 30 48 89 43 28 48 85 c0 48 89 0a
      [   40.654080] RSP: 0018:ffffb879864934d8 EFLAGS: 00010203
      [   40.659914] RAX: 0000000000000080 RBX: ffffffffb8328d80 RCX: 0000000000000000
      [   40.667882] RDX: 0000000000000400 RSI: 0000000000000000 RDI: ffffffffb831faa0
      [   40.675849] RBP: 0000000000000000 R08: ffffa0752c8b9088 R09: ffffa0752c8b9208
      [   40.683816] R10: 0000000000000006 R11: 0000000000000000 R12: ffffa0752d734000
      [   40.691783] R13: 0000000000000008 R14: 0000000000000000 R15: ffffa07113c18000
      [   40.699750] FS:  00007f94548e5880(0000) GS:ffffa0752e980000(0000) knlGS:0000000000000000
      [   40.708782] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   40.715189] CR2: 0000000000000400 CR3: 000000082b6ae006 CR4: 00000000001606e0
      [   40.723156] Call Trace:
      [   40.725888]  dev_qdisc_set_real_num_tx_queues+0x61/0x90
      [   40.731725]  netif_set_real_num_tx_queues+0x94/0x1d0
      [   40.737286]  __igb_open+0x19a/0x5d0 [igb]
      [   40.741767]  __dev_open+0xbb/0x150
      [   40.745567]  __dev_change_flags+0x157/0x1a0
      [   40.750240]  dev_change_flags+0x23/0x60
      
      [...]
      
      Fixes: 4cda7527 ("net: sched: make newly activated qdiscs visible")
      Reported-by: NMichal Kubecek <mkubecek@suse.cz>
      CC: Michal Kubecek <mkubecek@suse.cz>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jamal Hadi Salim <jhs@mojatatu.com>
      CC: Cong Wang <xiyou.wangcong@gmail.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: NJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7c4046b1
  25. 12 3月, 2020 1 次提交
    • J
      net: sched: make newly activated qdiscs visible · 4cda7527
      Julian Wiedmann 提交于
      In their .attach callback, mq[prio] only add the qdiscs of the currently
      active TX queues to the device's qdisc hash list.
      If a user later increases the number of active TX queues, their qdiscs
      are not visible via eg. 'tc qdisc show'.
      
      Add a hook to netif_set_real_num_tx_queues() that walks all active
      TX queues and adds those which are missing to the hash list.
      
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jamal Hadi Salim <jhs@mojatatu.com>
      CC: Cong Wang <xiyou.wangcong@gmail.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: NJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4cda7527
  26. 27 2月, 2020 2 次提交
  27. 25 2月, 2020 1 次提交
  28. 20 2月, 2020 2 次提交
  29. 19 2月, 2020 1 次提交
  30. 17 2月, 2020 1 次提交
    • T
      net: export netdev_next_lower_dev_rcu() · 7151affe
      Taehee Yoo 提交于
      netdev_next_lower_dev_rcu() will be used to implement a function,
      which is to walk all lower interfaces.
      There are already functions that they walk their lower interface.
      (netdev_walk_all_lower_dev_rcu, netdev_walk_all_lower_dev()).
      But, there would be cases that couldn't be covered by given
      netdev_walk_all_lower_dev_{rcu}() function.
      So, some modules would want to implement own function,
      which is to walk all lower interfaces.
      
      In the next patch, netdev_next_lower_dev_rcu() will be used.
      In addition, this patch removes two unused prototypes in netdevice.h.
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7151affe