1. 18 3月, 2020 1 次提交
  2. 13 3月, 2020 1 次提交
    • J
      Revert "net: sched: make newly activated qdiscs visible" · 7c4046b1
      Julian Wiedmann 提交于
      This reverts commit 4cda7527
      from net-next.
      
      Brown bag time.
      
      Michal noticed that this change doesn't work at all when
      netif_set_real_num_tx_queues() gets called prior to an initial
      dev_activate(), as for instance igb does.
      
      Doing so dies with:
      
      [   40.579142] BUG: kernel NULL pointer dereference, address: 0000000000000400
      [   40.586922] #PF: supervisor read access in kernel mode
      [   40.592668] #PF: error_code(0x0000) - not-present page
      [   40.598405] PGD 0 P4D 0
      [   40.601234] Oops: 0000 [#1] PREEMPT SMP PTI
      [   40.605909] CPU: 18 PID: 1681 Comm: wickedd Tainted: G            E     5.6.0-rc3-ethnl.50-default #1
      [   40.616205] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.R3.27.D685.1305151734 05/15/2013
      [   40.627377] RIP: 0010:qdisc_hash_add.part.22+0x2e/0x90
      [   40.633115] Code: 00 55 53 89 f5 48 89 fb e8 2f 9b fb ff 85 c0 74 44 48 8b 43 40 48 8b 08 69 43 38 47 86 c8 61 c1 e8 1c 48 83 e8 80 48 8d 14 c1 <48> 8b 04 c1 48 8d 4b 28 48 89 53 30 48 89 43 28 48 85 c0 48 89 0a
      [   40.654080] RSP: 0018:ffffb879864934d8 EFLAGS: 00010203
      [   40.659914] RAX: 0000000000000080 RBX: ffffffffb8328d80 RCX: 0000000000000000
      [   40.667882] RDX: 0000000000000400 RSI: 0000000000000000 RDI: ffffffffb831faa0
      [   40.675849] RBP: 0000000000000000 R08: ffffa0752c8b9088 R09: ffffa0752c8b9208
      [   40.683816] R10: 0000000000000006 R11: 0000000000000000 R12: ffffa0752d734000
      [   40.691783] R13: 0000000000000008 R14: 0000000000000000 R15: ffffa07113c18000
      [   40.699750] FS:  00007f94548e5880(0000) GS:ffffa0752e980000(0000) knlGS:0000000000000000
      [   40.708782] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   40.715189] CR2: 0000000000000400 CR3: 000000082b6ae006 CR4: 00000000001606e0
      [   40.723156] Call Trace:
      [   40.725888]  dev_qdisc_set_real_num_tx_queues+0x61/0x90
      [   40.731725]  netif_set_real_num_tx_queues+0x94/0x1d0
      [   40.737286]  __igb_open+0x19a/0x5d0 [igb]
      [   40.741767]  __dev_open+0xbb/0x150
      [   40.745567]  __dev_change_flags+0x157/0x1a0
      [   40.750240]  dev_change_flags+0x23/0x60
      
      [...]
      
      Fixes: 4cda7527 ("net: sched: make newly activated qdiscs visible")
      Reported-by: NMichal Kubecek <mkubecek@suse.cz>
      CC: Michal Kubecek <mkubecek@suse.cz>
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jamal Hadi Salim <jhs@mojatatu.com>
      CC: Cong Wang <xiyou.wangcong@gmail.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: NJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7c4046b1
  3. 12 3月, 2020 1 次提交
    • J
      net: sched: make newly activated qdiscs visible · 4cda7527
      Julian Wiedmann 提交于
      In their .attach callback, mq[prio] only add the qdiscs of the currently
      active TX queues to the device's qdisc hash list.
      If a user later increases the number of active TX queues, their qdiscs
      are not visible via eg. 'tc qdisc show'.
      
      Add a hook to netif_set_real_num_tx_queues() that walks all active
      TX queues and adds those which are missing to the hash list.
      
      CC: Eric Dumazet <edumazet@google.com>
      CC: Jamal Hadi Salim <jhs@mojatatu.com>
      CC: Cong Wang <xiyou.wangcong@gmail.com>
      CC: Jiri Pirko <jiri@resnulli.us>
      Signed-off-by: NJulian Wiedmann <jwi@linux.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4cda7527
  4. 27 2月, 2020 2 次提交
  5. 25 2月, 2020 1 次提交
  6. 20 2月, 2020 2 次提交
  7. 19 2月, 2020 1 次提交
  8. 17 2月, 2020 2 次提交
    • T
      net: export netdev_next_lower_dev_rcu() · 7151affe
      Taehee Yoo 提交于
      netdev_next_lower_dev_rcu() will be used to implement a function,
      which is to walk all lower interfaces.
      There are already functions that they walk their lower interface.
      (netdev_walk_all_lower_dev_rcu, netdev_walk_all_lower_dev()).
      But, there would be cases that couldn't be covered by given
      netdev_walk_all_lower_dev_{rcu}() function.
      So, some modules would want to implement own function,
      which is to walk all lower interfaces.
      
      In the next patch, netdev_next_lower_dev_rcu() will be used.
      In addition, this patch removes two unused prototypes in netdevice.h.
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7151affe
    • E
      net: add strict checks in netdev_name_node_alt_destroy() · e08ad805
      Eric Dumazet 提交于
      netdev_name_node_alt_destroy() does a lookup over all
      device names of a namespace.
      
      We need to make sure the name belongs to the device
      of interest, and that we do not destroy its primary
      name, since we rely on it being not deleted :
      dev->name_node would indeed point to freed memory.
      
      syzbot report was the following :
      
      BUG: KASAN: use-after-free in dev_net include/linux/netdevice.h:2206 [inline]
      BUG: KASAN: use-after-free in mld_force_mld_version net/ipv6/mcast.c:1172 [inline]
      BUG: KASAN: use-after-free in mld_in_v2_mode_only net/ipv6/mcast.c:1180 [inline]
      BUG: KASAN: use-after-free in mld_in_v1_mode+0x203/0x230 net/ipv6/mcast.c:1190
      Read of size 8 at addr ffff88809886c588 by task swapper/1/0
      
      CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.6.0-rc1-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       <IRQ>
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x197/0x210 lib/dump_stack.c:118
       print_address_description.constprop.0.cold+0xd4/0x30b mm/kasan/report.c:374
       __kasan_report.cold+0x1b/0x32 mm/kasan/report.c:506
       kasan_report+0x12/0x20 mm/kasan/common.c:641
       __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:135
       dev_net include/linux/netdevice.h:2206 [inline]
       mld_force_mld_version net/ipv6/mcast.c:1172 [inline]
       mld_in_v2_mode_only net/ipv6/mcast.c:1180 [inline]
       mld_in_v1_mode+0x203/0x230 net/ipv6/mcast.c:1190
       mld_send_initial_cr net/ipv6/mcast.c:2083 [inline]
       mld_dad_timer_expire+0x24/0x230 net/ipv6/mcast.c:2118
       call_timer_fn+0x1ac/0x780 kernel/time/timer.c:1404
       expire_timers kernel/time/timer.c:1449 [inline]
       __run_timers kernel/time/timer.c:1773 [inline]
       __run_timers kernel/time/timer.c:1740 [inline]
       run_timer_softirq+0x6c3/0x1790 kernel/time/timer.c:1786
       __do_softirq+0x262/0x98c kernel/softirq.c:292
       invoke_softirq kernel/softirq.c:373 [inline]
       irq_exit+0x19b/0x1e0 kernel/softirq.c:413
       exiting_irq arch/x86/include/asm/apic.h:546 [inline]
       smp_apic_timer_interrupt+0x1a3/0x610 arch/x86/kernel/apic/apic.c:1146
       apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:829
       </IRQ>
      RIP: 0010:native_safe_halt+0xe/0x10 arch/x86/include/asm/irqflags.h:61
      Code: 68 73 c5 f9 eb 8a cc cc cc cc cc cc e9 07 00 00 00 0f 00 2d 94 be 59 00 f4 c3 66 90 e9 07 00 00 00 0f 00 2d 84 be 59 00 fb f4 <c3> cc 55 48 89 e5 41 57 41 56 41 55 41 54 53 e8 de 2a 74 f9 e8 09
      RSP: 0018:ffffc90000d3fd68 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
      RAX: 1ffffffff136761a RBX: ffff8880a99fc340 RCX: 0000000000000000
      RDX: dffffc0000000000 RSI: 0000000000000006 RDI: ffff8880a99fcbd4
      RBP: ffffc90000d3fd98 R08: ffff8880a99fc340 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: dffffc0000000000
      R13: ffffffff8aa5a1c0 R14: 0000000000000000 R15: 0000000000000001
       arch_cpu_idle+0xa/0x10 arch/x86/kernel/process.c:686
       default_idle_call+0x84/0xb0 kernel/sched/idle.c:94
       cpuidle_idle_call kernel/sched/idle.c:154 [inline]
       do_idle+0x3c8/0x6e0 kernel/sched/idle.c:269
       cpu_startup_entry+0x1b/0x20 kernel/sched/idle.c:361
       start_secondary+0x2f4/0x410 arch/x86/kernel/smpboot.c:264
       secondary_startup_64+0xa4/0xb0 arch/x86/kernel/head_64.S:242
      
      Allocated by task 10229:
       save_stack+0x23/0x90 mm/kasan/common.c:72
       set_track mm/kasan/common.c:80 [inline]
       __kasan_kmalloc mm/kasan/common.c:515 [inline]
       __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:488
       kasan_kmalloc+0x9/0x10 mm/kasan/common.c:529
       __do_kmalloc_node mm/slab.c:3616 [inline]
       __kmalloc_node+0x4e/0x70 mm/slab.c:3623
       kmalloc_node include/linux/slab.h:578 [inline]
       kvmalloc_node+0x68/0x100 mm/util.c:574
       kvmalloc include/linux/mm.h:645 [inline]
       kvzalloc include/linux/mm.h:653 [inline]
       alloc_netdev_mqs+0x98/0xe40 net/core/dev.c:9797
       rtnl_create_link+0x22d/0xaf0 net/core/rtnetlink.c:3047
       __rtnl_newlink+0xf9f/0x1790 net/core/rtnetlink.c:3309
       rtnl_newlink+0x69/0xa0 net/core/rtnetlink.c:3377
       rtnetlink_rcv_msg+0x45e/0xaf0 net/core/rtnetlink.c:5438
       netlink_rcv_skb+0x177/0x450 net/netlink/af_netlink.c:2477
       rtnetlink_rcv+0x1d/0x30 net/core/rtnetlink.c:5456
       netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
       netlink_unicast+0x59e/0x7e0 net/netlink/af_netlink.c:1328
       netlink_sendmsg+0x91c/0xea0 net/netlink/af_netlink.c:1917
       sock_sendmsg_nosec net/socket.c:652 [inline]
       sock_sendmsg+0xd7/0x130 net/socket.c:672
       __sys_sendto+0x262/0x380 net/socket.c:1998
       __do_compat_sys_socketcall net/compat.c:771 [inline]
       __se_compat_sys_socketcall net/compat.c:719 [inline]
       __ia32_compat_sys_socketcall+0x530/0x710 net/compat.c:719
       do_syscall_32_irqs_on arch/x86/entry/common.c:337 [inline]
       do_fast_syscall_32+0x27b/0xe16 arch/x86/entry/common.c:408
       entry_SYSENTER_compat+0x70/0x7f arch/x86/entry/entry_64_compat.S:139
      
      Freed by task 10229:
       save_stack+0x23/0x90 mm/kasan/common.c:72
       set_track mm/kasan/common.c:80 [inline]
       kasan_set_free_info mm/kasan/common.c:337 [inline]
       __kasan_slab_free+0x102/0x150 mm/kasan/common.c:476
       kasan_slab_free+0xe/0x10 mm/kasan/common.c:485
       __cache_free mm/slab.c:3426 [inline]
       kfree+0x10a/0x2c0 mm/slab.c:3757
       __netdev_name_node_alt_destroy+0x1ff/0x2a0 net/core/dev.c:322
       netdev_name_node_alt_destroy+0x57/0x80 net/core/dev.c:334
       rtnl_alt_ifname net/core/rtnetlink.c:3518 [inline]
       rtnl_linkprop.isra.0+0x575/0x6f0 net/core/rtnetlink.c:3567
       rtnl_dellinkprop+0x46/0x60 net/core/rtnetlink.c:3588
       rtnetlink_rcv_msg+0x45e/0xaf0 net/core/rtnetlink.c:5438
       netlink_rcv_skb+0x177/0x450 net/netlink/af_netlink.c:2477
       rtnetlink_rcv+0x1d/0x30 net/core/rtnetlink.c:5456
       netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
       netlink_unicast+0x59e/0x7e0 net/netlink/af_netlink.c:1328
       netlink_sendmsg+0x91c/0xea0 net/netlink/af_netlink.c:1917
       sock_sendmsg_nosec net/socket.c:652 [inline]
       sock_sendmsg+0xd7/0x130 net/socket.c:672
       ____sys_sendmsg+0x753/0x880 net/socket.c:2343
       ___sys_sendmsg+0x100/0x170 net/socket.c:2397
       __sys_sendmsg+0x105/0x1d0 net/socket.c:2430
       __compat_sys_sendmsg net/compat.c:642 [inline]
       __do_compat_sys_sendmsg net/compat.c:649 [inline]
       __se_compat_sys_sendmsg net/compat.c:646 [inline]
       __ia32_compat_sys_sendmsg+0x7a/0xb0 net/compat.c:646
       do_syscall_32_irqs_on arch/x86/entry/common.c:337 [inline]
       do_fast_syscall_32+0x27b/0xe16 arch/x86/entry/common.c:408
       entry_SYSENTER_compat+0x70/0x7f arch/x86/entry/entry_64_compat.S:139
      
      The buggy address belongs to the object at ffff88809886c000
       which belongs to the cache kmalloc-4k of size 4096
      The buggy address is located 1416 bytes inside of
       4096-byte region [ffff88809886c000, ffff88809886d000)
      The buggy address belongs to the page:
      page:ffffea0002621b00 refcount:1 mapcount:0 mapping:ffff8880aa402000 index:0x0 compound_mapcount: 0
      flags: 0xfffe0000010200(slab|head)
      raw: 00fffe0000010200 ffffea0002610d08 ffffea0002607608 ffff8880aa402000
      raw: 0000000000000000 ffff88809886c000 0000000100000001 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff88809886c480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff88809886c500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      >ffff88809886c580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                            ^
       ffff88809886c600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff88809886c680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      
      Fixes: 36fbf1e5 ("net: rtnetlink: add linkprop commands to add and delete alternative ifnames")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Cc: Jiri Pirko <jiri@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e08ad805
  9. 12 2月, 2020 1 次提交
    • T
      core: Don't skip generic XDP program execution for cloned SKBs · ad1e03b2
      Toke Høiland-Jørgensen 提交于
      The current generic XDP handler skips execution of XDP programs entirely if
      an SKB is marked as cloned. This leads to some surprising behaviour, as
      packets can end up being cloned in various ways, which will make an XDP
      program not see all the traffic on an interface.
      
      This was discovered by a simple test case where an XDP program that always
      returns XDP_DROP is installed on a veth device. When combining this with
      the Scapy packet sniffer (which uses an AF_PACKET) socket on the sending
      side, SKBs reliably end up in the cloned state, causing them to be passed
      through to the receiving interface instead of being dropped. A minimal
      reproducer script for this is included below.
      
      This patch fixed the issue by simply triggering the existing linearisation
      code for cloned SKBs instead of skipping the XDP program execution. This
      behaviour is in line with the behaviour of the native XDP implementation
      for the veth driver, which will reallocate and copy the SKB data if the SKB
      is marked as shared.
      
      Reproducer Python script (requires BCC and Scapy):
      
      from scapy.all import TCP, IP, Ether, sendp, sniff, AsyncSniffer, Raw, UDP
      from bcc import BPF
      import time, sys, subprocess, shlex
      
      SKB_MODE = (1 << 1)
      DRV_MODE = (1 << 2)
      PYTHON=sys.executable
      
      def client():
          time.sleep(2)
          # Sniffing on the sender causes skb_cloned() to be set
          s = AsyncSniffer()
          s.start()
      
          for p in range(10):
              sendp(Ether(dst="aa:aa:aa:aa:aa:aa", src="cc:cc:cc:cc:cc:cc")/IP()/UDP()/Raw("Test"),
                    verbose=False)
              time.sleep(0.1)
      
          s.stop()
          return 0
      
      def server(mode):
          prog = BPF(text="int dummy_drop(struct xdp_md *ctx) {return XDP_DROP;}")
          func = prog.load_func("dummy_drop", BPF.XDP)
          prog.attach_xdp("a_to_b", func, mode)
      
          time.sleep(1)
      
          s = sniff(iface="a_to_b", count=10, timeout=15)
          if len(s):
              print(f"Got {len(s)} packets - should have gotten 0")
              return 1
          else:
              print("Got no packets - as expected")
              return 0
      
      if len(sys.argv) < 2:
          print(f"Usage: {sys.argv[0]} <skb|drv>")
          sys.exit(1)
      
      if sys.argv[1] == "client":
          sys.exit(client())
      elif sys.argv[1] == "server":
          mode = SKB_MODE if sys.argv[2] == 'skb' else DRV_MODE
          sys.exit(server(mode))
      else:
          try:
              mode = sys.argv[1]
              if mode not in ('skb', 'drv'):
                  print(f"Usage: {sys.argv[0]} <skb|drv>")
                  sys.exit(1)
              print(f"Running in {mode} mode")
      
              for cmd in [
                      'ip netns add netns_a',
                      'ip netns add netns_b',
                      'ip -n netns_a link add a_to_b type veth peer name b_to_a netns netns_b',
                      # Disable ipv6 to make sure there's no address autoconf traffic
                      'ip netns exec netns_a sysctl -qw net.ipv6.conf.a_to_b.disable_ipv6=1',
                      'ip netns exec netns_b sysctl -qw net.ipv6.conf.b_to_a.disable_ipv6=1',
                      'ip -n netns_a link set dev a_to_b address aa:aa:aa:aa:aa:aa',
                      'ip -n netns_b link set dev b_to_a address cc:cc:cc:cc:cc:cc',
                      'ip -n netns_a link set dev a_to_b up',
                      'ip -n netns_b link set dev b_to_a up']:
                  subprocess.check_call(shlex.split(cmd))
      
              server = subprocess.Popen(shlex.split(f"ip netns exec netns_a {PYTHON} {sys.argv[0]} server {mode}"))
              client = subprocess.Popen(shlex.split(f"ip netns exec netns_b {PYTHON} {sys.argv[0]} client"))
      
              client.wait()
              server.wait()
              sys.exit(server.returncode)
      
          finally:
              subprocess.run(shlex.split("ip netns delete netns_a"))
              subprocess.run(shlex.split("ip netns delete netns_b"))
      
      Fixes: d4455169 ("net: xdp: support xdp generic on virtual devices")
      Reported-by: NStepan Horacek <shoracek@redhat.com>
      Suggested-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ad1e03b2
  10. 04 2月, 2020 1 次提交
  11. 27 1月, 2020 5 次提交
  12. 23 1月, 2020 2 次提交
    • E
      net: rtnetlink: validate IFLA_MTU attribute in rtnl_create_link() · d836f5c6
      Eric Dumazet 提交于
      rtnl_create_link() needs to apply dev->min_mtu and dev->max_mtu
      checks that we apply in do_setlink()
      
      Otherwise malicious users can crash the kernel, for example after
      an integer overflow :
      
      BUG: KASAN: use-after-free in memset include/linux/string.h:365 [inline]
      BUG: KASAN: use-after-free in __alloc_skb+0x37b/0x5e0 net/core/skbuff.c:238
      Write of size 32 at addr ffff88819f20b9c0 by task swapper/0/0
      
      CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.5.0-rc1-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       <IRQ>
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x197/0x210 lib/dump_stack.c:118
       print_address_description.constprop.0.cold+0xd4/0x30b mm/kasan/report.c:374
       __kasan_report.cold+0x1b/0x41 mm/kasan/report.c:506
       kasan_report+0x12/0x20 mm/kasan/common.c:639
       check_memory_region_inline mm/kasan/generic.c:185 [inline]
       check_memory_region+0x134/0x1a0 mm/kasan/generic.c:192
       memset+0x24/0x40 mm/kasan/common.c:108
       memset include/linux/string.h:365 [inline]
       __alloc_skb+0x37b/0x5e0 net/core/skbuff.c:238
       alloc_skb include/linux/skbuff.h:1049 [inline]
       alloc_skb_with_frags+0x93/0x590 net/core/skbuff.c:5664
       sock_alloc_send_pskb+0x7ad/0x920 net/core/sock.c:2242
       sock_alloc_send_skb+0x32/0x40 net/core/sock.c:2259
       mld_newpack+0x1d7/0x7f0 net/ipv6/mcast.c:1609
       add_grhead.isra.0+0x299/0x370 net/ipv6/mcast.c:1713
       add_grec+0x7db/0x10b0 net/ipv6/mcast.c:1844
       mld_send_cr net/ipv6/mcast.c:1970 [inline]
       mld_ifc_timer_expire+0x3d3/0x950 net/ipv6/mcast.c:2477
       call_timer_fn+0x1ac/0x780 kernel/time/timer.c:1404
       expire_timers kernel/time/timer.c:1449 [inline]
       __run_timers kernel/time/timer.c:1773 [inline]
       __run_timers kernel/time/timer.c:1740 [inline]
       run_timer_softirq+0x6c3/0x1790 kernel/time/timer.c:1786
       __do_softirq+0x262/0x98c kernel/softirq.c:292
       invoke_softirq kernel/softirq.c:373 [inline]
       irq_exit+0x19b/0x1e0 kernel/softirq.c:413
       exiting_irq arch/x86/include/asm/apic.h:536 [inline]
       smp_apic_timer_interrupt+0x1a3/0x610 arch/x86/kernel/apic/apic.c:1137
       apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:829
       </IRQ>
      RIP: 0010:native_safe_halt+0xe/0x10 arch/x86/include/asm/irqflags.h:61
      Code: 98 6b ea f9 eb 8a cc cc cc cc cc cc e9 07 00 00 00 0f 00 2d 44 1c 60 00 f4 c3 66 90 e9 07 00 00 00 0f 00 2d 34 1c 60 00 fb f4 <c3> cc 55 48 89 e5 41 57 41 56 41 55 41 54 53 e8 4e 5d 9a f9 e8 79
      RSP: 0018:ffffffff89807ce8 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff13
      RAX: 1ffffffff13266ae RBX: ffffffff8987a1c0 RCX: 0000000000000000
      RDX: dffffc0000000000 RSI: 0000000000000006 RDI: ffffffff8987aa54
      RBP: ffffffff89807d18 R08: ffffffff8987a1c0 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000000 R12: dffffc0000000000
      R13: ffffffff8a799980 R14: 0000000000000000 R15: 0000000000000000
       arch_cpu_idle+0xa/0x10 arch/x86/kernel/process.c:690
       default_idle_call+0x84/0xb0 kernel/sched/idle.c:94
       cpuidle_idle_call kernel/sched/idle.c:154 [inline]
       do_idle+0x3c8/0x6e0 kernel/sched/idle.c:269
       cpu_startup_entry+0x1b/0x20 kernel/sched/idle.c:361
       rest_init+0x23b/0x371 init/main.c:451
       arch_call_rest_init+0xe/0x1b
       start_kernel+0x904/0x943 init/main.c:784
       x86_64_start_reservations+0x29/0x2b arch/x86/kernel/head64.c:490
       x86_64_start_kernel+0x77/0x7b arch/x86/kernel/head64.c:471
       secondary_startup_64+0xa4/0xb0 arch/x86/kernel/head_64.S:242
      
      The buggy address belongs to the page:
      page:ffffea00067c82c0 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0
      raw: 057ffe0000000000 ffffea00067c82c8 ffffea00067c82c8 0000000000000000
      raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff88819f20b880: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
       ffff88819f20b900: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
      >ffff88819f20b980: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
                                                 ^
       ffff88819f20ba00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
       ffff88819f20ba80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
      
      Fixes: 61e84623 ("net: centralize net_device min/max MTU checking")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d836f5c6
    • M
      net: Fix packet reordering caused by GRO and listified RX cooperation · c8079432
      Maxim Mikityanskiy 提交于
      Commit 323ebb61 ("net: use listified RX for handling GRO_NORMAL
      skbs") introduces batching of GRO_NORMAL packets in napi_frags_finish,
      and commit 6570bc79 ("net: core: use listified Rx for GRO_NORMAL in
      napi_gro_receive()") adds the same to napi_skb_finish. However,
      dev_gro_receive (that is called just before napi_{frags,skb}_finish) can
      also pass skbs to the networking stack: e.g., when the GRO session is
      flushed, napi_gro_complete is called, which passes pp directly to
      netif_receive_skb_internal, skipping napi->rx_list. It means that the
      packet stored in pp will be handled by the stack earlier than the
      packets that arrived before, but are still waiting in napi->rx_list. It
      leads to TCP reorderings that can be observed in the TCPOFOQueue counter
      in netstat.
      
      This commit fixes the reordering issue by making napi_gro_complete also
      use napi->rx_list, so that all packets going through GRO will keep their
      order. In order to keep napi_gro_flush working properly, gro_normal_list
      calls are moved after the flush to clear napi->rx_list.
      
      iwlwifi calls napi_gro_flush directly and does the same thing that is
      done by gro_normal_list, so the same change is applied there:
      napi_gro_flush is moved to be before the flush of napi->rx_list.
      
      A few other drivers also use napi_gro_flush (brocade/bna/bnad.c,
      cortina/gemini.c, hisilicon/hns3/hns3_enet.c). The first two also use
      napi_complete_done afterwards, which performs the gro_normal_list flush,
      so they are fine. The latter calls napi_gro_receive right after
      napi_gro_flush, so it can end up with non-empty napi->rx_list anyway.
      
      Fixes: 323ebb61 ("net: use listified RX for handling GRO_NORMAL skbs")
      Signed-off-by: NMaxim Mikityanskiy <maximmi@mellanox.com>
      Cc: Alexander Lobakin <alobakin@dlink.ru>
      Cc: Edward Cree <ecree@solarflare.com>
      Acked-by: NAlexander Lobakin <alobakin@dlink.ru>
      Acked-by: NSaeed Mahameed <saeedm@mellanox.com>
      Acked-by: NEdward Cree <ecree@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c8079432
  13. 21 1月, 2020 1 次提交
    • J
      net-sysfs: Fix reference count leak · cb626bf5
      Jouni Hogander 提交于
      Netdev_register_kobject is calling device_initialize. In case of error
      reference taken by device_initialize is not given up.
      
      Drivers are supposed to call free_netdev in case of error. In non-error
      case the last reference is given up there and device release sequence
      is triggered. In error case this reference is kept and the release
      sequence is never started.
      
      Fix this by setting reg_state as NETREG_UNREGISTERED if registering
      fails.
      
      This is the rootcause for couple of memory leaks reported by Syzkaller:
      
      BUG: memory leak unreferenced object 0xffff8880675ca008 (size 256):
        comm "netdev_register", pid 281, jiffies 4294696663 (age 6.808s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
        backtrace:
          [<0000000058ca4711>] kmem_cache_alloc_trace+0x167/0x280
          [<000000002340019b>] device_add+0x882/0x1750
          [<000000001d588c3a>] netdev_register_kobject+0x128/0x380
          [<0000000011ef5535>] register_netdevice+0xa1b/0xf00
          [<000000007fcf1c99>] __tun_chr_ioctl+0x20d5/0x3dd0
          [<000000006a5b7b2b>] tun_chr_ioctl+0x2f/0x40
          [<00000000f30f834a>] do_vfs_ioctl+0x1c7/0x1510
          [<00000000fba062ea>] ksys_ioctl+0x99/0xb0
          [<00000000b1c1b8d2>] __x64_sys_ioctl+0x78/0xb0
          [<00000000984cabb9>] do_syscall_64+0x16f/0x580
          [<000000000bde033d>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
          [<00000000e6ca2d9f>] 0xffffffffffffffff
      
      BUG: memory leak
      unreferenced object 0xffff8880668ba588 (size 8):
        comm "kobject_set_nam", pid 286, jiffies 4294725297 (age 9.871s)
        hex dump (first 8 bytes):
          6e 72 30 00 cc be df 2b                          nr0....+
        backtrace:
          [<00000000a322332a>] __kmalloc_track_caller+0x16e/0x290
          [<00000000236fd26b>] kstrdup+0x3e/0x70
          [<00000000dd4a2815>] kstrdup_const+0x3e/0x50
          [<0000000049a377fc>] kvasprintf_const+0x10e/0x160
          [<00000000627fc711>] kobject_set_name_vargs+0x5b/0x140
          [<0000000019eeab06>] dev_set_name+0xc0/0xf0
          [<0000000069cb12bc>] netdev_register_kobject+0xc8/0x320
          [<00000000f2e83732>] register_netdevice+0xa1b/0xf00
          [<000000009e1f57cc>] __tun_chr_ioctl+0x20d5/0x3dd0
          [<000000009c560784>] tun_chr_ioctl+0x2f/0x40
          [<000000000d759e02>] do_vfs_ioctl+0x1c7/0x1510
          [<00000000351d7c31>] ksys_ioctl+0x99/0xb0
          [<000000008390040a>] __x64_sys_ioctl+0x78/0xb0
          [<0000000052d196b7>] do_syscall_64+0x16f/0x580
          [<0000000019af9236>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
          [<00000000bc384531>] 0xffffffffffffffff
      
      v3 -> v4:
        Set reg_state to NETREG_UNREGISTERED if registering fails
      
      v2 -> v3:
      * Replaced BUG_ON with WARN_ON in free_netdev and netdev_release
      
      v1 -> v2:
      * Relying on driver calling free_netdev rather than calling
        put_device directly in error path
      
      Reported-by: syzbot+ad8ca40ecd77896d51e2@syzkaller.appspotmail.com
      Cc: David Miller <davem@davemloft.net>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
      Signed-off-by: NJouni Hogander <jouni.hogander@unikie.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cb626bf5
  14. 17 1月, 2020 2 次提交
    • C
      net: avoid updating qdisc_xmit_lock_key in netdev_update_lockdep_key() · 53d37497
      Cong Wang 提交于
      syzbot reported some bogus lockdep warnings, for example bad unlock
      balance in sch_direct_xmit(). They are due to a race condition between
      slow path and fast path, that is qdisc_xmit_lock_key gets re-registered
      in netdev_update_lockdep_key() on slow path, while we could still
      acquire the queue->_xmit_lock on fast path in this small window:
      
      CPU A						CPU B
      						__netif_tx_lock();
      lockdep_unregister_key(qdisc_xmit_lock_key);
      						__netif_tx_unlock();
      lockdep_register_key(qdisc_xmit_lock_key);
      
      In fact, unlike the addr_list_lock which has to be reordered when
      the master/slave device relationship changes, queue->_xmit_lock is
      only acquired on fast path and only when NETIF_F_LLTX is not set,
      so there is likely no nested locking for it.
      
      Therefore, we can just get rid of re-registration of
      qdisc_xmit_lock_key.
      
      Reported-by: syzbot+4ec99438ed7450da6272@syzkaller.appspotmail.com
      Fixes: ab92d68f ("net: core: add generic lockdep keys")
      Cc: Taehee Yoo <ap420073@gmail.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      53d37497
    • T
      xdp: Move devmap bulk queue into struct net_device · 75ccae62
      Toke Høiland-Jørgensen 提交于
      Commit 96360004 ("xdp: Make devmap flush_list common for all map
      instances"), changed devmap flushing to be a global operation instead of a
      per-map operation. However, the queue structure used for bulking was still
      allocated as part of the containing map.
      
      This patch moves the devmap bulk queue into struct net_device. The
      motivation for this is reusing it for the non-map variant of XDP_REDIRECT,
      which will be changed in a subsequent commit.  To avoid other fields of
      struct net_device moving to different cache lines, we also move a couple of
      other members around.
      
      We defer the actual allocation of the bulk queue structure until the
      NETDEV_REGISTER notification devmap.c. This makes it possible to check for
      ndo_xdp_xmit support before allocating the structure, which is not possible
      at the time struct net_device is allocated. However, we keep the freeing in
      free_netdev() to avoid adding another RCU callback on NETDEV_UNREGISTER.
      
      Because of this change, we lose the reference back to the map that
      originated the redirect, so change the tracepoint to always return 0 as the
      map ID and index. Otherwise no functional change is intended with this
      patch.
      
      After this patch, the relevant part of struct net_device looks like this,
      according to pahole:
      
      	/* --- cacheline 14 boundary (896 bytes) --- */
      	struct netdev_queue *      _tx __attribute__((__aligned__(64))); /*   896     8 */
      	unsigned int               num_tx_queues;        /*   904     4 */
      	unsigned int               real_num_tx_queues;   /*   908     4 */
      	struct Qdisc *             qdisc;                /*   912     8 */
      	unsigned int               tx_queue_len;         /*   920     4 */
      	spinlock_t                 tx_global_lock;       /*   924     4 */
      	struct xdp_dev_bulk_queue * xdp_bulkq;           /*   928     8 */
      	struct xps_dev_maps *      xps_cpus_map;         /*   936     8 */
      	struct xps_dev_maps *      xps_rxqs_map;         /*   944     8 */
      	struct mini_Qdisc *        miniq_egress;         /*   952     8 */
      	/* --- cacheline 15 boundary (960 bytes) --- */
      	struct hlist_head  qdisc_hash[16];               /*   960   128 */
      	/* --- cacheline 17 boundary (1088 bytes) --- */
      	struct timer_list  watchdog_timer;               /*  1088    40 */
      
      	/* XXX last struct has 4 bytes of padding */
      
      	int                        watchdog_timeo;       /*  1128     4 */
      
      	/* XXX 4 bytes hole, try to pack */
      
      	struct list_head   todo_list;                    /*  1136    16 */
      	/* --- cacheline 18 boundary (1152 bytes) --- */
      Signed-off-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NBjörn Töpel <bjorn.topel@intel.com>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/157918768397.1458396.12673224324627072349.stgit@toke.dk
      75ccae62
  15. 18 12月, 2019 1 次提交
    • L
      netfilter: Clean up unnecessary #ifdef · 871185ac
      Lukas Wunner 提交于
      If CONFIG_NETFILTER_INGRESS is not enabled, nf_ingress() becomes a no-op
      because it solely contains an if-clause calling nf_hook_ingress_active(),
      for which an empty inline stub exists in <linux/netfilter_ingress.h>.
      
      All the symbols used in the if-clause's body are still available even if
      CONFIG_NETFILTER_INGRESS is not enabled.
      
      The additional "#ifdef CONFIG_NETFILTER_INGRESS" in nf_ingress() is thus
      unnecessary, so drop it.
      Signed-off-by: NLukas Wunner <lukas@wunner.de>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      871185ac
  16. 14 12月, 2019 1 次提交
  17. 10 12月, 2019 1 次提交
  18. 08 12月, 2019 2 次提交
    • T
      sched/rt, net: Use CONFIG_PREEMPTION.patch · 2da2b32f
      Thomas Gleixner 提交于
      CONFIG_PREEMPTION is selected by CONFIG_PREEMPT and by CONFIG_PREEMPT_RT.
      Both PREEMPT and PREEMPT_RT require the same functionality which today
      depends on CONFIG_PREEMPT.
      
      Update the comment to use CONFIG_PREEMPTION.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: netdev@vger.kernel.org
      Link: https://lore.kernel.org/r/20191015191821.11479-22-bigeasy@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2da2b32f
    • E
      inet: protect against too small mtu values. · 501a90c9
      Eric Dumazet 提交于
      syzbot was once again able to crash a host by setting a very small mtu
      on loopback device.
      
      Let's make inetdev_valid_mtu() available in include/net/ip.h,
      and use it in ip_setup_cork(), so that we protect both ip_append_page()
      and __ip_append_data()
      
      Also add a READ_ONCE() when the device mtu is read.
      
      Pairs this lockless read with one WRITE_ONCE() in __dev_set_mtu(),
      even if other code paths might write over this field.
      
      Add a big comment in include/linux/netdevice.h about dev->mtu
      needing READ_ONCE()/WRITE_ONCE() annotations.
      
      Hopefully we will add the missing ones in followup patches.
      
      [1]
      
      refcount_t: saturated; leaking memory.
      WARNING: CPU: 0 PID: 9464 at lib/refcount.c:22 refcount_warn_saturate+0x138/0x1f0 lib/refcount.c:22
      Kernel panic - not syncing: panic_on_warn set ...
      CPU: 0 PID: 9464 Comm: syz-executor850 Not tainted 5.4.0-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x197/0x210 lib/dump_stack.c:118
       panic+0x2e3/0x75c kernel/panic.c:221
       __warn.cold+0x2f/0x3e kernel/panic.c:582
       report_bug+0x289/0x300 lib/bug.c:195
       fixup_bug arch/x86/kernel/traps.c:174 [inline]
       fixup_bug arch/x86/kernel/traps.c:169 [inline]
       do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:267
       do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:286
       invalid_op+0x23/0x30 arch/x86/entry/entry_64.S:1027
      RIP: 0010:refcount_warn_saturate+0x138/0x1f0 lib/refcount.c:22
      Code: 06 31 ff 89 de e8 c8 f5 e6 fd 84 db 0f 85 6f ff ff ff e8 7b f4 e6 fd 48 c7 c7 e0 71 4f 88 c6 05 56 a6 a4 06 01 e8 c7 a8 b7 fd <0f> 0b e9 50 ff ff ff e8 5c f4 e6 fd 0f b6 1d 3d a6 a4 06 31 ff 89
      RSP: 0018:ffff88809689f550 EFLAGS: 00010286
      RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: ffffffff815e4336 RDI: ffffed1012d13e9c
      RBP: ffff88809689f560 R08: ffff88809c50a3c0 R09: fffffbfff15d31b1
      R10: fffffbfff15d31b0 R11: ffffffff8ae98d87 R12: 0000000000000001
      R13: 0000000000040100 R14: ffff888099041104 R15: ffff888218d96e40
       refcount_add include/linux/refcount.h:193 [inline]
       skb_set_owner_w+0x2b6/0x410 net/core/sock.c:1999
       sock_wmalloc+0xf1/0x120 net/core/sock.c:2096
       ip_append_page+0x7ef/0x1190 net/ipv4/ip_output.c:1383
       udp_sendpage+0x1c7/0x480 net/ipv4/udp.c:1276
       inet_sendpage+0xdb/0x150 net/ipv4/af_inet.c:821
       kernel_sendpage+0x92/0xf0 net/socket.c:3794
       sock_sendpage+0x8b/0xc0 net/socket.c:936
       pipe_to_sendpage+0x2da/0x3c0 fs/splice.c:458
       splice_from_pipe_feed fs/splice.c:512 [inline]
       __splice_from_pipe+0x3ee/0x7c0 fs/splice.c:636
       splice_from_pipe+0x108/0x170 fs/splice.c:671
       generic_splice_sendpage+0x3c/0x50 fs/splice.c:842
       do_splice_from fs/splice.c:861 [inline]
       direct_splice_actor+0x123/0x190 fs/splice.c:1035
       splice_direct_to_actor+0x3b4/0xa30 fs/splice.c:990
       do_splice_direct+0x1da/0x2a0 fs/splice.c:1078
       do_sendfile+0x597/0xd00 fs/read_write.c:1464
       __do_sys_sendfile64 fs/read_write.c:1525 [inline]
       __se_sys_sendfile64 fs/read_write.c:1511 [inline]
       __x64_sys_sendfile64+0x1dd/0x220 fs/read_write.c:1511
       do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x441409
      Code: e8 ac e8 ff ff 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 eb 08 fc ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007fffb64c4f78 EFLAGS: 00000246 ORIG_RAX: 0000000000000028
      RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441409
      RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000005
      RBP: 0000000000073b8a R08: 0000000000000010 R09: 0000000000000010
      R10: 0000000000010001 R11: 0000000000000246 R12: 0000000000402180
      R13: 0000000000402210 R14: 0000000000000000 R15: 0000000000000000
      Kernel Offset: disabled
      Rebooting in 86400 seconds..
      
      Fixes: 1470ddf7 ("inet: Remove explicit write references to sk/inet in ip_append_data")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      501a90c9
  19. 04 12月, 2019 1 次提交
  20. 24 11月, 2019 1 次提交
  21. 17 11月, 2019 1 次提交
    • A
      net: core: allow fast GRO for skbs with Ethernet header in head · 8aef998d
      Alexander Lobakin 提交于
      Commit 78d3fd0b ("gro: Only use skb_gro_header for completely
      non-linear packets") back in May'09 (v2.6.31-rc1) has changed the
      original condition '!skb_headlen(skb)' to
      'skb->mac_header == skb->tail' in gro_reset_offset() saying: "Since
      the drivers that need this optimisation all provide completely
      non-linear packets" (note that this condition has become the current
      'skb_mac_header(skb) == skb_tail_pointer(skb)' later with commmit
      ced14f68 ("net: Correct comparisons and calculations using
      skb->tail and skb-transport_header") without any functional changes).
      
      For now, we have the following rough statistics for v5.4-rc7:
      1) napi_gro_frags: 14
      2) napi_gro_receive with skb->head containing (most of) payload: 83
      3) napi_gro_receive with skb->head containing all the headers: 20
      4) napi_gro_receive with skb->head containing only Ethernet header: 2
      
      With the current condition, fast GRO with the usage of
      NAPI_GRO_CB(skb)->frag0 is available only in the [1] case.
      Packets pushed by [2] and [3] go through the 'slow' path, but
      it's not a problem for them as they already contain all the needed
      headers in skb->head, so pskb_may_pull() only moves skb->data.
      
      The layout of skbs in the fourth [4] case at the moment of
      dev_gro_receive() is identical to skbs that have come through [1],
      as napi_frags_skb() pulls Ethernet header to skb->head. The only
      difference is that the mentioned condition is always false for them,
      because skb_put() and friends irreversibly alter the tail pointer.
      They also go through the 'slow' path, but now every single
      pskb_may_pull() in every single .gro_receive() will call the *really*
      slow __pskb_pull_tail() to pull headers to head. This significantly
      decreases the overall performance for no visible reasons.
      
      The only two users of method [4] is:
      * drivers/staging/qlge
      * drivers/net/wireless/iwlwifi (all three variants: dvm, mvm, mvm-mq)
      
      Note that in case with wireless drivers we can't use [1]
      (napi_gro_frags()) at least for now and mac80211 stack always
      performs pushes and pulls anyways, so performance hit is inavoidable.
      
      At the moment of v2.6.31 the mentioned change was necessary (that's
      why I don't add the "Fixes:" tag), but it became obsolete since
      skb_gro_mac_header() has gone in commit a50e233c ("net-gro:
      restore frag0 optimization"), so we can simply revert the condition
      in gro_reset_offset() to allow skbs from [4] go through the 'fast'
      path just like in case [1].
      
      This was tested on a 600 MHz MIPS CPU and a custom driver and this
      patch gave boosts up to 40 Mbps to method [4] in both directions
      comparing to net-next, which made overall performance relatively
      close to [1] (without it, [4] is the slowest).
      
      v2:
      - Add more references and explanations to commit message
      - Fix some typos ibid
      - No functional changes
      Signed-off-by: NAlexander Lobakin <alobakin@dlink.ru>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8aef998d
  22. 09 11月, 2019 1 次提交
    • E
      net/sched: annotate lockless accesses to qdisc->empty · 90b2be27
      Eric Dumazet 提交于
      KCSAN reported the following race [1]
      
      BUG: KCSAN: data-race in __dev_queue_xmit / net_tx_action
      
      read to 0xffff8880ba403508 of 1 bytes by task 21814 on cpu 1:
       __dev_xmit_skb net/core/dev.c:3389 [inline]
       __dev_queue_xmit+0x9db/0x1b40 net/core/dev.c:3761
       dev_queue_xmit+0x21/0x30 net/core/dev.c:3825
       neigh_hh_output include/net/neighbour.h:500 [inline]
       neigh_output include/net/neighbour.h:509 [inline]
       ip6_finish_output2+0x873/0xec0 net/ipv6/ip6_output.c:116
       __ip6_finish_output net/ipv6/ip6_output.c:142 [inline]
       __ip6_finish_output+0x2d7/0x330 net/ipv6/ip6_output.c:127
       ip6_finish_output+0x41/0x160 net/ipv6/ip6_output.c:152
       NF_HOOK_COND include/linux/netfilter.h:294 [inline]
       ip6_output+0xf2/0x280 net/ipv6/ip6_output.c:175
       dst_output include/net/dst.h:436 [inline]
       ip6_local_out+0x74/0x90 net/ipv6/output_core.c:179
       ip6_send_skb+0x53/0x110 net/ipv6/ip6_output.c:1795
       udp_v6_send_skb.isra.0+0x3ec/0xa70 net/ipv6/udp.c:1173
       udpv6_sendmsg+0x1906/0x1c20 net/ipv6/udp.c:1471
       inet6_sendmsg+0x6d/0x90 net/ipv6/af_inet6.c:576
       sock_sendmsg_nosec net/socket.c:637 [inline]
       sock_sendmsg+0x9f/0xc0 net/socket.c:657
       ___sys_sendmsg+0x2b7/0x5d0 net/socket.c:2311
       __sys_sendmmsg+0x123/0x350 net/socket.c:2413
       __do_sys_sendmmsg net/socket.c:2442 [inline]
       __se_sys_sendmmsg net/socket.c:2439 [inline]
       __x64_sys_sendmmsg+0x64/0x80 net/socket.c:2439
       do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      write to 0xffff8880ba403508 of 1 bytes by interrupt on cpu 0:
       qdisc_run_begin include/net/sch_generic.h:160 [inline]
       qdisc_run include/net/pkt_sched.h:120 [inline]
       net_tx_action+0x2b1/0x6c0 net/core/dev.c:4551
       __do_softirq+0x115/0x33f kernel/softirq.c:292
       do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1082
       do_softirq.part.0+0x6b/0x80 kernel/softirq.c:337
       do_softirq kernel/softirq.c:329 [inline]
       __local_bh_enable_ip+0x76/0x80 kernel/softirq.c:189
       local_bh_enable include/linux/bottom_half.h:32 [inline]
       rcu_read_unlock_bh include/linux/rcupdate.h:688 [inline]
       ip6_finish_output2+0x7bb/0xec0 net/ipv6/ip6_output.c:117
       __ip6_finish_output net/ipv6/ip6_output.c:142 [inline]
       __ip6_finish_output+0x2d7/0x330 net/ipv6/ip6_output.c:127
       ip6_finish_output+0x41/0x160 net/ipv6/ip6_output.c:152
       NF_HOOK_COND include/linux/netfilter.h:294 [inline]
       ip6_output+0xf2/0x280 net/ipv6/ip6_output.c:175
       dst_output include/net/dst.h:436 [inline]
       ip6_local_out+0x74/0x90 net/ipv6/output_core.c:179
       ip6_send_skb+0x53/0x110 net/ipv6/ip6_output.c:1795
       udp_v6_send_skb.isra.0+0x3ec/0xa70 net/ipv6/udp.c:1173
       udpv6_sendmsg+0x1906/0x1c20 net/ipv6/udp.c:1471
       inet6_sendmsg+0x6d/0x90 net/ipv6/af_inet6.c:576
       sock_sendmsg_nosec net/socket.c:637 [inline]
       sock_sendmsg+0x9f/0xc0 net/socket.c:657
       ___sys_sendmsg+0x2b7/0x5d0 net/socket.c:2311
       __sys_sendmmsg+0x123/0x350 net/socket.c:2413
       __do_sys_sendmmsg net/socket.c:2442 [inline]
       __se_sys_sendmmsg net/socket.c:2439 [inline]
       __x64_sys_sendmmsg+0x64/0x80 net/socket.c:2439
       do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 21817 Comm: syz-executor.2 Not tainted 5.4.0-rc6+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      
      Fixes: d518d2ed ("net/sched: fix race between deactivation and dequeue for NOLOCK qdisc")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Cc: Paolo Abeni <pabeni@redhat.com>
      Cc: Davide Caratti <dcaratti@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      90b2be27
  23. 02 11月, 2019 1 次提交
    • J
      net: fix installing orphaned programs · aefc3e72
      Jakub Kicinski 提交于
      When netdevice with offloaded BPF programs is destroyed
      the programs are orphaned and removed from the program
      IDA - their IDs get released (the programs may remain
      accessible via existing open file descriptors and pinned
      files). After IDs are released they are set to 0.
      
      This confuses dev_change_xdp_fd() because it compares
      the __dev_xdp_query() result where 0 means no program
      with prog->aux->id where 0 means orphaned.
      
      dev_change_xdp_fd() would have incorrectly returned success
      even though it had not installed the program.
      
      Since drivers already catch this case via bpf_offload_dev_match()
      let them handle this case. The error message drivers produce in
      this case ("program loaded for a different device") is in fact
      correct as the orphaned program must had to be loaded for a
      different device.
      
      Fixes: c14a9f63 ("net: Don't call XDP_SETUP_PROG when nothing is changed")
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aefc3e72
  24. 30 10月, 2019 1 次提交
  25. 26 10月, 2019 1 次提交
    • G
      netns: fix GFP flags in rtnl_net_notifyid() · d4e4fdf9
      Guillaume Nault 提交于
      In rtnl_net_notifyid(), we certainly can't pass a null GFP flag to
      rtnl_notify(). A GFP_KERNEL flag would be fine in most circumstances,
      but there are a few paths calling rtnl_net_notifyid() from atomic
      context or from RCU critical sections. The later also precludes the use
      of gfp_any() as it wouldn't detect the RCU case. Also, the nlmsg_new()
      call is wrong too, as it uses GFP_KERNEL unconditionally.
      
      Therefore, we need to pass the GFP flags as parameter and propagate it
      through function calls until the proper flags can be determined.
      
      In most cases, GFP_KERNEL is fine. The exceptions are:
        * openvswitch: ovs_vport_cmd_get() and ovs_vport_cmd_dump()
          indirectly call rtnl_net_notifyid() from RCU critical section,
      
        * rtnetlink: rtmsg_ifinfo_build_skb() already receives GFP flags as
          parameter.
      
      Also, in ovs_vport_cmd_build_info(), let's change the GFP flags used
      by nlmsg_new(). The function is allowed to sleep, so better make the
      flags consistent with the ones used in the following
      ovs_vport_cmd_fill_info() call.
      
      Found by code inspection.
      
      Fixes: 9a963454 ("netns: notify netns id events")
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d4e4fdf9
  26. 25 10月, 2019 4 次提交
    • T
      net: remove unnecessary variables and callback · f3b0a18b
      Taehee Yoo 提交于
      This patch removes variables and callback these are related to the nested
      device structure.
      devices that can be nested have their own nest_level variable that
      represents the depth of nested devices.
      In the previous patch, new {lower/upper}_level variables are added and
      they replace old private nest_level variable.
      So, this patch removes all 'nest_level' variables.
      
      In order to avoid lockdep warning, ->ndo_get_lock_subclass() was added
      to get lockdep subclass value, which is actually lower nested depth value.
      But now, they use the dynamic lockdep key to avoid lockdep warning instead
      of the subclass.
      So, this patch removes ->ndo_get_lock_subclass() callback.
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f3b0a18b
    • T
      net: core: add ignore flag to netdev_adjacent structure · 32b6d34f
      Taehee Yoo 提交于
      In order to link an adjacent node, netdev_upper_dev_link() is used
      and in order to unlink an adjacent node, netdev_upper_dev_unlink() is used.
      unlink operation does not fail, but link operation can fail.
      
      In order to exchange adjacent nodes, we should unlink an old adjacent
      node first. then, link a new adjacent node.
      If link operation is failed, we should link an old adjacent node again.
      But this link operation can fail too.
      It eventually breaks the adjacent link relationship.
      
      This patch adds an ignore flag into the netdev_adjacent structure.
      If this flag is set, netdev_upper_dev_link() ignores an old adjacent
      node for a moment.
      
      This patch also adds new functions for other modules.
      netdev_adjacent_change_prepare()
      netdev_adjacent_change_commit()
      netdev_adjacent_change_abort()
      
      netdev_adjacent_change_prepare() inserts new device into adjacent list
      but new device is not allowed to use immediately.
      If netdev_adjacent_change_prepare() fails, it internally rollbacks
      adjacent list so that we don't need any other action.
      netdev_adjacent_change_commit() deletes old device in the adjacent list
      and allows new device to use.
      netdev_adjacent_change_abort() rollbacks adjacent list.
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      32b6d34f
    • T
      net: core: add generic lockdep keys · ab92d68f
      Taehee Yoo 提交于
      Some interface types could be nested.
      (VLAN, BONDING, TEAM, MACSEC, MACVLAN, IPVLAN, VIRT_WIFI, VXLAN, etc..)
      These interface types should set lockdep class because, without lockdep
      class key, lockdep always warn about unexisting circular locking.
      
      In the current code, these interfaces have their own lockdep class keys and
      these manage itself. So that there are so many duplicate code around the
      /driver/net and /net/.
      This patch adds new generic lockdep keys and some helper functions for it.
      
      This patch does below changes.
      a) Add lockdep class keys in struct net_device
         - qdisc_running, xmit, addr_list, qdisc_busylock
         - these keys are used as dynamic lockdep key.
      b) When net_device is being allocated, lockdep keys are registered.
         - alloc_netdev_mqs()
      c) When net_device is being free'd llockdep keys are unregistered.
         - free_netdev()
      d) Add generic lockdep key helper function
         - netdev_register_lockdep_key()
         - netdev_unregister_lockdep_key()
         - netdev_update_lockdep_key()
      e) Remove unnecessary generic lockdep macro and functions
      f) Remove unnecessary lockdep code of each interfaces.
      
      After this patch, each interface modules don't need to maintain
      their lockdep keys.
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ab92d68f
    • T
      net: core: limit nested device depth · 5343da4c
      Taehee Yoo 提交于
      Current code doesn't limit the number of nested devices.
      Nested devices would be handled recursively and this needs huge stack
      memory. So, unlimited nested devices could make stack overflow.
      
      This patch adds upper_level and lower_level, they are common variables
      and represent maximum lower/upper depth.
      When upper/lower device is attached or dettached,
      {lower/upper}_level are updated. and if maximum depth is bigger than 8,
      attach routine fails and returns -EMLINK.
      
      In addition, this patch converts recursive routine of
      netdev_walk_all_{lower/upper} to iterator routine.
      
      Test commands:
          ip link add dummy0 type dummy
          ip link add link dummy0 name vlan1 type vlan id 1
          ip link set vlan1 up
      
          for i in {2..55}
          do
      	    let A=$i-1
      
      	    ip link add vlan$i link vlan$A type vlan id $i
          done
          ip link del dummy0
      
      Splat looks like:
      [  155.513226][  T908] BUG: KASAN: use-after-free in __unwind_start+0x71/0x850
      [  155.514162][  T908] Write of size 88 at addr ffff8880608a6cc0 by task ip/908
      [  155.515048][  T908]
      [  155.515333][  T908] CPU: 0 PID: 908 Comm: ip Not tainted 5.4.0-rc3+ #96
      [  155.516147][  T908] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
      [  155.517233][  T908] Call Trace:
      [  155.517627][  T908]
      [  155.517918][  T908] Allocated by task 0:
      [  155.518412][  T908] (stack is not available)
      [  155.518955][  T908]
      [  155.519228][  T908] Freed by task 0:
      [  155.519885][  T908] (stack is not available)
      [  155.520452][  T908]
      [  155.520729][  T908] The buggy address belongs to the object at ffff8880608a6ac0
      [  155.520729][  T908]  which belongs to the cache names_cache of size 4096
      [  155.522387][  T908] The buggy address is located 512 bytes inside of
      [  155.522387][  T908]  4096-byte region [ffff8880608a6ac0, ffff8880608a7ac0)
      [  155.523920][  T908] The buggy address belongs to the page:
      [  155.524552][  T908] page:ffffea0001822800 refcount:1 mapcount:0 mapping:ffff88806c657cc0 index:0x0 compound_mapcount:0
      [  155.525836][  T908] flags: 0x100000000010200(slab|head)
      [  155.526445][  T908] raw: 0100000000010200 ffffea0001813808 ffffea0001a26c08 ffff88806c657cc0
      [  155.527424][  T908] raw: 0000000000000000 0000000000070007 00000001ffffffff 0000000000000000
      [  155.528429][  T908] page dumped because: kasan: bad access detected
      [  155.529158][  T908]
      [  155.529410][  T908] Memory state around the buggy address:
      [  155.530060][  T908]  ffff8880608a6b80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [  155.530971][  T908]  ffff8880608a6c00: fb fb fb fb fb f1 f1 f1 f1 00 f2 f2 f2 f3 f3 f3
      [  155.531889][  T908] >ffff8880608a6c80: f3 fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      [  155.532806][  T908]                                            ^
      [  155.533509][  T908]  ffff8880608a6d00: fb fb fb fb fb fb fb fb fb f1 f1 f1 f1 00 00 00
      [  155.534436][  T908]  ffff8880608a6d80: f2 f3 f3 f3 f3 fb fb fb 00 00 00 00 00 00 00 00
      [ ... ]
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5343da4c
  27. 16 10月, 2019 1 次提交