1. 15 10月, 2021 2 次提交
  2. 09 4月, 2021 1 次提交
    • J
      net: always use icmp{,v6}_ndo_send from ndo_start_xmit · 0165d324
      Jason A. Donenfeld 提交于
      stable inclusion
      from stable-5.10.24
      commit 91796b65563bd3fd0efe4fb56d6ee1c5c6006eb0
      bugzilla: 51348
      
      --------------------------------
      
      commit 4372339e upstream.
      
      There were a few remaining tunnel drivers that didn't receive the prior
      conversion to icmp{,v6}_ndo_send. Knowing now that this could lead to
      memory corrution (see ee576c47 ("net: icmp: pass zeroed opts from
      icmp{,v6}_ndo_send before sending") for details), there's even more
      imperative to have these all converted. So this commit goes through the
      remaining cases that I could find and does a boring translation to the
      ndo variety.
      
      The Fixes: line below is the merge that originally added icmp{,v6}_
      ndo_send and converted the first batch of icmp{,v6}_send users. The
      rationale then for the change applies equally to this patch. It's just
      that these drivers were left out of the initial conversion because these
      network devices are hiding in net/ rather than in drivers/net/.
      
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: David Ahern <dsahern@kernel.org>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Fixes: 803381f9 ("Merge branch 'icmp-account-for-NAT-when-sending-icmps-from-ndo-layer'")
      Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      0165d324
  3. 09 3月, 2021 1 次提交
  4. 27 1月, 2021 1 次提交
  5. 01 11月, 2020 1 次提交
  6. 06 10月, 2020 1 次提交
  7. 19 6月, 2020 1 次提交
    • T
      ip_tunnel: fix use-after-free in ip_tunnel_lookup() · ba61539c
      Taehee Yoo 提交于
      In the datapath, the ip_tunnel_lookup() is used and it internally uses
      fallback tunnel device pointer, which is fb_tunnel_dev.
      This pointer variable should be set to NULL when a fb interface is deleted.
      But there is no routine to set fb_tunnel_dev pointer to NULL.
      So, this pointer will be still used after interface is deleted and
      it eventually results in the use-after-free problem.
      
      Test commands:
          ip netns add A
          ip netns add B
          ip link add eth0 type veth peer name eth1
          ip link set eth0 netns A
          ip link set eth1 netns B
      
          ip netns exec A ip link set lo up
          ip netns exec A ip link set eth0 up
          ip netns exec A ip link add gre1 type gre local 10.0.0.1 \
      	    remote 10.0.0.2
          ip netns exec A ip link set gre1 up
          ip netns exec A ip a a 10.0.100.1/24 dev gre1
          ip netns exec A ip a a 10.0.0.1/24 dev eth0
      
          ip netns exec B ip link set lo up
          ip netns exec B ip link set eth1 up
          ip netns exec B ip link add gre1 type gre local 10.0.0.2 \
      	    remote 10.0.0.1
          ip netns exec B ip link set gre1 up
          ip netns exec B ip a a 10.0.100.2/24 dev gre1
          ip netns exec B ip a a 10.0.0.2/24 dev eth1
          ip netns exec A hping3 10.0.100.2 -2 --flood -d 60000 &
          ip netns del B
      
      Splat looks like:
      [   77.793450][    C3] ==================================================================
      [   77.794702][    C3] BUG: KASAN: use-after-free in ip_tunnel_lookup+0xcc4/0xf30
      [   77.795573][    C3] Read of size 4 at addr ffff888060bd9c84 by task hping3/2905
      [   77.796398][    C3]
      [   77.796664][    C3] CPU: 3 PID: 2905 Comm: hping3 Not tainted 5.8.0-rc1+ #616
      [   77.797474][    C3] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
      [   77.798453][    C3] Call Trace:
      [   77.798815][    C3]  <IRQ>
      [   77.799142][    C3]  dump_stack+0x9d/0xdb
      [   77.799605][    C3]  print_address_description.constprop.7+0x2cc/0x450
      [   77.800365][    C3]  ? ip_tunnel_lookup+0xcc4/0xf30
      [   77.800908][    C3]  ? ip_tunnel_lookup+0xcc4/0xf30
      [   77.801517][    C3]  ? ip_tunnel_lookup+0xcc4/0xf30
      [   77.802145][    C3]  kasan_report+0x154/0x190
      [   77.802821][    C3]  ? ip_tunnel_lookup+0xcc4/0xf30
      [   77.803503][    C3]  ip_tunnel_lookup+0xcc4/0xf30
      [   77.804165][    C3]  __ipgre_rcv+0x1ab/0xaa0 [ip_gre]
      [   77.804862][    C3]  ? rcu_read_lock_sched_held+0xc0/0xc0
      [   77.805621][    C3]  gre_rcv+0x304/0x1910 [ip_gre]
      [   77.806293][    C3]  ? lock_acquire+0x1a9/0x870
      [   77.806925][    C3]  ? gre_rcv+0xfe/0x354 [gre]
      [   77.807559][    C3]  ? erspan_xmit+0x2e60/0x2e60 [ip_gre]
      [   77.808305][    C3]  ? rcu_read_lock_sched_held+0xc0/0xc0
      [   77.809032][    C3]  ? rcu_read_lock_held+0x90/0xa0
      [   77.809713][    C3]  gre_rcv+0x1b8/0x354 [gre]
      [ ... ]
      Suggested-by: NEric Dumazet <eric.dumazet@gmail.com>
      Fixes: c5441932 ("GRE: Refactor GRE tunneling code.")
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ba61539c
  8. 20 5月, 2020 1 次提交
    • C
      net: add a new ndo_tunnel_ioctl method · 607259a6
      Christoph Hellwig 提交于
      This method is used to properly allow kernel callers of the IPv4 route
      management ioctls.  The exsting ip_tunnel_ioctl helper is renamed to
      ip_tunnel_ctl to better reflect that it doesn't directly implement ioctls
      touching user memory, and is used for the guts of ndo_tunnel_ctl
      implementations. A new ip_tunnel_ioctl helper is added that can be wired
      up directly to the ndo_do_ioctl method and takes care of the copy to and
      from userspace.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      607259a6
  9. 30 3月, 2020 1 次提交
    • W
      net, ip_tunnel: fix interface lookup with no key · 25629fda
      William Dauchy 提交于
      when creating a new ipip interface with no local/remote configuration,
      the lookup is done with TUNNEL_NO_KEY flag, making it impossible to
      match the new interface (only possible match being fallback or metada
      case interface); e.g: `ip link add tunl1 type ipip dev eth0`
      
      To fix this case, adding a flag check before the key comparison so we
      permit to match an interface with no local/remote config; it also avoids
      breaking possible userland tools relying on TUNNEL_NO_KEY flag and
      uninitialised key.
      
      context being on my side, I'm creating an extra ipip interface attached
      to the physical one, and moving it to a dedicated namespace.
      
      Fixes: c5441932 ("GRE: Refactor GRE tunneling code.")
      Signed-off-by: NWilliam Dauchy <w.dauchy@criteo.com>
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      25629fda
  10. 21 1月, 2020 1 次提交
  11. 25 12月, 2019 1 次提交
  12. 05 6月, 2019 1 次提交
  13. 07 3月, 2019 1 次提交
    • A
      iptunnel: NULL pointer deref for ip_md_tunnel_xmit · f4b3ec4e
      Alan Maguire 提交于
      Naresh Kamboju noted the following oops during execution of selftest
      tools/testing/selftests/bpf/test_tunnel.sh on x86_64:
      
      [  274.120445] BUG: unable to handle kernel NULL pointer dereference
      at 0000000000000000
      [  274.128285] #PF error: [INSTR]
      [  274.131351] PGD 8000000414a0e067 P4D 8000000414a0e067 PUD 3b6334067 PMD 0
      [  274.138241] Oops: 0010 [#1] SMP PTI
      [  274.141734] CPU: 1 PID: 11464 Comm: ping Not tainted
      5.0.0-rc4-next-20190129 #1
      [  274.149046] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
      2.0b 07/27/2017
      [  274.156526] RIP: 0010:          (null)
      [  274.160280] Code: Bad RIP value.
      [  274.163509] RSP: 0018:ffffbc9681f83540 EFLAGS: 00010286
      [  274.168726] RAX: 0000000000000000 RBX: ffffdc967fa80a18 RCX: 0000000000000000
      [  274.175851] RDX: ffff9db2ee08b540 RSI: 000000000000000e RDI: ffffdc967fa809a0
      [  274.182974] RBP: ffffbc9681f83580 R08: ffff9db2c4d62690 R09: 000000000000000c
      [  274.190098] R10: 0000000000000000 R11: ffff9db2ee08b540 R12: ffff9db31ce7c000
      [  274.197222] R13: 0000000000000001 R14: 000000000000000c R15: ffff9db3179cf400
      [  274.204346] FS:  00007ff4ae7c5740(0000) GS:ffff9db31fa80000(0000)
      knlGS:0000000000000000
      [  274.212424] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  274.218162] CR2: ffffffffffffffd6 CR3: 00000004574da004 CR4: 00000000003606e0
      [  274.225292] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  274.232416] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  274.239541] Call Trace:
      [  274.241988]  ? tnl_update_pmtu+0x296/0x3b0
      [  274.246085]  ip_md_tunnel_xmit+0x1bc/0x520
      [  274.250176]  gre_fb_xmit+0x330/0x390
      [  274.253754]  gre_tap_xmit+0x128/0x180
      [  274.257414]  dev_hard_start_xmit+0xb7/0x300
      [  274.261598]  sch_direct_xmit+0xf6/0x290
      [  274.265430]  __qdisc_run+0x15d/0x5e0
      [  274.269007]  __dev_queue_xmit+0x2c5/0xc00
      [  274.273011]  ? dev_queue_xmit+0x10/0x20
      [  274.276842]  ? eth_header+0x2b/0xc0
      [  274.280326]  dev_queue_xmit+0x10/0x20
      [  274.283984]  ? dev_queue_xmit+0x10/0x20
      [  274.287813]  arp_xmit+0x1a/0xf0
      [  274.290952]  arp_send_dst.part.19+0x46/0x60
      [  274.295138]  arp_solicit+0x177/0x6b0
      [  274.298708]  ? mod_timer+0x18e/0x440
      [  274.302281]  neigh_probe+0x57/0x70
      [  274.305684]  __neigh_event_send+0x197/0x2d0
      [  274.309862]  neigh_resolve_output+0x18c/0x210
      [  274.314212]  ip_finish_output2+0x257/0x690
      [  274.318304]  ip_finish_output+0x219/0x340
      [  274.322314]  ? ip_finish_output+0x219/0x340
      [  274.326493]  ip_output+0x76/0x240
      [  274.329805]  ? ip_fragment.constprop.53+0x80/0x80
      [  274.334510]  ip_local_out+0x3f/0x70
      [  274.337992]  ip_send_skb+0x19/0x40
      [  274.341391]  ip_push_pending_frames+0x33/0x40
      [  274.345740]  raw_sendmsg+0xc15/0x11d0
      [  274.349403]  ? __might_fault+0x85/0x90
      [  274.353151]  ? _copy_from_user+0x6b/0xa0
      [  274.357070]  ? rw_copy_check_uvector+0x54/0x130
      [  274.361604]  inet_sendmsg+0x42/0x1c0
      [  274.365179]  ? inet_sendmsg+0x42/0x1c0
      [  274.368937]  sock_sendmsg+0x3e/0x50
      [  274.372460]  ___sys_sendmsg+0x26f/0x2d0
      [  274.376293]  ? lock_acquire+0x95/0x190
      [  274.380043]  ? __handle_mm_fault+0x7ce/0xb70
      [  274.384307]  ? lock_acquire+0x95/0x190
      [  274.388053]  ? __audit_syscall_entry+0xdd/0x130
      [  274.392586]  ? ktime_get_coarse_real_ts64+0x64/0xc0
      [  274.397461]  ? __audit_syscall_entry+0xdd/0x130
      [  274.401989]  ? trace_hardirqs_on+0x4c/0x100
      [  274.406173]  __sys_sendmsg+0x63/0xa0
      [  274.409744]  ? __sys_sendmsg+0x63/0xa0
      [  274.413488]  __x64_sys_sendmsg+0x1f/0x30
      [  274.417405]  do_syscall_64+0x55/0x190
      [  274.421064]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  274.426113] RIP: 0033:0x7ff4ae0e6e87
      [  274.429686] Code: 64 89 02 48 c7 c0 ff ff ff ff eb b9 0f 1f 80 00
      00 00 00 8b 05 ca d9 2b 00 48 63 d2 48 63 ff 85 c0 75 10 b8 2e 00 00
      00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 53 48 89 f3 48 83 ec 10 48 89 7c
      24 08
      [  274.448422] RSP: 002b:00007ffcd9b76db8 EFLAGS: 00000246 ORIG_RAX:
      000000000000002e
      [  274.455978] RAX: ffffffffffffffda RBX: 0000000000000040 RCX: 00007ff4ae0e6e87
      [  274.463104] RDX: 0000000000000000 RSI: 00000000006092e0 RDI: 0000000000000003
      [  274.470228] RBP: 0000000000000000 R08: 00007ffcd9bc40a0 R09: 00007ffcd9bc4080
      [  274.477349] R10: 000000000000060a R11: 0000000000000246 R12: 0000000000000003
      [  274.484475] R13: 0000000000000016 R14: 00007ffcd9b77fa0 R15: 00007ffcd9b78da4
      [  274.491602] Modules linked in: cls_bpf sch_ingress iptable_filter
      ip_tables algif_hash af_alg x86_pkg_temp_thermal fuse [last unloaded:
      test_bpf]
      [  274.504634] CR2: 0000000000000000
      [  274.507976] ---[ end trace 196d18386545eae1 ]---
      [  274.512588] RIP: 0010:          (null)
      [  274.516334] Code: Bad RIP value.
      [  274.519557] RSP: 0018:ffffbc9681f83540 EFLAGS: 00010286
      [  274.524775] RAX: 0000000000000000 RBX: ffffdc967fa80a18 RCX: 0000000000000000
      [  274.531921] RDX: ffff9db2ee08b540 RSI: 000000000000000e RDI: ffffdc967fa809a0
      [  274.539082] RBP: ffffbc9681f83580 R08: ffff9db2c4d62690 R09: 000000000000000c
      [  274.546205] R10: 0000000000000000 R11: ffff9db2ee08b540 R12: ffff9db31ce7c000
      [  274.553329] R13: 0000000000000001 R14: 000000000000000c R15: ffff9db3179cf400
      [  274.560456] FS:  00007ff4ae7c5740(0000) GS:ffff9db31fa80000(0000)
      knlGS:0000000000000000
      [  274.568541] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  274.574277] CR2: ffffffffffffffd6 CR3: 00000004574da004 CR4: 00000000003606e0
      [  274.581403] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  274.588535] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  274.595658] Kernel panic - not syncing: Fatal exception in interrupt
      [  274.602046] Kernel Offset: 0x14400000 from 0xffffffff81000000
      (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
      [  274.612827] ---[ end Kernel panic - not syncing: Fatal exception in
      interrupt ]---
      [  274.620387] ------------[ cut here ]------------
      
      I'm also seeing the same failure on x86_64, and it reproduces
      consistently.
      
      >From poking around it looks like the skb's dst entry is being used
      to calculate the mtu in:
      
      mtu = skb_dst(skb) ? dst_mtu(skb_dst(skb)) : dev->mtu;
      
      ...but because that dst_entry  has an "ops" value set to md_dst_ops,
      the various ops (including mtu) are not set:
      
      crash> struct sk_buff._skb_refdst ffff928f87447700 -x
            _skb_refdst = 0xffffcd6fbf5ea590
      crash> struct dst_entry.ops 0xffffcd6fbf5ea590
        ops = 0xffffffffa0193800
      crash> struct dst_ops.mtu 0xffffffffa0193800
        mtu = 0x0
      crash>
      
      I confirmed that the dst entry also has dst->input set to
      dst_md_discard, so it looks like it's an entry that's been
      initialized via __metadata_dst_init alright.
      
      I think the fix here is to use skb_valid_dst(skb) - it checks
      for  DST_METADATA also, and with that fix in place, the
      problem - which was previously 100% reproducible - disappears.
      
      The below patch resolves the panic and all bpf tunnel tests pass
      without incident.
      
      Fixes: c8b34e68 ("ip_tunnel: Add tnl_update_pmtu in ip_md_tunnel_xmit")
      Reported-by: NNaresh Kamboju <naresh.kamboju@linaro.org>
      Signed-off-by: NAlan Maguire <alan.maguire@oracle.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Tested-by: NAnders Roxell <anders.roxell@linaro.org>
      Reported-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Tested-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f4b3ec4e
  14. 28 2月, 2019 1 次提交
  15. 25 2月, 2019 1 次提交
  16. 27 1月, 2019 3 次提交
  17. 25 1月, 2019 1 次提交
  18. 02 1月, 2019 1 次提交
    • W
      ip: validate header length on virtual device xmit · cb9f1b78
      Willem de Bruijn 提交于
      KMSAN detected read beyond end of buffer in vti and sit devices when
      passing truncated packets with PF_PACKET. The issue affects additional
      ip tunnel devices.
      
      Extend commit 76c0ddd8 ("ip6_tunnel: be careful when accessing the
      inner header") and commit ccfec9e5 ("ip_tunnel: be careful when
      accessing the inner header").
      
      Move the check to a separate helper and call at the start of each
      ndo_start_xmit function in net/ipv4 and net/ipv6.
      
      Minor changes:
      - convert dev_kfree_skb to kfree_skb on error path,
        as dev_kfree_skb calls consume_skb which is not for error paths.
      - use pskb_network_may_pull even though that is pedantic here,
        as the same as pskb_may_pull for devices without llheaders.
      - do not cache ipv6 hdrs if used only once
        (unsafe across pskb_may_pull, was more relevant to earlier patch)
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cb9f1b78
  19. 25 9月, 2018 1 次提交
  20. 08 6月, 2018 1 次提交
  21. 02 6月, 2018 1 次提交
  22. 06 4月, 2018 1 次提交
    • E
      ip_tunnel: better validate user provided tunnel names · 9cb726a2
      Eric Dumazet 提交于
      Use dev_valid_name() to make sure user does not provide illegal
      device name.
      
      syzbot caught the following bug :
      
      BUG: KASAN: stack-out-of-bounds in strlcpy include/linux/string.h:300 [inline]
      BUG: KASAN: stack-out-of-bounds in __ip_tunnel_create+0xca/0x6b0 net/ipv4/ip_tunnel.c:257
      Write of size 20 at addr ffff8801ac79f810 by task syzkaller268107/4482
      
      CPU: 0 PID: 4482 Comm: syzkaller268107 Not tainted 4.16.0+ #1
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:17 [inline]
       dump_stack+0x1b9/0x29f lib/dump_stack.c:53
       print_address_description+0x6c/0x20b mm/kasan/report.c:256
       kasan_report_error mm/kasan/report.c:354 [inline]
       kasan_report.cold.7+0xac/0x2f5 mm/kasan/report.c:412
       check_memory_region_inline mm/kasan/kasan.c:260 [inline]
       check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267
       memcpy+0x37/0x50 mm/kasan/kasan.c:303
       strlcpy include/linux/string.h:300 [inline]
       __ip_tunnel_create+0xca/0x6b0 net/ipv4/ip_tunnel.c:257
       ip_tunnel_create net/ipv4/ip_tunnel.c:352 [inline]
       ip_tunnel_ioctl+0x818/0xd40 net/ipv4/ip_tunnel.c:861
       ipip_tunnel_ioctl+0x1c5/0x420 net/ipv4/ipip.c:350
       dev_ifsioc+0x43e/0xb90 net/core/dev_ioctl.c:334
       dev_ioctl+0x69a/0xcc0 net/core/dev_ioctl.c:525
       sock_ioctl+0x47e/0x680 net/socket.c:1015
       vfs_ioctl fs/ioctl.c:46 [inline]
       file_ioctl fs/ioctl.c:500 [inline]
       do_vfs_ioctl+0x1cf/0x1650 fs/ioctl.c:684
       ksys_ioctl+0xa9/0xd0 fs/ioctl.c:701
       SYSC_ioctl fs/ioctl.c:708 [inline]
       SyS_ioctl+0x24/0x30 fs/ioctl.c:706
       do_syscall_64+0x29e/0x9d0 arch/x86/entry/common.c:287
       entry_SYSCALL_64_after_hwframe+0x42/0xb7
      
      Fixes: c5441932 ("GRE: Refactor GRE tunneling code.")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9cb726a2
  23. 29 3月, 2018 1 次提交
  24. 24 3月, 2018 1 次提交
    • P
      ip_tunnel: Emit events for post-register MTU changes · f6cc9c05
      Petr Machata 提交于
      For tunnels created with IFLA_MTU, MTU of the netdevice is set by
      rtnl_create_link() (called from rtnl_newlink()) before the device is
      registered. However without IFLA_MTU that's not done.
      
      rtnl_newlink() proceeds by calling struct rtnl_link_ops.newlink, which
      via ip_tunnel_newlink() calls register_netdevice(), and that emits
      NETDEV_REGISTER. Thus any listeners that inspect the netdevice get the
      MTU of 0.
      
      After ip_tunnel_newlink() corrects the MTU after registering the
      netdevice, but since there's no event, the listeners don't get to know
      about the MTU until something else happens--such as a NETDEV_UP event.
      That's not ideal.
      
      So instead of setting the MTU directly, go through dev_set_mtu(), which
      takes care of distributing the necessary NETDEV_PRECHANGEMTU and
      NETDEV_CHANGEMTU events.
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: NPetr Machata <petrm@mellanox.com>
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f6cc9c05
  25. 19 3月, 2018 1 次提交
  26. 10 3月, 2018 1 次提交
    • E
      net: do not create fallback tunnels for non-default namespaces · 79134e6c
      Eric Dumazet 提交于
      fallback tunnels (like tunl0, gre0, gretap0, erspan0, sit0,
      ip6tnl0, ip6gre0) are automatically created when the corresponding
      module is loaded.
      
      These tunnels are also automatically created when a new network
      namespace is created, at a great cost.
      
      In many cases, netns are used for isolation purposes, and these
      extra network devices are a waste of resources. We are using
      thousands of netns per host, and hit the netns creation/delete
      bottleneck a lot. (Many thanks to Kirill for recent work on this)
      
      Add a new sysctl so that we can opt-out from this automatic creation.
      
      Note that these tunnels are still created for the initial namespace,
      to be the least intrusive for typical setups.
      
      Tested:
      lpk43:~# cat add_del_unshare.sh
      for i in `seq 1 40`
      do
       (for j in `seq 1 100` ; do  unshare -n /bin/true >/dev/null ; done) &
      done
      wait
      
      lpk43:~# echo 0 >/proc/sys/net/core/fb_tunnels_only_for_init_net
      lpk43:~# time ./add_del_unshare.sh
      
      real	0m37.521s
      user	0m0.886s
      sys	7m7.084s
      lpk43:~# echo 1 >/proc/sys/net/core/fb_tunnels_only_for_init_net
      lpk43:~# time ./add_del_unshare.sh
      
      real	0m4.761s
      user	0m0.851s
      sys	1m8.343s
      lpk43:~#
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      79134e6c
  27. 28 2月, 2018 1 次提交
  28. 27 2月, 2018 1 次提交
  29. 26 1月, 2018 1 次提交
  30. 25 1月, 2018 1 次提交
  31. 14 12月, 2017 1 次提交
  32. 20 9月, 2017 1 次提交
    • E
      ipv4: speedup ipv6 tunnels dismantle · 64bc1781
      Eric Dumazet 提交于
      Implement exit_batch() method to dismantle more devices
      per round.
      
      (rtnl_lock() ...
       unregister_netdevice_many() ...
       rtnl_unlock())
      
      Tested:
      $ cat add_del_unshare.sh
      for i in `seq 1 40`
      do
       (for j in `seq 1 100` ; do unshare -n /bin/true >/dev/null ; done) &
      done
      wait ; grep net_namespace /proc/slabinfo
      
      Before patch :
      $ time ./add_del_unshare.sh
      net_namespace        126    282   5504    1    2 : tunables    8    4    0 : slabdata    126    282      0
      
      real    1m38.965s
      user    0m0.688s
      sys     0m37.017s
      
      After patch:
      $ time ./add_del_unshare.sh
      net_namespace        135    291   5504    1    2 : tunables    8    4    0 : slabdata    135    291      0
      
      real	0m22.117s
      user	0m0.728s
      sys	0m35.328s
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      64bc1781
  33. 13 9月, 2017 1 次提交
  34. 09 9月, 2017 1 次提交
  35. 17 6月, 2017 1 次提交
  36. 08 6月, 2017 1 次提交
    • D
      net: Fix inconsistent teardown and release of private netdev state. · cf124db5
      David S. Miller 提交于
      Network devices can allocate reasources and private memory using
      netdev_ops->ndo_init().  However, the release of these resources
      can occur in one of two different places.
      
      Either netdev_ops->ndo_uninit() or netdev->destructor().
      
      The decision of which operation frees the resources depends upon
      whether it is necessary for all netdev refs to be released before it
      is safe to perform the freeing.
      
      netdev_ops->ndo_uninit() presumably can occur right after the
      NETDEV_UNREGISTER notifier completes and the unicast and multicast
      address lists are flushed.
      
      netdev->destructor(), on the other hand, does not run until the
      netdev references all go away.
      
      Further complicating the situation is that netdev->destructor()
      almost universally does also a free_netdev().
      
      This creates a problem for the logic in register_netdevice().
      Because all callers of register_netdevice() manage the freeing
      of the netdev, and invoke free_netdev(dev) if register_netdevice()
      fails.
      
      If netdev_ops->ndo_init() succeeds, but something else fails inside
      of register_netdevice(), it does call ndo_ops->ndo_uninit().  But
      it is not able to invoke netdev->destructor().
      
      This is because netdev->destructor() will do a free_netdev() and
      then the caller of register_netdevice() will do the same.
      
      However, this means that the resources that would normally be released
      by netdev->destructor() will not be.
      
      Over the years drivers have added local hacks to deal with this, by
      invoking their destructor parts by hand when register_netdevice()
      fails.
      
      Many drivers do not try to deal with this, and instead we have leaks.
      
      Let's close this hole by formalizing the distinction between what
      private things need to be freed up by netdev->destructor() and whether
      the driver needs unregister_netdevice() to perform the free_netdev().
      
      netdev->priv_destructor() performs all actions to free up the private
      resources that used to be freed by netdev->destructor(), except for
      free_netdev().
      
      netdev->needs_free_netdev is a boolean that indicates whether
      free_netdev() should be done at the end of unregister_netdevice().
      
      Now, register_netdevice() can sanely release all resources after
      ndo_ops->ndo_init() succeeds, by invoking both ndo_ops->ndo_uninit()
      and netdev->priv_destructor().
      
      And at the end of unregister_netdevice(), we invoke
      netdev->priv_destructor() and optionally call free_netdev().
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cf124db5
  37. 22 4月, 2017 1 次提交
    • C
      ip_tunnel: Allow policy-based routing through tunnels · 9830ad4c
      Craig Gallek 提交于
      This feature allows the administrator to set an fwmark for
      packets traversing a tunnel.  This allows the use of independent
      routing tables for tunneled packets without the use of iptables.
      
      There is no concept of per-packet routing decisions through IPv4
      tunnels, so this implementation does not need to work with
      per-packet route lookups as the v6 implementation may
      (with IP6_TNL_F_USE_ORIG_FWMARK).
      
      Further, since the v4 tunnel ioctls share datastructures
      (which can not be trivially modified) with the kernel's internal
      tunnel configuration structures, the mark attribute must be stored
      in the tunnel structure itself and passed as a parameter when
      creating or changing tunnel attributes.
      Signed-off-by: NCraig Gallek <kraig@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9830ad4c