1. 20 2月, 2022 4 次提交
  2. 28 1月, 2022 1 次提交
  3. 27 1月, 2022 1 次提交
    • E
      tcp: allocate tcp_death_row outside of struct netns_ipv4 · fbb82952
      Eric Dumazet 提交于
      I forgot tcp had per netns tracking of timewait sockets,
      and their sysctl to change the limit.
      
      After 0dad4087 ("tcp/dccp: get rid of inet_twsk_purge()"),
      whole struct net can be freed before last tw socket is freed.
      
      We need to allocate a separate struct inet_timewait_death_row
      object per netns.
      
      tw_count becomes a refcount and gains associated debugging infrastructure.
      
      BUG: KASAN: use-after-free in inet_twsk_kill+0x358/0x3c0 net/ipv4/inet_timewait_sock.c:46
      Read of size 8 at addr ffff88807d5f9f40 by task kworker/1:7/3690
      
      CPU: 1 PID: 3690 Comm: kworker/1:7 Not tainted 5.16.0-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Workqueue: events pwq_unbound_release_workfn
      Call Trace:
       <IRQ>
       __dump_stack lib/dump_stack.c:88 [inline]
       dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
       print_address_description.constprop.0.cold+0x8d/0x336 mm/kasan/report.c:255
       __kasan_report mm/kasan/report.c:442 [inline]
       kasan_report.cold+0x83/0xdf mm/kasan/report.c:459
       inet_twsk_kill+0x358/0x3c0 net/ipv4/inet_timewait_sock.c:46
       call_timer_fn+0x1a5/0x6b0 kernel/time/timer.c:1421
       expire_timers kernel/time/timer.c:1466 [inline]
       __run_timers.part.0+0x67c/0xa30 kernel/time/timer.c:1734
       __run_timers kernel/time/timer.c:1715 [inline]
       run_timer_softirq+0xb3/0x1d0 kernel/time/timer.c:1747
       __do_softirq+0x29b/0x9c2 kernel/softirq.c:558
       invoke_softirq kernel/softirq.c:432 [inline]
       __irq_exit_rcu+0x123/0x180 kernel/softirq.c:637
       irq_exit_rcu+0x5/0x20 kernel/softirq.c:649
       sysvec_apic_timer_interrupt+0x93/0xc0 arch/x86/kernel/apic/apic.c:1097
       </IRQ>
       <TASK>
       asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:638
      RIP: 0010:lockdep_unregister_key+0x1c9/0x250 kernel/locking/lockdep.c:6328
      Code: 00 00 00 48 89 ee e8 46 fd ff ff 4c 89 f7 e8 5e c9 ff ff e8 09 cc ff ff 9c 58 f6 c4 02 75 26 41 f7 c4 00 02 00 00 74 01 fb 5b <5d> 41 5c 41 5d 41 5e 41 5f e9 19 4a 08 00 0f 0b 5b 5d 41 5c 41 5d
      RSP: 0018:ffffc90004077cb8 EFLAGS: 00000206
      RAX: 0000000000000046 RBX: ffff88807b61b498 RCX: 0000000000000001
      RDX: dffffc0000000000 RSI: 0000000000000000 RDI: 0000000000000000
      RBP: ffff888077027128 R08: 0000000000000001 R09: ffffffff8f1ea4fc
      R10: fffffbfff1ff93ee R11: 000000000000af1e R12: 0000000000000246
      R13: 0000000000000000 R14: ffffffff8ffc89b8 R15: ffffffff90157fb0
       wq_unregister_lockdep kernel/workqueue.c:3508 [inline]
       pwq_unbound_release_workfn+0x254/0x340 kernel/workqueue.c:3746
       process_one_work+0x9ac/0x1650 kernel/workqueue.c:2307
       worker_thread+0x657/0x1110 kernel/workqueue.c:2454
       kthread+0x2e9/0x3a0 kernel/kthread.c:377
       ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
       </TASK>
      
      Allocated by task 3635:
       kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
       kasan_set_track mm/kasan/common.c:46 [inline]
       set_alloc_info mm/kasan/common.c:437 [inline]
       __kasan_slab_alloc+0x90/0xc0 mm/kasan/common.c:470
       kasan_slab_alloc include/linux/kasan.h:260 [inline]
       slab_post_alloc_hook mm/slab.h:732 [inline]
       slab_alloc_node mm/slub.c:3230 [inline]
       slab_alloc mm/slub.c:3238 [inline]
       kmem_cache_alloc+0x202/0x3a0 mm/slub.c:3243
       kmem_cache_zalloc include/linux/slab.h:705 [inline]
       net_alloc net/core/net_namespace.c:407 [inline]
       copy_net_ns+0x125/0x760 net/core/net_namespace.c:462
       create_new_namespaces+0x3f6/0xb20 kernel/nsproxy.c:110
       unshare_nsproxy_namespaces+0xc1/0x1f0 kernel/nsproxy.c:226
       ksys_unshare+0x445/0x920 kernel/fork.c:3048
       __do_sys_unshare kernel/fork.c:3119 [inline]
       __se_sys_unshare kernel/fork.c:3117 [inline]
       __x64_sys_unshare+0x2d/0x40 kernel/fork.c:3117
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      The buggy address belongs to the object at ffff88807d5f9a80
       which belongs to the cache net_namespace of size 6528
      The buggy address is located 1216 bytes inside of
       6528-byte region [ffff88807d5f9a80, ffff88807d5fb400)
      The buggy address belongs to the page:
      page:ffffea0001f57e00 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88807d5f9a80 pfn:0x7d5f8
      head:ffffea0001f57e00 order:3 compound_mapcount:0 compound_pincount:0
      memcg:ffff888070023001
      flags: 0xfff00000010200(slab|head|node=0|zone=1|lastcpupid=0x7ff)
      raw: 00fff00000010200 ffff888010dd4f48 ffffea0001404e08 ffff8880118fd000
      raw: ffff88807d5f9a80 0000000000040002 00000001ffffffff ffff888070023001
      page dumped because: kasan: bad access detected
      page_owner tracks the page as allocated
      page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 3634, ts 119694798460, free_ts 119693556950
       prep_new_page mm/page_alloc.c:2434 [inline]
       get_page_from_freelist+0xa72/0x2f50 mm/page_alloc.c:4165
       __alloc_pages+0x1b2/0x500 mm/page_alloc.c:5389
       alloc_pages+0x1aa/0x310 mm/mempolicy.c:2271
       alloc_slab_page mm/slub.c:1799 [inline]
       allocate_slab mm/slub.c:1944 [inline]
       new_slab+0x28a/0x3b0 mm/slub.c:2004
       ___slab_alloc+0x87c/0xe90 mm/slub.c:3018
       __slab_alloc.constprop.0+0x4d/0xa0 mm/slub.c:3105
       slab_alloc_node mm/slub.c:3196 [inline]
       slab_alloc mm/slub.c:3238 [inline]
       kmem_cache_alloc+0x35c/0x3a0 mm/slub.c:3243
       kmem_cache_zalloc include/linux/slab.h:705 [inline]
       net_alloc net/core/net_namespace.c:407 [inline]
       copy_net_ns+0x125/0x760 net/core/net_namespace.c:462
       create_new_namespaces+0x3f6/0xb20 kernel/nsproxy.c:110
       unshare_nsproxy_namespaces+0xc1/0x1f0 kernel/nsproxy.c:226
       ksys_unshare+0x445/0x920 kernel/fork.c:3048
       __do_sys_unshare kernel/fork.c:3119 [inline]
       __se_sys_unshare kernel/fork.c:3117 [inline]
       __x64_sys_unshare+0x2d/0x40 kernel/fork.c:3117
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      page last free stack trace:
       reset_page_owner include/linux/page_owner.h:24 [inline]
       free_pages_prepare mm/page_alloc.c:1352 [inline]
       free_pcp_prepare+0x374/0x870 mm/page_alloc.c:1404
       free_unref_page_prepare mm/page_alloc.c:3325 [inline]
       free_unref_page+0x19/0x690 mm/page_alloc.c:3404
       skb_free_head net/core/skbuff.c:655 [inline]
       skb_release_data+0x65d/0x790 net/core/skbuff.c:677
       skb_release_all net/core/skbuff.c:742 [inline]
       __kfree_skb net/core/skbuff.c:756 [inline]
       consume_skb net/core/skbuff.c:914 [inline]
       consume_skb+0xc2/0x160 net/core/skbuff.c:908
       skb_free_datagram+0x1b/0x1f0 net/core/datagram.c:325
       netlink_recvmsg+0x636/0xea0 net/netlink/af_netlink.c:1998
       sock_recvmsg_nosec net/socket.c:948 [inline]
       sock_recvmsg net/socket.c:966 [inline]
       sock_recvmsg net/socket.c:962 [inline]
       ____sys_recvmsg+0x2c4/0x600 net/socket.c:2632
       ___sys_recvmsg+0x127/0x200 net/socket.c:2674
       __sys_recvmsg+0xe2/0x1a0 net/socket.c:2704
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Memory state around the buggy address:
       ffff88807d5f9e00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff88807d5f9e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      >ffff88807d5f9f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                                 ^
       ffff88807d5f9f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff88807d5fa000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      
      Fixes: 0dad4087 ("tcp/dccp: get rid of inet_twsk_purge()")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Reported-by: NPaolo Abeni <pabeni@redhat.com>
      Tested-by: NPaolo Abeni <pabeni@redhat.com>
      Link: https://lore.kernel.org/r/20220126180714.845362-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      fbb82952
  4. 25 1月, 2022 2 次提交
    • E
      ipv4/tcp: do not use per netns ctl sockets · 37ba017d
      Eric Dumazet 提交于
      TCP ipv4 uses per-cpu/per-netns ctl sockets in order to send
      RST and some ACK packets (on behalf of TIMEWAIT sockets).
      
      This adds memory and cpu costs, which do not seem needed.
      Now typical servers have 256 or more cores, this adds considerable
      tax to netns users.
      
      tcp sockets are used from BH context, are not receiving packets,
      and do not store any persistent state but the 'struct net' pointer
      in order to be able to use IPv4 output functions.
      
      Note that I attempted a related change in the past, that had
      to be hot-fixed in commit bdbbb852 ("ipv4: tcp: get rid of ugly unicast_sock")
      
      This patch could very well surface old bugs, on layers not
      taking care of sk->sk_kern_sock properly.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      37ba017d
    • E
      tcp/dccp: get rid of inet_twsk_purge() · 0dad4087
      Eric Dumazet 提交于
      Prior patches in the series made sure tw_timer_handler()
      can be fired after netns has been dismantled/freed.
      
      We no longer have to scan a potentially big TCP ehash
      table at netns dismantle.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0dad4087
  5. 22 1月, 2022 1 次提交
  6. 10 1月, 2022 1 次提交
    • M
      net: skb: use kfree_skb_reason() in tcp_v4_rcv() · 85125597
      Menglong Dong 提交于
      Replace kfree_skb() with kfree_skb_reason() in tcp_v4_rcv(). Following
      drop reasons are added:
      
      SKB_DROP_REASON_NO_SOCKET
      SKB_DROP_REASON_PKT_TOO_SMALL
      SKB_DROP_REASON_TCP_CSUM
      SKB_DROP_REASON_TCP_FILTER
      
      After this patch, 'kfree_skb' event will print message like this:
      
      $           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
      $              | |         |   |||||     |         |
                <idle>-0       [000] ..s1.    36.113438: kfree_skb: skbaddr=(____ptrval____) protocol=2048 location=(____ptrval____) reason: NO_SOCKET
      
      The reason of skb drop is printed too.
      Signed-off-by: NMenglong Dong <imagedong@tencent.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      85125597
  7. 07 1月, 2022 1 次提交
    • M
      net: bpf: Handle return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND() · 91a760b2
      Menglong Dong 提交于
      The return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND() in
      __inet_bind() is not handled properly. While the return value
      is non-zero, it will set inet_saddr and inet_rcv_saddr to 0 and
      exit:
      
      	err = BPF_CGROUP_RUN_PROG_INET4_POST_BIND(sk);
      	if (err) {
      		inet->inet_saddr = inet->inet_rcv_saddr = 0;
      		goto out_release_sock;
      	}
      
      Let's take UDP for example and see what will happen. For UDP
      socket, it will be added to 'udp_prot.h.udp_table->hash' and
      'udp_prot.h.udp_table->hash2' after the sk->sk_prot->get_port()
      called success. If 'inet->inet_rcv_saddr' is specified here,
      then 'sk' will be in the 'hslot2' of 'hash2' that it don't belong
      to (because inet_saddr is changed to 0), and UDP packet received
      will not be passed to this sock. If 'inet->inet_rcv_saddr' is not
      specified here, the sock will work fine, as it can receive packet
      properly, which is wired, as the 'bind()' is already failed.
      
      To undo the get_port() operation, introduce the 'put_port' field
      for 'struct proto'. For TCP proto, it is inet_put_port(); For UDP
      proto, it is udp_lib_unhash(); For icmp proto, it is
      ping_unhash().
      
      Therefore, after sys_bind() fail caused by
      BPF_CGROUP_RUN_PROG_INET4_POST_BIND(), it will be unbinded, which
      means that it can try to be binded to another port.
      Signed-off-by: NMenglong Dong <imagedong@tencent.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20220106132022.3470772-2-imagedong@tencent.com
      91a760b2
  8. 21 12月, 2021 1 次提交
    • E
      inet: fully convert sk->sk_rx_dst to RCU rules · 8f905c0e
      Eric Dumazet 提交于
      syzbot reported various issues around early demux,
      one being included in this changelog [1]
      
      sk->sk_rx_dst is using RCU protection without clearly
      documenting it.
      
      And following sequences in tcp_v4_do_rcv()/tcp_v6_do_rcv()
      are not following standard RCU rules.
      
      [a]    dst_release(dst);
      [b]    sk->sk_rx_dst = NULL;
      
      They look wrong because a delete operation of RCU protected
      pointer is supposed to clear the pointer before
      the call_rcu()/synchronize_rcu() guarding actual memory freeing.
      
      In some cases indeed, dst could be freed before [b] is done.
      
      We could cheat by clearing sk_rx_dst before calling
      dst_release(), but this seems the right time to stick
      to standard RCU annotations and debugging facilities.
      
      [1]
      BUG: KASAN: use-after-free in dst_check include/net/dst.h:470 [inline]
      BUG: KASAN: use-after-free in tcp_v4_early_demux+0x95b/0x960 net/ipv4/tcp_ipv4.c:1792
      Read of size 2 at addr ffff88807f1cb73a by task syz-executor.5/9204
      
      CPU: 0 PID: 9204 Comm: syz-executor.5 Not tainted 5.16.0-rc5-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       <TASK>
       __dump_stack lib/dump_stack.c:88 [inline]
       dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
       print_address_description.constprop.0.cold+0x8d/0x320 mm/kasan/report.c:247
       __kasan_report mm/kasan/report.c:433 [inline]
       kasan_report.cold+0x83/0xdf mm/kasan/report.c:450
       dst_check include/net/dst.h:470 [inline]
       tcp_v4_early_demux+0x95b/0x960 net/ipv4/tcp_ipv4.c:1792
       ip_rcv_finish_core.constprop.0+0x15de/0x1e80 net/ipv4/ip_input.c:340
       ip_list_rcv_finish.constprop.0+0x1b2/0x6e0 net/ipv4/ip_input.c:583
       ip_sublist_rcv net/ipv4/ip_input.c:609 [inline]
       ip_list_rcv+0x34e/0x490 net/ipv4/ip_input.c:644
       __netif_receive_skb_list_ptype net/core/dev.c:5508 [inline]
       __netif_receive_skb_list_core+0x549/0x8e0 net/core/dev.c:5556
       __netif_receive_skb_list net/core/dev.c:5608 [inline]
       netif_receive_skb_list_internal+0x75e/0xd80 net/core/dev.c:5699
       gro_normal_list net/core/dev.c:5853 [inline]
       gro_normal_list net/core/dev.c:5849 [inline]
       napi_complete_done+0x1f1/0x880 net/core/dev.c:6590
       virtqueue_napi_complete drivers/net/virtio_net.c:339 [inline]
       virtnet_poll+0xca2/0x11b0 drivers/net/virtio_net.c:1557
       __napi_poll+0xaf/0x440 net/core/dev.c:7023
       napi_poll net/core/dev.c:7090 [inline]
       net_rx_action+0x801/0xb40 net/core/dev.c:7177
       __do_softirq+0x29b/0x9c2 kernel/softirq.c:558
       invoke_softirq kernel/softirq.c:432 [inline]
       __irq_exit_rcu+0x123/0x180 kernel/softirq.c:637
       irq_exit_rcu+0x5/0x20 kernel/softirq.c:649
       common_interrupt+0x52/0xc0 arch/x86/kernel/irq.c:240
       asm_common_interrupt+0x1e/0x40 arch/x86/include/asm/idtentry.h:629
      RIP: 0033:0x7f5e972bfd57
      Code: 39 d1 73 14 0f 1f 80 00 00 00 00 48 8b 50 f8 48 83 e8 08 48 39 ca 77 f3 48 39 c3 73 3e 48 89 13 48 8b 50 f8 48 89 38 49 8b 0e <48> 8b 3e 48 83 c3 08 48 83 c6 08 eb bc 48 39 d1 72 9e 48 39 d0 73
      RSP: 002b:00007fff8a413210 EFLAGS: 00000283
      RAX: 00007f5e97108990 RBX: 00007f5e97108338 RCX: ffffffff81d3aa45
      RDX: ffffffff81d3aa45 RSI: 00007f5e97108340 RDI: ffffffff81d3aa45
      RBP: 00007f5e97107eb8 R08: 00007f5e97108d88 R09: 0000000093c2e8d9
      R10: 0000000000000000 R11: 0000000000000000 R12: 00007f5e97107eb0
      R13: 00007f5e97108338 R14: 00007f5e97107ea8 R15: 0000000000000019
       </TASK>
      
      Allocated by task 13:
       kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
       kasan_set_track mm/kasan/common.c:46 [inline]
       set_alloc_info mm/kasan/common.c:434 [inline]
       __kasan_slab_alloc+0x90/0xc0 mm/kasan/common.c:467
       kasan_slab_alloc include/linux/kasan.h:259 [inline]
       slab_post_alloc_hook mm/slab.h:519 [inline]
       slab_alloc_node mm/slub.c:3234 [inline]
       slab_alloc mm/slub.c:3242 [inline]
       kmem_cache_alloc+0x202/0x3a0 mm/slub.c:3247
       dst_alloc+0x146/0x1f0 net/core/dst.c:92
       rt_dst_alloc+0x73/0x430 net/ipv4/route.c:1613
       ip_route_input_slow+0x1817/0x3a20 net/ipv4/route.c:2340
       ip_route_input_rcu net/ipv4/route.c:2470 [inline]
       ip_route_input_noref+0x116/0x2a0 net/ipv4/route.c:2415
       ip_rcv_finish_core.constprop.0+0x288/0x1e80 net/ipv4/ip_input.c:354
       ip_list_rcv_finish.constprop.0+0x1b2/0x6e0 net/ipv4/ip_input.c:583
       ip_sublist_rcv net/ipv4/ip_input.c:609 [inline]
       ip_list_rcv+0x34e/0x490 net/ipv4/ip_input.c:644
       __netif_receive_skb_list_ptype net/core/dev.c:5508 [inline]
       __netif_receive_skb_list_core+0x549/0x8e0 net/core/dev.c:5556
       __netif_receive_skb_list net/core/dev.c:5608 [inline]
       netif_receive_skb_list_internal+0x75e/0xd80 net/core/dev.c:5699
       gro_normal_list net/core/dev.c:5853 [inline]
       gro_normal_list net/core/dev.c:5849 [inline]
       napi_complete_done+0x1f1/0x880 net/core/dev.c:6590
       virtqueue_napi_complete drivers/net/virtio_net.c:339 [inline]
       virtnet_poll+0xca2/0x11b0 drivers/net/virtio_net.c:1557
       __napi_poll+0xaf/0x440 net/core/dev.c:7023
       napi_poll net/core/dev.c:7090 [inline]
       net_rx_action+0x801/0xb40 net/core/dev.c:7177
       __do_softirq+0x29b/0x9c2 kernel/softirq.c:558
      
      Freed by task 13:
       kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
       kasan_set_track+0x21/0x30 mm/kasan/common.c:46
       kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:370
       ____kasan_slab_free mm/kasan/common.c:366 [inline]
       ____kasan_slab_free mm/kasan/common.c:328 [inline]
       __kasan_slab_free+0xff/0x130 mm/kasan/common.c:374
       kasan_slab_free include/linux/kasan.h:235 [inline]
       slab_free_hook mm/slub.c:1723 [inline]
       slab_free_freelist_hook+0x8b/0x1c0 mm/slub.c:1749
       slab_free mm/slub.c:3513 [inline]
       kmem_cache_free+0xbd/0x5d0 mm/slub.c:3530
       dst_destroy+0x2d6/0x3f0 net/core/dst.c:127
       rcu_do_batch kernel/rcu/tree.c:2506 [inline]
       rcu_core+0x7ab/0x1470 kernel/rcu/tree.c:2741
       __do_softirq+0x29b/0x9c2 kernel/softirq.c:558
      
      Last potentially related work creation:
       kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
       __kasan_record_aux_stack+0xf5/0x120 mm/kasan/generic.c:348
       __call_rcu kernel/rcu/tree.c:2985 [inline]
       call_rcu+0xb1/0x740 kernel/rcu/tree.c:3065
       dst_release net/core/dst.c:177 [inline]
       dst_release+0x79/0xe0 net/core/dst.c:167
       tcp_v4_do_rcv+0x612/0x8d0 net/ipv4/tcp_ipv4.c:1712
       sk_backlog_rcv include/net/sock.h:1030 [inline]
       __release_sock+0x134/0x3b0 net/core/sock.c:2768
       release_sock+0x54/0x1b0 net/core/sock.c:3300
       tcp_sendmsg+0x36/0x40 net/ipv4/tcp.c:1441
       inet_sendmsg+0x99/0xe0 net/ipv4/af_inet.c:819
       sock_sendmsg_nosec net/socket.c:704 [inline]
       sock_sendmsg+0xcf/0x120 net/socket.c:724
       sock_write_iter+0x289/0x3c0 net/socket.c:1057
       call_write_iter include/linux/fs.h:2162 [inline]
       new_sync_write+0x429/0x660 fs/read_write.c:503
       vfs_write+0x7cd/0xae0 fs/read_write.c:590
       ksys_write+0x1ee/0x250 fs/read_write.c:643
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      The buggy address belongs to the object at ffff88807f1cb700
       which belongs to the cache ip_dst_cache of size 176
      The buggy address is located 58 bytes inside of
       176-byte region [ffff88807f1cb700, ffff88807f1cb7b0)
      The buggy address belongs to the page:
      page:ffffea0001fc72c0 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7f1cb
      flags: 0xfff00000000200(slab|node=0|zone=1|lastcpupid=0x7ff)
      raw: 00fff00000000200 dead000000000100 dead000000000122 ffff8881413bb780
      raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
      page dumped because: kasan: bad access detected
      page_owner tracks the page as allocated
      page last allocated via order 0, migratetype Unmovable, gfp_mask 0x112a20(GFP_ATOMIC|__GFP_NOWARN|__GFP_NORETRY|__GFP_HARDWALL), pid 5, ts 108466983062, free_ts 108048976062
       prep_new_page mm/page_alloc.c:2418 [inline]
       get_page_from_freelist+0xa72/0x2f50 mm/page_alloc.c:4149
       __alloc_pages+0x1b2/0x500 mm/page_alloc.c:5369
       alloc_pages+0x1a7/0x300 mm/mempolicy.c:2191
       alloc_slab_page mm/slub.c:1793 [inline]
       allocate_slab mm/slub.c:1930 [inline]
       new_slab+0x32d/0x4a0 mm/slub.c:1993
       ___slab_alloc+0x918/0xfe0 mm/slub.c:3022
       __slab_alloc.constprop.0+0x4d/0xa0 mm/slub.c:3109
       slab_alloc_node mm/slub.c:3200 [inline]
       slab_alloc mm/slub.c:3242 [inline]
       kmem_cache_alloc+0x35c/0x3a0 mm/slub.c:3247
       dst_alloc+0x146/0x1f0 net/core/dst.c:92
       rt_dst_alloc+0x73/0x430 net/ipv4/route.c:1613
       __mkroute_output net/ipv4/route.c:2564 [inline]
       ip_route_output_key_hash_rcu+0x921/0x2d00 net/ipv4/route.c:2791
       ip_route_output_key_hash+0x18b/0x300 net/ipv4/route.c:2619
       __ip_route_output_key include/net/route.h:126 [inline]
       ip_route_output_flow+0x23/0x150 net/ipv4/route.c:2850
       ip_route_output_key include/net/route.h:142 [inline]
       geneve_get_v4_rt+0x3a6/0x830 drivers/net/geneve.c:809
       geneve_xmit_skb drivers/net/geneve.c:899 [inline]
       geneve_xmit+0xc4a/0x3540 drivers/net/geneve.c:1082
       __netdev_start_xmit include/linux/netdevice.h:4994 [inline]
       netdev_start_xmit include/linux/netdevice.h:5008 [inline]
       xmit_one net/core/dev.c:3590 [inline]
       dev_hard_start_xmit+0x1eb/0x920 net/core/dev.c:3606
       __dev_queue_xmit+0x299a/0x3650 net/core/dev.c:4229
      page last free stack trace:
       reset_page_owner include/linux/page_owner.h:24 [inline]
       free_pages_prepare mm/page_alloc.c:1338 [inline]
       free_pcp_prepare+0x374/0x870 mm/page_alloc.c:1389
       free_unref_page_prepare mm/page_alloc.c:3309 [inline]
       free_unref_page+0x19/0x690 mm/page_alloc.c:3388
       qlink_free mm/kasan/quarantine.c:146 [inline]
       qlist_free_all+0x5a/0xc0 mm/kasan/quarantine.c:165
       kasan_quarantine_reduce+0x180/0x200 mm/kasan/quarantine.c:272
       __kasan_slab_alloc+0xa2/0xc0 mm/kasan/common.c:444
       kasan_slab_alloc include/linux/kasan.h:259 [inline]
       slab_post_alloc_hook mm/slab.h:519 [inline]
       slab_alloc_node mm/slub.c:3234 [inline]
       kmem_cache_alloc_node+0x255/0x3f0 mm/slub.c:3270
       __alloc_skb+0x215/0x340 net/core/skbuff.c:414
       alloc_skb include/linux/skbuff.h:1126 [inline]
       alloc_skb_with_frags+0x93/0x620 net/core/skbuff.c:6078
       sock_alloc_send_pskb+0x783/0x910 net/core/sock.c:2575
       mld_newpack+0x1df/0x770 net/ipv6/mcast.c:1754
       add_grhead+0x265/0x330 net/ipv6/mcast.c:1857
       add_grec+0x1053/0x14e0 net/ipv6/mcast.c:1995
       mld_send_initial_cr.part.0+0xf6/0x230 net/ipv6/mcast.c:2242
       mld_send_initial_cr net/ipv6/mcast.c:1232 [inline]
       mld_dad_work+0x1d3/0x690 net/ipv6/mcast.c:2268
       process_one_work+0x9b2/0x1690 kernel/workqueue.c:2298
       worker_thread+0x658/0x11f0 kernel/workqueue.c:2445
      
      Memory state around the buggy address:
       ffff88807f1cb600: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff88807f1cb680: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc
      >ffff88807f1cb700: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                              ^
       ffff88807f1cb780: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc
       ffff88807f1cb800: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      
      Fixes: 41063e9d ("ipv4: Early TCP socket demux.")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20211220143330.680945-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      8f905c0e
  9. 16 11月, 2021 3 次提交
    • E
      tcp: defer skb freeing after socket lock is released · f35f8219
      Eric Dumazet 提交于
      tcp recvmsg() (or rx zerocopy) spends a fair amount of time
      freeing skbs after their payload has been consumed.
      
      A typical ~64KB GRO packet has to release ~45 page
      references, eventually going to page allocator
      for each of them.
      
      Currently, this freeing is performed while socket lock
      is held, meaning that there is a high chance that
      BH handler has to queue incoming packets to tcp socket backlog.
      
      This can cause additional latencies, because the user
      thread has to process the backlog at release_sock() time,
      and while doing so, additional frames can be added
      by BH handler.
      
      This patch adds logic to defer these frees after socket
      lock is released, or directly from BH handler if possible.
      
      Being able to free these skbs from BH handler helps a lot,
      because this avoids the usual alloc/free assymetry,
      when BH handler and user thread do not run on same cpu or
      NUMA node.
      
      One cpu can now be fully utilized for the kernel->user copy,
      and another cpu is handling BH processing and skb/page
      allocs/frees (assuming RFS is not forcing use of a single CPU)
      
      Tested:
       100Gbit NIC
       Max throughput for one TCP_STREAM flow, over 10 runs
      
      MTU : 1500
      Before: 55 Gbit
      After:  66 Gbit
      
      MTU : 4096+(headers)
      Before: 82 Gbit
      After:  95 Gbit
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f35f8219
    • E
      net: remove sk_route_nocaps · aba54656
      Eric Dumazet 提交于
      Instead of using a full netdev_features_t, we can use a single bit,
      as sk_route_nocaps is only used to remove NETIF_F_GSO_MASK from
      sk->sk_route_cap.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aba54656
    • E
      tcp: minor optimization in tcp_add_backlog() · d519f350
      Eric Dumazet 提交于
      If packet is going to be coalesced, sk_sndbuf/sk_rcvbuf values
      are not used. Defer their access to the point we need them.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d519f350
  10. 26 10月, 2021 3 次提交
  11. 15 10月, 2021 2 次提交
    • L
      tcp: md5: Allow MD5SIG_FLAG_IFINDEX with ifindex=0 · a76c2315
      Leonard Crestez 提交于
      Multiple VRFs are generally meant to be "separate" but right now md5
      keys for the default VRF also affect connections inside VRFs if the IP
      addresses happen to overlap.
      
      So far the combination of TCP_MD5SIG_FLAG_IFINDEX with tcpm_ifindex == 0
      was an error, accept this to mean "key only applies to default VRF".
      This is what applications using VRFs for traffic separation want.
      Signed-off-by: NLeonard Crestez <cdleonard@gmail.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a76c2315
    • L
      tcp: md5: Fix overlap between vrf and non-vrf keys · 86f1e3a8
      Leonard Crestez 提交于
      With net.ipv4.tcp_l3mdev_accept=1 it is possible for a listen socket to
      accept connection from the same client address in different VRFs. It is
      also possible to set different MD5 keys for these clients which differ
      only in the tcpm_l3index field.
      
      This appears to work when distinguishing between different VRFs but not
      between non-VRF and VRF connections. In particular:
      
       * tcp_md5_do_lookup_exact will match a non-vrf key against a vrf key.
      This means that adding a key with l3index != 0 after a key with l3index
      == 0 will cause the earlier key to be deleted. Both keys can be present
      if the non-vrf key is added later.
       * _tcp_md5_do_lookup can match a non-vrf key before a vrf key. This
      casues failures if the passwords differ.
      
      Fix this by making tcp_md5_do_lookup_exact perform an actual exact
      comparison on l3index and by making  __tcp_md5_do_lookup perfer
      vrf-bound keys above other considerations like prefixlen.
      
      Fixes: dea53bb8 ("tcp: Add l3index to tcp_md5sig_key and md5 functions")
      Signed-off-by: NLeonard Crestez <cdleonard@gmail.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      86f1e3a8
  12. 23 9月, 2021 1 次提交
  13. 24 7月, 2021 7 次提交
  14. 22 7月, 2021 1 次提交
  15. 20 7月, 2021 1 次提交
  16. 03 7月, 2021 1 次提交
  17. 30 6月, 2021 1 次提交
  18. 16 6月, 2021 1 次提交
    • K
      tcp: Migrate TCP_NEW_SYN_RECV requests at receiving the final ACK. · d4f2c86b
      Kuniyuki Iwashima 提交于
      This patch also changes the code to call reuseport_migrate_sock() and
      inet_reqsk_clone(), but unlike the other cases, we do not call
      inet_reqsk_clone() right after reuseport_migrate_sock().
      
      Currently, in the receive path for TCP_NEW_SYN_RECV sockets, its listener
      has three kinds of refcnt:
      
        (A) for listener itself
        (B) carried by reuqest_sock
        (C) sock_hold() in tcp_v[46]_rcv()
      
      While processing the req, (A) may disappear by close(listener). Also, (B)
      can disappear by accept(listener) once we put the req into the accept
      queue. So, we have to hold another refcnt (C) for the listener to prevent
      use-after-free.
      
      For socket migration, we call reuseport_migrate_sock() to select a listener
      with (A) and to increment the new listener's refcnt in tcp_v[46]_rcv().
      This refcnt corresponds to (C) and is cleaned up later in tcp_v[46]_rcv().
      Thus we have to take another refcnt (B) for the newly cloned request_sock.
      
      In inet_csk_complete_hashdance(), we hold the count (B), clone the req, and
      try to put the new req into the accept queue. By migrating req after
      winning the "own_req" race, we can avoid such a worst situation:
      
        CPU 1 looks up req1
        CPU 2 looks up req1, unhashes it, then CPU 1 loses the race
        CPU 3 looks up req2, unhashes it, then CPU 2 loses the race
        ...
      Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20210612123224.12525-8-kuniyu@amazon.co.jp
      d4f2c86b
  19. 15 5月, 2021 1 次提交
    • J
      tcp: add tracepoint for checksum errors · 709c0314
      Jakub Kicinski 提交于
      Add a tracepoint for capturing TCP segments with
      a bad checksum. This makes it easy to identify
      sources of bad frames in the fleet (e.g. machines
      with faulty NICs).
      
      It should also help tools like IOvisor's tcpdrop.py
      which are used today to get detailed information
      about such packets.
      
      We don't have a socket in many cases so we must
      open code the address extraction based just on
      the skb.
      
      v2: add missing export for ipv6=m
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      709c0314
  20. 03 4月, 2021 1 次提交
    • F
      mptcp: add mptcp reset option support · dc87efdb
      Florian Westphal 提交于
      The MPTCP reset option allows to carry a mptcp-specific error code that
      provides more information on the nature of a connection reset.
      
      Reset option data received gets stored in the subflow context so it can
      be sent to userspace via the 'subflow closed' netlink event.
      
      When a subflow is closed, the desired error code that should be sent to
      the peer is also placed in the subflow context structure.
      
      If a reset is sent before subflow establishment could complete, e.g. on
      HMAC failure during an MP_JOIN operation, the mptcp skb extension is
      used to store the reset information.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dc87efdb
  21. 02 4月, 2021 1 次提交
  22. 04 2月, 2021 1 次提交
  23. 21 1月, 2021 2 次提交
    • S
      bpf: Remove extra lock_sock for TCP_ZEROCOPY_RECEIVE · 9cacf81f
      Stanislav Fomichev 提交于
      Add custom implementation of getsockopt hook for TCP_ZEROCOPY_RECEIVE.
      We skip generic hooks for TCP_ZEROCOPY_RECEIVE and have a custom
      call in do_tcp_getsockopt using the on-stack data. This removes
      3% overhead for locking/unlocking the socket.
      
      Without this patch:
           3.38%     0.07%  tcp_mmap  [kernel.kallsyms]  [k] __cgroup_bpf_run_filter_getsockopt
                  |
                   --3.30%--__cgroup_bpf_run_filter_getsockopt
                             |
                              --0.81%--__kmalloc
      
      With the patch applied:
           0.52%     0.12%  tcp_mmap  [kernel.kallsyms]  [k] __cgroup_bpf_run_filter_getsockopt_kern
      
      Note, exporting uapi/tcp.h requires removing netinet/tcp.h
      from test_progs.h because those headers have confliciting
      definitions.
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20210115163501.805133-2-sdf@google.com
      9cacf81f
    • K
      tcp: Fix potential use-after-free due to double kfree() · c89dffc7
      Kuniyuki Iwashima 提交于
      Receiving ACK with a valid SYN cookie, cookie_v4_check() allocates struct
      request_sock and then can allocate inet_rsk(req)->ireq_opt. After that,
      tcp_v4_syn_recv_sock() allocates struct sock and copies ireq_opt to
      inet_sk(sk)->inet_opt. Normally, tcp_v4_syn_recv_sock() inserts the full
      socket into ehash and sets NULL to ireq_opt. Otherwise,
      tcp_v4_syn_recv_sock() has to reset inet_opt by NULL and free the full
      socket.
      
      The commit 01770a16 ("tcp: fix race condition when creating child
      sockets from syncookies") added a new path, in which more than one cores
      create full sockets for the same SYN cookie. Currently, the core which
      loses the race frees the full socket without resetting inet_opt, resulting
      in that both sock_put() and reqsk_put() call kfree() for the same memory:
      
        sock_put
          sk_free
            __sk_free
              sk_destruct
                __sk_destruct
                  sk->sk_destruct/inet_sock_destruct
                    kfree(rcu_dereference_protected(inet->inet_opt, 1));
      
        reqsk_put
          reqsk_free
            __reqsk_free
              req->rsk_ops->destructor/tcp_v4_reqsk_destructor
                kfree(rcu_dereference_protected(inet_rsk(req)->ireq_opt, 1));
      
      Calling kmalloc() between the double kfree() can lead to use-after-free, so
      this patch fixes it by setting NULL to inet_opt before sock_put().
      
      As a side note, this kind of issue does not happen for IPv6. This is
      because tcp_v6_syn_recv_sock() clones both ipv6_opt and pktopts which
      correspond to ireq_opt in IPv4.
      
      Fixes: 01770a16 ("tcp: fix race condition when creating child sockets from syncookies")
      CC: Ricardo Dias <rdias@singlestore.com>
      Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Reviewed-by: NBenjamin Herrenschmidt <benh@amazon.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20210118055920.82516-1-kuniyu@amazon.co.jpSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      c89dffc7
  24. 20 1月, 2021 1 次提交