1. 10 1月, 2022 1 次提交
    • M
      net: skb: use kfree_skb_reason() in tcp_v4_rcv() · 85125597
      Menglong Dong 提交于
      Replace kfree_skb() with kfree_skb_reason() in tcp_v4_rcv(). Following
      drop reasons are added:
      
      SKB_DROP_REASON_NO_SOCKET
      SKB_DROP_REASON_PKT_TOO_SMALL
      SKB_DROP_REASON_TCP_CSUM
      SKB_DROP_REASON_TCP_FILTER
      
      After this patch, 'kfree_skb' event will print message like this:
      
      $           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
      $              | |         |   |||||     |         |
                <idle>-0       [000] ..s1.    36.113438: kfree_skb: skbaddr=(____ptrval____) protocol=2048 location=(____ptrval____) reason: NO_SOCKET
      
      The reason of skb drop is printed too.
      Signed-off-by: NMenglong Dong <imagedong@tencent.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      85125597
  2. 07 1月, 2022 1 次提交
    • M
      net: bpf: Handle return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND() · 91a760b2
      Menglong Dong 提交于
      The return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND() in
      __inet_bind() is not handled properly. While the return value
      is non-zero, it will set inet_saddr and inet_rcv_saddr to 0 and
      exit:
      
      	err = BPF_CGROUP_RUN_PROG_INET4_POST_BIND(sk);
      	if (err) {
      		inet->inet_saddr = inet->inet_rcv_saddr = 0;
      		goto out_release_sock;
      	}
      
      Let's take UDP for example and see what will happen. For UDP
      socket, it will be added to 'udp_prot.h.udp_table->hash' and
      'udp_prot.h.udp_table->hash2' after the sk->sk_prot->get_port()
      called success. If 'inet->inet_rcv_saddr' is specified here,
      then 'sk' will be in the 'hslot2' of 'hash2' that it don't belong
      to (because inet_saddr is changed to 0), and UDP packet received
      will not be passed to this sock. If 'inet->inet_rcv_saddr' is not
      specified here, the sock will work fine, as it can receive packet
      properly, which is wired, as the 'bind()' is already failed.
      
      To undo the get_port() operation, introduce the 'put_port' field
      for 'struct proto'. For TCP proto, it is inet_put_port(); For UDP
      proto, it is udp_lib_unhash(); For icmp proto, it is
      ping_unhash().
      
      Therefore, after sys_bind() fail caused by
      BPF_CGROUP_RUN_PROG_INET4_POST_BIND(), it will be unbinded, which
      means that it can try to be binded to another port.
      Signed-off-by: NMenglong Dong <imagedong@tencent.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20220106132022.3470772-2-imagedong@tencent.com
      91a760b2
  3. 21 12月, 2021 1 次提交
    • E
      inet: fully convert sk->sk_rx_dst to RCU rules · 8f905c0e
      Eric Dumazet 提交于
      syzbot reported various issues around early demux,
      one being included in this changelog [1]
      
      sk->sk_rx_dst is using RCU protection without clearly
      documenting it.
      
      And following sequences in tcp_v4_do_rcv()/tcp_v6_do_rcv()
      are not following standard RCU rules.
      
      [a]    dst_release(dst);
      [b]    sk->sk_rx_dst = NULL;
      
      They look wrong because a delete operation of RCU protected
      pointer is supposed to clear the pointer before
      the call_rcu()/synchronize_rcu() guarding actual memory freeing.
      
      In some cases indeed, dst could be freed before [b] is done.
      
      We could cheat by clearing sk_rx_dst before calling
      dst_release(), but this seems the right time to stick
      to standard RCU annotations and debugging facilities.
      
      [1]
      BUG: KASAN: use-after-free in dst_check include/net/dst.h:470 [inline]
      BUG: KASAN: use-after-free in tcp_v4_early_demux+0x95b/0x960 net/ipv4/tcp_ipv4.c:1792
      Read of size 2 at addr ffff88807f1cb73a by task syz-executor.5/9204
      
      CPU: 0 PID: 9204 Comm: syz-executor.5 Not tainted 5.16.0-rc5-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       <TASK>
       __dump_stack lib/dump_stack.c:88 [inline]
       dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
       print_address_description.constprop.0.cold+0x8d/0x320 mm/kasan/report.c:247
       __kasan_report mm/kasan/report.c:433 [inline]
       kasan_report.cold+0x83/0xdf mm/kasan/report.c:450
       dst_check include/net/dst.h:470 [inline]
       tcp_v4_early_demux+0x95b/0x960 net/ipv4/tcp_ipv4.c:1792
       ip_rcv_finish_core.constprop.0+0x15de/0x1e80 net/ipv4/ip_input.c:340
       ip_list_rcv_finish.constprop.0+0x1b2/0x6e0 net/ipv4/ip_input.c:583
       ip_sublist_rcv net/ipv4/ip_input.c:609 [inline]
       ip_list_rcv+0x34e/0x490 net/ipv4/ip_input.c:644
       __netif_receive_skb_list_ptype net/core/dev.c:5508 [inline]
       __netif_receive_skb_list_core+0x549/0x8e0 net/core/dev.c:5556
       __netif_receive_skb_list net/core/dev.c:5608 [inline]
       netif_receive_skb_list_internal+0x75e/0xd80 net/core/dev.c:5699
       gro_normal_list net/core/dev.c:5853 [inline]
       gro_normal_list net/core/dev.c:5849 [inline]
       napi_complete_done+0x1f1/0x880 net/core/dev.c:6590
       virtqueue_napi_complete drivers/net/virtio_net.c:339 [inline]
       virtnet_poll+0xca2/0x11b0 drivers/net/virtio_net.c:1557
       __napi_poll+0xaf/0x440 net/core/dev.c:7023
       napi_poll net/core/dev.c:7090 [inline]
       net_rx_action+0x801/0xb40 net/core/dev.c:7177
       __do_softirq+0x29b/0x9c2 kernel/softirq.c:558
       invoke_softirq kernel/softirq.c:432 [inline]
       __irq_exit_rcu+0x123/0x180 kernel/softirq.c:637
       irq_exit_rcu+0x5/0x20 kernel/softirq.c:649
       common_interrupt+0x52/0xc0 arch/x86/kernel/irq.c:240
       asm_common_interrupt+0x1e/0x40 arch/x86/include/asm/idtentry.h:629
      RIP: 0033:0x7f5e972bfd57
      Code: 39 d1 73 14 0f 1f 80 00 00 00 00 48 8b 50 f8 48 83 e8 08 48 39 ca 77 f3 48 39 c3 73 3e 48 89 13 48 8b 50 f8 48 89 38 49 8b 0e <48> 8b 3e 48 83 c3 08 48 83 c6 08 eb bc 48 39 d1 72 9e 48 39 d0 73
      RSP: 002b:00007fff8a413210 EFLAGS: 00000283
      RAX: 00007f5e97108990 RBX: 00007f5e97108338 RCX: ffffffff81d3aa45
      RDX: ffffffff81d3aa45 RSI: 00007f5e97108340 RDI: ffffffff81d3aa45
      RBP: 00007f5e97107eb8 R08: 00007f5e97108d88 R09: 0000000093c2e8d9
      R10: 0000000000000000 R11: 0000000000000000 R12: 00007f5e97107eb0
      R13: 00007f5e97108338 R14: 00007f5e97107ea8 R15: 0000000000000019
       </TASK>
      
      Allocated by task 13:
       kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
       kasan_set_track mm/kasan/common.c:46 [inline]
       set_alloc_info mm/kasan/common.c:434 [inline]
       __kasan_slab_alloc+0x90/0xc0 mm/kasan/common.c:467
       kasan_slab_alloc include/linux/kasan.h:259 [inline]
       slab_post_alloc_hook mm/slab.h:519 [inline]
       slab_alloc_node mm/slub.c:3234 [inline]
       slab_alloc mm/slub.c:3242 [inline]
       kmem_cache_alloc+0x202/0x3a0 mm/slub.c:3247
       dst_alloc+0x146/0x1f0 net/core/dst.c:92
       rt_dst_alloc+0x73/0x430 net/ipv4/route.c:1613
       ip_route_input_slow+0x1817/0x3a20 net/ipv4/route.c:2340
       ip_route_input_rcu net/ipv4/route.c:2470 [inline]
       ip_route_input_noref+0x116/0x2a0 net/ipv4/route.c:2415
       ip_rcv_finish_core.constprop.0+0x288/0x1e80 net/ipv4/ip_input.c:354
       ip_list_rcv_finish.constprop.0+0x1b2/0x6e0 net/ipv4/ip_input.c:583
       ip_sublist_rcv net/ipv4/ip_input.c:609 [inline]
       ip_list_rcv+0x34e/0x490 net/ipv4/ip_input.c:644
       __netif_receive_skb_list_ptype net/core/dev.c:5508 [inline]
       __netif_receive_skb_list_core+0x549/0x8e0 net/core/dev.c:5556
       __netif_receive_skb_list net/core/dev.c:5608 [inline]
       netif_receive_skb_list_internal+0x75e/0xd80 net/core/dev.c:5699
       gro_normal_list net/core/dev.c:5853 [inline]
       gro_normal_list net/core/dev.c:5849 [inline]
       napi_complete_done+0x1f1/0x880 net/core/dev.c:6590
       virtqueue_napi_complete drivers/net/virtio_net.c:339 [inline]
       virtnet_poll+0xca2/0x11b0 drivers/net/virtio_net.c:1557
       __napi_poll+0xaf/0x440 net/core/dev.c:7023
       napi_poll net/core/dev.c:7090 [inline]
       net_rx_action+0x801/0xb40 net/core/dev.c:7177
       __do_softirq+0x29b/0x9c2 kernel/softirq.c:558
      
      Freed by task 13:
       kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
       kasan_set_track+0x21/0x30 mm/kasan/common.c:46
       kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:370
       ____kasan_slab_free mm/kasan/common.c:366 [inline]
       ____kasan_slab_free mm/kasan/common.c:328 [inline]
       __kasan_slab_free+0xff/0x130 mm/kasan/common.c:374
       kasan_slab_free include/linux/kasan.h:235 [inline]
       slab_free_hook mm/slub.c:1723 [inline]
       slab_free_freelist_hook+0x8b/0x1c0 mm/slub.c:1749
       slab_free mm/slub.c:3513 [inline]
       kmem_cache_free+0xbd/0x5d0 mm/slub.c:3530
       dst_destroy+0x2d6/0x3f0 net/core/dst.c:127
       rcu_do_batch kernel/rcu/tree.c:2506 [inline]
       rcu_core+0x7ab/0x1470 kernel/rcu/tree.c:2741
       __do_softirq+0x29b/0x9c2 kernel/softirq.c:558
      
      Last potentially related work creation:
       kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
       __kasan_record_aux_stack+0xf5/0x120 mm/kasan/generic.c:348
       __call_rcu kernel/rcu/tree.c:2985 [inline]
       call_rcu+0xb1/0x740 kernel/rcu/tree.c:3065
       dst_release net/core/dst.c:177 [inline]
       dst_release+0x79/0xe0 net/core/dst.c:167
       tcp_v4_do_rcv+0x612/0x8d0 net/ipv4/tcp_ipv4.c:1712
       sk_backlog_rcv include/net/sock.h:1030 [inline]
       __release_sock+0x134/0x3b0 net/core/sock.c:2768
       release_sock+0x54/0x1b0 net/core/sock.c:3300
       tcp_sendmsg+0x36/0x40 net/ipv4/tcp.c:1441
       inet_sendmsg+0x99/0xe0 net/ipv4/af_inet.c:819
       sock_sendmsg_nosec net/socket.c:704 [inline]
       sock_sendmsg+0xcf/0x120 net/socket.c:724
       sock_write_iter+0x289/0x3c0 net/socket.c:1057
       call_write_iter include/linux/fs.h:2162 [inline]
       new_sync_write+0x429/0x660 fs/read_write.c:503
       vfs_write+0x7cd/0xae0 fs/read_write.c:590
       ksys_write+0x1ee/0x250 fs/read_write.c:643
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      The buggy address belongs to the object at ffff88807f1cb700
       which belongs to the cache ip_dst_cache of size 176
      The buggy address is located 58 bytes inside of
       176-byte region [ffff88807f1cb700, ffff88807f1cb7b0)
      The buggy address belongs to the page:
      page:ffffea0001fc72c0 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7f1cb
      flags: 0xfff00000000200(slab|node=0|zone=1|lastcpupid=0x7ff)
      raw: 00fff00000000200 dead000000000100 dead000000000122 ffff8881413bb780
      raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
      page dumped because: kasan: bad access detected
      page_owner tracks the page as allocated
      page last allocated via order 0, migratetype Unmovable, gfp_mask 0x112a20(GFP_ATOMIC|__GFP_NOWARN|__GFP_NORETRY|__GFP_HARDWALL), pid 5, ts 108466983062, free_ts 108048976062
       prep_new_page mm/page_alloc.c:2418 [inline]
       get_page_from_freelist+0xa72/0x2f50 mm/page_alloc.c:4149
       __alloc_pages+0x1b2/0x500 mm/page_alloc.c:5369
       alloc_pages+0x1a7/0x300 mm/mempolicy.c:2191
       alloc_slab_page mm/slub.c:1793 [inline]
       allocate_slab mm/slub.c:1930 [inline]
       new_slab+0x32d/0x4a0 mm/slub.c:1993
       ___slab_alloc+0x918/0xfe0 mm/slub.c:3022
       __slab_alloc.constprop.0+0x4d/0xa0 mm/slub.c:3109
       slab_alloc_node mm/slub.c:3200 [inline]
       slab_alloc mm/slub.c:3242 [inline]
       kmem_cache_alloc+0x35c/0x3a0 mm/slub.c:3247
       dst_alloc+0x146/0x1f0 net/core/dst.c:92
       rt_dst_alloc+0x73/0x430 net/ipv4/route.c:1613
       __mkroute_output net/ipv4/route.c:2564 [inline]
       ip_route_output_key_hash_rcu+0x921/0x2d00 net/ipv4/route.c:2791
       ip_route_output_key_hash+0x18b/0x300 net/ipv4/route.c:2619
       __ip_route_output_key include/net/route.h:126 [inline]
       ip_route_output_flow+0x23/0x150 net/ipv4/route.c:2850
       ip_route_output_key include/net/route.h:142 [inline]
       geneve_get_v4_rt+0x3a6/0x830 drivers/net/geneve.c:809
       geneve_xmit_skb drivers/net/geneve.c:899 [inline]
       geneve_xmit+0xc4a/0x3540 drivers/net/geneve.c:1082
       __netdev_start_xmit include/linux/netdevice.h:4994 [inline]
       netdev_start_xmit include/linux/netdevice.h:5008 [inline]
       xmit_one net/core/dev.c:3590 [inline]
       dev_hard_start_xmit+0x1eb/0x920 net/core/dev.c:3606
       __dev_queue_xmit+0x299a/0x3650 net/core/dev.c:4229
      page last free stack trace:
       reset_page_owner include/linux/page_owner.h:24 [inline]
       free_pages_prepare mm/page_alloc.c:1338 [inline]
       free_pcp_prepare+0x374/0x870 mm/page_alloc.c:1389
       free_unref_page_prepare mm/page_alloc.c:3309 [inline]
       free_unref_page+0x19/0x690 mm/page_alloc.c:3388
       qlink_free mm/kasan/quarantine.c:146 [inline]
       qlist_free_all+0x5a/0xc0 mm/kasan/quarantine.c:165
       kasan_quarantine_reduce+0x180/0x200 mm/kasan/quarantine.c:272
       __kasan_slab_alloc+0xa2/0xc0 mm/kasan/common.c:444
       kasan_slab_alloc include/linux/kasan.h:259 [inline]
       slab_post_alloc_hook mm/slab.h:519 [inline]
       slab_alloc_node mm/slub.c:3234 [inline]
       kmem_cache_alloc_node+0x255/0x3f0 mm/slub.c:3270
       __alloc_skb+0x215/0x340 net/core/skbuff.c:414
       alloc_skb include/linux/skbuff.h:1126 [inline]
       alloc_skb_with_frags+0x93/0x620 net/core/skbuff.c:6078
       sock_alloc_send_pskb+0x783/0x910 net/core/sock.c:2575
       mld_newpack+0x1df/0x770 net/ipv6/mcast.c:1754
       add_grhead+0x265/0x330 net/ipv6/mcast.c:1857
       add_grec+0x1053/0x14e0 net/ipv6/mcast.c:1995
       mld_send_initial_cr.part.0+0xf6/0x230 net/ipv6/mcast.c:2242
       mld_send_initial_cr net/ipv6/mcast.c:1232 [inline]
       mld_dad_work+0x1d3/0x690 net/ipv6/mcast.c:2268
       process_one_work+0x9b2/0x1690 kernel/workqueue.c:2298
       worker_thread+0x658/0x11f0 kernel/workqueue.c:2445
      
      Memory state around the buggy address:
       ffff88807f1cb600: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff88807f1cb680: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc
      >ffff88807f1cb700: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                              ^
       ffff88807f1cb780: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc
       ffff88807f1cb800: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      
      Fixes: 41063e9d ("ipv4: Early TCP socket demux.")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20211220143330.680945-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      8f905c0e
  4. 16 11月, 2021 3 次提交
    • E
      tcp: defer skb freeing after socket lock is released · f35f8219
      Eric Dumazet 提交于
      tcp recvmsg() (or rx zerocopy) spends a fair amount of time
      freeing skbs after their payload has been consumed.
      
      A typical ~64KB GRO packet has to release ~45 page
      references, eventually going to page allocator
      for each of them.
      
      Currently, this freeing is performed while socket lock
      is held, meaning that there is a high chance that
      BH handler has to queue incoming packets to tcp socket backlog.
      
      This can cause additional latencies, because the user
      thread has to process the backlog at release_sock() time,
      and while doing so, additional frames can be added
      by BH handler.
      
      This patch adds logic to defer these frees after socket
      lock is released, or directly from BH handler if possible.
      
      Being able to free these skbs from BH handler helps a lot,
      because this avoids the usual alloc/free assymetry,
      when BH handler and user thread do not run on same cpu or
      NUMA node.
      
      One cpu can now be fully utilized for the kernel->user copy,
      and another cpu is handling BH processing and skb/page
      allocs/frees (assuming RFS is not forcing use of a single CPU)
      
      Tested:
       100Gbit NIC
       Max throughput for one TCP_STREAM flow, over 10 runs
      
      MTU : 1500
      Before: 55 Gbit
      After:  66 Gbit
      
      MTU : 4096+(headers)
      Before: 82 Gbit
      After:  95 Gbit
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f35f8219
    • E
      net: remove sk_route_nocaps · aba54656
      Eric Dumazet 提交于
      Instead of using a full netdev_features_t, we can use a single bit,
      as sk_route_nocaps is only used to remove NETIF_F_GSO_MASK from
      sk->sk_route_cap.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aba54656
    • E
      tcp: minor optimization in tcp_add_backlog() · d519f350
      Eric Dumazet 提交于
      If packet is going to be coalesced, sk_sndbuf/sk_rcvbuf values
      are not used. Defer their access to the point we need them.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d519f350
  5. 26 10月, 2021 3 次提交
  6. 15 10月, 2021 2 次提交
    • L
      tcp: md5: Allow MD5SIG_FLAG_IFINDEX with ifindex=0 · a76c2315
      Leonard Crestez 提交于
      Multiple VRFs are generally meant to be "separate" but right now md5
      keys for the default VRF also affect connections inside VRFs if the IP
      addresses happen to overlap.
      
      So far the combination of TCP_MD5SIG_FLAG_IFINDEX with tcpm_ifindex == 0
      was an error, accept this to mean "key only applies to default VRF".
      This is what applications using VRFs for traffic separation want.
      Signed-off-by: NLeonard Crestez <cdleonard@gmail.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a76c2315
    • L
      tcp: md5: Fix overlap between vrf and non-vrf keys · 86f1e3a8
      Leonard Crestez 提交于
      With net.ipv4.tcp_l3mdev_accept=1 it is possible for a listen socket to
      accept connection from the same client address in different VRFs. It is
      also possible to set different MD5 keys for these clients which differ
      only in the tcpm_l3index field.
      
      This appears to work when distinguishing between different VRFs but not
      between non-VRF and VRF connections. In particular:
      
       * tcp_md5_do_lookup_exact will match a non-vrf key against a vrf key.
      This means that adding a key with l3index != 0 after a key with l3index
      == 0 will cause the earlier key to be deleted. Both keys can be present
      if the non-vrf key is added later.
       * _tcp_md5_do_lookup can match a non-vrf key before a vrf key. This
      casues failures if the passwords differ.
      
      Fix this by making tcp_md5_do_lookup_exact perform an actual exact
      comparison on l3index and by making  __tcp_md5_do_lookup perfer
      vrf-bound keys above other considerations like prefixlen.
      
      Fixes: dea53bb8 ("tcp: Add l3index to tcp_md5sig_key and md5 functions")
      Signed-off-by: NLeonard Crestez <cdleonard@gmail.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      86f1e3a8
  7. 23 9月, 2021 1 次提交
  8. 24 7月, 2021 7 次提交
  9. 22 7月, 2021 1 次提交
  10. 20 7月, 2021 1 次提交
  11. 03 7月, 2021 1 次提交
  12. 30 6月, 2021 1 次提交
  13. 16 6月, 2021 1 次提交
    • K
      tcp: Migrate TCP_NEW_SYN_RECV requests at receiving the final ACK. · d4f2c86b
      Kuniyuki Iwashima 提交于
      This patch also changes the code to call reuseport_migrate_sock() and
      inet_reqsk_clone(), but unlike the other cases, we do not call
      inet_reqsk_clone() right after reuseport_migrate_sock().
      
      Currently, in the receive path for TCP_NEW_SYN_RECV sockets, its listener
      has three kinds of refcnt:
      
        (A) for listener itself
        (B) carried by reuqest_sock
        (C) sock_hold() in tcp_v[46]_rcv()
      
      While processing the req, (A) may disappear by close(listener). Also, (B)
      can disappear by accept(listener) once we put the req into the accept
      queue. So, we have to hold another refcnt (C) for the listener to prevent
      use-after-free.
      
      For socket migration, we call reuseport_migrate_sock() to select a listener
      with (A) and to increment the new listener's refcnt in tcp_v[46]_rcv().
      This refcnt corresponds to (C) and is cleaned up later in tcp_v[46]_rcv().
      Thus we have to take another refcnt (B) for the newly cloned request_sock.
      
      In inet_csk_complete_hashdance(), we hold the count (B), clone the req, and
      try to put the new req into the accept queue. By migrating req after
      winning the "own_req" race, we can avoid such a worst situation:
      
        CPU 1 looks up req1
        CPU 2 looks up req1, unhashes it, then CPU 1 loses the race
        CPU 3 looks up req2, unhashes it, then CPU 2 loses the race
        ...
      Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20210612123224.12525-8-kuniyu@amazon.co.jp
      d4f2c86b
  14. 15 5月, 2021 1 次提交
    • J
      tcp: add tracepoint for checksum errors · 709c0314
      Jakub Kicinski 提交于
      Add a tracepoint for capturing TCP segments with
      a bad checksum. This makes it easy to identify
      sources of bad frames in the fleet (e.g. machines
      with faulty NICs).
      
      It should also help tools like IOvisor's tcpdrop.py
      which are used today to get detailed information
      about such packets.
      
      We don't have a socket in many cases so we must
      open code the address extraction based just on
      the skb.
      
      v2: add missing export for ipv6=m
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      709c0314
  15. 03 4月, 2021 1 次提交
    • F
      mptcp: add mptcp reset option support · dc87efdb
      Florian Westphal 提交于
      The MPTCP reset option allows to carry a mptcp-specific error code that
      provides more information on the nature of a connection reset.
      
      Reset option data received gets stored in the subflow context so it can
      be sent to userspace via the 'subflow closed' netlink event.
      
      When a subflow is closed, the desired error code that should be sent to
      the peer is also placed in the subflow context structure.
      
      If a reset is sent before subflow establishment could complete, e.g. on
      HMAC failure during an MP_JOIN operation, the mptcp skb extension is
      used to store the reset information.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dc87efdb
  16. 02 4月, 2021 1 次提交
  17. 04 2月, 2021 1 次提交
  18. 21 1月, 2021 2 次提交
    • S
      bpf: Remove extra lock_sock for TCP_ZEROCOPY_RECEIVE · 9cacf81f
      Stanislav Fomichev 提交于
      Add custom implementation of getsockopt hook for TCP_ZEROCOPY_RECEIVE.
      We skip generic hooks for TCP_ZEROCOPY_RECEIVE and have a custom
      call in do_tcp_getsockopt using the on-stack data. This removes
      3% overhead for locking/unlocking the socket.
      
      Without this patch:
           3.38%     0.07%  tcp_mmap  [kernel.kallsyms]  [k] __cgroup_bpf_run_filter_getsockopt
                  |
                   --3.30%--__cgroup_bpf_run_filter_getsockopt
                             |
                              --0.81%--__kmalloc
      
      With the patch applied:
           0.52%     0.12%  tcp_mmap  [kernel.kallsyms]  [k] __cgroup_bpf_run_filter_getsockopt_kern
      
      Note, exporting uapi/tcp.h requires removing netinet/tcp.h
      from test_progs.h because those headers have confliciting
      definitions.
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20210115163501.805133-2-sdf@google.com
      9cacf81f
    • K
      tcp: Fix potential use-after-free due to double kfree() · c89dffc7
      Kuniyuki Iwashima 提交于
      Receiving ACK with a valid SYN cookie, cookie_v4_check() allocates struct
      request_sock and then can allocate inet_rsk(req)->ireq_opt. After that,
      tcp_v4_syn_recv_sock() allocates struct sock and copies ireq_opt to
      inet_sk(sk)->inet_opt. Normally, tcp_v4_syn_recv_sock() inserts the full
      socket into ehash and sets NULL to ireq_opt. Otherwise,
      tcp_v4_syn_recv_sock() has to reset inet_opt by NULL and free the full
      socket.
      
      The commit 01770a16 ("tcp: fix race condition when creating child
      sockets from syncookies") added a new path, in which more than one cores
      create full sockets for the same SYN cookie. Currently, the core which
      loses the race frees the full socket without resetting inet_opt, resulting
      in that both sock_put() and reqsk_put() call kfree() for the same memory:
      
        sock_put
          sk_free
            __sk_free
              sk_destruct
                __sk_destruct
                  sk->sk_destruct/inet_sock_destruct
                    kfree(rcu_dereference_protected(inet->inet_opt, 1));
      
        reqsk_put
          reqsk_free
            __reqsk_free
              req->rsk_ops->destructor/tcp_v4_reqsk_destructor
                kfree(rcu_dereference_protected(inet_rsk(req)->ireq_opt, 1));
      
      Calling kmalloc() between the double kfree() can lead to use-after-free, so
      this patch fixes it by setting NULL to inet_opt before sock_put().
      
      As a side note, this kind of issue does not happen for IPv6. This is
      because tcp_v6_syn_recv_sock() clones both ipv6_opt and pktopts which
      correspond to ireq_opt in IPv4.
      
      Fixes: 01770a16 ("tcp: fix race condition when creating child sockets from syncookies")
      CC: Ricardo Dias <rdias@singlestore.com>
      Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Reviewed-by: NBenjamin Herrenschmidt <benh@amazon.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20210118055920.82516-1-kuniyu@amazon.co.jpSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      c89dffc7
  19. 20 1月, 2021 1 次提交
  20. 10 12月, 2020 1 次提交
  21. 04 12月, 2020 1 次提交
    • F
      tcp: merge 'init_req' and 'route_req' functions · 7ea851d1
      Florian Westphal 提交于
      The Multipath-TCP standard (RFC 8684) says that an MPTCP host should send
      a TCP reset if the token in a MP_JOIN request is unknown.
      
      At this time we don't do this, the 3whs completes and the 'new subflow'
      is reset afterwards.  There are two ways to allow MPTCP to send the
      reset.
      
      1. override 'send_synack' callback and emit the rst from there.
         The drawback is that the request socket gets inserted into the
         listeners queue just to get removed again right away.
      
      2. Send the reset from the 'route_req' function instead.
         This avoids the 'add&remove request socket', but route_req lacks the
         skb that is required to send the TCP reset.
      
      Instead of just adding the skb to that function for MPTCP sake alone,
      Paolo suggested to merge init_req and route_req functions.
      
      This saves one indirection from syn processing path and provides the skb
      to the merged function at the same time.
      
      'send reset on unknown mptcp join token' is added in next patch.
      Suggested-by: NPaolo Abeni <pabeni@redhat.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      7ea851d1
  22. 25 11月, 2020 1 次提交
    • A
      tcp: Set ECT0 bit in tos/tclass for synack when BPF needs ECN · 407c85c7
      Alexander Duyck 提交于
      When a BPF program is used to select between a type of TCP congestion
      control algorithm that uses either ECN or not there is a case where the
      synack for the frame was coming up without the ECT0 bit set. A bit of
      research found that this was due to the final socket being configured to
      dctcp while the listener socket was staying in cubic.
      
      To reproduce it all that is needed is to monitor TCP traffic while running
      the sample bpf program "samples/bpf/tcp_cong_kern.c". What is observed,
      assuming tcp_dctcp module is loaded or compiled in and the traffic matches
      the rules in the sample file, is that for all frames with the exception of
      the synack the ECT0 bit is set.
      
      To address that it is necessary to make one additional call to
      tcp_bpf_ca_needs_ecn using the request socket and then use the output of
      that to set the ECT0 bit for the tos/tclass of the packet.
      
      Fixes: 91b5b21c ("bpf: Add support for changing congestion control")
      Signed-off-by: NAlexander Duyck <alexanderduyck@fb.com>
      Link: https://lore.kernel.org/r/160593039663.2604.1374502006916871573.stgit@localhost.localdomainSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      407c85c7
  23. 24 11月, 2020 1 次提交
    • R
      tcp: fix race condition when creating child sockets from syncookies · 01770a16
      Ricardo Dias 提交于
      When the TCP stack is in SYN flood mode, the server child socket is
      created from the SYN cookie received in a TCP packet with the ACK flag
      set.
      
      The child socket is created when the server receives the first TCP
      packet with a valid SYN cookie from the client. Usually, this packet
      corresponds to the final step of the TCP 3-way handshake, the ACK
      packet. But is also possible to receive a valid SYN cookie from the
      first TCP data packet sent by the client, and thus create a child socket
      from that SYN cookie.
      
      Since a client socket is ready to send data as soon as it receives the
      SYN+ACK packet from the server, the client can send the ACK packet (sent
      by the TCP stack code), and the first data packet (sent by the userspace
      program) almost at the same time, and thus the server will equally
      receive the two TCP packets with valid SYN cookies almost at the same
      instant.
      
      When such event happens, the TCP stack code has a race condition that
      occurs between the momement a lookup is done to the established
      connections hashtable to check for the existence of a connection for the
      same client, and the moment that the child socket is added to the
      established connections hashtable. As a consequence, this race condition
      can lead to a situation where we add two child sockets to the
      established connections hashtable and deliver two sockets to the
      userspace program to the same client.
      
      This patch fixes the race condition by checking if an existing child
      socket exists for the same client when we are adding the second child
      socket to the established connections socket. If an existing child
      socket exists, we drop the packet and discard the second child socket
      to the same client.
      Signed-off-by: NRicardo Dias <rdias@singlestore.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20201120111133.GA67501@rdias-suse-pc.lanSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      01770a16
  24. 21 11月, 2020 1 次提交
    • A
      tcp: Allow full IP tos/IPv6 tclass to be reflected in L3 header · 861602b5
      Alexander Duyck 提交于
      An issue was recently found where DCTCP SYN/ACK packets did not have the
      ECT bit set in the L3 header. A bit of code review found that the recent
      change referenced below had gone though and added a mask that prevented the
      ECN bits from being populated in the L3 header.
      
      This patch addresses that by rolling back the mask so that it is only
      applied to the flags coming from the incoming TCP request instead of
      applying it to the socket tos/tclass field. Doing this the ECT bits were
      restored in the SYN/ACK packets in my testing.
      
      One thing that is not addressed by this patch set is the fact that
      tcp_reflect_tos appears to be incompatible with ECN based congestion
      avoidance algorithms. At a minimum the feature should likely be documented
      which it currently isn't.
      
      Fixes: ac8f1710 ("tcp: reflect tos value received in SYN to the socket")
      Signed-off-by: NAlexander Duyck <alexanderduyck@fb.com>
      Acked-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      861602b5
  25. 15 11月, 2020 1 次提交
  26. 06 10月, 2020 1 次提交
    • E
      tcp: fix receive window update in tcp_add_backlog() · 86bccd03
      Eric Dumazet 提交于
      We got reports from GKE customers flows being reset by netfilter
      conntrack unless nf_conntrack_tcp_be_liberal is set to 1.
      
      Traces seemed to suggest ACK packet being dropped by the
      packet capture, or more likely that ACK were received in the
      wrong order.
      
       wscale=7, SYN and SYNACK not shown here.
      
       This ACK allows the sender to send 1871*128 bytes from seq 51359321 :
       New right edge of the window -> 51359321+1871*128=51598809
      
       09:17:23.389210 IP A > B: Flags [.], ack 51359321, win 1871, options [nop,nop,TS val 10 ecr 999], length 0
      
       09:17:23.389212 IP B > A: Flags [.], seq 51422681:51424089, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 1408
       09:17:23.389214 IP A > B: Flags [.], ack 51422681, win 1376, options [nop,nop,TS val 10 ecr 999], length 0
       09:17:23.389253 IP B > A: Flags [.], seq 51424089:51488857, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 64768
       09:17:23.389272 IP A > B: Flags [.], ack 51488857, win 859, options [nop,nop,TS val 10 ecr 999], length 0
       09:17:23.389275 IP B > A: Flags [.], seq 51488857:51521241, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 32384
      
       Receiver now allows to send 606*128=77568 from seq 51521241 :
       New right edge of the window -> 51521241+606*128=51598809
      
       09:17:23.389296 IP A > B: Flags [.], ack 51521241, win 606, options [nop,nop,TS val 10 ecr 999], length 0
      
       09:17:23.389308 IP B > A: Flags [.], seq 51521241:51553625, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 32384
      
       It seems the sender exceeds RWIN allowance, since 51611353 > 51598809
      
       09:17:23.389346 IP B > A: Flags [.], seq 51553625:51611353, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 57728
       09:17:23.389356 IP B > A: Flags [.], seq 51611353:51618393, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 7040
      
       09:17:23.389367 IP A > B: Flags [.], ack 51611353, win 0, options [nop,nop,TS val 10 ecr 999], length 0
      
       netfilter conntrack is not happy and sends RST
      
       09:17:23.389389 IP A > B: Flags [R], seq 92176528, win 0, length 0
       09:17:23.389488 IP B > A: Flags [R], seq 174478967, win 0, length 0
      
       Now imagine ACK were delivered out of order and tcp_add_backlog() sets window based on wrong packet.
       New right edge of the window -> 51521241+859*128=51631193
      
      Normally TCP stack handles OOO packets just fine, but it
      turns out tcp_add_backlog() does not. It can update the window
      field of the aggregated packet even if the ACK sequence
      of the last received packet is too old.
      
      Many thanks to Alexandre Ferrieux for independently reporting the issue
      and suggesting a fix.
      
      Fixes: 4f693b55 ("tcp: implement coalescing on backlog queue")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NAlexandre Ferrieux <alexandre.ferrieux@orange.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      86bccd03
  27. 11 9月, 2020 2 次提交