1. 19 6月, 2022 1 次提交
  2. 10 6月, 2022 1 次提交
  3. 22 4月, 2022 1 次提交
    • G
      ipv4: Avoid using RTO_ONLINK with ip_route_connect(). · 67e1e2f4
      Guillaume Nault 提交于
      Now that ip_rt_fix_tos() doesn't reset ->flowi4_scope unconditionally,
      we don't have to rely on the RTO_ONLINK bit to properly set the scope
      of a flowi4 structure. We can just set ->flowi4_scope explicitly and
      avoid using RTO_ONLINK in ->flowi4_tos.
      
      This patch converts callers of ip_route_connect(). Instead of setting
      the tos parameter with RT_CONN_FLAGS(sk), as all callers do, we can:
      
        1- Drop the tos parameter from ip_route_connect(): its value was
           entirely based on sk, which is also passed as parameter.
      
        2- Set ->flowi4_scope depending on the SOCK_LOCALROUTE socket option
           instead of always initialising it with RT_SCOPE_UNIVERSE (let's
           define ip_sock_rt_scope() for this purpose).
      
        3- Avoid overloading ->flowi4_tos with RTO_ONLINK: since the scope is
           now properly initialised, we don't need to tell ip_rt_fix_tos() to
           adjust ->flowi4_scope for us. So let's define ip_sock_rt_tos(),
           which is the same as RT_CONN_FLAGS() but without the RTO_ONLINK
           bit overload.
      
      Note:
        In the original ip_route_connect() code, __ip_route_output_key()
        might clear the RTO_ONLINK bit of fl4->flowi4_tos (because of
        ip_rt_fix_tos()). Therefore flowi4_update_output() had to reuse the
        original tos variable. Now that we don't set RTO_ONLINK any more,
        this is not a problem and we can use fl4->flowi4_tos in
        flowi4_update_output().
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      67e1e2f4
  4. 12 4月, 2022 1 次提交
    • O
      net: remove noblock parameter from recvmsg() entities · ec095263
      Oliver Hartkopp 提交于
      The internal recvmsg() functions have two parameters 'flags' and 'noblock'
      that were merged inside skb_recv_datagram(). As a follow up patch to commit
      f4b41f06 ("net: remove noblock parameter from skb_recv_datagram()")
      this patch removes the separate 'noblock' parameter for recvmsg().
      
      Analogue to the referenced patch for skb_recv_datagram() the 'flags' and
      'noblock' parameters are unnecessarily split up with e.g.
      
      err = sk->sk_prot->recvmsg(sk, msg, size, flags & MSG_DONTWAIT,
                                 flags & ~MSG_DONTWAIT, &addr_len);
      
      or in
      
      err = INDIRECT_CALL_2(sk->sk_prot->recvmsg, tcp_recvmsg, udp_recvmsg,
                            sk, msg, size, flags & MSG_DONTWAIT,
                            flags & ~MSG_DONTWAIT, &addr_len);
      
      instead of simply using only flags all the time and check for MSG_DONTWAIT
      where needed (to preserve for the formerly separated no(n)block condition).
      Signed-off-by: NOliver Hartkopp <socketcan@hartkopp.net>
      Link: https://lore.kernel.org/r/20220411124955.154876-1-socketcan@hartkopp.netSigned-off-by: NPaolo Abeni <pabeni@redhat.com>
      ec095263
  5. 21 2月, 2022 1 次提交
    • T
      gso: do not skip outer ip header in case of ipip and net_failover · cc20cced
      Tao Liu 提交于
      We encounter a tcp drop issue in our cloud environment. Packet GROed in
      host forwards to a VM virtio_net nic with net_failover enabled. VM acts
      as a IPVS LB with ipip encapsulation. The full path like:
      host gro -> vm virtio_net rx -> net_failover rx -> ipvs fullnat
       -> ipip encap -> net_failover tx -> virtio_net tx
      
      When net_failover transmits a ipip pkt (gso_type = 0x0103, which means
      SKB_GSO_TCPV4, SKB_GSO_DODGY and SKB_GSO_IPXIP4), there is no gso
      did because it supports TSO and GSO_IPXIP4. But network_header points to
      inner ip header.
      
      Call Trace:
       tcp4_gso_segment        ------> return NULL
       inet_gso_segment        ------> inner iph, network_header points to
       ipip_gso_segment
       inet_gso_segment        ------> outer iph
       skb_mac_gso_segment
      
      Afterwards virtio_net transmits the pkt, only inner ip header is modified.
      And the outer one just keeps unchanged. The pkt will be dropped in remote
      host.
      
      Call Trace:
       inet_gso_segment        ------> inner iph, outer iph is skipped
       skb_mac_gso_segment
       __skb_gso_segment
       validate_xmit_skb
       validate_xmit_skb_list
       sch_direct_xmit
       __qdisc_run
       __dev_queue_xmit        ------> virtio_net
       dev_hard_start_xmit
       __dev_queue_xmit        ------> net_failover
       ip_finish_output2
       ip_output
       iptunnel_xmit
       ip_tunnel_xmit
       ipip_tunnel_xmit        ------> ipip
       dev_hard_start_xmit
       __dev_queue_xmit
       ip_finish_output2
       ip_output
       ip_forward
       ip_rcv
       __netif_receive_skb_one_core
       netif_receive_skb_internal
       napi_gro_receive
       receive_buf
       virtnet_poll
       net_rx_action
      
      The root cause of this issue is specific with the rare combination of
      SKB_GSO_DODGY and a tunnel device that adds an SKB_GSO_ tunnel option.
      SKB_GSO_DODGY is set from external virtio_net. We need to reset network
      header when callbacks.gso_segment() returns NULL.
      
      This patch also includes ipv6_gso_segment(), considering SIT, etc.
      
      Fixes: cb32f511 ("ipip: add GSO/TSO support")
      Signed-off-by: NTao Liu <thomas.liu@ucloud.cn>
      Reviewed-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cc20cced
  6. 07 1月, 2022 1 次提交
    • M
      net: bpf: Handle return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND() · 91a760b2
      Menglong Dong 提交于
      The return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND() in
      __inet_bind() is not handled properly. While the return value
      is non-zero, it will set inet_saddr and inet_rcv_saddr to 0 and
      exit:
      
      	err = BPF_CGROUP_RUN_PROG_INET4_POST_BIND(sk);
      	if (err) {
      		inet->inet_saddr = inet->inet_rcv_saddr = 0;
      		goto out_release_sock;
      	}
      
      Let's take UDP for example and see what will happen. For UDP
      socket, it will be added to 'udp_prot.h.udp_table->hash' and
      'udp_prot.h.udp_table->hash2' after the sk->sk_prot->get_port()
      called success. If 'inet->inet_rcv_saddr' is specified here,
      then 'sk' will be in the 'hslot2' of 'hash2' that it don't belong
      to (because inet_saddr is changed to 0), and UDP packet received
      will not be passed to this sock. If 'inet->inet_rcv_saddr' is not
      specified here, the sock will work fine, as it can receive packet
      properly, which is wired, as the 'bind()' is already failed.
      
      To undo the get_port() operation, introduce the 'put_port' field
      for 'struct proto'. For TCP proto, it is inet_put_port(); For UDP
      proto, it is udp_lib_unhash(); For icmp proto, it is
      ping_unhash().
      
      Therefore, after sys_bind() fail caused by
      BPF_CGROUP_RUN_PROG_INET4_POST_BIND(), it will be unbinded, which
      means that it can try to be binded to another port.
      Signed-off-by: NMenglong Dong <imagedong@tencent.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20220106132022.3470772-2-imagedong@tencent.com
      91a760b2
  7. 30 12月, 2021 1 次提交
    • M
      net: fix use-after-free in tw_timer_handler · e22e45fc
      Muchun Song 提交于
      A real world panic issue was found as follow in Linux 5.4.
      
          BUG: unable to handle page fault for address: ffffde49a863de28
          PGD 7e6fe62067 P4D 7e6fe62067 PUD 7e6fe63067 PMD f51e064067 PTE 0
          RIP: 0010:tw_timer_handler+0x20/0x40
          Call Trace:
           <IRQ>
           call_timer_fn+0x2b/0x120
           run_timer_softirq+0x1ef/0x450
           __do_softirq+0x10d/0x2b8
           irq_exit+0xc7/0xd0
           smp_apic_timer_interrupt+0x68/0x120
           apic_timer_interrupt+0xf/0x20
      
      This issue was also reported since 2017 in the thread [1],
      unfortunately, the issue was still can be reproduced after fixing
      DCCP.
      
      The ipv4_mib_exit_net is called before tcp_sk_exit_batch when a net
      namespace is destroyed since tcp_sk_ops is registered befrore
      ipv4_mib_ops, which means tcp_sk_ops is in the front of ipv4_mib_ops
      in the list of pernet_list. There will be a use-after-free on
      net->mib.net_statistics in tw_timer_handler after ipv4_mib_exit_net
      if there are some inflight time-wait timers.
      
      This bug is not introduced by commit f2bf415c ("mib: add net to
      NET_ADD_STATS_BH") since the net_statistics is a global variable
      instead of dynamic allocation and freeing. Actually, commit
      61a7e260 ("mib: put net statistics on struct net") introduces
      the bug since it put net statistics on struct net and free it when
      net namespace is destroyed.
      
      Moving init_ipv4_mibs() to the front of tcp_init() to fix this bug
      and replace pr_crit() with panic() since continuing is meaningless
      when init_ipv4_mibs() fails.
      
      [1] https://groups.google.com/g/syzkaller/c/p1tn-_Kc6l4/m/smuL_FMAAgAJ?pli=1
      
      Fixes: 61a7e260 ("mib: put net statistics on struct net")
      Signed-off-by: NMuchun Song <songmuchun@bytedance.com>
      Cc: Cong Wang <cong.wang@bytedance.com>
      Cc: Fam Zheng <fam.zheng@bytedance.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lore.kernel.org/r/20211228104145.9426-1-songmuchun@bytedance.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      e22e45fc
  8. 21 12月, 2021 1 次提交
    • E
      inet: fully convert sk->sk_rx_dst to RCU rules · 8f905c0e
      Eric Dumazet 提交于
      syzbot reported various issues around early demux,
      one being included in this changelog [1]
      
      sk->sk_rx_dst is using RCU protection without clearly
      documenting it.
      
      And following sequences in tcp_v4_do_rcv()/tcp_v6_do_rcv()
      are not following standard RCU rules.
      
      [a]    dst_release(dst);
      [b]    sk->sk_rx_dst = NULL;
      
      They look wrong because a delete operation of RCU protected
      pointer is supposed to clear the pointer before
      the call_rcu()/synchronize_rcu() guarding actual memory freeing.
      
      In some cases indeed, dst could be freed before [b] is done.
      
      We could cheat by clearing sk_rx_dst before calling
      dst_release(), but this seems the right time to stick
      to standard RCU annotations and debugging facilities.
      
      [1]
      BUG: KASAN: use-after-free in dst_check include/net/dst.h:470 [inline]
      BUG: KASAN: use-after-free in tcp_v4_early_demux+0x95b/0x960 net/ipv4/tcp_ipv4.c:1792
      Read of size 2 at addr ffff88807f1cb73a by task syz-executor.5/9204
      
      CPU: 0 PID: 9204 Comm: syz-executor.5 Not tainted 5.16.0-rc5-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       <TASK>
       __dump_stack lib/dump_stack.c:88 [inline]
       dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
       print_address_description.constprop.0.cold+0x8d/0x320 mm/kasan/report.c:247
       __kasan_report mm/kasan/report.c:433 [inline]
       kasan_report.cold+0x83/0xdf mm/kasan/report.c:450
       dst_check include/net/dst.h:470 [inline]
       tcp_v4_early_demux+0x95b/0x960 net/ipv4/tcp_ipv4.c:1792
       ip_rcv_finish_core.constprop.0+0x15de/0x1e80 net/ipv4/ip_input.c:340
       ip_list_rcv_finish.constprop.0+0x1b2/0x6e0 net/ipv4/ip_input.c:583
       ip_sublist_rcv net/ipv4/ip_input.c:609 [inline]
       ip_list_rcv+0x34e/0x490 net/ipv4/ip_input.c:644
       __netif_receive_skb_list_ptype net/core/dev.c:5508 [inline]
       __netif_receive_skb_list_core+0x549/0x8e0 net/core/dev.c:5556
       __netif_receive_skb_list net/core/dev.c:5608 [inline]
       netif_receive_skb_list_internal+0x75e/0xd80 net/core/dev.c:5699
       gro_normal_list net/core/dev.c:5853 [inline]
       gro_normal_list net/core/dev.c:5849 [inline]
       napi_complete_done+0x1f1/0x880 net/core/dev.c:6590
       virtqueue_napi_complete drivers/net/virtio_net.c:339 [inline]
       virtnet_poll+0xca2/0x11b0 drivers/net/virtio_net.c:1557
       __napi_poll+0xaf/0x440 net/core/dev.c:7023
       napi_poll net/core/dev.c:7090 [inline]
       net_rx_action+0x801/0xb40 net/core/dev.c:7177
       __do_softirq+0x29b/0x9c2 kernel/softirq.c:558
       invoke_softirq kernel/softirq.c:432 [inline]
       __irq_exit_rcu+0x123/0x180 kernel/softirq.c:637
       irq_exit_rcu+0x5/0x20 kernel/softirq.c:649
       common_interrupt+0x52/0xc0 arch/x86/kernel/irq.c:240
       asm_common_interrupt+0x1e/0x40 arch/x86/include/asm/idtentry.h:629
      RIP: 0033:0x7f5e972bfd57
      Code: 39 d1 73 14 0f 1f 80 00 00 00 00 48 8b 50 f8 48 83 e8 08 48 39 ca 77 f3 48 39 c3 73 3e 48 89 13 48 8b 50 f8 48 89 38 49 8b 0e <48> 8b 3e 48 83 c3 08 48 83 c6 08 eb bc 48 39 d1 72 9e 48 39 d0 73
      RSP: 002b:00007fff8a413210 EFLAGS: 00000283
      RAX: 00007f5e97108990 RBX: 00007f5e97108338 RCX: ffffffff81d3aa45
      RDX: ffffffff81d3aa45 RSI: 00007f5e97108340 RDI: ffffffff81d3aa45
      RBP: 00007f5e97107eb8 R08: 00007f5e97108d88 R09: 0000000093c2e8d9
      R10: 0000000000000000 R11: 0000000000000000 R12: 00007f5e97107eb0
      R13: 00007f5e97108338 R14: 00007f5e97107ea8 R15: 0000000000000019
       </TASK>
      
      Allocated by task 13:
       kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
       kasan_set_track mm/kasan/common.c:46 [inline]
       set_alloc_info mm/kasan/common.c:434 [inline]
       __kasan_slab_alloc+0x90/0xc0 mm/kasan/common.c:467
       kasan_slab_alloc include/linux/kasan.h:259 [inline]
       slab_post_alloc_hook mm/slab.h:519 [inline]
       slab_alloc_node mm/slub.c:3234 [inline]
       slab_alloc mm/slub.c:3242 [inline]
       kmem_cache_alloc+0x202/0x3a0 mm/slub.c:3247
       dst_alloc+0x146/0x1f0 net/core/dst.c:92
       rt_dst_alloc+0x73/0x430 net/ipv4/route.c:1613
       ip_route_input_slow+0x1817/0x3a20 net/ipv4/route.c:2340
       ip_route_input_rcu net/ipv4/route.c:2470 [inline]
       ip_route_input_noref+0x116/0x2a0 net/ipv4/route.c:2415
       ip_rcv_finish_core.constprop.0+0x288/0x1e80 net/ipv4/ip_input.c:354
       ip_list_rcv_finish.constprop.0+0x1b2/0x6e0 net/ipv4/ip_input.c:583
       ip_sublist_rcv net/ipv4/ip_input.c:609 [inline]
       ip_list_rcv+0x34e/0x490 net/ipv4/ip_input.c:644
       __netif_receive_skb_list_ptype net/core/dev.c:5508 [inline]
       __netif_receive_skb_list_core+0x549/0x8e0 net/core/dev.c:5556
       __netif_receive_skb_list net/core/dev.c:5608 [inline]
       netif_receive_skb_list_internal+0x75e/0xd80 net/core/dev.c:5699
       gro_normal_list net/core/dev.c:5853 [inline]
       gro_normal_list net/core/dev.c:5849 [inline]
       napi_complete_done+0x1f1/0x880 net/core/dev.c:6590
       virtqueue_napi_complete drivers/net/virtio_net.c:339 [inline]
       virtnet_poll+0xca2/0x11b0 drivers/net/virtio_net.c:1557
       __napi_poll+0xaf/0x440 net/core/dev.c:7023
       napi_poll net/core/dev.c:7090 [inline]
       net_rx_action+0x801/0xb40 net/core/dev.c:7177
       __do_softirq+0x29b/0x9c2 kernel/softirq.c:558
      
      Freed by task 13:
       kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
       kasan_set_track+0x21/0x30 mm/kasan/common.c:46
       kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:370
       ____kasan_slab_free mm/kasan/common.c:366 [inline]
       ____kasan_slab_free mm/kasan/common.c:328 [inline]
       __kasan_slab_free+0xff/0x130 mm/kasan/common.c:374
       kasan_slab_free include/linux/kasan.h:235 [inline]
       slab_free_hook mm/slub.c:1723 [inline]
       slab_free_freelist_hook+0x8b/0x1c0 mm/slub.c:1749
       slab_free mm/slub.c:3513 [inline]
       kmem_cache_free+0xbd/0x5d0 mm/slub.c:3530
       dst_destroy+0x2d6/0x3f0 net/core/dst.c:127
       rcu_do_batch kernel/rcu/tree.c:2506 [inline]
       rcu_core+0x7ab/0x1470 kernel/rcu/tree.c:2741
       __do_softirq+0x29b/0x9c2 kernel/softirq.c:558
      
      Last potentially related work creation:
       kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
       __kasan_record_aux_stack+0xf5/0x120 mm/kasan/generic.c:348
       __call_rcu kernel/rcu/tree.c:2985 [inline]
       call_rcu+0xb1/0x740 kernel/rcu/tree.c:3065
       dst_release net/core/dst.c:177 [inline]
       dst_release+0x79/0xe0 net/core/dst.c:167
       tcp_v4_do_rcv+0x612/0x8d0 net/ipv4/tcp_ipv4.c:1712
       sk_backlog_rcv include/net/sock.h:1030 [inline]
       __release_sock+0x134/0x3b0 net/core/sock.c:2768
       release_sock+0x54/0x1b0 net/core/sock.c:3300
       tcp_sendmsg+0x36/0x40 net/ipv4/tcp.c:1441
       inet_sendmsg+0x99/0xe0 net/ipv4/af_inet.c:819
       sock_sendmsg_nosec net/socket.c:704 [inline]
       sock_sendmsg+0xcf/0x120 net/socket.c:724
       sock_write_iter+0x289/0x3c0 net/socket.c:1057
       call_write_iter include/linux/fs.h:2162 [inline]
       new_sync_write+0x429/0x660 fs/read_write.c:503
       vfs_write+0x7cd/0xae0 fs/read_write.c:590
       ksys_write+0x1ee/0x250 fs/read_write.c:643
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      The buggy address belongs to the object at ffff88807f1cb700
       which belongs to the cache ip_dst_cache of size 176
      The buggy address is located 58 bytes inside of
       176-byte region [ffff88807f1cb700, ffff88807f1cb7b0)
      The buggy address belongs to the page:
      page:ffffea0001fc72c0 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7f1cb
      flags: 0xfff00000000200(slab|node=0|zone=1|lastcpupid=0x7ff)
      raw: 00fff00000000200 dead000000000100 dead000000000122 ffff8881413bb780
      raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
      page dumped because: kasan: bad access detected
      page_owner tracks the page as allocated
      page last allocated via order 0, migratetype Unmovable, gfp_mask 0x112a20(GFP_ATOMIC|__GFP_NOWARN|__GFP_NORETRY|__GFP_HARDWALL), pid 5, ts 108466983062, free_ts 108048976062
       prep_new_page mm/page_alloc.c:2418 [inline]
       get_page_from_freelist+0xa72/0x2f50 mm/page_alloc.c:4149
       __alloc_pages+0x1b2/0x500 mm/page_alloc.c:5369
       alloc_pages+0x1a7/0x300 mm/mempolicy.c:2191
       alloc_slab_page mm/slub.c:1793 [inline]
       allocate_slab mm/slub.c:1930 [inline]
       new_slab+0x32d/0x4a0 mm/slub.c:1993
       ___slab_alloc+0x918/0xfe0 mm/slub.c:3022
       __slab_alloc.constprop.0+0x4d/0xa0 mm/slub.c:3109
       slab_alloc_node mm/slub.c:3200 [inline]
       slab_alloc mm/slub.c:3242 [inline]
       kmem_cache_alloc+0x35c/0x3a0 mm/slub.c:3247
       dst_alloc+0x146/0x1f0 net/core/dst.c:92
       rt_dst_alloc+0x73/0x430 net/ipv4/route.c:1613
       __mkroute_output net/ipv4/route.c:2564 [inline]
       ip_route_output_key_hash_rcu+0x921/0x2d00 net/ipv4/route.c:2791
       ip_route_output_key_hash+0x18b/0x300 net/ipv4/route.c:2619
       __ip_route_output_key include/net/route.h:126 [inline]
       ip_route_output_flow+0x23/0x150 net/ipv4/route.c:2850
       ip_route_output_key include/net/route.h:142 [inline]
       geneve_get_v4_rt+0x3a6/0x830 drivers/net/geneve.c:809
       geneve_xmit_skb drivers/net/geneve.c:899 [inline]
       geneve_xmit+0xc4a/0x3540 drivers/net/geneve.c:1082
       __netdev_start_xmit include/linux/netdevice.h:4994 [inline]
       netdev_start_xmit include/linux/netdevice.h:5008 [inline]
       xmit_one net/core/dev.c:3590 [inline]
       dev_hard_start_xmit+0x1eb/0x920 net/core/dev.c:3606
       __dev_queue_xmit+0x299a/0x3650 net/core/dev.c:4229
      page last free stack trace:
       reset_page_owner include/linux/page_owner.h:24 [inline]
       free_pages_prepare mm/page_alloc.c:1338 [inline]
       free_pcp_prepare+0x374/0x870 mm/page_alloc.c:1389
       free_unref_page_prepare mm/page_alloc.c:3309 [inline]
       free_unref_page+0x19/0x690 mm/page_alloc.c:3388
       qlink_free mm/kasan/quarantine.c:146 [inline]
       qlist_free_all+0x5a/0xc0 mm/kasan/quarantine.c:165
       kasan_quarantine_reduce+0x180/0x200 mm/kasan/quarantine.c:272
       __kasan_slab_alloc+0xa2/0xc0 mm/kasan/common.c:444
       kasan_slab_alloc include/linux/kasan.h:259 [inline]
       slab_post_alloc_hook mm/slab.h:519 [inline]
       slab_alloc_node mm/slub.c:3234 [inline]
       kmem_cache_alloc_node+0x255/0x3f0 mm/slub.c:3270
       __alloc_skb+0x215/0x340 net/core/skbuff.c:414
       alloc_skb include/linux/skbuff.h:1126 [inline]
       alloc_skb_with_frags+0x93/0x620 net/core/skbuff.c:6078
       sock_alloc_send_pskb+0x783/0x910 net/core/sock.c:2575
       mld_newpack+0x1df/0x770 net/ipv6/mcast.c:1754
       add_grhead+0x265/0x330 net/ipv6/mcast.c:1857
       add_grec+0x1053/0x14e0 net/ipv6/mcast.c:1995
       mld_send_initial_cr.part.0+0xf6/0x230 net/ipv6/mcast.c:2242
       mld_send_initial_cr net/ipv6/mcast.c:1232 [inline]
       mld_dad_work+0x1d3/0x690 net/ipv6/mcast.c:2268
       process_one_work+0x9b2/0x1690 kernel/workqueue.c:2298
       worker_thread+0x658/0x11f0 kernel/workqueue.c:2445
      
      Memory state around the buggy address:
       ffff88807f1cb600: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff88807f1cb680: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc
      >ffff88807f1cb700: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                              ^
       ffff88807f1cb780: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc
       ffff88807f1cb800: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      
      Fixes: 41063e9d ("ipv4: Early TCP socket demux.")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20211220143330.680945-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      8f905c0e
  9. 25 11月, 2021 2 次提交
  10. 24 11月, 2021 1 次提交
  11. 18 11月, 2021 1 次提交
  12. 16 11月, 2021 1 次提交
  13. 28 10月, 2021 2 次提交
    • P
      net: introduce sk_forward_alloc_get() · 292e6077
      Paolo Abeni 提交于
      A later patch will change the MPTCP memory accounting schema
      in such a way that MPTCP sockets will encode the total amount of
      forward allocated memory in two separate fields (one for tx and
      one for rx).
      
      MPTCP sockets will use their own helper to provide the accurate
      amount of fwd allocated memory.
      
      To allow the above, this patch adds a new, optional, sk method to
      fetch the fwd memory, wrap the call in a new helper and use it
      where it is appropriate.
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      292e6077
    • E
      inet: remove races in inet{6}_getname() · 9dfc685e
      Eric Dumazet 提交于
      syzbot reported data-races in inet_getname() multiple times,
      it is time we fix this instead of pretending applications
      should not trigger them.
      
      getsockname() and getpeername() are not really considered fast path.
      
      v2: added the missing BPF_CGROUP_RUN_SA_PROG() declaration
          needed when CONFIG_CGROUP_BPF=n, as reported by
          kernel test robot <lkp@intel.com>
      
      syzbot typical report:
      
      BUG: KCSAN: data-race in __inet_hash_connect / inet_getname
      
      write to 0xffff888136d66cf8 of 2 bytes by task 14374 on cpu 1:
       __inet_hash_connect+0x7ec/0x950 net/ipv4/inet_hashtables.c:831
       inet_hash_connect+0x85/0x90 net/ipv4/inet_hashtables.c:853
       tcp_v4_connect+0x782/0xbb0 net/ipv4/tcp_ipv4.c:275
       __inet_stream_connect+0x156/0x6e0 net/ipv4/af_inet.c:664
       inet_stream_connect+0x44/0x70 net/ipv4/af_inet.c:728
       __sys_connect_file net/socket.c:1896 [inline]
       __sys_connect+0x254/0x290 net/socket.c:1913
       __do_sys_connect net/socket.c:1923 [inline]
       __se_sys_connect net/socket.c:1920 [inline]
       __x64_sys_connect+0x3d/0x50 net/socket.c:1920
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x44/0xa0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      read to 0xffff888136d66cf8 of 2 bytes by task 14408 on cpu 0:
       inet_getname+0x11f/0x170 net/ipv4/af_inet.c:790
       __sys_getsockname+0x11d/0x1b0 net/socket.c:1946
       __do_sys_getsockname net/socket.c:1961 [inline]
       __se_sys_getsockname net/socket.c:1958 [inline]
       __x64_sys_getsockname+0x3e/0x50 net/socket.c:1958
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x44/0xa0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      value changed: 0x0000 -> 0xdee0
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 14408 Comm: syz-executor.3 Not tainted 5.15.0-rc3-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Link: https://lore.kernel.org/r/20211026213014.3026708-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      9dfc685e
  14. 30 9月, 2021 2 次提交
    • E
      net: snmp: inline snmp_get_cpu_field() · 59f09ae8
      Eric Dumazet 提交于
      This trivial function is called ~90,000 times on 256 cpus hosts,
      when reading /proc/net/netstat. And this number keeps inflating.
      
      Inlining it saves many cycles.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      59f09ae8
    • W
      net: add new socket option SO_RESERVE_MEM · 2bb2f5fb
      Wei Wang 提交于
      This socket option provides a mechanism for users to reserve a certain
      amount of memory for the socket to use. When this option is set, kernel
      charges the user specified amount of memory to memcg, as well as
      sk_forward_alloc. This amount of memory is not reclaimable and is
      available in sk_forward_alloc for this socket.
      With this socket option set, the networking stack spends less cycles
      doing forward alloc and reclaim, which should lead to better system
      performance, with the cost of an amount of pre-allocated and
      unreclaimable memory, even under memory pressure.
      
      Note:
      This socket option is only available when memory cgroup is enabled and we
      require this reserved memory to be charged to the user's memcg. We hope
      this could avoid mis-behaving users to abused this feature to reserve a
      large amount on certain sockets and cause unfairness for others.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2bb2f5fb
  15. 23 9月, 2021 1 次提交
  16. 24 8月, 2021 1 次提交
    • D
      bpf: Migrate cgroup_bpf to internal cgroup_bpf_attach_type enum · 6fc88c35
      Dave Marchevsky 提交于
      Add an enum (cgroup_bpf_attach_type) containing only valid cgroup_bpf
      attach types and a function to map bpf_attach_type values to the new
      enum. Inspired by netns_bpf_attach_type.
      
      Then, migrate cgroup_bpf to use cgroup_bpf_attach_type wherever
      possible.  Functionality is unchanged as attach_type_to_prog_type
      switches in bpf/syscall.c were preventing non-cgroup programs from
      making use of the invalid cgroup_bpf array slots.
      
      As a result struct cgroup_bpf uses 504 fewer bytes relative to when its
      arrays were sized using MAX_BPF_ATTACH_TYPE.
      
      bpf_cgroup_storage is notably not migrated as struct
      bpf_cgroup_storage_key is part of uapi and contains a bpf_attach_type
      member which is not meant to be opaque. Similarly, bpf_cgroup_link
      continues to report its bpf_attach_type member to userspace via fdinfo
      and bpf_link_info.
      
      To ease disambiguation, bpf_attach_type variables are renamed from
      'type' to 'atype' when changed to cgroup_bpf_attach_type.
      Signed-off-by: NDave Marchevsky <davemarchevsky@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20210819092420.1984861-2-davemarchevsky@fb.com
      6fc88c35
  17. 23 7月, 2021 1 次提交
    • A
      net: socket: rework compat_ifreq_ioctl() · 29c49648
      Arnd Bergmann 提交于
      compat_ifreq_ioctl() is one of the last users of copy_in_user() and
      compat_alloc_user_space(), as it attempts to convert the 'struct ifreq'
      arguments from 32-bit to 64-bit format as used by dev_ioctl() and a
      couple of socket family specific interpretations.
      
      The current implementation works correctly when calling dev_ioctl(),
      inet_ioctl(), ieee802154_sock_ioctl(), atalk_ioctl(), qrtr_ioctl()
      and packet_ioctl(). The ioctl handlers for x25, netrom, rose and x25 do
      not interpret the arguments and only block the corresponding commands,
      so they do not care.
      
      For af_inet6 and af_decnet however, the compat conversion is slightly
      incorrect, as it will copy more data than the native handler accesses,
      both of them use a structure that is shorter than ifreq.
      
      Replace the copy_in_user() conversion with a pair of accessor functions
      to read and write the ifreq data in place with the correct length where
      needed, while leaving the other ones to copy the (already compatible)
      structures directly.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      29c49648
  18. 10 6月, 2021 1 次提交
    • E
      inet: annotate data race in inet_send_prepare() and inet_dgram_connect() · dcd01eea
      Eric Dumazet 提交于
      Both functions are known to be racy when reading inet_num
      as we do not want to grab locks for the common case the socket
      has been bound already. The race is resolved in inet_autobind()
      by reading again inet_num under the socket lock.
      
      syzbot reported:
      BUG: KCSAN: data-race in inet_send_prepare / udp_lib_get_port
      
      write to 0xffff88812cba150e of 2 bytes by task 24135 on cpu 0:
       udp_lib_get_port+0x4b2/0xe20 net/ipv4/udp.c:308
       udp_v6_get_port+0x5e/0x70 net/ipv6/udp.c:89
       inet_autobind net/ipv4/af_inet.c:183 [inline]
       inet_send_prepare+0xd0/0x210 net/ipv4/af_inet.c:807
       inet6_sendmsg+0x29/0x80 net/ipv6/af_inet6.c:639
       sock_sendmsg_nosec net/socket.c:654 [inline]
       sock_sendmsg net/socket.c:674 [inline]
       ____sys_sendmsg+0x360/0x4d0 net/socket.c:2350
       ___sys_sendmsg net/socket.c:2404 [inline]
       __sys_sendmmsg+0x315/0x4b0 net/socket.c:2490
       __do_sys_sendmmsg net/socket.c:2519 [inline]
       __se_sys_sendmmsg net/socket.c:2516 [inline]
       __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2516
       do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      read to 0xffff88812cba150e of 2 bytes by task 24132 on cpu 1:
       inet_send_prepare+0x21/0x210 net/ipv4/af_inet.c:806
       inet6_sendmsg+0x29/0x80 net/ipv6/af_inet6.c:639
       sock_sendmsg_nosec net/socket.c:654 [inline]
       sock_sendmsg net/socket.c:674 [inline]
       ____sys_sendmsg+0x360/0x4d0 net/socket.c:2350
       ___sys_sendmsg net/socket.c:2404 [inline]
       __sys_sendmmsg+0x315/0x4b0 net/socket.c:2490
       __do_sys_sendmmsg net/socket.c:2519 [inline]
       __se_sys_sendmmsg net/socket.c:2516 [inline]
       __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2516
       do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      value changed: 0x0000 -> 0x9db4
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 24132 Comm: syz-executor.2 Not tainted 5.13.0-rc4-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dcd01eea
  19. 02 6月, 2021 1 次提交
  20. 18 5月, 2021 1 次提交
  21. 02 4月, 2021 1 次提交
  22. 24 2月, 2021 1 次提交
  23. 03 2月, 2021 2 次提交
  24. 28 1月, 2021 1 次提交
  25. 21 1月, 2021 1 次提交
  26. 03 12月, 2020 1 次提交
  27. 25 8月, 2020 1 次提交
  28. 20 7月, 2020 1 次提交
  29. 08 7月, 2020 1 次提交
  30. 24 6月, 2020 2 次提交
    • E
      udp: move gro declarations to net/udp.h · 6db69328
      Eric Dumazet 提交于
      This removes following warnings :
        CC      net/ipv4/udp_offload.o
      net/ipv4/udp_offload.c:504:17: warning: no previous prototype for 'udp4_gro_receive' [-Wmissing-prototypes]
        504 | struct sk_buff *udp4_gro_receive(struct list_head *head, struct sk_buff *skb)
            |                 ^~~~~~~~~~~~~~~~
      net/ipv4/udp_offload.c:584:29: warning: no previous prototype for 'udp4_gro_complete' [-Wmissing-prototypes]
        584 | INDIRECT_CALLABLE_SCOPE int udp4_gro_complete(struct sk_buff *skb, int nhoff)
            |                             ^~~~~~~~~~~~~~~~~
      
        CHECK   net/ipv6/udp_offload.c
      net/ipv6/udp_offload.c:115:16: warning: symbol 'udp6_gro_receive' was not declared. Should it be static?
      net/ipv6/udp_offload.c:148:29: warning: symbol 'udp6_gro_complete' was not declared. Should it be static?
        CC      net/ipv6/udp_offload.o
      net/ipv6/udp_offload.c:115:17: warning: no previous prototype for 'udp6_gro_receive' [-Wmissing-prototypes]
        115 | struct sk_buff *udp6_gro_receive(struct list_head *head, struct sk_buff *skb)
            |                 ^~~~~~~~~~~~~~~~
      net/ipv6/udp_offload.c:148:29: warning: no previous prototype for 'udp6_gro_complete' [-Wmissing-prototypes]
        148 | INDIRECT_CALLABLE_SCOPE int udp6_gro_complete(struct sk_buff *skb, int nhoff)
            |                             ^~~~~~~~~~~~~~~~~
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6db69328
    • E
      net: move tcp gro declarations to net/tcp.h · 5521d95e
      Eric Dumazet 提交于
      This patch removes following (C=1 W=1) warnings for CONFIG_RETPOLINE=y :
      
      net/ipv4/tcp_offload.c:306:16: warning: symbol 'tcp4_gro_receive' was not declared. Should it be static?
      net/ipv4/tcp_offload.c:306:17: warning: no previous prototype for 'tcp4_gro_receive' [-Wmissing-prototypes]
      net/ipv4/tcp_offload.c:319:29: warning: symbol 'tcp4_gro_complete' was not declared. Should it be static?
      net/ipv4/tcp_offload.c:319:29: warning: no previous prototype for 'tcp4_gro_complete' [-Wmissing-prototypes]
        CHECK   net/ipv6/tcpv6_offload.c
      net/ipv6/tcpv6_offload.c:16:16: warning: symbol 'tcp6_gro_receive' was not declared. Should it be static?
      net/ipv6/tcpv6_offload.c:29:29: warning: symbol 'tcp6_gro_complete' was not declared. Should it be static?
        CC      net/ipv6/tcpv6_offload.o
      net/ipv6/tcpv6_offload.c:16:17: warning: no previous prototype for 'tcp6_gro_receive' [-Wmissing-prototypes]
         16 | struct sk_buff *tcp6_gro_receive(struct list_head *head, struct sk_buff *skb)
            |                 ^~~~~~~~~~~~~~~~
      net/ipv6/tcpv6_offload.c:29:29: warning: no previous prototype for 'tcp6_gro_complete' [-Wmissing-prototypes]
         29 | INDIRECT_CALLABLE_SCOPE int tcp6_gro_complete(struct sk_buff *skb, int thoff)
            |                             ^~~~~~~~~~~~~~~~~
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5521d95e
  31. 20 5月, 2020 1 次提交
    • D
      bpf: Add get{peer, sock}name attach types for sock_addr · 1b66d253
      Daniel Borkmann 提交于
      As stated in 983695fa ("bpf: fix unconnected udp hooks"), the objective
      for the existing cgroup connect/sendmsg/recvmsg/bind BPF hooks is to be
      transparent to applications. In Cilium we make use of these hooks [0] in
      order to enable E-W load balancing for existing Kubernetes service types
      for all Cilium managed nodes in the cluster. Those backends can be local
      or remote. The main advantage of this approach is that it operates as close
      as possible to the socket, and therefore allows to avoid packet-based NAT
      given in connect/sendmsg/recvmsg hooks we only need to xlate sock addresses.
      
      This also allows to expose NodePort services on loopback addresses in the
      host namespace, for example. As another advantage, this also efficiently
      blocks bind requests for applications in the host namespace for exposed
      ports. However, one missing item is that we also need to perform reverse
      xlation for inet{,6}_getname() hooks such that we can return the service
      IP/port tuple back to the application instead of the remote peer address.
      
      The vast majority of applications does not bother about getpeername(), but
      in a few occasions we've seen breakage when validating the peer's address
      since it returns unexpectedly the backend tuple instead of the service one.
      Therefore, this trivial patch allows to customise and adds a getpeername()
      as well as getsockname() BPF cgroup hook for both IPv4 and IPv6 in order
      to address this situation.
      
      Simple example:
      
        # ./cilium/cilium service list
        ID   Frontend     Service Type   Backend
        1    1.2.3.4:80   ClusterIP      1 => 10.0.0.10:80
      
      Before; curl's verbose output example, no getpeername() reverse xlation:
      
        # curl --verbose 1.2.3.4
        * Rebuilt URL to: 1.2.3.4/
        *   Trying 1.2.3.4...
        * TCP_NODELAY set
        * Connected to 1.2.3.4 (10.0.0.10) port 80 (#0)
        > GET / HTTP/1.1
        > Host: 1.2.3.4
        > User-Agent: curl/7.58.0
        > Accept: */*
        [...]
      
      After; with getpeername() reverse xlation:
      
        # curl --verbose 1.2.3.4
        * Rebuilt URL to: 1.2.3.4/
        *   Trying 1.2.3.4...
        * TCP_NODELAY set
        * Connected to 1.2.3.4 (1.2.3.4) port 80 (#0)
        > GET / HTTP/1.1
        >  Host: 1.2.3.4
        > User-Agent: curl/7.58.0
        > Accept: */*
        [...]
      
      Originally, I had both under a BPF_CGROUP_INET{4,6}_GETNAME type and exposed
      peer to the context similar as in inet{,6}_getname() fashion, but API-wise
      this is suboptimal as it always enforces programs having to test for ctx->peer
      which can easily be missed, hence BPF_CGROUP_INET{4,6}_GET{PEER,SOCK}NAME split.
      Similarly, the checked return code is on tnum_range(1, 1), but if a use case
      comes up in future, it can easily be changed to return an error code instead.
      Helper and ctx member access is the same as with connect/sendmsg/etc hooks.
      
        [0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.cSigned-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NAndrey Ignatov <rdna@fb.com>
      Link: https://lore.kernel.org/bpf/61a479d759b2482ae3efb45546490bacd796a220.1589841594.git.daniel@iogearbox.net
      1b66d253
  32. 19 5月, 2020 1 次提交
  33. 09 5月, 2020 2 次提交
  34. 29 4月, 2020 1 次提交
    • R
      net: ipv4: add sysctl for nexthop api compatibility mode · 4f80116d
      Roopa Prabhu 提交于
      Current route nexthop API maintains user space compatibility
      with old route API by default. Dumps and netlink notifications
      support both new and old API format. In systems which have
      moved to the new API, this compatibility mode cancels some
      of the performance benefits provided by the new nexthop API.
      
      This patch adds new sysctl nexthop_compat_mode which is on
      by default but provides the ability to turn off compatibility
      mode allowing systems to run entirely with the new routing
      API. Old route API behaviour and support is not modified by this
      sysctl.
      
      Uses a single sysctl to cover both ipv4 and ipv6 following
      other sysctls. Covers dumps and delete notifications as
      suggested by David Ahern.
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4f80116d