- 20 7月, 2021 1 次提交
-
-
由 Vasily Averin 提交于
An netadmin inside container can use 'ip a a' and 'ip r a' to assign a large number of ipv4/ipv6 addresses and routing entries and force kernel to allocate megabytes of unaccounted memory for long-lived per-netdevice related kernel objects: 'struct in_ifaddr', 'struct inet6_ifaddr', 'struct fib6_node', 'struct rt6_info', 'struct fib_rules' and ip_fib caches. These objects can be manually removed, though usually they lives in memory till destroy of its net namespace. It makes sense to account for them to restrict the host's memory consumption from inside the memcg-limited container. One of such objects is the 'struct fib6_node' mostly allocated in net/ipv6/route.c::__ip6_ins_rt() inside the lock_bh()/unlock_bh() section: write_lock_bh(&table->tb6_lock); err = fib6_add(&table->tb6_root, rt, info, mxc); write_unlock_bh(&table->tb6_lock); In this case it is not enough to simply add SLAB_ACCOUNT to corresponding kmem cache. The proper memory cgroup still cannot be found due to the incorrect 'in_interrupt()' check used in memcg_kmem_bypass(). Obsoleted in_interrupt() does not describe real execution context properly. >From include/linux/preempt.h: The following macros are deprecated and should not be used in new code: in_interrupt() - We're in NMI,IRQ,SoftIRQ context or have BH disabled To verify the current execution context new macro should be used instead: in_task() - We're in task context Signed-off-by: NVasily Averin <vvs@virtuozzo.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 18 7月, 2021 1 次提交
-
-
由 Liu Jian 提交于
I got below panic when doing fuzz test: Kernel panic - not syncing: panic_on_warn set ... CPU: 0 PID: 4056 Comm: syz-executor.3 Tainted: G B 5.14.0-rc1-00195-gcff5c4254439-dirty #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack_lvl+0x7a/0x9b panic+0x2cd/0x5af end_report.cold+0x5a/0x5a kasan_report+0xec/0x110 ip_check_mc_rcu+0x556/0x5d0 __mkroute_output+0x895/0x1740 ip_route_output_key_hash_rcu+0x2d0/0x1050 ip_route_output_key_hash+0x182/0x2e0 ip_route_output_flow+0x28/0x130 udp_sendmsg+0x165d/0x2280 udpv6_sendmsg+0x121e/0x24f0 inet6_sendmsg+0xf7/0x140 sock_sendmsg+0xe9/0x180 ____sys_sendmsg+0x2b8/0x7a0 ___sys_sendmsg+0xf0/0x160 __sys_sendmmsg+0x17e/0x3c0 __x64_sys_sendmmsg+0x9e/0x100 do_syscall_64+0x3b/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x462eb9 Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f3df5af1c58 EFLAGS: 00000246 ORIG_RAX: 0000000000000133 RAX: ffffffffffffffda RBX: 000000000073bf00 RCX: 0000000000462eb9 RDX: 0000000000000312 RSI: 0000000020001700 RDI: 0000000000000007 RBP: 0000000000000004 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007f3df5af26bc R13: 00000000004c372d R14: 0000000000700b10 R15: 00000000ffffffff It is one use-after-free in ip_check_mc_rcu. In ip_mc_del_src, the ip_sf_list of pmc has been freed under pmc->lock protection. But access to ip_sf_list in ip_check_mc_rcu is not protected by the lock. Signed-off-by: NLiu Jian <liujian56@huawei.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 16 7月, 2021 1 次提交
-
-
由 Cong Wang 提交于
TCP and other connection oriented sockets have accept() for each incoming connection on the server side, hence they can just insert those fd's from accept() to sockmap, which are of course established. Now with datagram sockets begin to support sockmap and redirection, the restriction is no longer applicable to them, as they have no accept(). So we have to lift this restriction for them. This is fine, because inside bpf_sk_redirect_map() we still have another socket status check, sock_map_redirect_allowed(), as a guard. This also means they do not have to be removed from sockmap when disconnecting. Signed-off-by: NCong Wang <cong.wang@bytedance.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20210704190252.11866-3-xiyou.wangcong@gmail.com
-
- 14 7月, 2021 1 次提交
-
-
由 Yajun Deng 提交于
It has 'if (err >0 )' statement in nlmsg_unicast(), so use nlmsg_unicast() instead of netlink_unicast(), this looks more concise. v2: remove the change in netfilter. Signed-off-by: NYajun Deng <yajun.deng@linux.dev> Reviewed-by: NDavid Ahern <dsahern@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 10 7月, 2021 3 次提交
-
-
由 Jianguo Wu 提交于
If check_fully_established() causes a subflow reset, it should not continue to process the packet in tcp_data_queue(). Add a return value to mptcp_incoming_options(), and return false if a subflow has been reset, else return true. Then drop the packet in tcp_data_queue()/tcp_rcv_state_process() if mptcp_incoming_options() return false. Fixes: d5824847 ("mptcp: fix fallback for MP_JOIN subflows") Signed-off-by: NJianguo Wu <wujianguo@chinatelecom.cn> Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Hangbin Liu 提交于
Commit 28e104d0 ("net: ip_tunnel: fix mtu calculation") removed dev->hard_header_len subtraction when calculate MTU for tunnel devices as there is an overhead for device that has header_ops. But there are ETHER tunnel devices, like gre_tap or erspan, which don't have header_ops but set dev->hard_header_len during setup. This makes pkts greater than (MTU - ETH_HLEN) could not be xmited. Fix it by subtracting the ETHER tunnel devices' dev->hard_header_len for MTU calculation. Fixes: 28e104d0 ("net: ip_tunnel: fix mtu calculation") Reported-by: NJianlin Shi <jishi@redhat.com> Signed-off-by: NHangbin Liu <liuhangbin@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Talal Ahmad 提交于
sk_wmem_schedule makes sure that sk_forward_alloc has enough bytes for charging that is going to be done by sk_mem_charge. In the transmit zerocopy path, there is sk_mem_charge but there was no call to sk_wmem_schedule. This change adds that call. Without this call to sk_wmem_schedule, sk_forward_alloc can go negetive which is a bug because sk_forward_alloc is a per-socket space that has been forward charged so this can't be negative. Fixes: f214f915 ("tcp: enable MSG_ZEROCOPY") Signed-off-by: NTalal Ahmad <talalahmad@google.com> Reviewed-by: NWillem de Bruijn <willemb@google.com> Reviewed-by: NWei Wang <weiwan@google.com> Reviewed-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 09 7月, 2021 1 次提交
-
-
由 Eric Dumazet 提交于
While TCP stack scales reasonably well, there is still one part that can be used to DDOS it. IPv6 Packet too big messages have to lookup/insert a new route, and if abused by attackers, can easily put hosts under high stress, with many cpus contending on a spinlock while one is stuck in fib6_run_gc() ip6_protocol_deliver_rcu() icmpv6_rcv() icmpv6_notify() tcp_v6_err() tcp_v6_mtu_reduced() inet6_csk_update_pmtu() ip6_rt_update_pmtu() __ip6_rt_update_pmtu() ip6_rt_cache_alloc() ip6_dst_alloc() dst_alloc() ip6_dst_gc() fib6_run_gc() spin_lock_bh() ... Some of our servers have been hit by malicious ICMPv6 packets trying to _increase_ the MTU/MSS of TCP flows. We believe these ICMPv6 packets are a result of a bug in one ISP stack, since they were blindly sent back for _every_ (small) packet sent to them. These packets are for one TCP flow: 09:24:36.266491 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240 09:24:36.266509 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240 09:24:36.316688 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240 09:24:36.316704 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240 09:24:36.608151 IP6 Addr1 > Victim ICMP6, packet too big, mtu 1460, length 1240 TCP stack can filter some silly requests : 1) MTU below IPV6_MIN_MTU can be filtered early in tcp_v6_err() 2) tcp_v6_mtu_reduced() can drop requests trying to increase current MSS. This tests happen before the IPv6 routing stack is entered, thus removing the potential contention and route exhaustion. Note that IPv6 stack was performing these checks, but too late (ie : after the route has been added, and after the potential garbage collect war) v2: fix typo caught by Martin, thanks ! v3: exports tcp_mtu_to_mss(), caught by David, thanks ! Fixes: 1da177e4 ("Linux-2.6.12-rc2") Signed-off-by: NEric Dumazet <edumazet@google.com> Reviewed-by: NMaciej Żenczykowski <maze@google.com> Cc: Martin KaFai Lau <kafai@fb.com> Acked-by: NMartin KaFai Lau <kafai@fb.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 08 7月, 2021 1 次提交
-
-
由 Roy, UjjaL 提交于
Fixed indentation by removing extra spaces. Signed-off-by: NRoy, UjjaL <royujjal@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 07 7月, 2021 1 次提交
-
-
由 Nguyen Dinh Phi 提交于
This commit fixes a bug (found by syzkaller) that could cause spurious double-initializations for congestion control modules, which could cause memory leaks or other problems for congestion control modules (like CDG) that allocate memory in their init functions. The buggy scenario constructed by syzkaller was something like: (1) create a TCP socket (2) initiate a TFO connect via sendto() (3) while socket is in TCP_SYN_SENT, call setsockopt(TCP_CONGESTION), which calls: tcp_set_congestion_control() -> tcp_reinit_congestion_control() -> tcp_init_congestion_control() (4) receive ACK, connection is established, call tcp_init_transfer(), set icsk_ca_initialized=0 (without first calling cc->release()), call tcp_init_congestion_control() again. Note that in this sequence tcp_init_congestion_control() is called twice without a cc->release() call in between. Thus, for CC modules that allocate memory in their init() function, e.g, CDG, a memory leak may occur. The syzkaller tool managed to find a reproducer that triggered such a leak in CDG. The bug was introduced when that commit 8919a9b3 ("tcp: Only init congestion control if not initialized already") introduced icsk_ca_initialized and set icsk_ca_initialized to 0 in tcp_init_transfer(), missing the possibility for a sequence like the one above, where a process could call setsockopt(TCP_CONGESTION) in state TCP_SYN_SENT (i.e. after the connect() or TFO open sendmsg()), which would call tcp_init_congestion_control(). It did not intend to reset any initialization that the user had already explicitly made; it just missed the possibility of that particular sequence (which syzkaller managed to find). Fixes: 8919a9b3 ("tcp: Only init congestion control if not initialized already") Reported-by: syzbot+f1e24a0594d4e3a895d3@syzkaller.appspotmail.com Signed-off-by: NNguyen Dinh Phi <phind.uet@gmail.com> Acked-by: NNeal Cardwell <ncardwell@google.com> Tested-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 03 7月, 2021 2 次提交
-
-
由 Paolo Abeni 提交于
If an UDP packet enters the GRO engine but is not eligible for aggregation and is not targeting an UDP tunnel, udp_gro_receive() will not set the flush bit, and packet could delayed till the next napi flush. Fix the issue ensuring non GROed packets traverse skb_gro_flush_final(). Reported-and-tested-by: NMatthias Treydte <mt@waldheinz.de> Fixes: 18f25dc3 ("udp: skip L4 aggregation for UDP tunnel packets") Signed-off-by: NPaolo Abeni <pabeni@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
While tp->mtu_info is read while socket is owned, the write sides happen from err handlers (tcp_v[46]_mtu_reduced) which only own the socket spinlock. Fixes: 563d34d0 ("tcp: dont drop MTU reduction indications") Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 02 7月, 2021 1 次提交
-
-
由 Eric Dumazet 提交于
Accesses to unix_sk(sk)->gso_size are lockless. Add READ_ONCE()/WRITE_ONCE() around them. BUG: KCSAN: data-race in udp_lib_setsockopt / udpv6_sendmsg write to 0xffff88812d78f47c of 2 bytes by task 10849 on cpu 1: udp_lib_setsockopt+0x3b3/0x710 net/ipv4/udp.c:2696 udpv6_setsockopt+0x63/0x90 net/ipv6/udp.c:1630 sock_common_setsockopt+0x5d/0x70 net/core/sock.c:3265 __sys_setsockopt+0x18f/0x200 net/socket.c:2104 __do_sys_setsockopt net/socket.c:2115 [inline] __se_sys_setsockopt net/socket.c:2112 [inline] __x64_sys_setsockopt+0x62/0x70 net/socket.c:2112 do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47 entry_SYSCALL_64_after_hwframe+0x44/0xae read to 0xffff88812d78f47c of 2 bytes by task 10852 on cpu 0: udpv6_sendmsg+0x161/0x16b0 net/ipv6/udp.c:1299 inet6_sendmsg+0x5f/0x80 net/ipv6/af_inet6.c:642 sock_sendmsg_nosec net/socket.c:654 [inline] sock_sendmsg net/socket.c:674 [inline] ____sys_sendmsg+0x360/0x4d0 net/socket.c:2337 ___sys_sendmsg net/socket.c:2391 [inline] __sys_sendmmsg+0x315/0x4b0 net/socket.c:2477 __do_sys_sendmmsg net/socket.c:2506 [inline] __se_sys_sendmmsg net/socket.c:2503 [inline] __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2503 do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47 entry_SYSCALL_64_after_hwframe+0x44/0xae value changed: 0x0000 -> 0x0005 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 10852 Comm: syz-executor.0 Not tainted 5.13.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Fixes: bec1f6f6 ("udp: generate gso with UDP_SEGMENT") Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Willem de Bruijn <willemb@google.com> Reported-by: Nsyzbot <syzkaller@googlegroups.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 30 6月, 2021 3 次提交
-
-
由 Eric Dumazet 提交于
Compiler can perform the sanity check instead of waiting to load the module and crash the host. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Alexander Aring 提交于
This patch introduces a function wrapper to call the sk_error_report callback. That will prepare to add additional handling whenever sk_error_report is called, for example to trace socket errors. Signed-off-by: NAlexander Aring <aahringo@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Liam Howlett 提交于
Use vma_lookup() to find the VMA at a specific address. As vma_lookup() will return NULL if the address is not within any VMA, the start address no longer needs to be validated. Link: https://lkml.kernel.org/r/20210521174745.2219620-13-Liam.Howlett@Oracle.comSigned-off-by: NLiam R. Howlett <Liam.Howlett@Oracle.com> Reviewed-by: NLaurent Dufour <ldufour@linux.ibm.com> Acked-by: NDavid Hildenbrand <david@redhat.com> Acked-by: NDavidlohr Bueso <dbueso@suse.de> Cc: David Miller <davem@davemloft.net> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 29 6月, 2021 4 次提交
-
-
由 Andreas Roeseler 提交于
This patch builds off of commit 2b246b25 and adds functionality to respond to ICMPV6 PROBE requests. Add icmp_build_probe function to construct PROBE requests for both ICMPV4 and ICMPV6. Modify icmpv6_rcv to detect ICMPV6 PROBE messages and call the icmpv6_echo_reply handler. Modify icmpv6_echo_reply to build a PROBE response message based on the queried interface. This patch has been tested using a branch of the iputils git repo which can be found here: https://github.com/Juniper-Clinic-2020/iputils/tree/probe-requestSigned-off-by: NAndreas Roeseler <andreas.a.roeseler@gmail.com> Reviewed-by: NWillem de Bruijn <willemb@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Guillaume Nault 提交于
Commit e271c7b4 ("gre: do not keep the GRE header around in collect medata mode") did reset the mac_header for the collect_md case. Let's extend this behaviour to classical gre devices as well. ipgre_header_parse() seems to be the only case that requires mac_header to point to the outer header. We can detect this case accurately by checking ->header_ops. For all other cases, we can reset mac_header. This allows to push an Ethernet header to ipgre packets and redirect them to an Ethernet device: $ tc filter add dev gre0 ingress matchall \ action vlan push_eth dst_mac 00:00:5e:00:53:01 \ src_mac 00:00:5e:00:53:00 \ action mirred egress redirect dev eth0 Before this patch, this worked only for collect_md gre devices. Now this works for regular gre devices as well. Only the special case of gre devices that use ipgre_header_ops isn't supported. Signed-off-by: NGuillaume Nault <gnault@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Guillaume Nault 提交于
Even though ipip transports IPv4 or MPLS packets, it needs to reset the mac_header pointer, so that other parts of the stack don't mistakenly access the outer header after the packet has been decapsulated. This allows to push an Ethernet header to ipip or mplsip packets and redirect them to an Ethernet device: $ tc filter add dev ipip0 ingress matchall \ action vlan push_eth dst_mac 00:00:5e:00:53:01 \ src_mac 00:00:5e:00:53:00 \ action mirred egress redirect dev eth0 Without this patch, push_eth refuses to add an ethernet header because the skb appears to already have a MAC header. Signed-off-by: NGuillaume Nault <gnault@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Vadim Fedorenko 提交于
Commit 14972cbd ("net: lwtunnel: Handle fragmentation") moved fragmentation logic away from lwtunnel by carry encap headroom and use it in output MTU calculation. But the forwarding part was not covered and created difference in MTU for output and forwarding and further to silent drops on ipv4 forwarding path. Fix it by taking into account lwtunnel encap headroom. The same commit also introduced difference in how to treat RTAX_MTU in IPv4 and IPv6 where latter explicitly removes lwtunnel encap headroom from route MTU. Make IPv4 version do the same. Fixes: 14972cbd ("net: lwtunnel: Handle fragmentation") Suggested-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NVadim Fedorenko <vfedorenko@novek.ru> Reviewed-by: NDavid Ahern <dsahern@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 25 6月, 2021 1 次提交
-
-
由 Jakub Kicinski 提交于
Dave observed number of machines hitting OOM on the UDP send path. The workload seems to be sending large UDP packets over loopback. Since loopback has MTU of 64k kernel will try to allocate an skb with up to 64k of head space. This has a good chance of failing under memory pressure. What's worse if the message length is <32k the allocation may trigger an OOM killer. This is entirely avoidable, we can use an skb with page frags. af_unix solves a similar problem by limiting the head length to SKB_MAX_ALLOC. This seems like a good and simple approach. It means that UDP messages > 16kB will now use fragments if underlying device supports SG, if extra allocator pressure causes regressions in real workloads we can switch to trying the large allocation first and falling back. v4: pre-calculate all the additions to alloclen so we can be sure it won't go over order-2 Reported-by: NDave Jones <dsj@fb.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 24 6月, 2021 1 次提交
-
-
由 Kuniyuki Iwashima 提交于
This commit adds two stats for the socket migration feature to evaluate the effectiveness: LINUX_MIB_TCPMIGRATEREQ(SUCCESS|FAILURE). If the migration fails because of the own_req race in receiving ACK and sending SYN+ACK paths, we do not increment the failure stat. Then another CPU is responsible for the req. Link: https://lore.kernel.org/bpf/CAK6E8=cgFKuGecTzSCSQ8z3YJ_163C0uwO9yRvfDSE7vOe9mJA@mail.gmail.com/Suggested-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.co.jp> Acked-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 23 6月, 2021 1 次提交
-
-
由 Miao Wang 提交于
When doing source address validation, the flowi4 struct used for fib_lookup should be in the reverse direction to the given skb. fl4_dport and fl4_sport returned by fib4_rules_early_flow_dissect should thus be swapped. Fixes: 5a847a6e ("net/ipv4: Initialize proto and ports in flow struct") Signed-off-by: NMiao Wang <shankerwangmiao@gmail.com> Reviewed-by: NDavid Ahern <dsahern@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 21 6月, 2021 2 次提交
-
-
由 Cong Wang 提交于
sk_psock_verdict_recv() clones the skb and uses the clone afterward, so udp_read_sock() should free the skb after using it, regardless of error or not. This fixes a real kmemleak. Fixes: d7f57118 ("udp: Implement ->read_sock() for sockmap") Signed-off-by: NCong Wang <cong.wang@bytedance.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NJohn Fastabend <john.fastabend@gmail.com> Acked-by: NJakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20210615021342.7416-4-xiyou.wangcong@gmail.com
-
由 Cong Wang 提交于
I tried to reuse sk_msg_wait_data() for different protocols, but it turns out it can not be simply reused. For example, UDP actually uses two queues to receive skb: udp_sk(sk)->reader_queue and sk->sk_receive_queue. So we have to check both of them to know whether we have received any packet. Also, UDP does not lock the sock during BH Rx path, it makes no sense for its ->recvmsg() to lock the sock. It is always possible for ->recvmsg() to be called before packets actually arrive in the receive queue, we just use best effort to make it accurate here. Fixes: 1f5be6b3 ("udp: Implement udp_bpf_recvmsg() for sockmap") Signed-off-by: NCong Wang <cong.wang@bytedance.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NJohn Fastabend <john.fastabend@gmail.com> Acked-by: NJakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/20210615021342.7416-2-xiyou.wangcong@gmail.com
-
- 19 6月, 2021 1 次提交
-
-
由 Toke Høiland-Jørgensen 提交于
When constructing ICMP response messages, the kernel will try to pick a suitable source address for the outgoing packet. However, if no IPv4 addresses are configured on the system at all, this will fail and we end up producing an ICMP message with a source address of 0.0.0.0. This can happen on a box routing IPv4 traffic via v6 nexthops, for instance. Since 0.0.0.0 is not generally routable on the internet, there's a good chance that such ICMP messages will never make it back to the sender of the original packet that the ICMP message was sent in response to. This, in turn, can create connectivity and PMTUd problems for senders. Fortunately, RFC7600 reserves a dummy address to be used as a source for ICMP messages (192.0.0.8/32), so let's teach the kernel to substitute that address as a last resort if the regular source address selection procedure fails. Below is a quick example reproducing this issue with network namespaces: ip netns add ns0 ip l add type veth peer netns ns0 ip l set dev veth0 up ip a add 10.0.0.1/24 dev veth0 ip a add fc00:dead:cafe:42::1/64 dev veth0 ip r add 10.1.0.0/24 via inet6 fc00:dead:cafe:42::2 ip -n ns0 l set dev veth0 up ip -n ns0 a add fc00:dead:cafe:42::2/64 dev veth0 ip -n ns0 r add 10.0.0.0/24 via inet6 fc00:dead:cafe:42::1 ip netns exec ns0 sysctl -w net.ipv4.icmp_ratelimit=0 ip netns exec ns0 sysctl -w net.ipv4.ip_forward=1 tcpdump -tpni veth0 -c 2 icmp & ping -w 1 10.1.0.1 > /dev/null tcpdump: verbose output suppressed, use -v[v]... for full protocol decode listening on veth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes IP 10.0.0.1 > 10.1.0.1: ICMP echo request, id 29, seq 1, length 64 IP 0.0.0.0 > 10.0.0.1: ICMP net 10.1.0.1 unreachable, length 92 2 packets captured 2 packets received by filter 0 packets dropped by kernel With this patch the above capture changes to: IP 10.0.0.1 > 10.1.0.1: ICMP echo request, id 31127, seq 1, length 64 IP 192.0.0.8 > 10.0.0.1: ICMP net 10.1.0.1 unreachable, length 92 Fixes: 1da177e4 ("Linux-2.6.12-rc2") Reported-by: NJuliusz Chroboczek <jch@irif.fr> Reviewed-by: NDavid Ahern <dsahern@kernel.org> Signed-off-by: NToke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 17 6月, 2021 1 次提交
-
-
由 Chengyang Fan 提交于
BUG: memory leak unreferenced object 0xffff888101bc4c00 (size 32): comm "syz-executor527", pid 360, jiffies 4294807421 (age 19.329s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 01 00 00 00 00 00 00 00 ac 14 14 bb 00 00 02 00 ................ backtrace: [<00000000f17c5244>] kmalloc include/linux/slab.h:558 [inline] [<00000000f17c5244>] kzalloc include/linux/slab.h:688 [inline] [<00000000f17c5244>] ip_mc_add1_src net/ipv4/igmp.c:1971 [inline] [<00000000f17c5244>] ip_mc_add_src+0x95f/0xdb0 net/ipv4/igmp.c:2095 [<000000001cb99709>] ip_mc_source+0x84c/0xea0 net/ipv4/igmp.c:2416 [<0000000052cf19ed>] do_ip_setsockopt net/ipv4/ip_sockglue.c:1294 [inline] [<0000000052cf19ed>] ip_setsockopt+0x114b/0x30c0 net/ipv4/ip_sockglue.c:1423 [<00000000477edfbc>] raw_setsockopt+0x13d/0x170 net/ipv4/raw.c:857 [<00000000e75ca9bb>] __sys_setsockopt+0x158/0x270 net/socket.c:2117 [<00000000bdb993a8>] __do_sys_setsockopt net/socket.c:2128 [inline] [<00000000bdb993a8>] __se_sys_setsockopt net/socket.c:2125 [inline] [<00000000bdb993a8>] __x64_sys_setsockopt+0xba/0x150 net/socket.c:2125 [<000000006a1ffdbd>] do_syscall_64+0x40/0x80 arch/x86/entry/common.c:47 [<00000000b11467c4>] entry_SYSCALL_64_after_hwframe+0x44/0xae In commit 24803f38 ("igmp: do not remove igmp souce list info when set link down"), the ip_mc_clear_src() in ip_mc_destroy_dev() was removed, because it was also called in igmpv3_clear_delrec(). Rough callgraph: inetdev_destroy -> ip_mc_destroy_dev -> igmpv3_clear_delrec -> ip_mc_clear_src -> RCU_INIT_POINTER(dev->ip_ptr, NULL) However, ip_mc_clear_src() called in igmpv3_clear_delrec() doesn't release in_dev->mc_list->sources. And RCU_INIT_POINTER() assigns the NULL to dev->ip_ptr. As a result, in_dev cannot be obtained through inetdev_by_index() and then in_dev->mc_list->sources cannot be released by ip_mc_del1_src() in the sock_close. Rough call sequence goes like: sock_close -> __sock_release -> inet_release -> ip_mc_drop_socket -> inetdev_by_index -> ip_mc_leave_src -> ip_mc_del_src -> ip_mc_del1_src So we still need to call ip_mc_clear_src() in ip_mc_destroy_dev() to free in_dev->mc_list->sources. Fixes: 24803f38 ("igmp: do not remove igmp souce list info ...") Reported-by: NHulk Robot <hulkci@huawei.com> Signed-off-by: NChengyang Fan <cy.fan@huawei.com> Acked-by: NHangbin Liu <liuhangbin@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 16 6月, 2021 6 次提交
-
-
由 Maciej Żenczykowski 提交于
Timewait sockets have included mark since approx 4.18. Cc: Eric Dumazet <edumazet@google.com> Cc: Jon Maxwell <jmaxwell37@gmail.com> Fixes: 00483690 ("tcp: Add mark for TIMEWAIT sockets") Signed-off-by: NMaciej Żenczykowski <maze@google.com> Reviewed-by: NEric Dumazet <edumazet@google.com> Reviewed-by: NJon Maxwell <jmaxwell37@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Kuniyuki Iwashima 提交于
This patch also changes the code to call reuseport_migrate_sock() and inet_reqsk_clone(), but unlike the other cases, we do not call inet_reqsk_clone() right after reuseport_migrate_sock(). Currently, in the receive path for TCP_NEW_SYN_RECV sockets, its listener has three kinds of refcnt: (A) for listener itself (B) carried by reuqest_sock (C) sock_hold() in tcp_v[46]_rcv() While processing the req, (A) may disappear by close(listener). Also, (B) can disappear by accept(listener) once we put the req into the accept queue. So, we have to hold another refcnt (C) for the listener to prevent use-after-free. For socket migration, we call reuseport_migrate_sock() to select a listener with (A) and to increment the new listener's refcnt in tcp_v[46]_rcv(). This refcnt corresponds to (C) and is cleaned up later in tcp_v[46]_rcv(). Thus we have to take another refcnt (B) for the newly cloned request_sock. In inet_csk_complete_hashdance(), we hold the count (B), clone the req, and try to put the new req into the accept queue. By migrating req after winning the "own_req" race, we can avoid such a worst situation: CPU 1 looks up req1 CPU 2 looks up req1, unhashes it, then CPU 1 loses the race CPU 3 looks up req2, unhashes it, then CPU 2 loses the race ... Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Reviewed-by: NEric Dumazet <edumazet@google.com> Acked-by: NMartin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20210612123224.12525-8-kuniyu@amazon.co.jp
-
由 Kuniyuki Iwashima 提交于
As with the preceding patch, this patch changes reqsk_timer_handler() to call reuseport_migrate_sock() and inet_reqsk_clone() to migrate in-flight requests at retransmitting SYN+ACKs. If we can select a new listener and clone the request, we resume setting the SYN+ACK timer for the new req. If we can set the timer, we call inet_ehash_insert() to unhash the old req and put the new req into ehash. The noteworthy point here is that by unhashing the old req, another CPU processing it may lose the "own_req" race in tcp_v[46]_syn_recv_sock() and drop the final ACK packet. However, the new timer will recover this situation. Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Reviewed-by: NEric Dumazet <edumazet@google.com> Acked-by: NMartin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20210612123224.12525-7-kuniyu@amazon.co.jp
-
由 Kuniyuki Iwashima 提交于
When we call close() or shutdown() for listening sockets, each child socket in the accept queue are freed at inet_csk_listen_stop(). If we can get a new listener by reuseport_migrate_sock() and clone the request by inet_reqsk_clone(), we try to add it into the new listener's accept queue by inet_csk_reqsk_queue_add(). If it fails, we have to call __reqsk_free() to call sock_put() for its listener and free the cloned request. After putting the full socket into ehash, tcp_v[46]_syn_recv_sock() sets NULL to ireq_opt/pktopts in struct inet_request_sock, but ipv6_opt can be non-NULL. So, we have to set NULL to ipv6_opt of the old request to avoid double free. Note that we do not update req->rsk_listener and instead clone the req to migrate because another path may reference the original request. If we protected it by RCU, we would need to add rcu_read_lock() in many places. Suggested-by: NMartin KaFai Lau <kafai@fb.com> Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Reviewed-by: NEric Dumazet <edumazet@google.com> Acked-by: NMartin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/netdev/20201209030903.hhow5r53l6fmozjn@kafai-mbp.dhcp.thefacebook.com/ Link: https://lore.kernel.org/bpf/20210612123224.12525-6-kuniyu@amazon.co.jp
-
由 Kuniyuki Iwashima 提交于
When we close a listening socket, to migrate its connections to another listener in the same reuseport group, we have to handle two kinds of child sockets. One is that a listening socket has a reference to, and the other is not. The former is the TCP_ESTABLISHED/TCP_SYN_RECV sockets, and they are in the accept queue of their listening socket. So we can pop them out and push them into another listener's queue at close() or shutdown() syscalls. On the other hand, the latter, the TCP_NEW_SYN_RECV socket is during the three-way handshake and not in the accept queue. Thus, we cannot access such sockets at close() or shutdown() syscalls. Accordingly, we have to migrate immature sockets after their listening socket has been closed. Currently, if their listening socket has been closed, TCP_NEW_SYN_RECV sockets are freed at receiving the final ACK or retransmitting SYN+ACKs. At that time, if we could select a new listener from the same reuseport group, no connection would be aborted. However, we cannot do that because reuseport_detach_sock() sets NULL to sk_reuseport_cb and forbids access to the reuseport group from closed sockets. This patch allows TCP_CLOSE sockets to remain in the reuseport group and access it while any child socket references them. The point is that reuseport_detach_sock() was called twice from inet_unhash() and sk_destruct(). This patch replaces the first reuseport_detach_sock() with reuseport_stop_listen_sock(), which checks if the reuseport group is capable of migration. If capable, it decrements num_socks, moves the socket backwards in socks[] and increments num_closed_socks. When all connections are migrated, sk_destruct() calls reuseport_detach_sock() to remove the socket from socks[], decrement num_closed_socks, and set NULL to sk_reuseport_cb. By this change, closed or shutdowned sockets can keep sk_reuseport_cb. Consequently, calling listen() after shutdown() can cause EADDRINUSE or EBUSY in inet_csk_bind_conflict() or reuseport_add_sock() which expects such sockets not to have the reuseport group. Therefore, this patch also loosens such validation rules so that a socket can listen again if it has a reuseport group with num_closed_socks more than 0. When such sockets listen again, we handle them in reuseport_resurrect(). If there is an existing reuseport group (reuseport_add_sock() path), we move the socket from the old group to the new one and free the old one if necessary. If there is no existing group (reuseport_alloc() path), we allocate a new reuseport group, detach sk from the old one, and free it if necessary, not to break the current shutdown behaviour: - we cannot carry over the eBPF prog of shutdowned sockets - we cannot attach/detach an eBPF prog to/from listening sockets via shutdowned sockets Note that when the number of sockets gets over U16_MAX, we try to detach a closed socket randomly to make room for the new listening socket in reuseport_grow(). Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: NMartin KaFai Lau <kafai@fb.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Reviewed-by: NEric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/bpf/20210612123224.12525-4-kuniyu@amazon.co.jp
-
由 Kuniyuki Iwashima 提交于
This commit adds a new sysctl option: net.ipv4.tcp_migrate_req. If this option is enabled or eBPF program is attached, we will be able to migrate child sockets from a listener to another in the same reuseport group after close() or shutdown() syscalls. Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Reviewed-by: NBenjamin Herrenschmidt <benh@amazon.com> Reviewed-by: NEric Dumazet <edumazet@google.com> Acked-by: NMartin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20210612123224.12525-2-kuniyu@amazon.co.jp
-
- 15 6月, 2021 1 次提交
-
-
由 David Ahern 提交于
Oliver reported a use case where deleting a VRF device can hang waiting for the refcnt to drop to 0. The root cause is that the dst is allocated against the VRF device but cached on the loopback device. The use case (added to the selftests) has an implicit VRF crossing due to the ordering of the FIB rules (lookup local is before the l3mdev rule, but the problem occurs even if the FIB rules are re-ordered with local after l3mdev because the VRF table does not have a default route to terminate the lookup). The end result is is that the FIB lookup returns the loopback device as the nexthop, but the ingress device is in a VRF. The mismatch causes the dst alloc against the VRF device but then cached on the loopback. The fix is to bring the trick used for IPv6 (see ip6_rt_get_dev_rcu): pick the dst alloc device based the fib lookup result but with checks that the result has a nexthop device (e.g., not an unreachable or prohibit entry). Fixes: f5a0aab8 ("net: ipv4: dst for local input routes should use l3mdev if relevant") Reported-by: NOliver Herms <oliver.peter.herms@gmail.com> Signed-off-by: NDavid Ahern <dsahern@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 11 6月, 2021 1 次提交
-
-
由 Zheng Yongjun 提交于
Function 'ping_queue_rcv_skb' not always return success, which will also return fail. If not check the wrong return value of it, lead to function `ping_rcv` return success. Signed-off-by: NZheng Yongjun <zhengyongjun3@huawei.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 10 6月, 2021 2 次提交
-
-
由 Paolo Abeni 提交于
Kaustubh reported and diagnosed a panic in udp_lib_lookup(). The root cause is udp_abort() racing with close(). Both racing functions acquire the socket lock, but udp{v6}_destroy_sock() release it before performing destructive actions. We can't easily extend the socket lock scope to avoid the race, instead use the SOCK_DEAD flag to prevent udp_abort from doing any action when the critical race happens. Diagnosed-and-tested-by: NKaustubh Pandey <kapandey@codeaurora.org> Fixes: 5d77dca8 ("net: diag: support SOCK_DESTROY for UDP sockets") Signed-off-by: NPaolo Abeni <pabeni@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
Both functions are known to be racy when reading inet_num as we do not want to grab locks for the common case the socket has been bound already. The race is resolved in inet_autobind() by reading again inet_num under the socket lock. syzbot reported: BUG: KCSAN: data-race in inet_send_prepare / udp_lib_get_port write to 0xffff88812cba150e of 2 bytes by task 24135 on cpu 0: udp_lib_get_port+0x4b2/0xe20 net/ipv4/udp.c:308 udp_v6_get_port+0x5e/0x70 net/ipv6/udp.c:89 inet_autobind net/ipv4/af_inet.c:183 [inline] inet_send_prepare+0xd0/0x210 net/ipv4/af_inet.c:807 inet6_sendmsg+0x29/0x80 net/ipv6/af_inet6.c:639 sock_sendmsg_nosec net/socket.c:654 [inline] sock_sendmsg net/socket.c:674 [inline] ____sys_sendmsg+0x360/0x4d0 net/socket.c:2350 ___sys_sendmsg net/socket.c:2404 [inline] __sys_sendmmsg+0x315/0x4b0 net/socket.c:2490 __do_sys_sendmmsg net/socket.c:2519 [inline] __se_sys_sendmmsg net/socket.c:2516 [inline] __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2516 do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47 entry_SYSCALL_64_after_hwframe+0x44/0xae read to 0xffff88812cba150e of 2 bytes by task 24132 on cpu 1: inet_send_prepare+0x21/0x210 net/ipv4/af_inet.c:806 inet6_sendmsg+0x29/0x80 net/ipv6/af_inet6.c:639 sock_sendmsg_nosec net/socket.c:654 [inline] sock_sendmsg net/socket.c:674 [inline] ____sys_sendmsg+0x360/0x4d0 net/socket.c:2350 ___sys_sendmsg net/socket.c:2404 [inline] __sys_sendmmsg+0x315/0x4b0 net/socket.c:2490 __do_sys_sendmmsg net/socket.c:2519 [inline] __se_sys_sendmmsg net/socket.c:2516 [inline] __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2516 do_syscall_64+0x4a/0x90 arch/x86/entry/common.c:47 entry_SYSCALL_64_after_hwframe+0x44/0xae value changed: 0x0000 -> 0x9db4 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 24132 Comm: syz-executor.2 Not tainted 5.13.0-rc4-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Signed-off-by: NEric Dumazet <edumazet@google.com> Reported-by: Nsyzbot <syzkaller@googlegroups.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 09 6月, 2021 3 次提交
-
-
由 Florian Westphal 提交于
Its set but never read. Reduces size of xfrm_type to 64 bytes on 64bit. Signed-off-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
-
由 Zheng Yongjun 提交于
When 'nla_parse_nested_deprecated' failed, it's no need to BUG() here, return -EINVAL is ok. Signed-off-by: NZheng Yongjun <zhengyongjun3@huawei.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Nanyong Sun 提交于
Reported by syzkaller: BUG: memory leak unreferenced object 0xffff888105df7000 (size 64): comm "syz-executor842", pid 360, jiffies 4294824824 (age 22.546s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<00000000e67ed558>] kmalloc include/linux/slab.h:590 [inline] [<00000000e67ed558>] kzalloc include/linux/slab.h:720 [inline] [<00000000e67ed558>] netlbl_cipsov4_add_std net/netlabel/netlabel_cipso_v4.c:145 [inline] [<00000000e67ed558>] netlbl_cipsov4_add+0x390/0x2340 net/netlabel/netlabel_cipso_v4.c:416 [<0000000006040154>] genl_family_rcv_msg_doit.isra.0+0x20e/0x320 net/netlink/genetlink.c:739 [<00000000204d7a1c>] genl_family_rcv_msg net/netlink/genetlink.c:783 [inline] [<00000000204d7a1c>] genl_rcv_msg+0x2bf/0x4f0 net/netlink/genetlink.c:800 [<00000000c0d6a995>] netlink_rcv_skb+0x134/0x3d0 net/netlink/af_netlink.c:2504 [<00000000d78b9d2c>] genl_rcv+0x24/0x40 net/netlink/genetlink.c:811 [<000000009733081b>] netlink_unicast_kernel net/netlink/af_netlink.c:1314 [inline] [<000000009733081b>] netlink_unicast+0x4a0/0x6a0 net/netlink/af_netlink.c:1340 [<00000000d5fd43b8>] netlink_sendmsg+0x789/0xc70 net/netlink/af_netlink.c:1929 [<000000000a2d1e40>] sock_sendmsg_nosec net/socket.c:654 [inline] [<000000000a2d1e40>] sock_sendmsg+0x139/0x170 net/socket.c:674 [<00000000321d1969>] ____sys_sendmsg+0x658/0x7d0 net/socket.c:2350 [<00000000964e16bc>] ___sys_sendmsg+0xf8/0x170 net/socket.c:2404 [<000000001615e288>] __sys_sendmsg+0xd3/0x190 net/socket.c:2433 [<000000004ee8b6a5>] do_syscall_64+0x37/0x90 arch/x86/entry/common.c:47 [<00000000171c7cee>] entry_SYSCALL_64_after_hwframe+0x44/0xae The memory of doi_def->map.std pointing is allocated in netlbl_cipsov4_add_std, but no place has freed it. It should be freed in cipso_v4_doi_free which frees the cipso DOI resource. Fixes: 96cb8e33 ("[NetLabel]: CIPSOv4 and Unlabeled packet integration") Reported-by: NHulk Robot <hulkci@huawei.com> Signed-off-by: NNanyong Sun <sunnanyong@huawei.com> Acked-by: NPaul Moore <paul@paul-moore.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-