- 19 6月, 2022 2 次提交
-
-
由 Eric Dumazet 提交于
Using rwlock in networking code is extremely risky. writers can starve if enough readers are constantly grabing the rwlock. I thought rwlock were at fault and sent this patch: https://lkml.org/lkml/2022/6/17/272 But Peter and Linus essentially told me rwlock had to be unfair. We need to get rid of rwlock in networking code. Without this fix, following script triggers soft lockups: for i in {1..48} do ping -f -n -q 127.0.0.1 & sleep 0.1 done Fixes: 1da177e4 ("Linux-2.6.12-rc2") Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
In order to prepare the following patch, I change raw v4 & v6 code to use more conventional iterators. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 18 6月, 2022 1 次提交
-
-
由 Eric Dumazet 提交于
Using rwlock in networking code is extremely risky. writers can starve if enough readers are constantly grabing the rwlock. I thought rwlock were at fault and sent this patch: https://lkml.org/lkml/2022/6/17/272 But Peter and Linus essentially told me rwlock had to be unfair. We need to get rid of rwlock in networking code. Fixes: c319b4d7 ("net: ipv4: add IPPROTO_ICMP socket kind") Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 17 6月, 2022 6 次提交
-
-
由 David S. Miller 提交于
Remove accidental dup of tcp_wmem_schedule. Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
Blamed commit only dealt with applications issuing small writes. Issue here is that we allow to force memory schedule for the sk_buff allocation, but we have no guarantee that sendmsg() is able to copy some payload in it. In this patch, I make sure the socket can use up to tcp_wmem[0] bytes. For example, if we consider tcp_wmem[0] = 4096 (default on x86), and initial skb->truesize being 1280, tcp_sendmsg() is able to copy up to 2816 bytes under memory pressure. Before this patch a sendmsg() sending more than 2816 bytes would either block forever (if persistent memory pressure), or return -EAGAIN. For bigger MTU networks, it is advised to increase tcp_wmem[0] to avoid sending too small packets. v2: deal with zero copy paths. Fixes: 8e4d980a ("tcp: fix behavior for epoll edge trigger") Signed-off-by: NEric Dumazet <edumazet@google.com> Acked-by: NSoheil Hassas Yeganeh <soheil@google.com> Reviewed-by: NWei Wang <weiwan@google.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
Blamed commit only dealt with applications issuing small writes. Issue here is that we allow to force memory schedule for the sk_buff allocation, but we have no guarantee that sendmsg() is able to copy some payload in it. In this patch, I make sure the socket can use up to tcp_wmem[0] bytes. For example, if we consider tcp_wmem[0] = 4096 (default on x86), and initial skb->truesize being 1280, tcp_sendmsg() is able to copy up to 2816 bytes under memory pressure. Before this patch a sendmsg() sending more than 2816 bytes would either block forever (if persistent memory pressure), or return -EAGAIN. For bigger MTU networks, it is advised to increase tcp_wmem[0] to avoid sending too small packets. v2: deal with zero copy paths. Fixes: 8e4d980a ("tcp: fix behavior for epoll edge trigger") Signed-off-by: NEric Dumazet <edumazet@google.com> Acked-by: NSoheil Hassas Yeganeh <soheil@google.com> Reviewed-by: NWei Wang <weiwan@google.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
sk_forced_mem_schedule() has a bug similar to ones fixed in commit 7c80b038 ("net: fix sk_wmem_schedule() and sk_rmem_schedule() errors") While this bug has little chance to trigger in old kernels, we need to fix it before the following patch. Fixes: d83769a5 ("tcp: fix possible deadlock in tcp_send_fin()") Signed-off-by: NEric Dumazet <edumazet@google.com> Acked-by: NSoheil Hassas Yeganeh <soheil@google.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Reviewed-by: NWei Wang <weiwan@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Maxim Mikityanskiy 提交于
The new helpers bpf_tcp_raw_{gen,check}_syncookie_ipv{4,6} allow an XDP program to generate SYN cookies in response to TCP SYN packets and to check those cookies upon receiving the first ACK packet (the final packet of the TCP handshake). Unlike bpf_tcp_{gen,check}_syncookie these new helpers don't need a listening socket on the local machine, which allows to use them together with synproxy to accelerate SYN cookie generation. Signed-off-by: NMaxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: NTariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20220615134847.3753567-4-maximmi@nvidia.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
-
由 Joanne Koong 提交于
This reverts: commit d5a42de8 ("net: Add a second bind table hashed by port and address") commit 538aaf9b ("selftests: Add test for timing a bind request to a port with a populated bhash entry") Link: https://lore.kernel.org/netdev/20220520001834.2247810-1-kuba@kernel.org/ There are a few things that need to be fixed here: * Updating bhash2 in cases where the socket's rcv saddr changes * Adding bhash2 hashbucket locks Links to syzbot reports: https://lore.kernel.org/netdev/00000000000022208805e0df247a@google.com/ https://lore.kernel.org/netdev/0000000000003f33bc05dfaf44fe@google.com/ Fixes: d5a42de8 ("net: Add a second bind table hashed by port and address") Reported-by: syzbot+015d756bbd1f8b5c8f09@syzkaller.appspotmail.com Reported-by: syzbot+98fd2d1422063b0f8c44@syzkaller.appspotmail.com Reported-by: syzbot+0a847a982613c6438fba@syzkaller.appspotmail.com Signed-off-by: NJoanne Koong <joannelkoong@gmail.com> Link: https://lore.kernel.org/r/20220615193213.2419568-1-joannelkoong@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 11 6月, 2022 3 次提交
-
-
由 Eric Dumazet 提交于
Currently, tcp_memory_allocated can hit tcp_mem[] limits quite fast. Each TCP socket can forward allocate up to 2 MB of memory, even after flow became less active. 10,000 sockets can have reserved 20 GB of memory, and we have no shrinker in place to reclaim that. Instead of trying to reclaim the extra allocations in some places, just keep sk->sk_forward_alloc values as small as possible. This should not impact performance too much now we have per-cpu reserves: Changes to tcp_memory_allocated should not be too frequent. For sockets not using SO_RESERVE_MEM: - idle sockets (no packets in tx/rx queues) have zero forward alloc. - non idle sockets have a forward alloc smaller than one page. Note: - Removal of SK_RECLAIM_CHUNK and SK_RECLAIM_THRESHOLD is left to MPTCP maintainers as a follow up. Signed-off-by: NEric Dumazet <edumazet@google.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Acked-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Eric Dumazet 提交于
Each protocol having a ->memory_allocated pointer gets a corresponding per-cpu reserve, that following patches will use. Instead of having reserved bytes per socket, we want to have per-cpu reserves. Signed-off-by: NEric Dumazet <edumazet@google.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Acked-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Eric Dumazet 提交于
Due to memcg interface, SK_MEM_QUANTUM is effectively PAGE_SIZE. This might change in the future, but it seems better to avoid the confusion. Signed-off-by: NEric Dumazet <edumazet@google.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Acked-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 10 6月, 2022 2 次提交
-
-
由 Eric Dumazet 提交于
inet_sock_destruct() has four warnings which have been useful to point to kernel bugs in the past. However they are potentially a problem because they could flood the syslog. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Jakub Kicinski 提交于
Netdev reference helpers have a dev_ prefix for historic reasons. Renaming the old helpers would be too much churn but we can rename the tracking ones which are relatively recent and should be the default for new code. Rename: dev_hold_track() -> netdev_hold() dev_put_track() -> netdev_put() dev_replace_track() -> netdev_ref_replace() Link: https://lore.kernel.org/r/20220608043955.919359-1-kuba@kernel.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 09 6月, 2022 3 次提交
-
-
由 Muchun Song 提交于
In our server, there may be no high order (>= 6) memory since we reserve lots of HugeTLB pages when booting. Then the system panic. So use alloc_large_system_hash() to allocate table_perturb. Fixes: e9261476 ("tcp: dynamically allocate the perturb table used by source ports") Signed-off-by: NMuchun Song <songmuchun@bytedance.com> Reviewed-by: NEric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220607070214.94443-1-songmuchun@bytedance.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Willem de Bruijn 提交于
GRE with TUNNEL_CSUM will apply local checksum offload on CHECKSUM_PARTIAL packets. ipgre_xmit must validate csum_start after an optional skb_pull, else lco_csum may trigger an overflow. The original check was if (csum && skb_checksum_start(skb) < skb->data) return -EINVAL; This had false positives when skb_checksum_start is undefined: when ip_summed is not CHECKSUM_PARTIAL. A discussed refinement was straightforward if (csum && skb->ip_summed == CHECKSUM_PARTIAL && skb_checksum_start(skb) < skb->data) return -EINVAL; But was eventually revised more thoroughly: - restrict the check to the only branch where needed, in an uncommon GRE path that uses header_ops and calls skb_pull. - test skb_transport_header, which is set along with csum_start in skb_partial_csum_set in the normal header_ops datapath. Turns out skbs can arrive in this branch without the transport header set, e.g., through BPF redirection. Revise the check back to check csum_start directly, and only if CHECKSUM_PARTIAL. Do leave the check in the updated location. Check field regardless of whether TUNNEL_CSUM is configured. Link: https://lore.kernel.org/netdev/YS+h%2FtqCJJiQei+W@shredder/ Link: https://lore.kernel.org/all/20210902193447.94039-2-willemdebruijn.kernel@gmail.com/T/#u Fixes: 8a0ed250 ("ip_gre: validate csum_start only on pull") Reported-by: Nsyzbot <syzkaller@googlegroups.com> Signed-off-by: NWillem de Bruijn <willemb@google.com> Reviewed-by: NEric Dumazet <edumazet@google.com> Reviewed-by: NAlexander Duyck <alexanderduyck@fb.com> Link: https://lore.kernel.org/r/20220606132107.3582565-1-willemdebruijn.kernel@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Masahiro Yamada 提交于
EXPORT_SYMBOL and __init is a bad combination because the .init.text section is freed up after the initialization. Hence, modules cannot use symbols annotated __init. The access to a freed symbol may end up with kernel panic. modpost used to detect it, but it has been broken for a decade. Recently, I fixed modpost so it started to warn it again, then this showed up in linux-next builds. There are two ways to fix it: - Remove __init - Remove EXPORT_SYMBOL I chose the latter for this case because the only in-tree call-site, net/ipv4/xfrm4_policy.c is never compiled as modular. (CONFIG_XFRM is boolean) Fixes: 2f32b51b ("xfrm: Introduce xfrm_input_afinfo to access the the callbacks properly") Reported-by: NStephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NMasahiro Yamada <masahiroy@kernel.org> Acked-by: NSteffen Klassert <steffen.klassert@secunet.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 03 6月, 2022 1 次提交
-
-
由 Wang Yufen 提交于
During TCP sockmap redirect pressure test, the following warning is triggered: WARNING: CPU: 3 PID: 2145 at net/core/stream.c:205 sk_stream_kill_queues+0xbc/0xd0 CPU: 3 PID: 2145 Comm: iperf Kdump: loaded Tainted: G W 5.10.0+ #9 Call Trace: inet_csk_destroy_sock+0x55/0x110 inet_csk_listen_stop+0xbb/0x380 tcp_close+0x41b/0x480 inet_release+0x42/0x80 __sock_release+0x3d/0xa0 sock_close+0x11/0x20 __fput+0x9d/0x240 task_work_run+0x62/0x90 exit_to_user_mode_prepare+0x110/0x120 syscall_exit_to_user_mode+0x27/0x190 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The reason we observed is that: When the listener is closing, a connection may have completed the three-way handshake but not accepted, and the client has sent some packets. The child sks in accept queue release by inet_child_forget()->inet_csk_destroy_sock(), but psocks of child sks have not released. To fix, add sock_map_destroy to release psocks. Signed-off-by: NWang Yufen <wangyufen@huawei.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Signed-off-by: NAndrii Nakryiko <andrii@kernel.org> Acked-by: NJakub Sitnicki <jakub@cloudflare.com> Acked-by: NJohn Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20220524075311.649153-1-wangyufen@huawei.com
-
- 01 6月, 2022 1 次提交
-
-
由 Eric Dumazet 提交于
Laurent reported the enclosed report [1] This bug triggers with following coditions: 0) Kernel built with CONFIG_DEBUG_PREEMPT=y 1) A new passive FastOpen TCP socket is created. This FO socket waits for an ACK coming from client to be a complete ESTABLISHED one. 2) A socket operation on this socket goes through lock_sock() release_sock() dance. 3) While the socket is owned by the user in step 2), a retransmit of the SYN is received and stored in socket backlog. 4) At release_sock() time, the socket backlog is processed while in process context. 5) A SYNACK packet is cooked in response of the SYN retransmit. 6) -> tcp_rtx_synack() is called in process context. Before blamed commit, tcp_rtx_synack() was always called from BH handler, from a timer handler. Fix this by using TCP_INC_STATS() & NET_INC_STATS() which do not assume caller is in non preemptible context. [1] BUG: using __this_cpu_add() in preemptible [00000000] code: epollpep/2180 caller is tcp_rtx_synack.part.0+0x36/0xc0 CPU: 10 PID: 2180 Comm: epollpep Tainted: G OE 5.16.0-0.bpo.4-amd64 #1 Debian 5.16.12-1~bpo11+1 Hardware name: Supermicro SYS-5039MC-H8TRF/X11SCD-F, BIOS 1.7 11/23/2021 Call Trace: <TASK> dump_stack_lvl+0x48/0x5e check_preemption_disabled+0xde/0xe0 tcp_rtx_synack.part.0+0x36/0xc0 tcp_rtx_synack+0x8d/0xa0 ? kmem_cache_alloc+0x2e0/0x3e0 ? apparmor_file_alloc_security+0x3b/0x1f0 inet_rtx_syn_ack+0x16/0x30 tcp_check_req+0x367/0x610 tcp_rcv_state_process+0x91/0xf60 ? get_nohz_timer_target+0x18/0x1a0 ? lock_timer_base+0x61/0x80 ? preempt_count_add+0x68/0xa0 tcp_v4_do_rcv+0xbd/0x270 __release_sock+0x6d/0xb0 release_sock+0x2b/0x90 sock_setsockopt+0x138/0x1140 ? __sys_getsockname+0x7e/0xc0 ? aa_sk_perm+0x3e/0x1a0 __sys_setsockopt+0x198/0x1e0 __x64_sys_setsockopt+0x21/0x30 do_syscall_64+0x38/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xae Fixes: 168a8f58 ("tcp: TCP Fast Open Server - main code path") Signed-off-by: NEric Dumazet <edumazet@google.com> Reported-by: NLaurent Fasnacht <laurent.fasnacht@proton.ch> Acked-by: NNeal Cardwell <ncardwell@google.com> Link: https://lore.kernel.org/r/20220530213713.601888-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 31 5月, 2022 1 次提交
-
-
由 huhai 提交于
Fix the following build warning when CONFIG_IPV6 is not set: In function ‘fortify_memcpy_chk’, inlined from ‘tcp_md5_do_add’ at net/ipv4/tcp_ipv4.c:1210:2: ./include/linux/fortify-string.h:328:4: error: call to ‘__write_overflow_field’ declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Werror=attribute-warning] 328 | __write_overflow_field(p_size_field, size); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Suggested-by: NPaolo Abeni <pabeni@redhat.com> Signed-off-by: Nhuhai <huhai@kylinos.cn> Reviewed-by: NDavid Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20220526101213.2392980-1-zhanggenjian@kylinos.cnSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 28 5月, 2022 1 次提交
-
-
由 Eric Dumazet 提交于
syzbot got a new report [1] finally pointing to a very old bug, added in initial support for MTU probing. tcp_mtu_probe() has checks about starting an MTU probe if tcp_snd_cwnd(tp) >= 11. But nothing prevents tcp_snd_cwnd(tp) to be reduced later and before the MTU probe succeeds. This bug would lead to potential zero-divides. Debugging added in commit 40570375 ("tcp: add accessors to read/set tp->snd_cwnd") has paid off :) While we are at it, address potential overflows in this code. [1] WARNING: CPU: 1 PID: 14132 at include/net/tcp.h:1219 tcp_mtup_probe_success+0x366/0x570 net/ipv4/tcp_input.c:2712 Modules linked in: CPU: 1 PID: 14132 Comm: syz-executor.2 Not tainted 5.18.0-syzkaller-07857-gbabf0bb9 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:tcp_snd_cwnd_set include/net/tcp.h:1219 [inline] RIP: 0010:tcp_mtup_probe_success+0x366/0x570 net/ipv4/tcp_input.c:2712 Code: 74 08 48 89 ef e8 da 80 17 f9 48 8b 45 00 65 48 ff 80 80 03 00 00 48 83 c4 30 5b 41 5c 41 5d 41 5e 41 5f 5d c3 e8 aa b0 c5 f8 <0f> 0b e9 16 fe ff ff 48 8b 4c 24 08 80 e1 07 38 c1 0f 8c c7 fc ff RSP: 0018:ffffc900079e70f8 EFLAGS: 00010287 RAX: ffffffff88c0f7f6 RBX: ffff8880756e7a80 RCX: 0000000000040000 RDX: ffffc9000c6c4000 RSI: 0000000000031f9e RDI: 0000000000031f9f RBP: 0000000000000000 R08: ffffffff88c0f606 R09: ffffc900079e7520 R10: ffffed101011226d R11: 1ffff1101011226c R12: 1ffff1100eadcf50 R13: ffff8880756e72c0 R14: 1ffff1100eadcf89 R15: dffffc0000000000 FS: 00007f643236e700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f1ab3f1e2a0 CR3: 0000000064fe7000 CR4: 00000000003506e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> tcp_clean_rtx_queue+0x223a/0x2da0 net/ipv4/tcp_input.c:3356 tcp_ack+0x1962/0x3c90 net/ipv4/tcp_input.c:3861 tcp_rcv_established+0x7c8/0x1ac0 net/ipv4/tcp_input.c:5973 tcp_v6_do_rcv+0x57b/0x1210 net/ipv6/tcp_ipv6.c:1476 sk_backlog_rcv include/net/sock.h:1061 [inline] __release_sock+0x1d8/0x4c0 net/core/sock.c:2849 release_sock+0x5d/0x1c0 net/core/sock.c:3404 sk_stream_wait_memory+0x700/0xdc0 net/core/stream.c:145 tcp_sendmsg_locked+0x111d/0x3fc0 net/ipv4/tcp.c:1410 tcp_sendmsg+0x2c/0x40 net/ipv4/tcp.c:1448 sock_sendmsg_nosec net/socket.c:714 [inline] sock_sendmsg net/socket.c:734 [inline] __sys_sendto+0x439/0x5c0 net/socket.c:2119 __do_sys_sendto net/socket.c:2131 [inline] __se_sys_sendto net/socket.c:2127 [inline] __x64_sys_sendto+0xda/0xf0 net/socket.c:2127 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 RIP: 0033:0x7f6431289109 Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f643236e168 EFLAGS: 00000246 ORIG_RAX: 000000000000002c RAX: ffffffffffffffda RBX: 00007f643139c100 RCX: 00007f6431289109 RDX: 00000000d0d0c2ac RSI: 0000000020000080 RDI: 000000000000000a RBP: 00007f64312e308d R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000 R13: 00007fff372533af R14: 00007f643236e300 R15: 0000000000022000 Fixes: 5d424d5a ("[TCP]: MTU probing") Signed-off-by: NEric Dumazet <edumazet@google.com> Reported-by: Nsyzbot <syzkaller@googlegroups.com> Acked-by: NYuchung Cheng <ycheng@google.com> Acked-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 21 5月, 2022 1 次提交
-
-
由 Joanne Koong 提交于
We currently have one tcp bind table (bhash) which hashes by port number only. In the socket bind path, we check for bind conflicts by traversing the specified port's inet_bind2_bucket while holding the bucket's spinlock (see inet_csk_get_port() and inet_csk_bind_conflict()). In instances where there are tons of sockets hashed to the same port at different addresses, checking for a bind conflict is time-intensive and can cause softirq cpu lockups, as well as stops new tcp connections since __inet_inherit_port() also contests for the spinlock. This patch proposes adding a second bind table, bhash2, that hashes by port and ip address. Searching the bhash2 table leads to significantly faster conflict resolution and less time holding the spinlock. Signed-off-by: NJoanne Koong <joannelkoong@gmail.com> Reviewed-by: NEric Dumazet <edumazet@google.com> Acked-by: NKuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 20 5月, 2022 1 次提交
-
-
由 Yuchung Cheng 提交于
This patch improves TCP PRR loss recovery behavior for a corner case. Previously during PRR conservation-bound mode, it strictly sends the amount equals to the amount newly acked or s/acked. The patch changes s.t. PRR may send additional amount that was banked previously (e.g. application-limited) in the conservation-bound mode, similar to the slow-start mode. This unifies and simplifies the algorithm further and may improve the recovery latency. This change still follow the general packet conservation design principle and always keep inflight/cwnd below the slow start threshold set by the congestion control module. PRR is described in RFC 6937. We'll include this change in the latest revision rfc6937-bis as well. Reported-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220519003410.2531936-1-ycheng@google.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 18 5月, 2022 1 次提交
-
-
由 Jason A. Donenfeld 提交于
random32.c has two random number generators in it: one that is meant to be used deterministically, with some predefined seed, and one that does the same exact thing as random.c, except does it poorly. The first one has some use cases. The second one no longer does and can be replaced with calls to random.c's proper random number generator. The relatively recent siphash-based bad random32.c code was added in response to concerns that the prior random32.c was too deterministic. Out of fears that random.c was (at the time) too slow, this code was anonymously contributed. Then out of that emerged a kind of shadow entropy gathering system, with its own tentacles throughout various net code, added willy nilly. Stop
👏 making👏 bespoke👏 random👏 number👏 generators👏 . Fortunately, recent advances in random.c mean that we can stop playing with this sketchiness, and just use get_random_u32(), which is now fast enough. In micro benchmarks using RDPMC, I'm seeing the same median cycle count between the two functions, with the mean being _slightly_ higher due to batches refilling (which we can optimize further need be). However, when doing *real* benchmarks of the net functions that actually use these random numbers, the mean cycles actually *decreased* slightly (with the median still staying the same), likely because the additional prandom code means icache misses and complexity, whereas random.c is generally already being used by something else nearby. The biggest benefit of this is that there are many users of prandom who probably should be using cryptographically secure random numbers. This makes all of those accidental cases become secure by just flipping a switch. Later on, we can do a tree-wide cleanup to remove the static inline wrapper functions that this commit adds. There are also some low-ish hanging fruits for making this even faster in the future: a get_random_u16() function for use in the networking stack will give a 2x performance boost there, using SIMD for ChaCha20 will let us compute 4 or 8 or 16 blocks of output in parallel, instead of just one, giving us large buffers for cheap, and introducing a get_random_*_bh() function that assumes irqs are already disabled will shave off a few cycles for ordinary calls. These are things we can chip away at down the road. Acked-by: NJakub Kicinski <kuba@kernel.org> Acked-by: NTheodore Ts'o <tytso@mit.edu> Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com>
-
- 16 5月, 2022 8 次提交
-
-
由 Martin Willi 提交于
Commit 40867d74 ("net: Add l3mdev index to flow struct and avoid oif reset for port devices") introduces a flow key specific for layer 3 domains, such as a VRF master device. This allows for explicit VRF domain selection instead of abusing the oif flow key. Update ip[6]_route_me_harder() to make use of that new key when re-routing mangled packets within VRFs instead of setting the flow oif, making it consistent with other users. Signed-off-by: NMartin Willi <martin@strongswan.org> Reviewed-by: NDavid Ahern <dsahern@kernel.org> Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
-
由 Menglong Dong 提交于
The 'drop_reason' that passed to kfree_skb_reason() in tcp_v4_rcv() and tcp_v6_rcv() can be SKB_NOT_DROPPED_YET(0), as it is used as the return value of tcp_inbound_md5_hash(). And it can panic the kernel with NULL pointer in net_dm_packet_report_size() if the reason is 0, as drop_reasons[0] is NULL. Fixes: 1330b6ef ("skb: make drop reason booleanable") Reviewed-by: NJiang Biao <benbjiang@tencent.com> Reviewed-by: NHao Peng <flyingpeng@tencent.com> Signed-off-by: NMenglong Dong <imagedong@tencent.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
This is no longer a macro, but an inlined function. INET_MATCH() -> inet_match() Signed-off-by: NEric Dumazet <edumazet@google.com> Suggested-by: NOlivier Hartkopp <socketcan@hartkopp.net> Suggested-by: NJakub Kicinski <kuba@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
INET6_MATCH() runs without holding a lock on the socket. We probably need to annotate most reads. This patch makes INET6_MATCH() an inline function to ease our changes. v2: inline function only defined if IS_ENABLED(CONFIG_IPV6) Change the name to inet6_match(), this is no longer a macro. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
inet_csk_bind_conflict() can access sk->sk_bound_dev_if for unlocked sockets. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
hystart_ack_delay() had the assumption that a TSO packet would not be bigger than GSO_MAX_SIZE. This will no longer be true. We should use sk->sk_gso_max_size instead. This reduces chances of spurious Hystart ACK train detections. Signed-off-by: NEric Dumazet <edumazet@google.com> Acked-by: NAlexander Duyck <alexanderduyck@fb.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Alexander Duyck 提交于
The code for gso_max_size was added originally to allow for debugging and workaround of buggy devices that couldn't support TSO with blocks 64K in size. The original reason for limiting it to 64K was because that was the existing limits of IPv4 and non-jumbogram IPv6 length fields. With the addition of Big TCP we can remove this limit and allow the value to potentially go up to UINT_MAX and instead be limited by the tso_max_size value. So in order to support this we need to go through and clean up the remaining users of the gso_max_size value so that the values will cap at 64K for non-TCPv6 flows. In addition we can clean up the GSO_MAX_SIZE value so that 64K becomes GSO_LEGACY_MAX_SIZE and UINT_MAX will now be the upper limit for GSO_MAX_SIZE. v6: (edumazet) fixed a compile error if CONFIG_IPV6=n, in a new sk_trim_gso_size() helper. netif_set_tso_max_size() caps the requested TSO size with GSO_MAX_SIZE. Signed-off-by: NAlexander Duyck <alexanderduyck@fb.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eyal Birger 提交于
In IPv4 setting the "disable_policy" flag on a device means no policy should be enforced for traffic originating from the device. This was implemented by seting the DST_NOPOLICY flag in the dst based on the originating device. However, dsts are cached in nexthops regardless of the originating devices, in which case, the DST_NOPOLICY flag value may be incorrect. Consider the following setup: +------------------------------+ | ROUTER | +-------------+ | +-----------------+ | | ipsec src |----|-|ipsec0 | | +-------------+ | |disable_policy=0 | +----+ | | +-----------------+ |eth1|-|----- +-------------+ | +-----------------+ +----+ | | noipsec src |----|-|eth0 | | +-------------+ | |disable_policy=1 | | | +-----------------+ | +------------------------------+ Where ROUTER has a default route towards eth1. dst entries for traffic arriving from eth0 would have DST_NOPOLICY and would be cached and therefore can be reused by traffic originating from ipsec0, skipping policy check. Fix by setting a IPSKB_NOPOLICY flag in IPCB and observing it instead of the DST in IN/FWD IPv4 policy checks. Fixes: 1da177e4 ("Linux-2.6.12-rc2") Reported-by: NShmulik Ladkani <shmulik.ladkani@gmail.com> Signed-off-by: NEyal Birger <eyal.birger@gmail.com> Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
-
- 14 5月, 2022 1 次提交
-
-
由 Kevin Mitchell 提交于
The checksum is optional for UDP packets. However nf_reject would previously require a valid checksum to elicit a response such as ICMP_DEST_UNREACH. Add some logic to nf_reject_verify_csum to determine if a UDP packet has a zero checksum and should therefore not be verified. Signed-off-by: NKevin Mitchell <kevmitch@arista.com> Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
-
- 13 5月, 2022 5 次提交
-
-
由 Eric Dumazet 提交于
This reverts commits: 0dad4087 ("tcp/dccp: get rid of inet_twsk_purge()") d507204d ("tcp/dccp: add tw->tw_bslot") As Leonard pointed out, a newly allocated netns can happen to reuse a freed 'struct net'. While TCP TW timers were covered by my patches, other things were not: 1) Lookups in rx path (INET_MATCH() and INET6_MATCH()), as they look at 4-tuple plus the 'struct net' pointer. 2) /proc/net/tcp[6] and inet_diag, same reason. 3) hashinfo->bhash[], same reason. Fixing all this seems risky, lets instead revert. In the future, we might have a per netns tcp hash table, or a per netns list of timewait sockets... Fixes: 0dad4087 ("tcp/dccp: get rid of inet_twsk_purge()") Signed-off-by: NEric Dumazet <edumazet@google.com> Reported-by: NLeonard Crestez <cdleonard@gmail.com> Tested-by: NLeonard Crestez <cdleonard@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
INET_MATCH() runs without holding a lock on the socket. We probably need to annotate most reads. This patch makes INET_MATCH() an inline function to ease our changes. v2: We remove the 32bit version of it, as modern compilers should generate the same code really, no need to try to be smarter. Also make 'struct net *net' the first argument. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Martin KaFai Lau 提交于
The listen sk is currently stored in two hash tables, listening_hash (hashed by port) and lhash2 (hashed by port and address). After commit 0ee58dad ("net: tcp6: prefer listeners bound to an address") and commit d9fbc7f6 ("net: tcp: prefer listeners bound to an address"), the TCP-SYN lookup fast path does not use listening_hash. The commit 05c0b357 ("tcp: seq_file: Replace listening_hash with lhash2") also moved the seq_file (/proc/net/tcp) iteration usage from listening_hash to lhash2. There are still a few listening_hash usages left. One of them is inet_reuseport_add_sock() which uses the listening_hash to search a listen sk during the listen() system call. This turns out to be very slow on use cases that listen on many different VIPs at a popular port (e.g. 443). [ On top of the slowness in adding to the tail in the IPv6 case ]. The latter patch has a selftest to demonstrate this case. This patch takes this chance to move all remaining listening_hash usages to lhash2 and then retire listening_hash. Since most changes need to be done together, it is hard to cut the listening_hash to lhash2 switch into small patches. The changes in this patch is highlighted here for the review purpose. 1. Because of the listening_hash removal, lhash2 can use the sk->sk_nulls_node instead of the icsk->icsk_listen_portaddr_node. This will also keep the sk_unhashed() check to work as is after stop adding sk to listening_hash. The union is removed from inet_listen_hashbucket because only nulls_head is needed. 2. icsk->icsk_listen_portaddr_node and its helpers are removed. 3. The current lhash2 users needs to iterate with sk_nulls_node instead of icsk_listen_portaddr_node. One case is in the inet[6]_lhash2_lookup(). Another case is the seq_file iterator in tcp_ipv4.c. One thing to note is sk_nulls_next() is needed because the old inet_lhash2_for_each_icsk_continue() does a "next" first before iterating. 4. Move the remaining listening_hash usage to lhash2 inet_reuseport_add_sock() which this series is trying to improve. inet_diag.c and mptcp_diag.c are the final two remaining use cases and is moved to lhash2 now also. Signed-off-by: NMartin KaFai Lau <kafai@fb.com> Reviewed-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Martin KaFai Lau 提交于
This patch folds lhash2 related functions into __inet_hash and inet_unhash. This will make the removal of the listening_hash in a latter patch easier to review. First, this patch folds inet_hash2 into __inet_hash. For unhash, the current call sequence is like inet_unhash() => __inet_unhash() => inet_unhash2(). The specific testing cases in __inet_unhash() are mostly related to TCP_LISTEN sk and its caller inet_unhash() already has the TCP_LISTEN test, so this patch folds both __inet_unhash() and inet_unhash2() into inet_unhash(). Note that all listening_hash users also have lhash2 initialized, so the !h->lhash2 check is no longer needed. Signed-off-by: NMartin KaFai Lau <kafai@fb.com> Reviewed-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Martin KaFai Lau 提交于
After commit 0ee58dad ("net: tcp6: prefer listeners bound to an address") and commit d9fbc7f6 ("net: tcp: prefer listeners bound to an address"), the count is no longer used. This patch removes it. Signed-off-by: NMartin KaFai Lau <kafai@fb.com> Reviewed-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 07 5月, 2022 1 次提交
-
-
由 Lokesh Dhoundiyal 提交于
kmemleak reports the following when routing multicast traffic over an ipsec tunnel. Kmemleak output: unreferenced object 0x8000000044bebb00 (size 256): comm "softirq", pid 0, jiffies 4294985356 (age 126.810s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 80 00 00 00 05 13 74 80 ..............t. 80 00 00 00 04 9b bf f9 00 00 00 00 00 00 00 00 ................ backtrace: [<00000000f83947e0>] __kmalloc+0x1e8/0x300 [<00000000b7ed8dca>] metadata_dst_alloc+0x24/0x58 [<0000000081d32c20>] __ipgre_rcv+0x100/0x2b8 [<00000000824f6cf1>] gre_rcv+0x178/0x540 [<00000000ccd4e162>] gre_rcv+0x7c/0xd8 [<00000000c024b148>] ip_protocol_deliver_rcu+0x124/0x350 [<000000006a483377>] ip_local_deliver_finish+0x54/0x68 [<00000000d9271b3a>] ip_local_deliver+0x128/0x168 [<00000000bd4968ae>] xfrm_trans_reinject+0xb8/0xf8 [<0000000071672a19>] tasklet_action_common.isra.16+0xc4/0x1b0 [<0000000062e9c336>] __do_softirq+0x1fc/0x3e0 [<00000000013d7914>] irq_exit+0xc4/0xe0 [<00000000a4d73e90>] plat_irq_dispatch+0x7c/0x108 [<000000000751eb8e>] handle_int+0x16c/0x178 [<000000001668023b>] _raw_spin_unlock_irqrestore+0x1c/0x28 The metadata dst is leaked when ip_route_input_mc() updates the dst for the skb. Commit f38a9eb1 ("dst: Metadata destinations") correctly handled dropping the dst in ip_route_input_slow() but missed the multicast case which is handled by ip_route_input_mc(). Drop the dst in ip_route_input_mc() avoiding the leak. Fixes: f38a9eb1 ("dst: Metadata destinations") Signed-off-by: NLokesh Dhoundiyal <lokesh.dhoundiyal@alliedtelesis.co.nz> Signed-off-by: NChris Packham <chris.packham@alliedtelesis.co.nz> Reviewed-by: NDavid Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20220505020017.3111846-1-chris.packham@alliedtelesis.co.nzSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 06 5月, 2022 1 次提交
-
-
由 Leon Romanovsky 提交于
After removal of Innova IPsec support from mlx5 driver, the last user of this XFRM_ESP_NO_TRAILER was gone too. This means that we can safely remove it as no other hardware is capable (or need) to remove ESP trailer. Reviewed-by: NRaed Salem <raeds@nvidia.com> Signed-off-by: NLeon Romanovsky <leonro@nvidia.com> Acked-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
-