- 15 5月, 2020 1 次提交
-
-
由 xuanzhuo 提交于
to #26353046 TcpRT: Instrument and Diagnostic Analysis System for Service Quality of Cloud Databases at Massive Scale in Real-time. It can also provide information for all request/response services. Such as HTTP request. This is the kernel framework for tcprt, more work needs tcprt module support. TcpRt module should call tcp_unregitsert_rt before rmmod. TcpRt hooks will be called when sock init, recv data, send data, packet acked and socket been destroy. The private data save to icsk->icsk_tcp_rt_priv. Reviewed-by: NCambda Zhu <cambda@linux.alibaba.com> Acked-by: NDust Li <dust.li@linux.alibaba.com> Signed-off-by: Nxuanzhuo <xuanzhuo@linux.alibaba.com>
-
- 06 5月, 2020 2 次提交
-
-
由 Sabrina Dubroca 提交于
to #24913189 commit 6c8991f41546c3c472503dff1ea9daaddf9331c2 upstream. ipv6_stub uses the ip6_dst_lookup function to allow other modules to perform IPv6 lookups. However, this function skips the XFRM layer entirely. All users of ipv6_stub->ip6_dst_lookup use ip_route_output_flow (via the ip_route_output_key and ip_route_output helpers) for their IPv4 lookups, which calls xfrm_lookup_route(). This patch fixes this inconsistent behavior by switching the stub to ip6_dst_lookup_flow, which also calls xfrm_lookup_route(). This requires some changes in all the callers, as these two functions take different arguments and have different return types. Fixes: 5f81bd2e ("ipv6: export a stub for IPv6 symbols used by vxlan") Reported-by: NXiumei Mu <xmu@redhat.com> Signed-off-by: NSabrina Dubroca <sd@queasysnail.net> Signed-off-by: NDavid S. Miller <davem@davemloft.net> [bwh: Backported to 4.19: - Drop change in lwt_bpf.c - Delete now-unused "ret" in mlx5e_route_lookup_ipv6() - Initialise "out_dev" in mlx5e_create_encap_header_ipv6() to avoid introducing a spurious "may be used uninitialised" warning - Adjust filenames, context, indentation] Signed-off-by: NBen Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: NSasha Levin <sashal@kernel.org> References: CVE-2020-1749 Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Sabrina Dubroca 提交于
to #24913189 commit c4e85f73afb6384123e5ef1bba3315b2e3ad031e upstream. This will be used in the conversion of ipv6_stub to ip6_dst_lookup_flow, as some modules currently pass a net argument without a socket to ip6_dst_lookup. This is equivalent to commit 343d60aa ("ipv6: change ipv6_stub_impl.ipv6_dst_lookup to take net argument"). Signed-off-by: NSabrina Dubroca <sd@queasysnail.net> Signed-off-by: NDavid S. Miller <davem@davemloft.net> [bwh: Backported to 4.19: adjust context] Signed-off-by: NBen Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: NSasha Levin <sashal@kernel.org> References: CVE-2020-1749 [zsl: fixes conflicts in net/sctp/ipv6.c] Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
- 22 4月, 2020 1 次提交
-
-
由 Yihao Wu 提交于
fix #25707555 commit 43e33924c38e8faeb0c12035481cb150e602e39d linux-next Deleting list entry within hlist_for_each_entry_safe is not safe unless next pointer (tmp) is protected too. It's not, because once hash_lock is released, cache_clean may delete the entry that tmp points to. Then cache_purge can walk to a deleted entry and tries to double free it. Fix this bug by holding only the deleted entry's reference. Suggested-by: NNeilBrown <neilb@suse.de> Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Reviewed-by: NNeilBrown <neilb@suse.de> [ cel: removed unused variable ] Signed-off-by: NChuck Lever <chuck.lever@oracle.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
- 15 4月, 2020 1 次提交
-
-
由 xuanzhuo 提交于
fix #24463023 This reverts commit adb03115. Related Links: https://lkml.org/lkml/2019/7/24/243 https://lore.kernel.org/lkml/b0160f4b-b996-b0ee-405a-3d5f1866272e@gmail.com/ https://lore.kernel.org/lkml/20181101172739.GA3196@hirez.programming.kicks-ass.net/ test methods: 1. add dummy net dev. "ip link add pps_dummy0 type dummy" 2. set the dummy dev with addr 10.10.10.1 3. send numerous udp packets to 10.10.10.2(fake addr) by dummy dev test command: sockperf tp -m 14 -t 20 --mps=max -i 10.10.10.2 -p 11111 By default, the ip_idents_reserve function will be called to distribute the identities of the identities in the ip layer without the DF flag. Use this method to stress test this function. After testing under vm, after the concurrent CPU exceeds 12, the old patch pps will no longer rise. The data for concurrent CPU 32 is as follows: test without atomic_add_return: 11:32:10 AM pps_dummy0 0.00 8008897.00 0.00 437986.55 0.00 0.00 0.00 11:32:11 AM pps_dummy0 0.00 7992910.00 0.00 437112.27 0.00 0.00 0.00 11:32:12 AM pps_dummy0 0.00 7982553.00 0.00 436545.87 0.00 0.00 0.00 11:32:13 AM pps_dummy0 0.00 7977757.00 0.00 436283.59 0.00 0.00 0.00 11:32:14 AM pps_dummy0 0.00 7968355.00 0.00 435769.41 0.00 0.00 0.00 test with atomic_add_return: 11:33:20 AM pps_dummy0 0.00 16024069.00 0.00 876316.27 0.00 0.00 0.00 11:33:21 AM pps_dummy0 0.00 16024252.00 0.00 876326.28 0.00 0.00 0.00 11:33:22 AM pps_dummy0 0.00 16021639.00 0.00 876183.38 0.00 0.00 0.00 11:33:23 AM pps_dummy0 0.00 16018738.00 0.00 876024.73 0.00 0.00 0.00 11:33:24 AM pps_dummy0 0.00 16022333.00 0.00 876221.34 0.00 0.00 0.00 11:33:25 AM pps_dummy0 0.00 16028147.00 0.00 876539.29 0.00 0.00 0.00 Signed-off-by: Nxuanzhuo <xuanzhuo@linux.alibaba.com> Acked-by: NDust Li <dust.li@linux.alibaba.com>
-
- 18 3月, 2020 12 次提交
-
-
由 YueHaibing 提交于
commit 1d3ff0950e2b40dc861b1739029649d03f591820 upstream. [ Fixes: CVE-2019-20096 ] If dccp_feat_push_change fails, we forget free the mem which is alloced by kmemdup in dccp_feat_clone_sp_val. Reported-by: NHulk Robot <hulkci@huawei.com> Fixes: e8ef967a ("dccp: Registration routines for changing feature values") Reviewed-by: NMukesh Ojha <mojha@codeaurora.org> Signed-off-by: NYueHaibing <yuehaibing@huawei.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NBen Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Zou Cao 提交于
Now hookers has been support in arm64, add the Kconfig depend for arm64, otherwise it won't be built. Signed-off-by: NZou Cao <zoucao@linux.alibaba.com> Acked-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Zou Cao 提交于
arm64 has't global write protect set, here we remove the write proctect in pmd table and update the hook data. Signed-off-by: NZou Cao <zoucao@linux.alibaba.com> Acked-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Florian Westphal 提交于
commit 294304e4c522d797b7ea8200ab74354843fa68e9 upstream We have no explicit signal when a UDP stream has terminated, peers just stop sending. For suspected stream connections a timeout of two minutes is sane to keep NAT mapping alive a while longer. It matches tcp conntracks 'timewait' default timeout value. Signed-off-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: NTony Lu <tonylu@linux.alibaba.com> Acked-by: NDust Li <dust.li@linux.alibaba.com>
-
由 Florian Westphal 提交于
commit d535c8a69c1924e70186d80be0a9cecaf475f166 upstream Currently DNS resolvers that send both A and AAAA queries from same source port can trigger stream mode prematurely, which results in non-early-evictable conntrack entry for three minutes, even though DNS requests are done in a few milliseconds. Add a two second grace period where we continue to use the ordinary 30-second default timeout. Its enough for DNS request/response traffic, even if two request/reply packets are involved. ASSURED is still set, else conntrack (and thus a possible NAT mapping ...) gets zapped too in case conntrack table runs full. Signed-off-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: NTony Lu <tonylu@linux.alibaba.com> Acked-by: NDust Li <dust.li@linux.alibaba.com>
-
由 Jens Axboe 提交于
commit de2ea4b64b75a79ed9cdf9bf30e0e197901084e4 upstream. This is identical to __sys_accept4(), except it takes a struct file instead of an fd, and it also allows passing in extra file->f_flags flags. The latter is done to support masking in O_NONBLOCK without manipulating the original file flags. No functional changes in this patch. Cc: netdev@vger.kernel.org Acked-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Eric Dumazet 提交于
[ Upstream commit 2bec445f9bf35e52e395b971df48d3e1e5dc704a ] Latest commit 853697504de0 ("tcp: Fix highest_sack and highest_sack_seq") apparently allowed syzbot to trigger various crashes in TCP stack [1] I believe this commit only made things easier for syzbot to find its way into triggering use-after-frees. But really the bugs could lead to bad TCP behavior or even plain crashes even for non malicious peers. I have audited all calls to tcp_rtx_queue_unlink() and tcp_rtx_queue_unlink_and_free() and made sure tp->highest_sack would be updated if we are removing from rtx queue the skb that tp->highest_sack points to. These updates were missing in three locations : 1) tcp_clean_rtx_queue() [This one seems quite serious, I have no idea why this was not caught earlier] 2) tcp_rtx_queue_purge() [Probably not a big deal for normal operations] 3) tcp_send_synack() [Probably not a big deal for normal operations] [1] BUG: KASAN: use-after-free in tcp_highest_sack_seq include/net/tcp.h:1864 [inline] BUG: KASAN: use-after-free in tcp_highest_sack_seq include/net/tcp.h:1856 [inline] BUG: KASAN: use-after-free in tcp_check_sack_reordering+0x33c/0x3a0 net/ipv4/tcp_input.c:891 Read of size 4 at addr ffff8880a488d068 by task ksoftirqd/1/16 CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.5.0-rc5-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x197/0x210 lib/dump_stack.c:118 print_address_description.constprop.0.cold+0xd4/0x30b mm/kasan/report.c:374 __kasan_report.cold+0x1b/0x41 mm/kasan/report.c:506 kasan_report+0x12/0x20 mm/kasan/common.c:639 __asan_report_load4_noabort+0x14/0x20 mm/kasan/generic_report.c:134 tcp_highest_sack_seq include/net/tcp.h:1864 [inline] tcp_highest_sack_seq include/net/tcp.h:1856 [inline] tcp_check_sack_reordering+0x33c/0x3a0 net/ipv4/tcp_input.c:891 tcp_try_undo_partial net/ipv4/tcp_input.c:2730 [inline] tcp_fastretrans_alert+0xf74/0x23f0 net/ipv4/tcp_input.c:2847 tcp_ack+0x2577/0x5bf0 net/ipv4/tcp_input.c:3710 tcp_rcv_established+0x6dd/0x1e90 net/ipv4/tcp_input.c:5706 tcp_v4_do_rcv+0x619/0x8d0 net/ipv4/tcp_ipv4.c:1619 tcp_v4_rcv+0x307f/0x3b40 net/ipv4/tcp_ipv4.c:2001 ip_protocol_deliver_rcu+0x5a/0x880 net/ipv4/ip_input.c:204 ip_local_deliver_finish+0x23b/0x380 net/ipv4/ip_input.c:231 NF_HOOK include/linux/netfilter.h:307 [inline] NF_HOOK include/linux/netfilter.h:301 [inline] ip_local_deliver+0x1e9/0x520 net/ipv4/ip_input.c:252 dst_input include/net/dst.h:442 [inline] ip_rcv_finish+0x1db/0x2f0 net/ipv4/ip_input.c:428 NF_HOOK include/linux/netfilter.h:307 [inline] NF_HOOK include/linux/netfilter.h:301 [inline] ip_rcv+0xe8/0x3f0 net/ipv4/ip_input.c:538 __netif_receive_skb_one_core+0x113/0x1a0 net/core/dev.c:5148 __netif_receive_skb+0x2c/0x1d0 net/core/dev.c:5262 process_backlog+0x206/0x750 net/core/dev.c:6093 napi_poll net/core/dev.c:6530 [inline] net_rx_action+0x508/0x1120 net/core/dev.c:6598 __do_softirq+0x262/0x98c kernel/softirq.c:292 run_ksoftirqd kernel/softirq.c:603 [inline] run_ksoftirqd+0x8e/0x110 kernel/softirq.c:595 smpboot_thread_fn+0x6a3/0xa40 kernel/smpboot.c:165 kthread+0x361/0x430 kernel/kthread.c:255 ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352 Allocated by task 10091: save_stack+0x23/0x90 mm/kasan/common.c:72 set_track mm/kasan/common.c:80 [inline] __kasan_kmalloc mm/kasan/common.c:513 [inline] __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:486 kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:521 slab_post_alloc_hook mm/slab.h:584 [inline] slab_alloc_node mm/slab.c:3263 [inline] kmem_cache_alloc_node+0x138/0x740 mm/slab.c:3575 __alloc_skb+0xd5/0x5e0 net/core/skbuff.c:198 alloc_skb_fclone include/linux/skbuff.h:1099 [inline] sk_stream_alloc_skb net/ipv4/tcp.c:875 [inline] sk_stream_alloc_skb+0x113/0xc90 net/ipv4/tcp.c:852 tcp_sendmsg_locked+0xcf9/0x3470 net/ipv4/tcp.c:1282 tcp_sendmsg+0x30/0x50 net/ipv4/tcp.c:1432 inet_sendmsg+0x9e/0xe0 net/ipv4/af_inet.c:807 sock_sendmsg_nosec net/socket.c:652 [inline] sock_sendmsg+0xd7/0x130 net/socket.c:672 __sys_sendto+0x262/0x380 net/socket.c:1998 __do_sys_sendto net/socket.c:2010 [inline] __se_sys_sendto net/socket.c:2006 [inline] __x64_sys_sendto+0xe1/0x1a0 net/socket.c:2006 do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294 entry_SYSCALL_64_after_hwframe+0x49/0xbe Freed by task 10095: save_stack+0x23/0x90 mm/kasan/common.c:72 set_track mm/kasan/common.c:80 [inline] kasan_set_free_info mm/kasan/common.c:335 [inline] __kasan_slab_free+0x102/0x150 mm/kasan/common.c:474 kasan_slab_free+0xe/0x10 mm/kasan/common.c:483 __cache_free mm/slab.c:3426 [inline] kmem_cache_free+0x86/0x320 mm/slab.c:3694 kfree_skbmem+0x178/0x1c0 net/core/skbuff.c:645 __kfree_skb+0x1e/0x30 net/core/skbuff.c:681 sk_eat_skb include/net/sock.h:2453 [inline] tcp_recvmsg+0x1252/0x2930 net/ipv4/tcp.c:2166 inet_recvmsg+0x136/0x610 net/ipv4/af_inet.c:838 sock_recvmsg_nosec net/socket.c:886 [inline] sock_recvmsg net/socket.c:904 [inline] sock_recvmsg+0xce/0x110 net/socket.c:900 __sys_recvfrom+0x1ff/0x350 net/socket.c:2055 __do_sys_recvfrom net/socket.c:2073 [inline] __se_sys_recvfrom net/socket.c:2069 [inline] __x64_sys_recvfrom+0xe1/0x1a0 net/socket.c:2069 do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294 entry_SYSCALL_64_after_hwframe+0x49/0xbe The buggy address belongs to the object at ffff8880a488d040 which belongs to the cache skbuff_fclone_cache of size 456 The buggy address is located 40 bytes inside of 456-byte region [ffff8880a488d040, ffff8880a488d208) The buggy address belongs to the page: page:ffffea0002922340 refcount:1 mapcount:0 mapping:ffff88821b057000 index:0x0 raw: 00fffe0000000200 ffffea00022a5788 ffffea0002624a48 ffff88821b057000 raw: 0000000000000000 ffff8880a488d040 0000000100000006 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff8880a488cf00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff8880a488cf80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc >ffff8880a488d000: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb ^ ffff8880a488d080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff8880a488d100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb Fixes: 853697504de0 ("tcp: Fix highest_sack and highest_sack_seq") Fixes: 50895b9d ("tcp: highest_sack fix") Fixes: 737ff314 ("tcp: use sequence distance to detect reordering") Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Cambda Zhu <cambda@linux.alibaba.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by: NNeal Cardwell <ncardwell@google.com> Acked-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NTony Lu <tonylu@linux.alibaba.com> Acked-by: NDust Li <dust.li@linux.alibaba.com>
-
由 Jens Axboe 提交于
commit aa1fa28fc73ea6b740ee7b62bf3b07141883dbb8 upstream. This is done through IORING_OP_RECVMSG. This opcode uses the same sqe->msg_flags that IORING_OP_SENDMSG added, and we pass in the msghdr struct in the sqe->addr field as well. We use MSG_DONTWAIT to force an inline fast path if recvmsg() doesn't block, and punt to async execution if it would have. Acked-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jens Axboe 提交于
commit 0fa03c624d8fc9932d0f27c39a9deca6a37e0e17 upstream. This is done through IORING_OP_SENDMSG. There's a new sqe->msg_flags for the flags argument, and the msghdr struct is passed in the sqe->addr field. We use MSG_DONTWAIT to force an inline fast path if sendmsg() doesn't block, and punt to async execution if it would have. Acked-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jens Axboe 提交于
commit 87e5e6dab6c2a21fab2620f37786276d202e2ce0 upstream. Currently these functions return < 0 on error, and 0 for success. Change that so that we return < 0 on error, but number of bytes for success. Some callers already treat the return value that way, others need a slight tweak. Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Thomas Higdon 提交于
commit 8f7baad7f03543451af27f5380fc816b008aa1f2 upstream Neal Cardwell mentioned that snd_wnd would be useful for diagnosing TCP performance problems -- > (1) Usually when we're diagnosing TCP performance problems, we do so > from the sender, since the sender makes most of the > performance-critical decisions (cwnd, pacing, TSO size, TSQ, etc). > From the sender-side the thing that would be most useful is to see > tp->snd_wnd, the receive window that the receiver has advertised to > the sender. This serves the purpose of adding an additional __u32 to avoid the would-be hole caused by the addition of the tcpi_rcvi_ooopack field. Signed-off-by: NThomas Higdon <tph@fb.com> Acked-by: NYuchung Cheng <ycheng@google.com> Acked-by: NNeal Cardwell <ncardwell@google.com> Acked-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NTony Lu <tonylu@linux.alibaba.com> Acked-by: NDust Li <dust.li@linux.alibaba.com>
-
由 Thomas Higdon 提交于
commit f9af2dbbfe01def62765a58af7fbc488351893c3 upstream For receive-heavy cases on the server-side, we want to track the connection quality for individual client IPs. This counter, similar to the existing system-wide TCPOFOQueue counter in /proc/net/netstat, tracks out-of-order packet reception. By providing this counter in TCP_INFO, it will allow understanding to what degree receive-heavy sockets are experiencing out-of-order delivery and packet drops indicating congestion. Please note that this is similar to the counter in NetBSD TCP_INFO, and has the same name. Also note that we avoid increasing the size of the tcp_sock struct by taking advantage of a hole. Signed-off-by: NThomas Higdon <tph@fb.com> Acked-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NTony Lu <tonylu@linux.alibaba.com> Acked-by: NDust Li <dust.li@linux.alibaba.com>
-
- 17 1月, 2020 5 次提交
-
-
由 Jens Axboe 提交于
commit f4e65870e5cede5ca1ec0006b6c9803994e5f7b8 upstream. We need this functionality for the io_uring file registration, but we cannot rely on it since CONFIG_UNIX can be modular. Move the helpers to a separate file, that's always builtin to the kernel if CONFIG_UNIX is m/y. No functional changes in this patch, just moving code around. Reviewed-by: NHannes Reinecke <hare@suse.com> Acked-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Jens Axboe 提交于
commit 2b188cc1bb857a9d4701ae59aa7768b5124e262e upstream. The submission queue (SQ) and completion queue (CQ) rings are shared between the application and the kernel. This eliminates the need to copy data back and forth to submit and complete IO. IO submissions use the io_uring_sqe data structure, and completions are generated in the form of io_uring_cqe data structures. The SQ ring is an index into the io_uring_sqe array, which makes it possible to submit a batch of IOs without them being contiguous in the ring. The CQ ring is always contiguous, as completion events are inherently unordered, and hence any io_uring_cqe entry can point back to an arbitrary submission. Two new system calls are added for this: io_uring_setup(entries, params) Sets up an io_uring instance for doing async IO. On success, returns a file descriptor that the application can mmap to gain access to the SQ ring, CQ ring, and io_uring_sqes. io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize) Initiates IO against the rings mapped to this fd, or waits for them to complete, or both. The behavior is controlled by the parameters passed in. If 'to_submit' is non-zero, then we'll try and submit new IO. If IORING_ENTER_GETEVENTS is set, the kernel will wait for 'min_complete' events, if they aren't already available. It's valid to set IORING_ENTER_GETEVENTS and 'min_complete' == 0 at the same time, this allows the kernel to return already completed events without waiting for them. This is useful only for polling, as for IRQ driven IO, the application can just check the CQ ring without entering the kernel. With this setup, it's possible to do async IO with a single system call. Future developments will enable polled IO with this interface, and polled submission as well. The latter will enable an application to do IO without doing ANY system calls at all. For IRQ driven IO, an application only needs to enter the kernel for completions if it wants to wait for them to occur. Each io_uring is backed by a workqueue, to support buffered async IO as well. We will only punt to an async context if the command would need to wait for IO on the device side. Any data that can be accessed directly in the page cache is done inline. This avoids the slowness issue of usual threadpools, since cached data is accessed as quickly as a sync interface. Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.cReviewed-by: NHannes Reinecke <hare@suse.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 David Howells 提交于
commit aa563d7bca6e882ec2bdae24603c8f016401a144 upstream. In the iov_iter struct, separate the iterator type from the iterator direction and use accessor functions to access them in most places. Convert a bunch of places to use switch-statements to access them rather then chains of bitwise-AND statements. This makes it easier to add further iterator types. Also, this can be more efficient as to implement a switch of small contiguous integers, the compiler can use ~50% fewer compare instructions than it has to use bitwise-and instructions. Further, cease passing the iterator type into the iterator setup function. The iterator function can set that itself. Only the direction is required. Signed-off-by: NDavid Howells <dhowells@redhat.com> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 David Howells 提交于
commit 00e23707442a75b404392cef1405ab4fd498de6b upstream. Use accessor functions to access an iterator's type and direction. This allows for the possibility of using some other method of determining the type of iterator than if-chains with bitwise-AND conditions. Signed-off-by: NDavid Howells <dhowells@redhat.com> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJeffle Xu <jefflexu@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Cambda Zhu 提交于
commit 853697504de043ff0bfd815bd3a64de1dce73dc7 upstream. >From commit 50895b9d ("tcp: highest_sack fix"), the logic about setting tp->highest_sack to the head of the send queue was removed. Of course the logic is error prone, but it is logical. Before we remove the pointer to the highest sack skb and use the seq instead, we need to set tp->highest_sack to NULL when there is no skb after the last sack, and then replace NULL with the real skb when new skb inserted into the rtx queue, because the NULL means the highest sack seq is tp->snd_nxt. If tp->highest_sack is NULL and new data sent, the next ACK with sack option will increase tp->reordering unexpectedly. This patch sets tp->highest_sack to the tail of the rtx queue if it's NULL and new data is sent. The patch keeps the rule that the highest_sack can only be maintained by sack processing, except for this only case. Fixes: 50895b9d ("tcp: highest_sack fix") Signed-off-by: NCambda Zhu <cambda@linux.alibaba.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: NDust Li <dust.li@linux.alibaba.com>
-
- 27 12月, 2019 6 次提交
-
-
由 Gen Zhang 提交于
commit 425aa0e1d01513437668fa3d4a971168bbaa8515 upstream. In function ip_ra_control(), the pointer new_ra is allocated a memory space via kmalloc(). And it is used in the following codes. However, when there is a memory allocation error, kmalloc() fails. Thus null pointer dereference may happen. And it will cause the kernel to crash. Therefore, we should check the return value and handle the error. Signed-off-by: NGen Zhang <blackgod016574@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Gen Zhang 提交于
commit 95baa60a0da80a0143e3ddd4d3725758b4513825 upstream. In function ip6_ra_control(), the pointer new_ra is allocated a memory space via kmalloc(). And it is used in the following codes. However, when there is a memory allocation error, kmalloc() fails. Thus null pointer dereference may happen. And it will cause the kernel to crash. Therefore, we should check the return value and handle the error. Signed-off-by: NGen Zhang <blackgod016574@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 George Zhang 提交于
By default the tcp_tw_timeout value is 60 seconds. The minimum is 1 second and the maximum is 600. This setting is useful on system under heavy tcp load. NOTE: set the tcp_tw_timeout below 60 seconds voilates the "quiet time" restriction, and make your system into the risk of causing some old data to be accepted as new or new data rejected as old duplicated by some receivers. Link: http://web.archive.org/web/20150102003320/http://tools.ietf.org/html/rfc793Signed-off-by: NGeorge Zhang <georgezhang@linux.alibaba.com> Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 kbuild test robot 提交于
lkp-build bot reported the following link error with ipv6 disabled: ld: net/hookers/hookers.o:(.data+0x40): undefined reference to `ipv6_specific' ld: net/hookers/hookers.o:(.data+0x78): undefined reference to `ipv6_mapped' ld: net/hookers/hookers.o:(.data+0xe8): undefined reference to `inet6_stream_ops' Fixed this issue by adding IS_ENABLED(CONFIG_IPV6) check. Reported-by: Nkbuild test robot <lkp@intel.com> Signed-off-by: NCaspar Zhang <caspar@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Caspar Zhang 提交于
read/write_cr0() are used in net/hookers.c, but they are only available on x86 platform. Adding a depend-on fields in Kconfig to disable this feature in other platforms. Reported-by: Nkbuild test robot <lkp@intel.com> Signed-off-by: NCaspar Zhang <caspar@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 George Zhang 提交于
LVS fullnat will replace network traffic's source ip with its local ip, and thus the backend servers cannot obtain the real client ip. To solve this, LVS has introduced the tcp option address (TOA) to store the essential ip address information in the last tcp ack packet of the 3-way handshake, and the backend servers need to retrieve it from the packet header. In this patch, we have introduced the sk_toa_data member in the sock structure to hold the TOA information. There used to be an in-tree module for TOA managing, whereas it has now been maintained as an standalone module. In this case, the toa module should register its hook function(s) using the provided interfaces in the hookers module. TOA in sock structure: __be32 sk_toa_data[16]; The hookers module only provides the sk_toa_data placeholder, and the toa module can use this variable through the layout it needs. Hook interfaces: The hookers module replaces the kernel's syn_recv_sock and getname handler with a stub that chains the toa module's hook function(s) to the original handling function. The hookers module allows hook functions to be installed and uninstalled in any order. toa module: The external toa module will be provided in separate RPM package. [xuyu@linux.alibaba.com: amend commit log] Signed-off-by: NGeorge Zhang <georgezhang@linux.alibaba.com> Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
- 21 12月, 2019 8 次提交
-
-
由 Taehee Yoo 提交于
[ Upstream commit 9cf1cd8ee3ee09ef2859017df2058e2f53c5347f ] In order to set/get/dump, the tipc uses the generic netlink infrastructure. So, when tipc module is inserted, init function calls genl_register_family(). After genl_register_family(), set/get/dump commands are immediately allowed and these callbacks internally use the net_generic. net_generic is allocated by register_pernet_device() but this is called after genl_register_family() in the __init function. So, these callbacks would use un-initialized net_generic. Test commands: #SHELL1 while : do modprobe tipc modprobe -rv tipc done #SHELL2 while : do tipc link list done Splat looks like: [ 59.616322][ T2788] kasan: CONFIG_KASAN_INLINE enabled [ 59.617234][ T2788] kasan: GPF could be caused by NULL-ptr deref or user memory access [ 59.618398][ T2788] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI [ 59.619389][ T2788] CPU: 3 PID: 2788 Comm: tipc Not tainted 5.4.0+ #194 [ 59.620231][ T2788] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 59.621428][ T2788] RIP: 0010:tipc_bcast_get_broadcast_mode+0x131/0x310 [tipc] [ 59.622379][ T2788] Code: c7 c6 ef 8b 38 c0 65 ff 0d 84 83 c9 3f e8 d7 a5 f2 e3 48 8d bb 38 11 00 00 48 b8 00 00 00 00 [ 59.622550][ T2780] NET: Registered protocol family 30 [ 59.624627][ T2788] RSP: 0018:ffff88804b09f578 EFLAGS: 00010202 [ 59.624630][ T2788] RAX: dffffc0000000000 RBX: 0000000000000011 RCX: 000000008bc66907 [ 59.624631][ T2788] RDX: 0000000000000229 RSI: 000000004b3cf4cc RDI: 0000000000001149 [ 59.624633][ T2788] RBP: ffff88804b09f588 R08: 0000000000000003 R09: fffffbfff4fb3df1 [ 59.624635][ T2788] R10: fffffbfff50318f8 R11: ffff888066cadc18 R12: ffffffffa6cc2f40 [ 59.624637][ T2788] R13: 1ffff11009613eba R14: ffff8880662e9328 R15: ffff8880662e9328 [ 59.624639][ T2788] FS: 00007f57d8f7b740(0000) GS:ffff88806cc00000(0000) knlGS:0000000000000000 [ 59.624645][ T2788] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 59.625875][ T2780] tipc: Started in single node mode [ 59.626128][ T2788] CR2: 00007f57d887a8c0 CR3: 000000004b140002 CR4: 00000000000606e0 [ 59.633991][ T2788] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 59.635195][ T2788] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 59.636478][ T2788] Call Trace: [ 59.637025][ T2788] tipc_nl_add_bc_link+0x179/0x1470 [tipc] [ 59.638219][ T2788] ? lock_downgrade+0x6e0/0x6e0 [ 59.638923][ T2788] ? __tipc_nl_add_link+0xf90/0xf90 [tipc] [ 59.639533][ T2788] ? tipc_nl_node_dump_link+0x318/0xa50 [tipc] [ 59.640160][ T2788] ? mutex_lock_io_nested+0x1380/0x1380 [ 59.640746][ T2788] tipc_nl_node_dump_link+0x4fd/0xa50 [tipc] [ 59.641356][ T2788] ? tipc_nl_node_reset_link_stats+0x340/0x340 [tipc] [ 59.642088][ T2788] ? __skb_ext_del+0x270/0x270 [ 59.642594][ T2788] genl_lock_dumpit+0x85/0xb0 [ 59.643050][ T2788] netlink_dump+0x49c/0xed0 [ 59.643529][ T2788] ? __netlink_sendskb+0xc0/0xc0 [ 59.644044][ T2788] ? __netlink_dump_start+0x190/0x800 [ 59.644617][ T2788] ? __mutex_unlock_slowpath+0xd0/0x670 [ 59.645177][ T2788] __netlink_dump_start+0x5a0/0x800 [ 59.645692][ T2788] genl_rcv_msg+0xa75/0xe90 [ 59.646144][ T2788] ? __lock_acquire+0xdfe/0x3de0 [ 59.646692][ T2788] ? genl_family_rcv_msg_attrs_parse+0x320/0x320 [ 59.647340][ T2788] ? genl_lock_dumpit+0xb0/0xb0 [ 59.647821][ T2788] ? genl_unlock+0x20/0x20 [ 59.648290][ T2788] ? genl_parallel_done+0xe0/0xe0 [ 59.648787][ T2788] ? find_held_lock+0x39/0x1d0 [ 59.649276][ T2788] ? genl_rcv+0x15/0x40 [ 59.649722][ T2788] ? lock_contended+0xcd0/0xcd0 [ 59.650296][ T2788] netlink_rcv_skb+0x121/0x350 [ 59.650828][ T2788] ? genl_family_rcv_msg_attrs_parse+0x320/0x320 [ 59.651491][ T2788] ? netlink_ack+0x940/0x940 [ 59.651953][ T2788] ? lock_acquire+0x164/0x3b0 [ 59.652449][ T2788] genl_rcv+0x24/0x40 [ 59.652841][ T2788] netlink_unicast+0x421/0x600 [ ... ] Fixes: 7e436905 ("tipc: fix a slab object leak") Fixes: a62fbcce ("tipc: make subscriber server support net namespace") Signed-off-by: NTaehee Yoo <ap420073@gmail.com> Acked-by: NJon Maloy <jon.maloy@ericsson.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Eric Dumazet 提交于
[ Upstream commit 9424e2e7ad93ffffa88f882c9bc5023570904b55 ] Back in 2008, Adam Langley fixed the corner case of packets for flows having all of the following options : MD5 TS SACK Since MD5 needs 20 bytes, and TS needs 12 bytes, no sack block can be cooked from the remaining 8 bytes. tcp_established_options() correctly sets opts->num_sack_blocks to zero, but returns 36 instead of 32. This means TCP cooks packets with 4 extra bytes at the end of options, containing unitialized bytes. Fixes: 33ad798c ("tcp: options clean up") Signed-off-by: NEric Dumazet <edumazet@google.com> Reported-by: Nsyzbot <syzkaller@googlegroups.com> Acked-by: NNeal Cardwell <ncardwell@google.com> Acked-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Aaron Conole 提交于
[ Upstream commit 5d50aa83e2c8e91ced2cca77c198b468ca9210f4 ] The openvswitch module shares a common conntrack and NAT infrastructure exposed via netfilter. It's possible that a packet needs both SNAT and DNAT manipulation, due to e.g. tuple collision. Netfilter can support this because it runs through the NAT table twice - once on ingress and again after egress. The openvswitch module doesn't have such capability. Like netfilter hook infrastructure, we should run through NAT twice to keep the symmetry. Fixes: 05752523 ("openvswitch: Interface with NAT.") Signed-off-by: NAaron Conole <aconole@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Dust Li 提交于
[ Upstream commit 2f23cd42e19c22c24ff0e221089b7b6123b117c5 ] sch->q.len hasn't been set if the subqueue is a NOLOCK qdisc in mq_dump() and mqprio_dump(). Fixes: ce679e8d ("net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mqprio") Signed-off-by: NDust Li <dust.li@linux.alibaba.com> Signed-off-by: NTony Lu <tonylu@linux.alibaba.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Alexander Lobakin 提交于
[ Upstream commit 8bef0af09a5415df761b04fa487a6c34acae74bc ] Commit 43e66528 ("net-next: dsa: fix flow dissection") added an ability to override protocol and network offset during flow dissection for DSA-enabled devices (i.e. controllers shipped as switch CPU ports) in order to fix skb hashing for RPS on Rx path. However, skb_hash() and added part of code can be invoked not only on Rx, but also on Tx path if we have a multi-queued device and: - kernel is running on UP system or - XPS is not configured. The call stack in this two cases will be like: dev_queue_xmit() -> __dev_queue_xmit() -> netdev_core_pick_tx() -> netdev_pick_tx() -> skb_tx_hash() -> skb_get_hash(). The problem is that skbs queued for Tx have both network offset and correct protocol already set up even after inserting a CPU tag by DSA tagger, so calling tag_ops->flow_dissect() on this path actually only breaks flow dissection and hashing. This can be observed by adding debug prints just before and right after tag_ops->flow_dissect() call to the related block of code: Before the patch: Rx path (RPS): [ 19.240001] Rx: proto: 0x00f8, nhoff: 0 /* ETH_P_XDSA */ [ 19.244271] tag_ops->flow_dissect() [ 19.247811] Rx: proto: 0x0800, nhoff: 8 /* ETH_P_IP */ [ 19.215435] Rx: proto: 0x00f8, nhoff: 0 /* ETH_P_XDSA */ [ 19.219746] tag_ops->flow_dissect() [ 19.223241] Rx: proto: 0x0806, nhoff: 8 /* ETH_P_ARP */ [ 18.654057] Rx: proto: 0x00f8, nhoff: 0 /* ETH_P_XDSA */ [ 18.658332] tag_ops->flow_dissect() [ 18.661826] Rx: proto: 0x8100, nhoff: 8 /* ETH_P_8021Q */ Tx path (UP system): [ 18.759560] Tx: proto: 0x0800, nhoff: 26 /* ETH_P_IP */ [ 18.763933] tag_ops->flow_dissect() [ 18.767485] Tx: proto: 0x920b, nhoff: 34 /* junk */ [ 22.800020] Tx: proto: 0x0806, nhoff: 26 /* ETH_P_ARP */ [ 22.804392] tag_ops->flow_dissect() [ 22.807921] Tx: proto: 0x920b, nhoff: 34 /* junk */ [ 16.898342] Tx: proto: 0x86dd, nhoff: 26 /* ETH_P_IPV6 */ [ 16.902705] tag_ops->flow_dissect() [ 16.906227] Tx: proto: 0x920b, nhoff: 34 /* junk */ After: Rx path (RPS): [ 16.520993] Rx: proto: 0x00f8, nhoff: 0 /* ETH_P_XDSA */ [ 16.525260] tag_ops->flow_dissect() [ 16.528808] Rx: proto: 0x0800, nhoff: 8 /* ETH_P_IP */ [ 15.484807] Rx: proto: 0x00f8, nhoff: 0 /* ETH_P_XDSA */ [ 15.490417] tag_ops->flow_dissect() [ 15.495223] Rx: proto: 0x0806, nhoff: 8 /* ETH_P_ARP */ [ 17.134621] Rx: proto: 0x00f8, nhoff: 0 /* ETH_P_XDSA */ [ 17.138895] tag_ops->flow_dissect() [ 17.142388] Rx: proto: 0x8100, nhoff: 8 /* ETH_P_8021Q */ Tx path (UP system): [ 15.499558] Tx: proto: 0x0800, nhoff: 26 /* ETH_P_IP */ [ 20.664689] Tx: proto: 0x0806, nhoff: 26 /* ETH_P_ARP */ [ 18.565782] Tx: proto: 0x86dd, nhoff: 26 /* ETH_P_IPV6 */ In order to fix that we can add the check 'proto == htons(ETH_P_XDSA)' to prevent code from calling tag_ops->flow_dissect() on Tx. I also decided to initialize 'offset' variable so tagger callbacks can now safely leave it untouched without provoking a chaos. Fixes: 43e66528 ("net-next: dsa: fix flow dissection") Signed-off-by: NAlexander Lobakin <alobakin@dlink.ru> Reviewed-by: NFlorian Fainelli <f.fainelli@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Nikolay Aleksandrov 提交于
[ Upstream commit c4b4c421857dc7b1cf0dccbd738472360ff2cd70 ] We have an interesting memory leak in the bridge when it is being unregistered and is a slave to a master device which would change the mac of its slaves on unregister (e.g. bond, team). This is a very unusual setup but we do end up leaking 1 fdb entry because dev_set_mac_address() would cause the bridge to insert the new mac address into its table after all fdbs are flushed, i.e. after dellink() on the bridge has finished and we call NETDEV_UNREGISTER the bond/team would release it and will call dev_set_mac_address() to restore its original address and that in turn will add an fdb in the bridge. One fix is to check for the bridge dev's reg_state in its ndo_set_mac_address callback and return an error if the bridge is not in NETREG_REGISTERED. Easy steps to reproduce: 1. add bond in mode != A/B 2. add any slave to the bond 3. add bridge dev as a slave to the bond 4. destroy the bridge device Trace: unreferenced object 0xffff888035c4d080 (size 128): comm "ip", pid 4068, jiffies 4296209429 (age 1413.753s) hex dump (first 32 bytes): 41 1d c9 36 80 88 ff ff 00 00 00 00 00 00 00 00 A..6............ d2 19 c9 5e 3f d7 00 00 00 00 00 00 00 00 00 00 ...^?........... backtrace: [<00000000ddb525dc>] kmem_cache_alloc+0x155/0x26f [<00000000633ff1e0>] fdb_create+0x21/0x486 [bridge] [<0000000092b17e9c>] fdb_insert+0x91/0xdc [bridge] [<00000000f2a0f0ff>] br_fdb_change_mac_address+0xb3/0x175 [bridge] [<000000001de02dbd>] br_stp_change_bridge_id+0xf/0xff [bridge] [<00000000ac0e32b1>] br_set_mac_address+0x76/0x99 [bridge] [<000000006846a77f>] dev_set_mac_address+0x63/0x9b [<00000000d30738fc>] __bond_release_one+0x3f6/0x455 [bonding] [<00000000fc7ec01d>] bond_netdev_event+0x2f2/0x400 [bonding] [<00000000305d7795>] notifier_call_chain+0x38/0x56 [<0000000028885d4a>] call_netdevice_notifiers+0x1e/0x23 [<000000008279477b>] rollback_registered_many+0x353/0x6a4 [<0000000018ef753a>] unregister_netdevice_many+0x17/0x6f [<00000000ba854b7a>] rtnl_delete_link+0x3c/0x43 [<00000000adf8618d>] rtnl_dellink+0x1dc/0x20a [<000000009b6395fd>] rtnetlink_rcv_msg+0x23d/0x268 Fixes: 43598813 ("bridge: add local MAC address to forwarding table (v2)") Reported-by: syzbot+2add91c08eb181fea1bf@syzkaller.appspotmail.com Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Vladyslav Tarasiuk 提交于
[ Upstream commit 9f104c7736904ac72385bbb48669e0c923ca879b ] When user runs a command like tc qdisc add dev eth1 root mqprio KASAN stack-out-of-bounds warning is emitted. Currently, NLA_ALIGN macro used in mqprio_dump provides too large buffer size as argument for nla_put and memcpy down the call stack. The flow looks like this: 1. nla_put expects exact object size as an argument; 2. Later it provides this size to memcpy; 3. To calculate correct padding for SKB, nla_put applies NLA_ALIGN macro itself. Therefore, NLA_ALIGN should not be applied to the nla_put parameter. Otherwise it will lead to out-of-bounds memory access in memcpy. Fixes: 4e8b86c0 ("mqprio: Introduce new hardware offload mode and shaper in mqprio") Signed-off-by: NVladyslav Tarasiuk <vladyslavt@mellanox.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Eric Dumazet 提交于
[ Upstream commit 501a90c945103e8627406763dac418f20f3837b2 ] syzbot was once again able to crash a host by setting a very small mtu on loopback device. Let's make inetdev_valid_mtu() available in include/net/ip.h, and use it in ip_setup_cork(), so that we protect both ip_append_page() and __ip_append_data() Also add a READ_ONCE() when the device mtu is read. Pairs this lockless read with one WRITE_ONCE() in __dev_set_mtu(), even if other code paths might write over this field. Add a big comment in include/linux/netdevice.h about dev->mtu needing READ_ONCE()/WRITE_ONCE() annotations. Hopefully we will add the missing ones in followup patches. [1] refcount_t: saturated; leaking memory. WARNING: CPU: 0 PID: 9464 at lib/refcount.c:22 refcount_warn_saturate+0x138/0x1f0 lib/refcount.c:22 Kernel panic - not syncing: panic_on_warn set ... CPU: 0 PID: 9464 Comm: syz-executor850 Not tainted 5.4.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x197/0x210 lib/dump_stack.c:118 panic+0x2e3/0x75c kernel/panic.c:221 __warn.cold+0x2f/0x3e kernel/panic.c:582 report_bug+0x289/0x300 lib/bug.c:195 fixup_bug arch/x86/kernel/traps.c:174 [inline] fixup_bug arch/x86/kernel/traps.c:169 [inline] do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:267 do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:286 invalid_op+0x23/0x30 arch/x86/entry/entry_64.S:1027 RIP: 0010:refcount_warn_saturate+0x138/0x1f0 lib/refcount.c:22 Code: 06 31 ff 89 de e8 c8 f5 e6 fd 84 db 0f 85 6f ff ff ff e8 7b f4 e6 fd 48 c7 c7 e0 71 4f 88 c6 05 56 a6 a4 06 01 e8 c7 a8 b7 fd <0f> 0b e9 50 ff ff ff e8 5c f4 e6 fd 0f b6 1d 3d a6 a4 06 31 ff 89 RSP: 0018:ffff88809689f550 EFLAGS: 00010286 RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffff815e4336 RDI: ffffed1012d13e9c RBP: ffff88809689f560 R08: ffff88809c50a3c0 R09: fffffbfff15d31b1 R10: fffffbfff15d31b0 R11: ffffffff8ae98d87 R12: 0000000000000001 R13: 0000000000040100 R14: ffff888099041104 R15: ffff888218d96e40 refcount_add include/linux/refcount.h:193 [inline] skb_set_owner_w+0x2b6/0x410 net/core/sock.c:1999 sock_wmalloc+0xf1/0x120 net/core/sock.c:2096 ip_append_page+0x7ef/0x1190 net/ipv4/ip_output.c:1383 udp_sendpage+0x1c7/0x480 net/ipv4/udp.c:1276 inet_sendpage+0xdb/0x150 net/ipv4/af_inet.c:821 kernel_sendpage+0x92/0xf0 net/socket.c:3794 sock_sendpage+0x8b/0xc0 net/socket.c:936 pipe_to_sendpage+0x2da/0x3c0 fs/splice.c:458 splice_from_pipe_feed fs/splice.c:512 [inline] __splice_from_pipe+0x3ee/0x7c0 fs/splice.c:636 splice_from_pipe+0x108/0x170 fs/splice.c:671 generic_splice_sendpage+0x3c/0x50 fs/splice.c:842 do_splice_from fs/splice.c:861 [inline] direct_splice_actor+0x123/0x190 fs/splice.c:1035 splice_direct_to_actor+0x3b4/0xa30 fs/splice.c:990 do_splice_direct+0x1da/0x2a0 fs/splice.c:1078 do_sendfile+0x597/0xd00 fs/read_write.c:1464 __do_sys_sendfile64 fs/read_write.c:1525 [inline] __se_sys_sendfile64 fs/read_write.c:1511 [inline] __x64_sys_sendfile64+0x1dd/0x220 fs/read_write.c:1511 do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x441409 Code: e8 ac e8 ff ff 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 eb 08 fc ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007fffb64c4f78 EFLAGS: 00000246 ORIG_RAX: 0000000000000028 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441409 RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000005 RBP: 0000000000073b8a R08: 0000000000000010 R09: 0000000000000010 R10: 0000000000010001 R11: 0000000000000246 R12: 0000000000402180 R13: 0000000000402210 R14: 0000000000000000 R15: 0000000000000000 Kernel Offset: disabled Rebooting in 86400 seconds.. Fixes: 1470ddf7 ("inet: Remove explicit write references to sk/inet in ip_append_data") Signed-off-by: NEric Dumazet <edumazet@google.com> Reported-by: Nsyzbot <syzkaller@googlegroups.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 18 12月, 2019 4 次提交
-
-
由 Pavel Tikhomirov 提交于
[ Upstream commit 5fcaf6982d1167f1cd9b264704f6d1ef4c505d54 ] I was investigating a crash in our Virtuozzo7 kernel which happened in in svcauth_unix_set_client. I found out that we access m_client field in ip_map structure, which was received from sunrpc_cache_lookup (we have a bit older kernel, now the code is in sunrpc_cache_add_entry), and these field looks uninitialized (m_client == 0x74 don't look like a pointer) but in the cache_head in flags we see 0x1 which is CACHE_VALID. It looks like the problem appeared from our previous fix to sunrpc (1): commit 4ecd55ea0742 ("sunrpc: fix cache_head leak due to queued request") And we've also found a patch already fixing our patch (2): commit d58431eacb22 ("sunrpc: don't mark uninitialised items as VALID.") Though the crash is eliminated, I think the core of the problem is not completely fixed: Neil in the patch (2) makes cache_head CACHE_NEGATIVE, before cache_fresh_locked which was added in (1) to fix crash. These way cache_is_valid won't say the cache is valid anymore and in svcauth_unix_set_client the function cache_check will return error instead of 0, and we don't count entry as initialized. But it looks like we need to remove cache_fresh_locked completely in sunrpc_cache_lookup: In (1) we've only wanted to make cache_fresh_unlocked->cache_dequeue so that cache_requests with no readers also release corresponding cache_head, to fix their leak. We with Vasily were not sure if cache_fresh_locked and cache_fresh_unlocked should be used in pair or not, so we've guessed to use them in pair. Now we see that we don't want the CACHE_VALID bit set here by cache_fresh_locked, as "valid" means "initialized" and there is no initialization in sunrpc_cache_add_entry. Both expiry_time and last_refresh are not used in cache_fresh_unlocked code-path and also not required for the initial fix. So to conclude cache_fresh_locked was called by mistake, and we can just safely remove it instead of crutching it with CACHE_NEGATIVE. It looks ideologically better for me. Hope I don't miss something here. Here is our crash backtrace: [13108726.326291] BUG: unable to handle kernel NULL pointer dereference at 0000000000000074 [13108726.326365] IP: [<ffffffffc01f79eb>] svcauth_unix_set_client+0x2ab/0x520 [sunrpc] [13108726.326448] PGD 0 [13108726.326468] Oops: 0002 [#1] SMP [13108726.326497] Modules linked in: nbd isofs xfs loop kpatch_cumulative_81_0_r1(O) xt_physdev nfnetlink_queue bluetooth rfkill ip6table_nat nf_nat_ipv6 ip_vs_wrr ip_vs_wlc ip_vs_sh nf_conntrack_netlink ip_vs_sed ip_vs_pe_sip nf_conntrack_sip ip_vs_nq ip_vs_lc ip_vs_lblcr ip_vs_lblc ip_vs_ftp ip_vs_dh nf_nat_ftp nf_conntrack_ftp iptable_raw xt_recent nf_log_ipv6 xt_hl ip6t_rt nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_TCPMSS xt_tcpmss vxlan ip6_udp_tunnel udp_tunnel xt_statistic xt_NFLOG nfnetlink_log dummy xt_mark xt_REDIRECT nf_nat_redirect raw_diag udp_diag tcp_diag inet_diag netlink_diag af_packet_diag unix_diag rpcsec_gss_krb5 xt_addrtype ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 ebtable_nat ebtable_broute nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_mangle ip6table_raw nfsv4 [13108726.327173] dns_resolver cls_u32 binfmt_misc arptable_filter arp_tables ip6table_filter ip6_tables devlink fuse_kio_pcs ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_nat iptable_nat nf_nat_ipv4 xt_comment nf_conntrack_ipv4 nf_defrag_ipv4 xt_wdog_tmo xt_multiport bonding xt_set xt_conntrack iptable_filter iptable_mangle kpatch(O) ebtable_filter ebt_among ebtables ip_set_hash_ip ip_set nfnetlink vfat fat skx_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass fuse pcspkr ses enclosure joydev sg mei_me hpwdt hpilo lpc_ich mei ipmi_si shpchp ipmi_devintf ipmi_msghandler xt_ipvs acpi_power_meter ip_vs_rr nfsv3 nfsd auth_rpcgss nfs_acl nfs lockd grace fscache nf_nat cls_fw sch_htb sch_cbq sch_sfq ip_vs em_u32 nf_conntrack tun br_netfilter veth overlay ip6_vzprivnet ip6_vznetstat ip_vznetstat [13108726.327817] ip_vzprivnet vziolimit vzevent vzlist vzstat vznetstat vznetdev vzmon vzdev bridge pio_kaio pio_nfs pio_direct pfmt_raw pfmt_ploop1 ploop ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mgag200 i2c_algo_bit drm_kms_helper scsi_transport_iscsi 8021q syscopyarea sysfillrect garp sysimgblt fb_sys_fops mrp stp ttm llc bnx2x crct10dif_pclmul crct10dif_common crc32_pclmul crc32c_intel drm dm_multipath ghash_clmulni_intel uas aesni_intel lrw gf128mul glue_helper ablk_helper cryptd tg3 smartpqi scsi_transport_sas mdio libcrc32c i2c_core usb_storage ptp pps_core wmi sunrpc dm_mirror dm_region_hash dm_log dm_mod [last unloaded: kpatch_cumulative_82_0_r1] [13108726.328403] CPU: 35 PID: 63742 Comm: nfsd ve: 51332 Kdump: loaded Tainted: G W O ------------ 3.10.0-862.20.2.vz7.73.29 #1 73.29 [13108726.328491] Hardware name: HPE ProLiant DL360 Gen10/ProLiant DL360 Gen10, BIOS U32 10/02/2018 [13108726.328554] task: ffffa0a6a41b1160 ti: ffffa0c2a74bc000 task.ti: ffffa0c2a74bc000 [13108726.328610] RIP: 0010:[<ffffffffc01f79eb>] [<ffffffffc01f79eb>] svcauth_unix_set_client+0x2ab/0x520 [sunrpc] [13108726.328706] RSP: 0018:ffffa0c2a74bfd80 EFLAGS: 00010246 [13108726.328750] RAX: 0000000000000001 RBX: ffffa0a6183ae000 RCX: 0000000000000000 [13108726.328811] RDX: 0000000000000074 RSI: 0000000000000286 RDI: ffffa0c2a74bfcf0 [13108726.328864] RBP: ffffa0c2a74bfe00 R08: ffffa0bab8c22960 R09: 0000000000000001 [13108726.328916] R10: 0000000000000001 R11: 0000000000000001 R12: ffffa0a32aa7f000 [13108726.328969] R13: ffffa0a6183afac0 R14: ffffa0c233d88d00 R15: ffffa0c2a74bfdb4 [13108726.329022] FS: 0000000000000000(0000) GS:ffffa0e17f9c0000(0000) knlGS:0000000000000000 [13108726.329081] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [13108726.332311] CR2: 0000000000000074 CR3: 00000026a1b28000 CR4: 00000000007607e0 [13108726.334606] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [13108726.336754] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [13108726.338908] PKRU: 00000000 [13108726.341047] Call Trace: [13108726.343074] [<ffffffff8a2c78b4>] ? groups_alloc+0x34/0x110 [13108726.344837] [<ffffffffc01f5eb4>] svc_set_client+0x24/0x30 [sunrpc] [13108726.346631] [<ffffffffc01f2ac1>] svc_process_common+0x241/0x710 [sunrpc] [13108726.348332] [<ffffffffc01f3093>] svc_process+0x103/0x190 [sunrpc] [13108726.350016] [<ffffffffc07d605f>] nfsd+0xdf/0x150 [nfsd] [13108726.351735] [<ffffffffc07d5f80>] ? nfsd_destroy+0x80/0x80 [nfsd] [13108726.353459] [<ffffffff8a2bf741>] kthread+0xd1/0xe0 [13108726.355195] [<ffffffff8a2bf670>] ? create_kthread+0x60/0x60 [13108726.356896] [<ffffffff8a9556dd>] ret_from_fork_nospec_begin+0x7/0x21 [13108726.358577] [<ffffffff8a2bf670>] ? create_kthread+0x60/0x60 [13108726.360240] Code: 4c 8b 45 98 0f 8e 2e 01 00 00 83 f8 fe 0f 84 76 fe ff ff 85 c0 0f 85 2b 01 00 00 49 8b 50 40 b8 01 00 00 00 48 89 93 d0 1a 00 00 <f0> 0f c1 02 83 c0 01 83 f8 01 0f 8e 53 02 00 00 49 8b 44 24 38 [13108726.363769] RIP [<ffffffffc01f79eb>] svcauth_unix_set_client+0x2ab/0x520 [sunrpc] [13108726.365530] RSP <ffffa0c2a74bfd80> [13108726.367179] CR2: 0000000000000074 Fixes: d58431eacb22 ("sunrpc: don't mark uninitialised items as VALID.") Signed-off-by: NPavel Tikhomirov <ptikhomirov@virtuozzo.com> Acked-by: NNeilBrown <neilb@suse.de> Signed-off-by: NJ. Bruce Fields <bfields@redhat.com> Signed-off-by: NSasha Levin <sashal@kernel.org>
-
由 Cong Wang 提交于
[ Upstream commit 0e4940928c26527ce8f97237fef4c8a91cd34207 ] After pskb_may_pull() we should always refetch the header pointers from the skb->data in case it got reallocated. In gre_parse_header(), the erspan header is still fetched from the 'options' pointer which is fetched before pskb_may_pull(). Found this during code review of a KMSAN bug report. Fixes: cb73ee40b1b3 ("net: ip_gre: use erspan key field for tunnel lookup") Cc: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com> Acked-by: NLorenzo Bianconi <lorenzo.bianconi@redhat.com> Acked-by: NWilliam Tu <u9012063@gmail.com> Reviewed-by: NSimon Horman <simon.horman@netronome.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NSasha Levin <sashal@kernel.org>
-
由 Karsten Graul 提交于
[ Upstream commit 33f3fcc290671590821ff3c0c9396db1ec9b7d4c ] smc_cdc_get_free_slot() might wait for free transfer buffers when using SMC-R. This wait should not be done under the send_lock, which is a spin_lock. This fixes a cpu loop in parallel threads waiting for the send_lock. Signed-off-by: NKarsten Graul <kgraul@linux.ibm.com> Signed-off-by: NUrsula Braun <ubraun@linux.ibm.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NSasha Levin <sashal@kernel.org>
-
由 Toke Høiland-Jørgensen 提交于
[ Upstream commit 8c6c37fdc20ec9ffaa342f827a8e20afe736fb0c ] To ensure parent qdiscs have the same notion of the number of enqueued packets even after splitting a GSO packet, update the qdisc tree with the number of packets that was added due to the split. Reported-by: NPete Heist <pete@heistp.net> Tested-by: NPete Heist <pete@heistp.net> Signed-off-by: NToke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NSasha Levin <sashal@kernel.org>
-