- 08 1月, 2021 4 次提交
-
-
由 Ido Schimmel 提交于
In case of error, remove the nexthop group entry from the list to which it was previously added. Fixes: 430a0491 ("nexthop: Add support for nexthop groups") Signed-off-by: NIdo Schimmel <idosch@nvidia.com> Reviewed-by: NPetr Machata <petrm@nvidia.com> Reviewed-by: NDavid Ahern <dsahern@kernel.org> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Ido Schimmel 提交于
A reference was not taken for the current nexthop entry, so do not try to put it in the error path. Fixes: 430a0491 ("nexthop: Add support for nexthop groups") Signed-off-by: NIdo Schimmel <idosch@nvidia.com> Reviewed-by: NPetr Machata <petrm@nvidia.com> Reviewed-by: NDavid Ahern <dsahern@kernel.org> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Florian Westphal 提交于
Conntrack reassembly records the largest fragment size seen in IPCB. However, when this gets forwarded/transmitted, fragmentation will only be forced if one of the fragmented packets had the DF bit set. In that case, a flag in IPCB will force fragmentation even if the MTU is large enough. This should work fine, but this breaks with ip tunnels. Consider client that sends a UDP datagram of size X to another host. The client fragments the datagram, so two packets, of size y and z, are sent. DF bit is not set on any of these packets. Middlebox netfilter reassembles those packets back to single size-X packet, before routing decision. packet-size-vs-mtu checks in ip_forward are irrelevant, because DF bit isn't set. At output time, ip refragmentation is skipped as well because x is still smaller than the mtu of the output device. If ttransmit device is an ip tunnel, the packet size increases to x+overhead. Also, tunnel might be configured to force DF bit on outer header. In this case, packet will be dropped (exceeds MTU) and an ICMP error is generated back to sender. But sender already respects the announced MTU, all the packets that it sent did fit the announced mtu. Force refragmentation as per original sizes unconditionally so ip tunnel will encapsulate the fragments instead. The only other solution I see is to place ip refragmentation in the ip_tunnel code to handle this case. Fixes: d6b915e2 ("ip_fragment: don't forward defragmented DF packet") Reported-by: NChristian Perle <christian.perle@secunet.com> Signed-off-by: NFlorian Westphal <fw@strlen.de> Acked-by: NPablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Florian Westphal 提交于
For some reason ip_tunnel insist on setting the DF bit anyway when the inner header has the DF bit set, EVEN if the tunnel was configured with 'nopmtudisc'. This means that the script added in the previous commit cannot be made to work by adding the 'nopmtudisc' flag to the ip tunnel configuration. Doing so breaks connectivity even for the without-conntrack/netfilter scenario. When nopmtudisc is set, the tunnel will skip the mtu check, so no icmp error is sent to client. Then, because inner header has DF set, the outer header gets added with DF bit set as well. IP stack then sends an error to itself because the packet exceeds the device MTU. Fixes: 23a3647b ("ip_tunnels: Use skb-len to PMTU check.") Cc: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: NFlorian Westphal <fw@strlen.de> Acked-by: NPablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 29 12月, 2020 2 次提交
-
-
由 Cong Wang 提交于
Both version 0 and version 1 use ETH_P_ERSPAN, but version 0 does not have an erspan header. So the check in gre_parse_header() is wrong, we have to distinguish version 1 from version 0. We can just check the gre header length like is_erspan_type1(). Fixes: cb73ee40 ("net: ip_gre: use erspan key field for tunnel lookup") Reported-by: syzbot+f583ce3d4ddf9836b27a@syzkaller.appspotmail.com Cc: William Tu <u9012063@gmail.com> Cc: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> Signed-off-by: NCong Wang <cong.wang@bytedance.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Guillaume Nault 提交于
RT_TOS() only clears one of the ECN bits. Therefore, when fib_compute_spec_dst() resorts to a fib lookup, it can return different results depending on the value of the second ECN bit. For example, ECT(0) and ECT(1) packets could be treated differently. $ ip netns add ns0 $ ip netns add ns1 $ ip link add name veth01 netns ns0 type veth peer name veth10 netns ns1 $ ip -netns ns0 link set dev lo up $ ip -netns ns1 link set dev lo up $ ip -netns ns0 link set dev veth01 up $ ip -netns ns1 link set dev veth10 up $ ip -netns ns0 address add 192.0.2.10/24 dev veth01 $ ip -netns ns1 address add 192.0.2.11/24 dev veth10 $ ip -netns ns1 address add 192.0.2.21/32 dev lo $ ip -netns ns1 route add 192.0.2.10/32 tos 4 dev veth10 src 192.0.2.21 $ ip netns exec ns1 sysctl -wq net.ipv4.icmp_echo_ignore_broadcasts=0 With TOS 4 and ECT(1), ns1 replies using source address 192.0.2.21 (ping uses -Q to set all TOS and ECN bits): $ ip netns exec ns0 ping -c 1 -b -Q 5 192.0.2.255 [...] 64 bytes from 192.0.2.21: icmp_seq=1 ttl=64 time=0.544 ms But with TOS 4 and ECT(0), ns1 replies using source address 192.0.2.11 because the "tos 4" route isn't matched: $ ip netns exec ns0 ping -c 1 -b -Q 6 192.0.2.255 [...] 64 bytes from 192.0.2.11: icmp_seq=1 ttl=64 time=0.597 ms After this patch the ECN bits don't affect the result anymore: $ ip netns exec ns0 ping -c 1 -b -Q 6 192.0.2.255 [...] 64 bytes from 192.0.2.21: icmp_seq=1 ttl=64 time=0.591 ms Fixes: 35ebf65e ("ipv4: Create and use fib_compute_spec_dst() helper.") Signed-off-by: NGuillaume Nault <gnault@redhat.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 18 12月, 2020 1 次提交
-
-
This fixes the dereference to fetch the RCU pointer when holding the appropriate xtables lock. Reported-by: Nkernel test robot <lkp@intel.com> Fixes: cc00bcaa ("netfilter: x_tables: Switch synchronization to RCU") Signed-off-by: NSubash Abhinov Kasiviswanathan <subashab@codeaurora.org> Reviewed-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
-
- 15 12月, 2020 2 次提交
-
-
由 Alexander Duyck 提交于
There are cases where a fastopen SYN may trigger either a ICMP_TOOBIG message in the case of IPv6 or a fragmentation request in the case of IPv4. This results in the socket stalling for a second or more as it does not respond to the message by retransmitting the SYN frame. Normally a SYN frame should not be able to trigger a ICMP_TOOBIG or ICMP_FRAG_NEEDED however in the case of fastopen we can have a frame that makes use of the entire MSS. In the case of fastopen it does, and an additional complication is that the retransmit queue doesn't contain the original frames. As a result when tcp_simple_retransmit is called and walks the list of frames in the queue it may not mark the frames as lost because both the SYN and the data packet each individually are smaller than the MSS size after the adjustment. This results in the socket being stalled until the retransmit timer kicks in and forces the SYN frame out again without the data attached. In order to resolve this we can reduce the MSS the packets are compared to in tcp_simple_retransmit to -1 for cases where we are still in the TCP_SYN_SENT state for a fastopen socket. Doing this we will mark all of the packets related to the fastopen SYN as lost. Signed-off-by: NAlexander Duyck <alexanderduyck@fb.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NYuchung Cheng <ycheng@google.com> Link: https://lore.kernel.org/r/160780498125.3272.15437756269539236825.stgit@localhost.localdomainSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Florian Westphal 提交于
Because TCP-level resets only affect the subflow, there is a MPTCP option to indicate that the MPTCP-level connection should be closed immediately without a mptcp-level fin exchange. This is the 'MPTCP fast close option'. It can be carried on ack segments or TCP resets. In the latter case, its needed to parse mptcp options also for reset packets so that MPTCP can act accordingly. Next patch will add receive side fastclose support in MPTCP. Acked-by: NMatthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 13 12月, 2020 1 次提交
-
-
由 SeongJae Park 提交于
On a few of our systems, I found frequent 'unshare(CLONE_NEWNET)' calls make the number of active slab objects including 'sock_inode_cache' type rapidly and continuously increase. As a result, memory pressure occurs. In more detail, I made an artificial reproducer that resembles the workload that we found the problem and reproduce the problem faster. It merely repeats 'unshare(CLONE_NEWNET)' 50,000 times in a loop. It takes about 2 minutes. On 40 CPU cores / 70GB DRAM machine, the available memory continuously reduced in a fast speed (about 120MB per second, 15GB in total within the 2 minutes). Note that the issue don't reproduce on every machine. On my 6 CPU cores machine, the problem didn't reproduce. 'cleanup_net()' and 'fqdir_work_fn()' are functions that deallocate the relevant memory objects. They are asynchronously invoked by the work queues and internally use 'rcu_barrier()' to ensure safe destructions. 'cleanup_net()' works in a batched maneer in a single thread worker, while 'fqdir_work_fn()' works for each 'fqdir_exit()' call in the 'system_wq'. Therefore, 'fqdir_work_fn()' called frequently under the workload and made the contention for 'rcu_barrier()' high. In more detail, the global mutex, 'rcu_state.barrier_mutex' became the bottleneck. This commit avoids such contention by doing the 'rcu_barrier()' and subsequent lightweight works in a batched manner, as similar to that of 'cleanup_net()'. The fqdir hashtable destruction, which is done before the 'rcu_barrier()', is still allowed to run in parallel for fast processing, but this commit makes it to use a dedicated work queue instead of the 'system_wq', to make sure that the number of threads is bounded. Signed-off-by: NSeongJae Park <sjpark@amazon.de> Reviewed-by: NEric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20201211112405.31158-1-sjpark@amazon.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 11 12月, 2020 1 次提交
-
-
由 Arjun Roy 提交于
A prior patch increased the size of struct tcp_zerocopy_receive but did not update do_tcp_getsockopt() handling to properly account for this. This patch simply reintroduces content erroneously cut from the referenced prior patch that handles the new struct size. Fixes: 18fb76ed ("net-zerocopy: Copy straggler unaligned data for TCP Rx. zerocopy.") Signed-off-by: NArjun Roy <arjunroy@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 10 12月, 2020 2 次提交
-
-
由 Neal Cardwell 提交于
When cwnd is not a multiple of the TSO skb size of N*MSS, we can get into persistent scenarios where we have the following sequence: (1) ACK for full-sized skb of N*MSS arrives -> tcp_write_xmit() transmit full-sized skb with N*MSS -> move pacing release time forward -> exit tcp_write_xmit() because pacing time is in the future (2) TSQ callback or TCP internal pacing timer fires -> try to transmit next skb, but TSO deferral finds remainder of available cwnd is not big enough to trigger an immediate send now, so we defer sending until the next ACK. (3) repeat... So we can get into a case where we never mark ourselves as cwnd-limited for many seconds at a time, even with bulk/infinite-backlog senders, because: o In case (1) above, every time in tcp_write_xmit() we have enough cwnd to send a full-sized skb, we are not fully using the cwnd (because cwnd is not a multiple of the TSO skb size). So every time we send data, we are not cwnd limited, and so in the cwnd-limited tracking code in tcp_cwnd_validate() we mark ourselves as not cwnd-limited. o In case (2) above, every time in tcp_write_xmit() that we try to transmit the "remainder" of the cwnd but defer, we set the local variable is_cwnd_limited to true, but we do not send any packets, so sent_pkts is zero, so we don't call the cwnd-limited logic to update tp->is_cwnd_limited. Fixes: ca8a2263 ("tcp: make cwnd-limited checks measurement-based, and gentler") Reported-by: NIngemar Johansson <ingemar.s.johansson@ericsson.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NYuchung Cheng <ycheng@google.com> Acked-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20201209035759.1225145-1-ncardwell.kernel@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Wei Wang 提交于
For DCTCP, we have to retain the ECT bits set by the congestion control algorithm on the socket when reflecting syn TOS in syn-ack, in order to make ECN work properly. Fixes: ac8f1710 ("tcp: reflect tos value received in SYN to the socket") Reported-by: NAlexander Duyck <alexanderduyck@fb.com> Signed-off-by: NWei Wang <weiwan@google.com> Reviewed-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 09 12月, 2020 1 次提交
-
-
由 Eric Dumazet 提交于
Before commit a337531b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB") small tcp_rmem[1] values were overridden by tcp_fixup_rcvbuf() to accommodate various MSS. This is no longer the case, and Hazem Mohamed Abuelfotoh reported that DRS would not work for MTU 9000 endpoints receiving regular (1500 bytes) frames. Root cause is that tcp_init_buffer_space() uses tp->rcv_wnd for upper limit of rcvq_space.space computation, while it can select later a smaller value for tp->rcv_ssthresh and tp->window_clamp. ss -temoi on receiver would show : skmem:(r0,rb131072,t0,tb46080,f0,w0,o0,bl0,d0) rcv_space:62496 rcv_ssthresh:56596 This means that TCP can not increase its window in tcp_grow_window(), and that DRS can never kick. Fix this by making sure that rcvq_space.space is not bigger than number of bytes that can be held in TCP receive queue. People unable/unwilling to change their kernel can work around this issue by selecting a bigger tcp_rmem[1] value as in : echo "4096 196608 6291456" >/proc/sys/net/ipv4/tcp_rmem Based on an initial report and patch from Hazem Mohamed Abuelfotoh https://lore.kernel.org/netdev/20201204180622.14285-1-abuehaze@amazon.com/ Fixes: a337531b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB") Fixes: 041a14d2 ("tcp: start receiver buffer autotuning sooner") Reported-by: NHazem Mohamed Abuelfotoh <abuehaze@amazon.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Acked-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 08 12月, 2020 1 次提交
-
-
When running concurrent iptables rules replacement with data, the per CPU sequence count is checked after the assignment of the new information. The sequence count is used to synchronize with the packet path without the use of any explicit locking. If there are any packets in the packet path using the table information, the sequence count is incremented to an odd value and is incremented to an even after the packet process completion. The new table value assignment is followed by a write memory barrier so every CPU should see the latest value. If the packet path has started with the old table information, the sequence counter will be odd and the iptables replacement will wait till the sequence count is even prior to freeing the old table info. However, this assumes that the new table information assignment and the memory barrier is actually executed prior to the counter check in the replacement thread. If CPU decides to execute the assignment later as there is no user of the table information prior to the sequence check, the packet path in another CPU may use the old table information. The replacement thread would then free the table information under it leading to a use after free in the packet processing context- Unable to handle kernel NULL pointer dereference at virtual address 000000000000008e pc : ip6t_do_table+0x5d0/0x89c lr : ip6t_do_table+0x5b8/0x89c ip6t_do_table+0x5d0/0x89c ip6table_filter_hook+0x24/0x30 nf_hook_slow+0x84/0x120 ip6_input+0x74/0xe0 ip6_rcv_finish+0x7c/0x128 ipv6_rcv+0xac/0xe4 __netif_receive_skb+0x84/0x17c process_backlog+0x15c/0x1b8 napi_poll+0x88/0x284 net_rx_action+0xbc/0x23c __do_softirq+0x20c/0x48c This could be fixed by forcing instruction order after the new table information assignment or by switching to RCU for the synchronization. Fixes: 80055dab ("netfilter: x_tables: make xt_replace_table wait until old rules are not used anymore") Reported-by: NSean Tranchetti <stranche@codeaurora.org> Reported-by: Nkernel test robot <lkp@intel.com> Suggested-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NSubash Abhinov Kasiviswanathan <subashab@codeaurora.org> Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
-
- 07 12月, 2020 1 次提交
-
-
由 Xin Long 提交于
Guillaume noticed that: for segments udp_queue_rcv_one_skb() returns the proto, and it should pass "ret" unmodified to ip_protocol_deliver_rcu(). Otherwize, with a negtive value passed, it will underflow inet_protos. This can be reproduced with IPIP FOU: # ip fou add port 5555 ipproto 4 # ethtool -K eth1 rx-gro-list on Fixes: cf329aa4 ("udp: cope with UDP GRO packet misdirection") Reported-by: NGuillaume Nault <gnault@redhat.com> Signed-off-by: NXin Long <lucien.xin@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 05 12月, 2020 9 次提交
-
-
由 Zhang Changzhong 提交于
Fix to return a negative error code from the error handling case instead of 0, as done elsewhere in this function. Fixes: d1566268 ("ipv4: Allow ipv6 gateway with ipv4 routes") Reported-by: NHulk Robot <hulkci@huawei.com> Signed-off-by: NZhang Changzhong <zhangchangzhong@huawei.com> Reviewed-by: NDavid Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/1607071695-33740-1-git-send-email-zhangchangzhong@huawei.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Arjun Roy 提交于
Zapping pages is required only if we are calling vm_insert_page into a region where pages had previously been mapped. Receive zerocopy allows reusing such regions, and hitherto called zap_page_range() before calling vm_insert_page() in that range. zap_page_range() can also be triggered from userspace with madvise(MADV_DONTNEED). If userspace is configured to call this before reusing a segment, or if there was nothing mapped at this virtual address to begin with, we can avoid calling zap_page_range() under the socket lock. That said, if userspace does not do that, then we are still responsible for calling zap_page_range(). This patch adds a flag that the user can use to hint to the kernel that a zap is not required. If the flag is not set, or if an older user application does not have a flags field at all, then the kernel calls zap_page_range as before. Also, if the flag is set but a zap is still required, the kernel performs that zap as necessary. Thus incorrectly indicating that a zap can be avoided does not change the correctness of operation. It also increases the batchsize for vm_insert_pages and prefetches the page struct for the batch since we're about to bump the refcount. An alternative mechanism could be to not have a flag, assume by default a zap is not needed, and fall back to zapping if needed. However, this would harm performance for older applications for which a zap is necessary, and thus we implement it with an explicit flag so newer applications can opt in. When using RPC-style traffic with medium sized (tens of KB) RPCs, this change yields an efficency improvement of about 30% for QPS/CPU usage. Signed-off-by: NArjun Roy <arjunroy@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Arjun Roy 提交于
Set zerocopy hint, event when falling back to copy, so that the pending data can be efficiently received using zerocopy when possible. Signed-off-by: NArjun Roy <arjunroy@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Arjun Roy 提交于
Sometimes, we may call tcp receive zerocopy when inq is 0, or inq < PAGE_SIZE, or inq is generally small enough that it is cheaper to copy rather than remap pages. In these cases, we may want to either return early (inq=0) or attempt to use the provided copy buffer to simply copy the received data. This allows us to save both system call overhead and the latency of acquiring mmap_sem in read mode for cases where it would be useless to do so. This patchset enables this behaviour by: 1. Returning quickly if inq is 0. 2. Attempting to perform a regular copy if a hybrid copybuffer is provided and it is large enough to absorb all available bytes. 3. Return quickly if no such buffer was provided and there are less than PAGE_SIZE bytes available. For small RPC ping-pong workloads, normally we would have 1 getsockopt(), 1 recvmsg() and 1 sendmsg() call per RPC. With this change, we remove the recvmsg() call entirely, reducing the syscall overhead by about 33%. In testing with small (hundreds of bytes) RPC traffic, this yields a syscall reduction of about 33% and an efficiency gain of about 3-5% when defined as QPS/CPU Util. Signed-off-by: NArjun Roy <arjunroy@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Arjun Roy 提交于
Sometimes, we may call tcp receive zerocopy when inq is 0, or inq < PAGE_SIZE, in which case we cannot remap pages. In this case, simply return the appropriate hint for regular copying without taking mmap_sem. Signed-off-by: NArjun Roy <arjunroy@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Arjun Roy 提交于
Refactor frag-is-remappable test for tcp receive zerocopy. This is part of a patch set that introduces short-circuited hybrid copies for small receive operations, which results in roughly 33% fewer syscalls for small RPC scenarios. Signed-off-by: NArjun Roy <arjunroy@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Arjun Roy 提交于
Refactor skb frag fast-forwarding for tcp receive zerocopy. This is part of a patch set that introduces short-circuited hybrid copies for small receive operations, which results in roughly 33% fewer syscalls for small RPC scenarios. skb_advance_to_frag(), given a skb and an offset into the skb, iterates from the first frag for the skb until we're at the frag specified by the offset. Assuming the offset provided refers to how many bytes in the skb are already read, the returned frag points to the next frag we may read from, while offset_frag is set to the number of bytes from this frag that we have already read. If frag is not null and offset_frag is equal to 0, then we may be able to map this frag's page into the process address space with vm_insert_page(). However, if offset_frag is not equal to 0, then we cannot do so. Signed-off-by: NArjun Roy <arjunroy@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Arjun Roy 提交于
Refactor tcp_recvmsg() by splitting it into locked and unlocked portions. Callers already holding the socket lock and not using ERRQUEUE/cmsg/busy polling can simply call tcp_recvmsg_locked(). This is in preparation for a short-circuit copy performed by TCP receive zerocopy for small (< PAGE_SIZE, or otherwise requested by the user) reads. Signed-off-by: NArjun Roy <arjunroy@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Arjun Roy 提交于
When TCP receive zerocopy does not successfully map the entire requested space, it outputs a 'hint' that the caller should recvmsg(). Augment zerocopy to accept a user buffer that it tries to copy this hint into - if it is possible to copy the entire hint, it will do so. This elides a recvmsg() call for received traffic that isn't exactly page-aligned in size. This was tested with RPC-style traffic of arbitrary sizes. Normally, each received message required at least one getsockopt() call, and one recvmsg() call for the remaining unaligned data. With this change, almost all of the recvmsg() calls are eliminated, leading to a savings of about 25%-50% in number of system calls for RPC-style workloads. Signed-off-by: NArjun Roy <arjunroy@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 04 12月, 2020 3 次提交
-
-
由 Andrii Nakryiko 提交于
Remove a permeating assumption thoughout BPF verifier of vmlinux BTF. Instead, wherever BTF type IDs are involved, also track the instance of struct btf that goes along with the type ID. This allows to gradually add support for kernel module BTFs and using/tracking module types across BPF helper calls and registers. This patch also renames btf_id() function to btf_obj_id() to minimize naming clash with using btf_id to denote BTF *type* ID, rather than BTF *object*'s ID. Also, altough btf_vmlinux can't get destructed and thus doesn't need refcounting, module BTFs need that, so apply BTF refcounting universally when BPF program is using BTF-powered attachment (tp_btf, fentry/fexit, etc). This makes for simpler clean up code. Now that BTF type ID is not enough to uniquely identify a BTF type, extend BPF trampoline key to include BTF object ID. To differentiate that from target program BPF ID, set 31st bit of type ID. BTF type IDs (at least currently) are not allowed to take full 32 bits, so there is no danger of confusing that bit with a valid BTF type ID. Signed-off-by: NAndrii Nakryiko <andrii@kernel.org> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201203204634.1325171-10-andrii@kernel.org
-
由 Prankur gupta 提交于
Adds a new bpf_setsockopt for TCP sockets, TCP_BPF_WINDOW_CLAMP, which sets the maximum receiver window size. It will be useful for limiting receiver window based on RTT. Signed-off-by: NPrankur gupta <prankgup@fb.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20201202213152.435886-2-prankgup@fb.com
-
由 Florian Westphal 提交于
The Multipath-TCP standard (RFC 8684) says that an MPTCP host should send a TCP reset if the token in a MP_JOIN request is unknown. At this time we don't do this, the 3whs completes and the 'new subflow' is reset afterwards. There are two ways to allow MPTCP to send the reset. 1. override 'send_synack' callback and emit the rst from there. The drawback is that the request socket gets inserted into the listeners queue just to get removed again right away. 2. Send the reset from the 'route_req' function instead. This avoids the 'add&remove request socket', but route_req lacks the skb that is required to send the TCP reset. Instead of just adding the skb to that function for MPTCP sake alone, Paolo suggested to merge init_req and route_req functions. This saves one indirection from syn processing path and provides the skb to the merged function at the same time. 'send reset on unknown mptcp join token' is added in next patch. Suggested-by: NPaolo Abeni <pabeni@redhat.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 03 12月, 2020 1 次提交
-
-
由 Stanislav Fomichev 提交于
I have to now lock/unlock socket for the bind hook execution. That shouldn't cause any overhead because the socket is unbound and shouldn't receive any traffic. Signed-off-by: NStanislav Fomichev <sdf@google.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Acked-by: NAndrey Ignatov <rdna@fb.com> Link: https://lore.kernel.org/bpf/20201202172516.3483656-3-sdf@google.com
-
- 01 12月, 2020 1 次提交
-
-
由 Jan Engelhardt 提交于
True to the message of commit v5.10-rc1-105-g46d6c5ae, _do_ actually make use of state->sk when possible, such as in the REJECT modules. Reported-by: NMinqiang Chen <ptpt52@gmail.com> Cc: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: NJan Engelhardt <jengelh@inai.de> Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
-
- 29 11月, 2020 1 次提交
-
-
由 Guillaume Nault 提交于
When inet_rtm_getroute() was converted to use the RCU variants of ip_route_input() and ip_route_output_key(), the TOS parameters stopped being masked with IPTOS_RT_MASK before doing the route lookup. As a result, "ip route get" can return a different route than what would be used when sending real packets. For example: $ ip route add 192.0.2.11/32 dev eth0 $ ip route add unreachable 192.0.2.11/32 tos 2 $ ip route get 192.0.2.11 tos 2 RTNETLINK answers: No route to host But, packets with TOS 2 (ECT(0) if interpreted as an ECN bit) would actually be routed using the first route: $ ping -c 1 -Q 2 192.0.2.11 PING 192.0.2.11 (192.0.2.11) 56(84) bytes of data. 64 bytes from 192.0.2.11: icmp_seq=1 ttl=64 time=0.173 ms --- 192.0.2.11 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.173/0.173/0.173/0.000 ms This patch re-applies IPTOS_RT_MASK in inet_rtm_getroute(), to return results consistent with real route lookups. Fixes: 3765d35e ("net: ipv4: Convert inet_rtm_getroute to rcu versions of route lookup") Signed-off-by: NGuillaume Nault <gnault@redhat.com> Reviewed-by: NDavid Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/b2d237d08317ca55926add9654a48409ac1b8f5b.1606412894.git.gnault@redhat.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 25 11月, 2020 1 次提交
-
-
由 Alexander Duyck 提交于
When a BPF program is used to select between a type of TCP congestion control algorithm that uses either ECN or not there is a case where the synack for the frame was coming up without the ECT0 bit set. A bit of research found that this was due to the final socket being configured to dctcp while the listener socket was staying in cubic. To reproduce it all that is needed is to monitor TCP traffic while running the sample bpf program "samples/bpf/tcp_cong_kern.c". What is observed, assuming tcp_dctcp module is loaded or compiled in and the traffic matches the rules in the sample file, is that for all frames with the exception of the synack the ECT0 bit is set. To address that it is necessary to make one additional call to tcp_bpf_ca_needs_ecn using the request socket and then use the output of that to set the ECT0 bit for the tos/tclass of the packet. Fixes: 91b5b21c ("bpf: Add support for changing congestion control") Signed-off-by: NAlexander Duyck <alexanderduyck@fb.com> Link: https://lore.kernel.org/r/160593039663.2604.1374502006916871573.stgit@localhost.localdomainSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 24 11月, 2020 2 次提交
-
-
由 Ricardo Dias 提交于
When the TCP stack is in SYN flood mode, the server child socket is created from the SYN cookie received in a TCP packet with the ACK flag set. The child socket is created when the server receives the first TCP packet with a valid SYN cookie from the client. Usually, this packet corresponds to the final step of the TCP 3-way handshake, the ACK packet. But is also possible to receive a valid SYN cookie from the first TCP data packet sent by the client, and thus create a child socket from that SYN cookie. Since a client socket is ready to send data as soon as it receives the SYN+ACK packet from the server, the client can send the ACK packet (sent by the TCP stack code), and the first data packet (sent by the userspace program) almost at the same time, and thus the server will equally receive the two TCP packets with valid SYN cookies almost at the same instant. When such event happens, the TCP stack code has a race condition that occurs between the momement a lookup is done to the established connections hashtable to check for the existence of a connection for the same client, and the moment that the child socket is added to the established connections hashtable. As a consequence, this race condition can lead to a situation where we add two child sockets to the established connections hashtable and deliver two sockets to the userspace program to the same client. This patch fixes the race condition by checking if an existing child socket exists for the same client when we are adding the second child socket to the established connections socket. If an existing child socket exists, we drop the packet and discard the second child socket to the same client. Signed-off-by: NRicardo Dias <rdias@singlestore.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20201120111133.GA67501@rdias-suse-pc.lanSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Paul Moore 提交于
As pointed out by Herbert in a recent related patch, the LSM hooks do not have the necessary address family information to use the flowi struct safely. As none of the LSMs currently use any of the protocol specific flowi information, replace the flowi pointers with pointers to the address family independent flowi_common struct. Reported-by: NHerbert Xu <herbert@gondor.apana.org.au> Acked-by: NJames Morris <jamorris@linux.microsoft.com> Signed-off-by: NPaul Moore <paul@paul-moore.com>
-
- 21 11月, 2020 3 次提交
-
-
由 Alexander Duyck 提交于
When setting congestion control via a BPF program it is seen that the SYN/ACK for packets within a given flow will not include the ECT0 flag. A bit of simple printk debugging shows that when this is configured without BPF we will see the value INET_ECN_xmit value initialized in tcp_assign_congestion_control however when we configure this via BPF the socket is in the closed state and as such it isn't configured, and I do not see it being initialized when we transition the socket into the listen state. The result of this is that the ECT0 bit is configured based on whatever the default state is for the socket. Any easy way to reproduce this is to monitor the following with tcpdump: tools/testing/selftests/bpf/test_progs -t bpf_tcp_ca Without this patch the SYN/ACK will follow whatever the default is. If dctcp all SYN/ACK packets will have the ECT0 bit set, and if it is not then ECT0 will be cleared on all SYN/ACK packets. With this patch applied the SYN/ACK bit matches the value seen on the other packets in the given stream. Fixes: 91b5b21c ("bpf: Add support for changing congestion control") Signed-off-by: NAlexander Duyck <alexanderduyck@fb.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Alexander Duyck 提交于
An issue was recently found where DCTCP SYN/ACK packets did not have the ECT bit set in the L3 header. A bit of code review found that the recent change referenced below had gone though and added a mask that prevented the ECN bits from being populated in the L3 header. This patch addresses that by rolling back the mask so that it is only applied to the flags coming from the incoming TCP request instead of applying it to the socket tos/tclass field. Doing this the ECT bits were restored in the SYN/ACK packets in my testing. One thing that is not addressed by this patch set is the fact that tcp_reflect_tos appears to be incompatible with ECN based congestion avoidance algorithms. At a minimum the feature should likely be documented which it currently isn't. Fixes: ac8f1710 ("tcp: reflect tos value received in SYN to the socket") Signed-off-by: NAlexander Duyck <alexanderduyck@fb.com> Acked-by: NWei Wang <weiwan@google.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Florian Westphal 提交于
OoO handling attempts to detect when packet is out-of-window by testing current ack sequence and remaining space vs. sequence number. This doesn't work reliably. Store the highest allowed sequence number that we've announced and use it to detect oow packets. Do this when mptcp options get written to the packet (wire format). For this to work we need to move the write_options call until after stack selected a new tcp window. Acked-by: NPaolo Abeni <pabeni@redhat.com> Signed-off-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 18 11月, 2020 3 次提交
-
-
由 Florian Klink 提交于
Checking for ifdef CONFIG_x fails if CONFIG_x=m. Use IS_ENABLED instead, which is true for both built-ins and modules. Otherwise, a > ip -4 route add 1.2.3.4/32 via inet6 fe80::2 dev eth1 fails with the message "Error: IPv6 support not enabled in kernel." if CONFIG_IPV6 is `m`. In the spirit of b8127113. Fixes: d1566268 ("ipv4: Allow ipv6 gateway with ipv4 routes") Cc: Kim Phillips <kim.phillips@arm.com> Signed-off-by: NFlorian Klink <flokli@flokli.de> Reviewed-by: NDavid Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20201115224509.2020651-1-flokli@flokli.deSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 Wang Hai 提交于
nlmsg_cancel() needs to be called in the error path of inet_req_diag_fill to cancel the message. Fixes: d545caca ("net: inet: diag: expose the socket mark to privileged processes.") Reported-by: NHulk Robot <hulkci@huawei.com> Signed-off-by: NWang Hai <wanghai38@huawei.com> Link: https://lore.kernel.org/r/20201116082018.16496-1-wanghai38@huawei.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
-
由 John Fastabend 提交于
Fix sockmap sk_skb programs so that they observe sk_rcvbuf limits. This allows users to tune SO_RCVBUF and sockmap will honor them. We can refactor the if(charge) case out in later patches. But, keep this fix to the point. Fixes: 51199405 ("bpf: skb_verdict, support SK_PASS on RX BPF path") Suggested-by: NJakub Sitnicki <jakub@cloudflare.com> Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Reviewed-by: NJakub Sitnicki <jakub@cloudflare.com> Link: https://lore.kernel.org/bpf/160556568657.73229.8404601585878439060.stgit@john-XPS-13-9370
-