- 24 1月, 2020 4 次提交
-
-
由 Christoph Paasch 提交于
This implements MP_CAPABLE options parsing and writing according to RFC 6824 bis / RFC 8684: MPTCP v1. Local key is sent on syn/ack, and both keys are sent on 3rd ack. MP_CAPABLE messages len are updated accordingly. We need the skbuff to correctly emit the above, so we push the skbuff struct as an argument all the way from tcp code to the relevant mptcp callbacks. When processing incoming MP_CAPABLE + data, build a full blown DSS-like map info, to simplify later processing. On child socket creation, we need to record the remote key, if available. Signed-off-by: NChristoph Paasch <cpaasch@apple.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Mat Martineau 提交于
Parses incoming DSS options and populates outgoing MPTCP ACK fields. MPTCP fields are parsed from the TCP option header and placed in an skb extension, allowing the upper MPTCP layer to access MPTCP options after the skb has gone through the TCP stack. The subflow implements its own data_ready() ops, which ensures that the pending data is in sequence - according to MPTCP seq number - dropping out-of-seq skbs. The DATA_READY bit flag is set if this is the case. This allows the MPTCP socket layer to determine if more data is available without having to consult the individual subflows. It additionally validates the current mapping and propagates EoF events to the connection socket. Co-developed-by: NPaolo Abeni <pabeni@redhat.com> Signed-off-by: NPaolo Abeni <pabeni@redhat.com> Co-developed-by: NPeter Krystad <peter.krystad@linux.intel.com> Signed-off-by: NPeter Krystad <peter.krystad@linux.intel.com> Co-developed-by: NDavide Caratti <dcaratti@redhat.com> Signed-off-by: NDavide Caratti <dcaratti@redhat.com> Co-developed-by: NMatthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: NMatthieu Baerts <matthieu.baerts@tessares.net> Co-developed-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: NChristoph Paasch <cpaasch@apple.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Peter Krystad 提交于
Add hooks to tcp_output.c to add MP_CAPABLE to an outgoing SYN request, to capture the MP_CAPABLE in the received SYN-ACK, to add MP_CAPABLE to the final ACK of the three-way handshake. Use the .sk_rx_dst_set() handler in the subflow proto to capture when the responding SYN-ACK is received and notify the MPTCP connection layer. Co-developed-by: NPaolo Abeni <pabeni@redhat.com> Signed-off-by: NPaolo Abeni <pabeni@redhat.com> Co-developed-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NPeter Krystad <peter.krystad@linux.intel.com> Signed-off-by: NChristoph Paasch <cpaasch@apple.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Peter Krystad 提交于
Add hooks to parse and format the MP_CAPABLE option. This option is handled according to MPTCP version 0 (RFC6824). MPTCP version 1 MP_CAPABLE (RFC6824bis/RFC8684) will be added later in coordination with related code changes. Co-developed-by: NMatthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: NMatthieu Baerts <matthieu.baerts@tessares.net> Co-developed-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NFlorian Westphal <fw@strlen.de> Co-developed-by: NDavide Caratti <dcaratti@redhat.com> Signed-off-by: NDavide Caratti <dcaratti@redhat.com> Signed-off-by: NPeter Krystad <peter.krystad@linux.intel.com> Signed-off-by: NChristoph Paasch <cpaasch@apple.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 16 1月, 2020 1 次提交
-
-
由 Pengcheng Yang 提交于
When the packet pointed to by retransmit_skb_hint is unlinked by ACK, retransmit_skb_hint will be set to NULL in tcp_clean_rtx_queue(). If packet loss is detected at this time, retransmit_skb_hint will be set to point to the current packet loss in tcp_verify_retransmit_hint(), then the packets that were previously marked lost but not retransmitted due to the restriction of cwnd will be skipped and cannot be retransmitted. To fix this, when retransmit_skb_hint is NULL, retransmit_skb_hint can be reset only after all marked lost packets are retransmitted (retrans_out >= lost_out), otherwise we need to traverse from tcp_rtx_queue_head in tcp_xmit_retransmit_queue(). Packetdrill to demonstrate: // Disable RACK and set max_reordering to keep things simple 0 `sysctl -q net.ipv4.tcp_recovery=0` +0 `sysctl -q net.ipv4.tcp_max_reordering=3` // Establish a connection +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 +.1 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7> +0 > S. 0:0(0) ack 1 <...> +.01 < . 1:1(0) ack 1 win 257 +0 accept(3, ..., ...) = 4 // Send 8 data segments +0 write(4, ..., 8000) = 8000 +0 > P. 1:8001(8000) ack 1 // Enter recovery and 1:3001 is marked lost +.01 < . 1:1(0) ack 1 win 257 <sack 3001:4001,nop,nop> +0 < . 1:1(0) ack 1 win 257 <sack 5001:6001 3001:4001,nop,nop> +0 < . 1:1(0) ack 1 win 257 <sack 5001:7001 3001:4001,nop,nop> // Retransmit 1:1001, now retransmit_skb_hint points to 1001:2001 +0 > . 1:1001(1000) ack 1 // 1001:2001 was ACKed causing retransmit_skb_hint to be set to NULL +.01 < . 1:1(0) ack 2001 win 257 <sack 5001:8001 3001:4001,nop,nop> // Now retransmit_skb_hint points to 4001:5001 which is now marked lost // BUG: 2001:3001 was not retransmitted +0 > . 2001:3001(1000) ack 1 Signed-off-by: NPengcheng Yang <yangpc@wangsu.com> Acked-by: NNeal Cardwell <ncardwell@google.com> Tested-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 10 1月, 2020 1 次提交
-
-
由 Mat Martineau 提交于
Coalesce and collapse of packets carrying MPTCP extensions is allowed when the newer packet has no extension or the extensions carried by both packets are equal. This allows merging of TSO packet trains and even cross-TSO packets, and does not require any additional action when moving data into existing SKBs. v3 -> v4: - allow collapsing, under mptcp_skb_can_collapse() constraint v5 -> v6: - clarify MPTCP skb extensions must always be cleared at allocation time Co-developed-by: NPaolo Abeni <pabeni@redhat.com> Signed-off-by: NPaolo Abeni <pabeni@redhat.com> Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 03 1月, 2020 2 次提交
-
-
由 Mao Wenan 提交于
REXMIT_NEW is a macro for "FRTO-style transmit of unsent/new packets", this patch makes it more readable. Signed-off-by: NMao Wenan <maowenan@huawei.com> Acked-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Pengcheng Yang 提交于
When we receive a D-SACK, where the sequence number satisfies: undo_marker <= start_seq < end_seq <= prior_snd_una we consider this is a valid D-SACK and tcp_is_sackblock_valid() returns true, then this D-SACK is discarded as "old stuff", but the variable first_sack_index is not marked as negative in tcp_sacktag_write_queue(). If this D-SACK also carries a SACK that needs to be processed (for example, the previous SACK segment was lost), this SACK will be treated as a D-SACK in the following processing of tcp_sacktag_write_queue(), which will eventually lead to incorrect updates of undo_retrans and reordering. Fixes: fd6dad61 ("[TCP]: Earlier SACK block verification & simplify access to them") Signed-off-by: NPengcheng Yang <yangpc@wangsu.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 26 10月, 2019 1 次提交
-
-
由 Jason Baron 提交于
The TCPI_OPT_SYN_DATA bit as part of tcpi_options currently reports whether or not data-in-SYN was ack'd on both the client and server side. We'd like to gather more information on the client-side in the failure case in order to indicate the reason for the failure. This can be useful for not only debugging TFO, but also for creating TFO socket policies. For example, if a middle box removes the TFO option or drops a data-in-SYN, we can can detect this case, and turn off TFO for these connections saving the extra retransmits. The newly added tcpi_fastopen_client_fail status is 2 bits and has the following 4 states: 1) TFO_STATUS_UNSPEC Catch-all state which includes when TFO is disabled via black hole detection, which is indicated via LINUX_MIB_TCPFASTOPENBLACKHOLE. 2) TFO_COOKIE_UNAVAILABLE If TFO_CLIENT_NO_COOKIE mode is off, this state indicates that no cookie is available in the cache. 3) TFO_DATA_NOT_ACKED Data was sent with SYN, we received a SYN/ACK but it did not cover the data portion. Cookie is not accepted by server because the cookie may be invalid or the server may be overloaded. 4) TFO_SYN_RETRANSMITTED Data was sent with SYN, we received a SYN/ACK which did not cover the data after at least 1 additional SYN was sent (without data). It may be the case that a middle-box is dropping data-in-SYN packets. Thus, it would be more efficient to not use TFO on this connection to avoid extra retransmits during connection establishment. These new fields do not cover all the cases where TFO may fail, but other failures, such as SYN/ACK + data being dropped, will result in the connection not becoming established. And a connection blackhole after session establishment shows up as a stalled connection. Signed-off-by: NJason Baron <jbaron@akamai.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Christoph Paasch <cpaasch@apple.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 14 10月, 2019 6 次提交
-
-
由 Eric Dumazet 提交于
For the sake of tcp_poll(), there are few places where we fetch sk->sk_sndbuf while this field can change from IRQ or other cpu. We need to add READ_ONCE() annotations, and also make sure write sides use corresponding WRITE_ONCE() to avoid store-tearing. Note that other transports probably need similar fixes. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
For the sake of tcp_poll(), there are few places where we fetch sk->sk_rcvbuf while this field can change from IRQ or other cpu. We need to add READ_ONCE() annotations, and also make sure write sides use corresponding WRITE_ONCE() to avoid store-tearing. Note that other transports probably need similar fixes. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
There two places where we fetch tp->urg_seq while this field can change from IRQ or other cpu. We need to add READ_ONCE() annotations, and also make sure write side use corresponding WRITE_ONCE() to avoid store-tearing. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
There are few places where we fetch tp->copied_seq while this field can change from IRQ or other cpu. We need to add READ_ONCE() annotations, and also make sure write sides use corresponding WRITE_ONCE() to avoid store-tearing. Note that tcp_inq_hint() was already using READ_ONCE(tp->copied_seq) Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
There are few places where we fetch tp->rcv_nxt while this field can change from IRQ or other cpu. We need to add READ_ONCE() annotations, and also make sure write sides use corresponding WRITE_ONCE() to avoid store-tearing. Note that tcp_inq_hint() was already using READ_ONCE(tp->rcv_nxt) syzbot reported : BUG: KCSAN: data-race in tcp_poll / tcp_queue_rcv write to 0xffff888120425770 of 4 bytes by interrupt on cpu 0: tcp_rcv_nxt_update net/ipv4/tcp_input.c:3365 [inline] tcp_queue_rcv+0x180/0x380 net/ipv4/tcp_input.c:4638 tcp_rcv_established+0xbf1/0xf50 net/ipv4/tcp_input.c:5616 tcp_v4_do_rcv+0x381/0x4e0 net/ipv4/tcp_ipv4.c:1542 tcp_v4_rcv+0x1a03/0x1bf0 net/ipv4/tcp_ipv4.c:1923 ip_protocol_deliver_rcu+0x51/0x470 net/ipv4/ip_input.c:204 ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231 NF_HOOK include/linux/netfilter.h:305 [inline] NF_HOOK include/linux/netfilter.h:299 [inline] ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252 dst_input include/net/dst.h:442 [inline] ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413 NF_HOOK include/linux/netfilter.h:305 [inline] NF_HOOK include/linux/netfilter.h:299 [inline] ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523 __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5004 __netif_receive_skb+0x37/0xf0 net/core/dev.c:5118 netif_receive_skb_internal+0x59/0x190 net/core/dev.c:5208 napi_skb_finish net/core/dev.c:5671 [inline] napi_gro_receive+0x28f/0x330 net/core/dev.c:5704 receive_buf+0x284/0x30b0 drivers/net/virtio_net.c:1061 read to 0xffff888120425770 of 4 bytes by task 7254 on cpu 1: tcp_stream_is_readable net/ipv4/tcp.c:480 [inline] tcp_poll+0x204/0x6b0 net/ipv4/tcp.c:554 sock_poll+0xed/0x250 net/socket.c:1256 vfs_poll include/linux/poll.h:90 [inline] ep_item_poll.isra.0+0x90/0x190 fs/eventpoll.c:892 ep_send_events_proc+0x113/0x5c0 fs/eventpoll.c:1749 ep_scan_ready_list.constprop.0+0x189/0x500 fs/eventpoll.c:704 ep_send_events fs/eventpoll.c:1793 [inline] ep_poll+0xe3/0x900 fs/eventpoll.c:1930 do_epoll_wait+0x162/0x180 fs/eventpoll.c:2294 __do_sys_epoll_pwait fs/eventpoll.c:2325 [inline] __se_sys_epoll_pwait fs/eventpoll.c:2311 [inline] __x64_sys_epoll_pwait+0xcd/0x170 fs/eventpoll.c:2311 do_syscall_64+0xcf/0x2f0 arch/x86/entry/common.c:296 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 7254 Comm: syz-fuzzer Not tainted 5.3.0+ #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Signed-off-by: NEric Dumazet <edumazet@google.com> Reported-by: Nsyzbot <syzkaller@googlegroups.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
Both tcp_v4_err() and tcp_v6_err() do the following operations while they do not own the socket lock : fastopen = tp->fastopen_rsk; snd_una = fastopen ? tcp_rsk(fastopen)->snt_isn : tp->snd_una; The problem is that without appropriate barrier, the compiler might reload tp->fastopen_rsk and trigger a NULL deref. request sockets are protected by RCU, we can simply add the missing annotations and barriers to solve the issue. Fixes: 168a8f58 ("tcp: TCP Fast Open Server - main code path") Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 16 9月, 2019 1 次提交
-
-
由 Thomas Higdon 提交于
For receive-heavy cases on the server-side, we want to track the connection quality for individual client IPs. This counter, similar to the existing system-wide TCPOFOQueue counter in /proc/net/netstat, tracks out-of-order packet reception. By providing this counter in TCP_INFO, it will allow understanding to what degree receive-heavy sockets are experiencing out-of-order delivery and packet drops indicating congestion. Please note that this is similar to the counter in NetBSD TCP_INFO, and has the same name. Also note that we avoid increasing the size of the tcp_sock struct by taking advantage of a hole. Signed-off-by: NThomas Higdon <tph@fb.com> Acked-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 12 9月, 2019 1 次提交
-
-
由 Neal Cardwell 提交于
Fix tcp_ecn_withdraw_cwr() to clear the correct bit: TCP_ECN_QUEUE_CWR. Rationale: basically, TCP_ECN_DEMAND_CWR is a bit that is purely about the behavior of data receivers, and deciding whether to reflect incoming IP ECN CE marks as outgoing TCP th->ece marks. The TCP_ECN_QUEUE_CWR bit is purely about the behavior of data senders, and deciding whether to send CWR. The tcp_ecn_withdraw_cwr() function is only called from tcp_undo_cwnd_reduction() by data senders during an undo, so it should zero the sender-side state, TCP_ECN_QUEUE_CWR. It does not make sense to stop the reflection of incoming CE bits on incoming data packets just because outgoing packets were spuriously retransmitted. The bug has been reproduced with packetdrill to manifest in a scenario with RFC3168 ECN, with an incoming data packet with CE bit set and carrying a TCP timestamp value that causes cwnd undo. Before this fix, the IP CE bit was ignored and not reflected in the TCP ECE header bit, and sender sent a TCP CWR ('W') bit on the next outgoing data packet, even though the cwnd reduction had been undone. After this fix, the sender properly reflects the CE bit and does not set the W bit. Note: the bug actually predates 2005 git history; this Fixes footer is chosen to be the oldest SHA1 I have tested (from Sep 2007) for which the patch applies cleanly (since before this commit the code was in a .h file). Fixes: bdf1ee5d ("[TCP]: Move code from tcp_ecn.h to tcp*.c and tcp.h & remove it") Signed-off-by: NNeal Cardwell <ncardwell@google.com> Acked-by: NYuchung Cheng <ycheng@google.com> Acked-by: NSoheil Hassas Yeganeh <soheil@google.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 31 7月, 2019 2 次提交
-
-
由 Petar Penkov 提交于
This patch allows generation of a SYN cookie before an SKB has been allocated, as is the case at XDP. Signed-off-by: NPetar Penkov <ppenkov@google.com> Reviewed-by: NLorenz Bauer <lmb@cloudflare.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
-
由 Petar Penkov 提交于
This allows us to call this function before an SKB has been allocated. Signed-off-by: NPetar Penkov <ppenkov@google.com> Reviewed-by: NLorenz Bauer <lmb@cloudflare.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
-
- 03 7月, 2019 1 次提交
-
-
由 Stanislav Fomichev 提交于
Performance impact should be minimal because it's under a new BPF_SOCK_OPS_RTT_CB_FLAG flag that has to be explicitly enabled. Suggested-by: NEric Dumazet <edumazet@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Priyaranjan Jha <priyarjha@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: NSoheil Hassas Yeganeh <soheil@google.com> Acked-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NStanislav Fomichev <sdf@google.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
-
- 16 6月, 2019 1 次提交
-
-
由 Eric Dumazet 提交于
Jonathan Looney reported that TCP can trigger the following crash in tcp_shifted_skb() : BUG_ON(tcp_skb_pcount(skb) < pcount); This can happen if the remote peer has advertized the smallest MSS that linux TCP accepts : 48 An skb can hold 17 fragments, and each fragment can hold 32KB on x86, or 64KB on PowerPC. This means that the 16bit witdh of TCP_SKB_CB(skb)->tcp_gso_segs can overflow. Note that tcp_sendmsg() builds skbs with less than 64KB of payload, so this problem needs SACK to be enabled. SACK blocks allow TCP to coalesce multiple skbs in the retransmit queue, thus filling the 17 fragments to maximal capacity. CVE-2019-11477 -- u16 overflow of TCP_SKB_CB(skb)->tcp_gso_segs Fixes: 832d11c5 ("tcp: Try to restore large SKBs while SACK processing") Signed-off-by: NEric Dumazet <edumazet@google.com> Reported-by: NJonathan Looney <jtl@netflix.com> Acked-by: NNeal Cardwell <ncardwell@google.com> Reviewed-by: NTyler Hicks <tyhicks@canonical.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Bruce Curtis <brucec@netflix.com> Cc: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 15 6月, 2019 1 次提交
-
-
由 Willem de Bruijn 提交于
Deferred static key clean_acked_data_enabled uses the deferred variants of dec and flush. Do the same for inc. Signed-off-by: NWillem de Bruijn <willemb@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 10 6月, 2019 1 次提交
-
-
由 Yuchung Cheng 提交于
Commit 794200d6 ("tcp: undo cwnd on Fast Open spurious SYNACK retransmit") may cause tcp_fastretrans_alert() to warn about pending retransmission in Open state. This is triggered when the Fast Open server both sends data and has spurious SYNACK retransmission during the handshake, and the data packets were lost or reordered. The root cause is a bit complicated: (1) Upon receiving SYN-data: a full socket is created with snd_una = ISN + 1 by tcp_create_openreq_child() (2) On SYNACK timeout the server/sender enters CA_Loss state. (3) Upon receiving the final ACK to complete the handshake, sender does not mark FLAG_SND_UNA_ADVANCED since (1) Sender then calls tcp_process_loss since state is CA_loss by (2) (4) tcp_process_loss() does not invoke undo operations but instead mark REXMIT_LOST to force retransmission (5) tcp_rcv_synrecv_state_fastopen() calls tcp_try_undo_loss(). It changes state to CA_Open but has positive tp->retrans_out (6) Next ACK triggers the WARN_ON in tcp_fastretrans_alert() The step that goes wrong is (4) where the undo operation should have been invoked because the ACK successfully acknowledged the SYN sequence. This fixes that by specifically checking undo when the SYN-ACK sequence is acknowledged. Then after tcp_process_loss() the state would be further adjusted based in tcp_fastretrans_alert() to avoid triggering the warning in (6). Fixes: 794200d6 ("tcp: undo cwnd on Fast Open spurious SYNACK retransmit") Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 31 5月, 2019 1 次提交
-
-
由 Young Xiao 提交于
The TCP option parsing routines in tcp_parse_options function could read one byte out of the buffer of the TCP options. 1 while (length > 0) { 2 int opcode = *ptr++; 3 int opsize; 4 5 switch (opcode) { 6 case TCPOPT_EOL: 7 return; 8 case TCPOPT_NOP: /* Ref: RFC 793 section 3.1 */ 9 length--; 10 continue; 11 default: 12 opsize = *ptr++; //out of bound access If length = 1, then there is an access in line2. And another access is occurred in line 12. This would lead to out-of-bound access. Therefore, in the patch we check that the available data length is larger enough to pase both TCP option code and size. Signed-off-by: NYoung Xiao <92siuyang@gmail.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 15 5月, 2019 1 次提交
-
-
由 Yuchung Cheng 提交于
Commit c7d13c8f ("tcp: properly track retry time on passive Fast Open") sets the start of SYNACK retransmission time on passive Fast Open in "retrans_stamp". However the timestamp is not reset upon the handshake has completed. As a result, future data packet retransmission may not update it in tcp_retransmit_skb(). This may lead to socket aborting earlier unexpectedly by retransmits_timed_out() since retrans_stamp remains the SYNACK rtx time. This bug only manifests on passive TFO sender that a) suffered SYNACK timeout and then b) stalls on very first loss recovery. Any successful loss recovery would reset the timestamp to avoid this issue. Fixes: c7d13c8f ("tcp: properly track retry time on passive Fast Open") Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 10 5月, 2019 1 次提交
-
-
由 Jakub Kicinski 提交于
User space can flip the clean_acked_data_enabled static branch on and off with TLS offload when CONFIG_TLS_DEVICE is enabled. jump_label.h suggests we use the delayed version in this case. Deferred branches now also don't take the branch mutex on decrement, so we avoid potential locking issues. Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: NSimon Horman <simon.horman@netronome.com> Reviewed-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 01 5月, 2019 7 次提交
-
-
由 Yuchung Cheng 提交于
Relocate the congestion window initialization from tcp_init_metrics() to tcp_init_transfer() to improve code readability. Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Yuchung Cheng 提交于
Use a helper to consolidate two identical code block for passive TFO. Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Yuchung Cheng 提交于
This patch makes passive Fast Open reverts the cwnd to default initial cwnd (10 packets) if the SYNACK timeout is spurious. Passive Fast Open uses a full socket during handshake so it can use the existing undo logic to detect spurious retransmission by recording the first SYNACK timeout in key state variable retrans_stamp. Upon receiving the ACK of the SYNACK, if the socket has sent some data before the timeout, the spurious timeout is detected by tcp_try_undo_recovery() in tcp_process_loss() in tcp_ack(). But if the socket has not send any data yet, tcp_ack() does not execute the undo code since no data is acknowledged. The fix is to check such case explicitly after tcp_ack() during the ACK processing in SYN_RECV state. In addition this is checked in FIN_WAIT_1 state in case the server closes the socket before handshake completes. Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Yuchung Cheng 提交于
Linux implements RFC6298 and use an initial congestion window of 1 upon establishing the connection if the SYNACK packet is retransmitted 2 or more times. In cellular networks SYNACK timeouts are often spurious if the wireless radio was dormant or idle. Also some network path is longer than the default SYNACK timeout. In both cases falsely starting with a minimal cwnd are detrimental to performance. This patch avoids doing so when the final ACK's TCP timestamp indicates the original SYNACK was delivered. It remembers the original SYNACK timestamp when SYNACK timeout has occurred and re-uses the function to detect spurious SYN timeout conveniently. Note that a server may receives multiple SYNs from and immediately retransmits SYNACKs without any SYNACK timeout. This often happens on when the client SYNs have timed out due to wireless delay above. In this case since the server will still use the default initial congestion (e.g. 10) because tp->undo_marker is reset in tcp_init_metrics(). This is an intentional design because packets are not lost but delayed. This patch only covers regular TCP passive open. Fast Open is supported in the next patch. Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Yuchung Cheng 提交于
Detecting spurious SYNACK timeout using timestamp option requires recording the exact SYNACK skb timestamp. Previously the SYNACK sent timestamp was stamped slightly earlier before the skb was transmitted. This patch uses the SYNACK skb transmission timestamp directly. Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Yuchung Cheng 提交于
Linux implements RFC6298 and use an initial congestion window of 1 upon establishing the connection if the SYN packet is retransmitted 2 or more times. In cellular networks SYN timeouts are often spurious if the wireless radio was dormant or idle. Also some network path is longer than the default SYN timeout. Having a minimal cwnd on both cases are detrimental to TCP startup performance. This patch extends TCP undo feature (RFC3522 aka TCP Eifel) to detect spurious SYN timeout via TCP timestamps. Since tp->retrans_stamp records the initial SYN timestamp instead of first retransmission, we have to implement a different undo code additionally. The detection also must happen before tcp_ack() as retrans_stamp is reset when SYN is acknowledged. Note this patch covers both active regular and fast open. Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Yuchung Cheng 提交于
Previously if an active TCP open has SYN timeout, it always undo the cwnd upon receiving the SYNACK. This is because tcp_clean_rtx_queue would reset tp->retrans_stamp when SYN is acked, which fools then tcp_try_undo_loss and tcp_packet_delayed. Addressing this issue is required to properly support undo for spurious SYN timeout. Fixing this is tricky -- for active TCP open tp->retrans_stamp records the time when the handshake starts, not the first retransmission time as the name may suggest. The simplest fix is for tcp_packet_delayed to ensure it is valid before comparing with other timestamp. One side effect of this change is active TCP Fast Open that incurred SYN timeout. Upon receiving a SYN-ACK that only acknowledged the SYN, it would immediately retransmit unacknowledged data in tcp_ack() because the data is marked lost after SYN timeout. But the retransmission would have an incorrect ack sequence number since rcv_nxt has not been updated yet tcp_rcv_synsent_state_process(), the retransmission needs to properly handed by tcp_rcv_fastopen_synack() like before. Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 17 4月, 2019 1 次提交
-
-
由 Eric Dumazet 提交于
For some reason, tcp_grow_window() correctly tests if enough room is present before attempting to increase tp->rcv_ssthresh, but does not prevent it to grow past tcp_space() This is causing hard to debug issues, like failing the (__tcp_select_window(sk) >= tp->rcv_wnd) test in __tcp_ack_snd_check(), causing ACK delays and possibly slow flows. Depending on tcp_rmem[2], MTU, skb->len/skb->truesize ratio, we can see the problem happening on "netperf -t TCP_RR -- -r 2000,2000" after about 60 round trips, when the active side no longer sends immediate acks. This bug predates git history. Signed-off-by: NEric Dumazet <edumazet@google.com> Acked-by: NSoheil Hassas Yeganeh <soheil@google.com> Acked-by: NNeal Cardwell <ncardwell@google.com> Acked-by: NWei Wang <weiwan@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 05 4月, 2019 1 次提交
-
-
Linux currently disable ECN for incoming connections when the SYN requests ECN and the IP header has ECT(0)/ECT(1) set, as some networks were reportedly mangling the ToS byte, hence could later trigger false congestion notifications. RFC8311 §4.3 relaxes RFC3168's requirements such that ECT can be set one TCP control packets (including SYNs). The main benefit of this is the decreased probability of losing a SYN in a congested ECN-capable network (i.e., it avoids the initial 1s timeout). Additionally, this allows the development of newer TCP extensions, such as AccECN. This patch relaxes the previous check, by enabling ECN on incoming connections using SYN+ECT if at least one bit of the reserved flags of the TCP header is set. Such bit would indicate that the sender of the SYN is using a newer TCP feature than what the host implements, such as AccECN, and is thus implementing RFC8311. This enables end-hosts not supporting such extensions to still negociate ECN, and to have some of the benefits of using ECN on control packets. Signed-off-by: NOlivier Tilmans <olivier.tilmans@nokia-bell-labs.com> Suggested-by: NBob Briscoe <research@bobbriscoe.net> Cc: Koen De Schepper <koen.de_schepper@nokia-bell-labs.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Acked-by: NNeal Cardwell <ncardwell@google.com> Acked-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 20 3月, 2019 1 次提交
-
-
由 Guillaume Nault 提交于
Since the request socket is created locally, it'd make more sense to use reqsk_free() instead of reqsk_put() in TFO and syncookies' error path. However, tcp_get_cookie_sock() may set ->rsk_refcnt before freeing the socket; tcp_conn_request() may also have non-null ->rsk_refcnt because of tcp_try_fastopen(). In both cases 'req' hasn't been exposed to the outside world and is safe to free immediately, but that'd trigger the WARN_ON_ONCE in reqsk_free(). Define __reqsk_free() for these situations where we know nobody's referencing the socket, even though ->rsk_refcnt might be non-null. Now we can consolidate the error path of tcp_get_cookie_sock() and tcp_conn_request(). Signed-off-by: NGuillaume Nault <gnault@redhat.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 09 3月, 2019 1 次提交
-
-
由 Guillaume Nault 提交于
Commit 7716682c ("tcp/dccp: fix another race at listener dismantle") let inet_csk_reqsk_queue_add() fail, and adjusted {tcp,dccp}_check_req() accordingly. However, TFO and syncookies weren't modified, thus leaking allocated resources on error. Contrary to tcp_check_req(), in both syncookies and TFO cases, we need to drop the request socket. Also, since the child socket is created with inet_csk_clone_lock(), we have to unlock it and drop an extra reference (->sk_refcount is initially set to 2 and inet_csk_reqsk_queue_add() drops only one ref). For TFO, we also need to revert the work done by tcp_try_fastopen() (with reqsk_fastopen_remove()). Fixes: 7716682c ("tcp/dccp: fix another race at listener dismantle") Signed-off-by: NGuillaume Nault <gnault@redhat.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 26 2月, 2019 2 次提交
-
-
由 Yafang Shao 提交于
Per discussion with Daniel[1] and Eric[2], these SOCK_DEBUG() calles in TCP are not needed now. We'd better clean up it. [1] https://patchwork.ozlabs.org/patch/1035573/ [2] https://patchwork.ozlabs.org/patch/1040533/Signed-off-by: NYafang Shao <laoar.shao@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Taehee Yoo 提交于
parameter state in the tcp_sacktag_bsearch() is not used. So, it can be removed. Signed-off-by: NTaehee Yoo <ap420073@gmail.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 28 1月, 2019 1 次提交
-
-
由 Wei Wang 提交于
Instead of using pingpong as a single bit information, we refactor the code to treat it as a counter. When interactive session is detected, we set pingpong count to TCP_PINGPONG_THRESH. And when pingpong count is >= TCP_PINGPONG_THRESH, we consider the session in pingpong mode. This patch is a pure refactor and sets foundation for the next patch. This patch itself does not change any pingpong logic. Signed-off-by: NWei Wang <weiwan@google.com> Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-