- 03 10月, 2015 1 次提交
-
-
由 Eric Dumazet 提交于
long term plan is to remove struct listen_sock when its hash table is no longer there. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 30 9月, 2015 3 次提交
-
-
由 Eric Dumazet 提交于
tcp_syn_flood_action() will soon be called with unlocked socket. In order to avoid SYN flood warning being emitted multiple times, use xchg(). Extend max_qlen_log and synflood_warned fields in struct listen_sock to u32 Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
Factorize code to get tcp header from skb. It makes no sense to duplicate code in callers. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
Once we realize tcp_rcv_synsent_state_process() does not use its 'len' argument and we get rid of it, then it becomes clear this argument is no longer used in tcp_rcv_state_process() Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 29 9月, 2015 1 次提交
-
-
由 Eric Dumazet 提交于
We found that a TCP Fast Open passive connection was vulnerable to reorders, as the exchange might look like [1] C -> S S <FO ...> <request> [2] S -> C S. ack request <options> [3] S -> C . <answer> packets [2] and [3] can be generated at almost the same time. If C receives the 3rd packet before the 2nd, it will drop it as the socket is in SYN_SENT state and expects a SYNACK. S will have to retransmit the answer. Current OOO avoidance in linux is defeated because SYNACK packets are attached to the LISTEN socket, while DATA packets are attached to the children. They might be sent by different cpus, and different TX queues might be selected. It turns out that for TFO, we created a child, which is a full blown socket in TCP_SYN_RECV state, and we simply can attach the SYNACK packet to this socket. This means that at the time tcp_sendmsg() pushes DATA packet, skb->ooo_okay will be set iff the SYNACK packet had been sent and TX completed. This removes the reorder source at the host level. We also removed the export of tcp_try_fastopen(), as it is no longer called from IPv6. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 22 9月, 2015 1 次提交
-
-
由 Yuchung Cheng 提交于
Currently SYN/ACK RTT is measured in jiffies. For LAN the SYN/ACK RTT is often measured as 0ms or sometimes 1ms, which would affect RTT estimation and min RTT samping used by some congestion control. This patch improves SYN/ACK RTT to be usec resolution if platform supports it. While the timestamping of SYN/ACK is done in request sock, the RTT measurement is carefully arranged to avoid storing another u64 timestamp in tcp_sock. For regular handshake w/o SYNACK retransmission, the RTT is sampled right after the child socket is created and right before the request sock is released (tcp_check_req() in tcp_minisocks.c) For Fast Open the child socket is already created when SYN/ACK was sent, the RTT is sampled in tcp_rcv_state_process() after processing the final ACK an right before the request socket is released. If the SYN/ACK was retransmistted or SYN-cookie was used, we rely on TCP timestamps to measure the RTT. The sample is taken at the same place in tcp_rcv_state_process() after the timestamp values are validated in tcp_validate_incoming(). Note that we do not store TS echo value in request_sock for SYN-cookies, because the value is already stored in tp->rx_opt used by tcp_ack_update_rtt(). One side benefit is that the RTT measurement now happens before initializing congestion control (of the passive side). Therefore the congestion control can use the SYN/ACK RTT. Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 18 9月, 2015 1 次提交
-
-
由 Eric Dumazet 提交于
In commit b73c3d0e ("net: Save TX flow hash in sock and set in skbuf on xmit"), Tom provided a l4 hash to most outgoing TCP packets. We'd like to provide one as well for SYNACK packets, so that all packets of a given flow share same txhash, to later enable bonding driver to also use skb->hash to perform slave selection. Note that a SYNACK retransmit shuffles the tx hash, as Tom did in commit 265f94ff ("net: Recompute sk_txhash on negative routing advice") for established sockets. This has nice effect making TCP flows resilient to some kind of black holes, even at connection establish phase. Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Tom Herbert <tom@herbertland.com> Cc: Mahesh Bandewar <maheshb@google.com> Acked-by: NTom Herbert <tom@herbertland.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 01 9月, 2015 1 次提交
-
-
由 Daniel Borkmann 提交于
Currently, the following case doesn't use DCTCP, even if it should: A responder has f.e. Cubic as system wide default, but for a specific route to the initiating host, DCTCP is being set in RTAX_CC_ALGO. The initiating host then uses DCTCP as congestion control, but since the initiator sets ECT(0), tcp_ecn_create_request() doesn't set ecn_ok, and we have to fall back to Reno after 3WHS completes. We were thinking on how to solve this in a minimal, non-intrusive way without bloating tcp_ecn_create_request() needlessly: lets cache the CA ecn option flag in RTAX_FEATURES. In other words, when ECT(0) is set on the SYN packet, set ecn_ok=1 iff route RTAX_FEATURES contains the unexposed (internal-only) DST_FEATURE_ECN_CA. This allows to only do a single metric feature lookup inside tcp_ecn_create_request(). Joint work with Florian Westphal. Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Signed-off-by: NFlorian Westphal <fw@strlen.de> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 26 8月, 2015 2 次提交
-
-
由 Eric Dumazet 提交于
When TCP pacing was added back in linux-3.12, we chose to apply a fixed ratio of 200 % against current rate, to allow probing for optimal throughput even during slow start phase, where cwnd can be doubled every other gRTT. At Google, we found it was better applying a different ratio while in Congestion Avoidance phase. This ratio was set to 120 %. We've used the normal tcp_in_slow_start() helper for a while, then tuned the condition to select the conservative ratio as soon as cwnd >= ssthresh/2 : - After cwnd reduction, it is safer to ramp up more slowly, as we approach optimal cwnd. - Initial ramp up (ssthresh == INFINITY) still allows doubling cwnd every other RTT. Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
slow start after idle might reduce cwnd, but we perform this after first packet was cooked and sent. With TSO/GSO, it means that we might send a full TSO packet even if cwnd should have been reduced to IW10. Moving the SSAI check in skb_entail() makes sense, because we slightly reduce number of times this check is done, especially for large send() and TCP Small queue callbacks from softirq context. As Neal pointed out, we also need to perform the check if/when receive window opens. Tested: Following packetdrill test demonstrates the problem // Test of slow start after idle `sysctl -q net.ipv4.tcp_slow_start_after_idle=1` 0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 +0 < S 0:0(0) win 65535 <mss 1000,sackOK,nop,nop,nop,wscale 7> +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6> +.100 < . 1:1(0) ack 1 win 511 +0 accept(3, ..., ...) = 4 +0 setsockopt(4, SOL_SOCKET, SO_SNDBUF, [200000], 4) = 0 +0 write(4, ..., 26000) = 26000 +0 > . 1:5001(5000) ack 1 +0 > . 5001:10001(5000) ack 1 +0 %{ assert tcpi_snd_cwnd == 10 }% +.100 < . 1:1(0) ack 10001 win 511 +0 %{ assert tcpi_snd_cwnd == 20, tcpi_snd_cwnd }% +0 > . 10001:20001(10000) ack 1 +0 > P. 20001:26001(6000) ack 1 +.100 < . 1:1(0) ack 26001 win 511 +0 %{ assert tcpi_snd_cwnd == 36, tcpi_snd_cwnd }% +4 write(4, ..., 20000) = 20000 // If slow start after idle works properly, we should send 5 MSS here (cwnd/2) +0 > . 26001:31001(5000) ack 1 +0 %{ assert tcpi_snd_cwnd == 10, tcpi_snd_cwnd }% +0 > . 31001:36001(5000) ack 1 Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 22 7月, 2015 1 次提交
-
-
由 Rick Jones 提交于
Track success and failure of TCP PMTU probing. Signed-off-by: NRick Jones <rick.jones2@hp.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 16 7月, 2015 1 次提交
-
-
由 Yuchung Cheng 提交于
Currently F-RTO may repeatedly send new data packets on non-recurring timeouts in CA_Loss mode. This is a bug because F-RTO (RFC5682) should only be used on either new recovery or recurring timeouts. This exacerbates the recovery progress during frequent timeout & repair, because we prioritize sending new data packets instead of repairing the holes when the bandwidth is already scarce. Fix it by correcting the test of a new recovery episode. Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 10 7月, 2015 2 次提交
-
-
由 Yuchung Cheng 提交于
The congestion state and cwnd can be updated in the wrong order. For example, upon receiving a dubious ACK, we incorrectly raise the cwnd first (tcp_may_raise_cwnd()/tcp_cong_avoid()) because the state is still Open, then enter recovery state to reduce cwnd. For another example, if the ACK indicates spurious timeout or retransmits, we first revert the cwnd reduction and congestion state back to Open state. But we don't raise the cwnd even though the ACK does not indicate any congestion. To fix this problem we should first call tcp_fastretrans_alert() to process the dubious ACK and update the congestion state, then call tcp_may_raise_cwnd() that raises cwnd based on the current state. Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NNandita Dukkipati <nanditad@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Jon Maxwell 提交于
V1 of this patch contains Eric Dumazet's suggestion to move the per dst RTAX_QUICKACK check into tcp_in_quickack_mode(). Thanks Eric. I ran some tests and after setting the "ip route change quickack 1" knob there were still many delayed ACKs sent. This occured because when icsk_ack.quick=0 the !icsk_ack.pingpong value is subsequently ignored as tcp_in_quickack_mode() checks both these values. The condition for a quick ack to trigger requires that both icsk_ack.quick != 0 and icsk_ack.pingpong=0. Currently only icsk_ack.pingpong is controlled by the knob. But the icsk_ack.quick value changes dynamically depending on heuristics. The crux of the matter is that delayed acks still cannot be entirely disabled even with the RTAX_QUICKACK per dst knob enabled. This patch ensures that a quick ack is always sent when the RTAX_QUICKACK per dst knob is turned on. The "ip route change quickack 1" knob was recently added to enable quickacks. It was modeled around the TCP_QUICKACK setsockopt() option. This issue is that even with "ip route change quickack 1" enabled we still see delayed ACKs under some conditions. It would be nice to be able to completely disable delayed ACKs. Here is an example: # netstat -s|grep dela 3 delayed acks sent For all routes enable the knob # ip route change quickack 1 Generate some traffic across a slow link and we still see the delayed acks. # netstat -s|grep dela 106 delayed acks sent 1 delayed acks further delayed because of locked socket The issue is that both the "ip route change quickack 1" knob and the TCP_QUICKACK option set the icsk_ack.pingpong variable to 0. However at the business end in the __tcp_ack_snd_check() routine, tcp_in_quickack_mode() checks that both icsk_ack.quick != 0 and icsk_ack.pingpong=0 in order to trigger a quickack. As icsk_ack.quick is determined by heuristics it can be 0. When that occurs the icsk_ack.pingpong value is ignored and a delayed ACK is sent regardless. This patch moves the RTAX_QUICKACK per dst check into the tcp_in_quickack_mode() routine which ensures that a quickack is always sent when the quickack knob is enabled for that dst. Signed-off-by: NJon Maxwell <jmaxwell37@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 09 7月, 2015 2 次提交
-
-
由 Yuchung Cheng 提交于
PRR slow start is often too aggressive especially when drops are caused by traffic policers. The policers mainly use token bucket to enforce the rate so sending (twice) faster than the delivery rate causes excessive drops. This patch changes PRR to the conservative reduction bound (CRB) mode in RFC 6937 by default. CRB follows the packet conservation rule to send at most the delivery rate by default. But if many packets are lost and the pipe is empty, CRB may take N round trips to repair N losses. We conditionally turn on slow start mode if all these conditions are made to speed up the recovery: 1) on the second round or later in recovery 2) retransmission sent in the previous round is delivered on this ACK 3) no retransmission is marked lost on this ACK By using packet conservation by default, this change reduces the loss retransmits signicantly on networks that deploy traffic policers, up to 20% reduction of overall loss rate. Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NNandita Dukkipati <nanditad@google.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Yuchung Cheng 提交于
If the retransmission in CA_Loss is lost again, we should not continue to slow start or raise cwnd in congestion avoidance mode. Instead we should enter fast recovery and use PRR to reduce cwnd, following the principle in RFC5681: "... or the loss of a retransmission, should be taken as two indications of congestion and, therefore, cwnd (and ssthresh) MUST be lowered twice in this case." This is especially important to reduce loss when the CA_Loss state was caused by a traffic policer dropping the entire inflight. The CA_Loss state has a problem where a loss of L packets causes the sender to send a burst of L packets. So a policer that's dropping most packets in a given RTT can cause a huge retransmit storm. By contrast, PRR includes logic to bound the number of outbound packets that result from a given ACK. So switching to CA_Recovery on lost retransmits in CA_Loss avoids this retransmit storm problem when in CA_Loss. Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NNandita Dukkipati <nanditad@google.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 12 6月, 2015 2 次提交
-
-
由 Eric Dumazet 提交于
In commit cd7d8498 ("tcp: change tcp_skb_pcount() location") we stored gso_segs in a temporary cache hot location. This patch does the same for gso_size. This allows to save 2 cache line misses in tcp xmit path for the last packet that is considered but not sent because of various conditions (cwnd, tso defer, receiver window, TSQ...) Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
Our goal is to touch skb_shinfo(skb) only when absolutely needed, to avoid two cache line misses in TCP output path for last skb that is considered but not sent because of various conditions (cwnd, tso defer, receiver window, TSQ...) A packet is GSO only when skb_shinfo(skb)->gso_size is not zero. We can set skb_shinfo(skb)->gso_type to sk->sk_gso_type even for non GSO packets. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 11 6月, 2015 1 次提交
-
-
由 Kenneth Klette Jonassen 提交于
Upcoming tcp_cdg uses tcp_enter_cwr() to initiate PRR. Export this function so that CDG can be compiled as a module. Cc: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Neal Cardwell <ncardwell@google.com> Cc: David Hayes <davihay@ifi.uio.no> Cc: Andreas Petlund <apetlund@simula.no> Cc: Dave Taht <dave.taht@bufferbloat.net> Cc: Nicolas Kuhn <nicolas.kuhn@telecom-bretagne.eu> Signed-off-by: NKenneth Klette Jonassen <kennetkl@ifi.uio.no> Acked-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 23 5月, 2015 1 次提交
-
-
由 Eric Dumazet 提交于
Taking socket spinlock in tcp_get_info() can deadlock, as inet_diag_dump_icsk() holds the &hashinfo->ehash_locks[i], while packet processing can use the reverse locking order. We could avoid this locking for TCP_LISTEN states, but lockdep would certainly get confused as all TCP sockets share same lockdep classes. [ 523.722504] ====================================================== [ 523.728706] [ INFO: possible circular locking dependency detected ] [ 523.734990] 4.1.0-dbg-DEV #1676 Not tainted [ 523.739202] ------------------------------------------------------- [ 523.745474] ss/18032 is trying to acquire lock: [ 523.750002] (slock-AF_INET){+.-...}, at: [<ffffffff81669d44>] tcp_get_info+0x2c4/0x360 [ 523.758129] [ 523.758129] but task is already holding lock: [ 523.763968] (&(&hashinfo->ehash_locks[i])->rlock){+.-...}, at: [<ffffffff816bcb75>] inet_diag_dump_icsk+0x1d5/0x6c0 [ 523.774661] [ 523.774661] which lock already depends on the new lock. [ 523.774661] [ 523.782850] [ 523.782850] the existing dependency chain (in reverse order) is: [ 523.790326] -> #1 (&(&hashinfo->ehash_locks[i])->rlock){+.-...}: [ 523.796599] [<ffffffff811126bb>] lock_acquire+0xbb/0x270 [ 523.802565] [<ffffffff816f5868>] _raw_spin_lock+0x38/0x50 [ 523.808628] [<ffffffff81665af8>] __inet_hash_nolisten+0x78/0x110 [ 523.815273] [<ffffffff816819db>] tcp_v4_syn_recv_sock+0x24b/0x350 [ 523.822067] [<ffffffff81684d41>] tcp_check_req+0x3c1/0x500 [ 523.828199] [<ffffffff81682d09>] tcp_v4_do_rcv+0x239/0x3d0 [ 523.834331] [<ffffffff816842fe>] tcp_v4_rcv+0xa8e/0xc10 [ 523.840202] [<ffffffff81658fa3>] ip_local_deliver_finish+0x133/0x3e0 [ 523.847214] [<ffffffff81659a9a>] ip_local_deliver+0xaa/0xc0 [ 523.853440] [<ffffffff816593b8>] ip_rcv_finish+0x168/0x5c0 [ 523.859624] [<ffffffff81659db7>] ip_rcv+0x307/0x420 Lets use u64_sync infrastructure instead. As a bonus, 64bit arches get optimized, as these are nop for them. Fixes: 0df48c26 ("tcp: add tcpi_bytes_acked to tcp_info") Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 20 5月, 2015 2 次提交
-
-
由 Yuchung Cheng 提交于
After sending the new data packets to probe (step 2), F-RTO may incorrectly send more probes if the next ACK advances SND_UNA and does not sack new packet. However F-RTO RFC 5682 probes at most once. This bug may cause sender to always send new data instead of repairing holes, inducing longer HoL blocking on the receiver for the application. Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Yuchung Cheng 提交于
Undo based on TCP timestamps should only happen on ACKs that advance SND_UNA, according to the Eifel algorithm in RFC 3522: Section 3.2: (4) If the value of the Timestamp Echo Reply field of the acceptable ACK's Timestamps option is smaller than the value of RetransmitTS, then proceed to step (5), Section Terminology: We use the term 'acceptable ACK' as defined in [RFC793]. That is an ACK that acknowledges previously unacknowledged data. This is because upon receiving an out-of-order packet, the receiver returns the last timestamp that advances RCV_NXT, not the current timestamp of the packet in the DUPACK. Without checking the flag, the DUPACK will cause tcp_packet_delayed() to return true and tcp_try_undo_loss() will revert cwnd reduction. Note that we check the condition in CA_Recovery already by only calling tcp_try_undo_partial() if FLAG_SND_UNA_ADVANCED is set or tcp_try_undo_recovery() if snd_una crosses high_seq. Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 18 5月, 2015 2 次提交
-
-
由 Eric Dumazet 提交于
While testing tight tcp_mem settings, I found tcp sessions could be stuck because we do not allow even one skb to be received on them. By allowing one skb to be received, we introduce fairness and eventuallu force memory hogs to release their allocation. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
Introduce an optimized version of sk_under_memory_pressure() for TCP. Our intent is to use it in fast paths. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 10 5月, 2015 1 次提交
-
-
由 Eric Dumazet 提交于
With the advent of small rto timers in datacenter TCP, (ip route ... rto_min x), the following can happen : 1) Qdisc is full, transmit fails. TCP sets a timer based on icsk_rto to retry the transmit, without exponential backoff. With low icsk_rto, and lot of sockets, all cpus are servicing timer interrupts like crazy. Intent of the code was to retry with a timer between 200 (TCP_RTO_MIN) and 500ms (TCP_RESOURCE_PROBE_INTERVAL) 2) Receivers can send zero windows if they don't drain their receive queue. TCP sends zero window probes, based on icsk_rto current value, with exponential backoff. With /proc/sys/net/ipv4/tcp_retries2 being 15 (or even smaller in some cases), sender can abort in less than one or two minutes ! If receiver stops the sender, it obviously doesn't care of very tight rto. Probability of dropping the ACK reopening the window is not worth the risk. Lets change the base timer to be at least 200ms (TCP_RTO_MIN) for these events (but not normal RTO based retransmits) A followup patch adds a new SNMP counter, as it would have helped a lot diagnosing this issue. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NYuchung Cheng <ycheng@google.com> Acked-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 06 5月, 2015 1 次提交
-
-
由 Eric Dumazet 提交于
This patch allows a server application to get the TCP SYN headers for its passive connections. This is useful if the server is doing fingerprinting of clients based on SYN packet contents. Two socket options are added: TCP_SAVE_SYN and TCP_SAVED_SYN. The first is used on a socket to enable saving the SYN headers for child connections. This can be set before or after the listen() call. The latter is used to retrieve the SYN headers for passive connections, if the parent listener has enabled TCP_SAVE_SYN. TCP_SAVED_SYN is read once, it frees the saved SYN headers. The data returned in TCP_SAVED_SYN are network (IPv4/IPv6) and TCP headers. Original patch was written by Tom Herbert, I changed it to not hold a full skb (and associated dst and conntracking reference). We have used such patch for about 3 years at Google. Signed-off-by: NEric Dumazet <edumazet@google.com> Acked-by: NNeal Cardwell <ncardwell@google.com> Tested-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 04 5月, 2015 3 次提交
-
-
由 Kenneth Klette Jonassen 提交于
Invoking pkts_acked is currently conditioned on FLAG_ACKED: receiving a cumulative ACK of new data, or ACK with SYN flag set. Remove this condition so that CC may get RTT measurements from all SACKs. Cc: Yuchung Cheng <ycheng@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Signed-off-by: NKenneth Klette Jonassen <kennetkl@ifi.uio.no> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Kenneth Klette Jonassen 提交于
tcp_sacktag_one() always picks the earliest sequence SACKed for RTT. This might not make sense for congestion control in cases where: 1. ACKs are lost, i.e. a SACK following a lost SACK covers both new and old segments at the receiver. 2. The receiver disregards the RFC 5681 recommendation to immediately ACK out-of-order segments. Give congestion control a RTT for the latest segment SACKed, which is the most accurate RTT estimate, but preserve the conservative RTT for RTO. Removes the call to skb_mstamp_get() in tcp_sacktag_one(). Cc: Yuchung Cheng <ycheng@google.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: NKenneth Klette Jonassen <kennetkl@ifi.uio.no> Acked-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Kenneth Klette Jonassen 提交于
Later patch passes two values set in tcp_sacktag_one() to tcp_clean_rtx_queue(). Prepare passing them via struct tcp_sacktag_state. Acked-by: NYuchung Cheng <ycheng@google.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: NKenneth Klette Jonassen <kennetkl@ifi.uio.no> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 30 4月, 2015 3 次提交
-
-
由 Yuchung Cheng 提交于
tcp_mark_lost_retrans is not used when FACK is disabled. Since tcp_update_reordering may disable FACK, it should be called first before tcp_mark_lost_retrans. Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NNandita Dukkipati <nanditad@google.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
This patch tracks total number of payload bytes received on a TCP socket. This is the sum of all changes done to tp->rcv_nxt RFC4898 named this : tcpEStatsAppHCThruOctetsReceived This is a 64bit field, and can be fetched both from TCP_INFO getsockopt() if one has a handle on a TCP socket, or from inet_diag netlink facility (iproute2/ss patch will follow) Note that tp->bytes_received was placed near tp->rcv_nxt for best data locality and minimal performance impact. Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Matt Mathis <mattmathis@google.com> Cc: Eric Salo <salo@google.com> Cc: Martin Lau <kafai@fb.com> Cc: Chris Rapier <rapier@psc.edu> Acked-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
This patch tracks total number of bytes acked for a TCP socket. This is the sum of all changes done to tp->snd_una, and allows for precise tracking of delivered data. RFC4898 named this : tcpEStatsAppHCThruOctetsAcked This is a 64bit field, and can be fetched both from TCP_INFO getsockopt() if one has a handle on a TCP socket, or from inet_diag netlink facility (iproute2/ss patch will follow) Note that tp->bytes_acked was placed near tp->snd_una for best data locality and minimal performance impact. Signed-off-by: NEric Dumazet <edumazet@google.com> Acked-by: NYuchung Cheng <ycheng@google.com> Cc: Matt Mathis <mattmathis@google.com> Cc: Eric Salo <salo@google.com> Cc: Martin Lau <kafai@fb.com> Cc: Chris Rapier <rapier@psc.edu> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 22 4月, 2015 1 次提交
-
-
由 jbaron@akamai.com 提交于
Ensure that we either see that the buffer has write space in tcp_poll() or that we perform a wakeup from the input side. Did not run into any actual problem here, but thought that we should make things explicit. Signed-off-by: NJason Baron <jbaron@akamai.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 14 4月, 2015 1 次提交
-
-
由 Kenneth Klette Jonassen 提交于
Since retransmitted segments are not used for RTT estimation, previously SACKed segments present in the rtx queue are used. This estimation can be several times larger than the actual RTT. When a cumulative ack covers both previously SACKed and retransmitted segments, CC may thus get a bogus RTT. Such segments previously had an RTT estimation in tcp_sacktag_one(), so it seems reasonable to not reuse them in tcp_clean_rtx_queue() at all. Afaik, this has had no effect on SRTT/RTO because of Karn's check. Signed-off-by: NKenneth Klette Jonassen <kennetkl@ifi.uio.no> Acked-by: NNeal Cardwell <ncardwell@google.com> Tested-by: NNeal Cardwell <ncardwell@google.com> Acked-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 08 4月, 2015 2 次提交
-
-
由 Daniel Lee 提交于
Fast Open has been using an experimental option with a magic number (RFC6994). This patch makes the client by default use the RFC7413 option (34) to get and send Fast Open cookies. This patch makes the client solicit cookies from a given server first with the RFC7413 option. If that fails to elicit a cookie, then it tries the RFC6994 experimental option. If that also fails, it uses the RFC7413 option on all subsequent connect attempts. If the server returns a Fast Open cookie then the client caches the form of the option that successfully elicited a cookie, and uses that form on later connects when it presents that cookie. The idea is to gradually obsolete the use of experimental options as the servers and clients upgrade, while keeping the interoperability meanwhile. Signed-off-by: NDaniel Lee <Longinus00@gmail.com> Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Daniel Lee 提交于
Fast Open has been using the experimental option with a magic number (RFC6994) to request and grant Fast Open cookies. This patch enables the server to support the official IANA option 34 in RFC7413 in addition. The change has passed all existing Fast Open tests with both old and new options at Google. Signed-off-by: NDaniel Lee <Longinus00@gmail.com> Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 04 4月, 2015 2 次提交
-
-
由 Ian Morris 提交于
The ipv4 code uses a mixture of coding styles. In some instances check for non-NULL pointer is done as x != NULL and sometimes as x. x is preferred according to checkpatch and this patch makes the code consistent by adopting the latter form. No changes detected by objdiff. Signed-off-by: NIan Morris <ipm@chirality.org.uk> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Ian Morris 提交于
The ipv4 code uses a mixture of coding styles. In some instances check for NULL pointer is done as x == NULL and sometimes as !x. !x is preferred according to checkpatch and this patch makes the code consistent by adopting the latter form. No changes detected by objdiff. Signed-off-by: NIan Morris <ipm@chirality.org.uk> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 03 4月, 2015 1 次提交
-
-
由 Neal Cardwell 提交于
On processing cumulative ACKs, the FRTO code was not checking the SACKed bit, meaning that there could be a spurious FRTO undo on a cumulative ACK of a previously SACKed skb. The FRTO code should only consider a cumulative ACK to indicate that an original/unretransmitted skb is newly ACKed if the skb was not yet SACKed. The effect of the spurious FRTO undo would typically be to make the connection think that all previously-sent packets were in flight when they really weren't, leading to a stall and an RTO. Signed-off-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NYuchung Cheng <ycheng@google.com> Fixes: e33099f9 ("tcp: implement RFC5682 F-RTO") Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 30 3月, 2015 1 次提交
-
-
由 Eric Dumazet 提交于
After commit 1fb6f159 ("tcp: add tcp_conn_request"), tcp_syn_flood_action() is no longer used from IPv6. We can make it static, by moving it above tcp_conn_request() Signed-off-by: NEric Dumazet <edumazet@google.com> Reviewed-by: NOctavian Purdila <octavian.purdila@intel.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-