- 21 9月, 2016 5 次提交
-
-
由 Yuchung Cheng 提交于
This commit introduces an optional new "omnipotent" hook, cong_control(), for congestion control modules. The cong_control() function is called at the end of processing an ACK (i.e., after updating sequence numbers, the SACK scoreboard, and loss detection). At that moment we have precise delivery rate information the congestion control module can use to control the sending behavior (using cwnd, TSO skb size, and pacing rate) in any CA state. This function can also be used by a congestion control that prefers not to use the default cwnd reduction approach (i.e., the PRR algorithm) during CA_Recovery to control the cwnd and sending rate during loss recovery. We take advantage of the fact that recent changes defer the retransmission or transmission of new data (e.g. by F-RTO) in recovery until the new tcp_cong_control() function is run. With this commit, we only run tcp_update_pacing_rate() if the congestion control is not using this new API. New congestion controls which use the new API do not want the TCP stack to run the default pacing rate calculation and overwrite whatever pacing rate they have chosen at initialization time. Signed-off-by: NVan Jacobson <vanj@google.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NNandita Dukkipati <nanditad@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Yuchung Cheng 提交于
Currently the TCP send buffer expands to twice cwnd, in order to allow limited transmits in the CA_Recovery state. This assumes that cwnd does not increase in the CA_Recovery. For some congestion control algorithms, like the upcoming BBR module, if the losses in recovery do not indicate congestion then we may continue to raise cwnd multiplicatively in recovery. In such cases the current multiplier will falsely limit the sending rate, much as if it were limited by the application. This commit adds an optional congestion control callback to use a different multiplier to expand the TCP send buffer. For congestion control modules that do not specificy this callback, TCP continues to use the previous default of 2. Signed-off-by: NVan Jacobson <vanj@google.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NNandita Dukkipati <nanditad@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com> Acked-by: NStephen Hemminger <stephen@networkplumber.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Yuchung Cheng 提交于
This patch generates data delivery rate (throughput) samples on a per-ACK basis. These rate samples can be used by congestion control modules, and specifically will be used by TCP BBR in later patches in this series. Key state: tp->delivered: Tracks the total number of data packets (original or not) delivered so far. This is an already-existing field. tp->delivered_mstamp: the last time tp->delivered was updated. Algorithm: A rate sample is calculated as (d1 - d0)/(t1 - t0) on a per-ACK basis: d1: the current tp->delivered after processing the ACK t1: the current time after processing the ACK d0: the prior tp->delivered when the acked skb was transmitted t0: the prior tp->delivered_mstamp when the acked skb was transmitted When an skb is transmitted, we snapshot d0 and t0 in its control block in tcp_rate_skb_sent(). When an ACK arrives, it may SACK and ACK some skbs. For each SACKed or ACKed skb, tcp_rate_skb_delivered() updates the rate_sample struct to reflect the latest (d0, t0). Finally, tcp_rate_gen() generates a rate sample by storing (d1 - d0) in rs->delivered and (t1 - t0) in rs->interval_us. One caveat: if an skb was sent with no packets in flight, then tp->delivered_mstamp may be either invalid (if the connection is starting) or outdated (if the connection was idle). In that case, we'll re-stamp tp->delivered_mstamp. At first glance it seems t0 should always be the time when an skb was transmitted, but actually this could over-estimate the rate due to phase mismatch between transmit and ACK events. To track the delivery rate, we ensure that if packets are in flight then t0 and and t1 are times at which packets were marked delivered. If the initial and final RTTs are different then one may be corrupted by some sort of noise. The noise we see most often is sending gaps caused by delayed, compressed, or stretched acks. This either affects both RTTs equally or artificially reduces the final RTT. We approach this by recording the info we need to compute the initial RTT (duration of the "send phase" of the window) when we recorded the associated inflight. Then, for a filter to avoid bandwidth overestimates, we generalize the per-sample bandwidth computation from: bw = delivered / ack_phase_rtt to the following: bw = delivered / max(send_phase_rtt, ack_phase_rtt) In large-scale experiments, this filtering approach incorporating send_phase_rtt is effective at avoiding bandwidth overestimates due to ACK compression or stretched ACKs. Signed-off-by: NVan Jacobson <vanj@google.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NNandita Dukkipati <nanditad@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Neal Cardwell 提交于
Count the number of packets that a TCP connection marks lost. Congestion control modules can use this loss rate information for more intelligent decisions about how fast to send. Specifically, this is used in TCP BBR policer detection. BBR uses a high packet loss rate as one signal in its policer detection and policer bandwidth estimation algorithm. The BBR policer detection algorithm cannot simply track retransmits, because a retransmit can be (and often is) an indicator of packets lost long, long ago. This is particularly true in a long CA_Loss period that repairs the initial massive losses when a policer kicks in. Signed-off-by: NVan Jacobson <vanj@google.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NNandita Dukkipati <nanditad@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Neal Cardwell 提交于
Refactor the TCP min_rtt code to reuse the new win_minmax library in lib/win_minmax.c to simplify the TCP code. This is a pure refactor: the functionality is exactly the same. We just moved the windowed min code to make TCP easier to read and maintain, and to allow other parts of the kernel to use the windowed min/max filter code. Signed-off-by: NVan Jacobson <vanj@google.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NNandita Dukkipati <nanditad@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 16 9月, 2016 1 次提交
-
-
由 Eric Dumazet 提交于
When skb replaces another one in ooo queue, I forgot to also update tp->ooo_last_skb as well, if the replaced skb was the last one in the queue. To fix this, we simply can re-use the code that runs after an insertion, trying to merge skbs at the right of current skb. This not only fixes the bug, but also remove all small skbs that might be a subset of the new one. Example: We receive segments 2001:3001, 4001:5001 Then we receive 2001:8001 : We should replace 2001:3001 with the big skb, but also remove 4001:50001 from the queue to save space. packetdrill test demonstrating the bug 0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 +0 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7> +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7> +0.100 < . 1:1(0) ack 1 win 1024 +0 accept(3, ..., ...) = 4 +0.01 < . 1001:2001(1000) ack 1 win 1024 +0 > . 1:1(0) ack 1 <nop,nop, sack 1001:2001> +0.01 < . 1001:3001(2000) ack 1 win 1024 +0 > . 1:1(0) ack 1 <nop,nop, sack 1001:2001 1001:3001> Fixes: 9f5afeae ("tcp: use an RB tree for ooo receive queue") Signed-off-by: NEric Dumazet <edumazet@google.com> Reported-by: NYuchung Cheng <ycheng@google.com> Cc: Yaogong Wang <wygivan@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 11 9月, 2016 1 次提交
-
-
由 Eric Dumazet 提交于
Willem noticed that we could avoid an rbtree lookup if the the attempt to coalesce incoming skb to the last skb failed for some reason. Since most ooo additions are at the tail, this is definitely worth adding a test and fast path. Suggested-by: NWillem de Bruijn <willemb@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Yaogong Wang <wygivan@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 09 9月, 2016 1 次提交
-
-
由 Yaogong Wang 提交于
Over the years, TCP BDP has increased by several orders of magnitude, and some people are considering to reach the 2 Gbytes limit. Even with current window scale limit of 14, ~1 Gbytes maps to ~740,000 MSS. In presence of packet losses (or reorders), TCP stores incoming packets into an out of order queue, and number of skbs sitting there waiting for the missing packets to be received can be in the 10^5 range. Most packets are appended to the tail of this queue, and when packets can finally be transferred to receive queue, we scan the queue from its head. However, in presence of heavy losses, we might have to find an arbitrary point in this queue, involving a linear scan for every incoming packet, throwing away cpu caches. This patch converts it to a RB tree, to get bounded latencies. Yaogong wrote a preliminary patch about 2 years ago. Eric did the rebase, added ofo_last_skb cache, polishing and tests. Tested with network dropping between 1 and 10 % packets, with good success (about 30 % increase of throughput in stress tests) Next step would be to also use an RB tree for the write queue at sender side ;) Signed-off-by: NYaogong Wang <wygivan@google.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Acked-By: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 19 8月, 2016 1 次提交
-
-
由 Eric Dumazet 提交于
Over the years, TCP BDP has increased a lot, and is typically in the order of ~10 Mbytes with help of clever Congestion Control modules. In presence of packet losses, TCP stores incoming packets into an out of order queue, and number of skbs sitting there waiting for the missing packets to be received can match the BDP (~10 Mbytes) In some cases, TCP needs to make room for incoming skbs, and current strategy can simply remove all skbs in the out of order queue as a last resort, incurring a huge penalty, both for receiver and sender. Unfortunately these 'last resort events' are quite frequent, forcing sender to send all packets again, stalling the flow and wasting a lot of resources. This patch cleans only a part of the out of order queue in order to meet the memory constraints. Signed-off-by: NEric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: C. Stephen Gun <csg@google.com> Cc: Van Jacobson <vanj@google.com> Acked-by: NSoheil Hassas Yeganeh <soheil@google.com> Acked-by: NYuchung Cheng <ycheng@google.com> Acked-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 16 7月, 2016 1 次提交
-
-
由 Jason Baron 提交于
The per-socket rate limit for 'challenge acks' was introduced in the context of limiting ack loops: commit f2b2c582 ("tcp: mitigate ACK loops for connections as tcp_sock") And I think it can be extended to rate limit all 'challenge acks' on a per-socket basis. Since we have the global tcp_challenge_ack_limit, this patch allows for tcp_challenge_ack_limit to be set to a large value and effectively rely on the per-socket limit, or set tcp_challenge_ack_limit to a lower value and still prevents a single connections from consuming the entire challenge ack quota. It further moves in the direction of eliminating the global limit at some point, as Eric Dumazet has suggested. This a follow-up to: Subject: tcp: make challenge acks less predictable Cc: Eric Dumazet <edumazet@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Yue Cao <ycao009@ucr.edu> Signed-off-by: NJason Baron <jbaron@akamai.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 12 7月, 2016 1 次提交
-
-
由 Eric Dumazet 提交于
Yue Cao claims that current host rate limiting of challenge ACKS (RFC 5961) could leak enough information to allow a patient attacker to hijack TCP sessions. He will soon provide details in an academic paper. This patch increases the default limit from 100 to 1000, and adds some randomization so that the attacker can no longer hijack sessions without spending a considerable amount of probes. Based on initial analysis and patch from Linus. Note that we also have per socket rate limiting, so it is tempting to remove the host limit in the future. v2: randomize the count of challenge acks per second, not the period. Fixes: 282f23c6 ("tcp: implement RFC 5961 3.2") Reported-by: NYue Cao <ycao009@ucr.edu> Signed-off-by: NEric Dumazet <edumazet@google.com> Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Acked-by: NNeal Cardwell <ncardwell@google.com> Acked-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 28 6月, 2016 1 次提交
-
-
由 Huw Davies 提交于
If set, these will take precedence over the parent's options during both sending and child creation. If they're not set, the parent's options (if any) will be used. This is to allow the security_inet_conn_request() hook to modify the IPv6 options in just the same way that it already may do for IPv4. Signed-off-by: NHuw Davies <huw@codeweavers.com> Signed-off-by: NPaul Moore <paul@paul-moore.com>
-
- 11 6月, 2016 1 次提交
-
-
由 Lawrence Brakmo 提交于
Add in_flight (bytes in flight when packet was sent) field to tx component of tcp_skb_cb and make it available to congestion modules' pkts_acked() function through the ack_sample function argument. Signed-off-by: NLawrence Brakmo <brakmo@fb.com> Acked-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 08 6月, 2016 1 次提交
-
-
由 Pau Espin Pedrol 提交于
RFC 5961 advises to only accept RST packets containing a seq number matching the next expected seq number instead of the whole receive window in order to avoid spoofing attacks. However, this situation is not optimal in the case SACK is in use at the time the RST is sent. I recently run into a scenario in which packet losses were high while uploading data to a server, and userspace was willing to frequently terminate connections by sending a RST. In this case, the ACK sent on the receiver side (rcv_nxt) is frozen waiting for a lost packet retransmission and SACK blocks are used to let the client continue uploading data. At some point later on, the client sends the RST (snd_nxt), which matches the next expected seq number of the right-most SACK block on the receiver side which is going forward receiving data. In this scenario, as RFC 5961 defines, the RST SEQ doesn't match the frozen main ACK at receiver side and thus gets dropped and a challenge ACK is sent, which gets usually lost due to network conditions. The main consequence is that the connection stays alive for a while even if it made sense to accept the RST. This can get really bad if lots of connections like this one are created in few seconds, allocating all the resources of the server easily. For security reasons, not all SACK blocks are checked (there could be a big amount of SACK blocks => acceptable SEQ numbers). Furthermore, it wouldn't make sense to check for RST in blocks other than the right-most received one because the sender is not expected to be sending new data after the RST. For simplicity, only up to the 4 most recently updated SACK blocks (selective_acks[4] field) are compared to find the right-most block, as usually those are the ones with bigger probability to contain it. This patch was tested in a 3.18 kernel and probed to improve the situation in the scenario described above. Signed-off-by: NPau Espin Pedrol <pau.espin@tessares.net> Acked-by: NEric Dumazet <edumazet@google.com> Acked-by: NNeal Cardwell <ncardwell@google.com> Tested-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 12 5月, 2016 1 次提交
-
-
由 Lawrence Brakmo 提交于
Replace 2 arguments (cnt and rtt) in the congestion control modules' pkts_acked() function with a struct. This will allow adding more information without having to modify existing congestion control modules (tcp_nv in particular needs bytes in flight when packet was sent). As proposed by Neal Cardwell in his comments to the tcp_nv patch. Signed-off-by: NLawrence Brakmo <brakmo@fb.com> Acked-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 05 5月, 2016 1 次提交
-
-
由 Eric Dumazet 提交于
tcp_snd_una_update() and tcp_rcv_nxt_update() call u64_stats_update_begin() either from process context or BH handler. This triggers a lockdep splat on 32bit & SMP builds. We could add u64_stats_update_begin_bh() variant but this would slow down 32bit builds with useless local_disable_bh() and local_enable_bh() pairs, since we own the socket lock at this point. I add sock_owned_by_me() helper to have proper lockdep support even on 64bit builds, and new u64_stats_update_begin_raw() and u64_stats_update_end_raw methods. Fixes: c10d9310 ("tcp: do not assume TCP code is non preemptible") Reported-by: NFabio Estevam <festevam@gmail.com> Diagnosed-by: NFrancois Romieu <romieu@fr.zoreil.com> Signed-off-by: NEric Dumazet <edumazet@google.com> Tested-by: NFabio Estevam <fabio.estevam@nxp.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 03 5月, 2016 2 次提交
-
-
由 Eric Dumazet 提交于
AFAIK, nothing in current TCP stack absolutely wants BH being disabled once socket is owned by a thread running in process context. As mentioned in my prior patch ("tcp: give prequeue mode some care"), processing a batch of packets might take time, better not block BH at all. Signed-off-by: NEric Dumazet <edumazet@google.com> Acked-by: NSoheil Hassas Yeganeh <soheil@google.com> Acked-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
We want to to make TCP stack preemptible, as draining prequeue and backlog queues can take lot of time. Many SNMP updates were assuming that BH (and preemption) was disabled. Need to convert some __NET_INC_STATS() calls to NET_INC_STATS() and some __TCP_INC_STATS() to TCP_INC_STATS() Before using this_cpu_ptr(net->ipv4.tcp_sk) in tcp_v4_send_reset() and tcp_v4_send_ack(), we add an explicit preempt disabled section. Signed-off-by: NEric Dumazet <edumazet@google.com> Acked-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 29 4月, 2016 2 次提交
-
-
由 Martin KaFai Lau 提交于
This patch: 1. Prevent next_skb from coalescing to the prev_skb if TCP_SKB_CB(prev_skb)->eor is set 2. Update the TCP_SKB_CB(prev_skb)->eor if coalescing is allowed Packetdrill script for testing: ~~~~~~ +0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10` +0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1` +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7> 0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7> 0.200 < . 1:1(0) ack 1 win 257 0.200 accept(3, ..., ...) = 4 +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0 0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730 0.200 sendto(4, ..., 730, MSG_EOR, ..., ...) = 730 0.200 write(4, ..., 11680) = 11680 0.200 > P. 1:731(730) ack 1 0.200 > P. 731:1461(730) ack 1 0.200 > . 1461:8761(7300) ack 1 0.200 > P. 8761:13141(4380) ack 1 0.300 < . 1:1(0) ack 1 win 257 <sack 1461:13141,nop,nop> 0.300 > P. 1:731(730) ack 1 0.300 > P. 731:1461(730) ack 1 0.400 < . 1:1(0) ack 13141 win 257 0.400 close(4) = 0 0.400 > F. 13141:13141(0) ack 1 0.500 < F. 1:1(0) ack 13142 win 257 0.500 > . 13142:13142(0) ack 2 Signed-off-by: NMartin KaFai Lau <kafai@fb.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: NEric Dumazet <edumazet@google.com> Acked-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Soheil Hassas Yeganeh 提交于
The SKBTX_ACK_TSTAMP flag is set in skb_shinfo->tx_flags when the timestamp of the TCP acknowledgement should be reported on error queue. Since accessing skb_shinfo is likely to incur a cache-line miss at the time of receiving the ack, the txstamp_ack bit was added in tcp_skb_cb, which is set iff the SKBTX_ACK_TSTAMP flag is set for an skb. This makes SKBTX_ACK_TSTAMP flag redundant. Remove the SKBTX_ACK_TSTAMP and instead use the txstamp_ack bit everywhere. Note that this frees one bit in shinfo->tx_flags. Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com> Acked-by: NMartin KaFai Lau <kafai@fb.com> Suggested-by: NWillem de Bruijn <willemb@google.com> Acked-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 28 4月, 2016 2 次提交
-
-
由 Eric Dumazet 提交于
Rename NET_INC_STATS_BH() to __NET_INC_STATS() and NET_ADD_STATS_BH() to __NET_ADD_STATS() Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
Rename TCP_INC_STATS_BH() to __TCP_INC_STATS() Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 26 4月, 2016 1 次提交
-
-
由 Eric Dumazet 提交于
We now have proper per-listener but also per network namespace counters for SYN packets that might be dropped. We replace the kfree_skb() by consume_skb() to be drop monitor [1] friendly, and remove an obsolete comment. FastOpen SYN packets can carry payload in them just fine. [1] perf record -a -g -e skb:kfree_skb sleep 1; perf report Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 25 4月, 2016 1 次提交
-
-
由 Eric Dumazet 提交于
Linux TCP stack painfully segments all TSO/GSO packets before retransmits. This was fine back in the days when TSO/GSO were emerging, with their bugs, but we believe the dark age is over. Keeping big packets in write queues, but also in stack traversal has a lot of benefits. - Less memory overhead, because write queues have less skbs - Less cpu overhead at ACK processing. - Better SACK processing, as lot of studies mentioned how awful linux was at this ;) - Less cpu overhead to send the rtx packets (IP stack traversal, netfilter traversal, drivers...) - Better latencies in presence of losses. - Smaller spikes in fq like packet schedulers, as retransmits are not constrained by TCP Small Queues. 1 % packet losses are common today, and at 100Gbit speeds, this translates to ~80,000 losses per second. Losses are often correlated, and we see many retransmit events leading to 1-MSS train of packets, at the time hosts are already under stress. Signed-off-by: NEric Dumazet <edumazet@google.com> Acked-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 22 4月, 2016 2 次提交
-
-
由 Martin KaFai Lau 提交于
After receiving sacks, tcp_shifted_skb() will collapse skbs if possible. tx_flags and tskey also have to be merged. This patch reuses the tcp_skb_collapse_tstamp() to handle them. BPF Output Before: ~~~~~ <no-output-due-to-missing-tstamp-event> BPF Output After: ~~~~~ <...>-2024 [007] d.s. 88.644374: : ee_data:14599 Packetdrill Script: ~~~~~ +0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10` +0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1` +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7> 0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7> 0.200 < . 1:1(0) ack 1 win 257 0.200 accept(3, ..., ...) = 4 +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0 0.200 write(4, ..., 1460) = 1460 +0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0 0.200 write(4, ..., 13140) = 13140 0.200 > P. 1:1461(1460) ack 1 0.200 > . 1461:8761(7300) ack 1 0.200 > P. 8761:14601(5840) ack 1 0.300 < . 1:1(0) ack 1 win 257 <sack 1461:14601,nop,nop> 0.300 > P. 1:1461(1460) ack 1 0.400 < . 1:1(0) ack 14601 win 257 0.400 close(4) = 0 0.400 > F. 14601:14601(0) ack 1 0.500 < F. 1:1(0) ack 14602 win 257 0.500 > . 14602:14602(0) ack 2 Signed-off-by: NMartin KaFai Lau <kafai@fb.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: NSoheil Hassas Yeganeh <soheil@google.com> Tested-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Martin KaFai Lau 提交于
Assuming SOF_TIMESTAMPING_TX_ACK is on. When dup acks are received, it could incorrectly think that a skb has already been acked and queue a SCM_TSTAMP_ACK cmsg to the sk->sk_error_queue. In tcp_ack_tstamp(), it checks 'between(shinfo->tskey, prior_snd_una, tcp_sk(sk)->snd_una - 1)'. If prior_snd_una == tcp_sk(sk)->snd_una like the following packetdrill script, between() returns true but the tskey is actually not acked. e.g. try between(3, 2, 1). The fix is to replace between() with one before() and one !before(). By doing this, the -1 offset on the tcp_sk(sk)->snd_una can also be removed. A packetdrill script is used to reproduce the dup ack scenario. Due to the lacking cmsg support in packetdrill (may be I cannot find it), a BPF prog is used to kprobe to sock_queue_err_skb() and print out the value of serr->ee.ee_data. Both the packetdrill and the bcc BPF script is attached at the end of this commit message. BPF Output Before Fix: ~~~~~~ <...>-2056 [001] d.s. 433.927987: : ee_data:1459 #incorrect packetdrill-2056 [001] d.s. 433.929563: : ee_data:1459 #incorrect packetdrill-2056 [001] d.s. 433.930765: : ee_data:1459 #incorrect packetdrill-2056 [001] d.s. 434.028177: : ee_data:1459 packetdrill-2056 [001] d.s. 434.029686: : ee_data:14599 BPF Output After Fix: ~~~~~~ <...>-2049 [000] d.s. 113.517039: : ee_data:1459 <...>-2049 [000] d.s. 113.517253: : ee_data:14599 BCC BPF Script: ~~~~~~ #!/usr/bin/env python from __future__ import print_function from bcc import BPF bpf_text = """ #include <uapi/linux/ptrace.h> #include <net/sock.h> #include <bcc/proto.h> #include <linux/errqueue.h> #ifdef memset #undef memset #endif int trace_err_skb(struct pt_regs *ctx) { struct sk_buff *skb = (struct sk_buff *)ctx->si; struct sock *sk = (struct sock *)ctx->di; struct sock_exterr_skb *serr; u32 ee_data = 0; if (!sk || !skb) return 0; serr = SKB_EXT_ERR(skb); bpf_probe_read(&ee_data, sizeof(ee_data), &serr->ee.ee_data); bpf_trace_printk("ee_data:%u\\n", ee_data); return 0; }; """ b = BPF(text=bpf_text) b.attach_kprobe(event="sock_queue_err_skb", fn_name="trace_err_skb") print("Attached to kprobe") b.trace_print() Packetdrill Script: ~~~~~~ +0 `sysctl -q -w net.ipv4.tcp_min_tso_segs=10` +0 `sysctl -q -w net.ipv4.tcp_no_metrics_save=1` +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 0.100 < S 0:0(0) win 32792 <mss 1460,sackOK,nop,nop,nop,wscale 7> 0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7> 0.200 < . 1:1(0) ack 1 win 257 0.200 accept(3, ..., ...) = 4 +0 setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0 +0 setsockopt(4, SOL_SOCKET, 37, [2688], 4) = 0 0.200 write(4, ..., 1460) = 1460 0.200 write(4, ..., 13140) = 13140 0.200 > P. 1:1461(1460) ack 1 0.200 > . 1461:8761(7300) ack 1 0.200 > P. 8761:14601(5840) ack 1 0.300 < . 1:1(0) ack 1 win 257 <sack 1461:2921,nop,nop> 0.300 < . 1:1(0) ack 1 win 257 <sack 1461:4381,nop,nop> 0.300 < . 1:1(0) ack 1 win 257 <sack 1461:5841,nop,nop> 0.300 > P. 1:1461(1460) ack 1 0.400 < . 1:1(0) ack 14601 win 257 0.400 close(4) = 0 0.400 > F. 14601:14601(0) ack 1 0.500 < F. 1:1(0) ack 14602 win 257 0.500 > . 14602:14602(0) ack 2 Signed-off-by: NMartin KaFai Lau <kafai@fb.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Soheil Hassas Yeganeh <soheil.kdev@gmail.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: NSoheil Hassas Yeganeh <soheil@google.com> Tested-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 16 4月, 2016 2 次提交
-
-
由 Eric Dumazet 提交于
Last known hot point during SYNFLOOD attack is the clearing of rx_opt.saw_tstamp in tcp_rcv_state_process() It is not needed for a listener, so we move it where it matters. Performance while a SYNFLOOD hits a single listener socket went from 5 Mpps to 6 Mpps on my test server (24 cores, 8 NIC RX queues) Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
When removing sk_refcnt manipulation on synflood, I missed that using skb_set_owner_w() was racy, if sk->sk_wmem_alloc had already transitioned to 0. We should hold sk_refcnt instead, but this is a big deal under attack. (Doing so increase performance from 3.2 Mpps to 3.8 Mpps only) In this patch, I chose to not attach a socket to syncookies skb. Performance is now 5 Mpps instead of 3.2 Mpps. Following patch will remove last known false sharing in tcp_rcv_state_process() Fixes: 3b24d854 ("tcp/dccp: do not touch listener sk_refcnt under synflood") Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 05 4月, 2016 3 次提交
-
-
由 Eric Dumazet 提交于
Goal: packets dropped by a listener are accounted for. This adds tcp_listendrop() helper, and clears sk_drops in sk_clone_lock() so that children do not inherit their parent drop count. Note that we no longer increment LINUX_MIB_LISTENDROPS counter when sending a SYNCOOKIE, since the SYN packet generated a SYNACK. We already have a separate LINUX_MIB_SYNCOOKIESSENT Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Eric Dumazet 提交于
Now ss can report sk_drops, we can instruct TCP to increment this per socket counter when it drops an incoming frame, to refine monitoring and debugging. Following patch takes care of listeners drops. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Soheil Hassas Yeganeh 提交于
Currently, to avoid a cache line miss for accessing skb_shinfo, tcp_ack_tstamp skips socket that do not have SOF_TIMESTAMPING_TX_ACK bit set in sk_tsflags. This is implemented based on an implicit assumption that the SOF_TIMESTAMPING_TX_ACK is set via socket options for the duration that ACK timestamps are needed. To implement per-write timestamps, this check should be removed and replaced with a per-packet alternative that quickly skips packets missing ACK timestamps marks without a cache-line miss. To enable per-packet marking without a cache line miss, use one bit in TCP_SKB_CB to mark a whether a SKB might need a ack tx timestamp or not. Further checks in tcp_ack_tstamp are not modified and work as before. Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com> Acked-by: NWillem de Bruijn <willemb@google.com> Acked-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 03 4月, 2016 1 次提交
-
-
由 Yuchung Cheng 提交于
For non-SACK connections, cwnd is lowered to inflight plus 3 packets when the recovery ends. This is an optional feature in the NewReno RFC 2582 to reduce the potential burst when cwnd is "re-opened" after recovery and inflight is low. This feature is questionably effective because of PRR: when the recovery ends (i.e., snd_una == high_seq) NewReno holds the CA_Recovery state for another round trip to prevent false fast retransmits. But if the inflight is low, PRR will overwrite the moderated cwnd in tcp_cwnd_reduction() later regardlessly. So if a receiver responds bogus ACKs (i.e., acking future data) to speed up transfer after recovery, it can only induce a burst up to a window worth of data packets by acking up to SND.NXT. A restart from (short) idle or receiving streched ACKs can both cause such bursts as well. On the other hand, if the recovery ends because the sender detects the losses were spurious (e.g., reordering). This feature unconditionally lowers a reverted cwnd even though nothing was lost. By principle loss recovery module should not update cwnd. Further pacing is much more effective to reduce burst. Hence this patch removes the cwnd moderation feature. v2 changes: revised commit message on bogus ACKs and burst, and missing signature Signed-off-by: NMatt Mathis <mattmathis@google.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com> Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 17 2月, 2016 1 次提交
-
-
由 Eric Dumazet 提交于
There are some cases where rtt_us derives from deltas of jiffies, instead of using usec timestamps. Since we want to track minimal rtt, better to assume a delta of 0 jiffie might be in fact be very close to 1 jiffie. It is kind of sad jiffies_to_usecs(1) calls a function instead of simply using a constant. Fixes: f6722583 ("tcp: track min RTT using windowed min-filter") Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 08 2月, 2016 7 次提交
-
-
由 Nikolay Borisov 提交于
Signed-off-by: NNikolay Borisov <kernel@kyup.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Nikolay Borisov 提交于
Signed-off-by: NNikolay Borisov <kernel@kyup.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Yuchung Cheng 提交于
Refactor and consolidate cwnd and rate updates into a new function tcp_cong_control(). Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NEric Dumazet <ncardwell@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Yuchung Cheng 提交于
This change enables congestion control to update cwnd based on not only packet cumulatively acked but also packets delivered out-of-order. This makes congestion control robust against packet reordering because it may raise cwnd as long as packets are being delivered once reordering has been detected (i.e., it only cares the amount of packets delivered, not the ordering among them). Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NEric Dumazet <ncardwell@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Yuchung Cheng 提交于
A small refactoring that gets number of packets cumulatively acked from tcp_clean_rtx_queue() directly. Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NEric Dumazet <ncardwell@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Yuchung Cheng 提交于
This patch changes the accounting of how many packets are newly acked or sacked when the sender receives an ACK. The current approach basically computes newly_acked_sacked = (prior_packets - prior_sacked) - (tp->packets_out - tp->sacked_out) where prior_packets and prior_sacked out are snapshot at the beginning of the ACK processing. The new approach tracks the delivery information via a new TCP state variable "delivered" which monotically increases as new packets are delivered in order or out-of-order. The reason for this change is that the current approach is brittle that produces negative or inaccurate estimate. 1) For non-SACK connections, an ACK that advances the SND.UNA could reset the DUPACK counters (tp->sacked_out) in tcp_process_loss() or tcp_fastretrans_alert(). This inflates the inflight suddenly and causes under-estimate or even negative estimate. Here is a real example: before after (processing ACK) packets_out 75 73 sacked_out 23 0 ca state Loss Open The old approach computes (75-23) - (73 - 0) = -21 delivered while the new approach computes 1 delivered since it considers the 2nd-24th packets are delivered OOO. 2) MSS change would re-count packets_out and sacked_out so the estimate is in-accurate and can even become negative. E.g., the inflight is doubled when MSS is halved. 3) Spurious retransmission signaled by DSACK is not accounted The new approach is simpler and more robust. For SACK connections, tp->delivered increments as packets are being acked or sacked in SACK and ACK processing. For non-sack connections, it's done in tcp_remove_reno_sacks() and tcp_add_reno_sack(). When an ACK advances the SND.UNA, tp->delivered is incremented by the number of packets ACKed (less the current number of DUPACKs received plus one packet hole). Upon receiving a DUPACK, tp->delivered is incremented assuming one out-of-order packet is delivered. Upon receiving a DSACK, tp->delivered is incremtened assuming one retransmission is delivered in tcp_sacktag_write_queue(). Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NEric Dumazet <ncardwell@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Yuchung Cheng 提交于
Currently the cwnd is reduced and increased in various different places. The reduction happens in various places in the recovery state processing (tcp_fastretrans_alert) while the increase happens afterward. A better sequence is to identify lost packets and update the congestion control state (icsk_ca_state) first. Then base on the new state, up/down the cwnd in one central place. It's more clear to reason cwnd changes. Signed-off-by: NYuchung Cheng <ycheng@google.com> Signed-off-by: NNeal Cardwell <ncardwell@google.com> Signed-off-by: NEric Dumazet <ncardwell@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-