1. 28 10月, 2013 1 次提交
  2. 22 10月, 2013 1 次提交
    • N
      tcp: initialize passive-side sk_pacing_rate after 3WHS · 02cf4ebd
      Neal Cardwell 提交于
      For passive TCP connections, upon receiving the ACK that completes the
      3WHS, make sure we set our pacing rate after we get our first RTT
      sample.
      
      On passive TCP connections, when we receive the ACK completing the
      3WHS we do not take an RTT sample in tcp_ack(), but rather in
      tcp_synack_rtt_meas(). So upon receiving the ACK that completes the
      3WHS, tcp_ack() leaves sk_pacing_rate at its initial value.
      
      Originally the initial sk_pacing_rate value was 0, so passive-side
      connections defaulted to sysctl_tcp_min_tso_segs (2 segs) in skbuffs
      made in the first RTT. With a default initial cwnd of 10 packets, this
      happened to be correct for RTTs 5ms or bigger, so it was hard to
      see problems in WAN or emulated WAN testing.
      
      Since 7eec4174 ("pkt_sched: fq: fix non TCP flows pacing"), the
      initial sk_pacing_rate is 0xffffffff. So after that change, passive
      TCP connections were keeping this value (and using large numbers of
      segments per skbuff) until receiving an ACK for data.
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      02cf4ebd
  3. 18 10月, 2013 1 次提交
  4. 05 10月, 2013 1 次提交
    • E
      tcp: do not forget FIN in tcp_shifted_skb() · 5e8a402f
      Eric Dumazet 提交于
      Yuchung found following problem :
      
       There are bugs in the SACK processing code, merging part in
       tcp_shift_skb_data(), that incorrectly resets or ignores the sacked
       skbs FIN flag. When a receiver first SACK the FIN sequence, and later
       throw away ofo queue (e.g., sack-reneging), the sender will stop
       retransmitting the FIN flag, and hangs forever.
      
      Following packetdrill test can be used to reproduce the bug.
      
      $ cat sack-merge-bug.pkt
      `sysctl -q net.ipv4.tcp_fack=0`
      
      // Establish a connection and send 10 MSS.
      0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      +.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      +.000 bind(3, ..., ...) = 0
      +.000 listen(3, 1) = 0
      
      +.050 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
      +.000 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6>
      +.001 < . 1:1(0) ack 1 win 1024
      +.000 accept(3, ..., ...) = 4
      
      +.100 write(4, ..., 12000) = 12000
      +.000 shutdown(4, SHUT_WR) = 0
      +.000 > . 1:10001(10000) ack 1
      +.050 < . 1:1(0) ack 2001 win 257
      +.000 > FP. 10001:12001(2000) ack 1
      +.050 < . 1:1(0) ack 2001 win 257 <sack 10001:11001,nop,nop>
      +.050 < . 1:1(0) ack 2001 win 257 <sack 10001:12002,nop,nop>
      // SACK reneg
      +.050 < . 1:1(0) ack 12001 win 257
      +0 %{ print "unacked: ",tcpi_unacked }%
      +5 %{ print "" }%
      
      First, a typo inverted left/right of one OR operation, then
      code forgot to advance end_seq if the merged skb carried FIN.
      
      Bug was added in 2.6.29 by commit 832d11c5
      ("tcp: Try to restore large SKBs while SACK processing")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Acked-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5e8a402f
  5. 07 9月, 2013 2 次提交
    • E
      tcp: properly increase rcv_ssthresh for ofo packets · 4e4f1fc2
      Eric Dumazet 提交于
      TCP receive window handling is multi staged.
      
      A socket has a memory budget, static or dynamic, in sk_rcvbuf.
      
      Because we do not really know how this memory budget translates to
      a TCP window (payload), TCP announces a small initial window
      (about 20 MSS).
      
      When a packet is received, we increase TCP rcv_win depending
      on the payload/truesize ratio of this packet. Good citizen
      packets give a hint that it's reasonable to have rcv_win = sk_rcvbuf/2
      
      This heuristic takes place in tcp_grow_window()
      
      Problem is : We currently call tcp_grow_window() only for in-order
      packets.
      
      This means that reorders or packet losses stop proper grow of
      rcv_win, and senders are unable to benefit from fast recovery,
      or proper reordering level detection.
      
      Really, a packet being stored in OFO queue is not a bad citizen.
      It should be part of the game as in-order packets.
      
      In our traces, we very often see sender is limited by linux small
      receive windows, even if linux hosts use autotuning (DRS) and should
      allow rcv_win to grow to ~3MB.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4e4f1fc2
    • Y
      tcp: fix no cwnd growth after timeout · 16edfe7e
      Yuchung Cheng 提交于
      In commit 0f7cc9a3 "tcp: increase throughput when reordering is high",
      it only allows cwnd to increase in Open state. This mistakenly disables
      slow start after timeout (CA_Loss). Moreover cwnd won't grow if the
      state moves from Disorder to Open later in tcp_fastretrans_alert().
      
      Therefore the correct logic should be to allow cwnd to grow as long
      as the data is received in order in Open, Loss, or even Disorder state.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      16edfe7e
  6. 04 9月, 2013 1 次提交
  7. 30 8月, 2013 2 次提交
    • E
      tcp: TSO packets automatic sizing · 95bd09eb
      Eric Dumazet 提交于
      After hearing many people over past years complaining against TSO being
      bursty or even buggy, we are proud to present automatic sizing of TSO
      packets.
      
      One part of the problem is that tcp_tso_should_defer() uses an heuristic
      relying on upcoming ACKS instead of a timer, but more generally, having
      big TSO packets makes little sense for low rates, as it tends to create
      micro bursts on the network, and general consensus is to reduce the
      buffering amount.
      
      This patch introduces a per socket sk_pacing_rate, that approximates
      the current sending rate, and allows us to size the TSO packets so
      that we try to send one packet every ms.
      
      This field could be set by other transports.
      
      Patch has no impact for high speed flows, where having large TSO packets
      makes sense to reach line rate.
      
      For other flows, this helps better packet scheduling and ACK clocking.
      
      This patch increases performance of TCP flows in lossy environments.
      
      A new sysctl (tcp_min_tso_segs) is added, to specify the
      minimal size of a TSO packet (default being 2).
      
      A follow-up patch will provide a new packet scheduler (FQ), using
      sk_pacing_rate as an input to perform optional per flow pacing.
      
      This explains why we chose to set sk_pacing_rate to twice the current
      rate, allowing 'slow start' ramp up.
      
      sk_pacing_rate = 2 * cwnd * mss / srtt
      
      v2: Neal Cardwell reported a suspect deferring of last two segments on
      initial write of 10 MSS, I had to change tcp_tso_should_defer() to take
      into account tp->xmit_size_goal_segs
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Van Jacobson <vanj@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      95bd09eb
    • A
      tcp: don't apply tsoffset if rcv_tsecr is zero · e3e12028
      Andrew Vagin 提交于
      The zero value means that tsecr is not valid, so it's a special case.
      
      tsoffset is used to customize tcp_time_stamp for one socket.
      tsoffset is usually zero, it's used when a socket was moved from one
      host to another host.
      
      Currently this issue affects logic of tcp_rcv_rtt_measure_ts. Due to
      incorrect value of rcv_tsecr, tcp_rcv_rtt_measure_ts sets rto to
      TCP_RTO_MAX.
      
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Reported-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Signed-off-by: NAndrey Vagin <avagin@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e3e12028
  8. 23 8月, 2013 1 次提交
    • Y
      tcp: increase throughput when reordering is high · 0f7cc9a3
      Yuchung Cheng 提交于
      The stack currently detects reordering and avoid spurious
      retransmission very well. However the throughput is sub-optimal under
      high reordering because cwnd is increased only if the data is deliverd
      in order. I.e., FLAG_DATA_ACKED check in tcp_ack().  The more packet
      are reordered the worse the throughput is.
      
      Therefore when reordering is proven high, cwnd should advance whenever
      the data is delivered regardless of its ordering. If reordering is low,
      conservatively advance cwnd only on ordered deliveries in Open state,
      and retain cwnd in Disordered state (RFC5681).
      
      Using netperf on a qdisc setup of 20Mbps BW and random RTT from 45ms
      to 55ms (for reordering effect). This change increases TCP throughput
      by 20 - 25% to near bottleneck BW.
      
      A special case is the stretched ACK with new SACK and/or ECE mark.
      For example, a receiver may receive an out of order or ECN packet with
      unacked data buffered because of LRO or delayed ACK. The principle on
      such an ACK is to advance cwnd on the cummulative acked part first,
      then reduce cwnd in tcp_fastretrans_alert().
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0f7cc9a3
  9. 14 8月, 2013 1 次提交
    • Y
      tcp: reset reordering est. selectively on timeout · 74c181d5
      Yuchung Cheng 提交于
      On timeout the TCP sender unconditionally resets the estimated degree
      of network reordering (tp->reordering). The idea behind this is that
      the estimate is too large to trigger fast recovery (e.g., due to a IP
      path change).
      
      But for example if the sender only had 2 packets outstanding, then a
      timeout doesn't tell much about reordering. A sender that learns about
      reordering on big writes and loses packets on small writes will end up
      falsely retransmitting again and again, especially when reordering is
      more likely on big writes.
      
      Therefore the sender should only suspect that tp->reordering is too
      high if it could have gone into fast recovery with the (lower) default
      estimate.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      74c181d5
  10. 23 7月, 2013 4 次提交
    • Y
      tcp: use RTT from SACK for RTO · ed08495c
      Yuchung Cheng 提交于
      If RTT is not available because Karn's check has failed or no
      new packet is acked, use the RTT measured from SACK to estimate
      the RTO. The sender can continue to estimate the RTO during loss
      recovery or reordering event upon receiving non-partial ACKs.
      
      This also changes when the RTO is re-armed. Previously it is
      only re-armed when some data is cummulatively acknowledged (i.e.,
      SND.UNA advances), but now it is re-armed whenever RTT estimator
      is updated. This feature is particularly useful to reduce spurious
      timeout for buffer bloat including cellular carriers [1], and
      RTT estimation on reordering events.
      
      [1] "An In-depth Study of LTE: Effect of Network Protocol and
       Application Behavior on Performance", In Proc. of SIGCOMM 2013
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ed08495c
    • Y
      tcp: measure RTT from new SACK · 59c9af42
      Yuchung Cheng 提交于
      Take RTT sample if an ACK selectively acks some sequences that
      have never been retransmitted. The Karn's algorithm does not apply
      even if that ACK (s)acks other retransmitted sequences, because it
      must been generated by an original but perhaps out-of-order packet.
      There is no ambiguity. In case when multiple blocks are newly
      sacked because of ACK losses the earliest block is used to
      measure RTT, similar to cummulative ACKs.
      
      Such RTT samples allow the sender to estimate the RTO during loss
      recovery and packet reordering events. It is still useful even with
      TCP timestamps. That's because during these events the SND.UNA may
      not advance preventing RTT samples from TS ECR (thus the FLAG_ACKED
      check before calling tcp_ack_update_rtt()).  Therefore this new
      RTT source is complementary to existing ACK and TS RTT mechanisms.
      
      This patch does not update the RTO. It is done in the next patch.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      59c9af42
    • Y
      tcp: prefer packet timing to TS-ECR for RTT · 5b08e47c
      Yuchung Cheng 提交于
      Prefer packet timings to TS-ecr for RTT measurements when both
      sources are available. That's because broken middle-boxes and remote
      peer can return packets with corrupted TS ECR fields. Similarly most
      congestion controls that require RTT signals favor timing-based
      sources as well. Also check for bad TS ECR values to avoid RTT
      blow-ups. It has happened on production Web servers.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5b08e47c
    • Y
      tcp: consolidate SYNACK RTT sampling · 375fe02c
      Yuchung Cheng 提交于
      The first patch consolidates SYNACK and other RTT measurement to use a
      central function tcp_ack_update_rtt(). A (small) bonus is now SYNACK
      RTT measurement happens after PAWS check, potentially reducing the
      impact of RTO seeding on bad TCP timestamps values.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      375fe02c
  11. 20 6月, 2013 1 次提交
    • C
      tcp: introduce a per-route knob for quick ack · bcefe17c
      Cong Wang 提交于
      In previous discussions, I tried to find some reasonable heuristics
      for delayed ACK, however this seems not possible, according to Eric:
      
      	"ACKS might also be delayed because of bidirectional
      	traffic, and is more controlled by the application
      	response time. TCP stack can not easily estimate it."
      
      	"ACK can be incredibly useful to recover from losses in
      	a short time.
      
      	The vast majority of TCP sessions are small lived, and we
      	send one ACK per received segment anyway at beginning or
      	retransmits to let the sender smoothly increase its cwnd,
      	so an auto-tuning facility wont help them that much."
      
      and according to David:
      
      	"ACKs are the only information we have to detect loss.
      
      	And, for the same reasons that TCP VEGAS is fundamentally
      	broken, we cannot measure the pipe or some other
      	receiver-side-visible piece of information to determine
      	when it's "safe" to stretch ACK.
      
      	And even if it's "safe", we should not do it so that losses are
      	accurately detected and we don't spuriously retransmit.
      
      	The only way to know when the bandwidth increases is to
      	"test" it, by sending more and more packets until drops happen.
      	That's why all successful congestion control algorithms must
      	operate on explicited tested pieces of information.
      
      	Similarly, it's not really possible to universally know if
      	it's safe to stretch ACK or not."
      
      It still makes sense to enable or disable quick ack mode like
      what TCP_QUICK_ACK does.
      
      Similar to TCP_QUICK_ACK option, but for people who can't
      modify the source code and still wants to control
      TCP delayed ACK behavior. As David suggested, this should belong
      to per-path scope, since different pathes may want different
      behaviors.
      
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Rick Jones <rick.jones2@hp.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Thomas Graf <tgraf@suug.ch>
      CC: David Laight <David.Laight@ACULAB.COM>
      Signed-off-by: NCong Wang <amwang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bcefe17c
  12. 13 6月, 2013 1 次提交
  13. 31 5月, 2013 4 次提交
  14. 26 5月, 2013 3 次提交
  15. 23 5月, 2013 1 次提交
    • N
      tcp: bug fix in proportional rate reduction. · 35f079eb
      Nandita Dukkipati 提交于
      This patch is a fix for a bug triggering newly_acked_sacked < 0
      in tcp_ack(.).
      
      The bug is triggered by sacked_out decreasing relative to prior_sacked,
      but packets_out remaining the same as pior_packets. This is because the
      snapshot of prior_packets is taken after tcp_sacktag_write_queue() while
      prior_sacked is captured before tcp_sacktag_write_queue(). The problem
      is: tcp_sacktag_write_queue (tcp_match_skb_to_sack() -> tcp_fragment)
      adjusts the pcount for packets_out and sacked_out (MSS change or other
      reason). As a result, this delta in pcount is reflected in
      (prior_sacked - sacked_out) but not in (prior_packets - packets_out).
      
      This patch does the following:
      1) initializes prior_packets at the start of tcp_ack() so as to
      capture the delta in packets_out created by tcp_fragment.
      2) introduces a new "previous_packets_out" variable that snapshots
      packets_out right before tcp_clean_rtx_queue, so pkts_acked can be
      correctly computed as before.
      3) Computes pkts_acked using previous_packets_out, and computes
      newly_acked_sacked using prior_packets.
      Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      35f079eb
  16. 20 5月, 2013 1 次提交
    • Y
      tcp: remove bad timeout logic in fast recovery · 3e59cb0d
      Yuchung Cheng 提交于
      tcp_timeout_skb() was intended to trigger fast recovery on timeout,
      unfortunately in reality it often causes spurious retransmission
      storms during fast recovery. The particular sign is a fast retransmit
      over the highest sacked sequence (SND.FACK).
      
      Currently the RTO timer re-arming (as in RFC6298) offers a nice cushion
      to avoid spurious timeout: when SND.UNA advances the sender re-arms
      RTO and extends the timeout by icsk_rto. The sender does not offset
      the time elapsed since the packet at SND.UNA was sent.
      
      But if the next (DUP)ACK arrives later than ~RTTVAR and triggers
      tcp_fastretrans_alert(), then tcp_timeout_skb() will mark any packet
      sent before the icsk_rto interval lost, including one that's above the
      highest sacked sequence. Most likely a large part of scorebard will be
      marked.
      
      If most packets are not lost then the subsequent DUPACKs with new SACK
      blocks will cause the sender to continue to retransmit packets beyond
      SND.FACK spuriously. Even if only one packet is lost the sender may
      falsely retransmit almost the entire window.
      
      The situation becomes common in the world of bufferbloat: the RTT
      continues to grow as the queue builds up but RTTVAR remains small and
      close to the minimum 200ms. If a data packet is lost and the DUPACK
      triggered by the next data packet is slightly delayed, then a spurious
      retransmission storm forms.
      
      As the original comment on tcp_timeout_skb() suggests: the usefulness
      of this feature is questionable. It also wastes cycles walking the
      sack scoreboard and is actually harmful because of false recovery.
      
      It's time to remove this.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NNandita Dukkipati <nanditad@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3e59cb0d
  17. 17 5月, 2013 1 次提交
    • E
      tcp: speedup tcp_fixup_rcvbuf() · d2cf4367
      Eric Dumazet 提交于
      tcp_fixup_rcvbuf() contains a loop to estimate initial socket
      rcv space needed for a given mss. With large MTU (like 64K on lo),
      we can loop ~500 times and consume a lot of cpu cycles.
      
      perf top of 200 concurrent netperf -t TCP_CRR
      
      5.62%  netperf  [kernel.kallsyms]  [k] tcp_init_buffer_space
      1.71%  netperf  [kernel.kallsyms]  [k] _raw_spin_lock
      1.55%  netperf  [kernel.kallsyms]  [k] kmem_cache_free
      1.51%  netperf  [kernel.kallsyms]  [k] tcp_transmit_skb
      1.50%  netperf  [kernel.kallsyms]  [k] tcp_ack
      
      Lets use a 100% factor, and remove the loop.
      
      100% is needed anyway for tcp_adv_win_scale=1
      default value, and is also the maximum factor.
      
      Refs: commit b49960a0
            ("tcp: change tcp_adv_win_scale and tcp_rmem[2]")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d2cf4367
  18. 30 4月, 2013 1 次提交
  19. 20 4月, 2013 1 次提交
    • E
      tcp: call tcp_replace_ts_recent() from tcp_ack() · 12fb3dd9
      Eric Dumazet 提交于
      commit bd090dfc (tcp: tcp_replace_ts_recent() should not be called
      from tcp_validate_incoming()) introduced a TS ecr bug in slow path
      processing.
      
      1 A > B P. 1:10001(10000) ack 1 <nop,nop,TS val 1001 ecr 200>
      2 B < A . 1:1(0) ack 1 win 257 <sack 9001:10001,TS val 300 ecr 1001>
      3 A > B . 1:1001(1000) ack 1 win 227 <nop,nop,TS val 1002 ecr 200>
      4 A > B . 1001:2001(1000) ack 1 win 227 <nop,nop,TS val 1002 ecr 200>
      
      (ecr 200 should be ecr 300 in packets 3 & 4)
      
      Problem is tcp_ack() can trigger send of new packets (retransmits),
      reflecting the prior TSval, instead of the TSval contained in the
      currently processed incoming packet.
      
      Fix this by calling tcp_replace_ts_recent() from tcp_ack() after the
      checks, but before the actions.
      Reported-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      12fb3dd9
  20. 25 3月, 2013 1 次提交
  21. 21 3月, 2013 3 次提交
    • Y
      tcp: implement RFC5682 F-RTO · e33099f9
      Yuchung Cheng 提交于
      This patch implements F-RTO (foward RTO recovery):
      
      When the first retransmission after timeout is acknowledged, F-RTO
      sends new data instead of old data. If the next ACK acknowledges
      some never-retransmitted data, then the timeout was spurious and the
      congestion state is reverted.  Otherwise if the next ACK selectively
      acknowledges the new data, then the timeout was genuine and the
      loss recovery continues. This idea applies to recurring timeouts
      as well. While F-RTO sends different data during timeout recovery,
      it does not (and should not) change the congestion control.
      
      The implementaion follows the three steps of SACK enhanced algorithm
      (section 3) in RFC5682. Step 1 is in tcp_enter_loss(). Step 2 and
      3 are in tcp_process_loss().  The basic version is not supported
      because SACK enhanced version also works for non-SACK connections.
      
      The new implementation is functionally in parity with the old F-RTO
      implementation except the one case where it increases undo events:
      In addition to the RFC algorithm, a spurious timeout may be detected
      without sending data in step 2, as long as the SACK confirms not
      all the original data are dropped. When this happens, the sender
      will undo the cwnd and perhaps enter fast recovery instead. This
      additional check increases the F-RTO undo events by 5x compared
      to the prior implementation on Google Web servers, since the sender
      often does not have new data to send for HTTP.
      
      Note F-RTO may detect spurious timeout before Eifel with timestamps
      does so.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e33099f9
    • Y
      tcp: refactor CA_Loss state processing · ab42d9ee
      Yuchung Cheng 提交于
      Consolidate all of TCP CA_Loss state processing in
      tcp_fastretrans_alert() into a new function called tcp_process_loss().
      This is to prepare the new F-RTO implementation in the next patch.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ab42d9ee
    • Y
      tcp: refactor F-RTO · 9b44190d
      Yuchung Cheng 提交于
      The patch series refactor the F-RTO feature (RFC4138/5682).
      
      This is to simplify the loss recovery processing. Existing F-RTO
      was developed during the experimental stage (RFC4138) and has
      many experimental features.  It takes a separate code path from
      the traditional timeout processing by overloading CA_Disorder
      instead of using CA_Loss state. This complicates CA_Disorder state
      handling because it's also used for handling dubious ACKs and undos.
      While the algorithm in the RFC does not change the congestion control,
      the implementation intercepts congestion control in various places
      (e.g., frto_cwnd in tcp_ack()).
      
      The new code implements newer F-RTO RFC5682 using CA_Loss processing
      path.  F-RTO becomes a small extension in the timeout processing
      and interfaces with congestion control and Eifel undo modules.
      It lets congestion control (module) determines how many to send
      independently.  F-RTO only chooses what to send in order to detect
      spurious retranmission. If timeout is found spurious it invokes
      existing Eifel undo algorithms like DSACK or TCP timestamp based
      detection.
      
      The first patch removes all F-RTO code except the sysctl_tcp_frto is
      left for the new implementation.  Since CA_EVENT_FRTO is removed, TCP
      westwood now computes ssthresh on regular timeout CA_EVENT_LOSS event.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9b44190d
  22. 18 3月, 2013 1 次提交
    • C
      tcp: Remove TCPCT · 1a2c6181
      Christoph Paasch 提交于
      TCPCT uses option-number 253, reserved for experimental use and should
      not be used in production environments.
      Further, TCPCT does not fully implement RFC 6013.
      
      As a nice side-effect, removing TCPCT increases TCP's performance for
      very short flows:
      
      Doing an apache-benchmark with -c 100 -n 100000, sending HTTP-requests
      for files of 1KB size.
      
      before this patch:
      	average (among 7 runs) of 20845.5 Requests/Second
      after:
      	average (among 7 runs) of 21403.6 Requests/Second
      Signed-off-by: NChristoph Paasch <christoph.paasch@uclouvain.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1a2c6181
  23. 12 3月, 2013 2 次提交
    • N
      tcp: TLP loss detection. · 9b717a8d
      Nandita Dukkipati 提交于
      This is the second of the TLP patch series; it augments the basic TLP
      algorithm with a loss detection scheme.
      
      This patch implements a mechanism for loss detection when a Tail
      loss probe retransmission plugs a hole thereby masking packet loss
      from the sender. The loss detection algorithm relies on counting
      TLP dupacks as outlined in Sec. 3 of:
      http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01
      
      The basic idea is: Sender keeps track of TLP "episode" upon
      retransmission of a TLP packet. An episode ends when the sender receives
      an ACK above the SND.NXT (tracked by tlp_high_seq) at the time of the
      episode. We want to make sure that before the episode ends the sender
      receives a "TLP dupack", indicating that the TLP retransmission was
      unnecessary, so there was no loss/hole that needed plugging. If the
      sender gets no TLP dupack before the end of the episode, then it reduces
      ssthresh and the congestion window, because the TLP packet arriving at
      the receiver probably plugged a hole.
      Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9b717a8d
    • N
      tcp: Tail loss probe (TLP) · 6ba8a3b1
      Nandita Dukkipati 提交于
      This patch series implement the Tail loss probe (TLP) algorithm described
      in http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01. The
      first patch implements the basic algorithm.
      
      TLP's goal is to reduce tail latency of short transactions. It achieves
      this by converting retransmission timeouts (RTOs) occuring due
      to tail losses (losses at end of transactions) into fast recovery.
      TLP transmits one packet in two round-trips when a connection is in
      Open state and isn't receiving any ACKs. The transmitted packet, aka
      loss probe, can be either new or a retransmission. When there is tail
      loss, the ACK from a loss probe triggers FACK/early-retransmit based
      fast recovery, thus avoiding a costly RTO. In the absence of loss,
      there is no change in the connection state.
      
      PTO stands for probe timeout. It is a timer event indicating
      that an ACK is overdue and triggers a loss probe packet. The PTO value
      is set to max(2*SRTT, 10ms) and is adjusted to account for delayed
      ACK timer when there is only one oustanding packet.
      
      TLP Algorithm
      
      On transmission of new data in Open state:
        -> packets_out > 1: schedule PTO in max(2*SRTT, 10ms).
        -> packets_out == 1: schedule PTO in max(2*RTT, 1.5*RTT + 200ms)
        -> PTO = min(PTO, RTO)
      
      Conditions for scheduling PTO:
        -> Connection is in Open state.
        -> Connection is either cwnd limited or no new data to send.
        -> Number of probes per tail loss episode is limited to one.
        -> Connection is SACK enabled.
      
      When PTO fires:
        new_segment_exists:
          -> transmit new segment.
          -> packets_out++. cwnd remains same.
      
        no_new_packet:
          -> retransmit the last segment.
             Its ACK triggers FACK or early retransmit based recovery.
      
      ACK path:
        -> rearm RTO at start of ACK processing.
        -> reschedule PTO if need be.
      
      In addition, the patch includes a small variation to the Early Retransmit
      (ER) algorithm, such that ER and TLP together can in principle recover any
      N-degree of tail loss through fast recovery. TLP is controlled by the same
      sysctl as ER, tcp_early_retrans sysctl.
      tcp_early_retrans==0; disables TLP and ER.
      		 ==1; enables RFC5827 ER.
      		 ==2; delayed ER.
      		 ==3; TLP and delayed ER. [DEFAULT]
      		 ==4; TLP only.
      
      The TLP patch series have been extensively tested on Google Web servers.
      It is most effective for short Web trasactions, where it reduced RTOs by 15%
      and improved HTTP response time (average by 6%, 99th percentile by 10%).
      The transmitted probes account for <0.5% of the overall transmissions.
      Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6ba8a3b1
  24. 05 3月, 2013 1 次提交
  25. 14 2月, 2013 2 次提交
    • P
      net: Fix possible wrong checksum generation. · c9af6db4
      Pravin B Shelar 提交于
      Patch cef401de (net: fix possible wrong checksum
      generation) fixed wrong checksum calculation but it broke TSO by
      defining new GSO type but not a netdev feature for that type.
      net_gso_ok() would not allow hardware checksum/segmentation
      offload of such packets without the feature.
      
      Following patch fixes TSO and wrong checksum. This patch uses
      same logic that Eric Dumazet used. Patch introduces new flag
      SKBTX_SHARED_FRAG if at least one frag can be modified by
      the user. but SKBTX_SHARED_FRAG flag is kept in skb shared
      info tx_flags rather than gso_type.
      
      tx_flags is better compared to gso_type since we can have skb with
      shared frag without gso packet. It does not link SHARED_FRAG to
      GSO, So there is no need to define netdev feature for this.
      Signed-off-by: NPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c9af6db4
    • A
      tcp: send packets with a socket timestamp · ee684b6f
      Andrey Vagin 提交于
      A socket timestamp is a sum of the global tcp_time_stamp and
      a per-socket offset.
      
      A socket offset is added in places where externally visible
      tcp timestamp option is parsed/initialized.
      
      Connections in the SYN_RECV state are not supported, global
      tcp_time_stamp is used for them, because repair mode doesn't support
      this state. In a future it can be implemented by the similar way
      as for TIME_WAIT sockets.
      
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NAndrey Vagin <avagin@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee684b6f
  26. 07 2月, 2013 1 次提交