1. 06 8月, 2014 1 次提交
    • N
      tcp: reduce spurious retransmits due to transient SACK reneging · 5ae344c9
      Neal Cardwell 提交于
      This commit reduces spurious retransmits due to apparent SACK reneging
      by only reacting to SACK reneging that persists for a short delay.
      
      When a sequence space hole at snd_una is filled, some TCP receivers
      send a series of ACKs as they apparently scan their out-of-order queue
      and cumulatively ACK all the packets that have now been consecutiveyly
      received. This is essentially misbehavior B in "Misbehaviors in TCP
      SACK generation" ACM SIGCOMM Computer Communication Review, April
      2011, so we suspect that this is from several common OSes (Windows
      2000, Windows Server 2003, Windows XP). However, this issue has also
      been seen in other cases, e.g. the netdev thread "TCP being hoodwinked
      into spurious retransmissions by lack of timestamps?" from March 2014,
      where the receiver was thought to be a BSD box.
      
      Since snd_una would temporarily be adjacent to a previously SACKed
      range in these scenarios, this receiver behavior triggered the Linux
      SACK reneging code path in the sender. This led the sender to clear
      the SACK scoreboard, enter CA_Loss, and spuriously retransmit
      (potentially) every packet from the entire write queue at line rate
      just a few milliseconds before the ACK for each packet arrives at the
      sender.
      
      To avoid such situations, now when a sender sees apparent reneging it
      does not yet retransmit, but rather adjusts the RTO timer to give the
      receiver a little time (max(RTT/2, 10ms)) to send us some more ACKs
      that will restore sanity to the SACK scoreboard. If the reneging
      persists until this RTO then, as before, we clear the SACK scoreboard
      and enter CA_Loss.
      
      A 10ms delay tolerates a receiver sending such a stream of ACKs at
      56Kbit/sec. And to allow for receivers with slower or more congested
      paths, we wait for at least RTT/2.
      
      We validated the resulting max(RTT/2, 10ms) delay formula with a mix
      of North American and South American Google web server traffic, and
      found that for ACKs displaying transient reneging:
      
       (1) 90% of inter-ACK delays were less than 10ms
       (2) 99% of inter-ACK delays were less than RTT/2
      
      In tests on Google web servers this commit reduced reneging events by
      75%-90% (as measured by the TcpExtTCPSACKReneging counter), without
      any measurable impact on latency for user HTTP and SPDY requests.
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5ae344c9
  2. 16 7月, 2014 1 次提交
  3. 08 7月, 2014 1 次提交
    • Y
      tcp: fix false undo corner cases · 6e08d5e3
      Yuchung Cheng 提交于
      The undo code assumes that, upon entering loss recovery, TCP
      1) always retransmit something
      2) the retransmission never fails locally (e.g., qdisc drop)
      
      so undo_marker is set in tcp_enter_recovery() and undo_retrans is
      incremented only when tcp_retransmit_skb() is successful.
      
      When the assumption is broken because TCP's cwnd is too small to
      retransmit or the retransmit fails locally. The next (DUP)ACK
      would incorrectly revert the cwnd and the congestion state in
      tcp_try_undo_dsack() or tcp_may_undo(). Subsequent (DUP)ACKs
      may enter the recovery state. The sender repeatedly enter and
      (incorrectly) exit recovery states if the retransmits continue to
      fail locally while receiving (DUP)ACKs.
      
      The fix is to initialize undo_retrans to -1 and start counting on
      the first retransmission. Always increment undo_retrans even if the
      retransmissions fail locally because they couldn't cause DSACKs to
      undo the cwnd reduction.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6e08d5e3
  4. 30 6月, 2014 1 次提交
  5. 28 6月, 2014 1 次提交
  6. 20 6月, 2014 1 次提交
    • N
      tcp: fix tcp_match_skb_to_sack() for unaligned SACK at end of an skb · 2cd0d743
      Neal Cardwell 提交于
      If there is an MSS change (or misbehaving receiver) that causes a SACK
      to arrive that covers the end of an skb but is less than one MSS, then
      tcp_match_skb_to_sack() was rounding up pkt_len to the full length of
      the skb ("Round if necessary..."), then chopping all bytes off the skb
      and creating a zero-byte skb in the write queue.
      
      This was visible now because the recently simplified TLP logic in
      bef1909e ("tcp: fixing TLP's FIN recovery") could find that 0-byte
      skb at the end of the write queue, and now that we do not check that
      skb's length we could send it as a TLP probe.
      
      Consider the following example scenario:
      
       mss: 1000
       skb: seq: 0 end_seq: 4000  len: 4000
       SACK: start_seq: 3999 end_seq: 4000
      
      The tcp_match_skb_to_sack() code will compute:
      
       in_sack = false
       pkt_len = start_seq - TCP_SKB_CB(skb)->seq = 3999 - 0 = 3999
       new_len = (pkt_len / mss) * mss = (3999/1000)*1000 = 3000
       new_len += mss = 4000
      
      Previously we would find the new_len > skb->len check failing, so we
      would fall through and set pkt_len = new_len = 4000 and chop off
      pkt_len of 4000 from the 4000-byte skb, leaving a 0-byte segment
      afterward in the write queue.
      
      With this new commit, we notice that the new new_len >= skb->len check
      succeeds, so that we return without trying to fragment.
      
      Fixes: adb92db8 ("tcp: Make SACK code to split only at mss boundaries")
      Reported-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Ilpo Jarvinen <ilpo.jarvinen@helsinki.fi>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2cd0d743
  7. 11 6月, 2014 1 次提交
  8. 03 6月, 2014 1 次提交
  9. 04 5月, 2014 1 次提交
  10. 21 4月, 2014 1 次提交
  11. 12 4月, 2014 1 次提交
    • D
      net: Fix use after free by removing length arg from sk_data_ready callbacks. · 676d2369
      David S. Miller 提交于
      Several spots in the kernel perform a sequence like:
      
      	skb_queue_tail(&sk->s_receive_queue, skb);
      	sk->sk_data_ready(sk, skb->len);
      
      But at the moment we place the SKB onto the socket receive queue it
      can be consumed and freed up.  So this skb->len access is potentially
      to freed up memory.
      
      Furthermore, the skb->len can be modified by the consumer so it is
      possible that the value isn't accurate.
      
      And finally, no actual implementation of this callback actually uses
      the length argument.  And since nobody actually cared about it's
      value, lots of call sites pass arbitrary values in such as '0' and
      even '1'.
      
      So just remove the length argument from the callback, that way there
      is no confusion whatsoever and all of these use-after-free cases get
      fixed as a side effect.
      
      Based upon a patch by Eric Dumazet and his suggestion to audit this
      issue tree-wide.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      676d2369
  12. 11 3月, 2014 1 次提交
  13. 04 3月, 2014 2 次提交
  14. 27 2月, 2014 1 次提交
    • E
      tcp: switch rtt estimations to usec resolution · 740b0f18
      Eric Dumazet 提交于
      Upcoming congestion controls for TCP require usec resolution for RTT
      estimations. Millisecond resolution is simply not enough these days.
      
      FQ/pacing in DC environments also require this change for finer control
      and removal of bimodal behavior due to the current hack in
      tcp_update_pacing_rate() for 'small rtt'
      
      TCP_CONG_RTT_STAMP is no longer needed.
      
      As Julian Anastasov pointed out, we need to keep user compatibility :
      tcp_metrics used to export RTT and RTTVAR in msec resolution,
      so we added RTT_US and RTTVAR_US. An iproute2 patch is needed
      to use the new attributes if provided by the kernel.
      
      In this example ss command displays a srtt of 32 usecs (10Gbit link)
      
      lpk51:~# ./ss -i dst lpk52
      Netid  State      Recv-Q Send-Q   Local Address:Port       Peer
      Address:Port
      tcp    ESTAB      0      1         10.246.11.51:42959
      10.246.11.52:64614
               cubic wscale:6,6 rto:201 rtt:0.032/0.001 ato:40 mss:1448
      cwnd:10 send
      3620.0Mbps pacing_rate 7240.0Mbps unacked:1 rcv_rtt:993 rcv_space:29559
      
      Updated iproute2 ip command displays :
      
      lpk51:~# ./ip tcp_metrics | grep 10.246.11.52
      10.246.11.52 age 561.914sec cwnd 10 rtt 274us rttvar 213us source
      10.246.11.51
      
      Old binary displays :
      
      lpk51:~# ip tcp_metrics | grep 10.246.11.52
      10.246.11.52 age 561.914sec cwnd 10 rtt 250us rttvar 125us source
      10.246.11.51
      
      With help from Julian Anastasov, Stephen Hemminger and Yuchung Cheng
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Larry Brakmo <brakmo@google.com>
      Cc: Julian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      740b0f18
  15. 07 2月, 2014 1 次提交
    • E
      tcp: remove 1ms offset in srtt computation · 4a5ab4e2
      Eric Dumazet 提交于
      TCP pacing depends on an accurate srtt estimation.
      
      Current srtt estimation is using jiffie resolution,
      and has an artificial offset of at least 1 ms, which can produce
      slowdowns when FQ/pacing is used, especially in DC world,
      where typical rtt is below 1 ms.
      
      We are planning a switch to usec resolution for linux-3.15,
      but in the meantime, this patch removes the 1 ms offset.
      
      All we need is to have tp->srtt minimal value of 1 to differentiate
      the case of srtt being initialized or not, not 8.
      
      The problematic behavior was observed on a 40Gbit testbed,
      where 32 concurrent netperf were reaching 12Gbps of aggregate
      speed, instead of line speed.
      
      This patch also has the effect of reporting more accurate srtt and send
      rates to iproute2 ss command as in :
      
      $ ss -i dst cca2
      Netid  State      Recv-Q Send-Q          Local Address:Port
      Peer Address:Port
      tcp    ESTAB      0      0                10.244.129.1:56984
      10.244.129.2:12865
      	 cubic wscale:6,6 rto:200 rtt:0.25/0.25 ato:40 mss:1448 cwnd:10 send
      463.4Mbps rcv_rtt:1 rcv_space:29200
      tcp    ESTAB      0      390960           10.244.129.1:60247
      10.244.129.2:50204
      	 cubic wscale:6,6 rto:200 rtt:0.875/0.75 mss:1448 cwnd:73 ssthresh:51
      send 966.4Mbps unacked:73 retrans:0/121 rcv_space:29200
      Reported-by: NVytautas Valancius <valas@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4a5ab4e2
  16. 30 12月, 2013 1 次提交
  17. 27 12月, 2013 1 次提交
  18. 05 11月, 2013 1 次提交
    • Y
      tcp: properly handle stretch acks in slow start · 9f9843a7
      Yuchung Cheng 提交于
      Slow start now increases cwnd by 1 if an ACK acknowledges some packets,
      regardless the number of packets. Consequently slow start performance
      is highly dependent on the degree of the stretch ACKs caused by
      receiver or network ACK compression mechanisms (e.g., delayed-ACK,
      GRO, etc).  But slow start algorithm is to send twice the amount of
      packets of packets left so it should process a stretch ACK of degree
      N as if N ACKs of degree 1, then exits when cwnd exceeds ssthresh. A
      follow up patch will use the remainder of the N (if greater than 1)
      to adjust cwnd in the congestion avoidance phase.
      
      In addition this patch retires the experimental limited slow start
      (LSS) feature. LSS has multiple drawbacks but questionable benefit. The
      fractional cwnd increase in LSS requires a loop in slow start even
      though it's rarely used. Configuring such an increase step via a global
      sysctl on different BDPS seems hard. Finally and most importantly the
      slow start overshoot concern is now better covered by the Hybrid slow
      start (hystart) enabled by default.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9f9843a7
  19. 28 10月, 2013 3 次提交
  20. 22 10月, 2013 1 次提交
    • N
      tcp: initialize passive-side sk_pacing_rate after 3WHS · 02cf4ebd
      Neal Cardwell 提交于
      For passive TCP connections, upon receiving the ACK that completes the
      3WHS, make sure we set our pacing rate after we get our first RTT
      sample.
      
      On passive TCP connections, when we receive the ACK completing the
      3WHS we do not take an RTT sample in tcp_ack(), but rather in
      tcp_synack_rtt_meas(). So upon receiving the ACK that completes the
      3WHS, tcp_ack() leaves sk_pacing_rate at its initial value.
      
      Originally the initial sk_pacing_rate value was 0, so passive-side
      connections defaulted to sysctl_tcp_min_tso_segs (2 segs) in skbuffs
      made in the first RTT. With a default initial cwnd of 10 packets, this
      happened to be correct for RTTs 5ms or bigger, so it was hard to
      see problems in WAN or emulated WAN testing.
      
      Since 7eec4174 ("pkt_sched: fq: fix non TCP flows pacing"), the
      initial sk_pacing_rate is 0xffffffff. So after that change, passive
      TCP connections were keeping this value (and using large numbers of
      segments per skbuff) until receiving an ACK for data.
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      02cf4ebd
  21. 18 10月, 2013 1 次提交
  22. 10 10月, 2013 1 次提交
  23. 05 10月, 2013 1 次提交
    • E
      tcp: do not forget FIN in tcp_shifted_skb() · 5e8a402f
      Eric Dumazet 提交于
      Yuchung found following problem :
      
       There are bugs in the SACK processing code, merging part in
       tcp_shift_skb_data(), that incorrectly resets or ignores the sacked
       skbs FIN flag. When a receiver first SACK the FIN sequence, and later
       throw away ofo queue (e.g., sack-reneging), the sender will stop
       retransmitting the FIN flag, and hangs forever.
      
      Following packetdrill test can be used to reproduce the bug.
      
      $ cat sack-merge-bug.pkt
      `sysctl -q net.ipv4.tcp_fack=0`
      
      // Establish a connection and send 10 MSS.
      0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      +.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      +.000 bind(3, ..., ...) = 0
      +.000 listen(3, 1) = 0
      
      +.050 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
      +.000 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6>
      +.001 < . 1:1(0) ack 1 win 1024
      +.000 accept(3, ..., ...) = 4
      
      +.100 write(4, ..., 12000) = 12000
      +.000 shutdown(4, SHUT_WR) = 0
      +.000 > . 1:10001(10000) ack 1
      +.050 < . 1:1(0) ack 2001 win 257
      +.000 > FP. 10001:12001(2000) ack 1
      +.050 < . 1:1(0) ack 2001 win 257 <sack 10001:11001,nop,nop>
      +.050 < . 1:1(0) ack 2001 win 257 <sack 10001:12002,nop,nop>
      // SACK reneg
      +.050 < . 1:1(0) ack 12001 win 257
      +0 %{ print "unacked: ",tcpi_unacked }%
      +5 %{ print "" }%
      
      First, a typo inverted left/right of one OR operation, then
      code forgot to advance end_seq if the merged skb carried FIN.
      
      Bug was added in 2.6.29 by commit 832d11c5
      ("tcp: Try to restore large SKBs while SACK processing")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Acked-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5e8a402f
  24. 04 10月, 2013 1 次提交
  25. 03 10月, 2013 1 次提交
    • E
      tcp: sndbuf autotuning improvements · 6ae70532
      Eric Dumazet 提交于
      tcp_fixup_sndbuf() is underestimating initial send buffer requirements.
      
      It was not noticed because big GSO packets were escaping the limitation,
      but with smaller TSO packets (or TSO/GSO/SG off), application hits
      sk_sndbuf before having a chance to fill enough packets in socket write
      queue.
      
      - initial cwnd can be bigger than 10 for specific routes
      
      - SKB_TRUESIZE() is a bit under real needs in some cases,
        because of power-of-two rounding in kmalloc()
      
      - Fast Recovery (RFC 5681 3.2) : Cubic needs 70% factor
      
      - Extra cushion (application might react slowly to POLLOUT)
      
      tcp_v4_conn_req_fastopen() needs to call tcp_init_metrics() before
      calling tcp_init_buffer_space()
      
      Then we realize tcp_new_space() should call tcp_fixup_sndbuf()
      instead of duplicating this stuff.
      
      Rename tcp_fixup_sndbuf() to tcp_sndbuf_expand() to be more
      descriptive.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NMaciej Żenczykowski <maze@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6ae70532
  26. 29 9月, 2013 1 次提交
    • E
      net: introduce SO_MAX_PACING_RATE · 62748f32
      Eric Dumazet 提交于
      As mentioned in commit afe4fd06 ("pkt_sched: fq: Fair Queue packet
      scheduler"), this patch adds a new socket option.
      
      SO_MAX_PACING_RATE offers the application the ability to cap the
      rate computed by transport layer. Value is in bytes per second.
      
      u32 val = 1000000;
      setsockopt(sockfd, SOL_SOCKET, SO_MAX_PACING_RATE, &val, sizeof(val));
      
      To be effectively paced, a flow must use FQ packet scheduler.
      
      Note that a packet scheduler takes into account the headers for its
      computations. The effective payload rate depends on MSS and retransmits
      if any.
      
      I chose to make this pacing rate a SOL_SOCKET option instead of a
      TCP one because this can be used by other protocols.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Steinar H. Gunderson <sesse@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      62748f32
  27. 24 9月, 2013 1 次提交
    • E
      tcp: fix dynamic right sizing · b0983d3c
      Eric Dumazet 提交于
      Dynamic Right Sizing (DRS) is supposed to open TCP receive window
      automatically, but suffers from two bugs, presented by order
      of importance.
      
      1) tcp_rcv_space_adjust() fix :
      
      Using twice the last received amount is very pessimistic,
      because it doesn't allow fast recovery or proper slow start
      ramp up, if sender wants to increase cwin by 100% every RTT.
      
      copied = bytes received in previous RTT
      
      2*copied = bytes we expect to receive in next RTT
      
      4*copied = bytes we need to advertise in rwin at end of next RTT
      
      DRS is one RTT late, it needs a 4x factor.
      
      If sender is not using ABC, and increases cwin by 50% every rtt,
      then we needed 1.5*1.5 = 2.25 factor.
      This is probably why this bug was not really noticed.
      
      2) There is no window adjustment after first RTT. DRS triggers only
        after the second RTT.
        DRS needs two RTT to initialize, so tcp_fixup_rcvbuf() should setup
        sk_rcvbuf to allow proper window grow for first two RTT.
      
      This patch increases TCP efficiency particularly for large RTT flows
      when autotuning is used at the receiver, and more particularly
      in presence of packet losses.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Cc: Van Jacobson <vanj@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b0983d3c
  28. 07 9月, 2013 2 次提交
    • E
      tcp: properly increase rcv_ssthresh for ofo packets · 4e4f1fc2
      Eric Dumazet 提交于
      TCP receive window handling is multi staged.
      
      A socket has a memory budget, static or dynamic, in sk_rcvbuf.
      
      Because we do not really know how this memory budget translates to
      a TCP window (payload), TCP announces a small initial window
      (about 20 MSS).
      
      When a packet is received, we increase TCP rcv_win depending
      on the payload/truesize ratio of this packet. Good citizen
      packets give a hint that it's reasonable to have rcv_win = sk_rcvbuf/2
      
      This heuristic takes place in tcp_grow_window()
      
      Problem is : We currently call tcp_grow_window() only for in-order
      packets.
      
      This means that reorders or packet losses stop proper grow of
      rcv_win, and senders are unable to benefit from fast recovery,
      or proper reordering level detection.
      
      Really, a packet being stored in OFO queue is not a bad citizen.
      It should be part of the game as in-order packets.
      
      In our traces, we very often see sender is limited by linux small
      receive windows, even if linux hosts use autotuning (DRS) and should
      allow rcv_win to grow to ~3MB.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4e4f1fc2
    • Y
      tcp: fix no cwnd growth after timeout · 16edfe7e
      Yuchung Cheng 提交于
      In commit 0f7cc9a3 "tcp: increase throughput when reordering is high",
      it only allows cwnd to increase in Open state. This mistakenly disables
      slow start after timeout (CA_Loss). Moreover cwnd won't grow if the
      state moves from Disorder to Open later in tcp_fastretrans_alert().
      
      Therefore the correct logic should be to allow cwnd to grow as long
      as the data is received in order in Open, Loss, or even Disorder state.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      16edfe7e
  29. 04 9月, 2013 1 次提交
  30. 30 8月, 2013 2 次提交
    • E
      tcp: TSO packets automatic sizing · 95bd09eb
      Eric Dumazet 提交于
      After hearing many people over past years complaining against TSO being
      bursty or even buggy, we are proud to present automatic sizing of TSO
      packets.
      
      One part of the problem is that tcp_tso_should_defer() uses an heuristic
      relying on upcoming ACKS instead of a timer, but more generally, having
      big TSO packets makes little sense for low rates, as it tends to create
      micro bursts on the network, and general consensus is to reduce the
      buffering amount.
      
      This patch introduces a per socket sk_pacing_rate, that approximates
      the current sending rate, and allows us to size the TSO packets so
      that we try to send one packet every ms.
      
      This field could be set by other transports.
      
      Patch has no impact for high speed flows, where having large TSO packets
      makes sense to reach line rate.
      
      For other flows, this helps better packet scheduling and ACK clocking.
      
      This patch increases performance of TCP flows in lossy environments.
      
      A new sysctl (tcp_min_tso_segs) is added, to specify the
      minimal size of a TSO packet (default being 2).
      
      A follow-up patch will provide a new packet scheduler (FQ), using
      sk_pacing_rate as an input to perform optional per flow pacing.
      
      This explains why we chose to set sk_pacing_rate to twice the current
      rate, allowing 'slow start' ramp up.
      
      sk_pacing_rate = 2 * cwnd * mss / srtt
      
      v2: Neal Cardwell reported a suspect deferring of last two segments on
      initial write of 10 MSS, I had to change tcp_tso_should_defer() to take
      into account tp->xmit_size_goal_segs
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Van Jacobson <vanj@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      95bd09eb
    • A
      tcp: don't apply tsoffset if rcv_tsecr is zero · e3e12028
      Andrew Vagin 提交于
      The zero value means that tsecr is not valid, so it's a special case.
      
      tsoffset is used to customize tcp_time_stamp for one socket.
      tsoffset is usually zero, it's used when a socket was moved from one
      host to another host.
      
      Currently this issue affects logic of tcp_rcv_rtt_measure_ts. Due to
      incorrect value of rcv_tsecr, tcp_rcv_rtt_measure_ts sets rto to
      TCP_RTO_MAX.
      
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Reported-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Signed-off-by: NAndrey Vagin <avagin@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e3e12028
  31. 23 8月, 2013 1 次提交
    • Y
      tcp: increase throughput when reordering is high · 0f7cc9a3
      Yuchung Cheng 提交于
      The stack currently detects reordering and avoid spurious
      retransmission very well. However the throughput is sub-optimal under
      high reordering because cwnd is increased only if the data is deliverd
      in order. I.e., FLAG_DATA_ACKED check in tcp_ack().  The more packet
      are reordered the worse the throughput is.
      
      Therefore when reordering is proven high, cwnd should advance whenever
      the data is delivered regardless of its ordering. If reordering is low,
      conservatively advance cwnd only on ordered deliveries in Open state,
      and retain cwnd in Disordered state (RFC5681).
      
      Using netperf on a qdisc setup of 20Mbps BW and random RTT from 45ms
      to 55ms (for reordering effect). This change increases TCP throughput
      by 20 - 25% to near bottleneck BW.
      
      A special case is the stretched ACK with new SACK and/or ECE mark.
      For example, a receiver may receive an out of order or ECN packet with
      unacked data buffered because of LRO or delayed ACK. The principle on
      such an ACK is to advance cwnd on the cummulative acked part first,
      then reduce cwnd in tcp_fastretrans_alert().
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0f7cc9a3
  32. 14 8月, 2013 1 次提交
    • Y
      tcp: reset reordering est. selectively on timeout · 74c181d5
      Yuchung Cheng 提交于
      On timeout the TCP sender unconditionally resets the estimated degree
      of network reordering (tp->reordering). The idea behind this is that
      the estimate is too large to trigger fast recovery (e.g., due to a IP
      path change).
      
      But for example if the sender only had 2 packets outstanding, then a
      timeout doesn't tell much about reordering. A sender that learns about
      reordering on big writes and loses packets on small writes will end up
      falsely retransmitting again and again, especially when reordering is
      more likely on big writes.
      
      Therefore the sender should only suspect that tp->reordering is too
      high if it could have gone into fast recovery with the (lower) default
      estimate.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      74c181d5
  33. 23 7月, 2013 3 次提交
    • Y
      tcp: use RTT from SACK for RTO · ed08495c
      Yuchung Cheng 提交于
      If RTT is not available because Karn's check has failed or no
      new packet is acked, use the RTT measured from SACK to estimate
      the RTO. The sender can continue to estimate the RTO during loss
      recovery or reordering event upon receiving non-partial ACKs.
      
      This also changes when the RTO is re-armed. Previously it is
      only re-armed when some data is cummulatively acknowledged (i.e.,
      SND.UNA advances), but now it is re-armed whenever RTT estimator
      is updated. This feature is particularly useful to reduce spurious
      timeout for buffer bloat including cellular carriers [1], and
      RTT estimation on reordering events.
      
      [1] "An In-depth Study of LTE: Effect of Network Protocol and
       Application Behavior on Performance", In Proc. of SIGCOMM 2013
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ed08495c
    • Y
      tcp: measure RTT from new SACK · 59c9af42
      Yuchung Cheng 提交于
      Take RTT sample if an ACK selectively acks some sequences that
      have never been retransmitted. The Karn's algorithm does not apply
      even if that ACK (s)acks other retransmitted sequences, because it
      must been generated by an original but perhaps out-of-order packet.
      There is no ambiguity. In case when multiple blocks are newly
      sacked because of ACK losses the earliest block is used to
      measure RTT, similar to cummulative ACKs.
      
      Such RTT samples allow the sender to estimate the RTO during loss
      recovery and packet reordering events. It is still useful even with
      TCP timestamps. That's because during these events the SND.UNA may
      not advance preventing RTT samples from TS ECR (thus the FLAG_ACKED
      check before calling tcp_ack_update_rtt()).  Therefore this new
      RTT source is complementary to existing ACK and TS RTT mechanisms.
      
      This patch does not update the RTO. It is done in the next patch.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      59c9af42
    • Y
      tcp: prefer packet timing to TS-ECR for RTT · 5b08e47c
      Yuchung Cheng 提交于
      Prefer packet timings to TS-ecr for RTT measurements when both
      sources are available. That's because broken middle-boxes and remote
      peer can return packets with corrupted TS ECR fields. Similarly most
      congestion controls that require RTT signals favor timing-based
      sources as well. Also check for bad TS ECR values to avoid RTT
      blow-ups. It has happened on production Web servers.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5b08e47c