1. 29 1月, 2015 3 次提交
    • N
      tcp: fix timing issue in CUBIC slope calculation · d6b1a8a9
      Neal Cardwell 提交于
      This patch fixes a bug in CUBIC that causes cwnd to increase slightly
      too slowly when multiple ACKs arrive in the same jiffy.
      
      If cwnd is supposed to increase at a rate of more than once per jiffy,
      then CUBIC was sometimes too slow. Because the bic_target is
      calculated for a future point in time, calculated with time in
      jiffies, the cwnd can increase over the course of the jiffy while the
      bic_target calculated as the proper CUBIC cwnd at time
      t=tcp_time_stamp+rtt does not increase, because tcp_time_stamp only
      increases on jiffy tick boundaries.
      
      So since the cnt is set to:
      	ca->cnt = cwnd / (bic_target - cwnd);
      as cwnd increases but bic_target does not increase due to jiffy
      granularity, the cnt becomes too large, causing cwnd to increase
      too slowly.
      
      For example:
      - suppose at the beginning of a jiffy, cwnd=40, bic_target=44
      - so CUBIC sets:
         ca->cnt =  cwnd / (bic_target - cwnd) = 40 / (44 - 40) = 40/4 = 10
      - suppose we get 10 acks, each for 1 segment, so tcp_cong_avoid_ai()
         increases cwnd to 41
      - so CUBIC sets:
         ca->cnt =  cwnd / (bic_target - cwnd) = 41 / (44 - 41) = 41 / 3 = 13
      
      So now CUBIC will wait for 13 packets to be ACKed before increasing
      cwnd to 42, insted of 10 as it should.
      
      The fix is to avoid adjusting the slope (determined by ca->cnt)
      multiple times within a jiffy, and instead skip to compute the Reno
      cwnd, the "TCP friendliness" code path.
      Reported-by: NEyal Perry <eyalpe@mellanox.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d6b1a8a9
    • N
      tcp: fix stretch ACK bugs in CUBIC · 9cd981dc
      Neal Cardwell 提交于
      Change CUBIC to properly handle stretch ACKs in additive increase mode
      by passing in the count of ACKed packets to tcp_cong_avoid_ai().
      
      In addition, because we are now precisely accounting for stretch ACKs,
      including delayed ACKs, we can now remove the delayed ACK tracking and
      estimation code that tracked recent delayed ACK behavior in
      ca->delayed_ack.
      Reported-by: NEyal Perry <eyalpe@mellanox.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9cd981dc
    • N
      tcp: stretch ACK fixes prep · e73ebb08
      Neal Cardwell 提交于
      LRO, GRO, delayed ACKs, and middleboxes can cause "stretch ACKs" that
      cover more than the RFC-specified maximum of 2 packets. These stretch
      ACKs can cause serious performance shortfalls in common congestion
      control algorithms that were designed and tuned years ago with
      receiver hosts that were not using LRO or GRO, and were instead
      politely ACKing every other packet.
      
      This patch series fixes Reno and CUBIC to handle stretch ACKs.
      
      This patch prepares for the upcoming stretch ACK bug fix patches. It
      adds an "acked" parameter to tcp_cong_avoid_ai() to allow for future
      fixes to tcp_cong_avoid_ai() to correctly handle stretch ACKs, and
      changes all congestion control algorithms to pass in 1 for the ACKed
      count. It also changes tcp_slow_start() to return the number of packet
      ACK "credits" that were not processed in slow start mode, and can be
      processed by the congestion control module in additive increase mode.
      
      In future patches we will fix tcp_cong_avoid_ai() to handle stretch
      ACKs, and fix Reno and CUBIC handling of stretch ACKs in slow start
      and additive increase mode.
      Reported-by: NEyal Perry <eyalpe@mellanox.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e73ebb08
  2. 10 12月, 2014 2 次提交
    • E
      tcp_cubic: refine Hystart delay threshold · 42eef7a0
      Eric Dumazet 提交于
      In commit 2b4636a5 ("tcp_cubic: make the delay threshold of HyStart
      less sensitive"), HYSTART_DELAY_MIN was changed to 4 ms.
      
      The remaining problem is that using delay_min + (delay_min/16) as the
      threshold is too sensitive.
      
      6.25 % of variation is too small for rtt above 60 ms, which are not
      uncommon.
      
      Lets use 12.5 % instead (delay_min + (delay_min/8))
      
      Tested:
       80 ms RTT between peers, FQ/pacing packet scheduler on sender.
       10 bulk transfers of 10 seconds :
      
      nstat >/dev/null
      for i in `seq 1 10`
       do
         netperf -H remote -- -k THROUGHPUT | grep THROUGHPUT
       done
      nstat | grep Hystart
      
      With the 6.25 % threshold :
      
      THROUGHPUT=20.66
      THROUGHPUT=249.38
      THROUGHPUT=254.10
      THROUGHPUT=14.94
      THROUGHPUT=251.92
      THROUGHPUT=237.73
      THROUGHPUT=19.18
      THROUGHPUT=252.89
      THROUGHPUT=21.32
      THROUGHPUT=15.58
      TcpExtTCPHystartTrainDetect     2                  0.0
      TcpExtTCPHystartTrainCwnd       4756               0.0
      TcpExtTCPHystartDelayDetect     5                  0.0
      TcpExtTCPHystartDelayCwnd       180                0.0
      
      With the 12.5 % threshold
      THROUGHPUT=251.09
      THROUGHPUT=247.46
      THROUGHPUT=250.92
      THROUGHPUT=248.91
      THROUGHPUT=250.88
      THROUGHPUT=249.84
      THROUGHPUT=250.51
      THROUGHPUT=254.15
      THROUGHPUT=250.62
      THROUGHPUT=250.89
      TcpExtTCPHystartTrainDetect     1                  0.0
      TcpExtTCPHystartTrainCwnd       3175               0.0
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Tested-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      42eef7a0
    • E
      tcp_cubic: add SNMP counters to track how effective is Hystart · 6e3a8a93
      Eric Dumazet 提交于
      When deploying FQ pacing, one thing we noticed is that CUBIC Hystart
      triggers too soon.
      
      Having SNMP counters to have an idea of how often the various Hystart
      methods trigger is useful prior to any modifications.
      
      This patch adds SNMP counters tracking, how many time "ack train" or
      "Delay" based Hystart triggers, and cumulative sum of cwnd at the time
      Hystart decided to end SS (Slow Start)
      
      myhost:~# nstat -a | grep Hystart
      TcpExtTCPHystartTrainDetect     9                  0.0
      TcpExtTCPHystartTrainCwnd       20650              0.0
      TcpExtTCPHystartDelayDetect     10                 0.0
      TcpExtTCPHystartDelayCwnd       360                0.0
      
      ->
       Train detection was triggered 9 times, and average cwnd was
       20650/9=2294,
       Delay detection was triggered 10 times and average cwnd was 36
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6e3a8a93
  3. 02 9月, 2014 1 次提交
  4. 04 5月, 2014 1 次提交
  5. 01 5月, 2014 1 次提交
    • L
      tcp_cubic: fix the range of delayed_ack · 0cda345d
      Liu Yu 提交于
      commit b9f47a3a (tcp_cubic: limit delayed_ack ratio to prevent
      divide error) try to prevent divide error, but there is still a little
      chance that delayed_ack can reach zero. In case the param cnt get
      negative value, then ratio+cnt would overflow and may happen to be zero.
      As a result, min(ratio, ACK_RATIO_LIMIT) will calculate to be zero.
      
      In some old kernels, such as 2.6.32, there is a bug that would
      pass negative param, which then ultimately leads to this divide error.
      
      commit 5b35e1e6 (tcp: fix tcp_trim_head() to adjust segment count
      with skb MSS) fixed the negative param issue. However,
      it's safe that we fix the range of delayed_ack as well,
      to make sure we do not hit a divide by zero.
      
      CC: Stephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: NLiu Yu <allanyuliu@tencent.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0cda345d
  6. 27 2月, 2014 1 次提交
    • E
      tcp: switch rtt estimations to usec resolution · 740b0f18
      Eric Dumazet 提交于
      Upcoming congestion controls for TCP require usec resolution for RTT
      estimations. Millisecond resolution is simply not enough these days.
      
      FQ/pacing in DC environments also require this change for finer control
      and removal of bimodal behavior due to the current hack in
      tcp_update_pacing_rate() for 'small rtt'
      
      TCP_CONG_RTT_STAMP is no longer needed.
      
      As Julian Anastasov pointed out, we need to keep user compatibility :
      tcp_metrics used to export RTT and RTTVAR in msec resolution,
      so we added RTT_US and RTTVAR_US. An iproute2 patch is needed
      to use the new attributes if provided by the kernel.
      
      In this example ss command displays a srtt of 32 usecs (10Gbit link)
      
      lpk51:~# ./ss -i dst lpk52
      Netid  State      Recv-Q Send-Q   Local Address:Port       Peer
      Address:Port
      tcp    ESTAB      0      1         10.246.11.51:42959
      10.246.11.52:64614
               cubic wscale:6,6 rto:201 rtt:0.032/0.001 ato:40 mss:1448
      cwnd:10 send
      3620.0Mbps pacing_rate 7240.0Mbps unacked:1 rcv_rtt:993 rcv_space:29559
      
      Updated iproute2 ip command displays :
      
      lpk51:~# ./ip tcp_metrics | grep 10.246.11.52
      10.246.11.52 age 561.914sec cwnd 10 rtt 274us rttvar 213us source
      10.246.11.51
      
      Old binary displays :
      
      lpk51:~# ip tcp_metrics | grep 10.246.11.52
      10.246.11.52 age 561.914sec cwnd 10 rtt 250us rttvar 125us source
      10.246.11.51
      
      With help from Julian Anastasov, Stephen Hemminger and Yuchung Cheng
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Larry Brakmo <brakmo@google.com>
      Cc: Julian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      740b0f18
  7. 05 11月, 2013 1 次提交
    • Y
      tcp: properly handle stretch acks in slow start · 9f9843a7
      Yuchung Cheng 提交于
      Slow start now increases cwnd by 1 if an ACK acknowledges some packets,
      regardless the number of packets. Consequently slow start performance
      is highly dependent on the degree of the stretch ACKs caused by
      receiver or network ACK compression mechanisms (e.g., delayed-ACK,
      GRO, etc).  But slow start algorithm is to send twice the amount of
      packets of packets left so it should process a stretch ACK of degree
      N as if N ACKs of degree 1, then exits when cwnd exceeds ssthresh. A
      follow up patch will use the remainder of the N (if greater than 1)
      to adjust cwnd in the congestion avoidance phase.
      
      In addition this patch retires the experimental limited slow start
      (LSS) feature. LSS has multiple drawbacks but questionable benefit. The
      fractional cwnd increase in LSS requires a loop in slow start even
      though it's rarely used. Configuring such an increase step via a global
      sysctl on different BDPS seems hard. Finally and most importantly the
      slow start overshoot concern is now better covered by the Hybrid slow
      start (hystart) enabled by default.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9f9843a7
  8. 08 8月, 2013 2 次提交
    • E
      tcp: cubic: fix bug in bictcp_acked() · cd6b423a
      Eric Dumazet 提交于
      While investigating about strange increase of retransmit rates
      on hosts ~24 days after boot, Van found hystart was disabled
      if ca->epoch_start was 0, as following condition is true
      when tcp_time_stamp high order bit is set.
      
      (s32)(tcp_time_stamp - ca->epoch_start) < HZ
      
      Quoting Van :
      
       At initialization & after every loss ca->epoch_start is set to zero so
       I believe that the above line will turn off hystart as soon as the 2^31
       bit is set in tcp_time_stamp & hystart will stay off for 24 days.
       I think we've observed that cubic's restart is too aggressive without
       hystart so this might account for the higher drop rate we observe.
      Diagnosed-by: NVan Jacobson <vanj@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cd6b423a
    • E
      tcp: cubic: fix overflow error in bictcp_update() · 2ed0edf9
      Eric Dumazet 提交于
      commit 17a6e9f1 ("tcp_cubic: fix clock dependency") added an
      overflow error in bictcp_update() in following code :
      
      /* change the unit from HZ to bictcp_HZ */
      t = ((tcp_time_stamp + msecs_to_jiffies(ca->delay_min>>3) -
            ca->epoch_start) << BICTCP_HZ) / HZ;
      
      Because msecs_to_jiffies() being unsigned long, compiler does
      implicit type promotion.
      
      We really want to constrain (tcp_time_stamp - ca->epoch_start)
      to a signed 32bit value, or else 't' has unexpected high values.
      
      This bugs triggers an increase of retransmit rates ~24 days after
      boot [1], as the high order bit of tcp_time_stamp flips.
      
      [1] for hosts with HZ=1000
      
      Big thanks to Van Jacobson for spotting this problem.
      Diagnosed-by: NVan Jacobson <vanj@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2ed0edf9
  9. 21 1月, 2012 1 次提交
    • N
      tcp: fix undo after RTO for CUBIC · 5a45f008
      Neal Cardwell 提交于
      This patch fixes CUBIC so that cwnd reductions made during RTOs can be
      undone (just as they already can be undone when using the default/Reno
      behavior).
      
      When undoing cwnd reductions, BIC-derived congestion control modules
      were restoring the cwnd from last_max_cwnd. There were two problems
      with using last_max_cwnd to restore a cwnd during undo:
      
      (a) last_max_cwnd was set to 0 on state transitions into TCP_CA_Loss
      (by calling the module's reset() functions), so cwnd reductions from
      RTOs could not be undone.
      
      (b) when fast_covergence is enabled (which it is by default)
      last_max_cwnd does not actually hold the value of snd_cwnd before the
      loss; instead, it holds a scaled-down version of snd_cwnd.
      
      This patch makes the following changes:
      
      (1) upon undo, revert snd_cwnd to ca->loss_cwnd, which is already, as
      the existing comment notes, the "congestion window at last loss"
      
      (2) stop forgetting ca->loss_cwnd on TCP_CA_Loss events
      
      (3) use ca->last_max_cwnd to check if we're in slow start
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NStephen Hemminger <shemminger@vyatta.com>
      Acked-by: NSangtae Ha <sangtae.ha@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5a45f008
  10. 09 5月, 2011 1 次提交
  11. 15 3月, 2011 6 次提交
  12. 10 3月, 2011 1 次提交
  13. 02 3月, 2009 1 次提交
  14. 02 11月, 2008 1 次提交
  15. 01 5月, 2008 1 次提交
    • R
      rename div64_64 to div64_u64 · 6f6d6a1a
      Roman Zippel 提交于
      Rename div64_64 to div64_u64 to make it consistent with the other divide
      functions, so it clearly includes the type of the divide.  Move its definition
      to math64.h as currently no architecture overrides the generic implementation.
       They can still override it of course, but the duplicated declarations are
      avoided.
      Signed-off-by: NRoman Zippel <zippel@linux-m68k.org>
      Cc: Avi Kivity <avi@qumranet.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Patrick McHardy <kaber@trash.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6f6d6a1a
  16. 05 3月, 2008 1 次提交
  17. 29 1月, 2008 1 次提交
  18. 11 10月, 2007 1 次提交
  19. 31 7月, 2007 2 次提交
  20. 18 7月, 2007 1 次提交
  21. 13 6月, 2007 1 次提交
  22. 26 4月, 2007 5 次提交
  23. 13 2月, 2007 1 次提交
  24. 11 2月, 2007 1 次提交
  25. 26 10月, 2006 1 次提交
  26. 23 9月, 2006 1 次提交