1. 18 5月, 2017 2 次提交
  2. 17 5月, 2017 1 次提交
    • E
      tcp: internal implementation for pacing · 218af599
      Eric Dumazet 提交于
      BBR congestion control depends on pacing, and pacing is
      currently handled by sch_fq packet scheduler for performance reasons,
      and also because implemening pacing with FQ was convenient to truly
      avoid bursts.
      
      However there are many cases where this packet scheduler constraint
      is not practical.
      - Many linux hosts are not focusing on handling thousands of TCP
        flows in the most efficient way.
      - Some routers use fq_codel or other AQM, but still would like
        to use BBR for the few TCP flows they initiate/terminate.
      
      This patch implements an automatic fallback to internal pacing.
      
      Pacing is requested either by BBR or use of SO_MAX_PACING_RATE option.
      
      If sch_fq happens to be in the egress path, pacing is delegated to
      the qdisc, otherwise pacing is done by TCP itself.
      
      One advantage of pacing from TCP stack is to get more precise rtt
      estimations, and less work done from TX completion, since TCP Small
      queue limits are not generally hit. Setups with single TX queue but
      many cpus might even benefit from this.
      
      Note that unlike sch_fq, we do not take into account header sizes.
      Taking care of these headers would add additional complexity for
      no practical differences in behavior.
      
      Some performance numbers using 800 TCP_STREAM flows rate limited to
      ~48 Mbit per second on 40Gbit NIC.
      
      If MQ+pfifo_fast is used on the NIC :
      
      $ sar -n DEV 1 5 | grep eth
      14:48:44         eth0 725743.00 2932134.00  46776.76 4335184.68      0.00      0.00      1.00
      14:48:45         eth0 725349.00 2932112.00  46751.86 4335158.90      0.00      0.00      0.00
      14:48:46         eth0 725101.00 2931153.00  46735.07 4333748.63      0.00      0.00      0.00
      14:48:47         eth0 725099.00 2931161.00  46735.11 4333760.44      0.00      0.00      1.00
      14:48:48         eth0 725160.00 2931731.00  46738.88 4334606.07      0.00      0.00      0.00
      Average:         eth0 725290.40 2931658.20  46747.54 4334491.74      0.00      0.00      0.40
      $ vmstat 1 5
      procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
       r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
       4  0      0 259825920  45644 2708324    0    0    21     2  247   98  0  0 100  0  0
       4  0      0 259823744  45644 2708356    0    0     0     0 2400825 159843  0 19 81  0  0
       0  0      0 259824208  45644 2708072    0    0     0     0 2407351 159929  0 19 81  0  0
       1  0      0 259824592  45644 2708128    0    0     0     0 2405183 160386  0 19 80  0  0
       1  0      0 259824272  45644 2707868    0    0     0    32 2396361 158037  0 19 81  0  0
      
      Now use MQ+FQ :
      
      lpaa23:~# echo fq >/proc/sys/net/core/default_qdisc
      lpaa23:~# tc qdisc replace dev eth0 root mq
      
      $ sar -n DEV 1 5 | grep eth
      14:49:57         eth0 678614.00 2727930.00  43739.13 4033279.14      0.00      0.00      0.00
      14:49:58         eth0 677620.00 2723971.00  43674.69 4027429.62      0.00      0.00      1.00
      14:49:59         eth0 676396.00 2719050.00  43596.83 4020125.02      0.00      0.00      0.00
      14:50:00         eth0 675197.00 2714173.00  43518.62 4012938.90      0.00      0.00      1.00
      14:50:01         eth0 676388.00 2719063.00  43595.47 4020171.64      0.00      0.00      0.00
      Average:         eth0 676843.00 2720837.40  43624.95 4022788.86      0.00      0.00      0.40
      $ vmstat 1 5
      procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
       r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
       2  0      0 259832240  46008 2710912    0    0    21     2  223  192  0  1 99  0  0
       1  0      0 259832896  46008 2710744    0    0     0     0 1702206 198078  0 17 82  0  0
       0  0      0 259830272  46008 2710596    0    0     0     0 1696340 197756  1 17 83  0  0
       4  0      0 259829168  46024 2710584    0    0    16     0 1688472 197158  1 17 82  0  0
       3  0      0 259830224  46024 2710408    0    0     0     0 1692450 197212  0 18 82  0  0
      
      As expected, number of interrupts per second is very different.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Van Jacobson <vanj@google.com>
      Cc: Jerry Chu <hkchu@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      218af599
  3. 30 10月, 2016 1 次提交
  4. 21 9月, 2016 1 次提交
    • N
      tcp_bbr: add BBR congestion control · 0f8782ea
      Neal Cardwell 提交于
      This commit implements a new TCP congestion control algorithm: BBR
      (Bottleneck Bandwidth and RTT). A detailed description of BBR will be
      published in ACM Queue, Vol. 14 No. 5, September-October 2016, as
      "BBR: Congestion-Based Congestion Control".
      
      BBR has significantly increased throughput and reduced latency for
      connections on Google's internal backbone networks and google.com and
      YouTube Web servers.
      
      BBR requires only changes on the sender side, not in the network or
      the receiver side. Thus it can be incrementally deployed on today's
      Internet, or in datacenters.
      
      The Internet has predominantly used loss-based congestion control
      (largely Reno or CUBIC) since the 1980s, relying on packet loss as the
      signal to slow down. While this worked well for many years, loss-based
      congestion control is unfortunately out-dated in today's networks. On
      today's Internet, loss-based congestion control causes the infamous
      bufferbloat problem, often causing seconds of needless queuing delay,
      since it fills the bloated buffers in many last-mile links. On today's
      high-speed long-haul links using commodity switches with shallow
      buffers, loss-based congestion control has abysmal throughput because
      it over-reacts to losses caused by transient traffic bursts.
      
      In 1981 Kleinrock and Gale showed that the optimal operating point for
      a network maximizes delivered bandwidth while minimizing delay and
      loss, not only for single connections but for the network as a
      whole. Finding that optimal operating point has been elusive, since
      any single network measurement is ambiguous: network measurements are
      the result of both bandwidth and propagation delay, and those two
      cannot be measured simultaneously.
      
      While it is impossible to disambiguate any single bandwidth or RTT
      measurement, a connection's behavior over time tells a clearer
      story. BBR uses a measurement strategy designed to resolve this
      ambiguity. It combines these measurements with a robust servo loop
      using recent control systems advances to implement a distributed
      congestion control algorithm that reacts to actual congestion, not
      packet loss or transient queue delay, and is designed to converge with
      high probability to a point near the optimal operating point.
      
      In a nutshell, BBR creates an explicit model of the network pipe by
      sequentially probing the bottleneck bandwidth and RTT. On the arrival
      of each ACK, BBR derives the current delivery rate of the last round
      trip, and feeds it through a windowed max-filter to estimate the
      bottleneck bandwidth. Conversely it uses a windowed min-filter to
      estimate the round trip propagation delay. The max-filtered bandwidth
      and min-filtered RTT estimates form BBR's model of the network pipe.
      
      Using its model, BBR sets control parameters to govern sending
      behavior. The primary control is the pacing rate: BBR applies a gain
      multiplier to transmit faster or slower than the observed bottleneck
      bandwidth. The conventional congestion window (cwnd) is now the
      secondary control; the cwnd is set to a small multiple of the
      estimated BDP (bandwidth-delay product) in order to allow full
      utilization and bandwidth probing while bounding the potential amount
      of queue at the bottleneck.
      
      When a BBR connection starts, it enters STARTUP mode and applies a
      high gain to perform an exponential search to quickly probe the
      bottleneck bandwidth (doubling its sending rate each round trip, like
      slow start). However, instead of continuing until it fills up the
      buffer (i.e. a loss), or until delay or ACK spacing reaches some
      threshold (like Hystart), it uses its model of the pipe to estimate
      when that pipe is full: it estimates the pipe is full when it notices
      the estimated bandwidth has stopped growing. At that point it exits
      STARTUP and enters DRAIN mode, where it reduces its pacing rate to
      drain the queue it estimates it has created.
      
      Then BBR enters steady state. In steady state, PROBE_BW mode cycles
      between first pacing faster to probe for more bandwidth, then pacing
      slower to drain any queue that created if no more bandwidth was
      available, and then cruising at the estimated bandwidth to utilize the
      pipe without creating excess queue. Occasionally, on an as-needed
      basis, it sends significantly slower to probe for RTT (PROBE_RTT
      mode).
      
      BBR has been fully deployed on Google's wide-area backbone networks
      and we're experimenting with BBR on Google.com and YouTube on a global
      scale.  Replacing CUBIC with BBR has resulted in significant
      improvements in network latency and application (RPC, browser, and
      video) metrics. For more details please refer to our upcoming ACM
      Queue publication.
      
      Example performance results, to illustrate the difference between BBR
      and CUBIC:
      
      Resilience to random loss (e.g. from shallow buffers):
        Consider a netperf TCP_STREAM test lasting 30 secs on an emulated
        path with a 10Gbps bottleneck, 100ms RTT, and 1% packet loss
        rate. CUBIC gets 3.27 Mbps, and BBR gets 9150 Mbps (2798x higher).
      
      Low latency with the bloated buffers common in today's last-mile links:
        Consider a netperf TCP_STREAM test lasting 120 secs on an emulated
        path with a 10Mbps bottleneck, 40ms RTT, and 1000-packet bottleneck
        buffer. Both fully utilize the bottleneck bandwidth, but BBR
        achieves this with a median RTT 25x lower (43 ms instead of 1.09
        secs).
      
      Our long-term goal is to improve the congestion control algorithms
      used on the Internet. We are hopeful that BBR can help advance the
      efforts toward this goal, and motivate the community to do further
      research.
      
      Test results, performance evaluations, feedback, and BBR-related
      discussions are very welcome in the public e-mail list for BBR:
      
        https://groups.google.com/forum/#!forum/bbr-dev
      
      NOTE: BBR *must* be used with the fq qdisc ("man tc-fq") with pacing
      enabled, since pacing is integral to the BBR design and
      implementation. BBR without pacing would not function properly, and
      may incur unnecessary high packet loss rates.
      Signed-off-by: NVan Jacobson <vanj@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0f8782ea