1. 22 5月, 2017 1 次提交
  2. 18 5月, 2017 5 次提交
  3. 17 5月, 2017 1 次提交
    • E
      tcp: internal implementation for pacing · 218af599
      Eric Dumazet 提交于
      BBR congestion control depends on pacing, and pacing is
      currently handled by sch_fq packet scheduler for performance reasons,
      and also because implemening pacing with FQ was convenient to truly
      avoid bursts.
      
      However there are many cases where this packet scheduler constraint
      is not practical.
      - Many linux hosts are not focusing on handling thousands of TCP
        flows in the most efficient way.
      - Some routers use fq_codel or other AQM, but still would like
        to use BBR for the few TCP flows they initiate/terminate.
      
      This patch implements an automatic fallback to internal pacing.
      
      Pacing is requested either by BBR or use of SO_MAX_PACING_RATE option.
      
      If sch_fq happens to be in the egress path, pacing is delegated to
      the qdisc, otherwise pacing is done by TCP itself.
      
      One advantage of pacing from TCP stack is to get more precise rtt
      estimations, and less work done from TX completion, since TCP Small
      queue limits are not generally hit. Setups with single TX queue but
      many cpus might even benefit from this.
      
      Note that unlike sch_fq, we do not take into account header sizes.
      Taking care of these headers would add additional complexity for
      no practical differences in behavior.
      
      Some performance numbers using 800 TCP_STREAM flows rate limited to
      ~48 Mbit per second on 40Gbit NIC.
      
      If MQ+pfifo_fast is used on the NIC :
      
      $ sar -n DEV 1 5 | grep eth
      14:48:44         eth0 725743.00 2932134.00  46776.76 4335184.68      0.00      0.00      1.00
      14:48:45         eth0 725349.00 2932112.00  46751.86 4335158.90      0.00      0.00      0.00
      14:48:46         eth0 725101.00 2931153.00  46735.07 4333748.63      0.00      0.00      0.00
      14:48:47         eth0 725099.00 2931161.00  46735.11 4333760.44      0.00      0.00      1.00
      14:48:48         eth0 725160.00 2931731.00  46738.88 4334606.07      0.00      0.00      0.00
      Average:         eth0 725290.40 2931658.20  46747.54 4334491.74      0.00      0.00      0.40
      $ vmstat 1 5
      procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
       r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
       4  0      0 259825920  45644 2708324    0    0    21     2  247   98  0  0 100  0  0
       4  0      0 259823744  45644 2708356    0    0     0     0 2400825 159843  0 19 81  0  0
       0  0      0 259824208  45644 2708072    0    0     0     0 2407351 159929  0 19 81  0  0
       1  0      0 259824592  45644 2708128    0    0     0     0 2405183 160386  0 19 80  0  0
       1  0      0 259824272  45644 2707868    0    0     0    32 2396361 158037  0 19 81  0  0
      
      Now use MQ+FQ :
      
      lpaa23:~# echo fq >/proc/sys/net/core/default_qdisc
      lpaa23:~# tc qdisc replace dev eth0 root mq
      
      $ sar -n DEV 1 5 | grep eth
      14:49:57         eth0 678614.00 2727930.00  43739.13 4033279.14      0.00      0.00      0.00
      14:49:58         eth0 677620.00 2723971.00  43674.69 4027429.62      0.00      0.00      1.00
      14:49:59         eth0 676396.00 2719050.00  43596.83 4020125.02      0.00      0.00      0.00
      14:50:00         eth0 675197.00 2714173.00  43518.62 4012938.90      0.00      0.00      1.00
      14:50:01         eth0 676388.00 2719063.00  43595.47 4020171.64      0.00      0.00      0.00
      Average:         eth0 676843.00 2720837.40  43624.95 4022788.86      0.00      0.00      0.40
      $ vmstat 1 5
      procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
       r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
       2  0      0 259832240  46008 2710912    0    0    21     2  223  192  0  1 99  0  0
       1  0      0 259832896  46008 2710744    0    0     0     0 1702206 198078  0 17 82  0  0
       0  0      0 259830272  46008 2710596    0    0     0     0 1696340 197756  1 17 83  0  0
       4  0      0 259829168  46024 2710584    0    0    16     0 1688472 197158  1 17 82  0  0
       3  0      0 259830224  46024 2710408    0    0     0     0 1692450 197212  0 18 82  0  0
      
      As expected, number of interrupts per second is very different.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Van Jacobson <vanj@google.com>
      Cc: Jerry Chu <hkchu@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      218af599
  4. 25 4月, 2017 1 次提交
  5. 08 3月, 2017 1 次提交
  6. 14 1月, 2017 2 次提交
    • Y
      tcp: remove early retransmit · bec41a11
      Yuchung Cheng 提交于
      This patch removes the support of RFC5827 early retransmit (i.e.,
      fast recovery on small inflight with <3 dupacks) because it is
      subsumed by the new RACK loss detection. More specifically when
      RACK receives DUPACKs, it'll arm a reordering timer to start fast
      recovery after a quarter of (min)RTT, hence it covers the early
      retransmit except RACK does not limit itself to specific inflight
      or dupack numbers.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bec41a11
    • Y
      tcp: add reordering timer in RACK loss detection · 57dde7f7
      Yuchung Cheng 提交于
      This patch makes RACK install a reordering timer when it suspects
      some packets might be lost, but wants to delay the decision
      a little bit to accomodate reordering.
      
      It does not create a new timer but instead repurposes the existing
      RTO timer, because both are meant to retransmit packets.
      Specifically it arms a timer ICSK_TIME_REO_TIMEOUT when
      the RACK timing check fails. The wait time is set to
      
        RACK.RTT + RACK.reo_wnd - (NOW - Packet.xmit_time) + fudge
      
      This translates to expecting a packet (Packet) should take
      (RACK.RTT + RACK.reo_wnd + fudge) to deliver after it was sent.
      
      When there are multiple packets that need a timer, we use one timer
      with the maximum timeout. Therefore the timer conservatively uses
      the maximum window to expire N packets by one timeout, instead of
      N timeouts to expire N packets sent at different times.
      
      The fudge factor is 2 jiffies to ensure when the timer fires, all
      the suspected packets would exceed the deadline and be marked lost
      by tcp_rack_detect_loss(). It has to be at least 1 jiffy because the
      clock may tick between calling icsk_reset_xmit_timer(timeout) and
      actually hang the timer. The next jiffy is to lower-bound the timeout
      to 2 jiffies when reo_wnd is < 1ms.
      
      When the reordering timer fires (tcp_rack_reo_timeout): If we aren't
      in Recovery we'll enter fast recovery and force fast retransmit.
      This is very similar to the early retransmit (RFC5827) except RACK
      is not constrained to only enter recovery for small outstanding
      flights.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      57dde7f7
  7. 10 1月, 2017 1 次提交
  8. 06 12月, 2016 1 次提交
  9. 28 9月, 2016 1 次提交
    • L
      tcp: Change txhash on every SYN and RTO retransmit · 3acf3ec3
      Lawrence Brakmo 提交于
      The current code changes txhash (flowlables) on every retransmitted
      SYN/ACK, but only after the 2nd retransmitted SYN and only after
      tcp_retries1 RTO retransmits.
      
      With this patch:
      1) txhash is changed with every SYN retransmits
      2) txhash is changed with every RTO.
      
      The result is that we can start re-routing around failed (or very
      congested paths) as soon as possible. Otherwise application health
      checks may fail and the connection may be terminated before we start
      to change txhash.
      
      v4: Removed sysctl, txhash is changed for all RTOs
      v3: Removed text saying default value of sysctl is 0 (it is 100)
      v2: Added sysctl documentation and cleaned code
      
      Tested with packetdrill tests
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3acf3ec3
  10. 22 9月, 2016 1 次提交
  11. 16 7月, 2016 1 次提交
  12. 03 5月, 2016 1 次提交
  13. 28 4月, 2016 1 次提交
  14. 25 4月, 2016 1 次提交
    • E
      tcp-tso: do not split TSO packets at retransmit time · 10d3be56
      Eric Dumazet 提交于
      Linux TCP stack painfully segments all TSO/GSO packets before retransmits.
      
      This was fine back in the days when TSO/GSO were emerging, with their
      bugs, but we believe the dark age is over.
      
      Keeping big packets in write queues, but also in stack traversal
      has a lot of benefits.
       - Less memory overhead, because write queues have less skbs
       - Less cpu overhead at ACK processing.
       - Better SACK processing, as lot of studies mentioned how
         awful linux was at this ;)
       - Less cpu overhead to send the rtx packets
         (IP stack traversal, netfilter traversal, drivers...)
       - Better latencies in presence of losses.
       - Smaller spikes in fq like packet schedulers, as retransmits
         are not constrained by TCP Small Queues.
      
      1 % packet losses are common today, and at 100Gbit speeds, this
      translates to ~80,000 losses per second.
      Losses are often correlated, and we see many retransmit events
      leading to 1-MSS train of packets, at the time hosts are already
      under stress.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      10d3be56
  15. 08 2月, 2016 5 次提交
  16. 11 1月, 2016 3 次提交
  17. 20 11月, 2015 2 次提交
  18. 12 10月, 2015 1 次提交
  19. 10 7月, 2015 1 次提交
  20. 18 5月, 2015 1 次提交
  21. 10 5月, 2015 1 次提交
  22. 08 4月, 2015 1 次提交
    • D
      tcp: RFC7413 option support for Fast Open client · 2646c831
      Daniel Lee 提交于
      Fast Open has been using an experimental option with a magic number
      (RFC6994). This patch makes the client by default use the RFC7413
      option (34) to get and send Fast Open cookies.  This patch makes
      the client solicit cookies from a given server first with the
      RFC7413 option. If that fails to elicit a cookie, then it tries
      the RFC6994 experimental option. If that also fails, it uses the
      RFC7413 option on all subsequent connect attempts.  If the server
      returns a Fast Open cookie then the client caches the form of the
      option that successfully elicited a cookie, and uses that form on
      later connects when it presents that cookie.
      
      The idea is to gradually obsolete the use of experimental options as
      the servers and clients upgrade, while keeping the interoperability
      meanwhile.
      Signed-off-by: NDaniel Lee <Longinus00@gmail.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2646c831
  23. 24 3月, 2015 1 次提交
  24. 21 3月, 2015 1 次提交
    • E
      inet: get rid of central tcp/dccp listener timer · fa76ce73
      Eric Dumazet 提交于
      One of the major issue for TCP is the SYNACK rtx handling,
      done by inet_csk_reqsk_queue_prune(), fired by the keepalive
      timer of a TCP_LISTEN socket.
      
      This function runs for awful long times, with socket lock held,
      meaning that other cpus needing this lock have to spin for hundred of ms.
      
      SYNACK are sent in huge bursts, likely to cause severe drops anyway.
      
      This model was OK 15 years ago when memory was very tight.
      
      We now can afford to have a timer per request sock.
      
      Timer invocations no longer need to lock the listener,
      and can be run from all cpus in parallel.
      
      With following patch increasing somaxconn width to 32 bits,
      I tested a listener with more than 4 million active request sockets,
      and a steady SYNFLOOD of ~200,000 SYN per second.
      Host was sending ~830,000 SYNACK per second.
      
      This is ~100 times more what we could achieve before this patch.
      
      Later, we will get rid of the listener hash and use ehash instead.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fa76ce73
  25. 07 3月, 2015 1 次提交
    • F
      ipv4: Create probe timer for tcp PMTU as per RFC4821 · 05cbc0db
      Fan Du 提交于
      As per RFC4821 7.3.  Selecting Probe Size, a probe timer should
      be armed once probing has converged. Once this timer expired,
      probing again to take advantage of any path PMTU change. The
      recommended probing interval is 10 minutes per RFC1981. Probing
      interval could be sysctled by sysctl_tcp_probe_interval.
      
      Eric Dumazet suggested to implement pseudo timer based on 32bits
      jiffies tcp_time_stamp instead of using classic timer for such
      rare event.
      Signed-off-by: NFan Du <fan.du@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      05cbc0db
  26. 10 2月, 2015 1 次提交
    • F
      ipv4: Namespecify TCP PMTU mechanism · b0f9ca53
      Fan Du 提交于
      Packetization Layer Path MTU Discovery works separately beside
      Path MTU Discovery at IP level, different net namespace has
      various requirements on which one to chose, e.g., a virutalized
      container instance would require TCP PMTU to probe an usable
      effective mtu for underlying tunnel, while the host would
      employ classical ICMP based PMTU to function.
      
      Hence making TCP PMTU mechanism per net namespace to decouple
      two functionality. Furthermore the probe base MSS should also
      be configured separately for each namespace.
      Signed-off-by: NFan Du <fan.du@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b0f9ca53
  27. 12 11月, 2014 1 次提交
    • J
      net: Convert LIMIT_NETDEBUG to net_dbg_ratelimited · ba7a46f1
      Joe Perches 提交于
      Use the more common dynamic_debug capable net_dbg_ratelimited
      and remove the LIMIT_NETDEBUG macro.
      
      All messages are still ratelimited.
      
      Some KERN_<LEVEL> uses are changed to KERN_DEBUG.
      
      This may have some negative impact on messages that were
      emitted at KERN_INFO that are not not enabled at all unless
      DEBUG is defined or dynamic_debug is enabled.  Even so,
      these messages are now _not_ emitted by default.
      
      This also eliminates the use of the net_msg_warn sysctl
      "/proc/sys/net/core/warnings".  For backward compatibility,
      the sysctl is not removed, but it has no function.  The extern
      declaration of net_msg_warn is removed from sock.h and made
      static in net/core/sysctl_net_core.c
      
      Miscellanea:
      
      o Update the sysctl documentation
      o Remove the embedded uses of pr_fmt
      o Coalesce format fragments
      o Realign arguments
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ba7a46f1
  28. 02 10月, 2014 1 次提交
    • Y
      tcp: abort orphan sockets stalling on zero window probes · b248230c
      Yuchung Cheng 提交于
      Currently we have two different policies for orphan sockets
      that repeatedly stall on zero window ACKs. If a socket gets
      a zero window ACK when it is transmitting data, the RTO is
      used to probe the window. The socket is aborted after roughly
      tcp_orphan_retries() retries (as in tcp_write_timeout()).
      
      But if the socket was idle when it received the zero window ACK,
      and later wants to send more data, we use the probe timer to
      probe the window. If the receiver always returns zero window ACKs,
      icsk_probes keeps getting reset in tcp_ack() and the orphan socket
      can stall forever until the system reaches the orphan limit (as
      commented in tcp_probe_timer()). This opens up a simple attack
      to create lots of hanging orphan sockets to burn the memory
      and the CPU, as demonstrated in the recent netdev post "TCP
      connection will hang in FIN_WAIT1 after closing if zero window is
      advertised." http://www.spinics.net/lists/netdev/msg296539.html
      
      This patch follows the design in RTO-based probe: we abort an orphan
      socket stalling on zero window when the probe timer reaches both
      the maximum backoff and the maximum RTO. For example, an 100ms RTT
      connection will timeout after roughly 153 seconds (0.3 + 0.6 +
      .... + 76.8) if the receiver keeps the window shut. If the orphan
      socket passes this check, but the system already has too many orphans
      (as in tcp_out_of_resources()), we still abort it but we'll also
      send an RST packet as the connection may still be active.
      
      In addition, we change TCP_USER_TIMEOUT to cover (life or dead)
      sockets stalled on zero-window probes. This changes the semantics
      of TCP_USER_TIMEOUT slightly because it previously only applies
      when the socket has pending transmission.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Reported-by: NAndrey Dmitrov <andrey.dmitrov@oktetlabs.ru>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b248230c