1. 20 6月, 2017 2 次提交
  2. 16 6月, 2017 2 次提交
  3. 08 6月, 2017 5 次提交
  4. 05 6月, 2017 1 次提交
  5. 18 5月, 2017 4 次提交
  6. 17 5月, 2017 1 次提交
    • E
      tcp: internal implementation for pacing · 218af599
      Eric Dumazet 提交于
      BBR congestion control depends on pacing, and pacing is
      currently handled by sch_fq packet scheduler for performance reasons,
      and also because implemening pacing with FQ was convenient to truly
      avoid bursts.
      
      However there are many cases where this packet scheduler constraint
      is not practical.
      - Many linux hosts are not focusing on handling thousands of TCP
        flows in the most efficient way.
      - Some routers use fq_codel or other AQM, but still would like
        to use BBR for the few TCP flows they initiate/terminate.
      
      This patch implements an automatic fallback to internal pacing.
      
      Pacing is requested either by BBR or use of SO_MAX_PACING_RATE option.
      
      If sch_fq happens to be in the egress path, pacing is delegated to
      the qdisc, otherwise pacing is done by TCP itself.
      
      One advantage of pacing from TCP stack is to get more precise rtt
      estimations, and less work done from TX completion, since TCP Small
      queue limits are not generally hit. Setups with single TX queue but
      many cpus might even benefit from this.
      
      Note that unlike sch_fq, we do not take into account header sizes.
      Taking care of these headers would add additional complexity for
      no practical differences in behavior.
      
      Some performance numbers using 800 TCP_STREAM flows rate limited to
      ~48 Mbit per second on 40Gbit NIC.
      
      If MQ+pfifo_fast is used on the NIC :
      
      $ sar -n DEV 1 5 | grep eth
      14:48:44         eth0 725743.00 2932134.00  46776.76 4335184.68      0.00      0.00      1.00
      14:48:45         eth0 725349.00 2932112.00  46751.86 4335158.90      0.00      0.00      0.00
      14:48:46         eth0 725101.00 2931153.00  46735.07 4333748.63      0.00      0.00      0.00
      14:48:47         eth0 725099.00 2931161.00  46735.11 4333760.44      0.00      0.00      1.00
      14:48:48         eth0 725160.00 2931731.00  46738.88 4334606.07      0.00      0.00      0.00
      Average:         eth0 725290.40 2931658.20  46747.54 4334491.74      0.00      0.00      0.40
      $ vmstat 1 5
      procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
       r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
       4  0      0 259825920  45644 2708324    0    0    21     2  247   98  0  0 100  0  0
       4  0      0 259823744  45644 2708356    0    0     0     0 2400825 159843  0 19 81  0  0
       0  0      0 259824208  45644 2708072    0    0     0     0 2407351 159929  0 19 81  0  0
       1  0      0 259824592  45644 2708128    0    0     0     0 2405183 160386  0 19 80  0  0
       1  0      0 259824272  45644 2707868    0    0     0    32 2396361 158037  0 19 81  0  0
      
      Now use MQ+FQ :
      
      lpaa23:~# echo fq >/proc/sys/net/core/default_qdisc
      lpaa23:~# tc qdisc replace dev eth0 root mq
      
      $ sar -n DEV 1 5 | grep eth
      14:49:57         eth0 678614.00 2727930.00  43739.13 4033279.14      0.00      0.00      0.00
      14:49:58         eth0 677620.00 2723971.00  43674.69 4027429.62      0.00      0.00      1.00
      14:49:59         eth0 676396.00 2719050.00  43596.83 4020125.02      0.00      0.00      0.00
      14:50:00         eth0 675197.00 2714173.00  43518.62 4012938.90      0.00      0.00      1.00
      14:50:01         eth0 676388.00 2719063.00  43595.47 4020171.64      0.00      0.00      0.00
      Average:         eth0 676843.00 2720837.40  43624.95 4022788.86      0.00      0.00      0.40
      $ vmstat 1 5
      procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
       r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
       2  0      0 259832240  46008 2710912    0    0    21     2  223  192  0  1 99  0  0
       1  0      0 259832896  46008 2710744    0    0     0     0 1702206 198078  0 17 82  0  0
       0  0      0 259830272  46008 2710596    0    0     0     0 1696340 197756  1 17 83  0  0
       4  0      0 259829168  46024 2710584    0    0    16     0 1688472 197158  1 17 82  0  0
       3  0      0 259830224  46024 2710408    0    0     0     0 1692450 197212  0 18 82  0  0
      
      As expected, number of interrupts per second is very different.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Van Jacobson <vanj@google.com>
      Cc: Jerry Chu <hkchu@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      218af599
  7. 09 5月, 2017 1 次提交
  8. 06 5月, 2017 1 次提交
    • E
      tcp: randomize timestamps on syncookies · 84b114b9
      Eric Dumazet 提交于
      Whole point of randomization was to hide server uptime, but an attacker
      can simply start a syn flood and TCP generates 'old style' timestamps,
      directly revealing server jiffies value.
      
      Also, TSval sent by the server to a particular remote address vary
      depending on syncookies being sent or not, potentially triggering PAWS
      drops for innocent clients.
      
      Lets implement proper randomization, including for SYNcookies.
      
      Also we do not need to export sysctl_tcp_timestamps, since it is not
      used from a module.
      
      In v2, I added Florian feedback and contribution, adding tsoff to
      tcp_get_cookie_sock().
      
      v3 removed one unused variable in tcp_v4_connect() as Florian spotted.
      
      Fixes: 95a22cae ("tcp: randomize tcp timestamp offsets for each connection")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NFlorian Westphal <fw@strlen.de>
      Tested-by: NFlorian Westphal <fw@strlen.de>
      Cc: Yuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      84b114b9
  9. 27 4月, 2017 3 次提交
  10. 25 4月, 2017 2 次提交
    • W
      net/tcp_fastopen: Add snmp counter for blackhole detection · 46c2fa39
      Wei Wang 提交于
      This counter records the number of times the firewall blackhole issue is
      detected and active TFO is disabled.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      46c2fa39
    • W
      net/tcp_fastopen: Disable active side TFO in certain scenarios · cf1ef3f0
      Wei Wang 提交于
      Middlebox firewall issues can potentially cause server's data being
      blackholed after a successful 3WHS using TFO. Following are the related
      reports from Apple:
      https://www.nanog.org/sites/default/files/Paasch_Network_Support.pdf
      Slide 31 identifies an issue where the client ACK to the server's data
      sent during a TFO'd handshake is dropped.
      C ---> syn-data ---> S
      C <--- syn/ack ----- S
      C (accept & write)
      C <---- data ------- S
      C ----- ACK -> X     S
      		[retry and timeout]
      
      https://www.ietf.org/proceedings/94/slides/slides-94-tcpm-13.pdf
      Slide 5 shows a similar situation that the server's data gets dropped
      after 3WHS.
      C ---- syn-data ---> S
      C <--- syn/ack ----- S
      C ---- ack --------> S
      S (accept & write)
      C?  X <- data ------ S
      		[retry and timeout]
      
      This is the worst failure b/c the client can not detect such behavior to
      mitigate the situation (such as disabling TFO). Failing to proceed, the
      application (e.g., SSL library) may simply timeout and retry with TFO
      again, and the process repeats indefinitely.
      
      The proposed solution is to disable active TFO globally under the
      following circumstances:
      1. client side TFO socket detects out of order FIN
      2. client side TFO socket receives out of order RST
      
      We disable active side TFO globally for 1hr at first. Then if it
      happens again, we disable it for 2h, then 4h, 8h, ...
      And we reset the timeout to 1hr if a client side TFO sockets not opened
      on loopback has successfully received data segs from server.
      And we examine this condition during close().
      
      The rational behind it is that when such firewall issue happens,
      application running on the client should eventually close the socket as
      it is not able to get the data it is expecting. Or application running
      on the server should close the socket as it is not able to receive any
      response from client.
      In both cases, out of order FIN or RST will get received on the client
      given that the firewall will not block them as no data are in those
      frames.
      And we want to disable active TFO globally as it helps if the middle box
      is very close to the client and most of the connections are likely to
      fail.
      
      Also, add a debug sysctl:
        tcp_fastopen_blackhole_detect_timeout_sec:
          the initial timeout to use when firewall blackhole issue happens.
          This can be set and read.
          When setting it to 0, it means to disable the active disable logic.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cf1ef3f0
  11. 05 4月, 2017 1 次提交
  12. 25 3月, 2017 1 次提交
    • G
      tcp: sysctl: Fix a race to avoid unexpected 0 window from space · c4836742
      Gao Feng 提交于
      Because sysctl_tcp_adv_win_scale could be changed any time, so there
      is one race in tcp_win_from_space.
      For example,
      1.sysctl_tcp_adv_win_scale<=0 (sysctl_tcp_adv_win_scale is negative now)
      2.space>>(-sysctl_tcp_adv_win_scale) (sysctl_tcp_adv_win_scale is postive now)
      
      As a result, tcp_win_from_space returns 0. It is unexpected.
      
      Certainly if the compiler put the sysctl_tcp_adv_win_scale into one
      register firstly, then use the register directly, it would be ok.
      But we could not depend on the compiler behavior.
      Signed-off-by: NGao Feng <fgao@ikuai8.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c4836742
  13. 17 3月, 2017 2 次提交
  14. 10 3月, 2017 1 次提交
  15. 26 1月, 2017 2 次提交
    • W
      net/tcp-fastopen: Add new API support · 19f6d3f3
      Wei Wang 提交于
      This patch adds a new socket option, TCP_FASTOPEN_CONNECT, as an
      alternative way to perform Fast Open on the active side (client). Prior
      to this patch, a client needs to replace the connect() call with
      sendto(MSG_FASTOPEN). This can be cumbersome for applications who want
      to use Fast Open: these socket operations are often done in lower layer
      libraries used by many other applications. Changing these libraries
      and/or the socket call sequences are not trivial. A more convenient
      approach is to perform Fast Open by simply enabling a socket option when
      the socket is created w/o changing other socket calls sequence:
        s = socket()
          create a new socket
        setsockopt(s, IPPROTO_TCP, TCP_FASTOPEN_CONNECT …);
          newly introduced sockopt
          If set, new functionality described below will be used.
          Return ENOTSUPP if TFO is not supported or not enabled in the
          kernel.
      
        connect()
          With cookie present, return 0 immediately.
          With no cookie, initiate 3WHS with TFO cookie-request option and
          return -1 with errno = EINPROGRESS.
      
        write()/sendmsg()
          With cookie present, send out SYN with data and return the number of
          bytes buffered.
          With no cookie, and 3WHS not yet completed, return -1 with errno =
          EINPROGRESS.
          No MSG_FASTOPEN flag is needed.
      
        read()
          Return -1 with errno = EWOULDBLOCK/EAGAIN if connect() is called but
          write() is not called yet.
          Return -1 with errno = EWOULDBLOCK/EAGAIN if connection is
          established but no msg is received yet.
          Return number of bytes read if socket is established and there is
          msg received.
      
      The new API simplifies life for applications that always perform a write()
      immediately after a successful connect(). Such applications can now take
      advantage of Fast Open by merely making one new setsockopt() call at the time
      of creating the socket. Nothing else about the application's socket call
      sequence needs to change.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      19f6d3f3
    • W
      net/tcp-fastopen: refactor cookie check logic · 065263f4
      Wei Wang 提交于
      Refactor the cookie check logic in tcp_send_syn_data() into a function.
      This function will be called else where in later changes.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      065263f4
  16. 14 1月, 2017 6 次提交
  17. 30 12月, 2016 1 次提交
  18. 28 12月, 2016 1 次提交
  19. 03 12月, 2016 1 次提交
    • F
      tcp: randomize tcp timestamp offsets for each connection · 95a22cae
      Florian Westphal 提交于
      jiffies based timestamps allow for easy inference of number of devices
      behind NAT translators and also makes tracking of hosts simpler.
      
      commit ceaa1fef ("tcp: adding a per-socket timestamp offset")
      added the main infrastructure that is needed for per-connection ts
      randomization, in particular writing/reading the on-wire tcp header
      format takes the offset into account so rest of stack can use normal
      tcp_time_stamp (jiffies).
      
      So only two items are left:
       - add a tsoffset for request sockets
       - extend the tcp isn generator to also return another 32bit number
         in addition to the ISN.
      
      Re-use of ISN generator also means timestamps are still monotonically
      increasing for same connection quadruple, i.e. PAWS will still work.
      
      Includes fixes from Eric Dumazet.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      95a22cae
  20. 30 11月, 2016 2 次提交