1. 17 5月, 2017 1 次提交
    • E
      tcp: internal implementation for pacing · 218af599
      Eric Dumazet 提交于
      BBR congestion control depends on pacing, and pacing is
      currently handled by sch_fq packet scheduler for performance reasons,
      and also because implemening pacing with FQ was convenient to truly
      avoid bursts.
      
      However there are many cases where this packet scheduler constraint
      is not practical.
      - Many linux hosts are not focusing on handling thousands of TCP
        flows in the most efficient way.
      - Some routers use fq_codel or other AQM, but still would like
        to use BBR for the few TCP flows they initiate/terminate.
      
      This patch implements an automatic fallback to internal pacing.
      
      Pacing is requested either by BBR or use of SO_MAX_PACING_RATE option.
      
      If sch_fq happens to be in the egress path, pacing is delegated to
      the qdisc, otherwise pacing is done by TCP itself.
      
      One advantage of pacing from TCP stack is to get more precise rtt
      estimations, and less work done from TX completion, since TCP Small
      queue limits are not generally hit. Setups with single TX queue but
      many cpus might even benefit from this.
      
      Note that unlike sch_fq, we do not take into account header sizes.
      Taking care of these headers would add additional complexity for
      no practical differences in behavior.
      
      Some performance numbers using 800 TCP_STREAM flows rate limited to
      ~48 Mbit per second on 40Gbit NIC.
      
      If MQ+pfifo_fast is used on the NIC :
      
      $ sar -n DEV 1 5 | grep eth
      14:48:44         eth0 725743.00 2932134.00  46776.76 4335184.68      0.00      0.00      1.00
      14:48:45         eth0 725349.00 2932112.00  46751.86 4335158.90      0.00      0.00      0.00
      14:48:46         eth0 725101.00 2931153.00  46735.07 4333748.63      0.00      0.00      0.00
      14:48:47         eth0 725099.00 2931161.00  46735.11 4333760.44      0.00      0.00      1.00
      14:48:48         eth0 725160.00 2931731.00  46738.88 4334606.07      0.00      0.00      0.00
      Average:         eth0 725290.40 2931658.20  46747.54 4334491.74      0.00      0.00      0.40
      $ vmstat 1 5
      procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
       r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
       4  0      0 259825920  45644 2708324    0    0    21     2  247   98  0  0 100  0  0
       4  0      0 259823744  45644 2708356    0    0     0     0 2400825 159843  0 19 81  0  0
       0  0      0 259824208  45644 2708072    0    0     0     0 2407351 159929  0 19 81  0  0
       1  0      0 259824592  45644 2708128    0    0     0     0 2405183 160386  0 19 80  0  0
       1  0      0 259824272  45644 2707868    0    0     0    32 2396361 158037  0 19 81  0  0
      
      Now use MQ+FQ :
      
      lpaa23:~# echo fq >/proc/sys/net/core/default_qdisc
      lpaa23:~# tc qdisc replace dev eth0 root mq
      
      $ sar -n DEV 1 5 | grep eth
      14:49:57         eth0 678614.00 2727930.00  43739.13 4033279.14      0.00      0.00      0.00
      14:49:58         eth0 677620.00 2723971.00  43674.69 4027429.62      0.00      0.00      1.00
      14:49:59         eth0 676396.00 2719050.00  43596.83 4020125.02      0.00      0.00      0.00
      14:50:00         eth0 675197.00 2714173.00  43518.62 4012938.90      0.00      0.00      1.00
      14:50:01         eth0 676388.00 2719063.00  43595.47 4020171.64      0.00      0.00      0.00
      Average:         eth0 676843.00 2720837.40  43624.95 4022788.86      0.00      0.00      0.40
      $ vmstat 1 5
      procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
       r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
       2  0      0 259832240  46008 2710912    0    0    21     2  223  192  0  1 99  0  0
       1  0      0 259832896  46008 2710744    0    0     0     0 1702206 198078  0 17 82  0  0
       0  0      0 259830272  46008 2710596    0    0     0     0 1696340 197756  1 17 83  0  0
       4  0      0 259829168  46024 2710584    0    0    16     0 1688472 197158  1 17 82  0  0
       3  0      0 259830224  46024 2710408    0    0     0     0 1692450 197212  0 18 82  0  0
      
      As expected, number of interrupts per second is very different.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Van Jacobson <vanj@google.com>
      Cc: Jerry Chu <hkchu@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      218af599
  2. 09 5月, 2017 1 次提交
  3. 06 5月, 2017 1 次提交
    • E
      tcp: randomize timestamps on syncookies · 84b114b9
      Eric Dumazet 提交于
      Whole point of randomization was to hide server uptime, but an attacker
      can simply start a syn flood and TCP generates 'old style' timestamps,
      directly revealing server jiffies value.
      
      Also, TSval sent by the server to a particular remote address vary
      depending on syncookies being sent or not, potentially triggering PAWS
      drops for innocent clients.
      
      Lets implement proper randomization, including for SYNcookies.
      
      Also we do not need to export sysctl_tcp_timestamps, since it is not
      used from a module.
      
      In v2, I added Florian feedback and contribution, adding tsoff to
      tcp_get_cookie_sock().
      
      v3 removed one unused variable in tcp_v4_connect() as Florian spotted.
      
      Fixes: 95a22cae ("tcp: randomize tcp timestamp offsets for each connection")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NFlorian Westphal <fw@strlen.de>
      Tested-by: NFlorian Westphal <fw@strlen.de>
      Cc: Yuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      84b114b9
  4. 27 4月, 2017 3 次提交
  5. 25 4月, 2017 2 次提交
    • W
      net/tcp_fastopen: Add snmp counter for blackhole detection · 46c2fa39
      Wei Wang 提交于
      This counter records the number of times the firewall blackhole issue is
      detected and active TFO is disabled.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      46c2fa39
    • W
      net/tcp_fastopen: Disable active side TFO in certain scenarios · cf1ef3f0
      Wei Wang 提交于
      Middlebox firewall issues can potentially cause server's data being
      blackholed after a successful 3WHS using TFO. Following are the related
      reports from Apple:
      https://www.nanog.org/sites/default/files/Paasch_Network_Support.pdf
      Slide 31 identifies an issue where the client ACK to the server's data
      sent during a TFO'd handshake is dropped.
      C ---> syn-data ---> S
      C <--- syn/ack ----- S
      C (accept & write)
      C <---- data ------- S
      C ----- ACK -> X     S
      		[retry and timeout]
      
      https://www.ietf.org/proceedings/94/slides/slides-94-tcpm-13.pdf
      Slide 5 shows a similar situation that the server's data gets dropped
      after 3WHS.
      C ---- syn-data ---> S
      C <--- syn/ack ----- S
      C ---- ack --------> S
      S (accept & write)
      C?  X <- data ------ S
      		[retry and timeout]
      
      This is the worst failure b/c the client can not detect such behavior to
      mitigate the situation (such as disabling TFO). Failing to proceed, the
      application (e.g., SSL library) may simply timeout and retry with TFO
      again, and the process repeats indefinitely.
      
      The proposed solution is to disable active TFO globally under the
      following circumstances:
      1. client side TFO socket detects out of order FIN
      2. client side TFO socket receives out of order RST
      
      We disable active side TFO globally for 1hr at first. Then if it
      happens again, we disable it for 2h, then 4h, 8h, ...
      And we reset the timeout to 1hr if a client side TFO sockets not opened
      on loopback has successfully received data segs from server.
      And we examine this condition during close().
      
      The rational behind it is that when such firewall issue happens,
      application running on the client should eventually close the socket as
      it is not able to get the data it is expecting. Or application running
      on the server should close the socket as it is not able to receive any
      response from client.
      In both cases, out of order FIN or RST will get received on the client
      given that the firewall will not block them as no data are in those
      frames.
      And we want to disable active TFO globally as it helps if the middle box
      is very close to the client and most of the connections are likely to
      fail.
      
      Also, add a debug sysctl:
        tcp_fastopen_blackhole_detect_timeout_sec:
          the initial timeout to use when firewall blackhole issue happens.
          This can be set and read.
          When setting it to 0, it means to disable the active disable logic.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cf1ef3f0
  6. 05 4月, 2017 1 次提交
  7. 25 3月, 2017 1 次提交
    • G
      tcp: sysctl: Fix a race to avoid unexpected 0 window from space · c4836742
      Gao Feng 提交于
      Because sysctl_tcp_adv_win_scale could be changed any time, so there
      is one race in tcp_win_from_space.
      For example,
      1.sysctl_tcp_adv_win_scale<=0 (sysctl_tcp_adv_win_scale is negative now)
      2.space>>(-sysctl_tcp_adv_win_scale) (sysctl_tcp_adv_win_scale is postive now)
      
      As a result, tcp_win_from_space returns 0. It is unexpected.
      
      Certainly if the compiler put the sysctl_tcp_adv_win_scale into one
      register firstly, then use the register directly, it would be ok.
      But we could not depend on the compiler behavior.
      Signed-off-by: NGao Feng <fgao@ikuai8.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c4836742
  8. 17 3月, 2017 2 次提交
  9. 10 3月, 2017 1 次提交
  10. 26 1月, 2017 2 次提交
    • W
      net/tcp-fastopen: Add new API support · 19f6d3f3
      Wei Wang 提交于
      This patch adds a new socket option, TCP_FASTOPEN_CONNECT, as an
      alternative way to perform Fast Open on the active side (client). Prior
      to this patch, a client needs to replace the connect() call with
      sendto(MSG_FASTOPEN). This can be cumbersome for applications who want
      to use Fast Open: these socket operations are often done in lower layer
      libraries used by many other applications. Changing these libraries
      and/or the socket call sequences are not trivial. A more convenient
      approach is to perform Fast Open by simply enabling a socket option when
      the socket is created w/o changing other socket calls sequence:
        s = socket()
          create a new socket
        setsockopt(s, IPPROTO_TCP, TCP_FASTOPEN_CONNECT …);
          newly introduced sockopt
          If set, new functionality described below will be used.
          Return ENOTSUPP if TFO is not supported or not enabled in the
          kernel.
      
        connect()
          With cookie present, return 0 immediately.
          With no cookie, initiate 3WHS with TFO cookie-request option and
          return -1 with errno = EINPROGRESS.
      
        write()/sendmsg()
          With cookie present, send out SYN with data and return the number of
          bytes buffered.
          With no cookie, and 3WHS not yet completed, return -1 with errno =
          EINPROGRESS.
          No MSG_FASTOPEN flag is needed.
      
        read()
          Return -1 with errno = EWOULDBLOCK/EAGAIN if connect() is called but
          write() is not called yet.
          Return -1 with errno = EWOULDBLOCK/EAGAIN if connection is
          established but no msg is received yet.
          Return number of bytes read if socket is established and there is
          msg received.
      
      The new API simplifies life for applications that always perform a write()
      immediately after a successful connect(). Such applications can now take
      advantage of Fast Open by merely making one new setsockopt() call at the time
      of creating the socket. Nothing else about the application's socket call
      sequence needs to change.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      19f6d3f3
    • W
      net/tcp-fastopen: refactor cookie check logic · 065263f4
      Wei Wang 提交于
      Refactor the cookie check logic in tcp_send_syn_data() into a function.
      This function will be called else where in later changes.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      065263f4
  11. 14 1月, 2017 6 次提交
  12. 30 12月, 2016 1 次提交
  13. 28 12月, 2016 1 次提交
  14. 03 12月, 2016 1 次提交
    • F
      tcp: randomize tcp timestamp offsets for each connection · 95a22cae
      Florian Westphal 提交于
      jiffies based timestamps allow for easy inference of number of devices
      behind NAT translators and also makes tracking of hosts simpler.
      
      commit ceaa1fef ("tcp: adding a per-socket timestamp offset")
      added the main infrastructure that is needed for per-connection ts
      randomization, in particular writing/reading the on-wire tcp header
      format takes the offset into account so rest of stack can use normal
      tcp_time_stamp (jiffies).
      
      So only two items are left:
       - add a tsoffset for request sockets
       - extend the tcp isn generator to also return another 32bit number
         in addition to the ISN.
      
      Re-use of ISN generator also means timestamps are still monotonically
      increasing for same connection quadruple, i.e. PAWS will still work.
      
      Includes fixes from Eric Dumazet.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      95a22cae
  15. 30 11月, 2016 2 次提交
  16. 22 11月, 2016 1 次提交
  17. 14 11月, 2016 1 次提交
  18. 04 11月, 2016 1 次提交
    • D
      net: tcp: check skb is non-NULL for exact match on lookups · da96786e
      David Ahern 提交于
      Andrey reported the following error report while running the syzkaller
      fuzzer:
      
      general protection fault: 0000 [#1] SMP KASAN
      Dumping ftrace buffer:
         (ftrace buffer empty)
      Modules linked in:
      CPU: 0 PID: 648 Comm: syz-executor Not tainted 4.9.0-rc3+ #333
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
      task: ffff8800398c4480 task.stack: ffff88003b468000
      RIP: 0010:[<ffffffff83091106>]  [<     inline     >]
      inet_exact_dif_match include/net/tcp.h:808
      RIP: 0010:[<ffffffff83091106>]  [<ffffffff83091106>]
      __inet_lookup_listener+0xb6/0x500 net/ipv4/inet_hashtables.c:219
      RSP: 0018:ffff88003b46f270  EFLAGS: 00010202
      RAX: 0000000000000004 RBX: 0000000000004242 RCX: 0000000000000001
      RDX: 0000000000000000 RSI: ffffc90000e3c000 RDI: 0000000000000054
      RBP: ffff88003b46f2d8 R08: 0000000000004000 R09: ffffffff830910e7
      R10: 0000000000000000 R11: 000000000000000a R12: ffffffff867fa0c0
      R13: 0000000000004242 R14: 0000000000000003 R15: dffffc0000000000
      FS:  00007fb135881700(0000) GS:ffff88003ec00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000020cc3000 CR3: 000000006d56a000 CR4: 00000000000006f0
      Stack:
       0000000000000000 000000000601a8c0 0000000000000000 ffffffff00004242
       424200003b9083c2 ffff88003def4041 ffffffff84e7e040 0000000000000246
       ffff88003a0911c0 0000000000000000 ffff88003a091298 ffff88003b9083ae
      Call Trace:
       [<ffffffff831100f4>] tcp_v4_send_reset+0x584/0x1700 net/ipv4/tcp_ipv4.c:643
       [<ffffffff83115b1b>] tcp_v4_rcv+0x198b/0x2e50 net/ipv4/tcp_ipv4.c:1718
       [<ffffffff83069d22>] ip_local_deliver_finish+0x332/0xad0
      net/ipv4/ip_input.c:216
      ...
      
      MD5 has a code path that calls __inet_lookup_listener with a null skb,
      so inet{6}_exact_dif_match needs to check skb against null before pulling
      the flag.
      
      Fixes: a04a480d ("net: Require exact match for TCP socket lookups if
             dif is l3mdev")
      Reported-by: NAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Tested-by: NAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      da96786e
  19. 17 10月, 2016 1 次提交
    • D
      net: Require exact match for TCP socket lookups if dif is l3mdev · a04a480d
      David Ahern 提交于
      Currently, socket lookups for l3mdev (vrf) use cases can match a socket
      that is bound to a port but not a device (ie., a global socket). If the
      sysctl tcp_l3mdev_accept is not set this leads to ack packets going out
      based on the main table even though the packet came in from an L3 domain.
      The end result is that the connection does not establish creating
      confusion for users since the service is running and a socket shows in
      ss output. Fix by requiring an exact dif to sk_bound_dev_if match if the
      skb came through an interface enslaved to an l3mdev device and the
      tcp_l3mdev_accept is not set.
      
      skb's through an l3mdev interface are marked by setting a flag in
      inet{6}_skb_parm. The IPv6 variant is already set; this patch adds the
      flag for IPv4. Using an skb flag avoids a device lookup on the dif. The
      flag is set in the VRF driver using the IP{6}CB macros. For IPv4, the
      inet_skb_parm struct is moved in the cb per commit 971f10ec, so the
      match function in the TCP stack needs to use TCP_SKB_CB. For IPv6, the
      move is done after the socket lookup, so IP6CB is used.
      
      The flags field in inet_skb_parm struct needs to be increased to add
      another flag. There is currently a 1-byte hole following the flags,
      so it can be expanded to u16 without increasing the size of the struct.
      
      Fixes: 193125db ("net: Introduce VRF device driver")
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a04a480d
  20. 21 9月, 2016 7 次提交
    • Y
      tcp: new CC hook to set sending rate with rate_sample in any CA state · c0402760
      Yuchung Cheng 提交于
      This commit introduces an optional new "omnipotent" hook,
      cong_control(), for congestion control modules. The cong_control()
      function is called at the end of processing an ACK (i.e., after
      updating sequence numbers, the SACK scoreboard, and loss
      detection). At that moment we have precise delivery rate information
      the congestion control module can use to control the sending behavior
      (using cwnd, TSO skb size, and pacing rate) in any CA state.
      
      This function can also be used by a congestion control that prefers
      not to use the default cwnd reduction approach (i.e., the PRR
      algorithm) during CA_Recovery to control the cwnd and sending rate
      during loss recovery.
      
      We take advantage of the fact that recent changes defer the
      retransmission or transmission of new data (e.g. by F-RTO) in recovery
      until the new tcp_cong_control() function is run.
      
      With this commit, we only run tcp_update_pacing_rate() if the
      congestion control is not using this new API. New congestion controls
      which use the new API do not want the TCP stack to run the default
      pacing rate calculation and overwrite whatever pacing rate they have
      chosen at initialization time.
      Signed-off-by: NVan Jacobson <vanj@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c0402760
    • Y
      tcp: allow congestion control to expand send buffer differently · 77bfc174
      Yuchung Cheng 提交于
      Currently the TCP send buffer expands to twice cwnd, in order to allow
      limited transmits in the CA_Recovery state. This assumes that cwnd
      does not increase in the CA_Recovery.
      
      For some congestion control algorithms, like the upcoming BBR module,
      if the losses in recovery do not indicate congestion then we may
      continue to raise cwnd multiplicatively in recovery. In such cases the
      current multiplier will falsely limit the sending rate, much as if it
      were limited by the application.
      
      This commit adds an optional congestion control callback to use a
      different multiplier to expand the TCP send buffer. For congestion
      control modules that do not specificy this callback, TCP continues to
      use the previous default of 2.
      Signed-off-by: NVan Jacobson <vanj@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      77bfc174
    • N
      tcp: export tcp_tso_autosize() and parameterize minimum number of TSO segments · 1b3878ca
      Neal Cardwell 提交于
      To allow congestion control modules to use the default TSO auto-sizing
      algorithm as one of the ingredients in their own decision about TSO sizing:
      
      1) Export tcp_tso_autosize() so that CC modules can use it.
      
      2) Change tcp_tso_autosize() to allow callers to specify a minimum
         number of segments per TSO skb, in case the congestion control
         module has a different notion of the best floor for TSO skbs for
         the connection right now. For very low-rate paths or policed
         connections it can be appropriate to use smaller TSO skbs.
      Signed-off-by: NVan Jacobson <vanj@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1b3878ca
    • N
      tcp: allow congestion control module to request TSO skb segment count · ed6e7268
      Neal Cardwell 提交于
      Add the tso_segs_goal() function in tcp_congestion_ops to allow the
      congestion control module to specify the number of segments that
      should be in a TSO skb sent by tcp_write_xmit() and
      tcp_xmit_retransmit_queue(). The congestion control module can either
      request a particular number of segments in TSO skb that we transmit,
      or return 0 if it doesn't care.
      
      This allows the upcoming BBR congestion control module to select small
      TSO skb sizes if the module detects that the bottleneck bandwidth is
      very low, or that the connection is policed to a low rate.
      Signed-off-by: NVan Jacobson <vanj@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ed6e7268
    • S
      tcp: track application-limited rate samples · d7722e85
      Soheil Hassas Yeganeh 提交于
      This commit adds code to track whether the delivery rate represented
      by each rate_sample was limited by the application.
      
      Upon each transmit, we store in the is_app_limited field in the skb a
      boolean bit indicating whether there is a known "bubble in the pipe":
      a point in the rate sample interval where the sender was
      application-limited, and did not transmit even though the cwnd and
      pacing rate allowed it.
      
      This logic marks the flow app-limited on a write if *all* of the
      following are true:
      
        1) There is less than 1 MSS of unsent data in the write queue
           available to transmit.
      
        2) There is no packet in the sender's queues (e.g. in fq or the NIC
           tx queue).
      
        3) The connection is not limited by cwnd.
      
        4) There are no lost packets to retransmit.
      
      The tcp_rate_check_app_limited() code in tcp_rate.c determines whether
      the connection is application-limited at the moment. If the flow is
      application-limited, it sets the tp->app_limited field. If the flow is
      application-limited then that means there is effectively a "bubble" of
      silence in the pipe now, and this silence will be reflected in a lower
      bandwidth sample for any rate samples from now until we get an ACK
      indicating this bubble has exited the pipe: specifically, until we get
      an ACK for the next packet we transmit.
      
      When we send every skb we record in scb->tx.is_app_limited whether the
      resulting rate sample will be application-limited.
      
      The code in tcp_rate_gen() checks to see when it is safe to mark all
      known application-limited bubbles of silence as having exited the
      pipe. It does this by checking to see when the delivered count moves
      past the tp->app_limited marker. At this point it zeroes the
      tp->app_limited marker, as all known bubbles are out of the pipe.
      
      We make room for the tx.is_app_limited bit in the skb by borrowing a
      bit from the in_flight field used by NV to record the number of bytes
      in flight. The receive window in the TCP header is 16 bits, and the
      max receive window scaling shift factor is 14 (RFC 1323). So the max
      receive window offered by the TCP protocol is 2^(16+14) = 2^30. So we
      only need 30 bits for the tx.in_flight used by NV.
      Signed-off-by: NVan Jacobson <vanj@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d7722e85
    • Y
      tcp: track data delivery rate for a TCP connection · b9f64820
      Yuchung Cheng 提交于
      This patch generates data delivery rate (throughput) samples on a
      per-ACK basis. These rate samples can be used by congestion control
      modules, and specifically will be used by TCP BBR in later patches in
      this series.
      
      Key state:
      
      tp->delivered: Tracks the total number of data packets (original or not)
      	       delivered so far. This is an already-existing field.
      
      tp->delivered_mstamp: the last time tp->delivered was updated.
      
      Algorithm:
      
      A rate sample is calculated as (d1 - d0)/(t1 - t0) on a per-ACK basis:
      
        d1: the current tp->delivered after processing the ACK
        t1: the current time after processing the ACK
      
        d0: the prior tp->delivered when the acked skb was transmitted
        t0: the prior tp->delivered_mstamp when the acked skb was transmitted
      
      When an skb is transmitted, we snapshot d0 and t0 in its control
      block in tcp_rate_skb_sent().
      
      When an ACK arrives, it may SACK and ACK some skbs. For each SACKed
      or ACKed skb, tcp_rate_skb_delivered() updates the rate_sample struct
      to reflect the latest (d0, t0).
      
      Finally, tcp_rate_gen() generates a rate sample by storing
      (d1 - d0) in rs->delivered and (t1 - t0) in rs->interval_us.
      
      One caveat: if an skb was sent with no packets in flight, then
      tp->delivered_mstamp may be either invalid (if the connection is
      starting) or outdated (if the connection was idle). In that case,
      we'll re-stamp tp->delivered_mstamp.
      
      At first glance it seems t0 should always be the time when an skb was
      transmitted, but actually this could over-estimate the rate due to
      phase mismatch between transmit and ACK events. To track the delivery
      rate, we ensure that if packets are in flight then t0 and and t1 are
      times at which packets were marked delivered.
      
      If the initial and final RTTs are different then one may be corrupted
      by some sort of noise. The noise we see most often is sending gaps
      caused by delayed, compressed, or stretched acks. This either affects
      both RTTs equally or artificially reduces the final RTT. We approach
      this by recording the info we need to compute the initial RTT
      (duration of the "send phase" of the window) when we recorded the
      associated inflight. Then, for a filter to avoid bandwidth
      overestimates, we generalize the per-sample bandwidth computation
      from:
      
          bw = delivered / ack_phase_rtt
      
      to the following:
      
          bw = delivered / max(send_phase_rtt, ack_phase_rtt)
      
      In large-scale experiments, this filtering approach incorporating
      send_phase_rtt is effective at avoiding bandwidth overestimates due to
      ACK compression or stretched ACKs.
      Signed-off-by: NVan Jacobson <vanj@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b9f64820
    • N
      tcp: use windowed min filter library for TCP min_rtt estimation · 64033892
      Neal Cardwell 提交于
      Refactor the TCP min_rtt code to reuse the new win_minmax library in
      lib/win_minmax.c to simplify the TCP code.
      
      This is a pure refactor: the functionality is exactly the same. We
      just moved the windowed min code to make TCP easier to read and
      maintain, and to allow other parts of the kernel to use the windowed
      min/max filter code.
      Signed-off-by: NVan Jacobson <vanj@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      64033892
  21. 09 9月, 2016 1 次提交
    • Y
      tcp: use an RB tree for ooo receive queue · 9f5afeae
      Yaogong Wang 提交于
      Over the years, TCP BDP has increased by several orders of magnitude,
      and some people are considering to reach the 2 Gbytes limit.
      
      Even with current window scale limit of 14, ~1 Gbytes maps to ~740,000
      MSS.
      
      In presence of packet losses (or reorders), TCP stores incoming packets
      into an out of order queue, and number of skbs sitting there waiting for
      the missing packets to be received can be in the 10^5 range.
      
      Most packets are appended to the tail of this queue, and when
      packets can finally be transferred to receive queue, we scan the queue
      from its head.
      
      However, in presence of heavy losses, we might have to find an arbitrary
      point in this queue, involving a linear scan for every incoming packet,
      throwing away cpu caches.
      
      This patch converts it to a RB tree, to get bounded latencies.
      
      Yaogong wrote a preliminary patch about 2 years ago.
      Eric did the rebase, added ofo_last_skb cache, polishing and tests.
      
      Tested with network dropping between 1 and 10 % packets, with good
      success (about 30 % increase of throughput in stress tests)
      
      Next step would be to also use an RB tree for the write queue at sender
      side ;)
      Signed-off-by: NYaogong Wang <wygivan@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Acked-By: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9f5afeae
  22. 29 8月, 2016 2 次提交
    • E
      tcp: add tcp_add_backlog() · c9c33212
      Eric Dumazet 提交于
      When TCP operates in lossy environments (between 1 and 10 % packet
      losses), many SACK blocks can be exchanged, and I noticed we could
      drop them on busy senders, if these SACK blocks have to be queued
      into the socket backlog.
      
      While the main cause is the poor performance of RACK/SACK processing,
      we can try to avoid these drops of valuable information that can lead to
      spurious timeouts and retransmits.
      
      Cause of the drops is the skb->truesize overestimation caused by :
      
      - drivers allocating ~2048 (or more) bytes as a fragment to hold an
        Ethernet frame.
      
      - various pskb_may_pull() calls bringing the headers into skb->head
        might have pulled all the frame content, but skb->truesize could
        not be lowered, as the stack has no idea of each fragment truesize.
      
      The backlog drops are also more visible on bidirectional flows, since
      their sk_rmem_alloc can be quite big.
      
      Let's add some room for the backlog, as only the socket owner
      can selectively take action to lower memory needs, like collapsing
      receive queues or partial ofo pruning.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c9c33212
    • T
      tcp: Set read_sock and peek_len proto_ops · 32035585
      Tom Herbert 提交于
      In inet_stream_ops we set read_sock to tcp_read_sock and peek_len to
      tcp_peek_len (which is just a stub function that calls tcp_inq).
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      32035585