1. 03 10月, 2013 1 次提交
  2. 01 10月, 2013 1 次提交
    • E
      tcp: TSQ can use a dynamic limit · c9eeec26
      Eric Dumazet 提交于
      When TCP Small Queues was added, we used a sysctl to limit amount of
      packets queues on Qdisc/device queues for a given TCP flow.
      
      Problem is this limit is either too big for low rates, or too small
      for high rates.
      
      Now TCP stack has rate estimation in sk->sk_pacing_rate, and TSO
      auto sizing, it can better control number of packets in Qdisc/device
      queues.
      
      New limit is two packets or at least 1 to 2 ms worth of packets.
      
      Low rates flows benefit from this patch by having even smaller
      number of packets in queues, allowing for faster recovery,
      better RTT estimations.
      
      High rates flows benefit from this patch by allowing more than 2 packets
      in flight as we had reports this was a limiting factor to reach line
      rate. [ In particular if TX completion is delayed because of coalescing
      parameters ]
      
      Example for a single flow on 10Gbp link controlled by FQ/pacing
      
      14 packets in flight instead of 2
      
      $ tc -s -d qd
      qdisc fq 8001: dev eth0 root refcnt 32 limit 10000p flow_limit 100p
      buckets 1024 quantum 3028 initial_quantum 15140
       Sent 1168459366606 bytes 771822841 pkt (dropped 0, overlimits 0
      requeues 6822476)
       rate 9346Mbit 771713pps backlog 953820b 14p requeues 6822476
        2047 flow, 2046 inactive, 1 throttled, delay 15673 ns
        2372 gc, 0 highprio, 0 retrans, 9739249 throttled, 0 flows_plimit
      
      Note that sk_pacing_rate is currently set to twice the actual rate, but
      this might be refined in the future when a flow is in congestion
      avoidance.
      
      Additional change : skb->destructor should be set to tcp_wfree().
      
      A future patch (for linux 3.13+) might remove tcp_limit_output_bytes
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Wei Liu <wei.liu2@citrix.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c9eeec26
  3. 31 8月, 2013 1 次提交
  4. 30 8月, 2013 2 次提交
    • E
      tcp: TSO packets automatic sizing · 95bd09eb
      Eric Dumazet 提交于
      After hearing many people over past years complaining against TSO being
      bursty or even buggy, we are proud to present automatic sizing of TSO
      packets.
      
      One part of the problem is that tcp_tso_should_defer() uses an heuristic
      relying on upcoming ACKS instead of a timer, but more generally, having
      big TSO packets makes little sense for low rates, as it tends to create
      micro bursts on the network, and general consensus is to reduce the
      buffering amount.
      
      This patch introduces a per socket sk_pacing_rate, that approximates
      the current sending rate, and allows us to size the TSO packets so
      that we try to send one packet every ms.
      
      This field could be set by other transports.
      
      Patch has no impact for high speed flows, where having large TSO packets
      makes sense to reach line rate.
      
      For other flows, this helps better packet scheduling and ACK clocking.
      
      This patch increases performance of TCP flows in lossy environments.
      
      A new sysctl (tcp_min_tso_segs) is added, to specify the
      minimal size of a TSO packet (default being 2).
      
      A follow-up patch will provide a new packet scheduler (FQ), using
      sk_pacing_rate as an input to perform optional per flow pacing.
      
      This explains why we chose to set sk_pacing_rate to twice the current
      rate, allowing 'slow start' ramp up.
      
      sk_pacing_rate = 2 * cwnd * mss / srtt
      
      v2: Neal Cardwell reported a suspect deferring of last two segments on
      initial write of 10 MSS, I had to change tcp_tso_should_defer() to take
      into account tp->xmit_size_goal_segs
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Van Jacobson <vanj@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      95bd09eb
    • A
      tcp: initialize rcv_tstamp for restored sockets · c7781a6e
      Andrew Vagin 提交于
      u32 rcv_tstamp;     /* timestamp of last received ACK */
      
      Its value used in tcp_retransmit_timer, which closes socket
      if the last ack was received more then TCP_RTO_MAX ago.
      
      Currently rcv_tstamp is initialized to zero and if tcp_retransmit_timer
      is called before receiving a first ack, the connection is closed.
      
      This patch initializes rcv_tstamp to a timestamp, when a socket was
      restored.
      
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Reported-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Signed-off-by: NAndrey Vagin <avagin@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c7781a6e
  5. 25 7月, 2013 1 次提交
    • E
      tcp: TCP_NOTSENT_LOWAT socket option · c9bee3b7
      Eric Dumazet 提交于
      Idea of this patch is to add optional limitation of number of
      unsent bytes in TCP sockets, to reduce usage of kernel memory.
      
      TCP receiver might announce a big window, and TCP sender autotuning
      might allow a large amount of bytes in write queue, but this has little
      performance impact if a large part of this buffering is wasted :
      
      Write queue needs to be large only to deal with large BDP, not
      necessarily to cope with scheduling delays (incoming ACKS make room
      for the application to queue more bytes)
      
      For most workloads, using a value of 128 KB or less is OK to give
      applications enough time to react to POLLOUT events in time
      (or being awaken in a blocking sendmsg())
      
      This patch adds two ways to set the limit :
      
      1) Per socket option TCP_NOTSENT_LOWAT
      
      2) A sysctl (/proc/sys/net/ipv4/tcp_notsent_lowat) for sockets
      not using TCP_NOTSENT_LOWAT socket option (or setting a zero value)
      Default value being UINT_MAX (0xFFFFFFFF), meaning this has no effect.
      
      This changes poll()/select()/epoll() to report POLLOUT
      only if number of unsent bytes is below tp->nosent_lowat
      
      Note this might increase number of sendmsg()/sendfile() calls
      when using non blocking sockets,
      and increase number of context switches for blocking sockets.
      
      Note this is not related to SO_SNDLOWAT (as SO_SNDLOWAT is
      defined as :
       Specify the minimum number of bytes in the buffer until
       the socket layer will pass the data to the protocol)
      
      Tested:
      
      netperf sessions, and watching /proc/net/protocols "memory" column for TCP
      
      With 200 concurrent netperf -t TCP_STREAM sessions, amount of kernel memory
      used by TCP buffers shrinks by ~55 % (20567 pages instead of 45458)
      
      lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
      lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
      TCPv6     1880      2   45458   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
      TCP       1696    508   45458   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
      
      lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
      lpq83:~# (super_netperf 200 -t TCP_STREAM -H remote -l 90 &); sleep 60 ; grep TCP /proc/net/protocols
      TCPv6     1880      2   20567   no     208   yes  ipv6        y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
      TCP       1696    508   20567   no     208   yes  kernel      y  y  y  y  y  y  y  y  y  y  y  y  y  n  y  y  y  y  y
      
      Using 128KB has no bad effect on the throughput or cpu usage
      of a single flow, although there is an increase of context switches.
      
      A bonus is that we hold socket lock for a shorter amount
      of time and should improve latencies of ACK processing.
      
      lpq83:~# echo -1 >/proc/sys/net/ipv4/tcp_notsent_lowat
      lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
      OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
      Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
      Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
      Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
      Final       Final                                             %     Method %      Method
      1651584     6291456     16384  20.00   17447.90   10^6bits/s  3.13  S      -1.00  U      0.353   -1.000  usec/KB
      
       Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
      
                 412,514 context-switches
      
           200.034645535 seconds time elapsed
      
      lpq83:~# echo 131072 >/proc/sys/net/ipv4/tcp_notsent_lowat
      lpq83:~# perf stat -e context-switches ./netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3
      OMNI Send TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.7.84 () port 0 AF_INET : +/-2.500% @ 99% conf.
      Local       Remote      Local  Elapsed Throughput Throughput  Local Local  Remote Remote Local   Remote  Service
      Send Socket Recv Socket Send   Time               Units       CPU   CPU    CPU    CPU    Service Service Demand
      Size        Size        Size   (sec)                          Util  Util   Util   Util   Demand  Demand  Units
      Final       Final                                             %     Method %      Method
      1593240     6291456     16384  20.00   17321.16   10^6bits/s  3.35  S      -1.00  U      0.381   -1.000  usec/KB
      
       Performance counter stats for './netperf -H 7.7.7.84 -t omni -l 20 -c -i10,3':
      
               2,675,818 context-switches
      
           200.029651391 seconds time elapsed
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-By: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c9bee3b7
  6. 13 7月, 2013 1 次提交
  7. 20 6月, 2013 2 次提交
    • C
      tcp: introduce a per-route knob for quick ack · bcefe17c
      Cong Wang 提交于
      In previous discussions, I tried to find some reasonable heuristics
      for delayed ACK, however this seems not possible, according to Eric:
      
      	"ACKS might also be delayed because of bidirectional
      	traffic, and is more controlled by the application
      	response time. TCP stack can not easily estimate it."
      
      	"ACK can be incredibly useful to recover from losses in
      	a short time.
      
      	The vast majority of TCP sessions are small lived, and we
      	send one ACK per received segment anyway at beginning or
      	retransmits to let the sender smoothly increase its cwnd,
      	so an auto-tuning facility wont help them that much."
      
      and according to David:
      
      	"ACKs are the only information we have to detect loss.
      
      	And, for the same reasons that TCP VEGAS is fundamentally
      	broken, we cannot measure the pipe or some other
      	receiver-side-visible piece of information to determine
      	when it's "safe" to stretch ACK.
      
      	And even if it's "safe", we should not do it so that losses are
      	accurately detected and we don't spuriously retransmit.
      
      	The only way to know when the bandwidth increases is to
      	"test" it, by sending more and more packets until drops happen.
      	That's why all successful congestion control algorithms must
      	operate on explicited tested pieces of information.
      
      	Similarly, it's not really possible to universally know if
      	it's safe to stretch ACK or not."
      
      It still makes sense to enable or disable quick ack mode like
      what TCP_QUICK_ACK does.
      
      Similar to TCP_QUICK_ACK option, but for people who can't
      modify the source code and still wants to control
      TCP delayed ACK behavior. As David suggested, this should belong
      to per-path scope, since different pathes may want different
      behaviors.
      
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Rick Jones <rick.jones2@hp.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Thomas Graf <tgraf@suug.ch>
      CC: David Laight <David.Laight@ACULAB.COM>
      Signed-off-by: NCong Wang <amwang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bcefe17c
    • W
      tcp:typo unset should be unsent · 9ef71e0c
      Weiping Pan 提交于
      Signed-off-by: NWeiping Pan <wpan@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9ef71e0c
  8. 13 6月, 2013 1 次提交
  9. 24 5月, 2013 1 次提交
    • E
      tcp: xps: fix reordering issues · 547669d4
      Eric Dumazet 提交于
      commit 3853b584 ("xps: Improvements in TX queue selection")
      introduced ooo_okay flag, but the condition to set it is slightly wrong.
      
      In our traces, we have seen ACK packets being received out of order,
      and RST packets sent in response.
      
      We should test if we have any packets still in host queue.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      547669d4
  10. 30 4月, 2013 1 次提交
  11. 19 4月, 2013 1 次提交
    • E
      tcp: introduce TCPSpuriousRtxHostQueues SNMP counter · 0e280af0
      Eric Dumazet 提交于
      Host queues (Qdisc + NIC) can hold packets so long that TCP can
      eventually retransmit a packet before the first transmit even left
      the host.
      
      Its not clear right now if we could avoid this in the first place :
      
      - We could arm RTO timer not at the time we enqueue packets, but
        at the time we TX complete them (tcp_wfree())
      
      - Cancel the sending of the new copy of the packet if prior one
        is still in queue.
      
      This patch adds instrumentation so that we can at least see how
      often this problem happens.
      
      TCPSpuriousRtxHostQueues SNMP counter is incremented every time
      we detect the fast clone is not yet freed in tcp_transmit_skb()
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0e280af0
  12. 13 4月, 2013 1 次提交
    • E
      tcp: GSO should be TSQ friendly · d6a4a104
      Eric Dumazet 提交于
      I noticed that TSQ (TCP Small queues) was less effective when TSO is
      turned off, and GSO is on. If BQL is not enabled, TSQ has then no
      effect.
      
      It turns out the GSO engine frees the original gso_skb at the time the
      fragments are generated and queued to the NIC.
      
      We should instead call the tcp_wfree() destructor for the last fragment,
      to keep the flow control as intended in TSQ. This effectively limits
      the number of queued packets on qdisc + NIC layers.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Nandita Dukkipati <nanditad@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d6a4a104
  13. 12 4月, 2013 1 次提交
  14. 10 4月, 2013 1 次提交
  15. 22 3月, 2013 1 次提交
    • E
      tcp: preserve ACK clocking in TSO · f4541d60
      Eric Dumazet 提交于
      A long standing problem with TSO is the fact that tcp_tso_should_defer()
      rearms the deferred timer, while it should not.
      
      Current code leads to following bad bursty behavior :
      
      20:11:24.484333 IP A > B: . 297161:316921(19760) ack 1 win 119
      20:11:24.484337 IP B > A: . ack 263721 win 1117
      20:11:24.485086 IP B > A: . ack 265241 win 1117
      20:11:24.485925 IP B > A: . ack 266761 win 1117
      20:11:24.486759 IP B > A: . ack 268281 win 1117
      20:11:24.487594 IP B > A: . ack 269801 win 1117
      20:11:24.488430 IP B > A: . ack 271321 win 1117
      20:11:24.489267 IP B > A: . ack 272841 win 1117
      20:11:24.490104 IP B > A: . ack 274361 win 1117
      20:11:24.490939 IP B > A: . ack 275881 win 1117
      20:11:24.491775 IP B > A: . ack 277401 win 1117
      20:11:24.491784 IP A > B: . 316921:332881(15960) ack 1 win 119
      20:11:24.492620 IP B > A: . ack 278921 win 1117
      20:11:24.493448 IP B > A: . ack 280441 win 1117
      20:11:24.494286 IP B > A: . ack 281961 win 1117
      20:11:24.495122 IP B > A: . ack 283481 win 1117
      20:11:24.495958 IP B > A: . ack 285001 win 1117
      20:11:24.496791 IP B > A: . ack 286521 win 1117
      20:11:24.497628 IP B > A: . ack 288041 win 1117
      20:11:24.498459 IP B > A: . ack 289561 win 1117
      20:11:24.499296 IP B > A: . ack 291081 win 1117
      20:11:24.500133 IP B > A: . ack 292601 win 1117
      20:11:24.500970 IP B > A: . ack 294121 win 1117
      20:11:24.501388 IP B > A: . ack 295641 win 1117
      20:11:24.501398 IP A > B: . 332881:351881(19000) ack 1 win 119
      
      While the expected behavior is more like :
      
      20:19:49.259620 IP A > B: . 197601:202161(4560) ack 1 win 119
      20:19:49.260446 IP B > A: . ack 154281 win 1212
      20:19:49.261282 IP B > A: . ack 155801 win 1212
      20:19:49.262125 IP B > A: . ack 157321 win 1212
      20:19:49.262136 IP A > B: . 202161:206721(4560) ack 1 win 119
      20:19:49.262958 IP B > A: . ack 158841 win 1212
      20:19:49.263795 IP B > A: . ack 160361 win 1212
      20:19:49.264628 IP B > A: . ack 161881 win 1212
      20:19:49.264637 IP A > B: . 206721:211281(4560) ack 1 win 119
      20:19:49.265465 IP B > A: . ack 163401 win 1212
      20:19:49.265886 IP B > A: . ack 164921 win 1212
      20:19:49.266722 IP B > A: . ack 166441 win 1212
      20:19:49.266732 IP A > B: . 211281:215841(4560) ack 1 win 119
      20:19:49.267559 IP B > A: . ack 167961 win 1212
      20:19:49.268394 IP B > A: . ack 169481 win 1212
      20:19:49.269232 IP B > A: . ack 171001 win 1212
      20:19:49.269241 IP A > B: . 215841:221161(5320) ack 1 win 119
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Van Jacobson <vanj@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Nandita Dukkipati <nanditad@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f4541d60
  16. 21 3月, 2013 1 次提交
    • Y
      tcp: refactor F-RTO · 9b44190d
      Yuchung Cheng 提交于
      The patch series refactor the F-RTO feature (RFC4138/5682).
      
      This is to simplify the loss recovery processing. Existing F-RTO
      was developed during the experimental stage (RFC4138) and has
      many experimental features.  It takes a separate code path from
      the traditional timeout processing by overloading CA_Disorder
      instead of using CA_Loss state. This complicates CA_Disorder state
      handling because it's also used for handling dubious ACKs and undos.
      While the algorithm in the RFC does not change the congestion control,
      the implementation intercepts congestion control in various places
      (e.g., frto_cwnd in tcp_ack()).
      
      The new code implements newer F-RTO RFC5682 using CA_Loss processing
      path.  F-RTO becomes a small extension in the timeout processing
      and interfaces with congestion control and Eifel undo modules.
      It lets congestion control (module) determines how many to send
      independently.  F-RTO only chooses what to send in order to detect
      spurious retranmission. If timeout is found spurious it invokes
      existing Eifel undo algorithms like DSACK or TCP timestamp based
      detection.
      
      The first patch removes all F-RTO code except the sysctl_tcp_frto is
      left for the new implementation.  Since CA_EVENT_FRTO is removed, TCP
      westwood now computes ssthresh on regular timeout CA_EVENT_LOSS event.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9b44190d
  17. 18 3月, 2013 1 次提交
    • C
      tcp: Remove TCPCT · 1a2c6181
      Christoph Paasch 提交于
      TCPCT uses option-number 253, reserved for experimental use and should
      not be used in production environments.
      Further, TCPCT does not fully implement RFC 6013.
      
      As a nice side-effect, removing TCPCT increases TCP's performance for
      very short flows:
      
      Doing an apache-benchmark with -c 100 -n 100000, sending HTTP-requests
      for files of 1KB size.
      
      before this patch:
      	average (among 7 runs) of 20845.5 Requests/Second
      after:
      	average (among 7 runs) of 21403.6 Requests/Second
      Signed-off-by: NChristoph Paasch <christoph.paasch@uclouvain.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1a2c6181
  18. 14 3月, 2013 1 次提交
  19. 12 3月, 2013 2 次提交
    • N
      tcp: TLP loss detection. · 9b717a8d
      Nandita Dukkipati 提交于
      This is the second of the TLP patch series; it augments the basic TLP
      algorithm with a loss detection scheme.
      
      This patch implements a mechanism for loss detection when a Tail
      loss probe retransmission plugs a hole thereby masking packet loss
      from the sender. The loss detection algorithm relies on counting
      TLP dupacks as outlined in Sec. 3 of:
      http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01
      
      The basic idea is: Sender keeps track of TLP "episode" upon
      retransmission of a TLP packet. An episode ends when the sender receives
      an ACK above the SND.NXT (tracked by tlp_high_seq) at the time of the
      episode. We want to make sure that before the episode ends the sender
      receives a "TLP dupack", indicating that the TLP retransmission was
      unnecessary, so there was no loss/hole that needed plugging. If the
      sender gets no TLP dupack before the end of the episode, then it reduces
      ssthresh and the congestion window, because the TLP packet arriving at
      the receiver probably plugged a hole.
      Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9b717a8d
    • N
      tcp: Tail loss probe (TLP) · 6ba8a3b1
      Nandita Dukkipati 提交于
      This patch series implement the Tail loss probe (TLP) algorithm described
      in http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01. The
      first patch implements the basic algorithm.
      
      TLP's goal is to reduce tail latency of short transactions. It achieves
      this by converting retransmission timeouts (RTOs) occuring due
      to tail losses (losses at end of transactions) into fast recovery.
      TLP transmits one packet in two round-trips when a connection is in
      Open state and isn't receiving any ACKs. The transmitted packet, aka
      loss probe, can be either new or a retransmission. When there is tail
      loss, the ACK from a loss probe triggers FACK/early-retransmit based
      fast recovery, thus avoiding a costly RTO. In the absence of loss,
      there is no change in the connection state.
      
      PTO stands for probe timeout. It is a timer event indicating
      that an ACK is overdue and triggers a loss probe packet. The PTO value
      is set to max(2*SRTT, 10ms) and is adjusted to account for delayed
      ACK timer when there is only one oustanding packet.
      
      TLP Algorithm
      
      On transmission of new data in Open state:
        -> packets_out > 1: schedule PTO in max(2*SRTT, 10ms).
        -> packets_out == 1: schedule PTO in max(2*RTT, 1.5*RTT + 200ms)
        -> PTO = min(PTO, RTO)
      
      Conditions for scheduling PTO:
        -> Connection is in Open state.
        -> Connection is either cwnd limited or no new data to send.
        -> Number of probes per tail loss episode is limited to one.
        -> Connection is SACK enabled.
      
      When PTO fires:
        new_segment_exists:
          -> transmit new segment.
          -> packets_out++. cwnd remains same.
      
        no_new_packet:
          -> retransmit the last segment.
             Its ACK triggers FACK or early retransmit based recovery.
      
      ACK path:
        -> rearm RTO at start of ACK processing.
        -> reschedule PTO if need be.
      
      In addition, the patch includes a small variation to the Early Retransmit
      (ER) algorithm, such that ER and TLP together can in principle recover any
      N-degree of tail loss through fast recovery. TLP is controlled by the same
      sysctl as ER, tcp_early_retrans sysctl.
      tcp_early_retrans==0; disables TLP and ER.
      		 ==1; enables RFC5827 ER.
      		 ==2; delayed ER.
      		 ==3; TLP and delayed ER. [DEFAULT]
      		 ==4; TLP only.
      
      The TLP patch series have been extensively tested on Google Web servers.
      It is most effective for short Web trasactions, where it reduced RTOs by 15%
      and improved HTTP response time (average by 6%, 99th percentile by 10%).
      The transmitted probes account for <0.5% of the overall transmissions.
      Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6ba8a3b1
  20. 23 2月, 2013 1 次提交
  21. 16 2月, 2013 1 次提交
  22. 14 2月, 2013 2 次提交
    • P
      net: Fix possible wrong checksum generation. · c9af6db4
      Pravin B Shelar 提交于
      Patch cef401de (net: fix possible wrong checksum
      generation) fixed wrong checksum calculation but it broke TSO by
      defining new GSO type but not a netdev feature for that type.
      net_gso_ok() would not allow hardware checksum/segmentation
      offload of such packets without the feature.
      
      Following patch fixes TSO and wrong checksum. This patch uses
      same logic that Eric Dumazet used. Patch introduces new flag
      SKBTX_SHARED_FRAG if at least one frag can be modified by
      the user. but SKBTX_SHARED_FRAG flag is kept in skb shared
      info tx_flags rather than gso_type.
      
      tx_flags is better compared to gso_type since we can have skb with
      shared frag without gso packet. It does not link SHARED_FRAG to
      GSO, So there is no need to define netdev feature for this.
      Signed-off-by: NPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c9af6db4
    • A
      tcp: send packets with a socket timestamp · ee684b6f
      Andrey Vagin 提交于
      A socket timestamp is a sum of the global tcp_time_stamp and
      a per-socket offset.
      
      A socket offset is added in places where externally visible
      tcp timestamp option is parsed/initialized.
      
      Connections in the SYN_RECV state are not supported, global
      tcp_time_stamp is used for them, because repair mode doesn't support
      this state. In a future it can be implemented by the similar way
      as for TIME_WAIT sockets.
      
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NAndrey Vagin <avagin@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee684b6f
  23. 28 1月, 2013 1 次提交
    • E
      net: fix possible wrong checksum generation · cef401de
      Eric Dumazet 提交于
      Pravin Shelar mentioned that GSO could potentially generate
      wrong TX checksum if skb has fragments that are overwritten
      by the user between the checksum computation and transmit.
      
      He suggested to linearize skbs but this extra copy can be
      avoided for normal tcp skbs cooked by tcp_sendmsg().
      
      This patch introduces a new SKB_GSO_SHARED_FRAG flag, set
      in skb_shinfo(skb)->gso_type if at least one frag can be
      modified by the user.
      
      Typical sources of such possible overwrites are {vm}splice(),
      sendfile(), and macvtap/tun/virtio_net drivers.
      
      Tested:
      
      $ netperf -H 7.7.8.84
      MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
      7.7.8.84 () port 0 AF_INET
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/sec
      
       87380  16384  16384    10.00    3959.52
      
      $ netperf -H 7.7.8.84 -t TCP_SENDFILE
      TCP SENDFILE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 ()
      port 0 AF_INET
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/sec
      
       87380  16384  16384    10.00    3216.80
      
      Performance of the SENDFILE is impacted by the extra allocation and
      copy, and because we use order-0 pages, while the TCP_STREAM uses
      bigger pages.
      Reported-by: NPravin Shelar <pshelar@nicira.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cef401de
  24. 07 1月, 2013 1 次提交
  25. 08 12月, 2012 1 次提交
    • Y
      tcp: bug fix Fast Open client retransmission · 93b174ad
      Yuchung Cheng 提交于
      If SYN-ACK partially acks SYN-data, the client retransmits the
      remaining data by tcp_retransmit_skb(). This increments lost recovery
      state variables like tp->retrans_out in Open state. If loss recovery
      happens before the retransmission is acked, it triggers the WARN_ON
      check in tcp_fastretrans_alert(). For example: the client sends
      SYN-data, gets SYN-ACK acking only ISN, retransmits data, sends
      another 4 data packets and get 3 dupacks.
      
      Since the retransmission is not caused by network drop it should not
      update the recovery state variables. Further the server may return a
      smaller MSS than the cached MSS used for SYN-data, so the retranmission
      needs a loop. Otherwise some data will not be retransmitted until timeout
      or other loss recovery events.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93b174ad
  26. 23 11月, 2012 1 次提交
  27. 16 11月, 2012 1 次提交
    • A
      tcp: fix retransmission in repair mode · ec342325
      Andrew Vagin 提交于
      Currently if a socket was repaired with a few packet in a write queue,
      a kernel bug may be triggered:
      
      kernel BUG at net/ipv4/tcp_output.c:2330!
      RIP: 0010:[<ffffffff8155784f>] tcp_retransmit_skb+0x5ff/0x610
      
      According to the initial realization v3.4-rc2-963-gc0e88ff0,
      all skb-s should look like already posted. This patch fixes code
      according with this sentence.
      
      Here are three points, which were not done in the initial patch:
      1. A tcp send head should not be changed
      2. Initialize TSO state of a skb
      3. Reset the retransmission time
      
      This patch moves logic from tcp_sendmsg to tcp_write_xmit. A packet
      passes the ussual way, but isn't sent to network. This patch solves
      all described problems and handles tcp_sendpages.
      
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Signed-off-by: NAndrey Vagin <avagin@openvz.org>
      Acked-by: NPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ec342325
  28. 04 9月, 2012 1 次提交
    • Y
      tcp: use PRR to reduce cwin in CWR state · 684bad11
      Yuchung Cheng 提交于
      Use proportional rate reduction (PRR) algorithm to reduce cwnd in CWR state,
      in addition to Recovery state. Retire the current rate-halving in CWR.
      When losses are detected via ACKs in CWR state, the sender enters Recovery
      state but the cwnd reduction continues and does not restart.
      
      Rename and refactor cwnd reduction functions since both CWR and Recovery
      use the same algorithm:
      tcp_init_cwnd_reduction() is new and initiates reduction state variables.
      tcp_cwnd_reduction() is previously tcp_update_cwnd_in_recovery().
      tcp_ends_cwnd_reduction() is previously  tcp_complete_cwr().
      
      The rate halving functions and logic such as tcp_cwnd_down(), tcp_min_cwnd(),
      and the cwnd moderation inside tcp_enter_cwr() are removed. The unused
      parameter, flag, in tcp_cwnd_reduction() is also removed.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      684bad11
  29. 01 9月, 2012 1 次提交
    • J
      tcp: TCP Fast Open Server - support TFO listeners · 8336886f
      Jerry Chu 提交于
      This patch builds on top of the previous patch to add the support
      for TFO listeners. This includes -
      
      1. allocating, properly initializing, and managing the per listener
      fastopen_queue structure when TFO is enabled
      
      2. changes to the inet_csk_accept code to support TFO. E.g., the
      request_sock can no longer be freed upon accept(), not until 3WHS
      finishes
      
      3. allowing a TCP_SYN_RECV socket to properly poll() and sendmsg()
      if it's a TFO socket
      
      4. properly closing a TFO listener, and a TFO socket before 3WHS
      finishes
      
      5. supporting TCP_FASTOPEN socket option
      
      6. modifying tcp_check_req() to use to check a TFO socket as well
      as request_sock
      
      7. supporting TCP's TFO cookie option
      
      8. adding a new SYN-ACK retransmit handler to use the timer directly
      off the TFO socket rather than the listener socket. Note that TFO
      server side will not retransmit anything other than SYN-ACK until
      the 3WHS is completed.
      
      The patch also contains an important function
      "reqsk_fastopen_remove()" to manage the somewhat complex relation
      between a listener, its request_sock, and the corresponding child
      socket. See the comment above the function for the detail.
      Signed-off-by: NH.K. Jerry Chu <hkchu@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8336886f
  30. 22 8月, 2012 1 次提交
    • E
      tcp: fix possible socket refcount problem · 144d56e9
      Eric Dumazet 提交于
      Commit 6f458dfb (tcp: improve latencies of timer triggered events)
      added bug leading to following trace :
      
      [ 2866.131281] IPv4: Attempt to release TCP socket in state 1 ffff880019ec0000
      [ 2866.131726]
      [ 2866.132188] =========================
      [ 2866.132281] [ BUG: held lock freed! ]
      [ 2866.132281] 3.6.0-rc1+ #622 Not tainted
      [ 2866.132281] -------------------------
      [ 2866.132281] kworker/0:1/652 is freeing memory ffff880019ec0000-ffff880019ec0a1f, with a lock still held there!
      [ 2866.132281]  (sk_lock-AF_INET-RPC){+.+...}, at: [<ffffffff81903619>] tcp_sendmsg+0x29/0xcc6
      [ 2866.132281] 4 locks held by kworker/0:1/652:
      [ 2866.132281]  #0:  (rpciod){.+.+.+}, at: [<ffffffff81083567>] process_one_work+0x1de/0x47f
      [ 2866.132281]  #1:  ((&task->u.tk_work)){+.+.+.}, at: [<ffffffff81083567>] process_one_work+0x1de/0x47f
      [ 2866.132281]  #2:  (sk_lock-AF_INET-RPC){+.+...}, at: [<ffffffff81903619>] tcp_sendmsg+0x29/0xcc6
      [ 2866.132281]  #3:  (&icsk->icsk_retransmit_timer){+.-...}, at: [<ffffffff81078017>] run_timer_softirq+0x1ad/0x35f
      [ 2866.132281]
      [ 2866.132281] stack backtrace:
      [ 2866.132281] Pid: 652, comm: kworker/0:1 Not tainted 3.6.0-rc1+ #622
      [ 2866.132281] Call Trace:
      [ 2866.132281]  <IRQ>  [<ffffffff810bc527>] debug_check_no_locks_freed+0x112/0x159
      [ 2866.132281]  [<ffffffff818a0839>] ? __sk_free+0xfd/0x114
      [ 2866.132281]  [<ffffffff811549fa>] kmem_cache_free+0x6b/0x13a
      [ 2866.132281]  [<ffffffff818a0839>] __sk_free+0xfd/0x114
      [ 2866.132281]  [<ffffffff818a08c0>] sk_free+0x1c/0x1e
      [ 2866.132281]  [<ffffffff81911e1c>] tcp_write_timer+0x51/0x56
      [ 2866.132281]  [<ffffffff81078082>] run_timer_softirq+0x218/0x35f
      [ 2866.132281]  [<ffffffff81078017>] ? run_timer_softirq+0x1ad/0x35f
      [ 2866.132281]  [<ffffffff810f5831>] ? rb_commit+0x58/0x85
      [ 2866.132281]  [<ffffffff81911dcb>] ? tcp_write_timer_handler+0x148/0x148
      [ 2866.132281]  [<ffffffff81070bd6>] __do_softirq+0xcb/0x1f9
      [ 2866.132281]  [<ffffffff81a0a00c>] ? _raw_spin_unlock+0x29/0x2e
      [ 2866.132281]  [<ffffffff81a1227c>] call_softirq+0x1c/0x30
      [ 2866.132281]  [<ffffffff81039f38>] do_softirq+0x4a/0xa6
      [ 2866.132281]  [<ffffffff81070f2b>] irq_exit+0x51/0xad
      [ 2866.132281]  [<ffffffff81a129cd>] do_IRQ+0x9d/0xb4
      [ 2866.132281]  [<ffffffff81a0a3ef>] common_interrupt+0x6f/0x6f
      [ 2866.132281]  <EOI>  [<ffffffff8109d006>] ? sched_clock_cpu+0x58/0xd1
      [ 2866.132281]  [<ffffffff81a0a172>] ? _raw_spin_unlock_irqrestore+0x4c/0x56
      [ 2866.132281]  [<ffffffff81078692>] mod_timer+0x178/0x1a9
      [ 2866.132281]  [<ffffffff818a00aa>] sk_reset_timer+0x19/0x26
      [ 2866.132281]  [<ffffffff8190b2cc>] tcp_rearm_rto+0x99/0xa4
      [ 2866.132281]  [<ffffffff8190dfba>] tcp_event_new_data_sent+0x6e/0x70
      [ 2866.132281]  [<ffffffff8190f7ea>] tcp_write_xmit+0x7de/0x8e4
      [ 2866.132281]  [<ffffffff818a565d>] ? __alloc_skb+0xa0/0x1a1
      [ 2866.132281]  [<ffffffff8190f952>] __tcp_push_pending_frames+0x2e/0x8a
      [ 2866.132281]  [<ffffffff81904122>] tcp_sendmsg+0xb32/0xcc6
      [ 2866.132281]  [<ffffffff819229c2>] inet_sendmsg+0xaa/0xd5
      [ 2866.132281]  [<ffffffff81922918>] ? inet_autobind+0x5f/0x5f
      [ 2866.132281]  [<ffffffff810ee7f1>] ? trace_clock_local+0x9/0xb
      [ 2866.132281]  [<ffffffff8189adab>] sock_sendmsg+0xa3/0xc4
      [ 2866.132281]  [<ffffffff810f5de6>] ? rb_reserve_next_event+0x26f/0x2d5
      [ 2866.132281]  [<ffffffff8103e6a9>] ? native_sched_clock+0x29/0x6f
      [ 2866.132281]  [<ffffffff8103e6f8>] ? sched_clock+0x9/0xd
      [ 2866.132281]  [<ffffffff810ee7f1>] ? trace_clock_local+0x9/0xb
      [ 2866.132281]  [<ffffffff8189ae03>] kernel_sendmsg+0x37/0x43
      [ 2866.132281]  [<ffffffff8199ce49>] xs_send_kvec+0x77/0x80
      [ 2866.132281]  [<ffffffff8199cec1>] xs_sendpages+0x6f/0x1a0
      [ 2866.132281]  [<ffffffff8107826d>] ? try_to_del_timer_sync+0x55/0x61
      [ 2866.132281]  [<ffffffff8199d0d2>] xs_tcp_send_request+0x55/0xf1
      [ 2866.132281]  [<ffffffff8199bb90>] xprt_transmit+0x89/0x1db
      [ 2866.132281]  [<ffffffff81999bcd>] ? call_connect+0x3c/0x3c
      [ 2866.132281]  [<ffffffff81999d92>] call_transmit+0x1c5/0x20e
      [ 2866.132281]  [<ffffffff819a0d55>] __rpc_execute+0x6f/0x225
      [ 2866.132281]  [<ffffffff81999bcd>] ? call_connect+0x3c/0x3c
      [ 2866.132281]  [<ffffffff819a0f33>] rpc_async_schedule+0x28/0x34
      [ 2866.132281]  [<ffffffff810835d6>] process_one_work+0x24d/0x47f
      [ 2866.132281]  [<ffffffff81083567>] ? process_one_work+0x1de/0x47f
      [ 2866.132281]  [<ffffffff819a0f0b>] ? __rpc_execute+0x225/0x225
      [ 2866.132281]  [<ffffffff81083a6d>] worker_thread+0x236/0x317
      [ 2866.132281]  [<ffffffff81083837>] ? process_scheduled_works+0x2f/0x2f
      [ 2866.132281]  [<ffffffff8108b7b8>] kthread+0x9a/0xa2
      [ 2866.132281]  [<ffffffff81a12184>] kernel_thread_helper+0x4/0x10
      [ 2866.132281]  [<ffffffff81a0a4b0>] ? retint_restore_args+0x13/0x13
      [ 2866.132281]  [<ffffffff8108b71e>] ? __init_kthread_worker+0x5a/0x5a
      [ 2866.132281]  [<ffffffff81a12180>] ? gs_change+0x13/0x13
      [ 2866.308506] IPv4: Attempt to release TCP socket in state 1 ffff880019ec0000
      [ 2866.309689] =============================================================================
      [ 2866.310254] BUG TCP (Not tainted): Object already free
      [ 2866.310254] -----------------------------------------------------------------------------
      [ 2866.310254]
      
      The bug comes from the fact that timer set in sk_reset_timer() can run
      before we actually do the sock_hold(). socket refcount reaches zero and
      we free the socket too soon.
      
      timer handler is not allowed to reduce socket refcnt if socket is owned
      by the user, or we need to change sk_reset_timer() implementation.
      
      We should take a reference on the socket in case TCP_DELACK_TIMER_DEFERRED
      or TCP_DELACK_TIMER_DEFERRED bit are set in tsq_flags
      
      Also fix a typo in tcp_delack_timer(), where TCP_WRITE_TIMER_DEFERRED
      was used instead of TCP_DELACK_TIMER_DEFERRED.
      
      For consistency, use same socket refcount change for TCP_MTU_REDUCED_DEFERRED,
      even if not fired from a timer.
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Tested-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      144d56e9
  31. 07 8月, 2012 1 次提交
  32. 02 8月, 2012 1 次提交
  33. 01 8月, 2012 1 次提交
  34. 23 7月, 2012 1 次提交
    • E
      tcp: dont drop MTU reduction indications · 563d34d0
      Eric Dumazet 提交于
      ICMP messages generated in output path if frame length is bigger than
      mtu are actually lost because socket is owned by user (doing the xmit)
      
      One example is the ipgre_tunnel_xmit() calling
      icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED, htonl(mtu));
      
      We had a similar case fixed in commit a34a101e (ipv6: disable GSO on
      sockets hitting dst_allfrag).
      
      Problem of such fix is that it relied on retransmit timers, so short tcp
      sessions paid a too big latency increase price.
      
      This patch uses the tcp_release_cb() infrastructure so that MTU
      reduction messages (ICMP messages) are not lost, and no extra delay
      is added in TCP transmits.
      Reported-by: NMaciej Żenczykowski <maze@google.com>
      Diagnosed-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Nandita Dukkipati <nanditad@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Tore Anderson <tore@fud.no>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      563d34d0
  35. 21 7月, 2012 1 次提交
    • E
      tcp: improve latencies of timer triggered events · 6f458dfb
      Eric Dumazet 提交于
      Modern TCP stack highly depends on tcp_write_timer() having a small
      latency, but current implementation doesn't exactly meet the
      expectations.
      
      When a timer fires but finds the socket is owned by the user, it rearms
      itself for an additional delay hoping next run will be more
      successful.
      
      tcp_write_timer() for example uses a 50ms delay for next try, and it
      defeats many attempts to get predictable TCP behavior in term of
      latencies.
      
      Use the recently introduced tcp_release_cb(), so that the user owning
      the socket will call various handlers right before socket release.
      
      This will permit us to post a followup patch to address the
      tcp_tso_should_defer() syndrome (some deferred packets have to wait
      RTO timer to be transmitted, while cwnd should allow us to send them
      sooner)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Nandita Dukkipati <nanditad@google.com>
      Cc: H.K. Jerry Chu <hkchu@google.com>
      Cc: John Heffner <johnwheffner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6f458dfb
  36. 20 7月, 2012 1 次提交