1. 20 5月, 2013 1 次提交
    • Y
      tcp: remove bad timeout logic in fast recovery · 3e59cb0d
      Yuchung Cheng 提交于
      tcp_timeout_skb() was intended to trigger fast recovery on timeout,
      unfortunately in reality it often causes spurious retransmission
      storms during fast recovery. The particular sign is a fast retransmit
      over the highest sacked sequence (SND.FACK).
      
      Currently the RTO timer re-arming (as in RFC6298) offers a nice cushion
      to avoid spurious timeout: when SND.UNA advances the sender re-arms
      RTO and extends the timeout by icsk_rto. The sender does not offset
      the time elapsed since the packet at SND.UNA was sent.
      
      But if the next (DUP)ACK arrives later than ~RTTVAR and triggers
      tcp_fastretrans_alert(), then tcp_timeout_skb() will mark any packet
      sent before the icsk_rto interval lost, including one that's above the
      highest sacked sequence. Most likely a large part of scorebard will be
      marked.
      
      If most packets are not lost then the subsequent DUPACKs with new SACK
      blocks will cause the sender to continue to retransmit packets beyond
      SND.FACK spuriously. Even if only one packet is lost the sender may
      falsely retransmit almost the entire window.
      
      The situation becomes common in the world of bufferbloat: the RTT
      continues to grow as the queue builds up but RTTVAR remains small and
      close to the minimum 200ms. If a data packet is lost and the DUPACK
      triggered by the next data packet is slightly delayed, then a spurious
      retransmission storm forms.
      
      As the original comment on tcp_timeout_skb() suggests: the usefulness
      of this feature is questionable. It also wastes cycles walking the
      sack scoreboard and is actually harmful because of false recovery.
      
      It's time to remove this.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NNandita Dukkipati <nanditad@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3e59cb0d
  2. 21 3月, 2013 2 次提交
    • Y
      tcp: implement RFC5682 F-RTO · e33099f9
      Yuchung Cheng 提交于
      This patch implements F-RTO (foward RTO recovery):
      
      When the first retransmission after timeout is acknowledged, F-RTO
      sends new data instead of old data. If the next ACK acknowledges
      some never-retransmitted data, then the timeout was spurious and the
      congestion state is reverted.  Otherwise if the next ACK selectively
      acknowledges the new data, then the timeout was genuine and the
      loss recovery continues. This idea applies to recurring timeouts
      as well. While F-RTO sends different data during timeout recovery,
      it does not (and should not) change the congestion control.
      
      The implementaion follows the three steps of SACK enhanced algorithm
      (section 3) in RFC5682. Step 1 is in tcp_enter_loss(). Step 2 and
      3 are in tcp_process_loss().  The basic version is not supported
      because SACK enhanced version also works for non-SACK connections.
      
      The new implementation is functionally in parity with the old F-RTO
      implementation except the one case where it increases undo events:
      In addition to the RFC algorithm, a spurious timeout may be detected
      without sending data in step 2, as long as the SACK confirms not
      all the original data are dropped. When this happens, the sender
      will undo the cwnd and perhaps enter fast recovery instead. This
      additional check increases the F-RTO undo events by 5x compared
      to the prior implementation on Google Web servers, since the sender
      often does not have new data to send for HTTP.
      
      Note F-RTO may detect spurious timeout before Eifel with timestamps
      does so.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e33099f9
    • Y
      tcp: refactor F-RTO · 9b44190d
      Yuchung Cheng 提交于
      The patch series refactor the F-RTO feature (RFC4138/5682).
      
      This is to simplify the loss recovery processing. Existing F-RTO
      was developed during the experimental stage (RFC4138) and has
      many experimental features.  It takes a separate code path from
      the traditional timeout processing by overloading CA_Disorder
      instead of using CA_Loss state. This complicates CA_Disorder state
      handling because it's also used for handling dubious ACKs and undos.
      While the algorithm in the RFC does not change the congestion control,
      the implementation intercepts congestion control in various places
      (e.g., frto_cwnd in tcp_ack()).
      
      The new code implements newer F-RTO RFC5682 using CA_Loss processing
      path.  F-RTO becomes a small extension in the timeout processing
      and interfaces with congestion control and Eifel undo modules.
      It lets congestion control (module) determines how many to send
      independently.  F-RTO only chooses what to send in order to detect
      spurious retranmission. If timeout is found spurious it invokes
      existing Eifel undo algorithms like DSACK or TCP timestamp based
      detection.
      
      The first patch removes all F-RTO code except the sysctl_tcp_frto is
      left for the new implementation.  Since CA_EVENT_FRTO is removed, TCP
      westwood now computes ssthresh on regular timeout CA_EVENT_LOSS event.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9b44190d
  3. 18 3月, 2013 1 次提交
    • C
      tcp: Remove TCPCT · 1a2c6181
      Christoph Paasch 提交于
      TCPCT uses option-number 253, reserved for experimental use and should
      not be used in production environments.
      Further, TCPCT does not fully implement RFC 6013.
      
      As a nice side-effect, removing TCPCT increases TCP's performance for
      very short flows:
      
      Doing an apache-benchmark with -c 100 -n 100000, sending HTTP-requests
      for files of 1KB size.
      
      before this patch:
      	average (among 7 runs) of 20845.5 Requests/Second
      after:
      	average (among 7 runs) of 21403.6 Requests/Second
      Signed-off-by: NChristoph Paasch <christoph.paasch@uclouvain.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1a2c6181
  4. 12 3月, 2013 2 次提交
    • N
      tcp: TLP loss detection. · 9b717a8d
      Nandita Dukkipati 提交于
      This is the second of the TLP patch series; it augments the basic TLP
      algorithm with a loss detection scheme.
      
      This patch implements a mechanism for loss detection when a Tail
      loss probe retransmission plugs a hole thereby masking packet loss
      from the sender. The loss detection algorithm relies on counting
      TLP dupacks as outlined in Sec. 3 of:
      http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01
      
      The basic idea is: Sender keeps track of TLP "episode" upon
      retransmission of a TLP packet. An episode ends when the sender receives
      an ACK above the SND.NXT (tracked by tlp_high_seq) at the time of the
      episode. We want to make sure that before the episode ends the sender
      receives a "TLP dupack", indicating that the TLP retransmission was
      unnecessary, so there was no loss/hole that needed plugging. If the
      sender gets no TLP dupack before the end of the episode, then it reduces
      ssthresh and the congestion window, because the TLP packet arriving at
      the receiver probably plugged a hole.
      Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9b717a8d
    • N
      tcp: Tail loss probe (TLP) · 6ba8a3b1
      Nandita Dukkipati 提交于
      This patch series implement the Tail loss probe (TLP) algorithm described
      in http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01. The
      first patch implements the basic algorithm.
      
      TLP's goal is to reduce tail latency of short transactions. It achieves
      this by converting retransmission timeouts (RTOs) occuring due
      to tail losses (losses at end of transactions) into fast recovery.
      TLP transmits one packet in two round-trips when a connection is in
      Open state and isn't receiving any ACKs. The transmitted packet, aka
      loss probe, can be either new or a retransmission. When there is tail
      loss, the ACK from a loss probe triggers FACK/early-retransmit based
      fast recovery, thus avoiding a costly RTO. In the absence of loss,
      there is no change in the connection state.
      
      PTO stands for probe timeout. It is a timer event indicating
      that an ACK is overdue and triggers a loss probe packet. The PTO value
      is set to max(2*SRTT, 10ms) and is adjusted to account for delayed
      ACK timer when there is only one oustanding packet.
      
      TLP Algorithm
      
      On transmission of new data in Open state:
        -> packets_out > 1: schedule PTO in max(2*SRTT, 10ms).
        -> packets_out == 1: schedule PTO in max(2*RTT, 1.5*RTT + 200ms)
        -> PTO = min(PTO, RTO)
      
      Conditions for scheduling PTO:
        -> Connection is in Open state.
        -> Connection is either cwnd limited or no new data to send.
        -> Number of probes per tail loss episode is limited to one.
        -> Connection is SACK enabled.
      
      When PTO fires:
        new_segment_exists:
          -> transmit new segment.
          -> packets_out++. cwnd remains same.
      
        no_new_packet:
          -> retransmit the last segment.
             Its ACK triggers FACK or early retransmit based recovery.
      
      ACK path:
        -> rearm RTO at start of ACK processing.
        -> reschedule PTO if need be.
      
      In addition, the patch includes a small variation to the Early Retransmit
      (ER) algorithm, such that ER and TLP together can in principle recover any
      N-degree of tail loss through fast recovery. TLP is controlled by the same
      sysctl as ER, tcp_early_retrans sysctl.
      tcp_early_retrans==0; disables TLP and ER.
      		 ==1; enables RFC5827 ER.
      		 ==2; delayed ER.
      		 ==3; TLP and delayed ER. [DEFAULT]
      		 ==4; TLP only.
      
      The TLP patch series have been extensively tested on Google Web servers.
      It is most effective for short Web trasactions, where it reduced RTOs by 15%
      and improved HTTP response time (average by 6%, 99th percentile by 10%).
      The transmitted probes account for <0.5% of the overall transmissions.
      Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6ba8a3b1
  5. 11 3月, 2013 1 次提交
  6. 14 2月, 2013 1 次提交
    • A
      tcp: adding a per-socket timestamp offset · ceaa1fef
      Andrey Vagin 提交于
      This functionality is used for restoring tcp sockets. A tcp timestamp
      depends on how long a system has been running, so it's differ for each
      host. The solution is to set a per-socket offset.
      
      A per-socket offset for a TIME_WAIT socket is inherited from a proper
      tcp socket.
      
      tcp_request_sock doesn't have a timestamp offset, because the repair
      mode for them are not implemented.
      
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NAndrey Vagin <avagin@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ceaa1fef
  7. 06 2月, 2013 1 次提交
  8. 09 12月, 2012 1 次提交
  9. 23 10月, 2012 1 次提交
  10. 13 10月, 2012 1 次提交
  11. 21 9月, 2012 1 次提交
  12. 01 9月, 2012 1 次提交
    • J
      tcp: TCP Fast Open Server - header & support functions · 10467163
      Jerry Chu 提交于
      This patch adds all the necessary data structure and support
      functions to implement TFO server side. It also documents a number
      of flags for the sysctl_tcp_fastopen knob, and adds a few Linux
      extension MIBs.
      
      In addition, it includes the following:
      
      1. a new TCP_FASTOPEN socket option an application must call to
      supply a max backlog allowed in order to enable TFO on its listener.
      
      2. A number of key data structures:
      "fastopen_rsk" in tcp_sock - for a big socket to access its
      request_sock for retransmission and ack processing purpose. It is
      non-NULL iff 3WHS not completed.
      
      "fastopenq" in request_sock_queue - points to a per Fast Open
      listener data structure "fastopen_queue" to keep track of qlen (# of
      outstanding Fast Open requests) and max_qlen, among other things.
      
      "listener" in tcp_request_sock - to point to the original listener
      for book-keeping purpose, i.e., to maintain qlen against max_qlen
      as part of defense against IP spoofing attack.
      
      3. various data structure and functions, many in tcp_fastopen.c, to
      support server side Fast Open cookie operations, including
      /proc/sys/net/ipv4/tcp_fastopen_key to allow manual rekeying.
      Signed-off-by: NH.K. Jerry Chu <hkchu@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      10467163
  13. 23 7月, 2012 1 次提交
    • E
      tcp: dont drop MTU reduction indications · 563d34d0
      Eric Dumazet 提交于
      ICMP messages generated in output path if frame length is bigger than
      mtu are actually lost because socket is owned by user (doing the xmit)
      
      One example is the ipgre_tunnel_xmit() calling
      icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED, htonl(mtu));
      
      We had a similar case fixed in commit a34a101e (ipv6: disable GSO on
      sockets hitting dst_allfrag).
      
      Problem of such fix is that it relied on retransmit timers, so short tcp
      sessions paid a too big latency increase price.
      
      This patch uses the tcp_release_cb() infrastructure so that MTU
      reduction messages (ICMP messages) are not lost, and no extra delay
      is added in TCP transmits.
      Reported-by: NMaciej Żenczykowski <maze@google.com>
      Diagnosed-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Nandita Dukkipati <nanditad@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Tore Anderson <tore@fud.no>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      563d34d0
  14. 21 7月, 2012 1 次提交
    • E
      tcp: improve latencies of timer triggered events · 6f458dfb
      Eric Dumazet 提交于
      Modern TCP stack highly depends on tcp_write_timer() having a small
      latency, but current implementation doesn't exactly meet the
      expectations.
      
      When a timer fires but finds the socket is owned by the user, it rearms
      itself for an additional delay hoping next run will be more
      successful.
      
      tcp_write_timer() for example uses a 50ms delay for next try, and it
      defeats many attempts to get predictable TCP behavior in term of
      latencies.
      
      Use the recently introduced tcp_release_cb(), so that the user owning
      the socket will call various handlers right before socket release.
      
      This will permit us to post a followup patch to address the
      tcp_tso_should_defer() syndrome (some deferred packets have to wait
      RTO timer to be transmitted, while cwnd should allow us to send them
      sooner)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Nandita Dukkipati <nanditad@google.com>
      Cc: H.K. Jerry Chu <hkchu@google.com>
      Cc: John Heffner <johnwheffner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6f458dfb
  15. 20 7月, 2012 3 次提交
    • Y
      net-tcp: Fast Open client - cookie-less mode · 67da22d2
      Yuchung Cheng 提交于
      In trusted networks, e.g., intranet, data-center, the client does not
      need to use Fast Open cookie to mitigate DoS attacks. In cookie-less
      mode, sendmsg() with MSG_FASTOPEN flag will send SYN-data regardless
      of cookie availability.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      67da22d2
    • Y
      net-tcp: Fast Open client - sending SYN-data · 783237e8
      Yuchung Cheng 提交于
      This patch implements sending SYN-data in tcp_connect(). The data is
      from tcp_sendmsg() with flag MSG_FASTOPEN (implemented in a later patch).
      
      The length of the cookie in tcp_fastopen_req, init'd to 0, controls the
      type of the SYN. If the cookie is not cached (len==0), the host sends
      data-less SYN with Fast Open cookie request option to solicit a cookie
      from the remote. If cookie is not available (len > 0), the host sends
      a SYN-data with Fast Open cookie option. If cookie length is negative,
        the SYN will not include any Fast Open option (for fall back operations).
      
      To deal with middleboxes that may drop SYN with data or experimental TCP
      option, the SYN-data is only sent once. SYN retransmits do not include
      data or Fast Open options. The connection will fall back to regular TCP
      handshake.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      783237e8
    • Y
      net-tcp: Fast Open base · 2100c8d2
      Yuchung Cheng 提交于
      This patch impelements the common code for both the client and server.
      
      1. TCP Fast Open option processing. Since Fast Open does not have an
         option number assigned by IANA yet, it shares the experiment option
         code 254 by implementing draft-ietf-tcpm-experimental-options
         with a 16 bits magic number 0xF989. This enables global experiments
         without clashing the scarce(2) experimental options available for TCP.
      
         When the draft status becomes standard (maybe), the client should
         switch to the new option number assigned while the server supports
         both numbers for transistion.
      
      2. The new sysctl tcp_fastopen
      
      3. A place holder init function
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2100c8d2
  16. 12 7月, 2012 1 次提交
    • E
      tcp: TCP Small Queues · 46d3ceab
      Eric Dumazet 提交于
      This introduce TSQ (TCP Small Queues)
      
      TSQ goal is to reduce number of TCP packets in xmit queues (qdisc &
      device queues), to reduce RTT and cwnd bias, part of the bufferbloat
      problem.
      
      sk->sk_wmem_alloc not allowed to grow above a given limit,
      allowing no more than ~128KB [1] per tcp socket in qdisc/dev layers at a
      given time.
      
      TSO packets are sized/capped to half the limit, so that we have two
      TSO packets in flight, allowing better bandwidth use.
      
      As a side effect, setting the limit to 40000 automatically reduces the
      standard gso max limit (65536) to 40000/2 : It can help to reduce
      latencies of high prio packets, having smaller TSO packets.
      
      This means we divert sock_wfree() to a tcp_wfree() handler, to
      queue/send following frames when skb_orphan() [2] is called for the
      already queued skbs.
      
      Results on my dev machines (tg3/ixgbe nics) are really impressive,
      using standard pfifo_fast, and with or without TSO/GSO.
      
      Without reduction of nominal bandwidth, we have reduction of buffering
      per bulk sender :
      < 1ms on Gbit (instead of 50ms with TSO)
      < 8ms on 100Mbit (instead of 132 ms)
      
      I no longer have 4 MBytes backlogged in qdisc by a single netperf
      session, and both side socket autotuning no longer use 4 Mbytes.
      
      As skb destructor cannot restart xmit itself ( as qdisc lock might be
      taken at this point ), we delegate the work to a tasklet. We use one
      tasklest per cpu for performance reasons.
      
      If tasklet finds a socket owned by the user, it sets TSQ_OWNED flag.
      This flag is tested in a new protocol method called from release_sock(),
      to eventually send new segments.
      
      [1] New /proc/sys/net/ipv4/tcp_limit_output_bytes tunable
      [2] skb_orphan() is usually called at TX completion time,
        but some drivers call it in their start_xmit() handler.
        These drivers should at least use BQL, or else a single TCP
        session can still fill the whole NIC TX ring, since TSQ will
        have no effect.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Dave Taht <dave.taht@bufferbloat.net>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Matt Mathis <mattmathis@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Nandita Dukkipati <nanditad@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      46d3ceab
  17. 11 7月, 2012 1 次提交
  18. 10 6月, 2012 2 次提交
    • P
      net: Make linux/tcp.h C++ friendly (trivial) · 8876d6b5
      Paul Pluzhnikov 提交于
      I originally sent this patch to <trivial@kernel.org>, but Jiri Kosina did
      not feel that this is fully appropriate for the trivial tree.
      
      Using linux/tcp.h from C++ results in:
      
      cat t.cc
      #include <linux/tcp.h>
      int main() { }
      
      g++ -c t.cc
      
      In file included from t.cc:1:
      /usr/include/linux/tcp.h:72: error: '__u32 __fswab32(__u32)' cannot appear in a constant-expression
      /usr/include/linux/tcp.h:72: error: a function call cannot appear in a constant-expression
      ...
      
      Attached trivial patch fixes this problem.
      
      Tested:
      - the t.cc above compiles with g++ and
      - the following program generates the same output before/after
        the patch:
      
      #include <linux/tcp.h>
      #include <stdio.h>
      
      int main ()
      {
      #define P(a) printf("%s: %08x\n", #a, (int)a)
       P(TCP_FLAG_CWR);
       P(TCP_FLAG_ECE);
       P(TCP_FLAG_URG);
       P(TCP_FLAG_ACK);
       P(TCP_FLAG_PSH);
       P(TCP_FLAG_RST);
       P(TCP_FLAG_SYN);
       P(TCP_FLAG_FIN);
       P(TCP_RESERVED_BITS);
       P(TCP_DATA_OFFSET);
      #undef P
       return 0;
      }
      Signed-off-by: NPaul Pluzhnikov <ppluzhnikov@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8876d6b5
    • D
      [PATCH] tcp: Cache inetpeer in timewait socket, and only when necessary. · 2397849b
      David S. Miller 提交于
      Since it's guarenteed that we will access the inetpeer if we're trying
      to do timewait recycling and TCP options were enabled on the
      connection, just cache the peer in the timewait socket.
      
      In the future, inetpeer lookups will be context dependent (per routing
      realm), and this helps facilitate that as well.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2397849b
  19. 09 5月, 2012 1 次提交
  20. 03 5月, 2012 2 次提交
    • Y
      tcp: early retransmit: delayed fast retransmit · 750ea2ba
      Yuchung Cheng 提交于
      Implementing the advanced early retransmit (sysctl_tcp_early_retrans==2).
      Delays the fast retransmit by an interval of RTT/4. We borrow the
      RTO timer to implement the delay. If we receive another ACK or send
      a new packet, the timer is cancelled and restored to original RTO
      value offset by time elapsed.  When the delayed-ER timer fires,
      we enter fast recovery and perform fast retransmit.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      750ea2ba
    • Y
      tcp: early retransmit · eed530b6
      Yuchung Cheng 提交于
      This patch implements RFC 5827 early retransmit (ER) for TCP.
      It reduces DUPACK threshold (dupthresh) if outstanding packets are
      less than 4 to recover losses by fast recovery instead of timeout.
      
      While the algorithm is simple, small but frequent network reordering
      makes this feature dangerous: the connection repeatedly enter
      false recovery and degrade performance. Therefore we implement
      a mitigation suggested in the appendix of the RFC that delays
      entering fast recovery by a small interval, i.e., RTT/4. Currently
      ER is conservative and is disabled for the rest of the connection
      after the first reordering event. A large scale web server
      experiment on the performance impact of ER is summarized in
      section 6 of the paper "Proportional Rate Reduction for TCP”,
      IMC 2011. http://conferences.sigcomm.org/imc/2011/docs/p155.pdf
      
      Note that Linux has a similar feature called THIN_DUPACK. The
      differences are THIN_DUPACK do not mitigate reorderings and is only
      used after slow start. Currently ER is disabled if THIN_DUPACK is
      enabled. I would be happy to merge THIN_DUPACK feature with ER if
      people think it's a good idea.
      
      ER is enabled by sysctl_tcp_early_retrans:
        0: Disables ER
      
        1: Reduce dupthresh to packets_out - 1 when outstanding packets < 4.
      
        2: (Default) reduce dupthresh like mode 1. In addition, delay
           entering fast recovery by RTT/4.
      
      Note: mode 2 is implemented in the third part of this patch series.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eed530b6
  21. 26 4月, 2012 1 次提交
  22. 22 4月, 2012 2 次提交
    • P
      tcp: Repair connection-time negotiated parameters · b139ba4e
      Pavel Emelyanov 提交于
      There are options, which are set up on a socket while performing
      TCP handshake. Need to resurrect them on a socket while repairing.
      A new sockoption accepts a buffer and parses it. The buffer should
      be CODE:VALUE sequence of bytes, where CODE is standard option
      code and VALUE is the respective value.
      
      Only 4 options should be handled on repaired socket.
      
      To read 3 out of 4 of these options the TCP_INFO sockoption can be
      used. An ability to get the last one (the mss_clamp) was added by
      the previous patch.
      
      Now the restore. Three of these options -- timestamp_ok, mss_clamp
      and snd_wscale -- are just restored on a coket.
      
      The sack_ok flags has 2 issues. First, whether or not to do sacks
      at all. This flag is just read and set back. No other sack  info is
      saved or restored, since according to the standart and the code
      dropping all sack-ed segments is OK, the sender will resubmit them
      again, so after the repair we will probably experience a pause in
      connection. Next, the fack bit. It's just set back on a socket if
      the respective sysctl is set. No collected stats about packets flow
      is preserved. As far as I see (plz, correct me if I'm wrong) the
      fack-based congestion algorithm survives dropping all of the stats
      and repairs itself eventually, probably losing the performance for
      that period.
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b139ba4e
    • P
      tcp: Initial repair mode · ee995283
      Pavel Emelyanov 提交于
      This includes (according the the previous description):
      
      * TCP_REPAIR sockoption
      
      This one just puts the socket in/out of the repair mode.
      Allowed for CAP_NET_ADMIN and for closed/establised sockets only.
      When repair mode is turned off and the socket happens to be in
      the established state the window probe is sent to the peer to
      'unlock' the connection.
      
      * TCP_REPAIR_QUEUE sockoption
      
      This one sets the queue which we're about to repair. The
      'no-queue' is set by default.
      
      * TCP_QUEUE_SEQ socoption
      
      Sets the write_seq/rcv_nxt of a selected repaired queue.
      Allowed for TCP_CLOSE-d sockets only. When the socket changes
      its state the other seq-s are changed by the kernel according
      to the protocol rules (most of the existing code is actually
      reused).
      
      * Ability to forcibly bind a socket to a port
      
      The sk->sk_reuse is set to SK_FORCE_REUSE.
      
      * Immediate connect modification
      
      The connect syscall initializes the connection, then directly jumps
      to the code which finalizes it.
      
      * Silent close modification
      
      The close just aborts the connection (similar to SO_LINGER with 0
      time) but without sending any FIN/RST-s to peer.
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee995283
  23. 29 2月, 2012 1 次提交
  24. 01 2月, 2012 2 次提交
    • E
      tcp: md5: protects md5sig_info with RCU · a8afca03
      Eric Dumazet 提交于
      This patch makes sure we use appropriate memory barriers before
      publishing tp->md5sig_info, allowing tcp_md5_do_lookup() being used from
      tcp_v4_send_reset() without holding socket lock (upcoming patch from
      Shawn Lu)
      
      Note we also need to respect rcu grace period before its freeing, since
      we can free socket without this grace period thanks to
      SLAB_DESTROY_BY_RCU
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Cc: Shawn Lu <shawn.lu@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a8afca03
    • E
      tcp: md5: rcu conversion · a915da9b
      Eric Dumazet 提交于
      In order to be able to support proper RST messages for TCP MD5 flows, we
      need to allow access to MD5 keys without locking listener socket.
      
      This conversion is a nice cleanup, and shrinks size of timewait sockets
      by 80 bytes.
      
      IPv6 code reuses generic code found in IPv4 instead of duplicating it.
      
      Control path uses GFP_KERNEL allocations instead of GFP_ATOMIC.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Cc: Shawn Lu <shawn.lu@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a915da9b
  25. 21 12月, 2011 1 次提交
  26. 04 10月, 2011 1 次提交
    • E
      tcp: report ECN_SEEN in tcp_info · b5c5693b
      Eric Dumazet 提交于
      Allows ss command (iproute2) to display "ecnseen" if at least one packet
      with ECT(0) or ECT(1) or ECN was received by this socket.
      
      "ecn" means ECN was negotiated at session establishment (TCP level)
      
      "ecnseen" means we received at least one packet with ECT fields set (IP
      level)
      
      ss -i
      ...
      ESTAB      0      0   192.168.20.110:22  192.168.20.144:38016
      ino:5950 sk:f178e400
      	 mem:(r0,w0,f0,t0) ts sack ecn ecnseen bic wscale:7,8 rto:210
      rtt:12.5/7.5 cwnd:10 send 9.3Mbps rcv_space:14480
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b5c5693b
  27. 25 8月, 2011 1 次提交
    • N
      Proportional Rate Reduction for TCP. · a262f0cd
      Nandita Dukkipati 提交于
      This patch implements Proportional Rate Reduction (PRR) for TCP.
      PRR is an algorithm that determines TCP's sending rate in fast
      recovery. PRR avoids excessive window reductions and aims for
      the actual congestion window size at the end of recovery to be as
      close as possible to the window determined by the congestion control
      algorithm. PRR also improves accuracy of the amount of data sent
      during loss recovery.
      
      The patch implements the recommended flavor of PRR called PRR-SSRB
      (Proportional rate reduction with slow start reduction bound) and
      replaces the existing rate halving algorithm. PRR improves upon the
      existing Linux fast recovery under a number of conditions including:
        1) burst losses where the losses implicitly reduce the amount of
      outstanding data (pipe) below the ssthresh value selected by the
      congestion control algorithm and,
        2) losses near the end of short flows where application runs out of
      data to send.
      
      As an example, with the existing rate halving implementation a single
      loss event can cause a connection carrying short Web transactions to
      go into the slow start mode after the recovery. This is because during
      recovery Linux pulls the congestion window down to packets_in_flight+1
      on every ACK. A short Web response often runs out of new data to send
      and its pipe reduces to zero by the end of recovery when all its packets
      are drained from the network. Subsequent HTTP responses using the same
      connection will have to slow start to raise cwnd to ssthresh. PRR on
      the other hand aims for the cwnd to be as close as possible to ssthresh
      by the end of recovery.
      
      A description of PRR and a discussion of its performance can be found at
      the following links:
      - IETF Draft:
          http://tools.ietf.org/html/draft-mathis-tcpm-proportional-rate-reduction-01
      - IETF Slides:
          http://www.ietf.org/proceedings/80/slides/tcpm-6.pdf
          http://tools.ietf.org/agenda/81/slides/tcpm-2.pdf
      - Paper to appear in Internet Measurements Conference (IMC) 2011:
          Improving TCP Loss Recovery
          Nandita Dukkipati, Matt Mathis, Yuchung Cheng
      Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a262f0cd
  28. 09 6月, 2011 1 次提交
    • J
      tcp: RFC2988bis + taking RTT sample from 3WHS for the passive open side · 9ad7c049
      Jerry Chu 提交于
      This patch lowers the default initRTO from 3secs to 1sec per
      RFC2988bis. It falls back to 3secs if the SYN or SYN-ACK packet
      has been retransmitted, AND the TCP timestamp option is not on.
      
      It also adds support to take RTT sample during 3WHS on the passive
      open side, just like its active open counterpart, and uses it, if
      valid, to seed the initRTO for the data transmission phase.
      
      The patch also resets ssthresh to its initial default at the
      beginning of the data transmission phase, and reduces cwnd to 1 if
      there has been MORE THAN ONE retransmission during 3WHS per RFC5681.
      Signed-off-by: NH.K. Jerry Chu <hkchu@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9ad7c049
  29. 31 8月, 2010 1 次提交
    • J
      tcp: Add TCP_USER_TIMEOUT socket option. · dca43c75
      Jerry Chu 提交于
      This patch provides a "user timeout" support as described in RFC793. The
      socket option is also needed for the the local half of RFC5482 "TCP User
      Timeout Option".
      
      TCP_USER_TIMEOUT is a TCP level socket option that takes an unsigned int,
      when > 0, to specify the maximum amount of time in ms that transmitted
      data may remain unacknowledged before TCP will forcefully close the
      corresponding connection and return ETIMEDOUT to the application. If
      0 is given, TCP will continue to use the system default.
      
      Increasing the user timeouts allows a TCP connection to survive extended
      periods without end-to-end connectivity. Decreasing the user timeouts
      allows applications to "fail fast" if so desired. Otherwise it may take
      upto 20 minutes with the current system defaults in a normal WAN
      environment.
      
      The socket option can be made during any state of a TCP connection, but
      is only effective during the synchronized states of a connection
      (ESTABLISHED, FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, or LAST-ACK).
      Moreover, when used with the TCP keepalive (SO_KEEPALIVE) option,
      TCP_USER_TIMEOUT will overtake keepalive to determine when to close a
      connection due to keepalive failure.
      
      The option does not change in anyway when TCP retransmits a packet, nor
      when a keepalive probe will be sent.
      
      This option, like many others, will be inherited by an acceptor from its
      listener.
      Signed-off-by: NH.K. Jerry Chu <hkchu@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dca43c75
  30. 19 2月, 2010 2 次提交
    • A
      net: TCP thin dupack · 7e380175
      Andreas Petlund 提交于
      This patch enables fast retransmissions after one dupACK for
      TCP if the stream is identified as thin. This will reduce
      latencies for thin streams that are not able to trigger fast
      retransmissions due to high packet interarrival time. This
      mechanism is only active if enabled by iocontrol or syscontrol
      and the stream is identified as thin.
      Signed-off-by: NAndreas Petlund <apetlund@simula.no>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7e380175
    • A
      net: TCP thin linear timeouts · 36e31b0a
      Andreas Petlund 提交于
      This patch will make TCP use only linear timeouts if the
      stream is thin. This will help to avoid the very high latencies
      that thin stream suffer because of exponential backoff. This
      mechanism is only active if enabled by iocontrol or syscontrol
      and the stream is identified as thin. A maximum of 6 linear
      timeouts is tried before exponential backoff is resumed.
      Signed-off-by: NAndreas Petlund <apetlund@simula.no>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      36e31b0a
  31. 03 12月, 2009 1 次提交
    • W
      TCPCT part 1d: define TCP cookie option, extend existing struct's · 435cf559
      William Allen Simpson 提交于
      Data structures are carefully composed to require minimal additions.
      For example, the struct tcp_options_received cookie_plus variable fits
      between existing 16-bit and 8-bit variables, requiring no additional
      space (taking alignment into consideration).  There are no additions to
      tcp_request_sock, and only 1 pointer in tcp_sock.
      
      This is a significantly revised implementation of an earlier (year-old)
      patch that no longer applies cleanly, with permission of the original
      author (Adam Langley):
      
          http://thread.gmane.org/gmane.linux.network/102586
      
      The principle difference is using a TCP option to carry the cookie nonce,
      instead of a user configured offset in the data.  This is more flexible and
      less subject to user configuration error.  Such a cookie option has been
      suggested for many years, and is also useful without SYN data, allowing
      several related concepts to use the same extension option.
      
          "Re: SYN floods (was: does history repeat itself?)", September 9, 1996.
          http://www.merit.net/mail.archives/nanog/1996-09/msg00235.html
      
          "Re: what a new TCP header might look like", May 12, 1998.
          ftp://ftp.isi.edu/end2end/end2end-interest-1998.mail
      
      These functions will also be used in subsequent patches that implement
      additional features.
      
      Requires:
         TCPCT part 1a: add request_values parameter for sending SYNACK
         TCPCT part 1b: generate Responder Cookie secret
         TCPCT part 1c: sysctl_tcp_cookie_size, socket option TCP_COOKIE_TRANSACTIONS
      
      Signed-off-by: William.Allen.Simpson@gmail.com
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      435cf559