1. 17 5月, 2013 1 次提交
    • E
      tcp: speedup tcp_fixup_rcvbuf() · d2cf4367
      Eric Dumazet 提交于
      tcp_fixup_rcvbuf() contains a loop to estimate initial socket
      rcv space needed for a given mss. With large MTU (like 64K on lo),
      we can loop ~500 times and consume a lot of cpu cycles.
      
      perf top of 200 concurrent netperf -t TCP_CRR
      
      5.62%  netperf  [kernel.kallsyms]  [k] tcp_init_buffer_space
      1.71%  netperf  [kernel.kallsyms]  [k] _raw_spin_lock
      1.55%  netperf  [kernel.kallsyms]  [k] kmem_cache_free
      1.51%  netperf  [kernel.kallsyms]  [k] tcp_transmit_skb
      1.50%  netperf  [kernel.kallsyms]  [k] tcp_ack
      
      Lets use a 100% factor, and remove the loop.
      
      100% is needed anyway for tcp_adv_win_scale=1
      default value, and is also the maximum factor.
      
      Refs: commit b49960a0
            ("tcp: change tcp_adv_win_scale and tcp_rmem[2]")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d2cf4367
  2. 30 4月, 2013 1 次提交
  3. 20 4月, 2013 1 次提交
    • E
      tcp: call tcp_replace_ts_recent() from tcp_ack() · 12fb3dd9
      Eric Dumazet 提交于
      commit bd090dfc (tcp: tcp_replace_ts_recent() should not be called
      from tcp_validate_incoming()) introduced a TS ecr bug in slow path
      processing.
      
      1 A > B P. 1:10001(10000) ack 1 <nop,nop,TS val 1001 ecr 200>
      2 B < A . 1:1(0) ack 1 win 257 <sack 9001:10001,TS val 300 ecr 1001>
      3 A > B . 1:1001(1000) ack 1 win 227 <nop,nop,TS val 1002 ecr 200>
      4 A > B . 1001:2001(1000) ack 1 win 227 <nop,nop,TS val 1002 ecr 200>
      
      (ecr 200 should be ecr 300 in packets 3 & 4)
      
      Problem is tcp_ack() can trigger send of new packets (retransmits),
      reflecting the prior TSval, instead of the TSval contained in the
      currently processed incoming packet.
      
      Fix this by calling tcp_replace_ts_recent() from tcp_ack() after the
      checks, but before the actions.
      Reported-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      12fb3dd9
  4. 25 3月, 2013 1 次提交
  5. 21 3月, 2013 3 次提交
    • Y
      tcp: implement RFC5682 F-RTO · e33099f9
      Yuchung Cheng 提交于
      This patch implements F-RTO (foward RTO recovery):
      
      When the first retransmission after timeout is acknowledged, F-RTO
      sends new data instead of old data. If the next ACK acknowledges
      some never-retransmitted data, then the timeout was spurious and the
      congestion state is reverted.  Otherwise if the next ACK selectively
      acknowledges the new data, then the timeout was genuine and the
      loss recovery continues. This idea applies to recurring timeouts
      as well. While F-RTO sends different data during timeout recovery,
      it does not (and should not) change the congestion control.
      
      The implementaion follows the three steps of SACK enhanced algorithm
      (section 3) in RFC5682. Step 1 is in tcp_enter_loss(). Step 2 and
      3 are in tcp_process_loss().  The basic version is not supported
      because SACK enhanced version also works for non-SACK connections.
      
      The new implementation is functionally in parity with the old F-RTO
      implementation except the one case where it increases undo events:
      In addition to the RFC algorithm, a spurious timeout may be detected
      without sending data in step 2, as long as the SACK confirms not
      all the original data are dropped. When this happens, the sender
      will undo the cwnd and perhaps enter fast recovery instead. This
      additional check increases the F-RTO undo events by 5x compared
      to the prior implementation on Google Web servers, since the sender
      often does not have new data to send for HTTP.
      
      Note F-RTO may detect spurious timeout before Eifel with timestamps
      does so.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e33099f9
    • Y
      tcp: refactor CA_Loss state processing · ab42d9ee
      Yuchung Cheng 提交于
      Consolidate all of TCP CA_Loss state processing in
      tcp_fastretrans_alert() into a new function called tcp_process_loss().
      This is to prepare the new F-RTO implementation in the next patch.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ab42d9ee
    • Y
      tcp: refactor F-RTO · 9b44190d
      Yuchung Cheng 提交于
      The patch series refactor the F-RTO feature (RFC4138/5682).
      
      This is to simplify the loss recovery processing. Existing F-RTO
      was developed during the experimental stage (RFC4138) and has
      many experimental features.  It takes a separate code path from
      the traditional timeout processing by overloading CA_Disorder
      instead of using CA_Loss state. This complicates CA_Disorder state
      handling because it's also used for handling dubious ACKs and undos.
      While the algorithm in the RFC does not change the congestion control,
      the implementation intercepts congestion control in various places
      (e.g., frto_cwnd in tcp_ack()).
      
      The new code implements newer F-RTO RFC5682 using CA_Loss processing
      path.  F-RTO becomes a small extension in the timeout processing
      and interfaces with congestion control and Eifel undo modules.
      It lets congestion control (module) determines how many to send
      independently.  F-RTO only chooses what to send in order to detect
      spurious retranmission. If timeout is found spurious it invokes
      existing Eifel undo algorithms like DSACK or TCP timestamp based
      detection.
      
      The first patch removes all F-RTO code except the sysctl_tcp_frto is
      left for the new implementation.  Since CA_EVENT_FRTO is removed, TCP
      westwood now computes ssthresh on regular timeout CA_EVENT_LOSS event.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9b44190d
  6. 18 3月, 2013 1 次提交
    • C
      tcp: Remove TCPCT · 1a2c6181
      Christoph Paasch 提交于
      TCPCT uses option-number 253, reserved for experimental use and should
      not be used in production environments.
      Further, TCPCT does not fully implement RFC 6013.
      
      As a nice side-effect, removing TCPCT increases TCP's performance for
      very short flows:
      
      Doing an apache-benchmark with -c 100 -n 100000, sending HTTP-requests
      for files of 1KB size.
      
      before this patch:
      	average (among 7 runs) of 20845.5 Requests/Second
      after:
      	average (among 7 runs) of 21403.6 Requests/Second
      Signed-off-by: NChristoph Paasch <christoph.paasch@uclouvain.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1a2c6181
  7. 12 3月, 2013 2 次提交
    • N
      tcp: TLP loss detection. · 9b717a8d
      Nandita Dukkipati 提交于
      This is the second of the TLP patch series; it augments the basic TLP
      algorithm with a loss detection scheme.
      
      This patch implements a mechanism for loss detection when a Tail
      loss probe retransmission plugs a hole thereby masking packet loss
      from the sender. The loss detection algorithm relies on counting
      TLP dupacks as outlined in Sec. 3 of:
      http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01
      
      The basic idea is: Sender keeps track of TLP "episode" upon
      retransmission of a TLP packet. An episode ends when the sender receives
      an ACK above the SND.NXT (tracked by tlp_high_seq) at the time of the
      episode. We want to make sure that before the episode ends the sender
      receives a "TLP dupack", indicating that the TLP retransmission was
      unnecessary, so there was no loss/hole that needed plugging. If the
      sender gets no TLP dupack before the end of the episode, then it reduces
      ssthresh and the congestion window, because the TLP packet arriving at
      the receiver probably plugged a hole.
      Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9b717a8d
    • N
      tcp: Tail loss probe (TLP) · 6ba8a3b1
      Nandita Dukkipati 提交于
      This patch series implement the Tail loss probe (TLP) algorithm described
      in http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01. The
      first patch implements the basic algorithm.
      
      TLP's goal is to reduce tail latency of short transactions. It achieves
      this by converting retransmission timeouts (RTOs) occuring due
      to tail losses (losses at end of transactions) into fast recovery.
      TLP transmits one packet in two round-trips when a connection is in
      Open state and isn't receiving any ACKs. The transmitted packet, aka
      loss probe, can be either new or a retransmission. When there is tail
      loss, the ACK from a loss probe triggers FACK/early-retransmit based
      fast recovery, thus avoiding a costly RTO. In the absence of loss,
      there is no change in the connection state.
      
      PTO stands for probe timeout. It is a timer event indicating
      that an ACK is overdue and triggers a loss probe packet. The PTO value
      is set to max(2*SRTT, 10ms) and is adjusted to account for delayed
      ACK timer when there is only one oustanding packet.
      
      TLP Algorithm
      
      On transmission of new data in Open state:
        -> packets_out > 1: schedule PTO in max(2*SRTT, 10ms).
        -> packets_out == 1: schedule PTO in max(2*RTT, 1.5*RTT + 200ms)
        -> PTO = min(PTO, RTO)
      
      Conditions for scheduling PTO:
        -> Connection is in Open state.
        -> Connection is either cwnd limited or no new data to send.
        -> Number of probes per tail loss episode is limited to one.
        -> Connection is SACK enabled.
      
      When PTO fires:
        new_segment_exists:
          -> transmit new segment.
          -> packets_out++. cwnd remains same.
      
        no_new_packet:
          -> retransmit the last segment.
             Its ACK triggers FACK or early retransmit based recovery.
      
      ACK path:
        -> rearm RTO at start of ACK processing.
        -> reschedule PTO if need be.
      
      In addition, the patch includes a small variation to the Early Retransmit
      (ER) algorithm, such that ER and TLP together can in principle recover any
      N-degree of tail loss through fast recovery. TLP is controlled by the same
      sysctl as ER, tcp_early_retrans sysctl.
      tcp_early_retrans==0; disables TLP and ER.
      		 ==1; enables RFC5827 ER.
      		 ==2; delayed ER.
      		 ==3; TLP and delayed ER. [DEFAULT]
      		 ==4; TLP only.
      
      The TLP patch series have been extensively tested on Google Web servers.
      It is most effective for short Web trasactions, where it reduced RTOs by 15%
      and improved HTTP response time (average by 6%, 99th percentile by 10%).
      The transmitted probes account for <0.5% of the overall transmissions.
      Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6ba8a3b1
  8. 05 3月, 2013 1 次提交
  9. 14 2月, 2013 2 次提交
    • P
      net: Fix possible wrong checksum generation. · c9af6db4
      Pravin B Shelar 提交于
      Patch cef401de (net: fix possible wrong checksum
      generation) fixed wrong checksum calculation but it broke TSO by
      defining new GSO type but not a netdev feature for that type.
      net_gso_ok() would not allow hardware checksum/segmentation
      offload of such packets without the feature.
      
      Following patch fixes TSO and wrong checksum. This patch uses
      same logic that Eric Dumazet used. Patch introduces new flag
      SKBTX_SHARED_FRAG if at least one frag can be modified by
      the user. but SKBTX_SHARED_FRAG flag is kept in skb shared
      info tx_flags rather than gso_type.
      
      tx_flags is better compared to gso_type since we can have skb with
      shared frag without gso packet. It does not link SHARED_FRAG to
      GSO, So there is no need to define netdev feature for this.
      Signed-off-by: NPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c9af6db4
    • A
      tcp: send packets with a socket timestamp · ee684b6f
      Andrey Vagin 提交于
      A socket timestamp is a sum of the global tcp_time_stamp and
      a per-socket offset.
      
      A socket offset is added in places where externally visible
      tcp timestamp option is parsed/initialized.
      
      Connections in the SYN_RECV state are not supported, global
      tcp_time_stamp is used for them, because repair mode doesn't support
      this state. In a future it can be implemented by the similar way
      as for TIME_WAIT sockets.
      
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NAndrey Vagin <avagin@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee684b6f
  10. 07 2月, 2013 1 次提交
  11. 06 2月, 2013 1 次提交
  12. 04 2月, 2013 1 次提交
  13. 01 2月, 2013 1 次提交
  14. 28 1月, 2013 1 次提交
    • E
      net: fix possible wrong checksum generation · cef401de
      Eric Dumazet 提交于
      Pravin Shelar mentioned that GSO could potentially generate
      wrong TX checksum if skb has fragments that are overwritten
      by the user between the checksum computation and transmit.
      
      He suggested to linearize skbs but this extra copy can be
      avoided for normal tcp skbs cooked by tcp_sendmsg().
      
      This patch introduces a new SKB_GSO_SHARED_FRAG flag, set
      in skb_shinfo(skb)->gso_type if at least one frag can be
      modified by the user.
      
      Typical sources of such possible overwrites are {vm}splice(),
      sendfile(), and macvtap/tun/virtio_net drivers.
      
      Tested:
      
      $ netperf -H 7.7.8.84
      MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
      7.7.8.84 () port 0 AF_INET
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/sec
      
       87380  16384  16384    10.00    3959.52
      
      $ netperf -H 7.7.8.84 -t TCP_SENDFILE
      TCP SENDFILE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 ()
      port 0 AF_INET
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/sec
      
       87380  16384  16384    10.00    3216.80
      
      Performance of the SENDFILE is impacted by the extra allocation and
      copy, and because we use order-0 pages, while the TCP_STREAM uses
      bigger pages.
      Reported-by: NPravin Shelar <pshelar@nicira.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cef401de
  15. 11 1月, 2013 1 次提交
  16. 07 1月, 2013 1 次提交
  17. 27 12月, 2012 1 次提交
  18. 08 12月, 2012 1 次提交
    • Y
      tcp: bug fix Fast Open client retransmission · 93b174ad
      Yuchung Cheng 提交于
      If SYN-ACK partially acks SYN-data, the client retransmits the
      remaining data by tcp_retransmit_skb(). This increments lost recovery
      state variables like tp->retrans_out in Open state. If loss recovery
      happens before the retransmission is acked, it triggers the WARN_ON
      check in tcp_fastretrans_alert(). For example: the client sends
      SYN-data, gets SYN-ACK acking only ISN, retransmits data, sends
      another 4 data packets and get 3 dupacks.
      
      Since the retransmission is not caused by network drop it should not
      update the recovery state variables. Further the server may return a
      smaller MSS than the cached MSS used for SYN-data, so the retranmission
      needs a loop. Otherwise some data will not be retransmitted until timeout
      or other loss recovery events.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93b174ad
  19. 14 11月, 2012 1 次提交
    • E
      tcp: tcp_replace_ts_recent() should not be called from tcp_validate_incoming() · bd090dfc
      Eric Dumazet 提交于
      We added support for RFC 5961 in latest kernels but TCP fails
      to perform exhaustive check of ACK sequence.
      
      We can update our view of peer tsval from a frame that is
      later discarded by tcp_ack()
      
      This makes timestamps enabled sessions vulnerable to injection of
      a high tsval : peers start an ACK storm, since the victim
      sends a dupack each time it receives an ACK from the other peer.
      
      As tcp_validate_incoming() is called before tcp_ack(), we should
      not peform tcp_replace_ts_recent() from it, and let callers do it
      at the right time.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Nandita Dukkipati <nanditad@google.com>
      Cc: H.K. Jerry Chu <hkchu@google.com>
      Cc: Romain Francoise <romain@orebokech.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bd090dfc
  20. 04 11月, 2012 1 次提交
    • E
      tcp: better retrans tracking for defer-accept · e6c022a4
      Eric Dumazet 提交于
      For passive TCP connections using TCP_DEFER_ACCEPT facility,
      we incorrectly increment req->retrans each time timeout triggers
      while no SYNACK is sent.
      
      SYNACK are not sent for TCP_DEFER_ACCEPT that were established (for
      which we received the ACK from client). Only the last SYNACK is sent
      so that we can receive again an ACK from client, to move the req into
      accept queue. We plan to change this later to avoid the useless
      retransmit (and potential problem as this SYNACK could be lost)
      
      TCP_INFO later gives wrong information to user, claiming imaginary
      retransmits.
      
      Decouple req->retrans field into two independent fields :
      
      num_retrans : number of retransmit
      num_timeout : number of timeouts
      
      num_timeout is the counter that is incremented at each timeout,
      regardless of actual SYNACK being sent or not, and used to
      compute the exponential timeout.
      
      Introduce inet_rtx_syn_ack() helper to increment num_retrans
      only if ->rtx_syn_ack() succeeded.
      
      Use inet_rtx_syn_ack() from tcp_check_req() to increment num_retrans
      when we re-send a SYNACK in answer to a (retransmitted) SYN.
      Prior to this patch, we were not counting these retransmits.
      
      Change tcp_v[46]_rtx_synack() to increment TCP_MIB_RETRANSSEGS
      only if a synack packet was successfully queued.
      Reported-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Julian Anastasov <ja@ssi.bg>
      Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
      Cc: Elliott Hughes <enh@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e6c022a4
  21. 03 11月, 2012 1 次提交
  22. 23 10月, 2012 3 次提交
    • J
      tcp: Reject invalid ack_seq to Fast Open sockets · 37561f68
      Jerry Chu 提交于
      A packet with an invalid ack_seq may cause a TCP Fast Open socket to switch
      to the unexpected TCP_CLOSING state, triggering a BUG_ON kernel panic.
      
      When a FIN packet with an invalid ack_seq# arrives at a socket in
      the TCP_FIN_WAIT1 state, rather than discarding the packet, the current
      code will accept the FIN, causing state transition to TCP_CLOSING.
      
      This may be a small deviation from RFC793, which seems to say that the
      packet should be dropped. Unfortunately I did not expect this case for
      Fast Open hence it will trigger a BUG_ON panic.
      
      It turns out there is really nothing bad about a TFO socket going into
      TCP_CLOSING state so I could just remove the BUG_ON statements. But after
      some thought I think it's better to treat this case like TCP_SYN_RECV
      and return a RST to the confused peer who caused the unacceptable ack_seq
      to be generated in the first place.
      Signed-off-by: NH.K. Jerry Chu <hkchu@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      37561f68
    • Y
      tcp: add SYN/data info to TCP_INFO · 6f73601e
      Yuchung Cheng 提交于
      Add a bit TCPI_OPT_SYN_DATA (32) to the socket option TCP_INFO:tcpi_options.
      It's set if the data in SYN (sent or received) is acked by SYN-ACK. Server or
      client application can use this information to check Fast Open success rate.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6f73601e
    • E
      tcp: RFC 5961 5.2 Blind Data Injection Attack Mitigation · 354e4aa3
      Eric Dumazet 提交于
      RFC 5961 5.2 [Blind Data Injection Attack].[Mitigation]
      
        All TCP stacks MAY implement the following mitigation.  TCP stacks
        that implement this mitigation MUST add an additional input check to
        any incoming segment.  The ACK value is considered acceptable only if
        it is in the range of ((SND.UNA - MAX.SND.WND) <= SEG.ACK <=
        SND.NXT).  All incoming segments whose ACK value doesn't satisfy the
        above condition MUST be discarded and an ACK sent back.
      
      Move tcp_send_challenge_ack() before tcp_ack() to avoid a forward
      declaration.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Jerry Chu <hkchu@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      354e4aa3
  23. 23 9月, 2012 3 次提交
    • N
      tcp: TCP Fast Open Server - record retransmits after 3WHS · 30099b2e
      Neal Cardwell 提交于
      When recording the number of SYNACK retransmits for servers using TCP
      Fast Open, fix the code to ensure that we copy over the retransmit
      count from the request_sock after we receive the ACK that completes
      the 3-way handshake.
      
      The story here is similar to that of SYNACK RTT
      measurements. Previously we were always doing this in
      tcp_v4_syn_recv_sock(). However, for TCP Fast Open connections
      tcp_v4_conn_req_fastopen() calls tcp_v4_syn_recv_sock() at the time we
      receive the SYN. So for TFO we must copy the final SYNACK retransmit
      count in tcp_rcv_state_process().
      
      Note that copying over the SYNACK retransmit count will give us the
      correct count since, as is mentioned in a comment in
      tcp_retransmit_timer(), before we receive an ACK for our SYN-ACK a TFO
      passive connection does not retransmit anything else (e.g., data or
      FIN segments).
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      30099b2e
    • N
      tcp: TCP Fast Open Server - call tcp_validate_incoming() for all packets · e69bebde
      Neal Cardwell 提交于
      A TCP Fast Open (TFO) passive connection must call both
      tcp_check_req() and tcp_validate_incoming() for all incoming ACKs that
      are attempting to complete the 3WHS.
      
      This is needed to parallel all the action that happens for a non-TFO
      connection, where for an ACK that is attempting to complete the 3WHS
      we call both tcp_check_req() and tcp_validate_incoming().
      
      For example, upon receiving the ACK that completes the 3WHS, we need
      to call tcp_fast_parse_options() and update ts_recent based on the
      incoming timestamp value in the ACK.
      
      One symptom of the problem with the previous code was that for passive
      TFO connections using TCP timestamps, the outgoing TS ecr values
      ignored the incoming TS val value on the ACK that completed the 3WHS.
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e69bebde
    • N
      tcp: TCP Fast Open Server - take SYNACK RTT after completing 3WHS · 016818d0
      Neal Cardwell 提交于
      When taking SYNACK RTT samples for servers using TCP Fast Open, fix
      the code to ensure that we only call tcp_valid_rtt_meas() after we
      receive the ACK that completes the 3-way handshake.
      
      Previously we were always taking an RTT sample in
      tcp_v4_syn_recv_sock(). However, for TCP Fast Open connections
      tcp_v4_conn_req_fastopen() calls tcp_v4_syn_recv_sock() at the time we
      receive the SYN. So for TFO we must wait until tcp_rcv_state_process()
      to take the RTT sample.
      
      To fix this, we wait until after TFO calls tcp_v4_syn_recv_sock()
      before we set the snt_synack timestamp, since tcp_synack_rtt_meas()
      already ensures that we only take a SYNACK RTT sample if snt_synack is
      non-zero. To be careful, we only take a snt_synack timestamp when
      a SYNACK transmit or retransmit succeeds.
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      016818d0
  24. 19 9月, 2012 1 次提交
  25. 04 9月, 2012 3 次提交
  26. 01 9月, 2012 2 次提交
    • J
      tcp: TCP Fast Open Server - main code path · 168a8f58
      Jerry Chu 提交于
      This patch adds the main processing path to complete the TFO server
      patches.
      
      A TFO request (i.e., SYN+data packet with a TFO cookie option) first
      gets processed in tcp_v4_conn_request(). If it passes the various TFO
      checks by tcp_fastopen_check(), a child socket will be created right
      away to be accepted by applications, rather than waiting for the 3WHS
      to finish.
      
      In additon to the use of TFO cookie, a simple max_qlen based scheme
      is put in place to fend off spoofed TFO attack.
      
      When a valid ACK comes back to tcp_rcv_state_process(), it will cause
      the state of the child socket to switch from either TCP_SYN_RECV to
      TCP_ESTABLISHED, or TCP_FIN_WAIT1 to TCP_FIN_WAIT2. At this time
      retransmission will resume for any unack'ed (data, FIN,...) segments.
      Signed-off-by: NH.K. Jerry Chu <hkchu@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      168a8f58
    • J
      tcp: TCP Fast Open Server - header & support functions · 10467163
      Jerry Chu 提交于
      This patch adds all the necessary data structure and support
      functions to implement TFO server side. It also documents a number
      of flags for the sysctl_tcp_fastopen knob, and adds a few Linux
      extension MIBs.
      
      In addition, it includes the following:
      
      1. a new TCP_FASTOPEN socket option an application must call to
      supply a max backlog allowed in order to enable TFO on its listener.
      
      2. A number of key data structures:
      "fastopen_rsk" in tcp_sock - for a big socket to access its
      request_sock for retransmission and ack processing purpose. It is
      non-NULL iff 3WHS not completed.
      
      "fastopenq" in request_sock_queue - points to a per Fast Open
      listener data structure "fastopen_queue" to keep track of qlen (# of
      outstanding Fast Open requests) and max_qlen, among other things.
      
      "listener" in tcp_request_sock - to point to the original listener
      for book-keeping purpose, i.e., to maintain qlen against max_qlen
      as part of defense against IP spoofing attack.
      
      3. various data structure and functions, many in tcp_fastopen.c, to
      support server side Fast Open cookie operations, including
      /proc/sys/net/ipv4/tcp_fastopen_key to allow manual rekeying.
      Signed-off-by: NH.K. Jerry Chu <hkchu@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      10467163
  27. 25 8月, 2012 1 次提交
  28. 16 8月, 2012 1 次提交
  29. 07 8月, 2012 1 次提交
    • E
      tcp: ecn: dont delay ACKS after CE · aae06bf5
      Eric Dumazet 提交于
      While playing with CoDel and ECN marking, I discovered a
      non optimal behavior of receiver of CE (Congestion Encountered)
      segments.
      
      In pathological cases, sender has reduced its cwnd to low values,
      and receiver delays its ACK (by 40 ms).
      
      While RFC 3168 6.1.3 (The TCP Receiver) doesn't explicitly recommend
      to send immediate ACKS, we believe its better to not delay ACKS, because
      a CE segment should give same signal than a dropped segment, and its
      quite important to reduce RTT to give ECE/CWR signals as fast as
      possible.
      
      Note we already call tcp_enter_quickack_mode() from TCP_ECN_check_ce()
      if we receive a retransmit, for the same reason.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aae06bf5