1. 04 2月, 2015 2 次提交
    • A
      net: switch memcpy_fromiovec()/memcpy_fromiovecend() users to copy_from_iter() · 21226abb
      Al Viro 提交于
      That takes care of the majority of ->sendmsg() instances - most of them
      via memcpy_to_msg() or assorted getfrag() callbacks.  One place where we
      still keep memcpy_fromiovecend() is tipc - there we potentially read the
      same data over and over; separate patch, that...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      21226abb
    • A
      ip: convert tcp_sendmsg() to iov_iter primitives · 57be5bda
      Al Viro 提交于
      patch is actually smaller than it seems to be - most of it is unindenting
      the inner loop body in tcp_sendmsg() itself...
      
      the bit in tcp_input.c is going to get reverted very soon - that's what
      memcpy_from_msg() will become, but not in this commit; let's keep it
      reasonably contained...
      
      There's one potentially subtle change here: in case of short copy from
      userland, mainline tcp_send_syn_data() discards the skb it has allocated
      and falls back to normal path, where we'll send as much as possible after
      rereading the same data again.  This patch trims SYN+data skb instead -
      that way we don't need to copy from the same place twice.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      57be5bda
  2. 03 2月, 2015 1 次提交
    • F
      net: dctcp: loosen requirement to assert ECT(0) during 3WHS · 843c2fdf
      Florian Westphal 提交于
      One deployment requirement of DCTCP is to be able to run
      in a DC setting along with TCP traffic. As Glenn Judd's
      NSDI'15 paper "Attaining the Promise and Avoiding the Pitfalls
      of TCP in the Datacenter" [1] (tba) explains, one way to
      solve this on switch side is to split DCTCP and TCP traffic
      in two queues per switch port based on the DSCP: one queue
      soley intended for DCTCP traffic and one for non-DCTCP traffic.
      
      For the DCTCP queue, there's the marking threshold K as
      explained in commit e3118e83 ("net: tcp: add DCTCP congestion
      control algorithm") for RED marking ECT(0) packets with CE.
      For the non-DCTCP queue, there's f.e. a classic tail drop queue.
      As already explained in e3118e83, running DCTCP at scale
      when not marking SYN/SYN-ACK packets with ECT(0) has severe
      consequences as for non-ECT(0) packets, traversing the RED
      marking DCTCP queue will result in a severe reduction of
      connection probability.
      
      This is due to the DCTCP queue being dominated by ECT(0) traffic
      and switches handle non-ECT traffic in the RED marking queue
      after passing K as drops, where K is usually a low watermark
      in order to leave enough tailroom for bursts. Splitting DCTCP
      traffic among several queues (ECN and non-ECN queue) is being
      considered a terrible idea in the network community as it
      splits single flows across multiple network paths.
      
      Therefore, commit e3118e83 implements this on Linux as
      ECT(0) marked traffic, as we argue that marking all packets
      of a DCTCP flow is the only viable solution and also doesn't
      speak against the draft.
      
      However, recently, a DCTCP implementation for FreeBSD hit also
      their mainline kernel [2]. In order to let them play well
      together with Linux' DCTCP, we would need to loosen the
      requirement that ECT(0) has to be asserted during the 3WHS as
      not implemented in FreeBSD. This simplifies the ECN test and
      lets DCTCP work together with FreeBSD.
      
      Joint work with Daniel Borkmann.
      
        [1] https://www.usenix.org/conference/nsdi15/technical-sessions/presentation/judd
        [2] https://github.com/freebsd/freebsd/commit/8ad879445281027858a7fa706d13e458095b595fSigned-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Glenn Judd <glenn.judd@morganstanley.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      843c2fdf
  3. 01 2月, 2015 1 次提交
  4. 14 1月, 2015 1 次提交
    • S
      tcp: avoid reducing cwnd when ACK+DSACK is received · 08abdffa
      Sébastien Barré 提交于
      With TLP, the peer may reply to a probe with an
      ACK+D-SACK, with ack value set to tlp_high_seq. In the current code,
      such ACK+DSACK will be missed and only at next, higher ack will the TLP
      episode be considered done. Since the DSACK is not present anymore,
      this will cost a cwnd reduction.
      
      This patch ensures that this scenario does not cause a cwnd reduction, since
      receiving an ACK+DSACK indicates that both the initial segment and the probe
      have been received by the peer.
      
      The following packetdrill test, from Neal Cardwell, validates this patch:
      
      // Establish a connection.
      0     socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      +0     setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      +0    bind(3, ..., ...) = 0
      +0    listen(3, 1) = 0
      
      +0    < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
      +0    > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6>
      +.020 < . 1:1(0) ack 1 win 257
      +0    accept(3, ..., ...) = 4
      
      // Send 1 packet.
      +0    write(4, ..., 1000) = 1000
      +0    > P. 1:1001(1000) ack 1
      
      // Loss probe retransmission.
      // packets_out == 1 => schedule PTO in max(2*RTT, 1.5*RTT + 200ms)
      // In this case, this means: 1.5*RTT + 200ms = 230ms
      +.230 > P. 1:1001(1000) ack 1
      +0    %{ assert tcpi_snd_cwnd == 10 }%
      
      // Receiver ACKs at tlp_high_seq with a DSACK,
      // indicating they received the original packet and probe.
      +.020 < . 1:1(0) ack 1001 win 257 <sack 1:1001,nop,nop>
      +0    %{ assert tcpi_snd_cwnd == 10 }%
      
      // Send another packet.
      +0    write(4, ..., 1000) = 1000
      +0    > P. 1001:2001(1000) ack 1
      
      // Receiver ACKs above tlp_high_seq, which should end the TLP episode
      // if we haven't already. We should not reduce cwnd.
      +.020 < . 1:1(0) ack 2001 win 257
      +0    %{ assert tcpi_snd_cwnd == 10, tcpi_snd_cwnd }%
      
      Credits:
      -Gregory helped in finding that tcp_process_tlp_ack was where the cwnd
      got reduced in our MPTCP tests.
      -Neal wrote the packetdrill test above
      -Yuchung reworked the patch to make it more readable.
      
      Cc: Gregory Detal <gregory.detal@uclouvain.be>
      Cc: Nandita Dukkipati <nanditad@google.com>
      Tested-by: NNeal Cardwell <ncardwell@google.com>
      Reviewed-by: NYuchung Cheng <ycheng@google.com>
      Reviewed-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NSébastien Barré <sebastien.barre@uclouvain.be>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      08abdffa
  5. 10 12月, 2014 1 次提交
  6. 24 11月, 2014 1 次提交
  7. 22 11月, 2014 1 次提交
    • C
      tcp: Restore RFC5961-compliant behavior for SYN packets · 0c228e83
      Calvin Owens 提交于
      Commit c3ae62af ("tcp: should drop incoming frames without ACK
      flag set") was created to mitigate a security vulnerability in which a
      local attacker is able to inject data into locally-opened sockets by
      using TCP protocol statistics in procfs to quickly find the correct
      sequence number.
      
      This broke the RFC5961 requirement to send a challenge ACK in response
      to spurious RST packets, which was subsequently fixed by commit
      7b514a88 ("tcp: accept RST without ACK flag").
      
      Unfortunately, the RFC5961 requirement that spurious SYN packets be
      handled in a similar manner remains broken.
      
      RFC5961 section 4 states that:
      
         ... the handling of the SYN in the synchronized state SHOULD be
         performed as follows:
      
         1) If the SYN bit is set, irrespective of the sequence number, TCP
            MUST send an ACK (also referred to as challenge ACK) to the remote
            peer:
      
            <SEQ=SND.NXT><ACK=RCV.NXT><CTL=ACK>
      
            After sending the acknowledgment, TCP MUST drop the unacceptable
            segment and stop processing further.
      
         By sending an ACK, the remote peer is challenged to confirm the loss
         of the previous connection and the request to start a new connection.
         A legitimate peer, after restart, would not have a TCB in the
         synchronized state.  Thus, when the ACK arrives, the peer should send
         a RST segment back with the sequence number derived from the ACK
         field that caused the RST.
      
         This RST will confirm that the remote peer has indeed closed the
         previous connection.  Upon receipt of a valid RST, the local TCP
         endpoint MUST terminate its connection.  The local TCP endpoint
         should then rely on SYN retransmission from the remote end to
         re-establish the connection.
      
      This patch lets SYN packets through the discard added in c3ae62af,
      so that spurious SYN packets are properly dealt with as per the RFC.
      
      The challenge ACK is sent unconditionally and is rate-limited, so the
      original vulnerability is not reintroduced by this patch.
      Signed-off-by: NCalvin Owens <calvinowens@fb.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0c228e83
  8. 12 11月, 2014 1 次提交
    • J
      net: Convert LIMIT_NETDEBUG to net_dbg_ratelimited · ba7a46f1
      Joe Perches 提交于
      Use the more common dynamic_debug capable net_dbg_ratelimited
      and remove the LIMIT_NETDEBUG macro.
      
      All messages are still ratelimited.
      
      Some KERN_<LEVEL> uses are changed to KERN_DEBUG.
      
      This may have some negative impact on messages that were
      emitted at KERN_INFO that are not not enabled at all unless
      DEBUG is defined or dynamic_debug is enabled.  Even so,
      these messages are now _not_ emitted by default.
      
      This also eliminates the use of the net_msg_warn sysctl
      "/proc/sys/net/core/warnings".  For backward compatibility,
      the sysctl is not removed, but it has no function.  The extern
      declaration of net_msg_warn is removed from sock.h and made
      static in net/core/sysctl_net_core.c
      
      Miscellanea:
      
      o Update the sysctl documentation
      o Remove the embedded uses of pr_fmt
      o Coalesce format fragments
      o Realign arguments
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ba7a46f1
  9. 06 11月, 2014 1 次提交
    • M
      tcp: zero retrans_stamp if all retrans were acked · 1f37bf87
      Marcelo Leitner 提交于
      Ueki Kohei reported that when we are using NewReno with connections that
      have a very low traffic, we may timeout the connection too early if a
      second loss occurs after the first one was successfully acked but no
      data was transfered later. Below is his description of it:
      
      When SACK is disabled, and a socket suffers multiple separate TCP
      retransmissions, that socket's ETIMEDOUT value is calculated from the
      time of the *first* retransmission instead of the *latest*
      retransmission.
      
      This happens because the tcp_sock's retrans_stamp is set once then never
      cleared.
      
      Take the following connection:
      
                            Linux                    remote-machine
                              |                           |
               send#1---->(*1)|--------> data#1 --------->|
                        |     |                           |
                       RTO    :                           :
                        |     |                           |
                       ---(*2)|----> data#1(retrans) ---->|
                        | (*3)|<---------- ACK <----------|
                        |     |                           |
                        |     :                           :
                        |     :                           :
                        |     :                           :
                      16 minutes (or more)                :
                        |     :                           :
                        |     :                           :
                        |     :                           :
                        |     |                           |
               send#2---->(*4)|--------> data#2 --------->|
                        |     |                           |
                       RTO    :                           :
                        |     |                           |
                       ---(*5)|----> data#2(retrans) ---->|
                        |     |                           |
                        |     |                           |
                      RTO*2   :                           :
                        |     |                           |
                        |     |                           |
            ETIMEDOUT<----(*6)|                           |
      
      (*1) One data packet sent.
      (*2) Because no ACK packet is received, the packet is retransmitted.
      (*3) The ACK packet is received. The transmitted packet is acknowledged.
      
      At this point the first "retransmission event" has passed and been
      recovered from. Any future retransmission is a completely new "event".
      
      (*4) After 16 minutes (to correspond with retries2=15), a new data
      packet is sent. Note: No data is transmitted between (*3) and (*4).
      
      The socket's timeout SHOULD be calculated from this point in time, but
      instead it's calculated from the prior "event" 16 minutes ago.
      
      (*5) Because no ACK packet is received, the packet is retransmitted.
      (*6) At the time of the 2nd retransmission, the socket returns
      ETIMEDOUT.
      
      Therefore, now we clear retrans_stamp as soon as all data during the
      loss window is fully acked.
      
      Reported-by: Ueki Kohei
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Signed-off-by: NMarcelo Ricardo Leitner <mleitner@redhat.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Tested-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1f37bf87
  10. 05 11月, 2014 1 次提交
    • F
      net: allow setting ecn via routing table · f7b3bec6
      Florian Westphal 提交于
      This patch allows to set ECN on a per-route basis in case the sysctl
      tcp_ecn is not set to 1. In other words, when ECN is set for specific
      routes, it provides a tcp_ecn=1 behaviour for that route while the rest
      of the stack acts according to the global settings.
      
      One can use 'ip route change dev $dev $net features ecn' to toggle this.
      
      Having a more fine-grained per-route setting can be beneficial for various
      reasons, for example, 1) within data centers, or 2) local ISPs may deploy
      ECN support for their own video/streaming services [1], etc.
      
      There was a recent measurement study/paper [2] which scanned the Alexa's
      publicly available top million websites list from a vantage point in US,
      Europe and Asia:
      
      Half of the Alexa list will now happily use ECN (tcp_ecn=2, most likely
      blamed to commit 255cac91 ("tcp: extend ECN sysctl to allow server-side
      only ECN") ;)); the break in connectivity on-path was found is about
      1 in 10,000 cases. Timeouts rather than receiving back RSTs were much
      more common in the negotiation phase (and mostly seen in the Alexa
      middle band, ranks around 50k-150k): from 12-thousand hosts on which
      there _may_ be ECN-linked connection failures, only 79 failed with RST
      when _not_ failing with RST when ECN is not requested.
      
      It's unclear though, how much equipment in the wild actually marks CE
      when buffers start to fill up.
      
      We thought about a fallback to non-ECN for retransmitted SYNs as another
      global option (which could perhaps one day be made default), but as Eric
      points out, there's much more work needed to detect broken middleboxes.
      
      Two examples Eric mentioned are buggy firewalls that accept only a single
      SYN per flow, and middleboxes that successfully let an ECN flow establish,
      but later mark CE for all packets (so cwnd converges to 1).
      
       [1] http://www.ietf.org/proceedings/89/slides/slides-89-tsvarea-1.pdf, p.15
       [2] http://ecn.ethz.ch/
      
      Joint work with Daniel Borkmann.
      
      Reference: http://thread.gmane.org/gmane.linux.network/335797Suggested-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f7b3bec6
  11. 31 10月, 2014 2 次提交
  12. 30 10月, 2014 1 次提交
    • E
      tcp: allow for bigger reordering level · dca145ff
      Eric Dumazet 提交于
      While testing upcoming Yaogong patch (converting out of order queue
      into an RB tree), I hit the max reordering level of linux TCP stack.
      
      Reordering level was limited to 127 for no good reason, and some
      network setups [1] can easily reach this limit and get limited
      throughput.
      
      Allow a new max limit of 300, and add a sysctl to allow admins to even
      allow bigger (or lower) values if needed.
      
      [1] Aggregation of links, per packet load balancing, fabrics not doing
       deep packet inspections, alternative TCP congestion modules...
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yaogong Wang <wygivan@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dca145ff
  13. 15 10月, 2014 1 次提交
    • E
      tcp: fix tcp_ack() performance problem · ad971f61
      Eric Dumazet 提交于
      We worked hard to improve tcp_ack() performance, by not accessing
      skb_shinfo() in fast path (cd7d8498 tcp: change tcp_skb_pcount()
      location)
      
      We still have one spurious access because of ACK timestamping,
      added in commit e1c8a607 ("net-timestamp: ACK timestamp for
      bytestreams")
      
      By checking if sk_tsflags has SOF_TIMESTAMPING_TX_ACK set,
      we can avoid two cache line misses for the common case.
      
      While we are at it, add two prefetchw() :
      
      One in tcp_ack() to bring skb at the head of write queue.
      
      One in tcp_clean_rtx_queue() loop to bring following skb,
      as we will delete skb from the write queue and dirty skb->next->prev.
      
      Add a couple of [un]likely() clauses.
      
      After this patch, tcp_ack() is no longer the most consuming
      function in tcp stack.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Van Jacobson <vanj@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ad971f61
  14. 30 9月, 2014 2 次提交
  15. 29 9月, 2014 5 次提交
  16. 28 9月, 2014 2 次提交
  17. 24 9月, 2014 1 次提交
  18. 23 9月, 2014 1 次提交
  19. 20 9月, 2014 1 次提交
  20. 16 9月, 2014 3 次提交
  21. 06 9月, 2014 2 次提交
  22. 23 8月, 2014 1 次提交
    • Y
      tcp: improve undo on timeout · 989e04c5
      Yuchung Cheng 提交于
      Upon timeout, undo (via both timestamps/Eifel and DSACKs) was
      disabled if any retransmits were still in flight.  The concern was
      perhaps that spurious retransmission sent in a previous recovery
      episode may trigger DSACKs to falsely undo the current recovery.
      
      However, this inadvertently misses undo opportunities (using either
      TCP timestamps or DSACKs) when timeout occurs during a loss episode,
      i.e.  recurring timeouts or timeout during fast recovery. In these
      cases some retransmissions will be in flight but we should allow
      undo. Furthermore, we should only reset undo_marker and undo_retrans
      upon timeout if we are starting a new recovery episode. Finally,
      when we do reset our undo state, we now do so in a manner similar
      to tcp_enter_recovery(), so that we require a DSACK for each of
      the outstsanding retransmissions. This will achieve the original
      goal by requiring that we receive the same number of DSACKs as
      retransmissions.
      
      This patch increases the undo events by 50% on Google servers.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      989e04c5
  23. 15 8月, 2014 2 次提交
    • N
      tcp: fix ssthresh and undo for consecutive short FRTO episodes · 0c9ab092
      Neal Cardwell 提交于
      Fix TCP FRTO logic so that it always notices when snd_una advances,
      indicating that any RTO after that point will be a new and distinct
      loss episode.
      
      Previously there was a very specific sequence that could cause FRTO to
      fail to notice a new loss episode had started:
      
      (1) RTO timer fires, enter FRTO and retransmit packet 1 in write queue
      (2) receiver ACKs packet 1
      (3) FRTO sends 2 more packets
      (4) RTO timer fires again (should start a new loss episode)
      
      The problem was in step (3) above, where tcp_process_loss() returned
      early (in the spot marked "Step 2.b"), so that it never got to the
      logic to clear icsk_retransmits. Thus icsk_retransmits stayed
      non-zero. Thus in step (4) tcp_enter_loss() would see the non-zero
      icsk_retransmits, decide that this RTO is not a new episode, and
      decide not to cut ssthresh and remember the current cwnd and ssthresh
      for undo.
      
      There were two main consequences to the bug that we have
      observed. First, ssthresh was not decreased in step (4). Second, when
      there was a series of such FRTO (1-4) sequences that happened to be
      followed by an FRTO undo, we would restore the cwnd and ssthresh from
      before the entire series started (instead of the cwnd and ssthresh
      from before the most recent RTO). This could result in cwnd and
      ssthresh being restored to values much bigger than the proper values.
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Fixes: e33099f9 ("tcp: implement RFC5682 F-RTO")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0c9ab092
    • H
      tcp: don't allow syn packets without timestamps to pass tcp_tw_recycle logic · a26552af
      Hannes Frederic Sowa 提交于
      tcp_tw_recycle heavily relies on tcp timestamps to build a per-host
      ordering of incoming connections and teardowns without the need to
      hold state on a specific quadruple for TCP_TIMEWAIT_LEN, but only for
      the last measured RTO. To do so, we keep the last seen timestamp in a
      per-host indexed data structure and verify if the incoming timestamp
      in a connection request is strictly greater than the saved one during
      last connection teardown. Thus we can verify later on that no old data
      packets will be accepted by the new connection.
      
      During moving a socket to time-wait state we already verify if timestamps
      where seen on a connection. Only if that was the case we let the
      time-wait socket expire after the RTO, otherwise normal TCP_TIMEWAIT_LEN
      will be used. But we don't verify this on incoming SYN packets. If a
      connection teardown was less than TCP_PAWS_MSL seconds in the past we
      cannot guarantee to not accept data packets from an old connection if
      no timestamps are present. We should drop this SYN packet. This patch
      closes this loophole.
      
      Please note, this patch does not make tcp_tw_recycle in any way more
      usable but only adds another safety check:
      Sporadic drops of SYN packets because of reordering in the network or
      in the socket backlog queues can happen. Users behing NAT trying to
      connect to a tcp_tw_recycle enabled server can get caught in blackholes
      and their connection requests may regullary get dropped because hosts
      behind an address translator don't have synchronized tcp timestamp clocks.
      tcp_tw_recycle cannot work if peers don't have tcp timestamps enabled.
      
      In general, use of tcp_tw_recycle is disadvised.
      
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Florian Westphal <fw@strlen.de>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a26552af
  24. 14 8月, 2014 1 次提交
  25. 06 8月, 2014 2 次提交
    • W
      net-timestamp: ACK timestamp for bytestreams · e1c8a607
      Willem de Bruijn 提交于
      Add SOF_TIMESTAMPING_TX_ACK, a request for a tstamp when the last byte
      in the send() call is acknowledged. It implements the feature for TCP.
      
      The timestamp is generated when the TCP socket cumulative ACK is moved
      beyond the tracked seqno for the first time. The feature ignores SACK
      and FACK, because those acknowledge the specific byte, but not
      necessarily the entire contents of the buffer up to that byte.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e1c8a607
    • N
      tcp: reduce spurious retransmits due to transient SACK reneging · 5ae344c9
      Neal Cardwell 提交于
      This commit reduces spurious retransmits due to apparent SACK reneging
      by only reacting to SACK reneging that persists for a short delay.
      
      When a sequence space hole at snd_una is filled, some TCP receivers
      send a series of ACKs as they apparently scan their out-of-order queue
      and cumulatively ACK all the packets that have now been consecutiveyly
      received. This is essentially misbehavior B in "Misbehaviors in TCP
      SACK generation" ACM SIGCOMM Computer Communication Review, April
      2011, so we suspect that this is from several common OSes (Windows
      2000, Windows Server 2003, Windows XP). However, this issue has also
      been seen in other cases, e.g. the netdev thread "TCP being hoodwinked
      into spurious retransmissions by lack of timestamps?" from March 2014,
      where the receiver was thought to be a BSD box.
      
      Since snd_una would temporarily be adjacent to a previously SACKed
      range in these scenarios, this receiver behavior triggered the Linux
      SACK reneging code path in the sender. This led the sender to clear
      the SACK scoreboard, enter CA_Loss, and spuriously retransmit
      (potentially) every packet from the entire write queue at line rate
      just a few milliseconds before the ACK for each packet arrives at the
      sender.
      
      To avoid such situations, now when a sender sees apparent reneging it
      does not yet retransmit, but rather adjusts the RTO timer to give the
      receiver a little time (max(RTT/2, 10ms)) to send us some more ACKs
      that will restore sanity to the SACK scoreboard. If the reneging
      persists until this RTO then, as before, we clear the SACK scoreboard
      and enter CA_Loss.
      
      A 10ms delay tolerates a receiver sending such a stream of ACKs at
      56Kbit/sec. And to allow for receivers with slower or more congested
      paths, we wait for at least RTT/2.
      
      We validated the resulting max(RTT/2, 10ms) delay formula with a mix
      of North American and South American Google web server traffic, and
      found that for ACKs displaying transient reneging:
      
       (1) 90% of inter-ACK delays were less than 10ms
       (2) 99% of inter-ACK delays were less than RTT/2
      
      In tests on Google web servers this commit reduced reneging events by
      75%-90% (as measured by the TcpExtTCPSACKReneging counter), without
      any measurable impact on latency for user HTTP and SPDY requests.
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5ae344c9
  26. 16 7月, 2014 1 次提交
  27. 08 7月, 2014 1 次提交
    • Y
      tcp: fix false undo corner cases · 6e08d5e3
      Yuchung Cheng 提交于
      The undo code assumes that, upon entering loss recovery, TCP
      1) always retransmit something
      2) the retransmission never fails locally (e.g., qdisc drop)
      
      so undo_marker is set in tcp_enter_recovery() and undo_retrans is
      incremented only when tcp_retransmit_skb() is successful.
      
      When the assumption is broken because TCP's cwnd is too small to
      retransmit or the retransmit fails locally. The next (DUP)ACK
      would incorrectly revert the cwnd and the congestion state in
      tcp_try_undo_dsack() or tcp_may_undo(). Subsequent (DUP)ACKs
      may enter the recovery state. The sender repeatedly enter and
      (incorrectly) exit recovery states if the retransmits continue to
      fail locally while receiving (DUP)ACKs.
      
      The fix is to initialize undo_retrans to -1 and start counting on
      the first retransmission. Always increment undo_retrans even if the
      retransmissions fail locally because they couldn't cause DSACKs to
      undo the cwnd reduction.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6e08d5e3