1. 10 7月, 2015 2 次提交
    • Y
      tcp: update congestion state first before raising cwnd · b20a3fa3
      Yuchung Cheng 提交于
      The congestion state and cwnd can be updated in the wrong order.
      For example, upon receiving a dubious ACK, we incorrectly raise
      the cwnd first (tcp_may_raise_cwnd()/tcp_cong_avoid()) because
      the state is still Open, then enter recovery state to reduce cwnd.
      
      For another example, if the ACK indicates spurious timeout or
      retransmits, we first revert the cwnd reduction and congestion
      state back to Open state.  But we don't raise the cwnd even though
      the ACK does not indicate any congestion.
      
      To fix this problem we should first call tcp_fastretrans_alert() to
      process the dubious ACK and update the congestion state, then call
      tcp_may_raise_cwnd() that raises cwnd based on the current state.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b20a3fa3
    • J
      tcp: v1 always send a quick ack when quickacks are enabled · 2251ae46
      Jon Maxwell 提交于
      V1 of this patch contains Eric Dumazet's suggestion to move the per
      dst RTAX_QUICKACK check into tcp_in_quickack_mode(). Thanks Eric.
      
      I ran some tests and after setting the "ip route change quickack 1"
      knob there were still many delayed ACKs sent. This occured
      because when icsk_ack.quick=0 the !icsk_ack.pingpong value is
      subsequently ignored as tcp_in_quickack_mode() checks both these
      values. The condition for a quick ack to trigger requires
      that both icsk_ack.quick != 0 and icsk_ack.pingpong=0. Currently
      only icsk_ack.pingpong is controlled by the knob. But the
      icsk_ack.quick value changes dynamically depending on heuristics.
      The crux of the matter is that delayed acks still cannot be entirely
      disabled even with the RTAX_QUICKACK per dst knob enabled. This
      patch ensures that a quick ack is always sent when the RTAX_QUICKACK
      per dst knob is turned on.
      
      The "ip route change quickack 1" knob was recently added to enable
      quickacks. It was modeled around the TCP_QUICKACK setsockopt() option.
      This issue is that even with "ip route change quickack 1" enabled
      we still see delayed ACKs under some conditions. It would be nice
      to be able to completely disable delayed ACKs.
      
      Here is an example:
      
      # netstat -s|grep dela
          3 delayed acks sent
      
      For all routes enable the knob
      
      # ip route change quickack 1
      
      Generate some traffic across a slow link and we still see the delayed
      acks.
      
      # netstat -s|grep dela
          106 delayed acks sent
          1 delayed acks further delayed because of locked socket
      
      The issue is that both the "ip route change quickack 1" knob and
      the TCP_QUICKACK option set the icsk_ack.pingpong variable to 0.
      However at the business end in the __tcp_ack_snd_check() routine,
      tcp_in_quickack_mode() checks that both icsk_ack.quick != 0
      and icsk_ack.pingpong=0 in order to trigger a quickack. As
      icsk_ack.quick is determined by heuristics it can be 0. When
      that occurs the icsk_ack.pingpong value is ignored and a delayed
      ACK is sent regardless.
      
      This patch moves the RTAX_QUICKACK per dst check into the
      tcp_in_quickack_mode() routine which ensures that a quickack is
      always sent when the quickack knob is enabled for that dst.
      Signed-off-by: NJon Maxwell <jmaxwell37@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2251ae46
  2. 09 7月, 2015 2 次提交
    • Y
      tcp: PRR uses CRB mode by default and SS mode conditionally · 3759824d
      Yuchung Cheng 提交于
      PRR slow start is often too aggressive especially when drops are
      caused by traffic policers. The policers mainly use token bucket
      to enforce the rate so sending (twice) faster than the delivery
      rate causes excessive drops.
      
      This patch changes PRR to the conservative reduction bound
      (CRB) mode in RFC 6937 by default. CRB follows the packet
      conservation rule to send at most the delivery rate by default.
      
      But if many packets are lost and the pipe is empty, CRB may take N
      round trips to repair N losses. We conditionally turn on slow start
      mode if all these conditions are made to speed up the recovery:
      
        1) on the second round or later in recovery
        2) retransmission sent in the previous round is delivered on this ACK
        3) no retransmission is marked lost on this ACK
      
      By using packet conservation by default, this change reduces the loss
      retransmits signicantly on networks that deploy traffic policers,
      up to 20% reduction of overall loss rate.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3759824d
    • Y
      tcp: reduce cwnd if retransmit is lost in CA_Loss · 291a00d1
      Yuchung Cheng 提交于
      If the retransmission in CA_Loss is lost again, we should not
      continue to slow start or raise cwnd in congestion avoidance mode.
      Instead we should enter fast recovery and use PRR to reduce cwnd,
      following the principle in RFC5681:
      
      "... or the loss of a retransmission, should be taken as two
       indications of congestion and, therefore, cwnd (and ssthresh) MUST
       be lowered twice in this case."
      
      This is especially important to reduce loss when the CA_Loss
      state was caused by a traffic policer dropping the entire inflight.
      The CA_Loss state has a problem where a loss of L packets causes the
      sender to send a burst of L packets. So a policer that's dropping
      most packets in a given RTT can cause a huge retransmit storm. By
      contrast, PRR includes logic to bound the number of outbound packets
      that result from a given ACK. So switching to CA_Recovery on lost
      retransmits in CA_Loss avoids this retransmit storm problem when
      in CA_Loss.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      291a00d1
  3. 12 6月, 2015 2 次提交
  4. 11 6月, 2015 1 次提交
  5. 23 5月, 2015 1 次提交
    • E
      tcp: fix a potential deadlock in tcp_get_info() · d654976c
      Eric Dumazet 提交于
      Taking socket spinlock in tcp_get_info() can deadlock, as
      inet_diag_dump_icsk() holds the &hashinfo->ehash_locks[i],
      while packet processing can use the reverse locking order.
      
      We could avoid this locking for TCP_LISTEN states, but lockdep would
      certainly get confused as all TCP sockets share same lockdep classes.
      
      [  523.722504] ======================================================
      [  523.728706] [ INFO: possible circular locking dependency detected ]
      [  523.734990] 4.1.0-dbg-DEV #1676 Not tainted
      [  523.739202] -------------------------------------------------------
      [  523.745474] ss/18032 is trying to acquire lock:
      [  523.750002]  (slock-AF_INET){+.-...}, at: [<ffffffff81669d44>] tcp_get_info+0x2c4/0x360
      [  523.758129]
      [  523.758129] but task is already holding lock:
      [  523.763968]  (&(&hashinfo->ehash_locks[i])->rlock){+.-...}, at: [<ffffffff816bcb75>] inet_diag_dump_icsk+0x1d5/0x6c0
      [  523.774661]
      [  523.774661] which lock already depends on the new lock.
      [  523.774661]
      [  523.782850]
      [  523.782850] the existing dependency chain (in reverse order) is:
      [  523.790326]
      -> #1 (&(&hashinfo->ehash_locks[i])->rlock){+.-...}:
      [  523.796599]        [<ffffffff811126bb>] lock_acquire+0xbb/0x270
      [  523.802565]        [<ffffffff816f5868>] _raw_spin_lock+0x38/0x50
      [  523.808628]        [<ffffffff81665af8>] __inet_hash_nolisten+0x78/0x110
      [  523.815273]        [<ffffffff816819db>] tcp_v4_syn_recv_sock+0x24b/0x350
      [  523.822067]        [<ffffffff81684d41>] tcp_check_req+0x3c1/0x500
      [  523.828199]        [<ffffffff81682d09>] tcp_v4_do_rcv+0x239/0x3d0
      [  523.834331]        [<ffffffff816842fe>] tcp_v4_rcv+0xa8e/0xc10
      [  523.840202]        [<ffffffff81658fa3>] ip_local_deliver_finish+0x133/0x3e0
      [  523.847214]        [<ffffffff81659a9a>] ip_local_deliver+0xaa/0xc0
      [  523.853440]        [<ffffffff816593b8>] ip_rcv_finish+0x168/0x5c0
      [  523.859624]        [<ffffffff81659db7>] ip_rcv+0x307/0x420
      
      Lets use u64_sync infrastructure instead. As a bonus, 64bit
      arches get optimized, as these are nop for them.
      
      Fixes: 0df48c26 ("tcp: add tcpi_bytes_acked to tcp_info")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d654976c
  6. 20 5月, 2015 2 次提交
    • Y
      tcp: don't over-send F-RTO probes · b7b0ed91
      Yuchung Cheng 提交于
      After sending the new data packets to probe (step 2), F-RTO may
      incorrectly send more probes if the next ACK advances SND_UNA and
      does not sack new packet. However F-RTO RFC 5682 probes at most
      once. This bug may cause sender to always send new data instead of
      repairing holes, inducing longer HoL blocking on the receiver for
      the application.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b7b0ed91
    • Y
      tcp: only undo on partial ACKs in CA_Loss · da34ac76
      Yuchung Cheng 提交于
      Undo based on TCP timestamps should only happen on ACKs that advance
      SND_UNA, according to the Eifel algorithm in RFC 3522:
      
      Section 3.2:
      
        (4) If the value of the Timestamp Echo Reply field of the
            acceptable ACK's Timestamps option is smaller than the
            value of RetransmitTS, then proceed to step (5),
      
      Section Terminology:
         We use the term 'acceptable ACK' as defined in [RFC793].  That is an
         ACK that acknowledges previously unacknowledged data.
      
      This is because upon receiving an out-of-order packet, the receiver
      returns the last timestamp that advances RCV_NXT, not the current
      timestamp of the packet in the DUPACK. Without checking the flag,
      the DUPACK will cause tcp_packet_delayed() to return true and
      tcp_try_undo_loss() will revert cwnd reduction.
      
      Note that we check the condition in CA_Recovery already by only
      calling tcp_try_undo_partial() if FLAG_SND_UNA_ADVANCED is set or
      tcp_try_undo_recovery() if snd_una crosses high_seq.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      da34ac76
  7. 18 5月, 2015 2 次提交
  8. 10 5月, 2015 1 次提交
    • E
      tcp: adjust window probe timers to safer values · 21c8fe99
      Eric Dumazet 提交于
      With the advent of small rto timers in datacenter TCP,
      (ip route ... rto_min x), the following can happen :
      
      1) Qdisc is full, transmit fails.
      
         TCP sets a timer based on icsk_rto to retry the transmit, without
         exponential backoff.
         With low icsk_rto, and lot of sockets, all cpus are servicing timer
         interrupts like crazy.
         Intent of the code was to retry with a timer between 200 (TCP_RTO_MIN)
         and 500ms (TCP_RESOURCE_PROBE_INTERVAL)
      
      2) Receivers can send zero windows if they don't drain their receive queue.
      
         TCP sends zero window probes, based on icsk_rto current value, with
         exponential backoff.
         With /proc/sys/net/ipv4/tcp_retries2 being 15 (or even smaller in
         some cases), sender can abort in less than one or two minutes !
         If receiver stops the sender, it obviously doesn't care of very tight
         rto. Probability of dropping the ACK reopening the window is not
         worth the risk.
      
      Lets change the base timer to be at least 200ms (TCP_RTO_MIN) for these
      events (but not normal RTO based retransmits)
      
      A followup patch adds a new SNMP counter, as it would have helped a lot
      diagnosing this issue.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      21c8fe99
  9. 06 5月, 2015 1 次提交
    • E
      tcp: provide SYN headers for passive connections · cd8ae852
      Eric Dumazet 提交于
      This patch allows a server application to get the TCP SYN headers for
      its passive connections.  This is useful if the server is doing
      fingerprinting of clients based on SYN packet contents.
      
      Two socket options are added: TCP_SAVE_SYN and TCP_SAVED_SYN.
      
      The first is used on a socket to enable saving the SYN headers
      for child connections. This can be set before or after the listen()
      call.
      
      The latter is used to retrieve the SYN headers for passive connections,
      if the parent listener has enabled TCP_SAVE_SYN.
      
      TCP_SAVED_SYN is read once, it frees the saved SYN headers.
      
      The data returned in TCP_SAVED_SYN are network (IPv4/IPv6) and TCP
      headers.
      
      Original patch was written by Tom Herbert, I changed it to not hold
      a full skb (and associated dst and conntracking reference).
      
      We have used such patch for about 3 years at Google.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Tested-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cd8ae852
  10. 04 5月, 2015 3 次提交
  11. 30 4月, 2015 3 次提交
  12. 22 4月, 2015 1 次提交
  13. 14 4月, 2015 1 次提交
  14. 08 4月, 2015 2 次提交
  15. 04 4月, 2015 2 次提交
  16. 03 4月, 2015 1 次提交
  17. 30 3月, 2015 1 次提交
  18. 25 3月, 2015 1 次提交
    • E
      tcp: fix ipv4 mapped request socks · 0144a81c
      Eric Dumazet 提交于
      ss should display ipv4 mapped request sockets like this :
      
      tcp    SYN-RECV   0      0  ::ffff:192.168.0.1:8080   ::ffff:192.0.2.1:35261
      
      and not like this :
      
      tcp    SYN-RECV   0      0  192.168.0.1:8080   192.0.2.1:35261
      
      We should init ireq->ireq_family based on listener sk_family,
      not the actual protocol carried by SYN packet.
      
      This means we can set ireq_family in inet_reqsk_alloc()
      
      Fixes: 3f66b083 ("inet: introduce ireq_family")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0144a81c
  19. 21 3月, 2015 1 次提交
  20. 18 3月, 2015 7 次提交
  21. 15 3月, 2015 1 次提交
  22. 13 3月, 2015 1 次提交
  23. 12 3月, 2015 1 次提交