1. 28 11月, 2011 4 次提交
    • N
      tcp: allow undo from reordered DSACKs · f698204b
      Neal Cardwell 提交于
      Previously, SACK-enabled connections hung around in TCP_CA_Disorder
      state while snd_una==high_seq, just waiting to accumulate DSACKs and
      hopefully undo a cwnd reduction. This could and did lead to the
      following unfortunate scenario: if some incoming ACKs advance snd_una
      beyond high_seq then we were setting undo_marker to 0 and moving to
      TCP_CA_Open, so if (due to reordering in the ACK return path) we
      shortly thereafter received a DSACK then we were no longer able to
      undo the cwnd reduction.
      
      The change: Simplify the congestion avoidance state machine by
      removing the behavior where SACK-enabled connections hung around in
      the TCP_CA_Disorder state just waiting for DSACKs. Instead, when
      snd_una advances to high_seq or beyond we typically move to
      TCP_CA_Open immediately and allow an undo in either TCP_CA_Open or
      TCP_CA_Disorder if we later receive enough DSACKs.
      
      Other patches in this series will provide other changes that are
      necessary to fully fix this problem.
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f698204b
    • N
      tcp: use SACKs and DSACKs that arrive on ACKs below snd_una · e95ae2f2
      Neal Cardwell 提交于
      The bug: When the ACK field is below snd_una (which can happen when
      ACKs are reordered), senders ignored DSACKs (preventing undo) and did
      not call tcp_fastretrans_alert, so they did not increment
      prr_delivered to reflect newly-SACKed sequence ranges, and did not
      call tcp_xmit_retransmit_queue, thus passing up chances to send out
      more retransmitted and new packets based on any newly-SACKed packets.
      
      The change: When the ACK field is below snd_una (the "old_ack" goto
      label), call tcp_fastretrans_alert to allow undo based on any
      newly-arrived DSACKs and try to send out more packets based on
      newly-SACKed packets.
      
      Other patches in this series will provide other changes that are
      necessary to fully fix this problem.
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e95ae2f2
    • N
      tcp: use DSACKs that arrive when packets_out is 0 · 5628adf1
      Neal Cardwell 提交于
      The bug: Senders ignored DSACKs after recovery when there were no
      outstanding packets (a common scenario for HTTP servers).
      
      The change: when there are no outstanding packets (the "no_queue" goto
      label), call tcp_fastretrans_alert() in order to use DSACKs to undo
      congestion window reductions.
      
      Other patches in this series will provide other changes that are
      necessary to fully fix this problem.
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5628adf1
    • N
      tcp: make is_dupack a parameter to tcp_fastretrans_alert() · 7d2b55f8
      Neal Cardwell 提交于
      Allow callers to decide whether an ACK is a duplicate ACK. This is a
      prerequisite to allowing fastretrans_alert to be called from new
      contexts, such as the no_queue and old_ack code paths, from which we
      have extra info that tells us whether an ACK is a dupack.
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7d2b55f8
  2. 21 10月, 2011 3 次提交
  3. 20 10月, 2011 1 次提交
  4. 14 10月, 2011 1 次提交
    • E
      net: more accurate skb truesize · 87fb4b7b
      Eric Dumazet 提交于
      skb truesize currently accounts for sk_buff struct and part of skb head.
      kmalloc() roundings are also ignored.
      
      Considering that skb_shared_info is larger than sk_buff, its time to
      take it into account for better memory accounting.
      
      This patch introduces SKB_TRUESIZE(X) macro to centralize various
      assumptions into a single place.
      
      At skb alloc phase, we put skb_shared_info struct at the exact end of
      skb head, to allow a better use of memory (lowering number of
      reallocations), since kmalloc() gives us power-of-two memory blocks.
      
      Unless SLUB/SLUB debug is active, both skb->head and skb_shared_info are
      aligned to cache lines, as before.
      
      Note: This patch might trigger performance regressions because of
      misconfigured protocol stacks, hitting per socket or global memory
      limits that were previously not reached. But its a necessary step for a
      more accurate memory accounting.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      CC: Andi Kleen <ak@linux.intel.com>
      CC: Ben Hutchings <bhutchings@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      87fb4b7b
  5. 05 10月, 2011 1 次提交
    • Y
      tcp: properly update lost_cnt_hint during shifting · 1e5289e1
      Yan, Zheng 提交于
      lost_skb_hint is used by tcp_mark_head_lost() to mark the first unhandled skb.
      lost_cnt_hint is the number of packets or sacked packets before the lost_skb_hint;
      When shifting a skb that is before the lost_skb_hint, if tcp_is_fack() is ture,
      the skb has already been counted in the lost_cnt_hint; if tcp_is_fack() is false,
      tcp_sacktag_one() will increase the lost_cnt_hint. So tcp_shifted_skb() does not
      need to adjust the lost_cnt_hint by itself. When shifting a skb that is equal to
      lost_skb_hint, the shifted packets will not be counted by tcp_mark_head_lost().
      So tcp_shifted_skb() should adjust the lost_cnt_hint even tcp_is_fack(tp) is true.
      Signed-off-by: NZheng Yan <zheng.z.yan@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1e5289e1
  6. 28 9月, 2011 1 次提交
  7. 27 9月, 2011 2 次提交
  8. 19 9月, 2011 1 次提交
  9. 25 8月, 2011 1 次提交
    • N
      Proportional Rate Reduction for TCP. · a262f0cd
      Nandita Dukkipati 提交于
      This patch implements Proportional Rate Reduction (PRR) for TCP.
      PRR is an algorithm that determines TCP's sending rate in fast
      recovery. PRR avoids excessive window reductions and aims for
      the actual congestion window size at the end of recovery to be as
      close as possible to the window determined by the congestion control
      algorithm. PRR also improves accuracy of the amount of data sent
      during loss recovery.
      
      The patch implements the recommended flavor of PRR called PRR-SSRB
      (Proportional rate reduction with slow start reduction bound) and
      replaces the existing rate halving algorithm. PRR improves upon the
      existing Linux fast recovery under a number of conditions including:
        1) burst losses where the losses implicitly reduce the amount of
      outstanding data (pipe) below the ssthresh value selected by the
      congestion control algorithm and,
        2) losses near the end of short flows where application runs out of
      data to send.
      
      As an example, with the existing rate halving implementation a single
      loss event can cause a connection carrying short Web transactions to
      go into the slow start mode after the recovery. This is because during
      recovery Linux pulls the congestion window down to packets_in_flight+1
      on every ACK. A short Web response often runs out of new data to send
      and its pipe reduces to zero by the end of recovery when all its packets
      are drained from the network. Subsequent HTTP responses using the same
      connection will have to slow start to raise cwnd to ssthresh. PRR on
      the other hand aims for the cwnd to be as close as possible to ssthresh
      by the end of recovery.
      
      A description of PRR and a discussion of its performance can be found at
      the following links:
      - IETF Draft:
          http://tools.ietf.org/html/draft-mathis-tcpm-proportional-rate-reduction-01
      - IETF Slides:
          http://www.ietf.org/proceedings/80/slides/tcpm-6.pdf
          http://tools.ietf.org/agenda/81/slides/tcpm-2.pdf
      - Paper to appear in Internet Measurements Conference (IMC) 2011:
          Improving TCP Loss Recovery
          Nandita Dukkipati, Matt Mathis, Yuchung Cheng
      Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a262f0cd
  10. 09 6月, 2011 1 次提交
    • J
      tcp: RFC2988bis + taking RTT sample from 3WHS for the passive open side · 9ad7c049
      Jerry Chu 提交于
      This patch lowers the default initRTO from 3secs to 1sec per
      RFC2988bis. It falls back to 3secs if the SYN or SYN-ACK packet
      has been retransmitted, AND the TCP timestamp option is not on.
      
      It also adds support to take RTT sample during 3WHS on the passive
      open side, just like its active open counterpart, and uses it, if
      valid, to seed the initRTO for the data transmission phase.
      
      The patch also resets ssthresh to its initial default at the
      beginning of the data transmission phase, and reduces cwnd to 1 if
      there has been MORE THAN ONE retransmission during 3WHS per RFC5681.
      Signed-off-by: NH.K. Jerry Chu <hkchu@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9ad7c049
  11. 23 3月, 2011 2 次提交
  12. 15 3月, 2011 1 次提交
    • S
      tcp: fix RTT for quick packets in congestion control · febf0819
      stephen hemminger 提交于
      In the congestion control interface, the callback for each ACK
      includes an estimated round trip time in microseconds.
      Some algorithms need high resolution (Vegas style) but most only
      need jiffie resolution.  If RTT is not accurate (like a retransmission)
      -1 is used as a flag value.
      
      When doing coarse resolution if RTT is less than a a jiffie
      then 0 should be returned rather than no estimate. Otherwise algorithms
      that expect good ack's to trigger slow start (like CUBIC Hystart)
      will be confused.
      Signed-off-by: NStephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      febf0819
  13. 22 2月, 2011 1 次提交
  14. 03 2月, 2011 1 次提交
  15. 26 1月, 2011 1 次提交
    • J
      TCP: fix a bug that triggers large number of TCP RST by mistake · 44f5324b
      Jerry Chu 提交于
      This patch fixes a bug that causes TCP RST packets to be generated
      on otherwise correctly behaved applications, e.g., no unread data
      on close,..., etc. To trigger the bug, at least two conditions must
      be met:
      
      1. The FIN flag is set on the last data packet, i.e., it's not on a
      separate, FIN only packet.
      2. The size of the last data chunk on the receive side matches
      exactly with the size of buffer posted by the receiver, and the
      receiver closes the socket without any further read attempt.
      
      This bug was first noticed on our netperf based testbed for our IW10
      proposal to IETF where a large number of RST packets were observed.
      netperf's read side code meets the condition 2 above 100%.
      
      Before the fix, tcp_data_queue() will queue the last skb that meets
      condition 1 to sk_receive_queue even though it has fully copied out
      (skb_copy_datagram_iovec()) the data. Then if condition 2 is also met,
      tcp_recvmsg() often returns all the copied out data successfully
      without actually consuming the skb, due to a check
      "if ((chunk = len - tp->ucopy.len) != 0) {"
      and
      "len -= chunk;"
      after tcp_prequeue_process() that causes "len" to become 0 and an
      early exit from the big while loop.
      
      I don't see any reason not to free the skb whose data have been fully
      consumed in tcp_data_queue(), regardless of the FIN flag.  We won't
      get there if MSG_PEEK is on. Am I missing some arcane cases related
      to urgent data?
      Signed-off-by: NH.K. Jerry Chu <hkchu@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      44f5324b
  16. 24 12月, 2010 1 次提交
  17. 10 12月, 2010 1 次提交
    • D
      net: Abstract away all dst_entry metrics accesses. · defb3519
      David S. Miller 提交于
      Use helper functions to hide all direct accesses, especially writes,
      to dst_entry metrics values.
      
      This will allow us to:
      
      1) More easily change how the metrics are stored.
      
      2) Implement COW for metrics.
      
      In particular this will help us put metrics into the inetpeer
      cache if that is what we end up doing.  We can make the _metrics
      member a pointer instead of an array, initially have it point
      at the read-only metrics in the FIB, and then on the first set
      grab an inetpeer entry and point the _metrics member there.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
      defb3519
  18. 11 11月, 2010 1 次提交
  19. 18 10月, 2010 2 次提交
  20. 07 10月, 2010 1 次提交
  21. 30 9月, 2010 1 次提交
  22. 28 9月, 2010 1 次提交
  23. 24 9月, 2010 1 次提交
  24. 21 9月, 2010 1 次提交
  25. 31 8月, 2010 1 次提交
  26. 08 8月, 2010 1 次提交
  27. 13 7月, 2010 1 次提交
  28. 16 6月, 2010 1 次提交
    • C
      tcp: unify tcp flag macros · a3433f35
      Changli Gao 提交于
      unify tcp flag macros: TCPHDR_FIN, TCPHDR_SYN, TCPHDR_RST, TCPHDR_PSH,
      TCPHDR_ACK, TCPHDR_URG, TCPHDR_ECE and TCPHDR_CWR. TCBCB_FLAG_* are replaced
      with the corresponding TCPHDR_*.
      Signed-off-by: NChangli Gao <xiaosuo@gmail.com>
      ----
       include/net/tcp.h                      |   24 ++++++-------
       net/ipv4/tcp.c                         |    8 ++--
       net/ipv4/tcp_input.c                   |    2 -
       net/ipv4/tcp_output.c                  |   59 ++++++++++++++++-----------------
       net/netfilter/nf_conntrack_proto_tcp.c |   32 ++++++-----------
       net/netfilter/xt_TCPMSS.c              |    4 --
       6 files changed, 58 insertions(+), 71 deletions(-)
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a3433f35
  29. 01 6月, 2010 1 次提交
  30. 18 5月, 2010 1 次提交
  31. 29 4月, 2010 1 次提交
  32. 13 4月, 2010 1 次提交
    • E
      net: sk_dst_cache RCUification · b6c6712a
      Eric Dumazet 提交于
      With latest CONFIG_PROVE_RCU stuff, I felt more comfortable to make this
      work.
      
      sk->sk_dst_cache is currently protected by a rwlock (sk_dst_lock)
      
      This rwlock is readlocked for a very small amount of time, and dst
      entries are already freed after RCU grace period. This calls for RCU
      again :)
      
      This patch converts sk_dst_lock to a spinlock, and use RCU for readers.
      
      __sk_dst_get() is supposed to be called with rcu_read_lock() or if
      socket locked by user, so use appropriate rcu_dereference_check()
      condition (rcu_read_lock_held() || sock_owned_by_user(sk))
      
      This patch avoids two atomic ops per tx packet on UDP connected sockets,
      for example, and permits sk_dst_lock to be much less dirtied.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b6c6712a