1. 12 3月, 2013 2 次提交
    • N
      tcp: TLP loss detection. · 9b717a8d
      Nandita Dukkipati 提交于
      This is the second of the TLP patch series; it augments the basic TLP
      algorithm with a loss detection scheme.
      
      This patch implements a mechanism for loss detection when a Tail
      loss probe retransmission plugs a hole thereby masking packet loss
      from the sender. The loss detection algorithm relies on counting
      TLP dupacks as outlined in Sec. 3 of:
      http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01
      
      The basic idea is: Sender keeps track of TLP "episode" upon
      retransmission of a TLP packet. An episode ends when the sender receives
      an ACK above the SND.NXT (tracked by tlp_high_seq) at the time of the
      episode. We want to make sure that before the episode ends the sender
      receives a "TLP dupack", indicating that the TLP retransmission was
      unnecessary, so there was no loss/hole that needed plugging. If the
      sender gets no TLP dupack before the end of the episode, then it reduces
      ssthresh and the congestion window, because the TLP packet arriving at
      the receiver probably plugged a hole.
      Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9b717a8d
    • N
      tcp: Tail loss probe (TLP) · 6ba8a3b1
      Nandita Dukkipati 提交于
      This patch series implement the Tail loss probe (TLP) algorithm described
      in http://tools.ietf.org/html/draft-dukkipati-tcpm-tcp-loss-probe-01. The
      first patch implements the basic algorithm.
      
      TLP's goal is to reduce tail latency of short transactions. It achieves
      this by converting retransmission timeouts (RTOs) occuring due
      to tail losses (losses at end of transactions) into fast recovery.
      TLP transmits one packet in two round-trips when a connection is in
      Open state and isn't receiving any ACKs. The transmitted packet, aka
      loss probe, can be either new or a retransmission. When there is tail
      loss, the ACK from a loss probe triggers FACK/early-retransmit based
      fast recovery, thus avoiding a costly RTO. In the absence of loss,
      there is no change in the connection state.
      
      PTO stands for probe timeout. It is a timer event indicating
      that an ACK is overdue and triggers a loss probe packet. The PTO value
      is set to max(2*SRTT, 10ms) and is adjusted to account for delayed
      ACK timer when there is only one oustanding packet.
      
      TLP Algorithm
      
      On transmission of new data in Open state:
        -> packets_out > 1: schedule PTO in max(2*SRTT, 10ms).
        -> packets_out == 1: schedule PTO in max(2*RTT, 1.5*RTT + 200ms)
        -> PTO = min(PTO, RTO)
      
      Conditions for scheduling PTO:
        -> Connection is in Open state.
        -> Connection is either cwnd limited or no new data to send.
        -> Number of probes per tail loss episode is limited to one.
        -> Connection is SACK enabled.
      
      When PTO fires:
        new_segment_exists:
          -> transmit new segment.
          -> packets_out++. cwnd remains same.
      
        no_new_packet:
          -> retransmit the last segment.
             Its ACK triggers FACK or early retransmit based recovery.
      
      ACK path:
        -> rearm RTO at start of ACK processing.
        -> reschedule PTO if need be.
      
      In addition, the patch includes a small variation to the Early Retransmit
      (ER) algorithm, such that ER and TLP together can in principle recover any
      N-degree of tail loss through fast recovery. TLP is controlled by the same
      sysctl as ER, tcp_early_retrans sysctl.
      tcp_early_retrans==0; disables TLP and ER.
      		 ==1; enables RFC5827 ER.
      		 ==2; delayed ER.
      		 ==3; TLP and delayed ER. [DEFAULT]
      		 ==4; TLP only.
      
      The TLP patch series have been extensively tested on Google Web servers.
      It is most effective for short Web trasactions, where it reduced RTOs by 15%
      and improved HTTP response time (average by 6%, 99th percentile by 10%).
      The transmitted probes account for <0.5% of the overall transmissions.
      Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6ba8a3b1
  2. 19 2月, 2013 2 次提交
  3. 01 9月, 2012 1 次提交
    • J
      tcp: TCP Fast Open Server - header & support functions · 10467163
      Jerry Chu 提交于
      This patch adds all the necessary data structure and support
      functions to implement TFO server side. It also documents a number
      of flags for the sysctl_tcp_fastopen knob, and adds a few Linux
      extension MIBs.
      
      In addition, it includes the following:
      
      1. a new TCP_FASTOPEN socket option an application must call to
      supply a max backlog allowed in order to enable TFO on its listener.
      
      2. A number of key data structures:
      "fastopen_rsk" in tcp_sock - for a big socket to access its
      request_sock for retransmission and ack processing purpose. It is
      non-NULL iff 3WHS not completed.
      
      "fastopenq" in request_sock_queue - points to a per Fast Open
      listener data structure "fastopen_queue" to keep track of qlen (# of
      outstanding Fast Open requests) and max_qlen, among other things.
      
      "listener" in tcp_request_sock - to point to the original listener
      for book-keeping purpose, i.e., to maintain qlen against max_qlen
      as part of defense against IP spoofing attack.
      
      3. various data structure and functions, many in tcp_fastopen.c, to
      support server side Fast Open cookie operations, including
      /proc/sys/net/ipv4/tcp_fastopen_key to allow manual rekeying.
      Signed-off-by: NH.K. Jerry Chu <hkchu@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      10467163
  4. 20 7月, 2012 1 次提交
    • Y
      net-tcp: Fast Open client - sending SYN-data · 783237e8
      Yuchung Cheng 提交于
      This patch implements sending SYN-data in tcp_connect(). The data is
      from tcp_sendmsg() with flag MSG_FASTOPEN (implemented in a later patch).
      
      The length of the cookie in tcp_fastopen_req, init'd to 0, controls the
      type of the SYN. If the cookie is not cached (len==0), the host sends
      data-less SYN with Fast Open cookie request option to solicit a cookie
      from the remote. If cookie is not available (len > 0), the host sends
      a SYN-data with Fast Open cookie option. If cookie length is negative,
        the SYN will not include any Fast Open option (for fall back operations).
      
      To deal with middleboxes that may drop SYN with data or experimental TCP
      option, the SYN-data is only sent once. SYN retransmits do not include
      data or Fast Open options. The connection will fall back to regular TCP
      handshake.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      783237e8
  5. 17 7月, 2012 3 次提交
    • E
      tcp: implement RFC 5961 4.2 · 0c24604b
      Eric Dumazet 提交于
      Implement the RFC 5691 mitigation against Blind
      Reset attack using SYN bit.
      
      Section 4.2 of RFC 5961 advises to send a Challenge ACK and drop
      incoming packet, instead of resetting the session.
      
      Add a new SNMP counter to count number of challenge acks sent
      in response to SYN packets.
      (netstat -s | grep TCPSYNChallenge)
      
      Remove obsolete TCPAbortOnSyn, since we no longer abort a TCP session
      because of a SYN flag.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Kiran Kumar Kella <kkiran@broadcom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0c24604b
    • E
      tcp: implement RFC 5961 3.2 · 282f23c6
      Eric Dumazet 提交于
      Implement the RFC 5691 mitigation against Blind
      Reset attack using RST bit.
      
      Idea is to validate incoming RST sequence,
      to match RCV.NXT value, instead of previouly accepted
      window : (RCV.NXT <= SEG.SEQ < RCV.NXT+RCV.WND)
      
      If sequence is in window but not an exact match, send
      a "challenge ACK", so that the other part can resend an
      RST with the appropriate sequence.
      
      Add a new sysctl, tcp_challenge_ack_limit, to limit
      number of challenge ACK sent per second.
      
      Add a new SNMP counter to count number of challenge acks sent.
      (netstat -s | grep TCPChallengeACK)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Kiran Kumar Kella <kkiran@broadcom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      282f23c6
    • E
      tcp: add OFO snmp counters · a6df1ae9
      Eric Dumazet 提交于
      Add three SNMP TCP counters, to better track TCP behavior
      at global stage (netstat -s), when packets are received
      Out Of Order (OFO)
      
      TCPOFOQueue : Number of packets queued in OFO queue
      
      TCPOFODrop  : Number of packets meant to be queued in OFO
                    but dropped because socket rcvbuf limit hit.
      
      TCPOFOMerge : Number of packets in OFO that were merged with
                    other packets.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a6df1ae9
  6. 20 3月, 2012 1 次提交
    • E
      tcp: reduce out_of_order memory use · c8628155
      Eric Dumazet 提交于
      With increasing receive window sizes, but speed of light not improved
      that much, out of order queue can contain a huge number of skbs, waiting
      to be moved to receive_queue when missing packets can fill the holes.
      
      Some devices happen to use fat skbs (truesize of 4096 + sizeof(struct
      sk_buff)) to store regular (MTU <= 1500) frames. This makes highly
      probable sk_rmem_alloc hits sk_rcvbuf limit, which can be 4Mbytes in
      many cases.
      
      When limit is hit, tcp stack calls tcp_collapse_ofo_queue(), a true
      latency killer and cpu cache blower.
      
      Doing the coalescing attempt each time we add a frame in ofo queue
      permits to keep memory use tight and in many cases avoid the
      tcp_collapse() thing later.
      
      Tested on various wireless setups (b43, ath9k, ...) known to use big skb
      truesize, this patch removed the "packets collapsed in receive queue due
      to low socket buffer" I had before.
      
      This also reduced average memory used by tcp sockets.
      
      With help from Neal Cardwell.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: H.K. Jerry Chu <hkchu@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c8628155
  7. 27 1月, 2012 1 次提交
  8. 23 1月, 2012 1 次提交
    • Y
      tcp: detect loss above high_seq in recovery · 974c1236
      Yuchung Cheng 提交于
      Correctly implement a loss detection heuristic: New sequences (above
      high_seq) sent during the fast recovery are deemed lost when higher
      sequences are SACKed.
      
      Current code does not catch these losses, because tcp_mark_head_lost()
      does not check packets beyond high_seq. The fix is straight-forward by
      checking packets until the highest sacked packet. In addition, all the
      FLAG_DATA_LOST logic are in-effective and redundant and can be removed.
      
      Update the loss heuristic comments. The algorithm above is documented
      as heuristic B, but it is redundant too because heuristic A already
      covers B.
      
      Note that this change only marks some forward-retransmitted packets LOST.
      It does NOT forbid TCP performing further CWR on new losses. A potential
      follow-up patch under preparation is to perform another CWR on "new"
      losses such as
      1) sequence above high_seq is lost (by resetting high_seq to snd_nxt)
      2) retransmission is lost.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      974c1236
  9. 13 12月, 2011 1 次提交
  10. 10 11月, 2011 1 次提交
    • E
      ipv4: reduce percpu needs for icmpmsg mibs · acb32ba3
      Eric Dumazet 提交于
      Reading /proc/net/snmp on a machine with a lot of cpus is very expensive
      (can be ~88000 us).
      
      This is because ICMPMSG MIB uses 4096 bytes per cpu, and folding values
      for all possible cpus can read 16 Mbytes of memory.
      
      ICMP messages are not considered as fast path on a typical server, and
      eventually few cpus handle them anyway. We can afford an atomic
      operation instead of using percpu data.
      
      This saves 4096 bytes per cpu and per network namespace.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      acb32ba3
  11. 01 11月, 2011 1 次提交
  12. 16 9月, 2011 1 次提交
    • E
      tcp: Change possible SYN flooding messages · 946cedcc
      Eric Dumazet 提交于
      "Possible SYN flooding on port xxxx " messages can fill logs on servers.
      
      Change logic to log the message only once per listener, and add two new
      SNMP counters to track :
      
      TCPReqQFullDoCookies : number of times a SYNCOOKIE was replied to client
      
      TCPReqQFullDrop : number of times a SYN request was dropped because
      syncookies were not enabled.
      
      Based on a prior patch from Tom Herbert, and suggestions from David.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      CC: Tom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      946cedcc
  13. 09 12月, 2010 1 次提交
  14. 11 11月, 2010 1 次提交
  15. 01 7月, 2010 1 次提交
    • E
      snmp: 64bit ipstats_mib for all arches · 4ce3c183
      Eric Dumazet 提交于
      /proc/net/snmp and /proc/net/netstat expose SNMP counters.
      
      Width of these counters is either 32 or 64 bits, depending on the size
      of "unsigned long" in kernel.
      
      This means user program parsing these files must already be prepared to
      deal with 64bit values, regardless of user program being 32 or 64 bit.
      
      This patch introduces 64bit snmp values for IPSTAT mib, where some
      counters can wrap pretty fast if they are 32bit wide.
      
      # netstat -s|egrep "InOctets|OutOctets"
          InOctets: 244068329096
          OutOctets: 244069348848
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4ce3c183
  16. 03 6月, 2010 1 次提交
  17. 22 3月, 2010 1 次提交
  18. 09 3月, 2010 1 次提交
  19. 17 2月, 2010 1 次提交
    • T
      percpu: add __percpu sparse annotations to net · 7d720c3e
      Tejun Heo 提交于
      Add __percpu sparse annotations to net.
      
      These annotations are to make sparse consider percpu variables to be
      in a different address space and warn if accessed without going
      through percpu accessors.  This patch doesn't affect normal builds.
      
      The macro and type tricks around snmp stats make things a bit
      interesting.  DEFINE/DECLARE_SNMP_STAT() macros mark the target field
      as __percpu and SNMP_UPD_PO_STATS() macro is updated accordingly.  All
      snmp_mib_*() users which used to cast the argument to (void **) are
      updated to cast it to (void __percpu **).
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
      Cc: Vlad Yasevich <vladislav.yasevich@hp.com>
      Cc: netdev@vger.kernel.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7d720c3e
  20. 23 1月, 2010 1 次提交
  21. 27 4月, 2009 1 次提交
  22. 16 2月, 2009 1 次提交
  23. 30 12月, 2008 1 次提交
  24. 26 11月, 2008 2 次提交
  25. 25 11月, 2008 1 次提交
  26. 11 11月, 2008 1 次提交
  27. 30 7月, 2008 1 次提交
  28. 18 7月, 2008 8 次提交