1. 03 5月, 2012 2 次提交
    • Y
      tcp: early retransmit: delayed fast retransmit · 750ea2ba
      Yuchung Cheng 提交于
      Implementing the advanced early retransmit (sysctl_tcp_early_retrans==2).
      Delays the fast retransmit by an interval of RTT/4. We borrow the
      RTO timer to implement the delay. If we receive another ACK or send
      a new packet, the timer is cancelled and restored to original RTO
      value offset by time elapsed.  When the delayed-ER timer fires,
      we enter fast recovery and perform fast retransmit.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      750ea2ba
    • Y
      tcp: early retransmit · eed530b6
      Yuchung Cheng 提交于
      This patch implements RFC 5827 early retransmit (ER) for TCP.
      It reduces DUPACK threshold (dupthresh) if outstanding packets are
      less than 4 to recover losses by fast recovery instead of timeout.
      
      While the algorithm is simple, small but frequent network reordering
      makes this feature dangerous: the connection repeatedly enter
      false recovery and degrade performance. Therefore we implement
      a mitigation suggested in the appendix of the RFC that delays
      entering fast recovery by a small interval, i.e., RTT/4. Currently
      ER is conservative and is disabled for the rest of the connection
      after the first reordering event. A large scale web server
      experiment on the performance impact of ER is summarized in
      section 6 of the paper "Proportional Rate Reduction for TCP”,
      IMC 2011. http://conferences.sigcomm.org/imc/2011/docs/p155.pdf
      
      Note that Linux has a similar feature called THIN_DUPACK. The
      differences are THIN_DUPACK do not mitigate reorderings and is only
      used after slow start. Currently ER is disabled if THIN_DUPACK is
      enabled. I would be happy to merge THIN_DUPACK feature with ER if
      people think it's a good idea.
      
      ER is enabled by sysctl_tcp_early_retrans:
        0: Disables ER
      
        1: Reduce dupthresh to packets_out - 1 when outstanding packets < 4.
      
        2: (Default) reduce dupthresh like mode 1. In addition, delay
           entering fast recovery by RTT/4.
      
      Note: mode 2 is implemented in the third part of this patch series.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eed530b6
  2. 26 4月, 2012 1 次提交
  3. 22 4月, 2012 2 次提交
    • P
      tcp: Repair connection-time negotiated parameters · b139ba4e
      Pavel Emelyanov 提交于
      There are options, which are set up on a socket while performing
      TCP handshake. Need to resurrect them on a socket while repairing.
      A new sockoption accepts a buffer and parses it. The buffer should
      be CODE:VALUE sequence of bytes, where CODE is standard option
      code and VALUE is the respective value.
      
      Only 4 options should be handled on repaired socket.
      
      To read 3 out of 4 of these options the TCP_INFO sockoption can be
      used. An ability to get the last one (the mss_clamp) was added by
      the previous patch.
      
      Now the restore. Three of these options -- timestamp_ok, mss_clamp
      and snd_wscale -- are just restored on a coket.
      
      The sack_ok flags has 2 issues. First, whether or not to do sacks
      at all. This flag is just read and set back. No other sack  info is
      saved or restored, since according to the standart and the code
      dropping all sack-ed segments is OK, the sender will resubmit them
      again, so after the repair we will probably experience a pause in
      connection. Next, the fack bit. It's just set back on a socket if
      the respective sysctl is set. No collected stats about packets flow
      is preserved. As far as I see (plz, correct me if I'm wrong) the
      fack-based congestion algorithm survives dropping all of the stats
      and repairs itself eventually, probably losing the performance for
      that period.
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b139ba4e
    • P
      tcp: Initial repair mode · ee995283
      Pavel Emelyanov 提交于
      This includes (according the the previous description):
      
      * TCP_REPAIR sockoption
      
      This one just puts the socket in/out of the repair mode.
      Allowed for CAP_NET_ADMIN and for closed/establised sockets only.
      When repair mode is turned off and the socket happens to be in
      the established state the window probe is sent to the peer to
      'unlock' the connection.
      
      * TCP_REPAIR_QUEUE sockoption
      
      This one sets the queue which we're about to repair. The
      'no-queue' is set by default.
      
      * TCP_QUEUE_SEQ socoption
      
      Sets the write_seq/rcv_nxt of a selected repaired queue.
      Allowed for TCP_CLOSE-d sockets only. When the socket changes
      its state the other seq-s are changed by the kernel according
      to the protocol rules (most of the existing code is actually
      reused).
      
      * Ability to forcibly bind a socket to a port
      
      The sk->sk_reuse is set to SK_FORCE_REUSE.
      
      * Immediate connect modification
      
      The connect syscall initializes the connection, then directly jumps
      to the code which finalizes it.
      
      * Silent close modification
      
      The close just aborts the connection (similar to SO_LINGER with 0
      time) but without sending any FIN/RST-s to peer.
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee995283
  4. 29 2月, 2012 1 次提交
  5. 01 2月, 2012 2 次提交
    • E
      tcp: md5: protects md5sig_info with RCU · a8afca03
      Eric Dumazet 提交于
      This patch makes sure we use appropriate memory barriers before
      publishing tp->md5sig_info, allowing tcp_md5_do_lookup() being used from
      tcp_v4_send_reset() without holding socket lock (upcoming patch from
      Shawn Lu)
      
      Note we also need to respect rcu grace period before its freeing, since
      we can free socket without this grace period thanks to
      SLAB_DESTROY_BY_RCU
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Cc: Shawn Lu <shawn.lu@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a8afca03
    • E
      tcp: md5: rcu conversion · a915da9b
      Eric Dumazet 提交于
      In order to be able to support proper RST messages for TCP MD5 flows, we
      need to allow access to MD5 keys without locking listener socket.
      
      This conversion is a nice cleanup, and shrinks size of timewait sockets
      by 80 bytes.
      
      IPv6 code reuses generic code found in IPv4 instead of duplicating it.
      
      Control path uses GFP_KERNEL allocations instead of GFP_ATOMIC.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Cc: Shawn Lu <shawn.lu@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a915da9b
  6. 21 12月, 2011 1 次提交
  7. 04 10月, 2011 1 次提交
    • E
      tcp: report ECN_SEEN in tcp_info · b5c5693b
      Eric Dumazet 提交于
      Allows ss command (iproute2) to display "ecnseen" if at least one packet
      with ECT(0) or ECT(1) or ECN was received by this socket.
      
      "ecn" means ECN was negotiated at session establishment (TCP level)
      
      "ecnseen" means we received at least one packet with ECT fields set (IP
      level)
      
      ss -i
      ...
      ESTAB      0      0   192.168.20.110:22  192.168.20.144:38016
      ino:5950 sk:f178e400
      	 mem:(r0,w0,f0,t0) ts sack ecn ecnseen bic wscale:7,8 rto:210
      rtt:12.5/7.5 cwnd:10 send 9.3Mbps rcv_space:14480
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b5c5693b
  8. 25 8月, 2011 1 次提交
    • N
      Proportional Rate Reduction for TCP. · a262f0cd
      Nandita Dukkipati 提交于
      This patch implements Proportional Rate Reduction (PRR) for TCP.
      PRR is an algorithm that determines TCP's sending rate in fast
      recovery. PRR avoids excessive window reductions and aims for
      the actual congestion window size at the end of recovery to be as
      close as possible to the window determined by the congestion control
      algorithm. PRR also improves accuracy of the amount of data sent
      during loss recovery.
      
      The patch implements the recommended flavor of PRR called PRR-SSRB
      (Proportional rate reduction with slow start reduction bound) and
      replaces the existing rate halving algorithm. PRR improves upon the
      existing Linux fast recovery under a number of conditions including:
        1) burst losses where the losses implicitly reduce the amount of
      outstanding data (pipe) below the ssthresh value selected by the
      congestion control algorithm and,
        2) losses near the end of short flows where application runs out of
      data to send.
      
      As an example, with the existing rate halving implementation a single
      loss event can cause a connection carrying short Web transactions to
      go into the slow start mode after the recovery. This is because during
      recovery Linux pulls the congestion window down to packets_in_flight+1
      on every ACK. A short Web response often runs out of new data to send
      and its pipe reduces to zero by the end of recovery when all its packets
      are drained from the network. Subsequent HTTP responses using the same
      connection will have to slow start to raise cwnd to ssthresh. PRR on
      the other hand aims for the cwnd to be as close as possible to ssthresh
      by the end of recovery.
      
      A description of PRR and a discussion of its performance can be found at
      the following links:
      - IETF Draft:
          http://tools.ietf.org/html/draft-mathis-tcpm-proportional-rate-reduction-01
      - IETF Slides:
          http://www.ietf.org/proceedings/80/slides/tcpm-6.pdf
          http://tools.ietf.org/agenda/81/slides/tcpm-2.pdf
      - Paper to appear in Internet Measurements Conference (IMC) 2011:
          Improving TCP Loss Recovery
          Nandita Dukkipati, Matt Mathis, Yuchung Cheng
      Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a262f0cd
  9. 09 6月, 2011 1 次提交
    • J
      tcp: RFC2988bis + taking RTT sample from 3WHS for the passive open side · 9ad7c049
      Jerry Chu 提交于
      This patch lowers the default initRTO from 3secs to 1sec per
      RFC2988bis. It falls back to 3secs if the SYN or SYN-ACK packet
      has been retransmitted, AND the TCP timestamp option is not on.
      
      It also adds support to take RTT sample during 3WHS on the passive
      open side, just like its active open counterpart, and uses it, if
      valid, to seed the initRTO for the data transmission phase.
      
      The patch also resets ssthresh to its initial default at the
      beginning of the data transmission phase, and reduces cwnd to 1 if
      there has been MORE THAN ONE retransmission during 3WHS per RFC5681.
      Signed-off-by: NH.K. Jerry Chu <hkchu@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9ad7c049
  10. 31 8月, 2010 1 次提交
    • J
      tcp: Add TCP_USER_TIMEOUT socket option. · dca43c75
      Jerry Chu 提交于
      This patch provides a "user timeout" support as described in RFC793. The
      socket option is also needed for the the local half of RFC5482 "TCP User
      Timeout Option".
      
      TCP_USER_TIMEOUT is a TCP level socket option that takes an unsigned int,
      when > 0, to specify the maximum amount of time in ms that transmitted
      data may remain unacknowledged before TCP will forcefully close the
      corresponding connection and return ETIMEDOUT to the application. If
      0 is given, TCP will continue to use the system default.
      
      Increasing the user timeouts allows a TCP connection to survive extended
      periods without end-to-end connectivity. Decreasing the user timeouts
      allows applications to "fail fast" if so desired. Otherwise it may take
      upto 20 minutes with the current system defaults in a normal WAN
      environment.
      
      The socket option can be made during any state of a TCP connection, but
      is only effective during the synchronized states of a connection
      (ESTABLISHED, FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, or LAST-ACK).
      Moreover, when used with the TCP keepalive (SO_KEEPALIVE) option,
      TCP_USER_TIMEOUT will overtake keepalive to determine when to close a
      connection due to keepalive failure.
      
      The option does not change in anyway when TCP retransmits a packet, nor
      when a keepalive probe will be sent.
      
      This option, like many others, will be inherited by an acceptor from its
      listener.
      Signed-off-by: NH.K. Jerry Chu <hkchu@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dca43c75
  11. 19 2月, 2010 2 次提交
    • A
      net: TCP thin dupack · 7e380175
      Andreas Petlund 提交于
      This patch enables fast retransmissions after one dupACK for
      TCP if the stream is identified as thin. This will reduce
      latencies for thin streams that are not able to trigger fast
      retransmissions due to high packet interarrival time. This
      mechanism is only active if enabled by iocontrol or syscontrol
      and the stream is identified as thin.
      Signed-off-by: NAndreas Petlund <apetlund@simula.no>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7e380175
    • A
      net: TCP thin linear timeouts · 36e31b0a
      Andreas Petlund 提交于
      This patch will make TCP use only linear timeouts if the
      stream is thin. This will help to avoid the very high latencies
      that thin stream suffer because of exponential backoff. This
      mechanism is only active if enabled by iocontrol or syscontrol
      and the stream is identified as thin. A maximum of 6 linear
      timeouts is tried before exponential backoff is resumed.
      Signed-off-by: NAndreas Petlund <apetlund@simula.no>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      36e31b0a
  12. 03 12月, 2009 2 次提交
    • W
      TCPCT part 1d: define TCP cookie option, extend existing struct's · 435cf559
      William Allen Simpson 提交于
      Data structures are carefully composed to require minimal additions.
      For example, the struct tcp_options_received cookie_plus variable fits
      between existing 16-bit and 8-bit variables, requiring no additional
      space (taking alignment into consideration).  There are no additions to
      tcp_request_sock, and only 1 pointer in tcp_sock.
      
      This is a significantly revised implementation of an earlier (year-old)
      patch that no longer applies cleanly, with permission of the original
      author (Adam Langley):
      
          http://thread.gmane.org/gmane.linux.network/102586
      
      The principle difference is using a TCP option to carry the cookie nonce,
      instead of a user configured offset in the data.  This is more flexible and
      less subject to user configuration error.  Such a cookie option has been
      suggested for many years, and is also useful without SYN data, allowing
      several related concepts to use the same extension option.
      
          "Re: SYN floods (was: does history repeat itself?)", September 9, 1996.
          http://www.merit.net/mail.archives/nanog/1996-09/msg00235.html
      
          "Re: what a new TCP header might look like", May 12, 1998.
          ftp://ftp.isi.edu/end2end/end2end-interest-1998.mail
      
      These functions will also be used in subsequent patches that implement
      additional features.
      
      Requires:
         TCPCT part 1a: add request_values parameter for sending SYNACK
         TCPCT part 1b: generate Responder Cookie secret
         TCPCT part 1c: sysctl_tcp_cookie_size, socket option TCP_COOKIE_TRANSACTIONS
      
      Signed-off-by: William.Allen.Simpson@gmail.com
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      435cf559
    • W
      TCPCT part 1c: sysctl_tcp_cookie_size, socket option TCP_COOKIE_TRANSACTIONS · 519855c5
      William Allen Simpson 提交于
      Define sysctl (tcp_cookie_size) to turn on and off the cookie option
      default globally, instead of a compiled configuration option.
      
      Define per socket option (TCP_COOKIE_TRANSACTIONS) for setting constant
      data values, retrieving variable cookie values, and other facilities.
      
      Move inline tcp_clear_options() unchanged from net/tcp.h to linux/tcp.h,
      near its corresponding struct tcp_options_received (prior to changes).
      
      This is a straightforward re-implementation of an earlier (year-old)
      patch that no longer applies cleanly, with permission of the original
      author (Adam Langley):
      
          http://thread.gmane.org/gmane.linux.network/102586
      
      These functions will also be used in subsequent patches that implement
      additional features.
      
      Requires:
         net: TCP_MSS_DEFAULT, TCP_MSS_DESIRED
      
      Signed-off-by: William.Allen.Simpson@gmail.com
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      519855c5
  13. 14 11月, 2009 1 次提交
  14. 05 11月, 2009 1 次提交
  15. 02 9月, 2009 1 次提交
  16. 20 4月, 2009 1 次提交
  17. 16 3月, 2009 2 次提交
    • I
      tcp: cache result of earlier divides when mss-aligning things · 2a3a041c
      Ilpo Järvinen 提交于
      The results is very unlikely change every so often so we
      hardly need to divide again after doing that once for a
      connection. Yet, if divide still becomes necessary we
      detect that and do the right thing and again settle for
      non-divide state. Takes the u16 space which was previously
      taken by the plain xmit_size_goal.
      
      This should take care part of the tso vs non-tso difference
      we found earlier.
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2a3a041c
    • I
      tcp: simplify tcp_current_mss · 0c54b85f
      Ilpo Järvinen 提交于
      There's very little need for most of the callsites to get
      tp->xmit_goal_size updated. That will cost us divide as is,
      so slice the function in two. Also, the only users of the
      tp->xmit_goal_size are directly behind tcp_current_mss(),
      so there's no need to store that variable into tcp_sock
      at all! The drop of xmit_goal_size currently leaves 16-bit
      hole and some reorganization would again be necessary to
      change that (but I'm aiming to fill that hole with u16
      xmit_goal_size_segs to cache the results of the remaining
      divide to get that tso on regression).
      
      Bring xmit_goal_size parts into tcp.c
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Cc: Evgeniy Polyakov <zbr@ioremap.net>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0c54b85f
  18. 02 3月, 2009 1 次提交
  19. 15 2月, 2009 1 次提交
  20. 08 10月, 2008 1 次提交
    • I
      tcp: kill pointless urg_mode · 33f5f57e
      Ilpo Järvinen 提交于
      It all started from me noticing that this urgent check in
      tcp_clean_rtx_queue is unnecessarily inside the loop. Then
      I took a longer look to it and found out that the users of
      urg_mode can trivially do without, well almost, there was
      one gotcha.
      
      Bonus: those funny people who use urg with >= 2^31 write_seq -
      snd_una could now rejoice too (that's the only purpose for the
      between being there, otherwise a simple compare would have done
      the thing). Not that I assume that the rest of the tcp code
      happily lives with such mind-boggling numbers :-). Alas, it
      turned out to be impossible to set wmem to such numbers anyway,
      yes I really tried a big sendfile after setting some wmem but
      nothing happened :-). ...Tcp_wmem is int and so is sk_sndbuf...
      So I hacked a bit variable to long and found out that it seems
      to work... :-)
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      33f5f57e
  21. 21 9月, 2008 2 次提交
  22. 19 7月, 2008 1 次提交
  23. 13 6月, 2008 1 次提交
    • D
      tcp: Revert 'process defer accept as established' changes. · ec0a1966
      David S. Miller 提交于
      This reverts two changesets, ec3c0982
      ("[TCP]: TCP_DEFER_ACCEPT updates - process as established") and
      the follow-on bug fix 9ae27e0a
      ("tcp: Fix slab corruption with ipv6 and tcp6fuzz").
      
      This change causes several problems, first reported by Ingo Molnar
      as a distcc-over-loopback regression where connections were getting
      stuck.
      
      Ilpo Järvinen first spotted the locking problems.  The new function
      added by this code, tcp_defer_accept_check(), only has the
      child socket locked, yet it is modifying state of the parent
      listening socket.
      
      Fixing that is non-trivial at best, because we can't simply just grab
      the parent listening socket lock at this point, because it would
      create an ABBA deadlock.  The normal ordering is parent listening
      socket --> child socket, but this code path would require the
      reverse lock ordering.
      
      Next is a problem noticed by Vitaliy Gusev, he noted:
      
      ----------------------------------------
      >--- a/net/ipv4/tcp_timer.c
      >+++ b/net/ipv4/tcp_timer.c
      >@@ -481,6 +481,11 @@ static void tcp_keepalive_timer (unsigned long data)
      > 		goto death;
      > 	}
      >
      >+	if (tp->defer_tcp_accept.request && sk->sk_state == TCP_ESTABLISHED) {
      >+		tcp_send_active_reset(sk, GFP_ATOMIC);
      >+		goto death;
      
      Here socket sk is not attached to listening socket's request queue. tcp_done()
      will not call inet_csk_destroy_sock() (and tcp_v4_destroy_sock() which should
      release this sk) as socket is not DEAD. Therefore socket sk will be lost for
      freeing.
      ----------------------------------------
      
      Finally, Alexey Kuznetsov argues that there might not even be any
      real value or advantage to these new semantics even if we fix all
      of the bugs:
      
      ----------------------------------------
      Hiding from accept() sockets with only out-of-order data only
      is the only thing which is impossible with old approach. Is this really
      so valuable? My opinion: no, this is nothing but a new loophole
      to consume memory without control.
      ----------------------------------------
      
      So revert this thing for now.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ec0a1966
  24. 29 5月, 2008 1 次提交
    • I
      tcp: Reorganize tcp_sock to fill 64-bit holes & improve locality · b79eeeb9
      Ilpo Järvinen 提交于
      I tried to group recovery related fields nearby (non-CA_Open related
      variables, to be more accurate) so that one to three cachelines would
      not be necessary in CA_Open. These are now contiguously deployed:
      
        struct sk_buff_head        out_of_order_queue;   /*  1968    80 */
        /* --- cacheline 32 boundary (2048 bytes) --- */
        struct tcp_sack_block      duplicate_sack[1];    /*  2048     8 */
        struct tcp_sack_block      selective_acks[4];    /*  2056    32 */
        struct tcp_sack_block      recv_sack_cache[4];   /*  2088    32 */
        /* --- cacheline 33 boundary (2112 bytes) was 8 bytes ago --- */
        struct sk_buff *           highest_sack;         /*  2120     8 */
        int                        lost_cnt_hint;        /*  2128     4 */
        int                        retransmit_cnt_hint;  /*  2132     4 */
        u32                        lost_retrans_low;     /*  2136     4 */
        u8                         reordering;           /*  2140     1 */
        u8                         keepalive_probes;     /*  2141     1 */
      
        /* XXX 2 bytes hole, try to pack */
      
        u32                        prior_ssthresh;       /*  2144     4 */
        u32                        high_seq;             /*  2148     4 */
        u32                        retrans_stamp;        /*  2152     4 */
        u32                        undo_marker;          /*  2156     4 */
        int                        undo_retrans;         /*  2160     4 */
        u32                        total_retrans;        /*  2164     4 */
      
      ...and they're then followed by URG slowpath & keepalive related
      variables.
      
      Head of the out_of_order_queue always needed for empty checks, if
      that's empty (and TCP is in CA_Open), following ~200 bytes (in 64-bit)
      shouldn't be necessary for anything. If only OFO queue exists but TCP
      is in CA_Open, selective_acks (and possibly duplicate_sack) are
      necessary besides the out_of_order_queue but the rest of the block
      again shouldn't be (ie., the other direction had losses).
      
      As the cacheline boundaries depend on many factors in the preceeding
      stuff, trying to align considering them doesn't make too much sense.
      
      Commented one ordering hazard.
      
      There are number of low utilized u8/16s that could be combined get 2
      bytes less in total so that the hole could be made to vanish (includes
      at least ecn_flags, urg_data, urg_mode, frto_counter, nonagle).
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Acked-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b79eeeb9
  25. 22 5月, 2008 1 次提交
  26. 22 3月, 2008 1 次提交
    • P
      [TCP]: TCP_DEFER_ACCEPT updates - process as established · ec3c0982
      Patrick McManus 提交于
      Change TCP_DEFER_ACCEPT implementation so that it transitions a
      connection to ESTABLISHED after handshake is complete instead of
      leaving it in SYN-RECV until some data arrvies. Place connection in
      accept queue when first data packet arrives from slow path.
      
      Benefits:
        - established connection is now reset if it never makes it
         to the accept queue
      
       - diagnostic state of established matches with the packet traces
         showing completed handshake
      
       - TCP_DEFER_ACCEPT timeouts are expressed in seconds and can now be
         enforced with reasonable accuracy instead of rounding up to next
         exponential back-off of syn-ack retry.
      Signed-off-by: NPatrick McManus <mcmanus@ducksong.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ec3c0982
  27. 29 1月, 2008 3 次提交
    • I
      [TCP]: Rewrite SACK block processing & sack_recv_cache use · 68f8353b
      Ilpo Järvinen 提交于
      Key points of this patch are:
      
        - In case new SACK information is advance only type, no skb
          processing below previously discovered highest point is done
        - Optimize cases below highest point too since there's no need
          to always go up to highest point (which is very likely still
          present in that SACK), this is not entirely true though
          because I'm dropping the fastpath_skb_hint which could
          previously optimize those cases even better. Whether that's
          significant, I'm not too sure.
      
      Currently it will provide skipping by walking. Combined with
      RB-tree, all skipping would become fast too regardless of window
      size (can be done incrementally later).
      
      Previously a number of cases in TCP SACK processing fails to
      take advantage of costly stored information in sack_recv_cache,
      most importantly, expected events such as cumulative ACK and new
      hole ACKs. Processing on such ACKs result in rather long walks
      building up latencies (which easily gets nasty when window is
      huge). Those latencies are often completely unnecessary
      compared with the amount of _new_ information received, usually
      for cumulative ACK there's no new information at all, yet TCP
      walks whole queue unnecessary potentially taking a number of
      costly cache misses on the way, etc.!
      
      Since the inclusion of highest_sack, there's a lot information
      that is very likely redundant (SACK fastpath hint stuff,
      fackets_out, highest_sack), though there's no ultimate guarantee
      that they'll remain the same whole the time (in all unearthly
      scenarios). Take advantage of this knowledge here and drop
      fastpath hint and use direct access to highest SACKed skb as
      a replacement.
      
      Effectively "special cased" fastpath is dropped. This change
      adds some complexity to introduce better coveraged "fastpath",
      though the added complexity should make TCP behave more cache
      friendly.
      
      The current ACK's SACK blocks are compared against each cached
      block individially and only ranges that are new are then scanned
      by the high constant walk. For other parts of write queue, even
      when in previously known part of the SACK blocks, a faster skip
      function is used (if necessary at all). In addition, whenever
      possible, TCP fast-forwards to highest_sack skb that was made
      available by an earlier patch. In typical case, no other things
      but this fast-forward and mandatory markings after that occur
      making the access pattern quite similar to the former fastpath
      "special case".
      
      DSACKs are special case that must always be walked.
      
      The local to recv_sack_cache copying could be more intelligent
      w.r.t DSACKs which are likely to be there only once but that
      is left to a separate patch.
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      68f8353b
    • I
    • I
      [TCP]: Convert highest_sack to sk_buff to allow direct access · a47e5a98
      Ilpo Järvinen 提交于
      It is going to replace the sack fastpath hint quite soon... :-)
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a47e5a98
  28. 16 10月, 2007 1 次提交
  29. 12 10月, 2007 1 次提交
  30. 11 10月, 2007 2 次提交