1. 08 10月, 2008 1 次提交
    • I
      tcp: kill pointless urg_mode · 33f5f57e
      Ilpo Järvinen 提交于
      It all started from me noticing that this urgent check in
      tcp_clean_rtx_queue is unnecessarily inside the loop. Then
      I took a longer look to it and found out that the users of
      urg_mode can trivially do without, well almost, there was
      one gotcha.
      
      Bonus: those funny people who use urg with >= 2^31 write_seq -
      snd_una could now rejoice too (that's the only purpose for the
      between being there, otherwise a simple compare would have done
      the thing). Not that I assume that the rest of the tcp code
      happily lives with such mind-boggling numbers :-). Alas, it
      turned out to be impossible to set wmem to such numbers anyway,
      yes I really tried a big sendfile after setting some wmem but
      nothing happened :-). ...Tcp_wmem is int and so is sk_sndbuf...
      So I hacked a bit variable to long and found out that it seems
      to work... :-)
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      33f5f57e
  2. 21 9月, 2008 2 次提交
  3. 19 7月, 2008 1 次提交
  4. 13 6月, 2008 1 次提交
    • D
      tcp: Revert 'process defer accept as established' changes. · ec0a1966
      David S. Miller 提交于
      This reverts two changesets, ec3c0982
      ("[TCP]: TCP_DEFER_ACCEPT updates - process as established") and
      the follow-on bug fix 9ae27e0a
      ("tcp: Fix slab corruption with ipv6 and tcp6fuzz").
      
      This change causes several problems, first reported by Ingo Molnar
      as a distcc-over-loopback regression where connections were getting
      stuck.
      
      Ilpo Järvinen first spotted the locking problems.  The new function
      added by this code, tcp_defer_accept_check(), only has the
      child socket locked, yet it is modifying state of the parent
      listening socket.
      
      Fixing that is non-trivial at best, because we can't simply just grab
      the parent listening socket lock at this point, because it would
      create an ABBA deadlock.  The normal ordering is parent listening
      socket --> child socket, but this code path would require the
      reverse lock ordering.
      
      Next is a problem noticed by Vitaliy Gusev, he noted:
      
      ----------------------------------------
      >--- a/net/ipv4/tcp_timer.c
      >+++ b/net/ipv4/tcp_timer.c
      >@@ -481,6 +481,11 @@ static void tcp_keepalive_timer (unsigned long data)
      > 		goto death;
      > 	}
      >
      >+	if (tp->defer_tcp_accept.request && sk->sk_state == TCP_ESTABLISHED) {
      >+		tcp_send_active_reset(sk, GFP_ATOMIC);
      >+		goto death;
      
      Here socket sk is not attached to listening socket's request queue. tcp_done()
      will not call inet_csk_destroy_sock() (and tcp_v4_destroy_sock() which should
      release this sk) as socket is not DEAD. Therefore socket sk will be lost for
      freeing.
      ----------------------------------------
      
      Finally, Alexey Kuznetsov argues that there might not even be any
      real value or advantage to these new semantics even if we fix all
      of the bugs:
      
      ----------------------------------------
      Hiding from accept() sockets with only out-of-order data only
      is the only thing which is impossible with old approach. Is this really
      so valuable? My opinion: no, this is nothing but a new loophole
      to consume memory without control.
      ----------------------------------------
      
      So revert this thing for now.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ec0a1966
  5. 29 5月, 2008 1 次提交
    • I
      tcp: Reorganize tcp_sock to fill 64-bit holes & improve locality · b79eeeb9
      Ilpo Järvinen 提交于
      I tried to group recovery related fields nearby (non-CA_Open related
      variables, to be more accurate) so that one to three cachelines would
      not be necessary in CA_Open. These are now contiguously deployed:
      
        struct sk_buff_head        out_of_order_queue;   /*  1968    80 */
        /* --- cacheline 32 boundary (2048 bytes) --- */
        struct tcp_sack_block      duplicate_sack[1];    /*  2048     8 */
        struct tcp_sack_block      selective_acks[4];    /*  2056    32 */
        struct tcp_sack_block      recv_sack_cache[4];   /*  2088    32 */
        /* --- cacheline 33 boundary (2112 bytes) was 8 bytes ago --- */
        struct sk_buff *           highest_sack;         /*  2120     8 */
        int                        lost_cnt_hint;        /*  2128     4 */
        int                        retransmit_cnt_hint;  /*  2132     4 */
        u32                        lost_retrans_low;     /*  2136     4 */
        u8                         reordering;           /*  2140     1 */
        u8                         keepalive_probes;     /*  2141     1 */
      
        /* XXX 2 bytes hole, try to pack */
      
        u32                        prior_ssthresh;       /*  2144     4 */
        u32                        high_seq;             /*  2148     4 */
        u32                        retrans_stamp;        /*  2152     4 */
        u32                        undo_marker;          /*  2156     4 */
        int                        undo_retrans;         /*  2160     4 */
        u32                        total_retrans;        /*  2164     4 */
      
      ...and they're then followed by URG slowpath & keepalive related
      variables.
      
      Head of the out_of_order_queue always needed for empty checks, if
      that's empty (and TCP is in CA_Open), following ~200 bytes (in 64-bit)
      shouldn't be necessary for anything. If only OFO queue exists but TCP
      is in CA_Open, selective_acks (and possibly duplicate_sack) are
      necessary besides the out_of_order_queue but the rest of the block
      again shouldn't be (ie., the other direction had losses).
      
      As the cacheline boundaries depend on many factors in the preceeding
      stuff, trying to align considering them doesn't make too much sense.
      
      Commented one ordering hazard.
      
      There are number of low utilized u8/16s that could be combined get 2
      bytes less in total so that the hole could be made to vanish (includes
      at least ecn_flags, urg_data, urg_mode, frto_counter, nonagle).
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Acked-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b79eeeb9
  6. 22 5月, 2008 1 次提交
  7. 22 3月, 2008 1 次提交
    • P
      [TCP]: TCP_DEFER_ACCEPT updates - process as established · ec3c0982
      Patrick McManus 提交于
      Change TCP_DEFER_ACCEPT implementation so that it transitions a
      connection to ESTABLISHED after handshake is complete instead of
      leaving it in SYN-RECV until some data arrvies. Place connection in
      accept queue when first data packet arrives from slow path.
      
      Benefits:
        - established connection is now reset if it never makes it
         to the accept queue
      
       - diagnostic state of established matches with the packet traces
         showing completed handshake
      
       - TCP_DEFER_ACCEPT timeouts are expressed in seconds and can now be
         enforced with reasonable accuracy instead of rounding up to next
         exponential back-off of syn-ack retry.
      Signed-off-by: NPatrick McManus <mcmanus@ducksong.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ec3c0982
  8. 29 1月, 2008 3 次提交
    • I
      [TCP]: Rewrite SACK block processing & sack_recv_cache use · 68f8353b
      Ilpo Järvinen 提交于
      Key points of this patch are:
      
        - In case new SACK information is advance only type, no skb
          processing below previously discovered highest point is done
        - Optimize cases below highest point too since there's no need
          to always go up to highest point (which is very likely still
          present in that SACK), this is not entirely true though
          because I'm dropping the fastpath_skb_hint which could
          previously optimize those cases even better. Whether that's
          significant, I'm not too sure.
      
      Currently it will provide skipping by walking. Combined with
      RB-tree, all skipping would become fast too regardless of window
      size (can be done incrementally later).
      
      Previously a number of cases in TCP SACK processing fails to
      take advantage of costly stored information in sack_recv_cache,
      most importantly, expected events such as cumulative ACK and new
      hole ACKs. Processing on such ACKs result in rather long walks
      building up latencies (which easily gets nasty when window is
      huge). Those latencies are often completely unnecessary
      compared with the amount of _new_ information received, usually
      for cumulative ACK there's no new information at all, yet TCP
      walks whole queue unnecessary potentially taking a number of
      costly cache misses on the way, etc.!
      
      Since the inclusion of highest_sack, there's a lot information
      that is very likely redundant (SACK fastpath hint stuff,
      fackets_out, highest_sack), though there's no ultimate guarantee
      that they'll remain the same whole the time (in all unearthly
      scenarios). Take advantage of this knowledge here and drop
      fastpath hint and use direct access to highest SACKed skb as
      a replacement.
      
      Effectively "special cased" fastpath is dropped. This change
      adds some complexity to introduce better coveraged "fastpath",
      though the added complexity should make TCP behave more cache
      friendly.
      
      The current ACK's SACK blocks are compared against each cached
      block individially and only ranges that are new are then scanned
      by the high constant walk. For other parts of write queue, even
      when in previously known part of the SACK blocks, a faster skip
      function is used (if necessary at all). In addition, whenever
      possible, TCP fast-forwards to highest_sack skb that was made
      available by an earlier patch. In typical case, no other things
      but this fast-forward and mandatory markings after that occur
      making the access pattern quite similar to the former fastpath
      "special case".
      
      DSACKs are special case that must always be walked.
      
      The local to recv_sack_cache copying could be more intelligent
      w.r.t DSACKs which are likely to be there only once but that
      is left to a separate patch.
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      68f8353b
    • I
    • I
      [TCP]: Convert highest_sack to sk_buff to allow direct access · a47e5a98
      Ilpo Järvinen 提交于
      It is going to replace the sack fastpath hint quite soon... :-)
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a47e5a98
  9. 16 10月, 2007 1 次提交
  10. 12 10月, 2007 1 次提交
  11. 11 10月, 2007 5 次提交
  12. 26 4月, 2007 5 次提交
  13. 09 2月, 2007 1 次提交
    • B
      [TCP]: Seperate DSACK from SACK fast path · 6f74651a
      Baruch Even 提交于
      Move DSACK code outside the SACK fast-path checking code. If the DSACK
      determined that the information was too old we stayed with a partial cache
      copied. Most likely this matters very little since the next packet will not be
      DSACK and we will find it in the cache. but it's still not good form and there
      is little reason to couple the two checks.
      
      Since the SACK receive cache doesn't need the data to be in host order we also
      remove the ntohl in the checking loop.
      Signed-off-by: NBaruch Even <baruch@ev-en.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6f74651a
  14. 03 12月, 2006 4 次提交
  15. 19 10月, 2006 1 次提交
    • J
      [TCP]: Bound TSO defer time · ae8064ac
      John Heffner 提交于
      This patch limits the amount of time you will defer sending a TSO segment
      to less than two clock ticks, or the time between two acks, whichever is
      longer.
      
      On slow links, deferring causes significant bursts.  See attached plots,
      which show RTT through a 1 Mbps link with a 100 ms RTT and ~100 ms queue
      for (a) non-TSO, (b) currnet TSO, and (c) patched TSO.  This burstiness
      causes significant jitter, tends to overflow queues early (bad for short
      queues), and makes delay-based congestion control more difficult.
      
      Deferring by a couple clock ticks I believe will have a relatively small
      impact on performance.
      Signed-off-by: NJohn Heffner <jheffner@psc.edu>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ae8064ac
  16. 29 9月, 2006 3 次提交
  17. 23 6月, 2006 1 次提交
  18. 18 6月, 2006 1 次提交
  19. 26 4月, 2006 1 次提交
  20. 21 3月, 2006 1 次提交
  21. 04 1月, 2006 3 次提交
  22. 11 11月, 2005 1 次提交