1. 13 6月, 2008 1 次提交
    • D
      tcp: Revert 'process defer accept as established' changes. · ec0a1966
      David S. Miller 提交于
      This reverts two changesets, ec3c0982
      ("[TCP]: TCP_DEFER_ACCEPT updates - process as established") and
      the follow-on bug fix 9ae27e0a
      ("tcp: Fix slab corruption with ipv6 and tcp6fuzz").
      
      This change causes several problems, first reported by Ingo Molnar
      as a distcc-over-loopback regression where connections were getting
      stuck.
      
      Ilpo Järvinen first spotted the locking problems.  The new function
      added by this code, tcp_defer_accept_check(), only has the
      child socket locked, yet it is modifying state of the parent
      listening socket.
      
      Fixing that is non-trivial at best, because we can't simply just grab
      the parent listening socket lock at this point, because it would
      create an ABBA deadlock.  The normal ordering is parent listening
      socket --> child socket, but this code path would require the
      reverse lock ordering.
      
      Next is a problem noticed by Vitaliy Gusev, he noted:
      
      ----------------------------------------
      >--- a/net/ipv4/tcp_timer.c
      >+++ b/net/ipv4/tcp_timer.c
      >@@ -481,6 +481,11 @@ static void tcp_keepalive_timer (unsigned long data)
      > 		goto death;
      > 	}
      >
      >+	if (tp->defer_tcp_accept.request && sk->sk_state == TCP_ESTABLISHED) {
      >+		tcp_send_active_reset(sk, GFP_ATOMIC);
      >+		goto death;
      
      Here socket sk is not attached to listening socket's request queue. tcp_done()
      will not call inet_csk_destroy_sock() (and tcp_v4_destroy_sock() which should
      release this sk) as socket is not DEAD. Therefore socket sk will be lost for
      freeing.
      ----------------------------------------
      
      Finally, Alexey Kuznetsov argues that there might not even be any
      real value or advantage to these new semantics even if we fix all
      of the bugs:
      
      ----------------------------------------
      Hiding from accept() sockets with only out-of-order data only
      is the only thing which is impossible with old approach. Is this really
      so valuable? My opinion: no, this is nothing but a new loophole
      to consume memory without control.
      ----------------------------------------
      
      So revert this thing for now.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ec0a1966
  2. 22 5月, 2008 1 次提交
  3. 22 3月, 2008 1 次提交
    • P
      [TCP]: TCP_DEFER_ACCEPT updates - process as established · ec3c0982
      Patrick McManus 提交于
      Change TCP_DEFER_ACCEPT implementation so that it transitions a
      connection to ESTABLISHED after handshake is complete instead of
      leaving it in SYN-RECV until some data arrvies. Place connection in
      accept queue when first data packet arrives from slow path.
      
      Benefits:
        - established connection is now reset if it never makes it
         to the accept queue
      
       - diagnostic state of established matches with the packet traces
         showing completed handshake
      
       - TCP_DEFER_ACCEPT timeouts are expressed in seconds and can now be
         enforced with reasonable accuracy instead of rounding up to next
         exponential back-off of syn-ack retry.
      Signed-off-by: NPatrick McManus <mcmanus@ducksong.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ec3c0982
  4. 29 1月, 2008 3 次提交
    • I
      [TCP]: Rewrite SACK block processing & sack_recv_cache use · 68f8353b
      Ilpo Järvinen 提交于
      Key points of this patch are:
      
        - In case new SACK information is advance only type, no skb
          processing below previously discovered highest point is done
        - Optimize cases below highest point too since there's no need
          to always go up to highest point (which is very likely still
          present in that SACK), this is not entirely true though
          because I'm dropping the fastpath_skb_hint which could
          previously optimize those cases even better. Whether that's
          significant, I'm not too sure.
      
      Currently it will provide skipping by walking. Combined with
      RB-tree, all skipping would become fast too regardless of window
      size (can be done incrementally later).
      
      Previously a number of cases in TCP SACK processing fails to
      take advantage of costly stored information in sack_recv_cache,
      most importantly, expected events such as cumulative ACK and new
      hole ACKs. Processing on such ACKs result in rather long walks
      building up latencies (which easily gets nasty when window is
      huge). Those latencies are often completely unnecessary
      compared with the amount of _new_ information received, usually
      for cumulative ACK there's no new information at all, yet TCP
      walks whole queue unnecessary potentially taking a number of
      costly cache misses on the way, etc.!
      
      Since the inclusion of highest_sack, there's a lot information
      that is very likely redundant (SACK fastpath hint stuff,
      fackets_out, highest_sack), though there's no ultimate guarantee
      that they'll remain the same whole the time (in all unearthly
      scenarios). Take advantage of this knowledge here and drop
      fastpath hint and use direct access to highest SACKed skb as
      a replacement.
      
      Effectively "special cased" fastpath is dropped. This change
      adds some complexity to introduce better coveraged "fastpath",
      though the added complexity should make TCP behave more cache
      friendly.
      
      The current ACK's SACK blocks are compared against each cached
      block individially and only ranges that are new are then scanned
      by the high constant walk. For other parts of write queue, even
      when in previously known part of the SACK blocks, a faster skip
      function is used (if necessary at all). In addition, whenever
      possible, TCP fast-forwards to highest_sack skb that was made
      available by an earlier patch. In typical case, no other things
      but this fast-forward and mandatory markings after that occur
      making the access pattern quite similar to the former fastpath
      "special case".
      
      DSACKs are special case that must always be walked.
      
      The local to recv_sack_cache copying could be more intelligent
      w.r.t DSACKs which are likely to be there only once but that
      is left to a separate patch.
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      68f8353b
    • I
    • I
      [TCP]: Convert highest_sack to sk_buff to allow direct access · a47e5a98
      Ilpo Järvinen 提交于
      It is going to replace the sack fastpath hint quite soon... :-)
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a47e5a98
  5. 16 10月, 2007 1 次提交
  6. 12 10月, 2007 1 次提交
  7. 11 10月, 2007 5 次提交
  8. 26 4月, 2007 5 次提交
  9. 09 2月, 2007 1 次提交
    • B
      [TCP]: Seperate DSACK from SACK fast path · 6f74651a
      Baruch Even 提交于
      Move DSACK code outside the SACK fast-path checking code. If the DSACK
      determined that the information was too old we stayed with a partial cache
      copied. Most likely this matters very little since the next packet will not be
      DSACK and we will find it in the cache. but it's still not good form and there
      is little reason to couple the two checks.
      
      Since the SACK receive cache doesn't need the data to be in host order we also
      remove the ntohl in the checking loop.
      Signed-off-by: NBaruch Even <baruch@ev-en.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6f74651a
  10. 03 12月, 2006 4 次提交
  11. 19 10月, 2006 1 次提交
    • J
      [TCP]: Bound TSO defer time · ae8064ac
      John Heffner 提交于
      This patch limits the amount of time you will defer sending a TSO segment
      to less than two clock ticks, or the time between two acks, whichever is
      longer.
      
      On slow links, deferring causes significant bursts.  See attached plots,
      which show RTT through a 1 Mbps link with a 100 ms RTT and ~100 ms queue
      for (a) non-TSO, (b) currnet TSO, and (c) patched TSO.  This burstiness
      causes significant jitter, tends to overflow queues early (bad for short
      queues), and makes delay-based congestion control more difficult.
      
      Deferring by a couple clock ticks I believe will have a relatively small
      impact on performance.
      Signed-off-by: NJohn Heffner <jheffner@psc.edu>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ae8064ac
  12. 29 9月, 2006 3 次提交
  13. 23 6月, 2006 1 次提交
  14. 18 6月, 2006 1 次提交
  15. 26 4月, 2006 1 次提交
  16. 21 3月, 2006 1 次提交
  17. 04 1月, 2006 3 次提交
  18. 11 11月, 2005 2 次提交
  19. 30 8月, 2005 4 次提交
    • A
      [ICSK]: Move TCP congestion avoidance members to icsk · 6687e988
      Arnaldo Carvalho de Melo 提交于
      This changeset basically moves tcp_sk()->{ca_ops,ca_state,etc} to inet_csk(),
      minimal renaming/moving done in this changeset to ease review.
      
      Most of it is just changes of struct tcp_sock * to struct sock * parameters.
      
      With this we move to a state closer to two interesting goals:
      
      1. Generalisation of net/ipv4/tcp_diag.c, becoming inet_diag.c, being used
         for any INET transport protocol that has struct inet_hashinfo and are
         derived from struct inet_connection_sock. Keeps the userspace API, that will
         just not display DCCP sockets, while newer versions of tools can support
         DCCP.
      
      2. INET generic transport pluggable Congestion Avoidance infrastructure, using
         the current TCP CA infrastructure with DCCP.
      Signed-off-by: NArnaldo Carvalho de Melo <acme@mandriva.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6687e988
    • A
      [ICSK]: Introduce reqsk_queue_prune from code in tcp_synack_timer · 295f7324
      Arnaldo Carvalho de Melo 提交于
      With this we're very close to getting all of the current TCP
      refactorings in my dccp-2.6 tree merged, next changeset will export
      some functions needed by the current DCCP code and then dccp-2.6.git
      will be born!
      Signed-off-by: NArnaldo Carvalho de Melo <acme@ghostprotocols.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      295f7324
    • A
      [NET]: Introduce inet_connection_sock · 463c84b9
      Arnaldo Carvalho de Melo 提交于
      This creates struct inet_connection_sock, moving members out of struct
      tcp_sock that are shareable with other INET connection oriented
      protocols, such as DCCP, that in my private tree already uses most of
      these members.
      
      The functions that operate on these members were renamed, using a
      inet_csk_ prefix while not being moved yet to a new file, so as to
      ease the review of these changes.
      Signed-off-by: NArnaldo Carvalho de Melo <acme@ghostprotocols.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      463c84b9
    • A
      [INET]: Generalise tcp_tw_bucket, aka TIME_WAIT sockets · 8feaf0c0
      Arnaldo Carvalho de Melo 提交于
      This paves the way to generalise the rest of the sock ID lookup
      routines and saves some bytes in TCPv4 TIME_WAIT sockets on distro
      kernels (where IPv6 is always built as a module):
      
      [root@qemu ~]# grep tw_sock /proc/slabinfo
      tw_sock_TCPv6  0  0  128  31  1
      tw_sock_TCP    0  0   96  41  1
      [root@qemu ~]#
      
      Now if a protocol wants to use the TIME_WAIT generic infrastructure it
      only has to set the sk_prot->twsk_obj_size field with the size of its
      inet_timewait_sock derived sock and proto_register will create
      sk_prot->twsk_slab, for now its only for INET sockets, but we can
      introduce timewait_sock later if some non INET transport protocolo
      wants to use this stuff.
      
      Next changesets will take advantage of this new infrastructure to
      generalise even more TCP code.
      
      [acme@toy net-2.6.14]$ grep built-in /tmp/before.size /tmp/after.size
      /tmp/before.size: 188646   11764    5068  205478   322a6 net/ipv4/built-in.o
      /tmp/after.size:  188144   11764    5068  204976   320b0 net/ipv4/built-in.o
      [acme@toy net-2.6.14]$
      
      Tested with both IPv4 & IPv6 (::1 (localhost) & ::ffff:172.20.0.1
      (qemu host)).
      Signed-off-by: NArnaldo Carvalho de Melo <acme@ghostprotocols.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8feaf0c0