1. 25 11月, 2008 2 次提交
    • I
      tcp: Try to restore large SKBs while SACK processing · 832d11c5
      Ilpo Järvinen 提交于
      During SACK processing, most of the benefits of TSO are eaten by
      the SACK blocks that one-by-one fragment SKBs to MSS sized chunks.
      Then we're in problems when cleanup work for them has to be done
      when a large cumulative ACK comes. Try to return back to pre-split
      state already while more and more SACK info gets discovered by
      combining newly discovered SACK areas with the previous skb if
      that's SACKed as well.
      
      This approach has a number of benefits:
      
      1) The processing overhead is spread more equally over the RTT
      2) Write queue has less skbs to process (affect everything
         which has to walk in the queue past the sacked areas)
      3) Write queue is consistent whole the time, so no other parts
         of TCP has to be aware of this (this was not the case with
         some other approach that was, well, quite intrusive all
         around).
      4) Clean_rtx_queue can release most of the pages using single
         put_page instead of previous PAGE_SIZE/mss+1 calls
      
      In case a hole is fully filled by the new SACK block, we attempt
      to combine the next skb too which allows construction of skbs
      that are even larger than what tso split them to and it handles
      hole per on every nth patterns that often occur during slow start
      overshoot pretty nicely. Though this to be really useful also
      a retransmission would have to get lost since cumulative ACKs
      advance one hole at a time in the most typical case.
      
      TODO: handle upwards only merging. That should be rather easy
      when segment is fully sacked but I'm leaving that as future
      work item (it won't make very large difference anyway since
      this current approach already covers quite a lot of normal
      cases).
      
      I was earlier thinking of some sophisticated way of tracking
      timestamps of the first and the last segment but later on
      realized that it won't be that necessary at all to store the
      timestamp of the last segment. The cases that can occur are
      basically either:
        1) ambiguous => no sensible measurement can be taken anyway
        2) non-ambiguous is due to reordering => having the timestamp
           of the last segment there is just skewing things more off
           than does some good since the ack got triggered by one of
           the holes (besides some substle issues that would make
           determining right hole/skb even harder problem). Anyway,
           it has nothing to do with this change then.
      
      I choose to route some abnormal looking cases with goto noop,
      some could be handled differently (eg., by stopping the
      walking at that skb but again). In general, they either
      shouldn't happen at all or are rare enough to make no difference
      in practice.
      
      In theory this change (as whole) could cause some macroscale
      regression (global) because of cache misses that are taken over
      the round-trip time but it gets very likely better because of much
      less (local) cache misses per other write queue walkers and the
      big recovery clearing cumulative ack.
      
      Worth to note that these benefits would be very easy to get also
      without TSO/GSO being on as long as the data is in pages so that
      we can merge them. Currently I won't let that happen because
      DSACK splitting at fragment that would mess up pcounts due to
      sk_can_gso in tcp_set_skb_tso_segs. Once DSACKs fragments gets
      avoided, we have some conditions that can be made less strict.
      
      TODO: I will probably have to convert the excessive pointer
      passing to struct sacktag_state... :-)
      
      My testing revealed that considerable amount of skbs couldn't
      be shifted because they were cloned (most likely still awaiting
      tx reclaim)...
      
      [The rest is considering future work instead since I got
      repeatably EFAULT to tcpdump's recvfrom when I added
      pskb_expand_head to deal with clones, so I separated that
      into another, later patch]
      
      ...To counter that, I gave up on the fifth advantage:
      
      5) When growing previous SACK block, less allocs for new skbs
         are done, basically a new alloc is needed only when new hole
         is detected and when the previous skb runs out of frags space
      
      ...which now only happens of if reclaim is fast enough to dispose
      the clone before the SACK block comes in (the window is RTT long),
      otherwise we'll have to alloc some.
      
      With clones being handled I got these numbers (will be somewhat
      worse without that), taken with fine-grained mibs:
      
                        TCPSackShifted 398
                         TCPSackMerged 877
                  TCPSackShiftFallback 320
            TCPSACKCOLLAPSEFALLBACKGSO 0
        TCPSACKCOLLAPSEFALLBACKSKBBITS 0
        TCPSACKCOLLAPSEFALLBACKSKBDATA 0
          TCPSACKCOLLAPSEFALLBACKBELOW 0
          TCPSACKCOLLAPSEFALLBACKFIRST 1
       TCPSACKCOLLAPSEFALLBACKPREVBITS 318
            TCPSACKCOLLAPSEFALLBACKMSS 1
         TCPSACKCOLLAPSEFALLBACKNOHEAD 0
          TCPSACKCOLLAPSEFALLBACKSHIFT 0
                TCPSACKCOLLAPSENOOPSEQ 0
        TCPSACKCOLLAPSENOOPSMALLPCOUNT 0
           TCPSACKCOLLAPSENOOPSMALLLEN 0
                   TCPSACKCOLLAPSEHOLE 12
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      832d11c5
    • I
      e1aa680f
  2. 14 11月, 2008 1 次提交
  3. 08 10月, 2008 1 次提交
  4. 01 10月, 2008 1 次提交
    • K
      tcp: Port redirection support for TCP · a3116ac5
      KOVACS Krisztian 提交于
      Current TCP code relies on the local port of the listening socket
      being the same as the destination address of the incoming
      connection. Port redirection used by many transparent proxying
      techniques obviously breaks this, so we have to store the original
      destination port address.
      
      This patch extends struct inet_request_sock and stores the incoming
      destination port value there. It also modifies the handshake code to
      use that value as the source port when sending reply packets.
      Signed-off-by: NKOVACS Krisztian <hidden@sch.bme.hu>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a3116ac5
  5. 23 9月, 2008 2 次提交
  6. 22 9月, 2008 1 次提交
  7. 21 9月, 2008 4 次提交
  8. 09 9月, 2008 1 次提交
  9. 04 9月, 2008 1 次提交
  10. 19 7月, 2008 2 次提交
  11. 18 7月, 2008 1 次提交
  12. 17 7月, 2008 8 次提交
  13. 15 6月, 2008 1 次提交
  14. 13 6月, 2008 1 次提交
    • D
      tcp: Revert 'process defer accept as established' changes. · ec0a1966
      David S. Miller 提交于
      This reverts two changesets, ec3c0982
      ("[TCP]: TCP_DEFER_ACCEPT updates - process as established") and
      the follow-on bug fix 9ae27e0a
      ("tcp: Fix slab corruption with ipv6 and tcp6fuzz").
      
      This change causes several problems, first reported by Ingo Molnar
      as a distcc-over-loopback regression where connections were getting
      stuck.
      
      Ilpo Järvinen first spotted the locking problems.  The new function
      added by this code, tcp_defer_accept_check(), only has the
      child socket locked, yet it is modifying state of the parent
      listening socket.
      
      Fixing that is non-trivial at best, because we can't simply just grab
      the parent listening socket lock at this point, because it would
      create an ABBA deadlock.  The normal ordering is parent listening
      socket --> child socket, but this code path would require the
      reverse lock ordering.
      
      Next is a problem noticed by Vitaliy Gusev, he noted:
      
      ----------------------------------------
      >--- a/net/ipv4/tcp_timer.c
      >+++ b/net/ipv4/tcp_timer.c
      >@@ -481,6 +481,11 @@ static void tcp_keepalive_timer (unsigned long data)
      > 		goto death;
      > 	}
      >
      >+	if (tp->defer_tcp_accept.request && sk->sk_state == TCP_ESTABLISHED) {
      >+		tcp_send_active_reset(sk, GFP_ATOMIC);
      >+		goto death;
      
      Here socket sk is not attached to listening socket's request queue. tcp_done()
      will not call inet_csk_destroy_sock() (and tcp_v4_destroy_sock() which should
      release this sk) as socket is not DEAD. Therefore socket sk will be lost for
      freeing.
      ----------------------------------------
      
      Finally, Alexey Kuznetsov argues that there might not even be any
      real value or advantage to these new semantics even if we fix all
      of the bugs:
      
      ----------------------------------------
      Hiding from accept() sockets with only out-of-order data only
      is the only thing which is impossible with old approach. Is this really
      so valuable? My opinion: no, this is nothing but a new loophole
      to consume memory without control.
      ----------------------------------------
      
      So revert this thing for now.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ec0a1966
  15. 12 6月, 2008 4 次提交
  16. 11 6月, 2008 1 次提交
  17. 16 4月, 2008 1 次提交
  18. 14 4月, 2008 5 次提交
  19. 10 4月, 2008 1 次提交
    • F
      [Syncookies]: Add support for TCP options via timestamps. · 4dfc2817
      Florian Westphal 提交于
      Allow the use of SACK and window scaling when syncookies are used
      and the client supports tcp timestamps. Options are encoded into
      the timestamp sent in the syn-ack and restored from the timestamp
      echo when the ack is received.
      
      Based on earlier work by Glenn Griffin.
      This patch avoids increasing the size of structs by encoding TCP
      options into the least significant bits of the timestamp and
      by not using any 'timestamp offset'.
      
      The downside is that the timestamp sent in the packet after the synack
      will increase by several seconds.
      
      changes since v1:
       don't duplicate timestamp echo decoding function, put it into ipv4/syncookie.c
       and have ipv6/syncookies.c use it.
       Feedback from Glenn Griffin: fix line indented with spaces, kill redundant if ()
      Reviewed-by: NHagen Paul Pfeifer <hagen@jauu.net>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4dfc2817
  20. 08 4月, 2008 1 次提交
    • I
      [TCP]: tcp_simple_retransmit can cause S+L · 882bebaa
      Ilpo Järvinen 提交于
      This fixes Bugzilla #10384
      
      tcp_simple_retransmit does L increment without any checking
      whatsoever for overflowing S+L when Reno is in use.
      
      The simplest scenario I can currently think of is rather
      complex in practice (there might be some more straightforward
      cases though). Ie., if mss is reduced during mtu probing, it
      may end up marking everything lost and if some duplicate ACKs
      arrived prior to that sacked_out will be non-zero as well,
      leading to S+L > packets_out, tcp_clean_rtx_queue on the next
      cumulative ACK or tcp_fastretrans_alert on the next duplicate
      ACK will fix the S counter.
      
      More straightforward (but questionable) solution would be to
      just call tcp_reset_reno_sack() in tcp_simple_retransmit but
      it would negatively impact the probe's retransmission, ie.,
      the retransmissions would not occur if some duplicate ACKs
      had arrived.
      
      So I had to add reno sacked_out reseting to CA_Loss state
      when the first cumulative ACK arrives (this stale sacked_out
      might actually be the explanation for the reports of left_out
      overflows in kernel prior to 2.6.23 and S+L overflow reports
      of 2.6.24). However, this alone won't be enough to fix kernel
      before 2.6.24 because it is building on top of the commit
      1b6d427b ([TCP]: Reduce sacked_out with reno when purging
      write_queue) to keep the sacked_out from overflowing.
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Reported-by: NAlessandro Suardi <alessandro.suardi@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      882bebaa