1. 14 11月, 2009 1 次提交
  2. 04 11月, 2009 1 次提交
  3. 29 10月, 2009 1 次提交
  4. 01 10月, 2009 1 次提交
  5. 15 9月, 2009 1 次提交
  6. 03 9月, 2009 1 次提交
    • W
      tcp: replace hard coded GFP_KERNEL with sk_allocation · aa133076
      Wu Fengguang 提交于
      This fixed a lockdep warning which appeared when doing stress
      memory tests over NFS:
      
      	inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
      
      	page reclaim => nfs_writepage => tcp_sendmsg => lock sk_lock
      
      	mount_root => nfs_root_data => tcp_close => lock sk_lock =>
      			tcp_send_fin => alloc_skb_fclone => page reclaim
      
      David raised a concern that if the allocation fails in tcp_send_fin(), and it's
      GFP_ATOMIC, we are going to yield() (which sleeps) and loop endlessly waiting
      for the allocation to succeed.
      
      But fact is, the original GFP_KERNEL also sleeps. GFP_ATOMIC+yield() looks
      weird, but it is no worse the implicit sleep inside GFP_KERNEL. Both could
      loop endlessly under memory pressure.
      
      CC: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
      CC: David S. Miller <davem@davemloft.net>
      CC: Herbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aa133076
  7. 02 9月, 2009 1 次提交
  8. 01 9月, 2009 2 次提交
    • D
      Revert Backoff [v3]: Calculate TCP's connection close threshold as a time value. · 6fa12c85
      Damian Lukowski 提交于
      RFC 1122 specifies two threshold values R1 and R2 for connection timeouts,
      which may represent a number of allowed retransmissions or a timeout value.
      Currently linux uses sysctl_tcp_retries{1,2} to specify the thresholds
      in number of allowed retransmissions.
      
      For any desired threshold R2 (by means of time) one can specify tcp_retries2
      (by means of number of retransmissions) such that TCP will not time out
      earlier than R2. This is the case, because the RTO schedule follows a fixed
      pattern, namely exponential backoff.
      
      However, the RTO behaviour is not predictable any more if RTO backoffs can be
      reverted, as it is the case in the draft
      "Make TCP more Robust to Long Connectivity Disruptions"
      (http://tools.ietf.org/html/draft-zimmermann-tcp-lcd).
      
      In the worst case TCP would time out a connection after 3.2 seconds, if the
      initial RTO equaled MIN_RTO and each backoff has been reverted.
      
      This patch introduces a function retransmits_timed_out(N),
      which calculates the timeout of a TCP connection, assuming an initial
      RTO of MIN_RTO and N unsuccessful, exponentially backed-off retransmissions.
      
      Whenever timeout decisions are made by comparing the retransmission counter
      to some value N, this function can be used, instead.
      
      The meaning of tcp_retries2 will be changed, as many more RTO retransmissions
      can occur than the value indicates. However, it yields a timeout which is
      similar to the one of an unpatched, exponentially backing off TCP in the same
      scenario. As no application could rely on an RTO greater than MIN_RTO, there
      should be no risk of a regression.
      Signed-off-by: NDamian Lukowski <damian@tvk.rwth-aachen.de>
      Acked-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6fa12c85
    • D
      Revert Backoff [v3]: Revert RTO on ICMP destination unreachable · f1ecd5d9
      Damian Lukowski 提交于
      Here, an ICMP host/network unreachable message, whose payload fits to
      TCP's SND.UNA, is taken as an indication that the RTO retransmission has
      not been lost due to congestion, but because of a route failure
      somewhere along the path.
      With true congestion, a router won't trigger such a message and the
      patched TCP will operate as standard TCP.
      
      This patch reverts one RTO backoff, if an ICMP host/network unreachable
      message, whose payload fits to TCP's SND.UNA, arrives.
      Based on the new RTO, the retransmission timer is reset to reflect the
      remaining time, or - if the revert clocked out the timer - a retransmission
      is sent out immediately.
      Backoffs are only reverted, if TCP is in RTO loss recovery, i.e. if
      there have been retransmissions and reversible backoffs, already.
      
      Changes from v2:
      1) Renaming of skb in tcp_v4_err() moved to another patch.
      2) Reintroduced tcp_bound_rto() and __tcp_set_rto().
      3) Fixed code comments.
      Signed-off-by: NDamian Lukowski <damian@tvk.rwth-aachen.de>
      Acked-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f1ecd5d9
  9. 29 8月, 2009 1 次提交
  10. 20 7月, 2009 1 次提交
  11. 08 5月, 2009 2 次提交
  12. 05 5月, 2009 1 次提交
  13. 20 4月, 2009 1 次提交
  14. 03 4月, 2009 1 次提交
  15. 16 3月, 2009 2 次提交
    • I
      tcp: simplify tcp_current_mss · 0c54b85f
      Ilpo Järvinen 提交于
      There's very little need for most of the callsites to get
      tp->xmit_goal_size updated. That will cost us divide as is,
      so slice the function in two. Also, the only users of the
      tp->xmit_goal_size are directly behind tcp_current_mss(),
      so there's no need to store that variable into tcp_sock
      at all! The drop of xmit_goal_size currently leaves 16-bit
      hole and some reorganization would again be necessary to
      change that (but I'm aiming to fill that hole with u16
      xmit_goal_size_segs to cache the results of the remaining
      divide to get that tso on regression).
      
      Bring xmit_goal_size parts into tcp.c
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Cc: Evgeniy Polyakov <zbr@ioremap.net>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0c54b85f
    • I
      tcp: consolidate paws check · c887e6d2
      Ilpo Järvinen 提交于
      Wow, it was quite tricky to merge that stream of negations
      but I think I finally got it right:
      
      check & replace_ts_recent:
      (s32)(rcv_tsval - ts_recent) >= 0                  => 0
      (s32)(ts_recent - rcv_tsval) <= 0                  => 0
      
      discard:
      (s32)(ts_recent - rcv_tsval)  > TCP_PAWS_WINDOW    => 1
      (s32)(ts_recent - rcv_tsval) <= TCP_PAWS_WINDOW    => 0
      
      I toggled the return values of tcp_paws_check around since
      the old encoding added yet-another negation making tracking
      of truth-values really complicated.
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c887e6d2
  16. 03 3月, 2009 1 次提交
  17. 02 3月, 2009 2 次提交
  18. 16 12月, 2008 1 次提交
  19. 26 11月, 2008 3 次提交
  20. 25 11月, 2008 2 次提交
    • I
      tcp: Try to restore large SKBs while SACK processing · 832d11c5
      Ilpo Järvinen 提交于
      During SACK processing, most of the benefits of TSO are eaten by
      the SACK blocks that one-by-one fragment SKBs to MSS sized chunks.
      Then we're in problems when cleanup work for them has to be done
      when a large cumulative ACK comes. Try to return back to pre-split
      state already while more and more SACK info gets discovered by
      combining newly discovered SACK areas with the previous skb if
      that's SACKed as well.
      
      This approach has a number of benefits:
      
      1) The processing overhead is spread more equally over the RTT
      2) Write queue has less skbs to process (affect everything
         which has to walk in the queue past the sacked areas)
      3) Write queue is consistent whole the time, so no other parts
         of TCP has to be aware of this (this was not the case with
         some other approach that was, well, quite intrusive all
         around).
      4) Clean_rtx_queue can release most of the pages using single
         put_page instead of previous PAGE_SIZE/mss+1 calls
      
      In case a hole is fully filled by the new SACK block, we attempt
      to combine the next skb too which allows construction of skbs
      that are even larger than what tso split them to and it handles
      hole per on every nth patterns that often occur during slow start
      overshoot pretty nicely. Though this to be really useful also
      a retransmission would have to get lost since cumulative ACKs
      advance one hole at a time in the most typical case.
      
      TODO: handle upwards only merging. That should be rather easy
      when segment is fully sacked but I'm leaving that as future
      work item (it won't make very large difference anyway since
      this current approach already covers quite a lot of normal
      cases).
      
      I was earlier thinking of some sophisticated way of tracking
      timestamps of the first and the last segment but later on
      realized that it won't be that necessary at all to store the
      timestamp of the last segment. The cases that can occur are
      basically either:
        1) ambiguous => no sensible measurement can be taken anyway
        2) non-ambiguous is due to reordering => having the timestamp
           of the last segment there is just skewing things more off
           than does some good since the ack got triggered by one of
           the holes (besides some substle issues that would make
           determining right hole/skb even harder problem). Anyway,
           it has nothing to do with this change then.
      
      I choose to route some abnormal looking cases with goto noop,
      some could be handled differently (eg., by stopping the
      walking at that skb but again). In general, they either
      shouldn't happen at all or are rare enough to make no difference
      in practice.
      
      In theory this change (as whole) could cause some macroscale
      regression (global) because of cache misses that are taken over
      the round-trip time but it gets very likely better because of much
      less (local) cache misses per other write queue walkers and the
      big recovery clearing cumulative ack.
      
      Worth to note that these benefits would be very easy to get also
      without TSO/GSO being on as long as the data is in pages so that
      we can merge them. Currently I won't let that happen because
      DSACK splitting at fragment that would mess up pcounts due to
      sk_can_gso in tcp_set_skb_tso_segs. Once DSACKs fragments gets
      avoided, we have some conditions that can be made less strict.
      
      TODO: I will probably have to convert the excessive pointer
      passing to struct sacktag_state... :-)
      
      My testing revealed that considerable amount of skbs couldn't
      be shifted because they were cloned (most likely still awaiting
      tx reclaim)...
      
      [The rest is considering future work instead since I got
      repeatably EFAULT to tcpdump's recvfrom when I added
      pskb_expand_head to deal with clones, so I separated that
      into another, later patch]
      
      ...To counter that, I gave up on the fifth advantage:
      
      5) When growing previous SACK block, less allocs for new skbs
         are done, basically a new alloc is needed only when new hole
         is detected and when the previous skb runs out of frags space
      
      ...which now only happens of if reclaim is fast enough to dispose
      the clone before the SACK block comes in (the window is RTT long),
      otherwise we'll have to alloc some.
      
      With clones being handled I got these numbers (will be somewhat
      worse without that), taken with fine-grained mibs:
      
                        TCPSackShifted 398
                         TCPSackMerged 877
                  TCPSackShiftFallback 320
            TCPSACKCOLLAPSEFALLBACKGSO 0
        TCPSACKCOLLAPSEFALLBACKSKBBITS 0
        TCPSACKCOLLAPSEFALLBACKSKBDATA 0
          TCPSACKCOLLAPSEFALLBACKBELOW 0
          TCPSACKCOLLAPSEFALLBACKFIRST 1
       TCPSACKCOLLAPSEFALLBACKPREVBITS 318
            TCPSACKCOLLAPSEFALLBACKMSS 1
         TCPSACKCOLLAPSEFALLBACKNOHEAD 0
          TCPSACKCOLLAPSEFALLBACKSHIFT 0
                TCPSACKCOLLAPSENOOPSEQ 0
        TCPSACKCOLLAPSENOOPSMALLPCOUNT 0
           TCPSACKCOLLAPSENOOPSMALLLEN 0
                   TCPSACKCOLLAPSEHOLE 12
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      832d11c5
    • I
      e1aa680f
  21. 14 11月, 2008 1 次提交
  22. 08 10月, 2008 1 次提交
  23. 01 10月, 2008 1 次提交
    • K
      tcp: Port redirection support for TCP · a3116ac5
      KOVACS Krisztian 提交于
      Current TCP code relies on the local port of the listening socket
      being the same as the destination address of the incoming
      connection. Port redirection used by many transparent proxying
      techniques obviously breaks this, so we have to store the original
      destination port address.
      
      This patch extends struct inet_request_sock and stores the incoming
      destination port value there. It also modifies the handshake code to
      use that value as the source port when sending reply packets.
      Signed-off-by: NKOVACS Krisztian <hidden@sch.bme.hu>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a3116ac5
  24. 23 9月, 2008 2 次提交
  25. 22 9月, 2008 1 次提交
  26. 21 9月, 2008 4 次提交
  27. 09 9月, 2008 1 次提交
  28. 04 9月, 2008 1 次提交
  29. 19 7月, 2008 1 次提交
    • A
      tcp: options clean up · 33ad798c
      Adam Langley 提交于
      This should fix the following bugs:
        * Connections with MD5 signatures produce invalid packets whenever SACK
          options are included
        * MD5 signatures are counted twice in the MSS calculations
      
      Behaviour changes:
        * A SYN with MD5 + SACK + TS elicits a SYNACK with MD5 + SACK
      
          This is because we can't fit any SACK blocks in a packet with MD5 + TS
          options. There was discussion about disabling SACK rather than TS in
          order to fit in better with old, buggy kernels, but that was deemed to
          be unnecessary.
      
        * SYNs with MD5 don't include a TS option
      
          See above.
      
      Additionally, it removes a bunch of duplicated logic for calculating options,
      which should help avoid these sort of issues in the future.
      Signed-off-by: NAdam Langley <agl@imperialviolet.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      33ad798c