1. 18 10月, 2010 1 次提交
  2. 29 9月, 2010 1 次提交
  3. 31 8月, 2010 1 次提交
    • J
      tcp: Add TCP_USER_TIMEOUT socket option. · dca43c75
      Jerry Chu 提交于
      This patch provides a "user timeout" support as described in RFC793. The
      socket option is also needed for the the local half of RFC5482 "TCP User
      Timeout Option".
      
      TCP_USER_TIMEOUT is a TCP level socket option that takes an unsigned int,
      when > 0, to specify the maximum amount of time in ms that transmitted
      data may remain unacknowledged before TCP will forcefully close the
      corresponding connection and return ETIMEDOUT to the application. If
      0 is given, TCP will continue to use the system default.
      
      Increasing the user timeouts allows a TCP connection to survive extended
      periods without end-to-end connectivity. Decreasing the user timeouts
      allows applications to "fail fast" if so desired. Otherwise it may take
      upto 20 minutes with the current system defaults in a normal WAN
      environment.
      
      The socket option can be made during any state of a TCP connection, but
      is only effective during the synchronized states of a connection
      (ESTABLISHED, FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, or LAST-ACK).
      Moreover, when used with the TCP keepalive (SO_KEEPALIVE) option,
      TCP_USER_TIMEOUT will overtake keepalive to determine when to close a
      connection due to keepalive failure.
      
      The option does not change in anyway when TCP retransmits a packet, nor
      when a keepalive probe will be sent.
      
      This option, like many others, will be inherited by an acceptor from its
      listener.
      Signed-off-by: NH.K. Jerry Chu <hkchu@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dca43c75
  4. 25 8月, 2010 1 次提交
  5. 13 7月, 2010 1 次提交
  6. 28 4月, 2010 1 次提交
  7. 13 4月, 2010 1 次提交
    • E
      net: sk_dst_cache RCUification · b6c6712a
      Eric Dumazet 提交于
      With latest CONFIG_PROVE_RCU stuff, I felt more comfortable to make this
      work.
      
      sk->sk_dst_cache is currently protected by a rwlock (sk_dst_lock)
      
      This rwlock is readlocked for a very small amount of time, and dst
      entries are already freed after RCU grace period. This calls for RCU
      again :)
      
      This patch converts sk_dst_lock to a spinlock, and use RCU for readers.
      
      __sk_dst_get() is supposed to be called with rcu_read_lock() or if
      socket locked by user, so use appropriate rcu_dereference_check()
      condition (rcu_read_lock_held() || sock_owned_by_user(sk))
      
      This patch avoids two atomic ops per tx packet on UDP connected sockets,
      for example, and permits sk_dst_lock to be much less dirtied.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b6c6712a
  8. 30 3月, 2010 1 次提交
    • T
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking... · 5a0e3ad6
      Tejun Heo 提交于
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
      
      percpu.h is included by sched.h and module.h and thus ends up being
      included when building most .c files.  percpu.h includes slab.h which
      in turn includes gfp.h making everything defined by the two files
      universally available and complicating inclusion dependencies.
      
      percpu.h -> slab.h dependency is about to be removed.  Prepare for
      this change by updating users of gfp and slab facilities include those
      headers directly instead of assuming availability.  As this conversion
      needs to touch large number of source files, the following script is
      used as the basis of conversion.
      
        http://userweb.kernel.org/~tj/misc/slabh-sweep.py
      
      The script does the followings.
      
      * Scan files for gfp and slab usages and update includes such that
        only the necessary includes are there.  ie. if only gfp is used,
        gfp.h, if slab is used, slab.h.
      
      * When the script inserts a new include, it looks at the include
        blocks and try to put the new include such that its order conforms
        to its surrounding.  It's put in the include block which contains
        core kernel includes, in the same order that the rest are ordered -
        alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
        doesn't seem to be any matching order.
      
      * If the script can't find a place to put a new include (mostly
        because the file doesn't have fitting include block), it prints out
        an error message indicating which .h file needs to be added to the
        file.
      
      The conversion was done in the following steps.
      
      1. The initial automatic conversion of all .c files updated slightly
         over 4000 files, deleting around 700 includes and adding ~480 gfp.h
         and ~3000 slab.h inclusions.  The script emitted errors for ~400
         files.
      
      2. Each error was manually checked.  Some didn't need the inclusion,
         some needed manual addition while adding it to implementation .h or
         embedding .c file was more appropriate for others.  This step added
         inclusions to around 150 files.
      
      3. The script was run again and the output was compared to the edits
         from #2 to make sure no file was left behind.
      
      4. Several build tests were done and a couple of problems were fixed.
         e.g. lib/decompress_*.c used malloc/free() wrappers around slab
         APIs requiring slab.h to be added manually.
      
      5. The script was run on all .h files but without automatically
         editing them as sprinkling gfp.h and slab.h inclusions around .h
         files could easily lead to inclusion dependency hell.  Most gfp.h
         inclusion directives were ignored as stuff from gfp.h was usually
         wildly available and often used in preprocessor macros.  Each
         slab.h inclusion directive was examined and added manually as
         necessary.
      
      6. percpu.h was updated not to include slab.h.
      
      7. Build test were done on the following configurations and failures
         were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
         distributed build env didn't work with gcov compiles) and a few
         more options had to be turned off depending on archs to make things
         build (like ipr on powerpc/64 which failed due to missing writeq).
      
         * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
         * powerpc and powerpc64 SMP allmodconfig
         * sparc and sparc64 SMP allmodconfig
         * ia64 SMP allmodconfig
         * s390 SMP allmodconfig
         * alpha SMP allmodconfig
         * um on x86_64 SMP allmodconfig
      
      8. percpu.h modifications were reverted so that it could be applied as
         a separate patch and serve as bisection point.
      
      Given the fact that I had only a couple of failures from tests on step
      6, I'm fairly confident about the coverage of this conversion patch.
      If there is a breakage, it's likely to be something in one of the arch
      headers which should be easily discoverable easily on most builds of
      the specific arch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Guess-its-ok-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      5a0e3ad6
  9. 19 2月, 2010 1 次提交
  10. 09 2月, 2010 1 次提交
  11. 18 1月, 2010 1 次提交
    • O
      tcp: account SYN-ACK timeouts & retransmissions · 72659ecc
      Octavian Purdila 提交于
      Currently we don't increment SYN-ACK timeouts & retransmissions
      although we do increment the same stats for SYN. We seem to have lost
      the SYN-ACK accounting with the introduction of tcp_syn_recv_timer
      (commit 2248761e in the netdev-vger-cvs tree).
      
      This patch fixes this issue. In the process we also rename the v4/v6
      syn/ack retransmit functions for clarity. We also add a new
      request_socket operations (syn_ack_timeout) so we can keep code in
      inet_connection_sock.c protocol agnostic.
      Signed-off-by: NOctavian Purdila <opurdila@ixiacom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      72659ecc
  12. 09 12月, 2009 1 次提交
  13. 21 10月, 2009 1 次提交
  14. 19 10月, 2009 1 次提交
    • E
      inet: rename some inet_sock fields · c720c7e8
      Eric Dumazet 提交于
      In order to have better cache layouts of struct sock (separate zones
      for rx/tx paths), we need this preliminary patch.
      
      Goal is to transfert fields used at lookup time in the first
      read-mostly cache line (inside struct sock_common) and move sk_refcnt
      to a separate cache line (only written by rx path)
      
      This patch adds inet_ prefix to daddr, rcv_saddr, dport, num, saddr,
      sport and id fields. This allows a future patch to define these
      fields as macros, like sk_refcnt, without name clashes.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c720c7e8
  15. 01 9月, 2009 2 次提交
    • D
      Revert Backoff [v3]: Calculate TCP's connection close threshold as a time value. · 6fa12c85
      Damian Lukowski 提交于
      RFC 1122 specifies two threshold values R1 and R2 for connection timeouts,
      which may represent a number of allowed retransmissions or a timeout value.
      Currently linux uses sysctl_tcp_retries{1,2} to specify the thresholds
      in number of allowed retransmissions.
      
      For any desired threshold R2 (by means of time) one can specify tcp_retries2
      (by means of number of retransmissions) such that TCP will not time out
      earlier than R2. This is the case, because the RTO schedule follows a fixed
      pattern, namely exponential backoff.
      
      However, the RTO behaviour is not predictable any more if RTO backoffs can be
      reverted, as it is the case in the draft
      "Make TCP more Robust to Long Connectivity Disruptions"
      (http://tools.ietf.org/html/draft-zimmermann-tcp-lcd).
      
      In the worst case TCP would time out a connection after 3.2 seconds, if the
      initial RTO equaled MIN_RTO and each backoff has been reverted.
      
      This patch introduces a function retransmits_timed_out(N),
      which calculates the timeout of a TCP connection, assuming an initial
      RTO of MIN_RTO and N unsuccessful, exponentially backed-off retransmissions.
      
      Whenever timeout decisions are made by comparing the retransmission counter
      to some value N, this function can be used, instead.
      
      The meaning of tcp_retries2 will be changed, as many more RTO retransmissions
      can occur than the value indicates. However, it yields a timeout which is
      similar to the one of an unpatched, exponentially backing off TCP in the same
      scenario. As no application could rely on an RTO greater than MIN_RTO, there
      should be no risk of a regression.
      Signed-off-by: NDamian Lukowski <damian@tvk.rwth-aachen.de>
      Acked-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6fa12c85
    • D
      Revert Backoff [v3]: Revert RTO on ICMP destination unreachable · f1ecd5d9
      Damian Lukowski 提交于
      Here, an ICMP host/network unreachable message, whose payload fits to
      TCP's SND.UNA, is taken as an indication that the RTO retransmission has
      not been lost due to congestion, but because of a route failure
      somewhere along the path.
      With true congestion, a router won't trigger such a message and the
      patched TCP will operate as standard TCP.
      
      This patch reverts one RTO backoff, if an ICMP host/network unreachable
      message, whose payload fits to TCP's SND.UNA, arrives.
      Based on the new RTO, the retransmission timer is reset to reflect the
      remaining time, or - if the revert clocked out the timer - a retransmission
      is sent out immediately.
      Backoffs are only reverted, if TCP is in RTO loss recovery, i.e. if
      there have been retransmissions and reversible backoffs, already.
      
      Changes from v2:
      1) Renaming of skb in tcp_v4_err() moved to another patch.
      2) Reintroduced tcp_bound_rto() and __tcp_set_rto().
      3) Fixed code comments.
      Signed-off-by: NDamian Lukowski <damian@tvk.rwth-aachen.de>
      Acked-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f1ecd5d9
  16. 29 8月, 2009 1 次提交
  17. 02 3月, 2009 1 次提交
  18. 19 12月, 2008 1 次提交
  19. 26 11月, 2008 1 次提交
  20. 03 11月, 2008 1 次提交
  21. 31 10月, 2008 1 次提交
  22. 30 10月, 2008 1 次提交
  23. 29 10月, 2008 1 次提交
  24. 08 10月, 2008 1 次提交
  25. 26 7月, 2008 1 次提交
  26. 17 7月, 2008 1 次提交
  27. 03 7月, 2008 1 次提交
    • P
      tcp: de-bloat a bit with factoring NET_INC_STATS_BH out · 40b215e5
      Pavel Emelyanov 提交于
      There are some places in TCP that select one MIB index to
      bump snmp statistics like this:
      
      	if (<something>)
      		NET_INC_STATS_BH(<some_id>);
      	else if (<something_else>)
      		NET_INC_STATS_BH(<some_other_id>);
      	...
      	else
      		NET_INC_STATS_BH(<default_id>);
      
      or in a more tricky but still similar way.
      
      On the other hand, this NET_INC_STATS_BH is a camouflaged
      increment of percpu variable, which is not that small.
      
      Factoring those cases out de-bloats 235 bytes on non-preemptible
      i386 config and drives parts of the code into 80 columns.
      
      add/remove: 0/0 grow/shrink: 0/7 up/down: 0/-235 (-235)
      function                                     old     new   delta
      tcp_fastretrans_alert                       1437    1424     -13
      tcp_dsack_set                                137     124     -13
      tcp_xmit_retransmit_queue                    690     676     -14
      tcp_try_undo_recovery                        283     265     -18
      tcp_sacktag_write_queue                     1550    1515     -35
      tcp_update_reordering                        162     106     -56
      tcp_retransmit_timer                         990     904     -86
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      40b215e5
  28. 13 6月, 2008 1 次提交
    • D
      tcp: Revert 'process defer accept as established' changes. · ec0a1966
      David S. Miller 提交于
      This reverts two changesets, ec3c0982
      ("[TCP]: TCP_DEFER_ACCEPT updates - process as established") and
      the follow-on bug fix 9ae27e0a
      ("tcp: Fix slab corruption with ipv6 and tcp6fuzz").
      
      This change causes several problems, first reported by Ingo Molnar
      as a distcc-over-loopback regression where connections were getting
      stuck.
      
      Ilpo Järvinen first spotted the locking problems.  The new function
      added by this code, tcp_defer_accept_check(), only has the
      child socket locked, yet it is modifying state of the parent
      listening socket.
      
      Fixing that is non-trivial at best, because we can't simply just grab
      the parent listening socket lock at this point, because it would
      create an ABBA deadlock.  The normal ordering is parent listening
      socket --> child socket, but this code path would require the
      reverse lock ordering.
      
      Next is a problem noticed by Vitaliy Gusev, he noted:
      
      ----------------------------------------
      >--- a/net/ipv4/tcp_timer.c
      >+++ b/net/ipv4/tcp_timer.c
      >@@ -481,6 +481,11 @@ static void tcp_keepalive_timer (unsigned long data)
      > 		goto death;
      > 	}
      >
      >+	if (tp->defer_tcp_accept.request && sk->sk_state == TCP_ESTABLISHED) {
      >+		tcp_send_active_reset(sk, GFP_ATOMIC);
      >+		goto death;
      
      Here socket sk is not attached to listening socket's request queue. tcp_done()
      will not call inet_csk_destroy_sock() (and tcp_v4_destroy_sock() which should
      release this sk) as socket is not DEAD. Therefore socket sk will be lost for
      freeing.
      ----------------------------------------
      
      Finally, Alexey Kuznetsov argues that there might not even be any
      real value or advantage to these new semantics even if we fix all
      of the bugs:
      
      ----------------------------------------
      Hiding from accept() sockets with only out-of-order data only
      is the only thing which is impossible with old approach. Is this really
      so valuable? My opinion: no, this is nothing but a new loophole
      to consume memory without control.
      ----------------------------------------
      
      So revert this thing for now.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ec0a1966
  29. 12 6月, 2008 1 次提交
  30. 14 4月, 2008 2 次提交
  31. 22 3月, 2008 1 次提交
    • P
      [TCP]: TCP_DEFER_ACCEPT updates - process as established · ec3c0982
      Patrick McManus 提交于
      Change TCP_DEFER_ACCEPT implementation so that it transitions a
      connection to ESTABLISHED after handshake is complete instead of
      leaving it in SYN-RECV until some data arrvies. Place connection in
      accept queue when first data packet arrives from slow path.
      
      Benefits:
        - established connection is now reset if it never makes it
         to the accept queue
      
       - diagnostic state of established matches with the packet traces
         showing completed handshake
      
       - TCP_DEFER_ACCEPT timeouts are expressed in seconds and can now be
         enforced with reasonable accuracy instead of rounding up to next
         exponential back-off of syn-ack retry.
      Signed-off-by: NPatrick McManus <mcmanus@ducksong.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ec3c0982
  32. 29 1月, 2008 5 次提交
  33. 11 10月, 2007 1 次提交
    • I
      [TCP]: Move sack_ok access to obviously named funcs & cleanup · e60402d0
      Ilpo Järvinen 提交于
      Previously code had IsReno/IsFack defined as macros that were
      local to tcp_input.c though sack_ok field has user elsewhere too
      for the same purpose. This changes them to static inlines as
      preferred according the current coding style and unifies the
      access to sack_ok across multiple files. Magic bitops of sack_ok
      for FACK and DSACK are also abstracted to functions with
      appropriate names.
      
      Note:
      - One sack_ok = 1 remains but that's self explanary, i.e., it
        enables sack
      - Couple of !IsReno cases are changed to tcp_is_sack
      - There were no users for IsDSack => I dropped it
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e60402d0
  34. 08 6月, 2007 1 次提交