1. 29 1月, 2015 2 次提交
    • N
      tcp: fix the timid additive increase on stretch ACKs · 814d488c
      Neal Cardwell 提交于
      tcp_cong_avoid_ai() was too timid (snd_cwnd increased too slowly) on
      "stretch ACKs" -- cases where the receiver ACKed more than 1 packet in
      a single ACK. For example, suppose w is 10 and we get a stretch ACK
      for 20 packets, so acked is 20. We ought to increase snd_cwnd by 2
      (since acked/w = 20/10 = 2), but instead we were only increasing cwnd
      by 1. This patch fixes that behavior.
      Reported-by: NEyal Perry <eyalpe@mellanox.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      814d488c
    • N
      tcp: stretch ACK fixes prep · e73ebb08
      Neal Cardwell 提交于
      LRO, GRO, delayed ACKs, and middleboxes can cause "stretch ACKs" that
      cover more than the RFC-specified maximum of 2 packets. These stretch
      ACKs can cause serious performance shortfalls in common congestion
      control algorithms that were designed and tuned years ago with
      receiver hosts that were not using LRO or GRO, and were instead
      politely ACKing every other packet.
      
      This patch series fixes Reno and CUBIC to handle stretch ACKs.
      
      This patch prepares for the upcoming stretch ACK bug fix patches. It
      adds an "acked" parameter to tcp_cong_avoid_ai() to allow for future
      fixes to tcp_cong_avoid_ai() to correctly handle stretch ACKs, and
      changes all congestion control algorithms to pass in 1 for the ACKed
      count. It also changes tcp_slow_start() to return the number of packet
      ACK "credits" that were not processed in slow start mode, and can be
      processed by the congestion control module in additive increase mode.
      
      In future patches we will fix tcp_cong_avoid_ai() to handle stretch
      ACKs, and fix Reno and CUBIC handling of stretch ACKs in slow start
      and additive increase mode.
      Reported-by: NEyal Perry <eyalpe@mellanox.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e73ebb08
  2. 05 11月, 2014 1 次提交
  3. 01 10月, 2014 1 次提交
  4. 29 9月, 2014 1 次提交
  5. 02 9月, 2014 1 次提交
  6. 04 5月, 2014 1 次提交
  7. 03 5月, 2014 1 次提交
    • E
      tcp: fix cwnd limited checking to improve congestion control · e114a710
      Eric Dumazet 提交于
      Yuchung discovered tcp_is_cwnd_limited() was returning false in
      slow start phase even if the application filled the socket write queue.
      
      All congestion modules take into account tcp_is_cwnd_limited()
      before increasing cwnd, so this behavior limits slow start from
      probing the bandwidth at full speed.
      
      The problem is that even if write queue is full (aka we are _not_
      application limited), cwnd can be under utilized if TSO should auto
      defer or TCP Small queues decided to hold packets.
      
      So the in_flight can be kept to smaller value, and we can get to the
      point tcp_is_cwnd_limited() returns false.
      
      With TCP Small Queues and FQ/pacing, this issue is more visible.
      
      We fix this by having tcp_cwnd_validate(), which is supposed to track
      such things, take into account unsent_segs, the number of segs that we
      are not sending at the moment due to TSO or TSQ, but intend to send
      real soon. Then when we are cwnd-limited, remember this fact while we
      are processing the window of ACKs that comes back.
      
      For example, suppose we have a brand new connection with cwnd=10; we
      are in slow start, and we send a flight of 9 packets. By the time we
      have received ACKs for all 9 packets we want our cwnd to be 18.
      We implement this by setting tp->lsnd_pending to 9, and
      considering ourselves to be cwnd-limited while cwnd is less than
      twice tp->lsnd_pending (2*9 -> 18).
      
      This makes tcp_is_cwnd_limited() more understandable, by removing
      the GSO/TSO kludge, that tried to work around the issue.
      
      Note the in_flight parameter can be removed in a followup cleanup
      patch.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e114a710
  8. 25 2月, 2014 1 次提交
    • E
      tcp: reduce the bloat caused by tcp_is_cwnd_limited() · d10473d4
      Eric Dumazet 提交于
      tcp_is_cwnd_limited() allows GSO/TSO enabled flows to increase
      their cwnd to allow a full size (64KB) TSO packet to be sent.
      
      Non GSO flows only allow an extra room of 3 MSS.
      
      For most flows with a BDP below 10 MSS, this results in a bloat
      of cwnd reaching 90, and an inflate of RTT.
      
      Thanks to TSO auto sizing, we can restrict the bloat to the number
      of MSS contained in a TSO packet (tp->xmit_size_goal_segs), to keep
      original intent without performance impact.
      
      Because we keep cwnd small, it helps to keep TSO packet size to their
      optimal value.
      
      Example for a 10Mbit flow, with low TCP Small queue limits (no more than
      2 skb in qdisc/device tx ring)
      
      Before patch :
      
      lpk51:~# ./ss -i dst lpk52:44862 | grep cwnd
               cubic wscale:6,6 rto:215 rtt:15.875/2.5 mss:1448 cwnd:96
      ssthresh:96
      send 70.1Mbps unacked:14 rcv_space:29200
      
      After patch :
      
      lpk51:~# ./ss -i dst lpk52:52916 | grep cwnd
               cubic wscale:6,6 rto:206 rtt:5.206/0.036 mss:1448 cwnd:15
      ssthresh:14
      send 33.4Mbps unacked:4 rcv_space:29200
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Nandita Dukkipati <nanditad@google.com>
      Cc: Van Jacobson <vanj@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d10473d4
  9. 14 2月, 2014 1 次提交
  10. 05 11月, 2013 1 次提交
    • Y
      tcp: properly handle stretch acks in slow start · 9f9843a7
      Yuchung Cheng 提交于
      Slow start now increases cwnd by 1 if an ACK acknowledges some packets,
      regardless the number of packets. Consequently slow start performance
      is highly dependent on the degree of the stretch ACKs caused by
      receiver or network ACK compression mechanisms (e.g., delayed-ACK,
      GRO, etc).  But slow start algorithm is to send twice the amount of
      packets of packets left so it should process a stretch ACK of degree
      N as if N ACKs of degree 1, then exits when cwnd exceeds ssthresh. A
      follow up patch will use the remainder of the N (if greater than 1)
      to adjust cwnd in the congestion avoidance phase.
      
      In addition this patch retires the experimental limited slow start
      (LSS) feature. LSS has multiple drawbacks but questionable benefit. The
      fractional cwnd increase in LSS requires a loop in slow start even
      though it's rarely used. Configuring such an increase step via a global
      sysctl on different BDPS seems hard. Finally and most importantly the
      slow start overshoot concern is now better covered by the Hybrid slow
      start (hystart) enabled by default.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9f9843a7
  11. 06 2月, 2013 1 次提交
  12. 04 2月, 2013 1 次提交
  13. 19 11月, 2012 2 次提交
    • M
    • E
      net: Allow userns root to control ipv4 · 52e804c6
      Eric W. Biederman 提交于
      Allow an unpriviled user who has created a user namespace, and then
      created a network namespace to effectively use the new network
      namespace, by reducing capable(CAP_NET_ADMIN) and
      capable(CAP_NET_RAW) calls to be ns_capable(net->user_ns,
      CAP_NET_ADMIN), or capable(net->user_ns, CAP_NET_RAW) calls.
      
      Settings that merely control a single network device are allowed.
      Either the network device is a logical network device where
      restrictions make no difference or the network device is hardware NIC
      that has been explicity moved from the initial network namespace.
      
      In general policy and network stack state changes are allowed
      while resource control is left unchanged.
      
      Allow creating raw sockets.
      Allow the SIOCSARP ioctl to control the arp cache.
      Allow the SIOCSIFFLAG ioctl to allow setting network device flags.
      Allow the SIOCSIFADDR ioctl to allow setting a netdevice ipv4 address.
      Allow the SIOCSIFBRDADDR ioctl to allow setting a netdevice ipv4 broadcast address.
      Allow the SIOCSIFDSTADDR ioctl to allow setting a netdevice ipv4 destination address.
      Allow the SIOCSIFNETMASK ioctl to allow setting a netdevice ipv4 netmask.
      Allow the SIOCADDRT and SIOCDELRT ioctls to allow adding and deleting ipv4 routes.
      
      Allow the SIOCADDTUNNEL, SIOCCHGTUNNEL and SIOCDELTUNNEL ioctls for
      adding, changing and deleting gre tunnels.
      
      Allow the SIOCADDTUNNEL, SIOCCHGTUNNEL and SIOCDELTUNNEL ioctls for
      adding, changing and deleting ipip tunnels.
      
      Allow the SIOCADDTUNNEL, SIOCCHGTUNNEL and SIOCDELTUNNEL ioctls for
      adding, changing and deleting ipsec virtual tunnel interfaces.
      
      Allow setting the MRT_INIT, MRT_DONE, MRT_ADD_VIF, MRT_DEL_VIF, MRT_ADD_MFC,
      MRT_DEL_MFC, MRT_ASSERT, MRT_PIM, MRT_TABLE socket options on multicast routing
      sockets.
      
      Allow setting and receiving IPOPT_CIPSO, IP_OPT_SEC, IP_OPT_SID and
      arbitrary ip options.
      
      Allow setting IP_SEC_POLICY/IP_XFRM_POLICY ipv4 socket option.
      Allow setting the IP_TRANSPARENT ipv4 socket option.
      Allow setting the TCP_REPAIR socket option.
      Allow setting the TCP_CONGESTION socket option.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      52e804c6
  14. 02 8月, 2012 1 次提交
  15. 21 7月, 2012 1 次提交
  16. 18 5月, 2012 1 次提交
  17. 13 3月, 2012 1 次提交
  18. 12 3月, 2012 1 次提交
    • J
      net: Convert printks to pr_<level> · 058bd4d2
      Joe Perches 提交于
      Use a more current kernel messaging style.
      
      Convert a printk block to print_hex_dump.
      Coalesce formats, align arguments.
      Use %s, __func__ instead of embedding function names.
      
      Some messages that were prefixed with <foo>_close are
      now prefixed with <foo>_fini.  Some ah4 and esp messages
      are now not prefixed with "ip ".
      
      The intent of this patch is to later add something like
        #define pr_fmt(fmt) "IPv4: " fmt.
      to standardize the output messages.
      
      Text size is trivially reduced. (x86-32 allyesconfig)
      
      $ size net/ipv4/built-in.o*
         text	   data	    bss	    dec	    hex	filename
       887888	  31558	 249696	1169142	 11d6f6	net/ipv4/built-in.o.new
       887934	  31558	 249800	1169292	 11d78c	net/ipv4/built-in.o.old
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      058bd4d2
  19. 29 11月, 2011 1 次提交
  20. 28 8月, 2010 1 次提交
    • J
      net/ipv4: Eliminate kstrdup memory leak · c34186ed
      Julia Lawall 提交于
      The string clone is only used as a temporary copy of the argument val
      within the while loop, and so it should be freed before leaving the
      function.  The call to strsep, however, modifies clone, so a pointer to the
      front of the string is kept in saved_clone, to make it possible to free it.
      
      The sematic match that finds this problem is as follows:
      (http://coccinelle.lip6.fr/)
      
      // <smpl>
      @r exists@
      local idexpression x;
      expression E;
      identifier l;
      statement S;
      @@
      
      *x= \(kasprintf\|kstrdup\)(...);
      ...
      if (x == NULL) S
      ... when != kfree(x)
          when != E = x
      if (...) {
        <... when != kfree(x)
      * goto l;
        ...>
      * return ...;
      }
      // </smpl>
      Signed-off-by: NJulia Lawall <julia@diku.dk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c34186ed
  21. 30 3月, 2010 1 次提交
    • T
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking... · 5a0e3ad6
      Tejun Heo 提交于
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
      
      percpu.h is included by sched.h and module.h and thus ends up being
      included when building most .c files.  percpu.h includes slab.h which
      in turn includes gfp.h making everything defined by the two files
      universally available and complicating inclusion dependencies.
      
      percpu.h -> slab.h dependency is about to be removed.  Prepare for
      this change by updating users of gfp and slab facilities include those
      headers directly instead of assuming availability.  As this conversion
      needs to touch large number of source files, the following script is
      used as the basis of conversion.
      
        http://userweb.kernel.org/~tj/misc/slabh-sweep.py
      
      The script does the followings.
      
      * Scan files for gfp and slab usages and update includes such that
        only the necessary includes are there.  ie. if only gfp is used,
        gfp.h, if slab is used, slab.h.
      
      * When the script inserts a new include, it looks at the include
        blocks and try to put the new include such that its order conforms
        to its surrounding.  It's put in the include block which contains
        core kernel includes, in the same order that the rest are ordered -
        alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
        doesn't seem to be any matching order.
      
      * If the script can't find a place to put a new include (mostly
        because the file doesn't have fitting include block), it prints out
        an error message indicating which .h file needs to be added to the
        file.
      
      The conversion was done in the following steps.
      
      1. The initial automatic conversion of all .c files updated slightly
         over 4000 files, deleting around 700 includes and adding ~480 gfp.h
         and ~3000 slab.h inclusions.  The script emitted errors for ~400
         files.
      
      2. Each error was manually checked.  Some didn't need the inclusion,
         some needed manual addition while adding it to implementation .h or
         embedding .c file was more appropriate for others.  This step added
         inclusions to around 150 files.
      
      3. The script was run again and the output was compared to the edits
         from #2 to make sure no file was left behind.
      
      4. Several build tests were done and a couple of problems were fixed.
         e.g. lib/decompress_*.c used malloc/free() wrappers around slab
         APIs requiring slab.h to be added manually.
      
      5. The script was run on all .h files but without automatically
         editing them as sprinkling gfp.h and slab.h inclusions around .h
         files could easily lead to inclusion dependency hell.  Most gfp.h
         inclusion directives were ignored as stuff from gfp.h was usually
         wildly available and often used in preprocessor macros.  Each
         slab.h inclusion directive was examined and added manually as
         necessary.
      
      6. percpu.h was updated not to include slab.h.
      
      7. Build test were done on the following configurations and failures
         were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
         distributed build env didn't work with gcov compiles) and a few
         more options had to be turned off depending on archs to make things
         build (like ipr on powerpc/64 which failed due to missing writeq).
      
         * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
         * powerpc and powerpc64 SMP allmodconfig
         * sparc and sparc64 SMP allmodconfig
         * ia64 SMP allmodconfig
         * s390 SMP allmodconfig
         * alpha SMP allmodconfig
         * um on x86_64 SMP allmodconfig
      
      8. percpu.h modifications were reverted so that it could be applied as
         a separate patch and serve as bisection point.
      
      Given the fact that I had only a couple of failures from tests on step
      6, I'm fairly confident about the coverage of this conversion patch.
      If there is a breakage, it's likely to be something in one of the arch
      headers which should be easily discoverable easily on most builds of
      the specific arch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Guess-its-ok-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      5a0e3ad6
  22. 14 8月, 2009 1 次提交
  23. 02 3月, 2009 1 次提交
  24. 17 10月, 2008 1 次提交
  25. 29 4月, 2008 2 次提交
  26. 29 1月, 2008 2 次提交
    • I
      [TCP]: Uninline tcp_is_cwnd_limited · cea14e0e
      Ilpo Järvinen 提交于
      net/ipv4/tcp_cong.c:
        tcp_reno_cong_avoid |  -65
       1 function changed, 65 bytes removed, diff: -65
      
      net/ipv4/arp.c:
        arp_ignore |   -5
       1 function changed, 5 bytes removed, diff: -5
      
      net/ipv4/tcp_bic.c:
        bictcp_cong_avoid |  -57
       1 function changed, 57 bytes removed, diff: -57
      
      net/ipv4/tcp_cubic.c:
        bictcp_cong_avoid |  -61
       1 function changed, 61 bytes removed, diff: -61
      
      net/ipv4/tcp_highspeed.c:
        hstcp_cong_avoid |  -63
       1 function changed, 63 bytes removed, diff: -63
      
      net/ipv4/tcp_hybla.c:
        hybla_cong_avoid |  -85
       1 function changed, 85 bytes removed, diff: -85
      
      net/ipv4/tcp_htcp.c:
        htcp_cong_avoid |  -57
       1 function changed, 57 bytes removed, diff: -57
      
      net/ipv4/tcp_veno.c:
        tcp_veno_cong_avoid |  -52
       1 function changed, 52 bytes removed, diff: -52
      
      net/ipv4/tcp_scalable.c:
        tcp_scalable_cong_avoid |  -61
       1 function changed, 61 bytes removed, diff: -61
      
      net/ipv4/tcp_yeah.c:
        tcp_yeah_cong_avoid |  -75
       1 function changed, 75 bytes removed, diff: -75
      
      net/ipv4/tcp_illinois.c:
        tcp_illinois_cong_avoid |  -54
       1 function changed, 54 bytes removed, diff: -54
      
      net/dccp/ccids/ccid3.c:
        ccid3_update_send_interval |   -7
        ccid3_hc_tx_packet_recv    |   +7
       2 functions changed, 7 bytes added, 7 bytes removed, diff: +0
      
      net/ipv4/tcp_cong.c:
        tcp_is_cwnd_limited |  +88
       1 function changed, 88 bytes added, diff: +88
      
      built-in.o:
       14 functions changed, 95 bytes added, 642 bytes removed, diff: -547
      
      ...Again some gcc artifacts visible as well.
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cea14e0e
    • I
  27. 18 7月, 2007 1 次提交
  28. 18 5月, 2007 1 次提交
  29. 26 4月, 2007 2 次提交
  30. 24 4月, 2007 1 次提交
  31. 18 2月, 2007 1 次提交
  32. 11 2月, 2007 1 次提交
  33. 03 12月, 2006 3 次提交