1. 09 11月, 2011 1 次提交
  2. 21 10月, 2011 1 次提交
  3. 19 10月, 2011 1 次提交
  4. 28 9月, 2011 1 次提交
  5. 25 8月, 2011 2 次提交
    • N
      Proportional Rate Reduction for TCP. · a262f0cd
      Nandita Dukkipati 提交于
      This patch implements Proportional Rate Reduction (PRR) for TCP.
      PRR is an algorithm that determines TCP's sending rate in fast
      recovery. PRR avoids excessive window reductions and aims for
      the actual congestion window size at the end of recovery to be as
      close as possible to the window determined by the congestion control
      algorithm. PRR also improves accuracy of the amount of data sent
      during loss recovery.
      
      The patch implements the recommended flavor of PRR called PRR-SSRB
      (Proportional rate reduction with slow start reduction bound) and
      replaces the existing rate halving algorithm. PRR improves upon the
      existing Linux fast recovery under a number of conditions including:
        1) burst losses where the losses implicitly reduce the amount of
      outstanding data (pipe) below the ssthresh value selected by the
      congestion control algorithm and,
        2) losses near the end of short flows where application runs out of
      data to send.
      
      As an example, with the existing rate halving implementation a single
      loss event can cause a connection carrying short Web transactions to
      go into the slow start mode after the recovery. This is because during
      recovery Linux pulls the congestion window down to packets_in_flight+1
      on every ACK. A short Web response often runs out of new data to send
      and its pipe reduces to zero by the end of recovery when all its packets
      are drained from the network. Subsequent HTTP responses using the same
      connection will have to slow start to raise cwnd to ssthresh. PRR on
      the other hand aims for the cwnd to be as close as possible to ssthresh
      by the end of recovery.
      
      A description of PRR and a discussion of its performance can be found at
      the following links:
      - IETF Draft:
          http://tools.ietf.org/html/draft-mathis-tcpm-proportional-rate-reduction-01
      - IETF Slides:
          http://www.ietf.org/proceedings/80/slides/tcpm-6.pdf
          http://tools.ietf.org/agenda/81/slides/tcpm-2.pdf
      - Paper to appear in Internet Measurements Conference (IMC) 2011:
          Improving TCP Loss Recovery
          Nandita Dukkipati, Matt Mathis, Yuchung Cheng
      Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a262f0cd
    • I
      net: ipv4: convert to SKB frag APIs · aff65da0
      Ian Campbell 提交于
      Signed-off-by: NIan Campbell <ian.campbell@citrix.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: "Pekka Savola (ipv6)" <pekkas@netcore.fi>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: netdev@vger.kernel.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aff65da0
  6. 09 5月, 2011 1 次提交
    • D
      inet: Pass flowi to ->queue_xmit(). · d9d8da80
      David S. Miller 提交于
      This allows us to acquire the exact route keying information from the
      protocol, however that might be managed.
      
      It handles all of the possibilities, from the simplest case of storing
      the key in inet->cork.fl to the more complex setup SCTP has where
      individual transports determine the flow.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d9d8da80
  7. 02 4月, 2011 1 次提交
  8. 31 3月, 2011 1 次提交
  9. 22 2月, 2011 1 次提交
  10. 21 12月, 2010 1 次提交
    • N
      TCP: increase default initial receive window. · 356f0398
      Nandita Dukkipati 提交于
      This patch changes the default initial receive window to 10 mss
      (defined constant). The default window is limited to the maximum
      of 10*1460 and 2*mss (when mss > 1460).
      
      draft-ietf-tcpm-initcwnd-00 is a proposal to the IETF that recommends
      increasing TCP's initial congestion window to 10 mss or about 15KB.
      Leading up to this proposal were several large-scale live Internet
      experiments with an initial congestion window of 10 mss (IW10), where
      we showed that the average latency of HTTP responses improved by
      approximately 10%. This was accompanied by a slight increase in
      retransmission rate (0.5%), most of which is coming from applications
      opening multiple simultaneous connections. To understand the extreme
      worst case scenarios, and fairness issues (IW10 versus IW3), we further
      conducted controlled testbed experiments. We came away finding minimal
      negative impact even under low link bandwidths (dial-ups) and small
      buffers.  These results are extremely encouraging to adopting IW10.
      
      However, an initial congestion window of 10 mss is useless unless a TCP
      receiver advertises an initial receive window of at least 10 mss.
      Fortunately, in the large-scale Internet experiments we found that most
      widely used operating systems advertised large initial receive windows
      of 64KB, allowing us to experiment with a wide range of initial
      congestion windows. Linux systems were among the few exceptions that
      advertised a small receive window of 6KB. The purpose of this patch is
      to fix this shortcoming.
      
      References:
      1. A comprehensive list of all IW10 references to date.
      http://code.google.com/speed/protocols/tcpm-IW10.html
      
      2. Paper describing results from large-scale Internet experiments with IW10.
      http://ccr.sigcomm.org/drupal/?q=node/621
      
      3. Controlled testbed experiments under worst case scenarios and a
      fairness study.
      http://www.ietf.org/proceedings/79/slides/tcpm-0.pdf
      
      4. Raw test data from testbed experiments (Linux senders/receivers)
      with initial congestion and receive windows of both 10 mss.
      http://research.csc.ncsu.edu/netsrv/?q=content/iw10
      
      5. Internet-Draft. Increasing TCP's Initial Window.
      https://datatracker.ietf.org/doc/draft-ietf-tcpm-initcwnd/Signed-off-by: NNandita Dukkipati <nanditad@google.com>
      Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      356f0398
  11. 14 12月, 2010 1 次提交
    • D
      net: Abstract default ADVMSS behind an accessor. · 0dbaee3b
      David S. Miller 提交于
      Make all RTAX_ADVMSS metric accesses go through a new helper function,
      dst_metric_advmss().
      
      Leave the actual default metric as "zero" in the real metric slot,
      and compute the actual default value dynamically via a new dst_ops
      AF specific callback.
      
      For stacked IPSEC routes, we use the advmss of the path which
      preserves existing behavior.
      
      Unlike ipv4/ipv6, DecNET ties the advmss to the mtu and thus updates
      advmss on pmtu updates.  This inconsistency in advmss handling
      results in more raw metric accesses than I wish we ended up with.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0dbaee3b
  12. 09 12月, 2010 3 次提交
  13. 03 12月, 2010 1 次提交
  14. 25 11月, 2010 1 次提交
    • T
      xps: Improvements in TX queue selection · 3853b584
      Tom Herbert 提交于
      In dev_pick_tx, don't do work in calculating queue
      index or setting
      the index in the sock unless the device has more than one queue.  This
      allows the sock to be set only with a queue index of a multi-queue
      device which is desirable if device are stacked like in a tunnel.
      
      We also allow the mapping of a socket to queue to be changed.  To
      maintain in order packet transmission a flag (ooo_okay) has been
      added to the sk_buff structure.  If a transport layer sets this flag
      on a packet, the transmit queue can be changed for the socket.
      Presumably, the transport would set this if there was no possbility
      of creating OOO packets (for instance, there are no packets in flight
      for the socket).  This patch includes the modification in TCP output
      for setting this flag.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3853b584
  15. 18 11月, 2010 1 次提交
  16. 02 11月, 2010 1 次提交
  17. 24 9月, 2010 1 次提交
  18. 02 9月, 2010 1 次提交
  19. 23 8月, 2010 1 次提交
    • H
      tcp: allow effective reduction of TCP's rcv-buffer via setsockopt · e88c64f0
      Hagen Paul Pfeifer 提交于
      Via setsockopt it is possible to reduce the socket RX buffer
      (SO_RCVBUF). TCP method to select the initial window and window scaling
      option in tcp_select_initial_window() currently misbehaves and do not
      consider a reduced RX socket buffer via setsockopt.
      
      Even though the server's RX buffer is reduced via setsockopt() to 256
      byte (Initial Window 384 byte => 256 * 2 - (256 * 2 / 4)) the window
      scale option is still 7:
      
      192.168.1.38.40676 > 78.47.222.210.5001: Flags [S], seq 2577214362, win 5840, options [mss 1460,sackOK,TS val 338417 ecr 0,nop,wscale 0], length 0
      78.47.222.210.5001 > 192.168.1.38.40676: Flags [S.], seq 1570631029, ack 2577214363, win 384, options [mss 1452,sackOK,TS val 2435248895 ecr 338417,nop,wscale 7], length 0
      192.168.1.38.40676 > 78.47.222.210.5001: Flags [.], ack 1, win 5840, options [nop,nop,TS val 338421 ecr 2435248895], length 0
      
      Within tcp_select_initial_window() the original space argument - a
      representation of the rx buffer size - is expanded during
      tcp_select_initial_window(). Only sysctl_tcp_rmem[2], sysctl_rmem_max
      and window_clamp are considered to calculate the initial window.
      
      This patch adjust the window_clamp argument if the user explicitly
      reduce the receive buffer.
      Signed-off-by: NHagen Paul Pfeifer <hagen@jauu.net>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e88c64f0
  20. 20 7月, 2010 1 次提交
  21. 13 7月, 2010 1 次提交
  22. 29 6月, 2010 1 次提交
  23. 16 6月, 2010 1 次提交
    • C
      tcp: unify tcp flag macros · a3433f35
      Changli Gao 提交于
      unify tcp flag macros: TCPHDR_FIN, TCPHDR_SYN, TCPHDR_RST, TCPHDR_PSH,
      TCPHDR_ACK, TCPHDR_URG, TCPHDR_ECE and TCPHDR_CWR. TCBCB_FLAG_* are replaced
      with the corresponding TCPHDR_*.
      Signed-off-by: NChangli Gao <xiaosuo@gmail.com>
      ----
       include/net/tcp.h                      |   24 ++++++-------
       net/ipv4/tcp.c                         |    8 ++--
       net/ipv4/tcp_input.c                   |    2 -
       net/ipv4/tcp_output.c                  |   59 ++++++++++++++++-----------------
       net/netfilter/nf_conntrack_proto_tcp.c |   32 ++++++-----------
       net/netfilter/xt_TCPMSS.c              |    4 --
       6 files changed, 58 insertions(+), 71 deletions(-)
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a3433f35
  24. 18 5月, 2010 1 次提交
    • E
      tcp: tcp_synack_options() fix · de213e5e
      Eric Dumazet 提交于
      Commit 33ad798c (tcp: options clean up) introduced a problem
      if MD5+SACK+timestamps were used in initial SYN message.
      
      Some stacks (old linux for example) try to negotiate MD5+SACK+TSTAMP
      sessions, but since 40 bytes of tcp options space are not enough to
      store all the bits needed, we chose to disable timestamps in this case.
      
      We send a SYN-ACK _without_ timestamp option, but socket has timestamps
      enabled and all further outgoing messages contain a TS block, all with
      the initial timestamp of the remote peer.
      
      Fix is to really disable timestamps option for the whole session.
      Reported-by: NBijay Singh <Bijay.Singh@guavus.com>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      de213e5e
  25. 16 5月, 2010 1 次提交
    • E
      net: Introduce sk_route_nocaps · a465419b
      Eric Dumazet 提交于
      TCP-MD5 sessions have intermittent failures, when route cache is
      invalidated. ip_queue_xmit() has to find a new route, calls
      sk_setup_caps(sk, &rt->u.dst), destroying the 
      
      sk->sk_route_caps &= ~NETIF_F_GSO_MASK
      
      that MD5 desperately try to make all over its way (from
      tcp_transmit_skb() for example)
      
      So we send few bad packets, and everything is fine when
      tcp_transmit_skb() is called again for this socket.
      
      Since ip_queue_xmit() is at a lower level than TCP-MD5, I chose to use a
      socket field, sk_route_nocaps, containing bits to mask on sk_route_caps.
      Reported-by: NBhaskar Dutta <bhaskie@gmail.com>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a465419b
  26. 23 4月, 2010 1 次提交
  27. 21 4月, 2010 1 次提交
  28. 16 4月, 2010 1 次提交
  29. 12 4月, 2010 2 次提交
    • D
      tcp: Set CHECKSUM_UNNECESSARY in tcp_init_nondata_skb · 2e8e18ef
      David S. Miller 提交于
      Back in commit 04a0551c
      ("loopback: Drop obsolete ip_summed setting") we stopped
      setting CHECKSUM_UNNECESSARY in the loopback xmit.
      
      This is because such a setting was a lie since it implies that the
      checksum field of the packet is properly filled in.
      
      Instead what happens normally is that CHECKSUM_PARTIAL is set and
      skb->csum is calculated as needed.
      
      But this was only happening for TCP data packets (via the
      skb->ip_summed assignment done in tcp_sendmsg()).  It doesn't
      happen for non-data packets like ACKs etc.
      
      Fix this by setting skb->ip_summed in the common non-data packet
      constructor.  It already is setting skb->csum to zero.
      
      But this reminds us that we still have things like ip_output.c's
      ip_dev_loopback_xmit() which sets skb->ip_summed to the value
      CHECKSUM_UNNECESSARY, which Herbert's patch teaches us is not
      valid.  So we'll have to address that at some point too.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2e8e18ef
    • H
      inet: Remove unused send_check length argument · bb296246
      Herbert Xu 提交于
      inet: Remove unused send_check length argument
      
      This patch removes the unused length argument from the send_check
      function in struct inet_connection_sock_af_ops.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Tested-by: NYinghai <yinghai.lu@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bb296246
  30. 11 4月, 2010 1 次提交
  31. 09 4月, 2010 1 次提交
    • D
      tcp: Set CHECKSUM_UNNECESSARY in tcp_init_nondata_skb · 2626419a
      David S. Miller 提交于
      Back in commit 04a0551c
      ("loopback: Drop obsolete ip_summed setting") we stopped
      setting CHECKSUM_UNNECESSARY in the loopback xmit.
      
      This is because such a setting was a lie since it implies that the
      checksum field of the packet is properly filled in.
      
      Instead what happens normally is that CHECKSUM_PARTIAL is set and
      skb->csum is calculated as needed.
      
      But this was only happening for TCP data packets (via the
      skb->ip_summed assignment done in tcp_sendmsg()).  It doesn't
      happen for non-data packets like ACKs etc.
      
      Fix this by setting skb->ip_summed in the common non-data packet
      constructor.  It already is setting skb->csum to zero.
      
      But this reminds us that we still have things like ip_output.c's
      ip_dev_loopback_xmit() which sets skb->ip_summed to the value
      CHECKSUM_UNNECESSARY, which Herbert's patch teaches us is not
      valid.  So we'll have to address that at some point too.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2626419a
  32. 30 3月, 2010 1 次提交
    • T
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking... · 5a0e3ad6
      Tejun Heo 提交于
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
      
      percpu.h is included by sched.h and module.h and thus ends up being
      included when building most .c files.  percpu.h includes slab.h which
      in turn includes gfp.h making everything defined by the two files
      universally available and complicating inclusion dependencies.
      
      percpu.h -> slab.h dependency is about to be removed.  Prepare for
      this change by updating users of gfp and slab facilities include those
      headers directly instead of assuming availability.  As this conversion
      needs to touch large number of source files, the following script is
      used as the basis of conversion.
      
        http://userweb.kernel.org/~tj/misc/slabh-sweep.py
      
      The script does the followings.
      
      * Scan files for gfp and slab usages and update includes such that
        only the necessary includes are there.  ie. if only gfp is used,
        gfp.h, if slab is used, slab.h.
      
      * When the script inserts a new include, it looks at the include
        blocks and try to put the new include such that its order conforms
        to its surrounding.  It's put in the include block which contains
        core kernel includes, in the same order that the rest are ordered -
        alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
        doesn't seem to be any matching order.
      
      * If the script can't find a place to put a new include (mostly
        because the file doesn't have fitting include block), it prints out
        an error message indicating which .h file needs to be added to the
        file.
      
      The conversion was done in the following steps.
      
      1. The initial automatic conversion of all .c files updated slightly
         over 4000 files, deleting around 700 includes and adding ~480 gfp.h
         and ~3000 slab.h inclusions.  The script emitted errors for ~400
         files.
      
      2. Each error was manually checked.  Some didn't need the inclusion,
         some needed manual addition while adding it to implementation .h or
         embedding .c file was more appropriate for others.  This step added
         inclusions to around 150 files.
      
      3. The script was run again and the output was compared to the edits
         from #2 to make sure no file was left behind.
      
      4. Several build tests were done and a couple of problems were fixed.
         e.g. lib/decompress_*.c used malloc/free() wrappers around slab
         APIs requiring slab.h to be added manually.
      
      5. The script was run on all .h files but without automatically
         editing them as sprinkling gfp.h and slab.h inclusions around .h
         files could easily lead to inclusion dependency hell.  Most gfp.h
         inclusion directives were ignored as stuff from gfp.h was usually
         wildly available and often used in preprocessor macros.  Each
         slab.h inclusion directive was examined and added manually as
         necessary.
      
      6. percpu.h was updated not to include slab.h.
      
      7. Build test were done on the following configurations and failures
         were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
         distributed build env didn't work with gcov compiles) and a few
         more options had to be turned off depending on archs to make things
         build (like ipr on powerpc/64 which failed due to missing writeq).
      
         * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
         * powerpc and powerpc64 SMP allmodconfig
         * sparc and sparc64 SMP allmodconfig
         * ia64 SMP allmodconfig
         * s390 SMP allmodconfig
         * alpha SMP allmodconfig
         * um on x86_64 SMP allmodconfig
      
      8. percpu.h modifications were reverted so that it could be applied as
         a separate patch and serve as bisection point.
      
      Given the fact that I had only a couple of failures from tests on step
      6, I'm fairly confident about the coverage of this conversion patch.
      If there is a breakage, it's likely to be something in one of the arch
      headers which should be easily discoverable easily on most builds of
      the specific arch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Guess-its-ok-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      5a0e3ad6
  33. 09 3月, 2010 1 次提交
  34. 24 12月, 2009 2 次提交
    • L
      net: Add rtnetlink init_rcvwnd to set the TCP initial receive window · 31d12926
      laurent chavey 提交于
      Add rtnetlink init_rcvwnd to set the TCP initial receive window size
      advertised by passive and active TCP connections.
      The current Linux TCP implementation limits the advertised TCP initial
      receive window to the one prescribed by slow start. For short lived
      TCP connections used for transaction type of traffic (i.e. http
      requests), bounding the advertised TCP initial receive window results
      in increased latency to complete the transaction.
      Support for setting initial congestion window is already supported
      using rtnetlink init_cwnd, but the feature is useless without the
      ability to set a larger TCP initial receive window.
      The rtnetlink init_rcvwnd allows increasing the TCP initial receive
      window, allowing TCP connection to advertise larger TCP receive window
      than the ones bounded by slow start.
      Signed-off-by: NLaurent Chavey <chavey@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      31d12926
    • K
      tcp: Remove check in __tcp_push_pending_frames · 12d50c46
      Krishna Kumar 提交于
      tcp_push checks tcp_send_head and calls __tcp_push_pending_frames,
      which again checks tcp_send_head, and this unnecessary check is
      done for every other caller of __tcp_push_pending_frames.
      
      Remove tcp_send_head check in __tcp_push_pending_frames and add
      the check to tcp_push_pending_frames. Other functions call
      __tcp_push_pending_frames only when tcp_send_head would evaluate
      to true.
      Signed-off-by: NKrishna Kumar <krkumar2@in.ibm.com>
      Acked-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      12d50c46
  35. 16 12月, 2009 1 次提交
    • D
      tcp: Revert per-route SACK/DSACK/TIMESTAMP changes. · bb5b7c11
      David S. Miller 提交于
      It creates a regression, triggering badness for SYN_RECV
      sockets, for example:
      
      [19148.022102] Badness at net/ipv4/inet_connection_sock.c:293
      [19148.022570] NIP: c02a0914 LR: c02a0904 CTR: 00000000
      [19148.023035] REGS: eeecbd30 TRAP: 0700   Not tainted  (2.6.32)
      [19148.023496] MSR: 00029032 <EE,ME,CE,IR,DR>  CR: 24002442  XER: 00000000
      [19148.024012] TASK = eee9a820[1756] 'privoxy' THREAD: eeeca000
      
      This is likely caused by the change in the 'estab' parameter
      passed to tcp_parse_options() when invoked by the functions
      in net/ipv4/tcp_minisocks.c
      
      But even if that is fixed, the ->conn_request() changes made in
      this patch series is fundamentally wrong.  They try to use the
      listening socket's 'dst' to probe the route settings.  The
      listening socket doesn't even have a route, and you can't
      get the right route (the child request one) until much later
      after we setup all of the state, and it must be done by hand.
      
      This stuff really isn't ready, so the best thing to do is a
      full revert.  This reverts the following commits:
      
      f55017a9
      022c3f7d
      1aba721e
      cda42ebd
      345cda2f
      dc343475
      05eaade2
      6a2a2d6bSigned-off-by: NDavid S. Miller <davem@davemloft.net>
      bb5b7c11