1. 11 5月, 2012 1 次提交
  2. 03 5月, 2012 3 次提交
    • E
      net: implement tcp coalescing in tcp_queue_rcv() · b081f85c
      Eric Dumazet 提交于
      Extend tcp coalescing implementing it from tcp_queue_rcv(), the main
      receiver function when application is not blocked in recvmsg().
      
      Function tcp_queue_rcv() is moved a bit to allow its call from
      tcp_data_queue()
      
      This gives good results especially if GRO could not kick, and if skb
      head is a fragment.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Alexander Duyck <alexander.h.duyck@intel.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b081f85c
    • E
      tcp: change tcp_adv_win_scale and tcp_rmem[2] · b49960a0
      Eric Dumazet 提交于
      tcp_adv_win_scale default value is 2, meaning we expect a good citizen
      skb to have skb->len / skb->truesize ratio of 75% (3/4)
      
      In 2.6 kernels we (mis)accounted for typical MSS=1460 frame :
      1536 + 64 + 256 = 1856 'estimated truesize', and 1856 * 3/4 = 1392.
      So these skbs were considered as not bloated.
      
      With recent truesize fixes, a typical MSS=1460 frame truesize is now the
      more precise :
      2048 + 256 = 2304. But 2304 * 3/4 = 1728.
      So these skb are not good citizen anymore, because 1460 < 1728
      
      (GRO can escape this problem because it build skbs with a too low
      truesize.)
      
      This also means tcp advertises a too optimistic window for a given
      allocated rcvspace : When receiving frames, sk_rmem_alloc can hit
      sk_rcvbuf limit and we call tcp_prune_queue()/tcp_collapse() too often,
      especially when application is slow to drain its receive queue or in
      case of losses (netperf is fast, scp is slow). This is a major latency
      source.
      
      We should adjust the len/truesize ratio to 50% instead of 75%
      
      This patch :
      
      1) changes tcp_adv_win_scale default to 1 instead of 2
      
      2) increase tcp_rmem[2] limit from 4MB to 6MB to take into account
      better truesize tracking and to allow autotuning tcp receive window to
      reach same value than before. Note that same amount of kernel memory is
      consumed compared to 2.6 kernels.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b49960a0
    • Y
      tcp: early retransmit · eed530b6
      Yuchung Cheng 提交于
      This patch implements RFC 5827 early retransmit (ER) for TCP.
      It reduces DUPACK threshold (dupthresh) if outstanding packets are
      less than 4 to recover losses by fast recovery instead of timeout.
      
      While the algorithm is simple, small but frequent network reordering
      makes this feature dangerous: the connection repeatedly enter
      false recovery and degrade performance. Therefore we implement
      a mitigation suggested in the appendix of the RFC that delays
      entering fast recovery by a small interval, i.e., RTT/4. Currently
      ER is conservative and is disabled for the rest of the connection
      after the first reordering event. A large scale web server
      experiment on the performance impact of ER is summarized in
      section 6 of the paper "Proportional Rate Reduction for TCP”,
      IMC 2011. http://conferences.sigcomm.org/imc/2011/docs/p155.pdf
      
      Note that Linux has a similar feature called THIN_DUPACK. The
      differences are THIN_DUPACK do not mitigate reorderings and is only
      used after slow start. Currently ER is disabled if THIN_DUPACK is
      enabled. I would be happy to merge THIN_DUPACK feature with ER if
      people think it's a good idea.
      
      ER is enabled by sysctl_tcp_early_retrans:
        0: Disables ER
      
        1: Reduce dupthresh to packets_out - 1 when outstanding packets < 4.
      
        2: (Default) reduce dupthresh like mode 1. In addition, delay
           entering fast recovery by RTT/4.
      
      Note: mode 2 is implemented in the third part of this patch series.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eed530b6
  3. 26 4月, 2012 1 次提交
  4. 24 4月, 2012 1 次提交
  5. 22 4月, 2012 6 次提交
    • N
      tcp: move duplicate code from tcp_v4_init_sock()/tcp_v6_init_sock() · 900f65d3
      Neal Cardwell 提交于
      This commit moves the (substantial) common code shared between
      tcp_v4_init_sock() and tcp_v6_init_sock() to a new address-family
      independent function, tcp_init_sock().
      
      Centralizing this functionality should help avoid drift issues,
      e.g. where the IPv4 side is updated without a corresponding update to
      IPv6. There was already some drift: IPv4 initialized snd_cwnd to
      TCP_INIT_CWND, while the IPv6 side was still initializing snd_cwnd to
      2 (in this case it should not matter, since snd_cwnd is also
      initialized in tcp_init_metrics(), but the general risks and
      maintenance overhead remain).
      
      When diffing the old and new code, note that new tcp_init_sock()
      function uses the order of steps from the tcp_v4_init_sock()
      implementation (the order is slightly different in
      tcp_v6_init_sock()).
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      900f65d3
    • P
      tcp: Repair connection-time negotiated parameters · b139ba4e
      Pavel Emelyanov 提交于
      There are options, which are set up on a socket while performing
      TCP handshake. Need to resurrect them on a socket while repairing.
      A new sockoption accepts a buffer and parses it. The buffer should
      be CODE:VALUE sequence of bytes, where CODE is standard option
      code and VALUE is the respective value.
      
      Only 4 options should be handled on repaired socket.
      
      To read 3 out of 4 of these options the TCP_INFO sockoption can be
      used. An ability to get the last one (the mss_clamp) was added by
      the previous patch.
      
      Now the restore. Three of these options -- timestamp_ok, mss_clamp
      and snd_wscale -- are just restored on a coket.
      
      The sack_ok flags has 2 issues. First, whether or not to do sacks
      at all. This flag is just read and set back. No other sack  info is
      saved or restored, since according to the standart and the code
      dropping all sack-ed segments is OK, the sender will resubmit them
      again, so after the repair we will probably experience a pause in
      connection. Next, the fack bit. It's just set back on a socket if
      the respective sysctl is set. No collected stats about packets flow
      is preserved. As far as I see (plz, correct me if I'm wrong) the
      fack-based congestion algorithm survives dropping all of the stats
      and repairs itself eventually, probably losing the performance for
      that period.
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b139ba4e
    • P
      tcp: Report mss_clamp with TCP_MAXSEG option in repair mode · 5e6a3ce6
      Pavel Emelyanov 提交于
      The mss_clamp is the only connection-time negotiated option which
      cannot be obtained from the user space. Make the TCP_MAXSEG sockopt
      report one in the repair mode.
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5e6a3ce6
    • P
      tcp: Repair socket queues · c0e88ff0
      Pavel Emelyanov 提交于
      Reading queues under repair mode is done with recvmsg call.
      The queue-under-repair set by TCP_REPAIR_QUEUE option is used
      to determine which queue should be read. Thus both send and
      receive queue can be read with this.
      
      Caller must pass the MSG_PEEK flag.
      
      Writing to queues is done with sendmsg call and yet again --
      the repair-queue option can be used to push data into the
      receive queue.
      
      When putting an skb into receive queue a zero tcp header is
      appented to its head to address the tcp_hdr(skb)->syn and
      the ->fin checks by the (after repair) tcp_recvmsg. These
      flags flags are both set to zero and that's why.
      
      The fin cannot be met in the queue while reading the source
      socket, since the repair only works for closed/established
      sockets and queueing fin packet always changes its state.
      
      The syn in the queue denotes that the respective skb's seq
      is "off-by-one" as compared to the actual payload lenght. Thus,
      at the rcv queue refill we can just drop this flag and set the
      skb's sequences to precice values.
      
      When the repair mode is turned off, the write queue seqs are
      updated so that the whole queue is considered to be 'already sent,
      waiting for ACKs' (write_seq = snd_nxt <= snd_una). From the
      protocol POV the send queue looks like it was sent, but the data
      between the write_seq and snd_nxt is lost in the network.
      
      This helps to avoid another sockoption for setting the snd_nxt
      sequence. Leaving the whole queue in a 'not yet sent' state (as
      it will be after sendmsg-s) will not allow to receive any acks
      from the peer since the ack_seq will be after the snd_nxt. Thus
      even the ack for the window probe will be dropped and the
      connection will be 'locked' with the zero peer window.
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c0e88ff0
    • P
      tcp: Initial repair mode · ee995283
      Pavel Emelyanov 提交于
      This includes (according the the previous description):
      
      * TCP_REPAIR sockoption
      
      This one just puts the socket in/out of the repair mode.
      Allowed for CAP_NET_ADMIN and for closed/establised sockets only.
      When repair mode is turned off and the socket happens to be in
      the established state the window probe is sent to the peer to
      'unlock' the connection.
      
      * TCP_REPAIR_QUEUE sockoption
      
      This one sets the queue which we're about to repair. The
      'no-queue' is set by default.
      
      * TCP_QUEUE_SEQ socoption
      
      Sets the write_seq/rcv_nxt of a selected repaired queue.
      Allowed for TCP_CLOSE-d sockets only. When the socket changes
      its state the other seq-s are changed by the kernel according
      to the protocol rules (most of the existing code is actually
      reused).
      
      * Ability to forcibly bind a socket to a port
      
      The sk->sk_reuse is set to SK_FORCE_REUSE.
      
      * Immediate connect modification
      
      The connect syscall initializes the connection, then directly jumps
      to the code which finalizes it.
      
      * Silent close modification
      
      The close just aborts the connection (similar to SO_LINGER with 0
      time) but without sending any FIN/RST-s to peer.
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee995283
    • P
      tcp: Move code around · 370816ae
      Pavel Emelyanov 提交于
      This is just the preparation patch, which makes the needed for
      TCP repair code ready for use.
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      370816ae
  6. 16 4月, 2012 1 次提交
  7. 11 4月, 2012 2 次提交
    • E
      tcp: avoid order-1 allocations on wifi and tx path · a21d4572
      Eric Dumazet 提交于
      Marc Merlin reported many order-1 allocations failures in TX path on its
      wireless setup, that dont make any sense with MTU=1500 network, and non
      SG capable hardware.
      
      After investigation, it turns out TCP uses sk_stream_alloc_skb() and
      used as a convention skb_tailroom(skb) to know how many bytes of data
      payload could be put in this skb (for non SG capable devices)
      
      Note : these skb used kmalloc-4096 (MTU=1500 + MAX_HEADER +
      sizeof(struct skb_shared_info) being above 2048)
      
      Later, mac80211 layer need to add some bytes at the tail of skb
      (IEEE80211_ENCRYPT_TAILROOM = 18 bytes) and since no more tailroom is
      available has to call pskb_expand_head() and request order-1
      allocations.
      
      This patch changes sk_stream_alloc_skb() so that only
      sk->sk_prot->max_header bytes of headroom are reserved, and use a new
      skb field, avail_size to hold the data payload limit.
      
      This way, order-0 allocations done by TCP stack can leave more than 2 KB
      of tailroom and no more allocation is performed in mac80211 layer (or
      any layer needing some tailroom)
      
      avail_size is unioned with mark/dropcount, since mark will be set later
      in IP stack for output packets. Therefore, skb size is unchanged.
      Reported-by: NMarc MERLIN <marc@merlins.org>
      Tested-by: NMarc MERLIN <marc@merlins.org>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a21d4572
    • E
      tcp: restore correct limit · 5fb84b14
      Eric Dumazet 提交于
      Commit c43b874d (tcp: properly initialize tcp memory limits) tried
      to fix a regression added in commits 4acb4190 & 3dc43e3e,
      but still get it wrong.
      
      Result is machines with low amount of memory have too small tcp_rmem[2]
      value and slow tcp receives : Per socket limit being 1/1024 of memory
      instead of 1/128 in old kernels, so rcv window is capped to small
      values.
      
      Fix this to match comment and previous behavior.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5fb84b14
  8. 06 4月, 2012 2 次提交
  9. 04 4月, 2012 1 次提交
    • E
      tcp: allow splice() to build full TSO packets · 2f533844
      Eric Dumazet 提交于
      vmsplice()/splice(pipe, socket) call do_tcp_sendpages() one page at a
      time, adding at most 4096 bytes to an skb. (assuming PAGE_SIZE=4096)
      
      The call to tcp_push() at the end of do_tcp_sendpages() forces an
      immediate xmit when pipe is not already filled, and tso_fragment() try
      to split these skb to MSS multiples.
      
      4096 bytes are usually split in a skb with 2 MSS, and a remaining
      sub-mss skb (assuming MTU=1500)
      
      This makes slow start suboptimal because many small frames are sent to
      qdisc/driver layers instead of big ones (constrained by cwnd and packets
      in flight of course)
      
      In fact, applications using sendmsg() (adding an additional memory copy)
      instead of vmsplice()/splice()/sendfile() are a bit faster because of
      this anomaly, especially if serving small files in environments with
      large initial [c]wnd.
      
      Call tcp_push() only if MSG_MORE is not set in the flags parameter.
      
      This bit is automatically provided by splice() internals but for the
      last page, or on all pages if user specified SPLICE_F_MORE splice()
      flag.
      
      In some workloads, this can reduce number of sent logical packets by an
      order of magnitude, making zero-copy TCP actually faster than
      one-copy :)
      Reported-by: NTom Herbert <therbert@google.com>
      Cc: Nandita Dukkipati <nanditad@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: H.K. Jerry Chu <hkchu@google.com>
      Cc: Maciej Żenczykowski <maze@google.com>
      Cc: Mahesh Bandewar <maheshb@google.com>
      Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail&gt;com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2f533844
  10. 13 3月, 2012 1 次提交
  11. 12 3月, 2012 1 次提交
    • J
      net: Convert printks to pr_<level> · 058bd4d2
      Joe Perches 提交于
      Use a more current kernel messaging style.
      
      Convert a printk block to print_hex_dump.
      Coalesce formats, align arguments.
      Use %s, __func__ instead of embedding function names.
      
      Some messages that were prefixed with <foo>_close are
      now prefixed with <foo>_fini.  Some ah4 and esp messages
      are now not prefixed with "ip ".
      
      The intent of this patch is to later add something like
        #define pr_fmt(fmt) "IPv4: " fmt.
      to standardize the output messages.
      
      Text size is trivially reduced. (x86-32 allyesconfig)
      
      $ size net/ipv4/built-in.o*
         text	   data	    bss	    dec	    hex	filename
       887888	  31558	 249696	1169142	 11d6f6	net/ipv4/built-in.o.new
       887934	  31558	 249800	1169292	 11d78c	net/ipv4/built-in.o.old
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      058bd4d2
  12. 14 2月, 2012 1 次提交
  13. 03 2月, 2012 1 次提交
  14. 02 2月, 2012 1 次提交
    • A
      net: Disambiguate kernel message · efcdbf24
      Arun Sharma 提交于
      Some of our machines were reporting:
      
      TCP: too many of orphaned sockets
      
      even when the number of orphaned sockets was well below the
      limit.
      
      We print a different message depending on whether we're out
      of TCP memory or there are too many orphaned sockets.
      
      Also move the check out of line and cleanup the messages
      that were printed.
      Signed-off-by: NArun Sharma <asharma@fb.com>
      Suggested-by: NMohan Srinivasan <mohan@fb.com>
      Cc: netdev@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: David Miller <davem@davemloft.net>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Joe Perches <joe@perches.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      efcdbf24
  15. 31 1月, 2012 1 次提交
  16. 13 12月, 2011 1 次提交
  17. 06 12月, 2011 1 次提交
  18. 05 12月, 2011 1 次提交
  19. 30 11月, 2011 1 次提交
    • E
      tcp: avoid frag allocation for small frames · f07d960d
      Eric Dumazet 提交于
      tcp_sendmsg() uses select_size() helper to choose skb head size when a
      new skb must be allocated.
      
      If GSO is enabled for the socket, current strategy is to force all
      payload data to be outside of headroom, in PAGE fragments.
      
      This strategy is not welcome for small packets, wasting memory.
      
      Experiments show that best results are obtained when using 2048 bytes
      for skb head (This includes the skb overhead and various headers)
      
      This patch provides better len/truesize ratios for packets sent to
      loopback device, and reduce memory needs for in-flight loopback packets,
      particularly on arches with big pages.
      
      If a sender sends many 1-byte packets to an unresponsive application,
      receiver rmem_alloc will grow faster and will stop queuing these packets
      sooner, or will collapse its receive queue to free excess memory.
      
      netperf -t TCP_RR results are improved by ~4 %, and many workloads are
      improved as well (tbench, mysql...)
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f07d960d
  20. 29 11月, 2011 1 次提交
  21. 17 11月, 2011 1 次提交
  22. 25 10月, 2011 1 次提交
  23. 24 10月, 2011 1 次提交
  24. 21 10月, 2011 1 次提交
  25. 19 10月, 2011 1 次提交
  26. 04 10月, 2011 1 次提交
    • E
      tcp: report ECN_SEEN in tcp_info · b5c5693b
      Eric Dumazet 提交于
      Allows ss command (iproute2) to display "ecnseen" if at least one packet
      with ECT(0) or ECT(1) or ECN was received by this socket.
      
      "ecn" means ECN was negotiated at session establishment (TCP level)
      
      "ecnseen" means we received at least one packet with ECT fields set (IP
      level)
      
      ss -i
      ...
      ESTAB      0      0   192.168.20.110:22  192.168.20.144:38016
      ino:5950 sk:f178e400
      	 mem:(r0,w0,f0,t0) ts sack ecn ecnseen bic wscale:7,8 rto:210
      rtt:12.5/7.5 cwnd:10 send 9.3Mbps rcv_space:14480
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b5c5693b
  27. 28 9月, 2011 1 次提交
  28. 17 9月, 2011 1 次提交
  29. 25 8月, 2011 1 次提交
  30. 07 7月, 2011 1 次提交
  31. 05 4月, 2011 1 次提交
    • T
      net: Allow no-cache copy from user on transmit · c6e1a0d1
      Tom Herbert 提交于
      This patch uses __copy_from_user_nocache on transmit to bypass data
      cache for a performance improvement.  skb_add_data_nocache and
      skb_copy_to_page_nocache can be called by sendmsg functions to use
      this feature, initial support is in tcp_sendmsg.  This functionality is
      configurable per device using ethtool.
      
      Presumably, this feature would only be useful when the driver does
      not touch the data.  The feature is turned on by default if a device
      indicates that it does some form of checksum offload; it is off by
      default for devices that do no checksum offload or indicate no checksum
      is necessary.  For the former case copy-checksum is probably done
      anyway, in the latter case the device is likely loopback in which case
      the no cache copy is probably not beneficial.
      
      This patch was tested using 200 instances of netperf TCP_RR with
      1400 byte request and one byte reply.  Platform is 16 core AMD x86.
      
      No-cache copy disabled:
         672703 tps, 97.13% utilization
         50/90/99% latency:244.31 484.205 1028.41
      
      No-cache copy enabled:
         702113 tps, 96.16% utilization,
         50/90/99% latency 238.56 467.56 956.955
      
      Using 14000 byte request and response sizes demonstrate the
      effects more dramatically:
      
      No-cache copy disabled:
         79571 tps, 34.34 %utlization
         50/90/95% latency 1584.46 2319.59 5001.76
      
      No-cache copy enabled:
         83856 tps, 34.81% utilization
         50/90/95% latency 2508.42 2622.62 2735.88
      
      Note especially the effect on latency tail (95th percentile).
      
      This seems to provide a nice performance improvement and is
      consistent in the tests I ran.  Presumably, this would provide
      the greatest benfits in the presence of an application workload
      stressing the cache and a lot of transmit data happening.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c6e1a0d1