1. 29 9月, 2014 4 次提交
    • D
      net: tcp: add flag for ca to indicate that ECN is required · 30e502a3
      Daniel Borkmann 提交于
      This patch adds a flag to TCP congestion algorithms that allows
      for requesting to mark IPv4/IPv6 sockets with transport as ECN
      capable, that is, ECT(0), when required by a congestion algorithm.
      
      It is currently used and needed in DataCenter TCP (DCTCP), as it
      requires both peers to assert ECT on all IP packets sent - it
      uses ECN feedback (i.e. CE, Congestion Encountered information)
      from switches inside the data center to derive feedback to the
      end hosts.
      
      Therefore, simply add a new flag to icsk_ca_ops. Note that DCTCP's
      algorithm/behaviour slightly diverges from RFC3168, therefore this
      is only (!) enabled iff the assigned congestion control ops module
      has requested this. By that, we can tightly couple this logic really
      only to the provided congestion control ops.
      
      Joint work with Florian Westphal and Glenn Judd.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NGlenn Judd <glenn.judd@morganstanley.com>
      Acked-by: NStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      30e502a3
    • F
      net: tcp: assign tcp cong_ops when tcp sk is created · 55d8694f
      Florian Westphal 提交于
      Split assignment and initialization from one into two functions.
      
      This is required by followup patches that add Datacenter TCP
      (DCTCP) congestion control algorithm - we need to be able to
      determine if the connection is moderated by DCTCP before the
      3WHS has finished.
      
      As we walk the available congestion control list during the
      assignment, we are always guaranteed to have Reno present as
      it's fixed compiled-in. Therefore, since we're doing the
      early assignment, we don't have a real use for the Reno alias
      tcp_init_congestion_ops anymore and can thus remove it.
      
      Actual usage of the congestion control operations are being
      made after the 3WHS has finished, in some cases however we
      can access get_info() via diag if implemented, therefore we
      need to zero out the private area for those modules.
      
      Joint work with Daniel Borkmann and Glenn Judd.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NGlenn Judd <glenn.judd@morganstanley.com>
      Acked-by: NStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      55d8694f
    • E
      tcp: change tcp_skb_pcount() location · cd7d8498
      Eric Dumazet 提交于
      Our goal is to access no more than one cache line access per skb in
      a write or receive queue when doing the various walks.
      
      After recent TCP_SKB_CB() reorganizations, it is almost done.
      
      Last part is tcp_skb_pcount() which currently uses
      skb_shinfo(skb)->gso_segs, which is a terrible choice, because it needs
      3 cache lines in current kernel (skb->head, skb->end, and
      shinfo->gso_segs are all in 3 different cache lines, far from skb->cb)
      
      This very simple patch reuses space currently taken by tcp_tw_isn
      only in input path, as tcp_skb_pcount is only needed for skb stored in
      write queue.
      
      This considerably speeds up tcp_ack(), granted we avoid shinfo->tx_flags
      to get SKBTX_ACK_TSTAMP, which seems possible.
      
      This also speeds up all sack processing in general.
      
      This speeds up tcp_sendmsg() because it no longer has to access/dirty
      shinfo.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cd7d8498
    • E
      tcp: better TCP_SKB_CB layout to reduce cache line misses · 971f10ec
      Eric Dumazet 提交于
      TCP maintains lists of skb in write queue, and in receive queues
      (in order and out of order queues)
      
      Scanning these lists both in input and output path usually requires
      access to skb->next, TCP_SKB_CB(skb)->seq, and TCP_SKB_CB(skb)->end_seq
      
      These fields are currently in two different cache lines, meaning we
      waste lot of memory bandwidth when these queues are big and flows
      have either packet drops or packet reorders.
      
      We can move TCP_SKB_CB(skb)->header at the end of TCP_SKB_CB, because
      this header is not used in fast path. This allows TCP to search much faster
      in the skb lists.
      
      Even with regular flows, we save one cache line miss in fast path.
      
      Thanks to Christoph Paasch for noticing we need to cleanup
      skb->cb[] (IPCB/IP6CB) before entering IP stack in tx path,
      and that I forgot IPCB use in tcp_v4_hnd_req() and tcp_v4_save_options().
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      971f10ec
  2. 06 9月, 2014 2 次提交
  3. 15 8月, 2014 3 次提交
    • H
      tcp: don't allow syn packets without timestamps to pass tcp_tw_recycle logic · a26552af
      Hannes Frederic Sowa 提交于
      tcp_tw_recycle heavily relies on tcp timestamps to build a per-host
      ordering of incoming connections and teardowns without the need to
      hold state on a specific quadruple for TCP_TIMEWAIT_LEN, but only for
      the last measured RTO. To do so, we keep the last seen timestamp in a
      per-host indexed data structure and verify if the incoming timestamp
      in a connection request is strictly greater than the saved one during
      last connection teardown. Thus we can verify later on that no old data
      packets will be accepted by the new connection.
      
      During moving a socket to time-wait state we already verify if timestamps
      where seen on a connection. Only if that was the case we let the
      time-wait socket expire after the RTO, otherwise normal TCP_TIMEWAIT_LEN
      will be used. But we don't verify this on incoming SYN packets. If a
      connection teardown was less than TCP_PAWS_MSL seconds in the past we
      cannot guarantee to not accept data packets from an old connection if
      no timestamps are present. We should drop this SYN packet. This patch
      closes this loophole.
      
      Please note, this patch does not make tcp_tw_recycle in any way more
      usable but only adds another safety check:
      Sporadic drops of SYN packets because of reordering in the network or
      in the socket backlog queues can happen. Users behing NAT trying to
      connect to a tcp_tw_recycle enabled server can get caught in blackholes
      and their connection requests may regullary get dropped because hosts
      behind an address translator don't have synchronized tcp timestamp clocks.
      tcp_tw_recycle cannot work if peers don't have tcp timestamps enabled.
      
      In general, use of tcp_tw_recycle is disadvised.
      
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Florian Westphal <fw@strlen.de>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a26552af
    • N
      tcp: fix tcp_release_cb() to dispatch via address family for mtu_reduced() · 4fab9071
      Neal Cardwell 提交于
      Make sure we use the correct address-family-specific function for
      handling MTU reductions from within tcp_release_cb().
      
      Previously AF_INET6 sockets were incorrectly always using the IPv6
      code path when sometimes they were handling IPv4 traffic and thus had
      an IPv4 dst.
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Diagnosed-by: NWillem de Bruijn <willemb@google.com>
      Fixes: 563d34d0 ("tcp: dont drop MTU reduction indications")
      Reviewed-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4fab9071
    • A
      tcp: don't use timestamp from repaired skb-s to calculate RTT (v2) · 9d186cac
      Andrey Vagin 提交于
      We don't know right timestamp for repaired skb-s. Wrong RTT estimations
      isn't good, because some congestion modules heavily depends on it.
      
      This patch adds the TCPCB_REPAIRED flag, which is included in
      TCPCB_RETRANS.
      
      Thanks to Eric for the advice how to fix this issue.
      
      This patch fixes the warning:
      [  879.562947] WARNING: CPU: 0 PID: 2825 at net/ipv4/tcp_input.c:3078 tcp_ack+0x11f5/0x1380()
      [  879.567253] CPU: 0 PID: 2825 Comm: socket-tcpbuf-l Not tainted 3.16.0-next-20140811 #1
      [  879.567829] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
      [  879.568177]  0000000000000000 00000000c532680c ffff880039643d00 ffffffff817aa2d2
      [  879.568776]  0000000000000000 ffff880039643d38 ffffffff8109afbd ffff880039d6ba80
      [  879.569386]  ffff88003a449800 000000002983d6bd 0000000000000000 000000002983d6bc
      [  879.569982] Call Trace:
      [  879.570264]  [<ffffffff817aa2d2>] dump_stack+0x4d/0x66
      [  879.570599]  [<ffffffff8109afbd>] warn_slowpath_common+0x7d/0xa0
      [  879.570935]  [<ffffffff8109b0ea>] warn_slowpath_null+0x1a/0x20
      [  879.571292]  [<ffffffff816d0a05>] tcp_ack+0x11f5/0x1380
      [  879.571614]  [<ffffffff816d10bd>] tcp_rcv_established+0x1ed/0x710
      [  879.571958]  [<ffffffff816dc9da>] tcp_v4_do_rcv+0x10a/0x370
      [  879.572315]  [<ffffffff81657459>] release_sock+0x89/0x1d0
      [  879.572642]  [<ffffffff816c81a0>] do_tcp_setsockopt.isra.36+0x120/0x860
      [  879.573000]  [<ffffffff8110a52e>] ? rcu_read_lock_held+0x6e/0x80
      [  879.573352]  [<ffffffff816c8912>] tcp_setsockopt+0x32/0x40
      [  879.573678]  [<ffffffff81654ac4>] sock_common_setsockopt+0x14/0x20
      [  879.574031]  [<ffffffff816537b0>] SyS_setsockopt+0x80/0xf0
      [  879.574393]  [<ffffffff817b40a9>] system_call_fastpath+0x16/0x1b
      [  879.574730] ---[ end trace a17cbc38eb8c5c00 ]---
      
      v2: moving setting of skb->when for repaired skb-s in tcp_write_xmit,
          where it's set for other skb-s.
      
      Fixes: 431a9124 ("tcp: timestamp SYN+DATA messages")
      Fixes: 740b0f18 ("tcp: switch rtt estimations to usec resolution")
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NAndrey Vagin <avagin@openvz.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9d186cac
  4. 06 8月, 2014 1 次提交
    • N
      tcp: reduce spurious retransmits due to transient SACK reneging · 5ae344c9
      Neal Cardwell 提交于
      This commit reduces spurious retransmits due to apparent SACK reneging
      by only reacting to SACK reneging that persists for a short delay.
      
      When a sequence space hole at snd_una is filled, some TCP receivers
      send a series of ACKs as they apparently scan their out-of-order queue
      and cumulatively ACK all the packets that have now been consecutiveyly
      received. This is essentially misbehavior B in "Misbehaviors in TCP
      SACK generation" ACM SIGCOMM Computer Communication Review, April
      2011, so we suspect that this is from several common OSes (Windows
      2000, Windows Server 2003, Windows XP). However, this issue has also
      been seen in other cases, e.g. the netdev thread "TCP being hoodwinked
      into spurious retransmissions by lack of timestamps?" from March 2014,
      where the receiver was thought to be a BSD box.
      
      Since snd_una would temporarily be adjacent to a previously SACKed
      range in these scenarios, this receiver behavior triggered the Linux
      SACK reneging code path in the sender. This led the sender to clear
      the SACK scoreboard, enter CA_Loss, and spuriously retransmit
      (potentially) every packet from the entire write queue at line rate
      just a few milliseconds before the ACK for each packet arrives at the
      sender.
      
      To avoid such situations, now when a sender sees apparent reneging it
      does not yet retransmit, but rather adjusts the RTO timer to give the
      receiver a little time (max(RTT/2, 10ms)) to send us some more ACKs
      that will restore sanity to the SACK scoreboard. If the reneging
      persists until this RTO then, as before, we clear the SACK scoreboard
      and enter CA_Loss.
      
      A 10ms delay tolerates a receiver sending such a stream of ACKs at
      56Kbit/sec. And to allow for receivers with slower or more congested
      paths, we wait for at least RTT/2.
      
      We validated the resulting max(RTT/2, 10ms) delay formula with a mix
      of North American and South American Google web server traffic, and
      found that for ACKs displaying transient reneging:
      
       (1) 90% of inter-ACK delays were less than 10ms
       (2) 99% of inter-ACK delays were less than RTT/2
      
      In tests on Google web servers this commit reduced reneging events by
      75%-90% (as measured by the TcpExtTCPSACKReneging counter), without
      any measurable impact on latency for user HTTP and SPDY requests.
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5ae344c9
  5. 16 7月, 2014 1 次提交
  6. 08 7月, 2014 1 次提交
    • N
      tcp: switch snt_synack back to measuring transmit time of first SYNACK · 86c6a2c7
      Neal Cardwell 提交于
      Always store in snt_synack the time at which the server received the
      first client SYN and attempted to send the first SYNACK.
      
      Recent commit aa27fc50 ("tcp: tcp_v[46]_conn_request: fix snt_synack
      initialization") resolved an inconsistency between IPv4 and IPv6 in
      the initialization of snt_synack. This commit brings back the idea
      from 843f4a55 (tcp: use tcp_v4_send_synack on first SYN-ACK), which
      was going for the original behavior of snt_synack from the commit
      where it was added in 9ad7c049 ("tcp: RFC2988bis + taking RTT
      sample from 3WHS for the passive open side") in v3.1.
      
      In addition to being simpler (and probably a tiny bit faster),
      unconditionally storing the time of the first SYNACK attempt has been
      useful because it allows calculating a performance metric quantifying
      how long it took to establish a passive TCP connection.
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Cc: Octavian Purdila <octavian.purdila@intel.com>
      Cc: Jerry Chu <hkchu@google.com>
      Acked-by: NOctavian Purdila <octavian.purdila@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      86c6a2c7
  7. 28 6月, 2014 10 次提交
  8. 18 6月, 2014 1 次提交
  9. 11 6月, 2014 1 次提交
  10. 23 5月, 2014 1 次提交
    • N
      tcp: make cwnd-limited checks measurement-based, and gentler · ca8a2263
      Neal Cardwell 提交于
      Experience with the recent e114a710 ("tcp: fix cwnd limited
      checking to improve congestion control") has shown that there are
      common cases where that commit can cause cwnd to be much larger than
      necessary. This leads to TSO autosizing cooking skbs that are too
      large, among other things.
      
      The main problems seemed to be:
      
      (1) That commit attempted to predict the future behavior of the
      connection by looking at the write queue (if TSO or TSQ limit
      sending). That prediction sometimes overestimated future outstanding
      packets.
      
      (2) That commit always allowed cwnd to grow to twice the number of
      outstanding packets (even in congestion avoidance, where this is not
      needed).
      
      This commit improves both of these, by:
      
      (1) Switching to a measurement-based approach where we explicitly
      track the largest number of packets in flight during the past window
      ("max_packets_out"), and remember whether we were cwnd-limited at the
      moment we finished sending that flight.
      
      (2) Only allowing cwnd to grow to twice the number of outstanding
      packets ("max_packets_out") in slow start. In congestion avoidance
      mode we now only allow cwnd to grow if it was fully utilized.
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ca8a2263
  11. 14 5月, 2014 3 次提交
  12. 04 5月, 2014 1 次提交
  13. 03 5月, 2014 1 次提交
    • E
      tcp: fix cwnd limited checking to improve congestion control · e114a710
      Eric Dumazet 提交于
      Yuchung discovered tcp_is_cwnd_limited() was returning false in
      slow start phase even if the application filled the socket write queue.
      
      All congestion modules take into account tcp_is_cwnd_limited()
      before increasing cwnd, so this behavior limits slow start from
      probing the bandwidth at full speed.
      
      The problem is that even if write queue is full (aka we are _not_
      application limited), cwnd can be under utilized if TSO should auto
      defer or TCP Small queues decided to hold packets.
      
      So the in_flight can be kept to smaller value, and we can get to the
      point tcp_is_cwnd_limited() returns false.
      
      With TCP Small Queues and FQ/pacing, this issue is more visible.
      
      We fix this by having tcp_cwnd_validate(), which is supposed to track
      such things, take into account unsent_segs, the number of segs that we
      are not sending at the moment due to TSO or TSQ, but intend to send
      real soon. Then when we are cwnd-limited, remember this fact while we
      are processing the window of ACKs that comes back.
      
      For example, suppose we have a brand new connection with cwnd=10; we
      are in slow start, and we send a flight of 9 packets. By the time we
      have received ACKs for all 9 packets we want our cwnd to be 18.
      We implement this by setting tp->lsnd_pending to 9, and
      considering ourselves to be cwnd-limited while cwnd is less than
      twice tp->lsnd_pending (2*9 -> 18).
      
      This makes tcp_is_cwnd_limited() more understandable, by removing
      the GSO/TSO kludge, that tried to work around the issue.
      
      Note the in_flight parameter can be removed in a followup cleanup
      patch.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e114a710
  14. 21 4月, 2014 1 次提交
  15. 21 3月, 2014 1 次提交
  16. 27 2月, 2014 1 次提交
    • E
      tcp: switch rtt estimations to usec resolution · 740b0f18
      Eric Dumazet 提交于
      Upcoming congestion controls for TCP require usec resolution for RTT
      estimations. Millisecond resolution is simply not enough these days.
      
      FQ/pacing in DC environments also require this change for finer control
      and removal of bimodal behavior due to the current hack in
      tcp_update_pacing_rate() for 'small rtt'
      
      TCP_CONG_RTT_STAMP is no longer needed.
      
      As Julian Anastasov pointed out, we need to keep user compatibility :
      tcp_metrics used to export RTT and RTTVAR in msec resolution,
      so we added RTT_US and RTTVAR_US. An iproute2 patch is needed
      to use the new attributes if provided by the kernel.
      
      In this example ss command displays a srtt of 32 usecs (10Gbit link)
      
      lpk51:~# ./ss -i dst lpk52
      Netid  State      Recv-Q Send-Q   Local Address:Port       Peer
      Address:Port
      tcp    ESTAB      0      1         10.246.11.51:42959
      10.246.11.52:64614
               cubic wscale:6,6 rto:201 rtt:0.032/0.001 ato:40 mss:1448
      cwnd:10 send
      3620.0Mbps pacing_rate 7240.0Mbps unacked:1 rcv_rtt:993 rcv_space:29559
      
      Updated iproute2 ip command displays :
      
      lpk51:~# ./ip tcp_metrics | grep 10.246.11.52
      10.246.11.52 age 561.914sec cwnd 10 rtt 274us rttvar 213us source
      10.246.11.51
      
      Old binary displays :
      
      lpk51:~# ip tcp_metrics | grep 10.246.11.52
      10.246.11.52 age 561.914sec cwnd 10 rtt 250us rttvar 125us source
      10.246.11.51
      
      With help from Julian Anastasov, Stephen Hemminger and Yuchung Cheng
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Larry Brakmo <brakmo@google.com>
      Cc: Julian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      740b0f18
  17. 22 2月, 2014 1 次提交
  18. 14 2月, 2014 1 次提交
  19. 30 12月, 2013 1 次提交
  20. 18 12月, 2013 1 次提交
    • E
      tcp: refine TSO splits · d4589926
      Eric Dumazet 提交于
      While investigating performance problems on small RPC workloads,
      I noticed linux TCP stack was always splitting the last TSO skb
      into two parts (skbs). One being a multiple of MSS, and a small one
      with the Push flag. This split is done even if TCP_NODELAY is set,
      or if no small packet is in flight.
      
      Example with request/response of 4K/4K
      
      IP A > B: . ack 68432 win 2783 <nop,nop,timestamp 6524593 6525001>
      IP A > B: . 65537:68433(2896) ack 69632 win 2783 <nop,nop,timestamp 6524593 6525001>
      IP A > B: P 68433:69633(1200) ack 69632 win 2783 <nop,nop,timestamp 6524593 6525001>
      IP B > A: . ack 68433 win 2768 <nop,nop,timestamp 6525001 6524593>
      IP B > A: . 69632:72528(2896) ack 69633 win 2768 <nop,nop,timestamp 6525001 6524593>
      IP B > A: P 72528:73728(1200) ack 69633 win 2768 <nop,nop,timestamp 6525001 6524593>
      IP A > B: . ack 72528 win 2783 <nop,nop,timestamp 6524593 6525001>
      IP A > B: . 69633:72529(2896) ack 73728 win 2783 <nop,nop,timestamp 6524593 6525001>
      IP A > B: P 72529:73729(1200) ack 73728 win 2783 <nop,nop,timestamp 6524593 6525001>
      
      We can avoid this split by including the Nagle tests at the right place.
      
      Note : If some NIC had trouble sending TSO packets with a partial
      last segment, we would have hit the problem in GRO/forwarding workload already.
      
      tcp_minshall_update() is moved to tcp_output.c and is updated as we might
      feed a TSO packet with a partial last segment.
      
      This patch tremendously improves performance, as the traffic now looks
      like :
      
      IP A > B: . ack 98304 win 2783 <nop,nop,timestamp 6834277 6834685>
      IP A > B: P 94209:98305(4096) ack 98304 win 2783 <nop,nop,timestamp 6834277 6834685>
      IP B > A: . ack 98305 win 2768 <nop,nop,timestamp 6834686 6834277>
      IP B > A: P 98304:102400(4096) ack 98305 win 2768 <nop,nop,timestamp 6834686 6834277>
      IP A > B: . ack 102400 win 2783 <nop,nop,timestamp 6834279 6834686>
      IP A > B: P 98305:102401(4096) ack 102400 win 2783 <nop,nop,timestamp 6834279 6834686>
      IP B > A: . ack 102401 win 2768 <nop,nop,timestamp 6834687 6834279>
      IP B > A: P 102400:106496(4096) ack 102401 win 2768 <nop,nop,timestamp 6834687 6834279>
      IP A > B: . ack 106496 win 2783 <nop,nop,timestamp 6834280 6834687>
      IP A > B: P 102401:106497(4096) ack 106496 win 2783 <nop,nop,timestamp 6834280 6834687>
      IP B > A: . ack 106497 win 2768 <nop,nop,timestamp 6834688 6834280>
      IP B > A: P 106496:110592(4096) ack 106497 win 2768 <nop,nop,timestamp 6834688 6834280>
      
      Before :
      
      lpq83:~# nstat >/dev/null;perf stat ./super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K
      280774
      
       Performance counter stats for './super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K':
      
           205719.049006 task-clock                #    9.278 CPUs utilized
               8,449,968 context-switches          #    0.041 M/sec
               1,935,997 CPU-migrations            #    0.009 M/sec
                 160,541 page-faults               #    0.780 K/sec
         548,478,722,290 cycles                    #    2.666 GHz                     [83.20%]
         455,240,670,857 stalled-cycles-frontend   #   83.00% frontend cycles idle    [83.48%]
         272,881,454,275 stalled-cycles-backend    #   49.75% backend  cycles idle    [66.73%]
         166,091,460,030 instructions              #    0.30  insns per cycle
                                                   #    2.74  stalled cycles per insn [83.39%]
          29,150,229,399 branches                  #  141.699 M/sec                   [83.30%]
           1,943,814,026 branch-misses             #    6.67% of all branches         [83.32%]
      
            22.173517844 seconds time elapsed
      
      lpq83:~# nstat | egrep "IpOutRequests|IpExtOutOctets"
      IpOutRequests                   16851063           0.0
      IpExtOutOctets                  23878580777        0.0
      
      After patch :
      
      lpq83:~# nstat >/dev/null;perf stat ./super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K
      280877
      
       Performance counter stats for './super_netperf 200 -t TCP_RR -H lpq84 -l 20 -- -r 4K,4K':
      
           107496.071918 task-clock                #    4.847 CPUs utilized
               5,635,458 context-switches          #    0.052 M/sec
               1,374,707 CPU-migrations            #    0.013 M/sec
                 160,920 page-faults               #    0.001 M/sec
         281,500,010,924 cycles                    #    2.619 GHz                     [83.28%]
         228,865,069,307 stalled-cycles-frontend   #   81.30% frontend cycles idle    [83.38%]
         142,462,742,658 stalled-cycles-backend    #   50.61% backend  cycles idle    [66.81%]
          95,227,712,566 instructions              #    0.34  insns per cycle
                                                   #    2.40  stalled cycles per insn [83.43%]
          16,209,868,171 branches                  #  150.795 M/sec                   [83.20%]
             874,252,952 branch-misses             #    5.39% of all branches         [83.37%]
      
            22.175821286 seconds time elapsed
      
      lpq83:~# nstat | egrep "IpOutRequests|IpExtOutOctets"
      IpOutRequests                   11239428           0.0
      IpExtOutOctets                  23595191035        0.0
      
      Indeed, the occupancy of tx skbs (IpExtOutOctets/IpOutRequests) is higher :
      2099 instead of 1417, thus helping GRO to be more efficient when using FQ packet
      scheduler.
      
      Many thanks to Neal for review and ideas.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Nandita Dukkipati <nanditad@google.com>
      Cc: Van Jacobson <vanj@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Tested-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d4589926
  21. 07 12月, 2013 1 次提交
    • E
      tcp: auto corking · f54b3111
      Eric Dumazet 提交于
      With the introduction of TCP Small Queues, TSO auto sizing, and TCP
      pacing, we can implement Automatic Corking in the kernel, to help
      applications doing small write()/sendmsg() to TCP sockets.
      
      Idea is to change tcp_push() to check if the current skb payload is
      under skb optimal size (a multiple of MSS bytes)
      
      If under 'size_goal', and at least one packet is still in Qdisc or
      NIC TX queues, set the TCP Small Queue Throttled bit, so that the push
      will be delayed up to TX completion time.
      
      This delay might allow the application to coalesce more bytes
      in the skb in following write()/sendmsg()/sendfile() system calls.
      
      The exact duration of the delay is depending on the dynamics
      of the system, and might be zero if no packet for this flow
      is actually held in Qdisc or NIC TX ring.
      
      Using FQ/pacing is a way to increase the probability of
      autocorking being triggered.
      
      Add a new sysctl (/proc/sys/net/ipv4/tcp_autocorking) to control
      this feature and default it to 1 (enabled)
      
      Add a new SNMP counter : nstat -a | grep TcpExtTCPAutoCorking
      This counter is incremented every time we detected skb was under used
      and its flush was deferred.
      
      Tested:
      
      Interesting effects when using line buffered commands under ssh.
      
      Excellent performance results in term of cpu usage and total throughput.
      
      lpq83:~# echo 1 >/proc/sys/net/ipv4/tcp_autocorking
      lpq83:~# perf stat ./super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128
      9410.39
      
       Performance counter stats for './super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128':
      
            35209.439626 task-clock                #    2.901 CPUs utilized
                   2,294 context-switches          #    0.065 K/sec
                     101 CPU-migrations            #    0.003 K/sec
                   4,079 page-faults               #    0.116 K/sec
          97,923,241,298 cycles                    #    2.781 GHz                     [83.31%]
          51,832,908,236 stalled-cycles-frontend   #   52.93% frontend cycles idle    [83.30%]
          25,697,986,603 stalled-cycles-backend    #   26.24% backend  cycles idle    [66.70%]
         102,225,978,536 instructions              #    1.04  insns per cycle
                                                   #    0.51  stalled cycles per insn [83.38%]
          18,657,696,819 branches                  #  529.906 M/sec                   [83.29%]
              91,679,646 branch-misses             #    0.49% of all branches         [83.40%]
      
            12.136204899 seconds time elapsed
      
      lpq83:~# echo 0 >/proc/sys/net/ipv4/tcp_autocorking
      lpq83:~# perf stat ./super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128
      6624.89
      
       Performance counter stats for './super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128':
            40045.864494 task-clock                #    3.301 CPUs utilized
                     171 context-switches          #    0.004 K/sec
                      53 CPU-migrations            #    0.001 K/sec
                   4,080 page-faults               #    0.102 K/sec
         111,340,458,645 cycles                    #    2.780 GHz                     [83.34%]
          61,778,039,277 stalled-cycles-frontend   #   55.49% frontend cycles idle    [83.31%]
          29,295,522,759 stalled-cycles-backend    #   26.31% backend  cycles idle    [66.67%]
         108,654,349,355 instructions              #    0.98  insns per cycle
                                                   #    0.57  stalled cycles per insn [83.34%]
          19,552,170,748 branches                  #  488.244 M/sec                   [83.34%]
             157,875,417 branch-misses             #    0.81% of all branches         [83.34%]
      
            12.130267788 seconds time elapsed
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f54b3111
  22. 05 11月, 2013 1 次提交
    • Y
      tcp: properly handle stretch acks in slow start · 9f9843a7
      Yuchung Cheng 提交于
      Slow start now increases cwnd by 1 if an ACK acknowledges some packets,
      regardless the number of packets. Consequently slow start performance
      is highly dependent on the degree of the stretch ACKs caused by
      receiver or network ACK compression mechanisms (e.g., delayed-ACK,
      GRO, etc).  But slow start algorithm is to send twice the amount of
      packets of packets left so it should process a stretch ACK of degree
      N as if N ACKs of degree 1, then exits when cwnd exceeds ssthresh. A
      follow up patch will use the remainder of the N (if greater than 1)
      to adjust cwnd in the congestion avoidance phase.
      
      In addition this patch retires the experimental limited slow start
      (LSS) feature. LSS has multiple drawbacks but questionable benefit. The
      fractional cwnd increase in LSS requires a loop in slow start even
      though it's rarely used. Configuring such an increase step via a global
      sysctl on different BDPS seems hard. Finally and most importantly the
      slow start overshoot concern is now better covered by the Hybrid slow
      start (hystart) enabled by default.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9f9843a7
  23. 22 10月, 2013 1 次提交