1. 04 10月, 2014 2 次提交
    • J
      qdisc: dequeue bulking also pickup GSO/TSO packets · 808e7ac0
      Jesper Dangaard Brouer 提交于
      The TSO and GSO segmented packets already benefit from bulking
      on their own.
      
      The TSO packets have always taken advantage of the only updating
      the tailptr once for a large packet.
      
      The GSO segmented packets have recently taken advantage of
      bulking xmit_more API, via merge commit 53fda7f7 ("Merge
      branch 'xmit_list'"), specifically via commit 7f2e870f ("net:
      Move main gso loop out of dev_hard_start_xmit() into helper.")
      allowing qdisc requeue of remaining list.  And via commit
      ce93718f ("net: Don't keep around original SKB when we
      software segment GSO frames.").
      
      This patch allow further bulking of TSO/GSO packets together,
      when dequeueing from the qdisc.
      
      Testing:
       Measuring HoL (Head-of-Line) blocking for TSO and GSO, with
      netperf-wrapper. Bulking several TSO show no performance regressions
      (requeues were in the area 32 requeues/sec).
      
      Bulking several GSOs does show small regression or very small
      improvement (requeues were in the area 8000 requeues/sec).
      
       Using ixgbe 10Gbit/s with GSO bulking, we can measure some additional
      latency. Base-case, which is "normal" GSO bulking, sees varying
      high-prio queue delay between 0.38ms to 0.47ms.  Bulking several GSOs
      together, result in a stable high-prio queue delay of 0.50ms.
      
       Using igb at 100Mbit/s with GSO bulking, shows an improvement.
      Base-case sees varying high-prio queue delay between 2.23ms to 2.35ms
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      808e7ac0
    • J
      qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE · 5772e9a3
      Jesper Dangaard Brouer 提交于
      Based on DaveM's recent API work on dev_hard_start_xmit(), that allows
      sending/processing an entire skb list.
      
      This patch implements qdisc bulk dequeue, by allowing multiple packets
      to be dequeued in dequeue_skb().
      
      The optimization principle for this is two fold, (1) to amortize
      locking cost and (2) avoid expensive tailptr update for notifying HW.
       (1) Several packets are dequeued while holding the qdisc root_lock,
      amortizing locking cost over several packet.  The dequeued SKB list is
      processed under the TXQ lock in dev_hard_start_xmit(), thus also
      amortizing the cost of the TXQ lock.
       (2) Further more, dev_hard_start_xmit() will utilize the skb->xmit_more
      API to delay HW tailptr update, which also reduces the cost per
      packet.
      
      One restriction of the new API is that every SKB must belong to the
      same TXQ.  This patch takes the easy way out, by restricting bulk
      dequeue to qdisc's with the TCQ_F_ONETXQUEUE flag, that specifies the
      qdisc only have attached a single TXQ.
      
      Some detail about the flow; dev_hard_start_xmit() will process the skb
      list, and transmit packets individually towards the driver (see
      xmit_one()).  In case the driver stops midway in the list, the
      remaining skb list is returned by dev_hard_start_xmit().  In
      sch_direct_xmit() this returned list is requeued by dev_requeue_skb().
      
      To avoid overshooting the HW limits, which results in requeuing, the
      patch limits the amount of bytes dequeued, based on the drivers BQL
      limits.  In-effect bulking will only happen for BQL enabled drivers.
      
      Small amounts for extra HoL blocking (2x MTU/0.24ms) were
      measured at 100Mbit/s, with bulking 8 packets, but the
      oscillating nature of the measurement indicate something, like
      sched latency might be causing this effect. More comparisons
      show, that this oscillation goes away occationally. Thus, we
      disregard this artifact completely and remove any "magic" bulking
      limit.
      
      For now, as a conservative approach, stop bulking when seeing TSO and
      segmented GSO packets.  They already benefit from bulking on their own.
      A followup patch add this, to allow easier bisect-ability for finding
      regressions.
      
      Jointed work with Hannes, Daniel and Florian.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5772e9a3
  2. 02 10月, 2014 18 次提交
    • A
      net: pktgen: packet bursting via skb->xmit_more · 38b2cf29
      Alexei Starovoitov 提交于
      This patch demonstrates the effect of delaying update of HW tailptr.
      (based on earlier patch by Jesper)
      
      burst=1 is the default. It sends one packet with xmit_more=false
      burst=2 sends one packet with xmit_more=true and
              2nd copy of the same packet with xmit_more=false
      burst=3 sends two copies of the same packet with xmit_more=true and
              3rd copy with xmit_more=false
      
      Performance with ixgbe (usec 30):
      burst=1  tx:9.2 Mpps
      burst=2  tx:13.5 Mpps
      burst=3  tx:14.5 Mpps full 10G line rate
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      38b2cf29
    • F
      net: bridge: add a br_set_state helper function · 775dd692
      Florian Fainelli 提交于
      In preparation for being able to propagate port states to e.g: notifiers
      or other kernel parts, do not manipulate the port state directly, but
      instead use a helper function which will allow us to do a bit more than
      just setting the state.
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      775dd692
    • W
      net_sched: avoid calling tcf_unbind_filter() in call_rcu callback · a0efb80c
      WANG Cong 提交于
      This fixes the following crash:
      
      [   63.976822] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      [   63.980094] CPU: 1 PID: 15 Comm: ksoftirqd/1 Not tainted 3.17.0-rc6+ #648
      [   63.980094] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
      [   63.980094] task: ffff880117dea690 ti: ffff880117dfc000 task.ti: ffff880117dfc000
      [   63.980094] RIP: 0010:[<ffffffff817e6d07>]  [<ffffffff817e6d07>] u32_destroy_key+0x27/0x6d
      [   63.980094] RSP: 0018:ffff880117dffcc0  EFLAGS: 00010202
      [   63.980094] RAX: ffff880117dea690 RBX: ffff8800d02e0820 RCX: 0000000000000000
      [   63.980094] RDX: 0000000000000001 RSI: 0000000000000002 RDI: 6b6b6b6b6b6b6b6b
      [   63.980094] RBP: ffff880117dffcd0 R08: 0000000000000000 R09: 0000000000000000
      [   63.980094] R10: 00006c0900006ba8 R11: 00006ba100006b9d R12: 0000000000000001
      [   63.980094] R13: ffff8800d02e0898 R14: ffffffff817e6d4d R15: ffff880117387a30
      [   63.980094] FS:  0000000000000000(0000) GS:ffff88011a800000(0000) knlGS:0000000000000000
      [   63.980094] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [   63.980094] CR2: 00007f07e6732fed CR3: 000000011665b000 CR4: 00000000000006e0
      [   63.980094] Stack:
      [   63.980094]  ffff88011a9cd300 ffffffff82051ac0 ffff880117dffce0 ffffffff817e6d68
      [   63.980094]  ffff880117dffd70 ffffffff810cb4c7 ffffffff810cb3cd ffff880117dfffd8
      [   63.980094]  ffff880117dea690 ffff880117dea690 ffff880117dfffd8 000000000000000a
      [   63.980094] Call Trace:
      [   63.980094]  [<ffffffff817e6d68>] u32_delete_key_freepf_rcu+0x1b/0x1d
      [   63.980094]  [<ffffffff810cb4c7>] rcu_process_callbacks+0x3bb/0x691
      [   63.980094]  [<ffffffff810cb3cd>] ? rcu_process_callbacks+0x2c1/0x691
      [   63.980094]  [<ffffffff817e6d4d>] ? u32_destroy_key+0x6d/0x6d
      [   63.980094]  [<ffffffff810780a4>] __do_softirq+0x142/0x323
      [   63.980094]  [<ffffffff810782a8>] run_ksoftirqd+0x23/0x53
      [   63.980094]  [<ffffffff81092126>] smpboot_thread_fn+0x203/0x221
      [   63.980094]  [<ffffffff81091f23>] ? smpboot_unpark_thread+0x33/0x33
      [   63.980094]  [<ffffffff8108e44d>] kthread+0xc9/0xd1
      [   63.980094]  [<ffffffff819e00ea>] ? do_wait_for_common+0xf8/0x125
      [   63.980094]  [<ffffffff8108e384>] ? __kthread_parkme+0x61/0x61
      [   63.980094]  [<ffffffff819e43ec>] ret_from_fork+0x7c/0xb0
      [   63.980094]  [<ffffffff8108e384>] ? __kthread_parkme+0x61/0x61
      
      tp could be freed in call_rcu callback too, the order is not guaranteed.
      
      John Fastabend says:
      
      ====================
      Its worth noting why this is safe. Any running schedulers will either
      read the valid class field or it will be zeroed.
      
      All schedulers today when the class is 0 do a lookup using the
      same call used by the tcf_exts_bind(). So even if we have a running
      classifier hit the null class pointer it will do a lookup and get
      to the same result. This is particularly fragile at the moment because
      the only way to verify this is to audit the schedulers call sites.
      ====================
      
      Cc: John Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a0efb80c
    • W
      net_sched: fix another crash in cls_tcindex · 6e056569
      WANG Cong 提交于
      This patch fixes the following crash:
      
      [  166.670795] BUG: unable to handle kernel NULL pointer dereference at           (null)
      [  166.674230] IP: [<ffffffff814b739f>] __list_del_entry+0x5c/0x98
      [  166.674230] PGD d0ea5067 PUD ce7fc067 PMD 0
      [  166.674230] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
      [  166.674230] CPU: 1 PID: 775 Comm: tc Not tainted 3.17.0-rc6+ #642
      [  166.674230] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
      [  166.674230] task: ffff8800d03c4d20 ti: ffff8800cae7c000 task.ti: ffff8800cae7c000
      [  166.674230] RIP: 0010:[<ffffffff814b739f>]  [<ffffffff814b739f>] __list_del_entry+0x5c/0x98
      [  166.674230] RSP: 0018:ffff8800cae7f7d0  EFLAGS: 00010207
      [  166.674230] RAX: 0000000000000000 RBX: ffff8800cba8d700 RCX: ffff8800cba8d700
      [  166.674230] RDX: 0000000000000000 RSI: dead000000200200 RDI: ffff8800cba8d700
      [  166.674230] RBP: ffff8800cae7f7d0 R08: 0000000000000001 R09: 0000000000000001
      [  166.674230] R10: 0000000000000000 R11: 000000000000859a R12: ffffffffffffffe8
      [  166.674230] R13: ffff8800cba8c5b8 R14: 0000000000000001 R15: ffff8800cba8d700
      [  166.674230] FS:  00007fdb5f04a740(0000) GS:ffff88011a800000(0000) knlGS:0000000000000000
      [  166.674230] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [  166.674230] CR2: 0000000000000000 CR3: 00000000cf929000 CR4: 00000000000006e0
      [  166.674230] Stack:
      [  166.674230]  ffff8800cae7f7e8 ffffffff814b73e8 ffff8800cba8d6e8 ffff8800cae7f828
      [  166.674230]  ffffffff817caeec 0000000000000046 ffff8800cba8c5b0 ffff8800cba8c5b8
      [  166.674230]  0000000000000000 0000000000000001 ffff8800cf8e33e8 ffff8800cae7f848
      [  166.674230] Call Trace:
      [  166.674230]  [<ffffffff814b73e8>] list_del+0xd/0x2b
      [  166.674230]  [<ffffffff817caeec>] tcf_action_destroy+0x4c/0x71
      [  166.674230]  [<ffffffff817ca0ce>] tcf_exts_destroy+0x20/0x2d
      [  166.674230]  [<ffffffff817ec2b5>] tcindex_delete+0x196/0x1b7
      
      struct list_head can not be simply copied and we should always init it.
      
      Cc: John Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Acked-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6e056569
    • T
      gre: Set inner protocol in v4 and v6 GRE transmit · 54bc9bac
      Tom Herbert 提交于
      Call skb_set_inner_protocol to set inner Ethernet protocol to
      protocol being encapsulation by GRE before tunnel_xmit. This is
      needed for GSO if UDP encapsulation (fou) is being done.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      54bc9bac
    • T
      ipip: Set inner IP protocol in ipip · 077c5a09
      Tom Herbert 提交于
      Call skb_set_inner_ipproto to set inner IP protocol to IPPROTO_IPV4
      before tunnel_xmit. This is needed if UDP encapsulation (fou) is
      being done.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      077c5a09
    • T
      sit: Set inner IP protocol in sit · 469471cd
      Tom Herbert 提交于
      Call skb_set_inner_ipproto to set inner IP protocol to IPPROTO_IPV6
      before tunnel_xmit. This is needed if UDP encapsulation (fou) is
      being done.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      469471cd
    • T
      udp: Generalize skb_udp_segment · 8bce6d7d
      Tom Herbert 提交于
      skb_udp_segment is the function called from udp4_ufo_fragment to
      segment a UDP tunnel packet. This function currently assumes
      segmentation is transparent Ethernet bridging (i.e. VXLAN
      encapsulation). This patch generalizes the function to
      operate on either Ethertype or IP protocol.
      
      The inner_protocol field must be set to the protocol of the inner
      header. This can now be either an Ethertype or an IP protocol
      (in a union). A new flag in the skbuff indicates which type is
      effective. skb_set_inner_protocol and skb_set_inner_ipproto
      helper functions were added to set the inner_protocol. These
      functions are called from the point where the tunnel encapsulation
      is occuring.
      
      When skb_udp_tunnel_segment is called, the function to segment the
      inner packet is selected based on the inner IP or Ethertype. In the
      case of an IP protocol encapsulation, the function is derived from
      inet[6]_offloads. In the case of Ethertype, skb->protocol is
      set to the inner_protocol and skb_mac_gso_segment is called. (GRE
      currently does this, but it might be possible to lookup the protocol
      in offload_base and call the appropriate segmenation function
      directly).
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8bce6d7d
    • E
      net: avoid one atomic operation in skb_clone() · ce1a4ea3
      Eric Dumazet 提交于
      Fast clone cloning can actually avoid an atomic_inc(), if we
      guarantee prior clone_ref value is 1.
      
      This requires a change kfree_skbmem(), to perform the
      atomic_dec_and_test() on clone_ref before setting fclone to
      SKB_FCLONE_UNAVAILABLE.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ce1a4ea3
    • F
      net/dccp/ccid.c: add __init to ccid_activate · e500f488
      Fabian Frederick 提交于
      ccid_activate is only called by __init ccid_initialize_builtins in same module.
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e500f488
    • F
      net/dccp/proto.c: add __init to dccp_mib_init · 0c5b8a46
      Fabian Frederick 提交于
      dccp_mib_init is only called by __init dccp_init in same module.
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0c5b8a46
    • E
      net: cleanup and document skb fclone layout · d0bf4a9e
      Eric Dumazet 提交于
      Lets use a proper structure to clearly document and implement
      skb fast clones.
      
      Then, we might experiment more easily alternative layouts.
      
      This patch adds a new skb_fclone_busy() helper, used by tcp and xfrm,
      to stop leaking of implementation details.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d0bf4a9e
    • Y
      tcp: abort orphan sockets stalling on zero window probes · b248230c
      Yuchung Cheng 提交于
      Currently we have two different policies for orphan sockets
      that repeatedly stall on zero window ACKs. If a socket gets
      a zero window ACK when it is transmitting data, the RTO is
      used to probe the window. The socket is aborted after roughly
      tcp_orphan_retries() retries (as in tcp_write_timeout()).
      
      But if the socket was idle when it received the zero window ACK,
      and later wants to send more data, we use the probe timer to
      probe the window. If the receiver always returns zero window ACKs,
      icsk_probes keeps getting reset in tcp_ack() and the orphan socket
      can stall forever until the system reaches the orphan limit (as
      commented in tcp_probe_timer()). This opens up a simple attack
      to create lots of hanging orphan sockets to burn the memory
      and the CPU, as demonstrated in the recent netdev post "TCP
      connection will hang in FIN_WAIT1 after closing if zero window is
      advertised." http://www.spinics.net/lists/netdev/msg296539.html
      
      This patch follows the design in RTO-based probe: we abort an orphan
      socket stalling on zero window when the probe timer reaches both
      the maximum backoff and the maximum RTO. For example, an 100ms RTT
      connection will timeout after roughly 153 seconds (0.3 + 0.6 +
      .... + 76.8) if the receiver keeps the window shut. If the orphan
      socket passes this check, but the system already has too many orphans
      (as in tcp_out_of_resources()), we still abort it but we'll also
      send an RST packet as the connection may still be active.
      
      In addition, we change TCP_USER_TIMEOUT to cover (life or dead)
      sockets stalled on zero-window probes. This changes the semantics
      of TCP_USER_TIMEOUT slightly because it previously only applies
      when the socket has pending transmission.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Reported-by: NAndrey Dmitrov <andrey.dmitrov@oktetlabs.ru>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b248230c
    • F
      cipso: add __init to cipso_v4_cache_init · cb57659a
      Fabian Frederick 提交于
      cipso_v4_cache_init is only called by __init cipso_v4_init
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cb57659a
    • F
      inet: frags: add __init to ip4_frags_ctl_register · 57a02c39
      Fabian Frederick 提交于
      ip4_frags_ctl_register is only called by __init ipfrag_init
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      57a02c39
    • F
      tcp: add __init to tcp_init_mem · 47d7a88c
      Fabian Frederick 提交于
      tcp_init_mem is only called by __init tcp_init.
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      47d7a88c
    • T
      net: dsa: Fix build warning for !PM_SLEEP · e506d405
      Thierry Reding 提交于
      The dsa_switch_suspend() and dsa_switch_resume() functions are only used
      when PM_SLEEP is enabled, so they need #ifdef CONFIG_PM_SLEEP protection
      to avoid a compiler warning.
      Signed-off-by: NThierry Reding <treding@nvidia.com>
      Acked-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e506d405
    • E
      ipv4: mentions skb_gro_postpull_rcsum() in inet_gro_receive() · 2c804d0f
      Eric Dumazet 提交于
      Proper CHECKSUM_COMPLETE support needs to adjust skb->csum
      when we remove one header. Its done using skb_gro_postpull_rcsum()
      
      In the case of IPv4, we know that the adjustment is not really needed,
      because the checksum over IPv4 header is 0. Lets add a comment to
      ease code comprehension and avoid copy/paste errors.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2c804d0f
  3. 01 10月, 2014 6 次提交
  4. 30 9月, 2014 10 次提交
  5. 29 9月, 2014 4 次提交
    • F
      netfilter: conntrack: disable generic tracking for known protocols · db29a950
      Florian Westphal 提交于
      Given following iptables ruleset:
      
      -P FORWARD DROP
      -A FORWARD -m sctp --dport 9 -j ACCEPT
      -A FORWARD -p tcp --dport 80 -j ACCEPT
      -A FORWARD -p tcp -m conntrack -m state ESTABLISHED,RELATED -j ACCEPT
      
      One would assume that this allows SCTP on port 9 and TCP on port 80.
      Unfortunately, if the SCTP conntrack module is not loaded, this allows
      *all* SCTP communication, to pass though, i.e. -p sctp -j ACCEPT,
      which we think is a security issue.
      
      This is because on the first SCTP packet on port 9, we create a dummy
      "generic l4" conntrack entry without any port information (since
      conntrack doesn't know how to extract this information).
      
      All subsequent packets that are unknown will then be in established
      state since they will fallback to proto_generic and will match the
      'generic' entry.
      
      Our originally proposed version [1] completely disabled generic protocol
      tracking, but Jozsef suggests to not track protocols for which a more
      suitable helper is available, hence we now mitigate the issue for in
      tree known ct protocol helpers only, so that at least NAT and direction
      information will still be preserved for others.
      
       [1] http://www.spinics.net/lists/netfilter-devel/msg33430.html
      
      Joint work with Daniel Borkmann.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Acked-by: NJozsef Kadlecsik <kadlec@blackhole.kfki.hu>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      db29a950
    • A
      netfilter: nf_tables: store and dump set policy · 9363dc4b
      Arturo Borrero 提交于
      We want to know in which cases the user explicitly sets the policy
      options. In that case, we also want to dump back the info.
      Signed-off-by: NArturo Borrero Gonzalez <arturo.borrero.glez@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      9363dc4b
    • D
      net: tcp: add DCTCP congestion control algorithm · e3118e83
      Daniel Borkmann 提交于
      This work adds the DataCenter TCP (DCTCP) congestion control
      algorithm [1], which has been first published at SIGCOMM 2010 [2],
      resp. follow-up analysis at SIGMETRICS 2011 [3] (and also, more
      recently as an informational IETF draft available at [4]).
      
      DCTCP is an enhancement to the TCP congestion control algorithm for
      data center networks. Typical data center workloads are i.e.
      i) partition/aggregate (queries; bursty, delay sensitive), ii) short
      messages e.g. 50KB-1MB (for coordination and control state; delay
      sensitive), and iii) large flows e.g. 1MB-100MB (data update;
      throughput sensitive). DCTCP has therefore been designed for such
      environments to provide/achieve the following three requirements:
      
        * High burst tolerance (incast due to partition/aggregate)
        * Low latency (short flows, queries)
        * High throughput (continuous data updates, large file
          transfers) with commodity, shallow buffered switches
      
      The basic idea of its design consists of two fundamentals: i) on the
      switch side, packets are being marked when its internal queue
      length > threshold K (K is chosen so that a large enough headroom
      for marked traffic is still available in the switch queue); ii) the
      sender/host side maintains a moving average of the fraction of marked
      packets, so each RTT, F is being updated as follows:
      
       F := X / Y, where X is # of marked ACKs, Y is total # of ACKs
       alpha := (1 - g) * alpha + g * F, where g is a smoothing constant
      
      The resulting alpha (iow: probability that switch queue is congested)
      is then being used in order to adaptively decrease the congestion
      window W:
      
       W := (1 - (alpha / 2)) * W
      
      The means for receiving marked packets resp. marking them on switch
      side in DCTCP is the use of ECN.
      
      RFC3168 describes a mechanism for using Explicit Congestion Notification
      from the switch for early detection of congestion, rather than waiting
      for segment loss to occur.
      
      However, this method only detects the presence of congestion, not
      the *extent*. In the presence of mild congestion, it reduces the TCP
      congestion window too aggressively and unnecessarily affects the
      throughput of long flows [4].
      
      DCTCP, as mentioned, enhances Explicit Congestion Notification (ECN)
      processing to estimate the fraction of bytes that encounter congestion,
      rather than simply detecting that some congestion has occurred. DCTCP
      then scales the TCP congestion window based on this estimate [4],
      thus it can derive multibit feedback from the information present in
      the single-bit sequence of marks in its control law. And thus act in
      *proportion* to the extent of congestion, not its *presence*.
      
      Switches therefore set the Congestion Experienced (CE) codepoint in
      packets when internal queue lengths exceed threshold K. Resulting,
      DCTCP delivers the same or better throughput than normal TCP, while
      using 90% less buffer space.
      
      It was found in [2] that DCTCP enables the applications to handle 10x
      the current background traffic, without impacting foreground traffic.
      Moreover, a 10x increase in foreground traffic did not cause any
      timeouts, and thus largely eliminates TCP incast collapse problems.
      
      The algorithm itself has already seen deployments in large production
      data centers since then.
      
      We did a long-term stress-test and analysis in a data center, short
      summary of our TCP incast tests with iperf compared to cubic:
      
      This test measured DCTCP throughput and latency and compared it with
      CUBIC throughput and latency for an incast scenario. In this test, 19
      senders sent at maximum rate to a single receiver. The receiver simply
      ran iperf -s.
      
      The senders ran iperf -c <receiver> -t 30. All senders started
      simultaneously (using local clocks synchronized by ntp).
      
      This test was repeated multiple times. Below shows the results from a
      single test. Other tests are similar. (DCTCP results were extremely
      consistent, CUBIC results show some variance induced by the TCP timeouts
      that CUBIC encountered.)
      
      For this test, we report statistics on the number of TCP timeouts,
      flow throughput, and traffic latency.
      
      1) Timeouts (total over all flows, and per flow summaries):
      
                  CUBIC            DCTCP
        Total     3227             25
        Mean       169.842          1.316
        Median     183              1
        Max        207              5
        Min        123              0
        Stddev      28.991          1.600
      
      Timeout data is taken by measuring the net change in netstat -s
      "other TCP timeouts" reported. As a result, the timeout measurements
      above are not restricted to the test traffic, and we believe that it
      is likely that all of the "DCTCP timeouts" are actually timeouts for
      non-test traffic. We report them nevertheless. CUBIC will also include
      some non-test timeouts, but they are drawfed by bona fide test traffic
      timeouts for CUBIC. Clearly DCTCP does an excellent job of preventing
      TCP timeouts. DCTCP reduces timeouts by at least two orders of
      magnitude and may well have eliminated them in this scenario.
      
      2) Throughput (per flow in Mbps):
      
                  CUBIC            DCTCP
        Mean      521.684          521.895
        Median    464              523
        Max       776              527
        Min       403              519
        Stddev    105.891            2.601
        Fairness    0.962            0.999
      
      Throughput data was simply the average throughput for each flow
      reported by iperf. By avoiding TCP timeouts, DCTCP is able to
      achieve much better per-flow results. In CUBIC, many flows
      experience TCP timeouts which makes flow throughput unpredictable and
      unfair. DCTCP, on the other hand, provides very clean predictable
      throughput without incurring TCP timeouts. Thus, the standard deviation
      of CUBIC throughput is dramatically higher than the standard deviation
      of DCTCP throughput.
      
      Mean throughput is nearly identical because even though cubic flows
      suffer TCP timeouts, other flows will step in and fill the unused
      bandwidth. Note that this test is something of a best case scenario
      for incast under CUBIC: it allows other flows to fill in for flows
      experiencing a timeout. Under situations where the receiver is issuing
      requests and then waiting for all flows to complete, flows cannot fill
      in for timed out flows and throughput will drop dramatically.
      
      3) Latency (in ms):
      
                  CUBIC            DCTCP
        Mean      4.0088           0.04219
        Median    4.055            0.0395
        Max       4.2              0.085
        Min       3.32             0.028
        Stddev    0.1666           0.01064
      
      Latency for each protocol was computed by running "ping -i 0.2
      <receiver>" from a single sender to the receiver during the incast
      test. For DCTCP, "ping -Q 0x6 -i 0.2 <receiver>" was used to ensure
      that traffic traversed the DCTCP queue and was not dropped when the
      queue size was greater than the marking threshold. The summary
      statistics above are over all ping metrics measured between the single
      sender, receiver pair.
      
      The latency results for this test show a dramatic difference between
      CUBIC and DCTCP. CUBIC intentionally overflows the switch buffer
      which incurs the maximum queue latency (more buffer memory will lead
      to high latency.) DCTCP, on the other hand, deliberately attempts to
      keep queue occupancy low. The result is a two orders of magnitude
      reduction of latency with DCTCP - even with a switch with relatively
      little RAM. Switches with larger amounts of RAM will incur increasing
      amounts of latency for CUBIC, but not for DCTCP.
      
      4) Convergence and stability test:
      
      This test measured the time that DCTCP took to fairly redistribute
      bandwidth when a new flow commences. It also measured DCTCP's ability
      to remain stable at a fair bandwidth distribution. DCTCP is compared
      with CUBIC for this test.
      
      At the commencement of this test, a single flow is sending at maximum
      rate (near 10 Gbps) to a single receiver. One second after that first
      flow commences, a new flow from a distinct server begins sending to
      the same receiver as the first flow. After the second flow has sent
      data for 10 seconds, the second flow is terminated. The first flow
      sends for an additional second. Ideally, the bandwidth would be evenly
      shared as soon as the second flow starts, and recover as soon as it
      stops.
      
      The results of this test are shown below. Note that the flow bandwidth
      for the two flows was measured near the same time, but not
      simultaneously.
      
      DCTCP performs nearly perfectly within the measurement limitations
      of this test: bandwidth is quickly distributed fairly between the two
      flows, remains stable throughout the duration of the test, and
      recovers quickly. CUBIC, in contrast, is slow to divide the bandwidth
      fairly, and has trouble remaining stable.
      
        CUBIC                      DCTCP
      
        Seconds  Flow 1  Flow 2    Seconds  Flow 1  Flow 2
         0       9.93    0          0       9.92    0
         0.5     9.87    0          0.5     9.86    0
         1       8.73    2.25       1       6.46    4.88
         1.5     7.29    2.8        1.5     4.9     4.99
         2       6.96    3.1        2       4.92    4.94
         2.5     6.67    3.34       2.5     4.93    5
         3       6.39    3.57       3       4.92    4.99
         3.5     6.24    3.75       3.5     4.94    4.74
         4       6       3.94       4       5.34    4.71
         4.5     5.88    4.09       4.5     4.99    4.97
         5       5.27    4.98       5       4.83    5.01
         5.5     4.93    5.04       5.5     4.89    4.99
         6       4.9     4.99       6       4.92    5.04
         6.5     4.93    5.1        6.5     4.91    4.97
         7       4.28    5.8        7       4.97    4.97
         7.5     4.62    4.91       7.5     4.99    4.82
         8       5.05    4.45       8       5.16    4.76
         8.5     5.93    4.09       8.5     4.94    4.98
         9       5.73    4.2        9       4.92    5.02
         9.5     5.62    4.32       9.5     4.87    5.03
        10       6.12    3.2       10       4.91    5.01
        10.5     6.91    3.11      10.5     4.87    5.04
        11       8.48    0         11       8.49    4.94
        11.5     9.87    0         11.5     9.9     0
      
      SYN/ACK ECT test:
      
      This test demonstrates the importance of ECT on SYN and SYN-ACK packets
      by measuring the connection probability in the presence of competing
      flows for a DCTCP connection attempt *without* ECT in the SYN packet.
      The test was repeated five times for each number of competing flows.
      
                    Competing Flows  1 |    2 |    4 |    8 |   16
                                     ------------------------------
      Mean Connection Probability    1 | 0.67 | 0.45 | 0.28 |    0
      Median Connection Probability  1 | 0.65 | 0.45 | 0.25 |    0
      
      As the number of competing flows moves beyond 1, the connection
      probability drops rapidly.
      
      Enabling DCTCP with this patch requires the following steps:
      
      DCTCP must be running both on the sender and receiver side in your
      data center, i.e.:
      
        sysctl -w net.ipv4.tcp_congestion_control=dctcp
      
      Also, ECN functionality must be enabled on all switches in your
      data center for DCTCP to work. The default ECN marking threshold (K)
      heuristic on the switch for DCTCP is e.g., 20 packets (30KB) at
      1Gbps, and 65 packets (~100KB) at 10Gbps (K > 1/7 * C * RTT, [4]).
      
      In above tests, for each switch port, traffic was segregated into two
      queues. For any packet with a DSCP of 0x01 - or equivalently a TOS of
      0x04 - the packet was placed into the DCTCP queue. All other packets
      were placed into the default drop-tail queue. For the DCTCP queue,
      RED/ECN marking was enabled, here, with a marking threshold of 75 KB.
      More details however, we refer you to the paper [2] under section 3).
      
      There are no code changes required to applications running in user
      space. DCTCP has been implemented in full *isolation* of the rest of
      the TCP code as its own congestion control module, so that it can run
      without a need to expose code to the core of the TCP stack, and thus
      nothing changes for non-DCTCP users.
      
      Changes in the CA framework code are minimal, and DCTCP algorithm
      operates on mechanisms that are already available in most Silicon.
      The gain (dctcp_shift_g) is currently a fixed constant (1/16) from
      the paper, but we leave the option that it can be chosen carefully
      to a different value by the user.
      
      In case DCTCP is being used and ECN support on peer site is off,
      DCTCP falls back after 3WHS to operate in normal TCP Reno mode.
      
      ss {-4,-6} -t -i diag interface:
      
        ... dctcp wscale:7,7 rto:203 rtt:2.349/0.026 mss:1448 cwnd:2054
        ssthresh:1102 ce_state 0 alpha 15 ab_ecn 0 ab_tot 735584
        send 10129.2Mbps pacing_rate 20254.1Mbps unacked:1822 retrans:0/15
        reordering:101 rcv_space:29200
      
        ... dctcp-reno wscale:7,7 rto:201 rtt:0.711/1.327 ato:40 mss:1448
        cwnd:10 ssthresh:1102 fallback_mode send 162.9Mbps pacing_rate
        325.5Mbps rcv_rtt:1.5 rcv_space:29200
      
      More information about DCTCP can be found in [1-4].
      
        [1] http://simula.stanford.edu/~alizade/Site/DCTCP.html
        [2] http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp-final.pdf
        [3] http://simula.stanford.edu/~alizade/Site/DCTCP_files/dctcp_analysis-full.pdf
        [4] http://tools.ietf.org/html/draft-bensley-tcpm-dctcp-00
      
      Joint work with Florian Westphal and Glenn Judd.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NGlenn Judd <glenn.judd@morganstanley.com>
      Acked-by: NStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e3118e83
    • F
      net: tcp: more detailed ACK events and events for CE marked packets · 9890092e
      Florian Westphal 提交于
      DataCenter TCP (DCTCP) determines cwnd growth based on ECN information
      and ACK properties, e.g. ACK that updates window is treated differently
      than DUPACK.
      
      Also DCTCP needs information whether ACK was delayed ACK. Furthermore,
      DCTCP also implements a CE state machine that keeps track of CE markings
      of incoming packets.
      
      Therefore, extend the congestion control framework to provide these
      event types, so that DCTCP can be properly implemented as a normal
      congestion algorithm module outside of the core stack.
      
      Joint work with Daniel Borkmann and Glenn Judd.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NGlenn Judd <glenn.judd@morganstanley.com>
      Acked-by: NStephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9890092e