1. 02 9月, 2015 1 次提交
  2. 01 9月, 2015 1 次提交
  3. 31 8月, 2015 2 次提交
  4. 18 8月, 2015 1 次提交
  5. 14 8月, 2015 1 次提交
    • D
      net: Fix up inet_addr_type checks · 30bbaa19
      David Ahern 提交于
      Currently inet_addr_type and inet_dev_addr_type expect local addresses
      to be in the local table. With the VRF device local routes for devices
      associated with a VRF will be in the table associated with the VRF.
      Provide an alternate inet_addr lookup to use a specific table rather
      than defaulting to the local table.
      
      inet_addr_type_dev_table keeps the same semantics as inet_addr_type but
      if the passed in device is enslaved to a VRF then the table for that VRF
      is used for the lookup.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      30bbaa19
  6. 23 7月, 2015 1 次提交
  7. 23 6月, 2015 1 次提交
    • C
      tcp: Do not call tcp_fastopen_reset_cipher from interrupt context · dfea2aa6
      Christoph Paasch 提交于
      tcp_fastopen_reset_cipher really cannot be called from interrupt
      context. It allocates the tcp_fastopen_context with GFP_KERNEL and
      calls crypto_alloc_cipher, which allocates all kind of stuff with
      GFP_KERNEL.
      
      Thus, we might sleep when the key-generation is triggered by an
      incoming TFO cookie-request which would then happen in interrupt-
      context, as shown by enabling CONFIG_DEBUG_ATOMIC_SLEEP:
      
      [   36.001813] BUG: sleeping function called from invalid context at mm/slub.c:1266
      [   36.003624] in_atomic(): 1, irqs_disabled(): 0, pid: 1016, name: packetdrill
      [   36.004859] CPU: 1 PID: 1016 Comm: packetdrill Not tainted 4.1.0-rc7 #14
      [   36.006085] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
      [   36.008250]  00000000000004f2 ffff88007f8838a8 ffffffff8171d53a ffff880075a084a8
      [   36.009630]  ffff880075a08000 ffff88007f8838c8 ffffffff810967d3 ffff88007f883928
      [   36.011076]  0000000000000000 ffff88007f8838f8 ffffffff81096892 ffff88007f89be00
      [   36.012494] Call Trace:
      [   36.012953]  <IRQ>  [<ffffffff8171d53a>] dump_stack+0x4f/0x6d
      [   36.014085]  [<ffffffff810967d3>] ___might_sleep+0x103/0x170
      [   36.015117]  [<ffffffff81096892>] __might_sleep+0x52/0x90
      [   36.016117]  [<ffffffff8118e887>] kmem_cache_alloc_trace+0x47/0x190
      [   36.017266]  [<ffffffff81680d82>] ? tcp_fastopen_reset_cipher+0x42/0x130
      [   36.018485]  [<ffffffff81680d82>] tcp_fastopen_reset_cipher+0x42/0x130
      [   36.019679]  [<ffffffff81680f01>] tcp_fastopen_init_key_once+0x61/0x70
      [   36.020884]  [<ffffffff81680f2c>] __tcp_fastopen_cookie_gen+0x1c/0x60
      [   36.022058]  [<ffffffff816814ff>] tcp_try_fastopen+0x58f/0x730
      [   36.023118]  [<ffffffff81671788>] tcp_conn_request+0x3e8/0x7b0
      [   36.024185]  [<ffffffff810e3872>] ? __module_text_address+0x12/0x60
      [   36.025327]  [<ffffffff8167b2e1>] tcp_v4_conn_request+0x51/0x60
      [   36.026410]  [<ffffffff816727e0>] tcp_rcv_state_process+0x190/0xda0
      [   36.027556]  [<ffffffff81661f97>] ? __inet_lookup_established+0x47/0x170
      [   36.028784]  [<ffffffff8167c2ad>] tcp_v4_do_rcv+0x16d/0x3d0
      [   36.029832]  [<ffffffff812e6806>] ? security_sock_rcv_skb+0x16/0x20
      [   36.030936]  [<ffffffff8167cc8a>] tcp_v4_rcv+0x77a/0x7b0
      [   36.031875]  [<ffffffff816af8c3>] ? iptable_filter_hook+0x33/0x70
      [   36.032953]  [<ffffffff81657d22>] ip_local_deliver_finish+0x92/0x1f0
      [   36.034065]  [<ffffffff81657f1a>] ip_local_deliver+0x9a/0xb0
      [   36.035069]  [<ffffffff81657c90>] ? ip_rcv+0x3d0/0x3d0
      [   36.035963]  [<ffffffff81657569>] ip_rcv_finish+0x119/0x330
      [   36.036950]  [<ffffffff81657ba7>] ip_rcv+0x2e7/0x3d0
      [   36.037847]  [<ffffffff81610652>] __netif_receive_skb_core+0x552/0x930
      [   36.038994]  [<ffffffff81610a57>] __netif_receive_skb+0x27/0x70
      [   36.040033]  [<ffffffff81610b72>] process_backlog+0xd2/0x1f0
      [   36.041025]  [<ffffffff81611482>] net_rx_action+0x122/0x310
      [   36.042007]  [<ffffffff81076743>] __do_softirq+0x103/0x2f0
      [   36.042978]  [<ffffffff81723e3c>] do_softirq_own_stack+0x1c/0x30
      
      This patch moves the call to tcp_fastopen_init_key_once to the places
      where a listener socket creates its TFO-state, which always happens in
      user-context (either from the setsockopt, or implicitly during the
      listen()-call)
      
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Fixes: 222e83d2 ("tcp: switch tcp_fastopen key generation to net_get_random_once")
      Signed-off-by: NChristoph Paasch <cpaasch@apple.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dfea2aa6
  8. 07 6月, 2015 1 次提交
    • E
      inet: add IP_BIND_ADDRESS_NO_PORT to overcome bind(0) limitations · 90c337da
      Eric Dumazet 提交于
      When an application needs to force a source IP on an active TCP socket
      it has to use bind(IP, port=x).
      
      As most applications do not want to deal with already used ports, x is
      often set to 0, meaning the kernel is in charge to find an available
      port.
      But kernel does not know yet if this socket is going to be a listener or
      be connected.
      It has very limited choices (no full knowledge of final 4-tuple for a
      connect())
      
      With limited ephemeral port range (about 32K ports), it is very easy to
      fill the space.
      
      This patch adds a new SOL_IP socket option, asking kernel to ignore
      the 0 port provided by application in bind(IP, port=0) and only
      remember the given IP address.
      
      The port will be automatically chosen at connect() time, in a way
      that allows sharing a source port as long as the 4-tuples are unique.
      
      This new feature is available for both IPv4 and IPv6 (Thanks Neal)
      
      Tested:
      
      Wrote a test program and checked its behavior on IPv4 and IPv6.
      
      strace(1) shows sequences of bind(IP=127.0.0.2, port=0) followed by
      connect().
      Also getsockname() show that the port is still 0 right after bind()
      but properly allocated after connect().
      
      socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 5
      setsockopt(5, SOL_IP, IP_BIND_ADDRESS_NO_PORT, [1], 4) = 0
      bind(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, 16) = 0
      getsockname(5, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
      connect(5, {sa_family=AF_INET, sin_port=htons(53174), sin_addr=inet_addr("127.0.0.3")}, 16) = 0
      getsockname(5, {sa_family=AF_INET, sin_port=htons(38050), sin_addr=inet_addr("127.0.0.2")}, [16]) = 0
      
      IPv6 test :
      
      socket(PF_INET6, SOCK_STREAM, IPPROTO_IP) = 7
      setsockopt(7, SOL_IP, IP_BIND_ADDRESS_NO_PORT, [1], 4) = 0
      bind(7, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0
      getsockname(7, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0
      connect(7, {sa_family=AF_INET6, sin6_port=htons(57300), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0
      getsockname(7, {sa_family=AF_INET6, sin6_port=htons(60964), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0
      
      I was able to bind()/connect() a million concurrent IPv4 sockets,
      instead of ~32000 before patch.
      
      lpaa23:~# ulimit -n 1000010
      lpaa23:~# ./bind --connect --num-flows=1000000 &
      1000000 sockets
      
      lpaa23:~# grep TCP /proc/net/sockstat
      TCP: inuse 2000063 orphan 0 tw 47 alloc 2000157 mem 66
      
      Check that a given source port is indeed used by many different
      connections :
      
      lpaa23:~# ss -t src :40000 | head -10
      State      Recv-Q Send-Q   Local Address:Port          Peer Address:Port
      ESTAB      0      0           127.0.0.2:40000         127.0.202.33:44983
      ESTAB      0      0           127.0.0.2:40000         127.2.27.240:44983
      ESTAB      0      0           127.0.0.2:40000           127.2.98.5:44983
      ESTAB      0      0           127.0.0.2:40000        127.0.124.196:44983
      ESTAB      0      0           127.0.0.2:40000         127.2.139.38:44983
      ESTAB      0      0           127.0.0.2:40000          127.1.59.80:44983
      ESTAB      0      0           127.0.0.2:40000          127.3.6.228:44983
      ESTAB      0      0           127.0.0.2:40000          127.0.38.53:44983
      ESTAB      0      0           127.0.0.2:40000         127.1.197.10:44983
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      90c337da
  9. 28 5月, 2015 1 次提交
    • E
      tcp/dccp: try to not exhaust ip_local_port_range in connect() · 07f4c900
      Eric Dumazet 提交于
      A long standing problem on busy servers is the tiny available TCP port
      range (/proc/sys/net/ipv4/ip_local_port_range) and the default
      sequential allocation of source ports in connect() system call.
      
      If a host is having a lot of active TCP sessions, chances are
      very high that all ports are in use by at least one flow,
      and subsequent bind(0) attempts fail, or have to scan a big portion of
      space to find a slot.
      
      In this patch, I changed the starting point in __inet_hash_connect()
      so that we try to favor even [1] ports, leaving odd ports for bind()
      users.
      
      We still perform a sequential search, so there is no guarantee, but
      if connect() targets are very different, end result is we leave
      more ports available to bind(), and we spread them all over the range,
      lowering time for both connect() and bind() to find a slot.
      
      This strategy only works well if /proc/sys/net/ipv4/ip_local_port_range
      is even, ie if start/end values have different parity.
      
      Therefore, default /proc/sys/net/ipv4/ip_local_port_range was changed to
      32768 - 60999 (instead of 32768 - 61000)
      
      There is no change on security aspects here, only some poor hashing
      schemes could be eventually impacted by this change.
      
      [1] : The odd/even property depends on ip_local_port_range values parity
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      07f4c900
  10. 11 5月, 2015 3 次提交
  11. 04 4月, 2015 2 次提交
  12. 03 3月, 2015 1 次提交
  13. 02 3月, 2015 1 次提交
  14. 09 2月, 2015 1 次提交
    • E
      net: rfs: add hash collision detection · 567e4b79
      Eric Dumazet 提交于
      Receive Flow Steering is a nice solution but suffers from
      hash collisions when a mix of connected and unconnected traffic
      is received on the host, when flow hash table is populated.
      
      Also, clearing flow in inet_release() makes RFS not very good
      for short lived flows, as many packets can follow close().
      (FIN , ACK packets, ...)
      
      This patch extends the information stored into global hash table
      to not only include cpu number, but upper part of the hash value.
      
      I use a 32bit value, and dynamically split it in two parts.
      
      For host with less than 64 possible cpus, this gives 6 bits for the
      cpu number, and 26 (32-6) bits for the upper part of the hash.
      
      Since hash bucket selection use low order bits of the hash, we have
      a full hash match, if /proc/sys/net/core/rps_sock_flow_entries is big
      enough.
      
      If the hash found in flow table does not match, we fallback to RPS (if
      it is enabled for the rxqueue).
      
      This means that a packet for an non connected flow can avoid the
      IPI through a unrelated/victim CPU.
      
      This also means we no longer have to clear the table at socket
      close time, and this helps short lived flows performance.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      567e4b79
  15. 27 11月, 2014 1 次提交
  16. 06 11月, 2014 2 次提交
  17. 21 10月, 2014 1 次提交
    • F
      net: gso: use feature flag argument in all protocol gso handlers · 1e16aa3d
      Florian Westphal 提交于
      skb_gso_segment() has a 'features' argument representing offload features
      available to the output path.
      
      A few handlers, e.g. GRE, instead re-fetch the features of skb->dev and use
      those instead of the provided ones when handing encapsulation/tunnels.
      
      Depending on dev->hw_enc_features of the output device skb_gso_segment() can
      then return NULL even when the caller has disabled all GSO feature bits,
      as segmentation of inner header thinks device will take care of segmentation.
      
      This e.g. affects the tbf scheduler, which will silently drop GRE-encap GSO skbs
      that did not fit the remaining token quota as the segmentation does not work
      when device supports corresponding hw offload capabilities.
      
      Cc: Pravin B Shelar <pshelar@nicira.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1e16aa3d
  18. 02 10月, 2014 1 次提交
  19. 26 9月, 2014 1 次提交
  20. 10 9月, 2014 2 次提交
  21. 17 7月, 2014 1 次提交
  22. 05 6月, 2014 2 次提交
  23. 24 5月, 2014 1 次提交
  24. 15 5月, 2014 1 次提交
  25. 09 5月, 2014 2 次提交
  26. 08 5月, 2014 1 次提交
    • W
      net: clean up snmp stats code · 698365fa
      WANG Cong 提交于
      commit 8f0ea0fe (snmp: reduce percpu needs by 50%)
      reduced snmp array size to 1, so technically it doesn't have to be
      an array any more. What's more, after the following commit:
      
      	commit 933393f5
      	Date:   Thu Dec 22 11:58:51 2011 -0600
      
      	    percpu: Remove irqsafe_cpu_xxx variants
      
      	    We simply say that regular this_cpu use must be safe regardless of
      	    preemption and interrupt state.  That has no material change for x86
      	    and s390 implementations of this_cpu operations.  However, arches that
      	    do not provide their own implementation for this_cpu operations will
      	    now get code generated that disables interrupts instead of preemption.
      
      probably no arch wants to have SNMP_ARRAY_SZ == 2. At least after
      almost 3 years, no one complains.
      
      So, just convert the array to a single pointer and remove snmp_mib_init()
      and snmp_mib_free() as well.
      
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      698365fa
  27. 15 3月, 2014 1 次提交
  28. 26 2月, 2014 1 次提交
    • H
      ipv4: ipv6: better estimate tunnel header cut for correct ufo handling · 91a48a2e
      Hannes Frederic Sowa 提交于
      Currently the UFO fragmentation process does not correctly handle inner
      UDP frames.
      
      (The following tcpdumps are captured on the parent interface with ufo
      disabled while tunnel has ufo enabled, 2000 bytes payload, mtu 1280,
      both sit device):
      
      IPv6:
      16:39:10.031613 IP (tos 0x0, ttl 64, id 3208, offset 0, flags [DF], proto IPv6 (41), length 1300)
          192.168.122.151 > 1.1.1.1: IP6 (hlim 64, next-header Fragment (44) payload length: 1240) 2001::1 > 2001::8: frag (0x00000001:0|1232) 44883 > distinct: UDP, length 2000
      16:39:10.031709 IP (tos 0x0, ttl 64, id 3209, offset 0, flags [DF], proto IPv6 (41), length 844)
          192.168.122.151 > 1.1.1.1: IP6 (hlim 64, next-header Fragment (44) payload length: 784) 2001::1 > 2001::8: frag (0x00000001:0|776) 58979 > 46366: UDP, length 5471
      
      We can see that fragmentation header offset is not correctly updated.
      (fragmentation id handling is corrected by 916e4cf4 ("ipv6: reuse
      ip6_frag_id from ip6_ufo_append_data")).
      
      IPv4:
      16:39:57.737761 IP (tos 0x0, ttl 64, id 3209, offset 0, flags [DF], proto IPIP (4), length 1296)
          192.168.122.151 > 1.1.1.1: IP (tos 0x0, ttl 64, id 57034, offset 0, flags [none], proto UDP (17), length 1276)
          192.168.99.1.35961 > 192.168.99.2.distinct: UDP, length 2000
      16:39:57.738028 IP (tos 0x0, ttl 64, id 3210, offset 0, flags [DF], proto IPIP (4), length 792)
          192.168.122.151 > 1.1.1.1: IP (tos 0x0, ttl 64, id 57035, offset 0, flags [none], proto UDP (17), length 772)
          192.168.99.1.13531 > 192.168.99.2.20653: UDP, length 51109
      
      In this case fragmentation id is incremented and offset is not updated.
      
      First, I aligned inet_gso_segment and ipv6_gso_segment:
      * align naming of flags
      * ipv6_gso_segment: setting skb->encapsulation is unnecessary, as we
        always ensure that the state of this flag is left untouched when
        returning from upper gso segmenation function
      * ipv6_gso_segment: move skb_reset_inner_headers below updating the
        fragmentation header data, we don't care for updating fragmentation
        header data
      * remove currently unneeded comment indicating skb->encapsulation might
        get changed by upper gso_segment callback (gre and udp-tunnel reset
        encapsulation after segmentation on each fragment)
      
      If we encounter an IPIP or SIT gso skb we now check for the protocol ==
      IPPROTO_UDP and that we at least have already traversed another ip(6)
      protocol header.
      
      The reason why we have to special case GSO_IPIP and GSO_SIT is that
      we reset skb->encapsulation to 0 while skb_mac_gso_segment the inner
      protocol of GSO_UDP_TUNNEL or GSO_GRE packets.
      Reported-by: NWolfgang Walter <linux@stwm.de>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      91a48a2e
  29. 14 1月, 2014 1 次提交
    • H
      ipv4: introduce hardened ip_no_pmtu_disc mode · 8ed1dc44
      Hannes Frederic Sowa 提交于
      This new ip_no_pmtu_disc mode only allowes fragmentation-needed errors
      to be honored by protocols which do more stringent validation on the
      ICMP's packet payload. This knob is useful for people who e.g. want to
      run an unmodified DNS server in a namespace where they need to use pmtu
      for TCP connections (as they are used for zone transfers or fallback
      for requests) but don't want to use possibly spoofed UDP pmtu information.
      
      Currently the whitelisted protocols are TCP, SCTP and DCCP as they check
      if the returned packet is in the window or if the association is valid.
      
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: John Heffner <johnwheffner@gmail.com>
      Suggested-by: NFlorian Weimer <fweimer@redhat.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8ed1dc44
  30. 08 1月, 2014 1 次提交
    • J
      net-gre-gro: Add GRE support to the GRO stack · bf5a755f
      Jerry Chu 提交于
      This patch built on top of Commit 299603e8
      ("net-gro: Prepare GRO stack for the upcoming tunneling support") to add
      the support of the standard GRE (RFC1701/RFC2784/RFC2890) to the GRO
      stack. It also serves as an example for supporting other encapsulation
      protocols in the GRO stack in the future.
      
      The patch supports version 0 and all the flags (key, csum, seq#) but
      will flush any pkt with the S (seq#) flag. This is because the S flag
      is not support by GSO, and a GRO pkt may end up in the forwarding path,
      thus requiring GSO support to break it up correctly.
      
      Currently the "packet_offload" structure only contains L3 (ETH_P_IP/
      ETH_P_IPV6) GRO offload support so the encapped pkts are limited to
      IP pkts (i.e., w/o L2 hdr). But support for other protocol type can
      be easily added, so is the support for GRE variations like NVGRE.
      
      The patch also support csum offload. Specifically if the csum flag is on
      and the h/w is capable of checksumming the payload (CHECKSUM_COMPLETE),
      the code will take advantage of the csum computed by the h/w when
      validating the GRE csum.
      
      Note that commit 60769a5d "ipv4: gre:
      add GRO capability" already introduces GRO capability to IPv4 GRE
      tunnels, using the gro_cells infrastructure. But GRO is done after
      GRE hdr has been removed (i.e., decapped). The following patch applies
      GRO when pkts first come in (before hitting the GRE tunnel code). There
      is some performance advantage for applying GRO as early as possible.
      Also this approach is transparent to other subsystem like Open vSwitch
      where GRE decap is handled outside of the IP stack hence making it
      harder for the gro_cells stuff to apply. On the other hand, some NICs
      are still not capable of hashing on the inner hdr of a GRE pkt (RSS).
      In that case the GRO processing of pkts from the same remote host will
      all happen on the same CPU and the performance may be suboptimal.
      
      I'm including some rough preliminary performance numbers below. Note
      that the performance will be highly dependent on traffic load, mix as
      usual. Moreover it also depends on NIC offload features hence the
      following is by no means a comprehesive study. Local testing and tuning
      will be needed to decide the best setting.
      
      All tests spawned 50 copies of netperf TCP_STREAM and ran for 30 secs.
      (super_netperf 50 -H 192.168.1.18 -l 30)
      
      An IP GRE tunnel with only the key flag on (e.g., ip tunnel add gre1
      mode gre local 10.246.17.18 remote 10.246.17.17 ttl 255 key 123)
      is configured.
      
      The GRO support for pkts AFTER decap are controlled through the device
      feature of the GRE device (e.g., ethtool -K gre1 gro on/off).
      
      1.1 ethtool -K gre1 gro off; ethtool -K eth0 gro off
      thruput: 9.16Gbps
      CPU utilization: 19%
      
      1.2 ethtool -K gre1 gro on; ethtool -K eth0 gro off
      thruput: 5.9Gbps
      CPU utilization: 15%
      
      1.3 ethtool -K gre1 gro off; ethtool -K eth0 gro on
      thruput: 9.26Gbps
      CPU utilization: 12-13%
      
      1.4 ethtool -K gre1 gro on; ethtool -K eth0 gro on
      thruput: 9.26Gbps
      CPU utilization: 10%
      
      The following tests were performed on a different NIC that is capable of
      csum offload. I.e., the h/w is capable of computing IP payload csum
      (CHECKSUM_COMPLETE).
      
      2.1 ethtool -K gre1 gro on (hence will use gro_cells)
      
      2.1.1 ethtool -K eth0 gro off; csum offload disabled
      thruput: 8.53Gbps
      CPU utilization: 9%
      
      2.1.2 ethtool -K eth0 gro off; csum offload enabled
      thruput: 8.97Gbps
      CPU utilization: 7-8%
      
      2.1.3 ethtool -K eth0 gro on; csum offload disabled
      thruput: 8.83Gbps
      CPU utilization: 5-6%
      
      2.1.4 ethtool -K eth0 gro on; csum offload enabled
      thruput: 8.98Gbps
      CPU utilization: 5%
      
      2.2 ethtool -K gre1 gro off
      
      2.2.1 ethtool -K eth0 gro off; csum offload disabled
      thruput: 5.93Gbps
      CPU utilization: 9%
      
      2.2.2 ethtool -K eth0 gro off; csum offload enabled
      thruput: 5.62Gbps
      CPU utilization: 8%
      
      2.2.3 ethtool -K eth0 gro on; csum offload disabled
      thruput: 7.69Gbps
      CPU utilization: 8%
      
      2.2.4 ethtool -K eth0 gro on; csum offload enabled
      thruput: 8.96Gbps
      CPU utilization: 5-6%
      Signed-off-by: NH.K. Jerry Chu <hkchu@google.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bf5a755f
  31. 19 12月, 2013 1 次提交
  32. 13 12月, 2013 1 次提交
    • J
      net-gro: Prepare GRO stack for the upcoming tunneling support · 299603e8
      Jerry Chu 提交于
      This patch modifies the GRO stack to avoid the use of "network_header"
      and associated macros like ip_hdr() and ipv6_hdr() in order to allow
      an arbitary number of IP hdrs (v4 or v6) to be used in the
      encapsulation chain. This lays the foundation for various IP
      tunneling support (IP-in-IP, GRE, VXLAN, SIT,...) to be added later.
      
      With this patch, the GRO stack traversing now is mostly based on
      skb_gro_offset rather than special hdr offsets saved in skb (e.g.,
      skb->network_header). As a result all but the top layer (i.e., the
      the transport layer) must have hdrs of the same length in order for
      a pkt to be considered for aggregation. Therefore when adding a new
      encap layer (e.g., for tunneling), one must check and skip flows
      (e.g., by setting NAPI_GRO_CB(p)->same_flow to 0) that have a
      different hdr length.
      
      Note that unlike the network header, the transport header can and
      will continue to be set by the GRO code since there will be at
      most one "transport layer" in the encap chain.
      Signed-off-by: NH.K. Jerry Chu <hkchu@google.com>
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      299603e8