1. 03 12月, 2014 4 次提交
  2. 27 11月, 2014 2 次提交
  3. 24 11月, 2014 3 次提交
  4. 22 11月, 2014 5 次提交
  5. 20 11月, 2014 3 次提交
  6. 19 11月, 2014 3 次提交
  7. 17 11月, 2014 1 次提交
    • E
      net: provide a per host RSS key generic infrastructure · 960fb622
      Eric Dumazet 提交于
      RSS (Receive Side Scaling) typically uses Toeplitz hash and a 40 or 52 bytes
      RSS key.
      
      Some drivers use a constant (and well known key), some drivers use a random
      key per port, making bonding setups hard to tune. Well known keys increase
      attack surface, considering that number of queues is usually a power of two.
      
      This patch provides infrastructure to help drivers doing the right thing.
      
      netdev_rss_key_fill() should be used by drivers to initialize their RSS key,
      even if they provide ethtool -X support to let user redefine the key later.
      
      A new /proc/sys/net/core/netdev_rss_key file can be used to get the host
      RSS key even for drivers not providing ethtool -x support, in case some
      applications want to precisely setup flows to match some RX queues.
      
      Tested:
      
      myhost:~# cat /proc/sys/net/core/netdev_rss_key
      11:63:99:bb:79:fb:a5:a7:07:45:b2:20:bf:02:42:2d:08:1a:dd:19:2b:6b:23:ac:56:28:9d:70:c3:ac:e8:16:4b:b7:c1:10:53:a4:78:41:36:40:74:b6:15:ca:27:44:aa:b3:4d:72
      
      myhost:~# ethtool -x eth0
      RX flow hash indirection table for eth0 with 8 RX ring(s):
          0:      0     1     2     3     4     5     6     7
      RSS hash key:
      11:63:99:bb:79:fb:a5:a7:07:45:b2:20:bf:02:42:2d:08:1a:dd:19:2b:6b:23:ac:56:28:9d:70:c3:ac:e8:16:4b:b7:c1:10:53:a4:78:41
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      960fb622
  8. 14 11月, 2014 1 次提交
    • M
      net: generic dev_disable_lro() stacked device handling · fbe168ba
      Michal Kubeček 提交于
      Large receive offloading is known to cause problems if received packets
      are passed to other host. Therefore the kernel disables it by calling
      dev_disable_lro() whenever a network device is enslaved in a bridge or
      forwarding is enabled for it (or globally). For virtual devices we need
      to disable LRO on the underlying physical device (which is actually
      receiving the packets).
      
      Current dev_disable_lro() code handles this  propagation for a vlan
      (including 802.1ad nested vlan), macvlan or a vlan on top of a macvlan.
      It doesn't handle other stacked devices and their combinations, in
      particular propagation from a bond to its slaves which often causes
      problems in virtualization setups.
      
      As we now have generic data structures describing the upper-lower device
      relationship, dev_disable_lro() can be generalized to disable LRO also
      for all lower devices (if any) once it is disabled for the device
      itself.
      
      For bonding and teaming devices, it is necessary to disable LRO not only
      on current slaves at the moment when dev_disable_lro() is called but
      also on any slave (port) added later.
      
      v2: use lower device links for all devices (including vlan and macvlan)
      Signed-off-by: NMichal Kubecek <mkubecek@suse.cz>
      Acked-by: NVeaceslav Falico <vfalico@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fbe168ba
  9. 12 11月, 2014 3 次提交
    • W
      neigh: remove dynamic neigh table registration support · d7480fd3
      WANG Cong 提交于
      Currently there are only three neigh tables in the whole kernel:
      arp table, ndisc table and decnet neigh table. What's more,
      we don't support registering multiple tables per family.
      Therefore we can just make these tables statically built-in.
      
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d7480fd3
    • J
      net: Convert LIMIT_NETDEBUG to net_dbg_ratelimited · ba7a46f1
      Joe Perches 提交于
      Use the more common dynamic_debug capable net_dbg_ratelimited
      and remove the LIMIT_NETDEBUG macro.
      
      All messages are still ratelimited.
      
      Some KERN_<LEVEL> uses are changed to KERN_DEBUG.
      
      This may have some negative impact on messages that were
      emitted at KERN_INFO that are not not enabled at all unless
      DEBUG is defined or dynamic_debug is enabled.  Even so,
      these messages are now _not_ emitted by default.
      
      This also eliminates the use of the net_msg_warn sysctl
      "/proc/sys/net/core/warnings".  For backward compatibility,
      the sysctl is not removed, but it has no function.  The extern
      declaration of net_msg_warn is removed from sock.h and made
      static in net/core/sysctl_net_core.c
      
      Miscellanea:
      
      o Update the sysctl documentation
      o Remove the embedded uses of pr_fmt
      o Coalesce format fragments
      o Realign arguments
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ba7a46f1
    • E
      net: introduce SO_INCOMING_CPU · 2c8c56e1
      Eric Dumazet 提交于
      Alternative to RPS/RFS is to use hardware support for multiple
      queues.
      
      Then split a set of million of sockets into worker threads, each
      one using epoll() to manage events on its own socket pool.
      
      Ideally, we want one thread per RX/TX queue/cpu, but we have no way to
      know after accept() or connect() on which queue/cpu a socket is managed.
      
      We normally use one cpu per RX queue (IRQ smp_affinity being properly
      set), so remembering on socket structure which cpu delivered last packet
      is enough to solve the problem.
      
      After accept(), connect(), or even file descriptor passing around
      processes, applications can use :
      
       int cpu;
       socklen_t len = sizeof(cpu);
      
       getsockopt(fd, SOL_SOCKET, SO_INCOMING_CPU, &cpu, &len);
      
      And use this information to put the socket into the right silo
      for optimal performance, as all networking stack should run
      on the appropriate cpu, without need to send IPI (RPS/RFS).
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2c8c56e1
  10. 11 11月, 2014 1 次提交
    • E
      net: gro: add a per device gro flush timer · 3b47d303
      Eric Dumazet 提交于
      Tuning coalescing parameters on NIC can be really hard.
      
      Servers can handle both bulk and RPC like traffic, with conflicting
      goals : bulk flows want as big GRO packets as possible, RPC want minimal
      latencies.
      
      To reach big GRO packets on 10Gbe NIC, one can use :
      
      ethtool -C eth0 rx-usecs 4 rx-frames 44
      
      But this penalizes rpc sessions, with an increase of latencies, up to
      50% in some cases, as NICs generally do not force an interrupt when
      a packet with TCP Push flag is received.
      
      Some NICs do not have an absolute timer, only a timer rearmed for every
      incoming packet.
      
      This patch uses a different strategy : Let GRO stack decides what do do,
      based on traffic pattern.
      
      Packets with Push flag wont be delayed.
      Packets without Push flag might be held in GRO engine, if we keep
      receiving data.
      
      This new mechanism is off by default, and shall be enabled by setting
      /sys/class/net/ethX/gro_flush_timeout to a value in nanosecond.
      
      To fully enable this mechanism, drivers should use napi_complete_done()
      instead of napi_complete().
      
      Tested:
       Ran 200 netperf TCP_STREAM from A to B (10Gbe mlx4 link, 8 RX queues)
      
      Without this feature, we send back about 305,000 ACK per second.
      
      GRO aggregation ratio is low (811/305 = 2.65 segments per GRO packet)
      
      Setting a timer of 2000 nsec is enough to increase GRO packet sizes
      and reduce number of ACK packets. (811/19.2 = 42)
      
      Receiver performs less calls to upper stacks, less wakes up.
      This also reduces cpu usage on the sender, as it receives less ACK
      packets.
      
      Note that reducing number of wakes up increases cpu efficiency, but can
      decrease QPS, as applications wont have the chance to warmup cpu caches
      doing a partial read of RPC requests/answers if they fit in one skb.
      
      B:~# sar -n DEV 1 10 | grep eth0 | tail -1
      Average:         eth0 811269.80 305732.30 1199462.57  19705.72      0.00
      0.00      0.50
      
      B:~# echo 2000 >/sys/class/net/eth0/gro_flush_timeout
      
      B:~# sar -n DEV 1 10 | grep eth0 | tail -1
      Average:         eth0 811577.30  19230.80 1199916.51   1239.80      0.00
      0.00      0.50
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b47d303
  11. 08 11月, 2014 2 次提交
  12. 06 11月, 2014 4 次提交
  13. 04 11月, 2014 1 次提交
    • E
      net: less interrupt masking in NAPI · d75b1ade
      Eric Dumazet 提交于
      net_rx_action() can mask irqs a single time to transfert sd->poll_list
      into a private list, for a very short duration.
      
      Then, napi_complete() can avoid masking irqs again,
      and net_rx_action() only needs to mask irq again in slow path.
      
      This patch removes 2 couples of irq mask/unmask per typical NAPI run,
      more if multiple napi were triggered.
      
      Note this also allows to give control back to caller (do_softirq())
      more often, so that other softirq handlers can be called a bit earlier,
      or ksoftirqd can be wakeup earlier under pressure.
      
      This was developed while testing an alternative to RX interrupt
      mitigation to reduce latencies while keeping or improving GRO
      aggregation on fast NIC.
      
      Idea is to test napi->gro_list at the end of a napi->poll() and
      reschedule one NAPI poll, but after servicing a full round of
      softirqs (timers, TX, rcu, ...). This will be allowed only if softirq
      is currently serviced by idle task or ksoftirqd, and resched not needed.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d75b1ade
  14. 01 11月, 2014 1 次提交
  15. 30 10月, 2014 3 次提交
    • N
      neigh: optimize neigh_parms_release() · 75fbfd33
      Nicolas Dichtel 提交于
      In neigh_parms_release() we loop over all entries to find the entry given in
      argument and being able to remove it from the list. By using a double linked
      list, we can avoid this loop.
      
      Here are some numbers with 30 000 dummy interfaces configured:
      
      Before the patch:
      $ time rmmod dummy
      real	2m0.118s
      user	0m0.000s
      sys	1m50.048s
      
      After the patch:
      $ time rmmod dummy
      real	1m9.970s
      user	0m0.000s
      sys	0m47.976s
      Suggested-by: NThierry Herbelot <thierry.herbelot@6wind.com>
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      75fbfd33
    • E
      net: introduce napi_schedule_irqoff() · bc9ad166
      Eric Dumazet 提交于
      napi_schedule() can be called from any context and has to mask hard
      irqs.
      
      Add a variant that can only be called from hard interrupts handlers
      or when irqs are already masked.
      
      Many NIC drivers can use it from their hard IRQ handler instead of
      generic variant.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bc9ad166
    • T
      net: skb_segment() should preserve backpressure · 432c856f
      Toshiaki Makita 提交于
      This patch generalizes commit d6a4a104 ("tcp: GSO should be TSQ
      friendly") to protocols using skb_set_owner_w()
      
      TCP uses its own destructor (tcp_wfree) and needs a more complex scheme
      as explained in commit 6ff50cd5 ("tcp: gso: do not generate out of
      order packets")
      
      This allows UDP sockets using UFO to get proper backpressure,
      thus avoiding qdisc drops and excessive cpu usage.
      
      Here are performance test results (macvlan on vlan):
      
      - Before
      # netperf -t UDP_STREAM ...
      Socket  Message  Elapsed      Messages
      Size    Size     Time         Okay Errors   Throughput
      bytes   bytes    secs            #      #   10^6bits/sec
      
      212992   65507   60.00      144096 1224195    1258.56
      212992           60.00          51              0.45
      
      Average:        CPU     %user     %nice   %system   %iowait    %steal     %idle
      Average:        all      0.23      0.00     25.26      0.08      0.00     74.43
      
      - After
      # netperf -t UDP_STREAM ...
      Socket  Message  Elapsed      Messages
      Size    Size     Time         Okay Errors   Throughput
      bytes   bytes    secs            #      #   10^6bits/sec
      
      212992   65507   60.00      109593      0     957.20
      212992           60.00      109593            957.20
      
      Average:        CPU     %user     %nice   %system   %iowait    %steal     %idle
      Average:        all      0.18      0.00      8.38      0.02      0.00     91.43
      
      [edumazet] Rewrote patch and changelog.
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      432c856f
  16. 27 10月, 2014 1 次提交
  17. 23 10月, 2014 1 次提交
  18. 21 10月, 2014 1 次提交