1. 12 11月, 2014 2 次提交
    • J
      net: Convert LIMIT_NETDEBUG to net_dbg_ratelimited · ba7a46f1
      Joe Perches 提交于
      Use the more common dynamic_debug capable net_dbg_ratelimited
      and remove the LIMIT_NETDEBUG macro.
      
      All messages are still ratelimited.
      
      Some KERN_<LEVEL> uses are changed to KERN_DEBUG.
      
      This may have some negative impact on messages that were
      emitted at KERN_INFO that are not not enabled at all unless
      DEBUG is defined or dynamic_debug is enabled.  Even so,
      these messages are now _not_ emitted by default.
      
      This also eliminates the use of the net_msg_warn sysctl
      "/proc/sys/net/core/warnings".  For backward compatibility,
      the sysctl is not removed, but it has no function.  The extern
      declaration of net_msg_warn is removed from sock.h and made
      static in net/core/sysctl_net_core.c
      
      Miscellanea:
      
      o Update the sysctl documentation
      o Remove the embedded uses of pr_fmt
      o Coalesce format fragments
      o Realign arguments
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ba7a46f1
    • E
      net: introduce SO_INCOMING_CPU · 2c8c56e1
      Eric Dumazet 提交于
      Alternative to RPS/RFS is to use hardware support for multiple
      queues.
      
      Then split a set of million of sockets into worker threads, each
      one using epoll() to manage events on its own socket pool.
      
      Ideally, we want one thread per RX/TX queue/cpu, but we have no way to
      know after accept() or connect() on which queue/cpu a socket is managed.
      
      We normally use one cpu per RX queue (IRQ smp_affinity being properly
      set), so remembering on socket structure which cpu delivered last packet
      is enough to solve the problem.
      
      After accept(), connect(), or even file descriptor passing around
      processes, applications can use :
      
       int cpu;
       socklen_t len = sizeof(cpu);
      
       getsockopt(fd, SOL_SOCKET, SO_INCOMING_CPU, &cpu, &len);
      
      And use this information to put the socket into the right silo
      for optimal performance, as all networking stack should run
      on the appropriate cpu, without need to send IPI (RPS/RFS).
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2c8c56e1
  2. 11 11月, 2014 1 次提交
    • E
      net: gro: add a per device gro flush timer · 3b47d303
      Eric Dumazet 提交于
      Tuning coalescing parameters on NIC can be really hard.
      
      Servers can handle both bulk and RPC like traffic, with conflicting
      goals : bulk flows want as big GRO packets as possible, RPC want minimal
      latencies.
      
      To reach big GRO packets on 10Gbe NIC, one can use :
      
      ethtool -C eth0 rx-usecs 4 rx-frames 44
      
      But this penalizes rpc sessions, with an increase of latencies, up to
      50% in some cases, as NICs generally do not force an interrupt when
      a packet with TCP Push flag is received.
      
      Some NICs do not have an absolute timer, only a timer rearmed for every
      incoming packet.
      
      This patch uses a different strategy : Let GRO stack decides what do do,
      based on traffic pattern.
      
      Packets with Push flag wont be delayed.
      Packets without Push flag might be held in GRO engine, if we keep
      receiving data.
      
      This new mechanism is off by default, and shall be enabled by setting
      /sys/class/net/ethX/gro_flush_timeout to a value in nanosecond.
      
      To fully enable this mechanism, drivers should use napi_complete_done()
      instead of napi_complete().
      
      Tested:
       Ran 200 netperf TCP_STREAM from A to B (10Gbe mlx4 link, 8 RX queues)
      
      Without this feature, we send back about 305,000 ACK per second.
      
      GRO aggregation ratio is low (811/305 = 2.65 segments per GRO packet)
      
      Setting a timer of 2000 nsec is enough to increase GRO packet sizes
      and reduce number of ACK packets. (811/19.2 = 42)
      
      Receiver performs less calls to upper stacks, less wakes up.
      This also reduces cpu usage on the sender, as it receives less ACK
      packets.
      
      Note that reducing number of wakes up increases cpu efficiency, but can
      decrease QPS, as applications wont have the chance to warmup cpu caches
      doing a partial read of RPC requests/answers if they fit in one skb.
      
      B:~# sar -n DEV 1 10 | grep eth0 | tail -1
      Average:         eth0 811269.80 305732.30 1199462.57  19705.72      0.00
      0.00      0.50
      
      B:~# echo 2000 >/sys/class/net/eth0/gro_flush_timeout
      
      B:~# sar -n DEV 1 10 | grep eth0 | tail -1
      Average:         eth0 811577.30  19230.80 1199916.51   1239.80      0.00
      0.00      0.50
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b47d303
  3. 08 11月, 2014 2 次提交
  4. 06 11月, 2014 4 次提交
  5. 04 11月, 2014 1 次提交
    • E
      net: less interrupt masking in NAPI · d75b1ade
      Eric Dumazet 提交于
      net_rx_action() can mask irqs a single time to transfert sd->poll_list
      into a private list, for a very short duration.
      
      Then, napi_complete() can avoid masking irqs again,
      and net_rx_action() only needs to mask irq again in slow path.
      
      This patch removes 2 couples of irq mask/unmask per typical NAPI run,
      more if multiple napi were triggered.
      
      Note this also allows to give control back to caller (do_softirq())
      more often, so that other softirq handlers can be called a bit earlier,
      or ksoftirqd can be wakeup earlier under pressure.
      
      This was developed while testing an alternative to RX interrupt
      mitigation to reduce latencies while keeping or improving GRO
      aggregation on fast NIC.
      
      Idea is to test napi->gro_list at the end of a napi->poll() and
      reschedule one NAPI poll, but after servicing a full round of
      softirqs (timers, TX, rcu, ...). This will be allowed only if softirq
      is currently serviced by idle task or ksoftirqd, and resched not needed.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d75b1ade
  6. 01 11月, 2014 1 次提交
  7. 30 10月, 2014 3 次提交
    • N
      neigh: optimize neigh_parms_release() · 75fbfd33
      Nicolas Dichtel 提交于
      In neigh_parms_release() we loop over all entries to find the entry given in
      argument and being able to remove it from the list. By using a double linked
      list, we can avoid this loop.
      
      Here are some numbers with 30 000 dummy interfaces configured:
      
      Before the patch:
      $ time rmmod dummy
      real	2m0.118s
      user	0m0.000s
      sys	1m50.048s
      
      After the patch:
      $ time rmmod dummy
      real	1m9.970s
      user	0m0.000s
      sys	0m47.976s
      Suggested-by: NThierry Herbelot <thierry.herbelot@6wind.com>
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      75fbfd33
    • E
      net: introduce napi_schedule_irqoff() · bc9ad166
      Eric Dumazet 提交于
      napi_schedule() can be called from any context and has to mask hard
      irqs.
      
      Add a variant that can only be called from hard interrupts handlers
      or when irqs are already masked.
      
      Many NIC drivers can use it from their hard IRQ handler instead of
      generic variant.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bc9ad166
    • T
      net: skb_segment() should preserve backpressure · 432c856f
      Toshiaki Makita 提交于
      This patch generalizes commit d6a4a104 ("tcp: GSO should be TSQ
      friendly") to protocols using skb_set_owner_w()
      
      TCP uses its own destructor (tcp_wfree) and needs a more complex scheme
      as explained in commit 6ff50cd5 ("tcp: gso: do not generate out of
      order packets")
      
      This allows UDP sockets using UFO to get proper backpressure,
      thus avoiding qdisc drops and excessive cpu usage.
      
      Here are performance test results (macvlan on vlan):
      
      - Before
      # netperf -t UDP_STREAM ...
      Socket  Message  Elapsed      Messages
      Size    Size     Time         Okay Errors   Throughput
      bytes   bytes    secs            #      #   10^6bits/sec
      
      212992   65507   60.00      144096 1224195    1258.56
      212992           60.00          51              0.45
      
      Average:        CPU     %user     %nice   %system   %iowait    %steal     %idle
      Average:        all      0.23      0.00     25.26      0.08      0.00     74.43
      
      - After
      # netperf -t UDP_STREAM ...
      Socket  Message  Elapsed      Messages
      Size    Size     Time         Okay Errors   Throughput
      bytes   bytes    secs            #      #   10^6bits/sec
      
      212992   65507   60.00      109593      0     957.20
      212992           60.00      109593            957.20
      
      Average:        CPU     %user     %nice   %system   %iowait    %steal     %idle
      Average:        all      0.18      0.00      8.38      0.02      0.00     91.43
      
      [edumazet] Rewrote patch and changelog.
      Signed-off-by: NToshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      432c856f
  8. 27 10月, 2014 1 次提交
  9. 23 10月, 2014 1 次提交
  10. 21 10月, 2014 1 次提交
  11. 16 10月, 2014 1 次提交
    • T
      net: Add ndo_gso_check · 04ffcb25
      Tom Herbert 提交于
      Add ndo_gso_check which a device can define to indicate whether is
      is capable of doing GSO on a packet. This funciton would be called from
      the stack to determine whether software GSO is needed to be done. A
      driver should populate this function if it advertises GSO types for
      which there are combinations that it wouldn't be able to handle. For
      instance a device that performs UDP tunneling might only implement
      support for transparent Ethernet bridging type of inner packets
      or might have limitations on lengths of inner headers.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      04ffcb25
  12. 15 10月, 2014 1 次提交
  13. 11 10月, 2014 3 次提交
  14. 10 10月, 2014 1 次提交
    • M
      net: Missing @ before descriptions cause make xmldocs warning · de3f0d0e
      Masanari Iida 提交于
      This patch fix following warning.
      Warning(.//net/core/skbuff.c:4142): No description found for parameter 'header_len'
      Warning(.//net/core/skbuff.c:4142): No description found for parameter 'data_len'
      Warning(.//net/core/skbuff.c:4142): No description found for parameter 'max_page_order'
      Warning(.//net/core/skbuff.c:4142): No description found for parameter 'errcode'
      Warning(.//net/core/skbuff.c:4142): No description found for parameter 'gfp_mask'
      
      Acutually the descriptions exist, but missing "@" in front.
      
      This problem start to happen when following commit was merged
      into Linus's tree during 3.18-rc1 merge period.
      commit 2e4e4410
      net: add alloc_skb_with_frags() helper
      Signed-off-by: NMasanari Iida <standby24x7@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      de3f0d0e
  15. 08 10月, 2014 1 次提交
    • E
      net: better IFF_XMIT_DST_RELEASE support · 02875878
      Eric Dumazet 提交于
      Testing xmit_more support with netperf and connected UDP sockets,
      I found strange dst refcount false sharing.
      
      Current handling of IFF_XMIT_DST_RELEASE is not optimal.
      
      Dropping dst in validate_xmit_skb() is certainly too late in case
      packet was queued by cpu X but dequeued by cpu Y
      
      The logical point to take care of drop/force is in __dev_queue_xmit()
      before even taking qdisc lock.
      
      As Julian Anastasov pointed out, need for skb_dst() might come from some
      packet schedulers or classifiers.
      
      This patch adds new helper to cleanly express needs of various drivers
      or qdiscs/classifiers.
      
      Drivers that need skb_dst() in their ndo_start_xmit() should call
      following helper in their setup instead of the prior :
      
      	dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
      ->
      	netif_keep_dst(dev);
      
      Instead of using a single bit, we use two bits, one being
      eventually rebuilt in bonding/team drivers.
      
      The other one, is permanent and blocks IFF_XMIT_DST_RELEASE being
      rebuilt in bonding/team. Eventually, we could add something
      smarter later.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Julian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      02875878
  16. 07 10月, 2014 3 次提交
  17. 06 10月, 2014 2 次提交
  18. 05 10月, 2014 1 次提交
    • V
      net: Cleanup skb cloning by adding SKB_FCLONE_FREE · c8753d55
      Vijay Subramanian 提交于
      SKB_FCLONE_UNAVAILABLE has overloaded meaning depending on type of skb.
      1: If skb is allocated from head_cache, it indicates fclone is not available.
      2: If skb is a companion fclone skb (allocated from fclone_cache), it indicates
      it is available to be used.
      
      To avoid confusion for case 2 above, this patch  replaces
      SKB_FCLONE_UNAVAILABLE with SKB_FCLONE_FREE where appropriate. For fclone
      companion skbs, this indicates it is free for use.
      
      SKB_FCLONE_UNAVAILABLE will now simply indicate skb is from head_cache and
      cannot / will not have a companion fclone.
      Signed-off-by: NVijay Subramanian <subramanian.vijay@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c8753d55
  19. 04 10月, 2014 2 次提交
  20. 03 10月, 2014 1 次提交
  21. 02 10月, 2014 3 次提交
  22. 30 9月, 2014 4 次提交