1. 17 12月, 2010 2 次提交
  2. 25 11月, 2010 1 次提交
    • T
      xps: Improvements in TX queue selection · 3853b584
      Tom Herbert 提交于
      In dev_pick_tx, don't do work in calculating queue
      index or setting
      the index in the sock unless the device has more than one queue.  This
      allows the sock to be set only with a queue index of a multi-queue
      device which is desirable if device are stacked like in a tunnel.
      
      We also allow the mapping of a socket to queue to be changed.  To
      maintain in order packet transmission a flag (ooo_okay) has been
      added to the sk_buff structure.  If a transport layer sets this flag
      on a packet, the transmit queue can be changed for the socket.
      Presumably, the transport would set this if there was no possbility
      of creating OOO packets (for instance, there are no packets in flight
      for the socket).  This patch includes the modification in TCP output
      for setting this flag.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3853b584
  3. 20 10月, 2010 1 次提交
    • E
      net: avoid RCU for NOCACHE dst · 27b75c95
      Eric Dumazet 提交于
      There is no point using RCU for dst we allocate for a very short time
      (used once).
      
      Change dst_release() to take DST_NOCACHE into account, but also change
      skb_dst_set_noref() to force a refcount increment for such dst.
      
      This is a _huge_ gain, because we dont waste memory to store xx thousand
      of dsts. Instead of queueing them to RCU, we can free them instantly.
      
      CPU caches can stay hot, re-using same memory blocks to hold temporary
      dsts.
      
      Note : remove unneeded smp_mb__before_atomic_dec(); in dst_release(),
      since atomic_dec_return() implies a full memory barrier.
      
      Stress test, 160.000.000 udp frames sent, IP route cache disabled
      (DDOS).
      
      Before:
      
      real    0m38.091s
      user    0m13.189s
      sys     7m53.018s
      
      After:
      
      real	0m29.946s
      user	0m12.157s
      sys	7m40.605s
      
      For reference, if IP route cache was enabled :
      
      real	0m32.030s
      user	0m10.521s
      sys	8m15.243s
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      27b75c95
  4. 17 10月, 2010 1 次提交
    • E
      net: allocate skbs on local node · 564824b0
      Eric Dumazet 提交于
      commit b30973f8 (node-aware skb allocation) spread a wrong habit of
      allocating net drivers skbs on a given memory node : The one closest to
      the NIC hardware. This is wrong because as soon as we try to scale
      network stack, we need to use many cpus to handle traffic and hit
      slub/slab management on cross-node allocations/frees when these cpus
      have to alloc/free skbs bound to a central node.
      
      skb allocated in RX path are ephemeral, they have a very short
      lifetime : Extra cost to maintain NUMA affinity is too expensive. What
      appeared as a nice idea four years ago is in fact a bad one.
      
      In 2010, NIC hardwares are multiqueue, or we use RPS to spread the load,
      and two 10Gb NIC might deliver more than 28 million packets per second,
      needing all the available cpus.
      
      Cost of cross-node handling in network and vm stacks outperforms the
      small benefit hardware had when doing its DMA transfert in its 'local'
      memory node at RX time. Even trying to differentiate the two allocations
      done for one skb (the sk_buff on local node, the data part on NIC
      hardware node) is not enough to bring good performance.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      564824b0
  5. 27 9月, 2010 1 次提交
  6. 24 9月, 2010 1 次提交
  7. 03 9月, 2010 1 次提交
  8. 23 8月, 2010 1 次提交
  9. 19 8月, 2010 1 次提交
  10. 17 8月, 2010 1 次提交
    • K
      core: Factor out flow calculation from get_rps_cpu · bfb564e7
      Krishna Kumar 提交于
      Factor out flow calculation code from get_rps_cpu, since other
      functions can use the same code.
      
      Revisions:
      
      v2 (Ben): Separate flow calcuation out and use in select queue.
      v3 (Arnd): Don't re-implement MIN.
      v4 (Changli): skb->data points to ethernet header in macvtap, and
      	make a fast path. Tested macvtap with this patch.
      v5 (Changli):
      	- Cache skb->rxhash in skb_get_rxhash
      	- macvtap may not have pow(2) queues, so change code for
      	  queue selection.
          (Arnd):
      	- Use first available queue if all fails.
      Signed-off-by: NKrishna Kumar <krkumar2@in.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bfb564e7
  11. 05 8月, 2010 1 次提交
  12. 03 8月, 2010 1 次提交
  13. 25 7月, 2010 1 次提交
  14. 19 7月, 2010 2 次提交
  15. 16 6月, 2010 1 次提交
  16. 11 6月, 2010 1 次提交
    • J
      net: deliver skbs on inactive slaves to exact matches · 597a264b
      John Fastabend 提交于
      Currently, the accelerated receive path for VLAN's will
      drop packets if the real device is an inactive slave and
      is not one of the special pkts tested for in
      skb_bond_should_drop().  This behavior is different then
      the non-accelerated path and for pkts over a bonded vlan.
      
      For example,
      
      vlanx -> bond0 -> ethx
      
      will be dropped in the vlan path and not delivered to any
      packet handlers at all.  However,
      
      bond0 -> vlanx -> ethx
      
      and
      
      bond0 -> ethx
      
      will be delivered to handlers that match the exact dev,
      because the VLAN path checks the real_dev which is not a
      slave and netif_recv_skb() doesn't drop frames but only
      delivers them to exact matches.
      
      This patch adds a sk_buff flag which is used for tagging
      skbs that would previously been dropped and allows the
      skb to continue to skb_netif_recv().  Here we add
      logic to check for the deliver_no_wcard flag and if it
      is set only deliver to handlers that match exactly.  This
      makes both paths above consistent and gives pkt handlers
      a way to identify skbs that come from inactive slaves.
      Without this patch in some configurations skbs will be
      delivered to handlers with exact matches and in others
      be dropped out right in the vlan path.
      
      I have tested the following 4 configurations in failover modes
      and load balancing modes.
      
      # bond0 -> ethx
      
      # vlanx -> bond0 -> ethx
      
      # bond0 -> vlanx -> ethx
      
      # bond0 -> ethx
                  |
        vlanx -> --
      Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      597a264b
  17. 05 6月, 2010 1 次提交
  18. 29 5月, 2010 1 次提交
  19. 18 5月, 2010 1 次提交
    • E
      net: add a noref bit on skb dst · 7fee226a
      Eric Dumazet 提交于
      Use low order bit of skb->_skb_dst to tell dst is not refcounted.
      
      Change _skb_dst to _skb_refdst to make sure all uses are catched.
      
      skb_dst() returns the dst, regardless of noref bit set or not, but
      with a lockdep check to make sure a noref dst is not given if current
      user is not rcu protected.
      
      New skb_dst_set_noref() helper to set an notrefcounted dst on a skb.
      (with lockdep check)
      
      skb_dst_drop() drops a reference only if skb dst was refcounted.
      
      skb_dst_force() helper is used to force a refcount on dst, when skb
      is queued and not anymore RCU protected.
      
      Use skb_dst_force() in __sk_add_backlog(), __dev_xmit_skb() if
      !IFF_XMIT_DST_RELEASE or skb enqueued on qdisc queue, in
      sock_queue_rcv_skb(), in __nf_queue().
      
      Use skb_dst_force() in dev_requeue_skb().
      
      Note: dst_use_noref() still dirties dst, we might transform it
      later to do one dirtying per jiffies.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7fee226a
  20. 07 5月, 2010 1 次提交
  21. 05 5月, 2010 1 次提交
    • E
      net: __alloc_skb() speedup · ec7d2f2c
      Eric Dumazet 提交于
      With following patch I can reach maximum rate of my pktgen+udpsink
      simulator :
      - 'old' machine : dual quad core E5450  @3.00GHz
      - 64 UDP rx flows (only differ by destination port)
      - RPS enabled, NIC interrupts serviced on cpu0
      - rps dispatched on 7 other cores. (~130.000 IPI per second)
      - SLAB allocator (faster than SLUB in this workload)
      - tg3 NIC
      - 1.080.000 pps without a single drop at NIC level.
      
      Idea is to add two prefetchw() calls in __alloc_skb(), one to prefetch
      first sk_buff cache line, the second to prefetch the shinfo part.
      
      Also using one memset() to initialize all skb_shared_info fields instead
      of one by one to reduce number of instructions, using long word moves.
      
      All skb_shared_info fields before 'dataref' are cleared in 
      __alloc_skb().
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ec7d2f2c
  22. 02 5月, 2010 1 次提交
  23. 21 4月, 2010 1 次提交
  24. 13 4月, 2010 1 次提交
  25. 25 3月, 2010 1 次提交
  26. 17 3月, 2010 1 次提交
    • T
      rps: Receive Packet Steering · 0a9627f2
      Tom Herbert 提交于
      This patch implements software receive side packet steering (RPS).  RPS
      distributes the load of received packet processing across multiple CPUs.
      
      Problem statement: Protocol processing done in the NAPI context for received
      packets is serialized per device queue and becomes a bottleneck under high
      packet load.  This substantially limits pps that can be achieved on a single
      queue NIC and provides no scaling with multiple cores.
      
      This solution queues packets early on in the receive path on the backlog queues
      of other CPUs.   This allows protocol processing (e.g. IP and TCP) to be
      performed on packets in parallel.   For each device (or each receive queue in
      a multi-queue device) a mask of CPUs is set to indicate the CPUs that can
      process packets. A CPU is selected on a per packet basis by hashing contents
      of the packet header (e.g. the TCP or UDP 4-tuple) and using the result to index
      into the CPU mask.  The IPI mechanism is used to raise networking receive
      softirqs between CPUs.  This effectively emulates in software what a multi-queue
      NIC can provide, but is generic requiring no device support.
      
      Many devices now provide a hash over the 4-tuple on a per packet basis
      (e.g. the Toeplitz hash).  This patch allow drivers to set the HW reported hash
      in an skb field, and that value in turn is used to index into the RPS maps.
      Using the HW generated hash can avoid cache misses on the packet when
      steering it to a remote CPU.
      
      The CPU mask is set on a per device and per queue basis in the sysfs variable
      /sys/class/net/<device>/queues/rx-<n>/rps_cpus.  This is a set of canonical
      bit maps for receive queues in the device (numbered by <n>).  If a device
      does not support multi-queue, a single variable is used for the device (rx-0).
      
      Generally, we have found this technique increases pps capabilities of a single
      queue device with good CPU utilization.  Optimal settings for the CPU mask
      seem to depend on architectures and cache hierarcy.  Below are some results
      running 500 instances of netperf TCP_RR test with 1 byte req. and resp.
      Results show cumulative transaction rate and system CPU utilization.
      
      e1000e on 8 core Intel
         Without RPS: 108K tps at 33% CPU
         With RPS:    311K tps at 64% CPU
      
      forcedeth on 16 core AMD
         Without RPS: 156K tps at 15% CPU
         With RPS:    404K tps at 49% CPU
      
      bnx2x on 16 core AMD
         Without RPS  567K tps at 61% CPU (4 HW RX queues)
         Without RPS  738K tps at 96% CPU (8 HW RX queues)
         With RPS:    854K tps at 76% CPU (4 HW RX queues)
      
      Caveats:
      - The benefits of this patch are dependent on architecture and cache hierarchy.
      Tuning the masks to get best performance is probably necessary.
      - This patch adds overhead in the path for processing a single packet.  In
      a lightly loaded server this overhead may eliminate the advantages of
      increased parallelism, and possibly cause some relative performance degradation.
      We have found that masks that are cache aware (share same caches with
      the interrupting CPU) mitigate much of this.
      - The RPS masks can be changed dynamically, however whenever the mask is changed
      this introduces the possibility of generating out of order packets.  It's
      probably best not change the masks too frequently.
      Signed-off-by: NTom Herbert <therbert@google.com>
      
       include/linux/netdevice.h |   32 ++++-
       include/linux/skbuff.h    |    3 +
       net/core/dev.c            |  335 +++++++++++++++++++++++++++++++++++++--------
       net/core/net-sysfs.c      |  225 ++++++++++++++++++++++++++++++-
       net/core/skbuff.c         |    2 +
       5 files changed, 538 insertions(+), 59 deletions(-)
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a9627f2
  27. 02 3月, 2010 1 次提交
  28. 27 2月, 2010 1 次提交
    • F
      skbuff: align sk_buff::cb to 64 bit and close some potential holes · da3f5cf1
      Felix Fietkau 提交于
      The alignment requirement for 64-bit load/store instructions on ARM is
      implementation defined. Some CPUs (such as Marvell Feroceon) do not
      generate an exception, if such an instruction is executed with an
      address that is not 64 bit aligned. In such a case, the Feroceon
      corrupts adjacent memory, which showed up in my tests as a crash in the
      rx path of ath9k that only occured with CONFIG_XFRM set.
      
      This crash happened, because the first field of the mac80211 rx status
      info in the cb is an u64, and changing it corrupted the skb->sp field.
      
      This patch also closes some potential pre-existing holes in the sk_buff
      struct surrounding the cb[] area.
      Signed-off-by: NFelix Fietkau <nbd@openwrt.org>
      Cc: stable@kernel.org
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      da3f5cf1
  29. 15 2月, 2010 1 次提交
  30. 03 12月, 2009 1 次提交
  31. 21 11月, 2009 1 次提交
  32. 05 11月, 2009 1 次提交
  33. 31 10月, 2009 1 次提交
  34. 30 10月, 2009 1 次提交
  35. 13 10月, 2009 2 次提交
    • E
      net: Add netdev_alloc_skb_ip_align() helper · 61321bbd
      Eric Dumazet 提交于
      Instead of hardcoding NET_IP_ALIGN stuff in various network drivers,
      we can add a helper around netdev_alloc_skb()
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      61321bbd
    • N
      net: Generalize socket rx gap / receive queue overflow cmsg · 3b885787
      Neil Horman 提交于
      Create a new socket level option to report number of queue overflows
      
      Recently I augmented the AF_PACKET protocol to report the number of frames lost
      on the socket receive queue between any two enqueued frames.  This value was
      exported via a SOL_PACKET level cmsg.  AFter I completed that work it was
      requested that this feature be generalized so that any datagram oriented socket
      could make use of this option.  As such I've created this patch, It creates a
      new SOL_SOCKET level option called SO_RXQ_OVFL, which when enabled exports a
      SOL_SOCKET level cmsg that reports the nubmer of times the sk_receive_queue
      overflowed between any two given frames.  It also augments the AF_PACKET
      protocol to take advantage of this new feature (as it previously did not touch
      sk->sk_drops, which this patch uses to record the overflow count).  Tested
      successfully by me.
      
      Notes:
      
      1) Unlike my previous patch, this patch simply records the sk_drops value, which
      is not a number of drops between packets, but rather a total number of drops.
      Deltas must be computed in user space.
      
      2) While this patch currently works with datagram oriented protocols, it will
      also be accepted by non-datagram oriented protocols. I'm not sure if thats
      agreeable to everyone, but my argument in favor of doing so is that, for those
      protocols which aren't applicable to this option, sk_drops will always be zero,
      and reporting no drops on a receive queue that isn't used for those
      non-participating protocols seems reasonable to me.  This also saves us having
      to code in a per-protocol opt in mechanism.
      
      3) This applies cleanly to net-next assuming that commit
      97775007 (my af packet cmsg patch) is reverted
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b885787
  36. 25 7月, 2009 1 次提交
  37. 15 7月, 2009 1 次提交