1. 14 10月, 2011 1 次提交
    • E
      net: more accurate skb truesize · 87fb4b7b
      Eric Dumazet 提交于
      skb truesize currently accounts for sk_buff struct and part of skb head.
      kmalloc() roundings are also ignored.
      
      Considering that skb_shared_info is larger than sk_buff, its time to
      take it into account for better memory accounting.
      
      This patch introduces SKB_TRUESIZE(X) macro to centralize various
      assumptions into a single place.
      
      At skb alloc phase, we put skb_shared_info struct at the exact end of
      skb head, to allow a better use of memory (lowering number of
      reallocations), since kmalloc() gives us power-of-two memory blocks.
      
      Unless SLUB/SLUB debug is active, both skb->head and skb_shared_info are
      aligned to cache lines, as before.
      
      Note: This patch might trigger performance regressions because of
      misconfigured protocol stacks, hitting per socket or global memory
      limits that were previously not reached. But its a necessary step for a
      more accurate memory accounting.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      CC: Andi Kleen <ak@linux.intel.com>
      CC: Ben Hutchings <bhutchings@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      87fb4b7b
  2. 16 9月, 2011 1 次提交
    • M
      net: copy userspace buffers on device forwarding · 48c83012
      Michael S. Tsirkin 提交于
      dev_forward_skb loops an skb back into host networking
      stack which might hang on the memory indefinitely.
      In particular, this can happen in macvtap in bridged mode.
      Copy the userspace fragments to avoid blocking the
      sender in that case.
      
      As this patch makes skb_copy_ubufs extern now,
      I also added some documentation and made it clear
      the SKBTX_DEV_ZEROCOPY flag automatically instead
      of doing it in all callers. This can be made into a separate
      patch if people feel it's worth it.
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      48c83012
  3. 25 8月, 2011 1 次提交
  4. 23 8月, 2011 1 次提交
  5. 19 8月, 2011 1 次提交
  6. 18 8月, 2011 1 次提交
  7. 27 7月, 2011 1 次提交
  8. 12 7月, 2011 1 次提交
  9. 10 7月, 2011 1 次提交
  10. 07 7月, 2011 1 次提交
    • S
      skbuff: skb supports zero-copy buffers · a6686f2f
      Shirley Ma 提交于
      This patch adds userspace buffers support in skb shared info. A new
      struct skb_ubuf_info is needed to maintain the userspace buffers
      argument and index, a callback is used to notify userspace to release
      the buffers once lower device has done DMA (Last reference to that skb
      has gone).
      
      If there is any userspace apps to reference these userspace buffers,
      then these userspaces buffers will be copied into kernel. This way we
      can prevent userspace apps from holding these userspace buffers too long.
      
      Use destructor_arg to point to the userspace buffer info; a new tx flags
      SKBTX_DEV_ZEROCOPY is added for zero-copy buffer check.
      Signed-off-by: NShirley Ma <xma@...ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a6686f2f
  11. 20 6月, 2011 1 次提交
  12. 12 6月, 2011 1 次提交
    • J
      vlan: Fix the ingress VLAN_FLAG_REORDER_HDR check · 0b5c9db1
      Jiri Pirko 提交于
      Testing of VLAN_FLAG_REORDER_HDR does not belong in vlan_untag
      but rather in vlan_do_receive.  Otherwise the vlan header
      will not be properly put on the packet in the case of
      vlan header accelleration.
      
      As we remove the check from vlan_check_reorder_header
      rename it vlan_reorder_header to keep the naming clean.
      
      Fix up the skb->pkt_type early so we don't look at the packet
      after adding the vlan tag, which guarantees we don't goof
      and look at the wrong field.
      
      Use a simple if statement instead of a complicated switch
      statement to decided that we need to increment rx_stats
      for a multicast packet.
      
      Hopefully at somepoint we will just declare the case where
      VLAN_FLAG_REORDER_HDR is cleared as unsupported and remove
      the code.  Until then this keeps it working correctly.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NJiri Pirko <jpirko@redhat.com>
      Acked-by: NChangli Gao <xiaosuo@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0b5c9db1
  13. 23 5月, 2011 5 次提交
  14. 28 4月, 2011 1 次提交
    • E
      net: filter: Just In Time compiler for x86-64 · 0a14842f
      Eric Dumazet 提交于
      In order to speedup packet filtering, here is an implementation of a
      JIT compiler for x86_64
      
      It is disabled by default, and must be enabled by the admin.
      
      echo 1 >/proc/sys/net/core/bpf_jit_enable
      
      It uses module_alloc() and module_free() to get memory in the 2GB text
      kernel range since we call helpers functions from the generated code.
      
      EAX : BPF A accumulator
      EBX : BPF X accumulator
      RDI : pointer to skb   (first argument given to JIT function)
      RBP : frame pointer (even if CONFIG_FRAME_POINTER=n)
      r9d : skb->len - skb->data_len (headlen)
      r8  : skb->data
      
      To get a trace of generated code, use :
      
      echo 2 >/proc/sys/net/core/bpf_jit_enable
      
      Example of generated code :
      
      # tcpdump -p -n -s 0 -i eth1 host 192.168.20.0/24
      
      flen=18 proglen=147 pass=3 image=ffffffffa00b5000
      JIT code: ffffffffa00b5000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 60
      JIT code: ffffffffa00b5010: 44 2b 4f 64 4c 8b 87 b8 00 00 00 be 0c 00 00 00
      JIT code: ffffffffa00b5020: e8 24 7b f7 e0 3d 00 08 00 00 75 28 be 1a 00 00
      JIT code: ffffffffa00b5030: 00 e8 fe 7a f7 e0 24 00 3d 00 14 a8 c0 74 49 be
      JIT code: ffffffffa00b5040: 1e 00 00 00 e8 eb 7a f7 e0 24 00 3d 00 14 a8 c0
      JIT code: ffffffffa00b5050: 74 36 eb 3b 3d 06 08 00 00 74 07 3d 35 80 00 00
      JIT code: ffffffffa00b5060: 75 2d be 1c 00 00 00 e8 c8 7a f7 e0 24 00 3d 00
      JIT code: ffffffffa00b5070: 14 a8 c0 74 13 be 26 00 00 00 e8 b5 7a f7 e0 24
      JIT code: ffffffffa00b5080: 00 3d 00 14 a8 c0 75 07 b8 ff ff 00 00 eb 02 31
      JIT code: ffffffffa00b5090: c0 c9 c3
      
      BPF program is 144 bytes long, so native program is almost same size ;)
      
      (000) ldh      [12]
      (001) jeq      #0x800           jt 2    jf 8
      (002) ld       [26]
      (003) and      #0xffffff00
      (004) jeq      #0xc0a81400      jt 16   jf 5
      (005) ld       [30]
      (006) and      #0xffffff00
      (007) jeq      #0xc0a81400      jt 16   jf 17
      (008) jeq      #0x806           jt 10   jf 9
      (009) jeq      #0x8035          jt 10   jf 17
      (010) ld       [28]
      (011) and      #0xffffff00
      (012) jeq      #0xc0a81400      jt 16   jf 13
      (013) ld       [38]
      (014) and      #0xffffff00
      (015) jeq      #0xc0a81400      jt 16   jf 17
      (016) ret      #65535
      (017) ret      #0
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Hagen Paul Pfeifer <hagen@jauu.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a14842f
  15. 31 3月, 2011 1 次提交
  16. 30 3月, 2011 1 次提交
    • D
      net: Fix warnings caused by MAX_SKB_FRAGS change. · eec00954
      David S. Miller 提交于
      After commit a715dea3 ("net: Always
      allocate at least 16 skb frags regardless of page size"), the value
      of MAX_SKB_FRAGS can now take on either an "unsigned long" or an
      "int" value.
      
      This causes warnings like:
      
      net/packet/af_packet.c: In function ‘tpacket_fill_skb’:
      net/packet/af_packet.c:948: warning: format ‘%lu’ expects type ‘long unsigned int’, but argument 2 has type ‘int’
      
      Fix by forcing the constant to be unsigned long, otherwise we have
      a situation where the type of a system wide constant is variable.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eec00954
  17. 29 3月, 2011 1 次提交
  18. 17 3月, 2011 1 次提交
  19. 25 1月, 2011 1 次提交
  20. 21 1月, 2011 1 次提交
  21. 13 1月, 2011 1 次提交
  22. 17 12月, 2010 2 次提交
  23. 16 12月, 2010 1 次提交
  24. 25 11月, 2010 1 次提交
    • T
      xps: Improvements in TX queue selection · 3853b584
      Tom Herbert 提交于
      In dev_pick_tx, don't do work in calculating queue
      index or setting
      the index in the sock unless the device has more than one queue.  This
      allows the sock to be set only with a queue index of a multi-queue
      device which is desirable if device are stacked like in a tunnel.
      
      We also allow the mapping of a socket to queue to be changed.  To
      maintain in order packet transmission a flag (ooo_okay) has been
      added to the sk_buff structure.  If a transport layer sets this flag
      on a packet, the transmit queue can be changed for the socket.
      Presumably, the transport would set this if there was no possbility
      of creating OOO packets (for instance, there are no packets in flight
      for the socket).  This patch includes the modification in TCP output
      for setting this flag.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3853b584
  25. 20 10月, 2010 1 次提交
    • E
      net: avoid RCU for NOCACHE dst · 27b75c95
      Eric Dumazet 提交于
      There is no point using RCU for dst we allocate for a very short time
      (used once).
      
      Change dst_release() to take DST_NOCACHE into account, but also change
      skb_dst_set_noref() to force a refcount increment for such dst.
      
      This is a _huge_ gain, because we dont waste memory to store xx thousand
      of dsts. Instead of queueing them to RCU, we can free them instantly.
      
      CPU caches can stay hot, re-using same memory blocks to hold temporary
      dsts.
      
      Note : remove unneeded smp_mb__before_atomic_dec(); in dst_release(),
      since atomic_dec_return() implies a full memory barrier.
      
      Stress test, 160.000.000 udp frames sent, IP route cache disabled
      (DDOS).
      
      Before:
      
      real    0m38.091s
      user    0m13.189s
      sys     7m53.018s
      
      After:
      
      real	0m29.946s
      user	0m12.157s
      sys	7m40.605s
      
      For reference, if IP route cache was enabled :
      
      real	0m32.030s
      user	0m10.521s
      sys	8m15.243s
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      27b75c95
  26. 17 10月, 2010 1 次提交
    • E
      net: allocate skbs on local node · 564824b0
      Eric Dumazet 提交于
      commit b30973f8 (node-aware skb allocation) spread a wrong habit of
      allocating net drivers skbs on a given memory node : The one closest to
      the NIC hardware. This is wrong because as soon as we try to scale
      network stack, we need to use many cpus to handle traffic and hit
      slub/slab management on cross-node allocations/frees when these cpus
      have to alloc/free skbs bound to a central node.
      
      skb allocated in RX path are ephemeral, they have a very short
      lifetime : Extra cost to maintain NUMA affinity is too expensive. What
      appeared as a nice idea four years ago is in fact a bad one.
      
      In 2010, NIC hardwares are multiqueue, or we use RPS to spread the load,
      and two 10Gb NIC might deliver more than 28 million packets per second,
      needing all the available cpus.
      
      Cost of cross-node handling in network and vm stacks outperforms the
      small benefit hardware had when doing its DMA transfert in its 'local'
      memory node at RX time. Even trying to differentiate the two allocations
      done for one skb (the sk_buff on local node, the data part on NIC
      hardware node) is not enough to bring good performance.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      564824b0
  27. 27 9月, 2010 1 次提交
  28. 24 9月, 2010 1 次提交
  29. 03 9月, 2010 1 次提交
  30. 23 8月, 2010 1 次提交
  31. 19 8月, 2010 1 次提交
  32. 17 8月, 2010 1 次提交
    • K
      core: Factor out flow calculation from get_rps_cpu · bfb564e7
      Krishna Kumar 提交于
      Factor out flow calculation code from get_rps_cpu, since other
      functions can use the same code.
      
      Revisions:
      
      v2 (Ben): Separate flow calcuation out and use in select queue.
      v3 (Arnd): Don't re-implement MIN.
      v4 (Changli): skb->data points to ethernet header in macvtap, and
      	make a fast path. Tested macvtap with this patch.
      v5 (Changli):
      	- Cache skb->rxhash in skb_get_rxhash
      	- macvtap may not have pow(2) queues, so change code for
      	  queue selection.
          (Arnd):
      	- Use first available queue if all fails.
      Signed-off-by: NKrishna Kumar <krkumar2@in.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bfb564e7
  33. 05 8月, 2010 1 次提交
  34. 03 8月, 2010 1 次提交
  35. 25 7月, 2010 1 次提交