1. 16 9月, 2011 1 次提交
    • M
      net: copy userspace buffers on device forwarding · 48c83012
      Michael S. Tsirkin 提交于
      dev_forward_skb loops an skb back into host networking
      stack which might hang on the memory indefinitely.
      In particular, this can happen in macvtap in bridged mode.
      Copy the userspace fragments to avoid blocking the
      sender in that case.
      
      As this patch makes skb_copy_ubufs extern now,
      I also added some documentation and made it clear
      the SKBTX_DEV_ZEROCOPY flag automatically instead
      of doing it in all callers. This can be made into a separate
      patch if people feel it's worth it.
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      48c83012
  2. 27 7月, 2011 1 次提交
  3. 12 7月, 2011 1 次提交
  4. 10 7月, 2011 1 次提交
  5. 07 7月, 2011 1 次提交
    • S
      skbuff: skb supports zero-copy buffers · a6686f2f
      Shirley Ma 提交于
      This patch adds userspace buffers support in skb shared info. A new
      struct skb_ubuf_info is needed to maintain the userspace buffers
      argument and index, a callback is used to notify userspace to release
      the buffers once lower device has done DMA (Last reference to that skb
      has gone).
      
      If there is any userspace apps to reference these userspace buffers,
      then these userspaces buffers will be copied into kernel. This way we
      can prevent userspace apps from holding these userspace buffers too long.
      
      Use destructor_arg to point to the userspace buffer info; a new tx flags
      SKBTX_DEV_ZEROCOPY is added for zero-copy buffer check.
      Signed-off-by: NShirley Ma <xma@...ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a6686f2f
  6. 20 6月, 2011 1 次提交
  7. 12 6月, 2011 1 次提交
    • J
      vlan: Fix the ingress VLAN_FLAG_REORDER_HDR check · 0b5c9db1
      Jiri Pirko 提交于
      Testing of VLAN_FLAG_REORDER_HDR does not belong in vlan_untag
      but rather in vlan_do_receive.  Otherwise the vlan header
      will not be properly put on the packet in the case of
      vlan header accelleration.
      
      As we remove the check from vlan_check_reorder_header
      rename it vlan_reorder_header to keep the naming clean.
      
      Fix up the skb->pkt_type early so we don't look at the packet
      after adding the vlan tag, which guarantees we don't goof
      and look at the wrong field.
      
      Use a simple if statement instead of a complicated switch
      statement to decided that we need to increment rx_stats
      for a multicast packet.
      
      Hopefully at somepoint we will just declare the case where
      VLAN_FLAG_REORDER_HDR is cleared as unsupported and remove
      the code.  Until then this keeps it working correctly.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NJiri Pirko <jpirko@redhat.com>
      Acked-by: NChangli Gao <xiaosuo@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0b5c9db1
  8. 23 5月, 2011 5 次提交
  9. 28 4月, 2011 1 次提交
    • E
      net: filter: Just In Time compiler for x86-64 · 0a14842f
      Eric Dumazet 提交于
      In order to speedup packet filtering, here is an implementation of a
      JIT compiler for x86_64
      
      It is disabled by default, and must be enabled by the admin.
      
      echo 1 >/proc/sys/net/core/bpf_jit_enable
      
      It uses module_alloc() and module_free() to get memory in the 2GB text
      kernel range since we call helpers functions from the generated code.
      
      EAX : BPF A accumulator
      EBX : BPF X accumulator
      RDI : pointer to skb   (first argument given to JIT function)
      RBP : frame pointer (even if CONFIG_FRAME_POINTER=n)
      r9d : skb->len - skb->data_len (headlen)
      r8  : skb->data
      
      To get a trace of generated code, use :
      
      echo 2 >/proc/sys/net/core/bpf_jit_enable
      
      Example of generated code :
      
      # tcpdump -p -n -s 0 -i eth1 host 192.168.20.0/24
      
      flen=18 proglen=147 pass=3 image=ffffffffa00b5000
      JIT code: ffffffffa00b5000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 60
      JIT code: ffffffffa00b5010: 44 2b 4f 64 4c 8b 87 b8 00 00 00 be 0c 00 00 00
      JIT code: ffffffffa00b5020: e8 24 7b f7 e0 3d 00 08 00 00 75 28 be 1a 00 00
      JIT code: ffffffffa00b5030: 00 e8 fe 7a f7 e0 24 00 3d 00 14 a8 c0 74 49 be
      JIT code: ffffffffa00b5040: 1e 00 00 00 e8 eb 7a f7 e0 24 00 3d 00 14 a8 c0
      JIT code: ffffffffa00b5050: 74 36 eb 3b 3d 06 08 00 00 74 07 3d 35 80 00 00
      JIT code: ffffffffa00b5060: 75 2d be 1c 00 00 00 e8 c8 7a f7 e0 24 00 3d 00
      JIT code: ffffffffa00b5070: 14 a8 c0 74 13 be 26 00 00 00 e8 b5 7a f7 e0 24
      JIT code: ffffffffa00b5080: 00 3d 00 14 a8 c0 75 07 b8 ff ff 00 00 eb 02 31
      JIT code: ffffffffa00b5090: c0 c9 c3
      
      BPF program is 144 bytes long, so native program is almost same size ;)
      
      (000) ldh      [12]
      (001) jeq      #0x800           jt 2    jf 8
      (002) ld       [26]
      (003) and      #0xffffff00
      (004) jeq      #0xc0a81400      jt 16   jf 5
      (005) ld       [30]
      (006) and      #0xffffff00
      (007) jeq      #0xc0a81400      jt 16   jf 17
      (008) jeq      #0x806           jt 10   jf 9
      (009) jeq      #0x8035          jt 10   jf 17
      (010) ld       [28]
      (011) and      #0xffffff00
      (012) jeq      #0xc0a81400      jt 16   jf 13
      (013) ld       [38]
      (014) and      #0xffffff00
      (015) jeq      #0xc0a81400      jt 16   jf 17
      (016) ret      #65535
      (017) ret      #0
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Hagen Paul Pfeifer <hagen@jauu.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a14842f
  10. 31 3月, 2011 1 次提交
  11. 30 3月, 2011 1 次提交
    • D
      net: Fix warnings caused by MAX_SKB_FRAGS change. · eec00954
      David S. Miller 提交于
      After commit a715dea3 ("net: Always
      allocate at least 16 skb frags regardless of page size"), the value
      of MAX_SKB_FRAGS can now take on either an "unsigned long" or an
      "int" value.
      
      This causes warnings like:
      
      net/packet/af_packet.c: In function ‘tpacket_fill_skb’:
      net/packet/af_packet.c:948: warning: format ‘%lu’ expects type ‘long unsigned int’, but argument 2 has type ‘int’
      
      Fix by forcing the constant to be unsigned long, otherwise we have
      a situation where the type of a system wide constant is variable.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eec00954
  12. 29 3月, 2011 1 次提交
  13. 17 3月, 2011 1 次提交
  14. 25 1月, 2011 1 次提交
  15. 21 1月, 2011 1 次提交
  16. 13 1月, 2011 1 次提交
  17. 17 12月, 2010 2 次提交
  18. 16 12月, 2010 1 次提交
  19. 25 11月, 2010 1 次提交
    • T
      xps: Improvements in TX queue selection · 3853b584
      Tom Herbert 提交于
      In dev_pick_tx, don't do work in calculating queue
      index or setting
      the index in the sock unless the device has more than one queue.  This
      allows the sock to be set only with a queue index of a multi-queue
      device which is desirable if device are stacked like in a tunnel.
      
      We also allow the mapping of a socket to queue to be changed.  To
      maintain in order packet transmission a flag (ooo_okay) has been
      added to the sk_buff structure.  If a transport layer sets this flag
      on a packet, the transmit queue can be changed for the socket.
      Presumably, the transport would set this if there was no possbility
      of creating OOO packets (for instance, there are no packets in flight
      for the socket).  This patch includes the modification in TCP output
      for setting this flag.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3853b584
  20. 20 10月, 2010 1 次提交
    • E
      net: avoid RCU for NOCACHE dst · 27b75c95
      Eric Dumazet 提交于
      There is no point using RCU for dst we allocate for a very short time
      (used once).
      
      Change dst_release() to take DST_NOCACHE into account, but also change
      skb_dst_set_noref() to force a refcount increment for such dst.
      
      This is a _huge_ gain, because we dont waste memory to store xx thousand
      of dsts. Instead of queueing them to RCU, we can free them instantly.
      
      CPU caches can stay hot, re-using same memory blocks to hold temporary
      dsts.
      
      Note : remove unneeded smp_mb__before_atomic_dec(); in dst_release(),
      since atomic_dec_return() implies a full memory barrier.
      
      Stress test, 160.000.000 udp frames sent, IP route cache disabled
      (DDOS).
      
      Before:
      
      real    0m38.091s
      user    0m13.189s
      sys     7m53.018s
      
      After:
      
      real	0m29.946s
      user	0m12.157s
      sys	7m40.605s
      
      For reference, if IP route cache was enabled :
      
      real	0m32.030s
      user	0m10.521s
      sys	8m15.243s
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      27b75c95
  21. 17 10月, 2010 1 次提交
    • E
      net: allocate skbs on local node · 564824b0
      Eric Dumazet 提交于
      commit b30973f8 (node-aware skb allocation) spread a wrong habit of
      allocating net drivers skbs on a given memory node : The one closest to
      the NIC hardware. This is wrong because as soon as we try to scale
      network stack, we need to use many cpus to handle traffic and hit
      slub/slab management on cross-node allocations/frees when these cpus
      have to alloc/free skbs bound to a central node.
      
      skb allocated in RX path are ephemeral, they have a very short
      lifetime : Extra cost to maintain NUMA affinity is too expensive. What
      appeared as a nice idea four years ago is in fact a bad one.
      
      In 2010, NIC hardwares are multiqueue, or we use RPS to spread the load,
      and two 10Gb NIC might deliver more than 28 million packets per second,
      needing all the available cpus.
      
      Cost of cross-node handling in network and vm stacks outperforms the
      small benefit hardware had when doing its DMA transfert in its 'local'
      memory node at RX time. Even trying to differentiate the two allocations
      done for one skb (the sk_buff on local node, the data part on NIC
      hardware node) is not enough to bring good performance.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      564824b0
  22. 27 9月, 2010 1 次提交
  23. 24 9月, 2010 1 次提交
  24. 03 9月, 2010 1 次提交
  25. 23 8月, 2010 1 次提交
  26. 19 8月, 2010 1 次提交
  27. 17 8月, 2010 1 次提交
    • K
      core: Factor out flow calculation from get_rps_cpu · bfb564e7
      Krishna Kumar 提交于
      Factor out flow calculation code from get_rps_cpu, since other
      functions can use the same code.
      
      Revisions:
      
      v2 (Ben): Separate flow calcuation out and use in select queue.
      v3 (Arnd): Don't re-implement MIN.
      v4 (Changli): skb->data points to ethernet header in macvtap, and
      	make a fast path. Tested macvtap with this patch.
      v5 (Changli):
      	- Cache skb->rxhash in skb_get_rxhash
      	- macvtap may not have pow(2) queues, so change code for
      	  queue selection.
          (Arnd):
      	- Use first available queue if all fails.
      Signed-off-by: NKrishna Kumar <krkumar2@in.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bfb564e7
  28. 05 8月, 2010 1 次提交
  29. 03 8月, 2010 1 次提交
  30. 25 7月, 2010 1 次提交
  31. 19 7月, 2010 2 次提交
  32. 16 6月, 2010 1 次提交
  33. 11 6月, 2010 1 次提交
    • J
      net: deliver skbs on inactive slaves to exact matches · 597a264b
      John Fastabend 提交于
      Currently, the accelerated receive path for VLAN's will
      drop packets if the real device is an inactive slave and
      is not one of the special pkts tested for in
      skb_bond_should_drop().  This behavior is different then
      the non-accelerated path and for pkts over a bonded vlan.
      
      For example,
      
      vlanx -> bond0 -> ethx
      
      will be dropped in the vlan path and not delivered to any
      packet handlers at all.  However,
      
      bond0 -> vlanx -> ethx
      
      and
      
      bond0 -> ethx
      
      will be delivered to handlers that match the exact dev,
      because the VLAN path checks the real_dev which is not a
      slave and netif_recv_skb() doesn't drop frames but only
      delivers them to exact matches.
      
      This patch adds a sk_buff flag which is used for tagging
      skbs that would previously been dropped and allows the
      skb to continue to skb_netif_recv().  Here we add
      logic to check for the deliver_no_wcard flag and if it
      is set only deliver to handlers that match exactly.  This
      makes both paths above consistent and gives pkt handlers
      a way to identify skbs that come from inactive slaves.
      Without this patch in some configurations skbs will be
      delivered to handlers with exact matches and in others
      be dropped out right in the vlan path.
      
      I have tested the following 4 configurations in failover modes
      and load balancing modes.
      
      # bond0 -> ethx
      
      # vlanx -> bond0 -> ethx
      
      # bond0 -> vlanx -> ethx
      
      # bond0 -> ethx
                  |
        vlanx -> --
      Signed-off-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      597a264b
  34. 05 6月, 2010 1 次提交