1. 02 3月, 2011 1 次提交
  2. 28 1月, 2011 1 次提交
  3. 25 1月, 2011 2 次提交
    • M
      net: change netdev->features to u32 · 04ed3e74
      Michał Mirosław 提交于
      Quoting Ben Hutchings: we presumably won't be defining features that
      can only be enabled on 64-bit architectures.
      
      Occurences found by `grep -r` on net/, drivers/net, include/
      
      [ Move features and vlan_features next to each other in
        struct netdev, as per Eric Dumazet's suggestion -DaveM ]
      Signed-off-by: NMichał Mirosław <mirq-linux@rere.qmqm.pl>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      04ed3e74
    • M
      GRO: fix merging a paged skb after non-paged skbs · d1dc7abf
      Michal Schmidt 提交于
      Suppose that several linear skbs of the same flow were received by GRO. They
      were thus merged into one skb with a frag_list. Then a new skb of the same flow
      arrives, but it is a paged skb with data starting in its frags[].
      
      Before adding the skb to the frag_list skb_gro_receive() will of course adjust
      the skb to throw away the headers. It correctly modifies the page_offset and
      size of the frag, but it leaves incorrect information in the skb:
       ->data_len is not decreased at all.
       ->len is decreased only by headlen, as if no change were done to the frag.
      Later in a receiving process this causes skb_copy_datagram_iovec() to return
      -EFAULT and this is seen in userspace as the result of the recv() syscall.
      
      In practice the bug can be reproduced with the sfc driver. By default the
      driver uses an adaptive scheme when it switches between using
      napi_gro_receive() (with skbs) and napi_gro_frags() (with pages). The bug is
      reproduced when under rx load with enough successful GRO merging the driver
      decides to switch from the former to the latter.
      
      Manual control is also possible, so reproducing this is easy with netcat:
       - on machine1 (with sfc): nc -l 12345 > /dev/null
       - on machine2: nc machine1 12345 < /dev/zero
       - on machine1:
         echo 1 > /sys/module/sfc/parameters/rx_alloc_method  # use skbs
         echo 2 > /sys/module/sfc/parameters/rx_alloc_method  # use pages
       - See that nc has quit suddenly.
      
      [v2: Modified by Eric Dumazet to avoid advancing skb->data past the end
           and to use a temporary variable.]
      Signed-off-by: NMichal Schmidt <mschmidt@redhat.com>
      Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d1dc7abf
  4. 13 1月, 2011 1 次提交
  5. 17 12月, 2010 1 次提交
  6. 16 12月, 2010 1 次提交
  7. 04 12月, 2010 1 次提交
  8. 17 10月, 2010 1 次提交
    • E
      net: allocate skbs on local node · 564824b0
      Eric Dumazet 提交于
      commit b30973f8 (node-aware skb allocation) spread a wrong habit of
      allocating net drivers skbs on a given memory node : The one closest to
      the NIC hardware. This is wrong because as soon as we try to scale
      network stack, we need to use many cpus to handle traffic and hit
      slub/slab management on cross-node allocations/frees when these cpus
      have to alloc/free skbs bound to a central node.
      
      skb allocated in RX path are ephemeral, they have a very short
      lifetime : Extra cost to maintain NUMA affinity is too expensive. What
      appeared as a nice idea four years ago is in fact a bad one.
      
      In 2010, NIC hardwares are multiqueue, or we use RPS to spread the load,
      and two 10Gb NIC might deliver more than 28 million packets per second,
      needing all the available cpus.
      
      Cost of cross-node handling in network and vm stacks outperforms the
      small benefit hardware had when doing its DMA transfert in its 'local'
      memory node at RX time. Even trying to differentiate the two allocations
      done for one skb (the sk_buff on local node, the data part on NIC
      hardware node) is not enough to bring good performance.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      564824b0
  9. 09 9月, 2010 1 次提交
  10. 07 9月, 2010 2 次提交
    • K
      skb: Add tracepoints to freeing skb · 07dc22e7
      Koki Sanagi 提交于
      This patch adds tracepoint to consume_skb and add trace_kfree_skb
      before __kfree_skb in skb_free_datagram_locked and net_tx_action.
      Combinating with tracepoint on dev_hard_start_xmit, we can check
      how long it takes to free transmitted packets. And using it, we can
      calculate how many packets driver had at that time. It is useful when
      a drop of transmitted packet is a problem.
      
                  sshd-6828  [000] 112689.258154: consume_skb: skbaddr=f2d99bb8
      Signed-off-by: NKoki Sanagi <sanagi.koki@jp.fujitsu.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Kaneshige Kenji <kaneshige.kenji@jp.fujitsu.com>
      Cc: Izumo Taku <izumi.taku@jp.fujitsu.com>
      Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Scott Mcmillan <scott.a.mcmillan@intel.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      LKML-Reference: <4C724364.50903@jp.fujitsu.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      07dc22e7
    • E
      net: pskb_expand_head() optimization · 1fd63041
      Eric Dumazet 提交于
      pskb_expand_head() blindly takes references on fragments before calling
      skb_release_data(), potentially releasing these references.
      
      We can add a fast path, avoiding these atomic operations, if we own the
      last reference on skb->head.
      
      Based on a previous patch from David
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1fd63041
  11. 04 9月, 2010 1 次提交
  12. 02 9月, 2010 2 次提交
    • E
      gro: fix different skb headrooms · 3d3be433
      Eric Dumazet 提交于
      Packets entering GRO might have different headrooms, even for a given
      flow (because of implementation details in drivers, like copybreak).
      We cant force drivers to deliver packets with a fixed headroom.
      
      1) fix skb_segment()
      
      skb_segment() makes the false assumption headrooms of fragments are same
      than the head. When CHECKSUM_PARTIAL is used, this can give csum_start
      errors, and crash later in skb_copy_and_csum_dev()
      
      2) allocate a minimal skb for head of frag_list
      
      skb_gro_receive() uses netdev_alloc_skb(headroom + skb_gro_offset(p)) to
      allocate a fresh skb. This adds NET_SKB_PAD to a padding already
      provided by netdevice, depending on various things, like copybreak.
      
      Use alloc_skb() to allocate an exact padding, to reduce cache line
      needs:
      NET_SKB_PAD + NET_IP_ALIGN
      
      bugzilla : https://bugzilla.kernel.org/show_bug.cgi?id=16626
      
      Many thanks to Plamen Petrov, testing many debugging patches !
      With help of Jarek Poplawski.
      Reported-by: NPlamen Petrov <pvp-lsts@fs.uni-ruse.bg>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      CC: Jarek Poplawski <jarkao2@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3d3be433
    • E
      net: skbuff.c cleanup · 6602cebb
      Eric Dumazet 提交于
      (skb->data - skb->head) can be changed by skb_headroom(skb)
      
      Remove some uses of NET_SKBUFF_DATA_USES_OFFSET, using
      (skb_end_pointer(skb) - skb->head) or
      (skb_tail_pointer(skb) - skb->head) : compiler does the right thing,
      and this is more readable for us ;)
      
      (struct skb_shared_info *) casts in pskb_expand_head() to help memcpy()
      to use aligned moves.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6602cebb
  13. 23 8月, 2010 1 次提交
  14. 19 8月, 2010 1 次提交
  15. 25 7月, 2010 1 次提交
  16. 23 7月, 2010 2 次提交
  17. 13 7月, 2010 1 次提交
  18. 14 6月, 2010 2 次提交
  19. 01 6月, 2010 1 次提交
  20. 29 5月, 2010 2 次提交
  21. 22 5月, 2010 1 次提交
  22. 21 5月, 2010 1 次提交
  23. 18 5月, 2010 1 次提交
    • E
      net: add a noref bit on skb dst · 7fee226a
      Eric Dumazet 提交于
      Use low order bit of skb->_skb_dst to tell dst is not refcounted.
      
      Change _skb_dst to _skb_refdst to make sure all uses are catched.
      
      skb_dst() returns the dst, regardless of noref bit set or not, but
      with a lockdep check to make sure a noref dst is not given if current
      user is not rcu protected.
      
      New skb_dst_set_noref() helper to set an notrefcounted dst on a skb.
      (with lockdep check)
      
      skb_dst_drop() drops a reference only if skb dst was refcounted.
      
      skb_dst_force() helper is used to force a refcount on dst, when skb
      is queued and not anymore RCU protected.
      
      Use skb_dst_force() in __sk_add_backlog(), __dev_xmit_skb() if
      !IFF_XMIT_DST_RELEASE or skb enqueued on qdisc queue, in
      sock_queue_rcv_skb(), in __nf_queue().
      
      Use skb_dst_force() in dev_requeue_skb().
      
      Note: dst_use_noref() still dirties dst, we might transform it
      later to do one dirtying per jiffies.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7fee226a
  24. 05 5月, 2010 1 次提交
    • E
      net: __alloc_skb() speedup · ec7d2f2c
      Eric Dumazet 提交于
      With following patch I can reach maximum rate of my pktgen+udpsink
      simulator :
      - 'old' machine : dual quad core E5450  @3.00GHz
      - 64 UDP rx flows (only differ by destination port)
      - RPS enabled, NIC interrupts serviced on cpu0
      - rps dispatched on 7 other cores. (~130.000 IPI per second)
      - SLAB allocator (faster than SLUB in this workload)
      - tg3 NIC
      - 1.080.000 pps without a single drop at NIC level.
      
      Idea is to add two prefetchw() calls in __alloc_skb(), one to prefetch
      first sk_buff cache line, the second to prefetch the shinfo part.
      
      Also using one memset() to initialize all skb_shared_info fields instead
      of one by one to reduce number of instructions, using long word moves.
      
      All skb_shared_info fields before 'dataref' are cleared in 
      __alloc_skb().
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ec7d2f2c
  25. 02 5月, 2010 1 次提交
  26. 21 4月, 2010 1 次提交
  27. 17 3月, 2010 1 次提交
    • T
      rps: Receive Packet Steering · 0a9627f2
      Tom Herbert 提交于
      This patch implements software receive side packet steering (RPS).  RPS
      distributes the load of received packet processing across multiple CPUs.
      
      Problem statement: Protocol processing done in the NAPI context for received
      packets is serialized per device queue and becomes a bottleneck under high
      packet load.  This substantially limits pps that can be achieved on a single
      queue NIC and provides no scaling with multiple cores.
      
      This solution queues packets early on in the receive path on the backlog queues
      of other CPUs.   This allows protocol processing (e.g. IP and TCP) to be
      performed on packets in parallel.   For each device (or each receive queue in
      a multi-queue device) a mask of CPUs is set to indicate the CPUs that can
      process packets. A CPU is selected on a per packet basis by hashing contents
      of the packet header (e.g. the TCP or UDP 4-tuple) and using the result to index
      into the CPU mask.  The IPI mechanism is used to raise networking receive
      softirqs between CPUs.  This effectively emulates in software what a multi-queue
      NIC can provide, but is generic requiring no device support.
      
      Many devices now provide a hash over the 4-tuple on a per packet basis
      (e.g. the Toeplitz hash).  This patch allow drivers to set the HW reported hash
      in an skb field, and that value in turn is used to index into the RPS maps.
      Using the HW generated hash can avoid cache misses on the packet when
      steering it to a remote CPU.
      
      The CPU mask is set on a per device and per queue basis in the sysfs variable
      /sys/class/net/<device>/queues/rx-<n>/rps_cpus.  This is a set of canonical
      bit maps for receive queues in the device (numbered by <n>).  If a device
      does not support multi-queue, a single variable is used for the device (rx-0).
      
      Generally, we have found this technique increases pps capabilities of a single
      queue device with good CPU utilization.  Optimal settings for the CPU mask
      seem to depend on architectures and cache hierarcy.  Below are some results
      running 500 instances of netperf TCP_RR test with 1 byte req. and resp.
      Results show cumulative transaction rate and system CPU utilization.
      
      e1000e on 8 core Intel
         Without RPS: 108K tps at 33% CPU
         With RPS:    311K tps at 64% CPU
      
      forcedeth on 16 core AMD
         Without RPS: 156K tps at 15% CPU
         With RPS:    404K tps at 49% CPU
      
      bnx2x on 16 core AMD
         Without RPS  567K tps at 61% CPU (4 HW RX queues)
         Without RPS  738K tps at 96% CPU (8 HW RX queues)
         With RPS:    854K tps at 76% CPU (4 HW RX queues)
      
      Caveats:
      - The benefits of this patch are dependent on architecture and cache hierarchy.
      Tuning the masks to get best performance is probably necessary.
      - This patch adds overhead in the path for processing a single packet.  In
      a lightly loaded server this overhead may eliminate the advantages of
      increased parallelism, and possibly cause some relative performance degradation.
      We have found that masks that are cache aware (share same caches with
      the interrupting CPU) mitigate much of this.
      - The RPS masks can be changed dynamically, however whenever the mask is changed
      this introduces the possibility of generating out of order packets.  It's
      probably best not change the masks too frequently.
      Signed-off-by: NTom Herbert <therbert@google.com>
      
       include/linux/netdevice.h |   32 ++++-
       include/linux/skbuff.h    |    3 +
       net/core/dev.c            |  335 +++++++++++++++++++++++++++++++++++++--------
       net/core/net-sysfs.c      |  225 ++++++++++++++++++++++++++++++-
       net/core/skbuff.c         |    2 +
       5 files changed, 538 insertions(+), 59 deletions(-)
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a9627f2
  28. 16 12月, 2009 1 次提交
  29. 21 11月, 2009 1 次提交
  30. 17 11月, 2009 1 次提交
  31. 12 11月, 2009 1 次提交
    • A
      skbuff: Do not allow skb recycling with disabled IRQs · e84af6dd
      Anton Vorontsov 提交于
      NAPI drivers try to recycle SKBs in their polling routine, but we
      generally don't know the context in which the polling will be called,
      and the skb recycling itself may require IRQs to be enabled.
      
      This patch adds irqs_disabled() test to the skb_recycle_check()
      routine, so that we'll not let the drivers hit the skb recycling
      path with IRQs disabled.
      
      As a side effect, this patch actually disables skb recycling for some
      [broken] drivers. E.g. gianfar driver grabs an irqsave spinlock during
      TX ring processing, and then tries to recycle an skb, and that caused
      the following badness:
      
      nf_conntrack version 0.5.0 (1008 buckets, 4032 max)
      ------------[ cut here ]------------
      Badness at kernel/softirq.c:143
      NIP: c003e3c4 LR: c423a528 CTR: c003e344
      ...
      NIP [c003e3c4] local_bh_enable+0x80/0xc4
      LR [c423a528] destroy_conntrack+0xd4/0x13c [nf_conntrack]
      Call Trace:
      [c15d1b60] [c003e32c] local_bh_disable+0x1c/0x34 (unreliable)
      [c15d1b70] [c423a528] destroy_conntrack+0xd4/0x13c [nf_conntrack]
      [c15d1b80] [c02c6370] nf_conntrack_destroy+0x3c/0x70
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e84af6dd
  32. 25 7月, 2009 1 次提交
  33. 18 6月, 2009 2 次提交