1. 23 7月, 2010 1 次提交
    • A
      net: Fix corruption of skb csum field in pskb_expand_head() of net/core/skbuff.c · 00c5a983
      Andrea Shepard 提交于
      Make pskb_expand_head() check ip_summed to make sure csum_start is really
      csum_start and not csum before adjusting it.
      
      This fixes a bug I encountered using a Sun Quad-Fast Ethernet card and VLANs.
      On my configuration, the sunhme driver produces skbs with differing amounts
      of headroom on receive depending on the packet size.  See line 2030 of
      drivers/net/sunhme.c; packets smaller than RX_COPY_THRESHOLD have 52 bytes
      of headroom but packets larger than that cutoff have only 20 bytes.
      
      When these packets reach the VLAN driver, vlan_check_reorder_header()
      calls skb_cow(), which, if the packet has less than NET_SKB_PAD (== 32) bytes
      of headroom, uses pskb_expand_head() to make more.
      
      Then, pskb_expand_head() needs to adjust a lot of offsets into the skb,
      including csum_start.  Since csum_start is a union with csum, if the packet
      has a valid csum value this will corrupt it, which was the effect I observed.
      The sunhme hardware computes receive checksums, so the skbs would be created
      by the driver with ip_summed == CHECKSUM_COMPLETE and a valid csum field, and
      then pskb_expand_head() would corrupt the csum field, leading to an "hw csum
      error" message later on, for example in icmp_rcv() for pings larger than the
      sunhme RX_COPY_THRESHOLD.
      
      On the basis of the comment at the beginning of include/linux/skbuff.h,
      I believe that the csum_start skb field is only meaningful if ip_csummed is
      CSUM_PARTIAL, so this patch makes pskb_expand_head() adjust it only in that
      case to avoid corrupting a valid csum value.
      
      Please see my more in-depth disucssion of tracking down this bug for
      more details if you like:
      
      http://puellavulnerata.livejournal.com/112186.html
      http://puellavulnerata.livejournal.com/112567.html
      http://puellavulnerata.livejournal.com/112891.html
      http://puellavulnerata.livejournal.com/113096.html
      http://puellavulnerata.livejournal.com/113591.html
      
      I am not subscribed to this list, so please CC me on replies.
      Signed-off-by: NAndrea Shepard <andrea@persephoneslair.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      00c5a983
  2. 14 6月, 2010 2 次提交
  3. 01 6月, 2010 1 次提交
  4. 29 5月, 2010 2 次提交
  5. 22 5月, 2010 1 次提交
  6. 21 5月, 2010 1 次提交
  7. 18 5月, 2010 1 次提交
    • E
      net: add a noref bit on skb dst · 7fee226a
      Eric Dumazet 提交于
      Use low order bit of skb->_skb_dst to tell dst is not refcounted.
      
      Change _skb_dst to _skb_refdst to make sure all uses are catched.
      
      skb_dst() returns the dst, regardless of noref bit set or not, but
      with a lockdep check to make sure a noref dst is not given if current
      user is not rcu protected.
      
      New skb_dst_set_noref() helper to set an notrefcounted dst on a skb.
      (with lockdep check)
      
      skb_dst_drop() drops a reference only if skb dst was refcounted.
      
      skb_dst_force() helper is used to force a refcount on dst, when skb
      is queued and not anymore RCU protected.
      
      Use skb_dst_force() in __sk_add_backlog(), __dev_xmit_skb() if
      !IFF_XMIT_DST_RELEASE or skb enqueued on qdisc queue, in
      sock_queue_rcv_skb(), in __nf_queue().
      
      Use skb_dst_force() in dev_requeue_skb().
      
      Note: dst_use_noref() still dirties dst, we might transform it
      later to do one dirtying per jiffies.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7fee226a
  8. 05 5月, 2010 1 次提交
    • E
      net: __alloc_skb() speedup · ec7d2f2c
      Eric Dumazet 提交于
      With following patch I can reach maximum rate of my pktgen+udpsink
      simulator :
      - 'old' machine : dual quad core E5450  @3.00GHz
      - 64 UDP rx flows (only differ by destination port)
      - RPS enabled, NIC interrupts serviced on cpu0
      - rps dispatched on 7 other cores. (~130.000 IPI per second)
      - SLAB allocator (faster than SLUB in this workload)
      - tg3 NIC
      - 1.080.000 pps without a single drop at NIC level.
      
      Idea is to add two prefetchw() calls in __alloc_skb(), one to prefetch
      first sk_buff cache line, the second to prefetch the shinfo part.
      
      Also using one memset() to initialize all skb_shared_info fields instead
      of one by one to reduce number of instructions, using long word moves.
      
      All skb_shared_info fields before 'dataref' are cleared in 
      __alloc_skb().
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ec7d2f2c
  9. 02 5月, 2010 1 次提交
  10. 21 4月, 2010 1 次提交
  11. 17 3月, 2010 1 次提交
    • T
      rps: Receive Packet Steering · 0a9627f2
      Tom Herbert 提交于
      This patch implements software receive side packet steering (RPS).  RPS
      distributes the load of received packet processing across multiple CPUs.
      
      Problem statement: Protocol processing done in the NAPI context for received
      packets is serialized per device queue and becomes a bottleneck under high
      packet load.  This substantially limits pps that can be achieved on a single
      queue NIC and provides no scaling with multiple cores.
      
      This solution queues packets early on in the receive path on the backlog queues
      of other CPUs.   This allows protocol processing (e.g. IP and TCP) to be
      performed on packets in parallel.   For each device (or each receive queue in
      a multi-queue device) a mask of CPUs is set to indicate the CPUs that can
      process packets. A CPU is selected on a per packet basis by hashing contents
      of the packet header (e.g. the TCP or UDP 4-tuple) and using the result to index
      into the CPU mask.  The IPI mechanism is used to raise networking receive
      softirqs between CPUs.  This effectively emulates in software what a multi-queue
      NIC can provide, but is generic requiring no device support.
      
      Many devices now provide a hash over the 4-tuple on a per packet basis
      (e.g. the Toeplitz hash).  This patch allow drivers to set the HW reported hash
      in an skb field, and that value in turn is used to index into the RPS maps.
      Using the HW generated hash can avoid cache misses on the packet when
      steering it to a remote CPU.
      
      The CPU mask is set on a per device and per queue basis in the sysfs variable
      /sys/class/net/<device>/queues/rx-<n>/rps_cpus.  This is a set of canonical
      bit maps for receive queues in the device (numbered by <n>).  If a device
      does not support multi-queue, a single variable is used for the device (rx-0).
      
      Generally, we have found this technique increases pps capabilities of a single
      queue device with good CPU utilization.  Optimal settings for the CPU mask
      seem to depend on architectures and cache hierarcy.  Below are some results
      running 500 instances of netperf TCP_RR test with 1 byte req. and resp.
      Results show cumulative transaction rate and system CPU utilization.
      
      e1000e on 8 core Intel
         Without RPS: 108K tps at 33% CPU
         With RPS:    311K tps at 64% CPU
      
      forcedeth on 16 core AMD
         Without RPS: 156K tps at 15% CPU
         With RPS:    404K tps at 49% CPU
      
      bnx2x on 16 core AMD
         Without RPS  567K tps at 61% CPU (4 HW RX queues)
         Without RPS  738K tps at 96% CPU (8 HW RX queues)
         With RPS:    854K tps at 76% CPU (4 HW RX queues)
      
      Caveats:
      - The benefits of this patch are dependent on architecture and cache hierarchy.
      Tuning the masks to get best performance is probably necessary.
      - This patch adds overhead in the path for processing a single packet.  In
      a lightly loaded server this overhead may eliminate the advantages of
      increased parallelism, and possibly cause some relative performance degradation.
      We have found that masks that are cache aware (share same caches with
      the interrupting CPU) mitigate much of this.
      - The RPS masks can be changed dynamically, however whenever the mask is changed
      this introduces the possibility of generating out of order packets.  It's
      probably best not change the masks too frequently.
      Signed-off-by: NTom Herbert <therbert@google.com>
      
       include/linux/netdevice.h |   32 ++++-
       include/linux/skbuff.h    |    3 +
       net/core/dev.c            |  335 +++++++++++++++++++++++++++++++++++++--------
       net/core/net-sysfs.c      |  225 ++++++++++++++++++++++++++++++-
       net/core/skbuff.c         |    2 +
       5 files changed, 538 insertions(+), 59 deletions(-)
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a9627f2
  12. 16 12月, 2009 1 次提交
  13. 21 11月, 2009 1 次提交
  14. 17 11月, 2009 1 次提交
  15. 12 11月, 2009 1 次提交
    • A
      skbuff: Do not allow skb recycling with disabled IRQs · e84af6dd
      Anton Vorontsov 提交于
      NAPI drivers try to recycle SKBs in their polling routine, but we
      generally don't know the context in which the polling will be called,
      and the skb recycling itself may require IRQs to be enabled.
      
      This patch adds irqs_disabled() test to the skb_recycle_check()
      routine, so that we'll not let the drivers hit the skb recycling
      path with IRQs disabled.
      
      As a side effect, this patch actually disables skb recycling for some
      [broken] drivers. E.g. gianfar driver grabs an irqsave spinlock during
      TX ring processing, and then tries to recycle an skb, and that caused
      the following badness:
      
      nf_conntrack version 0.5.0 (1008 buckets, 4032 max)
      ------------[ cut here ]------------
      Badness at kernel/softirq.c:143
      NIP: c003e3c4 LR: c423a528 CTR: c003e344
      ...
      NIP [c003e3c4] local_bh_enable+0x80/0xc4
      LR [c423a528] destroy_conntrack+0xd4/0x13c [nf_conntrack]
      Call Trace:
      [c15d1b60] [c003e32c] local_bh_disable+0x1c/0x34 (unreliable)
      [c15d1b70] [c423a528] destroy_conntrack+0xd4/0x13c [nf_conntrack]
      [c15d1b80] [c02c6370] nf_conntrack_destroy+0x3c/0x70
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e84af6dd
  16. 25 7月, 2009 1 次提交
  17. 18 6月, 2009 2 次提交
  18. 15 6月, 2009 1 次提交
  19. 11 6月, 2009 1 次提交
    • J
      mac80211: do not pass PS frames out of mac80211 again · 8f77f384
      Johannes Berg 提交于
      In order to handle powersave frames properly we had needed
      to pass these out to the device queues again, and introduce
      the skb->requeue bit. This, however, also has unnecessary
      overhead by needing to 'clean up' already tried frames, and
      this clean-up code is also buggy when software encryption
      is used.
      
      Instead of sending the frames via the master netdev queue
      again, simply put them into the pending queue. This also
      fixes a problem where frames for that particular station
      could be reordered when some were still on the software
      queues and older ones are re-injected into the software
      queue after them.
      Signed-off-by: NJohannes Berg <johannes@sipsolutions.net>
      Signed-off-by: NJohn W. Linville <linville@tuxdriver.com>
      8f77f384
  20. 09 6月, 2009 1 次提交
  21. 08 6月, 2009 1 次提交
    • H
      net: Ensure partial checksum offset is inside the skb head · 5ff8dda3
      Herbert Xu 提交于
      On Thu, Jun 04, 2009 at 09:06:00PM +1000, Herbert Xu wrote:
      >
      > tun: Optimise handling of bogus gso->hdr_len
      >
      > As all current versions of virtio_net generate a value for the
      > header length that's too small, we should optimise this so that
      > we don't copy it twice.  This can be done by ensuring that it is
      > at least as large as the place where we'll write the checksum.
      >
      > Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
      
      With this applied we can strengthen the partial checksum check:
      
      In skb_partial_csum_set we check to see if the checksum offset
      is within the packet.  However, we really should check that it
      is within the skb head as that's the only bit we can modify
      without copying.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Acked-by: NRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5ff8dda3
  22. 03 6月, 2009 1 次提交
  23. 27 5月, 2009 4 次提交
  24. 25 5月, 2009 2 次提交
  25. 19 5月, 2009 1 次提交
  26. 07 5月, 2009 1 次提交
  27. 30 4月, 2009 1 次提交
  28. 15 4月, 2009 1 次提交
    • S
      tracing/events: move trace point headers into include/trace/events · ad8d75ff
      Steven Rostedt 提交于
      Impact: clean up
      
      Create a sub directory in include/trace called events to keep the
      trace point headers in their own separate directory. Only headers that
      declare trace points should be defined in this directory.
      
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Cc: Zhao Lei <zhaolei@cn.fujitsu.com>
      Cc: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      ad8d75ff
  29. 29 3月, 2009 1 次提交
  30. 14 3月, 2009 1 次提交
  31. 27 2月, 2009 1 次提交
  32. 18 2月, 2009 1 次提交
    • D
      net: Kill skb_truesize_check(), it only catches false-positives. · 92a0acce
      David S. Miller 提交于
      A long time ago we had bugs, primarily in TCP, where we would modify
      skb->truesize (for TSO queue collapsing) in ways which would corrupt
      the socket memory accounting.
      
      skb_truesize_check() was added in order to try and catch this error
      more systematically.
      
      However this debugging check has morphed into a Frankenstein of sorts
      and these days it does nothing other than catch false-positives.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      92a0acce
  33. 16 2月, 2009 1 次提交
    • P
      net: infrastructure for hardware time stamping · ac45f602
      Patrick Ohly 提交于
      The additional per-packet information (16 bytes for time stamps, 1
      byte for flags) is stored for all packets in the skb_shared_info
      struct. This implementation detail is hidden from users of that
      information via skb_* accessor functions. A separate struct resp.
      union is used for the additional information so that it can be
      stored/copied easily outside of skb_shared_info.
      
      Compared to previous implementations (reusing the tstamp field
      depending on the context, optional additional structures) this
      is the simplest solution. It does not extend sk_buff itself.
      
      TX time stamping is implemented in software if the device driver
      doesn't support hardware time stamping.
      
      The new semantic for hardware/software time stamping around
      ndo_start_xmit() is based on two assumptions about existing
      network device drivers which don't support hardware time
      stamping and know nothing about it:
       - they leave the new skb_shared_tx unmodified
       - the keep the connection to the originating socket in skb->sk
         alive, i.e., don't call skb_orphan()
      
      Given that skb_shared_tx is new, the first assumption is safe.
      The second is only true for some drivers. As a result, software
      TX time stamping currently works with the bnx2 driver, but not
      with the unmodified igb driver (the two drivers this patch series
      was tested with).
      Signed-off-by: NPatrick Ohly <patrick.ohly@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ac45f602