1. 04 11月, 2011 1 次提交
    • T
      net: Add back alignment for size for __alloc_skb · bc417e30
      Tony Lindgren 提交于
      Commit 87fb4b7b (net: more
      accurate skb truesize) changed the alignment of size. This
      can cause problems at least on some machines with NFS root:
      
      Unhandled fault: alignment exception (0x801) at 0xc183a43a
      Internal error: : 801 [#1] PREEMPT
      Modules linked in:
      CPU: 0    Not tainted  (3.1.0-08784-g5eeee4a #733)
      pc : [<c02fbba0>]    lr : [<c02fbb9c>]    psr: 60000013
      sp : c180fef8  ip : 00000000  fp : c181f580
      r10: 00000000  r9 : c044b28c  r8 : 00000001
      r7 : c183a3a0  r6 : c1835be0  r5 : c183a412  r4 : 000001f2
      r3 : 00000000  r2 : 00000000  r1 : ffffffe6  r0 : c183a43a
      Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment kernel
      Control: 0005317f  Table: 10004000  DAC: 00000017
      Process swapper (pid: 1, stack limit = 0xc180e270)
      Stack: (0xc180fef8 to 0xc1810000)
      fee0:                                                       00000024 00000000
      ff00: 00000000 c183b9c0 c183b8e0 c044b28c c0507ccc c019dfc4 c180ff2c c0503cf8
      ff20: c180ff4c c180ff4c 00000000 c1835420 c182c740 c18349c0 c05233c0 00000000
      ff40: 00000000 c00e6bb8 c180e000 00000000 c04dd82c c0507e7c c050cc18 c183b9c0
      ff60: c05233c0 00000000 00000000 c01f34f4 c0430d70 c019d364 c04dd898 c04dd898
      ff80: c04dd82c c0507e7c c180e000 00000000 c04c584c c01f4918 c04dd898 c04dd82c
      ffa0: c04ddd28 c180e000 00000000 c0008758 c181fa60 3231d82c 00000037 00000000
      ffc0: 00000000 c04dd898 c04dd82c c04ddd28 00000013 00000000 00000000 00000000
      ffe0: 00000000 c04b2224 00000000 c04b21a0 c001056c c001056c 00000000 00000000
      Function entered at [<c02fbba0>] from [<c019dfc4>]
      Function entered at [<c019dfc4>] from [<c01f34f4>]
      Function entered at [<c01f34f4>] from [<c01f4918>]
      Function entered at [<c01f4918>] from [<c0008758>]
      Function entered at [<c0008758>] from [<c04b2224>]
      Function entered at [<c04b2224>] from [<c001056c>]
      Code: e1a00005 e3a01028 ebfa7cb0 e35a0000 (e5858028)
      
      Here PC is at __alloc_skb and &shinfo->dataref is unaligned because
      skb->end can be unaligned without this patch.
      
      As explained by Eric Dumazet <eric.dumazet@gmail.com>, this happens
      only with SLOB, and not with SLAB or SLUB:
      
      * Eric Dumazet <eric.dumazet@gmail.com> [111102 15:56]:
      >
      > Your patch is absolutely needed, I completely forgot about SLOB :(
      >
      > since, kmalloc(386) on SLOB gives exactly ksize=386 bytes, not nearest
      > power of two.
      >
      > [   60.305763] malloc(size=385)->ffff880112c11e38 ksize=386 -> nsize=2
      > [   60.305921] malloc(size=385)->ffff88007c92ce28 ksize=386 -> nsize=2
      > [   60.306898] malloc(size=656)->ffff88007c44ad28 ksize=656 -> nsize=272
      > [   60.325385] malloc(size=656)->ffff88007c575868 ksize=656 -> nsize=272
      > [   60.325531] malloc(size=656)->ffff88011c777230 ksize=656 -> nsize=272
      > [   60.325701] malloc(size=656)->ffff880114011008 ksize=656 -> nsize=272
      > [   60.346716] malloc(size=385)->ffff880114142008 ksize=386 -> nsize=2
      > [   60.346900] malloc(size=385)->ffff88011c777690 ksize=386 -> nsize=2
      Signed-off-by: NTony Lindgren <tony@atomide.com>
      Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bc417e30
  2. 21 10月, 2011 1 次提交
  3. 20 10月, 2011 1 次提交
  4. 19 10月, 2011 1 次提交
  5. 14 10月, 2011 1 次提交
    • E
      net: more accurate skb truesize · 87fb4b7b
      Eric Dumazet 提交于
      skb truesize currently accounts for sk_buff struct and part of skb head.
      kmalloc() roundings are also ignored.
      
      Considering that skb_shared_info is larger than sk_buff, its time to
      take it into account for better memory accounting.
      
      This patch introduces SKB_TRUESIZE(X) macro to centralize various
      assumptions into a single place.
      
      At skb alloc phase, we put skb_shared_info struct at the exact end of
      skb head, to allow a better use of memory (lowering number of
      reallocations), since kmalloc() gives us power-of-two memory blocks.
      
      Unless SLUB/SLUB debug is active, both skb->head and skb_shared_info are
      aligned to cache lines, as before.
      
      Note: This patch might trigger performance regressions because of
      misconfigured protocol stacks, hitting per socket or global memory
      limits that were previously not reached. But its a necessary step for a
      more accurate memory accounting.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      CC: Andi Kleen <ak@linux.intel.com>
      CC: Ben Hutchings <bhutchings@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      87fb4b7b
  6. 16 9月, 2011 1 次提交
    • M
      net: copy userspace buffers on device forwarding · 48c83012
      Michael S. Tsirkin 提交于
      dev_forward_skb loops an skb back into host networking
      stack which might hang on the memory indefinitely.
      In particular, this can happen in macvtap in bridged mode.
      Copy the userspace fragments to avoid blocking the
      sender in that case.
      
      As this patch makes skb_copy_ubufs extern now,
      I also added some documentation and made it clear
      the SKBTX_DEV_ZEROCOPY flag automatically instead
      of doing it in all callers. This can be made into a separate
      patch if people feel it's worth it.
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      48c83012
  7. 25 8月, 2011 1 次提交
  8. 21 8月, 2011 1 次提交
  9. 18 8月, 2011 1 次提交
  10. 02 8月, 2011 1 次提交
  11. 22 7月, 2011 1 次提交
  12. 09 7月, 2011 1 次提交
  13. 07 7月, 2011 1 次提交
    • S
      skbuff: skb supports zero-copy buffers · a6686f2f
      Shirley Ma 提交于
      This patch adds userspace buffers support in skb shared info. A new
      struct skb_ubuf_info is needed to maintain the userspace buffers
      argument and index, a callback is used to notify userspace to release
      the buffers once lower device has done DMA (Last reference to that skb
      has gone).
      
      If there is any userspace apps to reference these userspace buffers,
      then these userspaces buffers will be copied into kernel. This way we
      can prevent userspace apps from holding these userspace buffers too long.
      
      Use destructor_arg to point to the userspace buffer info; a new tx flags
      SKBTX_DEV_ZEROCOPY is added for zero-copy buffer check.
      Signed-off-by: NShirley Ma <xma@...ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a6686f2f
  14. 21 5月, 2011 1 次提交
    • L
      sanitize <linux/prefetch.h> usage · 268bb0ce
      Linus Torvalds 提交于
      Commit e66eed65 ("list: remove prefetching from regular list
      iterators") removed the include of prefetch.h from list.h, which
      uncovered several cases that had apparently relied on that rather
      obscure header file dependency.
      
      So this fixes things up a bit, using
      
         grep -L linux/prefetch.h $(git grep -l '[^a-z_]prefetchw*(' -- '*.[ch]')
         grep -L 'prefetchw*(' $(git grep -l 'linux/prefetch.h' -- '*.[ch]')
      
      to guide us in finding files that either need <linux/prefetch.h>
      inclusion, or have it despite not needing it.
      
      There are more of them around (mostly network drivers), but this gets
      many core ones.
      Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      268bb0ce
  15. 18 5月, 2011 1 次提交
    • E
      net: add skb_dst_force() in sock_queue_err_skb() · abb57ea4
      Eric Dumazet 提交于
      Commit 7fee226a (add a noref bit on skb dst) forgot to use
      skb_dst_force() on packets queued in sk_error_queue
      
      This triggers following warning, for applications using IP_CMSG_PKTINFO
      receiving one error status
      
      
      ------------[ cut here ]------------
      WARNING: at include/linux/skbuff.h:457 ip_cmsg_recv_pktinfo+0xa6/0xb0()
      Hardware name: 2669UYD
      Modules linked in: isofs vboxnetadp vboxnetflt nfsd ebtable_nat ebtables
      lib80211_crypt_ccmp uinput xcbc hdaps tp_smapi thinkpad_ec radeonfb fb_ddc
      radeon ttm drm_kms_helper drm ipw2200 intel_agp intel_gtt libipw i2c_algo_bit
      i2c_i801 agpgart rng_core cfbfillrect cfbcopyarea cfbimgblt video raid10 raid1
      raid0 linear md_mod vboxdrv
      Pid: 4697, comm: miredo Not tainted 2.6.39-rc6-00569-g5895198c-dirty #22
      Call Trace:
       [<c17746b6>] ? printk+0x1d/0x1f
       [<c1058302>] warn_slowpath_common+0x72/0xa0
       [<c15bbca6>] ? ip_cmsg_recv_pktinfo+0xa6/0xb0
       [<c15bbca6>] ? ip_cmsg_recv_pktinfo+0xa6/0xb0
       [<c1058350>] warn_slowpath_null+0x20/0x30
       [<c15bbca6>] ip_cmsg_recv_pktinfo+0xa6/0xb0
       [<c15bbdd7>] ip_cmsg_recv+0x127/0x260
       [<c154f82d>] ? skb_dequeue+0x4d/0x70
       [<c1555523>] ? skb_copy_datagram_iovec+0x53/0x300
       [<c178e834>] ? sub_preempt_count+0x24/0x50
       [<c15bdd2d>] ip_recv_error+0x23d/0x270
       [<c15de554>] udp_recvmsg+0x264/0x2b0
       [<c15ea659>] inet_recvmsg+0xd9/0x130
       [<c1547752>] sock_recvmsg+0xf2/0x120
       [<c11179cb>] ? might_fault+0x4b/0xa0
       [<c15546bc>] ? verify_iovec+0x4c/0xc0
       [<c1547660>] ? sock_recvmsg_nosec+0x100/0x100
       [<c1548294>] __sys_recvmsg+0x114/0x1e0
       [<c1093895>] ? __lock_acquire+0x365/0x780
       [<c1148b66>] ? fget_light+0xa6/0x3e0
       [<c1148b7f>] ? fget_light+0xbf/0x3e0
       [<c1148aee>] ? fget_light+0x2e/0x3e0
       [<c1549f29>] sys_recvmsg+0x39/0x60
      
      Close bug https://bugzilla.kernel.org/show_bug.cgi?id=34622Reported-by: NWitold Baryluk <baryluk@smp.if.uj.edu.pl>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      CC: Stephen Hemminger <shemminger@vyatta.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      abb57ea4
  16. 31 3月, 2011 1 次提交
  17. 17 3月, 2011 1 次提交
  18. 02 3月, 2011 1 次提交
  19. 28 1月, 2011 1 次提交
  20. 25 1月, 2011 2 次提交
    • M
      net: change netdev->features to u32 · 04ed3e74
      Michał Mirosław 提交于
      Quoting Ben Hutchings: we presumably won't be defining features that
      can only be enabled on 64-bit architectures.
      
      Occurences found by `grep -r` on net/, drivers/net, include/
      
      [ Move features and vlan_features next to each other in
        struct netdev, as per Eric Dumazet's suggestion -DaveM ]
      Signed-off-by: NMichał Mirosław <mirq-linux@rere.qmqm.pl>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      04ed3e74
    • M
      GRO: fix merging a paged skb after non-paged skbs · d1dc7abf
      Michal Schmidt 提交于
      Suppose that several linear skbs of the same flow were received by GRO. They
      were thus merged into one skb with a frag_list. Then a new skb of the same flow
      arrives, but it is a paged skb with data starting in its frags[].
      
      Before adding the skb to the frag_list skb_gro_receive() will of course adjust
      the skb to throw away the headers. It correctly modifies the page_offset and
      size of the frag, but it leaves incorrect information in the skb:
       ->data_len is not decreased at all.
       ->len is decreased only by headlen, as if no change were done to the frag.
      Later in a receiving process this causes skb_copy_datagram_iovec() to return
      -EFAULT and this is seen in userspace as the result of the recv() syscall.
      
      In practice the bug can be reproduced with the sfc driver. By default the
      driver uses an adaptive scheme when it switches between using
      napi_gro_receive() (with skbs) and napi_gro_frags() (with pages). The bug is
      reproduced when under rx load with enough successful GRO merging the driver
      decides to switch from the former to the latter.
      
      Manual control is also possible, so reproducing this is easy with netcat:
       - on machine1 (with sfc): nc -l 12345 > /dev/null
       - on machine2: nc machine1 12345 < /dev/zero
       - on machine1:
         echo 1 > /sys/module/sfc/parameters/rx_alloc_method  # use skbs
         echo 2 > /sys/module/sfc/parameters/rx_alloc_method  # use pages
       - See that nc has quit suddenly.
      
      [v2: Modified by Eric Dumazet to avoid advancing skb->data past the end
           and to use a temporary variable.]
      Signed-off-by: NMichal Schmidt <mschmidt@redhat.com>
      Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d1dc7abf
  21. 13 1月, 2011 1 次提交
  22. 17 12月, 2010 1 次提交
  23. 16 12月, 2010 1 次提交
  24. 04 12月, 2010 1 次提交
  25. 17 10月, 2010 1 次提交
    • E
      net: allocate skbs on local node · 564824b0
      Eric Dumazet 提交于
      commit b30973f8 (node-aware skb allocation) spread a wrong habit of
      allocating net drivers skbs on a given memory node : The one closest to
      the NIC hardware. This is wrong because as soon as we try to scale
      network stack, we need to use many cpus to handle traffic and hit
      slub/slab management on cross-node allocations/frees when these cpus
      have to alloc/free skbs bound to a central node.
      
      skb allocated in RX path are ephemeral, they have a very short
      lifetime : Extra cost to maintain NUMA affinity is too expensive. What
      appeared as a nice idea four years ago is in fact a bad one.
      
      In 2010, NIC hardwares are multiqueue, or we use RPS to spread the load,
      and two 10Gb NIC might deliver more than 28 million packets per second,
      needing all the available cpus.
      
      Cost of cross-node handling in network and vm stacks outperforms the
      small benefit hardware had when doing its DMA transfert in its 'local'
      memory node at RX time. Even trying to differentiate the two allocations
      done for one skb (the sk_buff on local node, the data part on NIC
      hardware node) is not enough to bring good performance.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Acked-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      564824b0
  26. 09 9月, 2010 1 次提交
  27. 07 9月, 2010 2 次提交
    • K
      skb: Add tracepoints to freeing skb · 07dc22e7
      Koki Sanagi 提交于
      This patch adds tracepoint to consume_skb and add trace_kfree_skb
      before __kfree_skb in skb_free_datagram_locked and net_tx_action.
      Combinating with tracepoint on dev_hard_start_xmit, we can check
      how long it takes to free transmitted packets. And using it, we can
      calculate how many packets driver had at that time. It is useful when
      a drop of transmitted packet is a problem.
      
                  sshd-6828  [000] 112689.258154: consume_skb: skbaddr=f2d99bb8
      Signed-off-by: NKoki Sanagi <sanagi.koki@jp.fujitsu.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Kaneshige Kenji <kaneshige.kenji@jp.fujitsu.com>
      Cc: Izumo Taku <izumi.taku@jp.fujitsu.com>
      Cc: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Scott Mcmillan <scott.a.mcmillan@intel.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      LKML-Reference: <4C724364.50903@jp.fujitsu.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      07dc22e7
    • E
      net: pskb_expand_head() optimization · 1fd63041
      Eric Dumazet 提交于
      pskb_expand_head() blindly takes references on fragments before calling
      skb_release_data(), potentially releasing these references.
      
      We can add a fast path, avoiding these atomic operations, if we own the
      last reference on skb->head.
      
      Based on a previous patch from David
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1fd63041
  28. 04 9月, 2010 1 次提交
  29. 02 9月, 2010 2 次提交
    • E
      gro: fix different skb headrooms · 3d3be433
      Eric Dumazet 提交于
      Packets entering GRO might have different headrooms, even for a given
      flow (because of implementation details in drivers, like copybreak).
      We cant force drivers to deliver packets with a fixed headroom.
      
      1) fix skb_segment()
      
      skb_segment() makes the false assumption headrooms of fragments are same
      than the head. When CHECKSUM_PARTIAL is used, this can give csum_start
      errors, and crash later in skb_copy_and_csum_dev()
      
      2) allocate a minimal skb for head of frag_list
      
      skb_gro_receive() uses netdev_alloc_skb(headroom + skb_gro_offset(p)) to
      allocate a fresh skb. This adds NET_SKB_PAD to a padding already
      provided by netdevice, depending on various things, like copybreak.
      
      Use alloc_skb() to allocate an exact padding, to reduce cache line
      needs:
      NET_SKB_PAD + NET_IP_ALIGN
      
      bugzilla : https://bugzilla.kernel.org/show_bug.cgi?id=16626
      
      Many thanks to Plamen Petrov, testing many debugging patches !
      With help of Jarek Poplawski.
      Reported-by: NPlamen Petrov <pvp-lsts@fs.uni-ruse.bg>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      CC: Jarek Poplawski <jarkao2@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3d3be433
    • E
      net: skbuff.c cleanup · 6602cebb
      Eric Dumazet 提交于
      (skb->data - skb->head) can be changed by skb_headroom(skb)
      
      Remove some uses of NET_SKBUFF_DATA_USES_OFFSET, using
      (skb_end_pointer(skb) - skb->head) or
      (skb_tail_pointer(skb) - skb->head) : compiler does the right thing,
      and this is more readable for us ;)
      
      (struct skb_shared_info *) casts in pskb_expand_head() to help memcpy()
      to use aligned moves.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6602cebb
  30. 23 8月, 2010 1 次提交
  31. 19 8月, 2010 1 次提交
  32. 25 7月, 2010 1 次提交
  33. 23 7月, 2010 2 次提交
  34. 13 7月, 2010 1 次提交
  35. 14 6月, 2010 2 次提交