1. 16 7月, 2012 1 次提交
  2. 14 6月, 2012 1 次提交
    • E
      splice: fix racy pipe->buffers uses · 047fe360
      Eric Dumazet 提交于
      Dave Jones reported a kernel BUG at mm/slub.c:3474! triggered
      by splice_shrink_spd() called from vmsplice_to_pipe()
      
      commit 35f3d14d (pipe: add support for shrinking and growing pipes)
      added capability to adjust pipe->buffers.
      
      Problem is some paths don't hold pipe mutex and assume pipe->buffers
      doesn't change for their duration.
      
      Fix this by adding nr_pages_max field in struct splice_pipe_desc, and
      use it in place of pipe->buffers where appropriate.
      
      splice_shrink_spd() loses its struct pipe_inode_info argument.
      Reported-by: NDave Jones <davej@redhat.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Tom Herbert <therbert@google.com>
      Cc: stable <stable@vger.kernel.org> # 2.6.35
      Tested-by: NDave Jones <davej@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      047fe360
  3. 09 6月, 2012 1 次提交
  4. 20 5月, 2012 1 次提交
    • E
      net: introduce skb_try_coalesce() · bad43ca8
      Eric Dumazet 提交于
      Move tcp_try_coalesce() protocol independent part to
      skb_try_coalesce().
      
      skb_try_coalesce() can be used in IPv4 defrag and IPv6 reassembly,
      to build optimized skbs (less sk_buff, and possibly less 'headers')
      
      skb_try_coalesce() is zero copy, unless the copy can fit in destination
      header (its a rare case)
      
      kfree_skb_partial() is also moved to net/core/skbuff.c and exported,
      because IPv6 will need it in patch (ipv6: use skb coalescing in
      reassembly).
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Alexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bad43ca8
  5. 19 5月, 2012 1 次提交
    • E
      net: introduce netdev_alloc_frag() · 6f532612
      Eric Dumazet 提交于
      Fix two issues introduced in commit a1c7fff7
      ( net: netdev_alloc_skb() use build_skb() )
      
      - Must be IRQ safe (non NAPI drivers can use it)
      - Must not leak the frag if build_skb() fails to allocate sk_buff
      
      This patch introduces netdev_alloc_frag() for drivers willing to
      use build_skb() instead of __netdev_alloc_skb() variants.
      
      Factorize code so that :
      __dev_alloc_skb() is a wrapper around __netdev_alloc_skb(), and
      dev_alloc_skb() a wrapper around netdev_alloc_skb()
      
      Use __GFP_COLD flag.
      
      Almost all network drivers now benefit from skb->head_frag
      infrastructure.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6f532612
  6. 18 5月, 2012 1 次提交
  7. 17 5月, 2012 1 次提交
  8. 16 5月, 2012 1 次提交
  9. 07 5月, 2012 3 次提交
  10. 04 5月, 2012 2 次提交
  11. 03 5月, 2012 1 次提交
  12. 01 5月, 2012 3 次提交
    • E
      net: makes skb_splice_bits() aware of skb->head_frag · 1d0c0b32
      Eric Dumazet 提交于
      __skb_splice_bits() can check if skb to be spliced has its skb->head
      mapped to a page fragment, instead of a kmalloc() area.
      
      If so we can avoid a copy of the skb head and get a reference on
      underlying page.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Maciej Żenczykowski <maze@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Matt Carlson <mcarlson@broadcom.com>
      Cc: Michael Chan <mchan@broadcom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1d0c0b32
    • E
      net: make GRO aware of skb->head_frag · d7e8883c
      Eric Dumazet 提交于
      GRO can check if skb to be merged has its skb->head mapped to a page
      fragment, instead of a kmalloc() area.
      
      We 'upgrade' skb->head as a fragment in itself
      
      This avoids the frag_list fallback, and permits to build true GRO skb
      (one sk_buff and up to 16 fragments), using less memory.
      
      This reduces number of cache misses when user makes its copy, since a
      single sk_buff is fetched.
      
      This is a followup of patch "net: allow skb->head to be a page fragment"
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Maciej Żenczykowski <maze@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Matt Carlson <mcarlson@broadcom.com>
      Cc: Michael Chan <mchan@broadcom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d7e8883c
    • E
      net: allow skb->head to be a page fragment · d3836f21
      Eric Dumazet 提交于
      skb->head is currently allocated from kmalloc(). This is convenient but
      has the drawback the data cannot be converted to a page fragment if
      needed.
      
      We have three spots were it hurts :
      
      1) GRO aggregation
      
       When a linear skb must be appended to another skb, GRO uses the
      frag_list fallback, very inefficient since we keep all struct sk_buff
      around. So drivers enabling GRO but delivering linear skbs to network
      stack aren't enabling full GRO power.
      
      2) splice(socket -> pipe).
      
       We must copy the linear part to a page fragment.
       This kind of defeats splice() purpose (zero copy claim)
      
      3) TCP coalescing.
      
       Recently introduced, this permits to group several contiguous segments
      into a single skb. This shortens queue lengths and save kernel memory,
      and greatly reduce probabilities of TCP collapses. This coalescing
      doesnt work on linear skbs (or we would need to copy data, this would be
      too slow)
      
      Given all these issues, the following patch introduces the possibility
      of having skb->head be a fragment in itself. We use a new skb flag,
      skb->head_frag to carry this information.
      
      build_skb() is changed to accept a frag_size argument. Drivers willing
      to provide a page fragment instead of kmalloc() data will set a non zero
      value, set to the fragment size.
      
      Then, on situations we need to convert the skb head to a frag in itself,
      we can check if skb->head_frag is set and avoid the copies or various
      fallbacks we have.
      
      This means drivers currently using frags could be updated to avoid the
      current skb->head allocation and reduce their memory footprint (aka skb
      truesize). (thats 512 or 1024 bytes saved per skb). This also makes
      bpf/netfilter faster since the 'first frag' will be part of skb linear
      part, no need to copy data.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Maciej Żenczykowski <maze@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Matt Carlson <mcarlson@broadcom.com>
      Cc: Michael Chan <mchan@broadcom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d3836f21
  13. 24 4月, 2012 3 次提交
  14. 22 4月, 2012 1 次提交
  15. 20 4月, 2012 1 次提交
  16. 16 4月, 2012 1 次提交
  17. 11 4月, 2012 1 次提交
  18. 06 4月, 2012 1 次提交
  19. 05 4月, 2012 1 次提交
  20. 29 3月, 2012 1 次提交
  21. 26 3月, 2012 1 次提交
  22. 24 2月, 2012 1 次提交
  23. 14 2月, 2012 1 次提交
  24. 17 12月, 2011 1 次提交
  25. 05 12月, 2011 1 次提交
    • E
      tcp: take care of misalignments · 117632e6
      Eric Dumazet 提交于
      We discovered that TCP stack could retransmit misaligned skbs if a
      malicious peer acknowledged sub MSS frame. This currently can happen
      only if output interface is non SG enabled : If SG is enabled, tcp
      builds headless skbs (all payload is included in fragments), so the tcp
      trimming process only removes parts of skb fragments, header stay
      aligned.
      
      Some arches cant handle misalignments, so force a head reallocation and
      shrink headroom to MAX_TCP_HEADER.
      
      Dont care about misaligments on x86 and PPC (or other arches setting
      NET_IP_ALIGN to 0)
      
      This patch introduces __pskb_copy() which can specify the headroom of
      new head, and pskb_copy() becomes a wrapper on top of __pskb_copy()
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      117632e6
  26. 23 11月, 2011 1 次提交
  27. 17 11月, 2011 1 次提交
  28. 15 11月, 2011 1 次提交
    • E
      net: introduce build_skb() · b2b5ce9d
      Eric Dumazet 提交于
      One of the thing we discussed during netdev 2011 conference was the idea
      to change some network drivers to allocate/populate their skb at RX
      completion time, right before feeding the skb to network stack.
      
      In old days, we allocated skbs when populating the RX ring.
      
      This means bringing into cpu cache sk_buff and skb_shared_info cache
      lines (since we clear/initialize them), then 'queue' skb->data to NIC.
      
      By the time NIC fills a frame in skb->data buffer and host can process
      it, cpu probably threw away the cache lines from its caches, because lot
      of things happened between the allocation and final use.
      
      So the deal would be to allocate only the data buffer for the NIC to
      populate its RX ring buffer. And use build_skb() at RX completion to
      attach a data buffer (now filled with an ethernet frame) to a new skb,
      initialize the skb_shared_info portion, and give the hot skb to network
      stack.
      
      build_skb() is the function to allocate an skb, caller providing the
      data buffer that should be attached to it. Drivers are expected to call
      skb_reserve() right after build_skb() to adjust skb->data to the
      Ethernet frame (usually skipping NET_SKB_PAD and NET_IP_ALIGN, but some
      drivers might add a hardware provided alignment)
      
      Data provided to build_skb() MUST have been allocated by a prior
      kmalloc() call, with enough room to add SKB_DATA_ALIGN(sizeof(struct
      skb_shared_info)) bytes at the end of the data without corrupting
      incoming frame.
      
      data = kmalloc(NET_SKB_PAD + NET_IP_ALIGN + 1536 +
                     SKB_DATA_ALIGN(sizeof(struct skb_shared_info)),
      	       GFP_ATOMIC);
      ...
      skb = build_skb(data);
      if (!skb) {
      	recycle_data(data);
      } else {
      	skb_reserve(skb, NET_SKB_PAD + NET_IP_ALIGN);
      	...
      }
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      CC: Eilon Greenstein <eilong@broadcom.com>
      CC: Ben Hutchings <bhutchings@solarflare.com>
      CC: Tom Herbert <therbert@google.com>
      CC: Jamal Hadi Salim <hadi@mojatatu.com>
      CC: Stephen Hemminger <shemminger@vyatta.com>
      CC: Thomas Graf <tgraf@infradead.org>
      CC: Herbert Xu <herbert@gondor.apana.org.au>
      CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b2b5ce9d
  29. 10 11月, 2011 1 次提交
    • J
      net: add wireless TX status socket option · 6e3e939f
      Johannes Berg 提交于
      The 802.1X EAPOL handshake hostapd does requires
      knowing whether the frame was ack'ed by the peer.
      Currently, we fudge this pretty badly by not even
      transmitting the frame as a normal data frame but
      injecting it with radiotap and getting the status
      out of radiotap monitor as well. This is rather
      complex, confuses users (mon.wlan0 presence) and
      doesn't work with all hardware.
      
      To get rid of that hack, introduce a real wifi TX
      status option for data frame transmissions.
      
      This works similar to the existing TX timestamping
      in that it reflects the SKB back to the socket's
      error queue with a SCM_WIFI_STATUS cmsg that has
      an int indicating ACK status (0/1).
      
      Since it is possible that at some point we will
      want to have TX timestamping and wifi status in a
      single errqueue SKB (there's little point in not
      doing that), redefine SO_EE_ORIGIN_TIMESTAMPING
      to SO_EE_ORIGIN_TXSTATUS which can collect more
      than just the timestamp; keep the old constant
      as an alias of course. Currently the internal APIs
      don't make that possible, but it wouldn't be hard
      to split them up in a way that makes it possible.
      
      Thanks to Neil Horman for helping me figure out
      the functions that add the control messages.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NJohn W. Linville <linville@tuxdriver.com>
      6e3e939f
  30. 04 11月, 2011 1 次提交
    • T
      net: Add back alignment for size for __alloc_skb · bc417e30
      Tony Lindgren 提交于
      Commit 87fb4b7b (net: more
      accurate skb truesize) changed the alignment of size. This
      can cause problems at least on some machines with NFS root:
      
      Unhandled fault: alignment exception (0x801) at 0xc183a43a
      Internal error: : 801 [#1] PREEMPT
      Modules linked in:
      CPU: 0    Not tainted  (3.1.0-08784-g5eeee4a #733)
      pc : [<c02fbba0>]    lr : [<c02fbb9c>]    psr: 60000013
      sp : c180fef8  ip : 00000000  fp : c181f580
      r10: 00000000  r9 : c044b28c  r8 : 00000001
      r7 : c183a3a0  r6 : c1835be0  r5 : c183a412  r4 : 000001f2
      r3 : 00000000  r2 : 00000000  r1 : ffffffe6  r0 : c183a43a
      Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment kernel
      Control: 0005317f  Table: 10004000  DAC: 00000017
      Process swapper (pid: 1, stack limit = 0xc180e270)
      Stack: (0xc180fef8 to 0xc1810000)
      fee0:                                                       00000024 00000000
      ff00: 00000000 c183b9c0 c183b8e0 c044b28c c0507ccc c019dfc4 c180ff2c c0503cf8
      ff20: c180ff4c c180ff4c 00000000 c1835420 c182c740 c18349c0 c05233c0 00000000
      ff40: 00000000 c00e6bb8 c180e000 00000000 c04dd82c c0507e7c c050cc18 c183b9c0
      ff60: c05233c0 00000000 00000000 c01f34f4 c0430d70 c019d364 c04dd898 c04dd898
      ff80: c04dd82c c0507e7c c180e000 00000000 c04c584c c01f4918 c04dd898 c04dd82c
      ffa0: c04ddd28 c180e000 00000000 c0008758 c181fa60 3231d82c 00000037 00000000
      ffc0: 00000000 c04dd898 c04dd82c c04ddd28 00000013 00000000 00000000 00000000
      ffe0: 00000000 c04b2224 00000000 c04b21a0 c001056c c001056c 00000000 00000000
      Function entered at [<c02fbba0>] from [<c019dfc4>]
      Function entered at [<c019dfc4>] from [<c01f34f4>]
      Function entered at [<c01f34f4>] from [<c01f4918>]
      Function entered at [<c01f4918>] from [<c0008758>]
      Function entered at [<c0008758>] from [<c04b2224>]
      Function entered at [<c04b2224>] from [<c001056c>]
      Code: e1a00005 e3a01028 ebfa7cb0 e35a0000 (e5858028)
      
      Here PC is at __alloc_skb and &shinfo->dataref is unaligned because
      skb->end can be unaligned without this patch.
      
      As explained by Eric Dumazet <eric.dumazet@gmail.com>, this happens
      only with SLOB, and not with SLAB or SLUB:
      
      * Eric Dumazet <eric.dumazet@gmail.com> [111102 15:56]:
      >
      > Your patch is absolutely needed, I completely forgot about SLOB :(
      >
      > since, kmalloc(386) on SLOB gives exactly ksize=386 bytes, not nearest
      > power of two.
      >
      > [   60.305763] malloc(size=385)->ffff880112c11e38 ksize=386 -> nsize=2
      > [   60.305921] malloc(size=385)->ffff88007c92ce28 ksize=386 -> nsize=2
      > [   60.306898] malloc(size=656)->ffff88007c44ad28 ksize=656 -> nsize=272
      > [   60.325385] malloc(size=656)->ffff88007c575868 ksize=656 -> nsize=272
      > [   60.325531] malloc(size=656)->ffff88011c777230 ksize=656 -> nsize=272
      > [   60.325701] malloc(size=656)->ffff880114011008 ksize=656 -> nsize=272
      > [   60.346716] malloc(size=385)->ffff880114142008 ksize=386 -> nsize=2
      > [   60.346900] malloc(size=385)->ffff88011c777690 ksize=386 -> nsize=2
      Signed-off-by: NTony Lindgren <tony@atomide.com>
      Acked-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bc417e30
  31. 21 10月, 2011 1 次提交
  32. 20 10月, 2011 1 次提交
  33. 19 10月, 2011 1 次提交