1. 20 5月, 2012 1 次提交
    • E
      net: introduce skb_try_coalesce() · bad43ca8
      Eric Dumazet 提交于
      Move tcp_try_coalesce() protocol independent part to
      skb_try_coalesce().
      
      skb_try_coalesce() can be used in IPv4 defrag and IPv6 reassembly,
      to build optimized skbs (less sk_buff, and possibly less 'headers')
      
      skb_try_coalesce() is zero copy, unless the copy can fit in destination
      header (its a rare case)
      
      kfree_skb_partial() is also moved to net/core/skbuff.c and exported,
      because IPv6 will need it in patch (ipv6: use skb coalescing in
      reassembly).
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Alexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bad43ca8
  2. 19 5月, 2012 1 次提交
    • E
      net: introduce netdev_alloc_frag() · 6f532612
      Eric Dumazet 提交于
      Fix two issues introduced in commit a1c7fff7
      ( net: netdev_alloc_skb() use build_skb() )
      
      - Must be IRQ safe (non NAPI drivers can use it)
      - Must not leak the frag if build_skb() fails to allocate sk_buff
      
      This patch introduces netdev_alloc_frag() for drivers willing to
      use build_skb() instead of __netdev_alloc_skb() variants.
      
      Factorize code so that :
      __dev_alloc_skb() is a wrapper around __netdev_alloc_skb(), and
      dev_alloc_skb() a wrapper around netdev_alloc_skb()
      
      Use __GFP_COLD flag.
      
      Almost all network drivers now benefit from skb->head_frag
      infrastructure.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6f532612
  3. 07 5月, 2012 1 次提交
  4. 04 5月, 2012 1 次提交
  5. 01 5月, 2012 4 次提交
    • E
      net: fix two typos in skbuff.h · d9619496
      Eric Dumazet 提交于
      fix kernel doc typos in function names
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d9619496
    • E
      net: skb_peek()/skb_peek_tail() cleanups · 18d07000
      Eric Dumazet 提交于
      remove useless casts and rename variables for less confusion.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      18d07000
    • E
      net: make GRO aware of skb->head_frag · d7e8883c
      Eric Dumazet 提交于
      GRO can check if skb to be merged has its skb->head mapped to a page
      fragment, instead of a kmalloc() area.
      
      We 'upgrade' skb->head as a fragment in itself
      
      This avoids the frag_list fallback, and permits to build true GRO skb
      (one sk_buff and up to 16 fragments), using less memory.
      
      This reduces number of cache misses when user makes its copy, since a
      single sk_buff is fetched.
      
      This is a followup of patch "net: allow skb->head to be a page fragment"
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Maciej Żenczykowski <maze@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Matt Carlson <mcarlson@broadcom.com>
      Cc: Michael Chan <mchan@broadcom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d7e8883c
    • E
      net: allow skb->head to be a page fragment · d3836f21
      Eric Dumazet 提交于
      skb->head is currently allocated from kmalloc(). This is convenient but
      has the drawback the data cannot be converted to a page fragment if
      needed.
      
      We have three spots were it hurts :
      
      1) GRO aggregation
      
       When a linear skb must be appended to another skb, GRO uses the
      frag_list fallback, very inefficient since we keep all struct sk_buff
      around. So drivers enabling GRO but delivering linear skbs to network
      stack aren't enabling full GRO power.
      
      2) splice(socket -> pipe).
      
       We must copy the linear part to a page fragment.
       This kind of defeats splice() purpose (zero copy claim)
      
      3) TCP coalescing.
      
       Recently introduced, this permits to group several contiguous segments
      into a single skb. This shortens queue lengths and save kernel memory,
      and greatly reduce probabilities of TCP collapses. This coalescing
      doesnt work on linear skbs (or we would need to copy data, this would be
      too slow)
      
      Given all these issues, the following patch introduces the possibility
      of having skb->head be a fragment in itself. We use a new skb flag,
      skb->head_frag to carry this information.
      
      build_skb() is changed to accept a frag_size argument. Drivers willing
      to provide a page fragment instead of kmalloc() data will set a non zero
      value, set to the fragment size.
      
      Then, on situations we need to convert the skb head to a frag in itself,
      we can check if skb->head_frag is set and avoid the copies or various
      fallbacks we have.
      
      This means drivers currently using frags could be updated to avoid the
      current skb->head allocation and reduce their memory footprint (aka skb
      truesize). (thats 512 or 1024 bytes saved per skb). This also makes
      bpf/netfilter faster since the 'first frag' will be part of skb linear
      part, no need to copy data.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Maciej Żenczykowski <maze@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Tom Herbert <therbert@google.com>
      Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Matt Carlson <mcarlson@broadcom.com>
      Cc: Michael Chan <mchan@broadcom.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d3836f21
  6. 24 4月, 2012 1 次提交
  7. 20 4月, 2012 1 次提交
  8. 14 4月, 2012 1 次提交
  9. 11 4月, 2012 1 次提交
    • E
      tcp: avoid order-1 allocations on wifi and tx path · a21d4572
      Eric Dumazet 提交于
      Marc Merlin reported many order-1 allocations failures in TX path on its
      wireless setup, that dont make any sense with MTU=1500 network, and non
      SG capable hardware.
      
      After investigation, it turns out TCP uses sk_stream_alloc_skb() and
      used as a convention skb_tailroom(skb) to know how many bytes of data
      payload could be put in this skb (for non SG capable devices)
      
      Note : these skb used kmalloc-4096 (MTU=1500 + MAX_HEADER +
      sizeof(struct skb_shared_info) being above 2048)
      
      Later, mac80211 layer need to add some bytes at the tail of skb
      (IEEE80211_ENCRYPT_TAILROOM = 18 bytes) and since no more tailroom is
      available has to call pskb_expand_head() and request order-1
      allocations.
      
      This patch changes sk_stream_alloc_skb() so that only
      sk->sk_prot->max_header bytes of headroom are reserved, and use a new
      skb field, avail_size to hold the data payload limit.
      
      This way, order-0 allocations done by TCP stack can leave more than 2 KB
      of tailroom and no more allocation is performed in mac80211 layer (or
      any layer needing some tailroom)
      
      avail_size is unioned with mark/dropcount, since mark will be set later
      in IP stack for output packets. Therefore, skb size is unchanged.
      Reported-by: NMarc MERLIN <marc@merlins.org>
      Tested-by: NMarc MERLIN <marc@merlins.org>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a21d4572
  10. 29 3月, 2012 1 次提交
  11. 26 3月, 2012 1 次提交
  12. 20 3月, 2012 1 次提交
  13. 10 3月, 2012 1 次提交
  14. 05 3月, 2012 1 次提交
    • P
      BUG: headers with BUG/BUG_ON etc. need linux/bug.h · 187f1882
      Paul Gortmaker 提交于
      If a header file is making use of BUG, BUG_ON, BUILD_BUG_ON, or any
      other BUG variant in a static inline (i.e. not in a #define) then
      that header really should be including <linux/bug.h> and not just
      expecting it to be implicitly present.
      
      We can make this change risk-free, since if the files using these
      headers didn't have exposure to linux/bug.h already, they would have
      been causing compile failures/warnings.
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      187f1882
  15. 24 2月, 2012 2 次提交
  16. 22 2月, 2012 2 次提交
  17. 11 2月, 2012 1 次提交
  18. 06 1月, 2012 1 次提交
  19. 24 12月, 2011 1 次提交
  20. 05 12月, 2011 1 次提交
    • E
      tcp: take care of misalignments · 117632e6
      Eric Dumazet 提交于
      We discovered that TCP stack could retransmit misaligned skbs if a
      malicious peer acknowledged sub MSS frame. This currently can happen
      only if output interface is non SG enabled : If SG is enabled, tcp
      builds headless skbs (all payload is included in fragments), so the tcp
      trimming process only removes parts of skb fragments, header stay
      aligned.
      
      Some arches cant handle misalignments, so force a head reallocation and
      shrink headroom to MAX_TCP_HEADER.
      
      Dont care about misaligments on x86 and PPC (or other arches setting
      NET_IP_ALIGN to 0)
      
      This patch introduces __pskb_copy() which can specify the headroom of
      new head, and pskb_copy() becomes a wrapper on top of __pskb_copy()
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      117632e6
  21. 23 11月, 2011 1 次提交
    • E
      net: remove netdev_alloc_page and use __GFP_COLD · 1f2149c1
      Eric Dumazet 提交于
      Given we dont use anymore the struct net_device *dev argument, and this
      interface brings litle benefit, remove netdev_{alloc|free}_page(), to
      debloat include/linux/skbuff.h a bit.
      
      (Some drivers used a mix of these interfaces and alloc_pages())
      
      When allocating a page given to device for DMA transfer (device to
      memory), it makes sense to use a cold one (__GFP_COLD)
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      CC: Dimitris Michailidis <dm@chelsio.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1f2149c1
  22. 17 11月, 2011 2 次提交
  23. 15 11月, 2011 1 次提交
    • E
      net: introduce build_skb() · b2b5ce9d
      Eric Dumazet 提交于
      One of the thing we discussed during netdev 2011 conference was the idea
      to change some network drivers to allocate/populate their skb at RX
      completion time, right before feeding the skb to network stack.
      
      In old days, we allocated skbs when populating the RX ring.
      
      This means bringing into cpu cache sk_buff and skb_shared_info cache
      lines (since we clear/initialize them), then 'queue' skb->data to NIC.
      
      By the time NIC fills a frame in skb->data buffer and host can process
      it, cpu probably threw away the cache lines from its caches, because lot
      of things happened between the allocation and final use.
      
      So the deal would be to allocate only the data buffer for the NIC to
      populate its RX ring buffer. And use build_skb() at RX completion to
      attach a data buffer (now filled with an ethernet frame) to a new skb,
      initialize the skb_shared_info portion, and give the hot skb to network
      stack.
      
      build_skb() is the function to allocate an skb, caller providing the
      data buffer that should be attached to it. Drivers are expected to call
      skb_reserve() right after build_skb() to adjust skb->data to the
      Ethernet frame (usually skipping NET_SKB_PAD and NET_IP_ALIGN, but some
      drivers might add a hardware provided alignment)
      
      Data provided to build_skb() MUST have been allocated by a prior
      kmalloc() call, with enough room to add SKB_DATA_ALIGN(sizeof(struct
      skb_shared_info)) bytes at the end of the data without corrupting
      incoming frame.
      
      data = kmalloc(NET_SKB_PAD + NET_IP_ALIGN + 1536 +
                     SKB_DATA_ALIGN(sizeof(struct skb_shared_info)),
      	       GFP_ATOMIC);
      ...
      skb = build_skb(data);
      if (!skb) {
      	recycle_data(data);
      } else {
      	skb_reserve(skb, NET_SKB_PAD + NET_IP_ALIGN);
      	...
      }
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      CC: Eilon Greenstein <eilong@broadcom.com>
      CC: Ben Hutchings <bhutchings@solarflare.com>
      CC: Tom Herbert <therbert@google.com>
      CC: Jamal Hadi Salim <hadi@mojatatu.com>
      CC: Stephen Hemminger <shemminger@vyatta.com>
      CC: Thomas Graf <tgraf@infradead.org>
      CC: Herbert Xu <herbert@gondor.apana.org.au>
      CC: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b2b5ce9d
  24. 10 11月, 2011 1 次提交
    • J
      net: add wireless TX status socket option · 6e3e939f
      Johannes Berg 提交于
      The 802.1X EAPOL handshake hostapd does requires
      knowing whether the frame was ack'ed by the peer.
      Currently, we fudge this pretty badly by not even
      transmitting the frame as a normal data frame but
      injecting it with radiotap and getting the status
      out of radiotap monitor as well. This is rather
      complex, confuses users (mon.wlan0 presence) and
      doesn't work with all hardware.
      
      To get rid of that hack, introduce a real wifi TX
      status option for data frame transmissions.
      
      This works similar to the existing TX timestamping
      in that it reflects the SKB back to the socket's
      error queue with a SCM_WIFI_STATUS cmsg that has
      an int indicating ACK status (0/1).
      
      Since it is possible that at some point we will
      want to have TX timestamping and wifi status in a
      single errqueue SKB (there's little point in not
      doing that), redefine SO_EE_ORIGIN_TIMESTAMPING
      to SO_EE_ORIGIN_TXSTATUS which can collect more
      than just the timestamp; keep the old constant
      as an alias of course. Currently the internal APIs
      don't make that possible, but it wouldn't be hard
      to split them up in a way that makes it possible.
      
      Thanks to Neil Horman for helping me figure out
      the functions that add the control messages.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NJohn W. Linville <linville@tuxdriver.com>
      6e3e939f
  25. 01 11月, 2011 1 次提交
  26. 24 10月, 2011 1 次提交
    • R
      net: hold sock reference while processing tx timestamps · da92b194
      Richard Cochran 提交于
      The pair of functions,
      
       * skb_clone_tx_timestamp()
       * skb_complete_tx_timestamp()
      
      were designed to allow timestamping in PHY devices. The first
      function, called during the MAC driver's hard_xmit method, identifies
      PTP protocol packets, clones them, and gives them to the PHY device
      driver. The PHY driver may hold onto the packet and deliver it at a
      later time using the second function, which adds the packet to the
      socket's error queue.
      
      As pointed out by Johannes, nothing prevents the socket from
      disappearing while the cloned packet is sitting in the PHY driver
      awaiting a timestamp. This patch fixes the issue by taking a reference
      on the socket for each such packet. In addition, the comments
      regarding the usage of these function are expanded to highlight the
      rule that PHY drivers must use skb_complete_tx_timestamp() to release
      the packet, in order to release the socket reference, too.
      
      These functions first appeared in v2.6.36.
      Reported-by: NJohannes Berg <johannes@sipsolutions.net>
      Signed-off-by: NRichard Cochran <richard.cochran@omicron.at>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Reviewed-by: NJohannes Berg <johannes@sipsolutions.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      da92b194
  27. 21 10月, 2011 2 次提交
  28. 20 10月, 2011 2 次提交
  29. 19 10月, 2011 1 次提交
  30. 14 10月, 2011 1 次提交
    • E
      net: more accurate skb truesize · 87fb4b7b
      Eric Dumazet 提交于
      skb truesize currently accounts for sk_buff struct and part of skb head.
      kmalloc() roundings are also ignored.
      
      Considering that skb_shared_info is larger than sk_buff, its time to
      take it into account for better memory accounting.
      
      This patch introduces SKB_TRUESIZE(X) macro to centralize various
      assumptions into a single place.
      
      At skb alloc phase, we put skb_shared_info struct at the exact end of
      skb head, to allow a better use of memory (lowering number of
      reallocations), since kmalloc() gives us power-of-two memory blocks.
      
      Unless SLUB/SLUB debug is active, both skb->head and skb_shared_info are
      aligned to cache lines, as before.
      
      Note: This patch might trigger performance regressions because of
      misconfigured protocol stacks, hitting per socket or global memory
      limits that were previously not reached. But its a necessary step for a
      more accurate memory accounting.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      CC: Andi Kleen <ak@linux.intel.com>
      CC: Ben Hutchings <bhutchings@solarflare.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      87fb4b7b
  31. 16 9月, 2011 1 次提交
    • M
      net: copy userspace buffers on device forwarding · 48c83012
      Michael S. Tsirkin 提交于
      dev_forward_skb loops an skb back into host networking
      stack which might hang on the memory indefinitely.
      In particular, this can happen in macvtap in bridged mode.
      Copy the userspace fragments to avoid blocking the
      sender in that case.
      
      As this patch makes skb_copy_ubufs extern now,
      I also added some documentation and made it clear
      the SKBTX_DEV_ZEROCOPY flag automatically instead
      of doing it in all callers. This can be made into a separate
      patch if people feel it's worth it.
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      48c83012
  32. 25 8月, 2011 1 次提交