1. 02 8月, 2013 1 次提交
  2. 28 6月, 2013 1 次提交
  3. 26 6月, 2013 1 次提交
  4. 11 6月, 2013 2 次提交
  5. 01 6月, 2013 2 次提交
  6. 29 5月, 2013 1 次提交
  7. 28 5月, 2013 2 次提交
    • S
      MPLS: Add limited GSO support · 0d89d203
      Simon Horman 提交于
      In the case where a non-MPLS packet is received and an MPLS stack is
      added it may well be the case that the original skb is GSO but the
      NIC used for transmit does not support GSO of MPLS packets.
      
      The aim of this code is to provide GSO in software for MPLS packets
      whose skbs are GSO.
      
      SKB Usage:
      
      When an implementation adds an MPLS stack to a non-MPLS packet it should do
      the following to skb metadata:
      
      * Set skb->inner_protocol to the old non-MPLS ethertype of the packet.
        skb->inner_protocol is added by this patch.
      
      * Set skb->protocol to the new MPLS ethertype of the packet.
      
      * Set skb->network_header to correspond to the
        end of the L3 header, including the MPLS label stack.
      
      I have posted a patch, "[PATCH v3.29] datapath: Add basic MPLS support to
      kernel" which adds MPLS support to the kernel datapath of Open vSwtich.
      That patch sets the above requirements in datapath/actions.c:push_mpls()
      and was used to exercise this code.  The datapath patch is against the Open
      vSwtich tree but it is intended that it be added to the Open vSwtich code
      present in the mainline Linux kernel at some point.
      
      Features:
      
      I believe that the approach that I have taken is at least partially
      consistent with the handling of other protocols.  Jesse, I understand that
      you have some ideas here.  I am more than happy to change my implementation.
      
      This patch adds dev->mpls_features which may be used by devices
      to advertise features supported for MPLS packets.
      
      A new NETIF_F_MPLS_GSO feature is added for devices which support
      hardware MPLS GSO offload.  Currently no devices support this
      and MPLS GSO always falls back to software.
      
      Alternate Implementation:
      
      One possible alternate implementation is to teach netif_skb_features()
      and skb_network_protocol() about MPLS, in a similar way to their
      understanding of VLANs. I believe this would avoid the need
      for net/mpls/mpls_gso.c and in particular the calls to
      __skb_push() and __skb_push() in mpls_gso_segment().
      
      I have decided on the implementation in this patch as it should
      not introduce any overhead in the case where mpls_gso is not compiled
      into the kernel or inserted as a module.
      
      MPLS GSO suggested by Jesse Gross.
      Based in part on "v4 GRE: Add TCP segmentation offload for GRE"
      by Pravin B Shelar.
      
      Cc: Jesse Gross <jesse@nicira.com>
      Cc: Pravin B Shelar <pshelar@nicira.com>
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0d89d203
    • S
      net: Use 16bits for *_headers fields of struct skbuff · 1a37e412
      Simon Horman 提交于
      In order to mitigate ongoing incresase in the size of struct skbuff
      use 16 bit integer offsets rather than pointers for inner_*_headers.
      
      This appears to reduce the size of struct skbuff from 0xd0 to 0xc0
      bytes on x86_64 with the following all unset.
      
      	CONFIG_XFRM
      	CONFIG_NF_CONNTRACK
      	CONFIG_NF_CONNTRACK_MODULE
      	NET_SKBUFF_NF_DEFRAG_NEEDED
      	CONFIG_BRIDGE_NETFILTER
      	CONFIG_NET_SCHED
      	CONFIG_IPV6_NDISC_NODETYPE
      	CONFIG_NET_DMA
      	CONFIG_NETWORK_SECMARK
      Signed-off-by: NSimon Horman <horms@verge.net.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1a37e412
  8. 20 4月, 2013 2 次提交
  9. 06 4月, 2013 1 次提交
    • P
      netfilter: don't reset nf_trace in nf_reset() · 124dff01
      Patrick McHardy 提交于
      Commit 130549fe ("netfilter: reset nf_trace in nf_reset") added code
      to reset nf_trace in nf_reset(). This is wrong and unnecessary.
      
      nf_reset() is used in the following cases:
      
      - when passing packets up the the socket layer, at which point we want to
        release all netfilter references that might keep modules pinned while
        the packet is queued. nf_trace doesn't matter anymore at this point.
      
      - when encapsulating or decapsulating IPsec packets. We want to continue
        tracing these packets after IPsec processing.
      
      - when passing packets through virtual network devices. Only devices on
        that encapsulate in IPv4/v6 matter since otherwise nf_trace is not
        used anymore. Its not entirely clear whether those packets should
        be traced after that, however we've always done that.
      
      - when passing packets through virtual network devices that make the
        packet cross network namespace boundaries. This is the only cases
        where we clearly want to reset nf_trace and is also what the
        original patch intended to fix.
      
      Add a new function nf_reset_trace() and use it in dev_forward_skb() to
      fix this properly.
      Signed-off-by: NPatrick McHardy <kaber@trash.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      124dff01
  10. 02 4月, 2013 1 次提交
  11. 28 3月, 2013 2 次提交
  12. 25 3月, 2013 1 次提交
  13. 21 3月, 2013 1 次提交
  14. 14 3月, 2013 2 次提交
    • P
      skb: Propagate pfmemalloc on skb from head page only · cca7af38
      Pavel Emelyanov 提交于
      Hi.
      
      I'm trying to send big chunks of memory from application address space via
      TCP socket using vmsplice + splice like this
      
         mem = mmap(128Mb);
         vmsplice(pipe[1], mem); /* splice memory into pipe */
         splice(pipe[0], tcp_socket); /* send it into network */
      
      When I'm lucky and a huge page splices into the pipe and then into the socket
      _and_ client and server ends of the TCP connection are on the same host,
      communicating via lo, the whole connection gets stuck! The sending queue
      becomes full and app stops writing/splicing more into it, but the receiving
      queue remains empty, and that's why.
      
      The __skb_fill_page_desc observes a tail page of a huge page and erroneously
      propagates its page->pfmemalloc value onto socket (the pfmemalloc on tail pages
      contain garbage). Then this skb->pfmemalloc leaks through lo and due to the
      
          tcp_v4_rcv
          sk_filter
              if (skb->pfmemalloc && !sock_flag(sk, SOCK_MEMALLOC)) /* true */
                  return -ENOMEM
              goto release_and_discard;
      
      no packets reach the socket. Even TCP re-transmits are dropped by this, as skb
      cloning clones the pfmemalloc flag as well.
      
      That said, here's the proper page->pfmemalloc propagation onto socket: we
      must check the huge-page's head page only, other pages' pfmemalloc and mapping
      values do not contain what is expected in this place. However, I'm not sure
      whether this fix is _complete_, since pfmemalloc propagation via lo also
      oesn't look great.
      
      Both, bit propagation from page to skb and this check in sk_filter, were
      introduced by c48a11c7 (netvm: propagate page->pfmemalloc to skb), in v3.5 so
      Mel and stable@ are in Cc.
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cca7af38
    • E
      tcp: fix skb_availroom() · 16fad69c
      Eric Dumazet 提交于
      Chrome OS team reported a crash on a Pixel ChromeBook in TCP stack :
      
      https://code.google.com/p/chromium/issues/detail?id=182056
      
      commit a21d4572 (tcp: avoid order-1 allocations on wifi and tx
      path) did a poor choice adding an 'avail_size' field to skb, while
      what we really needed was a 'reserved_tailroom' one.
      
      It would have avoided commit 22b4a4f2 (tcp: fix retransmit of
      partially acked frames) and this commit.
      
      Crash occurs because skb_split() is not aware of the 'avail_size'
      management (and should not be aware)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NMukesh Agrawal <quiche@chromium.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      16fad69c
  15. 10 3月, 2013 2 次提交
  16. 16 2月, 2013 2 次提交
  17. 14 2月, 2013 1 次提交
    • P
      net: Fix possible wrong checksum generation. · c9af6db4
      Pravin B Shelar 提交于
      Patch cef401de (net: fix possible wrong checksum
      generation) fixed wrong checksum calculation but it broke TSO by
      defining new GSO type but not a netdev feature for that type.
      net_gso_ok() would not allow hardware checksum/segmentation
      offload of such packets without the feature.
      
      Following patch fixes TSO and wrong checksum. This patch uses
      same logic that Eric Dumazet used. Patch introduces new flag
      SKBTX_SHARED_FRAG if at least one frag can be modified by
      the user. but SKBTX_SHARED_FRAG flag is kept in skb shared
      info tx_flags rather than gso_type.
      
      tx_flags is better compared to gso_type since we can have skb with
      shared frag without gso packet. It does not link SHARED_FRAG to
      GSO, So there is no need to define netdev feature for this.
      Signed-off-by: NPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c9af6db4
  18. 09 2月, 2013 1 次提交
  19. 28 1月, 2013 1 次提交
    • E
      net: fix possible wrong checksum generation · cef401de
      Eric Dumazet 提交于
      Pravin Shelar mentioned that GSO could potentially generate
      wrong TX checksum if skb has fragments that are overwritten
      by the user between the checksum computation and transmit.
      
      He suggested to linearize skbs but this extra copy can be
      avoided for normal tcp skbs cooked by tcp_sendmsg().
      
      This patch introduces a new SKB_GSO_SHARED_FRAG flag, set
      in skb_shinfo(skb)->gso_type if at least one frag can be
      modified by the user.
      
      Typical sources of such possible overwrites are {vm}splice(),
      sendfile(), and macvtap/tun/virtio_net drivers.
      
      Tested:
      
      $ netperf -H 7.7.8.84
      MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
      7.7.8.84 () port 0 AF_INET
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/sec
      
       87380  16384  16384    10.00    3959.52
      
      $ netperf -H 7.7.8.84 -t TCP_SENDFILE
      TCP SENDFILE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 ()
      port 0 AF_INET
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/sec
      
       87380  16384  16384    10.00    3216.80
      
      Performance of the SENDFILE is impacted by the extra allocation and
      copy, and because we use order-0 pages, while the TCP_STREAM uses
      bigger pages.
      Reported-by: NPravin Shelar <pshelar@nicira.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cef401de
  20. 09 1月, 2013 1 次提交
    • E
      net: introduce skb_transport_header_was_set() · fda55eca
      Eric Dumazet 提交于
      We have skb_mac_header_was_set() helper to tell if mac_header
      was set on a skb. We would like the same for transport_header.
      
      __netif_receive_skb() doesn't reset the transport header if already
      set by GRO layer.
      
      Note that network stacks usually reset the transport header anyway,
      after pulling the network header, so this change only allows
      a followup patch to have more precise qdisc pkt_len computation
      for GSO packets at ingress side.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fda55eca
  21. 09 12月, 2012 1 次提交
  22. 03 11月, 2012 2 次提交
  23. 01 11月, 2012 1 次提交
    • W
      net: compute skb->rxhash if nic hash may be 3-tuple · ecd5cf5d
      Willem de Bruijn 提交于
      Network device drivers can communicate a Toeplitz hash in skb->rxhash,
      but devices differ in their hashing capabilities. All compute a 5-tuple
      hash for TCP over IPv4, but for other connection-oriented protocols,
      they may compute only a 3-tuple. This breaks RPS load balancing, e.g.,
      for TCP over IPv6 flows. Additionally, for GRE and other tunnels,
      the kernel computes a 5-tuple hash over the inner packet if possible,
      but devices do not.
      
      This patch recomputes the rxhash in software in all cases where it
      cannot be certain that a 5-tuple was computed. Device drivers can avoid
      recomputation by setting the skb->l4_rxhash flag.
      
      Recomputing adds cycles to each packet when RPS is enabled or the
      packet arrives over a tunnel. A comparison of 200x TCP_STREAM between
      two servers running unmodified netnext with rxhash computation
      in hardware vs software (using ethtool -K eth0 rxhash [on|off]) shows
      how much time is spent in __skb_get_rxhash in this worst case:
      
           0.03%          swapper  [kernel.kallsyms]     [k] __skb_get_rxhash
           0.03%          swapper  [kernel.kallsyms]     [k] __skb_get_rxhash
           0.05%          swapper  [kernel.kallsyms]     [k] __skb_get_rxhash
      
      With 200x TCP_RR it increases to
      
           0.10%          netperf  [kernel.kallsyms]     [k] __skb_get_rxhash
           0.10%          netperf  [kernel.kallsyms]     [k] __skb_get_rxhash
           0.10%          netperf  [kernel.kallsyms]     [k] __skb_get_rxhash
      
      I considered having the patch explicitly skips recomputation when it knows
      that it will not improve the hash (TCP over IPv4), but that conditional
      complicates code without saving many cycles in practice, because it has
      to take place after flow dissector.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ecd5cf5d
  24. 07 10月, 2012 1 次提交
    • E
      net: remove skb recycling · acb600de
      Eric Dumazet 提交于
      Over time, skb recycling infrastructure got litle interest and
      many bugs. Generic rx path skb allocation is now using page
      fragments for efficient GRO / TCP coalescing, and recyling
      a tx skb for rx path is not worth the pain.
      
      Last identified bug is that fat skbs can be recycled
      and it can endup using high order pages after few iterations.
      
      With help from Maxime Bizon, who pointed out that commit
      87151b86 (net: allow pskb_expand_head() to get maximum tailroom)
      introduced this regression for recycled skbs.
      
      Instead of fixing this bug, lets remove skb recycling.
      
      Drivers wanting really hot skbs should use build_skb() anyway,
      to allocate/populate sk_buff right before netif_receive_skb()
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Maxime Bizon <mbizon@freebox.fr>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      acb600de
  25. 04 8月, 2012 1 次提交
  26. 01 8月, 2012 3 次提交
    • M
      netvm: propagate page->pfmemalloc from skb_alloc_page to skb · 0614002b
      Mel Gorman 提交于
      The skb->pfmemalloc flag gets set to true iff during the slab allocation
      of data in __alloc_skb that the the PFMEMALLOC reserves were used.  If
      page splitting is used, it is possible that pages will be allocated from
      the PFMEMALLOC reserve without propagating this information to the skb.
      This patch propagates page->pfmemalloc from pages allocated for fragments
      to the skb.
      
      It works by reintroducing and expanding the skb_alloc_page() API to take
      an skb.  If the page was allocated from pfmemalloc reserves, it is
      automatically copied.  If the driver allocates the page before the skb, it
      should call skb_propagate_pfmemalloc() after the skb is allocated to
      ensure the flag is copied properly.
      
      Failure to do so is not critical.  The resulting driver may perform slower
      if it is used for swap-over-NBD or swap-over-NFS but it should not result
      in failure.
      
      [davem@davemloft.net: API rename and consistency]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0614002b
    • M
      netvm: propagate page->pfmemalloc to skb · c48a11c7
      Mel Gorman 提交于
      The skb->pfmemalloc flag gets set to true iff during the slab allocation
      of data in __alloc_skb that the the PFMEMALLOC reserves were used.  If the
      packet is fragmented, it is possible that pages will be allocated from the
      PFMEMALLOC reserve without propagating this information to the skb.  This
      patch propagates page->pfmemalloc from pages allocated for fragments to
      the skb.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c48a11c7
    • M
      netvm: allow skb allocation to use PFMEMALLOC reserves · c93bdd0e
      Mel Gorman 提交于
      Change the skb allocation API to indicate RX usage and use this to fall
      back to the PFMEMALLOC reserve when needed.  SKBs allocated from the
      reserve are tagged in skb->pfmemalloc.  If an SKB is allocated from the
      reserve and the socket is later found to be unrelated to page reclaim, the
      packet is dropped so that the memory remains available for page reclaim.
      Network protocols are expected to recover from this packet loss.
      
      [a.p.zijlstra@chello.nl: Ideas taken from various patches]
      [davem@davemloft.net: Use static branches, coding style corrections]
      [sebastian@breakpoint.cc: Avoid unnecessary cast, fix !CONFIG_NET build]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c93bdd0e
  27. 23 7月, 2012 1 次提交
  28. 16 6月, 2012 1 次提交
  29. 30 5月, 2012 1 次提交
    • F
      skb: avoid unnecessary reallocations in __skb_cow · 617c8c11
      Felix Fietkau 提交于
      At the beginning of __skb_cow, headroom gets set to a minimum of
      NET_SKB_PAD. This causes unnecessary reallocations if the buffer was not
      cloned and the headroom is just below NET_SKB_PAD, but still more than the
      amount requested by the caller.
      This was showing up frequently in my tests on VLAN tx, where
      vlan_insert_tag calls skb_cow_head(skb, VLAN_HLEN).
      
      Locally generated packets should have enough headroom, and for forward
      paths, we already have NET_SKB_PAD bytes of headroom, so we don't need to
      add any extra space here.
      Signed-off-by: NFelix Fietkau <nbd@openwrt.org>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      617c8c11