1. 08 11月, 2016 1 次提交
    • P
      udp: do fwd memory scheduling on dequeue · 7c13f97f
      Paolo Abeni 提交于
      A new argument is added to __skb_recv_datagram to provide
      an explicit skb destructor, invoked under the receive queue
      lock.
      The UDP protocol uses such argument to perform memory
      reclaiming on dequeue, so that the UDP protocol does not
      set anymore skb->desctructor.
      Instead explicit memory reclaiming is performed at close() time and
      when skbs are removed from the receive queue.
      The in kernel UDP protocol users now need to call a
      skb_recv_udp() variant instead of skb_recv_datagram() to
      properly perform memory accounting on dequeue.
      
      Overall, this allows acquiring only once the receive queue
      lock on dequeue.
      
      Tested using pktgen with random src port, 64 bytes packet,
      wire-speed on a 10G link as sender and udp_sink as the receiver,
      using an l4 tuple rxhash to stress the contention, and one or more
      udp_sink instances with reuseport.
      
      nr sinks	vanilla		patched
      1		440		560
      3		2150		2300
      6		3650		3800
      9		4450		4600
      12		6250		6450
      
      v1 -> v2:
       - do rmem and allocated memory scheduling under the receive lock
       - do bulk scheduling in first_packet_length() and in udp_destruct_sock()
       - avoid the typdef for the dequeue callback
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7c13f97f
  2. 30 10月, 2016 1 次提交
  3. 27 10月, 2016 1 次提交
  4. 04 10月, 2016 1 次提交
    • A
      skb_splice_bits(): get rid of callback · 25869262
      Al Viro 提交于
      since pipe_lock is the outermost now, we don't need to drop/regain
      socket locks around the call of splice_to_pipe() from skb_splice_bits(),
      which kills the need to have a socket-specific callback; we can just
      call splice_to_pipe() and be done with that.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      25869262
  5. 22 9月, 2016 1 次提交
  6. 21 9月, 2016 1 次提交
    • D
      bpf: direct packet write and access for helpers for clsact progs · 36bbef52
      Daniel Borkmann 提交于
      This work implements direct packet access for helpers and direct packet
      write in a similar fashion as already available for XDP types via commits
      4acf6c0b ("bpf: enable direct packet data write for xdp progs") and
      6841de8b ("bpf: allow helpers access the packet directly"), and as a
      complementary feature to the already available direct packet read for tc
      (cls/act) programs.
      
      For enabling this, we need to introduce two helpers, bpf_skb_pull_data()
      and bpf_csum_update(). The first is generally needed for both, read and
      write, because they would otherwise only be limited to the current linear
      skb head. Usually, when the data_end test fails, programs just bail out,
      or, in the direct read case, use bpf_skb_load_bytes() as an alternative
      to overcome this limitation. If such data sits in non-linear parts, we
      can just pull them in once with the new helper, retest and eventually
      access them.
      
      At the same time, this also makes sure the skb is uncloned, which is, of
      course, a necessary condition for direct write. As this needs to be an
      invariant for the write part only, the verifier detects writes and adds
      a prologue that is calling bpf_skb_pull_data() to effectively unclone the
      skb from the very beginning in case it is indeed cloned. The heuristic
      makes use of a similar trick that was done in 233577a2 ("net: filter:
      constify detection of pkt_type_offset"). This comes at zero cost for other
      programs that do not use the direct write feature. Should a program use
      this feature only sparsely and has read access for the most parts with,
      for example, drop return codes, then such write action can be delegated
      to a tail called program for mitigating this cost of potential uncloning
      to a late point in time where it would have been paid similarly with the
      bpf_skb_store_bytes() as well. Advantage of direct write is that the
      writes are inlined whereas the helper cannot make any length assumptions
      and thus needs to generate a call to memcpy() also for small sizes, as well
      as cost of helper call itself with sanity checks are avoided. Plus, when
      direct read is already used, we don't need to cache or perform rechecks
      on the data boundaries (due to verifier invalidating previous checks for
      helpers that change skb->data), so more complex programs using rewrites
      can benefit from switching to direct read plus write.
      
      For direct packet access to helpers, we save the otherwise needed copy into
      a temp struct sitting on stack memory when use-case allows. Both facilities
      are enabled via may_access_direct_pkt_data() in verifier. For now, we limit
      this to map helpers and csum_diff, and can successively enable other helpers
      where we find it makes sense. Helpers that definitely cannot be allowed for
      this are those part of bpf_helper_changes_skb_data() since they can change
      underlying data, and those that write into memory as this could happen for
      packet typed args when still cloned. bpf_csum_update() helper accommodates
      for the fact that we need to fixup checksum_complete when using direct write
      instead of bpf_skb_store_bytes(), meaning the programs can use available
      helpers like bpf_csum_diff(), and implement csum_add(), csum_sub(),
      csum_block_add(), csum_block_sub() equivalents in eBPF together with the
      new helper. A usage example will be provided for iproute2's examples/bpf/
      directory.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      36bbef52
  7. 09 9月, 2016 1 次提交
    • Y
      tcp: use an RB tree for ooo receive queue · 9f5afeae
      Yaogong Wang 提交于
      Over the years, TCP BDP has increased by several orders of magnitude,
      and some people are considering to reach the 2 Gbytes limit.
      
      Even with current window scale limit of 14, ~1 Gbytes maps to ~740,000
      MSS.
      
      In presence of packet losses (or reorders), TCP stores incoming packets
      into an out of order queue, and number of skbs sitting there waiting for
      the missing packets to be received can be in the 10^5 range.
      
      Most packets are appended to the tail of this queue, and when
      packets can finally be transferred to receive queue, we scan the queue
      from its head.
      
      However, in presence of heavy losses, we might have to find an arbitrary
      point in this queue, involving a linear scan for every incoming packet,
      throwing away cpu caches.
      
      This patch converts it to a RB tree, to get bounded latencies.
      
      Yaogong wrote a preliminary patch about 2 years ago.
      Eric did the rebase, added ofo_last_skb cache, polishing and tests.
      
      Tested with network dropping between 1 and 10 % packets, with good
      success (about 30 % increase of throughput in stress tests)
      
      Next step would be to also use an RB tree for the write queue at sender
      side ;)
      Signed-off-by: NYaogong Wang <wygivan@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Acked-By: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9f5afeae
  8. 27 8月, 2016 1 次提交
    • I
      bridge: switchdev: Add forward mark support for stacked devices · 6bc506b4
      Ido Schimmel 提交于
      switchdev_port_fwd_mark_set() is used to set the 'offload_fwd_mark' of
      port netdevs so that packets being flooded by the device won't be
      flooded twice.
      
      It works by assigning a unique identifier (the ifindex of the first
      bridge port) to bridge ports sharing the same parent ID. This prevents
      packets from being flooded twice by the same switch, but will flood
      packets through bridge ports belonging to a different switch.
      
      This method is problematic when stacked devices are taken into account,
      such as VLANs. In such cases, a physical port netdev can have upper
      devices being members in two different bridges, thus requiring two
      different 'offload_fwd_mark's to be configured on the port netdev, which
      is impossible.
      
      The main problem is that packet and netdev marking is performed at the
      physical netdev level, whereas flooding occurs between bridge ports,
      which are not necessarily port netdevs.
      
      Instead, packet and netdev marking should really be done in the bridge
      driver with the switch driver only telling it which packets it already
      forwarded. The bridge driver will mark such packets using the mark
      assigned to the ingress bridge port and will prevent the packet from
      being forwarded through any bridge port sharing the same mark (i.e.
      having the same parent ID).
      
      Remove the current switchdev 'offload_fwd_mark' implementation and
      instead implement the proposed method. In addition, make rocker - the
      sole user of the mark - use the proposed method.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6bc506b4
  9. 19 8月, 2016 1 次提交
    • D
      bpf: add bpf_skb_change_tail helper · 5293efe6
      Daniel Borkmann 提交于
      This work adds a bpf_skb_change_tail() helper for tc BPF programs. The
      basic idea is to expand or shrink the skb in a controlled manner. The
      eBPF program can then rewrite the rest via helpers like bpf_skb_store_bytes(),
      bpf_lX_csum_replace() and others rather than passing a raw buffer for
      writing here.
      
      bpf_skb_change_tail() is really a slow path helper and intended for
      replies with f.e. ICMP control messages. Concept is similar to other
      helpers like bpf_skb_change_proto() helper to keep the helper without
      protocol specifics and let the BPF program mangle the remaining parts.
      A flags field has been added and is reserved for now should we extend
      the helper in future.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5293efe6
  10. 09 8月, 2016 1 次提交
    • D
      bpf: fix checksum fixups on bpf_skb_store_bytes · 479ffccc
      Daniel Borkmann 提交于
      bpf_skb_store_bytes() invocations above L2 header need BPF_F_RECOMPUTE_CSUM
      flag for updates, so that CHECKSUM_COMPLETE will be fixed up along the way.
      Where we ran into an issue with bpf_skb_store_bytes() is when we did a
      single-byte update on the IPv6 hoplimit despite using BPF_F_RECOMPUTE_CSUM
      flag; simple ping via ICMPv6 triggered a hw csum failure as a result. The
      underlying issue has been tracked down to a buffer alignment issue.
      
      Meaning, that csum_partial() computations via skb_postpull_rcsum() and
      skb_postpush_rcsum() pair invoked had a wrong result since they operated on
      an odd address for the hoplimit, while other computations were done on an
      even address. This mix doesn't work as-is with skb_postpull_rcsum(),
      skb_postpush_rcsum() pair as it always expects at least half-word alignment
      of input buffers, which is normally the case. Thus, instead of these helpers
      using csum_sub() and (implicitly) csum_add(), we need to use csum_block_sub(),
      csum_block_add(), respectively. For unaligned offsets, they rotate the sum
      to align it to a half-word boundary again, otherwise they work the same as
      csum_sub() and csum_add().
      
      Adding __skb_postpull_rcsum(), __skb_postpush_rcsum() variants that take the
      offset as an input and adapting bpf_skb_store_bytes() to them fixes the hw
      csum failures again. The skb_postpull_rcsum(), skb_postpush_rcsum() helpers
      use a 0 constant for offset so that the compiler optimizes the offset & 1
      test away and generates the same code as with csum_sub()/_add().
      
      Fixes: 608cd71a ("tc: bpf: generalize pedit action")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      479ffccc
  11. 05 7月, 2016 1 次提交
  12. 02 7月, 2016 2 次提交
  13. 04 6月, 2016 3 次提交
  14. 21 5月, 2016 2 次提交
    • N
      net: suppress warnings on dev_alloc_skb · 95829b3a
      Neil Horman 提交于
      Noticed an allocation failure in a network driver the other day on a 32 bit
      system:
      
      DMA-API: debugging out of memory - disabling
      bnx2fc: adapter_lookup: hba NULL
      lldpad: page allocation failure. order:0, mode:0x4120
      Pid: 4556, comm: lldpad Not tainted 2.6.32-639.el6.i686.debug #1
      Call Trace:
       [<c08a4086>] ? printk+0x19/0x23
       [<c05166a4>] ? __alloc_pages_nodemask+0x664/0x830
       [<c0649d02>] ? free_object+0x82/0xa0
       [<fb4e2c9b>] ? ixgbe_alloc_rx_buffers+0x10b/0x1d0 [ixgbe]
       [<fb4e2fff>] ? ixgbe_configure_rx_ring+0x29f/0x420 [ixgbe]
       [<fb4e228c>] ? ixgbe_configure_tx_ring+0x15c/0x220 [ixgbe]
       [<fb4e3709>] ? ixgbe_configure+0x589/0xc00 [ixgbe]
       [<fb4e7be7>] ? ixgbe_open+0xa7/0x5c0 [ixgbe]
       [<fb503ce6>] ? ixgbe_init_interrupt_scheme+0x5b6/0x970 [ixgbe]
       [<fb4e8e54>] ? ixgbe_setup_tc+0x1a4/0x260 [ixgbe]
       [<fb505a9f>] ? ixgbe_dcbnl_set_state+0x7f/0x90 [ixgbe]
       [<c088d80d>] ? dcb_doit+0x10ed/0x16d0
      ...
      
      Thought that perhaps the big splat in the logs wasn't really necessecary, as
      all call sites for dev_alloc_skb:
      
      a) check the return code for the function
      
      and
      
      b) either print their own error message or have a recovery path that makes the
      warning moot.
      
      Fix it by modifying dev_alloc_pages to pass __GFP_NOWARN as a gfp flag to
      suppress the warning
      
      applies to the net tree
      Signed-off-by: NNeil Horman <nhorman@tuxdriver.com>
      CC: "David S. Miller" <davem@davemloft.net>
      CC: Eric Dumazet <eric.dumazet@gmail.com>
      CC: Alexander Duyck <alexander.duyck@gmail.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      95829b3a
    • T
      net: define gso types for IPx over IPv4 and IPv6 · 7e13318d
      Tom Herbert 提交于
      This patch defines two new GSO definitions SKB_GSO_IPXIP4 and
      SKB_GSO_IPXIP6 along with corresponding NETIF_F_GSO_IPXIP4 and
      NETIF_F_GSO_IPXIP6. These are used to described IP in IP
      tunnel and what the outer protocol is. The inner protocol
      can be deduced from other GSO types (e.g. SKB_GSO_TCPV4 and
      SKB_GSO_TCPV6). The GSO types of SKB_GSO_IPIP and SKB_GSO_SIT
      are removed (these are both instances of SKB_GSO_IPXIP4).
      SKB_GSO_IPXIP6 will be used when support for GSO with IP
      encapsulation over IPv6 is added.
      Signed-off-by: NTom Herbert <tom@herbertland.com>
      Acked-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7e13318d
  15. 03 5月, 2016 1 次提交
    • E
      net: relax expensive skb_unclone() in iptunnel_handle_offloads() · 9580bf2e
      Eric Dumazet 提交于
      Locally generated TCP GSO packets having to go through a GRE/SIT/IPIP
      tunnel have to go through an expensive skb_unclone()
      
      Reallocating skb->head is a lot of work.
      
      Test should really check if a 'real clone' of the packet was done.
      
      TCP does not care if the original gso_type is changed while the packet
      travels in the stack.
      
      This adds skb_header_unclone() which is a variant of skb_clone()
      using skb_header_cloned() check instead of skb_cloned().
      
      This variant can probably be used from other points.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9580bf2e
  16. 29 4月, 2016 1 次提交
  17. 26 4月, 2016 1 次提交
    • S
      skbuff: Add pskb_extract() helper function · 6fa01ccd
      Sowmini Varadhan 提交于
      A pattern of skb usage seen in modules such as RDS-TCP is to
      extract `to_copy' bytes from the received TCP segment, starting
      at some offset `off' into a new skb `clone'. This is done in
      the ->data_ready callback, where the clone skb is queued up for rx on
      the PF_RDS socket, while the parent TCP segment is returned unchanged
      back to the TCP engine.
      
      The existing code uses the sequence
      	clone = skb_clone(..);
      	pskb_pull(clone, off, ..);
      	pskb_trim(clone, to_copy, ..);
      with the intention of discarding the first `off' bytes. However,
      skb_clone() + pskb_pull() implies pksb_expand_head(), which ends
      up doing a redundant memcpy of bytes that will then get discarded
      in __pskb_pull_tail().
      
      To avoid this inefficiency, this commit adds pskb_extract() that
      creates the clone, and memcpy's only the relevant header/frag/frag_list
      to the start of `clone'. pskb_trim() is then invoked to trim clone
      down to the requested to_copy bytes.
      Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6fa01ccd
  18. 15 4月, 2016 2 次提交
    • A
      GSO: Support partial segmentation offload · 802ab55a
      Alexander Duyck 提交于
      This patch adds support for something I am referring to as GSO partial.
      The basic idea is that we can support a broader range of devices for
      segmentation if we use fixed outer headers and have the hardware only
      really deal with segmenting the inner header.  The idea behind the naming
      is due to the fact that everything before csum_start will be fixed headers,
      and everything after will be the region that is handled by hardware.
      
      With the current implementation it allows us to add support for the
      following GSO types with an inner TSO_MANGLEID or TSO6 offload:
      NETIF_F_GSO_GRE
      NETIF_F_GSO_GRE_CSUM
      NETIF_F_GSO_IPIP
      NETIF_F_GSO_SIT
      NETIF_F_UDP_TUNNEL
      NETIF_F_UDP_TUNNEL_CSUM
      
      In the case of hardware that already supports tunneling we may be able to
      extend this further to support TSO_TCPV4 without TSO_MANGLEID if the
      hardware can support updating inner IPv4 headers.
      Signed-off-by: NAlexander Duyck <aduyck@mirantis.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      802ab55a
    • A
      GSO: Add GSO type for fixed IPv4 ID · cbc53e08
      Alexander Duyck 提交于
      This patch adds support for TSO using IPv4 headers with a fixed IP ID
      field.  This is meant to allow us to do a lossless GRO in the case of TCP
      flows that use a fixed IP ID such as those that convert IPv6 header to IPv4
      headers.
      
      In addition I am adding a feature that for now I am referring to TSO with
      IP ID mangling.  Basically when this flag is enabled the device has the
      option to either output the flow with incrementing IP IDs or with a fixed
      IP ID regardless of what the original IP ID ordering was.  This is useful
      in cases where the DF bit is set and we do not care if the original IP ID
      value is maintained.
      Signed-off-by: NAlexander Duyck <aduyck@mirantis.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cbc53e08
  19. 06 4月, 2016 1 次提交
    • S
      udp: enable MSG_PEEK at non-zero offset · 627d2d6b
      samanthakumar 提交于
      Enable peeking at UDP datagrams at the offset specified with socket
      option SOL_SOCKET/SO_PEEK_OFF. Peek at any datagram in the queue, up
      to the end of the given datagram.
      
      Implement the SO_PEEK_OFF semantics introduced in commit ef64a54f
      ("sock: Introduce the SO_PEEK_OFF sock option"). Increase the offset
      on peek, decrease it on regular reads.
      
      When peeking, always checksum the packet immediately, to avoid
      recomputation on subsequent peeks and final read.
      
      The socket lock is not held for the duration of udp_recvmsg, so
      peek and read operations can run concurrently. Only the last store
      to sk_peek_off is preserved.
      Signed-off-by: NSam Kumar <samanthakumar@google.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      627d2d6b
  20. 04 3月, 2016 1 次提交
    • B
      mld, igmp: Fix reserved tailroom calculation · 1837b2e2
      Benjamin Poirier 提交于
      The current reserved_tailroom calculation fails to take hlen and tlen into
      account.
      
      skb:
      [__hlen__|__data____________|__tlen___|__extra__]
      ^                                               ^
      head                                            skb_end_offset
      
      In this representation, hlen + data + tlen is the size passed to alloc_skb.
      "extra" is the extra space made available in __alloc_skb because of
      rounding up by kmalloc. We can reorder the representation like so:
      
      [__hlen__|__data____________|__extra__|__tlen___]
      ^                                               ^
      head                                            skb_end_offset
      
      The maximum space available for ip headers and payload without
      fragmentation is min(mtu, data + extra). Therefore,
      reserved_tailroom
      = data + extra + tlen - min(mtu, data + extra)
      = skb_end_offset - hlen - min(mtu, skb_end_offset - hlen - tlen)
      = skb_tailroom - min(mtu, skb_tailroom - tlen) ; after skb_reserve(hlen)
      
      Compare the second line to the current expression:
      reserved_tailroom = skb_end_offset - min(mtu, skb_end_offset)
      and we can see that hlen and tlen are not taken into account.
      
      The min() in the third line can be expanded into:
      if mtu < skb_tailroom - tlen:
      	reserved_tailroom = skb_tailroom - mtu
      else:
      	reserved_tailroom = tlen
      
      Depending on hlen, tlen, mtu and the number of multicast address records,
      the current code may output skbs that have less tailroom than
      dev->needed_tailroom or it may output more skbs than needed because not all
      space available is used.
      
      Fixes: 4c672e4b ("ipv6: mld: fix add_grhead skb_over_panic for devs with large MTUs")
      Signed-off-by: NBenjamin Poirier <bpoirier@suse.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1837b2e2
  21. 02 3月, 2016 1 次提交
  22. 22 2月, 2016 1 次提交
  23. 19 2月, 2016 1 次提交
    • A
      net: Optimize local checksum offload · 9e74a6da
      Alexander Duyck 提交于
      This patch takes advantage of several assumptions we can make about the
      headers of the frame in order to reduce overall processing overhead for
      computing the outer header checksum.
      
      First we can assume the entire header is in the region pointed to by
      skb->head as this is what csum_start is based on.
      
      Second, as a result of our first assumption, we can just call csum_partial
      instead of making a call to skb_checksum which would end up having to
      configure things so that we could walk through the frags list.
      Signed-off-by: NAlexander Duyck <aduyck@mirantis.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9e74a6da
  24. 12 2月, 2016 4 次提交
  25. 11 2月, 2016 2 次提交
    • A
      net: Store checksum result for offloaded GSO checksums · 08b64fcc
      Alexander Duyck 提交于
      This patch makes it so that we can offload the checksums for a packet up
      to a certain point and then begin computing the checksums via software.
      Setting this up is fairly straight forward as all we need to do is reset
      the values stored in csum and csum_start for the GSO context block.
      
      One complication for this is remote checksum offload.  In order to allow
      the inner checksums to be offloaded while computing the outer checksum
      manually we needed to have some way of indicating that the offload wasn't
      real.  In order to do that I replaced CHECKSUM_PARTIAL with
      CHECKSUM_UNNECESSARY in the case of us computing checksums for the outer
      header while skipping computing checksums for the inner headers.  We clean
      up the ip_summed flag and set it to either CHECKSUM_PARTIAL or
      CHECKSUM_NONE once we hand the packet off to the next lower level.
      Signed-off-by: NAlexander Duyck <aduyck@mirantis.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      08b64fcc
    • A
      net: Move GSO csum into SKB_GSO_CB · 76443456
      Alexander Duyck 提交于
      This patch moves the checksum maintained by GSO out of skb->csum and into
      the GSO context block in order to allow for us to work on outer checksums
      while maintaining the inner checksum offsets in the case of the inner
      checksum being offloaded, while the outer checksums will be computed.
      
      While updating the code I also did a minor cleanu-up on gso_make_checksum.
      The change is mostly to make it so that we store the values and compute the
      checksum instead of computing the checksum and then storing the values we
      needed to update.
      Signed-off-by: NAlexander Duyck <aduyck@mirantis.com>
      Acked-by: NTom Herbert <tom@herbertland.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      76443456
  26. 09 2月, 2016 1 次提交
  27. 16 1月, 2016 1 次提交
  28. 11 1月, 2016 1 次提交
    • D
      bpf: add skb_postpush_rcsum and fix dev_forward_skb occasions · f8ffad69
      Daniel Borkmann 提交于
      Add a small helper skb_postpush_rcsum() and fix up redirect locations
      that need CHECKSUM_COMPLETE fixups on ingress. dev_forward_skb() expects
      a proper csum that covers also Ethernet header, f.e. since 2c26d34b
      ("net/core: Handle csum for CHECKSUM_COMPLETE VXLAN forwarding"), we
      also do skb_postpull_rcsum() after pulling Ethernet header off via
      eth_type_trans().
      
      When using eBPF in a netns setup f.e. with vxlan in collect metadata mode,
      I can trigger the following csum issue with an IPv6 setup:
      
        [  505.144065] dummy1: hw csum failure
        [...]
        [  505.144108] Call Trace:
        [  505.144112]  <IRQ>  [<ffffffff81372f08>] dump_stack+0x44/0x5c
        [  505.144134]  [<ffffffff81607cea>] netdev_rx_csum_fault+0x3a/0x40
        [  505.144142]  [<ffffffff815fee3f>] __skb_checksum_complete+0xcf/0xe0
        [  505.144149]  [<ffffffff816f0902>] nf_ip6_checksum+0xb2/0x120
        [  505.144161]  [<ffffffffa08c0e0e>] icmpv6_error+0x17e/0x328 [nf_conntrack_ipv6]
        [  505.144170]  [<ffffffffa0898eca>] ? ip6t_do_table+0x2fa/0x645 [ip6_tables]
        [  505.144177]  [<ffffffffa08c0725>] ? ipv6_get_l4proto+0x65/0xd0 [nf_conntrack_ipv6]
        [  505.144189]  [<ffffffffa06c9a12>] nf_conntrack_in+0xc2/0x5a0 [nf_conntrack]
        [  505.144196]  [<ffffffffa08c039c>] ipv6_conntrack_in+0x1c/0x20 [nf_conntrack_ipv6]
        [  505.144204]  [<ffffffff8164385d>] nf_iterate+0x5d/0x70
        [  505.144210]  [<ffffffff816438d6>] nf_hook_slow+0x66/0xc0
        [  505.144218]  [<ffffffff816bd302>] ipv6_rcv+0x3f2/0x4f0
        [  505.144225]  [<ffffffff816bca40>] ? ip6_make_skb+0x1b0/0x1b0
        [  505.144232]  [<ffffffff8160b77b>] __netif_receive_skb_core+0x36b/0x9a0
        [  505.144239]  [<ffffffff8160bdc8>] ? __netif_receive_skb+0x18/0x60
        [  505.144245]  [<ffffffff8160bdc8>] __netif_receive_skb+0x18/0x60
        [  505.144252]  [<ffffffff8160ccff>] process_backlog+0x9f/0x140
        [  505.144259]  [<ffffffff8160c4a5>] net_rx_action+0x145/0x320
        [...]
      
      What happens is that on ingress, we push Ethernet header back in, either
      from cls_bpf or right before skb_do_redirect(), but without updating csum.
      The "hw csum failure" can be fixed by using the new skb_postpush_rcsum()
      helper for the dev_forward_skb() case to correct the csum diff again.
      
      Thanks to Hannes Frederic Sowa for the csum_partial() idea!
      
      Fixes: 3896d655 ("bpf: introduce bpf_clone_redirect() helper")
      Fixes: 27b29f63 ("bpf: add bpf_redirect() helper")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f8ffad69
  29. 16 12月, 2015 2 次提交
  30. 15 12月, 2015 1 次提交