1. 27 1月, 2017 1 次提交
  2. 30 12月, 2016 1 次提交
    • Z
      ipv6: Should use consistent conditional judgement for ip6 fragment between... · e4c5e13a
      Zheng Li 提交于
      ipv6: Should use consistent conditional judgement for ip6 fragment between __ip6_append_data and ip6_finish_output
      
      There is an inconsistent conditional judgement between __ip6_append_data
      and ip6_finish_output functions, the variable length in __ip6_append_data
      just include the length of application's payload and udp6 header, don't
      include the length of ipv6 header, but in ip6_finish_output use
      (skb->len > ip6_skb_dst_mtu(skb)) as judgement, and skb->len include the
      length of ipv6 header.
      
      That causes some particular application's udp6 payloads whose length are
      between (MTU - IPv6 Header) and MTU were fragmented by ip6_fragment even
      though the rst->dev support UFO feature.
      
      Add the length of ipv6 header to length in __ip6_append_data to keep
      consistent conditional judgement as ip6_finish_output for ip6 fragment.
      Signed-off-by: NZheng Li <james.z.li@ericsson.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e4c5e13a
  3. 26 11月, 2016 1 次提交
    • D
      net: ipv4, ipv6: run cgroup eBPF egress programs · 33b48679
      Daniel Mack 提交于
      If the cgroup associated with the receiving socket has an eBPF
      programs installed, run them from ip_output(), ip6_output() and
      ip_mc_output(). From mentioned functions we have two socket contexts
      as per 7026b1dd ("netfilter: Pass socket pointer down through
      okfn()."). We explicitly need to use sk instead of skb->sk here,
      since otherwise the same program would run multiple times on egress
      when encap devices are involved, which is not desired in our case.
      
      eBPF programs used in this context are expected to either return 1 to
      let the packet pass, or != 1 to drop them. The programs have access to
      the skb through bpf_skb_load_bytes(), and the payload starts at the
      network headers (L3).
      
      Note that cgroup_bpf_run_filter() is stubbed out as static inline nop
      for !CONFIG_CGROUP_BPF, and is otherwise guarded by a static key if
      the feature is unused.
      Signed-off-by: NDaniel Mack <daniel@zonque.org>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      33b48679
  4. 20 11月, 2016 1 次提交
    • A
      net: fix bogus cast in skb_pagelen() and use unsigned variables · c72d8cda
      Alexey Dobriyan 提交于
      1) cast to "int" is unnecessary:
         u8 will be promoted to int before decrementing,
         small positive numbers fit into "int", so their values won't be changed
         during promotion.
      
         Once everything is int including loop counters, signedness doesn't
         matter: 32-bit operations will stay 32-bit operations.
      
         But! Someone tried to make this loop smart by making everything of
         the same type apparently in an attempt to optimise it.
         Do the optimization, just differently.
         Do the cast where it matters. :^)
      
      2) frag size is unsigned entity and sum of fragments sizes is also
         unsigned.
      
      Make everything unsigned, leave no MOVSX instruction behind.
      
      	add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-4 (-4)
      	function                                     old     new   delta
      	skb_cow_data                                 835     834      -1
      	ip_do_fragment                              2549    2548      -1
      	ip6_fragment                                3130    3128      -2
      	Total: Before=154865032, After=154865028, chg -0.00%
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c72d8cda
  5. 10 11月, 2016 1 次提交
  6. 01 11月, 2016 1 次提交
    • J
      ipv6: Don't use ufo handling on later transformed packets · f89c56ce
      Jakub Sitnicki 提交于
      Similar to commit c146066a ("ipv4: Don't use ufo handling on later
      transformed packets"), don't perform UFO on packets that will be IPsec
      transformed. To detect it we rely on the fact that headerlen in
      dst_entry is non-zero only for transformation bundles (xfrm_dst
      objects).
      
      Unwanted segmentation can be observed with a NETIF_F_UFO capable device,
      such as a dummy device:
      
        DEV=dum0 LEN=1493
      
        ip li add $DEV type dummy
        ip addr add fc00::1/64 dev $DEV nodad
        ip link set $DEV up
        ip xfrm policy add dir out src fc00::1 dst fc00::2 \
           tmpl src fc00::1 dst fc00::2 proto esp spi 1
        ip xfrm state add src fc00::1 dst fc00::2 \
           proto esp spi 1 enc 'aes' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b
      
        tcpdump -n -nn -i $DEV -t &
        socat /dev/zero,readbytes=$LEN udp6:[fc00::2]:$LEN
      
      tcpdump output before:
      
        IP6 fc00::1 > fc00::2: frag (0|1448) ESP(spi=0x00000001,seq=0x1), length 1448
        IP6 fc00::1 > fc00::2: frag (1448|48)
        IP6 fc00::1 > fc00::2: ESP(spi=0x00000001,seq=0x2), length 88
      
      ... and after:
      
        IP6 fc00::1 > fc00::2: frag (0|1448) ESP(spi=0x00000001,seq=0x1), length 1448
        IP6 fc00::1 > fc00::2: frag (1448|80)
      
      Fixes: e89e9cf5 ("[IPv4/IPv6]: UFO Scatter-gather approach")
      Signed-off-by: NJakub Sitnicki <jkbs@redhat.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f89c56ce
  7. 11 9月, 2016 3 次提交
  8. 31 8月, 2016 1 次提交
    • R
      net: lwtunnel: Handle fragmentation · 14972cbd
      Roopa Prabhu 提交于
      Today mpls iptunnel lwtunnel_output redirect expects the tunnel
      output function to handle fragmentation. This is ok but can be
      avoided if we did not do the mpls output redirect too early.
      ie we could wait until ip fragmentation is done and then call
      mpls output for each ip fragment.
      
      To make this work we will need,
      1) the lwtunnel state to carry encap headroom
      2) and do the redirect to the encap output handler on the ip fragment
      (essentially do the output redirect after fragmentation)
      
      This patch adds tunnel headroom in lwtstate to make sure we
      account for tunnel data in mtu calculations during fragmentation
      and adds new xmit redirect handler to redirect to lwtunnel xmit func
      after ip fragmentation.
      
      This includes IPV6 and some mtu fixes and testing from David Ahern.
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      14972cbd
  9. 18 6月, 2016 1 次提交
    • D
      net: vrf: Implement get_saddr for IPv6 · 0d240e78
      David Ahern 提交于
      IPv6 source address selection needs to consider the real egress route.
      Similar to IPv4 implement a get_saddr6 method which is called if
      source address has not been set.  The get_saddr6 method does a full
      lookup which means pulling a route from the VRF FIB table and properly
      considering linklocal/multicast destination addresses. Lookup failures
      (eg., unreachable) then cause the source address selection to fail
      which gets propagated back to the caller.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0d240e78
  10. 09 6月, 2016 1 次提交
    • J
      ipv6: Skip XFRM lookup if dst_entry in socket cache is valid · 00bc0ef5
      Jakub Sitnicki 提交于
      At present we perform an xfrm_lookup() for each UDPv6 message we
      send. The lookup involves querying the flow cache (flow_cache_lookup)
      and, in case of a cache miss, creating an XFRM bundle.
      
      If we miss the flow cache, we can end up creating a new bundle and
      deriving the path MTU (xfrm_init_pmtu) from on an already transformed
      dst_entry, which we pass from the socket cache (sk->sk_dst_cache) down
      to xfrm_lookup(). This can happen only if we're caching the dst_entry
      in the socket, that is when we're using a connected UDP socket.
      
      To put it another way, the path MTU shrinks each time we miss the flow
      cache, which later on leads to incorrectly fragmented payload. It can
      be observed with ESPv6 in transport mode:
      
        1) Set up a transformation and lower the MTU to trigger fragmentation
          # ip xfrm policy add dir out src ::1 dst ::1 \
            tmpl src ::1 dst ::1 proto esp spi 1
          # ip xfrm state add src ::1 dst ::1 \
            proto esp spi 1 enc 'aes' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b
          # ip link set dev lo mtu 1500
      
        2) Monitor the packet flow and set up an UDP sink
          # tcpdump -ni lo -ttt &
          # socat udp6-listen:12345,fork /dev/null &
      
        3) Send a datagram that needs fragmentation with a connected socket
          # perl -e 'print "@" x 1470 | socat - udp6:[::1]:12345
          2016/06/07 18:52:52 socat[724] E read(3, 0x555bb3d5ba00, 8192): Protocol error
          00:00:00.000000 IP6 ::1 > ::1: frag (0|1448) ESP(spi=0x00000001,seq=0x2), length 1448
          00:00:00.000014 IP6 ::1 > ::1: frag (1448|32)
          00:00:00.000050 IP6 ::1 > ::1: ESP(spi=0x00000001,seq=0x3), length 1272
          (^ ICMPv6 Parameter Problem)
          00:00:00.000022 IP6 ::1 > ::1: ESP(spi=0x00000001,seq=0x5), length 136
      
        4) Compare it to a non-connected socket
          # perl -e 'print "@" x 1500' | socat - udp6-sendto:[::1]:12345
          00:00:40.535488 IP6 ::1 > ::1: frag (0|1448) ESP(spi=0x00000001,seq=0x6), length 1448
          00:00:00.000010 IP6 ::1 > ::1: frag (1448|64)
      
      What happens in step (3) is:
      
        1) when connecting the socket in __ip6_datagram_connect(), we
           perform an XFRM lookup, miss the flow cache, create an XFRM
           bundle, and cache the destination,
      
        2) afterwards, when sending the datagram, we perform an XFRM lookup,
           again, miss the flow cache (due to mismatch of flowi6_iif and
           flowi6_oif, which is an issue of its own), and recreate an XFRM
           bundle based on the cached (and already transformed) destination.
      
      To prevent the recreation of an XFRM bundle, avoid an XFRM lookup
      altogether whenever we already have a destination entry cached in the
      socket. This prevents the path MTU shrinkage and brings us on par with
      UDPv4.
      
      The fix also benefits connected PINGv6 sockets, another user of
      ip6_sk_dst_lookup_flow(), who also suffer messages being transformed
      twice.
      
      Joint work with Hannes Frederic Sowa.
      Reported-by: NJan Tluka <jtluka@redhat.com>
      Signed-off-by: NJakub Sitnicki <jkbs@redhat.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      00bc0ef5
  11. 04 6月, 2016 1 次提交
  12. 04 5月, 2016 1 次提交
    • W
      ipv6: add new struct ipcm6_cookie · 26879da5
      Wei Wang 提交于
      In the sendmsg function of UDP, raw, ICMP and l2tp sockets, we use local
      variables like hlimits, tclass, opt and dontfrag and pass them to corresponding
      functions like ip6_make_skb, ip6_append_data and xxx_push_pending_frames.
      This is not a good practice and makes it hard to add new parameters.
      This fix introduces a new struct ipcm6_cookie similar to ipcm_cookie in
      ipv4 and include the above mentioned variables. And we only pass the
      pointer to this structure to corresponding functions. This makes it easier
      to add new parameters in the future and makes the function cleaner.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      26879da5
  13. 28 4月, 2016 1 次提交
  14. 08 4月, 2016 1 次提交
    • J
      ipv6: Count in extension headers in skb->network_header · 3ba3458f
      Jakub Sitnicki 提交于
      When sending a UDPv6 message longer than MTU, account for the length
      of fragmentable IPv6 extension headers in skb->network_header offset.
      Same as we do in alloc_new_skb path in __ip6_append_data().
      
      This ensures that later on __ip6_make_skb() will make space in
      headroom for fragmentable extension headers:
      
      	/* move skb->data to ip header from ext header */
      	if (skb->data < skb_network_header(skb))
      		__skb_pull(skb, skb_network_offset(skb));
      
      Prevents a splat due to skb_under_panic:
      
      skbuff: skb_under_panic: text:ffffffff8143397b len:2126 put:14 \
      head:ffff880005bacf50 data:ffff880005bacf4a tail:0x48 end:0xc0 dev:lo
      ------------[ cut here ]------------
      kernel BUG at net/core/skbuff.c:104!
      invalid opcode: 0000 [#1] KASAN
      CPU: 0 PID: 160 Comm: reproducer Not tainted 4.6.0-rc2 #65
      [...]
      Call Trace:
       [<ffffffff813eb7b9>] skb_push+0x79/0x80
       [<ffffffff8143397b>] eth_header+0x2b/0x100
       [<ffffffff8141e0d0>] neigh_resolve_output+0x210/0x310
       [<ffffffff814eab77>] ip6_finish_output2+0x4a7/0x7c0
       [<ffffffff814efe3a>] ip6_output+0x16a/0x280
       [<ffffffff815440c1>] ip6_local_out+0xb1/0xf0
       [<ffffffff814f1115>] ip6_send_skb+0x45/0xd0
       [<ffffffff81518836>] udp_v6_send_skb+0x246/0x5d0
       [<ffffffff8151985e>] udpv6_sendmsg+0xa6e/0x1090
      [...]
      Reported-by: NJi Jianwen <jiji@redhat.com>
      Signed-off-by: NJakub Sitnicki <jkbs@redhat.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3ba3458f
  15. 05 4月, 2016 1 次提交
    • S
      sock: enable timestamping using control messages · c14ac945
      Soheil Hassas Yeganeh 提交于
      Currently, SOL_TIMESTAMPING can only be enabled using setsockopt.
      This is very costly when users want to sample writes to gather
      tx timestamps.
      
      Add support for enabling SO_TIMESTAMPING via control messages by
      using tsflags added in `struct sockcm_cookie` (added in the previous
      patches in this series) to set the tx_flags of the last skb created in
      a sendmsg. With this patch, the timestamp recording bits in tx_flags
      of the skbuff is overridden if SO_TIMESTAMPING is passed in a cmsg.
      
      Please note that this is only effective for overriding the recording
      timestamps flags. Users should enable timestamp reporting (e.g.,
      SOF_TIMESTAMPING_SOFTWARE | SOF_TIMESTAMPING_OPT_ID) using
      socket options and then should ask for SOF_TIMESTAMPING_TX_*
      using control messages per sendmsg to sample timestamps for each
      write.
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c14ac945
  16. 02 3月, 2016 1 次提交
  17. 30 1月, 2016 1 次提交
    • P
      ipv6: enforce flowi6_oif usage in ip6_dst_lookup_tail() · 6f21c96a
      Paolo Abeni 提交于
      The current implementation of ip6_dst_lookup_tail basically
      ignore the egress ifindex match: if the saddr is set,
      ip6_route_output() purposefully ignores flowi6_oif, due
      to the commit d46a9d67 ("net: ipv6: Dont add RT6_LOOKUP_F_IFACE
      flag if saddr set"), if the saddr is 'any' the first route lookup
      in ip6_dst_lookup_tail fails, but upon failure a second lookup will
      be performed with saddr set, thus ignoring the ifindex constraint.
      
      This commit adds an output route lookup function variant, which
      allows the caller to specify lookup flags, and modify
      ip6_dst_lookup_tail() to enforce the ifindex match on the second
      lookup via said helper.
      
      ip6_route_output() becames now a static inline function build on
      top of ip6_route_output_flags(); as a side effect, out-of-tree
      modules need now a GPL license to access the output route lookup
      functionality.
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Acked-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6f21c96a
  18. 12 1月, 2016 1 次提交
  19. 16 12月, 2015 1 次提交
  20. 02 11月, 2015 2 次提交
  21. 29 10月, 2015 2 次提交
  22. 23 10月, 2015 1 次提交
  23. 22 10月, 2015 1 次提交
  24. 13 10月, 2015 1 次提交
  25. 11 10月, 2015 1 次提交
  26. 08 10月, 2015 4 次提交
  27. 30 9月, 2015 1 次提交
  28. 26 9月, 2015 2 次提交
  29. 18 9月, 2015 4 次提交
    • F
      ipv6: ip6_fragment: fix headroom tests and skb leak · 1d325d21
      Florian Westphal 提交于
      David Woodhouse reports skb_under_panic when we try to push ethernet
      header to fragmented ipv6 skbs:
      
       skbuff: skb_under_panic: text:c1277f1e len:1294 put:14 head:dec98000
       data:dec97ffc tail:0xdec9850a end:0xdec98f40 dev:br-lan
      [..]
      ip6_finish_output2+0x196/0x4da
      
      David further debugged this:
        [..] offending fragments were arriving here with skb_headroom(skb)==10.
        Which is reasonable, being the Solos ADSL card's header of 8 bytes
        followed by 2 bytes of PPP frame type.
      
      The problem is that if netfilter ipv6 defragmentation is used, skb_cow()
      in ip6_forward will only see reassembled skb.
      
      Therefore, headroom is overestimated by 8 bytes (we pulled fragment
      header) and we don't check the skbs in the frag_list either.
      
      We can't do these checks in netfilter defrag since outdev isn't known yet.
      
      Furthermore, existing tests in ip6_fragment did not consider the fragment
      or ipv6 header size when checking headroom of the fraglist skbs.
      
      While at it, also fix a skb leak on memory allocation -- ip6_fragment
      must consume the skb.
      
      I tested this e1000 driver hacked to not allocate additional headroom
      (we end up in slowpath, since LL_RESERVED_SPACE is 16).
      
      If 2 bytes of headroom are allocated, fastpath is taken (14 byte
      ethernet header was pulled, so 16 byte headroom available in all
      fragments).
      Reported-by: NDavid Woodhouse <dwmw2@infradead.org>
      Diagnosed-by: NDavid Woodhouse <dwmw2@infradead.org>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Tested-by: NDavid Woodhouse <David.Woodhouse@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1d325d21
    • E
      netfilter: Add blank lines in callers of netfilter hooks · be10de0a
      Eric W. Biederman 提交于
      In code review it was noticed that I had failed to add some blank lines
      in places where they are customarily used.  Taking a second look at the
      code I have to agree blank lines would be nice so I have added them
      here.
      Reported-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      be10de0a
    • E
      netfilter: Pass net into okfn · 0c4b51f0
      Eric W. Biederman 提交于
      This is immediately motivated by the bridge code that chains functions that
      call into netfilter.  Without passing net into the okfns the bridge code would
      need to guess about the best expression for the network namespace to process
      packets in.
      
      As net is frequently one of the first things computed in continuation functions
      after netfilter has done it's job passing in the desired network namespace is in
      many cases a code simplification.
      
      To support this change the function dst_output_okfn is introduced to
      simplify passing dst_output as an okfn.  For the moment dst_output_okfn
      just silently drops the struct net.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0c4b51f0
    • E
      netfilter: Pass struct net into the netfilter hooks · 29a26a56
      Eric W. Biederman 提交于
      Pass a network namespace parameter into the netfilter hooks.  At the
      call site of the netfilter hooks the path a packet is taking through
      the network stack is well known which allows the network namespace to
      be easily and reliabily.
      
      This allows the replacement of magic code like
      "dev_net(state->in?:state->out)" that appears at the start of most
      netfilter hooks with "state->net".
      
      In almost all cases the network namespace passed in is derived
      from the first network device passed in, guaranteeing those
      paths will not see any changes in practice.
      
      The exceptions are:
      xfrm/xfrm_output.c:xfrm_output_resume()         xs_net(skb_dst(skb)->xfrm)
      ipvs/ip_vs_xmit.c:ip_vs_nat_send_or_cont()      ip_vs_conn_net(cp)
      ipvs/ip_vs_xmit.c:ip_vs_send_or_cont()          ip_vs_conn_net(cp)
      ipv4/raw.c:raw_send_hdrinc()                    sock_net(sk)
      ipv6/ip6_output.c:ip6_xmit()			sock_net(sk)
      ipv6/ndisc.c:ndisc_send_skb()                   dev_net(skb->dev) not dev_net(dst->dev)
      ipv6/raw.c:raw6_send_hdrinc()                   sock_net(sk)
      br_netfilter_hooks.c:br_nf_pre_routing_finish() dev_net(skb->dev) before skb->dev is set to nf_bridge->physindev
      
      In all cases these exceptions seem to be a better expression for the
      network namespace the packet is being processed in then the historic
      "dev_net(in?in:out)".  I am documenting them in case something odd
      pops up and someone starts trying to track down what happened.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      29a26a56