1. 22 4月, 2017 2 次提交
    • C
      ip_tunnel: Allow policy-based routing through tunnels · 9830ad4c
      Craig Gallek 提交于
      This feature allows the administrator to set an fwmark for
      packets traversing a tunnel.  This allows the use of independent
      routing tables for tunneled packets without the use of iptables.
      
      There is no concept of per-packet routing decisions through IPv4
      tunnels, so this implementation does not need to work with
      per-packet route lookups as the v6 implementation may
      (with IP6_TNL_F_USE_ORIG_FWMARK).
      
      Further, since the v4 tunnel ioctls share datastructures
      (which can not be trivially modified) with the kernel's internal
      tunnel configuration structures, the mark attribute must be stored
      in the tunnel structure itself and passed as a parameter when
      creating or changing tunnel attributes.
      Signed-off-by: NCraig Gallek <kraig@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9830ad4c
    • C
      ip6_tunnel: Allow policy-based routing through tunnels · 0a473b82
      Craig Gallek 提交于
      This feature allows the administrator to set an fwmark for
      packets traversing a tunnel.  This allows the use of independent
      routing tables for tunneled packets without the use of iptables.
      Signed-off-by: NCraig Gallek <kraig@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0a473b82
  2. 21 4月, 2017 1 次提交
  3. 18 4月, 2017 5 次提交
    • D
      net: rtnetlink: plumb extended ack to doit function · c21ef3e3
      David Ahern 提交于
      Add netlink_ext_ack arg to rtnl_doit_func. Pass extack arg to nlmsg_parse
      for doit functions that call it directly.
      
      This is the first step to using extended error reporting in rtnetlink.
      >From here individual subsystems can be updated to set netlink_ext_ack as
      needed.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c21ef3e3
    • D
      ipv6: sr: fix BUG due to headroom too small after SRH push · af3b5158
      David Lebrun 提交于
      When a locally generated packet receives an SRH with two or more segments,
      the remaining headroom is too small to push an ethernet header. This patch
      ensures that the headroom is large enough after SRH push.
      
      The BUG generated the following trace.
      
      [  192.950285] skbuff: skb_under_panic: text:ffffffff81809675 len:198 put:14 head:ffff88006f306400 data:ffff88006f3063fa tail:0xc0 end:0x2c0 dev:A-1
      [  192.952456] ------------[ cut here ]------------
      [  192.953218] kernel BUG at net/core/skbuff.c:105!
      [  192.953411] invalid opcode: 0000 [#1] PREEMPT SMP
      [  192.953411] Modules linked in:
      [  192.953411] CPU: 5 PID: 3433 Comm: ping6 Not tainted 4.11.0-rc3+ #237
      [  192.953411] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.1-0-g8891697-prebuilt.qemu-project.org 04/01/2014
      [  192.953411] task: ffff88007c2d42c0 task.stack: ffffc90000ef4000
      [  192.953411] RIP: 0010:skb_panic+0x61/0x70
      [  192.953411] RSP: 0018:ffffc90000ef7900 EFLAGS: 00010286
      [  192.953411] RAX: 0000000000000085 RBX: 00000000000086dd RCX: 0000000000000201
      [  192.953411] RDX: 0000000080000201 RSI: ffffffff81d104c5 RDI: 00000000ffffffff
      [  192.953411] RBP: ffffc90000ef7920 R08: 0000000000000001 R09: 0000000000000000
      [  192.953411] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
      [  192.953411] R13: ffff88007c5a4000 R14: ffff88007b363d80 R15: 00000000000000b8
      [  192.953411] FS:  00007f94b558b700(0000) GS:ffff88007fd40000(0000) knlGS:0000000000000000
      [  192.953411] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  192.953411] CR2: 00007fff5ecd5080 CR3: 0000000074141000 CR4: 00000000001406e0
      [  192.953411] Call Trace:
      [  192.953411]  skb_push+0x3b/0x40
      [  192.953411]  eth_header+0x25/0xc0
      [  192.953411]  neigh_resolve_output+0x168/0x230
      [  192.953411]  ? ip6_finish_output2+0x242/0x8f0
      [  192.953411]  ip6_finish_output2+0x242/0x8f0
      [  192.953411]  ? ip6_finish_output2+0x76/0x8f0
      [  192.953411]  ip6_finish_output+0xa8/0x1d0
      [  192.953411]  ip6_output+0x64/0x2d0
      [  192.953411]  ? ip6_output+0x73/0x2d0
      [  192.953411]  ? ip6_dst_check+0xb5/0xc0
      [  192.953411]  ? dst_cache_per_cpu_get.isra.2+0x40/0x80
      [  192.953411]  seg6_output+0xb0/0x220
      [  192.953411]  lwtunnel_output+0xcf/0x210
      [  192.953411]  ? lwtunnel_output+0x59/0x210
      [  192.953411]  ip6_local_out+0x38/0x70
      [  192.953411]  ip6_send_skb+0x2a/0xb0
      [  192.953411]  ip6_push_pending_frames+0x48/0x50
      [  192.953411]  rawv6_sendmsg+0xa39/0xf10
      [  192.953411]  ? __lock_acquire+0x489/0x890
      [  192.953411]  ? __mutex_lock+0x1fc/0x970
      [  192.953411]  ? __lock_acquire+0x489/0x890
      [  192.953411]  ? __mutex_lock+0x1fc/0x970
      [  192.953411]  ? tty_ioctl+0x283/0xec0
      [  192.953411]  inet_sendmsg+0x45/0x1d0
      [  192.953411]  ? _copy_from_user+0x54/0x80
      [  192.953411]  sock_sendmsg+0x33/0x40
      [  192.953411]  SYSC_sendto+0xef/0x170
      [  192.953411]  ? entry_SYSCALL_64_fastpath+0x5/0xc2
      [  192.953411]  ? trace_hardirqs_on_caller+0x12b/0x1b0
      [  192.953411]  ? trace_hardirqs_on_thunk+0x1a/0x1c
      [  192.953411]  SyS_sendto+0x9/0x10
      [  192.953411]  entry_SYSCALL_64_fastpath+0x1f/0xc2
      [  192.953411] RIP: 0033:0x7f94b453db33
      [  192.953411] RSP: 002b:00007fff5ecd0578 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
      [  192.953411] RAX: ffffffffffffffda RBX: 00007fff5ecd16e0 RCX: 00007f94b453db33
      [  192.953411] RDX: 0000000000000040 RSI: 000055a78352e9c0 RDI: 0000000000000003
      [  192.953411] RBP: 00007fff5ecd1690 R08: 000055a78352c940 R09: 000000000000001c
      [  192.953411] R10: 0000000000000000 R11: 0000000000000246 R12: 000055a783321e10
      [  192.953411] R13: 000055a7839890c0 R14: 0000000000000004 R15: 0000000000000000
      [  192.953411] Code: 00 00 48 89 44 24 10 8b 87 c4 00 00 00 48 89 44 24 08 48 8b 87 d8 00 00 00 48 c7 c7 90 58 d2 81 48 89 04 24 31 c0 e8 4f 70 9a ff <0f> 0b 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 48 8b 97 d8 00 00
      [  192.953411] RIP: skb_panic+0x61/0x70 RSP: ffffc90000ef7900
      [  193.000186] ---[ end trace bd0b89fabdf2f92c ]---
      [  193.000951] Kernel panic - not syncing: Fatal exception in interrupt
      [  193.001137] Kernel Offset: disabled
      [  193.001169] ---[ end Kernel panic - not syncing: Fatal exception in interrupt
      
      Fixes: 19d5a26f ("ipv6: sr: expand skb head only if necessary")
      Signed-off-by: NDavid Lebrun <david.lebrun@uclouvain.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      af3b5158
    • F
      ipv6: drop non loopback packets claiming to originate from ::1 · 0aa8c13e
      Florian Westphal 提交于
      We lack a saddr check for ::1. This causes security issues e.g. with acls
      permitting connections from ::1 because of assumption that these originate
      from local machine.
      
      Assuming a source address of ::1 is local seems reasonable.
      RFC4291 doesn't allow such a source address either, so drop such packets.
      Reported-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0aa8c13e
    • W
      net-timestamp: avoid use-after-free in ip_recv_error · 1862d620
      Willem de Bruijn 提交于
      Syzkaller reported a use-after-free in ip_recv_error at line
      
          info->ipi_ifindex = skb->dev->ifindex;
      
      This function is called on dequeue from the error queue, at which
      point the device pointer may no longer be valid.
      
      Save ifindex on enqueue in __skb_complete_tx_timestamp, when the
      pointer is valid or NULL. Store it in temporary storage skb->cb.
      
      It is safe to reference skb->dev here, as called from device drivers
      or dev_queue_xmit. The exception is when called from tcp_ack_tstamp;
      in that case it is NULL and ifindex is set to 0 (invalid).
      
      Do not return a pktinfo cmsg if ifindex is 0. This maintains the
      current behavior of not returning a cmsg if skb->dev was NULL.
      
      On dequeue, the ipv4 path will cast from sock_exterr_skb to
      in_pktinfo. Both have ifindex as their first element, so no explicit
      conversion is needed. This is by design, introduced in commit
      0b922b7a ("net: original ingress device index in PKTINFO"). For
      ipv6 ip6_datagram_support_cmsg converts to in6_pktinfo.
      
      Fixes: 829ae9d6 ("net-timestamp: allow reading recv cmsg on errqueue with origin tstamp")
      Reported-by: NAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1862d620
    • D
      net: ipv6: send unsolicited NA on admin up · 4a6e3c5d
      David Ahern 提交于
      ndisc_notify is the ipv6 equivalent to arp_notify. When arp_notify is
      set to 1, gratuitous arp requests are sent when the device is brought up.
      The same is expected when ndisc_notify is set to 1 (per ndisc_notify in
      Documentation/networking/ip-sysctl.txt). The NA is not sent on NETDEV_UP
      event; add it.
      
      Fixes: 5cb04436 ("ipv6: add knob to send unsolicited ND on link-layer address change")
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4a6e3c5d
  4. 14 4月, 2017 1 次提交
  5. 13 4月, 2017 3 次提交
  6. 29 3月, 2017 3 次提交
  7. 28 3月, 2017 1 次提交
  8. 25 3月, 2017 4 次提交
    • A
      tcp: Record Rx hash and NAPI ID in tcp_child_process · e5907459
      Alexander Duyck 提交于
      While working on some recent busy poll changes we found that child sockets
      were being instantiated without NAPI ID being set.  In our first attempt to
      fix it, it was suggested that we should just pull programming the NAPI ID
      into the function itself since all callers will need to have it set.
      
      In addition to the NAPI ID change I have dropped the code that was
      populating the Rx hash since it was actually being populated in
      tcp_get_cookie_sock.
      Reported-by: NSridhar Samudrala <sridhar.samudrala@intel.com>
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e5907459
    • D
      ipv6: sr: use dst_cache in seg6_input · af4a2209
      David Lebrun 提交于
      We already use dst_cache in seg6_output, when handling locally generated
      packets. We extend it in seg6_input, to also handle forwarded packets, and avoid
      unnecessary fib lookups.
      
      Performances for SRH encapsulation before the patch:
      Result: OK: 5656067(c5655678+d388) usec, 5000000 (1000byte,0frags)
        884006pps 7072Mb/sec (7072048000bps) errors: 0
      
      Performances after the patch:
      Result: OK: 4774543(c4774084+d459) usec, 5000000 (1000byte,0frags)
        1047220pps 8377Mb/sec (8377760000bps) errors: 0
      Signed-off-by: NDavid Lebrun <david.lebrun@uclouvain.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      af4a2209
    • D
      ipv6: sr: expand skb head only if necessary · 19d5a26f
      David Lebrun 提交于
      To insert or encapsulate a packet with an SRH, we need a large enough skb
      headroom. Currently, we are using pskb_expand_head to inconditionally increase
      the size of the headroom by the amount needed by the SRH (and IPv6 header).
      If this reallocation is performed by another CPU than the one that initially
      allocated the skb, then when the initial CPU kfree the skb, it will enter the
      __slab_free slowpath, impacting performances.
      
      This patch replaces pskb_expand_head with skb_cow_head, that will reallocate the
      skb head only if the headroom is not large enough.
      
      Performances for SRH encapsulation before the patch:
      Result: OK: 7348320(c7347271+d1048) usec, 5000000 (1000byte,0frags)
        680427pps 5443Mb/sec (5443416000bps) errors: 0
      
      Performances after the patch:
      Result: OK: 5656067(c5655678+d388) usec, 5000000 (1000byte,0frags)
        884006pps 7072Mb/sec (7072048000bps) errors: 0
      Signed-off-by: NDavid Lebrun <david.lebrun@uclouvain.be>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      19d5a26f
    • S
      net: Add sysctl to toggle early demux for tcp and udp · dddb64bc
      subashab@codeaurora.org 提交于
      Certain system process significant unconnected UDP workload.
      It would be preferrable to disable UDP early demux for those systems
      and enable it for TCP only.
      
      By disabling UDP demux, we see these slight gains on an ARM64 system-
      782 -> 788Mbps unconnected single stream UDPv4
      633 -> 654Mbps unconnected UDPv4 different sources
      
      The performance impact can change based on CPU architecure and cache
      sizes. There will not much difference seen if entire UDP hash table
      is in cache.
      
      Both sysctls are enabled by default to preserve existing behavior.
      
      v1->v2: Change function pointer instead of adding conditional as
      suggested by Stephen.
      
      v2->v3: Read once in callers to avoid issues due to compiler
      optimizations. Also update commit message with the tests.
      
      v3->v4: Store and use read once result instead of querying pointer
      again incorrectly.
      
      v4->v5: Refactor to avoid errors due to compilation with IPV6={m,n}
      Signed-off-by: NSubash Abhinov Kasiviswanathan <subashab@codeaurora.org>
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: Tom Herbert <tom@herbertland.com>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dddb64bc
  9. 23 3月, 2017 3 次提交
  10. 17 3月, 2017 3 次提交
  11. 14 3月, 2017 3 次提交
    • J
      dccp/tcp: fix routing redirect race · 45caeaa5
      Jon Maxwell 提交于
      As Eric Dumazet pointed out this also needs to be fixed in IPv6.
      v2: Contains the IPv6 tcp/Ipv6 dccp patches as well.
      
      We have seen a few incidents lately where a dst_enty has been freed
      with a dangling TCP socket reference (sk->sk_dst_cache) pointing to that
      dst_entry. If the conditions/timings are right a crash then ensues when the
      freed dst_entry is referenced later on. A Common crashing back trace is:
      
       #8 [] page_fault at ffffffff8163e648
          [exception RIP: __tcp_ack_snd_check+74]
      .
      .
       #9 [] tcp_rcv_established at ffffffff81580b64
      #10 [] tcp_v4_do_rcv at ffffffff8158b54a
      #11 [] tcp_v4_rcv at ffffffff8158cd02
      #12 [] ip_local_deliver_finish at ffffffff815668f4
      #13 [] ip_local_deliver at ffffffff81566bd9
      #14 [] ip_rcv_finish at ffffffff8156656d
      #15 [] ip_rcv at ffffffff81566f06
      #16 [] __netif_receive_skb_core at ffffffff8152b3a2
      #17 [] __netif_receive_skb at ffffffff8152b608
      #18 [] netif_receive_skb at ffffffff8152b690
      #19 [] vmxnet3_rq_rx_complete at ffffffffa015eeaf [vmxnet3]
      #20 [] vmxnet3_poll_rx_only at ffffffffa015f32a [vmxnet3]
      #21 [] net_rx_action at ffffffff8152bac2
      #22 [] __do_softirq at ffffffff81084b4f
      #23 [] call_softirq at ffffffff8164845c
      #24 [] do_softirq at ffffffff81016fc5
      #25 [] irq_exit at ffffffff81084ee5
      #26 [] do_IRQ at ffffffff81648ff8
      
      Of course it may happen with other NIC drivers as well.
      
      It's found the freed dst_entry here:
      
       224 static bool tcp_in_quickack_mode(struct sock *sk)
       225 {
       226 ▹       const struct inet_connection_sock *icsk = inet_csk(sk);
       227 ▹       const struct dst_entry *dst = __sk_dst_get(sk);
       228 
       229 ▹       return (dst && dst_metric(dst, RTAX_QUICKACK)) ||
       230 ▹       ▹       (icsk->icsk_ack.quick && !icsk->icsk_ack.pingpong);
       231 }
      
      But there are other backtraces attributed to the same freed dst_entry in
      netfilter code as well.
      
      All the vmcores showed 2 significant clues:
      
      - Remote hosts behind the default gateway had always been redirected to a
      different gateway. A rtable/dst_entry will be added for that host. Making
      more dst_entrys with lower reference counts. Making this more probable.
      
      - All vmcores showed a postitive LockDroppedIcmps value, e.g:
      
      LockDroppedIcmps                  267
      
      A closer look at the tcp_v4_err() handler revealed that do_redirect() will run
      regardless of whether user space has the socket locked. This can result in a
      race condition where the same dst_entry cached in sk->sk_dst_entry can be
      decremented twice for the same socket via:
      
      do_redirect()->__sk_dst_check()-> dst_release().
      
      Which leads to the dst_entry being prematurely freed with another socket
      pointing to it via sk->sk_dst_cache and a subsequent crash.
      
      To fix this skip do_redirect() if usespace has the socket locked. Instead let
      the redirect take place later when user space does not have the socket
      locked.
      
      The dccp/IPv6 code is very similar in this respect, so fixing it there too.
      
      As Eric Garver pointed out the following commit now invalidates routes. Which
      can set the dst->obsolete flag so that ipv4_dst_check() returns null and
      triggers the dst_release().
      
      Fixes: ceb33206 ("ipv4: Kill routes during PMTU/redirect updates.")
      Cc: Eric Garver <egarver@redhat.com>
      Cc: Hannes Sowa <hsowa@redhat.com>
      Signed-off-by: NJon Maxwell <jmaxwell37@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      45caeaa5
    • F
      ipv6: avoid write to a possibly cloned skb · 79e49503
      Florian Westphal 提交于
      ip6_fragment, in case skb has a fraglist, checks if the
      skb is cloned.  If it is, it will move to the 'slow path' and allocates
      new skbs for each fragment.
      
      However, right before entering the slowpath loop, it updates the
      nexthdr value of the last ipv6 extension header to NEXTHDR_FRAGMENT,
      to account for the fragment header that will be inserted in the new
      ipv6-fragment skbs.
      
      In case original skb is cloned this munges nexthdr value of another
      skb.  Avoid this by doing the nexthdr update for each of the new fragment
      skbs separately.
      
      This was observed with tcpdump on a bridge device where netfilter ipv6
      reassembly is active:  tcpdump shows malformed fragment headers as
      the l4 header (icmpv6, tcp, etc). is decoded as a fragment header.
      
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Reported-by: NAndreas Karis <akaris@redhat.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      79e49503
    • S
      ipv6: make ECMP route replacement less greedy · 67e19400
      Sabrina Dubroca 提交于
      Commit 27596472 ("ipv6: fix ECMP route replacement") introduced a
      loop that removes all siblings of an ECMP route that is being
      replaced. However, this loop doesn't stop when it has replaced
      siblings, and keeps removing other routes with a higher metric.
      We also end up triggering the WARN_ON after the loop, because after
      this nsiblings < 0.
      
      Instead, stop the loop when we have taken care of all routes with the
      same metric as the route being replaced.
      
        Reproducer:
        ===========
          #!/bin/sh
      
          ip netns add ns1
          ip netns add ns2
          ip -net ns1 link set lo up
      
          for x in 0 1 2 ; do
              ip link add veth$x netns ns2 type veth peer name eth$x netns ns1
              ip -net ns1 link set eth$x up
              ip -net ns2 link set veth$x up
          done
      
          ip -net ns1 -6 r a 2000::/64 nexthop via fe80::0 dev eth0 \
                  nexthop via fe80::1 dev eth1 nexthop via fe80::2 dev eth2
          ip -net ns1 -6 r a 2000::/64 via fe80::42 dev eth0 metric 256
          ip -net ns1 -6 r a 2000::/64 via fe80::43 dev eth0 metric 2048
      
          echo "before replace, 3 routes"
          ip -net ns1 -6 r | grep -v '^fe80\|^ff00'
          echo
      
          ip -net ns1 -6 r c 2000::/64 nexthop via fe80::4 dev eth0 \
                  nexthop via fe80::5 dev eth1 nexthop via fe80::6 dev eth2
      
          echo "after replace, only 2 routes, metric 2048 is gone"
          ip -net ns1 -6 r | grep -v '^fe80\|^ff00'
      
      Fixes: 27596472 ("ipv6: fix ECMP route replacement")
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Reviewed-by: NXin Long <lucien.xin@gmail.com>
      Reviewed-by: NMichal Kubecek <mkubecek@suse.cz>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      67e19400
  12. 13 3月, 2017 3 次提交
    • P
      netfilter: nft_fib: Support existence check · 055c4b34
      Phil Sutter 提交于
      Instead of the actual interface index or name, set destination register
      to just 1 or 0 depending on whether the lookup succeeded or not if
      NFTA_FIB_F_PRESENT was set in userspace.
      Signed-off-by: NPhil Sutter <phil@nwl.cc>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      055c4b34
    • L
      netfilter: nf_tables: fix mismatch in big-endian system · 10596608
      Liping Zhang 提交于
      Currently, there are two different methods to store an u16 integer to
      the u32 data register. For example:
        u32 *dest = &regs->data[priv->dreg];
        1. *dest = 0; *(u16 *) dest = val_u16;
        2. *dest = val_u16;
      
      For method 1, the u16 value will be stored like this, either in
      big-endian or little-endian system:
        0          15           31
        +-+-+-+-+-+-+-+-+-+-+-+-+
        |   Value   |     0     |
        +-+-+-+-+-+-+-+-+-+-+-+-+
      
      For method 2, in little-endian system, the u16 value will be the same
      as listed above. But in big-endian system, the u16 value will be stored
      like this:
        0          15           31
        +-+-+-+-+-+-+-+-+-+-+-+-+
        |     0     |   Value   |
        +-+-+-+-+-+-+-+-+-+-+-+-+
      
      So later we use "memcmp(&regs->data[priv->sreg], data, 2);" to do
      compare in nft_cmp, nft_lookup expr ..., method 2 will get the wrong
      result in big-endian system, as 0~15 bits will always be zero.
      
      For the similar reason, when loading an u16 value from the u32 data
      register, we should use "*(u16 *) sreg;" instead of "(u16)*sreg;",
      the 2nd method will get the wrong value in the big-endian system.
      
      So introduce some wrapper functions to store/load an u8 or u16
      integer to/from the u32 data register, and use them in the right
      place.
      Signed-off-by: NLiping Zhang <zlpnobody@gmail.com>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      10596608
    • S
      net: ipv6: Add early demux handler for UDP unicast · 5425077d
      subashab@codeaurora.org 提交于
      While running a single stream UDPv6 test, we observed that amount
      of CPU spent in NET_RX softirq was much greater than UDPv4 for an
      equivalent receive rate. The test here was run on an ARM64 based
      Android system. On further analysis with perf, we found that UDPv6
      was spending significant time in the statistics netfilter targets
      which did socket lookup per packet. These statistics rules perform
      a lookup when there is no socket associated with the skb. Since
      there are multiple instances of these rules based on UID, there
      will be equal number of lookups per skb.
      
      By introducing early demux for UDPv6, we avoid the redundant lookups.
      This also helped to improve the performance (800Mbps -> 870Mbps) on a
      CPU limited system in a single stream UDPv6 receive test with 1450
      byte sized datagrams using iperf.
      
      v1->v2: Use IPv6 cookie to validate dst instead of 0 as suggested
      by Eric
      Signed-off-by: NSubash Abhinov Kasiviswanathan <subashab@codeaurora.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5425077d
  13. 10 3月, 2017 4 次提交
    • A
      udp: avoid ufo handling on IP payload compression packets · 4b3b45ed
      Alexey Kodanev 提交于
      commit c146066a ("ipv4: Don't use ufo handling on later transformed
      packets") and commit f89c56ce ("ipv6: Don't use ufo handling on
      later transformed packets") added a check that 'rt->dst.header_len' isn't
      zero in order to skip UFO, but it doesn't include IPcomp in transport mode
      where it equals zero.
      
      Packets, after payload compression, may not require further fragmentation,
      and if original length exceeds MTU, later compressed packets will be
      transmitted incorrectly. This can be reproduced with LTP udp_ipsec.sh test
      on veth device with enabled UFO, MTU is 1500 and UDP payload is 2000:
      
      * IPv4 case, offset is wrong + unnecessary fragmentation
          udp_ipsec.sh -p comp -m transport -s 2000 &
          tcpdump -ni ltp_ns_veth2
          ...
          IP (tos 0x0, ttl 64, id 45203, offset 0, flags [+],
            proto Compressed IP (108), length 49)
            10.0.0.2 > 10.0.0.1: IPComp(cpi=0x1000)
          IP (tos 0x0, ttl 64, id 45203, offset 1480, flags [none],
            proto UDP (17), length 21) 10.0.0.2 > 10.0.0.1: ip-proto-17
      
      * IPv6 case, sending small fragments
          udp_ipsec.sh -6 -p comp -m transport -s 2000 &
          tcpdump -ni ltp_ns_veth2
          ...
          IP6 (flowlabel 0x6b9ba, hlim 64, next-header Compressed IP (108)
            payload length: 37) fd00::2 > fd00::1: IPComp(cpi=0x1000)
          IP6 (flowlabel 0x6b9ba, hlim 64, next-header Compressed IP (108)
            payload length: 21) fd00::2 > fd00::1: IPComp(cpi=0x1000)
      
      Fix it by checking 'rt->dst.xfrm' pointer to 'xfrm_state' struct, skip UFO
      if xfrm is set. So the new check will include both cases: IPcomp and IPsec.
      
      Fixes: c146066a ("ipv4: Don't use ufo handling on later transformed packets")
      Fixes: f89c56ce ("ipv6: Don't use ufo handling on later transformed packets")
      Signed-off-by: NAlexey Kodanev <alexey.kodanev@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4b3b45ed
    • A
      tcp: rename *_sequence_number() to *_seq_and_tsoff() · a30aad50
      Alexey Kodanev 提交于
      The functions that are returning tcp sequence number also setup
      TS offset value, so rename them to better describe their purpose.
      
      No functional changes in this patch.
      Suggested-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NAlexey Kodanev <alexey.kodanev@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a30aad50
    • P
      net/tunnel: set inner protocol in network gro hooks · 294acf1c
      Paolo Abeni 提交于
      The gso code of several tunnels type (gre and udp tunnels)
      takes for granted that the skb->inner_protocol is properly
      initialized and drops the packet elsewhere.
      
      On the forwarding path no one is initializing such field,
      so gro encapsulated packets are dropped on forward.
      
      Since commit 38720352 ("gre: Use inner_proto to obtain
      inner header protocol"), this can be reproduced when the
      encapsulated packets use gre as the tunneling protocol.
      
      The issue happens also with vxlan and geneve tunnels since
      commit 8bce6d7d ("udp: Generalize skb_udp_segment"), if the
      forwarding host's ingress nic has h/w offload for such tunnel
      and a vxlan/geneve device is configured on top of it, regardless
      of the configured peer address and vni.
      
      To address the issue, this change initialize the inner_protocol
      field for encapsulated packets in both ipv4 and ipv6 gro complete
      callbacks.
      
      Fixes: 38720352 ("gre: Use inner_proto to obtain inner header protocol")
      Fixes: 8bce6d7d ("udp: Generalize skb_udp_segment")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Acked-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      294acf1c
    • D
      net: ipv6: Remove redundant RTA_OIF in multipath routes · 5be083ce
      David Ahern 提交于
      Dinesh reported that RTA_MULTIPATH nexthops are 8-bytes larger with IPv6
      than IPv4. The recent refactoring for multipath support in netlink
      messages does discriminate between non-multipath which needs the OIF
      and multipath which adds a rtnexthop struct for each hop making the
      RTA_OIF attribute redundant. Resolve by adding a flag to the info
      function to skip the oif for multipath.
      
      Fixes: beb1afac ("net: ipv6: Add support to dump multipath routes
             via RTA_MULTIPATH attribute")
      Reported-by: NDinesh Dutt <ddutt@cumulusnetworks.com>
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5be083ce
  14. 08 3月, 2017 1 次提交
    • W
      ipv6: reorder icmpv6_init() and ip6_mr_init() · 15e66807
      WANG Cong 提交于
      Andrey reported the following kernel crash:
      
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      general protection fault: 0000 [#1] SMP KASAN
      Dumping ftrace buffer:
         (ftrace buffer empty)
      Modules linked in:
      CPU: 0 PID: 14446 Comm: syz-executor6 Not tainted 4.10.0+ #82
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
      task: ffff88001f311700 task.stack: ffff88001f6e8000
      RIP: 0010:ip6mr_sk_done+0x15a/0x3d0 net/ipv6/ip6mr.c:1618
      RSP: 0018:ffff88001f6ef418 EFLAGS: 00010202
      RAX: dffffc0000000000 RBX: 1ffff10003edde8c RCX: ffffc900043ee000
      RDX: 0000000000000004 RSI: ffffffff83e3b3f8 RDI: 0000000000000020
      RBP: ffff88001f6ef508 R08: fffffbfff0dcc5d8 R09: 0000000000000000
      R10: ffffffff86e62ec0 R11: 0000000000000000 R12: 0000000000000000
      R13: 0000000000000000 R14: ffff88001f6ef4e0 R15: ffff8800380a0040
      FS:  00007f7a52cec700(0000) GS:ffff88003ec00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 000000000061c500 CR3: 000000001f1ae000 CR4: 00000000000006f0
      DR0: 0000000020000000 DR1: 0000000020000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
      Call Trace:
       rawv6_close+0x4c/0x80 net/ipv6/raw.c:1217
       inet_release+0xed/0x1c0 net/ipv4/af_inet.c:425
       inet6_release+0x50/0x70 net/ipv6/af_inet6.c:432
       sock_release+0x8d/0x1e0 net/socket.c:597
       __sock_create+0x39d/0x880 net/socket.c:1226
       sock_create_kern+0x3f/0x50 net/socket.c:1243
       inet_ctl_sock_create+0xbb/0x280 net/ipv4/af_inet.c:1526
       icmpv6_sk_init+0x163/0x500 net/ipv6/icmp.c:954
       ops_init+0x10a/0x550 net/core/net_namespace.c:115
       setup_net+0x261/0x660 net/core/net_namespace.c:291
       copy_net_ns+0x27e/0x540 net/core/net_namespace.c:396
      9pnet_virtio: no channels available for device ./file1
       create_new_namespaces+0x437/0x9b0 kernel/nsproxy.c:106
       unshare_nsproxy_namespaces+0xae/0x1e0 kernel/nsproxy.c:205
       SYSC_unshare kernel/fork.c:2281 [inline]
       SyS_unshare+0x64e/0x1000 kernel/fork.c:2231
       entry_SYSCALL_64_fastpath+0x1f/0xc2
      
      This is because net->ipv6.mr6_tables is not initialized at that point,
      ip6mr_rules_init() is not called yet, therefore on the error path when
      we iterator the list, we trigger this oops. Fix this by reordering
      ip6mr_rules_init() before icmpv6_sk_init().
      Reported-by: NAndrey Konovalov <andreyknvl@google.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      15e66807
  15. 07 3月, 2017 1 次提交
  16. 03 3月, 2017 2 次提交