1. 05 10月, 2017 3 次提交
  2. 22 9月, 2017 1 次提交
  3. 16 9月, 2017 1 次提交
    • A
      net: vrf: avoid gcc-4.6 warning · ecf09117
      Arnd Bergmann 提交于
      When building an allmodconfig kernel with gcc-4.6, we get a rather
      odd warning:
      
      drivers/net/vrf.c: In function ‘vrf_ip6_input_dst’:
      drivers/net/vrf.c:964:3: error: initialized field with side-effects overwritten [-Werror]
      drivers/net/vrf.c:964:3: error: (near initialization for ‘fl6’) [-Werror]
      
      I have no idea what this warning is even trying to say, but it does
      seem like a false positive. Reordering the initialization in to match
      the structure definition gets rid of the warning, and might also avoid
      whatever gcc thinks is wrong here.
      
      Fixes: 9ff74384 ("net: vrf: Handle ipv6 multicast and link-local addresses")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ecf09117
  4. 14 8月, 2017 1 次提交
  5. 08 8月, 2017 1 次提交
  6. 06 7月, 2017 1 次提交
    • N
      vrf: fix bug_on triggered by rx when destroying a vrf · f630c38e
      Nikolay Aleksandrov 提交于
      When destroying a VRF device we cleanup the slaves in its ndo_uninit()
      function, but that causes packets to be switched (skb->dev == vrf being
      destroyed) even though we're pass the point where the VRF should be
      receiving any packets while it is being dismantled. This causes a BUG_ON
      to trigger if we have raw sockets (trace below).
      The reason is that the inetdev of the VRF has been destroyed but we're
      still sending packets up the stack with it, so let's free the slaves in
      the dellink callback as David Ahern suggested.
      
      Note that this fix doesn't prevent packets from going up when the VRF
      device is admin down.
      
      [   35.631371] ------------[ cut here ]------------
      [   35.631603] kernel BUG at net/ipv4/fib_frontend.c:285!
      [   35.631854] invalid opcode: 0000 [#1] SMP
      [   35.631977] Modules linked in:
      [   35.632081] CPU: 2 PID: 22 Comm: ksoftirqd/2 Not tainted 4.12.0-rc7+ #45
      [   35.632247] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
      [   35.632477] task: ffff88005ad68000 task.stack: ffff88005ad64000
      [   35.632632] RIP: 0010:fib_compute_spec_dst+0xfc/0x1ee
      [   35.632769] RSP: 0018:ffff88005ad67978 EFLAGS: 00010202
      [   35.632910] RAX: 0000000000000001 RBX: ffff880059a7f200 RCX: 0000000000000000
      [   35.633084] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff82274af0
      [   35.633256] RBP: ffff88005ad679f8 R08: 000000000001ef70 R09: 0000000000000046
      [   35.633430] R10: ffff88005ad679f8 R11: ffff880037731cb0 R12: 0000000000000001
      [   35.633603] R13: ffff8800599e3000 R14: 0000000000000000 R15: ffff8800599cb852
      [   35.634114] FS:  0000000000000000(0000) GS:ffff88005d900000(0000) knlGS:0000000000000000
      [   35.634306] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   35.634456] CR2: 00007f3563227095 CR3: 000000000201d000 CR4: 00000000000406e0
      [   35.634632] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [   35.634865] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [   35.635055] Call Trace:
      [   35.635271]  ? __lock_acquire+0xf0d/0x1117
      [   35.635522]  ipv4_pktinfo_prepare+0x82/0x151
      [   35.635831]  raw_rcv_skb+0x17/0x3c
      [   35.636062]  raw_rcv+0xe5/0xf7
      [   35.636287]  raw_local_deliver+0x169/0x1d9
      [   35.636534]  ip_local_deliver_finish+0x87/0x1c4
      [   35.636820]  ip_local_deliver+0x63/0x7f
      [   35.637058]  ip_rcv_finish+0x340/0x3a1
      [   35.637295]  ip_rcv+0x314/0x34a
      [   35.637525]  __netif_receive_skb_core+0x49f/0x7c5
      [   35.637780]  ? lock_acquire+0x13f/0x1d7
      [   35.638018]  ? lock_acquire+0x15e/0x1d7
      [   35.638259]  __netif_receive_skb+0x1e/0x94
      [   35.638502]  ? __netif_receive_skb+0x1e/0x94
      [   35.638748]  netif_receive_skb_internal+0x74/0x300
      [   35.639002]  ? dev_gro_receive+0x2ed/0x411
      [   35.639246]  ? lock_is_held_type+0xc4/0xd2
      [   35.639491]  napi_gro_receive+0x105/0x1a0
      [   35.639736]  receive_buf+0xc32/0xc74
      [   35.639965]  ? detach_buf+0x67/0x153
      [   35.640201]  ? virtqueue_get_buf_ctx+0x120/0x176
      [   35.640453]  virtnet_poll+0x128/0x1c5
      [   35.640690]  net_rx_action+0x103/0x343
      [   35.640932]  __do_softirq+0x1c7/0x4b7
      [   35.641171]  run_ksoftirqd+0x23/0x5c
      [   35.641403]  smpboot_thread_fn+0x24f/0x26d
      [   35.641646]  ? sort_range+0x22/0x22
      [   35.641878]  kthread+0x129/0x131
      [   35.642104]  ? __list_add+0x31/0x31
      [   35.642335]  ? __list_add+0x31/0x31
      [   35.642568]  ret_from_fork+0x2a/0x40
      [   35.642804] Code: 05 bd 87 a3 00 01 e8 1f ef 98 ff 4d 85 f6 48 c7 c7 f0 4a 27 82 41 0f 94 c4 31 c9 31 d2 41 0f b6 f4 e8 04 71 a1 ff 45 84 e4 74 02 <0f> 0b 0f b7 93 c4 00 00 00 4d 8b a5 80 05 00 00 48 03 93 d0 00
      [   35.644342] RIP: fib_compute_spec_dst+0xfc/0x1ee RSP: ffff88005ad67978
      
      Fixes: 193125db ("net: Introduce VRF device driver")
      Reported-by: NChris Cormier <chriscormier@cumulusnetworks.com>
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Acked-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f630c38e
  7. 27 6月, 2017 2 次提交
  8. 18 6月, 2017 2 次提交
    • W
      net: remove DST_NOCACHE flag · a4c2fd7f
      Wei Wang 提交于
      DST_NOCACHE flag check has been removed from dst_release() and
      dst_hold_safe() in a previous patch because all the dst are now ref
      counted properly and can be released based on refcnt only.
      Looking at the rest of the DST_NOCACHE use, all of them can now be
      removed or replaced with other checks.
      So this patch gets rid of all the DST_NOCACHE usage and remove this flag
      completely.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a4c2fd7f
    • W
      ipv6: take dst->__refcnt for insertion into fib6 tree · 1cfb71ee
      Wei Wang 提交于
      In IPv6 routing code, struct rt6_info is created for each static route
      and RTF_CACHE route and inserted into fib6 tree. In both cases, dst
      ref count is not taken.
      As explained in the previous patch, this leads to the need of the dst
      garbage collector.
      
      This patch holds ref count of dst before inserting the route into fib6
      tree and properly releases the dst when deleting it from the fib6 tree
      as a preparation in order to fully get rid of dst gc later.
      
      Also, correct fib6_age() logic to check dst->__refcnt to be 1 to indicate
      no user is referencing the dst.
      
      And remove dst_hold() in vrf_rt6_create() as ip6_dst_alloc() already puts
      dst->__refcnt to 1.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1cfb71ee
  9. 16 6月, 2017 1 次提交
    • J
      networking: make skb_push & __skb_push return void pointers · d58ff351
      Johannes Berg 提交于
      It seems like a historic accident that these return unsigned char *,
      and in many places that means casts are required, more often than not.
      
      Make these functions return void * and remove all the casts across
      the tree, adding a (u8 *) cast only where the unsigned char pointer
      was used directly, all done with the following spatch:
      
          @@
          expression SKB, LEN;
          typedef u8;
          identifier fn = { skb_push, __skb_push, skb_push_rcsum };
          @@
          - *(fn(SKB, LEN))
          + *(u8 *)fn(SKB, LEN)
      
          @@
          expression E, SKB, LEN;
          identifier fn = { skb_push, __skb_push, skb_push_rcsum };
          type T;
          @@
          - E = ((T *)(fn(SKB, LEN)))
          + E = fn(SKB, LEN)
      
          @@
          expression SKB, LEN;
          identifier fn = { skb_push, __skb_push, skb_push_rcsum };
          @@
          - fn(SKB, LEN)[0]
          + *(u8 *)fn(SKB, LEN)
      
      Note that the last part there converts from push(...)[0] to the
      more idiomatic *(u8 *)push(...).
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d58ff351
  10. 09 6月, 2017 1 次提交
  11. 08 6月, 2017 1 次提交
    • D
      net: Fix inconsistent teardown and release of private netdev state. · cf124db5
      David S. Miller 提交于
      Network devices can allocate reasources and private memory using
      netdev_ops->ndo_init().  However, the release of these resources
      can occur in one of two different places.
      
      Either netdev_ops->ndo_uninit() or netdev->destructor().
      
      The decision of which operation frees the resources depends upon
      whether it is necessary for all netdev refs to be released before it
      is safe to perform the freeing.
      
      netdev_ops->ndo_uninit() presumably can occur right after the
      NETDEV_UNREGISTER notifier completes and the unicast and multicast
      address lists are flushed.
      
      netdev->destructor(), on the other hand, does not run until the
      netdev references all go away.
      
      Further complicating the situation is that netdev->destructor()
      almost universally does also a free_netdev().
      
      This creates a problem for the logic in register_netdevice().
      Because all callers of register_netdevice() manage the freeing
      of the netdev, and invoke free_netdev(dev) if register_netdevice()
      fails.
      
      If netdev_ops->ndo_init() succeeds, but something else fails inside
      of register_netdevice(), it does call ndo_ops->ndo_uninit().  But
      it is not able to invoke netdev->destructor().
      
      This is because netdev->destructor() will do a free_netdev() and
      then the caller of register_netdevice() will do the same.
      
      However, this means that the resources that would normally be released
      by netdev->destructor() will not be.
      
      Over the years drivers have added local hacks to deal with this, by
      invoking their destructor parts by hand when register_netdevice()
      fails.
      
      Many drivers do not try to deal with this, and instead we have leaks.
      
      Let's close this hole by formalizing the distinction between what
      private things need to be freed up by netdev->destructor() and whether
      the driver needs unregister_netdevice() to perform the free_netdev().
      
      netdev->priv_destructor() performs all actions to free up the private
      resources that used to be freed by netdev->destructor(), except for
      free_netdev().
      
      netdev->needs_free_netdev is a boolean that indicates whether
      free_netdev() should be done at the end of unregister_netdevice().
      
      Now, register_netdevice() can sanely release all resources after
      ndo_ops->ndo_init() succeeds, by invoking both ndo_ops->ndo_uninit()
      and netdev->priv_destructor().
      
      And at the end of unregister_netdevice(), we invoke
      netdev->priv_destructor() and optionally call free_netdev().
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cf124db5
  12. 12 5月, 2017 1 次提交
    • G
      driver: vrf: Fix one possible use-after-free issue · 1a4a5bf5
      Gao Feng 提交于
      The current codes only deal with the case that the skb is dropped, it
      may meet one use-after-free issue when NF_HOOK returns 0 that means
      the skb is stolen by one netfilter rule or hook.
      
      When one netfilter rule or hook stoles the skb and return NF_STOLEN,
      it means the skb is taken by the rule, and other modules should not
      touch this skb ever. Maybe the skb is queued or freed directly by the
      rule.
      
      Now uses the nf_hook instead of NF_HOOK to get the result of netfilter,
      and check the return value of nf_hook. Only when its value equals 1, it
      means the skb could go ahead. Or reset the skb as NULL.
      
      BTW, because vrf_rcv_finish is empty function, so needn't invoke it
      even though nf_hook returns 1. But we need to modify vrf_rcv_finish
      to deal with the NF_STOLEN case.
      
      There are two cases when skb is stolen.
      1. The skb is stolen and freed directly.
         There is nothing we need to do, and vrf_rcv_finish isn't invoked.
      2. The skb is queued and reinjected again.
         The vrf_rcv_finish would be invoked as okfn, so need to free the
         skb in it.
      Signed-off-by: NGao Feng <gfree.wind@vip.163.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1a4a5bf5
  13. 28 4月, 2017 1 次提交
  14. 18 4月, 2017 2 次提交
  15. 23 3月, 2017 2 次提交
    • D
      net: vrf: performance improvements for IPv6 · a9ec54d1
      David Ahern 提交于
      The VRF driver allows users to implement device based features for an
      entire domain. For example, a qdisc or netfilter rules can be attached
      to a VRF device or tcpdump can be used to view packets for all devices
      in the L3 domain.
      
      The device-based features come with a performance penalty, most
      notably in the Tx path. The VRF driver uses the l3mdev_l3_out hook
      to switch the dst on an skb to its private dst. This allows the skb
      to traverse the xmit stack with the device set to the VRF device
      which in turn enables the netfilter and qdisc features. The VRF
      driver then performs the FIB lookup again and reinserts the packet.
      
      This patch avoids the redirect for IPv6 packets if a qdisc has not
      been attached to a VRF device which is the default config. In this
      case the netfilter hooks and network taps are directly traversed in
      the l3mdev_l3_out handler. If a qdisc is attached to a VRF device,
      then the redirect using the vrf dst is done.
      
      Additional overhead is removed by only checking packet taps if a
      socket is open on the device (vrf_dev->ptype_all list is not empty).
      Packet sockets bound to any device will still get a copy of the
      packet via the real ingress or egress interface.
      
      The end result of this change is a decrease in the overhead of VRF
      for the default, baseline case (ie., no netfilter rules, no packet
      sockets, no qdisc) from a +3% improvement for UDP which has a lookup
      per packet (VRF being better than no l3mdev) to ~2% loss for TCP_CRR
      which connects a socket for each request-response.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a9ec54d1
    • D
      net: vrf: performance improvements for IPv4 · dcdd43c4
      David Ahern 提交于
      The VRF driver allows users to implement device based features for an
      entire domain. For example, a qdisc or netfilter rules can be attached
      to a VRF device or tcpdump can be used to view packets for all devices
      in the L3 domain.
      
      The device-based features come with a performance penalty, most
      notably in the Tx path. The VRF driver uses the l3mdev_l3_out hook
      to switch the dst on an skb to its private dst. This allows the skb
      to traverse the xmit stack with the device set to the VRF device
      which in turn enables the netfilter and qdisc features. The VRF
      driver then performs the FIB lookup again and reinserts the packet.
      
      This patch avoids the redirect for IPv4 packets if a qdisc has not
      been attached to a VRF device which is the default config. In this
      case the netfilter hooks and network taps are directly traversed in
      the l3mdev_l3_out handler. If a qdisc is attached to a VRF device,
      then the redirect using the vrf dst is done.
      
      Additional overhead is removed by only checking packet taps if a
      socket is open on the device (vrf_dev->ptype_all list is not empty).
      Packet sockets bound to any device will still get a copy of the
      packet via the real ingress or egress interface.
      
      The end result of this change is a decrease in the overhead of VRF
      for the default, baseline case (ie., no netfilter rules, no packet
      sockets, no qdisc) to ~3% for UDP which has a lookup per packet and
      < 1% overhead for connected sockets that leverage early demux and
      avoid FIB lookups.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dcdd43c4
  16. 22 3月, 2017 1 次提交
    • D
      net: vrf: Reset rt6i_idev in local dst after put · 3dc857f0
      David Ahern 提交于
      The VRF driver takes a reference to the inet6_dev on the VRF device for
      its rt6_local dst when handling local traffic through the VRF device as
      a loopback. When the device is deleted the driver does a put on the idev
      but does not reset rt6i_idev in the rt6_info struct. When the dst is
      destroyed, dst_destroy calls ip6_dst_destroy which does a second put for
      what is essentially the same reference causing it to be prematurely freed.
      Reset rt6i_idev after the put in the vrf driver.
      
      Fixes: b4869aa2 ("net: vrf: ipv6 support for local traffic to
                             local addresses")
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3dc857f0
  17. 17 3月, 2017 1 次提交
  18. 09 3月, 2017 1 次提交
    • D
      vrf: Fix use-after-free in vrf_xmit · f7887d40
      David Ahern 提交于
      KASAN detected a use-after-free:
      
      [  269.467067] BUG: KASAN: use-after-free in vrf_xmit+0x7f1/0x827 [vrf] at addr ffff8800350a21c0
      [  269.467067] Read of size 4 by task ssh/1879
      [  269.467067] CPU: 1 PID: 1879 Comm: ssh Not tainted 4.10.0+ #249
      [  269.467067] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
      [  269.467067] Call Trace:
      [  269.467067]  dump_stack+0x81/0xb6
      [  269.467067]  kasan_object_err+0x21/0x78
      [  269.467067]  kasan_report+0x2f7/0x450
      [  269.467067]  ? vrf_xmit+0x7f1/0x827 [vrf]
      [  269.467067]  ? ip_output+0xa4/0xdb
      [  269.467067]  __asan_load4+0x6b/0x6d
      [  269.467067]  vrf_xmit+0x7f1/0x827 [vrf]
      ...
      
      Which corresponds to the skb access after xmit handling. Fix by saving
      skb->len and using the saved value to update stats.
      
      Fixes: 193125db ("net: Introduce VRF device driver")
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f7887d40
  19. 12 2月, 2017 1 次提交
  20. 08 2月, 2017 1 次提交
  21. 11 1月, 2017 2 次提交
  22. 09 1月, 2017 1 次提交
  23. 04 1月, 2017 1 次提交
  24. 17 12月, 2016 2 次提交
    • D
      net: vrf: Drop conntrack data after pass through VRF device on Tx · eb63ecc1
      David Ahern 提交于
      Locally originated traffic in a VRF fails in the presence of a POSTROUTING
      rule. For example,
      
          $ iptables -t nat -A POSTROUTING -s 11.1.1.0/24  -j MASQUERADE
          $ ping -I red -c1 11.1.1.3
          ping: Warning: source address might be selected on device other than red.
          PING 11.1.1.3 (11.1.1.3) from 11.1.1.2 red: 56(84) bytes of data.
          ping: sendmsg: Operation not permitted
      
      Worse, the above causes random corruption resulting in a panic in random
      places (I have not seen a consistent backtrace).
      
      Call nf_reset to drop the conntrack info following the pass through the
      VRF device.  The nf_reset is needed on Tx but not Rx because of the order
      in which NF_HOOK's are hit: on Rx the VRF device is after the real ingress
      device and on Tx it is is before the real egress device. Connection
      tracking should be tied to the real egress device and not the VRF device.
      
      Fixes: 8f58336d ("net: Add ethernet header for pass through VRF device")
      Fixes: 35402e31 ("net: Add IPv6 support to VRF device")
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eb63ecc1
    • D
      net: vrf: Fix NAT within a VRF · a0f37efa
      David Ahern 提交于
      Connection tracking with VRF is broken because the pass through the VRF
      device drops the connection tracking info. Removing the call to nf_reset
      allows DNAT and MASQUERADE to work across interfaces within a VRF.
      
      Fixes: 73e20b76 ("net: vrf: Add support for PREROUTING rules on vrf device")
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a0f37efa
  25. 01 11月, 2016 1 次提交
    • D
      net: Enable support for VRF with ipv4 multicast · e58e4159
      David Ahern 提交于
      Enable support for IPv4 multicast:
      - similar to unicast the flow struct is updated to L3 master device
        if relevant prior to calling fib_rules_lookup. The table id is saved
        to the lookup arg so the rule action for ipmr can return the table
        associated with the device.
      
      - ip_mr_forward needs to check for master device mismatch as well
        since the skb->dev is set to it
      
      - allow multicast address on VRF device for Rx by checking for the
        daddr in the VRF device as well as the original ingress device
      
      - on Tx need to drop to __mkroute_output when FIB lookup fails for
        multicast destination address.
      
      - if CONFIG_IP_MROUTE_MULTIPLE_TABLES is enabled VRF driver creates
        IPMR FIB rules on first device create similar to FIB rules. In
        addition the VRF driver does not divert IPv4 multicast packets:
        it breaks on Tx since the fib lookup fails on the mcast address.
      
      With this patch, ipmr forwarding and local rx/tx work.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e58e4159
  26. 17 10月, 2016 1 次提交
    • D
      net: Require exact match for TCP socket lookups if dif is l3mdev · a04a480d
      David Ahern 提交于
      Currently, socket lookups for l3mdev (vrf) use cases can match a socket
      that is bound to a port but not a device (ie., a global socket). If the
      sysctl tcp_l3mdev_accept is not set this leads to ack packets going out
      based on the main table even though the packet came in from an L3 domain.
      The end result is that the connection does not establish creating
      confusion for users since the service is running and a socket shows in
      ss output. Fix by requiring an exact dif to sk_bound_dev_if match if the
      skb came through an interface enslaved to an l3mdev device and the
      tcp_l3mdev_accept is not set.
      
      skb's through an l3mdev interface are marked by setting a flag in
      inet{6}_skb_parm. The IPv6 variant is already set; this patch adds the
      flag for IPv4. Using an skb flag avoids a device lookup on the dif. The
      flag is set in the VRF driver using the IP{6}CB macros. For IPv4, the
      inet_skb_parm struct is moved in the cb per commit 971f10ec, so the
      match function in the TCP stack needs to use TCP_SKB_CB. For IPv6, the
      move is done after the socket lookup, so IP6CB is used.
      
      The flags field in inet_skb_parm struct needs to be increased to add
      another flag. There is currently a 1-byte hole following the flags,
      so it can be expanded to u16 without increasing the size of the struct.
      
      Fixes: 193125db ("net: Introduce VRF device driver")
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a04a480d
  27. 17 9月, 2016 1 次提交
  28. 11 9月, 2016 5 次提交