1. 10 10月, 2017 1 次提交
  2. 01 10月, 2017 1 次提交
    • P
      udp: perform source validation for mcast early demux · bc044e8d
      Paolo Abeni 提交于
      The UDP early demux can leverate the rx dst cache even for
      multicast unconnected sockets.
      
      In such scenario the ipv4 source address is validated only on
      the first packet in the given flow. After that, when we fetch
      the dst entry  from the socket rx cache, we stop enforcing
      the rp_filter and we even start accepting any kind of martian
      addresses.
      
      Disabling the dst cache for unconnected multicast socket will
      cause large performace regression, nearly reducing by half the
      max ingress tput.
      
      Instead we factor out a route helper to completely validate an
      skb source address for multicast packets and we call it from
      the UDP early demux for mcast packets landing on unconnected
      sockets, after successful fetching the related cached dst entry.
      
      This still gives a measurable, but limited performance
      regression:
      
      		rp_filter = 0		rp_filter = 1
      edmux disabled:	1182 Kpps		1127 Kpps
      edmux before:	2238 Kpps		2238 Kpps
      edmux after:	2037 Kpps		2019 Kpps
      
      The above figures are on top of current net tree.
      Applying the net-next commit 6e617de8 ("net: avoid a full
      fib lookup when rp_filter is disabled.") the delta with
      rp_filter == 0 will decrease even more.
      
      Fixes: 421b3885 ("udp: ipv4: Add udp early demux")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bc044e8d
  3. 19 8月, 2017 2 次提交
    • R
      net: check and errout if res->fi is NULL when RTM_F_FIB_MATCH is set · bc3aae2b
      Roopa Prabhu 提交于
      Syzkaller hit 'general protection fault in fib_dump_info' bug on
      commit 4.13-rc5..
      
      Guilty file: net/ipv4/fib_semantics.c
      
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      general protection fault: 0000 [#1] SMP KASAN
      Modules linked in:
      CPU: 0 PID: 2808 Comm: syz-executor0 Not tainted 4.13.0-rc5 #1
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
      Ubuntu-1.8.2-1ubuntu1 04/01/2014
      task: ffff880078562700 task.stack: ffff880078110000
      RIP: 0010:fib_dump_info+0x388/0x1170 net/ipv4/fib_semantics.c:1314
      RSP: 0018:ffff880078117010 EFLAGS: 00010206
      RAX: dffffc0000000000 RBX: 00000000000000fe RCX: 0000000000000002
      RDX: 0000000000000006 RSI: ffff880078117084 RDI: 0000000000000030
      RBP: ffff880078117268 R08: 000000000000000c R09: ffff8800780d80c8
      R10: 0000000058d629b4 R11: 0000000067fce681 R12: 0000000000000000
      R13: ffff8800784bd540 R14: ffff8800780d80b5 R15: ffff8800780d80a4
      FS:  00000000022fa940(0000) GS:ffff88007fc00000(0000)
      knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00000000004387d0 CR3: 0000000079135000 CR4: 00000000000006f0
      Call Trace:
        inet_rtm_getroute+0xc89/0x1f50 net/ipv4/route.c:2766
        rtnetlink_rcv_msg+0x288/0x680 net/core/rtnetlink.c:4217
        netlink_rcv_skb+0x340/0x470 net/netlink/af_netlink.c:2397
        rtnetlink_rcv+0x28/0x30 net/core/rtnetlink.c:4223
        netlink_unicast_kernel net/netlink/af_netlink.c:1265 [inline]
        netlink_unicast+0x4c4/0x6e0 net/netlink/af_netlink.c:1291
        netlink_sendmsg+0x8c4/0xca0 net/netlink/af_netlink.c:1854
        sock_sendmsg_nosec net/socket.c:633 [inline]
        sock_sendmsg+0xca/0x110 net/socket.c:643
        ___sys_sendmsg+0x779/0x8d0 net/socket.c:2035
        __sys_sendmsg+0xd1/0x170 net/socket.c:2069
        SYSC_sendmsg net/socket.c:2080 [inline]
        SyS_sendmsg+0x2d/0x50 net/socket.c:2076
        entry_SYSCALL_64_fastpath+0x1a/0xa5
        RIP: 0033:0x4512e9
        RSP: 002b:00007ffc75584cc8 EFLAGS: 00000216 ORIG_RAX:
        000000000000002e
        RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00000000004512e9
        RDX: 0000000000000000 RSI: 0000000020f2cfc8 RDI: 0000000000000003
        RBP: 000000000000000e R08: 0000000000000000 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000216 R12: fffffffffffffffe
        R13: 0000000000718000 R14: 0000000020c44ff0 R15: 0000000000000000
        Code: 00 0f b6 8d ec fd ff ff 48 8b 85 f0 fd ff ff 88 48 17 48 8b 45
        28 48 8d 78 30 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03
        <0f>
        b6 04 02 84 c0 74 08 3c 03 0f 8e cb 0c 00 00 48 8b 45 28 44
        RIP: fib_dump_info+0x388/0x1170 net/ipv4/fib_semantics.c:1314 RSP:
        ffff880078117010
      ---[ end trace 254a7af28348f88b ]---
      
      This patch adds a res->fi NULL check.
      
      example run:
      $ip route get 0.0.0.0 iif virt1-0
      broadcast 0.0.0.0 dev lo
          cache <local,brd> iif virt1-0
      
      $ip route get 0.0.0.0 iif virt1-0 fibmatch
      RTNETLINK answers: No route to host
      Reported-by: Nidaifish <idaifish@gmail.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Fixes: b6179813 ("net: ipv4: RTM_GETROUTE: return matched fib result when requested")
      Signed-off-by: NRoopa Prabhu <roopa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bc3aae2b
    • E
      ipv4: convert dst_metrics.refcnt from atomic_t to refcount_t · 9620fef2
      Eric Dumazet 提交于
      refcount_t type and corresponding API should be
      used instead of atomic_t when the variable is used as
      a reference counter. This allows to avoid accidental
      refcounter overflows that might lead to use-after-free
      situations.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9620fef2
  4. 17 8月, 2017 1 次提交
    • E
      ipv4: better IP_MAX_MTU enforcement · c780a049
      Eric Dumazet 提交于
      While working on yet another syzkaller report, I found
      that our IP_MAX_MTU enforcements were not properly done.
      
      gcc seems to reload dev->mtu for min(dev->mtu, IP_MAX_MTU), and
      final result can be bigger than IP_MAX_MTU :/
      
      This is a problem because device mtu can be changed on other cpus or
      threads.
      
      While this patch does not fix the issue I am working on, it is
      probably worth addressing it.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c780a049
  5. 16 8月, 2017 1 次提交
  6. 15 8月, 2017 1 次提交
    • F
      ipv4: route: fix inet_rtm_getroute induced crash · 2c87d63a
      Florian Westphal 提交于
      "ip route get $daddr iif eth0 from $saddr" causes:
       BUG: KASAN: use-after-free in ip_route_input_rcu+0x1535/0x1b50
       Call Trace:
        ip_route_input_rcu+0x1535/0x1b50
        ip_route_input_noref+0xf9/0x190
        tcp_v4_early_demux+0x1a4/0x2b0
        ip_rcv+0xbcb/0xc05
        __netif_receive_skb+0x9c/0xd0
        netif_receive_skb_internal+0x5a8/0x890
      
      Problem is that inet_rtm_getroute calls either ip_route_input_rcu (if an
      iif was provided) or ip_route_output_key_hash_rcu.
      
      But ip_route_input_rcu, unlike ip_route_output_key_hash_rcu, already
      associates the dst_entry with the skb.  This clears the SKB_DST_NOREF
      bit (i.e. skb_dst_drop will release/free the entry while it should not).
      
      Thus only set the dst if we called ip_route_output_key_hash_rcu().
      
      I tested this patch by running:
       while true;do ip r get 10.0.1.2;done > /dev/null &
       while true;do ip r get 10.0.1.2 iif eth0  from 10.0.1.1;done > /dev/null &
      ... and saw no crash or memory leak.
      
      Cc: Roopa Prabhu <roopa@cumulusnetworks.com>
      Cc: David Ahern <dsahern@gmail.com>
      Fixes: ba52d61e ("ipv4: route: restore skb_dst_set in inet_rtm_getroute")
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2c87d63a
  7. 14 8月, 2017 1 次提交
  8. 12 8月, 2017 1 次提交
    • D
      net: ipv4: set orig_oif based on fib result for local traffic · 839da4d9
      David Ahern 提交于
      Attempts to connect to a local address with a socket bound
      to a device with the local address hangs if there is no listener:
      
        $ ip addr sh dev eth1
        3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
          link/ether 02:e0:f9:1c:00:37 brd ff:ff:ff:ff:ff:ff
          inet 10.100.1.4/24 scope global eth1
             valid_lft forever preferred_lft forever
          inet6 2001:db8:1::4/120 scope global
             valid_lft forever preferred_lft forever
          inet6 fe80::e0:f9ff:fe1c:37/64 scope link
             valid_lft forever preferred_lft forever
      
        $ vrf-test -I eth1 -r 10.100.1.4
        <hangs when there is no server>
      
      (don't let the command name fool you; vrf-test works without vrfs.)
      
      The problem is that the original intended device, eth1 in this case, is
      lost when the tcp reset is sent, so the socket lookup does not find a
      match for the reset and the connect attempt hangs. Fix by adjusting
      orig_oif for local traffic to the device from the fib lookup result.
      
      With this patch you get the more user friendly:
        $ vrf-test -I eth1 -r 10.100.1.4
        connect failed: 111: Connection refused
      
      orig_oif is saved to the newly created rtable as rt_iif and when set
      it is used as the dif for socket lookups. It is set based on flowi4_oif
      passed in to ip_route_output_key_hash_rcu and will be set to either
      the loopback device, an l3mdev device, nothing (flowi4_oif = 0 which
      is the case in the example above) or a netdev index depending on the
      lookup path. In each case, resetting orig_oif to the device in the fib
      result for the RTN_LOCAL case allows the actual device to be preserved
      as the skb tx and rx is done over the loopback or VRF device.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      839da4d9
  9. 10 8月, 2017 1 次提交
  10. 20 6月, 2017 1 次提交
  11. 18 6月, 2017 7 次提交
    • W
      net: remove DST_NOCACHE flag · a4c2fd7f
      Wei Wang 提交于
      DST_NOCACHE flag check has been removed from dst_release() and
      dst_hold_safe() in a previous patch because all the dst are now ref
      counted properly and can be released based on refcnt only.
      Looking at the rest of the DST_NOCACHE use, all of them can now be
      removed or replaced with other checks.
      So this patch gets rid of all the DST_NOCACHE usage and remove this flag
      completely.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a4c2fd7f
    • W
      net: remove DST_NOGC flag · b2a9c0ed
      Wei Wang 提交于
      Now that all the components have been changed to release dst based on
      refcnt only and not depend on dst gc anymore, we can remove the
      temporary flag DST_NOGC.
      
      Note that we also need to remove the DST_NOCACHE check in dst_release()
      and dst_hold_safe() because now all the dst are released based on refcnt
      and behaves as DST_NOCACHE.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b2a9c0ed
    • W
      ipv4: mark DST_NOGC and remove the operation of dst_free() · b838d5e1
      Wei Wang 提交于
      With the previous preparation patches, we are ready to get rid of the
      dst gc operation in ipv4 code and release dst based on refcnt only.
      So this patch adds DST_NOGC flag for all IPv4 dst and remove the calls
      to dst_free().
      At this point, all dst created in ipv4 code do not use the dst gc
      anymore and will be destroyed at the point when refcnt drops to 0.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b838d5e1
    • W
      ipv4: call dst_hold_safe() properly · 9df16efa
      Wei Wang 提交于
      This patch checks all the calls to
      dst_hold()/skb_dst_force()/dst_clone()/dst_use() to see if
      dst_hold_safe() is needed to avoid double free issue if dst
      gc is removed and dst_release() directly destroys dst when
      dst->__refcnt drops to 0.
      
      In tx path, TCP hold sk->sk_rx_dst ref count and also hold sock_lock().
      UDP and other similar protocols always hold refcount for
      skb->_skb_refdst. So both paths seem to be safe.
      
      In rx path, as it is lockless and skb_dst_set_noref() is likely to be
      used, dst_hold_safe() should always be used when trying to hold dst.
      
      In the routing code, if dst is held during an rcu protected session, it
      is necessary to call dst_hold_safe() as the current dst might be in its
      rcu grace period.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9df16efa
    • W
      ipv4: call dst_dev_put() properly · 95c47f9c
      Wei Wang 提交于
      As the intend of this patch series is to completely remove dst gc,
      we need to call dst_dev_put() to release the reference to dst->dev
      when removing routes from fib because we won't keep the gc list anymore
      and will lose the dst pointer right after removing the routes.
      Without the gc list, there is no way to find all the dst's that have
      dst->dev pointing to the going-down dev.
      Hence, we are doing dst_dev_put() immediately before we lose the last
      reference of the dst from the routing code. The next dst_check() will
      trigger a route re-lookup to find another route (if there is any).
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      95c47f9c
    • W
      ipv4: take dst->__refcnt when caching dst in fib · 0830106c
      Wei Wang 提交于
      In IPv4 routing code, fib_nh and fib_nh_exception can hold pointers
      to struct rtable but they never increment dst->__refcnt.
      This leads to the need of the dst garbage collector because when user
      is done with this dst and calls dst_release(), it can only decrement
      dst->__refcnt and can not free the dst even it sees dst->__refcnt
      drops from 1 to 0 (unless DST_NOCACHE flag is set) because the routing
      code might still hold reference to it.
      And when the routing code tries to delete a route, it has to put the
      dst to the gc_list if dst->__refcnt is not yet 0 and have a gc thread
      running periodically to check on dst->__refcnt and finally to free dst
      when refcnt becomes 0.
      
      This patch increments dst->__refcnt when
      fib_nh/fib_nh_exception holds reference to this dst and properly release
      the dst when fib_nh/fib_nh_exception has been updated with a new dst.
      
      This patch is a preparation in order to fully get rid of dst gc later.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0830106c
    • W
      net: use loopback dev when generating blackhole route · 1dbe3252
      Wei Wang 提交于
      Existing ipv4/6_blackhole_route() code generates a blackhole route
      with dst->dev pointing to the passed in dst->dev.
      It is not necessary to hold reference to the passed in dst->dev
      because the packets going through this route are dropped anyway.
      A loopback interface is good enough so that we don't need to worry about
      releasing this dst->dev when this dev is going down.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1dbe3252
  12. 01 6月, 2017 1 次提交
  13. 27 5月, 2017 6 次提交
  14. 09 5月, 2017 2 次提交
  15. 25 4月, 2017 1 次提交
    • R
      ipv4: Avoid caching l3mdev dst on mismatched local route · b7c8487c
      Robert Shearman 提交于
      David reported that doing the following:
      
          ip li add red type vrf table 10
          ip link set dev eth1 vrf red
          ip addr add 127.0.0.1/8 dev red
          ip link set dev eth1 up
          ip li set red up
          ping -c1 -w1 -I red 127.0.0.1
          ip li del red
      
      when either policy routing IP rules are present or the local table
      lookup ip rule is before the l3mdev lookup results in a hang with
      these messages:
      
          unregister_netdevice: waiting for red to become free. Usage count = 1
      
      The problem is caused by caching the dst used for sending the packet
      out of the specified interface on a local route with a different
      nexthop interface. Thus the dst could stay around until the route in
      the table the lookup was done is deleted which may be never.
      
      Address the problem by not forcing output device to be the l3mdev in
      the flow's output interface if the lookup didn't use the l3mdev. This
      then results in the dst using the right device according to the route.
      
      Changes in v2:
       - make the dev_out passed in by __ip_route_output_key_hash correct
         instead of checking the nh dev if FLOWI_FLAG_SKIP_NH_OIF is set as
         suggested by David.
      
      Fixes: 5f02ce24 ("net: l3mdev: Allow the l3mdev to be a loopback")
      Reported-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Suggested-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NRobert Shearman <rshearma@brocade.com>
      Acked-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Tested-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b7c8487c
  16. 18 4月, 2017 1 次提交
  17. 14 4月, 2017 2 次提交
  18. 07 4月, 2017 2 次提交
    • F
      net: ipv4: fix multipath RTM_GETROUTE behavior when iif is given · bbadb9a2
      Florian Larysch 提交于
      inet_rtm_getroute synthesizes a skeletal ICMP skb, which is passed to
      ip_route_input when iif is given. If a multipath route is present for
      the designated destination, fib_multipath_hash ends up being called with
      that skb. However, as that skb contains no information beyond the
      protocol type, the calculated hash does not match the one we would see
      for a real packet.
      
      There is currently no way to fix this for layer 4 hashing, as
      RTM_GETROUTE doesn't have the necessary information to create layer 4
      headers. To fix this for layer 3 hashing, set appropriate saddr/daddrs
      in the skb and also change the protocol to UDP to avoid special
      treatment for ICMP.
      Signed-off-by: NFlorian Larysch <fl@n621.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bbadb9a2
    • F
      net: ipv4: fix multipath RTM_GETROUTE behavior when iif is given · a8801799
      Florian Larysch 提交于
      inet_rtm_getroute synthesizes a skeletal ICMP skb, which is passed to
      ip_route_input when iif is given. If a multipath route is present for
      the designated destination, ip_multipath_icmp_hash ends up being called,
      which uses the source/destination addresses within the skb to calculate
      a hash. However, those are not set in the synthetic skb, causing it to
      return an arbitrary and incorrect result.
      
      Instead, use UDP, which gets no such special treatment.
      Signed-off-by: NFlorian Larysch <fl@n621.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a8801799
  19. 22 3月, 2017 1 次提交
    • N
      net: ipv4: add support for ECMP hash policy choice · bf4e0a3d
      Nikolay Aleksandrov 提交于
      This patch adds support for ECMP hash policy choice via a new sysctl
      called fib_multipath_hash_policy and also adds support for L4 hashes.
      The current values for fib_multipath_hash_policy are:
       0 - layer 3 (default)
       1 - layer 4
      If there's an skb hash already set and it matches the chosen policy then it
      will be used instead of being calculated (currently only for L4).
      In L3 mode we always calculate the hash due to the ICMP error special
      case, the flow dissector's field consistentification should handle the
      address order thus we can remove the address reversals.
      If the skb is provided we always use it for the hash calculation,
      otherwise we fallback to fl4, that is if skb is NULL fl4 has to be set.
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bf4e0a3d
  20. 27 2月, 2017 2 次提交
  21. 08 2月, 2017 1 次提交
  22. 13 1月, 2017 1 次提交
  23. 10 1月, 2017 1 次提交
  24. 09 1月, 2017 1 次提交