1. 25 12月, 2019 3 次提交
  2. 22 11月, 2019 1 次提交
  3. 21 11月, 2019 1 次提交
  4. 08 11月, 2019 1 次提交
    • E
      ipv6: fixes rt6_probe() and fib6_nh->last_probe init · 1bef4c22
      Eric Dumazet 提交于
      While looking at a syzbot KCSAN report [1], I found multiple
      issues in this code :
      
      1) fib6_nh->last_probe has an initial value of 0.
      
         While probably okay on 64bit kernels, this causes an issue
         on 32bit kernels since the time_after(jiffies, 0 + interval)
         might be false ~24 days after boot (for HZ=1000)
      
      2) The data-race found by KCSAN
         I could use READ_ONCE() and WRITE_ONCE(), but we also can
         take the opportunity of not piling-up too many rt6_probe_deferred()
         works by using instead cmpxchg() so that only one cpu wins the race.
      
      [1]
      BUG: KCSAN: data-race in find_match / find_match
      
      write to 0xffff8880bb7aabe8 of 8 bytes by interrupt on cpu 1:
       rt6_probe net/ipv6/route.c:663 [inline]
       find_match net/ipv6/route.c:757 [inline]
       find_match+0x5bd/0x790 net/ipv6/route.c:733
       __find_rr_leaf+0xe3/0x780 net/ipv6/route.c:831
       find_rr_leaf net/ipv6/route.c:852 [inline]
       rt6_select net/ipv6/route.c:896 [inline]
       fib6_table_lookup+0x383/0x650 net/ipv6/route.c:2164
       ip6_pol_route+0xee/0x5c0 net/ipv6/route.c:2200
       ip6_pol_route_output+0x48/0x60 net/ipv6/route.c:2452
       fib6_rule_lookup+0x3d6/0x470 net/ipv6/fib6_rules.c:117
       ip6_route_output_flags_noref+0x16b/0x230 net/ipv6/route.c:2484
       ip6_route_output_flags+0x50/0x1a0 net/ipv6/route.c:2497
       ip6_dst_lookup_tail+0x25d/0xc30 net/ipv6/ip6_output.c:1049
       ip6_dst_lookup_flow+0x68/0x120 net/ipv6/ip6_output.c:1150
       inet6_csk_route_socket+0x2f7/0x420 net/ipv6/inet6_connection_sock.c:106
       inet6_csk_xmit+0x91/0x1f0 net/ipv6/inet6_connection_sock.c:121
       __tcp_transmit_skb+0xe81/0x1d60 net/ipv4/tcp_output.c:1169
       tcp_transmit_skb net/ipv4/tcp_output.c:1185 [inline]
       tcp_xmit_probe_skb+0x19b/0x1d0 net/ipv4/tcp_output.c:3735
      
      read to 0xffff8880bb7aabe8 of 8 bytes by interrupt on cpu 0:
       rt6_probe net/ipv6/route.c:657 [inline]
       find_match net/ipv6/route.c:757 [inline]
       find_match+0x521/0x790 net/ipv6/route.c:733
       __find_rr_leaf+0xe3/0x780 net/ipv6/route.c:831
       find_rr_leaf net/ipv6/route.c:852 [inline]
       rt6_select net/ipv6/route.c:896 [inline]
       fib6_table_lookup+0x383/0x650 net/ipv6/route.c:2164
       ip6_pol_route+0xee/0x5c0 net/ipv6/route.c:2200
       ip6_pol_route_output+0x48/0x60 net/ipv6/route.c:2452
       fib6_rule_lookup+0x3d6/0x470 net/ipv6/fib6_rules.c:117
       ip6_route_output_flags_noref+0x16b/0x230 net/ipv6/route.c:2484
       ip6_route_output_flags+0x50/0x1a0 net/ipv6/route.c:2497
       ip6_dst_lookup_tail+0x25d/0xc30 net/ipv6/ip6_output.c:1049
       ip6_dst_lookup_flow+0x68/0x120 net/ipv6/ip6_output.c:1150
       inet6_csk_route_socket+0x2f7/0x420 net/ipv6/inet6_connection_sock.c:106
       inet6_csk_xmit+0x91/0x1f0 net/ipv6/inet6_connection_sock.c:121
       __tcp_transmit_skb+0xe81/0x1d60 net/ipv4/tcp_output.c:1169
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 18894 Comm: udevd Not tainted 5.4.0-rc3+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      
      Fixes: cc3a86c8 ("ipv6: Change rt6_probe to take a fib6_nh")
      Fixes: f547fac6 ("ipv6: rate-limit probes for neighbourless routes")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1bef4c22
  5. 06 11月, 2019 1 次提交
  6. 05 11月, 2019 1 次提交
  7. 17 10月, 2019 1 次提交
  8. 16 10月, 2019 1 次提交
    • M
      blackhole_netdev: fix syzkaller reported issue · b0818f80
      Mahesh Bandewar 提交于
      While invalidating the dst, we assign backhole_netdev instead of
      loopback device. However, this device does not have idev pointer
      and hence no ip6_ptr even if IPv6 is enabled. Possibly this has
      triggered the syzbot reported crash.
      
      The syzbot report does not have reproducer, however, this is the
      only device that doesn't have matching idev created.
      
      Crash instruction is :
      
      static inline bool ip6_ignore_linkdown(const struct net_device *dev)
      {
              const struct inet6_dev *idev = __in6_dev_get(dev);
      
              return !!idev->cnf.ignore_routes_with_linkdown; <= crash
      }
      
      Also ipv6 always assumes presence of idev and never checks for it
      being NULL (as does the above referenced code). So adding a idev
      for the blackhole_netdev to avoid this class of crashes in the future.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b0818f80
  9. 12 9月, 2019 1 次提交
    • S
      ipv6: Don't use dst gateway directly in ip6_confirm_neigh() · cbfd6891
      Stefano Brivio 提交于
      This is the equivalent of commit 2c6b55f4 ("ipv6: fix neighbour
      resolution with raw socket") for ip6_confirm_neigh(): we can send a
      packet with MSG_CONFIRM on a raw socket for a connected route, so the
      gateway would be :: here, and we should pick the next hop using
      rt6_nexthop() instead.
      
      This was found by code review and, to the best of my knowledge, doesn't
      actually fix a practical issue: the destination address from the packet
      is not considered while confirming a neighbour, as ip6_confirm_neigh()
      calls choose_neigh_daddr() without passing the packet, so there are no
      similar issues as the one fixed by said commit.
      
      A possible source of issues with the existing implementation might come
      from the fact that, if we have a cached dst, we won't consider it,
      while rt6_nexthop() takes care of that. I might just not be creative
      enough to find a practical problem here: the only way to affect this
      with cached routes is to have one coming from an ICMPv6 redirect, but
      if the next hop is a directly connected host, there should be no
      topology for which a redirect applies here, and tests with redirected
      routes show no differences for MSG_CONFIRM (and MSG_PROBE) packets on
      raw sockets destined to a directly connected host.
      
      However, directly using the dst gateway here is not consistent anymore
      with neighbour resolution, and, in general, as we want the next hop,
      using rt6_nexthop() looks like the only sane way to fetch it.
      Reported-by: NGuillaume Nault <gnault@redhat.com>
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Acked-by: NGuillaume Nault <gnault@redhat.com>
      Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cbfd6891
  10. 07 9月, 2019 1 次提交
  11. 05 9月, 2019 3 次提交
    • D
      net: Properly update v4 routes with v6 nexthop · 7bdf4de1
      Donald Sharp 提交于
      When creating a v4 route that uses a v6 nexthop from a nexthop group.
      Allow the kernel to properly send the nexthop as v6 via the RTA_VIA
      attribute.
      
      Broken behavior:
      
      $ ip nexthop add via fe80::9 dev eth0
      $ ip nexthop show
      id 1 via fe80::9 dev eth0 scope link
      $ ip route add 4.5.6.7/32 nhid 1
      $ ip route show
      default via 10.0.2.2 dev eth0
      4.5.6.7 nhid 1 via 254.128.0.0 dev eth0
      10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.15
      $
      
      Fixed behavior:
      
      $ ip nexthop add via fe80::9 dev eth0
      $ ip nexthop show
      id 1 via fe80::9 dev eth0 scope link
      $ ip route add 4.5.6.7/32 nhid 1
      $ ip route show
      default via 10.0.2.2 dev eth0
      4.5.6.7 nhid 1 via inet6 fe80::9 dev eth0
      10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.15
      $
      
      v2, v3: Addresses code review comments from David Ahern
      
      Fixes: dcb1ecb5 (“ipv4: Prepare for fib6_nh from a nexthop object”)
      Signed-off-by: NDonald Sharp <sharpd@cumulusnetworks.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7bdf4de1
    • D
      ipv6: Fix RTA_MULTIPATH with nexthop objects · 4255ff05
      David Ahern 提交于
      A change to the core nla helpers was missed during the push of
      the nexthop changes. rt6_fill_node_nexthop should be calling
      nla_nest_start_noflag not nla_nest_start. Currently, iproute2
      does not print multipath data because of parsing issues with
      the attribute.
      
      Fixes: f88d8ea6 ("ipv6: Plumb support for nexthop object in a fib6_info")
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4255ff05
    • M
      net-ipv6: fix excessive RTF_ADDRCONF flag on ::1/128 local route (and others) · d55a2e37
      Maciej Żenczykowski 提交于
      There is a subtle change in behaviour introduced by:
        commit c7a1ce39
        'ipv6: Change addrconf_f6i_alloc to use ip6_route_info_create'
      
      Before that patch /proc/net/ipv6_route includes:
      00000000000000000000000000000001 80 00000000000000000000000000000000 00 00000000000000000000000000000000 00000000 00000003 00000000 80200001 lo
      
      Afterwards /proc/net/ipv6_route includes:
      00000000000000000000000000000001 80 00000000000000000000000000000000 00 00000000000000000000000000000000 00000000 00000002 00000000 80240001 lo
      
      ie. the above commit causes the ::1/128 local (automatic) route to be flagged with RTF_ADDRCONF (0x040000).
      
      AFAICT, this is incorrect since these routes are *not* coming from RA's.
      
      As such, this patch restores the old behaviour.
      
      Fixes: c7a1ce39 ("ipv6: Change addrconf_f6i_alloc to use ip6_route_info_create")
      Cc: David Ahern <dsahern@gmail.com>
      Cc: Lorenzo Colitti <lorenzo@google.com>
      Signed-off-by: NMaciej Żenczykowski <maze@google.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d55a2e37
  12. 06 8月, 2019 2 次提交
  13. 19 7月, 2019 1 次提交
  14. 18 7月, 2019 1 次提交
  15. 09 7月, 2019 1 次提交
  16. 02 7月, 2019 1 次提交
  17. 28 6月, 2019 1 次提交
    • D
      ipv6: Convert gateway validation to use fib6_info · b2c709cc
      David Ahern 提交于
      Gateway validation does not need a dst_entry, it only needs the fib
      entry to validate the gateway resolution and egress device. So,
      convert ip6_nh_lookup_table from ip6_pol_route to fib6_table_lookup
      and ip6_route_check_nh to use fib6_lookup over rt6_lookup.
      
      ip6_pol_route is a call to fib6_table_lookup and if successful a call
      to fib6_select_path. From there the exception cache is searched for an
      entry or a dst_entry is created to return to the caller. The exception
      entry is not relevant for gateway validation, so what matters are the
      calls to fib6_table_lookup and then fib6_select_path.
      
      Similarly, rt6_lookup can be replaced with a call to fib6_lookup with
      RT6_LOOKUP_F_IFACE set in flags. Again, the exception cache search is
      not relevant, only the lookup with path selection. The primary difference
      in the lookup paths is the use of rt6_select with fib6_lookup versus
      rt6_device_match with rt6_lookup. When you remove complexities in the
      rt6_select path, e.g.,
      1. saddr is not set for gateway validation, so RT6_LOOKUP_F_HAS_SADDR
         is not relevant
      2. rt6_check_neigh is not called so that removes the RT6_NUD_FAIL_DO_RR
         return and round-robin logic.
      
      the code paths are believed to be equivalent for the given use case -
      validate the gateway and optionally given the device. Furthermore, it
      aligns the validation with onlink code path and the lookup path actually
      used for rx and tx.
      
      Adjust the users, ip6_route_check_nh_onlink and ip6_route_check_nh to
      handle a fib6_info vs a rt6_info when performing validation checks.
      
      Existing selftests fib-onlink-tests.sh and fib_tests.sh are used to
      verify the changes.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Reviewed-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b2c709cc
  18. 27 6月, 2019 2 次提交
    • N
      ipv6: fix neighbour resolution with raw socket · 2c6b55f4
      Nicolas Dichtel 提交于
      The scenario is the following: the user uses a raw socket to send an ipv6
      packet, destinated to a not-connected network, and specify a connected nh.
      Here is the corresponding python script to reproduce this scenario:
      
       import socket
       IPPROTO_RAW = 255
       send_s = socket.socket(socket.AF_INET6, socket.SOCK_RAW, IPPROTO_RAW)
       # scapy
       # p = IPv6(src='fd00:100::1', dst='fd00:200::fa')/ICMPv6EchoRequest()
       # str(p)
       req = b'`\x00\x00\x00\x00\x08:@\xfd\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\xfd\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xfa\x80\x00\x81\xc0\x00\x00\x00\x00'
       send_s.sendto(req, ('fd00:175::2', 0, 0, 0))
      
      fd00:175::/64 is a connected route and fd00:200::fa is not a connected
      host.
      
      With this scenario, the kernel starts by sending a NS to resolve
      fd00:175::2. When it receives the NA, it flushes its queue and try to send
      the initial packet. But instead of sending it, it sends another NS to
      resolve fd00:200::fa, which obvioulsy fails, thus the packet is dropped. If
      the user sends again the packet, it now uses the right nh (fd00:175::2).
      
      The problem is that ip6_dst_lookup_neigh() uses the rt6i_gateway, which is
      :: because the associated route is a connected route, thus it uses the dst
      addr of the packet. Let's use rt6_nexthop() to choose the right nh.
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2c6b55f4
    • E
      ipv6: fix suspicious RCU usage in rt6_dump_route() · 3b525691
      Eric Dumazet 提交于
      syzbot reminded us that rt6_nh_dump_exceptions() needs to be called
      with rcu_read_lock()
      
      net/ipv6/route.c:1593 suspicious rcu_dereference_check() usage!
      
      other info that might help us debug this:
      
      rcu_scheduler_active = 2, debug_locks = 1
      2 locks held by syz-executor609/8966:
       #0: 00000000b7dbe288 (rtnl_mutex){+.+.}, at: netlink_dump+0xe7/0xfb0 net/netlink/af_netlink.c:2199
       #1: 00000000f2d87c21 (&(&tb->tb6_lock)->rlock){+...}, at: spin_lock_bh include/linux/spinlock.h:343 [inline]
       #1: 00000000f2d87c21 (&(&tb->tb6_lock)->rlock){+...}, at: fib6_dump_table.isra.0+0x37e/0x570 net/ipv6/ip6_fib.c:533
      
      stack backtrace:
      CPU: 0 PID: 8966 Comm: syz-executor609 Not tainted 5.2.0-rc5+ #43
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x172/0x1f0 lib/dump_stack.c:113
       lockdep_rcu_suspicious+0x153/0x15d kernel/locking/lockdep.c:5250
       fib6_nh_get_excptn_bucket+0x18e/0x1b0 net/ipv6/route.c:1593
       rt6_nh_dump_exceptions+0x45/0x4d0 net/ipv6/route.c:5541
       rt6_dump_route+0x904/0xc50 net/ipv6/route.c:5640
       fib6_dump_node+0x168/0x280 net/ipv6/ip6_fib.c:467
       fib6_walk_continue+0x4a9/0x8e0 net/ipv6/ip6_fib.c:1986
       fib6_walk+0x9d/0x100 net/ipv6/ip6_fib.c:2034
       fib6_dump_table.isra.0+0x38a/0x570 net/ipv6/ip6_fib.c:534
       inet6_dump_fib+0x93c/0xb00 net/ipv6/ip6_fib.c:624
       rtnl_dump_all+0x295/0x490 net/core/rtnetlink.c:3445
       netlink_dump+0x558/0xfb0 net/netlink/af_netlink.c:2244
       __netlink_dump_start+0x5b1/0x7d0 net/netlink/af_netlink.c:2352
       netlink_dump_start include/linux/netlink.h:226 [inline]
       rtnetlink_rcv_msg+0x73d/0xb00 net/core/rtnetlink.c:5182
       netlink_rcv_skb+0x177/0x450 net/netlink/af_netlink.c:2477
       rtnetlink_rcv+0x1d/0x30 net/core/rtnetlink.c:5237
       netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
       netlink_unicast+0x531/0x710 net/netlink/af_netlink.c:1328
       netlink_sendmsg+0x8ae/0xd70 net/netlink/af_netlink.c:1917
       sock_sendmsg_nosec net/socket.c:646 [inline]
       sock_sendmsg+0xd7/0x130 net/socket.c:665
       sock_write_iter+0x27c/0x3e0 net/socket.c:994
       call_write_iter include/linux/fs.h:1872 [inline]
       new_sync_write+0x4d3/0x770 fs/read_write.c:483
       __vfs_write+0xe1/0x110 fs/read_write.c:496
       vfs_write+0x20c/0x580 fs/read_write.c:558
       ksys_write+0x14f/0x290 fs/read_write.c:611
       __do_sys_write fs/read_write.c:623 [inline]
       __se_sys_write fs/read_write.c:620 [inline]
       __x64_sys_write+0x73/0xb0 fs/read_write.c:620
       do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x4401b9
      Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007ffc8e134978 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
      RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 00000000004401b9
      RDX: 000000000000001c RSI: 0000000020000000 RDI: 00
      
      Fixes: 1e47b483 ("ipv6: Dump route exceptions if requested")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Stefano Brivio <sbrivio@redhat.com>
      Cc: David Ahern <dsahern@gmail.com>
      Reviewed-by: NStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b525691
  19. 26 6月, 2019 1 次提交
  20. 25 6月, 2019 3 次提交
    • S
      ipv6: Dump route exceptions if requested · 1e47b483
      Stefano Brivio 提交于
      Since commit 2b760fcf ("ipv6: hook up exception table to store dst
      cache"), route exceptions reside in a separate hash table, and won't be
      found by walking the FIB, so they won't be dumped to userspace on a
      RTM_GETROUTE message.
      
      This causes 'ip -6 route list cache' and 'ip -6 route flush cache' to
      have no function anymore:
      
       # ip -6 route get fc00:3::1
       fc00:3::1 via fc00:1::2 dev veth_A-R1 src fc00:1::1 metric 1024 expires 539sec mtu 1400 pref medium
       # ip -6 route get fc00:4::1
       fc00:4::1 via fc00:2::2 dev veth_A-R2 src fc00:2::1 metric 1024 expires 536sec mtu 1500 pref medium
       # ip -6 route list cache
       # ip -6 route flush cache
       # ip -6 route get fc00:3::1
       fc00:3::1 via fc00:1::2 dev veth_A-R1 src fc00:1::1 metric 1024 expires 520sec mtu 1400 pref medium
       # ip -6 route get fc00:4::1
       fc00:4::1 via fc00:2::2 dev veth_A-R2 src fc00:2::1 metric 1024 expires 519sec mtu 1500 pref medium
      
      because iproute2 lists cached routes using RTM_GETROUTE, and flushes them
      by listing all the routes, and deleting them with RTM_DELROUTE one by one.
      
      If cached routes are requested using the RTM_F_CLONED flag together with
      strict checking, or if no strict checking is requested (and hence we can't
      consistently apply filters), look up exceptions in the hash table
      associated with the current fib6_info in rt6_dump_route(), and, if present
      and not expired, add them to the dump.
      
      We might be unable to dump all the entries for a given node in a single
      message, so keep track of how many entries were handled for the current
      node in fib6_walker, and skip that amount in case we start from the same
      partially dumped node.
      
      When a partial dump restarts, as the starting node might change when
      'sernum' changes, we have no guarantee that we need to skip the same
      amount of in-node entries. Therefore, we need two counters, and we need to
      zero the in-node counter if the node from which the dump is resumed
      differs.
      
      Note that, with the current version of iproute2, this only fixes the
      'ip -6 route list cache': on a flush command, iproute2 doesn't pass
      RTM_F_CLONED and, due to this inconsistency, 'ip -6 route flush cache' is
      still unable to fetch the routes to be flushed. This will be addressed in
      a patch for iproute2.
      
      To flush cached routes, a procfs entry could be introduced instead: that's
      how it works for IPv4. We already have a rt6_flush_exception() function
      ready to be wired to it. However, this would not solve the issue for
      listing.
      
      Versions of iproute2 and kernel tested:
      
                          iproute2
      kernel             4.14.0   4.15.0   4.19.0   5.0.0   5.1.0    5.1.0, patched
       3.18    list        +        +        +        +       +            +
               flush       +        +        +        +       +            +
       4.4     list        +        +        +        +       +            +
               flush       +        +        +        +       +            +
       4.9     list        +        +        +        +       +            +
               flush       +        +        +        +       +            +
       4.14    list        +        +        +        +       +            +
               flush       +        +        +        +       +            +
       4.15    list
               flush
       4.19    list
               flush
       5.0     list
               flush
       5.1     list
               flush
       with    list        +        +        +        +       +            +
       fix     flush       +        +        +                             +
      
      v7:
        - Explain usage of "skip" counters in commit message (suggested by
          David Ahern)
      
      v6:
        - Rebase onto net-next, use recently introduced nexthop walker
        - Make rt6_nh_dump_exceptions() a separate function (suggested by David
          Ahern)
      
      v5:
        - Use dump_routes and dump_exceptions from filter, ignore NLM_F_MATCH,
          update test results (flushing works with iproute2 < 5.0.0 now)
      
      v4:
        - Split NLM_F_MATCH and strict check handling in separate patches
        - Filter routes using RTM_F_CLONED: if it's not set, only return
          non-cached routes, and if it's set, only return cached routes:
          change requested by David Ahern and Martin Lau. This implies that
          iproute2 needs a separate patch to be able to flush IPv6 cached
          routes. This is not ideal because we can't fix the breakage caused
          by 2b760fcf entirely in kernel. However, two years have passed
          since then, and this makes it more tolerable
      
      v3:
        - More descriptive comment about expired exceptions in rt6_dump_route()
        - Swap return values of rt6_dump_route() (suggested by Martin Lau)
        - Don't zero skip_in_node in case we don't dump anything in a given pass
          (also suggested by Martin Lau)
        - Remove check on RTM_F_CLONED altogether: in the current UAPI semantic,
          it's just a flag to indicate the route was cloned, not to filter on
          routes
      
      v2: Add tracking of number of entries to be skipped in current node after
          a partial dump. As we restart from the same node, if not all the
          exceptions for a given node fit in a single message, the dump will
          not terminate, as suggested by Martin Lau. This is a concrete
          possibility, setting up a big number of exceptions for the same route
          actually causes the issue, suggested by David Ahern.
      Reported-by: NJianlin Shi <jishi@redhat.com>
      Fixes: 2b760fcf ("ipv6: hook up exception table to store dst cache")
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1e47b483
    • S
      ipv6/route: Change return code of rt6_dump_route() for partial node dumps · bf9a8a06
      Stefano Brivio 提交于
      In the next patch, we are going to add optional dump of exceptions to
      rt6_dump_route().
      
      Change the return code of rt6_dump_route() to accomodate partial node
      dumps: we might dump multiple routes per node, and might be able to dump
      only a given number of them, so fib6_dump_node() will need to know how
      many routes have been dumped on partial dump, to restart the dump from the
      point where it was interrupted.
      
      Note that fib6_dump_node() is the only caller and already handles all
      non-negative return codes as success: those become -1 to signal that we're
      done with the node. If we fail, return 0, as we were unable to dump the
      single route in the node, but we're not done with it.
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bf9a8a06
    • S
      ipv6/route: Don't match on fc_nh_id if not set in ip6_route_del() · 3401bfb1
      Stefano Brivio 提交于
      If fc_nh_id isn't set, we shouldn't try to match against it. This
      actually matters just for the RTF_CACHE below (where this case is
      already handled): if iproute2 gets a route exception and tries to
      delete it, it won't reference it by fc_nh_id, even if a nexthop
      object might be associated to the originating route.
      
      Fixes: 5b98324e ("ipv6: Allow routes to use nexthop objects")
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3401bfb1
  21. 24 6月, 2019 4 次提交
    • W
      ipv6: convert major tx path to use RT6_LOOKUP_F_DST_NOREF · 7d9e5f42
      Wei Wang 提交于
      For tx path, in most cases, we still have to take refcnt on the dst
      cause the caller is caching the dst somewhere. But it still is
      beneficial to make use of RT6_LOOKUP_F_DST_NOREF flag while doing the
      route lookup. It is cause this flag prevents manipulating refcnt on
      net->ipv6.ip6_null_entry when doing fib6_rule_lookup() to traverse each
      routing table. The null_entry is a shared object and constant updates on
      it cause false sharing.
      
      We converted the current major lookup function ip6_route_output_flags()
      to make use of RT6_LOOKUP_F_DST_NOREF.
      
      Together with the change in the rx path, we see noticable performance
      boost:
      I ran synflood tests between 2 hosts under the same switch. Both hosts
      have 20G mlx NIC, and 8 tx/rx queues.
      Sender sends pure SYN flood with random src IPs and ports using trafgen.
      Receiver has a simple TCP listener on the target port.
      Both hosts have multiple custom rules:
      - For incoming packets, only local table is traversed.
      - For outgoing packets, 3 tables are traversed to find the route.
      The packet processing rate on the receiver is as follows:
      - Before the fix: 3.78Mpps
      - After the fix:  5.50Mpps
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7d9e5f42
    • W
      ipv6: convert rx data path to not take refcnt on dst · 67f415dd
      Wei Wang 提交于
      ip6_route_input() is the key function to do the route lookup in the
      rx data path. All the callers to this function are already holding rcu
      lock. So it is fairly easy to convert it to not take refcnt on the dst:
      We pass in flag RT6_LOOKUP_F_DST_NOREF and do skb_dst_set_noref().
      This saves a few atomic inc or dec operations and should boost
      performance overall.
      This also makes the logic more aligned with v4.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NMahesh Bandewar <maheshb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      67f415dd
    • W
      ipv6: initialize rt6->rt6i_uncached in all pre-allocated dst entries · 74109218
      Wei Wang 提交于
      Initialize rt6->rt6i_uncached on the following pre-allocated dsts:
      net->ipv6.ip6_null_entry
      net->ipv6.ip6_prohibit_entry
      net->ipv6.ip6_blk_hole_entry
      
      This is a preparation patch for later commits to be able to distinguish
      dst entries in uncached list by doing:
      !list_empty(rt6->rt6i_uncached)
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NMahesh Bandewar <maheshb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      74109218
    • W
      ipv6: introduce RT6_LOOKUP_F_DST_NOREF flag in ip6_pol_route() · 0e09edcc
      Wei Wang 提交于
      This new flag is to instruct the route lookup function to not take
      refcnt on the dst entry. The user which does route lookup with this flag
      must properly use rcu protection.
      ip6_pol_route() is the major route lookup function for both tx and rx
      path.
      In this function:
      Do not take refcnt on dst if RT6_LOOKUP_F_DST_NOREF flag is set, and
      directly return the route entry. The caller should be holding rcu lock
      when using this flag, and decide whether to take refcnt or not.
      
      One note on the dst cache in the uncached_list:
      As uncached_list does not consume refcnt, one refcnt is always returned
      back to the caller even if RT6_LOOKUP_F_DST_NOREF flag is set.
      Uncached dst is only possible in the output path. So in such call path,
      caller MUST check if the dst is in the uncached_list before assuming
      that there is no refcnt taken on the returned dst.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NMahesh Bandewar <maheshb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0e09edcc
  22. 23 6月, 2019 1 次提交
    • I
      ipv6: Error when route does not have any valid nexthops · 9eee3b49
      Ido Schimmel 提交于
      When user space sends invalid information in RTA_MULTIPATH, the nexthop
      list in ip6_route_multipath_add() is empty and 'rt_notif' is set to
      NULL.
      
      The code that emits the in-kernel notifications does not check for this
      condition, which results in a NULL pointer dereference [1].
      
      Fix this by bailing earlier in the function if the parsed nexthop list
      is empty. This is consistent with the corresponding IPv4 code.
      
      v2:
      * Check if parsed nexthop list is empty and bail with extack set
      
      [1]
      kasan: CONFIG_KASAN_INLINE enabled
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      general protection fault: 0000 [#1] PREEMPT SMP KASAN
      CPU: 0 PID: 9190 Comm: syz-executor149 Not tainted 5.2.0-rc5+ #38
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
      Google 01/01/2011
      RIP: 0010:call_fib6_multipath_entry_notifiers+0xd1/0x1a0
      net/ipv6/ip6_fib.c:396
      Code: 8b b5 30 ff ff ff 48 c7 85 68 ff ff ff 00 00 00 00 48 c7 85 70 ff ff
      ff 00 00 00 00 89 45 88 4c 89 e0 48 c1 e8 03 4c 89 65 80 <42> 80 3c 28 00
      0f 85 9a 00 00 00 48 b8 00 00 00 00 00 fc ff df 4d
      RSP: 0018:ffff88809788f2c0 EFLAGS: 00010246
      RAX: 0000000000000000 RBX: 1ffff11012f11e59 RCX: 00000000ffffffff
      RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
      RBP: ffff88809788f390 R08: ffff88809788f8c0 R09: 000000000000000c
      R10: ffff88809788f5d8 R11: ffff88809788f527 R12: 0000000000000000
      R13: dffffc0000000000 R14: ffff88809788f8c0 R15: ffffffff89541d80
      FS:  000055555632c880(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000020000080 CR3: 000000009ba7c000 CR4: 00000000001406f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
        ip6_route_multipath_add+0xc55/0x1490 net/ipv6/route.c:5094
        inet6_rtm_newroute+0xed/0x180 net/ipv6/route.c:5208
        rtnetlink_rcv_msg+0x463/0xb00 net/core/rtnetlink.c:5219
        netlink_rcv_skb+0x177/0x450 net/netlink/af_netlink.c:2477
        rtnetlink_rcv+0x1d/0x30 net/core/rtnetlink.c:5237
        netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
        netlink_unicast+0x531/0x710 net/netlink/af_netlink.c:1328
        netlink_sendmsg+0x8ae/0xd70 net/netlink/af_netlink.c:1917
        sock_sendmsg_nosec net/socket.c:646 [inline]
        sock_sendmsg+0xd7/0x130 net/socket.c:665
        ___sys_sendmsg+0x803/0x920 net/socket.c:2286
        __sys_sendmsg+0x105/0x1d0 net/socket.c:2324
        __do_sys_sendmsg net/socket.c:2333 [inline]
        __se_sys_sendmsg net/socket.c:2331 [inline]
        __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2331
        do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      RIP: 0033:0x4401f9
      Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7
      48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
      ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007ffc09fd0028 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
      RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 00000000004401f9
      RDX: 0000000000000000 RSI: 0000000020000080 RDI: 0000000000000003
      RBP: 00000000006ca018 R08: 0000000000000000 R09: 00000000004002c8
      R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000401a80
      R13: 0000000000401b10 R14: 0000000000000000 R15: 0000000000000000
      
      Reported-by: syzbot+382566d339d52cd1a204@syzkaller.appspotmail.com
      Fixes: ebee3cad ("ipv6: Add IPv6 multipath notifications for add / replace")
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Reviewed-by: NJiri Pirko <jiri@mellanox.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9eee3b49
  23. 20 6月, 2019 1 次提交
    • D
      ipv6: Default fib6_type to RTN_UNICAST when not set · c7036d97
      David Ahern 提交于
      A user reported that routes are getting installed with type 0 (RTN_UNSPEC)
      where before the routes were RTN_UNICAST. One example is from accel-ppp
      which apparently still uses the ioctl interface and does not set
      rtmsg_type. Another is the netlink interface where ipv6 does not require
      rtm_type to be set (v4 does). Prior to the commit in the Fixes tag the
      ipv6 stack converted type 0 to RTN_UNICAST, so restore that behavior.
      
      Fixes: e8478e80 ("net/ipv6: Save route type in rt6_info")
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c7036d97
  24. 19 6月, 2019 2 次提交
  25. 11 6月, 2019 4 次提交