提交 · 54074f1dbd6fbc0f0a085a54f3297ae26e424d59 · openeuler / Kernel

06 11月, 2019 1 次提交

由 Matteo Croce 提交于 11月 02, 2019

The same code which recognizes ICMP error packets is duplicated several
times. Use the icmp_is_err() and icmpv6_is_err() helpers instead, which
do the same thing.

ip_multipath_l3_keys() and tcf_nat_act() didn't check for all the error types,
assume that they should instead.
Signed-off-by: NMatteo Croce <mcroce@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

54074f1d

05 11月, 2019 1 次提交

ipv6: use jhash2() in rt6_exception_hash() · b6b556af

由 Eric Dumazet 提交于 11月 03, 2019

Faster jhash2() can be used instead of jhash(), since
IPv6 addresses have the needed alignment requirement.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b6b556af

17 10月, 2019 1 次提交

Revert "blackhole_netdev: fix syzkaller reported issue" · bd74708c

由 Mahesh Bandewar 提交于 10月 16, 2019

This reverts commit b0818f80.

Started seeing weird behavior after this patch especially in
the IPv6 code path. Haven't root caused it, but since this was
applied to net branch, taking a precautionary measure to revert
it and look / analyze those failures

Revert this now and I'll send a better fix after analysing / fixing
the weirdness observed.

CC: Eric Dumazet <edumazet@google.com>
CC: Wei Wang <weiwan@google.com>
CC: David S. Miller <davem@davemloft.net>
Signed-off-by: NMahesh Bandewar <maheshb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bd74708c

16 10月, 2019 1 次提交

blackhole_netdev: fix syzkaller reported issue · b0818f80

由 Mahesh Bandewar 提交于 10月 11, 2019

While invalidating the dst, we assign backhole_netdev instead of
loopback device. However, this device does not have idev pointer
and hence no ip6_ptr even if IPv6 is enabled. Possibly this has
triggered the syzbot reported crash.

The syzbot report does not have reproducer, however, this is the
only device that doesn't have matching idev created.

Crash instruction is :

static inline bool ip6_ignore_linkdown(const struct net_device *dev)
{
        const struct inet6_dev *idev = __in6_dev_get(dev);

        return !!idev->cnf.ignore_routes_with_linkdown; <= crash
}

Also ipv6 always assumes presence of idev and never checks for it
being NULL (as does the above referenced code). So adding a idev
for the blackhole_netdev to avoid this class of crashes in the future.
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b0818f80

12 9月, 2019 1 次提交

ipv6: Don't use dst gateway directly in ip6_confirm_neigh() · cbfd6891

由 Stefano Brivio 提交于 9月 09, 2019

This is the equivalent of commit 2c6b55f4 ("ipv6: fix neighbour
resolution with raw socket") for ip6_confirm_neigh(): we can send a
packet with MSG_CONFIRM on a raw socket for a connected route, so the
gateway would be :: here, and we should pick the next hop using
rt6_nexthop() instead.

This was found by code review and, to the best of my knowledge, doesn't
actually fix a practical issue: the destination address from the packet
is not considered while confirming a neighbour, as ip6_confirm_neigh()
calls choose_neigh_daddr() without passing the packet, so there are no
similar issues as the one fixed by said commit.

A possible source of issues with the existing implementation might come
from the fact that, if we have a cached dst, we won't consider it,
while rt6_nexthop() takes care of that. I might just not be creative
enough to find a practical problem here: the only way to affect this
with cached routes is to have one coming from an ICMPv6 redirect, but
if the next hop is a directly connected host, there should be no
topology for which a redirect applies here, and tests with redirected
routes show no differences for MSG_CONFIRM (and MSG_PROBE) packets on
raw sockets destined to a directly connected host.

However, directly using the dst gateway here is not consistent anymore
with neighbour resolution, and, in general, as we want the next hop,
using rt6_nexthop() looks like the only sane way to fetch it.
Reported-by: NGuillaume Nault <gnault@redhat.com>
Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
Acked-by: NGuillaume Nault <gnault@redhat.com>
Acked-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cbfd6891

07 9月, 2019 1 次提交

ipv6: addrconf_f6i_alloc - fix non-null pointer check to !IS_ERR() · 8652f17c

由 Maciej Żenczykowski 提交于 9月 05, 2019

Fixes a stupid bug I recently introduced...
ip6_route_info_create() returns an ERR_PTR(err) and not a NULL on error.

Fixes: d55a2e37 ("net-ipv6: fix excessive RTF_ADDRCONF flag on ::1/128 local route (and others)'")
Cc: David Ahern <dsahern@gmail.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: NMaciej Żenczykowski <maze@google.com>
Reported-by: Nsyzbot <syzkaller@googlegroups.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8652f17c

05 9月, 2019 3 次提交

net: Properly update v4 routes with v6 nexthop · 7bdf4de1

由 Donald Sharp 提交于 9月 04, 2019

When creating a v4 route that uses a v6 nexthop from a nexthop group.
Allow the kernel to properly send the nexthop as v6 via the RTA_VIA
attribute.

Broken behavior:

$ ip nexthop add via fe80::9 dev eth0
$ ip nexthop show
id 1 via fe80::9 dev eth0 scope link
$ ip route add 4.5.6.7/32 nhid 1
$ ip route show
default via 10.0.2.2 dev eth0
4.5.6.7 nhid 1 via 254.128.0.0 dev eth0
10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.15
$

Fixed behavior:

$ ip nexthop add via fe80::9 dev eth0
$ ip nexthop show
id 1 via fe80::9 dev eth0 scope link
$ ip route add 4.5.6.7/32 nhid 1
$ ip route show
default via 10.0.2.2 dev eth0
4.5.6.7 nhid 1 via inet6 fe80::9 dev eth0
10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.15
$

v2, v3: Addresses code review comments from David Ahern

Fixes: dcb1ecb5 (“ipv4: Prepare for fib6_nh from a nexthop object”)
Signed-off-by: NDonald Sharp <sharpd@cumulusnetworks.com>
Reviewed-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7bdf4de1

ipv6: Fix RTA_MULTIPATH with nexthop objects · 4255ff05

由 David Ahern 提交于 9月 03, 2019

A change to the core nla helpers was missed during the push of
the nexthop changes. rt6_fill_node_nexthop should be calling
nla_nest_start_noflag not nla_nest_start. Currently, iproute2
does not print multipath data because of parsing issues with
the attribute.

Fixes: f88d8ea6 ("ipv6: Plumb support for nexthop object in a fib6_info")
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4255ff05

net-ipv6: fix excessive RTF_ADDRCONF flag on ::1/128 local route (and others) · d55a2e37

由 Maciej Żenczykowski 提交于 9月 02, 2019

There is a subtle change in behaviour introduced by:
  commit c7a1ce39
  'ipv6: Change addrconf_f6i_alloc to use ip6_route_info_create'

Before that patch /proc/net/ipv6_route includes:
00000000000000000000000000000001 80 00000000000000000000000000000000 00 00000000000000000000000000000000 00000000 00000003 00000000 80200001 lo

Afterwards /proc/net/ipv6_route includes:
00000000000000000000000000000001 80 00000000000000000000000000000000 00 00000000000000000000000000000000 00000000 00000002 00000000 80240001 lo

ie. the above commit causes the ::1/128 local (automatic) route to be flagged with RTF_ADDRCONF (0x040000).

AFAICT, this is incorrect since these routes are *not* coming from RA's.

As such, this patch restores the old behaviour.

Fixes: c7a1ce39 ("ipv6: Change addrconf_f6i_alloc to use ip6_route_info_create")
Cc: David Ahern <dsahern@gmail.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: NMaciej Żenczykowski <maze@google.com>
Reviewed-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d55a2e37

06 8月, 2019 2 次提交

ipv6: have a single rcu unlock point in __ip6_rt_update_pmtu · 43a4b60d

由 David Ahern 提交于 8月 01, 2019

Simplify the unlock path in __ip6_rt_update_pmtu by using a
single point where rcu_read_unlock is called.
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

43a4b60d

ipv6: Fix unbalanced rcu locking in rt6_update_exception_stamp_rt · cff6a327

由 David Ahern 提交于 8月 01, 2019

The nexthop path in rt6_update_exception_stamp_rt needs to call
rcu_read_unlock if it fails to find a fib6_nh match rather than
just returning.

Fixes: e659ba31 ("ipv6: Handle all fib6_nh in a nexthop in exception handling")
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cff6a327

19 7月, 2019 1 次提交

proc/sysctl: add shared variables for range check · eec4844f

由 Matteo Croce 提交于 7月 18, 2019

In the sysctl code the proc_dointvec_minmax() function is often used to
validate the user supplied value between an allowed range.  This
function uses the extra1 and extra2 members from struct ctl_table as
minimum and maximum allowed value.

On sysctl handler declaration, in every source file there are some
readonly variables containing just an integer which address is assigned
to the extra1 and extra2 members, so the sysctl range is enforced.

The special values 0, 1 and INT_MAX are very often used as range
boundary, leading duplication of variables like zero=0, one=1,
int_max=INT_MAX in different source files:

    $ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
    248

Add a const int array containing the most commonly used values, some
macros to refer more easily to the correct array member, and use them
instead of creating a local one for every object file.

This is the bloat-o-meter output comparing the old and new binary
compiled with the default Fedora config:

    # scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
    add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
    Data                                         old     new   delta
    sysctl_vals                                    -      12     +12
    __kstrtab_sysctl_vals                          -      12     +12
    max                                           14      10      -4
    int_max                                       16       -     -16
    one                                           68       -     -68
    zero                                         128      28    -100
    Total: Before=20583249, After=20583085, chg -0.00%

[mcroce@redhat.com: tipc: remove two unused variables]
  Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
[akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
[arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
  Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
[akpm@linux-foundation.org: fix fs/eventpoll.c]
Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.comSigned-off-by: NMatteo Croce <mcroce@redhat.com>
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Acked-by: NKees Cook <keescook@chromium.org>
Reviewed-by: NAaron Tomlin <atomlin@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

eec4844f

18 7月, 2019 1 次提交

ipv6: rt6_check should return NULL if 'from' is NULL · 49d05fe2

由 David Ahern 提交于 7月 17, 2019

Paul reported that l2tp sessions were broken after the commit referenced
in the Fixes tag. Prior to this commit rt6_check returned NULL if the
rt6_info 'from' was NULL - ie., the dst_entry was disconnected from a FIB
entry. Restore that behavior.

Fixes: 93531c67 ("net/ipv6: separate handling of FIB entries from dst based routes")
Reported-by: NPaul Donohue <linux-kernel@PaulSD.com>
Tested-by: NPaul Donohue <linux-kernel@PaulSD.com>
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

49d05fe2

09 7月, 2019 1 次提交

ipv6: Support multipath hashing on inner IP pkts · d8f74f09

由 Stephen Suryaputra 提交于 7月 06, 2019

Make the same support as commit 363887a2 ("ipv4: Support multipath
hashing on inner IP pkts for GRE tunnel") for outer IPv6. The hashing
considers both IPv4 and IPv6 pkts when they are tunneled by IPv6 GRE.
Signed-off-by: NStephen Suryaputra <ssuryaextr@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

d8f74f09

02 7月, 2019 1 次提交

blackhole_netdev: use blackhole_netdev to invalidate dst entries · 8d7017fd

由 Mahesh Bandewar 提交于 7月 01, 2019

Use blackhole_netdev instead of 'lo' device with lower MTU when marking
dst "dead".
Signed-off-by: NMahesh Bandewar <maheshb@google.com>
Tested-by: NMichael Chan <michael.chan@broadcom.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

8d7017fd

28 6月, 2019 1 次提交

ipv6: Convert gateway validation to use fib6_info · b2c709cc

由 David Ahern 提交于 6月 24, 2019

Gateway validation does not need a dst_entry, it only needs the fib
entry to validate the gateway resolution and egress device. So,
convert ip6_nh_lookup_table from ip6_pol_route to fib6_table_lookup
and ip6_route_check_nh to use fib6_lookup over rt6_lookup.

ip6_pol_route is a call to fib6_table_lookup and if successful a call
to fib6_select_path. From there the exception cache is searched for an
entry or a dst_entry is created to return to the caller. The exception
entry is not relevant for gateway validation, so what matters are the
calls to fib6_table_lookup and then fib6_select_path.

Similarly, rt6_lookup can be replaced with a call to fib6_lookup with
RT6_LOOKUP_F_IFACE set in flags. Again, the exception cache search is
not relevant, only the lookup with path selection. The primary difference
in the lookup paths is the use of rt6_select with fib6_lookup versus
rt6_device_match with rt6_lookup. When you remove complexities in the
rt6_select path, e.g.,
1. saddr is not set for gateway validation, so RT6_LOOKUP_F_HAS_SADDR
   is not relevant
2. rt6_check_neigh is not called so that removes the RT6_NUD_FAIL_DO_RR
   return and round-robin logic.

the code paths are believed to be equivalent for the given use case -
validate the gateway and optionally given the device. Furthermore, it
aligns the validation with onlink code path and the lookup path actually
used for rx and tx.

Adjust the users, ip6_route_check_nh_onlink and ip6_route_check_nh to
handle a fib6_info vs a rt6_info when performing validation checks.

Existing selftests fib-onlink-tests.sh and fib_tests.sh are used to
verify the changes.
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Reviewed-by: NWei Wang <weiwan@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b2c709cc

27 6月, 2019 2 次提交

ipv6: fix neighbour resolution with raw socket · 2c6b55f4

由 Nicolas Dichtel 提交于 6月 24, 2019

The scenario is the following: the user uses a raw socket to send an ipv6
packet, destinated to a not-connected network, and specify a connected nh.
Here is the corresponding python script to reproduce this scenario:

 import socket
 IPPROTO_RAW = 255
 send_s = socket.socket(socket.AF_INET6, socket.SOCK_RAW, IPPROTO_RAW)
 # scapy
 # p = IPv6(src='fd00:100::1', dst='fd00:200::fa')/ICMPv6EchoRequest()
 # str(p)
 req = b'`\x00\x00\x00\x00\x08:@\xfd\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\xfd\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xfa\x80\x00\x81\xc0\x00\x00\x00\x00'
 send_s.sendto(req, ('fd00:175::2', 0, 0, 0))

fd00:175::/64 is a connected route and fd00:200::fa is not a connected
host.

With this scenario, the kernel starts by sending a NS to resolve
fd00:175::2. When it receives the NA, it flushes its queue and try to send
the initial packet. But instead of sending it, it sends another NS to
resolve fd00:200::fa, which obvioulsy fails, thus the packet is dropped. If
the user sends again the packet, it now uses the right nh (fd00:175::2).

The problem is that ip6_dst_lookup_neigh() uses the rt6i_gateway, which is
:: because the associated route is a connected route, thus it uses the dst
addr of the packet. Let's use rt6_nexthop() to choose the right nh.
Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2c6b55f4

ipv6: fix suspicious RCU usage in rt6_dump_route() · 3b525691

由 Eric Dumazet 提交于 6月 26, 2019

syzbot reminded us that rt6_nh_dump_exceptions() needs to be called
with rcu_read_lock()

net/ipv6/route.c:1593 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
2 locks held by syz-executor609/8966:
 #0: 00000000b7dbe288 (rtnl_mutex){+.+.}, at: netlink_dump+0xe7/0xfb0 net/netlink/af_netlink.c:2199
 #1: 00000000f2d87c21 (&(&tb->tb6_lock)->rlock){+...}, at: spin_lock_bh include/linux/spinlock.h:343 [inline]
 #1: 00000000f2d87c21 (&(&tb->tb6_lock)->rlock){+...}, at: fib6_dump_table.isra.0+0x37e/0x570 net/ipv6/ip6_fib.c:533

stack backtrace:
CPU: 0 PID: 8966 Comm: syz-executor609 Not tainted 5.2.0-rc5+ #43
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x172/0x1f0 lib/dump_stack.c:113
 lockdep_rcu_suspicious+0x153/0x15d kernel/locking/lockdep.c:5250
 fib6_nh_get_excptn_bucket+0x18e/0x1b0 net/ipv6/route.c:1593
 rt6_nh_dump_exceptions+0x45/0x4d0 net/ipv6/route.c:5541
 rt6_dump_route+0x904/0xc50 net/ipv6/route.c:5640
 fib6_dump_node+0x168/0x280 net/ipv6/ip6_fib.c:467
 fib6_walk_continue+0x4a9/0x8e0 net/ipv6/ip6_fib.c:1986
 fib6_walk+0x9d/0x100 net/ipv6/ip6_fib.c:2034
 fib6_dump_table.isra.0+0x38a/0x570 net/ipv6/ip6_fib.c:534
 inet6_dump_fib+0x93c/0xb00 net/ipv6/ip6_fib.c:624
 rtnl_dump_all+0x295/0x490 net/core/rtnetlink.c:3445
 netlink_dump+0x558/0xfb0 net/netlink/af_netlink.c:2244
 __netlink_dump_start+0x5b1/0x7d0 net/netlink/af_netlink.c:2352
 netlink_dump_start include/linux/netlink.h:226 [inline]
 rtnetlink_rcv_msg+0x73d/0xb00 net/core/rtnetlink.c:5182
 netlink_rcv_skb+0x177/0x450 net/netlink/af_netlink.c:2477
 rtnetlink_rcv+0x1d/0x30 net/core/rtnetlink.c:5237
 netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
 netlink_unicast+0x531/0x710 net/netlink/af_netlink.c:1328
 netlink_sendmsg+0x8ae/0xd70 net/netlink/af_netlink.c:1917
 sock_sendmsg_nosec net/socket.c:646 [inline]
 sock_sendmsg+0xd7/0x130 net/socket.c:665
 sock_write_iter+0x27c/0x3e0 net/socket.c:994
 call_write_iter include/linux/fs.h:1872 [inline]
 new_sync_write+0x4d3/0x770 fs/read_write.c:483
 __vfs_write+0xe1/0x110 fs/read_write.c:496
 vfs_write+0x20c/0x580 fs/read_write.c:558
 ksys_write+0x14f/0x290 fs/read_write.c:611
 __do_sys_write fs/read_write.c:623 [inline]
 __se_sys_write fs/read_write.c:620 [inline]
 __x64_sys_write+0x73/0xb0 fs/read_write.c:620
 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x4401b9
Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007ffc8e134978 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 00000000004401b9
RDX: 000000000000001c RSI: 0000000020000000 RDI: 00

Fixes: 1e47b483 ("ipv6: Dump route exceptions if requested")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Stefano Brivio <sbrivio@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Reviewed-by: NStefano Brivio <sbrivio@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3b525691

26 6月, 2019 1 次提交

net/ipv6: Fix misuse of proc_dointvec "skip_notify_on_dev_down" · b8e8a863

由 Eiichi Tsukata 提交于 6月 25, 2019

/proc/sys/net/ipv6/route/skip_notify_on_dev_down assumes given value to be
0 or 1. Use proc_dointvec_minmax instead of proc_dointvec.

Fixes: 7c6bb7d2 ("net/ipv6: Add knob to skip DELROUTE message ondevice down")
Signed-off-by: NEiichi Tsukata <devel@etsukata.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

b8e8a863

25 6月, 2019 3 次提交

ipv6: Dump route exceptions if requested · 1e47b483

由 Stefano Brivio 提交于 6月 21, 2019

Since commit 2b760fcf ("ipv6: hook up exception table to store dst
cache"), route exceptions reside in a separate hash table, and won't be
found by walking the FIB, so they won't be dumped to userspace on a
RTM_GETROUTE message.

This causes 'ip -6 route list cache' and 'ip -6 route flush cache' to
have no function anymore:

 # ip -6 route get fc00:3::1
 fc00:3::1 via fc00:1::2 dev veth_A-R1 src fc00:1::1 metric 1024 expires 539sec mtu 1400 pref medium
 # ip -6 route get fc00:4::1
 fc00:4::1 via fc00:2::2 dev veth_A-R2 src fc00:2::1 metric 1024 expires 536sec mtu 1500 pref medium
 # ip -6 route list cache
 # ip -6 route flush cache
 # ip -6 route get fc00:3::1
 fc00:3::1 via fc00:1::2 dev veth_A-R1 src fc00:1::1 metric 1024 expires 520sec mtu 1400 pref medium
 # ip -6 route get fc00:4::1
 fc00:4::1 via fc00:2::2 dev veth_A-R2 src fc00:2::1 metric 1024 expires 519sec mtu 1500 pref medium

because iproute2 lists cached routes using RTM_GETROUTE, and flushes them
by listing all the routes, and deleting them with RTM_DELROUTE one by one.

If cached routes are requested using the RTM_F_CLONED flag together with
strict checking, or if no strict checking is requested (and hence we can't
consistently apply filters), look up exceptions in the hash table
associated with the current fib6_info in rt6_dump_route(), and, if present
and not expired, add them to the dump.

We might be unable to dump all the entries for a given node in a single
message, so keep track of how many entries were handled for the current
node in fib6_walker, and skip that amount in case we start from the same
partially dumped node.

When a partial dump restarts, as the starting node might change when
'sernum' changes, we have no guarantee that we need to skip the same
amount of in-node entries. Therefore, we need two counters, and we need to
zero the in-node counter if the node from which the dump is resumed
differs.

Note that, with the current version of iproute2, this only fixes the
'ip -6 route list cache': on a flush command, iproute2 doesn't pass
RTM_F_CLONED and, due to this inconsistency, 'ip -6 route flush cache' is
still unable to fetch the routes to be flushed. This will be addressed in
a patch for iproute2.

To flush cached routes, a procfs entry could be introduced instead: that's
how it works for IPv4. We already have a rt6_flush_exception() function
ready to be wired to it. However, this would not solve the issue for
listing.

Versions of iproute2 and kernel tested:

                    iproute2
kernel             4.14.0   4.15.0   4.19.0   5.0.0   5.1.0    5.1.0, patched
 3.18    list        +        +        +        +       +            +
         flush       +        +        +        +       +            +
 4.4     list        +        +        +        +       +            +
         flush       +        +        +        +       +            +
 4.9     list        +        +        +        +       +            +
         flush       +        +        +        +       +            +
 4.14    list        +        +        +        +       +            +
         flush       +        +        +        +       +            +
 4.15    list
         flush
 4.19    list
         flush
 5.0     list
         flush
 5.1     list
         flush
 with    list        +        +        +        +       +            +
 fix     flush       +        +        +                             +

v7:
  - Explain usage of "skip" counters in commit message (suggested by
    David Ahern)

v6:
  - Rebase onto net-next, use recently introduced nexthop walker
  - Make rt6_nh_dump_exceptions() a separate function (suggested by David
    Ahern)

v5:
  - Use dump_routes and dump_exceptions from filter, ignore NLM_F_MATCH,
    update test results (flushing works with iproute2 < 5.0.0 now)

v4:
  - Split NLM_F_MATCH and strict check handling in separate patches
  - Filter routes using RTM_F_CLONED: if it's not set, only return
    non-cached routes, and if it's set, only return cached routes:
    change requested by David Ahern and Martin Lau. This implies that
    iproute2 needs a separate patch to be able to flush IPv6 cached
    routes. This is not ideal because we can't fix the breakage caused
    by 2b760fcf entirely in kernel. However, two years have passed
    since then, and this makes it more tolerable

v3:
  - More descriptive comment about expired exceptions in rt6_dump_route()
  - Swap return values of rt6_dump_route() (suggested by Martin Lau)
  - Don't zero skip_in_node in case we don't dump anything in a given pass
    (also suggested by Martin Lau)
  - Remove check on RTM_F_CLONED altogether: in the current UAPI semantic,
    it's just a flag to indicate the route was cloned, not to filter on
    routes

v2: Add tracking of number of entries to be skipped in current node after
    a partial dump. As we restart from the same node, if not all the
    exceptions for a given node fit in a single message, the dump will
    not terminate, as suggested by Martin Lau. This is a concrete
    possibility, setting up a big number of exceptions for the same route
    actually causes the issue, suggested by David Ahern.
Reported-by: NJianlin Shi <jishi@redhat.com>
Fixes: 2b760fcf ("ipv6: hook up exception table to store dst cache")
Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
Reviewed-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1e47b483

ipv6/route: Change return code of rt6_dump_route() for partial node dumps · bf9a8a06

由 Stefano Brivio 提交于 6月 21, 2019

In the next patch, we are going to add optional dump of exceptions to
rt6_dump_route().

Change the return code of rt6_dump_route() to accomodate partial node
dumps: we might dump multiple routes per node, and might be able to dump
only a given number of them, so fib6_dump_node() will need to know how
many routes have been dumped on partial dump, to restart the dump from the
point where it was interrupted.

Note that fib6_dump_node() is the only caller and already handles all
non-negative return codes as success: those become -1 to signal that we're
done with the node. If we fail, return 0, as we were unable to dump the
single route in the node, but we're not done with it.
Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
Reviewed-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

bf9a8a06

ipv6/route: Don't match on fc_nh_id if not set in ip6_route_del() · 3401bfb1

由 Stefano Brivio 提交于 6月 21, 2019

If fc_nh_id isn't set, we shouldn't try to match against it. This
actually matters just for the RTF_CACHE below (where this case is
already handled): if iproute2 gets a route exception and tries to
delete it, it won't reference it by fc_nh_id, even if a nexthop
object might be associated to the originating route.

Fixes: 5b98324e ("ipv6: Allow routes to use nexthop objects")
Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
Reviewed-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3401bfb1

24 6月, 2019 4 次提交

ipv6: convert major tx path to use RT6_LOOKUP_F_DST_NOREF · 7d9e5f42

由 Wei Wang 提交于 6月 20, 2019

For tx path, in most cases, we still have to take refcnt on the dst
cause the caller is caching the dst somewhere. But it still is
beneficial to make use of RT6_LOOKUP_F_DST_NOREF flag while doing the
route lookup. It is cause this flag prevents manipulating refcnt on
net->ipv6.ip6_null_entry when doing fib6_rule_lookup() to traverse each
routing table. The null_entry is a shared object and constant updates on
it cause false sharing.

We converted the current major lookup function ip6_route_output_flags()
to make use of RT6_LOOKUP_F_DST_NOREF.

Together with the change in the rx path, we see noticable performance
boost:
I ran synflood tests between 2 hosts under the same switch. Both hosts
have 20G mlx NIC, and 8 tx/rx queues.
Sender sends pure SYN flood with random src IPs and ports using trafgen.
Receiver has a simple TCP listener on the target port.
Both hosts have multiple custom rules:
- For incoming packets, only local table is traversed.
- For outgoing packets, 3 tables are traversed to find the route.
The packet processing rate on the receiver is as follows:
- Before the fix: 3.78Mpps
- After the fix:  5.50Mpps
Signed-off-by: NWei Wang <weiwan@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7d9e5f42

ipv6: convert rx data path to not take refcnt on dst · 67f415dd

由 Wei Wang 提交于 6月 20, 2019

ip6_route_input() is the key function to do the route lookup in the
rx data path. All the callers to this function are already holding rcu
lock. So it is fairly easy to convert it to not take refcnt on the dst:
We pass in flag RT6_LOOKUP_F_DST_NOREF and do skb_dst_set_noref().
This saves a few atomic inc or dec operations and should boost
performance overall.
This also makes the logic more aligned with v4.
Signed-off-by: NWei Wang <weiwan@google.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Acked-by: NMahesh Bandewar <maheshb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

67f415dd

ipv6: initialize rt6->rt6i_uncached in all pre-allocated dst entries · 74109218

由 Wei Wang 提交于 6月 20, 2019

Initialize rt6->rt6i_uncached on the following pre-allocated dsts:
net->ipv6.ip6_null_entry
net->ipv6.ip6_prohibit_entry
net->ipv6.ip6_blk_hole_entry

This is a preparation patch for later commits to be able to distinguish
dst entries in uncached list by doing:
!list_empty(rt6->rt6i_uncached)
Signed-off-by: NWei Wang <weiwan@google.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Acked-by: NMahesh Bandewar <maheshb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

74109218

ipv6: introduce RT6_LOOKUP_F_DST_NOREF flag in ip6_pol_route() · 0e09edcc

由 Wei Wang 提交于 6月 20, 2019

This new flag is to instruct the route lookup function to not take
refcnt on the dst entry. The user which does route lookup with this flag
must properly use rcu protection.
ip6_pol_route() is the major route lookup function for both tx and rx
path.
In this function:
Do not take refcnt on dst if RT6_LOOKUP_F_DST_NOREF flag is set, and
directly return the route entry. The caller should be holding rcu lock
when using this flag, and decide whether to take refcnt or not.

One note on the dst cache in the uncached_list:
As uncached_list does not consume refcnt, one refcnt is always returned
back to the caller even if RT6_LOOKUP_F_DST_NOREF flag is set.
Uncached dst is only possible in the output path. So in such call path,
caller MUST check if the dst is in the uncached_list before assuming
that there is no refcnt taken on the returned dst.
Signed-off-by: NWei Wang <weiwan@google.com>
Acked-by: NEric Dumazet <edumazet@google.com>
Acked-by: NMahesh Bandewar <maheshb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

0e09edcc

23 6月, 2019 1 次提交

ipv6: Error when route does not have any valid nexthops · 9eee3b49

由 Ido Schimmel 提交于 6月 20, 2019

When user space sends invalid information in RTA_MULTIPATH, the nexthop
list in ip6_route_multipath_add() is empty and 'rt_notif' is set to
NULL.

The code that emits the in-kernel notifications does not check for this
condition, which results in a NULL pointer dereference [1].

Fix this by bailing earlier in the function if the parsed nexthop list
is empty. This is consistent with the corresponding IPv4 code.

v2:
* Check if parsed nexthop list is empty and bail with extack set

[1]
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] PREEMPT SMP KASAN
CPU: 0 PID: 9190 Comm: syz-executor149 Not tainted 5.2.0-rc5+ #38
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
RIP: 0010:call_fib6_multipath_entry_notifiers+0xd1/0x1a0
net/ipv6/ip6_fib.c:396
Code: 8b b5 30 ff ff ff 48 c7 85 68 ff ff ff 00 00 00 00 48 c7 85 70 ff ff
ff 00 00 00 00 89 45 88 4c 89 e0 48 c1 e8 03 4c 89 65 80 <42> 80 3c 28 00
0f 85 9a 00 00 00 48 b8 00 00 00 00 00 fc ff df 4d
RSP: 0018:ffff88809788f2c0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 1ffff11012f11e59 RCX: 00000000ffffffff
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff88809788f390 R08: ffff88809788f8c0 R09: 000000000000000c
R10: ffff88809788f5d8 R11: ffff88809788f527 R12: 0000000000000000
R13: dffffc0000000000 R14: ffff88809788f8c0 R15: ffffffff89541d80
FS:  000055555632c880(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020000080 CR3: 000000009ba7c000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
  ip6_route_multipath_add+0xc55/0x1490 net/ipv6/route.c:5094
  inet6_rtm_newroute+0xed/0x180 net/ipv6/route.c:5208
  rtnetlink_rcv_msg+0x463/0xb00 net/core/rtnetlink.c:5219
  netlink_rcv_skb+0x177/0x450 net/netlink/af_netlink.c:2477
  rtnetlink_rcv+0x1d/0x30 net/core/rtnetlink.c:5237
  netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
  netlink_unicast+0x531/0x710 net/netlink/af_netlink.c:1328
  netlink_sendmsg+0x8ae/0xd70 net/netlink/af_netlink.c:1917
  sock_sendmsg_nosec net/socket.c:646 [inline]
  sock_sendmsg+0xd7/0x130 net/socket.c:665
  ___sys_sendmsg+0x803/0x920 net/socket.c:2286
  __sys_sendmsg+0x105/0x1d0 net/socket.c:2324
  __do_sys_sendmsg net/socket.c:2333 [inline]
  __se_sys_sendmsg net/socket.c:2331 [inline]
  __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2331
  do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x4401f9
Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007ffc09fd0028 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 00000000004401f9
RDX: 0000000000000000 RSI: 0000000020000080 RDI: 0000000000000003
RBP: 00000000006ca018 R08: 0000000000000000 R09: 00000000004002c8
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000401a80
R13: 0000000000401b10 R14: 0000000000000000 R15: 0000000000000000

Reported-by: syzbot+382566d339d52cd1a204@syzkaller.appspotmail.com
Fixes: ebee3cad ("ipv6: Add IPv6 multipath notifications for add / replace")
Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
Reviewed-by: NJiri Pirko <jiri@mellanox.com>
Reviewed-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

9eee3b49

20 6月, 2019 1 次提交

ipv6: Default fib6_type to RTN_UNICAST when not set · c7036d97

由 David Ahern 提交于 6月 19, 2019

A user reported that routes are getting installed with type 0 (RTN_UNSPEC)
where before the routes were RTN_UNICAST. One example is from accel-ppp
which apparently still uses the ioctl interface and does not set
rtmsg_type. Another is the netlink interface where ipv6 does not require
rtm_type to be set (v4 does). Prior to the commit in the Fixes tag the
ipv6 stack converted type 0 to RTN_UNICAST, so restore that behavior.

Fixes: e8478e80 ("net/ipv6: Save route type in rt6_info")
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c7036d97

19 6月, 2019 2 次提交

ipv6: Add IPv6 multipath notification for route delete · 2881fd61

由 Ido Schimmel 提交于 6月 18, 2019

If all the nexthops of a multipath route are being deleted, send one
notification for the entire route, instead of one per-nexthop.
Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
Acked-by: NJiri Pirko <jiri@mellanox.com>
Reviewed-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2881fd61

ipv6: Add IPv6 multipath notifications for add / replace · ebee3cad

由 Ido Schimmel 提交于 6月 18, 2019

Emit a notification when a multipath routes is added or replace.

Note that unlike the replace notifications sent from fib6_add_rt2node(),
it is possible we are sending a 'FIB_EVENT_ENTRY_REPLACE' when a route
was merely added and not replaced.
Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
Acked-by: NJiri Pirko <jiri@mellanox.com>
Reviewed-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

ebee3cad

11 6月, 2019 9 次提交

ipv6: Allow routes to use nexthop objects · 5b98324e

由 David Ahern 提交于 6月 08, 2019

Add support for RTA_NH_ID attribute to allow a user to specify a
nexthop id to use with a route. fc_nh_id is added to fib6_config to
hold the value passed in the RTA_NH_ID attribute. If a nexthop id
is given, the gateway, device, encap and multipath attributes can
not be set.

Update ip6_route_del to check metric and protocol before nexthop
specs. If fc_nh_id is set, then it must match the id in the route
entry. Since IPv6 allows delete of a cached entry (an exception),
add ip6_del_cached_rt_nh to cycle through all of the fib6_nh in
a fib entry if it is using a nexthop.
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

5b98324e

ipv6: Handle all fib6_nh in a nexthop in mtu updates · 2d44234b

由 David Ahern 提交于 6月 08, 2019

Use nexthop_for_each_fib6_nh to call fib6_nh_mtu_change for each
fib6_nh in a nexthop for rt6_mtu_change_route. For __ip6_rt_update_pmtu,
we need to find the nexthop that correlates to the device and gateway
in the rt6_info.
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2d44234b

ipv6: Handle all fib6_nh in a nexthop in rt6_do_redirect · 49d5b8ef

由 David Ahern 提交于 6月 08, 2019

Use nexthop_for_each_fib6_nh and fib6_nh_find_match to find the
fib6_nh in a nexthop that correlates to the device and gateway
in the rt6_info.
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

49d5b8ef

ipv6: Handle all fib6_nh in a nexthop in __ip6_route_redirect · c55c8988

由 David Ahern 提交于 6月 08, 2019

Add a hook in __ip6_route_redirect to handle a nexthop struct in a
fib6_info. Use nexthop_for_each_fib6_nh and fib6_nh_redirect_match
to call ip6_redirect_nh_match for each fib6_nh looking for a match.
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

c55c8988

ipv6: Handle all fib6_nh in a nexthop in exception handling · e659ba31

由 David Ahern 提交于 6月 08, 2019

Add a hook in rt6_flush_exceptions, rt6_remove_exception_rt,
rt6_update_exception_stamp_rt, and rt6_age_exceptions to handle
nexthop struct in a fib6_info.
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e659ba31

ipv6: Handle all fib6_nh in a nexthop in fib6_info_uses_dev · 2c170e07

由 David Ahern 提交于 6月 08, 2019

Add a hook in fib6_info_uses_dev to handle nexthop struct in a fib6_info.
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2c170e07

ipv6: Handle all fib6_nh in a nexthop in rt6_nlmsg_size · a1b7a1f0

由 David Ahern 提交于 6月 08, 2019

Add a hook in rt6_nlmsg_size to handle nexthop struct in a fib6_info.
rt6_nh_nlmsg_size is used to sum the space needed for all nexthops in
the fib entry.
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a1b7a1f0

ipv6: Handle all fib6_nh in a nexthop in __find_rr_leaf · 17a5984e

由 David Ahern 提交于 6月 08, 2019

Add a hook in __find_rr_leaf to handle nexthop struct in a fib6_info.
nexthop_for_each_fib6_nh is used to walk each fib6_nh in a nexthop and
call find_match. On a match, use the fib6_nh saved in the callback arg
to setup fib6_result.
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

17a5984e

ipv6: Handle all fib6_nh in a nexthop in rt6_device_match · 962b6803

由 David Ahern 提交于 6月 08, 2019

Add a hook in rt6_device_match to handle nexthop struct in a fib6_info.
The new rt6_nh_dev_match uses nexthop_for_each_fib6_nh to walk each
fib6_nh in a nexthop and call __rt6_device_match. On match,
rt6_nh_dev_match returns the fib6_nh and rt6_device_match uses it to
setup fib6_result.
Signed-off-by: NDavid Ahern <dsahern@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

962b6803

07 6月, 2019 1 次提交

ipv6: fix spelling mistake: "wtih" -> "with" · 4daa95af

由 Colin Ian King 提交于 6月 06, 2019

There is a spelling mistake in a NL_SET_ERR_MSG message. Fix it.
Signed-off-by: NColin Ian King <colin.king@canonical.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4daa95af

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功