1. 28 10月, 2022 1 次提交
  2. 13 7月, 2022 1 次提交
  3. 09 2月, 2022 1 次提交
  4. 30 12月, 2021 1 次提交
  5. 30 11月, 2021 1 次提交
  6. 23 11月, 2021 1 次提交
    • N
      net: nexthop: fix null pointer dereference when IPv6 is not enabled · 1c743127
      Nikolay Aleksandrov 提交于
      When we try to add an IPv6 nexthop and IPv6 is not enabled
      (!CONFIG_IPV6) we'll hit a NULL pointer dereference[1] in the error path
      of nh_create_ipv6() due to calling ipv6_stub->fib6_nh_release. The bug
      has been present since the beginning of IPv6 nexthop gateway support.
      Commit 1aefd3de ("ipv6: Add fib6_nh_init and release to stubs") tells
      us that only fib6_nh_init has a dummy stub because fib6_nh_release should
      not be called if fib6_nh_init returns an error, but the commit below added
      a call to ipv6_stub->fib6_nh_release in its error path. To fix it return
      the dummy stub's -EAFNOSUPPORT error directly without calling
      ipv6_stub->fib6_nh_release in nh_create_ipv6()'s error path.
      
      [1]
       Output is a bit truncated, but it clearly shows the error.
       BUG: kernel NULL pointer dereference, address: 000000000000000000
       #PF: supervisor instruction fetch in kernel modede
       #PF: error_code(0x0010) - not-present pagege
       PGD 0 P4D 0
       Oops: 0010 [#1] PREEMPT SMP NOPTI
       CPU: 4 PID: 638 Comm: ip Kdump: loaded Not tainted 5.16.0-rc1+ #446
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-4.fc34 04/01/2014
       RIP: 0010:0x0
       Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
       RSP: 0018:ffff888109f5b8f0 EFLAGS: 00010286^Ac
       RAX: 0000000000000000 RBX: ffff888109f5ba28 RCX: 0000000000000000
       RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8881008a2860
       RBP: ffff888109f5b9d8 R08: 0000000000000000 R09: 0000000000000000
       R10: ffff888109f5b978 R11: ffff888109f5b948 R12: 00000000ffffff9f
       R13: ffff8881008a2a80 R14: ffff8881008a2860 R15: ffff8881008a2840
       FS:  00007f98de70f100(0000) GS:ffff88822bf00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: ffffffffffffffd6 CR3: 0000000100efc000 CR4: 00000000000006e0
       Call Trace:
        <TASK>
        nh_create_ipv6+0xed/0x10c
        rtm_new_nexthop+0x6d7/0x13f3
        ? check_preemption_disabled+0x3d/0xf2
        ? lock_is_held_type+0xbe/0xfd
        rtnetlink_rcv_msg+0x23f/0x26a
        ? check_preemption_disabled+0x3d/0xf2
        ? rtnl_calcit.isra.0+0x147/0x147
        netlink_rcv_skb+0x61/0xb2
        netlink_unicast+0x100/0x187
        netlink_sendmsg+0x37f/0x3a0
        ? netlink_unicast+0x187/0x187
        sock_sendmsg_nosec+0x67/0x9b
        ____sys_sendmsg+0x19d/0x1f9
        ? copy_msghdr_from_user+0x4c/0x5e
        ? rcu_read_lock_any_held+0x2a/0x78
        ___sys_sendmsg+0x6c/0x8c
        ? asm_sysvec_apic_timer_interrupt+0x12/0x20
        ? lockdep_hardirqs_on+0xd9/0x102
        ? sockfd_lookup_light+0x69/0x99
        __sys_sendmsg+0x50/0x6e
        do_syscall_64+0xcb/0xf2
        entry_SYSCALL_64_after_hwframe+0x44/0xae
       RIP: 0033:0x7f98dea28914
       Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b5 0f 1f 80 00 00 00 00 48 8d 05 e9 5d 0c 00 8b 00 85 c0 75 13 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 41 54 41 89 d4 55 48 89 f5 53
       RSP: 002b:00007fff859f5e68 EFLAGS: 00000246 ORIG_RAX: 000000000000002e2e
       RAX: ffffffffffffffda RBX: 00000000619cb810 RCX: 00007f98dea28914
       RDX: 0000000000000000 RSI: 00007fff859f5ed0 RDI: 0000000000000003
       RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000008
       R10: fffffffffffffce6 R11: 0000000000000246 R12: 0000000000000001
       R13: 000055c0097ae520 R14: 000055c0097957fd R15: 00007fff859f63a0
       </TASK>
       Modules linked in: bridge stp llc bonding virtio_net
      
      Cc: stable@vger.kernel.org
      Fixes: 53010f99 ("nexthop: Add support for IPv6 gateways")
      Signed-off-by: NNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1c743127
  7. 22 11月, 2021 1 次提交
    • N
      net: nexthop: release IPv6 per-cpu dsts when replacing a nexthop group · 1005f19b
      Nikolay Aleksandrov 提交于
      When replacing a nexthop group, we must release the IPv6 per-cpu dsts of
      the removed nexthop entries after an RCU grace period because they
      contain references to the nexthop's net device and to the fib6 info.
      With specific series of events[1] we can reach net device refcount
      imbalance which is unrecoverable. IPv4 is not affected because dsts
      don't take a refcount on the route.
      
      [1]
       $ ip nexthop list
        id 200 via 2002:db8::2 dev bridge.10 scope link onlink
        id 201 via 2002:db8::3 dev bridge scope link onlink
        id 203 group 201/200
       $ ip -6 route
        2001:db8::10 nhid 203 metric 1024 pref medium
           nexthop via 2002:db8::3 dev bridge weight 1 onlink
           nexthop via 2002:db8::2 dev bridge.10 weight 1 onlink
      
      Create rt6_info through one of the multipath legs, e.g.:
       $ taskset -a -c 1  ./pkt_inj 24 bridge.10 2001:db8::10
       (pkt_inj is just a custom packet generator, nothing special)
      
      Then remove that leg from the group by replace (let's assume it is id
      200 in this case):
       $ ip nexthop replace id 203 group 201
      
      Now remove the IPv6 route:
       $ ip -6 route del 2001:db8::10/128
      
      The route won't be really deleted due to the stale rt6_info holding 1
      refcnt in nexthop id 200.
      At this point we have the following reference count dependency:
       (deleted) IPv6 route holds 1 reference over nhid 203
       nh 203 holds 1 ref over id 201
       nh 200 holds 1 ref over the net device and the route due to the stale
       rt6_info
      
      Now to create circular dependency between nh 200 and the IPv6 route, and
      also to get a reference over nh 200, restore nhid 200 in the group:
       $ ip nexthop replace id 203 group 201/200
      
      And now we have a permanent circular dependncy because nhid 203 holds a
      reference over nh 200 and 201, but the route holds a ref over nh 203 and
      is deleted.
      
      To trigger the bug just delete the group (nhid 203):
       $ ip nexthop del id 203
      
      It won't really be deleted due to the IPv6 route dependency, and now we
      have 2 unlinked and deleted objects that reference each other: the group
      and the IPv6 route. Since the group drops the reference it holds over its
      entries at free time (i.e. its own refcount needs to drop to 0) that will
      never happen and we get a permanent ref on them, since one of the entries
      holds a reference over the IPv6 route it will also never be released.
      
      At this point the dependencies are:
       (deleted, only unlinked) IPv6 route holds reference over group nh 203
       (deleted, only unlinked) group nh 203 holds reference over nh 201 and 200
       nh 200 holds 1 ref over the net device and the route due to the stale
       rt6_info
      
      This is the last point where it can be fixed by running traffic through
      nh 200, and specifically through the same CPU so the rt6_info (dst) will
      get released due to the IPv6 genid, that in turn will free the IPv6
      route, which in turn will free the ref count over the group nh 203.
      
      If nh 200 is deleted at this point, it will never be released due to the
      ref from the unlinked group 203, it will only be unlinked:
       $ ip nexthop del id 200
       $ ip nexthop
       $
      
      Now we can never release that stale rt6_info, we have IPv6 route with ref
      over group nh 203, group nh 203 with ref over nh 200 and 201, nh 200 with
      rt6_info (dst) with ref over the net device and the IPv6 route. All of
      these objects are only unlinked, and cannot be released, thus they can't
      release their ref counts.
      
       Message from syslogd@dev at Nov 19 14:04:10 ...
        kernel:[73501.828730] unregister_netdevice: waiting for bridge.10 to become free. Usage count = 3
       Message from syslogd@dev at Nov 19 14:04:20 ...
        kernel:[73512.068811] unregister_netdevice: waiting for bridge.10 to become free. Usage count = 3
      
      Fixes: 7bf4796d ("nexthops: add support for replace")
      Signed-off-by: NNikolay Aleksandrov <nikolay@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1005f19b
  8. 23 9月, 2021 1 次提交
    • I
      nexthop: Fix memory leaks in nexthop notification chain listeners · 3106a084
      Ido Schimmel 提交于
      syzkaller discovered memory leaks [1] that can be reduced to the
      following commands:
      
       # ip nexthop add id 1 blackhole
       # devlink dev reload pci/0000:06:00.0
      
      As part of the reload flow, mlxsw will unregister its netdevs and then
      unregister from the nexthop notification chain. Before unregistering
      from the notification chain, mlxsw will receive delete notifications for
      nexthop objects using netdevs registered by mlxsw or their uppers. mlxsw
      will not receive notifications for nexthops using netdevs that are not
      dismantled as part of the reload flow. For example, the blackhole
      nexthop above that internally uses the loopback netdev as its nexthop
      device.
      
      One way to fix this problem is to have listeners flush their nexthop
      tables after unregistering from the notification chain. This is
      error-prone as evident by this patch and also not symmetric with the
      registration path where a listener receives a dump of all the existing
      nexthops.
      
      Therefore, fix this problem by replaying delete notifications for the
      listener being unregistered. This is symmetric to the registration path
      and also consistent with the netdev notification chain.
      
      The above means that unregister_nexthop_notifier(), like
      register_nexthop_notifier(), will have to take RTNL in order to iterate
      over the existing nexthops and that any callers of the function cannot
      hold RTNL. This is true for mlxsw and netdevsim, but not for the VXLAN
      driver. To avoid a deadlock, change the latter to unregister its nexthop
      listener without holding RTNL, making it symmetric to the registration
      path.
      
      [1]
      unreferenced object 0xffff88806173d600 (size 512):
        comm "syz-executor.0", pid 1290, jiffies 4295583142 (age 143.507s)
        hex dump (first 32 bytes):
          41 9d 1e 60 80 88 ff ff 08 d6 73 61 80 88 ff ff  A..`......sa....
          08 d6 73 61 80 88 ff ff 01 00 00 00 00 00 00 00  ..sa............
        backtrace:
          [<ffffffff81a6b576>] kmemleak_alloc_recursive include/linux/kmemleak.h:43 [inline]
          [<ffffffff81a6b576>] slab_post_alloc_hook+0x96/0x490 mm/slab.h:522
          [<ffffffff81a716d3>] slab_alloc_node mm/slub.c:3206 [inline]
          [<ffffffff81a716d3>] slab_alloc mm/slub.c:3214 [inline]
          [<ffffffff81a716d3>] kmem_cache_alloc_trace+0x163/0x370 mm/slub.c:3231
          [<ffffffff82e8681a>] kmalloc include/linux/slab.h:591 [inline]
          [<ffffffff82e8681a>] kzalloc include/linux/slab.h:721 [inline]
          [<ffffffff82e8681a>] mlxsw_sp_nexthop_obj_group_create drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c:4918 [inline]
          [<ffffffff82e8681a>] mlxsw_sp_nexthop_obj_new drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c:5054 [inline]
          [<ffffffff82e8681a>] mlxsw_sp_nexthop_obj_event+0x59a/0x2910 drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c:5239
          [<ffffffff813ef67d>] notifier_call_chain+0xbd/0x210 kernel/notifier.c:83
          [<ffffffff813f0662>] blocking_notifier_call_chain kernel/notifier.c:318 [inline]
          [<ffffffff813f0662>] blocking_notifier_call_chain+0x72/0xa0 kernel/notifier.c:306
          [<ffffffff8384b9c6>] call_nexthop_notifiers+0x156/0x310 net/ipv4/nexthop.c:244
          [<ffffffff83852bd8>] insert_nexthop net/ipv4/nexthop.c:2336 [inline]
          [<ffffffff83852bd8>] nexthop_add net/ipv4/nexthop.c:2644 [inline]
          [<ffffffff83852bd8>] rtm_new_nexthop+0x14e8/0x4d10 net/ipv4/nexthop.c:2913
          [<ffffffff833e9a78>] rtnetlink_rcv_msg+0x448/0xbf0 net/core/rtnetlink.c:5572
          [<ffffffff83608703>] netlink_rcv_skb+0x173/0x480 net/netlink/af_netlink.c:2504
          [<ffffffff833de032>] rtnetlink_rcv+0x22/0x30 net/core/rtnetlink.c:5590
          [<ffffffff836069de>] netlink_unicast_kernel net/netlink/af_netlink.c:1314 [inline]
          [<ffffffff836069de>] netlink_unicast+0x5ae/0x7f0 net/netlink/af_netlink.c:1340
          [<ffffffff83607501>] netlink_sendmsg+0x8e1/0xe30 net/netlink/af_netlink.c:1929
          [<ffffffff832fde84>] sock_sendmsg_nosec net/socket.c:704 [inline]
          [<ffffffff832fde84>] sock_sendmsg net/socket.c:724 [inline]
          [<ffffffff832fde84>] ____sys_sendmsg+0x874/0x9f0 net/socket.c:2409
          [<ffffffff83304a44>] ___sys_sendmsg+0x104/0x170 net/socket.c:2463
          [<ffffffff83304c01>] __sys_sendmsg+0x111/0x1f0 net/socket.c:2492
          [<ffffffff83304d5d>] __do_sys_sendmsg net/socket.c:2501 [inline]
          [<ffffffff83304d5d>] __se_sys_sendmsg net/socket.c:2499 [inline]
          [<ffffffff83304d5d>] __x64_sys_sendmsg+0x7d/0xc0 net/socket.c:2499
      
      Fixes: 2a014b20 ("mlxsw: spectrum_router: Add support for nexthop objects")
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NPetr Machata <petrm@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3106a084
  9. 20 9月, 2021 1 次提交
    • I
      nexthop: Fix division by zero while replacing a resilient group · 563f23b0
      Ido Schimmel 提交于
      The resilient nexthop group torture tests in fib_nexthop.sh exposed a
      possible division by zero while replacing a resilient group [1]. The
      division by zero occurs when the data path sees a resilient nexthop
      group with zero buckets.
      
      The tests replace a resilient nexthop group in a loop while traffic is
      forwarded through it. The tests do not specify the number of buckets
      while performing the replacement, resulting in the kernel allocating a
      stub resilient table (i.e, 'struct nh_res_table') with zero buckets.
      
      This table should never be visible to the data path, but the old nexthop
      group (i.e., 'oldg') might still be used by the data path when the stub
      table is assigned to it.
      
      Fix this by only assigning the stub table to the old nexthop group after
      making sure the group is no longer used by the data path.
      
      Tested with fib_nexthops.sh:
      
      Tests passed: 222
      Tests failed:   0
      
      [1]
       divide error: 0000 [#1] PREEMPT SMP KASAN
       CPU: 0 PID: 1850 Comm: ping Not tainted 5.14.0-custom-10271-ga86eb53057fe #1107
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-4.fc34 04/01/2014
       RIP: 0010:nexthop_select_path+0x2d2/0x1a80
      [...]
       Call Trace:
        fib_select_multipath+0x79b/0x1530
        fib_select_path+0x8fb/0x1c10
        ip_route_output_key_hash_rcu+0x1198/0x2da0
        ip_route_output_key_hash+0x190/0x340
        ip_route_output_flow+0x21/0x120
        raw_sendmsg+0x91d/0x2e10
        inet_sendmsg+0x9e/0xe0
        __sys_sendto+0x23d/0x360
        __x64_sys_sendto+0xe1/0x1b0
        do_syscall_64+0x35/0x80
        entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Cc: stable@vger.kernel.org
      Fixes: 283a72a5 ("nexthop: Add implementation of resilient next-hop groups")
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NPetr Machata <petrm@nvidia.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      563f23b0
  10. 02 9月, 2021 1 次提交
    • R
      Set fc_nlinfo in nh_create_ipv4, nh_create_ipv6 · 9aca491e
      Ryoga Saito 提交于
      This patch fixes kernel NULL pointer dereference when creating nexthop
      which is bound with SRv6 decapsulation. In the creation of nexthop,
      __seg6_end_dt_vrf_build is called. __seg6_end_dt_vrf_build expects
      fc_lninfo in fib6_config is set correctly, but it isn't set in
      nh_create_ipv6, which causes kernel crash.
      
      Here is steps to reproduce kernel crash:
      
      1. modprobe vrf
      2. ip -6 nexthop add encap seg6local action End.DT4 vrftable 1 dev eth0
      
      We got the following message:
      
      [  901.370336] BUG: kernel NULL pointer dereference, address: 0000000000000ba0
      [  901.371658] #PF: supervisor read access in kernel mode
      [  901.372672] #PF: error_code(0x0000) - not-present page
      [  901.373672] PGD 0 P4D 0
      [  901.374248] Oops: 0000 [#1] SMP PTI
      [  901.374944] CPU: 0 PID: 8593 Comm: ip Not tainted 5.14-051400-generic #202108310811-Ubuntu
      [  901.376404] Hardware name: Red Hat KVM, BIOS 1.11.1-4.module_el8.2.0+320+13f867d7 04/01/2014
      [  901.377907] RIP: 0010:vrf_ifindex_lookup_by_table_id+0x19/0x90 [vrf]
      [  901.379182] Code: c1 e9 72 ff ff ff e8 96 49 01 c2 66 0f 1f 44 00 00 0f 1f 44 00 00 55 48 89 e5 41 56 41 55 41 89 f5 41 54 53 8b 05 47 4c 00 00 <48> 8b 97 a0 0b 00 00 48 8b 1c c2 e8 57 27 53 c1 4c 8d a3 88 00 00
      [  901.382652] RSP: 0018:ffffbf2d02043590 EFLAGS: 00010282
      [  901.383746] RAX: 000000000000000b RBX: ffff990808255e70 RCX: ffffbf2d02043aa8
      [  901.385436] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 0000000000000000
      [  901.386924] RBP: ffffbf2d020435b0 R08: 00000000000000c0 R09: ffff990808255e40
      [  901.388537] R10: ffffffff83b08c90 R11: 0000000000000009 R12: 0000000000000000
      [  901.389937] R13: 0000000000000001 R14: 0000000000000000 R15: 000000000000000b
      [  901.391226] FS:  00007fe49381f740(0000) GS:ffff99087dc00000(0000) knlGS:0000000000000000
      [  901.392737] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  901.393803] CR2: 0000000000000ba0 CR3: 000000000e3e8003 CR4: 0000000000770ef0
      [  901.395122] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  901.396496] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  901.397833] PKRU: 55555554
      [  901.398578] Call Trace:
      [  901.399144]  l3mdev_ifindex_lookup_by_table_id+0x3b/0x70
      [  901.400179]  __seg6_end_dt_vrf_build+0x34/0xd0
      [  901.401067]  seg6_end_dt4_build+0x16/0x20
      [  901.401904]  seg6_local_build_state+0x271/0x430
      [  901.402797]  lwtunnel_build_state+0x81/0x130
      [  901.403645]  fib_nh_common_init+0x82/0x100
      [  901.404465]  ? sock_def_readable+0x4b/0x80
      [  901.405285]  fib6_nh_init+0x115/0x7c0
      [  901.406033]  nh_create_ipv6.isra.0+0xe1/0x140
      [  901.406932]  rtm_new_nexthop+0x3b7/0xeb0
      [  901.407828]  rtnetlink_rcv_msg+0x152/0x3a0
      [  901.408663]  ? rtnl_calcit.isra.0+0x130/0x130
      [  901.409535]  netlink_rcv_skb+0x55/0x100
      [  901.410319]  rtnetlink_rcv+0x15/0x20
      [  901.411026]  netlink_unicast+0x1a8/0x250
      [  901.411813]  netlink_sendmsg+0x238/0x470
      [  901.412602]  ? _copy_from_user+0x2b/0x60
      [  901.413394]  sock_sendmsg+0x65/0x70
      [  901.414112]  ____sys_sendmsg+0x218/0x290
      [  901.414929]  ? copy_msghdr_from_user+0x5c/0x90
      [  901.415814]  ___sys_sendmsg+0x81/0xc0
      [  901.416559]  ? fsnotify_destroy_marks+0x27/0xf0
      [  901.417447]  ? call_rcu+0xa4/0x230
      [  901.418153]  ? kmem_cache_free+0x23f/0x410
      [  901.418972]  ? dentry_free+0x37/0x70
      [  901.419705]  ? mntput_no_expire+0x4c/0x260
      [  901.420574]  __sys_sendmsg+0x62/0xb0
      [  901.421297]  __x64_sys_sendmsg+0x1f/0x30
      [  901.422057]  do_syscall_64+0x5c/0xc0
      [  901.422756]  ? syscall_exit_to_user_mode+0x27/0x50
      [  901.423675]  ? __x64_sys_close+0x12/0x40
      [  901.424462]  ? do_syscall_64+0x69/0xc0
      [  901.425219]  ? irqentry_exit_to_user_mode+0x9/0x20
      [  901.426149]  ? irqentry_exit+0x19/0x30
      [  901.426901]  ? exc_page_fault+0x89/0x160
      [  901.427709]  ? asm_exc_page_fault+0x8/0x30
      [  901.428536]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [  901.429514] RIP: 0033:0x7fe493945747
      [  901.430248] Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
      [  901.433549] RSP: 002b:00007ffe9932cf68 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
      [  901.434981] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fe493945747
      [  901.436303] RDX: 0000000000000000 RSI: 00007ffe9932cfe0 RDI: 0000000000000003
      [  901.437607] RBP: 00000000613053f7 R08: 0000000000000001 R09: 00007ffe9932d07c
      [  901.438990] R10: 000055f4a903a010 R11: 0000000000000246 R12: 0000000000000001
      [  901.440340] R13: 0000000000000001 R14: 000055f4a802b163 R15: 000055f4a8042020
      [  901.441630] Modules linked in: vrf nls_utf8 isofs nls_iso8859_1 dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua intel_rapl_msr intel_rapl_common isst_if_mbox_msr isst_if_common nfit rapl input_leds joydev serio_raw qemu_fw_cfg mac_hid sch_fq_codel drm virtio_rng ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd virtio_net net_failover cryptd psmouse virtio_blk failover i2c_piix4 pata_acpi floppy
      [  901.450808] CR2: 0000000000000ba0
      [  901.451514] ---[ end trace c27b934b99ade304 ]---
      [  901.452403] RIP: 0010:vrf_ifindex_lookup_by_table_id+0x19/0x90 [vrf]
      [  901.453626] Code: c1 e9 72 ff ff ff e8 96 49 01 c2 66 0f 1f 44 00 00 0f 1f 44 00 00 55 48 89 e5 41 56 41 55 41 89 f5 41 54 53 8b 05 47 4c 00 00 <48> 8b 97 a0 0b 00 00 48 8b 1c c2 e8 57 27 53 c1 4c 8d a3 88 00 00
      [  901.456910] RSP: 0018:ffffbf2d02043590 EFLAGS: 00010282
      [  901.457912] RAX: 000000000000000b RBX: ffff990808255e70 RCX: ffffbf2d02043aa8
      [  901.459238] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 0000000000000000
      [  901.460552] RBP: ffffbf2d020435b0 R08: 00000000000000c0 R09: ffff990808255e40
      [  901.461882] R10: ffffffff83b08c90 R11: 0000000000000009 R12: 0000000000000000
      [  901.463208] R13: 0000000000000001 R14: 0000000000000000 R15: 000000000000000b
      [  901.464529] FS:  00007fe49381f740(0000) GS:ffff99087dc00000(0000) knlGS:0000000000000000
      [  901.466058] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  901.467189] CR2: 0000000000000ba0 CR3: 000000000e3e8003 CR4: 0000000000770ef0
      [  901.468515] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  901.469858] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  901.471139] PKRU: 55555554
      Signed-off-by: NRyoga Saito <contact@proelbtn.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9aca491e
  11. 20 4月, 2021 1 次提交
  12. 29 3月, 2021 1 次提交
    • P
      nexthop: Rename artifacts related to legacy multipath nexthop groups · de1d1ee3
      Petr Machata 提交于
      After resilient next-hop groups have been added recently, there are two
      types of multipath next-hop groups: the legacy "mpath", and the new
      "resilient". Calling the legacy next-hop group type "mpath" is unfortunate,
      because that describes the fact that a packet could be forwarded in one of
      several paths, which is also true for the resilient next-hop groups.
      
      Therefore, to make the naming clearer, rename various artifacts to reflect
      the assumptions made. Therefore as of this patch:
      
      - The flag for multipath groups is nh_grp_entry::is_multipath. This
        includes the legacy and resilient groups, as well as any future group
        types that behave as multipath groups.
        Functions that assume this have "mpath" in the name.
      
      - The flag for legacy multipath groups is nh_grp_entry::hash_threshold.
        Functions that assume this have "hthr" in the name.
      
      - The flag for resilient groups is nh_grp_entry::resilient.
        Functions that assume this have "res" in the name.
      
      Besides the above, struct nh_grp_entry::mpath was renamed to ::hthr as
      well.
      
      UAPI artifacts were obviously left intact.
      Suggested-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NPetr Machata <petrm@nvidia.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      de1d1ee3
  13. 12 3月, 2021 13 次提交
    • P
      nexthop: Enable resilient next-hop groups · 15e1dd57
      Petr Machata 提交于
      Now that all the code is in place, stop rejecting requests to create
      resilient next-hop groups.
      Signed-off-by: NPetr Machata <petrm@nvidia.com>
      Reviewed-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      15e1dd57
    • P
      nexthop: Notify userspace about bucket migrations · 0b4818aa
      Petr Machata 提交于
      Nexthop replacements et.al. are notified through netlink, but if a delayed
      work migrates buckets on the background, userspace will stay oblivious.
      Notify these as RTM_NEWNEXTHOPBUCKET events.
      Signed-off-by: NPetr Machata <petrm@nvidia.com>
      Reviewed-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0b4818aa
    • P
      nexthop: Add netlink handlers for bucket get · 187d4c6b
      Petr Machata 提交于
      Allow getting (but not setting) individual buckets to inspect the next hop
      mapped therein, idle time, and flags.
      Signed-off-by: NPetr Machata <petrm@nvidia.com>
      Reviewed-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      187d4c6b
    • P
      nexthop: Add netlink handlers for bucket dump · 8a1bbabb
      Petr Machata 提交于
      Add a dump handler for resilient next hop buckets. When next-hop group ID
      is given, it walks buckets of that group, otherwise it walks buckets of all
      groups. It then dumps the buckets whose next hops match the given filtering
      criteria.
      Signed-off-by: NPetr Machata <petrm@nvidia.com>
      Reviewed-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8a1bbabb
    • P
      nexthop: Add netlink handlers for resilient nexthop groups · a2601e2b
      Petr Machata 提交于
      Implement the netlink messages that allow creation and dumping of resilient
      nexthop groups.
      Signed-off-by: NPetr Machata <petrm@nvidia.com>
      Reviewed-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a2601e2b
    • I
      nexthop: Allow reporting activity of nexthop buckets · cfc15c1d
      Ido Schimmel 提交于
      The kernel periodically checks the idle time of nexthop buckets to
      determine if they are idle and can be re-populated with a new nexthop.
      
      When the resilient nexthop group is offloaded to hardware, the kernel
      will not see activity on nexthop buckets unless it is reported from
      hardware.
      
      Add a function that can be periodically called by device drivers to
      report activity on nexthop buckets after querying it from the underlying
      device.
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NPetr Machata <petrm@nvidia.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NPetr Machata <petrm@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cfc15c1d
    • I
      nexthop: Allow setting "offload" and "trap" indication of nexthop buckets · 56ad5ba3
      Ido Schimmel 提交于
      Add a function that can be called by device drivers to set "offload" or
      "trap" indication on nexthop buckets following nexthop notifications and
      other changes such as a neighbour becoming invalid.
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NPetr Machata <petrm@nvidia.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NPetr Machata <petrm@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      56ad5ba3
    • P
      nexthop: Implement notifiers for resilient nexthop groups · 7c37c7e0
      Petr Machata 提交于
      Implement the following notifications towards drivers:
      
      - NEXTHOP_EVENT_REPLACE, when a resilient nexthop group is created.
      
      - NEXTHOP_EVENT_BUCKET_REPLACE any time there is a change in assignment of
        next hops to hash table buckets. That includes replacements, deletions,
        and delayed upkeep cycles. Some bucket notifications can be vetoed by the
        driver, to make it possible to propagate bucket busy-ness flags from the
        HW back to the algorithm. Some are however forced, e.g. if a next hop is
        deleted, all buckets that use this next hop simply must be migrated,
        whether the HW wishes so or not.
      
      - NEXTHOP_EVENT_RES_TABLE_PRE_REPLACE, before a resilient nexthop group is
        replaced. Usually the driver will get the bucket notifications as well,
        and could veto those. But in some cases, a bucket may not be migrated
        immediately, but during delayed upkeep, and that is too late to roll the
        transaction back. This notification allows the driver to take a look and
        veto the new proposed group up front, before anything is committed.
      Signed-off-by: NPetr Machata <petrm@nvidia.com>
      Reviewed-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7c37c7e0
    • P
      nexthop: Add implementation of resilient next-hop groups · 283a72a5
      Petr Machata 提交于
      At this moment, there is only one type of next-hop group: an mpath group,
      which implements the hash-threshold algorithm.
      
      To select a next hop, hash-threshold algorithm first assigns a range of
      hashes to each next hop in the group, and then selects the next hop by
      comparing the SKB hash with the individual ranges. When a next hop is
      removed from the group, the ranges are recomputed, which leads to
      reassignment of parts of hash space from one next hop to another. While
      there will usually be some overlap between the previous and the new
      distribution, some traffic flows change the next hop that they resolve to.
      That causes problems e.g. as established TCP connections are reset, because
      the traffic is forwarded to a server that is not familiar with the
      connection.
      
      Resilient hashing is a technique to address the above problem. Resilient
      next-hop group has another layer of indirection between the group itself
      and its constituent next hops: a hash table. The selection algorithm uses a
      straightforward modulo operation to choose a hash bucket, and then reads
      the next hop that this bucket contains, and forwards traffic there.
      
      This indirection brings an important feature. In the hash-threshold
      algorithm, the range of hashes associated with a next hop must be
      continuous. With a hash table, mapping between the hash table buckets and
      the individual next hops is arbitrary. Therefore when a next hop is deleted
      the buckets that held it are simply reassigned to other next hops. When
      weights of next hops in a group are altered, it may be possible to choose a
      subset of buckets that are currently not used for forwarding traffic, and
      use those to satisfy the new next-hop distribution demands, keeping the
      "busy" buckets intact. This way, established flows are ideally kept being
      forwarded to the same endpoints through the same paths as before the
      next-hop group change.
      
      In a nutshell, the algorithm works as follows. Each next hop has a number
      of buckets that it wants to have, according to its weight and the number of
      buckets in the hash table. In case of an event that might cause bucket
      allocation change, the numbers for individual next hops are updated,
      similarly to how ranges are updated for mpath group next hops. Following
      that, a new "upkeep" algorithm runs, and for idle buckets that belong to a
      next hop that is currently occupying more buckets than it wants (it is
      "overweight"), it migrates the buckets to one of the next hops that has
      fewer buckets than it wants (it is "underweight"). If, after this, there
      are still underweight next hops, another upkeep run is scheduled to a
      future time.
      
      Chances are there are not enough "idle" buckets to satisfy the new demands.
      The algorithm has knobs to select both what it means for a bucket to be
      idle, and for whether and when to forcefully migrate buckets if there keeps
      being an insufficient number of idle buckets.
      
      There are three users of the resilient data structures.
      
      - The forwarding code accesses them under RCU, and does not modify them
        except for updating the time a selected bucket was last used.
      
      - Netlink code, running under RTNL, which may modify the data.
      
      - The delayed upkeep code, which may modify the data. This runs unlocked,
        and mutual exclusion between the RTNL code and the delayed upkeep is
        maintained by canceling the delayed work synchronously before the RTNL
        code touches anything. Later it restarts the delayed work if necessary.
      
      The RTNL code has to implement next-hop group replacement, next hop
      removal, etc. For removal, the mpath code uses a neat trick of having a
      backup next hop group structure, doing the necessary changes offline, and
      then RCU-swapping them in. However, the hash tables for resilient hashing
      are about an order of magnitude larger than the groups themselves (the size
      might be e.g. 4K entries), and it was felt that keeping two of them is an
      overkill. Both the primary next-hop group and the spare therefore use the
      same resilient table, and writers are careful to keep all references valid
      for the forwarding code. The hash table references next-hop group entries
      from the next-hop group that is currently in the primary role (i.e. not
      spare). During the transition from primary to spare, the table references a
      mix of both the primary group and the spare. When a next hop is deleted,
      the corresponding buckets are not set to NULL, but instead marked as empty,
      so that the pointer is valid and can be used by the forwarding code. The
      buckets are then migrated to a new next-hop group entry during upkeep. The
      only times that the hash table is invalid is the very beginning and very
      end of its lifetime. Between those points, it is always kept valid.
      
      This patch introduces the core support code itself. It does not handle
      notifications towards drivers, which are kept as if the group were an mpath
      one. It does not handle netlink either. The only bit currently exposed to
      user space is the new next-hop group type, and that is currently bounced.
      There is therefore no way to actually access this code.
      Signed-off-by: NPetr Machata <petrm@nvidia.com>
      Reviewed-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      283a72a5
    • I
      nexthop: Add netlink defines and enumerators for resilient NH groups · 710ec562
      Ido Schimmel 提交于
      - RTM_NEWNEXTHOP et.al. that handle resilient groups will have a new nested
        attribute, NHA_RES_GROUP, whose elements are attributes NHA_RES_GROUP_*.
      
      - RTM_NEWNEXTHOPBUCKET et.al. is a suite of new messages that will
        currently serve only for dumping of individual buckets of resilient next
        hop groups. For nexthop group buckets, these messages will carry a nested
        attribute NHA_RES_BUCKET, whose elements are attributes NHA_RES_BUCKET_*.
      
        There are several reasons why a new suite of messages is created for
        nexthop buckets instead of overloading the information on the existing
        RTM_{NEW,DEL,GET}NEXTHOP messages.
      
        First, a nexthop group can contain a large number of nexthop buckets (4k
        is not unheard of). This imposes limits on the amount of information that
        can be encoded for each nexthop bucket given a netlink message is limited
        to 64k bytes.
      
        Second, while RTM_NEWNEXTHOPBUCKET is only used for notifications at
        this point, in the future it can be extended to provide user space with
        control over nexthop buckets configuration.
      
      - The new group type is NEXTHOP_GRP_TYPE_RES. Note that nexthop code is
        adjusted to bounce groups with that type for now.
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NPetr Machata <petrm@nvidia.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NPetr Machata <petrm@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      710ec562
    • P
      nexthop: Add a dedicated flag for multipath next-hop groups · 90e1a9e2
      Petr Machata 提交于
      With the introduction of resilient nexthop groups, there will be two types
      of multipath groups: the current hash-threshold "mpath" ones, and resilient
      groups. Both are multipath, but to determine the fact, the system needs to
      consider two flags. This might prove costly in the datapath. Therefore,
      introduce a new flag, that should be set for next-hop groups that have more
      than one nexthop, and should be considered multipath.
      Signed-off-by: NPetr Machata <petrm@nvidia.com>
      Reviewed-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      90e1a9e2
    • P
      nexthop: __nh_notifier_single_info_init(): Make nh_info an argument · 96a85625
      Petr Machata 提交于
      The cited function currently uses rtnl_dereference() to get nh_info from a
      handed-in nexthop. However, under the resilient hashing scheme, this
      function will not always be called under RTNL, sometimes the mutual
      exclusion will be achieved differently. Therefore move the nh_info
      extraction from the function to its callers to make it possible to use a
      different synchronization guarantee.
      Signed-off-by: NPetr Machata <petrm@nvidia.com>
      Reviewed-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      96a85625
    • P
      nexthop: Pass nh_config to replace_nexthop() · 597f48e4
      Petr Machata 提交于
      Currently, replace assumes that the new group that is given is a
      fully-formed object. But mpath groups really only have one attribute, and
      that is the constituent next hop configuration. This may not be universally
      true. From the usability perspective, it is desirable to allow the replace
      operation to adjust just the constituent next hop configuration and leave
      the group attributes as such intact.
      
      But the object that keeps track of whether an attribute was or was not
      given is the nh_config object, not the next hop or next-hop group. To allow
      (selective) attribute updates during NH group replacement, propagate `cfg'
      to replace_nexthop() and further to replace_nexthop_grp().
      Signed-off-by: NPetr Machata <petrm@nvidia.com>
      Reviewed-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      597f48e4
  14. 05 3月, 2021 1 次提交
    • I
      nexthop: Do not flush blackhole nexthops when loopback goes down · 76c03bf8
      Ido Schimmel 提交于
      As far as user space is concerned, blackhole nexthops do not have a
      nexthop device and therefore should not be affected by the
      administrative or carrier state of any netdev.
      
      However, when the loopback netdev goes down all the blackhole nexthops
      are flushed. This happens because internally the kernel associates
      blackhole nexthops with the loopback netdev.
      
      This behavior is both confusing to those not familiar with kernel
      internals and also diverges from the legacy API where blackhole IPv4
      routes are not flushed when the loopback netdev goes down:
      
       # ip route add blackhole 198.51.100.0/24
       # ip link set dev lo down
       # ip route show 198.51.100.0/24
       blackhole 198.51.100.0/24
      
      Blackhole IPv6 routes are flushed, but at least user space knows that
      they are associated with the loopback netdev:
      
       # ip -6 route show 2001:db8:1::/64
       blackhole 2001:db8:1::/64 dev lo metric 1024 pref medium
      
      Fix this by only flushing blackhole nexthops when the loopback netdev is
      unregistered.
      
      Fixes: ab84be7e ("net: Initial nexthop code")
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Reported-by: NDonald Sharp <sharpd@nvidia.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      76c03bf8
  15. 29 1月, 2021 12 次提交
  16. 21 1月, 2021 2 次提交