1. 25 7月, 2023 1 次提交
  2. 02 11月, 2022 1 次提交
  3. 27 8月, 2020 1 次提交
    • I
      ipv4: Silence suspicious RCU usage warning · 7f6f32bb
      Ido Schimmel 提交于
      fib_info_notify_update() is always called with RTNL held, but not from
      an RCU read-side critical section. This leads to the following warning
      [1] when the FIB table list is traversed with
      hlist_for_each_entry_rcu(), but without a proper lockdep expression.
      
      Since modification of the list is protected by RTNL, silence the warning
      by adding a lockdep expression which verifies RTNL is held.
      
      [1]
       =============================
       WARNING: suspicious RCU usage
       5.9.0-rc1-custom-14233-g2f26e122d62f #129 Not tainted
       -----------------------------
       net/ipv4/fib_trie.c:2124 RCU-list traversed in non-reader section!!
      
       other info that might help us debug this:
      
       rcu_scheduler_active = 2, debug_locks = 1
       1 lock held by ip/834:
        #0: ffffffff85a3b6b0 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x49a/0xbd0
      
       stack backtrace:
       CPU: 0 PID: 834 Comm: ip Not tainted 5.9.0-rc1-custom-14233-g2f26e122d62f #129
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
       Call Trace:
        dump_stack+0x100/0x184
        lockdep_rcu_suspicious+0x143/0x14d
        fib_info_notify_update+0x8d1/0xa60
        __nexthop_replace_notify+0xd2/0x290
        rtm_new_nexthop+0x35e2/0x5946
        rtnetlink_rcv_msg+0x4f7/0xbd0
        netlink_rcv_skb+0x17a/0x480
        rtnetlink_rcv+0x22/0x30
        netlink_unicast+0x5ae/0x890
        netlink_sendmsg+0x98a/0xf40
        ____sys_sendmsg+0x879/0xa00
        ___sys_sendmsg+0x122/0x190
        __sys_sendmsg+0x103/0x1d0
        __x64_sys_sendmsg+0x7d/0xb0
        do_syscall_64+0x32/0x50
        entry_SYSCALL_64_after_hwframe+0x44/0xa9
       RIP: 0033:0x7fde28c3be57
       Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51
      c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
      RSP: 002b:00007ffc09330028 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
      RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fde28c3be57
      RDX: 0000000000000000 RSI: 00007ffc09330090 RDI: 0000000000000003
      RBP: 000000005f45f911 R08: 0000000000000001 R09: 00007ffc0933012c
      R10: 0000000000000076 R11: 0000000000000246 R12: 0000000000000001
      R13: 00007ffc09330290 R14: 00007ffc09330eee R15: 00005610e48ed020
      
      Fixes: 1bff1a0c ("ipv4: Add function to send route updates")
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7f6f32bb
  4. 30 7月, 2020 1 次提交
    • I
      ipv4: Silence suspicious RCU usage warning · 83f35228
      Ido Schimmel 提交于
      fib_trie_unmerge() is called with RTNL held, but not from an RCU
      read-side critical section. This leads to the following warning [1] when
      the FIB alias list in a leaf is traversed with
      hlist_for_each_entry_rcu().
      
      Since the function is always called with RTNL held and since
      modification of the list is protected by RTNL, simply use
      hlist_for_each_entry() and silence the warning.
      
      [1]
      WARNING: suspicious RCU usage
      5.8.0-rc4-custom-01520-gc1f937f3f83b #30 Not tainted
      -----------------------------
      net/ipv4/fib_trie.c:1867 RCU-list traversed in non-reader section!!
      
      other info that might help us debug this:
      
      rcu_scheduler_active = 2, debug_locks = 1
      1 lock held by ip/164:
       #0: ffffffff85a27850 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x49a/0xbd0
      
      stack backtrace:
      CPU: 0 PID: 164 Comm: ip Not tainted 5.8.0-rc4-custom-01520-gc1f937f3f83b #30
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
      Call Trace:
       dump_stack+0x100/0x184
       lockdep_rcu_suspicious+0x153/0x15d
       fib_trie_unmerge+0x608/0xdb0
       fib_unmerge+0x44/0x360
       fib4_rule_configure+0xc8/0xad0
       fib_nl_newrule+0x37a/0x1dd0
       rtnetlink_rcv_msg+0x4f7/0xbd0
       netlink_rcv_skb+0x17a/0x480
       rtnetlink_rcv+0x22/0x30
       netlink_unicast+0x5ae/0x890
       netlink_sendmsg+0x98a/0xf40
       ____sys_sendmsg+0x879/0xa00
       ___sys_sendmsg+0x122/0x190
       __sys_sendmsg+0x103/0x1d0
       __x64_sys_sendmsg+0x7d/0xb0
       do_syscall_64+0x54/0xa0
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      RIP: 0033:0x7fc80a234e97
      Code: Bad RIP value.
      RSP: 002b:00007ffef8b66798 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
      RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc80a234e97
      RDX: 0000000000000000 RSI: 00007ffef8b66800 RDI: 0000000000000003
      RBP: 000000005f141b1c R08: 0000000000000001 R09: 0000000000000000
      R10: 00007fc80a2a8ac0 R11: 0000000000000246 R12: 0000000000000001
      R13: 0000000000000000 R14: 00007ffef8b67008 R15: 0000556fccb10020
      
      Fixes: 0ddcf43d ("ipv4: FIB Local/MAIN table collapse")
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Reviewed-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      83f35228
  5. 07 7月, 2020 1 次提交
  6. 27 5月, 2020 1 次提交
    • D
      ipv4: Refactor nhc evaluation in fib_table_lookup · af7888ad
      David Ahern 提交于
      FIB lookups can return an entry that references an external nexthop.
      While walking the nexthop struct we do not want to make multiple calls
      into the nexthop code which can result in 2 different structs getting
      accessed - one returning the number of paths the rest of the loop
      seeing a different nh_grp struct. If the nexthop group shrunk, the
      result is an attempt to access a fib_nh_common that does not exist for
      the new nh_grp struct but did for the old one.
      
      To fix that move the device evaluation code to a helper that can be
      used for inline fib_nh path as well as external nexthops.
      
      Update the existing check for fi->nh in fib_table_lookup to call a
      new helper, nexthop_get_nhc_lookup, which walks the external nexthop
      with a single rcu dereference.
      
      Fixes: 430a0491 ("nexthop: Add support for nexthop groups")
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Acked-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      af7888ad
  7. 30 3月, 2020 2 次提交
    • A
      net: add net available in build_state · faee6769
      Alexander Aring 提交于
      The build_state callback of lwtunnel doesn't contain the net namespace
      structure yet. This patch will add it so we can check on specific
      address configuration at creation time of rpl source routes.
      Signed-off-by: NAlexander Aring <alex.aring@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      faee6769
    • Q
      ipv4: fix a RCU-list lock in fib_triestat_seq_show · fbe4e0c1
      Qian Cai 提交于
      fib_triestat_seq_show() calls hlist_for_each_entry_rcu(tb, head,
      tb_hlist) without rcu_read_lock() will trigger a warning,
      
       net/ipv4/fib_trie.c:2579 RCU-list traversed in non-reader section!!
      
       other info that might help us debug this:
      
       rcu_scheduler_active = 2, debug_locks = 1
       1 lock held by proc01/115277:
        #0: c0000014507acf00 (&p->lock){+.+.}-{3:3}, at: seq_read+0x58/0x670
      
       Call Trace:
        dump_stack+0xf4/0x164 (unreliable)
        lockdep_rcu_suspicious+0x140/0x164
        fib_triestat_seq_show+0x750/0x880
        seq_read+0x1a0/0x670
        proc_reg_read+0x10c/0x1b0
        __vfs_read+0x3c/0x70
        vfs_read+0xac/0x170
        ksys_read+0x7c/0x140
        system_call+0x5c/0x68
      
      Fix it by adding a pair of rcu_read_lock/unlock() and use
      cond_resched_rcu() to avoid the situation where walking of a large
      number of items  may prevent scheduling for a long time.
      Signed-off-by: NQian Cai <cai@lca.pw>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fbe4e0c1
  8. 21 2月, 2020 1 次提交
  9. 15 1月, 2020 3 次提交
    • I
      ipv4: Add "offload" and "trap" indications to routes · 90b93f1b
      Ido Schimmel 提交于
      When performing L3 offload, routes and nexthops are usually programmed
      into two different tables in the underlying device. Therefore, the fact
      that a nexthop resides in hardware does not necessarily mean that all
      the associated routes also reside in hardware and vice-versa.
      
      While the kernel can signal to user space the presence of a nexthop in
      hardware (via 'RTNH_F_OFFLOAD'), it does not have a corresponding flag
      for routes. In addition, the fact that a route resides in hardware does
      not necessarily mean that the traffic is offloaded. For example,
      unreachable routes (i.e., 'RTN_UNREACHABLE') are programmed to trap
      packets to the CPU so that the kernel will be able to generate the
      appropriate ICMP error packet.
      
      This patch adds an "offload" and "trap" indications to IPv4 routes, so
      that users will have better visibility into the offload process.
      
      'struct fib_alias' is extended with two new fields that indicate if the
      route resides in hardware or not and if it is offloading traffic from
      the kernel or trapping packets to it. Note that the new fields are added
      in the 6 bytes hole and therefore the struct still fits in a single
      cache line [1].
      
      Capable drivers are expected to invoke fib_alias_hw_flags_set() with the
      route's key in order to set the flags.
      
      The indications are dumped to user space via a new flags (i.e.,
      'RTM_F_OFFLOAD' and 'RTM_F_TRAP') in the 'rtm_flags' field in the
      ancillary header.
      
      v2:
      * Make use of 'struct fib_rt_info' in fib_alias_hw_flags_set()
      
      [1]
      struct fib_alias {
              struct hlist_node  fa_list;                      /*     0    16 */
              struct fib_info *          fa_info;              /*    16     8 */
              u8                         fa_tos;               /*    24     1 */
              u8                         fa_type;              /*    25     1 */
              u8                         fa_state;             /*    26     1 */
              u8                         fa_slen;              /*    27     1 */
              u32                        tb_id;                /*    28     4 */
              s16                        fa_default;           /*    32     2 */
              u8                         offload:1;            /*    34: 0  1 */
              u8                         trap:1;               /*    34: 1  1 */
              u8                         unused:6;             /*    34: 2  1 */
      
              /* XXX 5 bytes hole, try to pack */
      
              struct callback_head rcu __attribute__((__aligned__(8))); /*    40    16 */
      
              /* size: 56, cachelines: 1, members: 12 */
              /* sum members: 50, holes: 1, sum holes: 5 */
              /* sum bitfield members: 8 bits (1 bytes) */
              /* forced alignments: 1, forced holes: 1, sum forced holes: 5 */
              /* last cacheline: 56 bytes */
      } __attribute__((__aligned__(8)));
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Reviewed-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      90b93f1b
    • I
      ipv4: Encapsulate function arguments in a struct · 1e301fd0
      Ido Schimmel 提交于
      fib_dump_info() is used to prepare RTM_{NEW,DEL}ROUTE netlink messages
      using the passed arguments. Currently, the function takes 11 arguments,
      6 of which are attributes of the route being dumped (e.g., prefix, TOS).
      
      The next patch will need the function to also dump to user space an
      indication if the route is present in hardware or not. Instead of
      passing yet another argument, change the function to take a struct
      containing the different route attributes.
      
      v2:
      * Name last argument of fib_dump_info()
      * Move 'struct fib_rt_info' to include/net/ip_fib.h so that it could
        later be passed to fib_alias_hw_flags_set()
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Reviewed-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1e301fd0
    • I
      ipv4: Replace route in list before notifying · 6324d0fa
      Ido Schimmel 提交于
      Subsequent patches will add an offload / trap indication to routes which
      will signal if the route is present in hardware or not.
      
      After programming the route to the hardware, drivers will have to ask
      the IPv4 code to set the flags by passing the route's key.
      
      In the case of route replace, the new route is notified before it is
      actually inserted into the FIB alias list. This can prevent simple
      drivers (e.g., netdevsim) that program the route to the hardware in the
      same context it is notified in from being able to set the flag.
      
      Solve this by first inserting the new route to the list and rollback the
      operation in case the route was vetoed.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Reviewed-by: NJiri Pirko <jiri@mellanox.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6324d0fa
  10. 11 1月, 2020 1 次提交
    • D
      ipv4: Detect rollover in specific fib table dump · 9827c063
      David Ahern 提交于
      Sven-Haegar reported looping on fib dumps when 255.255.255.255 route has
      been added to a table. The looping is caused by the key rolling over from
      FFFFFFFF to 0. When dumping a specific table only, we need a means to detect
      when the table dump is done. The key and count saved to cb args are both 0
      only at the start of the table dump. If key is 0 and count > 0, then we are
      in the rollover case. Detect and return to avoid looping.
      
      This only affects dumps of a specific table; for dumps of all tables
      (the case prior to the change in the Fixes tag) inet_dump_fib moved
      the entry counter to the next table and reset the cb args used by
      fib_table_dump and fn_trie_dump_leaf, so the rollover ffffffff back
      to 0 did not cause looping with the dumps.
      
      Fixes: effe6792 ("net: Enable kernel side filtering of route dumps")
      Reported-by: NSven-Haegar Koch <haegar@sdinet.de>
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9827c063
  11. 17 12月, 2019 8 次提交
  12. 05 10月, 2019 3 次提交
  13. 25 8月, 2019 1 次提交
    • J
      net: route dump netlink NLM_F_MULTI flag missing · e93fb3e9
      John Fastabend 提交于
      An excerpt from netlink(7) man page,
      
        In multipart messages (multiple nlmsghdr headers with associated payload
        in one byte stream) the first and all following headers have the
        NLM_F_MULTI flag set, except for the last  header  which  has the type
        NLMSG_DONE.
      
      but, after (ee28906f) there is a missing NLM_F_MULTI flag in the middle of a
      FIB dump. The result is user space applications following above man page
      excerpt may get confused and may stop parsing msg believing something went
      wrong.
      
      In the golang netlink lib [0] the library logic stops parsing believing the
      message is not a multipart message. Found this running Cilium[1] against
      net-next while adding a feature to auto-detect routes. I noticed with
      multiple route tables we no longer could detect the default routes on net
      tree kernels because the library logic was not returning them.
      
      Fix this by handling the fib_dump_info_fnhe() case the same way the
      fib_dump_info() handles it by passing the flags argument through the
      call chain and adding a flags argument to rt_fill_info().
      
      Tested with Cilium stack and auto-detection of routes works again. Also
      annotated libs to dump netlink msgs and inspected NLM_F_MULTI and
      NLMSG_DONE flags look correct after this.
      
      Note: In inet_rtm_getroute() pass rt_fill_info() '0' for flags the same
      as is done for fib_dump_info() so this looks correct to me.
      
      [0] https://github.com/vishvananda/netlink/
      [1] https://github.com/cilium/
      
      Fixes: ee28906f ("ipv4: Dump route exceptions if requested")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Reviewed-by: NStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e93fb3e9
  14. 03 7月, 2019 1 次提交
    • S
      ipv4: Fix off-by-one in route dump counter without netlink strict checking · 885b8b4d
      Stefano Brivio 提交于
      In commit ee28906f ("ipv4: Dump route exceptions if requested") I
      added a counter of per-node dumped routes (including actual routes and
      exceptions), analogous to the existing counter for dumped nodes. Dumping
      exceptions means we need to also keep track of how many routes are dumped
      for each node: this would be just one route per node, without exceptions.
      
      When netlink strict checking is not enabled, we dump both routes and
      exceptions at the same time: the RTM_F_CLONED flag is not used as a
      filter. In this case, the per-node counter 'i_fa' is incremented by one
      to track the single dumped route, then also incremented by one for each
      exception dumped, and then stored as netlink callback argument as skip
      counter, 's_fa', to be used when a partial dump operation restarts.
      
      The per-node counter needs to be increased by one also when we skip a
      route (exception) due to a previous non-zero skip counter, because it
      needs to match the existing skip counter, if we are dumping both routes
      and exceptions. I missed this, and only incremented the counter, for
      regular routes, if the previous skip counter was zero. This means that,
      in case of a mixed dump, partial dump operations after the first one
      will start with a mismatching skip counter value, one less than expected.
      
      This means in turn that the first exception for a given node is skipped
      every time a partial dump operation restarts, if netlink strict checking
      is not enabled (iproute < 5.0).
      
      It turns out I didn't repeat the test in its final version, commit
      de755a85 ("selftests: pmtu: Introduce list_flush_ipv4_exception test
      case"), which also counts the number of route exceptions returned, with
      iproute2 versions < 5.0 -- I was instead using the equivalent of the IPv6
      test as it was before commit b964641e ("selftests: pmtu: Make
      list_flush_ipv6_exception test more demanding").
      
      Always increment the per-node counter by one if we previously dumped
      a regular route, so that it matches the current skip counter.
      
      Fixes: ee28906f ("ipv4: Dump route exceptions if requested")
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      885b8b4d
  15. 25 6月, 2019 1 次提交
    • S
      ipv4: Dump route exceptions if requested · ee28906f
      Stefano Brivio 提交于
      Since commit 4895c771 ("ipv4: Add FIB nexthop exceptions."), cached
      exception routes are stored as a separate entity, so they are not dumped
      on a FIB dump, even if the RTM_F_CLONED flag is passed.
      
      This implies that the command 'ip route list cache' doesn't return any
      result anymore.
      
      If the RTM_F_CLONED is passed, and strict checking requested, retrieve
      nexthop exception routes and dump them. If no strict checking is
      requested, filtering can't be performed consistently: dump everything in
      that case.
      
      With this, we need to add an argument to the netlink callback in order to
      track how many entries were already dumped for the last leaf included in
      a partial netlink dump.
      
      A single additional argument is sufficient, even if we traverse logically
      nested structures (nexthop objects, hash table buckets, bucket chains): it
      doesn't matter if we stop in the middle of any of those, because they are
      always traversed the same way. As an example, s_i values in [], s_fa
      values in ():
      
        node (fa) #1 [1]
          nexthop #1
          bucket #1 -> #0 in chain (1)
          bucket #2 -> #0 in chain (2) -> #1 in chain (3) -> #2 in chain (4)
          bucket #3 -> #0 in chain (5) -> #1 in chain (6)
      
          nexthop #2
          bucket #1 -> #0 in chain (7) -> #1 in chain (8)
          bucket #2 -> #0 in chain (9)
        --
        node (fa) #2 [2]
          nexthop #1
          bucket #1 -> #0 in chain (1) -> #1 in chain (2)
          bucket #2 -> #0 in chain (3)
      
      it doesn't matter if we stop at (3), (4), (7) for "node #1", or at (2)
      for "node #2": walking flattens all that.
      
      It would even be possible to drop the distinction between the in-tree
      (s_i) and in-node (s_fa) counter, but a further improvement might
      advise against this. This is only as accurate as the existing tracking
      mechanism for leaves: if a partial dump is restarted after exceptions
      are removed or expired, we might skip some non-dumped entries.
      
      To improve this, we could attach a 'sernum' attribute (similar to the
      one used for IPv6) to nexthop entities, and bump this counter whenever
      exceptions change: having a distinction between the two counters would
      make this more convenient.
      
      Listing of exception routes (modified routes pre-3.5) was tested against
      these versions of kernel and iproute2:
      
                          iproute2
      kernel         4.14.0   4.15.0   4.19.0   5.0.0   5.1.0
       3.5-rc4         +        +        +        +       +
       4.4
       4.9
       4.14
       4.15
       4.19
       5.0
       5.1
       fixed           +        +        +        +       +
      
      v7:
         - Move loop over nexthop objects to route.c, and pass struct fib_info
           and table ID to it, not a struct fib_alias (suggested by David Ahern)
         - While at it, note that the NULL check on fa->fa_info is redundant,
           and the check on RTNH_F_DEAD is also not consistent with what's done
           with regular route listing: just keep it for nhc_flags
         - Rename entry point function for dumping exceptions to
           fib_dump_info_fnhe(), and rearrange arguments for consistency with
           fib_dump_info()
         - Rename fnhe_dump_buckets() to fnhe_dump_bucket() and make it handle
           one bucket at a time
         - Expand commit message to describe why we can have a single "skip"
           counter for all exceptions stored in bucket chains in nexthop objects
           (suggested by David Ahern)
      
      v6:
         - Rebased onto net-next
         - Loop over nexthop paths too. Move loop over fnhe buckets to route.c,
           avoids need to export rt_fill_info() and to touch exceptions from
           fib_trie.c. Pass NULL as flow to rt_fill_info(), it now allows that
           (suggested by David Ahern)
      
      Fixes: 4895c771 ("ipv4: Add FIB nexthop exceptions.")
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee28906f
  16. 20 6月, 2019 1 次提交
  17. 05 6月, 2019 3 次提交
    • D
      ipv4: Plumb support for nexthop object in a fib_info · 4c7e8084
      David Ahern 提交于
      Add 'struct nexthop' and nh_list list_head to fib_info. nh_list is the
      fib_info side of the nexthop <-> fib_info relationship.
      
      Add fi_list list_head to 'struct nexthop' to track fib_info entries
      using a nexthop instance. Add __remove_nexthop_fib and add it to
      __remove_nexthop to walk the new list_head and mark those fib entries
      as dead when the nexthop is deleted.
      
      Add a few nexthop helpers for use when a nexthop is added to fib_info:
      - nexthop_cmp to determine if 2 nexthops are the same
      - nexthop_path_fib_result to select a path for a multipath
        'struct nexthop'
      - nexthop_fib_nhc to select a specific fib_nh_common within a
        multipath 'struct nexthop'
      
      Update existing fib_info_nhc to use nexthop_fib_nhc if a fib_info uses
      a 'struct nexthop', and mark fib_info_nh as only used for the non-nexthop
      case.
      
      Update the fib_info functions to check for fi->nh and take a different
      path as needed:
      - free_fib_info_rcu - put the nexthop object reference
      - fib_release_info - remove the fib_info from the nexthop's fi_list
      - nh_comp - use nexthop_cmp when either fib_info references a nexthop
        object
      - fib_info_hashfn - use the nexthop id for the hashing vs the oif of
        each fib_nh in a fib_info
      - fib_nlmsg_size - add space for the RTA_NH_ID attribute
      - fib_create_info - verify nexthop reference can be taken, verify
        nexthop spec is valid for fib entry, and add fib_info to fi_list for
        a nexthop
      - fib_select_multipath - use the new nexthop_path_fib_result to select a
        path when nexthop objects are used
      - fib_table_lookup - if the 'struct nexthop' is a blackhole nexthop, treat
        it the same as a fib entry using 'blackhole'
      
      The bulk of the changes are in fib_semantics.c and most of that is
      moving the existing change_nexthops into an else branch.
      
      Update the nexthop code to walk fi_list on a nexthop deleted to remove
      fib entries referencing it.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4c7e8084
    • D
      ipv4: Prepare for fib6_nh from a nexthop object · dcb1ecb5
      David Ahern 提交于
      Convert more IPv4 code to use fib_nh_common over fib_nh to enable routes
      to use a fib6_nh based nexthop. In the end, only code not using a
      nexthop object in a fib_info should directly access fib_nh in a fib_info
      without checking the famiy and going through fib_nh_common. Those
      functions will be marked when it is not directly evident.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dcb1ecb5
    • D
      ipv4: Use accessors for fib_info nexthop data · 5481d73f
      David Ahern 提交于
      Use helpers to access fib_nh and fib_nhs fields of a fib_info. Drop the
      fib_dev macro which is an alias for the first nexthop. Replacements:
      
        fi->fib_dev    --> fib_info_nh(fi, 0)->fib_nh_dev
        fi->fib_nh     --> fib_info_nh(fi, 0)
        fi->fib_nh[i]  --> fib_info_nh(fi, i)
        fi->fib_nhs    --> fib_info_num_path(fi)
      
      where fib_info_nh(fi, i) returns fi->fib_nh[nhsel] and fib_info_num_path
      returns fi->fib_nhs.
      
      Move the existing fib_info_nhc to nexthop.h and define the new ones
      there. A later patch adds a check if a fib_info uses a nexthop object,
      and defining the helpers in nexthop.h avoid circular header
      dependencies.
      
      After this all remaining open coded references to fi->fib_nhs and
      fi->fib_nh are in:
      - fib_create_info and helpers used to lookup an existing fib_info
        entry, and
      - the netdev event functions fib_sync_down_dev and fib_sync_up.
      
      The latter two will not be reused for nexthops, and the fib_create_info
      will be updated to handle a nexthop in a fib_info.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5481d73f
  18. 31 5月, 2019 1 次提交
  19. 27 5月, 2019 1 次提交
  20. 23 5月, 2019 1 次提交
    • D
      ipv4: Add function to send route updates · 1bff1a0c
      David Ahern 提交于
      Add fib_info_notify_update to walk the fib and send RTM_NEWROUTE
      notifications with NLM_F_REPLACE set for entries linked to a fib_info
      that have nh_updated flag set. This helper will be used by the nexthop
      code to notify userspace of routes that are impacted when a nexthop
      config is updated via replace. The new function and its helper are
      similar to how fib_flush and fib_table_flush work for address delete
      and link down events.
      
      This notification is needed for legacy apps that do not understand
      the new nexthop object. Apps that are nexthop aware can use the
      RTA_NH_ID attribute in the route notification to just ignore it.
      
      In the future this should be wrapped in a sysctl to allow OS'es that
      are fully updated to avoid the notificaton storm.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1bff1a0c
  21. 04 4月, 2019 2 次提交
    • D
      ipv4: Add fib_nh_common to fib_result · eba618ab
      David Ahern 提交于
      Most of the ipv4 code only needs data from fib_nh_common. Add
      fib_nh_common selection to fib_result and update users to use it.
      
      Right now, fib_nh_common in fib_result will point to a fib_nh struct
      that is embedded within a fib_info:
      
              fib_info  --> fib_nh
                            fib_nh
                            ...
                            fib_nh
                              ^
          fib_result->nhc ----+
      
      Later, nhc can point to a fib_nh within a nexthop struct:
      
              fib_info --> nexthop --> fib_nh
                                         ^
          fib_result->nhc ---------------+
      
      or for a nexthop group:
      
              fib_info --> nexthop --> nexthop --> fib_nh
                                       nexthop --> fib_nh
                                       ...
                                       nexthop --> fib_nh
                                                     ^
          fib_result->nhc ---------------------------+
      
      In all cases nhsel within fib_result will point to which leg in the
      multipath route is used.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eba618ab
    • D
      ipv4: Update fib_table_lookup tracepoint to take common nexthop · 0af7e7c1
      David Ahern 提交于
      Update fib_table_lookup tracepoint to take a fib_nh_common struct and
      dump the v6 gateway address if the nexthop uses it.
      
      Over the years saddr has not proven useful and the output of the
      tracepoint produces very long lines. Since saddr is not part of
      fib_nh_common, drop it. If it needs to be added later, fib_nh which
      contains saddr can be obtained from a fib_nh_common via container_of.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0af7e7c1
  22. 30 3月, 2019 2 次提交
  23. 22 3月, 2019 1 次提交
    • D
      ipv4: Allow amount of dirty memory from fib resizing to be controllable · 9ab948a9
      David Ahern 提交于
      fib_trie implementation calls synchronize_rcu when a certain amount of
      pages are dirty from freed entries. The number of pages was determined
      experimentally in 2009 (commit c3059477).
      
      At the current setting, synchronize_rcu is called often -- 51 times in a
      second in one test with an average of an 8 msec delay adding a fib entry.
      The total impact is a lot of slow down modifying the fib. This is seen
      in the output of 'time' - the difference between real time and sys+user.
      For example, using 720,022 single path routes and 'ip -batch'[1]:
      
          $ time ./ip -batch ipv4/routes-1-hops
          real    0m14.214s
          user    0m2.513s
          sys     0m6.783s
      
      So roughly 35% of the actual time to install the routes is from the ip
      command getting scheduled out, most notably due to synchronize_rcu (this
      is observed using 'perf sched timehist').
      
      This patch makes the amount of dirty memory configurable between 64k where
      the synchronize_rcu is called often (small, low end systems that are memory
      sensitive) to 64M where synchronize_rcu is called rarely during a large
      FIB change (for high end systems with lots of memory). The default is 512kB
      which corresponds to the current setting of 128 pages with a 4kB page size.
      
      As an example, at 16MB the worst interval shows 4 calls to synchronize_rcu
      in a second blocking for up to 30 msec in a single instance, and a total
      of almost 100 msec across the 4 calls in the second. The trade off is
      allowing FIB entries to consume more memory in a given time window but
      but with much better fib insertion rates (~30% increase in prefixes/sec).
      With this patch and net.ipv4.fib_sync_mem set to 16MB, the same batch
      file runs in:
      
          $ time ./ip -batch ipv4/routes-1-hops
          real    0m9.692s
          user    0m2.491s
          sys     0m6.769s
      
      So the dead time is reduced to about 1/2 second or <5% of the real time.
      
      [1] 'ip' modified to not request ACK messages which improves route
          insertion times by about 20%
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9ab948a9
  24. 16 1月, 2019 1 次提交
    • I
      net: ipv4: Fix memory leak in network namespace dismantle · f97f4dd8
      Ido Schimmel 提交于
      IPv4 routing tables are flushed in two cases:
      
      1. In response to events in the netdev and inetaddr notification chains
      2. When a network namespace is being dismantled
      
      In both cases only routes associated with a dead nexthop group are
      flushed. However, a nexthop group will only be marked as dead in case it
      is populated with actual nexthops using a nexthop device. This is not
      the case when the route in question is an error route (e.g.,
      'blackhole', 'unreachable').
      
      Therefore, when a network namespace is being dismantled such routes are
      not flushed and leaked [1].
      
      To reproduce:
      # ip netns add blue
      # ip -n blue route add unreachable 192.0.2.0/24
      # ip netns del blue
      
      Fix this by not skipping error routes that are not marked with
      RTNH_F_DEAD when flushing the routing tables.
      
      To prevent the flushing of such routes in case #1, add a parameter to
      fib_table_flush() that indicates if the table is flushed as part of
      namespace dismantle or not.
      
      Note that this problem does not exist in IPv6 since error routes are
      associated with the loopback device.
      
      [1]
      unreferenced object 0xffff888066650338 (size 56):
        comm "ip", pid 1206, jiffies 4294786063 (age 26.235s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 b0 1c 62 61 80 88 ff ff  ..........ba....
          e8 8b a1 64 80 88 ff ff 00 07 00 08 fe 00 00 00  ...d............
        backtrace:
          [<00000000856ed27d>] inet_rtm_newroute+0x129/0x220
          [<00000000fcdfc00a>] rtnetlink_rcv_msg+0x397/0xa20
          [<00000000cb85801a>] netlink_rcv_skb+0x132/0x380
          [<00000000ebc991d2>] netlink_unicast+0x4c0/0x690
          [<0000000014f62875>] netlink_sendmsg+0x929/0xe10
          [<00000000bac9d967>] sock_sendmsg+0xc8/0x110
          [<00000000223e6485>] ___sys_sendmsg+0x77a/0x8f0
          [<000000002e94f880>] __sys_sendmsg+0xf7/0x250
          [<00000000ccb1fa72>] do_syscall_64+0x14d/0x610
          [<00000000ffbe3dae>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
          [<000000003a8b605b>] 0xffffffffffffffff
      unreferenced object 0xffff888061621c88 (size 48):
        comm "ip", pid 1206, jiffies 4294786063 (age 26.235s)
        hex dump (first 32 bytes):
          6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
          6b 6b 6b 6b 6b 6b 6b 6b d8 8e 26 5f 80 88 ff ff  kkkkkkkk..&_....
        backtrace:
          [<00000000733609e3>] fib_table_insert+0x978/0x1500
          [<00000000856ed27d>] inet_rtm_newroute+0x129/0x220
          [<00000000fcdfc00a>] rtnetlink_rcv_msg+0x397/0xa20
          [<00000000cb85801a>] netlink_rcv_skb+0x132/0x380
          [<00000000ebc991d2>] netlink_unicast+0x4c0/0x690
          [<0000000014f62875>] netlink_sendmsg+0x929/0xe10
          [<00000000bac9d967>] sock_sendmsg+0xc8/0x110
          [<00000000223e6485>] ___sys_sendmsg+0x77a/0x8f0
          [<000000002e94f880>] __sys_sendmsg+0xf7/0x250
          [<00000000ccb1fa72>] do_syscall_64+0x14d/0x610
          [<00000000ffbe3dae>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
          [<000000003a8b605b>] 0xffffffffffffffff
      
      Fixes: 8cced9ef ("[NETNS]: Enable routing configuration in non-initial namespace.")
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f97f4dd8