1. 16 3月, 2022 1 次提交
    • D
      net: Add l3mdev index to flow struct and avoid oif reset for port devices · 40867d74
      David Ahern 提交于
      The fundamental premise of VRF and l3mdev core code is binding a socket
      to a device (l3mdev or netdev with an L3 domain) to indicate L3 scope.
      Legacy code resets flowi_oif to the l3mdev losing any original port
      device binding. Ben (among others) has demonstrated use cases where the
      original port device binding is important and needs to be retained.
      This patch handles that by adding a new entry to the common flow struct
      that can indicate the l3mdev index for later rule and table matching
      avoiding the need to reset flowi_oif.
      
      In addition to allowing more use cases that require port device binds,
      this patch brings a few datapath simplications:
      
      1. l3mdev_fib_rule_match is only called when walking fib rules and
         always after l3mdev_update_flow. That allows an optimization to bail
         early for non-VRF type uses cases when flowi_l3mdev is not set. Also,
         only that index needs to be checked for the FIB table id.
      
      2. l3mdev_update_flow can be called with flowi_oif set to a l3mdev
         (e.g., VRF) device. By resetting flowi_oif only for this case the
         FLOWI_FLAG_SKIP_NH_OIF flag is not longer needed and can be removed,
         removing several checks in the datapath. The flowi_iif path can be
         simplified to only be called if the it is not loopback (loopback can
         not be assigned to an L3 domain) and the l3mdev index is not already
         set.
      
      3. Avoid another device lookup in the output path when the fib lookup
         returns a reject failure.
      
      Note: 2 functional tests for local traffic with reject fib rules are
      updated to reflect the new direct failure at FIB lookup time for ping
      rather than the failure on packet path. The current code fails like this:
      
          HINT: Fails since address on vrf device is out of device scope
          COMMAND: ip netns exec ns-A ping -c1 -w1 -I eth1 172.16.3.1
          ping: Warning: source address might be selected on device other than: eth1
          PING 172.16.3.1 (172.16.3.1) from 172.16.3.1 eth1: 56(84) bytes of data.
      
          --- 172.16.3.1 ping statistics ---
          1 packets transmitted, 0 received, 100% packet loss, time 0ms
      
      where the test now directly fails:
      
          HINT: Fails since address on vrf device is out of device scope
          COMMAND: ip netns exec ns-A ping -c1 -w1 -I eth1 172.16.3.1
          ping: connect: No route to host
      Signed-off-by: NDavid Ahern <dsahern@kernel.org>
      Tested-by: NBen Greear <greearb@candelatech.com>
      Link: https://lore.kernel.org/r/20220314204551.16369-1-dsahern@kernel.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      40867d74
  2. 18 2月, 2022 1 次提交
    • E
      ipv4: fix data races in fib_alias_hw_flags_set · 9fcf986c
      Eric Dumazet 提交于
      fib_alias_hw_flags_set() can be used by concurrent threads,
      and is only RCU protected.
      
      We need to annotate accesses to following fields of struct fib_alias:
      
          offload, trap, offload_failed
      
      Because of READ_ONCE()WRITE_ONCE() limitations, make these
      field u8.
      
      BUG: KCSAN: data-race in fib_alias_hw_flags_set / fib_alias_hw_flags_set
      
      read to 0xffff888134224a6a of 1 bytes by task 2013 on cpu 1:
       fib_alias_hw_flags_set+0x28a/0x470 net/ipv4/fib_trie.c:1050
       nsim_fib4_rt_hw_flags_set drivers/net/netdevsim/fib.c:350 [inline]
       nsim_fib4_rt_add drivers/net/netdevsim/fib.c:367 [inline]
       nsim_fib4_rt_insert drivers/net/netdevsim/fib.c:429 [inline]
       nsim_fib4_event drivers/net/netdevsim/fib.c:461 [inline]
       nsim_fib_event drivers/net/netdevsim/fib.c:881 [inline]
       nsim_fib_event_work+0x1852/0x2cf0 drivers/net/netdevsim/fib.c:1477
       process_one_work+0x3f6/0x960 kernel/workqueue.c:2307
       process_scheduled_works kernel/workqueue.c:2370 [inline]
       worker_thread+0x7df/0xa70 kernel/workqueue.c:2456
       kthread+0x1bf/0x1e0 kernel/kthread.c:377
       ret_from_fork+0x1f/0x30
      
      write to 0xffff888134224a6a of 1 bytes by task 4872 on cpu 0:
       fib_alias_hw_flags_set+0x2d5/0x470 net/ipv4/fib_trie.c:1054
       nsim_fib4_rt_hw_flags_set drivers/net/netdevsim/fib.c:350 [inline]
       nsim_fib4_rt_add drivers/net/netdevsim/fib.c:367 [inline]
       nsim_fib4_rt_insert drivers/net/netdevsim/fib.c:429 [inline]
       nsim_fib4_event drivers/net/netdevsim/fib.c:461 [inline]
       nsim_fib_event drivers/net/netdevsim/fib.c:881 [inline]
       nsim_fib_event_work+0x1852/0x2cf0 drivers/net/netdevsim/fib.c:1477
       process_one_work+0x3f6/0x960 kernel/workqueue.c:2307
       process_scheduled_works kernel/workqueue.c:2370 [inline]
       worker_thread+0x7df/0xa70 kernel/workqueue.c:2456
       kthread+0x1bf/0x1e0 kernel/kthread.c:377
       ret_from_fork+0x1f/0x30
      
      value changed: 0x00 -> 0x02
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 4872 Comm: kworker/0:0 Not tainted 5.17.0-rc3-syzkaller-00188-g1d41d2e8-dirty #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Workqueue: events nsim_fib_event_work
      
      Fixes: 90b93f1b ("ipv4: Add "offload" and "trap" indications to routes")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Reviewed-by: NIdo Schimmel <idosch@nvidia.com>
      Link: https://lore.kernel.org/r/20220216173217.3792411-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      9fcf986c
  3. 11 2月, 2022 1 次提交
  4. 08 2月, 2022 1 次提交
  5. 31 1月, 2022 1 次提交
  6. 27 1月, 2022 1 次提交
  7. 04 1月, 2022 2 次提交
  8. 07 12月, 2021 1 次提交
  9. 17 11月, 2021 1 次提交
  10. 20 9月, 2021 1 次提交
  11. 31 8月, 2021 1 次提交
  12. 30 8月, 2021 1 次提交
    • E
      ipv4: make exception cache less predictible · 67d6d681
      Eric Dumazet 提交于
      Even after commit 6457378f ("ipv4: use siphash instead of Jenkins in
      fnhe_hashfun()"), an attacker can still use brute force to learn
      some secrets from a victim linux host.
      
      One way to defeat these attacks is to make the max depth of the hash
      table bucket a random value.
      
      Before this patch, each bucket of the hash table used to store exceptions
      could contain 6 items under attack.
      
      After the patch, each bucket would contains a random number of items,
      between 6 and 10. The attacker can no longer infer secrets.
      
      This is slightly increasing memory size used by the hash table,
      by 50% in average, we do not expect this to be a problem.
      
      This patch is more complex than the prior one (IPv6 equivalent),
      because IPv4 was reusing the oldest entry.
      Since we need to be able to evict more than one entry per
      update_or_create_fnhe() call, I had to replace
      fnhe_oldest() with fnhe_remove_oldest().
      
      Also note that we will queue extra kfree_rcu() calls under stress,
      which hopefully wont be a too big issue.
      
      Fixes: 4895c771 ("ipv4: Add FIB nexthop exceptions.")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NKeyu Man <kman001@ucr.edu>
      Cc: Willy Tarreau <w@1wt.eu>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Tested-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      67d6d681
  13. 26 8月, 2021 1 次提交
  14. 05 8月, 2021 1 次提交
  15. 03 8月, 2021 1 次提交
  16. 21 7月, 2021 1 次提交
  17. 29 6月, 2021 1 次提交
  18. 15 6月, 2021 1 次提交
    • D
      ipv4: Fix device used for dst_alloc with local routes · b87b04f5
      David Ahern 提交于
      Oliver reported a use case where deleting a VRF device can hang
      waiting for the refcnt to drop to 0. The root cause is that the dst
      is allocated against the VRF device but cached on the loopback
      device.
      
      The use case (added to the selftests) has an implicit VRF crossing
      due to the ordering of the FIB rules (lookup local is before the
      l3mdev rule, but the problem occurs even if the FIB rules are
      re-ordered with local after l3mdev because the VRF table does not
      have a default route to terminate the lookup). The end result is
      is that the FIB lookup returns the loopback device as the nexthop,
      but the ingress device is in a VRF. The mismatch causes the dst
      alloc against the VRF device but then cached on the loopback.
      
      The fix is to bring the trick used for IPv6 (see ip6_rt_get_dev_rcu):
      pick the dst alloc device based the fib lookup result but with checks
      that the result has a nexthop device (e.g., not an unreachable or
      prohibit entry).
      
      Fixes: f5a0aab8 ("net: ipv4: dst for local input routes should use l3mdev if relevant")
      Reported-by: NOliver Herms <oliver.peter.herms@gmail.com>
      Signed-off-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b87b04f5
  19. 19 5月, 2021 2 次提交
  20. 25 3月, 2021 1 次提交
  21. 17 3月, 2021 1 次提交
  22. 13 3月, 2021 1 次提交
  23. 11 3月, 2021 2 次提交
  24. 09 2月, 2021 1 次提交
    • A
      IPv4: Add "offload failed" indication to routes · 36c5100e
      Amit Cohen 提交于
      After installing a route to the kernel, user space receives an
      acknowledgment, which means the route was installed in the kernel, but not
      necessarily in hardware.
      
      The asynchronous nature of route installation in hardware can lead to a
      routing daemon advertising a route before it was actually installed in
      hardware. This can result in packet loss or mis-routed packets until the
      route is installed in hardware.
      
      To avoid such cases, previous patch set added the ability to emit
      RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags
      are changed, this behavior is controlled by sysctl.
      
      With the above mentioned behavior, it is possible to know from user-space
      if the route was offloaded, but if the offload fails there is no indication
      to user-space. Following a failure, a routing daemon will wait indefinitely
      for a notification that will never come.
      
      This patch adds an "offload_failed" indication to IPv4 routes, so that
      users will have better visibility into the offload process.
      
      'struct fib_alias', and 'struct fib_rt_info' are extended with new field
      that indicates if route offload failed. Note that the new field is added
      using unused bit and therefore there is no need to increase structs size.
      Signed-off-by: NAmit Cohen <amcohen@nvidia.com>
      Signed-off-by: NIdo Schimmel <idosch@nvidia.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      36c5100e
  25. 05 2月, 2021 1 次提交
  26. 04 2月, 2021 2 次提交
  27. 29 11月, 2020 1 次提交
    • G
      ipv4: Fix tos mask in inet_rtm_getroute() · 1ebf1790
      Guillaume Nault 提交于
      When inet_rtm_getroute() was converted to use the RCU variants of
      ip_route_input() and ip_route_output_key(), the TOS parameters
      stopped being masked with IPTOS_RT_MASK before doing the route lookup.
      
      As a result, "ip route get" can return a different route than what
      would be used when sending real packets.
      
      For example:
      
          $ ip route add 192.0.2.11/32 dev eth0
          $ ip route add unreachable 192.0.2.11/32 tos 2
          $ ip route get 192.0.2.11 tos 2
          RTNETLINK answers: No route to host
      
      But, packets with TOS 2 (ECT(0) if interpreted as an ECN bit) would
      actually be routed using the first route:
      
          $ ping -c 1 -Q 2 192.0.2.11
          PING 192.0.2.11 (192.0.2.11) 56(84) bytes of data.
          64 bytes from 192.0.2.11: icmp_seq=1 ttl=64 time=0.173 ms
      
          --- 192.0.2.11 ping statistics ---
          1 packets transmitted, 1 received, 0% packet loss, time 0ms
          rtt min/avg/max/mdev = 0.173/0.173/0.173/0.000 ms
      
      This patch re-applies IPTOS_RT_MASK in inet_rtm_getroute(), to
      return results consistent with real route lookups.
      
      Fixes: 3765d35e ("net: ipv4: Convert inet_rtm_getroute to rcu versions of route lookup")
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Link: https://lore.kernel.org/r/b2d237d08317ca55926add9654a48409ac1b8f5b.1606412894.git.gnault@redhat.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      1ebf1790
  28. 15 11月, 2020 1 次提交
  29. 12 11月, 2020 1 次提交
  30. 11 10月, 2020 1 次提交
  31. 16 9月, 2020 1 次提交
    • D
      ipv4: Update exception handling for multipath routes via same device · 2fbc6e89
      David Ahern 提交于
      Kfir reported that pmtu exceptions are not created properly for
      deployments where multipath routes use the same device.
      
      After some digging I see 2 compounding problems:
      1. ip_route_output_key_hash_rcu is updating the flowi4_oif *after*
         the route lookup. This is the second use case where this has
         been a problem (the first is related to use of vti devices with
         VRF). I can not find any reason for the oif to be changed after the
         lookup; the code goes back to the start of git. It does not seem
         logical so remove it.
      
      2. fib_lookups for exceptions do not call fib_select_path to handle
         multipath route selection based on the hash.
      
      The end result is that the fib_lookup used to add the exception
      always creates it based using the first leg of the route.
      
      An example topology showing the problem:
      
                       |  host1
                   +------+
                   | eth0 |  .209
                   +------+
                       |
                   +------+
           switch  | br0  |
                   +------+
                       |
             +---------+---------+
             | host2             |  host3
         +------+             +------+
         | eth0 | .250        | eth0 | 192.168.252.252
         +------+             +------+
      
         +-----+             +-----+
         | vti | .2          | vti | 192.168.247.3
         +-----+             +-----+
             \                  /
       =================================
       tunnels
               192.168.247.1/24
      
      for h in host1 host2 host3; do
              ip netns add ${h}
              ip -netns ${h} link set lo up
              ip netns exec ${h} sysctl -wq net.ipv4.ip_forward=1
      done
      
      ip netns add switch
      ip -netns switch li set lo up
      ip -netns switch link add br0 type bridge stp 0
      ip -netns switch link set br0 up
      
      for n in 1 2 3; do
              ip -netns switch link add eth-sw type veth peer name eth-h${n}
              ip -netns switch li set eth-h${n} master br0 up
              ip -netns switch li set eth-sw netns host${n} name eth0
      done
      
      ip -netns host1 addr add 192.168.252.209/24 dev eth0
      ip -netns host1 link set dev eth0 up
      ip -netns host1 route add 192.168.247.0/24 \
              nexthop via 192.168.252.250 dev eth0 nexthop via 192.168.252.252 dev eth0
      
      ip -netns host2 addr add 192.168.252.250/24 dev eth0
      ip -netns host2 link set dev eth0 up
      
      ip -netns host2 addr add 192.168.252.252/24 dev eth0
      ip -netns host3 link set dev eth0 up
      
      ip netns add tunnel
      ip -netns tunnel li set lo up
      ip -netns tunnel li add br0 type bridge
      ip -netns tunnel li set br0 up
      for n in $(seq 11 20); do
              ip -netns tunnel addr add dev br0 192.168.247.${n}/24
      done
      
      for n in 2 3
      do
              ip -netns tunnel link add vti${n} type veth peer name eth${n}
              ip -netns tunnel link set eth${n} mtu 1360 master br0 up
              ip -netns tunnel link set vti${n} netns host${n} mtu 1360 up
              ip -netns host${n} addr add dev vti${n} 192.168.247.${n}/24
      done
      ip -netns tunnel ro add default nexthop via 192.168.247.2 nexthop via 192.168.247.3
      
      ip netns exec host1 ping -M do -s 1400 -c3 -I 192.168.252.209 192.168.247.11
      ip netns exec host1 ping -M do -s 1400 -c3 -I 192.168.252.209 192.168.247.15
      ip -netns host1 ro ls cache
      
      Before this patch the cache always shows exceptions against the first
      leg in the multipath route; 192.168.252.250 per this example. Since the
      hash has an initial random seed, you may need to vary the final octet
      more than what is listed. In my tests, using addresses between 11 and 19
      usually found 1 that used both legs.
      
      With this patch, the cache will have exceptions for both legs.
      
      Fixes: 4895c771 ("ipv4: Add FIB nexthop exceptions")
      Reported-by: NKfir Itzhak <mastertheknife@gmail.com>
      Signed-off-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2fbc6e89
  32. 15 9月, 2020 1 次提交
    • D
      ipv4: Initialize flowi4_multipath_hash in data path · 1869e226
      David Ahern 提交于
      flowi4_multipath_hash was added by the commit referenced below for
      tunnels. Unfortunately, the patch did not initialize the new field
      for several fast path lookups that do not initialize the entire flow
      struct to 0. Fix those locations. Currently, flowi4_multipath_hash
      is random garbage and affects the hash value computed by
      fib_multipath_hash for multipath selection.
      
      Fixes: 24ba1440 ("route: Add multipath_hash in flowi_common to make user-define hash")
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Cc: wenxu <wenxu@ucloud.cn>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1869e226
  33. 01 9月, 2020 1 次提交
  34. 25 8月, 2020 2 次提交
  35. 05 8月, 2020 1 次提交
    • S
      ipv4: route: Ignore output interface in FIB lookup for PMTU route · df23bb18
      Stefano Brivio 提交于
      Currently, processes sending traffic to a local bridge with an
      encapsulation device as a port don't get ICMP errors if they exceed
      the PMTU of the encapsulated link.
      
      David Ahern suggested this as a hack, but it actually looks like
      the correct solution: when we update the PMTU for a given destination
      by means of updating or creating a route exception, the encapsulation
      might trigger this because of PMTU discovery happening either on the
      encapsulation device itself, or its lower layer. This happens on
      bridged encapsulations only.
      
      The output interface shouldn't matter, because we already have a
      valid destination. Drop the output interface restriction from the
      associated route lookup.
      
      For UDP tunnels, we will now have a route exception created for the
      encapsulation itself, with a MTU value reflecting its headroom, which
      allows a bridge forwarding IP packets originated locally to deliver
      errors back to the sending socket.
      
      The behaviour is now consistent with IPv6 and verified with selftests
      pmtu_ipv{4,6}_br_{geneve,vxlan}{4,6}_exception introduced later in
      this series.
      
      v2:
      - reset output interface only for bridge ports (David Ahern)
      - add and use netif_is_any_bridge_port() helper (David Ahern)
      Suggested-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      df23bb18