1. 10 11月, 2022 1 次提交
  2. 17 8月, 2022 2 次提交
  3. 06 7月, 2022 1 次提交
    • G
      ipv4: Fix route lookups when handling ICMP redirects and PMTU updates · bbbaeb3f
      Guillaume Nault 提交于
      stable inclusion
      from stable-v5.10.110
      commit 40f3b8dadae8e8509166e31198065bc8f6144ed2
      bugzilla: https://gitee.com/openeuler/kernel/issues/I574AL
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=40f3b8dadae8e8509166e31198065bc8f6144ed2
      
      --------------------------------
      
      [ Upstream commit 544b4dd5 ]
      
      The PMTU update and ICMP redirect helper functions initialise their fl4
      variable with either __build_flow_key() or build_sk_flow_key(). These
      initialisation functions always set ->flowi4_scope with
      RT_SCOPE_UNIVERSE and might set the ECN bits of ->flowi4_tos. This is
      not a problem when the route lookup is later done via
      ip_route_output_key_hash(), which properly clears the ECN bits from
      ->flowi4_tos and initialises ->flowi4_scope based on the RTO_ONLINK
      flag. However, some helpers call fib_lookup() directly, without
      sanitising the tos and scope fields, so the route lookup can fail and,
      as a result, the ICMP redirect or PMTU update aren't taken into
      account.
      
      Fix this by extracting the ->flowi4_tos and ->flowi4_scope sanitisation
      code into ip_rt_fix_tos(), then use this function in handlers that call
      fib_lookup() directly.
      
      Note 1: We can't sanitise ->flowi4_tos and ->flowi4_scope in a central
      place (like __build_flow_key() or flowi4_init_output()), because
      ip_route_output_key_hash() expects non-sanitised values. When called
      with sanitised values, it can erroneously overwrite RT_SCOPE_LINK with
      RT_SCOPE_UNIVERSE in ->flowi4_scope. Therefore we have to be careful to
      sanitise the values only for those paths that don't call
      ip_route_output_key_hash().
      
      Note 2: The problem is mostly about sanitising ->flowi4_tos. Having
      ->flowi4_scope initialised with RT_SCOPE_UNIVERSE instead of
      RT_SCOPE_LINK probably wasn't really a problem: sockets with the
      SOCK_LOCALROUTE flag set (those that'd result in RTO_ONLINK being set)
      normally shouldn't receive ICMP redirects or PMTU updates.
      
      Fixes: 4895c771 ("ipv4: Add FIB nexthop exceptions.")
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYu Liao <liaoyu15@huawei.com>
      Reviewed-by: NWei Li <liwei391@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      bbbaeb3f
  4. 19 10月, 2021 3 次提交
  5. 13 10月, 2021 1 次提交
  6. 06 7月, 2021 1 次提交
    • D
      ipv4: Fix device used for dst_alloc with local routes · 2d08e2bf
      David Ahern 提交于
      stable inclusion
      from stable-5.10.46
      commit 0239c439cedcc13c57f6d6e47c36904cdf1da7ca
      bugzilla: 168323
      CVE: NA
      
      --------------------------------
      
      [ Upstream commit b87b04f5 ]
      
      Oliver reported a use case where deleting a VRF device can hang
      waiting for the refcnt to drop to 0. The root cause is that the dst
      is allocated against the VRF device but cached on the loopback
      device.
      
      The use case (added to the selftests) has an implicit VRF crossing
      due to the ordering of the FIB rules (lookup local is before the
      l3mdev rule, but the problem occurs even if the FIB rules are
      re-ordered with local after l3mdev because the VRF table does not
      have a default route to terminate the lookup). The end result is
      is that the FIB lookup returns the loopback device as the nexthop,
      but the ingress device is in a VRF. The mismatch causes the dst
      alloc against the VRF device but then cached on the loopback.
      
      The fix is to bring the trick used for IPv6 (see ip6_rt_get_dev_rcu):
      pick the dst alloc device based the fib lookup result but with checks
      that the result has a nexthop device (e.g., not an unreachable or
      prohibit entry).
      
      Fixes: f5a0aab8 ("net: ipv4: dst for local input routes should use l3mdev if relevant")
      Reported-by: NOliver Herms <oliver.peter.herms@gmail.com>
      Signed-off-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      2d08e2bf
  7. 03 6月, 2021 1 次提交
  8. 19 4月, 2021 1 次提交
  9. 29 11月, 2020 1 次提交
    • G
      ipv4: Fix tos mask in inet_rtm_getroute() · 1ebf1790
      Guillaume Nault 提交于
      When inet_rtm_getroute() was converted to use the RCU variants of
      ip_route_input() and ip_route_output_key(), the TOS parameters
      stopped being masked with IPTOS_RT_MASK before doing the route lookup.
      
      As a result, "ip route get" can return a different route than what
      would be used when sending real packets.
      
      For example:
      
          $ ip route add 192.0.2.11/32 dev eth0
          $ ip route add unreachable 192.0.2.11/32 tos 2
          $ ip route get 192.0.2.11 tos 2
          RTNETLINK answers: No route to host
      
      But, packets with TOS 2 (ECT(0) if interpreted as an ECN bit) would
      actually be routed using the first route:
      
          $ ping -c 1 -Q 2 192.0.2.11
          PING 192.0.2.11 (192.0.2.11) 56(84) bytes of data.
          64 bytes from 192.0.2.11: icmp_seq=1 ttl=64 time=0.173 ms
      
          --- 192.0.2.11 ping statistics ---
          1 packets transmitted, 1 received, 0% packet loss, time 0ms
          rtt min/avg/max/mdev = 0.173/0.173/0.173/0.000 ms
      
      This patch re-applies IPTOS_RT_MASK in inet_rtm_getroute(), to
      return results consistent with real route lookups.
      
      Fixes: 3765d35e ("net: ipv4: Convert inet_rtm_getroute to rcu versions of route lookup")
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Link: https://lore.kernel.org/r/b2d237d08317ca55926add9654a48409ac1b8f5b.1606412894.git.gnault@redhat.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      1ebf1790
  10. 11 10月, 2020 1 次提交
  11. 16 9月, 2020 1 次提交
    • D
      ipv4: Update exception handling for multipath routes via same device · 2fbc6e89
      David Ahern 提交于
      Kfir reported that pmtu exceptions are not created properly for
      deployments where multipath routes use the same device.
      
      After some digging I see 2 compounding problems:
      1. ip_route_output_key_hash_rcu is updating the flowi4_oif *after*
         the route lookup. This is the second use case where this has
         been a problem (the first is related to use of vti devices with
         VRF). I can not find any reason for the oif to be changed after the
         lookup; the code goes back to the start of git. It does not seem
         logical so remove it.
      
      2. fib_lookups for exceptions do not call fib_select_path to handle
         multipath route selection based on the hash.
      
      The end result is that the fib_lookup used to add the exception
      always creates it based using the first leg of the route.
      
      An example topology showing the problem:
      
                       |  host1
                   +------+
                   | eth0 |  .209
                   +------+
                       |
                   +------+
           switch  | br0  |
                   +------+
                       |
             +---------+---------+
             | host2             |  host3
         +------+             +------+
         | eth0 | .250        | eth0 | 192.168.252.252
         +------+             +------+
      
         +-----+             +-----+
         | vti | .2          | vti | 192.168.247.3
         +-----+             +-----+
             \                  /
       =================================
       tunnels
               192.168.247.1/24
      
      for h in host1 host2 host3; do
              ip netns add ${h}
              ip -netns ${h} link set lo up
              ip netns exec ${h} sysctl -wq net.ipv4.ip_forward=1
      done
      
      ip netns add switch
      ip -netns switch li set lo up
      ip -netns switch link add br0 type bridge stp 0
      ip -netns switch link set br0 up
      
      for n in 1 2 3; do
              ip -netns switch link add eth-sw type veth peer name eth-h${n}
              ip -netns switch li set eth-h${n} master br0 up
              ip -netns switch li set eth-sw netns host${n} name eth0
      done
      
      ip -netns host1 addr add 192.168.252.209/24 dev eth0
      ip -netns host1 link set dev eth0 up
      ip -netns host1 route add 192.168.247.0/24 \
              nexthop via 192.168.252.250 dev eth0 nexthop via 192.168.252.252 dev eth0
      
      ip -netns host2 addr add 192.168.252.250/24 dev eth0
      ip -netns host2 link set dev eth0 up
      
      ip -netns host2 addr add 192.168.252.252/24 dev eth0
      ip -netns host3 link set dev eth0 up
      
      ip netns add tunnel
      ip -netns tunnel li set lo up
      ip -netns tunnel li add br0 type bridge
      ip -netns tunnel li set br0 up
      for n in $(seq 11 20); do
              ip -netns tunnel addr add dev br0 192.168.247.${n}/24
      done
      
      for n in 2 3
      do
              ip -netns tunnel link add vti${n} type veth peer name eth${n}
              ip -netns tunnel link set eth${n} mtu 1360 master br0 up
              ip -netns tunnel link set vti${n} netns host${n} mtu 1360 up
              ip -netns host${n} addr add dev vti${n} 192.168.247.${n}/24
      done
      ip -netns tunnel ro add default nexthop via 192.168.247.2 nexthop via 192.168.247.3
      
      ip netns exec host1 ping -M do -s 1400 -c3 -I 192.168.252.209 192.168.247.11
      ip netns exec host1 ping -M do -s 1400 -c3 -I 192.168.252.209 192.168.247.15
      ip -netns host1 ro ls cache
      
      Before this patch the cache always shows exceptions against the first
      leg in the multipath route; 192.168.252.250 per this example. Since the
      hash has an initial random seed, you may need to vary the final octet
      more than what is listed. In my tests, using addresses between 11 and 19
      usually found 1 that used both legs.
      
      With this patch, the cache will have exceptions for both legs.
      
      Fixes: 4895c771 ("ipv4: Add FIB nexthop exceptions")
      Reported-by: NKfir Itzhak <mastertheknife@gmail.com>
      Signed-off-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2fbc6e89
  12. 15 9月, 2020 1 次提交
    • D
      ipv4: Initialize flowi4_multipath_hash in data path · 1869e226
      David Ahern 提交于
      flowi4_multipath_hash was added by the commit referenced below for
      tunnels. Unfortunately, the patch did not initialize the new field
      for several fast path lookups that do not initialize the entire flow
      struct to 0. Fix those locations. Currently, flowi4_multipath_hash
      is random garbage and affects the hash value computed by
      fib_multipath_hash for multipath selection.
      
      Fixes: 24ba1440 ("route: Add multipath_hash in flowi_common to make user-define hash")
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Cc: wenxu <wenxu@ucloud.cn>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1869e226
  13. 01 9月, 2020 1 次提交
  14. 25 8月, 2020 2 次提交
  15. 05 8月, 2020 1 次提交
    • S
      ipv4: route: Ignore output interface in FIB lookup for PMTU route · df23bb18
      Stefano Brivio 提交于
      Currently, processes sending traffic to a local bridge with an
      encapsulation device as a port don't get ICMP errors if they exceed
      the PMTU of the encapsulated link.
      
      David Ahern suggested this as a hack, but it actually looks like
      the correct solution: when we update the PMTU for a given destination
      by means of updating or creating a route exception, the encapsulation
      might trigger this because of PMTU discovery happening either on the
      encapsulation device itself, or its lower layer. This happens on
      bridged encapsulations only.
      
      The output interface shouldn't matter, because we already have a
      valid destination. Drop the output interface restriction from the
      associated route lookup.
      
      For UDP tunnels, we will now have a route exception created for the
      encapsulation itself, with a MTU value reflecting its headroom, which
      allows a bridge forwarding IP packets originated locally to deliver
      errors back to the sending socket.
      
      The behaviour is now consistent with IPv6 and verified with selftests
      pmtu_ipv{4,6}_br_{geneve,vxlan}{4,6}_exception introduced later in
      this series.
      
      v2:
      - reset output interface only for bridge ports (David Ahern)
      - add and use netif_is_any_bridge_port() helper (David Ahern)
      Suggested-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      df23bb18
  16. 28 6月, 2020 1 次提交
  17. 18 5月, 2020 1 次提交
  18. 09 5月, 2020 1 次提交
    • P
      net: ipv4: really enforce backoff for redirects · 57644431
      Paolo Abeni 提交于
      In commit b406472b ("net: ipv4: avoid mixed n_redirects and
      rate_tokens usage") I missed the fact that a 0 'rate_tokens' will
      bypass the backoff algorithm.
      
      Since rate_tokens is cleared after a redirect silence, and never
      incremented on redirects, if the host keeps receiving packets
      requiring redirect it will reply ignoring the backoff.
      
      Additionally, the 'rate_last' field will be updated with the
      cadence of the ingress packet requiring redirect. If that rate is
      high enough, that will prevent the host from generating any
      other kind of ICMP messages
      
      The check for a zero 'rate_tokens' value was likely a shortcut
      to avoid the more complex backoff algorithm after a redirect
      silence period. Address the issue checking for 'n_redirects'
      instead, which is incremented on successful redirect, and
      does not interfere with other ICMP replies.
      
      Fixes: b406472b ("net: ipv4: avoid mixed n_redirects and rate_tokens usage")
      Reported-and-tested-by: NColin Walters <walters@redhat.com>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      57644431
  19. 27 4月, 2020 1 次提交
  20. 24 3月, 2020 1 次提交
  21. 25 2月, 2020 1 次提交
  22. 04 2月, 2020 1 次提交
  23. 24 1月, 2020 1 次提交
  24. 15 1月, 2020 2 次提交
    • I
      ipv4: Add "offload" and "trap" indications to routes · 90b93f1b
      Ido Schimmel 提交于
      When performing L3 offload, routes and nexthops are usually programmed
      into two different tables in the underlying device. Therefore, the fact
      that a nexthop resides in hardware does not necessarily mean that all
      the associated routes also reside in hardware and vice-versa.
      
      While the kernel can signal to user space the presence of a nexthop in
      hardware (via 'RTNH_F_OFFLOAD'), it does not have a corresponding flag
      for routes. In addition, the fact that a route resides in hardware does
      not necessarily mean that the traffic is offloaded. For example,
      unreachable routes (i.e., 'RTN_UNREACHABLE') are programmed to trap
      packets to the CPU so that the kernel will be able to generate the
      appropriate ICMP error packet.
      
      This patch adds an "offload" and "trap" indications to IPv4 routes, so
      that users will have better visibility into the offload process.
      
      'struct fib_alias' is extended with two new fields that indicate if the
      route resides in hardware or not and if it is offloading traffic from
      the kernel or trapping packets to it. Note that the new fields are added
      in the 6 bytes hole and therefore the struct still fits in a single
      cache line [1].
      
      Capable drivers are expected to invoke fib_alias_hw_flags_set() with the
      route's key in order to set the flags.
      
      The indications are dumped to user space via a new flags (i.e.,
      'RTM_F_OFFLOAD' and 'RTM_F_TRAP') in the 'rtm_flags' field in the
      ancillary header.
      
      v2:
      * Make use of 'struct fib_rt_info' in fib_alias_hw_flags_set()
      
      [1]
      struct fib_alias {
              struct hlist_node  fa_list;                      /*     0    16 */
              struct fib_info *          fa_info;              /*    16     8 */
              u8                         fa_tos;               /*    24     1 */
              u8                         fa_type;              /*    25     1 */
              u8                         fa_state;             /*    26     1 */
              u8                         fa_slen;              /*    27     1 */
              u32                        tb_id;                /*    28     4 */
              s16                        fa_default;           /*    32     2 */
              u8                         offload:1;            /*    34: 0  1 */
              u8                         trap:1;               /*    34: 1  1 */
              u8                         unused:6;             /*    34: 2  1 */
      
              /* XXX 5 bytes hole, try to pack */
      
              struct callback_head rcu __attribute__((__aligned__(8))); /*    40    16 */
      
              /* size: 56, cachelines: 1, members: 12 */
              /* sum members: 50, holes: 1, sum holes: 5 */
              /* sum bitfield members: 8 bits (1 bytes) */
              /* forced alignments: 1, forced holes: 1, sum forced holes: 5 */
              /* last cacheline: 56 bytes */
      } __attribute__((__aligned__(8)));
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Reviewed-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      90b93f1b
    • I
      ipv4: Encapsulate function arguments in a struct · 1e301fd0
      Ido Schimmel 提交于
      fib_dump_info() is used to prepare RTM_{NEW,DEL}ROUTE netlink messages
      using the passed arguments. Currently, the function takes 11 arguments,
      6 of which are attributes of the route being dumped (e.g., prefix, TOS).
      
      The next patch will need the function to also dump to user space an
      indication if the route is present in hardware or not. Instead of
      passing yet another argument, change the function to take a struct
      containing the different route attributes.
      
      v2:
      * Name last argument of fib_dump_info()
      * Move 'struct fib_rt_info' to include/net/ip_fib.h so that it could
        later be passed to fib_alias_hw_flags_set()
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Reviewed-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1e301fd0
  25. 25 12月, 2019 1 次提交
    • H
      net: add bool confirm_neigh parameter for dst_ops.update_pmtu · bd085ef6
      Hangbin Liu 提交于
      The MTU update code is supposed to be invoked in response to real
      networking events that update the PMTU. In IPv6 PMTU update function
      __ip6_rt_update_pmtu() we called dst_confirm_neigh() to update neighbor
      confirmed time.
      
      But for tunnel code, it will call pmtu before xmit, like:
        - tnl_update_pmtu()
          - skb_dst_update_pmtu()
            - ip6_rt_update_pmtu()
              - __ip6_rt_update_pmtu()
                - dst_confirm_neigh()
      
      If the tunnel remote dst mac address changed and we still do the neigh
      confirm, we will not be able to update neigh cache and ping6 remote
      will failed.
      
      So for this ip_tunnel_xmit() case, _EVEN_ if the MTU is changed, we
      should not be invoking dst_confirm_neigh() as we have no evidence
      of successful two-way communication at this point.
      
      On the other hand it is also important to keep the neigh reachability fresh
      for TCP flows, so we cannot remove this dst_confirm_neigh() call.
      
      To fix the issue, we have to add a new bool parameter for dst_ops.update_pmtu
      to choose whether we should do neigh update or not. I will add the parameter
      in this patch and set all the callers to true to comply with the previous
      way, and fix the tunnel code one by one on later patches.
      
      v5: No change.
      v4: No change.
      v3: Do not remove dst_confirm_neigh, but add a new bool parameter in
          dst_ops.update_pmtu to control whether we should do neighbor confirm.
          Also split the big patch to small ones for each area.
      v2: Remove dst_confirm_neigh in __ip6_rt_update_pmtu.
      Suggested-by: NDavid Miller <davem@davemloft.net>
      Reviewed-by: NGuillaume Nault <gnault@redhat.com>
      Acked-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NHangbin Liu <liuhangbin@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bd085ef6
  26. 22 11月, 2019 1 次提交
    • P
      ipv4: use dst hint for ipv4 list receive · 02b24941
      Paolo Abeni 提交于
      This is alike the previous change, with some additional ipv4 specific
      quirk. Even when using the route hint we still have to do perform
      additional per packet checks about source address validity: a new
      helper is added to wrap them.
      
      Hints are explicitly disabled if the destination is a local broadcast,
      that keeps the code simple and local broadcast are a slower path anyway.
      
      UDP flood performances vs recvmmsg() receiver:
      
      vanilla		patched		delta
      Kpps		Kpps		%
      1683		1871		+11
      
      In the worst case scenario - each packet has a different
      destination address - the performance delta is within noise
      range.
      
      v3 -> v4:
       - re-enable hints for forward
      
      v2 -> v3:
       - really fix build (sic) and hint usage check
       - use fib4_has_custom_rules() helpers (David A.)
       - add ip_extract_route_hint() helper (Edward C.)
       - use prev skb as hint instead of copying data (Willem)
      
      v1 -> v2:
       - fix build issue with !CONFIG_IP_MULTIPLE_TABLES
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      02b24941
  27. 06 11月, 2019 1 次提交
  28. 18 10月, 2019 2 次提交
    • W
      ipv4: fix race condition between route lookup and invalidation · 5018c596
      Wei Wang 提交于
      Jesse and Ido reported the following race condition:
      <CPU A, t0> - Received packet A is forwarded and cached dst entry is
      taken from the nexthop ('nhc->nhc_rth_input'). Calls skb_dst_set()
      
      <t1> - Given Jesse has busy routers ("ingesting full BGP routing tables
      from multiple ISPs"), route is added / deleted and rt_cache_flush() is
      called
      
      <CPU B, t2> - Received packet B tries to use the same cached dst entry
      from t0, but rt_cache_valid() is no longer true and it is replaced in
      rt_cache_route() by the newer one. This calls dst_dev_put() on the
      original dst entry which assigns the blackhole netdev to 'dst->dev'
      
      <CPU A, t3> - dst_input(skb) is called on packet A and it is dropped due
      to 'dst->dev' being the blackhole netdev
      
      There are 2 issues in the v4 routing code:
      1. A per-netns counter is used to do the validation of the route. That
      means whenever a route is changed in the netns, users of all routes in
      the netns needs to redo lookup. v6 has an implementation of only
      updating fn_sernum for routes that are affected.
      2. When rt_cache_valid() returns false, rt_cache_route() is called to
      throw away the current cache, and create a new one. This seems
      unnecessary because as long as this route does not change, the route
      cache does not need to be recreated.
      
      To fully solve the above 2 issues, it probably needs quite some code
      changes and requires careful testing, and does not suite for net branch.
      
      So this patch only tries to add the deleted cached rt into the uncached
      list, so user could still be able to use it to receive packets until
      it's done.
      
      Fixes: 95c47f9c ("ipv4: call dst_dev_put() properly")
      Signed-off-by: NWei Wang <weiwan@google.com>
      Reported-by: NIdo Schimmel <idosch@idosch.org>
      Reported-by: NJesse Hathaway <jesse@mbuki-mvuki.org>
      Tested-by: NJesse Hathaway <jesse@mbuki-mvuki.org>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Cc: David Ahern <dsahern@gmail.com>
      Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5018c596
    • S
      ipv4: Return -ENETUNREACH if we can't create route but saddr is valid · 595e0651
      Stefano Brivio 提交于
      ...instead of -EINVAL. An issue was found with older kernel versions
      while unplugging a NFS client with pending RPCs, and the wrong error
      code here prevented it from recovering once link is back up with a
      configured address.
      
      Incidentally, this is not an issue anymore since commit 4f8943f8
      ("SUNRPC: Replace direct task wakeups from softirq context"), included
      in 5.2-rc7, had the effect of decoupling the forwarding of this error
      by using SO_ERROR in xs_wake_error(), as pointed out by Benjamin
      Coddington.
      
      To the best of my knowledge, this isn't currently causing any further
      issue, but the error code doesn't look appropriate anyway, and we
      might hit this in other paths as well.
      
      In detail, as analysed by Gonzalo Siero, once the route is deleted
      because the interface is down, and can't be resolved and we return
      -EINVAL here, this ends up, courtesy of inet_sk_rebuild_header(),
      as the socket error seen by tcp_write_err(), called by
      tcp_retransmit_timer().
      
      In turn, tcp_write_err() indirectly calls xs_error_report(), which
      wakes up the RPC pending tasks with a status of -EINVAL. This is then
      seen by call_status() in the SUN RPC implementation, which aborts the
      RPC call calling rpc_exit(), instead of handling this as a
      potentially temporary condition, i.e. as a timeout.
      
      Return -EINVAL only if the input parameters passed to
      ip_route_output_key_hash_rcu() are actually invalid (this is the case
      if the specified source address is multicast, limited broadcast or
      all zeroes), but return -ENETUNREACH in all cases where, at the given
      moment, the given source address doesn't allow resolving the route.
      
      While at it, drop the initialisation of err to -ENETUNREACH, which
      was added to __ip_route_output_key() back then by commit
      0315e382 ("net: Fix behaviour of unreachable, blackhole and
      prohibit routes"), but actually had no effect, as it was, and is,
      overwritten by the fib_lookup() return code assignment, and anyway
      ignored in all other branches, including the if (fl4->saddr) one:
      I find this rather confusing, as it would look like -ENETUNREACH is
      the "default" error, while that statement has no effect.
      
      Also note that after commit fc75fc83 ("ipv4: dont create routes
      on down devices"), we would get -ENETUNREACH if the device is down,
      but -EINVAL if the source address is specified and we can't resolve
      the route, and this appears to be rather inconsistent.
      Reported-by: NStefan Walter <walteste@inf.ethz.ch>
      Analysed-by: NBenjamin Coddington <bcodding@redhat.com>
      Analysed-by: NGonzalo Siero <gsierohu@redhat.com>
      Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      595e0651
  29. 05 10月, 2019 1 次提交
    • P
      net: ipv4: avoid mixed n_redirects and rate_tokens usage · b406472b
      Paolo Abeni 提交于
      Since commit c09551c6 ("net: ipv4: use a dedicated counter
      for icmp_v4 redirect packets") we use 'n_redirects' to account
      for redirect packets, but we still use 'rate_tokens' to compute
      the redirect packets exponential backoff.
      
      If the device sent to the relevant peer any ICMP error packet
      after sending a redirect, it will also update 'rate_token' according
      to the leaking bucket schema; typically 'rate_token' will raise
      above BITS_PER_LONG and the redirect packets backoff algorithm
      will produce undefined behavior.
      
      Fix the issue using 'n_redirects' to compute the exponential backoff
      in ip_rt_send_redirect().
      
      Note that we still clear rate_tokens after a redirect silence period,
      to avoid changing an established behaviour.
      
      The root cause predates git history; before the mentioned commit in
      the critical scenario, the kernel stopped sending redirects, after
      the mentioned commit the behavior more randomic.
      Reported-by: NXiumei Mu <xmu@redhat.com>
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Fixes: c09551c6 ("net: ipv4: use a dedicated counter for icmp_v4 redirect packets")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Acked-by: NLorenzo Bianconi <lorenzo.bianconi@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b406472b
  30. 21 9月, 2019 1 次提交
  31. 25 8月, 2019 1 次提交
    • J
      net: route dump netlink NLM_F_MULTI flag missing · e93fb3e9
      John Fastabend 提交于
      An excerpt from netlink(7) man page,
      
        In multipart messages (multiple nlmsghdr headers with associated payload
        in one byte stream) the first and all following headers have the
        NLM_F_MULTI flag set, except for the last  header  which  has the type
        NLMSG_DONE.
      
      but, after (ee28906f) there is a missing NLM_F_MULTI flag in the middle of a
      FIB dump. The result is user space applications following above man page
      excerpt may get confused and may stop parsing msg believing something went
      wrong.
      
      In the golang netlink lib [0] the library logic stops parsing believing the
      message is not a multipart message. Found this running Cilium[1] against
      net-next while adding a feature to auto-detect routes. I noticed with
      multiple route tables we no longer could detect the default routes on net
      tree kernels because the library logic was not returning them.
      
      Fix this by handling the fib_dump_info_fnhe() case the same way the
      fib_dump_info() handles it by passing the flags argument through the
      call chain and adding a flags argument to rt_fill_info().
      
      Tested with Cilium stack and auto-detection of routes works again. Also
      annotated libs to dump netlink msgs and inspected NLM_F_MULTI and
      NLMSG_DONE flags look correct after this.
      
      Note: In inet_rtm_getroute() pass rt_fill_info() '0' for flags the same
      as is done for fib_dump_info() so this looks correct to me.
      
      [0] https://github.com/vishvananda/netlink/
      [1] https://github.com/cilium/
      
      Fixes: ee28906f ("ipv4: Dump route exceptions if requested")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Reviewed-by: NStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e93fb3e9
  32. 09 7月, 2019 1 次提交
  33. 06 7月, 2019 1 次提交
    • I
      ipv4: Fix NULL pointer dereference in ipv4_neigh_lookup() · 537de0c8
      Ido Schimmel 提交于
      Both ip_neigh_gw4() and ip_neigh_gw6() can return either a valid pointer
      or an error pointer, but the code currently checks that the pointer is
      not NULL.
      
      Fix this by checking that the pointer is not an error pointer, as this
      can result in a NULL pointer dereference [1]. Specifically, I believe
      that what happened is that ip_neigh_gw4() returned '-EINVAL'
      (0xffffffffffffffea) to which the offset of 'refcnt' (0x70) was added,
      which resulted in the address 0x000000000000005a.
      
      [1]
       BUG: KASAN: null-ptr-deref in refcount_inc_not_zero_checked+0x6e/0x180
       Read of size 4 at addr 000000000000005a by task swapper/2/0
      
       CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.2.0-rc6-custom-reg-179657-gaa32d89 #396
       Hardware name: Mellanox Technologies Ltd. MSN2010/SA002610, BIOS 5.6.5 08/24/2017
       Call Trace:
       <IRQ>
       dump_stack+0x73/0xbb
       __kasan_report+0x188/0x1ea
       kasan_report+0xe/0x20
       refcount_inc_not_zero_checked+0x6e/0x180
       ipv4_neigh_lookup+0x365/0x12c0
       __neigh_update+0x1467/0x22f0
       arp_process.constprop.6+0x82e/0x1f00
       __netif_receive_skb_one_core+0xee/0x170
       process_backlog+0xe3/0x640
       net_rx_action+0x755/0xd90
       __do_softirq+0x29b/0xae7
       irq_exit+0x177/0x1c0
       smp_apic_timer_interrupt+0x164/0x5e0
       apic_timer_interrupt+0xf/0x20
       </IRQ>
      
      Fixes: 5c9f7c1d ("ipv4: Add helpers for neigh lookup for nexthop")
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Reported-by: NShalom Toledo <shalomt@mellanox.com>
      Reviewed-by: NJiri Pirko <jiri@mellanox.com>
      Reviewed-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      537de0c8
  34. 02 7月, 2019 1 次提交