1. 27 5月, 2017 1 次提交
    • E
      ipv4: add reference counting to metrics · 3fb07daf
      Eric Dumazet 提交于
      Andrey Konovalov reported crashes in ipv4_mtu()
      
      I could reproduce the issue with KASAN kernels, between
      10.246.7.151 and 10.246.7.152 :
      
      1) 20 concurrent netperf -t TCP_RR -H 10.246.7.152 -l 1000 &
      
      2) At the same time run following loop :
      while :
      do
       ip ro add 10.246.7.152 dev eth0 src 10.246.7.151 mtu 1500
       ip ro del 10.246.7.152 dev eth0 src 10.246.7.151 mtu 1500
      done
      
      Cong Wang attempted to add back rt->fi in commit
      82486aa6 ("ipv4: restore rt->fi for reference counting")
      but this proved to add some issues that were complex to solve.
      
      Instead, I suggested to add a refcount to the metrics themselves,
      being a standalone object (in particular, no reference to other objects)
      
      I tried to make this patch as small as possible to ease its backport,
      instead of being super clean. Note that we believe that only ipv4 dst
      need to take care of the metric refcount. But if this is wrong,
      this patch adds the basic infrastructure to extend this to other
      families.
      
      Many thanks to Julian Anastasov for reviewing this patch, and Cong Wang
      for his efforts on this problem.
      
      Fixes: 2860583f ("ipv4: Kill rt->fi")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NAndrey Konovalov <andreyknvl@google.com>
      Reviewed-by: NJulian Anastasov <ja@ssi.bg>
      Acked-by: NCong Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3fb07daf
  2. 23 5月, 2017 2 次提交
  3. 22 3月, 2017 1 次提交
    • N
      net: ipv4: add support for ECMP hash policy choice · bf4e0a3d
      Nikolay Aleksandrov 提交于
      This patch adds support for ECMP hash policy choice via a new sysctl
      called fib_multipath_hash_policy and also adds support for L4 hashes.
      The current values for fib_multipath_hash_policy are:
       0 - layer 3 (default)
       1 - layer 4
      If there's an skb hash already set and it matches the chosen policy then it
      will be used instead of being calculated (currently only for L4).
      In L3 mode we always calculate the hash due to the ICMP error special
      case, the flow dissector's field consistentification should handle the
      address order thus we can remove the address reversals.
      If the skb is provided we always use it for the hash calculation,
      otherwise we fallback to fl4, that is if skb is NULL fl4 has to be set.
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bf4e0a3d
  4. 09 2月, 2017 1 次提交
  5. 31 1月, 2017 1 次提交
  6. 13 1月, 2017 1 次提交
  7. 11 1月, 2017 1 次提交
    • D
      net: ipv4: Fix multipath selection with vrf · 7a18c5b9
      David Ahern 提交于
      fib_select_path does not call fib_select_multipath if oif is set in the
      flow struct. For VRF use cases oif is always set, so multipath route
      selection is bypassed. Use the FLOWI_FLAG_SKIP_NH_OIF to skip the oif
      check similar to what is done in fib_table_lookup.
      
      Add saddr and proto to the flow struct for the fib lookup done by the
      VRF driver to better match hash computation for a flow.
      
      Fixes: 613d09b3 ("net: Use VRF device index for lookups on TX")
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7a18c5b9
  8. 07 1月, 2017 1 次提交
  9. 25 12月, 2016 1 次提交
  10. 04 12月, 2016 1 次提交
    • I
      ipv4: fib: Export free_fib_info() · b423cb10
      Ido Schimmel 提交于
      The FIB notification chain is going to be converted to an atomic chain,
      which means switchdev drivers will have to offload FIB entries in
      deferred work, as hardware operations entail sleeping.
      
      However, while the work is queued fib info might be freed, so a
      reference must be taken. To release the reference (and potentially free
      the fib info) fib_info_put() will be called, which in turn calls
      free_fib_info().
      
      Export free_fib_info() so that modules will be able to invoke
      fib_info_put().
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b423cb10
  11. 07 9月, 2016 1 次提交
  12. 20 8月, 2016 1 次提交
  13. 12 7月, 2016 1 次提交
    • J
      ipv4: reject RTNH_F_DEAD and RTNH_F_LINKDOWN from user space · 80610229
      Julian Anastasov 提交于
      Vegard Nossum is reporting for a crash in fib_dump_info
      when nh_dev = NULL and fib_nhs == 1:
      
      Pid: 50, comm: netlink.exe Not tainted 4.7.0-rc5+
      RIP: 0033:[<00000000602b3d18>]
      RSP: 0000000062623890  EFLAGS: 00010202
      RAX: 0000000000000000 RBX: 000000006261b800 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: 0000000000000024 RDI: 000000006245ba00
      RBP: 00000000626238f0 R08: 000000000000029c R09: 0000000000000000
      R10: 0000000062468038 R11: 000000006245ba00 R12: 000000006245ba00
      R13: 00000000625f96c0 R14: 00000000601e16f0 R15: 0000000000000000
      Kernel panic - not syncing: Kernel mode fault at addr 0x2e0, ip 0x602b3d18
      CPU: 0 PID: 50 Comm: netlink.exe Not tainted 4.7.0-rc5+ #581
      Stack:
       626238f0 960226a02 00000400 000000fe
       62623910 600afca7 62623970 62623a48
       62468038 00000018 00000000 00000000
      Call Trace:
       [<602b3e93>] rtmsg_fib+0xd3/0x190
       [<602b6680>] fib_table_insert+0x260/0x500
       [<602b0e5d>] inet_rtm_newroute+0x4d/0x60
       [<60250def>] rtnetlink_rcv_msg+0x8f/0x270
       [<60267079>] netlink_rcv_skb+0xc9/0xe0
       [<60250d4b>] rtnetlink_rcv+0x3b/0x50
       [<60265400>] netlink_unicast+0x1a0/0x2c0
       [<60265e47>] netlink_sendmsg+0x3f7/0x470
       [<6021dc9a>] sock_sendmsg+0x3a/0x90
       [<6021e0d0>] ___sys_sendmsg+0x300/0x360
       [<6021fa64>] __sys_sendmsg+0x54/0xa0
       [<6021fac0>] SyS_sendmsg+0x10/0x20
       [<6001ea68>] handle_syscall+0x88/0x90
       [<600295fd>] userspace+0x3fd/0x500
       [<6001ac55>] fork_handler+0x85/0x90
      
      $ addr2line -e vmlinux -i 0x602b3d18
      include/linux/inetdevice.h:222
      net/ipv4/fib_semantics.c:1264
      
      Problem happens when RTNH_F_LINKDOWN is provided from user space
      when creating routes that do not use the flag, catched with
      netlink fuzzer.
      
      Currently, the kernel allows user space to set both flags
      to nh_flags and fib_flags but this is not intentional, the
      assumption was that they are not set. Fix this by rejecting
      both flags with EINVAL.
      Reported-by: NVegard Nossum <vegard.nossum@oracle.com>
      Fixes: 0eeb075f ("net: ipv4 sysctl option to ignore routes when nexthop link is down")
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Cc: Andy Gospodarek <gospo@cumulusnetworks.com>
      Cc: Dinesh Dutt <ddutt@cumulusnetworks.com>
      Cc: Scott Feldman <sfeldma@gmail.com>
      Reviewed-by: NAndy Gospodarek <gospo@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      80610229
  14. 15 5月, 2016 1 次提交
    • P
      net/route: enforce hoplimit max value · 626abd59
      Paolo Abeni 提交于
      Currently, when creating or updating a route, no check is performed
      in both ipv4 and ipv6 code to the hoplimit value.
      
      The caller can i.e. set hoplimit to 256, and when such route will
       be used, packets will be sent with hoplimit/ttl equal to 0.
      
      This commit adds checks for the RTAX_HOPLIMIT value, in both ipv4
      ipv6 route code, substituting any value greater than 255 with 255.
      
      This is consistent with what is currently done for ADVMSS and MTU
      in the ipv4 code.
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      626abd59
  15. 12 4月, 2016 1 次提交
    • D
      net: ipv4: Consider failed nexthops in multipath routes · a6db4494
      David Ahern 提交于
      Multipath route lookups should consider knowledge about next hops and not
      select a hop that is known to be failed.
      
      Example:
      
                           [h2]                   [h3]   15.0.0.5
                            |                      |
                           3|                     3|
                          [SP1]                  [SP2]--+
                           1  2                   1     2
                           |  |     /-------------+     |
                           |   \   /                    |
                           |     X                      |
                           |    / \                     |
                           |   /   \---------------\    |
                           1  2                     1   2
               12.0.0.2  [TOR1] 3-----------------3 [TOR2] 12.0.0.3
                           4                         4
                            \                       /
                              \                    /
                               \                  /
                                -------|   |-----/
                                       1   2
                                      [TOR3]
                                        3|
                                         |
                                        [h1]  12.0.0.1
      
      host h1 with IP 12.0.0.1 has 2 paths to host h3 at 15.0.0.5:
      
          root@h1:~# ip ro ls
          ...
          12.0.0.0/24 dev swp1  proto kernel  scope link  src 12.0.0.1
          15.0.0.0/16
                  nexthop via 12.0.0.2  dev swp1 weight 1
                  nexthop via 12.0.0.3  dev swp1 weight 1
          ...
      
      If the link between tor3 and tor1 is down and the link between tor1
      and tor2 then tor1 is effectively cut-off from h1. Yet the route lookups
      in h1 are alternating between the 2 routes: ping 15.0.0.5 gets one and
      ssh 15.0.0.5 gets the other. Connections that attempt to use the
      12.0.0.2 nexthop fail since that neighbor is not reachable:
      
          root@h1:~# ip neigh show
          ...
          12.0.0.3 dev swp1 lladdr 00:02:00:00:00:1b REACHABLE
          12.0.0.2 dev swp1  FAILED
          ...
      
      The failed path can be avoided by considering known neighbor information
      when selecting next hops. If the neighbor lookup fails we have no
      knowledge about the nexthop, so give it a shot. If there is an entry
      then only select the nexthop if the state is sane. This is similar to
      what fib_detect_death does.
      
      To maintain backward compatibility use of the neighbor information is
      based on a new sysctl, fib_multipath_use_neigh.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Reviewed-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a6db4494
  16. 05 11月, 2015 1 次提交
  17. 03 11月, 2015 1 次提交
    • P
      ipv4: use l4 hash for locally generated multipath flows · 9920e48b
      Paolo Abeni 提交于
      This patch changes how the multipath hash is computed for locally
      generated flows: now the hash comprises l4 information.
      
      This allows better utilization of the available paths when the existing
      flows have the same source IP and the same destination IP: with l3 hash,
      even when multiple connections are in place simultaneously, a single path
      will be used, while with l4 hash we can use all the available paths.
      
      v2 changes:
      - use get_hash_from_flowi4() instead of implementing just another l4 hash
        function
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9920e48b
  18. 02 11月, 2015 2 次提交
    • J
      ipv4: update RTNH_F_LINKDOWN flag on UP event · c9b3292e
      Julian Anastasov 提交于
      When nexthop is part of multipath route we should clear the
      LINKDOWN flag when link goes UP or when first address is added.
      This is needed because we always set LINKDOWN flag when DEAD flag
      was set but now on UP the nexthop is not dead anymore. Examples when
      LINKDOWN bit can be forgotten when no NETDEV_CHANGE is delivered:
      
      - link goes down (LINKDOWN is set), then link goes UP and device
      shows carrier OK but LINKDOWN remains set
      
      - last address is deleted (LINKDOWN is set), then address is
      added and device shows carrier OK but LINKDOWN remains set
      
      Steps to reproduce:
      modprobe dummy
      ifconfig dummy0 192.168.168.1 up
      
      here add a multipath route where one nexthop is for dummy0:
      
      ip route add 1.2.3.4 nexthop dummy0 nexthop SOME_OTHER_DEVICE
      ifconfig dummy0 down
      ifconfig dummy0 up
      
      now ip route shows nexthop that is not dead. Now set the sysctl var:
      
      echo 1 > /proc/sys/net/ipv4/conf/dummy0/ignore_routes_with_linkdown
      
      now ip route will show a dead nexthop because the forgotten
      RTNH_F_LINKDOWN is propagated as RTNH_F_DEAD.
      
      Fixes: 8a3d0316 ("net: track link-status of ipv4 nexthops")
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c9b3292e
    • J
      ipv4: fix to not remove local route on link down · 4f823def
      Julian Anastasov 提交于
      When fib_netdev_event calls fib_disable_ip on NETDEV_DOWN event
      we should not delete the local routes if the local address
      is still present. The confusion comes from the fact that both
      fib_netdev_event and fib_inetaddr_event use the NETDEV_DOWN
      constant. Fix it by returning back the variable 'force'.
      
      Steps to reproduce:
      modprobe dummy
      ifconfig dummy0 192.168.168.1 up
      ifconfig dummy0 down
      ip route list table local | grep dummy | grep host
      local 192.168.168.1 dev dummy0  proto kernel  scope host  src 192.168.168.1
      
      Fixes: 8a3d0316 ("net: track link-status of ipv4 nexthops")
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4f823def
  19. 16 10月, 2015 1 次提交
    • D
      net: Fix suspicious RCU usage in fib_rebalance · 51161aa9
      David Ahern 提交于
      This command:
        ip route add 192.168.1.0/24 nexthop via 10.2.1.5 dev eth1 nexthop via 10.2.2.5 dev eth2
      
      generated this suspicious RCU usage message:
      
      [ 63.249262]
      [ 63.249939] ===============================
      [ 63.251571] [ INFO: suspicious RCU usage. ]
      [ 63.253250] 4.3.0-rc3+ #298 Not tainted
      [ 63.254724] -------------------------------
      [ 63.256401] ../include/linux/inetdevice.h:205 suspicious rcu_dereference_check() usage!
      [ 63.259450]
      [ 63.259450] other info that might help us debug this:
      [ 63.259450]
      [ 63.262297]
      [ 63.262297] rcu_scheduler_active = 1, debug_locks = 1
      [ 63.264647] 1 lock held by ip/2870:
      [ 63.265896] #0: (rtnl_mutex){+.+.+.}, at: [<ffffffff813ebfb7>] rtnl_lock+0x12/0x14
      [ 63.268858]
      [ 63.268858] stack backtrace:
      [ 63.270409] CPU: 4 PID: 2870 Comm: ip Not tainted 4.3.0-rc3+ #298
      [ 63.272478] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
      [ 63.275745] 0000000000000001 ffff8800b8c9f8b8 ffffffff8125f73c ffff88013afcf301
      [ 63.278185] ffff8800bab7a380 ffff8800b8c9f8e8 ffffffff8107bf30 ffff8800bb728000
      [ 63.280634] ffff880139fe9a60 0000000000000000 ffff880139fe9a00 ffff8800b8c9f908
      [ 63.283177] Call Trace:
      [ 63.283959] [<ffffffff8125f73c>] dump_stack+0x4c/0x68
      [ 63.285593] [<ffffffff8107bf30>] lockdep_rcu_suspicious+0xfa/0x103
      [ 63.287500] [<ffffffff8144d752>] __in_dev_get_rcu+0x48/0x4f
      [ 63.289169] [<ffffffff8144d797>] fib_rebalance+0x3e/0x127
      [ 63.290753] [<ffffffff8144d986>] ? rcu_read_unlock+0x3e/0x5f
      [ 63.292442] [<ffffffff8144ea45>] fib_create_info+0xaf9/0xdcc
      [ 63.294093] [<ffffffff8106c12f>] ? sched_clock_local+0x12/0x75
      [ 63.295791] [<ffffffff8145236a>] fib_table_insert+0x8c/0x451
      [ 63.297493] [<ffffffff8144bf9c>] ? fib_get_table+0x36/0x43
      [ 63.299109] [<ffffffff8144c3ca>] inet_rtm_newroute+0x43/0x51
      [ 63.300709] [<ffffffff813ef684>] rtnetlink_rcv_msg+0x182/0x195
      [ 63.302334] [<ffffffff8107d04c>] ? trace_hardirqs_on+0xd/0xf
      [ 63.303888] [<ffffffff813ebfb7>] ? rtnl_lock+0x12/0x14
      [ 63.305346] [<ffffffff813ef502>] ? __rtnl_unlock+0x12/0x12
      [ 63.306878] [<ffffffff81407c4c>] netlink_rcv_skb+0x3d/0x90
      [ 63.308437] [<ffffffff813ec00e>] rtnetlink_rcv+0x21/0x28
      [ 63.309916] [<ffffffff81407742>] netlink_unicast+0xfa/0x17f
      [ 63.311447] [<ffffffff81407a5e>] netlink_sendmsg+0x297/0x2dc
      [ 63.313029] [<ffffffff813c6cd4>] sock_sendmsg_nosec+0x12/0x1d
      [ 63.314597] [<ffffffff813c835b>] ___sys_sendmsg+0x196/0x21b
      [ 63.316125] [<ffffffff8100bf9f>] ? native_sched_clock+0x1f/0x3c
      [ 63.317671] [<ffffffff8106c12f>] ? sched_clock_local+0x12/0x75
      [ 63.319185] [<ffffffff8106c397>] ? sched_clock_cpu+0x9d/0xb6
      [ 63.320693] [<ffffffff8107e2d7>] ? __lock_is_held+0x32/0x54
      [ 63.322145] [<ffffffff81159fcb>] ? __fget_light+0x4b/0x77
      [ 63.323541] [<ffffffff813c8726>] __sys_sendmsg+0x3d/0x5b
      [ 63.324947] [<ffffffff813c8751>] SyS_sendmsg+0xd/0x19
      [ 63.326274] [<ffffffff814c8f57>] entry_SYSCALL_64_fastpath+0x12/0x6f
      
      It looks like all of the code paths to fib_rebalance are under rtnl.
      
      Fixes: 0e884c78 ("ipv4: L3 hash-based multipath")
      Cc: Peter Nørlund <pch@ordbogen.com>
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      51161aa9
  20. 07 10月, 2015 1 次提交
  21. 06 10月, 2015 1 次提交
  22. 05 10月, 2015 1 次提交
  23. 02 9月, 2015 1 次提交
  24. 01 9月, 2015 3 次提交
  25. 25 8月, 2015 1 次提交
  26. 21 8月, 2015 2 次提交
  27. 19 8月, 2015 1 次提交
  28. 17 8月, 2015 1 次提交
  29. 14 8月, 2015 3 次提交
    • D
      net: Use passed in table for nexthop lookups · 3bfd8472
      David Ahern 提交于
      If a user passes in a table for new routes use that table for nexthop
      lookups. Specifically, this solves the case where a connected route does
      not exist in the main table, but only another table and then a subsequent
      route is added with a next hop using the connected route. ie.,
      
      $ ip route ls
      default via 10.0.2.2 dev eth0
      10.0.2.0/24 dev eth0  proto kernel  scope link  src 10.0.2.15
      169.254.0.0/16 dev eth0  scope link  metric 1003
      192.168.56.0/24 dev eth1  proto kernel  scope link  src 192.168.56.51
      
      $ ip route ls table 10
      1.1.1.0/24 dev eth2  scope link
      
      Without this patch adding a nexthop route fails:
      
      $ ip route add table 10 2.2.2.0/24 via 1.1.1.10
      RTNETLINK answers: Network is unreachable
      
      With this patch the route is added successfully.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3bfd8472
    • D
      net: Add routes to the table associated with the device · 021dd3b8
      David Ahern 提交于
      When a device associated with a VRF is brought up or down routes
      should be added to/removed from the table associated with the VRF.
      fib_magic defaults to using the main or local tables. Have it use
      the table with the device if there is one.
      
      A part of this is directing prefsrc validations to the correct
      table as well.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      021dd3b8
    • D
      net: Fix up inet_addr_type checks · 30bbaa19
      David Ahern 提交于
      Currently inet_addr_type and inet_dev_addr_type expect local addresses
      to be in the local table. With the VRF device local routes for devices
      associated with a VRF will be in the table associated with the VRF.
      Provide an alternate inet_addr lookup to use a specific table rather
      than defaulting to the local table.
      
      inet_addr_type_dev_table keeps the same semantics as inet_addr_type but
      if the passed in device is enslaved to a VRF then the table for that VRF
      is used for the lookup.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      30bbaa19
  30. 27 7月, 2015 2 次提交
  31. 25 7月, 2015 2 次提交
    • J
      ipv4: consider TOS in fib_select_default · 2392debc
      Julian Anastasov 提交于
      fib_select_default considers alternative routes only when
      res->fi is for the first alias in res->fa_head. In the
      common case this can happen only when the initial lookup
      matches the first alias with highest TOS value. This
      prevents the alternative routes to require specific TOS.
      
      This patch solves the problem as follows:
      
      - routes that require specific TOS should be returned by
      fib_select_default only when TOS matches, as already done
      in fib_table_lookup. This rule implies that depending on the
      TOS we can have many different lists of alternative gateways
      and we have to keep the last used gateway (fa_default) in first
      alias for the TOS instead of using single tb_default value.
      
      - as the aliases are ordered by many keys (TOS desc,
      fib_priority asc), we restrict the possible results to
      routes with matching TOS and lowest metric (fib_priority)
      and routes that match any TOS, again with lowest metric.
      
      For example, packet with TOS 8 can not use gw3 (not lowest
      metric), gw4 (different TOS) and gw6 (not lowest metric),
      all other gateways can be used:
      
      tos 8 via gw1 metric 2 <--- res->fa_head and res->fi
      tos 8 via gw2 metric 2
      tos 8 via gw3 metric 3
      tos 4 via gw4
      tos 0 via gw5
      tos 0 via gw6 metric 1
      Reported-by: NHagen Paul Pfeifer <hagen@jauu.net>
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2392debc
    • J
      ipv4: fib_select_default should match the prefix · 18a912e9
      Julian Anastasov 提交于
      fib_trie starting from 4.1 can link fib aliases from
      different prefixes in same list. Make sure the alternative
      gateways are in same table and for same prefix (0) by
      checking tb_id and fa_slen.
      
      Fixes: 79e5ad2c ("fib_trie: Remove leaf_info")
      Signed-off-by: NJulian Anastasov <ja@ssi.bg>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      18a912e9