1. 31 5月, 2019 1 次提交
  2. 09 4月, 2019 3 次提交
    • D
      ipv4: Add helpers for neigh lookup for nexthop · 5c9f7c1d
      David Ahern 提交于
      A common theme in the output path is looking up a neigh entry for a
      nexthop, either the gateway in an rtable or a fallback to the daddr
      in the skb:
      
              nexthop = (__force u32)rt_nexthop(rt, ip_hdr(skb)->daddr);
              neigh = __ipv4_neigh_lookup_noref(dev, nexthop);
              if (unlikely(!neigh))
                      neigh = __neigh_create(&arp_tbl, &nexthop, dev, false);
      
      To allow the nexthop to be an IPv6 address we need to consider the
      family of the nexthop and then call __ipv{4,6}_neigh_lookup_noref based
      on it.
      
      To make this simpler, add a ip_neigh_gw4 helper similar to ip_neigh_gw6
      added in an earlier patch which handles:
      
              neigh = __ipv4_neigh_lookup_noref(dev, nexthop);
              if (unlikely(!neigh))
                      neigh = __neigh_create(&arp_tbl, &nexthop, dev, false);
      
      And then add a second one, ip_neigh_for_gw, that calls either
      ip_neigh_gw4 or ip_neigh_gw6 based on the address family of the gateway.
      
      Update the output paths in the VRF driver and core v4 code to use
      ip_neigh_for_gw simplifying the family based lookup and making both
      ready for a v6 nexthop.
      
      ipv4_neigh_lookup has a different need - the potential to resolve a
      passed in address in addition to any gateway in the rtable or skb. Since
      this is a one-off, add ip_neigh_gw4 and ip_neigh_gw6 diectly. The
      difference between __neigh_create used by the helpers and neigh_create
      called by ipv4_neigh_lookup is taking a refcount, so add rcu_read_lock_bh
      and bump the refcnt on the neigh entry.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5c9f7c1d
    • D
      ipv4: Add support to rtable for ipv6 gateway · 0f5f7d7b
      David Ahern 提交于
      Add support for an IPv6 gateway to rtable. Since a gateway is either
      IPv4 or IPv6, make it a union with rt_gw4 where rt_gw_family decides
      which address is in use.
      
      When dumping the route data, encode an ipv6 nexthop using RTA_VIA.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0f5f7d7b
    • D
      ipv4: Prepare rtable for IPv6 gateway · 1550c171
      David Ahern 提交于
      To allow the gateway to be either an IPv4 or IPv6 address, remove
      rt_uses_gateway from rtable and replace with rt_gw_family. If
      rt_gw_family is set it implies rt_uses_gateway. Rename rt_gateway
      to rt_gw4 to represent the IPv4 version.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Reviewed-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1550c171
  3. 27 9月, 2018 2 次提交
  4. 29 5月, 2018 1 次提交
  5. 15 3月, 2018 1 次提交
    • S
      ipv4: lock mtu in fnhe when received PMTU < net.ipv4.route.min_pmtu · d52e5a7e
      Sabrina Dubroca 提交于
      Prior to the rework of PMTU information storage in commit
      2c8cec5c ("ipv4: Cache learned PMTU information in inetpeer."),
      when a PMTU event advertising a PMTU smaller than
      net.ipv4.route.min_pmtu was received, we would disable setting the DF
      flag on packets by locking the MTU metric, and set the PMTU to
      net.ipv4.route.min_pmtu.
      
      Since then, we don't disable DF, and set PMTU to
      net.ipv4.route.min_pmtu, so the intermediate router that has this link
      with a small MTU will have to drop the packets.
      
      This patch reestablishes pre-2.6.39 behavior by splitting
      rtable->rt_pmtu into a bitfield with rt_mtu_locked and rt_pmtu.
      rt_mtu_locked indicates that we shouldn't set the DF bit on that path,
      and is checked in ip_dont_fragment().
      
      One possible workaround is to set net.ipv4.route.min_pmtu to a value low
      enough to accommodate the lowest MTU encountered.
      
      Fixes: 2c8cec5c ("ipv4: Cache learned PMTU information in inetpeer.")
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Reviewed-by: NStefano Brivio <sbrivio@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d52e5a7e
  6. 16 2月, 2018 2 次提交
    • X
      xfrm: reuse uncached_list to track xdsts · 510c321b
      Xin Long 提交于
      In early time, when freeing a xdst, it would be inserted into
      dst_garbage.list first. Then if it's refcnt was still held
      somewhere, later it would be put into dst_busy_list in
      dst_gc_task().
      
      When one dev was being unregistered, the dev of these dsts in
      dst_busy_list would be set with loopback_dev and put this dev.
      So that this dev's removal wouldn't get blocked, and avoid the
      kmsg warning:
      
        kernel:unregister_netdevice: waiting for veth0 to become \
        free. Usage count = 2
      
      However after Commit 52df157f ("xfrm: take refcnt of dst
      when creating struct xfrm_dst bundle"), the xdst will not be
      freed with dst gc, and this warning happens.
      
      To fix it, we need to find these xdsts that are still held by
      others when removing the dev, and free xdst's dev and set it
      with loopback_dev.
      
      But unfortunately after flow_cache for xfrm was deleted, no
      list tracks them anymore. So we need to save these xdsts
      somewhere to release the xdst's dev later.
      
      To make this easier, this patch is to reuse uncached_list to
      track xdsts, so that the dev refcnt can be released in the
      event NETDEV_UNREGISTER process of fib_netdev_notifier.
      
      Thanks to Florian, we could move forward this fix quickly.
      
      Fixes: 52df157f ("xfrm: take refcnt of dst when creating struct xfrm_dst bundle")
      Reported-by: NJianlin Shi <jishi@redhat.com>
      Reported-by: NHangbin Liu <liuhangbin@gmail.com>
      Tested-by: NEyal Birger <eyal.birger@gmail.com>
      Signed-off-by: NXin Long <lucien.xin@gmail.com>
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      510c321b
    • D
      net/ipv4: Remove fib table id from rtable · 68e813aa
      David Ahern 提交于
      Remove rt_table_id from rtable. It was added for getroute to return the
      table id that was hit in the lookup. With the changes for fibmatch the
      table id can be extracted from the fib_info returned in the fib_result
      so it no longer needs to be in rtable directly.
      Signed-off-by: NDavid Ahern <dsahern@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      68e813aa
  7. 25 1月, 2018 1 次提交
  8. 01 10月, 2017 1 次提交
    • P
      udp: perform source validation for mcast early demux · bc044e8d
      Paolo Abeni 提交于
      The UDP early demux can leverate the rx dst cache even for
      multicast unconnected sockets.
      
      In such scenario the ipv4 source address is validated only on
      the first packet in the given flow. After that, when we fetch
      the dst entry  from the socket rx cache, we stop enforcing
      the rp_filter and we even start accepting any kind of martian
      addresses.
      
      Disabling the dst cache for unconnected multicast socket will
      cause large performace regression, nearly reducing by half the
      max ingress tput.
      
      Instead we factor out a route helper to completely validate an
      skb source address for multicast packets and we call it from
      the UDP early demux for mcast packets landing on unconnected
      sockets, after successful fetching the related cached dst entry.
      
      This still gives a measurable, but limited performance
      regression:
      
      		rp_filter = 0		rp_filter = 1
      edmux disabled:	1182 Kpps		1127 Kpps
      edmux before:	2238 Kpps		2238 Kpps
      edmux after:	2037 Kpps		2019 Kpps
      
      The above figures are on top of current net tree.
      Applying the net-next commit 6e617de8 ("net: avoid a full
      fib lookup when rp_filter is disabled.") the delta with
      rp_filter == 0 will decrease even more.
      
      Fixes: 421b3885 ("udp: ipv4: Add udp early demux")
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bc044e8d
  9. 22 9月, 2017 1 次提交
    • E
      net: prevent dst uses after free · 222d7dbd
      Eric Dumazet 提交于
      In linux-4.13, Wei worked hard to convert dst to a traditional
      refcounted model, removing GC.
      
      We now want to make sure a dst refcount can not transition from 0 back
      to 1.
      
      The problem here is that input path attached a not refcounted dst to an
      skb. Then later, because packet is forwarded and hits skb_dst_force()
      before exiting RCU section, we might try to take a refcount on one dst
      that is about to be freed, if another cpu saw 1 -> 0 transition in
      dst_release() and queued the dst for freeing after one RCU grace period.
      
      Lets unify skb_dst_force() and skb_dst_force_safe(), since we should
      always perform the complete check against dst refcount, and not assume
      it is not zero.
      
      Bugzilla : https://bugzilla.kernel.org/show_bug.cgi?id=197005
      
      [  989.919496]  skb_dst_force+0x32/0x34
      [  989.919498]  __dev_queue_xmit+0x1ad/0x482
      [  989.919501]  ? eth_header+0x28/0xc6
      [  989.919502]  dev_queue_xmit+0xb/0xd
      [  989.919504]  neigh_connected_output+0x9b/0xb4
      [  989.919507]  ip_finish_output2+0x234/0x294
      [  989.919509]  ? ipt_do_table+0x369/0x388
      [  989.919510]  ip_finish_output+0x12c/0x13f
      [  989.919512]  ip_output+0x53/0x87
      [  989.919513]  ip_forward_finish+0x53/0x5a
      [  989.919515]  ip_forward+0x2cb/0x3e6
      [  989.919516]  ? pskb_trim_rcsum.part.9+0x4b/0x4b
      [  989.919518]  ip_rcv_finish+0x2e2/0x321
      [  989.919519]  ip_rcv+0x26f/0x2eb
      [  989.919522]  ? vlan_do_receive+0x4f/0x289
      [  989.919523]  __netif_receive_skb_core+0x467/0x50b
      [  989.919526]  ? tcp_gro_receive+0x239/0x239
      [  989.919529]  ? inet_gro_receive+0x226/0x238
      [  989.919530]  __netif_receive_skb+0x4d/0x5f
      [  989.919532]  netif_receive_skb_internal+0x5c/0xaf
      [  989.919533]  napi_gro_receive+0x45/0x81
      [  989.919536]  ixgbe_poll+0xc8a/0xf09
      [  989.919539]  ? kmem_cache_free_bulk+0x1b6/0x1f7
      [  989.919540]  net_rx_action+0xf4/0x266
      [  989.919543]  __do_softirq+0xa8/0x19d
      [  989.919545]  irq_exit+0x5d/0x6b
      [  989.919546]  do_IRQ+0x9c/0xb5
      [  989.919548]  common_interrupt+0x93/0x93
      [  989.919548]  </IRQ>
      
      Similarly dst_clone() can use dst_hold() helper to have additional
      debugging, as a follow up to commit 44ebe791 ("net: add debug
      atomic_inc_not_zero() in dst_hold()")
      
      In net-next we will convert dst atomic_t to refcount_t for peace of
      mind.
      
      Fixes: a4c2fd7f ("net: remove DST_NOCACHE flag")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Wei Wang <weiwan@google.com>
      Reported-by: NPaweł Staszewski <pstaszewski@itcare.pl>
      Bisected-by: NPaweł Staszewski <pstaszewski@itcare.pl>
      Acked-by: NWei Wang <weiwan@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      222d7dbd
  10. 04 9月, 2017 1 次提交
  11. 18 6月, 2017 1 次提交
    • W
      ipv4: call dst_hold_safe() properly · 9df16efa
      Wei Wang 提交于
      This patch checks all the calls to
      dst_hold()/skb_dst_force()/dst_clone()/dst_use() to see if
      dst_hold_safe() is needed to avoid double free issue if dst
      gc is removed and dst_release() directly destroys dst when
      dst->__refcnt drops to 0.
      
      In tx path, TCP hold sk->sk_rx_dst ref count and also hold sock_lock().
      UDP and other similar protocols always hold refcount for
      skb->_skb_refdst. So both paths seem to be safe.
      
      In rx path, as it is lockless and skb_dst_set_noref() is likely to be
      used, dst_hold_safe() should always be used when trying to hold dst.
      
      In the routing code, if dst is held during an rcu protected session, it
      is necessary to call dst_hold_safe() as the current dst might be in its
      rcu grace period.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9df16efa
  12. 27 5月, 2017 2 次提交
  13. 09 5月, 2017 2 次提交
  14. 22 3月, 2017 1 次提交
    • N
      net: ipv4: add support for ECMP hash policy choice · bf4e0a3d
      Nikolay Aleksandrov 提交于
      This patch adds support for ECMP hash policy choice via a new sysctl
      called fib_multipath_hash_policy and also adds support for L4 hashes.
      The current values for fib_multipath_hash_policy are:
       0 - layer 3 (default)
       1 - layer 4
      If there's an skb hash already set and it matches the chosen policy then it
      will be used instead of being calculated (currently only for L4).
      In L3 mode we always calculate the hash due to the ICMP error special
      case, the flow dissector's field consistentification should handle the
      address order thus we can remove the address reversals.
      If the skb is provided we always use it for the hash calculation,
      otherwise we fallback to fl4, that is if skb is NULL fl4 has to be set.
      Signed-off-by: NNikolay Aleksandrov <nikolay@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bf4e0a3d
  15. 05 11月, 2016 1 次提交
    • L
      net: inet: Support UID-based routing in IP protocols. · e2d118a1
      Lorenzo Colitti 提交于
      - Use the UID in routing lookups made by protocol connect() and
        sendmsg() functions.
      - Make sure that routing lookups triggered by incoming packets
        (e.g., Path MTU discovery) take the UID of the socket into
        account.
      - For packets not associated with a userspace socket, (e.g., ping
        replies) use UID 0 inside the user namespace corresponding to
        the network namespace the socket belongs to. This allows
        all namespaces to apply routing and iptables rules to
        kernel-originated traffic in that namespaces by matching UID 0.
        This is better than using the UID of the kernel socket that is
        sending the traffic, because the UID of kernel sockets created
        at namespace creation time (e.g., the per-processor ICMP and
        TCP sockets) is the UID of the user that created the socket,
        which might not be mapped in the namespace.
      
      Tested: compiles allnoconfig, allyesconfig, allmodconfig
      Tested: https://android-review.googlesource.com/253302Signed-off-by: NLorenzo Colitti <lorenzo@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e2d118a1
  16. 11 9月, 2016 1 次提交
  17. 12 4月, 2016 1 次提交
    • D
      net: vrf: Fix dst reference counting · 9ab179d8
      David Ahern 提交于
      Vivek reported a kernel exception deleting a VRF with an active
      connection through it. The root cause is that the socket has a cached
      reference to a dst that is destroyed. Converting the dst_destroy to
      dst_release and letting proper reference counting kick in does not
      work as the dst has a reference to the device which needs to be released
      as well.
      
      I talked to Hannes about this at netdev and he pointed out the ipv4 and
      ipv6 dst handling has dst_ifdown for just this scenario. Rather than
      continuing with the reinvented dst wheel in VRF just remove it and
      leverage the ipv4 and ipv6 versions.
      
      Fixes: 193125db ("net: Introduce VRF device driver")
      Fixes: 35402e31 ("net: Add IPv6 support to VRF device")
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9ab179d8
  18. 08 4月, 2016 1 次提交
  19. 17 2月, 2016 1 次提交
  20. 05 1月, 2016 1 次提交
    • D
      net: Propagate lookup failure in l3mdev_get_saddr to caller · b5bdacf3
      David Ahern 提交于
      Commands run in a vrf context are not failing as expected on a route lookup:
          root@kenny:~# ip ro ls table vrf-red
          unreachable default
      
          root@kenny:~# ping -I vrf-red -c1 -w1 10.100.1.254
          ping: Warning: source address might be selected on device other than vrf-red.
          PING 10.100.1.254 (10.100.1.254) from 0.0.0.0 vrf-red: 56(84) bytes of data.
      
          --- 10.100.1.254 ping statistics ---
          2 packets transmitted, 0 received, 100% packet loss, time 999ms
      
      Since the vrf table does not have a route for 10.100.1.254 the ping
      should have failed. The saddr lookup causes a full VRF table lookup.
      Propogating a lookup failure to the user allows the command to fail as
      expected:
      
          root@kenny:~# ping -I vrf-red -c1 -w1 10.100.1.254
          connect: No route to host
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b5bdacf3
  21. 07 10月, 2015 2 次提交
  22. 05 10月, 2015 1 次提交
  23. 30 9月, 2015 2 次提交
  24. 26 9月, 2015 1 次提交
  25. 18 9月, 2015 1 次提交
  26. 16 9月, 2015 1 次提交
  27. 02 9月, 2015 1 次提交
  28. 21 8月, 2015 1 次提交
  29. 14 8月, 2015 3 次提交
  30. 22 7月, 2015 1 次提交