1. 29 5月, 2018 1 次提交
  2. 28 3月, 2018 1 次提交
  3. 13 2月, 2018 1 次提交
    • K
      net: Convert pernet_subsys, registered from inet_init() · f84c6821
      Kirill Tkhai 提交于
      arp_net_ops just addr/removes /proc entry.
      
      devinet_ops allocates and frees duplicate of init_net tables
      and (un)registers sysctl entries.
      
      fib_net_ops allocates and frees pernet tables, creates/destroys
      netlink socket and (un)initializes /proc entries. Foreign
      pernet_operations do not touch them.
      
      ip_rt_proc_ops only modifies pernet /proc entries.
      
      xfrm_net_ops creates/destroys /proc entries, allocates/frees
      pernet statistics, hashes and tables, and (un)initializes
      sysctl files. These are not touched by foreigh pernet_operations
      
      xfrm4_net_ops allocates/frees private pernet memory, and
      configures sysctls.
      
      sysctl_route_ops creates/destroys sysctls.
      
      rt_genid_ops only initializes fields of just allocated net.
      
      ipv4_inetpeer_ops allocated/frees net private memory.
      
      igmp_net_ops just creates/destroys /proc files and socket,
      noone else interested in.
      
      tcp_sk_ops seems to be safe, because tcp_sk_init() does not
      depend on any other pernet_operations modifications. Iteration
      over hash table in inet_twsk_purge() is made under RCU lock,
      and it's safe to iterate the table this way. Removing from
      the table happen from inet_twsk_deschedule_put(), but this
      function is safe without any extern locks, as it's synchronized
      inside itself. There are many examples, it's used in different
      context. So, it's safe to leave tcp_sk_exit_batch() unlocked.
      
      tcp_net_metrics_ops is synchronized on tcp_metrics_lock and safe.
      
      udplite4_net_ops only creates/destroys pernet /proc file.
      
      icmp_sk_ops creates percpu sockets, not touched by foreign
      pernet_operations.
      
      ipmr_net_ops creates/destroys pernet fib tables, (un)registers
      fib rules and /proc files. This seem to be safe to execute
      in parallel with foreign pernet_operations.
      
      af_inet_ops just sets up default parameters of newly created net.
      
      ipv4_mib_ops creates and destroys pernet percpu statistics.
      
      raw_net_ops, tcp4_net_ops, udp4_net_ops, ping_v4_net_ops
      and ip_proc_ops only create/destroy pernet /proc files.
      
      ip4_frags_ops creates and destroys sysctl file.
      
      So, it's safe to make the pernet_operations async.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f84c6821
  4. 30 1月, 2018 1 次提交
  5. 25 1月, 2018 2 次提交
  6. 14 12月, 2017 1 次提交
  7. 20 10月, 2017 1 次提交
  8. 18 10月, 2017 1 次提交
  9. 17 10月, 2017 1 次提交
    • F
      net: core: rcu-ify rtnl af_ops · 5fa85a09
      Florian Westphal 提交于
      rtnl af_ops currently rely on rtnl mutex: unregister (called from module
      exit functions) takes the rtnl mutex and all users that do af_ops lookup
      also take the rtnl mutex. IOW, parallel rmmod will block until doit()
      callback is done.
      
      As none of the af_ops implementation sleep we can use rcu instead.
      
      doit functions that need the af_ops can now use rcu instead of the
      rtnl mutex provided the mutex isn't needed for other reasons.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5fa85a09
  10. 22 9月, 2017 1 次提交
    • P
      net: avoid a full fib lookup when rp_filter is disabled. · 6e617de8
      Paolo Abeni 提交于
      Since commit 1dced6a8 ("ipv4: Restore accept_local behaviour
      in fib_validate_source()") a full fib lookup is needed even if
      the rp_filter is disabled, if accept_local is false - which is
      the default.
      
      What we really need in the above scenario is just checking
      that the source IP address is not local, and in most case we
      can do that is a cheaper way looking up the ifaddr hash table.
      
      This commit adds a helper for such lookup, and uses it to
      validate the src address when rp_filter is disabled and no
      'local' routes are created by the user space in the relevant
      namespace.
      
      A new ipv4 netns flag is added to account for such routes.
      We need that to preserve the same behavior we had before this
      patch.
      
      It also drops the checks to bail early from __fib_validate_source,
      added by the commit 1dced6a8 ("ipv4: Restore accept_local
      behaviour in fib_validate_source()") they do not give any
      measurable performance improvement: if we do the lookup with are
      on a slower path.
      
      This improves UDP performances for unconnected sockets
      when rp_filter is disabled by 5% and also gives small but
      measurable performance improvement for TCP flood scenarios.
      
      v1 -> v2:
       - use the ifaddr lookup helper in __ip_dev_find(), as suggested
         by Eric
       - fall-back to full lookup if custom local routes are present
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6e617de8
  11. 10 8月, 2017 1 次提交
  12. 01 7月, 2017 1 次提交
  13. 10 6月, 2017 1 次提交
    • K
      Ipvlan should return an error when an address is already in use. · 3ad7d246
      Krister Johansen 提交于
      The ipvlan code already knows how to detect when a duplicate address is
      about to be assigned to an ipvlan device.  However, that failure is not
      propogated outward and leads to a silent failure.
      
      Introduce a validation step at ip address creation time and allow device
      drivers to register to validate the incoming ip addresses.  The ipvlan
      code is the first consumer.  If it detects an address in use, we can
      return an error to the user before beginning to commit the new ifa in
      the networking code.
      
      This can be especially useful if it is necessary to provision many
      ipvlans in containers.  The provisioning software (or operator) can use
      this to detect situations where an ip address is unexpectedly in use.
      Signed-off-by: NKrister Johansen <kjlx@templeofstupid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3ad7d246
  14. 18 4月, 2017 1 次提交
  15. 14 4月, 2017 1 次提交
  16. 29 3月, 2017 2 次提交
  17. 13 3月, 2017 1 次提交
  18. 02 3月, 2017 1 次提交
  19. 03 2月, 2017 1 次提交
  20. 25 12月, 2016 1 次提交
  21. 02 9月, 2016 1 次提交
  22. 10 7月, 2016 1 次提交
  23. 14 3月, 2016 2 次提交
  24. 27 2月, 2016 2 次提交
    • D
      net: l3mdev: prefer VRF master for source address selection · 17b693cd
      David Lamparter 提交于
      When selecting an address in context of a VRF, the vrf master should be
      preferred for address selection.  If it isn't, the user has a hard time
      getting the system to select to their preference - the code will pick
      the address off the first in-VRF interface it can find, which on a
      router could well be a non-routable address.
      Signed-off-by: NDavid Lamparter <equinox@diac24.net>
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      [dsa: Fixed comment style and removed extra blank link ]
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      17b693cd
    • D
      net: l3mdev: address selection should only consider devices in L3 domain · 3f2fb9a8
      David Ahern 提交于
      David Lamparter noted a use case where the source address selection fails
      to pick an address from a VRF interface - unnumbered interfaces.
      
      Relevant commands from his script:
          ip addr add 9.9.9.9/32 dev lo
          ip link set lo up
      
          ip link add name vrf0 type vrf table 101
          ip rule add oif vrf0 table 101
          ip rule add iif vrf0 table 101
          ip link set vrf0 up
          ip addr add 10.0.0.3/32 dev vrf0
      
          ip link add name dummy2 type dummy
          ip link set dummy2 master vrf0 up
      
          --> note dummy2 has no address - unnumbered device
      
          ip route add 10.2.2.2/32 dev dummy2 table 101
          ip neigh add 10.2.2.2 dev dummy2 lladdr 02:00:00:00:00:02
      
          tcpdump -ni dummy2 &
      
      And using ping instead of his socat example:
          $ ping -I vrf0 -c1 10.2.2.2
          ping: Warning: source address might be selected on device other than vrf0.
          PING 10.2.2.2 (10.2.2.2) from 9.9.9.9 vrf0: 56(84) bytes of data.
      
      >From tcpdump:
          12:57:29.449128 IP 9.9.9.9 > 10.2.2.2: ICMP echo request, id 2491, seq 1, length 64
      
      Note the source address is from lo and is not a VRF local address. With
      this patch:
      
          $ ping -I vrf0 -c1 10.2.2.2
          PING 10.2.2.2 (10.2.2.2) from 10.0.0.3 vrf0: 56(84) bytes of data.
      
      >From tcpdump:
          12:59:25.096426 IP 10.0.0.3 > 10.2.2.2: ICMP echo request, id 2113, seq 1, length 64
      
      Now the source address comes from vrf0.
      
      The ipv4 function for selecting source address takes a const argument.
      Removing the const requires touching a lot of places, so instead
      l3mdev_master_ifindex_rcu is changed to take a const argument and then
      do the typecast to non-const as required by netdev_master_upper_dev_get_rcu.
      This is similar to what l3mdev_fib_table_rcu does.
      
      IPv6 for unnumbered interfaces appears to be selecting the addresses
      properly.
      
      Cc: David Lamparter <david@opensourcerouting.org>
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3f2fb9a8
  25. 20 2月, 2016 1 次提交
  26. 11 2月, 2016 2 次提交
  27. 22 10月, 2015 1 次提交
    • A
      netlink: Rightsize IFLA_AF_SPEC size calculation · b1974ed0
      Arad, Ronen 提交于
      if_nlmsg_size() overestimates the minimum allocation size of netlink
      dump request (when called from rtnl_calcit()) or the size of the
      message (when called from rtnl_getlink()). This is because
      ext_filter_mask is not supported by rtnl_link_get_af_size() and
      rtnl_link_get_size().
      
      The over-estimation is significant when at least one netdev has many
      VLANs configured (8 bytes for each configured VLAN).
      
      This patch-set "rightsizes" the protocol specific attribute size
      calculation by propagating ext_filter_mask to rtnl_link_get_af_size()
      and adding this a argument to get_link_af_size op in rtnl_af_ops.
      
      Bridge module already used filtering aware sizing for notifications.
      br_get_link_af_size_filtered() is consistent with the modified
      get_link_af_size op so it replaces br_get_link_af_size() in br_af_ops.
      br_get_link_af_size() becomes unused and thus removed.
      Signed-off-by: NRonen Arad <ronen.arad@intel.com>
      Acked-by: NSridhar Samudrala <sridhar.samudrala@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b1974ed0
  28. 16 9月, 2015 1 次提交
  29. 29 7月, 2015 1 次提交
    • D
      net/ipv4: suppress NETDEV_UP notification on address lifetime update · 865b8042
      David Ward 提交于
      This notification causes the FIB to be updated, which is not needed
      because the address already exists, and more importantly it may undo
      intentional changes that were made to the FIB after the address was
      originally added. (As a point of comparison, when an address becomes
      deprecated because its preferred lifetime expired, a notification on
      this chain is not generated.)
      
      The motivation for this commit is fixing an incompatibility between
      DHCP clients which set and update the address lifetime according to
      the lease, and a commercial VPN client which replaces kernel routes
      in a way that outbound traffic is sent only through the tunnel (and
      disconnects if any further route changes are detected via netlink).
      Signed-off-by: NDavid Ward <david.ward@ll.mit.edu>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      865b8042
  30. 09 7月, 2015 1 次提交
  31. 24 6月, 2015 1 次提交
    • A
      net: ipv4 sysctl option to ignore routes when nexthop link is down · 0eeb075f
      Andy Gospodarek 提交于
      This feature is only enabled with the new per-interface or ipv4 global
      sysctls called 'ignore_routes_with_linkdown'.
      
      net.ipv4.conf.all.ignore_routes_with_linkdown = 0
      net.ipv4.conf.default.ignore_routes_with_linkdown = 0
      net.ipv4.conf.lo.ignore_routes_with_linkdown = 0
      ...
      
      When the above sysctls are set, will report to userspace that a route is
      dead and will no longer resolve to this nexthop when performing a fib
      lookup.  This will signal to userspace that the route will not be
      selected.  The signalling of a RTNH_F_DEAD is only passed to userspace
      if the sysctl is enabled and link is down.  This was done as without it
      the netlink listeners would have no idea whether or not a nexthop would
      be selected.   The kernel only sets RTNH_F_DEAD internally if the
      interface has IFF_UP cleared.
      
      With the new sysctl set, the following behavior can be observed
      (interface p8p1 is link-down):
      
      default via 10.0.5.2 dev p9p1
      10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15
      70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1
      80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 dead linkdown
      90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 dead linkdown
      90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2
      90.0.0.1 via 70.0.0.2 dev p7p1  src 70.0.0.1
          cache
      local 80.0.0.1 dev lo  src 80.0.0.1
          cache <local>
      80.0.0.2 via 10.0.5.2 dev p9p1  src 10.0.5.15
          cache
      
      While the route does remain in the table (so it can be modified if
      needed rather than being wiped away as it would be if IFF_UP was
      cleared), the proper next-hop is chosen automatically when the link is
      down.  Now interface p8p1 is linked-up:
      
      default via 10.0.5.2 dev p9p1
      10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15
      70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1
      80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1
      90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1
      90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2
      192.168.56.0/24 dev p2p1  proto kernel  scope link  src 192.168.56.2
      90.0.0.1 via 80.0.0.2 dev p8p1  src 80.0.0.1
          cache
      local 80.0.0.1 dev lo  src 80.0.0.1
          cache <local>
      80.0.0.2 dev p8p1  src 80.0.0.1
          cache
      
      and the output changes to what one would expect.
      
      If the sysctl is not set, the following output would be expected when
      p8p1 is down:
      
      default via 10.0.5.2 dev p9p1
      10.0.5.0/24 dev p9p1  proto kernel  scope link  src 10.0.5.15
      70.0.0.0/24 dev p7p1  proto kernel  scope link  src 70.0.0.1
      80.0.0.0/24 dev p8p1  proto kernel  scope link  src 80.0.0.1 linkdown
      90.0.0.0/24 via 80.0.0.2 dev p8p1  metric 1 linkdown
      90.0.0.0/24 via 70.0.0.2 dev p7p1  metric 2
      
      Since the dead flag does not appear, there should be no expectation that
      the kernel would skip using this route due to link being down.
      
      v2: Split kernel changes into 2 patches, this actually makes a
      behavioral change if the sysctl is set.  Also took suggestion from Alex
      to simplify code by only checking sysctl during fib lookup and
      suggestion from Scott to add a per-interface sysctl.
      
      v3: Code clean-ups to make it more readable and efficient as well as a
      reverse path check fix.
      
      v4: Drop binary sysctl
      
      v5: Whitespace fixups from Dave
      
      v6: Style changes from Dave and checkpatch suggestions
      
      v7: One more checkpatch fixup
      Signed-off-by: NAndy Gospodarek <gospo@cumulusnetworks.com>
      Signed-off-by: NDinesh Dutt <ddutt@cumulusnetworks.com>
      Acked-by: NScott Feldman <sfeldma@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0eeb075f
  32. 04 4月, 2015 2 次提交
  33. 01 4月, 2015 2 次提交