1. 16 6月, 2016 2 次提交
  2. 26 4月, 2016 3 次提交
    • D
      net: ipv6: Delete host routes on an ifdown · 38bd10c4
      David Ahern 提交于
      It was a simple idea -- save IPv6 configured addresses on a link down
      so that IPv6 behaves similar to IPv4. As always the devil is in the
      details and the IPv6 stack as too many behavioral differences from IPv4
      making the simple idea more complicated than it needs to be.
      
      The current implementation for keeping IPv6 addresses can panic or spit
      out a warning in one of many paths:
      
      1. IPv6 route gets an IPv4 route as its 'next' which causes a panic in
         rt6_fill_node while handling a route dump request.
      
      2. rt->dst.obsolete is set to DST_OBSOLETE_DEAD hitting the WARN_ON in
         fib6_del
      
      3. Panic in fib6_purge_rt because rt6i_ref count is not 1.
      
      The root cause of all these is references related to the host route for
      an address that is retained.
      
      So, this patch deletes the host route every time the ifdown loop runs.
      Since the host route is deleted and will be re-generated an up there is
      no longer a need for the l3mdev fix up. On the 'admin up' side move
      addrconf_permanent_addr into the NETDEV_UP event handling so that it
      runs only once versus on UP and CHANGE events.
      
      All of the current panics and warnings appear to be related to
      addresses on the loopback device, but given the catastrophic nature when
      a bug is triggered this patch takes the conservative approach and evicts
      all host routes rather than trying to determine when it can be re-used
      and when it can not. That can be a later optimizaton if desired.
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      38bd10c4
    • D
      Revert "ipv6: Revert optional address flusing on ifdown." · 6a923934
      David S. Miller 提交于
      This reverts commit 841645b5.
      
      Ok, this puts the feature back.  I've decided to apply David A.'s
      bug fix and run with that rather than make everyone wait another
      whole release for this feature.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6a923934
    • D
      ipv6: Revert optional address flusing on ifdown. · 841645b5
      David S. Miller 提交于
      This reverts the following three commits:
      
      70af921d
      799977d9
      f1705ec1
      
      The feature was ill conceived, has terrible semantics, and has added
      nothing but regressions to the already fragile ipv6 stack.
      
      Fixes: f1705ec1 ("net: ipv6: Make address flushing on ifdown optional")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      841645b5
  3. 20 4月, 2016 2 次提交
  4. 14 4月, 2016 2 次提交
  5. 12 4月, 2016 1 次提交
    • D
      net: vrf: Fix dev refcnt leak due to IPv6 prefix route · 4f7f34ea
      David Ahern 提交于
      ifupdown2 found a kernel bug with IPv6 routes and movement from the main
      table to the VRF table. Sequence of events:
      
      Create the interface and add addresses:
          ip link add dev eth4.105 link eth4 type vlan id 105
          ip addr add dev eth4.105 8.105.105.10/24
          ip -6 addr add dev eth4.105 2008:105:105::10/64
      
      At this point IPv6 has inserted a prefix route in the main table even
      though the interface is 'down'. From there the VRF device is created:
          ip link add dev vrf105 type vrf table 105
          ip addr add dev vrf105 9.9.105.10/32
          ip -6 addr add dev vrf105 2000:9:105::10/128
          ip link set vrf105 up
      
      Then the interface is enslaved, while still in the 'down' state:
          ip link set dev eth4.105 master vrf105
      
      Since the device is down the VRF driver cycling the device does not
      send the NETDEV_UP and NETDEV_DOWN but rather the NETDEV_CHANGE event
      which does not flush the routes inserted prior.
      
      When the link is brought up
          ip link set dev eth4.105 up
      
      the prefix route is added in the VRF table, but does not remove
      the route from the main table.
      
      Fix by handling the NETDEV_CHANGEUPPER event similar what was implemented
      for IPv4 in 7f49e7a3 ("net: Flush local routes when device changes vrf
      association")
      
      Fixes: 35402e31 ("net: Add IPv6 support to VRF device")
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4f7f34ea
  6. 14 3月, 2016 1 次提交
  7. 04 3月, 2016 1 次提交
    • D
      net: ipv6: Fix refcnt on host routes · 799977d9
      David Ahern 提交于
      Andrew and Ying Huang's test robot both reported usage count problems that
      trace back to the 'keep address on ifdown' patch.
      
      >From Andrew:
      We execute CRIU test on linux-next. On the current linux-next kernel
      they hangs on creating a network namespace.
      
      The kernel log contains many massages like this:
      [ 1036.122108] unregister_netdevice: waiting for lo to become free.
      Usage count = 2
      [ 1046.165156] unregister_netdevice: waiting for lo to become free.
      Usage count = 2
      [ 1056.210287] unregister_netdevice: waiting for lo to become free.
      Usage count = 2
      
      I tried to revert this patch and the bug disappeared.
      
      Here is a set of commands to reproduce this bug:
      
      [root@linux-next-test linux-next]# uname -a
      Linux linux-next-test 4.5.0-rc6-next-20160301+ #3 SMP Wed Mar 2
      17:32:18 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
      
      [root@linux-next-test ~]# unshare -n
      [root@linux-next-test ~]# ip link set up dev lo
      [root@linux-next-test ~]# ip a
      1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
      group default qlen 1
          link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
          inet 127.0.0.1/8 scope host lo
             valid_lft forever preferred_lft forever
          inet6 ::1/128 scope host
             valid_lft forever preferred_lft forever
      [root@linux-next-test ~]# logout
      [root@linux-next-test ~]# unshare -n
      
       -----
      
      The problem is a change made to RTM_DELADDR case in __ipv6_ifa_notify that
      was added in an early version of the offending patch and is no longer
      needed.
      
      Fixes: f1705ec1 ("net: ipv6: Make address flushing on ifdown optional")
      Cc: Andrey Wagin <avagin@gmail.com>
      Cc: Ying Huang <ying.huang@linux.intel.com>
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Tested-by: NJeremiah Mahler <jmmahler@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      799977d9
  8. 02 3月, 2016 1 次提交
  9. 26 2月, 2016 1 次提交
    • D
      net: ipv6: Make address flushing on ifdown optional · f1705ec1
      David Ahern 提交于
      Currently, all ipv6 addresses are flushed when the interface is configured
      down, including global, static addresses:
      
          $ ip -6 addr show dev eth1
          3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000
              inet6 2100:1::2/120 scope global
                 valid_lft forever preferred_lft forever
              inet6 fe80::e0:f9ff:fe79:34bd/64 scope link
                 valid_lft forever preferred_lft forever
          $ ip link set dev eth1 down
          $ ip -6 addr show dev eth1
          << nothing; all addresses have been flushed>>
      
      Add a new sysctl to make this behavior optional. The new setting defaults to
      flush all addresses to maintain backwards compatibility. When the set global
      addresses with no expire times are not flushed on an admin down. The sysctl
      is per-interface or system-wide for all interfaces
      
          $ sysctl -w net.ipv6.conf.eth1.keep_addr_on_down=1
      or
          $ sysctl -w net.ipv6.conf.all.keep_addr_on_down=1
      
      Will keep addresses on eth1 on an admin down.
      
          $ ip -6 addr show dev eth1
          3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000
              inet6 2100:1::2/120 scope global
                 valid_lft forever preferred_lft forever
              inet6 fe80::e0:f9ff:fe79:34bd/64 scope link
                 valid_lft forever preferred_lft forever
          $ ip link set dev eth1 down
          $ ip -6 addr show dev eth1
          3: eth1: <BROADCAST,MULTICAST> mtu 1500 state DOWN qlen 1000
              inet6 2100:1::2/120 scope global tentative
                 valid_lft forever preferred_lft forever
              inet6 fe80::e0:f9ff:fe79:34bd/64 scope link tentative
                 valid_lft forever preferred_lft forever
      Signed-off-by: NDavid Ahern <dsa@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f1705ec1
  10. 20 2月, 2016 1 次提交
  11. 11 2月, 2016 2 次提交
  12. 06 2月, 2016 1 次提交
    • S
      ipv6: addrconf: Fix recursive spin lock call · 16186a82
      subashab@codeaurora.org 提交于
      A rcu stall with the following backtrace was seen on a system with
      forwarding, optimistic_dad and use_optimistic set. To reproduce,
      set these flags and allow ipv6 autoconf.
      
      This occurs because the device write_lock is acquired while already
      holding the read_lock. Back trace below -
      
      INFO: rcu_preempt self-detected stall on CPU { 1}  (t=2100 jiffies
       g=3992 c=3991 q=4471)
      <6> Task dump for CPU 1:
      <2> kworker/1:0     R  running task    12168    15   2 0x00000002
      <2> Workqueue: ipv6_addrconf addrconf_dad_work
      <6> Call trace:
      <2> [<ffffffc000084da8>] el1_irq+0x68/0xdc
      <2> [<ffffffc000cc4e0c>] _raw_write_lock_bh+0x20/0x30
      <2> [<ffffffc000bc5dd8>] __ipv6_dev_ac_inc+0x64/0x1b4
      <2> [<ffffffc000bcbd2c>] addrconf_join_anycast+0x9c/0xc4
      <2> [<ffffffc000bcf9f0>] __ipv6_ifa_notify+0x160/0x29c
      <2> [<ffffffc000bcfb7c>] ipv6_ifa_notify+0x50/0x70
      <2> [<ffffffc000bd035c>] addrconf_dad_work+0x314/0x334
      <2> [<ffffffc0000b64c8>] process_one_work+0x244/0x3fc
      <2> [<ffffffc0000b7324>] worker_thread+0x2f8/0x418
      <2> [<ffffffc0000bb40c>] kthread+0xe0/0xec
      
      v2: do addrconf_dad_kick inside read lock and then acquire write
      lock for ipv6_ifa_notify as suggested by Eric
      
      Fixes: 7fd2561e ("net: ipv6: Add a sysctl to make optimistic
      addresses useful candidates")
      
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Erik Kline <ek@google.com>
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NSubash Abhinov Kasiviswanathan <subashab@codeaurora.org>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      16186a82
  13. 11 1月, 2016 1 次提交
  14. 23 12月, 2015 1 次提交
  15. 19 12月, 2015 1 次提交
    • B
      ipv6: addrconf: use stable address generator for ARPHRD_NONE · cc9da6cc
      Bjørn Mork 提交于
      Add a new address generator mode, using the stable address generator
      with an automatically generated secret. This is intended as a default
      address generator mode for device types with no EUI64 implementation.
      The new generator is used for ARPHRD_NONE interfaces initially, adding
      default IPv6 autoconf support to e.g. tun interfaces.
      
      If the addrgenmode is set to 'random', either by default or manually,
      and no stable secret is available, then a random secret is used as
      input for the stable-privacy address generator.  The secret can be
      read and modified like manually configured secrets, using the proc
      interface.  Modifying the secret will change the addrgen mode to
      'stable-privacy' to indicate that it operates on a known secret.
      
      Existing behaviour of the 'stable-privacy' mode is kept unchanged. If
      a known secret is available when the device is created, then the mode
      will default to 'stable-privacy' as before.  The mode can be manually
      set to 'random' but it will behave exactly like 'stable-privacy' in
      this case. The secret will not change.
      
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: 吉藤英明 <hideaki.yoshifuji@miraclelinux.com>
      Signed-off-by: NBjørn Mork <bjorn@mork.no>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cc9da6cc
  16. 16 12月, 2015 1 次提交
  17. 15 12月, 2015 1 次提交
    • A
      ipv6: addrconf: drop ieee802154 specific things · 5241c2d7
      Alexander Aring 提交于
      This patch removes ARPHRD_IEEE802154 from addrconf handling. In the
      earlier days of 802.15.4 6LoWPAN, the interface type was ARPHRD_IEEE802154
      which introduced several issues, because 802.15.4 interfaces used the
      same type.
      
      Since commit 965e613d ("ieee802154: 6lowpan: fix ARPHRD to
      ARPHRD_6LOWPAN") we use ARPHRD_6LOWPAN for 6LoWPAN interfaces. This
      patch will remove ARPHRD_IEEE802154 which is currently deadcode, because
      ARPHRD_IEEE802154 doesn't reach the minimum 1280 MTU of IPv6.
      
      Also we use 6LoWPAN EUI64 specific defines instead using link-layer
      constanst from 802.15.4 link-layer header.
      
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: James Morris <jmorris@namei.org>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: Patrick McHardy <kaber@trash.net>
      Signed-off-by: NAlexander Aring <alex.aring@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5241c2d7
  18. 06 12月, 2015 2 次提交
  19. 04 12月, 2015 1 次提交
    • P
      net: ipv6: restrict hop_limit sysctl setting to range [1; 255] · d6df198d
      Phil Sutter 提交于
      Setting a value bigger than 255 resulted in using only the lower eight
      bits of that value as it is assigned to the u8 header field. To avoid
      this unexpected result, reject such values.
      
      Setting a value of zero is technically possible, but hosts receiving
      such a packet have to treat it like hop_limit was set to one, according
      to RFC2460. Therefore I don't see a use-case for that.
      
      Setting a route's hop_limit to zero in iproute2 means to use the sysctl
      default, which is not the case here: Setting e.g.
      net.conf.eth0.hop_limit=0 will not make the kernel use
      net.conf.all.hop_limit for outgoing packets on eth0. To avoid these
      kinds of confusion, reject zero.
      Signed-off-by: NPhil Sutter <phil@nwl.cc>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d6df198d
  20. 02 12月, 2015 1 次提交
  21. 05 11月, 2015 1 次提交
  22. 30 10月, 2015 1 次提交
  23. 22 10月, 2015 1 次提交
    • A
      netlink: Rightsize IFLA_AF_SPEC size calculation · b1974ed0
      Arad, Ronen 提交于
      if_nlmsg_size() overestimates the minimum allocation size of netlink
      dump request (when called from rtnl_calcit()) or the size of the
      message (when called from rtnl_getlink()). This is because
      ext_filter_mask is not supported by rtnl_link_get_af_size() and
      rtnl_link_get_size().
      
      The over-estimation is significant when at least one netdev has many
      VLANs configured (8 bytes for each configured VLAN).
      
      This patch-set "rightsizes" the protocol specific attribute size
      calculation by propagating ext_filter_mask to rtnl_link_get_af_size()
      and adding this a argument to get_link_af_size op in rtnl_af_ops.
      
      Bridge module already used filtering aware sizing for notifications.
      br_get_link_af_size_filtered() is consistent with the modified
      get_link_af_size op so it replaces br_get_link_af_size() in br_af_ops.
      br_get_link_af_size() becomes unused and thus removed.
      Signed-off-by: NRonen Arad <ronen.arad@intel.com>
      Acked-by: NSridhar Samudrala <sridhar.samudrala@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b1974ed0
  24. 13 10月, 2015 1 次提交
  25. 11 10月, 2015 1 次提交
  26. 25 9月, 2015 1 次提交
  27. 16 9月, 2015 2 次提交
    • S
      rtnetlink: RTEXT_FILTER_SKIP_STATS support to avoid dumping inet/inet6 stats · d5566fd7
      Sowmini Varadhan 提交于
      Many commonly used functions like getifaddrs() invoke RTM_GETLINK
      to dump the interface information, and do not need the
      the AF_INET6 statististics that are always returned by default
      from rtnl_fill_ifinfo().
      
      Computing the statistics can be an expensive operation that impacts
      scaling, so it is desirable to avoid this if the information is
      not needed.
      
      This patch adds a the RTEXT_FILTER_SKIP_STATS extended info flag that
      can be passed with netlink_request() to avoid statistics computation
      for the ifinfo path.
      Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d5566fd7
    • M
      ipv6: Avoid double dst_free · 8e3d5be7
      Martin KaFai Lau 提交于
      It is a prep work to get dst freeing from fib tree undergo
      a rcu grace period.
      
      The following is a common paradigm:
      if (ip6_del_rt(rt))
      	dst_free(rt)
      
      which means, if rt cannot be deleted from the fib tree, dst_free(rt) now.
      1. We don't know the ip6_del_rt(rt) failure is because it
         was not managed by fib tree (e.g. DST_NOCACHE) or it had already been
         removed from the fib tree.
      2. If rt had been managed by the fib tree, ip6_del_rt(rt) failure means
         dst_free(rt) has been called already.  A second
         dst_free(rt) is not always obviously safe.  The rt may have
         been destroyed already.
      3. If rt is a DST_NOCACHE, dst_free(rt) should not be called.
      4. It is a stopper to make dst freeing from fib tree undergo a
         rcu grace period.
      
      This patch is to use a DST_NOCACHE flag to indicate a rt is
      not managed by the fib tree.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8e3d5be7
  28. 31 8月, 2015 2 次提交
    • R
      net: Optimize snmp stat aggregation by walking all the percpu data at once · a3a77372
      Raghavendra K T 提交于
      Docker container creation linearly increased from around 1.6 sec to 7.5 sec
      (at 1000 containers) and perf data showed 50% ovehead in snmp_fold_field.
      
      reason: currently __snmp6_fill_stats64 calls snmp_fold_field that walks
      through per cpu data of an item (iteratively for around 36 items).
      
      idea: This patch tries to aggregate the statistics by going through
      all the items of each cpu sequentially which is reducing cache
      misses.
      
      Docker creation got faster by more than 2x after the patch.
      
      Result:
                             Before           After
      Docker creation time   6.836s           3.25s
      cache miss             2.7%             1.41%
      
      perf before:
          50.73%  docker           [kernel.kallsyms]       [k] snmp_fold_field
           9.07%  swapper          [kernel.kallsyms]       [k] snooze_loop
           3.49%  docker           [kernel.kallsyms]       [k] veth_stats_one
           2.85%  swapper          [kernel.kallsyms]       [k] _raw_spin_lock
      
      perf after:
          10.57%  docker           docker                [.] scanblock
           8.37%  swapper          [kernel.kallsyms]     [k] snooze_loop
           6.91%  docker           [kernel.kallsyms]     [k] snmp_get_cpu_field
           6.67%  docker           [kernel.kallsyms]     [k] veth_stats_one
      
      changes/ideas suggested:
      Using buffer in stack (Eric), Usage of memset (David), Using memcpy in
      place of unaligned_put (Joe).
      Signed-off-by: NRaghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a3a77372
    • M
      net/ipv6: Export addrconf_ifid_eui48 · 399e6f95
      Matan Barak 提交于
      For loopback purposes, RoCE devices should have a default GID in the
      port GID table, even when the interface is down. In order to do so,
      we use the IPv6 link local address which would have been genenrated
      for the related Ethernet netdevice when it goes up as a default GID.
      
      addrconf_ifid_eui48 is used to gernerate this address, export it.
      Signed-off-by: NMatan Barak <matanb@mellanox.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      399e6f95
  29. 21 8月, 2015 1 次提交
  30. 14 8月, 2015 2 次提交
    • A
      net: addr IFLA_OPERSTATE to netlink message for ipv6 ifinfo · 0344338b
      Andy Gospodarek 提交于
      This is useful information to include in ipv6 netlink messages that
      report interface information.  IFLA_OPERSTATE is already included in
      ipv4 messages, but missing for ipv6.  This closes that gap.
      Signed-off-by: NAndy Gospodarek <gospo@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0344338b
    • A
      net: ipv6 sysctl option to ignore routes when nexthop link is down · 35103d11
      Andy Gospodarek 提交于
      Like the ipv4 patch with a similar title, this adds a sysctl to allow
      the user to change routing behavior based on whether or not the
      interface associated with the nexthop was an up or down link.  The
      default setting preserves the current behavior, but anyone that enables
      it will notice that nexthops on down interfaces will no longer be
      selected:
      
      net.ipv6.conf.all.ignore_routes_with_linkdown = 0
      net.ipv6.conf.default.ignore_routes_with_linkdown = 0
      net.ipv6.conf.lo.ignore_routes_with_linkdown = 0
      ...
      
      When the above sysctls are set, not only will link status be reported to
      userspace, but an indication that a nexthop is dead and will not be used
      is also reported.
      
      1000::/8 via 7000::2 dev p7p1  metric 1024 dead linkdown  pref medium
      1000::/8 via 8000::2 dev p8p1  metric 1024  pref medium
      7000::/8 dev p7p1  proto kernel  metric 256 dead linkdown  pref medium
      8000::/8 dev p8p1  proto kernel  metric 256  pref medium
      9000::/8 via 8000::2 dev p8p1  metric 2048  pref medium
      9000::/8 via 7000::2 dev p7p1  metric 1024 dead linkdown  pref medium
      fe80::/64 dev p7p1  proto kernel  metric 256 dead linkdown  pref medium
      fe80::/64 dev p8p1  proto kernel  metric 256  pref medium
      
      This also adds devconf support and notification when sysctl values
      change.
      
      v2: drop use of rt6i_nhflags since it is not needed right now
      Signed-off-by: NAndy Gospodarek <gospo@cumulusnetworks.com>
      Signed-off-by: NDinesh Dutt <ddutt@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      35103d11