1. 05 11月, 2015 1 次提交
  2. 30 10月, 2015 1 次提交
  3. 22 10月, 2015 1 次提交
    • A
      netlink: Rightsize IFLA_AF_SPEC size calculation · b1974ed0
      Arad, Ronen 提交于
      if_nlmsg_size() overestimates the minimum allocation size of netlink
      dump request (when called from rtnl_calcit()) or the size of the
      message (when called from rtnl_getlink()). This is because
      ext_filter_mask is not supported by rtnl_link_get_af_size() and
      rtnl_link_get_size().
      
      The over-estimation is significant when at least one netdev has many
      VLANs configured (8 bytes for each configured VLAN).
      
      This patch-set "rightsizes" the protocol specific attribute size
      calculation by propagating ext_filter_mask to rtnl_link_get_af_size()
      and adding this a argument to get_link_af_size op in rtnl_af_ops.
      
      Bridge module already used filtering aware sizing for notifications.
      br_get_link_af_size_filtered() is consistent with the modified
      get_link_af_size op so it replaces br_get_link_af_size() in br_af_ops.
      br_get_link_af_size() becomes unused and thus removed.
      Signed-off-by: NRonen Arad <ronen.arad@intel.com>
      Acked-by: NSridhar Samudrala <sridhar.samudrala@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b1974ed0
  4. 13 10月, 2015 1 次提交
  5. 11 10月, 2015 1 次提交
  6. 25 9月, 2015 1 次提交
  7. 16 9月, 2015 2 次提交
    • S
      rtnetlink: RTEXT_FILTER_SKIP_STATS support to avoid dumping inet/inet6 stats · d5566fd7
      Sowmini Varadhan 提交于
      Many commonly used functions like getifaddrs() invoke RTM_GETLINK
      to dump the interface information, and do not need the
      the AF_INET6 statististics that are always returned by default
      from rtnl_fill_ifinfo().
      
      Computing the statistics can be an expensive operation that impacts
      scaling, so it is desirable to avoid this if the information is
      not needed.
      
      This patch adds a the RTEXT_FILTER_SKIP_STATS extended info flag that
      can be passed with netlink_request() to avoid statistics computation
      for the ifinfo path.
      Signed-off-by: NSowmini Varadhan <sowmini.varadhan@oracle.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d5566fd7
    • M
      ipv6: Avoid double dst_free · 8e3d5be7
      Martin KaFai Lau 提交于
      It is a prep work to get dst freeing from fib tree undergo
      a rcu grace period.
      
      The following is a common paradigm:
      if (ip6_del_rt(rt))
      	dst_free(rt)
      
      which means, if rt cannot be deleted from the fib tree, dst_free(rt) now.
      1. We don't know the ip6_del_rt(rt) failure is because it
         was not managed by fib tree (e.g. DST_NOCACHE) or it had already been
         removed from the fib tree.
      2. If rt had been managed by the fib tree, ip6_del_rt(rt) failure means
         dst_free(rt) has been called already.  A second
         dst_free(rt) is not always obviously safe.  The rt may have
         been destroyed already.
      3. If rt is a DST_NOCACHE, dst_free(rt) should not be called.
      4. It is a stopper to make dst freeing from fib tree undergo a
         rcu grace period.
      
      This patch is to use a DST_NOCACHE flag to indicate a rt is
      not managed by the fib tree.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8e3d5be7
  8. 31 8月, 2015 2 次提交
    • R
      net: Optimize snmp stat aggregation by walking all the percpu data at once · a3a77372
      Raghavendra K T 提交于
      Docker container creation linearly increased from around 1.6 sec to 7.5 sec
      (at 1000 containers) and perf data showed 50% ovehead in snmp_fold_field.
      
      reason: currently __snmp6_fill_stats64 calls snmp_fold_field that walks
      through per cpu data of an item (iteratively for around 36 items).
      
      idea: This patch tries to aggregate the statistics by going through
      all the items of each cpu sequentially which is reducing cache
      misses.
      
      Docker creation got faster by more than 2x after the patch.
      
      Result:
                             Before           After
      Docker creation time   6.836s           3.25s
      cache miss             2.7%             1.41%
      
      perf before:
          50.73%  docker           [kernel.kallsyms]       [k] snmp_fold_field
           9.07%  swapper          [kernel.kallsyms]       [k] snooze_loop
           3.49%  docker           [kernel.kallsyms]       [k] veth_stats_one
           2.85%  swapper          [kernel.kallsyms]       [k] _raw_spin_lock
      
      perf after:
          10.57%  docker           docker                [.] scanblock
           8.37%  swapper          [kernel.kallsyms]     [k] snooze_loop
           6.91%  docker           [kernel.kallsyms]     [k] snmp_get_cpu_field
           6.67%  docker           [kernel.kallsyms]     [k] veth_stats_one
      
      changes/ideas suggested:
      Using buffer in stack (Eric), Usage of memset (David), Using memcpy in
      place of unaligned_put (Joe).
      Signed-off-by: NRaghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a3a77372
    • M
      net/ipv6: Export addrconf_ifid_eui48 · 399e6f95
      Matan Barak 提交于
      For loopback purposes, RoCE devices should have a default GID in the
      port GID table, even when the interface is down. In order to do so,
      we use the IPv6 link local address which would have been genenrated
      for the related Ethernet netdevice when it goes up as a default GID.
      
      addrconf_ifid_eui48 is used to gernerate this address, export it.
      Signed-off-by: NMatan Barak <matanb@mellanox.com>
      Signed-off-by: NDoug Ledford <dledford@redhat.com>
      399e6f95
  9. 21 8月, 2015 1 次提交
  10. 14 8月, 2015 2 次提交
    • A
      net: addr IFLA_OPERSTATE to netlink message for ipv6 ifinfo · 0344338b
      Andy Gospodarek 提交于
      This is useful information to include in ipv6 netlink messages that
      report interface information.  IFLA_OPERSTATE is already included in
      ipv4 messages, but missing for ipv6.  This closes that gap.
      Signed-off-by: NAndy Gospodarek <gospo@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0344338b
    • A
      net: ipv6 sysctl option to ignore routes when nexthop link is down · 35103d11
      Andy Gospodarek 提交于
      Like the ipv4 patch with a similar title, this adds a sysctl to allow
      the user to change routing behavior based on whether or not the
      interface associated with the nexthop was an up or down link.  The
      default setting preserves the current behavior, but anyone that enables
      it will notice that nexthops on down interfaces will no longer be
      selected:
      
      net.ipv6.conf.all.ignore_routes_with_linkdown = 0
      net.ipv6.conf.default.ignore_routes_with_linkdown = 0
      net.ipv6.conf.lo.ignore_routes_with_linkdown = 0
      ...
      
      When the above sysctls are set, not only will link status be reported to
      userspace, but an indication that a nexthop is dead and will not be used
      is also reported.
      
      1000::/8 via 7000::2 dev p7p1  metric 1024 dead linkdown  pref medium
      1000::/8 via 8000::2 dev p8p1  metric 1024  pref medium
      7000::/8 dev p7p1  proto kernel  metric 256 dead linkdown  pref medium
      8000::/8 dev p8p1  proto kernel  metric 256  pref medium
      9000::/8 via 8000::2 dev p8p1  metric 2048  pref medium
      9000::/8 via 7000::2 dev p7p1  metric 1024 dead linkdown  pref medium
      fe80::/64 dev p7p1  proto kernel  metric 256 dead linkdown  pref medium
      fe80::/64 dev p8p1  proto kernel  metric 256  pref medium
      
      This also adds devconf support and notification when sysctl values
      change.
      
      v2: drop use of rt6i_nhflags since it is not needed right now
      Signed-off-by: NAndy Gospodarek <gospo@cumulusnetworks.com>
      Signed-off-by: NDinesh Dutt <ddutt@cumulusnetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      35103d11
  11. 31 7月, 2015 1 次提交
    • H
      net/ipv6: add sysctl option accept_ra_min_hop_limit · 8013d1d7
      Hangbin Liu 提交于
      Commit 6fd99094 ("ipv6: Don't reduce hop limit for an interface")
      disabled accept hop limit from RA if it is smaller than the current hop
      limit for security stuff. But this behavior kind of break the RFC definition.
      
      RFC 4861, 6.3.4.  Processing Received Router Advertisements
         A Router Advertisement field (e.g., Cur Hop Limit, Reachable Time,
         and Retrans Timer) may contain a value denoting that it is
         unspecified.  In such cases, the parameter should be ignored and the
         host should continue using whatever value it is already using.
      
         If the received Cur Hop Limit value is non-zero, the host SHOULD set
         its CurHopLimit variable to the received value.
      
      So add sysctl option accept_ra_min_hop_limit to let user choose the minimum
      hop limit value they can accept from RA. And set default to 1 to meet RFC
      standards.
      Signed-off-by: NHangbin Liu <liuhangbin@gmail.com>
      Acked-by: NYOSHIFUJI Hideaki <hideaki.yoshifuji@miraclelinux.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8013d1d7
  12. 23 7月, 2015 1 次提交
  13. 16 7月, 2015 2 次提交
  14. 11 7月, 2015 1 次提交
  15. 02 5月, 2015 1 次提交
    • M
      ipv6: Consider RTF_CACHE when searching the fib6 tree · 1f56a01f
      Martin KaFai Lau 提交于
      It is a prep work for the later bug-fix patch which will stop /128 route
      from disappearing after pmtu update.
      
      The later bug-fix patch will allow a /128 route and its RTF_CACHE clone
      both exist at the same fib6_node.  To do this, we need to prepare the
      existing fib6 tree search to expect RTF_CACHE for /128 route.
      
      Note that the fn->leaf is sorted by rt6i_metric.  Hence,
      RTF_CACHE (if there is any) is always at the front.  This property
      leads to the following:
      
      1. When doing ip6_route_del(), it should honor the RTF_CACHE flag which
         the caller is used to ask for deleting clone or non-clone.
         The rtm_to_fib6_config() should also check the RTM_F_CLONED and
         then set RTF_CACHE accordingly so that:
         - 'ip -6 r del...' will make ip6_route_del() to delete a route
           and all its clones. Note that its clones is flushed by fib6_del()
         - 'ip -6 r flush table cache' will make ip6_route_del() to
            only delete clone(s).
      
      2. Exclude RTF_CACHE from addrconf_get_prefix_route() which
         should not configure on a cloned route.
      
      3. No change is need for rt6_device_match() since it currently could
         return a RTF_CACHE clone route, so the later bug-fix patch will not
         affect it.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Reviewed-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1f56a01f
  16. 03 4月, 2015 1 次提交
  17. 01 4月, 2015 2 次提交
  18. 25 3月, 2015 1 次提交
  19. 24 3月, 2015 6 次提交
    • H
      ipv6: introduce idgen_delay and idgen_retries knobs · 1855b7c3
      Hannes Frederic Sowa 提交于
      This is specified by RFC 7217.
      
      Cc: Erik Kline <ek@google.com>
      Cc: Fernando Gont <fgont@si6networks.com>
      Cc: Lorenzo Colitti <lorenzo@google.com>
      Cc: YOSHIFUJI Hideaki/吉藤英明 <hideaki.yoshifuji@miraclelinux.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1855b7c3
    • H
      ipv6: do retries on stable privacy addresses · 5f40ef77
      Hannes Frederic Sowa 提交于
      If a DAD conflict is detected, we want to retry privacy stable address
      generation up to idgen_retries (= 3) times with a delay of idgen_delay
      (= 1 second). Add the logic to addrconf_dad_failure.
      
      By design, we don't clean up dad failed permanent addresses.
      
      Cc: Erik Kline <ek@google.com>
      Cc: Fernando Gont <fgont@si6networks.com>
      Cc: Lorenzo Colitti <lorenzo@google.com>
      Cc: YOSHIFUJI Hideaki/吉藤英明 <hideaki.yoshifuji@miraclelinux.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5f40ef77
    • H
      ipv6: collapse state_lock and lock · 8e8e676d
      Hannes Frederic Sowa 提交于
      Cc: Erik Kline <ek@google.com>
      Cc: Fernando Gont <fgont@si6networks.com>
      Cc: Lorenzo Colitti <lorenzo@google.com>
      Cc: YOSHIFUJI Hideaki/吉藤英明 <hideaki.yoshifuji@miraclelinux.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8e8e676d
    • H
      ipv6: introduce IFA_F_STABLE_PRIVACY flag · 64236f3f
      Hannes Frederic Sowa 提交于
      We need to mark appropriate addresses so we can do retries in case their
      DAD failed.
      
      Cc: Erik Kline <ek@google.com>
      Cc: Fernando Gont <fgont@si6networks.com>
      Cc: Lorenzo Colitti <lorenzo@google.com>
      Cc: YOSHIFUJI Hideaki/吉藤英明 <hideaki.yoshifuji@miraclelinux.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      64236f3f
    • H
      ipv6: generation of stable privacy addresses for link-local and autoconf · 622c81d5
      Hannes Frederic Sowa 提交于
      This patch implements the stable privacy address generation for
      link-local and autoconf addresses as specified in RFC7217.
      
        RID = F(Prefix, Net_Iface, Network_ID, DAD_Counter, secret_key)
      
      is the RID (random identifier). As the hash function F we chose one
      round of sha1. Prefix will be either the link-local prefix or the
      router advertised one. As Net_Iface we use the MAC address of the
      device. DAD_Counter and secret_key are implemented as specified.
      
      We don't use Network_ID, as it couples the code too closely to other
      subsystems. It is specified as optional in the RFC.
      
      As Net_Iface we only use the MAC address: we simply have no stable
      identifier in the kernel we could possibly use: because this code might
      run very early, we cannot depend on names, as they might be changed by
      user space early on during the boot process.
      
      A new address generation mode is introduced,
      IN6_ADDR_GEN_MODE_STABLE_PRIVACY. With iproute2 one can switch back to
      none or eui64 address configuration mode although the stable_secret is
      already set.
      
      We refuse writes to ipv6/conf/all/stable_secret but only allow
      ipv6/conf/default/stable_secret and the interface specific file to be
      written to. The default stable_secret is used as the parameter for the
      namespace, the interface specific can overwrite the secret, e.g. when
      switching a network configuration from one system to another while
      inheriting the secret.
      
      Cc: Erik Kline <ek@google.com>
      Cc: Fernando Gont <fgont@si6networks.com>
      Cc: Lorenzo Colitti <lorenzo@google.com>
      Cc: YOSHIFUJI Hideaki/吉藤英明 <hideaki.yoshifuji@miraclelinux.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      622c81d5
    • H
      ipv6: introduce secret_stable to ipv6_devconf · 3d1bec99
      Hannes Frederic Sowa 提交于
      This patch implements the procfs logic for the stable_address knob:
      The secret is formatted as an ipv6 address and will be stored per
      interface and per namespace. We track initialized flag and return EIO
      errors until the secret is set.
      
      We don't inherit the secret to newly created namespaces.
      
      Cc: Erik Kline <ek@google.com>
      Cc: Fernando Gont <fgont@si6networks.com>
      Cc: Lorenzo Colitti <lorenzo@google.com>
      Cc: YOSHIFUJI Hideaki/吉藤英明 <hideaki.yoshifuji@miraclelinux.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3d1bec99
  20. 19 3月, 2015 1 次提交
  21. 28 2月, 2015 1 次提交
    • M
      multicast: Extend ip address command to enable multicast group join/leave on · 93a714d6
      Madhu Challa 提交于
      Joining multicast group on ethernet level via "ip maddr" command would
      not work if we have an Ethernet switch that does igmp snooping since
      the switch would not replicate multicast packets on ports that did not
      have IGMP reports for the multicast addresses.
      
      Linux vxlan interfaces created via "ip link add vxlan" have the group option
      that enables then to do the required join.
      
      By extending ip address command with option "autojoin" we can get similar
      functionality for openvswitch vxlan interfaces as well as other tunneling
      mechanisms that need to receive multicast traffic. The kernel code is
      structured similar to how the vxlan driver does a group join / leave.
      
      example:
      ip address add 224.1.1.10/24 dev eth5 autojoin
      ip address del 224.1.1.10/24 dev eth5
      Signed-off-by: NMadhu Challa <challa@noironetworks.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93a714d6
  22. 24 2月, 2015 1 次提交
    • M
      ipv6: addrconf: validate new MTU before applying it · 77751427
      Marcelo Leitner 提交于
      Currently we don't check if the new MTU is valid or not and this allows
      one to configure a smaller than minimum allowed by RFCs or even bigger
      than interface own MTU, which is a problem as it may lead to packet
      drops.
      
      If you have a daemon like NetworkManager running, this may be exploited
      by remote attackers by forging RA packets with an invalid MTU, possibly
      leading to a DoS. (NetworkManager currently only validates for values
      too small, but not for too big ones.)
      
      The fix is just to make sure the new value is valid. That is, between
      IPV6_MIN_MTU and interface's MTU.
      
      Note that similar check is already performed at
      ndisc_router_discovery(), for when kernel itself parses the RA.
      Signed-off-by: NMarcelo Ricardo Leitner <mleitner@redhat.com>
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      77751427
  23. 07 2月, 2015 1 次提交
  24. 06 2月, 2015 1 次提交
    • E
      net: ipv6: allow explicitly choosing optimistic addresses · c58da4c6
      Erik Kline 提交于
      RFC 4429 ("Optimistic DAD") states that optimistic addresses
      should be treated as deprecated addresses.  From section 2.1:
      
         Unless noted otherwise, components of the IPv6 protocol stack
         should treat addresses in the Optimistic state equivalently to
         those in the Deprecated state, indicating that the address is
         available for use but should not be used if another suitable
         address is available.
      
      Optimistic addresses are indeed avoided when other addresses are
      available (i.e. at source address selection time), but they have
      not heretofore been available for things like explicit bind() and
      sendmsg() with struct in6_pktinfo, etc.
      
      This change makes optimistic addresses treated more like
      deprecated addresses than tentative ones.
      Signed-off-by: NErik Kline <ek@google.com>
      Acked-by: NLorenzo Colitti <lorenzo@google.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c58da4c6
  25. 31 1月, 2015 1 次提交
  26. 26 1月, 2015 1 次提交
  27. 19 1月, 2015 1 次提交
  28. 18 1月, 2015 1 次提交
    • J
      netlink: make nlmsg_end() and genlmsg_end() void · 053c095a
      Johannes Berg 提交于
      Contrary to common expectations for an "int" return, these functions
      return only a positive value -- if used correctly they cannot even
      return 0 because the message header will necessarily be in the skb.
      
      This makes the very common pattern of
      
        if (genlmsg_end(...) < 0) { ... }
      
      be a whole bunch of dead code. Many places also simply do
      
        return nlmsg_end(...);
      
      and the caller is expected to deal with it.
      
      This also commonly (at least for me) causes errors, because it is very
      common to write
      
        if (my_function(...))
          /* error condition */
      
      and if my_function() does "return nlmsg_end()" this is of course wrong.
      
      Additionally, there's not a single place in the kernel that actually
      needs the message length returned, and if anyone needs it later then
      it'll be very easy to just use skb->len there.
      
      Remove this, and make the functions void. This removes a bunch of dead
      code as described above. The patch adds lines because I did
      
      -	return nlmsg_end(...);
      +	nlmsg_end(...);
      +	return 0;
      
      I could have preserved all the function's return values by returning
      skb->len, but instead I've audited all the places calling the affected
      functions and found that none cared. A few places actually compared
      the return value with <= 0 in dump functionality, but that could just
      be changed to < 0 with no change in behaviour, so I opted for the more
      efficient version.
      
      One instance of the error I've made numerous times now is also present
      in net/phonet/pn_netlink.c in the route_dumpit() function - it didn't
      check for <0 or <=0 and thus broke out of the loop every single time.
      I've preserved this since it will (I think) have caused the messages to
      userspace to be formatted differently with just a single message for
      every SKB returned to userspace. It's possible that this isn't needed
      for the tools that actually use this, but I don't even know what they
      are so couldn't test that changing this behaviour would be acceptable.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      053c095a
  29. 27 11月, 2014 1 次提交
    • Z
      ipv6: Remove unnecessary test · 73cf0e92
      zhuyj 提交于
      The "init_net" test in function addrconf_exit_net is introduced
      in commit 44a6bd29 [Create ipv6 devconf-s for namespaces] to avoid freeing
      init_net. In commit c900a800 [ipv6: fix bad free of addrconf_init_net],
      function addrconf_init_net will allocate memory for every net regardless of
      init_net. In this case, it is unnecessary to make "init_net" test.
      
      CC: Hong Zhiguo <honkiko@gmail.com>
      CC: Octavian Purdila <opurdila@ixiacom.com>
      CC: Pavel Emelyanov <xemul@openvz.org>
      CC: Cong Wang <cwang@twopensource.com>
      Suggested-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NZhu Yanjun <Yanjun.Zhu@windriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      73cf0e92
  30. 24 11月, 2014 1 次提交