1. 24 2月, 2015 1 次提交
    • M
      ipv6: addrconf: validate new MTU before applying it · 77751427
      Marcelo Leitner 提交于
      Currently we don't check if the new MTU is valid or not and this allows
      one to configure a smaller than minimum allowed by RFCs or even bigger
      than interface own MTU, which is a problem as it may lead to packet
      drops.
      
      If you have a daemon like NetworkManager running, this may be exploited
      by remote attackers by forging RA packets with an invalid MTU, possibly
      leading to a DoS. (NetworkManager currently only validates for values
      too small, but not for too big ones.)
      
      The fix is just to make sure the new value is valid. That is, between
      IPV6_MIN_MTU and interface's MTU.
      
      Note that similar check is already performed at
      ndisc_router_discovery(), for when kernel itself parses the RA.
      Signed-off-by: NMarcelo Ricardo Leitner <mleitner@redhat.com>
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      77751427
  2. 15 2月, 2015 1 次提交
    • M
      ipv6: fix ipv6_cow_metrics for non DST_HOST case · 3b471175
      Martin KaFai Lau 提交于
      ipv6_cow_metrics() currently assumes only DST_HOST routes require
      dynamic metrics allocation from inetpeer.  The assumption breaks
      when ndisc discovered router with RTAX_MTU and RTAX_HOPLIMIT metric.
      Refer to ndisc_router_discovery() in ndisc.c and note that dst_metric_set()
      is called after the route is created.
      
      This patch creates the metrics array (by calling dst_cow_metrics_generic) in
      ipv6_cow_metrics().
      
      Test:
      radvd.conf:
      interface qemubr0
      {
      	AdvLinkMTU 1300;
      	AdvCurHopLimit 30;
      
      	prefix fd00:face:face:face::/64
      	{
      		AdvOnLink on;
      		AdvAutonomous on;
      		AdvRouterAddr off;
      	};
      };
      
      Before:
      [root@qemu1 ~]# ip -6 r show | egrep -v unreachable
      fd00:face:face:face::/64 dev eth0  proto kernel  metric 256  expires 27sec
      fe80::/64 dev eth0  proto kernel  metric 256
      default via fe80::74df:d0ff:fe23:8ef2 dev eth0  proto ra  metric 1024  expires 27sec
      
      After:
      [root@qemu1 ~]# ip -6 r show | egrep -v unreachable
      fd00:face:face:face::/64 dev eth0  proto kernel  metric 256  expires 27sec mtu 1300
      fe80::/64 dev eth0  proto kernel  metric 256  mtu 1300
      default via fe80::74df:d0ff:fe23:8ef2 dev eth0  proto ra  metric 1024  expires 27sec mtu 1300 hoplimit 30
      
      Fixes: 8e2ec639 (ipv6: don't use inetpeer to store metrics for routes.)
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b471175
  3. 12 2月, 2015 3 次提交
    • J
      ipv6: fix possible deadlock in ip6_fl_purge / ip6_fl_gc · 4762fb98
      Jan Stancek 提交于
      Use spin_lock_bh in ip6_fl_purge() to prevent following potentially
      deadlock scenario between ip6_fl_purge() and ip6_fl_gc() timer.
      
        =================================
        [ INFO: inconsistent lock state ]
        3.19.0 #1 Not tainted
        ---------------------------------
        inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
        swapper/5/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
         (ip6_fl_lock){+.?...}, at: [<ffffffff8171155d>] ip6_fl_gc+0x2d/0x180
        {SOFTIRQ-ON-W} state was registered at:
          [<ffffffff810ee9a0>] __lock_acquire+0x4a0/0x10b0
          [<ffffffff810efd54>] lock_acquire+0xc4/0x2b0
          [<ffffffff81751d2d>] _raw_spin_lock+0x3d/0x80
          [<ffffffff81711798>] ip6_flowlabel_net_exit+0x28/0x110
          [<ffffffff815f9759>] ops_exit_list.isra.1+0x39/0x60
          [<ffffffff815fa320>] cleanup_net+0x100/0x1e0
          [<ffffffff810ad80a>] process_one_work+0x20a/0x830
          [<ffffffff810adf4b>] worker_thread+0x11b/0x460
          [<ffffffff810b42f4>] kthread+0x104/0x120
          [<ffffffff81752bfc>] ret_from_fork+0x7c/0xb0
        irq event stamp: 84640
        hardirqs last  enabled at (84640): [<ffffffff81752080>] _raw_spin_unlock_irq+0x30/0x50
        hardirqs last disabled at (84639): [<ffffffff81751eff>] _raw_spin_lock_irq+0x1f/0x80
        softirqs last  enabled at (84628): [<ffffffff81091ad1>] _local_bh_enable+0x21/0x50
        softirqs last disabled at (84629): [<ffffffff81093b7d>] irq_exit+0x12d/0x150
      
        other info that might help us debug this:
         Possible unsafe locking scenario:
      
               CPU0
               ----
          lock(ip6_fl_lock);
          <Interrupt>
            lock(ip6_fl_lock);
      
         *** DEADLOCK ***
      Signed-off-by: NJan Stancek <jstancek@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4762fb98
    • T
      udp: Set SKB_GSO_UDP_TUNNEL* in UDP GRO path · 6db93ea1
      Tom Herbert 提交于
      Properly set GSO types and skb->encapsulation in the UDP tunnel GRO
      complete so that packets are properly represented for GSO. This sets
      SKB_GSO_UDP_TUNNEL or SKB_GSO_UDP_TUNNEL_CSUM depending on whether
      non-zero checksums were received, and sets SKB_GSO_TUNNEL_REMCSUM if
      the remote checksum option was processed.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6db93ea1
    • V
      ipv6: Partial checksum only UDP packets · bf250a1f
      Vlad Yasevich 提交于
      ip6_append_data is used by other protocols and some of them can't
      be partially checksummed.  Only partially checksum UDP protocol.
      
      Fixes: 32dce968 (ipv6: Allow for partial checksums on non-ufo packets)
      Reported-by: NSabrina Dubroca <sd@queasysnail.net>
      Tested-by: NSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: NVladislav Yasevich <vyasevic@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bf250a1f
  4. 10 2月, 2015 2 次提交
  5. 09 2月, 2015 1 次提交
    • M
      rt6_probe_deferred: Do not depend on struct ordering · 662f5533
      Michael Büsch 提交于
      rt6_probe allocates a struct __rt6_probe_work and schedules a work handler rt6_probe_deferred.
      But rt6_probe_deferred kfree's the struct work_struct instead of struct __rt6_probe_work.
      This works, because struct work_struct is the first element of struct __rt6_probe_work.
      
      Change it to kfree struct __rt6_probe_work to not implicitly depend on
      struct work_struct being the first element.
      
      This does not affect the generated code.
      Signed-off-by: NMichael Buesch <m@bues.ch>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      662f5533
  6. 07 2月, 2015 1 次提交
  7. 06 2月, 2015 1 次提交
    • E
      net: ipv6: allow explicitly choosing optimistic addresses · c58da4c6
      Erik Kline 提交于
      RFC 4429 ("Optimistic DAD") states that optimistic addresses
      should be treated as deprecated addresses.  From section 2.1:
      
         Unless noted otherwise, components of the IPv6 protocol stack
         should treat addresses in the Optimistic state equivalently to
         those in the Deprecated state, indicating that the address is
         available for use but should not be used if another suitable
         address is available.
      
      Optimistic addresses are indeed avoided when other addresses are
      available (i.e. at source address selection time), but they have
      not heretofore been available for things like explicit bind() and
      sendmsg() with struct in6_pktinfo, etc.
      
      This change makes optimistic addresses treated more like
      deprecated addresses than tentative ones.
      Signed-off-by: NErik Kline <ek@google.com>
      Acked-by: NLorenzo Colitti <lorenzo@google.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c58da4c6
  8. 05 2月, 2015 2 次提交
    • E
      sit: fix some __be16/u16 mismatches · a409caec
      Eric Dumazet 提交于
      Fixes following sparse warnings :
      
      net/ipv6/sit.c:1509:32: warning: incorrect type in assignment (different base types)
      net/ipv6/sit.c:1509:32:    expected restricted __be16 [usertype] sport
      net/ipv6/sit.c:1509:32:    got unsigned short
      net/ipv6/sit.c:1514:32: warning: incorrect type in assignment (different base types)
      net/ipv6/sit.c:1514:32:    expected restricted __be16 [usertype] dport
      net/ipv6/sit.c:1514:32:    got unsigned short
      net/ipv6/sit.c:1711:38: warning: incorrect type in argument 3 (different base types)
      net/ipv6/sit.c:1711:38:    expected unsigned short [unsigned] [usertype] value
      net/ipv6/sit.c:1711:38:    got restricted __be16 [usertype] sport
      net/ipv6/sit.c:1713:38: warning: incorrect type in argument 3 (different base types)
      net/ipv6/sit.c:1713:38:    expected unsigned short [unsigned] [usertype] value
      net/ipv6/sit.c:1713:38:    got restricted __be16 [usertype] dport
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a409caec
    • S
      ip6_gre: fix endianness errors in ip6gre_err · d1e158e2
      Sabrina Dubroca 提交于
      info is in network byte order, change it back to host byte order
      before use. In particular, the current code sets the MTU of the tunnel
      to a wrong (too big) value.
      
      Fixes: c12b395a ("gre: Support GRE over IPv6")
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d1e158e2
  9. 04 2月, 2015 4 次提交
  10. 03 2月, 2015 7 次提交
  11. 31 1月, 2015 1 次提交
  12. 27 1月, 2015 1 次提交
    • H
      ipv6: replacing a rt6_info needs to purge possible propagated rt6_infos too · 6e9e16e6
      Hannes Frederic Sowa 提交于
      Lubomir Rintel reported that during replacing a route the interface
      reference counter isn't correctly decremented.
      
      To quote bug <https://bugzilla.kernel.org/show_bug.cgi?id=91941>:
      | [root@rhel7-5 lkundrak]# sh -x lal
      | + ip link add dev0 type dummy
      | + ip link set dev0 up
      | + ip link add dev1 type dummy
      | + ip link set dev1 up
      | + ip addr add 2001:db8:8086::2/64 dev dev0
      | + ip route add 2001:db8:8086::/48 dev dev0 proto static metric 20
      | + ip route add 2001:db8:8088::/48 dev dev1 proto static metric 10
      | + ip route replace 2001:db8:8086::/48 dev dev1 proto static metric 20
      | + ip link del dev0 type dummy
      | Message from syslogd@rhel7-5 at Jan 23 10:54:41 ...
      |  kernel:unregister_netdevice: waiting for dev0 to become free. Usage count = 2
      |
      | Message from syslogd@rhel7-5 at Jan 23 10:54:51 ...
      |  kernel:unregister_netdevice: waiting for dev0 to become free. Usage count = 2
      
      During replacement of a rt6_info we must walk all parent nodes and check
      if the to be replaced rt6_info got propagated. If so, replace it with
      an alive one.
      
      Fixes: 4a287eba ("IPv6 routing, NLM_F_* flag support: REPLACE and EXCL flags support, warn about missing CREATE flag")
      Reported-by: NLubomir Rintel <lkundrak@v3.sk>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Tested-by: NLubomir Rintel <lkundrak@v3.sk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6e9e16e6
  13. 26 1月, 2015 3 次提交
  14. 25 1月, 2015 1 次提交
    • T
      udp: Do not require sock in udp_tunnel_xmit_skb · d998f8ef
      Tom Herbert 提交于
      The UDP tunnel transmit functions udp_tunnel_xmit_skb and
      udp_tunnel6_xmit_skb include a socket argument. The socket being
      passed to the functions (from VXLAN) is a UDP created for receive
      side. The only thing that the socket is used for in the transmit
      functions is to get the setting for checksum (enabled or zero).
      This patch removes the argument and and adds a nocheck argument
      for checksum setting. This eliminates the unnecessary dependency
      on a UDP socket for UDP tunnel transmit.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d998f8ef
  15. 24 1月, 2015 1 次提交
  16. 20 1月, 2015 2 次提交
  17. 19 1月, 2015 1 次提交
  18. 18 1月, 2015 1 次提交
    • J
      netlink: make nlmsg_end() and genlmsg_end() void · 053c095a
      Johannes Berg 提交于
      Contrary to common expectations for an "int" return, these functions
      return only a positive value -- if used correctly they cannot even
      return 0 because the message header will necessarily be in the skb.
      
      This makes the very common pattern of
      
        if (genlmsg_end(...) < 0) { ... }
      
      be a whole bunch of dead code. Many places also simply do
      
        return nlmsg_end(...);
      
      and the caller is expected to deal with it.
      
      This also commonly (at least for me) causes errors, because it is very
      common to write
      
        if (my_function(...))
          /* error condition */
      
      and if my_function() does "return nlmsg_end()" this is of course wrong.
      
      Additionally, there's not a single place in the kernel that actually
      needs the message length returned, and if anyone needs it later then
      it'll be very easy to just use skb->len there.
      
      Remove this, and make the functions void. This removes a bunch of dead
      code as described above. The patch adds lines because I did
      
      -	return nlmsg_end(...);
      +	nlmsg_end(...);
      +	return 0;
      
      I could have preserved all the function's return values by returning
      skb->len, but instead I've audited all the places calling the affected
      functions and found that none cared. A few places actually compared
      the return value with <= 0 in dump functionality, but that could just
      be changed to < 0 with no change in behaviour, so I opted for the more
      efficient version.
      
      One instance of the error I've made numerous times now is also present
      in net/phonet/pn_netlink.c in the route_dumpit() function - it didn't
      check for <0 or <=0 and thus broke out of the loop every single time.
      I've preserved this since it will (I think) have caused the messages to
      userspace to be formatted differently with just a single message for
      every SKB returned to userspace. It's possible that this isn't needed
      for the tools that actually use this, but I don't even know what they
      are so couldn't test that changing this behaviour would be acceptable.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      053c095a
  19. 16 1月, 2015 1 次提交
  20. 15 1月, 2015 1 次提交
  21. 06 1月, 2015 4 次提交
    • D
      net: tcp: add per route congestion control · 81164413
      Daniel Borkmann 提交于
      This work adds the possibility to define a per route/destination
      congestion control algorithm. Generally, this opens up the possibility
      for a machine with different links to enforce specific congestion
      control algorithms with optimal strategies for each of them based
      on their network characteristics, even transparently for a single
      application listening on all links.
      
      For our specific use case, this additionally facilitates deployment
      of DCTCP, for example, applications can easily serve internal
      traffic/dsts in DCTCP and external one with CUBIC. Other scenarios
      would also allow for utilizing e.g. long living, low priority
      background flows for certain destinations/routes while still being
      able for normal traffic to utilize the default congestion control
      algorithm. We also thought about a per netns setting (where different
      defaults are possible), but given its actually a link specific
      property, we argue that a per route/destination setting is the most
      natural and flexible.
      
      The administrator can utilize this through ip-route(8) by appending
      "congctl [lock] <name>", where <name> denotes the name of a
      congestion control algorithm and the optional lock parameter allows
      to enforce the given algorithm so that applications in user space
      would not be allowed to overwrite that algorithm for that destination.
      
      The dst metric lookups are being done when a dst entry is already
      available in order to avoid a costly lookup and still before the
      algorithms are being initialized, thus overhead is very low when the
      feature is not being used. While the client side would need to drop
      the current reference on the module, on server side this can actually
      even be avoided as we just got a flat-copied socket clone.
      
      Joint work with Florian Westphal.
      Suggested-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      81164413
    • D
      net: tcp: add RTAX_CC_ALGO fib handling · ea697639
      Daniel Borkmann 提交于
      This patch adds the minimum necessary for the RTAX_CC_ALGO congestion
      control metric to be set up and dumped back to user space.
      
      While the internal representation of RTAX_CC_ALGO is handled as a u32
      key, we avoided to expose this implementation detail to user space, thus
      instead, we chose the netlink attribute that is being exchanged between
      user space to be the actual congestion control algorithm name, similarly
      as in the setsockopt(2) API in order to allow for maximum flexibility,
      even for 3rd party modules.
      
      It is a bit unfortunate that RTAX_QUICKACK used up a whole RTAX slot as
      it should have been stored in RTAX_FEATURES instead, we first thought
      about reusing it for the congestion control key, but it brings more
      complications and/or confusion than worth it.
      
      Joint work with Florian Westphal.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ea697639
    • F
      net: fib6: convert cfg metric to u32 outside of table write lock · e715b6d3
      Florian Westphal 提交于
      Do the nla validation earlier, outside the write lock.
      
      This is needed by followup patch which needs to be able to call
      request_module (which can sleep) if needed.
      
      Joint work with Daniel Borkmann.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e715b6d3
    • D
      net: fib6: fib6_commit_metrics: fix potential NULL pointer dereference · 0409c9a5
      Daniel Borkmann 提交于
      When IPv6 host routes with metrics attached are being added, we fetch
      the metrics store from the dst via COW through dst_metrics_write_ptr(),
      added through commit e5fd387a.
      
      One remaining problem here is that we actually call into inet_getpeer()
      and may end up allocating/creating a new peer from the kmemcache, which
      may fail.
      
      Example trace from perf probe (inet_getpeer:41) where create is 1:
      
      ip 6877 [002] 4221.391591: probe:inet_getpeer: (ffffffff8165e293)
        85e294 inet_getpeer.part.7 (<- kmem_cache_alloc())
        85e578 inet_getpeer
        8eb333 ipv6_cow_metrics
        8f10ff fib6_commit_metrics
      
      Therefore, a check for NULL on the return of dst_metrics_write_ptr()
      is necessary here.
      
      Joint work with Florian Westphal.
      
      Fixes: e5fd387a ("ipv6: do not overwrite inetpeer metrics prematurely")
      Cc: Michal Kubeček <mkubecek@suse.cz>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0409c9a5