1. 03 2月, 2015 6 次提交
  2. 31 1月, 2015 1 次提交
  3. 27 1月, 2015 1 次提交
    • H
      ipv6: replacing a rt6_info needs to purge possible propagated rt6_infos too · 6e9e16e6
      Hannes Frederic Sowa 提交于
      Lubomir Rintel reported that during replacing a route the interface
      reference counter isn't correctly decremented.
      
      To quote bug <https://bugzilla.kernel.org/show_bug.cgi?id=91941>:
      | [root@rhel7-5 lkundrak]# sh -x lal
      | + ip link add dev0 type dummy
      | + ip link set dev0 up
      | + ip link add dev1 type dummy
      | + ip link set dev1 up
      | + ip addr add 2001:db8:8086::2/64 dev dev0
      | + ip route add 2001:db8:8086::/48 dev dev0 proto static metric 20
      | + ip route add 2001:db8:8088::/48 dev dev1 proto static metric 10
      | + ip route replace 2001:db8:8086::/48 dev dev1 proto static metric 20
      | + ip link del dev0 type dummy
      | Message from syslogd@rhel7-5 at Jan 23 10:54:41 ...
      |  kernel:unregister_netdevice: waiting for dev0 to become free. Usage count = 2
      |
      | Message from syslogd@rhel7-5 at Jan 23 10:54:51 ...
      |  kernel:unregister_netdevice: waiting for dev0 to become free. Usage count = 2
      
      During replacement of a rt6_info we must walk all parent nodes and check
      if the to be replaced rt6_info got propagated. If so, replace it with
      an alive one.
      
      Fixes: 4a287eba ("IPv6 routing, NLM_F_* flag support: REPLACE and EXCL flags support, warn about missing CREATE flag")
      Reported-by: NLubomir Rintel <lkundrak@v3.sk>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Tested-by: NLubomir Rintel <lkundrak@v3.sk>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6e9e16e6
  4. 26 1月, 2015 3 次提交
  5. 25 1月, 2015 1 次提交
    • T
      udp: Do not require sock in udp_tunnel_xmit_skb · d998f8ef
      Tom Herbert 提交于
      The UDP tunnel transmit functions udp_tunnel_xmit_skb and
      udp_tunnel6_xmit_skb include a socket argument. The socket being
      passed to the functions (from VXLAN) is a UDP created for receive
      side. The only thing that the socket is used for in the transmit
      functions is to get the setting for checksum (enabled or zero).
      This patch removes the argument and and adds a nocheck argument
      for checksum setting. This eliminates the unnecessary dependency
      on a UDP socket for UDP tunnel transmit.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d998f8ef
  6. 24 1月, 2015 1 次提交
  7. 20 1月, 2015 2 次提交
  8. 19 1月, 2015 1 次提交
  9. 18 1月, 2015 1 次提交
    • J
      netlink: make nlmsg_end() and genlmsg_end() void · 053c095a
      Johannes Berg 提交于
      Contrary to common expectations for an "int" return, these functions
      return only a positive value -- if used correctly they cannot even
      return 0 because the message header will necessarily be in the skb.
      
      This makes the very common pattern of
      
        if (genlmsg_end(...) < 0) { ... }
      
      be a whole bunch of dead code. Many places also simply do
      
        return nlmsg_end(...);
      
      and the caller is expected to deal with it.
      
      This also commonly (at least for me) causes errors, because it is very
      common to write
      
        if (my_function(...))
          /* error condition */
      
      and if my_function() does "return nlmsg_end()" this is of course wrong.
      
      Additionally, there's not a single place in the kernel that actually
      needs the message length returned, and if anyone needs it later then
      it'll be very easy to just use skb->len there.
      
      Remove this, and make the functions void. This removes a bunch of dead
      code as described above. The patch adds lines because I did
      
      -	return nlmsg_end(...);
      +	nlmsg_end(...);
      +	return 0;
      
      I could have preserved all the function's return values by returning
      skb->len, but instead I've audited all the places calling the affected
      functions and found that none cared. A few places actually compared
      the return value with <= 0 in dump functionality, but that could just
      be changed to < 0 with no change in behaviour, so I opted for the more
      efficient version.
      
      One instance of the error I've made numerous times now is also present
      in net/phonet/pn_netlink.c in the route_dumpit() function - it didn't
      check for <0 or <=0 and thus broke out of the loop every single time.
      I've preserved this since it will (I think) have caused the messages to
      userspace to be formatted differently with just a single message for
      every SKB returned to userspace. It's possible that this isn't needed
      for the tools that actually use this, but I don't even know what they
      are so couldn't test that changing this behaviour would be acceptable.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      053c095a
  10. 16 1月, 2015 1 次提交
  11. 15 1月, 2015 1 次提交
  12. 06 1月, 2015 5 次提交
    • D
      net: tcp: add per route congestion control · 81164413
      Daniel Borkmann 提交于
      This work adds the possibility to define a per route/destination
      congestion control algorithm. Generally, this opens up the possibility
      for a machine with different links to enforce specific congestion
      control algorithms with optimal strategies for each of them based
      on their network characteristics, even transparently for a single
      application listening on all links.
      
      For our specific use case, this additionally facilitates deployment
      of DCTCP, for example, applications can easily serve internal
      traffic/dsts in DCTCP and external one with CUBIC. Other scenarios
      would also allow for utilizing e.g. long living, low priority
      background flows for certain destinations/routes while still being
      able for normal traffic to utilize the default congestion control
      algorithm. We also thought about a per netns setting (where different
      defaults are possible), but given its actually a link specific
      property, we argue that a per route/destination setting is the most
      natural and flexible.
      
      The administrator can utilize this through ip-route(8) by appending
      "congctl [lock] <name>", where <name> denotes the name of a
      congestion control algorithm and the optional lock parameter allows
      to enforce the given algorithm so that applications in user space
      would not be allowed to overwrite that algorithm for that destination.
      
      The dst metric lookups are being done when a dst entry is already
      available in order to avoid a costly lookup and still before the
      algorithms are being initialized, thus overhead is very low when the
      feature is not being used. While the client side would need to drop
      the current reference on the module, on server side this can actually
      even be avoided as we just got a flat-copied socket clone.
      
      Joint work with Florian Westphal.
      Suggested-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      81164413
    • D
      net: tcp: add RTAX_CC_ALGO fib handling · ea697639
      Daniel Borkmann 提交于
      This patch adds the minimum necessary for the RTAX_CC_ALGO congestion
      control metric to be set up and dumped back to user space.
      
      While the internal representation of RTAX_CC_ALGO is handled as a u32
      key, we avoided to expose this implementation detail to user space, thus
      instead, we chose the netlink attribute that is being exchanged between
      user space to be the actual congestion control algorithm name, similarly
      as in the setsockopt(2) API in order to allow for maximum flexibility,
      even for 3rd party modules.
      
      It is a bit unfortunate that RTAX_QUICKACK used up a whole RTAX slot as
      it should have been stored in RTAX_FEATURES instead, we first thought
      about reusing it for the congestion control key, but it brings more
      complications and/or confusion than worth it.
      
      Joint work with Florian Westphal.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ea697639
    • F
      net: fib6: convert cfg metric to u32 outside of table write lock · e715b6d3
      Florian Westphal 提交于
      Do the nla validation earlier, outside the write lock.
      
      This is needed by followup patch which needs to be able to call
      request_module (which can sleep) if needed.
      
      Joint work with Daniel Borkmann.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e715b6d3
    • D
      net: fib6: fib6_commit_metrics: fix potential NULL pointer dereference · 0409c9a5
      Daniel Borkmann 提交于
      When IPv6 host routes with metrics attached are being added, we fetch
      the metrics store from the dst via COW through dst_metrics_write_ptr(),
      added through commit e5fd387a.
      
      One remaining problem here is that we actually call into inet_getpeer()
      and may end up allocating/creating a new peer from the kmemcache, which
      may fail.
      
      Example trace from perf probe (inet_getpeer:41) where create is 1:
      
      ip 6877 [002] 4221.391591: probe:inet_getpeer: (ffffffff8165e293)
        85e294 inet_getpeer.part.7 (<- kmem_cache_alloc())
        85e578 inet_getpeer
        8eb333 ipv6_cow_metrics
        8f10ff fib6_commit_metrics
      
      Therefore, a check for NULL on the return of dst_metrics_write_ptr()
      is necessary here.
      
      Joint work with Florian Westphal.
      
      Fixes: e5fd387a ("ipv6: do not overwrite inetpeer metrics prematurely")
      Cc: Michal Kubeček <mkubecek@suse.cz>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0409c9a5
    • T
      ip: Move checksum convert defines to inet · 224d019c
      Tom Herbert 提交于
      Move convert_csum from udp_sock to inet_sock. This allows the
      possibility that we can use convert checksum for different types
      of sockets and also allows convert checksum to be enabled from
      inet layer (what we'll want to do when enabling IP_CHECKSUM cmsg).
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      224d019c
  13. 23 12月, 2014 2 次提交
  14. 11 12月, 2014 1 次提交
  15. 10 12月, 2014 5 次提交
  16. 09 12月, 2014 2 次提交
    • J
      udp: Neaten and reduce size of compute_score functions · 60c04aec
      Joe Perches 提交于
      The compute_score functions are a bit difficult to read.
      
      Neaten them a bit to reduce object sizes and make them a
      bit more intelligible.
      
      Return early to avoid indentation and avoid unnecessary
      initializations.
      
      (allyesconfig, but w/ -O2 and no profiling)
      
      $ size net/ipv[46]/udp.o.*
         text    data     bss     dec     hex filename
        28680    1184      25   29889    74c1 net/ipv4/udp.o.new
        28756    1184      25   29965    750d net/ipv4/udp.o.old
        17600    1010       2   18612    48b4 net/ipv6/udp.o.new
        17632    1010       2   18644    48d4 net/ipv6/udp.o.old
      Signed-off-by: NJoe Perches <joe@perches.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      60c04aec
    • W
      net-timestamp: allow reading recv cmsg on errqueue with origin tstamp · 829ae9d6
      Willem de Bruijn 提交于
      Allow reading of timestamps and cmsg at the same time on all relevant
      socket families. One use is to correlate timestamps with egress
      device, by asking for cmsg IP_PKTINFO.
      
      on AF_INET sockets, call the relevant function (ip_cmsg_recv). To
      avoid changing legacy expectations, only do so if the caller sets a
      new timestamping flag SOF_TIMESTAMPING_OPT_CMSG.
      
      on AF_INET6 sockets, IPV6_PKTINFO and all other recv cmsg are already
      returned for all origins. only change is to set ifindex, which is
      not initialized for all error origins.
      
      In both cases, only generate the pktinfo message if an ifindex is
      known. This is not the case for ACK timestamps.
      
      The difference between the protocol families is probably a historical
      accident as a result of the different conditions for generating cmsg
      in the relevant ip(v6)_recv_error function:
      
      ipv4:        if (serr->ee.ee_origin == SO_EE_ORIGIN_ICMP) {
      ipv6:        if (serr->ee.ee_origin != SO_EE_ORIGIN_LOCAL) {
      
      At one time, this was the same test bar for the ICMP/ICMP6
      distinction. This is no longer true.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      
      ----
      
      Changes
        v1 -> v2
          large rewrite
          - integrate with existing pktinfo cmsg generation code
          - on ipv4: only send with new flag, to maintain legacy behavior
          - on ipv6: send at most a single pktinfo cmsg
          - on ipv6: initialize fields if not yet initialized
      
      The recv cmsg interfaces are also relevant to the discussion of
      whether looping packet headers is problematic. For v6, cmsgs that
      identify many headers are already returned. This patch expands
      that to v4. If it sounds reasonable, I will follow with patches
      
      1. request timestamps without payload with SOF_TIMESTAMPING_OPT_TSONLY
         (http://patchwork.ozlabs.org/patch/366967/)
      2. sysctl to conditionally drop all timestamps that have payload or
         cmsg from users without CAP_NET_RAW.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      829ae9d6
  17. 08 12月, 2014 2 次提交
  18. 29 11月, 2014 1 次提交
  19. 27 11月, 2014 2 次提交
  20. 26 11月, 2014 1 次提交