1. 31 1月, 2015 1 次提交
  2. 29 1月, 2015 1 次提交
  3. 27 1月, 2015 3 次提交
  4. 26 1月, 2015 7 次提交
  5. 25 1月, 2015 1 次提交
    • T
      udp: Do not require sock in udp_tunnel_xmit_skb · d998f8ef
      Tom Herbert 提交于
      The UDP tunnel transmit functions udp_tunnel_xmit_skb and
      udp_tunnel6_xmit_skb include a socket argument. The socket being
      passed to the functions (from VXLAN) is a UDP created for receive
      side. The only thing that the socket is used for in the transmit
      functions is to get the setting for checksum (enabled or zero).
      This patch removes the argument and and adds a nocheck argument
      for checksum setting. This eliminates the unnecessary dependency
      on a UDP socket for UDP tunnel transmit.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d998f8ef
  6. 20 1月, 2015 2 次提交
    • F
      net: ipv4: handle DSA enabled master network devices · 728c0208
      Florian Fainelli 提交于
      The logic to configure a network interface for kernel IP
      auto-configuration is very simplistic, and does not handle the case
      where a device is stacked onto another such as with DSA. This causes the
      kernel not to open and configure the master network device in a DSA
      switch tree, and therefore slave network devices using this master
      network devices as conduit device cannot be open.
      
      This restriction comes from a check in net/dsa/slave.c, which is
      basically checking the master netdev flags for IFF_UP and returns
      -ENETDOWN if it is not the case.
      
      Automatically bringing-up DSA master network devices allows DSA slave
      network devices to be used as valid interfaces for e.g: NFS root booting
      by allowing kernel IP autoconfiguration to succeed on these interfaces.
      
      On the reverse path, make sure we do not attempt to close a DSA-enabled
      device as this would implicitely prevent the slave DSA network device
      from operating.
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      728c0208
    • N
      tunnels: advertise link netns via netlink · 1728d4fa
      Nicolas Dichtel 提交于
      Implement rtnl_link_ops->get_link_net() callback so that IFLA_LINK_NETNSID is
      added to rtnetlink messages.
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1728d4fa
  7. 19 1月, 2015 1 次提交
  8. 18 1月, 2015 1 次提交
    • J
      netlink: make nlmsg_end() and genlmsg_end() void · 053c095a
      Johannes Berg 提交于
      Contrary to common expectations for an "int" return, these functions
      return only a positive value -- if used correctly they cannot even
      return 0 because the message header will necessarily be in the skb.
      
      This makes the very common pattern of
      
        if (genlmsg_end(...) < 0) { ... }
      
      be a whole bunch of dead code. Many places also simply do
      
        return nlmsg_end(...);
      
      and the caller is expected to deal with it.
      
      This also commonly (at least for me) causes errors, because it is very
      common to write
      
        if (my_function(...))
          /* error condition */
      
      and if my_function() does "return nlmsg_end()" this is of course wrong.
      
      Additionally, there's not a single place in the kernel that actually
      needs the message length returned, and if anyone needs it later then
      it'll be very easy to just use skb->len there.
      
      Remove this, and make the functions void. This removes a bunch of dead
      code as described above. The patch adds lines because I did
      
      -	return nlmsg_end(...);
      +	nlmsg_end(...);
      +	return 0;
      
      I could have preserved all the function's return values by returning
      skb->len, but instead I've audited all the places calling the affected
      functions and found that none cared. A few places actually compared
      the return value with <= 0 in dump functionality, but that could just
      be changed to < 0 with no change in behaviour, so I opted for the more
      efficient version.
      
      One instance of the error I've made numerous times now is also present
      in net/phonet/pn_netlink.c in the route_dumpit() function - it didn't
      check for <0 or <=0 and thus broke out of the loop every single time.
      I've preserved this since it will (I think) have caused the messages to
      userspace to be formatted differently with just a single message for
      every SKB returned to userspace. It's possible that this isn't needed
      for the tools that actually use this, but I don't even know what they
      are so couldn't test that changing this behaviour would be acceptable.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      053c095a
  9. 16 1月, 2015 2 次提交
    • W
      ip: zero sockaddr returned on error queue · f812116b
      Willem de Bruijn 提交于
      The sockaddr is returned in IP(V6)_RECVERR as part of errhdr. That
      structure is defined and allocated on the stack as
      
          struct {
                  struct sock_extended_err ee;
                  struct sockaddr_in(6)    offender;
          } errhdr;
      
      The second part is only initialized for certain SO_EE_ORIGIN values.
      Always initialize it completely.
      
      An MTU exceeded error on a SOCK_RAW/IPPROTO_RAW is one example that
      would return uninitialized bytes.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      
      ----
      
      Also verified that there is no padding between errhdr.ee and
      errhdr.offender that could leak additional kernel data.
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f812116b
    • E
      ipv4: per cpu uncached list · 5055c371
      Eric Dumazet 提交于
      RAW sockets with hdrinc suffer from contention on rt_uncached_lock
      spinlock.
      
      One solution is to use percpu lists, since most routes are destroyed
      by the cpu that created them.
      
      It is unclear why we even have to put these routes in uncached_list,
      as all outgoing packets should be freed when a device is dismantled.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Fixes: caacf05e ("ipv4: Properly purge netdev references on uncached routes.")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5055c371
  10. 15 1月, 2015 1 次提交
  11. 14 1月, 2015 2 次提交
    • J
      net: rename vlan_tx_* helpers since "tx" is misleading there · df8a39de
      Jiri Pirko 提交于
      The same macros are used for rx as well. So rename it.
      Signed-off-by: NJiri Pirko <jiri@resnulli.us>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      df8a39de
    • S
      tcp: avoid reducing cwnd when ACK+DSACK is received · 08abdffa
      Sébastien Barré 提交于
      With TLP, the peer may reply to a probe with an
      ACK+D-SACK, with ack value set to tlp_high_seq. In the current code,
      such ACK+DSACK will be missed and only at next, higher ack will the TLP
      episode be considered done. Since the DSACK is not present anymore,
      this will cost a cwnd reduction.
      
      This patch ensures that this scenario does not cause a cwnd reduction, since
      receiving an ACK+DSACK indicates that both the initial segment and the probe
      have been received by the peer.
      
      The following packetdrill test, from Neal Cardwell, validates this patch:
      
      // Establish a connection.
      0     socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      +0     setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      +0    bind(3, ..., ...) = 0
      +0    listen(3, 1) = 0
      
      +0    < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
      +0    > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6>
      +.020 < . 1:1(0) ack 1 win 257
      +0    accept(3, ..., ...) = 4
      
      // Send 1 packet.
      +0    write(4, ..., 1000) = 1000
      +0    > P. 1:1001(1000) ack 1
      
      // Loss probe retransmission.
      // packets_out == 1 => schedule PTO in max(2*RTT, 1.5*RTT + 200ms)
      // In this case, this means: 1.5*RTT + 200ms = 230ms
      +.230 > P. 1:1001(1000) ack 1
      +0    %{ assert tcpi_snd_cwnd == 10 }%
      
      // Receiver ACKs at tlp_high_seq with a DSACK,
      // indicating they received the original packet and probe.
      +.020 < . 1:1(0) ack 1001 win 257 <sack 1:1001,nop,nop>
      +0    %{ assert tcpi_snd_cwnd == 10 }%
      
      // Send another packet.
      +0    write(4, ..., 1000) = 1000
      +0    > P. 1001:2001(1000) ack 1
      
      // Receiver ACKs above tlp_high_seq, which should end the TLP episode
      // if we haven't already. We should not reduce cwnd.
      +.020 < . 1:1(0) ack 2001 win 257
      +0    %{ assert tcpi_snd_cwnd == 10, tcpi_snd_cwnd }%
      
      Credits:
      -Gregory helped in finding that tcp_process_tlp_ack was where the cwnd
      got reduced in our MPTCP tests.
      -Neal wrote the packetdrill test above
      -Yuchung reworked the patch to make it more readable.
      
      Cc: Gregory Detal <gregory.detal@uclouvain.be>
      Cc: Nandita Dukkipati <nanditad@google.com>
      Tested-by: NNeal Cardwell <ncardwell@google.com>
      Reviewed-by: NYuchung Cheng <ycheng@google.com>
      Reviewed-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NSébastien Barré <sebastien.barre@uclouvain.be>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      08abdffa
  12. 06 1月, 2015 8 次提交
    • D
      net: tcp: add per route congestion control · 81164413
      Daniel Borkmann 提交于
      This work adds the possibility to define a per route/destination
      congestion control algorithm. Generally, this opens up the possibility
      for a machine with different links to enforce specific congestion
      control algorithms with optimal strategies for each of them based
      on their network characteristics, even transparently for a single
      application listening on all links.
      
      For our specific use case, this additionally facilitates deployment
      of DCTCP, for example, applications can easily serve internal
      traffic/dsts in DCTCP and external one with CUBIC. Other scenarios
      would also allow for utilizing e.g. long living, low priority
      background flows for certain destinations/routes while still being
      able for normal traffic to utilize the default congestion control
      algorithm. We also thought about a per netns setting (where different
      defaults are possible), but given its actually a link specific
      property, we argue that a per route/destination setting is the most
      natural and flexible.
      
      The administrator can utilize this through ip-route(8) by appending
      "congctl [lock] <name>", where <name> denotes the name of a
      congestion control algorithm and the optional lock parameter allows
      to enforce the given algorithm so that applications in user space
      would not be allowed to overwrite that algorithm for that destination.
      
      The dst metric lookups are being done when a dst entry is already
      available in order to avoid a costly lookup and still before the
      algorithms are being initialized, thus overhead is very low when the
      feature is not being used. While the client side would need to drop
      the current reference on the module, on server side this can actually
      even be avoided as we just got a flat-copied socket clone.
      
      Joint work with Florian Westphal.
      Suggested-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      81164413
    • D
      net: tcp: add RTAX_CC_ALGO fib handling · ea697639
      Daniel Borkmann 提交于
      This patch adds the minimum necessary for the RTAX_CC_ALGO congestion
      control metric to be set up and dumped back to user space.
      
      While the internal representation of RTAX_CC_ALGO is handled as a u32
      key, we avoided to expose this implementation detail to user space, thus
      instead, we chose the netlink attribute that is being exchanged between
      user space to be the actual congestion control algorithm name, similarly
      as in the setsockopt(2) API in order to allow for maximum flexibility,
      even for 3rd party modules.
      
      It is a bit unfortunate that RTAX_QUICKACK used up a whole RTAX slot as
      it should have been stored in RTAX_FEATURES instead, we first thought
      about reusing it for the congestion control key, but it brings more
      complications and/or confusion than worth it.
      
      Joint work with Florian Westphal.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ea697639
    • D
      net: tcp: add key management to congestion control · c5c6a8ab
      Daniel Borkmann 提交于
      This patch adds necessary infrastructure to the congestion control
      framework for later per route congestion control support.
      
      For a per route congestion control possibility, our aim is to store
      a unique u32 key identifier into dst metrics, which can then be
      mapped into a tcp_congestion_ops struct. We argue that having a
      RTAX key entry is the most simple, generic and easy way to manage,
      and also keeps the memory footprint of dst entries lower on 64 bit
      than with storing a pointer directly, for example. Having a unique
      key id also allows for decoupling actual TCP congestion control
      module management from the FIB layer, i.e. we don't have to care
      about expensive module refcounting inside the FIB at this point.
      
      We first thought of using an IDR store for the realization, which
      takes over dynamic assignment of unused key space and also performs
      the key to pointer mapping in RCU. While doing so, we stumbled upon
      the issue that due to the nature of dynamic key distribution, it
      just so happens, arguably in very rare occasions, that excessive
      module loads and unloads can lead to a possible reuse of previously
      used key space. Thus, previously stale keys in the dst metric are
      now being reassigned to a different congestion control algorithm,
      which might lead to unexpected behaviour. One way to resolve this
      would have been to walk FIBs on the actually rare occasion of a
      module unload and reset the metric keys for each FIB in each netns,
      but that's just very costly.
      
      Therefore, we argue a better solution is to reuse the unique
      congestion control algorithm name member and map that into u32 key
      space through jhash. For that, we split the flags attribute (as it
      currently uses 2 bits only anyway) into two u32 attributes, flags
      and key, so that we can keep the cacheline boundary of 2 cachelines
      on x86_64 and cache the precalculated key at registration time for
      the fast path. On average we might expect 2 - 4 modules being loaded
      worst case perhaps 15, so a key collision possibility is extremely
      low, and guaranteed collision-free on LE/BE for all in-tree modules.
      Overall this results in much simpler code, and all without the
      overhead of an IDR. Due to the deterministic nature, modules can
      now be unloaded, the congestion control algorithm for a specific
      but unloaded key will fall back to the default one, and on module
      reload time it will switch back to the expected algorithm
      transparently.
      
      Joint work with Florian Westphal.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c5c6a8ab
    • D
      net: tcp: refactor reinitialization of congestion control · 29ba4fff
      Daniel Borkmann 提交于
      We can just move this to an extra function and make the code
      a bit more readable, no functional change.
      
      Joint work with Florian Westphal.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      29ba4fff
    • T
      ip: Add offset parameter to ip_cmsg_recv · ad6f939a
      Tom Herbert 提交于
      Add ip_cmsg_recv_offset function which takes an offset argument
      that indicates the starting offset in skb where data is being received
      from. This will be useful in the case of UDP and provided checksum
      to user space.
      
      ip_cmsg_recv is an inline call to ip_cmsg_recv_offset with offset of
      zero.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ad6f939a
    • T
      ip: Add offset parameter to ip_cmsg_recv · 5961de9f
      Tom Herbert 提交于
      Add ip_cmsg_recv_offset function which takes an offset argument
      that indicates the starting offset in skb where data is being received
      from. This will be useful in the case of UDP and provided checksum
      to user space.
      
      ip_cmsg_recv is an inline call to ip_cmsg_recv_offset with offset of
      zero.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5961de9f
    • T
      ip: IP cmsg cleanup · c44d13d6
      Tom Herbert 提交于
      Move the IP_CMSG_* constants from ip_sockglue.c to inet_sock.h so that
      they can be referenced in other source files.
      
      Restructure ip_cmsg_recv to not go through flags using shift, check
      for flags by 'and'. This eliminates both the shift and a conditional
      per flag check.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c44d13d6
    • T
      ip: Move checksum convert defines to inet · 224d019c
      Tom Herbert 提交于
      Move convert_csum from udp_sock to inet_sock. This allows the
      possibility that we can use convert checksum for different types
      of sockets and also allows convert checksum to be enabled from
      inet layer (what we'll want to do when enabling IP_CHECKSUM cmsg).
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      224d019c
  13. 05 1月, 2015 4 次提交
    • J
      geneve: Check family when reusing sockets. · 46b1e4f9
      Jesse Gross 提交于
      When searching for an existing socket to reuse, the address family
      is not taken into account - only port number. This means that an
      IPv4 socket could be used for IPv6 traffic and vice versa, which
      is sure to cause problems when passing packets.
      
      It is not possible to trigger this problem currently because the
      only user of Geneve creates just IPv4 sockets. However, that is
      likely to change in the near future.
      Signed-off-by: NJesse Gross <jesse@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      46b1e4f9
    • J
      geneve: Remove socket hash table. · df5dba8e
      Jesse Gross 提交于
      The hash table for open Geneve ports is used only on creation and
      deletion time. It is not performance critical and is not likely to
      grow to a large number of items. Therefore, this can be changed
      to use a simple linked list.
      Signed-off-by: NJesse Gross <jesse@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      df5dba8e
    • J
      geneve: Simplify locking. · 829a3ada
      Jesse Gross 提交于
      The existing Geneve locking scheme was pulled over directly from
      VXLAN. However, VXLAN has a number of built in mechanisms which make
      the locking more complex and are unlikely to be necessary with Geneve.
      This simplifies the locking to use a basic scheme of a mutex
      when doing updates plus RCU on receive.
      
      In addition to making the code easier to read, this also avoids the
      possibility of a race when creating or destroying sockets since
      UDP sockets and the list of Geneve sockets are protected by different
      locks. After this change, the entire operation is atomic.
      Signed-off-by: NJesse Gross <jesse@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      829a3ada
    • J
      geneve: Remove workqueue. · 61f3cade
      Jesse Gross 提交于
      The work queue is used only to free the UDP socket upon destruction.
      This is not necessary with Geneve and generally makes the code more
      difficult to reason about. It also introduces nondeterministic
      behavior such as when a socket is rapidly deleted and recreated, which
      could fail as the the deletion happens asynchronously.
      Signed-off-by: NJesse Gross <jesse@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      61f3cade
  14. 03 1月, 2015 2 次提交
  15. 01 1月, 2015 4 次提交
    • A
      fib_trie: Add tracking value for suffix length · 5405afd1
      Alexander Duyck 提交于
      This change adds a tracking value for the maximum suffix length of all
      prefixes stored in any given tnode.  With this value we can determine if we
      need to backtrace or not based on if the suffix is greater than the pos
      value.
      
      By doing this we can reduce the CPU overhead for lookups in the local table
      as many of the prefixes there are 32b long and have a suffix length of 0
      meaning we can immediately backtrace to the root node without needing to
      test any of the nodes between it and where we ended up.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5405afd1
    • A
      fib_trie: Remove checks for index >= tnode_child_length from tnode_get_child · 21d1f11d
      Alexander Duyck 提交于
      For some reason the compiler doesn't seem to understand that when we are in
      a loop that runs from tnode_child_length - 1 to 0 we don't expect the value
      of tn->bits to change.  As such every call to tnode_get_child was rerunning
      tnode_chile_length which ended up consuming quite a bit of space in the
      resultant assembly code.
      
      I have gone though and verified that in all cases where tnode_get_child
      is used we are either winding though a fixed loop from tnode_child_length -
      1 to 0, or are in a fastpath case where we are verifying the value by
      either checking for any remaining bits after shifting index by bits and
      testing for leaf, or by using tnode_child_length.
      
      size net/ipv4/fib_trie.o
      Before:
         text	   data	    bss	    dec	    hex	filename
        15506	    376	      8	  15890	   3e12	net/ipv4/fib_trie.o
      
      After:
         text	   data	    bss	    dec	    hex	filename
        14827	    376	      8	  15211	   3b6b	net/ipv4/fib_trie.o
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      21d1f11d
    • A
      fib_trie: inflate/halve nodes in a more RCU friendly way · 12c081a5
      Alexander Duyck 提交于
      This change pulls the node_set_parent functionality out of put_child_reorg
      and instead leaves that to the function to take care of as well.  By doing
      this we can fully construct the new cluster of tnodes and all of the
      pointers out of it before we start routing pointers into it.
      
      I am suspecting this will likely fix some concurency issues though I don't
      have a good test to show as such.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      12c081a5
    • A
      fib_trie: Push tnode flushing down to inflate/halve · fc86a93b
      Alexander Duyck 提交于
      This change pushes the tnode freeing down into the inflate and halve
      functions.  It makes more sense here as we have a better grasp of what is
      going on and when a given cluster of nodes is ready to be freed.
      
      I believe this may address a bug in the freeing logic as well.  For some
      reason if the freelist got to a certain size we would call
      synchronize_rcu().  I'm assuming that what they meant to do is call
      synchronize_rcu() after they had handed off that much memory via
      call_rcu().  As such that is what I have updated the behavior to be.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fc86a93b