1. 05 2月, 2015 1 次提交
  2. 03 2月, 2015 2 次提交
    • F
      net: dctcp: loosen requirement to assert ECT(0) during 3WHS · 843c2fdf
      Florian Westphal 提交于
      One deployment requirement of DCTCP is to be able to run
      in a DC setting along with TCP traffic. As Glenn Judd's
      NSDI'15 paper "Attaining the Promise and Avoiding the Pitfalls
      of TCP in the Datacenter" [1] (tba) explains, one way to
      solve this on switch side is to split DCTCP and TCP traffic
      in two queues per switch port based on the DSCP: one queue
      soley intended for DCTCP traffic and one for non-DCTCP traffic.
      
      For the DCTCP queue, there's the marking threshold K as
      explained in commit e3118e83 ("net: tcp: add DCTCP congestion
      control algorithm") for RED marking ECT(0) packets with CE.
      For the non-DCTCP queue, there's f.e. a classic tail drop queue.
      As already explained in e3118e83, running DCTCP at scale
      when not marking SYN/SYN-ACK packets with ECT(0) has severe
      consequences as for non-ECT(0) packets, traversing the RED
      marking DCTCP queue will result in a severe reduction of
      connection probability.
      
      This is due to the DCTCP queue being dominated by ECT(0) traffic
      and switches handle non-ECT traffic in the RED marking queue
      after passing K as drops, where K is usually a low watermark
      in order to leave enough tailroom for bursts. Splitting DCTCP
      traffic among several queues (ECN and non-ECN queue) is being
      considered a terrible idea in the network community as it
      splits single flows across multiple network paths.
      
      Therefore, commit e3118e83 implements this on Linux as
      ECT(0) marked traffic, as we argue that marking all packets
      of a DCTCP flow is the only viable solution and also doesn't
      speak against the draft.
      
      However, recently, a DCTCP implementation for FreeBSD hit also
      their mainline kernel [2]. In order to let them play well
      together with Linux' DCTCP, we would need to loosen the
      requirement that ECT(0) has to be asserted during the 3WHS as
      not implemented in FreeBSD. This simplifies the ECN test and
      lets DCTCP work together with FreeBSD.
      
      Joint work with Daniel Borkmann.
      
        [1] https://www.usenix.org/conference/nsdi15/technical-sessions/presentation/judd
        [2] https://github.com/freebsd/freebsd/commit/8ad879445281027858a7fa706d13e458095b595fSigned-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Glenn Judd <glenn.judd@morganstanley.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      843c2fdf
    • W
      net-timestamp: no-payload option · 49ca0d8b
      Willem de Bruijn 提交于
      Add timestamping option SOF_TIMESTAMPING_OPT_TSONLY. For transmit
      timestamps, this loops timestamps on top of empty packets.
      
      Doing so reduces the pressure on SO_RCVBUF. Payload inspection and
      cmsg reception (aside from timestamps) are no longer possible. This
      works together with a follow on patch that allows administrators to
      only allow tx timestamping if it does not loop payload or metadata.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      
      ----
      
      Changes (rfc -> v1)
        - add documentation
        - remove unnecessary skb->len test (thanks to Richard Cochran)
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      49ca0d8b
  3. 01 2月, 2015 2 次提交
  4. 31 1月, 2015 1 次提交
  5. 29 1月, 2015 1 次提交
  6. 27 1月, 2015 3 次提交
  7. 26 1月, 2015 7 次提交
  8. 25 1月, 2015 1 次提交
    • T
      udp: Do not require sock in udp_tunnel_xmit_skb · d998f8ef
      Tom Herbert 提交于
      The UDP tunnel transmit functions udp_tunnel_xmit_skb and
      udp_tunnel6_xmit_skb include a socket argument. The socket being
      passed to the functions (from VXLAN) is a UDP created for receive
      side. The only thing that the socket is used for in the transmit
      functions is to get the setting for checksum (enabled or zero).
      This patch removes the argument and and adds a nocheck argument
      for checksum setting. This eliminates the unnecessary dependency
      on a UDP socket for UDP tunnel transmit.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d998f8ef
  9. 20 1月, 2015 2 次提交
    • F
      net: ipv4: handle DSA enabled master network devices · 728c0208
      Florian Fainelli 提交于
      The logic to configure a network interface for kernel IP
      auto-configuration is very simplistic, and does not handle the case
      where a device is stacked onto another such as with DSA. This causes the
      kernel not to open and configure the master network device in a DSA
      switch tree, and therefore slave network devices using this master
      network devices as conduit device cannot be open.
      
      This restriction comes from a check in net/dsa/slave.c, which is
      basically checking the master netdev flags for IFF_UP and returns
      -ENETDOWN if it is not the case.
      
      Automatically bringing-up DSA master network devices allows DSA slave
      network devices to be used as valid interfaces for e.g: NFS root booting
      by allowing kernel IP autoconfiguration to succeed on these interfaces.
      
      On the reverse path, make sure we do not attempt to close a DSA-enabled
      device as this would implicitely prevent the slave DSA network device
      from operating.
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      728c0208
    • N
      tunnels: advertise link netns via netlink · 1728d4fa
      Nicolas Dichtel 提交于
      Implement rtnl_link_ops->get_link_net() callback so that IFLA_LINK_NETNSID is
      added to rtnetlink messages.
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1728d4fa
  10. 19 1月, 2015 1 次提交
  11. 18 1月, 2015 1 次提交
    • J
      netlink: make nlmsg_end() and genlmsg_end() void · 053c095a
      Johannes Berg 提交于
      Contrary to common expectations for an "int" return, these functions
      return only a positive value -- if used correctly they cannot even
      return 0 because the message header will necessarily be in the skb.
      
      This makes the very common pattern of
      
        if (genlmsg_end(...) < 0) { ... }
      
      be a whole bunch of dead code. Many places also simply do
      
        return nlmsg_end(...);
      
      and the caller is expected to deal with it.
      
      This also commonly (at least for me) causes errors, because it is very
      common to write
      
        if (my_function(...))
          /* error condition */
      
      and if my_function() does "return nlmsg_end()" this is of course wrong.
      
      Additionally, there's not a single place in the kernel that actually
      needs the message length returned, and if anyone needs it later then
      it'll be very easy to just use skb->len there.
      
      Remove this, and make the functions void. This removes a bunch of dead
      code as described above. The patch adds lines because I did
      
      -	return nlmsg_end(...);
      +	nlmsg_end(...);
      +	return 0;
      
      I could have preserved all the function's return values by returning
      skb->len, but instead I've audited all the places calling the affected
      functions and found that none cared. A few places actually compared
      the return value with <= 0 in dump functionality, but that could just
      be changed to < 0 with no change in behaviour, so I opted for the more
      efficient version.
      
      One instance of the error I've made numerous times now is also present
      in net/phonet/pn_netlink.c in the route_dumpit() function - it didn't
      check for <0 or <=0 and thus broke out of the loop every single time.
      I've preserved this since it will (I think) have caused the messages to
      userspace to be formatted differently with just a single message for
      every SKB returned to userspace. It's possible that this isn't needed
      for the tools that actually use this, but I don't even know what they
      are so couldn't test that changing this behaviour would be acceptable.
      Signed-off-by: NJohannes Berg <johannes.berg@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      053c095a
  12. 16 1月, 2015 2 次提交
    • W
      ip: zero sockaddr returned on error queue · f812116b
      Willem de Bruijn 提交于
      The sockaddr is returned in IP(V6)_RECVERR as part of errhdr. That
      structure is defined and allocated on the stack as
      
          struct {
                  struct sock_extended_err ee;
                  struct sockaddr_in(6)    offender;
          } errhdr;
      
      The second part is only initialized for certain SO_EE_ORIGIN values.
      Always initialize it completely.
      
      An MTU exceeded error on a SOCK_RAW/IPPROTO_RAW is one example that
      would return uninitialized bytes.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      
      ----
      
      Also verified that there is no padding between errhdr.ee and
      errhdr.offender that could leak additional kernel data.
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f812116b
    • E
      ipv4: per cpu uncached list · 5055c371
      Eric Dumazet 提交于
      RAW sockets with hdrinc suffer from contention on rt_uncached_lock
      spinlock.
      
      One solution is to use percpu lists, since most routes are destroyed
      by the cpu that created them.
      
      It is unclear why we even have to put these routes in uncached_list,
      as all outgoing packets should be freed when a device is dismantled.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Fixes: caacf05e ("ipv4: Properly purge netdev references on uncached routes.")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5055c371
  13. 15 1月, 2015 1 次提交
  14. 14 1月, 2015 2 次提交
    • J
      net: rename vlan_tx_* helpers since "tx" is misleading there · df8a39de
      Jiri Pirko 提交于
      The same macros are used for rx as well. So rename it.
      Signed-off-by: NJiri Pirko <jiri@resnulli.us>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      df8a39de
    • S
      tcp: avoid reducing cwnd when ACK+DSACK is received · 08abdffa
      Sébastien Barré 提交于
      With TLP, the peer may reply to a probe with an
      ACK+D-SACK, with ack value set to tlp_high_seq. In the current code,
      such ACK+DSACK will be missed and only at next, higher ack will the TLP
      episode be considered done. Since the DSACK is not present anymore,
      this will cost a cwnd reduction.
      
      This patch ensures that this scenario does not cause a cwnd reduction, since
      receiving an ACK+DSACK indicates that both the initial segment and the probe
      have been received by the peer.
      
      The following packetdrill test, from Neal Cardwell, validates this patch:
      
      // Establish a connection.
      0     socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      +0     setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      +0    bind(3, ..., ...) = 0
      +0    listen(3, 1) = 0
      
      +0    < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
      +0    > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6>
      +.020 < . 1:1(0) ack 1 win 257
      +0    accept(3, ..., ...) = 4
      
      // Send 1 packet.
      +0    write(4, ..., 1000) = 1000
      +0    > P. 1:1001(1000) ack 1
      
      // Loss probe retransmission.
      // packets_out == 1 => schedule PTO in max(2*RTT, 1.5*RTT + 200ms)
      // In this case, this means: 1.5*RTT + 200ms = 230ms
      +.230 > P. 1:1001(1000) ack 1
      +0    %{ assert tcpi_snd_cwnd == 10 }%
      
      // Receiver ACKs at tlp_high_seq with a DSACK,
      // indicating they received the original packet and probe.
      +.020 < . 1:1(0) ack 1001 win 257 <sack 1:1001,nop,nop>
      +0    %{ assert tcpi_snd_cwnd == 10 }%
      
      // Send another packet.
      +0    write(4, ..., 1000) = 1000
      +0    > P. 1001:2001(1000) ack 1
      
      // Receiver ACKs above tlp_high_seq, which should end the TLP episode
      // if we haven't already. We should not reduce cwnd.
      +.020 < . 1:1(0) ack 2001 win 257
      +0    %{ assert tcpi_snd_cwnd == 10, tcpi_snd_cwnd }%
      
      Credits:
      -Gregory helped in finding that tcp_process_tlp_ack was where the cwnd
      got reduced in our MPTCP tests.
      -Neal wrote the packetdrill test above
      -Yuchung reworked the patch to make it more readable.
      
      Cc: Gregory Detal <gregory.detal@uclouvain.be>
      Cc: Nandita Dukkipati <nanditad@google.com>
      Tested-by: NNeal Cardwell <ncardwell@google.com>
      Reviewed-by: NYuchung Cheng <ycheng@google.com>
      Reviewed-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NSébastien Barré <sebastien.barre@uclouvain.be>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      08abdffa
  15. 06 1月, 2015 8 次提交
    • D
      net: tcp: add per route congestion control · 81164413
      Daniel Borkmann 提交于
      This work adds the possibility to define a per route/destination
      congestion control algorithm. Generally, this opens up the possibility
      for a machine with different links to enforce specific congestion
      control algorithms with optimal strategies for each of them based
      on their network characteristics, even transparently for a single
      application listening on all links.
      
      For our specific use case, this additionally facilitates deployment
      of DCTCP, for example, applications can easily serve internal
      traffic/dsts in DCTCP and external one with CUBIC. Other scenarios
      would also allow for utilizing e.g. long living, low priority
      background flows for certain destinations/routes while still being
      able for normal traffic to utilize the default congestion control
      algorithm. We also thought about a per netns setting (where different
      defaults are possible), but given its actually a link specific
      property, we argue that a per route/destination setting is the most
      natural and flexible.
      
      The administrator can utilize this through ip-route(8) by appending
      "congctl [lock] <name>", where <name> denotes the name of a
      congestion control algorithm and the optional lock parameter allows
      to enforce the given algorithm so that applications in user space
      would not be allowed to overwrite that algorithm for that destination.
      
      The dst metric lookups are being done when a dst entry is already
      available in order to avoid a costly lookup and still before the
      algorithms are being initialized, thus overhead is very low when the
      feature is not being used. While the client side would need to drop
      the current reference on the module, on server side this can actually
      even be avoided as we just got a flat-copied socket clone.
      
      Joint work with Florian Westphal.
      Suggested-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      81164413
    • D
      net: tcp: add RTAX_CC_ALGO fib handling · ea697639
      Daniel Borkmann 提交于
      This patch adds the minimum necessary for the RTAX_CC_ALGO congestion
      control metric to be set up and dumped back to user space.
      
      While the internal representation of RTAX_CC_ALGO is handled as a u32
      key, we avoided to expose this implementation detail to user space, thus
      instead, we chose the netlink attribute that is being exchanged between
      user space to be the actual congestion control algorithm name, similarly
      as in the setsockopt(2) API in order to allow for maximum flexibility,
      even for 3rd party modules.
      
      It is a bit unfortunate that RTAX_QUICKACK used up a whole RTAX slot as
      it should have been stored in RTAX_FEATURES instead, we first thought
      about reusing it for the congestion control key, but it brings more
      complications and/or confusion than worth it.
      
      Joint work with Florian Westphal.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ea697639
    • D
      net: tcp: add key management to congestion control · c5c6a8ab
      Daniel Borkmann 提交于
      This patch adds necessary infrastructure to the congestion control
      framework for later per route congestion control support.
      
      For a per route congestion control possibility, our aim is to store
      a unique u32 key identifier into dst metrics, which can then be
      mapped into a tcp_congestion_ops struct. We argue that having a
      RTAX key entry is the most simple, generic and easy way to manage,
      and also keeps the memory footprint of dst entries lower on 64 bit
      than with storing a pointer directly, for example. Having a unique
      key id also allows for decoupling actual TCP congestion control
      module management from the FIB layer, i.e. we don't have to care
      about expensive module refcounting inside the FIB at this point.
      
      We first thought of using an IDR store for the realization, which
      takes over dynamic assignment of unused key space and also performs
      the key to pointer mapping in RCU. While doing so, we stumbled upon
      the issue that due to the nature of dynamic key distribution, it
      just so happens, arguably in very rare occasions, that excessive
      module loads and unloads can lead to a possible reuse of previously
      used key space. Thus, previously stale keys in the dst metric are
      now being reassigned to a different congestion control algorithm,
      which might lead to unexpected behaviour. One way to resolve this
      would have been to walk FIBs on the actually rare occasion of a
      module unload and reset the metric keys for each FIB in each netns,
      but that's just very costly.
      
      Therefore, we argue a better solution is to reuse the unique
      congestion control algorithm name member and map that into u32 key
      space through jhash. For that, we split the flags attribute (as it
      currently uses 2 bits only anyway) into two u32 attributes, flags
      and key, so that we can keep the cacheline boundary of 2 cachelines
      on x86_64 and cache the precalculated key at registration time for
      the fast path. On average we might expect 2 - 4 modules being loaded
      worst case perhaps 15, so a key collision possibility is extremely
      low, and guaranteed collision-free on LE/BE for all in-tree modules.
      Overall this results in much simpler code, and all without the
      overhead of an IDR. Due to the deterministic nature, modules can
      now be unloaded, the congestion control algorithm for a specific
      but unloaded key will fall back to the default one, and on module
      reload time it will switch back to the expected algorithm
      transparently.
      
      Joint work with Florian Westphal.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c5c6a8ab
    • D
      net: tcp: refactor reinitialization of congestion control · 29ba4fff
      Daniel Borkmann 提交于
      We can just move this to an extra function and make the code
      a bit more readable, no functional change.
      
      Joint work with Florian Westphal.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      29ba4fff
    • T
      ip: Add offset parameter to ip_cmsg_recv · ad6f939a
      Tom Herbert 提交于
      Add ip_cmsg_recv_offset function which takes an offset argument
      that indicates the starting offset in skb where data is being received
      from. This will be useful in the case of UDP and provided checksum
      to user space.
      
      ip_cmsg_recv is an inline call to ip_cmsg_recv_offset with offset of
      zero.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ad6f939a
    • T
      ip: Add offset parameter to ip_cmsg_recv · 5961de9f
      Tom Herbert 提交于
      Add ip_cmsg_recv_offset function which takes an offset argument
      that indicates the starting offset in skb where data is being received
      from. This will be useful in the case of UDP and provided checksum
      to user space.
      
      ip_cmsg_recv is an inline call to ip_cmsg_recv_offset with offset of
      zero.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5961de9f
    • T
      ip: IP cmsg cleanup · c44d13d6
      Tom Herbert 提交于
      Move the IP_CMSG_* constants from ip_sockglue.c to inet_sock.h so that
      they can be referenced in other source files.
      
      Restructure ip_cmsg_recv to not go through flags using shift, check
      for flags by 'and'. This eliminates both the shift and a conditional
      per flag check.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c44d13d6
    • T
      ip: Move checksum convert defines to inet · 224d019c
      Tom Herbert 提交于
      Move convert_csum from udp_sock to inet_sock. This allows the
      possibility that we can use convert checksum for different types
      of sockets and also allows convert checksum to be enabled from
      inet layer (what we'll want to do when enabling IP_CHECKSUM cmsg).
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      224d019c
  16. 05 1月, 2015 4 次提交
    • J
      geneve: Check family when reusing sockets. · 46b1e4f9
      Jesse Gross 提交于
      When searching for an existing socket to reuse, the address family
      is not taken into account - only port number. This means that an
      IPv4 socket could be used for IPv6 traffic and vice versa, which
      is sure to cause problems when passing packets.
      
      It is not possible to trigger this problem currently because the
      only user of Geneve creates just IPv4 sockets. However, that is
      likely to change in the near future.
      Signed-off-by: NJesse Gross <jesse@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      46b1e4f9
    • J
      geneve: Remove socket hash table. · df5dba8e
      Jesse Gross 提交于
      The hash table for open Geneve ports is used only on creation and
      deletion time. It is not performance critical and is not likely to
      grow to a large number of items. Therefore, this can be changed
      to use a simple linked list.
      Signed-off-by: NJesse Gross <jesse@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      df5dba8e
    • J
      geneve: Simplify locking. · 829a3ada
      Jesse Gross 提交于
      The existing Geneve locking scheme was pulled over directly from
      VXLAN. However, VXLAN has a number of built in mechanisms which make
      the locking more complex and are unlikely to be necessary with Geneve.
      This simplifies the locking to use a basic scheme of a mutex
      when doing updates plus RCU on receive.
      
      In addition to making the code easier to read, this also avoids the
      possibility of a race when creating or destroying sockets since
      UDP sockets and the list of Geneve sockets are protected by different
      locks. After this change, the entire operation is atomic.
      Signed-off-by: NJesse Gross <jesse@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      829a3ada
    • J
      geneve: Remove workqueue. · 61f3cade
      Jesse Gross 提交于
      The work queue is used only to free the UDP socket upon destruction.
      This is not necessary with Geneve and generally makes the code more
      difficult to reason about. It also introduces nondeterministic
      behavior such as when a socket is rapidly deleted and recreated, which
      could fail as the the deletion happens asynchronously.
      Signed-off-by: NJesse Gross <jesse@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      61f3cade
  17. 03 1月, 2015 1 次提交