1. 10 2月, 2015 1 次提交
  2. 09 2月, 2015 2 次提交
    • E
      net: rfs: add hash collision detection · 567e4b79
      Eric Dumazet 提交于
      Receive Flow Steering is a nice solution but suffers from
      hash collisions when a mix of connected and unconnected traffic
      is received on the host, when flow hash table is populated.
      
      Also, clearing flow in inet_release() makes RFS not very good
      for short lived flows, as many packets can follow close().
      (FIN , ACK packets, ...)
      
      This patch extends the information stored into global hash table
      to not only include cpu number, but upper part of the hash value.
      
      I use a 32bit value, and dynamically split it in two parts.
      
      For host with less than 64 possible cpus, this gives 6 bits for the
      cpu number, and 26 (32-6) bits for the upper part of the hash.
      
      Since hash bucket selection use low order bits of the hash, we have
      a full hash match, if /proc/sys/net/core/rps_sock_flow_entries is big
      enough.
      
      If the hash found in flow table does not match, we fallback to RPS (if
      it is enabled for the rxqueue).
      
      This means that a packet for an non connected flow can avoid the
      IPI through a unrelated/victim CPU.
      
      This also means we no longer have to clear the table at socket
      close time, and this helps short lived flows performance.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      567e4b79
    • S
      gre/ipip: use be16 variants of netlink functions · 3e97fa70
      Sabrina Dubroca 提交于
      encap.sport and encap.dport are __be16, use nla_{get,put}_be16 instead
      of nla_{get,put}_u16.
      
      Fixes the sparse warnings:
      
      warning: incorrect type in assignment (different base types)
         expected restricted __be32 [addressable] [usertype] o_key
         got restricted __be16 [addressable] [usertype] i_flags
      warning: incorrect type in assignment (different base types)
         expected restricted __be16 [usertype] sport
         got unsigned short
      warning: incorrect type in assignment (different base types)
         expected restricted __be16 [usertype] dport
         got unsigned short
      warning: incorrect type in argument 3 (different base types)
         expected unsigned short [unsigned] [usertype] value
         got restricted __be16 [usertype] sport
      warning: incorrect type in argument 3 (different base types)
         expected unsigned short [unsigned] [usertype] value
         got restricted __be16 [usertype] dport
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3e97fa70
  3. 08 2月, 2015 4 次提交
    • N
      tcp: mitigate ACK loops for connections as tcp_timewait_sock · 4fb17a60
      Neal Cardwell 提交于
      Ensure that in state FIN_WAIT2 or TIME_WAIT, where the connection is
      represented by a tcp_timewait_sock, we rate limit dupacks in response
      to incoming packets (a) with TCP timestamps that fail PAWS checks, or
      (b) with sequence numbers that are out of the acceptable window.
      
      We do not send a dupack in response to out-of-window packets if it has
      been less than sysctl_tcp_invalid_ratelimit (default 500ms) since we
      last sent a dupack in response to an out-of-window packet.
      Reported-by: NAvery Fay <avery@mixpanel.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4fb17a60
    • N
      tcp: mitigate ACK loops for connections as tcp_sock · f2b2c582
      Neal Cardwell 提交于
      Ensure that in state ESTABLISHED, where the connection is represented
      by a tcp_sock, we rate limit dupacks in response to incoming packets
      (a) with TCP timestamps that fail PAWS checks, or (b) with sequence
      numbers or ACK numbers that are out of the acceptable window.
      
      We do not send a dupack in response to out-of-window packets if it has
      been less than sysctl_tcp_invalid_ratelimit (default 500ms) since we
      last sent a dupack in response to an out-of-window packet.
      
      There is already a similar (although global) rate-limiting mechanism
      for "challenge ACKs". When deciding whether to send a challence ACK,
      we first consult the new per-connection rate limit, and then the
      global rate limit.
      Reported-by: NAvery Fay <avery@mixpanel.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f2b2c582
    • N
      tcp: mitigate ACK loops for connections as tcp_request_sock · a9b2c06d
      Neal Cardwell 提交于
      In the SYN_RECV state, where the TCP connection is represented by
      tcp_request_sock, we now rate-limit SYNACKs in response to a client's
      retransmitted SYNs: we do not send a SYNACK in response to client SYN
      if it has been less than sysctl_tcp_invalid_ratelimit (default 500ms)
      since we last sent a SYNACK in response to a client's retransmitted
      SYN.
      
      This allows the vast majority of legitimate client connections to
      proceed unimpeded, even for the most aggressive platforms, iOS and
      MacOS, which actually retransmit SYNs 1-second intervals for several
      times in a row. They use SYN RTO timeouts following the progression:
      1,1,1,1,1,2,4,8,16,32.
      Reported-by: NAvery Fay <avery@mixpanel.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a9b2c06d
    • N
      tcp: helpers to mitigate ACK loops by rate-limiting out-of-window dupacks · 032ee423
      Neal Cardwell 提交于
      Helpers for mitigating ACK loops by rate-limiting dupacks sent in
      response to incoming out-of-window packets.
      
      This patch includes:
      
      - rate-limiting logic
      - sysctl to control how often we allow dupacks to out-of-window packets
      - SNMP counter for cases where we rate-limited our dupack sending
      
      The rate-limiting logic in this patch decides to not send dupacks in
      response to out-of-window segments if (a) they are SYNs or pure ACKs
      and (b) the remote endpoint is sending them faster than the configured
      rate limit.
      
      We rate-limit our responses rather than blocking them entirely or
      resetting the connection, because legitimate connections can rely on
      dupacks in response to some out-of-window segments. For example, zero
      window probes are typically sent with a sequence number that is below
      the current window, and ZWPs thus expect to thus elicit a dupack in
      response.
      
      We allow dupacks in response to TCP segments with data, because these
      may be spurious retransmissions for which the remote endpoint wants to
      receive DSACKs. This is safe because segments with data can't
      realistically be part of ACK loops, which by their nature consist of
      each side sending pure/data-less ACKs to each other.
      
      The dupack interval is controlled by a new sysctl knob,
      tcp_invalid_ratelimit, given in milliseconds, in case an administrator
      needs to dial this upward in the face of a high-rate DoS attack. The
      name and units are chosen to be analogous to the existing analogous
      knob for ICMP, icmp_ratelimit.
      
      The default value for tcp_invalid_ratelimit is 500ms, which allows at
      most one such dupack per 500ms. This is chosen to be 2x faster than
      the 1-second minimum RTO interval allowed by RFC 6298 (section 2, rule
      2.4). We allow the extra 2x factor because network delay variations
      can cause packets sent at 1 second intervals to be compressed and
      arrive much closer.
      Reported-by: NAvery Fay <avery@mixpanel.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      032ee423
  4. 05 2月, 2015 2 次提交
    • E
      tcp: do not pace pure ack packets · 98781965
      Eric Dumazet 提交于
      When we added pacing to TCP, we decided to let sch_fq take care
      of actual pacing.
      
      All TCP had to do was to compute sk->pacing_rate using simple formula:
      
      sk->pacing_rate = 2 * cwnd * mss / rtt
      
      It works well for senders (bulk flows), but not very well for receivers
      or even RPC :
      
      cwnd on the receiver can be less than 10, rtt can be around 100ms, so we
      can end up pacing ACK packets, slowing down the sender.
      
      Really, only the sender should pace, according to its own logic.
      
      Instead of adding a new bit in skb, or call yet another flow
      dissection, we tweak skb->truesize to a small value (2), and
      we instruct sch_fq to use new helper and not pace pure ack.
      
      Note this also helps TCP small queue, as ack packets present
      in qdisc/NIC do not prevent sending a data packet (RPC workload)
      
      This helps to reduce tx completion overhead, ack packets can use regular
      sock_wfree() instead of tcp_wfree() which is a bit more expensive.
      
      This has no impact in the case packets are sent to loopback interface,
      as we do not coalesce ack packets (were we would detect skb->truesize
      lie)
      
      In case netem (with a delay) is used, skb_orphan_partial() also sets
      skb->truesize to 1.
      
      This patch is a combination of two patches we used for about one year at
      Google.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      98781965
    • T
      net: add skb functions to process remote checksum offload · dcdc8994
      Tom Herbert 提交于
      This patch adds skb_remcsum_process and skb_gro_remcsum_process to
      perform the appropriate adjustments to the skb when receiving
      remote checksum offload.
      
      Updated vxlan and gue to use these functions.
      
      Tested: Ran TCP_RR and TCP_STREAM netperf for VXLAN and GUE, did
      not see any change in performance.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dcdc8994
  5. 04 2月, 2015 4 次提交
  6. 03 2月, 2015 2 次提交
    • F
      net: dctcp: loosen requirement to assert ECT(0) during 3WHS · 843c2fdf
      Florian Westphal 提交于
      One deployment requirement of DCTCP is to be able to run
      in a DC setting along with TCP traffic. As Glenn Judd's
      NSDI'15 paper "Attaining the Promise and Avoiding the Pitfalls
      of TCP in the Datacenter" [1] (tba) explains, one way to
      solve this on switch side is to split DCTCP and TCP traffic
      in two queues per switch port based on the DSCP: one queue
      soley intended for DCTCP traffic and one for non-DCTCP traffic.
      
      For the DCTCP queue, there's the marking threshold K as
      explained in commit e3118e83 ("net: tcp: add DCTCP congestion
      control algorithm") for RED marking ECT(0) packets with CE.
      For the non-DCTCP queue, there's f.e. a classic tail drop queue.
      As already explained in e3118e83, running DCTCP at scale
      when not marking SYN/SYN-ACK packets with ECT(0) has severe
      consequences as for non-ECT(0) packets, traversing the RED
      marking DCTCP queue will result in a severe reduction of
      connection probability.
      
      This is due to the DCTCP queue being dominated by ECT(0) traffic
      and switches handle non-ECT traffic in the RED marking queue
      after passing K as drops, where K is usually a low watermark
      in order to leave enough tailroom for bursts. Splitting DCTCP
      traffic among several queues (ECN and non-ECN queue) is being
      considered a terrible idea in the network community as it
      splits single flows across multiple network paths.
      
      Therefore, commit e3118e83 implements this on Linux as
      ECT(0) marked traffic, as we argue that marking all packets
      of a DCTCP flow is the only viable solution and also doesn't
      speak against the draft.
      
      However, recently, a DCTCP implementation for FreeBSD hit also
      their mainline kernel [2]. In order to let them play well
      together with Linux' DCTCP, we would need to loosen the
      requirement that ECT(0) has to be asserted during the 3WHS as
      not implemented in FreeBSD. This simplifies the ECN test and
      lets DCTCP work together with FreeBSD.
      
      Joint work with Daniel Borkmann.
      
        [1] https://www.usenix.org/conference/nsdi15/technical-sessions/presentation/judd
        [2] https://github.com/freebsd/freebsd/commit/8ad879445281027858a7fa706d13e458095b595fSigned-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Glenn Judd <glenn.judd@morganstanley.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      843c2fdf
    • W
      net-timestamp: no-payload option · 49ca0d8b
      Willem de Bruijn 提交于
      Add timestamping option SOF_TIMESTAMPING_OPT_TSONLY. For transmit
      timestamps, this loops timestamps on top of empty packets.
      
      Doing so reduces the pressure on SO_RCVBUF. Payload inspection and
      cmsg reception (aside from timestamps) are no longer possible. This
      works together with a follow on patch that allows administrators to
      only allow tx timestamping if it does not loop payload or metadata.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      
      ----
      
      Changes (rfc -> v1)
        - add documentation
        - remove unnecessary skb->len test (thanks to Richard Cochran)
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      49ca0d8b
  7. 02 2月, 2015 1 次提交
    • E
      ipv4: tcp: get rid of ugly unicast_sock · bdbbb852
      Eric Dumazet 提交于
      In commit be9f4a44 ("ipv4: tcp: remove per net tcp_sock")
      I tried to address contention on a socket lock, but the solution
      I chose was horrible :
      
      commit 3a7c384f ("ipv4: tcp: unicast_sock should not land outside
      of TCP stack") addressed a selinux regression.
      
      commit 0980e56e ("ipv4: tcp: set unicast_sock uc_ttl to -1")
      took care of another regression.
      
      commit b5ec8eea ("ipv4: fix ip_send_skb()") fixed another regression.
      
      commit 811230cd ("tcp: ipv4: initialize unicast_sock sk_pacing_rate")
      was another shot in the dark.
      
      Really, just use a proper socket per cpu, and remove the skb_orphan()
      call, to re-enable flow control.
      
      This solves a serious problem with FQ packet scheduler when used in
      hostile environments, as we do not want to allocate a flow structure
      for every RST packet sent in response to a spoofed packet.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bdbbb852
  8. 01 2月, 2015 2 次提交
  9. 31 1月, 2015 1 次提交
  10. 30 1月, 2015 1 次提交
  11. 29 1月, 2015 7 次提交
  12. 27 1月, 2015 3 次提交
  13. 26 1月, 2015 7 次提交
  14. 25 1月, 2015 1 次提交
    • T
      udp: Do not require sock in udp_tunnel_xmit_skb · d998f8ef
      Tom Herbert 提交于
      The UDP tunnel transmit functions udp_tunnel_xmit_skb and
      udp_tunnel6_xmit_skb include a socket argument. The socket being
      passed to the functions (from VXLAN) is a UDP created for receive
      side. The only thing that the socket is used for in the transmit
      functions is to get the setting for checksum (enabled or zero).
      This patch removes the argument and and adds a nocheck argument
      for checksum setting. This eliminates the unnecessary dependency
      on a UDP socket for UDP tunnel transmit.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d998f8ef
  15. 20 1月, 2015 2 次提交
    • F
      net: ipv4: handle DSA enabled master network devices · 728c0208
      Florian Fainelli 提交于
      The logic to configure a network interface for kernel IP
      auto-configuration is very simplistic, and does not handle the case
      where a device is stacked onto another such as with DSA. This causes the
      kernel not to open and configure the master network device in a DSA
      switch tree, and therefore slave network devices using this master
      network devices as conduit device cannot be open.
      
      This restriction comes from a check in net/dsa/slave.c, which is
      basically checking the master netdev flags for IFF_UP and returns
      -ENETDOWN if it is not the case.
      
      Automatically bringing-up DSA master network devices allows DSA slave
      network devices to be used as valid interfaces for e.g: NFS root booting
      by allowing kernel IP autoconfiguration to succeed on these interfaces.
      
      On the reverse path, make sure we do not attempt to close a DSA-enabled
      device as this would implicitely prevent the slave DSA network device
      from operating.
      Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      728c0208
    • N
      tunnels: advertise link netns via netlink · 1728d4fa
      Nicolas Dichtel 提交于
      Implement rtnl_link_ops->get_link_net() callback so that IFLA_LINK_NETNSID is
      added to rtnetlink messages.
      Signed-off-by: NNicolas Dichtel <nicolas.dichtel@6wind.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1728d4fa