1. 18 5月, 2015 1 次提交
  2. 10 5月, 2015 1 次提交
    • E
      tcp: adjust window probe timers to safer values · 21c8fe99
      Eric Dumazet 提交于
      With the advent of small rto timers in datacenter TCP,
      (ip route ... rto_min x), the following can happen :
      
      1) Qdisc is full, transmit fails.
      
         TCP sets a timer based on icsk_rto to retry the transmit, without
         exponential backoff.
         With low icsk_rto, and lot of sockets, all cpus are servicing timer
         interrupts like crazy.
         Intent of the code was to retry with a timer between 200 (TCP_RTO_MIN)
         and 500ms (TCP_RESOURCE_PROBE_INTERVAL)
      
      2) Receivers can send zero windows if they don't drain their receive queue.
      
         TCP sends zero window probes, based on icsk_rto current value, with
         exponential backoff.
         With /proc/sys/net/ipv4/tcp_retries2 being 15 (or even smaller in
         some cases), sender can abort in less than one or two minutes !
         If receiver stops the sender, it obviously doesn't care of very tight
         rto. Probability of dropping the ACK reopening the window is not
         worth the risk.
      
      Lets change the base timer to be at least 200ms (TCP_RTO_MIN) for these
      events (but not normal RTO based retransmits)
      
      A followup patch adds a new SNMP counter, as it would have helped a lot
      diagnosing this issue.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      21c8fe99
  3. 06 5月, 2015 1 次提交
    • E
      tcp: provide SYN headers for passive connections · cd8ae852
      Eric Dumazet 提交于
      This patch allows a server application to get the TCP SYN headers for
      its passive connections.  This is useful if the server is doing
      fingerprinting of clients based on SYN packet contents.
      
      Two socket options are added: TCP_SAVE_SYN and TCP_SAVED_SYN.
      
      The first is used on a socket to enable saving the SYN headers
      for child connections. This can be set before or after the listen()
      call.
      
      The latter is used to retrieve the SYN headers for passive connections,
      if the parent listener has enabled TCP_SAVE_SYN.
      
      TCP_SAVED_SYN is read once, it frees the saved SYN headers.
      
      The data returned in TCP_SAVED_SYN are network (IPv4/IPv6) and TCP
      headers.
      
      Original patch was written by Tom Herbert, I changed it to not hold
      a full skb (and associated dst and conntracking reference).
      
      We have used such patch for about 3 years at Google.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Tested-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cd8ae852
  4. 04 5月, 2015 3 次提交
  5. 30 4月, 2015 3 次提交
  6. 22 4月, 2015 1 次提交
  7. 14 4月, 2015 1 次提交
  8. 08 4月, 2015 2 次提交
  9. 04 4月, 2015 2 次提交
  10. 03 4月, 2015 1 次提交
  11. 30 3月, 2015 1 次提交
  12. 25 3月, 2015 1 次提交
    • E
      tcp: fix ipv4 mapped request socks · 0144a81c
      Eric Dumazet 提交于
      ss should display ipv4 mapped request sockets like this :
      
      tcp    SYN-RECV   0      0  ::ffff:192.168.0.1:8080   ::ffff:192.0.2.1:35261
      
      and not like this :
      
      tcp    SYN-RECV   0      0  192.168.0.1:8080   192.0.2.1:35261
      
      We should init ireq->ireq_family based on listener sk_family,
      not the actual protocol carried by SYN packet.
      
      This means we can set ireq_family in inet_reqsk_alloc()
      
      Fixes: 3f66b083 ("inet: introduce ireq_family")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0144a81c
  13. 21 3月, 2015 1 次提交
  14. 18 3月, 2015 7 次提交
  15. 15 3月, 2015 1 次提交
  16. 13 3月, 2015 1 次提交
  17. 12 3月, 2015 2 次提交
  18. 23 2月, 2015 1 次提交
  19. 08 2月, 2015 2 次提交
    • N
      tcp: mitigate ACK loops for connections as tcp_sock · f2b2c582
      Neal Cardwell 提交于
      Ensure that in state ESTABLISHED, where the connection is represented
      by a tcp_sock, we rate limit dupacks in response to incoming packets
      (a) with TCP timestamps that fail PAWS checks, or (b) with sequence
      numbers or ACK numbers that are out of the acceptable window.
      
      We do not send a dupack in response to out-of-window packets if it has
      been less than sysctl_tcp_invalid_ratelimit (default 500ms) since we
      last sent a dupack in response to an out-of-window packet.
      
      There is already a similar (although global) rate-limiting mechanism
      for "challenge ACKs". When deciding whether to send a challence ACK,
      we first consult the new per-connection rate limit, and then the
      global rate limit.
      Reported-by: NAvery Fay <avery@mixpanel.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f2b2c582
    • N
      tcp: helpers to mitigate ACK loops by rate-limiting out-of-window dupacks · 032ee423
      Neal Cardwell 提交于
      Helpers for mitigating ACK loops by rate-limiting dupacks sent in
      response to incoming out-of-window packets.
      
      This patch includes:
      
      - rate-limiting logic
      - sysctl to control how often we allow dupacks to out-of-window packets
      - SNMP counter for cases where we rate-limited our dupack sending
      
      The rate-limiting logic in this patch decides to not send dupacks in
      response to out-of-window segments if (a) they are SYNs or pure ACKs
      and (b) the remote endpoint is sending them faster than the configured
      rate limit.
      
      We rate-limit our responses rather than blocking them entirely or
      resetting the connection, because legitimate connections can rely on
      dupacks in response to some out-of-window segments. For example, zero
      window probes are typically sent with a sequence number that is below
      the current window, and ZWPs thus expect to thus elicit a dupack in
      response.
      
      We allow dupacks in response to TCP segments with data, because these
      may be spurious retransmissions for which the remote endpoint wants to
      receive DSACKs. This is safe because segments with data can't
      realistically be part of ACK loops, which by their nature consist of
      each side sending pure/data-less ACKs to each other.
      
      The dupack interval is controlled by a new sysctl knob,
      tcp_invalid_ratelimit, given in milliseconds, in case an administrator
      needs to dial this upward in the face of a high-rate DoS attack. The
      name and units are chosen to be analogous to the existing analogous
      knob for ICMP, icmp_ratelimit.
      
      The default value for tcp_invalid_ratelimit is 500ms, which allows at
      most one such dupack per 500ms. This is chosen to be 2x faster than
      the 1-second minimum RTO interval allowed by RFC 6298 (section 2, rule
      2.4). We allow the extra 2x factor because network delay variations
      can cause packets sent at 1 second intervals to be compressed and
      arrive much closer.
      Reported-by: NAvery Fay <avery@mixpanel.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      032ee423
  20. 04 2月, 2015 2 次提交
    • A
      net: switch memcpy_fromiovec()/memcpy_fromiovecend() users to copy_from_iter() · 21226abb
      Al Viro 提交于
      That takes care of the majority of ->sendmsg() instances - most of them
      via memcpy_to_msg() or assorted getfrag() callbacks.  One place where we
      still keep memcpy_fromiovecend() is tipc - there we potentially read the
      same data over and over; separate patch, that...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      21226abb
    • A
      ip: convert tcp_sendmsg() to iov_iter primitives · 57be5bda
      Al Viro 提交于
      patch is actually smaller than it seems to be - most of it is unindenting
      the inner loop body in tcp_sendmsg() itself...
      
      the bit in tcp_input.c is going to get reverted very soon - that's what
      memcpy_from_msg() will become, but not in this commit; let's keep it
      reasonably contained...
      
      There's one potentially subtle change here: in case of short copy from
      userland, mainline tcp_send_syn_data() discards the skb it has allocated
      and falls back to normal path, where we'll send as much as possible after
      rereading the same data again.  This patch trims SYN+data skb instead -
      that way we don't need to copy from the same place twice.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      57be5bda
  21. 03 2月, 2015 1 次提交
    • F
      net: dctcp: loosen requirement to assert ECT(0) during 3WHS · 843c2fdf
      Florian Westphal 提交于
      One deployment requirement of DCTCP is to be able to run
      in a DC setting along with TCP traffic. As Glenn Judd's
      NSDI'15 paper "Attaining the Promise and Avoiding the Pitfalls
      of TCP in the Datacenter" [1] (tba) explains, one way to
      solve this on switch side is to split DCTCP and TCP traffic
      in two queues per switch port based on the DSCP: one queue
      soley intended for DCTCP traffic and one for non-DCTCP traffic.
      
      For the DCTCP queue, there's the marking threshold K as
      explained in commit e3118e83 ("net: tcp: add DCTCP congestion
      control algorithm") for RED marking ECT(0) packets with CE.
      For the non-DCTCP queue, there's f.e. a classic tail drop queue.
      As already explained in e3118e83, running DCTCP at scale
      when not marking SYN/SYN-ACK packets with ECT(0) has severe
      consequences as for non-ECT(0) packets, traversing the RED
      marking DCTCP queue will result in a severe reduction of
      connection probability.
      
      This is due to the DCTCP queue being dominated by ECT(0) traffic
      and switches handle non-ECT traffic in the RED marking queue
      after passing K as drops, where K is usually a low watermark
      in order to leave enough tailroom for bursts. Splitting DCTCP
      traffic among several queues (ECN and non-ECN queue) is being
      considered a terrible idea in the network community as it
      splits single flows across multiple network paths.
      
      Therefore, commit e3118e83 implements this on Linux as
      ECT(0) marked traffic, as we argue that marking all packets
      of a DCTCP flow is the only viable solution and also doesn't
      speak against the draft.
      
      However, recently, a DCTCP implementation for FreeBSD hit also
      their mainline kernel [2]. In order to let them play well
      together with Linux' DCTCP, we would need to loosen the
      requirement that ECT(0) has to be asserted during the 3WHS as
      not implemented in FreeBSD. This simplifies the ECN test and
      lets DCTCP work together with FreeBSD.
      
      Joint work with Daniel Borkmann.
      
        [1] https://www.usenix.org/conference/nsdi15/technical-sessions/presentation/judd
        [2] https://github.com/freebsd/freebsd/commit/8ad879445281027858a7fa706d13e458095b595fSigned-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Glenn Judd <glenn.judd@morganstanley.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      843c2fdf
  22. 01 2月, 2015 1 次提交
  23. 14 1月, 2015 1 次提交
    • S
      tcp: avoid reducing cwnd when ACK+DSACK is received · 08abdffa
      Sébastien Barré 提交于
      With TLP, the peer may reply to a probe with an
      ACK+D-SACK, with ack value set to tlp_high_seq. In the current code,
      such ACK+DSACK will be missed and only at next, higher ack will the TLP
      episode be considered done. Since the DSACK is not present anymore,
      this will cost a cwnd reduction.
      
      This patch ensures that this scenario does not cause a cwnd reduction, since
      receiving an ACK+DSACK indicates that both the initial segment and the probe
      have been received by the peer.
      
      The following packetdrill test, from Neal Cardwell, validates this patch:
      
      // Establish a connection.
      0     socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      +0     setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      +0    bind(3, ..., ...) = 0
      +0    listen(3, 1) = 0
      
      +0    < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
      +0    > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 6>
      +.020 < . 1:1(0) ack 1 win 257
      +0    accept(3, ..., ...) = 4
      
      // Send 1 packet.
      +0    write(4, ..., 1000) = 1000
      +0    > P. 1:1001(1000) ack 1
      
      // Loss probe retransmission.
      // packets_out == 1 => schedule PTO in max(2*RTT, 1.5*RTT + 200ms)
      // In this case, this means: 1.5*RTT + 200ms = 230ms
      +.230 > P. 1:1001(1000) ack 1
      +0    %{ assert tcpi_snd_cwnd == 10 }%
      
      // Receiver ACKs at tlp_high_seq with a DSACK,
      // indicating they received the original packet and probe.
      +.020 < . 1:1(0) ack 1001 win 257 <sack 1:1001,nop,nop>
      +0    %{ assert tcpi_snd_cwnd == 10 }%
      
      // Send another packet.
      +0    write(4, ..., 1000) = 1000
      +0    > P. 1001:2001(1000) ack 1
      
      // Receiver ACKs above tlp_high_seq, which should end the TLP episode
      // if we haven't already. We should not reduce cwnd.
      +.020 < . 1:1(0) ack 2001 win 257
      +0    %{ assert tcpi_snd_cwnd == 10, tcpi_snd_cwnd }%
      
      Credits:
      -Gregory helped in finding that tcp_process_tlp_ack was where the cwnd
      got reduced in our MPTCP tests.
      -Neal wrote the packetdrill test above
      -Yuchung reworked the patch to make it more readable.
      
      Cc: Gregory Detal <gregory.detal@uclouvain.be>
      Cc: Nandita Dukkipati <nanditad@google.com>
      Tested-by: NNeal Cardwell <ncardwell@google.com>
      Reviewed-by: NYuchung Cheng <ycheng@google.com>
      Reviewed-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NSébastien Barré <sebastien.barre@uclouvain.be>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      08abdffa
  24. 10 12月, 2014 1 次提交
  25. 24 11月, 2014 1 次提交