1. 14 10月, 2019 1 次提交
    • E
      tcp: add rcu protection around tp->fastopen_rsk · d983ea6f
      Eric Dumazet 提交于
      Both tcp_v4_err() and tcp_v6_err() do the following operations
      while they do not own the socket lock :
      
      	fastopen = tp->fastopen_rsk;
       	snd_una = fastopen ? tcp_rsk(fastopen)->snt_isn : tp->snd_una;
      
      The problem is that without appropriate barrier, the compiler
      might reload tp->fastopen_rsk and trigger a NULL deref.
      
      request sockets are protected by RCU, we can simply add
      the missing annotations and barriers to solve the issue.
      
      Fixes: 168a8f58 ("tcp: TCP Fast Open Server - main code path")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d983ea6f
  2. 16 9月, 2019 1 次提交
    • T
      tcp: Add TCP_INFO counter for packets received out-of-order · f9af2dbb
      Thomas Higdon 提交于
      For receive-heavy cases on the server-side, we want to track the
      connection quality for individual client IPs. This counter, similar to
      the existing system-wide TCPOFOQueue counter in /proc/net/netstat,
      tracks out-of-order packet reception. By providing this counter in
      TCP_INFO, it will allow understanding to what degree receive-heavy
      sockets are experiencing out-of-order delivery and packet drops
      indicating congestion.
      
      Please note that this is similar to the counter in NetBSD TCP_INFO, and
      has the same name.
      
      Also note that we avoid increasing the size of the tcp_sock struct by
      taking advantage of a hole.
      Signed-off-by: NThomas Higdon <tph@fb.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f9af2dbb
  3. 12 9月, 2019 1 次提交
    • N
      tcp: fix tcp_ecn_withdraw_cwr() to clear TCP_ECN_QUEUE_CWR · af38d07e
      Neal Cardwell 提交于
      Fix tcp_ecn_withdraw_cwr() to clear the correct bit:
      TCP_ECN_QUEUE_CWR.
      
      Rationale: basically, TCP_ECN_DEMAND_CWR is a bit that is purely about
      the behavior of data receivers, and deciding whether to reflect
      incoming IP ECN CE marks as outgoing TCP th->ece marks. The
      TCP_ECN_QUEUE_CWR bit is purely about the behavior of data senders,
      and deciding whether to send CWR. The tcp_ecn_withdraw_cwr() function
      is only called from tcp_undo_cwnd_reduction() by data senders during
      an undo, so it should zero the sender-side state,
      TCP_ECN_QUEUE_CWR. It does not make sense to stop the reflection of
      incoming CE bits on incoming data packets just because outgoing
      packets were spuriously retransmitted.
      
      The bug has been reproduced with packetdrill to manifest in a scenario
      with RFC3168 ECN, with an incoming data packet with CE bit set and
      carrying a TCP timestamp value that causes cwnd undo. Before this fix,
      the IP CE bit was ignored and not reflected in the TCP ECE header bit,
      and sender sent a TCP CWR ('W') bit on the next outgoing data packet,
      even though the cwnd reduction had been undone.  After this fix, the
      sender properly reflects the CE bit and does not set the W bit.
      
      Note: the bug actually predates 2005 git history; this Fixes footer is
      chosen to be the oldest SHA1 I have tested (from Sep 2007) for which
      the patch applies cleanly (since before this commit the code was in a
      .h file).
      
      Fixes: bdf1ee5d ("[TCP]: Move code from tcp_ecn.h to tcp*.c and tcp.h & remove it")
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      af38d07e
  4. 31 7月, 2019 2 次提交
  5. 03 7月, 2019 1 次提交
  6. 16 6月, 2019 1 次提交
    • E
      tcp: limit payload size of sacked skbs · 3b4929f6
      Eric Dumazet 提交于
      Jonathan Looney reported that TCP can trigger the following crash
      in tcp_shifted_skb() :
      
      	BUG_ON(tcp_skb_pcount(skb) < pcount);
      
      This can happen if the remote peer has advertized the smallest
      MSS that linux TCP accepts : 48
      
      An skb can hold 17 fragments, and each fragment can hold 32KB
      on x86, or 64KB on PowerPC.
      
      This means that the 16bit witdh of TCP_SKB_CB(skb)->tcp_gso_segs
      can overflow.
      
      Note that tcp_sendmsg() builds skbs with less than 64KB
      of payload, so this problem needs SACK to be enabled.
      SACK blocks allow TCP to coalesce multiple skbs in the retransmit
      queue, thus filling the 17 fragments to maximal capacity.
      
      CVE-2019-11477 -- u16 overflow of TCP_SKB_CB(skb)->tcp_gso_segs
      
      Fixes: 832d11c5 ("tcp: Try to restore large SKBs while SACK processing")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NJonathan Looney <jtl@netflix.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Reviewed-by: NTyler Hicks <tyhicks@canonical.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Bruce Curtis <brucec@netflix.com>
      Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b4929f6
  7. 15 6月, 2019 1 次提交
  8. 10 6月, 2019 1 次提交
    • Y
      tcp: fix undo spurious SYNACK in passive Fast Open · fcc2202a
      Yuchung Cheng 提交于
      Commit 794200d6 ("tcp: undo cwnd on Fast Open spurious SYNACK
      retransmit") may cause tcp_fastretrans_alert() to warn about pending
      retransmission in Open state. This is triggered when the Fast Open
      server both sends data and has spurious SYNACK retransmission during
      the handshake, and the data packets were lost or reordered.
      
      The root cause is a bit complicated:
      
      (1) Upon receiving SYN-data: a full socket is created with
          snd_una = ISN + 1 by tcp_create_openreq_child()
      
      (2) On SYNACK timeout the server/sender enters CA_Loss state.
      
      (3) Upon receiving the final ACK to complete the handshake, sender
          does not mark FLAG_SND_UNA_ADVANCED since (1)
      
          Sender then calls tcp_process_loss since state is CA_loss by (2)
      
      (4) tcp_process_loss() does not invoke undo operations but instead
          mark REXMIT_LOST to force retransmission
      
      (5) tcp_rcv_synrecv_state_fastopen() calls tcp_try_undo_loss(). It
          changes state to CA_Open but has positive tp->retrans_out
      
      (6) Next ACK triggers the WARN_ON in tcp_fastretrans_alert()
      
      The step that goes wrong is (4) where the undo operation should
      have been invoked because the ACK successfully acknowledged the
      SYN sequence. This fixes that by specifically checking undo
      when the SYN-ACK sequence is acknowledged. Then after
      tcp_process_loss() the state would be further adjusted based
      in tcp_fastretrans_alert() to avoid triggering the warning in (6).
      
      Fixes: 794200d6 ("tcp: undo cwnd on Fast Open spurious SYNACK retransmit")
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fcc2202a
  9. 31 5月, 2019 1 次提交
    • Y
      ipv4: tcp_input: fix stack out of bounds when parsing TCP options. · 9609dad2
      Young Xiao 提交于
      The TCP option parsing routines in tcp_parse_options function could
      read one byte out of the buffer of the TCP options.
      
      1         while (length > 0) {
      2                 int opcode = *ptr++;
      3                 int opsize;
      4
      5                 switch (opcode) {
      6                 case TCPOPT_EOL:
      7                         return;
      8                 case TCPOPT_NOP:        /* Ref: RFC 793 section 3.1 */
      9                         length--;
      10                        continue;
      11                default:
      12                        opsize = *ptr++; //out of bound access
      
      If length = 1, then there is an access in line2.
      And another access is occurred in line 12.
      This would lead to out-of-bound access.
      
      Therefore, in the patch we check that the available data length is
      larger enough to pase both TCP option code and size.
      Signed-off-by: NYoung Xiao <92siuyang@gmail.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9609dad2
  10. 15 5月, 2019 1 次提交
  11. 10 5月, 2019 1 次提交
  12. 01 5月, 2019 7 次提交
  13. 17 4月, 2019 1 次提交
  14. 05 4月, 2019 1 次提交
    • T
      tcp: Accept ECT on SYN in the presence of RFC8311 · f6fee16d
      Tilmans, Olivier (Nokia - BE/Antwerp) 提交于
      Linux currently disable ECN for incoming connections when the SYN
      requests ECN and the IP header has ECT(0)/ECT(1) set, as some
      networks were reportedly mangling the ToS byte, hence could later
      trigger false congestion notifications.
      
      RFC8311 §4.3 relaxes RFC3168's requirements such that ECT can be set
      one TCP control packets (including SYNs). The main benefit of this
      is the decreased probability of losing a SYN in a congested
      ECN-capable network (i.e., it avoids the initial 1s timeout).
      Additionally, this allows the development of newer TCP extensions,
      such as AccECN.
      
      This patch relaxes the previous check, by enabling ECN on incoming
      connections using SYN+ECT if at least one bit of the reserved flags
      of the TCP header is set. Such bit would indicate that the sender of
      the SYN is using a newer TCP feature than what the host implements,
      such as AccECN, and is thus implementing RFC8311. This enables
      end-hosts not supporting such extensions to still negociate ECN, and
      to have some of the benefits of using ECN on control packets.
      Signed-off-by: NOlivier Tilmans <olivier.tilmans@nokia-bell-labs.com>
      Suggested-by: NBob Briscoe <research@bobbriscoe.net>
      Cc: Koen De Schepper <koen.de_schepper@nokia-bell-labs.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f6fee16d
  15. 20 3月, 2019 1 次提交
    • G
      tcp: free request sock directly upon TFO or syncookies error · 9403cf23
      Guillaume Nault 提交于
      Since the request socket is created locally, it'd make more sense to
      use reqsk_free() instead of reqsk_put() in TFO and syncookies' error
      path.
      
      However, tcp_get_cookie_sock() may set ->rsk_refcnt before freeing the
      socket; tcp_conn_request() may also have non-null ->rsk_refcnt because
      of tcp_try_fastopen(). In both cases 'req' hasn't been exposed
      to the outside world and is safe to free immediately, but that'd
      trigger the WARN_ON_ONCE in reqsk_free().
      
      Define __reqsk_free() for these situations where we know nobody's
      referencing the socket, even though ->rsk_refcnt might be non-null.
      Now we can consolidate the error path of tcp_get_cookie_sock() and
      tcp_conn_request().
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9403cf23
  16. 09 3月, 2019 1 次提交
    • G
      tcp: handle inet_csk_reqsk_queue_add() failures · 9d3e1368
      Guillaume Nault 提交于
      Commit 7716682c ("tcp/dccp: fix another race at listener
      dismantle") let inet_csk_reqsk_queue_add() fail, and adjusted
      {tcp,dccp}_check_req() accordingly. However, TFO and syncookies
      weren't modified, thus leaking allocated resources on error.
      
      Contrary to tcp_check_req(), in both syncookies and TFO cases,
      we need to drop the request socket. Also, since the child socket is
      created with inet_csk_clone_lock(), we have to unlock it and drop an
      extra reference (->sk_refcount is initially set to 2 and
      inet_csk_reqsk_queue_add() drops only one ref).
      
      For TFO, we also need to revert the work done by tcp_try_fastopen()
      (with reqsk_fastopen_remove()).
      
      Fixes: 7716682c ("tcp/dccp: fix another race at listener dismantle")
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9d3e1368
  17. 26 2月, 2019 2 次提交
  18. 28 1月, 2019 1 次提交
  19. 01 12月, 2018 1 次提交
  20. 28 11月, 2018 1 次提交
  21. 25 11月, 2018 1 次提交
    • E
      tcp: address problems caused by EDT misshaps · 9efdda4e
      Eric Dumazet 提交于
      When a qdisc setup including pacing FQ is dismantled and recreated,
      some TCP packets are sent earlier than instructed by TCP stack.
      
      TCP can be fooled when ACK comes back, because the following
      operation can return a negative value.
      
          tcp_time_stamp(tp) - tp->rx_opt.rcv_tsecr;
      
      Some paths in TCP stack were not dealing properly with this,
      this patch addresses four of them.
      
      Fixes: ab408b6d ("tcp: switch tcp and sch_fq to new earliest departure time model")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9efdda4e
  22. 22 11月, 2018 1 次提交
    • E
      tcp: defer SACK compression after DupThresh · 86de5921
      Eric Dumazet 提交于
      Jean-Louis reported a TCP regression and bisected to recent SACK
      compression.
      
      After a loss episode (receiver not able to keep up and dropping
      packets because its backlog is full), linux TCP stack is sending
      a single SACK (DUPACK).
      
      Sender waits a full RTO timer before recovering losses.
      
      While RFC 6675 says in section 5, "Algorithm Details",
      
         (2) If DupAcks < DupThresh but IsLost (HighACK + 1) returns true --
             indicating at least three segments have arrived above the current
             cumulative acknowledgment point, which is taken to indicate loss
             -- go to step (4).
      ...
         (4) Invoke fast retransmit and enter loss recovery as follows:
      
      there are old TCP stacks not implementing this strategy, and
      still counting the dupacks before starting fast retransmit.
      
      While these stacks probably perform poorly when receivers implement
      LRO/GRO, we should be a little more gentle to them.
      
      This patch makes sure we do not enable SACK compression unless
      3 dupacks have been sent since last rcv_nxt update.
      
      Ideally we should even rearm the timer to send one or two
      more DUPACK if no more packets are coming, but that will
      be work aiming for linux-4.21.
      
      Many thanks to Jean-Louis for bisecting the issue, providing
      packet captures and testing this patch.
      
      Fixes: 5d9f4262 ("tcp: add SACK compression")
      Reported-by: NJean-Louis Dupond <jean-louis@dupond.be>
      Tested-by: NJean-Louis Dupond <jean-louis@dupond.be>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      86de5921
  23. 21 11月, 2018 1 次提交
  24. 12 11月, 2018 1 次提交
  25. 24 10月, 2018 1 次提交
    • E
      tcp: add tcp_reset_xmit_timer() helper · 3f80e08f
      Eric Dumazet 提交于
      With EDT model, SRTT no longer is inflated by pacing delays.
      
      This means that RTO and some other xmit timers might be setup
      incorrectly. This is particularly visible with either :
      
      - Very small enforced pacing rates (SO_MAX_PACING_RATE)
      - Reduced rto (from the default 200 ms)
      
      This can lead to TCP flows aborts in the worst case,
      or spurious retransmits in other cases.
      
      For example, this session gets far more throughput
      than the requested 80kbit :
      
      $ netperf -H 127.0.0.2 -l 100 -- -q 10000
      MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 127.0.0.2 () port 0 AF_INET
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/sec
      
      540000 262144 262144    104.00      2.66
      
      With the fix :
      
      $ netperf -H 127.0.0.2 -l 100 -- -q 10000
      MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 127.0.0.2 () port 0 AF_INET
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/sec
      
      540000 262144 262144    104.00      0.12
      
      EDT allows for better control of rtx timers, since TCP has
      a better idea of the earliest departure time of each skb
      in the rtx queue. We only have to eventually add to the
      timer the difference of the EDT time with current time.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3f80e08f
  26. 02 10月, 2018 2 次提交
  27. 30 9月, 2018 1 次提交
    • Y
      tcp: up initial rmem to 128KB and SYN rwin to around 64KB · a337531b
      Yuchung Cheng 提交于
      Previously TCP initial receive buffer is ~87KB by default and
      the initial receive window is ~29KB (20 MSS). This patch changes
      the two numbers to 128KB and ~64KB (rounding down to the multiples
      of MSS) respectively. The patch also simplifies the calculations s.t.
      the two numbers are directly controlled by sysctl tcp_rmem[1]:
      
        1) Initial receiver buffer budget (sk_rcvbuf): while this should
           be configured via sysctl tcp_rmem[1], previously tcp_fixup_rcvbuf()
           always override and set a larger size when a new connection
           establishes.
      
        2) Initial receive window in SYN: previously it is set to 20
           packets if MSS <= 1460. The number 20 was based on the initial
           congestion window of 10: the receiver needs twice amount to
           avoid being limited by the receive window upon out-of-order
           delivery in the first window burst. But since this only
           applies if the receiving MSS <= 1460, connection using large MTU
           (e.g. to utilize receiver zero-copy) may be limited by the
           receive window.
      
      With this patch TCP memory configuration is more straight-forward and
      more properly sized to modern high-speed networks by default. Several
      popular stacks have been announcing 64KB rwin in SYNs as well.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a337531b
  28. 22 9月, 2018 1 次提交
  29. 12 9月, 2018 1 次提交
  30. 01 9月, 2018 1 次提交
    • Y
      tcp: change IPv6 flow-label upon receiving spurious retransmission · 7788174e
      Yuchung Cheng 提交于
      Currently a Linux IPv6 TCP sender will change the flow label upon
      timeouts to potentially steer away from a data path that has gone
      bad. However this does not help if the problem is on the ACK path
      and the data path is healthy. In this case the receiver is likely
      to receive repeated spurious retransmission because the sender
      couldn't get the ACKs in time and has recurring timeouts.
      
      This patch adds another feature to mitigate this problem. It
      leverages the DSACK states in the receiver to change the flow
      label of the ACKs to speculatively re-route the ACK packets.
      In order to allow triggering on the second consecutive spurious
      RTO, the receiver changes the flow label upon sending a second
      consecutive DSACK for a sequence number below RCV.NXT.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7788174e
  31. 12 8月, 2018 1 次提交
    • Y
      tcp: avoid resetting ACK timer upon receiving packet with ECN CWR flag · fd2123a3
      Yuchung Cheng 提交于
      Previously commit 9aee4000 ("tcp: ack immediately when a cwr
      packet arrives") calls tcp_enter_quickack_mode to force sending
      two immediate ACKs upon receiving a packet w/ CWR flag. The side
      effect is it'll also reset the delayed ACK timer and interactive
      session tracking. This patch removes that side effect by using the
      new ACK_NOW flag to force an immmediate ACK.
      
      Packetdrill to demonstrate:
      
          0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
         +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
         +0 setsockopt(3, SOL_TCP, TCP_CONGESTION, "dctcp", 5) = 0
         +0 bind(3, ..., ...) = 0
         +0 listen(3, 1) = 0
      
         +0 < [ect0] SEW 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
         +0 > SE. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
        +.1 < [ect0] . 1:1(0) ack 1 win 257
         +0 accept(3, ..., ...) = 4
      
         +0 < [ect0] . 1:1001(1000) ack 1 win 257
         +0 > [ect01] . 1:1(0) ack 1001
      
         +0 write(4, ..., 1) = 1
         +0 > [ect01] P. 1:2(1) ack 1001
      
         +0 < [ect0] . 1001:2001(1000) ack 2 win 257
         +0 write(4, ..., 1) = 1
         +0 > [ect01] P. 2:3(1) ack 2001
      
         +0 < [ect0] . 2001:3001(1000) ack 3 win 257
         +0 < [ect0] . 3001:4001(1000) ack 3 win 257
         // Ack delayed ...
      
         +.01 < [ce] P. 4001:4501(500) ack 3 win 257
         +0 > [ect01] . 3:3(0) ack 4001
         +0 > [ect01] E. 3:3(0) ack 4501
      
      +.001 read(4, ..., 4500) = 4500
         +0 write(4, ..., 1) = 1
         +0 > [ect01] PE. 3:4(1) ack 4501 win 100
      
       +.01 < [ect0] W. 4501:5501(1000) ack 4 win 257
         // No delayed ACK on CWR flag
         +0 > [ect01] . 4:4(0) ack 5501
      
       +.31 < [ect0] . 5501:6501(1000) ack 4 win 257
         +0 > [ect01] . 4:4(0) ack 6501
      
      Fixes: 9aee4000 ("tcp: ack immediately when a cwr packet arrives")
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fd2123a3