1. 06 5月, 2017 1 次提交
    • E
      tcp: randomize timestamps on syncookies · 84b114b9
      Eric Dumazet 提交于
      Whole point of randomization was to hide server uptime, but an attacker
      can simply start a syn flood and TCP generates 'old style' timestamps,
      directly revealing server jiffies value.
      
      Also, TSval sent by the server to a particular remote address vary
      depending on syncookies being sent or not, potentially triggering PAWS
      drops for innocent clients.
      
      Lets implement proper randomization, including for SYNcookies.
      
      Also we do not need to export sysctl_tcp_timestamps, since it is not
      used from a module.
      
      In v2, I added Florian feedback and contribution, adding tsoff to
      tcp_get_cookie_sock().
      
      v3 removed one unused variable in tcp_v4_connect() as Florian spotted.
      
      Fixes: 95a22cae ("tcp: randomize tcp timestamp offsets for each connection")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NFlorian Westphal <fw@strlen.de>
      Tested-by: NFlorian Westphal <fw@strlen.de>
      Cc: Yuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      84b114b9
  2. 27 4月, 2017 9 次提交
  3. 25 4月, 2017 2 次提交
    • W
      net/tcp_fastopen: Add snmp counter for blackhole detection · 46c2fa39
      Wei Wang 提交于
      This counter records the number of times the firewall blackhole issue is
      detected and active TFO is disabled.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      46c2fa39
    • W
      net/tcp_fastopen: Disable active side TFO in certain scenarios · cf1ef3f0
      Wei Wang 提交于
      Middlebox firewall issues can potentially cause server's data being
      blackholed after a successful 3WHS using TFO. Following are the related
      reports from Apple:
      https://www.nanog.org/sites/default/files/Paasch_Network_Support.pdf
      Slide 31 identifies an issue where the client ACK to the server's data
      sent during a TFO'd handshake is dropped.
      C ---> syn-data ---> S
      C <--- syn/ack ----- S
      C (accept & write)
      C <---- data ------- S
      C ----- ACK -> X     S
      		[retry and timeout]
      
      https://www.ietf.org/proceedings/94/slides/slides-94-tcpm-13.pdf
      Slide 5 shows a similar situation that the server's data gets dropped
      after 3WHS.
      C ---- syn-data ---> S
      C <--- syn/ack ----- S
      C ---- ack --------> S
      S (accept & write)
      C?  X <- data ------ S
      		[retry and timeout]
      
      This is the worst failure b/c the client can not detect such behavior to
      mitigate the situation (such as disabling TFO). Failing to proceed, the
      application (e.g., SSL library) may simply timeout and retry with TFO
      again, and the process repeats indefinitely.
      
      The proposed solution is to disable active TFO globally under the
      following circumstances:
      1. client side TFO socket detects out of order FIN
      2. client side TFO socket receives out of order RST
      
      We disable active side TFO globally for 1hr at first. Then if it
      happens again, we disable it for 2h, then 4h, 8h, ...
      And we reset the timeout to 1hr if a client side TFO sockets not opened
      on loopback has successfully received data segs from server.
      And we examine this condition during close().
      
      The rational behind it is that when such firewall issue happens,
      application running on the client should eventually close the socket as
      it is not able to get the data it is expecting. Or application running
      on the server should close the socket as it is not able to receive any
      response from client.
      In both cases, out of order FIN or RST will get received on the client
      given that the firewall will not block them as no data are in those
      frames.
      And we want to disable active TFO globally as it helps if the middle box
      is very close to the client and most of the connections are likely to
      fail.
      
      Also, add a debug sysctl:
        tcp_fastopen_blackhole_detect_timeout_sec:
          the initial timeout to use when firewall blackhole issue happens.
          This can be set and read.
          When setting it to 0, it means to disable the active disable logic.
      Signed-off-by: NWei Wang <weiwan@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cf1ef3f0
  4. 21 4月, 2017 2 次提交
  5. 08 4月, 2017 1 次提交
  6. 06 4月, 2017 1 次提交
  7. 05 4月, 2017 1 次提交
  8. 04 4月, 2017 1 次提交
    • M
      tcp: minimize false-positives on TCP/GRO check · 0b9aefea
      Marcelo Ricardo Leitner 提交于
      Markus Trippelsdorf reported that after commit dcb17d22 ("tcp: warn
      on bogus MSS and try to amend it") the kernel started logging the
      warning for a NIC driver that doesn't even support GRO.
      
      It was diagnosed that it was possibly caused on connections that were
      using TCP Timestamps but some packets lacked the Timestamps option. As
      we reduce rcv_mss when timestamps are used, the lack of them would cause
      the packets to be bigger than expected, although this is a valid case.
      
      As this warning is more as a hint, getting a clean-cut on the
      threshold is probably not worth the execution time spent on it. This
      patch thus alleviates the false-positives with 2 quick checks: by
      accounting for the entire TCP option space and also checking against the
      interface MTU if it's available.
      
      These changes, specially the MTU one, might mask some real positives,
      though if they are really happening, it's possible that sooner or later
      it will be triggered anyway.
      Reported-by: NMarkus Trippelsdorf <markus@trippelsdorf.de>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0b9aefea
  9. 23 3月, 2017 1 次提交
  10. 17 3月, 2017 2 次提交
  11. 10 3月, 2017 1 次提交
  12. 02 3月, 2017 1 次提交
    • E
      tcp/dccp: block BH for SYN processing · 449809a6
      Eric Dumazet 提交于
      SYN processing really was meant to be handled from BH.
      
      When I got rid of BH blocking while processing socket backlog
      in commit 5413d1ba ("net: do not block BH while processing socket
      backlog"), I forgot that a malicious user could transition to TCP_LISTEN
      from a state that allowed (SYN) packets to be parked in the socket
      backlog while socket is owned by the thread doing the listen() call.
      
      Sure enough syzkaller found this and reported the bug ;)
      
      =================================
      [ INFO: inconsistent lock state ]
      4.10.0+ #60 Not tainted
      ---------------------------------
      inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
      syz-executor0/5090 [HC0[0]:SC0[0]:HE1:SE1] takes:
       (&(&hashinfo->ehash_locks[i])->rlock){+.?...}, at:
      [<ffffffff83a6a370>] spin_lock include/linux/spinlock.h:299 [inline]
       (&(&hashinfo->ehash_locks[i])->rlock){+.?...}, at:
      [<ffffffff83a6a370>] inet_ehash_insert+0x240/0xad0
      net/ipv4/inet_hashtables.c:407
      {IN-SOFTIRQ-W} state was registered at:
        mark_irqflags kernel/locking/lockdep.c:2923 [inline]
        __lock_acquire+0xbcf/0x3270 kernel/locking/lockdep.c:3295
        lock_acquire+0x241/0x580 kernel/locking/lockdep.c:3753
        __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
        _raw_spin_lock+0x33/0x50 kernel/locking/spinlock.c:151
        spin_lock include/linux/spinlock.h:299 [inline]
        inet_ehash_insert+0x240/0xad0 net/ipv4/inet_hashtables.c:407
        reqsk_queue_hash_req net/ipv4/inet_connection_sock.c:753 [inline]
        inet_csk_reqsk_queue_hash_add+0x1b7/0x2a0 net/ipv4/inet_connection_sock.c:764
        tcp_conn_request+0x25cc/0x3310 net/ipv4/tcp_input.c:6399
        tcp_v4_conn_request+0x157/0x220 net/ipv4/tcp_ipv4.c:1262
        tcp_rcv_state_process+0x802/0x4130 net/ipv4/tcp_input.c:5889
        tcp_v4_do_rcv+0x56b/0x940 net/ipv4/tcp_ipv4.c:1433
        tcp_v4_rcv+0x2e12/0x3210 net/ipv4/tcp_ipv4.c:1711
        ip_local_deliver_finish+0x4ce/0xc40 net/ipv4/ip_input.c:216
        NF_HOOK include/linux/netfilter.h:257 [inline]
        ip_local_deliver+0x1ce/0x710 net/ipv4/ip_input.c:257
        dst_input include/net/dst.h:492 [inline]
        ip_rcv_finish+0xb1d/0x2110 net/ipv4/ip_input.c:396
        NF_HOOK include/linux/netfilter.h:257 [inline]
        ip_rcv+0xd90/0x19c0 net/ipv4/ip_input.c:487
        __netif_receive_skb_core+0x1ad1/0x3400 net/core/dev.c:4179
        __netif_receive_skb+0x2a/0x170 net/core/dev.c:4217
        netif_receive_skb_internal+0x1d6/0x430 net/core/dev.c:4245
        napi_skb_finish net/core/dev.c:4602 [inline]
        napi_gro_receive+0x4e6/0x680 net/core/dev.c:4636
        e1000_receive_skb drivers/net/ethernet/intel/e1000/e1000_main.c:4033 [inline]
        e1000_clean_rx_irq+0x5e0/0x1490
      drivers/net/ethernet/intel/e1000/e1000_main.c:4489
        e1000_clean+0xb9a/0x2910 drivers/net/ethernet/intel/e1000/e1000_main.c:3834
        napi_poll net/core/dev.c:5171 [inline]
        net_rx_action+0xe70/0x1900 net/core/dev.c:5236
        __do_softirq+0x2fb/0xb7d kernel/softirq.c:284
        invoke_softirq kernel/softirq.c:364 [inline]
        irq_exit+0x19e/0x1d0 kernel/softirq.c:405
        exiting_irq arch/x86/include/asm/apic.h:658 [inline]
        do_IRQ+0x81/0x1a0 arch/x86/kernel/irq.c:250
        ret_from_intr+0x0/0x20
        native_safe_halt+0x6/0x10 arch/x86/include/asm/irqflags.h:53
        arch_safe_halt arch/x86/include/asm/paravirt.h:98 [inline]
        default_idle+0x8f/0x410 arch/x86/kernel/process.c:271
        arch_cpu_idle+0xa/0x10 arch/x86/kernel/process.c:262
        default_idle_call+0x36/0x60 kernel/sched/idle.c:96
        cpuidle_idle_call kernel/sched/idle.c:154 [inline]
        do_idle+0x348/0x440 kernel/sched/idle.c:243
        cpu_startup_entry+0x18/0x20 kernel/sched/idle.c:345
        start_secondary+0x344/0x440 arch/x86/kernel/smpboot.c:272
        verify_cpu+0x0/0xfc
      irq event stamp: 1741
      hardirqs last  enabled at (1741): [<ffffffff84d49d77>]
      __raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:160
      [inline]
      hardirqs last  enabled at (1741): [<ffffffff84d49d77>]
      _raw_spin_unlock_irqrestore+0xf7/0x1a0 kernel/locking/spinlock.c:191
      hardirqs last disabled at (1740): [<ffffffff84d4a732>]
      __raw_spin_lock_irqsave include/linux/spinlock_api_smp.h:108 [inline]
      hardirqs last disabled at (1740): [<ffffffff84d4a732>]
      _raw_spin_lock_irqsave+0xa2/0x110 kernel/locking/spinlock.c:159
      softirqs last  enabled at (1738): [<ffffffff84d4deff>]
      __do_softirq+0x7cf/0xb7d kernel/softirq.c:310
      softirqs last disabled at (1571): [<ffffffff84d4b92c>]
      do_softirq_own_stack+0x1c/0x30 arch/x86/entry/entry_64.S:902
      
      other info that might help us debug this:
       Possible unsafe locking scenario:
      
             CPU0
             ----
        lock(&(&hashinfo->ehash_locks[i])->rlock);
        <Interrupt>
          lock(&(&hashinfo->ehash_locks[i])->rlock);
      
       *** DEADLOCK ***
      
      1 lock held by syz-executor0/5090:
       #0:  (sk_lock-AF_INET6){+.+.+.}, at: [<ffffffff83406b43>] lock_sock
      include/net/sock.h:1460 [inline]
       #0:  (sk_lock-AF_INET6){+.+.+.}, at: [<ffffffff83406b43>]
      sock_setsockopt+0x233/0x1e40 net/core/sock.c:683
      
      stack backtrace:
      CPU: 1 PID: 5090 Comm: syz-executor0 Not tainted 4.10.0+ #60
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:15 [inline]
       dump_stack+0x292/0x398 lib/dump_stack.c:51
       print_usage_bug+0x3ef/0x450 kernel/locking/lockdep.c:2387
       valid_state kernel/locking/lockdep.c:2400 [inline]
       mark_lock_irq kernel/locking/lockdep.c:2602 [inline]
       mark_lock+0xf30/0x1410 kernel/locking/lockdep.c:3065
       mark_irqflags kernel/locking/lockdep.c:2941 [inline]
       __lock_acquire+0x6dc/0x3270 kernel/locking/lockdep.c:3295
       lock_acquire+0x241/0x580 kernel/locking/lockdep.c:3753
       __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
       _raw_spin_lock+0x33/0x50 kernel/locking/spinlock.c:151
       spin_lock include/linux/spinlock.h:299 [inline]
       inet_ehash_insert+0x240/0xad0 net/ipv4/inet_hashtables.c:407
       reqsk_queue_hash_req net/ipv4/inet_connection_sock.c:753 [inline]
       inet_csk_reqsk_queue_hash_add+0x1b7/0x2a0 net/ipv4/inet_connection_sock.c:764
       dccp_v6_conn_request+0xada/0x11b0 net/dccp/ipv6.c:380
       dccp_rcv_state_process+0x51e/0x1660 net/dccp/input.c:606
       dccp_v6_do_rcv+0x213/0x350 net/dccp/ipv6.c:632
       sk_backlog_rcv include/net/sock.h:896 [inline]
       __release_sock+0x127/0x3a0 net/core/sock.c:2052
       release_sock+0xa5/0x2b0 net/core/sock.c:2539
       sock_setsockopt+0x60f/0x1e40 net/core/sock.c:1016
       SYSC_setsockopt net/socket.c:1782 [inline]
       SyS_setsockopt+0x2fb/0x3a0 net/socket.c:1765
       entry_SYSCALL_64_fastpath+0x1f/0xc2
      RIP: 0033:0x4458b9
      RSP: 002b:00007fe8b26c2b58 EFLAGS: 00000292 ORIG_RAX: 0000000000000036
      RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00000000004458b9
      RDX: 000000000000001a RSI: 0000000000000001 RDI: 0000000000000006
      RBP: 00000000006e2110 R08: 0000000000000010 R09: 0000000000000000
      R10: 00000000208c3000 R11: 0000000000000292 R12: 0000000000708000
      R13: 0000000020000000 R14: 0000000000001000 R15: 0000000000000000
      
      Fixes: 5413d1ba ("net: do not block BH while processing socket backlog")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NAndrey Konovalov <andreyknvl@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      449809a6
  13. 08 2月, 2017 1 次提交
  14. 26 1月, 2017 2 次提交
  15. 18 1月, 2017 1 次提交
    • J
      tcp: accept RST for rcv_nxt - 1 after receiving a FIN · 0e40f4c9
      Jason Baron 提交于
      Using a Mac OSX box as a client connecting to a Linux server, we have found
      that when certain applications (such as 'ab'), are abruptly terminated
      (via ^C), a FIN is sent followed by a RST packet on tcp connections. The
      FIN is accepted by the Linux stack but the RST is sent with the same
      sequence number as the FIN, and Linux responds with a challenge ACK per
      RFC 5961. The OSX client then sometimes (they are rate-limited) does not
      reply with any RST as would be expected on a closed socket.
      
      This results in sockets accumulating on the Linux server left mostly in
      the CLOSE_WAIT state, although LAST_ACK and CLOSING are also possible.
      This sequence of events can tie up a lot of resources on the Linux server
      since there may be a lot of data in write buffers at the time of the RST.
      Accepting a RST equal to rcv_nxt - 1, after we have already successfully
      processed a FIN, has made a significant difference for us in practice, by
      freeing up unneeded resources in a more expedient fashion.
      
      A packetdrill test demonstrating the behavior:
      
      // testing mac osx rst behavior
      
      // Establish a connection
      0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      0.000 bind(3, ..., ...) = 0
      0.000 listen(3, 1) = 0
      
      0.100 < S 0:0(0) win 32768 <mss 1460,nop,wscale 10>
      0.100 > S. 0:0(0) ack 1 <mss 1460,nop,wscale 5>
      0.200 < . 1:1(0) ack 1 win 32768
      0.200 accept(3, ..., ...) = 4
      
      // Client closes the connection
      0.300 < F. 1:1(0) ack 1 win 32768
      
      // now send rst with same sequence
      0.300 < R. 1:1(0) ack 1 win 32768
      
      // make sure we are in TCP_CLOSE
      0.400 %{
      assert tcpi_state == 7
      }%
      Signed-off-by: NJason Baron <jbaron@akamai.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0e40f4c9
  16. 14 1月, 2017 12 次提交
    • Y
      tcp: disable fack by default · 94bdc978
      Yuchung Cheng 提交于
      This patch disables FACK by default as RACK is the successor of FACK
      (inspired by the insights behind FACK).
      
      FACK[1] in Linux works as follows: a packet P is deemed lost,
      if packet Q of higher sequence is s/acked and P and Q are distant
      by at least dupthresh number of packets in sequence space.
      
      FACK is more aggressive than the IETF recommened recovery for SACK
      (RFC3517 A Conservative Selective Acknowledgment (SACK)-based Loss
       Recovery Algorithm for TCP), because a single SACK may trigger
      fast recovery. This obviously won't work well with reordering so
      FACK is dynamically disabled upon detecting reordering.
      
      RACK supersedes FACK by using time distance instead of sequence
      distance. On reordering, RACK waits for a quarter of RTT receiving
      a single SACK before starting recovery. (the timer can be made more
      adaptive in the future by measuring reordering distance in time,
      but currently RTT/4 seem to work well.) Once the recovery starts,
      RACK behaves almost like FACK because it reduces the reodering
      window to 1ms, so it fast retransmits quickly. In addition RACK
      can detect loss retransmission as it does not care about the packet
      sequences (being repeated or not), which is extremely useful when
      the connection is going through a traffic policer.
      
      Google server experiments indicate that disabling FACK after enabling
      RACK has negligible impact on the overall loss recovery performance
      with more reordering events detected.  But we still keep the FACK
      implementation for backup if RACK has bugs that needs to be disabled.
      
      [1] M. Mathis, J. Mahdavi, "Forward Acknowledgment: Refining
      TCP Congestion Control," In Proceedings of SIGCOMM '96, August 1996.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      94bdc978
    • Y
      tcp: remove thin_dupack feature · 4a7f6009
      Yuchung Cheng 提交于
      Thin stream DUPACK is to start fast recovery on only one DUPACK
      provided the connection is a thin stream (i.e., low inflight).  But
      this older feature is now subsumed with RACK. If a connection
      receives only a single DUPACK, RACK would arm a reordering timer
      and soon starts fast recovery instead of timeout if no further
      ACKs are received.
      
      The socket option (THIN_DUPACK) is kept as a nop for compatibility.
      Note that this patch does not change another thin-stream feature
      which enables linear RTO. Although it might be good to generalize
      that in the future (i.e., linear RTO for the first say 3 retries).
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4a7f6009
    • Y
      tcp: remove RFC4653 NCR · ac229dca
      Yuchung Cheng 提交于
      This patch removes the (partial) implementation of the aggressive
      limited transmit in RFC4653 TCP Non-Congestion Robustness (NCR).
      
      NCR is a mitigation to the problem created by the dynamic
      DUPACK threshold.  With the current adaptive DUPACK threshold
      (tp->reordering) could cause timeouts by preventing fast recovery.
      For example, if the last packet of a cwnd burst was reordered, the
      threshold will be set to the size of cwnd. But if next application
      burst is smaller than threshold and has drops instead of reorderings,
      the sender would not trigger fast recovery but instead resorts to a
      timeout recovery.
      
      NCR mitigates this issue by checking the number of DUPACKs against
      the current flight size additionally. The techniqueue is similar to
      the early retransmit RFC.
      
      With RACK loss detection, this mitigation is not needed, because RACK
      does not use DUPACK threshold to detect losses. RACK arms a reordering
      timer to fire at most a quarter RTT later to start fast recovery.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ac229dca
    • Y
      tcp: remove early retransmit · bec41a11
      Yuchung Cheng 提交于
      This patch removes the support of RFC5827 early retransmit (i.e.,
      fast recovery on small inflight with <3 dupacks) because it is
      subsumed by the new RACK loss detection. More specifically when
      RACK receives DUPACKs, it'll arm a reordering timer to start fast
      recovery after a quarter of (min)RTT, hence it covers the early
      retransmit except RACK does not limit itself to specific inflight
      or dupack numbers.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bec41a11
    • Y
      tcp: remove forward retransmit feature · 840a3cbe
      Yuchung Cheng 提交于
      Forward retransmit is an esoteric feature in RFC3517 (condition(3)
      in the NextSeg()). Basically if a packet is not considered lost by
      the current criteria (# of dupacks etc), but the congestion window
      has room for more packets, then retransmit this packet.
      
      However it actually conflicts with the rest of recovery design. For
      example, when reordering is detected we want to be conservative
      in retransmitting packets but forward-retransmit feature would
      break that to force more retransmission. Also the implementation is
      fairly complicated inside the retransmission logic inducing extra
      iterations in the write queue. With RACK losses are being detected
      timely and this heuristic is no longer necessary. There this patch
      removes the feature.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      840a3cbe
    • Y
      tcp: extend F-RTO to catch more spurious timeouts · 89fe18e4
      Yuchung Cheng 提交于
      Current F-RTO reverts cwnd reset whenever a never-retransmitted
      packet was (s)acked. The timeout can be declared spurious because
      the packets acknoledged with this ACK was transmitted before the
      timeout, so clearly not all the packets are lost to reset the cwnd.
      
      This nice detection does not really depend F-RTO internals. This
      patch applies the detection universally. On Google servers this
      change detected 20% more spurious timeouts.
      Suggested-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      89fe18e4
    • Y
      tcp: enable RACK loss detection to trigger recovery · a0370b3f
      Yuchung Cheng 提交于
      This patch changes two things:
      
      1. Start fast recovery with RACK in addition to other heuristics
         (e.g., DUPACK threshold, FACK). Prior to this change RACK
         is enabled to detect losses only after the recovery has
         started by other algorithms.
      
      2. Disable TCP early retransmit. RACK subsumes the early retransmit
         with the new reordering timer feature. A latter patch in this
         series removes the early retransmit code.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a0370b3f
    • Y
      tcp: check undo conditions before detecting losses · 98e36d44
      Yuchung Cheng 提交于
      Currently RACK would mark loss before the undo operations in TCP
      loss recovery. This could incorrectly identify real losses as
      spurious. For example a sender first experiences a delay spike and
      then eventually some packets were lost due to buffer overrun.
      In this case, the sender should perform fast recovery b/c not all
      the packets were lost.
      
      But the sender may first trigger a (spurious) RTO and reset
      cwnd to 1. The following ACKs may used to mark real losses by
      tcp_rack_mark_lost. Then in tcp_process_loss this ACK could trigger
      F-RTO undo condition and unmark real losses and revert the cwnd
      reduction. If there are no more ACKs coming back, eventually the
      sender would timeout again instead of performing fast recovery.
      
      The patch fixes this incorrect process by always performing
      the undo checks before detecting losses.
      
      Fixes: 4f41b1c5 ("tcp: use RACK to detect losses")
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      98e36d44
    • Y
      tcp: use sequence to break TS ties for RACK loss detection · 1d0833df
      Yuchung Cheng 提交于
      The packets inside a jumbo skb (e.g., TSO) share the same skb
      timestamp, even though they are sent sequentially on the wire. Since
      RACK is based on time, it can not detect some packets inside the
      same skb are lost.  However, we can leverage the packet sequence
      numbers as extended timestamps to detect losses. Therefore, when
      RACK timestamp is identical to skb's timestamp (i.e., one of the
      packets of the skb is acked or sacked), we use the sequence numbers
      of the acked and unacked packets to break ties.
      
      We can use the same sequence logic to advance RACK xmit time as
      well to detect more losses and avoid timeout.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1d0833df
    • Y
      tcp: add reordering timer in RACK loss detection · 57dde7f7
      Yuchung Cheng 提交于
      This patch makes RACK install a reordering timer when it suspects
      some packets might be lost, but wants to delay the decision
      a little bit to accomodate reordering.
      
      It does not create a new timer but instead repurposes the existing
      RTO timer, because both are meant to retransmit packets.
      Specifically it arms a timer ICSK_TIME_REO_TIMEOUT when
      the RACK timing check fails. The wait time is set to
      
        RACK.RTT + RACK.reo_wnd - (NOW - Packet.xmit_time) + fudge
      
      This translates to expecting a packet (Packet) should take
      (RACK.RTT + RACK.reo_wnd + fudge) to deliver after it was sent.
      
      When there are multiple packets that need a timer, we use one timer
      with the maximum timeout. Therefore the timer conservatively uses
      the maximum window to expire N packets by one timeout, instead of
      N timeouts to expire N packets sent at different times.
      
      The fudge factor is 2 jiffies to ensure when the timer fires, all
      the suspected packets would exceed the deadline and be marked lost
      by tcp_rack_detect_loss(). It has to be at least 1 jiffy because the
      clock may tick between calling icsk_reset_xmit_timer(timeout) and
      actually hang the timer. The next jiffy is to lower-bound the timeout
      to 2 jiffies when reo_wnd is < 1ms.
      
      When the reordering timer fires (tcp_rack_reo_timeout): If we aren't
      in Recovery we'll enter fast recovery and force fast retransmit.
      This is very similar to the early retransmit (RFC5827) except RACK
      is not constrained to only enter recovery for small outstanding
      flights.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      57dde7f7
    • Y
      tcp: record most recent RTT in RACK loss detection · deed7be7
      Yuchung Cheng 提交于
      Record the most recent RTT in RACK. It is often identical to the
      "ca_rtt_us" values in tcp_clean_rtx_queue. But when the packet has
      been retransmitted, RACK choses to believe the ACK is for the
      (latest) retransmitted packet if the RTT is over minimum RTT.
      
      This requires passing the arrival time of the most recent ACK to
      RACK routines. The timestamp is now recorded in the "ack_time"
      in tcp_sacktag_state during the ACK processing.
      
      This patch does not change the RACK algorithm itself. It only adds
      the RTT variable to prepare the next main patch.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      deed7be7
    • Y
      tcp: new helper for RACK to detect loss · e636f8b0
      Yuchung Cheng 提交于
      Create a new helper tcp_rack_detect_loss to prepare the upcoming
      RACK reordering timer patch.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e636f8b0
  17. 30 12月, 2016 1 次提交