1. 23 11月, 2022 1 次提交
  2. 16 11月, 2022 1 次提交
  3. 15 11月, 2022 1 次提交
  4. 02 11月, 2022 1 次提交
    • E
      tcp: refine tcp_prune_ofo_queue() logic · b0e01253
      Eric Dumazet 提交于
      After commits 36a6503f ("tcp: refine tcp_prune_ofo_queue()
      to not drop all packets") and 72cd43ba
      ("tcp: free batches of packets in tcp_prune_ofo_queue()")
      tcp_prune_ofo_queue() drops a fraction of ooo queue,
      to make room for incoming packet.
      
      However it makes no sense to drop packets that are
      before the incoming packet, in sequence space.
      
      In order to recover from packet losses faster,
      it makes more sense to only drop ooo packets
      which are after the incoming packet.
      
      Tested:
      packetdrill test:
         0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
         +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
         +0 setsockopt(3, SOL_SOCKET, SO_RCVBUF, [3800], 4) = 0
         +0 bind(3, ..., ...) = 0
         +0 listen(3, 1) = 0
      
         +0 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
         +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 0>
        +.1 < . 1:1(0) ack 1 win 1024
         +0 accept(3, ..., ...) = 4
      
       +.01 < . 200:300(100) ack 1 win 1024
         +0 > . 1:1(0) ack 1 <nop,nop, sack 200:300>
      
       +.01 < . 400:500(100) ack 1 win 1024
         +0 > . 1:1(0) ack 1 <nop,nop, sack 400:500 200:300>
      
       +.01 < . 600:700(100) ack 1 win 1024
         +0 > . 1:1(0) ack 1 <nop,nop, sack 600:700 400:500 200:300>
      
       +.01 < . 800:900(100) ack 1 win 1024
         +0 > . 1:1(0) ack 1 <nop,nop, sack 800:900 600:700 400:500 200:300>
      
       +.01 < . 1000:1100(100) ack 1 win 1024
         +0 > . 1:1(0) ack 1 <nop,nop, sack 1000:1100 800:900 600:700 400:500>
      
       +.01 < . 1200:1300(100) ack 1 win 1024
         +0 > . 1:1(0) ack 1 <nop,nop, sack 1200:1300 1000:1100 800:900 600:700>
      
      // this packet is dropped because we have no room left.
       +.01 < . 1400:1500(100) ack 1 win 1024
      
       +.01 < . 1:200(199) ack 1 win 1024
      // Make sure kernel did not drop 200:300 sequence
         +0 > . 1:1(0) ack 300 <nop,nop, sack 1200:1300 1000:1100 800:900 600:700>
      // Make room, since our RCVBUF is very small
         +0 read(4, ..., 299) = 299
      
       +.01 < . 300:400(100) ack 1 win 1024
         +0 > . 1:1(0) ack 500 <nop,nop, sack 1200:1300 1000:1100 800:900 600:700>
      
       +.01 < . 500:600(100) ack 1 win 1024
         +0 > . 1:1(0) ack 700 <nop,nop, sack 1200:1300 1000:1100 800:900>
      
         +0 read(4, ..., 400) = 400
      
       +.01 < . 700:800(100) ack 1 win 1024
         +0 > . 1:1(0) ack 900 <nop,nop, sack 1200:1300 1000:1100>
      
       +.01 < . 900:1000(100) ack 1 win 1024
         +0 > . 1:1(0) ack 1100 <nop,nop, sack 1200:1300>
      
       +.01 < . 1100:1200(100) ack 1 win 1024
      // This checks that 1200:1300 has not been removed from ooo queue
         +0 > . 1:1(0) ack 1300
      Suggested-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NKuniyuki Iwashima <kuniyu@amazon.com>
      Link: https://lore.kernel.org/r/20221101035234.3910189-1-edumazet@google.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      b0e01253
  5. 25 10月, 2022 1 次提交
    • N
      tcp: fix indefinite deferral of RTO with SACK reneging · 3d2af9cc
      Neal Cardwell 提交于
      This commit fixes a bug that can cause a TCP data sender to repeatedly
      defer RTOs when encountering SACK reneging.
      
      The bug is that when we're in fast recovery in a scenario with SACK
      reneging, every time we get an ACK we call tcp_check_sack_reneging()
      and it can note the apparent SACK reneging and rearm the RTO timer for
      srtt/2 into the future. In some SACK reneging scenarios that can
      happen repeatedly until the receive window fills up, at which point
      the sender can't send any more, the ACKs stop arriving, and the RTO
      fires at srtt/2 after the last ACK. But that can take far too long
      (O(10 secs)), since the connection is stuck in fast recovery with a
      low cwnd that cannot grow beyond ssthresh, even if more bandwidth is
      available.
      
      This fix changes the logic in tcp_check_sack_reneging() to only rearm
      the RTO timer if data is cumulatively ACKed, indicating forward
      progress. This avoids this kind of nearly infinite loop of RTO timer
      re-arming. In addition, this meets the goals of
      tcp_check_sack_reneging() in handling Windows TCP behavior that looks
      temporarily like SACK reneging but is not really.
      
      Many thanks to Jakub Kicinski and Neil Spring, who reported this issue
      and provided critical packet traces that enabled root-causing this
      issue. Also, many thanks to Jakub Kicinski for testing this fix.
      
      Fixes: 5ae344c9 ("tcp: reduce spurious retransmits due to transient SACK reneging")
      Reported-by: NJakub Kicinski <kuba@kernel.org>
      Reported-by: NNeil Spring <ntspring@fb.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Tested-by: NJakub Kicinski <kuba@kernel.org>
      Link: https://lore.kernel.org/r/20221021170821.1093930-1-ncardwell.kernel@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      3d2af9cc
  6. 06 9月, 2022 1 次提交
    • N
      tcp: fix early ETIMEDOUT after spurious non-SACK RTO · 686dc2db
      Neal Cardwell 提交于
      Fix a bug reported and analyzed by Nagaraj Arankal, where the handling
      of a spurious non-SACK RTO could cause a connection to fail to clear
      retrans_stamp, causing a later RTO to very prematurely time out the
      connection with ETIMEDOUT.
      
      Here is the buggy scenario, expanding upon Nagaraj Arankal's excellent
      report:
      
      (*1) Send one data packet on a non-SACK connection
      
      (*2) Because no ACK packet is received, the packet is retransmitted
           and we enter CA_Loss; but this retransmission is spurious.
      
      (*3) The ACK for the original data is received. The transmitted packet
           is acknowledged.  The TCP timestamp is before the retrans_stamp,
           so tcp_may_undo() returns true, and tcp_try_undo_loss() returns
           true without changing state to Open (because tcp_is_sack() is
           false), and tcp_process_loss() returns without calling
           tcp_try_undo_recovery().  Normally after undoing a CA_Loss
           episode, tcp_fastretrans_alert() would see that the connection
           has returned to CA_Open and fall through and call
           tcp_try_to_open(), which would set retrans_stamp to 0.  However,
           for non-SACK connections we hold the connection in CA_Loss, so do
           not fall through to call tcp_try_to_open() and do not set
           retrans_stamp to 0. So retrans_stamp is (erroneously) still
           non-zero.
      
           At this point the first "retransmission event" has passed and
           been recovered from. Any future retransmission is a completely
           new "event". However, retrans_stamp is erroneously still
           set. (And we are still in CA_Loss, which is correct.)
      
      (*4) After 16 minutes (to correspond with tcp_retries2=15), a new data
           packet is sent. Note: No data is transmitted between (*3) and
           (*4) and we disabled keep alives.
      
           The socket's timeout SHOULD be calculated from this point in
           time, but instead it's calculated from the prior "event" 16
           minutes ago (step (*2)).
      
      (*5) Because no ACK packet is received, the packet is retransmitted.
      
      (*6) At the time of the 2nd retransmission, the socket returns
           ETIMEDOUT, prematurely, because retrans_stamp is (erroneously)
           too far in the past (set at the time of (*2)).
      
      This commit fixes this bug by ensuring that we reuse in
      tcp_try_undo_loss() the same careful logic for non-SACK connections
      that we have in tcp_try_undo_recovery(). To avoid duplicating logic,
      we factor out that logic into a new
      tcp_is_non_sack_preventing_reopen() helper and call that helper from
      both undo functions.
      
      Fixes: da34ac76 ("tcp: only undo on partial ACKs in CA_Loss")
      Reported-by: NNagaraj Arankal <nagaraj.p.arankal@hpe.com>
      Link: https://lore.kernel.org/all/SJ0PR84MB1847BE6C24D274C46A1B9B0EB27A9@SJ0PR84MB1847.NAMPRD84.PROD.OUTLOOK.COM/Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20220903121023.866900-1-ncardwell.kernel@gmail.comSigned-off-by: NPaolo Abeni <pabeni@redhat.com>
      686dc2db
  7. 01 9月, 2022 2 次提交
  8. 25 7月, 2022 5 次提交
  9. 22 7月, 2022 7 次提交
  10. 20 7月, 2022 4 次提交
  11. 18 7月, 2022 3 次提交
  12. 13 7月, 2022 1 次提交
  13. 17 6月, 2022 1 次提交
  14. 11 6月, 2022 2 次提交
  15. 28 5月, 2022 1 次提交
    • E
      tcp: fix tcp_mtup_probe_success vs wrong snd_cwnd · 11825765
      Eric Dumazet 提交于
      syzbot got a new report [1] finally pointing to a very old bug,
      added in initial support for MTU probing.
      
      tcp_mtu_probe() has checks about starting an MTU probe if
      tcp_snd_cwnd(tp) >= 11.
      
      But nothing prevents tcp_snd_cwnd(tp) to be reduced later
      and before the MTU probe succeeds.
      
      This bug would lead to potential zero-divides.
      
      Debugging added in commit 40570375 ("tcp: add accessors
      to read/set tp->snd_cwnd") has paid off :)
      
      While we are at it, address potential overflows in this code.
      
      [1]
      WARNING: CPU: 1 PID: 14132 at include/net/tcp.h:1219 tcp_mtup_probe_success+0x366/0x570 net/ipv4/tcp_input.c:2712
      Modules linked in:
      CPU: 1 PID: 14132 Comm: syz-executor.2 Not tainted 5.18.0-syzkaller-07857-gbabf0bb9 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      RIP: 0010:tcp_snd_cwnd_set include/net/tcp.h:1219 [inline]
      RIP: 0010:tcp_mtup_probe_success+0x366/0x570 net/ipv4/tcp_input.c:2712
      Code: 74 08 48 89 ef e8 da 80 17 f9 48 8b 45 00 65 48 ff 80 80 03 00 00 48 83 c4 30 5b 41 5c 41 5d 41 5e 41 5f 5d c3 e8 aa b0 c5 f8 <0f> 0b e9 16 fe ff ff 48 8b 4c 24 08 80 e1 07 38 c1 0f 8c c7 fc ff
      RSP: 0018:ffffc900079e70f8 EFLAGS: 00010287
      RAX: ffffffff88c0f7f6 RBX: ffff8880756e7a80 RCX: 0000000000040000
      RDX: ffffc9000c6c4000 RSI: 0000000000031f9e RDI: 0000000000031f9f
      RBP: 0000000000000000 R08: ffffffff88c0f606 R09: ffffc900079e7520
      R10: ffffed101011226d R11: 1ffff1101011226c R12: 1ffff1100eadcf50
      R13: ffff8880756e72c0 R14: 1ffff1100eadcf89 R15: dffffc0000000000
      FS:  00007f643236e700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f1ab3f1e2a0 CR3: 0000000064fe7000 CR4: 00000000003506e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <TASK>
       tcp_clean_rtx_queue+0x223a/0x2da0 net/ipv4/tcp_input.c:3356
       tcp_ack+0x1962/0x3c90 net/ipv4/tcp_input.c:3861
       tcp_rcv_established+0x7c8/0x1ac0 net/ipv4/tcp_input.c:5973
       tcp_v6_do_rcv+0x57b/0x1210 net/ipv6/tcp_ipv6.c:1476
       sk_backlog_rcv include/net/sock.h:1061 [inline]
       __release_sock+0x1d8/0x4c0 net/core/sock.c:2849
       release_sock+0x5d/0x1c0 net/core/sock.c:3404
       sk_stream_wait_memory+0x700/0xdc0 net/core/stream.c:145
       tcp_sendmsg_locked+0x111d/0x3fc0 net/ipv4/tcp.c:1410
       tcp_sendmsg+0x2c/0x40 net/ipv4/tcp.c:1448
       sock_sendmsg_nosec net/socket.c:714 [inline]
       sock_sendmsg net/socket.c:734 [inline]
       __sys_sendto+0x439/0x5c0 net/socket.c:2119
       __do_sys_sendto net/socket.c:2131 [inline]
       __se_sys_sendto net/socket.c:2127 [inline]
       __x64_sys_sendto+0xda/0xf0 net/socket.c:2127
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x46/0xb0
      RIP: 0033:0x7f6431289109
      Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007f643236e168 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
      RAX: ffffffffffffffda RBX: 00007f643139c100 RCX: 00007f6431289109
      RDX: 00000000d0d0c2ac RSI: 0000000020000080 RDI: 000000000000000a
      RBP: 00007f64312e308d R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000
      R13: 00007fff372533af R14: 00007f643236e300 R15: 0000000000022000
      
      Fixes: 5d424d5a ("[TCP]: MTU probing")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      11825765
  16. 20 5月, 2022 1 次提交
  17. 30 4月, 2022 1 次提交
  18. 29 4月, 2022 1 次提交
    • P
      tcp: fix F-RTO may not work correctly when receiving DSACK · d9157f68
      Pengcheng Yang 提交于
      Currently DSACK is regarded as a dupack, which may cause
      F-RTO to incorrectly enter "loss was real" when receiving
      DSACK.
      
      Packetdrill to demonstrate:
      
      // Enable F-RTO and TLP
          0 `sysctl -q net.ipv4.tcp_frto=2`
          0 `sysctl -q net.ipv4.tcp_early_retrans=3`
          0 `sysctl -q net.ipv4.tcp_congestion_control=cubic`
      
      // Establish a connection
         +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
         +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
         +0 bind(3, ..., ...) = 0
         +0 listen(3, 1) = 0
      
      // RTT 10ms, RTO 210ms
        +.1 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
         +0 > S. 0:0(0) ack 1 <...>
       +.01 < . 1:1(0) ack 1 win 257
         +0 accept(3, ..., ...) = 4
      
      // Send 2 data segments
         +0 write(4, ..., 2000) = 2000
         +0 > P. 1:2001(2000) ack 1
      
      // TLP
      +.022 > P. 1001:2001(1000) ack 1
      
      // Continue to send 8 data segments
         +0 write(4, ..., 10000) = 10000
         +0 > P. 2001:10001(8000) ack 1
      
      // RTO
      +.188 > . 1:1001(1000) ack 1
      
      // The original data is acked and new data is sent(F-RTO step 2.b)
         +0 < . 1:1(0) ack 2001 win 257
         +0 > P. 10001:12001(2000) ack 1
      
      // D-SACK caused by TLP is regarded as a dupack, this results in
      // the incorrect judgment of "loss was real"(F-RTO step 3.a)
      +.022 < . 1:1(0) ack 2001 win 257 <sack 1001:2001,nop,nop>
      
      // Never-retransmitted data(3001:4001) are acked and
      // expect to switch to open state(F-RTO step 3.b)
         +0 < . 1:1(0) ack 4001 win 257
      +0 %{ assert tcpi_ca_state == 0, tcpi_ca_state }%
      
      Fixes: e33099f9 ("tcp: implement RFC5682 F-RTO")
      Signed-off-by: NPengcheng Yang <yangpc@wangsu.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Tested-by: NNeal Cardwell <ncardwell@google.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/1650967419-2150-1-git-send-email-yangpc@wangsu.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      d9157f68
  19. 25 4月, 2022 1 次提交
    • E
      tcp: fix potential xmit stalls caused by TCP_NOTSENT_LOWAT · 4bfe744f
      Eric Dumazet 提交于
      I had this bug sitting for too long in my pile, it is time to fix it.
      
      Thanks to Doug Porter for reminding me of it!
      
      We had various attempts in the past, including commit
      0cbe6a8f ("tcp: remove SOCK_QUEUE_SHRUNK"),
      but the issue is that TCP stack currently only generates
      EPOLLOUT from input path, when tp->snd_una has advanced
      and skb(s) cleaned from rtx queue.
      
      If a flow has a big RTT, and/or receives SACKs, it is possible
      that the notsent part (tp->write_seq - tp->snd_nxt) reaches 0
      and no more data can be sent until tp->snd_una finally advances.
      
      What is needed is to also check if POLLOUT needs to be generated
      whenever tp->snd_nxt is advanced, from output path.
      
      This bug triggers more often after an idle period, as
      we do not receive ACK for at least one RTT. tcp_notsent_lowat
      could be a fraction of what CWND and pacing rate would allow to
      send during this RTT.
      
      In a followup patch, I will remove the bogus call
      to tcp_chrono_stop(sk, TCP_CHRONO_SNDBUF_LIMITED)
      from tcp_check_space(). Fact that we have decided to generate
      an EPOLLOUT does not mean the application has immediately
      refilled the transmit queue. This optimistic call
      might have been the reason the bug seemed not too serious.
      
      Tested:
      
      200 ms rtt, 1% packet loss, 32 MB tcp_rmem[2] and tcp_wmem[2]
      
      $ echo 500000 >/proc/sys/net/ipv4/tcp_notsent_lowat
      $ cat bench_rr.sh
      SUM=0
      for i in {1..10}
      do
       V=`netperf -H remote_host -l30 -t TCP_RR -- -r 10000000,10000 -o LOCAL_BYTES_SENT | egrep -v "MIGRATED|Bytes"`
       echo $V
       SUM=$(($SUM + $V))
      done
      echo SUM=$SUM
      
      Before patch:
      $ bench_rr.sh
      130000000
      80000000
      140000000
      140000000
      140000000
      140000000
      130000000
      40000000
      90000000
      110000000
      SUM=1140000000
      
      After patch:
      $ bench_rr.sh
      430000000
      590000000
      530000000
      450000000
      450000000
      350000000
      450000000
      490000000
      480000000
      460000000
      SUM=4680000000  # This is 410 % of the value before patch.
      
      Fixes: c9bee3b7 ("tcp: TCP_NOTSENT_LOWAT socket option")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NDoug Porter <dsp@fb.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4bfe744f
  20. 18 4月, 2022 1 次提交
  21. 17 4月, 2022 3 次提交