1. 25 7月, 2022 1 次提交
  2. 22 7月, 2022 7 次提交
  3. 20 7月, 2022 4 次提交
  4. 18 7月, 2022 3 次提交
  5. 13 7月, 2022 1 次提交
  6. 28 5月, 2022 1 次提交
    • E
      tcp: fix tcp_mtup_probe_success vs wrong snd_cwnd · 11825765
      Eric Dumazet 提交于
      syzbot got a new report [1] finally pointing to a very old bug,
      added in initial support for MTU probing.
      
      tcp_mtu_probe() has checks about starting an MTU probe if
      tcp_snd_cwnd(tp) >= 11.
      
      But nothing prevents tcp_snd_cwnd(tp) to be reduced later
      and before the MTU probe succeeds.
      
      This bug would lead to potential zero-divides.
      
      Debugging added in commit 40570375 ("tcp: add accessors
      to read/set tp->snd_cwnd") has paid off :)
      
      While we are at it, address potential overflows in this code.
      
      [1]
      WARNING: CPU: 1 PID: 14132 at include/net/tcp.h:1219 tcp_mtup_probe_success+0x366/0x570 net/ipv4/tcp_input.c:2712
      Modules linked in:
      CPU: 1 PID: 14132 Comm: syz-executor.2 Not tainted 5.18.0-syzkaller-07857-gbabf0bb9 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      RIP: 0010:tcp_snd_cwnd_set include/net/tcp.h:1219 [inline]
      RIP: 0010:tcp_mtup_probe_success+0x366/0x570 net/ipv4/tcp_input.c:2712
      Code: 74 08 48 89 ef e8 da 80 17 f9 48 8b 45 00 65 48 ff 80 80 03 00 00 48 83 c4 30 5b 41 5c 41 5d 41 5e 41 5f 5d c3 e8 aa b0 c5 f8 <0f> 0b e9 16 fe ff ff 48 8b 4c 24 08 80 e1 07 38 c1 0f 8c c7 fc ff
      RSP: 0018:ffffc900079e70f8 EFLAGS: 00010287
      RAX: ffffffff88c0f7f6 RBX: ffff8880756e7a80 RCX: 0000000000040000
      RDX: ffffc9000c6c4000 RSI: 0000000000031f9e RDI: 0000000000031f9f
      RBP: 0000000000000000 R08: ffffffff88c0f606 R09: ffffc900079e7520
      R10: ffffed101011226d R11: 1ffff1101011226c R12: 1ffff1100eadcf50
      R13: ffff8880756e72c0 R14: 1ffff1100eadcf89 R15: dffffc0000000000
      FS:  00007f643236e700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f1ab3f1e2a0 CR3: 0000000064fe7000 CR4: 00000000003506e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <TASK>
       tcp_clean_rtx_queue+0x223a/0x2da0 net/ipv4/tcp_input.c:3356
       tcp_ack+0x1962/0x3c90 net/ipv4/tcp_input.c:3861
       tcp_rcv_established+0x7c8/0x1ac0 net/ipv4/tcp_input.c:5973
       tcp_v6_do_rcv+0x57b/0x1210 net/ipv6/tcp_ipv6.c:1476
       sk_backlog_rcv include/net/sock.h:1061 [inline]
       __release_sock+0x1d8/0x4c0 net/core/sock.c:2849
       release_sock+0x5d/0x1c0 net/core/sock.c:3404
       sk_stream_wait_memory+0x700/0xdc0 net/core/stream.c:145
       tcp_sendmsg_locked+0x111d/0x3fc0 net/ipv4/tcp.c:1410
       tcp_sendmsg+0x2c/0x40 net/ipv4/tcp.c:1448
       sock_sendmsg_nosec net/socket.c:714 [inline]
       sock_sendmsg net/socket.c:734 [inline]
       __sys_sendto+0x439/0x5c0 net/socket.c:2119
       __do_sys_sendto net/socket.c:2131 [inline]
       __se_sys_sendto net/socket.c:2127 [inline]
       __x64_sys_sendto+0xda/0xf0 net/socket.c:2127
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x46/0xb0
      RIP: 0033:0x7f6431289109
      Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007f643236e168 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
      RAX: ffffffffffffffda RBX: 00007f643139c100 RCX: 00007f6431289109
      RDX: 00000000d0d0c2ac RSI: 0000000020000080 RDI: 000000000000000a
      RBP: 00007f64312e308d R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000
      R13: 00007fff372533af R14: 00007f643236e300 R15: 0000000000022000
      
      Fixes: 5d424d5a ("[TCP]: MTU probing")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      11825765
  7. 20 5月, 2022 1 次提交
  8. 30 4月, 2022 1 次提交
  9. 29 4月, 2022 1 次提交
    • P
      tcp: fix F-RTO may not work correctly when receiving DSACK · d9157f68
      Pengcheng Yang 提交于
      Currently DSACK is regarded as a dupack, which may cause
      F-RTO to incorrectly enter "loss was real" when receiving
      DSACK.
      
      Packetdrill to demonstrate:
      
      // Enable F-RTO and TLP
          0 `sysctl -q net.ipv4.tcp_frto=2`
          0 `sysctl -q net.ipv4.tcp_early_retrans=3`
          0 `sysctl -q net.ipv4.tcp_congestion_control=cubic`
      
      // Establish a connection
         +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
         +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
         +0 bind(3, ..., ...) = 0
         +0 listen(3, 1) = 0
      
      // RTT 10ms, RTO 210ms
        +.1 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
         +0 > S. 0:0(0) ack 1 <...>
       +.01 < . 1:1(0) ack 1 win 257
         +0 accept(3, ..., ...) = 4
      
      // Send 2 data segments
         +0 write(4, ..., 2000) = 2000
         +0 > P. 1:2001(2000) ack 1
      
      // TLP
      +.022 > P. 1001:2001(1000) ack 1
      
      // Continue to send 8 data segments
         +0 write(4, ..., 10000) = 10000
         +0 > P. 2001:10001(8000) ack 1
      
      // RTO
      +.188 > . 1:1001(1000) ack 1
      
      // The original data is acked and new data is sent(F-RTO step 2.b)
         +0 < . 1:1(0) ack 2001 win 257
         +0 > P. 10001:12001(2000) ack 1
      
      // D-SACK caused by TLP is regarded as a dupack, this results in
      // the incorrect judgment of "loss was real"(F-RTO step 3.a)
      +.022 < . 1:1(0) ack 2001 win 257 <sack 1001:2001,nop,nop>
      
      // Never-retransmitted data(3001:4001) are acked and
      // expect to switch to open state(F-RTO step 3.b)
         +0 < . 1:1(0) ack 4001 win 257
      +0 %{ assert tcpi_ca_state == 0, tcpi_ca_state }%
      
      Fixes: e33099f9 ("tcp: implement RFC5682 F-RTO")
      Signed-off-by: NPengcheng Yang <yangpc@wangsu.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Tested-by: NNeal Cardwell <ncardwell@google.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/1650967419-2150-1-git-send-email-yangpc@wangsu.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      d9157f68
  10. 25 4月, 2022 1 次提交
    • E
      tcp: fix potential xmit stalls caused by TCP_NOTSENT_LOWAT · 4bfe744f
      Eric Dumazet 提交于
      I had this bug sitting for too long in my pile, it is time to fix it.
      
      Thanks to Doug Porter for reminding me of it!
      
      We had various attempts in the past, including commit
      0cbe6a8f ("tcp: remove SOCK_QUEUE_SHRUNK"),
      but the issue is that TCP stack currently only generates
      EPOLLOUT from input path, when tp->snd_una has advanced
      and skb(s) cleaned from rtx queue.
      
      If a flow has a big RTT, and/or receives SACKs, it is possible
      that the notsent part (tp->write_seq - tp->snd_nxt) reaches 0
      and no more data can be sent until tp->snd_una finally advances.
      
      What is needed is to also check if POLLOUT needs to be generated
      whenever tp->snd_nxt is advanced, from output path.
      
      This bug triggers more often after an idle period, as
      we do not receive ACK for at least one RTT. tcp_notsent_lowat
      could be a fraction of what CWND and pacing rate would allow to
      send during this RTT.
      
      In a followup patch, I will remove the bogus call
      to tcp_chrono_stop(sk, TCP_CHRONO_SNDBUF_LIMITED)
      from tcp_check_space(). Fact that we have decided to generate
      an EPOLLOUT does not mean the application has immediately
      refilled the transmit queue. This optimistic call
      might have been the reason the bug seemed not too serious.
      
      Tested:
      
      200 ms rtt, 1% packet loss, 32 MB tcp_rmem[2] and tcp_wmem[2]
      
      $ echo 500000 >/proc/sys/net/ipv4/tcp_notsent_lowat
      $ cat bench_rr.sh
      SUM=0
      for i in {1..10}
      do
       V=`netperf -H remote_host -l30 -t TCP_RR -- -r 10000000,10000 -o LOCAL_BYTES_SENT | egrep -v "MIGRATED|Bytes"`
       echo $V
       SUM=$(($SUM + $V))
      done
      echo SUM=$SUM
      
      Before patch:
      $ bench_rr.sh
      130000000
      80000000
      140000000
      140000000
      140000000
      140000000
      130000000
      40000000
      90000000
      110000000
      SUM=1140000000
      
      After patch:
      $ bench_rr.sh
      430000000
      590000000
      530000000
      450000000
      450000000
      350000000
      450000000
      490000000
      480000000
      460000000
      SUM=4680000000  # This is 410 % of the value before patch.
      
      Fixes: c9bee3b7 ("tcp: TCP_NOTSENT_LOWAT socket option")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NDoug Porter <dsp@fb.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4bfe744f
  11. 18 4月, 2022 1 次提交
  12. 17 4月, 2022 10 次提交
  13. 07 4月, 2022 1 次提交
  14. 20 2月, 2022 4 次提交
  15. 11 2月, 2022 1 次提交
    • D
      net/smc: Limit SMC visits when handshake workqueue congested · 48b6190a
      D. Wythe 提交于
      This patch intends to provide a mechanism to put constraint on SMC
      connections visit according to the pressure of SMC handshake process.
      At present, frequent visits will cause the incoming connections to be
      backlogged in SMC handshake queue, raise the connections established
      time. Which is quite unacceptable for those applications who base on
      short lived connections.
      
      There are two ways to implement this mechanism:
      
      1. Put limitation after TCP established.
      2. Put limitation before TCP established.
      
      In the first way, we need to wait and receive CLC messages that the
      client will potentially send, and then actively reply with a decline
      message, in a sense, which is also a sort of SMC handshake, affect the
      connections established time on its way.
      
      In the second way, the only problem is that we need to inject SMC logic
      into TCP when it is about to reply the incoming SYN, since we already do
      that, it's seems not a problem anymore. And advantage is obvious, few
      additional processes are required to complete the constraint.
      
      This patch use the second way. After this patch, connections who beyond
      constraint will not informed any SMC indication, and SMC will not be
      involved in any of its subsequent processes.
      
      Link: https://lore.kernel.org/all/1641301961-59331-1-git-send-email-alibuda@linux.alibaba.com/Signed-off-by: ND. Wythe <alibuda@linux.alibaba.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      48b6190a
  16. 03 2月, 2022 1 次提交
  17. 02 2月, 2022 1 次提交
    • A
      tcp: Use BPF timeout setting for SYN ACK RTO · 5903123f
      Akhmat Karakotov 提交于
      When setting RTO through BPF program, some SYN ACK packets were unaffected
      and continued to use TCP_TIMEOUT_INIT constant. This patch adds timeout
      option to struct request_sock. Option is initialized with TCP_TIMEOUT_INIT
      and is reassigned through BPF using tcp_timeout_init call. SYN ACK
      retransmits now use newly added timeout option.
      Signed-off-by: NAkhmat Karakotov <hmukos@yandex-team.ru>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      
      v2:
      	- Add timeout option to struct request_sock. Do not call
      	  tcp_timeout_init on every syn ack retransmit.
      
      v3:
      	- Use unsigned long for min. Bound tcp_timeout_init to TCP_RTO_MAX.
      
      v4:
      	- Refactor duplicate code by adding reqsk_timeout function.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5903123f