1. 10 5月, 2023 1 次提交
    • E
      tcp: add annotations around sk->sk_shutdown accesses · e14cadfd
      Eric Dumazet 提交于
      Now sk->sk_shutdown is no longer a bitfield, we can add
      standard READ_ONCE()/WRITE_ONCE() annotations to silence
      KCSAN reports like the following:
      
      BUG: KCSAN: data-race in tcp_disconnect / tcp_poll
      
      write to 0xffff88814588582c of 1 bytes by task 3404 on cpu 1:
      tcp_disconnect+0x4d6/0xdb0 net/ipv4/tcp.c:3121
      __inet_stream_connect+0x5dd/0x6e0 net/ipv4/af_inet.c:715
      inet_stream_connect+0x48/0x70 net/ipv4/af_inet.c:727
      __sys_connect_file net/socket.c:2001 [inline]
      __sys_connect+0x19b/0x1b0 net/socket.c:2018
      __do_sys_connect net/socket.c:2028 [inline]
      __se_sys_connect net/socket.c:2025 [inline]
      __x64_sys_connect+0x41/0x50 net/socket.c:2025
      do_syscall_x64 arch/x86/entry/common.c:50 [inline]
      do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
      entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      read to 0xffff88814588582c of 1 bytes by task 3374 on cpu 0:
      tcp_poll+0x2e6/0x7d0 net/ipv4/tcp.c:562
      sock_poll+0x253/0x270 net/socket.c:1383
      vfs_poll include/linux/poll.h:88 [inline]
      io_poll_check_events io_uring/poll.c:281 [inline]
      io_poll_task_func+0x15a/0x820 io_uring/poll.c:333
      handle_tw_list io_uring/io_uring.c:1184 [inline]
      tctx_task_work+0x1fe/0x4d0 io_uring/io_uring.c:1246
      task_work_run+0x123/0x160 kernel/task_work.c:179
      get_signal+0xe64/0xff0 kernel/signal.c:2635
      arch_do_signal_or_restart+0x89/0x2a0 arch/x86/kernel/signal.c:306
      exit_to_user_mode_loop+0x6f/0xe0 kernel/entry/common.c:168
      exit_to_user_mode_prepare+0x6c/0xb0 kernel/entry/common.c:204
      __syscall_exit_to_user_mode_work kernel/entry/common.c:286 [inline]
      syscall_exit_to_user_mode+0x26/0x140 kernel/entry/common.c:297
      do_syscall_64+0x4d/0xc0 arch/x86/entry/common.c:86
      entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      value changed: 0x03 -> 0x00
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e14cadfd
  2. 31 3月, 2023 1 次提交
    • K
      tcp: Refine SYN handling for PAWS. · ee05d90d
      Kuniyuki Iwashima 提交于
      Our Network Load Balancer (NLB) [0] has multiple nodes with different
      IP addresses, and each node forwards TCP flows from clients to backend
      targets.  NLB has an option to preserve the client's source IP address
      and port when routing packets to backend targets. [1]
      
      When a client connects to two different NLB nodes, they may select the
      same backend target.  Then, if the client has used the same source IP
      and port, the two flows at the backend side will have the same 4-tuple.
      
      While testing around such cases, I saw these sequences on the backend
      target.
      
      IP 10.0.0.215.60000 > 10.0.3.249.10000: Flags [S], seq 2819965599, win 62727, options [mss 8365,sackOK,TS val 1029816180 ecr 0,nop,wscale 7], length 0
      IP 10.0.3.249.10000 > 10.0.0.215.60000: Flags [S.], seq 3040695044, ack 2819965600, win 62643, options [mss 8961,sackOK,TS val 1224784076 ecr 1029816180,nop,wscale 7], length 0
      IP 10.0.0.215.60000 > 10.0.3.249.10000: Flags [.], ack 1, win 491, options [nop,nop,TS val 1029816181 ecr 1224784076], length 0
      IP 10.0.0.215.60000 > 10.0.3.249.10000: Flags [S], seq 2681819307, win 62727, options [mss 8365,sackOK,TS val 572088282 ecr 0,nop,wscale 7], length 0
      IP 10.0.3.249.10000 > 10.0.0.215.60000: Flags [.], ack 1, win 490, options [nop,nop,TS val 1224794914 ecr 1029816181,nop,nop,sack 1 {4156821004:4156821005}], length 0
      
      It seems to be working correctly, but the last ACK was generated by
      tcp_send_dupack() and PAWSEstab was increased.  This is because the
      second connection has a smaller timestamp than the first one.
      
      In this case, we should send a dup ACK in tcp_send_challenge_ack()
      to increase the correct counter and rate-limit it properly.
      
      Let's check the SYN flag after the PAWS tests to avoid adding unnecessary
      overhead for most packets.
      
      Link: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/introduction.html [0]
      Link: https://docs.aws.amazon.com/elasticloadbalancing/latest/network/load-balancer-target-groups.html#client-ip-preservation [1]
      Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: NJason Xing <kerneljasonxing@gmail.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ee05d90d
  3. 18 3月, 2023 1 次提交
    • E
      tcp: preserve const qualifier in tcp_sk() · e9d9da91
      Eric Dumazet 提交于
      We can change tcp_sk() to propagate its argument const qualifier,
      thanks to container_of_const().
      
      We have two places where a const sock pointer has to be upgraded
      to a write one. We have been using const qualifier for lockless
      listeners to clearly identify points where writes could happen.
      
      Add tcp_sk_rw() helper to better document these.
      
      tcp_inbound_md5_hash(), __tcp_grow_window(), tcp_reset_check()
      and tcp_rack_reo_wnd() get an additional const qualififer
      for their @tp local variables.
      
      smc_check_reset_syn_req() also needs a similar change.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NSimon Horman <simon.horman@corigine.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e9d9da91
  4. 17 3月, 2023 2 次提交
  5. 23 11月, 2022 1 次提交
  6. 18 11月, 2022 2 次提交
  7. 16 11月, 2022 1 次提交
  8. 15 11月, 2022 1 次提交
  9. 02 11月, 2022 1 次提交
    • E
      tcp: refine tcp_prune_ofo_queue() logic · b0e01253
      Eric Dumazet 提交于
      After commits 36a6503f ("tcp: refine tcp_prune_ofo_queue()
      to not drop all packets") and 72cd43ba
      ("tcp: free batches of packets in tcp_prune_ofo_queue()")
      tcp_prune_ofo_queue() drops a fraction of ooo queue,
      to make room for incoming packet.
      
      However it makes no sense to drop packets that are
      before the incoming packet, in sequence space.
      
      In order to recover from packet losses faster,
      it makes more sense to only drop ooo packets
      which are after the incoming packet.
      
      Tested:
      packetdrill test:
         0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
         +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
         +0 setsockopt(3, SOL_SOCKET, SO_RCVBUF, [3800], 4) = 0
         +0 bind(3, ..., ...) = 0
         +0 listen(3, 1) = 0
      
         +0 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
         +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 0>
        +.1 < . 1:1(0) ack 1 win 1024
         +0 accept(3, ..., ...) = 4
      
       +.01 < . 200:300(100) ack 1 win 1024
         +0 > . 1:1(0) ack 1 <nop,nop, sack 200:300>
      
       +.01 < . 400:500(100) ack 1 win 1024
         +0 > . 1:1(0) ack 1 <nop,nop, sack 400:500 200:300>
      
       +.01 < . 600:700(100) ack 1 win 1024
         +0 > . 1:1(0) ack 1 <nop,nop, sack 600:700 400:500 200:300>
      
       +.01 < . 800:900(100) ack 1 win 1024
         +0 > . 1:1(0) ack 1 <nop,nop, sack 800:900 600:700 400:500 200:300>
      
       +.01 < . 1000:1100(100) ack 1 win 1024
         +0 > . 1:1(0) ack 1 <nop,nop, sack 1000:1100 800:900 600:700 400:500>
      
       +.01 < . 1200:1300(100) ack 1 win 1024
         +0 > . 1:1(0) ack 1 <nop,nop, sack 1200:1300 1000:1100 800:900 600:700>
      
      // this packet is dropped because we have no room left.
       +.01 < . 1400:1500(100) ack 1 win 1024
      
       +.01 < . 1:200(199) ack 1 win 1024
      // Make sure kernel did not drop 200:300 sequence
         +0 > . 1:1(0) ack 300 <nop,nop, sack 1200:1300 1000:1100 800:900 600:700>
      // Make room, since our RCVBUF is very small
         +0 read(4, ..., 299) = 299
      
       +.01 < . 300:400(100) ack 1 win 1024
         +0 > . 1:1(0) ack 500 <nop,nop, sack 1200:1300 1000:1100 800:900 600:700>
      
       +.01 < . 500:600(100) ack 1 win 1024
         +0 > . 1:1(0) ack 700 <nop,nop, sack 1200:1300 1000:1100 800:900>
      
         +0 read(4, ..., 400) = 400
      
       +.01 < . 700:800(100) ack 1 win 1024
         +0 > . 1:1(0) ack 900 <nop,nop, sack 1200:1300 1000:1100>
      
       +.01 < . 900:1000(100) ack 1 win 1024
         +0 > . 1:1(0) ack 1100 <nop,nop, sack 1200:1300>
      
       +.01 < . 1100:1200(100) ack 1 win 1024
      // This checks that 1200:1300 has not been removed from ooo queue
         +0 > . 1:1(0) ack 1300
      Suggested-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NKuniyuki Iwashima <kuniyu@amazon.com>
      Link: https://lore.kernel.org/r/20221101035234.3910189-1-edumazet@google.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      b0e01253
  10. 25 10月, 2022 1 次提交
    • N
      tcp: fix indefinite deferral of RTO with SACK reneging · 3d2af9cc
      Neal Cardwell 提交于
      This commit fixes a bug that can cause a TCP data sender to repeatedly
      defer RTOs when encountering SACK reneging.
      
      The bug is that when we're in fast recovery in a scenario with SACK
      reneging, every time we get an ACK we call tcp_check_sack_reneging()
      and it can note the apparent SACK reneging and rearm the RTO timer for
      srtt/2 into the future. In some SACK reneging scenarios that can
      happen repeatedly until the receive window fills up, at which point
      the sender can't send any more, the ACKs stop arriving, and the RTO
      fires at srtt/2 after the last ACK. But that can take far too long
      (O(10 secs)), since the connection is stuck in fast recovery with a
      low cwnd that cannot grow beyond ssthresh, even if more bandwidth is
      available.
      
      This fix changes the logic in tcp_check_sack_reneging() to only rearm
      the RTO timer if data is cumulatively ACKed, indicating forward
      progress. This avoids this kind of nearly infinite loop of RTO timer
      re-arming. In addition, this meets the goals of
      tcp_check_sack_reneging() in handling Windows TCP behavior that looks
      temporarily like SACK reneging but is not really.
      
      Many thanks to Jakub Kicinski and Neil Spring, who reported this issue
      and provided critical packet traces that enabled root-causing this
      issue. Also, many thanks to Jakub Kicinski for testing this fix.
      
      Fixes: 5ae344c9 ("tcp: reduce spurious retransmits due to transient SACK reneging")
      Reported-by: NJakub Kicinski <kuba@kernel.org>
      Reported-by: NNeil Spring <ntspring@fb.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Tested-by: NJakub Kicinski <kuba@kernel.org>
      Link: https://lore.kernel.org/r/20221021170821.1093930-1-ncardwell.kernel@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      3d2af9cc
  11. 06 9月, 2022 1 次提交
    • N
      tcp: fix early ETIMEDOUT after spurious non-SACK RTO · 686dc2db
      Neal Cardwell 提交于
      Fix a bug reported and analyzed by Nagaraj Arankal, where the handling
      of a spurious non-SACK RTO could cause a connection to fail to clear
      retrans_stamp, causing a later RTO to very prematurely time out the
      connection with ETIMEDOUT.
      
      Here is the buggy scenario, expanding upon Nagaraj Arankal's excellent
      report:
      
      (*1) Send one data packet on a non-SACK connection
      
      (*2) Because no ACK packet is received, the packet is retransmitted
           and we enter CA_Loss; but this retransmission is spurious.
      
      (*3) The ACK for the original data is received. The transmitted packet
           is acknowledged.  The TCP timestamp is before the retrans_stamp,
           so tcp_may_undo() returns true, and tcp_try_undo_loss() returns
           true without changing state to Open (because tcp_is_sack() is
           false), and tcp_process_loss() returns without calling
           tcp_try_undo_recovery().  Normally after undoing a CA_Loss
           episode, tcp_fastretrans_alert() would see that the connection
           has returned to CA_Open and fall through and call
           tcp_try_to_open(), which would set retrans_stamp to 0.  However,
           for non-SACK connections we hold the connection in CA_Loss, so do
           not fall through to call tcp_try_to_open() and do not set
           retrans_stamp to 0. So retrans_stamp is (erroneously) still
           non-zero.
      
           At this point the first "retransmission event" has passed and
           been recovered from. Any future retransmission is a completely
           new "event". However, retrans_stamp is erroneously still
           set. (And we are still in CA_Loss, which is correct.)
      
      (*4) After 16 minutes (to correspond with tcp_retries2=15), a new data
           packet is sent. Note: No data is transmitted between (*3) and
           (*4) and we disabled keep alives.
      
           The socket's timeout SHOULD be calculated from this point in
           time, but instead it's calculated from the prior "event" 16
           minutes ago (step (*2)).
      
      (*5) Because no ACK packet is received, the packet is retransmitted.
      
      (*6) At the time of the 2nd retransmission, the socket returns
           ETIMEDOUT, prematurely, because retrans_stamp is (erroneously)
           too far in the past (set at the time of (*2)).
      
      This commit fixes this bug by ensuring that we reuse in
      tcp_try_undo_loss() the same careful logic for non-SACK connections
      that we have in tcp_try_undo_recovery(). To avoid duplicating logic,
      we factor out that logic into a new
      tcp_is_non_sack_preventing_reopen() helper and call that helper from
      both undo functions.
      
      Fixes: da34ac76 ("tcp: only undo on partial ACKs in CA_Loss")
      Reported-by: NNagaraj Arankal <nagaraj.p.arankal@hpe.com>
      Link: https://lore.kernel.org/all/SJ0PR84MB1847BE6C24D274C46A1B9B0EB27A9@SJ0PR84MB1847.NAMPRD84.PROD.OUTLOOK.COM/Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20220903121023.866900-1-ncardwell.kernel@gmail.comSigned-off-by: NPaolo Abeni <pabeni@redhat.com>
      686dc2db
  12. 01 9月, 2022 2 次提交
  13. 25 7月, 2022 5 次提交
  14. 22 7月, 2022 7 次提交
  15. 20 7月, 2022 4 次提交
  16. 18 7月, 2022 3 次提交
  17. 13 7月, 2022 1 次提交
  18. 17 6月, 2022 1 次提交
  19. 11 6月, 2022 2 次提交
  20. 28 5月, 2022 1 次提交
    • E
      tcp: fix tcp_mtup_probe_success vs wrong snd_cwnd · 11825765
      Eric Dumazet 提交于
      syzbot got a new report [1] finally pointing to a very old bug,
      added in initial support for MTU probing.
      
      tcp_mtu_probe() has checks about starting an MTU probe if
      tcp_snd_cwnd(tp) >= 11.
      
      But nothing prevents tcp_snd_cwnd(tp) to be reduced later
      and before the MTU probe succeeds.
      
      This bug would lead to potential zero-divides.
      
      Debugging added in commit 40570375 ("tcp: add accessors
      to read/set tp->snd_cwnd") has paid off :)
      
      While we are at it, address potential overflows in this code.
      
      [1]
      WARNING: CPU: 1 PID: 14132 at include/net/tcp.h:1219 tcp_mtup_probe_success+0x366/0x570 net/ipv4/tcp_input.c:2712
      Modules linked in:
      CPU: 1 PID: 14132 Comm: syz-executor.2 Not tainted 5.18.0-syzkaller-07857-gbabf0bb9 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      RIP: 0010:tcp_snd_cwnd_set include/net/tcp.h:1219 [inline]
      RIP: 0010:tcp_mtup_probe_success+0x366/0x570 net/ipv4/tcp_input.c:2712
      Code: 74 08 48 89 ef e8 da 80 17 f9 48 8b 45 00 65 48 ff 80 80 03 00 00 48 83 c4 30 5b 41 5c 41 5d 41 5e 41 5f 5d c3 e8 aa b0 c5 f8 <0f> 0b e9 16 fe ff ff 48 8b 4c 24 08 80 e1 07 38 c1 0f 8c c7 fc ff
      RSP: 0018:ffffc900079e70f8 EFLAGS: 00010287
      RAX: ffffffff88c0f7f6 RBX: ffff8880756e7a80 RCX: 0000000000040000
      RDX: ffffc9000c6c4000 RSI: 0000000000031f9e RDI: 0000000000031f9f
      RBP: 0000000000000000 R08: ffffffff88c0f606 R09: ffffc900079e7520
      R10: ffffed101011226d R11: 1ffff1101011226c R12: 1ffff1100eadcf50
      R13: ffff8880756e72c0 R14: 1ffff1100eadcf89 R15: dffffc0000000000
      FS:  00007f643236e700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f1ab3f1e2a0 CR3: 0000000064fe7000 CR4: 00000000003506e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <TASK>
       tcp_clean_rtx_queue+0x223a/0x2da0 net/ipv4/tcp_input.c:3356
       tcp_ack+0x1962/0x3c90 net/ipv4/tcp_input.c:3861
       tcp_rcv_established+0x7c8/0x1ac0 net/ipv4/tcp_input.c:5973
       tcp_v6_do_rcv+0x57b/0x1210 net/ipv6/tcp_ipv6.c:1476
       sk_backlog_rcv include/net/sock.h:1061 [inline]
       __release_sock+0x1d8/0x4c0 net/core/sock.c:2849
       release_sock+0x5d/0x1c0 net/core/sock.c:3404
       sk_stream_wait_memory+0x700/0xdc0 net/core/stream.c:145
       tcp_sendmsg_locked+0x111d/0x3fc0 net/ipv4/tcp.c:1410
       tcp_sendmsg+0x2c/0x40 net/ipv4/tcp.c:1448
       sock_sendmsg_nosec net/socket.c:714 [inline]
       sock_sendmsg net/socket.c:734 [inline]
       __sys_sendto+0x439/0x5c0 net/socket.c:2119
       __do_sys_sendto net/socket.c:2131 [inline]
       __se_sys_sendto net/socket.c:2127 [inline]
       __x64_sys_sendto+0xda/0xf0 net/socket.c:2127
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x46/0xb0
      RIP: 0033:0x7f6431289109
      Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007f643236e168 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
      RAX: ffffffffffffffda RBX: 00007f643139c100 RCX: 00007f6431289109
      RDX: 00000000d0d0c2ac RSI: 0000000020000080 RDI: 000000000000000a
      RBP: 00007f64312e308d R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000
      R13: 00007fff372533af R14: 00007f643236e300 R15: 0000000000022000
      
      Fixes: 5d424d5a ("[TCP]: MTU probing")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      11825765
  21. 20 5月, 2022 1 次提交