提交 · 59bf6c65a09fff74215517aecffbbdcd67df76e3 · openeuler / Kernel

25 7月, 2022 1 次提交

tcp: Fix data-races around sk_pacing_rate. · 59bf6c65

由 Kuniyuki Iwashima 提交于 7月 22, 2022

While reading sysctl_tcp_pacing_(ss|ca)_ratio, they can be changed
concurrently.  Thus, we need to add READ_ONCE() to their readers.

Fixes: 43e122b0 ("tcp: refine pacing rate determination")
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

59bf6c65

22 7月, 2022 7 次提交

tcp: Fix a data-race around sysctl_tcp_invalid_ratelimit. · 2afdbe7b

由 Kuniyuki Iwashima 提交于 7月 20, 2022

While reading sysctl_tcp_invalid_ratelimit, it can be changed
concurrently. Thus, we need to add READ_ONCE() to its reader.

Fixes: 032ee423 ("tcp: helpers to mitigate ACK loops by rate-limiting out-of-window dupacks")
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

2afdbe7b

tcp: Fix a data-race around sysctl_tcp_min_rtt_wlen. · 1330ffac

由 Kuniyuki Iwashima 提交于 7月 20, 2022

While reading sysctl_tcp_min_rtt_wlen, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.

Fixes: f6722583 ("tcp: track min RTT using windowed min-filter")
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1330ffac

tcp: Fix a data-race around sysctl_tcp_challenge_ack_limit. · db3815a2

由 Kuniyuki Iwashima 提交于 7月 20, 2022

While reading sysctl_tcp_challenge_ack_limit, it can be changed
concurrently.  Thus, we need to add READ_ONCE() to its reader.

Fixes: 282f23c6 ("tcp: implement RFC 5961 3.2")
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

db3815a2

tcp: Fix data-races around sysctl_tcp_moderate_rcvbuf. · 78047648

由 Kuniyuki Iwashima 提交于 7月 20, 2022

While reading sysctl_tcp_moderate_rcvbuf, it can be changed
concurrently.  Thus, we need to add READ_ONCE() to its readers.

Fixes: 1da177e4 ("Linux-2.6.12-rc2")
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

78047648

tcp: Fix a data-race around sysctl_tcp_frto. · 706c6202

由 Kuniyuki Iwashima 提交于 7月 20, 2022

While reading sysctl_tcp_frto, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.

Fixes: 1da177e4 ("Linux-2.6.12-rc2")
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

706c6202

tcp: Fix a data-race around sysctl_tcp_app_win. · 02ca527a

由 Kuniyuki Iwashima 提交于 7月 20, 2022

While reading sysctl_tcp_app_win, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.

Fixes: 1da177e4 ("Linux-2.6.12-rc2")
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

02ca527a

tcp: Fix data-races around sysctl_tcp_dsack. · 58ebb1c8

由 Kuniyuki Iwashima 提交于 7月 20, 2022

While reading sysctl_tcp_dsack, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its readers.

Fixes: 1da177e4 ("Linux-2.6.12-rc2")
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

58ebb1c8

20 7月, 2022 4 次提交

tcp: Fix data-races around sysctl_tcp_max_reordering. · a11e5b3e

由 Kuniyuki Iwashima 提交于 7月 18, 2022

While reading sysctl_tcp_max_reordering, it can be changed
concurrently.  Thus, we need to add READ_ONCE() to its readers.

Fixes: dca145ff ("tcp: allow for bigger reordering level")
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a11e5b3e

tcp: Fix a data-race around sysctl_tcp_stdurg. · 4e08ed41

由 Kuniyuki Iwashima 提交于 7月 18, 2022

While reading sysctl_tcp_stdurg, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.

Fixes: 1da177e4 ("Linux-2.6.12-rc2")
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4e08ed41

tcp: Fix data-races around sysctl_tcp_recovery. · e7d2ef83

由 Kuniyuki Iwashima 提交于 7月 18, 2022

While reading sysctl_tcp_recovery, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its readers.

Fixes: 4f41b1c5 ("tcp: use RACK to detect losses")
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

e7d2ef83

tcp: Fix data-races around sysctl knobs related to SYN option. · 3666f666

由 Kuniyuki Iwashima 提交于 7月 18, 2022

While reading these knobs, they can be changed concurrently.
Thus, we need to add READ_ONCE() to their readers.

  - tcp_sack
  - tcp_window_scaling
  - tcp_timestamps

Fixes: 1da177e4 ("Linux-2.6.12-rc2")
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3666f666

18 7月, 2022 3 次提交

tcp: Fix data-races around sysctl_max_syn_backlog. · 79539f34

由 Kuniyuki Iwashima 提交于 7月 15, 2022

While reading sysctl_max_syn_backlog, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its readers.

Fixes: 1da177e4 ("Linux-2.6.12-rc2")
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

79539f34

tcp: Fix data-races around sysctl_tcp_reordering. · 46778cd1

由 Kuniyuki Iwashima 提交于 7月 15, 2022

While reading sysctl_tcp_reordering, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its readers.

Fixes: 1da177e4 ("Linux-2.6.12-rc2")
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

46778cd1

tcp: Fix data-races around sysctl_tcp_syncookies. · f2e383b5

由 Kuniyuki Iwashima 提交于 7月 15, 2022

While reading sysctl_tcp_syncookies, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its readers.

Fixes: 1da177e4 ("Linux-2.6.12-rc2")
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f2e383b5

13 7月, 2022 1 次提交

tcp: Fix data-races around sysctl_tcp_ecn. · 4785a667

由 Kuniyuki Iwashima 提交于 7月 11, 2022

While reading sysctl_tcp_ecn, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its readers.

Fixes: 1da177e4 ("Linux-2.6.12-rc2")
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4785a667

28 5月, 2022 1 次提交

tcp: fix tcp_mtup_probe_success vs wrong snd_cwnd · 11825765

由 Eric Dumazet 提交于 5月 27, 2022

syzbot got a new report [1] finally pointing to a very old bug,
added in initial support for MTU probing.

tcp_mtu_probe() has checks about starting an MTU probe if
tcp_snd_cwnd(tp) >= 11.

But nothing prevents tcp_snd_cwnd(tp) to be reduced later
and before the MTU probe succeeds.

This bug would lead to potential zero-divides.

Debugging added in commit 40570375 ("tcp: add accessors
to read/set tp->snd_cwnd") has paid off :)

While we are at it, address potential overflows in this code.

[1]
WARNING: CPU: 1 PID: 14132 at include/net/tcp.h:1219 tcp_mtup_probe_success+0x366/0x570 net/ipv4/tcp_input.c:2712
Modules linked in:
CPU: 1 PID: 14132 Comm: syz-executor.2 Not tainted 5.18.0-syzkaller-07857-gbabf0bb9 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:tcp_snd_cwnd_set include/net/tcp.h:1219 [inline]
RIP: 0010:tcp_mtup_probe_success+0x366/0x570 net/ipv4/tcp_input.c:2712
Code: 74 08 48 89 ef e8 da 80 17 f9 48 8b 45 00 65 48 ff 80 80 03 00 00 48 83 c4 30 5b 41 5c 41 5d 41 5e 41 5f 5d c3 e8 aa b0 c5 f8 <0f> 0b e9 16 fe ff ff 48 8b 4c 24 08 80 e1 07 38 c1 0f 8c c7 fc ff
RSP: 0018:ffffc900079e70f8 EFLAGS: 00010287
RAX: ffffffff88c0f7f6 RBX: ffff8880756e7a80 RCX: 0000000000040000
RDX: ffffc9000c6c4000 RSI: 0000000000031f9e RDI: 0000000000031f9f
RBP: 0000000000000000 R08: ffffffff88c0f606 R09: ffffc900079e7520
R10: ffffed101011226d R11: 1ffff1101011226c R12: 1ffff1100eadcf50
R13: ffff8880756e72c0 R14: 1ffff1100eadcf89 R15: dffffc0000000000
FS:  00007f643236e700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1ab3f1e2a0 CR3: 0000000064fe7000 CR4: 00000000003506e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 tcp_clean_rtx_queue+0x223a/0x2da0 net/ipv4/tcp_input.c:3356
 tcp_ack+0x1962/0x3c90 net/ipv4/tcp_input.c:3861
 tcp_rcv_established+0x7c8/0x1ac0 net/ipv4/tcp_input.c:5973
 tcp_v6_do_rcv+0x57b/0x1210 net/ipv6/tcp_ipv6.c:1476
 sk_backlog_rcv include/net/sock.h:1061 [inline]
 __release_sock+0x1d8/0x4c0 net/core/sock.c:2849
 release_sock+0x5d/0x1c0 net/core/sock.c:3404
 sk_stream_wait_memory+0x700/0xdc0 net/core/stream.c:145
 tcp_sendmsg_locked+0x111d/0x3fc0 net/ipv4/tcp.c:1410
 tcp_sendmsg+0x2c/0x40 net/ipv4/tcp.c:1448
 sock_sendmsg_nosec net/socket.c:714 [inline]
 sock_sendmsg net/socket.c:734 [inline]
 __sys_sendto+0x439/0x5c0 net/socket.c:2119
 __do_sys_sendto net/socket.c:2131 [inline]
 __se_sys_sendto net/socket.c:2127 [inline]
 __x64_sys_sendto+0xda/0xf0 net/socket.c:2127
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x46/0xb0
RIP: 0033:0x7f6431289109
Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f643236e168 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 00007f643139c100 RCX: 00007f6431289109
RDX: 00000000d0d0c2ac RSI: 0000000020000080 RDI: 000000000000000a
RBP: 00007f64312e308d R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000
R13: 00007fff372533af R14: 00007f643236e300 R15: 0000000000022000

Fixes: 5d424d5a ("[TCP]: MTU probing")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: Nsyzbot <syzkaller@googlegroups.com>
Acked-by: NYuchung Cheng <ycheng@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

11825765

20 5月, 2022 1 次提交

tcp: improve PRR loss recovery · 9ad084d6

由 Yuchung Cheng 提交于 5月 18, 2022

This patch improves TCP PRR loss recovery behavior for a corner
case. Previously during PRR conservation-bound mode, it strictly
sends the amount equals to the amount newly acked or s/acked.

The patch changes s.t. PRR may send additional amount that was banked
previously (e.g. application-limited) in the conservation-bound
mode, similar to the slow-start mode. This unifies and simplifies the
algorithm further and may improve the recovery latency. This change
still follow the general packet conservation design principle and
always keep inflight/cwnd below the slow start threshold set
by the congestion control module.

PRR is described in RFC 6937. We'll include this change in the
latest revision rfc6937-bis as well.
Reported-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Signed-off-by: NNeal Cardwell <ncardwell@google.com>
Signed-off-by: NEric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20220519003410.2531936-1-ycheng@google.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

9ad084d6

30 4月, 2022 1 次提交

tcp: drop skb dst in tcp_rcv_established() · 783d108d

由 Eric Dumazet 提交于 4月 29, 2022

In commit f84af32c ("net: ip_queue_rcv_skb() helper")
I dropped the skb dst in tcp_data_queue().

This only dealt with so-called TCP input slow path.

When fast path is taken, tcp_rcv_established() calls
tcp_queue_rcv() while skb still has a dst.

This was mostly fine, because most dsts at this point
are not refcounted (thanks to early demux)

However, TCP packets sent over loopback have refcounted dst.

Then commit 68822bdf ("net: generalize skb freeing
deferral to per-cpu lists") came and had the effect
of delaying skb freeing for an arbitrary time.

If during this time the involved netns is dismantled, cleanup_net()
frees the struct net with embedded net->ipv6.ip6_dst_ops.

Then when eventually dst_destroy_rcu() is called,
if (dst->ops->destroy) ... triggers an use-after-free.

It is not clear if ip6_route_net_exit() lacks a rcu_barrier()
as syzbot reported similar issues before the blamed commit.

( https://groups.google.com/g/syzkaller-bugs/c/CofzW4eeA9A/m/009WjumTAAAJ )

Fixes: 68822bdf ("net: generalize skb freeing deferral to per-cpu lists")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

783d108d

29 4月, 2022 1 次提交

tcp: fix F-RTO may not work correctly when receiving DSACK · d9157f68

由 Pengcheng Yang 提交于 4月 26, 2022

Currently DSACK is regarded as a dupack, which may cause
F-RTO to incorrectly enter "loss was real" when receiving
DSACK.

Packetdrill to demonstrate:

// Enable F-RTO and TLP
    0 `sysctl -q net.ipv4.tcp_frto=2`
    0 `sysctl -q net.ipv4.tcp_early_retrans=3`
    0 `sysctl -q net.ipv4.tcp_congestion_control=cubic`

// Establish a connection
   +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
   +0 bind(3, ..., ...) = 0
   +0 listen(3, 1) = 0

// RTT 10ms, RTO 210ms
  +.1 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
   +0 > S. 0:0(0) ack 1 <...>
 +.01 < . 1:1(0) ack 1 win 257
   +0 accept(3, ..., ...) = 4

// Send 2 data segments
   +0 write(4, ..., 2000) = 2000
   +0 > P. 1:2001(2000) ack 1

// TLP
+.022 > P. 1001:2001(1000) ack 1

// Continue to send 8 data segments
   +0 write(4, ..., 10000) = 10000
   +0 > P. 2001:10001(8000) ack 1

// RTO
+.188 > . 1:1001(1000) ack 1

// The original data is acked and new data is sent(F-RTO step 2.b)
   +0 < . 1:1(0) ack 2001 win 257
   +0 > P. 10001:12001(2000) ack 1

// D-SACK caused by TLP is regarded as a dupack, this results in
// the incorrect judgment of "loss was real"(F-RTO step 3.a)
+.022 < . 1:1(0) ack 2001 win 257 <sack 1001:2001,nop,nop>

// Never-retransmitted data(3001:4001) are acked and
// expect to switch to open state(F-RTO step 3.b)
   +0 < . 1:1(0) ack 4001 win 257
+0 %{ assert tcpi_ca_state == 0, tcpi_ca_state }%

Fixes: e33099f9 ("tcp: implement RFC5682 F-RTO")
Signed-off-by: NPengcheng Yang <yangpc@wangsu.com>
Acked-by: NNeal Cardwell <ncardwell@google.com>
Tested-by: NNeal Cardwell <ncardwell@google.com>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/1650967419-2150-1-git-send-email-yangpc@wangsu.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>

d9157f68

25 4月, 2022 1 次提交

tcp: fix potential xmit stalls caused by TCP_NOTSENT_LOWAT · 4bfe744f

由 Eric Dumazet 提交于 4月 24, 2022

I had this bug sitting for too long in my pile, it is time to fix it.

Thanks to Doug Porter for reminding me of it!

We had various attempts in the past, including commit
0cbe6a8f ("tcp: remove SOCK_QUEUE_SHRUNK"),
but the issue is that TCP stack currently only generates
EPOLLOUT from input path, when tp->snd_una has advanced
and skb(s) cleaned from rtx queue.

If a flow has a big RTT, and/or receives SACKs, it is possible
that the notsent part (tp->write_seq - tp->snd_nxt) reaches 0
and no more data can be sent until tp->snd_una finally advances.

What is needed is to also check if POLLOUT needs to be generated
whenever tp->snd_nxt is advanced, from output path.

This bug triggers more often after an idle period, as
we do not receive ACK for at least one RTT. tcp_notsent_lowat
could be a fraction of what CWND and pacing rate would allow to
send during this RTT.

In a followup patch, I will remove the bogus call
to tcp_chrono_stop(sk, TCP_CHRONO_SNDBUF_LIMITED)
from tcp_check_space(). Fact that we have decided to generate
an EPOLLOUT does not mean the application has immediately
refilled the transmit queue. This optimistic call
might have been the reason the bug seemed not too serious.

Tested:

200 ms rtt, 1% packet loss, 32 MB tcp_rmem[2] and tcp_wmem[2]

$ echo 500000 >/proc/sys/net/ipv4/tcp_notsent_lowat
$ cat bench_rr.sh
SUM=0
for i in {1..10}
do
 V=`netperf -H remote_host -l30 -t TCP_RR -- -r 10000000,10000 -o LOCAL_BYTES_SENT | egrep -v "MIGRATED|Bytes"`
 echo $V
 SUM=$(($SUM + $V))
done
echo SUM=$SUM

Before patch:
$ bench_rr.sh
130000000
80000000
140000000
140000000
140000000
140000000
130000000
40000000
90000000
110000000
SUM=1140000000

After patch:
$ bench_rr.sh
430000000
590000000
530000000
450000000
450000000
350000000
450000000
490000000
480000000
460000000
SUM=4680000000  # This is 410 % of the value before patch.

Fixes: c9bee3b7 ("tcp: TCP_NOTSENT_LOWAT socket option")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: NDoug Porter <dsp@fb.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

4bfe744f

18 4月, 2022 1 次提交

tcp: fix signed/unsigned comparison · 843f7740

由 Eric Dumazet 提交于 4月 17, 2022

Kernel test robot reported:

smatch warnings:
net/ipv4/tcp_input.c:5966 tcp_rcv_established() warn: unsigned 'reason' is never less than zero.

I actually had one packetdrill failing because of this bug,
and was about to send the fix :)

v2: Andreas Schwab also pointed out that @reason needs to be negated
    before we reach tcp_drop_reason()

Fixes: 4b506af9 ("tcp: add two drop reasons for tcp_ack()")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: Nkernel test robot <lkp@intel.com>
Reported-by: NAndreas Schwab <schwab@linux-m68k.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

843f7740

17 4月, 2022 10 次提交

tcp: add drop reason support to tcp_ofo_queue() · 8fbf1957