1. 11 10月, 2022 1 次提交
  2. 07 9月, 2022 12 次提交
  3. 08 7月, 2022 1 次提交
    • E
      tcp: fix tcp_mtup_probe_success vs wrong snd_cwnd · 9e2de13c
      Eric Dumazet 提交于
      stable inclusion
      from stable-4.19.247
      commit f2845e1504a3bc4f3381394f057e8b63cb5f3f7a
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I5FNPY
      CVE: NA
      
      --------------------------------
      
      commit 11825765 upstream.
      
      syzbot got a new report [1] finally pointing to a very old bug,
      added in initial support for MTU probing.
      
      tcp_mtu_probe() has checks about starting an MTU probe if
      tcp_snd_cwnd(tp) >= 11.
      
      But nothing prevents tcp_snd_cwnd(tp) to be reduced later
      and before the MTU probe succeeds.
      
      This bug would lead to potential zero-divides.
      
      Debugging added in commit 40570375 ("tcp: add accessors
      to read/set tp->snd_cwnd") has paid off :)
      
      While we are at it, address potential overflows in this code.
      
      [1]
      WARNING: CPU: 1 PID: 14132 at include/net/tcp.h:1219 tcp_mtup_probe_success+0x366/0x570 net/ipv4/tcp_input.c:2712
      Modules linked in:
      CPU: 1 PID: 14132 Comm: syz-executor.2 Not tainted 5.18.0-syzkaller-07857-gbabf0bb9 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      RIP: 0010:tcp_snd_cwnd_set include/net/tcp.h:1219 [inline]
      RIP: 0010:tcp_mtup_probe_success+0x366/0x570 net/ipv4/tcp_input.c:2712
      Code: 74 08 48 89 ef e8 da 80 17 f9 48 8b 45 00 65 48 ff 80 80 03 00 00 48 83 c4 30 5b 41 5c 41 5d 41 5e 41 5f 5d c3 e8 aa b0 c5 f8 <0f> 0b e9 16 fe ff ff 48 8b 4c 24 08 80 e1 07 38 c1 0f 8c c7 fc ff
      RSP: 0018:ffffc900079e70f8 EFLAGS: 00010287
      RAX: ffffffff88c0f7f6 RBX: ffff8880756e7a80 RCX: 0000000000040000
      RDX: ffffc9000c6c4000 RSI: 0000000000031f9e RDI: 0000000000031f9f
      RBP: 0000000000000000 R08: ffffffff88c0f606 R09: ffffc900079e7520
      R10: ffffed101011226d R11: 1ffff1101011226c R12: 1ffff1100eadcf50
      R13: ffff8880756e72c0 R14: 1ffff1100eadcf89 R15: dffffc0000000000
      FS:  00007f643236e700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f1ab3f1e2a0 CR3: 0000000064fe7000 CR4: 00000000003506e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <TASK>
       tcp_clean_rtx_queue+0x223a/0x2da0 net/ipv4/tcp_input.c:3356
       tcp_ack+0x1962/0x3c90 net/ipv4/tcp_input.c:3861
       tcp_rcv_established+0x7c8/0x1ac0 net/ipv4/tcp_input.c:5973
       tcp_v6_do_rcv+0x57b/0x1210 net/ipv6/tcp_ipv6.c:1476
       sk_backlog_rcv include/net/sock.h:1061 [inline]
       __release_sock+0x1d8/0x4c0 net/core/sock.c:2849
       release_sock+0x5d/0x1c0 net/core/sock.c:3404
       sk_stream_wait_memory+0x700/0xdc0 net/core/stream.c:145
       tcp_sendmsg_locked+0x111d/0x3fc0 net/ipv4/tcp.c:1410
       tcp_sendmsg+0x2c/0x40 net/ipv4/tcp.c:1448
       sock_sendmsg_nosec net/socket.c:714 [inline]
       sock_sendmsg net/socket.c:734 [inline]
       __sys_sendto+0x439/0x5c0 net/socket.c:2119
       __do_sys_sendto net/socket.c:2131 [inline]
       __se_sys_sendto net/socket.c:2127 [inline]
       __x64_sys_sendto+0xda/0xf0 net/socket.c:2127
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x46/0xb0
      RIP: 0033:0x7f6431289109
      Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007f643236e168 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
      RAX: ffffffffffffffda RBX: 00007f643139c100 RCX: 00007f6431289109
      RDX: 00000000d0d0c2ac RSI: 0000000020000080 RDI: 000000000000000a
      RBP: 00007f64312e308d R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000
      R13: 00007fff372533af R14: 00007f643236e300 R15: 0000000000022000
      
      Fixes: 5d424d5a ("[TCP]: MTU probing")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
      Signed-off-by: NLaibin Qiu <qiulaibin@huawei.com>
      9e2de13c
  4. 02 6月, 2022 1 次提交
    • E
      tcp: fix potential xmit stalls caused by TCP_NOTSENT_LOWAT · 6cba1671
      Eric Dumazet 提交于
      stable inclusion
      from stable-4.19.242
      commit cc639aa3c2f5ec7189a2917af49559006f678c62
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I5A6BA
      CVE: NA
      
      --------------------------------
      
      [ Upstream commit 4bfe744f ]
      
      I had this bug sitting for too long in my pile, it is time to fix it.
      
      Thanks to Doug Porter for reminding me of it!
      
      We had various attempts in the past, including commit
      0cbe6a8f ("tcp: remove SOCK_QUEUE_SHRUNK"),
      but the issue is that TCP stack currently only generates
      EPOLLOUT from input path, when tp->snd_una has advanced
      and skb(s) cleaned from rtx queue.
      
      If a flow has a big RTT, and/or receives SACKs, it is possible
      that the notsent part (tp->write_seq - tp->snd_nxt) reaches 0
      and no more data can be sent until tp->snd_una finally advances.
      
      What is needed is to also check if POLLOUT needs to be generated
      whenever tp->snd_nxt is advanced, from output path.
      
      This bug triggers more often after an idle period, as
      we do not receive ACK for at least one RTT. tcp_notsent_lowat
      could be a fraction of what CWND and pacing rate would allow to
      send during this RTT.
      
      In a followup patch, I will remove the bogus call
      to tcp_chrono_stop(sk, TCP_CHRONO_SNDBUF_LIMITED)
      from tcp_check_space(). Fact that we have decided to generate
      an EPOLLOUT does not mean the application has immediately
      refilled the transmit queue. This optimistic call
      might have been the reason the bug seemed not too serious.
      
      Tested:
      
      200 ms rtt, 1% packet loss, 32 MB tcp_rmem[2] and tcp_wmem[2]
      
      $ echo 500000 >/proc/sys/net/ipv4/tcp_notsent_lowat
      $ cat bench_rr.sh
      SUM=0
      for i in {1..10}
      do
       V=`netperf -H remote_host -l30 -t TCP_RR -- -r 10000000,10000 -o LOCAL_BYTES_SENT | egrep -v "MIGRATED|Bytes"`
       echo $V
       SUM=$(($SUM + $V))
      done
      echo SUM=$SUM
      
      Before patch:
      $ bench_rr.sh
      130000000
      80000000
      140000000
      140000000
      140000000
      140000000
      130000000
      40000000
      90000000
      110000000
      SUM=1140000000
      
      After patch:
      $ bench_rr.sh
      430000000
      590000000
      530000000
      450000000
      450000000
      350000000
      450000000
      490000000
      480000000
      460000000
      SUM=4680000000  # This is 410 % of the value before patch.
      
      Fixes: c9bee3b7 ("tcp: TCP_NOTSENT_LOWAT socket option")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NDoug Porter <dsp@fb.com>
      Cc: Soheil Hassas Yeganeh <soheil@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
      6cba1671
  5. 08 12月, 2021 1 次提交
  6. 12 11月, 2021 1 次提交
  7. 02 8月, 2021 1 次提交
  8. 14 4月, 2021 1 次提交
  9. 11 3月, 2021 1 次提交
  10. 22 2月, 2021 2 次提交
  11. 04 11月, 2020 1 次提交
  12. 22 9月, 2020 7 次提交
    • Y
      tcp: allow at most one TLP probe per flight · e9dfbe8f
      Yuchung Cheng 提交于
      stable inclusion
      from linux-4.19.136
      commit 55c73db2995826dcd63b08605c53e0e37c80b306
      
      --------------------------------
      
      [ Upstream commit 76be93fc ]
      
      Previously TLP may send multiple probes of new data in one
      flight. This happens when the sender is cwnd limited. After the
      initial TLP containing new data is sent, the sender receives another
      ACK that acks partial inflight.  It may re-arm another TLP timer
      to send more, if no further ACK returns before the next TLP timeout
      (PTO) expires. The sender may send in theory a large amount of TLP
      until send queue is depleted. This only happens if the sender sees
      such irregular uncommon ACK pattern. But it is generally undesirable
      behavior during congestion especially.
      
      The original TLP design restrict only one TLP probe per inflight as
      published in "Reducing Web Latency: the Virtue of Gentle Aggression",
      SIGCOMM 2013. This patch changes TLP to send at most one probe
      per inflight.
      
      Note that if the sender is app-limited, TLP retransmits old data
      and did not have this issue.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NLi Aichun <liaichun@huawei.com>
      Reviewed-by: Nguodeqing <geffrey.guo@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      e9dfbe8f
    • E
      tcp: fix SO_RCVLOWAT possible hangs under high mem pressure · 10f344e7
      Eric Dumazet 提交于
      stable inclusion
      from linux-4.19.134
      commit 721e5f54af0dbe1cdee6ce3d5fc8b6abe36f3e43
      
      --------------------------------
      
      [ Upstream commit ba3bb0e7 ]
      
      Whenever tcp_try_rmem_schedule() returns an error, we are under
      trouble and should make sure to wakeup readers so that they
      can drain socket queues and eventually make room.
      
      Fixes: 03f45c88 ("tcp: avoid extra wakeups for SO_RCVLOWAT users")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NLi Aichun <liaichun@huawei.com>
      Reviewed-by: Nguodeqing <geffrey.guo@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      10f344e7
    • E
      tcp: grow window for OOO packets only for SACK flows · 3cfedd54
      Eric Dumazet 提交于
      stable inclusion
      from linux-4.19.131
      commit bcf95ac62cb51d5e24be2ccaeab5fbf64da7f582
      
      --------------------------------
      
      [ Upstream commit 66205121 ]
      
      Back in 2013, we made a change that broke fast retransmit
      for non SACK flows.
      
      Indeed, for these flows, a sender needs to receive three duplicate
      ACK before starting fast retransmit. Sending ACK with different
      receive window do not count.
      
      Even if enabling SACK is strongly recommended these days,
      there still are some cases where it has to be disabled.
      
      Not increasing the window seems better than having to
      rely on RTO.
      
      After the fix, following packetdrill test gives :
      
      // Initialize connection
          0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
         +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
         +0 bind(3, ..., ...) = 0
         +0 listen(3, 1) = 0
      
         +0 < S 0:0(0) win 32792 <mss 1000,nop,wscale 7>
         +0 > S. 0:0(0) ack 1 <mss 1460,nop,wscale 8>
         +0 < . 1:1(0) ack 1 win 514
      
         +0 accept(3, ..., ...) = 4
      
         +0 < . 1:1001(1000) ack 1 win 514
      // Quick ack
         +0 > . 1:1(0) ack 1001 win 264
      
         +0 < . 2001:3001(1000) ack 1 win 514
      // DUPACK : Normally we should not change the window
         +0 > . 1:1(0) ack 1001 win 264
      
         +0 < . 3001:4001(1000) ack 1 win 514
      // DUPACK : Normally we should not change the window
         +0 > . 1:1(0) ack 1001 win 264
      
         +0 < . 4001:5001(1000) ack 1 win 514
      // DUPACK : Normally we should not change the window
          +0 > . 1:1(0) ack 1001 win 264
      
         +0 < . 1001:2001(1000) ack 1 win 514
      // Hole is repaired.
         +0 > . 1:1(0) ack 5001 win 272
      
      Fixes: 4e4f1fc2 ("tcp: properly increase rcv_ssthresh for ofo packets")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NVenkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NLi Aichun <liaichun@huawei.com>
      Reviewed-by: Nguodeqing <geffrey.guo@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      3cfedd54
    • D
      tcp: don't ignore ECN CWR on pure ACK · 5bb960d3
      Denis Kirjanov 提交于
      stable inclusion
      from linux-4.19.131
      commit a4f8b2682d5aee896b17ca617df8fbbdab9af38c
      
      --------------------------------
      
      [ Upstream commit 25702840 ]
      
      there is a problem with the CWR flag set in an incoming ACK segment
      and it leads to the situation when the ECE flag is latched forever
      
      the following packetdrill script shows what happens:
      
      // Stack receives incoming segments with CE set
      +0.1 <[ect0]  . 11001:12001(1000) ack 1001 win 65535
      +0.0 <[ce]    . 12001:13001(1000) ack 1001 win 65535
      +0.0 <[ect0] P. 13001:14001(1000) ack 1001 win 65535
      
      // Stack repsonds with ECN ECHO
      +0.0 >[noecn]  . 1001:1001(0) ack 12001
      +0.0 >[noecn] E. 1001:1001(0) ack 13001
      +0.0 >[noecn] E. 1001:1001(0) ack 14001
      
      // Write a packet
      +0.1 write(3, ..., 1000) = 1000
      +0.0 >[ect0] PE. 1001:2001(1000) ack 14001
      
      // Pure ACK received
      +0.01 <[noecn] W. 14001:14001(0) ack 2001 win 65535
      
      // Since CWR was sent, this packet should NOT have ECE set
      
      +0.1 write(3, ..., 1000) = 1000
      +0.0 >[ect0]  P. 2001:3001(1000) ack 14001
      // but Linux will still keep ECE latched here, with packetdrill
      // flagging a missing ECE flag, expecting
      // >[ect0] PE. 2001:3001(1000) ack 14001
      // in the script
      
      In the situation above we will continue to send ECN ECHO packets
      and trigger the peer to reduce the congestion window. To avoid that
      we can check CWR on pure ACKs received.
      
      v3:
      - Add a sequence check to avoid sending an ACK to an ACK
      
      v2:
      - Adjusted the comment
      - move CWR check before checking for unacknowledged packets
      Signed-off-by: NDenis Kirjanov <denis.kirjanov@suse.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NLi Aichun <liaichun@huawei.com>
      Reviewed-by: Nguodeqing <geffrey.guo@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      5bb960d3
    • E
      tcp: fix SO_RCVLOWAT hangs with fat skbs · 62825541
      Eric Dumazet 提交于
      stable inclusion
      from linux-4.19.124
      commit 99779c24bf4c2b56129df5891902eb4e9a62f275
      
      --------------------------------
      
      [ Upstream commit 24adbc16 ]
      
      We autotune rcvbuf whenever SO_RCVLOWAT is set to account for 100%
      overhead in tcp_set_rcvlowat()
      
      This works well when skb->len/skb->truesize ratio is bigger than 0.5
      
      But if we receive packets with small MSS, we can end up in a situation
      where not enough bytes are available in the receive queue to satisfy
      RCVLOWAT setting.
      As our sk_rcvbuf limit is hit, we send zero windows in ACK packets,
      preventing remote peer from sending more data.
      
      Even autotuning does not help, because it only triggers at the time
      user process drains the queue. If no EPOLLIN is generated, this
      can not happen.
      
      Note poll() has a similar issue, after commit
      c7004482 ("tcp: Respect SO_RCVLOWAT in tcp_poll().")
      
      Fixes: 03f45c88 ("tcp: avoid extra wakeups for SO_RCVLOWAT users")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NLi Aichun <liaichun@huawei.com>
      Reviewed-by: Nguodeqing <geffrey.guo@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      62825541
    • E
      tcp: do not leave dangling pointers in tp->highest_sack · 72f11987
      Eric Dumazet 提交于
      stable inclusion
      from linux-4.19.100
      commit 9bbde0825846002c6931f41fbbd71eeb848ca0e1
      
      --------------------------------
      
      [ Upstream commit 2bec445f ]
      
      Latest commit 85369750 ("tcp: Fix highest_sack and highest_sack_seq")
      apparently allowed syzbot to trigger various crashes in TCP stack [1]
      
      I believe this commit only made things easier for syzbot to find
      its way into triggering use-after-frees. But really the bugs
      could lead to bad TCP behavior or even plain crashes even for
      non malicious peers.
      
      I have audited all calls to tcp_rtx_queue_unlink() and
      tcp_rtx_queue_unlink_and_free() and made sure tp->highest_sack would be updated
      if we are removing from rtx queue the skb that tp->highest_sack points to.
      
      These updates were missing in three locations :
      
      1) tcp_clean_rtx_queue() [This one seems quite serious,
                                I have no idea why this was not caught earlier]
      
      2) tcp_rtx_queue_purge() [Probably not a big deal for normal operations]
      
      3) tcp_send_synack()     [Probably not a big deal for normal operations]
      
      [1]
      BUG: KASAN: use-after-free in tcp_highest_sack_seq include/net/tcp.h:1864 [inline]
      BUG: KASAN: use-after-free in tcp_highest_sack_seq include/net/tcp.h:1856 [inline]
      BUG: KASAN: use-after-free in tcp_check_sack_reordering+0x33c/0x3a0 net/ipv4/tcp_input.c:891
      Read of size 4 at addr ffff8880a488d068 by task ksoftirqd/1/16
      
      CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.5.0-rc5-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x197/0x210 lib/dump_stack.c:118
       print_address_description.constprop.0.cold+0xd4/0x30b mm/kasan/report.c:374
       __kasan_report.cold+0x1b/0x41 mm/kasan/report.c:506
       kasan_report+0x12/0x20 mm/kasan/common.c:639
       __asan_report_load4_noabort+0x14/0x20 mm/kasan/generic_report.c:134
       tcp_highest_sack_seq include/net/tcp.h:1864 [inline]
       tcp_highest_sack_seq include/net/tcp.h:1856 [inline]
       tcp_check_sack_reordering+0x33c/0x3a0 net/ipv4/tcp_input.c:891
       tcp_try_undo_partial net/ipv4/tcp_input.c:2730 [inline]
       tcp_fastretrans_alert+0xf74/0x23f0 net/ipv4/tcp_input.c:2847
       tcp_ack+0x2577/0x5bf0 net/ipv4/tcp_input.c:3710
       tcp_rcv_established+0x6dd/0x1e90 net/ipv4/tcp_input.c:5706
       tcp_v4_do_rcv+0x619/0x8d0 net/ipv4/tcp_ipv4.c:1619
       tcp_v4_rcv+0x307f/0x3b40 net/ipv4/tcp_ipv4.c:2001
       ip_protocol_deliver_rcu+0x5a/0x880 net/ipv4/ip_input.c:204
       ip_local_deliver_finish+0x23b/0x380 net/ipv4/ip_input.c:231
       NF_HOOK include/linux/netfilter.h:307 [inline]
       NF_HOOK include/linux/netfilter.h:301 [inline]
       ip_local_deliver+0x1e9/0x520 net/ipv4/ip_input.c:252
       dst_input include/net/dst.h:442 [inline]
       ip_rcv_finish+0x1db/0x2f0 net/ipv4/ip_input.c:428
       NF_HOOK include/linux/netfilter.h:307 [inline]
       NF_HOOK include/linux/netfilter.h:301 [inline]
       ip_rcv+0xe8/0x3f0 net/ipv4/ip_input.c:538
       __netif_receive_skb_one_core+0x113/0x1a0 net/core/dev.c:5148
       __netif_receive_skb+0x2c/0x1d0 net/core/dev.c:5262
       process_backlog+0x206/0x750 net/core/dev.c:6093
       napi_poll net/core/dev.c:6530 [inline]
       net_rx_action+0x508/0x1120 net/core/dev.c:6598
       __do_softirq+0x262/0x98c kernel/softirq.c:292
       run_ksoftirqd kernel/softirq.c:603 [inline]
       run_ksoftirqd+0x8e/0x110 kernel/softirq.c:595
       smpboot_thread_fn+0x6a3/0xa40 kernel/smpboot.c:165
       kthread+0x361/0x430 kernel/kthread.c:255
       ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
      
      Allocated by task 10091:
       save_stack+0x23/0x90 mm/kasan/common.c:72
       set_track mm/kasan/common.c:80 [inline]
       __kasan_kmalloc mm/kasan/common.c:513 [inline]
       __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:486
       kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:521
       slab_post_alloc_hook mm/slab.h:584 [inline]
       slab_alloc_node mm/slab.c:3263 [inline]
       kmem_cache_alloc_node+0x138/0x740 mm/slab.c:3575
       __alloc_skb+0xd5/0x5e0 net/core/skbuff.c:198
       alloc_skb_fclone include/linux/skbuff.h:1099 [inline]
       sk_stream_alloc_skb net/ipv4/tcp.c:875 [inline]
       sk_stream_alloc_skb+0x113/0xc90 net/ipv4/tcp.c:852
       tcp_sendmsg_locked+0xcf9/0x3470 net/ipv4/tcp.c:1282
       tcp_sendmsg+0x30/0x50 net/ipv4/tcp.c:1432
       inet_sendmsg+0x9e/0xe0 net/ipv4/af_inet.c:807
       sock_sendmsg_nosec net/socket.c:652 [inline]
       sock_sendmsg+0xd7/0x130 net/socket.c:672
       __sys_sendto+0x262/0x380 net/socket.c:1998
       __do_sys_sendto net/socket.c:2010 [inline]
       __se_sys_sendto net/socket.c:2006 [inline]
       __x64_sys_sendto+0xe1/0x1a0 net/socket.c:2006
       do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Freed by task 10095:
       save_stack+0x23/0x90 mm/kasan/common.c:72
       set_track mm/kasan/common.c:80 [inline]
       kasan_set_free_info mm/kasan/common.c:335 [inline]
       __kasan_slab_free+0x102/0x150 mm/kasan/common.c:474
       kasan_slab_free+0xe/0x10 mm/kasan/common.c:483
       __cache_free mm/slab.c:3426 [inline]
       kmem_cache_free+0x86/0x320 mm/slab.c:3694
       kfree_skbmem+0x178/0x1c0 net/core/skbuff.c:645
       __kfree_skb+0x1e/0x30 net/core/skbuff.c:681
       sk_eat_skb include/net/sock.h:2453 [inline]
       tcp_recvmsg+0x1252/0x2930 net/ipv4/tcp.c:2166
       inet_recvmsg+0x136/0x610 net/ipv4/af_inet.c:838
       sock_recvmsg_nosec net/socket.c:886 [inline]
       sock_recvmsg net/socket.c:904 [inline]
       sock_recvmsg+0xce/0x110 net/socket.c:900
       __sys_recvfrom+0x1ff/0x350 net/socket.c:2055
       __do_sys_recvfrom net/socket.c:2073 [inline]
       __se_sys_recvfrom net/socket.c:2069 [inline]
       __x64_sys_recvfrom+0xe1/0x1a0 net/socket.c:2069
       do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      The buggy address belongs to the object at ffff8880a488d040
       which belongs to the cache skbuff_fclone_cache of size 456
      The buggy address is located 40 bytes inside of
       456-byte region [ffff8880a488d040, ffff8880a488d208)
      The buggy address belongs to the page:
      page:ffffea0002922340 refcount:1 mapcount:0 mapping:ffff88821b057000 index:0x0
      raw: 00fffe0000000200 ffffea00022a5788 ffffea0002624a48 ffff88821b057000
      raw: 0000000000000000 ffff8880a488d040 0000000100000006 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff8880a488cf00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
       ffff8880a488cf80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      >ffff8880a488d000: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
                                                                ^
       ffff8880a488d080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff8880a488d100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      
      Fixes: 85369750 ("tcp: Fix highest_sack and highest_sack_seq")
      Fixes: 50895b9d ("tcp: highest_sack fix")
      Fixes: 737ff314 ("tcp: use sequence distance to detect reordering")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Cambda Zhu <cambda@linux.alibaba.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NLi Aichun <liaichun@huawei.com>
      Reviewed-by: Nguodeqing <geffrey.guo@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      72f11987
    • P
      tcp: fix marked lost packets not being retransmitted · 2e139c9c
      Pengcheng Yang 提交于
      stable inclusion
      from linux-4.19.98
      commit 34e855f998f76169e685c7e3c790b0ee0eed2a75
      
      --------------------------------
      
      [ Upstream commit e176b1ba ]
      
      When the packet pointed to by retransmit_skb_hint is unlinked by ACK,
      retransmit_skb_hint will be set to NULL in tcp_clean_rtx_queue().
      If packet loss is detected at this time, retransmit_skb_hint will be set
      to point to the current packet loss in tcp_verify_retransmit_hint(),
      then the packets that were previously marked lost but not retransmitted
      due to the restriction of cwnd will be skipped and cannot be
      retransmitted.
      
      To fix this, when retransmit_skb_hint is NULL, retransmit_skb_hint can
      be reset only after all marked lost packets are retransmitted
      (retrans_out >= lost_out), otherwise we need to traverse from
      tcp_rtx_queue_head in tcp_xmit_retransmit_queue().
      
      Packetdrill to demonstrate:
      
      // Disable RACK and set max_reordering to keep things simple
          0 `sysctl -q net.ipv4.tcp_recovery=0`
         +0 `sysctl -q net.ipv4.tcp_max_reordering=3`
      
      // Establish a connection
         +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
         +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
         +0 bind(3, ..., ...) = 0
         +0 listen(3, 1) = 0
      
        +.1 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
         +0 > S. 0:0(0) ack 1 <...>
       +.01 < . 1:1(0) ack 1 win 257
         +0 accept(3, ..., ...) = 4
      
      // Send 8 data segments
         +0 write(4, ..., 8000) = 8000
         +0 > P. 1:8001(8000) ack 1
      
      // Enter recovery and 1:3001 is marked lost
       +.01 < . 1:1(0) ack 1 win 257 <sack 3001:4001,nop,nop>
         +0 < . 1:1(0) ack 1 win 257 <sack 5001:6001 3001:4001,nop,nop>
         +0 < . 1:1(0) ack 1 win 257 <sack 5001:7001 3001:4001,nop,nop>
      
      // Retransmit 1:1001, now retransmit_skb_hint points to 1001:2001
         +0 > . 1:1001(1000) ack 1
      
      // 1001:2001 was ACKed causing retransmit_skb_hint to be set to NULL
       +.01 < . 1:1(0) ack 2001 win 257 <sack 5001:8001 3001:4001,nop,nop>
      // Now retransmit_skb_hint points to 4001:5001 which is now marked lost
      
      // BUG: 2001:3001 was not retransmitted
         +0 > . 2001:3001(1000) ack 1
      Signed-off-by: NPengcheng Yang <yangpc@wangsu.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Tested-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NLi Aichun <liaichun@huawei.com>
      Reviewed-by: Nguodeqing <geffrey.guo@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      2e139c9c
  13. 05 3月, 2020 1 次提交
  14. 27 12月, 2019 8 次提交
    • E
      tcp: annotate tp->rcv_nxt lockless reads · 8aa429d6
      Eric Dumazet 提交于
      mainline inclusion
      from mainline-v5.4-rc4
      commit dba7d9b8
      category: bugfix
      bugzilla: 24157
      CVE: NA
      
      -------------------------------------
      
      There are few places where we fetch tp->rcv_nxt while
      this field can change from IRQ or other cpu.
      
      We need to add READ_ONCE() annotations, and also make
      sure write sides use corresponding WRITE_ONCE() to avoid
      store-tearing.
      
      Note that tcp_inq_hint() was already using READ_ONCE(tp->rcv_nxt)
      
      syzbot reported :
      
      BUG: KCSAN: data-race in tcp_poll / tcp_queue_rcv
      
      write to 0xffff888120425770 of 4 bytes by interrupt on cpu 0:
       tcp_rcv_nxt_update net/ipv4/tcp_input.c:3365 [inline]
       tcp_queue_rcv+0x180/0x380 net/ipv4/tcp_input.c:4638
       tcp_rcv_established+0xbf1/0xf50 net/ipv4/tcp_input.c:5616
       tcp_v4_do_rcv+0x381/0x4e0 net/ipv4/tcp_ipv4.c:1542
       tcp_v4_rcv+0x1a03/0x1bf0 net/ipv4/tcp_ipv4.c:1923
       ip_protocol_deliver_rcu+0x51/0x470 net/ipv4/ip_input.c:204
       ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
       dst_input include/net/dst.h:442 [inline]
       ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
       __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5004
       __netif_receive_skb+0x37/0xf0 net/core/dev.c:5118
       netif_receive_skb_internal+0x59/0x190 net/core/dev.c:5208
       napi_skb_finish net/core/dev.c:5671 [inline]
       napi_gro_receive+0x28f/0x330 net/core/dev.c:5704
       receive_buf+0x284/0x30b0 drivers/net/virtio_net.c:1061
      
      read to 0xffff888120425770 of 4 bytes by task 7254 on cpu 1:
       tcp_stream_is_readable net/ipv4/tcp.c:480 [inline]
       tcp_poll+0x204/0x6b0 net/ipv4/tcp.c:554
       sock_poll+0xed/0x250 net/socket.c:1256
       vfs_poll include/linux/poll.h:90 [inline]
       ep_item_poll.isra.0+0x90/0x190 fs/eventpoll.c:892
       ep_send_events_proc+0x113/0x5c0 fs/eventpoll.c:1749
       ep_scan_ready_list.constprop.0+0x189/0x500 fs/eventpoll.c:704
       ep_send_events fs/eventpoll.c:1793 [inline]
       ep_poll+0xe3/0x900 fs/eventpoll.c:1930
       do_epoll_wait+0x162/0x180 fs/eventpoll.c:2294
       __do_sys_epoll_pwait fs/eventpoll.c:2325 [inline]
       __se_sys_epoll_pwait fs/eventpoll.c:2311 [inline]
       __x64_sys_epoll_pwait+0xcd/0x170 fs/eventpoll.c:2311
       do_syscall_64+0xcf/0x2f0 arch/x86/entry/common.c:296
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 7254 Comm: syz-fuzzer Not tainted 5.3.0+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: Nguodeqing <geffrey.guo@huawei.com>
      Reviewed-by: NWenan Mao <maowenan@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      8aa429d6
    • Y
      tcp: start receiver buffer autotuning sooner · ec0bd6ee
      Yuchung Cheng 提交于
      [ Upstream commit 041a14d2671573611ffd6412bc16e2f64469f7fb ]
      
      Previously receiver buffer auto-tuning starts after receiving
      one advertised window amount of data. After the initial receiver
      buffer was raised by patch a337531b942b ("tcp: up initial rmem to
      128KB and SYN rwin to around 64KB"), the reciver buffer may take
      too long to start raising. To address this issue, this patch lowers
      the initial bytes expected to receive roughly the expected sender's
      initial window.
      
      Fixes: a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB")
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      ec0bd6ee
    • Y
      tcp: up initial rmem to 128KB and SYN rwin to around 64KB · 3c975d25
      Yuchung Cheng 提交于
      [ Upstream commit a337531b ]
      
      Previously TCP initial receive buffer is ~87KB by default and
      the initial receive window is ~29KB (20 MSS). This patch changes
      the two numbers to 128KB and ~64KB (rounding down to the multiples
      of MSS) respectively. The patch also simplifies the calculations s.t.
      the two numbers are directly controlled by sysctl tcp_rmem[1]:
      
        1) Initial receiver buffer budget (sk_rcvbuf): while this should
           be configured via sysctl tcp_rmem[1], previously tcp_fixup_rcvbuf()
           always override and set a larger size when a new connection
           establishes.
      
        2) Initial receive window in SYN: previously it is set to 20
           packets if MSS <= 1460. The number 20 was based on the initial
           congestion window of 10: the receiver needs twice amount to
           avoid being limited by the receive window upon out-of-order
           delivery in the first window burst. But since this only
           applies if the receiving MSS <= 1460, connection using large MTU
           (e.g. to utilize receiver zero-copy) may be limited by the
           receive window.
      
      With this patch TCP memory configuration is more straight-forward and
      more properly sized to modern high-speed networks by default. Several
      popular stacks have been announcing 64KB rwin in SYNs as well.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      3c975d25
    • N
      tcp: fix tcp_ecn_withdraw_cwr() to clear TCP_ECN_QUEUE_CWR · f4c84360
      Neal Cardwell 提交于
      [ Upstream commit af38d07ed391b21f7405fa1f936ca9686787d6d2 ]
      
      Fix tcp_ecn_withdraw_cwr() to clear the correct bit:
      TCP_ECN_QUEUE_CWR.
      
      Rationale: basically, TCP_ECN_DEMAND_CWR is a bit that is purely about
      the behavior of data receivers, and deciding whether to reflect
      incoming IP ECN CE marks as outgoing TCP th->ece marks. The
      TCP_ECN_QUEUE_CWR bit is purely about the behavior of data senders,
      and deciding whether to send CWR. The tcp_ecn_withdraw_cwr() function
      is only called from tcp_undo_cwnd_reduction() by data senders during
      an undo, so it should zero the sender-side state,
      TCP_ECN_QUEUE_CWR. It does not make sense to stop the reflection of
      incoming CE bits on incoming data packets just because outgoing
      packets were spuriously retransmitted.
      
      The bug has been reproduced with packetdrill to manifest in a scenario
      with RFC3168 ECN, with an incoming data packet with CE bit set and
      carrying a TCP timestamp value that causes cwnd undo. Before this fix,
      the IP CE bit was ignored and not reflected in the TCP ECE header bit,
      and sender sent a TCP CWR ('W') bit on the next outgoing data packet,
      even though the cwnd reduction had been undone.  After this fix, the
      sender properly reflects the CE bit and does not set the W bit.
      
      Note: the bug actually predates 2005 git history; this Fixes footer is
      chosen to be the oldest SHA1 I have tested (from Sep 2007) for which
      the patch applies cleanly (since before this commit the code was in a
      .h file).
      
      Fixes: bdf1ee5d ("[TCP]: Move code from tcp_ecn.h to tcp*.c and tcp.h & remove it")
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      f4c84360
    • E
      tcp: take care of compressed acks in tcp_add_reno_sack() · 8fcb3c3c
      Eric Dumazet 提交于
      mainline inclusion
      from mainline-v5.0-rc1
      commit 19119f298bb1f2af3bb1093f5f2a1fed8da94e37
      category: perf
      bugzilla: NA
      CVE: NA
      
      -------------------------------------------------
      Neal pointed out that non sack flows might suffer from ACK compression
      added in the following patch ("tcp: implement coalescing on backlog queue")
      
      Instead of tweaking tcp_add_backlog() we can take into
      account how many ACK were coalesced, this information
      will be available in skb_shinfo(skb)->gso_segs
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NBiaoxiang <yebiaoxiang@huawei.com>
      Reviewed-by: NMao Wenan <maowenan@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      8fcb3c3c
    • E
      tcp: limit payload size of sacked skbs · 249b1d27
      Eric Dumazet 提交于
      commit 3b4929f6 upstream.
      
      Jonathan Looney reported that TCP can trigger the following crash
      in tcp_shifted_skb() :
      
      	BUG_ON(tcp_skb_pcount(skb) < pcount);
      
      This can happen if the remote peer has advertized the smallest
      MSS that linux TCP accepts : 48
      
      An skb can hold 17 fragments, and each fragment can hold 32KB
      on x86, or 64KB on PowerPC.
      
      This means that the 16bit witdh of TCP_SKB_CB(skb)->tcp_gso_segs
      can overflow.
      
      Note that tcp_sendmsg() builds skbs with less than 64KB
      of payload, so this problem needs SACK to be enabled.
      SACK blocks allow TCP to coalesce multiple skbs in the retransmit
      queue, thus filling the 17 fragments to maximal capacity.
      
      CVE-2019-11477 -- u16 overflow of TCP_SKB_CB(skb)->tcp_gso_segs
      
      Fixes: 832d11c5 ("tcp: Try to restore large SKBs while SACK processing")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NJonathan Looney <jtl@netflix.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Reviewed-by: NTyler Hicks <tyhicks@canonical.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Bruce Curtis <brucec@netflix.com>
      Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      249b1d27
    • E
      tcp: tcp_grow_window() needs to respect tcp_space() · cb395e3e
      Eric Dumazet 提交于
      [ Upstream commit 50ce163a72d817a99e8974222dcf2886d5deb1ae ]
      
      For some reason, tcp_grow_window() correctly tests if enough room
      is present before attempting to increase tp->rcv_ssthresh,
      but does not prevent it to grow past tcp_space()
      
      This is causing hard to debug issues, like failing
      the (__tcp_select_window(sk) >= tp->rcv_wnd) test
      in __tcp_ack_snd_check(), causing ACK delays and possibly
      slow flows.
      
      Depending on tcp_rmem[2], MTU, skb->len/skb->truesize ratio,
      we can see the problem happening on "netperf -t TCP_RR -- -r 2000,2000"
      after about 60 round trips, when the active side no longer sends
      immediate acks.
      
      This bug predates git history.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      cb395e3e
    • G
      tcp: handle inet_csk_reqsk_queue_add() failures · 4788d11e
      Guillaume Nault 提交于
      [  Upstream commit 9d3e1368bb45893a75a5dfb7cd21fdebfa6b47af ]
      
      Commit 7716682c ("tcp/dccp: fix another race at listener
      dismantle") let inet_csk_reqsk_queue_add() fail, and adjusted
      {tcp,dccp}_check_req() accordingly. However, TFO and syncookies
      weren't modified, thus leaking allocated resources on error.
      
      Contrary to tcp_check_req(), in both syncookies and TFO cases,
      we need to drop the request socket. Also, since the child socket is
      created with inet_csk_clone_lock(), we have to unlock it and drop an
      extra reference (->sk_refcount is initially set to 2 and
      inet_csk_reqsk_queue_add() drops only one ref).
      
      For TFO, we also need to revert the work done by tcp_try_fastopen()
      (with reqsk_fastopen_remove()).
      
      Fixes: 7716682c ("tcp/dccp: fix another race at listener dismantle")
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      4788d11e
  15. 06 12月, 2018 1 次提交
    • E
      tcp: defer SACK compression after DupThresh · aaa7e45c
      Eric Dumazet 提交于
      [ Upstream commit 86de5921a3d5dd246df661e09bdd0a6131b39ae3 ]
      
      Jean-Louis reported a TCP regression and bisected to recent SACK
      compression.
      
      After a loss episode (receiver not able to keep up and dropping
      packets because its backlog is full), linux TCP stack is sending
      a single SACK (DUPACK).
      
      Sender waits a full RTO timer before recovering losses.
      
      While RFC 6675 says in section 5, "Algorithm Details",
      
         (2) If DupAcks < DupThresh but IsLost (HighACK + 1) returns true --
             indicating at least three segments have arrived above the current
             cumulative acknowledgment point, which is taken to indicate loss
             -- go to step (4).
      ...
         (4) Invoke fast retransmit and enter loss recovery as follows:
      
      there are old TCP stacks not implementing this strategy, and
      still counting the dupacks before starting fast retransmit.
      
      While these stacks probably perform poorly when receivers implement
      LRO/GRO, we should be a little more gentle to them.
      
      This patch makes sure we do not enable SACK compression unless
      3 dupacks have been sent since last rcv_nxt update.
      
      Ideally we should even rearm the timer to send one or two
      more DUPACK if no more packets are coming, but that will
      be work aiming for linux-4.21.
      
      Many thanks to Jean-Louis for bisecting the issue, providing
      packet captures and testing this patch.
      
      Fixes: 5d9f4262 ("tcp: add SACK compression")
      Reported-by: NJean-Louis Dupond <jean-louis@dupond.be>
      Tested-by: NJean-Louis Dupond <jean-louis@dupond.be>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      aaa7e45c