1. 16 4月, 2020 1 次提交
  2. 14 4月, 2020 1 次提交
    • E
      tcp: annotate tp->rcv_nxt lockless reads · a0adc066
      Eric Dumazet 提交于
      mainline inclusion
      from mainline-v5.4-rc4
      commit dba7d9b8c739df27ff3a234c81d6c6b23e3986fa
      category: bugfix
      bugzilla: 24157
      CVE: NA
      
      -------------------------------------
      
      There are few places where we fetch tp->rcv_nxt while
      this field can change from IRQ or other cpu.
      
      We need to add READ_ONCE() annotations, and also make
      sure write sides use corresponding WRITE_ONCE() to avoid
      store-tearing.
      
      Note that tcp_inq_hint() was already using READ_ONCE(tp->rcv_nxt)
      
      syzbot reported :
      
      BUG: KCSAN: data-race in tcp_poll / tcp_queue_rcv
      
      write to 0xffff888120425770 of 4 bytes by interrupt on cpu 0:
       tcp_rcv_nxt_update net/ipv4/tcp_input.c:3365 [inline]
       tcp_queue_rcv+0x180/0x380 net/ipv4/tcp_input.c:4638
       tcp_rcv_established+0xbf1/0xf50 net/ipv4/tcp_input.c:5616
       tcp_v4_do_rcv+0x381/0x4e0 net/ipv4/tcp_ipv4.c:1542
       tcp_v4_rcv+0x1a03/0x1bf0 net/ipv4/tcp_ipv4.c:1923
       ip_protocol_deliver_rcu+0x51/0x470 net/ipv4/ip_input.c:204
       ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
       dst_input include/net/dst.h:442 [inline]
       ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
       __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5004
       __netif_receive_skb+0x37/0xf0 net/core/dev.c:5118
       netif_receive_skb_internal+0x59/0x190 net/core/dev.c:5208
       napi_skb_finish net/core/dev.c:5671 [inline]
       napi_gro_receive+0x28f/0x330 net/core/dev.c:5704
       receive_buf+0x284/0x30b0 drivers/net/virtio_net.c:1061
      
      read to 0xffff888120425770 of 4 bytes by task 7254 on cpu 1:
       tcp_stream_is_readable net/ipv4/tcp.c:480 [inline]
       tcp_poll+0x204/0x6b0 net/ipv4/tcp.c:554
       sock_poll+0xed/0x250 net/socket.c:1256
       vfs_poll include/linux/poll.h:90 [inline]
       ep_item_poll.isra.0+0x90/0x190 fs/eventpoll.c:892
       ep_send_events_proc+0x113/0x5c0 fs/eventpoll.c:1749
       ep_scan_ready_list.constprop.0+0x189/0x500 fs/eventpoll.c:704
       ep_send_events fs/eventpoll.c:1793 [inline]
       ep_poll+0xe3/0x900 fs/eventpoll.c:1930
       do_epoll_wait+0x162/0x180 fs/eventpoll.c:2294
       __do_sys_epoll_pwait fs/eventpoll.c:2325 [inline]
       __se_sys_epoll_pwait fs/eventpoll.c:2311 [inline]
       __x64_sys_epoll_pwait+0xcd/0x170 fs/eventpoll.c:2311
       do_syscall_64+0xcf/0x2f0 arch/x86/entry/common.c:296
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 7254 Comm: syz-fuzzer Not tainted 5.3.0+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: Nguodeqing <geffrey.guo@huawei.com>
      Reviewed-by: NWenan Mao <maowenan@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      a0adc066
  3. 27 12月, 2019 7 次提交
  4. 06 12月, 2018 1 次提交
    • E
      tcp: defer SACK compression after DupThresh · aaa7e45c
      Eric Dumazet 提交于
      [ Upstream commit 86de5921a3d5dd246df661e09bdd0a6131b39ae3 ]
      
      Jean-Louis reported a TCP regression and bisected to recent SACK
      compression.
      
      After a loss episode (receiver not able to keep up and dropping
      packets because its backlog is full), linux TCP stack is sending
      a single SACK (DUPACK).
      
      Sender waits a full RTO timer before recovering losses.
      
      While RFC 6675 says in section 5, "Algorithm Details",
      
         (2) If DupAcks < DupThresh but IsLost (HighACK + 1) returns true --
             indicating at least three segments have arrived above the current
             cumulative acknowledgment point, which is taken to indicate loss
             -- go to step (4).
      ...
         (4) Invoke fast retransmit and enter loss recovery as follows:
      
      there are old TCP stacks not implementing this strategy, and
      still counting the dupacks before starting fast retransmit.
      
      While these stacks probably perform poorly when receivers implement
      LRO/GRO, we should be a little more gentle to them.
      
      This patch makes sure we do not enable SACK compression unless
      3 dupacks have been sent since last rcv_nxt update.
      
      Ideally we should even rearm the timer to send one or two
      more DUPACK if no more packets are coming, but that will
      be work aiming for linux-4.21.
      
      Many thanks to Jean-Louis for bisecting the issue, providing
      packet captures and testing this patch.
      
      Fixes: 5d9f4262 ("tcp: add SACK compression")
      Reported-by: NJean-Louis Dupond <jean-louis@dupond.be>
      Tested-by: NJean-Louis Dupond <jean-louis@dupond.be>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      aaa7e45c
  5. 23 11月, 2018 1 次提交
  6. 02 10月, 2018 1 次提交
    • E
      tcp/dccp: fix lockdep issue when SYN is backlogged · 1ad98e9d
      Eric Dumazet 提交于
      In normal SYN processing, packets are handled without listener
      lock and in RCU protected ingress path.
      
      But syzkaller is known to be able to trick us and SYN
      packets might be processed in process context, after being
      queued into socket backlog.
      
      In commit 06f877d6 ("tcp/dccp: fix other lockdep splats
      accessing ireq_opt") I made a very stupid fix, that happened
      to work mostly because of the regular path being RCU protected.
      
      Really the thing protecting ireq->ireq_opt is RCU read lock,
      and the pseudo request refcnt is not relevant.
      
      This patch extends what I did in commit 449809a6 ("tcp/dccp:
      block BH for SYN processing") by adding an extra rcu_read_{lock|unlock}
      pair in the paths that might be taken when processing SYN from
      socket backlog (thus possibly in process context)
      
      Fixes: 06f877d6 ("tcp/dccp: fix other lockdep splats accessing ireq_opt")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1ad98e9d
  7. 12 9月, 2018 1 次提交
  8. 12 8月, 2018 3 次提交
    • Y
      tcp: avoid resetting ACK timer upon receiving packet with ECN CWR flag · fd2123a3
      Yuchung Cheng 提交于
      Previously commit 9aee4000 ("tcp: ack immediately when a cwr
      packet arrives") calls tcp_enter_quickack_mode to force sending
      two immediate ACKs upon receiving a packet w/ CWR flag. The side
      effect is it'll also reset the delayed ACK timer and interactive
      session tracking. This patch removes that side effect by using the
      new ACK_NOW flag to force an immmediate ACK.
      
      Packetdrill to demonstrate:
      
          0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
         +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
         +0 setsockopt(3, SOL_TCP, TCP_CONGESTION, "dctcp", 5) = 0
         +0 bind(3, ..., ...) = 0
         +0 listen(3, 1) = 0
      
         +0 < [ect0] SEW 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
         +0 > SE. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
        +.1 < [ect0] . 1:1(0) ack 1 win 257
         +0 accept(3, ..., ...) = 4
      
         +0 < [ect0] . 1:1001(1000) ack 1 win 257
         +0 > [ect01] . 1:1(0) ack 1001
      
         +0 write(4, ..., 1) = 1
         +0 > [ect01] P. 1:2(1) ack 1001
      
         +0 < [ect0] . 1001:2001(1000) ack 2 win 257
         +0 write(4, ..., 1) = 1
         +0 > [ect01] P. 2:3(1) ack 2001
      
         +0 < [ect0] . 2001:3001(1000) ack 3 win 257
         +0 < [ect0] . 3001:4001(1000) ack 3 win 257
         // Ack delayed ...
      
         +.01 < [ce] P. 4001:4501(500) ack 3 win 257
         +0 > [ect01] . 3:3(0) ack 4001
         +0 > [ect01] E. 3:3(0) ack 4501
      
      +.001 read(4, ..., 4500) = 4500
         +0 write(4, ..., 1) = 1
         +0 > [ect01] PE. 3:4(1) ack 4501 win 100
      
       +.01 < [ect0] W. 4501:5501(1000) ack 4 win 257
         // No delayed ACK on CWR flag
         +0 > [ect01] . 4:4(0) ack 5501
      
       +.31 < [ect0] . 5501:6501(1000) ack 4 win 257
         +0 > [ect01] . 4:4(0) ack 6501
      
      Fixes: 9aee4000 ("tcp: ack immediately when a cwr packet arrives")
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fd2123a3
    • Y
      tcp: always ACK immediately on hole repairs · 15bdd568
      Yuchung Cheng 提交于
      RFC 5681 sec 4.2:
        To provide feedback to senders recovering from losses, the receiver
        SHOULD send an immediate ACK when it receives a data segment that
        fills in all or part of a gap in the sequence space.
      
      When a gap is partially filled, __tcp_ack_snd_check already checks
      the out-of-order queue and correctly send an immediate ACK. However
      when a gap is fully filled, the previous implementation only resets
      pingpong mode which does not guarantee an immediate ACK because the
      quick ACK counter may be zero. This patch addresses this issue by
      marking the one-time immediate ACK flag instead.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      15bdd568
    • Y
      tcp: mandate a one-time immediate ACK · 466466dc
      Yuchung Cheng 提交于
      Add a new flag to indicate a one-time immediate ACK. This flag is
      occasionaly set under specific TCP protocol states in addition to
      the more common quickack mechanism for interactive application.
      
      In several cases in the TCP code we want to force an immediate ACK
      but do not want to call tcp_enter_quickack_mode() because we do
      not want to forget the icsk_ack.pingpong or icsk_ack.ato state.
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NWei Wang <weiwan@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      466466dc
  9. 02 8月, 2018 2 次提交
  10. 26 7月, 2018 1 次提交
    • L
      tcp: ack immediately when a cwr packet arrives · 9aee4000
      Lawrence Brakmo 提交于
      We observed high 99 and 99.9% latencies when doing RPCs with DCTCP. The
      problem is triggered when the last packet of a request arrives CE
      marked. The reply will carry the ECE mark causing TCP to shrink its cwnd
      to 1 (because there are no packets in flight). When the 1st packet of
      the next request arrives, the ACK was sometimes delayed even though it
      is CWR marked, adding up to 40ms to the RPC latency.
      
      This patch insures that CWR marked data packets arriving will be acked
      immediately.
      
      Packetdrill script to reproduce the problem:
      
      0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      0.000 setsockopt(3, SOL_TCP, TCP_CONGESTION, "dctcp", 5) = 0
      0.000 bind(3, ..., ...) = 0
      0.000 listen(3, 1) = 0
      
      0.100 < [ect0] SEW 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
      0.100 > SE. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
      0.110 < [ect0] . 1:1(0) ack 1 win 257
      0.200 accept(3, ..., ...) = 4
      
      0.200 < [ect0] . 1:1001(1000) ack 1 win 257
      0.200 > [ect01] . 1:1(0) ack 1001
      
      0.200 write(4, ..., 1) = 1
      0.200 > [ect01] P. 1:2(1) ack 1001
      
      0.200 < [ect0] . 1001:2001(1000) ack 2 win 257
      0.200 write(4, ..., 1) = 1
      0.200 > [ect01] P. 2:3(1) ack 2001
      
      0.200 < [ect0] . 2001:3001(1000) ack 3 win 257
      0.200 < [ect0] . 3001:4001(1000) ack 3 win 257
      0.200 > [ect01] . 3:3(0) ack 4001
      
      0.210 < [ce] P. 4001:4501(500) ack 3 win 257
      
      +0.001 read(4, ..., 4500) = 4500
      +0 write(4, ..., 1) = 1
      +0 > [ect01] PE. 3:4(1) ack 4501
      
      +0.010 < [ect0] W. 4501:5501(1000) ack 4 win 257
      // Previously the ACK sequence below would be 4501, causing a long RTO
      +0.040~+0.045 > [ect01] . 4:4(0) ack 5501   // delayed ack
      
      +0.311 < [ect0] . 5501:6501(1000) ack 4 win 257  // More data
      +0 > [ect01] . 4:4(0) ack 6501     // now acks everything
      
      +0.500 < F. 9501:9501(0) ack 4 win 257
      
      Modified based on comments by Neal Cardwell <ncardwell@google.com>
      Signed-off-by: NLawrence Brakmo <brakmo@fb.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9aee4000
  11. 24 7月, 2018 5 次提交
  12. 21 7月, 2018 1 次提交
    • Y
      tcp: do not delay ACK in DCTCP upon CE status change · a0496ef2
      Yuchung Cheng 提交于
      Per DCTCP RFC8257 (Section 3.2) the ACK reflecting the CE status change
      has to be sent immediately so the sender can respond quickly:
      
      """ When receiving packets, the CE codepoint MUST be processed as follows:
      
         1.  If the CE codepoint is set and DCTCP.CE is false, set DCTCP.CE to
             true and send an immediate ACK.
      
         2.  If the CE codepoint is not set and DCTCP.CE is true, set DCTCP.CE
             to false and send an immediate ACK.
      """
      
      Previously DCTCP implementation may continue to delay the ACK. This
      patch fixes that to implement the RFC by forcing an immediate ACK.
      
      Tested with this packetdrill script provided by Larry Brakmo
      
      0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
      0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
      0.000 setsockopt(3, SOL_TCP, TCP_CONGESTION, "dctcp", 5) = 0
      0.000 bind(3, ..., ...) = 0
      0.000 listen(3, 1) = 0
      
      0.100 < [ect0] SEW 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
      0.100 > SE. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
      0.110 < [ect0] . 1:1(0) ack 1 win 257
      0.200 accept(3, ..., ...) = 4
         +0 setsockopt(4, SOL_SOCKET, SO_DEBUG, [1], 4) = 0
      
      0.200 < [ect0] . 1:1001(1000) ack 1 win 257
      0.200 > [ect01] . 1:1(0) ack 1001
      
      0.200 write(4, ..., 1) = 1
      0.200 > [ect01] P. 1:2(1) ack 1001
      
      0.200 < [ect0] . 1001:2001(1000) ack 2 win 257
      +0.005 < [ce] . 2001:3001(1000) ack 2 win 257
      
      +0.000 > [ect01] . 2:2(0) ack 2001
      // Previously the ACK below would be delayed by 40ms
      +0.000 > [ect01] E. 2:2(0) ack 3001
      
      +0.500 < F. 9501:9501(0) ack 4 win 257
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a0496ef2
  13. 16 7月, 2018 1 次提交
  14. 15 7月, 2018 1 次提交
  15. 13 7月, 2018 1 次提交
    • A
      tcp: use monotonic timestamps for PAWS · cca9bab1
      Arnd Bergmann 提交于
      Using get_seconds() for timestamps is deprecated since it can lead
      to overflows on 32-bit systems. While the interface generally doesn't
      overflow until year 2106, the specific implementation of the TCP PAWS
      algorithm breaks in 2038 when the intermediate signed 32-bit timestamps
      overflow.
      
      A related problem is that the local timestamps in CLOCK_REALTIME form
      lead to unexpected behavior when settimeofday is called to set the system
      clock backwards or forwards by more than 24 days.
      
      While the first problem could be solved by using an overflow-safe method
      of comparing the timestamps, a nicer solution is to use a monotonic
      clocksource with ktime_get_seconds() that simply doesn't overflow (at
      least not until 136 years after boot) and that doesn't change during
      settimeofday().
      
      To make 32-bit and 64-bit architectures behave the same way here, and
      also save a few bytes in the tcp_options_received structure, I'm changing
      the type to a 32-bit integer, which is now safe on all architectures.
      
      Finally, the ts_recent_stamp field also (confusingly) gets used to store
      a jiffies value in tcp_synq_overflow()/tcp_synq_no_recent_overflow().
      This is currently safe, but changing the type to 32-bit requires
      some small changes there to keep it working.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cca9bab1
  16. 02 7月, 2018 1 次提交
  17. 01 7月, 2018 1 次提交
    • I
      tcp: prevent bogus FRTO undos with non-SACK flows · 1236f22f
      Ilpo Järvinen 提交于
      If SACK is not enabled and the first cumulative ACK after the RTO
      retransmission covers more than the retransmitted skb, a spurious
      FRTO undo will trigger (assuming FRTO is enabled for that RTO).
      The reason is that any non-retransmitted segment acknowledged will
      set FLAG_ORIG_SACK_ACKED in tcp_clean_rtx_queue even if there is
      no indication that it would have been delivered for real (the
      scoreboard is not kept with TCPCB_SACKED_ACKED bits in the non-SACK
      case so the check for that bit won't help like it does with SACK).
      Having FLAG_ORIG_SACK_ACKED set results in the spurious FRTO undo
      in tcp_process_loss.
      
      We need to use more strict condition for non-SACK case and check
      that none of the cumulatively ACKed segments were retransmitted
      to prove that progress is due to original transmissions. Only then
      keep FLAG_ORIG_SACK_ACKED set, allowing FRTO undo to proceed in
      non-SACK case.
      
      (FLAG_ORIG_SACK_ACKED is planned to be renamed to FLAG_ORIG_PROGRESS
      to better indicate its purpose but to keep this change minimal, it
      will be done in another patch).
      
      Besides burstiness and congestion control violations, this problem
      can result in RTO loop: When the loss recovery is prematurely
      undoed, only new data will be transmitted (if available) and
      the next retransmission can occur only after a new RTO which in case
      of multiple losses (that are not for consecutive packets) requires
      one RTO per loss to recover.
      Signed-off-by: NIlpo Järvinen <ilpo.jarvinen@helsinki.fi>
      Tested-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1236f22f
  18. 30 6月, 2018 1 次提交
  19. 28 6月, 2018 1 次提交
  20. 26 6月, 2018 1 次提交
  21. 22 6月, 2018 1 次提交
  22. 05 6月, 2018 1 次提交
  23. 01 6月, 2018 1 次提交
  24. 23 5月, 2018 2 次提交
  25. 18 5月, 2018 2 次提交