1. 03 2月, 2020 1 次提交
    • S
      tcp: Reduce SYN resend delay if a suspicous ACK is received · 9603d47b
      SeongJae Park 提交于
      When closing a connection, the two acks that required to change closing
      socket's status to FIN_WAIT_2 and then TIME_WAIT could be processed in
      reverse order.  This is possible in RSS disabled environments such as a
      connection inside a host.
      
      For example, expected state transitions and required packets for the
      disconnection will be similar to below flow.
      
      	 00 (Process A)				(Process B)
      	 01 ESTABLISHED				ESTABLISHED
      	 02 close()
      	 03 FIN_WAIT_1
      	 04 		---FIN-->
      	 05 					CLOSE_WAIT
      	 06 		<--ACK---
      	 07 FIN_WAIT_2
      	 08 		<--FIN/ACK---
      	 09 TIME_WAIT
      	 10 		---ACK-->
      	 11 					LAST_ACK
      	 12 CLOSED				CLOSED
      
      In some cases such as LINGER option applied socket, the FIN and FIN/ACK
      will be substituted to RST and RST/ACK, but there is no difference in
      the main logic.
      
      The acks in lines 6 and 8 are the acks.  If the line 8 packet is
      processed before the line 6 packet, it will be just ignored as it is not
      a expected packet, and the later process of the line 6 packet will
      change the status of Process A to FIN_WAIT_2, but as it has already
      handled line 8 packet, it will not go to TIME_WAIT and thus will not
      send the line 10 packet to Process B.  Thus, Process B will left in
      CLOSE_WAIT status, as below.
      
      	 00 (Process A)				(Process B)
      	 01 ESTABLISHED				ESTABLISHED
      	 02 close()
      	 03 FIN_WAIT_1
      	 04 		---FIN-->
      	 05 					CLOSE_WAIT
      	 06 				(<--ACK---)
      	 07	  			(<--FIN/ACK---)
      	 08 				(fired in right order)
      	 09 		<--FIN/ACK---
      	 10 		<--ACK---
      	 11 		(processed in reverse order)
      	 12 FIN_WAIT_2
      
      Later, if the Process B sends SYN to Process A for reconnection using
      the same port, Process A will responds with an ACK for the last flow,
      which has no increased sequence number.  Thus, Process A will send RST,
      wait for TIMEOUT_INIT (one second in default), and then try
      reconnection.  If reconnections are frequent, the one second latency
      spikes can be a big problem.  Below is a tcpdump results of the problem:
      
          14.436259 IP 127.0.0.1.45150 > 127.0.0.1.4242: Flags [S], seq 2560603644
          14.436266 IP 127.0.0.1.4242 > 127.0.0.1.45150: Flags [.], ack 5, win 512
          14.436271 IP 127.0.0.1.45150 > 127.0.0.1.4242: Flags [R], seq 2541101298
          /* ONE SECOND DELAY */
          15.464613 IP 127.0.0.1.45150 > 127.0.0.1.4242: Flags [S], seq 2560603644
      
      This commit mitigates the problem by reducing the delay for the next SYN
      if the suspicous ACK is received while in SYN_SENT state.
      
      Following commit will add a selftest, which can be also helpful for
      understanding of this issue.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NSeongJae Park <sjpark@amazon.de>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      9603d47b
  2. 30 1月, 2020 1 次提交
    • F
      mptcp: handle tcp fallback when using syn cookies · ae2dd716
      Florian Westphal 提交于
      We can't deal with syncookie mode yet, the syncookie rx path will create
      tcp reqsk, i.e. we get OOB access because we treat tcp reqsk as mptcp reqsk one:
      
      TCP: SYN flooding on port 20002. Sending cookies.
      BUG: KASAN: slab-out-of-bounds in subflow_syn_recv_sock+0x451/0x4d0 net/mptcp/subflow.c:191
      Read of size 1 at addr ffff8881167bc148 by task syz-executor099/2120
       subflow_syn_recv_sock+0x451/0x4d0 net/mptcp/subflow.c:191
       tcp_get_cookie_sock+0xcf/0x520 net/ipv4/syncookies.c:209
       cookie_v6_check+0x15a5/0x1e90 net/ipv6/syncookies.c:252
       tcp_v6_cookie_check net/ipv6/tcp_ipv6.c:1123 [inline]
       [..]
      
      Bug can be reproduced via "sysctl net.ipv4.tcp_syncookies=2".
      
      Note that MPTCP should work with syncookies (4th ack would carry needed
      state), but it appears better to sort that out in -next so do tcp
      fallback for now.
      
      I removed the MPTCP ifdef for tcp_rsk "is_mptcp" member because
      if (IS_ENABLED()) is easier to read than "#ifdef IS_ENABLED()/#endif" pair.
      
      Cc: Eric Dumazet <edumazet@google.com>
      Fixes: cec37a6e ("mptcp: Handle MP_CAPABLE options for outgoing connections")
      Reported-by: NChristoph Paasch <cpaasch@apple.com>
      Tested-by: NChristoph Paasch <cpaasch@apple.com>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ae2dd716
  3. 26 1月, 2020 1 次提交
  4. 24 1月, 2020 5 次提交
    • C
      mptcp: parse and emit MP_CAPABLE option according to v1 spec · cc7972ea
      Christoph Paasch 提交于
      This implements MP_CAPABLE options parsing and writing according
      to RFC 6824 bis / RFC 8684: MPTCP v1.
      
      Local key is sent on syn/ack, and both keys are sent on 3rd ack.
      MP_CAPABLE messages len are updated accordingly. We need the skbuff to
      correctly emit the above, so we push the skbuff struct as an argument
      all the way from tcp code to the relevant mptcp callbacks.
      
      When processing incoming MP_CAPABLE + data, build a full blown DSS-like
      map info, to simplify later processing.  On child socket creation, we
      need to record the remote key, if available.
      Signed-off-by: NChristoph Paasch <cpaasch@apple.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cc7972ea
    • M
      mptcp: Implement MPTCP receive path · 648ef4b8
      Mat Martineau 提交于
      Parses incoming DSS options and populates outgoing MPTCP ACK
      fields. MPTCP fields are parsed from the TCP option header and placed in
      an skb extension, allowing the upper MPTCP layer to access MPTCP
      options after the skb has gone through the TCP stack.
      
      The subflow implements its own data_ready() ops, which ensures that the
      pending data is in sequence - according to MPTCP seq number - dropping
      out-of-seq skbs. The DATA_READY bit flag is set if this is the case.
      This allows the MPTCP socket layer to determine if more data is
      available without having to consult the individual subflows.
      
      It additionally validates the current mapping and propagates EoF events
      to the connection socket.
      Co-developed-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Co-developed-by: NPeter Krystad <peter.krystad@linux.intel.com>
      Signed-off-by: NPeter Krystad <peter.krystad@linux.intel.com>
      Co-developed-by: NDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: NDavide Caratti <dcaratti@redhat.com>
      Co-developed-by: NMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: NMatthieu Baerts <matthieu.baerts@tessares.net>
      Co-developed-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: NChristoph Paasch <cpaasch@apple.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      648ef4b8
    • P
      mptcp: Handle MP_CAPABLE options for outgoing connections · cec37a6e
      Peter Krystad 提交于
      Add hooks to tcp_output.c to add MP_CAPABLE to an outgoing SYN request,
      to capture the MP_CAPABLE in the received SYN-ACK, to add MP_CAPABLE to
      the final ACK of the three-way handshake.
      
      Use the .sk_rx_dst_set() handler in the subflow proto to capture when the
      responding SYN-ACK is received and notify the MPTCP connection layer.
      Co-developed-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Co-developed-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NPeter Krystad <peter.krystad@linux.intel.com>
      Signed-off-by: NChristoph Paasch <cpaasch@apple.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cec37a6e
    • P
      mptcp: Handle MPTCP TCP options · eda7acdd
      Peter Krystad 提交于
      Add hooks to parse and format the MP_CAPABLE option.
      
      This option is handled according to MPTCP version 0 (RFC6824).
      MPTCP version 1 MP_CAPABLE (RFC6824bis/RFC8684) will be added later in
      coordination with related code changes.
      Co-developed-by: NMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: NMatthieu Baerts <matthieu.baerts@tessares.net>
      Co-developed-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Co-developed-by: NDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: NDavide Caratti <dcaratti@redhat.com>
      Signed-off-by: NPeter Krystad <peter.krystad@linux.intel.com>
      Signed-off-by: NChristoph Paasch <cpaasch@apple.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eda7acdd
    • E
      tcp: do not leave dangling pointers in tp->highest_sack · 2bec445f
      Eric Dumazet 提交于
      Latest commit 85369750 ("tcp: Fix highest_sack and highest_sack_seq")
      apparently allowed syzbot to trigger various crashes in TCP stack [1]
      
      I believe this commit only made things easier for syzbot to find
      its way into triggering use-after-frees. But really the bugs
      could lead to bad TCP behavior or even plain crashes even for
      non malicious peers.
      
      I have audited all calls to tcp_rtx_queue_unlink() and
      tcp_rtx_queue_unlink_and_free() and made sure tp->highest_sack would be updated
      if we are removing from rtx queue the skb that tp->highest_sack points to.
      
      These updates were missing in three locations :
      
      1) tcp_clean_rtx_queue() [This one seems quite serious,
                                I have no idea why this was not caught earlier]
      
      2) tcp_rtx_queue_purge() [Probably not a big deal for normal operations]
      
      3) tcp_send_synack()     [Probably not a big deal for normal operations]
      
      [1]
      BUG: KASAN: use-after-free in tcp_highest_sack_seq include/net/tcp.h:1864 [inline]
      BUG: KASAN: use-after-free in tcp_highest_sack_seq include/net/tcp.h:1856 [inline]
      BUG: KASAN: use-after-free in tcp_check_sack_reordering+0x33c/0x3a0 net/ipv4/tcp_input.c:891
      Read of size 4 at addr ffff8880a488d068 by task ksoftirqd/1/16
      
      CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.5.0-rc5-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x197/0x210 lib/dump_stack.c:118
       print_address_description.constprop.0.cold+0xd4/0x30b mm/kasan/report.c:374
       __kasan_report.cold+0x1b/0x41 mm/kasan/report.c:506
       kasan_report+0x12/0x20 mm/kasan/common.c:639
       __asan_report_load4_noabort+0x14/0x20 mm/kasan/generic_report.c:134
       tcp_highest_sack_seq include/net/tcp.h:1864 [inline]
       tcp_highest_sack_seq include/net/tcp.h:1856 [inline]
       tcp_check_sack_reordering+0x33c/0x3a0 net/ipv4/tcp_input.c:891
       tcp_try_undo_partial net/ipv4/tcp_input.c:2730 [inline]
       tcp_fastretrans_alert+0xf74/0x23f0 net/ipv4/tcp_input.c:2847
       tcp_ack+0x2577/0x5bf0 net/ipv4/tcp_input.c:3710
       tcp_rcv_established+0x6dd/0x1e90 net/ipv4/tcp_input.c:5706
       tcp_v4_do_rcv+0x619/0x8d0 net/ipv4/tcp_ipv4.c:1619
       tcp_v4_rcv+0x307f/0x3b40 net/ipv4/tcp_ipv4.c:2001
       ip_protocol_deliver_rcu+0x5a/0x880 net/ipv4/ip_input.c:204
       ip_local_deliver_finish+0x23b/0x380 net/ipv4/ip_input.c:231
       NF_HOOK include/linux/netfilter.h:307 [inline]
       NF_HOOK include/linux/netfilter.h:301 [inline]
       ip_local_deliver+0x1e9/0x520 net/ipv4/ip_input.c:252
       dst_input include/net/dst.h:442 [inline]
       ip_rcv_finish+0x1db/0x2f0 net/ipv4/ip_input.c:428
       NF_HOOK include/linux/netfilter.h:307 [inline]
       NF_HOOK include/linux/netfilter.h:301 [inline]
       ip_rcv+0xe8/0x3f0 net/ipv4/ip_input.c:538
       __netif_receive_skb_one_core+0x113/0x1a0 net/core/dev.c:5148
       __netif_receive_skb+0x2c/0x1d0 net/core/dev.c:5262
       process_backlog+0x206/0x750 net/core/dev.c:6093
       napi_poll net/core/dev.c:6530 [inline]
       net_rx_action+0x508/0x1120 net/core/dev.c:6598
       __do_softirq+0x262/0x98c kernel/softirq.c:292
       run_ksoftirqd kernel/softirq.c:603 [inline]
       run_ksoftirqd+0x8e/0x110 kernel/softirq.c:595
       smpboot_thread_fn+0x6a3/0xa40 kernel/smpboot.c:165
       kthread+0x361/0x430 kernel/kthread.c:255
       ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
      
      Allocated by task 10091:
       save_stack+0x23/0x90 mm/kasan/common.c:72
       set_track mm/kasan/common.c:80 [inline]
       __kasan_kmalloc mm/kasan/common.c:513 [inline]
       __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:486
       kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:521
       slab_post_alloc_hook mm/slab.h:584 [inline]
       slab_alloc_node mm/slab.c:3263 [inline]
       kmem_cache_alloc_node+0x138/0x740 mm/slab.c:3575
       __alloc_skb+0xd5/0x5e0 net/core/skbuff.c:198
       alloc_skb_fclone include/linux/skbuff.h:1099 [inline]
       sk_stream_alloc_skb net/ipv4/tcp.c:875 [inline]
       sk_stream_alloc_skb+0x113/0xc90 net/ipv4/tcp.c:852
       tcp_sendmsg_locked+0xcf9/0x3470 net/ipv4/tcp.c:1282
       tcp_sendmsg+0x30/0x50 net/ipv4/tcp.c:1432
       inet_sendmsg+0x9e/0xe0 net/ipv4/af_inet.c:807
       sock_sendmsg_nosec net/socket.c:652 [inline]
       sock_sendmsg+0xd7/0x130 net/socket.c:672
       __sys_sendto+0x262/0x380 net/socket.c:1998
       __do_sys_sendto net/socket.c:2010 [inline]
       __se_sys_sendto net/socket.c:2006 [inline]
       __x64_sys_sendto+0xe1/0x1a0 net/socket.c:2006
       do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Freed by task 10095:
       save_stack+0x23/0x90 mm/kasan/common.c:72
       set_track mm/kasan/common.c:80 [inline]
       kasan_set_free_info mm/kasan/common.c:335 [inline]
       __kasan_slab_free+0x102/0x150 mm/kasan/common.c:474
       kasan_slab_free+0xe/0x10 mm/kasan/common.c:483
       __cache_free mm/slab.c:3426 [inline]
       kmem_cache_free+0x86/0x320 mm/slab.c:3694
       kfree_skbmem+0x178/0x1c0 net/core/skbuff.c:645
       __kfree_skb+0x1e/0x30 net/core/skbuff.c:681
       sk_eat_skb include/net/sock.h:2453 [inline]
       tcp_recvmsg+0x1252/0x2930 net/ipv4/tcp.c:2166
       inet_recvmsg+0x136/0x610 net/ipv4/af_inet.c:838
       sock_recvmsg_nosec net/socket.c:886 [inline]
       sock_recvmsg net/socket.c:904 [inline]
       sock_recvmsg+0xce/0x110 net/socket.c:900
       __sys_recvfrom+0x1ff/0x350 net/socket.c:2055
       __do_sys_recvfrom net/socket.c:2073 [inline]
       __se_sys_recvfrom net/socket.c:2069 [inline]
       __x64_sys_recvfrom+0xe1/0x1a0 net/socket.c:2069
       do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      The buggy address belongs to the object at ffff8880a488d040
       which belongs to the cache skbuff_fclone_cache of size 456
      The buggy address is located 40 bytes inside of
       456-byte region [ffff8880a488d040, ffff8880a488d208)
      The buggy address belongs to the page:
      page:ffffea0002922340 refcount:1 mapcount:0 mapping:ffff88821b057000 index:0x0
      raw: 00fffe0000000200 ffffea00022a5788 ffffea0002624a48 ffff88821b057000
      raw: 0000000000000000 ffff8880a488d040 0000000100000006 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff8880a488cf00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
       ffff8880a488cf80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      >ffff8880a488d000: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
                                                                ^
       ffff8880a488d080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff8880a488d100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      
      Fixes: 85369750 ("tcp: Fix highest_sack and highest_sack_seq")
      Fixes: 50895b9d ("tcp: highest_sack fix")
      Fixes: 737ff314 ("tcp: use sequence distance to detect reordering")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Cambda Zhu <cambda@linux.alibaba.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2bec445f
  5. 16 1月, 2020 1 次提交
    • P
      tcp: fix marked lost packets not being retransmitted · e176b1ba
      Pengcheng Yang 提交于
      When the packet pointed to by retransmit_skb_hint is unlinked by ACK,
      retransmit_skb_hint will be set to NULL in tcp_clean_rtx_queue().
      If packet loss is detected at this time, retransmit_skb_hint will be set
      to point to the current packet loss in tcp_verify_retransmit_hint(),
      then the packets that were previously marked lost but not retransmitted
      due to the restriction of cwnd will be skipped and cannot be
      retransmitted.
      
      To fix this, when retransmit_skb_hint is NULL, retransmit_skb_hint can
      be reset only after all marked lost packets are retransmitted
      (retrans_out >= lost_out), otherwise we need to traverse from
      tcp_rtx_queue_head in tcp_xmit_retransmit_queue().
      
      Packetdrill to demonstrate:
      
      // Disable RACK and set max_reordering to keep things simple
          0 `sysctl -q net.ipv4.tcp_recovery=0`
         +0 `sysctl -q net.ipv4.tcp_max_reordering=3`
      
      // Establish a connection
         +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
         +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
         +0 bind(3, ..., ...) = 0
         +0 listen(3, 1) = 0
      
        +.1 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
         +0 > S. 0:0(0) ack 1 <...>
       +.01 < . 1:1(0) ack 1 win 257
         +0 accept(3, ..., ...) = 4
      
      // Send 8 data segments
         +0 write(4, ..., 8000) = 8000
         +0 > P. 1:8001(8000) ack 1
      
      // Enter recovery and 1:3001 is marked lost
       +.01 < . 1:1(0) ack 1 win 257 <sack 3001:4001,nop,nop>
         +0 < . 1:1(0) ack 1 win 257 <sack 5001:6001 3001:4001,nop,nop>
         +0 < . 1:1(0) ack 1 win 257 <sack 5001:7001 3001:4001,nop,nop>
      
      // Retransmit 1:1001, now retransmit_skb_hint points to 1001:2001
         +0 > . 1:1001(1000) ack 1
      
      // 1001:2001 was ACKed causing retransmit_skb_hint to be set to NULL
       +.01 < . 1:1(0) ack 2001 win 257 <sack 5001:8001 3001:4001,nop,nop>
      // Now retransmit_skb_hint points to 4001:5001 which is now marked lost
      
      // BUG: 2001:3001 was not retransmitted
         +0 > . 2001:3001(1000) ack 1
      Signed-off-by: NPengcheng Yang <yangpc@wangsu.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Tested-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e176b1ba
  6. 10 1月, 2020 1 次提交
  7. 03 1月, 2020 2 次提交
  8. 26 10月, 2019 1 次提交
    • J
      tcp: add TCP_INFO status for failed client TFO · 48027478
      Jason Baron 提交于
      The TCPI_OPT_SYN_DATA bit as part of tcpi_options currently reports whether
      or not data-in-SYN was ack'd on both the client and server side. We'd like
      to gather more information on the client-side in the failure case in order
      to indicate the reason for the failure. This can be useful for not only
      debugging TFO, but also for creating TFO socket policies. For example, if
      a middle box removes the TFO option or drops a data-in-SYN, we can
      can detect this case, and turn off TFO for these connections saving the
      extra retransmits.
      
      The newly added tcpi_fastopen_client_fail status is 2 bits and has the
      following 4 states:
      
      1) TFO_STATUS_UNSPEC
      
      Catch-all state which includes when TFO is disabled via black hole
      detection, which is indicated via LINUX_MIB_TCPFASTOPENBLACKHOLE.
      
      2) TFO_COOKIE_UNAVAILABLE
      
      If TFO_CLIENT_NO_COOKIE mode is off, this state indicates that no cookie
      is available in the cache.
      
      3) TFO_DATA_NOT_ACKED
      
      Data was sent with SYN, we received a SYN/ACK but it did not cover the data
      portion. Cookie is not accepted by server because the cookie may be invalid
      or the server may be overloaded.
      
      4) TFO_SYN_RETRANSMITTED
      
      Data was sent with SYN, we received a SYN/ACK which did not cover the data
      after at least 1 additional SYN was sent (without data). It may be the case
      that a middle-box is dropping data-in-SYN packets. Thus, it would be more
      efficient to not use TFO on this connection to avoid extra retransmits
      during connection establishment.
      
      These new fields do not cover all the cases where TFO may fail, but other
      failures, such as SYN/ACK + data being dropped, will result in the
      connection not becoming established. And a connection blackhole after
      session establishment shows up as a stalled connection.
      Signed-off-by: NJason Baron <jbaron@akamai.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Christoph Paasch <cpaasch@apple.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      48027478
  9. 14 10月, 2019 6 次提交
    • E
      tcp: annotate sk->sk_sndbuf lockless reads · e292f05e
      Eric Dumazet 提交于
      For the sake of tcp_poll(), there are few places where we fetch
      sk->sk_sndbuf while this field can change from IRQ or other cpu.
      
      We need to add READ_ONCE() annotations, and also make sure write
      sides use corresponding WRITE_ONCE() to avoid store-tearing.
      
      Note that other transports probably need similar fixes.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e292f05e
    • E
      tcp: annotate sk->sk_rcvbuf lockless reads · ebb3b78d
      Eric Dumazet 提交于
      For the sake of tcp_poll(), there are few places where we fetch
      sk->sk_rcvbuf while this field can change from IRQ or other cpu.
      
      We need to add READ_ONCE() annotations, and also make sure write
      sides use corresponding WRITE_ONCE() to avoid store-tearing.
      
      Note that other transports probably need similar fixes.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ebb3b78d
    • E
      tcp: annotate tp->urg_seq lockless reads · d9b55bf7
      Eric Dumazet 提交于
      There two places where we fetch tp->urg_seq while
      this field can change from IRQ or other cpu.
      
      We need to add READ_ONCE() annotations, and also make
      sure write side use corresponding WRITE_ONCE() to avoid
      store-tearing.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d9b55bf7
    • E
      tcp: annotate tp->copied_seq lockless reads · 7db48e98
      Eric Dumazet 提交于
      There are few places where we fetch tp->copied_seq while
      this field can change from IRQ or other cpu.
      
      We need to add READ_ONCE() annotations, and also make
      sure write sides use corresponding WRITE_ONCE() to avoid
      store-tearing.
      
      Note that tcp_inq_hint() was already using READ_ONCE(tp->copied_seq)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7db48e98
    • E
      tcp: annotate tp->rcv_nxt lockless reads · dba7d9b8
      Eric Dumazet 提交于
      There are few places where we fetch tp->rcv_nxt while
      this field can change from IRQ or other cpu.
      
      We need to add READ_ONCE() annotations, and also make
      sure write sides use corresponding WRITE_ONCE() to avoid
      store-tearing.
      
      Note that tcp_inq_hint() was already using READ_ONCE(tp->rcv_nxt)
      
      syzbot reported :
      
      BUG: KCSAN: data-race in tcp_poll / tcp_queue_rcv
      
      write to 0xffff888120425770 of 4 bytes by interrupt on cpu 0:
       tcp_rcv_nxt_update net/ipv4/tcp_input.c:3365 [inline]
       tcp_queue_rcv+0x180/0x380 net/ipv4/tcp_input.c:4638
       tcp_rcv_established+0xbf1/0xf50 net/ipv4/tcp_input.c:5616
       tcp_v4_do_rcv+0x381/0x4e0 net/ipv4/tcp_ipv4.c:1542
       tcp_v4_rcv+0x1a03/0x1bf0 net/ipv4/tcp_ipv4.c:1923
       ip_protocol_deliver_rcu+0x51/0x470 net/ipv4/ip_input.c:204
       ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
       dst_input include/net/dst.h:442 [inline]
       ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
       __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5004
       __netif_receive_skb+0x37/0xf0 net/core/dev.c:5118
       netif_receive_skb_internal+0x59/0x190 net/core/dev.c:5208
       napi_skb_finish net/core/dev.c:5671 [inline]
       napi_gro_receive+0x28f/0x330 net/core/dev.c:5704
       receive_buf+0x284/0x30b0 drivers/net/virtio_net.c:1061
      
      read to 0xffff888120425770 of 4 bytes by task 7254 on cpu 1:
       tcp_stream_is_readable net/ipv4/tcp.c:480 [inline]
       tcp_poll+0x204/0x6b0 net/ipv4/tcp.c:554
       sock_poll+0xed/0x250 net/socket.c:1256
       vfs_poll include/linux/poll.h:90 [inline]
       ep_item_poll.isra.0+0x90/0x190 fs/eventpoll.c:892
       ep_send_events_proc+0x113/0x5c0 fs/eventpoll.c:1749
       ep_scan_ready_list.constprop.0+0x189/0x500 fs/eventpoll.c:704
       ep_send_events fs/eventpoll.c:1793 [inline]
       ep_poll+0xe3/0x900 fs/eventpoll.c:1930
       do_epoll_wait+0x162/0x180 fs/eventpoll.c:2294
       __do_sys_epoll_pwait fs/eventpoll.c:2325 [inline]
       __se_sys_epoll_pwait fs/eventpoll.c:2311 [inline]
       __x64_sys_epoll_pwait+0xcd/0x170 fs/eventpoll.c:2311
       do_syscall_64+0xcf/0x2f0 arch/x86/entry/common.c:296
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 7254 Comm: syz-fuzzer Not tainted 5.3.0+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      dba7d9b8
    • E
      tcp: add rcu protection around tp->fastopen_rsk · d983ea6f
      Eric Dumazet 提交于
      Both tcp_v4_err() and tcp_v6_err() do the following operations
      while they do not own the socket lock :
      
      	fastopen = tp->fastopen_rsk;
       	snd_una = fastopen ? tcp_rsk(fastopen)->snt_isn : tp->snd_una;
      
      The problem is that without appropriate barrier, the compiler
      might reload tp->fastopen_rsk and trigger a NULL deref.
      
      request sockets are protected by RCU, we can simply add
      the missing annotations and barriers to solve the issue.
      
      Fixes: 168a8f58 ("tcp: TCP Fast Open Server - main code path")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d983ea6f
  10. 16 9月, 2019 1 次提交
    • T
      tcp: Add TCP_INFO counter for packets received out-of-order · f9af2dbb
      Thomas Higdon 提交于
      For receive-heavy cases on the server-side, we want to track the
      connection quality for individual client IPs. This counter, similar to
      the existing system-wide TCPOFOQueue counter in /proc/net/netstat,
      tracks out-of-order packet reception. By providing this counter in
      TCP_INFO, it will allow understanding to what degree receive-heavy
      sockets are experiencing out-of-order delivery and packet drops
      indicating congestion.
      
      Please note that this is similar to the counter in NetBSD TCP_INFO, and
      has the same name.
      
      Also note that we avoid increasing the size of the tcp_sock struct by
      taking advantage of a hole.
      Signed-off-by: NThomas Higdon <tph@fb.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f9af2dbb
  11. 12 9月, 2019 1 次提交
    • N
      tcp: fix tcp_ecn_withdraw_cwr() to clear TCP_ECN_QUEUE_CWR · af38d07e
      Neal Cardwell 提交于
      Fix tcp_ecn_withdraw_cwr() to clear the correct bit:
      TCP_ECN_QUEUE_CWR.
      
      Rationale: basically, TCP_ECN_DEMAND_CWR is a bit that is purely about
      the behavior of data receivers, and deciding whether to reflect
      incoming IP ECN CE marks as outgoing TCP th->ece marks. The
      TCP_ECN_QUEUE_CWR bit is purely about the behavior of data senders,
      and deciding whether to send CWR. The tcp_ecn_withdraw_cwr() function
      is only called from tcp_undo_cwnd_reduction() by data senders during
      an undo, so it should zero the sender-side state,
      TCP_ECN_QUEUE_CWR. It does not make sense to stop the reflection of
      incoming CE bits on incoming data packets just because outgoing
      packets were spuriously retransmitted.
      
      The bug has been reproduced with packetdrill to manifest in a scenario
      with RFC3168 ECN, with an incoming data packet with CE bit set and
      carrying a TCP timestamp value that causes cwnd undo. Before this fix,
      the IP CE bit was ignored and not reflected in the TCP ECE header bit,
      and sender sent a TCP CWR ('W') bit on the next outgoing data packet,
      even though the cwnd reduction had been undone.  After this fix, the
      sender properly reflects the CE bit and does not set the W bit.
      
      Note: the bug actually predates 2005 git history; this Fixes footer is
      chosen to be the oldest SHA1 I have tested (from Sep 2007) for which
      the patch applies cleanly (since before this commit the code was in a
      .h file).
      
      Fixes: bdf1ee5d ("[TCP]: Move code from tcp_ecn.h to tcp*.c and tcp.h & remove it")
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      af38d07e
  12. 31 7月, 2019 2 次提交
  13. 03 7月, 2019 1 次提交
  14. 16 6月, 2019 1 次提交
    • E
      tcp: limit payload size of sacked skbs · 3b4929f6
      Eric Dumazet 提交于
      Jonathan Looney reported that TCP can trigger the following crash
      in tcp_shifted_skb() :
      
      	BUG_ON(tcp_skb_pcount(skb) < pcount);
      
      This can happen if the remote peer has advertized the smallest
      MSS that linux TCP accepts : 48
      
      An skb can hold 17 fragments, and each fragment can hold 32KB
      on x86, or 64KB on PowerPC.
      
      This means that the 16bit witdh of TCP_SKB_CB(skb)->tcp_gso_segs
      can overflow.
      
      Note that tcp_sendmsg() builds skbs with less than 64KB
      of payload, so this problem needs SACK to be enabled.
      SACK blocks allow TCP to coalesce multiple skbs in the retransmit
      queue, thus filling the 17 fragments to maximal capacity.
      
      CVE-2019-11477 -- u16 overflow of TCP_SKB_CB(skb)->tcp_gso_segs
      
      Fixes: 832d11c5 ("tcp: Try to restore large SKBs while SACK processing")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NJonathan Looney <jtl@netflix.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Reviewed-by: NTyler Hicks <tyhicks@canonical.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Bruce Curtis <brucec@netflix.com>
      Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b4929f6
  15. 15 6月, 2019 1 次提交
  16. 10 6月, 2019 1 次提交
    • Y
      tcp: fix undo spurious SYNACK in passive Fast Open · fcc2202a
      Yuchung Cheng 提交于
      Commit 794200d6 ("tcp: undo cwnd on Fast Open spurious SYNACK
      retransmit") may cause tcp_fastretrans_alert() to warn about pending
      retransmission in Open state. This is triggered when the Fast Open
      server both sends data and has spurious SYNACK retransmission during
      the handshake, and the data packets were lost or reordered.
      
      The root cause is a bit complicated:
      
      (1) Upon receiving SYN-data: a full socket is created with
          snd_una = ISN + 1 by tcp_create_openreq_child()
      
      (2) On SYNACK timeout the server/sender enters CA_Loss state.
      
      (3) Upon receiving the final ACK to complete the handshake, sender
          does not mark FLAG_SND_UNA_ADVANCED since (1)
      
          Sender then calls tcp_process_loss since state is CA_loss by (2)
      
      (4) tcp_process_loss() does not invoke undo operations but instead
          mark REXMIT_LOST to force retransmission
      
      (5) tcp_rcv_synrecv_state_fastopen() calls tcp_try_undo_loss(). It
          changes state to CA_Open but has positive tp->retrans_out
      
      (6) Next ACK triggers the WARN_ON in tcp_fastretrans_alert()
      
      The step that goes wrong is (4) where the undo operation should
      have been invoked because the ACK successfully acknowledged the
      SYN sequence. This fixes that by specifically checking undo
      when the SYN-ACK sequence is acknowledged. Then after
      tcp_process_loss() the state would be further adjusted based
      in tcp_fastretrans_alert() to avoid triggering the warning in (6).
      
      Fixes: 794200d6 ("tcp: undo cwnd on Fast Open spurious SYNACK retransmit")
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fcc2202a
  17. 31 5月, 2019 1 次提交
    • Y
      ipv4: tcp_input: fix stack out of bounds when parsing TCP options. · 9609dad2
      Young Xiao 提交于
      The TCP option parsing routines in tcp_parse_options function could
      read one byte out of the buffer of the TCP options.
      
      1         while (length > 0) {
      2                 int opcode = *ptr++;
      3                 int opsize;
      4
      5                 switch (opcode) {
      6                 case TCPOPT_EOL:
      7                         return;
      8                 case TCPOPT_NOP:        /* Ref: RFC 793 section 3.1 */
      9                         length--;
      10                        continue;
      11                default:
      12                        opsize = *ptr++; //out of bound access
      
      If length = 1, then there is an access in line2.
      And another access is occurred in line 12.
      This would lead to out-of-bound access.
      
      Therefore, in the patch we check that the available data length is
      larger enough to pase both TCP option code and size.
      Signed-off-by: NYoung Xiao <92siuyang@gmail.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9609dad2
  18. 15 5月, 2019 1 次提交
  19. 10 5月, 2019 1 次提交
  20. 01 5月, 2019 7 次提交
  21. 17 4月, 2019 1 次提交
  22. 05 4月, 2019 1 次提交
    • T
      tcp: Accept ECT on SYN in the presence of RFC8311 · f6fee16d
      Tilmans, Olivier (Nokia - BE/Antwerp) 提交于
      Linux currently disable ECN for incoming connections when the SYN
      requests ECN and the IP header has ECT(0)/ECT(1) set, as some
      networks were reportedly mangling the ToS byte, hence could later
      trigger false congestion notifications.
      
      RFC8311 §4.3 relaxes RFC3168's requirements such that ECT can be set
      one TCP control packets (including SYNs). The main benefit of this
      is the decreased probability of losing a SYN in a congested
      ECN-capable network (i.e., it avoids the initial 1s timeout).
      Additionally, this allows the development of newer TCP extensions,
      such as AccECN.
      
      This patch relaxes the previous check, by enabling ECN on incoming
      connections using SYN+ECT if at least one bit of the reserved flags
      of the TCP header is set. Such bit would indicate that the sender of
      the SYN is using a newer TCP feature than what the host implements,
      such as AccECN, and is thus implementing RFC8311. This enables
      end-hosts not supporting such extensions to still negociate ECN, and
      to have some of the benefits of using ECN on control packets.
      Signed-off-by: NOlivier Tilmans <olivier.tilmans@nokia-bell-labs.com>
      Suggested-by: NBob Briscoe <research@bobbriscoe.net>
      Cc: Koen De Schepper <koen.de_schepper@nokia-bell-labs.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f6fee16d
  23. 20 3月, 2019 1 次提交
    • G
      tcp: free request sock directly upon TFO or syncookies error · 9403cf23
      Guillaume Nault 提交于
      Since the request socket is created locally, it'd make more sense to
      use reqsk_free() instead of reqsk_put() in TFO and syncookies' error
      path.
      
      However, tcp_get_cookie_sock() may set ->rsk_refcnt before freeing the
      socket; tcp_conn_request() may also have non-null ->rsk_refcnt because
      of tcp_try_fastopen(). In both cases 'req' hasn't been exposed
      to the outside world and is safe to free immediately, but that'd
      trigger the WARN_ON_ONCE in reqsk_free().
      
      Define __reqsk_free() for these situations where we know nobody's
      referencing the socket, even though ->rsk_refcnt might be non-null.
      Now we can consolidate the error path of tcp_get_cookie_sock() and
      tcp_conn_request().
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9403cf23