1. 29 5月, 2020 3 次提交
  2. 15 5月, 2020 1 次提交
    • E
      tcp: fix error recovery in tcp_zerocopy_receive() · e776af60
      Eric Dumazet 提交于
      If user provides wrong virtual address in TCP_ZEROCOPY_RECEIVE
      operation we want to return -EINVAL error.
      
      But depending on zc->recv_skip_hint content, we might return
      -EIO error if the socket has SOCK_DONE set.
      
      Make sure to return -EINVAL in this case.
      
      BUG: KMSAN: uninit-value in tcp_zerocopy_receive net/ipv4/tcp.c:1833 [inline]
      BUG: KMSAN: uninit-value in do_tcp_getsockopt+0x4494/0x6320 net/ipv4/tcp.c:3685
      CPU: 1 PID: 625 Comm: syz-executor.0 Not tainted 5.7.0-rc4-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x1c9/0x220 lib/dump_stack.c:118
       kmsan_report+0xf7/0x1e0 mm/kmsan/kmsan_report.c:121
       __msan_warning+0x58/0xa0 mm/kmsan/kmsan_instr.c:215
       tcp_zerocopy_receive net/ipv4/tcp.c:1833 [inline]
       do_tcp_getsockopt+0x4494/0x6320 net/ipv4/tcp.c:3685
       tcp_getsockopt+0xf8/0x1f0 net/ipv4/tcp.c:3728
       sock_common_getsockopt+0x13f/0x180 net/core/sock.c:3131
       __sys_getsockopt+0x533/0x7b0 net/socket.c:2177
       __do_sys_getsockopt net/socket.c:2192 [inline]
       __se_sys_getsockopt+0xe1/0x100 net/socket.c:2189
       __x64_sys_getsockopt+0x62/0x80 net/socket.c:2189
       do_syscall_64+0xb8/0x160 arch/x86/entry/common.c:297
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      RIP: 0033:0x45c829
      Code: 0d b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 db b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00
      RSP: 002b:00007f1deeb72c78 EFLAGS: 00000246 ORIG_RAX: 0000000000000037
      RAX: ffffffffffffffda RBX: 00000000004e01e0 RCX: 000000000045c829
      RDX: 0000000000000023 RSI: 0000000000000006 RDI: 0000000000000009
      RBP: 000000000078bf00 R08: 0000000020000200 R09: 0000000000000000
      R10: 00000000200001c0 R11: 0000000000000246 R12: 00000000ffffffff
      R13: 00000000000001d8 R14: 00000000004d3038 R15: 00007f1deeb736d4
      
      Local variable ----zc@do_tcp_getsockopt created at:
       do_tcp_getsockopt+0x1a74/0x6320 net/ipv4/tcp.c:3670
       do_tcp_getsockopt+0x1a74/0x6320 net/ipv4/tcp.c:3670
      
      Fixes: 05255b82 ("tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e776af60
  3. 13 5月, 2020 1 次提交
    • E
      tcp: fix SO_RCVLOWAT hangs with fat skbs · 24adbc16
      Eric Dumazet 提交于
      We autotune rcvbuf whenever SO_RCVLOWAT is set to account for 100%
      overhead in tcp_set_rcvlowat()
      
      This works well when skb->len/skb->truesize ratio is bigger than 0.5
      
      But if we receive packets with small MSS, we can end up in a situation
      where not enough bytes are available in the receive queue to satisfy
      RCVLOWAT setting.
      As our sk_rcvbuf limit is hit, we send zero windows in ACK packets,
      preventing remote peer from sending more data.
      
      Even autotuning does not help, because it only triggers at the time
      user process drains the queue. If no EPOLLIN is generated, this
      can not happen.
      
      Note poll() has a similar issue, after commit
      c7004482 ("tcp: Respect SO_RCVLOWAT in tcp_poll().")
      
      Fixes: 03f45c88 ("tcp: avoid extra wakeups for SO_RCVLOWAT users")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      24adbc16
  4. 09 5月, 2020 1 次提交
  5. 02 5月, 2020 1 次提交
    • C
      net: Replace the limit of TCP_LINGER2 with TCP_FIN_TIMEOUT_MAX · f0628c52
      Cambda Zhu 提交于
      This patch changes the behavior of TCP_LINGER2 about its limit. The
      sysctl_tcp_fin_timeout used to be the limit of TCP_LINGER2 but now it's
      only the default value. A new macro named TCP_FIN_TIMEOUT_MAX is added
      as the limit of TCP_LINGER2, which is 2 minutes.
      
      Since TCP_LINGER2 used sysctl_tcp_fin_timeout as the default value
      and the limit in the past, the system administrator cannot set the
      default value for most of sockets and let some sockets have a greater
      timeout. It might be a mistake that let the sysctl to be the limit of
      the TCP_LINGER2. Maybe we can add a new sysctl to set the max of
      TCP_LINGER2, but FIN-WAIT-2 timeout is usually no need to be too long
      and 2 minutes are legal considering TCP specs.
      
      Changes in v3:
      - Remove the new socket option and change the TCP_LINGER2 behavior so
        that the timeout can be set to value between sysctl_tcp_fin_timeout
        and 2 minutes.
      
      Changes in v2:
      - Add int overflow check for the new socket option.
      
      Changes in v1:
      - Add a new socket option to set timeout greater than
        sysctl_tcp_fin_timeout.
      Signed-off-by: NCambda Zhu <cambda@linux.alibaba.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f0628c52
  6. 24 3月, 2020 1 次提交
    • E
      tcp: repair: fix TCP_QUEUE_SEQ implementation · 6cd6cbf5
      Eric Dumazet 提交于
      When application uses TCP_QUEUE_SEQ socket option to
      change tp->rcv_next, we must also update tp->copied_seq.
      
      Otherwise, stuff relying on tcp_inq() being precise can
      eventually be confused.
      
      For example, tcp_zerocopy_receive() might crash because
      it does not expect tcp_recv_skb() to return NULL.
      
      We could add tests in various places to fix the issue,
      or simply make sure tcp_inq() wont return a random value,
      and leave fast path as it is.
      
      Note that this fixes ioctl(fd, SIOCINQ, &val) at the same
      time.
      
      Fixes: ee995283 ("tcp: Initial repair mode")
      Fixes: 05255b82 ("tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6cd6cbf5
  7. 13 3月, 2020 1 次提交
  8. 10 3月, 2020 1 次提交
  9. 27 2月, 2020 1 次提交
  10. 17 2月, 2020 2 次提交
    • A
      tcp-zerocopy: Return sk_err (if set) along with tcp receive zerocopy. · 33946518
      Arjun Roy 提交于
      This patchset is intended to reduce the number of extra system calls
      imposed by TCP receive zerocopy. For ping-pong RPC style workloads,
      this patchset has demonstrated a system call reduction of about 30%
      when coupled with userspace changes.
      
      For applications using epoll, returning sk_err along with the result
      of tcp receive zerocopy could remove the need to call
      recvmsg()=-EAGAIN after a spurious wakeup.
      
      Consider a multi-threaded application using epoll. A thread may awaken
      with EPOLLIN but another thread may already be reading. The
      spuriously-awoken thread does not necessarily know that another thread
      'won'; rather, it may be possible that it was woken up due to the
      presence of an error if there is no data. A zerocopy read receiving 0
      bytes thus would need to be followed up by recvmsg to be sure.
      
      Instead, we return sk_err directly with zerocopy, so the application
      can avoid this extra system call.
      Signed-off-by: NArjun Roy <arjunroy@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      33946518
    • A
      tcp-zerocopy: Return inq along with tcp receive zerocopy. · c8856c05
      Arjun Roy 提交于
      This patchset is intended to reduce the number of extra system calls
      imposed by TCP receive zerocopy. For ping-pong RPC style workloads,
      this patchset has demonstrated a system call reduction of about 30%
      when coupled with userspace changes.
      
      For applications using edge-triggered epoll, returning inq along with
      the result of tcp receive zerocopy could remove the need to call
      recvmsg()=-EAGAIN after a successful zerocopy. Generally speaking,
      since normally we would need to perform a recvmsg() call for every
      successful small RPC read via TCP receive zerocopy, returning inq can
      reduce the number of system calls performed by approximately half.
      Signed-off-by: NArjun Roy <arjunroy@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c8856c05
  11. 01 2月, 2020 4 次提交
  12. 26 1月, 2020 1 次提交
  13. 24 1月, 2020 2 次提交
    • M
      mptcp: Add MPTCP socket stubs · f870fa0b
      Mat Martineau 提交于
      Implements the infrastructure for MPTCP sockets.
      
      MPTCP sockets open one in-kernel TCP socket per subflow. These subflow
      sockets are only managed by the MPTCP socket that owns them and are not
      visible from userspace. This commit allows a userspace program to open
      an MPTCP socket with:
      
        sock = socket(AF_INET, SOCK_STREAM, IPPROTO_MPTCP);
      
      The resulting socket is simply a wrapper around a single regular TCP
      socket, without any of the MPTCP protocol implemented over the wire.
      Co-developed-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Co-developed-by: NPeter Krystad <peter.krystad@linux.intel.com>
      Signed-off-by: NPeter Krystad <peter.krystad@linux.intel.com>
      Co-developed-by: NMatthieu Baerts <matthieu.baerts@tessares.net>
      Signed-off-by: NMatthieu Baerts <matthieu.baerts@tessares.net>
      Co-developed-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: NChristoph Paasch <cpaasch@apple.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f870fa0b
    • E
      tcp: do not leave dangling pointers in tp->highest_sack · 2bec445f
      Eric Dumazet 提交于
      Latest commit 85369750 ("tcp: Fix highest_sack and highest_sack_seq")
      apparently allowed syzbot to trigger various crashes in TCP stack [1]
      
      I believe this commit only made things easier for syzbot to find
      its way into triggering use-after-frees. But really the bugs
      could lead to bad TCP behavior or even plain crashes even for
      non malicious peers.
      
      I have audited all calls to tcp_rtx_queue_unlink() and
      tcp_rtx_queue_unlink_and_free() and made sure tp->highest_sack would be updated
      if we are removing from rtx queue the skb that tp->highest_sack points to.
      
      These updates were missing in three locations :
      
      1) tcp_clean_rtx_queue() [This one seems quite serious,
                                I have no idea why this was not caught earlier]
      
      2) tcp_rtx_queue_purge() [Probably not a big deal for normal operations]
      
      3) tcp_send_synack()     [Probably not a big deal for normal operations]
      
      [1]
      BUG: KASAN: use-after-free in tcp_highest_sack_seq include/net/tcp.h:1864 [inline]
      BUG: KASAN: use-after-free in tcp_highest_sack_seq include/net/tcp.h:1856 [inline]
      BUG: KASAN: use-after-free in tcp_check_sack_reordering+0x33c/0x3a0 net/ipv4/tcp_input.c:891
      Read of size 4 at addr ffff8880a488d068 by task ksoftirqd/1/16
      
      CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.5.0-rc5-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:77 [inline]
       dump_stack+0x197/0x210 lib/dump_stack.c:118
       print_address_description.constprop.0.cold+0xd4/0x30b mm/kasan/report.c:374
       __kasan_report.cold+0x1b/0x41 mm/kasan/report.c:506
       kasan_report+0x12/0x20 mm/kasan/common.c:639
       __asan_report_load4_noabort+0x14/0x20 mm/kasan/generic_report.c:134
       tcp_highest_sack_seq include/net/tcp.h:1864 [inline]
       tcp_highest_sack_seq include/net/tcp.h:1856 [inline]
       tcp_check_sack_reordering+0x33c/0x3a0 net/ipv4/tcp_input.c:891
       tcp_try_undo_partial net/ipv4/tcp_input.c:2730 [inline]
       tcp_fastretrans_alert+0xf74/0x23f0 net/ipv4/tcp_input.c:2847
       tcp_ack+0x2577/0x5bf0 net/ipv4/tcp_input.c:3710
       tcp_rcv_established+0x6dd/0x1e90 net/ipv4/tcp_input.c:5706
       tcp_v4_do_rcv+0x619/0x8d0 net/ipv4/tcp_ipv4.c:1619
       tcp_v4_rcv+0x307f/0x3b40 net/ipv4/tcp_ipv4.c:2001
       ip_protocol_deliver_rcu+0x5a/0x880 net/ipv4/ip_input.c:204
       ip_local_deliver_finish+0x23b/0x380 net/ipv4/ip_input.c:231
       NF_HOOK include/linux/netfilter.h:307 [inline]
       NF_HOOK include/linux/netfilter.h:301 [inline]
       ip_local_deliver+0x1e9/0x520 net/ipv4/ip_input.c:252
       dst_input include/net/dst.h:442 [inline]
       ip_rcv_finish+0x1db/0x2f0 net/ipv4/ip_input.c:428
       NF_HOOK include/linux/netfilter.h:307 [inline]
       NF_HOOK include/linux/netfilter.h:301 [inline]
       ip_rcv+0xe8/0x3f0 net/ipv4/ip_input.c:538
       __netif_receive_skb_one_core+0x113/0x1a0 net/core/dev.c:5148
       __netif_receive_skb+0x2c/0x1d0 net/core/dev.c:5262
       process_backlog+0x206/0x750 net/core/dev.c:6093
       napi_poll net/core/dev.c:6530 [inline]
       net_rx_action+0x508/0x1120 net/core/dev.c:6598
       __do_softirq+0x262/0x98c kernel/softirq.c:292
       run_ksoftirqd kernel/softirq.c:603 [inline]
       run_ksoftirqd+0x8e/0x110 kernel/softirq.c:595
       smpboot_thread_fn+0x6a3/0xa40 kernel/smpboot.c:165
       kthread+0x361/0x430 kernel/kthread.c:255
       ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
      
      Allocated by task 10091:
       save_stack+0x23/0x90 mm/kasan/common.c:72
       set_track mm/kasan/common.c:80 [inline]
       __kasan_kmalloc mm/kasan/common.c:513 [inline]
       __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:486
       kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:521
       slab_post_alloc_hook mm/slab.h:584 [inline]
       slab_alloc_node mm/slab.c:3263 [inline]
       kmem_cache_alloc_node+0x138/0x740 mm/slab.c:3575
       __alloc_skb+0xd5/0x5e0 net/core/skbuff.c:198
       alloc_skb_fclone include/linux/skbuff.h:1099 [inline]
       sk_stream_alloc_skb net/ipv4/tcp.c:875 [inline]
       sk_stream_alloc_skb+0x113/0xc90 net/ipv4/tcp.c:852
       tcp_sendmsg_locked+0xcf9/0x3470 net/ipv4/tcp.c:1282
       tcp_sendmsg+0x30/0x50 net/ipv4/tcp.c:1432
       inet_sendmsg+0x9e/0xe0 net/ipv4/af_inet.c:807
       sock_sendmsg_nosec net/socket.c:652 [inline]
       sock_sendmsg+0xd7/0x130 net/socket.c:672
       __sys_sendto+0x262/0x380 net/socket.c:1998
       __do_sys_sendto net/socket.c:2010 [inline]
       __se_sys_sendto net/socket.c:2006 [inline]
       __x64_sys_sendto+0xe1/0x1a0 net/socket.c:2006
       do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Freed by task 10095:
       save_stack+0x23/0x90 mm/kasan/common.c:72
       set_track mm/kasan/common.c:80 [inline]
       kasan_set_free_info mm/kasan/common.c:335 [inline]
       __kasan_slab_free+0x102/0x150 mm/kasan/common.c:474
       kasan_slab_free+0xe/0x10 mm/kasan/common.c:483
       __cache_free mm/slab.c:3426 [inline]
       kmem_cache_free+0x86/0x320 mm/slab.c:3694
       kfree_skbmem+0x178/0x1c0 net/core/skbuff.c:645
       __kfree_skb+0x1e/0x30 net/core/skbuff.c:681
       sk_eat_skb include/net/sock.h:2453 [inline]
       tcp_recvmsg+0x1252/0x2930 net/ipv4/tcp.c:2166
       inet_recvmsg+0x136/0x610 net/ipv4/af_inet.c:838
       sock_recvmsg_nosec net/socket.c:886 [inline]
       sock_recvmsg net/socket.c:904 [inline]
       sock_recvmsg+0xce/0x110 net/socket.c:900
       __sys_recvfrom+0x1ff/0x350 net/socket.c:2055
       __do_sys_recvfrom net/socket.c:2073 [inline]
       __se_sys_recvfrom net/socket.c:2069 [inline]
       __x64_sys_recvfrom+0xe1/0x1a0 net/socket.c:2069
       do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      The buggy address belongs to the object at ffff8880a488d040
       which belongs to the cache skbuff_fclone_cache of size 456
      The buggy address is located 40 bytes inside of
       456-byte region [ffff8880a488d040, ffff8880a488d208)
      The buggy address belongs to the page:
      page:ffffea0002922340 refcount:1 mapcount:0 mapping:ffff88821b057000 index:0x0
      raw: 00fffe0000000200 ffffea00022a5788 ffffea0002624a48 ffff88821b057000
      raw: 0000000000000000 ffff8880a488d040 0000000100000006 0000000000000000
      page dumped because: kasan: bad access detected
      
      Memory state around the buggy address:
       ffff8880a488cf00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
       ffff8880a488cf80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
      >ffff8880a488d000: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
                                                                ^
       ffff8880a488d080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff8880a488d100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      
      Fixes: 85369750 ("tcp: Fix highest_sack and highest_sack_seq")
      Fixes: 50895b9d ("tcp: highest_sack fix")
      Fixes: 737ff314 ("tcp: use sequence distance to detect reordering")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Cambda Zhu <cambda@linux.alibaba.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2bec445f
  14. 21 1月, 2020 1 次提交
  15. 10 1月, 2020 1 次提交
  16. 16 12月, 2019 1 次提交
  17. 14 12月, 2019 1 次提交
  18. 11 12月, 2019 1 次提交
  19. 10 12月, 2019 1 次提交
  20. 15 11月, 2019 1 次提交
    • A
      y2038: socket: use __kernel_old_timespec instead of timespec · df1b4ba9
      Arnd Bergmann 提交于
      The 'timespec' type definition and helpers like ktime_to_timespec()
      or timespec64_to_timespec() should no longer be used in the kernel so
      we can remove them and avoid introducing y2038 issues in new code.
      
      Change the socket code that needs to pass a timespec to user space for
      backward compatibility to use __kernel_old_timespec instead.  This type
      has the same layout but with a clearer defined name.
      
      Slightly reformat tcp_recv_timestamp() for consistency after the removal
      of timespec64_to_timespec().
      Acked-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      df1b4ba9
  21. 07 11月, 2019 4 次提交
    • E
      tcp: fix data-race in tcp_recvmsg() · a5a7daa5
      Eric Dumazet 提交于
      Reading tp->recvmsg_inq after socket lock is released
      raises a KCSAN warning [1]
      
      Replace has_tss & has_cmsg by cmsg_flags and make
      sure to not read tp->recvmsg_inq a second time.
      
      [1]
      BUG: KCSAN: data-race in tcp_chrono_stop / tcp_recvmsg
      
      write to 0xffff888126adef24 of 2 bytes by interrupt on cpu 0:
       tcp_chrono_set net/ipv4/tcp_output.c:2309 [inline]
       tcp_chrono_stop+0x14c/0x280 net/ipv4/tcp_output.c:2338
       tcp_clean_rtx_queue net/ipv4/tcp_input.c:3165 [inline]
       tcp_ack+0x274f/0x3170 net/ipv4/tcp_input.c:3688
       tcp_rcv_established+0x37e/0xf50 net/ipv4/tcp_input.c:5696
       tcp_v4_do_rcv+0x381/0x4e0 net/ipv4/tcp_ipv4.c:1561
       tcp_v4_rcv+0x19dc/0x1bb0 net/ipv4/tcp_ipv4.c:1942
       ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204
       ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
       dst_input include/net/dst.h:442 [inline]
       ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
       __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010
       __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124
       netif_receive_skb_internal+0x59/0x190 net/core/dev.c:5214
       napi_skb_finish net/core/dev.c:5677 [inline]
       napi_gro_receive+0x28f/0x330 net/core/dev.c:5710
      
      read to 0xffff888126adef25 of 1 bytes by task 7275 on cpu 1:
       tcp_recvmsg+0x77b/0x1a30 net/ipv4/tcp.c:2187
       inet_recvmsg+0xbb/0x250 net/ipv4/af_inet.c:838
       sock_recvmsg_nosec net/socket.c:871 [inline]
       sock_recvmsg net/socket.c:889 [inline]
       sock_recvmsg+0x92/0xb0 net/socket.c:885
       sock_read_iter+0x15f/0x1e0 net/socket.c:967
       call_read_iter include/linux/fs.h:1889 [inline]
       new_sync_read+0x389/0x4f0 fs/read_write.c:414
       __vfs_read+0xb1/0xc0 fs/read_write.c:427
       vfs_read fs/read_write.c:461 [inline]
       vfs_read+0x143/0x2c0 fs/read_write.c:446
       ksys_read+0xd5/0x1b0 fs/read_write.c:587
       __do_sys_read fs/read_write.c:597 [inline]
       __se_sys_read fs/read_write.c:595 [inline]
       __x64_sys_read+0x4c/0x60 fs/read_write.c:595
       do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 7275 Comm: sshd Not tainted 5.4.0-rc3+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      
      Fixes: b75eba76 ("tcp: send in-queue bytes in cmsg upon read")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a5a7daa5
    • E
      net: silence data-races on sk_backlog.tail · 9ed498c6
      Eric Dumazet 提交于
      sk->sk_backlog.tail might be read without holding the socket spinlock,
      we need to add proper READ_ONCE()/WRITE_ONCE() to silence the warnings.
      
      KCSAN reported :
      
      BUG: KCSAN: data-race in tcp_add_backlog / tcp_recvmsg
      
      write to 0xffff8881265109f8 of 8 bytes by interrupt on cpu 1:
       __sk_add_backlog include/net/sock.h:907 [inline]
       sk_add_backlog include/net/sock.h:938 [inline]
       tcp_add_backlog+0x476/0xce0 net/ipv4/tcp_ipv4.c:1759
       tcp_v4_rcv+0x1a70/0x1bd0 net/ipv4/tcp_ipv4.c:1947
       ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204
       ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
       dst_input include/net/dst.h:442 [inline]
       ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
       NF_HOOK include/linux/netfilter.h:305 [inline]
       NF_HOOK include/linux/netfilter.h:299 [inline]
       ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
       __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:4929
       __netif_receive_skb+0x37/0xf0 net/core/dev.c:5043
       netif_receive_skb_internal+0x59/0x190 net/core/dev.c:5133
       napi_skb_finish net/core/dev.c:5596 [inline]
       napi_gro_receive+0x28f/0x330 net/core/dev.c:5629
       receive_buf+0x284/0x30b0 drivers/net/virtio_net.c:1061
       virtnet_receive drivers/net/virtio_net.c:1323 [inline]
       virtnet_poll+0x436/0x7d0 drivers/net/virtio_net.c:1428
       napi_poll net/core/dev.c:6311 [inline]
       net_rx_action+0x3ae/0xa90 net/core/dev.c:6379
       __do_softirq+0x115/0x33f kernel/softirq.c:292
       invoke_softirq kernel/softirq.c:373 [inline]
       irq_exit+0xbb/0xe0 kernel/softirq.c:413
       exiting_irq arch/x86/include/asm/apic.h:536 [inline]
       do_IRQ+0xa6/0x180 arch/x86/kernel/irq.c:263
       ret_from_intr+0x0/0x19
       native_safe_halt+0xe/0x10 arch/x86/kernel/paravirt.c:71
       arch_cpu_idle+0x1f/0x30 arch/x86/kernel/process.c:571
       default_idle_call+0x1e/0x40 kernel/sched/idle.c:94
       cpuidle_idle_call kernel/sched/idle.c:154 [inline]
       do_idle+0x1af/0x280 kernel/sched/idle.c:263
       cpu_startup_entry+0x1b/0x20 kernel/sched/idle.c:355
       start_secondary+0x208/0x260 arch/x86/kernel/smpboot.c:264
       secondary_startup_64+0xa4/0xb0 arch/x86/kernel/head_64.S:241
      
      read to 0xffff8881265109f8 of 8 bytes by task 8057 on cpu 0:
       tcp_recvmsg+0x46e/0x1b40 net/ipv4/tcp.c:2050
       inet_recvmsg+0xbb/0x250 net/ipv4/af_inet.c:838
       sock_recvmsg_nosec net/socket.c:871 [inline]
       sock_recvmsg net/socket.c:889 [inline]
       sock_recvmsg+0x92/0xb0 net/socket.c:885
       sock_read_iter+0x15f/0x1e0 net/socket.c:967
       call_read_iter include/linux/fs.h:1889 [inline]
       new_sync_read+0x389/0x4f0 fs/read_write.c:414
       __vfs_read+0xb1/0xc0 fs/read_write.c:427
       vfs_read fs/read_write.c:461 [inline]
       vfs_read+0x143/0x2c0 fs/read_write.c:446
       ksys_read+0xd5/0x1b0 fs/read_write.c:587
       __do_sys_read fs/read_write.c:597 [inline]
       __se_sys_read fs/read_write.c:595 [inline]
       __x64_sys_read+0x4c/0x60 fs/read_write.c:595
       do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 0 PID: 8057 Comm: syz-fuzzer Not tainted 5.4.0-rc6+ #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9ed498c6
    • E
      net: annotate lockless accesses to sk->sk_max_ack_backlog · 099ecf59
      Eric Dumazet 提交于
      sk->sk_max_ack_backlog can be read without any lock being held
      at least in TCP/DCCP cases.
      
      We need to use READ_ONCE()/WRITE_ONCE() to avoid load/store tearing
      and/or potential KCSAN warnings.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      099ecf59
    • E
      net: annotate lockless accesses to sk->sk_ack_backlog · 288efe86
      Eric Dumazet 提交于
      sk->sk_ack_backlog can be read without any lock being held.
      We need to use READ_ONCE()/WRITE_ONCE() to avoid load/store tearing
      and/or potential KCSAN warnings.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      288efe86
  22. 29 10月, 2019 2 次提交
  23. 26 10月, 2019 1 次提交
    • J
      tcp: add TCP_INFO status for failed client TFO · 48027478
      Jason Baron 提交于
      The TCPI_OPT_SYN_DATA bit as part of tcpi_options currently reports whether
      or not data-in-SYN was ack'd on both the client and server side. We'd like
      to gather more information on the client-side in the failure case in order
      to indicate the reason for the failure. This can be useful for not only
      debugging TFO, but also for creating TFO socket policies. For example, if
      a middle box removes the TFO option or drops a data-in-SYN, we can
      can detect this case, and turn off TFO for these connections saving the
      extra retransmits.
      
      The newly added tcpi_fastopen_client_fail status is 2 bits and has the
      following 4 states:
      
      1) TFO_STATUS_UNSPEC
      
      Catch-all state which includes when TFO is disabled via black hole
      detection, which is indicated via LINUX_MIB_TCPFASTOPENBLACKHOLE.
      
      2) TFO_COOKIE_UNAVAILABLE
      
      If TFO_CLIENT_NO_COOKIE mode is off, this state indicates that no cookie
      is available in the cache.
      
      3) TFO_DATA_NOT_ACKED
      
      Data was sent with SYN, we received a SYN/ACK but it did not cover the data
      portion. Cookie is not accepted by server because the cookie may be invalid
      or the server may be overloaded.
      
      4) TFO_SYN_RETRANSMITTED
      
      Data was sent with SYN, we received a SYN/ACK which did not cover the data
      after at least 1 additional SYN was sent (without data). It may be the case
      that a middle-box is dropping data-in-SYN packets. Thus, it would be more
      efficient to not use TFO on this connection to avoid extra retransmits
      during connection establishment.
      
      These new fields do not cover all the cases where TFO may fail, but other
      failures, such as SYN/ACK + data being dropped, will result in the
      connection not becoming established. And a connection blackhole after
      session establishment shows up as a stalled connection.
      Signed-off-by: NJason Baron <jbaron@akamai.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Christoph Paasch <cpaasch@apple.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      48027478
  24. 16 10月, 2019 1 次提交
  25. 14 10月, 2019 5 次提交