1. 10 5月, 2023 1 次提交
    • E
      tcp: add annotations around sk->sk_shutdown accesses · e14cadfd
      Eric Dumazet 提交于
      Now sk->sk_shutdown is no longer a bitfield, we can add
      standard READ_ONCE()/WRITE_ONCE() annotations to silence
      KCSAN reports like the following:
      
      BUG: KCSAN: data-race in tcp_disconnect / tcp_poll
      
      write to 0xffff88814588582c of 1 bytes by task 3404 on cpu 1:
      tcp_disconnect+0x4d6/0xdb0 net/ipv4/tcp.c:3121
      __inet_stream_connect+0x5dd/0x6e0 net/ipv4/af_inet.c:715
      inet_stream_connect+0x48/0x70 net/ipv4/af_inet.c:727
      __sys_connect_file net/socket.c:2001 [inline]
      __sys_connect+0x19b/0x1b0 net/socket.c:2018
      __do_sys_connect net/socket.c:2028 [inline]
      __se_sys_connect net/socket.c:2025 [inline]
      __x64_sys_connect+0x41/0x50 net/socket.c:2025
      do_syscall_x64 arch/x86/entry/common.c:50 [inline]
      do_syscall_64+0x41/0xc0 arch/x86/entry/common.c:80
      entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      read to 0xffff88814588582c of 1 bytes by task 3374 on cpu 0:
      tcp_poll+0x2e6/0x7d0 net/ipv4/tcp.c:562
      sock_poll+0x253/0x270 net/socket.c:1383
      vfs_poll include/linux/poll.h:88 [inline]
      io_poll_check_events io_uring/poll.c:281 [inline]
      io_poll_task_func+0x15a/0x820 io_uring/poll.c:333
      handle_tw_list io_uring/io_uring.c:1184 [inline]
      tctx_task_work+0x1fe/0x4d0 io_uring/io_uring.c:1246
      task_work_run+0x123/0x160 kernel/task_work.c:179
      get_signal+0xe64/0xff0 kernel/signal.c:2635
      arch_do_signal_or_restart+0x89/0x2a0 arch/x86/kernel/signal.c:306
      exit_to_user_mode_loop+0x6f/0xe0 kernel/entry/common.c:168
      exit_to_user_mode_prepare+0x6c/0xb0 kernel/entry/common.c:204
      __syscall_exit_to_user_mode_work kernel/entry/common.c:286 [inline]
      syscall_exit_to_user_mode+0x26/0x140 kernel/entry/common.c:297
      do_syscall_64+0x4d/0xc0 arch/x86/entry/common.c:86
      entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      value changed: 0x03 -> 0x00
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e14cadfd
  2. 14 4月, 2023 1 次提交
    • K
      net: Ensure ->msg_control_user is used for user buffers · c39ef213
      Kevin Brodsky 提交于
      Since commit 1f466e1f ("net: cleanly handle kernel vs user
      buffers for ->msg_control"), pointers to user buffers should be
      stored in struct msghdr::msg_control_user, instead of the
      msg_control field.  Most users of msg_control have already been
      converted (where user buffers are involved), but not all of them.
      
      This patch attempts to address the remaining cases. An exception is
      made for null checks, as it should be safe to use msg_control
      unconditionally for that purpose.
      
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jakub Kicinski <kuba@kernel.org>
      Signed-off-by: NKevin Brodsky <kevin.brodsky@arm.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c39ef213
  3. 18 3月, 2023 1 次提交
    • E
      tcp: preserve const qualifier in tcp_sk() · e9d9da91
      Eric Dumazet 提交于
      We can change tcp_sk() to propagate its argument const qualifier,
      thanks to container_of_const().
      
      We have two places where a const sock pointer has to be upgraded
      to a write one. We have been using const qualifier for lockless
      listeners to clearly identify points where writes could happen.
      
      Add tcp_sk_rw() helper to better document these.
      
      tcp_inbound_md5_hash(), __tcp_grow_window(), tcp_reset_check()
      and tcp_rack_reo_wnd() get an additional const qualififer
      for their @tp local variables.
      
      smc_check_reset_syn_req() also needs a similar change.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NSimon Horman <simon.horman@corigine.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e9d9da91
  4. 17 3月, 2023 1 次提交
  5. 10 2月, 2023 1 次提交
  6. 20 1月, 2023 1 次提交
  7. 19 1月, 2023 1 次提交
  8. 02 12月, 2022 1 次提交
    • D
      net/tcp: Disable TCP-MD5 static key on tcp_md5sig_info destruction · 459837b5
      Dmitry Safonov 提交于
      To do that, separate two scenarios:
      - where it's the first MD5 key on the system, which means that enabling
        of the static key may need to sleep;
      - copying of an existing key from a listening socket to the request
        socket upon receiving a signed TCP segment, where static key was
        already enabled (when the key was added to the listening socket).
      
      Now the life-time of the static branch for TCP-MD5 is until:
      - last tcp_md5sig_info is destroyed
      - last socket in time-wait state with MD5 key is closed.
      
      Which means that after all sockets with TCP-MD5 keys are gone, the
      system gets back the performance of disabled md5-key static branch.
      
      While at here, provide static_key_fast_inc() helper that does ref
      counter increment in atomic fashion (without grabbing cpus_read_lock()
      on CONFIG_JUMP_LABEL=y). This is needed to add a new user for
      a static_key when the caller controls the lifetime of another user.
      Signed-off-by: NDmitry Safonov <dima@arista.com>
      Acked-by: NJakub Kicinski <kuba@kernel.org>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      459837b5
  9. 26 11月, 2022 1 次提交
    • A
      use less confusing names for iov_iter direction initializers · de4eda9d
      Al Viro 提交于
      READ/WRITE proved to be actively confusing - the meanings are
      "data destination, as used with read(2)" and "data source, as
      used with write(2)", but people keep interpreting those as
      "we read data from it" and "we write data to it", i.e. exactly
      the wrong way.
      
      Call them ITER_DEST and ITER_SOURCE - at least that is harder
      to misinterpret...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      de4eda9d
  10. 23 11月, 2022 1 次提交
    • K
      dccp/tcp: Fixup bhash2 bucket when connect() fails. · e0833d1f
      Kuniyuki Iwashima 提交于
      If a socket bound to a wildcard address fails to connect(), we
      only reset saddr and keep the port.  Then, we have to fix up the
      bhash2 bucket; otherwise, the bucket has an inconsistent address
      in the list.
      
      Also, listen() for such a socket will fire the WARN_ON() in
      inet_csk_get_port(). [0]
      
      Note that when a system runs out of memory, we give up fixing the
      bucket and unlink sk from bhash and bhash2 by inet_put_port().
      
      [0]:
      WARNING: CPU: 0 PID: 207 at net/ipv4/inet_connection_sock.c:548 inet_csk_get_port (net/ipv4/inet_connection_sock.c:548 (discriminator 1))
      Modules linked in:
      CPU: 0 PID: 207 Comm: bhash2_prev_rep Not tainted 6.1.0-rc3-00799-gc8421681c845 #63
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.amzn2022.0.1 04/01/2014
      RIP: 0010:inet_csk_get_port (net/ipv4/inet_connection_sock.c:548 (discriminator 1))
      Code: 74 a7 eb 93 48 8b 54 24 18 0f b7 cb 4c 89 e6 4c 89 ff e8 48 b2 ff ff 49 8b 87 18 04 00 00 e9 32 ff ff ff 0f 0b e9 34 ff ff ff <0f> 0b e9 42 ff ff ff 41 8b 7f 50 41 8b 4f 54 89 fe 81 f6 00 00 ff
      RSP: 0018:ffffc900003d7e50 EFLAGS: 00010202
      RAX: ffff8881047fb500 RBX: 0000000000004e20 RCX: 0000000000000000
      RDX: 000000000000000a RSI: 00000000fffffe00 RDI: 00000000ffffffff
      RBP: ffffffff8324dc00 R08: 0000000000000001 R09: 0000000000000001
      R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
      R13: 0000000000000001 R14: 0000000000004e20 R15: ffff8881054e1280
      FS:  00007f8ac04dc740(0000) GS:ffff88842fc00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000020001540 CR3: 00000001055fa003 CR4: 0000000000770ef0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      PKRU: 55555554
      Call Trace:
       <TASK>
       inet_csk_listen_start (net/ipv4/inet_connection_sock.c:1205)
       inet_listen (net/ipv4/af_inet.c:228)
       __sys_listen (net/socket.c:1810)
       __x64_sys_listen (net/socket.c:1819 net/socket.c:1817 net/socket.c:1817)
       do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
       entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
      RIP: 0033:0x7f8ac051de5d
      Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 93 af 1b 00 f7 d8 64 89 01 48
      RSP: 002b:00007ffc1c177248 EFLAGS: 00000206 ORIG_RAX: 0000000000000032
      RAX: ffffffffffffffda RBX: 0000000020001550 RCX: 00007f8ac051de5d
      RDX: ffffffffffffff80 RSI: 0000000000000000 RDI: 0000000000000004
      RBP: 00007ffc1c177270 R08: 0000000000000018 R09: 0000000000000007
      R10: 0000000020001540 R11: 0000000000000206 R12: 00007ffc1c177388
      R13: 0000000000401169 R14: 0000000000403e18 R15: 00007f8ac0723000
       </TASK>
      
      Fixes: 28044fc1 ("net: Add a bhash2 table hashed by port and address")
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Reported-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
      Acked-by: NJoanne Koong <joannelkoong@gmail.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      e0833d1f
  11. 07 11月, 2022 1 次提交
    • L
      tcp: prohibit TCP_REPAIR_OPTIONS if data was already sent · 0c175da7
      Lu Wei 提交于
      If setsockopt with option name of TCP_REPAIR_OPTIONS and opt_code
      of TCPOPT_SACK_PERM is called to enable sack after data is sent
      and dupacks are received , it will trigger a warning in function
      tcp_verify_left_out() as follows:
      
      ============================================
      WARNING: CPU: 8 PID: 0 at net/ipv4/tcp_input.c:2132
      tcp_timeout_mark_lost+0x154/0x160
      tcp_enter_loss+0x2b/0x290
      tcp_retransmit_timer+0x50b/0x640
      tcp_write_timer_handler+0x1c8/0x340
      tcp_write_timer+0xe5/0x140
      call_timer_fn+0x3a/0x1b0
      __run_timers.part.0+0x1bf/0x2d0
      run_timer_softirq+0x43/0xb0
      __do_softirq+0xfd/0x373
      __irq_exit_rcu+0xf6/0x140
      
      The warning is caused in the following steps:
      1. a socket named socketA is created
      2. socketA enters repair mode without build a connection
      3. socketA calls connect() and its state is changed to TCP_ESTABLISHED
         directly
      4. socketA leaves repair mode
      5. socketA calls sendmsg() to send data, packets_out and sack_outs(dup
         ack receives) increase
      6. socketA enters repair mode again
      7. socketA calls setsockopt with TCPOPT_SACK_PERM to enable sack
      8. retransmit timer expires, it calls tcp_timeout_mark_lost(), lost_out
         increases
      9. sack_outs + lost_out > packets_out triggers since lost_out and
         sack_outs increase repeatly
      
      In function tcp_timeout_mark_lost(), tp->sacked_out will be cleared if
      Step7 not happen and the warning will not be triggered. As suggested by
      Denis and Eric, TCP_REPAIR_OPTIONS should be prohibited if data was
      already sent.
      
      socket-tcp tests in CRIU has been tested as follows:
      $ sudo ./test/zdtm.py run -t zdtm/static/socket-tcp*  --keep-going \
             --ignore-taint
      
      socket-tcp* represent all socket-tcp tests in test/zdtm/static/.
      
      Fixes: b139ba4e ("tcp: Repair connection-time negotiated parameters")
      Signed-off-by: NLu Wei <luwei32@huawei.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0c175da7
  12. 28 10月, 2022 2 次提交
  13. 22 10月, 2022 1 次提交
  14. 13 10月, 2022 1 次提交
    • K
      tcp: Fix data races around icsk->icsk_af_ops. · f49cd2f4
      Kuniyuki Iwashima 提交于
      setsockopt(IPV6_ADDRFORM) and tcp_v6_connect() change icsk->icsk_af_ops
      under lock_sock(), but tcp_(get|set)sockopt() read it locklessly.  To
      avoid load/store tearing, we need to add READ_ONCE() and WRITE_ONCE()
      for the reads and writes.
      
      Thanks to Eric Dumazet for providing the syzbot report:
      
      BUG: KCSAN: data-race in tcp_setsockopt / tcp_v6_connect
      
      write to 0xffff88813c624518 of 8 bytes by task 23936 on cpu 0:
      tcp_v6_connect+0x5b3/0xce0 net/ipv6/tcp_ipv6.c:240
      __inet_stream_connect+0x159/0x6d0 net/ipv4/af_inet.c:660
      inet_stream_connect+0x44/0x70 net/ipv4/af_inet.c:724
      __sys_connect_file net/socket.c:1976 [inline]
      __sys_connect+0x197/0x1b0 net/socket.c:1993
      __do_sys_connect net/socket.c:2003 [inline]
      __se_sys_connect net/socket.c:2000 [inline]
      __x64_sys_connect+0x3d/0x50 net/socket.c:2000
      do_syscall_x64 arch/x86/entry/common.c:50 [inline]
      do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
      entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      read to 0xffff88813c624518 of 8 bytes by task 23937 on cpu 1:
      tcp_setsockopt+0x147/0x1c80 net/ipv4/tcp.c:3789
      sock_common_setsockopt+0x5d/0x70 net/core/sock.c:3585
      __sys_setsockopt+0x212/0x2b0 net/socket.c:2252
      __do_sys_setsockopt net/socket.c:2263 [inline]
      __se_sys_setsockopt net/socket.c:2260 [inline]
      __x64_sys_setsockopt+0x62/0x70 net/socket.c:2260
      do_syscall_x64 arch/x86/entry/common.c:50 [inline]
      do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
      entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      value changed: 0xffffffff8539af68 -> 0xffffffff8539aff8
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 23937 Comm: syz-executor.5 Not tainted
      6.0.0-rc4-syzkaller-00331-g4ed9c1e9-dirty #0
      
      Hardware name: Google Google Compute Engine/Google Compute Engine,
      BIOS Google 08/26/2022
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Reported-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      f49cd2f4
  15. 30 9月, 2022 1 次提交
    • N
      tcp: fix tcp_cwnd_validate() to not forget is_cwnd_limited · f4ce91ce
      Neal Cardwell 提交于
      This commit fixes a bug in the tracking of max_packets_out and
      is_cwnd_limited. This bug can cause the connection to fail to remember
      that is_cwnd_limited is true, causing the connection to fail to grow
      cwnd when it should, causing throughput to be lower than it should be.
      
      The following event sequence is an example that triggers the bug:
      
       (a) The connection is cwnd_limited, but packets_out is not at its
           peak due to TSO deferral deciding not to send another skb yet.
           In such cases the connection can advance max_packets_seq and set
           tp->is_cwnd_limited to true and max_packets_out to a small
           number.
      
      (b) Then later in the round trip the connection is pacing-limited (not
           cwnd-limited), and packets_out is larger. In such cases the
           connection would raise max_packets_out to a bigger number but
           (unexpectedly) flip tp->is_cwnd_limited from true to false.
      
      This commit fixes that bug.
      
      One straightforward fix would be to separately track (a) the next
      window after max_packets_out reaches a maximum, and (b) the next
      window after tp->is_cwnd_limited is set to true. But this would
      require consuming an extra u32 sequence number.
      
      Instead, to save space we track only the most important
      information. Specifically, we track the strongest available signal of
      the degree to which the cwnd is fully utilized:
      
      (1) If the connection is cwnd-limited then we remember that fact for
      the current window.
      
      (2) If the connection not cwnd-limited then we track the maximum
      number of outstanding packets in the current window.
      
      In particular, note that the new logic cannot trigger the buggy
      (a)/(b) sequence above because with the new logic a condition where
      tp->packets_out > tp->max_packets_out can only trigger an update of
      tp->is_cwnd_limited if tp->is_cwnd_limited is false.
      
      This first showed up in a testing of a BBRv2 dev branch, but this
      buggy behavior highlighted a general issue with the
      tcp_cwnd_validate() logic that can cause cwnd to fail to increase at
      the proper rate for any TCP congestion control, including Reno or
      CUBIC.
      
      Fixes: ca8a2263 ("tcp: make cwnd-limited checks measurement-based, and gentler")
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NKevin(Yudong) Yang <yyd@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f4ce91ce
  16. 29 9月, 2022 2 次提交
  17. 21 9月, 2022 1 次提交
    • K
      tcp: Introduce optional per-netns ehash. · d1e5e640
      Kuniyuki Iwashima 提交于
      The more sockets we have in the hash table, the longer we spend looking
      up the socket.  While running a number of small workloads on the same
      host, they penalise each other and cause performance degradation.
      
      The root cause might be a single workload that consumes much more
      resources than the others.  It often happens on a cloud service where
      different workloads share the same computing resource.
      
      On EC2 c5.24xlarge instance (196 GiB memory and 524288 (1Mi / 2) ehash
      entries), after running iperf3 in different netns, creating 24Mi sockets
      without data transfer in the root netns causes about 10% performance
      regression for the iperf3's connection.
      
       thash_entries		sockets		length		Gbps
      	524288		      1		     1		50.7
      			   24Mi		    48		45.1
      
      It is basically related to the length of the list of each hash bucket.
      For testing purposes to see how performance drops along the length,
      I set 131072 (1Mi / 8) to thash_entries, and here's the result.
      
       thash_entries		sockets		length		Gbps
              131072		      1		     1		50.7
      			    1Mi		     8		49.9
      			    2Mi		    16		48.9
      			    4Mi		    32		47.3
      			    8Mi		    64		44.6
      			   16Mi		   128		40.6
      			   24Mi		   192		36.3
      			   32Mi		   256		32.5
      			   40Mi		   320		27.0
      			   48Mi		   384		25.0
      
      To resolve the socket lookup degradation, we introduce an optional
      per-netns hash table for TCP, but it's just ehash, and we still share
      the global bhash, bhash2 and lhash2.
      
      With a smaller ehash, we can look up non-listener sockets faster and
      isolate such noisy neighbours.  In addition, we can reduce lock contention.
      
      We can control the ehash size by a new sysctl knob.  However, depending
      on workloads, it will require very sensitive tuning, so we disable the
      feature by default (net.ipv4.tcp_child_ehash_entries == 0).  Moreover,
      we can fall back to using the global ehash in case we fail to allocate
      enough memory for a new ehash.  The maximum size is 16Mi, which is large
      enough that even if we have 48Mi sockets, the average list length is 3,
      and regression would be less than 1%.
      
      We can check the current ehash size by another read-only sysctl knob,
      net.ipv4.tcp_ehash_entries.  A negative value means the netns shares
      the global ehash (per-netns ehash is disabled or failed to allocate
      memory).
      
        # dmesg | cut -d ' ' -f 5- | grep "established hash"
        TCP established hash table entries: 524288 (order: 10, 4194304 bytes, vmalloc hugepage)
      
        # sysctl net.ipv4.tcp_ehash_entries
        net.ipv4.tcp_ehash_entries = 524288  # can be changed by thash_entries
      
        # sysctl net.ipv4.tcp_child_ehash_entries
        net.ipv4.tcp_child_ehash_entries = 0  # disabled by default
      
        # ip netns add test1
        # ip netns exec test1 sysctl net.ipv4.tcp_ehash_entries
        net.ipv4.tcp_ehash_entries = -524288  # share the global ehash
      
        # sysctl -w net.ipv4.tcp_child_ehash_entries=100
        net.ipv4.tcp_child_ehash_entries = 100
      
        # ip netns add test2
        # ip netns exec test2 sysctl net.ipv4.tcp_ehash_entries
        net.ipv4.tcp_ehash_entries = 128  # own a per-netns ehash with 2^n buckets
      
      When more than two processes in the same netns create per-netns ehash
      concurrently with different sizes, we need to guarantee the size in
      one of the following ways:
      
        1) Share the global ehash and create per-netns ehash
      
        First, unshare() with tcp_child_ehash_entries==0.  It creates dedicated
        netns sysctl knobs where we can safely change tcp_child_ehash_entries
        and clone()/unshare() to create a per-netns ehash.
      
        2) Control write on sysctl by BPF
      
        We can use BPF_PROG_TYPE_CGROUP_SYSCTL to allow/deny read/write on
        sysctl knobs.
      
      Note that the global ehash allocated at the boot time is spread over
      available NUMA nodes, but inet_pernet_hashinfo_alloc() will allocate
      pages for each per-netns ehash depending on the current process's NUMA
      policy.  By default, the allocation is done in the local node only, so
      the per-netns hash table could fully reside on a random node.  Thus,
      depending on the NUMA policy the netns is created with and the CPU the
      current thread is running on, we could see some performance differences
      for highly optimised networking applications.
      
      Note also that the default values of two sysctl knobs depend on the ehash
      size and should be tuned carefully:
      
        tcp_max_tw_buckets  : tcp_child_ehash_entries / 2
        tcp_max_syn_backlog : max(128, tcp_child_ehash_entries / 128)
      
      As a bonus, we can dismantle netns faster.  Currently, while destroying
      netns, we call inet_twsk_purge(), which walks through the global ehash.
      It can be potentially big because it can have many sockets other than
      TIME_WAIT in all netns.  Splitting ehash changes that situation, where
      it's only necessary for inet_twsk_purge() to clean up TIME_WAIT sockets
      in each netns.
      
      With regard to this, we do not free the per-netns ehash in inet_twsk_kill()
      to avoid UAF while iterating the per-netns ehash in inet_twsk_purge().
      Instead, we do it in tcp_sk_exit_batch() after calling tcp_twsk_purge() to
      keep it protocol-family-independent.
      
      In the future, we could optimise ehash lookup/iteration further by removing
      netns comparison for the per-netns ehash.
      Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      d1e5e640
  18. 20 9月, 2022 1 次提交
  19. 16 9月, 2022 1 次提交
  20. 03 9月, 2022 3 次提交
  21. 02 9月, 2022 1 次提交
    • E
      tcp: TX zerocopy should not sense pfmemalloc status · 32614006
      Eric Dumazet 提交于
      We got a recent syzbot report [1] showing a possible misuse
      of pfmemalloc page status in TCP zerocopy paths.
      
      Indeed, for pages coming from user space or other layers,
      using page_is_pfmemalloc() is moot, and possibly could give
      false positives.
      
      There has been attempts to make page_is_pfmemalloc() more robust,
      but not using it in the first place in this context is probably better,
      removing cpu cycles.
      
      Note to stable teams :
      
      You need to backport 84ce071e ("net: introduce
      __skb_fill_page_desc_noacc") as a prereq.
      
      Race is more probable after commit c07aea3e
      ("mm: add a signature in struct page") because page_is_pfmemalloc()
      is now using low order bit from page->lru.next, which can change
      more often than page->index.
      
      Low order bit should never be set for lru.next (when used as an anchor
      in LRU list), so KCSAN report is mostly a false positive.
      
      Backporting to older kernel versions seems not necessary.
      
      [1]
      BUG: KCSAN: data-race in lru_add_fn / tcp_build_frag
      
      write to 0xffffea0004a1d2c8 of 8 bytes by task 18600 on cpu 0:
      __list_add include/linux/list.h:73 [inline]
      list_add include/linux/list.h:88 [inline]
      lruvec_add_folio include/linux/mm_inline.h:105 [inline]
      lru_add_fn+0x440/0x520 mm/swap.c:228
      folio_batch_move_lru+0x1e1/0x2a0 mm/swap.c:246
      folio_batch_add_and_move mm/swap.c:263 [inline]
      folio_add_lru+0xf1/0x140 mm/swap.c:490
      filemap_add_folio+0xf8/0x150 mm/filemap.c:948
      __filemap_get_folio+0x510/0x6d0 mm/filemap.c:1981
      pagecache_get_page+0x26/0x190 mm/folio-compat.c:104
      grab_cache_page_write_begin+0x2a/0x30 mm/folio-compat.c:116
      ext4_da_write_begin+0x2dd/0x5f0 fs/ext4/inode.c:2988
      generic_perform_write+0x1d4/0x3f0 mm/filemap.c:3738
      ext4_buffered_write_iter+0x235/0x3e0 fs/ext4/file.c:270
      ext4_file_write_iter+0x2e3/0x1210
      call_write_iter include/linux/fs.h:2187 [inline]
      new_sync_write fs/read_write.c:491 [inline]
      vfs_write+0x468/0x760 fs/read_write.c:578
      ksys_write+0xe8/0x1a0 fs/read_write.c:631
      __do_sys_write fs/read_write.c:643 [inline]
      __se_sys_write fs/read_write.c:640 [inline]
      __x64_sys_write+0x3e/0x50 fs/read_write.c:640
      do_syscall_x64 arch/x86/entry/common.c:50 [inline]
      do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
      entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      read to 0xffffea0004a1d2c8 of 8 bytes by task 18611 on cpu 1:
      page_is_pfmemalloc include/linux/mm.h:1740 [inline]
      __skb_fill_page_desc include/linux/skbuff.h:2422 [inline]
      skb_fill_page_desc include/linux/skbuff.h:2443 [inline]
      tcp_build_frag+0x613/0xb20 net/ipv4/tcp.c:1018
      do_tcp_sendpages+0x3e8/0xaf0 net/ipv4/tcp.c:1075
      tcp_sendpage_locked net/ipv4/tcp.c:1140 [inline]
      tcp_sendpage+0x89/0xb0 net/ipv4/tcp.c:1150
      inet_sendpage+0x7f/0xc0 net/ipv4/af_inet.c:833
      kernel_sendpage+0x184/0x300 net/socket.c:3561
      sock_sendpage+0x5a/0x70 net/socket.c:1054
      pipe_to_sendpage+0x128/0x160 fs/splice.c:361
      splice_from_pipe_feed fs/splice.c:415 [inline]
      __splice_from_pipe+0x222/0x4d0 fs/splice.c:559
      splice_from_pipe fs/splice.c:594 [inline]
      generic_splice_sendpage+0x89/0xc0 fs/splice.c:743
      do_splice_from fs/splice.c:764 [inline]
      direct_splice_actor+0x80/0xa0 fs/splice.c:931
      splice_direct_to_actor+0x305/0x620 fs/splice.c:886
      do_splice_direct+0xfb/0x180 fs/splice.c:974
      do_sendfile+0x3bf/0x910 fs/read_write.c:1249
      __do_sys_sendfile64 fs/read_write.c:1317 [inline]
      __se_sys_sendfile64 fs/read_write.c:1303 [inline]
      __x64_sys_sendfile64+0x10c/0x150 fs/read_write.c:1303
      do_syscall_x64 arch/x86/entry/common.c:50 [inline]
      do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
      entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      value changed: 0x0000000000000000 -> 0xffffea0004a1d288
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 18611 Comm: syz-executor.4 Not tainted 6.0.0-rc2-syzkaller-00248-ge022620b-dirty #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/22/2022
      
      Fixes: c07aea3e ("mm: add a signature in struct page")
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      32614006
  22. 01 9月, 2022 1 次提交
  23. 25 8月, 2022 1 次提交
    • J
      net: Add a bhash2 table hashed by port and address · 28044fc1
      Joanne Koong 提交于
      The current bind hashtable (bhash) is hashed by port only.
      In the socket bind path, we have to check for bind conflicts by
      traversing the specified port's inet_bind_bucket while holding the
      hashbucket's spinlock (see inet_csk_get_port() and
      inet_csk_bind_conflict()). In instances where there are tons of
      sockets hashed to the same port at different addresses, the bind
      conflict check is time-intensive and can cause softirq cpu lockups,
      as well as stops new tcp connections since __inet_inherit_port()
      also contests for the spinlock.
      
      This patch adds a second bind table, bhash2, that hashes by
      port and sk->sk_rcv_saddr (ipv4) and sk->sk_v6_rcv_saddr (ipv6).
      Searching the bhash2 table leads to significantly faster conflict
      resolution and less time holding the hashbucket spinlock.
      
      Please note a few things:
      * There can be the case where the a socket's address changes after it
      has been bound. There are two cases where this happens:
      
        1) The case where there is a bind() call on INADDR_ANY (ipv4) or
        IPV6_ADDR_ANY (ipv6) and then a connect() call. The kernel will
        assign the socket an address when it handles the connect()
      
        2) In inet_sk_reselect_saddr(), which is called when rebuilding the
        sk header and a few pre-conditions are met (eg rerouting fails).
      
      In these two cases, we need to update the bhash2 table by removing the
      entry for the old address, and add a new entry reflecting the updated
      address.
      
      * The bhash2 table must have its own lock, even though concurrent
      accesses on the same port are protected by the bhash lock. Bhash2 must
      have its own lock to protect against cases where sockets on different
      ports hash to different bhash hashbuckets but to the same bhash2
      hashbucket.
      
      This brings up a few stipulations:
        1) When acquiring both the bhash and the bhash2 lock, the bhash2 lock
        will always be acquired after the bhash lock and released before the
        bhash lock is released.
      
        2) There are no nested bhash2 hashbucket locks. A bhash2 lock is always
        acquired+released before another bhash2 lock is acquired+released.
      
      * The bhash table cannot be superseded by the bhash2 table because for
      bind requests on INADDR_ANY (ipv4) or IPV6_ADDR_ANY (ipv6), every socket
      bound to that port must be checked for a potential conflict. The bhash
      table is the only source of port->socket associations.
      Signed-off-by: NJoanne Koong <joannelkoong@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      28044fc1
  24. 24 8月, 2022 2 次提交
  25. 19 8月, 2022 6 次提交
  26. 28 7月, 2022 1 次提交
  27. 27 7月, 2022 1 次提交
    • J
      tcp: allow tls to decrypt directly from the tcp rcv queue · 3f92a64e
      Jakub Kicinski 提交于
      Expose TCP rx queue accessor and cleanup, so that TLS can
      decrypt directly from the TCP queue. The expectation
      is that the caller can access the skb returned from
      tcp_recv_skb() and up to inq bytes worth of data (some
      of which may be in ->next skbs) and then call
      tcp_read_done() when data has been consumed.
      The socket lock must be held continuously across
      those two operations.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      3f92a64e
  28. 25 7月, 2022 1 次提交
  29. 22 7月, 2022 1 次提交
  30. 20 7月, 2022 1 次提交