1. 29 9月, 2022 1 次提交
  2. 21 9月, 2022 1 次提交
    • K
      tcp: Introduce optional per-netns ehash. · d1e5e640
      Kuniyuki Iwashima 提交于
      The more sockets we have in the hash table, the longer we spend looking
      up the socket.  While running a number of small workloads on the same
      host, they penalise each other and cause performance degradation.
      
      The root cause might be a single workload that consumes much more
      resources than the others.  It often happens on a cloud service where
      different workloads share the same computing resource.
      
      On EC2 c5.24xlarge instance (196 GiB memory and 524288 (1Mi / 2) ehash
      entries), after running iperf3 in different netns, creating 24Mi sockets
      without data transfer in the root netns causes about 10% performance
      regression for the iperf3's connection.
      
       thash_entries		sockets		length		Gbps
      	524288		      1		     1		50.7
      			   24Mi		    48		45.1
      
      It is basically related to the length of the list of each hash bucket.
      For testing purposes to see how performance drops along the length,
      I set 131072 (1Mi / 8) to thash_entries, and here's the result.
      
       thash_entries		sockets		length		Gbps
              131072		      1		     1		50.7
      			    1Mi		     8		49.9
      			    2Mi		    16		48.9
      			    4Mi		    32		47.3
      			    8Mi		    64		44.6
      			   16Mi		   128		40.6
      			   24Mi		   192		36.3
      			   32Mi		   256		32.5
      			   40Mi		   320		27.0
      			   48Mi		   384		25.0
      
      To resolve the socket lookup degradation, we introduce an optional
      per-netns hash table for TCP, but it's just ehash, and we still share
      the global bhash, bhash2 and lhash2.
      
      With a smaller ehash, we can look up non-listener sockets faster and
      isolate such noisy neighbours.  In addition, we can reduce lock contention.
      
      We can control the ehash size by a new sysctl knob.  However, depending
      on workloads, it will require very sensitive tuning, so we disable the
      feature by default (net.ipv4.tcp_child_ehash_entries == 0).  Moreover,
      we can fall back to using the global ehash in case we fail to allocate
      enough memory for a new ehash.  The maximum size is 16Mi, which is large
      enough that even if we have 48Mi sockets, the average list length is 3,
      and regression would be less than 1%.
      
      We can check the current ehash size by another read-only sysctl knob,
      net.ipv4.tcp_ehash_entries.  A negative value means the netns shares
      the global ehash (per-netns ehash is disabled or failed to allocate
      memory).
      
        # dmesg | cut -d ' ' -f 5- | grep "established hash"
        TCP established hash table entries: 524288 (order: 10, 4194304 bytes, vmalloc hugepage)
      
        # sysctl net.ipv4.tcp_ehash_entries
        net.ipv4.tcp_ehash_entries = 524288  # can be changed by thash_entries
      
        # sysctl net.ipv4.tcp_child_ehash_entries
        net.ipv4.tcp_child_ehash_entries = 0  # disabled by default
      
        # ip netns add test1
        # ip netns exec test1 sysctl net.ipv4.tcp_ehash_entries
        net.ipv4.tcp_ehash_entries = -524288  # share the global ehash
      
        # sysctl -w net.ipv4.tcp_child_ehash_entries=100
        net.ipv4.tcp_child_ehash_entries = 100
      
        # ip netns add test2
        # ip netns exec test2 sysctl net.ipv4.tcp_ehash_entries
        net.ipv4.tcp_ehash_entries = 128  # own a per-netns ehash with 2^n buckets
      
      When more than two processes in the same netns create per-netns ehash
      concurrently with different sizes, we need to guarantee the size in
      one of the following ways:
      
        1) Share the global ehash and create per-netns ehash
      
        First, unshare() with tcp_child_ehash_entries==0.  It creates dedicated
        netns sysctl knobs where we can safely change tcp_child_ehash_entries
        and clone()/unshare() to create a per-netns ehash.
      
        2) Control write on sysctl by BPF
      
        We can use BPF_PROG_TYPE_CGROUP_SYSCTL to allow/deny read/write on
        sysctl knobs.
      
      Note that the global ehash allocated at the boot time is spread over
      available NUMA nodes, but inet_pernet_hashinfo_alloc() will allocate
      pages for each per-netns ehash depending on the current process's NUMA
      policy.  By default, the allocation is done in the local node only, so
      the per-netns hash table could fully reside on a random node.  Thus,
      depending on the NUMA policy the netns is created with and the CPU the
      current thread is running on, we could see some performance differences
      for highly optimised networking applications.
      
      Note also that the default values of two sysctl knobs depend on the ehash
      size and should be tuned carefully:
      
        tcp_max_tw_buckets  : tcp_child_ehash_entries / 2
        tcp_max_syn_backlog : max(128, tcp_child_ehash_entries / 128)
      
      As a bonus, we can dismantle netns faster.  Currently, while destroying
      netns, we call inet_twsk_purge(), which walks through the global ehash.
      It can be potentially big because it can have many sockets other than
      TIME_WAIT in all netns.  Splitting ehash changes that situation, where
      it's only necessary for inet_twsk_purge() to clean up TIME_WAIT sockets
      in each netns.
      
      With regard to this, we do not free the per-netns ehash in inet_twsk_kill()
      to avoid UAF while iterating the per-netns ehash in inet_twsk_purge().
      Instead, we do it in tcp_sk_exit_batch() after calling tcp_twsk_purge() to
      keep it protocol-family-independent.
      
      In the future, we could optimise ehash lookup/iteration further by removing
      netns comparison for the per-netns ehash.
      Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      d1e5e640
  3. 20 9月, 2022 1 次提交
  4. 16 9月, 2022 1 次提交
  5. 03 9月, 2022 3 次提交
  6. 02 9月, 2022 1 次提交
    • E
      tcp: TX zerocopy should not sense pfmemalloc status · 32614006
      Eric Dumazet 提交于
      We got a recent syzbot report [1] showing a possible misuse
      of pfmemalloc page status in TCP zerocopy paths.
      
      Indeed, for pages coming from user space or other layers,
      using page_is_pfmemalloc() is moot, and possibly could give
      false positives.
      
      There has been attempts to make page_is_pfmemalloc() more robust,
      but not using it in the first place in this context is probably better,
      removing cpu cycles.
      
      Note to stable teams :
      
      You need to backport 84ce071e ("net: introduce
      __skb_fill_page_desc_noacc") as a prereq.
      
      Race is more probable after commit c07aea3e
      ("mm: add a signature in struct page") because page_is_pfmemalloc()
      is now using low order bit from page->lru.next, which can change
      more often than page->index.
      
      Low order bit should never be set for lru.next (when used as an anchor
      in LRU list), so KCSAN report is mostly a false positive.
      
      Backporting to older kernel versions seems not necessary.
      
      [1]
      BUG: KCSAN: data-race in lru_add_fn / tcp_build_frag
      
      write to 0xffffea0004a1d2c8 of 8 bytes by task 18600 on cpu 0:
      __list_add include/linux/list.h:73 [inline]
      list_add include/linux/list.h:88 [inline]
      lruvec_add_folio include/linux/mm_inline.h:105 [inline]
      lru_add_fn+0x440/0x520 mm/swap.c:228
      folio_batch_move_lru+0x1e1/0x2a0 mm/swap.c:246
      folio_batch_add_and_move mm/swap.c:263 [inline]
      folio_add_lru+0xf1/0x140 mm/swap.c:490
      filemap_add_folio+0xf8/0x150 mm/filemap.c:948
      __filemap_get_folio+0x510/0x6d0 mm/filemap.c:1981
      pagecache_get_page+0x26/0x190 mm/folio-compat.c:104
      grab_cache_page_write_begin+0x2a/0x30 mm/folio-compat.c:116
      ext4_da_write_begin+0x2dd/0x5f0 fs/ext4/inode.c:2988
      generic_perform_write+0x1d4/0x3f0 mm/filemap.c:3738
      ext4_buffered_write_iter+0x235/0x3e0 fs/ext4/file.c:270
      ext4_file_write_iter+0x2e3/0x1210
      call_write_iter include/linux/fs.h:2187 [inline]
      new_sync_write fs/read_write.c:491 [inline]
      vfs_write+0x468/0x760 fs/read_write.c:578
      ksys_write+0xe8/0x1a0 fs/read_write.c:631
      __do_sys_write fs/read_write.c:643 [inline]
      __se_sys_write fs/read_write.c:640 [inline]
      __x64_sys_write+0x3e/0x50 fs/read_write.c:640
      do_syscall_x64 arch/x86/entry/common.c:50 [inline]
      do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
      entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      read to 0xffffea0004a1d2c8 of 8 bytes by task 18611 on cpu 1:
      page_is_pfmemalloc include/linux/mm.h:1740 [inline]
      __skb_fill_page_desc include/linux/skbuff.h:2422 [inline]
      skb_fill_page_desc include/linux/skbuff.h:2443 [inline]
      tcp_build_frag+0x613/0xb20 net/ipv4/tcp.c:1018
      do_tcp_sendpages+0x3e8/0xaf0 net/ipv4/tcp.c:1075
      tcp_sendpage_locked net/ipv4/tcp.c:1140 [inline]
      tcp_sendpage+0x89/0xb0 net/ipv4/tcp.c:1150
      inet_sendpage+0x7f/0xc0 net/ipv4/af_inet.c:833
      kernel_sendpage+0x184/0x300 net/socket.c:3561
      sock_sendpage+0x5a/0x70 net/socket.c:1054
      pipe_to_sendpage+0x128/0x160 fs/splice.c:361
      splice_from_pipe_feed fs/splice.c:415 [inline]
      __splice_from_pipe+0x222/0x4d0 fs/splice.c:559
      splice_from_pipe fs/splice.c:594 [inline]
      generic_splice_sendpage+0x89/0xc0 fs/splice.c:743
      do_splice_from fs/splice.c:764 [inline]
      direct_splice_actor+0x80/0xa0 fs/splice.c:931
      splice_direct_to_actor+0x305/0x620 fs/splice.c:886
      do_splice_direct+0xfb/0x180 fs/splice.c:974
      do_sendfile+0x3bf/0x910 fs/read_write.c:1249
      __do_sys_sendfile64 fs/read_write.c:1317 [inline]
      __se_sys_sendfile64 fs/read_write.c:1303 [inline]
      __x64_sys_sendfile64+0x10c/0x150 fs/read_write.c:1303
      do_syscall_x64 arch/x86/entry/common.c:50 [inline]
      do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
      entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      value changed: 0x0000000000000000 -> 0xffffea0004a1d288
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 18611 Comm: syz-executor.4 Not tainted 6.0.0-rc2-syzkaller-00248-ge022620b-dirty #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/22/2022
      
      Fixes: c07aea3e ("mm: add a signature in struct page")
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      32614006
  7. 01 9月, 2022 1 次提交
  8. 25 8月, 2022 1 次提交
    • J
      net: Add a bhash2 table hashed by port and address · 28044fc1
      Joanne Koong 提交于
      The current bind hashtable (bhash) is hashed by port only.
      In the socket bind path, we have to check for bind conflicts by
      traversing the specified port's inet_bind_bucket while holding the
      hashbucket's spinlock (see inet_csk_get_port() and
      inet_csk_bind_conflict()). In instances where there are tons of
      sockets hashed to the same port at different addresses, the bind
      conflict check is time-intensive and can cause softirq cpu lockups,
      as well as stops new tcp connections since __inet_inherit_port()
      also contests for the spinlock.
      
      This patch adds a second bind table, bhash2, that hashes by
      port and sk->sk_rcv_saddr (ipv4) and sk->sk_v6_rcv_saddr (ipv6).
      Searching the bhash2 table leads to significantly faster conflict
      resolution and less time holding the hashbucket spinlock.
      
      Please note a few things:
      * There can be the case where the a socket's address changes after it
      has been bound. There are two cases where this happens:
      
        1) The case where there is a bind() call on INADDR_ANY (ipv4) or
        IPV6_ADDR_ANY (ipv6) and then a connect() call. The kernel will
        assign the socket an address when it handles the connect()
      
        2) In inet_sk_reselect_saddr(), which is called when rebuilding the
        sk header and a few pre-conditions are met (eg rerouting fails).
      
      In these two cases, we need to update the bhash2 table by removing the
      entry for the old address, and add a new entry reflecting the updated
      address.
      
      * The bhash2 table must have its own lock, even though concurrent
      accesses on the same port are protected by the bhash lock. Bhash2 must
      have its own lock to protect against cases where sockets on different
      ports hash to different bhash hashbuckets but to the same bhash2
      hashbucket.
      
      This brings up a few stipulations:
        1) When acquiring both the bhash and the bhash2 lock, the bhash2 lock
        will always be acquired after the bhash lock and released before the
        bhash lock is released.
      
        2) There are no nested bhash2 hashbucket locks. A bhash2 lock is always
        acquired+released before another bhash2 lock is acquired+released.
      
      * The bhash table cannot be superseded by the bhash2 table because for
      bind requests on INADDR_ANY (ipv4) or IPV6_ADDR_ANY (ipv6), every socket
      bound to that port must be checked for a potential conflict. The bhash
      table is the only source of port->socket associations.
      Signed-off-by: NJoanne Koong <joannelkoong@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      28044fc1
  9. 24 8月, 2022 2 次提交
  10. 19 8月, 2022 6 次提交
  11. 28 7月, 2022 1 次提交
  12. 27 7月, 2022 1 次提交
    • J
      tcp: allow tls to decrypt directly from the tcp rcv queue · 3f92a64e
      Jakub Kicinski 提交于
      Expose TCP rx queue accessor and cleanup, so that TLS can
      decrypt directly from the TCP queue. The expectation
      is that the caller can access the skb returned from
      tcp_recv_skb() and up to inq bytes worth of data (some
      of which may be in ->next skbs) and then call
      tcp_read_done() when data has been consumed.
      The socket lock must be held continuously across
      those two operations.
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      3f92a64e
  13. 25 7月, 2022 1 次提交
  14. 22 7月, 2022 1 次提交
  15. 20 7月, 2022 1 次提交
  16. 18 7月, 2022 4 次提交
  17. 08 7月, 2022 1 次提交
  18. 29 6月, 2022 1 次提交
  19. 20 6月, 2022 2 次提交
  20. 17 6月, 2022 4 次提交
  21. 11 6月, 2022 3 次提交
  22. 21 5月, 2022 1 次提交
    • J
      net: Add a second bind table hashed by port and address · d5a42de8
      Joanne Koong 提交于
      We currently have one tcp bind table (bhash) which hashes by port
      number only. In the socket bind path, we check for bind conflicts by
      traversing the specified port's inet_bind2_bucket while holding the
      bucket's spinlock (see inet_csk_get_port() and inet_csk_bind_conflict()).
      
      In instances where there are tons of sockets hashed to the same port
      at different addresses, checking for a bind conflict is time-intensive
      and can cause softirq cpu lockups, as well as stops new tcp connections
      since __inet_inherit_port() also contests for the spinlock.
      
      This patch proposes adding a second bind table, bhash2, that hashes by
      port and ip address. Searching the bhash2 table leads to significantly
      faster conflict resolution and less time holding the spinlock.
      Signed-off-by: NJoanne Koong <joannelkoong@gmail.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      d5a42de8
  23. 13 5月, 2022 1 次提交
    • M
      net: inet: Retire port only listening_hash · cae3873c
      Martin KaFai Lau 提交于
      The listen sk is currently stored in two hash tables,
      listening_hash (hashed by port) and lhash2 (hashed by port and address).
      
      After commit 0ee58dad ("net: tcp6: prefer listeners bound to an address")
      and commit d9fbc7f6 ("net: tcp: prefer listeners bound to an address"),
      the TCP-SYN lookup fast path does not use listening_hash.
      
      The commit 05c0b357 ("tcp: seq_file: Replace listening_hash with lhash2")
      also moved the seq_file (/proc/net/tcp) iteration usage from
      listening_hash to lhash2.
      
      There are still a few listening_hash usages left.
      One of them is inet_reuseport_add_sock() which uses the listening_hash
      to search a listen sk during the listen() system call.  This turns
      out to be very slow on use cases that listen on many different
      VIPs at a popular port (e.g. 443).  [ On top of the slowness in
      adding to the tail in the IPv6 case ].  The latter patch has a
      selftest to demonstrate this case.
      
      This patch takes this chance to move all remaining listening_hash
      usages to lhash2 and then retire listening_hash.
      
      Since most changes need to be done together, it is hard to cut
      the listening_hash to lhash2 switch into small patches.  The
      changes in this patch is highlighted here for the review
      purpose.
      
      1. Because of the listening_hash removal, lhash2 can use the
         sk->sk_nulls_node instead of the icsk->icsk_listen_portaddr_node.
         This will also keep the sk_unhashed() check to work as is
         after stop adding sk to listening_hash.
      
         The union is removed from inet_listen_hashbucket because
         only nulls_head is needed.
      
      2. icsk->icsk_listen_portaddr_node and its helpers are removed.
      
      3. The current lhash2 users needs to iterate with sk_nulls_node
         instead of icsk_listen_portaddr_node.
      
         One case is in the inet[6]_lhash2_lookup().
      
         Another case is the seq_file iterator in tcp_ipv4.c.
         One thing to note is sk_nulls_next() is needed
         because the old inet_lhash2_for_each_icsk_continue()
         does a "next" first before iterating.
      
      4. Move the remaining listening_hash usage to lhash2
      
         inet_reuseport_add_sock() which this series is
         trying to improve.
      
         inet_diag.c and mptcp_diag.c are the final two
         remaining use cases and is moved to lhash2 now also.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      cae3873c