1. 02 12月, 2022 2 次提交
  2. 23 11月, 2022 3 次提交
    • K
      dccp/tcp: Fixup bhash2 bucket when connect() fails. · e0833d1f
      Kuniyuki Iwashima 提交于
      If a socket bound to a wildcard address fails to connect(), we
      only reset saddr and keep the port.  Then, we have to fix up the
      bhash2 bucket; otherwise, the bucket has an inconsistent address
      in the list.
      
      Also, listen() for such a socket will fire the WARN_ON() in
      inet_csk_get_port(). [0]
      
      Note that when a system runs out of memory, we give up fixing the
      bucket and unlink sk from bhash and bhash2 by inet_put_port().
      
      [0]:
      WARNING: CPU: 0 PID: 207 at net/ipv4/inet_connection_sock.c:548 inet_csk_get_port (net/ipv4/inet_connection_sock.c:548 (discriminator 1))
      Modules linked in:
      CPU: 0 PID: 207 Comm: bhash2_prev_rep Not tainted 6.1.0-rc3-00799-gc8421681c845 #63
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.amzn2022.0.1 04/01/2014
      RIP: 0010:inet_csk_get_port (net/ipv4/inet_connection_sock.c:548 (discriminator 1))
      Code: 74 a7 eb 93 48 8b 54 24 18 0f b7 cb 4c 89 e6 4c 89 ff e8 48 b2 ff ff 49 8b 87 18 04 00 00 e9 32 ff ff ff 0f 0b e9 34 ff ff ff <0f> 0b e9 42 ff ff ff 41 8b 7f 50 41 8b 4f 54 89 fe 81 f6 00 00 ff
      RSP: 0018:ffffc900003d7e50 EFLAGS: 00010202
      RAX: ffff8881047fb500 RBX: 0000000000004e20 RCX: 0000000000000000
      RDX: 000000000000000a RSI: 00000000fffffe00 RDI: 00000000ffffffff
      RBP: ffffffff8324dc00 R08: 0000000000000001 R09: 0000000000000001
      R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
      R13: 0000000000000001 R14: 0000000000004e20 R15: ffff8881054e1280
      FS:  00007f8ac04dc740(0000) GS:ffff88842fc00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000020001540 CR3: 00000001055fa003 CR4: 0000000000770ef0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      PKRU: 55555554
      Call Trace:
       <TASK>
       inet_csk_listen_start (net/ipv4/inet_connection_sock.c:1205)
       inet_listen (net/ipv4/af_inet.c:228)
       __sys_listen (net/socket.c:1810)
       __x64_sys_listen (net/socket.c:1819 net/socket.c:1817 net/socket.c:1817)
       do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
       entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
      RIP: 0033:0x7f8ac051de5d
      Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 93 af 1b 00 f7 d8 64 89 01 48
      RSP: 002b:00007ffc1c177248 EFLAGS: 00000206 ORIG_RAX: 0000000000000032
      RAX: ffffffffffffffda RBX: 0000000020001550 RCX: 00007f8ac051de5d
      RDX: ffffffffffffff80 RSI: 0000000000000000 RDI: 0000000000000004
      RBP: 00007ffc1c177270 R08: 0000000000000018 R09: 0000000000000007
      R10: 0000000020001540 R11: 0000000000000206 R12: 00007ffc1c177388
      R13: 0000000000401169 R14: 0000000000403e18 R15: 00007f8ac0723000
       </TASK>
      
      Fixes: 28044fc1 ("net: Add a bhash2 table hashed by port and address")
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Reported-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
      Acked-by: NJoanne Koong <joannelkoong@gmail.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      e0833d1f
    • K
      dccp/tcp: Update saddr under bhash's lock. · 8c5dae4c
      Kuniyuki Iwashima 提交于
      When we call connect() for a socket bound to a wildcard address, we update
      saddr locklessly.  However, it could result in a data race; another thread
      iterating over bhash might see a corrupted address.
      
      Let's update saddr under the bhash bucket's lock.
      
      Fixes: 3df80d93 ("[DCCP]: Introduce DCCPv6")
      Fixes: 7c657876 ("[DCCP]: Initial implementation")
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
      Acked-by: NJoanne Koong <joannelkoong@gmail.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      8c5dae4c
    • K
      dccp/tcp: Reset saddr on failure after inet6?_hash_connect(). · 77934dc6
      Kuniyuki Iwashima 提交于
      When connect() is called on a socket bound to the wildcard address,
      we change the socket's saddr to a local address.  If the socket
      fails to connect() to the destination, we have to reset the saddr.
      
      However, when an error occurs after inet_hash6?_connect() in
      (dccp|tcp)_v[46]_conect(), we forget to reset saddr and leave
      the socket bound to the address.
      
      From the user's point of view, whether saddr is reset or not varies
      with errno.  Let's fix this inconsistent behaviour.
      
      Note that after this patch, the repro [0] will trigger the WARN_ON()
      in inet_csk_get_port() again, but this patch is not buggy and rather
      fixes a bug papering over the bhash2's bug for which we need another
      fix.
      
      For the record, the repro causes -EADDRNOTAVAIL in inet_hash6_connect()
      by this sequence:
      
        s1 = socket()
        s1.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
        s1.bind(('127.0.0.1', 10000))
        s1.sendto(b'hello', MSG_FASTOPEN, (('127.0.0.1', 10000)))
        # or s1.connect(('127.0.0.1', 10000))
      
        s2 = socket()
        s2.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
        s2.bind(('0.0.0.0', 10000))
        s2.connect(('127.0.0.1', 10000))  # -EADDRNOTAVAIL
      
        s2.listen(32)  # WARN_ON(inet_csk(sk)->icsk_bind2_hash != tb2);
      
      [0]: https://syzkaller.appspot.com/bug?extid=015d756bbd1f8b5c8f09
      
      Fixes: 3df80d93 ("[DCCP]: Introduce DCCPv6")
      Fixes: 7c657876 ("[DCCP]: Initial implementation")
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
      Acked-by: NJoanne Koong <joannelkoong@gmail.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      77934dc6
  3. 14 11月, 2022 1 次提交
  4. 28 10月, 2022 2 次提交
  5. 24 10月, 2022 1 次提交
  6. 12 10月, 2022 1 次提交
    • J
      treewide: use get_random_{u8,u16}() when possible, part 1 · 7e3cf084
      Jason A. Donenfeld 提交于
      Rather than truncate a 32-bit value to a 16-bit value or an 8-bit value,
      simply use the get_random_{u8,u16}() functions, which are faster than
      wasting the additional bytes from a 32-bit value. This was done
      mechanically with this coccinelle script:
      
      @@
      expression E;
      identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
      typedef u16;
      typedef __be16;
      typedef __le16;
      typedef u8;
      @@
      (
      - (get_random_u32() & 0xffff)
      + get_random_u16()
      |
      - (get_random_u32() & 0xff)
      + get_random_u8()
      |
      - (get_random_u32() % 65536)
      + get_random_u16()
      |
      - (get_random_u32() % 256)
      + get_random_u8()
      |
      - (get_random_u32() >> 16)
      + get_random_u16()
      |
      - (get_random_u32() >> 24)
      + get_random_u8()
      |
      - (u16)get_random_u32()
      + get_random_u16()
      |
      - (u8)get_random_u32()
      + get_random_u8()
      |
      - (__be16)get_random_u32()
      + (__be16)get_random_u16()
      |
      - (__le16)get_random_u32()
      + (__le16)get_random_u16()
      |
      - prandom_u32_max(65536)
      + get_random_u16()
      |
      - prandom_u32_max(256)
      + get_random_u8()
      |
      - E->inet_id = get_random_u32()
      + E->inet_id = get_random_u16()
      )
      
      @@
      identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
      typedef u16;
      identifier v;
      @@
      - u16 v = get_random_u32();
      + u16 v = get_random_u16();
      
      @@
      identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
      typedef u8;
      identifier v;
      @@
      - u8 v = get_random_u32();
      + u8 v = get_random_u8();
      
      @@
      identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
      typedef u16;
      u16 v;
      @@
      -  v = get_random_u32();
      +  v = get_random_u16();
      
      @@
      identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
      typedef u8;
      u8 v;
      @@
      -  v = get_random_u32();
      +  v = get_random_u8();
      
      // Find a potential literal
      @literal_mask@
      expression LITERAL;
      type T;
      identifier get_random_u32 =~ "get_random_int|prandom_u32|get_random_u32";
      position p;
      @@
      
              ((T)get_random_u32()@p & (LITERAL))
      
      // Examine limits
      @script:python add_one@
      literal << literal_mask.LITERAL;
      RESULT;
      @@
      
      value = None
      if literal.startswith('0x'):
              value = int(literal, 16)
      elif literal[0] in '123456789':
              value = int(literal, 10)
      if value is None:
              print("I don't know how to handle %s" % (literal))
              cocci.include_match(False)
      elif value < 256:
              coccinelle.RESULT = cocci.make_ident("get_random_u8")
      elif value < 65536:
              coccinelle.RESULT = cocci.make_ident("get_random_u16")
      else:
              print("Skipping large mask of %s" % (literal))
              cocci.include_match(False)
      
      // Replace the literal mask with the calculated result.
      @plus_one@
      expression literal_mask.LITERAL;
      position literal_mask.p;
      identifier add_one.RESULT;
      identifier FUNC;
      @@
      
      -       (FUNC()@p & (LITERAL))
      +       (RESULT() & LITERAL)
      Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NYury Norov <yury.norov@gmail.com>
      Acked-by: NJakub Kicinski <kuba@kernel.org>
      Acked-by: Toke Høiland-Jørgensen <toke@toke.dk> # for sch_cake
      Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com>
      7e3cf084
  7. 21 9月, 2022 6 次提交
    • K
      tcp: Introduce optional per-netns ehash. · d1e5e640
      Kuniyuki Iwashima 提交于
      The more sockets we have in the hash table, the longer we spend looking
      up the socket.  While running a number of small workloads on the same
      host, they penalise each other and cause performance degradation.
      
      The root cause might be a single workload that consumes much more
      resources than the others.  It often happens on a cloud service where
      different workloads share the same computing resource.
      
      On EC2 c5.24xlarge instance (196 GiB memory and 524288 (1Mi / 2) ehash
      entries), after running iperf3 in different netns, creating 24Mi sockets
      without data transfer in the root netns causes about 10% performance
      regression for the iperf3's connection.
      
       thash_entries		sockets		length		Gbps
      	524288		      1		     1		50.7
      			   24Mi		    48		45.1
      
      It is basically related to the length of the list of each hash bucket.
      For testing purposes to see how performance drops along the length,
      I set 131072 (1Mi / 8) to thash_entries, and here's the result.
      
       thash_entries		sockets		length		Gbps
              131072		      1		     1		50.7
      			    1Mi		     8		49.9
      			    2Mi		    16		48.9
      			    4Mi		    32		47.3
      			    8Mi		    64		44.6
      			   16Mi		   128		40.6
      			   24Mi		   192		36.3
      			   32Mi		   256		32.5
      			   40Mi		   320		27.0
      			   48Mi		   384		25.0
      
      To resolve the socket lookup degradation, we introduce an optional
      per-netns hash table for TCP, but it's just ehash, and we still share
      the global bhash, bhash2 and lhash2.
      
      With a smaller ehash, we can look up non-listener sockets faster and
      isolate such noisy neighbours.  In addition, we can reduce lock contention.
      
      We can control the ehash size by a new sysctl knob.  However, depending
      on workloads, it will require very sensitive tuning, so we disable the
      feature by default (net.ipv4.tcp_child_ehash_entries == 0).  Moreover,
      we can fall back to using the global ehash in case we fail to allocate
      enough memory for a new ehash.  The maximum size is 16Mi, which is large
      enough that even if we have 48Mi sockets, the average list length is 3,
      and regression would be less than 1%.
      
      We can check the current ehash size by another read-only sysctl knob,
      net.ipv4.tcp_ehash_entries.  A negative value means the netns shares
      the global ehash (per-netns ehash is disabled or failed to allocate
      memory).
      
        # dmesg | cut -d ' ' -f 5- | grep "established hash"
        TCP established hash table entries: 524288 (order: 10, 4194304 bytes, vmalloc hugepage)
      
        # sysctl net.ipv4.tcp_ehash_entries
        net.ipv4.tcp_ehash_entries = 524288  # can be changed by thash_entries
      
        # sysctl net.ipv4.tcp_child_ehash_entries
        net.ipv4.tcp_child_ehash_entries = 0  # disabled by default
      
        # ip netns add test1
        # ip netns exec test1 sysctl net.ipv4.tcp_ehash_entries
        net.ipv4.tcp_ehash_entries = -524288  # share the global ehash
      
        # sysctl -w net.ipv4.tcp_child_ehash_entries=100
        net.ipv4.tcp_child_ehash_entries = 100
      
        # ip netns add test2
        # ip netns exec test2 sysctl net.ipv4.tcp_ehash_entries
        net.ipv4.tcp_ehash_entries = 128  # own a per-netns ehash with 2^n buckets
      
      When more than two processes in the same netns create per-netns ehash
      concurrently with different sizes, we need to guarantee the size in
      one of the following ways:
      
        1) Share the global ehash and create per-netns ehash
      
        First, unshare() with tcp_child_ehash_entries==0.  It creates dedicated
        netns sysctl knobs where we can safely change tcp_child_ehash_entries
        and clone()/unshare() to create a per-netns ehash.
      
        2) Control write on sysctl by BPF
      
        We can use BPF_PROG_TYPE_CGROUP_SYSCTL to allow/deny read/write on
        sysctl knobs.
      
      Note that the global ehash allocated at the boot time is spread over
      available NUMA nodes, but inet_pernet_hashinfo_alloc() will allocate
      pages for each per-netns ehash depending on the current process's NUMA
      policy.  By default, the allocation is done in the local node only, so
      the per-netns hash table could fully reside on a random node.  Thus,
      depending on the NUMA policy the netns is created with and the CPU the
      current thread is running on, we could see some performance differences
      for highly optimised networking applications.
      
      Note also that the default values of two sysctl knobs depend on the ehash
      size and should be tuned carefully:
      
        tcp_max_tw_buckets  : tcp_child_ehash_entries / 2
        tcp_max_syn_backlog : max(128, tcp_child_ehash_entries / 128)
      
      As a bonus, we can dismantle netns faster.  Currently, while destroying
      netns, we call inet_twsk_purge(), which walks through the global ehash.
      It can be potentially big because it can have many sockets other than
      TIME_WAIT in all netns.  Splitting ehash changes that situation, where
      it's only necessary for inet_twsk_purge() to clean up TIME_WAIT sockets
      in each netns.
      
      With regard to this, we do not free the per-netns ehash in inet_twsk_kill()
      to avoid UAF while iterating the per-netns ehash in inet_twsk_purge().
      Instead, we do it in tcp_sk_exit_batch() after calling tcp_twsk_purge() to
      keep it protocol-family-independent.
      
      In the future, we could optimise ehash lookup/iteration further by removing
      netns comparison for the per-netns ehash.
      Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      d1e5e640
    • K
      tcp: Save unnecessary inet_twsk_purge() calls. · edc12f03
      Kuniyuki Iwashima 提交于
      While destroying netns, we call inet_twsk_purge() in tcp_sk_exit_batch()
      and tcpv6_net_exit_batch() for AF_INET and AF_INET6.  These commands
      trigger the kernel to walk through the potentially big ehash twice even
      though the netns has no TIME_WAIT sockets.
      
        # ip netns add test
        # ip netns del test
      
        or
      
        # unshare -n /bin/true >/dev/null
      
      When tw_refcount is 1, we need not call inet_twsk_purge() at least
      for the net.  We can save such unneeded iterations if all netns in
      net_exit_list have no TIME_WAIT sockets.  This change eliminates
      the tax by the additional unshare() described in the next patch to
      guarantee the per-netns ehash size.
      
      Tested:
      
        # mount -t debugfs none /sys/kernel/debug/
        # echo cleanup_net > /sys/kernel/debug/tracing/set_ftrace_filter
        # echo inet_twsk_purge >> /sys/kernel/debug/tracing/set_ftrace_filter
        # echo function > /sys/kernel/debug/tracing/current_tracer
        # cat ./add_del_unshare.sh
        for i in `seq 1 40`
        do
            (for j in `seq 1 100` ; do  unshare -n /bin/true >/dev/null ; done) &
        done
        wait;
        # ./add_del_unshare.sh
      
      Before the patch:
      
        # cat /sys/kernel/debug/tracing/trace_pipe
          kworker/u128:0-8       [031] ...1.   174.162765: cleanup_net <-process_one_work
          kworker/u128:0-8       [031] ...1.   174.240796: inet_twsk_purge <-cleanup_net
          kworker/u128:0-8       [032] ...1.   174.244759: inet_twsk_purge <-tcp_sk_exit_batch
          kworker/u128:0-8       [034] ...1.   174.290861: cleanup_net <-process_one_work
          kworker/u128:0-8       [039] ...1.   175.245027: inet_twsk_purge <-cleanup_net
          kworker/u128:0-8       [046] ...1.   175.290541: inet_twsk_purge <-tcp_sk_exit_batch
          kworker/u128:0-8       [037] ...1.   175.321046: cleanup_net <-process_one_work
          kworker/u128:0-8       [024] ...1.   175.941633: inet_twsk_purge <-cleanup_net
          kworker/u128:0-8       [025] ...1.   176.242539: inet_twsk_purge <-tcp_sk_exit_batch
      
      After:
      
        # cat /sys/kernel/debug/tracing/trace_pipe
          kworker/u128:0-8       [038] ...1.   428.116174: cleanup_net <-process_one_work
          kworker/u128:0-8       [038] ...1.   428.262532: cleanup_net <-process_one_work
          kworker/u128:0-8       [030] ...1.   429.292645: cleanup_net <-process_one_work
      Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      edc12f03
    • K
      tcp: Access &tcp_hashinfo via net. · 4461568a
      Kuniyuki Iwashima 提交于
      We will soon introduce an optional per-netns ehash.
      
      This means we cannot use tcp_hashinfo directly in most places.
      
      Instead, access it via net->ipv4.tcp_death_row.hashinfo.
      
      The access will be valid only while initialising tcp_hashinfo
      itself and creating/destroying each netns.
      Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      4461568a
    • K
      tcp: Set NULL to sk->sk_prot->h.hashinfo. · 429e42c1
      Kuniyuki Iwashima 提交于
      We will soon introduce an optional per-netns ehash.
      
      This means we cannot use the global sk->sk_prot->h.hashinfo
      to fetch a TCP hashinfo.
      
      Instead, set NULL to sk->sk_prot->h.hashinfo for TCP and get
      a proper hashinfo from net->ipv4.tcp_death_row.hashinfo.
      
      Note that we need not use sk->sk_prot->h.hashinfo if DCCP is
      disabled.
      Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      429e42c1
    • K
      tcp: Don't allocate tcp_death_row outside of struct netns_ipv4. · e9bd0cca
      Kuniyuki Iwashima 提交于
      We will soon introduce an optional per-netns ehash and access hash
      tables via net->ipv4.tcp_death_row->hashinfo instead of &tcp_hashinfo
      in most places.
      
      It could harm the fast path because dereferences of two fields in net
      and tcp_death_row might incur two extra cache line misses.  To save one
      dereference, let's place tcp_death_row back in netns_ipv4 and fetch
      hashinfo via net->ipv4.tcp_death_row"."hashinfo.
      
      Note tcp_death_row was initially placed in netns_ipv4, and commit
      fbb82952 ("tcp: allocate tcp_death_row outside of struct netns_ipv4")
      changed it to a pointer so that we can fire TIME_WAIT timers after freeing
      net.  However, we don't do so after commit 04c494e6 ("Revert "tcp/dccp:
      get rid of inet_twsk_purge()""), so we need not define tcp_death_row as a
      pointer.
      
      Also, we move refcount_dec_and_test(&tw_refcount) from tcp_sk_exit() to
      tcp_sk_exit_batch() as a debug check.
      Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      e9bd0cca
    • K
      tcp: Clean up some functions. · 08eaef90
      Kuniyuki Iwashima 提交于
      This patch adds no functional change and cleans up some functions
      that the following patches touch around so that we make them tidy
      and easy to review/revert.  The changes are
      
        - Keep reverse christmas tree order
        - Remove unnecessary init of port in inet_csk_find_open_port()
        - Use req_to_sk() once in reqsk_queue_unlink()
        - Use sock_net(sk) once in tcp_time_wait() and tcp_v[46]_connect()
      Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      08eaef90
  8. 01 9月, 2022 1 次提交
  9. 25 8月, 2022 1 次提交
    • J
      net: Add a bhash2 table hashed by port and address · 28044fc1
      Joanne Koong 提交于
      The current bind hashtable (bhash) is hashed by port only.
      In the socket bind path, we have to check for bind conflicts by
      traversing the specified port's inet_bind_bucket while holding the
      hashbucket's spinlock (see inet_csk_get_port() and
      inet_csk_bind_conflict()). In instances where there are tons of
      sockets hashed to the same port at different addresses, the bind
      conflict check is time-intensive and can cause softirq cpu lockups,
      as well as stops new tcp connections since __inet_inherit_port()
      also contests for the spinlock.
      
      This patch adds a second bind table, bhash2, that hashes by
      port and sk->sk_rcv_saddr (ipv4) and sk->sk_v6_rcv_saddr (ipv6).
      Searching the bhash2 table leads to significantly faster conflict
      resolution and less time holding the hashbucket spinlock.
      
      Please note a few things:
      * There can be the case where the a socket's address changes after it
      has been bound. There are two cases where this happens:
      
        1) The case where there is a bind() call on INADDR_ANY (ipv4) or
        IPV6_ADDR_ANY (ipv6) and then a connect() call. The kernel will
        assign the socket an address when it handles the connect()
      
        2) In inet_sk_reselect_saddr(), which is called when rebuilding the
        sk header and a few pre-conditions are met (eg rerouting fails).
      
      In these two cases, we need to update the bhash2 table by removing the
      entry for the old address, and add a new entry reflecting the updated
      address.
      
      * The bhash2 table must have its own lock, even though concurrent
      accesses on the same port are protected by the bhash lock. Bhash2 must
      have its own lock to protect against cases where sockets on different
      ports hash to different bhash hashbuckets but to the same bhash2
      hashbucket.
      
      This brings up a few stipulations:
        1) When acquiring both the bhash and the bhash2 lock, the bhash2 lock
        will always be acquired after the bhash lock and released before the
        bhash lock is released.
      
        2) There are no nested bhash2 hashbucket locks. A bhash2 lock is always
        acquired+released before another bhash2 lock is acquired+released.
      
      * The bhash table cannot be superseded by the bhash2 table because for
      bind requests on INADDR_ANY (ipv4) or IPV6_ADDR_ANY (ipv6), every socket
      bound to that port must be checked for a potential conflict. The bhash
      table is the only source of port->socket associations.
      Signed-off-by: NJoanne Koong <joannelkoong@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      28044fc1
  10. 25 7月, 2022 1 次提交
  11. 18 7月, 2022 1 次提交
  12. 11 7月, 2022 1 次提交
    • S
      net: Find dst with sk's xfrm policy not ctl_sk · e22aa148
      sewookseo 提交于
      If we set XFRM security policy by calling setsockopt with option
      IPV6_XFRM_POLICY, the policy will be stored in 'sock_policy' in 'sock'
      struct. However tcp_v6_send_response doesn't look up dst_entry with the
      actual socket but looks up with tcp control socket. This may cause a
      problem that a RST packet is sent without ESP encryption & peer's TCP
      socket can't receive it.
      This patch will make the function look up dest_entry with actual socket,
      if the socket has XFRM policy(sock_policy), so that the TCP response
      packet via this function can be encrypted, & aligned on the encrypted
      TCP socket.
      
      Tested: We encountered this problem when a TCP socket which is encrypted
      in ESP transport mode encryption, receives challenge ACK at SYN_SENT
      state. After receiving challenge ACK, TCP needs to send RST to
      establish the socket at next SYN try. But the RST was not encrypted &
      peer TCP socket still remains on ESTABLISHED state.
      So we verified this with test step as below.
      [Test step]
      1. Making a TCP state mismatch between client(IDLE) & server(ESTABLISHED).
      2. Client tries a new connection on the same TCP ports(src & dst).
      3. Server will return challenge ACK instead of SYN,ACK.
      4. Client will send RST to server to clear the SOCKET.
      5. Client will retransmit SYN to server on the same TCP ports.
      [Expected result]
      The TCP connection should be established.
      
      Cc: Maciej Żenczykowski <maze@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Cc: Sehee Lee <seheele@google.com>
      Signed-off-by: NSewook Seo <sewookseo@google.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e22aa148
  13. 25 6月, 2022 1 次提交
  14. 11 6月, 2022 1 次提交
  15. 31 5月, 2022 1 次提交
  16. 16 5月, 2022 1 次提交
  17. 13 5月, 2022 2 次提交
    • E
      Revert "tcp/dccp: get rid of inet_twsk_purge()" · 04c494e6
      Eric Dumazet 提交于
      This reverts commits:
      
      0dad4087 ("tcp/dccp: get rid of inet_twsk_purge()")
      d507204d ("tcp/dccp: add tw->tw_bslot")
      
      As Leonard pointed out, a newly allocated netns can happen
      to reuse a freed 'struct net'.
      
      While TCP TW timers were covered by my patches, other things were not:
      
      1) Lookups in rx path (INET_MATCH() and INET6_MATCH()), as they look
        at 4-tuple plus the 'struct net' pointer.
      
      2) /proc/net/tcp[6] and inet_diag, same reason.
      
      3) hashinfo->bhash[], same reason.
      
      Fixing all this seems risky, lets instead revert.
      
      In the future, we might have a per netns tcp hash table, or
      a per netns list of timewait sockets...
      
      Fixes: 0dad4087 ("tcp/dccp: get rid of inet_twsk_purge()")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NLeonard Crestez <cdleonard@gmail.com>
      Tested-by: NLeonard Crestez <cdleonard@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      04c494e6
    • M
      net: inet: Retire port only listening_hash · cae3873c
      Martin KaFai Lau 提交于
      The listen sk is currently stored in two hash tables,
      listening_hash (hashed by port) and lhash2 (hashed by port and address).
      
      After commit 0ee58dad ("net: tcp6: prefer listeners bound to an address")
      and commit d9fbc7f6 ("net: tcp: prefer listeners bound to an address"),
      the TCP-SYN lookup fast path does not use listening_hash.
      
      The commit 05c0b357 ("tcp: seq_file: Replace listening_hash with lhash2")
      also moved the seq_file (/proc/net/tcp) iteration usage from
      listening_hash to lhash2.
      
      There are still a few listening_hash usages left.
      One of them is inet_reuseport_add_sock() which uses the listening_hash
      to search a listen sk during the listen() system call.  This turns
      out to be very slow on use cases that listen on many different
      VIPs at a popular port (e.g. 443).  [ On top of the slowness in
      adding to the tail in the IPv6 case ].  The latter patch has a
      selftest to demonstrate this case.
      
      This patch takes this chance to move all remaining listening_hash
      usages to lhash2 and then retire listening_hash.
      
      Since most changes need to be done together, it is hard to cut
      the listening_hash to lhash2 switch into small patches.  The
      changes in this patch is highlighted here for the review
      purpose.
      
      1. Because of the listening_hash removal, lhash2 can use the
         sk->sk_nulls_node instead of the icsk->icsk_listen_portaddr_node.
         This will also keep the sk_unhashed() check to work as is
         after stop adding sk to listening_hash.
      
         The union is removed from inet_listen_hashbucket because
         only nulls_head is needed.
      
      2. icsk->icsk_listen_portaddr_node and its helpers are removed.
      
      3. The current lhash2 users needs to iterate with sk_nulls_node
         instead of icsk_listen_portaddr_node.
      
         One case is in the inet[6]_lhash2_lookup().
      
         Another case is the seq_file iterator in tcp_ipv4.c.
         One thing to note is sk_nulls_next() is needed
         because the old inet_lhash2_for_each_icsk_continue()
         does a "next" first before iterating.
      
      4. Move the remaining listening_hash usage to lhash2
      
         inet_reuseport_add_sock() which this series is
         trying to improve.
      
         inet_diag.c and mptcp_diag.c are the final two
         remaining use cases and is moved to lhash2 now also.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      cae3873c
  18. 27 4月, 2022 1 次提交
    • E
      net: generalize skb freeing deferral to per-cpu lists · 68822bdf
      Eric Dumazet 提交于
      Logic added in commit f35f8219 ("tcp: defer skb freeing after socket
      lock is released") helped bulk TCP flows to move the cost of skbs
      frees outside of critical section where socket lock was held.
      
      But for RPC traffic, or hosts with RFS enabled, the solution is far from
      being ideal.
      
      For RPC traffic, recvmsg() has to return to user space right after
      skb payload has been consumed, meaning that BH handler has no chance
      to pick the skb before recvmsg() thread. This issue is more visible
      with BIG TCP, as more RPC fit one skb.
      
      For RFS, even if BH handler picks the skbs, they are still picked
      from the cpu on which user thread is running.
      
      Ideally, it is better to free the skbs (and associated page frags)
      on the cpu that originally allocated them.
      
      This patch removes the per socket anchor (sk->defer_list) and
      instead uses a per-cpu list, which will hold more skbs per round.
      
      This new per-cpu list is drained at the end of net_action_rx(),
      after incoming packets have been processed, to lower latencies.
      
      In normal conditions, skbs are added to the per-cpu list with
      no further action. In the (unlikely) cases where the cpu does not
      run net_action_rx() handler fast enough, we use an IPI to raise
      NET_RX_SOFTIRQ on the remote cpu.
      
      Also, we do not bother draining the per-cpu list from dev_cpu_dead()
      This is because skbs in this list have no requirement on how fast
      they should be freed.
      
      Note that we can add in the future a small per-cpu cache
      if we see any contention on sd->defer_lock.
      
      Tested on a pair of hosts with 100Gbit NIC, RFS enabled,
      and /proc/sys/net/ipv4/tcp_rmem[2] tuned to 16MB to work around
      page recycling strategy used by NIC driver (its page pool capacity
      being too small compared to number of skbs/pages held in sockets
      receive queues)
      
      Note that this tuning was only done to demonstrate worse
      conditions for skb freeing for this particular test.
      These conditions can happen in more general production workload.
      
      10 runs of one TCP_STREAM flow
      
      Before:
      Average throughput: 49685 Mbit.
      
      Kernel profiles on cpu running user thread recvmsg() show high cost for
      skb freeing related functions (*)
      
          57.81%  [kernel]       [k] copy_user_enhanced_fast_string
      (*) 12.87%  [kernel]       [k] skb_release_data
      (*)  4.25%  [kernel]       [k] __free_one_page
      (*)  3.57%  [kernel]       [k] __list_del_entry_valid
           1.85%  [kernel]       [k] __netif_receive_skb_core
           1.60%  [kernel]       [k] __skb_datagram_iter
      (*)  1.59%  [kernel]       [k] free_unref_page_commit
      (*)  1.16%  [kernel]       [k] __slab_free
           1.16%  [kernel]       [k] _copy_to_iter
      (*)  1.01%  [kernel]       [k] kfree
      (*)  0.88%  [kernel]       [k] free_unref_page
           0.57%  [kernel]       [k] ip6_rcv_core
           0.55%  [kernel]       [k] ip6t_do_table
           0.54%  [kernel]       [k] flush_smp_call_function_queue
      (*)  0.54%  [kernel]       [k] free_pcppages_bulk
           0.51%  [kernel]       [k] llist_reverse_order
           0.38%  [kernel]       [k] process_backlog
      (*)  0.38%  [kernel]       [k] free_pcp_prepare
           0.37%  [kernel]       [k] tcp_recvmsg_locked
      (*)  0.37%  [kernel]       [k] __list_add_valid
           0.34%  [kernel]       [k] sock_rfree
           0.34%  [kernel]       [k] _raw_spin_lock_irq
      (*)  0.33%  [kernel]       [k] __page_cache_release
           0.33%  [kernel]       [k] tcp_v6_rcv
      (*)  0.33%  [kernel]       [k] __put_page
      (*)  0.29%  [kernel]       [k] __mod_zone_page_state
           0.27%  [kernel]       [k] _raw_spin_lock
      
      After patch:
      Average throughput: 73076 Mbit.
      
      Kernel profiles on cpu running user thread recvmsg() looks better:
      
          81.35%  [kernel]       [k] copy_user_enhanced_fast_string
           1.95%  [kernel]       [k] _copy_to_iter
           1.95%  [kernel]       [k] __skb_datagram_iter
           1.27%  [kernel]       [k] __netif_receive_skb_core
           1.03%  [kernel]       [k] ip6t_do_table
           0.60%  [kernel]       [k] sock_rfree
           0.50%  [kernel]       [k] tcp_v6_rcv
           0.47%  [kernel]       [k] ip6_rcv_core
           0.45%  [kernel]       [k] read_tsc
           0.44%  [kernel]       [k] _raw_spin_lock_irqsave
           0.37%  [kernel]       [k] _raw_spin_lock
           0.37%  [kernel]       [k] native_irq_return_iret
           0.33%  [kernel]       [k] __inet6_lookup_established
           0.31%  [kernel]       [k] ip6_protocol_deliver_rcu
           0.29%  [kernel]       [k] tcp_rcv_established
           0.29%  [kernel]       [k] llist_reverse_order
      
      v2: kdoc issue (kernel bots)
          do not defer if (alloc_cpu == smp_processor_id()) (Paolo)
          replace the sk_buff_head with a single-linked list (Jakub)
          add a READ_ONCE()/WRITE_ONCE() for the lockless read of sd->defer_list
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NPaolo Abeni <pabeni@redhat.com>
      Link: https://lore.kernel.org/r/20220422201237.416238-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      68822bdf
  19. 22 4月, 2022 1 次提交
    • G
      ipv4: Avoid using RTO_ONLINK with ip_route_connect(). · 67e1e2f4
      Guillaume Nault 提交于
      Now that ip_rt_fix_tos() doesn't reset ->flowi4_scope unconditionally,
      we don't have to rely on the RTO_ONLINK bit to properly set the scope
      of a flowi4 structure. We can just set ->flowi4_scope explicitly and
      avoid using RTO_ONLINK in ->flowi4_tos.
      
      This patch converts callers of ip_route_connect(). Instead of setting
      the tos parameter with RT_CONN_FLAGS(sk), as all callers do, we can:
      
        1- Drop the tos parameter from ip_route_connect(): its value was
           entirely based on sk, which is also passed as parameter.
      
        2- Set ->flowi4_scope depending on the SOCK_LOCALROUTE socket option
           instead of always initialising it with RT_SCOPE_UNIVERSE (let's
           define ip_sock_rt_scope() for this purpose).
      
        3- Avoid overloading ->flowi4_tos with RTO_ONLINK: since the scope is
           now properly initialised, we don't need to tell ip_rt_fix_tos() to
           adjust ->flowi4_scope for us. So let's define ip_sock_rt_tos(),
           which is the same as RT_CONN_FLAGS() but without the RTO_ONLINK
           bit overload.
      
      Note:
        In the original ip_route_connect() code, __ip_route_output_key()
        might clear the RTO_ONLINK bit of fl4->flowi4_tos (because of
        ip_rt_fix_tos()). Therefore flowi4_update_output() had to reuse the
        original tos variable. Now that we don't set RTO_ONLINK any more,
        this is not a problem and we can use fl4->flowi4_tos in
        flowi4_update_output().
      Signed-off-by: NGuillaume Nault <gnault@redhat.com>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      67e1e2f4
  20. 07 4月, 2022 1 次提交
  21. 10 3月, 2022 1 次提交
    • E
      tcp: adjust TSO packet sizes based on min_rtt · 65466904
      Eric Dumazet 提交于
      Back when tcp_tso_autosize() and TCP pacing were introduced,
      our focus was really to reduce burst sizes for long distance
      flows.
      
      The simple heuristic of using sk_pacing_rate/1024 has worked
      well, but can lead to too small packets for hosts in the same
      rack/cluster, when thousands of flows compete for the bottleneck.
      
      Neal Cardwell had the idea of making the TSO burst size
      a function of both sk_pacing_rate and tcp_min_rtt()
      
      Indeed, for local flows, sending bigger bursts is better
      to reduce cpu costs, as occasional losses can be repaired
      quite fast.
      
      This patch is based on Neal Cardwell implementation
      done more than two years ago.
      bbr is adjusting max_pacing_rate based on measured bandwidth,
      while cubic would over estimate max_pacing_rate.
      
      /proc/sys/net/ipv4/tcp_tso_rtt_log can be used to tune or disable
      this new feature, in logarithmic steps.
      
      Tested:
      
      100Gbit NIC, two hosts in the same rack, 4K MTU.
      600 flows rate-limited to 20000000 bytes per second.
      
      Before patch: (TSO sizes would be limited to 20000000/1024/4096 -> 4 segments per TSO)
      
      ~# echo 0 >/proc/sys/net/ipv4/tcp_tso_rtt_log
      ~# nstat -n;perf stat ./super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000;nstat|egrep "TcpInSegs|TcpOutSegs|TcpRetransSegs|Delivered"
        96005
      
       Performance counter stats for './super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000':
      
               65,945.29 msec task-clock                #    2.845 CPUs utilized
               1,314,632      context-switches          # 19935.279 M/sec
                   5,292      cpu-migrations            #   80.249 M/sec
                 940,641      page-faults               # 14264.023 M/sec
         201,117,030,926      cycles                    # 3049769.216 GHz                   (83.45%)
          17,699,435,405      stalled-cycles-frontend   #    8.80% frontend cycles idle     (83.48%)
         136,584,015,071      stalled-cycles-backend    #   67.91% backend cycles idle      (83.44%)
          53,809,530,436      instructions              #    0.27  insn per cycle
                                                        #    2.54  stalled cycles per insn  (83.36%)
           9,062,315,523      branches                  # 137422329.563 M/sec               (83.22%)
             153,008,621      branch-misses             #    1.69% of all branches          (83.32%)
      
            23.182970846 seconds time elapsed
      
      TcpInSegs                       15648792           0.0
      TcpOutSegs                      58659110           0.0  # Average of 3.7 4K segments per TSO packet
      TcpExtTCPDelivered              58654791           0.0
      TcpExtTCPDeliveredCE            19                 0.0
      
      After patch:
      
      ~# echo 9 >/proc/sys/net/ipv4/tcp_tso_rtt_log
      ~# nstat -n;perf stat ./super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000;nstat|egrep "TcpInSegs|TcpOutSegs|TcpRetransSegs|Delivered"
        96046
      
       Performance counter stats for './super_netperf 600 -H otrv6 -l 20 -- -K dctcp -q 20000000':
      
               48,982.58 msec task-clock                #    2.104 CPUs utilized
                 186,014      context-switches          # 3797.599 M/sec
                   3,109      cpu-migrations            #   63.472 M/sec
                 941,180      page-faults               # 19214.814 M/sec
         153,459,763,868      cycles                    # 3132982.807 GHz                   (83.56%)
          12,069,861,356      stalled-cycles-frontend   #    7.87% frontend cycles idle     (83.32%)
         120,485,917,953      stalled-cycles-backend    #   78.51% backend cycles idle      (83.24%)
          36,803,672,106      instructions              #    0.24  insn per cycle
                                                        #    3.27  stalled cycles per insn  (83.18%)
           5,947,266,275      branches                  # 121417383.427 M/sec               (83.64%)
              87,984,616      branch-misses             #    1.48% of all branches          (83.43%)
      
            23.281200256 seconds time elapsed
      
      TcpInSegs                       1434706            0.0
      TcpOutSegs                      58883378           0.0  # Average of 41 4K segments per TSO packet
      TcpExtTCPDelivered              58878971           0.0
      TcpExtTCPDeliveredCE            9664               0.0
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reviewed-by: NNeal Cardwell <ncardwell@google.com>
      Link: https://lore.kernel.org/r/20220309015757.2532973-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      65466904
  22. 09 3月, 2022 1 次提交
  23. 25 2月, 2022 1 次提交
  24. 20 2月, 2022 4 次提交
  25. 28 1月, 2022 1 次提交
  26. 27 1月, 2022 1 次提交
    • E
      tcp: allocate tcp_death_row outside of struct netns_ipv4 · fbb82952
      Eric Dumazet 提交于
      I forgot tcp had per netns tracking of timewait sockets,
      and their sysctl to change the limit.
      
      After 0dad4087 ("tcp/dccp: get rid of inet_twsk_purge()"),
      whole struct net can be freed before last tw socket is freed.
      
      We need to allocate a separate struct inet_timewait_death_row
      object per netns.
      
      tw_count becomes a refcount and gains associated debugging infrastructure.
      
      BUG: KASAN: use-after-free in inet_twsk_kill+0x358/0x3c0 net/ipv4/inet_timewait_sock.c:46
      Read of size 8 at addr ffff88807d5f9f40 by task kworker/1:7/3690
      
      CPU: 1 PID: 3690 Comm: kworker/1:7 Not tainted 5.16.0-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Workqueue: events pwq_unbound_release_workfn
      Call Trace:
       <IRQ>
       __dump_stack lib/dump_stack.c:88 [inline]
       dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
       print_address_description.constprop.0.cold+0x8d/0x336 mm/kasan/report.c:255
       __kasan_report mm/kasan/report.c:442 [inline]
       kasan_report.cold+0x83/0xdf mm/kasan/report.c:459
       inet_twsk_kill+0x358/0x3c0 net/ipv4/inet_timewait_sock.c:46
       call_timer_fn+0x1a5/0x6b0 kernel/time/timer.c:1421
       expire_timers kernel/time/timer.c:1466 [inline]
       __run_timers.part.0+0x67c/0xa30 kernel/time/timer.c:1734
       __run_timers kernel/time/timer.c:1715 [inline]
       run_timer_softirq+0xb3/0x1d0 kernel/time/timer.c:1747
       __do_softirq+0x29b/0x9c2 kernel/softirq.c:558
       invoke_softirq kernel/softirq.c:432 [inline]
       __irq_exit_rcu+0x123/0x180 kernel/softirq.c:637
       irq_exit_rcu+0x5/0x20 kernel/softirq.c:649
       sysvec_apic_timer_interrupt+0x93/0xc0 arch/x86/kernel/apic/apic.c:1097
       </IRQ>
       <TASK>
       asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:638
      RIP: 0010:lockdep_unregister_key+0x1c9/0x250 kernel/locking/lockdep.c:6328
      Code: 00 00 00 48 89 ee e8 46 fd ff ff 4c 89 f7 e8 5e c9 ff ff e8 09 cc ff ff 9c 58 f6 c4 02 75 26 41 f7 c4 00 02 00 00 74 01 fb 5b <5d> 41 5c 41 5d 41 5e 41 5f e9 19 4a 08 00 0f 0b 5b 5d 41 5c 41 5d
      RSP: 0018:ffffc90004077cb8 EFLAGS: 00000206
      RAX: 0000000000000046 RBX: ffff88807b61b498 RCX: 0000000000000001
      RDX: dffffc0000000000 RSI: 0000000000000000 RDI: 0000000000000000
      RBP: ffff888077027128 R08: 0000000000000001 R09: ffffffff8f1ea4fc
      R10: fffffbfff1ff93ee R11: 000000000000af1e R12: 0000000000000246
      R13: 0000000000000000 R14: ffffffff8ffc89b8 R15: ffffffff90157fb0
       wq_unregister_lockdep kernel/workqueue.c:3508 [inline]
       pwq_unbound_release_workfn+0x254/0x340 kernel/workqueue.c:3746
       process_one_work+0x9ac/0x1650 kernel/workqueue.c:2307
       worker_thread+0x657/0x1110 kernel/workqueue.c:2454
       kthread+0x2e9/0x3a0 kernel/kthread.c:377
       ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
       </TASK>
      
      Allocated by task 3635:
       kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
       kasan_set_track mm/kasan/common.c:46 [inline]
       set_alloc_info mm/kasan/common.c:437 [inline]
       __kasan_slab_alloc+0x90/0xc0 mm/kasan/common.c:470
       kasan_slab_alloc include/linux/kasan.h:260 [inline]
       slab_post_alloc_hook mm/slab.h:732 [inline]
       slab_alloc_node mm/slub.c:3230 [inline]
       slab_alloc mm/slub.c:3238 [inline]
       kmem_cache_alloc+0x202/0x3a0 mm/slub.c:3243
       kmem_cache_zalloc include/linux/slab.h:705 [inline]
       net_alloc net/core/net_namespace.c:407 [inline]
       copy_net_ns+0x125/0x760 net/core/net_namespace.c:462
       create_new_namespaces+0x3f6/0xb20 kernel/nsproxy.c:110
       unshare_nsproxy_namespaces+0xc1/0x1f0 kernel/nsproxy.c:226
       ksys_unshare+0x445/0x920 kernel/fork.c:3048
       __do_sys_unshare kernel/fork.c:3119 [inline]
       __se_sys_unshare kernel/fork.c:3117 [inline]
       __x64_sys_unshare+0x2d/0x40 kernel/fork.c:3117
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      The buggy address belongs to the object at ffff88807d5f9a80
       which belongs to the cache net_namespace of size 6528
      The buggy address is located 1216 bytes inside of
       6528-byte region [ffff88807d5f9a80, ffff88807d5fb400)
      The buggy address belongs to the page:
      page:ffffea0001f57e00 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88807d5f9a80 pfn:0x7d5f8
      head:ffffea0001f57e00 order:3 compound_mapcount:0 compound_pincount:0
      memcg:ffff888070023001
      flags: 0xfff00000010200(slab|head|node=0|zone=1|lastcpupid=0x7ff)
      raw: 00fff00000010200 ffff888010dd4f48 ffffea0001404e08 ffff8880118fd000
      raw: ffff88807d5f9a80 0000000000040002 00000001ffffffff ffff888070023001
      page dumped because: kasan: bad access detected
      page_owner tracks the page as allocated
      page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 3634, ts 119694798460, free_ts 119693556950
       prep_new_page mm/page_alloc.c:2434 [inline]
       get_page_from_freelist+0xa72/0x2f50 mm/page_alloc.c:4165
       __alloc_pages+0x1b2/0x500 mm/page_alloc.c:5389
       alloc_pages+0x1aa/0x310 mm/mempolicy.c:2271
       alloc_slab_page mm/slub.c:1799 [inline]
       allocate_slab mm/slub.c:1944 [inline]
       new_slab+0x28a/0x3b0 mm/slub.c:2004
       ___slab_alloc+0x87c/0xe90 mm/slub.c:3018
       __slab_alloc.constprop.0+0x4d/0xa0 mm/slub.c:3105
       slab_alloc_node mm/slub.c:3196 [inline]
       slab_alloc mm/slub.c:3238 [inline]
       kmem_cache_alloc+0x35c/0x3a0 mm/slub.c:3243
       kmem_cache_zalloc include/linux/slab.h:705 [inline]
       net_alloc net/core/net_namespace.c:407 [inline]
       copy_net_ns+0x125/0x760 net/core/net_namespace.c:462
       create_new_namespaces+0x3f6/0xb20 kernel/nsproxy.c:110
       unshare_nsproxy_namespaces+0xc1/0x1f0 kernel/nsproxy.c:226
       ksys_unshare+0x445/0x920 kernel/fork.c:3048
       __do_sys_unshare kernel/fork.c:3119 [inline]
       __se_sys_unshare kernel/fork.c:3117 [inline]
       __x64_sys_unshare+0x2d/0x40 kernel/fork.c:3117
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      page last free stack trace:
       reset_page_owner include/linux/page_owner.h:24 [inline]
       free_pages_prepare mm/page_alloc.c:1352 [inline]
       free_pcp_prepare+0x374/0x870 mm/page_alloc.c:1404
       free_unref_page_prepare mm/page_alloc.c:3325 [inline]
       free_unref_page+0x19/0x690 mm/page_alloc.c:3404
       skb_free_head net/core/skbuff.c:655 [inline]
       skb_release_data+0x65d/0x790 net/core/skbuff.c:677
       skb_release_all net/core/skbuff.c:742 [inline]
       __kfree_skb net/core/skbuff.c:756 [inline]
       consume_skb net/core/skbuff.c:914 [inline]
       consume_skb+0xc2/0x160 net/core/skbuff.c:908
       skb_free_datagram+0x1b/0x1f0 net/core/datagram.c:325
       netlink_recvmsg+0x636/0xea0 net/netlink/af_netlink.c:1998
       sock_recvmsg_nosec net/socket.c:948 [inline]
       sock_recvmsg net/socket.c:966 [inline]
       sock_recvmsg net/socket.c:962 [inline]
       ____sys_recvmsg+0x2c4/0x600 net/socket.c:2632
       ___sys_recvmsg+0x127/0x200 net/socket.c:2674
       __sys_recvmsg+0xe2/0x1a0 net/socket.c:2704
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Memory state around the buggy address:
       ffff88807d5f9e00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff88807d5f9e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      >ffff88807d5f9f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                                 ^
       ffff88807d5f9f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff88807d5fa000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      
      Fixes: 0dad4087 ("tcp/dccp: get rid of inet_twsk_purge()")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Reported-by: NPaolo Abeni <pabeni@redhat.com>
      Tested-by: NPaolo Abeni <pabeni@redhat.com>
      Link: https://lore.kernel.org/r/20220126180714.845362-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      fbb82952
  27. 25 1月, 2022 1 次提交
    • E
      ipv4/tcp: do not use per netns ctl sockets · 37ba017d
      Eric Dumazet 提交于
      TCP ipv4 uses per-cpu/per-netns ctl sockets in order to send
      RST and some ACK packets (on behalf of TIMEWAIT sockets).
      
      This adds memory and cpu costs, which do not seem needed.
      Now typical servers have 256 or more cores, this adds considerable
      tax to netns users.
      
      tcp sockets are used from BH context, are not receiving packets,
      and do not store any persistent state but the 'struct net' pointer
      in order to be able to use IPv4 output functions.
      
      Note that I attempted a related change in the past, that had
      to be hot-fixed in commit bdbbb852 ("ipv4: tcp: get rid of ugly unicast_sock")
      
      This patch could very well surface old bugs, on layers not
      taking care of sk->sk_kern_sock properly.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      37ba017d